Next Article in Journal
Potential and Challenges in Airborne Automated External Defibrillator Delivery by Drones in a Mountainous Region
Previous Article in Journal
AMFEF-DETR: An End-to-End Adaptive Multi-Scale Feature Extraction and Fusion Object Detection Network Based on UAV Aerial Images
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Interception of a Single Intruding Unmanned Aerial Vehicle by Multiple Missiles Using the Novel EA-MADDPG Training Algorithm

1
School of Automation Science and Engineering, South China University of Technology, Guangzhou 510641, China
2
Key Laboratory of Autonomous Systems and Networked Control, Ministry of Education, Guangzhou 510641, China
3
Guangdong Engineering Technology Research Center of Unmanned Aerial Vehicle Systems, Guangzhou 510641, China
*
Author to whom correspondence should be addressed.
Drones 2024, 8(10), 524; https://doi.org/10.3390/drones8100524
Submission received: 7 August 2024 / Revised: 24 September 2024 / Accepted: 24 September 2024 / Published: 26 September 2024

Abstract

:
This paper proposes an improved multi-agent deep deterministic policy gradient algorithm called the equal-reward and action-enhanced multi-agent deep deterministic policy gradient (EA-MADDPG) algorithm to solve the guidance problem of multiple missiles cooperating to intercept a single intruding UAV in three-dimensional space. The key innovations of EA-MADDPG include the implementation of the action filter with additional reward functions, optimal replay buffer, and equal reward setting. The additional reward functions and the action filter are set to enhance the exploration performance of the missiles during training. The optimal replay buffer and the equal reward setting are implemented to improve the utilization efficiency of exploration experiences obtained through the action filter. In order to prevent over-learning from certain experiences, a special storage mechanism is established, where experiences obtained through the action filter are stored only in the optimal replay buffer, while normal experiences are stored in both the optimal replay buffer and normal replay buffer. Meanwhile, we gradually reduce the selection probability of the action filter and the sampling ratio of the optimal replay buffer. Finally, comparative experiments show that the algorithm enhances the agents’ exploration capabilities, allowing them to learn policies more quickly and stably, which enables multiple missiles to complete the interception task more rapidly and with a higher success rate.

1. Introduction

Unmanned aerial vehicles (UAVs), especially drones, have been widely applied and played a key role in several local wars due to their low cost, flexible maneuverability, strong concealment, and ability to perform tasks in harsh environments. Although drone technology has shown great potential in areas such as reconnaissance, logistics, and environmental monitoring, its security issues have also raised widespread concern, especially how to effectively address the threat of hostile drones [1]. Therefore, the study of UAV interception is essential, especially in the context of the rapid development and widespread application of UAV technology today.
The algorithms are mainly divided into two major categories: rule-based approaches [2,3,4,5,6,7,8,9,10,11] and learning-based approaches [12,13,14,15,16,17,18,19,20,21,22]. Within these categories, the methods can be further divided into single and multiple types. Single interception guidance methods: Command to line-of-sight guidance adjusts the interceptor’s path based on external commands, precisely controlling the interception process [4]. Optimal guidance designs guidance laws by minimizing a cost function to find the best path under given performance metrics [5]. Augmented proportional navigation adjusts the guidance gain to adapt to different interception conditions, enhancing efficiency against high-speed or highly maneuverable targets [6]. Predictive guidance adjusts the flight path of the interceptor based on predictions of the target’s movements, combating fast-maneuvering targets [7]. Multiple interception guidance methods: The leader–follower strategy involves one UAV acting as the leader, formulating the interception strategy, while the other UAVs follow and execute the plan, simplifying the decision-making process and ensuring team actions are synchronized [8]. The coordinated attack strategy requires high coordination among UAVs to determine the best angles and timings for the attack, effectively utilizing the team’s collective strength to increase the success rate of interceptions [9]. The sequential or alternating attack strategy allows UAVs to attack in a predetermined order or timing, maintaining continuous pressure on the target and extending the duration of combat [10]. The distributed guidance strategy enables each UAV to make independent decisions based on overall team information, enhancing the team’s adaptability and robustness, suitable for dynamic battlefield environments [11].
In recent decades, the rapid development and success of reinforcement learning have made it an ideal tool for solving complex problems, particularly in the field of UAV interception applications. The development of single-agent reinforcement learning algorithms began in the early 1980s, with initial explorations focusing on simple learning control problems [12]. Into the 21st century, with enhancements in computational power and data processing capabilities, deep learning techniques began to be integrated into reinforcement learning. Mnih et al. pioneered the deep q-network, achieving control levels in visually rich Atari games that matched or exceeded human performance for the first time [13]. Additionally, the deep deterministic policy gradient (DDPG) algorithm [14], proposed by Lillicrap et al., and the proximal policy optimization algorithm [15] by Schulman et al., further expanded the application of reinforcement learning in continuous action spaces. Haarnoja et al. introduced the soft actor–critic algorithm, which optimized policies within a maximum entropy framework, demonstrating its superiority in handling tasks involving high dimensions and complex environments [16]. However, single-agent algorithms face challenges such as the exponential explosion in state-action space and local optima when dealing with multi-agent systems. To overcome these challenges, researchers have extended single-agent algorithms to multi-agent algorithms, such as multi-agent deep deterministic policy gradient (MADDPG), multi-agent actor–critic, multi-agent proximal policy optimization (MAPPO), and multi-agent twin-delayed deep deterministic policy gradient (MATD3) [17,18,19,20]. In the realms of single UAV and multiple UAV interception, reinforcement learning has emerged as a significant research focus. For a single UAV, as studied by Koch et al., deep reinforcement learning is employed to train the UAV for autonomous navigation and obstacle avoidance in environments with obstacles [21]. In the context of multiple UAVs, Qie et al. proposed an artificial intelligence method named simultaneous target assignment and path planning based on MADDPG to train the system to solve target assignment and path planning simultaneously according to a corresponding reward structure [22].

1.1. Related Works

In the field of UAV interception, various studies have explored effective interception strategies for both single UAV and multiple UAV environments to tackle challenging missions. Garcia et al. focused on designing minimal-time trajectories to intercept malicious drones in constrained environments, breaking through the limitations of traditional path planning in complex settings [23]. Tan et al. introduced a proportional-navigation-based switching strategy for quadrotors to track maneuvering ground targets, addressing the limitations of traditional proportional derivative control. By dynamically switching between proportional navigation and proportional derivative controls, the proposed method reduced tracking errors and oscillations, demonstrating improved performance and robustness against measurement noise. The optimal switching points, derived analytically, ensured minimal positional errors between the UAV and the target [24]. Cetin et al. addressed the visual quadrotor interception problem in urban anti-drone systems by employing a model predictive control (MPC) method with terminal constraints to ensure engagement at a desired impact angle. The modified MPC objective function reduced the interceptor’s maneuvering requirements at the trajectory’s end, enhancing efficiency. The proposed guidance methodology was successfully tested in various scenarios, demonstrating effective interception of both maneuvering and non-maneuvering targets with minimal maneuvering at the interception’s conclusion [25]. Xue et al. introduced a fuzzy control method for UAV formation trajectory tracking, simplifying control logic and improving stability. Simulations confirm the method’s effectiveness in achieving rapid and stable formation docking [26]. Li et al. developed an enhanced real proportional navigation guidance method for gun-launched UAVs to intercept maneuvering targets, addressing limitations such as saturation overload and capture region constraints. By integrating an extended Kalman filter for data fusion and trajectory prediction, the method improved guidance accuracy and overcame system delay issues, resulting in more effective interception of “Low–slow–small” targets [27]. Liu et al. explored a cooperative UAV countermeasure strategy based on an interception triangle, using a geometric formation formed by three UAVs to enhance interception efficiency [28]. Tong et al., inspired by the cooperative hunting behavior of Harris’s hawks, proposed a multi-UAV cooperative hunting strategy, optimizing the interception process by mimicking natural cooperative behaviors [29]. In addition to drones, research on the interception of other moving objects also offers valuable insights. Shaferman et al. focused on developing and optimizing guidance strategies that enable multiple interceptors to coordinate their operations to achieve a predetermined relative intercept angle. By employing advanced optimization algorithms, this research significantly enhanced mission efficiency, particularly in scenarios where the intercept angle was crucial for successful interception [30]. Chen et al. explored and developed methods for enhancing the accuracy of target recognition and matching in missile guidance systems that used television-based seekers [31].
While rule-based methods have laid a solid foundation for UAV interception, the dynamic and unpredictable nature of modern aerial environments necessitates more adaptable and intelligent approaches. This is where reinforcement learning comes into play, offering significant advantages in UAV interception tasks. Reinforcement learning allows UAVs to learn from interactions with the environment, enabling them to make decisions in real time and adapt to changes dynamically.
In the single-agent domain, to address the issue of using drones to intercept hostile drones, Ting et al. explored the use of a deep q-network (DQN) with a graph neural network (GNN) model for drone-to-drone interception path planning, leveraging the flexibility of GNNs to handle variable input sizes and configurations. The proposed DQN-GNN method effectively trained a chaser drone to intercept a moving target drone, demonstrating the feasibility of integrating GNNs with deep reinforcement learning for this purpose. Results showed that the chaser drone could intercept the target in a reasonable time, suggesting that this approach enhanced the efficiency of drone interception tasks [32]. Furthermore, to counter the threat of small unmanned aerial systems to airspace systems and critical infrastructure, Pierre et al. applied deep reinforcement learning to intercept rogue UAVs in urban airspace. Using the proximal policy optimization method, they verified its effectiveness in improving interception success rates and reducing collision rates [33]. In addition, research on unmanned surface vessels (USVs) and missile interception is also of important reference significance. Du et al. introduced a multi-agent reinforcement learning control method using safe proximal policy optimization (SPPO) for USVs to perform cooperative interception missions. The SPPO method incorporated a joint state–value function to enhance cooperation between defender vessels and introduced safety constraints to reduce risky actions, improving the stability of the learning process. Simulation results demonstrated the effectiveness of the SPPO method, showing high performance in reward and successful cooperative interception of moving targets by the USVs [34]. Liu et al. introduced a combined proximal policy optimization and proportional–derivative control method for USV interception, enhancing interception efficiency. Simulation results showed that the proposed method significantly reduced the interception time compared to traditional approaches [35]. Hu et al. presented the twin-delayed deep deterministic policy gradient (TD3) algorithm to develop a guidance law for intercepting maneuvering targets with UAVs. The proposed method improved upon traditional proportional navigation guidance by directly mapping the line-of-sight rate to the normal acceleration command through a neural network, resulting in enhanced accuracy and convergence of the guidance system [36]. Li et al. proposed a covertness-aware trajectory design method for UAVs based on an enhanced TD3 algorithm. By integrating multi-step learning and prioritizing experience replay techniques, the method enabled UAVs to adaptively select flight velocities from a continuous action space to maximize transmission throughput to legitimate nodes while maintaining covertness. Considering the impact of building distribution and the uncertainty of the warden’s location, this approach effectively addressed issues that were challenging for standard optimization methods and demonstrated significant performance advantages over existing deep reinforcement learning baselines and non-learning strategies in numerical simulations [37]. Li et al. employed a quantum-inspired reinforcement learning (QiRL) approach to optimize the trajectory planning in UAV-assisted wireless networks without relying on prior environmental knowledge. QiRL introduced a novel probabilistic action selection policy inspired by quantum mechanics’ collapse and amplitude amplification, achieving a natural balance between exploration and exploitation without the need for tuning exploration parameters typical in conventional reinforcement learning methods. Simulation results validated the effectiveness of QiRL in addressing the UAV trajectory optimization problem, demonstrating faster convergence and superior learning performance compared to traditional q-learning methods [38]. Li et al. introduced TD3 and quantum-inspired experience replay (QiER) to optimize the trajectory planning for UAVs in cellular networks, enhancing performance in reducing flight time and communication outages and improving learning efficiency by effectively balancing exploration and exploitation [39].
For multi-agent systems, Wan et al. presented an enhanced algorithm based on MADDPG called mixed-experience MADDPG (ME-MADDPG), which enhanced sample efficiency and training stability by introducing artificial potential fields and a mixed-experience strategy. Experiments demonstrated that ME-MADDPG achieved faster convergence and better adaptability in complex dynamic environments compared to MADDPG, showing superior performance in multi-agent motion planning [40]. In order to enhance navigation and obstacle avoidance in multi-agent systems, Zhao et al. proposed the MADDPG-LSTM actor algorithm, which combined long short-term memory (LSTM) networks with the MADDPG algorithm, and the simplified MADDPG algorithm to improve efficiency in scenarios with a large number of agents. Experimental results showed that these algorithms outperformed existing networks in the OpenAI multi-agent particle environment, and the LSTM model demonstrated a favorable balance in handling data of varying sequence lengths compared to transformer and self-attention models [41]. Huang et al. proposed and validated the MADDPG for its effectiveness and superior performance in multi-agent defense and attack scenarios [42]. For collaborative drone tasks, a cooperative encirclement strategy based on multi-agent reinforcement learning was proposed, using the attention-mechanism MADDPG algorithm, which addressed the problem of surrounding airborne escape targets in a collaborative drone attack scenario [43]. Using the all-domain simulation 3D war game engine from China Aerospace System Simulation Technology, Wei et al. conducted simulations of confrontations between UAVs and radar stations. By designing rewards and integrating LSTM networks into multi-agent reinforcement learning, the recurrent deterministic policy gradient (RDPG) method was improved, and a combination of MADDPG and RDPG algorithm was developed, significantly enhancing the algorithm’s effectiveness and accuracy [44]. Jeon et al. introduced a fusion–multiactor–attention–critic model for energy-efficient navigation control of UAVs. The model incorporated a sensor fusion layer and a dissimilarity layer to optimize information processing, resulting in improved energy efficiency and 38% more deliveries within the same time steps compared to the original multiactor–attention–critic model [45]. Yue et al. proposed a method for multi-object tracking by a swarm of drones using a multi-agent soft actor–critic approach [46]. For the pursuit–evasion game of multi-rotor drones in obstacle-laden environments, Zhang et al. proposed the CBC-TP Net, a multi-agent bidirectional coordination target prediction network, which incorporated a vectorized extension of the MADDPG method to ensure effective task execution even if the “swarm” system was compromised [47]. The priority-experience replay-based multi-agent deep deterministic policy gradient algorithm (PER-MADDPG) has shown excellent performance in increasing the success rate of pursuits and reducing response times, featuring faster convergence and reduced oscillations compared to the MADDPG algorithm [48]. Zhu et al. applied deep reinforcement learning to the cluster control problems of multi-robot systems in complex environments, proposing the PER-MADDPG algorithm, which significantly enhanced learning efficiency and convergence speed. Experimental results validated the effectiveness of this algorithm in completing cluster tasks in obstacle-laden environments [49]. Jiang et al. integrated self-attention into the MADDPG algorithm to enhance the stability of learned policies by adapting to the dynamic changes in the number of adversaries and allies during confrontations. They presented a simplified 2D simulation environment and conducted experiments in three different collaborative and confrontational scenarios, demonstrating improved performance over baseline methods [50]. Zhang et al. tested the distributed decision-making and collaboration tasks of heterogeneous drones in a Unity3D cooperative combat environment, proposing an improved MAPPO algorithm and enhancing the algorithm’s generalizability through the course learning [51]. Huang et al. developed a collaborative path planning method for multiple drones based on the MATD3 algorithm with dual experience pools and a particle swarm optimization algorithm, significantly reducing the age of information [52]. Some typical methods are summarized in Table 1.

1.2. Contributions

This paper investigates the guidance of multiple missiles in three-dimensional space for intercepting a single UAV with random initial positions and uncertain trajectories. We use an improved algorithm called equal-reward and action-enhanced multi-agent deep deterministic policy gradient (EA-MADDPG) based on the MADDPG algorithm for training the control policy of the missiles to enhance their ability to complete interception tasks. The main contributions of this paper are as follows:
  • In multi-agent collision avoidance tasks, MADDPG-LSTM outperforms standard MADDPG due to the LSTM’s capability to handle sequential data. This advantage arises from the LSTM’s underlying principle of maintaining long-term dependencies and capturing temporal patterns through its specialized memory cells and gating mechanisms. Compared to LSTM, the EA-MADDPG in this paper, based on an equal reward setting and an optimal replay buffer with a special storage mechanism, enhances the utilization of important experiences during learning, thereby improving the collision avoidance training effectiveness in multi-missile interception tasks.
  • Prioritized experience replay techniques can enhance exploration performance by prioritizing and sampling experiences based on their significance within the replay buffer. In contrast, this paper introduces a new method to enhance exploration performance by incorporating an action filter. During training, agents use the action filter for action selection with a certain probability, executing these actions to enhance exploration and acquire unique experiences for storage.
  • ME-MADDPG [40] generates special experiences using an artificial potential field with a certain probability and incorporates them into the buffer for sampling and training. In this paper, the EA-MADDPG uses an action filter for action selection to generate unique experiences for storage and learning. All actions are derived from the output of the agents’ actor networks.
Table 1. Typical Methods.
Table 1. Typical Methods.
ArticleMethod
[41,44]MADDPG with long short-term memory networks
[37,48,49]Prioritized experience replay techniques
[40]Artificial potential field and mixed-experience mechanism
The remaining parts of this paper are structured as follows. Section 2 introduces the basic notations. Section 3 presents the kinematic models of missiles and the UAV, as well as the description of the problem. Section 4 details the adopted EA-MADDPG algorithm. Section 5 conducts simulations and discusses the results, and Section 6 summarizes the conclusions of the paper.

2. Notation

N represents the number of agents within the environment. The set of states X represents the possible configurations of all agents. The set A = { A 1 , , A N } denotes the possible action space of all agents, and A i denotes the action space of agent i. The set O = { O 1 , , O N } denotes the observation space of all agents, and O i denotes the observation space of agent i. x X denotes the state of all agents at a certain moment. o i O i denotes the information observed by agent i at a certain moment. a = { a 1 , a 2 , , a N } , where a A represents the joint action vector of all agents at a certain moment, and a i A i denotes the action executed exclusively by agent i within the environment at a certain moment. μ θ i : O i A i (abbreviated as μ i ) denotes the policy generated by the actor network of the ith agent, with θ i representing the actor network’s parameters. It takes observations as input and produces actions as output. μ θ i (abbreviated as μ i ) denotes the policy generated by the target actor network of the ith agent, with θ i representing the target network’s parameters. Q ϕ i μ ( x , a ) : X × A R , it is the action value function generated by the critic network of the ith agent, with ϕ i representing the critic network’s parameters and μ representing the set { μ 1 , , μ N } . Q ϕ i μ ( x , a ) is the action value function generated by the target critic network of the ith agent, with ϕ i representing the target critic network’s parameters and μ representing the set { μ 1 , , μ N } . x denotes the subsequent state reached by agents after executing their current actions. r = { r 1 , r 2 , , r N } , where r R N , denotes the joint reward vector of all agents, and r i R represents the immediate reward obtained by agent i within the environment. r i t o t a l = k = 0 k m a x 1 γ k r i k denotes the total expected return of agent i, where k m a x N + denotes the max step in an episode. N A i represents the noise added to actions during reinforcement learning training to enhance exploration. D denotes the replay buffer used to store the past experience tuples collected. γ R denotes the discount rate used in calculating cumulative rewards. τ R denotes the parameters used for updating the target network. S R denotes the number of experience tuples used for training at one time. x y denotes the derivative of y with respect to x.

3. Modeling and Problem Description

This paper considers the problem of a group of N missiles intercepting a single intruding UAV which aims to attack some ground target. In this section, the models of the missiles and the UAV are introduced first, which is then followed by the detailed description of the intercepting problem in rigorous mathematics.

3.1. Modeling of Missiles and the UAV

Define N ̲ = { 1 , , N } . For i N ̲ , the kinematic model of the ith missile is given by
p i ( k + 1 ) = p i ( k ) + T v i ( k )
where p i ( k ) , v i ( k ) R 3 denote the position and velocity of the ith missile at the kth step, respectively. T R denotes the duration of each step. Furthermore, the velocity v i ( k ) is given by
v i ( k ) = v i ( k ) cos ( β i v ( k ) ) cos ( α i v ( k ) ) v i ( k ) cos ( β i v ( k ) ) sin ( α i v ( k ) ) v i ( k ) sin ( β i v ( k ) )
Here, the speed v i ( k ) satisfies v i ( k ) = v m a x , where v m a x R denotes the maximum linear velocity. Moreover, as shown in Figure 1, v i x y ( k ) R 2 denotes the projection of v i ( k ) onto the plane parallel to O x y . α i v ( k ) ( π , π ] denotes the angle of v i x y ( k ) with respect to the positive semi-axis of x, with counterclockwise being positive and clockwise being negative, and this rule of sign for angles is adopted throughout this paper. β i v ( k ) ( π 2 , π 2 ] denotes the angle of v i ( k ) with respect to v i x y ( k ) . The angles of α i v ( k ) and β i v ( k ) are governed by
β i v ( k + 1 ) = β i v ( k ) + T u β i ( k ) α i v ( k + 1 ) = α i v ( k ) + T u α i ( k )
where the corresponding angular velocities u β i ( k ) , u α i ( k ) R are considered as the control inputs of the missiles. To make the control practical, it is assumed that | u β i ( k ) | < Θ and | u α i ( k ) | < Θ for some control input saturation threshold Θ > 0 .
The kinematic model of the UAV is given by
p m ( k + 1 ) = p m ( k ) + T v m ( k )
where p m ( k ) , v m ( k ) R 3 denote the position and velocity of the UAV at the kth step, respectively. The velocity v m ( k ) is given by
v m ( k ) = v m a x cos ( β m ( k ) + U β ) cos ( α m ( k ) + U α ) v m a x cos ( β m ( k ) + U β ) sin ( α m ( k ) + U α ) v m a x sin ( β m ( k ) + U β )
As shown in Figure 2, v m x y ( k ) R 2 denotes the projection of v m ( k ) onto the plane parallel to O x y . α m ( k ) ( π , π ] denotes the angle of v m x y ( k ) with respect to the positive semi-axis of x. β m ( k ) ( π 2 , π 2 ] denotes the angle of v m ( k ) with respect to v m x y ( k ) . U α , U β [ π 36 , π 36 ] represent the noise subject to a uniform distribution. Note that the speed of the UAV is the same as the maximal speed of the missiles.
As shown in Figure 3, l i n ( k ) R 2 denotes the projection of p m ( k ) p i ( k ) onto the plane parallel to O x y , and l i f ( k ) R 2 denotes the projection of p m ( k ) p i ( k 1 ) onto the plane parallel to O x y . Note that l i n ( k ) represents the current relative position between the ith missile and the UAV, whereas l i f ( k ) represents the semi-historical relative position. α i n ( k ) ( π , π ] denotes the angle of l i n ( k ) related to the positive semi-axis of x. β i n ( k ) ( π 2 , π 2 ] denotes the angle of p m ( k ) p i ( k ) related to l i n ( k ) . α i f ( k ) ( π , π ] denotes the angle of l i f ( k ) related to the positive semi-axis of x. β i f ( k ) ( π 2 , π 2 ] denotes the angle of p m ( k ) p i ( k 1 ) related to l i f ( k ) . Based on these angles, we define
α i ( k ) = 2 π ( α i n ( k ) α i v ( k ) ) if α i n ( k ) α i v ( k ) > π α i n ( k ) α i v ( k ) if π α i n ( k ) α i v ( k ) > π 2 π + ( α i n ( k ) α i v ( k ) ) if α i n ( k ) α i v ( k ) π
β i ( k ) = β i n ( k ) β i v ( k )
where α i ( k ) denotes the angle of l i n ( k ) with respect to v i x y ( k ) . β i ( k ) denotes the angle of p m ( k ) p i ( k ) with respect to v i ( k ) after they are both rotated around the z-axis to align within the same vertical plane. Moreover, for k = 1 , 2 , ,
α i ( k ) = 2 π ( α i f ( k ) α i v ( k 1 ) ) if α i f ( k ) α i v ( k 1 ) > π α i f ( k ) α i v ( k 1 ) if π α i f ( k ) α i v ( k 1 ) > π 2 π + ( α i f ( k ) α i v ( k 1 ) ) if α i f ( k ) α i v ( k 1 ) π
β i ( k ) = β i f ( k ) β i v ( k 1 )
where α i ( k ) denotes the angle of l i f ( k ) with respect to v i x y ( k 1 ) . β i ( k ) denotes the angle of p m ( k ) p i ( k 1 ) with respect to v i ( k 1 ) after they are both rotated around the z-axis to align within the same vertical plane. In addition, we simply set α i ( 0 ) = 0 and β i ( 0 ) = 0 .

3.2. Problem Description

In Figure 4, the blue curves represent a hemisphere P with center O and radius R . The orange curves represent a cone with apex O and inner angle φ [ 0 , π 2 ] . The cone intersects with P , and let the upper hemisphere be denoted by P u . In this paper, it is assumed that p m ( 0 ) P u , i.e., the UAV will appear randomly on the upper hemisphere. Let p T a r g e t = [ p T a r g e t , x , p T a r g e t , y , p T a r g e t , z ] T R 3 denote the position of the target, satisfying
p T a r g e t , z = 0 , p T a r g e t , x 2 + p T a r g e t , y 2 < R r 2
for some R r > 0 . For all k, the angles of of β m ( k ) and α m ( k ) are set to be such that the vector [ cos ( β m ( k ) ) cos ( α m ( k ) ) , cos ( β m ( k ) ) sin ( α m ( k ) ) , sin ( β m ( k ) ) ] T has the same direction as the vector p T a r g e t p m ( k ) . For i N ̲ , let p i ( k ) = [ p i x ( k ) , p i y ( k ) , p i z ( k ) ] T . Suppose
p i z ( 0 ) = 0 , p i x ( 0 ) 2 + p i y ( 0 ) 2 < R g 2
for some R g > 0 . Moreover, for i N ̲ , β i v ( 0 ) and α i v ( 0 ) are set to be such that the vector [ cos ( β i v ( k ) ) cos ( α i v ( k ) ) , cos ( β i v ( k ) ) sin ( α i v ( k ) ) , sin ( β i v ( k ) ) ] T has the same direction as the vector p m ( 0 ) p i ( 0 ) . The control objective is to design the control inputs of the missiles, i.e., u β i ( k ) , u α i ( k ) such that there exist i N ̲ and k * satisfying the following two conditions simultaneously,
p i ( k * ) p m ( k * ) R d
p m ( k * ) p T a r g e t > R s
and for k k * , at least one of these two conditions does not hold. Here, R d denotes the intercepting distance of the UAV. and R s denotes the safety distance of the target. At the same time, we define:
d = p m ( k * ) p T a r g e t
which represents the distance between the UAV and the target at the time when the UAV is successfully intercepted.
For the multi-missile guidance and interception problem described above, we chose the MADDPG algorithm as the fundamental method. Here were the reasons for this choice: First, in multi-agent systems, missiles need to work together to achieve a common goal. Each missile must optimize its own behavior while considering the actions of other missiles to avoid mutual interference. MADDPG is designed to handle such coordination by using shared global information during training. This allows each missile to consider the actions of other missiles and learn how to coordinate effectively to approach the UAV faster while avoiding collisions. Second, the action space of missiles is continuous because the missiles are controlled by flight angular velocities. This requires the algorithm to handle continuous action parameters rather than discrete choices. MADDPG can effectively handle continuous action spaces and is very suitable for multi-agent systems that require precise control. Third, MADDPG allows for centralized training where all missiles can access global information, which helps in learning effective coordination strategies. During execution, each missile makes decisions based only on local information, which reduces the computational burden and improves real-time decision-making. Finally, for the MATD3 algorithm that uses twin critics, as the number of missiles increases, the complexity of training and maintaining more network parameters rises. Therefore, in situations with a larger number of agents, the simplicity and efficiency of MADDPG make it a better choice, as it can reduce the computational cost associated with training and maintaining critic networks.
We used the EA-MADDPG algorithm to train actor networks for each missile, with β i ( k ) , α i ( k ) , and p m ( k ) p i ( k ) as inputs for each actor and u β i ( k ) and u α i ( k ) as the continuous action outputs. The goal was for the missiles to move toward the UAV more quickly while avoiding collisions with each other and with the ground, thereby completing the interception task. Each missile was assigned a separate critic network, which used the inputs and outputs of the corresponding actor network as inputs and output q-value.

4. EA-MADDPG

Figure 5 shows the framework of EA-MADDPG. Compared to MADDPG, EA-MADDPG has two additional components: the action filter and the optimal replay buffer D * . During the experience collection phase, we introduce the guided mode, as indicated by the blue arrows in the figure. After the policy network of agent i outputs the action a i based on the observation o i , a i is input into the action filter which outputs filtered action a i G u i d e d where a i G u i d e d A i and i = 1 , , N . Once agents have completed the filtered actions, we obtain an experience tuple ( x , a G u i d e d , r , x ) that is stored in optimal replay buffer D * where a G u i d e d = { a 1 G u i d e d , , a N G u i d e d } . The steps indicated by the brown arrows in the figure are essentially the same as those in the MADDPG algorithm, with two differences: first, experience tuples ( x , a , r , x ) were stored simultaneously in both buffers, and second, sampling was conducted from the two buffers in proportion η . Next, we provide a detailed introduction to the EA-MADDPG algorithm.

4.1. Equal Reward Setting

In the training of the MADDPG algorithm, a set of reward functions { R i ( x , a , x ) } is established, enabling agents to receive rewards through these functions each time they complete actions within the environment. In EA-MADDPG, we introduce an additional set of reward functions { R i e q ( x , a ) } and a reward setting called the equal reward setting to further enhance the training capabilities.
The key characteristic of the equal reward setting is that every state has the same maximum reward:
R i e q ( x , a ) C m a x
R i ( x , a , x ) C m a x
where C m a x R is a constant value. For any given state x , the maximum reward value that can be obtained is equal.
Next, we propose two specific reward functions that meet the above setting. For each agent i:
R i e q = f ( β i ( k ) , u β i ( k ) , λ ) + f ( α i ( k ) , u α i ( k ) , λ ) + j i ς 1 [ f ( β j ( k ) , u β j ( k ) , λ ) + f ( α j ( k ) , u α j ( k ) , λ ) ]
R i = f ( β i ( k ) , u β i ( k 1 ) , λ ) + f ( α i ( k ) , u α i ( k 1 ) , λ ) + j i ς 2 [ f ( β j ( k ) , u β j ( k 1 ) , λ ) + f ( α j ( k ) , u α j ( k 1 ) , λ ) ] + R i c o l + R i g r o u n d
where
f ( ψ , κ , Λ ) = κ Λ + 1 If ψ Λ κ Λ + ψ Λ + 2 If Λ > κ ψ and ψ > Λ κ Λ ψ Λ + 2 If ψ > κ > Λ and ψ < Λ κ Λ + 1 If ψ Λ
λ = T max | u β i | = T max | u α i | denotes the maximum angular velocity that can change the agent’s direction from both vertical and horizontal perspectives in one step. ς 1 , ς 2 R . R i c o l denotes the penalty for agent i after agent i collides with other agents, and R i g r o u n d denotes the penalty for agent i after agent i hits the ground:
R i c o l = 100 Agent   i   collides 0 else  
R i g r o u n d = 100 Agent   i   hits   the   ground 0 else  
Next, let us take f ( α i ( k ) , u α i ( k ) , λ ) in R i e q as an example to explain why R i e q and R i satisfy equal reward setting.
As shown in Figure 6, it is the plane parallel to O x y and the angle between the two blue dashed lines in the figure represents the range within which the missile can change its velocity direction in one step from horizontal perspectives. The blue dashed lines about v i x y ( k ) are symmetric. When α i ( k ) is greater than λ , the reward is maximized at 2 when u α i ( k ) = λ and decreases as u α i ( k ) decreases. Similarly, if α i ( k ) is less than λ , the reward is also maximized at 2 when action u α i ( k ) = λ and decreases as u α i ( k ) increases. If the angle α i ( k ) is within [ λ , λ ] , the reward is maximized at 2 when action u α i ( k ) = α i ( k ) and decreases as u α i ( k ) changes to both sides. The above is the general logic of the function f ( α i ( k ) , u α i ( k ) , λ ) and ensures that the maximum reward obtained from it is 2. Therefore, extending to each term of R i e q , they all meet the equal reward setting, meaning that R i e q satisfies the equal reward setting. R i has a similar composition to R i e q ; therefore, R i also possesses this property.

4.2. Optimal Replay Buffer D *

MADDPG uses a single experience replay buffer D . In ME-MADDPG, we introduce an additional replay buffer D * called optimal replay buffer. Figure 7 shows that e n e w is a tuple containing the experience saved in the current step. e 1 , , e B s * are all experience tuples stored in the buffer D * , and B s * R denotes the maximum storage size of D * . When storing experiences, the method of storing experience tuples in empty slots is the same as D . When experience tuples in D * are being overwritten, D * requires an additional step of judgment. For example, as shown in the figure, when e n e w is about to cover e 4 since r contains the rewards of all agents at this step, we select the minimum reward value m i n ( r ) in e n e w and compare it with the corresponding minimum reward value m i n ( r e 4 ) in e 4 . The experience tuple corresponding to the larger of the two compared reward values is then stored.
During training, samples can be extracted from D and D * in proportion η = N D * S and then mixed for learning:
η = η 0 e c e p i s o d e
S = N D * + N D
N D * R denotes the number of experience tuples collected from D * . N D R denotes the number of experience tuples collected from D . c e p i s o d e R denotes the current episode value. R denotes the decay rate of η . η 0 R denotes the initial proportion we set.

4.3. Action Filter

In the training of reinforcement learning, the exploration performance of agents is critical to the speed and effectiveness of learning. Therefore, we introduce an action filter. In the introduction above, we introduced the guided mode, using the action filter. As shown in Figure 8, after agents derive the action set { a 1 , a 2 , , a N } and input it into the action filter, we generate a batch of random noise { N 1 , , N N q } from N , where q R denotes the repeated exploration coefficient, and integrate it with the action set { a 1 , a 2 , , a N } to create a collection of exploratory actions { a 1 1 , a 1 2 , , a 1 q , a 2 1 , , a N q } . The observations { o 1 , o 2 , , o N } , the actions { a 1 , a 2 , , a N } , and these exploratory actions are then fed into the set of reward functions { R i e q } , i = 1 , , N mentioned above to generate a set of exploratory rewards { r i t = R i e q ( x , a 1 , a 2 , , a i t , a i + 1 , , a N ) } , i = 1 , , N , t = 1 , , q . We input them into the filter:
a i G u i d e d = arg max a i t ( R i e q ( x , a 1 , a 2 , , a i t , a i + 1 , , a N ) )
Subsequently, agents execute { a i G u i d e d } , i = 1 , , N , obtain rewards { r i } , i = 1 , , N , and move to the next state x .
The noise N is generated by:
N = N a K e ϑ c e p i s o d e
where K A i and the values in K follow a standard normal distribution. N a R denotes the amplitude of noise. ϑ R denotes the noise decay rate.
During training, we have a certain probability p G u i d e d of choosing the guided mode. As the training progresses, the probability gradually approaches zero:
p G u i d e d = p 0 G u i d e d e δ c e p i s o d e
where δ R denotes the probability decay rate we set. p 0 G u i d e d R denotes the initial probability. Algorithm 1 presents the pseudocode for EA-MADDPG.
Algorithm 1 EA-MADDPG
  • for episode = 1 to M do
  •    Obtain p G u i d e d and η
  •    Initialize a random process N for action exploration
  •    Receive initial state x
  •    Determine whether to enter guided mode based on probability p G u i d e d
  •    if Guided Mode then
  •      for  t = 1 to max-episode-length do
  •         for each agent i, select action a i = μ θ i ( o i )
  •         Obtain the exploratory actions { a 1 1 , a 1 2 , , a 1 q , a 2 1 , , a N q } by integrating it with a set of random noise { N 1 , , N N q }
  •         Input to the action filter to obtain { a i G u i d e d } :
  • a i G u i d e d = arg max a i t ( R i e q ( x , a 1 , a 2 , , a i t , a i + 1 , , a N ) )
  •         Execute actions { a i G u i d e d } and observe reward r and new state x
  •         Store ( x , a G u i d e d , r , x ) in replay buffer D *
  •          x x
  •      end for
  •    else
  •      for  t = 1 to max-episode-length do
  •         For each agent i, select action a i = μ θ i ( o i ) + N with respect to the current policy and exploration
  •         Execute actions a = { a 1 , a 2 , , a N } and observe reward r and new state x
  •         Store ( x , a , r , x ) in replay buffer D and optimal replay buffer D *
  •          x x
  •         for agent i = 1 to N do
  •           Sample a random mini-batch of S samples ( x j , a j , r j , x j ) in proportion η from D and D *
  •           Set y j = r i j + γ Q ϕ i μ ( x j , a 1 , a 2 , , a N ) | a k = μ k ( o k j )
  •           Update critic by minimizing the loss:
  • L ( ϕ i ) = 1 S j = 1 S y j Q ϕ i μ ( x j , a 1 j , , a N j ) 2
  •           Update actor using the sampled policy gradient:
  • θ i J 1 S j = 1 S θ i μ i ( o i j ) a i Q ϕ i μ ( x j , a 1 j , , a i , , a N j ) | a i = μ i ( o i j )
  •         end for
  •         Update target network parameters for each agent i:
  • θ i τ θ i + ( 1 τ ) θ i ϕ i τ ϕ i + ( 1 τ ) ϕ i
  •      end for
  •    end if
  • end for

5. Simulation and Discussion

Comprehensive simulation results are given in this section to examine the performance of the EA-MADDPG algorithm. Two scales were considered, namely N = 3 and N = 6 .

5.1. Simulation Setting

Table 2, Table 3 and Table 4 below list the values of relevant parameters for two training environments. Table 5 lists the values of parameters related to training.
Table 6 and Table 7 list the dimensionality parameters of each layer in the actor networks and critic networks, respectively. For each agent, both the actor and critic networks (including the target networks for actor and critic) employed a classic feedforward neural network structure, a multi-layer perceptron (MLP). Each network comprised an input layer, a hidden layer, and an output layer.
In the actor network, the input layer was fully connected and used a rectified linear unit (ReLU) as the activation function, with a normalization function applied to the batch input data. In the critic network, the input layer was also fully connected and used a rectified linear unit (ReLU) as the activation function, but it passed the input data directly using an identity function. The hidden layers in both networks were fully connected, with ReLU as the activation function. The output layer in both networks was fully connected; in the actor network, the hyperbolic tangent activation function was used.
For reference, we used the MADDPG [17], ME-MADDPG [40], MADDPG-L [41], MAPPO [19] and MATD3 [20] algorithms in the same environment. At the same time, we replaced the multiple critics in the EA-MADDPG framework with a single universal critic to evaluate the performance of all agents. This framework was called EA-MADDPG-S. The training was conducted on a computer equipped with an AMD Ryzen Threadripper PRO 3975WX 32-core CPU, 128 GB of RAM, and an NVIDIA RTX 4090 GPU.

5.2. Simulation Results and Discussion

5.2.1. Training Result

Through training, we obtained the following results:
Figure 9 illustrates the curves of the average reward of three missiles throughout the training episodes for N = 3 . Figure 10 illustrates the curves of the average reward of six missiles throughout the training episodes for N = 6 .
The reward functions (18) for missiles reflected our expectation of a joint optimal strategy for multiple missiles, where all missiles approached the UAV more quickly while avoiding collisions with each other. If one missile performed poorly during training, resulting in a lower individual reward, it also affected the rewards of the other missiles, causing the overall rewards of all missiles to decrease.
From Figure 9 and Figure 10, we can see that the training performance of the MADDPG-L and MAPPO algorithms was not very effective in the current environment, as the reward curves showed little change from the beginning to the end of training. Therefore, in the following convergence analysis of the reward curves we do not include these two algorithms.
In the three vs. one environment, as shown in Figure 9, missile collisions and crashes into the ground occurred at the beginning of the training, leading to low and even negative reward values. Through continuous training, the missiles learned to fly toward the UAV while avoiding collisions, leading to an increase in reward values. In the subsequent training process, the convergence values of the reward curves under the EA-MADDPG algorithm were higher than those of other algorithms. This indicated that under this algorithm, the three missiles could approach the UAV more quickly while avoiding collisions with each other.
In the six vs. one environment, as shown in Figure 10, the reward curves under the EA-MADDPG algorithm converged, and both the convergence speed and convergence values were greater than those of the reward curves under other algorithms. As the number of missiles increased, the difficulty of training began to rise. Within the same episodes, the reward curve of missile 1 was negative under ME-MADDPG. The reward curves of missiles 3, 4, and 5 were negative under MADDPG. The reward curves of missiles 1, 4, and 5 were negative under MATD3. This indicated that the corresponding missiles were still experiencing collisions with each other or crashing into the ground, which also affected the training and reward values of the other missiles. Under EA-MADDPG, the missiles could actively explore and collect good experiences through the action filter, quickly learning new strategies to overcome the aforementioned issues. To prevent excessive learning for this portion of experiences and stabilize subsequent learning effects and reward curves, EA-MADDPG mixed these experiences with normally obtained experiences in optimal replay buffer D * . At the same time, it reduced the probability p G u i d e d and the sampling ratio η according to certain decay rates.
We compared the reward curves of EA-MADDPG and EA-MADDPG-S in the three vs. one environment. As shown in Figure 11, the reward curve of missile 1 under EA-MADDPG-S converged faster than that under EA-MADDPG. However, the convergence speed of missile 2 was slower under EA-MADDPG-S compared to EA-MADDPG. For missile 3, the convergence speeds of both algorithms were roughly the same. Nonetheless, the final convergence values for all missiles under EA-MADDPG were higher than those under EA-MADDPG-S. From the comparison of the reward curves above, we conclude that training with a single critic was less effective than having a separate critic for each agent.
As shown in Figure 12, EA-MADDPG required additional time consumption compared to MADDPG. Analyzing EA-MADDPG’s training process, in the action filter, we needed to introduce a batch of noise to generate exploratory actions, calculate the { R i e q } , and select { a i G u i d e d } based on { R i e q } . All these steps required additional time consumption. Additionally, storing and sampling from D * also incurred extra time costs. When p G u i d e d and the sampling ratio η are set improperly, the extra time consumption increases. The algorithm analysis and experiments demonstrated that both the action filter and the optimal replay buffer design introduced additional time complexity.

5.2.2. Policy Networks Testing

Next, we evaluated the trained policy networks in the three vs. one environment. The testing procedure was as follows: First, we selected several initial positions and strike coordinates for the UAV, observed the interception trajectories of three missiles and the UAV, conducted 100 tests for each position, and compared k a v g * and d a v g under different algorithms for each position, where k a v g * , d a v g R denote the average values of k * and d, respectively. Second, 1000 tests were conducted with randomly generated initial positions and strike coordinates. We compared the percentage of successful interceptions under different algorithms across these 1000 trials.
As shown in Figure 13, the missile trajectories under the EA-MADDPG algorithm most closely matched the trajectories we expected: three missiles were all able to quickly move towards the UAV that was performing small-amplitude irregular movements. From the trajectories under the ME-MADDPG, it can be seen that although the three missiles no longer crashed into the ground or collided with each other, they were still unable to approach the target UAV quickly. This explained the reason for the lower convergence values of the curves of ME-MADDPG mentioned above. The trajectories under MADDPG showed that missile 2 and missile 3 achieved better training results because they could move towards the UAV, while missile 1 did not perform well. Similar to MADDPG, the missile trajectories under the MATD3 exhibited the same performance. Missile 2 and missile 3 could move towards the UAV, while missile 1 did not perform well and only flew straight up. From the trajectory plots, we can explain why the convergence values of the reward curves for the other algorithms were lower than those of the EA-MADDPG algorithm.
As shown in Figure 14 and Figure 15, the two graphs intuitively demonstrated the speed of UAV interception under three algorithms and did not show data under ME-MADDPG because its interception success rate was too low (the interception success percentage is shown below), making the interception data not referential. Under the EA-MADDPG algorithm, positions 1 to 6 required significantly fewer steps for interception, and the UAV was intercepted at a longer distance, thus showing a greater advantage. Combining the analyses mentioned above, we conclude that EA-MADDPG outperformed other algorithms both in terms of the combined training effect of multiple missiles (i.e., all missiles were ideally trained) and the speed of interception.
Next, we conducted 1000 sets of tests to compare the percentage of successful interceptions under different algorithms.
As shown in Figure 16, EA-MADDPG, with an interception success rate of 99.5%, exceeded MADDPG at 84.3% and MATD3 at 72.3%, while the percentage for ME-MADDPG was only 20.3%. This demonstrated that EA-MADDPG’s training effect in terms of interception success rate surpassed the other three algorithms.

6. Conclusions

This paper stemmed from the reality of the progressively increasing threat of drones and the ongoing development of multi-agent reinforcement learning. In response to the growing threat of UAVs, this paper combined multi-agent reinforcement learning technology with the use of missiles to explore the issue of using missiles for UAV defense in three-dimensional space, applying missiles in defense against UAVs.
This paper proposed an improved variant of the MADDPG algorithm called EA-MADDPG. Compared to the original algorithm, the EA-MADDPG algorithm had two distinct advantages: first, it enhanced the exploration of multiple agents through the action filter, allowing the agents to be more “proactive” in exploring and updating their strategies when they were experiencing poor training outcomes; second, through the design of the equal-reward setting and the optimal replay buffer, the agents were able to more efficiently utilize the experiences gained from “proactive” exploration, thereby improving the efficiency and effectiveness of the training.
This study explored a scenario with a single UAV. However, in reality, UAVs often operate in groups to perform tasks. Therefore, in the future, we will further investigate multi-UAV scenarios and multi-agent reinforcement learning strategies for multi-target interception tasks. Additionally, balancing exploration and exploitation during agent training has always been a crucial aspect of reinforcement learning research. In the EA-MADDPG framework presented in this paper, the series of exploratory actions were generated by adding multiple simple random noises. This design still has considerable room for improvement. Future work will consider incorporating Ornstein–Uhlenbeck noise for research and will also refer to the action selection principles based on relevant quantum theory from the QiRL framework and the prioritization principles for experiences in the replay pool from the QiER framework for more in-depth research.

Author Contributions

Conceptualization, H.G. and H.C.; methodology, X.L. and H.C.; software, X.L. and Y.Z.; validation, X.L.; formal analysis, X.L. and H.C.; investigation, X.L. and Y.Z.; writing—original draft preparation, X.L.; writing—review and editing, X.L. and H.C.; visualization, X.L.; supervision, H.G. and H.C.; funding acquisition, H.G. and H.C. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported in part by the National Natural Science Foundation of China under grant numbers 62173149, 62276104, and U22A2062, in part by the Guangdong Natural Science Foundation under grant numbers 2021A1515012584, 2022A1515011262, and in part by Fundamental Research Funds for the Central Universities.

Data Availability Statement

Data are contained within the article.

DURC Statement

Current research is limited to guidance and control, which is beneficial and does not pose a threat to public health or national security. Authors acknowledge the dual-use potential of research involving guidance and control and confirm that all necessary precautions have been taken to prevent potential misuse. As an ethical responsibility, authors strictly adhere to relevant national and international laws about DURC. Authors advocate for responsible deployment, ethical considerations, regulatory compliance, and transparent reporting to mitigate misuse risks and foster beneficial outcomes.

Conflicts of Interest

The authors declare no conflicts of interest.

Correction Statement

This article has been republished with a minor correction to the DURC Statement. This change does not affect the scientific content of the article.

Abbreviations

The following abbreviations are used in this manuscript:
UAVUnmanned aerial vehicle
DDPGDeep deterministic policy gradient
TD3Twin-delayed deep deterministic policy gradient
MADDPGMulti-agent deep deterministic policy gradient
EA-MADDPGEqual-reward and action-enhanced multi-agent deep deterministic
policy gradient
EA-MADDPG-SEqual-reward and action-enhanced multi-agent deep deterministic policy
gradient with single critic
ME-MADDPGMixed experience multi-agent deep deterministic policy gradient
MATD3Multi-agent twin-delayed deep deterministic policy gradient
MADDPG-LMulti-agent deep deterministic policy gradient with long short-term memory
MAPPOMulti-agent proximal policy optimization
MLPMulti-layer perceptron
ReLURectified linear unit

References

  1. Kang, H.; Joung, J.; Kim, J.; Kang, J.; Cho, Y.S. Protect your sky: A survey of counter unmanned aerial vehicle systems. IEEE Access 2020, 8, 168671–168710. [Google Scholar] [CrossRef]
  2. Li, C.Y.; Jing, W.X. Geometric approach to capture analysis of PN guidance law. Aerosp. Sci. Technol. 2008, 12, 177–183. [Google Scholar] [CrossRef]
  3. Yamasaki, T.; Takano, H.; Baba, Y. Robust path-following for UAV using pure pursuit guidance. In Aerial Vehicles; IntechOpen: London, UK, 2009. [Google Scholar]
  4. Lee, G.T.; Lee, J.G. Improved command to line-of-sight for homing guidance. IEEE Trans. Aerosp. Electron. Syst. 1995, 31, 506–510. [Google Scholar]
  5. Bryson, A.E. Applied Optimal Control: Optimization, Estimation and Control; Routledge: London, UK, 2018. [Google Scholar]
  6. Gutman, S. On Proportional Navigation. IEEE Trans. Aerosp. Electron. Syst. 1983, AES-19, 497–510. [Google Scholar]
  7. Shima, T.; Rasmussen, S. UAV Cooperative Decision and Control: Challenges and Practical Approaches; SIAM: University City, PA, USA, 2009. [Google Scholar]
  8. Kumar, V.; Michael, N. Opportunities and challenges with autonomous micro aerial vehicles. Int. J. Robot. Res. 2012, 31, 1279–1291. [Google Scholar] [CrossRef]
  9. Cummings, M.L.; Bruni, S. Collaborative Human-UAV Decision Making: Applications in Civilian UAVs; Springer: Berlin/Heidelberg, Germany, 2009. [Google Scholar]
  10. Xu, Z.; Zhuang, J. A study on a sequential one-defender-N-attacker game. Risk Anal. 2019, 39, 1414–1432. [Google Scholar] [CrossRef]
  11. Beard, R.W.; McLain, T.W. Small Unmanned Aircraft: Theory and Practice; Princeton University Press: Princeton, NJ, USA, 2012. [Google Scholar]
  12. Barto, A.G.; Sutton, R.S.; Anderson, C.W. Neuronlike adaptive elements that can solve difficult learning control problems. IEEE Trans. Syst. Man Cybern. 1983, SMC-13, 834–846. [Google Scholar] [CrossRef]
  13. Mnih, V.; Kavukcuoglu, K.; Silver, D.; Rusu, A.A.; Veness, J.; Bellemare, M.G.; Graves, A.; Riedmiller, M.; Fidjeland, A.K.; Ostrovski, G.; et al. Human-level control through deep reinforcement learning. Nature 2015, 518, 529–533. [Google Scholar] [CrossRef] [PubMed]
  14. Lillicrap, T.P.; Hunt, J.J.; Pritzel, A.; Heess, N.; Erez, T.; Tassa, Y.; Silver, D.; Wierstra, D. Continuous control with deep reinforcement learning. arXiv 2015, arXiv:1509.02971. [Google Scholar]
  15. Schulman, J.; Wolski, F.; Dhariwal, P.; Radford, A.; Klimov, O. Proximal policy optimization algorithms. arXiv 2017, arXiv:1707.06347. [Google Scholar]
  16. Haarnoja, T.; Zhou, A.; Abbeel, P.; Levine, S. Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor. In Proceedings of the International Conference on Machine Learning, Stockholm, Sweden, 10–15 July 2018; pp. 1861–1870. [Google Scholar]
  17. Lowe, R.; Wu, Y.I.; Tamar, A.; Harb, J.; Pieter Abbeel, O.; Mordatch, I. Multi-agent actor-critic for mixed cooperative-competitive environments. In Advances in Neural Information Processing Systems; Neural information processing systems foundation: Long Beach, CA, USA, 2017; Volume 30. [Google Scholar]
  18. Iqbal, S.; Sha, F. Actor-attention-critic for multi-agent reinforcement learning. In Proceedings of the International Conference on Machine Learning, Long Beach, CA, USA, 9–15 June 2019; pp. 2961–2970. [Google Scholar]
  19. Yu, C.; Velu, A.; Vinitsky, E.; Wang, Y.; Bayen, A.M.; Wu, Y. The Surprising Effectiveness of MAPPO in Cooperative, Multi-Agent Games. arXiv 2021, arXiv:2103.01955. [Google Scholar]
  20. Ackermann, J.; Gabler, V.; Osa, T.; Sugiyama, M. Reducing overestimation bias in multi-agent domains using double centralized critics. arXiv 2019, arXiv:1910.01465. [Google Scholar]
  21. Koch, W.; Mancuso, R.; West, R.; Bestavros, A. Deep reinforcement learning for UAV navigation and obstacle avoidance. IEEE Trans. Veh. Technol. 2019, 3, 22. [Google Scholar]
  22. Qie, H.; Shi, D.; Shen, T.; Xu, X.; Li, Y.; Wang, L. Joint optimization of multi-UAV target assignment and path planning based on multi-agent reinforcement learning. IEEE access 2019, 7, 146264–146272. [Google Scholar] [CrossRef]
  23. García, M.; Viguria, A.; Heredia, G.; Ollero, A. Minimal-time trajectories for interception of malicious drones in constrained environments. In Proceedings of the Computer Vision Systems: 12th International Conference, ICVS 2019, Thessaloniki, Greece, 23–25 September 2019; Springer: Berlin/Heidelberg, Germany, 2019; pp. 734–743. [Google Scholar]
  24. Tan, R.; Kumar, M. Tracking of ground mobile targets by quadrotor unmanned aerial vehicles. Unmanned Syst. 2014, 2, 157–173. [Google Scholar] [CrossRef]
  25. Çetin, A.T.; Koyuncu, E. Model Predictive Control-Based Guidance with Impact Angle Constraints for Visual Quadrotor Interception. In Proceedings of the 2023 9th International Conference on Control, Decision and Information Technologies (CoDIT), Rome, Italy, 3–6 July 2023; pp. 1–6. [Google Scholar]
  26. Xue, Y.; Wang, C.; Zhang, M. Trajectory tracking control method of UAV formation based on fuzzy control. In Proceedings of the International Conference on Cryptography, Network Security, and Communication Technology (CNSCT 2023), Changsha, China, 6–8 January 2023; Volume 12641, pp. 147–151. [Google Scholar]
  27. Li, J.; Xie, M.; Dong, Y.; Fan, H.; Chen, X.; Qu, G.; Wang, Z.; Yan, P. RTPN method for cooperative interception of maneuvering target by gun-launched UAV. Math. Biosci. Eng. 2022, 19, 5190–5206. [Google Scholar] [CrossRef]
  28. Liu, S.; Chen, T.; Zhao, T.; Liu, S.; Ma, C. Research on cooperative UAV countermeasure strategy based on interception triangle. In Proceedings of the 2023 4th International Conference on Machine Learning and Computer Application, Hangzhou, China, 27–29 October 2023; pp. 1015–1020. [Google Scholar]
  29. Tong, B.; Liu, J.; Duan, H. Multi-UAV interception inspired by Harris’ Hawks cooperative hunting behavior. In Proceedings of the 2021 IEEE International Conference on Robotics and Biomimetics (ROBIO), Sanya, China, 27–31 December 2021; pp. 1656–1661. [Google Scholar]
  30. Shaferman, V.; Shima, T. Cooperative optimal guidance laws for imposing a relative intercept angle. J. Guid. Control Dyn. 2015, 38, 1395–1408. [Google Scholar] [CrossRef]
  31. Wei, C.; Fancheng, K.; ZHANG, D.; Zhenzhou, B. Research on Target Matching of Television Guided Missile Seeker. In Proceedings of the 2017 International Conference on Electronic Industry and Automation (EIA 2017), Suzhou, China, 23–25 June 2017; Atlantis Press: Dordrecht, The Netherlands, 2017; pp. 319–321. [Google Scholar]
  32. Ting, J.A.S.; Srigrarom, S. Drone-to-drone interception path planning by Deep Q-network with Graph Neural Network based (DQN-GNN) model. In Proceedings of the 2023 IEEE International Conference on Cybernetics and Intelligent Systems (CIS) and IEEE Conference on Robotics, Automation and Mechatronics (RAM), Penang, Malaysia, 9–12 June 2023; pp. 122–127. [Google Scholar]
  33. Pierre, J.E.; Sun, X.; Novick, D.; Fierro, R. Multi-agent Deep Reinforcement Learning for Countering Uncrewed Aerial Systems. In Proceedings of the International Symposium on Distributed Autonomous Robotic Systems, Montbéliard, France, 28–30 November 2022; Springer: Cham, Switzerland, 2022; pp. 394–407. [Google Scholar]
  34. Du, B.; Liu, G.; Xie, W.; Zhang, W. Safe multi-agent learning control for unmanned surface vessels cooperative interception mission. In Proceedings of the 2022 International Conference on Advanced Robotics and Mechatronics (ICARM), Guilin, China, 9–11 July 2022; pp. 244–249. [Google Scholar]
  35. Liu, Y.; Wang, Y.; Dong, L. USV Target Interception Control With Reinforcement Learning and Motion Prediction Method. In Proceedings of the 2022 37th Youth Academic Annual Conference of Chinese Association of Automation (YAC), Beijing, China, 19–20 November 2022; pp. 1050–1054. [Google Scholar]
  36. Hu, Z.; Xiao, L.; Guan, J.; Yi, W.; Yin, H. Intercept Guidance of Maneuvering Targets with Deep Reinforcement Learning. Int. J. Aerosp. Eng. 2023, 2023, 7924190. [Google Scholar] [CrossRef]
  37. Li, Y.; Aghvami, A.H. Covertness-aware trajectory design for UAV: A multi-step TD3-PER solution. In Proceedings of the ICC 2022-IEEE International Conference on Communications, Seoul, Republic of Korea, 16–20 May 2022; pp. 7–12. [Google Scholar]
  38. Li, Y.; Aghvami, A.H.; Dong, D. Intelligent trajectory planning in UAV-mounted wireless networks: A quantum-inspired reinforcement learning perspective. IEEE Wirel. Commun. Lett. 2021, 10, 1994–1998. [Google Scholar] [CrossRef]
  39. Li, Y.; Aghvami, A.H.; Dong, D. Path planning for cellular-connected UAV: A DRL solution with quantum-inspired experience replay. IEEE Trans. Wirel. Commun. 2022, 21, 7897–7912. [Google Scholar] [CrossRef]
  40. Wan, K.; Wu, D.; Li, B.; Gao, X.; Hu, Z.; Chen, D. ME-MADDPG: An efficient learning-based motion planning method for multiple agents in complex environments. Int. J. Intell. Syst. 2022, 37, 2393–2427. [Google Scholar] [CrossRef]
  41. Zhao, E.; Zhou, N.; Liu, C.; Su, H.; Liu, Y.; Cong, J. Time-aware MADDPG with LSTM for multi-agent obstacle avoidance: A comparative study. Complex Intell. Syst. 2024, 10, 4141–4155. [Google Scholar] [CrossRef]
  42. Huang, L.; Fu, M.; Qu, H.; Wang, S.; Hu, S. A deep reinforcement learning-based method applied for solving multi-agent defense and attack problems. Expert Syst. Appl. 2021, 176, 114896. [Google Scholar] [CrossRef]
  43. Wang, Y.; Zhu, T.; Duan, Y. Cooperative Encirclement Strategy for Multiple Drones Based on ATT-MADDPG. In Proceedings of the 2023 IEEE 6th International Conference on Electronic Information and Communication Technology (ICEICT), Qingdao, China, 21–24 July 2023; pp. 1035–1040. [Google Scholar]
  44. Wei, X.; Yang, L.; Cao, G.; Lu, T.; Wang, B. Recurrent MADDPG for object detection and assignment in combat tasks. IEEE Access 2020, 8, 163334–163343. [Google Scholar] [CrossRef]
  45. Jeon, S.; Lee, H.; Kaliappan, V.K.; Nguyen, T.A.; Jo, H.; Cho, H.; Min, D. Multiagent reinforcement learning based on fusion-multiactor-attention-critic for multiple-unmanned-aerial-vehicle navigation control. Energies 2022, 15, 7426. [Google Scholar] [CrossRef]
  46. Yue, L.; Lv, M.; Yan, M.; Zhao, X.; Wu, A.; Li, L.; Zuo, J. Improving Cooperative Multi-Target Tracking Control for UAV Swarm Using Multi-Agent Reinforcement Learning. In Proceedings of the 2023 9th International Conference on Control, Automation and Robotics (ICCAR), Beijing, China, 21–23 April 2023; pp. 179–186. [Google Scholar]
  47. Zhang, R.; Zong, Q.; Zhang, X.; Dou, L.; Tian, B. Game of drones: Multi-UAV pursuit-evasion game with online motion planning by deep reinforcement learning. IEEE Trans. Neural Netw. Learn. Syst. 2022, 34, 7900–7909. [Google Scholar] [CrossRef]
  48. Zhang, J.; Qi, G.; Li, Y.; Sheng, A.; Xu, L. A Many-to-Many UAV Pursuit and Interception Strategy Based on PERMADDPG. In Proceedings of the 2023 5th International Conference on Robotics and Computer Vision (ICRCV), Nanjing, China, 15–17 September 2023; pp. 234–240. [Google Scholar]
  49. Zhu, P.; Dai, W.; Yao, W.; Ma, J.; Zeng, Z.; Lu, H. Multi-robot flocking control based on deep reinforcement learning. IEEE Access 2020, 8, 150397–150406. [Google Scholar] [CrossRef]
  50. Jiang, T.; Zhuang, D.; Xie, H. Anti-drone policy learning based on self-attention multi-agent deterministic policy gradient. In Proceedings of the International Conference on Autonomous Unmanned Systems, Changsha, China, 24–26 September 2021; Springer: Singapore, 2021; pp. 2277–2289. [Google Scholar]
  51. Zhan, G.; Zhang, X.; Li, Z.; Xu, L.; Zhou, D.; Yang, Z. Multiple-uav reinforcement learning algorithm based on improved ppo in ray framework. Drones 2022, 6, 166. [Google Scholar] [CrossRef]
  52. Huang, H.; Li, Y.; Song, G.; Gai, W. Deep Reinforcement Learning-Driven UAV Data Collection Path Planning: A Study on Minimizing AoI. Electronics 2024, 13, 1871. [Google Scholar] [CrossRef]
Figure 1. The heading parameters of the ith missile.
Figure 1. The heading parameters of the ith missile.
Drones 08 00524 g001
Figure 2. The heading parameters of the UAV.
Figure 2. The heading parameters of the UAV.
Drones 08 00524 g002
Figure 3. The current and semi-historical heading parameters between the ith missile and the UAV.
Figure 3. The current and semi-historical heading parameters between the ith missile and the UAV.
Drones 08 00524 g003
Figure 4. Environment description.
Figure 4. Environment description.
Drones 08 00524 g004
Figure 5. The structure of the EA-MADDPG algorithm.
Figure 5. The structure of the EA-MADDPG algorithm.
Drones 08 00524 g005
Figure 6. Schematic of f ( α i ( k ) , u α i ( k ) , λ ) .
Figure 6. Schematic of f ( α i ( k ) , u α i ( k ) , λ ) .
Drones 08 00524 g006
Figure 7. Optimal replay buffer D * .
Figure 7. Optimal replay buffer D * .
Drones 08 00524 g007
Figure 8. Action filter under guided mode.
Figure 8. Action filter under guided mode.
Drones 08 00524 g008
Figure 9. The courses of reward for missiles for N = 3 .
Figure 9. The courses of reward for missiles for N = 3 .
Drones 08 00524 g009
Figure 10. The courses of reward for missiles N = 6 .
Figure 10. The courses of reward for missiles N = 6 .
Drones 08 00524 g010
Figure 11. Comparison of reward curves under the EA-MADDPG and EA-MADDPG-S algorithms.
Figure 11. Comparison of reward curves under the EA-MADDPG and EA-MADDPG-S algorithms.
Drones 08 00524 g011
Figure 12. Training time consumption of algorithms in 3 vs. 1 environment.
Figure 12. Training time consumption of algorithms in 3 vs. 1 environment.
Drones 08 00524 g012
Figure 13. The trajectories of missiles under four different algorithms with varying initial and target positions.
Figure 13. The trajectories of missiles under four different algorithms with varying initial and target positions.
Drones 08 00524 g013aDrones 08 00524 g013bDrones 08 00524 g013c
Figure 14. d a v g under different positions. Positions 1 to 6 correspond to initial positions and target positions in Figure 13a–f.
Figure 14. d a v g under different positions. Positions 1 to 6 correspond to initial positions and target positions in Figure 13a–f.
Drones 08 00524 g014
Figure 15. k a v g * under different positions. Positions 1 to 6 correspond to initial positions and target positions in Figure 13a–f.
Figure 15. k a v g * under different positions. Positions 1 to 6 correspond to initial positions and target positions in Figure 13a–f.
Drones 08 00524 g015
Figure 16. The interception success percentage among 1000 tests.
Figure 16. The interception success percentage among 1000 tests.
Drones 08 00524 g016
Table 2. Initial missile positions when N = 3 .
Table 2. Initial missile positions when N = 3 .
ParameterValueExplanation of Parameter
( p 01 x , p 01 y , p 01 z ) (10, 10, 0)Initial coordinates of missile 1 (m)
( p 02 x , p 02 y , p 02 z ) (0, −100, 0)Initial coordinates of missile 2 (m)
( p 03 x , p 03 y , p 03 z ) (−50, 200, 0)Initial coordinates of missile 3 (m)
Table 3. Initial missile positions when N = 6 .
Table 3. Initial missile positions when N = 6 .
ParameterValueExplanation of Parameter
( p 01 x , p 01 y , p 01 z ) (10, 10, 0)Initial coordinates of missile 1 (m)
( p 02 x , p 02 y , p 02 z ) (0, -100, 0)Initial coordinates of missile 2 (m)
( p 03 x , p 03 y , p 03 z ) (−50, 200, 0)Initial coordinates of missile 3 (m)
( p 04 x , p 04 y , p 04 z ) (10, 100, 0)Initial coordinates of missile 4 (m)
( p 05 x , p 05 y , p 05 z ) (−30, −40, 0)Initial coordinates of missile 5 (m)
( p 06 x , p 06 y , p 06 z ) (−150, 0, 0)Initial coordinates of missile 6 (m)
Table 4. Common parameters for both environments.
Table 4. Common parameters for both environments.
ParameterValueExplanation of Parameter
v m a x 50The maximum linear speed (m/s)
Θ π 6 The maximum angular velocity in the plane O x y
and the vertical plane (rad/s)
T0.05Time per step (s)
φ π 3 The angle between the generatrix of the cone and the z-axis (rad)
R 2100The radius of the hemisphere P (m)
R g 300The radius of the green circle (m)
R r 200The radius of the red circle (m)
R d 30Interference distance of the device (m)
R s 100The minimum distance allowed for the UAV to approach the target (m)
Table 5. Training’s parameters.
Table 5. Training’s parameters.
ParameterValueExplanation of Parameter
q3Repeated exploration coefficient
η 0 0.3Initial proportion
0.001The decay rate of η
p 0 G u i d e d 0.3Initial probability
δ 0.005The probability decay rate of p G u i d e d
N a 0.2The amplitude of noise
ϑ 0.001The noise decay rate
γ 0.95Discounted rate
ϱ 0.02Learning rate
τ 0.02Soft update rate
B s 50,000Buffer size of D
B s * 5000Buffer size of D *
S64Batch size
n s t e p 200Training frequency
ς 0.5Gradient clipping
Table 6. The (target) actor network’s parameters.
Table 6. The (target) actor network’s parameters.
LayerInput DimensionOutput Dimension
Input layer3128
Hidden layer128128
Output layer1282
Table 7. The (target) critic network’s parameters.
Table 7. The (target) critic network’s parameters.
LayerInput DimensionOutput Dimension
Input layer3N128
Hidden layer128128
Output layer1281
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Cai, H.; Li, X.; Zhang, Y.; Gao, H. Interception of a Single Intruding Unmanned Aerial Vehicle by Multiple Missiles Using the Novel EA-MADDPG Training Algorithm. Drones 2024, 8, 524. https://doi.org/10.3390/drones8100524

AMA Style

Cai H, Li X, Zhang Y, Gao H. Interception of a Single Intruding Unmanned Aerial Vehicle by Multiple Missiles Using the Novel EA-MADDPG Training Algorithm. Drones. 2024; 8(10):524. https://doi.org/10.3390/drones8100524

Chicago/Turabian Style

Cai, He, Xingsheng Li, Yibo Zhang, and Huanli Gao. 2024. "Interception of a Single Intruding Unmanned Aerial Vehicle by Multiple Missiles Using the Novel EA-MADDPG Training Algorithm" Drones 8, no. 10: 524. https://doi.org/10.3390/drones8100524

APA Style

Cai, H., Li, X., Zhang, Y., & Gao, H. (2024). Interception of a Single Intruding Unmanned Aerial Vehicle by Multiple Missiles Using the Novel EA-MADDPG Training Algorithm. Drones, 8(10), 524. https://doi.org/10.3390/drones8100524

Article Metrics

Back to TopTop