1. Introduction
The round up of UAV swarms is a complex and challenging research area that intersects multiple disciplines, including computer science, control theory, communication technology, and aerospace engineering. With the rapid advancement of UAV technology, multi-UAV systems have become a hot topic in both research and application. Compared to single UAVs, multi-UAV systems exhibit numerous advantages in task execution. Single UAVs are often limited by endurance, payload capacity, and coverage area when faced with complex tasks and vast regions. However, multi-UAV systems, through collaborative operations, can effectively overcome these limitations, enhancing the efficiency and reliability of task execution. By employing cooperative strategies such as task allocation, path planning, and formation control, multi-UAV systems can cover larger areas and accomplish more complex tasks in a shorter time while offering higher redundancy and fault tolerance. These advantages make multi-UAV systems highly promising in various fields, including military operations, disaster relief [
1], communication services [
2], mapping and exploration [
3], resource delivery [
4], and agriculture [
5]. However, the challenge lies in achieving efficient control and coordination of UAV swarms in complex environments.
The problem of UAV swarm pursuit involves controlling one or more groups of UAVs to track and capture targets swiftly and accurately under specific conditions. This problem encompasses challenges in target tracking, path planning, communication coordination, and obstacle avoidance. To address these challenges, researchers need to conduct comprehensive investigations from both theoretical and practical perspectives. From a theoretical standpoint, the pursuit problem of UAV swarms requires knowledge from multiple fields, including optimization algorithms and control theory. Research in these domains can provide theoretical support and technical guidance for the pursuit problem of UAV swarms. Additionally, the study of communication protocols and collaboration strategies among drones is crucial for achieving efficient collaboration within drone swarms. From a practical perspective, the pursuit problem of drone swarms needs to be researched and validated in conjunction with real-world application scenarios. This includes testing and analyzing the performance parameters of drones, as well as simulating and replicating actual environments. Through these experiments and simulations, the effectiveness and feasibility of proposed control strategies and algorithms can be evaluated.
In the field of UAV cooperative control, traditional optimization algorithms such as ant colony optimization and wolf pack algorithms face limitations in computation time, flexibility, and intelligence, which hinder their ability to fully achieve intelligent control and decision-making in UAV swarms. On the other hand, the recently emerging deep reinforcement learning (DRL) algorithms, with their powerful high-dimensional perception capabilities and ability to handle nonlinear problems, have proven effective in a wide range of tasks, including path planning [
6], collision avoidance [
7], and autonomous driving [
8]. DRL has also seen widespread applications in UAV swarm control. For instance, in [
9], DRL algorithms were used to control multiple UAVs to track a person in an obstacle-laden environment, and in [
10], multiple UAVs were utilized as communication relays, with DRL algorithms optimizing to maximize the communicable area. Given the extensive applications of deep reinforcement learning in the multi-UAV domain, DRL can similarly be effectively employed to accomplish complex tasks involving UAV swarm coordination, pursuit, and strike missions in challenging environments.
In this paper, our main contributions are as follows:
We present a collaborative control strategy for unmanned aerial vehicle (UAV) formations utilizing a multi-agent deep reinforcement learning algorithm. This approach is designed to address the challenge of coordinating UAV groups for cooperative rounding-up tasks, with the overarching goal of enhancing strategic control capabilities in battlefield environments and optimizing the efficiency of collaborative operations within UAV groups.
We have developed a shared experience data pool that allows each UAV to learn from the collective experiences of the entire swarm. This shared mechanism enables each UAV to leverage the operational experiences of others, continuously improving its own behavioral strategies. This approach significantly enhances the adaptability of the UAV system in dynamic environments, enabling it to respond more intelligently to complex scenarios.
An innovative reward function has been developed to address challenges encountered by unmanned aerial vehicle (UAV) swarms in diverse scenarios, facilitating the successful accomplishment of target missions within a formation. This unique approach aims to optimize the performance of UAV swarms by tailoring the reward structure to the intricacies of various situations.
2. Related Work
In this section, we summarize existing works in multi-UAV control. The emerging paradigm of multi-UAV systems has given rise to a plethora of research avenues, with the primary focus on three pivotal components: perception, coordination, and navigation. Within this complex tapestry, the incorporation of neural network controls offers a promising dimension, enhancing the efficiency and adaptability of multi-UAV operations.
2.1. Optimization-Based Approach
Researchers have put forward the idea of utilizing traditional optimization methods to regulate the obstacle avoidance and pursuit of multiple unmanned aerial vehicles (UAVs). Optimization-based approaches have emerged as a robust framework for controlling multiple drones, employing mathematical techniques to identify the most optimal solution among a range of viable alternatives. These methods approach the planning dilemma as a constrained optimization problem with the objective of minimizing a designated cost function while adhering to constraints such as collision avoidance and formation maintenance [
11,
12,
13,
14].
For the obstacle avoidance navigation problem, this can be ensured by hard constraints that limit the trajectory to a specified convex space [
13,
15], or by soft constraints that directly embed the penalty function into the objective function [
11]. For optimization-based controls, it is common to separate mapping, collaboration, and planning tasks from the broader navigation process. This approach allows for parallel execution across components, enhancing the transparency and explainability of the overall system. X. Zhou et al. [
16] formulated the cluster control problem of UAV swarms as a multi-objective optimization problem and improved the multi-objective pigeon-inspired optimization (MPIO) based on the hierarchical learning behavior of the pigeon swarm to solve it in a distributed manner. In multi-UAV systems, model predictive control (MPC) is gradually used in various environments, including obstacle-free environments, space environments with obstacles [
17], and multi-UAV collision-free trajectory generation [
18,
19]. Kuriki and Namerikawa [
20] proposed a multi-UAV collaborative formation control strategy based on decentralized MPC and consensus control with collision avoidance capabilities. Their method ensures that each UAV can make decisions independently while ensuring the consistency of collision avoidance coupling constraints.
Regarding the tracking and rounding-up problem, Fei Yu et al. [
21] analyzed the pursuit process based on differential games, gave a boundary setting process, and determined the game strategy based on the kinematic characteristics of the aircraft to achieve pursuit in an obstacle environment. Mikhail Khachumov et al. [
22] combined intelligent control theory and jointly applied precise and precise collective control with flexible intelligent control to construct a mathematical model of UAV motion and implement pursuit research in a simulation system. Bingda Tong et al. [
23] proposed a method for Harris hawks (Parabuteo unicinctus) to cooperatively hunt and intercept enemy drones, constructed a simplified drone guidance model, implemented a lead–follow strategy, and successfully intercepted enemy drones. Jiaxin Li et al. [
24] designed a pursuit avoidance strategy, combined with the UAV motion control method, and achieved good results in improving interception time.
The optimization-based approach has undoubtedly elevated the capabilities of multi-UAV control, encompassing obstacle avoidance, tracking, and rounding-up tasks. However, this method encounters certain challenges that warrant attention. Particularly, in dynamic or unpredictable environments, traditional approaches struggle to guarantee the formulation of suitable control strategies at every stage of UAV swarm movement. Consequently, this limitation can result in errors, delays, and ultimately, catastrophic crashes. Addressing these challenges is crucial to ensure the safe and efficient operation of UAV swarms in complex and dynamic scenarios.
2.2. Learning-Based Approach
In recent years, the field of drone control has witnessed significant advancements with the integration of deep learning and reinforcement learning technologies. Researchers have increasingly turned to these learning-based approaches to enhance the adaptability and efficiency of unmanned aerial vehicle (UAV) systems. By leveraging learning methods, it becomes possible to design an end-to-end controller that directly translates input data into control commands, thereby greatly enhancing the capabilities and efficiency of the UAV system. This paper explores the potential of learning-based technologies in improving the performance of UAV systems, offering insights into the design and implementation of an efficient and adaptable end-to-end controller.
Inspired by biological swarm intelligence, Longting Jiang et. al. [
25] constructed a communication multi-agent deterministic policy gradient (COM-MADDPG) framework to achieve dynamic rounding-up of target points. Chaoxu Mu et al. [
26] also proposed a leader–follower strategy model combined with the actor–critic framework to realize the formation control problem, which effectively reduces the time complexity compared to traditional methods. Bo Li et al. [
27] considered the relationship between the enemy and ourselves, designed an alternative maneuver strategy, and trained it using the soft actor–critic algorithm, which improved the flexibility during the pursuit and achieved good rounding-up effects. Xiaowei Fu et al. [
28] designed a quasi-proportional guidance control law to generate effective learning samples as an empirical data pool for the DDPG algorithm. Experimental results show that the trained UAV can quickly track other target points, improving efficiency. Qianxin Xia et al. [
29] proposed a method based on multi-agent reinforcement learning for formation target capture using an adaptive allocation strategy to complete the allocation of capture points. The trained strategy model can successfully capture escape targets through formation. Ruilong Zhang et al. [
30] studied the target pursuit–avoidance problem of multiple quadcopters in an obstacle environment, expanded the MADDPG algorithm, and constructed a two-way coordinated target prediction network to ensure the effectiveness in the pursuit and escape process. Yihao Sun et al. [
31] proposed a multi-agent deep deterministic policy gradient algorithm based on attention, which solved the problem of collaborative tracking control of autonomous driving and showed scalability in dynamic environments with obstacles. Yuanda Wang et al. [
32] proposed a distributed cooperative pursuit strategy with communication based on reinforcement learning. Based on the deep deterministic policy gradient (DDPG) algorithm, a novel training algorithm for ring and leader–follower network topologies was developed. Verified through extensive simulations, the pursuer can capture agile escapees with a high success rate.
From this perspective, the challenge of cooperative obstacle avoidance and pursuit among multiple UAVs in a three-dimensional environment have predominantly been addressed through the utilization of traditional optimization methods, albeit lacking intelligent elements. However, the advent of reinforcement learning techniques presents an opportunity to train more efficient and intelligent control strategies. In light of this, we propose a novel approach based on deep reinforcement learning to tackle the issue of multi-UAV collaborative pursuit. By leveraging the power of reinforcement learning, our method aims to enhance the cooperative capabilities of UAVs in pursuit scenarios, enabling them to navigate complex environments with greater efficiency and intelligence.
In recent years, reinforcement learning algorithms have been widely applied in decision-making scenarios, particularly in multi-agent deep reinforcement learning (MARL) [
33]. Knowledge transfer mechanisms, including shared buffers, action advising, and experience sharing, have been extensively explored in the MARL domain. Ming Tan [
34] introduced the concept of shared buffers, where agents improve learning efficiency by sharing experiences. This approach demonstrated the potential of cooperative agents to outperform independent agents by leveraging shared experiences in multi-agent environments. Building on this foundation, recent work has focused on optimizing the conditions under which knowledge transfer occurs. For example, Tonghao Wang et al. [
35] proposed an automated design of action-advising trigger conditions using a genetic programming-based method, highlighting how intelligent advising can further enhance agent performance by dynamically adjusting advising strategies according to the environment and agent needs. In research by Tonghao Wang [
36] on experience sharing-based memetic transfer learning in MARL (MeTL-ES), a scalable approach was demonstrated where agents share implicit knowledge without the computational burden of traditional action-advising methods. Furthermore, in the work of Songyang Han [
37], information sharing in MARL was utilized to enhance the safety and efficiency of connected autonomous vehicles (CAVs) through innovative techniques such as truncated Q-functions and safe action mapping. These studies all highlight the importance of knowledge transfer mechanisms in MARL, providing the basis for our approach to advance more advanced techniques.
Our contributions are delineated as follows:
Collaborative control strategy: Unlike existing approaches, our work introduces a novel multi-agent deep reinforcement learning algorithm specifically designed for UAV formations. This strategy focuses on the coordination of UAV groups in cooperative rounding-up tasks, aiming to enhance strategic control capabilities in complex battlefield environments. The innovative aspect of our approach lies in its ability to optimize the efficiency of collaborative operations within UAV groups, which has not been fully addressed in the literature.
Shared experience data pool: We have developed a shared experience data pool that enables each UAV to not only make independent decisions and analyses but also learn from the experiences of the entire UAV swarm. This shared mechanism allows each UAV to leverage the operational experiences of others, continuously refining its own behavioral strategies. This approach significantly enhances the UAV system’s adaptability in dynamic environments, enabling it to respond more intelligently to complex scenarios, marking a substantial advancement over traditional methods.
Innovative reward function: Our work also introduces an innovative reward function tailored to the challenges faced by UAV swarms in diverse operational scenarios. This reward function is specifically designed to optimize UAV swarm performance, ensuring successful mission completion within a formation. The novelty of our approach lies in its customization of the reward structure to account for the complexities of various situations, which we believe sets our work apart from other reinforcement learning-based UAV decision-making approaches.
3. Preliminaries
3.1. Problem Scenario
In this paper, we intend to solve the problem of cooperative obstacle avoidance and round up of UAV swarms. This task can be divided into UAV formation coordination, UAV swarm obstacle avoidance, and UAV swarm rounding-up.
UAV swarm coordination is an effective approach for multiple drones to collaboratively and efficiently accomplish tasks. This relies on utilizing cameras and sensors within the drone cluster to gather crucial information, such as the position and velocity of nearby aircrafts. By maintaining a safe distance from each other, the swarm can navigate seamlessly during mission execution. Collaborative drone swarms improve task efficiency, enhance system reliability, and meet mission requirements in complex environments. In addition to collaborative tasks, drone swarms also need to consider obstacle avoidance to ensure safe navigation in challenging environments. This strategy enables drones to detect and evade obstacles while maintaining a safe distance, thereby enhancing the safety and efficiency of the drone swarm. Furthermore, drone swarms effectively encircle and capture target objects during pursuit missions. This is particularly useful for tracking and capturing dynamic and static moving targets, such as escaping vehicles, pedestrians, or other drones. Applying these capabilities to various domains, including security surveillance, border patrol, and crime prevention, can significantly improve the success rate and efficiency of these operations.
Overall, cooperative obstacle avoidance and rounding-up of UAV formations enables UAV swarms to operate effectively in diverse and challenging environments. We expect to combine the UAV swarm control decision-making model to complete obstacle avoidance, tracking, and rounding-up tasks. To this end, we designed a schematic diagram of UAV cooperative obstacle avoidance and rounding-up in an urban scene, as shown in
Figure 1. Among them, the overall task can be divided into two sub-tasks, as shown in
Figure 2, where (a) is the formation reconstruction process of the UAV group after avoiding obstacles, and (b) is the process of the UAV group chasing and surrounding the target point.
3.2. Problem Formulation
In this section, we present a comprehensive analysis of the problem of UAV swarm cooperative rounding-up, taking into account necessary assumptions and mitigating influential factors. Additionally, we provide a concise yet effective design of the experimental scene. Furthermore, we establish a robust flight control model for the UAV, serving as a fundamental framework for subsequent experiments. By addressing these crucial aspects, we aim to lay a solid foundation for our research and facilitate a deeper understanding of the cooperative pursuit dynamics within UAV swarms. In this study, we address the challenge of coordinating a multi-UAV swarm to pursue a mission target point within an obstacle environment. In real-world mission scenarios, the drone swarm and the target point are subject to various external factors, such as wind and gravity. However, to ensure the research aligns with our objectives, we focus solely on the kinematic model of UAV speed and acceleration during the simulation process. By eliminating extraneous factors, we aim to achieve precise motion control of the UAVs. The speed of the UAV can be expressed as , and the acceleration of the UAV can be expressed as .
In this experiment, during the movement of the UAV swarm, we regard each UAV as a particle. The kinematic equation of UAVs in a three-dimensional space environment can be expressed as follows:
In Equation (
1),
represent the spatial coordinates of the UAV at time
t, where
is the coordinate along the
x-axis,
is the coordinate along the
y-axis, and
is the coordinate along the
z-axis.
represent the spatial coordinates of the UAV at time
, indicating the new position after a time increment
.
represent the velocity components of the UAV along the
x-axis,
y-axis, and
z-axis at time
t, respectively.
represent the acceleration components of the UAV along the
x-axis,
y-axis, and
z-axis at time
t, respectively.
denotes the time increment, which is the elapsed time from
t to
.
represent the displacement components caused by the accelerations
over the time increment
. These are derived from the equations of motion for uniformly accelerated linear motion. The kinematic model of UAV is shown in
Figure 3.
To enhance the realism of our UAV model, we introduce a threat area concept, which establishes a radius around each UAV. This area serves as a boundary within which any approaching obstacle or aircraft is deemed a potential threat to the UAV. The threat area is defined by two parameters: the maximum threat distance, denoted as
, and the minimum threat distance, denoted as
.
Figure 4 provides a visual representation of the threat area, illustrating its schematic diagram. By incorporating this concept, we aim to simulate real-world scenarios where UAVs must navigate and respond to potential threats in their surroundings.
In our research, we introduce a critical distance parameter, denoted as , which determines the collision threshold between the drone and any obstacle or approaching aircraft. If the distance between the drone and an object falls below , it is considered a collision, resulting in the drone crashing. Additionally, we define a capture criterion based on the target point’s position within the internal area of . If the target point remains within this region for a significant duration, we conclude that the UAV swarm has successfully captured it. These criteria play a crucial role in evaluating the performance and effectiveness of the UAV swarm in pursuit missions.
In response to the above problem of cooperative obstacle avoidance and rounding-up of UAV groups, we define the following constraints:
where
represent the position of the drone;
represents the position of the target point;
represents the position of the obstacle; we consider
to be the target area of the drone;
and
define the safe area between drones; and
means that the drone will not collide with obstacles. Finally, to consider drone safety issues, we also set an upper speed threshold
for the drones.
4. Approach
In this section, we provide a comprehensive analysis of approaches to solving the rounding-up problem in obstacle spaces by leveraging UAV formations. We delve into the details of deep reinforcement learning-based methods, providing a comprehensive description of each aspect. These include an introduction to the basic theory of reinforcement learning, the design of the swarm control decision model, the architecture of the training network, the configuration of the UAV swarm reward function, and the pseudocode for the implementation of our algorithm. By illuminating these key components, we aim to provide a comprehensive understanding of the proposed approach and its potential implications in solving the rounding-up problem.
4.1. Multi-Agent Deep Reinforcement Learning
Multi-agent deep reinforcement learning is a powerful approach to address the challenge of control decision-making for multiple agents in collaborative or competitive environments. Unlike traditional reinforcement learning, where agents operate independently and solely consider their own states and actions, multi-agent deep reinforcement learning enables agents to collaborate or compete with each other in order to achieve a shared objective or maximize their individual returns. This methodology is particularly relevant in real-world scenarios where agents must work together or engage in competition to accomplish common goals.
In the realm of multi-agent decision-making, the effective sharing of data among agents and the enhancement of overall decision-making performance are crucial. To achieve this, we propose the establishment of a comprehensive shared experience and information buffer, which undergoes continuous optimization strategies to improve its control capabilities. In complex environments, where multiple agents are involved, the state input can be represented as , while the action output can be expressed as , with n denoting the number of agents. Each agent is driven by a common task goal and aims to maximize its individual reward return. By executing the strategy , the state transitions from to , and each agent receives its own reward r. Consequently, the overall reward value can be calculated. Our primary objective is to maximize the task reward income, expressed as .
4.2. Soft Actor–Critic Algorithm
The soft actor–critic algorithm (SAC) is a deep reinforcement learning algorithm that integrates the principles of policy gradient and Q-learning. By incorporating entropy regularization, SAC enhances the exploration and learning capabilities of the agent while ensuring policy stability. This algorithm serves as a powerful tool for addressing complex decision-making problems in various domains. Through the combination of policy gradient and Q-learning, SAC offers a comprehensive framework for optimizing the agent’s behavior and achieving efficient policy convergence. The introduction of entropy regularization further enriches the algorithm’s ability to explore the environment and adapt to dynamic scenarios. As a result, SAC stands as a significant advancement in the field of deep reinforcement learning, with promising applications in diverse domains.
The soft actor–critic (SAC) algorithm is a prominent approach in the field of reinforcement learning, aimed at learning the optimal action strategy by optimizing both the strategy and value function. The primary objective of the policy function optimization is to maximize the weighted sum of the expected return and entropy. This optimization goal is encapsulated in the objective function, which can be mathematically expressed as follows [
38]:
The policy function is optimized to maximize the weighted sum of the expected return and entropy, striking a balance between exploration and exploitation. The expected return is computed using a value function estimator, while the entropy value serves to encourage exploration and prevent the strategy from becoming overly deterministic. By adjusting the weight of entropy, we can effectively control the trade-off between exploration and exploitation.
The optimization goal of the value function is to minimize the mean square error of the value function so that it can accurately estimate the value of the state–action pair. We define the value function to be parameterized by
, and we learn the training value function by minimizing the squared residual, which can be expressed as follows [
38]:
The soft actor–critic (SAC) algorithm incorporates two value function estimators to enhance learning ability and stability. The primary value function estimator is responsible for estimating the value function, while the secondary estimator is utilized for calculating the target value. This dual-Q network structure effectively reduces the estimation error of the value function, leading to improved performance. By employing two value function estimators, the SAC algorithm mitigates the overestimation bias commonly observed in single-estimator approaches. The main value function estimator is updated based on the temporal difference error, while the target value is computed using the secondary estimator. This separation of estimation and target calculation helps to stabilize the learning process and improve the accuracy of value function estimation. Furthermore, the SAC algorithm incorporates a target entropy regularization term to encourage exploration and prevent premature convergence. By maximizing the entropy of the policy distribution, the algorithm promotes exploration in the early stages of learning, facilitating the discovery of optimal action strategies.
Based on an understanding of the above formula, training the value function network needs to minimize the squared difference between the prediction of the value network and the expected prediction of the Q function. By deriving the above formula, we can obtain the following:
The training Q function is based on the following equation:
where
. Bringing this into the above formula results in the following:
In order to use the gradient strategy for training, we can derive the above formula as follows:
The training policy network is based on the following formula:
where
. To train it using the gradient strategy, we differentiate it as follows:
During the training process, the stochastic gradient descent method is commonly employed to update both the policy network and the value function network. As the multi-agent interacts with the environment, the collected experience data are stored in a buffer. These data play a crucial role in calculating the loss values of the policy function and the value function. Subsequently, the network parameters are updated using the backpropagation algorithm, aiming to minimize the loss function. This iterative process facilitates the refinement and optimization of the policy and value function networks, enhancing the overall training performance.
4.3. Network Architecture
This paper presents a novel approach utilizing two deep neural networks (DNNs) to train actor networks and critic networks. The architecture of the networks is illustrated in
Figure 5. In the actor network, the input data are passed through a four-layer neural network with dimensions Linear [18, 128], Linear [128, 128], Linear [128, 64], and Linear [64, 2], along with the LeakyReLU activation function, to derive the corresponding strategic actions. On the other hand, the critic network processes the input data through a four-layer neural network with dimensions Linear [128, 128], Linear [128, 64], Linear [64, 32], and Linear [32, 1], along with the LeakyReLU activation function, to estimate the evaluation value of the corresponding state–action pair.
In the actor network, empirical information observations obtained from the interaction between the UAV group and the environment are taken as input. The output of the actor network consists of a mean vector and a standard deviation vector. These vectors are used to determine a Gaussian distribution, representing the strategy . Subsequently, an action is sampled from this distribution.
On the other hand, the critic network takes the status and actions of the drone swarm as input and outputs a value function that evaluates the current status and actions. The training objective of the critic network is to minimize the mean square error of the value function and reduce the discrepancy between the predicted value and the true value of the value function.
By simultaneously training the actor and critic networks using the SAC algorithm, policy learning and value function learning are implemented. At each time step, the actor network selects an action based on the Gaussian distribution generated by the mean and standard deviation vectors. The resulting interaction with the environment provides the next state and reward. The critic network evaluates the value of the next state, and the TD (temporal difference) error is computed. The parameters of the critic network are updated based on this error, improving the accuracy of the value function prediction. Additionally, the output of the critic network serves as the advantage function for the actor network. The actor network parameters are updated using the gradient of the advantage function, enhancing the strategy.
Overall, this research paper presents a comprehensive framework that combines actor and critic networks to optimize the decision-making process of a UAV group in an interactive environment. The proposed approach demonstrates promising results in improving the performance and efficiency of the UAV group.
4.4. Reward Function
In order to assess the learning performance of drone swarms in the context of swarm rounding-up, it is crucial to establish a reliable measure of success. To address this, this research paper introduces a novel reward function that serves as an evaluation metric for the drones’ learning capabilities.
The defined reward function takes into account various factors that contribute to the overall success of the drone swarm. These factors include the efficiency of the swarm in completing the rounding-up task, the accuracy of the drones’ actions, and coordination and collaboration among the individual drones. By considering these key aspects, the reward function provides a comprehensive assessment of the learning performance of the drone swarm.
In the experimental setup, we defined the operating space as a 30 × 30 area, with each UAV having a radius of 1, a safety distance of 1.2, and a maximum speed of 2. Obstacles in the environment were also assigned a radius of 1. In the context of a rounding-up mission, drones play a crucial role in starting from a designated point, reaching the target location for rounding-up, and executing precise directional strikes. Throughout the itinerary of the drone group, several factors require careful consideration, including maintaining a proper formation, avoiding obstacles, and ensuring the successful completion of the rounding-up task. To effectively evaluate the performance of the drone group based on these aspects, a mixed reward function is proposed in this research paper:
where
represents the reward obtained by the UAV group reaching the rounding-up target position,
represents the reward obtained by the UAV group avoiding collision during the march,
represents the reward obtained by the UAV group maintaining the formation during the march, and
represents the reward obtained by controlling the speed of the drone swarm approaching the target position.
Based on the environmental parameter settings, we consider a UAV to have reached the target when it is within a distance of 2 from the target point. Additionally, we describe the UAV’s progress toward the target using a negative scoring function. Furthermore, we define the UAV’s heading as moving toward the target when the heading angle is less than
. The UAV’s performance is considered optimal when the heading angle is less than
—this parameter was refined through continuous adjustments based on training outcomes. The reward
for the UAV swarm reaching the rounding-up target position consists of two parts. One part is the reward obtained when the position is closed, and the other part is the reward obtained when the angle towards the target position is correct, which can be expressed as follows:
where
Considering the parameters of the UAVs and obstacles, we determined that if the edge-to-edge distance between UAVs, or between a UAV and an obstacle, is greater than 0.2—meaning the center-to-center distance between UAVs is greater than 2.2—then the UAVs are within a safe distance. Additionally, we consider that when the distance between UAVs is less than 0.5, or the distance between a UAV and an obstacle is less than 1.2, the UAV is deemed to be at serious risk of damage, and a significant negative reward is applied.
The reward
obtained by the UAV swarm to avoid collision during its travel is set based on the relative distance
between the UAV swarm and the obstacle, which can be expressed as follows:
The reward
obtained by the UAV swarm maintaining the formation during the march is based on the relative distance
within the UAV swarm, which can be expressed as follows:
Regarding speed, our goal is to keep the drone close to its maximum speed while maintaining a certain margin, so we assign a positive reward when the speed is below 1.8 and a large negative reward when the speed exceeds 2.2.
The reward
obtained by the speed control of the UAV group approaching the target position is based on the speed
setting of the UAV when the UAV group approaches the target position, which can be expressed as follows:
In this section, we propose the implementation of a mixed reward function to evaluate the performance of drone clusters in rounding-up missions. The reward function takes into account various factors such as obstacle collisions, aircraft collisions, and yaw flight, assigning a negative reward value in these instances, which ultimately leads to mission failure. However, this approach also serves the purpose of optimizing the overall strategy employed.
4.5. Swarm Control Algorithm
In this study, we employed the soft actor–critic (SAC) algorithm as the foundation for our control strategy. The pseudocode for the training algorithm is outlined as follows. Initially, the network undergoes a random initialization process, setting the starting and ending mission coordinates for the UAV formation. Subsequently, the process enters a loop where the drone utilizes visual information to make decisions and execute actions based on factors such as the proximity between adjacent drones, obstacles, and target points. These actions lead to transitions to the next state, accompanied by corresponding rewards. Furthermore, we store the previous state value, action value, reward value, current state value, and completion signal value in a cache space, which is then utilized for network training through value sampling. Notably, information sharing is facilitated among the drones within this system. The overall algorithm is described in Algorithm 1, providing a comprehensive overview of the methodology employed in this research.
Algorithm 1: Soft Actor–Critic (SAC) Algorithm for Multi-UAV Flocking Rounding-Up |
|
5. Experiment and Result Analysis
In this section, we present a comprehensive evaluation of our novel multi-UAV formation control, cooperative rounding-up, and obstacle avoidance methods. To assess the effectiveness of our approach, we conducted experiments using the Gazebo9 simulation environment. Gazebo is a highly advanced platform that offers a realistic simulation environment closely resembling the physical characteristics of real-world robots. In this evaluation, we focus on the performance of the controller trained using the soft actor–critic (SAC) algorithm and compare it with other state-of-the-art reinforcement learning algorithms. Through rigorous comparison and verification, we demonstrate that our proposed system outperforms existing methods. To further validate the functionality of our designed system, we devised two complex mission scenarios for evaluation: (1) fixed-point rounding-up, and (2) dynamic tracking and rounding-up. These scenarios provide a comprehensive assessment of the system’s capabilities and showcase its potential in real-world applications.
Fixed-point rounding-up and strike can be explained as knowing that the target point needs to be hit and sending a drone group to round up and strike the target point. Dynamic tracking and rounding-up can be explained as knowing the movement trajectory of the target point to be hit and sending a group of drones to track and round up the moving target point.
5.1. Experimental Parameters and Hardware Settings
In our simulation environment, we configured a drone swarm consisting of five drones, each equipped with two simulated cameras: a grayscale camera and a stereo camera. The grayscale camera is capable of recording grayscale images at a resolution of 720 × 540 pixels, with a frame rate of 20 frames per second. On the other hand, the stereo cameras capture depth images at a resolution of 224 × 224 pixels, also at a frame rate of 20 Hz. These cameras work in conjunction with an onboard PyTorch-based end-to-end controller, which is responsible for controlling the drones. The drone swarm relies on visual perception to gather information about nearby drones and obstacles and utilizes obstacle avoidance and rounding-up actions to earn rewards. In the experimental setup, we trained the system within a 30 × 30 area ({(x,y)∣−20 ≤ x ≤ 10, −20 ≤ y
). We deployed five UAVs, each with a radius of 1 m. To ensure collision avoidance, a safety distance of 1.2 m was maintained, and the maximum speed was capped at 2 m. Additionally, four cylindrical obstacles with a radius of 1 m each were placed within the environment. During the training process, the start and end points were randomly generated to enhance the system’s adaptability and intelligence. For a detailed overview of the environmental parameters used in our experiments, please refer to
Table 1.
In the experiment, the PPO algorithm, DDPG algorithm, and SAC algorithm were used to train the strategy. During the training process, batch size is the number of samples extracted from the experience pool during each training of the UAV, and discount rate represents the number of samples taken when calculating the loss value, or the discount on future rewards. The optimization algorithm is the Adam algorithm, the initial value of the entropy regularization coefficient is 1, the target entropy value is −2, etc. The specific parameter settings are shown in
Table 2,
Table 3 and
Table 4.
When training the policy model, we used the PyTorch framework to design the neural network model and optimizer. The CPU was Intel@Xeon Silver 4210R 2.4G, 10C/20T, and the GPU was NVIDIA RTX 3090-24G, which was used to accelerate the training of the policy network. The computer memory was 32G.
5.2. Results and Analysis
In this section, we present the configuration of our experimental task environment, which is based on the description provided in the previous section. We proceed to display and analyze the experimental results obtained from the drone swarm. The primary objective of the swarm is to learn and explore, with the aim of achieving formation, obstacle avoidance, and tracking and rounding-up capabilities. To facilitate the training process of the strategy, we devised a hybrid reward function that assigns appropriate rewards and penalties to the UAV swarm. Additionally, we conducted tests to evaluate the performance of the resulting policy in completing tasks within an obstacle environment.
5.2.1. Learning Curve
In this article, we utilize three currently mainstream reinforcement learning algorithms (SAC algorithm, DDPG algorithm, and PPO algorithm) to solve the drone swarm rounding-up problem. It should be noted that we ensured that the three algorithm training strategies achieved convergence. Based on the reward function mentioned in the previous article, we measured the rewards obtained by the five drones in each episode and drew a learning curve to evaluate whether the five drones could perform better in the current policy network. During the training process, the fluctuations in the reward function curve were normal. These fluctuations typically arise from the algorithm’s exploratory phase, where various action combinations are attempted; the complexity of the tasks being tackled; the inherent randomness of the environment; and the non-linear nature of the problems being addressed. As shown in
Figure 6, the polyline represents the cumulative rewards obtained by the five drones in the current mission space at the end of each episode of training, where (a) is the learning curve of fixed-point rounding-up, and (b) is the learning curve of dynamic rounding-up. As can be seen from
Figure 6, under the same important parameter settings, whether it is a static tail scene or a dynamic rounding-up scene, the SAC algorithm can obtain higher rewards than the DDPG algorithm and the PPO algorithm.
5.2.2. Simulation Results and Visual Analysis
In order to test the performance of each strategy, we saved the previously trained control strategy model, then carried out the UAV’s fixed-point rounding-up and dynamic tracking rounding-up and strike missions and analyzed its flight trajectory.
A. UAV Swarm Fixed-Point Rounding-up and Strike Mission Test
The UAV group-targeted rounding-up strike mission is a military action that uses a group composed of multiple UAVs to round up and strike specific targets. In this type of mission, using information from onboard sensors, the drone swarm will work together to round up the target in a specific area, prevent it from escaping or administering further damage, and then strike the target in order to destroy or neutralize the target.
The flight trajectory is shown in
Figure 7. It can be seen from the figure that, in the beginning, there was a certain distance between the UAV group and the target point, and there were obstacles in the way. As our UAV group adjusted, it gradually approached the enemy and finally moved around the enemy’s strike point, maintaining a certain distance from the target point and rounding-up the target in a specific area.
Figure 7 illustrates the fixed-point search process of three algorithms across four directions. Panels (a), (b), and (c) show the motion trajectories of the PPO, DDPG, and SAC algorithms, respectively, in the upward, downward, leftward, and rightward directions. The PPO algorithm struggles with the fixed-point roundup task, demonstrating subpar performance. While the DDPG algorithm completes the task, it fails to produce optimal movement trajectories. In contrast, the SAC algorithm excels in both obstacle avoidance and fixed-point rounding-up tasks, delivering consistently smooth and efficient trajectories.
B. UAV Swarm Dynamic Tracking, Rounding-up, and Strike Mission Test
UAV group dynamic tracking and rounding strike missions refer to military actions that use a group composed of multiple UAVs to track, round-up, and strike moving targets in real time. In this kind of mission, we have anticipated the enemy’s movement, and under the condition of avoiding obstacles, the drone swarm works together to round-up the dynamic target, force it to stop, or directly destroy it.
The flight trajectory is shown in
Figure 8. It can be observed from the figure that with time, the UAV group constantly adjusts its direction and speed, gradually approaches the moving enemy target point, reduces the distance from the enemy target point, and finally successfully captures the enemy target.
Figure 8 shows the movement trajectories of five drones and three algorithms chasing the target point in the dynamic rounding-up scene, where (a) is the test trajectory diagram of the PPO algorithm, (b) is the test trajectory diagram of the DDPG algorithm, and (c) is the test trajectory diagram of the SAC algorithm. Three of the algorithms intercepted the trajectory state diagrams when step = 100, 150, 200, 300, and 600, respectively.
As can be seen in
Figure 8, in the PPO algorithm, the UAV swarm cannot successfully bypass obstacles and collide with obstacles. Although there is a tendency to move closer to the target point, it cannot successfully complete the dynamic round-up task. In the DDPG algorithm, the UAV swarm can successfully bypass obstacles and complete the round-up task, but the internal distance of the UAV is not maintained very well, and there is a risk of collision. In the SAC algorithm, the UAV swarm can complete the obstacle avoidance and round-up tasks very well, and the obstacle avoidance and round-up trajectories are also more in line with the ideal state.
To summarize, by testing the effects of the three algorithms in two rounding-up scenarios, we found that compared to the PPO algorithm and DDPG algorithm, the SAC algorithm can complete these tasks better. The SAC algorithm can be completed faster and better in both fixed-point rounding-up tasks and dynamic rounding-up tasks, while PPO is not suitable for these tasks.
5.3. Evaluation Metrics
To assess the efficacy of the three algorithms in both fixed-point and dynamic round-up tasks, we devised three key performance indicators: efficiency, safety, and energy consumption. By leveraging these indicators, we aim to discern which algorithm is better suited for the specific demands of this application scenario.
5.3.1. Efficiency Assessment
In terms of efficiency, we measured the number of steps taken by various algorithms to complete the two tasks, with each step set at 0.05 s. A shorter acquisition time indicates that the algorithm is capable of quickly identifying and rounding-up the target. Among them, we believe that a distance of 2 m from the target point represents a successful rounding-up.
Figure 9a,
Figure 9b, and
Figure 9c represent the changes in the distance of the drone swarm to the target point over time in the fixed-point rounding-up mission of the three algorithms, respectively. It can be observed that the three algorithms are all approaching the mission target point, but the UAV under the PPO algorithm converges prematurely and has not yet fully reached the rounding-up range. Both the DDPG and SAC algorithms can complete this task, but judging from the number of running steps, DDPG requires about 156 steps to complete, while SAC only takes 130 steps to complete. From this point of view, the SAC algorithm is better than the DDPG algorithm.
Figure 9d,
Figure 9e, and
Figure 9f represent the changes in the distance of the drone swarm to the target point over time in the dynamic rounding-up task of the three algorithms, respectively. It can be observed that while all three algorithms tend to complete the task, the PPO algorithm fails to maintain the surrounding state after approaching the target point. This inability to closely follow the mission point ultimately leads to mission failure. Both the DDPG algorithm and the SAC algorithm can complete this task, but in terms of rounding-up stability and the number of steps completed, the DDPG algorithm takes about 248 steps to complete, and the distance remains unstable, while the SAC algorithm only takes about 130 steps to complete, and the distance remains stable. From this point of view, the SAC algorithm has a better performance for both fixed-point rounding-up tasks and dynamic rounding-up tasks.
5.3.2. Safety Assessment
In terms of safety, we measured and recorded the minimum distance between drones and obstacles, the minimum distance between drones, and the minimum average distance between drones throughout the mission. Reducing the risk of collision shows that the algorithm can effectively prevent drone collisions and ensure mission safety. In addition, a higher obstacle avoidance ability indicates that the algorithm can quickly identify and avoid obstacles, thereby ensuring the safe flight of the drone. Among them, we consider a safe distance of 1.2 m between drones and obstacles and a safe distance of 0.5 m between drones.
Figure 10a and
Figure 10b represent the safety indicators of the three algorithms in fixed-point rounding-up tasks and dynamic rounding-up tasks: the minimum distance between drones relative to obstacles, the minimum distance between drones, and the average distance between drones, respectively, where (a) is a fixed-point roundup task. Both the DDPG algorithm and the SAC algorithm can achieve the purpose of safe obstacle avoidance and collision prevention, while the UAV collides with obstacles under the PPO algorithm. (b) represents the dynamic round-up task, in which only UAVs under the SAC algorithm can safely avoid obstacles. UAVs operating under both the DDPG algorithm and the PPO algorithm are at risk of colliding with obstacles or aircrafts.
5.3.3. Energy Consumption Assessment
In terms of energy consumption, we recorded the flight acceleration of the drone when using different methods to complete two tasks for evaluation. In terms of energy consumption, we recorded the average acceleration of the UAV swarm during flight when completing two tasks using different methods to evaluate the UAV energy consumption. We believe that the smaller the acceleration, the smoother the drone operates, and the smaller the energy consumption. A lower energy consumption means the algorithm can use the drone’s energy more efficiently, extending the duration of the mission and the energy consumption of the drone. The smaller the acceleration, the smaller the energy consumption. A lower energy consumption means that the algorithm is able to use the drone’s energy resources more efficiently, extending the duration of the mission.
Figure 11 represents the average acceleration of the UAV swarm for the two tasks controlled by the three algorithms obtained after running the strategy 20 times. Comparing the three algorithms, it is obvious that whenever there is a fixed-point rounding-up task or a dynamic rounding-up task, the average acceleration of the UAV group under the SAC algorithm is the smallest; that is, the energy consumption is lower than the other two algorithms.
Based on the analysis in
Section 5.2 regarding learning rate and simulation visualization, as well as the various evaluation metrics discussed in
Section 5.3, several conclusions can be drawn. First, the SAC algorithm demonstrates a significantly better learning rate compared to the DDPG and PPO algorithms. With the same critical parameter settings, SAC consistently achieves higher rewards in both static tail scenarios and dynamic rounding scenarios. Second, the trajectory visualization analysis reveals that the SAC algorithm produces smoother paths and more accurately reaches target positions, especially in dynamic scenarios, compared to DDPG and PPO. Third, the efficiency evaluation shows that SAC allows the UAVs to approach the target point more quickly and maintain a more stable formation around the target. In terms of safety, the SAC-based UAVs maintain safer distances from neighboring UAVs and obstacles throughout the entire flight compared to DDPG and PPO. Lastly, the energy consumption assessment indicates that UAVs under the SAC algorithm exhibit the lowest average acceleration, implying lower energy consumption compared to the other two algorithms. The comparison results of the three result in two different outcomes after 20 repeated test run scenarios, as shown in
Table 5.
In the table above, we have summarized and compared the performance of the three algorithms across two different scenarios. The comparison includes the number of steps required to reach the target point (efficiency); the minimum distance between UAVs and obstacles, as well as between UAVs themselves (safety); the average acceleration during UAV operation (consumption); and the overall success rate.
In conclusion, UAV swarms utilizing the SAC algorithm can effectively accomplish obstacle avoidance and encirclement tasks, demonstrating superior performance in both static and dynamic scenarios. In contrast, the DDPG and PPO algorithms show relatively poor performances, making them less suitable for handling high-complexity tasks.
6. Conclusions
This research paper proposes a new method using multi-agent deep reinforcement learning to solve the challenge of collaborative obstacle avoidance and round-up of drone swarms, enabling multiple drones to exhibit enhanced collaborative round-up capabilities and higher intelligence. Through a series of experiments, we evaluate the effectiveness of the policy model trained using reinforcement learning by analyzing the average reward function curve and drone-tracking graph. Second, we propose three performance evaluation metrics to discern which algorithm is more suitable for the specific needs of this application scenario. The feasibility and effectiveness of the proposed multi-agent deep reinforcement learning method were verified based on a large number of simulations. Experimental and simulation results show that the method based on multi-agent deep reinforcement learning achieves good results in the rounding-up scenario and successfully achieves the goal of using multiple drones to capture target points. However, it is important to note that the current scenario is relatively simple.
In our research, we utilized Gazebo to establish a robust and versatile simulation environment that closely mirrors real-world UAV dynamics, providing a strong foundation for validating our theoretical approaches and enabling potential real-world applications. While the direct transfer from simulation to reality, as demonstrated in previous studies, is promising, we recognize Gazebo’s inherent limitations, particularly its idealization of environmental complexity and simplification of hardware constraints. These factors must be carefully addressed during the transition from simulation to real-world implementation to ensure the effectiveness and reliability of UAV systems. Moving forward, we plan to extend our research by conducting real-world UAV flight experiments.