1. Introduction
Currently, with the development of science and technology, UAVs have been widely used in military, industrial, agricultural, and other fields. However, when faced with the requirements of target search, target pursuit, and target round-up abilities, a single UAV often suffers from various problems, including an inefficient detection range and a weak ability to adapt to the environment. Thus, it is increasingly important to study the techniques involving multi-UAVs for dealing with automatic collaborative tasks. At present, many countries have been involved in the research of multi-UAV formation, including some specific military plans. As early as in 2008, the University of Pennsylvania verified the indoor formation flight and obstacle avoidance of 16–20 small quadcopters. In recent years, several low-cost multi-UAV formation projects, such as the Defense Advanced Research Projects Agency’s Pixie Project and DARPA’s Attack UAV Swarm Tactics Project from U.S., have been launched [
1]. The ability to continuously traverse and monitor the specific target areas is the main advantage of multi-UAV. Due to limitations in operational accuracy and capability, a single UAV generally experiences difficulty performing tasks independently, and a multi-UAV cluster can effectively compensate for this deficiency. However, for a complex cluster system, a multi-UAV cluster must exhibit the ability to self-organize and adapt. In previous research regarding multi-agent control methods, agents were generally required to make corresponding decisions based on environmental information and individual information within the cluster. The overreliance on this information will lead to poor environmental adaptability; therefore, enabling a cluster to accomplish tasks based on less information and improving the efficiency of information utilization are current research hotspots.
The method of control of the UAV cluster mainly depends on the multi-agent system’s collaborative control techniques [
2,
3,
4]. In 1986, Reynolds first proposed the Boids model, which was based on the observations of bird swarm [
5]. It assumed that the individual can perceive the information of its neighborhoods within a certain range, and the decision was made based on the three basic principles of aggregation, separation, and alignment [
6]. On this basis, Vicsek established a plane model of a discrete time system in 1995 [
7] to simulate the consistent behavior of particle emergence. The above classic models have laid the foundation for traditional cluster control methods. Up until now, the formation control methods based on the principle of consistency have mainly included the leader–follower method [
8,
9] and the behavior control method [
10]. The specific idea behind the leader–follower method is to select a leader in the cluster, with the remaining agents as the followers. The leader holds the flight path and the task’s target. In this way, based on a distributed control method, the states of the followers will gradually be consistent with the leader, and the cluster ultimately maintains stable flight. The behavior-based control method is based on the idea of swarm intelligence, according to the desired behavior pattern of the UAVs. Individual behavior rules and local control schemes are designed for each UAV, and a “behavior library” is obtained and stored in the formation controller. When the control system of the UAV is instructed, it selects and executes the behavior from the “behavior library”, according to the instruction. Based on the above general consistent control methods, Lopen improved the method for a multi-agent system [
11] using visual constraints. Further, Song handled the loop formation problem with limited communication distance [
12] and Hu considered the coordinated control of spacecraft formation with external interference and limited communication resources [
13].
The consistent cluster control methods can achieve the collaboration of a multi-agent system, but it has the disadvantages of lacking autonomy and adaptability when facing dynamic environments and complex tasks. Therefore, from a data-driven perspective, reinforcement learning (RL) methods, which have strong decision-making capabilities [
14,
15], have also been widely studied in this field. Bansal [
16] explored the process of generating complex behaviors for a multi-agent through a self-gaming mechanism. For dealing with problems involving discrete and continuous states in a multi-agent system, Han Hu [
17] proposed the DDPG-D3QN algorithm. Jincheng [
18] investigated the role of baselines in stochastic policy gradients to better apply the policy optimization methods in real-world situations. For the solutions to the offline RL problems, Zifeng Zhuang [
19] found that the inherent conservativeness of policy-based algorithms needed to be overcome, and then proposed behavioral proximity policy optimization (BPPO), which did not require the introduction of any additional constraints or regularization compared to PPO. Zongwei Liu [
20] proposed an actor-director-critic algorithm, which added the role of director to the conventional actor-critic algorithm, improving the performance of the agents. To address the problems of low learning speed and poor generalization in decision making, Bo Li [
21] proposed PMADDPG, which is an improved version of the multi-agent deep deterministic policy gradient (MADDPG). Siyue Hu [
22] proposed a noise-MAPPO algorithm, for which the success rate was over 90% in all StarCraft Challenge scenarios. Since single-agent RL exhibits the disadvantage of overestimation bias of the value function, which causes the multi-agent reinforcement learning method to learn policy ineffectively, Johannes Ackermann [
23] proposed an approach that reduced this bias by using double centralized critics. Additionally, the self-attention mechanism [
24] was introduced on this basis, with remarkable results. In order to improve the learning speed in complex environments, Sriram Subramanian [
25] utilized the L2Q learning framework and extended the framework from single-agent to multi-agent settings. For improving the flow of autonomous vehicles in road networks, Anum [
26] proposed a method based on a multi-agent RL and an autonomous path selection algorithm.
The research in the above literature has obtained certain achievements in regards to formation control and obstacle avoidance for multi-UAV. However, for these conventional algorithms, the adopted leader generally tends to follow the previously prescribed flight path. This means that if the target requiring tracked is not within the detectable range of the leader, the multi-UAV cluster cannot construct an effective decision-making mechanism, leading to failures for tracking tasks. To deal with this problem, this paper designs a consistent round-up strategy based on PPO path optimization for the leader–follower tracking problem. This strategy is based on the consistent formation control method for a leader–follower multi-UAV cluster and aims to reach the goal of target round-up and obstacle avoidance. PPO can balance the exploration-exploitation aspects, while maintaining the simplicity and computational efficiency of the algorithm’s implementation. PPO tries to avoid training instability caused by excessive updating by limiting the step size of the policy updates. This allows PPO to perform in a balanced and robust way in a range of tasks. It is supposed that each member in the multiple-UAV cluster has a detectable range for spotting the nearby target and the obstacles, and each obstacle has an impact range for causing a collision when any moving UAV enters this range. The basic principle of the proposed strategy is to force the multi-UAV cluster to approach and round-up the target, based on the consistent formation control when any member locates it, and there are no obstacles nearby, while optimizing the policy for the leader, based on PPO, to determine the best flight path and make the followers cooperated with the leader in other conditions. To verify the performances of the proposed strategy in different environments, four scenarios are considered in the numerical experiments, including environments with a fixed target, a moving target, a fixed target and a fixed obstacle, as well as a fixed target and a moving obstacle. The results showed that the strategy exhibits excellent performance in tracking the target and successfully avoiding obstacles. In summary, the main contributions of this paper can be concluded as follows:
- (1)
Designing a flight formation based on the Apollonian circle for tracking the target, and executing the collaborative flight of multi-UAV, based on consistent formation control, achieving the round-up for the target in situations where the target is within the detectable range and in which none of the pursuit UAVs enter the impact range of any obstacle.
- (2)
Optimizing the acting policy of the leader based on the PPO algorithm for finding the best flight path to track the target and avoid obstacles, achieving the round-up for the target with the help of consistent formation control in situations where the target is out of the detectable range and any of the pursuit UAVs enter the impact range of any obstacle.
- (3)
Validating and analyzing the performance of the proposed algorithm in regards to target round-up and obstacle avoidance in environments with a fixed target, a moving target, a fixed target and a fixed obstacle, as well as a fixed target and a moving obstacle.
The rest of this paper is designed as follows:
Section 2 introduces the necessary preliminaries related to this paper;
Section 3 illustrates the design principles, implementation process, and extensibility of the proposed strategy;
Section 4 details the numerical experiment, and the results from the proposed round-up strategy, applied in different environments, are compared and analyzed; while
Section 5 presents the conclusions.
3. A Consistent Round-Up Strategy Based on PPO Path Optimization
The traditional formation control method for a multi-UAV cluster generally requires leaders follow a previously prescribed flight path, which may lead to the failure of target tracking and obstacle avoidance when faced with a complex environment. Specifically, when based on the consistency protocol only, the round-up mission will fail if the pursuit multi-UAV cluster cannot detect the target, or if the round-up route is interrupted by an obstacle. To deal with this problem, this paper designs a consistent round-up algorithm based on PPO path optimization for the leader–follower [
31] tracking problem, as shown in
Figure 3.
The proposed strategy assumes that one leader and several followers exist in the multi-UAV cluster. From
Figure 3, it can be noted that the proposed round-up strategy consists of two main parts: the PPO algorithm and the consistency protocol. The leader is trained and controlled by the PPO algorithm, playing the role of tracking the target and avoiding the obstacles in the environment when the target is out of the cluster’s detectable range. the PPO-based reinforcement learning algorithm will plan the optimal flight path of the leader. Once the optimal flight path is maintained, the followers are expected to follow the leader through the consistency protocol. When the target can be detected by the cluster, the consistency protocol will control the cluster to round up the target, based on the formation of an Apollonian circle. The strategy combines the two parts above to guide the cluster to finish the mission of safely rounding up the target.
3.1. Discrete-Time Consistency Protocol
The purpose of this cluster is to simultaneously round up a single target and avoid obstacles. The area in which the target can be detected is defined as the “detectable area”. If none of the individuals in this pursuit cluster can detect the target, the leader needs to plan a flight path to approach the detectable area of the target. Once the leader is able to enter the detectable area, the round-up path can be planned based on the Apollonian circles, which requires a cooperative flight according to a consistent protocol.
For a two-dimensional discrete-time system, considering that there are
UAVs in the cluster, the dynamics for each individual can be described by the following model:
where
,
,
, and
represent the position, velocity, and control inputs of the
member, respectively. Moreover,
denotes the sampling period (
). For any
, if the system in any initial state satisfies the following conditions:
then the discrete system is capable of achieving asymptotic consistency.
Assume that each member can only communicate with the adjacent neighbors in its communication region, and the set of adjacent neighbors
for the
th member at moment
can be expressed as:
where
indicates the communication topology network of the pursuit cluster,
denotes the Euclidean distance, and
is the maximum communication radius between the
th and
th member. It is noted that the cluster should eventually reach a consistent state when performing the tracking task, which means that the direct distance between the individual members should be maintained, as required. Therefore, based on the definition of asymptotic consistency of the discrete system, the following constraint must be satisfied for the cluster:
where
d is the required distance between neighboring UAVs in a consistent steady state.
Thus, based on the consistency protocol, the control law
for the
member in the multi-UAV cluster can be composed of three components, as follows [
32]:
where
controls the safety distance among the cluster members;
controls the consistent speed for the cluster members, and
control the pursuit UAV to achieve the same speed as the target and maintain the round-up distance based on the Apollonian circle. The specific definitions for
, , and
are as follows:
From Equations (17)–(19), the coefficients , , and represent the control parameters, indicates the set of neighbors in the communication adjacency network, and denotes the elements in the adjacency matrix. When other members appear within the detection range of the member, a communication path between the member and the neighbor will be quickly established. In this way, the corresponding element of the adjacency matrix is ; otherwise,. Additionally, represents the set of neighbors in the communication adjacency network and the target. If there is a member which has discovered the target, is set as 1; otherwise, . The symbol indicates the position coordinates, and indicates the velocity coordinates of the th member in the inertial coordinate system. The symbol represents the target UAV, d is the safe distance between the pursuing UAVs, and dc denotes the capture distance required in the round-up task.
From the descriptions from Equations (17)–(19), it is concluded that induces the separations of the members in the cluster so that the minimum distance between the members can be maintained; causes the speed alignment of the pursuit UAVs to maintain a consistent speed; and elicits the alignment of the pursuit UAVs with the speed and relative distance of the target, realizing the round-up of the target.
3.2. Target Round-Up Based on the Apollonian Circle
When the leader enters the detectable area of the target, the cluster needs to surround the target and round it up. To achieve this goal, it is necessary to design a round-up route based on the Apollonian circle [
33]. Rounding up the target with multiple UAVs ensures, to the greatest extent, that the target cannot escape after being tracked. In order to simplify the formation design process, it is assumed that the speed values of the UAVs do not change during the task.
The diagram of an Apollonian circle is drawn in
Figure 4. Suppose that point P is the position of a pursuit UAV and its velocity is
, and point
D is the position of the target and its velocity is
; then the ratio
is expressed as follows:
The circle shown in
Figure 4 is the so-called Apollonian circle, where
O is the center, and
is the radius. The position of center
O and the radius
can be expressed as [
34]:
where
represents the distance between point
D and point
P.
From
Figure 4, it is seen that
C is an arbitrary point located on the Apollonian circle. Define the angle of the tangent line between the target and the Apollonian circle as
. In the case where the ratio of the target velocity to the pursuing UAV’s velocity is k, the pursuit UAV will not be able to successfully pursue the target when the angle of the tangent line between the target and the Apollonian circle is less than
, which can be expressed as follows:
It can be seen that when the angle is greater than , the UAV can always find an angle that is able to catch the target.
Therefore, when multiple pursuit UAVs are employed, the cluster can form several Apollonian circles to surround the target, thus the rounding up target by the pursuit cluster and preventing its escape. To achieve this round-up goal, it is always desirable that the target should be within the Apollonian circles formed by all of the pursuing UAVs, as shown in the
Figure 5.
Uses
D to represent the target to be rounded up and
to represent the center of the Apollonian circle formed by the ith pursuit UAV and the target. The details of the formed Apollonian circle can be obtained based on Equations (20)–(23). In order to round up the target, it is necessary to design the desired position
for each pursuit UAV. In this way, when the pursuit UAV can detect the target, it will continuously fly towards
, thus completing the round-up for the target. The final formation of the round-up condition is shown in
Figure 6.
In
Figure 6,
represents the tangent point formed by the (
)th and nth Apollonian circles. Denote the angle formed by the points of any adjacent centers of the Apollonian circle and the center
as
, and denote the angle formed by the points between the center of any Apollonian circle and the corresponding tangent point as
; then, it is seen that
. Combining the geometric properties and the definition of an Apollonian circle, we can obtain the following relationships:
Based on the above designed formation, it can be seen that if the position of the leader
is known, then the designed positions
of the followers can be known as:
where
, and
is the capture distance required in the round-up task. To ensure that the formed Apollonian circles can closely surround the target, the minimum distance
between neighboring pursuit UAVs can be set as follows:
Thus, the target is expected to be rounded up by the pursuit cluster, and this round-up strategy is used as a basic strategy for the multi-UAV cluster when the target can be detected, and there are no obstacles nearby.
3.3. Target Tracking and Obstacle Avoidance Based on the Proposed Strategy
The whole process is shown in
Figure 7, where we use a circle to express the obstacle and a star to represent the target. It is seen that for situations in which the leader is able to reach the detectable area of the target, the target can be rounded up based on the round-up route and consistency protocol provided in
Section 3.1 and
Section 3.2. However, when facing a complex environment in which the leader is unable to directly reach the detectable area of the target, or where certain obstacles exist, it is necessary to plan an optimal flight path for the leader. The optimization process is conducted based on the PPO algorithm. By using such a reinforcement learning method, the leader can be guided to reach the detectable area in an optimized way, thereby further completing the encirclement of the target by other followers.
The PPO algorithm consists of two types of networks, including the actor network and the critic network. With input which contains the states of the leader, the target, and the obstacle, the actor networks can generate the corresponding policies and the corresponding actions, and the critic network generates the state value function. The whole diagram of the actor network and the critic network can be shown in
Figure 8.
In
Figure 8, the input layer has six input nodes: [
,
,
,
,
,
], where [
] represents the position of the leader itself, [
] represents the relative position of the target, and [
] represents the relative position of the obstacle. Then, the activation layer are selected as the type of ReLU functions. After that, there are two fully connected layers, comprising 256 cells. The output layer of the actor network possesses two nodes corresponding to the amount of change in the horizontal coordinates
and the amount of change in the vertical coordinates
of the leader. The output layer of the critic network is designed as one node, corresponding to the state value function.
The network’s PPO update process is shown in
Figure 9. The first step is to initialize the conditions of the target, leader, and obstacle, where their position and speed are randomly generated within a certain range. Then, the relative state
can be calculated and inputted into the PPO algorithm. Based on the policy network, the leader’s action
will be outputted and executed to the environment. After the interaction, the next state
and the reward
can be obtained. To repeat the above steps, the trace
can be obtained and then stored in the memory.
Based on the trajectory
from the memory, it is possible to obtain the state value function
. To ensure that the output of the critic
is close to
, the loss function for the critic network can be expressed as follows:
As for the actor network, the loss function is shown in Equation (9) in
Section 2.2. Through the descent gradient method, the network parameters in the actor and critic networks can be updated to cause the leader to approach the target and better avoid the obstacles.