A Fast and Robust Algorithm with Reinforcement Learning for Large UAV Cluster Mission Planning

Zuo, Lei; Gao, Shan; Li, Yachao; Li, Lianghai; Li, Ming; Lu, Xiaofei

doi:10.3390/rs14061304

Open AccessTechnical Note

A Fast and Robust Algorithm with Reinforcement Learning for Large UAV Cluster Mission Planning

by

Lei Zuo

¹,

Shan Gao

¹,

Yachao Li

^1,*

,

Lianghai Li

²,

Ming Li

¹ and

Xiaofei Lu

³

¹

National Lab of Radar Signal Processing, Xidian University, Xi’an 710000, China

²

Beijing Research Institute of Telemetry, Beijing 100076, China

³

Jiuquan Satelite Launch Centre, Jiuquan 735400, China

^*

Author to whom correspondence should be addressed.

Remote Sens. 2022, 14(6), 1304; https://doi.org/10.3390/rs14061304

Submission received: 10 December 2021 / Revised: 19 February 2022 / Accepted: 4 March 2022 / Published: 8 March 2022

Download

Browse Figures

Versions Notes

Abstract

:

Large Unmanned Aerial Vehicle (UAV) clusters, containing hundreds of UAVs, have widely been used in the modern world. Therein, mission planning is the core of large UAV cluster collaborative systems. In this paper, we propose a mission planning method by introducing the Simple Attention Model (SAM) into Dynamic Information Reinforcement Learning (DIRL), named DIRL-SAM. To reduce the computational complexity of the original attention model, we derive the SAM with a lightweight interactive model to rapidly extract high-dimensional features of the cluster information. In DIRL, dynamic training conditions are considered to simulate different mission environments. Meanwhile, the data expansion in DIRL guarantees the convergence of the model in these dynamic environments, which improves the robustness of the algorithm. Finally, the simulation experiment results show that the proposed method can adaptively provide feasible mission planning schemes with second-level solution speed and that it exhibits excellent generalization performance in large-scale cluster planning problems.

Keywords:

mission planning; UAV cluster; reinforcement learning; attention model; combinational optimization

1. Introduction

Unmanned aerial vehicle (UAV) clusters have been widely used to perform various complex missions in military and civil fields, such as plant protection, mobile signal service, load transportation service, target detection, and strike [1,2,3,4,5,6]. For example, a cluster needs to allocate many subgroups of UAV with various ammunition resources to strike heterogeneous enemy targets in different areas, leading to the problem of UAV cluster cooperative strike mission planning. Performing multiple missions simultaneously requires a large number of UAVs, which increases the difficulty of mission planning. Generally, mission planning problems can be considered as complex optimization problems with some system constraints [7,8]. Three algorithms are studied to solve this problem, including mathematical optimization algorithms, heuristic algorithms, and reinforcement learning (RL)-based artificial intelligence algorithms. A brief introduction about these three optimization algorithms are shown in Table 1.

Algorithms based on mathematical models of mission planning problems have been studied in recent years [9,10,11]. In [9], Li et al. simultaneously considered UAV resource and trajectory optimization problems in the non-orthogonal multiple access scenario with two UAVs. Taylor expansion and slack variables are used to maximize the secrecy energy efficiency in the UAV network. To navigate multiple UAVs to avoid obstacles in the cooperative source seeking planning problem, a hybrid gradient descent optimization algorithm was proposed in [11]. It shows the efficient scalability under the condition of the agents without a priori knowledge of obstacles. However, the computational complexity of mathematical optimization algorithms increases dramatically as the number of UAVs increases. That is, these algorithms are not suitable for large UAV cluster mission planning problems.

Some works based on heuristic algorithms are investigated to solve large UAV cluster planning problems, such as A star algorithms and bio-inspired heuristic algorithms [12,13,14,15]. Unlike mathematical optimization algorithms, heuristic algorithms use simple and efficient iterative processes with greedy rules to optimize mission planning schemes. In [16], a bio-inspired heuristic algorithm based on the schooling and foraging behaviors of a fish swarm was proposed, where the UAVs were adaptively aggregated into multiple groups to perform search and rescue missions. Considering the suffered combat threat of the UAV cluster, Zhen et al. propose an ant colony optimization algorithm to solve the cooperative search-attack planning problem, where the UAV was modeled as a Dubins vehicle [17]. An essential procedure of the heuristic algorithm is to evaluate the quality of the individuals at each iteration. The total number of evaluations depends mainly on the population size, special optimal rules, and iteration times. When facing a large UAV cluster, the population and iteration times are usually large, leading to high time consumption [8].

Reinforcement learning techniques provide a fast-solving framework for optimization problems [18,19,20,21,22,23,24] and have been applied to various UAV mission plannings [25,26,27,28]. Many mission planning problems can be considered as decision optimization problems, such as deciding on multiple UAVs to cooperatively perform many different missions and planning proper mission orders for each UAV. The RL formulates the problems as the sequential Markov decision process, where the objective function of the problem is set as the loss of algorithms [18]. After the unsupervised learning process, the solution can be quickly obtained as a group of decision actions [19,20]. Jun et al. use the RL algorithm to fast allocate the communication channels in the cooperative surveillance problem of the UAV cluster [25]. In [26], the RL algorithm was used to optimize the mission assignment scheme, which helped UAVs to avoid dynamic threat areas and reach to the goal areas. In [27], the complex task was decomposed into a series of simple subtasks, and an RL algorithm, i.e., Double Deep Q–network, was proposed to train the cooperative mission planning solver of human–UAV teamwork. However, conventional RL solvers can only solve the optimization algorithms with fixed objective functions and constraints. The model needs to be retrained to adapt to new environments when the problem settings are different from the training [28]. This means that the model cannot converge in dynamic training environments. In real scenarios, the missions change with requirements, leading to the objective functions and constraints of the model being dynamic [29,30]. Therefore, it is necessary to design a robust RL method to ensure the convergence of the model under dynamic training conditions.

In this paper, we address the large UAV cluster collaborative mission planning problem, where the cluster needs to adaptively assign reasonable UAV subgroups for completing many different missions in real-time. To this end, a fast and robust method is proposed, named dynamical information reinforcement learning (DIRL) with the simple attention model (SAM). In the method, the novel DIRL is proposed by importing mission information in the UAV data during the dynamical training process, i.e., the mission’s requirement constraints, environment influence factor, location, and the weights between different objective functions. Compared with the traditional RL algorithm, DIRL ensures that the model converges in the changing training environments, which overcomes the limitation of fixed objective functions and constraints in common RL technologies. Meanwhile, to fast obtain the UAV cluster’s high-dimensional information, we propose the SAM by redesigning a new fast interactive model in the attention model (AM). The SAM can significantly reduce the computational complexity, especially in handling large cluster data. Therefore, the resulting DIRL-SAM method can provide mission planning schemes for different missions in real time with a one-trained model, proving that it is fast and robust.

The remainder of this paper is organized as follows. In Section 2, we describe the simulation scenario of large UAV cluster mission planning and formulate the mission planning problem. In Section 3, the SAM with a new interaction model is derived, and the DIRL is used to train the total network model. Several numerical results are provided in Section 4 to verify the validity of the proposed SAM and DIRL. Finally, the conclusion is made in Section 5.

2. Mission Planning Problem Formulation

2.1. Mission and UAV Formulation

Figure 1 shows the illustration example of the mission planning in a large UAV cluster. In a two-dimensional battlefield, several missions containing different enemy targets and detection equipment are randomly distributed. The planner ranks the mission priorities according to the combat purpose. The mission with high rank needs to be given priority in the mission execution process. The blue and the red areas in Figure 1 represent the different requirements of ammunition and electronic jamming cover areas in combat missions. The size of the areas represents the number of combat resources requirements. The UAVs with various colors carry different types and numbers of combat resources (ammunition and electronic jamming resources). The mission planning is needed to adaptively allocate the limited UAV subgroups to complete the missions under the conditions of the missions’ requirements and UAVs’ capacity. Meanwhile, the resource consumption (various combat resources and fuel resources) and the combat loss of the allocated UAVs need to be reduced as much as possible.

The missions have different priorities

{T_{1}, T_{2}, \dots T_{K}}

, where the subscripts represent the priority of missions. The cluster sequentially assigns the reasonable UAV subgroups to perform missions according to the priorities. Each mission needs to be accomplished with various combat demands.

{A t t (T_{k}), D e f (T_{k}), E l e (T_{k})}

represent the requirements of ammunitions resources, defense capability, and electronic jamming area in the kth combat mission, respectively. The combat efficiency of UAV is affected by the local environment of missions

E n (T_{k})

[31]. In the mission planning problem, the total flight distance of UAVs is a common consideration factor for calculating the time cost of scheme planning [9,13,17]. The flight distance impacts UAV’s power consumption and arrival time, influencing the effect of mission planning. Therefore, we extend the mission mode in [31] with the mission location

L o c (T_{k})

. The kth mission data vector in our paper can be expressed by:

T_{k} = [A t t (T_{k}), D e f (T_{k}), E l e (T_{k}), E n (T_{k}), L o c (T_{k})]

.

We use different types and numbers of combat resources in the UAVs to present the heterogeneity.

{a t t (U_{i}), d e f (U_{i}), e l e (U_{i})}

denote the ammunitions resources, defense capability, and electronic jamming resources of the UAV

i

, respectively. Similarly, we add the location of the UAV

l o c (U_{i})

in the UAV data vector that can be formulated as

U_{i} = [a t t (U_{i}), d e f (U_{i}), e l e (U_{i}), l o c (U_{i})]

.

2.2. Multiple Objective Functions of Mission Planning

Under the limitations of various mission requirements, the planning scheme needs to reduce the total combat resources consumption of the UAV cluster. To the kth mission, we set the consumption of all combat resources as the first objective function

f_{1}^{k}

:

f_{1}^{k} = E n (T_{k}) \sum_{i_{k} = 1}^{N_{k}} (a t t (U_{i_{k}}) + e l e (U_{i_{k}})),

(1)

where

i_{k}

is the number of the allocated UAVs for the kth mission and

N_{k}

is the total number. Equation (1) denotes the total combat resources consumption of the allocated UAVs for the kth mission.

In actual combat environments, UAVs may suffer some undetected threats when performing missions, causing combat loss for the total UAV cluster. The planning scheme should consider the combat loss resulting from uncertain threats factors. To this end, we construct the combat loss of UAVs as follows:

f_{2}^{k} = \sum_{i_{k} = 1}^{N_{k}} i_{k} \bar{V_{k}} {(p_{k})}^{i_{k}},

(2)

where

f_{2}^{k}

is the combat loss of allocated UAVs in completing the kth mission.

\bar{V_{k}}

is the mean combat value of the allocated UAVs for the kth mission,

\bar{V_{k}} = \frac{1}{N_{k}} \sum_{i_{k} = 1}^{N_{k}} V_{i_{k}}

, where

V_{i_{k}}

denotes the combat value of the UAV

i_{k}

. Different UAVs carry different numbers and types of combat resources, which means that the UAVs have different combat values for the mission planner. We consider that the carrying of combat resources and the cost of UAVs influence the combat value of UAVs. Therefore,

V_{i_{k}}

can be calculated by

V_{i_{k}} = μ {‖ a t t (U_{i_{k}}) + e l e (U_{i_{k}}) ‖}_{1} + c_{i_{k}},

(3)

where

μ

is used to transform the combat resources to the combat value.

{‖ \cdot ‖}_{1}

represents the calculation of 1-norm and

c_{i_{k}}

represents the self-cost of UAV

i_{k}

. In (2),

p_{k}

is the probability of UAV being destroyed when carrying out the kth mission and is changed with different combat missions. The mission with larger combat resource requirements is usually more dangerous for UAVs, which means that

p_{k}

increases with the increase in mission resource requirements. It can be formulated as follows:

p_{k} = \min (χ {‖ L_{k} ‖}_{1}, 1),

(4)

where

λ

is the probability coefficient. It is used to transform

{‖ L_{k} ‖}_{1}

into

p_{k}

.

L_{k}

denotes the mission resource requirement vector in the kth mission, i.e.,

L_{k} = [a t t (T_{k}), e l e (T_{k})]

.

The total flight distance of UAVs is crucial for the mission planning problem. If the allocated UAVs are far from the mission, it increases the mission completion time and the total power consumption of the UAV cluster. Hence, it is necessary to decrease the total flight distance of the allocated UAVs in the mission planning scheme. We set the total flight distance as the third objective function

f_{3}^{k}

:

f_{3}^{k} = \sum_{i_{k} = 1}^{N_{k}} d_{i_{k}, k} (l o c (U_{i_{k}}), L o c (T_{k})),

(5)

where

d_{i_{k}, k}

is the Euclidean distance between the locations of the UAV

i_{k}

and the kth mission. From above all, the total objective function of the kth mission can be formulated as:

F_{k} = w_{1}^{k} f_{1}^{k} + w_{2}^{k} f_{2}^{k} + w_{3}^{k} f_{3}^{k},

(6)

where

w_{1}^{k}

,

w_{2}^{k}

, and

w_{3}^{k}

are the weights with respect to the three objective functions, yielding

w_{1}^{k} + w_{2}^{k} + w_{3}^{k} = 1

. According to the different mission requirements, the weights can be adjusted.

2.3. Constraint Conditions of Mission Planning

To ensure the completion of the mission, each combat resource contained in the allocated UAVs needs to be larger than the corresponding resource requirements of the missions. In addition, each allocated UAV for mission

k

should meet the defense capability threshold. Considering the limitation of the combat resource requirements and the defense capability threshold in the missions, we formulate the constraints of mission planning as follows:

e n (T_{k}) \sum_{i_{k} = 1}^{N_{k}} a t t (U_{i_{k}}) \geq A t t (T_{k}),

(7)

e n (T_{k}) \sum_{i_{k} = 1}^{N_{k}} e l e (U_{i_{k}}) \geq E l e (T_{k}),

(8)

d e f (U_{i_{k}}) \geq D e f (T_{k}) .

(9)

The constraints in (7) and (8) ensure that each combat resource requirement of missions can be met. Besides, we set the threshold of defense capability in (9) to guarantee the UAVs’ safety when carrying out combat missions.

Finally, the mission planning problem is formulated as:

\min F = \sum_{k = 1}^{K} F_{k}, .

(10)

To problem (10), assigning rational numbers and types of UAV subgroups to carry out each mission with various requirements and limitations is a complex combinatorial optimization problem.

3. DIRL-SAM

To solve the UAV cluster mission planning problem in Section 2, we propose the DIRL-SAM. First, we use the proposed SAM to fast extract the high-dimensional features of the UAV cluster during the encoding process. Then, the high-dimensional features are used in the decoder to allocate UAV subgroups for the missions. Finally, the model parameters of the encoder and the decoder are optimized collaboratively according to the DIRL dynamic training algorithm. The detailed model structure of the DIRL-SAM is provided in Figure 2.

3.1. Encoder with SAM

The encoder, used to extract the high dimensional features of UAV cluster data, provides necessary information for the decoder to solve the mission planning problem. Specifically, the encoder has a linear projection network and

r

sublayers. Each sublayer has a SAM model and a feed-forward (FF) network. The whole process of the encoding is shown in Figure 3.

The input of the linear projection network is the UAV cluster data

[U_{1}, U_{2}, \dots, U_{N}]

, where

N

is the total number of the UAVs in the cluster. For each UAV data

U_{i}

, the linear projection network transforms them to the UAV low-level features

h_{i}^{f}

:

h_{i}^{f} = W \times U_{i} + b,

(11)

where

W

and

b

are the parameters of the linear projection network. Then,

h_{i}^{f}

is fed into the 1th sublayer of the encoder, where the SAM uses

h_{i}^{f}

to extract the UAV high-level features

h_{i, s}^{1}

. This process can be calculated by:

h_{i, s}^{l} {= SAM}^{l} (h_{i}^{l}, h_{i}^{f}, s^{l}),

(12)

where

h_{i, s}^{l}

is the output of the SAM in the sublayer

l \in {1, 2, \dots, r}

.

h_{i}^{l}

is the output of the (l − 1)th sublayer, and

h_{i}^{1}

is equal to

h_{i}^{f}

in the 1th sublayer.

s^{l}

is the designed auxiliary vector in the SAM, which is expressed as:

s^{l} = R e l u (\frac{1}{N} \sum_{i = 1}^{N} h_{i}^{l}),

(13)

where the

R e l u

activation function is employed in the SAM to avoid the problem of gradient disappearance in the training process.

To improve the feature extraction performance of the SAM,

h_{i, s}^{l}

is expected to contain both the UAV cluster data features and its self-features. For the cluster features,

h_{i}^{l}

interacts with

s^{l}

in (13), which can represent the cluster information. To maintain the UAV self-features in the interaction process of the SAM,

h_{i}^{l}

interacts with itself and a vector relating to its original self-features, i.e.,

h_{i}^{l}

and

h_{i}^{f}

. The two feature vectors do not add extra network models and calculations, making the SAM’s structure lightweight.

After the SAM feature extraction process, the batch normalization (BN) [32] and the skip connection technology [33] are used in the rest of the sublayer:

{\begin{cases} {\hat{h}}_{i}^{l + 1} = {B N}^{l} (h_{i}^{l} + h_{i, s}^{l}) \\ h_{i}^{l + 1} = {B N}^{l} ({\hat{h}}_{i}^{l + 1} + {F F}^{l} ({\hat{h}}_{i}^{l + 1})) \end{cases},

(14)

In the last sublayer of the encoder, we obtain the UAV cluster high-dimensional features

{h_{1}^{r + 1}, h_{2}^{r + 1}, \dots, h_{N}^{r + 1}}

. The set of high-dimensional features is used in the decoder, which can obtain the probability vector of each UAV being allocated

P_{θ}

[20,21,22] (

θ

is the parameter vector of the encoder and the decoder). Finally, we can solve the allocated UAVs for each mission accordingly to

P_{θ}

.

Remark 1.

Figure 4 compares the interactive mode of the AM and the SAM. As shown in Figure 4a,

h_{i}^{l}

needs to carry out the interactions with all UAV hidden vectors to obtain its hidden features. In this case, the computational complexity of this process is

O (N \times N)

in the AM [34,35,36]. In Figure 4b, unlike the AM, we redesign a fast interaction process in the SAM.

h_{i}^{l}

only needs to interact with two vectors and itself, the computational complexity is reduced from

O (N \times N)

to

O (N \times 3)

.

3.2. DIRL

The DIRL is used to update

θ

with unsupervised training in the dynamic environments. To solve different settings of mission planning problems with the one network model, the

θ

is optimized under various training conditions. It means that the loss function and constraint conditions of RL are changed correspondingly. The total network model cannot converge in these dynamic training environments by using traditional RL algorithms.

The DIRL makes the encoder and the decoder contain all dynamic information, ensuring the convergence of the model in the changing training process. Specially,

U_{i}

is expanded with the dynamical data in the DIRL:

U_{i}^{k} = [U_{i}, w_{1}^{k}, w_{2}^{k}, w_{3}^{k}, T_{k}],

(15)

where

U_{i}^{k}

is the expanding vector of UAV

i

for the kth mission, it can include all dynamical information through this simple but effective data expansion operation. Due to the data expansion process of the DIRL, the entire solving process of the model can contain the mission data and weight data, guaranteeing the convergence of the model. The dynamic training conditions with the data expansion operation in DIRL ensure that the trained model can solve the problems with various resource requirements and objective function weights.

The actor–critic framework is employed in the DIRL [37]. The parameter vector of the total model

θ

is optimized by the policy gradient algorithm [38], which can be calculated as follows:

\begin{array}{l} \nabla ℒ (θ | U_{1}^{k}, U_{2}^{k}, \dots, U_{N}^{k}) = \\ E_{P_{θ} (π^{k} | U_{1}^{k}, U_{2}^{k}, \dots, U_{N}^{k})} [(F_{k} (π^{k}) - F_{k} (π_{c}^{k})) \nabla \log P_{θ} (π^{k} | U_{1}^{k}, U_{2}^{k}, \dots, U_{N}^{k})] \end{array},

(16)

where

ℒ

is the DIRL loss.

π^{k}

and

π_{c}^{k}

are the solutions achieved by the actor–network and critic–network, respectively.

The critic–network is the baseline function in DIRL, which can eliminate the variance during the training process [37]. The actor–network and critic–network use the sample rollout policy and the greedy rollout policy to sample

P_{θ}

[20], respectively. The performances of the two networks are compared at the end of each epoch. If the results of the actor–network are better than the critic network, we replace the parameters of the critic–network by using the actor–network parameters. It means that the critic–network maintains the best model parameters during the DIRL dynamic training process. Algorithm 1 exhibits the detailed dynamical training process of DIRL.

Algorithm 1. Dynamical Information Reinforcement Learning

1: Input: Initialized actor–network and critic–network parameters

θ, θ_{^{B L}}

2: Output: The optimal actor–network parameters

θ

3: for each epoch do

4: for each batch do

5:

{U_{1}, U_{2}, \dots, U_{N}}, T_{k}, W_{k} \leftarrow RandomInstance

6:

π^{k} \leftarrow SampleRollout (P_{θ})

7:

π_{c}^{k} \leftarrow GreedyRollou (P_{θ_{B L}})

8:

\nabla ℒ \leftarrow π^{k}, π_{c}^{k}

9:

θ \leftarrow Adam (θ, \nabla ℒ)

10: end for batch

11: if the paired t test

(P_{θ}, P_{θ_{B L}}) < 5 %

in the test batches, then

12:

θ_{B L} \leftarrow θ

13: end if

14: end for epoch

4. Experimental Results

4.1. Experimental Settings

The simulation experimental setup contains the following parts.

Parameter settings of the contrast optimization algorithms:

We compare our method with two effective heuristic optimization algorithms: genetic algorithm (GA) and particle swarm optimization algorithm (PSO) [39,40]. The main parameter settings of GA and PSO are shown in Table 2 and Table 3.

In addition, we use two models, DIRL-AM and RL-AM, to separately verify the effectiveness of the SAM and the DIRL, respectively. To compare the performances of the SAM and the AM, we only replace the SAM with the AM in the encoder, and the other network structures retained unchanged. To the DIRL, we fix the network structure of the model and exclude the data expansion process in the DIRL. The DIRL-SAM, the DIRL-AM, and the RL-SAM are trained on the same dynamical training conditions in Table 4.

Settings of various missions:

We randomly generate a group of missions with different combat requirements and locations. The mission settings are shown in Table 5.

In Table 5, we consider that missions 1 and 2 have a high threat for UAVs; hence, the priorities are higher than other missions. Meanwhile, we increase the weight of the path length function to ensure that the missions can be completed as soon as possible. We set missions 3 and 4 as the medium threat missions for UAVs, and the weights between the three objective functions are nearly equal. For missions 5 and 6 with lower combat threats, the UAV cluster needs to consume as few combat resources as possible to complete the missions; the weight of resource consumption function is higher than the others.

In Figure 5, the size of the square and circle represents the mission requirements of ammunition resources and electronic jamming area, respectively. The red nodes represent the UAVs with high defense capability and can perform all missions. The blue nodes denote the UAVs with low defense capability and can only perform missions 3–6. The total number of UAVs is 100. We use the Pycharm community edition 2018.3.5 with python 3.6 to randomly generate the same training data for the SAM and AM. The SAM and AM are running on the Pycharm community edition 2018.3.5. The PSO and GA are running on Matrix Laborator 2018 (b). All means are running on the Intel(R) Core (TM) i7-9700K CPU at 3.6 GHz.

2.: Parameters settings about the objection functions:

For the convenience of simulation tests, we consider that the UAVs have the same self-cost. By setting

χ = 0.03

in (4) and

μ = 10

and

c_{i_{k}} = 10

in (3), we obtain the probability function of UAV being destroyed and the UAV value function:

p_{k} = \min (0.03 \times {‖ L_{k} ‖}_{1}, 1),

(17)

V_{i_{k}} = 10 {‖ U_{i_{k}} ‖}_{1} + 10

(18)

4.2. Simulation Experimental Results

Figure 6 shows the comparison results of the five optimization methods in all missions. In Figure 6a,d,j,m, it is clear that the UAVs near missions 1 and 2 are assigned to complete the missions, and the allocated UAVs meet the limitations of the defense capabilities. This is expected since missions 1 and 2 are regarded as high threat missions and need to be completed within the shortest time possible. The combat resources consumptions and the suffered combat loss for UAVs are nearly not considered in missions 1 and 2. As depicted in Figure 6b,e,k,n, multiple objective functions are simultaneously considered in missions 3 and 4. Even some UAVs are closed to the missions, these are not assigned to carry out the missions. In Figure 6c,f,l,o, to ensure that the consumption of combat resources are as few as possible, some UAVs far from low priority missions 5 and 6 are allocated. The detailed optimization results of different algorithms are shown in Table 6.

Figure 7 shows the training process of DIRL-SAM and RL-SAM. From Figure 7b and Table 6, due to without the data expansion process of DIRL, the model with the common RL algorithm cannot converge in the same dynamic training environment, which leads to the RL-SAM being ineffective and unstable in dealing with different missions. Figure 6g–i shows that the RL-SAM fails to deal with the missions with different requirements. In Figure 7a, the model with DIRL can obtain the convergence in the same dynamical training environments, and we only use one DIRL-SAM model to solve all mission planning problems with varying combat requirements and locations. That is, our proposed DIRL-SAM is robust for the UAV cluster mission planning problem.

From Table 6, although SAM has a slight precision loss versus the AM (about 5%), it can significantly improve solving speed (about 35%). With the help of the new fast interactive process, SAM maintains the feature extraction effect and reduces the computational complexity to speed up the encoding process. Furthermore, compared to the GA and PSO, the DIRL-SAM achieves a considerable advantage in the solving speed (all means run on the same CPU). The DIRL-SAM avoids the time-consuming iterative optimization process and can solve the mission planning problem in a short, required time.

Specifically, we increase the number of UAVs to solve the same mission, and the results for 150 and 200 UAVs are provided in Table 7.

From Table 7, the objective function values of DIRL-SAM and DIRL-AM are decreasing in the condition of more UAVs. In the problem with 200 UAVs, more UAVs can be chosen to conduct the mission, and the algorithms can obtain better planning schemes for the same mission. The search space of GA and PSO is exponential growth as the number of variables increases, the performances of GA and PSO are affected by the larger number of UAVs. In addition, the DIRL-SAM obtains a huge advantage on the solution speed.

To test DIRL-SAM in small-scale problems, we use the exhaustive method to obtain the exact results, and the compared results are shown in Table 8. Table 9 shows the mission settings in small-scale problems.

From Table 8 and Table 9, we set six small-scale problems with different settings to test the performance of the proposed algorithm, and the exhaustive method is used to obtain the exact solution of the tests. All optimization methods yield reasonable results in small instances. The time consumption of the exhaustive method increases rapidly with the increasing number of UAVs. The GA and PSO can obtain the exact solutions in all six tests, the proposed DIRL-SAM obtains the exact solutions in tests 4 and 6, and obtains feasible solutions in tests 2, 3, and 5. The DIRL-SAM has a faster solution speed than others.

Finally, compared with the competitors, DIRL-SAM shows some new characteristics in simulation experiments, such as the fast solving speed and the excellent robustness in the different missions. It can prove that the DIRL-SAM can adaptively allocate UAV groups for each mission in real time and is a practical algorithm to solve large UAV cluster mission planning problems.

Summary: Compared to the competitors, DIRL-SAM uses the new framework to solve the UAV mission planning problem, which helps the algorithm to obtain the fast solution speed. With the number of UAVs increasing, the search space of the solutions is becoming huge, which influences the performance of the GA and PSO. DIRL-SAM exhibits excellent generalization ability in solving the large UAV mission planning problems, proving the effectiveness of the algorithm.

5. Conclusions

In this paper, we proposed a new algorithm DIRL-SAM to fast deal with large UAV cluster mission planning problems. Due to the data expansion operation of DIRL, the model obtains the convergence in dynamic training conditions, ensuring that the trained model can solve the problems with different constraints and objective function weights.

We designed a new lightweight interactive mode in the SAM, which decreases the computational complexity of the original AM model and improves the solving speed. Therefore, the DIRL-SAM algorithm can fast provide satisfactory mission planning schemes for the UAV cluster, especially for the large-scale problem.

Author Contributions

Resources, S.G.; validation, L.Z. and Y.L.; writing—original draft preparation, L.L. and M.L.; writing—review and editing, supervision, S.G. and X.L.; funding acquisition, L.Z. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported in part by the National Natural Science Foundation of China under grant (61871307) and the Fundamental Research Funds for the Central Universities (JB210207).

Institutional Review Board Statement

Not applicable.

Data Availability Statement

Not applicable.

Acknowledgments

The authors would like to thank all reviewers and editors for their comments on this paper.

Conflicts of Interest

The authors declare no conflict of interest.

References

Fraser, B.T.; Congalton, R.G. Monitoring Fine-Scale Forest Health Using Unmanned Aerial Systems (UAS) Multispectral Models. Remote Sens. 2021, 13, 4873. [Google Scholar] [CrossRef]
Kurdi, H.; AlDaood, M.F.; Al-Megren, S.; Aloboud, E.; Aldawood, A.S.; Youcef-Toumi, K. Adaptive Task Allocation for Multi-UAV Systems Based on Bacteria Foraging Behavior. Appl. Soft Comput. 2019, 83, 105643. [Google Scholar] [CrossRef]
Wang, Y.; Ru, Z.Y.; Wang, K.Z.; Huang, P.Q. Joint Deployment and Task Scheduling Optimization for Large-Scale Mobile Users in Multi-UAV-Enabled Mobile Edge Computing. IEEE Trans. Cybern. 2019, 9, 3984–3997. [Google Scholar] [CrossRef] [PubMed] [Green Version]
Shirani, B.; Najafi, M.; Izadi, I. Cooperative Load Transportation Using Multiple UAVs. Aerosp Sci Technol. 2019, 84, 158–169. [Google Scholar] [CrossRef]
Sun, L.; Chen, J.; Feng, D.; Xing, M. Parallel Ensemble Deep Learning for Real-Time Remote Sensing Video Multi-Target Detection. Remote Sens. 2021, 13, 4377. [Google Scholar] [CrossRef]
Milani, I.; Bongioanni, C.; Colone, F.; Lombardo, P. Fusing Measurements from Wi-Fi Emission-Based and Passive Radar Sensors for Short-Range Surveillance. Remote Sens. 2021, 13, 3556. [Google Scholar] [CrossRef]
Wu, H.S.; Li, H.; Xiao, R.B.; Liu, J. Modeling and Simulation of Dynamic Ant Colony’s Labor Division for Task Allocation of UAV Swarm. Phys. A Stat. Mech. Appl. 2018, 491, 127–141. [Google Scholar] [CrossRef]
Zhang, J.; Xing, J.H. Cooperative Task Assignment of Multi-UAV System. Chin. J. Aeronaut. 2020, 3, 2825–2827. [Google Scholar] [CrossRef]
Li, Y.B.; Zhang, H.J.; Long, K.P. Joint Resource, Trajectory, and Artificial Noise Optimization in Secure Driven 3-D UAVs with NOMA and Imperfect CSI. IEEE J. Sel. Area Commun. 2021, 39, 3363–3377. [Google Scholar] [CrossRef]
Wang, Y.; Wang, D.B.; Wang, J.H. A Convex Optimization Based Method for Multiple UAV Autonomous Formation Reconfiguration. Sci China Technol. Sci. 2017, 47, 249–258. [Google Scholar] [CrossRef]
Mohr, H.; Schroeder, K.; Black, J. Distributed Source Seeking and Robust Obstacle Avoidance through Hybrid Gradient Descent. In Proceedings of the 2019 IEEE Aerospace Conference, Big Sky, MT, USA, 2–9 March 2019; pp. 1–8. [Google Scholar]
Bagheri, S.M.; Taghaddos, H.; Mousaei, A.; Shahnavaz, F.; Hermann, U. An A-Star Algorithm for Semi-optimization of Crane Location and Configuration in Modular Construction. Automat. Constr. 2021, 121, 103447. [Google Scholar] [CrossRef]
Martin, R.A.; Rojas, I.; Franke, K.; Hedengren, J.D. Evolutionary View Planning for Optimized UAV Terrain Modeling in a Simulated Environment. Remote Sens. 2016, 8, 26. [Google Scholar] [CrossRef] [Green Version]
Huang, X.; Dong, X.; Ma, J.; Liu, K.; Ahmed, S.; Lin, J.; Qiu, B. The Improved A* Obstacle Avoidance Algorithm for the Plant Protection UAV with Millimeter Wave Radar and Monocular Camera Data Fusion. Remote Sens. 2021, 13, 3364. [Google Scholar] [CrossRef]
Banerjee, B.P.; Raval, S. A Particle Swarm Optimization Based Approach to Pre-Tune Programmable Hyperspectral Sensors. Remote Sens. 2021, 13, 3295. [Google Scholar] [CrossRef]
Alhaqbani, A.; Kurdi, H.; Youcef-Toumi, K. Fish-Inspired Task Allocation Algorithm for Multiple Unmanned Aerial Vehicles in Search and Rescue Missions. Remote Sens. 2021, 13, 27. [Google Scholar] [CrossRef]
Zhen, Z.Y.; Xing, D.; Gao, C. Cooperative Search-Attack Mission Planning for Multi-UAV based on Intelligent Self-Organized Algorithm. Aerosp. Sci. Technol. 2018, 76, 402–411. [Google Scholar] [CrossRef]
Vinyals, O.; Fortunato, M.; Jaitly, N. Pointer Networks. In Proceedings of the 2015 Neural Information Processing Systems (NIPS), Montreal, QC, Canada, 7–12 December 2015. [Google Scholar]
Meng, X.; Wu, L.; Yu, S. Research on Resource Allocation Method of Space Information Networks Based on Deep Reinforcement Learning. Remote Sens. 2019, 11, 448. [Google Scholar] [CrossRef] [Green Version]
Nazari, M.; Oroojlooy, A.; Snyder, L.V.; Takac, M. Reinforcement Learning for Solving the Vehicle Routing Problem. In Proceedings of the 2018 Neural Information Processing Systems (NIPS), Montreal, QC, Canada, 2–8 December 2018; Volume 31. [Google Scholar]
Bello, I.; Pham, H.V.; Le, Q.V.; Norouzi, M.; Bengio, S. Neural Combinatorial Optimization with Reinforcement Learning. In Proceedings of the 2017 International Conference on Learning Representations (ICLR), Toulon, France, 24–26 April 2017. [Google Scholar]
Kool, W.; Herke, V.F.; Welling, M. Attention, Learn to Solve Routing Problems. In Proceedings of the 2019 International Conference on Learning Representations (ICLR), New Orleans, LA, USA, 6–9 May 2017. [Google Scholar]
Huang, Y.; Mu, Z.; Wu, S.; Cui, B.; Duan, Y. Revising the Observation Satellite Scheduling Problem Based on Deep Reinforcement Learning. Remote Sens. 2021, 13, 2377. [Google Scholar] [CrossRef]
Ying, C.S.; Chow, A.H.F.; Chin, K.S. An Actor-Critic Deep Reinforcement Learning Approach for Metro Train Scheduling with Rolling Stock Circulation under Stochastic Demand. Transp. Res. B Meth. 2020, 140, 210–235. [Google Scholar] [CrossRef]
Jun, Y.; You, X.H.; Wu, G.X.; Hassan, M.M.; Almogren, A.; Guna, J. Application of Reinforcement Learning in UAV Cluster Task Scheduling. Future Gener. Comput. Syst. 2019, 95, 140–148. [Google Scholar]
Qie, H.; Shi, D.X.; Shen, T.L.; Xu, X.H.; Li, Y.; Wang, L.J. Joint Optimization of Multi-UAV Target Assignment and Path Planning Based on Multi-Agent Reinforcement Learning. IEEE Access 2019, 7, 146264–146272. [Google Scholar] [CrossRef]
Wang, C.; Wu, L.Z.; Yan, C.; Wang, Z.C.; Long, H.; Yu, C. Coactive Design of Explainable Agent-Based Task Planning and Deep Reinforcement Learning for Human-UAVs Teamwork. Chin. J. Aeronaut. 2020, 33, 2930–2945. [Google Scholar] [CrossRef]
Li, K.W.; Zhang, T.; Wang, R. Deep Reinforcement Learning for Multi-objective Optimization. IEEE Trans. Cybern. 2021, 51, 3103–3114. [Google Scholar] [CrossRef] [PubMed] [Green Version]
Atencia, R.C.; Ser, J.D.; Camacho, D. Weighted Strategies to Guide a Multi-Objective Evolutionary Algorithm for Multi-UAV Mission Planning. Swarm Evol. Comput. 2019, 44, 480–495. [Google Scholar] [CrossRef]
Zhen, Z.Y.; Chen, Y.; Wen, L.D.; Han, B. An Intelligent Cooperative Mission Planning Scheme of UAV Swarm in Uncertain Dynamic Environment. Aerosp. Sci. Technol. 2020, 100, 105826. [Google Scholar] [CrossRef]
Zhao, X.Y.; Zong, Q.; Tian, B.L.; Zhang, B.Y.; You, M. Fast Task Allocation for Heterogeneous Unmanned Aerial Vehicles Through Reinforcement Learning. Aerosp. Sci. Technol. 2019, 92, 588–594. [Google Scholar] [CrossRef]
Ioffe, S.; Szegedy, C. Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift. In Proceedings of the 2015 International Conference on Learning Representations (ICLR), Lille, France, 6–11 July 2015. [Google Scholar]
He, K.; Zhang, X.Y.; Ren, S.Q.; Sun, J. Deep Residual Learning for Image Recognition. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016. [Google Scholar]
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.; Kaiser, L.; Polosukhin, I. Attention Is All You Need. In Proceedings of the 2017 Neural Information Processing Systems (NIPS), Long Beach, CA, USA, 3–9 December 2017. [Google Scholar]
Hussain, R.; Karbhari, Y.; Ijaz, M.F.; Wozniak, M.; Singh, P.K.; Sarkar, R. Revise-Net: Exploiting Reverse Attention Mechanism for Salient Object Detection. Remote Sens. 2021, 13, 4941. [Google Scholar] [CrossRef]
Guo, Q.P.; Qiu, X.P.; Liu, P.F. Star-Transformer. In Proceedings of the 2019 Annual Conference of the North American Chapter of the Association for Computational Linguistics (NAACL-HLT), Minneapolis, MN, USA, 5–7 June 2019. [Google Scholar]
Rennie, J.S.; Marcheret, E.; Mroueh, Y.; Ross, J.; Goel, V. Self-Critical Sequence Training for Image Captioning. In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 22–25 July 2017. [Google Scholar]
Sutton, R.S.; McAllester, D.; Singh, S.; Mansour, Y. Policy Gradient Methods for Reinforcement Learning with Function Approximation. In Proceedings of the 1999 Neural Information Processing Systems (NIPS), Denver, CO, USA, 29 November–4 December 1999. [Google Scholar]
Zhou, N.; Lau, L.; Bai, R.; Moore, T. A Genetic Optimization Resampling Based Particle Filtering Algorithm for Indoor Target Tracking. Remote Sens. 2021, 13, 132. [Google Scholar] [CrossRef]
Huang, J.; Xing, Y.; You, H.; Qin, L.; Tian, J.; Ma, J. Particle Swarm Optimization-Based Noise Filtering Algorithm for Photon Cloud Data in Forest Area. Remote Sens. 2019, 11, 980. [Google Scholar] [CrossRef] [Green Version]

Figure 1. Illustration of mission planning in a large UAV cluster.

Figure 2. The model structure of DIRL-SAM.

Figure 3. Process of the encoding.

Figure 4. The construction of AM and SAM: (a) the interaction process of AM; (b) the interaction process of SAM.

Figure 5. The combat situation of UAVs mission planning.

Figure 6. The comparison results of different algorithms: (a) GA for missions 1 and 2; (b) GA for missions 3 and 4; (c) GA for missions 5 and 6; (d) PSO for missions 1 and 2; (e) PSO for missions 3 and 4; (f) PSO for missions 5 and 6; (g) RL-SAM for missions 1 and 2; (h) RL-SAM for missions 3 and 4; (i) RL-SAM for missions 5 and 6; (j) DIRL-AM for missions 1 and 2; (k) DIRL-AM for missions 3 and 4; (l) DIRL-AM for missions 5 and 6; (m) DIRL-SAM for missions 1 and 2; (n) DIRL-SAM for missions 3 and 4; (o) DIRL-SAM for missions 5 and 6.

Figure 7. The training process of DIRL-SAM and RL-SAM: (a) training process of DIRL-SAM; (b) training process of RL-SAM.

Table 1. The common optimization algorithms for UAV mission planning.

Algorithms	Concept	Advantage	Limitation
Mathematics	The mathematical algorithms transform the problem into a mathematical programming model. Based on the constructing optimization model and the objective function, choosing special mathematical methods to solve the problem, i.e., gradient descent, dynamic programming algorithm.	High Solving precision. Strong interpretability.	Poor generalization ability in the complex optimization problem, i.e., nonconvex problem model, multiple complex objective functions. Time consumption in the large-scale variable problem.
Heuristic	The heuristic algorithms design greedy optimization rules according to the problem’s characteristics and the researchers’ experience and knowledge, and iterate the predefined optimization process to obtain feasible solutions until the special convergence rules meet.	Not dependent on initial conditions of solution. Robustness for the solution domain of the problem, i.e., non-differentiability, discontinuity. Easy to realize.	Time consumption and limitation performance of algorithms in the large-scale variable problem.
Artificial intelligence	The artificial intelligence algorithms use the deep neural network to extract the high dimension features of problem data and output the solutions with the designing decoding policy. Unsupervised training methods are employed to optimize the total parameters of the model.	Fast solving speed and good generalization ability in large-scale problems.	Extensive training process. Limitation robustness in a complex problem model, i.e., both containing continuous and discrete variables, complex multiple constraint conditions.

Table 2. Main parameters of GA.

GA Algorithm
Population size	100
Hybrid probability	0.9
Mutation probability	0.1
Number of iterations	3000

Table 3. Main parameters of PSO.

GA Algorithm
Population size	100
Inertia weight	0.8
Learning factor 1	1.5
Learning factor 2	1.5
Number of iterations	3000

Table 4. Training data settings.

UAV Training Data
$a t t (U_{i})$	$(0, 1)$
$e l e (U_{i})$	$(0, 1)$
$d e f (U_{i})$	${10, 20}$
$l o c (U_{i})$	$(0, 5)$
$N$	100
Mission Training Data
$A t t (T_{k})$	$(0, 3)$
$E l e (T_{k})$	$(0, 3)$
$D e f (T_{k})$	${10, 20}$
$L o c (T_{k})$	$(0, 5)$
$E n (T_{k})$	$(0.5, 1)$
$W_{k}$	$w_{1}^{k}, w_{2}^{k}, w_{3}^{k} \in (0, 1)$
$K$	10

Table 5. Mission settings.

Mission	Priority Rank	$Att$	$E l e$	$D e f$	$L o c$	$w_{1}$	$w_{2}$	$w_{3}$	$E n$
mission 1	1	1.6	2.4	20	(1.0, 1.5)	0.1	0.1	0.8	0.8
mission 2	2	2.7	2.7	20	(4.0, 4.0)	0.2	0.2	0.6	0.9
mission 3	3	1.4	1.4	10	(4.0, 1.0)	0.3	0.3	0.4	0.8
mission 4	4	2.8	2.8	10	(2.5, 2.5)	0.3	0.3	0.4	0.7
mission 5	5	1.8	2.7	10	(1.0, 4.0)	0.8	0.1	0.1	0.9
mission 6	6	2.7	1.8	10	(3.0, 2.0)	0.8	0.1	0.1	0.9

Table 6. Optimization results of different algorithms.

Algorithms	Mission 1	Mission 2	Mission 3	Mission 4	Mission 5	Mission 6	Total Value	Running Time
GA	2.53	3.82	2.98	11.25	5.23	5.07	30.88	33.98 s
PSO	2.99	3.84	3.28	11.62	5.36	5.12	32.21	45.32 s
RL-SAM	10.54	8.46	4.95	13.52	5.78	7.37	50.62	2.10 s
DIRL-AM	2.38	3.82	2.75	11.07	4.99	4.83	29.84	3.06 s
DIRL-SAM	2.38	3.82	2.75	11.07	5.09	4.90	30.01	1.98 s

Table 7. Compared results of different methods in different numbers of UAVs.

Algorithms	150 UAVs		200 UAVs
Algorithms	Value of $F$	Running Time	Value of $F$	Running Time
GA	6.51	4.43 s	8.32	6.31 s
PSO	6.67	6.38 s	8.85	9.53 s
RL-SAM	49.05	0.39 s	67.43	0.48 s
DIRL-AM	5.57	0.63 s	4.93	0.94 s
DIRL-SAM	5.57	0.41 s	5.07	0.57 s

Table 8. Compared results of different methods.

Algorithms	Test 1	Test 2	Test 3	Test 4	Test 5	Test 6
Exhaustive method	8.85/40.1 s	7.70/84.51 s	13.55/168.59 s	4.83/341.24 s	5.45/682.24 s	6.34/1676.28 s
GA	8.85/1.97 s	7.70/1.62 s	13.55/1.96 s	4.83/2.01 s	5.45/1.82 s	6.34/1.71 s
PSO	8.85/2.13 s	7.70/2.18 s	13.55/2.23 s	4.83/2.28 s	5.45/2.31 s	6.34/2.53 s
DIRL-SAM	9.08/0.32 s	7.76/0.24 s	13.58/0.35 s	4.83/0.19 s	5.46/0.28 s	6.34/0.33 s
Random	19.21/0 s	18.32/0 s	21.53/0 s	18.15/0 s	14.64/0 s	17.66/0 s

Table 9. Mission settings in small-scale problems.

Test	Number of UAVs	$Att$	$E l e$	$L o c$	$w_{1}$	$w_{2}$	$w_{3}$	$E n$
Test 1	20	3.4	4.1	(4.55, 0.30)	0.5	0.3	0.2	1
Test 2	21	2.5	4.8	(3.80, 2.40)	0.2	0.4	0.3	1
Test 3	22	4.0	5.3	(3.85, 4.95)	0.2	0.3	0.5	1
Test 4	23	2.4	2.2	(1.40, 1.90)	0.5	0.1	0.4	1
Test 5	24	3.1	2.5	(2.75, 1.80)	0.5	0.4	0.1	1
Test 6	25	1.6	5.4	(3.50, 3.90)	0.1	0.3	0.6	1

Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

© 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Zuo, L.; Gao, S.; Li, Y.; Li, L.; Li, M.; Lu, X. A Fast and Robust Algorithm with Reinforcement Learning for Large UAV Cluster Mission Planning. Remote Sens. 2022, 14, 1304. https://doi.org/10.3390/rs14061304

AMA Style

Zuo L, Gao S, Li Y, Li L, Li M, Lu X. A Fast and Robust Algorithm with Reinforcement Learning for Large UAV Cluster Mission Planning. Remote Sensing. 2022; 14(6):1304. https://doi.org/10.3390/rs14061304

Chicago/Turabian Style

Zuo, Lei, Shan Gao, Yachao Li, Lianghai Li, Ming Li, and Xiaofei Lu. 2022. "A Fast and Robust Algorithm with Reinforcement Learning for Large UAV Cluster Mission Planning" Remote Sensing 14, no. 6: 1304. https://doi.org/10.3390/rs14061304

APA Style

Zuo, L., Gao, S., Li, Y., Li, L., Li, M., & Lu, X. (2022). A Fast and Robust Algorithm with Reinforcement Learning for Large UAV Cluster Mission Planning. Remote Sensing, 14(6), 1304. https://doi.org/10.3390/rs14061304

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

A Fast and Robust Algorithm with Reinforcement Learning for Large UAV Cluster Mission Planning

Abstract

1. Introduction

2. Mission Planning Problem Formulation

2.1. Mission and UAV Formulation

2.2. Multiple Objective Functions of Mission Planning

2.3. Constraint Conditions of Mission Planning

3. DIRL-SAM

3.1. Encoder with SAM

3.2. DIRL

4. Experimental Results

4.1. Experimental Settings

4.2. Simulation Experimental Results

5. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI