Next Article in Journal
Noninvasive Temperature Measurements in Tissue-Simulating Phantoms Using a Solid-State Near-Infrared Sensor
Previous Article in Journal
Exploration of MPSO-Two-Stage Classification Optimization Model for Scene Images with Low Quality and Complex Semantics
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Globally Guided Deep V-Network-Based Motion Planning Algorithm for Fixed-Wing Unmanned Aerial Vehicles

1
Sino-European Institute of Aviation Engineering, Civil Aviation University of China, Tianjin 300300, China
2
Shenyang Aircraft Design and Research Institute, Shenyang 110035, China
*
Author to whom correspondence should be addressed.
Sensors 2024, 24(12), 3984; https://doi.org/10.3390/s24123984
Submission received: 28 April 2024 / Revised: 24 May 2024 / Accepted: 11 June 2024 / Published: 19 June 2024
(This article belongs to the Section Remote Sensors)

Abstract

:
Fixed-wing UAVs have shown great potential in both military and civilian applications. However, achieving safe and collision-free flight in complex obstacle environments is still a challenging problem. This paper proposed a hierarchical two-layer fixed-wing UAV motion planning algorithm based on a global planner and a local reinforcement learning (RL) planner in the presence of static obstacles and other UAVs. Considering the kinematic constraints, a global planner is designed to provide reference guidance for ego-UAV with respect to static obstacles. On this basis, a local RL planner is designed to accomplish kino-dynamic feasible and collision-free motion planning that incorporates dynamic obstacles within the sensing range. Finally, in the simulation training phase, a multi-stage, multi-scenario training strategy is adopted, and the simulation experimental results show that the performance of the proposed algorithm is significantly better than that of the baseline method.

1. Introduction

With the development and wide application of UAV technology, fixed-wing UAVs have gradually become important tools in many fields due to their high maneuverability and survivability [1,2,3]. Unlike quadcopters, fixed-wing UAVs have huge advantages in endurance, flight speed, payload capacity, etc. [4,5], resulting in great potential in both civil and military applications [6]. Motion planning, as one of the key problems in flight control, is crucial for realizing safe, efficient, and autonomous flight missions. Collision avoidance is a key point in motion planning [7,8]. However, the flight characteristics of fixed-wing UAVs are very complex due to kinematic constraints [9]. Therefore, the goal of this paper is to learn a collision-free, safe, and fast motion planning method for fixed-wing UAVs that satisfies the kinematic constraints.
Unlike the path planning problem, kinodynamic motion planning for fixed-wing UAVs requires consideration of kinematic equations, kinematic constraints (e.g., obstacle avoidance), and kinodynamic constraints (e.g., velocity, acceleration, and turning radius constraints, etc.) in order to successfully reach the target location [10]. Although the traditional motion planning algorithm becomes less computationally practical for extremely dense scenarios, by simplifying the model, solvers enable real-time motion planning in many situations of interest.
A key challenge is to consider the second-order kinematics model and corresponding constraints of fixed-wing UAVs while achieving safe and fast motion planning in multi-obstacle (dynamic and static obstacles) environments. This is straightforward in a static environment, using first-order kinematics equations or simplifying the dynamics module, but the mixture of these factors makes it difficult to solve. Although existing studies have achieved some results in UAV path planning, it is still a worthwhile research problem [11,12].
In this work, we propose a hybrid A* deep V-network (HADVN) algorithm to realize collision-free motion planning for fixed-wing UAVs. In the proposed algorithm, a hierarchical two-layer motion planning framework is designed. First, considering the kinematic constraints of fixed-wing UAVs, such as the turning radius and the speed range, a global planner is designed based on to a modified hybrid A* search, providing a reference guidance with respect to static obstacles. With the help of global guidance, a local RL planner is designed to accomplish kino-dynamic, feasible, and collision-free motion planning that incorporates dynamic obstacles within the sensing range. Specifically, to improve the network generalization, a deep V-network with an attention mechanism is designed in local RL planner and a multi-stage, multi-scenario training strategy is adopted in the training process. The contributions of this work are as follows:
  • Unlike the literature [13,14,15,16,17,18,19], the method proposed in this paper can realize safe and fast path planning while satisfying the second-order kinematics model and constraints of fixed-wing UAVs;
  • For the motion planning problem studied in this paper, a deep V-network based on the attention mechanism is adopted with a multi-stage, multi-scenario training strategy to improve training efficiency and network generalization. The effectiveness of the algorithm is verified by comparison simulation experiments.

2. Related Works

(1)
Traditional motion planning algorithms for fixed-wing UAVs: Existing work on the navigation problem in cluttered environments for fixed-wing UAVs often focuses on traditional algorithms. The A* algorithm is employed to generate planning path in [13,14]. The authors in [15,16] proposed an improved potential field method to help UAVs avoid obstacles in local path planning. However, the above papers only consider the UAV as a mass point and neglect its kinematic model. The authors in [17] modeled a fixed-wing UAV as a first-order equation and defined a safe flight corridor based on its maneuvering characteristics and Dubins curve to achieve obstacle avoidance. A multi-objective optimization model was established in [20] and proposed an improved particle swarm optimization to solve the three-dimensional path planning problem in complex environments. The work proposed in [21] combined the ant colony algorithm, the self-organizing mapping algorithm, and the optimal Dubins trajectory to achieve the task assignment and path planning for multiple UAVs. Considering the kinematic model of fixed-wing UAVs as second-order systems, Ref. [22] proposed an improved RRT* algorithm to realize 3D path planning.
(2)
Deep neural network-based learning methods in the field of robotics: In recent years, with the increase in computational power, deep neural network-based learning methods have shown great potential in dealing with high-dimensional and complex environmental states [23,24], especially in the field of robotics, such as robotic arms [25], wheeled and multi-legged robots [26,27,28], etc., which have been applied to some experimental real-world scenarios. For example, in [29], the authors proposed an imitation learning approach based on perceptual information to realize motion planning for UAVs, and Ref. [30] used deep learning to achieve effective 3D exploration of a vehicle in an indoor environment.
(3)
Deep reinforcement learning (DRL)-based motion planning for fixed-wing UAVs: Unlike the supervised learning methods mentioned above, the RL method is applicable to sequential decision control problems, based on which agents can learn to explore strategies that maximize reward by interacting with the environment. Under the assumption that the UAV flies at a fixed altitude, Li et al. [18] designed the action space to satisfy the first-order fixed-wing UAV kinematic constraints, and implemented a deep Q-network to generate a safe path for a fixed-wing UAV. Based on the same model, Wu et al. [19] proposed a multicritical delay depth deterministic policy gradient method. Moreover, Li et al. [31] considered the second-order kinematic constraints, and designed a deep deterministic policy gradient algorithm to enable UAVs’ complete path planning in a 3D environment with only static obstacles; Yan et al. [32] established a posture assessment model to evaluate the collision risk between followers for the case with only neighboring aircraft, and designed a DRL algorithm to implement a collision-free strategy for fixed-wing UAVs.

3. Preliminaries

3.1. Problem Formulation

Consider a scenario where a fixed-wing UAV must navigate from an initial position p 0 to a goal position p g on the plane R 3 , surrounded by a number of non-communicating UAVs and static obstacles. The control state of a fixed-wing UAV at time t can be represented by ς t (defined in Section 3.2), u t is the control input (defined in Section 4.3.1). The next state ς t + 1 can be obtained through f ς t , u t , where f is the kinematic model of the fixed-wing UAV (defined in Section 3.2). O t denotes the area occupied by the fixed-wing UAV, O s by static obstacles and O t j by each surrounding UAV. We aim to drive the ego-UAV with the kinematic model f, to the goal position p n under the collision avoidance constraints and boundary constraints of control state and control input, as shown in (1a), (1c) and (1d).
We then illustrate this problem in a RL framework. At each time-step t, the ego-UAV first obtains its state s t (defined in Section 4.3.1), the set of the neighboring UAVs’ states S t = i { 1 , , N } s t i within the local FOV and a local segment of global guidance G * , expressed as G t * , then takes action a t , leading to the immediate reward R ( s t , a t ) (defined in Section 3.2) and next state s t + 1 = h ( s t , a t ) , under the transition model h. We use the superscript j 1 , , N to denote the j-th nearby UAV within the local FOV and omit the superscript when referring to the UAV.
With the above definition, the objective is transmitted to learn a policy π for the UAV that minimizes time to goal while avoiding conflicts with neighbors and static obstacles, defined as
π * = arg max π E t = 0 T γ t · v m a x R s t , π s t , S t
s . t . ς t + 1 = f ς t , u t
s T = g o a l
O t ς t O t j = ; O t ς t O s = u t U , s t S , ς t Θ
t 0 , T , j 1 , , n
where (1a) is the transition dynamic constraints considering the kinematic model of the fixed-wing UAV; (1b) is the terminal constraints. Note that the g o a l in (1b) is not exactly the goal position p n , since s T is defined as the terminal state of s t , which will be given in Section 4.3.1. (1c) is the collision avoidance constraints and S, U, and Θ are the set of admissible states, inputs, and the set of admissible control states, respectively. It is assumed that the fixed-wing UAV can obtain its current position and velocity (e.g., with on-board sensing) and the information of all the static obstacles, so it can calculate the global guidance at the start of each run. However, only the relative position and velocity of other UAVs can be observed when they are within their local FOV. Moreover, we assumed that other UAVs follow the policy that is generated by the optimal reciprocal collision avoidance (ORCA) algorithm [33].

3.2. Dynamics of Fixed-Wing UAV

At time-step t, the control state of a fixed-wing UAV can be represented by the tuple ς t = ( x t , y t , z t , v t , θ t , v z t ) , where ( x t , y t , z t ) denotes the position in 3D space; v t and v z t are the horizontal and vertical velocities relative to a inertial frame; θ t denotes the heading angle.
To simplify the planning process, we decouple the horizontal and vertical motion directions of the fixed-wing UAV; then, the kinematic model of the fixed-wing UAV can be expressed as (see (1a))
d d t x t y t z t v t θ t = v t cos θ t v t sin θ t v z t f v v t , u t f θ θ t , ω t
f v v t , u t = c l i p v t + u t · d t , v min , v max f θ θ t , ω t = θ t + ω t · d t u t [ u min , u max ] , ω t [ ω min , ω max ] , v z t v z min , v z max
where u t and ω t are the control commands.
For fixed-wing UAVs, a certain relative velocity with respect to air is necessary to generate enough lift to balance gravity. Moreover, the horizontal minimum turning radius is another limitation in fixed-wing UAV flight, resulting in constraints of heading angular velocity. To that end, three vehicle kinematic constraints are considered in this planning stage: horizontal and vertical velocity, horizontal acceleration, and heading angular velocity (which are Equation (3)).
Similarly, For each agent j [ 1 , n ] , p t j = x t j , y t j , z t j T R 3 denotes its position in 3D space, v t j R 2 and v z t j R its horizontal and vertical velocities at step t relative to a inertial frame, and r j the minimum radius of the ball that can completely encompass the j-th UAV.

4. Approach

In this section, we first describe the overall system structure, and then present details of our approach.

4.1. Systems Description

Our goal is to learn a kino-dynamic feasible and collision-free navigation policy via reinforcement learning, while relying on a local segment of global path and information within the sensing range as input, as shown in Figure 1. First, in the environment awareness layer, UAV obtains its self state and other UAVs’ information through sensors. Secondly, in the global planning layer, the global planner, which knows the information of all the static obstacles, provides a global guidance for ego-UAV. Then, in the local planning layer, a local RL planner is designed to accomplish kino-dynamic feasible and collision-free motion planning with the help of global guidance and other dynamic information within the sensing range. Finally, in the execution layer, ego-UAV executes control inputs and interacts with the environment.
It is assumed that the UAV knows the information of all the static obstacles and calculates the global guidance at the start when the scene changes. However, the trajectories of other UAVs are unknown and the ego-UAV can only obtain the current information of them when they are within its local field of view.
To that end, we adapt a hierarchical two-layer motion planning framework to tackle the navigation problem. The global planner is designed based on to a modified hybrid A* search, providing a reference guidance for ego-UAV with respect to static obstacles (Section 4.2). With the help of global guidance, a local RL planner is designed to accomplish kino-dynamic feasible and collision-free motion planning that incorporates dynamic obstacles within the sensing range (Section 4.3).

4.2. Global Guidance Planner

The main role of global guidance is to provide long-term global information and a safety reference path corresponding to static obstacles. With the help of global guidance, the proposed local RL planner is able to avoid potential collisions and unnecessary detours only by exploiting sensing information within a local area (e.g., a field of view).
Here, we employ a modified hybrid A* search to generate a safe and kino-dynamics feasible global path. Instead of straight-line segments, motion primitives with respect to the fixed-wing UAV dynamics described in Equations (2) and (3) are used as edges connecting two nodes. In contrast to standard hybrid A* search, the modifications are in three aspects.

4.2.1. Primitive Generation

In contrast to [13,14,15,16,17,18,19], a fixed-wing UAV dynamics respecting to second-order system Equation (2) and kinematic constraints Equation (3) is considered. To achieve this, the current heading angle and horizontal velocity are added to the node information, that is, the current node is defined as P k = ( x k , y k , θ k , v k , P k 1 ) , where P k 1 is the parent node. To expand the neighboring node, 11 equally spaced points within [ u m i n , u m a x ] and 7 equally spaced points within [ ω m i n , ω m a x ] are selected to obtain n i different combinations of action values. Then, the neighboring node P k + 1 i = ( x k + 1 i , y k + 1 i , θ k + 1 i , v k + 1 i , P k ) of P k is calculated as
x k + 1 i = x k + v k + 1 i cos θ k + 1 i Δ t y k + 1 i = y k + v k + 1 i sin θ k + 1 i Δ t v k + 1 i = c l i p ( v k + u k + 1 i Δ t ) θ k + 1 i = θ k + ω k + 1 i Δ t
where i = 1 , , n i , n i is the maximun number of neighboring nodes.

4.2.2. Analytic Expansion

Inspired by reference [34], an analytic expansion scheme is induced to deal with the difficulty of having a primitive end exactly at the goal and to speed up the searching. To achieve this, Dubins curves [35,36], which are widely used in path planning for vehicles subject to turn radius constraints such as fixed-wing UAVs and underwater robots, are incorporated into the hybrid A* algorithm. Specifically, for the current node P k , the trajectory from P k to P g , which is kino-dynamics-feasible and minimal with respect to the distance is computed under Dubins curves assumption. If it satisfies collision-free condition, the searching is terminated in advance. This strategy can improve efficiency of generating global guidance path greatly, since it reduces the complexity of the search space and terminates the searching earlier.

4.2.3. Cost Function

Similar to the classical A* algorithm, the notation g is used to represent the actual cost from the initial node P 0 to the current node P k . As we aim to find a path that has the shortest distance and is smooth enough, g is calculated as
g = g P k + g k + 1 , u , ω i , k = 0 , 1 , g k + 1 , u , ω i = v k + 1 i · Δ t + α u u k + 1 u k + α ω ω k + 1 ω k
In addition, the heuristic cost from P k to P g , h is calculated based on the length of the Dubins curve l d k g as designed in the previous section, i.e., h k , u , ω i = l d k g .
In summary, the flow of the hybrid A* algorithm is shown in Algorithm 1, where P and C refer to the open and closed list.

4.3. Local RL Planner

In this part, we aim to develop a decision-making algorithm to provide an estimate of the cost to go in dynamic environments with moving UAVs, which can inform the action, leading to higher rewards.

4.3.1. RL Formulation

For an ego-UAV, the observation vector is composed of three parts: states of itself, the surrounding UAVs’ states, and a local segment of global guidance, defined as
o t = s t , S t , G t *
where s t is the ego-UAV state, S t is the aggregation of neighbouring UAVs’ states, N is the total number of surrounding UAVs within sensing range, G t * is the global guidance given by Section 3.2. Specifically, each UAV’s state is parametrized as
s t = [ d g t , h t , v t , θ t , v z t , r ] ( ego - UAV ) s t j = [ p ¯ j t , v ¯ j t , r j , d ¯ j t , r j + r ] ( Other UAVs , labelled by j , j = 1 , , N )
where d g t and h t denote the horizontal distance and vertical distance to the goal, respectively; p ¯ j t , v ¯ j t , and d ¯ j t represent the relative position, velocity and distance of the jth UAV with respect to ego-UAV; r denote the minimum radius (size) of the ball that can encompass ego-UAV, respectively.
Here, we seek to learn the optimal policy for the ego-agent π : ( o t ) a t mapping the ego-agent’s observation of the environment to a probability distribution of actions. According to (1), we consider a continuous action space a t R 3 and define an action vector as
a t = u t , ω t , v z t
where u t [ u min , u max ] , ω t [ ω min , ω max ] , v z t v z min , v z max .
For simplicity, it is assumed that the UAV is at maximum vertical speed when it is far from the goal position. In other words, v z t is updated as below:
v z t = v z max , h t > v z max · Δ t h t Δ t , v z min · Δ t h t v z max · Δ t v z min , h t < v z min · Δ t
For this task, a compound reward function is designed to feed back signals considering multiple subobjectives and generate dense rewards to promote convergence. More specifically, at each step, the reward function is given by
R t = R 1 t + R 2 t + R 3 t
where R 1 t , R 2 t , and R 3 t are designed for the purpose of collision avoidance, goal arrival, and global information awareness, respectively.
Firstly, the collision avoidance reward R 1 t is obtained by
R 1 t = r c o l l i s i o n , d ¯ t 0 r c o l l i s i o n + α 1 · d ¯ t , 0 < d ¯ t d s a f e 0 , d ¯ t > d s a f e
where r c o l l i s i o n is a large negative penalty for collision, d ¯ t is the minimum sensing distance to the nearest obstacle in real time, and α 1 is a small positive number. The second line makes the UAV alert to the nearest obstacle, assigned a penalty value if d ¯ t is less than a predefined threshold d s a f e .
Secondly, the goal arrival reward R 2 t is computed by
R 2 t = r s u c c e s s , d g t d min α 2 d g t 1 d g t , d g t > d min r f a i l , t t max
where r s u c c e s s is a large positive reward for arrival, while α 2 is a small positive number used to motivate the UAV to move towards the goal when navigating, and d m i n is the Euclidean distance from the UAV to its goal UAV when it is identified as reaching the goal, since the fixed-wing UAV may not be able to reach the goal accurately.
Thirdly, to encourage the UAV to take the global information into account, while simultaneously not following the global guidance strictly, R 3 ( t ) is computed by
R 3 t = 0 , d a t d a max α 3 · d a max d a t , d a t > d a max
where d a t denotes the distance between the current UAV and the global guidance path at the time of t; α 3 is a small positive number; d a m a x is the predefined allowable deviation distance.

4.3.2. Policy Network Architecture

Figure 2 depicts the network structure of HADRL. Here is the description in detail. Considering that o t consists of three groups, i.e., ( s t , S t , G t * ), we use two embedding functions to encode them separately. Through three fully connected (FC) layers, the embeddings of local segment of global guidance G t * can be easily represented as
e l = F C l ( G t * )
As the numbers of surrounding UAVs in field of view are uncertain, we customize an attention embedding block to aggregate the states of the neighbors S t . As shown in Figure 3, the aggregated embedding of e N is computed by
e N = A t t e n t i o n ( S t ) = j N ξ N j F C N 1 ( s t j )
The attention weight ξ N j is calculated as
ξ N j = exp ( q N j ) j exp ( q N j ) q N j = F C N 2 ( j F C N 3 ( s j ) N s j )
As shown in Figure 2, we concatenate the fixed-length representation of the other agent’s states with the ego-agent’s state and the embedding of global guidance to create a joint state representation. This representation vector is fed to four fully connected layers (FCL). The output estimates the state-value function V π ( S t ) = E S t + 1 : [ l = 0 r t + l ] , with a linear activation function.

4.4. Hybrid A* Deep V Network (HADVN) Algorithm

An agent is unlikely to reach a goal position with an initial RL policy. Hence, during the pretraining phase, we use the ORCA as an expert and perform supervised training to train the policy and value function parameters for n O R C A steps. By setting the current state and using the ORCA, we obtain a sequence of state, action, next state, and reward in a buffer B { o k , a k , o k + 1 , r k } . Then, we compute advantage estimates and perform a supervised training step:
θ k + 1 V = arg min θ V E ( a k , o k , r k ) D O R C A [ V θ ( o k ) V k t r a g ]
where θ V is the value function and policy parameters.
In this work, we train the policy using the deep V-learning algorithm proposed by Chen C et al. [37]. The reasons for choosing a deep V-network instead of a Q-network in this paper are as follows: first, compared with deep Q-networks, the structure of V-networks are simpler, since their output is no longer different combinations of the action space, but can be simplified to a single node; second, in the supervised learning, the optimal policy can be computed faster, and the data-processing operations can be simplified; third, since the upper and lower bounds of the action quantities (acceleration and angular velocity) are different in this paper, the optimal policy can be computed quickly. We propose to jointly train the guidance policy V π with the ORCA algorithm. Algorithm 1 describes the proposed training strategy and has two main phases: supervised and RL training. First, we randomly initialize the value function parameters θ V . Then, a multi-scene multi-stage training framework is used for the training of the network. More details about the different training scenarios considered are given in Section 5.
Algorithm 1 Hybrid A* deep V-network (HADVN) algorithm
Input: target value network, number of supervised and RL training episodes { n O R C A , n e p i s o d e s } , size of mini-batch n m i n i b a t c h
Output: Optimal value network V π
 1:
Initialize the optimal value network V, the target value network V ^
 2:
Initialize the open list P and close list C , add P 0 to P
 3:
while  P not empty do
 4:
    From P take the node with the smallest total cost f and record it as the current node P k , add P k to C ;
 5:
    if  x k , y k x g , y g d min  then
 6:
        return  P 0 , , P k ;
 7:
    else
 8:
        generate Dubins curves l d k g from P k to P g ;
 9:
    end if
10:
    if  l d k g satisfies the no-collision condition, then
11:
        generate path points P k + 1 , P g on l d k g , return  P 0 , , P g ;
12:
    end if
13:
    Based on discretized action combinations u k + 1 i , ω k + 1 i and Equation (4), generate the set of neighboring nodes P k + 1 i of P k , where i = 1 , , n i ;
14:
    for  P k + 1 i in  P k + 1 i  do
15:
        if  P k + 1 i is beyond the map boundary, close to an obstacle or already in C in, then
16:
           continue;
17:
        end if
18:
        Using Equation (5), calculate g = g P k + g k + 1 , u , ω i ;
19:
        if  P k + 1 i not in  O , then
20:
           add P k + 1 i to O ;
21:
        else if  g g P k + 1 i , then
22:
           continue;
23:
        end if
24:
    end for
25:
     g P k + 1 i g , change the parent node of P k + 1 i to P k ;
26:
    Calculate the total cost f of the node P k + 1 i ;
27:
end while
28:
Generate G * from P 0 , , P k ;
29:
Back to step 2 if training scenario is changed;
30:
while  e p i s o d e < n e p i s o d e s  do
31:
    Initialize B and states;
32:
    for  k = 0 , , n m i n i b a t c h  do
33:
        if  e p i s o d e n O R C A  then
34:
           using ORCA to get a k ;
35:
        else
36:
            a k = arg max a k A R ( o k , a k ) + γ Δ t · v max V ( o k + 1 ) ;
37:
        end if
38:
         { o k , a k , o k + 1 , r k , d o n e } = s t e p ( o k , a k ) ;
39:
        Store B { o k , a k , o k + 1 , r k } ;
40:
        if  d o n e  then
41:
            e p i s o d e + = 1 , updating the target network V ^ V ;
42:
        end if
43:
    end for
44:
    if  e p i s o d e < n e p i s o d e s  then
45:
        Supervised training
46:
    else
47:
        Randomly take a { o k , a k , o k + 1 , r k } from B ;
48:
        Calculate of the target value y k = r k + γ Δ t · v max V ^ ( o k + 1 ) ;
49:
        Update the optimal value network V by gradient descent;
50:
    end if
51:
end while

5. Results

This section quantifies the performance throughout the training procedure, and compares the proposed method against the baseline approach, socially attentive reinforcement learning (SARL) algorithm [37], which adopts LSTM to deal with an uncertain number of dynamic obstacles in the environment. Although SARL only considers the first-order kinematic model of the mobile robot, in the comparative experiments, we improve both its pretraining and subsequent incremental training sessions to make it applicable to the generation of motion trajectories of fixed-wing UAVs.

5.1. Experimental Setup

The proposed training algorithm builds upon deep V-learning. We used a laptop with a 12th Intel Core i9 and 32 GB of RAM for training. Hyperparameter values are summarized in Table 1.

5.2. Training Procedure

To train and evaluate our method, we selected a multi-scene, multi-stage training framework used for the training of the network, which can effectively increase the convergence speed of the network and further improve the generalization ability and effectiveness of the strategy, compared to training only in a single complex environment [38,39,40]. Here, we consider three different scenarios. In the first scenario, there are no static obstacles, and the number of static obstacles is increased in the second and third scenarios. The obstacles are defined as cylinders of equal height and radius with randomly generated locations. To increase the difficulty of obstacle and collision avoidance, the UAV to be planned is randomly distributed with other adjacent UAVs on a horizontal circle, and the goal position of the UAV to be planned is randomly placed on the other side of the circle.
Prior to reinforcement learning training, the network was first pretrained using the ORCA algorithm. In the pretraining phase, all UAV strategies were given by ORCA, 3000 rounds of experience were collected in the experience pool, and 50 rounds of training were performed on the optimal strategy network with a learning rate of 0.001.
Training is completed in a total of three phases, corresponding to three different scenarios, with 9000 training rounds for each scenario. Every 300 rounds, the experience pool is emptied and the success rate of the algorithm in the current environment is tested. The experiments show that staged training can significantly improve the speed of convergence, as shown in Figure 4.

5.3. Qualitative Analysis

This section analyzes the trained method. As shown in Figure 4 and Figure 5, the average reward is about 0.6 and the success rate is close to 0.98 at 2100 training rounds, indicating that the algorithm has learned the ability to avoid obstacles in the absence of static obstacles. When the environment is changed in 9000 rounds, the algorithm’s success rate and average reward show a significant drop, indicating that the ability to avoid static obstacles has not been learned. After training to the 13,800th round, the average reward for the second scenario is about 0.4 and the success rate is around 0.9. In the third scenario, after training to round 21,000, the average reward for the second scenario basically is about 0.3, with a success rate of around 0.8.
As shown in Figure 5, the average reward becomes lower as the static obstacles in the environment increase. This is because UAVs have to take a longer detour to avoid obstacles and collisions.
Figure 6 gives one planning episode from top view: in yellow is depicted the trajectory of the robot; in other colors are the other UAVs; in black are the obstacles; and the time interval between each UAV is the same. Figure 7 plots the x, y, z responses of the ego-UAV. The solid line represents the curves of the ego-UAV, while the dashed line gives the goal position, from which we can see that the UAV successfully avoids all static obstacles and other UAVs while reaching the goal safely, as marked by the red dot. When using our approach, the robot initiates a collision avoidance maneuver early enough to lead to a smooth trajectory and faster arrival at the goal.
The paths planned in a random obstacle environment are tested at the end of training using the HADVN algorithm, as shown in Figure 8. It can be observed that the network has learned the experience of avoiding randomly generated static obstacles and other UAVs.
To simulate the complicated unknown disturbances, random Gaussian noises that affected the fixed-wing UAV dynamics are considered. We set random Gaussian disturbance as in [41], and a mean of 0 as well as a covariance of 0.5, for the x or y position, respectively. We run the trained HADRL in three different scenarios with the obstacle number as 10, 15, and 20, respectively, and record the results of 100 episodes, given in Table 2. The success rate is used to evaluate the performance, which is the proportion that ego-UAV reaches its goal without a collision. We can conclude that our approach demonstrates a good performance in the presence of random Gaussian disturbances without extra training.

5.4. Performance Results

This section compares it with the baseline method, SARL. The simulation results are shown in Figure 9 and Figure 10, it can be seen that during the training process of the network, in the first stage, the difference between the average reward and success rate of the SARL algorithm and the HADRL algorithm is not very large, and the SARL algorithm already has a high success rate after pretraining the initial network, indicating that the SARL algorithm can plan a collision-free and safe trajectory in the absence of static obstacles. However, in the second and third phases, the average reward and success rate of the SARL algorithm decrease significantly, while our proposed HADRL algorithm is able to learn and adapt to more complex environments very quickly with the help of global guidance.
Moreover, to illustrate the advantages of our proposed HADRL algorithm, comparative simulations are performed, with the ORCA and SARL algorithms. The scenarios—which are the same as in Section 5.3—and the results of 100 episodes are recorded. Two metrics—success rate and execution time—are used to evaluate the performance, where execution time is the time algorithm used to reach the goal. The results are shown in Table 3 and Table 4. It is observed that HADRL has a higher success rate in three scenarios. As for the execution time, HADRL and SARL show the same performance, which is better than ORCA. Hence, we can conclude that our algorithm performs better in terms of computational efficiency and scalability.

6. Conclusions

For complex environments with static obstacles and other flying UAVs, this paper fully considers the second-order kinematic model and constraints, and proposes a HADVN motion planning algorithm based on global path guidance and a local RL planner so that fixed-wing UAVs can reach the target point safely within a preset time. Unlike the widely used deep Q-networks, our approach employs a deep V-network to deal with the continuous motion space. In addition, considering the variable number of neighboring UAVs in the sensing range of the UAV, an attention mechanism is added to the front end of the network to improve the network generalization. Finally, a multi-stage, multi-scene training strategy is used in training process and the effectiveness of the algorithm is verified by simulation experiments.

Author Contributions

Conceptualization, H.D. and M.Y.; methodology, M.Y. and X.Z.; software, H.D.; validation, H.D. and X.Z.; formal analysis, M.Y.; investigation, H.D.; resources, H.D., M.Y. and X.Z.; data curation, H.D., M.Y. and X.Z.; writing—original draft preparation, H.D. and X.Z.; writing—review and editing, H.D. and X.Z.; visualization, H.D. and X.Z.; supervision, M.Y. and X.Z.; project administration, X.Z.; funding acquisition, X.Z. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by National Natural Science Foundation of China grant number 62103417, and Tianjin Science and Technology Bureau Science and Technology Plan Project Diversified Fund number 21JCQNJC00800.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Data are contained within the article.

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:
UAVUnmanned Aerial Vehicle
HADVNHybrid A* Deep V-Network
RRTRapidly-exploring Random Tree
FOVField of View
ORCAOptimal Reciprocal Collision Avoidance
SARLSocially Attentive Reinforcement Learning

References

  1. Pfeiffer, M.; Schaeuble, M.; Nieto, J. From perception to decision: A data-driven approach to end-to-end motion planning for autonomous ground robots. In Proceedings of the 2017 IEEE International Conference on Robotics and Automation, Marina Bay Sands, Singapore, 29 May–3 June 2017; pp. 1527–1533. [Google Scholar]
  2. Dugas, D.; Nieto, J.; Siegwart, R. Navrep: Unsupervised representations for reinforcement learning of robot navigation in dynamic human environments. In Proceedings of the 2021 IEEE International Conference on Robotics and Automation, Xi’an, China, 30 May–5 June 2021; pp. 7829–7835. [Google Scholar]
  3. Kästner, L.; Marx, C.; Lambrecht, J. Deep-reinforcement-learning-based semantic navigation of mobile robots in dynamic environments. In Proceedings of the 2020 IEEE 16th International Conference on Automation Science and Engineering, Hong Kong, China, 20–21 August 2020; pp. 1110–1115. [Google Scholar]
  4. Yan, T. Collision-avoiding flocking with multiple fixed-wing UAVs in obstacle-cluttered environments: A task-specific curriculum-based MADRL approach. IEEE Trans. Neural Netw. Learn. Syst. 2023, 1–15. [Google Scholar] [CrossRef] [PubMed]
  5. Klausen, K. Autonomous recovery of a fixed-wing UAV using a net sus-pended by two multirotor UAVs. J. Field Robot. 2018, 35, 717–731. [Google Scholar] [CrossRef]
  6. Liu, Z. Mission-oriented miniature fixed-wing UAV swarms: A multilayered and distributed architecture. IEEE Trans. Syst. Man Cybern. Syst. 2020, 52, 1588–1602. [Google Scholar] [CrossRef]
  7. Huang, S. Collision avoidance of multi unmanned aerial vehicles: A review. Annu. Rev. Control 2019, 48, 147–164. [Google Scholar] [CrossRef]
  8. Yasin, J.N. Unmanned aerial vehicles (UAVs): Collision avoidance systems and approaches. IEEE Access 2020, 8, 105139–105155. [Google Scholar] [CrossRef]
  9. Chen, H. Coordinated path-following control of fixed-wing unmanned aerial vehicles. IEEE Trans. Syst. Man Cybern. Syst. 2021, 52, 2540–2554. [Google Scholar] [CrossRef]
  10. Donald, B. Kinodynamic motion planning. J. ACM 1993, 40, 1048–1066. [Google Scholar] [CrossRef]
  11. Chen, H. Coordinated path following control of fixed-wing unmanned aerial vehicles in wind. ISA Trans. 2022, 122, 260–270. [Google Scholar] [CrossRef]
  12. Yan, C. Fixed-Wing UAVs flocking in continuous spaces: A deep reinforcement learning approach. Robot. Auton. Syst. 2020, 131, 103594. [Google Scholar] [CrossRef]
  13. Liu, Q.F. The Research of Unmanned Aerial Vehicles (UAV) Dynamic Path Planning Based on Sparse A* Algorithm and an Evolutionary Algorithm. Master’s Thesis, Nanchang Hangkong University, Nanchang, China, June 2016. [Google Scholar]
  14. Shi, H. Route Planning of Small Fixed-wing UAV Based on Sparse A* Algorithm. Ordnance Ind. Autom. 2021, 40, 14–18. [Google Scholar]
  15. Wu, Q. Application research on improved artificial potential field method in UAV path planning. J. Chongqing Univ. Technol. 2022, 36, 144–151. [Google Scholar]
  16. Guo, Y.C. 3D Path Planning Method for UAV Based on Improved Artificial Potential Field. J. Northwestern Polytech. Univ. 2020, 38, 977–986. [Google Scholar] [CrossRef]
  17. Fan, L.Y. A dense obstacle avoidance algorithm for UAVs based on safe flight corridor. J. Northwestern Polytech. Univ. 2022, 40, 1288–1296. [Google Scholar] [CrossRef]
  18. Li, J.; Liu, Y. Deep Reinforcement Learning based Adaptive Real-Time Path Planning for UAV. In Proceedings of the 2021 8th International Conference on Dependable Systems and Their Applications, Yinchuan, China, 5–6 August 2021; pp. 522–530. [Google Scholar]
  19. Wu, R. UAV path planning based on multicritic-delayed deep deterministic policy gradient. Wirel. Commun. Mob. Comput. 2022, 1–12. [Google Scholar] [CrossRef]
  20. Huang, C. A novel three-dimensional path planning method for fixed-wing UAV using improved particle swarm optimization algorithm. Int. J. Aerosp. Eng. 2021, 1–19. [Google Scholar] [CrossRef]
  21. Guo, J.H. Task allocation and path planning algorithm for multiple fixed-wing UAVs. J. Taiyuan Univ. Technol. 2023, 1, 1–10. [Google Scholar]
  22. Jiang, X. Global Path Planning Of Fixed-wing UAV Based On Improved RRT* Algorithm. J. Appl. Sci. Eng. 2023, 26, 1441–1450. [Google Scholar]
  23. Ren, W. Phase Space Graph Convolutional Network for Chaotic Time Series Learning. IEEE Trans. Ind. Inform. 2024, 20, 7576–7584. [Google Scholar] [CrossRef]
  24. Ren, W. An Interdigital Conductance Sensor for Measuring Liquid Film Thickness in Inclined Gas-Liquid Two-Phase Flow. IEEE Trans. Instrum. Meas. 2024, 73, 1–9. [Google Scholar] [CrossRef]
  25. Wang, L.; Ye, H.; Wang, Q. Learning-based 3D occupancy prediction for autonomous navigation in occluded environments. In Proceedings of the 2021 IEEE/RSJ International Conference on Intelligent Robots and Systems, Prague, Czech Republic, 27 September–1 October 2021; pp. 4509–4516. [Google Scholar]
  26. Gu, P. Research on Obstacle-Avoiding Strategy of Hexapod Robotbased on Deep Leamning. Master’s Thesis, Southwest University of Science and Technology, Nanchang, China, May 2019. [Google Scholar]
  27. Cai, X. EVORA: Deep Evidential Traversability Learning for Risk-Aware Off-Road Autonomy. arXiv 2023, arXiv:2311.06234. [Google Scholar]
  28. Yang, L. TacGNN: Learning Tactile-Based In-Hand Manipulation with a Blind Robot Using Hierarchical Graph Neural Network. IEEE Robot. Autom. Lett. 2023, 8, 3605–3612. [Google Scholar] [CrossRef]
  29. Tordesillas, J. Deep-panther: Learning-based perception-aware trajectory planner in dynamic environments. IEEE Robot. Autom. Lett. 2023, 8, 1399–1406. [Google Scholar] [CrossRef]
  30. Tao, Y.; Wu, Y.; Li, B. SEER: Safe efficient exploration for aerial robots using learning to predict information gain. In Proceedings of the 2023 IEEE International Conference on Robotics and Automation, London, UK, 29 May–2 June 2023; pp. 1235–1241. [Google Scholar]
  31. Li, Y.; Zhang, S.; Ye, F. A UAV path planning method based on deep reinforcement learning. In Proceedings of the 2020 IEEE USNC-CNC-URSI North American Radio Science Meeting, Montréal, QC, Canada, 5–10 July 2020; pp. 93–94. [Google Scholar]
  32. Yan, C. Deep reinforcement learning of collision-free flocking policies for multiple fixed-wing uavs using local situation maps. IEEE Trans. Ind. Inform. 2021, 18, 1260–1270. [Google Scholar] [CrossRef]
  33. Berg, J.; Lin, M.; Manocha, D. Reciprocal Velocity Obstacles for real-time multi-agent navigation. In Proceedings of the 2008 IEEE International Conference on Robotics and Automation, Pasadena, CA, USA, 19–23 May 2008; pp. 1928–1935. [Google Scholar]
  34. Zhao, X.; Du, H.; Lu, H. Collision-Free Motion-Primitive-Based Motion Planning Algorithm for Fixed-Wing Robotic Aircraft. In Proceedings of the Advances in Guidance, Navigation and Control, Ha’erbin, China, 5–7 August 2022; pp. 5951–5960. [Google Scholar]
  35. Zhang, H.; Zhang, Y.; Guo, C. Path planning for fixed-wing UAVs based on expert knowledge and improved VFH in cluttered environments. In Proceedings of the 2022 IEEE 17th International Conference on Control & Automation, Naples, Italy, 27–30 June 2022; pp. 255–260. [Google Scholar]
  36. Zhang, T.Y. Research on Path Planning of Large UAV based on Dubins Algorithm. Master’s Thesis, Anhui Polytechnic University, Wuhu, China, June 2021. [Google Scholar]
  37. Che, C.; Liu, Y.; Kreiss, S. Crowd-robot interaction: Crowd-aware robot navigation with attention-based deep reinforcement learning. In Proceedings of the 2019 International Conference on Robotics and Automation, Montreal, QC, Canada, 20–24 May 2019; pp. 6015–6022. [Google Scholar]
  38. Akmandor, N.; Li, H.; Lvov, G. Deep reinforcement learning based robot navigation in dynamic environments using occupancy values of motion primitives. In Proceedings of the 2022 IEEE/RSJ International Conference on Intelligent Robots and Systems, Kyoto, Japan, 23–27 October 2022; pp. 11687–11694. [Google Scholar]
  39. Fan, T. Distributed multi-robot collision avoidance via deep reinforcement learning for navigation in complex scenarios. Int. J. Robot. Res. 2020, 39, 856–892. [Google Scholar] [CrossRef]
  40. Zhou, B. Robust and Efficient Quadrotor Trajectory Generation for Fast Autonomous Flight. IEEE Robot. Autom. Lett. 2019, 4, 3529–3536. [Google Scholar] [CrossRef]
  41. Labbadi, M. Adaptive fractional-order nonsingular fast terminal sliding mode based robust tracking control of quadrotor UAV with Gaussian random disturbances and uncertainties. IEEE Trans. Aerosp. Electron. Syst. 2021, 57, 2265–2277. [Google Scholar] [CrossRef]
Figure 1. HADRL algorithm structure.
Figure 1. HADRL algorithm structure.
Sensors 24 03984 g001
Figure 2. HADRL algorithm network structure.
Figure 2. HADRL algorithm network structure.
Sensors 24 03984 g002
Figure 3. Attention mechanism module. Different color is used to illustrate different placement of linear layers for clarify.
Figure 3. Attention mechanism module. Different color is used to illustrate different placement of linear layers for clarify.
Sensors 24 03984 g003
Figure 4. Number of training rounds—average reward curve.
Figure 4. Number of training rounds—average reward curve.
Sensors 24 03984 g004
Figure 5. Number of training rounds—success rate curve.
Figure 5. Number of training rounds—success rate curve.
Sensors 24 03984 g005
Figure 6. Results of trained module. In yellow is depicted the trajectory of robot; in other colors are the other UAVs; in black are the obstacles.
Figure 6. Results of trained module. In yellow is depicted the trajectory of robot; in other colors are the other UAVs; in black are the obstacles.
Sensors 24 03984 g006
Figure 7. The x t , y t , z t responses. The dotted line represents the goal position.
Figure 7. The x t , y t , z t responses. The dotted line represents the goal position.
Sensors 24 03984 g007
Figure 8. Planning result under random obstacle environment. (a,b) are the test results with the randomly generated scene with 15 obstacles, while (c,d) are the test results with the randomly generated scene with 25 obstacles, the green dot represents the starting position and the red dot represents the target position.
Figure 8. Planning result under random obstacle environment. (a,b) are the test results with the randomly generated scene with 15 obstacles, while (c,d) are the test results with the randomly generated scene with 25 obstacles, the green dot represents the starting position and the red dot represents the target position.
Sensors 24 03984 g008
Figure 9. Comparison of the number of training rounds and average reward curves.
Figure 9. Comparison of the number of training rounds and average reward curves.
Sensors 24 03984 g009
Figure 10. Comparison of training rounds–success rate curves for the two algorithms.
Figure 10. Comparison of training rounds–success rate curves for the two algorithms.
Sensors 24 03984 g010
Table 1. Hyperparameters.
Table 1. Hyperparameters.
ParametersValue
Discount factor0.9
Learning rate0.001
Number of training rounds300
n O R C A 5000
n e p i s o d e s 30,000
n m i n i b a t c h 100
r c o l l i s i o n −0.25
r s u c c e s s 1
r f a i l −0.1
α 1 0.05
α 2 0.1
α 3 0.01
α u 0.01
α ω 0.01
d s a f e 5 m
d min 1 m
d a max 10 m
Δ t 0.25 s
t max 25 s
Table 2. Comparison of success rate with random Gaussian noises.
Table 2. Comparison of success rate with random Gaussian noises.
MethodsHADRL (without Noises)HADRL (with Noises in x Position)HADRL (with Noises in y Position)
Scenario 10.860.720.71
Scenario 20.790.650.63
Scenario 30.760.600.61
Table 3. Comparison of success rate with other planning strategies.
Table 3. Comparison of success rate with other planning strategies.
MethodsORCASARLHADRL
Scenario 10.630.780.86
Scenario 20.550.620.79
Scenario 30.430.570.76
Table 4. Comparison of execution time with other planning strategies.
Table 4. Comparison of execution time with other planning strategies.
MethodsORCASARLHADRL
Scenario 10.320.120.13
Scenario 20.350.130.14
Scenario 30.360.130.16
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Du, H.; You, M.; Zhao, X. Globally Guided Deep V-Network-Based Motion Planning Algorithm for Fixed-Wing Unmanned Aerial Vehicles. Sensors 2024, 24, 3984. https://doi.org/10.3390/s24123984

AMA Style

Du H, You M, Zhao X. Globally Guided Deep V-Network-Based Motion Planning Algorithm for Fixed-Wing Unmanned Aerial Vehicles. Sensors. 2024; 24(12):3984. https://doi.org/10.3390/s24123984

Chicago/Turabian Style

Du, Hang, Ming You, and Xinyi Zhao. 2024. "Globally Guided Deep V-Network-Based Motion Planning Algorithm for Fixed-Wing Unmanned Aerial Vehicles" Sensors 24, no. 12: 3984. https://doi.org/10.3390/s24123984

APA Style

Du, H., You, M., & Zhao, X. (2024). Globally Guided Deep V-Network-Based Motion Planning Algorithm for Fixed-Wing Unmanned Aerial Vehicles. Sensors, 24(12), 3984. https://doi.org/10.3390/s24123984

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop