1. Introduction
An Unmanned Aerial Vehicle (UAV), as a system with high maneuverability, can provide a wide range of applications for humans’ daily life. For example, UAVs can be used for search and rescue [
1,
2], relays in communication systems [
3,
4,
5], public health emergency management [
6,
7,
8], IoT data collection [
9], and more. Regardless of the task type, an effective autonomous-navigation method can guide the UAV to its destination in a safe, smooth, and cost-efficient manner, which is always a fundamental problem in task completion.
The classic path-planning algorithms such as Dijkstra, A*, PRM, and RRT* [
10,
11] can handle the prototypical navigation problems of finding a path from one point to another while avoiding obstacles. These algorithms assume that obstacles are perfectly known and stationary and that the planned path can be followed accurately. This is referred to as offline path planning because planning is finished in advance of execution. However, the main weakness of these algorithms is that they are incapable of dealing with uncertainty and changes in the environment. Furthermore, the path planned by these classic algorithms only includes location information and does not take into account the velocity constraint. This could lead to a new issue in which the planned path does not obey the robot’s dynamics.
To tackle these problems, more recent works have concentrated on improving the classic path-planning algorithms. Some of them, such as [
12,
13,
14], are geared toward kinodynamic problems. The goal of kinodynamic path planning is to create a robot’s motion within kinematic constraints, such as avoiding obstacles, and dynamic constraints, such as maximum velocity bounds [
15]. Therefore, such methods always build a graph with all edges executable by the robot at the front-end path finding process or have trajectory optimization at the back-end. Other works such as [
16,
17,
18,
19] focus on online replanning solutions. These methods always reduce the offline planner’s execution time, enabling it to continually replan in real-time as changes to the surroundings are detected. In addition, a great number of works [
20,
21,
22,
23,
24] have been developed considering both robot dynamics and uncertainty in the environment. However, the works mentioned so far suffer from the fact that they are energy-intensive and require a considerable amount of computational resources. Quadrotors, being relatively small-scale UAVs, have limited electric power and computational resources [
25,
26], so it is not easy to carry out the above methods onboard.
As a result, in recent years, various DRL-based obstacle-avoidance solutions [
27,
28,
29,
30,
31,
32,
33,
34,
35,
36,
37,
38] have gotten a lot of attention due to their real-time, kinodynamic, and energy-efficient features. DRL is a computational approach for learning how to map states to actions to obtain the optimized policy [
39]. The DRL agent is not given any exact labels or known maps; instead, it uses trial-and-error to figure out the best action set. When it comes to UAV navigation, DRL has a series of advantages. First, all the DRL-based methods emphasize training the policy by interacting with the environment directly. This means that actions are calculated in real-time using current sensor readings, ensuring the ability to respond to the changes. Second, unlike traditional path-planning approaches that output waypoints, the DRL framework returns the UAV’s actions directly, such as providing velocity and acceleration. Thus, it is inherently subject to dynamics constraints. Third, DRL is based on neural networks, which is an end-to-end approach and has lower computational complexity. Such a property makes these methods more energy-efficient. Nevertheless, the lack of global information makes DRL-based algorithms run the risk of being stuck in maze-like scenarios. When the trajectory is long and the environment is full of uncertain obstacles, a maze-like scenario usually occurs. As shown in
Figure 1, the UAV departs from Point A and heads towards the goal. When it arrives at Point B, the UAV’s laser ranger detects an obstacle ahead, and it returns. Because of its limited field-of-view sensing, it immediately discovers that the surroundings are without obstacles and then goes back, which eventually leads to the UAV hovering around in the corner. Based on the results of the state-of-art [
38], although the DRL network is modified to train a policy for UAVs to escape from maze-like obstacles, there is still a
failure rate when the trajectory grows longer.
Based on the analysis above, it is still a challenging problem to energy-effectively navigate the UAV to avoid unpredictable obstacles for long trajectories. To address such a problem, we propose the DRL Enhanced Global Long Trajectories Planning Framework (DGlobal for short). The DGlobal comprises two key modules: a DRL training module and a global-planning module. The training module learns a DRL policy for obstacle-avoiding motion planning. The global-planning module first generates an optimal collision-free global path based on previous knowledge. Such a path is separated into a few waypoints while keeping the major characteristics of the path. Then, the well-trained policy in the training module navigates the UAV among the waypoints until it at last reaches the destination. However, the DGlobal Framework still faces the following two challenges.
1. To effectively train a DRL policy in a complex environment. We use a robust and accurate physics engine, Gazebo [
40], to simulate the environment so that the trained policy can be more practical. However, training a DRL policy in such an accurate and complex environment is time-consuming and requires a large volume of computing resources.
2. To dynamically update the waypoints based on unpredictable obstacles. The waypoints generated by the global-planning module can be blocked by unpredictable obstacles that will mislead the UAV and make it unable to reach the destination.
Although the authors in [
41,
42] also considered the combination of global planning (RRT algorithm, for example) and DRL-based planning, their major idea is to use the DRL method to build the RRT tree instead of regarding the global-planning method as guidance in their work, and they did not consider either how to avoid unpredictable obstacles or how to improve training efficiency.
In this work, we not only propose the DGlobal Framework but also solve the above two challenges. First, inspired by the human-in-the-loop concept [
43], we adopt a Human-in-the-Loop DRL (HL-DRL) Training Module to use the demonstration replay to accelerate the learning process. Second, we propose a goal-updating algorithm that can detect unpredictable obstacles and determine which waypoint is to be reached.
The main contributions of this work are summarized as follows.
- (1)
We propose the DRL Enhanced Global Long Trajectories Planning Framework to navigate a UAV to avoid unpredictable obstacles for long trajectories.
- (2)
We adopt the human-in-the-loop concept and use the HL-DRL module to train a policy for UAV obstacle avoidance in real-time.
- (3)
We establish a global-planning module for waypoint guidance for UAV navigation in maze-like environments.
- (4)
We use Gazebo [
40] in conjunction with ROS as the simulator and build different types of environments to validate the efficiency of the proposed method.
The rest of this paper is organized as follows.
Section 2 surveys related works.
Section 3 formulates the navigation problem as a Markov Decision Process (MDP) and introduces some background knowledge of DRL. The proposed method is elaborated in
Section 4. In
Section 5, we describe the simulation details and analyze the experiment results. Last,
Section 6 concludes the paper.
4. The DGlobal Framework
In this section, we first take an overview of the DGlobal Framework and then give a more detailed account, including DRL settings, the DRL architecture, and the entire implementation process. The DGlobal Framework aims to address the following challenges of the UAV navigation problem.
(1) How do we navigate the UAV over a long trajectory and avoid obstacles? It is very hard for the UAV to find the destination itself since the long trajectory could be full of dynamic obstacles and traps.
(2) How do we accelerate training when the environment is rather complex? Training of the DRL module can be very slow when the solution space is rather large due to the complex training environment.
4.1. Overview
The overview of the DGlobal Framework is shown in
Figure 3; it consists of the
Human-in-the-Loop-DRL (HL-DRL) training module and the
global-planning module. The HL-DRL training module intends to obtain a well-trained policy for UAV point-to-point collision-free flight, while the global-planning module aims to prevent the UAV from being trapped in complex environments. Compared to other local planners [
18,
19,
20,
22], which need to recompute the trajectory during flight, a policy well-trained by DRL usually involves low computational cost and consumes less energy [
35,
52,
53]. Thus, we adopt the DRL-based local planner instead of non-DRL planners.
The HL-DRL training module contains four sub-modules. First, the state space, action space, and reward function are well-designed to help the UAV better comprehend how to complete the task (①). Second, during the training process, DRL randomly selects
tuples (
[state, action, next state, reward, done flag]) from the demonstration replay buffer and the experience replay buffer and concatenates them to be a batch of
n samples (②). In order to accelerate the training, we adopt the human-in-the-loop mechanism [
43], and by introducing the demonstration replay buffer, HL-DRL is able to learn from both human experience and self-exploration. Third, the batch is then fed into an Actor–Critic network that is constructed elaborately to extract the features of the environment information more accurately (③). Forth, Twin Delayed Deep Deterministic Policy Gradients (TD3) [
54], a DRL algorithm, is adopted to update the policy (④). Briefly, the TD3 algorithm has three major steps: (1) update the critic model network (
–
), (2) update the actor model network (
), and (3) update all target networks (
). The details of the TD3 algorithm are further introduced in
Section 4.4 and
Section 4.5. The HL-DRL training module iterates the training process until it converges to an optimized policy.
The global-planning module generates a static, collision-free path based on previous knowledge, and then the RDP algorithm downsamples the path to a few key points that can guide the UAV to escape from maze-like situations. Furthermore, a novel goal-updating algorithm is designed to change the goal among waypoints, which improves the generalization of the proposed method.
Taken together, the global-planning module provides the UAV with some waypoints based on global information to avoid getting lost, and the well-trained policy obtained by the HL-DRL training module navigates the UAV between each two waypoints to deal with changes in the environment.
4.2. DRL Settings
4.2.1. State Space and Action Space
State Space. The information required for UAV navigation generally falls into two classes: one for reaching the target and another for avoiding collisions. First, we adopt the relative information between the goal and the UAV rather than the absolute state of the UAV to represent the state space in case the environment changes. As shown in
Figure 4,
and
are the positions of the UAV and the goal, respectively. Thus, the relative position between them can be defined as
. Moreover,
denotes the distance between the goal and the UAV, and
is the angle between the UAV velocity vector and the connecting line of the UAV and the goal. It is worth noting that these state variables are in the world frame. Second, in order to detect obstacles, we model a range sensor with a detection angle of 360° and angular resolution of 0.5°, which corresponds to 720 rays.
Figure 5a shows a UAV in a multi-obstacle environment and that it detects obstacles using the onboard range sensor; in addition, the histogram of laser scanning data is in
Figure 5b. Since the scanning range of real sensors is limited by a finite detection distance, we define the saturated distance function of each ray, denoted
.
As shown in Equation (
1), if the length of ray
i is greater than the sensing range
, indicating that there are no obstacles within the detecting region, then the distance
is set to
. Otherwise, if ray
i hits an obstacle at point
O, then
is the distance of point
O from the sensor;
i ranges from 1 to 720 because the range sensor has 720 samples (or rays). Furthermore, it is vital to normalize state values with different units and scales before inputting them into networks, and such pre-processing is conducive to extracting features and speeding up the convergence of DRL. Therefore, the state space
is formulated as
where
is the x-axis length of the environment, and
is the y-axis length.
Action Space. The ‘XY_VEL_Z_POS’ mode (velocity control in
-plane and position control in
z-direction) is taken to model the dynamics of UAVs and is commonly used for flight control software such as PX4. In this paper, the UAV’s altitude remains constant except during takeoff and landing; then the UAV dynamics are formulated as
Thus, we use , ranging from −1 to 1, to represent the action space, is the velocity of the UAV, is the maximum speed value, represents the planar position of the UAV at moment t, and is the time interval.
4.2.2. Reward Design and Termination Conditions
Reward design is a fundamental property of DRL and has a substantial impact on the results of DRL training. The DRL agent receives a reward
r calculated using the reward function after taking action
a in the current state
s. According to
Section 3.1,
r greatly affects the value function
, which is the evaluation of the state–action pair’s quality. Therefore, a well-designed reward function can guide DRL to generate a safe and effective path to the destination. In this paper, we design the reward function based on our domain knowledge of the navigation problem and introduce it under three conditions.
C1: Colliding with obstacles. If the minimum value among all
is less than or equal to the radius of the UAV, it means the UAV has hit something, denoted as
where
is the diameter of the UAV. For the UAV, colliding with obstacles could be fatal. Thus, in order to keep the UAV away from obstacles, we penalize it with a negative constant and use C1 as a termination condition.
C2: Reaching the goal. If
is less than or equal to the sum of the UAV’s radius and the goal’s radius, it means the UAV has reached the goal, represented as
where
is the diameter of the goal point. Therefore, to teach the UAV to reach the target, we give it a positive constant as a reward and consider C2 to be another termination condition.
C3: Neither collision nor task completion. If the UAV has neither hit an obstacle nor completed the task at the current moment, it receives a reward expressed as
First, to make the UAV maintain a certain distance from obstacles, we design the obstacle penalty as
where
is a negative constant and
is a constant denoting the safe distance between the UAV and obstacles. If the UAV is less than
away from the nearest obstacle, it suffers a penalty of
; otherwise, the obstacle is not a threat to it, and the UAV receives a zero. Second, to reduce the distance between the UAV and the goal point,
is used to form the distance penalty
where
is a negative constant. Third, to assist the UAV in moving towards the goal point,
is taken to build the angle penalty
where
is a negative constant. Last, to make the UAV reach the target as soon as possible, a negative constant
is given to the UAV at each step.
4.3. Replay Buffer Design
Experience replay is a technique that can make the HL-DRL training process more stable. In Actor–Critic network-based DRL algorithms, the critic network is used to approximate the state–action value
. When updating the network, stochastic gradient descent optimization requires the training data to be independent and identically distributed (i.i.d.). Therefore, the authors in [
55] designed a large buffer, also known as an experience replay buffer, that includes a set of experience tuples. The fixed-size buffer adds the latest tuple
(state, action, next state, reward, done flag) to its end whenever the agent takes a step. After the buffer reaches the maximum size, a batch of experience tuples is sampled randomly from the buffer to update the network. As a result, experience replay can keep the training process from oscillating or diverging by breaking the correlation between the training data.
In this paper, we not only adopt the experience replay buffer but involve the “human tutor” and design a new buffer called the demonstration replay buffer. Each time HL-DRL updates, we sample some tuples from each of the two buffers and concatenate these tuples before inputting them into the networks. The following is a detailed description of the demonstration replay buffer.
The aim of demonstration replay is to speed up the convergence of training. We choose Gazebo as a simulator for HL-DRL training in this paper because it offers the ability to efficiently and accurately simulate robots in complex environments, closing the simulation-to-reality gap to a large extent. However, this feature makes it almost impossible for the UAV to find the target by exploring from scratch. Without any successful experience, it is hard for a DRL module to obtain the optimized policy. Therefore, we hire a group of human players to operate the UAV to the goal and store transitions
(state, action, next state, reward, done flag) in the demonstration replay buffer. The data in the demonstration replay buffer can help the UAV rapidly realize how to complete the task during the pivotal initial phase of training, and since the demonstration replay buffer is prepared before the start of training, human involvement does not slow the whole training process. As shown in
Figure 6, according to observation and the current state
in Gazebo, the human player uses the keyboard to control the UAV, and the operation is logged as
. Then, the environment returns the next time state
, reward
, and the done flag;
is obtained by reading the UAV state from Gazebo directly, and
is calculated using the same reward function introduced in
Section 4.2.2. For every step, the current tuple
is saved in the demonstration replay buffer, but if a collision occurs, all data in this episode are removed. It should be noted that we collect the demonstrations from different human players to exclude the influence of the human factor and to maintain data objectivity.
In general, the demonstration replay largely overcomes the sample inefficiency issue in DRL by incorporating human demonstrations and thus makes DRL converge faster.
4.4. Network Structure
The Actor–Critic network architecture is applied to build the DRL algorithm in this paper. There are two reasons why the Actor–Critic network structure is adopted to design DGlobal. First, although more-advanced structures such as residual modules and GRU units are available, we focus on validating the effectiveness of our proposed framework. Therefore, using a simpler architecture such as an Actor–Critic network is more efficient. Second, adopting residual modules or GRU units may introduce potential drawbacks such as overfitting, difficulty in tuning hyperparameters, and increased complexity. In particular, as our framework is designed for deployment on a UAV, increased complexity could pose challenges to the limited computational and energy resources onboard.
As we mentioned in
Section 3.1, the actor maps the state into the action to control the agent, and the critic then provides evaluative feedback about the action based on its reward.
Figure 7 shows the structures of the actor network and the critic network. The actor network is composed of two fully connected neural network layers, which are followed by a Rectified Linear Unit (ReLU) activation function and a hyperbolic tangent (tanh) activation function, respectively. The first layer receives the 724-dimensional vector
consisting of the 4-dimensional relative goal information and the 720-dimensional laser information and outputs a 256-dimensional vector. The second layer takes the vector as an input and maps it to the 2-dimensional vector
that serves as the velocity factor. For the critic network, the input
is first mapped to a 1024-dimensional vector by the fully connected layers with ReLU and then merged with
as a 1026-dimensional vector. After two layers with ReLU and one fully connected layer, the critic outputs the state–action value
.
4.5. Training Process
We use TD3 [
54] to address the UAV navigation problem in this paper. Although other DRL algorithms can also be adopted in the DGlobal framework, we still choose TD3 since it is a state-of-art DRL algorithm and has been widely used to train navigation policies for unmanned vehicles [
49,
56,
57]. TD3 is also based on the Actor–Critic architecture, similar to DDPG; however, it offers some improvements over DDPG, as shown in
Figure 8. The actor target network receives
and outputs
, as in DDPG, but the difference is that TD3 adds a little disturbance to
before inputting it into the critic target network (①).
N denotes the clipped noise and
and
are the boundary values of actions. This behavior reduces the negative impact of
evaluation errors by smoothing out
in the action dimension. Next, after obtaining
, TD3 uses two critic target networks to learn
instead of one and chooses the smaller
Q-value to calculate the loss function
L. For critic model networks, TD3 learns two
Q-values as well (②). Therefore, the loss function in TD3 can be denoted as
Using the smaller
Q-value to calculate the target value in
L helps mitigate overestimation in the
function. Moreover, the actor updates in the direction of a greater
Q-value, which is almost the same as DDPG. However, the actor in TD3 updates with a lower frequency than the critic does, and it recommends updating the actor when the critic has been updated twice (③). This restrains the fluctuation that often appears in DDPG because it is better to update the policy after the value function has been accurately estimated.
Overall, TD3 trains the DRL agent in that way until it converges into an optimized policy. However, if the UAV performs the task only relying on the DRL policy, it could get stuck in complex environments because of its limited sensing capability and the lack of global knowledge. In order to address this concern, we integrate the DRL policy with the global planner, and the details are introduced in the following part.
4.6. Entire Implementation Process
To make up for deficiencies in DRL-based motion planning, we propose a novel navigation method that incorporates global planning, as illustrated in Algorithm 1.
Phase I (line 1): The path-planning algorithm A* [
10] is adopted to generate a collision-free path
based on the previous point cloud map and the positions of the start and the goal.
Phase II (line 2): We use RDP [
58], a polyline simplification algorithm, to produce a simplified polyline that has fewer points than path
but still keeps its characteristics. These points are stored in set
in sequence and are used as waypoints in the following. In
, the first element
is the start point and the last element is the goal point.
Phase III (lines 3–10): The well-trained policy
is used to navigate the UAV along waypoints. First, we initialize the goal to
. Then, the policy generates action
a based on current state
s that includes information regarding the current to-be-reached waypoint, and the UAV performs the action. Next, we update the goal (the details are described in Algorithm 2). The algorithm terminates when the goal is reached, a collision occurs, or a fixed number of steps is exhausted. If the UAV reaches the goal, a dynamically feasible and collision-free motion plan can be returned.
Algorithm 1: Proposed method. |
|
The goal-updating algorithm and corresponding extension methods are given in the following subsections.
Algorithm 2: Updating the goal. |
|
4.6.1. Goal-Updating Algorithm
While the updated map strategy can assist the UAV in detecting changes in the environment, when numerous waypoints are blocked by unpredictable obstacles, the UAV may have to rely solely on a local planner (the policy learned by the DRL) to navigate through the maze-like environment. In such cases, the UAV has to take risks and could become trapped in the maze. Hence, using a goal-updating algorithm to renew global waypoints is rather necessary. Algorithm 2 gives the details of updating the goal, and it has three major steps.
Step 1 (lines 1–5). We check whether the UAV has been within range of the i-th () waypoint. If so, we directly set the goal to the ()-th waypoint, which prevents the UAV from flying unnecessary distances. The index i is initialized to (line 1), and we calculate the distance between the UAV and waypoint (line 3). If is less than or equal to the sum of the UAV’s radius and the waypoint’s radius, then the UAV is deemed to have reached ; therefore, we change the goal to waypoint (line 4). Incidentally, we expand the waypoint into a region with its coordinate as the center and as the diameter because it can help the UAV handle changes in the environment. This process repeats until all remaining waypoints have been examined (line 2). As a result, we have an updated index .
Step 2 (lines 6–7). If Step 1 did not return a new goal, it implies that the UAV has not reached the current goal or any subsequent waypoint. In this case, we need to evaluate whether the UAV is unable to approach the goal because random obstacles were generated within the range of the waypoint. If so, we relax the criterion for the UAV to reach the waypoint. To be specific, if the UAV is less than away from the edge of the waypoint’s region and detects that the distance to the nearest obstacle is less than , then we consider that the UAV has arrived at the waypoint and we update the goal (line 7); is chosen according to the environmental setting.
Step 3 (lines 8–9). The goal point’s position can be obtained based on , and both of them are returned.
The goal-updating algorithm only renews the waypoints during flight of the UAV. Between every two waypoints, the UAV is navigated by the DRL policy, and its speed can be dynamically changed.
Figure 9 gives an example to illustrate the proposed method: (a) The A* algorithm plans the collision-free path (the blue dotted line) based on the previous information, and then the RDP algorithm simplifies the path and generates the waypoints (green circles). (b) The UAV departs from the start point and flies along the waypoints. (c) After the UAV has arrived at W2, the goal is switched to W3; however, the DRL policy navigates the UAV to the other side to stay away from obstacles. (d) Then, the UAV reaches W4. According to Algorithm 2 (lines 2–5), the goal is set to W5, although the UAV has not reached W3. (e) After the UAV has arrived at W6, the goal is changed to W7, where a new obstacle appears. (f) Therefore, according to Algorithm 2 (line 7), we relax the criterion for the UAV reaching W7 by expanding the range of W7. (g) Next, the UAV passes W8 and finally reaches the goal.
4.6.2. Algorithm Extension
During Phase I of Algorithm 1, the A* algorithm is employed to generate a collision-free path , which serves as the basis for navigation and goal-updating. However, the presence of dynamic obstacles can cause multiple waypoints in to become invalid and blocked, which can lead the UAV to the wrong locations and can lead it to ultimately become trapped in maze-like obstacles. Therefore, we adopt the following extensions to enhance the robustness of the proposed algorithms.
First, in Algorithm 1, the A* algorithm generates a set of n paths , and each path is simplified by the RDP algorithm. For each two arbitrary paths (), they can have overlaps.
The second extension is in the goal-updating algorithm. Suppose that the UAV has arrived at the
i-th waypoint
of path
and its next goal is the (
)-th waypoint
. If the UAV cannot arrive at
within a certain time
,
then
is marked as an unreachable waypoint and every path containing
is
removed from
P. After that, the UAV selects a new waypoint
from path
as its new
goal, where
and
satisfy that
belongs to the updated
P and
.
Table 2 lists the major symbols used in this section.
5. Simulation Results
In this section, experiments are carried out to evaluate the performance of the method proposed in this paper. First, we introduce the simulation setup and show the training results. Then, we test the proposed method in different situations and analyze the testing results. Last, we prove the effectiveness of the goal-updating algorithm (Algorithm 2). In addition, we implement all algorithms on a workstation with an 11th-generation Intel i7 Processor, NVIDIA GeForce RTX 3060 GPU, and Ubuntu 20.04 system.
5.1. Simulation Setup and Training Results
We use Gazebo in combination with ROS as the simulator, which efficiently and accurately simulates the dynamics of the UAV and largely closes the simulation-to-reality gap.
Figure 10 shows the training environment. The yellow ball and the green ball, respectively, indicate the start and the goal points, and any obstacles are represented by grey cylinders in this configuration. As mentioned in
Section 4.2.1, the UAV is equipped with a range sensor with a 360° field of view and a range of 0.08 m to 5 m. We model the dynamics of the UAV using the PX4 ’XY_VEL_Z_POS’ mode, which controls the velocity in the
-plane and the position in the
z-direction. As our simulation is intended to test the effectiveness of our proposed method in navigating the UAV through long trajectories with maze-like environments, we fix the height of the UAV at 2 m to prevent it from circumventing obstacles by adjusting its height. At the start of each flight, the UAV ascends to a height of 2 m, and upon reaching its destination, it descends to the ground for landing. During the flight, the UAV’s speed can be dynamically changed within a maximum speed limit of 2 m/s.
At the beginning of each training episode, the weights of the networks are initialized at random. The start and the goal are randomly generated within certain areas, and obstacles are randomly placed within the remaining area. The episode ends if the UAV reaches the goal, a collision occurs, or the steps reach the maximum number. The training parameters are listed in
Table 3 and are chosen based on a combination of domain knowledge, empirical observations, and best practices in TD3 [
54]. It is worth noting that the setting of training parameters can have a significant impact on the training process and performance [
59]. Therefore, we perform preliminary experiments and hyperparameter tuning to select the parameters that provided stable learning and good performance for our specific problem.
To show the effectiveness of demonstration replay in this paper, we train TD3 without demonstration replay and TD3 with different ratios of demonstration replay batch size to experience replay batch size for comparison. The agent in TD3 with demonstration replay can learn from human experience and self-exploration, whereas the agent in TD3 without demonstration replay only learns by exploring from scratch. Multiple training experiments are carried out with different initialization seeds, and
Figure 11 shows the reward curves during the training process from one of the multiple training runs. We smooth curves using the average-filtering method to observe the trend, denoted as
where
is the original curve and
is the smoothed one.
M represents the width of the smoothing window. At the beginning of training, the UAV explores the environments at random, so the reward of each episode is quite low. As training continues, the policy updates and the exploration noise is reduced. Therefore, the UAV receives a greater reward value by making full use of the policy. As we can see, the curve of TD3 without demonstration replay continues to fluctuate a great deal and does not converge until training ends. The reason is that the agent in TD3 without demonstration replay can only learn from self-exploration, which rarely contains positive examples; thus, the agent constantly learns from negative examples, leading to the slow convergence problem. When the ratio of demonstration replay batch size to experience replay batch size is 1, the reward curve converges around the 700th episode. This is because demonstration replay provides numerous effective examples for the UAV to learn. Moreover, a balanced proportion between human demonstrations and the agent’s own experiences enables effective learning and efficient convergence. When the ratio is 1/3, where experience replay data have a larger share in the training samples, the reward curve shows a converging trend, but the convergence speed is significantly slower. This indicates that relying too much on the agent’s own experiences may not fully exploit the benefits of human demonstrations, potentially leading to a slower learning process. When the ratio is 3, the demonstration replay data occupy a large proportion of the training samples. It can be seen that the training results are even worse than those obtained by TD3 without demonstration replay. The cause is that the agent overly relies on human demonstrations, which are biased to some extent. This implies the importance of balancing the agent’s own experiences with human demonstrations to avoid overfitting or being restricted by the limitations of the demonstration data.
In the following sections, we evaluate the performance of the DGlobal in various environments with and without unpredictable obstacles.
5.2. Navigation in Environments without Unpredictable Obstacles
To test the performance of navigating in different situations, we build the following six types of static environments: Env1, Env2, and Env3 are all 20 m × 20 m environments, where Env1 includes some cylinder-shaped obstacles, Env2 has a trap, and Env3 is a labyrinth. Env4 is a 20 m × 100 m corridor that is full of obstacles. Env5 and Env6 are 100 m × 100 m environments full of barriers and mazes.
The following two existing algorithms are implemented in these environments to compare to the DGlobal method.
(1) Local Astar (LAS for short). LAS generates safe paths using the local information obtained from laser range scanning. The LAS adopted in this simulation is implemented by an open-source project [
60].
(2) TD3-based navigation (TD3 for short). TD3 [
54] uses a well-trained policy to generate actions according to laser scanning data and is real-time online planning.
The DGlobal method uses both the result of global A* and the well-trained policy of DRL to plan in real-time, as elaborated in
Section 4.6.
The following two subsections introduce the performance of these three algorithms in Env1–Env6. For each environment, the algorithms are executed multiple times with different start and goal points as inputs. To ensure a fair comparison, the generation seed is fixed for each round, guaranteeing that all three methods are carried out in the same environmental configuration. We employ the metric “
average flight time” as an indicator of the average energy consumption across multiple test experiments, as longer flight durations intuitively correlate with increased energy consumption. It is worth mentioning that the global planner calculation is not taken into the “average flight time” of the UAV in DGlobal because the global planner is part of the preprocessing phase, which is analogous to the DRL training process. Additionally, for the same scenario, the global planner does not need to recompute at the beginning of each navigation since DGlobal can effectively handle random obstacles that may appear (which will be discussed and analyzed in
Section 5.3).
5.2.1. Navigation in Env1, Env2, and Env3
Typical cases of navigation results in Env1–Env3 are illustrated in
Figure 12, and
Figure 13 shows the average flight time of these three algorithms under different environments. All methods successfully planned collision-free trajectories in Env1, as seen in
Figure 12a–c. Moreover, TD3, which does not consider global information, has almost the same flight time as DGlobal, indicating that the DRL policy is efficient when navigating the UAV in relatively simple environments. In Env2, as shown in
Figure 12d–f, LAS found a safe path due to its completeness and optimality. The UAV using TD3 lost its way because of the limited sensing range and lack of global knowledge, whereas the UAV using DGlobal finally took 56.3 s to reach the goal thanks to waypoint guidance. Similar results are obtained in Env3, as shown in
Figure 12g–i. The UAV navigated by TD3 was trapped in the U-shaped obstacle; however, DGlobal avoided this situation by incorporating global planning. Consequently, compared to TD3, DGlobal can prevent a UAV with limited sensing ability from being stuck in maze-like environments, and compared to LAS, DGlobal can navigate the UAV to the goal point as soon as possible.
5.2.2. Navigation in Env4 (Corridor-like Environment)
Env4 is adopted to evaluate the performance of the DGlobal method in a long trajectory scenario where the UAV needs to fly through a corridor to its destination. We test all three algorithms in Env4 plenty of times by changing the number of obstacles.
Figure 14 illustrates typical cases of navigation results, and
Figure 13 presents the flight time. The experimental results show that all methods successfully planned collision-free trajectories in Env4. Additionally, similar to Env1, the flight time of TD3 and DGlobal are quite close, which demonstrates the effectiveness of TD3 in non-maze-like environments. Although the flight path of LAS is shorter than that of TD3 and DGlobal, it spends the longest time. The reason is that the LAS algorithm always needs to hover and calculate the next waypoint, which is time-consuming. This group of simulations implies that TD3 and DGlobal can save more energy than LAS in a corridor-like environment.
5.2.3. Navigation in Env5 and Env6 (Maze-like Environments)
Env5 and Env6 are two
complex, maze-like environments. In Env5, there are two ways to arrive at the destination, while in Env6, the UAV can only get to the destination by flying through the right side of the maze. Typical cases of navigation results are shown in
Figure 15, and the average flight time of TD3, LAS, and DGlobal are illustrated in
Figure 13. According to
Figure 15a–c, we can see that the TD3 method cannot navigate the UAV to the destination in Env5. As illustrated in
Figure 15b, the UAV is trapped in the maze. As discussed in
Section 1, navigation in a maze-like environment is challenging for the DRL-based method. Although the DGlobal method is also based on the DRL, it can get out of the maze by following the waypoints obtained from the global A*. On the other hand, the well-trained flight policy can also help the UAV fly smoothly and save more time. The time consumed by the DGlobal method is only half of the time cost of LAS, which means the DGlobal is more energy efficient than LAS.
As shown in
Figure 15d–f, both LAS and TD3 cannot navigate the UAV to the destination. The reason is that they are all merely based on local information. TD3 and LAS guide the UAV to fly through the left of the maze, which is a straightforward way to close the distance to the destination based on local information. However, it is also a trap of Env6. The DGlobal, on the other hand, can avoid the trap by adopting the global information from the Global A* and can arrive at the destination in the end.
5.3. Navigation in Environments with Unpredictable Obstacles
In this section, we involve unpredictable obstacles in maze-like environments, Env2, Env3, and Env6 (Env5 is similar to Env6 and is omitted), to evaluate the performance of the DGlobal algorithm. We use UEnv2, UEnv3, and UEnv6 to represent environments with unpredictable obstacles. Based on the experimental results in the above sections, the navigation strategies generated by LAS and TD3 are either time-consuming or lead the UAV into traps, so it is unfair to compare them with DGlobal in more complicated environments. Therefore, we conduct a comparison with the state-of-the-art algorithm
FRSVG(0) (FRSVG for short) [
38]. FRSVG is able to make decisions based on historical observations and actions by using recurrent neural networks. It also addresses the challenges that the UAV easily gets trapped when using DRL methods to solve navigation problems in large-scale and long-trajectory environments. Thus, it serves as an appropriate comparison for DGlobal. The test experiments are carried out 200 times with obstacles being generated randomly. Each time, FRSVG and DGlobal perform with the same distribution of obstacles to maintain fairness.
Table 4 shows the success rate (SR), collision rate (CR), and lost rate (LR) of the two algorithms across 200 tests. We define SR, CR, and LR as the respective percentages of episodes in which the UAV successfully reaches the target point, experiences a collision with obstacles, and becomes trapped or lost.
Focusing on the DGlobal method, the data in
Table 4 demonstrate that DGlobal achieves a higher success rate compared to FRSVG in all three environments (98.0%, 98.5%, and 95.5% in UEnv2, UEnv3, and UEnv6, respectively). Additionally, as shown in
Figure 16, the DGlobal method successfully guides the UAV to avoid these obstacles and arrive at the destination. The capability to deal with unpredictable obstacles is a result of training with DRL. We also notice that the UAV updates its goal by using the goal-updating algorithm when flying through UEnv3 and UEnv6 (the effectiveness of the goal-updating algorithm will be analyzed in the following section).
In contrast, the FRSVG agent achieves lower success rates in all three environments (96.5%, 98.0%, and 89.5% in UEnv2, UEnv3, and UEnv6, respectively).
Figure 16 provides typical successful instances of the FRSVG algorithm. Although it solves navigation problems in unpredictable scenarios with the help of recurrent networks, it may still generate some unnecessary trajectories during flight, especially in long-trajectory scenarios such as UEnv6. This indicates that while recurrent networks help the agent to capture temporal dependencies in past observations and actions, they may have difficulty in capturing long-term dependencies. Moreover, the lack of a global planner also leads to unnecessary trajectories, as the agent may not be aware of efficient solutions in the larger context.
5.4. The Effect of the Goal-Updating Algorithm
We conduct an analysis of the effectiveness of the proposed goal-updating method (Algorithm 2) and its extension (
Section 4.6.2) by designing a traditional goal-updating method as a comparison and implementing all of the models in environments with unpredictable obstacles (cylinders).
First, we compare the traditional method with the proposed goal-updating method (the left in
Figure 17). In the traditional goal-updating method, the goal is set to the (
)-th waypoint only when the UAV has reached the
i-th waypoint. Thus, the UAV flies through waypoints W2, W3, and W4 accordingly and is finally trapped at W5 since W5 is blocked by an unpredictable obstacle and the DRL policy keeps the UAV away from obstacles at all times. In contrast, the proposed goal-updating algorithms perform much better. The UAV skips W2 and W3, flies through W4 directly, then passes W5 and finally reaches the destination. The reason is that when the UAV is at P1, its distance from W2 is less than
and the distance from the nearest obstacle is less than
. Therefore, according to the proposed goal-updating method, the goal is changed to W3. Then, the distance between the UAV and W3 is calculated at the next step, which also meets the condition of updating the goal; thus, W4 is the goal now. When the UAV is at P2, there is an obstacle overlapping W5. Therefore, the proposed method relaxes the criterion for the UAV reaching W5 by expanding the range of W5. The proposed goal-updating method can help the UAV complete the task even if new obstacles appear at waypoints, indicating that it improves generalization to environmental changes while retaining the information from waypoints.
Next, we compare the proposed goal-updating method with its extension described in
Section 4.6.2. We establish a maze-like environment, and there is a door in the maze. The waypoint (W7) generated by the global planner indicates that the optimal path should go through the door. The proposed goal-updating method makes the UAV trapped in the maze since the door has been blocked by an unpredictable obstacle. However, at P3, the extension version of the goal-updating method changes to another backup waypoint and leads the UAV to the destination in the end.
Our simulations imply that the extension version of the goal-updating method performs the best. However, it also requires additional memory to store backup paths.