1. Introduction
Flying robots are among the most flexible man-made robots developed to date. With their high level of maneuverability, these robots can navigate through complex and challenging environments, including natural forests and modern urban buildings, and they can reach areas that most other human-made robots cannot. Their flexibility has led to the development of numerous applications, including environmental mapping [
1,
2], patrol inspection [
3], search and rescue [
4], logistics automation [
5], entertainment performances [
6], and agricultural automation [
7]. Flying robots have an incredible level of autonomy, thereby enabling them to achieve previously impossible missions. To this end, the autonomous navigation capability of a flying robot enables it to safely interact with the environment and fly to its destination automatically without human intervention.
The autonomous navigation capability of the flying robot enables the aircraft to safely interact with the environment and fly to its destination automatically without human intervention. After years of continuous development and research [
8,
9], many previous research efforts have led to the development of flying robots in the field of full autonomy, thus freeing the hands of human expert pilots. However, developing end-to-end navigation methods that are capable of robust flight at high speeds in complex environments is a long-standing challenge that remains unsolved. Traditional trajectory-based optimization methods need to prebuild a mathematical model for the environment, which usually requires a map construction procedure, such as the ESDF map [
10,
11,
12,
13,
14,
15,
16]. However, the mapping procedure tends to be time-consuming, and it is difficult to meet the real-time requirements. Imitation-learning-based methods train a policy generator based on a large amount of expert experiences [
17,
18]. However, by only imitating the human experience, the learned policy generator cannot handle unseen scenarios and would make a inappropriate decision.
In recent years, learning-based end-to-end algorithms such as soft actor–critic (SAC) [
19] combined with convolutional neural networks have been investigated. SAC is an offline deep reinforcement learning (DRL) algorithm that employs a stochastic policy to maximize the sample utilization. These end-to-end algorithms can directly map the visual observations, attitude, and desired target information of a flying robot to the action output of the agent. In related work [
20], a proposed sensor-level DRL-based policy surpassed traditional algorithms in complex pedestrian scenario navigation tasks using a ground robot platform, which was greatly impressive. Although neural-network-based methods may be less explainable, they are still preferred, since they do not require rigorous mathematical proofs or tedious theoretical analyses. In the study of Xue et al. [
21], seven ranging sensors were used for the perception of the environment, and a reinforcement learning approach based on an actor–critic framework was used to the achieve autonomous navigation of the UAV in an unknown environment. Similarly, in Zhang et al.’s study [
22], more than seven laser ranging sensors were used to sense the environment, and an improved TD3-based algorithm was used to realize an autonomous navigation task for a UAV in a multi-obstacle environment. However, both methods are dependent on ranging sensors, and neither of them can accurately perceive the environment in the UAV’s forward direction.
In this work, our algorithm assumption is that a UAV can achieve intelligent navigation by only relying on visual inputs and its own attitude information. The reason behind this assumption is inspired by human experiences. Humans can achieve autonomous exploration and navigation in an unknown and complex environments based on visual observation. Therefore, we follow the deep reinforcement learning route and focus on designing an efficient visual processing procedure. Our proposed method mainly utilizes vision sensors and onboard inertial measurements to achieve the robust autonomous navigation of flying robots. In contrast to existing works, it does not require computation or prebuilt maps, which can effectively reduce the perception latency. To tackle the low-sample complexity issue [
23] in DRL, a previous work, SAC_RAE [
24], has been proposed to use a regularized autoencoder (RAE) to project raw pixels into a latent state before being fed into the DRL module. However, to improve the processing efficiency, the SAC_RAE downsamples the projected feature map in the autoencoder multiple times, which results in significant information loss. In this work, we propose a focus autoencoder (FAE) to enhance the representative ability of the features with a large receptive field while still keeping the processing procedure efficient. We conducted extensive experiments to validate the effectiveness of our method, as shown in
Figure 1. Our proposed SAC_FAE can outperform SAC_RAE and trajectory-based optimization methods by 10% and 19% in the navigation success rate, respectively. In addition, our proposed SAC_FAE outperformed other methods in multiple test environments, which validates the effectiveness and good robustness of our method. Our code implementation is available at
https://github.com/LHL6666/SAC_FAE (accessed on 23 September 2023).
The rest of this paper is organized as follows. We review the related works on the autonomous navigation of aerial robots in
Section 2.
Section 3 provide the details of our proposed method.
Section 4 and
Section 5 provide a detailed introduction of our experimental setup and an evaluation of the experimental results, respectively. Finally, we conclude the work in
Section 6.
2. Related Works
In the field of automatic navigation for flying robots, various approaches to achieve autonomous flight have been proposed in the literature, which can be broadly categorized into the following three types: (1) Trajectory-based optimization methods: These methods involve designing a set of optimal trajectories that the robot should follow to reach its destination. They commonly rely on mathematical models and algorithms to generate the trajectories, and they typically require accurate information about the environment and the robot’s dynamics; (2) Imitation-learning-based methods: These methods require a large amount of expert experience to fit an AI model that performs well in specific environments, but they have poorer generalization and exploration capabilities; (3) Reinforcement-learning-based methods: These methods represent a promising approach to achieving autonomous flight, which involves training an intelligent agent to learn how to navigate by interacting with its environment and receiving feedback in the form of rewards or penalties. Reinforcement-learning-based methods require a large amount of data for training, but they can use simulation software to obtain this data, thus making them more cost-effective than other methods. Additionally, the agent can continuously explore and learn from its environment, thus ultimately achieving comparable results to human experts. Next, we will introduce these methods in more detail.
2.1. Trajectory-Based Optimization Algorithms
Fast-Planner [
25] and EGO-Planner [
26] utilize certain search rules to find collision-free paths and optimize those paths for dynamic feasibility and smoothness. Fast-Planner features its stability, which is based on the approach of projecting depth images onto point clouds to construct ESDF maps and subsequently performing a path search and trajectory optimization. Since the planning algorithm needs to operate on the constructed ESDF map, the delay of the observation information becomes more prominent. This also means that, in order to achieve better performance, the speed of the flying robot must be strictly limited. Moreover, it should be noted that, due to the adaptive modification of the target point by trajectory optimization, Fast-Planner is not suitable for tasks in challenging environments requiring high-precision navigation. In navigation experiments conducted in complex scenes, the planner may exhibit conservative behaviors, because the target point does not impose a sufficient constraint on the behaviors, thereby resulting in a higher likelihood of task timeout without completion.
EGO-Planner is an improved planning algorithm that is based on Fast-Planner with an improved decision-making ability. This reduces the probability of task timeout while increasing the success rate. Interestingly, even when the planning horizon of EGO-Planner is increased several times, the algorithm still boldly explores and plans trajectories filled with exciting maneuvers such as frequent emergency turns for flying robots. In addition, the planner requires frequent restarts to reduce data errors.
For navigation in complex unknown environments, these typical algorithms combine online mapping and traditional planning algorithms. From an engineering perspective, splitting the navigation task into environmental perception and local planning is attractive, because each component can be performed in parallel, thereby making the overall system more efficient and interpretable. However, there is a time–space mismatch between the output of the perception module and the joint debugging of the planner, which makes the interaction between different stages amenable to compound errors to a large extent. Additionally, their sequential nature introduces additional delays that make maneuvering at high speeds and with agility difficult. Although these issues can be mitigated to some extent by manual tuning with expert knowledge, the divide-and-conquer principle that prevails in autonomous flight research in unknown environments commonly imposes fundamental limits on the speed and agility that flying robotic systems can achieve.
2.2. Imitation-Learning-Based Algorithms
Imitation-learning-based agents learn how to navigate by observing the trajectories of human experts or other robots that have completed specific tasks. Typically, a large volume of observational data is collected and used to train a neural network policy that can replicate an expert’s decision-making process. The policy then predicts the next action to be taken from the input observation data and achieves the navigation goal by executing those actions. Imitation-learning-based algorithms are simple to train and, with sufficient training data, robots can learn how to navigate on their own. However, if the training data are insufficient or noisy, the policy may fail to make an optimal decision. Additionally, since the algorithm learns and selects actions based on existing data, it may be unable to handle situations it has never seen before. Typical published studies [
18,
27,
28] used an imitation learning algorithm to train a policy as closely as possible to the expert’s behavior. However, the policy was heavily dependent on the input experience.
2.3. Deep-Reinforcement-Learning-Based Algorithms
Recently, research on end-to-end robot navigation using DRL has become increasingly popular. Yarats et al. [
24] proposed a SAC_AE policy with regularization constraints on the decoder loss. Then, in [
29], Huang et al. used the regularized SAC_AE policy (SAC_RAE) to complete a distributed multiUAV collision avoidance task, where the flying robots were able to avoid each other and reach the target point using only the depth image from a front-facing deep camera. However, the validity of this policy was not well-demonstrated, because the experiment was conducted in an unobstructed open space. Following the success of the transformer [
30] in the CV field, a combination of the transformer and reinforcement learning has been proposed in several works [
31,
32,
33]. In these works, transformers were used to extract feature information from observation, which was then input into the policy network for learning, thereby achieving satisfactory results in their task scenarios.
However, we have noticed that the introduction of transformer modules in DRL may make policy training more challenging. Nevertheless, the literature suggests that it is theoretically possible to use vision transformers to build an encoder network for the perception module, which takes in all observation information (including depth images and agent state information), extracts latent variables, and computes the attention between them. In practice, transformer modules may lead to unstable learning, particularly in situations in which the agent’s action set is rich and continuous. Therefore, to address this issue, we explored methods to increase the receptive field of convolutional modules, rather than relying solely on the large receptive field advantage of transformer modules.
3. Methodology
3.1. Problem Formulation
The objective of this study was to enable a flying robot to navigate rapidly in complex and unpredictable wild environments using DRL. The agent was limited to receiving only the depth image within a 15-m range ahead (consistent with the ZED Mini stereo camera) and its own pose information, thereby leading to restricted observation of the interaction process with the environment. Therefore, this study falls under the category of POMDP (Partially Observable Markov Decision Process), described as a 7-tuple . Therein, the state space represents the hidden variable, is the action space, function denotes the state transition probability, is the reward function, is the observation space, is the discount factor, and represents the observation probability.
At the beginning of each episode, target points are randomly generated in an interactive environment with a constant Euclidean distance. The episode ends only when the target point is reached, timed out, or when the distance to the obstacle is less than that of the safe value. Specifically, at each time step t, the flying robot obtains a frame for visual observation and a frame for pose observation . It then executes action , and it obtains the environment’s reward . By repeating this process, the strategy guides the agent to avoid obstacles and reach the target point by outputting the desired action.
3.2. Policy Setup
3.2.1. Observation Space
The pose observation consists of three parts: . Therein, denotes the target position information in the body coordinate system, denotes its own velocity observation, and represents the acceleration observation. The visual observation includes four consecutive 160 × 120 depth images at adjacent moments. At each time step t, the observation information obtained by the agent is .
3.2.2. Action Space
Our agent enjoys complete control freedom without explicit limitations on its action space. At each time step t, our policy outputs an action command consisting of four degrees of control: , and . , represents velocity in the x direction of the body coordinate system for the scaling factor , and it has a value range of m/s. Similarly, and representing velocity in the y and z directions, respectively, and they have a value range of m/s, while represents the rate of change in the heading angle of the agent, and it has a value range of rad/s; k is always equal to 1 in our experiments.
3.2.3. Reward Setup
Sparse rewards can make it challenging for reinforcement learning algorithms to learn, because the agent must undergo multiple trial-and-error iterations to discover the right course of action. However, incorporating auxiliary rewards can facilitate learning the correct policy, thus leading to a faster and more efficient learning process. Our reward function is composed of two distinct components. First, the
guides the agent to complete the primary task. The second component,
, assists the agent in learning a policy to expedite the achievement of this objective and penalizes the agent for outputting an inefficient action to encourage the generation of a more efficient output. At time step
t, the reward function can be described as follows:
For a basic reward,
(
),
, and
are calculated as follows:
The parameter encourages the flying robot to reach its target point as quickly as possible, either by taking a shorter path or by moving at a faster speed. The parameter motivates the flying robot to reduce the angle between the velocity vector direction and the visual observation direction to reduce unnecessary blind flight adventures of the flying robot. For example, a flying robot may choose to climb directly vertically to the top of an obstacle and then move toward a target point to avoid the obstacle, or the flying robot may appear to fly sideways to the left or right or backwards so that it is not able to visually observe movement in the direction. The settings for the and parameters should be between 0 and 1. The M parameter is used to determine whether the flying robot has abnormal movements, such as unnecessary blind flight adventures; if the strategy outputs an abnormal action, then M = 1. represents the Euclidean distance between the starting point and the target point, represents the Euclidean distance of the straight-line distance that the agent moves to the target point at the current moment, and represents the maximum time steps of actions that can be executed in each round of episode.
In our experiments, we set , , and to 0.1, 0.5, and 20, respectively.
3.3. Network Architecture
Our method flow chart is shown in
Figure 2. Our method consists of three key modules, i.e., the vision encoder, the state encoder, and the reinforcement learning (RL) policy module. Two encoders extract critical information and project it onto a feature vector on depth images and IMU information, respectively. After that, two extracted features are concatenated and fed into a reinforcement learning policy module for strategy learning. According to the current environment, the RL module returns a set of actions for UAV control, including velocity on three dimensions and yaw rate.
Our SAC_FAE network structure, as shown in
Figure 3, employs convolution to acquire latent variables from visual observations. However, we did not simply utilize plain convolutional autoencoders. Instead, we utilized a focus module to execute a unique type of convolution operation. The focus module was originally embodied in YOLO v5 and is featured in the use of the number of channels in exchange for the resolution of the feature map. This operation groups the input tensors by channel, arranges pixels in each channel in a particular order, and performs ordinary convolution operations. This grouping and rearrangement process effectively decreased the amount of computation and memory consumption required by the focus module. Specifically, we used a slicing operation to divide a high-resolution feature map into multiple low-resolution feature maps and concatenated them on channels.
This approach enabled us to complete downsampling operations without losing information and enhance the feature expression capacity of the proposed model through a larger receptive field. Although we can obtain lossless feature information and enhance the global receptive field through the focus operation, much of the feature information is still redundant and does not help in model training. Therefore, after each focus operation, we used a convolutional layer to reduce the number of channels to half of the original. Simultaneously, we truncated the gradients from the actors to avoid the impact of the exploration process on the uncertainty of the encoder learning.
The encoder extracted 384 latent variable features (encoder feature dim) from the input image observations. For the pose and target point of the flying robot, 128 latent variable features (measurement feature dim) were obtained after two layers of MLP. Both the actor and critic had a 4-layer multilayer perception (MLP) with 1024 units.
3.4. Training Strategies
To improve the sample efficiency, we utilized experience replay buffers to store the environment interaction trajectories. We discovered that the hyperparameters of the buffer capacity and training frequency play an important role in determining the final outcome. Setting the buffer capacity too high can impede the learning process and increase model complexity, because there will be differences between the data generated before and after the policy interacts with the environment. When the replay buffer is set too small, it leads to unstable training, and the results tend to fall into local optima. Therefore, the buffer capacity was set to store approximately 50 interaction fragments and train them at the end of each round. The pseudocode for our policy-training process is presented in Algorithm 1.
Algorithm 1: Policy training |
|
5. Results
Here, the SAC_RAE [
29] was selected for comparison, since it is the first work to achieve distributed multiUAV collision avoidance using deep reinforcement learning (DRL) techniques. In addition, since we focused on improving the visual feature extraction in this work, we also replaced the vision model of the SAC_RAE with two popular and strong architectures, i.e., CNN and ViT, to construct two stronger variants for comparison, i.e., SAC_CNN and SAC_ViT. In addition, we also selected two very popular traditional trajectory generation optimization methods—Fast-Planner [
25] and EGO-Planner [
26]—for comparison. We trained all of the DRL-based policies for 1200 episodes, and we measured their performance outcomes using the aforementioned metrics in three different scenarios, as shown in
Table 2 and
Figure 5. From
Table 2, we have three observations. First, from simple to complex scenarios, the performance of all methods became reduced, which is reasonable, since it is difficult for autonomous navigation to take place in a complex environment. Second, our method could outperform previous state-of-the-art (SOTA) methods in most settings, with both a higher success rate and speed. Third, the proposed model performed well in the timeout metric across multiple scenarios, which reflects the decisiveness of the strategy in situations where multiple optimal solutions may exist.
In the absence of pretrained weights, the SAC_ViT agent, based on its transformer architecture, interacted with the environment to learn. Although this model possesses a low crash rate compared to the other models, this is due to the fact that the model has only mastered basic machine control, and its success rate was the lowest across scenarios. Our analysis suggests that self-attention mechanisms require extensive learning and even prior knowledge, which are similar to the approach described in [
32], where they first employed imitation learning from a large number of expert trajectories to endow their strategy with some exploratory ability from the outset. However, long-term training or the provision of expert trajectories is expensive and time-consuming. Under the same experimental settings as those of the proposed method, the SAC_ViT performance remained significantly improved.
In the SAC_RAE architecture, this strategy performs multiple downsampling procedures on the projection feature map in the autoencoder, which is prone to a loss of obstacle position information, and the probability of a collision or timeout will increase. As is shown in
Figure 6, the average reward of this architecture fluctuated significantly during exploration.
Within the SAC_CNN architecture, only the gradient of the critic network was utilized in the perception module of the policy, thus resulting in the fastest initial learning speed. However, due to the instability of the critic during exploration, significant fluctuations can occur. As a result, the convergence performance of this policy was comparable to that of the SAC_RAE.
In addition, it is worth noting that the Fast-Planner algorithm was also effective, and its timeout cases were usually due to modifications of the target position made by the planner to ensure a better hovering attitude. Actually, in Fast-Planner, the inability to place control points around obstacles presents a significant hindrance in dense obstacle environments for flying robots; its selection of the control points avoids obstacles as much as possible, but camera noise can sometimes lead to false perceptions of planned trajectories or false perceptions as to whether the control points are within obstacles, thus leading to replanning. In practice, the flying robots deployed with the Fast-Planner algorithm typically came within 2.5 m of the expected target point in around 66% of the cases. The actual arrival at the target point itself occurred only rarely (though open scenes could facilitate this). As our experiment considered navigation success only when the robot’s Euclidean distance to its target was under 1.0 m, this prevented Fast-Planner from presenting the same performance as EGO-Planner in the success rate metric.
EOG-Planner achieved a high success rates in most scenarios, especially in Scene 3, where the success rate of EGO was higher than all the methods based on reinforcement learning. The reason is that the simulation environments can only render depth images at 10 Hz, which is significantly insufficient for fast flight in a complex environment. If the camera signal is lost, the learned reinforcement learning strategy will sample a wrong action, thereby leading to a higher crash rate. However, EGO-Planner modeled the UAV dynamics and maintained the local map information. If the camera signal is lost, the traditional planner can obtain the environment information from the built local map and achieve the navigation.
For the DRL-based strategies, we also display the average return during the training process in
Figure 6, where the shaded area represents the variance; it accounts for the stability of the policy-training process.
6. Conclusions
We presented an effective focus autoencoder module in this study, which performed the lossless downsampling of feature maps through slicing operations and achieved a larger receptive field. Experimental results show that our method could outperform previous SOTA methods in most settings, with both a higher success rate and speed. To demonstrate the effectiveness of our strategy, we conducted multiple experiments ranging from simple to complex scenarios in different complex environments. Our proposed method outperformed in multiple test environments, thereby exhibiting good robustness.
Although we believe that there is great potential for combining deep reinforcement learning with transformers, training intelligent agents based on transformers is challenging. We need to continuously develop more effective methods to promote community growth and enable autonomous flying robots based on strong reinforcement learning to be sufficiently powerful.