DTPPO: Dual-Transformer Encoder-Based Proximal Policy Optimization for Multi-UAV Navigation in Unseen Complex Environments

Wei, Anning; Liang, Jintao; Lin, Kaiyuan; Li, Ziyue; Zhao, Rui

doi:10.3390/drones8120720

Open AccessArticle

DTPPO: Dual-Transformer Encoder-Based Proximal Policy Optimization for Multi-UAV Navigation in Unseen Complex Environments

by

Anning Wei

^1,†,

Jintao Liang

^2,†

,

Kaiyuan Lin

³,

Ziyue Li

^4,*

and

Rui Zhao

⁵

¹

Department of Automation, Tsinghua University, Beijing 100190, China

²

State Key Laboratory of Networking and Switching Technology, Beijing University of Posts and Telecommunications, Beijing 100876, China

³

Viterbi School of Engineering, University of Southern California, Los Angeles, CA 90007, USA

⁴

Department of Information Systems, University of Cologne, 50923 Köln, Germany

⁵

SenseTime Research

^*

Author to whom correspondence should be addressed.

^†

These authors contributed equally to this work.

Drones 2024, 8(12), 720; https://doi.org/10.3390/drones8120720

Submission received: 15 October 2024 / Revised: 22 November 2024 / Accepted: 25 November 2024 / Published: 29 November 2024

(This article belongs to the Special Issue Advances in Detection, Security, and Communication for UAV)

Download

Browse Figures

Versions Notes

Abstract

:

Existing multi-agent deep reinforcement learning (MADRL) methods for multi-UAV navigation face challenges in generalization, particularly when applied to unseen complex environments. To address these limitations, we propose a Dual-Transformer Encoder-Based Proximal Policy Optimization (DTPPO) method. DTPPO enhances multi-UAV collaboration through a Spatial Transformer, which models inter-agent dynamics, and a Temporal Transformer, which captures temporal dependencies to improve generalization across diverse environments. This architecture allows UAVs to navigate new, unseen environments without retraining. Extensive simulations demonstrate that DTPPO outperforms current MADRL methods in terms of transferability, obstacle avoidance, and navigation efficiency across environments with varying obstacle densities. The results confirm DTPPO’s effectiveness as a robust solution for multi-UAV navigation in both known and unseen scenarios.

Keywords:

multi-UAV navigation; partially observable Markov decision process; multi-agent deep reinforcement learning; cross-scenario transferability

1. Introduction

In recent years, unmanned aerial vehicles (UAVs) (also known as drones) have rapidly emerged as vital tools in numerous applications, ranging from search and rescue missions to infrastructure monitoring and delivery services [1,2]. However, the challenge of ensuring safe and efficient navigation in complex and dynamic environments, particularly when multiple UAVs are involved, remains an open problem. In multi-UAV scenarios, UAVs must coordinate their actions to avoid obstacles [3], maintain efficient paths [4], and successfully complete their missions in environments with limited or partially observable information. Various centralized-based multi-UAV navigation systems have been developed to address these challenges [5,6,7]. A central server manages all the UAVs’ actions by leveraging global information about their states and observations. This global control can guarantee safety and near-optimal path planning under ideal conditions as it allows for complete knowledge of the environment and inter-drone interactions. However, centralized systems face significant limitations, such as the high reliance on stable communication with a central server and the escalating computational burden as the number of UAVs increases, making them less scalable and vulnerable to failures if the server is compromised.

Compared to centralized methods, some traditional decentralized multi-UAV navigation systems [8,9], such as those based on the velocity obstacle framework, allow agents to make independent decisions while avoiding collisions [10,11]. These methods often require extensive communication between agents and are highly sensitive to environmental interference, making them difficult to implement in real-world scenarios. Zhang et al. proposed a GPS-free cooperative fusion localization method, which enables multi-UAV systems to accurately localize non-stationary targets through distributed observations and consensus mechanisms [12]. Mei et al. develop a fixed-time collision-free elliptical circumnavigation coordination strategy that ensures efficient obstacle avoidance and convergence within a guaranteed time frame [13]. Such approaches often rely on complex parameter tuning for delicate navigation strategies. Our work focuses on distributed control using multi-agent deep reinforcement learning (MADRL) algorithms [14,15,16], which allows UAVs to learn cooperative strategies in dynamic, uncertain environments without the need for constant communication or predefined rules.

Existing MADRL-based methods have shown promise in addressing the challenges of multi-UAV navigation [17,18,19,20]. These approaches model the problem as decentralized partially observable Markov decision processes (Dec-POMDPs) and apply deep reinforcement learning to train agents to make decisions based on their limited perception. Typical methods like the multi-agent deep deterministic policy gradient (MADDPG) [15] have been successfully applied to tasks such as formation control and obstacle avoidance, but they struggle with issues such as non-stability during training and limited generalization to more complex environments. Recent approaches based on recurrent deterministic policy gradient (RDPG) [21] and proximal policy optimization (PPO) [22] applied for multi-UAV navigation tasks have demonstrated advantages in handling partial observation and improving training stability, respectively. Despite these advancements, the trained models often face significant limitations when applied to new, unseen environments. Current methods typically require retraining in each new scenario, leading to substantial computational costs and rendering them impractical for real-time applications.

To address this issue, we propose a Dual-Transformer Encoder-Based Proximal Policy Optimization (DTPPO) method, which enables multi-UAV systems to transfer learned knowledge from known scenarios to new, unseen environments without the need for extensive retraining (as shown in Figure 1). The choice of the Transformer architecture over traditional recurrent neural networks (RNNs) is motivated by its ability to handle both spatial and temporal dependencies more effectively. Previous works [23,24] have proved that Transformer can function as a meta-agent, which can establish connections between the immediate working memories to iteratively construct an episodic memory across the transformer layers. Our approach incorporates two key components: (1) a Spatial Transformer, which enhances collaboration between neighboring UAVs by modeling the inter-agent dynamics; and (2) a Temporal Transformer, which captures the temporal evolution of multi-UAV trajectories across various environments. Compared to RNNs, which rely on sequential data processing and often struggle with long-term dependencies due to vanishing gradients, Transformers leverage self-attention mechanisms to simultaneously process all the elements in a sequence. This parallelism enables the Spatial Transformer to better capture the inter-agent relationships critical for multi-UAV coordination, while the Temporal Transformer efficiently models long-term temporal dependencies, resulting in enhanced generalization to unseen environments. This Dual-Transformer (Dual-T) architecture is explicitly designed to improve transferability across diverse environments with different obstacle densities and configurations. Through co-training across multiple scenarios, DTPPO ensures that the learned policies generalize well beyond the training environments, enabling UAVs to adapt quickly to new environments without retraining. Furthermore, by leveraging the powerful PPO algorithm, DTPPO balances exploration and exploitation, allowing for robust policy optimization in challenging navigation tasks.

In summary, the main contributions of this paper are as follows.

We introduce a novel Dual-Transformer architecture for multi-UAV navigation that enhances inter-agent coordination through spatial and temporal modeling.
We develop a co-training framework that allows UAVs to learn generalized navigation strategies across diverse environments with varying obstacle densities.
We validate the effectiveness of DTPPO through extensive simulations, demonstrating superior performance and transferability compared to state-of-the-art MADRL-based methods.

The remainder of this paper is organized as follows: Section 2 reviews the related work on multi-UAV navigation and deep reinforcement learning. Section 3 provides the necessary background and prior knowledge related to our problem setup. Section 4 outlines the proposed methodology, including the Dual-Transformer Encoder and PPO-based multi-scenario co-training. Section 5 details the experimental setup and results, and Section 6 concludes the paper with insights and future directions.

2. Related Works

In this section, we review the existing works on multi-UAV navigation with regard to deep reinforcement learning algorithms. In recent years, deep reinforcement learning (DRL), which has achieved great success in many control tasks, has been integrated to achieve UAV autonomous navigation and enhance real-time decision-making capabilities. Wang et al. [25] formulated the navigation problem as a partially observable Markov decision process (POMDP), and they employed an online DRL method to solve it. In [26], a function approximation-based RL algorithm was presented to deal with a large number of state representations and to obtain faster convergence. Li et al. [27] designed a DRL-based UAV navigation framework, which considers temporal abstractions and dynamically chooses the frequency of action decisions with efficiency regularization. To assist multiple UAVs in reaching their goal points without obstacle collision in unknown complex environments, many multi-agent DRL (MADRL) algorithms can be utilized to learn the optimal trajectory for each drone. In multi-UAV navigation, multi-agent deep deterministic policy gradient (MADDPG) [15] methods have been extensively applied to address complex tasks such as formation control, collaborative target tracking, and obstacle avoidance in dynamic environments [17,18,28]. The study of [18] leveraged MADDPG to simultaneously solve target assignment and path planning. To boost the learning effects in unstable 3D environments, Xue et al. [21] proposed a multi-agent recurrent deterministic policy gradient (MARDPG) algorithm for developing navigation policy for multi-UAV. While these DPG-based methods excel in handling continuous action spaces and multi-UAV coordination, proximal policy optimization (PPO)-based methods have also gained significant attention in UAV navigation due to the robustness and ability to balance exploration and exploitation during policy optimization [19]. Multi-agent PPO (MAPPO) [20] can be applied in multi-UAV systems, enabling each UAV to learn its own policy while still benefiting from centralized training. Hodge et al. [22] developed an adaptive navigation framework using MAPPO combined with incremental course learning, allowing UAVs to efficiently track targets using real-time sensor data. To tackle the challenge of exploring unknown complex environments, Moltajaei et al. [29] employed on-policy RL with MAPPO to guide multiple UAVs in exploring areas of interest. Additionally, Chikhaoui et al. [30] integrated energy constraints into a MAPPO-based DRL framework, enhancing UAV efficiency and extending operational duration.

Although the aforementioned MADRL methods enable UAVs to learn efficient navigation strategies in complex and dynamic environments, they are environment-specific (in other words, training and testing must be conducted in the same environment). Even if UAVs are trained using MADRL algorithms across multiple different maps or environments to learn a general navigation strategy, their performance remains limited in unseen environments. Therefore, this study aims to achieve strong generalization performance by coordinating multiple UAVs across various environments.

3. Preliminary

In this work, we studied the multi-UAV navigation task across various complex and dynamic environments. We introduce the UAV system model and problem statement as follows.

3.1. UAV System Model

Referring to prior works [21,31], we modeled the UAV as a quadrotor with a 12-dimensional state, which includes the absolute position

[x, y, z]

of the UAV in the world coordinate frame, the Euler angles

[ϕ_{1}, ϕ_{2}, ϕ_{3}]

representing the UAV’s rotation state, the velocity

[v_{x}, v_{y}, v_{z}]

along the three axes of the coordinate frame, and the angular velocity

[ω_{x}, ω_{y}, ω_{z}]

. Thus, the complete state vector

s

can be expressed as

s = [x, y, z, ϕ_{1}, ϕ_{2}, ϕ_{3}, v_{x}, v_{y}, v_{z}, ω_{x}, ω_{y}, ω_{z}]

. The state s of a UAV captures both its position and orientation in the 3D space. To control the UAV, we utilized a 4-dimensional velocity vector as the control action

a = [v_{x}, v_{y}, v_{z}, v_{M}]

, where

v_{x}

,

v_{y}

, and

v_{z}

are the components of a unit vector representing the direction of motion in the 3D space, and

v_{M}

denotes the magnitude of the desired velocity. Thus, the control action a can specify the direction and speed at which the UAV should move.

To successfully reach the designated target point without colliding with obstacles in the environment, MADRL will be applied to control multi-UAV navigation in complex environments. During navigation, environmental information is collected in real-time by the UAV’s sensors, and corresponding action controls are made. After executing the actions, the UAV transitions to a new state and receives feedback from the environment. Using this feedback, the UAV can update its action selection strategy, enabling it to reach the target more efficiently while avoiding obstacles in the environment. In this paper, we aimed to design a MADRL algorithm that enables multiple UAVs to learn general and effective action strategies for navigation tasks, even in different complex environments, such as those with varying terrains or obstacle densities.

3.2. Problem Statement

The problem of multi-agent UAV action control in various scenarios can be formulated as Decentralized Partially Observed Markov Decision Processes (Dec-POMDPs) [32]. The goal for multiple UAVs is to cooperate and navigate safely through each scenario while avoiding obstacles and efficiently reaching their target destinations. Given a set of environments E with different types of obstacles and obstacle densities, each agent i controls a drone

D_{i}

in an environment

e \in E

. We consider the top n nearest neighboring drones

D_{N_{i}}

of drone

D_{i}

within its sensing range, where

N_{i} = {N_{1}, . . ., N_{n}}

.

Then, we represent this POMDP using the tuple

< S, O, A, P, r, γ >

, where

S

is the state space and

s_{t} \in S

denotes the state of all drones at time step t. The local observation can be obtained through an observation function

D (s) : S \to O

.

A

denotes the action space for each agent. When m agents take a joint control actions

a_{t} = {a_{t}^{1}, . . ., a_{t}^{m}}

in the environment e, the state transition

P (s_{t + 1} | s_{t}, a_{t}) = S \times A \to S

occurs and each agent i obtains a reward

r_{t}^{i}

. Due to the limited sensing range of the drone, the environment is partially observed, and each agent i can only have access to the joint actions

a_{t}^{i, N_{i}}

and the local observation

o_{t + 1}^{i, N_{i}}

, which, respectively, include the local control actions and state transitions of the target drone

D_{i}

and its top n nearest neighboring drones

D_{N_{i}}

. Therefore, each agent obtains

(o_{t + 1}^{i, N_{i}}, a_{t}^{i, N_{i}}, r_{t}^{i})

at the next time step

t + 1

. When updating the action policy, the cumulative reward for all the agents in each scenario

\sum_{t} \sum_{m} γ_{t} r_{t}^{i}

is expected to be maximized, where

γ

denotes the discounted factor.

In this paper, we aimed to develop a generalized multi-UAV navigation policy capable of performing well across various scenarios, even though these scenarios have not been encountered during training. As shown in Figure 2, maps with different obstacle types and varying obstacle densities represent distinct environments. Our objective is to learn an action control policy parameterized by

θ

, which can distinguish between tasks (i.e., learning on different environments) in the embedding space and minimize the loss across these diverse tasks:

θ = arg min_{θ} \frac{1}{m | E |} \sum_{e \in E} \sum_{i = 1}^{m} L (f_{θ} (D_{i}), D_{i}),

(1)

where

θ

represents the policy parameter, and

f_{θ} (D_{i})

is the control action output for UAV

D_{i}

, which denotes the policy to solve navigation task in environment e.

4. Methodology

In this section, we present a general MADRL method for cross-scenario multi-UAV navigation task, which is referred to as DTPPO. We first provide an overview of our method, followed by the introduction of the Dual-Transformer (Dual-T) Encoder module, which is composed of the Spatial Transformer and the Temporal Transformer. Additionally, we illustrate the details of the co-training process across diverse scenarios using the PPO algorithm.

4.1. Overview of DTPPO

The overall training process of our method is shown in Figure 3. DTPPO is trained using the UAVs’ MDP trajectories across multiple environments within a batch. Specifically, for a target agent i and its neighboring agents

N_{i}

, their MDP trajectories

(o, a, r)

in a certain range of time steps

[t - L, t]

are sampled and fed into the Dual-T Encoder module, where L denotes the length of the time frame. The Dual-T Encoder is composed of two transformers: the Spatial Transformer and the Temporal Transformer. At time step t, the Spatial Transformer takes the MDP information of each UAV and its neighboring UAVs as input, enhancing the collaboration between the agents within the UAV’s sensing range. The Temporal Transformer utilizes historical MDP trajectories as context to infer the current task, thereby improving transferability.

Referring to previous work [21,31], four types of kinematic information were selected from the observations as states: the absolute position, Euler angles, velocity, and angular velocity. Each UAV utilizes a 4-dimensional velocity vector as its control action to execute the next movement. The full observation for each agent

o^{i} = o^{i, N_{i}}

contains the local observations from the target agent i and its neighbors. The local observation consists of the current state information concatenated by historical actions during the last

Δ t

time steps, where we set

Δ t = 15

. The reward r can be defined as the weighted sum of three components: transfer reward, collision penalty, and free space reward. The transfer reward is denoted as follows.

r_{t r a n s} = [(∥ x_{target} - x_{t} ∥_{2} - {∥ x_{target} - x_{t - 1} ∥}_{2}) + max (0, (2 - ∥ x_{target} - x_{t} ∥_{2}^{2}))]

(2)

The first term in the function

r_{t r a n s}

measures the change in distance to the target between consecutive time steps, and the second term ensures that if the UAV gets very close to the target (i.e., within a distance of 2 units), it receives an additional positive reward. Thus, the transfer reward encourages the UAV to efficiently approach its target while avoiding unnecessary detours. Combining the collision penalty

r_{c o l}

(we set to

- 1.0

) and free space reward

r_{f r e e}

(we set to

0.04

), which encourages the UAVs to explore toward a safe space, we define the total reward function as

r_{t o t a l} = λ_{1} r_{t r a n s} + λ_{2} r_{c o l} + λ_{3} r_{f r e e},

(3)

where

λ_{1}

,

λ_{2}

, and

λ_{3}

are scale factors.

4.2. Dual-Transformer Encoder Module

The Dual-Transformer Encoder (Dual-T Encoder) module is the core of the DTPPO algorithm, and it is designed to handle both the spatial collaboration between UAVs and the temporal dynamics of their MDP trajectories across various environments. It includes the Spatial Transformer and the Temporal Transformer, working together to process the transition for each agent.

4.2.1. Spatial Transformer

The Spatial Transformer is designed for enhancing collaboration between the target UAV

D_{i}

and its top n nearest neighbors

D_{N^{i}}

within the sensing range. Here, we set

n = 4

. At each time step t, the Spatial Transformer has access to the MDP features, including the current observations

o_{t}

, the previous actions

a_{t - 1}

, and the rewards

r_{t - 1}

from both the target UAV and its neighboring UAVs. Unlike previous MADRL-based navigation methods [33,34], which only consider the states of neighboring drones for cooperation, Spatial Transformer considers the complex interrelations among neighboring drones’ observations, actions, and rewards. Regardless of the type of map, UAVs share a common characteristic: within the group of

n + 1

closely located UAVs, one’s action will affect another one’s navigation route decisions. Therefore, in the policy learning process, considering only the states is inadequate for capturing the mutual influence between neighboring UAVs, which can further exacerbate instability during the co-training process across various scenarios [24,35].

In the Spatial Transformer, for each UAV, we leverage the full MDP features

m_{t}^{i}

from the target drone i and its neighbors to boost up the coordination during navigation. As DTPPO is an online RL algorithm, only the current observation

o_{t}

, the previous action, and reward

a_{t - 1}

,

r_{t - 1}

can be acquired. The concatenated MDP features of agent i can be expressed as

m_{t}^{i} = [o_{t}^{i}, a_{t - 1}^{i}, r_{t - 1}^{i}]

. As

o_{t}^{i}

,

a_{t - 1}^{i}

,

r_{t - 1}^{i}

have different dimensions, they can be passed through three different learnable linear projections

W = [W_{o}

,

W_{a}

,

W_{r}]

, allowing them to be transformed into a common d-dimensional latent space:

m_{t}^{i} W = [o_{t}^{i} W_{o}, a_{t - 1}^{i} W_{a}, r_{t - 1}^{i} W_{r}] \in R^{3 \times d} .

(4)

By concatenating neighboring agents, the full MDP transition of agent i at time step t can be defined as

M_{t}^{i} = [m_{t}^{i} W; m_{t}^{N_{1}} W; . . .; m_{t}^{N_{n}} W] \in R^{3 (n + 1) \times d} .

(5)

The resulting embedding

M_{t}^{i}

encapsulates the spatial relationships and cooperation among UAVs, which is essential for effective multi-UAV collaboration, especially in densely populated or obstacle-rich environments. Notably, when the number of neighbors is fewer than n, we apply zero-padding and include a binary indicator embedding to

o_{t}

and

a_{t - 1}

to indicate whether the neighboring drone exists.

Referring to the works [24,36], we prepended a learnable

[decision]

token

q_{decision}

so that the state at the output of the Spatial Transformer can be served as the drone’s representation

d_{t}

. Moreover, standard positional embedding

E_{p o s}^{S} \in R^{3 (n + 1) \times d}

is added to each input token to retain positional information [37], and the input to the Spatial Transformer at time step t is given by

z_{t, i}^{S} = [q_{decision}; m_{t}^{i} W; m_{t}^{N_{1}} W; . . .; m_{t}^{N_{n}} W] + E_{p o s}^{S} .

(6)

Then, we feed

z_{t, i}^{S}

to the Spatial Transformer with multi-head self-attention layers and obtain a drone’s embedding

h_{t}^{i} = SpatialTransformer (z_{t, i}^{S})

.

4.2.2. Temporal Transformer

The Temporal Transformer plays a crucial role in ensuring that the model generalizes well to unseen environments by capturing long-term temporal dependencies. It processes the sequence of the embeddings

h_{[t - L : t]}^{i}

generated by the Spatial Transformer over the last L time steps, utilizing multi-head self-attention to extract temporal relationships. Thus, DTPPO is a context-based MADRL method.

At each time step t, the Temporal Transformer takes as input the spatial embeddings

h_{[t - L : t]}^{i}

for agent i that are obtained from the Spatial Transformer, which is first projected to a lower-dimensional space using a trainable projection matrix

W^{'}

. These projections encode the relevant spatial and temporal features, enabling the Temporal Transformer to capture the task-related dynamics over time steps. Similarly, we add the positional embedding

E_{p o s}^{T} \in R^{L \times d^{'}}

(where

d^{'}

denotes the lower dimensionality) to retain the sequential order of the input. The input to the Temporal Transformer for the time window

[t - L, t]

is

z_{[t - L : t], i}^{T} = [h_{t - L}^{i} W^{'}; h_{t - L + 1}^{i} W^{'}; . . .; h_{t}^{i} W^{'}] + E_{p o s}^{T} .

(7)

Then, the Temporal Transformer operates over the input within dimensions

R^{L \times L}

using the attention mechanism, which consists of six multi-head self-attention layers. Thus, it can capture the evolving environmental dynamics related to UAVs by leveraging historical data (i.e., MDP trajectories), and it can extract meaningful patterns for the UAV’s next control actions over time. The output of the Temporal Transformer can be defined as

h_{[t - L : t], i}^{output} = TemporalTransformer (z_{[t - L : t], i}^{T}) .

(8)

To further enhance the UAV’s understanding of environmental dynamics, we introduce a dynamic predictor between the output of the Temporal Transformer at each time step. This dynamic predictor performs autoregressive prediction, which encourages the Temporal Transformer to effectively model the cross-scenario dynamics. Specifically, the predictor works by taking the output at the previous time step

h_{t - 1, i}^{output}

and concatenating it with the joint actions

a_{t - 1}^{i, N_{i}}

and rewards

r_{t - 1}^{i, N i}

from the target UAV and its neighbors. The goal is to predict the next temporal embedding

{\hat{h}}_{t, i}^{output}

using a single-layer MLP:

{\hat{h}}_{t, i}^{output} = MLP ([h_{t - 1, i}^{output}, a_{t - 1}^{i, N_{i}}, r_{t - 1}^{i, N i}]) .

(9)

The training objective of the dynamic predictor is to minimize the prediction loss

l_{p r e d} = MSE ({\hat{h}}_{t, i}^{output}, h_{t - 1, i}^{output})

, which is defined as the mean squared error (MSE) between the predicted embedding

{\hat{h}}_{t, i}^{output}

and the actual output embedding

h_{t - 1, i}^{output}

of the Temporal Transformer.

4.3. PPO-Based Co-Training on Various Scenarios

To learn the decision policy, the output of the Dual-T Encoder was used as input to the Actor–Critic framework in the PPO algorithm [38]. Specifically, both the Actor and Critic networks are implemented as two-layer MLPs, where the Actor generates the control actions for the UAV and the Critic evaluates the state value to guide the learning process. For the policy

π

, the actor network takes

h^{output}

as input and makes the control action

a_{t}^{i}

for the target drone i. In addition, we implement a residual link to prevent over-abstraction of the agent’s embedding via the Dual-T Encoder. The residual connection adds direct self-observation

o_{t}^{i}

to the

h_{t, i}^{output}

, ensuring that the actor has both a high-level abstract representation of the current environment and enough up-to-date observation information from the target drone i. The actor network then outputs the action

a_{t}^{i}

using the policy

π

as follows:

a_{t}^{i} \sim π (\cdot ∣ h_{t, i}^{output} + o_{t}^{i}) .

(10)

In Equation (10),

h_{t, i}^{output}

represents the high-level feature embedding generated by the Dual-T Encoder. It provides a comprehensive context for decision making within the dynamic and complex environment, also enhancing the generalization across diverse scenarios. Conversely,

o t^{i}

represents the self-observation of the target UAV, focusing on its current state. This is critical for making precise, real-time adjustments in response to sudden environmental changes. Thus, combined with prediction loss

l_{p r e d}

, the overall optimization objective function can be written as

l_{D T P P O} = δ_{1} l_{a c t o r} + δ_{2} l_{c r i t i c} + δ_{3} l_{p r e d},

(11)

where

δ_{1}

,

δ_{2}

, and

δ_{3}

denote hyperparameters. The Actor loss

l_{a c t o r}

and Critic loss

l_{c r i t i c}

can be referred to as the original PPO method [38]. Finally, we can employ co-training across multiple scenarios to increase the training data diversity for better model generality. The UAVs will be stochastically chosen from various scenarios within each training batch. In these scenarios, there are obstacles and structures of various shapes or obstacle densities, which correspond to navigation tasks following different task distributions. This setup encourages the agent to learn more generalized knowledge while enabling a stable learning process. The training process of DTPPO can be summarized in Algorithm 1.

Algorithm 1 Training process of DTPPO

Input: A set of target UAVs

D

from various scenarios

S

, training episodes E, the number of neighbors n, the input length L for the Temporal Transformer, the PPO epochs

E p o c h

.

Initialize: MDP buffer

D

, policy parameters

θ

.

1:: for episode = 1 to E do
2:: Initialize buffer $D \leftarrow \emptyset$
3:: for each scenario $s \in S$ in parallel do
4:: Use the top n nearest neighbors $N_{i}$ for each UAV i
5:: for each time step t do
6:: Retrieve the last L transitions ${m_{t - l}^{i}}_{l = 0}^{L}$ for each UAV and add to buffer $D$
7:: Make action $a_{t}^{i}$ using policy $π^{θ}$ according to Equation (10), and take joint action ${a_{t}^{1}, . . ., a^{n}}$
8:: Observe the next state $o_{t + 1}^{i}$ and current reward $r_{t}^{i}$
9:: end for
10:: end for
11:: for e = 1 to $E p o c h$ do
12:: Sample mini-batch data from buffer $D$
13:: Calculate dynamic predictions ${{\hat{h}}_{t - l, i}^{output}}_{l = 0}^{L - 1}$ .
14:: Compute the total loss $L$ using Equation (11) and update policy parameters $θ$
15:: end for
16:: end for
17:: return Optimized policy $π^{θ}$

4.4. Computational Complexity Analysis

To analyze the computational complexity of our method, we evaluated two core components in the Dual-Transformer architecture: the Spatial Transformer and the Temporal Transformer.

The Spatial Transformer processes the interactions between a UAV and its n nearest neighbors. The key computational cost arises from the self-attention mechanism, which operates on the concatenated state, action, and reward embeddings of the UAV and its neighbors. Assuming the embedding dimension is d, the time complexity of the self-attention for the Spatial Transformer is $O (n^{2} d)$ . As n refers to the fixed number of neighbors considered for each UAV, by restricting n to a small, constant value (e.g., $n = 4$ in our implementation), the computational cost remains scalable, even as the number of UAVs increases in the environment.
The Temporal Transformer captures long-term dependencies over a sliding window of L time steps. For each UAV, the self-attention mechanism within the Temporal Transformer has a complexity of $O (L^{2} d)$ . Since L represents the window size rather than the total length of the trajectory, its value is kept constant (e.g., $L = 20$ in our implementation) to control the computational cost.

Combining both components, the computational complexity per UAV is

O (n^{2} d + L^{2} d)

. Since n and L are fixed and do not grow with the total number of UAVs, the per-agent complexity remains constant. Therefore, the overall complexity for a system with m UAVs is

O (m (n^{2} d + L^{2} d))

. This design ensures that the computational burden linearly scales with the number of UAVs, making it feasible for large-scale multi-UAV systems.

5. Experiment

5.1. Experiment and Parameter Setting

We utilized the simulated environment gym-pybullet-drones [31], which supports the random generation of maps. The environment includes three types of obstacles: square pillars, cylinders, and mixed 3D obstacles. We refer to these environments as Scene-I, Scene-II, and Scene-III, respectively. These settings are designed to replicate real-world obstacles, such as urban buildings and varying terrain features. During training, the UAV agents navigate through these randomly generated environments. Obstacle density is defined as the percentage of space within the environment occupied by obstacles, with higher densities posing a greater challenge for UAV navigation. We used obstacle densities of

[10 %, 25 %,

and

50 %]

for each type of map, resulting in a total of nine different maps for multi-scenario co-training. This setup encourages generalization across diverse obstacle distributions and task settings. During evaluation, we used six generated unseen maps (as shown in Figure 2) for testing our method.

The altitude of the UAVs is limited to the range [0.0 m, 30.0 m]. The control signal was normalized to the range [−1, 1] for stability. The reward function parameters were set as follows: transfer reward coefficient

λ_{1} = 0.45

, collision penalty coefficient

λ_{2} = 0.30

, and free space reward coefficient

λ_{3} = 0.25

. The exploration reward was set to

r_{free} = 0.04

and the collision penalty was set to

r_{col} = - 1.0

. When training our method, the hyperparameters used in the model were carefully tuned based on preliminary experiments to achieve optimal performance. The details of the hyperparameters are listed in Table 1 All of the simulations were run on an Ubuntu 20.04 system with 32 GB RAM and a Tesla V100 GPU. The UAVs were trained for a total of 1,000,000 episodes across multiple environments, which required approximately 38 h to complete.

5.2. Baselines

The proposed DTPPO was compared to the following baseline methods to evaluate its effectiveness. The same states, actions, and rewards were applied in all baselines.

MADDPG uses feedforward neural networks for learning. In MADDPG, the UAVs are trained in a centralized manner but execute their learned policies independently (decentralized execution). This method addresses the challenges of non-stationarity in multi-agent environments and reduces the variance in training across multiple UAVs.
MARDPG extends RDPG to the multi-agent deep reinforcement learning settings. In MARDPG, each UAV perceives all other UAVs as part of the environment, without direct communication or cooperation between them. This can be referred to as Ind-MARDPG, where each UAV’s navigation policy is trained using a recurrent deterministic policy gradient. The UAVs in the environment independently adopt the same policy without any exchange in the information between agents.
MAPPO is an extension of the single-agent PPO algorithm to multi-agent systems. It combines centralized training with decentralized execution, where each UAV learns its own policy but benefits from joint learning with other agents. MAPPO offers more stable learning through the PPO clipping mechanism, which helps to avoid large policy updates. This makes MAPPO particularly suited for complex, dynamic environments where cooperation between agents is crucial.

5.3. Evaluation Metrics

To evaluate the performance of our proposed method, we utilize a set of quantitative metrics that capture the overall efficiency, safety, and robustness of the learned policies. The test metrics are presented as follows.

Average Transfer Reward: This metric measures the average reward obtained by all the UAVs during their navigation toward the target in different environments. It reflects the efficiency of the learned navigation policies, with higher rewards indicating better performance in reaching the goal.
Average Collision Penalty: This metric records the average penalty incurred when any UAV collides with obstacles. It helps assess the safety of the navigation policies, with lower penalties indicating better obstacle avoidance and safer navigation.
Average Free Space: This metric evaluates how well the UAVs navigate through open, obstacle-free areas by averaging the rewards earned for doing so. It indicates how effectively the UAVs avoid obstacles while maintaining efficient movement through less congested regions.

5.4. Simulation Results

In this section, we show the superior transferability and general great performance of DTPPO when performing navigation tasks on different unseen scenarios after training.

5.4.1. Transferability on the Unseen Scenario

We evaluate the transferability of DTPPO using a zero-shot setting, where the model is trained on several scenarios and then directly tested on unseen scenarios. As shown in Table 2, each column of results shows the performance of transferring to a new, unseen scenario after training on the preset nine scenarios. The results clearly demonstrate that DTPPO achieves the best transfer performance in all the tested scenarios compared to the other baseline methods.

Cooperation is Key: Our results highlight the importance of cooperation between UAVs for better transferability. DTPPO, by leveraging its Dual-Transformer architecture, enables efficient coordination among neighboring agents, which significantly improves navigation in unseen environments. This is particularly evident when compared to the baseline MADDPG, which does not model inter-agent collaboration to the same extent.

Generalization to High-Density Obstacle Scenarios: DTPPO excels in high-density obstacle scenarios, where the complexity of navigation increases substantially. For example, in Scene-III with 50% obstacle density, DTPPO achieves a transfer reward of 214.55, far surpassing other methods like MAPPO (134.90) and MARDPG (77.69). This indicates that our model is able to generalize well even in challenging environments by learning more robust policies during training.

Lower Collision Rates: In addition to higher transfer rewards, DTPPO maintains lower collision penalties across all scenarios. In Scene-II with 50% obstacle density, DTPPO achieves a collision penalty of only 2.56, which is significantly lower than MAPPO (5.80) and MADDPG (24.27). This demonstrates that DTPPO’s learned policies are effective in avoiding obstacles while navigating through unseen environments.

Efficient Use of Free Space: DTPPO also makes better use of the available free space in the environment, as evidenced by the higher avg. free space reward. In Scene-II with 10% obstacle density, DTPPO achieved a reward of 5.17, outperforming all other baselines. This suggests that the model can efficiently navigate and utilize free areas, improving its overall navigation performance in novel environments.

Thus, DTPPO showed remarkable transferability and superior performance when handling unseen scenarios, demonstrating the strength of its design for multi-UAV navigation tasks in dynamic and complex environments.

5.4.2. Performance in Non-Transfer Setting

In this non-transfer setting, as shown in Table 3, each scenario for testing is already seen during training. Our method still achieved the best results in all seen scenarios, demonstrating enhanced performance over MAPPO and MARDPG. For example, in the Scene-I (10%) case, DTPPO yielded an average transfer reward of 262.89, which was significantly higher than MAPPO’s 175.51 and MARDPG’s 101.34. This improvement was consistent across all other scenarios, showing DTPPO’s robustness even in non-transfer settings. Moreover, the performance drop observed in Scene-II (50%) and Scene-III (50%) can be attributed to the higher complexity of these environments with denser obstacles. DTPPO consistently outperformed the other baselines by maintaining superior exploration capabilities, as reflected in its higher transfer rewards and free space rewards. In terms of collision penalty, DTPPO registered the lowest penalty values across all scenarios, indicating safer navigation capabilities compared to MAPPO and MARDPG.

Furthermore, Figure 4 shows the transfer reward optimization process for the top three methods. DTPPO consistently outperformed the other two approaches in terms of both convergence speed and final performance. The learning curves also highlight the stability of DTPPO during training, particularly in more challenging environments like Scene-II (50%) and Scene-III (50%), where MAPPO and MARDPG struggle with higher variance. In conclusion, DTPPO’s ability to maintain high performance in both non-transfer and transfer settings, along with its superior learning stability, makes it an ideal solution for UAV navigation tasks in various obstacle-dense environments.

5.4.3. Ablation Study

In our ablation study, we investigated the impact of removing key components from the DTPPO framework. The results in Figure 5 illustrate the performance drop across six different test scenarios when excluding each of the following components.

w/o Spatial Transformer: The removal of the spatial transformer, which facilitates inter-agent collaboration, resulted in the most significant drop in average transfer reward, especially in dense environments such as Scene-II (50%) and Scene-III (50%). This emphasizes the critical role of spatial collaboration in complex, obstacle-filled environments.
w/o Temporal Transformer: Replacing the temporal transformer with a GRU led to a noticeable decline in performance, particularly in scenarios like Scene-II (50%). The ability to model temporal dependencies is crucial for maintaining high transfer rewards.
w/o Residual Link: Removing the residual link significantly reduced the performance across all scenarios, with the most pronounced drops observed in Scene-II (50%) and Scene-III (50%). In these scenarios, the transfer reward sharply decreased compared to the full model, underscoring the critical role of self-observation in dense environments. Without the residual link, the model loses the ability to incorporate immediate feedback from its own state, resulting in less accurate decision making and reduced performance, especially in more challenging environments.

5.4.4. Varying the Numbers of Scenarios

We varied the number of scenarios for co-training from

[1, 3, 5, 7, 9]

, and we investigated the impact on the three unseen test scenarios with identical obstacle density: Scene-I (50%), Scene-II (50%), and Scene-III (50%). The primary goal of this setting was to explore how increasing the diversity of co-training scenarios enhances our model’s ability to transfer effectively to dense environments. Figure 6 shows the performance improvement on the three test metrics. As the number of co-training scenarios increased, our model consistently achieved better performance. The gain in Transfer Reward grew steadily, reflecting improved adaptability to unseen dense environments. The Collision Penalty saw a significant reduction, indicating enhanced safety and collision avoidance capabilities. Although the free space reward exhibited a more gradual increase, it still benefited from the larger set of co-training maps, further solidifying the overall robustness of our method in complex scenarios.

5.4.5. Analysis on Dual-T Encoder

The Dual-T Encoder in DTPPO is a critical component that facilitates the model’s ability to capture both spatial and temporal dynamics in multi-agent environments. We analyzed the output embeddings from the Dual-T Encoder by visualizing the clustering patterns. We applied the 3D t-SNE technique to visualize the clustering patterns. As shown in Figure 7, when finishing training, the Dual-T Encoder is capable of grouping embeddings based on their respective scenarios, each of which are represented by a distinct color. This result illustrates that the Dual-T Encoder can accurately capture scenario-specific dynamic information.

6. Conclusions

In this paper, we proposed DTPPO, a Dual-Transformer Encoder-Based PPO method designed to address the critical challenge of multi-UAV navigation in unseen complex environments. By integrating a Spatial Transformer to enhance inter-UAV coordination and a Temporal Transformer to capture long-term temporal dependencies, DTPPO provides a robust framework for navigating dynamic and obstacle-laden environments. These components work together to improve both navigation efficiency and transferability, enabling the framework to effectively generalize policies to previously unseen scenarios. This eliminates the need for extensive scenario-specific retraining, which is a significant limitation of many existing approaches. Our simulation results in various obstacle-laden environments demonstrate that DTPPO outperforms the baseline methods, especially in unseen scenarios where it exhibits strong transferability and robust navigation capabilities. These results highlight the architecture’s ability to balance exploration and exploitation while maintaining high levels of adaptability. Notably, the framework significantly reduces computational costs and enhances real-time decision making, making it suitable for large-scale UAV deployments. Future work will focus on extending the framework to real-world deployment scenarios, addressing challenges such as sensor noise and communication delays, as well as adapting the model to handle heterogeneous UAV fleets with diverse capabilities.

Author Contributions

Conceptualization, J.L., A.W., Z.L. and R.Z.; methodology, J.L., A.W. and Z.L.; software, J.L. and A.W.; validation, J.L. and A.W.; formal analysis, J.L. and A.W.; investigation, J.L. and A.W.; resources, Z.L. and R.Z.; data curation, J.L.; writing—original draft preparation, J.L. and A.W.; writing—review and editing, Z.L. and K.L.; visualization, J.L. and A.W.; supervision, Z.L.; project administration, Z.L. and R.Z. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

The original contributions presented in the study are included in the article, further inquiries can be directed to the corresponding author.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Shakhatreh, H.; Sawalmeh, A.H.; Al-Fuqaha, A.; Dou, Z.; Almaita, E.; Khalil, I.; Othman, N.S.; Khreishah, A.; Guizani, M. Unmanned aerial vehicles (UAVs): A survey on civil applications and key research challenges. IEEE Access 2019, 7, 48572–48634. [Google Scholar] [CrossRef]
Mohsan, S.A.H.; Khan, M.A.; Noor, F.; Ullah, I.; Alsharif, M.H. Towards the unmanned aerial vehicles (UAVs): A comprehensive review. Drones 2022, 6, 147. [Google Scholar] [CrossRef]
Huang, S.; Teo, R.S.H.; Tan, K.K. Collision avoidance of multi unmanned aerial vehicles: A review. Annu. Rev. Control 2019, 48, 147–164. [Google Scholar] [CrossRef]
Bellingham, J.S.; Tillerson, M.; Alighanbari, M.; How, J.P. Cooperative path planning for multiple UAVs in dynamic and uncertain environments. In Proceedings of the 41st IEEE Conference on Decision and Control, Las Vegas, NV, USA, 10–13 December 2002; Volume 3, pp. 2816–2822. [Google Scholar]
Lewis, F.L.; Zhang, H.; Hengster-Movric, K.; Das, A.; Lewis, F.L.; Zhang, H.; Hengster-Movric, K.; Das, A. Cooperative Globally Optimal Control for Multi-Agent Systems on Directed Graph Topologies. In Cooperative Control of Multi-Agent Systems: Optimal and Adaptive Design Approaches; Springer: London, UK, 2014; pp. 141–179. [Google Scholar]
Liu, Z.; Wang, H.; Wei, H.; Liu, M.; Liu, Y.H. Prediction, planning, and coordination of thousand-warehousing-robot networks with motion and communication uncertainties. IEEE Trans. Autom. Sci. Eng. 2020, 18, 1705–1717. [Google Scholar] [CrossRef]
Liu, Z.; Zhai, Y.; Li, J.; Wang, G.; Miao, Y.; Wang, H. Graph relational reinforcement learning for mobile robot navigation in large-scale crowded environments. IEEE Trans. Intell. Transp. Syst. 2023, 24, 8776–8787. [Google Scholar] [CrossRef]
Van Den Berg, J.; Guy, S.J.; Lin, M.; Manocha, D. Optimal reciprocal collision avoidance for multi-agent navigation. In Proceedings of the IEEE International Conference on Robotics and Automation, Anchorage, AK, USA, 3–7 May 2010. [Google Scholar]
Van Den Berg, J.; Guy, S.J.; Lin, M.; Manocha, D. Reciprocal n-body collision avoidance. In Robotics Research: The 14th International Symposium ISRR; Springer: Berlin/Heidelberg, Germany, 2011; pp. 3–19. [Google Scholar]
Snape, J.; Van Den Berg, J.; Guy, S.J.; Manocha, D. The hybrid reciprocal velocity obstacle. IEEE Trans. Robot. 2011, 27, 696–706. [Google Scholar] [CrossRef]
Douthwaite, J.A.; Zhao, S.; Mihaylova, L.S. Velocity obstacle approaches for multi-agent collision avoidance. Unmanned Syst. 2019, 7, 55–64. [Google Scholar] [CrossRef]
Zhang, F.; Shao, X.; Zhang, W. Cooperative fusion localization of a nonstationary target for multiple uavs without gps. IEEE Syst. J. 2024. [Google Scholar] [CrossRef]
Mei, Z.; Shao, X.; Xia, Y.; Liu, J. Enhanced Fixed-time Collision-free Elliptical Circumnavigation Coordination for UAVs. IEEE Trans. Aerosp. Electron. Syst. 2024, 60, 4257–4270. [Google Scholar] [CrossRef]
Gronauer, S.; Diepold, K. Multi-agent deep reinforcement learning: A survey. Artif. Intell. Rev. 2022, 55, 895–943. [Google Scholar] [CrossRef]
Lowe, R.; Wu, Y.I.; Tamar, A.; Harb, J.; Pieter Abbeel, O.; Mordatch, I. Multi-agent actor-critic for mixed cooperative-competitive environments. Adv. Neural Inf. Process. Syst. 2017, 30, 6382–6393. [Google Scholar]
Liu, Y.; Luo, G.; Yuan, Q.; Li, J.; Lei, J.; Chen, B.; Pan, R. GPLight: Grouped Multi-agent Reinforcement Learning for Large-scale Traffic Signal Control. IJCAI 2023, 199–207. [Google Scholar]
Bouhamed, O.; Ghazzai, H.; Besbes, H.; Massoud, Y. Autonomous UAV navigation: A DDPG-based deep reinforcement learning approach. In Proceedings of the 2020 IEEE International Symposium on circuits and systems (ISCAS), Seville, Spain, 12–14 October 2020; pp. 1–5. [Google Scholar]
Qie, H.; Shi, D.; Shen, T.; Xu, X.; Li, Y.; Wang, L. Joint optimization of multi-UAV target assignment and path planning based on multi-agent reinforcement learning. IEEE Access 2019, 7, 146264–146272. [Google Scholar] [CrossRef]
Rybchak, Z.; Kopylets, M. Comparative Analysis of DQN and PPO Algorithms in UAV Obstacle Avoidance 2D Simulation. In Proceedings of the COLINS (3), Lviv, Ukraine, 12–13 April 2024; pp. 391–403. [Google Scholar]
Yu, C.; Velu, A.; Vinitsky, E.; Gao, J.; Wang, Y.; Bayen, A.; Wu, Y. The surprising effectiveness of ppo in cooperative multi-agent games. Adv. Neural Inf. Process. Syst. 2022, 35, 24611–24624. [Google Scholar]
Xue, Y.; Chen, W. Multi-agent deep reinforcement learning for UAVs navigation in unknown complex environment. IEEE Trans. Intell. Veh. 2023, 9, 2290–2303. [Google Scholar] [CrossRef]
Hodge, V.J.; Hawkins, R.; Alexander, R. Deep reinforcement learning for drone navigation using sensor data. Neural Comput. Appl. 2021, 33, 2015–2033. [Google Scholar] [CrossRef]
Melo, L.C. Transformers are meta-reinforcement learners. In Proceedings of the International Conference on Machine Learning. PMLR, Baltimore, MD, USA, 17–23 July 2022; pp. 15340–15359. [Google Scholar]
Jiang, H.; Li, Z.; Wei, H.; Xiong, X.; Ruan, J.; Lu, J.; Mao, H.; Zhao, R. X-Light: Cross-City Traffic Signal Control Using Transformer on Transformer as Meta Multi-Agent Reinforcement Learner. arXiv 2024, arXiv:2404.12090. [Google Scholar]
Wang, C.; Wang, J.; Shen, Y.; Zhang, X. Autonomous navigation of UAVs in large-scale complex environments: A deep reinforcement learning approach. IEEE Trans. Veh. Technol. 2019, 68, 2124–2136. [Google Scholar] [CrossRef]
Pham, H.X.; La, H.M.; Feil-Seifer, D.; Van Nguyen, L. Reinforcement learning for autonomous UAV navigation using function approximation. In Proceedings of the 2018 IEEE International Symposium on Safety, Security, and Rescue Robotics (SSRR), Philadelphia, PA, USA, 6–8 August 2018; pp. 1–6. [Google Scholar]
Li, C.C.; Shuai, H.H.; Wang, L.C. Efficiency-reinforced learning with auxiliary depth reconstruction for autonomous navigation of mobile devices. In Proceedings of the 2022 23rd IEEE International Conference on Mobile Data Management (MDM), Paphos, Cyprus, 6–9 June 2022; pp. 458–463. [Google Scholar]
He, L.; Aouf, N.; Whidborne, J.F.; Song, B. Deep reinforcement learning based local planner for UAV obstacle avoidance using demonstration data. arXiv 2020, arXiv:2008.02521. [Google Scholar]
Moltajaei Farid, A.; Roshanian, J.; Mouhoub, M. On-policy Actor-Critic Reinforcement Learning for Multi-UAV Exploration. arXiv 2024, arXiv:2409.11058. [Google Scholar]
Chikhaoui, K.; Ghazzai, H.; Massoud, Y. PPO-based reinforcement learning for UAV navigation in urban environments. In Proceedings of the 2022 IEEE 65th International Midwest Symposium on Circuits and Systems (MWSCAS), Fukuoka, Japan, 7–10 August 2022; pp. 1–4. [Google Scholar]
Panerati, J.; Zheng, H.; Zhou, S.; Xu, J.; Prorok, A.; Schoellig, A.P. Learning to fly—A gym environment with pybullet physics for reinforcement learning of multi-agent quadcopter control. In Proceedings of the 2021 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Prague, Czech Republic, 27 September–1 October 2021; pp. 7512–7519. [Google Scholar]
Bernstein, D.S.; Givan, R.; Immerman, N.; Zilberstein, S. The complexity of decentralized control of Markov decision processes. Math. Oper. Res. 2002, 27, 819–840. [Google Scholar] [CrossRef]
Wei, D.; Zhang, L.; Liu, Q.; Chen, H.; Huang, J. UAV Swarm Cooperative Dynamic Target Search: A MAPPO-Based Discrete Optimal Control Method. Drones 2024, 8, 214. [Google Scholar] [CrossRef]
Wu, D.; Wan, K.; Tang, J.; Gao, X.; Zhai, Y.; Qi, Z. An improved method towards multi-UAV autonomous navigation using deep reinforcement learning. In Proceedings of the 2022 7th International Conference on Control and Robotics Engineering (ICCRE), Beijing, China, 15–17 April 2022; pp. 96–101. [Google Scholar]
Zang, X.; Yao, H.; Zheng, G.; Xu, N.; Xu, K.; Li, Z. Metalight: Value-based meta-reinforcement learning for traffic signal control. In Proceedings of the AAAI conference on artificial intelligence, Hilton, NY, USA, 7–12 February 2020; Volume 34, pp. 1153–1160. [Google Scholar]
Dosovitskiy, A. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv 2020, arXiv:2010.11929. [Google Scholar]
Vaswani, A. Attention is all you need. In Advances in Neural Information Processing Systems; 2017. [Google Scholar]
Schulman, J.; Wolski, F.; Dhariwal, P.; Radford, A.; Klimov, O. Proximal policy optimization algorithms. arXiv 2017, arXiv:1707.06347. [Google Scholar]

Figure 1. A schematic illustration of the zero-shot transfer to a previously unseen environment (Scene-III) after training on known environments (Scene-I and Scene-II).

Figure 2. The navigation algorithm is tested in three types of environments: a square column obstacle, a cylindrical obstacle, and mixed obstacles. Different obstacle densities can be set for training.

Figure 3. Overview of DTPPO: three types of environments: Scene-I a square column obstacle, Scene-II a cylindrical obstacle, Scene-IIII and mixed obstacles.

Figure 4. Transfer reward during training.

Figure 5. Ablation study on the different components in DTPPO.

Figure 6. Impact of varying the number of scenarios for co-training.

Figure 7. Visualizing the Temporal Transformer’s output, as evaluated on Scene-III (50%).

Table 1. The implementation details of DTPPO.

Hyperparameters	Details
Learning rate	5 × 10⁻⁴
Actor loss coefficient $δ_{1}$	1
Critic loss coefficient $δ_{2}$	1
Dynamic predictor loss coefficient $δ_{3}$	1 × 10−2
Entropy coefficient	1 × 10⁻²
Discount factor $γ$	0.99
Clipping $ϵ$	0.2
Number of Spatial Transformer layers	3
Number of Spatial Transformer heads	6
Number of Temporal Transformer layers	3
Number of Temporal Transformer heads	6
Spatial Transformer embedding dimension	149
Temporal Transformer embedding dimension	149
Temporal Transformer horizon L	20
The number of neighbor drones n	4

Table 2. The test metrics on performing zero-shot transfer to various unseen scenes with different obstacle densities.

Metric	Method	Scene-I ( $10 %$ )	Scene-I ( $50 %$ )	Scene-II ( $10 %$ )	Scene-II ( $50 %$ )	Scene-III ( $10 %$ )	Scene-III ( $50 %$ )
Avg. Transfer Reward	MADDPG	66.21	58.48	76.51	56.42	87.43	65.83
	MARDPG	95.45	84.37	105.75	86.03	92.32	77.69
	MAPPO	168.39	151.58	196.85	148.57	166.43	134.90
	DTPPO	256.19	243.53	239.26	227.80	231.26	214.55
Avg. Collision Penalty	MADDPG	5.22	24.68	8.27	24.27	13.66	33.25
	MARDPG	3.60	16.41	8.21	19.63	10.25	28.26
	MAPPO	2.59	4.60	3.24	5.80	4.80	7.45
	DTPPO	1.20	1.61	1.20	2.56	4.42	5.58
Avg. Free Space Reward	MADDPG	1.38	1.02	1.84	0.46	0.68	0.37
	MARDPG	1.27	1.69	2.01	1.15	1.28	0.68
	MAPPO	3.86	3.05	3.02	4.80	2.13	1.98
	DTPPO	4.65	3.97	5.17	4.56	3.41	3.25

Table 3. The test metrics on performing navigation tasks in the seen scenarios.

Metric	Method	Scene-I ( $10 %$ )	Scene-I ( $50 %$ )	Scene-II ( $10 %$ )	Scene-II ( $50 %$ )	Scene-III ( $10 %$ )	Scene-III ( $50 %$ )
Avg. Transfer Reward	MADDPG	70.25	62.50	80.51	60.95	90.12	69.02
	MARDPG	101.34	90.83	111.24	90.35	97.18	80.28
	MAPPO	175.51	160.04	205.73	157.12	170.29	137.51
	DTPPO	262.89	251.77	245.61	235.19	239.85	221.49
Avg. Collision Penalty	MADDPG	4.95	23.71	7.69	22.11	12.86	31.44
	MARDPG	3.35	15.18	7.73	18.53	9.82	26.18
	MAPPO	2.41	4.28	3.10	5.31	4.65	7.12
	DTPPO	1.13	1.53	1.12	2.34	4.21	5.37
Avg. Free Space Reward	MADDPG	1.42	1.06	1.95	0.53	0.72	0.40
	MARDPG	1.31	1.63	1.94	1.10	1.21	0.61
	MAPPO	3.76	2.98	2.95	4.69	2.07	1.90
	DTPPO	4.52	3.88	4.96	4.39	3.26	3.11

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Wei, A.; Liang, J.; Lin, K.; Li, Z.; Zhao, R. DTPPO: Dual-Transformer Encoder-Based Proximal Policy Optimization for Multi-UAV Navigation in Unseen Complex Environments. Drones 2024, 8, 720. https://doi.org/10.3390/drones8120720

AMA Style

Wei A, Liang J, Lin K, Li Z, Zhao R. DTPPO: Dual-Transformer Encoder-Based Proximal Policy Optimization for Multi-UAV Navigation in Unseen Complex Environments. Drones. 2024; 8(12):720. https://doi.org/10.3390/drones8120720

Chicago/Turabian Style

Wei, Anning, Jintao Liang, Kaiyuan Lin, Ziyue Li, and Rui Zhao. 2024. "DTPPO: Dual-Transformer Encoder-Based Proximal Policy Optimization for Multi-UAV Navigation in Unseen Complex Environments" Drones 8, no. 12: 720. https://doi.org/10.3390/drones8120720

APA Style

Wei, A., Liang, J., Lin, K., Li, Z., & Zhao, R. (2024). DTPPO: Dual-Transformer Encoder-Based Proximal Policy Optimization for Multi-UAV Navigation in Unseen Complex Environments. Drones, 8(12), 720. https://doi.org/10.3390/drones8120720

Article Menu

DTPPO: Dual-Transformer Encoder-Based Proximal Policy Optimization for Multi-UAV Navigation in Unseen Complex Environments

Abstract

1. Introduction

2. Related Works

3. Preliminary

3.1. UAV System Model

3.2. Problem Statement

4. Methodology

4.1. Overview of DTPPO

4.2. Dual-Transformer Encoder Module

4.2.1. Spatial Transformer

4.2.2. Temporal Transformer

4.3. PPO-Based Co-Training on Various Scenarios

4.4. Computational Complexity Analysis

5. Experiment

5.1. Experiment and Parameter Setting

5.2. Baselines

5.3. Evaluation Metrics

5.4. Simulation Results

5.4.1. Transferability on the Unseen Scenario

5.4.2. Performance in Non-Transfer Setting

5.4.3. Ablation Study

5.4.4. Varying the Numbers of Scenarios

5.4.5. Analysis on Dual-T Encoder

6. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI