DTPPO: Dual-Transformer Encoder-Based Proximal Policy Optimization for Multi-UAV Navigation in Unseen Complex Environments
Abstract
:1. Introduction
- We introduce a novel Dual-Transformer architecture for multi-UAV navigation that enhances inter-agent coordination through spatial and temporal modeling.
- We develop a co-training framework that allows UAVs to learn generalized navigation strategies across diverse environments with varying obstacle densities.
- We validate the effectiveness of DTPPO through extensive simulations, demonstrating superior performance and transferability compared to state-of-the-art MADRL-based methods.
2. Related Works
3. Preliminary
3.1. UAV System Model
3.2. Problem Statement
4. Methodology
4.1. Overview of DTPPO
4.2. Dual-Transformer Encoder Module
4.2.1. Spatial Transformer
4.2.2. Temporal Transformer
4.3. PPO-Based Co-Training on Various Scenarios
Algorithm 1 Training process of DTPPO |
Input: A set of target UAVs from various scenarios , training episodes E, the number of neighbors n, the input length L for the Temporal Transformer, the PPO epochs . Initialize: MDP buffer , policy parameters .
|
4.4. Computational Complexity Analysis
- The Spatial Transformer processes the interactions between a UAV and its n nearest neighbors. The key computational cost arises from the self-attention mechanism, which operates on the concatenated state, action, and reward embeddings of the UAV and its neighbors. Assuming the embedding dimension is d, the time complexity of the self-attention for the Spatial Transformer is . As n refers to the fixed number of neighbors considered for each UAV, by restricting n to a small, constant value (e.g., in our implementation), the computational cost remains scalable, even as the number of UAVs increases in the environment.
- The Temporal Transformer captures long-term dependencies over a sliding window of L time steps. For each UAV, the self-attention mechanism within the Temporal Transformer has a complexity of . Since L represents the window size rather than the total length of the trajectory, its value is kept constant (e.g., in our implementation) to control the computational cost.
5. Experiment
5.1. Experiment and Parameter Setting
5.2. Baselines
- MADDPG uses feedforward neural networks for learning. In MADDPG, the UAVs are trained in a centralized manner but execute their learned policies independently (decentralized execution). This method addresses the challenges of non-stationarity in multi-agent environments and reduces the variance in training across multiple UAVs.
- MARDPG extends RDPG to the multi-agent deep reinforcement learning settings. In MARDPG, each UAV perceives all other UAVs as part of the environment, without direct communication or cooperation between them. This can be referred to as Ind-MARDPG, where each UAV’s navigation policy is trained using a recurrent deterministic policy gradient. The UAVs in the environment independently adopt the same policy without any exchange in the information between agents.
- MAPPO is an extension of the single-agent PPO algorithm to multi-agent systems. It combines centralized training with decentralized execution, where each UAV learns its own policy but benefits from joint learning with other agents. MAPPO offers more stable learning through the PPO clipping mechanism, which helps to avoid large policy updates. This makes MAPPO particularly suited for complex, dynamic environments where cooperation between agents is crucial.
5.3. Evaluation Metrics
- Average Transfer Reward: This metric measures the average reward obtained by all the UAVs during their navigation toward the target in different environments. It reflects the efficiency of the learned navigation policies, with higher rewards indicating better performance in reaching the goal.
- Average Collision Penalty: This metric records the average penalty incurred when any UAV collides with obstacles. It helps assess the safety of the navigation policies, with lower penalties indicating better obstacle avoidance and safer navigation.
- Average Free Space: This metric evaluates how well the UAVs navigate through open, obstacle-free areas by averaging the rewards earned for doing so. It indicates how effectively the UAVs avoid obstacles while maintaining efficient movement through less congested regions.
5.4. Simulation Results
5.4.1. Transferability on the Unseen Scenario
5.4.2. Performance in Non-Transfer Setting
5.4.3. Ablation Study
- w/o Spatial Transformer: The removal of the spatial transformer, which facilitates inter-agent collaboration, resulted in the most significant drop in average transfer reward, especially in dense environments such as Scene-II (50%) and Scene-III (50%). This emphasizes the critical role of spatial collaboration in complex, obstacle-filled environments.
- w/o Temporal Transformer: Replacing the temporal transformer with a GRU led to a noticeable decline in performance, particularly in scenarios like Scene-II (50%). The ability to model temporal dependencies is crucial for maintaining high transfer rewards.
- w/o Residual Link: Removing the residual link significantly reduced the performance across all scenarios, with the most pronounced drops observed in Scene-II (50%) and Scene-III (50%). In these scenarios, the transfer reward sharply decreased compared to the full model, underscoring the critical role of self-observation in dense environments. Without the residual link, the model loses the ability to incorporate immediate feedback from its own state, resulting in less accurate decision making and reduced performance, especially in more challenging environments.
5.4.4. Varying the Numbers of Scenarios
5.4.5. Analysis on Dual-T Encoder
6. Conclusions
Author Contributions
Funding
Data Availability Statement
Conflicts of Interest
References
- Shakhatreh, H.; Sawalmeh, A.H.; Al-Fuqaha, A.; Dou, Z.; Almaita, E.; Khalil, I.; Othman, N.S.; Khreishah, A.; Guizani, M. Unmanned aerial vehicles (UAVs): A survey on civil applications and key research challenges. IEEE Access 2019, 7, 48572–48634. [Google Scholar] [CrossRef]
- Mohsan, S.A.H.; Khan, M.A.; Noor, F.; Ullah, I.; Alsharif, M.H. Towards the unmanned aerial vehicles (UAVs): A comprehensive review. Drones 2022, 6, 147. [Google Scholar] [CrossRef]
- Huang, S.; Teo, R.S.H.; Tan, K.K. Collision avoidance of multi unmanned aerial vehicles: A review. Annu. Rev. Control 2019, 48, 147–164. [Google Scholar] [CrossRef]
- Bellingham, J.S.; Tillerson, M.; Alighanbari, M.; How, J.P. Cooperative path planning for multiple UAVs in dynamic and uncertain environments. In Proceedings of the 41st IEEE Conference on Decision and Control, Las Vegas, NV, USA, 10–13 December 2002; Volume 3, pp. 2816–2822. [Google Scholar]
- Lewis, F.L.; Zhang, H.; Hengster-Movric, K.; Das, A.; Lewis, F.L.; Zhang, H.; Hengster-Movric, K.; Das, A. Cooperative Globally Optimal Control for Multi-Agent Systems on Directed Graph Topologies. In Cooperative Control of Multi-Agent Systems: Optimal and Adaptive Design Approaches; Springer: London, UK, 2014; pp. 141–179. [Google Scholar]
- Liu, Z.; Wang, H.; Wei, H.; Liu, M.; Liu, Y.H. Prediction, planning, and coordination of thousand-warehousing-robot networks with motion and communication uncertainties. IEEE Trans. Autom. Sci. Eng. 2020, 18, 1705–1717. [Google Scholar] [CrossRef]
- Liu, Z.; Zhai, Y.; Li, J.; Wang, G.; Miao, Y.; Wang, H. Graph relational reinforcement learning for mobile robot navigation in large-scale crowded environments. IEEE Trans. Intell. Transp. Syst. 2023, 24, 8776–8787. [Google Scholar] [CrossRef]
- Van Den Berg, J.; Guy, S.J.; Lin, M.; Manocha, D. Optimal reciprocal collision avoidance for multi-agent navigation. In Proceedings of the IEEE International Conference on Robotics and Automation, Anchorage, AK, USA, 3–7 May 2010. [Google Scholar]
- Van Den Berg, J.; Guy, S.J.; Lin, M.; Manocha, D. Reciprocal n-body collision avoidance. In Robotics Research: The 14th International Symposium ISRR; Springer: Berlin/Heidelberg, Germany, 2011; pp. 3–19. [Google Scholar]
- Snape, J.; Van Den Berg, J.; Guy, S.J.; Manocha, D. The hybrid reciprocal velocity obstacle. IEEE Trans. Robot. 2011, 27, 696–706. [Google Scholar] [CrossRef]
- Douthwaite, J.A.; Zhao, S.; Mihaylova, L.S. Velocity obstacle approaches for multi-agent collision avoidance. Unmanned Syst. 2019, 7, 55–64. [Google Scholar] [CrossRef]
- Zhang, F.; Shao, X.; Zhang, W. Cooperative fusion localization of a nonstationary target for multiple uavs without gps. IEEE Syst. J. 2024. [Google Scholar] [CrossRef]
- Mei, Z.; Shao, X.; Xia, Y.; Liu, J. Enhanced Fixed-time Collision-free Elliptical Circumnavigation Coordination for UAVs. IEEE Trans. Aerosp. Electron. Syst. 2024, 60, 4257–4270. [Google Scholar] [CrossRef]
- Gronauer, S.; Diepold, K. Multi-agent deep reinforcement learning: A survey. Artif. Intell. Rev. 2022, 55, 895–943. [Google Scholar] [CrossRef]
- Lowe, R.; Wu, Y.I.; Tamar, A.; Harb, J.; Pieter Abbeel, O.; Mordatch, I. Multi-agent actor-critic for mixed cooperative-competitive environments. Adv. Neural Inf. Process. Syst. 2017, 30, 6382–6393. [Google Scholar]
- Liu, Y.; Luo, G.; Yuan, Q.; Li, J.; Lei, J.; Chen, B.; Pan, R. GPLight: Grouped Multi-agent Reinforcement Learning for Large-scale Traffic Signal Control. IJCAI 2023, 199–207. [Google Scholar]
- Bouhamed, O.; Ghazzai, H.; Besbes, H.; Massoud, Y. Autonomous UAV navigation: A DDPG-based deep reinforcement learning approach. In Proceedings of the 2020 IEEE International Symposium on circuits and systems (ISCAS), Seville, Spain, 12–14 October 2020; pp. 1–5. [Google Scholar]
- Qie, H.; Shi, D.; Shen, T.; Xu, X.; Li, Y.; Wang, L. Joint optimization of multi-UAV target assignment and path planning based on multi-agent reinforcement learning. IEEE Access 2019, 7, 146264–146272. [Google Scholar] [CrossRef]
- Rybchak, Z.; Kopylets, M. Comparative Analysis of DQN and PPO Algorithms in UAV Obstacle Avoidance 2D Simulation. In Proceedings of the COLINS (3), Lviv, Ukraine, 12–13 April 2024; pp. 391–403. [Google Scholar]
- Yu, C.; Velu, A.; Vinitsky, E.; Gao, J.; Wang, Y.; Bayen, A.; Wu, Y. The surprising effectiveness of ppo in cooperative multi-agent games. Adv. Neural Inf. Process. Syst. 2022, 35, 24611–24624. [Google Scholar]
- Xue, Y.; Chen, W. Multi-agent deep reinforcement learning for UAVs navigation in unknown complex environment. IEEE Trans. Intell. Veh. 2023, 9, 2290–2303. [Google Scholar] [CrossRef]
- Hodge, V.J.; Hawkins, R.; Alexander, R. Deep reinforcement learning for drone navigation using sensor data. Neural Comput. Appl. 2021, 33, 2015–2033. [Google Scholar] [CrossRef]
- Melo, L.C. Transformers are meta-reinforcement learners. In Proceedings of the International Conference on Machine Learning. PMLR, Baltimore, MD, USA, 17–23 July 2022; pp. 15340–15359. [Google Scholar]
- Jiang, H.; Li, Z.; Wei, H.; Xiong, X.; Ruan, J.; Lu, J.; Mao, H.; Zhao, R. X-Light: Cross-City Traffic Signal Control Using Transformer on Transformer as Meta Multi-Agent Reinforcement Learner. arXiv 2024, arXiv:2404.12090. [Google Scholar]
- Wang, C.; Wang, J.; Shen, Y.; Zhang, X. Autonomous navigation of UAVs in large-scale complex environments: A deep reinforcement learning approach. IEEE Trans. Veh. Technol. 2019, 68, 2124–2136. [Google Scholar] [CrossRef]
- Pham, H.X.; La, H.M.; Feil-Seifer, D.; Van Nguyen, L. Reinforcement learning for autonomous UAV navigation using function approximation. In Proceedings of the 2018 IEEE International Symposium on Safety, Security, and Rescue Robotics (SSRR), Philadelphia, PA, USA, 6–8 August 2018; pp. 1–6. [Google Scholar]
- Li, C.C.; Shuai, H.H.; Wang, L.C. Efficiency-reinforced learning with auxiliary depth reconstruction for autonomous navigation of mobile devices. In Proceedings of the 2022 23rd IEEE International Conference on Mobile Data Management (MDM), Paphos, Cyprus, 6–9 June 2022; pp. 458–463. [Google Scholar]
- He, L.; Aouf, N.; Whidborne, J.F.; Song, B. Deep reinforcement learning based local planner for UAV obstacle avoidance using demonstration data. arXiv 2020, arXiv:2008.02521. [Google Scholar]
- Moltajaei Farid, A.; Roshanian, J.; Mouhoub, M. On-policy Actor-Critic Reinforcement Learning for Multi-UAV Exploration. arXiv 2024, arXiv:2409.11058. [Google Scholar]
- Chikhaoui, K.; Ghazzai, H.; Massoud, Y. PPO-based reinforcement learning for UAV navigation in urban environments. In Proceedings of the 2022 IEEE 65th International Midwest Symposium on Circuits and Systems (MWSCAS), Fukuoka, Japan, 7–10 August 2022; pp. 1–4. [Google Scholar]
- Panerati, J.; Zheng, H.; Zhou, S.; Xu, J.; Prorok, A.; Schoellig, A.P. Learning to fly—A gym environment with pybullet physics for reinforcement learning of multi-agent quadcopter control. In Proceedings of the 2021 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Prague, Czech Republic, 27 September–1 October 2021; pp. 7512–7519. [Google Scholar]
- Bernstein, D.S.; Givan, R.; Immerman, N.; Zilberstein, S. The complexity of decentralized control of Markov decision processes. Math. Oper. Res. 2002, 27, 819–840. [Google Scholar] [CrossRef]
- Wei, D.; Zhang, L.; Liu, Q.; Chen, H.; Huang, J. UAV Swarm Cooperative Dynamic Target Search: A MAPPO-Based Discrete Optimal Control Method. Drones 2024, 8, 214. [Google Scholar] [CrossRef]
- Wu, D.; Wan, K.; Tang, J.; Gao, X.; Zhai, Y.; Qi, Z. An improved method towards multi-UAV autonomous navigation using deep reinforcement learning. In Proceedings of the 2022 7th International Conference on Control and Robotics Engineering (ICCRE), Beijing, China, 15–17 April 2022; pp. 96–101. [Google Scholar]
- Zang, X.; Yao, H.; Zheng, G.; Xu, N.; Xu, K.; Li, Z. Metalight: Value-based meta-reinforcement learning for traffic signal control. In Proceedings of the AAAI conference on artificial intelligence, Hilton, NY, USA, 7–12 February 2020; Volume 34, pp. 1153–1160. [Google Scholar]
- Dosovitskiy, A. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv 2020, arXiv:2010.11929. [Google Scholar]
- Vaswani, A. Attention is all you need. In Advances in Neural Information Processing Systems; 2017. [Google Scholar]
- Schulman, J.; Wolski, F.; Dhariwal, P.; Radford, A.; Klimov, O. Proximal policy optimization algorithms. arXiv 2017, arXiv:1707.06347. [Google Scholar]
Hyperparameters | Details |
---|---|
Learning rate | 5 × 10−4 |
Actor loss coefficient | 1 |
Critic loss coefficient | 1 |
Dynamic predictor loss coefficient | 1 × 10−2 |
Entropy coefficient | 1 × 10−2 |
Discount factor | 0.99 |
Clipping | 0.2 |
Number of Spatial Transformer layers | 3 |
Number of Spatial Transformer heads | 6 |
Number of Temporal Transformer layers | 3 |
Number of Temporal Transformer heads | 6 |
Spatial Transformer embedding dimension | 149 |
Temporal Transformer embedding dimension | 149 |
Temporal Transformer horizon L | 20 |
The number of neighbor drones n | 4 |
Metric | Method | Scene-I () | Scene-I () | Scene-II () | Scene-II () | Scene-III () | Scene-III () |
---|---|---|---|---|---|---|---|
Avg. Transfer Reward | MADDPG | 66.21 | 58.48 | 76.51 | 56.42 | 87.43 | 65.83 |
MARDPG | 95.45 | 84.37 | 105.75 | 86.03 | 92.32 | 77.69 | |
MAPPO | 168.39 | 151.58 | 196.85 | 148.57 | 166.43 | 134.90 | |
DTPPO | 256.19 | 243.53 | 239.26 | 227.80 | 231.26 | 214.55 | |
Avg. Collision Penalty | MADDPG | 5.22 | 24.68 | 8.27 | 24.27 | 13.66 | 33.25 |
MARDPG | 3.60 | 16.41 | 8.21 | 19.63 | 10.25 | 28.26 | |
MAPPO | 2.59 | 4.60 | 3.24 | 5.80 | 4.80 | 7.45 | |
DTPPO | 1.20 | 1.61 | 1.20 | 2.56 | 4.42 | 5.58 | |
Avg. Free Space Reward | MADDPG | 1.38 | 1.02 | 1.84 | 0.46 | 0.68 | 0.37 |
MARDPG | 1.27 | 1.69 | 2.01 | 1.15 | 1.28 | 0.68 | |
MAPPO | 3.86 | 3.05 | 3.02 | 4.80 | 2.13 | 1.98 | |
DTPPO | 4.65 | 3.97 | 5.17 | 4.56 | 3.41 | 3.25 |
Metric | Method | Scene-I () | Scene-I () | Scene-II () | Scene-II () | Scene-III () | Scene-III () |
---|---|---|---|---|---|---|---|
Avg. Transfer Reward | MADDPG | 70.25 | 62.50 | 80.51 | 60.95 | 90.12 | 69.02 |
MARDPG | 101.34 | 90.83 | 111.24 | 90.35 | 97.18 | 80.28 | |
MAPPO | 175.51 | 160.04 | 205.73 | 157.12 | 170.29 | 137.51 | |
DTPPO | 262.89 | 251.77 | 245.61 | 235.19 | 239.85 | 221.49 | |
Avg. Collision Penalty | MADDPG | 4.95 | 23.71 | 7.69 | 22.11 | 12.86 | 31.44 |
MARDPG | 3.35 | 15.18 | 7.73 | 18.53 | 9.82 | 26.18 | |
MAPPO | 2.41 | 4.28 | 3.10 | 5.31 | 4.65 | 7.12 | |
DTPPO | 1.13 | 1.53 | 1.12 | 2.34 | 4.21 | 5.37 | |
Avg. Free Space Reward | MADDPG | 1.42 | 1.06 | 1.95 | 0.53 | 0.72 | 0.40 |
MARDPG | 1.31 | 1.63 | 1.94 | 1.10 | 1.21 | 0.61 | |
MAPPO | 3.76 | 2.98 | 2.95 | 4.69 | 2.07 | 1.90 | |
DTPPO | 4.52 | 3.88 | 4.96 | 4.39 | 3.26 | 3.11 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Wei, A.; Liang, J.; Lin, K.; Li, Z.; Zhao, R. DTPPO: Dual-Transformer Encoder-Based Proximal Policy Optimization for Multi-UAV Navigation in Unseen Complex Environments. Drones 2024, 8, 720. https://doi.org/10.3390/drones8120720
Wei A, Liang J, Lin K, Li Z, Zhao R. DTPPO: Dual-Transformer Encoder-Based Proximal Policy Optimization for Multi-UAV Navigation in Unseen Complex Environments. Drones. 2024; 8(12):720. https://doi.org/10.3390/drones8120720
Chicago/Turabian StyleWei, Anning, Jintao Liang, Kaiyuan Lin, Ziyue Li, and Rui Zhao. 2024. "DTPPO: Dual-Transformer Encoder-Based Proximal Policy Optimization for Multi-UAV Navigation in Unseen Complex Environments" Drones 8, no. 12: 720. https://doi.org/10.3390/drones8120720
APA StyleWei, A., Liang, J., Lin, K., Li, Z., & Zhao, R. (2024). DTPPO: Dual-Transformer Encoder-Based Proximal Policy Optimization for Multi-UAV Navigation in Unseen Complex Environments. Drones, 8(12), 720. https://doi.org/10.3390/drones8120720