Multiple-UAV Reinforcement Learning Algorithm Based on Improved PPO in Ray Framework
Abstract
:1. Introduction
- MAPPO uses the Critic network with global information and the Actor network with local information to achieve cooperation between heterogeneous UAVs, and the action entropy reward is added to the objective function to encourage the exploration of UAVs;
- The policy network of homogeneous UAVs realizes parameter sharing, and each UAV has the ability to make independent decisions;
- A staged training method based on course learning is proposed to improve the generalization of the algorithm.
2. Background
2.1. Markov Game
- The global states of all agents is represented by S;
- indicates the action of each agent;
- represents the observation of each agent.
2.2. PPO Algorithm
3. Muti-Agent Proximal Policy Optimization Algorithm
3.1. Algorithm Framework
3.2. Inherited Training Based on Course Learning
4. Cooperative Decision Model of UAV Cluster
4.1. Experimental Environment
4.2. Reward Mechanism
4.3. Network Design
5. Experimental Results and Analysis
5.1. Convergence Analysis
5.2. Anti-Interference Analysis
5.3. Generalization Performance Analysis
6. Conclusions
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Acknowledgments
Conflicts of Interest
References
- Han, Y.; Wang, H.; Zhang, Z.; Wang, W. Boundary-aware vehicle tracking upon uav. Electron. Lett. 2020, 56, 873–876. [Google Scholar] [CrossRef]
- Jiang, H.; Shi, D.; Xue, C.; Wang, Y.; Wang, G.; Zhang, Y. Multi-agent deep reinforcement learning with type-based hierarchical group communication. Appl. Intell. 2021, 51, 5793–5808. [Google Scholar] [CrossRef]
- Zhan, G.; Gong, Z.; Lv, Q.; Zhou, Z.; Wang, Z.; Yang, Z.; Zhou, D. Flight test of autonomous formation management for multiple fixed-wing uavs based on missile parallel method. Drones 2022, 6, 99. [Google Scholar] [CrossRef]
- Sutton, R.S.; Barto, A.G. Reinforcement Learning: An Introduction (Adaptive Computation and Machine Learning); MIT Press: Cambridge, MA, USA, 1998. [Google Scholar]
- Kulkarni, T.D.; Narasimhan, K.R.; Saeedi, A.; Tenenbaum, J.B. Hierarchical deep reinforcement learning: Integrating temporal abstraction and intrinsic motivation. Adv. Neural Inf. Process. Syst. 2016, 29, 3682–3690. [Google Scholar]
- Siddiqui, A.B.; Aqeel, I.; Alkhayyat, A.; Javed, U.; Kaleem, Z. Prioritized user association for sum-rate maximization in uav-assisted emergency communication: A reinforcement learning approach. Drones 2022, 6, 45. [Google Scholar] [CrossRef]
- Zou, G.; Tang, J.; Yilmaz, L.; Kong, X. Online food ordering delivery strategies based on deep reinforcement learning. Appl. Intell. 2022, 52, 6853–6865. [Google Scholar] [CrossRef]
- Ming, Z.; Huang, H. A 3D vision cone based method for collision free navigation of a quadcopter uav among moving obstacles. Drones 2021, 5, 134. [Google Scholar] [CrossRef]
- Zhao, W.; Chu, H.; Miao, X.; Guo, L.; Shen, H.; Zhu, C.; Zhang, F.; Liang, D. Research on the multiagent joint proximal policy optimization algorithm controlling cooperative fixed-wing uav obstacle avoidance. Sensors 2020, 20, 4546. [Google Scholar] [CrossRef] [PubMed]
- Syed, A.A.; Khamvilai, T.; Kim, Y.; Vamvoudakis, K.G. Experimental design and control of a smart morphing wing system using a q-learning framework. In Proceedings of the 2021 IEEE Conference on Control Technology and Applications (CCTA), San Diego, CA, USA, 9–11 August 2021; pp. 354–359. [Google Scholar]
- Xing, Z.; Jia, J.; Guo, K.; Jia, W.; Yu, X. Fast active fault-tolerant control for a quadrotor uav against multiple actuator faults. Guid. Navig. Control 2022, 2, 2250007. [Google Scholar] [CrossRef]
- Zhang, Y.; Zhang, Y.; Yu, Z. Path following control for uav using deep reinforcement learning approach. Guid. Navig. Control 2021, 1, 2150005. [Google Scholar] [CrossRef]
- Williams, R.J. Simple statistical gradient-following algorithms for connectionist reinforcement learning. Mach. Learn. 1992, 8, 229–256. [Google Scholar] [CrossRef] [Green Version]
- Jafari, M.; Xu, H.; Carrillo, L.R.G. A biologically-inspired reinforcement learning based intelligent distributed flocking control for multi-agent systems in presence of uncertain system and dynamic environment. IFAC J. Syst. Control 2020, 13, 100096. [Google Scholar] [CrossRef]
- Liu, H.; Peng, F.; Modares, H.; Kiumarsi, B. Heterogeneous formation control of multiple rotorcrafts with unknown dynamics by reinforcement learning. Inf. Sci. 2021, 558, 194–207. [Google Scholar] [CrossRef]
- Lowe, R.; Wu, Y.; Tamar, A.; Harb, J.; Abbeel, P.; Mordatch, I. Multi-agent actor-critic for mixed cooperative-competitive environments. In Proceedings of the 31st International Conference on Neural Information Processing Systems, Long Beach, CA, USA, 4–9 December 2017; Volume 30, pp. 6382–6393. [Google Scholar]
- Li, B.; Liang, S.; Gan, Z.; Chen, D.; Gao, P. Research on multi-uav task decision-making based on improved maddpg algorithm and transfer learning. Int. J.-Bio-Inspired Comput. 2021, 18, 82–91. [Google Scholar] [CrossRef]
- Schulman, J.; Wolski, F.; Dhariwal, P.; Radford, A.; Klimov, O. Proximal policy optimization algorithms. arXiv 2017, arXiv:1707.06347. [Google Scholar]
- Schulman, J.; Levine, S.; Abbeel, P.; Jordan, M.; Moritz, P. Trust region policy optimization. In Proceedings of the International Conference on Machine Learning, PMLR, Lille, France, 6–11 July 2015; pp. 1889–1897. [Google Scholar]
- Hoseini, S.A.; Hassan, J.; Bokani, A.; Kanhere, S.S. In situ mimo-wpt recharging of uavs using intelligent flying energy sources. Drones 2021, 5, 89. [Google Scholar] [CrossRef]
- Liang, E.; Liaw, R.; Nishihara, R.; Moritz, P.; Fox, R.; Goldberg, K.; Gonzalez, J.; Jordan, M.; Stoica, I. Rllib: Abstractions for distributed reinforcement learning. In Proceedings of the International Conference on Machine Learning, PMLR, Stockholm, Sweden, 10–15 July 2018; pp. 3053–3062. [Google Scholar]
- Littman, M.L. Markov games as a framework for multi-agent reinforcement learning. In Machine Learning Proceedings; Elsevier: Amsterdam, The Netherlands, 1994; pp. 157–163. [Google Scholar]
- Mnih, V.; Badia, A.P.; Mirza, M.; Graves, A.; Lillicrap, T.; Harley, T.; Silver, D.; Kavukcuoglu, K. Asynchronous methods for deep reinforcement learning. In Proceedings of the International Conference on Machine Learning, PMLR, New York City, NY, USA, 19–24 June 2016; pp. 1928–1937. [Google Scholar]
- Wang, Z.; Bapst, V.; Heess, N.; Mnih, V.; Munos, R.; Kavukcuoglu, K.; de Freitas, N. Sample efficient actor-critic with experience replay. arXiv 2016, arXiv:1611.01224. [Google Scholar]
- Wei, S.; Yang-He, F.; Guang-Quan, C.; Hong-Lan, H.; Jin-Cai, H.; Zhong, L.; Wei, H. Research on multi-aircraft cooperative air combat method based on deep reinforcement learning. Acta Autom. Sin. 2021, 47, 1610–1623. [Google Scholar]
- Ren, Z.; Dong, D.; Li, H.; Chen, C. Self-paced prioritized curriculum learning with coverage penalty in deep reinforcement learning. IEEE Trans. Neural Netw. Learn. Syst. 2018, 29, 2216–2226. [Google Scholar] [CrossRef] [PubMed]
- Wang, C.; Wang, J.; Wang, J.; Zhang, X. Deep-reinforcement-learning-based autonomous uav navigation with sparse rewards. IEEE Internet Things J. 2020, 7, 6180–6190. [Google Scholar] [CrossRef]
Range | ||||
---|---|---|---|---|
90 | 80 | 70 | 80 | |
Hyperparameters | Value |
---|---|
train_batch_size | 8000 |
Gamma | 0.6 |
epsilon | 0.2 |
max_steps | 1000 |
COMA | BiCNet | MAPPO | |
---|---|---|---|
Accumulated reward | 7571.67 | 9340.40 | 9964.01 |
Episode_len_mean | 193.69 | 180.88 | 172.26 |
Disturbance/m | Win Rate/% |
---|---|
+6 | 97.3 |
+12 | 92.3 |
+18 | 85.7 |
−6 | 95.7 |
−12 | 93.3 |
−18 | 82.7 |
Target Ship 1 | Target Ship 2 | Win Rate/% |
---|---|---|
87.3 | ||
84.7 | ||
89.3 | ||
81.7 |
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations. |
© 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Zhan, G.; Zhang, X.; Li, Z.; Xu, L.; Zhou, D.; Yang, Z. Multiple-UAV Reinforcement Learning Algorithm Based on Improved PPO in Ray Framework. Drones 2022, 6, 166. https://doi.org/10.3390/drones6070166
Zhan G, Zhang X, Li Z, Xu L, Zhou D, Yang Z. Multiple-UAV Reinforcement Learning Algorithm Based on Improved PPO in Ray Framework. Drones. 2022; 6(7):166. https://doi.org/10.3390/drones6070166
Chicago/Turabian StyleZhan, Guang, Xinmiao Zhang, Zhongchao Li, Lin Xu, Deyun Zhou, and Zhen Yang. 2022. "Multiple-UAV Reinforcement Learning Algorithm Based on Improved PPO in Ray Framework" Drones 6, no. 7: 166. https://doi.org/10.3390/drones6070166
APA StyleZhan, G., Zhang, X., Li, Z., Xu, L., Zhou, D., & Yang, Z. (2022). Multiple-UAV Reinforcement Learning Algorithm Based on Improved PPO in Ray Framework. Drones, 6(7), 166. https://doi.org/10.3390/drones6070166