Online Safe Flight Control Method Based on Constraint Reinforcement Learning
Abstract
:1. Introduction
- A new framework is proposed for online safe flight control. The core idea is to first design a constrained reinforcement learning algorithm based on extra safety budget, which introduces Lyapunov stability requirements to ensure flight safety and improves the robustness of the controller, and then an online condition-triggered meta-learning method is used to adjust the control raw online to complete the attitude angle tracking task.
- A novel flight control simulation environment is built based on the Python Flight Mechanics Engine (PyFME) [25] for offline training and online learning.
- This work proves that this method not only ensures the safety and stability of the aircraft during flight but also adapts the control law to various environmental changes through online learning.
2. Mathematic Model
- The aircraft is a rigid body.
- The ground is flat and stationary, ignoring the influence of the earth’s curvature and rotation.
- The deformation of the landing gear is neglected.
2.1. Aircraft Model
2.2. Controller Model
2.3. Flight Simulation Environment Model
2.3.1. State Quantity Design
2.3.2. Action Quantity Design
2.3.3. Reward Function Design
2.3.4. Cost Function Design
3. Methodology
3.1. Constrained Policy Optimization Algorithm with Extra Safety Budget
Algorithm 1 Aircraft attitude control based on ESB-CPO |
Initialize the policy network , the value network |
Initialize the replay buffer and step counter t = 0 |
for k < in 0, 1, 2, … do |
Use policy to carry out the flight mission and collect a batch of samples |
According to the FIFO principle, update replay buffer B with |
Update step counter t = t + len() |
for in do |
for s in do |
end for |
end for |
Compute by solving the local dual problem |
Estimate , , and using the sample constructed with if approximate ESB-CPO is feasible then |
else |
end if |
Obtain by backtracking line search to enforce satisfaction of constraint function in (15) |
Update by TD-like critic learning end for |
3.2. Condition-Triggered Meta-Learning Online Learning Method
Algorithm 2 Online flight control based on meta-learning |
Loading offline model parameters |
Initialize the replay buffer , step counter t = 0, learning rates α, β, and the batch counter l = 0. |
while not done do |
Use policy to carry out the flight mission and collect a batch of samples |
According to the FIFO principle, update playback buffer B with . |
Update step counter t = t + len() |
if attitude angle error exceeds threshold, then |
l = l + 1 |
Using the samples in B to construct a task set, and dividing the task set into a support set and a query set Utilize the support set to compute adaptive parameters |
Utilize the query set to update the policy network parameters |
end if end |
4. Results and Discussion
4.1. Assessment Method
4.2. Experimental Details
5. Conclusions
Supplementary Materials
Author Contributions
Funding
Data Availability Statement
Acknowledgments
Conflicts of Interest
References
- Cheng, H.; Zhang, S.; Liu, T.; Xu, S.; Huang, H. Review of Autonomous Decision-Making and Planning Techniques for Unmanned Aerial Vehicle. Air Space Def. 2024, 7, 6–15+80. [Google Scholar]
- Swaroop, D.; Hedrick, K.; Yip, P.P.; Gerdes, J.C. Dynamic surface control for a class of nonlinear systems. IEEE Trans. Autom. Control 2000, 45, 1893–1899. [Google Scholar] [CrossRef]
- Xidias, E.K. A Decision Algorithm for Motion Planning of Car-Like Robots in Dynamic Environments. Cybern. Syst. 2021, 52, 533–552. [Google Scholar] [CrossRef]
- Huang, Z.; Li, F.; Yao, J.; Chen, Z. MGCRL: Multi-view graph convolution and multi-agent reinforcement learning for dialogue state tracking. IEEE Trans. Autom. Control 2000, 45, 1893–1899. [Google Scholar] [CrossRef]
- Hellaoui, H.; Yang, B.; Taleb, T.; Manner, J. Traffic Steering for Cellular-Enabled UAVs: A Federated Deep Reinforcement Learning Approach. In Proceedings of the 2023 IEEE International Conference on Communications (ICC), Rome, Italy, 28 May–1 June 2023. [Google Scholar]
- Xia, B.; Mantegh, I.; Xie, W. UAV Multi-Dynamic Target Interception: A Hybrid Intelligent Method Using Deep Reinforcement Learning and Fuzzy Logic. Drones 2024, 8, 226. [Google Scholar] [CrossRef]
- Kaufmann, E.; Bauersfeld, L.; Loquercio, A.; Müller, M.; Koltun, V.; Scaramuzza, D. Champion-level drone racing using deep reinforcement learning. Nature 2023, 620, 982–987. [Google Scholar] [CrossRef]
- Cui, Y.; Hou, B.; Wu, Q.; Ren, B.; Wang, S.; Jiao, L.C. Remote Sensing Object Tracking With Deep Reinforcement Learning Under Occlusion. IEEE Trans. Geosci. Remote Sens. 2022, 60, 1–13. [Google Scholar] [CrossRef]
- Zhu, Z.D.; Lin, K.X.; Jain, A.K.; Zhou, J.Y. Transfer Learning in Deep Reinforcement Learning: A Survey. IEEE Trans. Pattern Anal. Mach. Intell. 2023, 45, 13344–13362. [Google Scholar] [CrossRef]
- Kiran, B.R.; Sobh, I.; Talpaert, V.; Mannion, P.; AI, S.; Yogamani, S.; Pérez, P. Deep Reinforcement Learning for Autonomous Driving: A Survey. IEEE Trans. Intell. Transp. Syst. 2022, 23, 4909–4926. [Google Scholar] [CrossRef]
- Minsky, M. Steps toward Artificial Intelligence. Proc. IRE. 1961, 49, 8–20. [Google Scholar] [CrossRef]
- Zhao, W.; He, T.; Chen, R.; Wei, T.; Liu, C. Safe Reinforcement Learning: A Survey. Acta Autom. Sin. 2023, 49, 1813–1835. [Google Scholar]
- Liu, X.; Nan, Y.; Xie, R.; Zhang, S. DDPG Optimization Based on Dynamic Inverse of Aircraft Attitude Control. Comput. Simul. 2020, 37, 37–43. [Google Scholar]
- Hao, C.; Fang, Z.; Li, P. Output feedback reinforcement learning control method based on reference model. J. Zhejiang Univ. Eng. Sci. 2013, 47, 409–414+479. [Google Scholar]
- Huang, X.; Liu, J.; Jia, C.; Wang, Z.; Zhang, J. Deep Deterministic policy gradient algorithm for UAV control. Acta Aeronaut. Astronaut. Sin. 2021, 42, 404–414. [Google Scholar]
- Choi, J.; Kim, H.M.; Hwang, H.J.; Kim, Y.D.; Kim, C.O. Modular Reinforcement Learning for Autonomous UAV Flight Control. Drones 2023, 7, 418. [Google Scholar] [CrossRef]
- Woo, J.; Yu, C.; Kim, N. Deep reinforcement learning-based controller for path following of an unmanned surface vehicle. Ocean Eng. 2019, 183, 155–166. [Google Scholar] [CrossRef]
- Tang, J.; Liang, Y.; Li, K. Dynamic Scene Path Planning of UAVs Based on Deep Reinforcement Learning. Drones 2024, 8, 60. [Google Scholar] [CrossRef]
- Wang, W.; Gokhan, I. Reinforcement learning based closed-loop reference model adaptive flight control system design. Sci. Technol. Eng. 2023, 23, 14888–14895. [Google Scholar]
- Yang, R.; Du, C.; Zheng, Y.; Gao, H.; Wu, Y.; Fang, T. PPO-Based Attitude Controller Design for a Tilt Rotor UAV in Transition Process. Drones 2023, 7, 499. [Google Scholar] [CrossRef]
- Burak, Y.; Wu, H.; Liu, H.X.; Yang, Y. An Attitude Controller for Quadrotor Drone Using RM-DDPG. Int. J. Adapt. Control Signal Process. 2021, 35, 420–440. [Google Scholar]
- Ma, B.; Liu, Z.; Dang, Q.; Zhao, W.; Wang, J.; Cheng, Y.; Yuan, Z. Deep reinforcement learning of UAV tracking control under wind disturbances environments. IEEE Trans. Instrum. Meas. 2023, 72, 1–13. [Google Scholar] [CrossRef]
- Chow, Y.; Nachum, O.; Faust, A.; Ghavamzadeh, M.; DuéñezGuzmán, E. Lyapunov-based safe policy optimization for continuous control. arXiv 2019, arXiv:1901.10031. [Google Scholar]
- Yu, X.; Xu, S.; Fan, Y.; Ou, L. Self-Adaptive LSAC-PID Approach Based on Lyapunov Reward Shaping for Mobile Robots. J. Shanghai Jiaotong Univ. (Sci.) 2023, 1–18. [Google Scholar] [CrossRef]
- PyFME. Available online: https://pyfme.readthedocs.io/en/latest/ (accessed on 12 April 2024).
- Filipe, N. Nonlinear Pose Control and Estimation for Space Proximity Operations: An Approach Based on Dual Quaternions. Ph.D. Thesis, Georgia Institute of Technology, Atlanta, GA, USA, 2014. [Google Scholar]
- Qing, Y.Y. Inertial Navigation, 3rd ed.; China Science Publishing & Media Ltd.: Beijing, China, 2020; pp. 252–284. [Google Scholar]
- Gazebo. Available online: https://github.com/gazebosim/gz-sim (accessed on 28 July 2024).
- Madaan, R.; Gyde, N.; Vemprala, S.; Vemprala, M.; Brown, M.; Nagami, K.; Taubner, T.; Cristofalo, E.; Scaramuzza, D.; Schwager, M.; et al. AirSim drone racing Lab. arXiv 2020, arXiv:2003.05654. [Google Scholar]
- FlightGear. Available online: https://wiki.flightgear.org/Main_Page (accessed on 28 July 2024).
- X-Plane. Available online: https://developer.x-plane.com/docs/ (accessed on 28 July 2024).
- Xu, H.; Wang, S.; Wang, Z.; Zhang, Y.; Zhuo, Q.; Gao, Y.; Zhang, T. Efficient Exploration Using Extra Safety Budget in Constrained Policy Optimization. In Proceedings of the 2023 IEEE International Conference on Intelligent Robots and Systems (IROS), Detroit, MI, USA, 1–5 October 2023. [Google Scholar]
- Achiam, J.; Held, D.; Tamar, A.; Abbeel, P. Constrained Policy Optimization. In Proceedings of the 34nd International Conference on Machine Learning (ICML), Sydney, Australia, 6–11 August 2017. [Google Scholar]
- Schulman, J.; Levine, S.; Abbeel, P.; Moritz, P.; Jordan, M.; Abbeel, P. Trust Region Policy Optimization. In Proceedings of the 32nd International Conference on Machine Learning (ICML), Lille, France, 7–9 July 2015. [Google Scholar]
- Haarnoja, T.; Zhou, A.; Abbeel, P.; Levine, S. Constrained Policy Optimization. arXiv 2018, arXiv:1801.01290. [Google Scholar]
- Zheng, Q.; Zhao, P.; Zhang, D.; Wang, H. MR-DCAE: Manifold regularization-based deep convolutional autoencoder for unauthorized broadcasting identification. Int. J. Intell. Syst. 2021, 36, 7204–7238. [Google Scholar] [CrossRef]
- Gopi, S.P.; Magarini, M.; Alsamhi, S.H.; Shvetsov, A.V. Machine Learning-Assisted Adaptive Modulation for Optimized Drone-User Communication in B5G. Drones 2021, 5, 128. [Google Scholar] [CrossRef]
- Zheng, Q.; Saponara, S.; Tian, X.; Yu, Z.; Elhanashi, A.; Yu, R. A real-time constellation image classification method of wireless communication signals based on the lightweight network MobileViT. Cogn. Neurodyn. 2024, 18, 659–671. [Google Scholar] [CrossRef]
Indicator Subjects | Scoring Criteria | Indicator Score | |
---|---|---|---|
Pitch channel First incentive | Steady state error | 0.1 | |
Adjusting time | 0.05 | ||
Overshoot | 0.1 | ||
Pitch channel Second incentive | Steady state error | 0.1 | |
Adjusting time | 0.05 | ||
Overshoot | 0.1 | ||
Yaw channel First incentive | Steady state error | 0.1 | |
Adjusting time | 0.05 | ||
Overshoot | 0.1 | ||
Yaw channel Second incentive | Steady state error | 0.1 | |
Adjusting time | 0.05 | ||
Overshoot | 0.1 | ||
1.0 |
Symbol | Meaning | Value |
---|---|---|
Full marks for the stabilization ratio term | 30 | |
Full marks for control quality items | 40 | |
Full marks for indicator dispersal items | 30 | |
Total | 100 |
Parameter | Symbol | Value | Dimension |
---|---|---|---|
Mass | m | 30 | kg |
Wingspan | 3 | ||
Reference area | 1.5 | ||
Reference chord length | 0.469 | ||
Centre of mass in the theoretical vertex system X-axis coordinates | 0.632 | ||
Centre of mass in the theoretical vertex system Y-axis coordinates | 0.0473 | ||
Centre of mass in the theoretical vertex system Z-axis coordinates | 0.0014 | ||
Initial position random ranges | (200, 400) | ||
Initial speed random ranges | (25, 40) | ||
Initial pitch angle | 0.0 | ||
Initial yaw angle random ranges | (−180, 180) | ||
Initial roll angle | 0.0 | ||
Initial pitch rate | 0.0 | ||
Initial yaw rate | 0.0 | ||
Initial roll a rate | 0.0 |
Name | Value | Name | Value |
---|---|---|---|
check_freq | 25 | min_rel_budget | 1.0 |
cost_limit | 8 | safety_budget | 15 |
entropy_coef | 0.01 | saute_discount_factor | 0.99 |
epochs | 500 | test_rel_budget | 1.0 |
gamma | 0.99 | unsafe_reward | −1.0 |
lam | 0.95 | save_freq | 10 |
lam_c | 0.95 | seed | 0 |
max_grad_norm | 0.5 | steps_per_epoch | 10,000 |
num_mini_batches | 16 | target_kl | 0.01 |
pi_lr | 0.0003 | train_pi_iterations | 80 |
max_ep_len | 1000 | train_v_iterations | 40 |
max_rel_budget | 1.0 | vf_lr | 0.001 |
Name | Value | Name | Value |
---|---|---|---|
max_ep_len | 1500 | qry_size | 80 |
buffer_size | 1000 | dist_angle | 0.8 |
batch_size | 200 | learning_rate | 1 × 10−5 |
minimal_size | 200 | gamma | 0.99 |
sup_size | 120 | lam | 0.95 |
Symbol | Meaning | Value |
---|---|---|
The total score of the offline flight algorithm | 82.5 | |
The total score of the online flight algorithm | 100 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Zhao, J.; Xu, H.; Wang, Z.; Zhang, T. Online Safe Flight Control Method Based on Constraint Reinforcement Learning. Drones 2024, 8, 429. https://doi.org/10.3390/drones8090429
Zhao J, Xu H, Wang Z, Zhang T. Online Safe Flight Control Method Based on Constraint Reinforcement Learning. Drones. 2024; 8(9):429. https://doi.org/10.3390/drones8090429
Chicago/Turabian StyleZhao, Jiawei, Haotian Xu, Zhaolei Wang, and Tao Zhang. 2024. "Online Safe Flight Control Method Based on Constraint Reinforcement Learning" Drones 8, no. 9: 429. https://doi.org/10.3390/drones8090429
APA StyleZhao, J., Xu, H., Wang, Z., & Zhang, T. (2024). Online Safe Flight Control Method Based on Constraint Reinforcement Learning. Drones, 8(9), 429. https://doi.org/10.3390/drones8090429