Adaptive Nonlinear Model Predictive Horizon Using Deep Reinforcement Learning for Optimal Trajectory Planning
Abstract
:1. Introduction
- Introducing an adaptive NMPH framework that uses a DRL-based method to tune the parameters of the underlying optimization problem of generating the best possible reference trajectories for the vehicle.
- Designing the RL components (the agent, the environment, and the reward scheme) of the proposed system.
- Implementing two different actor-critic DRL algorithms—the deterministic DDPG approach and the probabilistic SAC algorithm—within the adaptive NMPH framework, comparing them in terms of learning speed and stability.
- Evaluating the performance of the overall system with each of the above DRL algorithms in a lifelike simulation environment.
2. Methodologies
2.1. Nonlinear Model Predictive Horizon Based on Backstepping Control
2.2. Deep Reinforcement Learning Overview
2.2.1. Reinforcement Learning Preliminaries
2.2.2. Deep Deterministic Policy Gradient
Algorithm 1 Deep Deterministic Policy Gradient. |
1: Initialize: , , |
2: Set , |
3: repeat |
4: Observe the state s |
5: Find and apply noise to the action |
6: Apply a by the agent |
7: Observe the next state and calculate the reward r |
8: Store in the replay buffer |
9: for a given number of episodes do |
10: Obtain a random sample from |
11: Compute Bellman function |
12: Update the Q-function by applying gradient descent to MSBE: |
13: Update the policy by applying gradient ascent to (8): |
14: Update the parameters of the target networks: |
15: until convergence |
2.2.3. Soft Actor-Critic
Algorithm 2 Soft Actor-Critic. |
1: Initialize: , , , |
2: Set |
3: repeat |
4: Observe the state s |
5: Find the action , and apply it through the agent |
6: Observe the next state and the reward r |
7: Find the next action |
8: Store in the replay buffer |
9: for a given number of episodes do |
10: Obtain a random sample from |
11: Compute Bellman functions in (11) and find the soft-MSBEs (10) |
12: Apply gradient descent on the soft-MSBEs: |
13: Reparametrize the action: |
14: Apply gradient ascent on the policy: |
15: Apply gradient descent to tune : |
16: Update target networks: |
17: until convergence |
3. Adaptive Trajectory Planning Framework
3.1. Agent and Environment Representations
- Trajectory tracking reward, which reflects how well the flight trajectory matches the generated reference. The trajectory tracking reward is calculated as follows:
- Terminal setpoint reward, which reflects how close the ending point of the flight trajectory is to the terminal setpoint of the reference trajectory. The terminal setpoint reward is calculated as follows:
- Completion reward, which reflects how far the drone travels along its prescribed flight trajectory in the associated time interval. This is given by the following:
3.2. DRL-Based Adaptive NMPH Architecture
4. Implementation and Evaluation
5. Conclusions and Future Work
- Pros:
- -
- The proposed design is able to dynamically adjust the parameters of the optimization problem online during flight, which is preferable to tuning them before flight and evaluating the resulting performance afterwards.
- -
- The DRL model can adapt the gains of the optimization problem in response to changes in the vehicle, such as new payload configurations or replaced hardware components.
- Cons:
- -
- DRL algorithms employ a large number of hyperparameters. While SAC is less sensitive to hyperparameters than DDPG, finding the best combination of these parameters to achieve fast training is a challenging task.
- Limitations:
- -
- The present study was performed entirely within a simulation environment and does not include hardware testing results.
Supplementary Materials
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Conflicts of Interest
References
- Carlucho, I.; De Paula, M.; Acosta, G.G. An adaptive deep reinforcement learning approach for MIMO PID control of mobile robots. ISA Trans. 2020, 102, 280–294. [Google Scholar] [CrossRef]
- Åström, K.J. Theory and applications of adaptive control—A survey. Automatica 1983, 19, 471–486. [Google Scholar] [CrossRef]
- Åström, K. History of Adaptive Control. In Encyclopedia of Systems and Control; Baillieul, J., Samad, T., Eds.; Springer-Verlag: London, UK, 2015; pp. 526–533. [Google Scholar]
- Bellman, R. Adaptive Control Processes; A Guided Tour; Princeton University Press: Princeton, NJ, USA, 1961. [Google Scholar]
- Gregory, P. Proceedings of the Self Adaptive Flight Control Systems Symposium; Technical Report 59-49; Wright Air Development Centre: Boulder, CO, USA, 1959. [Google Scholar]
- Panda, S.K.; Lim, J.; Dash, P.; Lock, K. Gain-scheduled PI speed controller for PMSM drive. In Proceedings of the IECON’97 23rd International Conference on Industrial Electronics, Control, and Instrumentation (Cat. No. 97CH36066), New Orleans, LA, USA, 14 November 1997; Volume 2, pp. 925–930. [Google Scholar]
- Huang, H.P.; Roan, M.L.; Jeng, J.C. On-line adaptive tuning for PID controllers. IEE Proc.-Control. Theory Appl. 2002, 149, 60–67. [Google Scholar] [CrossRef]
- Gao, F.; Tong, H. Differential evolution: An efficient method in optimal PID tuning and on–line tuning. In Proceedings of the First International Conference on Complex Systems and Applications, Wuxi, China, 10–12 September 2006. [Google Scholar]
- Killingsworth, N.J.; Krstic, M. PID tuning using extremum seeking: Online, model-free performance optimization. IEEE Control Syst. Mag. 2006, 26, 70–79. [Google Scholar]
- Gheibi, O.; Weyns, D.; Quin, F. Applying machine learning in self-adaptive systems: A systematic literature review. ACM Trans. Auton. Adapt. Syst. (TAAS) 2021, 15, 1–37. [Google Scholar] [CrossRef]
- Jafari, R.; Dhaouadi, R. Adaptive PID control of a nonlinear servomechanism using recurrent neural networks. In Advances in Reinforcement Learning; Mellouk, A., Ed.; IntechOpen: London, UK, 2011; pp. 275–296. [Google Scholar]
- Dumitrache, I.; Dragoicea, M. Mobile robots adaptive control using neural networks. arXiv 2015, arXiv:1512.03345. [Google Scholar]
- Rossomando, F.G.; Soria, C.M. Identification and control of nonlinear dynamics of a mobile robot in discrete time using an adaptive technique based on neural PID. Neural Comput. Appl. 2015, 26, 1179–1191. [Google Scholar] [CrossRef]
- Sutton, R.S.; Barto, A.G. Reinforcement Learning: An Introduction; MIT Press: Cambridge, MA, USA, 2018. [Google Scholar]
- Hu, B.; Li, J.; Yang, J.; Bai, H.; Li, S.; Sun, Y.; Yang, X. Reinforcement learning approach to design practical adaptive control for a small-scale intelligent vehicle. Symmetry 2019, 11, 1139. [Google Scholar] [CrossRef] [Green Version]
- Watkins, C. Learning from Delayed Rewards. Ph.D. Thesis, King’s College, University of Cambridge, Cambridge, UK, 1989. [Google Scholar]
- Boubertakh, H.; Tadjine, M.; Glorennec, P.Y.; Labiod, S. Tuning fuzzy PD and PI controllers using reinforcement learning. ISA Trans. 2010, 49, 543–551. [Google Scholar] [CrossRef]
- Subudhi, B.; Pradhan, S.K. Direct adaptive control of a flexible robot using reinforcement learning. In Proceedings of the 2010 International Conference on Industrial Electronics, Control and Robotics, Rourkela, India, 27–29 December 2010; pp. 129–136. [Google Scholar]
- Barto, A.G.; Sutton, R.S.; Anderson, C.W. Neuronlike adaptive elements that can solve difficult learning control problems. IEEE Trans. Syst. Man, Cybern. 1983, 13, 834–846. [Google Scholar] [CrossRef]
- Lillicrap, T.P.; Hunt, J.J.; Pritzel, A.; Heess, N.; Erez, T.; Tassa, Y.; Silver, D.; Wierstra, D. Continuous control with deep reinforcement learning. arXiv 2015, arXiv:1509.02971. [Google Scholar]
- Fujimoto, S.; Hoof, H.; Meger, D. Addressing function approximation error in actor-critic methods. In Proceedings of the International Conference on Machine Learning, PMLR, Stockholm, Sweden, 10–15 July 2018; pp. 1587–1596. [Google Scholar]
- Haarnoja, T.; Zhou, A.; Abbeel, P.; Levine, S. Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor. In Proceedings of the International Conference on Machine Learning, PMLR, Stockholmsmässan, Sweden, 10–15 July 2018; pp. 1861–1870. [Google Scholar]
- Mnih, V.; Badia, A.P.; Mirza, M.; Graves, A.; Lillicrap, T.; Harley, T.; Silver, D.; Kavukcuoglu, K. Asynchronous methods for deep reinforcement learning. In Proceedings of the International Conference on Machine Learning, PMLR, Baltimore, MD, USA, 17–23 July 2016; pp. 1928–1937. [Google Scholar]
- Sun, Q.; Du, C.; Duan, Y.; Ren, H.; Li, H. Design and application of adaptive PID controller based on asynchronous advantage actor–critic learning method. Wirel. Netw. 2021, 27, 3537–3547. [Google Scholar] [CrossRef] [Green Version]
- Al Younes, Y.; Barczyk, M. Nonlinear Model Predictive Horizon for Optimal Trajectory Generation. Robotics 2021, 10, 90. [Google Scholar] [CrossRef]
- Al Younes, Y.; Barczyk, M. A Backstepping Approach to Nonlinear Model Predictive Horizon for Optimal Trajectory Planning. Robotics 2022, 11, 87. [Google Scholar] [CrossRef]
- Younes, Y.A.; Barczyk, M. Optimal Motion Planning in GPS-Denied Environments Using Nonlinear Model Predictive Horizon. Sensors 2021, 21, 5547. [Google Scholar] [CrossRef] [PubMed]
- Dang, T.; Mascarich, F.; Khattak, S.; Papachristos, C.; Alexis, K. Graph-based path planning for autonomous robotic exploration in subterranean environments. In Proceedings of the 2019 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), The Venetian Macao, Macau, 4–8 November 2019; pp. 3105–3112. [Google Scholar]
- Oleynikova, H.; Taylor, Z.; Fehr, M.; Siegwart, R.; Nieto, J. Voxblox: Incremental 3d euclidean signed distance fields for on-board mav planning. In Proceedings of the 2017 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Vancouver, BC, Canada, 24–28 September 2017; pp. 1366–1373. [Google Scholar]
- Liu, R.; Zou, J. The effects of memory replay in reinforcement learning. In Proceedings of the 2018 56th annual allerton conference on communication, control, and computing (Allerton), Monticello, IL, USA, 2–5 October 2018; pp. 478–485. [Google Scholar]
- Achiam, J. Spinning Up in Deep Reinforcement Learning. 2018. Available online: https://github.com/openai/spinningup (accessed on 2 October 2022).
- Goodfellow, I.; Bengio, Y.; Courville, A. Deep Learning; MIT Press: Cambridge, MA, USA, 2016. [Google Scholar]
- Duan, Y.; Chen, X.; Houthooft, R.; Schulman, J.; Abbeel, P. Benchmarking deep reinforcement learning for continuous control. In Proceedings of the International Conference on Machine Learning, PMLR, New York, NY, USA, 20–22 June 2016; pp. 1329–1338. [Google Scholar]
- Haarnoja, T.; Zhou, A.; Hartikainen, K.; Tucker, G.; Ha, S.; Tan, J.; Kumar, V.; Zhu, H.; Gupta, A.; Abbeel, P.; et al. Soft actor-critic algorithms and applications. arXiv 2019, arXiv:1812.05905v2. [Google Scholar]
- Quigley, M.; Conley, K.; Gerkey, B.; Faust, J.; Foote, T.; Leibs, J.; Wheeler, R.; Ng, A.Y. ROS: An open-source Robot Operating System. In Proceedings of the ICRA Workshop on Open Source Software in Robotics, Kobe, Japan, 12–17 May 2009. [Google Scholar]
- Shah, S.; Dey, D.; Lovett, C.; Kapoor, A. AirSim: High-Fidelity Visual and Physical Simulation for Autonomous Vehicles. In Field and Service Robotics; Hutter, M., Siegwart, R., Eds.; Springer: Cham, Switzerland, 2018; pp. 621–635. [Google Scholar]
- Houska, B.; Ferreau, H.; Diehl, M. ACADO Toolkit – An Open Source Framework for Automatic Control and Dynamic Optimization. Optim. Control. Appl. Methods 2011, 32, 298–312. [Google Scholar] [CrossRef]
- Abadi, M.; Agarwal, A.; Barham, P.; Brevdo, E.; Chen, Z.; Citro, C.; Corrado, G.S.; Davis, A.; Dean, J.; Devin, M.; et al. TensorFlow: Large-Scale Machine Learning on Heterogeneous Systems. 2015. Available online: https://tensorflow.org (accessed on 2 October 2022).
- Chollet, F. Keras. 2015. Available online: https://keras.io (accessed on 2 October 2022).
- Lai, C.; Han, J.; Dong, H. Tensorlayer 3.0: A Deep Learning Library Compatible with Multiple Backends. In Proceedings of the 2021 IEEE International Conference on Multimedia & Expo Workshops (ICMEW), Shenzhen, China, 5–9 July 2021; pp. 1–3. [Google Scholar]
Average Error | Zigzag Pattern | Square Pattern | Ascending Square Pattern | Random Setpoints (Exploration) | |
---|---|---|---|---|---|
Fixed NMPH | |||||
parameters | |||||
Adaptive | |||||
NMPH-SAC | |||||
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations. |
© 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Al Younes, Y.; Barczyk, M. Adaptive Nonlinear Model Predictive Horizon Using Deep Reinforcement Learning for Optimal Trajectory Planning. Drones 2022, 6, 323. https://doi.org/10.3390/drones6110323
Al Younes Y, Barczyk M. Adaptive Nonlinear Model Predictive Horizon Using Deep Reinforcement Learning for Optimal Trajectory Planning. Drones. 2022; 6(11):323. https://doi.org/10.3390/drones6110323
Chicago/Turabian StyleAl Younes, Younes, and Martin Barczyk. 2022. "Adaptive Nonlinear Model Predictive Horizon Using Deep Reinforcement Learning for Optimal Trajectory Planning" Drones 6, no. 11: 323. https://doi.org/10.3390/drones6110323
APA StyleAl Younes, Y., & Barczyk, M. (2022). Adaptive Nonlinear Model Predictive Horizon Using Deep Reinforcement Learning for Optimal Trajectory Planning. Drones, 6(11), 323. https://doi.org/10.3390/drones6110323