Runtime Verification-Based Safe MARL for Optimized Safety Policy Generation for Multi-Robot Systems
Abstract
:1. Introduction
2. Related Works
2.1. Safety Policy of MARL
2.2. Policy Optimization in MARL
2.3. The Relation between Our Work and Existing Works
3. Preliminaries
4. Optimized Safety Policy-Generation Framework
4.1. Modeling @ Runtime
4.2. Runtime Verification
- Model checking: Use PRISM-games modeling language to describe the runtime model SCMG, and use rPATL to specify the safety constraints and functional requirements and verify whether SCMG meets rPATL specifications.
- Counterexample discovery: If an error is discovered during model checking, then a counterexample is discovered, which is a trajectory (state and action sequence) that violates the safety property or functional requirements. These counterexamples provide information about possible unexpected behaviors of robots in the multi-robot system.
- Counterexamples-guided runtime model refinement: Based on counterexamples, the accuracy and precision of the model can be improved by dividing the states (or actions) into the refined states (or actions) or eliminating the unsafe or unexpected states (actions).
- Repeat loop: Continuously iterate model checking, counterexample discovery, and the refinement process until all of the functional requirements and safety constraints are satisfied or the refinement cannot be executed.
4.3. Constraint Training
Algorithm 1. SCPO |
1: Initialization: 2: for iteration N = 0, 1, …, n do 3: A. 4: for each agent do 5: Collect , considering partial observability, and Share observations with other agents 6: Store transitions in the experience replay buffer 7: Sample a batch from the experience replay buffer 8: Compute with GAE 9: Update by maximizing the agent’s return subject to the cost constraint using an optimizer, with trust region to ensure small updates 10: Adjust using a backtracking line search to satisfy the expected cost constraint 11: Update, which includes intrinsic rewards and safety terms, while satisfying safety constraints 12: If the policy is infeasible, apply a TRPO step to the cost surrogate term to recover a feasible policy 13: end for 14: Runtime, ensuring that the cumulative risk and the probability of risk events are bounded 15: end for |
5. Experimental Evaluation
5.1. Case Study
5.2. Experimental Results
6. Conclusions
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Conflicts of Interest
Appendix A
Abbreviations | Meanings |
---|---|
MRS | multi-robot systems |
RL | reinforcement learning |
MARL | multi-agent reinforcement learning |
SCMG | safety-constrained Markov Game |
RV | runtime verification |
rPATL | probabilistic alternating-time temporal logic with rewards |
SCPO | safety-constrained policy optimization |
AGVs | automated guided vehicles |
RWARE | the multi-robot warehouse |
CMIX | Cooperative Multi-agent Information eXchang |
QMIX | Monotonic Value Function Factorization for Deep Multi-Agent Reinforcement Learning |
CMDP | Constrained Markov Decision Process |
MDP | Markov Decision Process |
LTL | Linear Temporal Logic |
QV | quantitative verification |
IPPO | independent proximal policy optimization |
MAPPO | Multi-Agent Proximal Policy Optimization |
HAPPO | Heterogeneous-Agent Proximal Policy Optimization |
SNAC | Shared Network Actor-Critic |
SEAC | Shared Experience Actor-Critic |
IAC | Independent Actor-Critic |
MG | Markov Game |
MBSE | Model-Based Systems Engineering |
GAE | Generalized Advantage Estimation |
PATL | Probabilistic Alternating-time Temporal Logic |
ATL | Alternating-time Temporal Logic |
CTL | Computation Tree Logic |
TRPO | Trust Region Policy Optimization |
KL | Kullback-Leibler Divergence |
References
- Li, Z.; Barenji, A.V.; Jiang, J.; Zhong, R.Y.; Xu, G.J. A mechanism for scheduling multi robot intelligent warehouse system face with dynamic demand. J. Intell. Manuf. 2020, 31, 469–480. [Google Scholar] [CrossRef]
- Bolu, A.; Korçak, Ö. Adaptive task planning for multi-robot smart warehouse. IEEE Access 2021, 9, 27346–27358. [Google Scholar] [CrossRef]
- Street, C.; Pütz, S.; Mühlig, M.; Hawes, N.; Lacerda, B. Congestion-aware policy synthesis for multirobot systems. IEEE Trans. Robot. 2021, 38, 262–280. [Google Scholar] [CrossRef]
- Hu, H.; Yang, X.; Xiao, S.; Wang, F. Anti-Conflict AGV Path Planning in Automated Container Terminals Based on Multi-Agent Reinforcement Learning. Int. J. Prod. Res. 2023, 61, 65–80. [Google Scholar] [CrossRef]
- Sharkawy, A.N.; Koustoumpardis, P.N. Human–robot interaction: A review and analysis on variable admittance control, safety, and perspectives. Machines 2022, 10, 591. [Google Scholar] [CrossRef]
- Choi, H.B.; Kim, J.B.; Han, Y.H.; Oh, S.W.; Kim, K. MARL-based cooperative multi-AGV control in warehouse systems. IEEE Access 2022, 10, 100478–100488. [Google Scholar] [CrossRef]
- Arulkumaran, K.; Deisenroth, M.P.; Brundage, M.; Bharath, A.A. Deep reinforcement learning: A brief survey. IEEE Signal Process. Mag. 2017, 34, 26–38. [Google Scholar] [CrossRef]
- Li, Y.; Wang, X.; Sun, J.; Wang, G.; Chen, J. Self-triggered Consensus Control of Multi-agent Systems from Data. IEEE Trans. Autom. Control 2024, 1–8. [Google Scholar] [CrossRef]
- ElSayed-Aly, I.; Bharadwaj, S.; Amato, C.; Ehlers, R.; Topcu, U.; Feng, L. Safe Multi-Agent Reinforcement Learning via Shielding. In Proceedings of the 20th International Conference on Autonomous Agents and Multi Agent Systems, virtual, 3–7 May 2021. [Google Scholar]
- Kirca, Y.S.; Degirmenci, E.; Demirci, Z.; Yazici, A.; Ozkan, M.; Ergun, S.; Kanak, A. Runtime Verification for Anomaly Detection of Robotic Systems Security. Machines 2023, 11, 166. [Google Scholar] [CrossRef]
- Garcıa, J.; Fernández, F. A comprehensive survey on safe reinforcement learning. J. Mach. Learn. Res. 2015, 16, 1437–1480. [Google Scholar]
- Gu, S.; Grudzien Kuba, J.; Chen, Y.; Du, Y.; Yang, L.; Knoll, A.; Yang, Y. Safe Multi-Agent Reinforcement Learning for Multi-Robot Control. Artif. Intell. 2023, 319, 103905. [Google Scholar] [CrossRef]
- Wongpiromsarn, T.; Topcu, U.; Murray, R.M. Receding Horizon Temporal Logic Planning. IEEE Trans. Autom. Control 2012, 57, 2817–2830. [Google Scholar] [CrossRef]
- Valiente, R.; Toghi, B.; Pedarsani, R.; Fallah, Y.P. Robustness and adaptability of reinforcement learning-based cooperative autonomous driving in mixed-autonomy traffic. IEEE Open J. Intell. Transp. Syst. 2022, 3, 397–410. [Google Scholar] [CrossRef]
- Qin, Z.; Zhang, K.; Chen, Y.; Chen, J.; Fan, C. Learning Safe Multi-Agent Control with Decentralized Neural Barrier Certificates. In Proceedings of the International Conference on Learning Representations, Addis Ababa, Ethiopia, 26–30 April 2020. [Google Scholar]
- Hsu, K.C.; Ren, A.Z.; Nguyen, D.P.; Majumdar, A.; Fisac, J.F. Sim-to-Lab-to-Real: Safe reinforcement learning with shielding and generalization guarantees. Artif. Intell. 2023, 314, 103811. [Google Scholar] [CrossRef]
- Liu, C.; Geng, N.; Aggarwal, V.; Lan, T.; Yang, Y.; Xu, M. Cmix: Deep multi-agent reinforcement learning with peak and average constraints. In Proceedings of the Machine Learning and Knowledge Discovery in Databases. Research Track: European Conference, ECML PKDD 2021, Bilbao, Spain, 13–17 September 2021; Springer International Publishing: Berlin/Heidelberg, Germany, 2021. Part I 21. pp. 157–173. [Google Scholar]
- Rashid, T.; Samvelyan, M.; De Witt, C.S.; Farquhar, G.; Foerster, J.; Whiteson, S. Monotonic Value Function Factorisation for Deep Multi-Agent Reinforcement Learning. J. Mach. Learn. Res. 2020, 21, 7234–7284. [Google Scholar]
- El Mhamdi, E.M.; Guerraoui, R.; Hendrikx, H.; Maurer, A. Dynamic Safe Interruptibility for Decentralized Multi-Agent Reinforcement Learning. Adv. Neural Inf. Process. Syst. 2017, 30, 129–139. [Google Scholar]
- Altman, E. Constrained Markov Decision Processes; Routledge: New York, NY, USA, 2021. [Google Scholar]
- Yu, M.; Yang, Z.; Kolar, M.; Wang, Z. Convergent Policy Optimizeation for Safe Reinforcement Learning. Adv. Neural Inf. Process. Syst. 2019, 32, 3127–3139. [Google Scholar]
- Camacho, A.; Icarte, R.T.; Klassen, T.Q.; Valenzano, R.A.; McIlraith, S.A. LTL and Beyond: Formal Languages for Reward Function Specification in Reinforcement Learning. In Proceedings of the IJCAI, Macao, 10–16 August 2019; Volume 19, pp. 6065–6073. [Google Scholar]
- Hahn, E.M.; Perez, M.; Schewe, S.; Somenzi, F.; Trivedi, A.; Wojtczak, D. Omega-Regular Objectives in Model-Free Reinforcement Learning. In Proceedings of the International Conference on Tools and Algorithms for the Construction and Analysis of Systems, Prague, Czech Republic, 6–11 April 2019; Springer: Berlin/Heidelberg, Germany, 2019; pp. 395–412. [Google Scholar]
- Bozkurt, A.K.; Wang, Y.; Zavlanos, M.M.; Pajic, M. Control Synthesis from Linear Temporal Logic Specifications Using Model-Free Reinforcement Learning. In Proceedings of the 2020 IEEE International Conference on Robotics and Automation (ICRA), Paris, France, 31 May–31 August 2020; IEEE: Piscataway, NJ, USA, 2020; pp. 10349–10355. [Google Scholar]
- Hasanbeig, M.; Abate, A.; Kroening, D. Cautious Reinforcement Learning with Logical Constraints. In Proceedings of the 19th International Conference on Autonomous Agents and Multi-Agent Systems, Auckland, New Zealand, 9–13 May 2020; pp. 483–491. [Google Scholar]
- Hatanaka, W.; Yamashina, R.; Matsubara, T. Reinforcement Learning of Action and Query Policies with LTL Instructions under Uncertain Event Detector. IEEE Robot. Autom. Lett. 2023, 8, 7010–7017. [Google Scholar] [CrossRef]
- Grimm, T.; Lettnin, D.; Hübner, M. A Survey on Formal Verification Techniques for Safety-Critical Systems-on-Chip. Electronics 2018, 7, 81. [Google Scholar] [CrossRef]
- Ulusoy, A.; Smith, S.L.; Ding, X.C.; Belta, C.; Rus, D. Optimality and Robustness in Multi-Robot Path Planning with Temporal Logic Constraints. Int. J. Robot. Res. 2013, 32, 889–911. [Google Scholar] [CrossRef]
- Herd, B.; Miles, S.; McBurney, P.; Luck, M. Quantitative Analysis of Multiagent Systems through Statistical Model Checking. In Proceedings of the Engineering Multi-Agent Systems: Third International Workshop, EMAS 2015, Istanbul, Turkey, 5 May 2015; Revised, Selected, and Invited Papers 3. Springer: Berlin/Heidelberg, Germany, 2015; pp. 109–130. [Google Scholar]
- Tarasyuk, A.; Pereverzeva, I.; Troubitsyna, E.; Laibinis, L. Formal Development and Quantitative Assessment of a Resilient Multi-Robotic System. In Proceedings of the Software Engineering for Resilient Systems: 5th International Workshop, SERENE 2013, Kiev, Ukraine, 3–4 October 2013; pp. 109–124. [Google Scholar]
- Mason, G.; Calinescu, R.; Kudenko, D.; Banks, A. Assurance in Reinforcement Learning Using Quantitative Verification. Adv. Hybrid. Intell. Methods Models Syst. Appl. 2018, 85, 71–96. [Google Scholar]
- Mason, G.R.; Calinescu, R.C.; Kudenko, D.; Banks, A. Assured Reinforcement Learning with Formally Verified Abstract Policies. In Proceedings of the 9th International Conference on Agents and Artificial Intelligence (ICAART), Porto, Portugal, 24–26 February 2017. [Google Scholar]
- Riley, J.; Calinescu, R.; Paterson, C.; Kudenko, D.; Banks, A. Reinforcement Learning with Quantitative Verification for Assured Multi-Agent Policies. In Proceedings of the 13th International Conference on Agents and Artificial Intelligence, Online, 4–6 February 2021; pp. 237–245. [Google Scholar]
- Riley, J.; Calinescu, R.; Paterson, C.; Kudenko, D.; Banks, A. Utilising Assured Multi-Agent Reinforcement Learning within Safety-Critical Scenarios. Procedia Comput. Sci. 2021, 192, 1061–1070. [Google Scholar] [CrossRef]
- Kadoche, E.; Gourvénec, S.; Pallud, M.; Levent, T. Marlyc: Multi-agent reinforcement learning yaw control. Renew. Energy 2023, 217, 119129. [Google Scholar] [CrossRef]
- Yu, C.; Velu, A.; Vinitsky, E.; Gao, J.; Wang, Y.; Bayen, A.; Wu, Y. The surprising effectiveness of ppo in cooperative multi-agent games. Adv. Neural Inf. Process. Syst. 2022, 35, 24611–24624. [Google Scholar]
- Kuba, J.G.; Chen, R.; Wen, M.; Wen, Y.; Sun, F.; Wang, J.; Yang, Y. Trust Region Policy Optimisation in Multi-Agent Reinforcement Learning. In Proceedings of the ICLR 2022—10th International Conference on Learning Representations, The International Conference on Learning Representations (ICLR), Virtual, 25–29 April 2022; p. 1046. [Google Scholar]
- Ye, Z.; Chen, Y.; Jiang, X.; Song, G.; Yang, B.; Fan, S. Improving sample efficiency in multi-agent actor-critic methods. Appl. Intell. 2022, 52, 3691–3704. [Google Scholar] [CrossRef]
- Zhang, Y.; Zhou, Y.; Lu, H.; Fujita, H. Cooperative Multi-Agent Actor–Critic Control of Traffic Network Flow Based on Edge Computing. Future Gener. Comput. Syst. 2021, 123, 128–141. [Google Scholar] [CrossRef]
- Christianos, F.; Schäfer, L.; Albrecht, S. Shared experience actor-critic for multi-agent reinforcement learning. Adv. Neural Inf. Process. Syst. 2020, 33, 10707–10717. [Google Scholar]
- Kwiatkowska, M.; Norman, G.; Parker, D.; Santos, G. PRISM-games 3.0: Stochastic game verification with concurrency, equilibria and time. In Proceedings of the Computer Aided Verification: 32nd International Conference, CAV 2020, Los Angeles, CA, USA, 21–24 July 2020; Springer International Publishing: Berlin/Heidelberg, Germany, 2020. Part II 32. pp. 475–487. [Google Scholar]
- Bragin, M.A.; Luh, P.B.; Yan, J.H.; Yu, N.; Stern, G.A. Convergence of the surrogate Lagrangian relaxation method. J. Optim. Theory Appl. 2015, 164, 173–201. [Google Scholar] [CrossRef]
Property | Method | Result |
---|---|---|
Property 1: <<player1, player2>>Pmax = ? [F “reach_target”] | Verification | 1.0 |
Property 2: <<player1, player2>>Rmin = ? [C] | Verification | 0.0 |
Property 3: <<player1, player2>>Pmin = ? [F “safe”] | Verification | 1.0 |
Millions of Steps | SCPO | SEAC | SNAC | IAC |
---|---|---|---|---|
10 M | 3 | 15 | 18 | 23 |
20 M | 2 | 13 | 17 | 21 |
30 M | 1 | 12 | 16 | 20 |
40 M | 1 | 11 | 15 | 19 |
50 M | 0 | 10 | 15 | 18 |
60 M | 0 | 9 | 14 | 17 |
70 M | 0 | 8 | 13 | 16 |
80 M | 0 | 7 | 12 | 15 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Liu, Y.; Li, J. Runtime Verification-Based Safe MARL for Optimized Safety Policy Generation for Multi-Robot Systems. Big Data Cogn. Comput. 2024, 8, 49. https://doi.org/10.3390/bdcc8050049
Liu Y, Li J. Runtime Verification-Based Safe MARL for Optimized Safety Policy Generation for Multi-Robot Systems. Big Data and Cognitive Computing. 2024; 8(5):49. https://doi.org/10.3390/bdcc8050049
Chicago/Turabian StyleLiu, Yang, and Jiankun Li. 2024. "Runtime Verification-Based Safe MARL for Optimized Safety Policy Generation for Multi-Robot Systems" Big Data and Cognitive Computing 8, no. 5: 49. https://doi.org/10.3390/bdcc8050049
APA StyleLiu, Y., & Li, J. (2024). Runtime Verification-Based Safe MARL for Optimized Safety Policy Generation for Multi-Robot Systems. Big Data and Cognitive Computing, 8(5), 49. https://doi.org/10.3390/bdcc8050049