Guidance Design for Escape Flight Vehicle against Multiple Pursuit Flight Vehicles Using the RNN-Based Proximal Policy Optimization Algorithm
Abstract
:1. Introduction
- The agent, employing the fully connected neural network (FCNN), is restricted to addressing problems with a fixed dimensionality of the input state due to the fixed number of nodes in the FCNN’s input layer. Given that the dimensionality of the input state is a positive correlation with the number of PFVs requiring evasion (denoted as ), an FCNN-based agent alone is not capable of addressing the problem characterized by a varying number of PFVs in the escape-and-pursuit scenario. To address this challenging problem, we design a composite architecture integrating both the RNN and the FCNN. The proposed architecture employs the RNN to effectively handle the varying number of PFVs. Specifically, (1) the input of the RNN consists of a series of data sets, each comprising six elements representing the vector of relative position and velocity between the EFV and the ith PFV; (2) the number of data sets corresponds to the number of PFVs requiring evasion (); (3) the input state , with the dimensionality of , undergoes processing times, with each processing step involving six elements; (4) The output of the RNN is defined as the last hidden state of the RNN. Regarding the FCNN, the number of nodes in its input layer matches the dimensionality of the RNN’s hidden state. Consequently, the RNN and FCNN can be interconnected, enabling the FCNN to generate guidance commands for the EFV.
- The hidden state of the RNN is crucial for generating a reasonable output based on the integration of both previous and current input states; thus, it is essential to utilize the hidden state in training the agent of the EFV. In the conventional DRL techniques, the training data in the form of , with each element having fixed dimensionality, are produced through the ongoing interactions between the agent and its environment, and then stored in the replay buffer. Subsequently, a batch of training data are randomly selected from the replay buffer to facilitate the training of the agent. To address the challenge of variable dimensionality in , we have developed a two-step strategy. In the first step, we incorporate both the current hidden state and the next hidden state into each training data, transforming its structure from to . In the second step, we introduce an innovative dual-layer batch training approach. Specifically, the outer layer batch is constructed by segmenting the replay buffer based on the number of PFVs, thus ensuring that all training data in the same outer layer batch possess consistent dimensionality. Regarding the inner layer batch, it is generated by randomly selecting training data from the corresponding outer layer batch. These data, characterized by consistent dimensionality, are then utilized to train the agent in the EFV using the method employed in conventional DRL techniques.
- The purpose of this paper is to generate optimal guidance commands that enable the EFV to effectively evade the PFVs while maximizing the residual velocity. To address the problem, a novel reward function is designed, by taking into account the prospective states (i.e., evasion distance and residual velocity) derived from a virtual scenario where the guidance commands of the EFV are predefined, facilitating rapid acquisition of feasible evasion distances and residual velocities. Given that this design uses future information for current decision-making, the agent of the EFV invoked for continuously generating the guidance commands according to the various real-time situations of the EFV and the PFVs can be trained in a more efficient manner.
2. System Model and Problem Formulation
2.1. System Model
- Both the EFV and PFV can accurately observe the present and historical position and velocity of each other, and can use the information to generate its own guidance commands. Nonetheless, the future position and velocity of them are hard to predict due to the interacting behavior of the EFV and each PFV.
- The EFV is characterized by its plane-symmetrical structure, with its guidance commands formulated through the DRL technique. These commands are primarily composed of the composite angle of attack (The composite angle of attack is mathematically defined as .) (denoted by ) and the angle of heel (denoted by ). The range of is , and the range of is . Conversely, each PFV demonstrates axial symmetry with its guidance commands derived using the proportional navigation technique [25]. These commands incorporate the angle of attack (represented by ) and the angle of sideslip (denoted by ). The range of both and is .
- The EFV is capable of detecting PFV when the distance between them is less than 2000.0 m, and the PFV possesses the capability to capture the EFV if the distance between them is less than 20.0 m. More precisely, the exact number of PFVs requiring evasion is determined by two factors: firstly, the total number of PFVs present in the scenario; secondly, the number of instances where the distance between the EFV and each PFV is less than 2000.0 m. Furthermore, to successfully evade the PFVs, the EFV is required to maintain a distance larger than 20.0 m for each PFV during the entire escape-and-pursuit process.
- For both the EFV and the PFVs, the time interval between the generation of successive guidance commands is maintained at a fixed value. Specifically, the step time remains constant throughout the escape-and-pursuit simulation.
2.2. Evasion Distance and Residual Velocity
2.3. Problem Formulation and Analysis
- The kinematics model of the EFV is expressed by (1), which means that the position and velocity of the EFV are readily available if the output of Stage (1) has been determined.
- According to its own position and velocity, the guidance commands of the ith PFV are readily available if the position and velocity of the EFV have been determined by executing the assumed proportional navigation method.
- The kinematics model of the PFV is also expressed by (1). Therefore, the position and velocity of the ith PFV are readily available if the output of Stage (3) has been determined.
- The only “independent variable” that can vary actively in every single step of simulating the escape-and-pursuit scenario, as illustrated by Figure 3, is the guidance command of the EFV, namely, and , which constitute the output of Stage (1).
3. The Proposed RNN-Based PPO Algorithm
- We design the framework of the proposed RNN-based PPO algorithm, utilizing the RNN to dynamically manage the dimensionality of the environment state that varies due to the different number of PFVs requiring evasion at different time instants in a single escape-and-pursuit simulation.
- We design the structure of the training data by putting the hidden state of the RNN into it. Furthermore, we engineered a dual-layer batch method to adeptly manage the dimensional variances between environment states, enhancing both the stability and the efficiency of the training task.
- We design the architecture of both the actor and critic networks by integrating RNN and FCNN. Furthermore, we propose two distinct training strategies depending on whether the model is initialized using pre-trained weights from scenarios involving a smaller number of PFVs.
- We design an elaborate reward function by creating a virtual escape-and-pursuit scenario, enabling rapid calculations of future evasion distance and residual velocity for generating current guidance commands of the EFV.
3.1. Design of the Interaction between the Agent and the Environment
- The environment is composed of a single EFV and up to three PFVs, with the agent integrated in the EFV capable of generating guidance commands (i.e., and ) to evade the PFVs unremittingly.
- At every interaction step, the state of the environment is represented by a sequence of nodes, where the number of nodes corresponds to the number of PFVs detected by the EFV. Each node is composed of six elements, denoting the relative position and velocity between the EFV and the corresponding PFV. Moreover, the states corresponding to PFVs detected earlier are prioritized and fed to the agent accordingly. Specifically, the state illustrated in Figure 5 reveals that the first PFV is detected initially, followed by the second, and finally, the third PFV.
- In the proposed RNN-base PPO algorithm, both the actor and critic networks are constituted by a combination of the RNN and the FCNN. The actor network receives the state of the environment as input and generates the EFV’s guidance commands as output. The critic network’s input encapsulates the environment state, the actor network’s action, and the reward feedback from the environment, resulting in an output that is the Q-value corresponding to the above information. The Q-value emerges as a pivotal metric instrumental in the parameter updating process in both the actor and critic networks.
3.2. Design of the Replay Buffer
3.3. Design of the Actor and Critic Networks
- The lower section of the figure illustrates the various environment states, where , , and represent the relative state between the EFV and the first, second, and third PFVs, respectively. In the depicted scenario, the EFV initially evades the first PFV, subsequently evading with the second and third PFVs. Upon successfully evading the first PFV, only the relevant input information, namely, the relative state between the EFV and the second and third PFVs, is fed to the RNN for processing.
- Drawing upon the step delineated in the figure, in the RNN computation phase, state nodes (i.e., , , ) are sequentially fed into the RNN’s input layer. This procedure yields a sequence of hidden states (i.e., , , and ), from which the hidden state located at the last position () is selected as the output of the RNN. Regarding the FCNN computation phase, the initial input constitutes the output derived from the RNN computation. This phase culminates in the generation of guidance commands for the EFV, serving as its output.
- Begin the training task by randomly initializing the weights in both the actor and critic networks. This approach signifies the model’s initiation without prior knowledge, enabling it to autonomously develop the evasion guidance strategy.
- Implement an incremental learning strategy by loading the model’s weights from previous scenarios. This approach enhances the model’s proficiency in mastering the evasion guidance strategy, building upon the pre-learned similar knowledge.
3.4. Design of the Reward Function
4. Simulation Results and Discussions
4.1. Training Result of the RNN-Based PPO Algorithm in the Designed Three Scenarios
- As illustrated in Figure 10a, there is a consistent augmentation in the episode rewards, which serves as a testament to the EFV’s adept evasion of the PFV, facilitated by the judicious application of the designated reward function and hyperparameters. This trend unequivocally underscores the successful execution of training tasks, as evaluated from the perspective of the DRL technique.
- Figure 10b elucidates a discernible downward trend in residual velocity concomitant with the increment in episode number. It is imperative to acknowledge that the EFV’s energy reserves are inherently limited, with a significant allocation dedicated to modifying its flight trajectory to evade the PFV. This strategic energy deployment engenders a consistent reduction in residual velocity as the evasion distance increases.
4.2. Comparative Analysis of the RNN-Based PPO Algorithm and The Conventional FCNN-Based PPO Algorithm
4.3. Improvement for Future Work
- It is posited that the EFV and PFV can obtain each other’s positions and velocities continuously, accurately, and instantaneously. This assumption simplifies the problem to a certain extent. Future efforts will concentrate on refining the algorithm through incremental training involving intermittent, erroneous, and delayed data to enhance the evasion model’s adaptability.
- Given the limited computational resources available on the EFV, the RNN was selected to facilitate the training of the algorithm, rather than the transformer model, which is prevalent in contemporary artificial intelligence research. Our future work will explore the compatibility of embedded intelligent processors with the transformer model and aim to replace the current RNN with the transformer to enhance the algorithm’s adaptability.
- This paper primarily investigates an intelligent evasion model designed for scenarios in which a single EFV evades multiple PFVs. There is a potential risk that the intelligent evasion model might underperform in complex scenarios where multiple EFVs collaboratively evade multiple PFVs. The primary reason for this is that in such scenarios, each EFV must effectively evade multiple PFVs while simultaneously avoiding collisions with fellow EFVs. This scenario constitutes a multi-agent joint reinforcement learning challenge, extending beyond the single-agent reinforcement learning framework addressed in this paper. Our future work will involve related research into these more interesting and challenging problems.
5. Conclusions
Author Contributions
Funding
Data Availability Statement
Conflicts of Interest
References
- Garcia, E.; Casbeer, W.D.; Pachter, M. Design and Analysis of State Feedback Optimal Strategies for the Differential Game of Active Defense. IEEE Trans. Autom. Control 2019, 64, 553–568. [Google Scholar] [CrossRef]
- Sinha, A.; Kumar, S.R.; Mukherjee, D. Nonsingular Impact Time Guidance and Control Using Deviated Pursuit. Aerosp. Sci. Technol. 2021, 115, 106776. [Google Scholar] [CrossRef]
- Cheng, L.; Jiang, F.H.; Wang, Z.B.; Li, J.F. Multiconstrained Real-time Entry Guidance Using Deep Neural Networks. IEEE Trans. Aerosp. Electron. Syst. 2021, 57, 325–340. [Google Scholar] [CrossRef]
- Peng, C.; Zhang, H.W.; He, Y.X.; Ma, J.J. State-Following-Kernel-Based Online Reinforcement Learning Guidance Law against Maneuvering Target. IEEE Trans. Aerosp. Electron. Syst. 2022, 58, 5784–5797. [Google Scholar] [CrossRef]
- Shalumov, V. Cooperative Online Guide-launch-guide Policy in a Target-missile-defender Engagement Using Deep Reinforcement Learning. Aerosp. Sci. Technol. 2020, 104, 105996. [Google Scholar] [CrossRef]
- Liu, X.D.; Li, G.F. Adaptive Sliding Mode Guidance with Impact Time and Angle Constraints. IEEE Access 2020, 8, 26926–26932. [Google Scholar] [CrossRef]
- Zhou, J.L.; Yang, J.Y. Distributed Guidance Law Design for Cooperative Simultaneous Attacks with Multiple Missiles. J. Guid. Control Dyn. 2016, 39, 2436–2444. [Google Scholar] [CrossRef]
- Zhai, C.; He, F.H.; Hong, Y.G.; Wang, L.; Yao, Y. Coverage-based Interception Algorithm of Multiple Interceptors against the Target Involving Decoys. J. Guid. Control Dyn. 2016, 39, 1647–1653. [Google Scholar] [CrossRef]
- Liang, Z.X.; Ren, Z. Tentacle-Based Guidance for Entry Flight with No-Fly Zone Constraint. J. Guid. Control Dyn. 2018, 41, 991–1000. [Google Scholar] [CrossRef]
- Liang, Z.X.; Liu, S.Y.; Li, Q.D.; Ren, Z. Lateral Entry Guidance with No-Fly Zone Constraint. Aerosp. Sci. Technol. 2017, 60, 39–47. [Google Scholar] [CrossRef]
- Zhao, D.J.; Song, Z.Y. Reentry Trajectory Optimization with Waypoint and No-Fly Zone Constraints Using Multiphase Convex Programming. Acta Astronaut. 2017, 137, 60–69. [Google Scholar] [CrossRef]
- Zhou, Q.H.; Liu, Y.F.; Qi, N.M.; Yan, J.F. Anti-warning Based Anti-interception Avoiding Penetration Strategy in Midcourse. Acta Aeronaut. Astronaut. Sin. 2017, 38, 319922. [Google Scholar]
- Yu, W.B.; Chen, W.C.; Jiang, Z.G.; Zhang, W.Q.; Zhao, P.L. Analytical Entry Guidance for Coordinated Flight with Multiple No-fly-zone Constraints. Aerosp. Sci. Technol. 2019, 84, 273–290. [Google Scholar] [CrossRef]
- Yan, T.; Cai, Y.L.; Xu, B. Evasion Guidance Algorithms for Air-breathing Hypersonic Vehicles in Three-player Pursuit-evasion Games. Chin. J. Aeronaut. 2020, 33, 3423–3436. [Google Scholar] [CrossRef]
- Wang, Y.Q.; Ning, G.D.; Wang, X.F. Maneuver Penetration Strategy of Near Space Vehicle Based on Differential Game. Acta Aeronaut. Astronaut. Sin. 2020, 41, 724276. [Google Scholar]
- Shen, Z.P.; Yu, J.L.; Dong, X.W.; Hua, Y.Z.; Ren, Z. Penetration Trajectory Optimization for the Hypersonic Gliding Vehicle Encountering Two Interceptors. Aerosp. Sci. Technol. 2022, 121, 107363. [Google Scholar] [CrossRef]
- Nath, S.; Ghose, D. Worst-Case Scenario Evasive Strategies in a Two-on-One Engagement Between Dubins’ Vehicles With Partial Information. IEEE Control Syst. Lett. 2022, 7, 25–30. [Google Scholar] [CrossRef]
- He, S.M.; Shin, H.S.; Tsourdos, A. Computational Missile Guidance: A Deep Reinforcement Learning Approach. J. Aerosp. Inf. Syst. 2021, 18, 571–582. [Google Scholar] [CrossRef]
- Jiang, L.; Nan, Y.; Zhang, Y.; Li, Z. Anti-Interception Guidance for Hypersonic Glide Vehicle: A Deep Reinforcement Learning Approach. Aerospace 2022, 9, 424. [Google Scholar] [CrossRef]
- Shen, Z.P.; Yu, J.L.; Dong, X.W.; Ren, Z. Deep Neural Network-based Penetration Trajectory Generation for Hypersonic Gliding Vehicles Encountering Two Interceptors. In Proceedings of the 2022 41st Chinese Control Conference (CCC), Hefei, China, 25–27 June 2022; pp. 3392–3397. [Google Scholar]
- Guo, Y.H.; Jiang, Z.J.; Huang, H.Q.; Fan, H.J.; Weng, W.Y. Intelligent Maneuver Strategy for a Hypersonic Pursuit-Evasion Game Based on Deep Reinforcement Learning. Aerospace 2023, 10, 783. [Google Scholar] [CrossRef]
- Hui, J.P.; Wang, R.; Yu, Q.D. Generating New Quality Flight Corridor for Reentry Aircraft Based on Reinforcement Learning. Acta Aeronaut. Astronaut. Sin. 2022, 9, 325960. [Google Scholar]
- Pham, D.H.; Lin, C.M.; Giap, V.N.; Cho, H.Y. Design of Missile Guidance Law Using Takagi-Sugeno-Kang (TSK) Elliptic Type-2 Fuzzy Brain Imitated Neural Networks. IEEE Access 2023, 11, 53687–53702. [Google Scholar] [CrossRef]
- Pham, D.H.; Lin, C.M.; Huynh, T.T.; Cho, H.Y. Wavelet Interval Type-2 Takagi-Kang-Sugeno Hybrid Controller for Time-series Prediction and Chaotic Synchronization. IEEE Access 2022, 10, 104313–104327. [Google Scholar] [CrossRef]
- Qian, X.F. Missile Flight Aerodynamics; Beijing Institute of Technology Press: Beijing, China, 2014. [Google Scholar]
- Schulman, J.; Wolski, F.; Dhariwal, P.; Radford, A.; Klimov, O. Proximal Policy Optimization Algorithms. Available online: http://arxiv.org/abs/1707.06347 (accessed on 28 August 2017).
- Qi, C.Y.; Wu, C.F.; Lei, L.; Li, X.L.; Cong, P.Y. UAV Path Planning Based on the Improved PPO Algorithm. In Proceedings of the 2022 Asia Conference on Advanced Robotics, Automation, and Control Engineering (ARACE), Qingdao, China, 26–28 August 2022; pp. 193–199. [Google Scholar]
- Xiao, Q.H.; Jiang, L.; Wang, M.M.; Zhang, X. An Improved Distributed Sampling PPO Algorithm Based on Beta Policy for Continuous Global Path Planning Scheme. Sensors 2023, 23, 6101. [Google Scholar] [CrossRef]
- Tan, Z.Y.; Karaköse, M. Proximal Policy Based Deep Reinforcement Learning Approach for Swarm Robots. In Proceedings of the 2021 Zooming Innovation in Consumer Technologies Conference (ZINC), Novi Sad, Serbia, 26–27 May 2021; pp. 166–170. [Google Scholar]
Parameter | Meaning | Value |
---|---|---|
The number of the layer in the RNN | 1 | |
The dimensionality of the hidden state | 256 | |
The number of the layer in the FCNN | 3 | |
The number of the nodes in each hidden layer in the FCNN | 256 | |
The learning rate | 1 |
The Total Number of the PFV in the Scenario | The Index of the PFV | The Evasion Distance of the RNN-Based PPO Algorithm (m) | The Evasion Distance of the FCNN-Based PPO Algorithm (m) |
---|---|---|---|
1 | 1 | 20.76 (>20.0, success) | 25.41 (>20.0, success) |
2 | 1 | 22.47 (>20.0, success) | 25.41 (>20.0, success) |
2 | 34.22 (>20.0, success) | 10.32 (<20.0, fail) | |
3 | 1 | 26.72 (>20.0, success) | 25.41 (>20.0, success) |
2 | 37.27 (>20.0, success) | 10.32 (<20.0, fail) | |
3 | 42.22 (>20.0, success) | 3.12 (<20.0, fail) |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Hu, X.; Wang, H.; Gong, M.; Wang, T. Guidance Design for Escape Flight Vehicle against Multiple Pursuit Flight Vehicles Using the RNN-Based Proximal Policy Optimization Algorithm. Aerospace 2024, 11, 361. https://doi.org/10.3390/aerospace11050361
Hu X, Wang H, Gong M, Wang T. Guidance Design for Escape Flight Vehicle against Multiple Pursuit Flight Vehicles Using the RNN-Based Proximal Policy Optimization Algorithm. Aerospace. 2024; 11(5):361. https://doi.org/10.3390/aerospace11050361
Chicago/Turabian StyleHu, Xiao, Hongbo Wang, Min Gong, and Tianshu Wang. 2024. "Guidance Design for Escape Flight Vehicle against Multiple Pursuit Flight Vehicles Using the RNN-Based Proximal Policy Optimization Algorithm" Aerospace 11, no. 5: 361. https://doi.org/10.3390/aerospace11050361
APA StyleHu, X., Wang, H., Gong, M., & Wang, T. (2024). Guidance Design for Escape Flight Vehicle against Multiple Pursuit Flight Vehicles Using the RNN-Based Proximal Policy Optimization Algorithm. Aerospace, 11(5), 361. https://doi.org/10.3390/aerospace11050361