Research on Self-Learning Control Method of Reusable Launch Vehicle Based on Neural Network Architecture Search

Xue, Shuai; Wang, Zhaolei; Bai, Hongyang; Yu, Chunmei; Li, Zian

doi:10.3390/aerospace11090774

Open AccessArticle

Research on Self-Learning Control Method of Reusable Launch Vehicle Based on Neural Network Architecture Search

by

Shuai Xue

¹,

Zhaolei Wang

²,

Hongyang Bai

^1,*,

Chunmei Yu

² and

Zian Li

¹

School of Energy and Power Engineering, Nanjing University of Science and Technology, Nanjing 210094, China

²

National Key Laboratory of Science and Technology on Aerospace Intelligence Control, Beijing Aerospace Automatic Control Institute, Beijing 100854, China

^*

Author to whom correspondence should be addressed.

Aerospace 2024, 11(9), 774; https://doi.org/10.3390/aerospace11090774

Submission received: 3 July 2024 / Revised: 9 September 2024 / Accepted: 14 September 2024 / Published: 20 September 2024

(This article belongs to the Special Issue Advanced GNC Solutions for VTOL Systems)

Download

Browse Figures

Versions Notes

Abstract

:

Reusable launch vehicles need to face complex and diverse environments during flight. The design of rocket recovery control law based on traditional deep reinforcement learning (DRL) makes it difficult to obtain a set of network architectures that can adapt to multiple scenarios and multi-parameter uncertainties, and the performance of deep reinforcement learning algorithm depends on manual trial and error of hyperparameters. To solve this problem, this paper proposes a self-learning control method for launch vehicle recovery based on neural architecture search (NAS), which decouples deep network structure search and reinforcement learning hyperparameter optimization. First, using network architecture search technology based on a multi-objective hybrid particle swarm optimization algorithm, the proximal policy optimization algorithm of deep network architecture is automatically designed, and the search space is lightweight design in the process. Secondly, in order to further improve the landing accuracy of the launch vehicle, the Bayesian optimization (BO) method is used to automatically optimize the hyperparameters of reinforcement learning, and the control law of the landing phase in the recovery process of the launch vehicle is obtained through training. Finally, the algorithm is transplanted to the rocket intelligent learning embedded platform for comparative testing to verify its online deployment capability. The simulation results show that the proposed method can satisfy the landing accuracy of the launch vehicle recovery mission, and the control effect is basically the same as the landing accuracy of the trained rocket model under the untrained condition of model parameter deviation and wind field interference, which verifies the generalization of the proposed method.

Keywords:

reusable launch vehicle; deep reinforcement learning; neural network architecture search; Bayesian optimization; self-learning control

1. Introduction

Reusable launch Vehicles (RLVs) are vehicles that return to land in whole or in part after completing a launch mission and can be relaunched after maintenance and refueling [1]. For a long time, high launch costs have been considered a major obstacle to space exploration and development [2]. Compared with expendable launch vehicles, reusable launch vehicles, as a class of RLV, can reduce rocket launch costs by performing multiple boost and recovery missions [3] and are an attractive option for cost-effective and time-efficient space transportation [4]; so it has become a hot topic of current research. At present, SpaceX’s “Falcon-9” launch vehicle has been successfully launched and recycled, confirming the concept of reusable rockets. Due to the large airspace, wide speed range, and changing flight environment of the launch vehicle vertical recovery process, its own multi-variable and nonlinear system is subject to strong internal and external disturbances and uncertainties, which puts higher requirements on the stability, robustness, and control accuracy of the control system. Therefore, faced with such a complex nonlinear control problem, traditional control methods are unable to adapt to the complex and changeable flight environment, and it is necessary to design a rocket recovery control method suitable for the uncertainty of model parameters and flight environment [5]. In the literature [6], a lossless convex optimization method is used for minimum fuel-powered descent guidance of a Mars lander. The authors of [7] propose an iterative guidance method for a solid-sounding rocket in the atmosphere to improve the accuracy of the rocket’s landing point. In the literature [8], the PID control algorithm was adopted to complete the short-distance take-off and landing of a rocket-assisted UAV. The authors of [9] propose a trajectory design method based on phase plane control, which enables the rocket to enter orbit directly while meeting the constraint of six orbit elements. The above traditional guidance and control methods rely on accurate modeling, and their ability to deal with multi-factor uncertainties is still insufficient.

For the research of RLV flight control systems, Zhang [10] et al. proposed a preset performance control method, which can achieve the full profile of flight control of the rocket by using a set of control structures and control parameters. Liu [11] et al. proposed a fuzzy nonlinear PID anti-interference controller to achieve attitude control of an RLV in the complex aerodynamic environment of the re-entry section. Yang [12] et al. proposed an attitude control method based on the interval two adaptive fuzzy sliding mode, which can effectively track aircraft reference commands and has strong robustness to external interference. Current RLV control methods rely on accurate modeling and lack the ability to deal with multi-factor uncertainties. However, the intelligent controller can learn from the information obtained from the external environment to improve the control performance of the system, and it has a mapping relationship from input to output to achieve adaptive control independent of the model.

In recent years, with the wide application of artificial intelligence technology in various fields, deep reinforcement learning technology that integrates deep learning and reinforcement learning can provide new solutions for the perception, decision-making, and control problems of complex systems [13]. In 2016, the AlphaGo agent defeated the top human professional player, Lee Sedol, in the Go match, making the DRL algorithm widely recognized and deeply studied by the research community [14]. In recent years, DRL has been widely applied and researched in fields such as path planning [15], unmanned aerial vehicle autonomous navigation [16], intelligent driving [17], robot control [18], and missile offensive and defensive confrontation [19]. In the literature [20], an adaptive guidance algorithm is proposed. The network trained by the proximal policy optimization (PPO) algorithm enables the spacecraft to complete multiple rendezvous space tasks quickly and with high precision when unforeseen system failures occur. The literature [21] studies the robust design of interplanetary trajectory planning for spacecraft under low thrust under the condition of multiple uncertainties by using meta-reinforcement learning. In the literature [22], an adaptive tracking control method based on the Q-learning algorithm was studied to carry out attitude-tracking control of reusable rockets during descent and landing. In the literature [23], meta-reinforcement learning is used to process data within a few seconds of takeoff, and a control system model of aircraft with unknown loads is quickly established to control the safe flight of aircraft. In the literature [24], an improved Deep Deterministic Policy Gradient (DDPG) algorithm is proposed to complete the ballistic planning of the missile to the target point under the condition that the capability boundary is satisfied. The authors of [25] proposed an error-convolutional input neural network based on reinforcement learning to design a hybrid UAV control system.

However, due to the heavy dependence on hyperparameters, machine learning algorithms can only obtain useful results under the condition that hyperparameters are properly set [26]. At present, neural network architecture design and hyperparameter selection in deep reinforcement learning algorithms lack a unified standard and rely on the design of human expert experience, which reduces the usability of algorithms in different tasks. Secondly, the designed network convergence and generalization ability are limited. With the rise of automatic machine learning (AutoML), neural network architecture Search (NAS) technology has aroused extensive research and has achieved very good results, such as image classification [27], object detection [28], natural language processing [29], and others, which all play an important role. In the literature [30], hyperparameter optimization of computer vision architecture was carried out by using the Bayesian optimization method. The authors of [31] propose an NAS algorithm based on reinforcement learning as a search policy to generate the most accurate network structure.

Literature [32] proposed the method of NAS evolution as GenticCNN, and the optimal CNN network architecture was evolved through a genetic algorithm. The authors of [33] use Pareto optimal progressive neural architecture search for image and time series classification and obtain the best architecture suitable for the data types provided in different scenarios. At present, the above method of hyperparameter optimization with heuristic search is mostly applied in the field of image but is less applied in the field of intelligent rocket recovery algorithms.

In this paper, the deep reinforcement learning algorithm is applied to achieve the self-learning of the control network and complete the vertical recovery task of the launch vehicle. At the same time, in order to solve the problems of the design of neural network architecture in deep reinforcement learning algorithms lacking unified standards, relying on manual trial and error, and having a weakly designed network generalization ability, a neural network architecture search method for self-learning rocket recovery control is proposed. By decoupling a deep network search from reinforcement learning hyperparameter optimization, the search efficiency is improved. Finally, the automatic design of a self-learning control network structure based on the PPO algorithm is achieved, the generalization ability of the network model is improved, and the neural network controller meets the requirements. Finally, the effectiveness of the proposed method is verified by simulation.

2. Problem Description

Deep reinforcement learning is a machine learning approach to artificial intelligence that learns how to do a task or solve a problem better through trial and error. In this paper, a deep reinforcement learning algorithm is used to train a landing control neural network to achieve vertical recovery of a launch vehicle.

For the design of neural network architecture and the selection of algorithm hyperparameters, the traditional manual design method needs to rely on expert experience to conduct manual trial and error to select the optimal architecture. However, complete training of a deep neural network requires a very high time cost, and the manual design method must be tested after complete training to compare the performance of various models, which consumes a high amount of time. Further, the artificial design model is limited by the experience of experts, it is difficult to design innovative new architectures, and the generalization ability of the network is insufficient. Therefore, this paper designs an efficient network structure hyperparameter automatic optimization algorithm based on the idea of NAS, reduces the workload of network architecture design and hyperparameter selection in the PPO algorithm, and trains the optimal architecture adapted to the launch vehicle recovery mission under a complex flight environment.

2.1. General Framework of Self-Learning Control for Launch Vehicle Recovery

Aiming at the vertical recovery of a launch vehicle, the PPO [34] algorithm is used in this paper to effectively train the intelligent control network of rocket recovery for the power landing phase during the vertical soft landing process of a rocket; the algorithm principle is shown in [35]. The hyperparameters of the PPO algorithm are automatically optimized by hyperparameter optimization technology, and the Actor network and Critic network in PPO are automatically designed by the lightweight NAS algorithm, which solves the problem of the efficient self-learning and weak generalization ability of rocket landing control law.

The specific kinematic and dynamic models of the landing phase during the recovery process of the launch vehicle are shown below in this paper.

The kinetic equation of the rocket’s center of mass in an inertial coordinate system is as follows:

\{\begin{matrix} \dot{r} = v \\ \dot{v} = g + \frac{(T + D)}{m} \\ \dot{m} = - ‖T‖ / (I_{s p} g_{0}) \\ D = - \frac{1}{2} ρ ‖v‖ S_{r e f} C_{D} M a v \end{matrix}

(1)

where

r = (r_{x}, r_{y}, r_{z})

is the position coordinate of the rocket under the inertial system,

v = (v_{x}, v_{y}, v_{z})

is the speed information of the rocket,

m

is the mass information of the rocket,

g

is the gravitational acceleration,

T

is the thrust of the rocket’s engine,

D

is the aerodynamic drag,

I_{s p}

is the fuel-specific impulse,

g_{0}

is the average gravitational acceleration of the earth at sea level,

ρ

is the atmospheric density,

S_{r e f}

is the reference cross-sectional area of the rocket,

C_{D}

is the drag coefficient, and

M a

is the Mach number.

The thrust model equation of the rocket engine is as follows:

[\begin{array}{l} T_{x} \\ T_{y} \\ T_{z} \end{array}] = [\begin{matrix} \cos δ_{y} \cos δ_{z} \\ \sin δ_{y} \\ \sin δ_{y} \cos δ_{z} \end{matrix}] T

(2)

where

T = (T_{x}, T_{y}, T_{z})

is the component of the engine thrust vector in the launch vehicle coordinate system, and

δ_{y}

and

δ_{z}

are the two swing angles in the orthogonal direction, pointing to the opposite direction of the Z-axis and Y-axis of the launch vehicle coordinate system, as shown in Figure 1.

The dynamics equation of the rocket around the center of mass in the launch vehicle coordinate system is as follows:

[\begin{array}{l} {\dot{ω}}_{x} \\ {\dot{ω}}_{y} \\ {\dot{ω}}_{z} \end{array}] = J^{- 1} ([\begin{matrix} 0 \\ M_{st y} \\ M_{st z} \end{matrix}] + [\begin{array}{l} M_{d x} \\ M_{d y} \\ M_{d z} \end{array}] + [\begin{array}{l} M_{c x} \\ M_{c y} \\ M_{c z} \end{array}] - [\begin{matrix} 0 & - ω_{z} & ω_{y} \\ ω_{z} & 0 & - ω_{x} \\ - ω_{y} & ω_{x} & 0 \end{matrix}] J [\begin{array}{l} ω_{x} \\ ω_{y} \\ ω_{z} \end{array}])

(3)

where

\dot{ω} = ({\dot{ω}}_{x}, {\dot{ω}}_{x}, {\dot{ω}}_{x})

is the angular velocity calculated for attitude,

J

is the inertial moment,

ω = (ω_{x}, ω_{y}, ω_{z})

is the rotational angular velocity of the rocket, and

M_{s t} = (M_{s t x}, M_{s t y}, M_{s t z})

,

M_{d} = (M_{d x}, M_{d y}, M_{d z})

, and

M_{c} = (M_{c x}, M_{c y}, M_{c z})

are the aerodynamic stabilization moment, aerodynamic damping moment, and control moment acting on the rocket, respectively.

The kinematics equations of the rocket in quaternion form in the launch vehicle coordinate system are as follows:

[\begin{array}{l} {\dot{q}}_{0} \\ {\dot{q}}_{1} \\ {\dot{q}}_{2} \\ {\dot{q}}_{3} \end{array}] = \frac{1}{2} [\begin{matrix} 0 & - ω_{x} & - ω_{y} & - ω_{z} \\ ω_{x} & 0 & ω_{z} & - ω_{y} \\ ω_{y} & - ω_{z} & 0 & ω_{x} \\ ω_{z} & ω_{y} & - ω_{x} & 0 \end{matrix}] [\begin{array}{l} q_{0} \\ q_{1} \\ q_{2} \\ q_{3} \end{array}]

(4)

\{\begin{matrix} γ = \arctan \frac{2 (q_{0} q_{1} + q_{2} q_{3})}{{q_{0}}^{2} - {q_{1}}^{2} - {q_{2}}^{2} + {q_{3}}^{2}} \\ φ = \arcsin (2 (q_{0} q_{2} - q_{1} q_{3})) \\ ψ = \arctan \frac{2 (q_{0} q_{3} + q_{1} q_{2})}{{q_{0}}^{2} + {q_{1}}^{2} - {q_{2}}^{2} - {q_{3}}^{2}} \end{matrix}

(5)

where

q = (q_{0}, q_{1}, q_{2}, q_{2})

represent quaternion, and

γ

,

φ

, and

ψ

are the roll angle, pitch angle, and yaw angle of the landing stage of the rocket, respectively.

The PPO algorithm framework of long short-term memory (LSTM) is used to complete the training of rocket intelligent landing control law. In addition, by decoupling the deep network structure design and reinforcement learning hyperparameter optimization, the network architecture search method based on particle swarm optimization algorithm and Bayesian hyperparameter optimization technology, respectively, are used for automatic design. Finally, the offline self-learning controller and recovery environment are deployed on the embedded platform of the rocket for online verification and testing. Therefore, the overall framework structure is shown in Figure 2. Among them, the deep reinforcement learning intelligent control module is deployed on the embedded rocket platform.

When the PPO algorithm framework is used to train the launch vehicle recovery control law, the state space contains 13 state quantities, as shown in Table 1, and the action space contains 3 action quantities, as shown in Table 2.

According to the characteristics of vertical recovery of the launch vehicle, the reward function of self-learning control is designed as follows. Formula (6) shows the reward function controlling the flight process of the rocket, which is to ensure that the line of sight of the rocket never deviates from the target point during the flight and guide the rocket to successfully reach the target point under the condition that the fuel consumption is as low as possible and the attitude changes meet the flight conditions.

\{\begin{matrix} \begin{matrix} R_{1} = - 0.01 \cdot ‖v - v_{t}‖ - 10^{- 8} \cdot ‖T‖ + R^{'} \\ v_{t} = - ‖v_{0}‖ \cdot (r / ‖r‖) \cdot [1 - \exp (- t / k)] \end{matrix} \\ t = ‖r‖ / ‖v‖ \end{matrix}

(6)

where

R_{1}

represents the landing process reward when the rocket is recovered,

v_{0}

is the initial velocity of the rocket,

v_{t}

is the target velocity vector of the rocket,

k

is the parameter that adjusts the size of the target velocity vector with time,

t

is the remaining flight time of the rocket, and

R^{'}

is a reward designed to limit the change in the attitude angle of the rocket during the process, as shown in Formula (7).

R^{'} = \{\begin{matrix} - 10, & i f θ^{'} > {θ^{'}}_{l i m i t} \\ 0, & e l s e \end{matrix}

(7)

where

θ^{'}

is the attitude angle during landing and

{θ^{'}}_{l i m i t}

is the attitude angle limit during landing.

In the design of the reward function in this paper, the conditions for successful landing of the rocket are as follows:

\{\begin{matrix} \sqrt{{r_{x}}^{2} + {r_{y}}^{2} + {r_{z}}^{2}} \leq r_{l i m i t} \\ \sqrt{{v_{x}}^{2} + {v_{y}}^{2} + {v_{z}}^{2}} \leq v_{l i m i t} \\ |θ_{t f}| \leq θ_{l i m i t} \\ |ω_{t f}| \leq ω_{l i m i t} \end{matrix}

(8)

where

θ_{t f}

and

ω_{t f}

are the terminal attitude angle and rotational angular velocity of the rocket, and

v_{l i m i t}

,

r_{l i m i t}

,

θ_{l i m i t}

, and

ω_{l i m i t}

are the maximum limit value of the speed, landing radius, attitude angle, and rotational angular velocity at successful landing.

Therefore, according to whether the rocket successfully landed, the terminal landing reward function of the rocket was designed, as shown in Equation (9).

R_{2} = \{\begin{matrix} 50, & i f s u c c e s s f u l l a n d i n g \\ 0, & e l s e \end{matrix}

(9)

Then, the total reward function can be expressed as follows:

r e w a r d = R_{1} + R_{2}

(10)

2.2. Deep Network Architecture and Automatic Optimization of Hyperparameters

The Actor network and Critic network in the PPO algorithm adopt the LSTM network combined with a fully connected neural network, and the network structure is shown in Figure 3.

In combination with the PPO algorithm, it can be seen from Figure 3 that the number of neurons in the input layer and the output layer is determined by the dimensions of the state quantity and the motion quantity of the recovery environment of the launch vehicle. In vertical recycling, the state space is 13-dimensional, and the action space is 3-dimensional. Therefore, in an Actor network, the number of neurons in the input layer is 13, and the number of neurons in the output layer is 3. According to the characteristics of the PPO policy gradient, the activation function of the output layer is fixed as tanh. In the Critic network, the number of neurons in the input layer is 16, including the 13-dimensional state

s

and 3-dimensional action

a

, both of which are used as inputs; the number of neurons in the output layer is 1, indicating the evaluation score of the current state without activation function.

On the basis of the above, the number of hidden layers, the number of neurons in each hidden layer, and the type of activation function can be freely set. In the absence of a specified benchmark, it is often determined through constant trial and error. Therefore, this paper uses a lightweight network architecture search method to automatically generate and optimize the network structure from the specified search space. The neural network structure to be adjusted is shown in Table 3.

Secondly, in order to reduce the search complexity, the search for neural network structure and the hyperparameter optimization of reinforcement learning are decoupled, and search optimization is carried out. For the hyperparameters of reinforcement learning, this paper achieves the automatic selection of highly sensitive parameters in the PPO algorithm by means of Bayesian optimization. The hyperparameters to be adjusted are shown in Table 4.

Finally, the research goal of this paper is to achieve the automatic optimization of hyperparameters of the PPO deep reinforcement learning algorithm by using the network architecture search algorithm and Bayesian optimization algorithm, as shown in Figure 2. Thus, the generalization of the landing control policy network is improved while achieving the self-learning of the vertical recovery control law of the launch vehicle.

3. Algorithm Design

Aiming at the problems of heavy computation, difficult parameter tuning, and weak network generalization ability in deep reinforcement learning training, this paper decouples the network structure design from algorithm hyperparameter selection to reduce the coupling between multiple parameters and the variety of optimization parameters. In this paper, a lightweight network architecture search algorithm suitable for the intelligent control of launch vehicle recovery is proposed. By combining the generated optimal network architecture with Bayesian hyperparameter optimization technology, the control policy network with optimal hyperparameters is obtained to solve the continuous control problem of recovery.

3.1. Neural Network Architecture Search Algorithm

The design of deep neural network architecture mainly includes two parts: interlayer operation and interlayer connection. In order to improve the search efficiency, this paper adopts neural network architecture search to carry out the automatic design of the rocket recovery control network structure. The search policy adopts a multi-objective hybrid particle swarm optimization algorithm. The basic framework of NAS is shown in Figure 4. Through the search policy, the selection of the candidate network architecture in the search space and the model evaluation are carried out according to the model evaluation policy. After reaching the preset number of search iterations, the optimal network architecture is selected, and the performance of the network architecture is evaluated.

3.1.1. Lightweight Search Design

Firstly, the input and output of the neural network are fixed; the number of hidden layer neurons in the LSTM network may be 16, 32, 64, 128, 256, or 512, and the number of hidden layers in a fully connected network may be 1, 2, 3, or 4. The number of neurons in each layer may be 64, 128, or 256, and there are three options. The type of activation function can be tanh, relu, or sigmoid. Therefore, there are nine possible architecture combinations for each layer of the network in the hidden layer, that is, the search space

A_{s e a r c h}

can be expressed as follows:

\begin{array}{l} A_{s e a r c h} = & {t a n h_64, t a n h_128, t a n h_256, r e l u_64, r e l u_128, \\ r e l u_256, s i g m o i d_64, s i g m o i d_128, s i g m o i d_256} \end{array}

(11)

where

r e l u_64

indicates that this hidden layer is composed of 64 neurons, and the selection

r e l u

function is used as its activation function; other symbols have similar expression meanings.

Therefore, when both the Actor network and Critic network in the PPO algorithm adopt a fully connected network, the architecture search using the neural network can be represented by the topological relationship diagram shown in Figure 4, and the best network architecture can be searched according to needs.

In Figure 5, the red arrows are the subgraphs formed after a network architecture search. When different subgraphs are found with the same edge, it is not necessary to train the corresponding layer from scratch but to share the weight of the network layer with it and then carry out incremental training on this basis. Secondly, for the depth of the model network structure, to ensure the light weight of the model, the number of network layers is reduced step by step, namely D1 (excluding layer 2, layer 3, and layer 4), D2 (excluding layer 3 and layer 4), D3 (excluding layer 4), and D4 (the full number of layers). In order to ensure the high efficiency of hardware calculation, 64 is set as the minimum number of neurons, and it is increased on the basis of 64. The result is an efficient, lightweight search for the optimal network architecture.

3.1.2. Particle Swarm Optimization Algorithm

The Particle Swarm Optimization (PSO) is a random optimization technology proposed by Kennedy et al. [36]. It uses the mutual influence and information transfer between individuals in a group to find the best solution in the optimization space in the iterative process. The advantages of high search efficiency and wide search scope.

Since the traditional particle swarm optimization algorithm is suitable for solving the space with continuous values, the problem with values in the discrete space range in this paper cannot be solved directly. During coding, the string is mapped to the integer; for example, [tanh, relu, sigmoid] is mapped to [0, 1, 2]. In addition, in order to obtain a lightweight neural network architecture, the performance and computational complexity of the model are taken as optimization objectives in this paper. Therefore, this paper adopts a multi-objective hybrid particle swarm optimization (MOHPSO) algorithm as the architecture search policy for neural networks. All particles find the optimal solution in the search space according to the velocity update Formula (12) and the position update Formula (13).

v_{i + 1} (t + 1) = ω v_{i} (t) + c_{1} r_{1} (P b e s t_{i} (t) - x_{i} (t)) + c_{2} r_{2} (G b e s t - x_{i} (t))

(12)

x_{i + 1} (t + 1) = x_{i} (t) + v_{i + 1} (t + 1)

(13)

where

v_{i + 1} (t + 1)

is the current particle velocity,

v_{i} (t)

is the previous generation particle velocity,

ω

is the inertia weight coefficient,

P b e s t_{i} (t)

is the previous generation individual optimal position value,

c_{1}

and

c_{2}

are the number of learning factors,

G b e s t

is the previous generation’s global position optimal value,

r_{1}

,

r_{2}

are the random numbers in the interval [0, 1],

x_{i + 1} (t + 1)

is the current individual position, and

x_{i} (t)

is the previous generation’s individual position.

The

ω

is a fixed value in the MOHPSO algorithm, which can not achieve strong global and local search ability at the same time. Therefore, this paper adopts the improved MOHPSO algorithm and uses the linear attenuation method of inertia weight coefficient to improve it so that

ω

is linearly reduced from the maximum value to the minimum value; its relationship is shown in Equation (14).

ω = ω_{2} - (ω_{2} - ω_{1}) \times n^{'} / N_{1}

(14)

where

ω_{2}

is the maximum weight,

ω_{1}

is the minimum weight,

n^{'}

is the current iteration number, and

N_{1}

is the maximum iteration number.

The fitness function setting of the MOHPSO algorithm was improved, as shown in Equation (15).

f i t n e s s = f_{1} + f_{2} = \frac{1}{N_{c o n n e c t i o n}} + R

(15)

where

f_{1}

is the complexity of the model,

f_{2}

is the accuracy of the model, and

N_{c o n n e c t i o n}

is the number of connections in the network. For different networks with similar performance, the simpler the network, the better, so as to judge the complexity of the model;

R

is the value of the reward function, judging whether the rocket has reached the target point according to the reward parameters and judging the accuracy of the model.

According to the above description of network architecture search, Algorithm 1 is designed as follows.

Algorithm 1: NAS algorithm based on the improved MOHPSO

Input: Search space

A_{s e a r c h}

, particle swarm location range

R_{m}

; The maximum velocity of a particle swarm

V_{m}

, the number of particles

M_{1}

, and the number of iterations

N_{1}

Output: Optimized neural network architecture model

A^{*}

1. Initialize the particle swarm and disposable solution Archive Set according to

R_{m}

and

V_{m}

2. For i←1 to

N_{1}

do
3. For j←1 to

M_{1}

do
4. Select the structure, train the model, and obtain the particle fitness value according to Formula (15)
5. if particle fitness > Pbest fitness then
6. Update the Pbest value
7. if the fitness of the disposable Archive Set of particle fitness then
8. Update the disposable Archive Set
9. End for
10. Update the velocity and position of Particles according to Formulas (12) and (13)
11. End for
12. Return

A^{*}

3.2. Bayesian Optimization Algorithm

In the field of intelligent learning algorithms, the hyperparameter optimization technology can bring out the best performance of the model itself, which is different from the commonly used parameter tuning methods such as grid search or random search to find the optimal parameter by exhaustion [37]. The Bayesian optimization algorithm [38,39] considers each parameter setting and results obtained during the training process of the acquisition model and calculates the parameter combination for the next test by designing efficient agent functions and collection functions, finding an optimal combination of model parameters in a limited time.

There are many kinds of hyperparameters in reinforcement learning, and the artificially selected parameter combination is based on the experience setting, which may not necessarily reflect the applicability of the model parameter combination to the research object. In this paper, Bayesian optimization is used to optimize the hyperparameters of reinforcement learning for rocket recovery to improve the convergence speed and recovery accuracy of the control law of the launch vehicle recovery control network. In addition, in order to improve the efficiency of computation and optimization, the early stop policy based on median pruning is combined with the Bayesian optimization algorithm to automatically terminate the combination of reinforcement learning hyperparameters with poor performance, reduce ineffective searches, and reduce time consumption. Detailed information on the optimizing hyperparameters is shown in Table 4.

The agent functions commonly used in the BO algorithm include the Gaussian process, Tree Parzen Estimator (TPE), and probabilistic random forest. The acquisition functions include Probability of Improvement (PI), Excepted Improvement (EI), and confidence limit criteria [40]. In this paper, TPE [30] is used to build the proxy model, and its algorithm is based on Bayesian model expansion and can be combined with EI for hyperparameter selection.

The expression of the collection function EI is as follows:

{EI}_{y} (s) = \int_{- \infty}^{y^{*}} (y^{*} - y) P (y ∣ s) d y

(16)

where

s

is the selected combination of hyperparameters,

y^{*}

is the critical value of the objective function,

y

is the actual value of the objective function under the condition of hyperparameters

s

, and

P (y ∣ s)

is a proxy probability model for

y

given

s

.

The expression of the proxy model TPE is as follows:

P (s ∣ y) = \frac{p (s ∣ y) * P (y)}{p (s)}

(17)

where

P (s ∣ y)

is the hyperparameter probability of the score of a given objective function.

The fitness function constructed in this paper is the objective function of the Bayesian optimization algorithm. In order to have a high mean average accuracy, the total reward of 10 rounds of interaction training of the launch vehicle recovery control network is averaged as the fitness function, and its expression is as follows:

f i t n e s s = (r_{1} + r_{2} + \dots + r_{10}) / 10

(18)

where

r

,

r_{2}

, …,

r_{10}

are the reward value of reinforcement learning for each round of training when the hyperparameter combination

s

is selected.

To sum up, the structure of the Bayesian optimization reinforcement learning model is shown in Figure 6.

According to the design of the above optimization model, in order to automatically optimize the hyperparameters of reinforcement learning, Algorithm 2 is designed as shown below.

Algorithm 2: Bayesian reinforcement learning hyperparameter optimization

Input: The launch vehicle recovery reinforcement learning needs to optimize the combination of hyperparameters

S

;

N_{2}

is the maximum number of iterations;

M_{2}

is the maximum number of assessments;

f

is the median value of fitness in the first five assessments.
Output: The optimal combination of hyperparameters

S^{*}

.

1. Initialization: The initialization point is generated randomly
2. while

T < N_{2}

:
3. if

t < M_{2}

4. Maximize the collection function to select the next evaluation point;
5. Model parameters are obtained by training the model;
6. Calculate the fitness value according to Equation (5) fitness;
7. if:

M_{2} > 5

while:

f i t n e s s < f

:ends the algorithm early
8.

t = t + 1

9. end if
10. Update the TPE proxy model and the optimal hyperparameter values;
11.

T = T + 1

12. end while
13. return

S^{*}

.

4. Simulation Analysis

4.1. Simulation Parameter Settings

In order to test the algorithm in this paper, the initial parameters of the launch vehicle, as shown in Table 5, are set up to build the simulation scenario of rocket recovery. The simulation time step is 0.1 s and the maximum simulation step is 150 s.

The parameters of the algorithm in this paper are set as follows: the maximum number of iterations in the MOHPSO algorithm

N_{1}

is 400,

c_{1} = c_{2} = 1.2

,

ω_{2} = 0.8

, and

ω_{1} = 0.2

, and the number of particles is 30. In the Bayesian optimization algorithm, the maximum number of iterations

N_{2}

is 100 and the maximum number of evaluations

M_{2}

is 10.

In this paper, the algorithm program is programmed in Python language, and the algorithm training environment is built based on Pytorch. The test host is based on the 64-bit Windows 10 operating system, and the hardware configuration is the Intel Core I7 9700K, RTX 3080Ti, and RAM 16 GB. In addition, Huawei’s Atlas 200I DK A2 is selected as the embedded rocket platform for online capability testing, which integrates the Ascend 310B AI processor to provide 8TOPS INT8 computing power.

4.2. Analysis of Training Results

For the training of the autonomous control network for launch vehicle recovery, the NAS-RL algorithm designed in this paper is adopted, which conducts the architecture search of the neural network based on the improved particle swarm optimization algorithm, improves the search efficiency through weight sharing, uses the Bayesian optimization algorithm to optimize the training of hyperparameters, and compares the training results with those of the RL algorithm based on the PPO. In addition, in order to verify the convergence of the proposed algorithm compared with other learning algorithms, it is compared with the SAC (Soft actor-critic) algorithm, and the change curve of the reward function during the training process is shown in Figure 7.

As can be seen from Figure 7, under the same simulation conditions, the number of reward function convergence steps of the PPO-based RL algorithm, SAC algorithm, and NAS-RL algorithm in this paper is 1422, 4634, and 935, respectively; the maximum rewards are 162.2, 146.4, and 234.8, respectively. After 935 rounds of training, the NAS-based algorithm in this paper meets the termination condition that the network architecture generated by 10 consecutive rounds can complete the rocket recovery task. By comparison, the convergence speed of the proposed algorithm is 52.1% higher than that of the PPO-based RL algorithm, the maximum reward value explored is 44.8% higher, and the number of convergence steps is much lower than that of the SAC algorithm. Therefore, the NAS-RL algorithm designed in this paper enables the launch vehicle to explore greater rewards during landing, and the convergence speed is faster. Rocket recovery has a higher success rate. Table 4 shows the results of the comparison between the proposed network architecture optimization method and the current classical search strategy algorithms based on reinforcement learning, genetic algorithm, and particle swarm optimization.

As can be seen from Table 6, the algorithm in this paper has the shortest search time, the architecture with the highest score in the search process has the highest average fitness in the environment of 298.8, and the successful recovery accuracy rate of this architecture in 100 sampling tests is higher, reaching 99.6%. Table 7 and Table 8 show the parameters of the optimal architecture searched in this paper.

4.3. Simulation Verification

The interaction between the control network after the algorithm training and the rocket recovery environment is simulated and verified. Meanwhile, in order to verify the effectiveness of the proposed method, the simulation results are compared with those of the convex optimization method under the same conditions. In addition, in order to test its real-time performance on the missile-borne platform, the simulation environment built in this paper was transplanted to Huawei’s Atlas 200I DK A2 embedded rocket platform for testing. The simulation results are shown in the figure below.

Figure 8, Figure 9, Figure 10, Figure 11, Figure 12 and Figure 13 show the changes in each state quantity during the recovery process of the launch vehicle, and Figure 14, Figure 15 and Figure 16 show the changes in the movement quantity of the rocket during the recovery process. As can be seen from the trajectory curve of the launch vehicle landing in Figure 8, the launch vehicle successfully landed at the recovery target position. Figure 9 and Figure 10 show the velocity and fuel consumption curves of the launch vehicle during landing, while Figure 11, Figure 12 and Figure 13 show the attitude curves. It can be seen from the graphs that the velocity and attitude of the rocket at landing time meet the landing accuracy and converge, and specific comparison results are shown in Table 9. Figure 14 shows the variation curve of rocket thrust. Figure 15 and Figure 16 show the variation curve of the direction angle of rocket thrust.

As can be seen from Table 9, the terminal position accuracy error of the method designed in this paper is smaller than that of convex optimization and reinforcement learning methods, which increase by 96% and 96.6%, respectively. In addition, the rocket terminal speed of the proposed method is 1 m/s, which is closer to 0 m/s than reinforcement learning and convex optimization, and the fuel consumption during landing is less, which is 15.9% and 0.4% lower than reinforcement learning and convex optimization, respectively. At the same time, the attitude error of the proposed method during rocket landing is smaller, and the attitude angle meets the requirement of vertical recovery. Moreover, the comparison of thrust and thrust angle curves given in Figure 15 and Figure 16 shows that the thrust and angle of the proposed method both meet the variation law of the rocket recovery process.

Finally, the method environment in this paper is transplanted to the Atlas 200I DK A2-embedded rocket platform for testing. The curve change of the simulation results is completely consistent with the curve of the NAS-RL method in Figure 8, Figure 9, Figure 10, Figure 11, Figure 12, Figure 13, Figure 14, Figure 15 and Figure 16, and the results are shown in Table 9. In addition, the calculation speed of the control instructions is 6 ms, which meets the real-time requirement. It can be deployed online on the rocket platform and has online autonomous decision-making capability.

4.4. Generalization Ability Verification

In order to verify the generalization performance of the launch vehicle recovery control network obtained by the method proposed in this paper, it is tested in an untrained and unfamiliar environment. In the process of recovery of the launch vehicle, unknown conditions, such as various aerodynamic disturbances and deviations of model parameters, will be encountered. Therefore, in the verification process, double the standard wind field interference is added to the recovery environment, and deviation interference is added to the initial parameters of the rocket model. The specific deviation parameter values are shown in Table 10. In the process of verification, the self-taught control test is first carried out on the PC side, and then the verified control network is deployed on the arrow platform to test the accuracy of rocket recovery on the embedded hardware platform.

It can be seen from Figure 17, Figure 18, Figure 19, Figure 20, Figure 21, Figure 22, Figure 23, Figure 24 and Figure 25 that the landing position and velocity accuracy errors of the control network trained using the network structure generated by automatic learning in an untrained environment are 0.3 m and 0.9 m/s, respectively, and the fuel consumption is 3355.5 kg. In addition, the attitude accuracy errors, thrust amplitude, and thrust angle changes meet the changing rules of the vertical recovery of the launch vehicle. It can be seen that the self-learning control method of recoverable launch vehicles based on neural network architecture search has the generalization ability of multi-parameter uncertainty and environmental uncertainty and achieves the purpose of automatically optimizing network structure and hyperparameters.

The tested rocket recovery self-learning control law is deployed on the embedded rocket platform for verification. Figure 26 shows the trajectory curve of the launch vehicle during 100 different random sampling initialization tests. It can be seen that the rocket can successfully reach the landing site and stably complete the recovery task in an uncertain and unknown environment by adopting the method designed in this paper.

5. Conclusions

Aiming at the design of a self-learning control method of launch vehicle vertical recovery based on deep reinforcement learning, this paper adopts neural network architecture search and Bayesian hyperparameter optimization technology to achieve the automatic design and optimization of deep network and reinforcement learning hyperparameters. The multi-objective particle swarm optimization algorithm is used to achieve the automatic search of the network structure. The lightweight design of the search space reduces the complexity of the optimal network structure, improves the control precision of the recovery control network, and enhances the rapidity and robustness of the algorithm. Through interaction with the simulation environment of the descent stage of the launch vehicle, it was verified that the control network trained by the method proposed in this paper could successfully land the launch vehicle, and under the effect of model parameter deviation and wind field interference, it had strong adaptability, improved the generalization ability of the control network, and the output speed of the control command met the online deployment capability. It has a certain application value to the intelligent recovery of launch vehicles.

Author Contributions

Conceptualization, S.X.; methodology, S.X. and H.B.; software, S.X., Z.W. and Z.L.; validation, C.Y.; formal analysis, Z.L.; investigation, H.B.; resources, Z.W. and C.Y.; data curation, H.B.; writing—original draft preparation, S.X.; writing—review and editing, S.X.; visualization, S.X.; supervision, H.B.; project administration, Z.W. and H.B.; funding acquisition, H.B. and Z.W. All authors have read and agreed to the published version of the manuscript.

Funding

This work is supported by the National Natural Science Foundation of China through Grant No. U21B2028.

Data Availability Statement

The data presented in this study are available on request from the corresponding author. The data are not publicly available due to privacy.

Acknowledgments

Shuai Xue thanks Hongyang Bai for the helpful guide.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Wang, Z.G.; Luo, S.B.; Wu, J.J. Recent Progress on Reusable Launch Vehicle; National University of Defense Technology Press: Changsha, China, 2004. [Google Scholar]
Jones, H.W. The Recent Large Reduction in Space Launch Cost. In Proceedings of the 48th International Conference on Environmental Systems, Albuquerque, NM, USA, 8–12 July 2018. [Google Scholar]
Xu, D.; Zhang, Z.; Wu, K.; Li, H.B.; Lin, J.F.; Zhang, X.; Guo, X. Recent progress on development trend and key technologies of vertical take-off vertical landing reusable launch vehicle. Chin. Sci. Bull. 2016, 61, 3453–3463. [Google Scholar] [CrossRef]
Jo, B.U.; Ahn, J. Optimal staging of reusable launch vehicles for minimum life cycle cost. Aerosp. Sci. Technol. 2022, 127, 107703. [Google Scholar] [CrossRef]
Li, X.D.; Liao, Y.X.; Liao, J.; Luo, S.B. Finite-time sliding mode control for vertical recovery of the first-stage of reusable rocket. J. Cent. South Univ. (Sci. Technol.) 2020, 51, 979–988. [Google Scholar] [CrossRef]
Blackmore, L.; Scharf, D.P. Minimum-landing-error powered-descent guidance for Mars landing using convex optimization. J. Guid. Control Dyn. 2010, 33, 1161–1171. [Google Scholar] [CrossRef]
Tang, B.T.; Cheng, H.B.; Yang, Y.X.; Jiang, X.J. Research on iterative guidance method of solid sounding rocket. J. Solid Rocket Technol. 2024, 47, 135–142. [Google Scholar] [CrossRef]
Tian, W.L.; Yan, Z.W.; Li, W.; Jia, D. Design and analysis of takeoff and landing control algorithm for four-rocket boosting drone. Adv. Aeronaut. Sci. Eng. 2024, 15, 105–117. [Google Scholar] [CrossRef]
Wu, N.; Zhang, L. A fast and accurate injection strategy for solidrockets based on the phase plane control. Aerosp. Control 2020, 38, 44–50. [Google Scholar] [CrossRef]
Zhang, L.; Li, D.Y.; Cui, N.G.; Zhang, Y.H. Full profile flight preset Performance control for vertical take-off and Landing reusable launch vehicle. Acta Aeronaut. Sin. 2019, 44, 179–195. [Google Scholar]
Liu, W.; Wu, Y.Y.; Liu, W.; Tian, M.M.; Huang, T.P. RLV reentry robust fault-tolerant attitude control considering unknown disturbance. Acta Aeronaut. Sin. 2019, 44, 169–176. [Google Scholar]
Yang, Z.S.; Mao, Q.; Dou, L.Q. Design of Interval Two adaptive fuzzy sliding Mode Control for Reentry attitude of Reusable Aircraft. J. Beijing Univ. Aeronaut. Astronaut. 2019, 46, 781–790. [Google Scholar] [CrossRef]
Wang, Z.; Zhang, J.; Li, Y.; Gong, Q.; Luo, W.; Zhao, J. Automated Reinforcement Learning Based on Parameter Sharing Network Architecture Search. In Proceedings of the 2021 6th International Conference on Robotics and Automation Engineering (ICRAE), Guangzhou, China, 19–22 November 2021; pp. 358–363. [Google Scholar]
Silver, D.; Huang, A.; Maddison, C.J.; Guez, A.; Sifre, L.; Van Den Driessche, G.; Schrittwieser, J.; Antonoglou, I.; Panneershelvam, V.; Lanctot, M.; et al. Mastering the game of Go with deep neural networks and tree search. Nature 2016, 529, 484–489. [Google Scholar] [CrossRef] [PubMed]
Hadi, B.; Khosravi, A.; Sarhadi, P. Deep reinforcement learning for adaptive path planning and control of an autonomous underwater vehicle. Appl. Ocean Res. 2022, 129, 103326. [Google Scholar] [CrossRef]
Bijjahalli, S.; Sabatini, R.; Gardi, A. Advances in intelligent and autonomous navigation systems for small UAS. Prog. Aerosp. Sci. 2020, 115, 100617. [Google Scholar] [CrossRef]
Alagumuthukrishnan, S.; Deepajothi, S.; Vani, R.; Velliangiri, S. Reliable and Efficient Lane Changing Behaviour for Connected Autonomous Vehicle through Deep Reinforcement Learning. Procedia Comput. Sci. 2023, 218, 1112–1121. [Google Scholar] [CrossRef]
Huang, Z.G.; Liu, Q.; Zhu, F. Hierarchical reinforcement learning with adaptive scheduling for robot control. Eng. Appl. Artif. Intell. 2023, 126, 107130. [Google Scholar] [CrossRef]
Liu, J.Y.; Wang, G.; Fu, Q.; Yue, S.H.; Wang, S.Y. Task assignment in ground-to-air confrontation based on multiagent deep reinforcement learning. Def. Technol. 2023, 19, 210–219. [Google Scholar] [CrossRef]
Federici, L.; Scorsoglio, A.; Zavoli, A.; Furfaro, R. Meta-reinforcement learning for adaptive spacecraft guidance during finite-thrust rendezvous missions. Acta Astronaut. 2022, 201, 129–141. [Google Scholar] [CrossRef]
Federici, L.; Zavoli, A. Robust interplanetary trajectory design under multiple uncertainties via meta-reinforcement learning. Acta Astronaut. 2024, 214, 147–158. [Google Scholar] [CrossRef]
Costa, B.A.; Parente, F.L.; Belfo, J.; Somma, N.; Rosa, P.; Igreja, J.M.; Belhadj, J.; Lemos, J.M. A reinforcement learning approach for adaptive tracking control of a reusable rocket model in a landing scenario. Neurocomputing 2024, 577, 127377. [Google Scholar] [CrossRef]
Belkhale, S.; Li, R.; Kahn, G.; McAllister, R.; Calandra, R.; Levinel, S. Model-Based Meta-Reinforcement Learning for Flight With Suspended Payloads. IEEE Robot. Autom. Lett. 2021, 6, 1471–1478. [Google Scholar] [CrossRef]
Xue, S.; Han, Y.; Bai, H. Research on Ballistic Planning Method Based on Improved DDPG Algorithm. In Proceedings of the 2023 International Conference on Cyber-Physical Social Intelligence (ICCSI), Xi’an, China, 20–23 October 2023; pp. 13–19. [Google Scholar]
Xu, J.; Du, T.; Foshey, M.; Li, B.; Zhu, B.; Schulz, A.; Matusik, W. Learning to fly: Computational controller design for hybrid UAVs with reinforcement learning. ACM Trans. Graph. 2019, 38, 1–12. [Google Scholar] [CrossRef]
Hutter, F.; Kotthoff, L.; Vanschoren, J. Automated Machine Learning: Methods, Systems, Challenges; Springer Nature: Berlin/Heidelberg, Germany, 2019. [Google Scholar] [CrossRef]
Wen, L.; Gao, L.; Li, X.; Li, H. A new genetic algorithm based evolutionary neural architecture search for image classification. Swarm Evol. Comput. 2022, 75, 101191. [Google Scholar] [CrossRef]
Chen, L.C.; Collins, M.D.; Zhu, Y.; Papandreou, G.; Zoph, B.; Schroff, F.; Adam, H.; Shlens, J. Searching for efficient multi-scale architectures for dense image prediction. In Proceedings of the Advances in Neural Information Processing Systems, Montreal, QC, Canada, 3–8 December 2018; pp. 8699–8710. [Google Scholar]
Wang, Y.; Yang, Y.; Chen, Y.; Bai, J.; Zhang, C.; Su, G.; Kou, X.; Tong, Y.; Yang, M.; Zhou, L. Textnas: Aneural architecture search space tailored for text representation. In Proceedings of the AAAI Conference on Artificial Intelligence, New York, NY, USA, 7–12 February 2020; pp. 9242–9249. [Google Scholar]
Bergstra, J.; Yamins, D.; Cox, D. Making a science of model search: Hyperparameter optimization in hundreds of dimensions for vision architectures. In Proceedings of the International Conference on Machine Learning, Atlanta, GA, USA, 16–21 June 2013; pp. 115–123. [Google Scholar]
Zoph, B.; Le, Q.V. Neural architecture search with reinforcement learning. In Proceedings of the 5th International Conference on Learning Representations, Toulon, France, 24–26 April 2017; pp. 1–16. [Google Scholar]
Xie, L.; Yuille, A. Genetic CNN. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 1379–1388. [Google Scholar]
Falanti, A.; Lomurno, E.; Ardagna, D.; Matteucci, M. POPNASv3: A pareto-optimal neural architecture search solution for image and time series classification. Appl. Soft Comput. 2023, 145, 110555. [Google Scholar] [CrossRef]
Schulman, J.; Wolski, F.; Dhariwal, P.; Radford, A.; Klimov, O. Proximal policy optimization algorithms. Advances in Neural Information Processing Systems. arXiv 2017, arXiv:1707.06347. [Google Scholar]
Yang, C.H.; Xia, Y.S.; Chu, Z.F.; Zha, X. Logic Synthesis Optimization Sequence Tuning Using RL-Based LSTM and Graph Isomorphism Network. IEEE Trans. Circuits Syst. II Express Briefs 2022, 69, 3600–3604. [Google Scholar] [CrossRef]
Eberhart, R.; Kennedy, J. A new optimizer using particle swarm theory. In Proceedings of the Mhs95 Sixth International Symposium on Micro Machine and Human Science, Nagoya, Japan, 4–6 October 1995; IEEE: Piscataway, NJ, USA, 2002. [Google Scholar]
Feng, X.F.; Liu, T.W.; Li, B.; Su, S. Wind power slope climbing event detection method based on sliding window two-sided CUSUM algorithm. Sci. Technol. Eng. 2024, 24, 595–603. [Google Scholar] [CrossRef]
Young, M.T.; Hinkle, J.D.; Kannan, R.; Ramanathan, A. Distributed Bayesian optimization of deep reinforcement learning algorithms. J. Parallel Distrib. Comput. 2020, 139, 43–52. [Google Scholar] [CrossRef]
Deng, S. CNN hyperparameter optimization method based on improved Bayesian optimization algorithm. Appl. Res. Comput. 2019, 36, 1984–1987. [Google Scholar] [CrossRef]
Dong, L.J.; Fang, Z.; Chen, H.T. Compressor fault diagnosis based on deep learning and Bayesian optimization. Mach. Des. Manuf. 2023, 384, 45–52. [Google Scholar] [CrossRef]

Figure 1. Relationship between engine thrust and angle at the bottom view.

Figure 2. General framework of self-learning control for launch vehicle recovery.

Figure 3. Network structure of Actor and Critic in control law training algorithm based on PPO.

Figure 4. Neural network architecture search framework.

Figure 5. Architecture search topology diagram of neural network.

Figure 6. Structure diagram of Bayesian optimization reinforcement learning model.

Figure 7. Average reward function change.

Figure 8. Trajectory of rocket motion.

Figure 9. Speed change of the rocket.

Figure 10. Mass change of rocket.

Figure 11. Variation of pitch angle deviation of rocket.

Figure 12. Variation of yaw angle deviation of rocket.

Figure 13. Variation of roll angle deviation of rocket.

Figure 14. The thrust magnitude of the rocket changes.

Figure 15. Thrust angle change of the rocket.

Figure 16. Thrust angle change of the rocket.

Figure 17. Trajectory of rocket motion.

Figure 18. Speed change of the rocket.

Figure 19. Mass change of rocket.

Figure 20. Variation of pitch angle deviation of rocket.

Figure 21. Variation of yaw angle deviation of rocket.

Figure 22. Variation of roll angle deviation of rocket.

Figure 23. The thrust magnitude of the rocket changes.

Figure 24. Thrust angle change of the rocket.

Figure 25. Thrust angle change of the rocket.

Figure 26. The rocket recovered 100 sample trajectories.

Table 1. State space of launch vehicle recovery control.

No.	State Variable $s$	Symbol
1	Position coordinates in the x direction in the inertial coordinate system	$r_{x}$
2	Position coordinates in the y direction in the inertial coordinate system	$r_{y}$
3	Position coordinates in the z direction in the inertial coordinate system	$r_{z}$
4	The velocity in the x direction in the inertial coordinate system	$v_{x}$
5	The velocity in the y direction in the inertial coordinate system	$v_{y}$
6	The velocity in the z direction in the inertial coordinate system	$v_{z}$
7	Pitch angle	$φ$
8	Yaw angle	$ψ$
9	Roll angle	$γ$
10	The angular velocity of rotation in the x direction in the launch vehicle coordinate system	$ω_{x}$
11	The angular velocity of rotation in the y direction in the launch vehicle coordinate system	$ω_{y}$
12	The angular velocity of rotation in the z direction in the launch vehicle coordinate system	$ω_{z}$
13	Rocket mass	$m$

Table 2. Motion space of launch vehicle recovery control.

No.	Action Variable $a$	Symbol
1	Thrust direction angle	$δ_{y}$
2	Thrust direction angle	$δ_{z}$
3	Magnitude of thrust	$∥ T ∥$

Table 3. Deep neural network hyperparameters.

No.	Type	Hyperparameter	Symbol	Search Area
1	Actor network	Number of hidden layers	$n_{1}$	{1, 2, 3, 4}
		Hidden layer neuron	$a c t o r_{-} h$	{64, 128, 256}
		Activation function	$a c t o r_{-} a$	{tanh, relu, sigmoid}
2	Critic network	Number of hidden layers	$n_{2}$	{1, 2, 3, 4}
		Hidden layer neuron	$c r i t i c_{-} h$	{64, 128, 256}
		Activation function	$c r i t i c_{-} a$	{tanh, relu, sigmoid}
3	LSTM network	Hidden layer neuron	$l s t m_{-} h$	{16, 32, 64, 128, 256, 512}

Table 4. Reinforcement learning hyperparameters.

No.	Hyperparameter	Symbol	Search Area
1	Batch learning size	b	{32, 64, 128, 256, 512}
2	Maximum steps	N	{256, 512, 1024, 2048, 8192}
3	Discount factor	$γ$	uniform (0.8, 0.999)
4	Learning rate	$l r$	loguniform (1 × 10⁻⁵, 1)
5	Clipping factor	$ε$	uniform (0.1, 0.4)
6	Generalized dominance estimation parameters	$λ$	uniform (0.8, 1)

Table 5. Launch vehicle parameters.

No.	Rocket Parameter	Value
1	Total length of rocket	40 m
2	Diameter of rocket	3.66 m
3	Mass of rocket	41 t
4	Specific impulse $I_{s p}$	360 s
5	Atmospheric density $ρ$	1.225 kg/m³
6	Acceleration of gravity $g_{0}$	9.8 kg/s²
7	Aerodynamic drag coefficient $C_{D}$	0.82
8	The maximum thrust of the rocket	981 KN
9	Initial position coordinate $(r_{x}, r_{y}, r_{z})$	(2000, −1600, 50) m
10	Initial velocity $(v_{x}, v_{y}, v_{z})$	(−90, 180, 0) m/s
11	Landing position coordinates $(r_{x}, r_{y}, r_{z})$	(0, 0, 0)m
12	Attitude angle limitation during landing ${θ^{'}}_{l i m i t} (ψ, φ, γ)$	[85, 85, 360]°
13	Terminal position constraint $r_{l i m i t}$	5 m
14	Terminal velocity constraint $v_{l i m i t}$	2 m/s
15	Terminal attitude deviation constraint $θ_{l i m i t} (ψ, φ, γ)$	[10, 10, 360]°
16	Rotational angular velocity constraint $ω_{l i m i t} (ω_{x}, ω_{y}, ω_{z})$	[0.2, 0.2, 0.2] rad/s

Table 6. Performance comparison of algorithms.

No.	NAS Search Policy	Search Time	Average Fitness Value	Accuracy Rate
1	Reinforcement learning	10 h	162	79.4%
2	Genetic algorithm	8 h	140	80.5%
3	Particle swarm optimization	7 h	200	80.2%
4	Algorithm of this paper	1.5 h	298.8	99.6%

Table 7. Optimal deep network structure of NSA.

No.	Type	Hyperparameter	Symbol	Search Area
1	Actor network	Number of hidden layers	$n_{1}$	3
		Hidden layer neuron	$a c t o r_{-} h$	[64, 64, 128]
		Activation function	$a c t o r_{-} a$	tanh
2	Critic network	Number of hidden layers	$n_{2}$	3
		Hidden layer neuron	$c r i t i c_{-} h$	[64, 64, 128]
		Activation function	$c r i t i c_{-} a$	relu
3	LSTM network	Hidden layer neuron	$l s t m_{-} h$	128

Table 8. Optimal hyperparameter combination of reinforcement learning for Bayesian optimization.

No.	Hyperparameter	Symbol	Search Area
1	Batch learning size	b	64
2	Maximum steps	N	2048
3	Discount factor	$γ$	0.9391973108460121
4	Learning rate	$l r$	0.0001720593703687
5	Clipping factor	$ε$	0.2668120684510983
6	Generalized dominance estimation parameters	$λ$	0.8789545362092943

Table 9. Comparison of landing accuracy errors.

No.	Name	Terminal Position Accuracy/m	Terminal Speed Accuracy/m/s	Terminal Attitude Deviation $(φ, ψ, γ)$ /°	Fuel Consumption/kg
1	NAS-RL	0.4	1.0	(1.916, 0.129, 0.003)	3437.9
2	RL	11.9	12.9	(2.593, 0.228, 0.008)	4086.3
3	Convex optimization	10.0	7.5	(2.420, −0.774, 0.003)	3452.2
4	NAS-RL-Embedded platform	0.4	1.0	(1.916, 0.129, 0.003)	3437.9

Table 10. Rocket parameter deviation table.

No.	Name	Symbol	Maximum Deviation	Minimum Deviation
1	The initial position of the rocket	$r$	+10%	−10%
2	The initial velocity of the rocket	$v$	+10%	−10%
3	Mass of rocket	$m$	+10%	−10%
4	Aerodynamic drag coefficient	$C_{D}$	+10%	−10%
5	Fuel specific impulse	$I_{s p}$	+1%	−1%
6	Atmospheric density	$ρ$	+5%	−5%

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Xue, S.; Wang, Z.; Bai, H.; Yu, C.; Li, Z. Research on Self-Learning Control Method of Reusable Launch Vehicle Based on Neural Network Architecture Search. Aerospace 2024, 11, 774. https://doi.org/10.3390/aerospace11090774

AMA Style

Xue S, Wang Z, Bai H, Yu C, Li Z. Research on Self-Learning Control Method of Reusable Launch Vehicle Based on Neural Network Architecture Search. Aerospace. 2024; 11(9):774. https://doi.org/10.3390/aerospace11090774

Chicago/Turabian Style

Xue, Shuai, Zhaolei Wang, Hongyang Bai, Chunmei Yu, and Zian Li. 2024. "Research on Self-Learning Control Method of Reusable Launch Vehicle Based on Neural Network Architecture Search" Aerospace 11, no. 9: 774. https://doi.org/10.3390/aerospace11090774

APA Style

Xue, S., Wang, Z., Bai, H., Yu, C., & Li, Z. (2024). Research on Self-Learning Control Method of Reusable Launch Vehicle Based on Neural Network Architecture Search. Aerospace, 11(9), 774. https://doi.org/10.3390/aerospace11090774

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Research on Self-Learning Control Method of Reusable Launch Vehicle Based on Neural Network Architecture Search

Abstract

1. Introduction

2. Problem Description

2.1. General Framework of Self-Learning Control for Launch Vehicle Recovery

2.2. Deep Network Architecture and Automatic Optimization of Hyperparameters

3. Algorithm Design

3.1. Neural Network Architecture Search Algorithm

3.1.1. Lightweight Search Design

3.1.2. Particle Swarm Optimization Algorithm

3.2. Bayesian Optimization Algorithm

4. Simulation Analysis

4.1. Simulation Parameter Settings

4.2. Analysis of Training Results

4.3. Simulation Verification

4.4. Generalization Ability Verification

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI