Next Article in Journal
A Stochastic Drone-Scheduling Problem with Uncertain Energy Consumption
Previous Article in Journal
Large-Scale Solar-Powered UAV Attitude Control Using Deep Reinforcement Learning in Hardware-in-Loop Verification
Previous Article in Special Issue
Event-Triggered Collaborative Fault Diagnosis for UAV–UGV Systems
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Online Safe Flight Control Method Based on Constraint Reinforcement Learning

1
Department of Automatic Control, Xi’an Research Institute of Hi-Tech, Xi’an 710025, China
2
Department of Automation, Tsinghua University, Beijing 100091, China
3
Beijing Aerospace Automatic Control Institute, Beijing 100854, China
*
Authors to whom correspondence should be addressed.
Drones 2024, 8(9), 429; https://doi.org/10.3390/drones8090429
Submission received: 13 June 2024 / Revised: 22 August 2024 / Accepted: 22 August 2024 / Published: 26 August 2024

Abstract

:
UAVs are increasingly prominent in the competition for space due to their multiple characteristics, such as strong maneuverability, long flight distance, and high survivability. A new online safe flight control method based on constrained reinforcement learning is proposed for the intelligent safety control of UAVs. This method adopts constrained policy optimization as the main reinforcement learning framework and develops a constrained policy optimization algorithm with extra safety budget, which introduces Lyapunov stability requirements and limits rudder deflection loss to ensure flight safety and improves the robustness of the controller. By efficiently interacting with the constructed simulation environment, a control law model for UAVs is trained. Subsequently, a condition-triggered meta-learning online learning method is used to adjust the control raw online ensuring successful attitude angle tracking. Simulation experimental results show that using online control laws to perform aircraft attitude angle control tasks has an overall score of 100 points. After introducing online learning, the adaptability of attitude control to comprehensive errors such as aerodynamic parameters and wind improved by 21% compared to offline learning. The control law can be learned online to adjust the control policy of UAVs, ensuring their safety and stability during flight.

1. Introduction

Against the backdrop of continuously increasing flight mission requirements, the flight environment is becoming increasingly complex. UAVs inherently possess characteristics such as strong nonlinearity, complex coupling effects, rapid time-varying features, and significant uncertainties. Control design methods require precise knowledge or identified internal dynamic information about the aircraft. The high-order derivatives of the output variables needed in these methods are difficult to measure in practical applications [1,2]. This necessitates the intelligent upgrade and transformation of various key links in the control system, enabling the aircraft to possess intelligent learning capabilities. Reinforcement learning is a type of intelligent method that achieves end-to-end flight control based on data without the need for operations such as model decoupling and linearization.
Currently, reinforcement learning has received widespread attention and research in various decision-making tasks of artificial intelligence [3,4,5,6,7,8,9,10]. Reinforcement learning can typically be described as a Markov decision process (MDP), meaning that the state at the next moment depends only on the current state and the action taken by the agent, with rewards determined by two consecutive states and the action of the agent between them [11]. Introducing constraints into the MDP, that is, adding loss to describe the degree of violation of constraints by the agent’s behavior in the elements of reinforcement learning, forms the constrained Markov decision process (CMDP) [12]. In CMDP problems, the goal of the agent is to maximize cumulative rewards in a task while keeping the loss within certain constraints. Losses include cumulative losses and real-time losses; the former includes expectations and averages of long-term losses or the probability of exceeding a certain threshold, while the latter refers to explicit or implicit losses at each time step.
According to the objective of reinforcement learning, reinforcement learning is essentially solving an optimization problem, and constrained reinforcement learning transforms the unconstrained optimization problem of classical reinforcement learning into an optimization problem with inequality constraints. Therefore, many methods in the field of optimization can be borrowed for constrained reinforcement learning. Researchers at home and abroad have carried out some studies on the application of reinforcement learning to aircraft control [13,14,15,16,17,18]. Hao et al. [14] proposed a model reference output feedback reinforcement learning control algorithm with a broader application scope. Its learning process only relies on the output of the object and can obtain an output feedback control policy that enables the closed-loop system to have the desired dynamic performance. The algorithm constructs a reward function based on the reference model, which can effectively describe the desired closed-loop dynamic performance of the system. Huang et al. [15] utilized the deep deterministic policy gradient (DDPG) algorithm, using state information from multiple data frames as the agent’s observation state, and rudder angle and engine thrust commands as the agent’s output actions. After training, they obtained a generalized and robust intelligent flight controller. Wang et al. [19] proposed a deep deterministic policy gradient-based reference model for quadrotor UAV attitude controller design, which further incorporates a reference model in the DDPG structure to circumvent the system overshoot caused by too much control. Rui et al. [20] proposed a PPO-based RL controller for attitude control during the transition process of tilt rotor unmanned aerial vehicles (TRUAVs). Through direct interaction with the environment to learn control strategies, they designed and improved the reward function to adapt to the transition process. Burak et al. [21] proposed a design method for flight control systems based on reinforcement learning, aiming to improve the transient response performance of closed-loop reference model adaptive control systems. This method implements reinforcement learning in the feedback path gain matrix of the reference model to generate dynamic adjustment strategies, providing the possibility of learning multiple adaptation strategies, thereby improving the transient response performance of traditional model reference adaptive control system designs. Ma et al. [22] proposed an incremental reinforcement learning-based UAV tracking control algorithm in dynamic environments, using a policy relief approach to enable UAVs to explore appropriately in new environments, and a significance weighting approach to increase the plot with higher significance and richer information utilization.
Traditional reinforcement learning methods lack theoretical guarantees in terms of safety and credibility, making the trained systems unable to meet practical application requirements. Several studies have combined Lyapunov theory with deep reinforcement learning to improve the robustness and stability of control systems. Chow et al. [23] proposed a safety policy optimization method based on Lyapunov functions that trains neural network policies via a deep deterministic policy gradient and proximal policy optimization algorithms and ensures that the set of feasible solutions induced by linearized Lyapunov constraints is satisfied at each policy update. Yu et al. [24] proposed an adaptive control method for mobile robots based on Lyapunov reward shaping that optimizes the control parameters through environmental feedback to achieve real-time stable control. Therefore, this article proposes a new framework for online safe flight control based on constrained reinforcement learning. To the best of our knowledge, this is the first time that a constrained reinforcement learning approach has been applied to aircraft control research. The main contributions of this article are as follows:
  • A new framework is proposed for online safe flight control. The core idea is to first design a constrained reinforcement learning algorithm based on extra safety budget, which introduces Lyapunov stability requirements to ensure flight safety and improves the robustness of the controller, and then an online condition-triggered meta-learning method is used to adjust the control raw online to complete the attitude angle tracking task.
  • A novel flight control simulation environment is built based on the Python Flight Mechanics Engine (PyFME) [25] for offline training and online learning.
  • This work proves that this method not only ensures the safety and stability of the aircraft during flight but also adapts the control law to various environmental changes through online learning.
The rest of this article is structured as follows: Section 2 describes the aircraft model, controller model, and simulation environment. Section 3 introduces the proposed method. Section 4 presents the experimental results and analysis. Finally, Section 5 concludes this article.

2. Mathematic Model

To facilitate research on online trajectory planning based on the characteristics of the aircraft, three reasonable assumptions are made:
  • The aircraft is a rigid body.
  • The ground is flat and stationary, ignoring the influence of the earth’s curvature and rotation.
  • The deformation of the landing gear is neglected.
We only focus on the aircraft model after takeoff. The taxiing phase during takeoff and the landing phase during landing are temporarily not considered. Due to the large range of changes in the attitude motion parameters of the aircraft, quaternions are used to describe the attitude in the simulation to avoid singularities. After numerical integration of the differential equations, the attitude angles θ ,     ψ , and r , which represent pitch, yaw, and roll angles, are calculated using the conversion relationship between attitude angles and quaternions.

2.1. Aircraft Model

The differential equations for the state of the aircraft [26,27] are as follows:
X ˙ Y ˙ Z ˙ = V x V y V z
V x ˙ V y ˙ V z ˙ = A W ˙ x 1 W ˙ y 1 W ˙ z 1 + 0 g 0
q 0 ˙ q 1 ˙ q 2 ˙ q 3 ˙ = 1 2 q 1 q 2 q 3 q 0 q 3 q 2 q 3 q 0 q 1 q 2 q 1 q 0 ω x 1 ω y 1 ω z 1
ω ˙ x 1 ω ˙ y 1 ω ˙ z 1 = I x I x y I x z I y x I y I y z I z x I z y I z 1 I x z ω y 1 I x y ω z 1 I y z ω y 1 I z ω z 1 I y ω y 1 I y z ω z 1 I z ω z 1 I z x ω x 1 I x y ω z 1 I y z ω x 1 I x z ω z 1 I x ω x 1 I x y ω x 1 I y ω y 1 I x ω x 1 I x y ω y 1 I y z ω x 1 I z x ω y 1 ω x 1 ω y 1 ω z 1 + M x 1 M y 1 M z 1
where X , Y ,   a n d   Z   represent the positions along the three axes in the ground coordinate system, while V x , V y ,   a n d   V z stand for the velocities along the three axes in the ground coordinate system. q 0 , q 1 ,   q 2 ,   a n d   q 3 are the attitude quaternions, and ω x 1 , ω y 1 ,   a n d   ω z 1 represent the angular velocities of rotation along the three axes in the body coordinate system. W ˙ x 1 , W ˙ y 1 ,   a n d   W ˙ z 1 are the visual accelerations along the three axes in the body coordinate system. M x 1 , M y 1 ,   a n d   M z 1 are the combined external moments of the three axes under the machine system, which are composed of pneumatic moments and inertia moments. [ A ] is the transformation matrix from the body coordinate system to the ground coordinate system, and g represents the acceleration of gravity. The corresponding calculation diagram is shown in Figure 1.
In the body coordinate system, the aerodynamic forces and moments are as follows:
F x 1 O D = q S r e f C x 1 F y 1 O D = q S r e f C y 1 F z 1 O D = q S r e f C z 1 M x 1 Q D = q S r e f L r e f C m x 1 M y 1 Q D = q S r e f L r e f C m y 1 M z 1 Q D = q S r e f L r e f C m z 1
where q represents dynamic pressure; aerodynamic force coefficients include axial force coefficient C x 1 , normal force coefficient C y 1 , and lateral force coefficient C z 1 ; aerodynamic moments include rolling moment coefficient C m x 1 , yawing moment coefficient C m y 1 , and pitching moment coefficient C m z 1 ; Lref represents the aerodynamic reference length; and Sref represents the aerodynamic reference area.

2.2. Controller Model

To control the aircraft using reinforcement learning algorithms, the controller serves as the reinforcement learning action network, primarily responsible for executing the aircraft attitude angle control tasks. The inputs (states s) to the controller include observable quantities such as the current altitude H , climbing speed d H , flight speed V , pitch angle θ , yaw angle ψ , roll angle Φ , sideslip angle β , attack angle α , roll rate p , pitch rate q , yaw rate r , and the desired yaw angle   ψ d e s and pitch angle θ d e s . The outputs (action a) are δ e ,     δ a , and δ r . In the implementation process, a neural network is used to take all the above states as inputs and produce three control commands for the control surfaces as outputs. The block diagram of the controller system is shown in Figure 2.
In the action space, the control commands for the control surfaces are continuous values. During the simulation process, a Gaussian action model is adopted to control the rotation of the control surfaces. The design of the action network is shown in Figure 3.

2.3. Flight Simulation Environment Model

Gazebo’s expertise and depth in flight mechanics simulation are not as good as PyFME, and it consumes more resources and has higher hardware requirements in complex simulations [28]. AirSim is not as focused on the depth and breadth of flight mechanics simulation as PyFME in terms of functional scope. PyFME provides more comprehensive flight mechanics models and related parameters, which may be more suitable for highly specialized flight mechanics research [29]. FlightGear’s customization capabilities are not as flexible and powerful as those in PyFME, especially in research that requires highly customized simulation scenarios [30]. Compared to PyFME, which focuses on the depth and accuracy of flight mechanics simulation, X-Plane, as a commercial software program, has relatively limited customization capabilities, limiting users’ ability to deeply modify and expand the software [31]. After considering the existing flight control simulation environment, we found that although they perform well in some respects, there are limitations in terms of flexibility, user customization, and integration of specific algorithms. To overcome these limitations, we chose PyFME as our research tool.
We established a flight simulation environment based on PyFME, whose main idea is to model the physical scenarios involved in aircraft flight, including aircraft models, atmospheric models, dynamic models, aerodynamic models, and so on. It is capable of simulating the movement of aircraft in the air and, accordingly, replicating all the physical environments pertinent to flight. For further details, please refer to https://github.com/AeroPython/PyFME/wiki, accessed on 12 April 2024. Next, we introduce it from three perspectives: state quantity design, action quantity design, and the design of reward and cost functions.

2.3.1. State Quantity Design

The observables available to the aircraft include flight speed, altitude, climbing rate, pitch angle, yaw angle, roll angle, and the angular rates of these three angles. In reinforcement learning algorithms, the inputs to the agent’s action network include these nine state variables and control variables. During the learning process, the agent needs to utilize the deflection of the control surfaces to calculate the loss function. Therefore, in the experiment, we consider adding the deflection of the control surfaces to the state space, creating an inner feedback loop. Additionally, we utilize state stacking to allow for the agent to capture more dynamic information, obtaining high-order and integral quantities from past states, thereby generating better control outputs. To facilitate model training, during data processing, the mean and standard deviation of the data are dynamically calculated and updated based on the data in the state variables, rather than calculating the mean and standard deviation of all the data at once. This online computing method can better adapt to the dynamic changes of data, especially suitable for situations that require real-time updating of statistical information.

2.3.2. Action Quantity Design

In the attitude control task, the actions of the aircraft involve elevator deflection, rudder deflection, and aileron deflection. The objective of attitude control is to enable the aircraft to follow the target yaw angle and target pitch angle, as shown in Figure 4. In the task design, the target yaw angle and target pitch angle vary sinusoidally with simulation time, allowing for the aircraft to learn to track curved trajectories. The expressions for the target pitch angle θ d e s and target yaw angle   ψ d e s are as follows:
θ des   = A θ s i n ω θ t ψ des   = ψ 0 + A ψ s i n ω ψ t
where A θ and A ψ are the amplitudes of change for the target pitch angle and target yaw angle, respectively, while ω θ and ω ψ represent the angular frequencies of change for the target pitch angle and target yaw angle, respectively. A θ randomly takes values between the positive and negative maximum pitch angles, while A ψ randomly takes values between −π/2 and π/2. ω θ and ω ψ randomly take values within a certain frequency range, enabling the aircraft to learn following and attitude control strategies for different frequencies of trajectory changes.

2.3.3. Reward Function Design

We designed the reward function based on the incremental satisfaction of the target using the aircraft. Let the target state quantity in a certain task of the aircraft be T i , the current corresponding state quantity be S i , and the corresponding state quantity at the previous moment be S i . Then, the reward value corresponding to the current state S i is as follows:
R i = k R i T i S i T i S i
where k R i is the proportionality coefficient, related to the dimensions.
After the aircraft reaches the target attitude, the value of the aforementioned reward calculation method is 0, which is not conducive to the agent learning the policy of maintaining the attitude. Therefore, when the aircraft reaches the target attitude, we allow for the agent to obtain additional positive rewards. At this time, any action that causes the aircraft to deviate from the target attitude results in the aircraft receiving negative rewards, whereas maintaining the flight attitude at the target attitude enables the aircraft to accumulate positive rewards.
For the attitude angle control task, due to the high precision requirements, adopting a constant reward based on a threshold for the angle would make it difficult for the aircraft to meet the reward conditions, which is not conducive to learning control strategies for small errors. Therefore, we adopt a smoother reward function model similar to the Laplace distribution, which is expressed as follows:
R g o a l = k R g exp e b
where k R g   is the overall scaling factor for the target reward, used to balance the reward for approaching the target posture and the reward for maintaining the target posture; b is the precision control factor, used to adjust the sharpness of the Laplace function.
To ensure the safe operation of the aircraft, the maximum angle of attack during flight is set to ±0.5 rad. When the angle of attack exceeds the maximum angle of attack, the turn ends and a reward of −10 is generated. Set the minimum flight altitude of the aircraft to 2 m. If the flight altitude is below 2 m, the aircraft is considered to have crashed, the turn ends, and a reward of −10 is generated.

2.3.4. Cost Function Design

In the aircraft control task, we aim for the deflection of the control surfaces to be as smooth as possible, ensuring that the aircraft’s attitude remains stable and energy consumption is reduced. The cost function, designed specifically based on the deflection of the control surfaces, is formulated as follows:
C s , a , s = k c | a a |
where k c is the amplification factor, s is the current state of the aircraft, a is the action at the current moment, s is the state of the aircraft at the previous moment, and a is the action at the previous moment.
Therefore, the optimization problem for reinforcement learning in aircraft control becomes achieving the target attitude while ensuring that the control surface deflections are within a certain limit.

3. Methodology

The online safe flight control method based on constrained reinforcement learning employs a two-stage process, which is divided into offline virtual trial-and-error training and online reinforcement learning model calibration. In the offline stage, constrained reinforcement learning with an additional safety budget is used to train and optimize the policy function and value function. Finally, the trained policy function is used to drive the aircraft in the simulation to complete the flight mission. At the online stage, the control law model trained in the offline stage is first used to construct a task set through online flights in a virtual environment. Then, real-time interactive data are used to update the task set and continue fine-tuning the model using a conditional trigger-based meta-learning online reinforcement learning method. A schematic diagram of the method is shown in Figure 5.

3.1. Constrained Policy Optimization Algorithm with Extra Safety Budget

The constrained policy optimization with extra safety budget (ESB-CPO) [32] algorithm first samples from the environment, calculates the normalized safety state and constraint equation gradients for each time step based on the sampled losses, and updates the factors α i θ and β i θ ( s t ) based on the normalized safety state and constraint equation gradients. After that, it calculates the Lyapunov Advantage estimation (LAE) corresponding to each time step. Finally, it solves the approximate constrained policy optimization (CPO) [33] problem using the first-order approximation of the optimization objective function and constraint equation, and the second-order approximation of the KL divergence, to obtain a new policy. Adaptive factors control safety constraints enabling the aircraft to initially ignore the constraints of unsafe states and quickly converge to a trajectory to complete the task. Subsequently, it can gradually meet the requirements of a safe state, ultimately obtaining the optimal trajectory. The LAE A θ C i ( s , a ) is as follows:
A θ C i ( s , a ) = E s P a s V θ C i s V θ C i ( s ) + α V θ C i ( s ) β V θ C i s
where α ( 0,1 ) , β [ 0 , 1 ] are adaptive factors, s ,   a are the state and action at the current moment, s is the state at the next moment, and V θ C i is the cost value function. Further, P a s is the distribution of the next state of ( s , a ) .
Describe the parameterized policy in terms of π θ . The expected discounted cumulative return of the policy is as follows:
J R ( θ ) = E τ   π θ t = 0 γ t R ( s , a , s )
where τ π θ is the trajectory sampled from π θ . The expected discounted cumulative cost of the policy is as follows:
J C i ( θ ) = E τ   π θ t = 0 γ t C i ( s , a , s )
The optimization objective of the safe RL algorithm is to find the optimal policy π θ * that maximises J R and guarantees that J C i d i , where d i is the upper cost limit of the ith constraint.
Thus, the optimization problem can be defined:
m a x θ J R ( θ ) s . t . J C i ( θ ) d i
The commonly used advantage functions are as follows:
A θ R ( s , a ) = Q θ ( s , a ) V θ ( s ) A θ C i ( s , a ) = Q θ C i ( s , a ) V θ C i ( s )
where V θ is the value function, Q θ is the state-action value function, V θ C i is the cost value function, and Q θ C i is the state-action cost value function.
Using the LAE to derive the optimization problem, the policy can be updated:
θ = a r g m a x θ ~ E s ρ θ a π θ π θ ~ a s π θ a s A θ R s , a s . t .       J C i ( θ ) + 1 1 γ E s ρ θ a π θ Δ θ , θ s , a A θ C i s , a 1 α i θ d i                   E s ρ θ D K L π θ ~ ( ·   s ) π θ ( ·   s ) δ
where α i θ decreases from 1 to 0 with updating and Δ θ , θ ( s , a ) = π θ ~ ( a s ) π θ ( a s ) 1 . Δ θ , θ ( s , a ) denotes the tendency of the policy to update from π θ to π θ ~ . If the new policy tries to avoid choosing an action a under s , then Δ θ , θ ( s , a ) < 0 ; conversely, Δ θ , θ ( s , a ) > 0 .
The relationship between A θ C i ( s , a ) and A θ C i s , a is as below:
A θ C i ( s , a ) 1 α i θ = A θ C i ( s , a ) + B 1 θ i ( s , a ) + B 2 θ i ( s ) B 1 θ i ( s , a ) = ( 1 γ ) V θ C i ( s , a ) C i ( s , a , s ) B 2 θ i ( s ) = α i θ ( 1 β i θ ( s ) ) 1 α i θ V θ C i ( s )
Therefore, adding two gaps to the constraint function of (16) yields the following:
J C i ( θ ) + 1 1 γ E s ρ θ a π θ [ Δ θ , θ ( s , a ) A θ C i ( s , a ) 1 α i θ ] d i J C i ( θ ) + 1 1 γ E s ρ θ a π θ π θ ~ ( a s ) π θ ( a s ) A θ C i ( s , a ) + G 1 θ i ( s , a ) + G 2 θ i ( s , a ) d i            
where G 1 θ i ( s , a ) = 1 1 γ E s ρ θ a π θ Δ θ , θ ( s , a ) B 1 θ i ( s , a ) ; G 2 θ i ( s , a ) = 1 1 γ E s ρ θ a π θ Δ θ , θ ( s , a ) B 2 θ i ( s , a ) . If these gaps are negative, they relax the constraints; otherwise, they tighten them.
The normalized safe state z i θ ( s t ) is a sample-based internal state that directly shows the safety of the state at step t. z i θ ( s t ) is defined as follows:
z i θ ( s t ) = d i l = 0 t γ l C i ( s l , a l , s l + 1 ) γ t d i
where s l , a l , and s l + 1 are in the trajectory sampled by π θ . When the sum of costs is greater than the cost limit d i , z i θ ( s t ) is less than 0.
z i θ ( s t ) has an initial value of 1 before t = 0, and its update formula is as follows:
z i θ ( s t + 1 ) = z i θ ( s t ) C i ( s l , a l , s l + 1 ) d i γ
Considering the range [0, 1], β i θ ( s t ) is calculated as follows:
β i θ ( s t ) = 1 + m i n ( t a n h ( z i θ ( s t ) ) ,   0 )
When z i θ ( s t ) is less than 0, β i θ ( s t ) decreases to 0.
The policy gradient directly reflects the effect of constraint on the policy, so the Lagrange multiplier λ i can be introduced to compute α i θ based on the policy gradient of the constraint function, constructing the following local optimization problem:
m i n λ i   m a x θ ~ λ i P i θ ( θ ~ )
where P i θ ( θ ~ ) = E s ρ θ a π θ π θ ~ ( a s ) π θ ( a s ) A θ C i ( s , a ) .
The dual problem of the above equation is as below:
m i n λ i   m a x θ ~ λ i P i θ θ ~ s . t .           λ i 0
The update formula for λ i is as follows:
λ i , t + 1 = m a x ( λ i , t + η P θ ( θ ~ ) , 0 )
where η is the step size.
Considering the range of values of α i θ , the formula can be defined:
α i θ = t a n h ( k i e λ i )
where k i is a hyperparameter that globally controls the rate of decrease in α i θ . As λ i decreases, α i θ changes from 1 to 0.
For small step sizes δ , the optimization problem can be solved approximately by first-order approximation of the objective and constraints and second-order approximation of the KL divergence. Let the objective gradient be g , the constraint gradient be b , and the Hessian of the KL divergence be H . Define c i = ˙ J C i ( θ ) d i ; then, the approximation of Equation (15) is as follows:
θ = a r g   m a x θ ~ g T ( θ ~ θ ) s . t .       c i + b i T ( θ ~ θ ) 0                 1 2 ( θ ~ θ ) T H ( θ ~ θ ) δ
The dual problem of the above equation is as below:
m a x μ 1 0 μ 2 0 1 2 μ 1 ( g T H 1 g 2 τ T μ 2 + μ 2 T S μ 2 ) + μ 2 T c μ 1 δ 2
where c = c 0 , c 1 , . . . , τ = ˙ g T H 1 B , S = ˙ B T H 1 B , and B = [ b 0 , b 1 , . . . ] .
Equation (25) can be solved by approximating CPO update as follows:
If Equation (26) is feasible:
θ ^ = θ + 1 μ 1 * H 1 g μ 2 * b
else:
                      θ ^ = θ 2 σ b T H 1 b H 1 b
where μ 1 * and μ 2 * are solutions of Equation (26).
In summary, the aircraft attitude control based on ESB-CPO is shown in Algorithm 1. The corresponding framework is shown in Figure 6.
Algorithm 1 Aircraft attitude control based on ESB-CPO
Initialize the policy network π θ , the value network V
Initialize the replay buffer B and step counter t = 0
for k < in 0, 1, 2, … do
    Use policy π θ k to carry out the flight mission and collect a batch of samples
     D = { τ } = { ( s t , a t , r t e , s t + 1 ) }
    According to the FIFO principle, update replay buffer B with D
    Update step counter t = t + len( D )
    for  τ in D  do
          for s in τ  do
                β θ k s = 1 + min tanh z θ k s , 0
           end for
     end for
    Compute α θ k by solving the local dual problem
    Estimate g ^ , b ^ , c ^ and   H ^ using the sample constructed with D
    if approximate ESB-CPO is feasible then
                        θ ^ = θ + 1 μ 1 * H 1 g μ 2 * b
    else
                      θ ^ = θ 2 σ b T H 1 b H 1 b
    end if
    Obtain θ k + 1 by backtracking line search to enforce satisfaction of constraint function in (15)
    Update V by TD-like critic learning
end for

3.2. Condition-Triggered Meta-Learning Online Learning Method

Meta-learning, also known as learning to learn, is characterized by the fact that the trained deep model’s structure is not designed to complete a specific task in a particular scenario, but rather to rapidly adapt and accomplish new tasks in different scenarios after only a few training samples and one or a few iterations. This fully embodies the idea of enabling machines to learn to learn.
Assuming the existence of a sample set related to the training task T i , also known as a task set, where each task set contains its own training data and testing data, in meta-learning, these are referred to as the Support Set and Query Set, respectively. We define the initial parameters of the network as ϕ , and the model parameters trained on the i-th test task as θ ^ i . Therefore, the overall loss function can be defined:
L ( f ϕ ) = i = 1 n l i f θ ^ i ,
using gradient descent to update ϕ .
As a result, an optimized model can be obtained. When a new task T i arrives, a small training sample set can be used to train the network, enabling the rapid acquisition of the corresponding network model parameters for that task.
We apply meta-learning to the safe flight control method based on constrained reinforcement learning, using real-time interactive data to update the task set based on conditional triggers and continue fine-tuning the model. The online flight control is shown in Algorithm 2.
Algorithm 2 Online flight control based on meta-learning
Loading offline model parameters
Initialize the replay buffer B , step counter t = 0, learning rates α, β, and the batch counter l = 0.
while not done do
    Use policy π θ l to carry out the flight mission and collect a batch of samples D = { ( s t , a t , r t e , s t + 1 ) }
    According to the FIFO principle, update playback buffer B with D .
    Update step counter t = t + len( D )
    if attitude angle error exceeds threshold, then
        l = l + 1
        Using the samples in B to construct a task set, and dividing the task set into a support set and a query set
        Utilize the support set to compute adaptive parameters
                      θ i = θ α θ L f θ
        Utilize the query set to update the policy network parameters
θ = θ β θ L f θ i
    end if
end

4. Results and Discussion

The experiments were conducted on a hardware platform equipped with AMD Ryzen 9 7945HX CPU using Windows 11 OS, version 23H2 and were based on the open-source deep learning framework PyTorch, version 1.10.1. Our experiments are also applicable to the TensorFlow framework and can run stably on an Ubuntu system.
The limitation on the deflection of control surfaces has greater practical significance in attitude control tasks. In the experiments involving attitude control tasks at this stage, the loss limit is set to 8, and the training results of the controller are shown in Figure 7.
As can be seen from the above figure, TRPO [34] has a larger rudder deflection loss, reflecting its high-frequency rudder control policy. CPO and PPO cannot complete training to learn a policy that achieves superior reward. SAC [35] has a greater rudder deflection loss than ESB-CPO, and its reward convergence is slower than TRPO and ESB-CPO. ESB-CPO’s performance is close to TRPO’s in the early stages of the training, and the rewards converge rapidly. Thereafter, rudder deflection loss decreases gradually, which is beneficial to the safe flight of the aircraft, but due to the limitation of the amount of rudder deflection, the return is close to that of TRPO. The training process of the controller significantly illustrates how our method works. By adding rudder deflection constraints to the attitude control task, in the early stage, the controller effectively explores under very loose constraints, with both high rewards and losses. In the later stage, the controller gradually tries to avoid unsafe situations, and the rudder deflection loss gradually decreases. The results show that this method allows for early loss overturning and violation of constraints, and the constraint policy ultimately returns to the safe zone, ensuring smooth rudder deflection and improving flight safety.

4.1. Assessment Method

To assess the control quality of the control algorithm, it is first necessary to determine whether it is stable or not based on the simulation results. If it is stable, the time domain control index is calculated, and then the control quality score of the algorithm is calculated; if it is unstable, the control quality score of the algorithm is recorded as 0.
The control quality score is calculated based on the satisfaction of time-domain indicators. For each indicator, if it meets the set scoring criteria, it can obtain the corresponding score for that indicator; otherwise, no score will be awarded for that indicator. The cumulative score of each indicator is the control quality score q i of the algorithm, where i is the simulation case number. The scoring criteria and values for each indicator are shown in Table 1.
Under the disturbances of aerodynamic parameters, wind, and other parameters, a total of N simulation cases were run. Based on the stability, control quality of each case, and the distribution of the quality of a group of cases, the total score S of the algorithm was obtained according to the following method:
S = C 1 S 1 + C 2 S 2 + C 3 S 1 S 3
C 1 , C 2 , and C 3 represent the full scores for different individual items, as detailed in Table 2 below.
S 1 is the scoring coefficient for the stability proportion item, which represents the proportion of stable cases among all the cases:
S 1 = N s N
where N s is the number of stable cases and N is the total number of cases.
S 2 is the scoring coefficient for the control quality item, which represents the average score of the control quality of each case:
S 2 = i = 1 N q i N
S 3 is the scoring coefficient for the indicator dispersion item, which is calculated based on the dispersion degree of the control quality scores of the stable cases:
S 3 = 1 V
where V is the dispersion coefficient of the control quality scores for the stable cases. Let q s , i represent the control quality score of each stable case, and the array q s = q s , 1 , q s , 2 , . . . . The dispersion coefficient V is the ratio of the standard deflection to the mean of q s :
V = s t d ( q s ) m e a n ( q s )
Then, to evaluate the real-time performance of the algorithm, a total of N examples were run under the disturbance of aerodynamic parameters and wind. During online learning, when the pitch angle or yaw angle deviation exceeds the set threshold, the condition for triggering learning is met, and timing is started. When the next action ends, timing is stopped. The definition of the control algorithm’s single calculation time is as follows:
T i = t o f f i t o n i
where T i is the single calculation time of the control algorithm when the i-th condition is triggered; t o n i is the time when the i-th condition is triggered and the timer starts counting; t o f f i is the time at which the timing stops when the i-th condition is triggered and the next action ends.
So, the average time consumption of the control algorithm is as follows:
T ¯ = i = 1 N T i N

4.2. Experimental Details

During the simulation process, we obtain the state quantities of the aircraft by numerical integration using the fourth-order Runge–Kutta method with a fixed step size of t = 0.002   s . The model parameters and initial state of the aircraft are shown in Table 3.
During the training phase, the target attitude angles were set as shown in Equation (6). The goal of training is to complete the transition process within 20 s, and the simulation step size is 0.02 s, i.e., 1000 steps. Specific network hyperparameters are shown in Table 4.
During the testing phase, the goal of testing is to complete the transition process within 30 s, and the simulation step size is 0.02 s, i.e., 1500 steps. Specific network hyperparameters are shown in Table 5. The target attitude angles, specifically the target pitch angle and the target yaw angle, were set to vary piecewise over time. The equations for these changes are as follows:
θ c x = 10.0 ° 0.0   s   t < 3.0   s 15.0 ° 3.0   s t < 10.0   s 12.0 ° 10.0   s t 30.0   s
ψ c x = 0.0 ° 0.0   s t < 15.0   s 20.0 ° 15.0   s t < 22.0   s 0.0 ° 22.0   s t 30.0   s
Assuming that the components of the wind velocity vector ω in the ground coordinate system are V w x , V w y , V w z T , and the aircraft is in a horizontal steady wind field, the wind velocity vector lies within the horizontal plane, and the wind direction angle A w is defined as follows: According to the right-hand rule, starting from the ground coordinate system’s OgYg axis, the angle rotated around the OgYg axis to the wind velocity vector, the wind speed in the ground coordinate system is calculated using the following formula:
V w x V w y V w z = v w c o s A w 0 v w s i n A w
We conducted 100 simulations, with the wind direction angle A w randomly selected from the interval [0°, 360°] and the wind speed v w randomly selected from the interval [5, 60] (in m/s). Meanwhile, we added random white noise with mean 0 and variance from 1 to 10 to the aerodynamic parameters. The total scores for both the offline flight control method based on ESB-CPO and the meta-learning online flight control method based on ESB-CPO are presented in Table 6.
As shown in Table 6, the overall score for executing the aircraft attitude angle control task using the offline control law is 82.5 points, while the overall score for performing the same task using the online control law is 100 points. By introducing online learning, the adaptability of attitude control to comprehensive errors such as aerodynamic parameters and wind has improved by 21% compared to offline learning.
We calculated the computation time of the control algorithm 1000 times, and the average computation time per time is 0.6 ms. The CPU frequency is 2.5 GHz, which is converted to 1 GHz. The online stage control algorithm has a single computation time of about 1.5 ms, which meets the real-time requirements.
The attitude angle control test results for the offline flight control method based on ESB-CPO and the online flight control method based on meta-learning with ESB-CPO are shown in Figure 8 and Figure 9, respectively.
As can be seen from the above figures, the flight control method based on ESB-CPO can reduce the frequency of control surface movements by incorporating control surface losses to limit their rotation. Except when step commands arrive, the control surfaces will quickly deflect to track the attitude angles, and the control surface deflection remains stable during other time periods. In the offline phase of attitude control tasks, due to the large variation in the step commands for the attitude angles, the control law cannot quickly complete the attitude angle tracking task. However, in the online phase of attitude control tasks, the control law can learn online to adjust the aircraft control policy to complete the attitude angle control task and reduce the rudder deflection, which ensures the safety and stability of the aircraft during the flight process.
As CPO and PPO cannot perform the attitude angle tracking task, only TRPO and SAC test results are shown in Figure 10 and Figure 11.
As can be seen from Figure 10 and Figure 11, the flight control method based on TRPO achieves the purpose of controlling attitude accuracy by increasing the frequency and amplitude of the rudder deflection, which would lead to a greater perturbation of the aircraft during the flight process and cause a major potential hazard to flight safety. In contrast, with the SAC-based flight control method, the elevator rudder deflection frequency is significantly faster, leading to a decrease in its tracking stability.

5. Conclusions

To address the issues of inefficiency, safety, and stability in the intelligent flight process of UAVs, we propose a new architecture for online safe flight control based on constrained reinforcement learning. Firstly, a flight control simulation environment is established based on PyFME for offline training and online learning. Secondly, to avoid flight accidents or mission failures caused by online learning, the ESB-CPO algorithm is used for flight control, and a constrained optimization problem is constructed based on the trust region method, which ensures the safety and stability of the aircraft during flight by introducing the Lyapunov stability requirement and limiting the rudder deflection loss. Finally, meta-learning is combined with the ESB-CPO algorithm to perform attitude angle tracking tasks. Experimental results show that the overall score of the aircraft attitude angle control task is 100 points, and the adaptability of attitude control to comprehensive errors, including aerodynamic parameters and wind, improves by 21% compared to offline learning after introducing online learning, indicating that the online control law can adapt the control law according to the environment to ensure the safety and stability of the aircraft during flight. In the current study, research is conducted only in a simulated environment, which may not fully capture the complexities and uncertainties present in the real world. In future work, experiments need to be conducted in more complex scenarios, such as communication interference or motor faults [36,37,38], to explore and improve control strategies. At the same time, we will challenge various aircraft models in order to translate research results into practical applications.

Supplementary Materials

The following supporting information can be downloaded at: https://gitee.com/w776538047/online-safe-flight-control-method, Video S1: A video of the attitude angle control test result.

Author Contributions

Conceptualization, J.Z.; methodology, H.X.; software, H.X.; validation, J.Z.; formal analysis, Z.W.; investigation, Z.W.; resources, Z.W.; data curation, Z.W.; writing—original draft preparation, J.Z.; writing—review and editing, T.Z.; visualization, J.Z.; supervision, T.Z. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

The data presented in this study are available on request from the corresponding author.

Acknowledgments

The authors would like to thank the other researchers who helped with this study, family and friends who were not involved in the editing of this article, and the reviewers for their valuable comments and suggestions.

Conflicts of Interest

The authors declare no conflicts of interest.

References

  1. Cheng, H.; Zhang, S.; Liu, T.; Xu, S.; Huang, H. Review of Autonomous Decision-Making and Planning Techniques for Unmanned Aerial Vehicle. Air Space Def. 2024, 7, 6–15+80. [Google Scholar]
  2. Swaroop, D.; Hedrick, K.; Yip, P.P.; Gerdes, J.C. Dynamic surface control for a class of nonlinear systems. IEEE Trans. Autom. Control 2000, 45, 1893–1899. [Google Scholar] [CrossRef]
  3. Xidias, E.K. A Decision Algorithm for Motion Planning of Car-Like Robots in Dynamic Environments. Cybern. Syst. 2021, 52, 533–552. [Google Scholar] [CrossRef]
  4. Huang, Z.; Li, F.; Yao, J.; Chen, Z. MGCRL: Multi-view graph convolution and multi-agent reinforcement learning for dialogue state tracking. IEEE Trans. Autom. Control 2000, 45, 1893–1899. [Google Scholar] [CrossRef]
  5. Hellaoui, H.; Yang, B.; Taleb, T.; Manner, J. Traffic Steering for Cellular-Enabled UAVs: A Federated Deep Reinforcement Learning Approach. In Proceedings of the 2023 IEEE International Conference on Communications (ICC), Rome, Italy, 28 May–1 June 2023. [Google Scholar]
  6. Xia, B.; Mantegh, I.; Xie, W. UAV Multi-Dynamic Target Interception: A Hybrid Intelligent Method Using Deep Reinforcement Learning and Fuzzy Logic. Drones 2024, 8, 226. [Google Scholar] [CrossRef]
  7. Kaufmann, E.; Bauersfeld, L.; Loquercio, A.; Müller, M.; Koltun, V.; Scaramuzza, D. Champion-level drone racing using deep reinforcement learning. Nature 2023, 620, 982–987. [Google Scholar] [CrossRef]
  8. Cui, Y.; Hou, B.; Wu, Q.; Ren, B.; Wang, S.; Jiao, L.C. Remote Sensing Object Tracking With Deep Reinforcement Learning Under Occlusion. IEEE Trans. Geosci. Remote Sens. 2022, 60, 1–13. [Google Scholar] [CrossRef]
  9. Zhu, Z.D.; Lin, K.X.; Jain, A.K.; Zhou, J.Y. Transfer Learning in Deep Reinforcement Learning: A Survey. IEEE Trans. Pattern Anal. Mach. Intell. 2023, 45, 13344–13362. [Google Scholar] [CrossRef]
  10. Kiran, B.R.; Sobh, I.; Talpaert, V.; Mannion, P.; AI, S.; Yogamani, S.; Pérez, P. Deep Reinforcement Learning for Autonomous Driving: A Survey. IEEE Trans. Intell. Transp. Syst. 2022, 23, 4909–4926. [Google Scholar] [CrossRef]
  11. Minsky, M. Steps toward Artificial Intelligence. Proc. IRE. 1961, 49, 8–20. [Google Scholar] [CrossRef]
  12. Zhao, W.; He, T.; Chen, R.; Wei, T.; Liu, C. Safe Reinforcement Learning: A Survey. Acta Autom. Sin. 2023, 49, 1813–1835. [Google Scholar]
  13. Liu, X.; Nan, Y.; Xie, R.; Zhang, S. DDPG Optimization Based on Dynamic Inverse of Aircraft Attitude Control. Comput. Simul. 2020, 37, 37–43. [Google Scholar]
  14. Hao, C.; Fang, Z.; Li, P. Output feedback reinforcement learning control method based on reference model. J. Zhejiang Univ. Eng. Sci. 2013, 47, 409–414+479. [Google Scholar]
  15. Huang, X.; Liu, J.; Jia, C.; Wang, Z.; Zhang, J. Deep Deterministic policy gradient algorithm for UAV control. Acta Aeronaut. Astronaut. Sin. 2021, 42, 404–414. [Google Scholar]
  16. Choi, J.; Kim, H.M.; Hwang, H.J.; Kim, Y.D.; Kim, C.O. Modular Reinforcement Learning for Autonomous UAV Flight Control. Drones 2023, 7, 418. [Google Scholar] [CrossRef]
  17. Woo, J.; Yu, C.; Kim, N. Deep reinforcement learning-based controller for path following of an unmanned surface vehicle. Ocean Eng. 2019, 183, 155–166. [Google Scholar] [CrossRef]
  18. Tang, J.; Liang, Y.; Li, K. Dynamic Scene Path Planning of UAVs Based on Deep Reinforcement Learning. Drones 2024, 8, 60. [Google Scholar] [CrossRef]
  19. Wang, W.; Gokhan, I. Reinforcement learning based closed-loop reference model adaptive flight control system design. Sci. Technol. Eng. 2023, 23, 14888–14895. [Google Scholar]
  20. Yang, R.; Du, C.; Zheng, Y.; Gao, H.; Wu, Y.; Fang, T. PPO-Based Attitude Controller Design for a Tilt Rotor UAV in Transition Process. Drones 2023, 7, 499. [Google Scholar] [CrossRef]
  21. Burak, Y.; Wu, H.; Liu, H.X.; Yang, Y. An Attitude Controller for Quadrotor Drone Using RM-DDPG. Int. J. Adapt. Control Signal Process. 2021, 35, 420–440. [Google Scholar]
  22. Ma, B.; Liu, Z.; Dang, Q.; Zhao, W.; Wang, J.; Cheng, Y.; Yuan, Z. Deep reinforcement learning of UAV tracking control under wind disturbances environments. IEEE Trans. Instrum. Meas. 2023, 72, 1–13. [Google Scholar] [CrossRef]
  23. Chow, Y.; Nachum, O.; Faust, A.; Ghavamzadeh, M.; DuéñezGuzmán, E. Lyapunov-based safe policy optimization for continuous control. arXiv 2019, arXiv:1901.10031. [Google Scholar]
  24. Yu, X.; Xu, S.; Fan, Y.; Ou, L. Self-Adaptive LSAC-PID Approach Based on Lyapunov Reward Shaping for Mobile Robots. J. Shanghai Jiaotong Univ. (Sci.) 2023, 1–18. [Google Scholar] [CrossRef]
  25. PyFME. Available online: https://pyfme.readthedocs.io/en/latest/ (accessed on 12 April 2024).
  26. Filipe, N. Nonlinear Pose Control and Estimation for Space Proximity Operations: An Approach Based on Dual Quaternions. Ph.D. Thesis, Georgia Institute of Technology, Atlanta, GA, USA, 2014. [Google Scholar]
  27. Qing, Y.Y. Inertial Navigation, 3rd ed.; China Science Publishing & Media Ltd.: Beijing, China, 2020; pp. 252–284. [Google Scholar]
  28. Gazebo. Available online: https://github.com/gazebosim/gz-sim (accessed on 28 July 2024).
  29. Madaan, R.; Gyde, N.; Vemprala, S.; Vemprala, M.; Brown, M.; Nagami, K.; Taubner, T.; Cristofalo, E.; Scaramuzza, D.; Schwager, M.; et al. AirSim drone racing Lab. arXiv 2020, arXiv:2003.05654. [Google Scholar]
  30. FlightGear. Available online: https://wiki.flightgear.org/Main_Page (accessed on 28 July 2024).
  31. X-Plane. Available online: https://developer.x-plane.com/docs/ (accessed on 28 July 2024).
  32. Xu, H.; Wang, S.; Wang, Z.; Zhang, Y.; Zhuo, Q.; Gao, Y.; Zhang, T. Efficient Exploration Using Extra Safety Budget in Constrained Policy Optimization. In Proceedings of the 2023 IEEE International Conference on Intelligent Robots and Systems (IROS), Detroit, MI, USA, 1–5 October 2023. [Google Scholar]
  33. Achiam, J.; Held, D.; Tamar, A.; Abbeel, P. Constrained Policy Optimization. In Proceedings of the 34nd International Conference on Machine Learning (ICML), Sydney, Australia, 6–11 August 2017. [Google Scholar]
  34. Schulman, J.; Levine, S.; Abbeel, P.; Moritz, P.; Jordan, M.; Abbeel, P. Trust Region Policy Optimization. In Proceedings of the 32nd International Conference on Machine Learning (ICML), Lille, France, 7–9 July 2015. [Google Scholar]
  35. Haarnoja, T.; Zhou, A.; Abbeel, P.; Levine, S. Constrained Policy Optimization. arXiv 2018, arXiv:1801.01290. [Google Scholar]
  36. Zheng, Q.; Zhao, P.; Zhang, D.; Wang, H. MR-DCAE: Manifold regularization-based deep convolutional autoencoder for unauthorized broadcasting identification. Int. J. Intell. Syst. 2021, 36, 7204–7238. [Google Scholar] [CrossRef]
  37. Gopi, S.P.; Magarini, M.; Alsamhi, S.H.; Shvetsov, A.V. Machine Learning-Assisted Adaptive Modulation for Optimized Drone-User Communication in B5G. Drones 2021, 5, 128. [Google Scholar] [CrossRef]
  38. Zheng, Q.; Saponara, S.; Tian, X.; Yu, Z.; Elhanashi, A.; Yu, R. A real-time constellation image classification method of wireless communication signals based on the lightweight network MobileViT. Cogn. Neurodyn. 2024, 18, 659–671. [Google Scholar] [CrossRef]
Figure 1. The corresponding calculation diagram, where C b n is attitude matrix. C 11 = q 0 2 + q 1 2 q 2 2 q 3 2 , C 12 = 2 q 1 q 2 q 0 q 3 , C 13 = 2 q 0 q 2 + q 1 q 3 , C 21 = 2 q 1 q 2 + q 0 q 3 , C 22 = q 0 2 q 1 2 + q 2 2 q 3 2 , C 23 = 2 q 2 q 3 q 0 q 1 , C 31 = 2 q 1 q 3 q 0 q 2 , C 32 = 2 q 2 q 3 + q 0 q 1 , C 33 = q 0 2 q 1 2 q 2 2 + q 3 2 .
Figure 1. The corresponding calculation diagram, where C b n is attitude matrix. C 11 = q 0 2 + q 1 2 q 2 2 q 3 2 , C 12 = 2 q 1 q 2 q 0 q 3 , C 13 = 2 q 0 q 2 + q 1 q 3 , C 21 = 2 q 1 q 2 + q 0 q 3 , C 22 = q 0 2 q 1 2 + q 2 2 q 3 2 , C 23 = 2 q 2 q 3 q 0 q 1 , C 31 = 2 q 1 q 3 q 0 q 2 , C 32 = 2 q 2 q 3 + q 0 q 1 , C 33 = q 0 2 q 1 2 q 2 2 + q 3 2 .
Drones 08 00429 g001
Figure 2. Control system block diagram.
Figure 2. Control system block diagram.
Drones 08 00429 g002
Figure 3. Action network block diagram.
Figure 3. Action network block diagram.
Drones 08 00429 g003
Figure 4. Attitude control mission diagram.
Figure 4. Attitude control mission diagram.
Drones 08 00429 g004
Figure 5. Schematic diagram of online safe flight control method based on constrained reinforcement learning. During the offline training phase of the control law, constrained reinforcement learning with additional safety budget is utilized to update the control law strategy through the interaction between the control law and the offline training environment. In the online optimization stage of the control law, the control law model trained in the offline stage is used to construct a task set by flying online in a virtual environment. Then, real-time interactive data are used to update the task set using a conditional triggered meta-learning online reinforcement learning method and continue fine-tuning the control law strategy.
Figure 5. Schematic diagram of online safe flight control method based on constrained reinforcement learning. During the offline training phase of the control law, constrained reinforcement learning with additional safety budget is utilized to update the control law strategy through the interaction between the control law and the offline training environment. In the online optimization stage of the control law, the control law model trained in the offline stage is used to construct a task set by flying online in a virtual environment. Then, real-time interactive data are used to update the task set using a conditional triggered meta-learning online reinforcement learning method and continue fine-tuning the control law strategy.
Drones 08 00429 g005
Figure 6. ESB-CPO algorithm framework. This method first calculates the adaptive factors α i θ and β i θ ( s t ) , and then obtains the LAE value from them. Finally, the approximate trust domain method is used to obtain the new policy π θ through backtracking search to ensure that the constraints are met and update the current policy.
Figure 6. ESB-CPO algorithm framework. This method first calculates the adaptive factors α i θ and β i θ ( s t ) , and then obtains the LAE value from them. Finally, the approximate trust domain method is used to obtain the new policy π θ through backtracking search to ensure that the constraints are met and update the current policy.
Drones 08 00429 g006
Figure 7. Training results of attitude control tasks.
Figure 7. Training results of attitude control tasks.
Drones 08 00429 g007
Figure 8. Offline attitude angle control based on ESB-CPO. Theta, psi, and phi represent the pitch angle, yaw angle, and roll angle, respectively. Delta_elevator, delta_aileron, and delta_rudder represent the deflection amounts of the elevator, rudder, and aileron, respectively. Cur_average is the average of the attitude angle errors over the last 10 time steps.
Figure 8. Offline attitude angle control based on ESB-CPO. Theta, psi, and phi represent the pitch angle, yaw angle, and roll angle, respectively. Delta_elevator, delta_aileron, and delta_rudder represent the deflection amounts of the elevator, rudder, and aileron, respectively. Cur_average is the average of the attitude angle errors over the last 10 time steps.
Drones 08 00429 g008
Figure 9. Online attitude angle control based on ESB-CPO, see Supplementary Material.
Figure 9. Online attitude angle control based on ESB-CPO, see Supplementary Material.
Drones 08 00429 g009
Figure 10. Online attitude angle control based on TRPO.
Figure 10. Online attitude angle control based on TRPO.
Drones 08 00429 g010
Figure 11. Online attitude angle control based on SAC.
Figure 11. Online attitude angle control based on SAC.
Drones 08 00429 g011
Table 1. Scoring criteria and values for each indicator.
Table 1. Scoring criteria and values for each indicator.
Indicator SubjectsScoring CriteriaIndicator Score
Pitch channel
First incentive
Steady state error e s s , p i t c h , 1 1.0 ° 0.1
Adjusting time t s , p i t c h , 1 4.0   s 0.05
Overshoot σ p i t c h , 1 3.0 ° 0.1
Pitch channel
Second incentive
Steady state error e s s , p i t c h , 2 1.0 ° 0.1
Adjusting time t s , p i t c h , 2 2.0   s 0.05
Overshoot σ p i t c h , 2 1.5 ° 0.1
Yaw channel
First incentive
Steady state error e s s , y a w , 1 1.0 ° 0.1
Adjusting time t s , y a w , 1 5.0   s 0.05
Overshoot σ y a w , 1 5.0 ° 0.1
Yaw channel
Second incentive
Steady state error e s s , y a w , 2 1.0 ° 0.1
Adjusting time t s , y a w , 2 5.0   s 0.05
Overshoot σ y a w , 2 5.0 ° 0.1
Total 1.0
Table 2. Full marks for each individual item.
Table 2. Full marks for each individual item.
SymbolMeaningValue
C 1 Full marks for the stabilization ratio term30
C 2 Full marks for control quality items40
C 3 Full marks for indicator dispersal items30
Total100
Table 3. Model parameters and initial state of the aircraft.
Table 3. Model parameters and initial state of the aircraft.
ParameterSymbolValueDimension
Massm30kg
Wingspan S w 3 m
Reference area S r e f 1.5 m 2
Reference chord length L r e f 0.469 m
Centre of mass in the theoretical vertex system X-axis coordinates x c 0.632 m
Centre of mass in the theoretical vertex system Y-axis coordinates y c 0.0473 m
Centre of mass in the theoretical vertex system Z-axis coordinates z c 0.0014 m
Initial position random ranges H 0 (200, 400) m
Initial speed random ranges V 0 (25, 40) m / s
Initial pitch angle θ 0 0.0 °
Initial yaw angle random ranges ψ 0 (−180, 180) °
Initial roll angle Φ 0 0.0 °
Initial pitch rate q 0 0.0 ° / s
Initial yaw rate r 0 0.0 ° / s
Initial roll a rate p 0 0.0 ° / s
Table 4. Hyperparameters used in training.
Table 4. Hyperparameters used in training.
NameValueNameValue
check_freq25min_rel_budget1.0
cost_limit8safety_budget15
entropy_coef0.01saute_discount_factor0.99
epochs500test_rel_budget1.0
gamma0.99unsafe_reward−1.0
lam0.95save_freq10
lam_c0.95seed0
max_grad_norm0.5steps_per_epoch10,000
num_mini_batches16target_kl0.01
pi_lr0.0003train_pi_iterations80
max_ep_len1000train_v_iterations40
max_rel_budget1.0vf_lr0.001
Table 5. Hyperparameters used in testing.
Table 5. Hyperparameters used in testing.
NameValueNameValue
max_ep_len1500qry_size80
buffer_size1000dist_angle0.8
batch_size200learning_rate1 × 10−5
minimal_size 200gamma0.99
sup_size120lam0.95
Table 6. Attitude control mission assessment results.
Table 6. Attitude control mission assessment results.
SymbolMeaningValue
S o f f The total score of the offline flight algorithm82.5
S o n The total score of the online flight algorithm100
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Zhao, J.; Xu, H.; Wang, Z.; Zhang, T. Online Safe Flight Control Method Based on Constraint Reinforcement Learning. Drones 2024, 8, 429. https://doi.org/10.3390/drones8090429

AMA Style

Zhao J, Xu H, Wang Z, Zhang T. Online Safe Flight Control Method Based on Constraint Reinforcement Learning. Drones. 2024; 8(9):429. https://doi.org/10.3390/drones8090429

Chicago/Turabian Style

Zhao, Jiawei, Haotian Xu, Zhaolei Wang, and Tao Zhang. 2024. "Online Safe Flight Control Method Based on Constraint Reinforcement Learning" Drones 8, no. 9: 429. https://doi.org/10.3390/drones8090429

APA Style

Zhao, J., Xu, H., Wang, Z., & Zhang, T. (2024). Online Safe Flight Control Method Based on Constraint Reinforcement Learning. Drones, 8(9), 429. https://doi.org/10.3390/drones8090429

Article Metrics

Back to TopTop