1. Introduction
In recent years, vertical take-off and landing (VTOL) unmanned aerial vehicles (UAVs) have gained considerable attention due to their unique advantages. Distinct from traditional fixed-wing UAVs, VTOL UAVs are capable of taking off and landing vertically, eliminating the need for a runway [
1]. Moreover, when compared to multi-rotor UAVs, they offer several benefits, such as a larger payload capacity, higher cruise speed, and longer flight ranges [
2]. This is primarily attributed to their reliance on wings for lift production, as opposed to the multiple rotors utilized by multi-rotor UAVs. Among the various VTOL UAV configurations, such as tilt rotor, tail sitter, and vectored thrust, the ducted-fan tail-sitter fixed-wing UAV stands out as a unique design. Notably, they eliminate the need for additional moving parts to achieve VTOL capabilities, leading to a simplified mechanical design that is both easier to maintain and less prone to mechanical failures [
3]. Moreover, the incorporation of a ducted fan enhances the propulsion efficiency, allowing the UAVs to cover longer distances. The ducted fan also contributes to noise reduction, making these UAVs particularly suitable for noise-sensitive environments [
4].
Ducted-fan tail-sitter fixed-wing UAVs integrate the characteristics of both traditional fixed-wing and multi-rotor UAVs, enabling them to perform level flight and hover operations. However, during the transition process, the UAV experiences aerodynamic instability due to a stall occurring at the fixed wing [
5]. As a result, the control strategies for transition maneuvers between these two flight modes are of paramount importance, necessitating extensive investigation and research. Generally, two prevalent methods to address this issue can be found in the current literature [
3].
The first approach conceptualizes the transition process as a trajectory optimization problem. In [
6], the direct collocation method was employed to obtain the optimal transition trajectory by Li et al. An optimization approach using the interior point method was proposed that focused on altitude changes during transition in [
7]. The optimal feed-forward control input for transition was computed via sequential quadratic programming by Kubo and Suzuki [
8]. Banazadeh devised a gradient-based algorithm based on the classical Cauchy method to generate optimal transition trajectories [
9]. Naldi and Marconi [
10] utilized mixed-integer nonlinear programming to tackle the minimum-time and minimum-energy optimal transition problems.
The second approach is structured in two stages: the first stage entails devising the desired trajectory during the transition, and the second stage involves designing a controller to track the trajectory established in the first stage [
3]. Jeong et al. proposed a continuous ascent transition trajectory consisting of the angle of attack and flight path angle and invoked dynamic inversion control for tracking [
11]. Flores provided a desired velocity trajectory and implemented a recurrent neural network-based controller for feedback linearization [
12]. Cheng and Pei established a transition corridor based on the constraint of maintaining a fixed altitude [
13]. They planned the desired velocity in the corridor and utilized an adaptive controller [
14].
Alternatively, reinforcement learning (RL) has emerged as a promising approach to address various challenges in the domain of UAVs, owing to its ability to learn and adapt to dynamic environments [
15]. As a result, there has been growing interest in leveraging RL algorithms to tackle the transition problem of VTOL UAVs. In [
16], an RL-based controller for hybrid UAVs was designed that can not only automatically complete the transition but can also be adapted to different configurations. Xu [
17] proposed a soft landing control algorithm based on the RL method. Yuksek employed the deep deterministic policy gradient algorithm to address the transition flight problem for tilt-rotor UAVs [
18].
In this paper, we solve the back-transition task of ducted-fan tail-sitter fixed-wing UAVs (i.e., from level flight mode to hover mode) using safe RL algorithms. While prior work [
16] mainly emphasizes the successful execution of transition maneuvers, our focus is on minimizing altitude changes and the transition time, while adhering to velocity constraints. Our method, compared to [
18], integrates trajectory optimization and control problems into an RL-based learning process, thereby reducing the complexity and computational effort. To the best of our knowledge, this is one of the first works in which the RL methodology is utilized to solve the back-transition control problem of ducted-fan tail-sitter UAVs.
The main contributions of our work are as follows.
- 1.
We develop a mathematical model of ducted-fan fixed-wing UAV dynamics. Based on this model, we create a training environment for ducted-fan UAVs in OpenAI GYM [
19] using the fourth-order Runge–Kutta method.
- 2.
Taking into account the velocity constraint during the transition process, we train controllers using Trust Region Policy Optimization (TRPO) with fixed penalty, Proximal Policy Optimization with Lagrangian (PPOLag), and Constrained Policy Optimization (CPO). We assess the performance of these algorithms and demonstrate the superiority of the CPO algorithm.
- 3.
Comparing the CPO algorithm with the optimal trajectory obtained via GPOPS-II [
20], we find that the performance of CPO closely approximates the optimal trajectory. In addition, the CPO algorithm has robustness under unknown perturbations of UAV model parameters and wind disturbance, which is lacking in the GPOPS-II software.
This paper is organized as follows. In
Section 2, a mathematical model of a ducted-fan fixed-wing UAV is described. In
Section 3, the general structure of the RL transition controller is introduced, and the reward function, action, and observation are explained. In
Section 4, comparisons between CPO and other RL algorithms are reported. We also compare the transition trajectory of CPO with GPOPS-II and verify the robustness of the CPO algorithm. In
Section 5, concluding remarks and future works are reported.
2. Mathematical Modeling
In this section, we describe a three-degree-of-freedom (DOF) longitudinal model for ducted-fan fixed-wing UAVs. The model is derived from the Newton and Euler theorems and is simplified from the full 6-DOF dynamic model. This 3-DOF model is able to speed up the process of assessing the impact of different reward functions and hyper-parameter settings compared to the 6-DOF dynamic model [
21]. It is equipped with four groups of control vanes as the main control surfaces, each consisting of three movable vanes. In addition, four groups of fixed vanes are situated above the main control surfaces, intended to balance the anti-torque generated by the rotor, with each fixed group consisting of two fixed vanes (see
Figure 1). The four groups of control vanes are numbered 1, 2, 3, and 4 and are employed to change the attitude of the UAVs (see
Figure 2). Each group of three movable vanes, controlled by a single servo, deflects by the same angle. We use
,
,
, and
to represent the deflection of group numbers 1, 2, 3, and 4.
2.1. Three-DOF Dynamics
For the ducted-fan VTOL UAV, two right-handed coordinate systems are applied to describe the states of the aircraft (see
Figure 3). The inertial frame axes are denoted as {
:
,
,
} and the body frame axes are denoted as {
:
,
,
} with the origin located at the center of mass. Throughout this section, the superscripts (
) and (
) are utilized to specify whether a variable is formulated in the inertial or body frame. The position of the aircraft in
is described by
and the velocity of the aircraft in
is defined as
. The Euler angle vector (i.e., roll, pitch, and yaw) is described by
and the angular velocity vector with respect to the body frame is denoted by
. It should be noted that based on the above-mentioned definition, the pitch angle
at the level flight condition, and the pitch angle
at the landing and hover conditions (see
Figure 4). Thus, the longitudinal dynamics of the aircraft are derived as follows:
where
g is gravity acceleration;
is the moment of inertia on the
axis;
represents the angle of attack. All the forces and moments are discussed below.
2.2. Rotor
In this subsection, the rotor model is discussed based on basic momentum theory and blade element theory [
22]. Considering the airspeed along the
axis, the configuration of the blades, the airflow through the rotor, and the thrust of the rotor can be expressed as [
23]
where
is the velocity in the body frame,
is the air density,
represents the angular velocity of the rotor,
r is the radius of the rotor and
is the twist of the blades,
is the rotor lift curve slope,
b is the number of blades,
is the chord of the rotor blade, and the induced velocity
and the far-field velocity
can be expressed as
The expressions for T and can be solved iteratively through Equations (2)–(5) using the Newton–Raphson method.
2.3. Aerodynamics Model
The aerodynamic forces and moments include the lift
L, the drag
D, and the pitching moment
. They are primarily dependent on the fuselage, duct, and mostly on the wings. The angle of attack has a significant impact on the aerodynamic model, which can be expressed as follows:
The ducted-fan UAV can be regarded as an entire lifting body, including the fuselage, wings, and duct [
15]. This simplification introduces some inaccuracies in the aerodynamic data. To address these discrepancies, it is essential to incorporate compensation for the disturbances within the aerodynamic data. In order to account for the modeling errors, the lift and drag coefficients are multiplied by a perturbation factor, sampled from a uniform distribution ranging between 0.8 and 1.2.
During the transition mode, high angle of attack conditions can induce wing stall phenomena, leading to a significant reduction in lift. Traditional linear aerodynamic coefficient models are insufficient in accurately capturing the complexities of the aerodynamic behavior. As demonstrated in [
24], an advanced aerodynamic model that integrates both the linear lift model and the effects of wing stall can be formulated as follows:
where
The lift and drag coefficients are shown in
Figure 5. The pitching moment coefficients will be expressed by the linear model as
Thus, the lift
L, drag
D, and pitching moment
can be written as
V is the air speed of the UAV and S is the wing area.
2.4. Momentum Drag
Due to the existence of crosswinds, the duct must generate a force to align the incoming airflow with its orientation. This results in a reaction force known as momentum drag. Moreover, crosswinds lead to the formation of a region with higher velocity over the near edge of the duct, as the air surrounding it is pulled into the duct by the rotor. This increased lift on the edge produces a moment that causes the vehicle to turn away from the crosswind, referred to as the momentum moment [
22]. The formulas for the momentum drag and moment can be represented as follows:
where
w is the velocity of the Z-axis in the body frame, and
is determined by Equation (
5).
2.5. Control Vanes
Each control vane can be modeled as an airfoil. In [
3,
14,
15], the lift slope coefficient of vane is assumed to be constant. However, our Computational Fluid Dynamics simulations have shown that the lift coefficient of the vane remains virtually unchanged when subjected to a large angle of attack. Thus, based on the simple model in [
25], the lift slope coefficient
of the vane can be expressed as
The dynamic pressure
on each vane can be expressed as
where
u is the velocity of the
X-axis in the body frame. Based on the control allocation method [
15], the equivalent vane deflection of pitch (
Y-axis) is given as follows:
The drag forces of vanes can be neglected [
15], and the vane’s angle of attack only depends on the vane’s deflection [
3]. Thus, the force and moment generated by the control vanes are as follows:
where
represents the vane area and
is the arm of pitch moment.
4. Simulation and Results
In this section, we present the experimental results. In our experiments, the controller update and sensor data download frequency was set as 50 HZ. For our experiments, the initial flight mode was set to level flight, with the initial angle of attack sampled from a uniform distribution ranging between
and
. The initial pitch angle was set equal to the angle of attack, the initial height was set at 50 m, and the corresponding horizontal velocity could be determined using the following equation:
4.1. Comparing CPO and TRPO with Fixed Penalty
To demonstrate the superiority of safe reinforcement learning compared to other conventional reinforcement learning algorithms with a fixed penalty, we conducted three different sets of experiments, which were named CPO, TRPO with penalty 1, and TRPO with penalty 5. The experimental results are shown in
Figure 7.
During training, we took into account the randomization of the initial state and the perturbation of the aerodynamic coefficients. However, in evaluating the transition performance of CPO and TRPO with penalty, we have not considered these factors. Three sets of experiments with the same initial state (angle of attack at 6
) for the sake of comparison are shown in
Figure 8, where the arrows represent the terminal state.
In
Figure 7 and
Figure 8, we can observe that the trajectory corresponding to TRPO with penalty 1 converges to a locally optimal continuous ascending path, resulting in a fast transition time but significant altitude loss. In contrast, the trajectory corresponding to TRPO with penalty 5 consistently satisfies the constraints, but its terminal angle and speed exceed the predefined range, ultimately failing to complete the task. The CPO algorithm, however, effectively meets the constraints while accomplishing the transition with a minimal altitude loss of 0.1 m, which complies with the requirements of a neat transition.
4.2. Comparing CPO and PPOLag
In the domain of safe reinforcement learning, numerous algorithms can address constraint-related issues. For the sake of code development convenience, we have chosen to compare the CPO algorithm with Proximal Policy Optimization with Lagrangian, also called PPOLag (see
Figure 9).
PPOLag is a variant of the Proximal Policy Optimization (PPO) algorithm, which employs a Lagrangian relaxation approach to handle constraints. Constraints are integrated into the objective function using a penalty term that is computed with the help of the Lagrange multiplier. The Lagrange multiplier is updated during the training process based on the discrepancy between the current constraint value and its cost limit, which serves as a weighting factor to balance the trade-off between the reward function and the constraints.
PPOLag is a simpler algorithm in principle and easier to implement. However, as shown in
Figure 9b, this method may cause oscillations near the cost limit of the constraint, leading to poor performance of the agent (as shown in
Figure 9a). Consequently, although PPOLag can complete the transition, its performance is inferior to that of CPO. The best-trained model with PPOLag (at an angle of attack of 6°) has a terminal height loss of 1.2 m and a transition time of 7.84 s, but CPO has a terminal height loss of 0.1 m and a transition time of 6.94 s.
4.3. Comparison with GPOPS-II
According to the problem formulation in
Section 3.1, this problem can also be considered as an optimal control problem. As a result, we choose to employ GPOPS-II when computing the optimal trajectory. In this case, the control variables for GPOPS-II are the same as in our approach—namely, the angular velocity of the rotor and elevator deflection. The experimental results without perturbations at an initial angle of attack
are compared in
Figure 10 and
Figure 11. The cost function of GPOPS-II is the following:
where
is the transition time and
is the terminal height loss.
The height loss of GPOPS-II is 0.01 m with a transition time of 5.21 s, while the height loss of CPO is 0.103 m with a transition time of 5.62 s. As can be seen from the figures, the performance of the CPO algorithm closely approximates that of GPOPS-II. However, GPOPS-II has two main drawbacks. Firstly, as a model-based approach, it provides only optimal feed-forward control input, which is an ideal solution. When the dynamics model of the UAV changes, it requires re-solving based on the altered model. In contrast, the CPO algorithm is highly robust and can still perform the transition task with high performance despite certain modeling errors and wind disturbance, as discussed later in
Section 4.4 and
Section 4.5. Secondly, as micro-controllers onboard aerial robots generally only have limited computational power, the optimization of GPOPS-II can be only executed offline. However, RL algorithms such as CPO can solve the transition problem online after a policy model has been trained, which saves a great deal of computational resources.
4.4. Robustness Validation
During the training process, the randomness of the UAV is only determined by the randomness of the initial state and aerodynamic parameters. Therefore, in order to assess the robustness of the system, we chose to randomize various factors, including the mass, moment of inertia, and aerodynamic parameters of lift and drag (with a larger range), by employing a uniform distribution. Furthermore, we took sensor noise into account because measurement errors in the height can potentially lead to a significant decline in the system’s performance, especially for terminal height loss. To validate the robustness of sensor noise, Gaussian noise was introduced into the height measurements. It is important to note that we did not consider these perturbations during training, meaning that the UAV had not been exposed to them before.
To enable a fair comparison, we only randomized the initial state and tested each perturbation individually. Specifically, we conducted 500 experiments for each perturbation, while maintaining the same transition success condition as in
Section 3.1. We report the average performance of the UAV across all experiments (see
Table 1). At the same time, the trajectories under each disturbance group (50 in total) are also provided. We chose to use the pitch angle and velocity of the UAV during the transition process to draw 2D plane curves (see
Figure 12,
Figure 13,
Figure 14 and
Figure 15), where the different color simply indicates that these trajectories are different. Through the figures, we can find that the UAV can still complete the transition task with excellent performance despite the change in UAV model parameters. At the same time, we can observe that even when the UAV does not transition successfully, its terminal velocity and pitch angle mostly remain close to our desired range.
4.5. Transition under Wind Disturbance
In this section, we aim to verify the wind disturbance rejection ability of our method. In RL-based robotic control design, the sim-to-real gap is a challenging problem because there are inevitable mismatches between the simulator setting and real-world setting in the UAV control problem. This is not only due to the unknown perturbations of the UAV parameters but also the presence of adversarial disturbances such as wind in the real world. To account for wind disturbance, we introduce two different wind scenarios: constant-magnitude wind (ranging from −5 m/s to 5 m/s) and sinusoidal wind (with an amplitude of 5 m/s and time periods of 2 s, 3 s, 4 s, and 5 s), both along the horizontal direction, which is the
axis. Furthermore, the dynamics of ducted-fan fixed-wing UAVs in the presence of wind are described in detail in [
3,
14]. Therefore, we recreate the dynamic equations of the UAV using their modeling of wind.
To address the wind disturbance, we employ domain randomization [
31] and retrain the RL agent under three different conditions: the two wind scenarios mentioned above and a no-wind scenario. Domain randomization is an approach used to overcome the sim-to-real gap by randomizing the simulator environment during the training of the RL control policy. By adapting the domain randomization approach, we aim to retrain a UAV controller that is robust against wind disturbance.
The trajectory comparison among the constant-magnitude wind (5 m/s), sinusoidal wind (time period 2 s), and no-wind conditions is shown in
Figure 16, where the arrows represent the terminal state and all initialized from the angle of attack
.
From
Figure 8 and
Figure 16, we can see that compared to the controller trained without domain randomization, the terminal height loss of the drone retrained with domain randomization remains nearly the same, while the time taken increases by 1.52 s. This phenomenon can be intuitively understood as follows: the RL algorithm selects actions with the highest expected returns under various wind disturbance scenarios, rather than actions that can achieve high returns in windless environments, but may fail to complete the task in windy conditions. Thus, a certain degree of performance loss is reasonable.
In order to better evaluate the resistance of our method to wind disturbances, we applied the UAV in 100 experiments under three conditions and from different initial states, in addition to considering the perturbation of the aerodynamic coefficients during the transition (a uniform distribution ranging between 0.8 and 1.2). The experimental results are shown in
Figure 17 (success rate 90%), where the different color simply indicates that these trajectories are different.
In
Figure 17, we observe that in most cases, the UAV is able to withstand wind disturbances and complete the transition task. However, for those trajectories that do not satisfy the terminal constraint in
Section 3.1, we find that the trajectories generally terminate with an attitude close to vertical hover (
) and a small velocity. This suggests that the UAV is less resistant to interference in the hovering state. In conventional approaches, a common method is to switch to the hover controller when the UAV transitions from level flight to near-hover. Wind disturbance can be resisted by switching between the two controllers and enhancing the hover controller’s ability to handle wind disturbance. In RL, resistance to wind disturbance in hovering should be treated as a separate task, and the problem should be addressed using multi-task RL or meta-learning, which will be the focus of our future work.
5. Conclusions
In this study, we have successfully developed a safe reinforcement learning-based approach for neat transition control during the back-transition process of ducted-fan fixed-wing UAVs. By constructing a three-degree-of-freedom longitudinal model and implementing the CPO algorithm, our method effectively addresses the challenges of integrating trajectory optimization and control methods. By comparison, we found that the introduction of a velocity constraint leads to better performance compared to adding a penalty to the reward. Furthermore, we also found that our method closely resembles GPOPS-II’s performance without the need for prior knowledge. Additionally, we confirmed the robustness of the CPO algorithm and found that even when the transition was not successful, the terminal conditions remained close to our desired terminal range. Future research directions include enabling the UAV to complete multi-tasks (i.e., from hover to level flight, hover under wind disturbance, and level flight), ensuring robustness against wind disturbance, and validating the approach in the real world.