1. Introduction
Uncrewed Aerial Vehicles (UAVs), with their strong mobility and high flexibility, offer a cost-effective option for many tasks such as agricultural inspections [
1] and the exploration of hazardous areas [
2], enhancing task accessibility and efficiency. Nowadays, the rapid advancement in Artificial Intelligence (AI) and a corresponding exponential increase in computing power have unlocked new possibilities [
3]. This synergy enables UAVs equipped with AI abilities to perform complex tasks such as real-time path planning and obstacle avoidance more adeptly than traditional models reliant on pre-programmed algorithms [
4,
5]. Unlike their predecessors, these AI-driven UAVs can improve over time with extended training and repeated trials, acquiring a generalized capability to navigate safely and efficiently even in unknown or dynamic environments based on sensor data. This innovative direction not only expands the range of tasks that UAVs can undertake but also enhances their safety, opening up a new dimension in various sectors of human activity by leveraging the evolving capabilities of intelligent systems.
As a pivotal branch of AI, reinforcement learning (RL) represents a distinct approach [
6], focusing on how agents should take actions in an environment to maximize the cumulative reward. Unlike other machine learning methods, such as classification models [
7,
8] or clustering models [
9,
10], that typically involve learning from a fixed dataset, RL is about learning from interaction with an environment, making decisions, evaluating the outcomes, and learning from successes and failures.
RL has been widely integrated into the field of UAV autonomous navigation, including missions such as target tracking [
11], swarm control [
12], collision avoidance [
13,
14], etc. Different RL methods are suitable for different tasks. Q-Learning [
15] and Deep Q-Networks (DQNs) [
16] are two classic value-based RL methods. They are appropriate for tasks with discrete action spaces [
17,
18]. In contrast, policy-based methods [
19] directly learn the policy that maps states to actions without requiring a value function, which is particularly effective for problems with high-dimensional or continuous action spaces. However, policy-based methods may encounter stability and convergence issues, hindering the optimization process and potentially resulting in suboptimal policies. Combining the strengths of value-based and policy-based RL, actor–critic methods are proposed, and two models are incorporated: one for the policy (actor) and another for the value function (critic). These are suitable for complex tasks that benefit from the stability of value-based methods and the flexibility of policy-based approaches.
Within the actor–critic framework, neural networks are typically employed for learning and decision-making, which are referred to as deep reinforcement learning (DRL). Among various approaches, the Deep Deterministic Policy Gradient (DDPG) method employs experience replay and target networks [
20] and can effectively learn a deterministic policy. In [
21], Li et al. introduced a vision-based DDPG algorithm to enable UAVs to perform obstacle avoidance and destination tracking. This algorithm empowers the RL agent to assess the sizes of and distances between obstacles approximately, facilitating the generation of a collision-free path in real time. However, large Q estimates during the learning process might hinder the ability to find the optimal policy. Twin Delayed DDPG (TD3) [
22] improves on this issue by employing two separate Q-networks and using the smaller Q-value to calculate gradients. Additionally, TD3 typically updates its policy after the value function is updated several times. This delayed update strategy helps prevent inaccuracies in value function estimation, thereby increasing the stability of the algorithm. Our previous work [
23] demonstrated that an improved TD3 method enabled both single-UAV and multi-UAV settings to avoid obstacles and reach destinations in evolving contexts.
In addition, it is important to note that deterministic policy-based methods may lose effectiveness in cases filled with uncertainties. The Proximal Policy Optimization (PPO) method [
24] introduces randomness into action selection by computing a probability distribution, promoting exploration and robustness in unpredictable environments. This relatively simple setting makes PPO more suitable for discrete action space problems [
25], as the balance between exploration (trying new actions) and exploitation (utilizing known actions) is generally easier to achieve. In continuous spaces, gradient estimation can become more challenging due to the continuity and high dimensionality of actions. The Soft Actor–Critic (SAC) method is specifically designed for such cases, employing an entropy regularization process that improves the robustness against noise and optimizes the randomness (entropy) of the policy. Moreover, high sample efficiency is essential in real-world UAV applications where data collection can be time-consuming and costly. SAC can learn effective policies with fewer interactions and directly outputs the parameters of the action’s probability distribution in the policy network [
26], which leads to smoother and more adaptable control in autonomous navigation.
Although DRL holds significant advantages in tackling complex decision-making and control problems, such as its capacity for autonomous learning and ongoing performance enhancement, processing of high-dimensional data, and adaptation to changes in the environment, its algorithmic design inevitably leads to several challenges. These include low sample efficiency [
27] (as DRL typically requires a large amount of data samples to learn effective policies, which can be a limiting factor in real-world applications), issues with training stability and convergence [
28], limited generalization capability [
29], and high computational resource demands [
30]. To address these issues, recent research has explored the integration of other methods to offset these weaknesses. For example, ref. [
31] proposes a combination of DRL with Adaptive Control (AC) for a quadrotor landing task. An DRL strategy was selected for the outer loop to maintain optimality for the dynamics, while AC was utilized in the inner loop for real-time adjustment of the closed-loop dynamics towards a stable trajectory delineated by the RL model. In [
32], the authors utilized the DDPG method to automate PID system tuning, simplifying mobile robot control by eliminating manual adjustment of multiple parameters. The deep neural network (NN) based agent also removes the need for state discretization. Furthermore, ref. [
33] used a novel Convolutional Neural Network (CNN) approach to train the object recognition capabilities of the onboard RGB-D (red-green-blue-depth) camera of a UAV and incorporated the results and relative distances as inputs to a Markov Decision Process (MDP) based system, achieving effective obstacle avoidance and path planning.
The aforementioned integrated methods capitalize on the advantages of DRL while mitigating its limitations, enhancing system performance. However, these methods still possess their inherent hindrances. For instance, adaptive control systems, despite their capability for online parameter estimation and adjustment, require precise mathematical models, which are impractical to construct for cases performed in unknown environments. Similarly, while simple and effective, PID control may lack the flexibility to handle nonlinear systems or adapt to complex dynamics. Other machine learning algorithms, such as supervised learning, typically require extensive data for training and may not be directly applicable to real-time control challenges.
In summary, to achieve dynamic path planning for UAVs in uncharted scenarios, the learning and self-improvement capabilities of DRL are essential, but they must be complemented and augmented by another effective method, thereby creating a synergy that exceeds the sum of its parts. After an in-depth literature review–simulation–analyzing process, the fuzzy inference system (FIS) is identified as an excellent option. Unlike the methods discussed above, fuzzy logic offers an intuitive, flexible, and computationally efficient solution [
34]. Particularly when integrated with DRL, fuzzy logic can provide auxiliary decision-making support, reduce unnecessary exploitation, accelerate the learning process, and enhance the system’s transparency and interpretability. Further details will be elaborated on in the next section.
The remainder of this article is organized as follows:
Section 2 outlines the formulation of the problem, reviews recent related literature, and highlights the main contributions.
Section 3 elaborates on the proposed hybrid control framework and the other associated algorithms involved in the system. Additionally, it presents the model design for the UAVs and the onboard sensors.
Section 4 illustrates the numerical and simulation results, including a comparison between the performance of the proposed method and an RL-only approach. An analysis of the feasibility of the control system applied to physical drones is also included. Lastly, discussions and concluding remarks are provided in
Section 5.
2. Problem Definition
This paper explores the challenges of dynamic settings through tracking cases of multiple mobile intruder aircraft. The focus is on developing a versatile control framework capable of processing high-dimensional data to accurately navigate toward and intercept moving targets. It utilizes a simulated three-dimensional environment, incorporating flight dynamics, sensor capabilities, and factors such as wind and gravity, to provide a rigorous testing ground for the methodologies.
The complexity of this problem is multi-fold: firstly, the intruder aerial platforms exhibit varied speed and movement patterns, necessitating adaptive tracking capabilities. Secondly, the presence of multiple intruder aircraft introduces additional challenges for an RL problem. Also, each nontargeted platform becomes a potential mobile obstacle, complicating the tracking process. Lastly, the operation is time-constrained, emphasizing the need for efficiency and prompt action.
Improving UAV dynamic tracking has significant practical implications. Enhanced techniques enable drones to autonomously detect and respond to unauthorized activities over large areas, revolutionizing security and surveillance and reducing the need for human intervention [
35]. Additionally, in search-and-rescue missions [
36], AI-driven UAVs with advanced navigation systems can drastically reduce the time to locate individuals in danger, potentially saving lives.
This study seeks to advance the field of UAV safe autonomous navigation, offering insights and solutions for real-world applications. The characteristics of fuzzy logic-based approaches will be discussed in the following content, and the main contributions of this paper will also be detailed in this section.
2.1. Related Prior Work
Fuzzy logic (FL), as a soft computing method, diverging from the other conventional binary logic methods that strictly categorize statements as true or false, introduces a more nuanced approach by handling imprecision and accommodating degrees of truthfulness. Such a capability is practical in problems where information is ambiguous or incomplete. As a result, fuzzy logic finds extensive applications across multiple domains. It is instrumental in developing control systems that require adaptive response mechanisms [
37], enhancing decision-making processes by accounting for uncertainties [
38], and improving pattern recognition techniques [
39] by allowing for flexible criteria.
Within the domain of path planning and obstacle avoidance, algorithms developed based on fuzzy logic exploit input variables from sensors or data from machine learning models to generate continuous outputs. Hu et al. [
40] proposed a fuzzy multi-objective path-planning method based on distributed predictive control for cooperative searching and tracking ground-moving targets by a UAV swarm in urban environments. This approach integrates an extended Kalman filter and probability estimation to predict the states of the ground targets, with path planning that accounts for obstructed view and energy consumption. Then, fuzzy logic is employed to prioritize objectives based on their importance levels. Berisha et al. [
39] demonstrated a fuzzy controller utilizing readings from a stereo camera and range finders as input and generating two control commands for the robot’s wheels, enabling flexible collision avoidance, as their controller can react dynamically in testing. Our team introduced a Fuzzy-Kinodynamic RRT method [
41], which employs the traditional rapidly exploring Random Tree algorithm for global path planning and presents a set of heuristic fuzzy rules for de-confliction in 3D spaces. The velocity obstacle (VO) approaches proposed by [
42,
43], which are designed especially for evading mobile intruders, offer a constructive foundation for the potential combination with fuzzy logic in the future. These algorithms calculate a VO cone for each moving object, where the cone’s direction and width depend on the obstacle’s position, velocity, and relative velocity to the ownship UAV. FIS can be subsequently employed to ascertain the optimal action (e.g., select an appropriate speed vector that does not intersect with any cones) and facilitate evasion based on the current state of the UAV and the environmental conditions, ensuring the ownship avoids all identified intruders safely.
From these applications, it is found that fuzzy logic can describe and handle uncertainties and vagueness using concepts from natural language [
44], such as “large”, “medium”, “small”, etc. This approach closely mirrors how people evaluate situations and make decisions in daily life. Fuzzy logic also enhances system robustness in the face of input data variations or noises by using partial truth values and fuzzy sets. Additionally, FL does not require precise mathematical models, significantly simplifying problem descriptions. However, there are challenges in using FL, such as difficulties in establishing accurate rules for problems requiring precise control or lacking relevant experience. The process of designing membership functions and rules inevitably involves a degree of subjectivity, leading to potential inconsistencies and imperfections in performance. Fortunately, DRL methods can largely mitigate these issues. For instance, in our scenario, since the environment is unknown, it is insufficient to navigate a UAV to the target while avoiding potential obstacles by merely establishing a fuzzy logic system. The integration of DRL fills this gap effectively, enabling the drone to learn and find the optimal path gradually and autonomously. In the same way, the drone does not need to start learning from scratch. We can impart some basic movement or de-confliction rules to it through a fuzzy inference system, helping it avoid taking some incorrect or illogical actions in training.
2.2. Contributions
Based on the discussions, we developed a hybrid control scheme that utilizes DRL and FIS. By leveraging the capability of DRL to perform beyond expectations after training and the ability of fuzzy logic to incorporate specific expert experience and human cognition, the integration of these two methods allows for mutual enhancement and compensation. The main contributions are summarized as follows:
A hybrid intelligent controller is developed for a six-degree-of-freedom underactuated quadrotor to autonomously navigate and intercept multiple intruder aircraft in three-dimensional spaces within a limited time. This approach utilizes innovative learning-based interaction mechanisms between SAC and FIS to improve learning efficiency and enable real-time path planning in unpredictable environments under a model-free context.
An innovative target selection algorithm and a refined approach to handling the observation space are established to tackle the RL multi-target challenge. The algorithm effectively mitigates the exponential growth of state–action pairs and prevents reward signal interference by focusing on one target at a time, thereby addressing a common cause of failure in RL multi-target problems: the task-repetition-caused ’Agent Confusion’ phenomenon.
A creative reset function is developed to enhance the generalization capabilities of the trained agent. This function regenerates the states of the ownship and the intruders at the start of each training episode. It allows the agent to adapt to new scenarios without re-training, increasing utility and effectiveness.
A practical reward function is designed, with exponential functions introduced in each component to address the challenge of sparse rewards in multi-target RL cases. This approach ensures high sensitivity to slight state changes and provides proportional adjustment of each component based on the importance of different objectives (e.g., target tracking and speed control).
A method involving nonuniform discretization of the observation space is applied to reduce the dimensionality of the state space and improve training efficiency. For each observation variable, larger-step discretization is used in ranges where sensitivity to state changes is low, while more precise discretization is applied in sensitive ranges, reducing training time and minimizing the risk of failing to converge.
3. Methodology
The use cases in this study involve scenarios with one ownship UAV and multiple intruder aircraft, where the objective of the ownship is to intercept each intruder while avoiding collisions during the flights.
Figure 1 illustrates the architecture of the proposed hybrid control system. The system leverages both DRL and fuzzy logic models to propel the ownship, processing its state information at each timestep for the observation space and reward functions. Concurrently, intruder aircraft follow distinct, predefined trajectories using PID controllers, outputting state data on position and velocities.
To address the multiple dynamic target interception problem, a suite of additional algorithms was developed in the SAC-FIS framework, which encompasses a target selection algorithm, methods to assess successful target interception, criteria for episode termination, and a reset function, among others. The rest of this section will provide a comprehensive breakdown of each component within this system.
3.1. The RL Agent Design
For the proposed case, the state–action relationship is multifaceted and unpredictable. The RL agent’s neural network design, which includes layers of varying densities and specific branch structures to compute mean and standard deviation (SD) separately, effectively encodes the uncertainties. Moreover, a concatenationLayer is utilized within the critic network to merge action and state information, enhancing the understanding of their dependencies and improving value estimations. Key training parameters (e.g., learning rate, entropy regularization coefficient, and optimization algorithm) have been fine-tuned through extensive testing, refining the trade-off between exploration and exploitation and enhancing adaptability. The specific network design is as follows:
The critic networks are defined as , , where includes layers for the state(s) and action(a) paths:
State Path (Spath) Layers:
Action Path (Apath) Layers:
Concatenation and Subsequent Layers:
The actor network is defined as
, where
includes the following layers:
Based on the described neural network designs, the actor network processes the input state s to generate the parameters of the action distribution, specifically the mean and standard deviation. This enables the agent to select actions in a continuous action space. Meanwhile, the critic networks, parameterized by and , estimate the value of action–state pairs, guiding the agent towards actions that maximize the expected return. Prior to the commencement of the training process, it is imperative to initialize the following parameters.
Initial entropy coefficient:
Target entropy:
Length of the replay buffer B: transitions.
Learning rates:
Discount factor:
Target smoothing coefficient:
Mini-batch size: 256
Number of warm start steps: 1000
Optimizer algorithm for both actor and critics: adam
Gradient threshold: 1
A general training procedure for the proposed SAC-FIS method is delineated in Algorithm 1. Details regarding the specific control tasks raised in this paper, including the RL environment and the FIS, are discussed in the following subsections.
3.2. RL Environment Design
In this segment, the UAV and sensor model, along with the functions of the RL environment and other related algorithm designs, will be detailed.
3.2.1. The UAV and Sensor Model Design
The UAV platform selected in this paper for both ownship and intruder aircraft is the “+” type quadrotor, assumed to be center symmetric. Although this study employs a model-free SAC method, meaning the algorithm learns the optimal policy through interaction with the environment without relying on a predefined dynamic model of the environment, understanding the UAV model (including both dynamic and kinematic models) as referenced from [
45,
46] is still beneficial for effective training and control, as it provides critical physical constraints. Based on the rotation matrix, the dynamic model equations of this type of quadrotor are as shown in Equation (
1). Additionally, A 3D LiDAR sensor is mounted on the ownship. The sensor’s azimuth (
) and elevation (
) limits are defined as
and
, respectively, with a maximum detection range of 20 m. For each intruder UAV, it will execute reciprocating flights at a specified speed along a predefined route.
Algorithm 1 The hybrid intelligent SAC-FIS control framework. |
- 1:
Analyze the dynamic model of the controlled robot, and then determine the specific control variables (outputs) governed by the FIS based on specific control tasks and related requirements. - 2:
Define input variables of the FIS that can effectively incorporate certain universal expert experiences without compromising the robot’s maneuverability. - 3:
Define membership functions for each input variable, drawing from both extensive debugging results and thorough data analysis. - 4:
Establish a comprehensive set of fuzzy rules that capture expert knowledge and observed data behavior. - 5:
Load the FIS. - 6:
Initialize actor network and critic networks with predefined parameters. - 7:
Initialize target critic networks with parameters from . - 8:
Initialize replay buffer B. - 9:
for each iteration do - 10:
Observe state S. {S includes both inputs and outputs of the FIS.} - 11:
Compute the output of FIS, , and, in parallel, select action from the SAC model, , using the actor network with distribution parameters. - 12:
Concatenate and to form the complete action space A. - 13:
Execute action A in the environment, observe reward R, next state , and termination signal D. - 14:
Store transition in B. - 15:
if B’s size ≥ mini-batch size AND iteration > number of warm start steps then - 16:
Sample a mini-batch of transitions from B. - 17:
For each sampled transition, compute the target value :
- 18:
Update critic networks by minimizing the loss:
{This update process allows the RL agent to incrementally understand and adapt to the logic of the FIS, collaborating with the FIS outputs to optimize the overall decision-making.} - 19:
Update actor network using the policy gradient:
- 20:
Adaptively adjust towards TargetEntropy. - 21:
Softly update target networks:
- 22:
end if - 23:
end for
|
As depicted in
Figure 2, the axes and sign convention of the selected UAV platform are provided. Also, the LiDAR sensor mounted at the center of the ownship is capable of providing 360-degree environmental detection surrounding the ownship, encompassing both obstacles and intruder aircraft. The 3D point cloud generated by the LiDAR will be sampled and used as the input for FIS; the detailed methods and discussion are elaborated in
Section 3.3.
where:
, , and are the accelerations in the x, y, and z directions, respectively.
T is the total thrust of the UAV.
, , and are the roll, pitch, and yaw angles, respectively.
m is the mass of the UAV.
g is the gravitational acceleration.
3.2.2. Action and Observation Spaces (A & S)
Considering the actual quadrotor’s dynamics, the control inputs are comprised of
Roll,
Pitch,
Yaw, and
Total Thrust, defining the action space as a
matrix. Referring to
Section 3.1, the
numActions is thus set to 4, which establishes the target entropy,
. This prevents the agent from converging to local optima by gradually reducing exploration, favoring policy optimization and stabilization. Notably, in this case,
Yaw is exclusively governed by the FIS, leaving only three control variables for the RL agent. Further details regarding
Yaw control are discussed in
Section 3.3.
Meanwhile, in the chasing and interception problem, the observation space (
S) is characterized by high dimensionality, resulting in prolonged training times and potential challenges in achieving convergence. To address this problem, we developed a target selection algorithm. Given that the positions and orientations of the ownship and intruder aircraft within the 3D environment are randomized at the beginning of each training episode, the algorithm initially selects the nearest intruder to the ownship as the target, treating other intruders as dynamic obstacles. It then continuously selects a new target based on proximity, only after the current target has been successfully intercepted. The intercepted intruders will be stopped and remain hovering at their captured location. This approach simplifies
S by considering only the parameters of the selected target and ignoring those of other intruders. The algorithm, detailed in
Appendix A.1, produces an output ranging from 1 to
n, representing the
nth intruder (
) selected as the target. It updates
S with the target ship’s relative position to the ownship and its velocities. Upon target update, a switch replaces these parameters with those of the new target.
Furthermore, a counter
(ranging from 0 to
n) was introduced to represent the number of captured intruders. This element plays a crucial role in
S, eliminating the ’Agent Confusion’ phenomenon in most multi-target scenarios, where the agent might repetitively complete identical tasks. Overall,
S can be described as follows:
These items constitute a matrix, where each element has been detailed by their descriptions and respective sizes:
: The Euclidean distance between the ownship and the selected target ship ().
: Ownship’s Euler angles in ZYX order, represented as ().
: Ownship’s linear velocities. ().
: Ownship’s angular velocities, represented as p, q, r ().
: Coordinate differences between ownship and the selected target ship in the XYZ directions, represented as , , and ().
: The selected target ship’s linear velocities ().
: Ownship’s angular velocities about the world coordinate system’s X and Y axes, represented by and ().
: Actions of roll (), pitch (), yaw (), thrust () ().
: Inputs of the FIS, represented as ownship–target angle (), front distance (), and lateral distance error () ().
: Speed vector’s projection onto the direction (vector) formed between the ownship and the selected target ship ().
: The component of the speed vector that is perpendicular to the vector formed between the ownship and the selected target ship ().
: Number of successfully intercepted intruders ().
Most observations in S are continuous, leading to a theoretically infinite observation space. To tackle this, we discretized each observation factor. For instance, heterogeneous discretization was implemented for : states beyond 5 m from the target were rounded to 0.5 m, while states within 5 m were rounded to 0.1 m. The precision is up to one decimal place for velocities, position errors, and control inputs. Through such discretization, we successfully reduced the size of S by , significantly reducing the training time and assisting the agent in avoiding convergence failures.
3.2.3. Reward Function (R)
This part represents the core of the proposed hybrid control system, delineating the RL environment’s feedback mechanism in response to the ownship’s actions and guiding the agent to learn how to make optimal decisions. It is crucial that if the reward function (
R) is sparse or biased, it can result in the agent failing to have the desired outcomes. For complex and high-dimensional state-space tasks, the reward function must be sufficiently sensitive to reflect minor state changes. Furthermore, a task often encompasses multiple objectives, such as target tracking, obstacle avoidance, speed control, etc. This necessitates that
R can adjust the weights of the specified terms corresponding to each objective based on their importance and priority. Therefore, exponential terms are employed to ensure the sensitivity of each component within
R. Moreover, the exponential function enables bounded outputs by defining the inputs’ range, which facilitates simple adjustments of coefficients of different terms to satisfy the specific needs of the task. The following explains the reward function designed for the hybrid control scheme:
where
to
and
to
are weight coefficients. They were designed as follows:
and,
The reward function of the agent comprises several components: the initial item adopts the relative positions (instead of absolute positions) to the current target, encourages the ownship to minimize this distance and helps in avoiding ’Agent Confusion’ when the target changes. The second and third components incentivize the agent to navigate toward the target, imposing penalties for movements in the other directions. Subsequent penalizing components associated with control signals (
) and angular velocities (
) are designed to prevent oscillations and mitigate large variations, thus enhancing safety and energy efficiency. The term with
rewards the agent for each successful interception of an intruder. Additionally, the function also integrates a condition
, where
signals the end of an unsuccessful episode, triggering a corresponding penalty. These aspects are elaborated further in
Section 3.2.5.
3.2.4. Reset Function
To enhance the agent’s generalization capability, a flexible and efficient reset function is indispensable. Algorithm 2 is developed to ensure that at the beginning of each training episode, the position of each intruder UAV randomly appears on its predefined path. Subsequently, the position of the ownship (
) is determined randomly based on the relative distances between the intruders
, ensuring that the ownship is within the range of 2.5 m to 20 m of any intruder UAV (
). Similarly, the initial orientation and velocities
of the ownship are also generated randomly.
Algorithm 2 Reset mechanisms for ownship and intruder aircraft during training. |
Require: Number of intruder UAVs N, array of waypoint sets for each intruder n, where . Ensure: Initial states for intruder aircraft are updated based on their paths; the ownship’s initial state is updated based on the relative distance to intruders.
- 1:
Initialize with update rate (100 Hz) and reference location. Step 1: Reset initial positions for intruder UAVs . - 2:
for
to
N
do - 3:
: Number of points for interpolation on the waypoint-based trajectory. - 4:
Define waypoints for the nth intruder. - 5:
Generate a sequence M consisting of evenly spaced points that span the whole trajectory defined by .
- 6:
Use M for linear interpolation on the trajectory. - 7:
random integer from 1 to . - 8:
Select a random interpolation point from as the initial position for . - 9:
Initialize with . - 10:
end for Step 2: Reset ownship’s initial state. - 11:
Determine the ownship’s initial position based on the relative distances to each intruder UAV. - 12:
Select the orientation and velocities randomly, e.g., For initial velocities on x, y, z directions:
|
3.2.5. Termination Conditions (D)
Unlike 2D maps, which can be easily bounded, 3D spaces are theoretically infinite, and it is challenging to define a unique Geofence for various scenarios. Therefore, it is essential to establish relevant constraints to control the spatial range and temporal duration of each test. A training episode or a flight test concludes under one of the following conditions:
The ownship is not positioned within 30 m of the selected target ship:
The duration surpasses the predefined maximum time threshold:
All intruders have been successfully intercepted:
The ownship collides with any obstacle, identified when the minimal LiDAR reading drops below 0.75 m (in this study, the distance from the center of the ownship to its edge should not exceed 0.4 m):
3.3. The Fuzzy Inference System
For training cases in 2D environments, drones often exhibit aimless rotational movements during the early training phases, spanning 200–300 episodes [
47]. This consumes significant time and decreases the efficiency of testing new algorithms. Such challenges are further exacerbated in 3D environments due to the increased degrees of freedom. To compensate for this, a FIS has been designed to assist the ownship UAV in avoiding extensive trials of irrational poses and movements during the training process. Additionally, this FIS system also supports collision avoidance maneuvers, further simplifying the training complexity of the agent.
In essence, the primary objective of the proposed FIS is to guide the ownship towards the target ships along an unobstructed path, a functionality that has been partially embedded within the reward function. In practical applications, sensors such as LiDAR or cameras are strategically mounted to face forward on UAVs to ensure optimal data collection; also, for ease of mathematical representation, the FIS will be applied to adjust the orientation of the UAV (by controlling yaw
to steer left or right) towards the target, thus facilitating desired movements in cooperation with the reward functions. The choice of FIS for controlling
is based on the fact that yaw adjustments do not impact the quadrotor’s velocities in any direction [
48], thus preserving its maneuverability.
The FIS incorporates three inputs: the angle (
) between the ownship’s body coordinate system’s x-axis and the ownship–target vector, the LiDAR reading in the direction of ownship’s motion (
), and the difference between the LiDAR readings to the left and right sides of the motion direction (
). Additionally, since the ownship’s speed vector may not align with its x-axis, it is advisable to use LiDAR readings from the speed vector’s projection on the ownship’s XY plane for the front distance (
), along with readings at 25 degrees to the left and right to calculate
. Detailed calculations for acquiring these input variables are provided in
Appendix A.2.
Furthermore, for the onboard LiDAR, all readings exceeding the maximum range will be converted to that of the maximum range (20 m). The membership functions (MFs) and fuzzy rules of this FIS are as follows:
Input MFs:
: Small [0, 3.5 m), Medium [3.5 m, 7.5 m], and Large (7.5 m, 20 m].
: Small [−20 m, −7.5 m], Medium [−5 m, 5 m], and Large [7.5 m, 20 m].
: Extremely Small , Small ,
Moderately Small , Neutral ,
Moderately Large , Large , and Extremely Large .
Output MFs:
: Left [−1, −0.7], Soft Left [−0.5, −0.1], Straight [−0.2, 0.2], Soft Right [0.1, 0.5], and Right [0.7, 1].
Fuzzy Rules:
IF IS Small AND IS Small THEN IS Right.
IF IS Small AND IS Medium THEN IS Right.
IF IS Small AND IS Large THEN IS Left.
IF IS Medium AND IS Small THEN IS Right.
IF IS Medium AND IS Large THEN IS Left.
IF IS Large AND IS Small THEN IS Soft Right.
IF IS Large AND IS Large THEN IS Soft Left.
IF IS Medium AND IS Medium AND IS Extremely Small THEN IS Left.
IF IS Medium AND IS Medium AND IS Small THEN IS Left.
IF IS Medium AND IS Medium AND IS Moderately Small THEN IS Soft Left.
IF IS Medium AND IS Medium AND IS Neutral THEN IS Straight.
IF IS Medium AND IS Medium AND IS Moderately Large THEN IS Soft Right.
IF IS Medium AND IS Medium AND IS Large THEN IS Right.
IF IS Medium AND IS Medium AND IS Extremely Large THEN IS Right.
IF IS Large AND IS Medium AND IS Extremely Small THEN IS Left.
IF IS Large AND IS Medium AND IS Small THEN IS Left.
IF IS Large AND IS Medium AND IS Moderately Small THEN IS Soft Left.
IF IS Large AND IS Medium AND IS Neutral THEN IS Straight.
IF IS Large AND IS Medium AND IS Moderately Large THEN IS Soft Right.
IF IS Large AND IS Medium AND IS Large THEN IS Right.
IF IS Large AND IS Medium AND IS Extremely Large THEN IS Right.
In summary, this section introduces the components and algorithms of the proposed SAC-FIS control framework for UAV multiple dynamic target interception, based on the system architecture shown in
Figure 1. With a DRL agent as the foundation, the controller provides the ability to continuously learn and adapt to the task in a model-free manner. This learning process includes progressively understanding the dynamic model of the UAV to achieve six-degree-of-freedom motion control for the underactuated quadrotor system using only four control inputs:
Roll,
Pitch,
Yaw, and
Thrust. Furthermore, the FIS leverages onboard sensor data (3D LiDAR) to incorporate universal expert experience and human cognition, assisting the RL agent in avoiding numerous unrealistic trials, thereby improving training efficiency. Concurrently, the DRL agent persistently learns and interprets the logic provided by the FIS, gradually enhancing coordination for more flexible and efficient motions.
The system setup primarily involves the following steps: Initially, as per Algorithm 1, the dynamic model of the specific robot (in this study, a quadrotor UAV) must be thoroughly analyzed. Following this, a FIS is designed accordingly (detailed in
Section 3.3). Subsequently, the neural network is configured, and the relevant training parameters are fine-tuned (discussed in
Section 3.1). Additionally, a series of environment interface functions are developed (elaborated in
Section 3.2), which include the reward function, reset function, observation space, among others. In the subsequent section, the training results applied to various scenarios will be illustrated.
4. Results
This section presents a comprehensive display of the simulation results for the proposed hybrid control system, underscoring its robust generalization capabilities. Additionally, comparisons with other approaches are discussed, emphasizing the superiority of the hybrid controller. Moreover, a feasibility analysis of applying the hybrid controller to actual quadcopters is conducted based on extensive testing data.
It is important to highlight that the Matlab UAV Toolbox (2023b) serves as the simulation platform for this study, owing to its incorporation of realistic UAV dynamics and consideration of environmental factors. In the simulation results presented, the ’ownship’ is depicted as a blue quadrotor, whereas other colors, such as orange, purple, and green, represent the intruder aircraft (quadrotors). The simulation space employs the NED (north, east, down) coordinate system. However, the simulation environment displays the north, east, and up directions. This means that the z-coordinates are negative, as the UAV operates above ground level () and will not fly underground (). This convention aligns with the common practice of displaying positive values for the altitude in the simulation environment. Additionally, all line graphs feature the horizontal axis as time, with the unit in seconds and a duration of s.
4.1. Simulation Results of The Hybrid Controller
Figure 3 demonstrates the application of a well-trained SAC-FIS agent across three scenarios, each featuring variations in initial positions, the number of intruder aircraft, intruders’ trajectories, and their motion patterns. Additionally, each intruder UAV operates at a distinct speed: the speed of the purple intruder is set to 0.6 m/s, the orange intruder at 0.9 m/s, and the green intruder at 1.2 m/s. In contrast, the ownship’s speed can reach up to 10 m/s, due to the bounding of each action (control variables). In Scenario 1, the ownship’s initial position
, and the initial positions of the intruder UAVs are
and
. The flight paths of the two intruder aircraft are defined by the waypoints
and
, respectively. In Scenario 2, the ownship’s
, with the intruders’
and
.
Their flight paths are determined by
and
, respectively. Scenario 3 incorporates three intruder UAVs, with the initial position of the ownship at
and the intruders at
,
, and
. The flight paths are given by
,
, and
, respectively. The simulation results for these scenarios are depicted in
Figure 4 and
Figure 5, which detail the ownship’s and intruder UAVs’ trajectories during the interception process from both perspective and top–down views.
Figure 6,
Figure 7 and
Figure 8 provide an analysis of some key metrics for three scenarios. Specifically,
Figure 6 depicts the distance from the ownship to the current target ship. In Scenario 1, the ownship successfully intercepted the first target within 5.1 s and captured both intruder UAVs in 7.7 s, with each intruder hovering at its position upon successful interception. In Scenario 2, the ownship took 2 s to intercept the first target and 12.8 s for the second, totaling 14.8 s. The interception times for all targets were 4.8 s, 7.6 s, and 7 s, respectively, with a total engagement time of 19.4 s for Scenario 3. None of the scenarios exceeded the maximum allotted time of 30 s.
Figure 7 and
Figure 8 show the reward values for the first and second terms in the reward function. The first term encompasses the coordinate differences between the ownship and the current target in the X, Y, and Z directions, with each component bounded between 0 and 1. The smaller the difference, the higher the reward. Similarly,
Figure 8 assesses both the magnitude and direction of the speed vector’s projection on the ownship–target vector, which is bounded between −0.4 and 0.4. If this component aligns with the direction from the ownship to the target, the reward is positive; otherwise, it is negative.
These results demonstrate that the proposed SAC-FIS method can excellently perform the task of tracking multiple dynamic targets under various configurations, showcasing its efficiency and generalization capability.
4.2. Comparison Results
In order to demonstrate the advantages of the proposed hybrid SAC-FIS controller, comparisons were made with an SAC-only approach, which represents using only the DRL model to control all variables. Concurrently, a comprehensive evaluation was conducted to assess the impact of incorporating the captured-intruder-count factor into the observation space on tracking performance.
Figure 9 presents the training processes of three distinct approaches on a desktop equipped with an NVIDIA GeForce RTX 4090 GPU. The training result for the SAC-FIS method, with
included in the observation space, reaching the required average return after 434 episodes over 6 h and 47 min, is depicted in
Figure 9a. The SAC-only approach, also incorporating
, took 8 h and 53 min for 471 episodes as illustrated in
Figure 9b. This purely DRL method necessitated additional training time due to the need of the agent to simultaneously understand four interrelated control variables and gradually improve this more complicated control process. In contrast, by the same 434th episode in
Figure 9a, the training time already amounted to 8 h and 9 min. Moreover,
Figure 9c represents the SAC-FIS method’s training process without
, which was completed through 341 episodes in 4 h and 26 min. This method’s shortest training duration is attributed to a smaller state space, facilitating a relatively easier understanding of the environment. Therefore, we designed a unified scenario with more complex configurations to test the performance of these approaches. As shown in
Figure 10, the trajectories of two intruders were determined by
and
, respectively, with the ownship’s
, and
,
for the intruder UAVs. The purple intruder drone incrementally elevated its altitude as it moved away from the ownship, while the orange intruder was in the landing mode.
The simulation results of these approaches are showcased in
Figure 11,
Figure 12,
Figure 13,
Figure 14 and
Figure 15. The SAC-FIS agent incorporating
successfully captured both intruders within 14 s, whereas the SAC-only method took 15.8 s due to a noticeably longer flight path of the ownship in this case, with most of the flight path being redundant. This is because, without the guidance of relevant experience, the DRL-only method would require an extensive period to reduce the cost gradually, and achieving the minimum cost performance is basically impossible.
Figure 11c,
Figure 12c,
Figure 13c,
Figure 14c and
Figure 15c show the outcomes for the SAC-FIS agent without
, indicating a failed attempt. After successfully capturing the first target within 2.5 s, the ownship mechanically endeavored to complete the same task, ultimately succumbing to the ’Confusion’ issue after several failed attempts since it had already achieved the tracking objective, leading to the interception failure of the second target ship.
These comparison results highlight the superiority of the SAC-FIS method, as it possesses certain universal experiences from the outset, significantly reducing training time and aiding the agent in completing tasks at a lower cost. Meanwhile, all three agents successfully performed tasks involving only one intruder UAV. However, for tasks involving multiple dynamic targets, incorporating into the observation space (S) is crucial, as it straightforwardly resolves the ’Agent Confusion’ phenomenon.
We conducted an in-depth assessment of both the SAC-FIS and SAC-only methods, evaluating their performance with and without the inclusion of
in
S. Additionally, we tested these methods across scenarios involving two- and three-intruder aircraft, with each scenario undergoing a hundred trials. The initial positions of the intruders and the various initial parameters for the ownship were randomized for each trial at the beginning using the reset function developed in
Section 3.2.4. The success rates of the various approaches are summarized in
Table 1. It is evident that the success rate of the SAC-only method is markedly lower than that of SAC-FIS, primarily because the SAC-only agent is more likely to surpass the time limit (
). Moreover, the importance of introducing
is reaffirmed.
4.3. The Hybrid Controller Feasibility Analysis
For the simulations, the mass of the ownship is set to 0.1 kg, with gravity specified at
. The quadrotor’s four control inputs—
Roll,
Pitch,
Yaw, and
Total Thrust—are bounded, and the reward function also includes mechanisms to prevent the UAV from experiencing jitter and sudden motions, thus ensuring stable flight and avoiding unreasonable poses. However, more importantly, evaluating the safety performance in real-world applications is crucial. After analyzing a vast array of results, we select Scenario 3 from
Section 4.1 as an example. As illustrated in
Figure 16, the
Yaw input, regulated by the FIS, results in relatively smooth signals with fewer oscillations. In contrast, the other control variables, which are governed by the SAC agent, exhibit persistent fluctuations. When applying this hybrid controller for real flight tests, even though most quadrotors may not be as sensitive to fluctuating signals, such volatility could still cause mechanical wear and potentially pose a risk. Cordeiro et al. [
49] designed a sliding-mode controller (SMC) for fixed-wing UAVs and effectively smoothed the highly fluctuated control signals by incorporating a low-pass filter, reducing common chattering effects while ensuring robustness. In [
50], the authors employed an extended Kalman filter (EKF) within their proposed controller. The results demonstrate that the controller, augmented by the EKF, maintains robust tracking performance even when Gaussian white noise is introduced to the state variables. Therefore, for this study, it is beneficial to add a low-pass or Kalman filter to smooth the control signal curve before applying it to real UAVs.
Furthermore, it is essential to monitor changes in the acceleration of the ownship. As seen in
Figure 17, the maximum fluctuation range of acceleration is within
, less than two times that of the gravity (2 g). By consulting handbooks and technical specifications for small quadrotors of similar weight, they can typically withstand acceleration changes from 2 g to 5 g [
51]. Thus, the hybrid controller is applicable to real quadrotor UAVs in terms of acceleration changes.
In short, the hybrid SAC-FIS controller demonstrated exceptional simulation performance and can be applied to real quadcopters after applying filtration to smooth the control signals.
5. Conclusions
This study introduces a novel hybrid UAV motion control scheme that combines a SAC-based deep reinforcement learning strategy with a fuzzy inference system for the multiple dynamic target interception problem. A comprehensive analysis of the simulation results and comparisons with alternative approaches underscore the effectiveness of the proposed method. The design of the control framework employs a modular architecture, facilitating straightforward adaptation, either in part or in entirety, to varied problems, thereby augmenting its scalability. Additionally, by adopting a fuzzy logic model to integrate selected universal expert experience, in conjunction with a highly sensitive reward function and a flexible reset function, this approach markedly improves the training efficiency, boosts the generalization ability of the trained agent, and reduces costs simultaneously. Furthermore, dynamic-environment and multi-target cases have always been two significant challenges in RL. This paper addresses these difficulties by redesigning the observation space. The steps taken include focusing exclusively on the current target information, employing relative (instead of absolute) coordinates between the ownship and the selected target aircraft, discretizing each state, and incorporating a counting factor. In the future, we plan to upgrade specific modules within the system to address more complicated problems, such as integrating dense and randomly localized obstacles, and deploying a swarm of ownships operating under a cooperative protocol. On the other hand, by smoothing the control signals and conducting comprehensive safety tests, applying our method to real-world flight trials will enable the identification and correction of potential issues, further enhancing the performance of the hybrid control system.