Next Article in Journal
Multi-UAV Assisted Air–Ground Collaborative MEC System: DRL-Based Joint Task Offloading and Resource Allocation and 3D UAV Trajectory Optimization
Previous Article in Journal
Hovering of Bi-Directional Motor Driven Flapping Wing Micro Aerial Vehicle Based on Deep Reinforcement Learning
Previous Article in Special Issue
Suboptimal Trajectory Planning Technique in Real UAV Scenarios with Partial Knowledge of the Environment
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

A Novel Method for a Pursuit–Evasion Game Based on Fuzzy Q-Learning and Model-Predictive Control

by
Penglin Hu
,
Chunhui Zhao
* and
Quan Pan
School of Automation, Northwestern Polytechnical University, Xi’an 710129, China
*
Author to whom correspondence should be addressed.
Drones 2024, 8(9), 509; https://doi.org/10.3390/drones8090509
Submission received: 1 September 2024 / Revised: 16 September 2024 / Accepted: 19 September 2024 / Published: 20 September 2024
(This article belongs to the Special Issue Optimal Design, Dynamics, and Navigation of Drones)

Abstract

:
This paper explores a pursuit–evasion game (PEG) based on quadrotors by combining fuzzy Q-learning (FQL) and model-predictive control (MPC) algorithms. Initially, the FQL algorithm is employed to perceive, make decisions, and predict the trajectory of the evader. Based on the position and velocity information of both players in the game, the pursuer quadrotor determines its action strategy using the FQL algorithm. Subsequently, a state feedback controller is designed using the MPC algorithm, with reference inputs derived from the FQL algorithm. Within each MPC cycle, the FQL algorithm dynamically provides reference inputs to the MPC, thereby enhancing its robust control and dynamic optimization for the quadrotor. Finally, simulation results verify the effectiveness of the proposed algorithm.

1. Introduction

A pursuit–evasion game (PEG) involves two opposing groups: the pursuers and the evaders. It mainly focuses on how pursuers cooperate to capture evaders and what strategies evaders should adopt to avoid capture or prolong the capture time [1]. In recent years, PEGs have been widely applied in military and civil fields, such as missile defense and interception [2], air combat [3], search and rescue [4], and transportation management [5]. Various forms of PEGs have emerged, including single-to-single [6], single-to-multiple [7], multiple-to-single [8], multiple-to-multiple [9], and other scenarios. At the same time, a variety of algorithms have been presented and discussed, such as the geometric method [10,11], the differential game method [12], the classical control method [13], and the reinforcement learning (RL) method [14].
Compared with other classical control algorithms, the model-predictive control (MPC) algorithm has garnered significant research attention in PEG problems due to its flexible optimization capabilities, constraint-handling ability, adaptability, and robustness. For a PEG of fixed-wing unmanned aerial vehicles (UAVs), the authors proposed a nonlinear MPC algorithm that incorporated the safety and functional constraints of the UAV into the cost function, achieving autonomous obstacle avoidance and pursuit maneuvers [15]. Furthermore, the MPC algorithm was applied to the PEG of unicycle models in an obstacle environment, and the robustness and security of the algorithm were demonstrated [16]. Considering obstacles, the author proposed a novel PEG algorithm that can alternately convert strategies between game theory and MPC algorithms, effectively reducing computational complexity [17]. Considering the constraint of limited information, a PEG algorithm based on MPC was proposed, which enables players to use limited information to predict the opponent’s movement strategy in an obstacle environment [18]. To tackle the two-target, two-attacker PEG problem with state and control constraints, a game strategy based on nonlinear MPC and Apollonius circle (AC) was developed [19]. Additionally, based on the unmanned surface vessel platform, a robust adversarial PEG algorithm based on MPC was proposed by combining the particle swarm optimization algorithm with the cost function [20].
Traditional PEG algorithms, such as MPC, typically rely on precise mathematical models for designing control laws. However, practical applications often involve random factors such as sensor errors and environmental disturbances, which make it challenging to acquire accurate models and bring significant challenges to the control of agents. In this context, RL algorithms have become an important method to solve PEG problems. A distributed deep Q network (DQN) algorithm was proposed to enhance the collaborative pursuit or evasion strategy between agents through a decentralized learning framework [21]. The authors proposed a distributed cooperative PEG algorithm based on RL and designed communication networks to save communication and computational resources [22]. Kartal et al. [23] applied an integral reinforcement learning algorithm to achieve optimal action selection for online and real-time PEG scenarios. In [24], they introduced a novel two-stage pursuit strategy employing the deep deterministic policy gradient (DDPG) algorithm to resolve PEG challenges using real-time feedback information, particularly under conditions of incomplete information. Zhang et al. [25] have integrated a prediction network into an RL framework, enhancing the algorithm’s performance and agents’ predictive capabilities in the PEG scenario. Selvakumar et al. [26] utilized min–max Q-learning and matrix game theory to transform a non-zero-sum PEG into multiple two-player static zero-sum PEGs and then determined the optimal actions for agents in these complex scenarios.
Traditional RL algorithms, deeply rooted in rigorous mathematical foundations, provide a framework for the precise modeling and analysis of complex problems. However, they often encounter challenges related to interpretability and adaptability. Fuzzy reinforcement learning has emerged as a promising solution to effectively handle imprecise and ambiguous information, especially in contexts where uncertain environmental perception and decision-making processes are prevalent. The inherent flexibility of fuzzy logic enables it to adeptly handle intricate relationships, proving particularly advantageous when dealing with complexities like non-linearity and non-convexity that arise in diverse problem domains. Notably, the rules and inference processes produced by fuzzy reinforcement learning tend to be intuitive and easily comprehensible. Combining the strengths of both fuzzy reinforcement learning and traditional approaches yields promising results. Consequently, a large number of RL algorithms incorporating fuzzy logic principles have emerged. Based on the quadrotor platform, a novel control algorithm was proposed that combines fuzzy logic control with RL techniques and can adapt to varied initial positions and noise conditions [27]. The author utilized the fuzzy Q-learning (FQL) algorithm to acquire control policies for agents and also used the formation control mechanism to capture a superior evader, effectively addressing the multiple-to-single PEG problem [28]. Yue et al. [29] combined the fuzzy inference system and RL to address the PEG problem in the continuous action domain, and the proposed algorithm can be applied to high-dimensional situations. A PEG technique combining fuzzy systems with Q-learning was proposed for UAVs, which solved the challenges of continuous state and action spaces in Q-learning [30].
In summary, the MPC algorithm relies on precise mathematical models and target reference inputs. The traditional RL algorithm suffers from poor interpretability and adaptability. Fuzzy RL algorithm can effectively deal with imprecise and fuzzy information. Therefore, in this paper, we integrate the FQL algorithm with the MPC algorithm to address the PEG problem based on quadrotors and provide a robust framework for its solution. We utilize the perception and decision-making capabilities of FQL to formulate and evaluate strategies for the quadrotor. Meanwhile, we leverage the robust constraint handling and dynamic optimization capabilities of the MPC algorithm to achieve robust control for the quadrotor. First, we employ the FQL algorithm to perform perception and decision-making tasks, while evaluating the target’s position and velocity information. Fuzzy logic is employed to process ambiguous information and generate appropriate action strategies, enabling the quadrotor to make suitable decisions in diverse PEG scenarios for both tracking and evasion. Then, we utilize the action strategy generated by FQL as a reference state for MPC to design a state feedback controller. During each control cycle, we minimize an objective function that considers control input constraints and obstacle avoidance constraints, ensuring that the quadrotor does not violate physical constraints and safety requirements in the PEG scenario. The main contributions of this paper are as follows:
(1)
For the first time, the Apollonius circle is extended from 2D space to 3D space, and an analytical equation is provided. A novel learning algorithm based on the FQL framework is proposed, and a reward function based on the idea of artificial potential field is designed for the learning algorithm, which improves the convergence speed and learning performance of the algorithm.
(2)
Distinct from existing methods, with the strategy derived from the FQL algorithm as the reference signal, we model the trajectory tracking problem as an MPC optimization problem for decoupled quadrotor. We designed a state feedback controller for the quadrotor that takes into account control input constraints and obstacle avoidance constraints and analyzed the feasibility and stability of the solution.
(3)
We have verified the learning performance of FQL in 3D scenarios. The trajectory tracking capability of the designed state feedback algorithm has been validated through 3D trajectory tracking results. Finally, the combination of FQL and MPC algorithms is employed to control the quadrotor to achieve a PEG on the Gazebo platform.
The rest of this paper is organized as follows. Section 2 introduces the perception and decision based on the FQL algorithm. In Section 3, we present the quadrotor control based on the MPC algorithm. Section 4 provides detailed simulation results to verify the effectiveness of the algorithm. Finally, Section 5 concludes the paper.

2. Perception and Decision Based on Fuzzy Q-Learning

2.1. The Model of Pursuit–Evasion Game

The pursuer generates its trajectory prediction by assessing factors such as the evader’s position and velocity. We employ the FQL algorithm to learn the evader’s movement strategy and make decisions. In this section, we only require knowledge of the evader’s control strategy without involving specific dynamic controls introduced in the following sections. Therefore, to facilitate the learning of control strategies, we omit the complex control model of the quadrotor and focus on a simpler model. As shown in Figure 1, the agent’s kinematic model is described as (1)
x t + 1 = x t + v sin θ cos α y t + 1 = y t + v sin θ sin α z t + 1 = z t + v cos θ ,
where ( x t , y t , z t ) and v represent the position and the velocity of agent, respectively. α is the angle between the projection of velocity v onto the x-y plane and the x-axis, and θ is the angle between velocity v and the z-axis. U = [ α , θ ] is the steering angle of the agent and also the output of the controller. To ensure that the agent’s motion adheres to real-world constraints, the increments Δ α and Δ θ of the agent’s steering angle are limited in the interval of π 4 , π 4 . The ability of agents to engage in capture or evasion depends on the unique input information of the algorithm, namely, the positions and velocities of agents. Consequently, we introduce the Apollonius circle to depict the agent’s dominant area and determine their movement strategies based on their geometric relationships.
It is assumed that the pursuer’s velocity v P is faster than the evader’s velocity v E , that is v P > v E . As shown in Figure 2, we extended the Apollonius circle from 2D space to its generalized form in 3D space, providing the analytical equation of the generalized Apollonius circle. A, B, C and D are points located on the generalized Apollonius circle. The lines P A and P B are tangent to sphere O A C , that is, P A A O A C and P B B O A C , separating the capture and the escape regions of agents. Given the positions of the pursuer and the evader denoted as ( x t P , y t P , z t P ) and ( x t E , y t E , z t E ) , respectively, and the speed ratio a of the agents, we have (2)
a = v E v P = E A P A = E B P B = E C P C < 1 ,
where E A denotes the Euclidean distance between the position of the evader E and the point A. The center and radius of the generalized Apollonius circle can be calculated by (3)
O A C = x t E a 2 x t P 1 a 2 , y t E a 2 y t P 1 a 2 , z t E a 2 z t P 1 a 2 R A C = a 1 a 2 x t P x t E 2 + y t P y t E 2 + z t P z t E 2 .
Within the space enclosed by the yellow sphere in Figure 2, the evader consistently arrives earlier than the pursuer, establishing it as the evader’s dominant region. Conversely, the external area represents the pursuer’s dominant region.
To ensure that the pursuer always moves towards capturing the evader and that the evader constantly avoids the pursuer’s pursuit, we define the agents’ motion strategies. For a random point C on the generalized Apollonius circle, we have (4)
sin α 1 sin α 2 = v P v E ,
where α 1 is the angle difference between the vector E P and the vector E C , and α 2 is the angle difference between the vector P E and the vector P C . We obtain (5)
α 2 = sin 1 v E v P sin α 1 .
According to (5), when sin α 1 = 1 , the boundary value of the optimal solution for α 2 is
α 2 * = sin 1 v E v P .
Therefore, the optimal motion space for both agents lies within the conical region. In a PEG scenario with v P > v E , the pursuer moves within the envelope defined by the cone, satisfying the condition α 2 α 2 * = sin 1 v E v P to efficiently capture the evader. As for the evader, it maneuvers within the remaining cone envelope, satisfying α 1 π 2 . The terminal conditions are defined as follows: the PEG ends at time t f if the distance between the pursuer P and the evader E satisfies d P E ( t f ) d s , where d s represents the capture distance. Additionally, if the game duration exceeds the designated threshold, the game is deemed over.

2.2. Fuzzy Q-Learning Algorithm

In this paper, the PEG environment is not a typical discrete state–action space but rather a continuous environment. Thus, we need an approximator to map the state into an action. For classical RL, the goal is to obtain the optimal expected return to evaluate the performance of the control strategy. The expected value function is
Q ( s , a ) = E k = 0 γ t r t + k + 1 | s t = s , a t = a .
In the Q-learning algorithm, it is necessary to consider the challenges arising from dealing with continuous state or action spaces. Therefore, the FQL algorithm, which generates global continuous actions for agents based on a predefined discrete action set, can effectively address the aforementioned issues [14]. We summarize the update process of the FQL algorithm as shown in Figure 3. We assume that the agent has n inputs x ¯ = [ x 1 , , x n ] , m actions A = { a 1 , , a m } , and L fuzzy rules. For each fuzzy rule l ( l = 1 , , L ) , it can be described by an IF-THEN statement
R l : I F x 1 i s F 1 l , , A N D x n i s F n l T H E N f l = a l ,
where x i is the ith input of the x ¯ and f l is the rule’s output related to the corresponding fuzzy variable. F i l is the fuzzy set related to the input x i , and a l is the action selected from the discrete action set A for rule l. During the training process, it is necessary to design a reasonable action selection mechanism to overcome the exploration–exploitation dilemma.
In the Softmax exploration strategy, the action with higher Q-value has a higher probability of being selected. For rule l, the probability of an action a l being selected is (9)
Pr ( a l ) = exp Q ( l , a l ) T k = 1 | A | exp Q ( l , a k ) T ,
where Q ( l , a l ) is the Q-value of action a l given rule l and | A | is the size of the action space. T is the softmax temperature. To reduce the algorithmic complexity and adapt to different task requirements, another action-selection mechanism is the ε - greedy method, which is described as (10)
a l = a A ε arg max a A Q ( l , a ) 1 ε ,
where ε is the probability of randomly selecting action a A given rule l. The probability of selecting action a with the maximum Q-value is 1 ε . After the action is selected for each rule, the global continuous action at time t is
a t ( x ¯ t ) = l = 1 L ( i = 1 n μ F i l ( x i ) ) · a l l = 1 L i = 1 n μ F i l ( x i ) = l = 1 L Φ t l a t l ,
where Φ t l is the firing strength for rule l at time t. The firing strength serves to quantify the degree to which the rule is activated under given input conditions, determining which rules will be prioritized and utilized to generate the final output. In a fuzzy logic system, the process begins by fuzzifying the system’s input variables through membership functions, yielding corresponding membership values for each input variable. Subsequently, these membership values are matched against the conditions of the rules within the fuzzy rule base. The firing strength of a rule is then calculated based on whether the conditions of the rule are met. The firing strength is defined as follows
Φ t l = i = 1 n μ F i l x i l = 1 L i = 1 n μ F i l x i ,
where μ F i l is the membership degree of the fuzzy set F i l , and it can be calculated by the Gaussian membership function or the triangular membership function. The global Q-function is
Q t ( x ¯ t ) = l = 1 L Φ t l Q t ( l , a t l ) ,
where Q t ( l , a t l ) is the Q-value after taking action a t l for rule l at time step t. The global Q-function with the maximum Q-value is
Q t * ( x ¯ t ) = l = 1 L Φ t l max a A Q t ( l , a ) .
The temporal difference (TD) error is defined as
ε ˜ t + 1 = r t + 1 + γ Q t * ( x ¯ t + 1 ) Q t ( x ¯ t ) ,
where γ is the discount factor and r t + 1 is the reward at time step t. The update law for the Q-function is
Q t + 1 ( l , a t l ) = Q t ( l , a t l ) + α q ε ˜ t + 1 Φ t l = Q t ( l , a t l ) + α q r t + 1 + γ Q t * ( l , a t + 1 l ) Q t ( l , a t l ) Φ t l ,
where α q is the learning rate.
In this paper, we define four inputs for the FQL used by each agent, and the inputs for the pursuer are (17)
x ¯ P = [ d P E , δ P , d P O , δ P O ] ,
where d P E represents the distance between the pursuer and the evader. The term δ P denotes the angle difference between the heading of the pursuer v P ( t ) and a straight line from the pursuer to the evader P E ( t ) . Additionally, d P O stands for the distance between the pursuer and the obstacle, and δ P O indicates the angle difference between the heading of the pursuer v P ( t ) and a straight line from the pursuer to the obstacle P O ( t ) . For the evader, the inputs of the FQL system are defined as (18)
x ¯ E = [ d P E , δ E , d E O , δ E O ] ,
where δ E denotes the angle difference between the heading of the evader v E ( t ) and a straight line from the evader to the pursuer E P ( t ) . d E O is the distance between the evader and the obstacle. The term δ E O indicates the angle difference between the heading of the evader v E ( t ) and a straight line from the evader to the obstacle E O ( t ) .
As we all know, the reward function plays a crucial role in RL, as it significantly influences both the convergence speed and overall performance of the RL algorithm. An appropriate reward function can greatly assist the agent in acquiring an accurate strategy. For instance, by allowing the reward to increase as the agent approaches the optimal solution, the algorithm can converge quickly. To achieve better learning performance, we need to use prior knowledge to design a reasonable reward function for the agent.
In this paper, we design a reward function based on the idea of the artificial potential field. The obstacle exerts repulsive force on the pursuer, while the evader exerts attraction on the pursuer. We model the repulsion exerted by obstacles on the pursuer using an exponential function. The reward function for repulsion can be designed as
r P O = 1 e x p ( α r Δ d P O ) ,
where Δ d P O = d P O ( t + 1 ) d P O ( t ) , and α r is the repulsion coefficient controlling the strength of the repulsive force. Smaller distance results in higher repulsion and consequently smaller reward. The evader has a similar form of reward function when approaching obstacles. The reward function for attraction can be formulated as
r P E = e x p ( β a Δ d P E ) 1 ,
where Δ d P E = d P E ( t + 1 ) d P E ( t ) , and β a is the attraction coefficient governing the strength of the attractive force. Smaller distance results in stronger attraction and larger rewards. For the evader, the reward function (20) is inverted. Figure 4 depicts the curves of reward functions (19) and (20) under different coefficients. It can be observed that when the independent variable, i.e., the distance increment, is zero, the value of the reward function is also zero, which aligns with the requirement for realistic reward values. We incorporate an additional term to the reward function that considers whether the pursuer successfully captures the evader
r s = γ s g s ,
where γ s is the coefficient for the success-based reward, and g s is an indicator function that g s = 1 when the pursuer captures the evader and g s = 0 otherwise. This component incentivizes the pursuer to complete its objective successfully. We formulate a comprehensive reward function using a weighted balance between these forces.
r t o t a l = w r r P O + w a r P E + r s ,
where w r and w a represent the weights for balancing the influence of repulsion and attraction. We can adjust these weights based on experience, domain knowledge, and the specific requirements of the practical problem. By adjusting these weights, we can control the overall behavior of the reward function, enabling the pursuer to pursue the evader flexibly in a complex environment.
Throughout the learning process, each agent selects an action based on the current state. After taking the action and receiving the corresponding reward, the Q-function updates by (16). The learning process then continues until the final state is reached, that is, the evader is captured by the pursuer or the maximum learning time is reached. Using the FQL algorithm, we obtain the motion information of the evader, including position and velocity information. We define the trajectory of the evader as a parameterized regular curve in the output space of the position system
P E = p ^ t R 3 | p ^ t = [ x r ( t ) , y r ( t ) , z r ( t ) ] ,
where p ^ t is a time-dependent trajectory parameter. Our objective is to design a robust controller for the pursuer quadrotor with control constraints, enabling the stable and precise tracking of the trajectory P E with evolving parameters p ^ t of the evader quadrotor. We utilize FQL and MPC algorithms to collaboratively control the quadrotors in the PEG scenario. The control framework of the quadrotor based on FQL and MPC is shown in Figure 5. We employed the FQL algorithm to train and acquire the agent’s control strategy. Then, the obtained motion trajectory of the evader from FQL served as a reference input for the MPC algorithm. In each control cycle, the MPC algorithm combines the current state of the quadrotor, including position and velocity, as well as the reference trajectory provided by FQL, to generate control signals for the quadrotor.

3. Quadrotor Control Based on MPC

3.1. The Model of Quadrotor and Control Objective

As shown in Figure 6, in the inertial coordinate system I = I x , I y , I z , the position of the quadrotor’s center of mass is = [ x , y , z ] . The linear velocity and the angular velocity are expressed as η ˙ = [ x ˙ , y ˙ , z ˙ , ψ ˙ ] . In the body’s coordinate system B = B x , B y , B z , the velocity and angular velocity are expressed as v B = [ u , v , w , r ] . In the initial state, we assume that the body’s coordinate system coincides with the inertial coordinate system. As the quadrotor’s movement state changes, the body’s coordinate system and the inertial coordinate system will no longer coincide. To align the body’s coordinate system with the inertial coordinate system, it is necessary to rotate the inertial coordinate system around the coordinate axis I x , I y , and I z , respectively, by the roll angle ϕ , pitch angle θ , and yaw angle ψ . In this paper, we do not consider the complex dynamic model based on E u l e r - - L a g r a n g e equations, but only focus on the translational motion of the quadrotor. We have
η ˙ = R ( ψ ) v B ,
where R ( ψ ) = cos ( ψ ) sin ( ψ ) 0 0 sin ( ψ ) cos ( ψ ) 0 0 0 0 1 0 0 0 0 1 . We design a four-degree-of-freedom quadrotor model that is decoupled in terms of horizontal and altitude control, as follows:
x ˙ = η ˙ v ˙ B = R ( ψ ) v B F v B + G u = g ( x , u ) ,
where F = d i a g 1 τ x , 1 τ y , 1 τ z , 1 τ ψ and G = d i a g k x τ x , k y τ y , k z τ z , k ψ τ ψ are system identification parameters used to reduce the impact of system complexity and the lagged response of the quadrotor. τ x , τ y , τ z , τ ψ are time constants, and k x , k y , k z , k ψ are gains. u is the control signal. To facilitate the design of the control law, we present the following assumptions.
Assumption 1. 
The quadrotor’s body is considered a rigid structure, and other external force disturbances are neglected. All control parameters of the quadrotor can be accurately measured.
Assumption 2. 
The reference trajectory p ^ t satisfies the following conditions: x r ( t ) x ¯ , y r ( t ) y ¯ , z r ( t ) z ¯ , x ˙ r ( t ) x ¯ 1 , y ˙ r ( t ) y ¯ 1 , z ˙ r ( t ) z ¯ 1 , x ¨ r ( t ) x ¯ 2 , y ¨ r ( t ) y ¯ 2 , z ¨ r ( t ) z ¯ 2 . The reference system state increment is x r ( t ) = η r ( t ) ; v B r ( t ) ; we have (26)
η r ( t ) = x r ( t ) , y r ( t ) , z r ( t ) , ψ r ( t ) v B r ( t ) = u r ( t ) , v r ( t ) , w r ( t ) , r r ( t )
Lemma 1. 
The reference state η r ( t ) , as well as its first-order derivative η ˙ r ( t ) and the second-order derivative η ¨ r ( t ) , are all bounded.
Proof. 
According to (24), we have ψ ˙ = r r , and it satisfies the following relation:
ψ ˙ r ( t ) x ¯ 1 y ¯ 2 + x ¯ 2 y ¯ 1 x ¯ 1 2 + y ¯ 1 2 ψ ¨ r ( t ) x ¯ 1 y ¯ 3 + x ¯ 3 y ¯ 1 x ¯ 1 2 + y ¯ 1 2 + 2 ( x ¯ 1 x ¯ 2 + y ¯ 1 y ¯ 2 ) ( x ¯ 1 y ¯ 2 + x ¯ 2 y ¯ 1 ) ( x ¯ 1 2 + y ¯ 1 2 ) 2 .
If Assumption 2 holds, we have
η ¯ = max x ¯ , y ¯ , z ¯ , ψ r η ¯ 1 = max x ¯ 1 , y ¯ 1 , z ¯ 1 , x ¯ 1 y ¯ 2 + x ¯ 2 y ¯ 1 x ¯ 1 2 + y ¯ 1 2 η ¯ 2 = max x ¯ 2 , y ¯ 2 , z ¯ 2 , x ¯ 1 y ¯ 3 + x ¯ 3 y ¯ 1 x ¯ 1 2 + y ¯ 1 2 + 2 ( x ¯ 1 x ¯ 2 + y ¯ 1 y ¯ 2 ) ( x ¯ 1 y ¯ 2 + x ¯ 2 y ¯ 1 ) ( x ¯ 1 2 + y ¯ 1 2 ) 2 .
By combining (26) and (28), we can deduce that η r , η ˙ r , η ¨ r are bounded and satisfy
η r η ¯ , η ˙ r η ¯ 1 , η ¨ r η ¯ 2 .
The proof is completed.    □
If the reference trajectory of the evader and the model of quadrotor are given by (23) and (25), respectively, the control problem for the pursuer quadrotor can be described as Problem 1.
Problem 1. 
Given a reference trajectory depicted by (23), the system described by (25), the control objective is to compute appropriate control inputs u ( t ) that satisfy
lim t t + T ( t ) p ^ t = 0 ,
where T is the control cycle and ( t ) is the position of the quadrotor. The control signal satisfies constraint u ( t ) u max , and u max is the maximum amplitude of the control signal.

3.2. Design of the State Feedback Controller Based on MPC

We model the tracking of the quadrotor to the reference trajectory in Problem 1 as an MPC optimization problem:
(31a) min u ^ ( · ) J ¯ = t k t k + T x ˜ ( τ ; t k ) Q 2 + J ( x ^ ( τ ; t k ) ) + u ^ ( τ ; t k ) R 2 d τ (31b)   s . t . x ^ ˙ ( τ ; t k ) = g ( x ^ ( τ ; t k ) , u ^ ( τ ; t k ) ) (31c)     x ^ ( t k ; t k ) = x ( t k ) (31d)     u ^ ( τ ; t k ) u max , τ [ t k , t k + T ] (31e)     V x g ( x ^ ( t k ; t k ) , u ^ ( t k ; x ^ ( t k ; t k ) ) ) V x g ( x ^ ( t k ; t k ) , h ( t k ; x ^ ( t k ; t k ) ) ) ,
where x ˜ ( τ ) = x ^ ( τ ) x r ( τ ) is the trajectory tracking error. Q and R are weight matrices. J ( x ( t ) ) is a cost function for obstacle avoidance:
J ( x ( t ) ) = λ 1 + e Δ d o ( t ) λ 2 ,
where λ is an adjustable weight parameter representing the penalty for collisions, Δ d o ( t ) = d o 2 ( t ) p o 2 2 , where d o is the safety distance for obstacle avoidance and p o is the location of the obstacle. The function h ( x ) is a tracking control law based on Lyapunov, and V is the corresponding Lyapunov function. (31e) represents the stability constraint term, which dynamically adjusts the prediction horizon by comparing the program calculation time and the sampling period. The introduction of the stability constraint term can ensure the stability of the control system while achieving a dynamic adjustment between control performance and computational efficiency. We define the sampling period as δ = T / N , where N represents the prediction horizon. The update process of the trajectory tracking and collision-avoidance controller is shown in Figure 7. First, we construct the objective function J ¯ and obtain the state measurement x ( t k ) of the quadrotor. Second, we set x ^ 0 = x ( t k ) as the initial prediction value, solve the MPC optimization problem (31), and obtain the optimal control sequence U * = u ^ 0 * , u ^ 1 * , , u ^ N 1 * . During the sampling period, the control signal u ( t ) = u ^ 0 * is used to control the quadrotor. Third, the increment of the prediction horizon is calculated as Δ N = t c p u T , where t c p u represents the program processing time. If t c p u > δ , let N = N Δ N ; if t c p u < δ , let N = N + Δ N ; otherwise N remains unchanged. Finally, update k k + 1 and t k + 1 t k + δ .
Below, we present the design and stability analysis of the state feedback control law h ( x ) . We design a Lyapunov function as follows:
V = 1 2 e e ,
where e = η r η is the position error. The first derivative of V is
V ˙ = e e ˙ = e η ˙ r R ( ψ ) v B + ω e e ω e e ,
where ω > 0 is the parameter for the control law h ( x ) . We define the velocity error as
e v = η ˙ r R ( ψ ) v B + ω e = e ˙ + ω e .
The first derivative of e v is
e ˙ v = η ¨ r R ˙ ( ψ ) v B R ( ψ ) v ˙ B + ω e ˙ = η ¨ r R ˙ ( ψ ) v B R ( ψ ) v ˙ B + ω ( e v ω e ) = η ¨ r R ˙ ( ψ ) v B R ( ψ ) ( F v B + Gu ) + ω ( e v ω e ) .
By substituting (35) into (34), we obtain
V ˙ = e e v ω e e .
The Lyapunov function V is chosen as (38)
V = V + 1 2 e v e v .
The first derivative of V is
V ˙ = V ˙ + e v e ˙ v = ω e e + e e v + e v e ˙ v = ω e e ρ e v e v + ρ e v e v + e e v + e v e ˙ v ,
where ρ is another parameter of the control law h ( x ) . To ensure that under the control law h ( x ) , the Lyapunov function V satisfies (40),
V ˙ ( x , h ( x ) ) = ω e e ρ e v e v 0 .
We deduce the control law h ( x ) as
h ( x ) = G 1 F v B G 1 R ( ψ ) μ ,
where μ is defined as follows
μ = η ¨ r R ˙ ( ψ ) v B + ( ω + ρ ) e v + ( 1 ω 2 ) e .
Under the action of the control signal u = h ( x ) , the derivative of the Lyapunov function is
V ˙ ( x , u ) = ω e e ρ e v e v + e v [ μ R ( ψ ) ( F v B + Gu ) ] .
Therefore, the stability constraint can be expressed as
e ^ v ( t k ; t k ) μ ^ ( t k ; t k ) R ( ψ ^ ( t k ; t k ) ) F v ^ B ( t k ; t k ) + G u ^ ( t k ; x ^ ( t k ; t k ) ) 0 .

3.3. Feasibility Analysis of the State Feedback Control Law

Under the condition of satisfying the input constraint h ( x ) u max , we prove the existence of a feasible solution to the MPC optimization problem using Lemma 2 and Theorem 1.
Lemma 2. 
According to Lemma 1, under the control law u = h ( x ) , the velocity v B of the quadrotor satisfies
v B 2 η ¯ 1 + e v ( t 0 ) 2 + ω e ( t 0 ) 2 ,
where e v ( t 0 ) and e ( t 0 ) represent the position error and velocity error at time t 0 , respectively.
Proof. 
According to (24), we have
v B = R ( ψ ) η ˙ R ( ψ ) η ˙ 2 η ˙ .
According to Lemma 1, as well as e = η r η and e ˙ = e v ω e , we have
η ˙ = η ˙ r e ˙ η ¯ 1 + e ˙ η ¯ 1 + e v + ω e .
Due to V ˙ < 0 , there exist e ( t k ) e ( t k ) 2 e ( t 0 ) 2 and e v ( t k ) e v ( t k ) 2 e v ( t 0 ) 2 . Therefore, according to η ˙ η ¯ 1 + e v ( t 0 ) 2 + ω e ( t 0 ) 2 , we can conclude that (45) holds. The proof is completed.    □
Theorem 1. 
Let g ¯ = G 1 and f ¯ = F , the control parameters ω and ρ for control law h ( x ) are positive constants. If (48) is satisfied,
2 · g ¯ f ¯ · l + η ¯ 2 + 2 2 · l 2 + m u max ,
where l = η ¯ 1 + e v ( t 0 ) 2 + ω e ( t 0 ) 2 and m = ( ω + ρ ) e v ( t 0 ) 2 + ( 1 ω 2 ) e ( t 0 ) 2 . Then, the optimization problem (31) has a feasible solution.
Proof. 
According to (41), we have
h ( x ) G 1 F v B + G 1 R ( ψ ) μ .
According to the definition of R ( ψ ) , we have
R ( ψ ) = max sin ( ψ ) + cos ( ψ ) , 1 2 .
According to (42), we have
μ η ¯ 2 + R ˙ ( ψ ) v B + ( ω + ρ ) e v + ( 1 ω 2 ) e ,
where R ˙ ( ψ ) = sin ( ψ ) r cos ( ψ ) r 0 0 cos ( ψ ) r sin ( ψ ) r 0 0 0 0 0 0 0 0 0 0 , and R ˙ ( ψ ) 2 v B . According to Lemma 2, we can obtain
μ η ¯ 2 + 2 2 · l 2 + m .
By substituting (45), (50), and (52) into (49), we obtain
h ( x ) 2 · g ¯ f ¯ · l + η ¯ 2 + 2 2 · l 2 + m .
Therefore, if (48) is satisfied, we have h ( x ) u max . Consequently, a feasible solution exists for the MPC optimization problem (31). The proof is completed.    □
Through the above analysis, we know that there exists a feasible solution to the optimization problem (31). Next, we analyze the stability of the closed-loop system.

3.4. Stability Analysis of the State Feedback Control Law

We analyze the stability of the closed-loop system using the following Theorem 2.
Theorem 2. 
The state feedback control law h ( x ) enables the system tracking trajectory to converge asymptotically to the reference trajectory.
Proof. 
For the Lyapunov function V in (38), there exists a positive definite function F i = 1 , 2 , 3 ( · ) that satisfies the following inequality [31]:
F 1 x V ( x ) F 2 x V ( x ) x g ( x , h ( x ) ) F 3 x .
The Lyapunov function of the closed-loop system satisfies
V ( x ) x g ( x , u ) V ( x ) x g ( x , h ( x ) ) F 3 x .
According to the Lyapunov theory, under the control algorithm proposed in this paper, the closed-loop system is asymptotically stable at the equilibrium point x ˜ = 0 . The proof is completed.    □
Finally, we provide the overall algorithmic flowchart of the paper, as shown in Algorithm 1.
Algorithm 1 PEG Algorithm Based on FQL and MPC
  1:
Initialization: inputs x ¯ 0 , rule l ( l = 1 , 2 , , L ) , action A = { a 1 , , a m } , learning rate α q , discount factor γ , reward function r, and Q-function Q ( · ) = 0
  2:
for each time step t do
  3:
   Obtain inputs x ¯ t = [ x 1 , , x n ] at time t
  4:
   Choose an action a l for rule l based on (9) or (10)
  5:
   Calculate the global continuous action U t ( x ¯ t ) by (11)
  6:
   Calculate the global Q-function Q t ( x ¯ t ) by (13)
  7:
   Take the global action U t ( x ¯ t )
  8:
   Obtain the reward r t + 1 and the inputs x ¯ t + 1 at time t + 1
  9:
   Calculate the maximum Q-value Q t * ( x ¯ t ) by (14)
10:
   Calculate the TD error ε ˜ t + 1 by (15)
11:
   Update the Q-function Q t + 1 ( l , a t l ) by (16) for rule l
12:
end for
13:
Get the motion information of the evader P E with the trajectory parameter p ^ t by (23)
14:
for control instant t k  do
15:
   Construct the objective function J ¯ and define δ = T / N
16:
   Obtain the state measurement x ( t k )
17:
   Let x ^ 0 = x ( t k ) , and solve the MPC optimization problem (31)
18:
   Obtain the optimal control sequence U * = u ^ 0 * , u ^ 1 * , , u ^ N 1 *
19:
   Execute the control signal u ( t ) = u ^ 0 * and record program processing time t c p u
20:
   Calculate the increment of the prediction horizon Δ N = t c p u T
21:
   if  t c p u > δ  then
22:
      N = N Δ N
23:
   else if  t c p u < δ  then
24:
      N = N + Δ N
25:
   else
26:
      N = N
27:
   end if
28:
   Update k k + 1 and t k + 1 t k + δ .
29:
end for

4. Simulation Results and Analysis

First, we evaluated the learning performance of the FQL algorithm in different PEG scenarios. Then, we validated the effectiveness of the MPC algorithm in controlling the quadrotor. Finally, the performance of the FQL and MPC algorithms to jointly control the quadrotor PEG was verified in the Gazebo platform.

4.1. Simulation of Quadrotor PEG Based on FQL

We define a 3D environment for the PEG, where agents move in a space with a size of 35 m × 35 m × 15 m at maximum speeds of v P = 1.1 m / s and v E = 1 m / s , respectively. According to (22), the parameters of the reward functions are set as α r = 10 , β a = 5 , γ s = 20 , w r = 5 , w a = 10 . The number of episodes is set to 200, and the number of plays in each game is 500. The sample time is 0.1 s . The discount factor is γ = 0.95 , and the learning rate is α q = 0.001 . The game terminates when the pursuer captures the evader, that is, d P E d s = 0.5 m , or when the time exceeds 100 s . To reduce the computational load, we select the triangular membership function. Specifically, the FQL has five triangular membership functions for each input. The distance inputs are in the interval of [ 0 , 35 ] , and the angle inputs are in the interval of [ π , π ] . Consequently, we have 5 × 5 × 5 × 5 = 625 rules. In the 3D environment, we select the controller’s output from a set of tuples ( α , θ ) | Δ α , Δ θ [ π 4 , π 4 ] .
We set up four different scenarios, in which the pursuer starts from the initial position [ 7.5 , 7.5 , 0 ] , and the evader needs to start from the position [ 7.5 , 27.5 , 0 ] , avoid obstacles, and then reach the target point [ 30 , 5 , 10 ] . The objective of the pursuer is to capture the evader before it reaches the target point. The obstacles are randomly placed and represented by cuboids of dimensions 1.5 m × 1.5 m × 15 m . The trajectories of pursuers and evaders are illustrated in Figure 8, where Figure 8a–d represent the 3D trajectories of agents, while Figure 8e–h represent the projections of the 3D trajectories onto the x-y plane. We can observe that in the 3D environment, the pursuer successfully captured the evader as well, without any collisions occurring. As shown in Figure 9, we calculated the distances between the pursuer and the evader in four scenarios. After the PEG ends, the distances between the pursuer and the evader are 0.42 m , 0.44 m , 0.32 m , and 0.35 m , respectively, all of which are less than the set distance d s = 0.5 m .
To verify the superiority of the reward function designed in this paper, we compared the learning performance of the algorithm under different reward functions. In the same scenarios, we compared the performance of the reward function proposed in [32] with that designed in this paper. We repeatedly executed the FQL algorithm with two different reward functions, 20 times each, and then calculated the average path length and capture time of the pursuer, as shown in Figure 10. It can be observed that in all four scenarios, the reward function designed in this paper outperforms the one in [32] in terms of both path length and capture time. The primary reason is that the reward function designed in this paper is a nonlinear function that better aligns with the movement status of the agent. As shown in Figure 4, when the distance between the pursuer and the evader decreases, the reward value increases exponentially rather than linearly, which helps motivate the pursuer to capture the evader more quickly. The simulation results validate the effectiveness of the FQL algorithm and the reward function designed in this paper.

4.2. Simulation of Quadrotor Control Based on MPC

We define the parameters of the quadrotor and the MPC algorithm as follows. For the quadrotor parameters, the gains are set to k x = k y = k z = k ψ = 1 , and the time constants are set to τ x = 0.8 , τ y = 0.7 , τ z = 0.5 , τ ψ = 0.5 . For the MPC algorithm parameters, the sampling period is δ = 0.1 s , the weight matrices are Q = d i a g ( 10 3 , 10 3 , 10 3 , 10 3 , 10 2 , 10 2 , 10 2 , 10 2 ) and R = d i a g ( 1 , 1 , 1 , 1 ) , and the parameters for the control law are ω = 0.5 and ρ = 1 . For the cost function for obstacle avoidance, the adjustable weight parameter representing the penalty for collisions is λ = 500 , and the safety distance for obstacle avoidance is d o = 0.5 m . To verify the effectiveness of the altitude controller based on the MPC, we compared the proposed algorithm with the standard LQR algorithm [33,34,35]. We assumed that the quadrotor takes off from ( 0 , 0 , 0 ) at the origin of the coordinate system and flies along the reference trajectory described in (56)
P E = p ^ t R 3 p ^ = 8 cos ( π 20 t ) 1 + sin 2 ( π 20 t ) , 8 sin ( π 20 t ) cos ( π 20 t ) 1 + sin 2 ( π 20 t ) , 2 ( 1 cos ( π 20 t ) ) .
The entire simulation process lasted for 40 s . The result of 3D trajectory tracking of the quadrotor is shown in Figure 11, where (a) represents the 3D trajectory and (b) is the projection of the 3D trajectory onto the x-y plane. The circle represents the starting position, and the star represents the destination position. It can be seen that both algorithms can guide the quadrotor to track the reference trajectory. As shown in Figure 12, we analyzed the convergence of position and angle tracking. It is clear that the algorithm proposed in this paper exhibits less overshoot and a better response speed.
To further verify the stability of the proposed MPC algorithm, we introduced external disturbances into the simulation and compared it with the standard LQR algorithm. Based on the quadrotor model (25), the height control and horizontal control are decoupled. Therefore, we consider the anti-interference performance of the control algorithm in the horizontal plane. We assume that the quadrotor starts from the initial position ( 2 , 1 ) and flies along the reference trajectory described in (57):
x t = 2.5 + 1.5 · cos ( π 20 t π 2 ) y t = 2.5 + 1.5 · sin ( π 20 t π 2 ) .
The entire simulation process lasted for 40 s . External interferences with a period of 2 s and an amplitude of 0.5 m were introduced into the yaw channel at t = 12 s and t = 25 s . As shown in Figure 13, the quadrotor moved in a counterclockwise direction as indicated by the arrow in the figure. We analyzed the performance of the proposed MPC and standard LQR algorithms in controlling the quadrotor’s trajectory tracking under external disturbances. The trajectory tracking curve demonstrates that although both algorithms exhibit excellent trajectory tracking performance and a notable capability to suppress external interference, our proposed MPC algorithm exhibits less overshoot and smaller tracking error. We analyzed the quadrotor’s trajectory tracking results along the coordinate axes, as shown in Figure 14. Specifically, Figure 14a,b present the trajectory tracking results, Figure 14c,d display the trajectory tracking errors, Figure 14e,f show the velocity tracking results, and Figure 14g,h represent the velocity tracking errors. Through Figure 14, it can be seen that our proposed algorithm exhibits superior robustness and resistance to external disturbances.

4.3. Simulation of Quadrotor PEG Based on FQL and MPC

We validated the performance of using FQL and MPC to control quadrotors in PEG on the Gazebo platform. The hardware configuration consisted of an Intel i7-9700 processor and a GeForce RTX 3090 graphics card. The learning algorithm operated on a central computer, which subsequently transmitted the strategies to quadrotors. The MPC algorithm ran in MATLAB, and the control of quadrotors was implemented in the Ubuntu 18.04 operating system based on the robot operating system (ROS). The scenario size was set to 35 m × 35 m × 15 m , and the obstacles were represented by cuboids of dimensions 1.5 m × 1.5 m × 15 m . The parameters of the learning algorithm, MPC algorithm, and the quadrotor were the same as the configurations described in Section 4.1 and Section 4.2. Figure 15 presents screenshots of key moments in four simulation scenarios. We selected one quadrotor as the pursuer, marked by the red circle, and another quadrotor as the evader, marked by the blue circle. The flight trajectories of quadrotors in four scenarios are shown in Figure 16, where Figure 16a–d represent the 3D trajectories of quadrotors, while Figure 16e–h represent the projections of the 3D trajectories onto the x-y plane. Notably, the quadrotor successfully completed the pursuit–evasion task while effectively avoiding obstacles. The effectiveness and feasibility of FQL and MPC joint control of quadrotors to carry out PEG were verified.

5. Conclusions

In this paper, we propose a novel control algorithm that combines the FQL and MPC methods for the joint control of quadrotors in a PEG. The proposed algorithm effectively leverages the perception and decision-making capabilities of FQL and the robust control capabilities of MPC. Initially, we used the generalized Apollonius circle in 3D space to determine the agent’s movement strategy. We trained the agent using FQL algorithm and designed a reward function based on the concept of artificial potential field to enhance learning performance. Upon obtaining the evader’s reference trajectory, we designed an MPC algorithm to control quadrotors in the PEG scenario. We developed a state feedback control algorithm for the decoupled quadrotor model, considering constraints on control input and obstacle avoidance. Then, we analyzed the feasibility and stability of the state feedback control law. The simulation results evaluate the control performance of the FQL algorithm and the MPC algorithm separately. Subsequently, utilizing the Gazebo platform, we controlled the quadrotors in the PEG using both the FQL algorithm and the MPC algorithm, and our proposed algorithms exhibited excellent performance. This paper provides valuable insights for combining RL algorithms and traditional control techniques and provides a promising reference for future research. In the future, we will conduct demonstration and validation on a quadrotor platform to compare the performance of the algorithm proposed in this paper with other algorithms. Furthermore, we will also explore the integration of actor–critic structured learning algorithms with MPC algorithms. Additionally, we will also investigate the integration of multi-objective RL algorithms with nonlinear MPC algorithms to address the PEG problem.

Author Contributions

Conceptualization, P.H.; methodology, C.Z.; software, P.H.; resources, P.H.; data curation, C.Z.; writing—original draft preparation, P.H.; writing—review and editing, C.Z.; supervision, Q.P.; project administration, Q.P.; funding acquisition, Q.P. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by National Natural Science Foundation of China (No. 62073264).

Data Availability Statement

The data presented in this study are available on request from the corresponding author.

Conflicts of Interest

The authors declare no conflicts of interest. The funders had no role in the design of the study; in the collection, analyses, or interpretation of data; in the writing of the manuscript; or in the decision to publish the results.

References

  1. Weintraub, I.E.; Pachter, M.; Garcia, E. An introduction to pursuit–evasion differential games. In Proceedings of the 2020 American Control Conference (ACC), Denver, CO, USA, 1–3 July 2020; pp. 1049–1066. [Google Scholar]
  2. Lee, E.S.; Shishika, D.; Loianno, G.; Kumar, V. Defending a perimeter from a ground intruder using an aerial defender: Theory and practice. In Proceedings of the 2021 IEEE International Symposium on Safety, Security, and Rescue Robotics (SSRR), New York City, NY, USA, 25–27 October 2021; pp. 184–189. [Google Scholar]
  3. Biediger, D.; Popov, L.; Becker, A.T. The pursuit and evasion of drones attacking an automated turret. In Proceedings of the 2021 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Prague, Czech Republic, 27 September–1 October 2021; pp. 9677–9682. [Google Scholar]
  4. Zhang, F.; Zha, W. Evasion strategies of a three-player lifeline game. Sci. China Inf. Sci. 2018, 61, 112206. [Google Scholar] [CrossRef]
  5. Yan, R.; Deng, R.; Lai, H.; Zhang, W.; Shi, Z.; Zhong, Y. Multiplayer homicidal chauffeur reach-avoid games via guaranteed winning strategies. arXiv 2021, arXiv:2107.04709. [Google Scholar]
  6. Liang, L.; Deng, F.; Peng, Z.; Li, X.; Zha, W. A differential game for cooperative target defense. Automatica 2019, 102, 58–71. [Google Scholar] [CrossRef]
  7. Ibragimov, G.; Ferrara, M.; Ruziboev, M.; Pansera, B.A. Linear evasion differential game of one evader and several pursuers with integral constraints. Int. J. Game Theory 2021, 50, 729–750. [Google Scholar] [CrossRef]
  8. Deng, Z.; Kong, Z. Multi-agent cooperative pursuit-defense strategy against one single attacker. IEEE Robot. Autom. Lett. 2020, 5, 5772–5778. [Google Scholar] [CrossRef]
  9. Garcia, E.; Casbeer, D.W.; Von Moll, A.; Pachter, M. Multiple pursuer multiple evader differential games. IEEE Trans. Autom. Control. 2020, 66, 2345–2350. [Google Scholar] [CrossRef]
  10. Yan, R.; Shi, Z.; Zhong, Y. Cooperative strategies for two-evader-one-pursuer reach-avoid differential games. Int. J. Syst. Sci. 2021, 52, 1894–1912. [Google Scholar] [CrossRef]
  11. Yan, R.; Shi, Z.; Zhong, Y. Task assignment for multiplayer reach–avoid games in convex domains via analytical barriers. IEEE Trans. Robot. 2019, 36, 107–124. [Google Scholar] [CrossRef]
  12. Lopez, V.G.; Lewis, F.L.; Wan, Y.; Sanchez, E.N.; Fan, L. Solutions for multiagent pursuit–evasion games on communication graphs: Finite-time capture and asymptotic behaviors. IEEE Trans. Autom. Control. 2019, 65, 1911–1923. [Google Scholar] [CrossRef]
  13. Sani, M.; Hably, A.; Robu, B.; Dumon, J. Real-time game-theoretic model predictive control for differential game of target defense. Asian J. Control. 2023, 25, 3343–3355. [Google Scholar] [CrossRef]
  14. Schwartz, H.M. Multi-Agent Machine Learning: A Reinforcement Approach; John Wiley and Sons: Hoboken, NJ, USA, 2014. [Google Scholar]
  15. Eklund, J.M.; Sprinkle, J.; Sastry, S.S. Switched and symmetric pursuit/evasion games using online model predictive control with application to autonomous aircraft. IEEE Trans. Control. Syst. Technol. 2011, 20, 604–620. [Google Scholar] [CrossRef]
  16. De Simone, D.; Scianca, N.; Ferrari, P.; Lanari, L.; Oriolo, G. MPC-based humanoid pursuit–evasion in the presence of obstacles. In Proceedings of the 2017 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Vancouver, BC, Canada, 24–28 September 2017; pp. 5245–5250. [Google Scholar]
  17. Sani, M.; Robu, B.; Hably, A. pursuit–evasion Games Based on Game-theoretic and Model Predictive Control Algorithms. In Proceedings of the 2021 International Conference on Control, Automation and Diagnosis (ICCAD), Grenoble, France, 3–5 November 2021; pp. 1–6. [Google Scholar]
  18. Sani, M.; Robu, B.; Hably, A. Limited information model predictive control for pursuit–evasion games. In Proceedings of the 2021 60th IEEE Conference on Decision and Control (CDC), Austin, TX, USA, 14–17 December 2021; pp. 265–270. [Google Scholar]
  19. Manoharan, A.; Sujit, P.B. NMPC-Based Cooperative Strategy to Lure Two Attackers Into Collision by Two Targets. IEEE Control. Syst. Lett. 2022, 7, 496–501. [Google Scholar] [CrossRef]
  20. Peng, Y.; Mo, T.; Zheng, D.; Deng, Q.; Wang, J.; Qu, D.; Xie, Y. Model Predictive Control-based pursuit–evasion Games for Unmanned Surface Vessel; Springer: Singapore, 2023. [Google Scholar]
  21. Yu, C.; Dong, Y.; Li, Y.; Chen, Y. Distributed multi-agent deep reinforcement learning for cooperative multi-robot pursuit. J. Eng. 2020, 2020, 499–504. [Google Scholar] [CrossRef]
  22. Wang, Y.; Dong, L.; Sun, C. Cooperative control for multi-player pursuit–evasion games with reinforcement learning. Neurocomputing 2020, 412, 101–114. [Google Scholar] [CrossRef]
  23. Kartal, Y.; Subbarao, K.; Dogan, A.; Lewis, F. Optimal game theoretic solution of the pursuit-evasion intercept problem using on-policy reinforcement learning. Int. J. Robust Nonlinear Control. 2021, 31, 7886–7903. [Google Scholar] [CrossRef]
  24. Yang, B.; Liu, P.; Feng, J.; Li, S. Two-stage pursuit strategy for incomplete-information impulsive space pursuit–evasion mission using reinforcement learning. Aerospace 2021, 8, 299. [Google Scholar] [CrossRef]
  25. Zhang, R.; Zong, Q.; Zhang, X.; Dou, L.; Tian, B. Game of Drones: Multi-uav pursuit–evasion Game with Online Motion Planning by Deep Reinforcement Learning; IEEE Transactions on Neural Networks and Learning Systems: Piscataway, NJ, USA, 2022. [Google Scholar]
  26. Selvakumar, J.; Bakolas, E. Min–Max Q-learning for multi-player pursuit–evasion games. Neurocomputing 2022, 475, 1–14. [Google Scholar] [CrossRef]
  27. Camci, E.; Kayacan, E. Game of drones: UAV pursuit–evasion game with type-2 fuzzy logic controllers tuned by reinforcement learning. In Proceedings of the 2016 IEEE International Conference on Fuzzy Systems (FUZZ-IEEE), Vancouver, BC, Canada, 24–29 July 2016; pp. 618–625. [Google Scholar]
  28. Awheda, M.D.; Schwartz, H.M. A decentralized fuzzy learning algorithm for pursuit–evasion differential games with superior evaders. J. Intell. Robot. Syst. 2016, 83, 35–53. [Google Scholar] [CrossRef]
  29. Wang, L.; Wang, M.; Yue, T. A fuzzy deterministic policy gradient algorithm for pursuit–evasion differential games. Neurocomputing 2019, 362, 106–117. [Google Scholar] [CrossRef]
  30. Liu, S.; Hu, X.; Dong, K. Adaptive Double Fuzzy Systems Based Q-Learning for pursuit–evasion Game. IFAC-PapersOnLine 2022, 55, 251–256. [Google Scholar] [CrossRef]
  31. Vidyasagar, M. Nonlinear Systems Analysis; Society for Industrial and Applied Mathematics: Philadelphia, PA, USA, 2002. [Google Scholar]
  32. Hu, P.; Pan, Q.; Zhao, C.; Guo, Y. Transfer reinforcement learning for multi-agent pursuit-evasion differential game with obstacles in a continuous environment. Asian J. Control. 2024, 26, 2125–2140. [Google Scholar] [CrossRef]
  33. Jafari, H.; Zareh, M.; Roshanian, J.; Nikkhah, A. An optimal guidance law applied to quadrotor using LQR method. Trans. Jpn. Soc. Aeronaut. Space Sci. 2010, 53, 32–39. [Google Scholar] [CrossRef]
  34. Fan, B.; Sun, J.; Yu, Y. A LQR controller for a quadrotor: Design and experiment. In Proceedings of the 2016 31st Youth Academic Annual Conference of Chinese Association of Automation, Wuhan, China, 11–13 November 2016. [Google Scholar]
  35. Ihnak, M.S.; Edardar, M.M. Comparing LQR and PID controllers for quadcopter control effectiveness and cost analysis. In Proceedings of the 2023 IEEE 11th International Conference on Systems and Control, Sousse, Tunisia, 18–20 December 2023. [Google Scholar]
Figure 1. Illustration of the PEG model, where P t represents the pursuer, E t represents the evader, and the cylinder represents the obstacle O.
Figure 1. Illustration of the PEG model, where P t represents the pursuer, E t represents the evader, and the cylinder represents the obstacle O.
Drones 08 00509 g001
Figure 2. Illustration of the generalized Apollonius circle in 3D space, where P = ( 10 , 10 , 10 ) , E = ( 10 , 10 , 10 ) , and O A C represent the pursuer, evader, and the center of the generalized Apollonius circle, respectively.
Figure 2. Illustration of the generalized Apollonius circle in 3D space, where P = ( 10 , 10 , 10 ) , E = ( 10 , 10 , 10 ) , and O A C represent the pursuer, evader, and the center of the generalized Apollonius circle, respectively.
Drones 08 00509 g002
Figure 3. The structure of the FQL algorithm.
Figure 3. The structure of the FQL algorithm.
Drones 08 00509 g003
Figure 4. The reward function with different coefficients.
Figure 4. The reward function with different coefficients.
Drones 08 00509 g004
Figure 5. The control framework of quadrotor PEG based on FQL and MPC.
Figure 5. The control framework of quadrotor PEG based on FQL and MPC.
Drones 08 00509 g005
Figure 6. The model of the quadrotor.
Figure 6. The model of the quadrotor.
Drones 08 00509 g006
Figure 7. The update process of trajectory tracking and the collision-avoidance controller.
Figure 7. The update process of trajectory tracking and the collision-avoidance controller.
Drones 08 00509 g007
Figure 8. Trajectories of agents in four scenarios.
Figure 8. Trajectories of agents in four scenarios.
Drones 08 00509 g008
Figure 9. The distance variations between the pursuer and the evader in four scenarios.
Figure 9. The distance variations between the pursuer and the evader in four scenarios.
Drones 08 00509 g009
Figure 10. The path length and capture time of two reward functions in four scenarios. Orange represents the reward function in [32], while blue represents the reward function designed in this paper. (a) represents the path length; (b) represents the capture time.
Figure 10. The path length and capture time of two reward functions in four scenarios. Orange represents the reward function in [32], while blue represents the reward function designed in this paper. (a) represents the path length; (b) represents the capture time.
Drones 08 00509 g010
Figure 11. (a) is the trajectories of the quadrotors in 3D space; (b) is the projection of the quadrotors’ trajectories on the x-y plane.
Figure 11. (a) is the trajectories of the quadrotors in 3D space; (b) is the projection of the quadrotors’ trajectories on the x-y plane.
Drones 08 00509 g011
Figure 12. The position and yaw angle tracking trajectories of the quadrotor.
Figure 12. The position and yaw angle tracking trajectories of the quadrotor.
Drones 08 00509 g012
Figure 13. Quadrotor trajectory tracking curve with the external disturbance.
Figure 13. Quadrotor trajectory tracking curve with the external disturbance.
Drones 08 00509 g013
Figure 14. (a,b) are the trajectory tracking results, (c,d) are the trajectory tracking errors, (e,f) are the velocity tracking results, (g,h) are the velocity tracking errors.
Figure 14. (a,b) are the trajectory tracking results, (c,d) are the trajectory tracking errors, (e,f) are the velocity tracking results, (g,h) are the velocity tracking errors.
Drones 08 00509 g014
Figure 15. Screenshots of the quadrotor PEG based on the gazebo platform in four scenarios.
Figure 15. Screenshots of the quadrotor PEG based on the gazebo platform in four scenarios.
Drones 08 00509 g015
Figure 16. The flight trajectories of quadrotors in four scenarios.
Figure 16. The flight trajectories of quadrotors in four scenarios.
Drones 08 00509 g016
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Hu, P.; Zhao, C.; Pan, Q. A Novel Method for a Pursuit–Evasion Game Based on Fuzzy Q-Learning and Model-Predictive Control. Drones 2024, 8, 509. https://doi.org/10.3390/drones8090509

AMA Style

Hu P, Zhao C, Pan Q. A Novel Method for a Pursuit–Evasion Game Based on Fuzzy Q-Learning and Model-Predictive Control. Drones. 2024; 8(9):509. https://doi.org/10.3390/drones8090509

Chicago/Turabian Style

Hu, Penglin, Chunhui Zhao, and Quan Pan. 2024. "A Novel Method for a Pursuit–Evasion Game Based on Fuzzy Q-Learning and Model-Predictive Control" Drones 8, no. 9: 509. https://doi.org/10.3390/drones8090509

APA Style

Hu, P., Zhao, C., & Pan, Q. (2024). A Novel Method for a Pursuit–Evasion Game Based on Fuzzy Q-Learning and Model-Predictive Control. Drones, 8(9), 509. https://doi.org/10.3390/drones8090509

Article Metrics

Back to TopTop