1. Introduction
Trajectory tracking is related to time series and the mobile robot needs to reach the original set position through the control of the trajectory tracking system within a specified time. It is dedicated to estimating the motion state and motion trajectory of the tracked object in a continuous spatiotemporal sequence. Therefore, a controller with high-performance trajectory tracking capability is required for the mobile robot [
1]. With the development of control technology, the learning-based robot control method is the latest research hotspot in the field of control [
2,
3,
4], and researchers have proposed methods based on reinforcement learning (RL). It ignores the dynamic model of the robot and learns the control method through a large amount of motion data, which has received extensive attention in the field of automatic control. Due to its data-driven approach, reinforcement learning can reduce the need for complex engineering theory. Model-based reinforcement learning algorithms generally target simple dynamic environments or agents with few actual interactions. Without knowledge of the environmental dynamics model, model-free reinforcement learning algorithms can directly evaluate the quality of a policy or find the optimal value function and optimal policy through the actual interaction between individuals and the environment. Internationally, a lot of research has been done on this problem from theory to experiment, and fruitful results have been achieved in theoretical analysis, numerical calculation and experimental verification [
5,
6,
7,
8,
9,
10,
11,
12,
13]. Among them, the value-based deep reinforcement learning (DRL) algorithms, such as Q-learning, Sarsa, and the deep Q-learning algorithm (DQN) can only realize the control of discrete action space, so they can only realize the discrete direction control of the robot. However, in the case of a large-scale behavior space or continuous behavior, it will be difficult for value-based reinforcement learning to learn a good result. For trajectory tracking, value-based control methods can hardly achieve accurate tracking by simply using discrete action spaces. In this case, policy learning can be performed directly.
The policy gradient algorithm proposed by Peters [
14] realizes continuous control, and the deterministic policy gradient (DPG) proposed by Silver [
15] significantly outperforms the stochastic policy gradient in high-dimensional behavior spaces. Lillicrap [
16] introduced DQN on the basis of DPG, combined with the training characteristics of the DQN neural network, and proposed the DDPG algorithm. Due to the good performance of DDPG, many researchers applied DDPG to UAV navigation and target tracking, and achieved good results [
17]. Gan Z et al. [
18] established a UAV tracking model based on deep reinforcement learning, and studied the target tracking problem of UAV. Modares et al. [
19] used an off-policy reinforcement learning algorithm to learn and track the solution of the Hamilton-Jacobi-Isaac (HJI) equation online, and studied the design of an H-tracking controller for nonlinear continuous-time systems with completely unknown dynamics. Ye L et al. [
20] proposed reinforcement learning tracking control (RLTC) for unknown continuous dynamic systems to solve the traditional iterative learning control (ILC) problem.
The DDPG algorithm is a deterministic policy algorithm using deep learning technology and based on the Actor-critic algorithm, which can better solve the trajectory tracking problem in continuous behavior space [
8]. Among the latest research results, Luy NT et al. [
21] proposed a reinforcement learning-based design method for mobile robot kinematics and dynamic tracking control algorithms, in which the Actor-critic structure uses only one neural network to reduce computational costs and storage resources. In response to the time-consuming RL training, Wang GF et al. [
22] proposed a TL framework based on transfer learning, where the agent extracts knowledge from the human-demonstration trajectories of the source task, and reuses the knowledge in RL in the target task, which shows remarkable results in experiments. Levine S et al. [
23] proposed a deep reinforcement learning method based on uncertainty perception. By estimating the probability of collision, the robot can keep “vigilance” in the face of unfamiliar and unknown environments, reduce the running speed and reduce the possibility of collision. In order to reduce the number of trials and errors in the interaction between the UAV and the environment, they also proposed a guided policy search algorithm which uses the optimized data for learning to realize the obstacle avoidance control of the UAV in the simulation environment. Hwangbo J et al. [
24] proposed a new deep reinforcement learning algorithm under the Actor-critic framework to achieve UAV path tracking control under any initial condition. However, William F K et al. [
25] compared three deep reinforcement learning algorithms, namely, deep deterministic policy gradient, confidence region policy optimization, and near-end policy optimization, with the traditional PID (Proportion Integration Differentiation) algorithm. The experiment shows that in the UAV attitude control simulation environment, the near-end policy optimization algorithm is superior to the other three control algorithms in terms of overshoot, rise time and tracking error. B Rubí et al. [
26] proposed the DDPG algorithm for the path tracking problem of the UAV aircraft. Through training and testing in the RotorS-Gazebo environment, it was proved that the tracking performance of the method was better than that of the NLGL (Nonlinear Navigation Logic) method. Zheng Q et al. [
27] used the PPO algorithm of reinforcement learning to adjust the PID controller gain, and achieved good stability of the aircraft in control, anti-jamming and flying height. In addition, Zhen Y et al. [
28,
29] proposed a hybrid DDPG (Mi-DDPG) algorithm. Levine S et al. [
30] formulated policy search as optimization of trajectory distribution. Yang B et al. [
31] used a deep deterministic policy gradient algorithm to solve a closed-loop tracking method for tracking trajectories, and proved that it achieved a high tracking rate. Most of the above-mentioned previous studies on DDPG focus on the optimization problem of DDPG algorithm itself, or use a DDPG algorithm to adjust the parameters of traditional controllers to increase the stability of the system.
Aiming at the shortcomings of the UAV trajectory tracking control algorithm based on the deep deterministic policy gradient, such as low training efficiency and unstable convergence in unknown environments, a new UAV trajectory tracking control algorithm was proposed on the basis of DDPG, that is, state-compensated deep deterministic policy gradient (CDDPG) algorithm which uses deep reinforcement learning and fuses different state space networks. The process is shown in
Figure 1. The CDDPG algorithm takes improving the learning efficiency without reducing the training accuracy as a breakthrough point. In practice, the state compensation network is used to compensate the dynamic loss in real time to improve the following effect. In the subsequent process, the role of the compensation network is constantly weakened, and reinforcement learning is the dominant factor to achieve fast and accurate follow-up effects. For example, before learning a certain skill, students first roughly learn most of the content; the instructor will correct the study direction and guide them precisely to speed up the learning progress and accuracy; the cycle goes back and forth. The students’ self-study gradually becomes dominant, while weakening the tutor’s guidance, and they finally achieve a quick and accurate mastery of this skill. It can be seen that the advantages of CDDPG are the stable convergence and the short training period.
The rest of this article is organized as follows. The second section presents the proposed control method. The third section verifies the performance of the method through simulation results. The fourth section is devoted to concluding remarks.
2. Models and Methods
Section 2.1 establishes the Markov decision process model for UAV trajectory tracking and describes the problem formulation of policy search in a non-model-based RL framework.
Section 2.2 presents the proposed method. In this paper, the policy is the controller learned by RL, which is updated after the learning iteration. Also, the policy formulation is undertaken in conjunction with the state compensation.
2.1. Problem Formulation
Consider a UAV dynamic tracking system with state
and system input
, given by
where
is an unknown function and
is a noise function. Model-free reinforcement learning learns the dynamics
by continuously updating
, where
represents the time steps recorded in all previous steps during training. The dataset is continuously updated during learning, assuming that all states are measurable.
In reinforcement learning, a trajectory tracking Markov decision process (MDP) is defined. The UAV tracking MDP consists of a tuple , where:
is a finite state set;
is a finite behavior set;
is the behavior-based state transition matrix in the set:
is the state and behavior-based reward function:
is a decay factor: .
According to the Bellman equation, two Bellman equations of the state-value function
and behavior-value function
based on the policy
are obtained:
As we know, the value of an action is expressed in terms of the value of subsequent states that the action can reach:
The purpose of reinforcement learning is to find an optimal policy that allows individuals to obtain more gains than other policies in the process of interacting with the environment. This optimal policy is represented by .
Here, given the current state
, predict the probability of taking the next action
, and then obtain the prediction for the next state
:
where
is a policy function,
is the parameter of
,
represents the probability of taking any possible action
under the setting of a given state
and certain parameters
. The Markov reward process is expressed as:
In (7), the training input state is , and the output action is . The purpose of the article is to design the objective function , find the optimal parameters through the gradient ascent method, and find the optimal policy to achieve accurate target tracking.
2.2. Proposed Policy
In a DDPG algorithm, the parameters of the policy are randomly initialized to small values. In our experience, the DDPG algorithm is easily biased towards exploration, and it takes an extremely long time to complete each learning process with random actions in the early stage, making a minor or meaningless policy update. Since multiple iterations of learning are unavoidable, it is inefficient to apply the pre-learning DDPG algorithm in practice.
Let the policy of the original RL be
, and the proposed policy be
. Then, the hybrid control policy is:
where
is the compensatory control policy.
The DDPG algorithm is a typical representative of algorithms commonly used in the reinforcement learning problem in continuous action space. The proposed policy chooses to combine the DDPG algorithm and the state compensation network. As the DDPG algorithm is based on the Actor-critic algorithm, two hybrid algorithms, CQAC and CDDPG, are discussed below.
2.2.1. DDPG Algorithm
The Actor-critic algorithm consists of a policy function and a behavior value function, where the policy function acts as an actor, generating behavior and interacting with the environment; the behavior value function acts as a critic, which is responsible for evaluating the actor’s performance and guiding the actor’s follow-up actions. The QAC algorithm does not require a complete state sequence, but since the introduced critic is still an approximate value function, there is a possibility of introducing bias.
The DDPG algorithm is derived from the DQN algorithm, which is a typical reinforcement learning algorithm for solving continuous control problems. It is based on the Actor-critic algorithm but is different from it, and it adds a noise function to the deterministic behavior in the learning process to realize small-scale behavior exploration. Moreover, backup parameters are added to the actor network and the critic network as the target network, respectively, so as to achieve a dual parameter setting and increase the convergence probability (
Figure 2).
The conversion process is outlined below: (1) The current state generates specific behavior through the actor network; (2) The critic network calculates the behavior value corresponding to and ; (3) The target actor network generates used to estimate the value according to the subsequent state given by the environment; (4) The target critic network generates the behavior value used to calculate the target value according to , and the reward R.
The task of the actor in the DDPG algorithm is to find the optimal behavior to maximize the value of the output behavior, where the reward obtained by executing
is
, and this transformation process is stored as
. After the system stores enough transformation process, it randomly selects
samples
for calculation, where
, the calculation process of the objective function
is [
16]:
The loss of the Critic network is TD-error (temporal-difference error):
Updating actor network using gradient ascent:
Update the target network:
where
is the decay factor,
is the update rate parameter;
and
are the weights to the Critic network value function
and the actor network
;
and
are the weights to the target Critic network value function
and the target actor network
.
2.2.2. CDDPG Algorithm
The state compensation network is added to DDPG algorithm to form a CDDPG algorithm controller. The control policy of the hybrid controller
is:
where
θ is the parameter of the DDPG control policy
,
is the parameter of the state compensation control policy
. In this algorithm, all parameters are updated iteratively, as
of
is determined by the gradient ascent method and
of
is determined by the gradient descent method.
Rewrite Equations (11) and (12) to get the following deduction equation.
Therefore, the objective gradient function can be obtained.
Above formula, where
is a learning rate of the state compensation network (C-Net)
;
is the weight to C-Net;
is the weight to the target state compensation network (TC-Net)
.
Figure 3 and Algorithm 1 outlines the proposed algorithm, where
and
are the compensated actions calculated by C-Net and TC-Net.
Algorithm 1: CDDPG Algorithm |
Input: Output: optimized randomly initialize , , with weights , , initialize target network , , with weights , , initialize the experience cache space for episode from 1 to Limit do Initialize a noise function and receive the initial state for t = 1 to T do Perform action , get reward next state , store transition in sample a batch of random X transition in , where set target value function via equation (16) update critic by minimizing the loss via equation (17) update actor policy by policy gradient via equation (18) update networks via equation (14), (19) end for end for |
The above is the inference of the CDDPG algorithm based on the DDPG algorithm, but the above logic analysis is also applicable to the QAC algorithm, and the QAC algorithm combined with the state compensation network is recorded as CQAC in the same way. We compare the performance of using the QAC and DDPG algorithm with the CQAC and CDDPG algorithm in the third section.
3. Simulations and Results Discussion
The algorithm is solved with the UAV position and velocity parameters known. Assuming that the attitude control is stable, this paper focuses on the driving force control of the UAV in three-dimensional space of the Cartesian coordinate system.
In the algorithm simulations, the current state feature of the UAV and the following target position are used as the state input of the algorithmic framework algorithm, and its drive acceleration in the X, Y and Z directions of the three-dimensional space is used as the output. The learning process is to repeat the policy function solution process. During the trajectory tracking process, the measurement and control system will transmit the current position information, speed information and coordinate data of the following target to the UAV in each time step, and at the same time determine the reward value. The larger the following error, the lower the reward value, and the maximum reward value is zero.
3.1. Training Framework and Results
In CDDPG, the critic network evaluates the value of the UAV in the current state to guide the policy generation action, and the actor network is responsible for generating specific actions according to the current state. The input accepted by the critic network is the number of features observed by the UAV and the number of features of the action, and the output is the value of the state-action. Considering the need of the UAV for feature extraction of the observed state, we design a total of 3 hidden layers in critic. Since the number of elements in the input layer is the sum of the number of state elements and the number of behavior elements, the number of hidden layers is determined by the effect and delay of the algorithm. If the number of hidden layers is too large, the delay of the algorithm will be higher, and if the number of hidden layers is too small, the accuracy of the algorithm will be reduced, and the effect will be worse. Taking the above factors into consideration, the input of the critique network designed in this paper contains nine states and three actions, and the output is a Q value. After several simulations, to maintain a good training effect without too much delay, the hidden layer in this paper is set to three layers. The hidden layer that processes the state and the action is operated separately first, and is fully connected through the last hidden layer, and then outputs the state-action value together. The input to the actor network is the state observed by the UAV, and the output is the action the UAV will perform. The actor network is designed with a total of the hidden layers, and the layers are fully connected. The architecture of this network is shown in
Figure 4. In order to realize the exploration, the algorithm adopts the Ornstein-Uhlenbeck process as the noise model, and adds a random noise on the basis of the generated action so that it can realize a certain range of exploration around the exact action.
In the compensation simulation, the method performed is to train C-Net with a neural network and optimize the network with its tracking average error as the objective function. The input of the network is the state after the UAV performs the action, and the output is the target difference, which will be weighted into the output action to interact with the environment. The C-Net is designed with a total of three hidden layers, and the layers are fully connected (
Figure 5). After this neural network is embedded in the controller, trajectory tracking training is performed.
In the simulation, the tracking accuracy is set as r, that is, if the UAV is located within the range of the tracking target as the center and within the radius of r, it is a successful follower, and a larger reward can be obtained. The simulation is carried out using a deep reinforcement learning framework incorporating a state compensation network. The proposed method is simulated with OpenAI Gym, and the simulated computer configuration is an Intel(R) Core(TM) i5-7300HQ. The UAV flight range is a three-dimensional space of 10 × 10 × 10 (m). Taking the tracking accuracy r as 0.3 m, the proposed method is used to simulate and analyze the control policy of the data model after basic experience learning, and each successful follow-up is regarded as an episode.
Four algorithms, QAC, DDPG, CQAC, and CDDPG, are used to follow the training of target points. As we know, a complete episode contains multiple iteration steps, and the reward after each iteration step is the negative value of the linear distance between the tracking point and the target point. The episode reward is defined as the sum of the rewards for all iterations in a complete episode. The learning weights are set to and simulated.
After 1000 episodes are completed, all training models are saved to prepare for subsequent effect verification.
Figure 6 shows the episode rewards after 1000 episodes. It shows that from the 11th episode, the rewards after that are all close to zero, but judging from the previous episodes rewards, the CQAC algorithm and CDDPG algorithm during the training process have increased significantly compared to QAC and DDPG. It can also be seen from the curve of the iteration steps experienced by each episode in
Figure 7. In the early stage of training, the QAC and DDPG algorithms have to go through tens of thousands of iterations for each complete episode, while the iteration steps required by CQAC and CDDPG is only 30% of the original algorithm. Since the calculated time cost of the four algorithms are basically the same (about 0.998 milliseconds,
Figure 8) under the same computer configuration, it fully shows that the training time is improved by about 70% with the same training accuracy. The same conclusion can also be confirmed by the comparison of total iterations of the four algorithms (
Figure 9).
3.2. Trajectory Tracking Simulation
To verify the dynamic tracking effect of the above training model, we designed a simulation experiment to track the target point as accurately as possible to complete the spiral trajectory motion in the Cartesian coordinate system. In this simulation, the target point moves at a constant speed to complete the preset trajectory tracking task. Since it has an angular velocity of
rad/s in the X-Y plane and a rising velocity of
m/s in the Z-axis, it is designed to complete two turns of helical motion with a radius of 3 m and a pitch of
m in space. It takes 60 s to complete that trajectory tracking. Through the adoption of four algorithm models,
Figure 10,
Figure 11,
Figure 12 and
Figure 13 shows the tracking trajectories of the four algorithms in three-dimensional space and the tracking trajectories in the X, Y, and Z directions. Visually, the tracked trajectories under the CQAC and CDDPG algorithms are closer to the reference trajectories than those under the QAC and DDPG algorithms. In order to further clarify the superiority of the proposed algorithm, we conducted further analysis of the specific tracking error data.
Figure 14a plots the comprehensive tracking errors under each algorithm, and
Figure 14b–d plot the tracking errors in the directions of three coordinate axes. The tracking error is then quantified and analyzed by the plot of simulation data. It can be seen from
Figure 14 that the comprehensive tracking in the simulation is relatively stable. Since the tracking accuracy is set to 0.3 m during model training, the comprehensive tracking error of the QAC and DDPG algorithms in the figure basically fluctuates around 0.3 m, while the CQAC and CDDPG algorithms basically fluctuate around 0.15 m. Comparing the tracking error curves in the directions of the three coordinate axes, the tracking errors of the CQAC and CDDPG algorithms are reduced to half of the original QAC and DDPG; the error fluctuation range is also greatly reduced, and the error trend is more stable compared with the original algorithm. Obviously, due to the addition of the compensation network, the tracking accuracy and convergence stability of the proposed method have been effectively improved; under the same training time, the stable tracking error of the proposed method is reduced by about 50% compared with the original algorithm. It shows that the compensation network added to the controller produces more active control to reduce the position error and improve the follow-up effect during the UAV flight.
Through the above simulation process, the tracking speed and its errors during tracking are discussed.
Figure 15 plots the comparison of the tracking speed and its error in three directions during the tracking process. According to the simulation settings, the target point performs a uniform circular motion on the X-Y plane and a uniform upward motion on the Z axis. That is to say, the speed along the X and Y axes conforms to the law of trigonometric functions, which belongs to the variable acceleration motion. We can see from the comparison of the tracking speed in each axis direction that the speed in the tracking process is basically consistent with the reference speed, and the speed error is not much different. It can be seen from the above discussion that, the size of the tracking error can better represent the quality of the tracking effect.
In order to determine the effect of Z-axis uniform acceleration motion on the simulation effect, the following simulation adds an acceleration constant (0.033 m/s
2) along the Z axis based on the above simulation conditions. The target point does a half-circle helical motion (
Figure 16), and its comprehensive error (
Figure 17) and position error (
Figure 18) on Z axis are observed, then the stability of each algorithm is analyzed.
After analysis, the simulation results are consistent with the previous results. The proposed algorithm still maintains good tracking characteristics, and uniform acceleration in the z-axis direction does not affect the following effect, which also proves that the proposed algorithm always has good accuracy and stability.
By comparing the simulation results of trajectory tracking, it is found that the tracking effect and error of the CDDPG algorithm are better than those of the DDPG algorithm. The designed hybrid controller for trajectory tracking based on CDDPG overcomes the disadvantage that it is difficult to form high-quality behaviors in a short time in reinforcement learning. It not only shortens the learning iteration period, but also solves the problem of large tracking error, which not only speeds up the convergence speed, but also has good convergence stability.
4. Conclusions
Reinforcement learning may require a series of iterative processes in the learning process, and the long initial policy training period results in a large amount of training time for controller development. This paper designs a CDDPG algorithm based on deep reinforcement learning to speed up training time, improve learning efficiency, and stabilize the balance between exploration and utilization. Aiming at the UAV trajectory tracking control problem of the model unknown system, a compensation control algorithm combined with RL is proposed. The simulation results show that: (1) The training efficiency can be significantly improved by adding a compensation network, and the accuracy and convergence stability are also effectively improved; (2) Under the same configuration, the computational cost of the algorithm in this paper is basically the same as that of the DDPG algorithm; (3) The training time is about 70% lower than that of QAC and DDPG; (4) The tracking error is about 50% lower than QAC and DDPG. The above factors are better than model-free reinforcement learning algorithms represented by the DDPG algorithm. It breaks the traditional idea of using reinforcement learning to adjust or optimize system parameters, and embeds the compensation network into the reinforcement learning method, which is also the innovation of this paper.
The work in this paper is a simulation performed under ideal conditions, confining the movement of the UAV to the same space, and does not involve the attitude stability and anti-jamming capability of the aircraft. In further studies, the above problems will be considered, comprehensive simulation will be carried out, real machine experiments will be increased as much as possible, and in-depth research will be carried out in real scenarios.