Trajectory Tracking Control Based on Deep Reinforcement Learning for a Robotic Manipulator with an Input Deadzone

Wang, Fujie; Hu, Jintao; Qin, Yi; Guo, Fang; Jiang, Ming

doi:10.3390/sym17020149

Open AccessArticle

Trajectory Tracking Control Based on Deep Reinforcement Learning for a Robotic Manipulator with an Input Deadzone

by

Fujie Wang

,

Jintao Hu

,

Yi Qin

,

Fang Guo

^*

and

Ming Jiang

School of Computer Science and Technology, Dongguan University of Technology, Dongguan 523808, China

^*

Author to whom correspondence should be addressed.

Symmetry 2025, 17(2), 149; https://doi.org/10.3390/sym17020149

Submission received: 8 November 2024 / Revised: 12 January 2025 / Accepted: 17 January 2025 / Published: 21 January 2025

(This article belongs to the Section Computer)

Download

Browse Figures

Versions Notes

Abstract

:

This paper proposes a deep reinforcement learning (DRL) method that combines random network distillation (RND) and long short-term memory (LSTM) to address the tracking control problem, while leveraging the inherent symmetry in robotic arm movements to eliminate the need for learning or knowing the system’s dynamic model. In general, the complexity and strong coupling of robotic manipulators make trajectory tracking extremely challenging. Firstly, the prediction network and fixed network are jointly trained using the RND method. The difference in output values between the two networks acts as an internal reward for the robotic manipulator environment. This internal reward mechanism encourages the robotic arm agent to actively explore unpredictable and unknown environmental states, thereby consequently boosting the performance and efficiency of the tracking control for the robotic manipulator. Then, the Soft Actor-Critic (SAC) algorithm, the LSTM network, and the attention mechanism are integrated to resolve the instability problem during training and acquire a stable policy. The LSTM model effectively captures the symmetry and temporal changes in joint angles, while the attention mechanism dynamically prioritizes important features, thereby reducing the instability of the robotic manipulator during tracking tasks and enhancing feature extraction efficiency. The simulation outcomes demonstrate that the proposed method effectively performs the robot tracking task, confirming the efficacy and efficiency of the DRL algorithm.

Keywords:

random network distillation; long short-term memory; attention mechanism; deep reinforcement learning; robotics tracking control

1. Introduction

The advent of artificial intelligence (AI) has significantly impacted various fields, including the control of robotic manipulators [1]. The control of manipulators is a complex task that requires precision and adaptability to varying environments and tasks. Traditional control methods have been widely used, offering robust solutions for specific, well-defined problems. Several approaches are proposed to attain satisfactory tracking control results, including sliding mode control, PID control, model predictive control, and adaptive tracking control [2,3,4,5,6,7]. However, these methods often lack the flexibility and learning capabilities needed for more dynamic and unpredictable environments. Fortunately, neural networks, with the ability to learn from data and generalize to new situations, are employed to enhance the performance of robotic systems. In [8], an adaptive tracking control policy based on neural networks is proposed for robot manipulators. In [9], a new control method based on neural networks is proposed to control the robotic manipulator, utilizing neural networks to approximate its dynamic model. In [10], the development and experimental verification of a dual-loop nonlinear controller for a quadrotor, based on an adaptive neural network, is presented.

However, neural networks often require vast amounts of labeled training data and can struggle with generalizing to entirely new scenarios without retraining. They are prone to overfitting and may not perform well in highly dynamic or unpredictable environments. Reinforcement learning (RL) has attracted attention for its applications in environments where systems can learn optimal control policies through trial and error. In [11], an algorithm utilizing Q-learning (QL) and deep Q networks (DQN) is introduced for training a rotary inverted pendulum system. The SARSA (State-Action-Reward-State-Action) algorithm is applied to control the positioning of the end effector of a three-degree-of-freedom arm, targeting both fixed and random points in [12]. While QL, DQN, and SARSA are useful for addressing specific RL tasks, their primary application is in discrete spaces. Robotic arms need to operate with a high degree of precision, requiring smooth and continuous control signals that discrete algorithms cannot efficiently provide. Continuous RL algorithms have further advanced the field by allowing control in continuous action spaces, being more natural for a robotic manipulator. In [13], a deep reinforcement learning (DRL) control method is proposed for position and attitude tracking of bionic underwater vehicles (BUV), where the Soft Actor-Critic (SAC) algorithm is used to train the controller through interaction with a simulated BUV.

However, in highly coupled and complex environments such as robotic arms, these algorithms often exhibit significant instability during training. This is where the integration of long short-term memory (LSTM) neural networks and an attention mechanism with SAC can effectively mitigate these instabilities. LSTM networks are designed to handle sequential data and maintain long-term dependencies, making them ideal for processing the time-series data that are prevalent in robotic control tasks [14]. In [15], the SAC algorithm and the random network distillation (RND) method are combined to control a robotic arm to capture the target. The internal rewards generated by RND often motivate the agent to explore more extensively to speed up the convergence of the algorithm.

This paper aims to draw on the advantages of the LSTM network, attention mechanism, and RND to design a DRL trajectory tracking controller for a robotic manipulator with a deadzone. The primary contributions are summarized in the following points.

(1) By generating intrinsic rewards based on the prediction error of a neural network, RND motivates the agent to explore more extensively [15]. This extensive exploration accelerates the convergence of the algorithm by encouraging the agent to visit a wider variety of states, which is crucial in complex robotic tasks.

(2) In highly coupled and dynamic environments like robotic manipulators, maintaining stable learning is challenging. The LSTM component processes time-series data efficiently, capturing the dependencies between consecutive actions and states [14], and the attention mechanism can dynamically focus on critical features. This capability significantly mitigates the instability issues during training, leading to more reliable and consistent policy performance.

(3) SAC provides a robust framework for learning policies in continuous action spaces, ensuring smooth and precise movements. The LSTM component and the attention mechanism enhance the ability of the algorithm to handle the temporal dependencies and stabilize training. RND introduces a mechanism for intrinsic motivation, driving the agent to explore and learn more efficiently. Together, these components create a powerful and stable control strategy for robotic arms, capable of executing complex tasks with high precision and reliability.

2. System Description

Dynamics Model

By deriving equations from the Lagrangian function, we can obtain the dynamic equations of the system. The dynamic equations of the n-degree-of-freedom manipulator are as follows [16]:

\begin{matrix} \begin{matrix} M (ψ) \ddot{ψ} + C (ψ, \dot{ψ}) \dot{ψ} + G (ψ) = D (τ) \end{matrix} \end{matrix}

(1)

The variables

ψ

,

\dot{ψ}

, and

\ddot{ψ}

∈

R^{n}

correspond to the joint position, joint velocity, and joint acceleration, repectively. The inertia matrix is represented by

M (ψ)

∈

R^{n \times n}

, the centripetal and Coriolis torque is represented by

C (ψ, \dot{ψ})

∈

R^{n \times n}

, the gravitational force is denoted by

G (ψ)

∈

R^{n}

,

τ

∈

R^{n}

represents the joint torque, and

D (τ)

is the deadzone function for the control input

τ

. The deadzone

D (τ_{i})

is expressed as

D (τ_{i}) = \{\begin{matrix} h_{r} (τ_{i} - b_{r}) & τ_{i} \geq b_{r} \\ 0 & b_{l} < τ_{i} < b_{r} \\ h_{l} (τ_{i} - b_{l}) & τ_{i} \leq b_{l} \end{matrix}

(2)

where

h_{r}

and

h_{l}

are the known nonlinear functions and

b_{r} > 0

and

b_{l} < 0

are known constants. This paper investigates the Phantom Omni robot, as illustrated in Figure 1a. The schematic diagram, showing the reference frames used in the dynamics, is shown in Figure 1b. The robot is a three-DoF manipulator, and its dynamics are described by Equation (1), with the matrices M, C, and G, as given in [17].

\begin{matrix} M & = [\begin{matrix} m_{11} & 0 & 0 \\ 0 & m_{22} & m_{23} \\ 0 & m_{32} & m_{33} \end{matrix}], G = [\begin{matrix} 0 \\ g k_{5} c_{2} + g k_{6} c_{23} \\ g k_{6} c_{23} \end{matrix}], C & = [\begin{matrix} - a_{1} {\dot{ψ}}_{2} & - a_{1} {\dot{ψ}}_{1} & - a_{2} {\dot{ψ}}_{1} \\ a_{1} {\dot{ψ}}_{1} & - a_{3} {\dot{ψ}}_{3} & - a_{3} ({\dot{ψ}}_{2} + {\dot{ψ}}_{3}) \\ a_{2} {\dot{ψ}}_{1} & a_{3} {\dot{ψ}}_{2} & 0 \end{matrix}] \end{matrix}

(3)

where

m_{11} = k_{1} + k_{2} c_{2}^{2} + k_{3} c_{23}^{2} + 2 k_{4} c_{2} c_{23}, m_{22} = k_{2} + k_{3} + 2 k_{4} c_{3}, m_{23} = k_{3} + k_{4} c_{3},

m_{32} = m_{23}, m_{33} = k_{3}, a_{1} = k_{2} c_{2} s_{2} + k_{3} c_{23} s_{23} + k_{4} c_{2 \times 23}, a_{2} = k_{3} c_{23} s_{23} + k_{4} c_{2} s_{23},

a_{3} = k_{4} s_{3}, s_{i} = \sin (ψ_{i}), s_{23} = sin (ψ_{2} + ψ_{3}), c_{i} = cos (ψ_{i}), c_{23} = cos (ψ_{2} + ψ_{3}),

c_{2 \times 23} = cos (2 ψ_{2} + ψ_{3})

.

3. Preliminaries

3.1. Reinforcement Learning and Soft Actor-Critic

In reinforcement learning (RL), an agent aims to optimize its decision-making process to accumulate maximum reward by interacting with the environment. Figure 2 shows how an agent learns during reinforcement learning. In this framework, the agent operates in discrete time steps. At each step t, the agent perceives the current state

s_{t}

, selects an action

a_{t}

, receives a corresponding reward

r_{t}

, and transitions to a new state

s_{t + 1}

. The objective is to derive a policy

π (a | s)

that maximizes the expected return, expressed as

R = \sum_{t = 0}^{T} γ^{t} r_{t}

, with

γ

being the discount factor.

SAC is an RL algorithm designed for continuous action spaces. It combines the advantages of maximum entropy RL with the stability of actor-critic methods. The policy (actor) update in the SAC algorithm seeks to maximize both the expected reward and the entropy of the policy, encouraging exploration. Built upon the actor-critic framework and maximum entropy RL theory, the SAC algorithm focuses on optimizing the policy expectations along with the standard RL return [18].

Π^{★} = \underset{Π}{a r g m a x} \sum_{t}^{T} E [r (s_{t}, a_{t}) + α H (π (\cdot | s_{t}))]

(4)

where

H (π (\cdot | s_{t}))

represents the entropy of the policy.

The SAC algorithm involves five networks: two soft Q-value networks, two target soft Q-value networks, and a policy network. The Bellman equation is applied to update the soft Q-value networks, where the update target for a Q-value network

Θ_{i}

is specified in [19]:

L_{Q} (Θ_{i}) = E [\frac{1}{2} {(Q_{Θ_{i}} (s, a) - y (r, s^{'}, d))}^{2}]

(5)

SAC ensures stable training by using the minimum value from the two target Q-value networks as the target. The Q-value networks are then updated using stochastic gradient descent:

\nabla_{Θ_{i}} L (Θ_{i}) = \nabla_{Θ_{i}} Q_{Θ_{i}} (s, a) [{(Q_{Θ_{i}} (s, a) - y (r, s^{'}, d))}^{2}]

(6)

The policy network

Π_{ϕ} (a | s)

is updated by maximizing the expected value of the Q-value and the entropy term under the current policy:

J_{Π} (ϕ) = E_{(s) \sim D} [E_{a \sim Π_{ϕ}} [α log Π_{ϕ} (a | s) - Q_{Θ_{i}} (s, a)]]

(7)

Stochastic gradient descent is then used to update the policy network:

\begin{matrix} \nabla_{ϕ} J_{Π} (ϕ) = \nabla_{ϕ} α log Π_{ϕ} (a | s) + (\nabla_{ϕ} α log Π_{ϕ} (a | s) - \nabla Q_{Θ_{i}} (s, a)) \end{matrix}

(8)

3.2. Long Short-Term Memory

LSTM is a variant of recurrent neural networks (RNNs) specifically designed for modeling sequential data and capturing long-term dependencies. The forget gate governs the removal of information from the cell state. The forget gate of LSTM determines what information to discard from the cell state

C_{t - 1}

[20]:

f_{t} = φ (W_{f}^{'} \cdot [h_{t - 1}, x_{t}] + b_{f}^{'})

(9)

The input gate governs the decision of what new information to store in the cell state:

\begin{matrix} ß_{t} = φ (W_{i}^{'} \cdot [h_{t - 1}, x_{t}] + b_{i}^{'}) \\ {\tilde{C}}_{t} = tanh (W_{C}^{'} \cdot [h_{t - 1}, x_{t}] + b_{C}^{'}) \end{matrix}

(10)

where input gate

ß_{t}

scales the candidate values

{\tilde{C}}_{t}

.

The output gate governs the output of the LSTM cell. It merges the current cell state with the activation of the output gate to produce the final hidden state:

\begin{matrix} η_{t} = φ (W_{o}^{'} \cdot [h_{t - 1}, x_{t}] + b_{o}^{'}) \\ {\hat{h}}_{t} = η_{t} \cdot tanh (C_{t}^{'}) \end{matrix}

(11)

The symbol

φ

mentioned above refers to the sigmoid function, while the weight matrix and bias term are represented by W and b, respectively. The combined operation of these gates enables the LSTM unit to learn long-term dependencies, which is crucial for sequential data processing.

3.3. Attention Mechanism

Attention mechanisms have become a critical component in sequence modeling, enabling models to focus on the most relevant parts of the input sequence when producing output. The attention mechanism dynamically assigns different weights to different input elements based on their relevance to the current state or prediction, improving the model’s ability to capture long-range dependencies and handle variable-length sequences.

In the context of sequence-to-sequence models, attention mechanisms allow the model to focus on different parts of the input sequence at each time step.

3.4. Random Network Distillation

Random network distillation (RND) is an RL exploration method aimed at improving an agent’s ability to investigate unfamiliar environments, thus boosting learning efficiency. Similar to the Intrinsic Curiosity Module (ICM) algorithm, RND uses a fixed target network, which is typically a randomly initialized neural network, to forecast changes in environmental states. The next state observation

s_{t + 1}

is input into this target network to produce the output

f (s_{t + 1})

. The prediction network shares the same architecture and input as the target network, with its output being

\hat{f} (s_{t + 1})

, the state encoding

f (s_{t + 1})

from the target network. The prediction error

ϵ

is computed as follows [21]:

ϵ = ∥ \hat{f} (s_{t + 1}) - f (s_{t + 1}) ∥^{2}

(12)

Serving as an internal reward, the prediction error

ϵ

guides the agent’s exploration. RND incentivizes the agent to explore states where the target network’s predictions are less accurate, promoting effective learning and aiding exploration in intricate environments. The RND is widely employed in various domains, including gameplay, robot control, and autonomous driving, demonstrating its effectiveness in enhancing exploration and improving agent performance.

4. Controller Design

4.1. Design of System Input and Output

In DRL, a network of actors generates output actions based on the current input state. In the context of trajectory tracking of the manipulator in the task space, the joint angles are crucial [22]. The goal of the controller is to choose appropriate actions from the real-time state information of the robot arm to ensure superior tracking control. The state space is outlined as follows:

s = {[\begin{matrix} ψ & \dot{ψ} & ψ_{d} & \dot{ψ_{d}} \end{matrix}]}^{T}

(13)

where

ψ_{d}

and

\dot{ψ_{d}}

denote the target angular position and target angular velocity. Additionally, to control the swing amplitude during target curve tracking, saturation constraints are enforced on the manipulator joints. The specific limitations are

ψ = \{\begin{matrix} ψ_{min} & , if ψ_{i} \leq ψ_{min} \\ ψ_{i} & , if ψ_{min} < ψ_{i} < ψ_{max}, i = 1, 2, \dots, n \\ ψ_{max} & , if ψ_{i} \geq ψ_{max} \end{matrix}

(14)

where

ψ_{max} = π

and

ψ_{min} = - π

.

In RL tasks, the reward function is a key component. In the context of robotic manipulator control, the tracking error

e (t)

is the primary variable. The reward function can be written as

r_{t} = - ∥ e_{i} (t) ∥

(15)

where

e_{i} (t)

=

ψ (t)

−

{ψ_{d}}_{i} (t)

.

4.2. Network Architecture

Inspired by [14], this article combines the SAC algorithm with an LSTM network. SAC-LSTM is based on the actor–critic architecture, comprising a policy and Q-value networks. The policy network consists of an LSTM layer, attention mechanism layers, fully connected layers, and output layers. The policy network is shown in Figure 3a. The framework employs an attention mechanism to dynamically focus on relevant features in the LSTM output. In [23], the last hidden state of LSTM is used as the input of the fully connected layer. In the trajectory tracking task of this paper for manipulator, the state sequence at each time step is fed into the LSTM, which outputs a hidden state h for each time step. This attention mechanism layer computes three linear transformations of h to produce the query, key, and value matrices. The query and key matrices are used to calculate attention weights through a softmax function applied to their dot product, which allows the model to focus on different parts of the sequence by assigning higher weights to more relevant features. These attention weights are then used to produce a weighted sum of the value matrix, effectively generating a context vector that emphasizes important temporal features. These hidden states h are passed to fully connected layers to generate the Gaussian distribution parameters for the actions (torques): the mean (using the tanh activation function) and the standard deviation (using the softplus activation function with a small constant added to avoid numerical issues). During training, actions are sampled from this distribution to ensure they are random and exploratory. The final actions are scaled by the action bounds after passing through the tanh function, and their log probabilities are computed. As the network converges, the variance of the action distribution gradually decreases.

The Q-value network consists of fully connected layers and an output layer, as shown in Figure 3b. It takes the concatenation of the state sequence and the action as input. First, the concatenated input is fed into the first fully connected layer and processed by the ReLU activation function. The result is then passed to the second fully connected layer, also processed by the ReLU activation function. Finally, the output layer produces a single value,

Q (s, a)

, representing the expected return for the given state and action.

4.3. SLR Algorithm Design

For improving the ability and efficiency of robot manipulator control, the RND method is introduced and the SLR algorithm is proposed. The RND can prompt the robotic arm agent to more actively explore unpredictable and unknown environmental states, leading to the discovery of superior policies during the training of arm control. Therefore, combining the RND method with the SAC-LSTM algorithm can obtain more stable and effective control policies.

The RND target and predictor networks consist of a fully connected layer and an output layer. Since the weights of the target network remain unchanged, the prediction network can focus on the intrinsic representation of the learning state during the training process, thereby enhancing both the stability and efficiency of the learning process [24]. The difference in output values between the two networks acts as an internal reward for the robot manipulator environment. Then, the total reward for RL algorithm training is defined as the sum of the intrinsic and extrinsic rewards, formed by the linear combination of the values in Equations (12) and (15). It is expressed as follows:

r_{t o t a l} = \sum_{i = 1}^{n} - w_{i} | e_{i} (t) | + | | f_{i + 1} - f_{i + 1}^{'} {| |}^{2}

(16)

The

r_{t o t a l}

is then used in Equation (7) to update the actor network. This encourages the agent to explore unpredictable and unknown environmental states more actively, enhancing both learning efficiency and performance. The algorithm pseudocode is shown in Algorithm 1, and the algorithm flow chart is shown in Figure 4.

Algorithm 1 SLR.

Input:: The parameter of policy network $ϕ$ , the parameters of Q Network $Θ_{1}$ , $Θ_{2}$
Input:: target network weights ${\hat{Θ}}_{1} \leftarrow Θ_{1}, {\bar{Θ}}_{2} \leftarrow Θ_{2}$
Output:: Optimazed parameters
1:: for each iteration do
2:: for each step do
3:: Get $s$ and select $a \sim π_{ϕ} (. ∣ s_{t})$ ;
4:: Excute $a$ in the environment;
5:: Get $s_{t + 1}$ , reward r, and done singal $d$
6:: to indicate whether $s_{t + 1}$ is terminal state;
7:: Calculate internal reward $r^{'} = | | f_{i + 1} - f_{i + 1}^{'} {| |}^{2}$
8:: Calculate augmented reward $r_{t o t a l} = r + r^{'}$
9:: Store ( $s$ , $a$ , $r_{t o t a l}$ , $s^{'}$ , $d$ ) in replay buffer $D$ ;
10:: if $s^{'}$ is terminal then
11:: Reset environment;
12:: end if
13:: end for
14:: for each gradient step do
15:: Sample data B = ( $s$ , $a$ , $r$ , $s^{'}$ , $d$ ) from $D$ ;
16:: Compute targets Q networks following Equation (5)
17:: Update Q network parameters using
18:: gradient descent Equation (6)
19:: Update policy network parameters using
20:: gradient descent Equation (8)
21:: Update the parameters of target network using
22:: $ϕ_{targ, i} \leftarrow ρ ϕ_{targ, i} + (1 - ρ) ϕ_{i}$ for $i = 1, 2$
23:: end for
24:: end for

5. Simulation

This section presents simulation experiments conducted using the simulation environment from Section 2 to assess the performance of the method. To evaluate the performance of our SLR algorithm, we selected the SAC, SAC-RND, and SAC-LSTM methods as benchmarks. At the same time, the anti-interference ability of the four methods was assessed. All experiments were performed using the Phantom Omni robotic manipulator. Considering the availability of laboratory hardware, a workstation was used to create a virtual robotic manipulator system. The initial state values were defined as

ψ_{1} (0) = 0.23

,

ψ_{2} (0) = - 0.07

, and

ψ_{3} (0) = - 0.022

. The total number of steps was 3000, with a step size of 0.01 s per step.

5.1. Training Performance

The training performance of the SAC, SAC-RND, SAC-LSTM, and SLR methods wass assessed in this experiment. Figure 5 shows the trends in cumulative reward values for the four algorithms, along with the variance and standard deviation of the cumulative reward. The results suggest that the reward of the SLR agent converges around the 50th episode, and the learning curve continues smoothly after that point. In contrast, the SAC-LSTM and SAC-RND agents only converge after the 100th episode, with their learning curves continuing to fluctuate even after convergence. Throughout the training process, the reward curve of the SAC algorithm exhibits fluctuations. Meanwhile, the variance and standard deviation of the SLR agent stabilize after the 75th episode, whereas those of the SAC, SAC-RND, and SAC-LSTM agents continue to fluctuate significantly.

An ablation experiment was conducted to evaluate the impact of the LSTM network with the attention mechanism and the RND method enhancement. When not combined with the LSTM network, the convergence speed of the SAC-RND algorithm is higher than that of the SAC algorithm, but the stability during training is not effectively alleviated. In addition, the reward curve of the SAC-LSTM algorithm is very smooth, but its reward value after convergence is not as high as that of SAC-RND. However, the stability and convergence speed of the SLR algorithm proposed in this paper are better than those of other benchmark algorithms during training. This result shows that the LSTM network with an attention mechanism is key to improving stability, while RND is essential for enhancing exploration. Combining them for manipulator control enables us to enhance the efficiency and robustness of the algorithm.

5.2. Control Performance

In this experiment, the performance of SLR in tracking control was compared to that of SAC, SAC-RND, and SAC-LSTM. In Figure 6 and Figure 7, the tracking trajectories, tracking errors, and torque variations of the joints are presented for SAC, SAC-LSTM, SAC-RND, and SLR during control tasks in both environments: one without deadzones and one with deadzones.

As shown in Figure 6a–c, the control trajectories in the three joints are presented for the environment without deadzones. The tracking errors of all joints are shown in Figure 6d–f. These results indicate that SLR outperforms the baseline methods in tracking control performance, with smaller deviations between the reference and actual positions. The absence of deadzones allows for more precise tracking and reduces delays in response, although SLR still demonstrates superior tracking accuracy compared to SAC, SAC-LSTM, and SAC-RND.

In contrast, in the environment with deadzones, as shown in Figure 7a–c, the control trajectories of the three joints exhibit similar trends, but with slightly larger deviations from the desired path. The tracking errors for all joints in the deadzone environment are shown in Figure 7d–f. SLR still outperforms the baseline methods, although the tracking errors are higher compared to the no-deadzone scenario due to the inherent delays and nonlinearities introduced by the deadzones.

The torque trends for the joints in the no-deadzone environment are shown in Figure 6g–i. These results demonstrate that the torque of all joints is smaller when controlled by SLR compared to SAC, SAC-LSTM, and SAC-RND, highlighting the efficiency of SLR in minimizing energy consumption and mechanical stress. In the deadzone environment, as illustrated in Figure 7g–i, the torque variations are still lower for SLR but exhibit some fluctuation due to the deadzone effects, although SLR maintains a clear advantage in reducing torque compared to the other algorithms.

As demonstrated by this experiment, the SLR algorithm leverages the strengths of RND and LSTM to quickly learn stable control policies, resulting in excellent control ability with reduced torque. The average absolute tracking error (AAT), average absolute tracking error (AAE), and root-mean-squared error (RMSE) are summarized in Table 1 and Table 2 for both environments, with the corresponding formulas for AAE, AAT, and RMSE as follows:

AAE = \frac{1}{T} \sum_{i = 1}^{T} | e_{i} | i = 1, 2, \dots, n

(17)

AAT = \frac{1}{T} \sum_{i = 1}^{T} | τ_{i} | i = 1, 2, \dots, n

(18)

RMSE = \sqrt{\frac{1}{T} \sum_{i = 1}^{T} e_{i}^{2}} i = 1, 2, \dots, n

(19)

where T is the total time step.

The results for both environments highlight the robustness of the SLR algorithm, particularly in terms of minimizing tracking errors and joint torque, showcasing its potential for precise and energy-efficient robotic control.

6. Conclusions

This paper proposes an SLR (SAC-LSTM-RND) algorithm to train robotic manipulators to perform tracking control tasks. According to the characteristics of the robotic manipulator system, the SAC-LSTM algorithm is implemented. LSTM is used to capture the timing information of the curve trajectory and the attention mechanism can dynamically focus on critical features, thereby improving the adaptability of the controller to time-varying curves. SAC-LSTM solves the instability when training with only the SAC algorithm. To enhance the exploration ability of the robot arm agent, the RND method is added to the algorithm. RND prompts the robot arm agent to explore uncertain states of the environment more proactively, thereby obtaining superior policies for robotic manipulator control. The results demonstrate that, under the same conditions, the algorithm proposed in this paper exhibits superior learning and control capabilities. In addition, the algorithm proposed in this paper has only been verified in a single task environment. Future work can be extended to more diverse practical tasks, including multi-task learning, collaborative control, and other scenarios. This will not only verify the effectiveness of the algorithm in more complex environments, but also further promote the development of the algorithm towards a general control strategy.

Author Contributions

Conceptualization, F.W. and J.H.; data curation, J.H.; funding acquisition, F.W. and M.J.; investigation, F.W. and J.H.; methodology, F.W. and J.H.; project administration, Y.Q. and F.G.; writing—original draft preparation, F.W. and J.H.; resources, M.J., Y.Q. and F.G.; visualization, J.H.; writing—review and editing, F.W. and J.H.; formal analysis, M.J., Y.Q. and F.G. All authors have read and agreed to the published version of the manuscript.

Funding

This research was supported in part by National Natural Science Foundation of China under grant no. 62203116, 62103106, in part by GuangDong Basic and Applied Basic Research Foundation 2024A1515010222, and in part by Dongguan Science and Technology of Social Development Program under grant no. 20231800935882, SSL Sci-tech Commissioner Program (20234430-01KCJ-G, 20234371-01KCJ-G).

Data Availability Statement

Data are contained within the article.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Karabegović, I.; Husak, E.; Karabegović, E.; Mahmić, M. Robotic Technology as the Basis of Implementation of Industry 4.0 in Production Processes in China. In International Conference “New Technologies, Development and Applications”; Springer Nature: Cham, Switzerland, 2023. [Google Scholar]
He, N.; Li, Y.; Li, H.; He, D.; Cheng, F. PID-Based Event-Triggered MPC for Constrained Nonlinear Cyber-Physical Systems: Theory and Application. IEEE Trans. Ind. Electron. 2024, 71, 13103–13112. [Google Scholar] [CrossRef]
Ulu, B.; Savaş, S.; Ergin, Ö.F.; Ulu, B.; Kırnap, A.; Bingöl, M.S.; Yıldırım, Ş. Tuning the Proportional–Integral–Derivative Control Parameters of Unmanned Aerial Vehicles Using Artificial Neural Networks for Point-to-Point Trajectory Approach. Sensors 2024, 24, 2752. [Google Scholar] [CrossRef]
Hu, J.; Zhang, D.; Wu, Z.G.; Li, H. Neural network-based adaptive second-order sliding mode control for uncertain manipulator systems with input saturation. ISA Trans. 2023, 136, 126–138. [Google Scholar] [CrossRef] [PubMed]
Zhang, Y.; Kong, L.; Zhang, S.; Yu, X.; Liu, Y. Improved sliding mode control for a robotic manipulator with input deadzone and deferred constraint. IEEE Trans. Syst. Man Cybern. Syst. 2023, 53, 7814–7826. [Google Scholar] [CrossRef]
Zhang, J.; Wang, H. Online model predictive control of robot manipulator with structured deep Koopman model. IEEE Robot. Autom. Lett. 2023, 8, 3102–3109. [Google Scholar] [CrossRef]
Selvaggio, M.; Garg, A.; Ruggiero, F.; Oriolo, G.; Siciliano, B. Non-prehensile object transportation via model predictive non-sliding manipulation control. IEEE Trans. Control Syst. Technol. 2023, 31, 2231–2244. [Google Scholar] [CrossRef]
Yu, J.; Wu, M.; Ji, J.; Yang, W. Neural network-based region tracking control for a flexible-joint robot manipulator. J. Comput. Nonlinear Dyn. 2024, 19, 021003. [Google Scholar] [CrossRef]
Khan, G.D. Control of robot manipulators with uncertain closed architecture using neural networks. Intell. Serv. Robot. 2024, 17, 315–327. [Google Scholar] [CrossRef]
Lopez-Sanchez, I.; Pérez-Alcocer, R.; Moreno-Valenzuela, J. Trajectory tracking double two-loop adaptive neural network control for a Quadrotor. J. Frankl. Inst. 2023, 360, 3770–3799. [Google Scholar] [CrossRef]
Ben Hazem, Z. Study of Q-learning and deep Q-network learning control for a rotary inverted pendulum system. Discov. Appl. Sci. 2024, 6, 1–19. [Google Scholar] [CrossRef]
Liu, J.; Zhou, Y.; Gao, J.; Yan, W. Visual Servoing Gain Tuning by Sarsa: An Application with a Manipulator. In Proceedings of the 2023 3rd International Conference on Robotics and Control Engineering, Nanjing, China, 12–14 May 2023. [Google Scholar]
Ma, R.; Wang, Y.; Tang, C.; Wang, S.; Wang, R. Position and Attitude Tracking Control of a Biomimetic Underwater Vehicle via Deep Reinforcement Learning. IEEE/ASME Trans. Mechatron. 2023, 28, 2810–2819. [Google Scholar] [CrossRef]
Song, D.; Gan, W.; Yao, P. Search and tracking strategy of autonomous surface underwater vehicle in oceanic eddies based on deep reinforcement learning. Appl. Soft Comput. 2023, 132, 109902. [Google Scholar] [CrossRef]
Yang, C.; Yang, J.; Wang, X.; Liang, B. Control of space flexible manipulator using soft actor-critic and random network distillation. In Proceedings of the 2019 IEEE International Conference on Robotics and Biomimetics (ROBIO), Dali, China, 6–8 December 2019. [Google Scholar]
Wan, L.; Pan, Y.J.; Shen, H. Improving synchronization performance of multiple Euler–Lagrange systems using nonsingular terminal sliding mode control with fuzzy logic. IEEE/ASME Trans. Mechatron. 2021, 27, 2312–2321. [Google Scholar] [CrossRef]
Ma, Z.; Liu, Z.; Huang, P. Fractional-order control for uncertain teleoperated cyber-physical system with actuator fault. IEEE/ASME Trans. Mechatron. 2020, 26, 2472–2482. [Google Scholar] [CrossRef]
Haarnoja, T.; Zhou, A.; Abbeel, P.; Levine, S. Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor. In Proceedings of the International Conference on Machine Learning, Stockholm, Sweden, 10–15 July 2018; PMLR: Birmingham, UK. [Google Scholar]
Li, F.; Fu, M.; Chen, W.; Zhang, F.; Zhang, H.; Qu, H.; Yi, Z. Improving exploration in actor–critic with weakly pessimistic value estimation and optimistic policy optimization. IEEE Trans. Neural Netw. Learn. Syst. 2022, 35, 8783–8796. [Google Scholar] [CrossRef] [PubMed]
Graves, A.; Graves, A. Long short-term memory. In Supervised Sequence Labelling with Recurrent Neural Networks. Studies in Computational Intelligence; Springer: Berlin/Heidelberg, Germany, 2012; pp. 37–45. [Google Scholar] [CrossRef]
Bahloul, S.N.; Mahmoudi, Y. Rainbow-RND: A Value-based Algorithm Augmented with Intrinsic Curiosity. In Proceedings of the 2021 International Conference on Information Systems and Advanced Technologies (ICISAT), Tebessa, Algeria, 27–28 December 2021; IEEE: Piscataway, NJ, USA. [Google Scholar]
Huang, F.; Xu, J.; Wu, D.; Cui, Y.; Yan, Z.; Xing, W.; Zhang, X. A general motion controller based on deep reinforcement learning for an autonomous underwater vehicle with unknown disturbances. Eng. Appl. Artif. Intell. 2023, 117, 105589. [Google Scholar] [CrossRef]
Jiawei, X.; Xufang, Z.; Zhong, L.; Qingtao, X. LSTM-DPPO based deep reinforcement learning controller for path following optimization of unmanned surface vehicle. J. Syst. Eng. Electron. 2023, 34, 1343–1358. [Google Scholar] [CrossRef]
Nikulin, A.; Kurenkov, V.; Tarasov, D.; Kolesnikov, S. Anti-exploration by random network distillation. In Proceedings of the International Conference on Machine Learning, Honolulu, HI, USA, 23–29 July 2023; PMLR: Birmingham, UK. [Google Scholar]

Figure 1. (a) The 3-DOF Phantom Omni robot. (b) Schmetic representation of the robot manipultor.

Figure 2. A diagram showing how the agent learns during reinforcement learning.

Figure 3. (a) The architecture of the actor network. (b) The architecture of the critic network.

Figure 4. SLR algorithm demonstration.

Figure 5. The standard deviation and variance of cumulative rewards for each episode. (a) Total rewards per episode. (b) Standard deviation of the rewards. (c) Variance of the accumulated rewards.

Figure 6. Tracking position, tracking error, and torque in the no-deadzone environment. (a–c) Tracking position. (d–f) Tracking error. (g–i) Torque.

Figure 7. Tracking position, tracking error, and torque in the deadzone environment. (a–c) Tracking position. (d–f) Tracking error. (g–i) Torque.

Table 1. AAE, AAT, and RMSE of the SAC, SAC-RND, SAC-LSTM, and SLR algorithms in the no-deadzone environment.

	Algorithms	SAC	SAC-RND	SAC-LSTM	SLR
Joint		SAC	SAC-RND	SAC-LSTM	SLR
joint1	AAE	0.0141	0.0076	0.0072	0.0055
	AAT	11.07	12.35	10.22	9.45
	RMSE	0.0223	0.0123	0.0132	0.0097
joint2	AAE	0.0053	0.0059	0.052	0.0045
	AAT	5.732	10.23	6.702	3.612
	RMSE	0.0349	0.0108	0.0118	0.0131
joint3	AAE	0.0112	0.0123	0.0092	0.0063
	AAT	8.713	7.349	4.032	1.692
	RMSE	0.0665	0.0221	0.0184	0.0151

Table 2. AAE, AAT, and RMSE of the SAC, SAC-RND, SAC-LSTM, and SLR algorithms in the dead zone environment.

	Algorithms	SAC	SAC-RND	SAC-LSTM	SLR
Joint		SAC	SAC-RND	SAC-LSTM	SLR
joint1	AAE	0.0152	0.0088	0.0081	0.0064
	AAT	12.39	13.86	11.26	10.98
	RMSE	0.0241	0.014	0.0145	0.0113
joint2	AAE	0.0059	0.0067	0.058	0.0051
	AAT	6.663	11.26	7.468	3.781
	RMSE	0.0393	0.0114	0.0126	0.0144
joint3	AAE	0.013	0.0138	0.01	0.0076
	AAT	10.08	8.377	4.215	1.941
	RMSE	0.0726	0.0248	0.0202	0.0172

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Wang, F.; Hu, J.; Qin, Y.; Guo, F.; Jiang, M. Trajectory Tracking Control Based on Deep Reinforcement Learning for a Robotic Manipulator with an Input Deadzone. Symmetry 2025, 17, 149. https://doi.org/10.3390/sym17020149

AMA Style

Wang F, Hu J, Qin Y, Guo F, Jiang M. Trajectory Tracking Control Based on Deep Reinforcement Learning for a Robotic Manipulator with an Input Deadzone. Symmetry. 2025; 17(2):149. https://doi.org/10.3390/sym17020149

Chicago/Turabian Style

Wang, Fujie, Jintao Hu, Yi Qin, Fang Guo, and Ming Jiang. 2025. "Trajectory Tracking Control Based on Deep Reinforcement Learning for a Robotic Manipulator with an Input Deadzone" Symmetry 17, no. 2: 149. https://doi.org/10.3390/sym17020149

APA Style

Wang, F., Hu, J., Qin, Y., Guo, F., & Jiang, M. (2025). Trajectory Tracking Control Based on Deep Reinforcement Learning for a Robotic Manipulator with an Input Deadzone. Symmetry, 17(2), 149. https://doi.org/10.3390/sym17020149

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Trajectory Tracking Control Based on Deep Reinforcement Learning for a Robotic Manipulator with an Input Deadzone

Abstract

1. Introduction

2. System Description

Dynamics Model

3. Preliminaries

3.1. Reinforcement Learning and Soft Actor-Critic

3.2. Long Short-Term Memory

3.3. Attention Mechanism

3.4. Random Network Distillation

4. Controller Design

4.1. Design of System Input and Output

4.2. Network Architecture

4.3. SLR Algorithm Design

5. Simulation

5.1. Training Performance

5.2. Control Performance

6. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI