In this section, we introduce our proposal for path following based on DRL. Firstly, we present a variant of PP utilized in steering control. The aim of this variant is to better align with our proposal and address an inherent issue in PP. Following that, we delve into the application of SAC and the design of a training environment with the objective of minimizing cross-track error while encouraging the maximization of linear velocity. Finally, we discuss implementation details.
3.1. Pure Pursuit Steering Control
The original pure pursuit algorithm fits a circle between the robot’s position and a look-ahead point on the reference path, assuming that the robot moves along this trajectory. The conventional selection for the look-ahead point is a point on the reference path such that
, representing a distance
L from the robot’s current position. Since there are potentially multiple points fulfilling this criterion, the one with the highest value of the parameter
is selected. The main issue with this selection method is that when the robot deviates from the path by more than the distance
L, the control law is not defined [
12]. Consequently, the failure to stay within the distance
L results in the optimization problem’s failure.
In this paper, a variant is proposed with the primary objective of addressing the aforementioned issues. Leveraging an arc-length parameterized path, an enhancement can be achieved by choosing a point situated at an arc length of
d forward along the reference path from the point closest to the robot’s current position relative to the path, denoted as
. This newly selected point, denoted as
, is then assigned as the look-ahead point, as illustrated in
Figure 4. The look-ahead distance is subsequently calculated as
. Another advantage accompanying such a choice is that we only need to solve the optimization problem once, specifically for the nearest point.
Eventually, the commanded heading rate
for a robot traveling at linear velocity
v is defined as per Equation (
11).
where the look-ahead angle
is given by
3.2. Soft Actor-Critic in Velocity Control
A Markov Decision Process (MDP) is characterized by a sequential decision process that is fully observable, operates in a stochastic environment, and possesses a transition model adhering to the Markov property [
31]. The MDP can be concisely represented as a tuple
, where
represents the set of all states called the state space;
denotes the actions available to the agent called the action space;
denotes the transition probability
, representing the likelihood that action
in state
will result in state
; and
represents the immediate reward received after transitioning from state
to state
as a result of action
.
Detailed knowledge and the algorithm of SAC can be referenced from [
32,
33]. Here, we introduce only the essential components used in the proposed system. SAC is a RL algorithm built upon the Maximum Entropy RL (MERL) framework, which generalizes the objective of standard RL by introducing a regularization term. This regularization term ensures that the optimal policy
maximizes both expected return and entropy simultaneously as follows:
where
represents the trajectory of state-action pairs that the agent encounters under the control policy
, and
is the entropy associated with the parameter
. This parameter acts as the temperature, influencing the balance between the entropy term and the reward.
The soft Q-function is formulated to evaluate state-action pairs, as outlined in Equation (
14).
where
represents the discount rate for preventing the infinitely large return, and the soft state-value function
based on MERL is defined by
The optimization is performed for function approximators of both the soft Q-function and the policy. The soft Q-function is parameterized by
, representing a vector of
n parameters, and can be effectively modeled using a DNN. The optimization of the soft Q-function is achieved by employing a policy evaluation algorithm, such as Temporal-Difference (TD) learning. The parameters of the soft Q-function can be optimized by minimizing the mean squared loss given by Equation (
16). The loss is approximated using state-action pairs stored in the experience replay buffer, denoted by
.
where
given in Equation (
17) is the TD target that the soft Q-function is updating towards. The update makes use of a target soft Q-function with parameters
, where
denotes the double Q-function approximators, referred to as the clipped double-Q trick [
34].
Similarly, the policy, parameterized by
, is modeled as a Gaussian with a mean and a standard deviation determined by a DNN. The objective for updating the policy parameters is defined by maximizing the expected return and the entropy, as depicted in Equation (
18). Due to the challenges in directly sampling latent action and computing gradients, the policy is reparameterized such that
within a finite bound. Here,
and
represent the mean and standard deviation of the action, respectively, and
is an input noise vector. The reparameterized sample is thus differentiable.
where
is implicitly defined in relation to
.
In practice, the temperature
is adjusted automatically to constrain the average entropy of the policy, allowing for variability in entropy at different states. The objective of discovering a stochastic policy with maximal expected return, satisfying a minimum expected entropy constraint, is presented in Equation (
19).
where the entropy target
is set to be the negative of the action space dimension.
The overall architecture of the SAC-based controller is depicted in
Figure 5. It employs a pair of neural networks—one specifically designed for learning the policy and the other for learning the value function. To mitigate bias in value estimation, target networks are introduced. Following the optimization of the networks, the parameters of the target networks undergo an update using a soft update strategy, denoted by Equation (
20). Specifically, a fraction of the updated network parameters is blended with the target network parameters.
where parameter
indicates how fast the update is carried on and the update is performed at each step after optimizing the online critic networks.
Moreover, an experience replay buffer for storing and replaying samples, effectively reducing sample correlation, enhancing sample efficiency, and improving the learning capability of the algorithm is integrated. Through interaction with the training environment, the mobile robot, acting as the agent, takes an action based on the current policy according to observed states , receives immediate rewards , and transitions to the next state . This experience is stored in the replay buffer. During each optimization step, a mini-batch sample is randomly drawn from the buffer to approximate the required expected values. The detailed description of the designed training environment is provided below.
3.2.1. Observation Space and Action Space
The observation
, which serves as the input to the velocity controller for path following, is designed as follows:
where
is the cross-track error,
is the normalized orientation error between the path and the mobile robot,
v and
are the current linear velocity and rotational velocity of the robot, respectively.
, selected from the look-ahead point as discussed in the previous section, functions as an augmented observation that provides information about the curvature of the path in the future.
A graphical explanation is presented in
Figure 6. Through trial and adjustment, the arc-length divergence parameter
d, which serves both steering control and the observations in this paper, is set to 0.2 m. In configuring this parameter, our primary consideration is on selecting a look-ahead point that ensures the robot quickly regains the path. A small value of
d makes the robot approach the path rapidly, but it may result in overshooting and oscillations along the reference path. Conversely, a large
d reduces oscillations but might increase cross-track errors, especially around corners.
The action, denoted as
, represents the rate of linear velocity
normalized by the velocity itself. This selection is adopted to mitigate undesired rapid changes. Adopting an incremental control input for the robot makes it easier to achieve smooth motions without the need for additional rewards or penalties for excessive velocity changes. Furthermore, constraining the velocity rate within a specified range, i.e.,
, provides a better stability. The linear velocity command
after saturation operation for the mobile robot at each time step is calculated using Equation (
22).
where
and
are the scale and bias, respectively, to recover the normalized action to the range of the desired action. The range of linear velocity rate is designed as
m/s
2. The exploration space for deceleration is slightly larger than that for acceleration, addressing situations requiring urgent braking.
3.2.2. Reward Function
The reward function is designed to penalize the robot when it deviates from the path, while rewarding the robot’s velocity as much as possible, as depicted in Equation (
23).
where
,
, and
are positive constants that define the importance of each term.
is the tolerance for cross-track error within which the robot receives positive velocity rewards. Based on intuitive considerations, it is desirable for the robot to decrease its velocity when deviating from the path to prevent further error expansion. To achieve this, a segmented penalty approach is also introduced. When the cross-track error exceeds a critical threshold, the penalty on velocity increases accordingly. This design ensures that the policy receives velocity rewards only when the cross-track error is within the critical threshold.
The reward function with the first two terms is visualized in
Figure 7, with the range of
. Within this range, the reward at each step ranges from a maximum value of 1, indicating perfect tracking of the path at maximum speed, to a minimum value of
. Due to improvements in the pure pursuit algorithm, the mobile robot can consistently track the point ahead of itself on the path at any lateral distance. In this scenario, it is sufficient to solely investigate the reward associated with the robot traveling along the path.
Additionally, it has been observed in experiments that the agent may choose to discontinue forward movement at challenging turns to avoid potential penalties. To address this situation, a flag
F defined as
indicating a stationary state has been introduced, where
is an extremely small value such as
for numerical stability. The parameters for the reward function are designed as follows:
,
,
, and
m.
3.2.3. Environment and Details
The training environment encompasses the kinematics of the robot itself and a reference path. Ensuring the adaptability of the policy to various challenges is crucial, as it cultivates the ability to handle generalized scenarios, thereby reducing the risk of overfitting to specific paths. We employed a stochastic path generation algorithm proposed in our previous work [
35], randomly generating a reference path for the robot to follow at the beginning of each episode. In this paper, the parameters of the stochastic path generation algorithm is defined with
,
m, and
m. Straight paths with a generation probability of
are also introduced into the training. The straight path is achieved by setting
and
m. The example of randomly generated curved paths is illustrated in
Figure 8.
After the generation of a reference path, the initial posture of the robot is randomly sampled from a uniform distribution, with a position error range of
meters for both the global
x-axis and
y-axis, and a heading error range of
radians with respect to the reference path. A warm-up strategy is also implemented to gather completely random experiences. During the initial training phase, the agent takes random actions uniformly sampled from the action space. Each episode terminates when the robot reaches the endpoint, or when reaching 400 time steps. The summarized training parameters are presented in
Table 2.
The actor and critic neural networks are both structured with two hidden layers. Each layer is equipped with Rectified Linear Unit (ReLU) activation function, featuring 256 neurons in both hidden layers. The actor’s final layer outputs the mean
and standard deviation
of a distribution, facilitating the sampling of a valid action. Subsequently, the action undergoes a tanh transformation to confine its range, as outlined in Equation (
22). In the critic network, the action and state are concatenated to form an input. For the optimization of neural networks, the Adam optimizer [
36] is utilized with a minibatch size of 256. Subsequently, training is conducted five times separately, allowing for an assessment of the algorithm’s effectiveness and stability. Each training utilizes a distinct random seed to control factors such as path generation parameters and the initial posture of the robot. This approach ensures the reproducibility of the experiments. Hyperparameters are summarized in
Table 3.