2. Materials and Methods
To review the related work, we went through all robotic paradigms, always keeping in mind the autonomous driving problem that we were attempting to solve. The definition of this autonomous driving problem has basically 4 challenging points: first, the accomplishment of the safety and legal requirements under dynamic scenarios, i.e., avoiding dynamic obstacles and driving the robot within the drivable space, controlling its maximum linear speed. Some methods in the literature struggle to accomplish these requirements under dynamic scenarios. Second, the fulfillment of comfort requirements. Some robotic paradigms are less suitable than others to fulfill comfort requirements. Third, the generation of feasible trajectories, i.e., the generation of trajectories that are capable of being realized by the motion control. This point represents a headache for many state-of-the-art algorithms, specifically when facing driving situations close to the limit (grip limit). Finally, the proposed algorithm must have a reasonable computational load. State-of-the-art algorithms are well known for demanding very high computational resources, making them hard to run on microcontrollers.
- A.
Reactive Paradigm
Methods based on the reactive paradigm perform well with holonomic robots in spatially restricted scenarios with static obstacles while fulfilling safety and legal requirements. Common methods are the potential field method [
1], velocity obstacle method [
2], dynamic window approach [
3], or partial center of area method [
4]. However, non-holonomic robots might be difficult to integrate with the reactive paradigm since they need to plan their movement ahead in order to avoid collision with obstacles. Moreover, the control based on reactions to certain sensorial inputs (behaviors) without any local planning can lead to aggressive and long trajectories [
5], making it difficult to integrate comfort requirements. On the other hand, the reactive paradigm demands low computational resources, which enables the usage of non-expensive microcontrollers.
- B.
Deliberative Paradigm
The robotic paradigm with a local planification stage is the deliberative paradigm. The state-of-the-art solutions that attempt to solve the previously described autonomous driving problem are mainly grouped in this robotic paradigm. The methods within the deliberative paradigm can be divided into three different subgroups: graph-based methods, sampling-based methods, and model predictive control methods.
Within the graph-based methods group, the Dijkstra algorithm [
6] performs very well at avoiding obstacles and finding the shortest path. However, it is difficult to use with large grids because it is computationally very expensive. The A* algorithm [
7] followed the Dijkstra algorithm to reduce the computational load by including heuristic functions to find the solution faster. However, the Dijkstra and A* algorithms have also an additional and very important disadvantage: they do not consider the robot dynamics in the motion planning, making the method invalid for non-holonomic robots. Kinematic or dynamic embedded models can be considered to make the method valid for non-holonomic robots. A good example is the hybrid A* algorithm [
8]. However, these methods rely on embedded models that are affected by uncertainties, producing trajectories that might be unfeasible to control, specifically if the robot faces driving situations close to the limit.
With the aim of reducing the computational cost, sampling-based methods were created. Sampling-based methods can work on a continuous space where only specific discrete waypoints are sampled. Common methods are rapidly exploring random trees (RRTs) [
9] or the probabilistic roadmap method (PRM) [
10]. Once the sampling process is completed, a path considering non-holonomic constrains can be generated: [
11] used maximum and minimum limitations of a path’s curvature and the momentum of the vehicle and [
12] used an embedded dynamic model of the robot. Eventually, the final trajectory was selected based on the score given by a cost function. Since non-holonomic constrains were considered, the cost function only needed to evaluate aspects such as robot collisions, comfort, or total travel time. However, [
11] did not consider all non-holonomic constrains and [
12] relied on embedded models that could differ from reality. Driving situations close to the limit are also challenging, even when using complex embedded robot dynamic models. Feasible trajectories can be easily generated under scenarios of low velocity, such as parking [
13] or scenarios with moderate accelerations [
14]. But, scenarios close to the limit are difficult to consider in this approach. Some other approaches based on pure mathematical functions can be applied to connect the sampled waypoints with a smooth path profile, such as Bezier curves [
15], splines curves [
16], or the lattice planner with polynomial curves for the path’s curvature [
17]. These approaches can have the advantage of not embedding a dynamic model of the robot into the logic, which can simplify the implementation. However, since some of the trajectories generated might not fulfill the non-holonomic constrains, the cost function must consider these constrains to reject non-feasible trajectories.
The predictive control model attempts to generate feasible trajectories considering all non-holonomic constrains, without having to completely discretize (graph-based) or partially sample (sampling-based) the continuous space. An embedded dynamic model of the robot is included in the logic. A cost function defines the goals of the autonomous driving problem and an optimizer iteratively solves this problem, minimizing the cost function and generating the corresponding control actions over a future horizon. The predictive control model also addresses the problem of planning feasible trajectories under driving situations close to the limit. In [
18], the parameters of the embedded model were updated over the horizon in situations close to the limit based on a predictive friction estimate to obtain even more reliable and feasible trajectories. However, there was still a reliance on models or estimators that could have differences with respect to reality and the computational consumption was still very high due to the usage of an iterative optimizer.
In general, the main problems of the deliberative paradigm are the challenging design of the cost function, the high computational cost, and the uncertainties of embedded dynamic models, no matter whether the algorithm falls under a graph-based, sampling-based, or predictive control model. In [
8], the challenges of designing an effective cost function for the hybrid A* algorithm were addressed. In [
19], a comparison of motion planning algorithms was performed in terms of computational cost, highlighting the high computational cost of graph-based and sampling-based methods.
- C.
Sense–Think–Act Paradigm
The problem of the uncertainties of embedded models, which eventually compromises the feasibility of the generated trajectory, is addressed by the robotic paradigm sense–think–act. In this paradigm, a model, usually a deep neural network, is trained with real trajectories (supervised learning) to create a path planner algorithm based on artificial intelligence [
20]. Thus, it avoids the usage of embedded dynamic models. However, the main drawbacks of this paradigm are the need for big databases and the labeling of these databases. The labeling of databases is not only a very time-consuming task, but it can also introduce bias to the results, specifically if the labeling task is done manually.
- D.
Sense–Explore–Act–Learn Paradigm
The methods based on reinforcement learning techniques attempt to bring a solution to the previously mentioned drawbacks. The drawbacks and the solution proposed by the reinforcement learning approach are summarized in
Table 1.
Due to the advantages summarized in the previous table, reinforcement learning techniques have been the subject of extensive study within the autonomous driving domain. In [
21], Q-learning was used to find the shortest path from the starting pose to the destination pose. Reinforcement learning techniques with continuous actions and states were introduced in [
22], with very promising results. A deep deterministic policy gradient (DDPG) was introduced in [
23] to learn from a real scenario how to drive and keep the vehicle inside the road boundaries by providing just a single camera image of the road as the input. However, this work only implemented partial safety requirements (driving within the drivable space without obstacle avoidance) and did not consider any comfort or legal requirements. Additionally, the perception, motion planning, and motion control modules were embedded in the deep neural networks of the deep reinforcement learning algorithm. In [
24], the autonomous driving algorithm relied not only on input images of the cameras as in [
23] but also on post-processed data as velocities or relative distances. In [
25], comfort requirements were included for lane change maneuvers as a simple reward function. In contrast to [
23], it separated the perception module from the reinforcement learning task, and the agent consumed already post-processed signals from the perception module, such as robot velocities or relative distances, instead of images. The algorithm presented in this work keeps the perception, motion planning, and motion control modules separated from each other. Images are not directly provided to the agent. Instead, all the relevant post-processed information from the perception module is provided to the agent so that it can learn the best possible trajectory based only on variables that have an important impact on motion planning. Regarding the definition of reward functions, [
26] included positively and negatively weighted rewards. The methodology proposed only included negatively weighted rewards to exclude policies that end up in endless loops of collecting permanently positive rewards instead of reaching the destination. In terms of requirements, ref. [
26] included vehicle speed (legal requirements), collisions (safety requirements), and travel distance towards the destination (task-oriented requirements), but comfort requirements were not included. In [
27], comfort, legal, and task-oriented requirements (lane keeping) were included, but safety requirements were not, which are needed for, for example, overtaking maneuvers in dynamic scenarios. In [
28], all requirements were included. However, the comfort requirements were not expressed in terms of acceleration and jerk but in terms of limiting the steering wheel angle (thus only penalizing lateral acceleration but excluding lateral jerk and longitudinal acceleration and jerk). In [
28], a mix of sparse and continuous reward functions was used. The methodology proposed only used continuous reward functions to better guide the reinforcement learning algorithm to find the most valuable rewards.
The methodology proposed is based on the deep deterministic policy gradient (DDPG) algorithm. The DDPG method, like the Q-learning method, is a model-free algorithm. That is, it does not use any embedded model of the robot inside the agent logic. Instead, it learns the optimal policy directly from the environment. The DDPG method is intended for environments with continuous states and actions. This is possible by approximating the actor model (which computes the action
) with deep neural network
μ of weights
θ as a function of state
at time
(1).
The DDPG algorithm is an actor–critic method and, therefore, the critic function is also approximated with another deep neural network
of weight
as a function of state
and the action a at time
(2).
Actor–critic methods combine the benefits of policy-based and value-based methods. They benefit from the good convergence properties of policy-based methods and they also benefit from the sample efficiency of value-based methods, which tend to find an optimal solution faster.
The deterministic policy gradient theorem (3) updates the weight
of the actor model
in the direction of the maximum value of the critic model
:
On the other hand, the Bellman equation is used to iteratively update the weight
of the critic network
(4) and eventually learn the optimal critic and actor networks.
Additionally, this actor–critic method uses the replay buffer and target networks to bring more stability to the learning process. Random observations are selected from the replay buffer to avoid training the neural networks with consecutive and correlated observations. The weights
and
of the target networks
and
are updated after the weights of the actor and critic networks with a soft factor
(5). This lag brings stability to the learning.
The neural network architecture of actor and critic models considers several hidden layers (
Figure 1). The variables in brackets denote the number of neurons for that layer.
Finally, the exploration is essential to find the optimal policy. For that purpose, a white noise signal of zero mean value is added to the actor–network output (6). The amplitude of the exploration is controlled with the variance
of the normal distribution
.
As the learning process progresses and the actor and critic networks learn the optimal values, less exploration is needed. Therefore, a decay of the exploration was programmed. This decay also helped to stabilize the learning process. The DDPG pseudocode is shown in Algorithm 1:
Algorithm 1. Deep Deterministic Policy Gradient |
Init and with weights and |
Init and with weights and |
Init discount factor and soft update |
for episode do |
|
while True |
|
Execute and observe next state , reward , |
and termination flag |
Record transition in the buffer |
Sample a batch of transitions from the buffer |
Compute |
Update by minimizing the loss function L: |
|
Update : |
|
Update and : |
|
|
if == True |
break |
Before reaching the maximum number of learning episodes, the learning process can be terminated if some specific conditions are met. When reaching these conditions, it is considered that the policy is mature enough to fulfill the requirements. There are mainly 2 conditions: first, the number of total episodes must be greater than the minimum number. With this condition, we ensured that the exploration rate decayed until almost 0 to avoid very noisy and unstable policies. Secondly, no termination due to non-compliance of the safety, legal, or comfort requirements must be detected for a minimum number of episodes to ensure proper learning. In order words, the termination of the learning episode must be only related to task-oriented requirements, such as if the robot reaches the destination pose. If the episode gets terminated because the robot crashes into an obstacle, the counter to consider that the policy is mature enough gets reset to 0.
The agent, which is based on the DDPG algorithm, interacts with the environment. Important concepts of the environment are the state variables, reward functions, and motion control.
- A.
State Variables
The state variables
are essential during the reinforcement learning process (7). One important property is that the state variables must be capable of generalization so that the policy obtained after the learning process can solve unseen scenarios. Therefore, absolute variables like the absolute distance traveled by the robot are avoided. A set of relative distances (
) with respect to the robot’s position, the relative orientation to the destination (
), and dynamic variables such as robot linear and angular velocities (
) and accelerations (
) are selected. The relative distances come from a lidar sensor, which covers 180 degrees of the robot’s front view. To smoothly reduce the linear velocity before reaching the destination, the relative distance between the robot and the destination (
) is also included.
Figure 2 shows an image of the previously mentioned state variables:
For the learning process to be stable, the state variables must fulfill the Markov property. The Markov property is formulated in (8):
According to (8), the future state only depends on the current state . This means that all the information from the history is irrelevant for computing the future state . To make it possible for the state variables to fulfill the Markov property, they must include not only relative distances but also dynamic variables like the linear and angular velocities and accelerations. Therefore, they are included in the list of state variables in (7).
Finally, it is necessary to make sure that all states variables can be measured with sensors. The relative distances come from a lidar sensor mounted on the top of the robot, covering 180 degrees of the front view. The linear and angular velocities and accelerations of the robot can be measured with encoders mounted on the driven wheels and IMU sensors. The distance and orientation to the destination pose can be calculated based on camera images. The destination pose can be recognized by using some special color that the camera recognizes. Based on that recognition and after proper calibration of the camera parameters, the distance to the destination pose and the relative orientation of the robot to the destination pose can be calculated.
- B.
Reward Functions
The reward functions are the key to the reinforcement learning algorithm. The reward function does not define which action to take. Instead, the reward functions are a measurement of how good or bad it is to end up in a new state after an action is taken. This measurement of good or bad is closely related to the sensation of pain or pleasure. Based on the reward functions, the reinforcement learning algorithm automatically learns a complex critic function to navigate the robot autonomously. In contrast to the reward function, the critic function defines, for a given state, how good an action is, before the robot performs such an action. Comparing the critic function with the state-of-the-art algorithms like the lattice planner or the predictive control model, the critic function is equivalent to the cost function. In these state-of-the-art algorithms, the cost function is usually a single and very complex function, which needs to be manually designed to fulfill the requirements of the autonomous driving problem. On the other hand, the reinforcement learning algorithm automatically learns this cost function (critic function) based on simple, multiple, and very intuitive reward functions. Therefore, the critic function gathers all the experience seen during the learning process to build a complex function that fulfills all requirements. To learn this complexity, usually, deep neural networks are selected as policy and critic approximation functions.
There are some important rules to follow before the definition of the reward functions:
In problems with continuous actions and states, or in case of too large discrete spaces, it is very convenient to shape the reward functions instead of using sparse rewards. The reason behind this is that the robot learns by means of exploration. If the robot starts randomly exploring, the probability of the robot seeing this sparse reward can be very low, leading to either an unsuccessful learning process because the robot does not find the reward or to an extremely long learning process. On the other hand, in case of shaped rewards (rewards with a smooth continuous gradient), the robot can adapt its actions in the direction of maximizing the cumulative reward right from the beginning.
It is better to avoid purely negative rewards. If the reward functions are always negative, excluding the 0, the reinforcement learning might find a solution by terminating the learning episode as soon as possible rather than keeping on driving and therefore accumulating more negative rewards.
Positive rewards are also tricky, and they must be treated carefully. If the robot finds a positive reward, the logic might end up in a loop solution.
The learning scenarios must always be carefully designed to facilitate finding rewards from a beneficial starting situation.
Regarding the type of rewards, they are divided into safety, legal, comfort, and task-oriented requirements. The safety, legal, and comfort requirements are based on the pain that the robot might suffer if they are not fulfilled. On the other hand, task-oriented requirements are based on the sensation of pleasure that is obtained by fulfilling the proposed tasks. Therefore, depending on the robot’s mission, task-oriented requirements can be built following different criteria.
- (1)
Safety requirements
Safety requirements implement a penalization or negative reward if the robot crashes into obstacles. Situations where the robot drives too close to an obstacle are also penalized even when no contact between the robot and the obstacle happens. For that reason, a safety margin around the obstacle footprint of distance
dlimit is defined, as represented in
Figure 3. A very simple reward function (9) is therefore defined: 0 if the robot drives outside the safety margin, and a linear interpolation between 0 and −1 if the robot drives inside the safety margin. The variable d is the relative distance from the robot to the obstacle.
- (2)
Legal requirements
Legal requirements limit the maximum speed that the robot can travel. The robot must control its linear speed
so that it does not exceed this maximum value. The implementation of this reward function is very simple as well (10). There are two speed thresholds (
and
). If the speed does not exceed the lower threshold
, the reward will be 0. The reward will be linearly interpolated between 0 and −1 between the first and second thresholds.
If the legal maximum velocity is , can be equal to and can be set a little bit higher than .
- (3)
Comfort requirements
Comfort can be defined as the pleasant feeling of the people traveling inside the robot. Feelings of discomfort can come from different sources. The three main sources that can decrease the feeling of comfort are listed below:
High acceleration values (longitudinal and lateral/angular)
High jerk values (longitudinal and lateral/angular)
Low-frequency acceleration over long periods of time (feeling of sickness)
In this work, the first source of discomfort is implemented. The reward function (11) defines a window given two acceleration values,
and
. Accelerations below
are not penalized. Accelerations within
and
are penalized using linear interpolation. The maximum penalization value is clipped to −1 if the acceleration is higher than
.
Reward functions for longitudinal and angular accelerations are added together to build the final comfort function.
- (4)
Task-oriented requirements
It is possible to define multiple and specific tasks to be accomplished during autonomous navigation. In this work, the robot is intended to reach a specific destination pose. For that purpose, the robot is rewarded when facing the right direction towards the destination pose. The robot will be also rewarded when traveling at the maximum legal speed.
For the first task, if the direction to the destination pose
(determined by visual sensors) is not the same as the robot’s heading
, the robot is penalized according to (12).
If the destination cannot be detected by visual sensors, the destination reward is set to 0 and the robot moves forward within the drivable space.
For the second task, the robot is penalized if the linear speed is not the maximum legal one (13). Since legal requirements are also implemented, if the robot exceeds the maximum legal speed, it is penalized according to (10).
In the previous reward function, the target linear velocity is equal to .
Finally, the reward due to the task-oriented requirements is obtained by adding the single functions (14):
- (5)
Final reward function
The final reward function can be simply built by adding the single reward functions (15). Each reward function is normalized between −1 and 0. A weighted addition of the single reward functions is performed to prioritize the requirements:
In general, the order of importance within the requirements is as follows: safety, legal, comfort, and task-oriented (16):
- C.
Motion Control
The motion control strategy depends on the type of robot to be controlled. The commanded signals from the agent are the linear and angular velocities. The motion control takes care of the actuator control to fulfill the commanded velocities. In this work, a differential robot was used for the validation. Considering that a differential robot traveling at a constant linear speed must fulfill non-holonomic constrains as well, the differential robot is a simple but valid approach to validate the methodology.
The commanded linear and angular velocities of the robot were converted in wheel rotational speed by means of the kinematic equations of the differential robot (17) and (18).
The differential robot geometry and kinematic variables are shown in
Figure 4:
Finally, two independent PI controllers control the commanded velocities by adapting the duty cycles of the electrical motors connected to the driven wheels.
Figure 5 shows the final cascade control of the robot’s motion. An external motion planning loop based on deep reinforcement learning techniques generates the requested linear and angular speeds of the robot, depending on the robot’s state measured from the environment. These requested linear and angular speeds are forwarded to an internal control loop based on classical PID controllers. The internal loop measures the linear and angular actual speeds of the robot for closed-loop control.
3. Results
All the experiments were carried out in the simulator CoppeliaSim (
Figure 6). A Python script controlled the connection with the simulation by means of APIs. The main libraries used to build the reinforcement learning algorithm were TensorFlow and Gym. Tensorflow and Keras provide the necessary functions to build actor and critic neural network models. Gym provides the necessary tools to implement and compare reinforcement learning algorithms. The actor neural network (policy) produced linear and angular speed set points, which were converted into left and right wheel speeds of the differential robot. These wheel speeds were sent over to the simulator via APIs. The simulator implemented PID controllers to control in a closed loop these speeds.
The experiments defined within this section can be divided into two main groups. First, an experiment or set of experiments used for the learning process. Second, an experiment or set of experiments used for the validation process. It was important to use different experiments for the validation process to prove that the methodology proposed can be generalized to new unseen scenarios after the learning process is completed.
For the validation purpose, an experiment including a complex scenario with dynamic and static obstacles was set up. The scenario (see
Figure 7) was carefully chosen to validate the following maneuvers:
- (1)
Follow-up
- (2)
Overtaking
- (3)
Lane keeping
- (4)
Lane change
Before the robot could perform in this validation experiment, it needed to learn how to drive autonomously. For the learning purpose, less complex experiments were set up. In this case, it was important to carefully select the experiments and scenarios so that the robot could easily find rewards and learn a policy that would enable the robot to generalize and drive autonomously in other unseen and more complex scenarios. Therefore, the learning scenario (
Figure 8) was specially adapted to see reward values early in the learning process and better guide the policy to learn the right strategy.
The static obstacle of the left hand reduced the drivable space of the robot. This allowed the robot to receive a negative reward right from the beginning if it got too close to the walls, which delimited the boundaries of the circuit. Additionally, the starting pose of the robot was not too far away from the first dynamic obstacle. This helped the robot to quickly see negative rewards while approaching the dynamic obstacle and adapt the policy and critic functions to follow up the obstacle instead of crashing into it. The resulting robot’s trajectory after the learning process is shown in
Figure 9. The robot’s linear velocity was reduced to follow up the dynamic obstacle at the beginning of the circuit and the commanded angular velocity was always smooth to maximize comfort (
Figure 10).
Figure 11 shows the episodic reward as a function of the learning episodes. The episodic reward increased towards 0 after each learning episode. There was even room for improvement as the episodic reward did not seem to be completely flat after more than 200 learning episodes. It is interesting to note the recovery process in the episodic reward plot, that is, the episodic reward suddenly went down and then up again in the next episodes. This was due to two different factors: first, the exploration ratio sometimes led the robot to collide with the obstacles. In the beginning, when the policy was still unmature and the exploration ratio was high, the recovery peaks could be seen more frequently. Second, when updating the policy following the gradient descent approach, the step or learning rate could be too large, producing a policy that crashes the robot into the obstacles. For these two reasons, a decay for both exploration and learning rate was implemented. This is the reason why as the learning process went on, the recovery peaks were less frequent. The total time needed to automatically learn the policy was 6.57 h. The system used for the learning process was a desktop computer with an AMD Ryzen 5 2600 Six-Core Processor of 3.40 GHz and 8 Gb RAM installed (no GPU installation).
Finally, the traveled path of the robot in the validation scenario is shown in
Figure 12. The first dynamic obstacle blocked the way, and the robot needed to adapt its linear speed to follow up the obstacle. Once the robot detected enough free space, it overtook the obstacle. The second and third obstacles traveled at opposite speeds. Lane change maneuvers were needed to overtake both obstacles. The linear and angular velocities and accelerations are shown in
Figure 13. The robot safely avoided all static and dynamic obstacles of the circuit traveling at the maximum legal linear speed of
(except in the follow-up sector). In addition, the longitudinal and angular accelerations did not exceed the maximum comfort levels defined for this scenario (
and
respectively). Finally, the robot reached the destination pose and ramped down the linear velocity to a standstill position. All of this was happening in a scenario never used during a learning process, concluding that the policy learned during the learning process can be generalized to new and unseen scenarios.
There are several key points marked with capital letters in
Figure 12 and
Figure 13. These points are explained in detail in
Table 2.
Another interesting autonomous driving use case is a robot exploring inside an open-world scenario. For this purpose, the experiment depicted in
Figure 14 was used as a validation scenario to test the methodology in open-world scenarios.
The scenario used for the learning process was a simplified open-world scenario to speed up the learning process. The results after the learning process was completed are shown in
Figure 15 for different destination poses.
In all three cases from
Figure 15, the robot reached the destination pose under safety, legal, and comfort conditions. Nevertheless, the robot still entered at some points (beginning of blue path) the safety margin and got close to some obstacles. This was because the termination conditions of the learning process were met while the agent was still learning.
Figure 16 shows precisely that the agent was still learning since the episodic reward was still not flat after more than 200 learning episodes (red line).
However, since the learning process took a long time, a relax termination condition needed to be defined to find a trade-off between the performance of the resulting policy and the total learning time. In this case, the total learning time went up to 6.1 h. In the case of using of a more powerful system with a server and GPUs, the termination condition could have been adapted to check that the episodic reward does not increase over the learning episodes and ensure in that way a complete learning process.
Finally, a comparison was performed between different strategies to create reward functions. First, a mix of positive (robot driving at desired speed) and negative (collision, comfort, legal, and direction towards the destination) reward functions was created, such as the ones used in many algorithms of the literature reviewed in the section Materials and Methods. Second, a variant with strictly negative reward functions, such as the one proposed in this paper, was chosen as a reference for comparison. The results are shown in
Figure 17. The variant with positive and negative terms found a suboptimal solution where the robot keeps driving to collect more rewards.
These results show the effectiveness of the reinforcement learning techniques in dynamic scenarios, which are usually the most complex to solve, and open-world scenarios, which are common for robots whose task is to explore the surroundings. Maneuvers such as obstacle avoidance, overtaking, follow up, lane keeping, or lane change were successfully validated in the previous scenarios.