1. Introduction
Mobile robots have recently played a key role in the field of autonomous driving [
1,
2,
3,
4,
5], and in application fields such as delivery and service operations in buildings, hospitals, and restaurants [
6,
7,
8]. To fulfill these roles, three essential processes are required: perception of the surrounding environment, a decision-making process to generate a path to the destination, and a control process for the motion of the robot. First, in the perception process, the robot obtains information about its surroundings using sensors, such as cameras or LiDAR. Next, the decision-making process involves a path-planning procedure. One of the representative techniques is Simultaneous Localization and Mapping (SLAM), allowing the robot to navigate based on the surrounding map [
9]. Finally, in the control process, SLAM regulates its motions by following the reference path generated in the decision-making process. Many studies have been conducted on SLAM-based autonomous driving [
10]; however, there is a limitation in the complexity of implementation due to the necessity of multiple sensors or deep learning for accurate surrounding perception and navigation in environments with unexpected dynamic obstacles. To alleviate the issue associated with dynamic obstacles, Dang et al. [
11] modified the SLAM by implementing sensor fusion and dynamic object removal methods. They achieved accurate position estimation and map construction through integrated sensor-based dynamic object detection and removal techniques, including radar, cameras, LiDAR, and more. However, if an error occurs in the process of estimating the position and motion of the dynamic object, it may be difficult to remove the object accurately. In addition, the complexity of implementation due to the necessity of synchronizing multiple sensors remains, making it challenging to apply even when improving dynamic environments. Xiao et al. [
12] proposed the Dynamic-SLAM technology, which combines SLAM and deep learning. The authors of [
12] proposed a method using a single-slot detector based on a convolutional neural network (CNN) to detect dynamic objects. They enhanced the detection recall through a compensation algorithm for missed objects. As a result, this method demonstrated an improved performance in the presence of dynamic obstacles based on a visual SLAM. However, this approach still has limitations in that an environment map must be drawn and CNN requires a large amount of data. To address these issues, many researchers have studied the use of reinforcement learning for autonomous driving. Unlike SLAM technology, which requires additional adjustments to handle dynamic obstacles, reinforcement learning-based autonomous driving has the advantage of adapting to changes in the environment by utilizing the optimal policy. Applying reinforcement learning allows finding the optimal policy needed for autonomous driving with just a single sensor. Even when considering the additional burden associated with deep learning, it can reduce the tasks required for data collection. Therefore, it can be considered less complex to implement than SLAM. These advantages have attracted the interest of many researchers [
13,
14,
15,
16,
17,
18,
19,
20].
Reinforcement learning is akin to mimicking the direct engagement and experiential type of learning found in humans. Generally, humans employ two types of learning methods: indirect learning through observation and direct learning through hands-on experience. Traditionally, machine learning, which simulates the indirect learning method, has produced outstanding results in object recognition and image classification. Similarly, just as people develop the ability to make split-second decisions through experience, research is ongoing to develop neural networks capable of quick decision-making through reinforcement learning, aimed at handling complex tasks. The main goal of reinforcement learning is to find an optimal policy that achieves an objective through the interaction of an agent with the environment. The agent observes the environment and determines the optimal action. After executing the action, the agent receives a reward as feedback. Based on these processes, the agent finds the optimal policy that guarantees the maximal cumulative reward. Generally, the rewards obtained during experience collection can only provide meaningful information after the episode has ended. Therefore, in this case, finding the optimal policy can be difficult because most rewards are meaningless. This problem is referred to as the sparse reward problem and becomes more pronounced in the autonomous driving of mobile robots in environments with dynamic obstacles. It is challenging to simultaneously achieve the goals of reaching the destination and avoiding obstacles, as these goals are included. Two techniques can be applied to alleviate this problem. The first approach, reward shaping, is a method of complementing the reward system with a specific one that can provide sufficient information about the goals based on domain knowledge of the task. Jesus et al. [
21] successfully implemented reward shaping for the indoor autonomous driving of mobile robots. However, there was a limitation of not adequately addressing the goal of obstacle avoidance by reflecting only information about the destination in the reward function. The second technique is the hindsight experience replay (HER) method, which generates alternate success episodes by extracting partial trajectories from failed episodes [
22]. The HER can increase the number of successful experiences in the learning database by reevaluating failed experiences as alternate success experiences with virtual objectives. Consequently, it promotes the exploration of various routes, increasing the probability of reaching the actual destination. In our previous study, we employed the HER to implement the autonomous driving of a mobile robot based on reinforcement learning. The agent was trained in a simple driving environment within the simulation. We demonstrated that the proposed method operates effectively in both simulated and actual environments [
23]. The HER has also been widely applied in the fields of mobile robotics and robot arm control [
24,
25,
26,
27,
28]. Both reward shaping and the HER have individually been used to implement reinforcement learning-based autonomous driving schemes in dynamic environments. However, to the best of our knowledge, no attempts have been made to utilize both methods for handling dynamic environments, which was the objective of the present study.
We designed a reinforcement learning method for dynamic environments with moving obstacles by considering the concepts of both multifunctional reward shaping and the HER. First, by adopting the concept of reward shaping, specific information about the environment is reflected in the reward function to guide the agent toward goals. For navigation to the destination, the reward function was designed based on destination information, and for obstacle avoidance, the reward function was designed based on obstacle information. Additionally, we employed the HER, which involves the re-generation of successful episodes, addressing the data imbalance issue between successful and failed episodes. Consequently, the proposed method can improve the policy optimization process. To validate the effectiveness of the proposed method, we performed an autonomous driving experiment to compare the following methods based on the deep deterministic policy gradient (DDPG):
- (1)
- (2)
Only reward shaping (goal-based) [
21].
- (3)
Only reward shaping (proposed method).
- (4)
- (5)
Proposed method.
3. Materials and Methods
3.1. Mobile Robot and Environmental Configuration
In this study, we used the TurtleBot 3 Burger, as shown in
Figure 1a. The robot is equipped with two Dynamixel motors on the left and right sides, which transfer power to the two wheels. The OpenCR controller is used to control these wheels. Additionally, a laser distance sensor is mounted at the top of the robot, allowing it to measure distances around the robot in a 360° range. The detection distance range of this sensor is 0.12 m to 3.5 m. The system is controlled using a Raspberry Pi 3b+ board.
ROS (Robot Operating System) is a software platform for developing robot applications and serves as a meta-operating system used in traditional operating systems such as Linux, Windows, and Android. Communication in ROS is generally categorized into three types: topics, services, and actions. Specifically, topic communication involves one-way message transmission, service communication entails a bidirectional message request and response, and action communication employs a bidirectional message feedback mechanism.
Figure 1b illustrates the Turtlebot3 and the experimental environment within the 3D simulator Gazebo. In this environment, the destination is randomly set when the driving starts. The starting point of the driving is the center of the space, except when reaching the destination, where the navigation restarts from that point. The dynamic obstacles consist of 4 cylindrical structures that rotate with a fixed radius. The Gazebo allows the creation of environments similar to the real world, reducing time and cost in development and enhancing convenience. Moreover, it has good compatibility with ROS.
In this study, as shown in
Figure 2, a reinforcement learning system was implemented in the Gazebo simulation using the ROS, utilizing sensor values of the mobile robot and topic communication between nodes. A step is defined as the process in which the robot executes the action determined by the reinforcement learning algorithm, receives a reward, and completes the transition to the next state. As a result of this process, a single transition
is generated, consisting of the current state
, action
, reward
, and the next state
. An episode is defined as the trajectory observed when driving begins until the goal is achieved or when failure (collision or timeout) is observed during this process. Success is defined as reaching the destination, while failure includes collisions with obstacles and not reaching the destination within a limited number of actions (timeout). A trajectory is defined as a connected form of transition resulting from the steps performed within an episode.
3.2. Reinforcement Learning Parameters
For reinforcement learning, it is necessary to define the state, action, and rewards. The states and actions are described in this subsection, and the reward system is described in detail in
Section 3.3.
Figure 3 illustrates the states used in the experiments.
The state
can be expressed by Equation (5), where
is a value measured in front of the robot using a laser distance sensor (LDS) every 18 degrees, over a total of 180 degrees:
The variable
denotes the linear distance between the robot’s current coordinates
and the destination coordinates
as follows:
The variable
denotes the angular difference between the robot’s yaw value
and the destination:
The variable
denotes the immediate previous action, defined as follows:
where
and
denote the robot’s linear and angular velocities for that action, respectively. The variable
denotes the linear distance between the coordinates of the robot
and those of the closest obstacle
:
The variable
denotes the direction of the obstacle closest to the robot:
Figure 4 demonstrates the components of action, and the action
is constructed from
and
, along with noise
:
3.3. Design of Reward System
A sparse reward system can be expressed as in Equation (12). If the distance between the robot and its destination is less than 0.15 m after performing an action, it is defined as a success, and a reward value of +500 is returned. On the other hand, if the robot collides with a wall or obstacle, a reward value of −550 is returned instead. In cases where no special state transitions occur, as above, all other state transitions following an action receive a reward of −1.
The proposed reward system expressed in Equation (13) is augmented with reward shaping that includes specific information related to the destination and obstacles. To provide detailed information about the destination, an additional reward is introduced based on the change in distance to the destination. When the distance decreases, a positive reward proportional to the change in distance is generated, whereas when the distance increases, a fixed negative reward is applied instead.
where
Distance information associated with obstacles is also included in the reward, ensuring that an optimal policy would avoid moving obstacles:
where
. By introducing reward shaping, the final reward system ensures that the rewards for reaching the destination and collisions remain the same as in Equation (13), whereas those for other individual actions are expressed by the sum of the following reward functions:
As shown in
Figure 5a, it can be observed that negative rewards increase rapidly when the distance from the current state to the obstacle becomes closer than in the past state. In contrast, as shown in
Figure 5b, it can be observed that positive rewards increase rapidly when the distance from the current state to the obstacle becomes farther than in the past state.
Remark 1: Assigning an intentional weight to the reward of upon collision emphasizes the significance of the least desirable event (collision) commonly encountered in dynamic environments. This weighting aims to instill a recognition of the risk associated with collision states during the learning process. Additionally, in designing functions related to the destination, a fixed penalty is used. This is intended to continuously impose penalties of a magnitude similar to the maximum positive reward , aiding in policy formulation for reaching the destination. In the process of designing rewards related to obstacles, we use the exponential functions for both the advantage and penalty in a similar form. This aims to introduce a step-wise perception of the risk associated with collisions. Additionally, the use of a large-scale function is employed due to the limited conditions and time in which the function operates, seeking to exert a robust influence during operation.
3.4. Constituent Networks of the DDPG
The DDPG consists of an actor network, which approximates the policy, and a critic network, which evaluates the value of the policy. Both networks are based on a multilayer perceptron (MLP) structure comprising fully connected layers. To ensure learning stability, target networks are also constructed for each network.
Figure 6 illustrates the structure of the actor network. This network uses state
as an input, which passes through two hidden layers, each consisting of 500 nodes, to generate two values representing linear and angular velocities.
Figure 7 illustrates the critic network. The input consists of two components: state
and action
. After passing through the hidden layers, each containing 250 nodes, the intermediate output is incorporated into the second hidden layer with 500 nodes. Finally, the network generates a single value as the output, namely the Q-value for the given state and action. This value is used to evaluate the policy.
The actor network is trained to maximize the Q-value, i.e., the output of the critic network. The network parameters are updated using the gradient ascent method with the loss function
defined as follows:
where
denotes the deterministic policy and
denote the weights of the critic and actor networks, respectively. Although the critic network also updates its parameters using the gradient descent method, the loss function is defined as a smooth L1 loss using the Q-value and labeled as follows:
where
,
is the discounting factor, and
and
denote the weights of the target actor and target critic networks, respectively.
3.5. Generating Alternate Data with HER
In reinforcement learning, the common approach to collecting training data is to store transitions
in a memory buffer. These transitions are generated after the agent performs an action. The learning process is initiated after a certain number of data transitions are accumulated in the memory buffer. In this process, finding the optimal policy is challenging due to the low probability of achieving the goal through exploration. This difficulty is exacerbated, especially in environments with sparse rewards. To address this issue, we implement the HER by re-evaluating failed episodes to create successful trajectories. Algorithm 1 illustrates the detailed process of implementing the HER.
is a set of states to be re-evaluated as new destination states selected from failed trajectories. A failed episode occurs when the robot collides with walls or obstacles or experiences a timeout. In each failed case, HER is applied three times. In the case of a collision, the states corresponding to steps 5, 25, and 50 before the final state of the trajectory are designated as the new destinations. As shown in
Figure 8, trajectories from the initial position to these new destinations are extracted. The white trajectory represents the original unsuccessful path, while the blue, green, and yellow trajectories signify new successful paths, each setting the state 5, 25, and 50 steps before as the updated destination. Rewards are then recalculated, contributing to the generation of successful experiences.
Algorithm 1. The Hindsight experience replay algorithm applied in this study. Success and failure episodes are each applied three times, generating diverse paths for new successful experiences to enhance diversification.
Algorithm 1. Hindsight Experience Replay |
1: | terminate time T |
2: | after episode terminate, |
3: | |
4: | if is collision |
5: | |
6: | if is timeout |
7: | |
8: | for do |
9: | for do |
10: | |
11: | if |
12: | Break |
13: | Store the transition » || denotes concatenation |
14: | end for |
15: | end for |
When a timeout state occurs, the states 50, 150, and 250 steps before the terminal state are designated as the new destination, as illustrated in
Figure 9. The white trajectory denotes the original failed path, while the blue, green, and yellow paths represent successful trajectories. Each trajectory sets the state 50, 150, and 250 steps before as the new destination, respectively, and the trajectories are extracted. Rewards are then recalculated, contributing to the generation of successful experiences.
5. Discussion
In the field of autonomous mobile robot navigation, the primary goal is to reach the destination while avoiding obstacles. The techniques based on SLAM have successfully implemented autonomous navigation in indoor environments by relying on pre-mapped surrounding information. However, it faces limitations when unexpected dynamic obstacles appear or when there are changes in the internal elements of the indoor environment, necessitating map reconstruction. Research is underway to apply reinforcement learning to the autonomous driving of mobile robots. In the other study, reward shaping was applied with the DDPG, but there was a need for improvement in adaptability to dynamic obstacles. To address this limitation, this study proposes a technique that simultaneously applies the HER and multifunctional reward shaping. The objective is to achieve autonomous driving by effectively handling dynamic obstacles. Verification through test-driving in both simulation and real-world environments demonstrates the effectiveness of our approach. The HER proves valuable by generating successful experiences from failed ones, addressing the imbalance in experience data, and aiding in finding optimal policies. The multifunctional reward shaping continuously provides information about the goal and obstacles, facilitating in finding policies that avoid obstacles while reaching the destination. The training success rate of our proposed technique reached 80%, showcasing its effectiveness. From the perspective of overall driving success, our method achieved a success rate of over 95% in both simulation and real-world test driving, validating its effectiveness. Notably, despite differences in the composition of the training and real-world environments, the 95% navigation success rate achieved highlights the adaptability of the reinforcement learning-based autonomous driving technique to environmental changes.
Compared to SLAM techniques, our proposed approach exhibits advantages in environmental adaptability. This study demonstrates that intuitive ideas, such as those presented in our technique, can enhance performance and offer advantages in terms of implementation complexity. This underscores the adaptability of reinforcement learning-based autonomous driving technology to dynamic environmental changes.
6. Conclusions
We propose a technique that adopts the concepts of both multifunctional reward-shaping and HER to implement the autonomous driving of a mobile robot based on reinforcement learning in a dynamic environment. Reward shaping is used to design a reward system that induces actions to reach a destination while avoiding obstacles. The specific reward system was constructed by designing functions that provide information about the destination and obstacles, respectively. In addition, to balance the experiences of failure and success, we implemented the HER, which generates success experiences from failure experiences. Therefore, the proposed method addresses the sparse reward problem and aids in finding the adaptive optimal policy in dynamic environments. The proposed approach, combining the reward shaping and HER techniques, was validated through simulation and real-world test-driving, demonstrating its effectiveness in finding optimal policies. In particular, the proposed method demonstrated effectiveness in finding adaptable optimal policies, as evidenced by the high success rate in real-world environments different from the training setting.