1. Introduction
Autonomous vehicles (AVs) are designed to navigate unexpected and potentially dangerous driving conditions by engaging in fallback maneuvers or achieving minimum risk conditions when necessary [
1,
2]. For example, an AV may need to be pulled to the roadside if a critical sensor fails or swiftly changes lanes to avoid suddenly stopping the vehicle ahead [
3]. Current AVs from industry leaders such as Tesla, Mobileye [
4], and Waymo operate at Level 2 or 3 autonomy, as defined by the Society of Automotive Engineers (SAE) [
1]. AVs can handle certain risky situations at these levels, but they are engineered to transfer control back to a human driver when the system reaches its limits. Therefore, drivers must be ready to take over from either in the vehicle or from a remote location. The ultimate aim of AV technology is to achieve Level 4+ (4 and 5) autonomy, where vehicles can handle all driving tasks independently without human intervention, even in complex driving scenarios.
To understand the evolution of AVs, it is important to be familiar with SAE-defined levels of autonomy to understand the evolution of AVs. Starting from Level 0, where there is no automation, the scale progresses to Level 1, with basic driver assistance, such as cruise control. At Level 2, we observe partial automation that can control steering and acceleration under certain conditions. Level 3 allows the vehicle to take over many driving functions yet still requires a driver to intervene when prompted. Level 4 introduces the capability of the vehicle to operate independently in most environments, and Level 5 represents full automation: a vehicle that is fully autonomous in every driving scenario, eliminating the need for any human intervention. The leap to autonomy in Levels 4 and 5 entails a complex blend of technologies enabling vehicles to navigate and respond to all driving situations autonomously.
Progression towards higher levels of AV technology is ready to yield transformative societal rewards such as increased safety by reducing driver error, enhanced mobility for those who cannot drive, and increased free time for drivers on the move. As we advance towards Level 4+ automation, the implications for society are immense, offering not only safer and more accessible transportation options but also the promise of streamlined traffic management, the reshaping of transportation across personal and public domains, and environmental benefits via lowered emissions. Achieving this high level of autonomous operation necessitates a harmonious blend of cutting-edge technologies, including advanced machine learning, sensor fusion, and computer vision, coupled with the development of inter-vehicle and vehicle–infrastructure communication capabilities. The evolution of these technologies is expected to lead to a future in which intelligent, interconnected vehicles will form an efficient smart transportation network.
Generic autonomous driving system (ADS) research has two primary focus areas: development and validation. The ADS research development phase focuses on the design of algorithms, hardware, and software for autonomous navigation. A holistic ADS is operationalized through a sequential triad of perception, decision-making, and control components. Employing this mechanism, an AV interprets its immediate driving environment by analyzing the data gathered from an ensemble of sensors. In the decision-making phase, viable actions are determined based on this interpretation. During the control phase, the AV performs optimally, ensuring adherence to its intended trajectory.
The decision-making component is the most challenging to optimize within ADS architecture. This challenge arises primarily because of two factors. First, they must manage various unpredictable and potentially dangerous situations. Although these two situations pose the same risk, they can differ depending on when and where they occur. Second, the decision-making process is complex. The best course of action for a vehicle can vary significantly depending on the specific fallback scenario, and several viable actions are often available in response to a single situation.
To address these issues, researchers typically employ two main methodologies: (1) heuristic-based techniques [
5,
6] and (2) data-driven learning strategies [
7,
8,
9].
A
heuristic-based approach in ADS development primarily relies on expert-defined rules to guide vehicular decision-making. These rules simplify the complex decisions encountered by vehicles, drawing on expert knowledge and the fundamental principles of physics. One primary advantage of heuristic methods is their predictability and transparency, which render vehicle behavior foreseeable and interpretable. Moreover, these methods offer faster deployment than certain data-driven methods. Historically, heuristics were found in early ADS Levels 1–2, such as advanced driver assistance systems (ADAS), exemplified in lane-keeping and emergency braking systems. However, they encounter challenges in addressing many complex and dynamic driving scenarios that often require frequent updates. The primary limitation is the dependence on human intervention, which comprises the objective of achieving ADS levels of 4+. Orzechowski et al. (2020) developed a scalable decision-making framework for automated driving utilizing a heuristic approach, emphasizing a hierarchical behavior-based architecture [
10]. This system employs modular behavioral units combined in a bottom-up manner to devise more intricate driving behaviors. Although state-of-the-art automated-vehicle platforms often utilize finite-state machines, they lack transparency, scalability, and ease of maintenance. This model merges knowledge-based systems with behavior-based systems to harness the strengths of both. However, its reliance on human intervention affects its adaptability.
Data-driven learning approaches, particularly those rooted in deep learning, have led to advancements in ADS. These methods train an ADS to recognize patterns and make informed decisions by harnessing extensive datasets from diverse driving conditions. Reinforcement learning, an offshoot of this approach, enables the system to learn optimal strategies iteratively by interacting with the environment. Such data-driven techniques demonstrate heightened adaptability, can handle intricate driving scenarios that may not be hard-coded, and often outperform heuristic methods in unpredictable situations; data-driven and reinforcement learning methods represent significant progress in vehicle autonomy. They introduced a dynamic learning process grounded in real-world experiences compared to more fixed heuristic approaches. Their effectiveness, particularly for tasks such as object detection, highlights their potential applications.
Consequently, these methods are increasingly considered the way forward and are likely to dominate the subsequent stages of autonomous driving. Owais et al. (2020) proposed a novel algorithm for generating Pareto-optimal paths in stochastic transportation networks by considering the variability in travel times correlated with traffic flow [
11]. The algorithm utilizes a multi-objective analysis to provide a set of optimal paths for each origin–destination (O-D) pair based on successive simulations that reflect changing traffic conditions within different time slots. However, this study did not explicitly discuss the computational complexity or scalability of the algorithm, which may be important when applying this method to vast and complex networks. It also does not address the potential limitations in the accuracy of traffic-flow predictions, which may affect the reliability of the generated routes. Alshehri et al. (2023) presented a novel application of deep residual neural networks to estimate O-D trip matrices utilizing traffic-flow data from strategically placed sensors, inversely addressing the conventional traffic assignment problem [
12]. This research seeks to determine the correct O-D matrix from traffic counts and the optimal number and placement of sensors for efficient data gathering. Despite this innovative approach, the study may face challenges, such as the need for high-quality historical traffic data for model training, computational complexity of the deep learning model, and practical constraints of sensor placement in real-world scenarios. Pini et al. (2023) developed SafePathNet, a novel machine-learning system designed for trajectory prediction and planning in AVs [
13]. SafePathNet utilizes a unified neural network to anticipate future pathways for AVs and the surrounding road agents to avoid conventional rule-based methodologies. It integrates the attention mechanism of a transformer module with an expert mixing strategy to predict different trajectories and prioritize safety and reliability. SafePathNet chooses a trajectory in real-time decision-making based on safety metrics and the predicted likelihood. Simulators and real-world tests have underscored their safety and effectiveness, distinguishing them from contemporary data-driven methods. However, this study did not focus on achieving an ADS level of 4+.
By leveraging deep reinforcement learning, the proposed system significantly advances the adaptive capabilities of AVs, aiming at ADS Level 4+ autonomy. Deep reinforcement learning overcomes the limitations of heuristic and static rule-based systems, such as fuzzy logic and expert systems, enabling AVs to interpret extensive sensor data and iteratively refine their driving strategies from real-world interactions. Although fuzzy logic copes with ambiguity, and expert systems execute preprogrammed rules, they cannot self-optimize in response to novel stimuli. However, our deep reinforcement learning approach excels in assimilating experiences from unpredictable and intricate driving conditions, fostering a learning mechanism that continually enhances decision-making and adaptability. This empowers AVs with the decision-making agility for high-risk environments, ensuring robust performance across a diverse spectrum of real-world driving situations to achieve ADS Level 4+.
Second, a validation phase for AVs and their ADS is imperative to ensure their safety, reliability, and functionality. Although a wide array of tests is utilized in the development and validation process, the two critical testing methods for AVs and ADS are (1) miles driven [
14,
15,
16,
17,
18] and (2) scenario-based testing [
19,
20,
21,
22,
23,
24]. The other validation approaches are presented in
Appendix A.
Miles-driven validation quantifies the performance of an ADS by measuring the distance traveled by an AV without significant safety incidents. This method offers a direct and easily comprehensible metric and provides an ADS with a genuine and varied real-world driving experience. However, these methods exhibit several limitations. Relying solely on mileage does not necessarily represent the diversity and complexity of the driving scenarios a vehicle encounters. Moreover, accruing substantial mileage for testing can be both resource intensive and time consuming, particularly when the objective is to validate a system’s response to rare events. Shalev–Shwartz et al. (2017) asserted that evaluating the safety of AVs utilizing a mile-driven approach is problematic [
4]. This method measures the number of miles that an AV drives without an accident and compares its safety with that of human drivers. To statistically prove that an AV is as safe as a human, it would need to drive 30 million miles, given a specific fatality probability. If the goal is for AVs to become significantly safer, this will increase to 30 billion miles. In addition, any software update can render previous tests obsolete, and there is concern regarding whether these miles genuinely represent real-world conditions.
Scenario-based validation emphasizes the evaluation of an ADS within specific predefined scenarios. These scenarios depict challenging or critical driving situations that can be drawn from real-world datasets or constructed based on conceivable yet complex events. This method enabled a more structured evaluation, ensuring the ADS was tested across various scenarios. The focused nature of this approach facilitates efficient testing, particularly in situations where assessing safety and performance is paramount. It also offers a mechanism for repeatedly reproducing challenging driving situations, which may be infrequent in standard driving but are vital for comprehensive validation. However, this methodology has several challenges. Constructing a comprehensive scenario library is paramount, and this process can be demanding.
Furthermore, despite extensive scenario libraries, the inherent unpredictability of real-world driving is not always captured. Galko et al. (2014) introduced a vehicle-hardware-in-the-loop system designed to prototype and validate an ADAS for vehicles [
25]. The SERBER system provides a scenario-based testing platform for ADAS in vehicles. Instead of outdoor testing or conventional test benches, SERBER utilizes a controlled environment to simulate real-world scenarios. This approach addresses the challenges of the consistency and safety of conventional methods. By replicating real-world indoor scenarios without large spaces or high-speed mobile bases, SERBER offers a safe and effective ADAS validation process.
In our quest to advance ADS, we considered a data-driven learning method over a heuristic-based method. This approach enables our systems to adapt and improve their performance utilizing real-world data, enhancing their power and efficiency. When assessing the ADS performance, we did not rely solely on mileage. Instead, we prioritized simulation and evaluation under specific driving scenarios. This approach is instrumental for developing a safer and more reliable ADS by providing a clear understanding of the performance of the proposed system under challenging situations.
Here, we propose a decision-making algorithm based on deep reinforcement learning for a fallback scenario. Decision-making in fallback scenarios is an optimization technique in which autonomous vehicles select an appropriate optimal maneuver under certain risky driving conditions. The designed fallback scenario occurred when the vehicle in front of the ego vehicle decelerated rapidly, and the vehicle on the right drove at high speed with three vehicles driving on a highway. We derived three expected optimal maneuvers in advance: slow following, lane change, and lane change after yielding. We then formulated a given fallback scenario based on deep reinforcement learning. In addition, we adopted a proportional-integral-differential (PID) control algorithm for low-level control and provided closed-form equations. Therefore, the ego vehicle was trained in a fallback scenario employing the proposed decision-making algorithm.
The contributions of this study include significant advancements in the algorithm development and testing methodologies for ADS.
Learning Optimal Policy without Human Intervention: We introduced an algorithm to mimic human-driver decision-making. Uniquely developed from scratch, it exhibits exceptional adaptability across different fallback scenarios. It was rooted in a “learning from scratch” approach and possessed inherent flexibility and versatility. This algorithm captures the characteristics of human decision-making. Thus, they can efficiently return to a myriad of diverse driving situations.
Scenario-Based Simulation Testing for Efficient Validation: The test system setup process was streamlined, minimizing the required time to a few hours. A notable aspect of our methodology is its open-source framework, which ensures broad accessibility at no cost. While conventional deterministic and heuristic methods often struggle in varied environments, reinforcement learning enhances the adaptability of our model, enabling it to address fallbacks even in uncharted situations.
Pioneering Level 4+ ADS Deployment: Previous efforts utilizing scenario-based reinforcement learning primarily targeted Level 1–3 autonomy, wherein human intervention still played a role. Our research attempts to enhance ADS for Level 4+ operations, wherein the system operates autonomously. Unlike earlier studies with potentially vague objectives, our approach was methodically tailored with a clear objective to achieve Level 4+ autonomy. This emphasizes our contributions to evolutionary and groundbreaking ADS.
Thus, this study provides innovative algorithmic developments, advanced testing techniques, and a pioneering vision for next-generation ADS autonomy.
The remainder of this paper is organized as follows:
Section 2 examines the background knowledge on deep reinforcement learning, which forms the basis for the proposed decision-making algorithm. In
Section 3, a fallback scenario is designed, and the optimal fallback is derived with a heuristic method.
Section 4 formulates the designed fallback scenario with respect to deep reinforcement learning.
Section 5 details the low-level control algorithm, system, network, and hyperparameters of the decision-making algorithm.
Section 6 evaluates and discusses whether the results of the optimal avoidance maneuver and reinforcement learning derived with the heuristic method are the same as those in the proposed fallback scenario. Finally,
Section 7 summarizes the proposed research and describes future research directions.
2. Background
The proposed decision-making algorithm is primarily based on reinforcement learning. Recently, naïve reinforcement learning has utilized artificial neural networks of supervised learning as a plugin to overcome the typical drawbacks (discrete) and achieve synergy through collaboration. Various advanced deep reinforcement learning algorithms have been developed, for example, deep deterministic policy gradient (DDPG), trust region policy optimization (TRPO), proximal policy optimization (PPO), etc. These advanced algorithms demonstrated outstanding results in various applications. However, we implemented the proposed algorithm as a deep Q-network (DQN), which is outdated but considered adequate for our research for the following reasons. The DQN is a neural network that renders the state space continuous and the action space discrete. Furthermore, it has a lower computational strength than the above algorithms.
Reinforcement learning is a machine-learning algorithm that learns optimal decision-making patterns based on sequential interaction data between agents and the environment. The aim is to determine the optimal policy
. In the current state
with the optimal policy, the agent selects the optimal action
that is expected to yield the largest total reward
in the future.
where
denotes the state–action value function. A state–action value function that adopts an artificial neural network is called deep reinforcement learning. Because the artificial neural network contains the weights, we denote it again as
and refer to it as the Q-network. As the interaction continued, more accurate and diverse driving data containing collision and/or collision-free data were utilized to train the Q-network. Therefore, the Q-network network exhibits higher values.
Sequential interaction data for learning the Q-network were collected through the interaction between the agent and the environment. The agent receives State from the environment and outputs action . The environment receives the behavior of the agent and outputs the next state and reward . Through one interaction, data are collected as .
The training data for the Q-network were such that the input data were the state , and the output data were the temporal difference (TD) target , calculated from the interaction data where is an arbitrary next action.
The Q-network is an approximation function that minimizes the optimization problem by estimating training-output data. The loss function of the Q-network is expressed as
The gradient is as follows:
It is updated with a batch and determines the minimum such that stochastic gradient descent is utilized to minimize the objective function and gradient value.
Reinforcement learning algorithms repeat learning until an optimal policy is found.
The remainder of the detailed implementation-associated background, such as hyperparameters, is explained in
Section 5.
4. Fallback Scenario Formulation
To formulate a given fallback scenario, we mapped it to reinforcement learning concepts, such as agent, environment, and learning goals.
Agent denotes the ego vehicle, and the environment denotes the driving conditions, including the ego vehicle and two other vehicles, Vehicles A and B. The ego vehicle set the objective of learning to reach the target destination without colliding with Vehicles A and B and did not cross either side.
4.1. State and Perception
State denotes an ego vehicle that collects physical information to perceive surrounding vehicles and road environments. It comprises nine parameters, as expressed in Equation (1).
The details of the coordination and spatial variables are presented in
Figure 2. These nine parameters represent three pieces of information: goal-related, localization, and relationship states. A goal-related state is the information utilized to achieve a learning goal. The ego vehicle attempts to reach the goal line such that the goal-related state is the longitudinal relative distance
between the current
position of the ego vehicle
and the goal line
. The localization state represents the information through which an ego vehicle perceives its location in the environment. The chosen localization state parameters were the lateral absolute position
and rotational absolute yaw angle
, as illustrated in
Figure 2a. The longitudinal absolute position
was ignored because it was already included in the goal-related state
which is the distance between the ego vehicle and the goal position. The relationship state is the information between the ego vehicle and the other vehicles. This state is scalable depending on the number of surrounding vehicles, and each relationship state has three degrees of freedom for each vehicle: one longitudinal and lateral distance each and one rotation angle
, as illustrated in
Figure 2b. The ego vehicle had six relationship state parameters:
,
, and
for Vehicle A and
,
, and
for Vehicle B.
State , which includes spatial variables, serves as a training input for the deep reinforcement learning algorithm, allowing the autonomous system to make informed decisions based on comprehensive environmental awareness. By processing this input, deep reinforcement learning can train AVs to execute complex tasks, such as navigation and obstacle avoidance, and learn optimal strategies through repeated interactions with various simulated scenarios.
More parameters must be considered as states, such as velocity and acceleration, that is and , respectively. This information was correlated with distance information. In other words, a part of each piece of information is already contained in its parameters. Hence, they need not be included in the state; they are duplicate or trivial.
4.2. Action and Abstract Control
Action comprises nine abstract actions and fallback maneuvers, as expressed in Equation (2). Each action of the ego vehicle contained commands for the longitudinal and lateral controls. The details of each action are presented in
Table 1.
Each action had specific control commands for longitudinal and rotational control. Longitudinal control was performed with velocity command
. This command is explicitly expressed. However, the lateral control was performed utilizing the driving lane
. This command was expressed implicitly. PID control with sensor fusion is necessary to render the lateral control explicit. Further details are provided in
Appendix B.
4.3. Reward for Achieving the Goal
An ego vehicle must be designed to reach the goal line without collision. A reward function was defined to solve the fallback scenario. The reward function adequately provided success, failure, and transfer rewards to drive safety. The reward for success was reaching the destination without colliding with nearby vehicles.
A reward is a key component that affects learning; a bad reward design results in a bad policy or even divergence of learning. Rewards can be classified as episodic, transitional, or conditional. An episodic reward was provided when learning episodes were complete. When an episode was completed while achieving the goal, the agent was awarded a positive reward of +100; otherwise, the agent did not receive that reward. A transitional reward was awarded at each conditional time step during an episode. This depends on the design and can be constant or variable. Furthermore, both rewards can be awarded simultaneously.
Reward was formulated to encapsulate a tripartite sub-reward (
,
,
), as expressed in Equation (3).
The first sub-reward
is a sparse reward awarded only when a goal is achieved, as expressed in Equation (4). This reward is a strong motivation for the ego vehicle to achieve a goal because it provides many rewards simultaneously compared to subsequent rewards. Specifically, when the ego vehicle successfully navigated to the target location without colliding, a large reward of +100 was awarded. Conversely, nothing (zero reward) was awarded if the ego vehicle failed and the current episode ended.
Here, and denote the current and target locations of the ego vehicle, respectively.
The second sub-reward
is a positive and dense reward awarded at every time step after the interaction, as described in Equation (5). This sub-reward serves as a directional indicator, revealing whether the current behavior positively or negatively influences goal attainment. It offers vital navigational insights by progressively rewarding the agents as they approach their desired outcomes. For example, if the ego vehicle starts at 1.0 m and the position of
-step is 3.5 m, the reward is 100 × (3.5 − 1.0) = 250. In the next step
, if the position of the ego vehicle is 2.5 m, the reward is 100 × (3.5 − 1.0) = 150. When comparing the rewards of steps
and
, it can be concluded that this is a bad case because the position of Step
is lower than that of Step
.
where
is an initial position of the ego vehicle and is a constant value.
The third sub-reward,
was a negative, dense reward awarded at every time step after the interaction, as described in Equation (6). This sub-reward imposes a time penalty to increase the efficiency of the learning process, leading to faster optimal learning. An ego vehicle can reach a target point at low speed, as a practical example. However, this policy was inefficient in terms of time consumption. Therefore, the ego vehicle must achieve its goal at the highest speed to reduce the penalty and increase the total reward.
where
is the number of steps in an episode.
The expected maximum total reward, according to Equations (3)–(6), was computed as 480. When the ego vehicle reaches the goal point, the maximum value of the first sub-reward is +100. The maximum value of the second sub-reward was 400, where the driving distance was 400 from the initial position of ego vehicle , and 1.00 m to the goal position of 5.00 m. To obtain the maximum total reward, the third sub-reward must be minimized. Then, the minimum value of the third sub-reward was −20 when the ego vehicle drove a distance of 4.00 m with a maximum velocity of 0.2 m/s. The minimum time required was 20 s when the min steps was 20. However, the ego vehicle must change lanes. Therefore, the actual maximum total reward was slightly less than 480.
6. Training and Evaluation
To validate the fallback capability of the proposed decision-making algorithm, we conducted a simulated training experiment on an ego vehicle in a given fallback scenario. This section describes the experimental setup, coordinates, and initial conditions. Subsequently, we present and discuss the metrics and training results. Detailed calculations for the initial process of training are described in
Appendix C.
6.1. Experimental Setup
The training simulation was conducted on a computer with the following specifications: Intel Core i7-10700 CPU, NVIDIA GeForce RTX 2060 GPU, and Samsung DDR4 with 16 GB of RAM (Samsung Electronics, Republic of Korea). We provide the hardware specifications because we present the average training time in the training results, which indicates a highly hardware-sensitive performance.
The learning environment was built on the Gazebo simulator in the ROS framework: Gazebo 9.0.0 and ROS Melodic Morenia on the Ubuntu 18.04 Bionic Beaver OS. It provides a robust physics engine, high-quality graphics, convenient programming, and graphical interface. ROS provides many robotics libraries and interoperates with multiple plugins.
We employed the Turtlebot3 (©ROBOTIS) as AVs, which can be easily reprogrammed [
26]. Turtlebot3 is an ROS-based mobile robot that provides low-level control algorithms such as simultaneous localization and mapping, navigation, and collision avoidance.
The specifications of the Turtlebot3 are maximum linear velocity of 0.22 m/s, maximum angular velocity of 284 rad/s (162.72 deg/s), length of 138 mm, width of 178 mm, and height of 192 mm.
The source code was written in Python utilizing the Keras deep learning libraries Python 2.7.17 and Keras 2.1.5. The code utilized for the training was obtained from GitHub [
27]. The code was based on the ROBOTIS e-manual [
26].
6.2. Coordinates and Initial Conditions
The training began with the initial conditions presented in
Figure 4 and detailed in
Table 3. The origin of the coordinates is located at the left center, and the horizontal and vertical lines represent the
- and
-axes, respectively. The goal line was 5.00 m from the origin along the
-axis. The ego vehicle attempts to learn the speed and angular velocity required to reach the goal line without colliding with Vehicle A or B. The initial position (
,
) and orientation
of the ego vehicle were 2.00 m, 0.15 m, and 0.00 rad. Vehicles A and B (which did not learn) were driven according to a constant preset control command. Vehicle A drove at a low speed (0.05 m/s) owing to a defect and did not steer (angular velocity of 0.00 rad/s). Vehicle A’s initial positions and orientations were 2.00 m, 0.15 m, and 0.00 rad, respectively. As Vehicle B was not affected by the defective vehicle, it traveled at a high speed of 0.15 m/s and did not steer (angular velocity is 0.00 rad/s). Thus, Vehicle B’s initial position and orientation were 0.00 m, −0.15 m, and 0.00 rad.
6.3. Metric and Training Results
The metrics considered for the fallback capacity were the three optimal heuristic maneuvers derived in
Section 3.2: slow following, lane change, and lane change after yielding. We evaluated the decision-making algorithm by comparing the training results from heuristic maneuvers. If the training results converge to a particular heuristic maneuver, the ego vehicle successfully learns the optimal policy to respond appropriately to the corresponding fallback scenario.
The proposed decision-making algorithm was trained 100 times with the same hyperparameters. The training was performed in a single synchronous environment, and the time required to complete the training was measured and presented. This was conducted to generalize the training results based on the statistical evaluation. The 100 results were converged into five groups: (1) lane changes, (2) slow following, (3) rear-end collisions, (4) front-end collisions, and (5) side collisions. Therefore, lane changes and slow following were successful results of the training fallback technique, and the three collisions were considered failures.
Figure 5 presents the quantitative training results for each episode’s total reward and maximum Q-value. The Q-value is related to the loss function and is presented as an indicator for evaluating the learning process and policy results.
Lane change involved the ego vehicle moving from the current lane to the right lane, which occurred 38 of 100 times (38%). In
Figure 5a,b, the average of 38 training results and each training result are depicted in red and shaded red, respectively. The total reward converged within the range of 470–480 after 200 episodes. The maximum Q-value increased monotonically. Moreover, each training time required an average of 2.5 h for 500 episodes.
Slow following involved the ego vehicle following the front slow vehicle, which occurred 9 out of 100 times (9%). In
Figure 5a,b, the averages of the nine training results and each training result are depicted in green and shaded green, respectively. The total reward converges to 370–380 after 200 episodes. The maximum Q-value increases slightly. Moreover, each training time required an average of 8.5 h for 500 episodes.
Rear-end collision involved the front of Vehicle B colliding with the rear of the ego vehicle, which occurred 25 of 100 times (25%). In
Figure 5a,b, the average of the 25 training results and each training result are depicted in blue and shaded blue, respectively. The total reward converged in the range of 120–130 after 100 episodes. The maximum Q-value increased slightly.
Front-end collision involved the front of the ego vehicle colliding with the rear of Vehicle A, which occurred 17 of 100 times (17%). In
Figure 5a,b, the average of the 17 training results and each training result are depicted in cyan and shaded cyan, respectively. The total reward converged in the range of 100–110 after 100 episodes. The maximum Q-value increased slightly and was indistinguishable from the rear-end collision results.
Side collision involved the side of the ego vehicle colliding with the front of Vehicle B, which occurred 11 of 100 times (11%). In
Figure 5a,b, the average of the 11 training results and each training result are depicted in magenta and shaded magenta, respectively. The total reward converged within the range of 30–40 after 200 episodes. The maximum Q-value decreased slightly.
Thus, the lane change maneuver was the optimal fallback maneuver that achieved its purpose and provided the most efficient (speed) results. This analysis demonstrated that the lane change maneuver received the highest reward and Q-value in the experimental results. The slow following maneuver is a suboptimal fallback maneuver. This achieved its purpose; however, it was relatively slow and inefficient compared to the lane change maneuver. The rear, front, and side-end collision maneuvers are poor fallback maneuvers. This is because these maneuvers failed to achieve their goals. In addition, the lane change and slow following maneuvers were consistent with the heuristic optimal fallback maneuvers. This implies that the ego vehicle learned the appropriate fallback maneuvers utilizing the proposed algorithm. The ego vehicle accepts one of the trained models among the 38 lane change models or the 9 slow following models and considers it in the fallback scenario. In addition, the lane change maneuver is recommended for the slow following maneuver because of driving efficiency.
6.4. Discussion
Next, we discuss why the training experiment yielded five convergence results even though the proposed algorithm was independently trained 100 times with the same hyperparameters.
Convergence was primarily determined by the existence and frequency of successful experiences within the 64 episodes. This was the batch size, and the training started at this time. The experience data collected during 64 episodes were utilized more frequently and repeatedly to train the neural network model.
Therefore, if an initial neural network is trained on successful data containing optimal decision-making patterns, it converges to lane change or slow following. Eventually, active behavioral maneuvers (lane changes) and passive behavioral maneuvers (slow following) occur. Otherwise, if it is trained with poor data, it converges to a collision. Eventually, five convergences appeared. If learning fails, no pattern can be learned from the data. In other words, it does not capture the sense of driving.
To enhance the success of the experience, the data were related to the exploration and exploitation rates . The current exploration and exploitation rates decreased as the episode progressed; although, the ego vehicle did not have a successful experience. In other words, the exploration and exploitation rates decayed early. Methods for delaying the decay of exploration and exploitation rates are considered acceptable, for example, decaying the rate whenever an ego vehicle experiences success.
Furthermore, we discuss why lane changes after yielding did not occur owing to training. According to the training results, the proposed algorithm was expected to learn only one abstract maneuver. In addition, lane changes after yielding combine two abstract maneuvers. Therefore, lane changes did not occur after yielding. A neural network must be designed with more complex or additional networks for more complex abstract maneuvers.
The deployment of deep reinforcement learning is distinguished by its capacity to learn from scratch, which is instrumental in its adaptiveness. This feature of deep reinforcement learning is particularly advantageous because it facilitates the ability of an AV to discover and refine optimal driving policies in a wide range of fallback scenarios. Notably, this adaptiveness is a theoretical benefit and a practical necessity, given the unpredictable nature of real-world driving environments. Starting from basic principles and learning via trial and error, a deep reinforcement learning framework was designed to cope with the nuances and complexities of such scenarios. This approach significantly boosts the system’s scalability, allowing it to maintain its robustness and efficacy. As the diversity and complexity of potential driving situations expand, the ability of our deep reinforcement learning system to adapt and scale ensures that autonomous vehicles can navigate through these challenges, while increasing autonomy and safety. Therefore, this ongoing learning process is crucial for advancing towards fully autonomous driving capabilities, where the system must handle an ever-growing array of situations with minimal human oversight.