1. Introduction
With the rapid advancements in computer technology, sensor technology, and artificial intelligence, autonomous vehicles are progressively becoming a focal point of public attention. Autonomous vehicles not only free humans from manual driving tasks but also adhere strictly to traffic rules and driving standards during operation, thereby reducing the likelihood of traffic accidents, and their popularity is also expected to improve the degree of social civilization. The autonomous driving system (ADS) is primarily composed of three main modules: perception, decision-making, and execution. Among these, the decision-making module is the fundamental prerequisite for ensuring the safety of autonomous vehicles [
1]. At present, the decision-making of autonomous vehicles still faces great challenges, mainly including the following three aspects. Firstly, the complexity and variability of driving environments make it difficult for existing autonomous driving decision-making algorithms to effectively handle all driving scenarios and obstacles [
2]. Secondly, most decision-making algorithms require extensive data support; however, acquiring data for collision scenarios or hazardous situations presents significant challenges [
3]. Thirdly, since existing decision-making algorithms prioritize safety as the primary consideration [
4], the decision results may lead to reduced driving efficiency and a diminished passenger experience. Therefore, research focused on decision-making in autonomous driving is of critical importance.
The decision-making module, as one of the primary components of ADS, is tasked with interacting with other traffic participants in complex driving scenarios and providing the most rational decisions in accordance with safe driving principles [
5]. Therefore, appropriately designing the reward function based on traffic regulations can enhance the safety of decision-making in autonomous vehicles [
6]. In the decision-making process of ADS, existing research primarily considers the vehicle dynamic model, traffic regulations, and road boundaries. Although substantial research has been conducted in the academic community on decision-making algorithms for autonomous driving in the presence of both static and dynamic obstacles, challenges such as poor adaptability and low comfort remain [
7,
8]. Currently, behavior decision-making methods for ADS are divided into three main categories, which include learning-based methods [
9], probabilistic assessment-based methods [
10], and motion planning-based methods [
11].
Learning-based methods can be summarized as machine learning-based methods or reinforcement learning (RL) [
12]-based methods. Huang et al. [
13] proposed a self-evolutionary approach based on soft actor-critic and behavior cloning, which enhances the ability to address real-world challenges. Wu et al. [
14] proposed a method based on deep neural networks that integrates proximal policy optimization with recurrent neural networks to improve exploration efficiency. The method was experimentally evaluated on driving tasks within an open-world racing simulator. Zhang et al. [
15] proposed an integrated framework combining deep reinforcement learning with a rule-based hierarchical structure, which simultaneously considers car-following and lane-changing behaviors. This approach not only enhances policy learning capabilities but also improves interpretability.
Probabilistic assessment-based methods are primarily divided into two steps: first, calculating the risk probability of the current environment, and then, making decisions based on the assessed risk level. Traditional risk assessment methods calculate risk levels by evaluating indicators such as time-to-collision and time headway. In contrast, probabilistic risk assessment methods can more effectively address uncertainties in the driving environment. Liu et al. [
16] proposed a novel approach to capture the relationships between traffic entities and assess collision risks. They introduced Transformer models for pedestrian collision risk assessment and conducted experiments on a dataset created for pedestrian crossing scenarios. The experimental results demonstrated high accuracy. Chen et al. [
17] proposed a potential damage risk model that provides a physically meaningful structure for estimating risks in driving scenarios. The results reveal that differences in drivers’ risk perception stem from variations in the assessment of potential damage, offering a risk model for autonomous vehicles.
Motion planning-based methods generate a collision-free path based on factors such as the vehicle dynamic model and road boundaries. Li et al. [
18] combined the artificial potential field method with RL, enabling autonomous vehicles to rapidly generate paths that meet safety requirements. Jeong [
19] proposed a motion planner for intersection scenarios, which estimates the target vehicle’s aggression probability based on environmental sensors. The driving mode is determined by a game theory-based supervisor, configured according to scene probabilities and intersection theory. Specific case studies are provided to validate the effectiveness of the proposed method. Li et al. [
20] proposed a decision-making and motion-planning framework with no oscillation capability, which overcomes the limitations of autonomous driving systems in tasks such as lane changing and lane keeping.
In recent years, deep deterministic policy gradient (DDPG) [
21] algorithms have been widely used in the field of autonomous driving. Its exceptional capability in handling continuous actions can enhance the smoothness of autonomous vehicle operations. Formalized methods are also used to ensure the safety of autonomous vehicles. Linear temporal logic and Probabilistic Computational Tree Logic (PCTL) [
22] are often used in formal methods to describe the system behavior, and it plays a crucial role in system verification and model checking. UPPAAL can model and verify systems with strict timing constraints. It supports not only continuous actions but also discrete actions, a capability that is crucial for ADS, where timing factors are key to safety.
To make decisions that integrate driving safety, driving efficiency, and passenger comfort, this paper proposes a behavior safety decision-making based on DDPG and its verification method. This method not only ensures the driving efficiency of autonomous vehicles but also safeguards passenger safety and comfort.
This paper proposes a behavior decision-making method based on DDPG. Compared with traditional RL, the DDPG algorithm is an effective approach for handling both discrete and continuous actions. In the field of autonomous driving, the application of continuous actions is crucial. Given the outstanding ability of the DDPG algorithm to deal with continuous action space, the proposed decision-making method better meets driving safety requirements.
Due to the uncertainty of the data, the decision actions obtained through the DDPG algorithm are not absolutely safe. Therefore, this paper proposes a decision safety verification method. Using formal methods, we precisely model driving scenarios and express safety requirements within these scenarios through PCTL.
The organization of this paper is as follows.
Section 2 presents the fundamental concepts of decision-making and safety verification in autonomous driving.
Section 3 describes the autonomous driving decision-making method based on the DDPG algorithm.
Section 4 introduces the safety verification process for behavior decisions obtained using the DDPG algorithm.
Section 5 provides a highway driving scenario to illustrate the effectiveness of the proposed method. Finally, the paper summarizes the findings and discusses future work.
2. Preliminary
2.1. Reinforcement Learning
RL has been widely applied in the field of artificial intelligence due to its exceptional capacity for autonomous learning. During decision-making training based on RL, the agent continuously updates the environmental information by interacting with the environment and conducts a series of experiments to identify the optimal course of action. During the training process, after the agent makes a decision based on environmental information, the environment responds by changing accordingly and providing the agent with feedback in the form of a reward. Rewards are classified into positive and negative rewards, which are used to signify the effectiveness or ineffectiveness of the agent’s actions. Positive rewards are typically represented by positive values, while negative rewards are generally represented by negative values. The agent obtains rewards through interactions with the environment and continually improves based on these rewards, ultimately developing an optimal strategy. The process of maximizing cumulative rewards is modeled as a Markov Decision Process (MDP). It can be represented as a four-tuple .
is the finite state space, where is the initial state and .
is a finite action space.
is the state transition function, which defines the probability of transitioning from state to state after taking action . represents the state at time . represents the action at time . represents the reward at time .
is the reward function, which specifies the reward obtained after performing action in state after transitioning to state .
The goal of MDP is to find the optimal policy
that maximizes the expected cumulative reward.
where
denotes the optimal policy and
represents the reward received from the environment at time
.
In RL, the objective of the MDP is to identify a policy
that maximizes the expected cumulative reward. We find the optimal policy
by solving the Q-value function.
This represents the expected cumulative reward obtained by following policy and taking action starting from state . We evaluate the effectiveness of the policy through Q.
2.2. Deep Deterministic Policy Gradient
The DDPG algorithm is an extension of the deterministic policy gradient (DPG) algorithm. In the DDPG algorithm, the agent makes decisions and executes corresponding actions by observing the state of the environment. It utilizes deep neural networks to approximate the decision function, enabling the agent to make corresponding decisions based on the input state information.
The DDPG algorithm comprises two main components: the Actor and the Critic. The DDPG algorithm combines value functions with deterministic policies. It utilizes one neural network to represent the value function and another neural network to represent the deterministic policy.
In
Figure 1,
denotes the policy and
represents the
Q-value function. The parameter of the Actor-online network is
and the parameter of the Actor-target network is
. The parameter of the Critic-online network is
and the parameter of the Critic-target network is
. The Actor-online network selects action
in state
, and after interacting with the environment, produces a new state
and reward
. The Actor-target network is used to determine the optimal action for state
. The Critic-online network iteratively updates the parameter vector
by evaluating the current
-value. The Critic-target network is primarily responsible for evaluating the
-values estimated by the Actor-target network. The fundamental definitions in the DDPG algorithm are as follows.
Definition 1. The deterministic action policy μ represents a function that outputs a deterministic action a. The formula is as follows.
Definition 2. The Q-value represents the expected value of in state after taking action and consistently following policy . It is represented as follows.
2.3. Stochastic Hybrid Automata
A Stochastic Hybrid Automata (SHA) [
23] is a mathematical model that can describe both discrete behaviors and continuous actions. SHA can describe systems with both deterministic and stochastic behaviors. In autonomous driving scenarios, both the behavior of vehicles and the state transitions of drivers involve uncertainty. Therefore, we use a SHA to construct the autonomous driving model. A SHA can be represented as a seven-tuple
.
denotes the finite set of states, with representing the initial state, where .
denotes the finite set of clocks.
denotes the finite set of transition actions, including discrete and continuous actions.
represents the set of invariants at location .
represents the set of conditions for state transitions.
denotes the transition probability.
represents the labeling function that maps each location to the set of atomic propositions.
2.4. Probabilistic Computation Tree Logic
PCTL is a temporal logic that is capable of describing systems with probabilistic properties. In this study, it is highly suitable for verifying SHA with both random and hybrid behaviors. The syntax of PCTL comprises state formulas and path formulas. The state formula
is used to describe the properties of the state and its syntactic representation is as follows.
The path formula
describes the properties of paths originating from a state, and its syntax is as follows.
where
and
represent state formulas,
AP denotes the set of atomic propositions,
,
,
P denotes the probabilistic operation,
, and
. The operator
denotes the next state,
represents the until operator, and
indicates reaching a particular state within
steps.
Each state
in the state formula
is denoted as true or false. It satisfies the following relationship.
In , the represents the probability that paths starting from state satisfy the path formula . The operator denotes the comparison of the obtained probability with the given , thereby determining the result as either true or false.
3. DDPG Algorithm for Behavior Decision-Making
Traditional RL methods, such as the DQN algorithm, are only suitable for solving discrete action space problems. The DDPG algorithm is a RL method that is applicable to both continuous and discrete action spaces. It effectively addresses continuous action problems in the field of autonomous driving.
3.1. Behavior Decision-Making Process
During the operation of an autonomous vehicle, we assume the vehicle’s state at time is and the action taken is . The process proceeds as follows.
Firstly, based on the information of the autonomous vehicles interacting with the environment, the Actor-online network gives the action in the current state and passes it to the Critic-online network. The Critic-online network calculates based on the state and action , which is used to evaluate the value of the current state-action. After executing action , the state transitions to state . The Actor-online network identifies the optimal subsequent action for state and passes to the Critic-target network. The Critic-target network generates , and the autonomous vehicle receives the environmental reward to derive the target Q-value.
During the training process, the DDPG algorithm emphasizes the continuous optimization of the internal parameters of the Actor and Critic networks to improve the policy and Q-value. We tend to use the policy gradient method to update the parameters in the Actor network and optimize the policy. The policy gradient is computed through the Critic-online network and then fed back to the Actor-online network. The calculation formula for the policy gradient is presented as follows.
where
represents the experience replay buffer.
and
are the network parameters.
In traditional RL, the parameters in the Online network are frequently adjusted during training, which may lead to the instability of the network parameters. The DDPG algorithm applies to continuous action spaces and higher state spaces, so the stability of the parameters is affected with high probability. Therefore, the DDPG algorithm requires measures to enhance the stability of its parameters. The method used in the DDPG algorithm to improve training stability and convergence speed is soft updating. The purpose of using this method is to make the Target network parameters approximate the online network parameters, so that the parameter changes during the training process are smoother and smoother. The formula for soft update is presented as follows.
where
is used to control the update rate of the target network parameters. In practice, a small value of
is typically chosen to ensure the gradual update of the target network and the stability of the training, which is usually 0.001.
3.2. Autonomous Driving Model
The setting of the reward function in RL directly affects the agent’s behavior learning and the final policy it adopts. The reward function is generally set to [0, 1]. The agent reaches the target state to give a reward value of 1. If the agent deviates from the target state, the reward value is 0. The learning process of the agent in autonomous vehicles is highly complex during operation. To address this, we have redefined the reward function to make the agent more intelligent. Traditional reward functions include that the autonomous vehicles must not collide with other vehicles, the speed must not exceed a threshold, and the vehicle must not deviate from its designated lane. The range of this speed threshold is defined based on the case study.
During the operation of autonomous vehicles, strict adherence to traffic regulations is required. Collisions are very dangerous behaviors during the movement of autonomous vehicles, which can have a serious impact on the surrounding vehicles. We give a penalty for a collision.
According to human drivers’ driving habits, vehicles are expected to drive as close to the center of a symmetric road as possible. If an AV excessively deviates from the central line of a symmetric lane, it may collide with a vehicle on either the left or right. Therefore, a penalty is applied to discourage such behavior.
where
represents the position of the lane centerline where the autonomous vehicle is located, while
denotes the lateral position of the autonomous vehicle.
In high-speed scenarios such as highways, where the speed limits are relatively high, if the vehicle’s actual speed exceeds the maximum speed limit, it becomes difficult for the vehicle to respond effectively to unexpected situations. Therefore, the speed of autonomous vehicles must not exceed the maximum speed limit.
To incentivize autonomous vehicles to reach their destination efficiently, achieving a speed close to the desired speed is considered a reward. This reward is defined as follows.
where
and
denotes the component in the direction of
and
, and
denotes the desired velocity.
Frequent or excessive steering can reduce passenger comfort, so smooth steering is seen as a reward and oversteering is given a penalty.
where
s denotes the steering output.
Taking the above factors into account, the total reward
is expressed as follows.
Algorithm 1 presents the pseudocode for decision-making by the DDPG algorithm.
Algorithm 1. DDPG-based decision-making |
Initialization: Actor-online network parameters and Actor-target network parameters , Critic-online network parameters and Critic-target network parameters , experience buffer , maximum time step , and number of training epochs max_episode |
1. While episode < max_episode do |
2. Initial state |
3. for t in T do: |
4. // obtain behavioral actions based on current state and Actor-online network parameters |
5. Obtain the next state and reward // after executing the action |
6. // store the conversion to the experience buffer D |
7. batch = D.sample(batch_size) // extract a mini-batch of samples from the experience buffer D |
8. for sample in batch: |
9. actions |
10. |
11. |
12. Update by maximizing |
13. Updating the target network |
14. |
15. |
4. Safety Verification Models Based on SHA
Vehicles equipped with ADS constantly interact with complex and dynamic environments. If the decision-making actions of an autonomous vehicle are not absolutely safe, which could lead to unpredictable consequences for the traffic environment. Therefore, after obtaining decision results from the DDPG algorithm, this paper utilizes the formal verification tool UPPAAL to verify them.
The inputs to the tool UPPAAL are SHA models and safety constraints. The SHA model is derived from the joint modeling of the driving scenarios and decision-making processes. Safety constraints are formalized descriptions of safety requirements in different scenarios. In this paper, safety requirements are described using PCTL.
4.1. Construction of SHA
Firstly, we model the autonomous driving environment. Our SHA model mainly includes the autonomous driving vehicle model, ADS; the surrounding vehicle model, Surround_Vehicle; the vehicle dynamics model, Dynamics; and the system model, Decision.
Figure 2 shows the abstract representation of the autonomous driving vehicle model ADS. It mainly includes the normal driving state, Normal_driving; the preparation overtaking state, Prepare_overtaking; the right overtaking state, Right_overtaking; and the left overtaking state, Left_overtaking. When the distance between the autonomous vehicle and the vehicle ahead reaches a specified hazard distance, the autonomous driving vehicle enters from the normal driving state to the state of preparing to change lanes to overtake the vehicle. At this point, the autonomous driving vehicle determines the current conditions for changing lanes. If the conditions are met, the autonomous driving vehicle moves to the left or right to change lanes. After changing lanes, the autonomous driving vehicle adjusts itself and travels normally. If the condition is not met, the autonomous driving vehicle slows down to continue on the current roadway. In order to make the scenarios more realistic, this paper specifies that the priority of autonomous driving vehicle changing lanes to the left is greater than the priority of changing lanes to the right.
Figure 3 shows an abstract schematic of the surrounding manually driven vehicles. The driving environment model constructed in this paper is centered on the autonomous driving vehicle, and the surrounding vehicle states are classified into four types, including acceleration, deceleration, constant speed, and far. The far state means that the current vehicle has not entered the observation range of the sensor of the autonomous driving vehicle, indicating that the current vehicle has no influence on the autonomous driving vehicle traveling.
The current vehicle is traveling at a constant speed within a designated lane. When the distance between the surrounding vehicles and the autonomous vehicle exceeds the maximum detection range of the sensors, the surrounding vehicles are considered to be in the far state.
When the vehicle enters the sensor’s observation range, the surrounding vehicles travel in the current lane according to a given speed range. We do not consider the lateral movements of surrounding vehicles, that is, surrounding vehicles are assumed to remain within the designated lane. As a result, autonomous vehicles will not experience accidents due to lane changes by surrounding vehicles.
4.2. Safety Constraints for Highway Scenarios
After constructing the highway lane change overtaking model, the safety of the model is verified. To this end, we formally represent the safety properties to be verified through PCTL. Some of the safety properties to be verified are shown in
Table 1.
The enumerated properties allow us to observe whether the autonomous vehicle can safely and efficiently change lanes to the left or right within a specified time, thereby achieving safe and effective driving.
5. Case Study
Changing lanes to overtake has always been an example of risky behavior in car driving. During the lane-changing process, human drivers carefully assess the surrounding environment through visual perception to determine if the conditions are suitable for the maneuver. In the event of sudden situations, they are capable of quickly formulating appropriate response strategies. In contrast, autonomous vehicles lack the ability to adapt spontaneously like humans. Instead, they rely on data collected through sensors such as radars and cameras, which are then processed by intelligent systems to determine whether overtaking is feasible. During training, the agent continuously refines and updates itself to make decisions that are appropriate for the current environment. Although RL methods can continuously optimize and improve decision-making actions through interaction with the environment, it is crucial to verify the obtained behavioral decisions due to the inherent uncertainty in the data. To illustrate the method of this paper, we give a highway lane-changing scenario.
5.1. Highway Scenario
Currently, the cost of safety training for autonomous vehicles in real life is difficult to estimate, so safety training is performed on simulation tools. We used the simulation tools CARLA and Roadrunner to build our driving scenarios. CARLA is a powerful simulation tool that contains rich map information and realistic entities and supports multiple sensors. The Roadrunner provides a rich set of maps. The highway scenario we have constructed is illustrated in
Figure 4.
The scenario shown in
Figure 4 contains weather information, road information, and vehicle information. Road information includes maximum speed limit, minimum speed limit, etc. Weather information includes rainy, cloudy, sunny, and foggy. Our scenario contains four vehicle entities. The autonomous vehicle serving as the observation subject, referred to as Ego. The front vehicle v_Front, the right-side vehicle v_Right, and the left-side vehicle v_Left. In the scenario, the roads constructed by CARLA are symmetric roads. Therefore, during the AV’s driving process, it can assess whether it remains centered on the road by observing the angle of deviation from the road’s central line. In the current scenario, to reach its destination as quickly as possible, the autonomous vehicle must engage in lane-changing maneuvers. Throughout the lane change process, the autonomous vehicle needs to maintain a safe distance from surrounding vehicles. However, due to the impact of road conditions and weather changes, the safety of the lane-changing maneuvers performed by the autonomous vehicle remains uncertain. In the driving scenario, environmental factors such as weather and road conditions are crucial. The ADS relies on sensors to collect information about weather, road boundaries, traffic signs, and other scene elements. These environmental factors serve as key components for decision-making, interacting with the reward function to generate appropriate driving actions tailored to the current driving scenario.
5.2. Decision-Making Results Based on DDPG
This paper focuses on the lane-changing process of autonomous vehicles in the highway scenario. Ego is the autonomous vehicle and the primary focus of our decision-making process, while the remaining three vehicles are distributed around the autonomous vehicle.
The specific process of lane changing for the autonomous vehicle is outlined as follows.
Ego is traveling normally while the v_front enters the sensor range of Ego. When the distance between the two vehicles narrows to a dangerous distance, Ego needs to take measures to change lanes in order to avoid a collision with the vehicle in front of it and to reach to its destination as quickly as possible. The dangerous distance is defined as 80 according to traffic regulations.
The ADS analyzes the data to determine whether the conditions for a right or left lane change are met. Based on the symmetry of the road in the scenario, lane-changing maneuvers to both the left and right are permitted. Therefore, we assess whether the conditions for a lane change are met. If the condition is met, Ego performs a lane change. If the conditions are not met, Ego takes deceleration measures to avoid a collision. Ego’s controls are divided into lateral control and longitudinal control. Longitudinal control is used to control the speed of the autonomous vehicle, including acceleration, deceleration, and constant speed. Lateral control is used to control the direction of the vehicle, including maintaining the current direction, steering left, and steering right.
When Ego selects to change lanes to the left or right, the completion of the lane change operation is indicated by Ego reaching the centerline of the target lane. Ego transitions to a normal driving state.
If there is still a vehicle in the Ego’s target lane, the Ego needs to repeat operations 1–3 to ensure the safety and driving efficiency of the autonomous vehicles.
We compared the rewards obtained from two different methods. The experimental results indicate that the method proposed in this paper yields more secure and efficient decision-making outcomes.
Figure 5 illustrates the progression of rewards as the agent continues to learn. As training progresses, the reward value acquired by the agent rises. At 500 training episodes, the reward of the agent stabilizes.
5.3. SHA for Highway Lane Changing
Based on the highway scenario constructed by Carla, we construct the corresponding SHA model and verify its safety using the tool UPPAAL. The version of UPPAAL we are using is 5.1.0.
Figure 6 shows the SHA model for the lane change overtaking scenario.
We use the UPPAAL tool in conjunction with PCTL safety specifications to perform verification and analysis of the SHA model. We present a subset of the property verification results, as illustrated in
Figure 7.
Next, we analyze its property P1 in detail. P1 represents the probability of safety for an autonomous vehicle when the distance to the preceding vehicle exceeds 100 units. The unit of 100 is denoted in meters.
Figure 8 indicates that we obtain a confidence interval of [0.607081, 0.704599] after 382 simulations, where the confidence level is 0.66. The safety of the autonomous vehicle is acceptable when the distance between the autonomous vehicle and the vehicle in front of it is greater than 100 m. We verify whether the probability of this safety property meets the given probability threshold, and when the verification passes, it indicates that the safety of the autonomous vehicle is guaranteed.
We compare the obtained probability with a given threshold probability value. When the obtained probability exceeds the threshold value, it indicates that the decision is deemed safe.
6. Conclusions
In order to enhance the intelligence of the ADS, this paper proposes a behavior safety decision-making method based on DDPG and its verification approach. To improve decision-making accuracy, we have appropriately configured the reward function. By implementing rational rewards and penalties, we encourage the autonomous vehicle to drive as close to the center of the road as possible and reach its destination efficiently, while allowing lane change maneuvers when necessary to avoid collisions with surrounding vehicles. To ensure the safety of the decisions, this study constructs a SHA model that integrates autonomous driving scenarios with decision outcomes. Based on experimental results, the proposed method effectively enhances the intelligence of the ADS, thereby ensuring the safety of the autonomous vehicle.
In future work, we will focus on the following areas. We have constructed an environmental model using rule-based formal methods to guide the reward function in RL, thereby enhancing the interpretability of the algorithm. This study focuses primarily on highway scenarios, and there is a lack of reliable research on more complex environments, such as intersections and urban areas with high traffic density. Therefore, we will consider scenarios with greater uncertainty in our future work. We are even considering conducting experiments in real-world settings, such as closed-off road sections, to further validate our approach. Regarding comfort, we will not only address the lateral control of the vehicle but also incorporate longitudinal control, implementing more refined speed-based rewards to enable more intelligent decision-making. PCTL is capable of expressing complex safety constraints. However, it suffers from scalability issues in complex and uncertain environments. To address this, we can employ MDP or finite temporal models for verification, which can enhance its scalability.