Behavior Safety Decision-Making Based on Deep Deterministic Policy Gradient and Its Verification Method

Zhu, Yi; Li, Zexin; Wang, Jinyong; Zhao, Ying; Li, Miaoer

doi:10.3390/sym17010132

Open AccessArticle

Behavior Safety Decision-Making Based on Deep Deterministic Policy Gradient and Its Verification Method

by

Yi Zhu

¹

,

Zexin Li

^1,*,

Jinyong Wang

²,

Ying Zhao

¹ and

Miaoer Li

¹

School of Computer Science and Technology, Jiangsu Normal University, Xuzhou 221116, China

²

School of Information Engineering, Xuzhou University of Technology, Xuzhou 221018, China

^*

Author to whom correspondence should be addressed.

Symmetry 2025, 17(1), 132; https://doi.org/10.3390/sym17010132

Submission received: 25 December 2024 / Revised: 12 January 2025 / Accepted: 16 January 2025 / Published: 17 January 2025

(This article belongs to the Section Mathematics)

Download

Browse Figures

Versions Notes

Abstract

:

As an emerging mode of transportation, autonomous vehicles are increasingly attracting widespread attention. To address the issues of the traditional reinforcement learning algorithm, which only considers discrete actions within the system and cannot ensure the safety of decision-making, this paper proposes a behavior decision-making method based on the deep deterministic policy gradient. Firstly, to enable autonomous vehicles to drive as close to the center of the road as possible while sensitively avoiding surrounding obstacles, the reward function for reinforcement learning is constructed by comprehensively considering road boundaries and nearby vehicles. We account for the symmetry of the road by calculating the distances between the vehicle and both the left and right road boundaries, ensuring that the vehicle remains centered within the road. Secondly, to ensure the safety of decision-making, the safety constraints in autonomous driving scenarios are described using probabilistic computation tree logic, and the scenario is modeled as a stochastic hybrid automaton. Finally, the model is verified by the statistical model checker UPPAAL. The above method enables autonomous vehicles not only to independently acquire driving skills across diverse driving environments but also significantly enhances their obstacle avoidance capabilities, thereby ensuring driving safety.

Keywords:

deep deterministic policy gradient; autonomous driving; decision-making; UPPAAL

1. Introduction

With the rapid advancements in computer technology, sensor technology, and artificial intelligence, autonomous vehicles are progressively becoming a focal point of public attention. Autonomous vehicles not only free humans from manual driving tasks but also adhere strictly to traffic rules and driving standards during operation, thereby reducing the likelihood of traffic accidents, and their popularity is also expected to improve the degree of social civilization. The autonomous driving system (ADS) is primarily composed of three main modules: perception, decision-making, and execution. Among these, the decision-making module is the fundamental prerequisite for ensuring the safety of autonomous vehicles [1]. At present, the decision-making of autonomous vehicles still faces great challenges, mainly including the following three aspects. Firstly, the complexity and variability of driving environments make it difficult for existing autonomous driving decision-making algorithms to effectively handle all driving scenarios and obstacles [2]. Secondly, most decision-making algorithms require extensive data support; however, acquiring data for collision scenarios or hazardous situations presents significant challenges [3]. Thirdly, since existing decision-making algorithms prioritize safety as the primary consideration [4], the decision results may lead to reduced driving efficiency and a diminished passenger experience. Therefore, research focused on decision-making in autonomous driving is of critical importance.

The decision-making module, as one of the primary components of ADS, is tasked with interacting with other traffic participants in complex driving scenarios and providing the most rational decisions in accordance with safe driving principles [5]. Therefore, appropriately designing the reward function based on traffic regulations can enhance the safety of decision-making in autonomous vehicles [6]. In the decision-making process of ADS, existing research primarily considers the vehicle dynamic model, traffic regulations, and road boundaries. Although substantial research has been conducted in the academic community on decision-making algorithms for autonomous driving in the presence of both static and dynamic obstacles, challenges such as poor adaptability and low comfort remain [7,8]. Currently, behavior decision-making methods for ADS are divided into three main categories, which include learning-based methods [9], probabilistic assessment-based methods [10], and motion planning-based methods [11].

Learning-based methods can be summarized as machine learning-based methods or reinforcement learning (RL) [12]-based methods. Huang et al. [13] proposed a self-evolutionary approach based on soft actor-critic and behavior cloning, which enhances the ability to address real-world challenges. Wu et al. [14] proposed a method based on deep neural networks that integrates proximal policy optimization with recurrent neural networks to improve exploration efficiency. The method was experimentally evaluated on driving tasks within an open-world racing simulator. Zhang et al. [15] proposed an integrated framework combining deep reinforcement learning with a rule-based hierarchical structure, which simultaneously considers car-following and lane-changing behaviors. This approach not only enhances policy learning capabilities but also improves interpretability.

Probabilistic assessment-based methods are primarily divided into two steps: first, calculating the risk probability of the current environment, and then, making decisions based on the assessed risk level. Traditional risk assessment methods calculate risk levels by evaluating indicators such as time-to-collision and time headway. In contrast, probabilistic risk assessment methods can more effectively address uncertainties in the driving environment. Liu et al. [16] proposed a novel approach to capture the relationships between traffic entities and assess collision risks. They introduced Transformer models for pedestrian collision risk assessment and conducted experiments on a dataset created for pedestrian crossing scenarios. The experimental results demonstrated high accuracy. Chen et al. [17] proposed a potential damage risk model that provides a physically meaningful structure for estimating risks in driving scenarios. The results reveal that differences in drivers’ risk perception stem from variations in the assessment of potential damage, offering a risk model for autonomous vehicles.

Motion planning-based methods generate a collision-free path based on factors such as the vehicle dynamic model and road boundaries. Li et al. [18] combined the artificial potential field method with RL, enabling autonomous vehicles to rapidly generate paths that meet safety requirements. Jeong [19] proposed a motion planner for intersection scenarios, which estimates the target vehicle’s aggression probability based on environmental sensors. The driving mode is determined by a game theory-based supervisor, configured according to scene probabilities and intersection theory. Specific case studies are provided to validate the effectiveness of the proposed method. Li et al. [20] proposed a decision-making and motion-planning framework with no oscillation capability, which overcomes the limitations of autonomous driving systems in tasks such as lane changing and lane keeping.

In recent years, deep deterministic policy gradient (DDPG) [21] algorithms have been widely used in the field of autonomous driving. Its exceptional capability in handling continuous actions can enhance the smoothness of autonomous vehicle operations. Formalized methods are also used to ensure the safety of autonomous vehicles. Linear temporal logic and Probabilistic Computational Tree Logic (PCTL) [22] are often used in formal methods to describe the system behavior, and it plays a crucial role in system verification and model checking. UPPAAL can model and verify systems with strict timing constraints. It supports not only continuous actions but also discrete actions, a capability that is crucial for ADS, where timing factors are key to safety.

To make decisions that integrate driving safety, driving efficiency, and passenger comfort, this paper proposes a behavior safety decision-making based on DDPG and its verification method. This method not only ensures the driving efficiency of autonomous vehicles but also safeguards passenger safety and comfort.

This paper proposes a behavior decision-making method based on DDPG. Compared with traditional RL, the DDPG algorithm is an effective approach for handling both discrete and continuous actions. In the field of autonomous driving, the application of continuous actions is crucial. Given the outstanding ability of the DDPG algorithm to deal with continuous action space, the proposed decision-making method better meets driving safety requirements.
Due to the uncertainty of the data, the decision actions obtained through the DDPG algorithm are not absolutely safe. Therefore, this paper proposes a decision safety verification method. Using formal methods, we precisely model driving scenarios and express safety requirements within these scenarios through PCTL.

The organization of this paper is as follows. Section 2 presents the fundamental concepts of decision-making and safety verification in autonomous driving. Section 3 describes the autonomous driving decision-making method based on the DDPG algorithm. Section 4 introduces the safety verification process for behavior decisions obtained using the DDPG algorithm. Section 5 provides a highway driving scenario to illustrate the effectiveness of the proposed method. Finally, the paper summarizes the findings and discusses future work.

2. Preliminary

2.1. Reinforcement Learning

RL has been widely applied in the field of artificial intelligence due to its exceptional capacity for autonomous learning. During decision-making training based on RL, the agent continuously updates the environmental information by interacting with the environment and conducts a series of experiments to identify the optimal course of action. During the training process, after the agent makes a decision based on environmental information, the environment responds by changing accordingly and providing the agent with feedback in the form of a reward. Rewards are classified into positive and negative rewards, which are used to signify the effectiveness or ineffectiveness of the agent’s actions. Positive rewards are typically represented by positive values, while negative rewards are generally represented by negative values. The agent obtains rewards through interactions with the environment and continually improves based on these rewards, ultimately developing an optimal strategy. The process of maximizing cumulative rewards is modeled as a Markov Decision Process (MDP). It can be represented as a four-tuple

M = (S, A, T, R)

.

$S$ is the finite state space, where $s_{0}$ is the initial state and $s_{0} \in S$ .
$A$ is a finite action space.
$T$ is the state transition function, which defines the probability of transitioning from state $s_{t}$ to state $s_{t + 1}$ after taking action $a_{t}$ . $s_{t}$ represents the state at time $t$ . $a_{t}$ represents the action at time $t$ . $r_{t}$ represents the reward at time $t$ .
$R$ is the reward function, which specifies the reward obtained after performing action $a$ in state $s_{t}$ after transitioning to state $s_{t + 1}$ .

The goal of MDP is to find the optimal policy

π

that maximizes the expected cumulative reward.

π * = {a r g m a x}_{π} (E [\sum_{i = 0}^{+ \infty} r_{t + i} (s_{t} = s, a_{t} = a)])

(1)

where

π^{*}

denotes the optimal policy and

r_{t + i}

represents the reward received from the environment at time

t + i

.

In RL, the objective of the MDP is to identify a policy

π

that maximizes the expected cumulative reward. We find the optimal policy

π *

by solving the Q-value function.

Q^{π} (s, a) = E [\sum_{i = 0}^{+ \infty} r_{t + i} (s_{t} = s, a_{t} = a)]

(2)

This represents the expected cumulative reward obtained by following policy

π

and taking action

a

starting from state

s

. We evaluate the effectiveness of the policy

π

through Q.

2.2. Deep Deterministic Policy Gradient

The DDPG algorithm is an extension of the deterministic policy gradient (DPG) algorithm. In the DDPG algorithm, the agent makes decisions and executes corresponding actions by observing the state of the environment. It utilizes deep neural networks to approximate the decision function, enabling the agent to make corresponding decisions based on the input state information.

The DDPG algorithm comprises two main components: the Actor and the Critic. The DDPG algorithm combines value functions with deterministic policies. It utilizes one neural network to represent the value function and another neural network to represent the deterministic policy.

In Figure 1,

μ

denotes the policy and

Q

represents the Q-value function. The parameter of the Actor-online network is

θ^{μ}

and the parameter of the Actor-target network is

θ^{μ^{'}}

. The parameter of the Critic-online network is

θ^{Q}

and the parameter of the Critic-target network is

θ^{Q^{'}}

. The Actor-online network selects action

a

in state

s

, and after interacting with the environment, produces a new state

s^{'}

and reward

r

. The Actor-target network is used to determine the optimal action for state

s^{'}

. The Critic-online network iteratively updates the parameter vector

θ^{Q}

by evaluating the current

Q

-value. The Critic-target network is primarily responsible for evaluating the

Q

-values estimated by the Actor-target network. The fundamental definitions in the DDPG algorithm are as follows.

Definition 1.

The deterministic action policy μ represents a function that outputs a deterministic action a. The formula is as follows.

a_{t} = μ (s_{t})

(3)

Definition 2.

The Q-value represents the expected value of

r_{t}

in state

s

after taking action

a

and consistently following policy

μ

. It is represented as follows.

Q^{μ} (s = s_{t}| a = a_{t}) = E [r (s_{t}, a_{t}) + Q^{μ} (s_{t + 1}, μ (s_{t + 1}))]

(4)

2.3. Stochastic Hybrid Automata

A Stochastic Hybrid Automata (SHA) [23] is a mathematical model that can describe both discrete behaviors and continuous actions. SHA can describe systems with both deterministic and stochastic behaviors. In autonomous driving scenarios, both the behavior of vehicles and the state transitions of drivers involve uncertainty. Therefore, we use a SHA to construct the autonomous driving model. A SHA can be represented as a seven-tuple

H = (L, C, A c t, I, T r, P, E)

.

$L$ denotes the finite set of states, with $l_{0}$ representing the initial state, where $l_{0} \in L$ .
$C$ denotes the finite set of clocks.
$A c t$ denotes the finite set of transition actions, including discrete and continuous actions.
$I$ represents the set of invariants at location $l$ .
$T r$ represents the set of conditions for state transitions.
$P$ denotes the transition probability.
$E$ represents the labeling function that maps each location to the set of atomic propositions.

2.4. Probabilistic Computation Tree Logic

PCTL is a temporal logic that is capable of describing systems with probabilistic properties. In this study, it is highly suitable for verifying SHA with both random and hybrid behaviors. The syntax of PCTL comprises state formulas and path formulas. The state formula

ϕ

is used to describe the properties of the state and its syntactic representation is as follows.

ϕ ∷ = t r u e | a | \neg ϕ | ϕ \lor ψ | P_{~ p} (ψ)

(5)

The path formula

Φ

describes the properties of paths originating from a state, and its syntax is as follows.

Φ ∷ = X ϕ | ϕ \cup ψ | ϕ \cup^{\leq k} ψ

(6)

where

ϕ

and

ψ

represent state formulas, AP denotes the set of atomic propositions,

a \in A P

,

k \in N

, P denotes the probabilistic operation,

p \in [0, 1]

, and

~ \in {<, \leq, \geq, >}

. The operator

X

denotes the next state,

\cup

represents the until operator, and

\cup^{\leq k}

indicates reaching a particular state within

k

steps.

Each state

s

in the state formula

ϕ

is denoted as true or false. It satisfies the following relationship.

s ⊨ t r u e, s ⊨ a p i f f a p \in I (s)

(7)

s ⊨ \neg ϕ i f f s ⊭ ϕ

(8)

s ⊨ ϕ_{1} \land ϕ_{2} i f f s ⊨ ϕ_{1} a n d s ⊨ ϕ_{2}

(9)

s ⊨ P_{~ p} (ψ) i f f \Pr (s ⊨ ψ) ~ P

(10)

In

s | = P_{~ p} (ψ)

, the

P

represents the probability that paths starting from state

s

satisfy the path formula

ψ

. The operator

P_{~ p}

denotes the comparison of the obtained probability

P

with the given

p

, thereby determining the result as either true or false.

3. DDPG Algorithm for Behavior Decision-Making

Traditional RL methods, such as the DQN algorithm, are only suitable for solving discrete action space problems. The DDPG algorithm is a RL method that is applicable to both continuous and discrete action spaces. It effectively addresses continuous action problems in the field of autonomous driving.

3.1. Behavior Decision-Making Process

During the operation of an autonomous vehicle, we assume the vehicle’s state at time

t

is

s_{t}

and the action taken is

a_{t}

. The process proceeds as follows.

Firstly, based on the information of the autonomous vehicles interacting with the environment, the Actor-online network gives the action

μ (s_{t} | θ^{Q})

in the current state and passes it to the Critic-online network. The Critic-online network calculates

Q (s_{t}, a_{t} | θ^{Q})

based on the state

s_{t}

and action

a_{t}

, which is used to evaluate the value of the current state-action. After executing action

a_{t}

, the state

s_{t}

transitions to state

s_{t + 1}

. The Actor-online network identifies the optimal subsequent action

a_{t + 1}

for state

s_{t + 1}

and passes

μ^{'} (s_{t + 1} | θ^{μ^{'}})

to the Critic-target network. The Critic-target network generates

Q^{'} (s_{t + 1}, μ^{'} (s_{t + 1} | θ^{μ^{'}}) | θ^{Q^{'}})

, and the autonomous vehicle receives the environmental reward to derive the target Q-value.

During the training process, the DDPG algorithm emphasizes the continuous optimization of the internal parameters of the Actor and Critic networks to improve the policy and Q-value. We tend to use the policy gradient method to update the parameters in the Actor network and optimize the policy. The policy gradient is computed through the Critic-online network and then fed back to the Actor-online network. The calculation formula for the policy gradient is presented as follows.

\nabla_{θ^{μ}} J \approx E_{s ~ D} [\nabla_{θ^{μ}} μ (s | θ^{μ}) \nabla_{a} Q (s, a | θ^{Q}) |_{a = μ (s | θ^{μ})}]

(11)

where

D

represents the experience replay buffer.

θ^{μ}

and

θ^{Q}

are the network parameters.

In traditional RL, the parameters in the Online network are frequently adjusted during training, which may lead to the instability of the network parameters. The DDPG algorithm applies to continuous action spaces and higher state spaces, so the stability of the parameters is affected with high probability. Therefore, the DDPG algorithm requires measures to enhance the stability of its parameters. The method used in the DDPG algorithm to improve training stability and convergence speed is soft updating. The purpose of using this method is to make the Target network parameters approximate the online network parameters, so that the parameter changes during the training process are smoother and smoother. The formula for soft update is presented as follows.

θ^{Q^{'}} \leftarrow τ θ^{Q} + (1 - τ) θ^{Q^{'}}

(12)

θ^{μ^{'}} \leftarrow τ θ^{μ} + (1 - τ) θ^{μ^{'}}

(13)

where

τ

is used to control the update rate of the target network parameters. In practice, a small value of

τ

is typically chosen to ensure the gradual update of the target network and the stability of the training, which is usually 0.001.

3.2. Autonomous Driving Model

The setting of the reward function in RL directly affects the agent’s behavior learning and the final policy it adopts. The reward function is generally set to [0, 1]. The agent reaches the target state to give a reward value of 1. If the agent deviates from the target state, the reward value is 0. The learning process of the agent in autonomous vehicles is highly complex during operation. To address this, we have redefined the reward function to make the agent more intelligent. Traditional reward functions include that the autonomous vehicles must not collide with other vehicles, the speed must not exceed a threshold, and the vehicle must not deviate from its designated lane. The range of this speed threshold is defined based on the case study.

During the operation of autonomous vehicles, strict adherence to traffic regulations is required. Collisions are very dangerous behaviors during the movement of autonomous vehicles, which can have a serious impact on the surrounding vehicles. We give a penalty for a collision.

R_{1} = \{\begin{matrix} - 1, i f c o l l i s i o n \\ 0, o t h e r w i s e \end{matrix}

(14)

According to human drivers’ driving habits, vehicles are expected to drive as close to the center of a symmetric road as possible. If an AV excessively deviates from the central line of a symmetric lane, it may collide with a vehicle on either the left or right. Therefore, a penalty is applied to discourage such behavior.

R_{2} = \{\begin{matrix} - 1, a b s (d i s) > l_{o u t} \\ e^{- \frac{{(l_{c} - l_{h})}^{2}}{2 σ^{2}}}, o t h e r w i s e \end{matrix}

(15)

where

l_{c}

represents the position of the lane centerline where the autonomous vehicle is located, while

l_{h}

denotes the lateral position of the autonomous vehicle.

In high-speed scenarios such as highways, where the speed limits are relatively high, if the vehicle’s actual speed exceeds the maximum speed limit, it becomes difficult for the vehicle to respond effectively to unexpected situations. Therefore, the speed of autonomous vehicles must not exceed the maximum speed limit.

R_{3} = \{\begin{matrix} - 1, v_{h} > v_{d e s c i r e} \\ 0, o t h e r w i s e \end{matrix}

(16)

To incentivize autonomous vehicles to reach their destination efficiently, achieving a speed close to the desired speed is considered a reward. This reward is defined as follows.

R_{4} = - | \sqrt[2]{v_{x}^{2} + v_{y}^{2}} - v_{d e s i r e} |

(17)

where

v_{x}

and

v_{y}

denotes the component in the direction of

x

and

y

, and

v_{d e s i r e}

denotes the desired velocity.

Frequent or excessive steering can reduce passenger comfort, so smooth steering is seen as a reward and oversteering is given a penalty.

R_{5} = - s^{2}

(18)

where s denotes the steering output.

Taking the above factors into account, the total reward

R_{t}

is expressed as follows.

R_{t} = R_{1} + R_{2} + R_{3} + R_{4} + R_{5}

(19)

Algorithm 1 presents the pseudocode for decision-making by the DDPG algorithm.

Algorithm 1. DDPG-based decision-making

Initialization: Actor-online network parameters

θ^{μ}

and Actor-target network parameters

θ^{μ^{'}}

, Critic-online network parameters

θ^{Q}

and Critic-target network parameters

θ^{Q^{'}}

, experience buffer

D

, maximum time step

T

, and number of training epochs max_episode

1. While episode < max_episode do

2. Initial state

s_{0}

3. for t in T do:

4.

a_{t} = μ (s_{t} | θ^{μ})

// obtain behavioral actions based on current state and Actor-online network parameters

5. Obtain the next state

s_{t + 1}

and reward

r_{t}

// after executing the action

a_{t}

6.

a d d (s_{t}, a_{t}, r_{t}, s_{t + 1}) t o D

// store the conversion to the experience buffer D

7. batch = D.sample(batch_size) // extract a mini-batch of samples from the experience buffer D

8. for sample in batch:

9. actions

= μ^{'} (s_{t + 1} | θ^{μ^{'}})

10.

Q_value = Q^{'} (s_{t + 1}, μ^{'} (s_{t + 1} | θ^{μ^{'}}) | θ^{Q^{'}})

11.

y = r_{t} + Υ * Q^{'} (s_{t + 1}, μ^{'} (s_{t + 1} | θ^{μ^{'}}) | θ^{Q^{'}})

12. Update

θ^{μ}

by maximizing

Q (s, μ (s | θ^{μ}) | θ^{Q})

13. Updating the target network

14.

θ^{Q^{'}} \leftarrow τ θ^{Q} + (1 - τ) θ^{Q^{'}}

15.

θ^{μ^{'}} \leftarrow τ θ^{μ} + (1 - τ) θ^{μ^{'}}

4. Safety Verification Models Based on SHA

Vehicles equipped with ADS constantly interact with complex and dynamic environments. If the decision-making actions of an autonomous vehicle are not absolutely safe, which could lead to unpredictable consequences for the traffic environment. Therefore, after obtaining decision results from the DDPG algorithm, this paper utilizes the formal verification tool UPPAAL to verify them.

The inputs to the tool UPPAAL are SHA models and safety constraints. The SHA model is derived from the joint modeling of the driving scenarios and decision-making processes. Safety constraints are formalized descriptions of safety requirements in different scenarios. In this paper, safety requirements are described using PCTL.

4.1. Construction of SHA

Firstly, we model the autonomous driving environment. Our SHA model mainly includes the autonomous driving vehicle model, ADS; the surrounding vehicle model, Surround_Vehicle; the vehicle dynamics model, Dynamics; and the system model, Decision.

Figure 2 shows the abstract representation of the autonomous driving vehicle model ADS. It mainly includes the normal driving state, Normal_driving; the preparation overtaking state, Prepare_overtaking; the right overtaking state, Right_overtaking; and the left overtaking state, Left_overtaking. When the distance between the autonomous vehicle and the vehicle ahead reaches a specified hazard distance, the autonomous driving vehicle enters from the normal driving state to the state of preparing to change lanes to overtake the vehicle. At this point, the autonomous driving vehicle determines the current conditions for changing lanes. If the conditions are met, the autonomous driving vehicle moves to the left or right to change lanes. After changing lanes, the autonomous driving vehicle adjusts itself and travels normally. If the condition is not met, the autonomous driving vehicle slows down to continue on the current roadway. In order to make the scenarios more realistic, this paper specifies that the priority of autonomous driving vehicle changing lanes to the left is greater than the priority of changing lanes to the right.

Figure 3 shows an abstract schematic of the surrounding manually driven vehicles. The driving environment model constructed in this paper is centered on the autonomous driving vehicle, and the surrounding vehicle states are classified into four types, including acceleration, deceleration, constant speed, and far. The far state means that the current vehicle has not entered the observation range of the sensor of the autonomous driving vehicle, indicating that the current vehicle has no influence on the autonomous driving vehicle traveling.

The current vehicle is traveling at a constant speed within a designated lane. When the distance between the surrounding vehicles and the autonomous vehicle exceeds the maximum detection range of the sensors, the surrounding vehicles are considered to be in the far state.

When the vehicle enters the sensor’s observation range, the surrounding vehicles travel in the current lane according to a given speed range. We do not consider the lateral movements of surrounding vehicles, that is, surrounding vehicles are assumed to remain within the designated lane. As a result, autonomous vehicles will not experience accidents due to lane changes by surrounding vehicles.

4.2. Safety Constraints for Highway Scenarios

After constructing the highway lane change overtaking model, the safety of the model is verified. To this end, we formally represent the safety properties to be verified through PCTL. Some of the safety properties to be verified are shown in Table 1.

The enumerated properties allow us to observe whether the autonomous vehicle can safely and efficiently change lanes to the left or right within a specified time, thereby achieving safe and effective driving.

5. Case Study

Changing lanes to overtake has always been an example of risky behavior in car driving. During the lane-changing process, human drivers carefully assess the surrounding environment through visual perception to determine if the conditions are suitable for the maneuver. In the event of sudden situations, they are capable of quickly formulating appropriate response strategies. In contrast, autonomous vehicles lack the ability to adapt spontaneously like humans. Instead, they rely on data collected through sensors such as radars and cameras, which are then processed by intelligent systems to determine whether overtaking is feasible. During training, the agent continuously refines and updates itself to make decisions that are appropriate for the current environment. Although RL methods can continuously optimize and improve decision-making actions through interaction with the environment, it is crucial to verify the obtained behavioral decisions due to the inherent uncertainty in the data. To illustrate the method of this paper, we give a highway lane-changing scenario.

5.1. Highway Scenario

Currently, the cost of safety training for autonomous vehicles in real life is difficult to estimate, so safety training is performed on simulation tools. We used the simulation tools CARLA and Roadrunner to build our driving scenarios. CARLA is a powerful simulation tool that contains rich map information and realistic entities and supports multiple sensors. The Roadrunner provides a rich set of maps. The highway scenario we have constructed is illustrated in Figure 4.

The scenario shown in Figure 4 contains weather information, road information, and vehicle information. Road information includes maximum speed limit, minimum speed limit, etc. Weather information includes rainy, cloudy, sunny, and foggy. Our scenario contains four vehicle entities. The autonomous vehicle serving as the observation subject, referred to as Ego. The front vehicle v_Front, the right-side vehicle v_Right, and the left-side vehicle v_Left. In the scenario, the roads constructed by CARLA are symmetric roads. Therefore, during the AV’s driving process, it can assess whether it remains centered on the road by observing the angle of deviation from the road’s central line. In the current scenario, to reach its destination as quickly as possible, the autonomous vehicle must engage in lane-changing maneuvers. Throughout the lane change process, the autonomous vehicle needs to maintain a safe distance from surrounding vehicles. However, due to the impact of road conditions and weather changes, the safety of the lane-changing maneuvers performed by the autonomous vehicle remains uncertain. In the driving scenario, environmental factors such as weather and road conditions are crucial. The ADS relies on sensors to collect information about weather, road boundaries, traffic signs, and other scene elements. These environmental factors serve as key components for decision-making, interacting with the reward function to generate appropriate driving actions tailored to the current driving scenario.

5.2. Decision-Making Results Based on DDPG

This paper focuses on the lane-changing process of autonomous vehicles in the highway scenario. Ego is the autonomous vehicle and the primary focus of our decision-making process, while the remaining three vehicles are distributed around the autonomous vehicle.

The specific process of lane changing for the autonomous vehicle is outlined as follows.

Ego is traveling normally while the v_front enters the sensor range of Ego. When the distance between the two vehicles narrows to a dangerous distance, Ego needs to take measures to change lanes in order to avoid a collision with the vehicle in front of it and to reach to its destination as quickly as possible. The dangerous distance is defined as 80 according to traffic regulations.
The ADS analyzes the data to determine whether the conditions for a right or left lane change are met. Based on the symmetry of the road in the scenario, lane-changing maneuvers to both the left and right are permitted. Therefore, we assess whether the conditions for a lane change are met. If the condition is met, Ego performs a lane change. If the conditions are not met, Ego takes deceleration measures to avoid a collision. Ego’s controls are divided into lateral control and longitudinal control. Longitudinal control is used to control the speed of the autonomous vehicle, including acceleration, deceleration, and constant speed. Lateral control is used to control the direction of the vehicle, including maintaining the current direction, steering left, and steering right.
When Ego selects to change lanes to the left or right, the completion of the lane change operation is indicated by Ego reaching the centerline of the target lane. Ego transitions to a normal driving state.
If there is still a vehicle in the Ego’s target lane, the Ego needs to repeat operations 1–3 to ensure the safety and driving efficiency of the autonomous vehicles.

We compared the rewards obtained from two different methods. The experimental results indicate that the method proposed in this paper yields more secure and efficient decision-making outcomes. Figure 5 illustrates the progression of rewards as the agent continues to learn. As training progresses, the reward value acquired by the agent rises. At 500 training episodes, the reward of the agent stabilizes.

5.3. SHA for Highway Lane Changing

Based on the highway scenario constructed by Carla, we construct the corresponding SHA model and verify its safety using the tool UPPAAL. The version of UPPAAL we are using is 5.1.0. Figure 6 shows the SHA model for the lane change overtaking scenario.

We use the UPPAAL tool in conjunction with PCTL safety specifications to perform verification and analysis of the SHA model. We present a subset of the property verification results, as illustrated in Figure 7.

Next, we analyze its property P1 in detail. P1 represents the probability of safety for an autonomous vehicle when the distance to the preceding vehicle exceeds 100 units. The unit of 100 is denoted in meters. Figure 8 indicates that we obtain a confidence interval of [0.607081, 0.704599] after 382 simulations, where the confidence level is 0.66. The safety of the autonomous vehicle is acceptable when the distance between the autonomous vehicle and the vehicle in front of it is greater than 100 m. We verify whether the probability of this safety property meets the given probability threshold, and when the verification passes, it indicates that the safety of the autonomous vehicle is guaranteed.

We compare the obtained probability with a given threshold probability value. When the obtained probability exceeds the threshold value, it indicates that the decision is deemed safe.

6. Conclusions

In order to enhance the intelligence of the ADS, this paper proposes a behavior safety decision-making method based on DDPG and its verification approach. To improve decision-making accuracy, we have appropriately configured the reward function. By implementing rational rewards and penalties, we encourage the autonomous vehicle to drive as close to the center of the road as possible and reach its destination efficiently, while allowing lane change maneuvers when necessary to avoid collisions with surrounding vehicles. To ensure the safety of the decisions, this study constructs a SHA model that integrates autonomous driving scenarios with decision outcomes. Based on experimental results, the proposed method effectively enhances the intelligence of the ADS, thereby ensuring the safety of the autonomous vehicle.

In future work, we will focus on the following areas. We have constructed an environmental model using rule-based formal methods to guide the reward function in RL, thereby enhancing the interpretability of the algorithm. This study focuses primarily on highway scenarios, and there is a lack of reliable research on more complex environments, such as intersections and urban areas with high traffic density. Therefore, we will consider scenarios with greater uncertainty in our future work. We are even considering conducting experiments in real-world settings, such as closed-off road sections, to further validate our approach. Regarding comfort, we will not only address the lateral control of the vehicle but also incorporate longitudinal control, implementing more refined speed-based rewards to enable more intelligent decision-making. PCTL is capable of expressing complex safety constraints. However, it suffers from scalability issues in complex and uncertain environments. To address this, we can employ MDP or finite temporal models for verification, which can enhance its scalability.

Author Contributions

Conceptualization, Z.L. and Y.Z. (Yi Zhu); methodology, Z.L. and Y.Z. (Yi Zhu); software, Z.L. and Y.Z. (Yi Zhu); validation, Z.L., Y.Z. (Yi Zhu) and J.W.; formal analysis, Z.L. and Y.Z. (Yi Zhu); investigation, Z.L. and Y.Z. (Yi Zhu); resources, Z.L., Y.Z. (Yi Zhu), Y.Z. (Ying Zhao) and M.L.; data curation, Z.L. and Y.Z. (Yi Zhu); writing—original draft preparation, Z.L. and Y.Z. (Yi Zhu); writing—review and editing, Z.L., Y.Z. (Yi Zhu) and J.W.; visualization, Z.L. and Y.Z. (Yi Zhu); supervision, Z.L., Y.Z. (Yi Zhu), J.W., M.L. and Y.Z. (Ying Zhao). All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by the National Natural Science Foundation of China (62077029); the CCF-Huawei Populus Grove Fund (CCF-HuaweiFM202209); the Open Project Fund of Key Laboratory of Safety-Critical Software Ministry of Industry and Information Technology (NJ2020022); the Future Network Scientific Research Fund Project (FNSRFP-2021-YB-32); the Research and Innovation Program of Jiangsu Normal University (2024XKT2590).

Data Availability Statement

Data are contained within the article.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Yuan, K.; Huang, Y.; Yang, S.; Wu, M.; Cao, D.; Chen, Q.; Chen, H. Evolutionary Decision-Making and Planning for Autonomous Driving: A Hybrid Augmented Intelligence Framework. IEEE Trans. Intell. Transp. Syst. 2024, 25, 7339–7351. [Google Scholar] [CrossRef]
Huang, W.; Liu, H.; Huang, Z.; Lv, C. Safety-Aware Human-in-the-Loop Reinforcement Learning with Shared Control for Autonomous Driving. IEEE Trans. Intell. Transp. Syst. 2024, 25, 16181–16192. [Google Scholar] [CrossRef]
Sharma, R.; Garg, P. Optimizing Autonomous Driving with Advanced Reinforcement Learning: Evaluating DQN and PPO. In Proceedings of the 2024 5th International Conference on Smart Electronics and Communication (ICOSEC), Tholurpatti, India, 18–20 September 2024; pp. 910–914. [Google Scholar]
Vasile, L.; Dinkha, N.; Seitz, B.; Däsch, C.; Schramm, D. Comfort and Safety in Conditional Automated Driving in Dependence on Personal Driving Behavior. IEEE Open J. Intell. Transp. Syst. 2023, 4, 772–784. [Google Scholar] [CrossRef]
Wang, X.; Hu, J.; Wei, C.; Li, L.; Li, Y.; Du, M. A Novel Lane-Change Decision-Making with Long-Time Trajectory Prediction for Autonomous Vehicle. IEEE Access 2023, 11, 137437–137449. [Google Scholar] [CrossRef]
Sun, C.; Zhang, R.; Lu, Y.; Cui, Y.; Deng, Z.; Cao, D.; Khajepour, A. Toward Ensuring Safety for Autonomous Driving Perception: Standardization Progress, Research Advances, and Perspectives. IEEE Trans. Intell. Transp. Syst. 2024, 25, 3286–3304. [Google Scholar] [CrossRef]
Wang, Y.; Jiang, J.; Li, S.; Li, R.; Xu, S.; Wang, J.; Li, K. Decision-Making Driven by Driver Intelligence and Environment Reasoning for High-Level Autonomous Vehicles: A Survey. IEEE Trans. Intell. Transp. Syst. 2023, 24, 10362–10381. [Google Scholar] [CrossRef]
Jiang, Z.; Pan, W.; Liu, J.; Dang, S.; Yang, Z.; Li, H.; Pan, Y. Efficient and Unbiased Safety Test for Autonomous Driving Systems. IEEE Trans. Intell. Veh. 2023, 8, 3336–3348. [Google Scholar] [CrossRef]
Chen, S.; Dong, K.; Tian, Z.; Feng, W. Research on the Optimization Algorithm of Dynamic Obstacle Avoidance Path Planning for Unmanned Vehicles. In Proceedings of the 2023 International Symposium on Intelligent Robotics and Systems (ISoIRS), Chengdu, China, 13–15 June 2023; pp. 39–47. [Google Scholar]
Drage, T.H.; Quirke-Brown, K.; Haddad, L.; Lai, Z.; Lim, K.L.; Bräunl, T. Managing Risk in the Design of Modular Systems for an Autonomous Shuttle. IEEE Open J. Intell. Transp. Syst. 2024, 5, 555–565. [Google Scholar] [CrossRef]
Yoon, Y.; Kim, C.; Lee, H.; Seo, D.; Yi, K. Spatio-Temporal Corridor-Based Motion Planning of Lane Change Maneuver for Autonomous Driving in Multi-Vehicle Traffic. IEEE Trans. Intell. Transp. Syst. 2024, 25, 13163–13183. [Google Scholar] [CrossRef]
Wang, J.; Sun, H.; Zhu, C. Vision-Based Autonomous Driving: A Hierarchical Reinforcement Learning Approach. IEEE Trans. Veh. Technol. 2023, 72, 11213–11226. [Google Scholar] [CrossRef]
Huang, Y.; Yang, S.; Wang, L.; Yuan, K.; Zheng, H.; Chen, H. An Efficient Self-Evolution Method of Autonomous Driving for Any Given Algorithm. IEEE Trans. Intell. Transp. Syst. 2024, 25, 602–612. [Google Scholar] [CrossRef]
Wu, Y.; Liao, S.; Liu, X.; Li, Z.; Lu, R. Deep Reinforcement Learning on Autonomous Driving Policy with Auxiliary Critic Network. IEEE Trans. Neural Networks Learn. Syst. 2023, 34, 3680–3690. [Google Scholar] [CrossRef] [PubMed]
Zhang, X.; Sun, J.; Wang, Y.; Sun, J. A Hierarchical Framework for Multi-Lane Autonomous Driving Based on Reinforcement Learning. IEEE Open J. Intell. Transp. Syst. 2023, 4, 626–638. [Google Scholar] [CrossRef]
Liu, X.; Zhou, Y.; Gou, C. Learning from Interaction-Enhanced Scene Graph for Pedestrian Collision Risk Assessment. IEEE Trans. Intell. Veh. 2023, 8, 4237–4248. [Google Scholar] [CrossRef]
Chen, C.; Lan, Z.; Zhan, G.; Lyu, Y.; Nie, B.; Li, S.E. Quantifying the Individual Differences of Drivers’ Risk Perception via Potential Damage Risk Model. IEEE Trans. Intell. Transp. Syst. 2024, 25, 8093–8104. [Google Scholar] [CrossRef]
Li, Y.; Chen, Y.; Li, T.; Lao, J.; Li, X. DDPG-Based Path Planning Approach for Autonomous Driving. In Proceedings of the 2023 IEEE 12th Data Driven Control and Learning Systems Conference (DDCLS), Xiangtan, China, 12–14 May 2023; pp. 1306–1311. [Google Scholar]
Jeong, Y. Probabilistic Game Theory and Stochastic Model Predictive Control-Based Decision Making and Motion Planning in Uncontrolled Intersections for Autonomous Driving. IEEE Trans. Veh. Technol. 2023, 72, 15254–15267. [Google Scholar] [CrossRef]
Li, Z.; Hu, J.; Leng, B.; Xiong, L.; Fu, Z. An Integrated of Decision Making and Motion Planning Framework for Enhanced Oscillation-Free Capability. IEEE Trans. Intell. Transp. Syst. 2024, 25, 5718–5732. [Google Scholar] [CrossRef]
Lu, J.; Li, L.; Wang, F.-Y. Event-Triggered Parallel Control Using Deep Reinforcement Learning with Application to Comfortable Autonomous Driving. IEEE Trans. Intell. Veh. 2024, 9, 4470–4479. [Google Scholar] [CrossRef]
Liu, W.; Wang, J.; He, Q.; Li, Y. Model Checking Computation Tree Logic Over Multi-Valued Decision Processes and Its Reduction Techniques. Chin. J. Electron. 2024, 33, 1399–1411. [Google Scholar] [CrossRef]
Du, D.; Chen, J.; Zhang, M.; Ma, M. Towards Verified Safety-critical Autonomous Driving Scenario with ADSML. In Proceedings of the 2021 IEEE 45th Annual Computers, Software, and Applications Conference (COMPSAC), Madrid, Spain, 12–16 July 2021; pp. 1333–1338. [Google Scholar]

Figure 1. DDPG Framework.

Figure 2. Abstract diagram of the autonomous driving vehicle model.

Figure 3. Abstract diagram of surrounding vehicle model.

Figure 4. Highway scenario.

Figure 5. Rewards based on DDPG decision-making.

Figure 6. (a) SHA model for autonomous vehicles; (b) SHA model of surrounding vehicles; (c) Decision model; (d) Vehicle dynamics model.

Figure 7. Verification results.

Figure 8. Verification results of property P1.

Table 1. Safety properties.

Number	Property
P1	$P r [\leq 100] (< > d i s t a n c e > 100)$
P2	$P r [\leq 100] (< > d i s t a n c e \leq 5)$
P3	$\Pr [\leq 10] (< > E g o . L e f t_o v e r t a k i n g)$
P4	$s i m u l a t e [< = 100] {r V e l o c i t y F r o n t, v e l o c i t y F r o n t}$
P5	$s i m u l a t e [< = 100] {r V e l o c i t y E g o, v e l o c i t y E g o}$
P6	$A < > E g o . N o r m a l_d r i v i n g$

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Zhu, Y.; Li, Z.; Wang, J.; Zhao, Y.; Li, M. Behavior Safety Decision-Making Based on Deep Deterministic Policy Gradient and Its Verification Method. Symmetry 2025, 17, 132. https://doi.org/10.3390/sym17010132

AMA Style

Zhu Y, Li Z, Wang J, Zhao Y, Li M. Behavior Safety Decision-Making Based on Deep Deterministic Policy Gradient and Its Verification Method. Symmetry. 2025; 17(1):132. https://doi.org/10.3390/sym17010132

Chicago/Turabian Style

Zhu, Yi, Zexin Li, Jinyong Wang, Ying Zhao, and Miaoer Li. 2025. "Behavior Safety Decision-Making Based on Deep Deterministic Policy Gradient and Its Verification Method" Symmetry 17, no. 1: 132. https://doi.org/10.3390/sym17010132

APA Style

Zhu, Y., Li, Z., Wang, J., Zhao, Y., & Li, M. (2025). Behavior Safety Decision-Making Based on Deep Deterministic Policy Gradient and Its Verification Method. Symmetry, 17(1), 132. https://doi.org/10.3390/sym17010132

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Behavior Safety Decision-Making Based on Deep Deterministic Policy Gradient and Its Verification Method

Abstract

1. Introduction

2. Preliminary

2.1. Reinforcement Learning

2.2. Deep Deterministic Policy Gradient

2.3. Stochastic Hybrid Automata

2.4. Probabilistic Computation Tree Logic

3. DDPG Algorithm for Behavior Decision-Making

3.1. Behavior Decision-Making Process

3.2. Autonomous Driving Model

4. Safety Verification Models Based on SHA

4.1. Construction of SHA

4.2. Safety Constraints for Highway Scenarios

5. Case Study

5.1. Highway Scenario

5.2. Decision-Making Results Based on DDPG

5.3. SHA for Highway Lane Changing

6. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI