1. Introduction
Artificial intelligence is in its golden development age due to the exponential increase in data production and the continuous improvements in computing power [
1]. Autonomous driving is one of the main uses. A comprehensive autonomous driving system integrates sensing, decision-making, and motion-controlling modules [
2,
3,
4]. As the “brains” of connected autonomous vehicles (CAVs) [
5], the decision-making module formulates the most reasonable control strategy according to the state feature matrix transmitted by the sensing module, the vehicle state, and the cloud transmission information [
6]. Moreover, it sends the determined control strategy to the motion-controlling module, including high-level behavior and low-level control requirements [
7,
8]. It is crucial to complete autonomous driving tasks safely and efficiently by making reasonable decisions based on other modules [
9].
In an uncertain interactive traffic scenario, the driving environment has rigorous dynamic characteristics and high uncertainty, and the influences of driving behaviors of different traffic participants will be transmitted continuously [
10]. On the level of transportation overall, all traffic participants need to cooperate efficiently [
11]. At the traffic participant level, individuals need to judge the risk factors and make appropriate decisions sensitively based on dynamic scene changes [
12]. In [
13], a Gaussian mixture model and important weighted least squares probability classifier were combined and used for scene modeling. That model can identify the braking strength levels of new drivers under the condition of insufficient driving data. The key to participating in traffic scenarios for CAVs is that each CAV needs to generate appropriate and cooperative behavior to match human vehicles and other CAVs. Therefore, CAVs urgently demand efficient and accurate multi-agent decision-making technology to effectively handle the interactions between different traffic participants.
The current multi-agent decision-making technologies mainly focus on deep reinforcement learning (DRL) due to the excellent performance of DRL in high-dimensional dynamic state space [
14]. The keys of DRL in uncertain interactive traffic scenarios can be summarized as follows: (1) Efficient modeling of interactive traffic scenes and accurate representation of state features. (2) Generating reasonable and cooperative decision-making behaviors based on uncertain scene changes and individual task requirements. The design of the reward function is an essential part of the DRL application. Concretizing and numericizing task objectives realizes the communication between objectives and algorithms. Therefore, the design of the reward function determines whether the agent can generate reasonable and cooperative decision-making behaviors [
15]. In addition, the accurate modeling of the interactive traffic environment and representation of state characteristics are the requirements for agents to generate expected behaviors.
In studies of traffic scenarios using the DRL method, researchers found that in uncertain interactive traffic scenarios, sparse reward problems lead to agents having a lack of effective information guidance [
16], making the algorithm difficult to converge. In order to solve the sparse reward problem, researchers divide the original goal into different sub-goals and give reasonable rewards or punishments. In [
17], the DDPG was adopted to settle the autonomous braking problem. The reward function was split into three parts to solve the problems: braking too early, braking too late, and braking too quickly. In [
18], the reward function was divided into efficiency and safety rewards to train the comprehensive safety and efficiency decision model. In addition, considering the changes in driving task requirements and scene complexity, some studies have given different weights to the sub-reward functions based on the decomposition of the reward function, so as to train different decision-making modes. In [
19], the reward function was divided into sub-reward functions based on security, efficiency, and comfort. The system can realize different control targets by adjusting the weight values in the reward function.
In the decision-making research involving uncertain interactive traffic scenarios, the DRL method only takes the individual characteristics of each vehicle as the input. It ignores the interactive influence of transitivity between vehicles. This will result in CAVs not generating reasonable and cooperative behavior, which may reduce total traffic efficiency and the occurrence of traffic accidents. Graph representation can accurately describe the interactions between agents, representing the relationship between vehicles in uncertain interactive traffic scenarios. Therefore, some researchers focused on the graph reinforcement learning (GRL) method and modeled the interaction with graph representation [
20]. In [
21], a hierarchical GNN framework was proposed and combined with LSTM to model the interactions of heterogeneous traffic participants (vehicles, pedestrians, and riders) to predict their trajectories. The GRL method combines GNN and DRL: the features of interactive scenes are processed by GNN, and cooperative behaviors are generated by the DRL framework [
22]. In [
23], the traffic network was modeled by dynamic traffic flow probability graphs, and a graph convolutional policy network was used in reinforcement learning.
This paper proposes an innovative, dynamic reward function matrix, and various decision-making modes can be trained by adjusting the weight coefficient of the reward function matrix. Additionally, the weight coefficient is further set as a function of reward, forming an internal dynamic reward function. In the traffic environment adopted in this paper, the randomness and interactions between HVs and CAVs are strengthened by using some human vehicles making uncertain lane changes. Two GRL algorithms are used in this paper, GQN and MDGQN. Finally, we report a simulation based on the SUMO platform and a comparative analysis from various perspectives, such as reward functions, algorithms, and decision-making modes. The schematic diagram of the designed framework is shown in
Figure 1.
To summarize, the contributions of this paper are as follows:
- 1.
Innovative dynamic reward function matrix: we propose a reward function matrix including a decision-weighted coefficient matrix, an incentive-punished-weighted coefficient matrix, and a reward–penalty function matrix. By adjusting the decision-weighted coefficient matrix, the decision-making modes of different emphases among driving task, traffic efficiency, ride comfort, and safety can be realized. Based on the premise that the incentive-punished function matrix separates the reward and the penalty, the optimization of individual performance can be achieved by adjusting the incentive and punishment ratio of each sub-reward function. In addition, the weight coefficient of the incentive-punished-weighted coefficient matrix is further set as a function of reward functions, which can reduce the impact of proportional adjustment on important operations, such as collision rate.
- 2.
Adjust the parameters to train multiple decision-making modes: We compare the proposed reward function matrix with the traditional reward function under the same conditions. By adjusting the parameters of the decision-weighted coefficient matrix and the incentive-punished-weighted coefficient matrix, we can achieve aggressive or conservative incentive and punitive decision-making modes, respectively. Specifically, the four decision-making modes trained in this paper are the aggressive incentive (AGGI), aggressive punishment (AGGP), conservative incentive (CONI), and conservative punishment (CONP).
- 3.
Modeling of interactive traffic scene and evaluation of decision-making modes: We designed a highway exit scene with solid interactions between CAVs and human vehicles (HVs) and adopted two algorithms to verify their differences. Additionally, we also propose a set of indicators to evaluate the performance of driverless decision-making and used them to verify the performance differences among various algorithms and decision-making modes.
This article is organized as follows.
Section 2 introduces the problem formulation.
Section 3 introduces the methods used.
Section 4 proposes the reward function matrix.
Section 5 describes and analyzes the simulation results.
Section 6 summarizes this paper and gives the future development directions.
3. Methods
This section describes the principles of the methods used, including graph convolutional neural networks, Q-learning, GQN, and MDGQN.
3.1. Graph Convolutional Neural Network
GCN is a neural network model that directly encodes graph structure. The goal is to learn a function of features on a graph. A graph can be represented by in theory, where V denotes the set of nodes in the graph, the number of nodes in the graph is denoted by N, and E denotes the set of edges in the graph. The state of G is considered a tuple of 3 matrices of information: feature matrix X, adjacency matrix A, and degree matrix D.
The adjacency matrix A is used to represent connections between nodes;
The degree matrix D is a diagonal matrix, and ;
The feature matrix X is applied to represent node features, , where F represents the dimensions of the feature.
GCN is a multi-layer graph convolution neural network, a first-order local approximation of spectral graph convolution. Each convolution layer only deals with first-order neighborhood information, and multi-order neighborhood information transmission can be realized by adding several convolution layers.
The propagation rules for each convolution layer are as follows [
24]:
where
is the adjacency matrix of the undirected graph
G with added self-connections.
is the identity matrix;
and
are layer-specific trainable weight matrices.
denotes an activation function, such as the
.
is the matrix of activations in the
layer,
.
3.2. Deep Q-Learning
Q-learning [
25] is a value-based reinforcement learning algorithm.
is the expectation that
can obtain benefits by taking action
under the state of
at time
t. The environment will feed back the corresponding reward
according to the agent’s action. Each time step produces a quadruplet (
), and it is stored in the experience replay. Deep Q-learning [
26] replaces the optimal action–value function with the deep neural network
. The following is a description of the principle of the algorithm.
The predicted values of DQN can be calculated according to the given four-tuple (
):
The TD target can be calculated based on the actual observation reward
:
DQN updates network parameters according to the following formula:
where
represents the learning rate.
3.3. Graph Convolutional Q-Network (GQN)
As described in
Section 3.2, Q-learning uses a Q-Table to store the Q value of each state–action pair. However, if the state and action space are high-dimensionally continuous, there will be the curse of dimensionality; that is, with a linear increase in dimensions, the calculation load increases exponentially. GQN [
27] replaces the optimal action–value function
with the graph convolutional neural network
:
where
Z is the node embeddings output from graph convolution layer.
represents the neural network block, including the fully connected layer.
is the aggregation of all the weights.
The specific training process of GQN is the same as DQN [
26]. Firstly, the predictive value of GQN can be calculated according to the four-tuple (
) sampled from the experience replay:
However, since the Q-network uses the same estimate to update itself, it will cause bootstrapping and lead to deviation propagation. Therefore, another neural network can be used to calculate the TD target, called the target network
. Its network structure is precisely the same as that of the Q-network, but the parameter
is different from
. Selecting the optimal action and forward propagation of the target network:
Calculation of TD target
and TD error
:
The gradient
is calculated by the backpropagation of a Q-Network, and the parameter of the Q-Network is updated by gradient descent:
where
is the parameter of the updated Q-Network.
represents the learning rate.
Finally, GQN adopts the weighted average of the two networks to update the target network parameters:
where
represents soft update rate.
3.4. Multi-Step Double Graph Convolutional Q-Network (MDGQN)
MDGQN further adopts double Q-learning and a multi-step TD target algorithm based on GQN. As shown in
Section 3.3, the target network cannot completely avoid bootstrapping, since the parameters of the target network are still related to the Q-Network. The double Q-learning algorithm is improved based on the target network. The Q-learning with the target network uses the target network to select the optimal action and calculate the Q value of the optimal action. However, the double Q-learning selects the optimal action according to the Q-Network and uses the target network to calculate the Q value of the optimal action. Equation (
11) is modified as follows:
The multi-step TD target algorithm can balance the significant variance caused by Monte Carlo and the significant deviation caused by bootstrapping. Equation (
13) is modified as follows:
where
is called the m-step TD target.
4. Reward Functions
The design of the reward function is an important criterion and goal of the DRL training process. Based on four aspects—the results the of the driving task, traffic efficiency, ride comfort, and safety—the reward function is divided into four blocks. Further, we propose a reward function matrix including a decision-weighted coefficient matrix, an incentive-punished-weighted coefficient matrix, and a reward–penalty function matrix.
Specific parameters are described below:
- 1.
is the decision-weighted coefficient matrix. , , , and denote the weights of reward and penalty functions based on the results of the driving task, traffic efficiency, ride comfort, and safety.
- 2.
is the incentive-punished-weighted coefficient matrix. , , , and denote the weights of reward functions based on the results of the driving task, traffic efficiency, ride comfort, and safety, respectively. , , , and denote the weights of penalty functions based on the results of the driving task, traffic efficiency, ride comfort, and safety, respectively.
- 3.
is a reward–penalty function matrix. , , , and are reward functions based on the results of the driving task, traffic efficiency, ride comfort, and safety, respectively. , , , and are penalty functions based on the results of the driving task, traffic efficiency, ride comfort, and safety, respectively.
By adjusting the weight coefficient of the decision-weighted coefficient matrix and incentive-punished-weighted coefficient matrix, DRL can train different goal-oriented decision-making modes. In the decision-making process of autonomous vehicles, the upper control module can choose different decision-making modes according to different needs. In order to select a decision-making model with excellent comprehensive performance and strong contrast, we conducted multiple sets of experiments, and some experimental data were put in
Appendix A. This paper determined four decision modes: AGGI, AGGP, CONI, and CONP, by adjusting the parameters in
Table 2 and
Table 3.
Since the change of weight coefficient will dilute some essential rewards or punishments, this paper improves the reward function against this defect. The weight coefficient of the incentive-punished-weighted coefficient matrix is further set as a functional of reward functions, which forms an internal dynamic reward function. The specific formula is as follows:
Based on [
3], the specific reward functions and penalty functions were designed. Firstly, we designed the corresponding reward function and penalty function based on the results of the driving tasks. The independent variable of the reward function is the number of CAVs and HV2 reaching destinations, which aims to train decisions that can assist HVs in completing driving tasks. The penalty function is designed based on collisions.
where
is the number of CAVs leaving from “Exit 1”.
is the number of CAVs leaving from “Exit 1”.
To train the decision-making model to improve traffic efficiency, this paper divides the speed interval of CAVs into three parts. The corresponding reward and penalty functions were designed to curb speeding, encourage high-speed driving, and punish low-speed blocking for these three-speed ranges. In order to make CAVs faster and more stable to explore the optimal speed, we used the exponential function to design a soft reward function [
28].
where
is the velocity of the CAV.
represents the maximum velocity allowed by the current lane; its value is 25
, or 15
.
In order to improve the ride comfort of all vehicles in this traffic section, the corresponding reward function and penalty function are designed based on the acceleration and lane change times of all vehicles.
where
is the number of vehicles with acceleration of
.
is the number of vehicles with acceleration of
.
m is the number of lane changes in this traffic section within 0.5 s.
Superior security performance is the premise of developing decision technology [
29]. In [
30], the length of CAVs’ safety time is one of the most important factors affecting road safety. This paper introduces the safety time of the CAVs into the corresponding reward function. The definition of the safety time is as follows:
where
is the longitudinal position of the CAV.
and
are the longitudinal positions of the front and rear vehicles of CAV, respectively.
and
are the longitudinal speeds of the front and rear vehicles of CAV, respectively.
A driving hazard diagram is proposed to represent the degree of danger of the vehicle’s state based on safety time
and
. As shown in
Figure 3, three primary colors are used to represent the degrees of danger in this state. The red region represents collision accident danger, and the deeper the color, the greater the likelihood. The yellow area indicates that the vehicle needs to pay attention to the occurrence of a possible emergency. The green area indicates that the vehicle is in a safe state. By dividing
Figure 3 into five categories, the sub-reward functions were designed. On the basis of the above principles, a security-based reward function is proposed for training security decisions. The formula is shown in
Table 4.
6. Conclusions
We proposed a reward function matrix, including a decision-weighted coefficient matrix, an incentive-punished-weighted coefficient matrix, and a reward–penalty function matrix. By adjusting the weight coefficient of the reward function matrix, various decision-making modes can be trained. We trained four decision-making modes, namely, AGGI, AGGP, CONI, and CONP; and the GQN algorithm and the MDGQN algorithm based on GQN improvement were used for verification by comparison. A large number of simulation results proved the following three conclusions. Firstly, the proposed reward function can promote the fast convergence of the algorithm and greatly improve the stability of the training process. Taking the CONP decision-making mode as an example, the average normalized reward value after the 200th round was 35.40% higher than that of the baseline, and the variance was only 3.27% greater than that of the baseline. Secondly, the comprehensive performance of the MDGQN algorithm is superior to that of GQN. Under the four decision-making modes, the averages and variances of test reward values of MDGQN are better than those of GQN. In terms of driving performance, MDGQN performs better than GQN, except that the number of brakes increases slightly in CONP decision-making mode. Finally, the proposed reward function matrix can effectively train various decision-making modes to adapt to different autonomous driving scenarios. With an increase in incentive weight, the comparison effect of the algorithm is more obvious, but the security will decrease. In our future work, we will further study the interactions of autonomous vehicles and decision mode switching.