Next Article in Journal
A Multidimensional Financial Data Model for User Interface with Process Mining Systems
Previous Article in Journal
PANDA: A Polarized Attention Network for Enhanced Unsupervised Domain Adaptation in Semantic Segmentation
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Double Counterfactual Regret Minimization for Generating Safety-Critical Scenario of Autonomous Driving

1
School of Mechanical Engineering, Southeast University, Nanjing 215009, China
2
Insititute of Automation, Qilu University of Technology (Shandong Academy of Sciences), Jinan 250399, China
*
Author to whom correspondence should be addressed.
Electronics 2024, 13(21), 4303; https://doi.org/10.3390/electronics13214303
Submission received: 23 September 2024 / Revised: 25 October 2024 / Accepted: 29 October 2024 / Published: 1 November 2024

Abstract

:
Developing a high-quality scenario library is crucial for evaluating more reliable autonomous driving systems. A fundamental prerequisite for constructing such a library is the generation of safety-critical scenarios. In this paper, we propose a nested game algorithm to assign trajectories and their time series to multiple background vehicles. This method aimed to generate safety-critical scenarios with varying degrees of interference for the tested vehicle. To ensure the realism of the generated scenarios, we extracted a series of natural trajectories from an existing dataset as the input for the algorithm. We then analyzed multiple types of scenarios generated by this method and evaluated their danger using generalized metrics, such as the Time-to-Collision (TTC) and Minimum Safe Distance Factor (MSDF), which demonstrated the effectiveness of our approach. The experimental results demonstrate that the nested game-based approach could efficiently construct safety-critical scenarios, contributing to the development of high-quality test scenario libraries.

1. Introduction

Autonomous driving technologies have made significant advances over the past decade, but one of the key challenges currently hindering their progress is ensuring safety. In reality, autonomous vehicles must pass a series of comprehensive tests before they can be deployed on the road. Scenarios are essential for autonomous vehicle testing, and safety-critical scenarios are crucial for accelerating this process. Scenario testing can be categorized into real-world testing and virtual simulation testing. Studies showed that to demonstrate with 95% confidence that autonomous vehicles outperform human drivers, these vehicles would need to be tested over a distance of at least 440 million km [1]. Consequently, conducting real-world roadway measurements can be both time consuming and costly. In addition, real-world road tests are marked by high levels of uncertainty and diversity, and the driving data collected are often non-critical [2]. This results in significant inefficiencies in real-world testing. According to a report by the California Department of Motor Vehicles, safety-critical scenarios occur only once every 48,280.32 KM of driving [3]. Therefore, selecting safety-critical scenarios and testing them in a simulation environment can efficiently evaluate the safety functionality of autonomous driving systems and identify potential safety hazards [4]. Safety-critical scenarios are scenarios in which autonomous vehicles are highly likely to collide with other entities. How to select safety-critical scenarios for the training and evaluation of autonomous vehicles is extremely important [5]. In recent years, simulation tests that generate safety-critical scenarios have become one of the most important technical methods for ensuring the safety of autonomous vehicles [6]. Kruber et al. [7,8] proposed using unsupervised clustering methods to group similar scenarios. Kruber et al. [7] proposed an unsupervised machine learning method called MURF (Modified Unsupervised Random Forest), which uses multiple decision trees to train and predict samples. Based on the above work, Kruber et al. [8] proposed the xMURF algorithm. This algorithm not only considers the terminal leaves but also extracts similarity by analyzing the paths that data points take through the tree. However, its effectiveness is highly dependent on the size of the dataset. Wang et al. [9] utilized the Hierarchical Dirichlet Process Hidden Markov Model (HDPHMM) [10], a nonparametric Bayesian learning approach, to extract tuples from large volumes of time-series traffic data more efficiently in terms of time and resources. However, the extracted proto-languages are not interpretable, and the hierarchical relationships between them are unclear. Wheeler et al. [11] proposed an automated method for identifying and aggregating crisis situations, capturing their severity and frequency of occurrence, which enables risk-based security validation. Importance Sampling (IS) [12] is incorporated into the process and can be used to generate scenarios such as highway queue jumping and car tailgating. The method accelerates the safety assessment of autonomous vehicles but suffers from issues such as oversampling. Overall, the effectiveness of clustering-based scenario generation methods in accurately categorizing safety-critical scenarios heavily depends on the characteristics of the data samples.
In recent years, adversarial generation-based methods for creating safety-critical scenarios have become a prominent technique. Rempe et al. [13] proposed a method called STRIVE for automatic scenario generation. The method generates collision scenarios by optimizing the latent distribution of the traffic model to interfere with the initial real-world conditions. However, it only generates collisions between vehicles and does not account for interactions involving pedestrians and drivers. Priisalu et al. [14] proposed a framework for pedestrian prediction modeling to compute collision-prone points for pedestrians in a scenario. However, to generate special scenarios through this framework, adequate and realistic constraint scenarios need to be provided. Sallami et al. [15] proposed a GAN that can handle time-series data and generate realistic scenarios of lane changing in a highway environment. However, the method is challenging to extend to scenarios involving other road types due to its strong dependency on the specific type of data. Chen [16], Wachi [17], Yasasa [18], and Xu [19] utilized reinforcement learning to generate scenarios for the adversarial navigation of autonomous vehicles. Hsu et al. [20] proposed a GAN-based deep learning framework for predicting the trajectories of vehicles around autonomous vehicles in RGB image sequences without explicit coordinate labeling. Liu et al. [21] proposed the use of deep reinforcement learning for safety-critical scenario generation, where new scenarios are edited to generate new scenarios by adding new surrounding vehicles or changing the trajectories of the surrounding vehicles during the testing process, which solves the parameter dimensionality problem suffered by the traditional techniques.
Their methods optimize the vehicle’s learning adversarial approach and continuously discover rare events during Markov decision making. However, AI-driven systems often suffer from the black box problem, lack interpretability, and pose potential risks in safety-critical scenarios. Overall, the primary issues with the aforementioned approaches are that the simulation scenarios are difficult to scale, they rely heavily on sample datasets, the training costs are high, and the learned control strategies lack interpretability. To solve these problems, we propose a new scenario generation technique that introduces the idea of game theory to generate anthropomorphic strategies, and thus, test scenarios. In summary, the primary contributions of this study are as follows:
  • A novel structure for scenario generation was designed. This structure depends solely on the size of the dataset, rather than its type, and is easily scalable. The more data that are available, the greater the number of valid scenarios that can be generated.
  • Data analysis using real traffic trajectories. We extracted a series of real traffic trajectories from the highD dataset for use in the scenario architecture, including common trajectories in reality, such as long- and short-distance lane-changing trajectories, and straight-line-driving trajectories with different speeds.
  • Inspired by Kuhn’s poker, we introduced game theory into the field of autonomous driving scenario generation by constructing a virtual agent to mimic the human-regret-matching process. This approach generates anthropomorphic strategies, which, in turn, create the desired scenarios.
To fully understand the topic of this paper, the subsequent sections are organized as follows: Section 2 presents the problem description and game modeling. Section 3 describes the double counterfactual regret minimization (D-CFR) algorithm, outlines the solution steps, and details the D-CFR-based scenario generation process. Section 4 demonstrates a typical safety-critical scenario generated using the D-CFR algorithm within a simulation tool and further analyzes its significance. Finally, Section 5 summarizes this paper, discusses various issues, and provides insights for future work.

2. Problem Formulation

2.1. Scenario Description

We considered the problem as an attacker–defender game scenario occurring while a vehicle is traveling on a road. Specifically, we framed it as a nested attacker–defender, attacker–defender problem, with the parameters outlined in Table 1. The problem involves interfering with the behavior of the tested vehicle by deploying a natural trajectory to the background vehicle.
Zhao [22] categorized existing driving strategies for autonomous vehicles into four types: defensive driving strategies, competitive driving strategies, negotiated driving strategies, and cooperative driving strategies. To model the physical level, we considered both the vehicle’s physical defense system and its physical offense system based on the vehicle’s driving strategy. We distinguished the autonomous vehicles as the tested vehicle (ego vehicle) and the background vehicle. In this nested attack and defense problem, the tested vehicle functioned as the defender, while the background vehicle served as the attacker.
Our goal was to test the functional safety of autonomous vehicles. Therefore, we assumed that the defender’s physical defense system was sufficiently rational. The aggression between the attacker and defender aligned with a competitive driving strategy, where the primary objective was to maximize the attacker’s own driving gains under the assumption that the defender drove rationally. Decisions regarding the attack behavior between attackers followed a cooperative driving strategy, where the primary objective was to formulate and execute the globally optimal attack plan. To ensure the plausibility of attack scenarios in autonomous driving, we designed the physical attack system with the core concept of seizing the defender’s right of way to increase the attacker’s road traffic gain, rather than arbitrarily colliding with the defender.
In his analysis of crashes, Shawky [23] demonstrated that in crash data from 2010 to 2017, sudden lane changes accounted for approximately 17.0% of all serious crashes, followed by speeding (12.8%) and rear-end collisions (11.2%). Additionally, the severity of lane change crashes was relatively higher compared with other causes. We propose classifying the defender’s right of way into three categories: primary right of way seized (P-rows), inferior right of way seized (I-rows), and non-essential right of way seized (N-rows). P-rows occurs when the attacker imposes a direct threat to the defender, such as when neighboring vehicles change lanes into the defender’s path or when vehicles brake sharply in front of them; I-rows refers to areas of the road where the attacker members, in coordination with the attacker leader, execute a globally optimal attack. For example, just before the moment of a potential accident, the attacker members position themselves alongside the defender, blocking the defender’s ability to change lanes and thereby preventing them from evading the attacker leader’s offensive maneuvers; N-rows are sections of the road where attacking yields minimal or no advantage, such as the area behind the defender’s vehicle or regions that must be kept clear to prevent a “deadlock” situation. A more detailed explanation is provided in Figure 1. Additionally, we extracted a series of natural trajectories from the highD dataset to deploy attack behaviors, which ensured that the attacker’s actions were realistically plausible.

2.2. Model

In this section, we present a nested attack and defense game designed for autonomous vehicle safety testing. Each layer of the game is an extensive-form, two-player, zero-sum game involving an attacker and a defender. It was assumed that the attacker has perfect memory of their actions and can observe the defender’s behavior, with information shared between the attackers. The overall goal of the attacker is to execute an effective and reasonable physical attack that exposes the adversary to a functional security problem.
Autopilot offensive and defensive games can also be modeled as a single extensive-form game with incomplete information. This approach reduces the number of stages in the game process without altering the overall scale. While formulating fewer stages results in a more concise problem, it can also complicate the modeling process. Decomposing the problem into more stages allows for a more direct translation of the real-world self-driving attack and defense scenario into a game formulation, providing a clearer representation of the game. Thus, we decomposed the above problem into two subproblems with nested links.
Considering the parameters relevant to the scenario, in the first subproblem, the defender’s goal is to secure itself by allocating an appropriate amount of physical resources. The payoff for the defender is defined by the expected security loss. The goal of the attacker is to make a preliminary attack by allocating physical attack resources to maximize the security loss to the defender, consistent with the zero-sum formula. Specifically, the game structure corresponding to the first subproblem is shown in Table 2. In the second subproblem, the defender’s goal is to secure itself by assigning time sequences of behaviors. Conversely, the attacker’s goal is to maximize the security loss to the defender by assigning a time sequence of attacks. Specifically, the game structure corresponding to the second subproblem is shown in Table 3.
For the first subproblem, considering incomplete information and an extensive formal structure, we defined G 1 = N 1 , Σ 1 , g 1 , K 1 , where
  • N 1 = 1 , 2 is the set of players;
  • Σ 1 = Σ 1 , 1 , Σ 1 , 2 , where Σ 1 , i is the set of sequences of player i;
  • g 1 = ( g 1 , 1 , g 1 , 2 ) , where g 1 , i : 1 R is the set of payoff functions for player i;
  • K 1 = ( K 1 , 1 , K 1 , 2 ) , where K 1 , i is the set of constraints for player i to realize the action.
The model is a two-player game where agent 1 acts as the defender and agent 2 as the attacker. A sequence refers to an ordered set of actions taken by a player at specific points in the game tree. Let A i , j be the set of actions available to player i in phase j. The sequence of basic actions for player i corresponds to the physical resources allocated to the task, as defined by the set A i , j . To form valid sequences, we indexed each set of basic actions within each message, ensuring that each message set had the same physical resources.
An information set is a set of nodes in a game tree that players cannot distinguish between due to incomplete information about their opponents. In order to rationally configure the environment, we assumed that the attackers have perfect memories of their states and actions, and that they have perfect information about the allocation of resources for physical attacks between themselves through conventional intelligence. We also assumed that the attacker cannot ascertain whether the defender has adopted defensive measures before launching a physical attack. This implies that the attacker lacks the ability to predict the defender’s immediate behavior and cannot determine whether the attack will be successful, only receiving a referenceable value after the fact.
The total payoff g 1 ( σ 1 ) = ( g 1 , 1 ( σ 1 ) , g 1 , 2 ( σ 1 ) ) , where σ 1 = ( σ 1 , 1 , σ 1 , 2 ) represents the sequence pair of the attacker and defender with σ 1 , 1 Σ 1 , 1 and σ 1 , 2 Σ 1 , 2 . All incomplete action sequences correspond to zero gain; that is, g 1 ( σ 1 ) = ( 0 , 0 ) , i.e., neither the defender nor the attacker participates in the game in the available phases. In this problem, the attacker’s gain is directly related to the total gain of the attacker in the second subproblem. Let M 1 = 1 , 2 , , m 1 be the set of subplayers among the attackers in that phase. The attacker’s payoff function is then given by g 1 ( σ 1 ) = g 2 ( σ 2 ) .
K 1 = K 1 , 1 , K 1 , 2 represents the set of linear constraints necessary for realizing the action, where K 1 , i denotes the set of constraints related to agent i. Each information set has a constraint, denoted by r 1 , i ( σ 1 , i ) , where r 1 , i ( σ 1 , i ) represents the realization condition of a sequence, with σ 1 , 1 Σ 1 , 1 . The constraints in this phase are largely dependent on those in the second subproblem, and it can be argued that there are no additional constraint limitations in this phase.
For the second subproblem, we referred to the previous problem and defined G 2 = N 2 , Σ 2 , g 2 , K 2 , where
  • N 2 = 1 , 2 is the set of players;
  • Σ 2 = Σ 2 , 1 , Σ 2 , 2 , where Σ 2 , i is the set of sequences of player i;
  • g 2 = ( g 2 , 1 , g 2 , 2 ) , where g 2 , i : 2 R is the set of payoff functions for player i;
  • K 2 = ( K 2 , 1 , K 2 , 2 ) , where K 2 , i is the set of constraints for player i to realize the action.
In designing its payoff function g 2 ( σ 2 ) = ( g 2 , 1 ( σ 2 ) , g 2 , 2 ( σ 2 ) ) , we considered various parameters, such as the mass, relative position, real-time speed, speed threshold, and steering angle of an autonomous vehicle during its movement, as well as other possible factors that directly affect the vehicle’s risk level. To address these multidimensional parameters, this study employed the traveling risk field model [24,25] to measure the risk impact caused by the attacker on the defender. A complete description of the specific definitions of the risk field model is given in Appendix A. We defined the attacker’s payoff function in this phase as g 2 ( σ 2 ) = F ( σ 2 ) , where F ( σ 2 ) represents the risk benefit from the attacker adopting strategy σ 2 at this stage.
Finally, with respect to the constraint set K 2 = ( K 2 , 1 , K 2 , 2 ) , actions between the attackers are fundamentally not allowed to conflict. Unless an attacker has a high probability of determining that their attack behavior will be successful, the behavior of the other attackers remains arbitrary, suggesting that there is effectively no constraint at this point.

3. Methodology

In the context of large-scale intelligent game confrontation, counterfactual regret minimization (CFR) represents a cutting-edge methodology that is capable of generating effective strategies with favorable practical convergence. CFR has already achieved notable success in high-level human–machine Texas Hold’em games, such as DeepStack [26], Libratus [27], and Pluribus [28]. Lisy [29] and Davis [30], among others, also applied CFR to security games. Given these previous successes, we are interested in exploring what CFR’s hybrid strategy can contribute to the field of autonomous vehicle scenario generation.

3.1. Nash Equilibrium

It is our contention that the game model may exhibit a distinctive “Nash equilibrium” in the field of autonomous driving testing. This is because effective test scenarios aim to achieve both high risk exposure and a diverse range of scenario types. If a Nash equilibrium is reached, a state where no player can increase their returns by unilaterally changing their strategy, the chosen strategy may be globally optimal. However, this can result in the generation of high-risk test scenarios with extremely limited variety, which is undesirable. Therefore, it is inferred that a “Nash equilibrium” in this field should be a large collection of meaningful solutions.

3.2. Double Counterfactual Regret Minimization

Neller et al. [31] gave a full description of CFR, so our summary of the algorithm is included in the Appendix. Here, we briefly describe the regret-matching component, which remains the central piece in D-CFR. This component updates a player’s current strategy based on scaled regret values, where
σ i T + 1 I , a = R i T , + ( I , a ) a A ( I ) R i T , + ( I , a ) , if   a A ( I ) R i T , + ( I , a ) > 0 1 | A ( I ) | , otherwise .
R i T ( I , a ) represents the regret value of action a taken using information set I. A ( I ) denotes the set of available actions using the information set. The regret value of the match to positive values is restricted such that R i T , + ( I , a ) = m a x ( R i T ( I , a ) , 0 ) . Regret matching randomly samples actions based on an action probability distribution that is proportional to the positive regret values.
As discussed in Section 2, we decomposed the problem into a nested two-level game problem. To solve this two-layer game problem, we propose a D-CFR method. In this approach, each layer of the CFR addresses a subproblem, but the strategy at each layer influences the strategy of the other layer. In other words, the strategies of the two CFR layers interact with each other. This interaction is explained in more detail in Figure 2.
In this study, each layer of the CFR link was considered an intelligent agent. The role of each intelligent agent was to provide the optimal policy for different resource allocations. In Figure 2, the memory of agent A stored reorganized fragments of the initial scenario, with scenario elements derived from traffic information in the highD dataset. Traffic information included physical attributes, such as the size, type, and trajectory of traffic participants, which were the main focus of this study. The constructed scenario consisted of three main elements: road type, traffic participants, and the tested vehicle. The variable number of traffic participants, which could be more than one, and their types, which could be different types of vehicles or pedestrians, facilitated the construction of complex traffic scenarios. The reorganized initial trajectory time-series combination was stored in the memory of agent B, where the information was derived from each point in the trajectory, including the position, velocity, and steering angle. The specific role of agent A was to assign the best combination of trajectories. For combinations of trajectories with a high degree of danger, agent A continuously increased the probability of their occurrence during the updating of the strategy. Its update strategy relied on the optimal timing information provided by agent B and primarily focused on the value-at-risk impact associated with these timings. As in the upper part of Figure 2, agent A selected a set of trajectory combinations to be sent down to agent B based on the current strategy S A , and then received feedback from agent B to update the new round of strategy S A through the regret-matching component. The specific role of agent B was to assign the optimal timing to the trajectory combinations provided by agent A. It calculated the risk impact associated with this optimal timing and then fed this information back to agent A. As shown in the lower part of Figure 2, agent B received the trajectory combination information from agent A, reorganized the timing combinations, and generated the optimal strategy S B * through continuous iterations in a virtual self-game. It then fed back the optimal timings and the corresponding risks to agent A. A more specific D-CFR algorithm is shown in Algorithm 1.
Algorithm 1 D-CFR Algorithm.
1: Function WalkTree( h , i , π i , π i )
2:    if  h Z  then
3:       return  f c , j ( h ) u i ( h | π i f c , i ( h ) )
4:    end if
5:    if  h C  then
6:       Sample outcome a A ( h ) with probability f c a | h
7:       return WalkTree( ha , i , π i , π i )
8:    end if
9:     I ← information set of h
10:     u ← 0
11:     σ ← match regret of I
12:    for  a A ( h )  do
13:       if  player ( h ) = i  then
14:           π i σ [ a ] π i
15:           π i ← WalkTree ( ha , i , π i , π i )
16:           m [ a ] u
17:          u u + σ [ a ] u
18:       else
19:           π i σ [ a ] π i
20:           u ← WalkTree ( ha , i , π i , π i )
21:           m [ a ] u
22:          u u + u
23:       end if
24:       if  player ( h ) = i  then
25:          for  a A ( I )  do
26:           r I [ a ] r I [ a ] + m [ a ] u
27:           s I [ a ] s I [ a ] + π i σ [ a ]
28:          end for
29:       end if
30:    end for
31:    return u
32: end Function
33:
34: Function D-CFR
35:    for  α 1 1 , 2 ,  do
36:       for  i N 1  do
37:          WalkTree ( , i , 1 , 1 )
38:       end for
39:       for  α 2 1 , 2 ,  do
40:          for  j N 2  do
41:             WalkTree ( , i , 1 , 1 )
42:          end for
43:       end for
44:    end for
45: end Function

4. Testing and Case Study

Virtual simulation has become a mainstream and effective approach for testing autonomous driving systems, particularly for decision-making processes that do not heavily depend on physical factors. The highway environment serves as a highly illustrative scenario in automated driving testing, encompassing multiple situations with varying levels of challenges that pose significant threats to the safety and functionality of autonomous vehicles. In this section, we focus on demonstrating the effectiveness of D-CFR for testing collaborative interactions between multiple traffic participants in a highway environment.

4.1. Setup

The variables and parameters were set as shown in Table 4. The hardware parameters used for the experiments in this study were CPU: Intel(R) Core(TM) i5-8300H and GPU: NVIDIA GeForce GTX 1050 Ti. Figure 3 depicts a typical case of scenario generation theoretically realized by the D-CFR method. The blue vehicle represents the ego vehicle, and the green vehicles are the primary traffic participants in the experiment, which exerted a significant influence on the trajectory of the tested vehicle. The remaining vehicles (non-critical traffic participants) operated within the established traffic flow. The individuals depicted within the yellow dashed box are the primary composers of the test scenario and were classified as critical traffic participants. The scenario in this figure was closer to reality, and the inclusion of a larger number of traffic participants was more representative of a real traffic scenario, but in reality, there were only a few traffic participants that had a direct impact on the ego vehicle. Therefore, we removed the non-critical traffic participants from the experiment, which left only the critical traffic participants for collaborative testing and heavily mitigated the difficulty of the experiment.

4.2. Result Analysis

We generated a series of scenario cases, and in order to assess the risks of these cases, generalized metrics, such as the Time to Collision (TTC) [32] and Minimum Safe Distance Factor (MSDF) [33], were introduced.
The TTC and MSDF are defined below:
T T C i = X i X e g o V e g o V i
M S D F l a t = d l a t d m i n l a t
where X i and X e g o are the longitudinal coordinates of the traffic participant and the tested vehicle, respectively, and V i and V e g o are the speeds of the traffic participant and the tested vehicle, respectively. d l a t and d m i n l a t are the relative lateral distance and the relative minimum lateral distance between the tested vehicle and the traffic participant, respectively. Our aim was to monitor the risk in the longitudinal direction with the TTC and in the transversal direction with the MSDF, so we only defined MSDF l a t in the transversal direction.
Based on the analysis of scenario types and metrics, such as the TTC and MSDF l a t based on manual expert experience, we describe several typical scenario cases with different levels of risk generated by D-CFR, a visual representation of which is shown in Figure 4. Specifically, Case A represented a low-risk scenario, Case B represented a medium-risk scenario, and Case C represented a high-risk scenario. Figure 4 depicts a typical two-vehicle cut-in scenario, which is one of the more dangerous situations in reality. In this scenario, the background vehicle numbered 01 was located in the left lane of the tested vehicle, while the background vehicle numbered 02 was located in the right lane. In Case A, the first background vehicle (vehicle-02) changed lanes to the left far ahead of the tested vehicle. Subsequently, the second background vehicle (vehicle-01) made a right lane change to the left in front of the tested vehicle. The distance between the two lane changes was relatively long, which allowed the tested vehicle to complete the avoidance maneuver effortlessly. In Case B, the first background vehicle (vehicle-02) changed lanes to the left from the right-front of the tested vehicle and maintained a safe distance. However, at this point, the second background vehicle (vehicle-01) began to change lanes to the right from the left side of the tested vehicle, with a shorter lane change distance. To avoid danger, the tested vehicle needed to take action, such as braking or changing lanes to the right, and thereby yielded its right of way. In Case C, the background vehicle-01 and background vehicle-02 simultaneously initiated lane changes from either side of the tested vehicle toward the middle lane, with very short lane change distances. To avoid the risk, the tested vehicle could only perform an emergency braking maneuver.
The dashed line in the figure represents the baseline of the TTC, which was the most hazardous scenario captured from the input natural trajectories. It is obvious that the urgency and complexity of the tested vehicle control increased with the increase in scenario complexity, as shown in Figure 4d–f. Subjectively, it can be inferred from both the scenario elements and the values of the longitudinal and transversal directions monitored by the metrics, such as TTC and MSDF, that Case C represented a complex traffic scenario, and that it posed a significant challenge to the functional safety of the tested vehicle. In order to theoretically obtain scenarios with a greater degree of risk, we tested the scenarios generated with different numbers of iterations and recorded the corresponding optimal fitness, i.e., the maximum value of risk returned by the risk field, as shown in Table 5, where n * denotes the number of iterations to obtain the optimal strategy or to reach the Nash equilibrium state of agent B. It can be observed that the optimal fitness increased with the number of iterations. Beyond 150 iterations, the optimal fitness continued to increase at a very slow rate, indicating a high level of risk and theoretically yielding higher-risk scenarios. However, we found that the scenarios generated around that iteration value were too homogeneous, i.e., the strategy at this point always went for generating one or several higher-risk scenarios, and some of these scenarios could be judged to be unrealistic from a subjective point of view. Therefore, we appropriately reduced the number of iterations when iterating the strategy so that the generated scenarios were both risky and ensured diversity.
Figure 5 illustrates some of the other scenario types generated by D-CFR. In Case D, a background vehicle attempted to cut in front of its right side from the left side of the tested vehicle, and another background vehicle located on the right side of the tested vehicle was traveling straight ahead to seize the right of way of the tested vehicle’s variable lane to avoid the hazard. In this scenario, the tested vehicle could face a functional problem of colliding with a vehicle cutting in on the left due to untimely braking or changing lanes to the right and colliding with a vehicle on the right. In Case E, the tested vehicle attempted to cut in front of the background vehicle on the left, which was traveling straight ahead on the right. In this scenario, if the background car on the left employed competitive driving strategies [22], it was likely that this vehicle would not yield the right of way. Consequently, the tested vehicle could be involved in a collision during the cut-in maneuver. In Case F, the tested vehicle and the two background vehicles were traveling on their own roadways, with no possibility of an accident occurring, even though the D-CFR adjusted the background vehicle’s trajectory timing to be as close to the tested vehicle as possible. The emergence of this scenario illustrated the diversity of scenarios generated by the D-CFR’s strategy, as very low-risk scenarios with this tiny probability occurred, and other risk scenarios with any value were more likely to occur.
To demonstrate the ease of extensibility of our proposed method, we extended it to generate two more sets of scenarios. In Case G, we introduced a background vehicle with direct influence for collaborative testing. The background car to the left of the tested vehicle executed a right-hand cut, while the vehicle behind it traveled straight to occupy the right of way ahead. Additionally, the forward movement of the background vehicle on the right affected the tested vehicle’s ability to change lanes. However, it was not reasonable to seize all the right of way of the tested vehicle because the tested vehicle needed to go into a “desperate situation” so that there was no reflection of the possible functional problems of the tested vehicle. Thus, in this scenario, the D-CFR algorithm allowed the background vehicle on the right to yield the right of way, which gave the tested vehicle its only chance of survival—changing lanes to the right to avoid danger. Obviously, the risk level of the scenario that involved the simultaneous testing of three background vehicles was extremely high. In Case H, a simple pedestrian motion model was incorporated into the experiment for the human–vehicle cooperative test. The right background vehicle accelerated forward, which significantly obstructed the tested vehicle’s view to the right-front. As the pedestrian entered the center lane, the tested vehicle was at high risk of colliding with the pedestrian due to factors such as an insufficient reaction time and short braking distance.
The above cases illustrate that our proposed method was easily scalable, interpretable, and controllable compared with traditional machine learning methods. This was mainly due to the fact that the method did not overly rely on the features of the dataset and did not need to be trained, as long as the motion data of the corresponding traffic participants were added or modified. However, the problems were also obvious. In the expansion cases G and H, the time for generating an effective scenario increased substantially, which seriously affected the efficiency of the large-scale scenario generation.

5. Conclusions

In this paper, we propose a method to solve the problem of scarcity of effective scenarios in autonomous vehicle testing. We first extracted a series of natural trajectories from the existing highD dataset as the attacker’s actions, and then proposed the method of D-CFR to perform a self-playing game with multiple traffic participants, which generated optimal strategies to generate effective scenarios. Finally, we evaluated the risk level of the generated scenarios using generalized metrics, such as TTC and MSDF. The results show that the D-CFR-based scenario generation method could successfully generate challenging and complex traffic scenarios. Additionally, we demonstrated through extended experiments that our method was easily extensible to multi-vehicle, human–vehicle, and other multi-traffic participant collaborative testing.
In future work, we consider expanding the use of this nested game technique to test it in more realistic scenarios, such as more complex urban intersections.

Author Contributions

Conceptualization, Y.W. and P.S.; methodology, Y.W. and P.S.; software, P.S.; validation, Y.W., P.S. and D.Z.; formal analysis, D.Z.; investigation, P.S.; resources, L.S.; data curation, Y.W.; writing—original draft, Y.W.; writing—review and editing, P.S. and D.Z.; visualization, Y.W. and P.S.; supervision, D.Z.; project administration, L.S. and D.Z.; funding acquisition, Y.W. All authors read and agreed to the published version of this manuscript.

Funding

This research was funded by Key R&D Program of Shandong Province, China, grant number 2023CXGC010111, and the APC was funded by the Key R&D Program of Shandong Province, China, grant number 2023CXGC010111.

Data Availability Statement

Dataset available upon request from the authors.

Acknowledgments

Videos of the cases mentioned in this paper are supported in https://github.com/Wilianii/DCFR-for-Scenario-Generation-of-Autonomous-Driving (accessed on 22 September 2024).

Conflicts of Interest

The authors declare no conflicts of interest.

Appendix A

Appendix A.1. Risk Field Theory

For the second subproblem, we considered it as an optimization problem for timing resource allocation. Considering the influence of the parameters of the autopilot multidimensionality, in this study, we considered the driving risk field model [24,25] to measure the impact of the risk caused by the attacker on the defender. At time node t, define the amount of risk posed by attacker j to defender i as
F j i , 0 t = E j , 0 t r 0 1 k x , 0 t x j i 2 t + k y , 0 t y j i 2 t 1 r m a x 2
where x j i and y j i are the distances of attacker j from defender i in the vertical and horizontal directions, respectively; E j , 0 ( t ) denotes the kinetic energy of attacker j and is defined as
E j , 0 t = 1 2 m j v j 2
where m j and v j denote the mass and instantaneous velocity of attacker j, respectively; k x , 0 t and k y , 0 t are the gradient adjustment coefficients for the longitudinal and lateral directions, respectively. We assumed that the vehicle travels along the x-positive direction during the test, so there are
k x , 0 t = v m a x v j t · s i g n x i t x j t · s i g n 0 v j t v i t 2 v m a x + v j t · s i g n x i t x j t · s i g n 0 v j t v i t 2
k y , 0 t = 1
s i g n ( x ) = 1 , x 0 1 , x < 0
s i g n 0 x = 1 , x > 0 0 , x = 0 1 , x < 0

Appendix A.2. Counterfactual Regret Minimization (CFR)

Counterfactual minimization was solved using a regret-matching algorithm. We needed to consider the probability of reaching each information set given a player’s strategy; if the game is processed sequentially through a series of information sets, then the game’s state information and the probabilities of the sequence of players’ actions are passed forward, and the utility information is passed backward through the information sets. For each information set accessed recursively in the training iterations, the mixed strategy is computed according to the regret-matching equation, which is given below in a series of notations and definitions.
Let A be the set of all actions in the game, I be some information set, A ( I ) denote the set of all available actions on information set I, and t and T denote the time steps. Let Z denote the set of all game end nodes; then, there are h z , z Z . Let u i ( z ) denote the utility value generated by player i’s arrival at an endpoint node, and define the counterfactual value of the non-endpoint node h to be
v i σ , h = h z , z Z π i σ h π σ ( h , z ) u i ( z )
The counterfactual regret value for not taking action a in the history is defined as
r h , a = v i σ I a , h v i σ , h
The counterfactual regret value for not taking action a in information set I is
r I , a = h I r ( h , a )
Let r i t I , a be the regret value when player i adopts the strategy σ t in the information set I and does not adopt the action a. Define the cumulative counterfactual regret value as
R i T I , a = t = 1 T r i t I , a
When a player uses the strategy σ , the difference between the value of action a and the expected value is always chosen to be the regret value of an action, which is then weighted by the probability that the other players will adopt the action to reach that node. Define the non-negative counterfactual regret value as R i T , + ( I , a ) = m a x ( R i T ( I , a ) , 0 ) ; then, the update formula for the new strategy is as follows:
σ i T + 1 I , a = R i T , + ( I , a ) a A ( I ) R i T , + ( I , a ) , if   a A ( I ) R i T , + ( I , a ) > 0 1 | A ( I ) | , otherwise .
For each information set, this equation is used to compute the probability of an action proportional to the value of the positive cumulative regret. For each action, the CFR generates the next state in the game and calculates the utility value of each action by recursion. Regret values are computed from the return values, and finally, the probability of occurrence at the current node is computed and returned.

References

  1. Kalra, N.; Paddock, S.M. Driving to safety: How many miles of driving would it take to demonstrate autonomous vehicle reliability? Transp. Res. Part A Policy Pract. 2016, 94, 182–193. [Google Scholar] [CrossRef]
  2. Klischat, M.; Althoff, M. Generating Critical Test Scenarios for Automated Vehicles with Evolutionary Algorithms. In Proceedings of the 2019 IEEE Intelligent Vehicles Symposium (IV), Paris, France, 9–12 June 2019; pp. 2352–2358. [Google Scholar]
  3. Sinha, A.; Chand, S.; Vu, V.; Chen, H.; Dixit, V. Crash and disengagement data of autonomous vehicles on public roads in California. Sci. Data 2019, 8, 298. [Google Scholar] [CrossRef] [PubMed]
  4. Zhang, X.; Li, F.; Wu, X. CSG: Critical Scenario Generation from Real Traffic Accidents. In Proceedings of the 2020 IEEE Intelligent Vehicles Symposium (IV), Las Vegas, NV, USA, 19 October–13 November 2020; pp. 1330–1336. [Google Scholar]
  5. Ponn, T.; Gnandt, C.; Diermeyer, F. An optimization-based method to identify relevant scenarios for type approval of automated vehicles. In Proceedings of the ESV—International Technical Conference on the Enhanced Safety of Vehicles, Eindhoven, The Netherlands, 10–13 June 2020; pp. 10–13. [Google Scholar]
  6. Klikovits, S.; Riccio, V.; Castellano, E.; Cetinkaya, A.; Gambi, A.; Arcaini, P. Does Road Diversity Really Matter in Testing Automated Driving Systems?—A Registered Report. arXiv 2022, arXiv:2209.05947. [Google Scholar]
  7. Kruber, F.; Wurst, J.; Botsch, M. An unsupervised random forest clustering technique for automatic traffic scenario categorization. In Proceedings of the 2018 21st International Conference on Intelligent Transportation Systems (ITSC), Maui, HI, USA, 4–7 November 2018; pp. 2811–2818. [Google Scholar]
  8. Kruber, F.; Wurst, J.; Morales, E.S.; Chakraborty, S.; Botsch, M. Unsupervised and Supervised Learning with the Random Forest Algorithm for Traffic Scenario Clustering and Classification. In Proceedings of the 2019 IEEE Intelligent Vehicles Symposium (IV), Paris, France, 9–12 June 2019; pp. 2463–2470. [Google Scholar]
  9. Wang, W.; Zhao, D. Extracting Traffic Primitives Directly From Naturalistically Logged Data for Self-Driving Applications. IEEE Robot. Autom. Lett. 2018, 3, 1223–1229. [Google Scholar] [CrossRef]
  10. Fox, E.B.; Sudderth, E.B.; Jordan, M.I.; Willsky, A.S. An HDP-HMM for systems with state persistence. In Proceedings of the 25th International Conference on Machine Learning, Helsinki, Finland, 5–9 July 2008; pp. 312–319. [Google Scholar]
  11. Wheeler, T.A.; Kochenderfer, M.J. Critical Factor Graph Situation Clusters for Accelerated Automotive Safety Validation. In Proceedings of the 2019 IEEE Intelligent Vehicles Symposium (IV), Paris, France, 9–12 June 2019; pp. 2133–2139. [Google Scholar]
  12. Kiss, O.; Grossi, M.; Roggero, A. Importance sampling for stochastic quantum simulations. Quantum 2023, 7, 977. [Google Scholar] [CrossRef]
  13. Rempe, D.; Philion, J.; Guibas, L.J.; Fidler, S.; Litany, O.L. Generating Useful Accident-Prone Driving Scenarios via a Learned Traffic Prior. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 18–24 June 2022; pp. 17305–17315. [Google Scholar]
  14. Priisalu, M.; Pirinen, A.; Paduraru, C.; Sminchisescu, C. Generating Scenarios with Diverse Pedestrian Behaviors for Autonomous Vehicle Testing. In Proceedings of the 5th Conference on Robot Learning, Auckland, New Zealand, 14–18 December 2022; Volume 164, pp. 1247–1258. [Google Scholar]
  15. Mziou Sallami, M.; Kerboua-Benlarbi, S.; Doufene, A. Autonomous driving scenario generation using Generative Adversarial Networks. Quantum 2020, 7, 20–56. [Google Scholar]
  16. Chen, B.; Chen, X.; Wu, Q.; Li, L. Adversarial Evaluation of Autonomous Vehicles in Lane-Change Scenarios. IEEE Trans. Intell. Transp. Syst. 2022, 23, 10333–10342. [Google Scholar] [CrossRef]
  17. Wachi, A. Failure-scenario maker for rule-based agent using multi-agent adversarial reinforcement learning and its application to autonomous driving. arXiv 2019, arXiv:1903.10654. [Google Scholar]
  18. Abeysirigoonawardena, Y.; Shkurti, F.; Dudek, G. Generating Adversarial Driving Scenarios in High-Fidelity Simulators. In Proceedings of the 2019 International Conference on Robotics and Automation (ICRA), Montreal, QC, Canada, 20–24 May 2019; pp. 8271–8277. [Google Scholar]
  19. Xu, M.; Huang, P.; Li, F.; Zhu, J.; Qi, X.; Oguchi, K.; Huang, Z.; Lam, H.; Zhao, D. Accelerated policy evaluation: Learning adversarial environments with adaptive importance sampling. arXiv 2022, arXiv:2106.10566. [Google Scholar]
  20. Hsu, C.C.; Kang, L.W.; Chen, S.Y.; Wang, I.S.; Hong, C.H.; Chang, C.Y. Deep learning-based vehicle trajectory prediction based on generative adversarial network for autonomous driving applications. Multimed. Tools Appl. 2023, 82, 1573–7721. [Google Scholar] [CrossRef]
  21. Liu, H.; Zhang, L.; Hari, S.K.S.; Zhao, J. Safety-Critical Scenario Generation Via Reinforcement Learning Based Editing. In Proceedings of the 2024 IEEE International Conference on Robotics and Automation (ICRA), Yokohama, Japan, 13–17 May 2024; pp. 14405–14412. [Google Scholar]
  22. Zhao, C.; Li, L.; Pei, X.; Li, Z.; Wang, F.-Y.; Wu, X. A comparative study of state-of-the-art driving strategies for autonomous vehicles. Accid. Anal. Prev. 2021, 150, 105937. [Google Scholar] [CrossRef] [PubMed]
  23. Shawky, M. Factors affecting lane change crashes. IATSS Res. 2020, 44, 155–161. [Google Scholar] [CrossRef]
  24. Wang, J.; Wu, J.; Li, Y. The concept, principle, and modeling of driving risk in the context of human-vehicle-road collaboration. China Highw. J. 2016, 29, 105–114. [Google Scholar]
  25. Huang, H.; Wang, J.; Fei, C.; Zheng, X.; Yang, Y.; Liu, J.; Wu, X.; Xu, Q. A probabilistic risk assessment framework considering lane-changing behavior interaction. Sci. China Inf. Sci. 2020, 63, 1–15. [Google Scholar] [CrossRef]
  26. Moravčík, M.; Schmid, M.; Burch, N.; Lisý, V.; Morrill, D.; Bard, N.; Davis, T.; Waugh, K.; Johanson, M.; Bowling, M. DeepStack: Expert-level artificial intelligence in heads-up no-limit poker. Science 2017, 356, 508–513. [Google Scholar] [CrossRef] [PubMed]
  27. Brown, N.; Sandholm, T. Superhuman AI for heads-up no-limit poker: Libratus beats top professionals. Science 2018, 359, 418–424. [Google Scholar] [CrossRef] [PubMed]
  28. Blair, A.; Saffidine, A. AI surpasses humans at six-player poker. Science 2019, 356, 864–865. [Google Scholar] [CrossRef] [PubMed]
  29. Lisy, V.; Davis, T.; Bowling, M. Counterfactual Regret Minimization in Sequential Security Games. In Proceedings of the AAAI Conference on Artificial Intelligence, Phoenix, AZ, USA, 12–17 February 2016; Volume 30. [Google Scholar]
  30. Davis, T.; Waugh, K.; Bowling, M. Solving large extensive-form games with strategy constraints. In Proceedings of the AAAI Conference on Artificial Intelligence, Honolulu, HI, USA, 27 January–1 February 2019; Volume 33, pp. 1861–1868. [Google Scholar]
  31. Neller, T.W.; Lanctot, M. An introduction to counterfactual regret minimization. In Proceedings of the Model AI Assignments, the Fourth Symposium on Educational Advances in Artificial Intelligence (EAAI-2013), Bellevue, WA, USA, 15–16 July 2013. [Google Scholar]
  32. Minderhoud, M.M.; Bovy, P.H.L. Extended time-to-collision measures for road traffic safety assessment. Accid. Anal. Prev. 2001, 33, 89–97. [Google Scholar] [CrossRef] [PubMed]
  33. Wishart, J.; Como, S.; Elli, M.; Russo, B.; Weast, J.; Altekar, N.; James, E.; Chen, Y. Driving safety performance assessment metrics for ads-equipped vehicles. SAE Int. J. Adv. Curr. Pract. Mobil. 2020, 2, 2881–2899. [Google Scholar] [CrossRef]
Figure 1. Schematic diagram of right-of-way types.
Figure 1. Schematic diagram of right-of-way types.
Electronics 13 04303 g001
Figure 2. Illustration of the scenario generation process with the D-CFR algorithm.
Figure 2. Illustration of the scenario generation process with the D-CFR algorithm.
Electronics 13 04303 g002
Figure 3. Schematic diagram of a typical scenario with a complex traffic flow.
Figure 3. Schematic diagram of a typical scenario with a complex traffic flow.
Electronics 13 04303 g003
Figure 4. Typical straight road scenario case analysis.
Figure 4. Typical straight road scenario case analysis.
Electronics 13 04303 g004
Figure 5. Other cases and extended experimental case descriptions.
Figure 5. Other cases and extended experimental case descriptions.
Electronics 13 04303 g005
Table 1. Parameters in the scenario.
Table 1. Parameters in the scenario.
ParameterDescription
AK Set of attackers
DF Set of defenders
NT Set of natural trajectories
TS Set of time series
N TS Number of time series
N AK Number of attackers
N DF Number of defenders
N NT Number of natural trajectories
Road type Types of road for testing
Table 2. The structure of the game in the right-of-way resource allocation problem.
Table 2. The structure of the game in the right-of-way resource allocation problem.
RoleSizeAction
Attacker camp1The attacking player’s camp
Defender camp1The defending player’s camp
Attacker | M 1 , 1 | Adopt an attack strategy to attack the defender
Defender | M 1 , 2 | Keeping the defender safe
Action N NT Natural trajectories for deploying attacks
Table 3. The structure of the game in the time-series resource allocation problem.
Table 3. The structure of the game in the time-series resource allocation problem.
RoleSizeAction
Attacker | M 2 , 1 | Adopt an attack strategy to attack the defender
Defender | M 2 , 2 | Keeping the defender safe
Action N TS Natural trajectories for the time series
Table 4. Experimental factors.
Table 4. Experimental factors.
FactorValue/TypeDescription
N f 30Frames in the scenario
P type Compact vehicleTypes of traffic participants
N P 2Number of traffic participants
N NT 40Number of natural trajectories
Road type Straight highwayTypes of road for testing
l e 3.94Ego vehicle length
w e 1.92Ego vehicle width
l p 3.94Traffic participant length
w p 1.92Traffic participant width
α 1 [20, 30]Number of iterations performed by agent A
α 2 [80, 120]Number of iterations performed by agent B
β [−50, 50]The range of adjustable time series
lane e 2Lane where the ego vehicle is located
lane p 1, 3Lane where the traffic participants are located
Table 5. Relationship between iteration number, running time, and best fitness (max risk).
Table 5. Relationship between iteration number, running time, and best fitness (max risk).
Iteration (A)Iteration (B)Running Time (s)Best Fitness 2
10 n *  1183.447025
50 n * 1492.884,228,302
100 n * 3422.10149,897,307
150 n * 4905.49150,025,045
200 n * 5729.71150,055,946
1  n * denotes the number of iterations to obtain the optimal strategy or to reach the nash equilibrium state of agent B. 2 Best fitness is the maximum risk value derived from risk field theory.
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Wang, Y.; Sun, P.; Shuai, L.; Zhang, D. Double Counterfactual Regret Minimization for Generating Safety-Critical Scenario of Autonomous Driving. Electronics 2024, 13, 4303. https://doi.org/10.3390/electronics13214303

AMA Style

Wang Y, Sun P, Shuai L, Zhang D. Double Counterfactual Regret Minimization for Generating Safety-Critical Scenario of Autonomous Driving. Electronics. 2024; 13(21):4303. https://doi.org/10.3390/electronics13214303

Chicago/Turabian Style

Wang, Yong, Pengchao Sun, Liguo Shuai, and Daifeng Zhang. 2024. "Double Counterfactual Regret Minimization for Generating Safety-Critical Scenario of Autonomous Driving" Electronics 13, no. 21: 4303. https://doi.org/10.3390/electronics13214303

APA Style

Wang, Y., Sun, P., Shuai, L., & Zhang, D. (2024). Double Counterfactual Regret Minimization for Generating Safety-Critical Scenario of Autonomous Driving. Electronics, 13(21), 4303. https://doi.org/10.3390/electronics13214303

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop