1. Introduction
According to data from the World Health Organization, road traffic collisions result in approximately 1.19 million fatalities worldwide each year. Moreover, road traffic injuries are the leading cause of death among children and young adults aged 5–29 [
1]. The Roadside Safety Research Program of the Federal Highway Administration indicates that roadway departure crashes account for more than 50 percent of all traffic crash fatalities in the United States [
2]. Roadside collisions often result in higher fatalities as vehicles collide with large stationary objects such as guardrails, traffic barriers, and trees [
3]. Lane-changing is one of the most common driving behaviors, and run-off-road (ROR) incidents frequently occur during lane changes near the roadside. Therefore, it is crucial to conduct an in-depth analysis and research on the risks associated with near-roadside lane-changing. With the advancement of data-driven models, they have been widely applied in driving safety research. This study will employ deep learning models to predict ROR risks and compare the performance of risk sequence prediction with risk status prediction. Research on predicting ROR risks can advance the theoretical development of Advanced Driving Assistance Systems (ADAS) and driving intervention technologies, thereby reducing the risk of vehicles running off the road.
Research on roadside safety is increasing due to the significant frequency and severe consequences of roadside crashes [
4]. Current roadside safety research typically analyzes safety-relevant factors such as road curves, road shoulders, and roadside signals [
5]. Ewan et al. revealed that narrower road widths, narrower road shoulders, and larger curves increase crash risk [
6]. Jiang et al. examined the relationship between road shoulder type and roadside crashes [
7]. El Esawey et al. studied the relationship between the placement of roadside utility poles and utility pole collisions, finding that increasing the pole offset provides better safety improvements than increasing pole spacing [
8]. Many reports have highlighted that pavement condition is critical to roadside safety [
9,
10]. Meanwhile, there are also studies focusing on the impact of human factors on roadside crashes [
11,
12].
While factor analysis enhances roadside safety through macro-policy and infrastructure development, it does not facilitate real-time analysis and prediction of ROR risks. Quantifying driving risks is fundamental for real-time driving safety analysis. Due to the rarity of traffic accidents, Surrogate Safety Measures (SSMs) are widely used in traffic safety research [
13,
14]. Time To Collision (TTC) is one of the earliest and most widely applied SSMs, initially used to assess the time required for a following vehicle to collide with a leading vehicle in car-following situations [
15]. Subsequent studies have expanded the TTC metric from various perspectives, including application scenarios [
16], assessment time windows [
17], and mathematical formulation [
18].
From the perspective of accident consequences, driving risk encompasses both the likelihood of collision and its severity. Shangguan et al. designed a rear-end risk assessment metric that considers both the probability of collision and its severity, primarily using the change in velocity derived from the law of conservation of energy to quantify collision severity [
19]. Gabauer et al. also indicated that velocity change is effective for assessing the risk of roadside collisions [
20]. Park et al. considered risk exposure and severity levels, designing a composite metric to quantify lane-changing risks based on stopping sight distance [
21]. Chan et al. applied the product of velocity squared and the inverse of TTC to calculate collision risk, considering both collision likelihood and severity [
22].
Safety evaluation is a crucial component of roadside safety. Before analyzing and predicting ROR risk, it is essential to assess and define it [
23]. Previous studies have generally classified ROR risk into several levels using qualitative or quantitative methods. For instance, Cheng et al. categorized the rollover risk of roadside accidents into four categories based on accident outcomes and analyzed influencing factors using a Bayesian network [
24]. Fang et al. utilized the inherent safety features of the roadside and the likelihood of vehicle ROR to statically classify roadside environment subject safety into five levels [
25]. Long et al. employed the Acceleration Severity Index (ASI) to represent roadside risk levels and used the Fisher optimal segmentation algorithm to divide roadside risk into three categories [
26].
Based on the quantification and classification of driving risks, modeling approaches can be used to predict these risks. The essence of real-time driving risk prediction lies in the analysis and forecasting of time series by constructing a mapping relationship between historical driving feature sequences and future driving risks. There are two main prediction methods in this field: statistical algorithms and data-driven algorithms. Traditional statistical algorithms predict future driving risks by capturing the evolutionary trends of historical driving feature sequences. Although they have a solid theoretical foundation, their application is often limited by assumptions and struggles to capture complex nonlinear dynamic features, resulting in suboptimal prediction performance. With the increase in available data and advancements in computer technology, data-driven algorithms have become the mainstream models in traffic safety analysis. Deep learning is a type of data-driven algorithm characterized by its non-parametric nature, which allows it to effectively capture nonlinear relationships among multidimensional variables. Shangguan et al. conducted a comparative analysis of several data-driven algorithms to evaluate their effectiveness in predicting real-time driving safety statuses [
27]. Arvin et al. combined Convolutional Neural Networks (CNN) and Long Short-Term Memory (LSTM) networks to extract driving information from the initial 15 s for predicting crash and near-crash events [
28]. Zhang et al. employed several deep learning models to predict the evolution of car-following risks, with results indicating that sequence prediction provides richer safety information compared to status prediction [
29].
Despite the widespread application of machine learning and deep learning-based modeling methods in road traffic safety [
30,
31], and the extensive analysis of factors influencing roadside safety by numerous studies, three principal research gaps persist in this field: (1) ROR incidents can lead to severe consequences, yet current research on them is less extensive compared to car-following and lane-changing; (2) Real-time prediction and prevention of driving risks are key technologies in advanced driving systems, but there is a lack of quantitative and predictive research on ROR risks, and current factor analysis cannot directly support the construction of ADAS; (3) Predicting safety status is the mainstream approach in current driving risk prediction, but a single driving risk status fails to provide specific safety information, and few studies have comprehensively compared the disparities in prediction precision and efficiency between risk sequence prediction and risk status prediction.
In response to the aforementioned research gaps, this study makes three primary contributions: (1) The ROR risk prediction experiment was conducted from the perspective of quantitative analysis. Near-roadside lane change samples were selected from the high-D natural driving dataset, and ROR risk was quantified based on the likelihood and severity of collisions. Subsequently, deep learning prediction techniques were employed to forecast ROR risks. (2) The prediction experiments on driving risk effectively demonstrate that the performance of sequence prediction is superior to that of commonly used status prediction. Five models representing mainstream deep learning prediction techniques were selected to predict ROR risk across different time window combinations. The results revealed that sequence prediction can enhance prediction precision and provide richer safety information without compromising efficiency. (3) An in-depth analysis was conducted on the impact of sample imbalance on prediction performance, the influence of lane-changing scenarios on lane-changing safety and duration, and the prediction of car-following risks.
The subsequent sections of the paper are organized as follows:
Section 2 elucidates the methods and models applied in the study.
Section 3 introduces the datasets used in this research and the training environment of deep learning models.
Section 4 provides a detailed analysis and discussion of the results. Finally,
Section 5 concludes this study and outlines future research directions.
3. Dataset and Experiment Setting
3.1. Dataset Description and Sample Selection
The high-D dataset is a widely used natural driving dataset for micro-driving behavior analysis, containing approximately 16.5 h of driving trajectory data on German highways from an aerial perspective [
42], with each highway scenario being approximately 420 m long. The high-D dataset will be used in this study to extract near-roadside lane-changing samples for the following three reasons: (1) Large data volume. The high-D dataset records the trajectory data of over 110,000 vehicles. This large sample size enables deep learning models to learn data patterns during the training process, while sufficient validation and test samples enhance the validity of the prediction results; (2) High data quality. Advanced computer vision techniques were applied to extract the trajectories, limiting the position error to 10 cm. The data recording frequency is 25 Hz, which meets the real-time requirements of driving safety analysis; (3) Fixed scenarios. The selected scenarios in the high-D dataset are basic highway segments, where lane-changing behavior is not affected by ramps or changes in the number of lanes.
The high-D contains detailed lane change (LC) information (original lane, target lane, surrounding vehicles, etc.), and has been widely used by researchers to study LC risk and patterns [
43]. The near roadside LC behaviors were extracted from the high-D dataset to analyze the risk evaluation of potential ROR. The driver generally pursues safety or efficiency by LC, and the surrounding vehicles of the subject vehicle can influence LC behavior. There are up to four surrounding vehicles (preceding vehicle in the original lane: pre; following vehicle in the original lane: fol; preceding vehicle in the target lane: t_pre; following vehicle in the target lane: t_fol), and the space and motion information of those vehicles were considered while analyzing the risk. The velocity, distance, and acceleration in both the longitudinal and lateral directions of the surrounding vehicles were collected, and their differences with the subject vehicle were calculated. If there are no corresponding vehicles in the four positions (pre, fol, t_pre, t_fol), the longitudinal distance was set to 420 m, the lateral distance was set to 5 m, and the relative velocity and acceleration were set to 0.
In this study, the extraction of lane-changing samples combines data on the vehicle’s lane position with specific quantitative criteria. As shown in
Figure 5, lane-changing duration is determined by setting a threshold for the vehicle’s lateral velocity. If the threshold is set too high, some lane-changing driving data may be lost from the samples. Since there is an inherent disturbance in the lateral speed of manually driven vehicles, setting the threshold too low may include segments unrelated to the lane-changing process. By observing the lane-changing process of multiple samples, we set the lateral speed threshold to 0.01 m/s, which balances the effectiveness and completeness of sample extraction. The LC duration is determined following these three steps:
Step 1: The near roadside LC samples and surrounding vehicles were confirmed according to the official documents of High-D.
Step 2: The LC moment t0 was determined first. Then, the moment from t0 forward search to a lateral speed of 0.01 m/s is the end of the LC (tend). The moment from t0 backward search to a lateral speed of 0.01 m/s is the beginning of the LC (tbegin). The LC duration is [tend, tbegin].
Step 3: The samples with surrounding vehicles driving out the aerial perspective during the LC duration were screened out.
Finally, 660 complete near-roadside LC samples were recognized. To smooth the data and compress its volume, the data is aggregated at a granularity of 0.2 s by averaging.
The KMeans++ algorithm was used to cluster the ROR risk of selected samples. As shown in
Figure 6, the clustering error ceases to decrease significantly when the number of clusters exceeds four. Therefore, the number of ROR risk statuses is determined to be four: Safe, Low-risk, Medium-risk, and High-risk. The numerical ranges and proportions of each risk status are presented in
Table 1.
3.2. Experiment Setting
Our experiments were carried out on an Intel i7-14700KF CPU and NVIDIA GeForce RTX 4070 Ti SUPER GPU with 16 GB RAM. The framework was developed using Python 3.11 and PyTorch 2.2.1. The dataset was split into training, validation, and test sets in a ratio of 6:2:2. The data were standardized before forwarding the network, and the training dataset determined the scaler parameters.
The setting of parameters significantly impacts model performance. Referring to the empirical ranges of parameter settings from related models [
37,
38], we defined the hyperparameters space of the models, and grid search was employed to select these primary hyperparameters. In terms of model training parameters, the range of learning rates was [0.01, 0.005, 0.001]. Meanwhile, weight decay and learning rate decay [
44] techniques were applied during the training process to alleviate overfitting issues. The batch learning strategy was utilized in the training process, with batch size options being (128, 256, 512).
In terms of model parameter selection, adjustments were made primarily to the model’s main hyperparameters. For the LSTM-related model, the range of layer choices was (1, 2), and the selection range for the hidden layer dimensions was (16, 64, 128). The number of attention heads in the multi-head attention mechanism was set between (2, 4), and the range of the number of encoders is (2, 3). The kernel size in the 2D convolutional network was uniformly set to 3 × 3, with the convolutional layers’ channel numbers set at 16 and 32, respectively. Similarly, the kernel size for the 1D convolutional network was set at 3, with channels for the convolutional layers set at 32 and 64, respectively. All models will be trained for 100 epochs, and the model that performed best on the validation set during training will be saved and used for testing.
Related research indicates that warning the driver 0.5–1 s before a potential traffic crash can effectively prevent the collision [
45]. To investigate the impact of different observation and prediction window lengths, the lengths of these windows were set to (0.6 s, 1 s, 2 s). The prediction windows were set considering the gradient of driving intervention. If the model predicts a high driving risk within the next 0.6 or 1 s, emergency and forceful braking control measures can be taken to prevent an accident. A 2-s prediction window is more suitable for milder braking control measures and can also consider providing risk alerts to the driver.
3.3. Evaluation Metrics
To comprehensively assess the model’s prediction performance, the Macro F1 Score (MFS) was used to evaluate the model’s effectiveness in ROR risk prediction. The precision and recall rate for predicting status
k are shown in Equations (3) and (4), respectively, where TP
k represents the number of correctly predicted samples, and FP
k and FN
k represent the false positives and false negatives for status
k. Considering both precision and recall rate, the F1 score is calculated as shown in Equation (5). Assuming there are
n statuses, the MFS is calculated in Equation (6). A larger MFS value indicates higher average precision and recall for all risk categories, signifying better model performance.
4. Results
4.1. Prediction Results
The average MFS for each model under sequence prediction and status prediction is presented in
Figure 7. The overall predictive performance of each model in sequence prediction is superior to that in status prediction. With the exception of the Transformer, which demonstrates similar predictive performance in both modes, the average MFS for the other four models in sequence prediction exceeds that of status prediction by 3.13% to 6.54%. This indicates that sequence prediction outperforms status prediction in terms of prediction precision. Among the models, LSTM-CNN achieves the best average predictive performance for sequence prediction, followed by CNN and LSTM. Although the Transformer significantly outperforms other models in status prediction, its status prediction performance remains a noticeable gap compared to the sequence prediction of LSTM-CNN, CNN, and LSTM. Despite the application of model combination and the multi-head attention mechanism, CNN-LSTM-MA performed the worst among the five models. This suggests that the combination of the serial structure of CNN and LSTM with the multi-head attention mechanism did not yield a positive gain in model performance. It also reveals that stacking models and advanced technologies does not necessarily enhance model performance.
The distribution of the MFS in differnet time window combinations is shown in
Figure 8, and several key findings can be drawn:
Deep learning models have achieved excellent results in predicting ROR risk, with sequence prediction generally outperforming status prediction. In predicting ROR risk, the optimal prediction MFS for sequence prediction reached 0.964, 0.934, and 0.858 at 0.6 s, 1 s, and 2 s, respectively (with the CNN model’s observation window set at 2 s). Although mainstream models differ in their feature extraction mechanisms, the MFS for sequence prediction is basically higher than that for status prediction across different time window combinations. Intuitively, the MFS plane formed by sequence prediction results (red) is higher than the plane formed by status prediction (blue). Although the performance of sequence prediction is not consistently superior to that of status prediction across various combinations of time windows in the Transformer model, the performance of sequence prediction for the next 0.6 s is consistently better than that of status prediction within each observation window. Furthermore, under various combinations of time windows, the average performance of sequence prediction surpasses that of status prediction. The difference in MFS tends to widen with the increase in the prediction window, especially in the LSTM-CNN model, where the average MFS for sequence prediction is 11.76% higher than that for status prediction in a 2 s prediction window. This indicates that the sequence prediction modeling approach is superior to status prediction, and establishing a mapping relationship between historical features and future ROR risk sequences can yield more robust prediction results.
The MFS significantly declines as the prediction window increases. When the observation window is set to 2 s and the prediction window to 0.6 s, the MFS for all models in sequence prediction exceeds 0.9. However, when the prediction window is extended to 2 s, the highest MFS observed in sequence prediction is only 0.858 (CNN with a 2 s observation window). A similar trend is observed in status prediction, where the MFS decreases noticeably with the increase in the prediction window. This trend aligns with the general understanding that the difficulty of time series prediction increases with the length of the prediction window.
The predictive performance of models does not necessarily improve with the extension of the observation window. While the MFS generally increases with the observation window when the prediction window is set to 0.6 s, several models exhibit a decline in MFS when the prediction window is 2 s. For instance, when the prediction window is set to 2 s, the CNN-LSTM-MA model has an MFS of 0.793 with a 0.6 s observation window in sequence prediction, but this drops to 0.751 when the observation window is increased to 2 s. This decline is observed in both sequence and status predictions across different models. This suggests that although a longer observation window can provide more risk-related information for predictions, it also increases the complexity of the data and introduces potential noise unrelated to the risk, leading to a deterioration in model performance.
4.2. Real-Time Efficiency of Risk Prediction
In addition to prediction precision, the efficiency of model operation is crucial for real-time risk prediction. To fairly assess the efficiency of different prediction approaches, both sequence and status prediction models were configured with the same structural parameters. With both observation and prediction windows set at 2 s, the number of predictions performed per second was recorded, and the average results for 10 runs are shown in
Figure 9. The test results indicate that there is no significant difference in performance between sequence and status predictions. Although sequence prediction theoretically involves a slightly larger number of model parameters due to its more extensive output counts, experimental results indicate that these minor differences do not significantly impact the model’s execution efficiency. In the LSTM, CNN, and CNN-LSTM-MA models, status prediction slightly outperforms sequence prediction in terms of efficiency, whereas sequence prediction is more efficient than status prediction in LSTM-CNN and Transformer. This variation is mainly attributed to the randomness in hardware performance during execution.
The CNN demonstrated the best model efficiency, capable of performing 10,231 predictions per second under sequence prediction. Due to its complex gating structure, the LSTM model is less efficient than the CNN. The efficiency of the LSTM-CNN model is further reduced due to the stacking of models. The multi-head attention mechanism introduces additional model parameters. Although the CNN-LSTM-MA and Transformer exhibit worse efficiency among the five models, it is still capable of performing approximately 2000 predictions per second, which is sufficient to meet the real-time requirements of intelligent driving systems.
4.3. The Impact of Imbalanced Dataset
As mentioned above, there is an imbalance among the four safety statuses of samples, with the Safe status accounting for 71.19% of the samples, while the High-risk category comprises only 0.79%. This imbalance can introduce learning biases during model training, as excessive focus on the predominant categories may adversely affect the prediction performance for minority categories. The confusion matrix for predictions by CNN with both observation and prediction windows set at 2 s is shown in
Figure 10. Both sequence and status predictions exhibit varying degrees of category accuracy disparity. In sequence prediction, the accuracy for predicting the Low-risk category reaches 96.71%, whereas the accuracy for the minority High-risk category is only 64.71%, a significant gap of 32%. In status prediction, the accuracy for High-risk predictions is merely 50.98%, widening the gap to 36.51% compared to the Low-risk category.
On the one hand, this finding highlights that the imbalance in safety status significantly impacts prediction outcomes. On the other hand, it also demonstrates that sequence prediction can mitigate status imbalance issues compared to status prediction. The sequence-to-sequence mapping approach aligns with the sequential nature of risk sequences, potentially overcoming the learning biases that arise during the mapping from sequences to category probabilities in status prediction.
4.4. Case Study of Sequence Prediction
In addition to accuracy advantages, sequence prediction provides detailed information about the evolution of risk sequences compared to status prediction. The evolution of ROR risk for a specific sample is shown in
Figure 11a, which reflects the general trend of risk evolution during a near-roadside lane change. In the initial stage of the lane change, due to the increase in lateral speed and the decrease in the distance to the roadside, the ROR risk gradually increases to Medium-risk. Once the vehicle enters the target lane, the lateral velocity begins to decrease, and the ROR risk starts to decline.
The sequence prediction models’ results for the rising, turning, and declining phases of ROR risk are shown in
Figure 11b–d. The results indicate that all five prediction models accurately forecasted the continuous upward trend of ROR risk. Except for the LSTM model, which predicted the highest risk level as Low-risk for the upcoming 2 s, the other models predicted the risk status correctly. The turning points of risk contain two critical information: the risk degree and the risk transition moment. On one hand, the sequence prediction models accurately predicted the safety status at the risk extremum. On the other hand, they identified the risk turning point approximately 1.2 s in advance (each time step being 0.2 s), providing additional valuable information for driving safety. Simultaneously, the sequence prediction models also accurately captured the sequence trend where the ROR risk declined.
In the practical application of risk sequence prediction models, precise driving interventions can be implemented by combining the predicted risk status with the trend of risk evolution. For instance, if the ROR risk rises rapidly to a high-risk level without showing any signs of reversal, this indicates that the driving risk is likely to continue increasing in the future. In this case, lateral emergency braking measures should be taken to prevent the vehicle from running off the road. Conversely, when the model predicts that driving risk will rise to a high-risk level within a certain timeframe and simultaneously indicates a risk reversal, appropriate safety warnings can be issued to the driver. Since the model performs risk predictions adopting a sliding time window approach, it can also dynamically adjust the driving intervention methods by comparing the actual risk values with the predicted values during the prediction process.
4.5. The Impact of Surrounding Vehicles on ROR Risk
To further study the influence of interaction information, the lane change (LC) duration and mean risk of different situations are also discussed. The situations are determined by the presence or absence of four surrounding vehicles. There are mainly four LC situations: 211 pre and t_pre (S1); 66 pre, t_pre, and fol (S2); 121 pre, t_pre, and t_fol (S3); 195 all four vehicles (S4). The average LC duration and risk of these situations are shown in
Figure 12. As the duration and risk do not follow a normal distribution, the Kruskal-Wallis H test is applied to assess the differences among situations. The duration and risk in different situations show significant differences (duration: H = 17.177,
p < 0.001; risk: H = 17.103,
p < 0.001), and the Bonferroni multiple comparisons reveal significant differences in the duration of S2–S3 (
p = 0.012), S2–S1 (
p = 0.004), and S4–S1 (
p = 0.042), and in the risk of S1–S4 (
p < 0.001). The main difference among the situations is the presence of the following vehicle in the original and target lanes. The duration difference in S2–S1 and S2–S3 indicates that the following vehicle in the original lane would prompt the LC process, and the duration and risk difference in S1–S4 suggests that the increase of following vehicles will increase the risk of near-roadside LC while accelerating the LC process.
4.6. Prediction Performance in Car-Following Scenario
To study the applicability of sequence prediction in car-following scenarios, we randomly selected 1000 car-following samples from the high-D dataset to conduct a longitudinal risk prediction study. Considering that car-following on highways may involve high velocity with small velocity differences, traditional car-following risk indicators like Time-To-Collision (TTC) may not adequately reflect the driving risk. We used the Safety Margin (SM) to quantify driving risk during the car-following process [
46]. The empirical formula for SM is shown in Equation (7), where
Vn represents the velocity of the following vehicle,
Vn−1 represents the velocity of the leading vehicle,
Dn is the following distance, and
g denotes gravitational acceleration (9.8 m/s
2). This empirical formula also takes into account comprehensive factors such as driver reaction time and road friction. The dataset division and model training process are consistent with those used for ROR risk prediction.
The input features for the car-following risk prediction model include 11 feature sequences related to the leading and following vehicles, which can be categorized into three groups: (1) following vehicle-related: velocity, acceleration, and jerk; (2) leading vehicle-related: velocity, acceleration, and jerk; (3) interaction between following and leading vehicles: velocity difference, acceleration difference, jerk difference, car-following distance, and SM. The model’s output is either the SM sequence over a future period (sequence prediction) or the car-following risk level (status prediction).
As shown in
Figure 13, the car-following risk is categorized into five statuses using the KMeans++ clustering algorithm and the elbow method. The average MFS for different models in car-following risk prediction across various prediction windows is illustrated in
Figure 14. Notably, the MFS for sequence prediction is consistently higher than that for status prediction across all five models, further indicating that the sequence prediction approach is superior to status prediction. Among the models, LSTM-CNN and Transformer achieved the best performance for sequence prediction and status prediction, with MFS values of 0.984 and 0.974, respectively. Although a basic model structure was employed, LSTM still attained suboptimal performance in sequence prediction, demonstrating the advantages of recurrent neural network-based models in addressing time series encoding challenges. The CNN-LSTM-MA model also did not achieve the ideal performance in car-following risk prediction.