1. Introduction
An urban expressway is a high-speed multi-lane highway with access ramps and median dividers. The expressway is an important component of the urban transportation system in China’s major developed cities. The urban expressways have distinct features, such as short spacing between entrance or exit ramps, complex connections with frontage roads [
1], and short deceleration lanes, which increase the probability of accidents. The off-ramps of urban expressways suffer from more crashes than other segments due to frequent, lane-changing behaviors. According to the crash report (Traffic Management Bureau, Ministry of Public Security 2017), there were 6116 crashes and 1472 fatalities on expressways in 2016. In addition, it was reported that urban expressway crashes make up 5.6% of total roadway crashes in Xi’an and resulted in 673,370 RMB of economic loss in 2011 [
2]. In Shanghai, crashes at off-ramps accounted for 21.6% of total expressway crashes [
3]. Actually, the crashes are likely to occur in the vicinity of ramps. Hence, it is very important to predict and evaluate the crash risk of ramps. However, the random circumstances, as well as various missing data increase the difficulty when it comes to the solid accuracy of safety evaluations.
Many factors play a role in the safety of off-ramps, including annual average daily traffic (AADT), ramp configurations, ramp lengths, deceleration lanes, acceleration lanes [
4], and the number of lanes [
1]. It is found that ramp AADT, the number of exit ramp lanes, ramp alignment [
5], speed-changing lane length [
6,
7,
8], vehicle characteristics (steering, acceleration, and speed) [
9,
10], the number of lanes of ramps [
11,
12], driving behaviors [
13,
14], etc., are closely related to ramp crashes.
The traditional models are used to predict the number of crashes including Poisson, negative binomial [
5,
15], and generalized linear models with the main effect variables [
16,
17]. The fact that the crash data are sometimes incomplete becomes a driving force for the studies of surrogate safety measures (SSMs) [
18]. Such studies have become popular, which analyze the traffic conflict models [
19], and several of these SSMs are also popular to evaluate the severity of a conflict [
20,
21]. The time-to-collision (TTC) is one of the most widely used indictors to detect dangerous situations [
22,
23,
24,
25]. If both vehicles continued traveling at the same constant speeds, a crash would occur after a time. Therefore, the probability of a crash is calculated by using the TTC. This indicator is inversely related to crash risks (smaller TTC values indicate higher crash risks) [
26,
27,
28]. TTC is significantly affected by the road environment, traffic flow condition, driver characteristics, weather [
21,
29], and so on. It is found that the TTC decreases with an increase in traffic density [
30].
Crash risk estimation is to calculate the crash likelihood before crashes, and the related models can be broadly classified into statistical methods, classification trees, and artificial intelligence. The statistical model is based on a certain probability distribution. Therefore, the traditional crash frequency modeling cannot be conducted. The case-control logistics model is employed to qualify the explanatory variables and risk level, such as the logistic regression and the Tobit regression [
31,
32]. The Bayesian network model emphasize the logical cause of crashes [
33]. Moinul et al. [
34] used a Bayesian belief network for a basic freeway segment, and the classified rate was only 66%. The non-parametric classification tree is usually applied to injury severity forecast. The well-known artificial neural network (ANN) paradigms were investigated for crash risk estimation. The reliability of the results relies on three features: network architecture, the model of the neuron, and learning algorithms [
35]. Lee et al. [
36] determined the collision warning in the car-following state by a multi-layer perceptron neural network. The output was the level of collision severity.
The previous studies were mainly focused on freeway segments rather than on off-ramps. Few research works studied the collision risks at ramps. The limitation of the logistic regression model is that there is no dependency assumption among the influencing risk factors. The critical values of the severity of risks for expressways are unclear [
37,
38,
39]. The TTC distribution is uncertain in practice. The value of 15% TTC is considered as a threshold of traffic conflicts. In order to avoid bias, the severity prediction of the conflicts is ignored in most studies. The classification tree and neural network do not require any specific functional form to design a model of the risk factors. However, they cannot interpret the relationship between factors and crash severity. The Bayesian network model is time consuming when samples have a large dimension.
The crash risk level is measured by the TTC value. In order to qualify the relationship between contributing factors and the severity of crash risks, the crash risk is divided into different severity levels. However, the crash risk division in the previous studies was unclear. The real TTCs’ distribution is explored to determine the transitions of the safety states. The crash risk levels are classified according to the TTCs’ distributions. Since the risk level is an ordinal response variable, an ordinal method is employed to analyze crash risks. In this study, naive Bayesian is designed to overcome the time consumption problem.
The contributions of this research are as follows: (1) to explore the TTCs distributions with GMM at off-ramps; (2) to determine the critical values for the crash risk severity; (3) the naive Bayesian model is developed to explore the relationship between the severity of the TTC and explanatory variables. An ordinal logistic regression model is designed for comparison at the same time.
The remainder of the paper is organized as follows.
Section 2 introduces the traffic flow analysis at off-ramps.
Section 3 is the TTC distribution functions’ estimation. The ordinal logistic regression model and the naive Bayesian model are developed and compared for crash risk estimation in
Section 4. The discussion and conclusions will be given at the end.
2. Data Processing
2.1. Data Collection
In order to explore the crash risks upstream from expressway off-ramps, we selected five sites from the urban expressway in Xi’an, which is a six-lane expressway, and the lane width is 3.75 m. The five sites are shown in
Figure 1. The speed limit is 100 km/h for passenger cars and 80 km/h for trucks. The off-ramp influence area is 150 m past the ramp and 200 m before, and there is no on-ramp in this influence area. The investigated off-ramps are at straight and flat segments, connecting parallel to accesses. The length of the deceleration lane is 140 m. The traffic signs are placed at 2 km, 1 km, and 500 m away from exits. The vehicles are divided into passenger cars and trucks, according to “Technical Standard for Highway Engineering 2016”. The vehicles for which the axis distance is greater than 3.8 m are trucks. Otherwise, the vehicles are considered as passenger vehicles. An unmanned aerial vehicle was used to collect traffic flow for 20 min at each site, from 2 December to 3 December, during the morning peak hour. Images were used from a camera mounted on a nearby pedestrian overcrossing.
2.2. Statistical Analysis of Traffic Flow
As mentioned earlier, the inner lane, middle lane, and outer lane are referred to as Lanes 1, 2, and 3, respectively. Vehicle trajectories were extracted from an unmanned aerial vehicle by Tracker at a frequency of 5 Hz. A one-second video was composed of 25 frames. The vehicle trajectories were extracted from the video images by the following steps:
Step 1: A coordinate reference system was set up in the video to calculate vehicle positions at different times.
Step 2: In this study, the 3.75 m width of the lane was used to calibrate the position of vehicles with the ground plane.
As shown in
Figure 2, the pedestrian overcrossing and the upper-bound road constituted a coordinate reference system.
Figure 3 is the vehicle trajectories extracted from the video. The red trajectories represent the vehicles in the mainline. The blue trajectories indicate the vehicles out of the mainline. The green trajectories represent vehicles in the auxiliary lane.
Each vehicle was considered as a pixel, and the vehicle was tracked automatically. We extracted 1552 trajectories as samples at off-ramps. An eight second video was used for analysis in the observation area. Because we calculated the vehicle position at 5 Hz, there were 40 position data for each trajectory record.
In this study, the Grubbs outlier test was used to detect the outliers based on the spot speed, as Equation (1). The outliners were removed from the trace data to eliminate the error and improve the accuracy. The missing data were replaced by the mean of the adjacent points to obtain a full trace.
The correctness of each sample was calculated as Equation (1).
Therefore, the correctness of the subsample was calculated as follows:
The correctness of the total sample was 95.33%. This showed that the quality of the extracted data was sufficient to meet the modeling requirements.
The average speed, 85% speed, speed standard deviation (speed S.D.), and traffic volume for each lane are statistically summarized in
Table 1. The traffic composition was calculated for Lane 3.
Table 1 reveals that the average speed of off-ramps ranged from 25.88 km/h to 45.61 km/h, and the 85% speed ranged from 38.56 km/h to 49.50 km/h. The speed of Lane 1 was the highest of the three lanes, and the 85% speed was above 44.57 km/h. The speeds of Lane 2 and Lane 3 were similar. The speed S.D. of Lane 1 was higher than the other lanes, which illustrated larger speed gaps among vehicles. The 85% speed of Lane 3 was the lowest, usually below 40 km/h, due to the great impact that lane-changing had on it. The traffic volumes were between 697 veh/h and 919 veh/h. Traffic flow for Lane 2 was higher than the other lanes. The diverging rate described the proportion of exiting vehicles at the current off-ramp. Due to large residential communities, the diverging rates were higher at Site 1 and Site 4. The highest diverging rate as 19.27% at Site 4, and the lowest was 6.71% at Site 5. The truck percentage ranged from 6.25% to 23.8%. The speed decreased with the increased presence of trucks in the traffic flow.
The speed S.D. for the five sites of each lane is shown in
Figure 4. It can be seen from
Figure 4 that Lane 3 had the maximum speed S.D. and Lane 1 had the minimum speed S.D. at each location, which indicated that Lane 3 had a wider range of speed variation and a greater risk of crashes compared with other lanes.
3. Methodology
3.1. TTC Definition
TTC is defined as the time required for two vehicles to collide if they continue running at their present speed while on the same path [
23]. The two vehicles are in a car-following state, as seen below in
Figure 5. This indicator illustrates the crash risks in TTC.
TTC of off-ramps in the mainline can be calculated by using Equation (2), as shown below.
where
TTCf (
t) denotes the TTC of the following vehicle.
Ll(
t) denotes the position of the leading vehicle at a certain time
t.
Lf(
t) denotes the position of the following vehicle at a certain time
t.
ll(
t) denotes the length of the leading vehicle.
Vl(
t) denotes the speed of the leading vehicle.
Vf(
t) denotes the speed of the following vehicle.
The distance between the rear of the leading vehicle and the front of the following vehicle can be represented as Ll(t) − Lf(t) − ll(t). The speed difference between the two vehicles can be represented as Vf(t) − Vl(t). Therefore, the TTC of individual vehicles can be calculated by this equation.
3.2. TTC Calculation
The lane-change process at off-ramps is show in
Figure 6.
After the observation, we found that the vehicle often made a lane-change from the middle lane or the outer lane to the auxiliary lane to diverge from the mainline.
TTC for each vehicle is calculated as Equation (2). The values of 15% TTC, 50% TTC, and 85% TTC are summarized in
Table 2.
When comparing the TTCs of three lanes, the TTCs were distributed in a wider range in Lane 3. The 85% TTCs were smaller than 34.73 s for the expressway. The 50% TTCs ranged from 10.35 s to 29.67 s. The smaller TTCs indicated an increased probability of crash occurrences. The 15% TTCs of Lane 1 and Lane 2 were between 7.56 s and 23.65 s. The TTC of Lane 3 was significantly smaller than other lanes, since Lane 3 was disturbed by the interruption of vehicles merging from the other lane. The 15% TTCs ranged from 3.05 s to 9.03 s, which indicated that Lane 3 was more dangerous than the other lanes. TTC frequency distribution and TTC cumulative distribution are depicted in
Figure 7a,b.
Figure 7a reveals the TTC distributions for three lanes.
Figure 7b is the cumulative frequency of TTCs for all three lanes. For Lane 3, the 15% TTC, 50% TTC, and 85% TTC were 3.4 s, 7.5 s, and 16.7 s, respectively. For Lane 1 and Lane 2, TTC almost overlapped. The 15% TTC, 50% TTC, and 85% TTC were 3.3 s, 9.7 s, and 16.7 s, respectively. The 50% TTCs of Lane 1 and Lane 2 were greater than Lane 3.
Cross-sectional comparisons for the five sites are shown above in
Figure 8. The smaller TTC indicated a higher risk of crash. The 15% TTC, 50% TTC, and 85% TTC for Site 1 and Site 4 were smaller than the other sites, which indicated a higher crash risk level. Therefore, Site 1 and Site 4 were more dangerous than the other three sites.
3.3. TTCs’ Distribution Prediction with GMM
It is very important to understand the distribution of TTCs in danger determination. GMM is widely used in PDF estimation, which is a parametric PDF represented as a weighted sum of Gaussian component densities. The basic theoretical assumption is that an arbitrary distribution can be approximated by the weighted Gaussian models if there are enough Gaussian models. In this study, GMM was applied to explore the TTC distribution and capture the features for risk assessment without any distribution assumption. The PDF of the completed GMM is the sum of the sub-PDF described by Equation (3).
where the parameter
is the TTC of the
kth vehicle, which can be calculated by choosing a mixture weight
and
is between zero and one, which represents the percentage of the TTC belonging to category
i. The total sum of
is 1.
is the mean vector.
is the variance vector. GMM was used to estimate the PDF of TTC samples, and the estimated model was the sum of several Gaussian components with different probabilities
.
Each component density function is as Equation (4).
The complete GMM was parameterized by means, variances, and mixture weights from all components of Gaussian densities. Each sub-model is represented as . The number of sub-Gaussian models should be consistent with the crash risk levels.
The popular and well-known method for estimating the parameters of GMM is the maximum likelihood estimation (MLE). To train samples
, the MLE function of GMM can be written as Equation (5).
The function is called the likelihood of parameters given the training data. The likelihood is a function of where the TTC value is fixed.
The expectation-maximization (E-M) algorithm iterates through two steps to obtain the estimation of parameters. The E-M algorithm is a general method of finding the maximum-likelihood estimation of parameters when the given data are incomplete or have missing values. The goal of the algorithm is to find the parameters that make the largest. The two steps, the E-step and the M-step, are repeated until the maximum change in the estimation reaches convergence.
The condition of convergence is as Equation (6).
where
ε is a random constant.
In this study, the crash risks were divided into three levels: high, medium, and low. Therefore, three Gaussian sub-models were used to fit the TTCs’ distribution. The iteration stop condition was 1 × 10−15. The confidence level of estimation was 95%.
The modeling process is shown in
Figure 9.
GMM was used to estimate the TTC distribution for five sites. The blue line represents the high risk level. The green line and red line represent the medium and low risk level.
The PDF indicated the probability of crash risk at a certain TTC value. The intersections of crossing curves in
Figure 10 show the transition from high to medium risk. The place with a smaller TTC than the intersection had a greater probability at a high crash risk level, otherwise it had a greater probability at the medium crash risk level. Therefore, the TTC of the transition point on the PDF curves was considered as the threshold to distinguish high crash risk from medium crash risk.
The K-S test is a nonparametric test of the equality of continuous, one-dimensional probability distributions that can be used to compare a sample with a reference probability distribution. In statistics, the K-S test is applied to compare the TTC samples with the proposed GMM probability distribution. The comparison was set at the 0.05 confidence level. The K-S sig. value shows that the null hypotheses is accepted, and the samples were drawn from the reference distribution as given in
Table 3.
Table 4 is the theoretical estimation with GMM at a given percentage, and the severe crash thresholds are listed in the last column.
The theoretical values of 15% TTC, 30% TTC, 50% TTC, and 85% TTC for the five sites were calculated. The severe crash risk thresholds obtained from the PDF curves for Sites 1 to 5 were 2.23 s, 6.71 s, 2.61 s, 2.29 s, and 2.60 s, respectively.
The 15% TTC, 50% TTC, and 85% TTC for Site 1 and Site 4 were smaller than the other sites. Comparing the crash risk thresholds with the 15% TTC, it was found that the severe crash risk thresholds obtained from the distribution density functions were smaller than the 15% TTC of the collected samples.
The small TTC indicated that the driver had a short time to take measures to avoid collisions. Therefore, crashes were likely to occur at Site 1 and Site 4 after TTC, and Site 2 was safer than other sites. The thresholds could also be used to warn about the collisions.
4. Crash Risk Modeling
The paper proposed an ordinal logical regression model and a naive Bayesian model with four variables including speed, speed S.D., traffic volume, and truck percentage.
TTC has been proven to be an effective indicator for rating the severity of crash risks. The crash risks were divided into three levels: high, medium, and low, according to the TTC thresholds in
Table 4. The critical value for crash risk classification was determined by the derived PDFs. Combining with the results of previous research [
40], the critical value for high risk was 2.7 s. When the TTC was between 0 and 2.7 s, the crash risk of off-ramps was high. The critical value for the medium risk was 4.7 s. When the TTC was between 2.7 and 4.7 s, the crash risk of off-ramps was medium; when the TTC was greater than 4.7 s, the crash risk of off-ramps was low.
Since not all vehicles on the mainline were required to enter the ramp, 158 samples were selected from 1552 trajectory samples to develop the models, including 69 samples with a low crash risk, 44 samples with a medium risk, and 45 samples with a high risk. The following two conditions were considered when screening samples. The first one was that the vehicle was about to run out of ramp, and the second condition was that a collision had occurred, which was represented by a small TTC.
In order to test the validity of the model, 120 samples were randomly selected as the training dataset, and all samples were selected as the validation dataset. Meanwhile, the training datasets included 51 samples with a low crash risk, 27 samples with a medium risk, and 42 samples with a high risk. Before establishing the model, the relationship between the crash risks of off-ramps and four independent variables was analyzed separately, and the relationship between the crash risks of off-ramps and speed, traffic volume, speed S.D., and truck percentage was obtained, as shown in
Figure 12,
Figure 13,
Figure 14 and
Figure 15, respectively.
It can be seen from
Figure 13 that the impact of traffic volume on crash risk was relatively obvious. The lower crash risk corresponded to the smaller traffic volume, and the higher crash risk corresponded to the larger traffic volume. In addition,
Figure 12,
Figure 14, and
Figure 15 illustrate that the impact of speed, speed S.D., and truck percentage on crash risk was not obvious.
4.1. The Ordinal Logistic Regression Model
In this study, the ordinal logistic regression model was used to describe the relationship between independent variables and ordinal response variables. It was based on the cumulative probability theory. The ordinal logistic regression model assumes that the dependent variable
Y can be divided into ordinal
g categories (
Y = 1, 2,
...,
g).
X1,
X2,
…,
Xm are independent variables. The ordinal logistic regression model is as Equation (7).
where
is the constant term of the
jth regression equation.
is the regression coefficient of independent variable
Xm.
There are g − 1 equations for the different categories of Y. The concept of the ordinal logistic regression model is to assume that the independent variables have the same influence on the odds ratio of cumulative probability. The regression coefficients of each variable in all equations are the same, and the differences in the cumulative probability of the different categories are depicted by the constant terms.
The corresponding probability of event occurrence is
P1,
P2,
Pg for each category, and
P1 + P2 + Pg = 1. Hence, the probability when
Y = j is as Equation (8).
The value of Y means the crash risk level. Y = 1 indicates a low risk, and Y = 2 indicates a medium risk, while Y = 3 indicates a high risk.
The proportional odds assumption for the independent variables of the ordinal logistic regression is validated by the test of parallel lines. The test of parallel lines tests the proportional odds assumption for explanatory variables. The null hypothesis states that the slope coefficients in the model are the same across response categories.
The test results are shown in
Table 5. In this case,
,
P = 0.361, indicating that the regression equations were parallel to each other. Hence, the analysis could be carried out by the ordinal logistic regression model.
The ordinal logistic regression model was designed to explore the crash risk level and explanatory variables. The significance level of the model was set to the 95% confidence interval. The results showed the significance level of traffic volume and speed S.D. at 0.000 and 0.027, which were less than 0.05, indicating that these two independent variables could significantly affect the dependent variable. The significance level of speed and truck percentage at 0.062 and 0.502, which were greater than 0.05, indicated that the speed and truck percentage were less correlated with the dependent variable in this model. The ordinal logistic regression model results are shown in
Table 6.
Two variables including traffic volumes and speed S.D. were considered in the estimation model. The ordinal logistic regression showed that the proposed model could well evaluate the relationship between the crash risks of off-ramps and each independent variable. In addition, although the ordinal logistic regression model ignored the impact of speed and truck percentage on the crash risk, its accuracy was as high as 84.81%. The speed and truck percentage had a smaller impact on the crash risk than the other two independent variables, so they were ignored in the model.
4.2. The Naive Bayesian Model
The naive Bayesian model is a statistical method. Based on the Bayesian theorem and the independent hypothesis of characteristic conditions, it can predict the possibility between variables. The naive Bayesian model is characterized by high accuracy and high speed. It assumes that there is no interaction between any single attribute and classification result, which makes the calculation easier. In the study of traffic behavior, the Bayesian theorem can be applied to judge the probability of traffic behavior based on traffic flow.
According to the Bayesian rule, the conditional probability
P(A|B) represents the possibility of event
A occurring when
B occurs. The calculation formula is as Equation (9).
where
P(
A) is the prior probability when
A occurs.
P(B) is the prior probability when
B occurs.
P(A|B) is the conditional probability of
A after the occurrence of
B, which is called the posterior probability of
A.
P(
B|
A) is the conditional probability of
B after the occurrence of
A, which is also called the posterior probability of
B.
In the Bayesian rule, the denominator can be regarded as the normalized coefficient
η, and Equation (10) can be obtained.
If there is various observation information in traffic flow detection, it can be calculated by Equation (11).
Thus, the derivation formula of Bayesian filtering is as Equation (12).
The specific analysis of the naive Bayesian model is as follows: Each TTC sample had four corresponding attributions, including speed, speed S.D., traffic volume, and truck percentage, which can be expressed by . The crash risks of ramp were divided into low risk, medium risk, and high risk, which are represented by , , and , respectively. The proportion of each group of samples was classified into three levels with four attributes. Finally, the probability of risk level was obtained.
The naive Bayesian model was established with variables such as speed, speed S.D., traffic volume, and truck percentage. The model was trained by 120 training datasets and validated by 158 datasets. The Bayesian naive model classification results are shown in
Table 7.
It was found that the naive Bayesian model could well evaluate the relationship between the crash risks of off-ramps and each independent variable. The established model was tested by the validation datasets, and the prediction accuracy was 86.71%, which ensured the efficiency of the naive Bayesian model. Traffic volume was one of the most common exposure variables in the previous analyses, and there was a significant positive relationship between traffic volume and crash risks of off-ramps. It could be seen from the naive Bayesian model that the relationship between crash risk and traffic volume was the most obvious. The average traffic volume corresponding to low crash risk, medium crash risk, and high crash risk was 108, 177, and 275, respectively.
5. Discussion
The ordinal logistic regression model and the naive Bayesian model were designed to explore the relationship between the crash risks of off-ramps and the explanatory variables including speed, speed S.D., traffic volume, and truck percentage.
When evaluating the fit of a model, the Akaike information criterion (AIC) and Bayesian information criterion (BIC) are usually used to analyze the model. The calculation formulas are shown in Equations (13) and (14). The AIC and the BIC are proposed to prevent overfitting problems when training the model. As the accuracy of the model continues to increase, the number of unknown parameters in the model will increase, and the model will become more complex, which will lead to overfitting problems. Therefore, an evaluation for model should consider both the accuracy of fitting and the number of unknown parameters. In general, smaller AIC and BIC show a good fitting.
where
k is the number of model parameters,
n is the number of samples, and
L is the likelihood function.
The MLE is used in the training model of ordinal logistic regression. After calculation, the AIC was 119.81, and BIC was 125.9352. The naive Bayesian model did not use the MLE in the training model, so AIC and BIC could not be calculated. Then, the prediction accuracy of the two models was compared and analyzed.
The ordinal logistic regression model considered speed S.D. and traffic volume, while the naive Bayesian model took four independent variables into account. The ordinal logistic regression model and the naive Bayesian model were tested with all 158 data, and the predicted results are shown in
Table 8 and
Table 9.
It can be observed from
Table 8 and
Table 9 that the prediction accuracy of the two models was relatively high. The prediction accuracy of the ordinal logistic regression model was 84.81%, and the prediction accuracy of the naive Bayesian model was 86.71%. The naive Bayesian outperformed the ordinal logistic regression model.
In addition, both models had relatively low prediction accuracy for medium crash risk, which may be due to the similar sample data of medium crash risk and low crash risk.
An advanced crash warning system at off-ramps was designed based on the crash risk probability to enhance safety. The system integrating the model could predict the real-time crash risk level with real-time traffic flow information from loops including traffic volume, truck percentage, speed, and speed S.D. The system sends the warning signals according to the crash risk level. The framework of the system is as
Figure 16.