1. Introduction
With the rapid development of automobiles and autonomous vehicle technologies, how to decrease traffic accidents is increasingly receiving attention in various countries. In the future, human-driven vehicles may gradually be replaced by autonomous vehicles, but during this process, the road traffic will be in a mixed state of autonomous and human-driven vehicles. Vehicle conflicts may arise due to the perception, communication, and speed differences of autonomous vehicles in mixed traffic flows in highway ramp areas. The traffic conditions in the diversion and merging areas, which serve as the connection points between the arterial roads and highways, are extremely complex and prone to traffic accidents. According to data provided by the World Health Organization, about 1.35 million people die each year in road traffic accidents worldwide, equivalent to 3700 every day globally [
1]. Although the occurrence of traffic accidents is random, the temporal and spatial distribution of accidents on the specific road segments show some characteristics in different road conditions and traffic environments. The special sections of highways (such as merging and diversion areas) are black-spots for traffic accidents, especially for major accidents. Therefore, the corresponding accident analysis of these areas has gradually become a research topic in recent years [
2,
3,
4,
5]. Especially in recent years, machine learning algorithms have been widely used to divide and analyze accident-prone areas [
6,
7,
8].
The merging and diversion areas of highways are important highway nodes in the transportation system. They are also areas where traffic accidents frequently occur. The vehicles going straight in the merging and diversion areas of highways often travel fast, while the vehicles entering or exiting the highway need to slow down. The large speed difference can easily cause traffic accidents. Past studies have shown that merging and diversion behaviors are among the main causes of accidents. Compared with general road segments, the spatial competition ratio in merging and diversion areas is higher. Rear-end and side-collision accidents are more likely to occur in such areas [
9,
10]. In order to explore the causes of traffic accidents, researchers have proposed many traffic safety analysis methods to reduce or eliminate traffic accidents [
11,
12]. Identifying the black-spots of highways is an important method to understand and predict accidents. There are several statistical methods for identifying accident-prone areas. In order to explore the impact of risk factors on the collision frequency and severity of large trucks, Dong et al. [
13] applied multinomial logit (MNL) and negative binomial (NB) models to analyze collision severity and frequency, respectively. The results showed that the driver’s age, speed limit, and location type only had significant effects on the frequency of large truck crashes. Pulugurtha et al. [
14] used the geographic information system (GIS) method to study the spatial patterns of pedestrian collisions in order to identify the areas with high incidences of pedestrian collisions. The results indicated that there were significant differences in the ranking results when each method was considered. Dellinger et al. [
15] explored the factors that influence fatal crashes among U.S. drivers over the age of 55. The results showed that the number of fatal crashes increased with age. Meanwhile, the relative contributions of collision incidence density and exposure prevalence were greater than collision mortality. To improve road safety, Pritee et al. [
16] used geographic information systems for risk assessment and statistical analysis to identify high-frequency accident sites. The results showed that it was beneficial to divide the accident site by using the heat map analysis method of kernel density estimation. Jahan et al. [
17] developed a hybrid method, based on accident type, for improving the rate quality control method to overcome the shortcomings. According to the case study and based on the results in the real environment, the proposed method could detect and identify the accident hotspots. Ray et al. [
18] presented a new method to rank the severity of roadside hazards based on observable crash data. Unlike the earlier subjective severity index method, the new EFCCR method was based on observed crash data and used a systematic approach to calculate crash severities. Zou et al. [
19] proposed methods based on the Ordered Weighted Averaging (OWA) operator and Uncertain Ordered Weighted Averaging (UOWA) operator to fuse different certain results and different interval results to one result, respectively. The research provided more reliable choices to describe different results obtained from different methods in accident reconstruction. For the purpose of studying and comparing the effects of the methodological diversity of road network segmentation on the performance of different BSID methods, Ghadi et al. [
20] evaluated four commonly applied BS methods (empirical Bayesian (EB), excess EB, accident frequency, and accident ratio) against four different segmentation methods (spatial clustering, constant length, constant traffic volume, and the standard Highway Safety Manual segmentation method). The results showed that the EB method had surpassed the other BSID methods for all segmentation approaches.
In past studies, researchers have evaluated traffic accidents in various cities and rural areas. In general, these studies can be divided into two categories. The first is (a) analyzing the factors that cause traffic accidents [
21,
22,
23]. These studies tried to determine the interactions among environmental, human, road, and vehicle characteristics and traffic accidents. The second is (b) utilizing various geospatial analysis methods to map black-spots of traffic accidents. For example, the kernel density estimation (KDE) method was commonly used to illustrate the density of traffic accidents based on the number of accident points at each spatial location [
24,
25]. In addition, some studies used distance-based methods to determine the clustering of traffic accidents [
26,
27]. Some studies used time series analysis to explore the evolution of traffic accident clusters [
28,
29]. Time series analysis and spatial clustering analysis were combined to identify accident-prone locations; for example, Kingham et al. studied the temporal evolution and spatial clustering of traffic accidents in Christchurch, New Zealand [
30]. Liu et al. [
31] studied the temporal evolution of fatal accidents in Iowa (USA) and found that fatal car accidents in all Iowa counties declined from 2006 to 2015, but the rate of decline varied across counties.
In recent years, with the development of artificial intelligence and machine learning, some emerging algorithms have been used to identify accident-prone locations. For example, Meng et al. [
32] established a self-organizing neural network model for identifying accident-prone locations. They proposed a process for identifying prominent accident-inducing factors based on the combination of discrete multivariate algorithms and probability distributions. Wang et al. [
33] applied the DENCLUE clustering algorithm in identifying accident-prone locations, which can effectively avoid the pre-division of investigation locations and achieve arbitrary length clustering compared to traditional methods. Ifthikar et al. [
34] identified accident-prone areas and related causes by clustering accident location coordinates. Qiu et al. [
35] proposed an improved DBSCAN clustering algorithm to identify traffic-accident-prone areas by selecting reasonable values for the parameters ε and minPts. Yakar et al. [
36] studied the application of the relative frequency method in determining accident-prone road sections. Zhao et al. [
37] developed deep convolutional embedded clustering (DCEC) to classify traffic flow into nine states. The results of the logistic regression model proved that the nine traffic states were significantly associated with crash risk in the vicinities of weaving segments, and each traffic state could be assigned a unique safety level.
However, most of these methods were based on statistical analysis and only focus on non-spatial features and attributes of the data. Geographical spatial properties were not associated with accident occurrence. Traditional statistical analysis usually considered the occurrence of traffic accidents as a random and independent process. Corresponding geographical spatial information of these accidents was often omitted. However, the spatial data of an accident are not always the same as other accidents. Even if two accidents happened at the same location, their traffic environment and timestamp will never be the same. Therefore, an unreasonable combination of these factors will lead to information loss and unreasonable clustering results for accident-prone areas.
To solve this problem, the spatial autocorrelation-based method is an ideal way to identify the geographical location relationship of various accidents and to reflect the spatial correlation of different accident attributes. This method is also suitable for the safety evaluation of highway black-spots. For example, Khanh et al. [
38] proposed a method for determining the location of traffic accident black-spots by combining the kernel density estimation (KDE) algorithm and spatial autocorrelation analysis. Fan [
39] used a spatial autoregressive quantile model to estimate how risk factors affect overall and fatal traffic accident rates. The results were expected to provide strategies for reducing accident rates and improving road safety. Khaled et al. [
40] used spatial autocorrelation (Global Moran I Index) and local hotspot analysis (Getis-Ord Gi*) in a GIS environment to determine the spatial patterns and temporal evolution of accident black-spots along the internal and main road networks in the study area. Fan et al. [
41] trained and optimized a complex model for identifying accident black-spots using the support vector machine method based on the structured association features of urban traffic accident big data, which improved the accuracy of black-spot identification. Tanprasert et al. [
42] proposed a new technology that used street-view images to identify black-spots. This technology was based on the hypothesis that the features of the surrounding road environment had an impact on the safety level of specific locations. It was the first black-spot classification technology that was fully environment-aware. Vitianingsih et al. [
43] presented a framework of spatial analysis using a hybrid estimation model based on a combination of multi-criteria decision making (MCDM) and artificial neural network (ANN) (MCDM-ANN) classification. This model is useful for traffic-accident-prone road classification with a spatial dataset.
Although the above studies have proposed various methods to identify black-spots or accident-prone areas, most results remained at the macroscopic level of road traffic. The spatial characteristics of complex traffic environments were not considered or described quantically in these studies. In high-speed road sections and special nodes, the black-spots often occurred at some specific positions, which had strong correlations on the microscopic scale. Therefore, traditional accident statistics methods are not suitable for such road sections. To solve this problem, this paper utilized the microscopic spatial autocorrelation method to divide the accident-prone areas into highway diversion and merging areas.
The rest of this study is organized as follows.
Section 2 introduces the proposed method.
Section 3 presents the experiments and data.
Section 4 analyzes and verifies the results. The analysis and discussion are presented in
Section 5. The conclusion is presented in
Section 6.
2. Methodology
From the spatial distribution perspective, traffic accidents are not evenly distributed. In some areas, they cluster in one location, while in other regions, this phenomenon may not be present. This phenomenon also occurs near highway ramps. Therefore, geographic information system (GIS) technology was used to analyze the spatial characteristics of traffic accidents.
First, the locations of traffic accidents near the highway ramp were geocoded on the digital road network. However, considering that merging and diversion areas on highways are relatively small micro scenes, and that road traffic accident data are often obtained from macro road networks in a country or province, the object of this study was specific scenes of special sections of highways. This required classifying accident data based on the characteristics of each special section, e.g., classifying scenes based on the number of lanes on the highway. After determining the category of special section scenes, combined with the road network map of each special section, a hotspot map of the location of traffic accidents was drawn in special sections.
Next, the distribution of incident points needed to be checked to see if it matched the clustering distribution for the next cluster analysis. This required testing the random distribution of traffic accidents in a specific section of the highway, applying the average nearest neighbor method.
After confirming that accidents conform to the law of aggregation, the kernel density estimation method was applied to calculate and draw a density map of traffic accidents. Finally, in order to evaluate the distribution pattern of traffic accidents, we used the multivariate logit regression model to find the causal factors of differences in frequent accident locations and determine the explanatory variables that cause changes in the location of accidents. The flowchart of this study is shown in
Figure 1.
The following sections introduce the detailed methods used in this study, including kernel density estimation, the average nearest neighbor method, and multivariate logit regression analysis.
2.1. Kernel Density Estimation
The core idea of kernel density estimation (KDE) is that geographic phenomena and events can occur at any location in the spatial plane, but the probability of occurrence varies by location. Areas with dense points have a higher probability of event occurrence, while sparse areas have a lower probability. Therefore, KDE is particularly useful for analyzing and displaying point data. The geometric interpretation of the kernel density is that the density distribution is highest at the center of each point
and decreases outward. It will reach 0 at a certain threshold range (the edge of the window) from the center, as shown in
Figure 2. The kernel density at the grid center
x is the sum of the densities within the window range, as shown in Equation (1).
where
K( ) represents the kernel function,
h is the bandwidth,
n is the number of points in the study area
R,
d is the dimension of the data, and
represents the distance from the estimation point to the event point
.
For example, when
d = 2, a commonly used kernel density equation for two-dimensional plane space can be defined as shown in Equation (2):
Many researchers have pointed out that bandwidth is the most critical criterion for determining the most appropriate density surface. Therefore, the choice of bandwidth will significantly affect the results of hotspots. In other words, the smaller the bandwidth, the smaller the hotspots. The smoothness of the density surface is affected by the bandwidth—the smoother the density surface, the larger the bandwidth. Therefore, selecting the best bandwidth is crucial. According to the research of many scholars, the bandwidth is usually within the range of 20 to 1000 m.
2.2. Mean Nearest Neighbor Analysis
The Nearest Neighbor Index (NNI) is the ratio of the average observed distance between points to the expected average distance between points. If the observed distance is less than the expected distance, it indicates a clustered distribution of points. The greater the difference between the two, the stronger the clustering. If the observed distance is greater than the expected distance, it indicates a dispersed distribution of points. The formula for calculating the Nearest Neighbor Index is as shown in Equation (3).
where
,
is the nearest neighbor index,
is the observed mean distance,
is the expected mean distance,
is the number of points in the study area, and
is the area of the study area.
2.3. Multinomial Logistic Regression
The multinomial logit model can be viewed as a joint estimation of multiple binary logit models formed by pairing each selection category of the dependent variable. The model is specified as follows in Equation (4):
where
b is the reference category,
j is the total number of categories in the categorical variable, and
is the coefficient of the
categorical variable. When
=
b, the left-hand side of the Equation (4) is
ln1 = 0, then
βb = 0. This means that the log-odds of choosing a certain category relative to the reference category is always 0, causing any explanatory variable coefficients corresponding to this category to be 0 as well.
5. Discussion
5.1. Analysis of Influencing Factors in Different Accident-Prone Areas in the Diversion Area
The model results show that for the diversion area, the explanatory variables “longitudinal distance from the ramp”, “humidity”, “visibility”, “lane 1”, and “lane 2” have relatively significant effects on the model. In Area 1, “longitudinal distance from the ramp”, “humidity”, and “lane 2” have significant effects on accidents in Area 1. Among them, the longitudinal distance from the ramp and humidity have positive effects. This indicates that compared with Area 4, especially in the upstream area, the closer the distance from the ramp, the greater the possibility of accidents occurring in Area 1. The greater the humidity, the more likely accidents occur in Area 1. Compared with Area 4, the probability of accidents not occurring in lane 2 in Area 1 is low, which means most accidents in Area 1 occur in lane 2. For Area 2, only the variable “longitudinal distance from the ramp” has a significant effect, which is similar to Area 1. In the upstream area, the closer the distance to the ramp, the greater the possibility of accidents occurring in Area 2. Area 3 has no significant influential variables.
5.2. Analysis of Influencing Factors of Different Accident-Prone Areas in the Merging Zone
In the merging zone, the variables “distance from the nose in the longitudinal direction”, “temperature”, “daytime”, “lane 1”, “lane 2”, and “clear weather” have relatively statistically significant effects on the model. From the results of the merging region model parameter estimation, it is found that in Area 1, variable 1 “longitudinal distance from exit”, “temperature”, “lane 2”, and weather all have significant effects on whether the accident occurs in Area 1. Unlike in the diversion area, the longitudinal distance from the nose has a negative correlation with accidents occurring in Area 1; that is, the farther the longitudinal distance, the less likely it is to occur in Area 1, which is consistent with the kernel density result graph, where Area 1 of the merging zone is located far from the ramp. The higher the temperature, the more likely accidents will occur in Area 1. Compared with Area 4, the probability of accidents occurring in lane 2 is higher in Area 1. When the weather is not good (not clear weather), accidents are more likely to occur in Area 1. In Area 2, the pattern of accidents is similar to that in Area 1, but more accidents occur at night. Area 3 also has roughly the same pattern: accidents are more likely to occur in the second lane when the weather is not good and the temperature is higher.
5.3. Comparative Analysis of Factors Affecting Accident Occurrence in the Merging and Diversion Areas
By comparing the accident occurrence patterns in the merging and diversion areas, the following patterns can be observed. In the diversion area, there is a significant positive correlation between the area distribution of accident occurrence locations and the longitudinal distance from the ramp. This means that the farther the accident is from the upstream, the lower the accident density in the accident area. On the other hand, there is no obvious longitudinal distribution pattern in the accident area distribution in the merging area. Humidity has a certain influence on the distribution of accidents in the diversion area. However, temperature and weather have a significant impact on the distribution of accidents in the merging area. Regardless of whether it is in the diversion or merging area, the accident rate in the second lane is the highest. Based on the analysis results, the following conclusions can be drawn.
a. The accident-prone areas in the diversion area are more concentrated than those in the merging area. The accident density level distribution is closer to the cross-section near the ramp. In contrast, there is often more than one accident-prone area in the merging area. This is because, in the high-speed diversion area (500 m before and after the ramp), some drivers who are not familiar with the road conditions will suddenly slow down near the ramp. If the vehicle behind them does not slow down in time, rear-end accidents will inevitably occur.
b. In the merging area, accidents are more spread out than in the diversion zone. This is because vehicles are generally in the acceleration lane before entering the main road from the ramp and need to accelerate to a specific speed to enter the main road. The acceleration of vehicles in the acceleration lane varies widely, so the location where vehicles accelerate to the speed limit and then change lanes to enter the main road varies. Therefore, accidents in the merging zone appear to be more dispersed than accidents in the diversion zone.
c. In the diversion areas and merging areas, accidents occurring in the second lane account for the highest proportion among all lanes. This is because the second lane is the lane with the most vehicles, and the probability of accidents occurring is relatively high. Moreover, if the vehicles in the overtaking lane (always travel fast) need to get off the highway when approaching the ramp, they need to change lanes, which can easily lead to side collisions and rear-end collisions with adjacent vehicles. Other lanes have relatively fewer vehicles near the ramp, so accidents are more likely to occur in the second lane.
d. The model shows that rain increases the probability of vehicle accidents, so necessary traffic control and traffic warning measures should be implemented in the merging and diversion areas on rainy days.
e. The model indicates that the accident rate at night is much higher than that during the day, which is also mentioned in the research of Chen et al. [
44]. This is mainly because the driver’s visibility is poor at night, and there are almost no lighting facilities on the highway except in service areas and tunnels. Drivers can only judge the road conditions ahead based on the vehicle lights. If they are driving on a complex road section (such as the merging and diversion areas) or face a sudden situation, the driver cannot make a prediction in advance, which can easily lead to the occurrence of traffic accidents. This is consistent with the study results of Wei et al. [
45].
6. Conclusions
The main contribution of this paper is identifying accident-prone areas near highway ramps using spatial autocorrelation analysis from the microscopic aspect. This approach differs from traditional methods that rely on a statistical analysis of accident data without considering the spatial relationships between accidents. It classified the accident-prone areas into four levels with specific spatial division results. The main differences between the accident distribution and causes were presented according to the analysis and model results. The main conclusions can be drawn as follows.
Firstly, based on kernel density analysis of accident data in highway diversion and merging areas, this study found that the accident-prone locations were mainly located at the highway entrances and exits. This is consistent with normal expectations because these locations are usually areas with high traffic flow, fast speeds, and frequent lane-changing, which are the main factors of accidents. Secondly, when analyzing the time period and weather conditions of different accident-prone areas, it was found that accidents occurred more often in poor weather conditions and at night. This indicates that drivers need to pay attention to driving safety in such situations, especially near the highway ramp (within 100 m). In this study, spatial clustering analysis of accident-prone points was conducted. A significant spatial correlation was found among the high accident-prone points. These points were usually located in specific areas near the ramp, which need more management to reduce the accident rates. Finally, a multinomial logit model was used to analyze the causes of spatial differences in accident distribution. It was found that temperature, the accident lane, weather, and the accident time were important factors affecting the spatial distribution of traffic accidents. According to these findings, this study suggests that different preventive measures should be used for different types of traffic accidents. For example, the flexible control of speed limits, THW, warning signs, and lane markings should be strengthened in the highway diversion area to improve drivers’ safety awareness. In summary, the results of this study provide an important decision basis for traffic management departments to adopt a refined management strategy for the diversion and merging areas to reduce the accident rate.
In addition to the above results, this paper has the following shortcomings. First, this study only analyzed the accident location, time, and weather, and ignored the influence of different levels of accident areas on drivers’ subjective behaviors in depth. Second, this study only used the U.S. highway accident dataset, which to some extent limits the generalizability of the conclusions. Third, the proposed management measures have not been validated in realistic scenarios.
To address the above problems, a combination of field experiments and simulations should be realized in future research. This will help to comprehensively compare and analyze the differences between accidents near the ramp and accidents on normal sections. The kernel density algorithm can also be combined with artificial intelligence technology to analyze the implied relationships in traffic accidents to obtain more accurate accident causation analysis results. Moreover, different traffic rules can be considered as the causative factors of accidents to provide a scientific basis for improving the management of ramps.