Next Article in Journal
Greenwashing, Sustainability Reporting, and Artificial Intelligence: A Systematic Literature Review
Previous Article in Journal
Demand Response Management of a Residential Microgrid Using Chaotic Aquila Optimization
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Characterization of Traffic Accidents Based on Long-Horizon Aggregated and Disaggregated Data

1
The Center of Road Traffic Safety, Naif Arab University for Security Sciences, Riyadh 11452, Saudi Arabia
2
Civil Engineering Department, Faculty of Engineering, Beni-Suef University, Mandated to Al Minia High Institute of Engineering and Technology, El Minia 14812, Egypt
3
Civil Engineering Department, Faculty of Engineering, Aswan University, Aswan 81542, Egypt
*
Author to whom correspondence should be addressed.
Sustainability 2023, 15(2), 1483; https://doi.org/10.3390/su15021483
Submission received: 13 December 2022 / Revised: 9 January 2023 / Accepted: 10 January 2023 / Published: 12 January 2023
(This article belongs to the Section Sustainable Transportation)

Abstract

:
For sustainable transportation systems, modeling road traffic accidents is essential in order to formulate measures to reduce their harmful impacts on society. This study investigated the outcomes of using different datasets in traffic accident models with a low number of variables that can be easily manipulated by practitioners. Long-horizon aggregated and disaggregated road traffic accident datasets on Egyptian roads (for five years) were used to compare the model’s fit for different data groups. This study analyzed the results of k-means data clustering and classified the data into groups to compare the fit of the base model (Smeed’s model and different types of regression models). The results emphasized that the aggregated data used had less efficiency compared with the disaggregated data. It was found that the classification of the disaggregated dataset into reasonable groups improved the model’s fit. These findings may help in the better utilization of the available road traffic accident data for determining the best-fitting model that can assist decision-makers to choose suitable road traffic accident prevention measures.

1. Introduction

Road traffic accidents cause great losses of life, damage to properties, and notable psychological effects on the victims and their families. Annually, traffic accidents result in more than 50 million injuries and 1.35 million deaths worldwide [1]. Decision-makers need accurate information about the relationships between traffic accidents and the contributing factors. Developing accident-prediction models can help in predicting the causes of accidents effectively and allows transportation authorities to formulate accurate road safety measures to improve quality of life by ensuring sustainable transportation systems. Thus, many studies all over the world have aimed to evaluate the causes of traffic accidents in order to reduce their harmful impacts. Traffic accident modeling has been extensively studied in the literature using different techniques during the last few decades [2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17]. Although different methodologies have been used in accident modeling research, there are still numerous issues that need to be investigated, according to the recommendations of the extensive review performed by [18,19]. These include issues related to the characteristics of traffic accident data, parsimonious versus fully specified models, unobserved heterogeneity, spatial and temporal correlations, risk compensation, the choice of the methodological approach, and the under-reporting of traffic accidents with less severe injuries.
This study aimed to use some existing accident models to compare their performance based on the characteristics of aggregated and disaggregated datasets collected on Egyptian roads during the periods between 2015 and 2019 and between 1999 and 2003, respectively. The methodology was based on classifying different data characteristics (using k-means clustering) and the effect of that classification on model fitting. Because of the limitations of data availability and/or the need to specify models with a few simplistic explanatory variables, some parsimonious models (preferred by practitioners) were used for this comparison. As the death rate is a benchmark to measure the road safety conditions of a country, Smeed’s law proved the existence of a relationship between the death rate, the number of vehicles, and the population by using data from 1938 gathered from 20 different countries [20]. Many other updates to Smeed’s law found that the increase in car ownership led to a decrease in the number of fatalities per vehicle [21,22]. Moreover, this study compared the fit of modeling whole data with different clusters of data based on Smeed’s model and different regression model forms to evaluate the effectiveness of data clustering. The models tried to estimate the death rate caused by traffic accidents in Egypt by utilizing five years of historical data. The main objective was to compare the fit of the different data cluster models with estimating the models’ parameters. The specific objectives of this research were as follows:
  • To compare the prediction accuracy of an existing model based on long-horizon accident datasets;
  • To validate the models based on the available data, which included clustering the accident data into reasonable groups and testing the model’s fits for the different data groups.
The next section provides a literature review of the existing relevant research. Section 3 outlines the research methodology and Section 4 presents the characteristics and processing of the accident data utilized. The modeling of the clustered data, results and main conclusions are presented in Section 5 and Section 6, respectively.

2. Literature Review

There are different traffic accident prediction models in use; however, because of the limitations of data availability or the need to specify models with a few simplistic explanatory variables, parsimonious models are often used. Many human factors, such as the driver’s age, gender, and other socioeconomic characteristics, are considered, but the drivers’ perceptions and reaction time, stress, fatigue, and/or emotional condition (at the time of the accident) are difficult to record. In addition, some environmental conditions in the traffic accident’s location may not be recorded. There are several published studies covering the effect of unrecorded related factors on road traffic accidents [23,24,25,26]. They used latent-class models that addressed the effect of unrecorded factors by classifying the data into subgroups with homogeneous characteristics. However, because of the assumed homogeneous characteristics of each subgroup, these models did not consider variations within the same subgroup [18]. Other researchers, such as [27,28,29], used random-parameter models that allowed for differences in the parameters among observations to deal with the effect of unrecorded factors. These models can be more complex and may include a large number of variables, which may be cost-ineffective in terms of the computation time required compared with using models with a few variables. Factors that may influence the frequency and/or severity of road traffic accidents are likely to be correlated in space and time, so ignoring the spatial and temporal correlations of the data will certainly result in inefficient and possibly inconsistent parameter estimates. Some studies have addressed this issue by using multivariate models, where multiple dependent variables are interrelated with each other. These methods, however, are either too restrictive, relatively cumbersome and time-consuming, and/or are literally infeasible in cases of high dimensionality. This section briefly highlights the relevant literature; a comprehensive review of the studies that used various methodological approaches in the field of traffic accident research can be found in [18,19].
Riccardi et al. [2] modeled traffic accidents in Great Britain using many parametric and non-parametric models and found that the parametric models proved a relationship between the dependent and independent variables with a clear interpretation of the outputs, whereas the non-parametric models required more explanatory variables with a high probability of dependency among them. Similarly [3], modeled fatal pedestrian accidents in Italy by applying the mixed logit model, machine learning, and association rules. The F-measure and the G-mean measures were utilized to compare the performances of the models in both approaches.
Clustering analysis is a mathematical statistical method that can be applied to large datasets, where the raw data are sorted and grouped into clusters [30,31,32,33,34,35,36,37,38]. These clusters are internally homogeneous, although they are different from each other. The components of a single cluster are similar to each other, while the components of different clusters are less homogeneous [39]. Clustering analysis is similar to multidimensional scaling for investigating the similarity between factors by examining the full range of interrelated relationships [38]. The difference between the two methodologies is that multidimensional scaling identifies the key dimensions, while clustering analysis identifies the groups. Clustering analysis is considered to be the opposite of factor analysis [33]. Factor analysis reduces the number of variables by grouping them into smaller groups of factors, but clustering analysis reduces the number of observations or cases by grouping them into smaller groups [37]. Hence, clustering analysis aims to make the variance of the elements within each group as small as possible and to make the variance between groups and their centers as large as possible [40]. Non-hierarchical clustering analysis is suitable for large amounts of data that are compatible with rich disaggregated observations [36]. The number of clusters (k) can be determined specifically or by the node or clustering method. Non-hierarchical clustering analysis relies on three steps: (1) creating a preliminary distribution of the existing observations within a specific number of initial groups; (2) the created initial groups are considered to constitute primary clusters; and (3) the primary clusters are re-divided to form smaller and smaller clusters up to the final stage [35].
Nicholson [7] investigated many measures of accident clustering and recommended using simple methods of clustering accident data. The choice of plan type (site, route, or area) was suggested to be dependent upon the spatial distribution of accidents. Choosing a site plan when accidents are highly dispersed or an area plan when accidents are highly clustered at certain points will probably result in a poor economic return. In the same regard, Shaikh and Nicholson [10] studied accident clustering in New Zealand and found that accidents are much more dispersed in New Zealand compared with other countries, and suggested that less emphasis should be placed on site plans and more emphasis should be placed on route and area plans in New Zealand. Nicholson [8] discussed the evaluation of the indices of accident clustering and their interpretation to provide better randomness in the descriptions of how accidents occur. He evaluated the truncated negative binomial distribution and suggested a new form of truncated negative binomial distribution. Sabel et al. [5] used kernel estimation clustering analysis to automatically identify road traffic accident “black spots” and “black areas” in Christchurch, New Zealand, using GIS and Python software. They found that kernel estimation was able to quickly identify the accident clusters and, when used in conjunction with Monte Carlo simulation techniques, to identify statistically significant clusters.
Assi et al. [4] developed machine learning (ML) models to predict the severity of crash injuries in Great Britain and divided the ML models into different clusters using the fuzzy c-means method. They developed four ML models: feed-forward neural networks (FNN), a support vector machine (SVM), a fuzzy C-means clustering-based feed-forward neural network (FNN-FCM), and a fuzzy C-means-based support vector machine (SVM-FCM). They found that the FNN combined with FCM provided a slight improvement compared with the FNN without clustering, while the SVM-FCM model had higher accuracy when compared with the SVM. They concluded that the FCM clustering algorithm enhanced the prediction power of the FNN and the SVM models. Depaire et al. [9] investigated the effectiveness of using latent class clustering and compared the results of these cluster-based analyses with the results of full-data analysis, and found that clustering revealed important relationships in the variation in a variable’s effect between different traffic accident types on the probability of injury as an outcome. For example, they found that the full-data model hid the probability of the first road user being slightly injured in a traffic accident, while the cluster-based models revealed a more complete interpretation.
Smeed’s law [20] measures the road traffic safety conditions of a country as the relationship between the death rate, the number of vehicles, and the population. Smeed [20] found that an increase in the car ownership rate (the ratio between the size of the vehicle fleet and the population) caused a decrease in the rate of death caused by traffic accidents (the ratio between the number of deaths and the size of the vehicle fleet), with α = 0.0003 and β = −2/3. Many studies have updated Smeed’s law using a range of sociodemographic, economic, environmental, and policy-related variables in order to better estimate the road safety outcomes of a country [12,13]. These studies found that Smeed’s formula describes the change in fatalities reasonably well up to the 0.2–0.3 vehicles/person motorization rate, while above this level, the formula seems to overestimate the fatality rate.
Kopits and Cropper [12] showed that, in developing countries, the rate of the growth in vehicle ownership has increased more rapidly than the reduction in the fatality rate, while in industrialized countries, the motorization rate has tended to increase at a slower rate than the rate of the reduction in the number of fatalities per vehicle. However, refs. [32,33] seriously criticized Smeed’s model, as data from only one year were utilized in the model’s development. They also pointed out that Smeed’s model could not be used for all countries because each country has distinct traffic, economic, and social parameters, and that the model’s coefficient and exponent should, therefore, be country dependent. However, in this article, because of our emphasis on comparing different data clusters, rather than focusing on the model’s coefficients, Smeed’s law was used, along with other different forms of regression models.

3. Methodology

In this study, the use of a model that can be implemented in real-world practice was evaluated (with a small number of variables, but with representative factors that cause road traffic accidents). The big data collected could help in developing relatively simplistic models using only the explanatory variables of road traffic accidents needed for practical use in the field of road safety [34,35]. The use of existing models was studied in order to produce a proper model that is capable of representing actual field conditions with less variability as well as possible. The proposed models were calibrated and validated based on the collections of long-horizon data after they were clustered using normal clustering methods. Clustering helped in classifying the data based on the characteristics of the most likely causal factors. Accordingly, the prediction model was developed based on the other remaining factors (i.e., the factors that were not considered during clustering). It was found that these steps played a significant role in using simple models with few explanatory variables.
The k-means algorithm was used as the conceptual approach of non-hierarchical clustering to classify the different parameters of the disaggregated datasets. The objective of this procedure was to classify the sample of data within (k) clusters. As a result, the sum of the squares within the clusters was as small as possible. The k-means algorithm involves the following steps.
To determine the number of required (k) clusters with the random or intentional initial splitting of the observations into groups, the elements of each group were sorted separately. These primary groups were known as the initial clusters. In order to estimate the probability of cluster membership based on one or more probability distributions, the log-likelihood method was used to measure the distances between each item (x) and the center of its cluster, and the distances between the items. It could also calculate the distance between the centers of the final clusters. The log-likelihood method assumes that continuous variables are normally distributed and that categorical variables are distributed according to multinomial distributions. Accordingly, the overall probability or likelihood of the data can be maximized.
There were many benefits to using the various clustering methods in this study. For instance, the k-means algorithm produces tighter clusters than hierarchical clustering. Hence, applying k-means data clustering is a crucial way to obtain the optimal number of clusters from the model itself, and human intervention is not required. Although the initial seeds have a significant impact on the final results, they ease the classification probabilities of the sample’s contributory factor memberships in a clear visualization. On the other hand, the outputs of hierarchical clustering are more informative than the unstructured set of flat clusters returned by k-means. Each approach has its own disadvantages when calculating the similarity between clusters. Hierarchical clustering analysis may not be suitable for large datasets because of the high temporal and spatial complexity. This research used various values for k-means as a simple and fast way to specify the proper number of clusters; then, the modeling was conducted accordingly.
Although many studies (e.g., [4,9]) have revealed that traffic accident data should be clustered on the basis of vehicle type, the clustering analysis conducted in this study emphasized that other contributory factors have a significant impact on the clustering of the data. Moreover, the effectiveness of clustering was investigated by [9], and the results were compared with the results of full-data analysis, which indicated that clustering could reveal important relationships for the variation in a variable’s effect on different traffic accident types. The cluster-based models revealed a more complete interpretation, while the full-data model was found to hide the probability of some causative factors. The methodology followed in this article aimed to identify clusters that can be transferred to other datasets, rather than focus on finding typical groupings in the data. Accordingly, the results of clustering were validated to provide a degree of confidence, and the model’s performance was evaluated.
Nicholson [6] showed that the efficiency of modeling can be improved by increasing the observation period, and concluded that a five-year period is generally optimal from the viewpoint of statistical reliability. Accordingly, this article described an evaluation of the indices of accident clustering in order to estimate the death rate resulting from traffic accidents in Egypt by utilizing five years of historical data between 1999 and 2003.
First, the centers of the initial clusters were determined, and then the distances between each item (x) and the center of its cluster were calculated. Finally, the items were assigned to the clusters according to their closeness to the clusters’ centers. The previous steps were repeated for all n items and for each time, the item was assigned to the nearest cluster’s center. If the (k) clusters were not associated with a certain degree of accuracy to stop clustering, we could specify another number of clusters and repeat all of the previous steps, then compare and evaluate the results of the two sets and choose the best set [23,24,25,26]. The final distribution of the clusters’ centers and the distances between the items and the centers of the final cluster were used to interpret the clusters’ details. The values of the data were standardized to prevent the clustering from being dominated by features with a bigger scale. To do so, a zero mean was used, and the units’ variances’ standardizations were estimated accordingly. We used four values for the k-means, including k = 2, k = 3, k = 4, and k = 5, which specified the number of clusters, to carry out the clustering models. If a small cluster appeared that was hard to profile by means of the cluster-dependent distributions, this indicated a group of outliers [9]. Ten contributory parameters were included as the principal components, namely road shape, road type, surface conditions, weather conditions, traffic volume, accident time, number of reported crashed vehicles, reported cause of the accident, number of deaths, and number of injuries. After performing a descriptive analysis of the aggregated and disaggregated datasets, clustering and analysis of variance (ANOVA) were conducted simultaneously to identify the variance among the groups and their centers. These tests determined the significance of the impacts of the different variables on a cluster. Next, the clustered datasets were compared by using some base models. The models’ performances were evaluated in terms of the errors between the estimations and the observations for the base models described below.
Smeed’s law [20] measures the road traffic safety conditions of a country as the relationship between the death rate, the number of vehicles, and the population, as shown in Equation (1).
D N = α N P β
where D is the number of annual deaths caused by road traffic accidents, N is the number of registered vehicles, P is the population, and α and β are the model parameters used for the estimation.
In this study, Smeed’s law [20] was applied to the whole dataset to relate the death rate resulting from road traffic accidents in Egypt to car ownership by utilizing historical data from 1999 to 2003. The relationship between D N (the actual death rate based on the historical data) and α N P β (the estimated death rate calculated from the historical data as a function of the coefficients of Smeed’s model) was formulated as an optimization problem, as shown in Equation (2). The target of this optimization was to minimize the sum of the square of the difference between the actual ( D i N i ) and the estimated ( α N i P i β ) death rates for year i among n study years, as follows:
m i n .   i = 1 n D i N i α N i P i β 2
Subject   to   0.0001 α 0.0009 , 0.01 β 0.99
The constraints’ values were assumed to have a wider range, as estimated by [20]. The values of α and β were not very sensitive in fitting the actual death rate to the estimated one.
Moreover, different forms of regression models (linear, exponential, logarithmic, polynomial, and power regression models) were applied to relate the death rate as a dependent variable to the car ownership rate as an independent variable. The models’ performances were statistically measured by the determination coefficient (R2) to reflect the models’ fits when using the whole dataset and different clusters of data.
The model estimations using all the data and the clustered data were compared on the basis of the R2 values of the relationship between the actual versus the estimated death rates (estimated by Smeed’s law) and the car ownership rate versus the death rate for different regression models.

4. Data Characteristics and Processing

The present work utilized road traffic accident data that were obtained from the recorded data of the General Authority for Roads, Bridges, and Land Transport (GARBLT) of the Egyptian government [41]. The disaggregated accident data records included the accidents’ dates and times (hour and minute) and locations. The road’s geometry and traffic conditions (the road’s width, length, vertical grade, curvature, and annual average daily traffic) at the accident’s location and the surface conditions (paved or unpaved, wet or dry, etc.) were also major aspects that were accounted for during the analysis. In addition, the weather conditions at the time of the accident, the type of accident (single-vehicle, front to front, front to back, etc.), the type of vehicle, and other information were also used. The sample size was 10,857 observations during the period between 1999 and 2003. Moreover, we obtained aggregated accident data for five years on urban roads and a rural highway (the desert road) in Egypt. The aggregated accident data records included the percentage of accidents on different roads in Egypt during the period between 1 January 2015 and 31 December 2019. This provided the percentage of road traffic accidents during different time periods (day or night); the percentage of road traffic accidents caused by different vehicle types (private car, truck, taxi, or other); and the percentage of road traffic accidents caused by human, environmental, vehicular, or unknown factors.
Comprehensive data processing was conducted for the disaggregated data to prepare them for statistical analysis. The data processing stage also involved checking the data validation and the descriptions of all of the observations. The processing phase included standardizing the values to ensure an equitable statistical comparison between different types of variables, regardless of the type of variable. The standardized values described a data point and scaled it by the population data by placing the different variables on the same scale to produce standard scores. The standard scores for each observed value of the variables were estimated on the basis of the mean and standard deviation of all observations of a certain parameter, as shown in Equation (3). Accordingly, the standard scores of the 10 parameters included the principal variables, namely Z-score (shape), Z-score (type), Z-score (time), Z-score (weather_ID), Z-score (surface_ID), Z-score (traffic_notes), Z-score (death), and Z-score(volume), which refer to the road’s shape, the road type, the time of the accident, the weather conditions, the surface conditions, the reported cause of the accident, the total number of deaths, and the traffic volume, respectively. Z-score (Crahed_Cars_Count) refers to the number of reported crashed vehicles and Z-score (hurt) refers to the total number of injuries.
z i = X i   μ σ
where z i is the standardized value of the ith traffic accident observation, and X i , µ, and σ are the means and standard deviations of all X i observations.
To achieve an in-depth understanding of the variables involved in the study, a descriptive analysis was conducted. The sample with 10,857 observations of disaggregated traffic accident data was analyzed in order to manage the data and present them accurately before executing the clustering analysis. The analysis summarized the statistics for the different scale variables and measures of the data. The SPSS package was used to calculate the descriptive statistics and to test the potential significance and importance of the nominated group variables. The frequency, validity, and cumulative percentage were obtained for each nominated group. The outputs were described and summarized as shown in Figure 1 and Figure 2 (based on the analysis by [41]). Figure 1 shows that human factors affected 70% of all traffic accidents, while vehicular factors were involved in 30%. The majority of accidents occurred because of over-speeding in clear weather conditions and on dry roads. Accidents were distributed across all roads, with notably high rates on roads in Bani Sweif, Canal, and Sinai, while roads in the South Valley and the Red Sea had lower accident rates. Figure 1 shows some conditions with very low occurrence, such as heavy rains, sandstorms, and dusty conditions. Because of the low probability of such conditions occurring, they were screened during data processing and the clustering process, and were consequently excluded from the accident model, as illustrated in the next sections. The analysis of the aggregated data showed a similar pattern to that in Figure 2. This also shows that the majority of the accidents occurred in the daytime because of human factors. Private cars were involved in 59% of the accidents. The Aswan–Cairo and Cairo–Alexandria rural roads had the highest accident rates.
The tables and figures below illustrate the results of the clustering analysis for the four clustering results where k = 2, 3, 4, and 5. The results show the total number of road traffic accidents assigned to each cluster in the four clustering models. As shown in Table 1, 95.88% of the total number of accidents were assigned to the first cluster, while only 4.12% were assigned to the second cluster when k = 2, while for the case of k = 3, only 2.53% of all accidents were assigned to the first cluster, 89.16% were assigned to the second cluster, and 8.30% were assigned to the third cluster. In the case of k = 4, 89.47% of all accidents were assigned to the first cluster, while 1.02%, 1.05%, and 8.45% of the accidents were assigned to the second, third, and fourth clusters, respectively. Finally, it was found that 1.02% of all accidents were assigned to the first and second clusters, 7.53% were assigned to the third cluster, and 8.1% and 82.22% were assigned to the fourth and fifth clusters, respectively, in the case of k = 5. The distance between the final clusters’ centers was found to be 4.4 for k = 2. When k = 3, the distance between the final clusters’ centers was found to be 5.45 for the first and second clusters, 6.47 between the first and the third clusters, and 4.32 between the second and third clusters’ centers. In the case of the model with four clusters, the distance between the final clusters’ centers was found to be 5.22 between the first and the second clusters, and 7.86 and 2.65 between the first and the third and fourth clusters, respectively, while the distance between the second and third clusters was found to be 9.46. The distance from the fourth cluster’s center was found to be 5.99 to the second cluster’s center and 6.665 to the third cluster’s center. Finally, for k = 5, the distances from the clusters’ centers were 9.5, 6.02, 6.25, and 5.23 to the centers of the second, third, fourth, and fifth clusters, respectively. For the second cluster’s center, the distances were found to be 6.68, 8.84, and 7.9 from the third, fourth, and fifth clusters’ centers.
The values of different variables in the clusters’ centers for the different k-means scenarios are presented in Figure 3a–d to compare the different clustering patterns with k = 2, k = 3, k = 4, and k = 5, respectively. Figure 3a shows that the two clusters had different contents, which indicated that each cluster had a homogeneous combination of data on different factors affecting traffic accidents, and both clusters were different from each other. This means that the data could be reasonably divided into two different groups. This homogeneity of a single group of data and the variation in the data groups between each other may improve the model’s fit when modeling separate groups over the fit of the model with combined data. To test whether splitting the data into more groups was meaningful or not, we used k = 3, k = 4, and k = 5, as shown in Figure 3b–d. Figure 3b,c shows that splitting the data produced three and four identical groups that were different from each other, while Figure 3d shows that, in the case of k = 5, the second and the third clusters had many variables with similar characteristics. A visual comparison of Figure 3b–d shows that k = 3 and k = 4 were suitable. ANOVA test presented in Table 2 shows that the number of cars involved in the traffic accidents (Crahed_Cars_Count) has low statistical significance in the case of k = 2, 3, 4, 5 (p-value equal 0.940, 0. 748, 0.473, and 0.897 respectively) while all other variables are statistically significant in case of k = 4. Aggregated data showed the same results, clustering our aggregate data into four groups is better. Although all factors are statistically significant (as shown in Table 3), Figure 4d shows that clusters 2 and 5 (in case of k = 5) are not identical and the number of cases belonging to the second cluster is only 6 cases as shown in Table 4. Therefore, clustering our data into four groups may help for better model fit than modeling whole data.

5. Modeling of the Clustered Data

To examine the goodness of fit of the developed Smeed models, the actual death rates were compared with the estimated ones. Figure 5 shows the fits of the estimated values to the real values for (a) all the data, (b) the first cluster’s data, (c) the second cluster’s data, (d) the third cluster’s data, and (e) the fourth cluster’s data. The data points were scattered close to a 45° line. This reflects agreement between the actual and predicted values, which was supported statistically by the coefficient of determination (R2). The predicted and observed values agreed closely, and the R2 values were acceptable for all datasets. The R2 values for the different clusters were higher than the values for all of the data, meaning that clustering improved the models’ fits. For the regression models, the R2 values in Figure 6, Figure 7, Figure 8, Figure 9 and Figure 10 reflect the models’ fits by correlating the death rate as a dependent variable with the car ownership rate as an independent variable for (a) all the data, (b) the first cluster’s data, (c) the second cluster’s data, (d) the third cluster’s data, and (e) the fourth cluster’s data using linear, exponential, logarithmic, polynomial, and power regression models. Figure 6, Figure 7, Figure 8, Figure 9 and Figure 10 all show that the R2 values for different clusters were greater than the values for all of the data. This means that clustering improved the models’ fits regardless of the model type, except for a few cases in the power and logarithmic models. This may be because of the models’ characteristics or because the sample size was not enough for developing certain models.
Compared with the previous relevant literature, the following insights can be highlighted. Although extensive research has been carried out in this field of heuristic-based cluster analysis, the statistical properties of these methods are generally unknown, whereas the statistical properties of probability model-based clustering techniques are better understood [9]. Therefore, the current study relied on k-means clustering analysis. The results of [2,3] suggested that the combined use of parametric and non-parametric methods may effectively overcome the limits of each group of methods, with satisfactory prediction accuracies and the ability to interpret the factors contributing to fatal and serious crashes. However, we sought a simple model with fewer variables that would be suitable for practitioners. Nicholson [7] stated that there is a need for a simple way to cluster accident data and [4] recommended simplifying accident models by eliminating variables. Although this is expected to reduce the model’s accuracy, it might make it agile enough to be utilized in developing countries, where traffic crash data are usually scarce [4]. Moreover, many studies (e.g., [4,9]) have revealed that traffic accident data should be mainly clustered on the basis of vehicle type, whereas the cluster analysis in this study showed that other contributory factors are also important for data clustering.
Riccardi et al. [2] found that the proper clustering of the factors that affect fatal and serious injury accidents is largely different from that of the factors that contribute to accidents causing slight injuries. They tried to avoid the imbalanced distribution of the variables and the drawbacks of the error rate (which assumed that the errors had equal values, which was not true for imbalanced data, and misclassified some classes, such as fatal and serious-injury crashes). Therefore, Refs [2,4] divided the modeling data into fatal and serious-injury accidents, and into severe crashes and non-severe crashes, respectively, which was difficult here because of the availability and characteristics of the data used in this study.

6. Summary and Conclusions

On the basis of aggregated and disaggregated long-horizon traffic accident datasets in Egypt, the present study compared the performance of some existing models. To validate the models’ fits to the characteristics of different data, the traffic accident data were clustered into reasonable groups. Some parsimonious models (with fewer variables) that can be implemented in real-world practice were used. The k-means algorithm was used for clustering analysis. Ten contributory parameters were included as the principal causal factors, namely the road’s shape, the road type, the time of the accident, the weather conditions, the surface conditions, traffic volumes, the number of crashed vehicles reported, the cause as reported by the traffic police, the number of deaths, and the number of injuries. Using k = 2, k = 3, k = 4, and k = 5, which specified different numbers of clusters, four clustering modes were evaluated. By summarizing the factors within each cluster, ANOVA was used to identify the variances within groups and their centers, and the most suitable number of clusters was then determined.
Smeed’s model was applied to relate the death rate resulting from traffic accidents in Egypt to car ownership, utilizing five years of historical data. The model’s performance was evaluated in terms of the errors between the estimations and the observations. Moreover, different forms of regression models (linear, exponential, logarithmic, polynomial, and power regression models) were used to relate the death rate as a dependent variable to car ownership as an independent variable. The results of model fitting showed that the R2 values for the different individual clusters were higher than the values for all the data, indicating that clustering improved the models’ fits, regardless of the type of model, except in a few cases, i.e., the power and logarithmic models, which might have been caused by the models’ characteristics or the sample size. The results revealed that data clustering has a significant impact on classifying the data on the basis of the characteristics of the most important causal factors of traffic accidents. Consequently, predictive models can be developed on the basis of the other remaining factors (factors not considered during the clustering process).
Compared with the previous relevant literature, this study avoided using heuristic-based cluster analysis, as the statistical properties are generally unknown, whereas the statistical properties of the k-means algorithm used here are better understood. These criteria may help in using simple models with few variables, as recommended by several previous studies, that can be used for practical applications. Many previous studies mainly clustered accident data based on the type of vehicle, while the cluster analysis conducted in this study showed that other variables are also important for data clustering. We close by noting that, if data with larger sample sizes and different characteristics become available, such as different populations, car ownership, and death rates, these could be used to verify the methodology proposed here. More metrics, such as sensitivity and precision, are highly recommended for future studies to investigate the capability of the developed models to predict the death rate, which was not feasible here because of the availability and characteristics of the data used.
Finally, it should be emphasized that the descriptive cluster analysis followed in this study focused on finding a concise description for each traffic accident type, which can be useful during the interpretation of subsequent analyses. However, the results of the cluster analysis conducted in this study also contain other useful information that can provide interesting insights into various traffic accident types, which may guide decision-makers to deploy appropriate preventive measures for road traffic accidents toward sustainable transportation systems.

Author Contributions

Conceptualization, S.S., S.H. and A.M.W.; methodology, S.S., N.K.R., S.H. and A.M.W.; software, S.S., N.K.R., S.H. and A.M.W.; validation, S.S., S.H. and A.A.; analysis, S.S., S.H. and A.M.W.; data N.K.R. and A.M.W.; writing—original draft preparation, S.S. and N.K.R.; writing—review and editing, S.H., A.M.W. and A.A.; resources, A.A. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding. The APC was funded by the authors.

Data Availability Statement

The used data is unavailable due to privacy issues.

Acknowledgments

The authors express their gratitude to The Center of Road Traffic Safety at Naif Arab University for Security Sciences for providing the facilities required for completing this research.

Conflicts of Interest

The authors declare that the paper has no financial/personal interest or any other matter that could affect its objectivity. The authors certify that potential competing interests do not exist for this paper.

References

  1. World Health Organization. Global Status Report on Road Safety 2018; World Health Organization: Geneva, Switzerland, 2018. [Google Scholar]
  2. Queirós-Reis, L.; Gomes da Silva, P.; Gonçalves, J.; Brancale, A.; Bassetto, M.; Mesquita, J.R. SARS-CoV-2 Virus−Host Interaction: Currently Available Structures and Implications of Variant Emergence on Infectivity and Immune Response. Int. J. Mol. Sci. 2022, 22, 10836. [Google Scholar] [CrossRef] [PubMed]
  3. Riccardi, M.R.; Mauriello, F.; Scarano, A.; Montella, A. Analysis of contributory factors of fatal pedestrian crashes by mixed logit model and association rules. Int. J. Inj. Control. Saf. Promot. 2022, 1–15. [Google Scholar] [CrossRef] [PubMed]
  4. Assi, K.; Rahman, S.; Mansoor, U.; Ratrout, N. Predicting Crash Injury Severity with Machine Learning Algorithm Synergized with Clustering Technique: A Promising Protocol. Int. J. Environ. Res. Public Health 2020, 17, 5497. [Google Scholar] [CrossRef] [PubMed]
  5. Sabel, C.E.; Kingham, S.; Nicholson, A.; Bartie, P. Road traffic accident simulation modelling-A kernel estimation approach. In Proceedings of the 17th Annual Colloquium of the Spatial Information Research Centre, Dunedin, New Zealand; 2005; pp. 67–75. [Google Scholar]
  6. Nicholson, A.J. Towards a Comprehensive Strategy for Accident Reduction and Prevention. In Proceedings of the Road Traffic Safety Seminar, Road Traffic Safety Research Council, Wellington, New Zealand, 14–16 September 1988; Volume 2, pp. 12–25. [Google Scholar]
  7. Nicholson, A.J. Accident clustering: Some simple measures. Traffic Eng. Control 1989, 30, 241–246. [Google Scholar]
  8. Nicholson, A. Indices of accident clustering: A re-evaluation. Traffic Eng. Control 1995, 36, 5. [Google Scholar]
  9. Depaire, B.; Wets, G.; Vanhoof, K. Traffic accident segmentation by means of latent class clustering. Accid. Anal. Prev. 2008, 40, 1257–1266. [Google Scholar] [CrossRef] [PubMed] [Green Version]
  10. Shaikh, N.S.; Nicholson, A.J. Accident Clustering in New Zealand. In IPENZ Annual Conference 1993, Proceedings of: Sustainable development: Papers Prepared for the Conference, the University of Waikato, Hamilton, 5–9 February 1993; Institution of Professional Engineers New Zealand: Wellington, New Zealand, 1993; pp. 396–407. [Google Scholar]
  11. Tang, J.; Zheng, L.; Han, C.; Yin, W.; Zhang, Y.; Zou, Y.; Huang, H. Statistical and machine-learning methods for clearance time prediction of road incidents: A methodology review. Anal. Methods Accid. Res. 2020, 27, 100123. [Google Scholar] [CrossRef]
  12. Theofilatos, A.; Chen, C.; Antoniou, C. Comparing Machine Learning and Deep Learning Methods for Real-Time Crash Prediction. Transp. Res. Rec. J. Transp. Res. Board 2019, 2673, 169–178. [Google Scholar] [CrossRef]
  13. Wang, J.; Kong, Y.; Fu, T. Expressway crash risk prediction using back propagation neural network: A brief investigation on safety resilience. Accid. Anal. Prev. 2019, 124, 180–192. [Google Scholar] [CrossRef] [PubMed]
  14. Li, Z.; Wang, W.; Liu, P.; Bigham, J.M.; Ragland, D.R. Using Geographically Weighted Poisson Regression for county-level crash modeling in California. Saf. Sci. 2013, 58, 89–97. [Google Scholar] [CrossRef]
  15. Pirdavani, A.; Brijs, T.; Bellemans, T.; Kochan, B.; Wets, G. Evaluating the road safety effects of a fuel cost increase measure by means of zonal crash prediction modeling. Accid. Anal. Prev. 2013, 50, 186–195. [Google Scholar] [CrossRef] [PubMed]
  16. Yu, R.; Abdel-Aty, M. Multi-level Bayesian analyses for single- and multi-vehicle freeway crashes. Accid. Anal. Prev. 2013, 58, 97–105. [Google Scholar] [CrossRef] [PubMed]
  17. Cheng, L.; Geedipally, S.R.; Lord, D. The Poisson–Weibull generalized linear model for analyzing motor vehicle crash data. Saf. Sci. 2013, 54, 38–42. [Google Scholar] [CrossRef] [Green Version]
  18. Mannering, F.L.; Bhat, C.R. Analytic methods in accident research: Methodological frontier and future directions. Anal. Methods Accid. Res. 2014, 1, 1–22. [Google Scholar] [CrossRef]
  19. Santos, K.; Dias, J.P.; Amado, C. A literature review of machine learning algorithms for crash injury severity prediction. J. Saf. Res. 2022, 80, 254–269. [Google Scholar] [CrossRef] [PubMed]
  20. Smeed, R.J. Some Statistical Aspects of Road Safety Research. J. R. Stat. Soc. Ser. A (Gen.) 1949, 112, 1–34. [Google Scholar] [CrossRef]
  21. Kopits, E.; Cropper, M. Traffic fatalities and economic growth. Accid. Anal. Prev. 2005, 37, 169–178. [Google Scholar] [CrossRef] [PubMed] [Green Version]
  22. Dupont, E.; Commandeur, J.J.; Lassarre, S.; Bijleveld, F.; Martensen, H.; Antoniou, C.; Papadimitriou, E.; Yannis, G.; Hermans, E.; Pérez, K.; et al. Latent risk and trend models for the evolution of annual fatality numbers in 30 European countries. Accid. Anal. Prev. 2014, 71, 327–336. [Google Scholar] [CrossRef]
  23. Peng, Y.; Lord, D.; Information, R. Application of Latent Class Growth Model to Longitudinal Analysis of Traffic Crashes. Transp. Res. Rec. J. Transp. Res. Board 2011, 2236, 102–109. [Google Scholar] [CrossRef]
  24. Zou, Y.; Zhang, Y.; Lord, D. Application of finite mixture of negative binomial regression models with varying weight parameters for vehicle crash data analysis. Accid. Anal. Prev. 2013, 50, 1042–1051. [Google Scholar] [CrossRef]
  25. Zou, Y.; Zhang, Y.; Lord, D. Analyzing different functional forms of the varying weight parameter for finite mixture of negative binomial regression models. Anal. Methods Accid. Res. 2014, 1, 39–52. [Google Scholar] [CrossRef]
  26. Yuan, Y.; Yang, M.; Guo, Y.; Rasouli, S.; Gan, Z.; Ren, Y. Risk factors associated with truck-involved fatal crash severity: Analyzing their impact for different groups of truck drivers. J. Saf. Res. 2021, 76, 154–165. [Google Scholar] [CrossRef]
  27. Hosseinzadeh, A.; Moeinaddini, A.; Ghasemzadeh, A. Investigating factors affecting severity of large truck-involved crashes: Comparison of the SVM and random parameter logit model. J. Saf. Res. 2021, 77, 151–160. [Google Scholar] [CrossRef]
  28. Narayanamoorthy, S.; Paleti, R.; Bhat, C.R. On accommodating spatial dependence in bicycle and pedestrian injury counts by severity level. Transp. Res. Part B Methodol. 2013, 55, 245–264. [Google Scholar] [CrossRef] [Green Version]
  29. Castro, M.; Paleti, R.; Bhat, C.R. A latent variable representation of count data models to accommodate spatial and temporal dependence: Application to predicting crash frequency at intersections. Transp. Res. Part B Methodol. 2012, 46, 253–272. [Google Scholar] [CrossRef] [Green Version]
  30. Harvey, J.M.; Han, J. Geographic Data Mining and Knowledge Discovery; CRC Press: Boca Raton, FL, USA, 2009. [Google Scholar] [CrossRef]
  31. Omran, M.G.; Engelbrecht, A.P.; Salman, A. An overview of clustering methods. Intell. Data Anal. 2007, 11, 583–605. [Google Scholar] [CrossRef]
  32. Wu, X.; Cheng, C.; Zurita-Milla, R.; Song, C. An overview of clustering methods for geo-referenced time series: From one-way clustering to co- and tri-clustering. Int. J. Geogr. Inf. Sci. 2020, 34, 1822–1848. [Google Scholar] [CrossRef]
  33. Bogatyrev, M.; Samodurov, K. Conceptual Approach to Clustering in the Study of Gene Expression. Дoклады Междунарoднoй Кoнференции Математическая Биoлoгия И Биoинфoрматика 2018, 7, e54. [Google Scholar] [CrossRef]
  34. Dempe, S. Wiley Encyclopaedia of Operations Research and Management Science, by James J. Cochran. Optimization 2013, 62, 167–168. [Google Scholar] [CrossRef]
  35. Gulagiz, F.K.; Suhap, S. Comparison of Hierarchical and Non-Hierarchical Clustering Algorithms. Int. J. Comput. Eng. Inf. Technol. 2017, 9, 6–14. [Google Scholar]
  36. Cheng, R.; Milligan, G.W. K-Means Clustering Methods with Influence Detection. Educ. Psychol. Meas. 1996, 56, 833–838. [Google Scholar] [CrossRef]
  37. Wickramasinghe, N.D. Canonical correlation analysis: An introduction to a multivariate statistical analysis. J. Coll. Community Physicians Sri Lanka 2019, 25, 37. [Google Scholar] [CrossRef]
  38. Breiman, L.; Friedman, J.H.; Olshen, R.A.; Stone, C.J. Classification and Regression Trees; Routledge: Boca Raton, FL, USA, 2017. [Google Scholar] [CrossRef]
  39. Kemp, A.W.; Stuart, A.; Ord, J.K. Kendall’s Advanced Theory of Statistics. J. Am. Stat. Assoc. 1994, 43, 220. [Google Scholar] [CrossRef]
  40. Boehmke, B.; Greenwell, B. Multivariate Adaptive Regression Splines. In Hands-On Machine Learning with R. 2020, pp. 141–156. Available online: https://bradleyboehmke.github.io/HOML/mars.html (accessed on 12 December 2022). [CrossRef]
  41. General Authority for Roads, Bridges, and Land Transport, (GARBLT), for the Study of “Safety and Protection of Public Transport on the Rural Roads in Egypt”. 2003. Available online: https://archive.unescwa.org/general-authority-roads-bridges-and-land-transport (accessed on 12 December 2022).
Figure 1. Disaggregated data descriptive classification.
Figure 1. Disaggregated data descriptive classification.
Sustainability 15 01483 g001
Figure 2. Aggregated data descriptive classification.
Figure 2. Aggregated data descriptive classification.
Sustainability 15 01483 g002
Figure 3. Disaggregated data cluster center characteristics: (a) k = 2, (b) k = 3, (c) k = 4, and (d) k = 5.
Figure 3. Disaggregated data cluster center characteristics: (a) k = 2, (b) k = 3, (c) k = 4, and (d) k = 5.
Sustainability 15 01483 g003
Figure 4. Aggregated data cluster center characteristics: (a) k = 2, (b) k = 3, (c) k = 4, and (d) k = 5.
Figure 4. Aggregated data cluster center characteristics: (a) k = 2, (b) k = 3, (c) k = 4, and (d) k = 5.
Sustainability 15 01483 g004
Figure 5. The death rates estimated by Smeed’s law versus the actual death rates.
Figure 5. The death rates estimated by Smeed’s law versus the actual death rates.
Sustainability 15 01483 g005
Figure 6. Linear correlations between death rates and car ownership rates.
Figure 6. Linear correlations between death rates and car ownership rates.
Sustainability 15 01483 g006
Figure 7. Exponential correlations between death rates and car ownership rates.
Figure 7. Exponential correlations between death rates and car ownership rates.
Sustainability 15 01483 g007
Figure 8. Logarithmic correlation between death rates and car ownership rates.
Figure 8. Logarithmic correlation between death rates and car ownership rates.
Sustainability 15 01483 g008
Figure 9. Polynomial correlation between death rates and car ownership rates.
Figure 9. Polynomial correlation between death rates and car ownership rates.
Sustainability 15 01483 g009
Figure 10. Power correlation between death rates and car ownership rates.
Figure 10. Power correlation between death rates and car ownership rates.
Sustainability 15 01483 g010
Table 1. Numbers of road traffic accidents in each cluster of disaggregated data.
Table 1. Numbers of road traffic accidents in each cluster of disaggregated data.
Clusterfor (k = 2)for (k = 3)for (k = 4)for (k = 5)
1st7791206727083
2nd33572458384
3rd067586612
4th00687666
5th0006681
Table 2. Analysis of variance for the clustering of disaggregated data.
Table 2. Analysis of variance for the clustering of disaggregated data.
ParametersClusterErrorFSig.
(p-Value)
Mean SquaredfMean Squaredf
For k = 2
Zscore(SHAPE)17.06911.173812414.5500.000
Zscore(TYPE)46.80911.080812443.3330.000
Zscore(TIME)0.00011.01081240.0000.995
Zscore(WEATHER_ID)3.40310.91681243.7160.054
Zscore(SURFACE_ID)13.01210.922812414.1070.000
Zscore(TRAFFIC_NOTES)0.12711.00281240.1270.722
Zscore(DEATH)2675.52710.72981243668.7010.000
Zscore(HURT)3545.55110.62881245646.8590.000
Zscore(Crahed_Cars_Count)0.00611.02881240.0060.940
Zscore(VOLUME)30.71211.184812425.9410.000
For k = 3
Zscore(SHAPE)4.31821.17481233.6770.025
Zscore(TYPE)19.54621.081812318.0760.000
Zscore(TIME)44.13220.999812344.1720.000
Zscore(WEATHER_ID)3345.19820.093812336,057.5770.000
Zscore(SURFACE_ID)2329.63320.35181236646.2010.000
Zscore(TRAFFIC_NOTES)34.04820.993812334.2760.000
Zscore(DEATH)1708.92020.63881232678.5990.000
Zscore(HURT)1232.85320.76181231620.2690.000
Zscore(Crahed_Cars_Count)0.29921.02881230.2910.748
Zscore(VOLUME)11.53321.18581239.7330.000
For k = 4
Zscore(SHAPE)756.29830.8968122843.8900.000
Zscore(TYPE)37.71431.072812235.1700.000
Zscore(TIME)6.43931.00881226.3900.000
Zscore(WEATHER_ID)1.83730.91681222.0050.111
Zscore(SURFACE_ID)3.38530.92381223.6680.012
Zscore(TRAFFIC_NOTES)8.04430.99981228.0530.000
Zscore(DEATH)1603.18230.46781223435.0100.000
Zscore(HURT)1457.94030.52681222771.4530.000
Zscore(Crahed_Cars_Count)0.86131.02881220.8380.473
Zscore(VOLUME)39.56231.173812233.7160.000
For k = 5
Zscore(SHAPE)567.89440.8968121633.8210.000
Zscore(TYPE)27.73541.073812125.8550.000
Zscore(TIME)24.66240.998812124.7100.000
Zscore(WEATHER_ID)1661.40140.098812116,899.2250.000
Zscore(SURFACE_ID)1155.64540.35581213254.1910.000
Zscore(TRAFFIC_NOTES)22.12040.991812122.3200.000
Zscore(DEATH)1177.73640.47981212459.1620.000
Zscore(HURT)1027.35040.55981211838.8840.000
Zscore(Crahed_Cars_Count)0.27841.02881210.2700.897
Zscore(VOLUME)29.42541.174812125.0710.000
Table 3. Analysis of variance for the clustering of aggregated data.
Table 3. Analysis of variance for the clustering of aggregated data.
ParametersClusterErrorFSig.
(p-Value)
Mean SquaredfMean Squaredf
For k = 2
Zscore(TIME)1051.38010.044110423,986.1630.000
Zscore(TYPE)293.90110.20511041433.7640.000
Zscore(CAUSE)734.02910.33311042203.0210.000
For k = 3
Zscore(TIME)538.77320.020110326,738.7920.000
Zscore(TYPE)174.30020.15611031120.3250.000
Zscore(CAUSE)473.02820.14111033348.5040.000
For k = 4
Zscore(TIME)359.65730.019110219,054.8830.000
Zscore(TYPE)132.90430.11011021205.4870.000
Zscore(CAUSE)315.76230.14011022250.9520.000
For k = 5
Zscore(TIME)269.81040.019110114,467.4210.000
Zscore(TYPE)99.74540.1101101905.9170.000
Zscore(CAUSE)241.12140.12511011932.2580.000
Table 4. Number of road traffic accidents in each cluster of aggregated data.
Table 4. Number of road traffic accidents in each cluster of aggregated data.
Clusterfor (k = 2)for (k = 3)for (k = 4)for (k = 5)
1st784.000557.000557.000557.000
2nd322.000233.000303.0006.000
3rd0316.00013.00013.000
4th00233.000233.000
5th000297.000
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Shokry, S.; Rashwan, N.K.; Hemdan, S.; Alrashidi, A.; Wahaballa, A.M. Characterization of Traffic Accidents Based on Long-Horizon Aggregated and Disaggregated Data. Sustainability 2023, 15, 1483. https://doi.org/10.3390/su15021483

AMA Style

Shokry S, Rashwan NK, Hemdan S, Alrashidi A, Wahaballa AM. Characterization of Traffic Accidents Based on Long-Horizon Aggregated and Disaggregated Data. Sustainability. 2023; 15(2):1483. https://doi.org/10.3390/su15021483

Chicago/Turabian Style

Shokry, Sherif, Naglaa K. Rashwan, Seham Hemdan, Ali Alrashidi, and Amr M. Wahaballa. 2023. "Characterization of Traffic Accidents Based on Long-Horizon Aggregated and Disaggregated Data" Sustainability 15, no. 2: 1483. https://doi.org/10.3390/su15021483

APA Style

Shokry, S., Rashwan, N. K., Hemdan, S., Alrashidi, A., & Wahaballa, A. M. (2023). Characterization of Traffic Accidents Based on Long-Horizon Aggregated and Disaggregated Data. Sustainability, 15(2), 1483. https://doi.org/10.3390/su15021483

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop