1. Introduction
Road traffic accidents cause great losses of life, damage to properties, and notable psychological effects on the victims and their families. Annually, traffic accidents result in more than 50 million injuries and 1.35 million deaths worldwide [
1]. Decision-makers need accurate information about the relationships between traffic accidents and the contributing factors. Developing accident-prediction models can help in predicting the causes of accidents effectively and allows transportation authorities to formulate accurate road safety measures to improve quality of life by ensuring sustainable transportation systems. Thus, many studies all over the world have aimed to evaluate the causes of traffic accidents in order to reduce their harmful impacts. Traffic accident modeling has been extensively studied in the literature using different techniques during the last few decades [
2,
3,
4,
5,
6,
7,
8,
9,
10,
11,
12,
13,
14,
15,
16,
17]. Although different methodologies have been used in accident modeling research, there are still numerous issues that need to be investigated, according to the recommendations of the extensive review performed by [
18,
19]. These include issues related to the characteristics of traffic accident data, parsimonious versus fully specified models, unobserved heterogeneity, spatial and temporal correlations, risk compensation, the choice of the methodological approach, and the under-reporting of traffic accidents with less severe injuries.
This study aimed to use some existing accident models to compare their performance based on the characteristics of aggregated and disaggregated datasets collected on Egyptian roads during the periods between 2015 and 2019 and between 1999 and 2003, respectively. The methodology was based on classifying different data characteristics (using k-means clustering) and the effect of that classification on model fitting. Because of the limitations of data availability and/or the need to specify models with a few simplistic explanatory variables, some parsimonious models (preferred by practitioners) were used for this comparison. As the death rate is a benchmark to measure the road safety conditions of a country, Smeed’s law proved the existence of a relationship between the death rate, the number of vehicles, and the population by using data from 1938 gathered from 20 different countries [
20]. Many other updates to Smeed’s law found that the increase in car ownership led to a decrease in the number of fatalities per vehicle [
21,
22]. Moreover, this study compared the fit of modeling whole data with different clusters of data based on Smeed’s model and different regression model forms to evaluate the effectiveness of data clustering. The models tried to estimate the death rate caused by traffic accidents in Egypt by utilizing five years of historical data. The main objective was to compare the fit of the different data cluster models with estimating the models’ parameters. The specific objectives of this research were as follows:
To compare the prediction accuracy of an existing model based on long-horizon accident datasets;
To validate the models based on the available data, which included clustering the accident data into reasonable groups and testing the model’s fits for the different data groups.
The next section provides a literature review of the existing relevant research.
Section 3 outlines the research methodology and
Section 4 presents the characteristics and processing of the accident data utilized. The modeling of the clustered data, results and main conclusions are presented in
Section 5 and
Section 6, respectively.
2. Literature Review
There are different traffic accident prediction models in use; however, because of the limitations of data availability or the need to specify models with a few simplistic explanatory variables, parsimonious models are often used. Many human factors, such as the driver’s age, gender, and other socioeconomic characteristics, are considered, but the drivers’ perceptions and reaction time, stress, fatigue, and/or emotional condition (at the time of the accident) are difficult to record. In addition, some environmental conditions in the traffic accident’s location may not be recorded. There are several published studies covering the effect of unrecorded related factors on road traffic accidents [
23,
24,
25,
26]. They used latent-class models that addressed the effect of unrecorded factors by classifying the data into subgroups with homogeneous characteristics. However, because of the assumed homogeneous characteristics of each subgroup, these models did not consider variations within the same subgroup [
18]. Other researchers, such as [
27,
28,
29], used random-parameter models that allowed for differences in the parameters among observations to deal with the effect of unrecorded factors. These models can be more complex and may include a large number of variables, which may be cost-ineffective in terms of the computation time required compared with using models with a few variables. Factors that may influence the frequency and/or severity of road traffic accidents are likely to be correlated in space and time, so ignoring the spatial and temporal correlations of the data will certainly result in inefficient and possibly inconsistent parameter estimates. Some studies have addressed this issue by using multivariate models, where multiple dependent variables are interrelated with each other. These methods, however, are either too restrictive, relatively cumbersome and time-consuming, and/or are literally infeasible in cases of high dimensionality. This section briefly highlights the relevant literature; a comprehensive review of the studies that used various methodological approaches in the field of traffic accident research can be found in [
18,
19].
Riccardi et al. [
2] modeled traffic accidents in Great Britain using many parametric and non-parametric models and found that the parametric models proved a relationship between the dependent and independent variables with a clear interpretation of the outputs, whereas the non-parametric models required more explanatory variables with a high probability of dependency among them. Similarly [
3], modeled fatal pedestrian accidents in Italy by applying the mixed logit model, machine learning, and association rules. The F-measure and the G-mean measures were utilized to compare the performances of the models in both approaches.
Clustering analysis is a mathematical statistical method that can be applied to large datasets, where the raw data are sorted and grouped into clusters [
30,
31,
32,
33,
34,
35,
36,
37,
38]. These clusters are internally homogeneous, although they are different from each other. The components of a single cluster are similar to each other, while the components of different clusters are less homogeneous [
39]. Clustering analysis is similar to multidimensional scaling for investigating the similarity between factors by examining the full range of interrelated relationships [
38]. The difference between the two methodologies is that multidimensional scaling identifies the key dimensions, while clustering analysis identifies the groups. Clustering analysis is considered to be the opposite of factor analysis [
33]. Factor analysis reduces the number of variables by grouping them into smaller groups of factors, but clustering analysis reduces the number of observations or cases by grouping them into smaller groups [
37]. Hence, clustering analysis aims to make the variance of the elements within each group as small as possible and to make the variance between groups and their centers as large as possible [
40]. Non-hierarchical clustering analysis is suitable for large amounts of data that are compatible with rich disaggregated observations [
36]. The number of clusters (k) can be determined specifically or by the node or clustering method. Non-hierarchical clustering analysis relies on three steps: (1) creating a preliminary distribution of the existing observations within a specific number of initial groups; (2) the created initial groups are considered to constitute primary clusters; and (3) the primary clusters are re-divided to form smaller and smaller clusters up to the final stage [
35].
Nicholson [
7] investigated many measures of accident clustering and recommended using simple methods of clustering accident data. The choice of plan type (site, route, or area) was suggested to be dependent upon the spatial distribution of accidents. Choosing a site plan when accidents are highly dispersed or an area plan when accidents are highly clustered at certain points will probably result in a poor economic return. In the same regard, Shaikh and Nicholson [
10] studied accident clustering in New Zealand and found that accidents are much more dispersed in New Zealand compared with other countries, and suggested that less emphasis should be placed on site plans and more emphasis should be placed on route and area plans in New Zealand. Nicholson [
8] discussed the evaluation of the indices of accident clustering and their interpretation to provide better randomness in the descriptions of how accidents occur. He evaluated the truncated negative binomial distribution and suggested a new form of truncated negative binomial distribution. Sabel et al. [
5] used kernel estimation clustering analysis to automatically identify road traffic accident “black spots” and “black areas” in Christchurch, New Zealand, using GIS and Python software. They found that kernel estimation was able to quickly identify the accident clusters and, when used in conjunction with Monte Carlo simulation techniques, to identify statistically significant clusters.
Assi et al. [
4] developed machine learning (ML) models to predict the severity of crash injuries in Great Britain and divided the ML models into different clusters using the fuzzy c-means method. They developed four ML models: feed-forward neural networks (FNN), a support vector machine (SVM), a fuzzy C-means clustering-based feed-forward neural network (FNN-FCM), and a fuzzy C-means-based support vector machine (SVM-FCM). They found that the FNN combined with FCM provided a slight improvement compared with the FNN without clustering, while the SVM-FCM model had higher accuracy when compared with the SVM. They concluded that the FCM clustering algorithm enhanced the prediction power of the FNN and the SVM models. Depaire et al. [
9] investigated the effectiveness of using latent class clustering and compared the results of these cluster-based analyses with the results of full-data analysis, and found that clustering revealed important relationships in the variation in a variable’s effect between different traffic accident types on the probability of injury as an outcome. For example, they found that the full-data model hid the probability of the first road user being slightly injured in a traffic accident, while the cluster-based models revealed a more complete interpretation.
Smeed’s law [
20] measures the road traffic safety conditions of a country as the relationship between the death rate, the number of vehicles, and the population. Smeed [
20] found that an increase in the car ownership rate (the ratio between the size of the vehicle fleet and the population) caused a decrease in the rate of death caused by traffic accidents (the ratio between the number of deaths and the size of the vehicle fleet), with α = 0.0003 and β = −2/3. Many studies have updated Smeed’s law using a range of sociodemographic, economic, environmental, and policy-related variables in order to better estimate the road safety outcomes of a country [
12,
13]. These studies found that Smeed’s formula describes the change in fatalities reasonably well up to the 0.2–0.3 vehicles/person motorization rate, while above this level, the formula seems to overestimate the fatality rate.
Kopits and Cropper [
12] showed that, in developing countries, the rate of the growth in vehicle ownership has increased more rapidly than the reduction in the fatality rate, while in industrialized countries, the motorization rate has tended to increase at a slower rate than the rate of the reduction in the number of fatalities per vehicle. However, refs. [
32,
33] seriously criticized Smeed’s model, as data from only one year were utilized in the model’s development. They also pointed out that Smeed’s model could not be used for all countries because each country has distinct traffic, economic, and social parameters, and that the model’s coefficient and exponent should, therefore, be country dependent. However, in this article, because of our emphasis on comparing different data clusters, rather than focusing on the model’s coefficients, Smeed’s law was used, along with other different forms of regression models.
3. Methodology
In this study, the use of a model that can be implemented in real-world practice was evaluated (with a small number of variables, but with representative factors that cause road traffic accidents). The big data collected could help in developing relatively simplistic models using only the explanatory variables of road traffic accidents needed for practical use in the field of road safety [
34,
35]. The use of existing models was studied in order to produce a proper model that is capable of representing actual field conditions with less variability as well as possible. The proposed models were calibrated and validated based on the collections of long-horizon data after they were clustered using normal clustering methods. Clustering helped in classifying the data based on the characteristics of the most likely causal factors. Accordingly, the prediction model was developed based on the other remaining factors (i.e., the factors that were not considered during clustering). It was found that these steps played a significant role in using simple models with few explanatory variables.
The k-means algorithm was used as the conceptual approach of non-hierarchical clustering to classify the different parameters of the disaggregated datasets. The objective of this procedure was to classify the sample of data within (k) clusters. As a result, the sum of the squares within the clusters was as small as possible. The k-means algorithm involves the following steps.
To determine the number of required (k) clusters with the random or intentional initial splitting of the observations into groups, the elements of each group were sorted separately. These primary groups were known as the initial clusters. In order to estimate the probability of cluster membership based on one or more probability distributions, the log-likelihood method was used to measure the distances between each item (x) and the center of its cluster, and the distances between the items. It could also calculate the distance between the centers of the final clusters. The log-likelihood method assumes that continuous variables are normally distributed and that categorical variables are distributed according to multinomial distributions. Accordingly, the overall probability or likelihood of the data can be maximized.
There were many benefits to using the various clustering methods in this study. For instance, the k-means algorithm produces tighter clusters than hierarchical clustering. Hence, applying k-means data clustering is a crucial way to obtain the optimal number of clusters from the model itself, and human intervention is not required. Although the initial seeds have a significant impact on the final results, they ease the classification probabilities of the sample’s contributory factor memberships in a clear visualization. On the other hand, the outputs of hierarchical clustering are more informative than the unstructured set of flat clusters returned by k-means. Each approach has its own disadvantages when calculating the similarity between clusters. Hierarchical clustering analysis may not be suitable for large datasets because of the high temporal and spatial complexity. This research used various values for k-means as a simple and fast way to specify the proper number of clusters; then, the modeling was conducted accordingly.
Although many studies (e.g., [
4,
9]) have revealed that traffic accident data should be clustered on the basis of vehicle type, the clustering analysis conducted in this study emphasized that other contributory factors have a significant impact on the clustering of the data. Moreover, the effectiveness of clustering was investigated by [
9], and the results were compared with the results of full-data analysis, which indicated that clustering could reveal important relationships for the variation in a variable’s effect on different traffic accident types. The cluster-based models revealed a more complete interpretation, while the full-data model was found to hide the probability of some causative factors. The methodology followed in this article aimed to identify clusters that can be transferred to other datasets, rather than focus on finding typical groupings in the data. Accordingly, the results of clustering were validated to provide a degree of confidence, and the model’s performance was evaluated.
Nicholson [
6] showed that the efficiency of modeling can be improved by increasing the observation period, and concluded that a five-year period is generally optimal from the viewpoint of statistical reliability. Accordingly, this article described an evaluation of the indices of accident clustering in order to estimate the death rate resulting from traffic accidents in Egypt by utilizing five years of historical data between 1999 and 2003.
First, the centers of the initial clusters were determined, and then the distances between each item (x) and the center of its cluster were calculated. Finally, the items were assigned to the clusters according to their closeness to the clusters’ centers. The previous steps were repeated for all n items and for each time, the item was assigned to the nearest cluster’s center. If the (k) clusters were not associated with a certain degree of accuracy to stop clustering, we could specify another number of clusters and repeat all of the previous steps, then compare and evaluate the results of the two sets and choose the best set [
23,
24,
25,
26]. The final distribution of the clusters’ centers and the distances between the items and the centers of the final cluster were used to interpret the clusters’ details. The values of the data were standardized to prevent the clustering from being dominated by features with a bigger scale. To do so, a zero mean was used, and the units’ variances’ standardizations were estimated accordingly. We used four values for the k-means, including k = 2, k = 3, k = 4, and k = 5, which specified the number of clusters, to carry out the clustering models. If a small cluster appeared that was hard to profile by means of the cluster-dependent distributions, this indicated a group of outliers [
9]. Ten contributory parameters were included as the principal components, namely road shape, road type, surface conditions, weather conditions, traffic volume, accident time, number of reported crashed vehicles, reported cause of the accident, number of deaths, and number of injuries. After performing a descriptive analysis of the aggregated and disaggregated datasets, clustering and analysis of variance (ANOVA) were conducted simultaneously to identify the variance among the groups and their centers. These tests determined the significance of the impacts of the different variables on a cluster. Next, the clustered datasets were compared by using some base models. The models’ performances were evaluated in terms of the errors between the estimations and the observations for the base models described below.
Smeed’s law [
20] measures the road traffic safety conditions of a country as the relationship between the death rate, the number of vehicles, and the population, as shown in Equation (1).
where
D is the number of annual deaths caused by road traffic accidents,
N is the number of registered vehicles,
P is the population, and
α and
β are the model parameters used for the estimation.
In this study, Smeed’s law [
20] was applied to the whole dataset to relate the death rate resulting from road traffic accidents in Egypt to car ownership by utilizing historical data from 1999 to 2003. The relationship between
(the actual death rate based on the historical data) and
(the estimated death rate calculated from the historical data as a function of the coefficients of Smeed’s model) was formulated as an optimization problem, as shown in Equation (2). The target of this optimization was to minimize the sum of the square of the difference between the actual (
) and the estimated (
death rates for year
i among
n study years, as follows:
The constraints’ values were assumed to have a wider range, as estimated by [
20]. The values of
and
were not very sensitive in fitting the actual death rate to the estimated one.
Moreover, different forms of regression models (linear, exponential, logarithmic, polynomial, and power regression models) were applied to relate the death rate as a dependent variable to the car ownership rate as an independent variable. The models’ performances were statistically measured by the determination coefficient (R2) to reflect the models’ fits when using the whole dataset and different clusters of data.
The model estimations using all the data and the clustered data were compared on the basis of the R2 values of the relationship between the actual versus the estimated death rates (estimated by Smeed’s law) and the car ownership rate versus the death rate for different regression models.
4. Data Characteristics and Processing
The present work utilized road traffic accident data that were obtained from the recorded data of the General Authority for Roads, Bridges, and Land Transport (GARBLT) of the Egyptian government [
41]. The disaggregated accident data records included the accidents’ dates and times (hour and minute) and locations. The road’s geometry and traffic conditions (the road’s width, length, vertical grade, curvature, and annual average daily traffic) at the accident’s location and the surface conditions (paved or unpaved, wet or dry, etc.) were also major aspects that were accounted for during the analysis. In addition, the weather conditions at the time of the accident, the type of accident (single-vehicle, front to front, front to back, etc.), the type of vehicle, and other information were also used. The sample size was 10,857 observations during the period between 1999 and 2003. Moreover, we obtained aggregated accident data for five years on urban roads and a rural highway (the desert road) in Egypt. The aggregated accident data records included the percentage of accidents on different roads in Egypt during the period between 1 January 2015 and 31 December 2019. This provided the percentage of road traffic accidents during different time periods (day or night); the percentage of road traffic accidents caused by different vehicle types (private car, truck, taxi, or other); and the percentage of road traffic accidents caused by human, environmental, vehicular, or unknown factors.
Comprehensive data processing was conducted for the disaggregated data to prepare them for statistical analysis. The data processing stage also involved checking the data validation and the descriptions of all of the observations. The processing phase included standardizing the values to ensure an equitable statistical comparison between different types of variables, regardless of the type of variable. The standardized values described a data point and scaled it by the population data by placing the different variables on the same scale to produce standard scores. The standard scores for each observed value of the variables were estimated on the basis of the mean and standard deviation of all observations of a certain parameter, as shown in Equation (3). Accordingly, the standard scores of the 10 parameters included the principal variables, namely Z-score (shape), Z-score (type), Z-score (time), Z-score (weather_ID), Z-score (surface_ID), Z-score (traffic_notes), Z-score (death), and Z-score(volume), which refer to the road’s shape, the road type, the time of the accident, the weather conditions, the surface conditions, the reported cause of the accident, the total number of deaths, and the traffic volume, respectively. Z-score (Crahed_Cars_Count) refers to the number of reported crashed vehicles and Z-score (hurt) refers to the total number of injuries.
where
is the standardized value of the ith traffic accident observation, and
, µ, and
are the means and standard deviations of all
observations.
To achieve an in-depth understanding of the variables involved in the study, a descriptive analysis was conducted. The sample with 10,857 observations of disaggregated traffic accident data was analyzed in order to manage the data and present them accurately before executing the clustering analysis. The analysis summarized the statistics for the different scale variables and measures of the data. The SPSS package was used to calculate the descriptive statistics and to test the potential significance and importance of the nominated group variables. The frequency, validity, and cumulative percentage were obtained for each nominated group. The outputs were described and summarized as shown in
Figure 1 and
Figure 2 (based on the analysis by [
41]).
Figure 1 shows that human factors affected 70% of all traffic accidents, while vehicular factors were involved in 30%. The majority of accidents occurred because of over-speeding in clear weather conditions and on dry roads. Accidents were distributed across all roads, with notably high rates on roads in Bani Sweif, Canal, and Sinai, while roads in the South Valley and the Red Sea had lower accident rates.
Figure 1 shows some conditions with very low occurrence, such as heavy rains, sandstorms, and dusty conditions. Because of the low probability of such conditions occurring, they were screened during data processing and the clustering process, and were consequently excluded from the accident model, as illustrated in the next sections. The analysis of the aggregated data showed a similar pattern to that in
Figure 2. This also shows that the majority of the accidents occurred in the daytime because of human factors. Private cars were involved in 59% of the accidents. The Aswan–Cairo and Cairo–Alexandria rural roads had the highest accident rates.
The tables and figures below illustrate the results of the clustering analysis for the four clustering results where k = 2, 3, 4, and 5. The results show the total number of road traffic accidents assigned to each cluster in the four clustering models. As shown in
Table 1, 95.88% of the total number of accidents were assigned to the first cluster, while only 4.12% were assigned to the second cluster when k = 2, while for the case of k = 3, only 2.53% of all accidents were assigned to the first cluster, 89.16% were assigned to the second cluster, and 8.30% were assigned to the third cluster. In the case of k = 4, 89.47% of all accidents were assigned to the first cluster, while 1.02%, 1.05%, and 8.45% of the accidents were assigned to the second, third, and fourth clusters, respectively. Finally, it was found that 1.02% of all accidents were assigned to the first and second clusters, 7.53% were assigned to the third cluster, and 8.1% and 82.22% were assigned to the fourth and fifth clusters, respectively, in the case of k = 5. The distance between the final clusters’ centers was found to be 4.4 for k = 2. When k = 3, the distance between the final clusters’ centers was found to be 5.45 for the first and second clusters, 6.47 between the first and the third clusters, and 4.32 between the second and third clusters’ centers. In the case of the model with four clusters, the distance between the final clusters’ centers was found to be 5.22 between the first and the second clusters, and 7.86 and 2.65 between the first and the third and fourth clusters, respectively, while the distance between the second and third clusters was found to be 9.46. The distance from the fourth cluster’s center was found to be 5.99 to the second cluster’s center and 6.665 to the third cluster’s center. Finally, for k = 5, the distances from the clusters’ centers were 9.5, 6.02, 6.25, and 5.23 to the centers of the second, third, fourth, and fifth clusters, respectively. For the second cluster’s center, the distances were found to be 6.68, 8.84, and 7.9 from the third, fourth, and fifth clusters’ centers.
The values of different variables in the clusters’ centers for the different k-means scenarios are presented in
Figure 3a–d to compare the different clustering patterns with k = 2, k = 3, k = 4, and k = 5, respectively.
Figure 3a shows that the two clusters had different contents, which indicated that each cluster had a homogeneous combination of data on different factors affecting traffic accidents, and both clusters were different from each other. This means that the data could be reasonably divided into two different groups. This homogeneity of a single group of data and the variation in the data groups between each other may improve the model’s fit when modeling separate groups over the fit of the model with combined data. To test whether splitting the data into more groups was meaningful or not, we used k = 3, k = 4, and k = 5, as shown in
Figure 3b–d.
Figure 3b,c shows that splitting the data produced three and four identical groups that were different from each other, while
Figure 3d shows that, in the case of k = 5, the second and the third clusters had many variables with similar characteristics. A visual comparison of
Figure 3b–d shows that k = 3 and k = 4 were suitable. ANOVA test presented in
Table 2 shows that the number of cars involved in the traffic accidents (Crahed_Cars_Count) has low statistical significance in the case of k = 2, 3, 4, 5 (p-value equal 0.940, 0. 748, 0.473, and 0.897 respectively) while all other variables are statistically significant in case of k = 4. Aggregated data showed the same results, clustering our aggregate data into four groups is better. Although all factors are statistically significant (as shown in
Table 3),
Figure 4d shows that clusters 2 and 5 (in case of k = 5) are not identical and the number of cases belonging to the second cluster is only 6 cases as shown in
Table 4. Therefore, clustering our data into four groups may help for better model fit than modeling whole data.
5. Modeling of the Clustered Data
To examine the goodness of fit of the developed Smeed models, the actual death rates were compared with the estimated ones.
Figure 5 shows the fits of the estimated values to the real values for (a) all the data, (b) the first cluster’s data, (c) the second cluster’s data, (d) the third cluster’s data, and (e) the fourth cluster’s data. The data points were scattered close to a 45° line. This reflects agreement between the actual and predicted values, which was supported statistically by the coefficient of determination (R
2). The predicted and observed values agreed closely, and the R
2 values were acceptable for all datasets. The R
2 values for the different clusters were higher than the values for all of the data, meaning that clustering improved the models’ fits. For the regression models, the R
2 values in
Figure 6,
Figure 7,
Figure 8,
Figure 9 and
Figure 10 reflect the models’ fits by correlating the death rate as a dependent variable with the car ownership rate as an independent variable for (a) all the data, (b) the first cluster’s data, (c) the second cluster’s data, (d) the third cluster’s data, and (e) the fourth cluster’s data using linear, exponential, logarithmic, polynomial, and power regression models.
Figure 6,
Figure 7,
Figure 8,
Figure 9 and
Figure 10 all show that the R
2 values for different clusters were greater than the values for all of the data. This means that clustering improved the models’ fits regardless of the model type, except for a few cases in the power and logarithmic models. This may be because of the models’ characteristics or because the sample size was not enough for developing certain models.
Compared with the previous relevant literature, the following insights can be highlighted. Although extensive research has been carried out in this field of heuristic-based cluster analysis, the statistical properties of these methods are generally unknown, whereas the statistical properties of probability model-based clustering techniques are better understood [
9]. Therefore, the current study relied on k-means clustering analysis. The results of [
2,
3] suggested that the combined use of parametric and non-parametric methods may effectively overcome the limits of each group of methods, with satisfactory prediction accuracies and the ability to interpret the factors contributing to fatal and serious crashes. However, we sought a simple model with fewer variables that would be suitable for practitioners. Nicholson [
7] stated that there is a need for a simple way to cluster accident data and [
4] recommended simplifying accident models by eliminating variables. Although this is expected to reduce the model’s accuracy, it might make it agile enough to be utilized in developing countries, where traffic crash data are usually scarce [
4]. Moreover, many studies (e.g., [
4,
9]) have revealed that traffic accident data should be mainly clustered on the basis of vehicle type, whereas the cluster analysis in this study showed that other contributory factors are also important for data clustering.
Riccardi et al. [
2] found that the proper clustering of the factors that affect fatal and serious injury accidents is largely different from that of the factors that contribute to accidents causing slight injuries. They tried to avoid the imbalanced distribution of the variables and the drawbacks of the error rate (which assumed that the errors had equal values, which was not true for imbalanced data, and misclassified some classes, such as fatal and serious-injury crashes). Therefore, Refs [
2,
4] divided the modeling data into fatal and serious-injury accidents, and into severe crashes and non-severe crashes, respectively, which was difficult here because of the availability and characteristics of the data used in this study.
6. Summary and Conclusions
On the basis of aggregated and disaggregated long-horizon traffic accident datasets in Egypt, the present study compared the performance of some existing models. To validate the models’ fits to the characteristics of different data, the traffic accident data were clustered into reasonable groups. Some parsimonious models (with fewer variables) that can be implemented in real-world practice were used. The k-means algorithm was used for clustering analysis. Ten contributory parameters were included as the principal causal factors, namely the road’s shape, the road type, the time of the accident, the weather conditions, the surface conditions, traffic volumes, the number of crashed vehicles reported, the cause as reported by the traffic police, the number of deaths, and the number of injuries. Using k = 2, k = 3, k = 4, and k = 5, which specified different numbers of clusters, four clustering modes were evaluated. By summarizing the factors within each cluster, ANOVA was used to identify the variances within groups and their centers, and the most suitable number of clusters was then determined.
Smeed’s model was applied to relate the death rate resulting from traffic accidents in Egypt to car ownership, utilizing five years of historical data. The model’s performance was evaluated in terms of the errors between the estimations and the observations. Moreover, different forms of regression models (linear, exponential, logarithmic, polynomial, and power regression models) were used to relate the death rate as a dependent variable to car ownership as an independent variable. The results of model fitting showed that the R2 values for the different individual clusters were higher than the values for all the data, indicating that clustering improved the models’ fits, regardless of the type of model, except in a few cases, i.e., the power and logarithmic models, which might have been caused by the models’ characteristics or the sample size. The results revealed that data clustering has a significant impact on classifying the data on the basis of the characteristics of the most important causal factors of traffic accidents. Consequently, predictive models can be developed on the basis of the other remaining factors (factors not considered during the clustering process).
Compared with the previous relevant literature, this study avoided using heuristic-based cluster analysis, as the statistical properties are generally unknown, whereas the statistical properties of the k-means algorithm used here are better understood. These criteria may help in using simple models with few variables, as recommended by several previous studies, that can be used for practical applications. Many previous studies mainly clustered accident data based on the type of vehicle, while the cluster analysis conducted in this study showed that other variables are also important for data clustering. We close by noting that, if data with larger sample sizes and different characteristics become available, such as different populations, car ownership, and death rates, these could be used to verify the methodology proposed here. More metrics, such as sensitivity and precision, are highly recommended for future studies to investigate the capability of the developed models to predict the death rate, which was not feasible here because of the availability and characteristics of the data used.
Finally, it should be emphasized that the descriptive cluster analysis followed in this study focused on finding a concise description for each traffic accident type, which can be useful during the interpretation of subsequent analyses. However, the results of the cluster analysis conducted in this study also contain other useful information that can provide interesting insights into various traffic accident types, which may guide decision-makers to deploy appropriate preventive measures for road traffic accidents toward sustainable transportation systems.