1. Introduction
Over the past decade, traffic fatalities have exhibited a consistently upward trend, correlating with the expansion of road mileage. According to statistics from the Federal Highway Administration (FHWA) [
1], road mileage increased by approximately 2.5% between 2013 and 2022, an average annual growth rate of 0.3%. Similarly, the number of registered vehicles has risen at an average rate of 1.5% per year. This expansion of road infrastructure, coupled with increased vehicle usage, has contributed to a rise in traffic fatalities. Notably, the fatality rate per 100 million Vehicle Miles Traveled (VMT) has increased by 23%, rising from 1.10 to 1.35 over the past decade. These trends underscore the growing safety concerns associated with the expansion and utilization of transportation systems. Moreover, similar patterns in road mileage, VMT, and traffic fatalities are observed not only within the European Union but also in developing nations, such as South Korea.
Traffic accidents can be attributed to various human factors, including distracted driving, speeding, and drowsy driving, as well as environmental factors, road conditions, technical issues, weather, road infrastructure, and traffic system defects [
2]. In terms of human factors, emerging technologies, particularly autonomous vehicles, are anticipated to play a crucial role in accident prevention. While road management agencies conduct routine inspections to mitigate accidents resulting from hazardous road conditions, inadequately maintained roads continue to incur significant social costs due to the ongoing expansion of road networks, increasing road deterioration, and a shortage of personnel dedicated to road management. For instance, potholes can cause drivers to lose control, potentially leading to accidents. In 2021, potholes were responsible for 0.8% of road accidents, resulting in 1.4% of fatalities and 0.6% of injuries [
3]. Furthermore, road irregularities reduce vehicle speeds by 55% and increase emissions by 2.49% [
4].
Although human factors account for the majority of traffic accidents, the contribution of environmental factors is significant; for instance, Ref. [
5] highlights that environmental factors are responsible for 34% of all accidents. To address accidents related to environmental conditions, ongoing research is exploring the use of various sensing technologies, drones, and artificial intelligence for effective road and facility management. However, these methods have not yet been widely implemented on actual roads. As mobile phone usage proliferates, crowdsourced data have emerged as a promising and sustainable alternative for addressing traffic accident issues by leveraging public observations and insights. This approach facilitates the real-time collection of detailed, localized information on road conditions, hazards, and traffic patterns that may not be captured by traditional data sources. By involving citizens in reporting and analyzing traffic-related incidents, traffic agencies can gain a more comprehensive understanding of high-risk road segments and their underlying factors. Moreover, engaging the community enhances public awareness and participation in road safety initiatives, thereby contributing to the reduction in traffic accidents and the enhancement of overall road safety. Above all, the sustainability of crowdsourced data in traffic safety lies in its ability to provide ongoing, scalable, and cost-effective insights for improving transportation systems.
Traditional research on identifying high-risk road segments typically employs inductive reasoning based on historical data collected from road sensors. However, the variability in conclusions as new information becomes available represents a drawback in terms of flexibility. This study aims to utilize crowdsourced data as an alternative means of pinpointing areas where accidents are expected to occur due to environmental factors. Rather than relying solely on inductive reasoning, this study introduces a deductive approach that includes the analysis of crowdsourced data as a potential direct factor in accidents. The hypothesis of this study is that “traffic accidents occur in areas where user feedback is frequent”, and this hypothesis is tested using crowdsourced data and fatal accident records. Instead of focusing exclusively on a quantitative analysis of the correlation between user feedback and traffic accidents, this study’s significance lies in its empirical investigation of how crowdsourced data based on user feedback can be utilized in relation to traffic accidents.
2. Literature Review
Crowdsourcing data is emerging as an extensive yet cost-effective method for gathering traffic-related information [
6,
7,
8], as it “makes every user an instant social sensor” [
9]. This research investigates the correlation between traffic accidents and user feedback, proposing novel applications based on crowdsourced data. To this end, a comprehensive review of existing literature on crowdsourcing in transportation has been conducted. The findings categorize the literature into three distinct groups based on their specific applications in transportation.
The first group of studies utilizes crowdsourced data for incident detection. Salas et al. [
10] examined the feasibility of using Twitter for real-time incident detection in the United Kingdom (UK). They proposed a comprehensive methodology that integrates Natural Language Processing (NLP) techniques with a Support Vector Machine (SVM) algorithm to classify public tweets, demonstrating its applicability in identifying traffic-related tweets. Pandhare et al. [
11] leveraged tweets related to traffic and accidents to detect road events. In this study, logistic regression and SVM classifiers are applied in conjunction with text-mining techniques to assign appropriate class labels to road events.
Similarly, Zhang et al. [
8] employed deep belief network (DBN) and long short-term memory (LSTM) techniques to detect traffic accidents using Twitter-based data. Lu et al. [
12] proposed a crowdsourced approach to forecast city-level traffic incidents by collecting social media data on adverse weather and traffic reports. They combined adverse weather reports and weather-related data from Weibo and tweets to develop a regression model, which demonstrated superior predictive performance in forecasting city-level traffic incidents compared to traditional approaches. Dabiri and Heaslip [
13] employed a method known as “bag-of-words” for incident detection, transforming tweets into numerical feature vectors that can be processed by computers. They utilized unsupervised deep learning algorithms for modeling tweets and implemented supervised deep learning architectures. Rettore et al. [
14] introduced the Road Data Enrichment (RoDE) framework, which leverages Twitter data to enhance Intelligent Transportation System (ITS) services through Twitter MAPS (T-MAPS) for route planning and Twitter Incident (T-Incident) for event detection. While T-MAPS achieves up to 62% similarity with Google Maps’ routes, T-Incident demonstrates over 90% accuracy in identifying traffic events, showcasing its superior performance in incident detection. However, this study also revealed the limitations of crowdsourced data in planning applications. Alkouz et al. [
15] presented SNSJam, a system that employs cross-lingual data (English and Arabic) from Twitter and Instagram to detect and predict traffic jams. Experimental findings demonstrated that integrating data streams from multiple languages and platforms significantly improves the accuracy of traffic event detection. Waze data are another significant source of crowdsourced information for incident detection. Amin-Naseri et al. [
16] investigated Waze data to identify the characteristics of this social sensor and to provide a comparison with common data sources in traffic management. They empirically demonstrated that crowdsourced data could offer extensive coverage, providing timely reporting while maintaining reasonable geographic accuracy. Recently, several studies [
17,
18,
19] have developed methodologies to detect road incidents by processing Twitter or Waze data.
The second group of studies utilizes crowdsourcing data to detect or monitor pavement conditions. Monitoring pavement conditions is essential for effective pavement management and maintenance. Traditional methods, such as accelerometers, videos, and laser scanning, are constrained by equipment and labor limitations, which can delay maintenance actions. Recent studies have focused on utilizing Waze data to monitor pavement conditions. Gu et al. [
20] employed crowdsourced Waze data to evaluate pavement conditions, proposing Pothole Report Density (PRD) and Weather Report Density (WRD) as surrogate measures. They utilized a geographically weighted random forest (GWRF) model to analyze the relationship between crowdsourced data and the official Pavement Quality Index (PQI), finding that PRD exhibits a high correlation with the PQI. Similarly, Liu et al. [
21] demonstrated significant benefits of crowdsourced data in pothole detection. Gu et al. [
22] developed a framework for pothole detection and evaluation using reports from the Waze app, employing two spatiotemporal density models: STKDE and ST-DBSCAN. This framework was validated against official pavement maintenance records in Nashville, Tennessee. The study found that crowdsourced reports are capable of accurately identifying existing potholes while also revealing additional potholes that regular patrols may overlook.
The third group of studies employs crowdsourcing data for planning and modeling in the field of transportation. Lin et al. [
23] introduced the Topic-Enhanced Gaussian Process Aggregation Model (TEGPAM) to predict road speed using multi-source data, including INRIX data and tweets. They addressed challenges such as location uncertainty, language ambiguity, and data heterogeneity. Liao et al. [
24] developed a data fusion framework to compare travel times by car and public transit. They combined multiple data sources to estimate travel times for both modes, including traffic data, transit data, and travel demand estimated from Twitter data. Lin and Li [
25] created a traffic accident impact model using crowdsourcing data, categorizing accident-induced congestion into four levels and extracting spatiotemporal features, weather information, and accident details from the crowdsourced data. They trained three classification models and tested them to predict congestion levels and durations. Essien et al. [
26] introduced a deep learning model that integrates tweet data with traffic and weather information. Their model demonstrated improved accuracy in traffic flow predictions when tested in Greater Manchester compared to classical and machine learning models. Janež et al. [
27] investigated the potential of crowdsourcing data to supplement or replace conventional vehicle counters, such as inductive loop counters (ILC). In this study, crowdsourced data were collected from Telraam counters, which are low-cost cameras operated by citizens. They applied regression models to compare ILC and Telraam counters across four segments. Another study employed user-generated feedback for real-time decision making in traffic management and planning. Dienstl and Scholz [
28] focused on utilizing user-generated feedback, specifically through Volunteered Geographic Information (VGI), in demand-responsive transport systems. By integrating user feedback directly into transport management, the study demonstrated how this approach can provide immediate, actionable data for service improvements. A pilot project in Austria confirmed that citizen-sourced data are an effective tool for real-time decision making in transport systems. Liu and Feng [
29] developed a deep learning model to predict speed using crowdsourced police enforcement data, arguing that various underlying factors, such as police enforcement, can influence driving behaviors and should be considered in the development of speed prediction models.
This study distinguishes itself from previous studies in several key aspects. The main difference from existing studies specifically aims to leverage crowdsourced data for safety management instead of traffic management. Existing studies indicate that user feedback in crowdsourced data is employed in various applications, such as incident detection, traffic condition prediction, route planning, and the identification of road conditions such as potholes. The majority of studies utilize crowdsourced data predominantly for traffic management and operations, with relatively few addressing safety management. Although some studies [
8,
10,
11,
12,
13,
14,
15,
16,
17,
18,
19] leverage crowdsourced data to detect accidents or predict traffic conditions resulting from incidents, they typically consider accidents as one of several traffic incidents, rather than focusing on accident detection as the primary objective. Specifically, we discuss the potential applications of crowdsourced data in identifying high-crash locations within the context of accident analysis. Traditionally, traffic conditions (e.g., traffic flow and speed), road characteristics (e.g., road type and number of lanes), and environmental data (e.g., weather conditions) have been employed to identify high-crash locations. In contrast, this study explores whether crowdsourced data can be utilized to classify high-crash locations with greater precision.
Additionally, this research employs a nationally developed application to gather feedback from road users. Previous studies have predominantly relied on commercial applications such as Twitter or Waze to collect significant user feedback for their research. However, as highlighted by prior studies [
13,
22,
30,
31], the reliability of such data is often questionable, necessitating further refinement, which incurs significant time and cost. Conversely, this study has developed and utilized a specialized application designed to collect road user feedback specifically for road management. This method is anticipated to offer a comparatively higher level of data reliability while requiring less cost and effort for data refinement.
Finally, during the literature review process, we identified a similar study that compares police crash reports with Waze incident reports [
32]. Unlike that study, this research focuses on identifying high-crash areas within a national road network, distinguishing these areas more precisely by utilizing road complaint reports.
3. Methodology
This study aims to conduct a spatial analysis of crowdsourced data alongside fatal accident data to assess the hypothesis that traffic accidents occur in areas with frequent user complaints. To verify this hypothesis, it was essential to determine whether each dataset is randomly distributed or follows a distinct spatial pattern. Thus, the first step of the methodology was to perform a spatial autocorrelation analysis based on the location information of each dataset.
Spatial autocorrelation refers to the degree to which a set of spatial data points is correlated with itself based on their location in space. It measures whether objects or events that are close to each other in a geographical area are more similar (positive spatial autocorrelation) or dissimilar (negative spatial autocorrelation) than those that are further apart.
Figure 1 illustrates hypothetical examples of geographical spaces exhibiting positive and negative autocorrelation.
Spatial autocorrelation is typically measured using statistical indicators such as Moran’s I and Geary’s C. This study employs local Moran’s I, defined by Equations (1) and (2), as it is well-suited for analyzing spatially dispersed events, such as traffic accidents:
where
N: number of analysis spaces;
i,j: analysis space (1, 2, 3, … N, i ≠ j);
wij: weight between spaces i and j.
If the distribution of an event exhibits positive spatial autocorrelation, it can be interpreted that data with similar characteristics are geographically clustered together. For instance, if areas with a high incidence of traffic accidents are located close to one another, these regions can be described as exhibiting positive spatial autocorrelation.
After evaluating Local Moran’s I to determine spatial autocorrelation for each dataset, hotspot analysis was conducted exclusively on those datasets that exhibit positive spatial autocorrelation. This study employs a density-based clustering technique known as DBSCAN (Density-Based Spatial Clustering of Applications with Noise) for the hotspot analysis. DBSCAN forms clusters based on data density, creating clusters only in areas with a higher concentration of data while treating data in low-density areas as noise and excluding them. A key advantage of this density-based clustering technique is that it does not require the user to pre-specify the number of clusters, making it effective even when cluster densities vary.
Hotspot analysis facilitates the visual exploration of the spatial distribution of multiple events. To evaluate whether the occurrence of one event influences another, this study leverages the results of the hotspot analysis to classify areas where both events occur simultaneously (i.e., where two different hotspots overlap). By calculating the event occurrence density within the identified hotspots for each dataset and comparing the densities between the two classified groups, this study seeks to clarify the relationship between the occurrences of the two events.
Figure 2 illustrates the methodology employed in this study.
4. Analysis Data and Research Area
To validate the research hypothesis, this study conducts a spatial analysis using two sets of data: crowdsourced data and fatal accident occurrence data. The Ministry of Land, Infrastructure, and Transport in South Korea developed a mobile-based crowdsourcing data collection application, which was launched in 2013. This dataset includes records of inconveniences experienced by road users, as well as follow-up reports on the actions taken to address these issues. The data were collected from 2014 to 2022.
Table 1 presents typical examples of each attribute included in the crowdsourced data.
The inconvenience reports are classified into six categories: poor road surface condition, potholes, roadkill and falling rocks, poor drainage, defective road facilities, and others. The ‘others’ category encompasses cases where reports are either unspecified or cannot be classified under the five primary categories. The records are collected and classified into 17 regions based on location data. The dataset used in this study contains 65,680 inconvenience reports across these 17 regions for the period from 2014 to 2022.
Fatal accident data were obtained through an open API provided by the Traffic Accident Analysis System (TAAS), operated by the Korea Road Traffic Authority. This dataset was also collected from 2014 to 2022 and includes information on the number of fatalities, injuries, types of accidents, and locations where the accidents occurred. This study utilizes the fatal accident dataset associated with the crowdsourced data, which includes 42,602 fatal accident records.
Table 2 presents a sample of the fatal accident data.
The analysis area encompasses all regions of South Korea where both crowdsourced data and fatal accident data are available.
Figure 3 illustrates the visualization of the extracted location information from each dataset, applying the Lambert conformal conic projection to convert it into a Cartesian coordinate system. The rectangular area in the
Figure 3 represents the area of the Seoul metropolitan area.
6. Conclusions
Identifying a single cause of traffic accidents is challenging. While some accidents can be attributed to isolated factors such as drunk driving or speeding, most result from a complex interplay of various elements. To address this issue, multiple data sources, including traffic conditions (e.g., traffic flow, speed), road characteristics (e.g., road type, number of lanes), and environmental data (e.g., weather conditions), are integrated to identify locations with a high crash incidence. Additionally, inconvenience reports from road users can serve as valuable indicators of potential accidents. Locations with frequent user reports or a high density of such reports require close inspection and maintenance by road administrators to mitigate potential accident-related factors.
As the proliferation of mobile phones continues, crowdsourcing data have emerged as a promising alternative for addressing traffic accident issues by leveraging public observations and insights. Above all, crowdsourced data have a significant advantage in that they provides sustainable, scalable, and cost-effective insights for traffic safety. This study aims to test the potential of crowdsourcing data to identify segments or areas with a high frequency of accidents. By integrating crowdsourcing data with accident records, we investigate whether traffic accidents are more likely to occur in areas with a high density of user inconvenience reports.
Various spatial analysis techniques were employed to examine the two datasets. Initially, spatial autocorrelation analysis was performed to evaluate the feasibility of cluster analysis, confirming its applicability to both datasets. Density-based clustering methods were then utilized to identify hotspots based on both complaint and fatal accident data. Finally, a comprehensive analysis of the crowdsourcing data and fatal accident data within overlapping hotspot areas revealed several key findings that substantiate the research hypothesis:
Spatial autocorrelation analysis revealed that inconvenience reports and accident events were spatially clustered separately.
Hotspots of inconvenience reports near high-risk accident areas exhibited a higher concentration of accident-related reports.
Inconvenience reports are not uniformly distributed; rather, they tend to cluster near high-risk accident locations.
Density analysis demonstrated that traffic accidents tend to occur in areas with frequent inconvenience reports.
This study has clear limitations in explaining the underlying causes of the correlation between user feedback and traffic accidents. Whereas more in-depth research is needed to identify the causes, we speculate that the Heinrich’s law can help explain the results of this study. Interestingly, recent studies [
34,
35] have shown that Heinrich’s law can also be applied to events such as traffic accidents.
In practice, identifying the causes contributing to high-risk accident areas is challenging. However, if high-risk accident areas are located near hotspots of inconvenience reports, this study suggests that the reported risks are closely related to accidents occurring in these locations, indicating a need for immediate action to mitigate the identified risks. The significance of this study lies not in quantifying the correlation between inconvenience reports and traffic accidents, but in empirically exploring the potential application of crowdsourcing data in relation to traffic safety. This research provides foundational insights into the quantitative relationship between user feedback regarding road inconvenience and accident occurrences.
Nevertheless, several directions for future research are apparent. Firstly, it is essential to gather additional crowdsourcing data to enhance the empirical foundation. Furthermore, developing alternative analytical methodologies is crucial to address issues related to under- or over-clustering identified during the cluster analysis.