1. Introduction
Traffic safety research plays a crucial role in enhancing road safety by examining the root causes of accidents, identifying hazardous behaviors or factors, and proposing effective countermeasures [
1]. Despite advancements in vehicle safety, enhancements in road design, and the implementation of various policies, traffic safety remains a significant challenge. One important aspect of road safety research is understanding contributing factors leading to different crash severity outcomes, which is essential for mitigating crash consequences.
The challenge of traffic accident modeling stems from their multifaceted nature, involving intricate interplay among diverse factors, such as human behavior, vehicle dynamics, traffic conditions, environmental factors, and roadway characteristics. Traffic safety research has been primarily focused on understanding causality using observational data, due to the impracticality of conducting controlled experiments in this field [
1]. Traditional statistical and econometric methods have long been studied in traffic safety domain [
2,
3,
4,
5,
6] for causality understanding. These classic methods suffer from several limitations, including constraints imposed by specific functional forms and distributional assumptions [
1], as well as subtle confounding effects, also referred as heterogeneous treatment effects [
1,
7,
8], which often lead to an incomplete or misleading understanding of influencing factors.
Another limitation of statistical and econometric methods lies in the fact that they were designed around and can only consume structured data with traditional numeric or categorical coding and a limited number of features. These methods can not effectively handle unstructured textual data or passages of narratives. Due to recent advancements in AI and the abundance of narrative data captured in crash reports, natural language processing (NLP) methods have been applied in text mining of crash narratives [
9,
10,
11]. In previous works, researchers are required to collect a large amount of high-quality, labeled crash reports for model training. However, this process is time-consuming and costly. Additionally, low-quality training data and poorly chosen training parameters can lead to undesirable performance. In contrast, large language models (LLMs) offer a distinct advantage by leveraging their immense knowledge acquired from extensive pre-training with vast datasets, effectively addressing these challenges. Motivated by their superior capability to comprehend and generate human-like text, this study aims to investigate whether LLMs can process complex and unstructured data in traffic safety domain to enable elaborate case-specific analysis.
Despite the release of numerous LLMs, represented by the GPT family [
12] and LLaMA family [
13], their ability for traffic crash analysis and reasoning remains unexplored. Applying LLMs to crash analysis presents two major challenges: (1) it requires LLMs to fully understand the domain knowledge and potential causality behind crash events. However, LLMs, typically built on transformer architectures, are often regarded as “black-box” models, making it difficult to interpret their decision-making processes. (2) while LLMs possess extensive real-world knowledge acquired from the pre-training stage, they are not specialized in analyzing textual data in crash reports. This creates an alignment gap between the model’s original intent and the specific requirements of this specific task.
To address the first challenge, we propose to leverage the Chain-of-Thought (CoT) technique to enhance the reasoning capabilities of LLMs [
14]. This technique guides the model through a structured reasoning process, helping it better understand the detailed knowledge and causality behind crash data. Additionally, the CoT approach provides explainable reasoning steps for each intermediate result, making the model’s decisions more transparent and easier to interpret. By incorporating CoT, we aim to improve the LLMs’ performance in crash severity modeling, leveraging their capability to effectively process the complex and diverse data relevant to crash analysis.
To address the second challenge, we propose to use prompt engineering (PE) and few-shot learning (FS) to better align the LLMs with the specific requirements of the target task: crash severity modeling and analysis. PE can tailor the input prompts to guide the LLMs toward more relevant and reliable analysis, while few-shot learning can provide the models with specific examples to improve their understanding and performance in the subject domain. By combining these techniques, we aim to bridge the alignment gap and enhance the models’ ability to effectively analyze textual descriptions in crash reports.
To demonstrate the efficacy of our approach, we explore three state-of-the-art LLMs, specifically GPT-3.5-turbo, LLaMA3-8B, and LLaMA3-70B, for crash severity inference, framing it as a multi-class classification task. In our experiments, we utilized textual narratives derived from crash tabular data as input for crash severity analysis with LLMs. Additionally, we incorporated CoT to guide the LLMs in analyzing potential crash causes and subsequently inferring severity outcome. We also examine prompt engineering specifically designed for crash severity inference. We task LLMs with crash severity inference to (1) assess their capability for crash severity analysis; (2) evaluate the effectiveness of CoT and domain-informed PE; and (3) examine their reasoning ability within the CoT framework.
The experimental setup involves several strategies, including plain zero-shot and few-shot settings, zero-shot with Chain-of-Thought (ZS_CoT), zero-shot with prompt engineering (ZS_PE), zero-shot with both prompt engineering and Chain-of-Thought (ZS_PE_CoT), and few-shot with prompt engineering (FS_PE). The LLMs evaluated include GPT-3.5-turbo, LLaMA3-8B, and LLaMA3-70B, with specific hyperparameters to ensure consistent and reliable results. We compare the performance of these models and settings to determine the most effective approach for the crash severity inference task.
3. Data
In this section, we first discuss the dataset employed for the study. We then explain how we convert the crash tabular data to coherent descriptive narratives. Finally, we discuss our experimental settings and evaluation methods.
3.1. Dataset
Our empirical analysis utilizes data sourced from CrashStats data from Victoria, Australia spanning from 2006 to 2020. The crash database contains records of vehicles involved in crashes. A four-point ordinal scale is used to code the severity level of each accident, including: (1) non-injury accident, (2) other injury (minor injury) accident, (3) serious injury accident, and (4) fatal accident. Each sample denotes a vehicle involved in a crash with driver’s information. After data prepossessing, the final dataset has an extremely low representation of non-injury accidents (only four instances), accounting for less than 0.001%. Consequently, these four non-injury accidents are merged to the category of “Minor or non-injury accidents”. As a result, the dataset contains 197,425 minor or non-injury accidents, 89,925 serious injury accidents, and 4760 fatal accidents.
The traffic accident attributes considered in our empirical study include crash characteristics, driver traits, vehicle details, roadway attributes, environmental conditions, and situational factors (see
Table 1).
3.2. Textual Narrative Generation
To get coherent, informative passages enriched with domain-specific knowledge, we use a simple yet effective template to convert the raw structured tabular data into detailed, human-readable textual narratives, encapsulating vital information about traffic accidents, which can be better consumed by LLMs. This process is depicted in
Figure 1.
The primary objective is to augment the applicability and relevance of tabular data as input for LLMs, facilitating more context-aware inference for tasks like accident severity assessment. Furthermore, road safety engineers can supplement it with established facts or domain-specific knowledge for further enhancement. For instance, “Children and elders are typically more vulnerable in accidents without seat belts”. However, it is important to note that, for this study, we do not delve into the intricate design of scientific knowledge in traffic safety.
4. Experiments
4.1. Experiments Design
In this work, we tackle the crash severity inference problem as a classification task. The inputs encompass various crash related attributes, including environmental conditions, driver characteristics, crash details, and vehicle features. The original data is in tabular format including categorical and numerical fields. We then transform the tabular data into consistent textual narrative with a simple template, detailed in the preceding subsection. The objective is to estimate severity outcomes of crashes with the state-of-the-art LLM models, such as GPT-3.5-turbo, LLaMA3-8B, and LLaMA3-70B.
LlaMA3s models are open-source foundation models, while the GPT model is a close-source and non-free model. Considering the cost and our goal of evaluating crash severity inference performance of foundation models, we randomly draw 50 samples from each of the three severity outcome categories, resulting in a total of 150 samples. These samples are used to demonstrate the potential of LLMs in enhancing crash analysis and reasoning.
The strategies outlined in
Table 2 includes: Zero-shot and Few-shot settings coupled with different techniques. We use ZS and FS to denote the plain zero-shot and few-shot setting without prompt engineering and Chain-of-Thought. Other settings are zero-shot with Chain-of-Thought (ZS_CoT), zero-shot with prompt engineering (ZS_PE), zero-shot with prompt engineering and Chain-of-Thought (ZS_PE_CoT), and few-shot with prompt engineering (FS_PE). It worth noting that some literature treats Chain-of-Thought (CoT) as a special form of Prompt Engineering (PE). In this paper, we make distinction between PE and CoT to highlight the advantages of CoT over a basic PE approach in the context of traffic safety analysis.
With these experiments, we aim to determine: (1) the accident severity inference performance of LLMs in a plain zero-shot setting; (2) whether CoT enhances the performance through its reasoning process (ZS_CoT vs ZS; ZS_PE vs ZS_PE_CoT); (3) whether PE improves performance in zero-shot and few-shot settings (ZS_PE vs ZS; FS_PE vs FS); (4) whether few-shot learning boost performance compared to the zero-shot setting (ZS vs. FS). Accordingly, we tested six prompts to automatically infer the severity outcome of crashes. The details of these prompts are presented in the following section.
The experiments were conducted using GPT-3.5-turbo, LLaMA3-8B, and LLaMA3-70B. For GPT-3.5-turbo and LLaMA3, the hyperparameters are configured with temperature = 0 and top_p = 0.0001 for crash severity inference, aiming to produce consistent and reliable results through greedy decoding. Additionally, the LLaMA3 models are set with to generate deterministic outputs, ensuring reproducibility.
4.2. Prompts for LLMs
In this section, we explained in detail how we design the prompts for different experiments in
Table 2.
4.2.1. Zero-Shot
The prompt designed for plain zero-shot setting is demonstrated in
Figure 2. The provided prompt tasks each LLM as a professional road safety engineer with classifying the severity of a traffic crash in Victoria, Australia, based on a detailed description of a crash. The engineer is required to categorize the crash into one of three specified categories: ’Fatal accident’, ’Serious injury accident’, or ’Minor or non-injury accident’. The engineer’s response is restricted to outputting only the classification result, ensuring a focused and objective assessment. This prompt is designed to elicit a precise evaluation of the crash severity outcome, leveraging the knowledge of engineer’s expertise in road safety and crash analysis.
The prompt designed for zero-shot with CoT setting is demonstrated in
Figure 3. This prompt leverages a CoT approach, encouraging the LLM to serve as an engineer to methodically reason through the details of the accident to determine both the cause and the severity outcome, thereby ensuring a comprehensive and structured assessment based on LLMs’ knowledge. The difference between ZS_CoT and plain ZS is highlighted in red color in
Figure 3.
The prompt designed for zero-shot with prompt engineering setting is demonstrated in
Figure 4. The provided prompt instructs LLM to serve as a professional road safety engineer to classify the severity outcome of a traffic crash, using a descriptively modified set of categories (see the revised class description in red in
Figure 4). to accommodate alignment constraints in LLMs. The engineer must categorize the crash into one of three revised descriptive labels: ’Serious accident with potentially fatal outcomes’, ’Serious injury accident’, or ’Minor or non-injury accident’. The prompt explicitly requires the engineer to output only the classification result.
This rephrasing aims to preserve classification accuracy while adhering to the alignment parameters of LLMs, which tend to avoid directly assigning the ’Fatal accident’ label due to their training to steer clear of discussing unpleasant or unsafe topics related to human death. With comparison to other settings, this could highlight whether prompt engineering enhances LLMs’ performance in traffic safety analysis by addressing inherent biases and improving the model’s ability to more reliably infer the fatal outcome of traffic incidents.
The prompt designed for Zero-shot with PE & CoT setting is shown in
Figure 5. In this prompt design, we not only included the CoT, by requiring a logical deduction from cause reasoning to severity outcome classification, but also changed the class label of ’Fatal accident’ to the soft version of ’Serious accident with potentially fatal outcomes’. The differences between ZS_PE_CoT and plain ZS are highlighted in red in
Figure 5.
4.2.2. Few-Shot
The prompt designed for the plain few-shot setting is shown in
Figure 6. In this paper, the few-shot setting refers to a three-shot scenario, where three examples, one from each severity category, are provided for the LLMs to infer from.
The only difference between the prompt for Few-shot with Prompt Engineering (FS_PE) and that of the plain few-shot is that we substitute “Fatal accidents” with “Serous accidents with potentially fatal outcomes”.
4.3. Evaluation Metrics
Following standard practice in the context of multi-class classification, we adopt two commonly used classification metrics: Macro-Accuracy, and F1-score. Additionally, we include class-specific accuracies. These metrics are briefly discussed below.
(1)
Accuracy. Accuracy measures the proportion of correctly classified instances in the test dataset. It is calculated as:
where:
It should be noted that we first calculate the accuracy for each class and then calculate the macro-accuracy as the average of these class accuracies.
(2)
F1-score. F1-score is defined as the harmonic mean of precision and recall, computed as:
The F1-score reported in the following section (
Section 5) is at the macro level, which is an averaged F1 of all classes.
Precision quantifies the accuracy of positive predictions for a specific class, computed as:
where:
Recall, also known as sensitivity or true positive rate, measures the ability of the model to correctly identify instances of a specific class. It is calculated as:
where:
5. Findings
5.1. Exemplar Responses of LLMs to Crash Severity Inference Queries
The exemplar responses of GPT-3.5, LLaMA3-8B, and LLaMA3-70B in each of the six settings, including ZS, FS, ZS_CoT, FS_PE, ZS_PE, and ZS_PE_CoT (outlined in
Table 2), are shown in
Figure 7. It demonstrates that the LLMs can effectively respond to the severity inference task, delivering expected results. Note that the examples in
Figure 7 only showcase correct severity inferences.
Given the prompt for each setting (see
Section 4.2), each model can directly answer or ultimately summarize its estimated severity for the given accident as one of the defined categories.
In the plain Zero-shot and Few-shot settings, the models respond directly with one of the three class labels, i.e., “Minor or non-injury accident”, “Serious injury accident”, or “Fatal accident”. Similarly, in the ZS_PE and FS_PE settings, the models respond directly as “Minor or non-injury accident”, “Serious injury accident”, or “Serious accident with potentially fatal outcomes”.
In contrast, in the CoT settings (ZS_CoT and ZS_PE_CoT), the models return longer responses by reasoning first and then making inference of the severity outcome of each accident. Generally, the GPT-3.5 model’s responses are more concise.
5.2. Severity Inference Performance of the LLMs with Different Strategies
The performance metrics of GPT-3.5, LLaMA3-8B, and LLaMA3-70B for the crash severity inference task under the six settings (see
Table 2) are presented in
Table 3.
The results reveal varied performance across models and settings in inferring crash severity outcomes. LLaMA3-70B consistently exhibited superior performance, particularly with zero-shot prompt engineering (ZS_PE), achieving the highest macro F1-score (0.4755) and macro-accuracy (49.33%). Furthermore, LLaMA3-70B attained the second-best performance in macro F1-score (0.4747) and macro-accuracy (47.33%) under the zero-shot with Chain-of-Thought (ZS_CoT) setting. These findings suggest that both prompt engineering and Chain-of-Thought methodologies contribute positively to model performance. Nevertheless, no single technique demonstrated consistent superiority across all severity categories. For fatal accidents, GPT-3.5 with zero-shot prompt engineering and Chain-of-Thought (ZS_PE_CoT) exhibited the highest accuracy (68%). In contrast, for “serious injury” accidents, GPT-3.5 and LLaMA3-8B in the plain zero-shot setting (ZS), as well as GPT-3.5 in the zero-shot with Chain-of-Thought scenario (ZS_CoT), achieved 100% accuracy. However, it is crucial to note that in these settings, these models performed poorly for fatal and “minor or non-injury” accidents, indicating an inherent bias toward the intermediate severity category of “serious injury”.
Interestingly, LLaMA3-70B with the basic zero-shot approach demonstrated the best inference performance for “minor or non-injury” accidents (58% accuracy) while maintaining a relatively balanced performance across fatal and serious injury accidents. This suggests a robust generalization capability of LLaMA3-70B across crash severity categories.
The implementation of prompt engineering, particularly in zero-shot settings, generally enhanced performance across models. This improvement was especially pronounced for fatal accident classification, where rephrasing the “Fatal accidents” label to the soft version of “Serious accident with potentially fatal outcomes” facilitated maintenance of classification accuracy while adhering to LLMs’ aligned behaviors.
These results underscore the complexity inherent in crash severity inference task, as no single approach consistently outperformed others across all metrics and severity categories. These findings highlight the need for careful selection of models and methodologies based on specific task requirements and the emphasis of severity categories in the application context.
5.3. Effectiveness of Prompt Engineering (PE) and Chain-of-Thought (CoT)
Figure 8 demonstrated the performance gains by CoT and PE as compared to the plain zero-shot setting. Both ZS_CoT and ZS_PE consistently demonstrate enhanced performance in terms of Macro F1-score and Macro-accuracy across all three models evaluated. This improvement underscores the efficacy of CoT and PE in boosting model performance in zero-shot scenarios.
Notably, the implementation of PE (ZS_PE) yields more substantial improvements relative to CoT (ZS_CoT). This differential in enhancement suggests that, within the context of this specific task, the reformulation of prompts may be particularly effective in guiding model outputs as compared to the structured reasoning approach with CoT. The consistent pattern of improvement across different model architectures and sizes indicates the broad applicability of these techniques in zero-shot learning paradigms.
As illustrated in
Figure 8, CoT improves both Macro F1-score and Macro-Accuracy across all three models in the plain zero-shot (ZS) setting. Based on the results summarized in
Table 3, GPT-3.5 and LLaMA3-8B show improved recognition of “Minor or non-injury” accidents. LLaMA3-70B demonstrates substantial gain in identifying “Serious injury” accidents and “Minor or non-injury” accidents, with only a slight reduction in performance for “Fatal” accidents. The use of CoT enables LLMs to better understand and reason through questions, leading to more reliable and explainable inferences.
The PE technique also leads to increased Macro F1-score and Macro-Accuracy across all three models compared to the plain zero-shot (ZS) setting, as depicted in
Figure 8. Notably, it greatly enhances the models’ ability to detect Fatal accidents by simply softening the label description from “Fatal accident” to “Serious accident with potentially fatal outcomes”, resulting in more balanced performance across severity categories. Compared to the zero-shot baseline, GPT-3.5 with PE attains a remarkable increase in Fatal accident detection, with accuracy rising from 0% to 62%. Similarly, LLaMA3-8B and LLaMA3-70B show increases in fatal accident accuracy from 0% to 34%, 44% to 60%, respectively.
These improvements may stem from the fact that PE directs LLMs to concentrate more specifically on accident severity classification, potentially addressing any initial tendency to be overly cautious or generalized in their responses. This targeted guidance enables the models to make more precise distinctions among accident severity categories.
Figure 9 shows the comparative performance of the three models across incremental settings from ZS to ZS_PE and ZS_PE_CoT.
In both the zero-shot (ZS) and zero-shot with PE (ZS_PE) settings, LLaMA3-70B consistently outperforms the other two models. In the ZS setting, LLaMA3-70B achieves a macro F1-score of 0.4541 and a macro-accuracy of 45.33%, significantly higher than both GPT-3.5 (0.1812 and 34.00%) and LLaMA3-8B (0.1818 and 34.00%). This performance advantage is maintained in the ZS_PE setting, where LLaMA3-70B shows further improvement with a macro F1-score of 0.4755 and a macro-accuracy of 49.33%, compared to GPT-3.5 (0.3798 and 45.33%) and LLaMA3-8B (0.3120 and 40.00%).
However, the performance dynamics shift in the setting of zero-shot with PE and CoT (ZS_PE_CoT). In this setting, LLaMA3-8B leads the performance with a macro F1-score of 0.4033 and a macro-accuracy of 45.33%, surpassing both GPT-3.5 (0.3509 and 42.00%) and LLaMA3-70B (0.3581 and 42.67%). GPT-3.5 and LLaMA3-70B experience slightly decreased performance under the ZS_PE_CoT setting as compared to the ZS_PE setting. In contrast, LLaMA3-8B show improved macro-F1-score from 0.3120 to 0.4033 as well as increased macro-accuracy from 0.4000 to 0.4533. This shift in performance suggests that the combination of PE and CoT reasoning are particularly more beneficial to small models, such as LLaMA3-8B, than large models like GPT-3.5 and LLaMA3-70B for the crash severity inference task.
In the ZS_PE_CoT setting, all three models, GPT-3.5, LLaMA3-8B, and LLaMA3-70B, demonstrate improved recognition of fatal accidents, evidenced in
Table 3. This enhancement indicates that the combination of CoT and PE is particularly beneficial for identifying more severe crashes than less severe ones.
5.4. Zero-Shot vs. Few-Shot Learning
In the FS setting, the inclusion of three examples improves both the macro F1-score and macro-accuracy compared to the ZS setting, boosting classification accuracy of GPT-3.5 and LLaMA3-8B for “Fatal accident” and “Minor or non-injury accident”. However, this comes at the expense of accuracy for “Serious injury accident”. This indicates a potential trade-off in classification performance across severity categories or a decrease of the model bias in the ZS setting.
Moreover, smaller models like LLaMA3-8B generally benefit more from few-shot learning than larger models, such as GPT-3.5 and LLaMA3-70B, as evidenced by a notable increase in Macro F1 score from 0.1818 to 0.4068. Nevertheless, LLaMA3-70B, being a larger model, performs slightly better in the zero-shot settings, suggesting it may have gained some general knowledge in traffic safety domain during the pre-training stage, where the zero-shot prompting can draw upon such knowledge.
In contrast, the effects of PE in the FS setting exhibit more variations across models. GPT-3.5 demonstrates improvements in Macro F1-score and Fatal accident accuracy, and LLaMA3-70B shows remarkably improved inference accuracy for “fatal accident” and “Serious injury accident”. Conversely, LLaMA3-8B shows decreased macro F1-score and macro-accuracy, indicating that PEin the few-shot setting may not be equally beneficial for models of different sizes.
It important to note that we did not explore the aspect regarding the choice of the examples in the FS setting, which might have a varying effect on different models.
7. Conclusions
In conclusion, this study demonstrates the efficacy of LLMs in crash severity reasoning and inference using textual narratives of crash events constructed from structured tabular data. Our comprehensive evaluation of modern LLMs (GPT-3.5-turbo, LLaMA3-8B, and LLaMA3-70B) across different settings (zero-shot, few-shot, CoT, and PE) yields insightful findings. LLaMA3-70B consistently outperformed other models, especially in zero-shot settings. CoT and PE techniques lead to enhanced performance, improving logical reasoning and addressing alignment issues.
Notably, the use of CoT provided valuable insights into LLM reasoning processes, revealing their capacity to consider multiple factors such as environmental conditions, driver behavior, and vehicle characteristics in the crash severity inference task. These findings collectively suggest that LLMs hold considerable promise for crash analysis and modeling. Future research may explore other safety applications beyond the severity analysis and inference.