Identifying Human Factors in Aviation Accidents with Natural Language Processing and Machine Learning Models

Lázaro, Flávio L.; Madeira, Tomás; Melicio, Rui; Valério, Duarte; Santos, Luís F. F. M.

doi:10.3390/aerospace12020106

Open AccessArticle

Identifying Human Factors in Aviation Accidents with Natural Language Processing and Machine Learning Models

by

Flávio L. Lázaro

^1,2

,

Tomás Madeira

¹,

Rui Melicio

^1,3,4

,

Duarte Valério

^1,*

and

Luís F. F. M. Santos

^3,5

¹

Institute of Mechanical Engineering (IDMEC-LAETA), Instituto Superior Técnico, Universidade de Lisboa, Av. Rovisco Pais 1, 1049-001 Lisboa, Portugal

²

Faculdade de Engenharia, Universidade Agostinho Neto, Av. 21 de Janeiro, Luanda 1756, Angola

³

Aeronautics and Astronautics Research Center (AEROG-LAETA), Universidade da Beira Interior, Calçada Fonte do Lameiro, 6200-358 Covilhã, Portugal

⁴

Synopsis Planet, Advance Engineering Unipessoal LDA, 2810-174 Almada, Portugal

⁵

ISEC Lisboa, Alameda das Linhas de Torres 179, 1750-142 Lisboa, Portugal

^*

Author to whom correspondence should be addressed.

Aerospace 2025, 12(2), 106; https://doi.org/10.3390/aerospace12020106

Submission received: 31 October 2024 / Revised: 20 January 2025 / Accepted: 27 January 2025 / Published: 31 January 2025

(This article belongs to the Special Issue Machine Learning for Aeronautics (2nd Edition))

Download

Browse Figures

Versions Notes

Abstract

:

The use of machine learning techniques to identify contributing factors in air incidents has grown significantly, helping to identify and prevent accidents and improve air safety. In this paper, classifier models such as LS, KNN, Random Forest, Extra Trees, and XGBoost, which have proven effective in classification tasks, are used to analyze incident reports parsed with natural language processing (NLP) techniques, to uncover hidden patterns and prevent future incidents. Metrics such as precision, recall, F1-score and accuracy are used to assess the degree of correctness of the predictive models. The adjustment of hyperparameters is obtained with Grid Search and Bayesian Optimization. KNN had the best predictive rating, followed by Random Forest and Extra Trees. The results indicate that the use of machine learning tools to classify incidents and accidents helps to identify their root cause, improving situational decision-making.

Keywords:

machine learning; aviation; safety; models classifiers; hyperparameters tunning; human factors; SMS

1. Introduction

At present, industries and companies in the aviation sector have increasingly sought to exceed the safety levels of their equipment and systems in order to ensure the reliability of their operations and services. Automated decision support technologies still represent one of the main challenges in the aviation industry [1]. Furthermore, given that air transport contributes significantly to social and economic development on a global scale, stakeholders in this highly regulated sector have a strong interest in the prevention of operational risk and hazard identification, aiming to constantly improve the safety of people and property [2]. Over the last few years, the levels of accidents and incidents have decreased, not only due to the growing technological improvement resulting from the demand for dynamic systems that allow the early detection of failures but also due to the introduction of processes, policies, and regulations created to deal with human factors, organizational factors, and state safety programs [3,4]. Despite the SARS-COV-2 pandemic negatively affecting global air traffic performance in 2020, an IATA report states that international traffic in 2023 increased by 24.2% compared to 2022, reaching a rate of 94.7% of the pre-pandemic level in 2019 [5]. This significant growth reflects the resumption of passenger confidence and the increase in airline operations, driven by contingency and safety measures such as mass vaccination and the easing of travel restrictions in many countries [6].

The investigation into and reporting of aviation incidents are essential elements of continued progress in terms of safety. While accidents often involve fatalities or serious injuries, incidents, which are more common and less costly, play a crucial role in gathering information to detect possible risks and hazards [7]. Given the complexity of aviation systems, various elements such as human error, mechanical failures in aircraft, extreme weather conditions, and questionable organizational policies, or a mixture of these factors, can lead to incidents and accidents. Due to the importance of data on such occurrences, countries and international organizations have dedicated considerable efforts to collect and store reports to support analytical decisions [8]. The analysis of the causes of incidents through these occurrence reports has helped to investigate the fundamental reasons behind aviation accidents. For instance, the authors of [9] analyzed 99 incident reports linked to the Flight Management System, from the Aviation Safety Reporting System, and stated that there was a significant amount of operational and design-related problems in the management systems due to the user interface not having been ideally designed. It is recommended for manufacturers to seek a more appropriate balance in the design of the Flight Management System between logic and usability to reduce the occurrence of errors. Other studies [10,11,12,13] have indicated that factors such as pressure, fatigue, communication failures, and lack of technical knowledge among key professionals such as maintenance technicians, air crew, and air traffic controllers are among the main reasons for aviation accidents. As mitigation measures to address these challenges, international regulatory bodies require airlines to increasingly improve their Safety Management Systems and Maintenance Resource Management in order to accurately detect safety breaches for appropriate and efficient decision-making in individual and environmental processes at all levels of the industry, aiming to avoid possible catastrophic events or dangerous situations [14]. The implementation of these measures involves the planning and definition of safety policies and objectives, safety risk management, and the guarantee and promotion of safety to ensure that all players involved in the aviation sector incorporate in their thinking the appropriate industrial standards and safety policies, developing a culture of safety management and communication.

The ICAO safety report shows that, in 2023, there were 66 air accidents, of which only 1 was fatal, resulting in the death of 72 passengers [15]. This report reflects that the total number of accidents has been decreasing in recent years (i.e., 2018–2023), as shown in Figure 1.

However, despite the indicators pointing to a reduction in catastrophes, there is still much to be done to improve aviation safety. This is because, in addition to current measures, there is a clear need to implement predictive safety models that take into account the human factor to identify and avoid high-risk situations. The Fatigue Risk Management System (FRMS) Implementation Guide for Operators, published by IATA, ICAO, and IFALPA in 2011, offers a structured framework to help air operators identify, assess, and mitigate the risks associated with fatigue, which is also considered as a critical safety factor [16]. The scientific community has been increasingly interested in the development of Human Reliability Analysis (HRA) methods [17,18,19,20], aiming to generate predictive indicators in an accessible way, taking advantage of information based on collected data. These methods involve analysis of textual reports, which often require manual categorization of different aspects related to human factors, making this task a costly one. It is also important to note that aviation data are vast, dimensional, noisy, and class unbalanced; this, compounded with the unsuitability of data augmentation techniques for this area [10], makes data difficult to classify. Using data augmentation or synthetic data to enhance a model’s ability to learn from known cases can, in the context of human factors, lead to incoherent scenarios, such as an operational error attributed incorrectly to a maintenance technician. Therefore, synthetic data generation should be employed cautiously, ensuring that the generated data accurately reflect realistic and consistent outcomes. The integration of advanced machine learning techniques has become the heart of the evolution of aviation safety programs [21]. It is crucial to ensure that the classification models capture the available data and perform well.

This study aims to evaluate the performance of four classifier models―K-Nearest Neighbors (KNN), Random Forest (RF), Extra Trees (ET), and Extreme Gradient Boosting (XGBoost)―in the process of identifying the main contributing factors in air incidents and accidents, using machine learning (ML) data pre-processing techniques. The choice of these classifiers was based on the ability demonstrated by these algorithms to handle diverse amounts of data, identify patterns, and perform accurate classifications in natural language processing tasks, as evidenced by previous studies [22,23]. Two algorithms (Grid Search and Bayesian Optimization) are used separately to optimize the hyperparameters of the classifiers, which through inferences explore different configurations to find the best fit. Recent research indicates that ML can uncover hidden patterns in large operational datasets, such as those present in air incident reports, thereby improving risk detection and failure prediction [24]. The use of natural language processing (NLP) technologies facilitates not only the extraction of relevant information in complex textual reports but also improves their analysis and trend recognition [25].

NLP has already been used to parse aviation accident reports to identify any human factors involved [26]. Using ML to arrive at the main contributing human factors, however, has not been carried out in [26]. For a review of the application of ML to aviation safety, see [27], from which it can be seen that these tools have seldom been used to relate human factors and accidents.

This study is organized as follows. Section 2 describes the analysis and pre-processing of aviation safety reports: using NLP techniques, through incorporation processes, connections of semantic meaning are established between extensive excerpts of text for local comparison with the categories of human factors present in the documents with different degrees of distance; in this way, the reports are classified using the Human Factors Analysis and Classification System for Machine Learning (HFACS-ML) framework proposed in [28] for the classification of applied human factors in the context of machine learning. Section 3 connects the identified samples with vectorized documents (Doc2Vec), applying them in a label propagation algorithm (LS) and to the classifier models for evaluations and insights into the performance of the different predictive models tested. Finally, in Section 4, the results are discussed, conclusions are outlined, and some aspects for further work are recommended.

The novelty in this paper is the use of these classifiers in identifying the most important human factors causing incidents and accidents. Another novelty is the use of 25 years of data rather than the shorter periods addressed in previous publications such as [10,28].

2. Materials and Methods

For this study, detailed reports on the most relevant threats to aviation safety from the last twenty-five years (from 2000 to 2024) from the Aviation Safety Network (ASN) database [29] were collected, thus obtaining a total of 1836 documents identified as probable causes. The term “relevant threats” encompasses those safety threats identified from the ASN database, which have shown a substantial impact on aviation safety, such as mechanical failures, human errors, and severe weather conditions. The ASN database covers safety occurrences “involving airliners (12+ passengers), corporate jets and military transport aircraft since 1919” [29]. Notice that these criteria differ from those of ICAO safety reports [15].

Earlier reports in the database, from the 1919–1999 period, were not used. This would make no sense, given the profound changes in the aviation industry that took place in the second half of the 20th century, namely because of significant improvements in aircraft technology. In fact, a trade-off had to be struck. Increasing the period addressed would mean more data available for algorithms; more data should mean better results, but only if the data are not so old that they are no longer relevant. Likewise, there have been changes in the aviation industry, taking place already during the 21st century, which could justify using only the data from the most recent years. But then too few database items would be left, and the algorithms would yield poorer results. That is why we settled for a twenty-five-year period, deemed long enough to cover a sufficient number of reports, and also close enough for the reports to be relevant for today’s practices.

The human factors involved in each incident or accident are extracted using NLP. After this step, the methodology consists of using performance metrics, such as confusion matrix, precision, recall, F1-score, and accuracy, to assess the degree of correctness of the predictive models. Cross-validations are performed, using labeled and unlabeled datasets for training and testing, to evaluate their generalization and learning capabilities in the process of classifying the reports.

2.1. Human Factor Classification

The Human Factors Analysis and Classification System (HFACS) framework [30] was selected for its robust multi-level categorization of human factors, which allows for a detailed analysis of the cascade of human errors from frontline operations to organizational decisions. This systematic approach aligns with the objectives of our study to uncover the underlying human factors contributing to aviation accidents, offering a structured methodology to analyze and classify the data comprehensively.

In the present study, textual report data are matched with the human factor categories of the HFACS. However, it was often observed that reports say little or nothing about organizational influences; on the other hand, the HFACS is structured to address very specific instances of human factors, which may not always be explicitly detailed in accident reports, leading to potential subjective interpretations. This may happen because of some lack of communication between the information made available to researchers and the actual practices adopted by senior management decision makers. To address these challenges and enhance the linkage between reported incidents and human factor categories, the HFACS-ML model, proposed in [28], was used instead of the HFACS. This model adaptation, presented in Figure 2, allows for a more nuanced integration of the contents of accident reports into human factors categories, bridging the identified gaps between reported data and systemic influences.

The labeling of the data for the evaluation of the predictive classification models was done automatically, first, and then manually. The keywords found in the documents provided by the database were automatically processed, observing at least 20 random reports to assess the consistency (of the majority) of tags related to the different HFACS-ML levels and categories, as illustrated by the schematization presented in Figure 3. The irregular samples identified were corrected manually.

Although the automatic method allowed us to find a significant number of labels, irregular samples were identified and then manually corrected by analyzing them individually and classified into their respective HFACS-ML categories to introduce greater diversity to the set of labels. These processes allowed a total of 114 labels for unsafe supervision, 402 labels for precondition of unsafe acts, and 122 labels for unsafe acts, as described in Table 1.

2.2. NLP Pre-Processing of the Data

In text classification, the process of dividing the input flow of documents into different categories takes place using classifiers that have been learned from training samples [31]. However, before the data are used to train a classifier, it is necessary to carry out a pre-processing process to make it more uniform and understandable to the algorithm. This incipient phase of preparation can consume about 80% of the total effort in text mining projects [32]. However, performing comprehensive pre-processing is crucial, as the quality of the final classifier output is directly linked to how the data was pre-processed. Several attempts at data pre-processing were performed with different attributes to ensure the use of the best optimized settings, including techniques such as data cleaning, normalization and synonym grouping, and tokenization.

In the cleaning of the data, interfering or noisy information was eliminated, such as duplicate occurrences, stop words, and punctuations, as well as all incidents related to terrorist attacks. The reason for this last aspect is the understanding that employees’ performance in the face of malicious external threats should not reflect their professional behavior in normal situations. In the normalization stage, due to the diversity of terms contained in the documents, also originated by the existence of reports with languages other than English, the different documents were translated into English, and the synonyms were grouped together, which consisted of replacing some specific expressions with generic terms, e.g., ATC (Air Traffic Control), Control Tower, and Area Control Center were replaced by the generic term “Traffic Control”. With the support of Python’s nltk library (documentation available at [33]), the texts were tokenized, where each narrative was analyzed and divided into vectorized tokens. Stop words with terms (e.g., “the” or “and”) that contribute little significant information to the differentiation between two or more documents were eliminated. Stemming was also used to reduce words to their root forms in order to combine closely related words. Figure 4 provides an overview of the subsequent processes, which are detailed in the following sections.

2.3. Using GPT as Pre-Processing Tool

In order to test alternative methods to the ones developed, and based on the data-driven nature of artificial intelligence applications, an alternative procedure was performed using ChatGPT by applying the HFACS framework of [30] for accurate analysis and classification of the human factors contained in each aviation accident report. The GPT-4 Turbo platform was used—for reading and cleaning text, transforming cases, and labeling reports—as it is a tool that supports and generates texts with large scales of tokens per message [34]. However, the results from ChatGPT were inconsistent. ChatGPT identified the same human factors in most reports, regardless of the text or the root causes. This can be justified by the fact that artificial intelligence tools still lack developmental improvements in context interpretation, as they tend to misclassify the causal factors implicated by human behaviors and cognitive decision-making processes, a conclusion already reached in [35] precisely for HFACS.

3. NLP for Feature Extraction and HFACS Label Diffusion

Broadly speaking, texts and documents consist of unstructured datasets. However, it is necessary to transform these unstructured text strings into a structured format when employing mathematical techniques as part of a classifier. The area of natural language processing (NLP) centrally involves two main phases: the first step is to convert the input text (raw information) into numerical format (vectors or matrix), and the second step is to create models to manipulate the data to achieve a specific goal or complete an intended task [36]. Feature extraction, size reduction, choice of classifiers, and evaluations are some of the distinct tasks used in most document classification and categorization systems, in which statistical and/or mathematical formulas are employed to calculate resources.

The use of semi-supervised learning allows combining manually labeled data with large amounts of unlabeled data to train models [37]. Traditional methods of semi-supervised learning, which previously focused on transductive learning, have not yet been fully explored within the inductive context used in modern deep learning, whereas, in supervised learning, knowledge is acquired from labeled data that include a set of examples with their correct answers associated with them [38]. The challenge is to develop a function that can accurately relate inputs to existing outputs while being able to accurately predict the future.

To obtain a more in-depth analysis of the documents, the ability to disseminate information in the underlying structure of the data was analyzed through the label propagation algorithm (LS) to anticipate human variables in unknown documents. Therefore, it was only applied to the labeled dataset to the KNN, RF, Extra Trees, and XGBoost classifiers, which, through optimization techniques, seek to infer about their performance in the process of predicting human factors in aviation incident and accident documents. The performances of the respective models evaluated are described below.

3.1. Document to Vector (doc2vec)

The doc2vec model is an extension of word to vector (word2vec) used in learning processes for document embedding [39]. In this model, paragraph vectors function as memory devices to maintain their focus through feature extraction. Its main approaches, which are distributed bag of words (dbow) and distributed memory of paragraph vectors (dmpv), have similar characteristics to the skip-gram and the continuous bag of words (cbow) of word2vec. While dbow focuses on predicting random words in documents using only the document vector and ignoring the direct context of the words, dmpv predicts context words using both the vectors of the contextual words and the vector of the document [40]. In our study, we adopted vector architecture with the dbow model due to its ability to capture information, which does not depend on word order. This approach highlights the importance of maintaining the overall meaning of the text in technical reports, which are crucial for extracting valuable information and hidden patterns present in them. Additionally, dbow simplifies the analysis of large textual datasets by creating dense and continuous representations.

Its analytical formulation seeks to predict a word

w_{i}

from a document vector

D_{i}

, in which its probability is given by the following:

P (w_{i} | D_{d}) = \frac{\exp (V_{w_{i}} \cdot D_{d})}{\sum_{w \in V} \exp (V_{w_{i}} \cdot D_{d})}

(1)

where

V_{w_{i}}

is the word vector

w_{i}

;

D

is the vector of document d.

The objective function is chosen to maximize the probability of all the words in the document and is given by the following:

J (θ) = \sum_{w \in d} \log P (w_{i} | D_{d})

(2)

These characteristics are especially important in the aviation industry due to the huge number of accumulated reports that require the identification of trends or common causes in incidents. Another advantage is the computing savings compared to more complicated models, which allows for faster and less expensive analysis.

When applying the cosine similarity measure to identify the closest documents, the categories of human factors inferred from the documentary projections, extracted from a randomly selected report with reference (id), were observed at the local level in six files, as shown in Table 2.

Some authors claim that documents with similar characteristics of human nature are usually placed close to each other in vector space, whereas documents with distinct human factors are usually placed farther apart from each other [41,42]. These statements reinforce the use of the model under analysis to classify the human factors of unknown documents based on their location in the vector space. In this regard, all vectors of the documents are adjusted (excluding the magnitude of the differences) to obtain unit length and maintain consistency in the use of cosine as the main metric for vector similarity in specific textual documents. The core idea behind doc2vec is that document representations must be so precise as to anticipate the words or context of those documents.

3.2. Label Spreading (LS)

The Label Spreading (LS) approach, introduced by [43], is a semi-supervised learning technique that spreads known labels on a graph of unlabeled data. When applied to the classification of texts on air incidents, it can be used to disseminate HFACS markers in reports that lack a specific previous categorization. When using semi-supervised learning (LS), it is feasible to employ a limited set of labeled reports and spread these labels across a more extensive network of unlabeled reports for the identification of patterns associated with contributing elements that have not yet been explicitly acknowledged. In the study of [44], the authors explored the use of label propagation and Label Spreading in partially supervised text classification situations, demonstrating that these techniques can be effective in increasing accuracy in the classification of documents with incomplete information.

In this study, the process is carried out through iterations, where each node (representing a document or incident report) receives information from neighboring nodes, while retaining a fraction of its original information. This flow of information is symmetrical between the nodes, and the process continues until convergence is reached, at which point the labels of the unlabeled nodes (docs) are assigned based on the class that received the most information over the course of the iterations. This iterative diffusion process makes it possible for uncategorized incident reports to be labeled according to the most likely contributory factors, as defined by the HFACS-ML model. The affinity matrix is calculated using a radial base function (rbf) to define the similarity between the documents. This process is formalized by the following equation:

A_{i j} = \exp (- Γ \cdot {‖x_{i} - x_{j}‖}^{2})

(3)

Here,

A_{i j}

represents the affinity between the documents,

i, j, x_{i} and x_{j}

are the characteristic vectors of the respective documents, and

{‖x_{i} - x_{j}‖}^{2}

is the Euclidean distance between them. The

Γ

(Gamma) value controls sensitivity at a distance, where a higher value of

Γ

implies a significant affinity only for very close documents, while lower

Γ

values allow for broader influences. This affinity matrix is fundamental to determine how much two documents can influence each other during the dissemination process. The rbf function is controlled by the Γ (Gamma) hyperparameter, which defines the weight of the interaction between two document vectors. Thus, two documents with a shorter distance will have a greater affinity, while documents with a greater distance will have a weaker influence on each other [45]. Thus, we globally incorporated the dbow model to the Label Spreading classifier, with the aim of analyzing the behavior of class categorization in the process of data extraction and prediction. We use previously labeled data, separating it into training and test sets in a stratified manner into various training sizes (partitions). Figure 5a–c present the distribution of the forecasts for each class. Some instabilities are observed in the identification of labels, particularly at the level of precondition for unsafe act, where there is greater overlap and confusion between categories, such as the condition of the operator, and personnel factor. This pattern may indicate the complexity of the factors involved at this level, deserving greater attention.

However, the modeling of data with LS in a global way presents a reasonable performance, demonstrating a score in terms of accuracy, corresponding to 0.77 for unsafe supervision, 0.68 for the precondition for unsafe act, and 0.88 for unsafe act. These scores can be improved by using other classifiers and efficient techniques for good hyperparameter fit. Another measure would be the increase in data, which would allow the adjustment of weight in the less represented categories.

3.3. ML Classifiers

Each classifier has an essential role in the supervised classification procedure. These classifiers undergo tests to identify the model that best fits the data. Next, we describe the roller of each classifier used in the development of the present study.

(i): K-Nearest Neighbors (KNN)

KNN is a supervised machine learning algorithm, commonly applied to classification and regression problems [46]. Their approach consists of identifying the k-data points closest to a specific point of interest, based on a relevant distance measure. In our study, it seeks to predict the correct class of test data by calculating the distance between the test and all training points.

(ii): Random Forest (RF)

RF is a sophisticated decision-tree-based set learning technique trained on a random subset of features in the data universe to avoid overfitting and improve predictive accuracy [47]. This method aggregates predictions from multiple decision trees, which individually capture different patterns in the data, with our final prediction determined by voting among the individual trees.

(iii): Extra Trees (ET)

Extra Trees is an automatic learning tool, which stands out for its efficiency and speed in data analysis. It uses a training dataset to train decision trees at random, making the model robust and resistant to overfitting, to ensure greater accuracy and reliability of the results in the classification of texts [48]. It is generally an efficient technique due to its generalization capacity and training speed.

(iv): Extreme Gradient Boosting (XGBoost)

This is a promising algorithm, which uses a gradient-driven method, also based on sequentially constructed decision trees [49]. In our analysis, it employs a stepwise approach to optimize the set by minimizing a loss function. In addition, it seeks regularization techniques to avoid overfitting and improve the model’s performance. This method stands out for its efficiency when facing various challenges in machine learning platforms, incorporating techniques such as cross-validation, parallelization, etc.

3.4. Adjusting Hyperparameters

Hyperparameter tuning plays a crucial role in the field of machine learning and exerts a great influence on the efficiency of the algorithms used. While the parameters of the model are defined during the training process, the hyperparameters are defined in advance and guide the overall behavior of the model itself [50]. Essentially, this technique involves testing different combinations to identify the most suitable settings for a specific problem. In addition, efficiently fitting hyperparameters can lead to more accurate models, achieving an optimal balance between bias and variance, thus dealing with issues such as overfitting or underfitting [51,52]. By changing different settings, such as learning rate and batch size, or applying different regularization terms to the model, it is possible to improve its efficiency and reduce the training time and computational resources required. This is particularly essential in large-scale scenarios where resources are limited. The use of automated features, such as Grid Search and Bayesian Optimization, have been shown to be quite prominent for their effectiveness in exploring different settings and minimizing the manual intervention required to find the best adjustments [53].

3.5. Evaluation Metrics

The evaluation of the models is conducted by comparing their performance through metrics such as precision, recall, F1-score, and accuracy, derived from a confusion matrix. These metrics take into account the elements of the confusion matrix, i.e., the True Positive (TP), False Positive (FP), True Negative (TN), and False Negative (FN) values. Table 3 presents the corresponding expression for each metric.

(a): Precision: The percentage of correctly predicted positive labels out of all predicted positive labels, with higher accuracy indicating that the data have been more accurately classified for true results.
(b): Recall: The rate of positive samples that were correctly identified in relation to the total of actual positive samples, with a higher value associated with less misclassified positive data.
(c): F1-score: The relationship between sensitivity and precision, calculated by the harmonic mean of the two values. Its range of values ranges from 0 to 1. These data provide insights into the model’s ability to accurately classify data points, as well as insights into the model’s consistency.
(d): Accuracy: Corresponds to the rate of correct predictions to the total number of predictions made by classifiers.

3.5.1. Grid Search

Grid Search is a heuristic technique that seeks to exhaustively test all combinations by navigating a range of values for each hyperparameter [54]. In this work, it seeks to identify the most informative characteristics to reduce the size of classification models and examine the model that best fits the trade-off between precision and recall. After selection, the informational resources are fed into the classification models. A grid search ensures that all combinations of hyperparameters are considered [55,56]. However, the challenge arises when more hyperparameters are added, as the number of combinations will increase exponentially. Cross-validation divided the training set into several partitions, as a form of performance evaluation, training the model in one part and validating in another. This process ensured that the evaluation was not influenced by a specific division of data, which could certainly increase the robustness of the results. Table 4 presents the layout of the hyperparameters for each classifier.

These configurations had a great impact on the performance of each model. The selection of the parameters used in KNN focused on adjusting the amount of near-neighbor data, the distance measurement, and the weight assigned to each neighbor to achieve a balance between simplicity and accuracy in local classification. For RF and ET, adjustments were made to guide the development of the trees and determine the number of divisions made to improve variance and avoid overfitting. The use of these two almost similar models provided, on the one hand, additional stability to the process and, on the other hand, added randomness with an increase in tree diversity. The selection of parameters such as learning rate and subsampling in XGBoost aimed to ensure the achievement of convergence efficiently, avoiding overfitting. The tuning of the hyperparameters in this model is essential as boosting corrects errors gradually. This is because poorly adjusted parameters could cause the model to learn too quickly (overfitting) or too slowly (underfitting) [57,58].

3.5.2. Bayesian Optimization

Bayesian Optimization is an intelligent approach to hyperparameter tuning that uses a Bayesian model to predict the performance function of hyperparameters [59]. This method is known for the frequent use of three components: (1) a substitute model; (2) an acquisition function; and finally, (3) an optimization algorithm. In our pipeline, the surrogate model is characterized as a probabilistic model used to approximate performance, defining a loss function associated with different selections of hyperparameters. The acquisition function selects the hyperparameters on which the surrogate model is trained, seeking balance between exploration (the selection of hyperparameters where the uncertainty about the surrogate model’s prediction is high) and exploitation (the selection of hyperparameters according to the best prediction value of the surrogate model). In this way, a new set of hyperparameters are suggested through the optimization algorithm that are evaluated and added to the training data. A model or algorithm may be appropriate for a specific dataset. But if the model’s hyperparameter adjustments are not optimal for the dataset, the model’s true potential will not be achieved [60]. This is how Bayesian Optimization presents itself as an alternative to automatically find the best fits for machine learning models.

3.6. Metrics Results

The experimental results of the classifiers optimized by Grid Search and Bayesian Optimization are presented in Table 5. The analysis suggests important insights into the performance of classifiers around the precision, recall, F1-score, and accuracy metrics, divided into three categories, thus evidencing the models’ behavior throughout the prediction processes.

In the unsafe supervision category, the KNN classifier stood out with a precision of 0.96 and recall of 0.97 after being optimized through Bayesian Optimization, resulting in a slightly better performance than that obtained with the use of Grid Search (precision: 0.95; recall: 0.96). RF also achieved consistent results in both approaches, achieving small improvements when passing Bayesian Optimization (precision: 0.95; recall: 0.96) compared to Grid Search. As to ET and XGBoost, they had slightly lower performances than KNN and RF but was still significant. ET achieved a precision of 0.92 and a recall of 0.96 through Bayesian Optimization. XGBoost showed a substantial increase in F1-score with Bayesian Optimization (from 0.73 to 0.80), indicating that the Bayesian Optimization approach can have a greater impact on this specific classifier.

In the precondition for unsafe act category, the results indicate that the overall performance is lower than the performance for the unsafe supervision category. In this specific case, KNN and XGBoost performed comparably by both achieving F1-scores of 0.80 after being optimized by Bayesian Optimization. The results show that RF had the lowest precision (0.73) and recall (0.79) indices compared to the other classifiers, while ET showed improvements with Bayesian Optimization compared to Grid Search, especially in the F1-score (0.77 compared to 0.76). However, XGBoost was consistently superior to other classifiers at this level, after Bayesian Optimization, which suggests that this classifier fit well with the dataset at this level.

As to the unsafe act, results show greater homogeneity among classifiers. Here, it turns out that KNN again demonstrated to be the best-performing classifier utilizing Bayesian Optimization: it obtained a precision of 0.91 and a recall of 0.94, while maintaining a solid consistency compared to the other classifiers and its own performance at other levels. Almost identical performance, but with slight variations, is found in the RF and ET classifiers, for both optimization methods. A curious fact concerns the performance of XGBoost, which, despite improving over Grid Search (with precision from 0.76 to 0.80), proves to be the classifier that so far has the lowest performance for this category. Figure 6a,b illustrate the overall performance of classifiers in both optimization methods.

In terms of the effectiveness of the optimization approaches used, it is generally noted that the performance of classifiers using Bayesian Optimization has an advantage over the results produced with Grid Search. This may be justified by the fact that Bayesian Optimization uses a probabilistic model that explores and exploits hyperparameters more efficiently than Grid Search, which commonly performs a systematic and exhaustive search.

4. Discussion and Conclusions

4.1. Discussion

This study compares the performances of the KNN, RF, ET, and XGBoost classifiers, optimized by two pipelines. The data samples and document vectors associated with the LS algorithms proved to be promising in terms of their classification in datasets with unknown labels. It is observed that each approach has different algorithms and modeling to provide a classification.

The results obtained indicate that the KNN is better suited to the process of the classification of contextual data, especially at the levels of unsafe supervision and unsafe act. This is perhaps justified by the reason that they are the levels with the fewest labels. RF and ET also performed well in the classification process, although we believe that these classifiers could further improve their performance with larger and uniformly labeled datasets. This is because larger and more balanced data would provide a more representative sample distribution, maximizing performance by reducing bias and variances in the models.

In our framework, XGBoost, although competitive, has lower performances, when compared to the other classifiers in the precondition for unsafe act and unsafe act levels. This fact draws our attention, suggesting that such a model may not be the most appropriate to identify the precursor conditions at the levels mentioned.

4.2. Contribution

The HFACS framework serves as a structured method for identifying and classifying the human factors contributing to aviation accidents present in this study. This framework was applied to analyze root causes across four levels, which provided clear categorizations for machine learning analysis. For example, the model is capable to identify preconditions for unsafe acts included scenarios where environmental and human factors like poor weather and fatigue coincided, influencing crew decision-making, or organizational influences revealed through the analysis that included cultural issues that discourage incident reporting, which could have provided early warnings to avert more serious accidents.

The choice of the HFACS framework was motivated by its ability to systematically categorize complex accident data into manageable segments for detailed analysis using advanced machine learning techniques. This approach not only helps in pinpointing specific categories of human error but also enhances the understanding of how such factors interact at various levels, leading to more targeted interventions.

Regarding the technical outputs, the model provides aviation industry experts and researchers with valuable insights into the systemic and human factors influencing safety outcomes. Using ML to analyze data within the HFACS categories, this study provides labeling capabilities for potential safety hazards. This enables more effective preventative measures and supports the refinement of training programs and safety policies based on data-driven insights. Additionally, the identification of organizational cultural issues can guide strategic changes to promote a safety-first approach and improve communication practices. Overall, the integration of ML with the HFACS framework aids in the optimization of safety management systems, making a significant contribution to enhancing aviation safety.

4.3. Conclusions

The Grid Search and Bayesian Optimization models demonstrate the relevance of using efficient optimization methods to improve the performance of classifiers. This is because the optimization of hyperparameters is a crucial aspect in the model construction process and can have significant implications on predictive performance when accuracy is an issue. In particular, the results observed in this study with Bayesian Optimization further reinforce the idea that the use of machine learning tools in the accurate classification of incidents and accidents helps to identify their real nature and situational decision-making.

Limitations were observed in this study regarding the low amount of labeled data, especially in the unsafe supervision and unsafe act categories. For further studies, it is considered pertinent to use other databases with larger sets to analyze how the models would behave in a larger-scale environment. This type of expansion would provide insights into the generalizability of optimized classifiers, both with Grid Search and Bayesian Optimization, and would help validate the robustness of the proposed approach in broader and more complex scenarios. These complementary studies would allow for a continuous refinement of the models, contributing to a more effective use of the HFACS-ML framework in the identification and mitigation of contributing factors in air incidents and accidents.

As to future work, since large semantic models are expected to improve their performance and increase their reliability, we plan to integrate them into future studies to expand and compare results, once the issues identified in [33] are solved.

Author Contributions

Conceptualization, D.V. and R.M.; methodology, R.M., D.V., and F.L.L.; software, T.M. and F.L.L.; validation, L.F.F.M.S.; resources, R.M.; data curation, F.L.L. and T.M.; writing—original draft preparation, F.L.L.; writing—review and editing, D.V. and L.F.F.M.S.; visualization, F.L.L.; supervision, D.V. and R.M.; project administration, R.M.; funding acquisition, R.M. All authors have read and agreed to the published version of the manuscript.

Funding

The author Flávio Lázaro acknowledges a scholarship from Projecto de Desenvolvimento de Ciência e Tecnologia, from MESCTI, number 011/D-UL/PDCT-M003/2022. The authors acknowledge Fundação para a Ciência e a Tecnologia (FCT) for its financial support via the projects LAETA Base Funding (https://doi.org/10.54499/UIDB/50022/2020) and LAETA Programatic Funding (https://doi.org/10.54499/UIDP/50022/2020). The authors acknowledge https://doi.org/10.54499/LA/P/0079/2020.

Data Availability Statement

Data available in a publicly accessible repository [27].

Conflicts of Interest

The authors declare no conflicts of interest.

References

Ali, T.; Khazaei, H.; Moghaddam, M.H.Y.; Hassan, Y. Machine Learning in Transportation; Hindawi: London, UK, 2019. [Google Scholar]
Santos, L.F.; Melicio, R. Stress, pressure and fatigue on aircraft maintenance personal. Int. Rev. Aerosp. Eng. 2019, 12, 35–45. [Google Scholar] [CrossRef]
Cusick, S.K.; Cortes, A.I.; Rodrigues, C.C. Commercial Aviation Safety, 6th ed.; McGraw Hill Education: New York, NY, USA, 2017. [Google Scholar]
Muecklich, N.; Sikora, I.; Paraskevas, A.; Padhra, A. The role of human factors in aviation ground operation-related accidents/incidents: A human error analysis approach. Transp. Eng. 2023, 13, 100184. [Google Scholar] [CrossRef]
International Air Transport Association. Annual Report 2023. IATA. 2023. Available online: https://www.iata.org/contentassets/c81222d96c9a4e0bb4ff6ced0126f0bb/annual-review-2023.pdf (accessed on 9 May 2024).
ICAO. Annual Report of the Council to the Assembly. 2023. Available online: https://www.icao.int/about-icao/Annual_Report_2023_EN/AnnualReport2023.html#p=1 (accessed on 9 May 2024).
Dong, T.; Yang, Q.; Ebadi, N.; Luo, X.R.; Rad, P. Identifying Incident Causal Factors to Improve Aviation Transportation Safety: Proposing a Deep Learning Approach. J. Adv. Transp. 2021, 2021, 5540046. [Google Scholar] [CrossRef]
Shi, D.; Guan, J.; Zurada, J.; Manikas, A. A data-mining approach to identification of risk factors in safety management systems. J. Manag. Inf. Syst. 2017, 34, 1054–1081. [Google Scholar] [CrossRef]
Dodd, R.S.; Eldredge, D.; Mangold, S.J. A Review and Discussion of Flight Management System Incidents Reported to the Aviation Safety Reporting System; The National Academies of Sciences, Engineering, and Medicine: Washington, DC, USA, 1992. [Google Scholar]
Lázaro, F.L.; Nogueira, R.P.R.; Melicio, R.; Valério, D.; Santos, L.F.F.M. Human Factors as Predictor of Fatalities in Aviation Accidents: A Neural Network Analysis. Appl. Sci 2024, 14, 640. [Google Scholar] [CrossRef]
Council, N.R. Improving the Continued Airworthiness of Civil Aircraft: A Strategy for the FAA’s Aircraft Certification Service; The National Academies Press: Washington DC, USA, 1998. [Google Scholar]
Schreiber, F. Human Performance Error Management. 2007. Available online: https://skybrary.aero/bookshelf/books/1640.pdf (accessed on 23 May 2024).
Othman, N.; Fairuz, I. Mental Workload Evaluation of Aircraft Operators’ Using Pupil Dilation and Nasa-Task Load Index. Int. Rev. Aerospace Eng. 2016, 9, 80. [Google Scholar] [CrossRef]
ICAO. Annex 19 to the Convention on International Civil Aviation–Safety Management; ICAO: Montreal, QC, Canada, 2013. [Google Scholar]
ICAO. Safety Report. 2024. Available online: https://www.icao.int/safety/Documents/ICAO_SR_2024.pdf (accessed on 2 August 2024).
International Air Transport Association (IATA), International Civil Aviation Organization (ICAO), and International Federation of Air Line Pilots’ Associations (IFALPA). Fatigue Management Guide for Airline Operators; International Civil Aviation Organization: Montreal, ON, Canada, 2015; Available online: https://www.icao.int/safety/fatiguemanagement/frms%20tools/frms%20implementation%20guide%20for%20operators%20july%202011.pdf (accessed on 4 June 2024).
Chen, W.; Huang, S. Human Reliability Analysis for Visual Inspection in Aviation Maintenance by a Bayesian Network Approach. Transp. Res. Rec. J. Transp. Res. Board 2014, 2449, 105–113. [Google Scholar] [CrossRef]
Franciosi, C.; Pasquale, V.D.; Iannone, R.; Miranda, S. A Taxonomy of Performance Shaping Factors for Human Reliability Analysis in Industrial Maintenance. J. Ind. Eng. Manag. 2019, 12, 115–132. [Google Scholar] [CrossRef]
Ng, Y.S.R.; Rashid, H. Enhancing human performance reliability in aircraft pushback operations. Int. J. Qual. Reliab. Manag. 2019, 36, 485–509. [Google Scholar] [CrossRef]
Li, X.; Guo, Y.; Ge, F.L.; Yang, F.Q. Human reliability assessment on building construction work at height: The case of scaffolding work. Saf. Sci. 2023, 159, 106021. [Google Scholar] [CrossRef]
Chen, J.; Diao, M.; Zhang, C. Predicting airline additional services consumption willingness based on high-dimensional incomplete data. IEEE Access 2022, 10, 39596–39603. [Google Scholar] [CrossRef]
Lee, H.; Madar, S.; Sairam, S.; Puranik, T.G.; Payan, A.P.; Kirby, M.; Pinon, O.J.; Mavris, D.N. Critical Parameter Identification for Safety Events in Commercial Aviation Using Machine Learning. Aerospace 2020, 7, 73. [Google Scholar] [CrossRef]
Sarker, I.H. Machine learning: Algorithms, real-world applications and research directions. SN Comput. Sci. 2021, 2, 160. [Google Scholar] [CrossRef]
Le Clainche, S.; Ferrer, E.; Gibson, S.; Cross, E.; Parente, A.; Vinuesa, R. Improving aircraft performance using machine learning: A review. Aerosp. Sci. Technol. 2023, 138, 108354. [Google Scholar] [CrossRef]
Yang, C.; Huang, C. Natural language processing (NLP) in aviation safety: Systematic review of research and outlook into the future. Aerospace 2023, 10, 600. [Google Scholar] [CrossRef]
Perboli, G.; Gajetti, M.; Fedorov, S.; Giudice, S.L. Natural Language Processing for the identification of Human factors in aviation accidents causes: An application to the SHEL methodology. Expert Syst. Appl. 2021, 186, 115694. [Google Scholar] [CrossRef]
Demir, G.; Moslem, S.; Duleba, S. Artificial Intelligence in Aviation Safety: Systematic Review and Biometric Analysis. Int. J. Comput. Intell. Syst. 2024, 17, 279. [Google Scholar] [CrossRef]
Madeira, T.; Melício, R.; Valério, D.; Santos, L. Machine Learning and Natural Language Processing for Prediction of Human Factors in Aviation Incident Reports. Aerospace 2021, 8, 47. [Google Scholar] [CrossRef]
ASN. Aviation Safety Database. 2024. Available online: https://aviation-safety.net/database/ (accessed on 12 March 2024).
Wiegmann, D.A.; Shappell, S.A. A Human Error Approach to Aviation Accident Analysis. In The Human Factors Analysis and Classification System; Ashgate Publishing Limited: Aldershot, UK, 2003. [Google Scholar]
Zhou, X.; Gururajan, R.; Li, Y.; Venkataraman, R.; Tao, X.; Bargshady, G.; Barua, P.D.; Kondalsamy-Chennakesavan, S. A survey on text classification and its applications. Web Intell. 2020, 18, 205–216. [Google Scholar] [CrossRef]
De Vries, V. Classification of Aviation Safety Reports Using Machine Learning. In Proceedings of the International Conference on Artificial Intelligence and Data Analytics for Air Transportation (AIDA-AT), Singapore, 3–4 February 2020; pp. 1–6. [Google Scholar] [CrossRef]
Bird, S.; Klein, E.; Loper, E. Natural Language Toolkit (NLTK) Documentation. 2023. Available online: https://www.nltk.org/ (accessed on 23 May 2024).
Bastian, M. GPT-4 Has More Than a Trillion Parameters—Report. The Decoder: AI in Practice. 2023. Available online: https://the-decoder.com/gpt-4-has-a-trillion-parameters/ (accessed on 13 May 2024).
Saunders, D.; Hu, K.; Li, W.C. The Process of Training ChatGPT Using HFACS to Analyse Aviation Accident Reports. In Ergonomics & Human Factors, Proceedings of the Conference, 22–24 April 2024, Kenilworth, UK; Chartered Institute of Ergonomics and Human Factors (CIEHF): Loughborough, UK, 2024; Available online: https://publications.ergonomics.org.uk/uploads/The-Process-of-Training-ChatGPT-Using-HFACS-to-Analyse-Aviation-Accident-Reports.pdf/ (accessed on 28 November 2024).
Patil, R.; Boit, S.; Gudivada, V.; Nandigam, J. A survey of text representation and embedding techniques in nlp. IEEE Access 2023, 11, 36120–36146. [Google Scholar] [CrossRef]
Iscen, A.; Tolias, G.; Avrithis, Y.; Chum, O. Label Propagation for Deep Semi-supervised Learning. arXiv 2019. [Google Scholar] [CrossRef]
Burkart, N.; Huber, M.F. A survey on the explainability of supervised machine learning. J. Artif. Intell. Res. 2021, 70, 245–317. [Google Scholar] [CrossRef]
Lau, J.H.; Baldwin, T. An Empirical Evaluation of doc2vec with Practical Insights into Document Embedding Generation. In Proceedings of the 1st Workshop on Representation Learning for NLP; Association for Computational Linguistics: Berlin, Germany, 2016; pp. 78–86. [Google Scholar]
Le, Q.; Mikolov, T. Distributed representations of sentences and documents. In Proceedings of the International Conference on Machine Learning, PMLR, Beijing, China, 22–24 June 2014; pp. 1188–1196. [Google Scholar]
Li, Y.; Yang, T. Word embedding for understanding natural language: A survey. In Guide to Big Data Applications, Studies in Big Data; Springer: Cham, Switzerland, 2018; Volume 26, pp. 83–104. [Google Scholar]
Robinson, S.D. Multi-Label Classification of Contributing Causal Factors in Self-Reported Safety Narratives. Safety 2018, 4, 30. [Google Scholar] [CrossRef]
Zhou, D.; Bousquet, O.; Lal, T.; Weston, J.; Schölkopf, B. Learning with local and global consistency. In Advances in Neural Information Processing Systems; Thrun, S., Saul, L., Schölkopf, B., Eds.; The MIT Press: Cambridge, MA, USA, 2004; Volume 16, pp. 321–328. [Google Scholar]
Yu, M.; Zhou, Y.; Li, R.; Wang, X.; Zhong, Y. Semi-supervised learning via manifold regularization. J. China Univ. Post Telecommun. 2012, 19, 79–88. [Google Scholar] [CrossRef]
Niyogi, P. Manifold regularization and semi-supervised learning: Some theoretical analyses. J. Mach. Res. 2013, 14, 1229–1250. [Google Scholar]
Peterson, L.E. K-nearest neighbor. Scholarpedia 2009, 4, 1883. [Google Scholar] [CrossRef]
Brieman, L. Random Forests. Mach. Learn. 2001, 45, 5–32. [Google Scholar] [CrossRef]
Geurts, P.; Ernst, D.; Wehenkel, L. Extremely randomized trees. Mach. Learn. 2006, 63, 3–42. [Google Scholar] [CrossRef]
Jiao, Y.; Dong, J.; Han, J.; Sun, H. Classification and Causes Identification of Chinese Civil Aviation Incident Reports. Appl. Sci. 2022, 12, 10765. [Google Scholar] [CrossRef]
Mehdary, A.; Chehri, A.; Jakimi, A.; Saadane, R. Hyperparameter Optimization with Genetic Algorithms and XGBoost: A Step Forward in Smart Grid Fraud Detection. Sensors 2024, 24, 1230. [Google Scholar] [CrossRef]
Yang, L.; Shami, A. On hyperparameter optimization of machine learning algorithms: Theory and practice. Neurocomputing 2020, 415, 295–316. [Google Scholar] [CrossRef]
Li, H.; Chaudhari, P.; Yang, H.; Lam, M.; Ravichandran, A.; Bhotika, R.; Soatto, S. Rethinking the Hyperparameters for Fine-tuning. arXiv 2020, arXiv:2002.11770. [Google Scholar]
Khalid, R.; Javaid, N. A survey on hyperparameters optimization algorithms of forecasting models in smart grid. Sustain. Cities Soc. 2020, 61, 102275. [Google Scholar] [CrossRef]
Omar, M.; Yakub, F.; Abdullah, S.S.; Abd Rahim, M.S.; Zuhairi, A.H.; Govindan, N. One-step vs horizon-step training strategies for multi-step traffic flow forecasting with direct particle swarm optimization grid search support vector regression and long short-term memory. Expert Syst. Appl. 2024, 252, 124154. [Google Scholar] [CrossRef]
Hutter, F.; Kotthoff, L.; Vanschoren, J. Automated Machine Learning: Methods, Systems, Challenges, 1st. ed.; Springer Publishing Company, Incorporated: Berlin/Heidelberg, Germany, 2019. [Google Scholar]
Pedregosa, F.; Michel, V.; Grisel, O.; Blondel, M.; Prettenhofer, P.; Weiss, R.; Vanderplas, J.; Cournapeau, D.; Pedregosa, F.; Varoquaux, G.; et al. Scikit-learn: Machine Learning in Python. J. Mach. Learn. Res. 2011, 12, 2825–2830. [Google Scholar]
Shi, X.; Wong, Y.D.; Li, M.Z.F.; Palanisamy, C.; Chai, C. A feature learning approach based on xgboost for driving assessment and risk prediction. Accid. Anal. Prev. 2019, 129, 170–179. [Google Scholar] [CrossRef]
Shen, X.; Wei, S. Application of XGBoost for Hazardous Material Road Transport Accident Severity Analysis. IEEE Access 2020, 8, 206806–206819. [Google Scholar] [CrossRef]
Wang, X.; Jin, Y.; Schmitt, S.; Olhofer, M. Recent advances in Bayesian optimization. ACM Comput. Surv. 2023, 55, 1–36. [Google Scholar] [CrossRef]
Azevedo, B.F.; Rocha, A.M.A.C.; Pereira, A.I. Hybrid approaches to optimization and machine learning methods: A systematic literature review. Mach. Learn. 2024, 113, 4055–4097. [Google Scholar] [CrossRef]

Figure 1. Accidents and fatalities trend (2018–2023) (adapted from [15]).

Figure 2. HFACS machine learning (HFACS-ML) framework (adapted from [28]).

Figure 3. Layout of HFACS ML labels.

Figure 4. Process flowchart.

Figure 5. (a) Confusion matrix of unsafe supervision; (b) confusion matrix of precondition for unsafe act; (c) confusion matrix of unsafe act.

Figure 6. Global accuracy performance, as given in Table 5: (a) Grid Search; (b) Bayesian Optimization.

Table 1. Full-label distribution summary.

		HFAC-ML Category
HFAC-ML Level			Tag Count
	Unsafe Supervision	Failed to Correct Problem	3
		Inadequate Supervision	21
		Planned Inappropriate Operation	6
		Supervisory Violation	57
		Undetermined	4
		Not Available	23
	Precondition for Unsafe Act	Condition of Operator	61
		Personnel Factor	85
		Physical Environment 1	184
		Physical Environment 2	51
		Technological Environment	11
		Undetermined	4
		Not Available	6
	Unsafe Act	Error	55
		Violation	48
		Undetermined	14
		Not Available	5

Table 2. Most similar documents.

	HFACS-ML Levels
Reference	id	Unsafe Supervision	Precondition	Unsafe Act
Reference	520	Supervisory Violation	Personnel Factor	Violation
Most Similar Documents	326	Supervisory Violation	Personnel Factor	Violation
	210	Supervisory Violation	Personnel Factor	Violation
	1627	Supervisory Violation	Physical Environment 1	Error
	191	Supervisory Violation	Personnel Factor	Violation
	1485	Supervisory Violation	Condition of Operator	Error
	187	Supervisory Violation	Personnel Factor	Violation
Scores		6/6	4/6	4/6

Table 3. Expressions of each metric.

Metrics	Formula
Precision	$\frac{T P}{T P + F P}$
Recall	$\frac{T P}{T P + F N}$
F1-score	$2 \times \frac{P r e c i s i o n \times R e c a l l}{P r e c i s i o n + R e c a l l}$
Accuracy	$\frac{T P + T N}{T P + F P + T N + F N}$

Table 4. Hyperparameter layout for each classifier on Grid Search.

Classifiers	Hyperparameter	Values
KNN	n neighbors	3, 5, 7, 9
	weights	uniform, distance
	p	1, 2
RF/ET	n estimators	100, 200, 300
	max depth	10, 20, 30
	min samples split	2, 5, 10
	min samples leaf	1, 2, 4
XGBoost	learning rate	0.01, 0.3
	max depth	3, 6, 9, 12
	n estimators	100, 200, 300
	subsample	0.8, 0.2
	colsample bytree	0.8, 0.2

Table 5. Results with the best prediction from the optimized classifiers.

Models	Grid Search				Bayesian Optimization				Levels
Classifiers	Precision	Recall	F1-Score	Accuracy	Precision	Recall	F1-Score	Accuracy	Levels
KNN	0.95	0.96	0.95	0.96	0.96	0.97	0.96	0.97	Unsafe Supervision
Random Forest	0.92	0.96	0.94	0.96	0.95	0.96	0.95	0.96
Extra Trees	0.91	0.95	0.93	0.95	0.92	0.96	0.94	0.96
XGBoost	0.80	0.78	0.73	0.78	0.81	0.79	0.80	0.80
KNN	0.79	0.80	0.79	0.80	0.79	0.82	0.80	0.82	Precondition for Unsafe Act
Random Forest	0.72	0.78	0.75	0.78	0.73	0.79	0.76	0.79
Extra Trees	0.75	0.77	0.76	0.79	0.75	0.79	0.77	0.79
XGBoost	0.80	0.78	0.79	0.78	0.83	0.78	0.80	0.78
KNN	0.90	0.94	0.92	0.94	0.91	0.94	0.92	0.94	Unsafe Act
Random Forest	0.88	0.94	0.91	0.94	0.88	0.94	0.91	0.94
Extra Trees	0.88	0.94	0.91	0.94	0.88	0.94	0.91	0.94
XGBoost	0.76	0.74	0.75	0.75	0.80	0.78	0.79	0.78

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Lázaro, F.L.; Madeira, T.; Melicio, R.; Valério, D.; Santos, L.F.F.M. Identifying Human Factors in Aviation Accidents with Natural Language Processing and Machine Learning Models. Aerospace 2025, 12, 106. https://doi.org/10.3390/aerospace12020106

AMA Style

Lázaro FL, Madeira T, Melicio R, Valério D, Santos LFFM. Identifying Human Factors in Aviation Accidents with Natural Language Processing and Machine Learning Models. Aerospace. 2025; 12(2):106. https://doi.org/10.3390/aerospace12020106

Chicago/Turabian Style

Lázaro, Flávio L., Tomás Madeira, Rui Melicio, Duarte Valério, and Luís F. F. M. Santos. 2025. "Identifying Human Factors in Aviation Accidents with Natural Language Processing and Machine Learning Models" Aerospace 12, no. 2: 106. https://doi.org/10.3390/aerospace12020106

APA Style

Lázaro, F. L., Madeira, T., Melicio, R., Valério, D., & Santos, L. F. F. M. (2025). Identifying Human Factors in Aviation Accidents with Natural Language Processing and Machine Learning Models. Aerospace, 12(2), 106. https://doi.org/10.3390/aerospace12020106

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Identifying Human Factors in Aviation Accidents with Natural Language Processing and Machine Learning Models

Abstract

1. Introduction

2. Materials and Methods

2.1. Human Factor Classification

2.2. NLP Pre-Processing of the Data

2.3. Using GPT as Pre-Processing Tool

3. NLP for Feature Extraction and HFACS Label Diffusion

3.1. Document to Vector (doc2vec)

3.2. Label Spreading (LS)

3.3. ML Classifiers

3.4. Adjusting Hyperparameters

3.5. Evaluation Metrics

3.5.1. Grid Search

3.5.2. Bayesian Optimization

3.6. Metrics Results

4. Discussion and Conclusions

4.1. Discussion

4.2. Contribution

4.3. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI