1. Introduction
In the last two decades, safety risk management in civil aviation has shifted from post-accident investigations and analyses to proactively identifying emerging safety hazards and incorporating the analyses of these proactive findings to supplement post-accident investigations. This type of shift requires applying new approaches that can process a large amount of multivariate time-series aviation data from various data sources that can be both time-varying and categorical. Identifying anomalous events within historical flight data is crucial for the extraction of various safety hazards. These hazards can be associated with various factors, including adverse weather conditions (such as heavy rain and strong winds), mechanical failures, human error (pilot, air traffic controller), airspace congestion, inadequate flight planning, ground operations, difficult terrain, and bird strikes, among others. The standard anomaly detection technique applied to aviation data is to use exceedance detection methods [
1]. These methods require domain knowledge and involve comparing specific flight parameters with respect to aircraft-dependent thresholds pre-defined by aviation subject matter experts. Because exceedance-based methods are based on rules with strict thresholds [
2], they face limitations when identifying new safety risks and capturing useful information from the acquired multivariate and multimodal flight and environmental data with complex and nonlinear relationships that take place at different temporal scales. Recently, machine learning techniques have been used to fill this emergent gap, and they have been investigated for their potential to automatically identify anomalies in multivariate flight data.
Exploring machine learning techniques to be used with flight sensor data for improving aviation safety is an active research field. Li et al. [
3] introduced a cluster-based anomaly detection approach to detect abnormal flights, which can support domain experts in detecting anomalies and associated risks from routine airline operations. Their approach, “ClusterAD-Flight”, used data from the flight data recorder and applied the density-based spatial clustering of applications with noise (DBSCAN) algorithm to perform the cluster analysis to detect abnormal flights of unique data patterns. The authors, pointing out the need for predefined criteria or domain knowledge as a shortcoming of existing anomaly detection techniques, stated that ClusterAD-Flight no longer required these. Li et al. further extended their approach in [
4] and named their extended approach “ClusterAD-DataSample”. In this extended approach, a Gaussian Mixture Model (GMM)-based clustering is applied to digital flight data to detect flights with unusual data patterns, with the assumption that normal flights share common patterns while anomalies do not. The authors stated that, in comparison to ClusterAD-Flight, which can make decisions about whether the take-off or approach phase as a whole is abnormal or not, ClusterAD-DataSample can detect instantaneous abnormal data samples during flight. The authors noted that with their approach, airline safety experts can identify latent risks from daily operations without specifying what to look for in advance.
L. Basora, X. Olive, and T. Dubot provided a survey of data-driven anomaly detection approaches and their application to the aviation domain in [
5]. Some of these approaches included machine learning techniques such as clustering-based approaches and advanced autoencoders. Following this survey, two authors, X. Olive and L. Basora, introduced a reconstruction-based anomaly detection technique using autoencoders [
6]. Their technique was to detect and identify significant events in historical aircraft trajectory data. For flight data, the authors used Automatic Dependent Surveillance–Broadcast (ADS-B) trajectory data since it is often more accessible than aircraft data. The authors investigated the trajectory anomaly scores computed by autoencoders for significant operational events such as re-routings or deconfliction measures and found that the highest anomaly scores corresponded to poor weather conditions, while anomalies with a lower score related to Air Traffic Control (ATC) tactical actions.
The National Aeronautics and Space Administration’s (NASA) Ames Research Center has generated tools for data mining and machine learning methods for aviation safety, such as Multiple Kernel Anomaly Detection [
7]. Another software tool from NASA, the Automatic Discovery of Precursors in Time-Series, is based on finding precursors using multidimensional time-series data and has been applied to flight anomalies such as missed approach [
8] and take-off stall [
9].
Gavrilovski et al. [
10] surveyed data-mining techniques in the aviation domain and provided a review of published work. Janakiraman [
11] introduced an approach that combines multiple-instance learning and deep recurrent neural networks for weakly supervised learning problems that involve time-series flight data. Martinez et al. [
12] introduced a methodology that performs a precursor analysis and a binary classification using Gradient Boosting frameworks and analyzes Flight Data Monitoring (FDM) temporal series with Long Short-Term Memory (LSTM) deep learning techniques. The authors stated that the aircraft speed, flap positions, altitude, rate of descent, and meteorological conditions of the destination airport were the most relevant precursors, and the investigated deep learning technique provided better forecasting performance.
Wang et al. [
13,
14] used surveillance track data and wind data to build and improve a forecasting model based on Logistic Regression for predicting unstable approach (UA). The authors demonstrated that by adding more features, the prediction performance can be improved. Ackley et al. [
15], developed a methodology that is based on supervised machine learning techniques to train a model for classifying time-series flight data into safety events and non-safety events in the approach and landing phases. Time-series digital flight data obtained from historical commercial aviation operations is used to train a model and identify critical feature subsets and event precursors directly related to elevated levels of flight risk for commercial aircraft. Bleu-Laine et al. [
16] introduced a methodology that leverages high-dimensional aviation data to predict multiple adverse events and discover their precursors. Their methodology used a deep learning model that consists of Convolutional Neural Networks (CNN) for each sensor data type to predict adverse events and determine the precursors to the predicted adverse events. Recently, an autoencoder architecture has been used for anomaly detection with time-series flight sensor data, which utilizes the reconstruction error as the anomaly score [
17,
18]. Variational autoencoders (VAE) have also been investigated for their potential in anomaly detection [
1].
To better characterize flight behavior and emerging safety risks, in addition to time-series data from various aircraft flight sensors, there is a need to consider several other environmental and operational parameters in the time-series data model, such as runway identifier, runway status, airport traffic, weather, wind, visibility, temperature, etc. These parameters can be time-varying and can also consist of categorical data. In the case of cascading aircraft failures, such as sensor readout differences, it is difficult to detect and characterize these precursor events through threshold-exceedance monitoring alone before a catastrophic failure happens. A novel detection paradigm is therefore needed that can monitor the states of multiple aircraft variables and correlate these states with the nominal or anomalous conditions of the aircraft through mathematical models trained with multivariate and multimodal flight data.
In this paper, we introduce a forecasting-based anomaly detection approach that uses multivariate aviation time-series data with the Temporal Fusion Transformer (TFT) architecture [
19]. We show how a TFT model trained on nominal multivariate time-series data can be used for anomaly detection. For anomalies, we used flights that experienced a UA. We used Fisher’s discriminant classifier [
20] to demonstrate that the TFT model trained with nominal flight data is sensitive to UA flight data and can predict the temporal locations of UA during the approach phase of the landing. The contributions of this paper are as follows:
- (a)
Explored the feasibility of the TFT architecture with multivariate aviation time-series data for anomaly detection via nominal behavior learning.
- (b)
Demonstrated that the trained TFT forecasting models for nominal behavior are sensitive enough to detect anomalous flight time-series sequences, such as UA, and indicate the temporal locations of the anomaly.
- (c)
Showed the feasibility of training a single TFT model to forecast multiple outputs for anomaly detection.
The paper is organized as follows: In
Section 2, we describe the anomaly detection approach and summarize background information about the TFT. The multivariate time-series flight data used in this research and sourced from The MITRE Corporation’s TDP threaded track [
21] and digital flight data [
22] are also introduced in this section. In
Section 3, we present the conducted investigations and summarize the results. In
Section 4, we discuss our findings and address potential future work. Finally, in
Section 5, we state our conclusions.
3. Results
Our first two sets of investigations involved analyzing the impact of varying the input features in the TFT model training and output targets (i.e., using single output versus multi-output for forecasting) on the Root Mean Square Error (RMSE) of the prediction. The third set of investigations consisted of generating a quantitative metric to demonstrate that the trained TFT model can differentiate anomalous time sequences from nominal and can identify temporal sections for anomalous behavior. For all considered TFT models, the training split was used to train the TFT via gradient descent using the PyTorch-Forecasting Python package [
24]. The validation split was used to decide when to stop training (i.e., when the validation error establishes a local minimum). The test split was reserved to assess the performance of the TFT on data that the model hadn’t seen before. The UA test split was used to assess the feasibility of anomaly detection based on RMSE thresholding. The hyperparameters used for the TFT model training are in
Table 3. For technical information about these hyperparameters, refer to [
24].
3.1. Using Different Input Features in the TFT Model Training
A total of three different subsets of input feature combinations are considered when training TFT models, as shown in
Table 4. The first input feature combination (TFT-1) contains time (time before touchdown), a runway identifier, and eight different flight track features (latitude, longitude, altitude, speed, course, curvature, acceleration, and climb rate). The second input feature combination (TFT-2) contains what’s available within TFT-1 plus two additional wind-related features (headwind and crosswind). The third input feature combination contains what’s available within TFT-2 plus four additional weather/wind-related features (wind direction, wind speed, visibility, and wind runway difference). Speed is set as the output target. This investigation was mainly to observe which of the input feature combinations would provide lower forecasting errors and assess the impact of feature selection when training a TFT model. Intuitively, we assume that the TFT model with the lowest forecasting error on nominal test flight data would be the best candidate for detecting and discriminating flight data that contains anomalous behavior or events.
The RMSE metric is used to assess the forecasting performance of the TFT models trained with different subsets of input features. Suppose
corresponds to a forecast timestep value for nominal flight data and speed is the targeted output for forecasting. At forecast timestep
, the TFT model makes predictions for speed at the look-ahead window timesteps (
). For prediction, the TFT model uses the input data in the 64-timesteps look-back window (
). The RMSE computation at
, RMSE(
), is mathematically expressed in (1), where
), are the TFT-predicted speed values and (
) are the actual observed speed values. In (1),
corresponds to the size of the look-ahead window, which is set to 8 in this work.
The RMSE profile of nominal flight data is formed by computing the RMSE values at each forecast timestep value of nominal flight data before touchdown (a total of 169 forecast timesteps). The RMSE profiles are generated for each nominal flight data in the test split. Even though the nominal flight data varies temporally from each other, making it impractical to compare the RMSE values of two separate flights, we averaged the resulting RMSE profiles of the nominal flight data in the test split. This was to identify the TFT model that provided lower RMSE values and to get a rough idea about the temporal locations where lower forecasting errors were observed.
Figure 3 shows the averaged RMSE profiles for all the nominal flights in the test split with four TFT models trained to predict the speed of the aircraft. The time before touchdown on the x-axis is the forecast timesteps.
Three of these TFT models (TFT-1, TFT-2, TFT-3) correspond to the three different input feature combinations. For the sake of a benchmark comparison, we also included the performance of another TFT model that is trained to predict speed solely as a function of the previously observed 64 timesteps of speed. This single-variate TFT model can be thought of as an Autoregressive Integrated Moving Average (ARIMA) [
25] model. Notably, from
Figure 3, it can be observed that the ARIMA-like TFT model is the worst performer at each forecast timestep yielding higher forecast errors along the x-axis, which highlights that the TFT’s have learned patterns from the multivariate inputs. The mean value of the averaged RMSE profiles (along the whole forecast time-series) is 3.43 for the ARIMA-like TFT model, 3.13 for TFT-1, 3.06 for TFT-2, and 3.11 for TFT-3. Among the other TFT models, it can be observed that, except for the first 55 timesteps, the TFT-2 model, which is trained with a runway identifier, flight tracks, headwind, and crosswind, provided lower forecasting errors in comparison to the TFT-1 and TFT-3 models, indicating the importance of feature selection in model training.
One interesting attribute of the TFT architecture is that it can learn the global importance weights of input features due to its use of VSNs and provide feature importance rankings.
Figure 4 shows the resultant feature importance rankings with respect to the three TFT models when used with the nominal test data split.
To explore whether TFT’s feature importance ranking capability could be used for feature selection for training TFT models that yield lower forecasting errors, we used all but the final four features in the TFT-3 input feature combination (
Figure 4c) in order of importance to train a new TFT model in which the number of input features is set the same as TFT-2 (since TFT-2 provided lower forecasting errors). We label this input feature combination as “TFT-select”.
Figure 5 shows the averaged RMSE profiles obtained with this new input feature combination (TFT-select) and TFT-2. From
Figure 5, it is observed that overall, the two sets of averaged RMSE profiles are quite close to each other. The mean value of the averaged RMSE profiles (along the whole forecast time-series) is 3.02 for TFT-select and 3.06 for TFT-2 (it was 3.11 for TFT-3 from
Figure 3). Considering the time duration after the first 55 timesteps, TFT-2 provides slightly lower forecasting errors in comparison to TFT-select, and TFT-select significantly performs better within the first 55 timesteps. Overall, this result shows the feasibility of conducting feature selection using the information from TFT’s feature importance rankings instead of a manual feature selection process.
A potential automated feature selection process for the TFT model training could thus entail identifying all available features first, followed by a TFT model training using these features. This trained TFT model could then be applied to a nominal flight data set to identify the resultant feature rankings. Based on these TFT-based feature rankings, features that are not deemed as important could be excluded considering the computational constraints (for example, by dropping all features with importance values less than 5%). Finally, a new TFT model could be trained using the selected top-ranked features only, which is anticipated to decrease the TFT model training time while keeping the same forecasting power or perhaps providing even better forecasting performance due to excluding some of the redundant or less important features.
3.2. Single Output vs. Multi-Output Forecasting with the TFT
The TFT architecture allows for training a single TFT model that can forecast multiple outputs at once. This type of capability could reduce computation needs and simplify the data processing pipeline. In this part, speed and altitude were considered as the two output targets, due to their known correlation with UA events, to assess the TFT’s ability to predict multiple targets at once. Some of the trained TFT models were considered to jointly predict both speed and altitude, and other models were considered to predict only speed or altitude (but not both). We compared the forecasting error of the single-output and multi-output TFT models on the nominal flight data test split.
Prior to training the multi-output models, the speed and altitude targets are standardized via “min-max” normalization to help balance the contribution of speed and altitude prediction errors in the loss function.
Figure 6 and
Figure 7 compare the averaged RMSE profiles of the multi-output TFT models trained to jointly predict speed and altitude during the arrival phase for the nominal flight data in the test split. The multi-output TFT models are represented using one of the three subsets of input feature combinations that were introduced in
Table 4.
For the sake of comparison, the performance of TFT models trained to predict a single output is provided in each plot as well. In
Figure 6, the mean value of the averaged RMSE profiles is 3.06 for TFT-2 (single output, speed), 3.18 for TFT-1 (multi-output), 3.13 for TFT-2 (multi-output), and 3.27 for TFT-3 (multi-output). In
Figure 7, the mean value of the averaged RMSE profiles is 83.48 for TFT-2 (single output, speed), 88.72 for TFT-1 (multi-output), 96.65 for TFT-2 (multi-output), and 96.77 for TFT-3 (multi-output). For each of the two targets (speed and altitude), the single-output TFT model is generally superior across the forecast time axis. This is understandable, as the single-output models are trained exclusively to predict a single target, and so we can dedicate the TFT model’s entire forecasting capacity to this one task. Yet, the plots in
Figure 6 and
Figure 7 demonstrate the TFT’s multi-output forecasting capability. With the adoption of an enhanced architecture (adding more layers or increasing the number of network parameters in each layer) and hyperparameter finetuning, improved performance for the multi-output TFT models could be possible.
3.3. Anomaly Detection via Nominal Behavior Learning with the TFT Forecasting Model
For demonstrating anomaly detection with the TFT forecasting model, we make use of both the nominal and UA data test splits to examine their corresponding RMSE values at each timestep and to identify the temporal locations where the UA test split’s RMSE values differ from the nominal RMSE values. Even though the temporal locations of UA might differ from one UA-labeled flight data to another, we hypothesize that a significant portion of them should be taking place at a time close to landing.
Figure 8 shows the averaged RMSE profiles for both speed and altitude using the TFT-2 single-output TFT models (for the nominal and UA flight data in the test split). The differences between the averaged RMSE profiles of the nominal and UA flight data can be visually noticed in
Figure 8 and provide information about the temporal locations where UA events are possibly taking place.
We utilized Fisher’s linear discriminant [
20] to quantitatively show how separable the nominal flight data are from the UA flight data with respect to their RMSE values and to locate the temporal locations where the RMSE values of the nominal flight data differ from the UA flight data throughout the time-series. Fisher’s linear discriminant is not used as a classifier here but rather as an auxiliary analysis tool to identify a potential time point candidate along the time before touchdown axis, where the separation between the RMSE values of the nominal and UA flight data is relatively higher. The RMSE values at this identified time point for both the nominal and UA flight data samples are then used in an RMSE-threshold-based anomaly detection setting to demonstrate the feasibility of the forecasting-based anomaly detection.
Fisher’s criterion function is mathematically described in (2). In (2),
corresponds to the between-class scatter matrix,
is the within-class scatter matrix, and
is a transformation matrix that maximizes the ratio of the between-class scatter to the within-class scatter. One of the two classes corresponds to the RMSE values from the nominal flight data (speed and altitude forecasting errors) and the other class corresponds to the RMSE values for the UA flight data. With
W that maximizes the ratio of the between-class scatter to the within-class scatter, the resulting Fisher’s criterion is utilized as a metric to visualize the time instances where the separation between the two classes (nominal and UA) starts to become apparent within the time axis and to examine which of the TFT models (single-output alone, merged single-output, or multi-output) provides higher separation through time.
Due to the two-class nature of the problem, Fisher’s method is not strictly applicable in the single-output case. In the multi-output case where two RMSEs are available, Fisher’s method first computes a scalar projection of the joint speed and altitude RMSEs to simultaneously minimize within-class variance and maximize between-class variance. Both in the single- and multi-output cases, anomaly detection can then be conducted by setting a threshold on RMSE that balances true positives against false-positives on the test set.
Figure 9 plots the optimal Fisher’s discriminant score as a function of “time before touchdown”. All test samples available at each timestep were used to compute Fisher’s score and optimize the scalar projection in the multi-output cases at each time step. Larger Fisher’s scores indicate greater class separation at that timestep and, presumably, greater classification potential when combined with an appropriate RMSE threshold.
Based on
Figure 9, all four TFT models have a maximum Fisher’s score around the 26 timesteps before touchdown. Fisher’s score is highest when the RMSEs for the single-output speed and altitude TFT models are merged, followed by the multi-output TFT model at that timestep.
Figure 9 shows the impact of considering multiple outputs in comparison to a single output for anomaly detection. With speed output alone, the temporal range to differentiate UA from nominal flights is found to be narrow but has a higher separation potential, whereas, with altitude output alone, there is a wider temporal range for differentiation but with a lower separation potential. Merging both outputs yields the best of both worlds, achieving high-value separation and a wide temporal range. The RMSE values of the test split (nominal and UA) from the single output TFT models for altitude and speed prediction (using TFT-2 input feature combination) at 26 timesteps before touchdown can be seen in
Figure 10. Higher RMSE values for both speed and altitude predictions with wide scattering can be observed for the UA flight data in contrast to a more clustered set of RMSE values with smaller magnitudes for the nominal flight data at this timestep.
After identifying the time point that yielded a good separation between the RMSE values of the nominal and UA flight data, which corresponded to the 26 timesteps before touchdown in the time-axis, we set RMSE thresholds for both speed and altitude output at this time point to conduct a threshold-based anomaly detection. In this threshold-based anomaly detection, to determine whether test flight data is nominal or UA, the resultant RMSE values of the test flight data for the two target outputs are compared with the set RMSE thresholds. If the test flight data’s RMSE value is below the RMSE threshold for the two target outputs, the test flight data is considered as nominal. However, if the test flight data’s RMSE value is above the set RMSE threshold for either of the two target outputs, the test flight data are considered as anomalous.
Regarding the threshold settings, suppose an RMSE threshold is set to 5.15 for the speed output and an RMSE threshold is set to 64 for the altitude output at the 26 timesteps before touchdown after examining the RMSE scatter plot in
Figure 10 for the two outputs. In doing this, our goal is to get a sense of how the anomaly detection results would look when the RMSE values from the two outputs are used alone and when they are used together. With these thresholds, the resultant confusion matrix (normalized) for speed and altitude alone is shown in
Table 5a,b. It can be noticed from the two confusion matrices that with using speed output alone and with using altitude output alone, the false positive (FP) rates are the same at 6.99% with these two thresholds, whereas the True Positive (TP) rate using speed is 38.63% and the TP rate using altitude is 33.33%, indicating that speed is a better output for separating UA from nominal flight data, as was also observed from the Fisher discrimination scores in
Figure 9.
When the two outputs are used together for classification such that the class label is assigned to nominal only when the RMSE values for the two outputs are both lower than their assigned thresholds and are assigned to UA for all other cases, the resultant confusion matrix can be seen in
Table 5c. It is observed that the TP rate jumps significantly to 56.96% and the FP rate increases to 12.52%. By relaxing the two thresholds slightly (speed threshold increased to 5.85 and altitude threshold increased to 85.0) to get the same FP rate of 6.99% (to have a fair TP rate comparison), we get the confusion matrix shown in
Table 5d. From
Table 5d, it is seen that the TP rate becomes 45.46% (while FP is 6.99% like the single output FP values), which is significantly higher than the TP rate of using speed, which was 38.63%.
We used the Receiver Operating Characteristics (ROC) curve to visualize the detection performance when the set RMSE thresholds for both speed and altitude are changed incrementally. The speed RMSE threshold range is set between 0.25 and 6.55, and the altitude RMSE threshold range is set between 1 and 106, with 37 threshold points in each range with equal intervals. It is assumed that the two identified threshold pairs (5.15, 64) and (5.85, 85) for speed and altitude, which are used in the confusion matrices above, are included as two sets of threshold pairs in the two identified ranges.
Figure 11 shows the resultant ROC curve, which indicates that with the use of speed and altitude outputs together, higher detection performance can be achieved in comparison to using a single output, which supports the findings of the confusion matrices.
This result demonstrates the impact of examining multi-output flight parameters for anomaly detection with the TFT nominal behavior forecasting models. It is worth mentioning that while we used a simple RMSE threshold-based classifier for anomaly detection in the proof-of-concept demonstrations, other ML algorithms (e.g., regression trees, support vector machines, neural nets, etc.) could be utilized for higher accuracy.
4. Discussion
The multivariate time-series dataset used in this work contains a small subset of digital flight data features and lacks on-board sensor features that would otherwise enable the model to capture more nuanced relationships between inputs and outputs and provide greater insights into safety event precursors. Despite these caveats, our preliminary results suggest that the TFT is an effective way to summarize multivariate time-series aviation data.
In this initial proof of concept effort, we did not perform a hyperparameter optimization for the TFT model training, and we didn’t conduct performance benchmarking. In a follow-up future work, we are considering these as promising future directions for improving the forecasting accuracy of the TFT models and examining how the TFT models compare with respect to other forecasting techniques in the literature when used with aviation data for anomaly detection. Augmenting the existing dataset with a more complete set of digital flight data features and incorporating additional modes of data is another promising future direction, such as gridded weather (e.g., convective weather, wind fields), voice, and textual information. These modes need not be raw inputs to the TFT but can be vectorized data representations derived from other upstream capabilities, such as large language models and speech recognition software.
One more promising future direction would be to add additional layers of analysis to the anomaly detection framework. As it stands, our approach flags anomalies based on the magnitude of error between the nominal TFT’s prediction in a future forecast window and observed behavior at that time window. When a large prediction error results, we assume that the error is due to the presence of variable settings in the observed inputs that the model had limited exposure to during training, i.e., off-nominal precursors. The automated identification of such precursors and the anomaly type is a natural next step, which we anticipate could be explored via analysis of the TFT’s latent representation of model inputs and temporal attention weights and the RMSE profiles from the multi-output predictions.