1. Introduction
In real world applications, obtaining high quality labels for a dataset is a challenge that requires significant effort from subject matter experts (SMEs). On the other hand, supervised learning methods (the most commonly used method in the data science and machine learning domains) require fully labeled datasets to train a model that generalizes well on unseen data. Depending on the application and the complex nature of the data, a subset of data can be reviewed and labeled by the SMEs. However, if the data is exceedingly complex, it takes significant amount of time to label them, and as a result, a high number of labeled examples can be unrealistic to obtain. Furthermore, advanced machine learning models (e.g., deep learning) can contain an extremely large number of trainable parameters. This requires a significant amount of labeled data to reach optimum performance and prevent the model from overfitting. Techniques such as crowd sourcing, a term coined in 2005 [
1], can help alleviate this problem, constrained by the “crowd’s” skill level. If the objective is to label images of dogs and cats, the available pool of labeling workforce can be broad since the majority of the population can easily identify these animals in a variety of photographic conditions. Other applications may require some degree of training, and this can be incorporated into the labeling process. In such setting, a level of trust or confidence based on monitoring a user’s known experience can be assigned for each label. This helps track the quality of the labels such as in the Nemo-Net [
2] and Galaxy Zoo [
3] citizen science projects. While this works well for intermediate levels of expertise, in domains requiring more specific and technical knowledge, the pool of SMEs with relevant background and experience to validate each data instance may be drastically decreased. Furthermore, the data might be considered proprietary and sensitive (as is the case in the aviation safety domain) and hence cannot be shared with citizen scientists.
The lack of available SMEs is particularly compounded in aviation safety by the fact that not all events that occur are purely threshold crossings. They may involve reviewing a variety of actions, conditions, and sequences of events corroborated across various data sources before it can qualify as an operationally significant event. As a result, to properly assign the data labels, a forensic analysis for an hour of review may only yield a dozen or so labels [
4]. Moreover, an increasingly important area of focus in this domain is vulnerability discovery or anomaly detection—especially detection of unknown anomalies. In this case, there is no pre-defined notion of the anomalies that the algorithm has learned to detect. These types of anomalies also add time to the review process because there may not be precedent to categorize these anomalies. In addition to this, it may take more time to fully consider the scenario’s significance, its proximity to the safety margins, and the impact it may have had on the operations.
In the realm of machine learning algorithms that can be used to address this problem, there exist three main approaches: unsupervised, supervised, and semi-supervised learning. Purely unsupervised approaches can be prone to high false positive rates and do not perform as well as supervised approaches when labels are present. On the other hand, as discussed above, a completely supervised approach might not reach optimal performance when the size of labeled data is small. Furthermore, any supervised model can only detect known (labeled) anomalies and hence suffer from the inability to discover unknown vulnerabilities. Semi-supervised learning combines the two approaches and has the ability to leverage the known labeled data as well as the patterns found in the vast pool of available unlabeled data [
5]. This gives the semi-supervised approaches an advantage that can potentially address drawbacks of both unsupervised and supervised models, especially those associated with the lack of labeled data (when only a small set of labels can be acquired).
In addition to the performance improvements from semi-supervised learning where both the labeled and unlabeled data are leveraged, the algorithm yields useful structuring of the data in the learned feature space (i.e., latent space). By examining labeled examples that are collocated within this arrangement of data, it can be inferred that the unlabeled data instances have similar characteristics. This provides some level of model explainability in how the data is organizing in this space and can be used to find other similar events within the unlabeled data.
In the aviation safety domain, cases such as unstable approaches [
6] are well defined and use threshold crossings to detect the occurrence and level of severity for known events. In this paper, we leverage these known anomalies from commercial aircrafts’ approach to landing data to help bootstrap and guide a novel semi-supervised learning approach. We benchmark performance of the model based on three synergistic criteria: (1) classification accuracy, (2) interpretability of the extracted/learned features, and (3) robustness to adversarial perturbations. These criteria are compared against several baseline methods in aviation and machine learning literature. Furthermore, as we will discuss later in the results section, the similarities among the extracted features from the data can be exploited to help identify samples that may contain previously unknown anomalies by having the SMEs in the loop. This sets out the next step to design an active learning mechanism, where these newly discovered anomaly categories can be refactored into the training set. This allows us to begin to build previously unknown classes in the data, and further guide the algorithm’s ability to classify the categories and discover new anomalies over subsequent iterations.
In a deployed operational environment, imagine a situation where borderline events that may not meet the strict criteria currently being used to define an event, are labeled nominal and are therefore unmonitored. This approach can assist in identifying these borderline events since they share similar patterns with the labeled events. Uncovering and understanding an expanded events category can provide operators a clearer assessment to their risk exposure. This can also offer a means for crafting more accurate event definition logic that encompasses the more comprehensive event category. In other words, this technique can add new insights into the modes of operation that previously had been limited in scope and help improve the overall safety of the operations.
2. Related Work
Machine learning has been widely applied in the aviation domain for anomaly detection and safety improvement applications [
7,
8,
9,
10]. Due to the lack of properly labeled aviation data, the majority of the literature on aviation safety and anomaly detection has focused on unsupervised learning [
11]. Unsupervised approaches can roughly be categorized into: (1) distance-based [
12,
13,
14,
15], (2) kernel-based [
9], and (3) deep learning–based methods [
11,
16]. Distance-based models (nearest neighborhood and clustering approaches) are proven effective by using a distance metric to identify anomalous events. This category of models, however, is less popular due to their quadratic computation complexity [
11]. Bay and Schwabacher [
12] made one of the early attempts to reduce the computational complexity of distance-based anomaly detection, defining anomalies as points with far-off nearest neighbors. Kernel-based approaches such as One-Class Support Vector Machine (OC-SVM) are frequently used for unsupervised anomaly detection applied to the aviation domain. NASA’s Multiple Kernel Anomaly Detection (MKAD) model [
9] uses OC-SVM as a part of the overall framework and demonstrates significant proficiency in finding operationally significant anomalies in heterogeneous time-series of commercial flights.
More recently, deep learning has become popular in anomaly detection literature. Specifically, methods based on deep generative models such as Auto-Encoders (AEs) [
17], Variational Auto-Encoders (VAEs) [
18], and Generative Adversarial Networks (GANs) [
19] have been widely adopted for anomaly detection purposes in many science and engineering domains. A popular subset of these approaches is reconstruction-based anomaly detection. In this approach, a deep generative model is used to reconstruct/generate the input data by sampling from a lower-dimensional latent feature space. The main intuition is that since the majority of the training data are nominal, the reconstruction error for those data would be lower compared to the minority anomalous data present in the training. In other words, reconstruction-based models bet on the noticeable inconsistency of anomalies in subspace representation resulting in high reconstruction errors. This approach has been widely used for identifying anomalies in time-series data [
20,
21,
22,
23,
24,
25] as well as aviation data [
7,
16,
26,
27]. Janakiraman and Nielsen [
7] have implemented an extreme learning models AE to learn the nominal distribution. The anomalies are predicted based on surpassing the reconstruction error for a nominal boundary. Wang et al. [
26] developed a transfer learning-based AE that forces the latent space to learn useful data aspects. The authors applied this model to flight track anomaly detection problems on data from multiple airports and reported high performance and high capability for the model to reduce data processing requirements. Memarzadeh et al. [
16] have developed a Convolutional Variational Auto-Encoder (CVAE) to detect flight track anomalies. The model used an
distance reconstruction error as a metric to identify anomalies.
Despite the compelling case of no labeling requirement for unsupervised approaches, their performance is not competitive compared to supervised models. Lee et al. [
28] introduced a framework called Safety Analysis of Flight Events using classic supervised machine learning models. They showcased the versatility of the framework using Flight Operational Quality Assurance (FOQA) data in identifying multiple anomalies in the approach to landing of commercial aircraft. In another study, Janakiraman [
29] developed a supervised precursor mining algorithm: Deep Temporal Multiple Instance Learning (DT-MIL). This algorithm finds anomalies by correlating incoming events to anomalous multi-dimensional time-series. The author used deep recurrent neural networks in a multi-instance learning structure to efficiently track temporal behavior. Despite the significant predictive capabilities of supervised methods, developing such an approach can be expensive and infeasible at times. This is especially true in aviation safety datasets, since acquiring reliable and accurate labels for data requires significant time and effort from SMEs and is largely impractical. On the other hand, unsupervised methods are cheaply available, so long as you assume an operationally significant aviation anomaly is equivalent to a statistical one. This assumption is not consistently correct and results in poor performance of unsupervised approaches compared to supervised methods in application to the aviation domain. This leads to a significant number of false positives (false alarms) that bring into question the reliability and applicability of unsupervised methods.
There are limited studies in aviation safety literature to fill the gap between unsupervised and supervised methods. Active learning [
30,
31,
32,
33] has been developed to tackle this problem, where different information-theoretic or uncertainty-based methods are used to identify the most informative data (among the vast pool of unlabeled data) to be reviewed and labeled by SMEs. Although this approach improves the performance and efficiency of the supervised methods (by incorporating the smart labeling strategies into account), it does not tackle the shortcomings of the supervised learning approaches completely and still requires SMEs in the loop.
Semi-supervised methods are potential approaches to fill the gap where the labeled data exist but are not sufficient for fully supervised modeling. These approaches have been applied in the anomaly detection of time-series [
34,
35]; however, they have not yet been truly explored in the aviation safety domain. To our knowledge, the only existing semi-supervised aviation anomaly detection is a study done by the authors of this paper [
36] where two recent semi-supervised approaches have been used for detection of aviation anomalies.
3. Method
In this paper, we develop RESAD, a Robust and Explainable Semi-supervised deep learning model for Anomaly Detection in aviation data that addresses the shortcomings of both supervised and unsupervised learning. The semi-supervised mechanism allows the decision makers to make inference based on minimally available (but extremely valuable) labeled data as well as the vast amount of unlabeled data. As a result, it overcomes the main disadvantages of these two families of methods: (1) supervised learning not performing optimal due to scarcity of labeled data, and (2) unsupervised learning not showcasing great accuracy and reliability by not leveraging operational domain knowledge from SMEs. The proposed semi-supervised model is also superior to active learning as it does not rely on the availability of SMEs for data labeling; however, it can be easily fit within an active learning framework.
We build the model upon two existing methods in machine learning literature [
37,
38], and show that it is superior to multiple baseline methods from literature in flight multivariate time-series anomaly detection. Specifically, we train RESAD based on a loss function that: (1) takes advantage of both labeled and unlabeled sets of data to extract informative features for accurate classification of multi-classes of anomalies; (2) uses graph theory-based label propagation and enforces a compact clustering of data belonging to each class in the latent feature space, which improves the interpretability of extracted features and its application for down-stream tasks; and (3) uses the reconstruction fidelity of the input data based on its generative capability to improve the robustness of the learned latent features to adversarial perturbations.
Let us imagine that the available data is grouped into two sets: the minority labeled set,
, and the majority unlabeled set,
, where the size of the unlabeled set is significantly larger, i.e.,
. It should be noted that any supervised learning technique would ignore
, while any unsupervised learning method would ignore
. As depicted in
Figure 1, RESAD consists of three components: (1) an encoder, (2) a decoder, and (3) a classifier. The encoder,
, is a deep convolutional neural network (exact architectures are reported in
Appendix A) that maps the input data
X to a latent feature space
Z. The decoder,
, is also a deep convolutional neural network that reconstructs the data
from the latent features
Z. The classifier,
, is a fully connected neural network with dropout regularization that classifies the data in the latent feature space. Parameters
, and
represent the weights of the neural network for the encoder, the decoder, and the classifier, respectively.
We train the entire network end to end and use all available data (labeled and unlabeled). The overall objective of the optimization is to find a set of weights (i.e.,
) that minimizes the following loss function:
The first term is the classification loss and is defined as a cross entropy (
) between the prediction of classifier and the true labels on the labeled set:
The second term in Equation (
1) corresponds to the compact clustering via label propagation (CCLP) loss. We have adopted this loss term from [
38] and it is defined as follows:
where
is the total number of training data,
T is the optimal transition matrix between data instances in the latent feature space,
H is the actual transition matrix estimated via dynamic graph construction and label propagation, and
S is the step of the Markov chain on the graph. Equation (
3) is the cross-entropy between the desired optimal transition function
T and the estimated one
H.
To estimate
H, we first calculate the adjacency matrix,
A, which is estimated based on the similarity of the data instances in the latent space. We define the adjacency matrix as follows using Cosine similarity,
where T is transpose operation. It should be noted that the results are not affected by the choice of similarity measure, and any other metric (such as negative Euclidean distance) can also be used as a similarity metric. The Markovian random walk along the nodes of this graph is defined by the transition matrix
H, which is obtained by row-normalizing the adjacency matrix
A,
Once the graph is constructed according to the transition matrix
H, label propagation uses
H to propagate the class confidence from the labeled to unlabeled samples and estimate the optimal transition function
T. This is an iterative process until the process converges at an equilibrium. The class posteriors for the unlabeled data,
at this equilibrium can be computed in closed form [
38] as follows,
where
H is re-arranged to its labeled and unlabeled elements as follows,
As a result, is the class posterior estimated by the label propagation at convergence, and is the number of known classes.
Finally, the optimal transition function between data instances,
T, is calculated based on the class posterior, i.e.,
. The equilibrium denotes an optimal state in which transition probability between any two data instances of the same class is the same, and it is zero for inter-class transitions. Kamnitas et al. [
38] provides the following formula for calculating this optimal transition function,
where
is the posterior for node
i to belong to class
c, and
is the expected mass assigned to class
c.
Finally, the third term in Equation (
1) is the reconstruction loss, which ensures that the latent feature space is informative and robust enough that we can reconstruct the input data accurately from it. We define the reconstruction loss as the binary cross entropy (BCE) between the input data and the reconstruction; however, any other metric such as mean squared error (MSE) can be used as well. Our experiments have shown that when the input data is normalized using MinMax scaling, meaning that all the features take ranges between 0 and 1, the time-step level (or pixel-level in case of imagery data) BCE loss captures the variability in the reconstructed data compared to the input data much better than the MSE loss. However, it should be noted that if the data is scaled using standard scaling, and, as a result, features are not necessarily bounded between 0 and 1, only MSE loss should be used. The reconstruction loss is formalized as follows:
,
, and
in Equation (
1) are the hyper-parameters that tune the importance of each loss term in the overall loss function.
4. Results and Discussion
We compare the performance of RESAD with two baseline semi-supervised models: (1) Compact Clustering via Label Propagation (CCLP) [
38] and (2) Auto-Encoder + Classifier (AE+C) [
37]. Kingma et al. [
37] proposed a generalization of deep generative models such as VAEs (which is a widely used deep learning method for representation learning, nonlinear dimensionality reduction, and anomaly detection in many domains and in our previous work [
16]) to a semi-supervised version by adding a classifier to the VAE structure that is mainly trained based on the minimally available labeled data. They showed that such a semi-supervised deep generative model is superior to classic semi-supervised methods such as transductive support vector machines [
39], especially when the size of unlabeled data is huge. Later, Kamnitsas et al. [
38] developed the CCLP formulation, a discriminative model with a novel cost function for semi-supervised learning based on deep learning and graph theory, and showed that it is superior to developed architectures based on deep generative models such as [
37]. These two models can be seen as simpler models compared to ours. CCLP only contains an encoder and classifier and is trained only based on the first two terms in Equation (
1), while AE+C has an encoder, decoder, and a classifier, but does not enforce compact clustering, and is only trained based on the first and the third terms in Equation (
1).
We quantify the comparison according to three metrics: (1) classification performance, (2) latent space configuration, and (3) robustness to the adversarial perturbations. For classification performance, we also include a comparison with DT-MIL [
29] to show the superiority of semi-supervised learning over supervised methods when the labeled data is scarce. DT-MIL is chosen as the most recent supervised anomaly detection model based on deep learning architecture that has been validated on the FOQA data (similar data that has been used in this study). For the second metric, we qualitatively and quantitatively show how interpretable and useful the learned features in the latent space are for the downstream tasks (e.g., active learning, clustering). Lastly, the third metric evaluates the robustness of the inference made by the models to noise and adversarial perturbations in the input data.
The next subsection describes a real-world multi-class anomaly detection dataset during approach to landing of commercial aircraft. We have developed this dataset to benchmark performance of our proposed method, i.e., RESAD, against the baseline methods mentioned above.
4.1. Multi-Class Anomaly Detection during Approach to Landing of Commercial Aircraft
In this section, we introduce a multi-class anomaly-detection dataset based on FOQA data from a commercial airline
https://c3.nasa.gov/dashlink/projects/85/ (accessed on 1 March 2021). This data is primarily comprised of 1-Hz recordings for each flight and covers a variety of systems. These include the state and orientation of the aircraft, positions and inputs of the control surfaces, engine parameters, autopilot modes, and corresponding states. The data is acquired in real time on-board the aircraft and downloaded by the airline once the aircraft has reached the destination gate. These time series are analyzed by SMEs to flag known events and create labels. Each data instance is a 160-second-long recording of 20 variables during the approach of the aircraft to landing—from a few seconds before an altitude of 1000 ft, to a few seconds after an altitude of 500 ft. It should be noted that, for many flights, depending on the landing runway and airport geometries, the duration from 1000 to 500 ft altitude is less than 160 s. In this case, we expand the data window to include an additional period directly before reaching 1000 ft altitude.
We processed and labeled 30,522 overall data instances, which is comprised of four classes: (1) nominal, where no anomaly of the other three classes is known to be present (~66.7% of the total data); (2) speed anomaly, where the anomaly is identified based on a deviation from the target landing airspeed during approach (~22.9% of the total data); (3) path anomaly, where the path of descent for landing are flagged as being anomalous and deviated significantly from the glide slope (~7.2% of the total data); and (4) control anomaly, where the flaps (specific control surface on the wings of the aircraft) are flagged anomalous if there is a delay in extension as compared to the expected nominal deployment during approach to landing (~3.2% of the total data). These events were chosen because they are all relevant metrics used to measure unstabilized approaches.
Figure A4 in
Appendix C visualizes the flight time-series in the training set in 2D using t-Stochastic Neighbor Embedding (t-SNE) [
40] color-coded based by their true class. It appears that there are some distinct modes/clusters in the input space, but none are corresponding to the known classes of anomaly present in the data. This makes the task of anomaly detection in the input space difficult, since the data is not easily separable and organized. We aim to utilize our proposed semi-supervised method, RESAD, to efficiently generate latent representations with easily separable boundaries and with an interpretable configuration that down-stream tasks can leverage.
Each data instance is either nominal or contains only one type of anomaly: a restriction that simplifies the validation process. Testing on data that contains multiple types of anomalies per instance will be part of our future work.
Figure 2 shows the distribution of the data based on the landing airport. As it can be seen, majority of the data is for landing in Minneapolis–Saint Paul International Airport, Detroit Metropolitan Wayne County Airport, and Memphis International Airport.
We divide the data into three sets of training (60%), validation (20%), and testing (20%). The training set is used for training the models. The validation set is used to select an optimal choice of hyper-parameters (discussed in the next section). The testing set is used to report an unbiased estimate of the models’ performances. All the figures presented throughout the paper are based on the results obtained by applying the models to the testing set (an unseen set during training, validation, and hyper-parameter tuning).
4.2. Implementation Details
All three semi-supervised models are implemented in Python using the PyTorch library. The architecture of the encoder, the decoder, and the classifier in all of them are identical and are reported in
Appendix A. The DT-MIL model was used the way it was originally implemented, using the Keras library (for details refer to [
29]). DT-MIL is a binary classification model. Since we intended not to alter the original model, we implement DT-MIL as a one-versus-all scheme for the case of multi-class classification.
All models were trained for 200 epochs; the Adam optimizer [
41] was used for all models with a learning rate equal to
and default momentum parameters. We performed hyper-parameter tuning based on the validation set to identify reasonable choices for the hyper-parameters. Based on our comprehensive experimentation, the latent space dimension of all three semi-supervised models is fixed to 256 dimensions (
), the number of steps in the Markov chain in Equation (
3),
S is fixed to 3, and the weights of different loss terms in Equation (
1), i.e.,
are fixed based on the size of the labeled set, i.e.,
. We report these values in the next section, where we discuss the findings.
4.3. Classification Performance
Figure 3 compares performance of the classification in terms of average accuracy (mean
standard deviation) of classification among our proposed model, RESAD, (green) with the baseline models. This is based on 20 independent trials of training, where the labeled set is sampled uniformly across classes and randomly within each class. For example, in the case of a 100-sample labeled set, 25 samples are randomly selected from each of the four classes. The
x-axis shows the number (and percentage) of the labeled set in the training. It should be noted that the results presented in the figure are based on the performance on the testing set that was not seen by the algorithms in either training or validation phases (hyper-parameter tuning). As mentioned before, we set aside a validation set to perform hyper-parameter tuning and find the right combination of weights in the loss function in Equation (
1) for our approach. Based on our comprehensive experimentation, the following general rules emerged: for the classification loss (the first term), we found that having a higher weight for the case of small labeled set (e.g., 100 and 200 samples) improves the performance, while a small weight might be sufficient when the size of labeled set grows (e.g., 500 and 1000 samples). A large weight for CCLP loss (the second term) was found to improve the performance. The weight of the reconstruction loss (the third term) did not play a major role in the classification task, but was very crucial in the robustness to the adversarial perturbation (we will discuss this later in this section). Based on the experimentation, we fixed the values of the weights to
for
and
for
,
, and
. It should be noted that we performed hyper-parameter tuning only for our proposed model. For the baseline methods, we kept the loss function of training and corresponding weights of the terms identical to the ones obtained by the original authors.
A major finding in
Figure 3 is that all the semi-supervised models significantly outperform the supervised DT-MIL method. This emphasizes the superiority of the semi-supervised learning when the size of the labeled set is small. Among the semi-supervised models, RESAD performs slightly better than the baseline models.
Figure A5 and
Figure A6 in
Appendix C show the precision and recall values per class for each method. Semi-supervised methods perform significantly close in recall of identifying anomalous classes. RESAD has a higher precision in the minority anomaly classes (path and control in
Figure A5) and a higher recall for the majority nominal class (
Figure A6).
Furthermore, we calculate precision, recall, F1-score, and AUROC (Area Under ROC curve) for the binary anomaly detection problem, where we evaluate how accurately the model can distinguish between nominal versus anomalous classes. These performance metrics are reported in
Table A1 in
Appendix C. Although the difference between the semi-supervised models might not seem significant in the classification performance, we shall see later that the differences are significant with respect to other metrics.
4.4. Latent Space Configuration
Figure 4 visually compares the configuration of the latent feature space for the three semi-supervised models, i.e., AE+C, CCLP, and RESAD. All of the figures are showing the best example out of the 20 independent trials of training for each model, based on a 1000-sample labeled set. All figures are visualizing the 256-dimensional latent feature space of each method in 2D using t-SNE, initializing it with Principal Component Analysis (PCA), and setting the perplexity parameter to 50. The left column color-codes the data instances based on the true class that they belong to (blue: nominal, orange: speed anomaly, green: path anomaly, and red: control anomaly). As it can be seen, RESAD and CCLP show significant improvement over the AE+C in compactly clustering the data of each class together and far away from other classes. This distinction is less realized in the AE+C approach. This is an intuitively justified result, since both CCLP and RESAD use graph theory to enforce such compact clustering in the latent feature space, while AE+C does not enforce that.
To both make sure that this compact clustering is not an artifact of t-SNE’s nonlinear embedding from 256D to 2D, as well as to understand the structure and configuration of the latent feature space better, we perform unsupervised clustering in the 256D latent space of these models. The middle column in
Figure 4 shows the result of applying KMeans clustering with
(
being the number of classes in the training data) and Euclidean distance as a distance metric in the 256D latent space, visualized in 2D using t-SNE. In order to associate each cluster with true classes, we use the classifier’s prediction for the data in the cluster. For example, if the majority of the data in the cluster are classified as nominal, we associate that cluster with the nominal class. We use the color purple to show the
-th cluster (fifth cluster in here), which we call the uncertain cluster. As illustrated, both CCLP and RESAD confirm that data of each class are compactly clustered together, while they are far away from data of other classes, with a central cluster (the purple cluster) merging them together. However, we can see that in the case of AE+C approach, the clusters that are found using KMeans do not necessarily correspond to the actual classes of the data. For example, we can see that the data of both path and control anomalies are clustered together (green cluster in the middle column, top panel), which is not a desired structure of the latent feature space. Moreover, the data of the nominal class is clustered into multiple smaller ones, and the purple cluster does not play a role of central merging cluster between different classes of the data.
Right column of
Figure 4 visualizes data instances, where the classifier has the highest amount of uncertainty in classifying them. In order to quantify the uncertainty of the classifier’s prediction, we use the entropy of the output of the Softmax layer of the classifier. The output of this layer is a
-dimensional vector, each component
i of which denotes the probability that the data belongs to class
i, for
(please note that
in this example). The entropy is then given as
where
is the classifier’s prediction for input
X. In this figure, the points that are higher in color intensity are associated with a higher prediction uncertainty (i.e., entropy). As it is evident in the figure, the central purple cluster for CCLP and RESAD consists of the most uncertain data instances in the testing set. This is a significant finding and benefit of enforcing the CCLP loss: the method automatically forms a cluster, where the data instances that are hard to classify will be compactly clustered together. On the other hand, data instances that are easier to classify are compactly clustered away from this central cluster and into their own class-specific cluster. This important formation of the uncertain cluster is completely lost in the AE+C approach (top panel, right column), and we can see that hard-to-classify data instances are spread throughout the entire latent feature space.
One major benefit of the formation of such uncertain cluster is that we can design an active learning strategy to automatically identify the most informative subset of the unlabeled set. This set can be further reviewed and properly labeled by the SMEs and is part of our future work. This aspect emphasizes an important superiority of RESAD compared to CCLP, which is the size of the central purple cluster. This represents the number of data instances that the classifier has a high uncertainty about. In the case of CCLP approach,
of the data in the testing set are mapped to the central purple cluster, while the number for RESAD is
. This means that not only does our approach force the classifier to make more confident predictions on the unlabeled set (and a more accurate prediction according to
Figure 3), but also reduces the size of the uncertain cluster significantly (1350 versus 570 data based on the size of the testing set). This means that SMEs have fewer data to review and label. Given how expensive and time-consuming the review process of each data is, this will result in significant savings in SMEs time and the data labeling cost.
In
Figure 5, we further quantify the purity of the class-specific clusters by calculating the entropy of the class-distribution of the data instances that are mapped to each one of the class-specific clusters. Lower values mean that the class-specific clusters are purer and contain fewer data from other classes in them; we visualize the average value across the
class-specific clusters in the figure. Both RESAD and CCLP improve the purity of the clusters as more labeled data is provided, which is an intuitive result. AE+C, however, does not improve the purity of the clusters at all. This drawback of the AE+C approach is also evident in
Figure 4, where the class-specific clusters are not compact and distant from one another.
4.5. Robustness to the Adversarial Perturbation
In this section, we investigate the robustness of our proposed approach, as well as the baseline methods against small but misguiding noise (i.e., adversarial perturbations). This is an important experimentation that shows the reliability of model predictions in the presence of unwanted noise and is a crucial factor for operational models. We do this by implementing a perturbation scheme called the fast gradient sign method (FGSM) [
42]. FGSM is a white-box adversarial perturbation method and generates adversarial examples in the presence of model parameters [
43]. This perturbation scheme hypothesizes that neural networks are designed in a linear fashion (i.e., the components of neural networks, such as dot product, convolution, etc. are linear), and are vulnerable to linear adversarial noise. Such linear perturbation can be derived from
where
is the magnitude of error,
(.) is the sign function,
is the set of model parameters that affect the classification of
X, i.e.,
(since input data
X is first mapped to the latent feature space with the encoder,
, and then is classified by the classifier,
), and
is the loss function used for model training, which is depicted in Equation (
1). Based on Equation (
11), the adversarial noise is obtained by applying the sign function to the gradient of the loss function with respect to the input data. Based on our robustness evaluations of adversarial examples using the FGSM,
Figure 6 shows that RESAD consistently and significantly outperforms the baseline CCLP and AE+C models for different percentages of perturbation. CCLP takes the second rank in average classification accuracy, and AE+C is the worst performing model. This figure shows the results for the case of a 1000-sample labeled set. However, the superiority of our approach holds over baselines with smaller labeled sets as well (
Figure A3 in
Appendix C).
We also report the effect of adversarial perturbation on the per-class F1-score of classification in
Figure A7 in
Appendix C. Please note that F1-score is the harmonic mean of precision and recall and is defined as follows,
As it is evident, RESAD’s superiority over the baseline methods is consistent across class-specific performance metrics (i.e., F1-score). CCLP, on the other hand, performs better than AE+C for majority classes (nominal and speed anomaly) and worse for minority classes (path anomaly and control anomaly).
In order to further improve the robustness of RESAD, we augmented the training optimization objective in Equation (
1) based on two innovative ideas posed in the machine learning community recently: autoencoding variational autoencoder (AVAE) [
44] and interpolation consistency training (ICT) [
45]. Both of these approaches have been developed to improve the consistency and robustness of the mapping from the input data to the latent feature space for down-stream tasks such as the one here (i.e., multi-class classification). Details of these approaches are depicted in
Appendix B. However, we observed in
Figure A3 (in
Appendix B) that none of these augmentations result in any improvement in the robustness of RESAD to adversarial perturbation. We actually show that, in the case of a 1000-sample labeled set, AVAE and RESAD perform significantly close to each other and outperform ICT, while, in the case of a 200-sample labeled set, RESAD dominates both AVAE and ICT in the average accuracy of classification with different percentages of perturbation.
5. Conclusions
We proposed RESAD, a Robust and Explainable Semi-supervised learning model for Anomaly Detection in aviation data. Our proposed model is novel in several aspects, as follows: (1) It is semi-supervised: it addresses the shortcomings of supervised and unsupervised models in the aviation literature by taking into account both the majority unlabeled data and the minority labeled data sets. (2) It is explainable: the model incorporates graph-theoretic methods to propagate labels from the labeled set to the unlabeled set and form a compact structured feature space. This improves the interpretability of the learned latent feature space, where more information can be extracted for down-stream tasks such as active learning. (3) Lastly, it is robust to adversarial perturbations that significantly improves its reliability and applicability in the domain.
We evaluated the classification performance of RESAD against three existing methods in the literature. For this purpose, we developed a real-world case study of multi-class anomaly detection using commercial aircraft flight data during approach to landings. First, we illustrated the superiority of the semi-supervised learning over a supervised method in the aviation literature (
Figure 3) when the size of labeled data is small. We specifically showed that, with
of training data labeled, the supervised model (DT-MIL) finds anomalies with
accuracy, while the semi-supervised models are significantly more accurate with
(AE+C),
(CCLP), and
(RESAD) average accuracy.
We further quantified the interpretability of the learned latent feature space by the three semi-supervised models. We show qualitatively (
Figure 4) and quantitatively (
Figure 5) that methods which induce supervision into their feature learning and encoding (CCLP and RESAD) build an interpretable latent feature space. This well-structured latent space is advantageous because it explains which regions in this space are compactly populated by each of the labeled anomaly classes. This is observed when clusters with high class purity were formed using unsupervised clustering and corroborated with the t-SNE visualization. This important trait allows for intelligent sampling from each region to select the most informative data for future labeling efforts (i.e., active learning). On the other hand, the AE+C approach learns a latent feature space that is not compact and representative for more advanced down-stream tasks. This is due to the fact that AE+C does not induce any supervision in the feature learning. Moreover, the purity of the class-specific clusters shaped in the latent feature space does not improve in the AE+C method with an increase in the size of the labeled set (
Figure 5).
Lastly, we quantified the robustness of the three semi-supervised methods against adversarial perturbations induced in the input data space and show that RESAD significantly outperforms CCLP and AE+C (
Figure 6). We specifically show that with a relatively high level of adversarial perturbation at
(according to Equation (
7)), RESAD’s performance is at
average accuracy (a drop of
percentage point (pp) compared to no perturbation), while CCLP and AE+C performances are at
and
average accuracy, respectively (drop of
pp for CCLP and
pp for AE+C). We further compared the robustness of RESAD against augmentations based on two recent studies in machine learning literature (
Figure A3), and showed that none of those augmentations result in an improvement in the robustness of our model and it results in a loss of performance (ICT for both smaller and larger labeled sets and AVAE for smaller labeled set).
Potential future directions: One potential direction of future work is to extend the semi-supervised model to an open-set recognition model. In a testing scenario, this model would be capable of rejecting a new data as belonging to any of known classes and labeling them as unknown. This would be an important step forward in detecting unknown vulnerabilities and anomalies. Different metrics obtained by the model such as the reconstruction error, entropy of the classifier’s prediction, and/or distance to the centroid of the assigned cluster in the latent feature space can be used to develop such capability. Another more practical extension of the model is to examine methods that shed light on the inference made by the model such as integrated gradients [
46] or SHAP values [
47]. These methods propagate back the output of the model’s classifier to the input space to identify what features at what specific time window were influential in the model’s decision making. These explanations from the original input space can help with model validation and promote acceptance within the domain.