1. Introduction
Anomaly detection aims to find patterns that do not comply with expected behaviour [
1]. In many real-world applications, anomaly detection tasks are important research topics [
2,
3]. Typically, anomalies are categorised as point, contextual, and collective. Since it is common to find that no specific distributions fit the data, and the characteristics of anomalies are different, using traditional anomaly detection methods based on distance estimation or statistical theory may be challenging. Moreover, complex and changing intrinsic data characteristics, low recall rate, and high dimensional data [
4] further impede the learning performance of traditional machine learning methods. Under these circumstances, the detection algorithm requires excellent learning ability of the data features. Deep learning methods commonly learn the complex dynamics in the data without relying upon underlying patterns within the data. This advantage makes them popular in dealing with anomaly detection tasks. Transformers, which apply the self-attention mechanism, have shown outstanding performances in modelling long-range dependencies among different deep learning methods.
Common methods for dealing with anomaly detection tasks [
5,
6,
7] can be generally classified as traditional machine learning methods and deep learning methods.
SVM [
8], One-class SVM (OC-SVM) [
9], Isolation forest [
10] and Local Outlier Factor (LOF) [
11] are typical examples of machine learning anomaly detection algorithms. However, if raw samples are complex or dense, the detection performance of these methods will be limited. LODA [
12] is a lightweight anomaly detector that ensembles different detectors and is suitable for data streams. LSCP [
13] is another ensemble framework compatible with different types of base detectors and further determines the most competent base detector in the local region upon similarity measuring.
Although machine learning methods are suitable for some anomaly detection tasks, deep learning methods more effectively learn expressive representations of complex data in some real-world applications [
4,
14]. For example, deep support vector data description (DeepSVDD) [
15] is applied to complex data for better feature selection. Recurrent neural networks (RNN) that capture time dependence are commonly used to recognise or predict sequences. RNN has been exploited with gating mechanisms to become common methods such as LSTM and Gated recurrent units (GRU). For example, one LSTM-based method is adopted for detecting urban anomalies [
16]. DeepAnT [
17] is a novel anomaly detection method in time series, which does not require a huge dataset. It primarily applies a CNN-based network to take a window range of time series and try to predict its value for the next stamp. Then, the predicted value is sent to an anomaly detector module to determine its abnormality. DAGMM [
18] applies a compression network and an estimation network to achieve unsupervised anomaly detection. The compression network implements a deep autoencoder to generate a low-dimensional representation for each input. Then the estimation network, based on the Gaussian Mixture Model, takes the representation and predicts the corresponding likelihood. Parameters of both the two sub-networks are jointly optimised simultaneously. SO-GAAL [
19] applies the generative adversarial learning framework, which consists of a generator and a discriminator used to detect anomalies. Alternatively, GDN [
20] combines graph structure learning and attention weights to achieve good anomaly detection results in some fields. LUNAR [
21] is another graph neural network-based anomaly detection method. It extracts information from the nearest neighbours of each node and further detects anomalies, and it can learn and adapt to different sets of data. Transformer [
22]-based algorithms are also widely applied in anomaly detection tasks. For example, UTRAD [
23] obtains stable training and accurate anomaly detection/localisation results based on a transformer-based autoencoder. Additionally, MT-RVAE [
24] utilises the variational Transformer model with improved positional encoding and feature extraction to achieve satisfying anomaly detection performances.
In this paper, one Transformer-based network, Informer [
25], is chosen as the baseline for dealing with anomaly detection tasks with data collected from real-world applications. The original Informer is an efficient model of Transformer that adopts
ProbSparse self-attention mechanism to significantly reduce the time complexity and memory usage while outperforming existing methods, mainly in time-series forecasting tasks. Additionally, it can handle tasks in an unsupervised way to avoid the cumbersome labelling cost. However, directly applying the original Informer to time-series anomaly detection tasks may not be appropriate. Since the origin Informer is used in time-series forecasting tasks, it aims to find the overall trend of the target sequence and can ignore some unusual details. Moreover, Transformers may focus on dominant relationships among sequences while paying less attention to intrinsic details when dealing with short-term data. As a result, as shown in
Figure 1, it has a straight-throughout feature-transmitting structure for layers in the decoder, and the output of the informer decoder is merely based on the last layer, which contains the least noise but may miss some details. However, anomalies may be rare in anomaly detection tasks, and some minor details in data may reflect the anomaly and cannot be ignored. Different features should be utilised to improve the overall detection performances [
26].
To overcome the above-mentioned limitations, we propose to better utilise features from shallow and deep decoder layers with a new multi-layer feature fusion decoder. The original feature-transmitting structure in the decoder of Informer is replaced with the proposed feature fusion decoder to fully utilise the features extracted from shallow and deep decoder layers. This strategy prevents the decoder from missing unusual anomaly details while maintaining robustness from noises inside the data. Next, the auxiliary predictions generated by the decoder will be further adaptively fused based on similarities/distances and sequence information in the temporal context fusion module. This strategy exploits temporal context information of the data by a learnable weight to make the output more robust. We evaluate the proposed method using both the public and our collected transportation datasets for anomaly detection tasks and compare the results with recently proposed machine learning and deep learning methods.
The main contributions of our work can be summarised as follows:
We introduce a novel framework: Temporal Context Fusion Transformer (TCF-Trans) for unsupervised anomaly detection in time series based on temporal context fusion.
We replace the straight throughout feature-transmitting structure in the decoder layers of Informer with the proposed feature fusion decoder, which fully utilises the features extracted from shallow and deep decoder layers. This strategy prevents the decoder from missing unusual anomaly details while maintaining robustness from noises inside the data.
We propose the temporal context fusion module to fuse the auxiliary predictions generated by the decoder adaptively. This strategy alleviates noises or distortions caused by the single auxiliary prediction and fully uses temporal context information of the data.
Extensive experiments on the public and collected transportation datasets validate that the proposed framework is effective for anomaly detection tasks, such as transportation tasks in time series. In addition, a series of sensitivity experiments and the ablation study show that the proposed method maintains high performance under various experimental settings.
The remaining parts of this paper are organised as follows.
Section 2 reviews related works and the background of the Transformer.
Section 3 describes the details of the proposed method.
Section 4 describes the experiments for validating the proposed method. The conclusion of this paper is presented in
Section 5.
3. TCF-Trans: Temporal Context Fusion Transformer
3.1. Overall Structure
As shown in
Figure 3, TCF-Trans consists of three main modules: an auxiliary prediction generator, a temporal context fusion module and an anomaly detection module. The auxiliary prediction generator performs feature learning, fusion and refinement of the processed input data based on an Encoder–decoder architecture. Next, the generated auxiliary predictions are further processed by the temporal context fusion module based on the similarity/distance and sequence information to generate output predictions adaptively. Finally, the output predictions are compared with the target data under anomaly detection criteria to produce anomaly scores. The final detection result will be determined based on the threshold. These modules will be presented in detail in the following sections.
3.2. Auxiliary Prediction Generator
The auxiliary prediction generator implements an Encoder–decoder architecture similar to the Informer [
25]. The input
is encoded into the hidden state representation
as the encoder output. Next, the decoder produces output auxiliary predictions
based on the encoder output
and the decoder input
.
The process in the encoder is based on the Informer baseline. However, as we mentioned earlier, the process of the Informer baseline in the decoder is ineffective in anomaly detection tasks. To overcome such limitations, we propose a multi-layer feature fusion decoder to better use different features extracted in shallow and deep decoder layers and further refine them to generate auxiliary predictions. To smoothly demonstrate the idea of feature fusion in the decoder, we take a three-layer decoder as an example. As shown in
Figure 4, the decoder consists of three layers and inputs are based on encoder output
and the decoder input
. The outputs for three layers are denoted as
, where
is the output for the
ith layer. Our feature fusion aims to generate the merge and refined
(denoted as
) for the
ith layer based on
and use it to produce auxiliary predictions to assist detection.
In the multi-layer decoder, shallow layers contain more information about details, while deep layers contain more depth representations of the data [
41,
42]. Therefore, we can fuse features from deep layers with those from shallow layers to obtain an optimal representation of the data, as follows:
where
donates the concatenation,
i is the order of the layer and
is the dimension of the fused feature.
During the fusion process, the deeper the layer, the more features it fuses. Under this circumstance, directly applying the FC layer to produce outputs makes it likely to lose information. Therefore, feature-refining tools such as multilayer perceptron (MLP) layers can be implemented to refine the fused features for deeper layers. Note that MLP layers can be represented as hidden layers in the MLP network and layer normalisation can be added to achieve stable transmission.
Lastly, auxiliary predictions produced by the FC layer using refined features can be defined as follows:
where
N is the number of layers in the decoder and
is the
ith auxiliary prediction.
During the training of this module, the mean squared error (MSE) loss function can be chosen to compute loss among auxiliary predictions and target sequences. Moreover, weights can be assigned for each prediction to emphasise the importance of predictions produced by different layers. The loss function of this module can be expressed as follows:
where
is the weight of the
ith decoder layer.
3.3. Temporal Context Fusion Module and Anomaly Detection
After obtaining several auxiliary predictions produced by the former module, one direct way to combine them is to empirically assign weights to each prediction. However, such a method takes a long time to reach a satisfying solution due to the large number of trials required. Also, applying a scalar weight limited the utilisation of the temporal context of these predictions. Because different dimensions or points in the sequence may include different intrinsic temporal contexts or the importance of one prediction, applying a scalar weight to a prediction means only different importance exists on different predictions. However, all points in the sequence and dimensions may not share the same importance. In that case, the output prediction based on a fixed scalar weight may only partially take advantage of our feature fusion decoder. Therefore, we aim to produce the final prediction adaptively based on a weight learned from these auxiliary predictions’ similarity/differences and fused temporal context.
As shown in
Figure 5, auxiliary predictions are sent in the similarity/distance measurement block to determine their similarities/differences. We can calculate the similarities/differences among them as follows:
where
can be a user-defined distance or similarity measurement such as Euclidean distance, etc.
In the meantime, a slice or all of each auxiliary prediction
can be chosen as the sequence information, where
and
are its length and dimension, respectively. Target sequences could also be added to the sequence information, and we take the auxiliary prediction as an example for convenience of understanding. Then, the temporal context fusion is processed as follows:
where
donates the concatenation and
N is the number of layers in the decoder.
Next, the fused temporal context is processed by MLP layers to further feature extraction and refinement. Since we aim to make full use of auxiliary predictions’ similarity/differences and temporal information, the output of the MLP will be sent to the FC layer to adaptively generate the weight based on different weight resolution expectations on various data inputs. (learned weight with a higher dimension presents higher resolution). Specifically, the learned weight can be chosen from .
Last, the output prediction can be computed as follows:
During the training of this module, we can also implement the MSE loss function to compute loss output prediction and target sequences.
After we have obtained the output predictions, the anomaly detection module compares them with the target sequence under specific criteria. Here, we can choose MSE loss to generate the anomaly score as follows:
Once we have the anomaly score, a threshold can be set to determine anomalies. The threshold can be determined empirically or via grid search for better precision, but such methods require extensive trials and are time-consuming. Alternatively, methods based on Streaming Peaks-Over-Threshold (SPOT) [
43] can be applied to determine the threshold
.
Therefore, values exceeding
in the
are considered potential anomalies. The main procedures for detecting anomalies via TCF-Trans are summarised in Algorithm 1.
Algorithm 1: Anomaly detection via TCF-Trans |
|
4. Experiments
In this section, we validate the effectiveness of the proposed anomaly detection framework through several experiments. First, we describe the experimental setup. Then, we evaluate the proposed framework on three public datasets in
Section 4.2. Next, we compare the proposed method with several state-of-the-art methods in the real-world transportation traffic dataset in
Section 4.3. Moreover, we conduct an ablation study and parameter sensitivity experiments in
Section 4.4 and
Section 4.5, respectively.
4.1. Setup
We follow standard evaluation metrics in anomaly detection tasks, including
score, Precision, and Recall, to evaluate performances as follows
where
denotes the number of correct anomalous detections,
denotes the number of incorrect anomalous detections, and
denotes the number of incorrect normal detection. A higher
score, precision and recall demonstrate better performances.
The proposed method is implemented with the Pytorch [
44] framework and runs on the NVIDIA RTX 3080 GPU. Some comparison methods are based on publicly available codes provided by PyOD [
45]. We implement three layers (
) for the feature fusion decoder and the dimension of the model
.
4.2. Evaluation on Public Datasets
We evaluate the proposed method by applying it to three real-world public anomaly detection datasets. The first public dataset is a gesture dataset [
46] collected in a real-world scenario. It records
X and
Y moving coordinates of one actor’s right hand into a time-series sequence. During the actor’s actions with the right hand, anomalous actions within a specific time period are recorded. In total, the number of data points in the gesture dataset is around 11,000, with a dimension of two. Around 70% of the samples are used to train the model, while the rest of the data are used for testing. The second and the third public datasets are provided in NAB (The Numenta Anomaly Benchmark) [
47] with known anomaly causes collected in the real-world scenario. The second one contains temperature sensor data of an internal component of a large industrial machine (i.e., machine temperature dataset). The third one contains ambient temperature data in an office setting (i.e., ambient temperature dataset). Each dataset has more than 7000 univariate data samples collected in time series. Part of the dataset is used as the training set, while the rest is set for testing.
In the gesture dataset, we implement three state-of-the-art comparison methods, including LUNAR [
21], DeepAnt [
17], and DeepSVVD [
15]. Comparison anomaly detection results on this dataset are shown in
Table 1, where bold faced number in each column of the table indicates the best result among all the methods in comparison, which is applied to all other tables in this paper. The results show that the proposed TCF-Trans obtains the best performance in terms of
score, which indicates this method has a good balance in terms of its overall performance without biased detection. Although the recall of the proposed method is not the highest, the methods with higher recall values can suffer from low precision. This phenomenon means they are highly likely to generate false alarms. Therefore, the proposed method obtains competitive performances among the compared state-of-the-art anomaly detection methods. The visualisation examples of detection result slices achieved via the proposed method and LUNAR on the gesture dataset are presented in
Figure 6. The X-axis represents recording time.
Figure 6a represents the raw data from the test set.
Figure 6b and
Figure 6c are the corresponding anomaly scores of the proposed method and LUNAR, respectively. Potential anomalies continuously happen after around the 1600th recording time. Compared with LUNAR, the proposed method has more stable anomaly scores among the anomalous regions. On the contrary, anomaly scores obtained from LUNAR suffer from many missing alarms.
In the machine temperature dataset and the ambient temperature dataset, we implement four state-of-the-art comparison methods, including LSCP [
13], LUNAR [
21], SO-GAAL [
19], and DeepSVVD [
15]. Comparison anomaly detection results on the machine temperature dataset are shown in
Table 2. Results show that the proposed method outperforms other methods in the
score, with close precision and recall values, indicating that the proposed method can avoid biased detection results. Although some methods have higher recall values than the proposed method, the gap between their recall and precision is noticeable. This unbalanced performance prevents such methods from making accurate predictions among normal samples.
Table 3 summarises the comparison anomaly detection results on the ambient temperature dataset. Results show that TCF-Trans obtains satisfying results among the four methods compared here. Meanwhile, we notice that performance vibrations of some methods on these three datasets are apparent, while the proposed method is more stable for different datasets. This advantage indicates that the proposed method is less sensitive to the change in the intrinsic dataset and may be exploited for many detection tasks.
4.3. Evaluation of the Real-World Transportation Dataset
The proposed method is applied to one real-world collected transportation dataset for anomaly detection tasks to show its potential in other real-life applications. The dataset is collected in Chongqing, China, by days (i.e., real-world transportation dataset (days)), and it contains vehicle traffic data from several roads. This dataset with a day collection rate can reflect long-term traffic anomalies, which can be valuable for local enforcement, helping to assess overall traffic planning and management. Moreover, since the relatively large collection rate requires a longer accumulation of data to enlarge its size, the collection difficulty is greater, and the relatively small size of this dataset already contains information for several months. Under such circumstances, the proposed method’s learning ability with a small number of samples can also be evaluated. The statistical summary of this dataset is shown in
Table 4. In total, the real-world transportation dataset (days) contains 270-day vehicle traffic data from different roads. Part of the dataset, which does not contain an anomaly, is used as the training set, while the rest is chosen as the testing set. We implement five state-of-the-art anomaly detection methods for comparison, including LODA [
12], LSCP [
13], LUNAR [
21], SO-GAAL [
19], and DeepSVVD [
15].
Table 5 presents the comparison anomaly detection results on this dataset. Results show that TCF-Trans effectively detects anomalies in the real-world transportation dataset. It can be observed that the proposed method balances precision and recall. We owe this good ability to the proposed feature fusion strategy because we fuse different features with different characteristics. This strategy can alleviate the negative drawbacks of being vulnerable to noises or missing anomaly details. As a result, the proposed method can obtain good overall detection performance. The visualisation example of a detection result slice is shown in
Figure 7. In this figure, the X-axis represents collected dates.
Figure 7a contains the raw data slice from the test set, and
Figure 7b shows the corresponding anomaly score among the test set. SPOT, which is mentioned in
Section 3, can be used to determine the threshold without requiring time-consuming grid search trails. It can be observed that there are two types of anomalies in the raw data. For example, a continuous long-term vehicle traffic drop happens around 100th date, which may reflect a continuous anomalous vehicle traffic drop due to large-scale traffic control by local enforcement. The corresponding anomaly score achieved using the proposed method remains continuously high without a clear drop during this region, which reflects that it can effectively deal with this anomalous pattern. Some short-term sharp changes, such as points before the 140th date, may reflect temporary constructions and belong to another type of anomaly. The anomaly scores in this region are also sufficiently stable, showing that the proposed method can effectively detect short-term anomalies.
4.4. Ablation Study
In this section, we conduct the ablation study on the real-world transportation dataset (days) to analyse the effectiveness of each component of the proposed method. We implement four variants of the proposed method, including (i) the Informer baseline, (ii) the TCF-Trans w/o temporal context fusion, in which we replace the temporal context fusion module with one FC layer to generate the output directly, and (iii) the TCF-Trans w/o feature fusion, in which we do not fuse features form different layers. To minimise the impact of different threshold methods, we present results via the grid search based on score to check performances in theory (with notations †) in this ablation study.
Based on results shown in
Table 6, we make the following observations: (1) The proposed TCF-Trans utilises the advantages of each sub-module to achieve the optimal performance among these variants. (2) Since features from different layers have different characteristics, fusing features from different layers helps to improve detection performances. Fusing them can make the proposed method robust to noise and prevent the decoder from missing potential details related to anomalies. (3) Adaptively fusing the auxiliary predictions based on temporal context helps to satisfactorily generate results. On the contrary, directly transforming auxiliary predictions to the final output may be inappropriate.
Next, we implement another optimiser SGD on the proposed method to evaluate its impacts.
Table 7 shows the results of using different optimisers. The results obtained using SGD are worse than the Adam. The low performance can be led by SGD trapping in the local optimal.
Moreover, we gradually decrease the number of data used for training from 100% to 80% to show the proposed method’s potential to achieve effective detection performances without requiring the accumulation of a large amount of data. The results using different ratios of data used are listed in
Table 8. It can be found that
scores do not vary much with ratios from 100% to 80%, which indicates that the proposed method has the potential to obtain satisfying detection performances. Moreover, the results further validate the proposed method’s learning ability with few samples.
4.5. Parameter Sensitivity Experiments
In this section, the proposed method is implemented under different parameter settings to evaluate its sensitivity to parameter changes on the real-world transportation dataset (days). We also report results achieved via a grid search, in theory (with notations †), to reduce the impact of different threshold methods.
Since the decoder input
combines the earlier piece sequence before the output and a placeholder, we aim to evaluate the effect of reference sequence
with different lengths
, as well as corresponding input sequence lengths. Therefore, we choose five reference and input length combinations to evaluate their effect. The results of the proposed method using different combinations of lengths are summarised in
Table 9. Our
scores with different lengths are close, which indicates that the proposed method is good at dealing with different input and reference sequences.
Meanwhile, we aim to determine the impact on a larger number of decoder layers. We increase the number of decoder layers to four (i.e., four-layer TCF-Trans) to validate its performance. We also compare the performances of the Informer baseline with four decoder layers (i.e., four-layer Informer baseline). As shown in
Table 10, when comparing performances with the four-layer Informer baseline, the impacts of increasing decoder layers for the proposed method are not serious. This can be explained by our fusion strategy, which fully utilises features from shallow and deep layers to make the method more robust to the noise from a single layer while retaining important details for anomaly detection.
Other loss functions besides the MSE loss may also be chosen as the training loss during the training. We also evaluate the proposed method using two different training loss functions: MAE loss and SmoothL1 loss. The results of using different types of training loss are summarised in
Table 11. These results show that
scores with different types of training loss do not vary much, which indicates that loss functions can be adopted in our method.
Based on experiments conducted on the proposed method, the proposed method can handle anomaly detection tasks with different settings, and it shows robustness to these changes. Therefore, the proposed method: TCF-Trans, is an effective solution for anomaly detection in time series applications.