1. Introduction
Anomaly detection refers to the identification of “abnormal points” in data through data mining methods. Common application scenarios include the financial field: identifying “fraud cases” from financial data such as identifying credit card application fraud, false credit, etc.; network security: identify “intruders” from traffic data and identify new network intrusion patterns [
1,
2]; e-commerce field: identify “malicious buyers” from transaction data such as fleece parties and malicious screen-sweeping groups; early warning of ecological disasters: based on the forecast of wind speed, rainfall, temperature, and other indicators, judge the extreme weather that may occur in the future; industry: the defect detection of industrial products can be carried out by abnormal detection methods, instead of human eyes for measurement and judgment [
3,
4]. Many anomaly detection algorithms have been introduced for these fields, but the high latitude and non-linearity of the data features make it impossible to effectively detect abnormal data in this type of data [
5,
6].
In recent years, deep learning [
7] has been able to learn and train complex relationships in high-latitude, non-linear data, but there are still two main problems in deep learning in anomaly detection. (1) Unbalanced categories, where anomalies are usually rare instances of data, while normal instances usually account for the vast majority of data. Therefore, it is difficult or even impossible to collect a large number of labeled abnormal instances. This makes it impossible to obtain large-scale markup data in most applications. (2) Heterogeneity of abnormal types: Abnormalities are irregular, and one type of abnormality may show completely different abnormal characteristics from another type of abnormality. For example, in video surveillance, abnormal events such as robbery, traffic accidents, and theft are visually very different; this poses a huge challenge to the breadth of anomaly detection.
In response to the above two challenges, the existing deep learning-based anomaly detection algorithms [
5,
6,
8,
9,
10,
11,
12] use unsupervised learning to complete the above challenges in two steps (i.e., the pipeline (a) in
Figure 1). (1) Reconstruct features: learn new feature representations for all normal data such as the intermediate representation of the autoencoder [
8,
9,
12]. The use of autoencoders for anomaly detection is based on the fact that normal instances can be better derived from the compressed feature space than abnormal instances. Refactor this assumption. In order to minimize the overall reconstruction error, the retained information must be as relevant as possible to the input instance (such as the normal instance). Generate the latent space in the confrontation network [
10,
11]. The distance-based method [
5,
6] is to calculate the distance between each point and the surrounding points to determine whether a point is abnormal. The method assumes that there are many neighboring points around the normal point, and the abnormal point is away from the surrounding points and the distances are relatively far. (2) Through the reconstruction error [
8,
9,
10,
11,
12] or distance metric [
5,
6] obtained by learning in the first step, the abnormal anomaly score is expressed. However, the reconstruction error in this two-step method [
8,
9,
10,
11,
12] may not be able to fully represent the original features, resulting in a decrease in the accuracy of the final anomaly detection result. Compared with the distance metric [
5,
6], the traditional anomaly detection method is integrated into the reconstructed feature to solve this problem. Although this method greatly improves the ability of the reconstructed feature to restore the original feature, its optimization point is mainly the feature learning, indicating that this may lead to low learning efficiency and the low quality of abnormal scores.
For the above-mentioned main class algorithms, it is mainly unsupervised learning [
13], and a common problem in unsupervised learning is that it is difficult to control what features the model learns because unsupervised learning lacks prior knowledge to guide optimization features. The solution to this problem is to have supervised learning [
14] and semi-supervised learning [
15], but in the field of anomaly detection, there are no large-scale labeled data, so here is mainly the description of semi-supervised learning, that is, the use of limited abnormal data to learn the abnormal score model (these types of data are often the representative abnormal data that was manually screened in the early stage, for example, in the field of financial fraud, the deceived customer reports the bank, and the bank puts an abnormal label on the customer's data), semi-supervised learning avoids the waste of data and resources, and at the same time, solves the problems of insufficient generalization ability of supervised learning models and inaccurate models of unsupervised learning.
In recent years, there is a type of end-to-end learning model of deviation networks in semi-supervised learning (i.e., the pipeline (b) in
Figure 1) [
16]. The method takes raw data as input and learns and outputs anomaly scores directly instead of feature representations. Specifically, given a training data object, the proposed framework first uses a neural anomaly score learner to assign anomaly scores to it, and then defines the average of the anomaly scores of some normal data objects based on prior probabilities as a reference score to guide subsequent anomaly score learning. Finally, the framework defines a loss function, called bias loss, to enforce statistically significant deviations of the anomaly scores from the normal data object scores in the upper tail. Although this type of method solves the above two main problems, the reference score of the normal data is obtained through the prior probability. Scores cannot explain normal data well.
In this article, in response to the shortcomings of the existing methods, we introduce a new anomaly detection model: variational auto-encoder (i.e., the pipeline (c) in
Figure 1), which uses a data-driven method to perfectly learn the reference scores of each normal data, making it a more convincing force. Thus, a new anomaly detection framework was designed. First, we learned the normal distribution of normal data through variational self-encoding as a reference score, and then used another neural network to learn the anomaly scores. Finally, the framework defines a deviation loss function. The mandatory anomaly score has a significant deviation from the reference score of the normal data.
The network model is further instantiated as a method called variational deviation networks (V-DevNet). V-DevNet is a semi-supervised learning method that uses a small amount of abnormal data, which usually varies from a few to dozens of abnormal data, which only accounts for 0.006–1% of all data. These abnormal data are trained through neural networks, where the normal distribution generated by the variational auto-encoder is used as the reference score, and finally, the abnormal score is directly optimized through the deviation loss. The variational deviation network designed in this way not only makes the reference score more consistent with different data types, making it more convincing, but also has more extensiveness for multiple types of abnormal data. In addition, unlike the anomaly scores generated by other mainstream algorithms, it is more difficult to interpret, and the anomaly scores obtained through the variational deviation networks of the deviation loss are easier to interpret.
Compared with the (a) class method in
Figure 1, our model does not directly learn the feature representation of the data, but continuously learns the normal distribution of each data through variational self-coding, which greatly reduces the problem of inaccurate detection caused by reconstruction error; compared with the (b) class method in
Figure 1, our model learns the reference scores through variational self-encoding, so that it can learn the reference scores that match the data itself for different data types, which makes its detection accuracy higher, instead of directly obtaining the reference scores randomly in the (b) class method.
Therefore, this article has made the following contributions:
Introduced a new model: Variational auto-encoder. This model realizes the specification of reference scores by learning the normal distribution of each normal data and generates different reference scores for different data, making the reference scores more meaningful strong explanatory power; and
The framework instantiates a new deep anomaly detection method, namely the variational deviation networks (V-DevNet). V-DevNet optimizes the anomaly score by anomaly score neural network, variational self-encoding, and deviation loss, and the obtained anomaly score is optimized accurately and easily explained.
Experiments on nine types of existing high-dimensional data show that:
In the AUC-ROC and AUC-PR curves, the experimental results of V-DevNet were significantly better than the current mainstream four types of abnormal detection methods. Among them, AUC-ROC was improved by 2–33% on average, and AUC-PR was improved by 6–332% on average.
4. Methods
In this paper, a variational auto-encoder was added to the deviation framework [
16], which is instantiated as a variational deviation network. This method calculates the reference score and optimizes the deviation loss of the abnormal score network through a data-driven method. Finally, anomaly scores were obtained through anomaly score network training.
4.1. Overall Framework
Based on the combination of the variational encoder model, we introduced a new framework, which is mainly composed of three parts: anomaly score network, variational auto-encoder, and deviation loss function. These three parts were used to train the anomaly detection model. Our main contribution is that when generating reference scores, we adopted a data-driven approach-variational auto-encoder to make it more explanatory and superior to other mainstream models in terms of abnormal scores and data efficiency. As shown in
Figure 3, we list the overall architecture of the variational auto-encoder model, which is mainly composed of three parts:
Anomaly score network (): Its main function is to generate an anomaly score for each input .
Variational auto-encoder: In order to generate reference scores in a data-driven manner, we introduced a variational auto-encoder, which generates reference scores and through two encoders, and the reference scores generated by this method are more explanatory and can better fit the data itself and learn its reference scores.
Finally, we define the deviation loss function, and use , , and to guide the optimization of parameters. Through the deviation loss function, we can make the abnormal data, the reference score of the normal score has a significant deviation, and the normal data are closer to .
4.2. Anomaly Score Network
The anomaly score network is mainly composed of three parts: the input layer, intermediate representations layer, and anomaly scorer (anomaly score learning). The anomaly score network is expressed as , where the intermediate presentation layer is expressed as , and the anomaly scorer is expressed as , where .
The intermediate representations layer is a feature learning network with h hidden layers, in which the weight
is specifically expressed as:
Among them, we can choose different hidden networks according to different types of data such as convolutional networks, recurrent networks, and multilayer perceptron networks.
Anomaly scorer, which uses a single neural unit to obtain anomalous scores through the feature representation output by the intermediate representation layer, is specifically expressed as:
where
,
,
are the deviation parameters.
Therefore, the overall abnormal score network can be boiled down to:
4.3. Variational Auto-Encoder
After the abnormal score network is built, it is necessary to define a reference score for normal data to guide the optimization of the abnormal score network. The generation of reference scores is mainly generated by two methods: data-driven and prior probability. The main reason why we chose the data-driven method is its good interpretability. When dealing with different data, corresponding reference scores are generated to optimize the accuracy and data efficiency of abnormal scores.
4.3.1. Variational Auto-Encoder and Autoencoder
Variational self-encoding [
27] is the same as autoencoder [
28,
29] in that both consist of an encoder and decoder, with the difference that the variational autoencoder returns a distribution in the hidden space by making the encoder rather than a single point. The autoencoder mainly uses neural networks as the encoder and decoder and uses iterative optimization to learn the best encoding–decoding scheme (as shown in
Figure 4). Thus, in each iteration, we provide some data to the autoencoder structure, compare the encoded and then decoded output with the initial data, and update the weights of the network by back-propagating the error. Thus, intuitively, the entire autoencoder structure (encoder + decoder) constructs data bottlenecks, thus ensuring that only the major part of the information can pass through the bottleneck and be reconstructed. From our general framework, the considered encoder is defined by the encoder network structure, the considered decoder family is defined by the decoder network structure, and the reconstruction error reduction is performed by gradient descent of the encoder and decoder parameters. In contrast, variational self-encoding is performed by describing the probability distribution of each potential attribute through the encoder and adding a regular term to the returned distribution in the loss function to address the problem of irregularity in the hidden space to ensure a better organization of the hidden space. For example, in the field of image generation, an autoencoder model is trained. The autoencoder learns the relevant representations of pictures through neural network compression such as skin color, face shape, mouth, etc. in the field of portraits. The disadvantage of the autoencoder is that it obtains a single value through training, and uses it to describe the latent attributes of the image, thereby generating each attribute of the image, which is often not universal, and it cannot express the features well in the complex feature representation. However, the variational auto-encoder solves this challenging point. We describe the underlying attributes in probabilistic terms instead of training a single value. In this way, we now represent each potential attribute of a given input as a probability distribution. When decoding from a latent state, we randomly sample from each latent state distribution and generate a vector as input to the decoder model.
4.3.2. Principle of Variational Auto-Encoder
The variational auto-encoder generates a latent space representation through the encoder, and then generates a
similar to the normal data
through the decoder, where we only know the original normal data
, and we want to infer the characteristics of
. The distribution, that is, calculating
is expressed as follows:
However, calculating
is quite difficult, so let us approximate
with another distribution
, we defined it to have a scalable distribution. If we can define the parameters of
so that it is very similar to
, we can use it to approximate the complex distribution, so here we used the KL loss function. The KL dispersion degree is the difference between two probability distributions, so if we want to ensure that
is similar to
, we can minimize the KL divergence between the two distributions.
In the image generation model, we can use
to infer potentially hidden variables (latent states) that can be used to generate observations. We can further construct this model into a neural network structure, where the encoder model learns the mapping from
to
, and two neural networks are constructed in the encoder
to calculate the mean and variance in the normal distribution, and the decoder model learns the mapping from
to
, as shown in
Figure 5.
In order to prevent the noise from being zero, we stipulated that were all aligned with the standard normal distribution while ensuring that the model had the ability to generate, that is, it could sample from to generate an image.
The loss function includes two items. The first penalizes the reconstruction error, and the second encourages us to learn that the distribution
is similar to the real prior distribution
. We assumed the unit Gaussian distribution, where each is the latent space of dimension
.
In order for
to be in line with the standard normal distribution, the loss function here is specifically expressed as:
4.3.3. Generate Reference Score
We used normal data
as the input of the variational auto-encoder, and the encoder generates the reference score of each data
,
, the potential space
for each probability distribution of the data is then generated by the decoder to generate
, similar to
in the latent space
, and finally, the probability distribution of the latent space
is further optimized through the loss function (7) reference scores
,
. The reference scores
and
, generated through the training of the variational auto-encoder, were close to the normal distribution. Fortunately, according to [
30], the normal distribution is very suitable for abnormal scores in many datasets, so we used data to drive the method of the variational auto-encoder, where the generated reference scores not only had good explanatory properties, but also generated different reference scores that were close to the normal distribution according to different data, which were more representative.
4.4. Deviation Loss Function
According to the reference score of the normal data generated by the variational auto-encoder, the deviation function to optimize the abnormal score is defined as:
Among them,
and
are the reference scores that are close to the normal distribution generated by the variational auto-encoder through the encoder, and then the deviation function is used as the loss function to guide the optimization [
31]:
Among them, we define as abnormal data, and as normal data; according to the loss function in (9), when , the latter term of the loss function will be 0, and the former term will be , optimizing the abnormal score so that the normal data are closer to the reference score , in which the abnormal score generated by the abnormal score network is deviated from . When , the first term of the loss function is 0. At this time, the loss function encourages the abnormal data to have a significant deviation from the reference score , so that the abnormal data deviate to the greatest extent reference data . Here, we used a = 5, which made a significant deviation.
In practical applications, there are only a few abnormal label data in the data. For normal data, we can solve this problem by specifying the data without abnormal labels as normal data. The data in the variational auto-encoder also uses this type of data and we generated reference scores by taking their average. The experimental results showed that this type of method is well realized when used. Normal data in anomaly detection applications may have very rare abnormal data, which are very limited for the optimization of anomaly scores. In the experimental section, our variational bias network and other deep methods adopted this type of method to train the data where the effect was obvious.
4.5. Variational Deviation Network Algorithm
The variance deviation network algorithm consists of two main algorithms, where Algorithm 1 is mainly an algorithm for variance self-coding [
18], which mainly generates reference scores, and Algorithm 2 composes an optimized anomaly scoring network by using the reference scores obtained from Algorithm 1.
Algorithm 1 realizes the whole process of generating the reference score. Because the reference score needs to be implemented in a data-driven manner, additional training of the variational auto-encoder is required here to ensure the accuracy and interpretability of the reference score. The training data
consist of
and
.
belongs to normal data, and
belongs to abnormal data. In Algorithm 1, the main input is normal data
. In algorithm steps 2–7, the weight of the encoder in the variational auto-encoder is mainly trained to obtain an encoder that can generate
. The fourth step is to optimize the loss function of the weight, and the eighth step is the input of the normal data into the trained encoder, the ninth step outputs the
of each normal data, and the tenth step finds the average
. The main reference score calculated in the whole article was
,
, and
; appearing elsewhere are the average reference scores.
Algorithm 1 Training the variational deviation network. |
-input normal data |
-output reference score
for i=1 to n_epochs do for j=1 to n_batches do to perform gradient descent end for end for (Average)
|
Algorithm 2 is the whole process of realizing the entire variational deviation network.
Algorithm 2 mainly inputs
data. Steps 2–7 mainly train the weights of the abnormal score network. The fourth step is to optimize the loss function of the weights, which adds the reference score
values obtained in the first algorithm, and the final output of the algorithm is the trained variational deviation network model.
Algorithm 2 Training the variational deviation network. |
-input training data |
-output the trained model
for i=1 to n_epochs do for j=1 to n_batches do Optimize the weight Θ to perform gradient descent end for end for return
|
5. Results
5.1. Dataset
In the experimental part, we used nine representative datasets (see
Table 1 and
Table 2 for details) including five datasets with obvious abnormalities, four datasets with unobvious marked abnormalities, and training datasets in the experimental section including various fields such as the identification of credit card application fraud, false credit, etc.; network security: find out "intruders" from traffic data and identify new network intrusion patterns; and medical field: patients with abnormal thyroid function.
5.2. Comparison Algorithm
The variational deviation algorithm will be compared with the deviation algorithm [
16], REPEN algorithm [
5], Deep-SVDD algorithm [
6], FSNet algorithm (algorithm for prototypical networks for few-shot learning) [
32], and iForest (algorithm for isolation-based anomaly detection) [
17], which were the five algorithms used for verification, because these five algorithms are currently the most advanced methods in the field of anomaly detection. The deviation algorithm is an end-to-end anomaly detection algorithm; the REPEN algorithm is a semi-supervised deep learning anomaly detection algorithm that solves a small amount of abnormal data; and the Deep-SVDD algorithm is a feature representation of learning normal data. The FSNet algorithm is an anomaly detection algorithm that solves small sample learning detection. The iForest algorithm is a typical anomaly detection algorithm in unsupervised learning. These algorithms are implemented in the python language. Among them, V-DevNet, DevNet, Deep- SVDD, FSNet, and REPEN are implemented using keras [
33], and the iForest algorithm is called in scikit-learn. In the training, the first four algorithms all used the gradient optimization algorithm of [
34] for training.
5.3. Experiment Settings
Since our experimental data were multidimensional and disordered, the new model we designed here mainly consisted of a multilayer perceptron (MLP) network structure. Among the variational self-coding and anomaly score networks, we designed two architectures, one with only one hidden layer and 20 neural units, and another architecture with three hidden layers, in order to train more dimensional and complex data. The first hidden layer contains 1000 neural units, the second hidden layer contains 250 neural units, and the third hidden layer contains 20 units; here, we also applied the relu function because of its efficient computation and gradient propagation.
Table 1 shows the results of architecture 1, which was trained under the structure of one hidden layer. In the above comparison algorithm, they were trained using 50 epochs, with 20 min-batches in each epoch.
In a real-world anomaly detection environment, there are a large number of unlabeled data objects and a very small set of anomalous data; to fit this scenario, the anomalous and normal objects of each dataset were first divided into two subsets, where 80% of the data were used as the training set and 20% as the test set. To conform to the realistic scenario, we randomly added/removed anomalous data in each training dataset so that the anomalous data accounted for 2% of the training data, and the resulting data constitutes the unlabeled training dataset U. We further randomly selected 30 anomalies from the anomaly class as the a priori knowledge of the anomalies of interest (i.e., the labeled anomaly set K, which accounts for only 0.005–1% of all training data objects), thus constituting a realistic anomaly detection environment.
5.4. Performance Evaluation Method
In this experiment, we mainly used two important evaluation indicators, AUC-ROC and AUC-PR. Before understanding these two types of indicators, we need to understand the relevant terms. Please refer to
Table 3 for details.
In the AUC-ROC evaluation index, we need to use two reference values, one of which is the true positive ratio:
This represents the ratio between the positive examples in the prediction and all the positive examples in the dataset, which can represent the recall rate of positive examples; one of which is the false positive ratio:
This represents the ratio between the negative example in the prediction and all the negative examples in the dataset, which represents the recall rate of the negative examples. The ROC graph is made based on these two reference values. When making the ROC graph, the TPR is the y-axis and the FPR is the x-axis. Draw a straight line in the ROC chart , where the straight line and the ROC curve focus of an algorithm express how much positive case recall rate the algorithm can have when the counterexample recall rate reaches ; the same is true for horizontal straight lines. In order to generate a single value as a reference indicator, the concept of AUC (area under curve) is added here, that is, the size of the total area under the curve in the ROC graph. When the negative example recall rate is small, and the positive example recall rate is large, the performance of the algorithm is better. At this time, the ROC curve should tend to the upper left, as “left convex” as possible, and the area under the curve becomes larger, so the larger the ROC-AUC, the better the algorithm performance.
Two reference values are also needed in the AUC-PR evaluation index, one is the precision:
This can be expressed as the ratio between the positive example in the prediction and all the predictions, indicating the accuracy of the positive example in the prediction; the other is the recall rate:
The ratio between the positive example in the prediction and all the positive examples in the dataset represents the recall rate of the positive example; when plotting the PR graph, use P as the y-axis and R as the x-axis. Generally, the PR curve is when x is greater than a certain value a, which behaves as a decreasing function. When the dataset is unchanged and the algorithm performance does not change, to increase the recall rate (x-axis), the total number of predictions needs to be increased. At this time, the number of inverse columns predicted will increase, resulting in positive examples in the prediction and the accuracy (y-axis) decreases. Similarly, in order to generate a single value as a reference indicator, AUC (area under curve) is also added here. When the algorithm can achieve a higher recall rate and accuracy of positive examples when the total number of predictions is small, it indicates that the performance of the algorithm is good. At this time, the PR curve tends to the upper right, that is, the better the performance of the curve, then it should be as “right convex” as possible at this time, and the area under the curve should be as large as possible. Therefore, the larger the AUC-PR, the better the algorithm performance.
The study in [
35] found that when the data in the dataset was out of balance, that is, the difference between positive and negative examples was very large, which is the same as the characteristics of anomaly detection data, this led to a smaller difference in AUC-ROC, and AUC-PR had more effective differences between performance algorithms. In order to further improve the accuracy of the evaluation index, we applied the method of calculating the average accuracy in [
36] to calculate AUC-PR, where AUC-ROC was used as a reference index to further measure the superiority of the algorithm. In order to prove the superiority of V-DevNet, we added the Wilcoxon signed-rank test [
37], where the idea of this method is to take the rank of the absolute value of the difference between the observed values and the central position assumed by the original hypothesis (or the two sets of sample values) and add them up separately according to different signs as its test statistic.
In using the above evaluation method, the relevant AUC-PR and AUC-ROC indicators of the variational deviation network and the other five comparison methods are shown in
Table 1 and
Table 2. In the above table, we can see that the variational deviation network in data 7 and 9 showed the best performance. Among them, under the AUC-ROC performance index, the variance deviation network with other experimental methods by ANOVA test, we found that the
p-value was less than 0.05, so they were significantly different; by comparing the mean, we found that the performance of the variational deviation network was significantly higher than that of the deviation network (2%), REPEN (12%), Deep-SVDD (6%), FSNet (25%), iForest (33%). Under the AUC-PR performance index of the variance deviation network with other experimental methods by the ANOVA test, the
p-value was also less than 0.05, so they were significantly different; by comparing the mean, our experiment found that the performance of the variational deviation network was significantly higher than that of the deviation network (6%), REPEN (130%), Deep-SVDD (30%), FSNet (124%), iForest (332%). The experimental results showed that we used variational self-coding to learn the reference scores with a data-driven approach and to guide the optimization of the deviation network with the reference scores, resulting in significantly higher experimental results than the other methods. In the study of time complexity, we found that the overall running time of V-DevNet grew linearly with the size and dimensionality of the data, which provides a strong guarantee for the use of the algorithm.
6. Conclusions
This article added a variational auto-encoder model on the basis of the deviation network, and instantiated a new V-DevNet framework. Instead of learning the reconstructed features, the framework learns the normal distribution of each data as a reference score. Through training in a data-driven manner, the values of
and
that match the data are obtained. It can more accurately indicate the normal distribution of normal labeled data, and has good solvability. The performance in the two indicators of AUC-ROC and AUC-PR was higher than that of the deviation network (reference score was randomly generated), and was significantly higher than the other four types of anomaly detection methods (specific comparative data are shown in
Table 1).
However, the method proposed in this paper has certain limitations. Because the method in this paper was based on variational self-coding, in order to train the probability distribution of normal data in variational self-coding, here, the normal distribution was used to optimize the probability distribution of normal data to make it close to the mean and variance obtained from the normal distribution to optimize the reference score, although a previous paper based on references mentioned that most of the data conformed to the normal distribution, but could not be fully covered as all normal distribution, so further research will follow to see if the variational self-coding can fit other probability distributions.
In the future, we will consider using V-DevNet in the anomaly detection of image data and text data. We can also add a convolutional neural network to the anomaly scoring network in V-DevNet. I believe that it can be better in V-DevNet. The normal distribution of images and text can be obtained, which will further improve the performance index of V-DevNet.