1. Introduction
Hyperspectral images are a set of three-dimensional data and involve two-dimensional spatial information and abundant spectral information with hundreds of narrow contiguous bands [
1]. Hyperspectral images have greater advantages in the areas of ground object classification [
2,
3,
4], change detection [
5,
6], and target detection [
7,
8,
9], depending on their powerful space–spectral information expression capabilities compared with other remote sensing images [
10], such as visible and infrared light. According to whether the spectrum information of targets is known, target detection can be divided into either supervised target detection or anomaly detection [
11,
12]. Anomaly detection divides the pixels into anomaly and normal (background) without prior spectral information of the targets [
13]. Generally, anomalies in hyperspectral images refer to the spectral irregularities caused by the presence of atypical ground objects [
14]. The anomalies have not been authoritatively defined by researchers and are widely agreed upon as targets whose spectral information clearly deviates from the background distribution, such as airplanes, vehicles, and other man-made objects in natural scenes [
15,
16].
Traditional methods of hyperspectral anomaly detection are generally divided into three categories: background modeling-based methods, distance-based methods, and representation-based methods [
17,
18].
The background modeling method is based on statistical theory. Assuming that the background obeys a certain distribution, statistical methods can be used to measure the likelihood of test pixels to be anomalous after estimating its parameters.
The well-known Reed–Xiaoli (RX) detector is a classic background modeling method [
19]. It builds a statistical model of the background through the mean and covariance matrix of the whole image with the assumption that the background obeys a multivariate Gaussian distribution. Then, it detects anomalies by measuring the Mahalanobis distance between the test pixel spectrum and background spectral distribution. However, a whole-image-based background modeling method may be inaccurate with a complex background consisting of a variety of different distributions. Therefore, the Local RX (LRX) method uses samples from the local area around the test pixel to perform statistical background modeling [
20]. Another issue with the RX method is that the background disobeys the multivariate Gaussian distribution in most cases. Therefore, Kwon and Nas [
21] proposed a detection method based on the kernel function (Kernel RX, KRX), which maps hyperspectral image data into a higher dimensional feature space. Their results show that hyperspectral data tend to obey the multivariate Gaussian distribution in such a kernel space. Many improved RX detectors have appeared in follow-up studies, such as Segmented-RX [
22], Weighted-RX, and linear filter-based RX detectors [
23], etc.
In hyperspectral anomaly methods based on distance, pixels are grouped according to distance, and the pixels that deviate from the center of the cluster are considered anomalies. Distance-based approaches can be further classified into: support vector machine (SVM)-based, clustering-based, and graph-based methods. The SVM-based method obtains a hyperplane with the smallest distance in background samples by a support vector data description (SVDD) [
24], and the distance between the test pixel and the center of the hyperplane in the higher feature space is considered the anomalous score [
25]. In the clustering-based method, the original hyperspectral data are clustered by k-means [
26], fuzzy clustering [
27] or other approaches, and the Mahalanobis distance is calculated for each cluster to estimate the anomalous scores of test pixels. Graph theory [
28], which reflects the internal structure of the data, is introduced in the graph-based method. Pixels in hyperspectral images are considered as the set of vertices of a graph and are connected with an edge, becoming inseparable when the similarity exceeds a given threshold. The large-component pixels of the graph form the background set; therefore, the test pixel’s anomalous score is regarded as the distance from the nearest point in the background graph [
29,
30].
To avoid estimating the background with an inappropriate data distribution or evaluating the anomalies just by cluster distance, another method based on representation uses the linear combination of elements in the background dictionary to express each pixel in a hyperspectral image. Because the background and anomalies belong to different clusters, the representation of the background pixels will perform better than the anomalies through the constructed background-based dictionary, so that the residual image of the represented and original images is considered a measure of anomalies. Sparse (SR) [
31], collaborative (CR) [
32], and low-rank representation (LRR) [
33] are all representation methods.
SR theory considers that signals are sparse [
34], i.e., the features of a background sample are far fewer than its number. Therefore, the background pixels can be expressed by as few linear combinations of elements as possible in a given over-complete dictionary, which is constructed by a large amount of background sample data. In the SR-based anomaly detection method, the reconstruction of anomalies through a learned dictionary is worse than that of the background [
13], and various anomaly detection models based on sparse representation theory have been proposed [
35,
36,
37]. Furthermore, a CR-based anomaly detector was proposed based on the assumption that the collaboration between dictionary atoms is more considerable than the competition between them [
38]. The number of dictionary atoms used to represent test pixels is required to be as small as possible in SR theory, while all dictionary atoms are used in CR theory. CR-based methods have rapidly developed many extensions [
39,
40,
41]. The over-complete background dictionary in SR and CR theory is usually obtained by a dual-window strategy [
42], which indicates that only local attributes are used to constrain the coding vector; however, global structure information is not considered. In addition, setting the dual window to a reasonable size without polluting anomalies is difficult when constructing the dictionary in the absence of prior information.
Different from the two theories above, LRR assumes that each pixel in hyperspectral data can be expressed as a linear representation of the basis vectors formed by the dictionary [
43]. In real hyperspectral images, the number of categories is far fewer than the number of pixels; therefore, the matrix’s rows or columns are strongly correlated, creating a low-rank hyperspectral image. Based on LRR theory, a hyperspectral image is decomposed into a low-rank background matrix and a sparse anomaly matrix with the assumption that the background is in a low-dimensional subspace and considered to be highly correlated; furthermore, anomalies only occupy a small part of the image scene and have a low occurrence probability [
44]. Other algorithms were presented to better represent and decompose the background and anomaly matrices [
45,
46,
47] and to find the lowest rank representation of all pixels and use the sparse constraint anomalies under the background dictionary.
In recent years, deep learning has gradually become popular in the field of computer vision and has achieved many outstanding study results. Due to its excellent performance in the modeling and feature extraction of complex data, deep learning has been introduced into hyperspectral anomaly detection. A deep neural network’s main role in anomaly detection is to extract the essential features of the original hyperspectral image as a feature extractor while reducing the dimensionality.
Most researchers proposed generative models, including autoencoders (AEs) [
48], deep belief networks (DBNs), [
49], and generative adversarial networks (GANs) [
50], to minimize the error between the original and reconstructed spectra to extract features in the hidden layer or reconstruction spaces. Subsequently, traditional anomaly detectors (such as RX, CRD, LRR, etc.) have been applied to the features extracted by deep neural networks to perform the final detection. Arisoy et al. [
51] utilized a GAN model to generate a synthetic background image similar to the original and an RX detector to detect anomalies on the residual image of synthetic and original images. Jiang et al. [
52] used an adversarial autoencoder (AAE) to reconstruct the hyperspectral image and combined the morphological filter and RX detector on the residual image of original and reconstructed images to detect anomalies. AAE was also adopted in the spectral adversarial feature learning (SAFL) architecture proposed by Xie et al. [
53]. Self-attention mechanisms [
54] have been widely employed for anomaly detection in texts [
55], videos [
56], and industrial images [
57] because of the effective learning capabilities. Jiang [
58] proposed a manifold-constrained, multi-head, self-attention, variational autoencoder (MMS-VAE) method for hyperspectral anomaly detection, which introduced a self-attention mechanism to focus on abnormal areas by learning context-related information.
The main problem for methods based on deep learning is that the objective functions do not jointly optimize the feature extractor and anomaly detector, which makes the deep neural network unable to exert its advantages.
To address the above problems, we propose a new hyperspectral anomaly detection model based on deep learning to jointly optimize the feature extractor and hyperplane layer (i.e., the anomaly detector) in this paper. Accordingly, we present a new model with an extractor to obtain both global and local features in hyperspectral images, in addition to an anomaly weight map that roughly evaluates the probability of each pixel to be anomalous and enhances possible anomalous regions. Specifically, we designed our proposed network structure and loss function as an extension of the basic one-class support vector machine (OC-SVM) algorithm [
24,
59]. OC-SVM is a variant of the traditional SVM algorithm in which all training samples are considered as positive and the original data are mapped to a new high-dimensional space through the kernel function, solving the problem of unbalanced samples in two-class classification. Our proposed model replaces the kernel function in the original OC-SVM with a feature extractor based on a deep neural network with optimizable parameters, making it possible to jointly train the feature extraction network with the subsequent hyperplane layer for a one-class objective.
In the feature extractor, both global and local features are obtained, and the original data are mapped into a new feature space. We introduced an attention mechanism to mine the feature correlation of each pixel with its neighbor in the image as the global information. In addition, we utilized an anomaly weight map to enhance the possible anomalous regions, considering that low-SNR targets may decrease anomaly detection performance. Moreover, we also designed a local feature extraction block because a local spatial structure helps to improve the efficiency of feature utilization [
60,
61]. In the hyperplane layer, most of the pixels are considered as background, while the origin of the high-dimensional space is representative of the anomaly points. Finally, we trained the best hyperplane to separate the background from the anomaly by maximizing the distance between the hyperplane and the representative point for anomaly.
Compared with other existing state-of-the-art hyperspectral anomaly detection methods, the main contributions of our new proposed method are summarized as follows:
We propose a new hyperspectral anomaly detection model and an unsupervised training framework. Whereas other methods separated them, we combine and co-optimize the feature extraction network and hyperplane layer through the one-class objective function, which maximizes the advantages of a deep neural network to extract features customized for the anomaly detector.
Our model simultaneously extracts global and local image features to achieve higher feature utilization levels. We calculate the relevant information between pixels based on self-attention mechanisms to obtain the image’s global features; however, we also use a local feature extraction block to mine the local spatial information of each pixel’s neighboring regions.
We apply the anomaly weight map in the self-attention mechanism to enhance the possible anomalous regions in the original image. The differences between targets and the background are highlighted through the anomaly weight map, which helps the final anomaly detection results, especially when the target is weak and confounded by the background.
The remainder of our article is organized as follows.
Section 2 presents the proposed method in detail.
Section 3 introduces experimental settings and provides the detection results.
Section 4 discusses the efficiency of the adversarial learning and joint training methods. Finally,
Section 5 presents our conclusions.
3. Experiments Section
We used six real hyperspectral images and five methods for comparison, including traditional anomaly detection methods, such as GRX [
19], CRD [
39], and LSAD [
69], and two recent anomaly detection methods based on deep learning, namely, HADGAN [
52] and LREN [
47], to verify the performance of our proposed hyperspectral anomaly detection algorithm. Because three-dimensional convolution can play different roles according to the anomaly size in images, we also introduced another version without the use of three-dimensional convolution for spatial feature extraction named SAOCNN_NS, where “NS” stands for “no spatial features”.
3.1. Datasets Description
We evaluated our proposed method by six public hyperspectral datasets, including Coast [
15], Pavia [
70], Washington DC Mall, HYDICE [
71], Salinas, and San Diego [
72].
Coast: Coast dataset was acquired by Airborne Visible/Infrared Imaging Spectrometer (AVIRIS) sensor, which contains 100 × 100 spectral vectors, each with 207 channels, covering a range of 450 to 1350 nm. There are 141 bands left after removing some noise interference bands in the data. Buildings of different sizes in the image were marked as anomalies, and the total number of anomalous pixels was 155, accounting for 1.55%; furthermore, the number of pixels occupied by one single target was less than 20.
Pavia: Pavia dataset was acquired by Reflective Optics System Imaging Spectrometer (ROSIS) sensor over the Pavia center area. A sub-image with 100 × 100 px, 102 spectral bands in a wavelength ranging from 430 to 860 nm, and with a 1.3 m spatial resolution was selected from the hyperspectral data. The background was composed of rivers and bridges, and the anomalous targets were vehicles driving on the bridge, as well as bare soil near the bridge piers. The anomaly distribution map was obtained in the ENVI software by manual annotation, and the total number of anomalous pixels was 61, accounting for 0.61%; furthermore, the pixel number of each anomalous target ranged from 4 to 15.
Washington DC Mall: This dataset is composed of airborne hyperspectral data for Washington DC Mall. Its size is 1208 × 307 and contained 191 bands after discarding some useless bands. We selected a sub-image with 100 × 100 px where the main background was vegetation, and the anomalous targets were the two man-made buildings in the image, occupying 7 and 12 pixels, respectively, accounting for 0.19% in total.
HYDICE urban: This dataset was collected by the Hyperspectral Digital Imagery Collection Experiment (HYDICE) airborne sensor for an urban area. A sub-image of this dataset was selected, which had 80 × 100 pixels and 162 spectral bands in a wavelength ranging from 400 to 2500 nm. The background contained vegetation, highways, and parking lots, and the image’s anomalous targets were vehicles. The total number of anomalous pixels was 19, accounting for 0.24%; furthermore, the pixel number of each anomalous target ranged from 1 to 4.
Salinas Valley: This dataset was collected by the high-spatial-resolution, 224-band AVIRIS sensor over Salinas Valley, California. It had 204 bands after discarding the 20 water absorption bands. The original data size was 512 × 217, and we selected the 200 × 200 sub-image where the background was mainly various crops and roads between farmland; furthermore, the anomalous targets were man-made houses of different sizes. We obtained the anomaly distribution map in ENVI software by manual annotation, and the total number of anomalous pixels was 102, accounting for 0.22%; furthermore, the size of one single target ranged from 10 to 50 pixels.
San Diego: This dataset was collected by the Airborne Visible/Infrared Imaging Spectrometer (AVIRIS) sensor from San Diego, CA, USA. The spatial resolution was 3.5 m per pixel. We selected the 90 × 90 sub-image, which contained 189 bands and a spectral range from 400 to 2500 nm. The background included aprons, grass, roofs, and shadows. The three aircraft were considered in the scene as anomalous targets, and they occupied 134 pixels, accounting for 1.34%; furthermore, the size of one single target ranges from 30 to 50 pixels.
In the hyperspectral image datasets above, we considered anomalous targets in San Diego and Salinas as planar targets, while smaller targets in others are considered small targets.
3.2. Evaluation Metric
We used receiver operating characteristic (ROC) curves and the area under the ROC curve (AUC) to evaluate detection results [
73] and quantitatively analyze and compare the detection performance of our proposed method with other methods.Two types of ROC curves are applied in our experiment through the three parameters of
,
, and
, which represent the true positive rate, the false positive rate, and the detection threshold, respectively. The first type is the ROC curve of
, which indicates the trade-off between
and
and the closer curve
is to the upper-left corner; furthermore, the closer the area is to 1, the better the detection performance. The other type is the ROC curve of
, which reports
at each threshold, and the closer curve
is to the lower-left corner, the closer the area is to 0, which means the false alarm rate is lower.
In addition, we also used box plots [
74] to show the difference in the separation between the background and anomalies of each method in a more intuitive way. In the box plot, the red and green boxes, respectively, represent the anomaly data distribution and background in the detection map, where the bottom and top of each box represent 25% and 75% of the sample. Thus, a greater distance between the red and green boxes means a better separation between background and anomalies, and a narrower green box indicates a better background suppression.
3.3. Experimental Setting
As mentioned in
Section 2, our proposed method consists of two parts: a feature extraction network and an OC-SVM layer. In the feature extraction network, both the E and De of the AAE are composed of four convolutional layers with kernel sizes of 1 × 1 and the number of hidden-layer spatial features set to 25; meanwhile, the Dz of the AAE contains three convolutional layers with kernel sizes of 1 × 1 and one average pooling layer. We applied a two-dimensional convolutional layer with a convolution kernel size of 1 × 1 and a three-dimensional convolution layer with a convolution kernel size of 3 × 3 × 3 in the local feature extraction block, reducing the number of original hyperspectral image features to 100 dimensions. In the OC-SVM layer, we employed a two-dimensional convolutional layer with a convolution kernel size of 1 × 1 instead of the single fully connected network.
Adam is used to optimize the network parameters during the training process. The network learning rate is set to
, which decays exponentially during training. In the optimization functions, we set
v in Equation (
18) to 0.1 and
in Wgan-GP (Equation (
4)) to 10. We used Pytorch to implement our approach, and all the experiments were performed on a GeForce GTX 1080Ti graphics card and an Intel(R) Core (TM) i9-7920U CPU machine with 64 GB of RAM.
3.4. Experimental Results
Anomaly detection on six hyperspectral image datasets using the two proposed algorithms, namely, the SAOCNN and SAOCNN_NS, and the five contrast algorithms were performed.
Figure 6 shows the anomaly detection maps of each algorithm.
Figure 7 is the ROC curve
and
Table 1 lists the
(AUC of curve
) for each algorithm, in which the largest
values are indicated in bold and the second largest are underlined.
Figure 8 is the ROC curve
, and
Table 2 lists the
(AUC of curve
) for each algorithm, in which the minimums are indicated in bold and the second smallest are underlined. In addition, the box plot in
Figure 9 can be used to analyze the separation of the background and anomalies of each algorithm on the six datasets.
3.4.1. Performance for Small Target
The three hyperspectral image datasets of Coast, Pavia, and Washington DC Mall have small targets and simple backgrounds; therefore, the anomaly detection in these images is the simplest to perform.
Figure 6a–c, shows that the targets in detection maps of our proposed methods are prominent; moreover, the suppression on the background is better. The
values of the proposed methods are approximate to the ideal value of 1 in
Table 1; furthermore, they are usually the maximum or the second largest of all of the methods. In addition, the
values of SAOCNN are about 0.001 which is smaller than other methods in
Table 2 and indicates that the SAOCNN has a lower false alarm rate. Compared with CRD and LSAD, the detection results of our methods have a higher contrast between anomalies and background and have a better suppression of the background according to our visual perception of the detection maps in
Figure 6 and the box plot shown in
Figure 9, especially in the Pavia dataset. Compared with GRX, although the ROC curves
of SAOCNN are very close to GRX in
Figure 7, the SAOCNN is faster than approach 1, and the
values of our proposed methods are slightly higher than those of GRX. Compared with LREN and HADGAN, which were recently proposed as deep learning methods, the detection maps of our proposed methods consist of fewer backgrounds and more anomalies. Moreover, the ROC curves of
for SAOCNN_NS are presented at the top-left corner, indicating that our proposed network is more effective for extracting the deep features of anomalies while suppressing the background compared with LREN and HADGAN do. Although, the ROC curves of
for HADGAN and SAOCNN_NS in the DC dataset are very close, and the former reaches 1 the fastest; it can be seen from the box plot that the green boxes of the proposed method are at a lower level, which indicates a better background suppression.
In the HYDICE dataset, the targets are smaller at a pixel level and the background is more complex compared with the Coast, Pavia, and Washington DC Mall datasets. Compared with the traditional LSAD, CRD, GRX, and the deep-learning-based HADGAN, our proposed methods produce visual results that are closer to the reference images. Moreover, the green boxes of our proposed methods are narrower in the box plot shown in
Figure 9, indicating a better efficiency in background suppression. Compared with the LREN, our proposed method’s background suppression is slightly inferior in detection maps. The LREN and SAOCNN are very close and lower than others—
of 0.0088 and 0.0138 shown in
Table 2. However, the ROC curves of
shown in
Figure 7 demonstrate that the true positive rate
of SAOCNN_NS is the fastest to reach 1, while
of the LREN is the last to reach 1 because some pixels are not detected. We found that the LREN may have a lower false alarm rate due to its better background suppression; however, it causes missed detection for some targets and makes the
value smaller. Our proposed methods present a superior balance between detection and false alarm rates.
For small targets, our methods perform well and have better background suppression rates with proper target prominence in both simple and complex backgrounds compared with other contrast methods. We compared the four ROC curves and found that curves of the proposed SAOCNN_NS are basically the closest to the top-left corner, and the proposed SAOCNN can quickly reach 1 in most of the datasets, except for HYDICE. The detection maps in
Figure 6 show that targets detected by the SAOCNN are larger than the real ones annotated in reference maps, and the main reason is that a three-dimensional convolution kernel is added to the local feature extraction network. Especially in the HYDICE dataset, most of the anomalies are point targets, while the sizes of detected targets in the detection map appear to be much larger. In addition, the role of the three-dimensional convolution kernel in the network is to extract spatial features (such as texture and shape features), which the point target is lacking in. However, the background in this dataset is more complicated, with obvious spatial background features, and is therefore more easily confused with targets, which means that the SAOCNN presents a lower detection probability compared with others when the false alarm rate is low in the detection results. Without three-dimensional convolution, targets detected by the SAOCNN_NS seem to be at a proper size in
Figure 6, and it presents a better balance between detection probability and false alarm rate in
Figure 7. However, the SAOCNN_NS does not perform as well as the SAOCNN in background suppression, with the latter presenting a lower false alarm rate in
Figure 8.
3.4.2. Performance for Planar Target
The Salinas dataset has larger targets and a simpler background. The detection results of each algorithm on the dataset are shown in
Figure 6e. Targets in the detection maps of SAOCNN are the most prominent, and the SAOCNN_NS and LREN background suppression levels are not as high as SAOCNN. Targets in the detection maps of HADGAN, LSAD, CRD, and GRX are almost invisible, and there are obvious false alarm pixels.
Figure 7 shows the SAOCNN ROC curve of
is at the top-left corner and is the first one to reach 1, with the largest AUC value of 0.9990. From the box plot shown in
Figure 9, the HADGAN, LSAD, CRD, and GRX red boxes and green boxes are very close, with very low positions. However, the red boxes of our proposed method are relatively high, indicating that the four contrast methods are inferior when separating anomalies from the background. Compared with the LREN, the SAOCNN_NS has a similar performance level in all aspects. However, upon adding a three-dimensional convolution kernel, the
value of SAOCNN improves by 2.968% compared with the LREN. Meanwhile, its background suppression is also better, with a narrower green box in shown in the box plot.
In the San Diego dataset, the targets are three airplanes that are not only larger but also have certain shape features, and the background is more complicated, with some roof areas that may be easily detected as anomalies compared with other datasets. From the detection results shown in
Figure 6f, the targets detected by the SAOCNN are very prominent and completely retain their basic shape structures with superior background suppression. Although targets are detected in the SAOCNN_NS, the suppression of the roof areas in the background is imperfect. The HADGAN and LREN detection maps only highlight part of the anomalous pixels and preserve some background details, while targets in the LSAD, CRD, and GRX methods are almost invisible. In the ROC curves of
in
Figure 7, the SAOCNN is at the top-left corner and is the fastest to approach 1, and the probability of its detection is always higher than in other methods when the false alarm rate is about 0–0.01. For
in
Table 1, the SAOCNN has the largest value (0.9962), followed by the SAOCNN_NS (0.9884), which is a considerable increase compared with other methods.
Figure 8 shows that the ROC curve of
of GRX is slightly lower than the SAOCNN, with
Table 2 showing the lowest
value as 0.0297. This may be because the GRX green box in the box plot of
Figure 9 is smaller, resulting in a lower false alarm rate at a lower threshold. However, the red box of the targets in GRX is lower and extremely close to the green box, while our proposed methods showed red and green boxes farther apart, which better distinguished the anomaly from the background.
For planar targets, the two versions of our proposed method have better target prominence and recognize the approximate shape of anomalies in the detection maps. However, the anomalous targets in other contrast methods are usually submerged in the background, which is not conducive to detection. Comparing the two proposed methods, the SAOCNN_NS does not present as well on background suppression as the SAOCNN. Especially when the background is complex, the SAOCNN has high levels of superiority because of assistance from the three-dimensional convolution kernel, which extracts the favorable spatial features of targets and inhibits the unfavorable features.
3.4.3. Detection Accuracy and Efficiency
Table 1 lists the average
for all algorithms of the six datasets. The average SAOCNN and SAOCNN_NS
values are 0.9965 and 0.9941, respectively, while the HADGAN, LREN, LSAD, CRD, and GRX values are 0.9913, 0.9705, 0.9268, 0.9143, and 0.9777, respectively, which indicates that the two versions of our method are superior. The
values of our proposed methods are close to the ideal value 1 for all datasets, even if the lowest detection accuracy is 0.9812 for the SAOCNN_NS in the Salinas dataset.
Table 2 lists the average
for all algorithms. The average minimum
value of the SAOCNN is 0.0100, followed by GRX at 0.0250. The SAOCNN_NS is 0.0441, which is larger than the HADGAN and GRX but less than the LREN, LSAD, and CRD. In summary, the SAOCNN has superior performance for accuracy in anomaly detection, with higher detection and lower false alarm rates compared with the other methods, which demonstrates that the anomalies are sufficiently recognized and match the detection requirements.
Both the HADGAN and SAOCNN introduce the AAE to reconstruct hyperspectral data and highlight anomalies. Their difference is that the traditional anomaly detector RX is combined in the subsequent step to obtain the final anomaly detection map in the HADGAN, while we proposed OCSVM-Net to train together with a feature extraction network to obtain the results in the SAOCNN. Through the above experiments, it can be seen that although they both effectively highlighted anomalies, the box plot showed that the HADGAN method is far less effective in suppressing the background than the SAOCNN, with obvious background texture features displayed in the detection maps. This proves that the idea of jointly training in the SAOCNN may extract deep features that are more conducive to detecting anomalies, trading a balance between abnormal prominence and background suppression.
Due to the random neural network training, we repeated our network training experiments 10 times to verify the credibility of our experimental results. The obtained results, including average and standard deviation (std), are shown in
Table 3, and they prove that our proposed methods are also stable.
In addition to accuracy, the calculation efficiency is also a considerable aspect of detection performance.
Table 4 records the operation time of each algorithm for six datasets. The shortest time value in the table is displayed in bold, and the second is underlined. Our proposed methods have the shortest operation times, mainly due to the fewer network layers and the acceleration of the GPU. Compared with the SAOCNN, the SAOCNN_NS has no three-dimensional convolution layer, leading to a faster calculation. Although the LREN also utilizes a deep neural network, it has the longest operation time. Networks in the LREN are only used for dimension reduction, feature extraction, and dictionary generation. However, an anomaly detector is still the traditional method based on low-rank representation and solves weight coefficients through iterative optimization, which greatly reduces the operation efficiency.
3.5. Ablation Study
To verify the effectiveness of the major components in SAOCNN, we designed a set of ablation experiments based on the benchmark OC-NN [
68]. In this experiment, the following combinations are used for comparison:
SAOCNN: our complete proposed method based on OC-NN, in which the non-local method and three-dimensional convolutional layer using anomalous weight map is added;
SAOCNN_NS: rNon-local mechanism with an anomaly weight map is added in OC-NN, but without a three-dimensional convolutional layer to extract spatial features. “NS” stands for “No spatial feature”;
OCNN: the original OCNN is used as a benchmark for comparing the detection results of ablation experiments. However, the difference is that the loss function is Equation (
18) and error back propagation is presented to optimize the parameters;
OCNN_S: a three-dimensional convolutional layer is added to extract spatial features based on OCNN. “S” stands for “Spatial feature”;
OCNN_NL: the original non-local network is added to extract global features based on OCNN. “NL” stands for “non-local network”;
SAOCNN_NA: the original non-local network and a three-dimensional convolutional layer are added based on OCNN. Compared with the complete algorithm, the residual weight map of AAE is not used in the non-local network. “NA” stands for “No AAE”.
The detection results of each module combination in the ablation experiment are shown in
Figure 10. The ROC curves of
are shown in
Figure 11, and the corresponding AUC values are listed in
Table 5. Finally, the box plot is used in
Figure 12 to show the separation between the background and anomalies on different datasets in each module combination. In addition, the repeated experiment is likewise presented in
Table 6 to verify the reliability of results.
- (1)
AAE
The AAE is the first module in the whole network and is applied to generate the anomaly weight map to highlight the possible anomalous areas. By comparing the SAOCNN and SAOCNN_NA, we can see that the ROC curves of SAOCNN are higher than that of SAOCNN_NA and reach the ideal 1 faster (
Figure 11). The average
value increased by 0.433%, which indicated that AAE’s assistance can improve the detection probability of targets with a low false alarm rate.
- (2)
rNon-local Network
The main function of non-local network is to obtain global features by calculating the correlation between the test pixel and all other pixels in the image. In our ablation study, the OCNN_NL adds the ordinary non-local network based on the benchmark OCNN. The average
value of the OCNN is 0.029 larger than the OCNN_NL in
Table 5, which indicates that the simple non-local network cannot improve the detection performance of the OCNN. Furthermore, the background suppression in the OCNN_NL detection maps is worse than that of the OCNN in
Figure 10. The main reason for this may be that the non-local network enhances both the targets and background by extracting the global features of each pixel in a way that is not directed at anomalies.
We added the residual image reconstructed by the AAE to improve the non-local structure as rNon-local, which refers to the SAOCNN_NS in the experiment, enhancing the possible anomalous areas in the feature map produced by the non-local network. Comparing the
of the SAOCNN_NS and OCNN in
Table 5, the SAOCNN_NS detection results are better compared with the OCNN on each dataset, with the average value increased by 0.222%. From the box plot in
Figure 12, we observed that the distances between the red and the green boxes of the SAOCNN_NS are longer, which suggests that it is necessary to combine the AAE and rNon-local because the use of the anomaly weight map in rNon-local effectively enhances the features of anomalous areas.
- (3)
The 3D convolutional layer
The OCNN_S adds a three-dimensional convolutional layer to extract spatial features based on the OCNN. The average
value of OCNN_S is 0.202% higher than that of the OCNN in
Table 5. The ROC curves of OCNN_S reach 1 faster compared with the OCNN in the four datasets shown in
Figure 11, except for Coast and HYDICE. The green boxes of OCNN_S are narrower than that of the OCNN in the box plot comparison in
Figure 12, which means that OCNN_S demonstrates a better background suppression than the OCNN.
In the Coast dataset, the target sizes in the detection map of OCNN_S are much larger than the real ones, causing lower
values because of the three-dimensional convolution kernel (
Section 3.4.1). In the HYDICE dataset, the background contains more spatial features compared with the anomaly. The three-dimensional convolution layer used in the OCNN_S cannot extract enough spatial information from the target’s very small pixels, resulting in a few false alarm targets in the detection map of OCNN_S (
Figure 10d). For the datasets Salinas and San Diego with larger targets and certain spatial features, the performance of OCNN_S is considerably better than that of the OCNN.
Figure 10e,f shows that targets are more prominent in the detection maps of OCNN_S; furthermore, its background suppression is also better.
It can be concluded that a three-dimensional convolutional layer is more advantageous for targets with shape structures and can suppress backgrounds that lack spatial features, thus increasing the contrast between the background and targets.
- (4)
Non-local network combined with 3D convolutional layer
A non-local network and a three-dimensional convolutional layer are used in the SAOCNN_NA.
Table 5 shows that the optimal method in the average
is the OCNN_S, followed by the SAOCNN_NA and OCNN. The three methods’
values are very close in the Coast, Pavia, and Washington DC Mall datasets; however, the OCNN performs best in the HYDICE dataset, and the SAOCNN_NA is the best in the Salinas and San Diego datasets. The reason may be that the ordinary non-local network amplifies both the advantages and disadvantages of the three-dimensional convolutional layer.
In summary, the non-local network and the three-dimensional convolution layer we added to the OC-NN can effectively improve anomaly detection performance; furthermore, the anomaly weight map of AAE is added to enhance the possible anomalous regions in the feature map extracted by the improved rNon-local network. Therefore, the detection performance of SAOCNN reached the optimal level, with average values exceeding 0.464% of the benchmark OCNN, and the is significantly increased, especially in the Salinas and San Diego datasets.
4. Discussion
We carried out a training analysis with four scenarios to verify the effectiveness of the training tricks in our proposed methods, including adversarial learning and joint training approaches. The
values for each dataset are shown in
Table 7.
Figure 13 displays the ROC curves
, and
Figure 14 shows the box plots for each scenario. In addition, the repeated experiment is also presented in
Table 8.
AAE+OCSVM: our proposed method with adversarial learning (i.e., an additional discriminator Dz) and joint training for the whole network. “+OCSVM” stands for the rest part, except for the anomaly weight-map-generated block;
AAE+OCSVM(se): a two-step training version of our proposed network. The AAE is first trained with and separately, and then the trained AAE is fixed and used to help train the rest part with . “se” stands for separate;
AE+OCSVM: the AE is used to generate the anomaly weight map with the basic encoder–decoder structure and loss function ;
AE+OCSVM(se): a two-step training version of the third scenario.
The four schemes are trained in identical epochs to assure the rationality of our experiment. In the two-step approaches, the AAE or AE are trained in 500 epochs separately, and the subsequent parts are trained in additional 500 epochs with the fixed AAE or AE. The co-training approaches train in 500 epochs for the whole network.
- (1)
Adversarial learning method
As discussed in
Section 2.2.1, the adversarial training approach is presented in an anomaly weight-map-generated block to highlight the anomalous area in a residual image, with an additional discriminator Dz that helps constrain the hidden feature
of AE to follow the background distribution, which is concerned to obey Gaussian probability. We compared scenario (A) and (C) to validate the effectiveness of the adversarial learning method.
Table 7 shows that the average value of
is improved by 0.321% when adding the adversarial training. The AAE obtains better detection results of up 1.064%, especially for the San Diego dataset with planer targets and complex backgrounds.
Although the
values do not significantly increase, we noticed by comparing ROC curves in
Figure 13 that the red curves are at the upper left of the yellow curves in most datasets, and the detection probability is much higher when the false alarm rate is about 0–
. For the Pavia dataset, the AE performed better than AAE in ROC curves; however, we found that the red box of AAE is farther from the green box than that of AE from the box plot comparison in
Figure 14. This suggests that the AAE highlighted anomalies better, although it is not obvious from the
perspective.
In conclusion, our comparison results indicate that additional adversarial training helps yield a better anomaly weight map, which highlights the possible anomalous areas and improves the final detection performance.
- (2)
Joint training method
To better demonstrate the superiority of joint optimizing the feature extractor and subsequent anomaly detector in our proposed method, we present a two-step training version of our proposed network for comparison. As described in
Section 2.4, when training the whole SAOCNN, the AAE should be trained separately for the AAE_pre_num epochs at first, and then the subsequent network, including the OCSVM layers, can be updated together with the feature extraction network through
. Such joint training drives the feature extractor to learn additional deep details for specifically anomaly detection tasks, which cannot be achieved with the two-step training method.
In our experiment, we used the two-step training version for both scenario (A) and (C).
Figure 7 shows that the joint training methods demonstrate superior performance than the related separate training methods. The average
values are increased by 0.060% and 0.020% for the AAE and AE versions, respectively, when implementing joint training. For our proposed method, the
values increased by 0.342% and 0.343% in the Coast and San Diego datasets, respectively. The ROC curves in
Figure 13 show that the red curves, which represent our proposed method, are located in the upper-left corner in these two datasets. This shows that the joint training approaches present a prime balance between high detection and low false alarm rates. For other datasets, although the ROC curves of the two types are close, the comparison of the box plot in
Figure 14 shows that the red boxes of the co-training approaches are farther from the green boxes than that of the two-step training approach, which means the joint training is capable of better highlighting anomalies. In the HYDICE dataset, co-training approaches can also suppress the background better because the green box of the two joint training methods is much closer to 0 in the box plot.
The overall results show that the
joint training approach in our proposed methods help the whole SAOCNN to extract the relevant anomaly features compared with the two-step training approach. In the HADGAN [
52] and SAFL [
53] methods, the feature extractor and anomaly detector are separated, and the latter makes no contribution to constraining the feature extraction network. This two-step training approach is necessary for these methods because the anomaly detector still adopts the traditional method. However, in our method, a one-class network is applied for anomaly detection and updated together with the whole network through
. The joint training method is not only more concise but also enables the anomaly detector to better guide the feature extraction network and make the anomalies more prominent while suppressing the background in the detection maps.