1. Introduction
Hyperspectral images (HSIs) are captured by space-borne or airborne imaging spectrometers. Different from ordinary three channels (e.g., Red, Green, Blue) optical images, each pixel of HSIs contains a large number of dense and continuous spectral information in the channel dimension. The spectra of different objects contain unique spectral features, just like fingerprints [
1], nd the subtle spectral discrepancies (discrepancies along the spectral dimension are considered as part of the spectral series information) of different targets is an important basis for achieving fine-grained classification. The purpose of HSI classification is to define a definite category for each pixel which provides information guidance for land change detection, object detection, precision agriculture and other earth observation missions [
2,
3,
4].
Traditional machine learning methods of HSI classification, such as support vector machine (SVM) [
5], dynamic subspace [
6] and logistics regression [
7] rely on the spectral information of pixels. These methods find it difficult to achieve accurate classification when the spectral variability is serious and abundant mixed pixels exist.
In recent years, CNN-based image classification algorithms stand out in the field of HSI classification [
1]. For example, Chen et al. [
8] discussed the influence of different CNN-based structures on feature extraction performance. Due to the strip distributed receptive field of 1D kernel, 1D CNNs are often known as spectral-based feature extractors. In [
9,
10], a 1D convolution kernel with a finite number of layers was used to extract spectral features directly. Hu et al. [
11] employed stacked 1D convolution architecture to extract spectral features at multiple layers, and then the pixels were classified by fully connected layer. The 2D and 3D kernel-based backbone networks and their hybrid variants are regarded as spatial–spectral feature extractors. Lee and Kwon [
12] combined multi-scale spatial–spectral features extracted by 2D and 3D CNNs. In order to prevent the gradient vanishing phenomenon caused by deep-stacked CNNs, the residual connection of ResNet [
13] was introduced in HSI classification. Paoletti et al. [
14] fused CapsNet and ResNet to achieve fast HSI classification. Zhong et al. [
15] used a series of 3D kernels to extract spatial–spectral features jointly and the residual connection was used to enhance the interaction of deep and shallow features. Although the CNN-based methods have achieved remarkable classification performance, the entire network lacks flexibility after being designed. Due to its fixed kernel size and the limited number of layers, the backbone of CNNs shows limitations in capturing global information, especially in the spectral dimension of HSIs.
Recently, transformer network has shown a powerful ability to extract long-term dependencies of sequence data in the field of natural language processing (NLP) [
16]. Different from CNN-based models, transformer has a global receptive field even in the shallow layer because of the self-attention mechanism. Some researchers applied transformer to HSI classification because the self-attention mechanism can be used to efficiently model the long-range inter-spectra dependencies. For example, He et al. [
17] first used the transformer-based BERT [
18] model for HSI classification. Hong et al. [
2] proposed a pure Vision Transformer (ViT) [
19]-based framework named SpectralFormer, which can learn locally detailed spectral representations by group-wise spectral embedding operation. In addition, this method applied the idea of skip connection to enhance the representation ability of tokens from shallow to deep. Qing et al. [
20] adopted average pooling and maximum pooling operations as a spectral attention block to enhance the feature representation ability without losing spectral information. Then, the obtained feature maps were fed into transformer for classification. These pure transformer-based methods effectively model the long-range dependencies of spectra; however, they often divide the entire HSI patch into a series of tokens which prevent the transformer from efficiently modeling spatial contextual information.
In order to improve the spatial information representation capability of tokens, some approaches combine the CNNs (e.g., VGGNet [
21], ResNet [
13], etc.) with the transformer model. The CNN-based backbones are firstly used to extract the locally spatial context information of the hyperspectral data. Then the feature maps output from the CNNs are transformed into sequential features (tokens) and sent to the transformer to further model the deep inter-spectral dependencies. We refer to this as two-stage approach. For example, He et al. [
22] combined VGGNet with transformer and used the pre-trained VGGNet as the teacher network to guide the VGG-like model to learn the spatial features of HSIs. Finally, the whole feature maps were fed into the transformer. In [
23], Le et al. proposed a Spectral-Spatial Feature Tokenization Transformer (SSFTT). The SSFTT used principal component analysis (PCA) [
24] and stacked hybrid CNNs to reduce the dimension of original HSI data and extracted spectral–spatial features, respectively. Then, the Gaussian distributed weighted tokenization module makes the features keep in line with the original samples which is beneficial for transformer to learn the spectral information. Yang et al. [
25] proposed a CNN-based Hyperspectral Image Transformer (HiT) architecture and the Conv-Permutator of HiT was used to capture the information from different dimensions of HSI representations. Furthermore, other joint CNN and Transformer networks (i.e., LeViT [
26], RvT [
27]) were also applied to HSI classification to demonstrate the superiority of HiT.
The aforementioned joint CNN and Transformer architectures allow the model to further capture locally spatial context and reduce spatial semantic ambiguity in extracting spatially structured information from sequential features. However, these two-stage feature extraction methods are not effective in learning the spatial–spectral correlations of HSIs. In addition, CNNs overly focus on spatial information, which distorts the continuous nature of the original spectral signatures and increases the difficulty of the subsequent transformer to model the discrepancies of spectral properties. The classification accuracy of the two-stage methods is even lower than that of some multidimensional CNNs when the target to be classified has strong spectral intra-class variability or inter-class similarity.
In summary, the existing joint CNN and Transformer classification methods distort the sequence relationship of original spectral information in enhancing the spatial representation capability which further weakens the ability of the self-attention mechanism to distinguish subtle discrepancies of spectra. Aiming at the aforementioned limitations of current methods, we propose a Neighborhood Enhancement Hybrid Transformer (NEHT) network for HSI classification. The proposed network is roughly divided into three components: Channel Adjustment Module (CAM), Spectral Pooling and Enhancement Module (SPEM) and Hybrid Attention Module (HAM). First, we use a very simple CAM which includes a 2D convolution operation to extract the shallow features of the HSI. Second, to improve the spatial–spectral representation capability of tokens, we propose the SPEM module, which mainly contains two blocks, named the Spatial Neighborhood Enhancement (SANE) block and Spectral Neighborhood Enhancement (SENE) block. These two parallel-designed blocks can model the spatial and spectral relations simultaneously, further providing opportunities for extracting spatial–spectral features and achieving better feature representation learning. We also introduce a feature fusion strategy in SPEM that generates the complementary spatial–spectral clues of adjacent bands for each token, and enhances the transformer’s ability to identify subtle discrepancies between spectra for fine-grained classification. Finally, the HAM adopts the self-attention mechanism of transformer to capture the global correlation between the enhanced tokens and gives the classification results.
The main contributions of this paper is listed as follows:
1. Compared to the existing method of stacking CNNs before the transformer, which applies the shared weights to all bands, an efficient parallel-designed CNN-based structure named SPEM is proposed in NEHT network for extracting reliable spatial–spectral features from neighbor bands. The two blocks contained in SPEM can generate the data-dependent weights that enhance the generalization capability of the model.
2. To minimize the distortion of the continuous nature of spectral signature by stacked CNNs, a residual-like feature fusion strategy with Shift-and-Add Concatenation operation is proposed to enhance the distinguishability of spectra without losing the original fine features.
3. The special hybrid architecture enables the transformer to learn more reliable spatial–spectral information from shallow to deep. The experiments verify the superiority of the proposed method and the impact of some key parameters in the network are studied exhaustively.
The rest of this article is organized as follows.
Section 2 reviews some related works.
Section 3 introduces the proposed NEHT network. The network configuration and experiment results are shown in
Section 4.
Section 5 draws some related conclusions.
4. Results and Discussion
In this section, three well-known data sets are firstly described. Then, the implementation details of the network and environment configuration are introduced in the second part. Extensive experiments are conducted with ablation analysis to demonstrate the performance of our approach both quantitatively and qualitatively in the third part. Finally, other state-of-the-art methods are compared to show the superiority of our method.
4.1. Description of Data Sets
4.1.1. Pavia University Data Set
The Pavia data set was captured by the reflective optics system imaging spectrometer sensor (ROSIS). The Pavia University (PU) data set is a part of the Pavia data sets. It has a size of 610 × 340 pixels with a ground sampling distance of 1.3 m, and the spectral ranges from 0.43 to 0.86. After removing the noisy band, 103 bands are retained in the experiments. It has nine classes of interest that are annotated by different labels. The total number of labeled pixels is 42776, and the distribution of each category and its number is shown in the
Table 1 below.
Figure 4a shows the false-color version of the data set and its corresponding ground-truth label.
4.1.2. Salinas Data Set
The Salinas (SA) data set was collected by the AVIRIS sensor over the Salinas Valley in Southern California. It has a size of 512 × 217 pixels with a ground sampling distance of 3.7 m. This data set has 204 spectral bands and 16 labeled categories. The false-color composite image and its ground-truth map are shown in
Figure 4b. The number of pixels of each class is listed in
Table 2.
4.1.3. Indian Pines Data Set
The Indian Pines (IP) data set was also captured by the AVIRIS sensor which covers agricultural areas in northwestern Indiana. The spatial size of this data set is 145 × 145 with a ground sampling distance of 20 m. The false-color composite image and its ground-truth map are shown in
Figure 4c. The number of spectral bands is 224 with wavelengths from 0.4 to 2.5. Because of the water absorption, 20 bands were removed, and only 200 bands were left. There are 16 classes in the 10,249 labeled pixels listed in
Table 3.
4.2. Experimental Configuration
We randomly divide the HSI cube into training, validation, and testing data sets represented by , , , respectively, and their corresponding label sets are denoted as , , , respectively. The is used to update network parameters which contain 5% of labeled data for PU and SA datasets and 10% for IP dataset. A total of 1% of the labeled data are used to verify the trained network. The entirety of the data are used for testing and calculating three evaluation metrics including Overall Accuracy (OA), Average Accuracy (AA) and Kappa Coefficient . In this article, the network is trained with 80 epochs for PU and SA data sets and 100 epochs for IP datasets. During the training procedure, Adam optimizer with the batch size of 64 is adopted, and the initial learning rate for PU and SA data sets are set as 0.005 and 0.0005 for IP dataset. We use the Multi-Step learning rate decay strategy: the decaying rate gamma is set as 0.1 for all data sets and the milestone is set as [20,40,80] for PU and SA datasets, and [60,80] for IP dataset. For different datasets, the input channel of CAM is determined by the number of spectral bands. The output channel of standard 2D convolution in CAM is 96 for the PU dataset and 196 for SA and IP datasets. The whole process is repeated five times to report the average accuracy. In every single epoch, the model configuration with the highest accuracy is used to evaluate the test set.
All the experiments have been operated on the hardware environment composed of an 8th-generation Intel R Core TM i7-8700 processor, with 12 MB of Cache and a processing speed of 3.20 GHz with 6 cores/12-way multi-task processing. The environment was completed with an NVIDIA GeForce GTX 1080Ti graphics processing unit (GPU) with 11 GB RAM. The software environment consists of the Windows10 pro 64-bit operating system with CUDA 10.1 and cuDNN 7.1 and Python 3.7 is the programming language. The network was built by pytorch 1.8. In order to alleviate data imbalance, we used inverse-median frequency to penalize the less frequently occurring classes more.
4.3. Parameter Analysis
To give a detailed and complete analysis of the proposed network, experiments are conducted for some key parameters of NEHT network in this section. The parameters include the patch size, the number of attention heads and encoder blocks, and the group size of CAM. Other parameters, such as batch size, learning rate and drop ratio, are fixed.
4.3.1. Evaluation the Influence of the Patch Size
In the data processing stage, the HSI cube needs to be divided into patches of the same size, and the label of each patch is determined by its center pixel. Each patch is flattened into an image sequence in the channel dimension before the attention mechanism. Dosovitskiy et al. [
19] indicated that the size of each patch is inversely proportional to the length of the transformer, which means the FLOPS of transformer is similarly proportional to the depth and quadratic in width [
46]. However, since the patch embedding layer is discarded in NEHT network, the width of transformer is directly determined by the output of CAM, and the output length for each data set is fixed. Intuitively, with the increase in patch size, the length of each sequence also increases and more parameters need to be learned. Therefore, patch size is positively correlated with the model complexity. Too large a patch size will make the network encounter an over-fitting problem. For searching the optimal patch size, we set it as
, respectively, for three data sets.
Figure 5 presents the obtained results for PU, SA and IP data sets. The results illustrate that when the patch size is in the range
, network performance is positively correlated with patch size. However, when the patch size exceeds
, the OA scores tend to be flat or even slightly decline. Compared with PU and SA datasets, the IP dataset is more sensitive to changes in patch size. Finally, SA and IP data sets obtain the highest OA score at the patch size of
, while for PU data sets, the maximum OA score appears at the patch size of
.
4.3.2. Evaluation the Influence of the Attention Heads and Model Depth
The multi-head self-attention mechanism makes the transformer well modeling the dependencies between tokens. Increasing the number of heads is similar to increasing the number of feature maps in convolution and increasing the number of encoder blocks improves the model’s ability to extract deep semantic information. For the HSI classification task, the working dimension (i.e., model width) of the NEHT network and other transformer-based architecture is relatively fixed. The number of head and encoder block both determine the performance of the model. With limited training samples, an ultra-deep network will not only increase the computational complexity, but also degrade the network performance. Some transformer-based HSI classification methods separate the number of encoders and heads during the parameter analysis. We deem that adjusting the two parameters jointly is more beneficial to obtain optimal results.
We conducted experiments on different numbers of heads under different encoder blocks to dynamically measure the model depth that is most suitable for HSI data. We set the number of encoder blocks as 1, 2, 3, 4 and 5, respectively, at each depth we set the number of heads to 1, 2, 4, 8 and 16, respectively. The experimental results are shown in
Figure 6. It can be concluded that the performance of the network gradually improves as the depth of the network increases, but when the depth is greater than 4, the performance starts to decline. For the three data sets, the highest OA scores are obtained when the model depth is 4 and the number of heads is 16.
4.3.3. Evaluation of the Influence of the Group Size
For different categories, the distribution range of effective spatial and spectral features may be different. As the most important parameter in the SPEM, group size determines the distribution range of the fused feature maps, which improves the network’s ability to capture long-term dependencies and the semantic expression ability of tokens without directly increasing the width and depth of the model. Especially in the spectral dimension, different objects captured by the same sensor have different strong response intervals. For the targets with high interclass similarity, we need to pay more attention to the differences in spectral information in a certain wavelength range.
In order to find an optimal group size, we verify the classification effect of the model under different group sizes: 3, 5, 7, 9, 11 and 13.
Figure 7 shows the effects of different group sizes on the classification accuracy of three datasets. According to the results, for PU and IP data sets, the highest OA score occurs when the group size is 9, and for SA dataset is 11. We can draw a common conclusion that with the group size increases, the subtle spatial–spectral discrepancies of neighboring feature maps can be better modeled by SPEM. However, it should be noted that too large a group size will increase the model inference time and weaken the representation ability of neighborhood feature maps.
4.3.4. Ablation Analysis
To fully demonstrate the effectiveness of the proposed methods, we investigated the influence of different components that belong to the NEHT network on the IP data set. The whole model was divided into three components, and two of them need to be tested (i.e., CAM and SPEM). In addition, the SPEM is further divided into two blocks (i.e., SANE block and SENE block). The performance of each component and joint performance between different components are listed in
Table 4. We also compare other stacked CNNs with transformer architecture to show the superiority of our proposed architecture for HSI classification tasks. The results are listed in
Table 5.
In detail, the pure transformer-based method (ViT without CNN-based patch embedding module) yields the lowest classification accuracy, which means there are still many limitations of directly using the transformer for HSI classification. By adding either CAM or part of the SPEM into ViT, the classification accuracy has been improved. The fourth and fifth cases show that compared with CAM, SPEM can significantly improve classification accuracy (beyond 2.29% and 10.48% OAs, respectively). Comparing the second and third cases, without the channel adjustment module (CAM), spatial information is more effective for improving classification accuracy in the shallow layer of the network. Comparing the sixth and seventh cases, the CAM+SENE can obtain a higher OA score than CAM+SANE (0.34%), this may be that the combination of CAM and SENE extracts spatial and spectral information, while CAM+SANE pays more attention to spatial information. From the second, third, sixth and seventh cases, we can conclude that CAM can improve the reliability of features learned by any part of SPEM.
From
Table 5, the joint stacked 2D or 3D CNN architectures with transformer do not bring a significant performance improvement. The hybrid convolution (2D+3D Conv) provides a more representative feature map for the transformer and obtains relatively better classification performance. Undoubtedly, the architecture that we proposed can further bring a performance improvement (more than 2% of OA, 5% of AA and 2% of
). In conclusion, the joint use of CAM and SPEM tends to obtain the highest classification accuracy.
4.4. Comparison with Other Methods
This section aims to compare the performance of the proposed NEHT network with some classical traditional methods, CNN-based deep learning methods, ViT-based method and joint CNN and Transformer methods. For the traditional methods, we chose SVM [
5], random forest (RF) [
47], multinomial logistic regression (MLR) [
48] as the compared methods. For the CNN-based methods, PyResNet [
14], ContextualNet [
12], ResNet [
13], and SSRN [
15] were selected. For transformer-based methods, we took the pure ViT method as the baseline and the recent joint CNN and Transformer methods (i.e., SSFTT [
23], LeViT [
26], HiT [
25]) as the comparison methods.
From
Table 6,
Table 7 and
Table 8 we can conclude that our method outperforms other methods. Especially compared with the traditional methods, NEHT network appears more competitive. For the PU data set, the proposed NEHT network achieved 10.42%, 1.03% and 0.41% absolute improvement over the best traditional method, the CNN-based method and joint CNN and Transformer methods in the score of OA and achieved 14.44%, 0.83% and 0.57% absolute improvement in the score of AA. For the SA data set, the proposed NEHT network achieved 6.83%, 0.67% and 0.06% absolute improvement in the score of OA and achieved 14.44%, 0.28% and 0.27% absolute improvement in the score of AA, respectively. For the IP data set, the proposed NEHT network achieved 16.96%, 1.96% and 1.46% absolute improvement in the score of OA and achieved 21.43%, 1.38% and 2.3% absolute improvement in the score of AA, respectively.
Figure 8,
Figure 9 and
Figure 10 present the comparison results of classification maps for different methods.
We can observe that the traditional methods, especially those that only learn spectral features, show more misclassification of three considered data sets. Owing to the strong power of modeling locally contextual information, CNN-based methods obtain relative smooth classification maps, but they might lead to the misclassification of targets with small interclass distance. The pure ViT model without any CNN architecture does not achieve satisfactory classification results, because the self-attention mechanism is not as good as CNNs in fitting spatially structured information under limited training samples. We notice that the joint model obtains a higher OA score than CNN models. Although the gap between NEHTNet and SSFTT in OA scores is not large, our method is more robust in handling edge and texture details. This is because SPEM can extract highly semantic token representations from neighbor bands and increase subtle spectral discrepancies.
To evaluate how the training percentage affects the overall accuracy of the aforementioned methods, different numbers of training samples (i.e., 1%, 2%, 3%, 4% and 5% for PU and SA data sets and 2%, 4%, 6%, 8%, 10% for IP data) were selected. For samples whose total quantity does not meet the extraction ratio, we only take one pixel as the training set.
Figure 11 gives the obtained results and it can be concluded that our method is superior to other methods with limited training data and shows more stable performance with fewer training samples (i.e., SA, IP data sets). When the portion of the training set increases, the gap in overall accuracy between the proposed method and other CNN-based methods becomes close. However, in the case of ultra-small training data for PU and SA data sets, the classification accuracy of SSFTT is slightly higher than that of our method, this may be because the traditional PCA dimensionality reduction algorithm used in SSFTT is more reliable than the data-driven deep learning algorithm.