1. Introduction
Hyperspectral imagery (Hyperspectral Imagery, HSI) is an image acquired by a hyperspectral imager, and its spatial and spectral information is very rich. Compared with ordinary images, hyperspectral remote sensing images also have more bands and extremely high resolution. The application of hyperspectral remote sensing to earth observation technology is very common, such as precision agriculture [
1], land cover analysis [
2], marine hydrology detection [
3], geological exploration [
4] and other fields.
Hyperspectral Imagery (HSI) classification is an important task in hyperspectral image processing and application. In the early research, many traditional machine learning methods have been applied to hyperspectral image classification, such as K-nearest neighbor method [
5], support vector machine [
6], random forest [
7], naive Bayes [
8] and decision trees [
9], etc. Although these traditional methods have achieved good performance, they are all based on shallow features for learning classification and rely on manual design of classification features, which is difficult to learn more complex information in hyperspectral images [
10].
The hyperspectral image classification algorithm based on deep learning can automatically obtain the advanced features of the image, so that the classification model can better express the characteristic of the remote sensing image to improve classification accuracy. Chen [
11] applied deep learning theory to hyperspectral image classification for the first time, which used stacked autoencoders to extract spatial spectral features from hyperspectral images and achieved good results. Yu [
12] applied convolutional neural networks (CNN) to hyperspectral image classification, which only used spectral information, without taking into account the relationship between adjacent cells. Chen et al. [
13] proposed a three-dimensional convolutional neural network (3D-CNN) feature extraction model to directly extract spectral spatial features and achieve better classification results from hyperspectral images end-to-end, which has higher inter-class distinguishability compared to two-dimensional convolutional neural networks (2D-CNN). Roy [
14] proposed the HybridSN framework, which is a spectral–spatial 3D-CNN followed by spatial 2D-CNN to further learn a more abstract spatial representation. Zhong et al. [
15] proposed the SSRN network, where a spectral residual block and a spatial residual block sequentially learn discriminative features from the rich spectral features and spatial context in hyperspectral images. The selection of informative spectral–spatial kernel features presents challenges due to the presence of noise and band correlations, which is usually solved by using a convolutional neural network with a fixed size receptive field (RF). Roy et al. [
16] proposed an attention-based adaptive spectral spatial kernel modified residual network (A2S2K-ResNet) with spectral attention to capture discriminative spectral spatial features in an end-to-end training manner, using an improved 3D ResBlocks to jointly extract spectral–spatial features for HSI classification. T. Alipour-Fard et al. [
17] proposed a new multi-branch selection kernel network (MSKNet), which uses different receptive field sizes to convolve the input image to generate multiple branches, so as to adjust each branch according to the input contrast through the attention mechanism effect of the channel. Automatically adjusting the size of neuron receptive field and enhance the cross-channel relationship between features improves the problem of using fixed size receptive field in convolutional neural network, so as to limit the learning weight of the model. Although CNN-based methods have the advantages of spatial feature extraction, they are difficult to handle continuous data, and CNNs are not good at modeling long-range dependencies.
Recently, the application of transformers in the visual direction has become a hot topic. The spectrum of HSI is a kind of sequence data, which usually contains hundreds of spectral bands. Attention-based transformer models have demonstrated their advantages in handling sequential data, and the transformer framework can represent high-level semantic features well. Although CNN has good local perception ability, due to the limitation of the inherent network backbone, CNN cannot mine and represent the sequence attributes of spectral features well, while the transformer model based on an attention mechanism enables the model to be trained in parallel and has global information. CNN methods have limited ability to acquire deep semantic features. As the depth increases, the traditional CNN will increase the channel dimension and reduce the spatial dimension, and the computational cost will increase significantly. However, the transformer does not have this problem, and the channels and spatial dimensions of different layers do not change. This strategy of reducing the spatial dimension and increasing the channel dimension is also beneficial to improve the performance of the transformer structure.
He et al. [
18] used CNN to extract spatial features and a transformer to capture spectral sequence relationships. Hong et al. [
19] proposed the SpectralFormer network architecture to learn spectral local sequence information from adjacent bands of HSI images to generate intra-group spectral embeddings. Ji et al. [
20] proposed the bidirectional encoder representations from transformers (BERT), which has a global receptive field and can directly capture the global dependencies between pixels without considering their spatial distances. Han et al. [
21] proposed that Transformer iN Transformer (TNT) block uses an outer transformer block to model the relationship between patches and an inner Transformer block to model the relationship between pixels. The model not only retains the information extraction at the patch level but also achieves the information extraction at the pixel level, which can significantly improve the model’s ability to model local structures. Hugo et al. [
22] used distillation to enable the transformer-based model to learn some inductive biases based on the CNN model, thereby improving the processing capability of image. Although the global interaction between token embeddings can be well modeled by the transformer’s self-attention mechanism, the locality mechanism for information exchange within local regions is lacking. Li et al. [
23] introduced locality into the transformer by introducing depthwise convolutions in a feedforward network.
In order to capture the spectral relationship of HSI sequences over long distances, obtain deep semantic features and make full use of the local and global information of the data, this paper proposes a new classification framework, that is, attention-based adaptive spectral spatial kernel combined with ViT. The contributions of this article are summarized as follows:
1. This study proposes a novel HSI classification architecture, the attention-based adaptive spectral spatial kernel combined with improved ViT, which systematically combines bands from shallow to deep, enables neurons to adaptively adjust the receptive field size and successfully handles the long-range dependence of the spectrum making full use of the spectral spatial information and local global information in HSI to improve the classification performance of HSI.
2. This study proposes an improved ViT model that introduces the re-attention mechanism and the local mechanism. We use the re-attention mechanism to increase the diversity of attention maps at different levels. The local mechanism is introduced into ViT, the improved attention mechanism of ViT global relation modeling and the locality mechanism of local information aggregation are combined to make full use of the local and global information of the data and improve the classification accuracy.
3. In order to train the model better, the Focal Loss function is used to increase the loss weight of small-class samples and hard-to-classify samples in HSI data samples and reduce the loss weight of easy-to-classify samples, so that the network can learn more useful hyperspectral image information. In addition, using the Apollo optimizer to train the HSI classification model resulted in better updating and computing network parameters that affect model training and model output, thereby minimizing the loss function. The smaller the loss function, the better the model, thus improving the classification model’s performance.
4. The effectiveness of the method is verified on the challenging HSI four public datasets, the urban ground object classification is realized in the Pavia University dataset, the mineral classification is realized on the Xuzhou dataset, and the classification of crops is implemented on the Indian Pines and WHU-Hi-LongKou datasets. The effectiveness of the method is demonstrated on public datasets in different application domains. Compared with other representative methods, the classification results accuracy of the proposed method is improved.
The remaining part of this paper is organized as follows.
Section 2 describes the details of the proposed classification method in detail.
Section 3 describes the experimental datasets, experimental results and related analyses.
Section 4 gives conclusions and suggestions for future work.
2. Related Works
The framework proposed in this paper for HSI image classification is shown in
Figure 1. First, the principal component analysis (PCA) method is used to remove redundant spectra and reduce the time and space complexity of image processing. Considering that in order to effectively adjust the receptive field size of neurons and cross-channel dependencies, we proposed an attention-based adaptive spectral spatial residual method. Since CNN is good for capturing local information but has difficulty processing HSI’s continuous data, the extracted features are sent to the modified Vision Transformer model. The original ViT model is improved by combining it with Re-Attention mechanism to increase the diversity of the attention graph at different levels. Then, the local mechanism is introduced into the ViT, locality is added to the ViT by introducing a depthwise convolution in the feedforward network, and the transformed features are fed into the transformer encoder modules to perform feature representation and learning. The following part is divided into four parts: attention-based adaptive spectral spatial residual module, improved ViT, Apollo optimizer and Focal loss function.
2.1. Spectral–Spatial Feature Extraction
Let the hyperspectral data cube be
, where
is the original input,
is the width,
is the height, and
is the number of spectral bands. Every HSI pixel in
contains
spectral measures and forms a one-hot vector
, where
represents the land cover category. To remove spectral redundancy, a principal component analysis was first performed on the raw input HSI data to reduce the number of spectral bands from
to
while maintaining the same spatial dimension. Let
be the data cube after PCA processing and
be the number of spectral bands after PCA [
24]. Thus, spectral bands are reduced, and spectral information is preserved. Using the combined spectral and spatial information, a region size of
centered on the pixel
is superimposed into
, defined as a spectral–spatial vector
. Taking the HSI digital cube
as input, the adaptive spectral space kernel feature map
is generated as output [
16]:
where
is the trainable parameter in ASSK. By automatically adjusting the receptive field size, neurons can jointly learn spectral spatial features and amplify multi-scale information of neurons in the next layer.
In order to enable neurons to adaptively adjust the size of the receptive field, we use selective kernel convolution to learn the selection of the spectral–spatial kernel attention feature maps between different receptive fields through FASSK, as shown in
Figure 2. Selective kernel convolutions between multiple kernels have different kernel sizes. The basic idea is to use gates to control the flow of information from two branches carrying information of different scales into the neurons of the next layer. To achieve this, the gate needs to integrate information from branch offices, where multiple branches with different kernel sizes are fused using softmax function attention guided by information in these branches. Different attention to these branches produces different sizes of the effective receptive fields of neurons in the fusion layer.
and
are the transformations of the
th layer, where
is the input to the
th layer spectral and spatial kernel selection transformation. The output feature maps
and
are defined as:
Among them, is the three-dimensional convolution operation, is the weight of the th convolution layer, is the bias, and two three-dimensional convolution kernels with receptive field sizes (1 × 1 × 7) and (3 × 3 × 7) are used to extract the spectral and spatial feature maps. extracts spectral features, and extracts spatial features.
By automatically adjusting the size of the receptive field of neurons, the neurons jointly learn the spectral–spatial features and amplify the multi-scale information flow of the neurons in the next layer. Firstly, element-level summation is used to fuse the results of the two branches:
Secondly, global information is embedded by using global average pooling (GAP) to generate feature response vectors (FRVs) with channel statistics of the data. Specifically, the spatial dimension of
is reduced to
along the
th feature map direction by averaging the spatial elements of
at each channel:
Furthermore, to obtain neural activations of different channel features enabling adaptive kernel selection, a compact feature
was created to enable guidance for precise and adaptive selection. This is achieved by a simple fully connected layer, which reduces the dimensionality to improve efficiency, and the feature weight vector is defined as:
ReLu is the activation function, and BN is the batch normalization process.
is used to achieve model convergence, and the compression ratio
r is used to control
for compressing dimension:
where
is the minimum value of
(
= 32 in our experiment).
Guided by the channel descriptor
, a discriminative spectral-spatial kernel feature map is automatically selected. Specifically, apply
to the softmax function:
Among them,
and
denote the soft attention vector for
and
, respectively.
and
are the
th row of
and
, the final feature map V is obtained through the attention weights on each kernel function:
Among them, , and , .
The kernel feature map is made up of four ResBlocks in order to extract more robust and discriminative spectral–spatial characteristics. Each ResBlock is made up of 24 kernels that are separated into spectral characteristics based on the learning of distinct kernel shapes and spatial features. The first two ResBlocks extract spatially focused spectral characteristics, whereas the latter two extract spatially focused spectral features. As a result, combining spectral and spatial data increase the model’s identification capabilities. A GAP layer is utilized after re-blocking to transform 3D feature maps of size 7 × 7 × 24 into feature vectors of size 1 × 1 × 24.
Efficient Feature Recalibration is recalibrated by residual and spectral spatial channels. Among them,
takes the transformed feature map of the
th layer
as the input, and generates the feature map recalibrated by the channel
as the output, that is:
where
is the trainable parameter in the EFR module.
2.2. Improved Vision Transformer
Transformer networks were developed to simulate long-term relationships between sequence parts in machine translation. Although Transformer’s self-attention mechanism can mimic the global interaction between token embeddings, it lacks a local method for information sharing among small areas. We provide a locality mechanism to ViT by incorporating depth-wise convolution because locality is critical for HSI pictures. The upgraded ViT blends global relation modeling’s attention mechanism with local information aggregation’s locality mechanism. Locality is added to the ViT by introducing depthwise convolutions in a feedforward network, and the Re-Attention mechanism based on the original ViT is used to increase the diversity of attention maps at different levels.
When compared to standard convolution, depthwise convolution uses just channels for calculation. That is, just one input feature map is convolved to obtain one channel of the output feature map. As a result, depth-wise convolution is both parameter and computation efficient. The patch is input to the Embedding layer, that is, the Linear Projection of Flattened Patches in the
Figure 1, a lot of vectors called tokens can be obtained. Then, a new token is added in front of a series of tokens, and Positional Encoding will be added to the patch embedding to retain the position. The closer the information is located, the more similarly it tends to be encoded. In addition, the location information needs to be added, corresponding to
. Then, it is input into the transformer encoder to repeatedly stack the block N times. The output of the transformer is classified by the MLP Head which consists of LayerNorm and two fully connected layers, and the GELU activation function is used for classification to obtain the final classification result [
24].
Figure 3 depicts the transformer encoder, which consists of N stacks of the same layer. Each layer consists of the re-attention mechanism and position-wise fully connected feed-forward network. Around each of these two sublayers, we utilize a residual connection [
25] and a normalization layer [
26]. That is,
is the output of each sublayer, where
is an implementation function of the sublayer.
- 1.
Re-Attention
Re-Attention successfully overcomes the problem of attention collapse and allows for more in-depth ViT training, which collects complementing information from multiple attention heads through interactions to promote the variety of attention maps. Specifically, we use dynamic aggregation to create a new set of attention maps based on the head’s attention maps. A learnable transition matrix
is defined and used to combine the multi-head attention maps into a new regenerated map before multiplying by
. Re-Attention is accomplished by the following manner [
27]:
The transformation matrix is multiplied by the self-attention map of the head dimension. Norm is the normalizing function used to decrease hierarchical variance. The softmax function is applied to rows of comparable matrices, and is used to normalize the result. The three learnable weight matrices include query (Q), key (K), and value (V). Relationships between tokens are modeled by projecting the similarity between query key pairs, resulting in attention scores.
- 2.
Feed forward
After the Re-Attention layer, a feedforward network is attached. A token sequence is first reshaped into a feature map on a 2D lattice. Then, two convolutions and one depthwise convolution are applied to the feature map. Then, the feature map is reshaped into a sequence of tokens, which is used as self-attention in the transformer layer of the network. The specific description is as follows.
The feedforward network consists of two input convolutions of size
and transforms features along the embedding dimension. The hidden dimension between the two convolutional layers is expanded to learn richer feature representations. Since the feedforward network is applied to the sequence of tokens
by position, the reshaped features of the sequence of tokens are represented as:
The sequence is converted into a 2D feature map using Seq2Img. To re-establish token closeness, each token is placed at the pixel position of the feature map, offering a chance to reinstate locality into the network.
There is no information exchange between neighboring pixels since the feature map just performs the
convolution. Furthermore, the transformer’s attention section only captures the global interdependence between all tokens. In the inverted residual block, there is a depthwise convolution. Each channel is given k × k (k > 1) convolution kernels by depthwise convolution. To calculate a new feature, the features from k × k kernels are combined. Therefore, depthwise convolution is a good approach to bring locality into the network. The depthwise convolution is introduced into the transformer feedforward network, and the calculation formula is [
23]:
is the nonlinear activation function. The bias phrase has been eliminated for clarity. In most cases, the dimensional expansion ratio is set to 4. is reshaped from and represents the convolution kernel. is the kernel of depthwise convolution. The Img2Seq function returns the image feature map to a series of tokens which is used in the following self-attention layer.
2.3. Apollo Optimizer
The optimizer is used to update and compute network parameters that influence model training and output in order to approach or attain the optimal value, reducing (or maximizing) the loss function. This work employs Apollo, a non-convex quasi-Newtonian stochastic optimization technique that is both simple and computationally efficient. This approach is useful for large-scale optimization issues involving big data sets or high-dimensional parameter spaces, such as deep neural network machine learning, and using the Apollo optimizer improves HSI data categorization accuracy. The method approximates the Hessian through a diagonal matrix, dynamically introduces the curvature of the loss function, and the update and storage of the Hessian diagonal is as efficient as the adaptive first-order optimization method of linear complexity. The Hessian is replaced with its adjusted absolute value to handle non-convexity and ensure that it is positive definite.
The Apollo optimizer formula is as follows [
28]:
where
is the gradient at
, and
is the Hessian matrix:
where
is the step size, and
is the approximation of the Hessian matrix for each parameter update. Exponential moving averages (EMVs) are applied to
, and bias correction is initialized:
where
is the decay rate of the EMV. For each parameter
, the update formula is as follows:
Among them, , , and is the element-wise squared vector of , is a diagonal matrix consisting of the vector’s diagonal elements , and is the 4-norm of the vector.
Replace the step size bias in the stochastic gradient
with the modified gradient
. Combined with the corresponding corrected
, modify the update term Λ in formula (19) and replace
with
:
where
is the updated direction after correction.
When calculating the update direction with
as preprocessing, we use its absolute value:
where
is the positive definite square root of the matrix. Apollo uses a diagonal matrix to represent
. In order to deal with the non-convexity of the objective function, the absolute value of
is corrected with the convexity hyperparameter
:
Among them, the rectify function is similar to the corrected linear unit (ReLu), and the threshold is set to .
2.4. Focal Loss
Various samples have the same amount of loss in cross-entropy measures prediction; however, in the real HSI classification job, the quantity of samples of different categories varies substantially, as does the classification difficulty of the same category of samples. The categorization complexity of distinct samples varies according to the differences between them. If the same weight is utilized to optimize each instance’s prediction results, the prediction results for difficult-to-classify data will be relatively bad. Furthermore, the categorization findings of certain instances are not optimal due to the effect of mixed pixels. The model must adaptively alter the proportion of each instance in the loss according to the classification difficulty in order to enhance classification performance and pay attention to small-class samples and difficult-to-classify samples at the same time. More “optimization resources” should be allocated to challenging classification samples [
29]. Focal Loss [
30], an improved variant of cross entropy loss, is used in this research, which is defined as:
Assuming there are
label values,
is the real label,
is the probability of predicting the
th label value for the
th sample, and
represents the number of samples. A common way to address class imbalance is to introduce a weighting factor
, which is between
:
A more formal approach is to add a tuning factor
to the cross-entropy loss function, with tunable focusing parameter
.
Combining the above two formulas, the Focal Loss is obtained:
3. Experimental Results
The experiments were carried out on the Windows 10 operating system, and the classification methods were implemented using the Python language and PyTorch library. The experimental environment is an Intel(R) Core(TM) i7-10750H CPU @ 2.60 GHz 2.59 GHz processor, 16 GB memory, and a GeForce GTX 1650Ti graphics card. In order to minimize the experimental error and chance, all the experimental data in this paper are the average results of 10 iterations. In order to adapt to hardware resources and reduce the amount of computation per batch during network training, the size of the input data is set to 32 × 32. All experimental networks can reach a stable convergence state after training up to 100 epochs. In order to ensure that all methods can achieve the best classification effect, this paper sets the maximum number of training epochs to 200 and adopts the early stopping method to avoid the overfitting problem. We use the Apollo optimizer to learn the mixing operation parameters, where the learning rate is set to 0.0004. Three indicators of comprehensive accuracy (OA), average accuracy (AA) and the Kappa coefficient (K) are used to quantitatively evaluate the experimental results.
3.1. Hyperspectral Datasets Description
In this study, we conduct experiments on four different HSI datasets, including Indian Pines datasets, Pavia University datasets, Xuzhou datasets and WHU-Hi-LongKou datasets. The datasets used are described in detail below. The number of samples per class, a false color map and a ground truth map of the datasets are shown in
Table 1,
Table 2,
Table 3 and
Table 4.
1. Data in the Indian Pines dataset were obtained by the AVIRIS sensor over the Indian Pines Agricultural Proving Ground in northwestern Indiana, USA. The original data have a total of 224 bands, 4 zero bands and 20 water absorption bands (104–108, 150–163 and 220) are removed, and the remaining 200 bands are for experimental study, ranging from 0.4 to 2.5 μm. the space size is 145 × 145 pixels with 16 different types of plants.
Table 1.
Indian Pines Dataset Labeled Sample Counts.
2. Data in the Pavia University dataset were obtained by ROSIS-03 sensors over the University of Pavia, Pavia, Italy. The size of the dataset is 610 × 340 pixels, and the spatial resolution is 1.3 m. The original data have 115 bands with spectral coverage ranging from 0.43 to 0.86 μm. Twelve noise bands are removed, and the remaining 103 bands are available for experiments with 9 categories.
Table 2.
Pavia University Dataset Labeled Sample Counts.
3. Data in the Xuzhou dataset [
31,
32] were acquired by HySpex SWIR-384 and HySpex VNIR-1600 imaging spectrometers in Xuzhou, Jiangsu Province, China, in November 2014, and the experimental area is located near a coal mining area. The size of the dataset is 500 × 260 pixels, the noise bands from 415 to 2508 nm are removed, and there are 436 bands for experiments with 9 categories.
Table 3.
Xuzhou Dataset Labeled Sample Counts.
4. The WHU-Hi-LongKou dataset [
33,
34] consists of an 8 mm focal length head-wall nano-Hyperspec imaging sensor, mounted on the DJI Matrice 600 Pro (DJI M600 Pro) drone platform in 2018 Obtained in Longkou Town, Hubei Province, China, on 7 July 2008. The study area is a simple agricultural scenario containing six crops: corn, cotton, sesame, broad-leaf soybean, narrow-leaf soybean, and rice, with a total of nine categories. The image size is 550 × 400 pixels with 270 bands between 0.4~1 μm, and the spatial resolution of the hyperspectral image carried by the UAV is about 0.463 m.
Table 4.
WHU-Hi-LongKou Dataset Labeled Sample Counts.
3.2. Comparison of the Proposed Methods with the State-of-the-Art Methods
In this section, to evaluate the classification performance of our proposed method, the proposed method is validated by using several comparative experiments, including the traditional method RBF-SVM [
35] and the deep-learning-related methods CNN [
36], HybirdSN [
14], PyResNet [
37], SSRN [
15], SSFTT [
38] and A2S2KResNet [
16]. For RBF-SVM, the radial basis function is used as the kernel, and the grid search method is used to find the exponential growing sequence. In each dataset, the number of training samples is 10% of the total number of samples. The experimental results of the proposed method are shown in
Table 5,
Table 6,
Table 7 and
Table 8. It can be seen that the OA, AA and Kappa values achieved by the proposed method are the best, with OA reaching 98.81%, 99.76%, 99.80% and 99.89% on the Indian Pines, Pavia University, Xuzhou and WHU-Hi-LongKou datasets, respectively.
To show the classification results more clearly, we present the classification results of eight methods on the four hyperspectral datasets, as shown in
Figure 4,
Figure 5,
Figure 6 and
Figure 7. Obviously, our proposed method has more accurate classification results compared to other methods. Compared with deep-learning-based methods on the four datasets, there are more noise scatters in the classification graph of EMP-SVM, CNN, Hybird, PyResNet, SSRN, SSFTT, and A2S2KResNet classification methods still have some misclassifications. Compared with ground truth, it can be seen that the proposed method can obtain more accurate classification results, which further proves the effectiveness of the proposed method in the classification of hyperspectral data.
3.3. Ablation Experiments
Among them, we performed ablation experiments on the Indian Pines dataset to verify the effectiveness of the proposed method. The experimental results are shown in
Table 9.
When we only use the A2S2kResNet model to classify hyperspectral data, its OA on Indian pines dataset is only 98.51%. When A2S2KResNet is combined with ViT (A2S2KResNet + ViT), the OA is 98.61%, which proves that the ViT model can slightly improve the classification performance of the model. When the A2S2KResNet + ViT model is combined with Focal Loss function or Apollo optimizer, the OAs are 98.63% and 98.75%, respectively. It is proven that Focal Loss function and Apollo optimizer are slightly helpful to A2S2kResNet + ViT model. When A2S2KResNet + ViT is combined with Focal Loss function and Apollo optimizer, which is the HSI classification model proposed by us in this paper, it achieves the highest classification accuracy on Indian Pines dataset, which further proves the effectiveness of our method in improving the classification performance of HSI.
4. Discussions
This paper made some modifications and designed an HSI classification method. This study proposes an improved ViT model that introduces a re-attention mechanism and a local mechanism. Then, the improved ViT model is combined with the attention-based adaptive spectral–spatial kernel, which systematically combines bands from shallow to deep, enables neurons to adaptively adjust the receptive field size, and successfully handles the long-range dependence of the spectrum, making full use of the spectral–spatial information and local global information in HSI to improve the classification performance. The Focal Loss function is used to increase the loss weight of small-class samples and hard-to-classify samples in HSI data samples and Apollo. Furthermore, a quasi-Newton method for nonconvex stochastic optimization is introduced to dynamically incorporate the curvature of the loss function by approximating the Hessian via a diagonal matrix.
It can be seen from
Table 5,
Table 6,
Table 7 and
Table 8 that the classical RBF-SVM method and several deep-learning-based methods including CNN, HybirdSN, PyResNet, SSRN, SSFTT, and A2S2KResNet are considered for comparison. All experimental results show that the proposed method achieves the best performance on all datasets. The suggested technique obtained superior performance in terms of classification accuracy on the four popular HSI datasets, according to all experimental results. Taking the Indian Pines dataset as an example, the OA, AA and K of the proposed method are improved by 18.8%, 19.48% and 21.56%, respectively, compared with RBF-SVM. Furthermore, compared with CNN, the OA of the proposed method is improved by 20.5%, 15.5%, 9.24% and 8.47% on the Indian, Pavia, Xuzhou and WHU-Hi-Longkou datasets, respectively. For the Pavia University dataset, the proposed method improves the OA by 8.05%, 5.96%, 0.06%, 0.14% and 0.22% compared with HybirdSN, PyresNet, SSRN, SSFTT and A2S2KResNet, respectively.
Furthermore, to verify the effectiveness of the proposed method for different HSI dataset,
Figure 8,
Figure 9,
Figure 10 and
Figure 11 show the classification results of different methods for each class.
We can see that our method achieves the highest classification accuracy for almost every class for four different datasets. For example, for the Oats category in the Indian Pines dataset, our method improves OA by 3.17% over the state-of-the-art among other methods. The effectiveness of the method is demonstrated in different datasets in different application domains. The urban land feature classification is realized on the Pavia University dataset, the mineral classification is realized on the Xuzhou dataset and the Indian Pines dataset and WHU-Hi-LongKou datasets are implemented for fine crop classification.