1. Introduction
With the improvement of sensors, much more hyperspectral images are becoming available. HSI could offer abundant information to identify materials, as it records hundreds of bands on the electromagnetic spectrum of each pixel. Particularly, materials differ in their emission, reflection, and absorption of electromagnetic waves, making the identification and detection of different materials at a fine-grained level. The rich spectral information makes them indispensable in various fields, such as ecosystem measurement [
1], mineral analysis [
2,
3], biomedical imaging [
4], and precision agriculture [
5].
In fact, HSI classification task aims to divide each pixel into the class labels. HSI classification methods could be divided into two categories: traditional HSI classification methods and deep learning-based HSI classification methods. In traditional HSI classification methods, researchers attempt to solve the HSI classification task by applying machine learning methods, such as K-Nearest Neighbors (KNNs) ([
6,
7]), Random Forests (RFs) ([
8]), and Support Vector Machines (SVMs) ([
9,
10]). For instance, Li et al. [
11] proposed a spectral band selection method to determine the optimal bands for subsequent feature learning by combining Markov Random Field (MRF) and spectral selection. However, it is important to highlight that these traditional HSI classification methods, which have been used for a long time in the field, are often faced with challenges in the process of manual feature extraction. This means that there is quite a bit of room for human error and subjectivity in the process. Additionally, they may also experience a failure in fully exploiting the rich and complex spatial-spectral characteristics of different materials. This shortcoming can lead to inaccurate and unreliable results.
Deep learning-based HSI classification methods have been developed, showing strong feature extraction capability. Chen et al. [
12] introduced an autoencoder method for pixel identification, while Hu et al. [
13] designed a CNN-based method to capture local spatial features. Ran et al. [
14] combined spectral band analysis with CNN. RNN has also been utilized due to its sequential data modeling capability, but RNN-based methods may not explore spatial information as well as CNNs, leading to poor classification. Mei et al. [
15] established a five-layer CNN-based method, integrating spatial and spectral information; however, it explores them separately, resulting in insufficient use of spatial-spectral fusion features.
Three-dimensional CNN architectures beneficially extract fusion information from 4D tensors, enabling building models that exploit spatial-spectral fusion. Yang et al. [
16] used two CNN branches capturing this and then combined them for fed to a fully connected layer extracting jointly spatial-spectral fusion features. Other methods like FCN Three-Stream and novel 3D-CNN were introduced to address HSI classification, comprising multiple 3D convolutional, pooling, and regularization layers, effectively capturing spatial-spectral fusion. However, the CNN-based types may struggle with capturing global HSI information.
Recently, transformer networks have been applied to computer vision tasks and have performed well [
17,
18], which is due to their ability to capture long-range dependence. For instance, Dosovitskiy et al. [
19] first used transformers for image classification, introducing the vision image transformer (ViT) network. In the ViT model, input images are divided into nine patches and treated as a sequence of tokens with positional embeddings. These tokens are then fed into a series of transformer blocks to extract parameterized vectors. The transformer’s key components are the self-attention mechanism and Multilayer Perception (MLP), which can gather spatial transformations and long-range dependencies. Unfortunately, the ViT model fails to utilize the 2D structure of images, which can decrease performance. To improve performance, local features from CNNs are used as input tokens to capture local spatial information. For example, Graham et al. [
20] used convolution layers to extract local features, which are then fed into transformer blocks. However, these improved transformers do not fully integrate local features and global representations.
Inspired by the transformer model’s sequential modeling capability, Dosovitskiy et al. [
19] first introduced it into computer vision tasks as the Vision Transformer (ViT) for image classification. In ViT, input images are divided into blocks, positional information is added, and relationships between blocks are established. Inspired by this, He et al. [
21] introduced ViT into hyperspectral image (HSI) classification as the SSF model, utilizing a CNN for local spatial feature capture and a transformer module for sequential spectral relationship capture. Mei et al. [
22] proposed GAHT, combining a CNN and transformer to explore local relationships within spectral channels and construct a hierarchical transformer. However, these methods still have issues.
- 1.
Most transformer-based methods explore global spatial dependencies, ignoring those in the spectral dimension. Existing transformer-based HSI classification methods struggle to capture long spectral dependencies, hindering performance improvements.
- 2.
Now, most transformer-based methods may not be able to further refine the local feature during the training stage. This is mainly because transformers directly process the local spatial features through a multi-head self-attention mechanism, resulting in limiting the further exploitation of local features.
We present a new method known as Aggregation Multi-Hierarchical Feature Network (AMHFN) to tackle complex challenges in hyperspectral image classification. The AMHFN centers on two key modules: a Local-Pixel Embedding module (LPEM) and a Multi-Scale Convolutional Extraction (MSCE) module. The LPEM captures refined local features using a grouped convolution layer and a Batch Norm layer, while the MSCE utilizes multi-scale convolutional layers, an Efficient Channel Attention (ECA) layer [
23], and an Efficient Spatial Attention (ESA) layer to extract and re-weight local spatial-spectral features. The input HSI cube is projected into features that simultaneously possess global spectral information and refined local spatial information. These features then feed into a Multi-Scale Global Extraction (MSGE) module to capture and integrate global dependencies across both spatial and spectral dimensions. With this unique design, the proposed AMHFN excels at capturing global dependencies and exploring refined local features, significantly enhancing hyperspectral image classification performance. In this paper, our contributions could be summarized as follows:
- 1.
We propose a novel hybrid hyperspectral image classification method, called Aggregation Multi-Hierarchical Feature Network (AMHFN), that captures and aggregates local hierarchical features and explores global dependencies of spectral information and prominent local spatial features.
- 2.
We propose Local-Pixel Embedding module (LPEM) to exploit the refined local contextual spatial-spectral features. Specifically, the proposed LPEM consists of one grouped convolution layer to capture the hierarchical spatial-spectral features.
- 3.
We further propose two modules to capture and aggregate the multi-scale hierarchical features. A Multi-Scale Convolutional Extraction (MSCE) module captures local spectral-spatial fusion information, while a Multi-Scale Global Extraction (MSGE) module captures and integrates global dependencies.
- 4.
Finally, evaluated on three public HSI benchmarks, the proposed AMHFN outperforms other HSI classification methods.
The paper begins with a section delving into related work on HSI classification methods, followed by a section elaborating on the proposed AMHFN model. An experimental validation against three HSI datasets is presented in the next section. Concluding remarks and potential future work are offered in the final section.
The remainder of this paper is structured as follows:
Section 2 delves into related work on HSI classification methods,
Section 3 elaborates on the proposed AMHFN model,
Section 4 presents thorough experimental validation against three HSI datasets, and
Section 5 offers concluding remarks.
3. Proposed Methodology
In this section, we give a brief introduction of the proposed AMHFN (as shown in Algorithm 1). As shown in
Figure 1, it is a novel hybrid HSI classification method integrating the CNN and transformer. It consists of a “Stem” layer to extract the shallow features and three stages to capture the local and global multi-scale features. Specially, each stage comprises three key modules: a Local-Pixel Embedding module (LPEM) to retain the local spatial features, a Multi-Scale Convolutional Extraction (MSCE) module to capture the multi-scale hierarchical local spatial-spectral features, and a Multi-Scale Global Extraction (MSGE) module to explore the multi-scale hierarchical global dependencies. Because of these three key modules, the proposed AMHFN could model the spectral information and capture more refined multi-scale hierarchical local features.
Suppose
is the input HSI, where
P denotes the patch size and
C is the number of channels. And
is the output of the “Stem” layer. Thus,
can be obtained by
where the
denotes the stem layer, which comprises two 2D convolutional layers to extract local features. The stem layer is used to extract features from HSI inputs, reduce the spectral-spatial dimensionality, and perform feature mapping.
And
,
, and
are the outputs of the first, second, and third stages, where
denotes the channel number of each stage. All outputs from three layers are concatenated, and the features re-weighted using a linear operation. It is noted that the raw inputs are connected using a global residual connection layer. Finally, the outputs of each stage can be obtained by
where
denotes the LPEM module,
denotes the MSCE module, and
denotes the MSGE module.
is the linear operation and
denotes the “concate" layer. After the “Stage 3” layer, the output features are fed into the “Pooling” layer to predict the raw pixel inputs.
In the following passage, we introduce the details of the proposed modules.
Algorithm 1 AMHFN Implementation Process. |
- Require:
HIS image data , label , spatial size , training sample rate . - Ensure:
Classification map and four performance evaluation metrics. - 1:
Set batch size B to 64, optimizer Adam (learning rate: ), number of epochs E to 100. - 2:
Extract the input from X and divide it into a training dataset and test dataset. - 3:
for to E do - 4:
Perform the “Stem” layer for shallow feature extraction. - 5:
for stage = 1 to 3 do - 6:
The input ; perform LPEM, MSCE, MSGE; and obtain , , , respectively - 7:
. - 8:
Perform the “Pooling” layer and “Linear” layer to predict the result. - 9:
Use the softmax function to identify the labels. - 10:
Obtain the output by testing the trained model on the test dataset.
|
Local-Pixel Embedding module: The proposed
is a grouped convolutional operation used to capture deep spatial-spectral features from HSI. Specifically, a grouped convolution layer applies
n kernels to the input, whose size is
. Following a grouped convolution, batch normalization and ReLU activation are applied.
where
is the output of LPEM.
After extracting the deep spatial-spectral features, we utilize a linear operation to project the extracted features to the desired dimension.
where
is the linear operation. In this study, we adapt nn.Linear, which is a module provided by PyTorch that applies a linear transformation to the incoming data.
3.1. Multi-Scale Convolutional Layer
Figure 1 shows the Multi-Scale (
) convolutional layers, divided into a convolution layer, a multi-scale convolution layer, and an aggregate layer. The convolution layer adjusts input channels, the multi-scale convolution layer has four layers with varying receptive fields, and the aggregate layer fuses features to generate the final output. This design captures local spatial-spectral information and aggregates multi-scale, hierarchical features. The MS module can be formulated as per
Figure 1.
where
denotes the inputs, and
denotes the multi-scale convolution layer.
denotes the output of the
i-th multi-scale convolution branch. It utilizes
,
,
convolution layers and an average pooling layer. The output is obtained using these layers and passing through an average pooling layer, as shown in the provided equation.
We fuse and re-weight the multi-scale features by applying a
convolutional layer to produce the final output
.
The not only captures multi-scale local contextual information, but also explores global dependence across the spectral dimension. It achieves adaptability in both spatial and spectral dimensions.
3.2. Multi-Scale Convolutional Extraction Module
The proposed
module, as shown in
Figure 1, uses multi-scale convolutional layers for extracting local spatial-spectral features,
for capturing the refined spectral information, and
for enhancing and refining the spatial information.
3.2.1. ECA-Based Layer
The
(as shown in
Figure 2) uses global average pooling on input features, followed by 1D convolution with kernel size
k and a Sigmoid activation to obtain channel weights.
k represents the involvement of
k adjacent channels in inter-channel information interaction. The output from Sigmoid is recalculated by re-weighting channels.
where
is channel-wise global average pooling (GAP).
is a Sigmoid function.
3.2.2. ESA-Based Layer
Recently, some researchers [
42] proposed Partial Convolution (PConv) and Efficient Spatial Attention (
) in the field of natural images, which can reduce computational redundancy and speed up operations.
Figure 3 illustrates the operational process of PConv. The original feature map
is given, where
h,
w, and
C represent the height, width, and number of channels of the original feature map, respectively. PConv utilizes conventional convolution operations on a select region of the original image, identified as a region feature map
, with “c” symbolizing the channels involved (
c <
C), to perform feature extraction. This ensures that both the spatial dimensions and the channel count of the output feature map
are congruent with the input region
. Then, the ultima feature map, obtained by concatenating
with the non-convolved part (
), maintains the same spatial dimensions and channel numbers as the original image
I. PConv delivers an efficient method for feature extraction by diminishing computational redundancy and memory requirements. Finally, PConv is formulated as follows:
where Concat stands for concatenation in the channel dimension.
Figure 3.
The operation process of Partial Convolution (PConv). “×” represents the operation of convolution.
Figure 3.
The operation process of Partial Convolution (PConv). “×” represents the operation of convolution.
As shown in
Figure 4, the
is built upon one PConv layer and two PWconv layers. The PConv layer is used to capture local spatial information, and the PWconv layer is utilized to capture local features along with spatial-spectral dimension. Specifically,
balances low latency and feature diversity by connecting the two PWconv layers with normalization and activation, instead of adding them after each convolution. Then, the combination of features extracted by PConv and not extracted by PConv occurs in each PWconv layer of
by increasing the feature map’s dimensionality along the channel axis, then reducing it back to the initial channel dimension. Therefore, the output of
can be summarized as follows:
where PWconv and PConv are the PWconv and PConv operations.
Figure 4.
Structur of ESA mechanism.
Figure 4.
Structur of ESA mechanism.
Finally,
(Efficient Channel Attention) is used in hyperspectral image classification to capture refined spectral information by emphasizing important spectral channels, which helps in distinguishing subtle differences between spectral signatures.
(Enhanced Spatial Attention) focuses on enhancing and refining spatial information by giving attention to relevant spatial features, improving the ability to identify spatial patterns and structures within the image. Together,
and
effectively balance and enhance spectral and spatial information, leading to more accurate and detailed classification results. The proposed MSCE module is formulated as
3.3. Multi-Scale Global Extraction Module
The
could extract multi-scale local features but fails in exploring the global dependencies. To capture the global dependencies from HSIs, we design an
module to enhance the representation learning. Specifically, the
module integrates multi-head attention and
to effectively capture and refine complex graph relationships, enhancing the model’s ability to learn rich and nuanced representations for improved performance of HSI classification tasks. The proposed
module uses multi-scale convolutional layers and a transformer, a self-attention mechanism (see
Figure 5), to enhance performance. The transformer encoder incorporates multiple multi-head self-attention layers and a position-wise fully connected feed-forward network. The input and output of this module are a sequence of feature maps.
Self-attention (SA) is a mechanism enabling models to focus on relationships between different positions in a sequence. SA computes relationships between a query and a set of key–value pairs, generalizing the dot-product attention common in NLP tasks. SA can improve model performance by helping it focus on important relationships in input sequences.
where
Q,
K, and
V are matrices, and
is the dimension of
K.
Figure 5.
Structure of self-attention () and mechanisms.
Figure 5.
Structure of self-attention () and mechanisms.
Multi-head attention (MHSA) is a method that divides the input sequence into multiple sub-sequences and applies self-attention to each sub-sequence. It is a generalization of SA. Specifically, MHSA is a method that uses multiple
mechanisms to explore different relationships in the input sequence. Each SA mechanism is denoted by
, where
i denotes the
i-th head. The output of MHSA can be formulated as follows:
where
h is the head number,
W is the parameter matrix,
, and
.
The input goes through a linear layer with non-linear activation, which reduces dimensionality and enhances the model’s ability to learn nonlinear relationships.
The
output is passed through a linear layer with a non-linear activation function, and then divided into
h chunks. Each chunk is fed into a separate
mechanism. The
h outputs are concatenated and multiplied by a weight matrix
W. Then, the output passes through a final linear layer with a non-linear activation function to reduce dimensionality and enhance the model’s ability to learn nonlinear relationships. The multiple consecutive transformer blocks can be formulated as above.
where
and
denote the output features of the
module and
for block l.
4. Experiments
We selected three HSI datasets, including WHU-Hi-LongKou, Pavia University, and Houston 2013, to evaluate our proposed method. The experiments included parameter analysis, ablation experiments, and classification of results.
4.1. Datasets
4.1.1. WHU-Hi-LongKou Dataset
The WHL dataset was acquired from an 8-mm focal length Headwall Nano-Hyperspec imaging sensor mounted on a DJI Matrice 600 Pro UAV flying at 500 m altitude. The resulting imagery was 550 × 400 pixels with 270 bands from 400–1000 nm, at 0.463 m spatial resolution. The dataset contains 204,542 labeled samples across 9 land cover classes. In our experiments, we used 2% for training and 98% for testing, as shown in
Table 1.
4.1.2. Pavia University Dataset
The Pavia University (PU) dataset was acquired in 2001 using the ROSIS sensor. It covers 115 spectral bands from 380 nm to 860 nm. After discarding noisy bands, 103 bands remained for research. The dataset is an image with
pixels resolution. It contains 42,776 labeled samples across 9 land cover types. Only 5% of samples were used for training, while the remaining 95% are for testing. This split ensures rigorous model evaluation and comprehensive performance understanding, as shown in
Table 2.
4.1.3. Houston 2013 Dataset
The publicly available Houston 2013 (H2) dataset was collected using an Airborne Laser Mapping (ALM) system with a 2.5
m wavelength laser. It was gathered during summer 2013 in Houston, Texas, USA, and initially used for the 2013 IEEE GRSS Data Fusion Competition. The dataset is an image with
. It was acquired from an airplane flying at 500 m between 12:30 and 16:30 on 18 June 2013 and covers 15 distinct land covers with 15,029 labeled samples. In our experiments, 10% of the samples were used for training and 90% for testing, as shown in
Table 3.
4.2. Experimental Setup
4.2.1. Evaluation Indicators
In evaluating the proposed method’s classification performance, we used three common indicators: Kappa coefficient (
), overall accuracy, and average accuracy. The Kappa coefficient measures the agreement between two sets of data, with higher values indicating better agreement. The overall accuracy calculates the percentage of correct predictions, while the average accuracy determines the accuracy for each class. These metrics provide a comprehensive assessment of the method’s performance, with higher values signifying better performance.
where
and
are the observed and expected accuracies, respectively.
Table 3.
Number of training and testing samples for the Houston 2013 dataset.
Table 3.
Number of training and testing samples for the Houston 2013 dataset.
Class No. | Class Name | Training | Testing |
---|
1 | Healthy Grass | 125 | 1126 |
2 | Stressed Grass | 125 | 1129 |
3 | Synthetic Grass | 70 | 627 |
4 | Trees | 124 | 1120 |
5 | Soil | 124 | 1118 |
6 | Water | 33 | 292 |
7 | Residential | 127 | 1141 |
8 | Commercial | 124 | 1120 |
9 | Road | 125 | 1127 |
10 | Highway | 123 | 1104 |
11 | Railway | 123 | 1112 |
12 | Parking Lot 1 | 123 | 123 |
13 | Parking Lot 2 | 47 | 422 |
14 | Tennise Court | 43 | 385 |
15 | Running Track | 66 | 594 |
Total | 1502 | 13,527 |
where
is the number of samples of each class, and
is the accuracy of each class.
where
N is the total number of samples.
where
and
are the true positive and false positive of each class, respectively.
4.2.2. Implementation Details
Experiments were conducted on the HSI dataset, containing 10 classes with 100 samples each. The dataset had 16 spectral bands, and images were pixels. The experiments were run on an Intel(R) Xeon(R) Gold 6230R CPU and NVIDIA RTX A5000 GPU using the PyTorch deep learning framework. The Adam optimizer was used with an initial learning rate of 1 × 10−3, a minibatch size of 64, and 100 epochs. These parameters remained consistent across all experiments.
4.2.3. Comparison with State-of-the-Art Backbone Methods
A range of cutting-edge classification networks based on CNN and transformer architectures were employed to validate our proposed method: 2D-CNN ([
43]), 3D-CNN ([
44]), HybridSn ([
9]), ViT ([
19]), PiT ([
45]), HiT ([
36]), GAHT ([
22]). The 2D-CNN and 3D-CNN methods incorporate 2-D or 3-D convolutional layers, BN layers, activation functions, and linear layers. HybridSn combines 3-D and 2-D convolutional blocks, linear layers, and pooling layers. The ViT method uses a linear-projection component and transformer encoders. PiT includes four transformer encoder blocks, three pooling layers, and a linear-projection component. The HiT method combines a spectral-adaptive 3-D convolution projection (SACP) module and the Convolutional Permutator (Conv-Permutator) module. The GAHT method uses a new Grouped Pixel Embedding Module to limit the Multi-head Self-Attention (MHSA) mechanism within a local spectral context, overcoming the issue of excessive dispersion in MHSA. Finally, the AMHFN method incorporates two Feature Hierarchical Blocks and a Retention Block to extract important and secondary feature information from the spectral space and learn long-range correlations between pixels and bands.
4.3. Ablation Studies
4.3.1. Ablation Study of the Input Patch Size
The proposed method is based on a spatial-spectral approach, where the patch size directly reflects the extent to which the central pixel can utilize spatial-spectral information from neighboring pixels. Hence, patch size plays a crucial role in determining AMHFN performance. The optimal patch size for different datasets is demonstrated using the AA. For WHL and PU, it is
, and for H2 it is
(
Table 4). This is perhaps because WHL and PU have denser pixel distributions, and smaller patches can fully utilize spatial spectral information in HSI. In contrast, H2 has a very sparse pixel distribution, requiring larger patches to acquire more sufficient information.
4.3.2. Ablation Study of the Kernel Size in the ECA Block
The proposed method utilizes an ECA block, a variant of the SE block. The ECA block contains two parts: a global normalization layer and a 1D convolution layer. Thus, the kernel size in the ECA block affects the proposed method’s performance. We experimented to evaluate different kernel sizes’ impact on AA, setting the kernel size to 1, 3, 9, and 15. It is noted that “1” indicates no inter-channel interaction and direct channel shuffle.
Figure 6 illustrates different kernel size effects on AA across datasets. We first observe that a kernel size of 3 yielded the best AA on all HSI datasets. We also find varying decreases in AA with larger kernel sizes. This may result from introducing additional noise by expanding the channel interaction range, decreasing AA. Based on the kernel size analysis, we set the kernel size to 3 in the experiments.
4.3.3. Ablation Study of the Proposed Multi-Feature Hierarchical Module
The proposed model AMHFN is built on the MSCE module, which is used to capture the prominent multi-scale local features and aggregate the subtle local contextual features. In this ablation study, we employed three modules, which divided the channels into three parts using Pconv to correspondingly separate feature information into prominent, moderate, and subtle parts. We compared the performance of the models to those with two components on three datasets using three performance metrics.
Table 5 indicates that the results based on three modules are all inferior to the proposed AMHFN, where the performance of only ECA produces worse results than the baseline method. This may be attributed to the excessive fine-grained feature representation, impeding the model from fully capturing effective features, akin to the phenomenon of loss function overfitting.
4.3.4. Ablation Study of the Numbers of the Training Samples
The robustness and stability of the proposed AMHFN were evaluated through a comprehensive set of experiments on various training samples. Different HSI datasets require different training sample percentages, ranging from 1–4% on the WHU-Hi-LongKou dataset, 5–20% on the Houston2013 dataset, and 1–7% on the Pavia University dataset.
The experimental results, meticulously depicted in
Figure 7, offer illuminating insights. Most notably, a clear pattern of improvement emerges across all methodologies with increasing training samples. Consequently, deep learning methods, with intricate architectures, require substantive training data for optimal functionality. Of particular interest is the promising performance of the newly introduced AMHFN, which stands out by delivering superior results compared to well-established techniques, even maintaining the same training proportion. This significant observation highlights not only AMHFN’s efficacy but also its resilience amid data-driven analysis demands.
4.4. Classification Results
We comprehensively evaluated the proposed method and comparison methods on three HSI datasets. The experimental results in
Table 6,
Table 7 and
Table 8 show performance metrics for each method, with optimal results in bold.
The proposed AMHFN outperformed other methods on three datasets, primarily due to specialized modules capturing deep spatial-spectral features and enhancing spatial-spectral information. Interestingly, CNNs-based methods generally outperform transformer-based methods. Also, 3D-CNN outperforms 2D-CNN on the WHL dataset, possibly attributed to 3D convolution’s advantage in extracting spectral information from 200 channels. However, 3D-CNN underperforms on the Houston2013 dataset, likely due to insufficient training samples. Surprisingly, transformer-based methods do not perform better than CNN-based ones. For example, ViT only achieves 88.93%, 91.45%, and 86.15% in terms of , OA, and AA. This might be because of their architecture specialized for natural images rather than spatial-spectral exploration. The PiT method’s use of a pooling layer in the final part might lead to loss of critical feature information and inferior classification performance. However, HiT and GAHT are customized transformer models for HSI classification tasks, achieving satisfactory results compared to ViT and PiT. For instance, GAHT achieves outstanding OA and AA exceeding 98% on the Houston2013 dataset, demonstrating MHSA’s effectiveness when confined to a local spatial-spectral context. Finally, our proposed AMHFN exhibited superior classification performance over other methods, with AA exceeding GAHT by 0.67% on the PU dataset.
Table 8.
Classification results of the Houston 2013 dataset with 10% training samples.
Table 8.
Classification results of the Houston 2013 dataset with 10% training samples.
Class No. | CNNs | Transformers |
---|
2D-CNN
|
3D-CNN
|
HybridSn
|
ViT
|
PiT
|
HiT
|
SSFTT
|
GAHT
|
AMHFN (Ours)
|
---|
1 | 98.58 | 95.74 | 98.4 | 96.98 | 96.89 | 98.13 | 97.51 | 98.40 | 98.76 |
2 | 99.38 | 98.76 | 98.32 | 98.85 | 96.63 | 97.96 | 99.91 | 98.66 | 99.38 |
3 | 100 | 99.36 | 100 | 98.72 | 98.09 | 99.84 | 99.84 | 99.68 | 100 |
4 | 99.11 | 98.3 | 99.64 | 98.84 | 96.61 | 98.93 | 98.66 | 98.48 | 97.14 |
5 | 99.11 | 97.41 | 99.02 | 96.6 | 90.88 | 97.32 | 99.28 | 98.64 | 98.75 |
6 | 89.73 | 76.37 | 85.27 | 88.7 | 83.9 | 89.73 | 91.78 | 97.67 | 99.32 |
7 | 97.55 | 92.11 | 95 | 96.84 | 89.66 | 96.49 | 96.49 | 97.90 | 98.60 |
8 | 93.21 | 84.46 | 90.98 | 92.86 | 82.77 | 94.82 | 95.45 | 97.53 | 96.12 |
9 | 93.08 | 87.93 | 90.24 | 89.97 | 81.01 | 94.14 | 96.72 | 97.83 | 98.05 |
10 | 99.09 | 92.66 | 94.29 | 93.12 | 68.48 | 95.38 | 99.91 | 99.08 | 99.18 |
11 | 96.4 | 86.42 | 90.11 | 90.56 | 80.13 | 95.68 | 97.66 | 98.39 | 97.21 |
12 | 99.19 | 90.72 | 92.7 | 95.5 | 81.71 | 97.3 | 98.11 | 99.27 | 98.29 |
13 | 93.84 | 78.67 | 99.05 | 65.4 | 44.79 | 85.55 | 98.82 | 96.63 | 99.76 |
14 | 100 | 98.96 | 99.22 | 97.66 | 87.53 | 99.74 | 100 | 99.86 | 100 |
15 | 100 | 98.82 | 99.49 | 99.49 | 86.53 | 100 | 100 | 99.19 | 100 |
(%) | 97.28 | 91.86 | 94.99 | 93.95 | 84.58 | 96.23 | 97.94 | 97.92 | 98.32 |
OA (%) | 97.49 | 92.47 | 95.36 | 94.40 | 85.73 | 96.51 | 98.09 | 98.07 | 98.45 |
AA (%) | 97.22 | 91.78 | 95.45 | 93.34 | 84.37 | 96.07 | 98.01 | 98.01 | 98.71 |
The
Figure 8,
Figure 9 and
Figure 10 show classification maps generated by various methods on different datasets. Methods using convolutional neural networks, specifically 2D-CNN, produce notably smooth maps with reduced salt and pepper noise, indicating enhanced classification accuracy for single, large ground features. Techniques using the transformer model are adept at capturing global dependencies in hyperspectral images (HSIs), yielding comparable results to 2D-CNN for HSI classification. The innovative AMHFN method excels at maximally harnessing hierarchical features and enhancing refined spatial-spectral information. It also demonstrates impressive capability to explore global dependencies from HSIs, achieving accurate and detailed classification maps.
Figure 11 shows t-SNE data distribution from the Houston 2013 Dataset, analyzed by six methods. Our novel method has impressively low inter-class confusion, precisely distinguishing between classes. There is minimal overlap between classes 1 and 2, highlighting the method’s precision. In contrast, the GAHT method shows confusion, especially between classes 1 and 4. Other methods exhibit high confusion levels. However, our innovative method excels with superior clustering performance. It maintains vast inter-class distances while minimizing intra-class distances, enhancing data clustering and analytics. The visual analysis endorses our proposed methodology.
4.5. Discussion
From extensive experiments, we can find that the strengths of AMHFN are feature differentiation and hierarchical processing. By using LPEM and MSCE, AMHFN effectively differentiates between significant and subtle features, which is essential in dealing with the complex and redundant nature of HSI data. Meanwhile, the hierarchical structure allows for a more organized and detailed analysis of features, enhancing the model’s ability to classify images accurately.
In summary, the AMHFN approach appears to be a sophisticated and well-validated method for hyperspectral image classification. By integrating techniques like LPEM and MSCE within a hierarchical framework, it addresses key challenges in feature differentiation and redundancy. The results from extensive testing support its efficacy and highlight its potential for practical applications in HSI analysis.
Figure 8.
Classification maps obtained using different methods on the WHU-Hi-LongKou dataset (with 2% training samples).
Figure 8.
Classification maps obtained using different methods on the WHU-Hi-LongKou dataset (with 2% training samples).
Figure 9.
Classification maps obtained using different methods on the Houston 2013 dataset (with 10% training samples).
Figure 9.
Classification maps obtained using different methods on the Houston 2013 dataset (with 10% training samples).
Figure 10.
Classification maps obtained by different methods on the Pavia University dataset (with 1% training samples).
Figure 10.
Classification maps obtained by different methods on the Pavia University dataset (with 1% training samples).
Figure 11.
Visualization of t-SNE data analysis on the Houston 2013 dataset.
Figure 11.
Visualization of t-SNE data analysis on the Houston 2013 dataset.
5. Conclusions
In this paper, we present a new approach, which we call AMHFN, for the HSI classification task. The proposed AMHFN involves a gradual reduction in the channels of the feature maps, which is made possible by using a technique known as LPEM. The LPEM makes the proposed AMHFN easier for the subsequent multi-scale to distinguish between significant features and more nuanced ones. The MSCE is especially adept at working with the abundant redundant information that is inherent in HSI, a process that involves differentiating feature information into two distinct categories: prominent and subtle aspects. Moreover, the strategic use of a hierarchical structure within the framework of our model significantly aids MSGE, which is based on the AMHFN algorithm. This is achieved by separating the two kinds of features and subsequently directing them to two distinct MSGEs. This process ensures that the more nuanced feature information does not get overlooked. The results of our extensive experiments and their subsequent analyses serve as a testament to the exceptional performance of our proposed model. This is true not only across multiple public HSI datasets but also in the broader context of HSI classification.
Future study will focus on improving the transformer architecture, such as transfer learning, and mutual learning with various networks (CNNs and transformers). Then, a standardized and universal method will be established for HSI classification based on transformers.