1. Introduction
Hyperspectral remote sensing technology captures hundreds of spectral bands from a target area using sensors or imaging spectrometers, thereby acquiring both spatial and spectral information simultaneously. In recent years, advancements in hyperspectral sensors and spectral imaging technology have significantly enriched the information contained in hyperspectral images (HSIs) [
1]. These HSIs not only provide detailed two-dimensional spatial information of the target but also include one-dimensional spectral information, making them highly applicable in various fields such as biomedical imaging [
2], mineral exploration [
3], food safety [
4], disaster prevention and mitigation [
5], urban development [
6], military reconnaissance [
7], and precision agriculture [
8]. To fully leverage the potential of HSI data, researchers have explored various data processing techniques, such as denoising [
9,
10], spectral unmixing [
11], and target detection and classification [
12,
13,
14]. Among these techniques, land cover classification has garnered significant attention. The primary objective of HSI classification is to use the rich spatial and spectral information in HSI to classify each pixel according to land cover types. Despite the valuable opportunities provided by such rich data, effectively extracting and distinguishing relevant features remains a significant challenge. Consequently, researchers continue to explore various approaches to address the challenges in feature extraction for HSI classification.
Traditional machine learning techniques were mostly used in the early stages of HSI classification attempts, which were centered on the extraction of spectrum information. Various techniques are frequently employed in HSI classification, such as random forest [
15], k-nearest neighbor [
16], support vector machines [
17,
18], and Bayesian estimation methods [
19]. HSIs, on the other hand, contain a great deal of redundant information in their vast amounts of spectral information—typically hundreds of bands. In light of this, the researchers developed reduced dimensionality and feature extraction methods, such as principal component analysis (PCA) [
20,
21], independent component analysis (ICA) [
22], and linear discriminant analysis (LDA) [
23,
24]. These methods map the original spectral features into the new space using linear or nonlinear transforms in order to achieve reduced dimensionality and feature extraction, which significantly lowers the number of spectral features. The process of feature extraction significantly lowers the model’s running complexity and enhances the effectiveness of conventional machine learning models in HSI classification tasks. Model classification performance is limited by these approaches’ limited ability to interpret spatial information, notwithstanding their effectiveness in extracting spectral features. Thus, there has been a lot of interest in spectrum spatial feature extraction techniques, and to improve the extraction of spatial features from HSIs, researchers have created mathematical morphological operators. The techniques of morphological profile (MP) [
25], extended morphological profile (EMP) [
26], and extended multiattribute profile (EMAP) [
27] leverage the integration of spatial and spectral information through various methodologies. These approaches facilitate the identification of the size and shape of distinct objects within an image, consequently enhancing the accuracy of classification outcomes. However, these methods, as the starting stage of HSI classification, have shown effectiveness in understanding the data and its features, but they show limitations when facing the complexity of real HSI data limitations, especially in how to fuse spatial and spectral information more effectively.
Deep learning techniques simulate the hierarchical functioning of the human visual system by constructing deep network models with hierarchical structures based on the characteristics of input data and artificial neural networks. These models can independently learn high-level, discriminative features from the data. With the advancement of deep learning, leveraging powerful computational resources and abundant data, recent algorithms such as CNNs [
28,
29], Transformer [
30,
31], and Mamba [
32,
33] have been employed in hyperspectral image (HSI) classification, demonstrating excellent performance in this task. CNNs are particularly effective at extracting spatial features and learning feature representations automatically, thereby improving image classification accuracy and offering robust feature extraction capabilities. Various CNN architectures have been proposed for extracting both spectral and spatial features, including 1D CNNs [
34], 2D CNNs [
35], 1D-2D CNNs [
36], 3D CNNs [
37], and 2D-3D CNNs (Hybrid CNNs) [
38]. 1D CNNs [
39,
40] are primarily used for spectral feature extraction, while 2D CNNs [
41,
42] delve into deep spatial features of pixels within spectrally compressed image blocks. 3D CNNs [
43,
44] are employed to extract both spectral and spatial features from HSI data, and Hybrid CNNs [
45,
46] leverage the advantages of 2D and 3D CNNs for a more comprehensive extraction of multi-scale and multi-dimensional information in HSIs. In spatial convolutional neural networks, the DHCNet [
47] model introduces variability convolution and adaptive pooling operations that can dynamically adjust their size based on input spatial information, addressing the limitation of fixed-position convolution kernels in traditional CNNs, which cannot adapt to spatial structures. Zhong et al. [
48] proposed a spatial-spectral residual network, SSRN, for HSI classification, leveraging the information of front-layer features as complements to back-layer features, significantly enhancing feature utilization. In the spectral-spatial convolutional neural network, Roy et al. [
38] introduced HybridSN, capable of more efficient learning of spectral-spatial features and more abstract spatial features, contributing to improved classification accuracy. Li et al. [
49] proposed a dual-channel 2D CNN architecture that considers both local and global spatial features while capturing spectral features, adaptively combining feature weights from two parallel streams to enhance the network’s expressive capabilities. Additionally, FADCNN [
50] presents a spatial-spectral dense convolutional neural network framework that employs a feedback attention mechanism, facilitating improved extraction and integration of spectral and spatial features, as well as refining these features to leverage semantic information. Despite the good classification results achieved by CNNs as HSI feature extractors, they face limitations in processing complex high-dimensional data, insufficient integration of spatial and spectral information, and a high demand for training samples.
Following the success of CNNs, graph convolutional networks (GCNs) have increasingly been applied in HSI classification due to their advantages in processing graph-structured data [
51]. By constructing relational graphs among pixels, GCNs effectively model the complex interactions between spatial and spectral information, thereby enhancing classification performance. Qin et al. [
52] proposed a second-order GCN, extending the standard GCN structure to fully utilize the inter-band relationships in hyperspectral images, improving classification accuracy. Wan et al. [
53] applied superpixel segmentation, dividing hyperspectral images into multiple superpixels and feeding these as nodes into a GCN, effectively extracting both internal and neighboring information of superpixels and enhancing classification results. Additionally, the dynamic multiscale graph convolutional network classifier (DMSGer) [
54] was proposed to capture pixel-level and region-level features simultaneously, strengthening classification performance in hyperspectral images. By modeling at multiple scales, DMSGer can better capture complex spatial features, thereby improving the ability to differentiate between classes. However, GCNs still face limitations in graph construction, particularly for large-scale graphs where computational costs become prohibitive, making it challenging for GCNs to classify or identify materials in large-scale hyperspectral scenes efficiently.
The Transformer architecture, initially introduced for natural language processing [
55], has been creatively adapted for computer vision, leading to the development of Vision Transformer [
56]. This innovation has expanded the application of Transformers into the field of HSI analysis. Unlike the CNN approach, which focuses on local spatial information, the Transformer’s self-attention mechanism allows for the effective control of global sequence information by matching the positional encoding of data. This mechanism efficiently captures remote dependencies, providing a comprehensive understanding of the complex relationships between spatial and spectral features in HSI. HSI-BERT [
57] represents a pioneering application of Transformer-based models in HSI classification. It treats each pixel in the HSI cube as a Transformer token to capture the global context, demonstrating competitive accuracy. Hong et al. [
58] recognized the critical role of long-range dependencies in spectral bands and proposed SpectralFormer, a model that utilizes a pure Transformer architecture specialized in processing spectral features and establishing long-range dependencies. Tang et al. [
59] proposed a Transformer network with a dual-attention mechanism, capturing spectral and spatial features separately, and achieved superior classification results through the introduction of a jump-connection mechanism.
As research progressed, it was found that fusing CNN and Transformer for feature extraction could achieve better classification results. For instance, the SSFTT [
60] method preprocessed HSI data using 3D and 2D convolutions. 3D convolution was used to capture both spectral and spatial information features, while 2D convolution focused on extracting purely spatial features. A Gaussian-weighted feature tagger was then used to generate input tokens, which were fed into the Transformer encoder for classification by a linear layer. SSFTT successfully addressed the deep semantic feature extraction problem in HSI classification and became an important benchmark for subsequent landmark Transformer-based HSI classification research. Roy et al. [
61] proposed a novel morphFormer network for HSI classification, enhancing feature interaction through the combination of an attention mechanism with learnable spectral and spatial morphology convolutions, leading to significantly improved classification performance. Despite its impressive performance, the Transformer architecture has several drawbacks in real-world applications. Its multilayer structure and complexity require significant processing power during training and inference [
62]. Moreover, effective training of the Transformer model often necessitates a substantial amount of labeled data, which can be costly and difficult to obtain for hyperspectral data, especially when samples are limited or imbalanced. This can lead to overfitting and challenges in applying the model to new data. Additionally, the intricate self-attention mechanism and high computational complexity of the Transformer model result in poor real-time performance [
63]. Designing a hyperspectral network with few parameters, good classification performance, and high real-time performance presents a significant challenge.
This paper introduces the novel HSI classification model SSFAN, as illustrated in
Figure 1. SSFAN integrates advanced spectral-spatial feature extraction and deep learning algorithms. The model is composed of three key components: the Parallel Spectral-Spatial Feature Extraction Block (PSSB), the Scan Block, and the Squeeze-and-Excitation MLP Block (SEMB), designed to effectively extract and process spectral and spatial information, thereby enhancing the classification accuracy of HSI. The HSI data are initially preprocessed and fed into the PSSB, which includes two parallel streams. Each stream incorporates a 3D convolutional layer followed by a 2D convolutional layer. This process utilizes 3D convolution to extract spectral and spatial information from the input hyperspectral data, and then enhances the spatial feature representation through 2D convolution. The Scan Block is responsible for extracting spatial information at different scales from the center pixel, outward, employing a layered scanning strategy. This enables the model to capture both local and global spatial relationships. The SEMB consists of the Spectral-Spatial Recurrent Block (SSRB) and an MLP Block, which employs a deep residual structure combined with LayerNorm. This structure enhances the nonlinear representation of features while maintaining model stability. The SSRB introduces the SToken Module, a mechanism for adaptive weight assignment that facilitates flexible handling of time steps and feature dimensions through multilayered linear transformations and parameterization operations. Multiple state update operations are utilized to extract deeper spectral-spatial features. Finally, the MLP Module processes the input features through a series of linear transformations, activation functions (GELU), and Dropout layers, enabling the capture of complex patterns and relationships in the input data. The classification is completed through an argmax layer. The SSFAN model stands out by significantly reducing the number of parameters and MACs compared to other state-of-the-art models, thereby accelerating the training and inference speeds and enhancing the model’s deployment capabilities under limited computational resources. The codes are available at
https://github.com/one-boy-zc/SSFAN (accessed on 25 October 2024).
The contributions of this work are summarized as follows:
- (1)
A Parallel Spectral-Spatial Feature Extraction Block was proposed, which can increase classification accuracy and extract spectral-spatial information more fully.
- (2)
The Scan Block, designed for image spreading, allows the model to capture both local and global spatial relationships through a layered scanning method.
- (3)
The combination of SSRB and MLP Block in the SEMB introduces an adaptive weight assignment mechanism, facilitating the extraction of deeper spectral-spatial features through multi-layer linear transformations and parameterization operations.
- (4)
SSFAN significantly reduces the number of parameters and MACs compared to Transformer-based models, speeding up training and inference and improving deployment capabilities.
2. Materials and Methods
The SSFAN model for HSI classification is composed of three primary components: the Scan Block, the Parallel Spectral-Spatial Feature Extraction Block, and the Squeeze-and-Excitation MLP Block, as depicted in
Figure 1.
2.1. HSI Data Preprocessing
Given the raw Hyperspectral Imaging (HSI) data
, where
l represents the number of spectral bands, and
denotes the spatial resolution size, each pixel in
I is characterized by
l spectral dimensions and is associated with a one-hot vector
, where
C is the number of feature classes. Rich spectrum information is present in
l spectral bands, but they also result in a great deal of information redundancy, which greatly raises the computing cost. Therefore, the computational and spectral dimensions are reduced by using Principal Component Analysis (PCA) [
21]. PCA maintains the spatial dimensions of the HSI while reducing its spectral dimensions from
l to
b. The particular procedure is:
where
is the mean vector of each spectral channel,
is the eigenvector matrix corresponding to the first
b largest eigenvalues of the covariance matrix, and
and
denote the original hyperspectral data with
l bands and the dimensionality-reduced hyperspectral data with
b bands, respectively.
The HSI data were then subjected to 3D-patch extraction following spectral downscaling via PCA.
was used to generate each neighboring 3D-patch (
), where
represents the window size. Every 3D-patch has a center pixel set to
, where
and
. The label of each 3D-patch’s center pixel determines the real label of each patch. However, some pixel values in the patch are unavailable when extracting the region surrounding a pixel that is situated at the edge of the image because there isn’t any pixel data beyond the boundary. Consequently, a padding operation with a padding width of
is carried out on these pixels, based on the method in SSFTT [
60]. At some point,
determines how many 3D-patches there are in
. The width and height of each patch are
,
, and
b is their spectral dimension. Every sample is split into train and test datasets once background pixels with zero labels are eliminated.
2.2. Parallel Spectral–Spatial Feature Extraction Block
After data preprocessing, the spectral-spatial information in each sample patch is extracted using the Parallel Spectral-Spatial Feature Extraction Block (PSSB). In contrast to the conventional single pathway, which is unable to sufficiently extract the spectral-spatial information [
60], the PSSB is comprised of two parallel streams, each having a 2D and 3D convolutional layer. The inputs of both pathways are identical sample patches, which combine the features that were extracted from the two streams. The PSSB description follows one pathway since the two streams have the exact identical configuration. The input of each sample patch (
) is fed into the 3D convolutional layer.
The process of 3D convolution is detailed in
Figure 2. Given
is the value at the
position of the
i-th convolution kernel’s
j-th output, and
is the value at the
i-th convolution kernel’s
j-th output used for the
jth output. kernel
at the weight value, and
is the value at
in the sample patch, which is then given by the computational formula for 3D convolution:
where
and
represent the height and width of the sample patch, respectively,
represents the number of bands in the sample patch,
denotes the activation function (ReLU), and
denotes the
i-th bias used for the
j-th output. Theoretically,
3D convolutional kernels with a size of
make up the 3D convolutional layer.
3D feature cubes with spectral-spatial information will be generated after the 3D convolutional layer. After the 3D convolutional layer,
3D feature cubes containing spectral-spatial information will be generated. The size of each cube is shown in Equation (
3), and the total size of all feature cubes is shown in Equation (
4).
After the rearrangement operation and fed as an input to the next 2D convolutional layer, which has a size of
, in order to enhance the spectral spatial features. At the spatial location
on the jth feature map in the
i-th layer of the 2D convolutional layer, the activation value
is defined as follows:
where
and
represent the width and height of the 2D convolutional kernel, and
represents the weight parameter at the k-th feature map
.
denotes the activation function (ReLU).
denotes the value of the
kth feature map at
. The total number of feature maps after 2D convolutional layer processing is
@
, with
being the number of 2D convolutional kernels, and each convolutional kernel having a size of
, which are all set to 3.
Moreover, PSSB effectively addresses the high-dimensional nature of HSI data, mitigating the computational and representational shortcomings of single-channel methods in dealing with high-dimensional data. By integrating features from two identical channels, PSSB circumvents the limitation that single-channel methods cannot fully extract spectral and spatial features. This feature fusion enhances the model’s capability to capture complex HSI data and facilitates more effective extraction of spectral-spatial information, thereby improving the classifier’s accuracy, particularly in the classification of edge pixels or mixed pixels.
2.3. Scan Block
The Scan Block is an enhanced version of the spreading operation, designed to extract multi-scale features from the central region and its surroundings, enabling the model to capture local information at different scales. The input to the Scan Block is the multi-channel feature map output from the PSSB. Initially, the Scan Block calculates the midpoints of the input tensor’s height and width (
) to determine the center index. The input tensor’s dimensions are then reshaped from
to
to facilitate manipulation of spatial and spectral information. This reshaping makes it easier to handle the spatial dimensions
h and
w (the second and third dimensions of
x), aligning them for efficient slicing operations. Next, an output tensor,
, is initialized to store the extracted values, with dimensions
. This output tensor represents a multi-channel spread of the feature map. The center region,
, is first assigned directly to
. Subsequently, features from regions of varying scales are extracted iteratively, starting from the center and gradually expanding outward. In each
i-th layer of this loop, a region containing
rows and
columns is extracted, resulting in a total of
pixels being added, with five loops in total. As shown in
Figure 3, this process transforms the feature map from 2D to 1D without altering the number of channels. The color distribution in the figure illustrates how values in the original 2D feature map are accurately mapped to the corresponding positions in
after the spreading process. By progressively extracting regions at different scales, the Scan Block effectively captures local features of varying sizes, enhancing spatial information processing. This multi-scale feature extraction improves the model’s spatial perception, ultimately enhancing overall performance.
Before inputting the output of the Scan Block to the next module, it is necessary to add position information and a Learnable Token to the output. The Learnable Token is added to facilitate the subsequent classification work. With the Learnable Token, the model can gather the global information of the whole sequence, which can provide a useful global feature for the classification task. Position Embedding is used to label the positional information of each semantic tokens, allowing the model to process sequences with spatial or sequential sensitivity, which is crucial for the model’s comprehension. The output after adding the position information and a Learnable Token is with a size of .
2.4. Squeeze-and-Excitation MLP Block
The Squeeze-and-Excitation MLP Block (SEMB) consists of two core modules: the Spectral-Spatial Recurrent Block (SSRB) and the MLP Block. The combination of these two modules enables a more comprehensive extraction and aggregation of the information in the hyperspectral data, which makes the model able to complex spectral dimensions with better expressiveness and classification accuracy. For the input feature , the shape is , B is the batch size, L is the length of the sequence, and D is the feature dimension. is input to the SSRB module for processing by the self-attention mechanism to extract important spectral and spatial features. The output of the SSRB passes through the AdaptiveAvgPool and Layer Normalization layers before being passed to the MLP Block for further deep extraction of feature information from the sequence. Each module contains a residual linkage mechanism, which not only helps the effective transfer of information, but also mitigates the gradient vanishing problem and ensures the robust training of the model. Next, the working principle and implementation of these two key modules will be introduced in detail.
2.4.1. Spectral-Spatial Recurrent Block
As shown in
Figure 1, Spectral-Spatial Recurrent Block (SSRB) combines various approaches such as Linear Layer, Attention Mechanism, and State Update etc. SSRB consists of six main parts: Sigmoid, Linear Transformations, SToken Module, State Initialization & Weights, State Update & Feature Extraction, and Output Layer. Firstly,
is input into the Sigmoid function and compressed into the range of
to get the activation value
z, which is mainly used for the subsequent output adjustment. Next is the State Initialization & Weights module, which is used to initialize some tensors for subsequent processing. Two weight matrices
and
are randomly generated in the initial stage to participate in the subsequent recursive operations. Then,
is input into two linear transformations
and
to generate two intermediate tensors
and
with the same dimensions as
, respectively. Meanwhile,
is input to SToken Module to get the feature vector
T. The computation of
T will be described in detail in the subsequent section, and the formulae for
B and
C are as follows:
where the bias terms are
and
, and the weight matrices of the linear layer are
and
. Then,
and A are calculated. To obtain
, the input
is linearly varied, and the result is calculated using a Sigmoid nonlinear activation function, which can be computed as follows:
where
denotes the linear transformation. A is obtained from
and
for subsequent state updates, calculated as follows:
B is obtained by the
and
Einstein summation conventions, calculated as:
The state update and feature extraction are carried out in the State Update & Feature Extraction module following the completion of the initialization tensor. The state update formula is as follows, assuming that the initialization state
s is a zero vector, processing the state recursively at each time step, and conducting recursive updates for each time step
t,
t in the range
, where
L is the length of
.
where
is
,
is
,
is
, and the output prediction is obtained from
, computed as:
where
is
,
is
, and
is
. After the recursive processing is completed, the predictions of all time steps are stitched together to obtain the final output
and adjusted with
z, computed as:
The SToken Module plays a crucial role in integrating attention mechanisms within the SSRB framework, particularly in enhancing input sequence features through an attention mechanism. This module effectively captures significant spectral information features and fuses them with the original input by applying a Squeeze-and-Excitation (SE) attention computation on the input data sequence and combining it with a trainable bias parameter. The SToken Module is designed to implement an SE attention mechanism, which operates on the input sequence’s dimensionality to perform attention computation, and then adjusts the input features’ weights, enabling the model to focus more on important features.
Figure 4 visualizes the SToken Module process.
The module begins by extracting a compressed feature from the input sequence using a global average pooling operation. This operation averages the sequence in the sequence dimension, resulting in a feature vector that represents the average feature of the
ith sample over the sequence dimension, denoted as
. Subsequently, this feature vector is fed into a fully connected layer, which consists of a linear layer, a Sigmoid function, and a ReLU activation function. The Sigmoid function is used to scale the features within the
interval, while the ReLU activation provides a nonlinear mapping. A learnable bias is then initialized and extended to the same shape as the SE-seq, followed by a fully connected layer. The computational formula for this process is:
where
represents the bias added in the fully connected layer, and
is the weight of the fully connected layer. The resulting
has a shape of
, which is then concatenated with the second to the last element of the original input
, effectively incorporating the adjusted features into the sequence while preserving the original sequence features. The shape of the adjusted
is restored to
. Finally,
is element-wise multiplied with
, a process analogous to the attention mechanism, which assigns greater weight to important parts of the sequence, thereby highlighting key features, computed as:
This operation adjusts and weights the features of the input sequence, improving the model’s focus on important features and facilitating subsequent state updates.
2.4.2. MLP Block
Before formally entering the MLP Block, the output in the Spectral-Spatial Recurrent Block is normalized and adaptive mean pooling is performed in order to homogenize the features over the sequence dimension. This adaptive average pooling layer serves as a dimensionality reduction operation on the input tensor , transforming it into a compact feature representation for subsequent processing in the MLP Block. The shape of is , and the transpose operation is first applied to swap dimensions 1 and 2, changing the shape to . This step is taken to enable the adaptive average pooling operation to perform pooling operations on the sequence dimension. Subsequently, the tensor enters the adaptive average pooling operation, which performs global average pooling over the entire sequence L for each feature dimension D. The average of all elements in each sequence is used as the output of that sequence, resulting in a shape of . Finally, a squeeze operation is applied, removing the dimension of size 1, altering the tensor’s shape to . The output after adaptive average pooling, denoted as , serves as the input to the MLP Block.
The MLP Block comprises two linear layers, followed by a GELU activation function and two Dropout layers, each applied after the linear layers. After the first linear layer, a GELU activation function is applied, followed by dropout. The input then proceeds through the second linear layer and dropout before reaching the final output. This MLP layer is followed by a LayerNormalize layer, which aids in mitigating gradient explosion and vanishing gradients, facilitating faster training. The output after the MLP Block has a size of , and the final result is obtained through the argmax function in the numpy library.
2.5. Implemention
Compared to the Transformer model, SSFAN has fewer parameters and multiply-add cumulative number of operations, and is able to accomplish the classification task more quickly and to a high standard. In this paper, we take the Pavia University dataset with 9 classes of features and size of as an example to illustrate the proposed SSFAN.
After PCA dimensionality reduction and patch partitioning, each patch has a size of . The PSSB consists of two identical parallel paths; we will analyze the data flow using the first path as an example. In the 3D convolution layer of the first path, each patch generates eight feature cubes of size , with each patch utilizing eight convolution kernels. The purpose of the 3D convolution layer is to extract rich spectral information from each patch. After rearranging the eight feature cubes, a feature cube consisting of features of size is generated. Next, 64 two-dimensional convolution kernels are used to perform 2D convolution, resulting in 64 feature maps of size . Each feature map is processed through a Scan operation, flattening it into 64 feature vectors of size , where 16 represents the number of channels. Simultaneously, a learnable token vector, cls-tokens, initialized to all zeros, of size is created and concatenated with the output of the Scan Block. Subsequently, positional information is added, yielding , which is then input into the Squeeze-and-Excitation MLP Block, passing sequentially through the SSRB and MLP Block modules. In the SSRB, the input undergoes SToken attention mechanism and sequence modeling, and the resulting output is compressed into a single global representation through adaptive pooling. After passing through the MLP Block for nonlinear transformation, the final classification result is obtained. The overall process of the proposed SSFAN method is illustrated in Algorithm 1.
2.6. Loss Function
The cross-entropy loss function [
64] is widely utilized in HSI classification. Its primary benefit lies in its ability to accurately assess the discrepancy between the model’s predicted probability distribution and the actual labeling distribution, which aids in achieving rapid convergence during model training. Nonetheless, the cross-entropy loss function’s sensitivity to class imbalance and outliers can impact model performance in certain scenarios, necessitating supplementary strategies to address these issues in real-world applications. In order to achieve this, this paper designs a hybrid cross-entropy loss function (
), which is more sensitive to the category distribution and robust. It combines the Normalized Generalized Cross Entropy (
) and the Normalized Cross Entropy (
) loss functions.
Algorithm 1 SSFAN Model |
- Input:
Input HSI data and ground-truth ; after PCA bands number ; test dataset comprises of the total; patch size ; - Output:
Output the predicted categories for the test dataset. - 1:
Set Batchsize to 100; Optimizer Adam, learning rate ; Training epochs ; - 2:
After PCA transformation, is obtained. - 3:
Generate sample patches from I and divide them into training and testing datasets. - 4:
Generate train dataloader and test dataloader. - 5:
for to ∈ do - 6:
Execute 3D convolutional layer and 2D convolutional layer to obtain . - 7:
Execute another 3D convolutional layer and 2D convolutional layer to obtain ; - 8:
- 9:
Perform Scan Block. - 10:
Initialize learnable tokens, connect them to the output of the Scan Block, and embed the position to obtain the . - 11:
Perform Spectral-Spatial Recurrent Block; - 12:
Perform MLP Block. - 13:
end for - 14:
Use a trained model to predict categories in the test dataset. - 15:
return Predicted label.
|
NGCE is a generalized version of the standard cross-entropy loss function, introducing a parameter
q to control the sensitivity of the loss function. Given the predicted probability
p and the target category
y, NGCE is formulated as:
where
C denotes the number of categories;
is the one-hot coding of the target label, with a value of 1 at the index of only the correct category and 0 for the rest;
is the probability value of the model’s prediction output; and
q is a hyperparameter controlling the shape of the loss function, which degrades to standard cross-entropy when
, and increases the robustness of the loss to incorrect predictions when
. The numerator computes the difference between the predicted probability and the target label and adjusts the degree of nonlinearity of the loss by a
q power, and the denominator is the number of categories minus the sum of the
q powers of the predicted probabilities for each category, ensuring that the loss adjusts to category imbalances. As
q tends to 0, NGCE will tend to be more forgiving of uniform predictions for each category, but will be more sensitive to extreme mispredictions. Therefore, NGCE is suitable for scenarios that require robustness to outliers. In this paper, we demonstrate through multi-group ablation experiments that the model works best when the value of
q takes 0.7.
NCE is a normalized version of the standard cross-entropy loss, which mainly balances the effect of category imbalance on the loss by normalizing the denominator term. The formula of NCE is:
where
,
C, and
have the same meaning as in Equation (
16). In Equation (
17), the numerator is the standardized standard deviation loss, which calculates the model’s prediction loss for the correct category; the denominator is the logarithmic summation of the prediction probabilities of all categories, which serves as a normalizer and adjusts the prediction probabilities of all categories, preventing the problem of exploding or disappearing gradients in the case of imbalance of category distributions. The main purpose of the NCE is to adjust the value of the loss function to make it more stable in the different category distributions.
The hybrid loss function proposed in this paper combines NGCE and NCE by introducing two hyperparameters
and
to control the weights of the two in the final loss, respectively. The hybrid loss function is formulated as:
By adjusting the values of and , we can control the relative significance of NGCE and NCE in the loss. If the model is highly sensitive to outliers, increasing the value of can enhance the impact of NGCE, and if the model struggles with category imbalance, increasing the value of can enhance the impact of NCE. NGCE offers robustness against misclassification, while NCE addresses category imbalance through normalization. The combination of the two makes this loss function more stable and adaptable in various data distributions and anomaly scenarios. In this paper, we demonstrate through multiple sets of ablation experiments that the model achieves its best performance when and are both set to 1.0.
5. Discussion
5.1. Advantages of Parallel Spectral–Spatial Feature Extraction Block
Compared with the single-channel Spectral-Spatial Feature Extraction Block (SSSB), the parallel-channel Parallel Spectral-Spatial Feature Extraction Block (PSSB) can extract richer spectral-spatial information. To assess the performance of this module, we carried out several experiments using three distinct datasets, evaluating the No Spectral-Spatial Feature Extraction Block (NSSB), the single streams Spectral-Spatial Feature Extraction Block (SSSB), and the parallel streams Spectral-Spatial Feature Extraction Block (PSSB). The experimental findings are presented in
Table 5. The table indicates that the Spectral-Spatial Feature Extraction Block greatly improves the classification performance. Specifically, the OA metrics on the IP, Pavia University, and WHLK datasets are improved by
,
, and
, respectively, using PSSB compared to NSSB, and the AA and Kappa metrics also show similar increases. These ablation experimental results indicate that the PSSB module has a significant performance enhancement in the overall model, further validating its importance and effectiveness in the classification task. To further illustrate the advantages of the parallel streams, we performed comparative experiments between PSSB and SSSB. The results indicate that PSSB surpasses SSSB in all three metrics: OA, AA and Kappa, highlighting the enhanced effectiveness of the parallel channel in extracting spectral-spatial features.
To validate the effectiveness of the proposed PSSB, which employs two Spectral–Spatial Feature Extraction Blocks, we conducted a series of comparative experiments on the PU hyperspectral dataset, with the results presented in
Table 6. From the results, the OA and Kappa value using PSSB reached
and
, respectively, further demonstrating the effectiveness of PSSB. Meanwhile, as the number of Spectral–Spatial Feature Extraction Blocks increased, the model’s parameter count gradually increased; however, its performance did not improve and instead exhibited a declining trend. This may indicate that the model is experiencing overfitting.
5.2. Discussion About Patch Size
HSIs not only contain rich spectral information, but also carry spatial information. Classification using a pixel point alone may ignore the spatial context around the pixel point, while segmenting the image into patches can effectively utilize the local spatial information. Each patch’s size, or the amount of the extension from the center pixel to the outermost pixel, is determined by its patch size. A more detailed categorization might come from a smaller patch size, but the accuracy of the classification might suffer from insufficient contextual data. Using a larger patch size gathers more contextual details, enhancing the reliability of the classification. However, this also raises computational complexity, potentially resulting in longer training durations or increased memory requirements. Therefore, it is crucial to choose an appropriate patch size. To validate the effectiveness of different patch size values used in this study, a series of experiments were carried out on three datasets with varying patch size values, as presented in
Table 7. The results shows that a patch size of 11 consistently leads to the fastest training and testing times; however, the other three metrics do not achieve optimal values. As the value of patch size increases, the training time and testing time both increase, while the values of the remaining three metrics first increase and then decrease. There are two main reasons for this: first, as the patch size grows, the spatial information may increase, but the “purity” of the spectral information could diminish. This happens because a larger patch might include spectral features from various categories, causing the spectral characteristics of the central pixel to be mixed with those of the surrounding pixels. Second, increasing the patch size also raises the dimensionality of the input data, which can complicate the model and increase the risk of overfitting. Therefore, the decision to set the patch size at 15 in this study is based on experimental evidence.
5.3. Discussion About Loss Function
The choice of loss function directly affects the optimization process and final performance of the model. Different tasks and data distributions may require different loss functions. In order to verify the effectiveness of the hybrid loss function proposed in this paper, a series of experiments are conducted and the results are shown in
Table 8. We use the regular cross-entropy loss function
as a comparison in order to mimic the scenario without applying the hybrid loss function
described in this research, i.e., to conduct loss function ablation experiments. On three datasets, we conducted four sets of experiments:
,
,
and
. The experimental findings indicate that among the four different loss functions, the hybrid loss function introduced in this paper yields the best performance. Notably, on the PU dataset, the
hybrid loss function stands out as the most effective, achieving optimal results in the three key evaluation metrics: Overall Accuracy (OA), Average Accuracy (AA), and Kappa. Due to the inclusion of two loss functions in our proposed hybrid loss function, both training and testing times will be extended. However, when applied to the WHLK dataset, these times are shorter compared to other combinations of loss functions. This may be attributed to the extensive data and experimental samples available in the WHLK dataset. The
hybrid loss function yields the best performance across the three primary evaluation metrics: OA, AA and Kappa.
hybrid function’s advantage is more obvious on the data with large samples. Multiple sets of experiments on three datasets demonstrate the effectiveness of the hybrid loss function proposed in this paper.
5.4. Training Time and Test Time
Training time is the stage when the model learns on the training dataset, which directly affects the model’s fitting ability and complexity, while testing time is the stage when the model is deployed to make predictions on new data, which affects the model’s real-time performance and application scenarios. The balance between training time and testing time is an important consideration in designing an efficient model, which is directly related to the usability and practicality of the model. In order to verify the effectiveness of the model proposed in this paper, we conducted a series of experiments, the results of which are shown in
Table 9. In terms of training time, the SSFAN model performs best on all datasets. Specifically, on the IP dataset, the training time of SSFAN is 78.66 s, showing a significant advantage over other models. For example, compared to 2D CNN (87.51 s), SSFAN’s training time is reduced by 8.85 s, which indicates that SSFAN is more efficiently trained when dealing with this dataset. On the dataset, the training time of SSFAN is 310.72 s, which is slightly higher than that of 2D CNN (114.63 s), but still shows a large time saving compared to SSRN (704.40 s) and 3D CNN (688.85 s). On the WHLK dataset, the training time of SSFAN is 1452.56 s, which is not as long as SpectralFormer (1567.77 s) and Hybrid CNN (2145.48 s), but is significantly lower than SSRN (5689.40 s) and 3D CNN (2490.97 s), showing its higher training efficiency. SSFAN also shows excellent performance in terms of testing time. On the IP dataset, the test time of SSFAN is 1.87 s, which is significantly lower than that of 2D CNN (3.32 s) and all other models. This shows that SSFAN not only performs well in the training phase, but also provides fast response time in the testing phase. SSFAN took 8.17 s to test on the PU dataset, while 2D CNN and SpectralFormer took 4.82 and 19.54 s, respectively. Although SSFAN’s test time on this dataset is not the shortest, its overall performance is still optimal. On the WHLK dataset, the test time of SSFAN is 37.05 s, which shows a significant advantage over 2D CNN (78.56 s) and Hybrid CNN (134.45 s), and the gap is more obvious when compared to SSRN (267.67 s) and 3D CNN (222.56 s). These results indicate that the SSFAN model can provide higher computational efficiency when dealing with complex datasets, which is especially important for large-scale data processing in practical applications. Its shorter training and testing time not only improves the practical efficiency of the experiment, but also reduces the consumption of resources, providing a reliable solution for efficient data processing.
5.5. Discussion About Parameters and MACs
Parameters and MACs are two key metrics when discussing the efficiency of deep learning models. The number of parameters influences the model size and the memory needed for training, whereas MACs indicate the computational complexity during the inference process. To demonstrate the effectiveness of the models presented in this paper, we compare the parameter counts and MACs across different models, with the experimental results summarized in
Table 10. In terms of the number of parameters, the SSFAN model has the least 39.86K parameters, showing a significant advantage over other models. For example, the 1D CNN, although also smaller in the number of parameters (20.43 K), is far inferior to SSFAN in terms of performance metrics. In terms of computational complexities (MACs), SSFAN also exhibits lower computational complexity (10.35 M), which shows significant computational savings compared to 2D CNN (30.45 M) and 3D CNN (98.67 M). This shows that SSFAN has a strong advantage in terms of computational efficiency and is able to achieve efficient performance with reduced consumption of computational resources. The SSFAN model surpasses all other models evaluated on the PU dataset. Its benefits in terms of parameter count and computational complexity enable it to deliver effective classification results even with restricted resources. Regarding accuracy, SSFAN achieves the best results in OA, AA and Kappa, demonstrating its overall excellence in classification tasks. This enhancement in performance is due not only to the optimization of the model’s architecture but also to its effectiveness in feature extraction and data handling.
Figure 11 demonstrates the scatter plot of SSFAN model on PU dataset, where the size of the circle represents the size of the MACs value, from which it can be seen that the SSFAN model is able to maintain the highest OA and AA values with fewer number of parameters and MACs values.
As shown in
Figure 11, the scatter plots of different models on the Pavia University dataset are demonstrated, where the size of the scatter represents the size of the MACs values, from which it can be seen that the SSFAN model is able to maintain the highest OA and AA values with fewer number of parameters and MACs values, which illustrates that this paper’s proposed effectiveness of the model.
5.6. Discussion of Model Robustness
To assess the robustness [
71] of the SSFAN model proposed in this study, we introduced various levels of Gaussian, Salt-and-Pepper, and Poisson noise into the dataset. Gaussian noise consists of random noise with amplitudes that follow a normal distribution, Poisson noise represents discrete events that occur randomly, while Salt-and-Pepper noise causes abrupt shifts in pixel values to extreme levels. By adjusting the parameters for each noise type, we progressively increased noise intensity to evaluate model performance across these different noise levels. The experimental results are summarized in the
Table 11.
For Gaussian noise, the mean was set to zero, with standard deviation (std) values tested at 1, 5, and 10. As the standard deviation increased, classification accuracy metrics showed slight improvements. At an std of 1, the OA was , AA reached , and the Kappa was . When the std rose to 10, these metrics increased slightly, with OA reaching , AA at , and Kappa at . This indicates that moderate Gaussian noise does not hinder model robustness; instead, it may enhance the model’s generalization ability. For salt-and-pepper noise, the noise level was controlled by setting the parameters salt_prob and pepper_prob, both tested at levels of 0.1, 0.3, and 0.5. As noise levels increased, classification accuracy metrics gradually declined. For instance, with salt_prob and pepper_prob at 0.1, OA reached , AA was , and Kappa stood at . When these parameters were raised to 0.5, OA fell to , AA dropped to , and Kappa decreased to . These results suggest that high levels of salt-and-pepper noise significantly impact model accuracy, likely due to the noise’s disruptive effect on image pixels. For Poisson noise, intensity was controlled by adjusting the scale parameter, tested at values of 1, 10, and 20. It was observed that classification accuracy did not decrease significantly as the scale increased. At a scale of 1, OA was , AA reached , and Kappa was . With the scale at 10, OA rose slightly to , AA reached , and Kappa was . At a scale of 20, OA showed a slight decrease to . This indicates that Poisson noise has a minimal effect on model performance and may even positively contribute to model generalization. In summary, the results demonstrate that moderate levels of Gaussian and Poisson noise can enhance model generalization and robustness. In contrast, high levels of salt-and-pepper noise significantly reduce model performance, suggesting a more destructive impact on the data.
5.7. Limitations and Future Perspectives
The SSFAN model innovatively integrates the PSSB, Scan Block, and SEMB modules to create a compact and lightweight architecture capable of efficiently extracting and processing spectral and spatial information. This results in excellent classification performance and robust real-time processing capability. However, there remains room for further optimization. Although the Scan module introduces an innovative spatial feature arrangement, its manually defined center scanning process may limit its ability to capture complex spatial relationships, particularly when dealing with hyperspectral data that exhibit significant spatial variability. This scanning method lacks flexibility, which may hinder the model’s ability to fully utilize spatial information at each pixel. Furthermore, while the SSRB sequence processing mechanism within the SEMB module effectively integrates sequence information, its computational complexity remains high when processing long sequences, potentially increasing model inference time and computational resource demands. In data preprocessing stage, this study+ did not account for potential overlap between training and testing datasets caused by the use of patches.
Future research will explore more flexible and efficient spatial feature extraction methods to enhance the model’s ability to capture spatial features at multiple scales. Additionally, optimizing the computational efficiency of the SSRB module will be a primary focus, particularly by introducing more lightweight sequence processing mechanisms to reduce the model’s complexity and computational cost. Further studies will also investigate the effective integration of multimodal data or other complementary information to provide more contextual insights, thereby enhancing the accuracy of hyperspectral image classification. Moreover, we should explore new dataset partitioning methods to minimize overlap between training and testing datasets during data preprocessing, which is essential for reliable model evaluation.