1. Introduction
With the continuous advancement of spectral imaging technology, hyperspectral data have achieved significant improvements in both spatial and spectral resolution. Compared to multispectral and RGB images, hyperspectral images (HSI) possess narrower bandwidths and a greater number of bands, allowing them to provide more detailed and continuous spectral information [
1,
2,
3]. As a result, HSI have demonstrated tremendous potential in various earth observation fields [
4] such as precision agriculture [
5,
6], urban planning [
7,
8], environmental management [
9,
10,
11], and target detection [
12,
13,
14,
15]. Consequently, research on HSI classification has rapidly progressed.
While traditional HSI classification methods, such as nearest neighbor [
16], Bayesian estimation [
17], multinomial logistic regression [
18,
19], and Support Vector Machine (SVM) [
20,
21,
22,
23], have their merits in certain scenarios, these methods often have limitations in data representation and fitting capability, struggling to produce satisfactory classification results on more complex datasets. In contrast, in recent years, methods underpinned by deep learning, thanks to their outstanding feature extraction capabilities, have gradually become the focus of research in HSI classification.
Convolutional Neural Networks (CNNs) dominate the field of deep learning and are capable of accumulating in-depth spatial features through layered convolution. As such, CNNs have been extensively applied to and researched in the classification of HSI [
24,
25]. Notably, Roy and colleagues [
26] introduced a model named HybridSN. This model initially employs a 3D-CNN to extract spatial–spectral features from spectral bands that have undergone PCA dimensionality reduction. Subsequently, it uses a 2D-CNN to delve deeper into more abstract spatial feature hierarchies. Compared to a 3D-CNN, this hybrid approach simplifies the model architecture while effectively merging spatial and spectral information. Building on this, subsequent researchers have incorporated one-dimensional convolution based on central pixels to compensate for spectral information that might be lost after PCA reduction. Examples of this approach include the Cubic-CNN model proposed by J. Wang et al. [
27] and the JigsawHSI model introduced by Moraga and others [
28].
The Vision Transformer (ViT) model [
29], which evolved from the natural language processing (NLP) domain, has also increasingly become a focal point in the field of deep learning. The ViT model segments images into fixed-size patches and leverages embedding techniques to obtain a broader receptive field. Furthermore, with the help of multi-head attention mechanisms, it adeptly captures the dependencies between different patches, thereby achieving higher processing efficiency and remarkable image recognition performance. Consequently, numerous studies have been dedicated to exploring the application of this model in HSI classification. For instance, a research team proposed the Spatial–Spectral Transformer (SST) model in [
30]. They utilized VGGNet [
31], from which several convolutional layers were removed, as a feature extractor to capture spatial characteristics from hyperspectral images. Subsequently, they employed the DenseTransformer to discern relationships between spectral sequences and used a multi-layer perceptron for the final classification task. Qing et al. introduced SATNet in [
32], which effectively captures spectral continuity by adding position encoding vectors and learnable embedding vectors. Meanwhile, Hong and colleagues presented the SpectralFormer (SF) model in [
33]. This model adopts the Group-wise Spectral Embedding (GSE) module to encode adjacent spectra, ensuring spectral information continuity, and utilizes the Cross-layer Adaptive Fusion (CAF) technique to minimize information loss during hierarchical transmission. X. He and their team introduced the SSFTT network in [
34]. This model significantly simplifies the SST structure and incorporates Gaussian-weighted feature tagging for feature transformation, thus reducing computational complexity while enhancing classification performance.In recent studies, researchers have continued to explore more lightweight and effective methods for feature fusion and extraction based on the transformer architecture. For instance, Xuming Zhang and others proposed the CLMSA and PLMSA modules [
35], while Shichao Zhang and colleagues introduced the ELS2T [
36].
Due to the high correlation between adjacent bands in HSI, there is a significant amount of redundant information within HSI. Commonly, to mitigate the impact of this redundancy, the methods mentioned above [
26,
30,
31] preprocess HSI using Principal Component Analysis (PCA). However, not using PCA leads to a significant decrease in model prediction accuracy, highlighting the model’s deficiency in extracting key spectral information. As an unlearnable dimensionality reduction technique, PCA’s process is often irreversible and can lead to information loss, such as the loss of spectral continuity [
37]. Models reliant on PCA may thus produce suboptimal results. In transfer learning or few-shot image classification tasks, HSI are required to feed a large number of channels into the model to preserve as much original information as possible. This input of extensive channel data elevates the demands on feature extractors, necessitating their capability to efficiently process and extract key information from these numerous channels. Moreover, different datasets might require dimensionality reduction to different extents, making the selection of appropriate dimensions for each dataset a time-consuming operation. Therefore, we propose the CESA-MCFormer, which effectively extracts key information from HSI under conditions of limited samples and numerous channels, achieving higher classification accuracy in downstream tasks without relying on PCA for dimension reduction. To achieve this, we have incorporated attention mechanisms and mathematical morphology.
Attention mechanisms have been extensively applied in various domains of machine learning and artificial intelligence. Hu et al. [
38] introduced a “channel attention module” in their SE network structure to capture inter-channel dependencies. Woo et al. [
39] proposed CBAM, which combines channel and spatial attention, adaptively learning weights in both dimensions to enhance the network’s expressive power and robustness. Meanwhile, Zhong et al. [
40] presented a deep convolutional neural network model, integrating both a “global attention mechanism” and a “local attention mechanism” in sequence to capture both global and local contextual information. Inspired by these advancements, researchers began incorporating spatial attention into HSI classification. Several studies [
41,
42,
43,
44] combine spectral and spatial attention mechanisms, enabling adaptive selection of key features within HSI. However, in HSI classification, a common practice is to segment the HSI into small patches and classify each patch based on its center pixel. Yet, these methods do not sufficiently consider the importance of the center pixel. This approach makes the information provided by the center pixel crucial. Recent studies have recognized this, such as those cited in [
45,
46], which employed the Central Attention Module (CAM). This module determines feature weights by analyzing the correlation of each pixel with the center pixel. However, considering the phenomena of same material, different spectra and different materials, same spectra in HSI, relying solely on similarity to the center pixel for weight allocation might overlook important spatial information provided by other pixels. Therefore, effectively weighting the center pixel while taking global spatial information into account remains a challenge.
Mathematical Morphology (MM) primarily focuses on studying the characteristics of object morphology, processing and describing object shapes and structures using mathematical tools such as set theory, topology, and functional analysis [
47]. In previous HSI classification tasks, researchers often utilized attribute profiles (APs) and extended morphological profiles (EPs) to extract spatial features more effectively [
48,
49,
50,
51]. However, this approach typically requires many structuring elements (SEs), which are non-trainable and thus unable to effectively capture dynamic feature changes. To overcome these limitations, Roy et al. proposed the Morphological Transformer (morphFormer) in [
52], combining trainable MM operations with transformers, thereby enhancing the interaction between HSI features and the CLS token through learnable pooling operations. However, this method involves simultaneous dilation and erosion of spatio-spectral features, where each SE introduces a significant number of parameters. This not only risks losing fine-grained feature information during feature selection but also leads to model overfitting and reduced robustness, especially in scenarios with limited data. Hence, there is substantial room for improvement in the application of MM in HSI classification.
The core contributions of this study are as follows:
We designed a flexible and efficient Center Enhanced Spatial Attention (CESA) module specifically for hyperspectral image feature extraction. This module can be easily integrated into various models, enhancing focus on areas around the center pixel while considering global spatial information;
We introduced Morphological Convolution (MC) to replace the traditional linear layer feature extraction mechanism in the transformer encoder. MC selects fine-grained features through a strategy of separating and then integrating spatial and spectral features, significantly reducing the number of parameters and enhancing the model’s robustness;
Utilizing these modules, we developed the CESA-MCFormer feature extractor, capable of effectively extracting key features from a multitude of channels, supporting various downstream classification tasks. We conducted in-depth ablation experiments to provide practical and theoretical insights for researchers exploring and applying similar modules.
The rest of the paper is organized as follows: In
Section 2, we provide an overview of the CESA-MCFormer’s overall framework and detail our proposed CESA and MC modules.
Section 3 describes the experimental datasets, results under various parameter settings, and an analysis of the model parameters. Finally,
Section 4 concludes with our research findings.
2. Methodology
The architecture of CESA-MCFormer is illustrated in
Figure 1. For an HSI patch of size
, the spectral continuity information is initially extracted through a 3D–2D Conv Block [
52], and the dimensionality is transformed to 64. Subsequently, the HSI feature of size
is fed into the Emb Block for mixing spatial and spectral features, generating a 64 × 64 feature matrix. Then, a learnable CLS token, initialized to zero, is introduced for feature aggregation, along with a learnable matrix of size
, also initialized to zero, for spatial–spectral position encoding. After combining the feature map with the position encoding, it is passed through multiple iterations of the Transformer Encoder for deep feature extraction, and the extracted features are then input into the Classifier Head for downstream classification tasks. Next, we will provide a detailed introduction to the Emb Block and Transformer Encoder.
2.1. Emb Block
Given that the HSI patches input into the model are generally small (with a spatial size of
adopted in this study), we introduce the Emb Block to directly mix and encode global spatial features. This approach equips the model with a global receptive field before deep feature extraction, as illustrated in
Figure 2. Since the information provided by the central pixel of the HSI patch is crucial, CESA is first used to weight information at different positions, aiding the model in actively eliminating redundant information. Then, we introduce a learnable weight matrix
initialized using Xavier normal initialization, composed of 64 scoring vectors. By calculating the dot product between HSI features and each scoring vector, we score the features of each pixel. The scores are then transformed into mixing weights using the softmax function. Another learnable weight matrix
, initialized in the same manner, is introduced to remap the HSI features of each pixel point through matrix multiplication. Finally, by multiplying the two matrices, we mix the spatial features based on the mixing weights to obtain the final feature encoding matrix.
The overall architecture of CESA is illustrated in
Figure 3. To comprehensively consider both global information and the importance of the central pixel, we meticulously designed two modules: Soft CESA and Hard CESA. Hard CESA, a non-learnable module, statically assigns higher weights to pixels closer to the center. Soft CESA, conversely, is a learnable module that uses global information as a reference, enabling the model to adaptively select more important spatial information. This design aims to effectively integrate both global and local information, enhancing the overall performance of the model.
Specifically, CESA takes an HSI or its feature map () as input. Both Hard CESA and Soft CESA calculate and output the hard probabilistic diversity map () and the soft probabilistic diversity map (), respectively. The and maps are added together and then expanded along the channel dimension to match the size of before being element-wise multiplied with . Finally, an optional simple convolutional module is used to adjust the dimensions of the output feature (). The implementation details of both Hard CESA and Soft CESA are presented in the following sections.
2.1.1. Hard CESA
The output
of Hard CESA depends only on the size of
and the hyperparameter
K. For a pixel
q in
, its position coordinates are defined as (
x,
y), and its spectral features are denoted by
. We define
as the center pixel of the patch, and its coordinates in the image are defined as (
,
). The distance
d between
and
q is defined by the following Equation (
1):
for pixel
q is defined as follows in Equation (
2):
where
h is the length of the
. The hyperparameter
K (
) controls the importance gap between the center and edge pixels. As
K becomes larger, the weight of the center pixels becomes larger and the weight of the edge pixels becomes smaller. When
, all pixels in the patch have equal weights, and therefore Hard CESA will not have any effect.
2.1.2. Soft CESA
As shown in
Figure 4, Soft CESA processes
into three feature maps,
,
, and
.
and
are used to represent the overall features of
, while
is used to introduce the feature of the center pixel.
Specifically, for a pixel
q in
, its position coordinates are defined as (
x,
y), and its spectral features are denoted by
. The value of
at position (
x,
y), denoted as
(
x,
y), can be calculated as follows:
The value of
at position (
x,
y), denoted as
(
x,
y), can be calculated as follows:
To effectively extract the center overall feature in Soft CESA, we introduce a central weight vector
to weight
. Therefore, the value of
at position (
x,
y) can be represented as follows:
We extract the spectral features of the central pixel and its eight neighboring pixels, flatten them into a one-dimensional vector, and use this as the central feature vector
. We introduce a matrix
composed of
c learnable spectral feature encoding vectors and a vector
comprised of
c bias terms to weight and sum the spectral bands at each position. The specific formula for calculating the corresponding
r is as follows:
Finally, we concatenate
,
, and
along the channel dimension to form the final feature matrix
F. After passing through a convolutional layer with a kernel size of
, a softmax activation function is applied to produce the final soft probabilistic diversity map
:
It can be observed that the entire CESA model uses only learnable parameters, and the parameter c can be flexibly adjusted through the preceding conv Block. This means that the computational cost of CESA is very low, allowing it to be easily embedded into other models without significantly increasing the complexity of the original model.
2.2. Transformer Encoder
The primary function of the Transformer Encoder module is to extract deep spatial–spectral features through multiple iterations. As shown in
Figure 5, in each iteration, HSI features are first processed through Spectral Morph and Spatial Morph for feature selection and extraction, followed by an interaction with the CLS token through Cross Attention, aggregating the spatial–spectral features into the CLS token.
To capture multi-dimensional features, we employ a multi-head attention mechanism in Cross Attention [
29]. The input CLS token and HSI features are uniformly divided into eight parts along the spectral feature dimension, each with a feature length of eight. For each segmented feature, the CLS token serves as the query
, and the matrix formed by concatenating the CLS token and HSI features is used as the key and value
. The calculation method for Cross Attention is as follows:
In this process, , , and are all learnable parameters, while l is the feature length, set to eight in this study. After obtaining all eight groups of , they are reassembled along the spectral feature dimension. Then, they are processed through a linear layer followed by a dropout layer, resulting in the updated . This is then added to the input to produce the final .
Inspired by morphFormer [
52], we have also incorporated a Spectral Morph Block and a Spatial Morph Block into our model. The overall architecture of these two modules is identical, as shown in
Figure 6. Both modules process HSI features through erosion and dilation modules. After processing, the Spectral Morph block utilizes a
convolution layer (corresponding to the blue Conv block in
Figure 6) to extract deeper channel information, while the Spatial Morph block uses a
convolution layer to aggregate more channel information. The Morphological Convolution (MC) we propose is represented by the erosion and dilation modules in
Figure 6. Next, we will elaborate on how MC is implemented.
MC’s primary function is to eliminate redundant data during the feature extraction process, ensuring that as the depth of the encoder increases, the HSI feature retains only pivotal information.To accomplish this, we apply multiple learnable Structuring Elements (SEs) to the HS feature for morphological convolution. Through dilation, we select maximum values from adjacent features, emphasizing boundary details. In contrast, erosion allows us to identify the minimum values, effectively attenuating minor details. Additionally, directly employing SEs might inflate the parameter count, posing overfitting risks. To mitigate this, we separate the spectral and spatial SEs, significantly reducing parameters and thereby boosting the model’s resilience.
Specifically, when using SEs with a spatial size of
, to maintain the consistency of input and output dimensions of the module, we first reshape the spatial dimension of the HS feature into two dimensions and then pad its boundaries, resulting in the feature matrix
. Next, by adopting a sliding window with a stride of 1, we segment
H into 64 sub-blocks of size
, referred to as Xpatch. Subsequently, we further decompose Xpatch in both spatial and spectral directions. Spatially, Xpatch is divided into
vectors of dimension 64, denoted as
. Spectrally, Xpatch is parsed into 64 vectors of dimension
, represented as
. We then introduce multiple SEs groups, where each group consists of a spatial vector of length
and a spectral vector of length 64. For simplicity, we name one group of SEs
W, with its spectral vector labeled
and the spatial vector
. For any given Xpatch and
W, the dilation operation of the morphological convolution is shown in
Figure 7.
First, we add each segmented feature vector to the corresponding
and
at their respective positions, then take the maximum value to obtain
:
Then, we introduce two learnable vectors
and
, along with two learnable bias terms
and
. We concatenate the results from the previous step into two one-dimensional vectors, which are then dot-multiplied with
and
, respectively, and added to
and
, resulting in
:
Finally, we concatenate the two obtained feature values to form the convolution result of that Xpatch under the specified
and
, referred to as
:
In actual experiments, 16 groups of
W were used in the erosion block. Therefore, after computing all the
W with Xpatch, the final HSI feature size obtained through the dilation module is
. Similarly, for any
and
W, the following formula describes the erosion operation
in the morphological convolution:
Overall, we process the 64 using 32 sets of SEs. Specifically, 16 sets are responsible for the dilation operation, while the other 16 sets handle the erosion operation. This results in two feature matrices. After spatial–spectral separation, the required parameter count for the SEs is reduced from to . Additionally, MC operates similarly to traditional convolutional layers, allowing it to directly replace convolutional layers in models. This attribute endows MC with significant versatility and adaptability.