1. Introduction
Hyperspectral remote sensing can obtain the intrinsic characteristics and change patterns of objects by recording the electromagnetic wave characteristics without direct contact, making it a cutting-edge remote sensing technology [
1]. Hyperspectral imaging (HSI) can record the spatial information under each waveband and the spectral information under the same position. Therefore, it has excellent application prospects in many fields, such as agriculture and forestry [
2,
3,
4,
5], ocean [
6], disaster [
7], mineral exploration [
8,
9], and urban construction [
10,
11,
12]. HSI classification assigns category labels to each pixel based on sample features, which is increasingly becoming a key technology in hyperspectral remote sensing.
In the first two decades of the evolution of HSI classification, there were many machine learning algorithms based on hand-crafted features from the perspective of learning spectral and spatial features, for instance, spectral angle map [
13], support vector machine [
14], sparse representation [
15], manifold learning [
16], Markov Random Fields [
17], Morphological Profiles [
18], Random Forests [
19], etc. However, due to the significant variability among different objects, classification algorithms based on manual feature extraction face challenges in fitting an optimal set of features for different objects and require greater robustness and discriminability..
Recently, studies on HSI classification have heavily focused on deep learning (DL) technology, since it could adaptively extract features from the input data in a hierarchical manner [
20,
21,
22]. This allowed DL to learn data features in both spectral and spatial dimensions without requiring prior statistical knowledge of the input data. Chen et al. [
23] first introduced DL to the HSI classification by applying deep Stacked Auto-Encoder (SAE). Similarly, in [
24], the feasibility of using deep belief network (DBN) for HSI classification was investigated. However, implementing SAE and DBN could potentially lead to decreased performance, as they use complex structures to modify the input data [
25]. Researchers later discovered that Convolutional Neural Networks (CNNs) [
26] could effectively extract multi-level features from large samples, thus eliminating the need for complicated feature extraction techniques. Hu et al. [
27] first applied a one-dimensional CNN (1D-CNN) to HSI classification and obtained greater classification accuracy than many conventional machine learning techniques. Nevertheless, 1D-CNN has limited ability to capture spatial relationships between features in the input data. In contrast, two-dimensional CNN (2D-CNN) [
28] learns how pixels in an image are related, allowing it to capture complex spatial patterns that are important for accurate image classification. However, it may struggle to capture the spectral relationships between features in the input data, as it considers the different spectral bands only as separate channels of the image. To incorporate the advantages of both 1D-CNN and 2D-CNN, researchers have attempted various methods. Yu et al. [
29] utilized a 1D-CNN to extract spectral features and a 2D-CNN to extract spatial-spectral features, resulting in highly accurate classification. Conversely, the three-dimensional CNN (3D-CNN) [
30] was proposed to operate on 3D HSI data and was capable of learning both spatial and spectral relationships between features in the input data, compensating for the weaknesses of 2D-CNNs. Nowadays, CNNs have gained significant attention and popularity among scholars [
31], as evidenced by recent studies. Zhong et al. [
32] proposed a spectral-spatial residual network (SSRN) that combines 3D-CNN for extracting discriminative features. Li et al. [
33] developed a two-branch dual attention network (DBDA) that integrates spectral and spatial attention mechanisms for refining extracted feature maps. Yan et al. [
34] designed a dual-branch network structure to relieve the issue of insufficient samples in HSI classification by incorporating transfer learning. Through this novel network structure, both [
33] and [
34] investigated how multimodel features can be used to improve HSI task performance. Although CNNs are well adapted to the high-dimensional and complex features of HSIs, high computational complexity arises, and its classification accuracy can suffer as a result of samples with insufficient data annotation. Furthermore, CNNs may require more refined feature extractors for specific tasks, and CNN models are prone to problems such as overfitting in small samples.
In supervised learning, sufficient labeled samples are required to provide a foundation for the classification algorithm [
35]. However, labeling the samples pixel by pixel is time consuming and costly. Thus, the limited number of labeled samples and high-dimensional data can lead to the generation of the Hughes phenomenon [
36], a type of model overfitting caused by insufficient training data, which affects classification accuracy heavily. Zhang et al. [
37] proposed a lightweight 3D network based on transfer learning to address the sample-limited problem. Sellami et al. [
38] proposed a semi-supervised network with adaptive band selection to reduce the dimensional redundancy and alleviate the Huges phenomenon. Although deeper networks can extract richer features to achieve high classification accuracy, a problem arises when the number of training samples is vastly smaller than the data dimensionality, leading to the explosive growth of parameters and vanishing gradients during the training process. Li et al. [
39] designed a depth-wise separable Res-Net framework, which permitted separating spectral and spatial information in HSI and reduced network size to avoid overfitting issues. CNNs have shown remarkable performance in HSI classification tasks. Researchers have proposed various techniques, including transfer learning, adaptive band selection, and depth-wise separable networks, to improve the classification accuracy and robustness of the HSI small-sample classification model. However, convolution operations tend to assign equal weights to all pixels or bands in an image, despite the fact that some pixels and bands may be more beneficial for classification than others, or may even interfere with classification.
Currently, the introduction of an attention mechanism provides a solution to the aforementioned issue [
40,
41,
42,
43]. The attention mechanism draws inspiration from the visual focus region of the human brain, which aids the network in concentrating on significant regions while ignoring irrelevant ones and performing adaptive weight fitting on features. This enhances the efficiency of feature extraction in models and reduces the need for unnecessary computation and data preprocessing, thereby making it a promising approach for HSI classification. Yu et al. [
44] proposed a spatial-spectral dense CNN framework based on a feedback attention mechanism to extract high-level semantic features. Roy et al. [
45] proposed an end-to-end trained adaptive spectral-spatial kernel improved residual network (A2S2K) with an attention-based mechanism to capture discriminative features for HSI classification. Li et al. [
46] proposed a multi-attention fusion network (MAFN) that employs spatial and spectral attention mechanisms, respectively, to mitigate the effects of band redundancy and interfering pixels. Xue et al. [
47] proposed the attention-based second-order pooling network (A-SPN) for modeling distinct and representative features by training the model with adaptive attention weights and second-order statistics. The attention mechanism learns more effective feature information but can lead to overfitting when the sample size is limited. Additionally, the high dimensional data of hyperspectral data carry a large amount of redundant information. The traditional single-attention mechanism needs to locate adequate information quickly and accurately, resulting in the need for a deeper network.
We propose an attention-embedded triple-branch fusion convolutional neural network (AETF-Net) for HSI classification to address the aforementioned issue. As is shown in
Figure 1, the network comprises a spectral attention branch, a spatial attention branch, and a multi-attention fusion branch (MAFB). The spectral attention branch and spatial attention branch, respectively, address the issues of feature redundancy and correlation between spectral and spatial dimensions. We design a global band attention module (GBAM) in the spectral branch with a novel SMLP to extract more discriminative band features. In the spatial branch, we reference a bi-directional spatial attention module (BSAM) to extract spatial feature information in both horizontal and vertical directions. To incorporate the extracted spectral and spatial features and reduce the computational cost, we introduce the large kernel decomposition technique in the MAFB, which replaces large kernel convolution operation with some small kernel depth convolution and deep dilated convolution. In the proposed AETF-Net, multiple kinds of attention are used and fused to provide a reference basis for the relative importance of bands and pixels for 3D convolution with different weight values. Consequently, the proposed AETF-Net ensures efficient feature extraction while avoiding the gradient disappearance and feature dissipation issues caused by deep neural networks. In conclusion, the main contributions of this paper are as follows.
A novel multi-attention-based module is introduced that incorporates spatial attention, spectral attention, and joint spatial-spectral attention. The proposed approach embeds spatial and spectral feature information into each level of the joint spatial-spectral feature extraction module via cascading to compensate for the feature loss issue of the deep neural network.
An improved spectral feature extraction mechanism is designed to generate more accurate band features and weighting information. Moreover, we introduce an innovative weight fusion strategy for feature enhancement to prevent data loss during feature fusion and preserve the relative size relationship between weights.
The proposed method AETF-Net has been validated on three public datasets (i.e., IN, UP, and KSC) and has shown significantly better classification results. Particularly, at small sample rates, our method outperforms both traditional and advanced methods. The effectiveness of the method is verified.
The rest of this paper is arranged as follows.
Section 2 elaborates on the proposed AETF-Net.
Section 3 describes the detailed datasets and analyzes the experimental results.
Section 4 provides a comprehensive discussion of the differences between the proposed method and the comparative algorithms.
Section 5 summarizes the core of the whole paper and provides some suggestions for further research.
3. Results
3.1. Dataset Description
The datasets used in this paper are Indian pines (IP), University of Pavia (UP), and Kennedy Space Center (KSC). The sample numbers and corresponding colors of the three datasets are in
Table 1,
Table 2 and
Table 3.
The IP dataset is a widely used hyperspectral remote sensing image dataset, which contains a scene captured by the Airborne Visible/Infrared Imaging Spectrometer (AVIRIS) sensor at the Indian Pines test site in northwestern Indiana. The scene comprises two-thirds agricultural land and one-third forest or other natural perennial plants. Its data size is 145 × 145, with a spatial resolution of 20 meter/pixel (m/p) and a wavelength range from 0.4 to 2.5 μm containing 224 bands, with 200 remaining after removing the overlying absorption region, and 16 species.
The ROSIS sensor, an optical reflection system imaging spectrometer for urban areas, captured the UP dataset in 2003 at the University of Pavia, Northern Italy. It possesses a spatial resolution of 1.3 m/p, an image size of 610 × 340, 103 bands within the wavelength range from 0.43 to 0.86 μm, and 9 classes. Compared to the IP dataset, the UP dataset has fewer bands while still having a high dimensionality and complex classification task.
The KSC dataset is a hyperspectral remote sensing image dataset collected and released by the National Aeronautics and Space Administration (NASA), which is collected at the Kennedy Space Center by an AVIRIS sensor in March 1996. It has a spatial resolution of 1.8 m/p, 512 × 614 pixels, 224 bands from 0.4 to 2.5 μm, with 176 bands after removing absorbance and noise bands, and covers 13 different ground cover types. The KSC dataset has the same number of bands as the IP dataset while its spatial resolution is lower, thus requiring higher algorithmic requirements.
3.2. Experimental Setup
To demonstrate the efficiency of the proposed method, we conducted a series of classification experiments on three well-known hyperspectral datasets. These included CNN-based methods can be divided into two categories, traditional CNN-based methods (2D-CNN, 3D-CNN, Res-Net, and SSRN) and CNN-based methods with attention mechanism (DBDA, A2S2K, MAFN, and A-SPN). All comparison methods have the same parameter settings as in their corresponding references. The performance of classification will be evaluated using three metrics: overall accuracy (OA), average accuracy (AA), and the statistical kappa coefficient (Kappa) for the results. All methods were repeated ten times independently, after which the average value and standard deviation were taken to guarantee the generalizability of the experimental results.
In our experiments, three datasets were each divided into a 1% training set, a 1% validation set, and a 98% test set. During the training phase, we continuously adjusted certain hyperparameters of the model, such as the size of the convolution kernel, patch size, and learning rate, based on the training results obtained through experimentation. The model was trained using the Adam optimizer and cross-entropy loss function. In the validation phase, 1% of the samples were randomly selected as a validation set to select the best model. Performance metrics were calculated on the validation set to select the best-performing model as the final model. During the final testing phase, the remaining 98% of the test set was used to test the best model and obtain the test results. The experiments were set up to take two samples for any class with a sample number less than 2 in the 1% training samples case.
By employing an early stopping strategy during the training phase, we found that our model converges in terms of loss and accuracy stabilizes around 200 epochs. Thus, we ultimately set 200 epochs to train the model. The batch size was set to 64 and the Adam optimizer was used in the proposed method. The learning rate was initialized at 0.01 and then adjusted using the cosine annealing algorithm. The k value in the SMLP structure of GBAM was set to 9 based on the optimal experimental results. All experiments were finished on the software environment PyTorch and a computer with a process of Inter(R) Xeon(R) Platinum 8124M CPU @ 3.00 GHz, 64G RAM, and an NVIDIA GeForce RTX 3090 graphics card.
3.2.1. The Effect of the Number of Training Samples
To further analyze the effect of the number of training samples on the proposed AETF-Net, we split the three datasets into a training set, a validation set, and a test set with varying proportions. The size of the validation set is always consistent with that of the train set, while the remaining portion constitutes the test set. The remaining hyperparameters were set to be consistent with the above. For the IP dataset, the number of training samples varies from 1%, 3%, 5%, 10%, and 20% of the dataset samples. For the UP and KSC, the number of training samples varies from 1%, 3%, 5%, 7%, and 10% of the dataset samples, respectively. The validation sets in the above three datasets are divided from the remaining data into data samples the same size as the divided training set, while the remaining part is employed as the test set.
Figure 5 shows the classification results of the proposed with the different numbers of training samples. The vertical axis represents OA, and the horizontal axis represents the training set ratio. For three datasets, the values of OA increase with the number of training samples increase until a stable case. For the IP dataset, the value of OA stabilizes when the training set size is between 3% and 5%. It improves dramatically after 5% and reaches stable when the ratio of training size is 10%. The data distribution of the UP and KSC is not as heterogeneous as that of the IP dataset. Therefore, after the training set size reaches 4%, OA becomes stable and can reach nearly 100% accuracy, especially after 1% for the UP dataset, which has a sufficient number of samples.
3.2.2. Effectiveness of the Value in SMLP Structure
A series of experiments were conducted to verify the effectiveness of the improved SMLP structure in the GBAM module by setting various values of hyperparameter
. The remaining hyperparameter settings of the experiments were consistent with those described above. Firstly, we conducted experiments on the original MLP structure, followed by experiments on the improved SMLP structure with different values of hyperparameter
(3, 5, 7, 9, 11, 13, 15). To ensure fairness, all the experiments were conducted independently 10 times, and the final average results were compared. As shown in
Table 4, when using the original MLP in the channel attention module, the classification accuracy OA was 2.29% lower than that of the improved SMLP structure (
= 9), indicating that the improved SMLP structure could utilize the inter-band correlation information during the sliding window step of the convolutional kernel to extract more useful features than the original MLP structure.
However, the performance decreased significantly when the value was set to 3 or 5, even lower than that of the original MLP structure, because a small value cannot capture all the local features, leading to feature loss. With the increase of value, the sliding window of convolution can capture more inter-band correlation information and local features. However, when the value exceeds 11, the classification accuracy starts to decline due to the high overlap of the windows, which leads to overfitting, and the same local information is extracted multiple times. Therefore, based on the best experimental results, we set the value to 9, increase the number of convolutional kernels to extract various features of the data, and set the stride small enough to retain more local features. Combined with deconvolution, we can reduce the dimensionality while retaining the important features of the input data, eliminate the influence of the one-dimensional convolution layer, and restore the data to its original dimensionality.
3.2.3. The Effect of Patch Size
The patch size of the training network has an essential effect on classification performance. Typically, the larger the patch, the more spatial information it contains, leading to a better classification performance of the classification. However, a larger patch causes massive parameters and exacerbates the limited sample learning issue.
In this section, we design several experiments based on the dataset partitioning method and parameter settings described above to analyze the effect of the patch size on the proposed method.
Figure 6 shows the classification results of the proposed method with different patch sizes. The OA has achieved 80% when the patch size is 5 × 5 for the IP dataset. As the value of patch size increases, the OA gradually improves and plateaus until the OA tends to 90% when the sample block size exceeds 13 × 13. For the KSC dataset, the OA decreases when the patch size increases to a certain sample block size. The reason is that the objects in KSC are small and in dispersed distribution, so the large patch contains multiple classes, which provide negative information for classifying the center pixel in the patch. Thus, considering the computational cost and HSI scene, we set the patch size to 11 × 11 in our experiments.
3.3. Result and Analysis
Experimental results on the IP dataset: As is shown in
Table 5 and
Figure 7, the proposed AETF-Net method obtains the highest accuracy among all the methods with 89.58%, 87.91%, and 88.17%, and achieves the most detailed and smooth classification maps. The 3D-CNN has better feature extraction capability than 2D-CNN because it can incorporate both spectral and spatial information. However, in the case of insufficient training samples, the overfitting caused by the conflict between the high dimension of processing data and the insufficient number of samples makes the accuracy lower than the 2D-CNN with 5.47% OA. Res-Net has the worst classification result with 36.06% OA because it has many layers of the network, which is very redundant, and the adequate depth is inadequate. A2S2K adds the attention mechanism to the residual structure to weigh the valuable features, which obtained a 0.27% improvement in OA compared to SSRN, which illustrates the effectiveness of the attention mechanism.
MAFN, DBDA, and A-SPN also introduce the attention mechanism and obtain higher accuracy than the method without introducing the attention mechanism. Among them, DBDA captures many spatial and spectral features using a two-branch and densely connected network and obtains the highest accuracy among the compared methods with 82.19% OA. However, DBDA has not used the attention mechanism to locate the region of interest at the very beginning, so the evaluation indices of our method are 7.39%, 6.46%, and 8.63% higher than those of DBDA. The unbalanced distribution of samples in the IP dataset results in very few training samples for some classes after dividing 1%. A-SPN performs better for small sample classes, where the accuracies for classes 7 and 9 (i.e., Grass-pasture-mowed and Oats) were higher than the proposed method. However, the performance on classes 3, 4, 12, and 15 (i.e., Corn-mintill, Grass-pasture, Soy-bean-clean, and Buildings-Grass-Trees-Drives) is significantly lower than that of the proposed method because these classes are at the edge of the image and have a large number of neighboring species, making it difficult to classify them with blurred boundaries correctly. As a result, the overall OA, AA, and Kappa are 14.90%, 13.18%, and 17.87% lower, respectively, than the proposed method. To further evaluate the classification performance from a visual perspective, the ground-truth map and the classification results of eight comparison methods are shown in
Figure 7. 2D-CNN, 3D-CNN, and Res-Net obtained considerable noise within and at the class boundary. The noise point within the classification maps of SSRN, A2S2K, MAFN, and A-SPN are fewer, while the misclassification rates are higher than DBDA. By comparison, the classification map of our proposed methods has minor noise points and misclassified pixels on the boundary between classes and is closest to the ground-truth map.
Experimental results on the UP dataset:
Table 6 and
Figure 8 show the numerical results and visual results of UP dataset comparison experiments. It could be seen that the OA of the proposed AETF-Net method was improved compared with those attention-based methods A2S2K, MAFN, DBDA, and A-SPN for 1.76%, 0.82%, 0.86%, and 2.66%, respectively. Due to the relatively balanced distribution of each class in the dataset, 2D-CNN, 3D-CNN, and Res-Net obtained relatively higher classification accuracy. MAFN and DBDA all outperformed SSRN, A2S2K, and A-SPN. Compared with the similar multi-attention fusion method DBDA, our method has higher accuracy with 0.86% OA, 1.09% AA, and 0.45% Kappa. Because the MAFN lacks information interaction and feature transfer during the extraction of spectral and band attributes, resulting in one-sided extracted results, which demonstrates the effectiveness of our proposed feature fusion strategy. In addition, MAFN achieved the second-best classification results throughout a multi-scale multi-attention feature extraction framework. Our method can reduce the network depth while extracting sufficient feature information, which avoids the overfitting problem caused by limited samples. Our method has absolute advantages. The classification map of our method performed better on the UP dataset. In 2D-CNN, 3D-CNN, Res-Net, SSRN, and A-SPN, class 2 and class 6 have considerable noise, while the noise points were significantly reduced in other methods because of the multi-attention structure used in MAFN, DABA, and the proposed method. This demonstrates the effectiveness of the multi-attention strategy. Overall, the produced classification map of our method has more precise edges of features and was closest to the ground-truth map.
Experimental results on the KSC dataset: The KSC dataset has only 50 training samples at the 1% data division method, as shown in
Table 7, and the proposed method still achieved the best classification accuracy with 96.48% OA, 95.00% AA, and 96.08% Kappa, and the clearest classification results were obtained for some hard distinguish categories like class 4, 6, 8, and 9. Regarding the classification accuracy of each of the thirteen classes of features, eight classes achieve the highest accuracy. Classes 10 and 13 achieved the best precision. Although the number of KSC dataset training samples is the smallest, it obtains better classification accuracy. Because the dataset is relatively balanced, the feature distribution is dispersed, and the inter-class differences are less influential. However, due to the limited samples, the classification accuracy of 2D-CNN, 3D-CNN, and Res-Net still needs to be improved. Although the MAFN method performed well on the IN and UP datasets, it needs to catch up on the KSC dataset due to the minimal and balanced number of samples in each class. It also indicates that the MAFN method is unsuitable for small sample classification. In addition, A2S2K has the best classification accuracy among all the compared methods due to the attention mechanism employed at the beginning of the framework to extract valuable characteristics. As shown in
Figure 9, the proposed method had a smoother visual image compared with other methods and the classification map was closest to the ground-truth map.
Furthermore, when viewed in the context of the proposed method, the standard deviation of the results of ten runs for almost every class and OA, AA, and Kappa is lower than that of the other methods. It can be demonstrated that the proposed method produces less variation and more stable results for small samples of different datasets, implying that the method is more robust and can be adapted for a broader range of hyperspectral datasets.
3.4. Ablation Study
To further validate the contribution of the GBAM, BSAM, and MAFB in the proposed framework to the final classification results, ablation experiments were conducted while maintaining the original experimental setup.
The effectiveness of the three branches is examined: (1) GBAM: only employ the GBAM to extract spectral feature extraction and the classifier; (2) BSAM: only employ the BSAM to extract spatial feature extraction and the classifier; (3) LCNN: LCNN network of MAFB without fusing the GBAM and BSAM; (4) LCNN + GBAM: LCNN network of MAFB with fusing the GBAM and without BSAM; (5) LCNN + BSAM: LCNN network of MAFB with fusing the BSAM and without GBAM.
From the results in
Table 8, we can see that the performance of the GBAM and BSAM could be better than the other methods because the classification method based on single spectral or spatial feature extraction is significantly inferior to the methods based on spectral-spatial feature fusion. The LCNN overperforms GBAM and BSAM by about 3.86–10% on OA because it utilizes spectral-spatial feature combination by the 3D convolutional operation.
Additionally, the OA of the “LCNN + GBAM” and “LCNN + BSAM” increased by 0.38% to 1.72% compared with the OA of the “LCNN” method. It proved the effectiveness of the attention in GBAM and BSAM for classification. Especially the BSAM has obvious help in improving the AA by about 4.1%. It demonstrated that obtaining spatial features between feature mappings or long-range dependencies via the attention mechanism can significantly enhance the performance of the HSI classification model.
Lastly, the best classification results can be obtained when the spatial context information and the band dependencies are added concurrently to each stage of the MAFB for spatial-spectral joint attention feature extraction. It demonstrates the effectiveness of the proposed multiple attention fusion mechanism.
3.5. Analysis of the Multi-Attention Fusion Strategy
The fusion strategy is essential for multi-attention fusion, which significantly affects the classification method’s performance. In this section, we designed six multi-attention fusion strategies following the AETF-Net framework and did some experiments to analyze and discuss the effect on classification performance.
Six multi-attention fusion strategies are shown in
Figure 10. They can mainly be split into two groups: attention weight fusion (
Figure 10a) and attention feature map fusion (
Figure 10b). The outputs of each attention module in attention weight fusion strategies are the combination of the weights, while the outputs of each attention module in attention feature maps fusion strategies are the combination of the weights and input maps. Especially the six multi-attention fusion strategies are designed as follows:
- (1)
Figure 10a(1): the attention weight matrices produced by the GBA and BSA modules are element-wise multiplied and then multiplied with the original feature maps.
- (2)
Figure 10a(2): the attention weight matrices produced by the GBA and BSA modules are element-wise added and then multiplied with the original feature maps.
- (3)
Figure 10b(1): the feature maps produced by the GBAM and BSAM modules are element-wise added and then added with the original feature maps.
- (4)
Figure 10b(2): the feature maps produced by the GBAM and BSAM modules are element-wise added and then multiplied with the original feature maps.
- (5)
Figure 10b(3): the feature maps produced by the GBAM and BSAM modules are element-wise multiplied and then added to the original feature maps.
- (6)
Figure 10b(4): the feature maps produced by the GBAM and BSAM modules are element-wise multiplied and then multiplied with the original feature maps.
Table 9 shows the classification results of the proposed AETF-Net with six different multi-attention fusion strategies. Compared with the two groups, the attention weight fusion strategy has a slight advantage over the attention feature map fusion strategy from the classification. The reasons are that both multi-attention fusion strategy groups have utilized effective characteristics of the attention mechanism for spectral-spatial feature learning, while the attention feature maps fusion strategies cost more computing resources to generate the feature map and lead to information redundancy.
In the attention weight fusion strategies, the multiplication strategy retains the relative size relationship between different feature mappings better than the addition strategy. Because it retains the variability between features and is superior to the addition strategy, the classification performance of a model can be improved by strengthening the compelling features after eliminating redundant ones. Thus, the proposed AETF-Net adopts the attention weight fusion strategy with multiplication (
Figure 10a(1)).
3.6. Running Time Analysis
We computed the training and testing times of different methods using randomly selected samples. As shown in
Figure 11, the proposed approach significantly improves training time compared to traditional DL methods. This is primarily attributed to the fact that traditional DL methods incorporate multiple convolutional and pooling layers in their network architecture to extract feature information, which leads to a large number of parameters when processing high-dimensional hyperspectral data. However, our proposed method did not show the least training and testing time compared to DL methods based on attention mechanisms, suggesting that further improvements are needed in our method. According to our model framework, it is possible that the increased computational cost of our method is due to the need for multiple information fusion processes in the backbone network. Nevertheless, our proposed method can fully extract and fuse the spatial and spectral features of HSI and the increase in computational cost is justifiable given the significant improvement in classification accuracy. The method A-SPN, which has the shortest processing time, may be attributed to its abandonment of the hierarchical structure composed of traditional convolution and pooling layers, resulting in a significant reduction in the computational cost of parameters. This is a direction worth exploring in our future work.
4. Discussion
The proposed AETF-Net method has shown remarkable performance in terms of accuracy and classification map quality on three publicly available datasets, surpassing existing state-of-the-art methods.
Firstly, one of the key factors affecting the accuracy of deep learning-based image classification is the number of training samples. The 3D-CNN outperforms the 2D-CNN in feature extraction capabilities. However, overfitting can be a challenge when the number of training samples is insufficient. Additionally, Res-Net’s redundant layers lead to worse classification results. Attention mechanisms, as seen in A2S2K, MAFN, DBDA, and A-SPN methods, have been demonstrated to improve accuracy, especially for small sample classes.
Additionally, despite having a limited number of training samples, the AETF-Net method achieved the best classification accuracy due to the dataset’s balanced feature distribution and dispersed inter-class differences. The study emphasizes the limitations of existing methods for small sample classification and highlights the importance of attention mechanisms in achieving high accuracy. Furthermore, the results demonstrate the potential of AETF-Net to improve image classification tasks and its robustness for a broader range of datasets.
Furthermore, the study’s findings also suggest that AETF-Net has the potential to overcome challenges associated with unbalanced sample distributions and misclassification at class boundaries by minimizing noise and improving classification accuracy. This has significant implications for the development of more reliable and accurate image classification in practical applications.
In conclusion, the results of this study have important implications for the development of deep learning-based image classification methods. The study emphasizes the importance of continued research in this area to improve accuracy and overcome the challenges associated with small sample classification, unbalanced sample distributions, and misclassification at class boundaries.