1. Introduction
In contrast to traditional panchromatic and multi-spectral images, hyperspectral images typically consist of dozens or even several hundred spectral bands in the visual and far-infrared spectra, and they can be effectively utilized to distinguish between different categories of land covers. In recent years, the analysis and processing of hyperspectral images have been used in many fields [
1], such as in urban development and surveillance [
2,
3], environmental management [
4], agriculture [
5], etc.
Various supervised machine learning methods have been proposed and developed over time in order to improve the classification of HSIs, such as support vector machine (SVM) [
6,
7,
8],
k-nearest neighbor (K-NN) [
9,
10], and random forest [
11,
12,
13]. These algorithms only consider the discriminant information of spectral signatures. Subsequently, spectral–spatial-based algorithms have been proposed that also consider spatial contextual features in order to improve classification accuracy and efficiency. A support vector machine with a composite kernel (SVMCK) is a representative patch-wise-based algorithm that simultaneously projects the spectral–spatial features in the reproducing kernel Hilbert space (RKHS) [
14]. A joint sparse-representation-based approach involved simultaneously representing all pixels in the local patch, along with a group of common atoms in the training dictionary (JSRC) [
15]. In ref. [
16], a joint spectral–spatial derivative-aided kernel sparse representation of patch-based kernels was proposed for HSI classification that considered the derivative features of the spectral variation simultaneously. Additionally, an adaptive non-local spectral–spatial kernel (ANSSK) was proposed in order to further exploit homogeneous spectral–spatial features in the embedded manifold feature space [
17]. As for spatial filter feature extraction, various filter design algorithms, such as extended morphological profiles (EMPs) [
18], edge-preserving features [
19], and Gabor filters [
20,
21,
22,
23], have been proposed to improve classification performance. Most of the aforementioned classification algorithms adopted hand-crafted feature extractors and traditionally taught models; therefore, specialized field expertise is usually required for hand-crafted extraction.
Along with increased computational GPU resources, convolutional neural network (CNN)-based approaches have shown remarkable performances in visual tasks. For HSI classification, a 2D CNN [
24] was proposed with differently designed convolutional operators. Thereafter, Song et al. designed a deep feature fusion network (DFFN) [
25]. A spectral–spatial residual network (SSRN) was proposed by Zhong et al. in order to extract spectral–spatial features in an orderly fashion and classify HSIs according to joint spectral–spatial features [
26]. Swalpa et al. designed a structure with a spectral–spatial 3D CNN to reduce the complexity of the model [
27]. Mercede et al. proposed a rotating variable model for HSI analysis, in which the conventional convolution kernel was substituted with circular harmonic filters (CHFs) [
28]. Wei et al. divided pixels into different clusters as a material map for extracting spatial features in order to achieve an effective classification [
29]. Haokui et al. [
30] proposed a method of HSI classification with a cross-sensor strategy and a cross-modal strategy based on transfer learning, and it utilized RGB image data and other HSI data collected by arbitrary sensors as pre-training datasets. Wang et al. proposed a network architecture search (NAS)-guided lightweight spectral–spatial attention feature fusion network (LMAFN) for HSI classification [
31]. A novel multi-structure KELM with an attention fusion strategy (MSAF-KELM) was proposed in order to achieve the accurate fusion of multiple classifiers for effective HSI classification with ultra-small sample rates [
32]. Yue et al. [
33] enhanced the representation of learned features by reconstructing the spectral and spatial features of an HSI to achieve robust unknown detection. In addition, the graph convolutional network (GCN) [
34,
35] and fully convolutional neural network [
36] have gradually attracted more and more attention due to the utilization of their inherent advantages. For instance, to explore the internal relationships of data for semi-supervised label propagation in few-shot image classification, an attention-weighted graph convolutional network (AwGCN) model was proposed [
37]. L. Mou et al. constructed a graph-based end-to-end semi-supervised network, which was called the non-local GCN, that utilized both labeled and unlabeled data [
38]. A spectral–spatial 3D fully convolutional network (SS3FCN) was designed for the simultaneous exploration of spectral–spatial and semantic information [
39]. In Ref. [
40], a fully convolutional neural network was introduced by including de-convolution layers and an optimized ELM for HSI classification. To augment the available features, Zhu et al. [
41] first explored a generative adversarial network (GAN) for HSI classification, and it demonstrated better performance with limited training samples, as compared to some traditional CNNs. Nevertheless, the patch-wise-based GAN and CNN exposed the computational redundancy problem caused by the repetition of the patches of adjacent pixels during the training and testing processes.
In practical applications, high-dimensional spectral features and limited labeled samples have consistently challenged classification tasks. As a consequence, a number of unlabeled samples have been utilized to generate pseudo-labeled samples in order to increase the number of training samples and improve the performance of the classifier. Zhang et al. presented a semi-supervised classification algorithm that was based on simple linear iterative cluster (SLIC) splitting [
42], and it was expected to improve the efficiency of an extended training set by selecting pseudo-labeled samples (PLSs). Considering the number of unlabeled samples has also provided abundant discriminant spectral–spatial features. Mingmin Chi et al. presented a continuation-method-based local optimization algorithm for global optimization, which was tuned with an iterative learning procedure during the learning phase of the semi-supervised support vector machines (S3VMs) [
43]. A non-parametric and kernel-based transductive support vector machine (TSVM) classification framework was proposed by L. Bruzzone to alleviate the Hughes phenomenon [
44]. Meanwhile, a semi-supervised learning framework, based on spectral–spatial graph convolutional networks [
36,
45] and generative adversarial networks [
46,
47], was also exploited to increase the accuracy of the HSI classification by mitigating problems caused by limitations in the labeling samples.
In order to eliminate the computation redundancy caused by patch-wise-based algorithms and to fully utilize the abundance of unlabeled samples in an efficient way, we established a novel active inference transfer convolutional fusion network (AI-TFNet) for HSI classification. We have highlighted the notable outcomes of the proposed AI-TFNet as follows:
In the proposed AI-TFNet, an active inference pseudo-label propagation algorithm for spatial homogeneity samples was constructed by utilizing the proposed TFNet to segment the homogeneous area, and the proposed spectral–spatial similarity metric learning function was constructed to select propagated pseudo-labels for spectral–spatial homogeneity and continuity. Meanwhile, an end-to-end, fully hybrid multi-stage transfer fusion network (TFNet) was designed for improving classification performance and efficiency.
A metric confidence-augmented pseudo-label loss function (CapLoss) was designed to define the confidence of a pseudo-label by automatically assigning an adaptive threshold in homogeneous regions for acquiring homogeneous pseudo-label samples, which could actively infer the pseudo-label by augmenting the homogeneous training samples, based on spatial homogeneity and spectral continuity.
In addition, to reveal and merge the local low-level and global high-level spectral–spatial contextual features during different feature extraction stages, a fully hybrid multi-stage transfer convolutional fusion network was designed to achieve end-to-end HSI classification and improve classification efficiency.
Experimental results demonstrated that, compared to other related algorithms, our proposed AI-TFNet achieved better results on several different HSI scenario datasets in terms of accuracy and efficiency.
The rest of this paper is organized as follows. In
Section 2, we introduce our proposed algorithm in detail. In
Section 3, the parameters’ analysis and experimental results are illustrated and discussed. Finally, conclusions are drawn in
Section 4.
4. Conclusions
In this paper, we proposed a novel active inference transfer convolutional fusion network (AI-TFNet) to improve the accuracy and efficiency of HSI classification, especially when training samples were limited in quantity. First, the proposed multi-stage hybrid spectral–spatial fully convolutional fusion structure (TFNet) overcame the computational repetition caused by patch-wise-based deep-learning algorithms. In addition, the multi-stage hybrid structure was able to merge low-level spectral–spatial features (detailed information) with high-level spectral–spatial features (contextual information), which not only avoided the redundant path-wise computations but also revealed local and high-level contextual features. In addition, a confidence score and a correct CapLoss function were designed and utilized to augment the training sample sets for active inferential pseudo-labeled samples and supported the backpropagation in the training stage, even with small sample sets. The experimental results on three HSI datasets further demonstrated that the proposed TFNet and AI-TFNet had better outcomes in accuracy, efficiency, and classification performance, regardless of sample size.
Although the proposed TFNet and AI-TFNet had robust results for classification accuracy, expanding their application with more adaptive, automatic training samples via online inference and contextual analysis is a challenging direction to be addressed in future research.