1. Introduction
Hyperspectral image (HSI) classification is an important research problem in remote sensing (RS) and has a broad range of applications. Differing from RGB images, hyperspectral data is composed of spectral signatures and spatial contexts. On the one hand, it provides abundant spectral–spatial information for “over-band” classification. On the other hand, it raises challenges in extracting high-dimensional features [
1,
2,
3,
4].
The early HSI classification methods mainly focus on selecting or extracting spectral features due to abundant spectral information derived from the hundreds of contiguous spectral bands. Feature selection (also known as band selection) methods try to find the most representative features (bands) from raw HSI data to preserve their physical meaning. For instance, Wang et al. [
5] used manifold ranking as an unsupervised feature-selection method to choose the most representative bands for training the following classifiers. Yin et al. [
6] introduced a computational evolutionary strategy into the field of supervised band selection, where the candidate band combinations are evaluated through an affinity function driven by hyperspectral classification accuracy. Feature extraction approaches usually learn representative features through linear or nonlinear transformation. For instance, Huang et al. [
7] extended the k-nearest neighbor technique and proposed a feature extraction method called double nearest proportion feature extraction to reduce the dimensionality. Based on linear transformation nonparametric weighted feature extraction (NWFE), Kuo et al. [
8] proposed kernel-based NWFE, which has the advantages of both linear and nonlinear transformation.
These spectrum-based approaches select or extract the features directly from the pixel-wise spectra while ignoring the intrinsic geographical structure in HSI data. Recent studies have shown that the combined use of spectral and spatial information can enhance the ability to represent the extracted features. There are two categories of methods to extract spectral–spatial information from HSI data. The first one extracts the spectral signatures and the spatial contexts separately, and then combines them to perform pixel-wise classification [
9]. The second one treats the raw HSI data as a whole and extracts joint spatial–spectral features directly by using a 3D feature extractor. For example, spectral–spatial integrated features were extracted at different frequencies and scales using a series of 3D discrete wavelet filters [
10], 3D Gabor wavelets [
11], or 3D scattering wavelets [
12]. Since hyperspectral data is typically presented in the format of 3D cubes, the second category of methods can result in a large number of discriminative features, which can effectively improve the classification performance.
In the above traditional approaches, handcrafted features are typically used, and they are expected to be discriminative and representative of the characteristics of HSI data. Typically, the extracted features are based on domain knowledge, which may lose some valuable details. In feature classification, Support Vector Machines (SVMs) [
13] are often employed because SVMs are robust at representing high-dimensional vectors, but their capacity to represent is still limited to finite dimensions.
Since 2012, with the emergence of deep learning, the performance of many vision tasks has been dramatically improved, including but not limited to object detection [
14], segmentation [
15] and tracking [
16]. In recent years, deep-learning-based methods have been introduced in the field of HSI classification. In particular, supervised convolutional neural networks (CNNs) and their extensions, including 1D-CNN [
17,
18], 2D-CNN [
19,
20], 3D-CNN [
18,
21], and ResNet [
22,
23], have been successfully employed to extract deep spectral–spatial features and have demonstrated state-of-the-art performance. Usually, a CNN consists of at least three convolutional layers for extracting both low-level and high-level features. Moreover, instead of separating feature extraction and feature classification as two steps, the CNN structure integrating feature extraction and feature classification into one framework through back-propagation [
24]. Since the extracted features directly contribute to the final classification performance, deep learning methods achieve better performance than traditional methods.
However, two constraints limit the state-of-the-art deep CNNs from being used directly for HSI classification. The first factor is the different data format between RGB images and HSI. Specifically, the RGB images can be well represented by a 2D CNN model to extract features, while 3D CNN is preferable to preserve the abundant information being extracted from the spectral signatures and the spatial contexts of HSI. However, the number of parameters grows exponentially when the convolution moves from 2D to 3D [
25]. A 3D CNN has a lot more parameters than a 2D counterpart due to its additional kernel dimension, making it more difficult and expensive to train. The second factor is the limited training sample dilemma. Generally, the feature representation ability of deep learning models strongly depends on a large number of training samples. However, the manual annotation for hyperspectral data is difficult, which results in the lack of labeled pixels. Without sufficient training samples, a deep model that has a powerful representation capacity may suffer from overfitting. Therefore, most of the existing CNN-based HSI classification methods focus on using small-scale models with relatively less depth (no more than 10 layers, generally) at the cost of a decrease in performance. However, leveraging large-scale networks is still desirable to jointly exploit underlying the nonlinear spectral and spatial structures of hyperspectral data residing in a high-dimensional feature space [
26].
To address these inherent problems, in this paper, we propose a 3D asymmetric inception network (AINet) and a data-fusion transfer learning strategy for HSI classification, and our contributions can be summarized as four points:
A novel deep light-weight 3D CNN, AINet, with asymmetric structure is proposed to handle HSI classification, which uses the available small volume of HSI datasets to train the very deep neural network and fully exploit the potential of CNN.
Considering the properties of hyperspectral images as well as spectral signatures are emphasized over spatial contexts, an asymmetric inception unit (AI unit) is proposed. To convey and classify the features effectively, we replace the 3D convolution layer with two asymmetric inception units, namely the space inception unit and the spectrum inception unit.
Data fusion transfer learning is exploited to improve model initialization. It increases training efficiency and classification performance while compensating for data limitations.
The proposed method were tested on three public HSI datasets. The experimental results show that the proposed method achieves better performance than other state-of-the-art deep learning-based methods.
3. Methodology
Among the deep learning models used in HSI literatures, 3D-CNN performs better than 2D-CNN for HSI classification due to the fact that 3D data formats are used in HSI. In fact, different objects in HSI generally have different spectral structures. Convolving along the spectral dimension is very critical. In addition, there are also some different objects which have similar spectral structures. For these objects, it is also beneficial to convolve along spatial dimensions to capture features, which can capture important spatial variations observed with high-resolution data [
27,
28]. For 2D-CNN based methods, without spectral dimension reduction, the number of parameters of 2D-CNNs will be extremely large due to the hundreds of bands. Howover, if dimension reduction is conducted, it may destroy the information of spectral structure which is critical for discriminating different objects.
Generally speaking, 3D-CNN-based approaches have better performance than 2D-CNN-based approaches [
18,
22]. However, the existing 3D-CNN-based approaches still have two deficiencies: (1) compared with 2D convolutions, 3D convolutions have more parameters and 3D-CNN models are computation-intensive; (2) being limited by the training samples in HSI datasets, 3D-CNN models employed in HSI classification almost always consist of less than five convolution layers. However, a large number of experiments in computer vision have proved that the deep depth of CNN is very significantly important for improving the performance of tasks related to image processing [
23,
30].
In this section, we first introduces the proposed AINet, and then describe the proposed data-fusion transfer learning strategy.
3.1. AINet for HSI Classification
Network Structure:Figure 1 shows the overall framework of the proposed AINet for HSI classification. In order to utilize the spectral and spatial information contained in HSI, we extract
-sized cubes from raw HSI data as samples, where
L and
S indicate the number of spectrum bands and the spatial size accordingly (Following [
18], we set
S to 27 in this paper). Then, the samples are fed into AINet to extract deep spectral-spatial features, and finally the classification results are calculated. Inspired by the design of ResNet [
23], AINet employs a similar basic structure and introduces some key modifications for tailoring on HSI dataset. AINet starts with a 3D convolution layer, then stacks six AI units of increasing widths. It connects one 3D spatial pyramid pooling and one fully connected layer at the end. Specifically, the channels for the six AI units are 32, 64, 64, 128, 128 and 256, respectively. In order to reduce the dimension of features, four Max pooling layers are added with kernel = [3, 3, 3], stride = [2, 2, 2] within the six AI units.
3D Pyramid Pooling: Before the fully connected layer, a 3D pyramid pooling method is used to map features of different sizes to vectors with fixed dimensions. Different HSI datasets are usually captured by different sensors and with various numbers of spectrum bands, for example, the Pavia University dataset has 103 bands and the Indian Pines dataset contains 200 bands. With 3D pyramid pooling layer, the same network can be applied to different HSI datasets without any modification. In this paper, the 3D spatial pyramid pooling layer is composed of three-level pooling (, , ). As the last AI unit has 256 channels, the outputs of 3D pyramid pooling layer are -sized cubes.
Training and Loss: We employ log softmax [
50] as the activation function in the fully connected layer. During training, we take negative log likelihood as the loss function, and add
regularization term with weight 1 × 10
−5 to the loss function for alleviating over-fitting. The optimizer is stochastic gradient descent (SGD) with momentum [
51]. For all of the experiments, the same setting is adopted, where momentum, weight decay, batch size, epochs and learning rate are 0.9, 1 × 10
−5, 20, 60 and 0.01, respectively. In the last 12 epochs, the learning rate decreased to 0.001.
3.2. AI Unit
Because 3D convolution can learn the spectral and spatial information from the raw HSI datasets, the 3D-CNN based methods achieve the most advanced performance for HSI classification. However, compared with 2D convolutions, 3D convolutions are prone to overfitting and are computation-intensive. In order to address these problems, we propose an asymmetric inception unit (AI unit), which consists of the space inception unit and the spectrum inception unit. The structure of AI unit is illustrated in
Figure 2.
In the space inception unit, there are three space convolution paths. Path one has one pointwise convolution layer only, path two consists of one pointwise convolution layer and one 2D convolution layer with
-sized kernels, and path three has one pointwise convolution layer and two 2D convolution layers. The outputs of each path are concatenated in channel, and are added to the output of the shortcut connection. Inspired by the Inception networks [
35], we set the three paths with different widths. For each unit, we set the widths of three paths with a split ratio 1:2:1. In the last two paths, the width of the pointwise convolution layer is half of that of the other convolution layers. For instance, in the AI unit with 32 channels, the width of the first path is 8. For the second path, the widths of the pointwise convolution layer and
-sized convolution layer are 8 and 16 respectively. The widths of the three layers of the last path are 4, 8 and 8 accordingly. In the overall structure, the structure of spectrum inception unit is similar to the space inception unit, except that that
-sized 2D convolution layers in the space inception unit are replaced with
-sized 1D convolution layers.
In HSI datasets, the spectral resolution is much higher than the spatial resolution, and the spectral information is much richer. Therefore, in the process of spectral–spatial features extraction, we pay more attention to spectral feature extraction. In the proposed AINet, there are six AI units. The four units located in the middle can be divided into two groups, and each group stacks two units of equal width. Here, instead of stacking two same AI units in each group, we stack one space inception unit and two spectrum inception units. This is different from some popular networks, such as ResNet [
23] and MobileNet [
31], which build the whole model by stacking the same units.
Figure 3 shows the difference between one AI unit and two AI units.
3.3. Transfer Learning with Data Fusion
In RGB images classification, pretraining networks on the ImageNet dataset which has over 14 million hand-annotated images and over 20,000 categories is common, and it is very useful for improving the performance and overcoming the problem of limited training samples. The diversity of datasets used for pretraining is a key factor in transfer learning. For example, pretraining the same model on a dataset with a million images and a thousand categories always achieves better results than pretraining the same model on a dataset with 10 million images and 10 categories. We believe that model pretraining with more diverse samples may result in better generalization ability.
For further improving the performance of HSI classification, we propose a data-fusion transfer learning strategy. As shown in
Figure 4, the strategy is composed of data-fusion pretraining and finetuning: (1) data-fusion pretraining—during pretraining, the proposed network is trained on two different HSI datasets to improve the diversity of samples and obtain a robust initialized model; (2) fine-tuning—after the pretrained model is acquired, the new model is initialized using the parameters of the pretrained model for the target HSI dataset. The fully connected layers of the proposed model are randomly initialized with a Gaussian distribution.
During pretraining, the proposed network is trained on two source HSI datasets. Here, Pavia Center dataset and Salinas dataset are used as source HSI datasets for pretraining. Among the several public HSI datasets, those two datasets have the largest number of labeled samples. To be more specific, the model is initialized with Gaussian distribution on one-source HSI dataset and pretrained for N epochs, and then the feature extraction part is fixed and the classifier is reinitialized with Gaussian distribution. Later on, the feature extraction part and classifier on the other source HSI dataset are pretrained for epochs with a different learning rate. In this paper, N is set to 10 and the learning rate used for the feature extraction part is tenth of that used for the second pretraining HSI dataset.
After pretraining the model on the two source HSI datasets, we transfer the entire model except for the classifier, to construct the fine-tuning model for initialization of the target HSI dataset. Then the transfer part and the new classifier are fine-tuned at the same learning rate for training the second source HSI dataset.
4. Experiments
4.1. Datasets and Experiments Setting
In this paper, we compare the proposed AINet with a traditional approach and five CNN-based approaches for HSI classification on three public HSI datasets, including Pavia University, Indian Pines and KSC. In the transfer-learning experiment, the Pavia Center dataset and the Salinas dataset are employed as the source datasets. The false-color composite and ground truth of each dataset are shown in
Figure 5. A brief introduction of each dataset is given in the following part and more information can be found on the website
http://www.ehu.eus/ccwintco/index.php/Hyperspectral_Remote_Sensing_Scenes (accessed on 20 February 2022). The code of the proposed algorithm can be found at:
https://github.com/UniLauX/AINet (accessed on 20 February 2022).
Pavia University and Pavia Center datasets were captured by Reflective Optics System Imaging Spectrometer (ROSIS) sensor in 2001. After several noisiest bands being removed, Pavia University has 103 bands and Pavia Center has 102 bands. Both datasets are divided into 9 classes.
Indian Pines and Salinas datasets were acquired by the Airborne Visible/Infrared Imaging Spectrometer (AVIRIS) sensor in 1992. After correction, each dataset has 200 bands and contains 16 classes.
KSC was acquired by the AVIRIS sensor in 1996, and after removing water absorption and low SNR bands, 176 bands were used for analysis. For classification purposes, 13 classes are defined.
For the three target HSI datasets, samples are divided into training samples and testing samples. For comparison purposes, we follow [
18] to set the samples distribution for Indian Pines and KSC datasets. As for the Pavia University dataset, 200 random samples are taken from each class as training samples.
Table 1,
Table 2 and
Table 3 provide the split details. For the two-source HSI datasets, the description of the two datasets is shown in
Table 4 and
Table 5.
In the transfer learning experiment, we randomly extracted 200 samples from each class of Pavia Center dataset, 100 samples from each category of Salinas dataset as test samples, and take the rest as training samples.
4.2. Performance Comparison of Different Network Structures
In this section, we compare the proposed AINet with a traditional method and five CNN-based HSI classification methods, that are SVM-3DG [
52], 1D-CNN, 2D-CNN, 3D-CNN [
18], MSDN-SA [
29], SSRN [
22]. The experiments with the same settings are ran for 5 times to obtain the average performance. The experimental results are listed in
Table 6,
Table 7 and
Table 8, where the number of training samples, the number of parameters used in the convolution layers, the depth of CNN models, overall accuracy (OA), average accuracy (AA) and kappa coefficient (
K) are reported. OA is the ratio between the number of correctly classified samples in the test set and the total number of test sets. AA is the mean of the OA of all the categories.
K is a coefficient which measures inter-rater agreement for qualitative items [
53]. The classification maps are shown in
Figure 6,
Figure 7 and
Figure 8. From
Table 6,
Table 7 and
Table 8, we can see that the proposed AINet achieves the highest classification performance on all of the datasets. For instance, in the Indian Pines dataset, OA of AINet is 99.14, which is 9.15% better than that of 2D-CNN, 1.58% better than that of 3D-CNN and 0.74 better than that of SSRN. The experiments indicate that all of the 3D-CNN-based HSI classification methods are superior to 2D-CNN. From 3D-CNN, MSDN-SA, SSRN to AINet, the depth of the models is increasing and the classification accuracy keeps improving. In particular, the depths of the four models are 4, 7, 12, 32 respectively. Although AINet is much deeper than SSRN, AINet has slightly more parameters than SSRN and much fewer than 3D-CNN.
4.3. Classification Results with Spatially Disjoint Samples
Previous research [
4,
54,
55] has pointed out that the random-sampling strategy has a significant impact on the reliability and quality of the solution, since this may make it easier for the networks to classify the test samples during the inference stage (as the network has already processed them in some way during training). As compared to disjointed samples, randomly selected samples may result in significant spatial overlap of the training and test samples, which may overestimate classification performance. Because of this, the results obtained by the model may not be realistic, since artificially optimistic results may be obtained. To obtain more realistic results and a more accurate evaluation of the models, in this subsection, a sampling strategy based on selecting spatially separated samples is used to evaluate the model. The classification results on two sampling strategies of all compared methods in
Section 4.2 are summarized in
Table 9,
Table 10 and
Table 11.
As can be seen, 2D-CNN, 3D-CNN, MSDN-SA and SSRN suffer an accuracy deterioration. In addition, the performance of 2D-CNN and 3D-CNN endures a drastic decline. As the spatial resolution of the Indian dataset is lower than that of the other two datasets, 2D-CNN and 3D-CNN algorithms that focus more on spatial information decline significantly in this dataset. Although AINet also experiences performance degradation, it still achieves the highest OA, AA and K.
4.4. Results of Transfer Learning
In this section, we combine the proposed AINet with data-fusion-based transfer learning to further improve the classification performance. In [
45], the authors adopted transfer learning in their framework, but restricts that the data used for pretraining must be collected by the same sensor as the target data. In contrast to previous work, we have not imposed restrictions on the datasets used for pretraining, which makes these results more applicable than previous works.
Here, we employ five HSI datasets in total. Three datasets, Pavia University, Indian Pines and KSC, are used as target datasets. Two datasets, Pavia Center and Salinas, are used as source datasets. Both the source dataset Pavia Center and the target dataset Pavia University were collected by the same sensor ROSIS, so their spatial and spectral properties are similar. The source dataset Salinas and the target dataset Indian Pines were taken by the same sensor AVIRIS and their spatial and spectral resolution are roughly identical. The last target dataset, KSC, was also collected by AVIRIS, but KSC has 176 bands, which is much more than Salinas and Indian Pines. As a result, the basic attributes involved in KSC are rather different from those in Salinas and Indian Pines.
In transfer-learning experiments, we implement the experiments with four different transfer-learning strategies, named AINet+T1, AINet+T2, AINet+T3 and AINet+T4, respectively. In AINet+T1, we pretrain the proposed model with Pavia Center data at first, then transfer the pretrained model to target datasets and fine-tune it on target datasets. Similarly, in AINet+T2, we firstly pretrain our proposed model on Salinas, then transfer and fine-tune the pretrained model to target datasets. Different from AINet+T1 and AINet+T2, both AINet+T3 and AINet+T4 have two pretraining stages, in which different source datasets are used for pretraining. In AINet+T3, we pretrain the model on Pavia Center dataset in the first stage and pretrain the model on Salinas dataset in the second stage. In AINet+T4, we inverse the order of using source datasets to pretrain.
The experimental results of transfer learning are listed in
Table 12 and shown in
Figure 9 and
Figure 10. For each target dataset, we randomly choose 15 and 30 samples from each class as the training samples and reserve the rest as test samples.
6. Conclusions
This paper proposes a 3D asymmetric inception network (AINet) for hyperspectral image classification. Firstly, compared to traditional 3D CNNs, AINet proposed a light-weight but much deeper architecture that can exploit the potential of deep learning to extract representative features while alleviating the problems caused by limited annotated datasets. Secondly, considering the property of hyperspectral images, spectral signatures are emphasized over spatial contexts in the proposed AI Unit. Furthermore, a data-fusion transfer learning strategy is adopted to improve the initialization of the model and the classification accuracy.
We conduct comparison experiments on three challenging public HSI datasets and compare our proposed AINet with deep learning based HSI classification methods. The results of comparison experiments have demonstrated that our proposed AINet achieves competitive performance with others. Although AINet is much deeper than SSRN, the parameters of AINet are slightly more than that of SSRN and much less than that of 3D-CNN. In fact, benefiting from the AI Unit, AINet contains much less parameters and higher performance than the basic network model. In addition, we have performed experiments to verify the effectiveness of our proposed data fusion transfer learning strategy. Results show that compared with pretraining the model with a single-source dataset, pretraining the model with multiple-source datasets is more effective.
In the future, there are two topics we are keen to pursue. Investigating the reduction of the training time brought by transfer learning is the first, and the second is taking use of some policies to overcome the data imbalance in HSI classification.