1. Introduction
Hyperspectral remote sensing refers to the science and technology of the acquisition, processing, analysis, and application of remote sensing data with high spectral resolution. Different from multispectral remote sensing, hyperspectral remote sensing can obtain information on surface objects in hundreds of continuous spectrum segments, providing rich spectrum information to enhance the ability to enhance the expression of features [
1]. Hyperspectral remote sensing has been widely used in surface classification, target detection, agricultural monitoring, mineral mapping, environmental management, and other fields [
2].
Remote sensing image classification is an essential part of hyperspectral remote sensing image processing and application, and its ultimate goal is to assign a unique category identifier to each pixel in the image. In the past decades, a variety of HSI classification methods have been proposed [
3,
4] and have mainly focused on spectral or spatial–spectral information. For classification based on spectral information, a TabNet with spatial attention (TabNets) was designed for hyperspectral image classification in the study [
3]. To fully explore the spatial features of HSIs, many methods have been proposed, such as an encoder–decoder with a residual network (EDRN) [
4], a study that combines hyperspectral and panchromatic remote sensing images to extract the representative deep features of codes. However, the high-dimensional characteristics, high correlation between bands, and spectral mixing make the classification of hyperspectral remote sensing images face significant challenges [
5]. Thanks to the development of remote sensing technology, it is now possible to measure different aspects of the same object on the Earth’s surface [
6]. Hyperspectral data are easily disturbed by environmental factors such as clouds, shadows, etc., which can easily lead to the confusion of information. For example, if building roofs and roads are made of concrete, it is difficult to distinguish them using hyperspectral data alone, because their spectral responses are similar. However, light detection and ranging (LiDAR) uses pulsed lasers to measure distances and is an active remote sensing method [
7,
8]. It is not susceptible to weather conditions and not only provides height and shape information of the scene but also has excellent accuracy and flexibility [
9,
10] and can accurately classify these two categories. Conversely, LiDAR data cannot distinguish two roads composed of different materials (e.g., asphalt and concrete) with the same height [
11,
12]. Therefore, the two kinds of data are deeply integrated to realize the complementary advantages of multi-sensor remote sensing, break through the performance bottleneck of single remote sensing data (such as “different objects with the same spectrum” or “same objects with different spectrum”), and finally achieve the purpose of improving the accuracy of object classification [
13,
14].
Most traditional classification models first perform feature extraction on multi-sensor data and then distinguish them through a classifier [
15]. Among the feature extraction methods, knowledge-guided feature extraction is based on the understanding of spectral features to perform mathematical operations on relevant bands to obtain deep-level information. However, expert knowledge is often hard to obtain. Furthermore, traditional classification methods rely heavily on hand-designed features, which limit the representation of models [
16].
With the development of deep-learning techniques, convolutional neural networks (CNNs) [
17,
18,
19,
20,
21,
22,
23] have been widely used in computer vision tasks, such as image classification [
24,
25], object detection [
26], semantic segmentation [
17], etc. Research methods based on the CNN model have been widely used in the field of remote sensing image classification and have become the mainstream methods in this field [
27,
28]. Convolutional neural networks have shown excellent feature extraction capabilities in this field and are gradually replacing artificial feature-based methods.
For example, in [
29], a Two-Branch CNN combines the separately extracted spatial and spectral features of HSIs with those extracted by LiDAR. EndNet (encoder–decoder network) [
30] is a deep encoder–decoder network architecture that reconstructs multi-sensor inputs by encoding and decoding fused features via an autoencoder. In [
31], multimodal deep learning middle fusion (MDL-Middle) is an intermediate fusion CNN model.
Based on all these advanced approaches mentioned above, we have the following reflections. First, the ability to fully integrate different models is critical, and this can be fully reflected by the complementary nature of HSIs and LiDAR features. This is because multi-sensor data are naturally correlated. If this complementarity is fully exploited in the encoding process, the extracted multi-sensor encoding will be more robust and comprehensive. Brain studies also reaffirm that the human brain begins to interact with multi-sensor data in the primary perceptual cortex.
Second, the design of feature extraction networks specifically for different sensor data is often critical and needs to fully take into account the characteristics of different sensor domains, which is often important for downstream tasks.
Therefore, this paper takes two aspects and explores their solutions.
Self-supervised learning has recently emerged as an effective approach, self-supervised learning in the multimodal field is very effective and can achieve downstream task performance comparable to supervised pre-training in tasks such as action recognition, information retrieval, and video question answering. For example, many self-supervised methods [
32,
33,
34,
35,
36] exploit contrastive objectives (e.g., comparison) to facilitate multimodality such as visual–linguistic learning. For example, MoCo [
34] further improves this scheme by storing the representation of the future quantum encoder in a dynamic dictionary with queues. MoCov2 [
36] borrows the multilayer perceptron (MLP) head design and shows significant improvements.
From the success of self-supervised learning within the multimodal domain, we can also realize that self-supervised learning can be used in the multi-sensor domain as a way to help the deep information fusion of features from different sensors, which can further improve the performance of multi-sensor downstream tasks.
To solve the second aspect, we investigated a large number of state-of-the-art networks and their improved algorithms. Recently, the application of the ConvNeXt [
37] network in the visual direction has become a hot spot. On the basis of maintaining the CNN structure, the ConvNeXt network draws on the design concepts of methods such as the Swin Transformer [
38]. Swin Transformer is a landmark work in the transformer direction, which for the first time demonstrates that transformers can be used as general-purpose vision backbones and achieve state-of-the-art performance in a range of computer vision tasks. ConvNeXt uses a larger kernel size to simulate long-distance modeling capabilities while maintaining the local sensitivity of the CNN, ensuring the global information of the network. However, the spectrum of an HSI is a kind of sequence data, which usually contains hundreds of spectral bands. Through the feature extraction advantages of the local information and global information of the ConvNeXt network, it can not only complete the extraction of global spectral–spatial information but also overcome the mixed pixel band problems such as decreased accuracy.
Refocusing on the convolutional neural network, the receptive field is used to represent the size of the range of perception of the original image by neurons at different positions within the network. The larger the value of the neuron’s receptive field, the larger the range of the original image it can touch, which also means that it may contain more global and higher semantic features, while the smaller the value, the more features it contains, tending to be localized and contain more detail. So, the receptive field is very important for the network.
The introduction of dilated convolution [
39] introduces a dilation rate parameter in the convolution kernel, and the dilation rate defines the spacing between the convolution kernels (where the spacing is defined as r). In other words, the dilation convolution is similar to the traditional convolution, but the number of weights in the dilation convolution kernel remains the same. Only the weights of the convolution kernels are spaced by r positions, i.e., the kernels of the dilation convolution layer are sparse. This allows the convolution filter to obtain a larger perceptual field without reducing the spatial resolution or increasing the kernel size to improve the recognition of downstream tasks in the network. The traditional convolutional neural network can be regarded as cascading a large number of convolution operators to encode the input information, representing the characteristics of different frequency components of the input sample. However, there is no effective fusion process between frequencies, and the interaction between frequencies is very critical for the encoding of LiDAR information. OctaveConv [
40] decomposes the input convolutional feature map into two sets of feature maps with different spatial frequencies and processes different convolutions at corresponding frequencies, which helps each layer to obtain a larger receptive field to capture more contextual information.
The main contributions are summarized as follows:
We introduce a multi-sensor pair training framework for the HSI-LiDAR classification task. Our multi-sensor training framework can exploit intrinsic data properties in each modality and simultaneously extract semantic information from cross-modal correlations. It can not only encode the two modalities independently to capture more modality-specific information but also complete the deep fusion of the two sensors’ information and learn the alignment between different modalities and learn deep fusion for HSI-LiDAR classification tasks;
It is well known that information is conveyed at different frequencies, where higher frequencies are typically used for fine detail encoding and lower frequencies are typically used for global structure encoding. The Digital Surface Model (DSM) of LiDAR has rich depth information, that is, high- and low-frequency information. We propose a new LiDAR encoder network structure with Octave convolution. The output maps of a convolutional layer can also be factorized and grouped by their spatial frequency. OctaveConv focuses on reducing the spatial redundancy in CNNs and is designed to replace vanilla convolution operations. In this way, the high- and low-frequency information of the DSM is fully utilized from the aspect of feature extraction;
Due to the spectral redundancy and low spatial resolution of HSIs, we propose the Spectral-Aware Trident network in parallel and the ConvNeXt network in series. In both networks, dilated convolution that can improve the receptive field is used. Recently, the application of the ConvNeXt network in the visual field has become a hot spot. On the basis of maintaining the CNN structure, the ConvNeXt network borrows the design concepts of Transformer and other methods. While maintaining the local sensitivity of the CNN, a larger kernel size is used to simulate the long-range modeling ability, which ensures the global information of the network. The spectrum of an HSI is a sequence of data that typically contains hundreds of bands. Through the feature extraction advantages of the local information and the global information of the ConvNeXt network, we can not only complete the extraction of global spectral–spatial information but also overcome the problems of accuracy degradation caused by mixed pixels.
In the training method of the network, we show the use of a stagewise training strategy, which trains the HSI branch, LiDAR branch, and HSI-LiDAR classification tasks in stages. The method of training the HSI and LiDAR branches in stages can provide better model parameter initialization for the HSI-LiDAR classification model, which usually leads to better generalization performance and accelerates convergence on this downstream task.
3. Results
To evaluate the performance effectiveness of the proposed model, we used two different datasets for evaluation: Houston2013 and Trento. All deep models were implemented in the Pytorch 1.9 framework. All experiments were carried out in the same hardware environment; that is, Ubuntu16.04, Tesla K80 device.
3.1. Experimental Datasets Description
Houston 2013 dataset: This dataset involves two datasets—hyperspectral images and a LiDAR-derived DSM, both consisting of 349 × 1905 pixels with the same spatial resolution (2.5 m). The data were acquired by the NSF-funded Center for Airborne Laser Mapping (NCALM) over the University of Houston’s 2013 campus and the neighboring urban area. The HSI has 144 spectral bands in the 380 nm to 1050 nm region, including 15 classes.
Table 1 lists the number of samples of different classes and the color of each class.
Figure 7 gives the visualization results of the Houston2013 dataset. These data and reference classes can be obtained online from the IEEE GRSS website (
http://dase.grss-ieee.org/(accessed on 10 September 2022)).
Trento dataset: This dataset involves two datasets—hyperspectral images and LiDAR-derived DSM, both consisting of 600 × 166 pixels with the same spatial resolution (1 m). The data were acquired by the AISA Eagle sensor, and the LiDAR DSM was produced using first- and last-point cloud pulses obtained by the Optech ALTM 3100EAsensor. The HSI has 63 spectral bands covering the 402.89 to 989.09 nm region and includes six classes.
Table 2 lists the number of samples of different classes and the color of each class.
Figure 8 gives the visualization results of the Trento dataset.
3.2. Experimental Setup
The proposed network was implemented on the Pytorch platform. The models were trained on the training set by randomly dividing the original training set into a training set and a validation set in each epoch. The ratio of training set to validation set was 8:2. In the training phase, we used an SGD optimizer with a weight decay of 1 × 10−4, a momentum of 0.9, and a batch size of 64 on a NVIDIA TESLA K80 GPU. We used the “step” learning rate strategy. Then, the HSI encoder and LiDAR encoder were trained with an initial learning rate of 0.03 and 4 × 10−5, respectively, and, finally, the joint HSI and LiDAR encoder was trained with an initial learning rate of 0.01. All training epochs were 100.
In terms of details, we normalized all encoder feature vectors before calculating their dot products in the contrastive losses, where τ was set as 0.07. For trade-off parameters in the final loss, we set λ1 as 0.5, λ2 as 0.2, λ3 as 0.5, and λ4 as 1.0. All the auxiliary losses were less than 1.0. In the auxiliary losses, after pre-training the encoder, we found that the classification performance of the LiDAR encoder was not good enough compared to the HSI. In order to degrade the performance of the LiDAR encoder, we set the auxiliary classification loss of LiDAR to 0.5 and the auxiliary classification loss of HSI to 0.3. We used uniform hyperparameters for all datasets.
To evaluate the performance effectiveness of the proposed model, we used two different datasets for evaluation and evaluated the effectiveness of the model through four metrics: the overall accuracy (OA), average accuracy (AA), and Kappa coefficient. The overall accuracy (OA) defines the ratio of all correctly classified pixels to the total number of pixels in the test set. Average accuracy (AA) is the average probability that the accuracies for each class of elements are summed and divided by the number of classes. The Kappa coefficient was also used to evaluate the classification accuracy, checking the consistency of the remote sensing classification result map with the ground-truth map.
3.3. Experimental Results
To demonstrate the effectiveness of the proposed model, several representative multi-sensor joint classification model methods were selected for comparison experiments with the proposed model, including Two-Branch CNN, EndNet, and MDL-Middle. Here, Two-Branch CNN performs feature fusion by combining the spatial and spectral features extracted from the HSI branch with the LiDAR data features extracted from the cascaded network. EndNet is a deep encoder–decoder network architecture that fuses multi-sensor information by enhancing fused features. MDL-Middle performs multi-sensor feature fusion on the middle layer of the CNN model. We also compared single-sensor classification models: Trident-HSI, CNN-LiDAR, ConvNeXt-HSI, and OctaveConv-LiDAR, Trident-HSI, ConvNeXt-HSI, CNN-LiDAR, and OctaveConv-LiDAR. In order to ensure the validity of the comparative experiments, the verification data were uniformly used as the Houston2013 and the Trento dataset, and the training set and test set of each dataset were completely consistent.
3.3.1. Classification Results of the Houston2013 Dataset
Table 3 shows the detailed classification results of eight models in terms of OA, AA, and Kappa coefficients on the Houston2013 dataset. The best results are shown in bold. As can be seen from
Table 3, our proposed method shows obvious improvement in OA, AA, and Kappa coefficients compared with Two-Branch CNN, EndNet, and MDL-Middle. The classification performance of Soil, Road, Railway, Parking Lot1, Tennis Court, and Running Track are all better than those of these three models, especially the recognition accuracy of 88.41% of Healthy Grass and 94.97% of Railway. Compared with these three models, the maximum is 6.83% and 11.19% improvement.
The following factors are also evident from
Table 3. Firstly, all single-sensor classification performance is lower than multi-sensor classification. The performance using only HSI data is significantly higher than that of only LiDAR data. For example, in the Houston2013 dataset, the OctaveConv-LiDAR classification model has an OA of 67.58%, AA of 65.29%, and Kappa of 64.92%. These three metrics are all lower than the ConvNeXt-HSI classification model performance, which is 87.12%, 88.17%, and 86.02%, respectively. When combined with dual-sensor joint classification, the OA increased to 88.14%, AA increased to 88.14%, and Kappa increased to 87.16%. Secondly, the performance of the ConvNeXt-HSI and OctaveConv-LiDAR classification models proposed in this paper is better than the corresponding Trident-HSI and CNN-LiDAR classification models. Among them, the OA, AA, and Kappa of ConvNeXt-HSI were higher than that of Trident-HSI, at 5.77%, 6.14%, and 6.17%. The OA, AA, and Kappa of OctaveConv-LiDAR are higher than those of CNN-LiDAR, at 4.41%, 4.69%, and 4.71%.
To sum up, the method proposed in this paper is better than all other models, and it is proved that the recognition effect of a multi-sensor is better than that of a single sensor.
Figure 9 shows the classification diagrams of the different models of Two-Branch CNN, EndNet, MDL-Middle, Trident-HSI, CNN-LiDAR, ConvNeXt-HSI, OctaveConv-LiDAR, and the proposed model. In this figure, different colors represent different classes of objects. From the perspective of a single sensor, for the single-sensor HSI method (for example, d and f in
Figure 9), rich spectral information can provide more detailed ground-object information for the target to be detected, but it is difficult to identify similar objects (such as grass and shrubs); methods based on single-sensor LiDAR data (such as
Figure 9e,g), using elevation information, can distinguish objects of different heights, but it is difficult to classify objects of the same height. In contrast, in
Figure 9h, our proposed HSI-LiDAR joint classification model combines multi-sensor and self-supervised learning algorithms and compares other three advanced algorithms, Two-Branch CNN, EndNet, and MDL-Middle (for example,
Figure 9a–c), which can obtain more detailed information and smooth classification results (for example, railways) and can achieve high-precision classification tasks in complex scenes.
3.3.2. Classification Results of the Trento Dataset
Table 4 shows the detailed classification results of eight models in terms of OA, AA, and Kappa coefficients on the Trento dataset. Compared with Two-Branch CNN, EndNet, and MDL-Middle, our proposed method also has obvious improvements in OA, AA, and Kappa coefficients. The classification performance of Apples, Buildings, and Roads is superior to these three models, especially the recognition accuracy of Buildings at 99.10%, which is up to a 1.17% improvement compared to these three models.
Moreover, all single-sensor classification performance is lower than multi-sensor classification. The performance using only HSI data is significantly higher than that of only LiDAR data. For example, in the Trento dataset, the OctaveConv-LiDAR classification model has an OA of 91.85%, AA of 83.57%, and Kappa of 89.21%. These three metrics are all lower than the performance of the ConvNeXt-HSI classification model, for which they are 96.40%, 92.91%, and 95.20%, respectively. When the dual-sensor joint classification is combined, the OA is increased to 88.14%, the AA is increased to 88.14%, and the Kappa is increased to 87.16%.
Similarly, comparing the accuracy of a single branch horizontally, ConvNeXt-HSI compared to Trident-HSI and OctaveConv-LiDAR compared to CNN-LiDAR, the classification performance is relatively good. The OA, AA, and Kappa of ConvNeXt-HSI are higher than those of Trident-HSI, at 1.12%, 2.74%, 2.5%. The OA, AA, and Kappa of OctaveConv-LiDAR are higher than those of CNN-LiDAR, at 0.9%, 1.59%, and 1.19%.
Figure 10 shows the classification diagrams of the different models of Two-Branch CNN, EndNet, MDL-Middle, Trident-HSI, CNN-LiDAR, ConvNeXt-HSI, OctaveConv-LiDAR, and the proposed model. In this figure, different colors represent different classes of objects. In the Trento dataset, we can obtain the same conclusion as the Houston2013 dataset. From a single-sensor point of view, for single-sensor HSI methods (for example,
Figure 10d,f), it is difficult to identify similar objects (for example, Apples and Woods); for methods based on single-sensor LiDAR data (for example,
Figure 10e,g), objects with the same height are difficult to classify (such as Buildings and Roads). In
Figure 10h, our proposed method can obtain more detailed information and smooth classification results (e.g., Vineyard) compared with the other three advanced algorithms (e.g.,
Figure 10a–c).
3.3.3. Computational Complexity Analysis
Table 5 shows the model complexity analysis for the different models. The model complexity analysis is represented by two important metrics, which are floating point operations (FLOPs) and the number of model parameters (#param.) #param. FLOPs refers to the number of floating point operations that occur for the input of a single sample (one image) and for the model to complete one forward propagation, i.e., the time complexity of the model. #param refers to how many parameters the model contains, which directly determines the size of the model and also affects the amount of memory used for inference, i.e., the spatial complexity of the model.
Because EndNet does not consider neighborhood information, the spatial and temporal complexity of EndNet is small. Although using a single pixel as input reduces the model complexity, ignoring neighborhood information leads to a decrease in accuracy. The model proposed in this paper uses multiple encoders and the network is deeper than other models, which greatly increases the computational cost but also improves the performance of the network.