1. Introduction
In recent years, remote sensing technology has played a crucial role in Earth observation tasks [
1]. With the development of sensor technology, remote sensing imaging methods exhibit a diversified trend [
2]. Although an abundance of multi-source data are now available, remote sensing data from each source captures only one or a few specific properties, which cannot fully describe the scenes observed [
3,
4]. Naturally, multi-source remote sensing data fusion techniques present a feasible resolution to the predicament. By integrating complementary information from multi-source data, tasks can be performed more reliably and accurately [
5,
6]. Specifically, light detection and ranging (LiDAR) data can provide additional elevation information to hyperspectral image (HSI) data. In this way, the joint land cover classification of HSI-LiDAR data becomes a promising approach that has received favorable results in practical tasks [
7].
Depending on the sensor types of remote sensing data, multi-source remote sensing data can be categorized as homogeneous data or heterogeneous data [
8]. HSI-LiDAR data are heterogeneous remote sensing data that contains two forms of characteristics, namely spatial–spectral features in the HSI and spatial–elevation features in the LiDAR data [
9]. Depending on the hierarchical level of data fusion, multi-source remote sensing joint classification techniques can be further categorized into pixel-level, feature-level, and decision-level approaches [
10]. Due to the vast variation in the target characteristics being observed, the joint HSI-LiDAR data fusion is frequently processed by feature-level or decision-level techniques. In general, a typical HSI-LiDAR fusion classification network consists of four components, including the HSI feature extraction module, LiDAR feature extraction module, feature embedding fusion module, and classification module, which is shown in
Figure 1.
For the HSI feature extraction module, the feature extraction approaches mainly include convolution-based techniques, recurrent-based techniques, transformer-based techniques, and attention-based techniques. Many convolutional neural network-based (CNN) approaches use 2D convolution to learn local contextual information from pixel-centric data cubes [
11,
12]. However, these methods devote insufficient attention to the spectral signatures and fail to consider the joint spatial–spectral information in HSI. Naturally, some scholars use 2D-3D convolutions to improve feature extraction modules to obtain joint spatial–spectral feature embeddings, which has achieved promising results in practical applications [
13,
14]. Nevertheless, 3D convolution exposes increased computation in spectral dimensions over 2D convolution, which can dramatically raise the number of parameters and boost the computational complexity. Moreover, 2D and 3D convolutions are restricted by the receptive field and omit the long-distance dependencies among features. To capture the long-distance dependencies of spectral signatures, some scholars regard the spectral signatures of HSI as time series signals and employ recurrent neural network (RNN) techniques to process them [
15]. However, the sequential processing of RNN is time-consuming, which limits its application in practical tasks. In recent years, transformers and attention mechanisms have proposed to provide new thoughts for HSI feature extraction [
16,
17]. These approaches can capture long-distance dependencies of feature embeddings in parallel [
18,
19]. However, the attention mechanisms provide less ability to extract spatial information compared to the CNN models. Furthermore, the training and inference speed of transformer models is strongly influenced by data size and model structure. To solve these issues, the mamba models have emerged as a promising approach, which has strong long-distance modeling capabilities while maintaining a linear computational complexity [
20,
21]. However, it is hard for mamba models to provide an integrated spatial and spectral understanding of HSI features. For the LiDAR feature extraction module, 2D CNNs and attention mechanisms are widely used, and their high performance has been demonstrated in real tasks [
22,
23,
24].
For the feature embedding fusion module, the fusion techniques mainly include concatenate fusion, hierarchical fusion, residual fusion, graph-based fusion, transformer- and attention-based fusion. The concatenate fusion technique is introduced to achieve data fusion by combining feature embeddings of different input data into a joint feature [
25,
26]. However, this approach only performs simple stacking of multi-source features and only performs well on some small-scale datasets. In contrast, the hierarchical fusion technique is an effective improvement over the concatenate fusion approach. In hierarchical fusion, shallow and deep features, which are extracted in different hierarchies, interact with each other to perform fusion [
23,
27,
28]. In the feature fusion process, different attention modules are used for hierarchical fusion to achieve complete complementarity of features, which are expressed as shallow fusion and deep fusion. However, the effectiveness of hierarchical fusion depends on the quality of the features extracted from the network at each level of the model, which limits the application of the method to complex tasks. Residual fusion uses CNN and residual aggregation to realize data fusion [
29,
30]. However, the method works poorly at handling the heterogeneous data fusion problem. To represent the relationship between multi-source features, scholars tried to use graph-based fusion techniques to achieve data fusion [
9,
31]. However, the computational complexity of the graph-based fusion technique is significantly influenced by the number of nodes and edges, which makes it unsuitable to apply to large-scale data. The transformer- and attention-based fusion techniques utilize attention mechanism to achieve data fusion [
32,
33]. This approach can capture long-distance dependencies of feature embeddings and provide a high-performance representation of the relationship between multi-source features [
34,
35]. However, the transformer and attention mechanisms are less sensitive to location information and lack the ability to collect local information. For the classification module, it serves to map the fused features to the final classification results. Linear layers and adaptive weighting techniques are always used in this component to enhance the robustness and generalization of the model [
36,
37].
In this paper, a cross attention-based multi-scale convolutional fusion network (CMCN) is proposed for HSI-LiDAR land cover classification. The approach majorly consists of three modules: spatial–elevation–spectral convolutional feature extraction module (SESM), cross attention fusion module (CAFM), and classification module. Two considerations are focused on improving the performance of the network. First, discriminative spatial–spectral and spatial–elevation features are extracted in diversified land cover conditions, and the correlative and complementary information is preserved. Second, the long-distance dependencies of feature embeddings are captured, and the representation of deep semantic relations for multi-source data is achieved. To capture discriminative spatial–spectral and spatial–elevation features and preserve the correlation and complementarity, improved multi-scale convolutional blocks are utilized in the SESM to extract features from HSI and LiDAR data. Spatial and spectral pseudo-3D [
38] convolutions and pointwise convolutions are jointly used to extract spatial–elevation–spectral features and simplify the computation. Residual and one-shot aggregations are employed to maintain shallow features in deep layers and make the network easier to train. The parameter-sharing technique is used to exploit the correlation and complementarity. To capture the long-distance dependencies and achieve the relations of the feature embeddings, a local-global cross attention mechanism is applied in the CAFM to collect the local contexture features and integrate the global significant relational semantic information. A classification module is implemented to collect the fused features and translate them into the final classification results. More details about the proposed method are presented in
Section 3. The main contributions are summarized as follows:
1. A multi-scale convolutional feature extraction module is designed to extract spatial–elevation–spectral features from HSI and LiDAR data. In this module, spatial and spectral pseudo-3D multi-scale convolutions and pointwise convolutions are jointly utilized to extract discriminative features, which can enhance the ability to extract ground characteristics in diversified environments. Residual and one-shot aggregations are employed to maintain the shallow features and ensure convergence. To capture the correlation and complementarity of spatial and elevation information among HSI and LiDAR data, a parameter-sharing technique is applied to generate feature embeddings;
2. A local–global cross attention block is designed to collect and integrate effective information from multi-source feature embeddings. To collect the local information, local-based convolutional layers are implemented to perform the mapping transformation. After that, the global cross attention mechanism is applied to achieve long-distance dependencies and generate attention weights. Then, multiplication operation and residual aggregation are used to produce semantic representations and accomplish data fusion;
3. A novel cross attention-based multi-scale convolutional fusion network is proposed to achieve the joint classification of HSI and LiDAR data. A multi-scale CNN framework with parameter sharing and a local–global cross attention mechanism are combined to exploit joint deep semantic representations of HSI and LiDAR data and achieve data fusion. The classification module is implemented to perform classification results. Experimental results on three publicly available datasets are reported.
The rest of the paper is organized as follows.
Section 2 introduces the related work, such as HSI and LiDAR data classification, residual and one-shot aggregations, and the cross attention mechanism.
Section 3 presents the details of the proposed network.
Section 4 gives the experimental results, and
Section 5 makes some discussions.
Section 6 gives the conclusion of this article and provides future work.
4. Experiment
4.1. Dataset Description
In the experiment, three HSI-LiDAR pair datasets with different land covers are used to evaluate the effectiveness of the proposed network, which are the Trento dataset, the MUUFL dataset, and the Houston2013 dataset. The brief views of the datasets are described as follows.
Trento Dataset [
4]: The Trento dataset is an HSI-LiDAR pair dataset, where the HSI data were captured by an AISA Eagle sensor, and the LiDAR digital surface model (DSM) data were acquired by an Optech ALTM 3100EA sensor. The dataset is captured over a rural area south of the city of Trento, Italy. The spatial size of the Trento dataset is
and the spatial resolution is about 1
. The HSI data contains 63 bands with a spectral wavelength ranging from 420 to 990
. The LiDAR DSM data can reflect the height of ground objects. The land covers are classified into six categories, including Apple trees, Buildings, Ground, Woods, Vineyard, and Roads. The pseudo-color image for the HSI data, the LiDAR DSM image, and the ground-truth map of the Trento dataset are shown in
Figure 5. The classes, colors, and the number of samples for each class are exhaustively provided in
Table 9.
MUUFL Dataset [
65]: The MUUFL dataset was collected by the ITERS CASI-1500 sensor in November 2010 at the University of Southern Mississippi Gulf Park campus in Long Beach, Mississippi, which contains the HSI dataset and LiDAR dataset. The spatial size is
, and the spatial resolution is
. The HSI contains 64 available bands in the range of 375 to 1050
. The LiDAR data can reflect the height of the ground objects. The land covers are classified into 11 categories, including Trees, Mostly grass, Mixed ground surface, Dirt and sand, Road, Water, Building Shadow, Building, Sidewalk, Yellow curb, and Cloth panels. The pseudo-color image for the HSI data, the LiDAR DSM image, and the ground-truth map are shown in
Figure 6. The classes, colors, and the number of samples for each class are provided in
Table 9.
Houston2013 Dataset [
66,
67]: The Houston2013 dataset was acquired by the ITERS CASI-1500 sensor over the University of Houston campus, Houston, Texas, USA, and the neighboring urban area in 2012. It is composed of HSI data and LiDAR DSM data. The spatial size is
, and the spatial resolution is about 2.5
. The HSI data contain 144 spectral bands in the 380 to 1050
region. The LiDAR data can reflect the height of the ground objects. The land covers are classified into 15 categories, including Healthy grass, Stressed grass, Synthetic grass, Trees, Soil, Water, Residential, Commercial, Road, Highway, Railway, Parking Lot 1, Parking Lot 2, Tennis Court, and Running track. The pseudo-color image for the HSI data, the LiDAR DSM image, and the ground-truth map of the Houston2013 dataset are shown in
Figure 7. The classes, colors, and the number of samples for each class are provided in
Table 9.
4.2. Experimental Setup and Assessment Indices
To evaluate the performance of the proposed network on the multi-source remote sensing datasets, three HSI-LiDAR pair datasets with different land covers and spatial resolutions are introduced to our experiment. Ten representative methods are collected for comparison, including SVM [
68], HYSN [
45], DBDA [
53], PMCN [
69], FusAtNet [
34], CCNN [
22], AM
3net [
27], HCTnet [
33], Sal
2RN [
65], and MS2CAN [
70]. Among these methods, SVM is adopted to represent the classical machine learning HSI classification methods based on spectral signatures; HYSN, DBDA, and PMCN are used to represent the classical deep learning HSI classification methods based on spatial–spectral information; and FusAtNet, CCNN, AM
3net, HCTnet, Sal
2RN, MS2CAN are employed to represent the state-of-the-art deep learning HSI-LiDAR classification methods based on spatial–elevation–spectral information. The details of the comparisons are described as follows:
1. SVM: this method finds the optimal hyperplane, which is determined by the support vectors, to achieve classification;
2. HYSN: this approach proposes a hybrid spectral convolutional neural network for HSI classification and uses spectral–spatial 2D-3D convolutions to extract features;
3. DBDA: this approach proposes a double-branch dual-attention mechanism network for HSI classification. CNNs and self-attentions are used to extract spectral–spatial features;
4. PMCN: this method uses multi-scale spectral-spatial convolutions to extract features from HSI data, and an attention mechanism is applied to enhance the performance;
5. FusAtNet: this method uses residual aggregation and attention blocks to achieve data fusion of HSI and LiDAR data;
6. CCNN: this approach proposes a coupled convolutional neural network for HSI and LiDAR classification. Multi-scale CNNs are used to capture spectral–spatial features from HSI and LiDAR, and a hierarchical fusion method is implemented to fuse the extracted features;
7. AM3net: this approach uses CNNs to extract spectral–spatial–elevation features and the involution operator is specially designed for spectral features. A hierarchical mutual-guided module is proposed to fuse feature embeddings to achieve HSI-LiDAR data classification;
8. HCTnet: this method proposes a dual-branch approach to achieve HSI and LiDAR classification; 2D-3D convolutions are implemented to extract features, and a transformer network is used to fuse the features;
9. Sal2RN: this approach uses CNNs to extract spectral–spatial features from HSI and LiDAR data and applies a cross attention mechanism to achieve fusion;
10. MS2CAN: this approach is a multiscale pyramid fusion framework based on spatial–spectral cross-modal attention for HSI and LiDAR classification.
SVM utilizes optimal parameters by experimental analysis. Other comparisons utilize the default parameters as specified in their original papers. The data enhancement of FusAtNet is removed in the actual experiment.
For the proposed CMCN, the patch size of the data cube is set to
. The batch size is set to 32. The number of feature maps (
) is set to 24. The multiplicity of channel compression (
) and the number of heads (
) in multi-head attention, which are used in CAFM, are set to 2 and 2, respectively. The dropout rate is set to 0.5. Additional hyperparameters are listed as follows. The epoch is set to 200. The initial learning rate is set to
. The adaptive moment estimation (Adam) [
71] optimizer is applied to train the network, where the attenuation rate is set to
and the fuzzy factor is set to
. The cosine annealing technology [
72] is adopted with 200 epochs. The early stopping technology is used in the training process with 50 epochs; 1.00% of labeled samples are randomly selected as training samples and validation samples, respectively. The remaining labeled samples are collected as testing samples.
The overall accuracy (OA), average accuracy (AA), and Kappa coefficient (Kappa) [
73] are introduced to quantitatively measure the performance of the competitors. All experiments are repeated 10 times independently, and the average values are reported as the final results. The experimental hardware environment is a workstation with Intel Xeon E5-2680v4 processor 2.4 GHz and NVIDIA GeForce RTX 2080Ti GPU. The software environment is CUDA v11.2, PyTorch 1.10, and Python 3.8.
4.3. Experimental Results
We first compare the performance of the various methods on the Trento dataset. The classification results are given in
Table 10, and the full-factor classification maps are shown in
Figure 8. The best classification accuracy for each category, as well as OA, AA, Kappa, and training time, are highlighted in bold in the tables. Observing the classification accuracies of each method, we can see that the Trento dataset is relatively easy to classify with sufficient training samples. SVM gives the lowest OA (85.39%), indicating that the pixel classification methods using only spectral signatures are less effective than the methods using spatial–spectral features. To be specific, the C2-Buildings and C6-Roads are hard to be classified in SVM (74.12% and 71.57%). It indicates that ground objects of the two categories are difficult to distinguish by spectral signatures. HYSN, DBDA, and PMCN provide higher OAs (95.50%, 96.55%, and 97.46%) than SVM. It indicates that the inclusion of spatial information is helpful for the improvement of classification accuracy. Viewing the classification accuracies of these spatial–spectral-based deep learning HSI classification methods in various categories, especially in C2-Buildings and C6-Roads, we can see that the accuracies increase gradually. It demonstrates that the usage of multi-scale convolution and attention mechanisms can effectively improve the generalization of the model. The HSI-LiDAR fusion classification methods (FusAtNet, CCNN, AM
3net, HCTnet, Sal
2RN, MS2CAN, and CMCN) yield higher OAs than those of HSI classification methods (SVM, HYSN, DBDA, and PMCN). It indicates that the elevation information provided by LiDAR data can provide additional discriminative features to help improve the classification accuracy. Checking the average classification accuracies of each category and the OAs of these HSI-LiDAR fusion classification methods, we can see that CCNN, MS2CAN, and CMCN obtain relatively higher classification accuracies. It further demonstrates that multi-scale feature extraction and multi-level feature fusion can effectively exploit discriminative characteristics in complex ground cover environments. In particular, the proposed method (CMCN) obtained the highest OA, which shows that the proposed method can effectively capture correlative and complementary information and fuse them to generate discriminative deep semantic features. Observing the full-factor classification maps and the ground truth map, we can see that some scattered buildings, ground, and roads are difficult to distinguish. Region I is a parcel of buildings, ground, and roads. The comparison shows that CMCN can express the detailed information of ground cover more precisely. In Region II, there are two trees on the road. It is also clearly resolved on the full-factor classification map given by CMCN. Viewing the training times of the competitors, SVM gives the best training time (4.63 s). Comparing the deep neural networks, HCTnet provides the shortest training time (10.07 s). In contrast, FusAtNet gives the longest training time (68.03 s). The training time of CMCN is 30.98 s.
To further test the performance of the proposed method, experiments are implemented on the MUUFL dataset, which contains complex topographical landscapes and an unbalanced number of labeled samples. The classification results and full-factor classification maps are presented in
Table 11 and
Figure 9. Different from the experimental results on the Trento dataset, SVM obtains an OA of 83.87%, which is higher than those of HYSN and FusAtNet. For spatial–spectral-based deep learning classification methods, DBDA provides the highest OA (88.09%). The spatial–elevation–spectral-based fusion classification methods provide relatively higher classification accuracies than those of SVM, HYSN, and PMCN. MS2CAN and CMCN obtained the second-highest and the first-highest OAs (88.65% and 88.99%) for HSI-LiDAR classification. Reviewing the classification accuracies for each category, it can be seen that C7-Building Shadow, C9-Sidewalk, and C10-Yellow curb are hard to classify. It may be due to the dispersed distribution of ground cover and the similarity in elevation that leads to LiDAR data failing to provide additional valid discriminatory information. In addition, inadequate training of the deep neural network caused by an insufficient number of labeled samples for the C10-Yellow curb may also be one of the reasons for the difficulty in recognizing this category. Checking the classification accuracies of C7-Building Shadow and C9-Sidewalk, we can see that CCNN and CMCN received relatively high experimental results (83.35% and 80.91% for C7-Building Shadow; 74.94% and 75.98% for C9-Sidewalk), indicating that multi-scale feature extraction has better adaptability for remote sensing images with vast structural variations. For the C10-Yellow curb, we can see that SVM provides the highest accuracy (68.40%); meanwhile, all other methods give low classification accuracy on the C10-Yellow curb. The accuracy of the proposed method on the C10-Yellow curb is also poor (8.66%), indicating that the approach performs poorly with insufficient labeled samples and requires continuous efforts to improve it. Observing the full-factor classification maps and the ground truth map, the CMCN yields a relatively clearer map for land cover classification. In Region I and II, the building, building shadows and roads are smoothly recognized. However, we can see that the sidewalk is misclassified to be mostly grass in Region I. In Region II, the yellow curb is misclassified to be a road. SVM provides the shortest training time (11.75 s). FusAtNet gives the longest training time (225.94 s). The training time of CMCN is 66.53 s.
To further test the performance of the proposed network, experiments are conducted on the Houston2013 dataset. The classification results and the full-factor classification maps are given in
Table 12 and
Figure 10. The spectral-based SVM obtains the lowest OA (75.70%). The spatial–spectral-based methods provide higher OAs (79.98%, 86.53%, and 89.09%) than SVM. PMCN obtains the highest OA among these spatial–spectral-based deep learning methods, which is composed of multi-scale convolution and attention mechanisms. The spatial–elevation–spectral-based methods obtain relatively higher OAs than those of SVM and spatial–spectral-based methods. CMCN achieves the highest OA (89.47%) among all competitors. Checking the classification accuracies of each category, we can see that C9-Road and C12-Parking Lot 1 are relatively difficult to classify. For C9-Road, the classification accuracies of all methods are below 90%. FusAtNet gives the lowest accuracy (53.10%), while CMCN obtains the highest result (85.28%). For C12-Parking Lot 1, SVM provides the lowest accuracy (44.75%), and MS2CAN gives the highest accuracy (93.83%). The proposed CMCN obtains an accuracy of 76.83%, which is not good among the competitors. It may be caused by the scattered distribution of labeled samples in C9-Road and C12-Parking Lot 1, where spatial and elevation information cannot be fully utilized. Observing the full-factor classification maps and the ground-truth map, we can see that CMCN provides more clear and smooth classification maps in most categories. In Region I and II, it can be seen that buildings, roads, and land covers can be clearly identified. However, some of the ground objects in Region II are still poorly recognized due to cloud obscuration. For the training time, SVM provides the shortest time (3.49 s), and FusAtNet obtains the longest time (91.21 s). The training time of CMCN is 24.2 s.
6. Conclusions
In this paper, a cross attention-based multi-scale convolutional fusion network is proposed for pixel-wise HSI-LiDAR classification. The proposed model consists of three modules, which are SESM, CAFM, and classification module. The SESM is used to extract spatial, elevation, and spectral features of the HSI and LiDAR data. The CAFM is implemented to fuse the extracted HSI and LiDAR features in a cross-modal representation learning manner and generate joint semantic information. The classification module is employed to map the semantic features to classification results. Some techniques such as multi-scale convolution, cross attention mechanism, one-shot aggregation, residual aggregation, parameter sharing, batch normalization, layer normalization, and Mish activation function are implemented to improve the performance of the network. Three HSI-LiDAR datasets containing different land covers and spectral–spatial resolutions are used to verify the effectiveness of the proposed method. Ten relevant methods are invited for comparison. At the same time, the impact of the hyper-parameters, the proportion of training samples, computational cost, visualization of data features, model analysis, and the limitations are discussed.
In conclusion, our research contributes to the field of multi-source data fusion and classification by proposing an effective framework that combines multi-scale CNN and cross attention techniques. Compared with the state-of-the-art methods, the proposed CMCN provides competitive classification performance on widely used datasets, such as Trento, MUUFL, and Houston2013. The experimental results demonstrate the potential of these techniques in enhancing the ability to extract discriminative spatial–spectral features and capture correlation and complementarity between different data sources for HSI and LiDAR joint classification. Although the proposed method provides efficient performance in HSI-LiDAR classification, there are still several issues to be focused on, such as the quality of the labeled samples, unbalanced available data, FLOPs, parameter scale, and interpretability. In the future, we will aim to overcome these issues and further enhance the robustness and overall performance of our approach in multi-source data fusion and classification.