1. Introduction
Hyperspectral image (HSI) analysis is an essential and critical technology in the field of aerial and satellite-based images (remote sensing). Hyperspectral data has been successfully applied in various fields, such as classification [
1,
2,
3,
4,
5], environmental monitoring [
6,
7], and object recognition [
8,
9,
10]. Because of the limited spatial resolution of HSI, many pixels are typically mixed by way of a number of materials, which degrade the overall performance of hyperspectral data processing [
11,
12]. Therefore, hyperspectral unmixing (HSU) has become an important technique to handle the mixed pixels issue. HSU aims to decompose the measured pixel spectra from remote sensing data into a set of pure spectral signatures, referred to as endmembers and their fractional abundances. At each pixel, the endmembers are normally assumed to signify the pure materials, and the abundances show the ratio of every endmember. HSU applications are in various fields, such as agriculture, geology [
13], and environmental biology studies [
14]. The issue of mixed pixels can be addressed via three distinct deep learning approaches, such as supervised, unsupervised and semi supervised. In this work, the supervised approach can be considered to estimate the abundance map as the endmembers are supposed to be known. They can be extracted from endmember extraction algorithms, such as pixel purity index (PPI) [
15], N-FINDER [
16], vertex component analysis (VCA) [
17] and minimum volume simplex analysis (MVSA) [
18]. In the unsupervised approach, different HSU methods have been proposed [
19,
20,
21,
22,
23,
24] to estimate both endmembers and their fractional abundances simultaneously. Semi supervised approaches consider that each mixed pixel in the observed image is represented as a mixture of more than one spectral signature known in a huge spectral library [
25].
There exist two kinds of HSU models: the linear spectral mixing model (LSMM) and the nonlinear spectral mixing model (NLSMM) [
26]. LSMM holds when the incident light interacts with just one material [
27], and the pixels are linearly expressed by the combination of endmembers. NLSMM holds when the light interacts with more than one material in the scene. The LSMM is broadly used in several applications due to its efficiency and simplicity [
28]. Several methods, such as the sparse regression-based method [
29], nonnegative matrix factorization (NMF) [
30], and the geometrical-based method [
31], have been used to solve linear spectral mixing problems. These linear spectral unmixing methods derive endmembers and their abundance from the hyperspectral image.
Over the last few decades, several supervised approaches to the linear spectral mixing problem have been proposed. One of the well-known supervised spectral mixing problems based on the fully constrained least square (FCLS) method was adopted to solve the mixing problem, it minimizes the error between the true spectrum and the linearly mixed spectrum, subject to the physical constraints that the abundances should be positive and sum to one. A similar method, mixture tuned matched filtering (MTMF) [
32], has been adopted to extract abundance. Another recent unmixing algorithm, such as sparse unmixing by variable splitting and augmented Lagrangian (SUnSAL) [
33], and its constrained alternative (CSUnSAL) have also been proposed to solve the optimization problem by taking advantage of the alternating direction method of multipliers (ADMM) [
34]. Both SUnSAL and CSUnSAL apply an ℓ
1 regularization term on the fractional abundances. SUnSAL uses an ℓ
2 regularization term as the fidelity term while CSUnSAL assumes a constraint to enforce the data fidelity. SUnSAL was improved in [
35] by the use of spatial information via total variation (TV) regularization on the abundances (SUnSAL-TV). However, this could result in issues with over smoothness and blurred abundance maps. Thus, collaborative sparse unmixing [
29] applies ℓ
2,1 regularization terms to promote the sparsity of abundance maps. The spectral variability can be represented using a perturbed linear mixing model (PLMM) [
36] and extended linear mixing model (ELMM) [
37] to estimate abundances. In [
38], a new augmented linear mixing model (ALMM) was presented. First, the letter introduces the spectral variability dictionary and then explores data-driven dictionary learning to extract abundances in a second step.
In the past few decades, deep learning approaches have been employed in computer vision, object detection [
39], and pattern recognition. These approaches automatically extract rich contents from the remote sensing input data and have been used within hyperspectral image classification. A literature review has shown that very few deep learning methods have been applied to hyperspectral unmixing, as they have for things such as image classification. In this work, we intend to fill in this gap by creating a new connection between the deep supervised learning approach and robust supervised HSU to overcome the issue of a mixed pixels.
Various deep learning approaches have been employed in hyperspectral unmixing with the success of artificial neural networks. Recent neural network methods used for the HSU problem are [
40,
41,
42,
43,
44,
45,
46,
47]. In [
48], 1D and 3D methods utilizing a scattering transform framework (STF) were presented to extract features from HSIs, and then used k-nearest neighbor (k-NN) to extract abundance. More recently, autoencoder as a deep learning model based on the neural network has been used to address HSU problems. Autoencoder has gained tremendous popularity in the hyperspectral community. Two specific instances of autoencoder, de-noising and nonnegative sparse autoencoder (NNSAE), were utilized to address the HSU problem by simultaneously estimating endmembers and their fractional abundances from HSI [
49,
50]. Another unmixing method [
51] has been employed to solve the HSU problem using a variational autoencoder and stacked NNSAE. In [
52], a concatenated marginalized de-noising model and NNSAE were used to solve hyperspectral unmixing in noisy environments. In [
53], the authors proposed a stacked NNSAE to minimize the impact of low signal-to-noise ratio. In addition, several other autoencoders, such as [
54], have been developed to improve spectral unmixing performance using multitask-based learning and convolutional neural network (CNN) [
55]. Another two-stream network [
56], consisting of an endmember extraction and an unmixing network called a weakly supervised method, has been proposed for the spectral unmixing problem.
Recent research has made clear that using 2D-CNN or 3D-CNN exclusively has drawbacks, such as wasting band-related information or requiring the use of extremely complex techniques. Furthermore, they inhibit the deep learning techniques from obtaining exceptional accuracy. Their main justification is that HSI produces spectral-dimensional volumetric data. The spectral interpretations cannot be used to create useful maps with distinguishing features using the 2D-CNN approach alone. A deep 3D-CNN approach is also computationally expensive. When employed alone, it performs badly for materials of related features across many spectral bands. The approaches also require extra processing time to examine and comprehend the spectral–spatial data cubes. Additionally, the presence of various types of noise, i.e., noisy pixels and channels in the remote sensing data, badly degrades the overall performance of spectral unmixing. As the literature suggests, few of the existing unmixing methods have been proposed to achieve robustness in spectral dimension [
57,
58]. For these thoughts, we are motivated to propose a supervised robust HSU method that considers both noisy pixels and channels to enhance the robustness of spectral and spatial dimensions.
We propose a novel supervised end-to-end deep hybrid convolutional autoencoder (DHCAE) network for robust HSU in this paper. The proposed method can utilize spectral and spatial information to achieve abundances given the endmembers in the HSI. The main research contributions of our proposed approach are highlighted as threefold:
According to the best of the authors’ knowledge, this is the first time that a robust HSU model using a deep hybrid convolutional autoencoder has been proposed to build an end-to-end framework. This framework can learn discriminative features from HSI to produce better unmixing performance.
We used 3D and 2D layer information in the proposed approach, utilizing spectral–spatial information to improve the hyperspectral unmixing performance.
The proposed method performance is evaluated on one synthetic and three real datasets. The results confirm that the DHCAE approach outperforms existing methods.
The remainder of this article is organized as follows. The proposed DHCAE network is described in
Section 2. The results on the experiments conducted on synthetic and three real world remote sensing datasets are illustrated in
Section 3.
Section 4 offers the discussion; and finally,
Section 5 concludes this article.
3. Experiment and Analysis
The proposed DHCAE was implemented in a Keras and Tensorflow framework using python. The experiments were performed on a HP Notebook—15-da0001tx Intel Core i7-8550U CPU and a GPU with 4 GB of memory. We demonstrate the unmixing performance of the proposed approach on four datasets, one synthetic hyperspectral data and three real datasets. We compared the proposed approach (DHCAE) with six related unmixing methods, namely, FCLS [
59], nonnegative matrix factorization quadratic minimum volume (NMF-QMV) [
60], augmented linear mixing model (ALMM) [
31], hyperspectral unmixing using deep image prior (UnDIP) [
61], deep hyperspectral unmixing using transformer network (DHTN) [
62] and deep convolutional autoencoder (DCAE) [
63]. In terms of abundance estimation, DCAE performs well. It is easier to compare their capabilities in hyperspectral unmixing issues because DCAE and DHCAE networks have similar structures. For the DCAE network, the encoder part comprises four convolutional layers and two fully connected layers. The decoder part of the DCAE network is the same as our approach. For the quantitative assessment of the algorithm, we used three criteria, such as abundance overall root mean square error (aRMSE), reconstruction overall root mean square error (rRMSE), and average spectral angle mapper (aSAM). When the reference of the abundance maps (
=
) is given, then aRMSE can be used to measure the estimated abundance maps (
=
). Without the groundtruth of abundance maps, the other two metrics are used to measures the reconstruction error between the original HSIs (
=
) and its reconstruction (
=
). These are defined as follows:
3.1. Experiments on Synthetic Dataset
The synthetic data were generated using five spectral endmember references randomly chosen from the United States Geological Survey (USGS) digital spectral library [
64].
Figure 3 shows that the five endmember references have 480 spectral bands. The size of the abundance map is 64 × 64 pixels with a maximum abundance purity of 0.8. We followed the procedures in [
65,
66].
The extracted abundance map of different materials in the synthetic dataset is shown in
Figure 4. The first column is the true abundance map of distinct spectral signatures, and the rest of the columns are the extracted abundance maps by various unmixing methods, respectively. DHTN and DCAE achieve desirable abundance maps; however, the outcomes are still unsatisfactory. In comparison with the previous state-of-the-art approaches, our proposed approach extracts abundance maps that are typically similar to the real abundance maps. All the extracted abundances of the five spectral signatures are satisfied and stable.
To quantitatively validate the robustness of our proposed approach in the simulated experiments. We compared the performance of various unmixing algorithms on synthetic data with three distinct types of noise. These three types of noises are as follows: just band noise added, only pixel noise added, and both band noise and pixel noise added.
For the added band noise, to each band was added Gaussian noises with four various levels of SNR, i.e., 05, 15, 25 and 35 dB, to test the performance of the proposed approach under different levels of noise.
Figure 5 shows the score of aSAM and aRMSE for all the methods in the various noise levels. It can be seen that our proposed approach has good robustness against band noise of various SNRs than other unmixing methods.
For the added pixel noise, we also added Gaussian noises to each pixel on the synthetic image with different SNR values from 05 to 35 dB.
Figure 6 shows the aSAM and aRMSE values for all the methods in various noise levels. The figure shows that our proposed approach performs well and is more robust against pixel noise.
To investigate the robustness of the proposed approach we also performed experiments where we simultaneously added pixel noise and band noise. Each pixel and band were added with Gaussian noise with different SNR levels of 05, 15, 25, and 35 dB, to the synthetic data set. As illustrated in
Figure 7, our proposed approach is more robust to noise with different pixel noise and band noise levels. The performance of the other competitors reduces when the number of noisy pixels and bands increases, which means that the other competitors are easily corrupted by noisy pixels and noisy bands compared with the proposed approach.
3.2. Experiments on Jasper Ridge Dataset
The first real HSIs scene used in our experiments was the Jasper Ridge dataset captured by the Airborne Visible/Infrared Imaging Spectrometer (AVIRIS), as shown in
Figure 8a. The image of size 100 × 100 pixels was measured in 224 spectral bands covering the wavelength range from 0.38 to 2.5 μm. Tree, Water, Dirt, and Road are the four main endmembers observed in this dataset. In all 224 bands, there exist some noisy bands (2–4, 220–224) and blank bands (1, 108–112, 154–166).
Figure 8b–e shows the noisy bands seriously corrupted by noise. The presence of noisy bands in the HSIs reduces the performance of the HSU methods. Therefore, to test the robustness of our proposed approach, we conducted experiments on data without noisy bands (i.e., bands = 198) and with noisy bands (i.e., bands = 224).
3.2.1. Results without Noisy Bands
Table 2 lists the quantitative results for the Jasper Ridge dataset without noisy bands, where the proposed DHCAE network achieves the best performance compared with other competitors. The proposed method obtains the rRMSE value of 0.0068 and aSAM value of 0.0314. On this dataset, the second-best result was obtained by DCAE.
Figure 9 shows the extracted abundances for different materials. The extracted abundances, excluding our proposed approach, are shown in
Figure 9. The DCAE method also achieved accurate abundance estimation for all endmembers.
3.2.2. Results with Noisy Bands
We also tested the unmixing performance of our proposed approach on the Jasper Ridge dataset with noisy bands containing 224 bands.
Table 3 represents the rRMSE and aSAM values of various unmixing approaches. The results show that the rRMSE and aSAM value achieved by the DHCAE network is better than other unmixing methods.
Figure 10 illustrates the qualitative results of extracted abundances for different spectral signatures. According to the qualitative results, our proposed approach achieves an abundance map with good robustness to noise compared with other unmixing approaches.
3.3. Experiments on Urban Dataset
The second real dataset was collected by the Hyperspectral Digital Imagery Collection Experiment (HYDICE) sensor in October 1995.
Figure 11a shows the observed image of 307 × 307 pixels. Four kinds of different materials are included in the scene: asphalt, grass, trees and roofs. The original dataset has 210 spectral bands ranging from 400 to 2500 nm. Some noisy bands exist in this dataset (1–4, 76, 87, 101–111, 136–153, and 198–210) due to the atmospheric effects and water absorption.
Figure 11b–e shows some noisy bands of this dataset. It can be seen that bands (138, 149, 207, and 209) are corrupted by noise. The existence of noisy bands degrades the performance of spectral unmixing methods, but the noisy bands may also contain important information. Therefore, we performed experiments on both data with removed noise band (i.e., bands = 162) and containing noise band (i.e., bands = 210) to demonstrate the robustness of our approach.
3.3.1. Results with Removing Noisy Bands
Table 4 presents the quantitative comparison of rRMSE and aSAM results based on the urban datasets. According to
Table 4, the proposed DHCAE network provides superior unmixing results to the other competitors. The value of rRMSE and aSAM in the proposed method are 0.0115 and 0.0331 respectively, DCAE can achieve the second-best results on this dataset.
Figure 12 depicts the extracted abundances for different materials (endmembers) in the urban dataset. It can be observed that our proposed approach extracts abundance maps that are more separable and closer to the ground truth abundance maps than those provided by the state-of-the-art competitors.
3.3.2. Results Containing Noisy Bands
We also investigated the robustness of our proposed approach on the urban dataset containing noisy bands, i.e., 210.
Table 5 lists the rRMSE and aSAM values yielded by our proposed approach and the other six unmixing methods. As seen in
Table 5, the DHCAE network achieved a better result than other competitors.
Figure 13 displays the estimated abundance map of various unmixing approaches. When comparing
Figure 12 and
Figure 13, it can be easily seen that the FCLS and ALMM approaches are still corrupted by noise. We can also observe that the results achieved by our proposed approach are more robust to noise than those achieved by other unmixing competitors.
3.4. Experiments on Washington DC Dataset
The third set of real hyperspectral data was collected by the HYDICE sensor. The observed image comprises 1280 × 307 pixels with 210 channels ranging from 0.4 to 2.4 um. Noise and water vapor channels were removed. We investigated a cropped image of size 319 × 292 pixels with 191 channels. According to [
67], six endmembers are included in the Washington DC Mall scene: Grass, Tree, Roof, Road, Water and Trail. The cropped image and six endmembers are shown in
Figure 14.
Results on Washington DC Dataset
Table 6 presents the quantitative assessment results for the Washington DC Mall datasets. According to the results in
Table 6, the proposed unmixing method provides the best results compared with other unmixing methods. The second-best results are achieved by DHTN and DCAE in term of rRMSE and aSAM.
Figure 15 shows the qualitative results of estimated abundance maps for six different spectral signatures. The results clearly indicate that the proposed DHCAE network provides a clear and smoother abundance map.
3.5. Parameters Setting
In this section, we selected a random search approach to find suitable values for hyperparameters employed in the proposed model. The learning rates were investigated in the range [0.001, 0.003, 0.004, 0.005]; based on the results, the optimal learning rate is 0.005. Similarly, the Adam optimizer was selected to optimize the model among different optimizers. Several experiments were conducted to assess the impact of different dropout values, and the optimal dropout value is 20%. We tried different spatial window sizes [13 × 13, 15 × 15, 17 × 17, 19 × 19] to learn adequate spatial information, and a spatial window size of (15 × 15) was selected. The network was trained for 100 epochs and a minibatch of 100.
The regularization parameters influence the performance of the different unmixing approaches. We discovered that the following parameter values produced the highest results: the parameters in ALMM were to be set as α = β = 0.002, γ = η = 0.005 and the number of basis vectors L = 100. The learning rate was set to 0.001 for the UnDIP; the number of iterations was 3000. The DCAE was optimized using Adam optimizer with a learning rate of 0.005. The number of epochs and batch size were set to 100 each.
3.6. Effects of Spatial Window Size
The effect of spatial window size (S × S) over the unmixing performance of our proposed approach is shown in
Figure 16. It is clear that the best estimation was achieved with a spatial window size of 15 × 15. Therefore, we set 15 × 15 as the input size for all the hyperspectral datasets.
3.7. Comparative Analysis
To validate the effectiveness of the proposed DHCAE method, we compared it to two existing methods: 3D-CNN and 2D-CNN.
Figure 17 depicts the outcomes of the three methods. According to
Figure 17, the proposed DHCAE method achieved the best performance in terms of rRMSE and aSAM for each hyperspectral dataset. The proposed method is based on a hierarchical representation of spectral–spatial 3D-CNN followed by a spatial 2D-CNN that is complimentary to it. When compared to hybrid 3D and 2D convolutions, it is clear that 3D or 2D convolutions alone are incapable of representing the highly discriminative feature.
3.8. Computational Time
We discussed the computational time of our proposed DHCAE network and compared it to the computational time of the previous state-of-the-art unmixing approaches. The average computational times in seconds for all the spectral unmixing approaches on three datasets are reported in
Table 7. Using the advantage of autoencoder for the unmixing problem consumes computational time. However, the DHCAE network can parallelize on graphical processing units, undoubtedly leading to less computational time.