1. Introduction
Hyperspectral images (HSIs) have important applications in many fields, such as remote sensing [
1,
2], food safety [
3,
4], astronomy [
5], medicine [
6], and agriculture [
7,
8]. However, during the imaging process, due to the interference of complex human factors and the influence of the natural environment, such as illumination, the collected HSIs often have various noises (e.g., Gaussian noise). Therefore, there has been much work aimed at improving the performance of hyperspectral remote sensing images, such as through pansharpening [
9,
10,
11], super-resolution [
12], and denoising [
13].
Most successful traditional HSI denoising methods are based on certain strong prior knowledge, such as low-rank representation [
14,
15,
16,
17,
18], sparse coding [
19,
20,
21,
22,
23], global correlation along spectra [
24,
25], and so on. With the development of deep learning (DL), DL-based methods have drawn more and more attention [
13,
26,
27,
28], such as those using convolutional neural networks (CNNs). In order to achieve a better denoising effect, DL-based methods need a large number of training samples to learn network parameters. However, the currently widely used datasets (ICVL [
29], Pavia [
30], etc.) have a limited number of training samples because HSIs are more challenging to obtain than RGB images. Therefore, we hope to extend the data augmentation method to the task of HSI denoising, generate new samples with more positive feedback for the network, and further improve the denoising effect of the network.
Data augmentation (DA) is an effective way to improve performance without increasing the computational cost in the machine learning field, and it can improve model robustness and reduce model sensitivity to data at the same time. The core idea of most DA methods is to partially block or obfuscate the training sample so that the model can gain a greater generalization ability. The most commonly used geometric transformations are flipping, rotation, cropping, scaling, translation, and so on. When combined with deep neural networks, DA strategies have been successfully applied in high-level vision tasks, such as image classification [
31,
32,
33,
34,
35,
36,
37,
38] and object recognition [
39,
40]. Some typical operations include feature space augmentation [
41,
42], adversarial training [
43,
44,
45], and so on. However, it has been found that most existing DA methods lead to loss or confusion of spatial information between pixels if applied directly to some low-level vision tasks, such as HSI denoising. Unlike in high-level tasks, this relationship between pixels plays an important role in low-level vision tasks. Sharp transitions, mixed image content, and a lack of pixel relationships in the image can all degrade model performance. Therefore, those DA methods hinder the model’s ability to recover images and cannot be directly used for low-level tasks.
Many studies proposed alleviations of the limitations of DA in low-level vision tasks [
46,
47]. Radu et al. [
48] used simple geometric manipulations, such as rotation and flipping, to improve the performance of single-image super-resolution (SISR), which is the most basic DA method. Yoo et al. [
47] further proposed CutBlur for the super-resolution task for ordinary color images, which showed marginal improvement. CutBlur introduces parts of high-resolution image information into low-resolution images by replacing low-resolution patches with the corresponding high-resolution patches, which provides a beneficial regularization effect for model training and minimizes the boundary effects. They also explored the possibility of applying this method to other low-level vision tasks.
However, the above methods are still not directly applicable to the task of HSI denoising. Although the CutBlur method, which clips high-resolution and low-resolution images to each other, has improved the performance of super-resolution models, more improvements are still needed. After copying and pasting a clear image and a noisy image with CutBlur, the noise in the generated new samples will gather together, preventing the network from learning the difference between clear and noisy images to the greatest extent possible. We want to refine this difference in such a way that some parts of the new samples that are generated are more noticeable, and those parts can be well learned.
Motivated by this, this work specially designed a new DA method named PatchMask for HSI denoising. First, the noisy and clean images are segmented into patches, and then a certain number of noise patches are randomly selected and exchanged with the clean patches at the corresponding positions to generate two new training samples with partial noise-image information and partial clean-image information. Through our PatchMask method, the network can know not only the presence and intensity of noise, but also the noise regions that should be paid more attention to. Our main contributions are as follows:
Limited DA methods are currently explicitly designed for HSIs; thus, the proposed method that combines the characteristics of HSIs makes them more advantageous. Our PatchMask can learn the the difference between clear and noisy samples more precisely and pay more attention to the noisy areas.
Our PatchMask method was applied to several HSI denoising models and achieved good performance in the presence of Gaussian noise. Plenty of experiments on the ICVL and CAVE datasets show that our method can improve the performance of multiple networks and has a certain universality.
This paper presents our work in the following order. First,
Section 2 gives a brief review of HSI denoising methods and DA methods. In
Section 3, the new DA method, namely, PatchMask, is described in detail. In
Section 4, we describe the extensive experiments that were conducted to demonstrate the effectiveness of our method. Finally,
Section 5 concludes the whole paper.
4. Experiments and Results
In this section, we describe our experimental procedure and present the results. To demonstrate the effectiveness of our method, we conducted comparative experiments on a dataset with other DA methods. In addition, to demonstrate the effect of our method in networks with different parameter amounts, we selected four networks with different parameter values for testing. We also tested different application scales to illustrate the effect of adding samples through DA methods on the network.
4.1. Comparisons with Other Methods
To demonstrate the effectiveness of our method, we selected the ICVL dataset and the QRNN3D [
60] network for testing, and the test results are shown in
Table 1. We can see that both the original CutBlur method and our newly proposed PatchMask method achieved better performance than the baselines. The main reason is that both CutBlur and PatchMask retain the contextual content of the original image, which does not cause an excessive semantic loss. On the contrary, they only change the distribution of noise or add a layer of the mask to the original image. This increases the variety of data in the dataset and increases the number of samples between clear and noisy. At the same time, our proposed method enables the high-frequency information in some patches to be sufficiently trained. In this experiment, we tried to make the parameters of the two DA methods consistent. For CutBlur, we used the parameters
and
(we followed the parameter settings used in the original paper). For our proposed method, we set the parameters to
,
, and
, which were set to better compare the impacts of the two on the network.
As shown in
Figure 5, we tested other DA methods, such as Cutout [
65], which is also commonly used on RGB images, where, unlike tasks such as classification and recognition, a large number of pixels are lost, which causes difficulties in the recovery of the network. The obtained results differed significantly from the original image. Therefore, we set the parameters here as follows: The number of blocks was 1 and the size was 2. We found that the network results became worse after the loss of the original pixels, mainly because the network had no response for some of the parameters in the missing pixels, and therefore, the missing pixels caused the network performance to be degraded in the image recovery task. In addition, we also conducted experiments on another DA method, i.e., Mixup [
67]. In our experiments, we set the parameter of
, and we found that the network performance was also degraded to some extent, which was mainly because when performing the operation, the method confused the information between the clear and noisy map bands, thus causing the network performance to be degraded.
4.2. Comparisons on Benchmark Datasets
4.2.1. ICVL Dataset
When conducting experiments, there are certain difficulties in obtaining HSIs. Therefore, we used the ICVL [
29] dataset. The ICVL dataset is an HSI set that was acquired by using a Specim PS Kappa DX4 hyperspectral camera and a rotating stage for spatial scanning. It includes 200 images with a spatial resolution of
, The number of bands was 31. For the accuracy of the experiment, we used 100 images as the training set and another 50 as the test set, and the spatial resolution input into the network was uniformly set to
.
Figure 6 is the RGB rendering of the HSI dataset.
4.2.2. CAVE Dataset
The CAVE dataset is a database of hyperspectral images that are used to simulate a GAP camera, and the entire dataset contains 32 hyperspectral images of different scenes. As shown in
Figure 7, these images cover a wide range of real-world materials and objects, and each image includes full-spectral-resolution reflectance data in 31 bands from 400 to 700 nm with 10 nm steps.
To demonstrate the generalization ability of the proposed method, we tested it on this dataset as well. To ensure the accuracy of the experimental results, the experimental setup here was the same as that described in
Section 4.1. We only replaced the ICVL dataset with the CAVE dataset, and the network followed QRNN3D. The reason that the improvement here was not so significant on the ICVL dataset was mainly due to the difference in the data volume. The ICVL dataset had 100 units training data, while the CAVE dataset only had 26 units of training data. However, we can see that our approach was equally effective on the CAVE dataset. Please see
Figure 8 and
Table 2 for detailed experimental results and data.
4.3. Implementation Details
In this section, we will describe the implementation details of the experiments. First, in the experiments, we used two datasets to accomplish the task, i.e., the ICVL [
70] and CAVE [
71] datasets. As the dataset for most of the experiments, we chose the ICVL dataset due to its large data volume and high image quality. From that dataset, we chose 100 of the images as the training set and 50 images as the test set. The CAVE dataset was only used to demonstrate the generalization ability of our method. Notably, all images in both datasets had 31 bands.
The QRNN3D [
60] network was shown in experiments to have good performance as a network dedicated to the task of hyperspectral denoising; thus, we used it as a benchmark network. In our experiments, we added Gaussian noise (
) to the dataset and used the original loss function of the network for each network according to the original text. The images of the input networks were uniformly cropped to the size of
for training, and all of the band information was retained. Meanwhile, we used the Adam optimizer, whose initial learning rate was set to
, and we used a cosine annealing strategy [
72] for training.
In addition, we also used two commonly used metrics, i.e., the peak signal-to-noise ratio (PSNR) and structural similarity index (SSIM), to evaluate the performance. The PSNR describes the ratio of the maximum power of the signal to the corrupted noise and is commonly employed to measure the reconstruction quality of images and videos. Moreover, the SSIM defines structural information as a feature of the scene-wide structure of an object independent of luminance and contrast, and the distortion is modeled as the interaction of these three elements. It is used in most hyperspectral denoising tasks; see, e.g., SDeCNN [
26], SSDRN [
73], etc. Therefore, to ensure the validity and fairness of the comparisons, we still used the two most commonly used metrics for evaluation. In addition, all experiments were performed on the same NVIDIA GeForce RTX3090 (24G) GPU for fair comparisons.
4.4. Comparison of Different Models
Generally speaking, the more parameters a model has, the more beneficial it will be for a network to learn. This is because the more parameters a model has, the larger the capacity of the entire model is, and the more things are learned. We selected four networks with different parameters (DnCNN [
61], CBDNet [
74], HSID-CNN [
28], MPRnet [
62], and QRNN3D [
60]). To fairly compare the impacts of several networks on the DA method, we set the parameters for applying DA to
,
, and
.
Below, we will briefly introduce the four networks mentioned above and illustrate the improvements that we made to the network in our experiments to deal with HSIs. Then, we will show the results of our method on different models.
DnCNN: The structure is shown in
Figure 9. This network was the first to use residual learning for noise reduction. By combining residual learning and batch normalization (BN), the training of the noise reduction model can be greatly improved and accelerated. For a specific noise level, DnCNN can achieve an outstanding level of visual effects and evaluation indicators. For this network, we improved its structure and adjusted the network’s input and output channels uniformly to 31 to adapt to the large number of bands in HSIs. It should be noted that this network does not pay attention to the information between different bands; therefore, it has certain limitations for HSIs.
MPRnet: This is a progressive multi-stage network. As shown in
Figure 10, the first two stages of the network adopt the U-Net structure. There are many attention modules embedded in the network. For example, each stage passes through a channel attention block in advance, and the skip-connection part of U-Net also has a channel attention block (CAB) module. In addition, there is an attention adjustment module—the supervised attention module (SAM)—that introduces supervision information between stages. Between stages, cross-stage feature fusion (CSFF) is also performed in the encoder–decoder part of the U-Net to better preserve contextual information. We also set the input and output channels to 31 to accommodate the number of bands in the dataset.
CBDNet: This is from a paper included in CVPR 2019, and it reached state-of-the-art (SOTA) performance on the DND dataset at that time. The model is more inclined to remove noise from real environments, and the whole network has two components, i.e., a noise estimation network and a non-blind denoising sub-network that removes noise with unknown noise levels. In this work, synthetic noisy images and real-world noisy images were both utilized to train the network; thus, the network was able to represent the noise in real-world images and improve the denoising performance. The network structure is shown in
Figure 11.
HSID-CNN: Aiming at the characteristics of high redundancy and correlation of information in HSIs, this network considers the spatial–spectral joint at the input of a convolutional neural network. The end-to-end nonlinear mapping of noisy images and clear images is realized with deep convolutional neural networks, which solves the inflexibility present in other methods. This network uses multi-scale feature extraction and multi-level representation to obtain multi-scale spatial–spectral features and fuses different features for restoration. In this way, the network achieves better performance. The network structure is shown in
Figure 12.
QRNN3D This network is an alternating-orientation 3D recurrent neural network for HSI denoising, and it effectively exploits the structural spatial–spectral correlation and global correlation information along the spectrum. The alternating-direction structure that is introduced removes causal dependencies without adding extra computational costs. The model can model spatial–spectral dependence while maintaining the flexibility of HSIs with arbitrary bands. The network structure is shown in
Figure 13.
As shown in
Table 3, for the above models, the performance thereof was improved after applying our DA method. However, for models with simpler structures, the improvement was limited, which was mainly because there was not much information that simple models could learn, and the addition of DA methods made it more difficult for these simple models to adapt to this change. Moreover, 3D convolutions are very useful for hyperspectral denoising tasks. In
Table 3, we can see that QRNN3D achieved good performance with fewer parameters because the 3D convolution was able to extract inter-band information more efficiently. The experimental outcomes are shown in
Figure 14.
4.5. Ablation Study
4.5.1. Proportion of Newly Generated Samples
The number of samples generated through DA should be investigated. If there are too many samples generated through DA, the learning of the network will be biased towards the new samples rather than the original samples. Otherwise, if the data samples generated through DA are too few, the network cannot learn the differences between the new samples and the original samples. Therefore, we designed a set of experiments to verify how different scales of augmented datasets affected the network. For experimental accuracy, we empirically set and for better performance.
From
Table 4, we can see that when the number of new samples was increased to 30% of the original dataset, the performance of the entire network was better. With the increase in the number of new samples, the performance of the network showed a certain level of decline. It is believed that the proportion of the original sample information learned by the network decreased as the proportion of new samples increased; thus, the performance of the network model also decreased. In
Figure 15, we can also see that the images were best reconstructed at 30% of the new samples.
4.5.2. Total Number of Patches—
The parameter
has an important role in the proposed PatchMask DA method; it refers to the total number of patches into which the image is divided. The greater the total number of patches is, the smaller the area of each patch will be, and the finer the overall area formed by a patch will be. It was possible to fit more complex textured areas of the image. In this way, more complex regions were trained, and the network performance was improved. To prove our conjecture, we conducted a set of experiments on the parameter
, and the experimental results are shown in
Table 5 and
Figure 16.
4.5.3. The Ratio of Patch Swaps—
Another key parameter,
, refers to the proportion of patches that are swapped. Our DA method generated two new complimentary samples. As shown in
Table 6, we found that the performance dropped significantly when the exchange ratio was closer to
. When
, the number of noisy patches decreased, and the probability that noisy patches happened to be in information-dense regions further decreased. It was originally envisaged that there would be complex textures and other areas in which the denoising task was more complex, and the method of covering the areas with noise masks lost the original effect, resulting in some performance degradation. The visualization results of the experiment are presented in
Figure 17, where we can see that the image reconstruction was better in terms of detail at
.
The experimental results are shown in
Table 6. This part of the experiment was performed with the other parameters kept the same. We set the proportion of the increased dataset to 30% of the original dataset, and the alpha parameter that we described above was set to 16. In addition, we still chose QRNN3D for the model and ICVL for the dataset. During training, we randomly added Gaussian noise with
to the dataset.
4.6. Convergence Analysis
To prove that our DA method did not cause divergence in the original network, we show a comparison of the training loss curve obtained with our method. The abscissa in
Figure 18 represents the training epoch, and the ordinate represents the training loss. As shown in
Figure 18, the early stage of training (before epoch = 15), shown with the red curve (without our method), had consistently lower loss values than the blue curve (with our method), and the training loss curve decreased faster than the blue curve. The main reason for our analysis is that when the DA method was not used, the network needed to learn less content, and the network did not have to learn the changes in the noise distribution after DA and the parts that required the network to pay attention. Therefore, the convergence rate of the network was faster.
At the later stage of training (shown after we have zoomed in on the image), we found that the loss value of the blue curve was significantly lower than that of the red curve, which was also in line with our predictions. After using the DA method, the performance of the network was further improved.