1. Introduction
As a coherent system, synthetic rapture radar (SAR) is naturally affected by speckles, which is a random structure caused by the interference of waves scattered from detection objects. Speckle is a real SAR measurement that brings information about the detection object. However, its uncertainty, variance, and noise-like appearance affect the image interpretation, making despeckling indispensable. Despeckling research has been constantly driven by new image processing development. Recently, deep learning methods have shown very promising results. The latest developments of supervised deep learning denoising methods, such as denoising convolutional neural network (DnCNN) [
1] and fast and flexible denoising convolutional neural network (FFDNet) [
2], have improved the denoising results of ordinary images compared with the traditional methods. Inspired by this, many researchers thus then have tried to apply supervised deep learning methods to SAR despeckling. Such a method requires corresponding clean images as references to help the deep model learn the mapping from noisy image to denoised image. However, there is no “clean” SAR image. To solve this problem, some researchers have tries to build an approximate clean reference from noisy images by using the multi-look fusion [
3]. On the other hand, other researchers have used the optical image or pre-denoised multi-temporal fusion images as clean references and calculated simulated noisy images for training [
4]. However, both the approximate clean reference method and simulated noise reference method are problematic. Multi-look fusion cannot achieve comparable denoising results of the state-of-art denoising filter, thus greatly limiting the denoising potential of the deep model. Simulating noise reference, on the other hand, creates a domain gap [
5], which causes despeckling performance mismatch between the synthesized image and the real SAR image.
Due to the lack of clean ground truth, the self-supervised deep learning despeckling method, which requires no clean reference, is an appealing research direction. Researchers in the image denoising domain have proposed multiple self-supervised deep learning denoising methods. Some of these methods require only a single noisy image to train the network, such as deep image prior [
6], noise2void [
7], and noise2self [
8]. The application to SAR despeckling has just begun. Currently, various blind-spot methods have been proposed, which evaluate the value of a pixel by its surrounding pixels as the receptive field. By using the SAR image statistics, Molini et al. [
9] adapted Laine et al.’s [
10] model to SAR despeckling and achieved a fair result. Joo et al. [
11] extended the neural adaptive image denoiser (N-AIDE) for multiplicative noise, and their results out-performed dilated residual network-based SAR filter (SAR-DRN) [
12] on a simulated SAR dataset.
Using only one noisy image is valuable when the independent noise realization pair is not available for training. On the other hand, using multiple noise realization of the same ground truth may have the potential to produce better denoising results, as multiple noise realizations provide the filter with richer information concerning the data distribution and the noise feature. Lehtinen et al. [
13] proposed a noise2noise method, which trains a deep learning model with multiple noisy image pairs that refer to the same clean target and follow an independent noise distribution. Directly applying the noise2noise method to multi-temporal SAR despeckling is problematic, as the multi-temporal SAR image of the same region cannot be regarded as independent noise realizations of the same clean image without the ground truth concerning the temporal variance. Addressing this problem, Dalsasso et al. [
14] used a simulated-based and supervised trained despeckling model to acquire pre-estimations of the despeckled images. Then, they used the difference between the pre-estimations to compensate for the temporal change and by this way formalized a series of noise2noise training images pairs. Their model can be evaluated as a hybrid of supervised and self-supervised, thus may not be fully domain gap free. In our previous work [
15], we proposed a fully self-supervised noise2noise SAR despeckling method (NR_SAR_DL) that uses bi-temporal SAR amplitude image blocks of the same region as training pairs and weights the training loss by the estimated pixel similarity between the training images. This method yields state-of-arts results.
The similarity loss-weighting training employed in the training process in [
15] can only reduce but not fully eliminate the influence of the temporal variance, as no similarity lower bound was employed for both the training image pair and its pixel pair. As a result, the noisy images with major changes in their ground truths are not excluded from the training process and thus may taint the neural network. The influence is minor when the time variance is limited but evident when the variance is significant. Another drawback is that the multi-temporal SAR images used for training should achieve a fine subpixel registration, which is often a very time-consuming task.
Extending the original method of [
15] by setting a threshold for the similarity of the training image block pair shows promising results to improve the overall despeckling performance. Such a method answers the question of whether an image pair contains two independent noise realizations of the same clean image by two quantitative parameters: an estimation of their similarity value based on the statistical feature, and the selection of the similarity threshold that separates the similar and the non-similar.
Some prominent similarity, like the one used in NR_SAR_DL [
15], does not consider the location and the time. Thus, similar block pairs can be selected from not only those time sequence images that refer to the same location as in the method of [
15] but also those of different locations and different times. Inspired by the irrelevance of location and time in the similarity calculation, we developed a similar Block Matching and Noise-Referenced Deep Learning network filter (BM-NR-DL). The main idea of this method is to search and construct a large number of image block pairs that are deemed similar by their similarity estimation and then train the deep learning despeckling network with them. It uses only real SAR images to train the deep model and could be fully self-supervised. Compared with other self-supervised methods, the proposed method has two main innovations:
(1) The proposed method in this article for the first time in SAR despeckling combines the self-supervised noise2noise method with the similarity-based block-matching method. The block can be searched in a single SAR image or multiple images, thus providing a uniform yet flexible training solution. Furthermore, when training on time sequence images of the same region, the images no longer need to be fine registered.
(2) The U-network [
16] style model employed in both the original noise2noise [
13] and the NR_SAR_DL solution [
15] requires a minimal size of image blocks for training, such as 64 × 64. However, large image blocks impede the block-matching procedure. The proposed method inserts a transposed convolutional layer before the encoder and minimizes the encoding layer number without damaging the despeckling performance. The model can now be trained with a block size significantly smaller than those of [
13] and [
15], making block-matching possible.
The test with both simulated and real SAR single-look intensity data shows that the proposed method outperforms other state-of-art despeckling methods. The proposed method has great application flexibility, as it can be trained with either one SAR image or multiple images. The method also presents good generalization capability. The network trained using SAR images acquired by a certain sensor can readily despeckle other images acquired by the same sensor. Moreover, the proposed method has good scalability in both the training and despeckling process. For training, due to the training flexibility provided by this method, the set of image pairs used in the training process could be readily scaled up. For despeckling, the deep learning inference process is much more computationally efficient than other traditional methods, such as block-matching and 3D filtering (BM3D) and probabilistic block-based weights (PPB). The despeckling process can be easily scaled up taking advantage of the parallel computing.
The article is organized as follows:
Section 2 demonstrates the related research.
Section 3 describes the main feature of the proposed method.
Section 4 presents the validation of the proposed model and the comparison with other despeckling methods.
Section 5 discusses the performance of the proposed model and points to potential future research directions.
Section 6 concludes the research.
2. Related Work
Deep Learning and SAR despeckling: Denoising is one of the image restoration processes whose purpose is to recover the clean image from its noisy observation. The noisy observation can be expressed as a function of the signal and corruption:
where
is the noisy observation,
X is the clean image, and
n is the corruption. The function
is largely domain specific. In SAR despeckling, it is often described as a multiplicative noise function in most circumstances and noise-free in extreme intensity conditions. The ill-posed problem of such a process requires image priors to limit and regulate the solution space. Many explicitly, handcraft priors, such as total variant (TV) [
17], Markov random fields (MRFs) [
18], and nonlocal means (NLM) [
19], are popular among the state-of-art denoising algorithms. The classical SAR despeckling methods, including Lee filter [
20], Kuan filter [
21], Frost filter [
22], PPB [
23], and SAR-BM3D [
24], also use handcraft priors. However, the features of both the noise and the original image are complex and may be dedicated by the application instance, thus impeding the construction of widely applicable yet precise priors. Ma et al. [
25] presented variational methods by introducing non-local regularization functions. Deep learning methods, on the other hand, implicate priors by their network structures with performances better than those with explicit priors [
6]. Meanwhile, deep learning methods can learn priors from data with proper training [
26]. The implicit priors with learnable priors together may explain the better despeckling results of the deep learning methods than the traditional ones, which are demonstrated in recent studies [
12,
27].
The deep learning despeckling often consists of a training process and an inference process. The training process aims to find the parameter set
of the network
that minimizes the sum of the loss calculated by loss function
L between the network output image and its corresponding training reference
where
M is the number of training samples, and
is the
i-th training reference. Due to the large number of training samples used in the learning process, each optimization step often employs only a small subset of the training samples, achieving stochastic gradient descent of loss. Once the network is well trained, it infers the clean estimation
from noisy images
.
The two main factors that influence the despeckling performance are the network structure, which dedicates the handcraft prior, and the learning method, which dedicates the learnable prior. Currently, the most common network structure employed for SAR despeckling is the convolutional neural network (CNN) with an encoder–decoder structure [
28], Siamese structure [
15], N3 non-local layer [
3], residual architecture [
28], or Generative Adversarial Network [
29]. The input of the network can be original SAR images [
30] or their log transformation [
31], and the output of the network can be the residual content, the denoised image, or the weights for further denoising process [
3]. On the other hand, self-supervised learning has merged as a promising future trend, as it avoids the domain gap introduced by supervised learning methods, which employ simulated data in their training. The existing self-supervised methods require either one single noise image or multiple noisy image pairs. Compared with only a single noise instance, using multiple noisy images with independent noise distribution provides more information concerning the noise and the original signal. It may lead to better denoising performance, as demonstrated in [
7].
Noise2Noise and SAR despeckling: For the deep learning denoising method that only uses noisy images, noise2noise [
13] achieves state-of-the-art results. This method trains the model with a series of noisy image pairs, and each image pair contains two independent and zero-mean noisy realizations of the same scene, one as training input and one as training reference. The main obstacle to the application of the noise2noise SAR despeckling is the construction of the noisy SAR image pair that refers to the same ground truth, since the ground truth is not directly available. Different observations of the same land area cannot be deemed fully stationary, as the land feature is constantly changing. Addressing this problem, some researchers try to construct the image pairs by using synthesized speckled images [
14,
32,
33], either directly for network training [
32,
33] or as an initial process [
14] to evaluate and compensate for the change that occurs between multi-temporal images acquired over the same area. The training process using synthesized noisy images may inevitably bring in the domain gap. On the other hand, the NR-SAR-DL previously proposed by our research group [
15] employs no synthesized noisy image. As demonstrated in
Figure 1, this method directly uses multi-temporal observations of the same area as the noise2noise training set. The denoising network is an encoder–decoder convolutional network with skip-connection and trained in a Siamese way with two branches. The method compensates the temporal change by weighting the training loss with the pixel level similarity estimation.
Block-matching in SAR despeckling algorithm: The self-similarity presented inside noisy images provides the possibility to construct a denoising algorithm by matching similar blocks and then using the nonlocal filter as presented in the BM3D [
34]. As the speckled SAR image also presents self-similarity, there are several despeckle algorithms that are built on block-matching and have good performance, such as SAR-BM3D [
24] and Fast adaptive nonlocal SAR (FANS) [
35]. Block-matching can also be carried out among multi-temporal SAR images [
36]. Block-matching has recently been embedded in some deep learning denoising methods [
37,
38,
39] and despeckling methods [
40,
41]. However, the network involved in these methods still requires supervised training with clean references. Combining block-matching and self-supervised noise-referenced network training has only emerged in the latest pre-print with feasible results [
42].
4. Experiments and Results
We conducted experiments on both the simulated and real SAR data to gauge the performance of the proposed despeckling method. The training was carried out with either a single SAR image or time sequence images of the same sensor. We compared the despeckling results of the proposed model with the following four state-of-the-art methods: for single image-based despeckling, the probabilistic block-based (PPB) filter [
23], and the dilated residual network-based SAR (SAR-DRN) filter [
12]; and for multitemporal image-based despeckling, the 3D block matching-based multitemporal SAR (MSAR-BM3D) filter [
36], and the noise-referenced deep learning (NR-SAR-DL) filter [
15]. All these methods were implemented by the source codes provided by the authors of the respective articles. The proposed method was implemented in the Pytorch package [
45] and run on a workstation with an INTEL 10920x CPU, two NVIDIA 2080Ti GPUs, and 128G RAM. The experiment employed the Adam method [
46] to optimize the network in the training process, with the parameter settings as β1 = 0.9 and β2 = 0.999. The learning rate was set to 0.003 without decay during the training. The block size was set to 13 × 13 pixels, and
was set to 2 for similarity calculation in the second pass for all the dataset except Sentinel-1, where
was set to 1. The loss weight
was set to 0 for the first pass and 0.5 for the second pass. Due to the calculation capacity limitation, each test used 10,000 index blocks, and the similarity blocks were searched within 90 × 90 pixels from the center of the index blocks. For each index block, 32 of the most similar blocks were searched to construct training pairs. We chose the similarity threshold
to eliminate the 10% outliers in the training pairs produced in the similar search process. However, fine tuning
based on the denoising result was still required. The source codes of the proposed method and the experiment dataset can be retrieved from the link
https://github.com/githubeastmad/bm-nr-dl (accessed on 19 December 2021).
Figure 3 demonstrates the training curve for a TerraSAR image. During the training, the loss converged very quickly but would not decrease close to zero, since the training output was a clean image, and the reference was its noisy counterpart. The training on the TerraSAR image cost around 9 min for each pass with 1500 iterations and a batch size of 300 to achieve a decent denoising result. For comparison, the PPB filter cost around 4 min running on the same machine.
4.1. Experiments with Simulated Images
Using the same optical image, we synthesized single-look speckled SAR images for experiments. Two tests were carried out: the first test trained the network on a single-look speckled image and despeckled the same image. The second test trained the network on bi-temporal images and despeckled the trained image. The despeckled results of the proposed model were quantitatively compared with other models by the peak signal-to-noise ratio (PSNR) [
47], the structural similarity (SSIM) index [
48], and the equivalent number of looks (ENL) [
49].
Table 2 lists the assessment results, and
Figure 4 shows the filtering results of the different methods.
Table 2 also includes the rank-sum of each filter, which is the sum of the filter performance ranking in the three metrics. In the experiments, the PPB filter presented an over-smoothed result with considerable fine linear structure loss. The SAR-DRN filtered image still had notable speckles. It also presented a lower range of gray-level compared with the clean image. The MSAR-BM3D filtered image had a block effect in the homogeneous region. The linear structure in the urban region was also blurred to some degree. NR-SAR-DL introduced some undesirable point-like artifacts. It also had a subtle block effect in the homogeneous area. The proposed BM-NR-DL filter, either trained using a single noisy image or bi-temporal images, yielded good despeckling results with visually pleasing output. The despeckled image presented no significant artifacts or blurs. The quantitative evaluation listed in
Table 2 supports the previous observations.
4.2. Experiments with Real SAR Images
We then inspected the despeckling performance of the proposed model using real single-look SAR datasets from three different sensors: TerraSAR, Sentinel-1, and ALOS. The TerraSAR images were in Ruhr, Germany on 20 February and 4 March, 2008. The Sentinel-1 images were in Wuhan, China on 26 April and 8 May, 2019. The ALOS images were acquired in 2007 and 2010, which contained major changes. The multi-temporal data had not been fine-registered. Two tests were carried out for each dataset: the first test trained the despeckling deep model with only one speckled image and then despeckled the same image. The second test trained the network with two speckled images and despeckled the two images. Since no ground-truth was available, we compared the performance of the proposed method with others by the four following quantitative metrics: the ENL, the edge-preservation degree based on the ratio of average (EPD-ROA) [
50], the mean of ratio (MOR), and the target-to-clutter ratio (TCR) [
51]. The EPD-ROA metric is calculated as follows:
where
and
are the adjacent pixel pair of a certain direction in the despeckled image. The EPD-ROA metric describes how the denoising filter retains edges along certain directions. Its value should be close to 1 if the edge preservation is good. The MOR metric is the ratio between the despeckled image and the original one. It describes how the denoising filter preserves the radiometric information and, if the preservation is good, should be close to 1. The TCR is calculated with image patches as follows:
where
and
denote the despeckled and noisy image patches with a high return point, respectively. The TCR measures the radiometric information preservation of the region around the high return point and should be close to 0.
Table 3 lists the assessment results, and
Figure 5,
Figure 6 and
Figure 7 show the despeckling results of the experiments with ALOS, Sentinel-1, and the TerraSAR dataset, respectively.
Table 3 also includes the rank-sum of each filter, which is the sum of the filter performance ranking in the four metrics. For the TerraSAR dataset, the PPB, BM-NR-DL, and NR-SAR-DL presented superior speckling surpassing capacity, as indicated by ENL. However, the PPB over-smoothed the image and yielded unnatural despeckling results. It also introduced undesirable artifacts in the despeckled image. The SAR-DRN retained well the fine structures but was poor in speckle surpassing. NR-SAR-DL achieved very good speckling surpassing capacity but failed to retain some strong point targets, as demonstrated in the TCR value for the selected highlight. MSAR-BM3D yielded natural despeckling output with good highlight retaining capacity. However, its result presented some artifacts and had a significant block effect. The BM-NR-DL filter showed the best speckle suppression, good fine structure maintenance, and good strong point target retainment. The despeckling results of the model trained with single image presented higher ENL and TCR compared with those of the model trained with two images. However, the later model preserved better the fine structures and radiometric information.
For the Sentinel-1 and ALOS datasets, though PPB presented the highest ENL, visual inspection showed that its results missed fine structures and were over-smooth. On the other hand, the SAR-DRN filter failed to eliminate the speckles to a large degree. It also greatly distorted the radiometric information of the speckled image. The despeckling results of the MSAR-BM3D method had artifacts and presented block-effects, similar to the results with simulated and TerraSAR images. NR-SAR-DL showed good performance in all the four metrics. However, BM-NR-DL performances were even better, except for EPD and TCR in the Sentinel-1 dataset. Visual inspection also discovered that the BM-NR-DL method preserved more fine structures and had much less radiometric distortion compared with the NR-SAR-DL method. The only exception was that the BM-NR-DL despeckled ALOS image had a lower contrast compared with that despeckled by the NR-SAR-DL method. However, by post-processing, the contrast could be easily maintained, and the fine structures were still well preserved. The high return preserving capability of BM-NR-DL was acceptable. In sum, BM-NR-DL achieved a good balance between speckle surpassing and structure preserving. Unlike other methods, which present an over sharpening feeling, the despeckled image of the BM-NR-DL was more natural.
4.3. Generalization Experiments
Generalization capacity, which refers to the performance of a denoising method on a non-training dataset, is not a necessity for BM-NR-DL, as this method can be self-supervised. However, exploiting the generalization capacity of the despeckling network is valuable, since using a pre-trained network has a great performance advantage compared with training an ad hoc model for each new noisy image.
We used the model trained in the previous experiments with real SAR data to despeckle the image that was not used in the training process to understand the generalization capability of the denoising network. The preliminary evaluation with images acquired by a different sensor of the training image was unsatisfactory. As a result, generalization experiments were carried out only with the images of the same sensor. For TerraSAR, the single-image-trained network was used to despeckle another TerraSAR image of the same region but with significant temporal alternation. In addition, the two-image-trained network was used to despeckle another TerraSAR image of a new region. For Sentinel-1, the two-image-trained network was used to despeckle another Sentinal-1 image of the same region but with significant temporal alternation. In the test of both datasets, we used the PPB filter as the reference. The images are shown in
Figure 6h,i and
Figure 7h,j, and the quantitative analysis is shown in
Table 4.
The experiment showed positive results. The visual inspection of the despeckling result showed that the speckle was effectively eliminated without significant loss in the fine structure in accordance with the quantitative evaluation. We spotted no obvious block effects and very few artifacts. In contrast, the PPB filter produced an over-smooth result with noticeable artifacts. The quantitative evaluation indicated a slight decrease in the EPD value for the TerraSAR dataset when compared with the despeckling result of the training images. The previous ad hoc model despeckling result showed a higher EPD value than PPB in contrast to a slightly lower value than PPB in the current experiment. However, there was no perceivable deterioration in the edge perception.
The observation of the previously mentioned experiment demonstrated that the generalization capability of the denoising network is very promising when used for a given sensor. The great generalization capacity of the proposed model brings application flexibility. The noisy image can now be despeckled either by an ad hoc trained network or by the network that is previously well-trained with the images of the same sensor. The ad-hoc trained network can yield promising despeckling results. On the other hand, using the pre-trained network can process the noisy image much quicker with very limited performance reduction. In our experiment, the despeckling of a 2400 ∗ 2400 Sentinel-1 image only costs around 10 s, while PPB requires over 400 s.
5. Discussion
The proposed BM-NR-DL despeckling method combines block-matching and noise-referenced SAR deep learning despeckling. To our knowledge, it is the first such attempt in this domain. The proposed model shows good speckling results in the previous experiments with both synthesized and real SAR data. Its speckling result is visually pleasing with a balance between speckle suppression and detail preservation, and its quantitative evaluation metrics are generally superior to other reference methods. These experiments also demonstrate the applicability of the BM-NR-DL in various training and despeckling settings. The noisy image patches for network training could be acquired from either a single SAR image or multiple images of the same sensor. Synthesized noisy data or a “clean” image are no longer necessary, thus avoiding the domain gap introduced by synthesized data. The despeckling network also has good generalization capability. The despeckling network can now be specifically trained for the targeted noisy images, or it can be pre-trained with other images of the same sensor, thus saving calculation time for a specific mission.
Despite the overall promising results, the high return preservation of the proposed method can only be regarded as fair. By carefully inspecting the highlights in the tested images, we discover that the structure of the highlighting is largely preserved in the despeckled result; yet the radiometric information has been distorted to some degree. In the experiment, the highlighting despeckling performance is very sensitive to the training settings, such as the block-matching searching radius and the nature of the training noisy image. Extending the search radius has a positive impact on the performance of high return preservation. This observation may indicate that the current block-matching process has not searched and constructed enough similar noisy image pairs that contains high returns. In addition, some of the training settings, such as λ in the training loss calculation and the similarity threshold , are set according to the empirical test result. It may require multiple fine-tuning manually, which could be time-consuming. The training of the proposed deep network also requires more time compared with classical denoising approaches. Thus, practitioners may expect a longer denoising time if training an ad hoc network for the best denoising result. On the other hand, denoising with a pre-trained network is significantly quicker than classical denoising approaches.
Potential further investigations include using the lookup tables for a wider range of block searching, such as that of [
36], and introducing auto-adaptive similarity estimation according to different image blocks, such as that of [
25]. Moreover, using the lookup tables and variational methods may raise an interesting research question for future work: instead of randomized index block selection in the current BM-NR-DL method, how can training image pairs be constructed that lead to better performance and generalization capability? Discovering the rule of parameter setting could also facilitate this process and may lead to an automatic solution in the future.