1. Introduction
Image denoising is one of the basic tasks of computer vision and is of wide interest to academia and industry, as it can effectively improve image quality. The purpose of image denoising is to remove noise from a corrupted image and restore its original content as much as possible. In many computer vision tasks, image denoising is often used as a preprocessing method to improve the practical performance of advanced computer vision tasks [
1]. Over the past few decades, many outstanding image denoising methods, as shown in
Figure 1, have been proposed, including filtering-based [
2,
3], sparse-representation-based [
4,
5,
6,
7,
8], external-prior-based [
9,
10,
11,
12], low-rank-representation-based [
13,
14], and deep-learning-based methods [
15,
16,
17,
18].
Filtering-based methods were the first techniques to be applied to image denoising and rely on the self-similarity of images. Well-known approaches include Gaussian filtering, mean filtering, and median filtering. These three methods assume that the pixels in an image do not exist in isolation and have connections to other pixels. However, Buades et al. [
2] found that similar pixels are not limited to local areas, and making full use of the redundant information in an image can improve the image denoising performance. Hence, they proposed a nonlocal mean filtering method (NLM) based on existing smoothing filtering methods. Although NLM can achieve good denoising performance, it needs to find a sufficient number of similar blocks when computing each pixel, which gives it high computational complexity. To solve this problem, Kostadin et al. [
3] proposed a block-matching and 3D filtering (BM3D) method, which has a good denoising performance and fast computational speed.
Sparse-representation-based methods are based on image sparsity and achieve image denoising by training an over-complete dictionary. A more representative case is the K-singular value decomposition (KSVD) method using sparse representation [
4]. Inspired by KSVD, Mairal et al. [
5] combined image self-similarity with sparse coding to decompose similar patches using similar sparse patterns, thus forming a Learned Simultaneous Sparse Coding (LSSC) method. Although sparse representation models have shown good results in image denoising, the sparse representation of traditional models may not be accurate enough due to the degradation of the observed images. To further improve the performance of image denoising based on sparse representation, Dong et al. [
6] proposed a nonlocally centralized sparse representation (NCSR), which transformed the denoising problem into a problem of suppressing sparse coding noise. In addition, because the sparse coding of images using a single transform can limit performance, Wen et al. [
7] proposed a structured over-complete sparsifying transform model with block cosparsity (OCTOBOS). These methods [
4,
5,
6,
7] have exhibited good results in denoising additive Gaussian white noise (AWGN); however, it is difficult to obtain good performance in real image denoising. To achieve better denoising of real images, Xu et al. [
8] proposed a trilateral weighted sparse coding scheme (TWSC).
External-prior-based methods realize image denoising by using the statistical properties of natural images. A representative method is the denoising method based on expected patch log likelihood (EPLL) proposed by Zoran and Weiss [
9]. This method applies the Gaussian mixture model to learn prior knowledge from a large number of natural image blocks and applies it to the denoising of other natural images. Similar to EPLL, Xu et al. [
10] proposed a patch group prior-based denoising method (PGPD) to learn the self-similar features of natural images from groups of similar patches using the Gaussian mixture distribution. Inspired by EPLL, Chen et al. [
11] proposed an external patch prior-guided internal clustering approach by combining an image external prior and an internal self-similarity prior, which is named PCLR. To improve the texture restoration capability of the image denoising method, Zou et al. [
12] proposed a gradient histogram preservation method (GHP) based on texture enhancement. GHP improves texture recovery by preserving the gradient distribution of the corrupted image.
Low-rank representation-based methods exploit the low-rank properties of natural images and achieve denoising by extracting their low-rank components. A typical case is the Weighted Nuclear Norm Minimization (WNNM) proposed by Gu et al. [
13]. Low-rank matrix factorization is also a method used to extract low-rank components from a dataset and is often applied in cases where the image size is large and its rank is much smaller than the length and width of the dataset. The most well-known method is the low-rank matrix factorization based on variational Bayesian (VBMFL), which was proposed by Zhao et al. [
14]. This method improves the robustness of the model to outliers by using a Laplace distribution to establish a noise model.
Deep-learning-based denoising methods are currently the most popular. They usually learn the direct mapping from the corrupted image to the clean image or the noise. Since deep-learning-based denoising methods do not rely on image priori (e.g., self-similarity, sparsity, gradient, statistical properties, and low-rank properties), they do not have to spend much time finding and processing similar blocks in the images. Thus, they not only achieve a good denoising performance but also have a fast inference speed. Schmidt et al. [
15] proposed a method based on a cascade of shrinkage fields (CSF) to improve the denoising performance while considering computational efficiency. Chen et al. [
16] extended conventional nonlinear reaction–diffusion models with several parametrized linear filters as well as several parametrized influence functions and proposed a trainable nonlinear reaction–diffusion method (TNRD). Although CSF and TNRD show good denoising performance, they can only provide the best denoising results at known noise levels. To solve the problem of blind image denoising, Zhang et al. [
17] proposed a deep learning method using a denoising convolutional neural network (DnCNN), which was the first application of residual learning to general image denoising. The application of residual learning to image denoising has greatly improved the denoising performance of networks and inspired many outstanding denoising methods based on deep learning [
17,
18,
19,
20,
21,
22]. In addition, Zhang et al. [
18] further improved the DnCNN and proposed a fast and flexible denoising convolutional neural network (FFDNet), which achieves a good trade-off between the inference speed and denoising performance by downsampling and manually inputting a noise estimation map. Binh et al. [
23] combined DnCNN with ResNet and proposed a convolutional denoising neural network called FlashLight CNN. A complex-valued deep convolutional neural network called CDNet was proposed by Quan et al. [
24], and it effectively improved the denoising performance of the network. Guan et al. [
25] proposed an image denoising method for remote sensing images called MRFENet. It demonstrated good denoising performance and preserved the edge details of the images. Zhang et al. [
26] utilized dilated convolutions to capture more contextual information and then proposed a hybrid denoising neural network called HDCNN to enhance the denoising performance of CNN networks in complex application scenarios. Tian et al. [
27] combined dynamic convolution, wavelet transform, and discriminative learning to propose a convolutional neural network based on the wavelet transform called Multi-stage Image Denoising CNN with the Wavelet Transform (MWDCNN). To reduce the parameter size and training burden of deep denoising networks, Tang et al. [
28] employed a cascaded residual network and proposed a lightweight, multi-scale, efficient convolutional neural network.
The results of most denoising methods are obtained directly from the fusion of high-level features, while low-level features containing texture and contour information are ignored, resulting in the loss of some important information. Furthermore, since the convolutional operation prefers to extract local features, it is difficult to extract global information such as textures and contours when the objects in the image are relatively large. To solve these problems, an N-shaped convolutional neural network, named NSNet, using multi-scale features is proposed in this paper. In this model, a 2D Haar wavelet is used to construct an image pyramid that contains high- and low-frequency components of the corrupted image at different scales. The multi-scale features are extracted from the image pyramid by a U-shaped convolutional network [
29], and the low- and high-level features are fused by skip connections in the U-shaped network. The 2D Haar wavelet is widely used in image denoising, and many scholars have achieved excellent denoising performance with it [
30,
31,
32,
33]. To verify the denoising performance of NSNet, the denoising of gray and color images was carried out at different noise levels and compared with existing denoising methods. The contributions of this work are summarized as follows:
- (1)
An N-shaped convolutional neural network for extracting multi-scale information is proposed. The network exploits multi-scale information to compensate for the drawbacks of convolutional operations in extracting global features, which effectively improves the network’s ability to recover textures and contours.
- (2)
A scheme for constructing image pyramids using a 2D Haar wavelet is proposed. The image pyramid is obtained by using a multi-scale 2D Haar wavelet, and each layer of the pyramid contains one low-frequency component and three high-frequency components. In image denoising, the high-frequency components can be used as an estimate of the noise level to facilitate denoising.
- (3)
NSNet shows good denoising performance for AWGN at a noise level range of (0, 55) and good recovery of textures and contours. It provides a solution for applications that need not only denoising but also texture and contour recovery.
The rest of this paper is organized as follows.
Section 2 presents the techniques involved in the proposed model.
Section 3 describes the proposed NSNet and the construction of the image pyramid in detail.
Section 4 presents the results of experiments, and
Section 5 concludes the paper.
3. The Proposed NSNet Model
In this section, the proposed NSNet model is introduced in detail; its architecture is shown in
Figure 6. It mainly consists of a multi-scale input layer and a multi-scale feature extraction layer. The multi-scale input layer uses a 2D Haar wavelet to create an image pyramid, which decomposes the corrupted image into high- and low-frequency components at different scales. The multi-scale feature extraction layer uses a U-shaped convolutional network to extract features at different scales from the image pyramid. Additionally, NSNet sets the mean-squared error as the loss function and uses the residual learning strategy to learn the noise directly.
The 2D Haar wavelet can decompose the image into four sub-images, each with a size half that of the original image. By using a 2D Haar wavelet to decompose the image, we can obtain
where
represents the 2D Haar wavelet,
is the corrupted image,
is the low-frequency component, and
,
, and
are the high-frequency components. Then, the 2D Haar wavelet is applied once again to the low-frequency component
to obtain
Finally, to obtain the image pyramid shown in
Figure 7, the same operation is repeated twice, resulting in
The image pyramid contains images at five different scales, each of which corresponds to a different stage of the U-shaped convolutional network. In addition to the original scale, each scale contains a low-frequency component
and three high-frequency components
,
, and
. As shown in
Figure 7, the low-frequency component is close to the corrupted input image, while the high-frequency components contain a lot of noise and some textures, which can be considered an estimate of the noise level.
The image degradation model is established as
, where
denotes the corrupted image,
denotes the clean image, and
denotes the noise. The proposed model inputs the corrupted image
into the network to predict the noise
and finally obtains the clean image
through simple subtraction. The mean-squared error is used as the loss function:
where
represents the parameter set of the model,
is the total number of images, and
and
represent the
ith clean image and noisy image, respectively.
For convenience, the proposed model trained at a known noise level is named NSNet-S, and the model trained at an unknown noise level is named NSNet-B. The pseudo-code of the proposed method is shown in Algorithm 1.
Algorithm 1 The algorithm of NSNet |
Input: All training images D from the observed dataset, denoising mode (B or S), noise level , range of noise level , maximum epoch . Output: The trained network . |
1: | Initialing model parameters and learning rate ; |
2: | Sampling patches from D; |
3: | for epoch = 1 to do |
4: | if epoch > 30 then |
5: | ; |
6: | end if |
7: | Set ; |
8: | for to do |
9: | if mode == “B” then |
10: | Setting as an integer at the range randomly; |
11: |
end if |
12: | Adding Gaussian noise with the noise level of to : |
| ; |
13: | Performing multi-scale wavelet transform on to obtain , , , : |
| , , , ; |
14: | Predicting noise using the network with parameter : |
| ; |
15: | Calculating the loss according to Equation (14); |
16: | Computing the gradient: ; |
17: |
end for |
18: | Updating : ; |
19: | end for |
4. Experimental Results
Gray image denoising and color image denoising were carried out to compare the denoising performance of the proposed NSNet with those of existing models, including NLM [
2], BM3D [
3], KSVD [
4], NCSR [
6], OCTOBOS [
7], TWSC [
8], GHP [
9], EPLL [
10], PGPD [
11], PCLR [
12], WNNM [
13], CSF [
15], TNRD [
16], and FFDNet [
18]. Moreover, two different types of DnCNNs [
17] were also selected as the compared models. They are DnCNN-S and DnCNN-B, which are trained at known and unknown noise levels, respectively.
4.1. Evaluation Metrics
The results of all denoising methods were analyzed quantitatively in terms of the peak signal-to-noise ratio (PSNR) and structural similarity (SSIM).
(1) Supposing that the recovered image is
and the corrupted image is
, the PSNR is calculated as
where
is the mean-squared error, and
denotes the maximum value of the pixels in the image. In general,
if each pixel is represented in 8-bit binary form or 1 if it is represented in 1-bit binary.
(2) The SSIM is calculated as
where
and
denote the means of
and
, respectively, and
and
denote their standard deviations, while
denotes the covariance of
and
, and
and
are constants.
4.2. Experimental Setting
For the ablation experiment, NSNet, NSNet without BN, Unet, and Unet without BN were compared. All compared methods were trained using 400 images of size 180 × 180 pixels, as mentioned in [
17]. The test sets were Set12, which is widely used in the evaluation of denoising methods, and BSD68 [
38]. In training the model, the size of the patch was set to 48 × 48, and 128 × 618 patches were cropped from the 400 images. Four denoising methods were trained at noise levels of 15, 25, 35, 45, and 50. For a noise level of
, the noise was generated by a Gaussian distribution with a mean of zero and a variance of
.
For the denoising of gray images, the 400 images were still used as the training set, and 128 × 2934 patches of size 48 × 48 were cropped from them. Since most image denoising methods can only obtain the best denoising performance at a known noise level, to achieve a fair comparison, the proposed method was trained at an unknown noise level and at noise levels of 15, 25, and 50. The test sets were Set12 and BSD68, neither of which participated in model training.
For color image denoising, 432 images were selected from the color image dataset CBSD500 [
39] as training samples, and the remaining 68 images (CBSD68) were used as the test set. The test set also included Kodak24 [
40] and McMaster [
41]. In this experiment, 96 × 3900 patches were cropped from the 432 images to train the color image denoising model. The other settings were largely consistent with the settings used for gray image denoising. The specific settings of NSNet are shown in
Table 1.
When training NSNet-S, each clean image input to the model was corrupted by the same level of noise. When training NSNet-B, each clean image input to the model was corrupted by noise at a level drawn randomly from the range (0, 55). The Adam optimizer was used to tune the model with an initial learning rate of 0.001. The maximum training epoch was 50. After 30 epochs, the learning rate was adjusted to 0.0001. The size of each mini-batch was set to 128. The denoising network was trained in PyTorch, and all experiments were carried out in the pycharm environment running on a PC with an AMD Ryzen 9 5900HX with Radeon Graphics 3.30 GHz CPU and an NVIDIA GeForce RTX 3070 Laptop GPU.
4.3. Ablation Experiment
This section describes the ablation experiment that was carried out to demonstrate the effectiveness of the main components of the proposed model. The experiment tested the denoising performance of NSNet, Unet, NSNet without BN, and Unet without BN at noise levels of 15, 25, 35, 45, and 50. The denoising results on the Set12 dataset and the gray version of BSD68 are shown in
Table 2, in which values with # and * represent the best and second-best denoising performance, respectively.
The denoising performance of NSNet is better than that of Unet at all noise levels, which shows that multi-scale input can improve the denoising performance. The results for NSNet and NSNet without BN show that the denoising performance of NSNet is greatly improved after adding the BN layer. As mentioned in [
29], the BN layer can improve the denoising performance of the neural network. At all noise levels, NSNet without BN achieves better denoising than Unet without BN, and its denoising performance is close to that of Unet at low noise levels, which indicates that the multi-scale input greatly improved the denoising performance of the model at low noise levels. With increases in the noise level, the structure of the compressed image is increasingly corrupted; thus, it cannot provide accurate information for the network. In this case, the performance of NSNet is similar to that of Unet. For example, the denoising performance of NSNet is similar to that of Unet at a noise level of 50.
4.4. Gray Image Denoising
In this experiment, AWGN was added to Set12 and the gray version of BSD68, and the noise levels were set to 15, 25, and 50. The denoising results of all methods on Set12 are shown in
Table 3,
Table 4 and
Table 5. Due to insufficient parameters, CSF cannot be tested at a noise level of 50.
Table 3 shows the denoising results of all methods at a noise level of 15. NSNet-S has a better denoising performance than other methods and obtained the first-ranked denoising performance on ten images. NSNet-S ranked a very close second in terms of denoising the image “House”. In this case, the denoising performance of NSNet-S is 1.86 dB higher than that of the worst method, NLM, on average, and 0.1 dB higher than that of DnCNN-S, on average, in terms of PSNR. The average results of all methods show that there is little difference in the denoising performance of most methods at a noise level of 15. At a low noise level, the self-similarity, sparsity, and low-rank properties of the image are still relatively complete, and the traditional denoising methods (e.g., BM3D, NCSR, TWSC, PCLR, WNNM) also achieve a good denoising performance.
Table 4 shows the denoising results obtained by all methods at a noise level of 25. With increases in the noise level, the best and second-best denoising performance was obtained with the deep-learning-based methods, which shows the superiority of deep learning in image denoising. As the noise level increases, the self-similarity and other features of the image are increasingly corrupted, which leads to the fact that the traditional image-prior-based denoising methods are no longer advantageous. The deep-learning-based denoising methods learn the potential noise directly from the corrupted image and rely less on the prior knowledge of the image. This allows them to achieve a good denoising performance, even at high noise levels. In this case, NSNet-S is ranked first in terms of denoising performance on ten images and second on one image. NSNet-B is ranked second in terms of denoising performance on three images. The denoising performance of NSNet-S is 1.99 dB higher than that of the worst method, NLM, and 0.17 dB higher than that of the second-best method, DnCNN-S, in terms of PSNR. Compared with the traditional denoising methods BM3D, NCSR, TWSC, and WNNM, NSNet-B is close to WNNM and outperforms BM3D, NCSR, and TWSC at a noise level of 25.
Table 5 shows the denoising results obtained by all methods at a noise level of 50. In this case, NSNet-S is ranked first in terms of denoising performance on eight images and second on two images. NSNet-B is ranked first in terms of denoising performance on two images and second on six images. Compared with other methods, NSNet-B provides a better denoising performance at a high noise level. The reason for this may be that an image with high noise will cause a greater deviation than one with low noise, and the network will pay more attention to the restoration of images with high noise when training NSNet-B. In this case, NSNet-S is 2.86 dB higher than the worst method, NLM, and 0.17 dB higher than the second-best method, FFDNet. NSNet-B also has an outstanding performance; its denoising performance surpasses that of DnCNN-B and WNNM and differs from that of FFDNet by only 0.03 dB.
In the above three experiments, NSNet shows a good denoising performance in most cases, although that of NSNet on the image “Barbara” is not as good as that of traditional methods (e.g., TWSC, WNNM). “Barbara” has similar rich textures, and the method based on image self-similarity can effectively use them to achieve better denoising. Both texture and noise are high-frequency information; therefore, image denoising methods that use residual learning tend to treat texture as noise, which makes the denoising performance of the proposed model poor. In addition, the same experiments were conducted on the dataset BSD68 to better demonstrate the denoising performance of NSNet. The average PSNRs and SSIMs of all methods on Set12 and BSD68 are shown in
Table 6. Compared with other methods, the average denoising performance of NSNet-S on the two datasets is the best, and NSNet-B has a better denoising performance at a high noise level.
Figure 8 and
Figure 9 show the denoising performance of all compared methods on gray images at noise levels of 25 and 50, respectively. In order to facilitate the comparison of the denoising performance, the results were transformed into pseudo-color images. The red box in
Figure 8 shows the restoration effect of all compared methods on the “grass”. It can be seen that NSNet’s recovery of the “grass” is closer to the clean image than the other compared methods and results in sharper and clearer edges and textures. The red box in
Figure 9 further demonstrates the advantages of NSNet in edge and texture restoration. As shown in
Figure 9, compared to other methods, NSNet can not only make the edges and textures sharper but also restore more image details.
4.5. Color Image Denoising
The previous section compared the denoising performance of the methods on gray images, where BM3D, DnCNN, and FFDNet had the better denoising performance and computational speed. In the experiment described in this subsection, these methods were selected for a further comparative test of the denoising ability of the proposed model with color images. The datasets used here are the color versions of CBSD68, Kodak24, and McMaster.
The denoising performance of four methods (BM3D, DnCNN, FFDNet, and NSNet) at noise levels of 35, 45, and 55 is shown in
Table 7. The proposed NSNet is the best model in the denoising experiments with CBSD68 and Kodak24, with McMaster ranked second.
In the denoising experiment with the dataset CBSD68, NSNet performed 0.81 dB better than BM3D, 0.09 dB better than DnCNN, and 0.1 dB better than FFDNet at a noise level of 35. It was 0.74 dB better than BM3D, 0.15 dB better than DnCNN, and 0.14 dB better than FFDNet at a noise level of 45, and 0.73 dB better than BM3D, 0.2 dB better than DnCNN, and 0.14 dB better than FFDNet at a noise level of 55. In the denoising experiment with the dataset Kodak24, NSNet was 0.78 dB better than BM3D, 0.22 dB better than DnCNN, and 0.11 dB better than FFDNet at a noise level of 35. It was 0.68 dB better than BM3D, 0.28 dB better than DnCNN, and 0.15dB better than FFDNet at a noise level of 45, and 0.66 dB better than BM3D, 0.34 dB better than DnCNN, and 0.15 dB better than FFDNet at a noise level of 55. In the denoising experiment using the McMaster dataset, although the denoising performance of NSNet was not as good as that of FFDNet, the comparison of all methods in terms of SSIM shows that NSNet is able to recover the structure of the color image better.
Figure 10 shows the denoising results of four methods (BM3D, DnCNN, FFDNet, and NSNet) on one image of the public dataset CBSD68 at a noise level of 45. To better demonstrate the denoising performance of the proposed method, two representative parts are highlighted. These show that NSNet repairs the texture better than other methods.
Figure 11 shows the denoising results of four methods at a noise level of 40. NSNet has a more significant repair effect on “stone” in the image, and its recovery of textures is better than those of other methods.