1. Introduction
Digital images have found extensive applications in various fields, including medical imaging, remote sensing, and semantic segmentation [
1]. However, images captured by cameras often suffer from mixed noise, which degrades image quality and affects subsequent computer vision tasks [
2]. For example, hyperspectral images commonly exhibit a combination of additive white Gaussian noise (AWGN) and Poisson noise, while computed tomography images and complementary DNA microarray images experience a mixture of AWGN and impulse noise (IN) [
3]. As a result, the removal of mixed noise has become a critical and challenging problem that requires further investigation.
In recent years, researchers have proposed many effective methods for removing combined AWGN and IN. Since IN only affects some of an image’s pixel values [
4], early methods for mixed noise removal typically adopted a two-stage approach, where IN is first suppressed and then AWGN is removed. Garnett et al. [
5] introduced the rank-ordered absolute differences method for IN detection and integrated it into the bilateral filter framework [
6], which performs adaptive image denoising for both AWGN and IN. Cai et al. [
7] employed a variational framework with a
data fidelity term and the Mumford–Shah regularization term to remove mixed noise from images. However, this method, while preserving some edge properties, may lead to image oversmoothing as it only considers local image information. To address this, Xiao et al. [
8] proposed a
double-sparsity regularization-based method. They used the
term to remove IN and introduced the
term in an improved K-SVD algorithm to suppress residual noise after IN detection. Liu et al. [
9] developed a generalized weighted
method for AWGN-IN removal based on maximum likelihood estimation and sparse representation. Jiang et al. [
10] integrated the image sparse prior and non-local similar prior into a non-local sparse regularization term and proposed the weighted encoding with sparse nonlocal regularization (WESNR) method. Furthermore, Huang et al. [
11] considered both the non-local self-similarity and low-rank properties of natural images and proposed a Laplace scale mixture combined with a non-local low-rank regularization (LSLR) model. In the domain of hyperspectral images, Zhuang et al. [
12] proposed a method called FastHyMix, which estimates mixed noise by exploiting its high spectral correlation. This method enhances the accuracy of mixed noise estimation by utilizing a neural denoising network to learn image priors. Liu et al. [
13] proposed a mixed noise removal method that combines a deterministic low-rank prior with an implicit regularization scheme. This method approximates the low-rank prior of the image using the matrix logarithm norm and utilizes an implicit regularizer to preserve image details.
Deep learning has emerged as a promising approach for removing mixed noise due to its ability to adapt to complex data and establish relationships between noisy and clean images. Compared to traditional denoising methods, deep learning relies on large amounts of training data to learn stable nonlinear mappings, making it a data-driven approach. Several deep learning-based methods have been proposed for mixed noise removal. Islam et al. [
14] proposed a transfer learning approach called TL-CNN, which uses a convolutional neural network (CNN) to learn an end-to-end mapping from noisy to clean images. Abiko et al. [
15] introduced a blind denoising method called BDCNN, which is entirely based on a CNN. The network structure of BDCNN consists of 50 convolution (Conv) blocks, with the first 25 blocks used for IN removal and the last 25 blocks used for AWGN removal. Wang et al. [
16] incorporated a CNN regularizer into a traditional model-based variational approach, resulting in VA-CNN. VA-CNN utilizes the CNN-learned natural image prior to improve the variational method’s accuracy in estimating noise parameters. Jiang et al. [
17] proposed a non-local mean-based CNN (NM-CNN) method. This approach first detects the locations of outlier pixels using a median filter and replaces them with non-local mean values; subsequently, AWGN is removed using a CNN. Lyu et al. [
18] introduced a generative adversarial network (GAN) that employs generators and discriminators for feature extraction. This method incorporates a joint loss function based on image prior and visual perceptual metrics, further enhancing image denoising performance. Mafi et al. [
19] proposed a CNN architecture incorporating Conv, batch normalization (BN), and rectified linear units (ReLU) as basic components for mixed noise removal. In [
20], a serial attention module-based CNN method (SACNN) is proposed, which employs a serial attention module to better preserve texture details. Overall, these deep learning-based methods have demonstrated significant improvements compared to traditional denoising methods.
In summary, most existing methods for removing mixed noise do not fully utilize the local and global information of the image, resulting in the inaccurate modeling of complex noise during the denoising process. This leads to the deformation and distortion of the image structure, causing the loss of fine details. To address this issue, this paper proposes MSNUNet for mixed noise removal.
The key contributions of MSNUNet can be summarized as follows:
- 1.
We propose a nested UNet architecture based on multi-scale feature extraction for mixed noise removal. In MSNUNet, we introduce MSU-Subnet for multi-scale feature extraction. These multi-scale features contain rich local and global features, which help the model estimate noise more accurately and improve its robustness.
- 2.
We introduce MSCAM into the MSNUNet model to effectively aggregate multi-scale features. Additionally, MSCAM utilizes channel attention (CA) to enhance the extraction of important features, enabling the network to better preserve intricate textural details in images.
- 3.
Our experimental results demonstrate that MSNUNet achieves superior performance in terms of quality metrics compared to state-of-the-art methods and generates visually satisfying denoised images.
The remainder of this paper is organized as follows:
Section 2 describes the related work.
Section 3 presents the mixed noise model, while
Section 4 introduces the proposed MSNUNet model. In
Section 5, extensive experiments are conducted to evaluate the performance of MSNUNet, and the conclusions of the study are presented in
Section 6.
3. Noise Model
We assume that
is a noise-free image,
is a corresponding noisy image, and
is the pixel value at location
. For the AWGN model, the relationship between these elements can be written as
where
G is an independent and identically distributed zero-mean AWGN.
There are two types of IN: salt–pepper impulse noise (SPIN) and random value impulse noise (RVIN). SPIN uses two fixed extreme values of for pepper noise and for salt noise to corrupt the image, whereas RVIN uses any value in the range to corrupt the image. The SPIN and RVIN models are as follows.
SPIN model:
where
denotes the probability of SPIN.
RVIN model:
where
is a random pixel value in the range
at the location
and
denotes the probability of RVIN.
In this paper, three types of mixed noise models are considered [
20]:
- (1)
- (2)
- (3)
AWGN mixed with RVIN plus SPIN:
4. MSNUNet
Most current mixed noise removal algorithms achieve good denoising results by utilizing local information or prior knowledge of the image. However, when the noise ratio increases or more complex noise, such as combined AWGN, SPIN, and RVIN, is encountered, the accuracy of these methods in estimating the noise distribution significantly decreases. This limitation hinders the model’s ability to accurately model the noise and ultimately leads to the loss of image texture details. Additionally, although existing deep learning-based denoising models have shown promising results, the majority of these models perform denoising on low-resolution image blocks without considering the global information of the image. This limitation disrupts the overall structure and consistency of the image, thereby restricting the denoising performance of the model. In this paper, we propose MSNUNet for mixed noise removal. Firstly, the MSU-Subnet is introduced to enable the network to process high-resolution images more deeply and generate diverse receptive fields, capturing rich local and global features. Then, by integrating the MSCAM into the nested UNet architecture, MSNUNet is able to aggregate local and global features, which assist the model in preserving the overall structure and consistency of the image, resulting in improved denoising performance. Lastly, by introducing a channel attention block (CAB) in the MSCAM, the model enhances its ability to learn and extract important features, thereby preserving more image details.
4.1. Overall Pipeline
Figure 2 shows the architecture of MSNUNet for denoising images corrupted by mixed noise. Given an image
, where
represents the spatial dimensions of the image, MSNUNet first applies a
Conv to extract low-level image features
, where
C denotes the number of channels. These features are then processed by a nine-block symmetric encoder–decoder structure to obtain the depth features
. The encoder blocks, including encoder1 and encoder2, and the decoder blocks, including decoder1 and decoder2, are filled with MSU-Subnet, a well-configured U-shaped subnetwork. By utilizing MSU-Subnet, MSNUNet effectively captures local and global features at multiple scales from high-resolution images. In the subsequent blocks with lower-resolution feature maps, downsampling further would result in the loss of valuable image information. To address this issue, we employ MSCAM, which combines local feature extraction with a CAB, allowing the network to extract local features while capturing channel correlations. This approach enables MSNUNet to efficiently extract multi-level features within blocks and aggregate multi-level features between blocks. Specifically, starting from the low-level feature
, the encoder gradually reduces the spatial size of the feature map while increasing the channel capacity. The decoder takes the potential feature
as input and learns the noise distribution while gradually recovering the image resolution. Pixel-unshuffling and pixel-shuffling operations are applied during feature downsampling and upsampling, respectively [
30]. The encoder is connected to the decoder through a skip connection to facilitate the image recovery process [
24]. Before this connection, the output of the decoder is upsampled, and a summation operation is used to ensure a consistent number of channels. These design choices have resulted in improved quality, as described in the experimental section (
Section 5). Finally, a
Conv is applied to the output features of the final decoder to generate the residual image
. This residual image is then added to the noisy image
I to obtain the restored image,
. In the following section, we will describe the core components of MSNUNet, including MSCAM and MSU-Subnet, which are used to extract multi-scale features from the image.
4.2. MSCAM
In recent years, CNNs have gained wide adoption in image processing due to their remarkable performance. Conv serves as a fundamental building block of CNNs and typically includes Conv, BN, and ReLU functions. However, the standard Conv has inherent limitations in extracting exclusively local features [
31], which poses a significant disadvantage for image processing. To overcome this limitation, MSNUNet utilizes MSU-Subnet to extract multi-scale features that are more diverse than those obtained through a standard Conv. These features consist of high-resolution feature maps with precise spatial information and low-resolution feature maps with reliable semantic information. To effectively integrate these rich features, we follow the component arrangement described in [
32] and propose a new module called MSCAM, depicted in
Figure 2b. MSCAM consists of two steps.
In step 1 of MSCAM, we apply layer normalization (LN) to normalize the input feature maps. The normalized feature maps are then processed using a
Conv, followed by a
depth-wise Conv (DConv). The DConv offers greater efficiency compared to traditional Conv as it has fewer parameters and lower computational costs. The Gaussian error linear unit (GELU) activation function is employed to implement the non-linear mapping relationship. The resulting feature maps are fed into the CAB, which captures the correlation between global feature channels. The internal structure of the CAB is illustrated in
Figure 3. Finally, a
Conv is utilized to aggregate these features and output them to the second step of the module.
The CAB implementation can be represented by the following equations [
29], where the input is denoted as
c and the output is denoted as
. These calculations are shown in Equations (
7)–(
9),
where the function
g in Equation (
7) represents the global average pooling. The input feature map,
c, undergoes global average pooling followed by a fully connected layer,
, to obtain
. Subsequently,
is passed through the ReLU activation function and another fully connected layer,
, to obtain the compressed image features,
. The Sigmoid function is then applied to
, yielding weight coefficients between channels. Finally, these weight coefficients are multiplied with the input
c to obtain the final result.
In Step 2, we employ a Conv to enable the interaction of information among the diverse features acquired from the previous stage.
The incorporation of the CAB in MSCAM enhances the network’s focus on important features [
33], leading to the improved utilization and preservation of image texture details. Furthermore, residual connections are introduced to enhance the network’s performance by aiding in the reconstruction of neglected high-frequency feature information.
4.3. MSU-Subnet
In current CNN designs, such as VGG [
34], ResNet [
35], and DenseNet [
36], small kernels of sizes of
or
are commonly used for feature extraction. However, these small kernels have limited receptive fields, resulting in shallow output feature maps that only capture local features and fail to capture global features. To overcome this limitation, we propose a novel multi-scale feature extraction subnetwork, namely MSU-Subnet, to capture multi-scale features in high-resolution feature maps.
Figure 2a showcases the structure of MSU-Subnet, which consists of three main components:
- (i)
A U-shaped encoder-decoder structure: The subnetwork takes intermediate feature maps (with a size of and a number of channels equal to c) as inputs and employs a U-shaped architecture with seven blocks to extract multi-scale features. By progressively downsampling the feature maps and encoding them into a high-resolution feature map (with a size of and a number of channels equal to c) through progressive upsampling, skip connections, and Convs, this structure effectively avoids the loss of fine details encountered with direct upsampling at larger scales. Additionally, by extracting features from deeper levels, the network can capture more diverse receptive fields and richer local and global features. The extracted multi-scale features can represent noise detail features of various granularity, enabling the network to capture a more accurate noise distribution and enhance the robustness of the model.
- (ii)
MSCAM: Serving as the base module for both MSU-Subnet and the entire network, MSCAM aggregates multi-scale features within the network. Not only does MSCAM aggregate multi-scale features within the network; it also utilizes CA to extract the correlation between feature channels. This allows the network to selectively attend to relevant features, thereby enhancing the effectiveness of feature extraction.
- (iii)
Residual connection: Experimental results indicate that the denoising quality of a network tends to decrease beyond a certain number of layers, potentially causing image degradation during network training. To mitigate this problem, residual connections are utilized to learn the residual mapping of the stacked layers, enabling the easier training of deeper networks.
By allowing the network to extract features at multiple scales, MSU-Subnet enhances the capabilities of feature extraction. In addition, the U-structure of MSU-Subnet offers a low computational overhead, as most operations and manipulations are applied to the downsampled feature maps.
4.4. Loss Function
In order to ensure a fairer and more robust denoising model, we employ the peak signal-to-noise ratio (PSNR) as a loss function for updating parameters [
37]. The loss function can be represented as
where
N and
M are the image dimensions. Additionally,
y and
x stand for a noise-free image and the corresponding noisy image, respectively. These metrics are appealing for several reasons, including because they are easy to calculate, possess clear physical interpretations, and are mathematically convenient for optimization purposes.
5. Experiment and Analysis
To evaluate the denoising performance of the proposed MSNUNet, we applied the model to three benchmark datasets: BSD100 [
38], Set12 [
39], and Urban100 [
40]. These datasets are widely used for image denoising tasks. The BSD100 [
38] dataset consists of 100 real-world images capturing diverse scenes. Similarly, the Set12 [
39] consists of 12 grayscale images that are widely used for image denoising tests. The Urban100 [
40] dataset contains 100 urban scene images with complex textures. These datasets provide valuable resources for evaluating and comparing the efficacy of various image denoising algorithms. Thus, we conducted experiments on these three datasets to ensure fairness and reliability in the results. In
Section 5.1, we discuss the experimental setup and provide details about the datasets utilized. The experimental results demonstrating the performance of the proposed MSNUNet are presented in
Section 5.2. In
Section 5.3, we discusses the proposed algorithm in comparison with other recent algorithms. Finally, in
Section 5.4, we present extensive ablation experiments to evaluate the key components of MSNUNet.
5.1. Experiment Setup and Datasets
Implementation details. The proposed MSNUNet is an end-to-end trainable model with no pre-trained network, implemented using PyTorch 1.8.0 and a single NVIDIA RTX 3090 GPU. Specifically, we trained the model for a total of 300,000 iterations using the Adam [
41] optimizer. Each 10,000 iterations corresponded to one epoch, and the entire training process required 30 epochs in total. The exponential decay rate parameters
,
, and weight decay were set as 0.9, 0.9, and 0, respectively. The initial learning rate was set to
, gradually decreasing to
using the cosine annealing schedule [
42]. Patch training and full image testing lead to performance degradation and denoised images with patch artifacts, which we addressed using a test-time local converter [
43].
Noise ratio. We considered three types of mixed noise: AWGN+SPIN, AWGN+RVIN, and AWGN+RVIN+SPIN. For the first type, the standard deviation of the AWGN ranged from 20 to 30 in steps of 5, and the SPIN ratios were set to 15%, 30%, and 40%. For the second type, the of the AWGN ranged from 15 to 25 in steps of 5, and the RVIN ratio varied from = 5% to 15% in steps of 5%. For the third type, the of the AWGN ranged from 5 to 15 in increments of 5, the RVIN ratio varied from = 5% to 15% in increments of 5%, and the SPIN ratio varied from = 50% to 30% in decrements of 10%.
Training datasets. We trained the proposed MSNUNet on the DIV2K [
44] dataset, which consists of 800 high-quality images for the training set and 100 images for the validation set. These images have an average resolution of around
. Additionally, these datasets contain abundant details and intricate textures, making them well-suited for evaluating and comparing the performance of diverse image processing algorithms. The patch size and batch size were set to
and 8, respectively. We added three types of mixed noise to the patches for each of the 800 training images. In addition, the same three types of mixed noise used in the training set were applied to the 100 validation images.
5.2. Results
This section describes the extensive experiments performed to evaluate MSNUNet. For all three noise types, we compared MSNUNet with six competing algorithms, including two traditional methods (WESNR [
10] and LSLR [
11]) and four CNN-based methods (TL-CNN [
14], VA-CNN [
16], DeGAN [
18], and SACNN [
20]). After applying the competing methods, we calculated the PSNR and structural similarity index (SSIM) metrics of each method’s processing results to measure the effectiveness of the diverse mixed noise removal algorithms and evaluate the quality of the denoising results. In
Table 1,
Table 2 and
Table 3, for each
of a given test set, the first line shows the PSNR, and the second line shows the SSIM.
As shown in
Table 1, for the mixed AWGN+SPIN case, the denoising performance of MSNUNet outperforms all competing methods, which demonstrates the superiority of MSNUNet.
Figure 4 illustrates the visual appearance of “Barbara” from the Set12 dataset.
Figure 4b shows an image of a parrot corrupted by AWGN+SPIN (
= 25,
=30%), while
Figure 4c–i shows the processing results of the six compared algorithms and the proposed MSNUNet. Compared to the other methods, MSNUNet preserves more of the fine texture of the eye region, resulting in a significantly improved visualization.
Similarly, for the mixed AWGN+RVIN case, MSNUNet achieves the best denoising performance (
Table 2). With increasing RVIN proportions and decreasing AWGN proportions, the denoising performance of the competing algorithms, except for LSLR [
11], steadily improves, with the most significant improvements beinobserved for MSNUNet. The denoising results for image 24077 of the BSD100 dataset are shown in
Figure 5, where
Figure 5b is corrupted by AWGN+RVIN (
= 20,
= 10%). In
Figure 5, the visual effect obtained by MSNUNet is more pleasing than all the other results.
For the mixed AWGN+SPIN+RVIN case,
Table 3 reveals that our proposed MSNUNet delivers superior performance compared to nearly all the competing models at various mixed noise levels. Although the PSNR metric of SACNN [
20] exceeds that of MSNUNet at
= 5,
= 5%, and
= 50%, our model exhibits exceptional denoising capability as the levels of AWGN and RVIN increase, highlighting its remarkable robustness. The denoising results for image 119082 of the BSD100 dataset are shown in
Figure 6.
Figure 6b shows images corrupted by AWGN+SPIN+RVIN (
= 10,
= 10%,
= 40%). As shown by the denoised images in
Figure 6, DeGAN [
18] and SACNN [
20] effectively eliminate mixed noise and reconstruct clear image structures. WESNR [
10] and LSLR [
11] cannot clearly reconstruct the image’s content, while TL-CNN [
14] and VA-CNN [
16] blur the texture and structure of the images; however, MSNUNet reconstructs fine textures more effectively than all other competing algorithms.
To evaluate the generality of MSNUNet, we tested the performance of the model on the color versions of the BSD100 [
38] and Urban100 [
40] datasets under six different mixed noise ratios. The results are shown in
Table 4. The denoising results of the proposed model are presented in
Figure 7 and
Figure 8. It can be observed that MSNUNet achieves impressive performance in denoising color images. This demonstrates the versatility and effectiveness of our approach in handling color image denoising tasks.
5.3. Discussion
As mentioned in
Section 5.2, the proposed MSNUNet algorithm demonstrates powerful performance in removing mixed noise, specifically AWGN+RVIN and AWGN + RVIN + SPIN. It surpasses state-of-the-art methods in terms of quality metrics and the visual appearance of the images. The effectiveness of MSNUNet in denoising can be attributed to its efficient extraction of multi-scale features. MSU-Subnet generates multiple receptive fields, providing rich local and global features for extraction. Given that mixed noise has a more complex distribution than a single source of noise, it is crucial to incorporate both local and global image information in the process of removing mixed noise. On the other hand, MSCAM utilizes a CAB to dynamically adjust the weights of individual channels in the global feature space. This helps in aggregating multi-scale features and allocating weights to relevant features based on their channel-wise correlations, thereby preserving more image details.
In the task of removing mixed noise, the network’s primary objective is to learn the mapping relationship between clean and noisy images. DEGAN [
18] used a generative adversarial network for this purpose. DeGAN [
18] utilizes a generator to learn the direct mappings between clean and noisy images, generating new images with a distribution similar to clean images. The generated images are then evaluated by a discriminator, providing feedback to the generator for generating more realistic clean images. However, the training process of GANs is unstable, leading to convergence difficulties for the generator and discriminator. Consequently, DeGAN [
18] performs poorly when image details are severely corrupted. In our approach, the symmetric UNet structure effectively extracts image features, and the residual connections stabilize the training process, preventing gradient vanishing.
In SACNN [
20], local image features are extracted using Conv, and a hybrid attention mechanism is employed to learn weights for the image, incorporating SA and CA. With the powerful feature extraction capability of Conv, SACNN [
20] effectively extracts local features and assigns weights through the hybrid attention mechanism in both the channel and spatial dimensions. This empowers the network to fully utilize valuable features and restore the corrupted details of the image. In our approach, we also employ CA to learn weights for feature information in the channel dimension. However, our method goes beyond SACNN [
20] by extracting multi-scale features from high-resolution images using MSU-Subnet, which includes both local and global features. Throughout the denoising process of MSNUNet, MSCAM is utilized to effectively aggregate multi-scale features. As a result, MSNUNet outperforms SACNN [
20] in denoising performance.
5.4. Ablation Studies
In our ablation experiments, the denoising model was trained for only 150,000 iterations. The tests were performed on BSD100 [
38] and analyzed for a challenging AWGN + SPIN case (
= 25,
=30%).
Table 5,
Table 6 and
Table 7 indicate the quality of improvements in performance achieved for various configurations. We then describe the impact of each component.
Effect of input size. We computed the FLOPs, PSNR, and SSIM for image sizes of
,
, and
, respectively. As shown in
Table 5, the PSNR gain becomes larger as the image size increases, and the FLOPs also increases. In this paper, we used patches with an image size of
as network inputs.
Effect of MSCAM. To fully validate MSCAM, we replaced all MSCAM blocks in MSNUNet with Conv-based Resblocks [
35], while the others were left unchanged. In addition, we removed DConv and CAB from MSCAM, resulting in two MSCAM variants, MSCAM-C and MSCAM-D.
Table 6(d) demonstrates the exceptional progress achieved by our MSCAM approach, surpassing the standard Resblock (
Table 6(a)) by an impressive 0.37 dB. Furthermore, the localization introduced by DConv enhances the robustness of MSCAM since its removal causes a decrease in PSNR (see
Table 6(b)). In addition, including CAB produces a noteworthy enhancement of 0.23 dB, as revealed in
Table 6(c). To further illustrate the effectiveness of MSCAM,
Figure 9b,c shows the two cases of MSNUNet equipped with Resblock and MSCAM, respectively. It is evident that MSNUNet equipped with MSCAM restores additional details of the human within the image, which confirms that our method is able to retain more texture details.
Effect of MSU-Subnet.Table 7(d) demonstrates that including a U-block design improves the denoising performance by 0.14 dB compared to a conventional U-network (see
Table 7(b)). Furthermore, replacing the standard Resblock with MSCAM in the U-block enhances the aggregation of multi-scale features, as shown by a decrease in PSNR upon its removal. Overall, MSNUNet contributes a substantial gain of 0.51 dB over the baseline (see
Table 7(a)).
6. Conclusions
In this paper, we propose MSNUNet, a mixed noise removal method based on nested UNet and multi-scale feature extraction. In MSNUNet, the two-layer nested UNet structure can deeply extract multi-scale features from high-resolution images and aggregate these features more efficiently using the proposed MSCAM. This approach not only accurately models mixed noise by leveraging the richer local and global information in the original image but also effectively extracts important features of the image through CA, thus preserving more image texture details. The experimental results clearly demonstrate that MSNUNet can achieve leading quality measures and fine textures that outperform all other contemporary competing methods.
In the future, we will further develop our work in two aspects. On the one hand, we will focus on developing a lightweight solution for image denoising models. Due to the limited cost of hardware devices in real-world applications, most of the current deep learning-based denoising models cannot be deployed on hardware. We will explore two alternatives to address and alleviate this issue. Firstly, we will design more efficient neural network components, such as DConv, to reduce computational complexity. Secondly, we will optimize existing models using currently available lightweight techniques. On the other hand, we plan to design a versatile model for low-level computer vision tasks.