1. Introduction
With the rapid advancement of medical technology, the field of medical image denoising has made significant progress as a fundamental task in computer vision. Despite the improvements in imaging equipment, medical image denoising continues to present challenges in real-world applications. As a result, there is a growing interest among scholars in developing effective medical image denoising algorithms. Medical image denoising is not only an essential preprocessing step in medical imaging workflows, but it also has a direct impact on the accuracy and reliability of subsequent analyses. Therefore, achieving accurate image denoising is crucial for enhancing the quality of medical images.
Traditional denoising methods often rely on filters to remove noise from images. While these filter-based algorithms are effective, they do have a few drawbacks: (1) each denoising task requires a specific model, (2) iterative denoising methods are time consuming, (3) they lack generality for various data types, and (4) manual or semi-automatic parameter tuning is necessary. These limitations greatly affect the performance of medical image denoising algorithms. However, with significant advancements in deep-learning algorithms for medical image denoising, deep-learning-based methods have achieved state-of-the-art performance by adaptively extracting more complex features from medical image data.
Among these methods, convolutional neural networks (CNNs) [
1,
2,
3] stand out for their ability to learn local patterns from extensive medical image datasets. For medical image denoising based on deep learning, U-Net-based CNNs [
4] are widely used. Meanwhile, transformers [
5,
6,
7] have shown impressive results on large datasets by relying less on predefined data assumptions during training. However, both CNNs and transformers face challenges: CNNs often struggle with limited receptive fields, which leads to suboptimal denoising outcomes, while transformers require intricate self-attention mechanisms to process input data, resulting in a high computational overhead and prolonged inference times. Therefore, the MLP-mixer architecture, which is based on multilayer perceptrons (MLPs), emerges as a favorable balance between the two. Specifically, MLPs have lower data priors and larger receptive fields compared with CNNs, while also being simpler with smaller computational demands compared with transformer networks. Moreover, MLPs primarily rely on matrix multiplication, which makes them highly efficient on GPUs designed for parallel processing. However, despite the excellent performance of MLPs in many tasks, they do not effectively account for the local information of images. In medical image denoising tasks, the local self-similarity of images is a crucial attribute, and this limitation can hinder the performance of MLPs in handling such tasks.
Initially used in biomedical image segmentation, U-Net has also shown outstanding results in image denoising. Recent research has investigated ways to improve its denoising capabilities by combining U-Net with transformers or adversarial neural networks. Zhang et al. [
7] merged U-Net with transformers, effectively utilizing global image features at different scales and employing local attention mechanisms to enhance the local image information. Huang et al. [
8] integrated U-Net with adversarial networks to capture both global and local variations between denoised and original images, resulting in superior performance metrics. Despite their effectiveness in extracting multiscale feature maps through their encoder–decoder structure, these U-Net variants become more computationally demanding when additional basic modules are stacked. Additionally, conventional U-Nets primarily rely on skip connections, which restrict feature map interactions between different scales. These approaches fail to consider the structural and texture similarities present in both multiscale and same-scale feature maps, ultimately limiting their denoising capabilities.
To overcome these limitations, we propose the asymmetric multilayer perceptron U-Net (AsymUNet) for medical image denoising. AsymUNet uses an asymmetric U-Net architecture to reduce computational load, along with a multiscale feature fusion module (MSFFM) to enhance the information interaction between the encoder and decoder. Furthermore, spatially rearranged multilayer perceptron blocks function as core building blocks, effectively extracting local features by reorganizing feature map spatial structures. AsymUNet demonstrates superior performance metrics and improved visual effects compared with existing methods. Extensive experiments confirm AsymUNet’s exceptional performance across various metrics and its ability to deliver exceptional visual quality. The primary contributions of this study are summarized as follows:
1. We propose AsymUNet, a denoising algorithm for medical images based on an asymmetric U-Net framework and a spatially rearranged MLP. AsymUNet effectively reduces the computational load compared to conventional U-Net structures, while maintaining an excellent denoising performance.
2. We introduce a MSFFM that integrates the feature information from all scales, including both multiscale and same-scale feature maps. This improvement in the decoder greatly enhances the denoising effectiveness and achieves higher performance metrics.
3. We use MLP blocks that are spatially rearranged as core modules in our approach. These blocks extract local information from a spatial perspective and global information from a channel perspective, improving image feature representations and preserving image details. Additionally, MLPs utilize basic matrix multiplication for feature extraction, which results in a superior inference speed in practical applications.
The remainder of this paper is organized as follows:
Section 2 reviews related work.
Section 3 introduces the AsymUNet model. In
Section 4, we present extensive experiments conducted to evaluate the performance of AsymUNet. Finally,
Section 5 concludes the paper with a summary of findings and potential future directions.
3. AsymUNet
In this section, we provide a detailed explanation of the proposed AsymUNet for denoising medical images. This explanation covers the overall model structure, the design of multiscale feature fusion module, the design of basic blocks, and the loss function used during training.
3.1. Main Framework
As we know, the U-Net architecture has gained widespread popularity in image denoising due to its ability to preserve image details through skip connections that transfer feature maps from the encoder to the decoder. However, this method primarily focuses on feature maps at specific scales, which can limit its denoising performance. Moreover, integrating computationally intensive modules such as MLPs or transformers can further burden the neural network, impacting its efficiency. To address these challenges, we propose AsymUNet, a medical image denoising algorithm based on an asymmetric U-Net framework and spatially rearranged multilayer perceptrons. AsymUNet adopts an asymmetric U-Net structure to reduce the computational load and processing time while effectively capturing both global and local information. Additionally, it includes an MSFFM to enhance the integration of feature maps.
Figure 1 illustrates the overall structure of AsymUNet.
To reduce the computational load, the asymmetric U-Net reduces the number of basic blocks in the encoder while enhancing the encoder structure to maintain the denoising performance despite the smaller model size. At each downsampling layer, the encoder incorporates feature maps from the degraded image as additional inputs, ensuring that the texture information at different scales is preserved during the downsampling process. Effective feature map transfer between the encoder and decoder is crucial for the denoising performance. To optimize the utilization of multiscale feature maps and enhance the denoising efficacy, we introduce an MSFFM. This module aims to combine features from various scales and transfer them effectively to the corresponding decoder layers. The detailed mechanism of this module will be explained in
Section 3.2. As for the decoder design, we propose an architecture that integrates inputs from three different sources: the merged feature maps from the MSFFM (
), the restored features from the previous decoder layer (
), and the texture features extracted from the degraded image (
). By integrating these three feature maps, the decoder becomes better at reconstructing both the overall structure and fine details of the image. This decoder architecture functions similarly to a compact neural network. By inputting comprehensive image features into the decoder, it significantly reduces the encoder’s workload for feature extraction. To further improve the model’s computational efficiency, we introduce spatially rearranged MLP blocks that rely on matrix multiplication as fundamental components. These blocks restructure the spatial arrangement of features to extract local features and derive global features from the image from a channel perspective. This approach helps achieve superior performance metrics. The detailed design of these basic blocks is presented in
Section 3.3. Additionally, AsymUNet uses a three-layer encoder–decoder architecture; at the bottleneck, the feature maps extracted by the encoders are sent to spatially rearranged MLP blocks, which are then ready to be decoded through the same number of stages as they were encoded.
3.2. Multiscale Feature Fusion Module
Skip connections are essential in U-Net as they enable crucial interaction between the encoder and decoder, significantly enhancing feature fusion. However, despite being efficient and effective, skip connections do not fully exploit the correlations between feature maps at different scales. This limitation can restrict the denoising performance. Taking inspiration from research by Li et al. [
27], it becomes evident that image feature maps exhibit commonalities on two levels. Firstly, within images at the same scale, feature maps at the corresponding positions display significant texture similarity. Secondly, across images of varying scales, structural feature maps exhibit clear similarities. Building on these insights, this paper argues that preserving this intrinsic coherence is vital for achieving comprehensive and effective information integration in multiscale feature fusion.
To effectively reduce noise in medical images, we propose an MSFFM that merges image feature maps across different scales to preserve as much detail and structure as possible. Taking a mid-scale MSFFM as an example (
Figure 2), this module integrates four inputs: the feature map
from the decoder output at the current resolution, the feature map
from the decoder output at a higher resolution, the feature map
from the decoder output at a lower resolution, and the output from the preceding MSFFM. This fusion process not only maintains the unique structural characteristics of each scale, but also effectively preserves the texture information at the current scale. The operations of the MSFFM are represented by Equation (
1):
where
denotes the operation of integrating these four feature maps, involving a spatial rearrangement MLP (SRCMLP) block and a channel attention module;
represents the output of the MSFFM at the current resolution; and
is the output from the previous MSFFM.
With the SRCMLP block, AsymUNet efficiently extracts detailed image feature maps. By introducing the channel attention mechanism, it is able to capture global information within these feature maps. The resulting output from this module not only contributes to the next higher resolution fusion feature module, but also significantly enhances the denoising capability of the decoder within our proposed framework.
3.3. Spatial Rearrangement Multilayer Perceptron Block
In MLP-Mixer [
28], the Mix-Token MLP block accepts tokens that have been linearly expanded as the input feature maps, and it uses fully connected layers to model the relationships between these tokens. However, the fixed input dimensions of the fully connected layers in the Mix-Token MLP block make it incompatible with the variable image sizes that are encountered in medical image processing. Furthermore, as the Mix-Token MLP block extracts global information directly from the sequence on each pass, there is a risk of overlooking crucial local information.
To address this issue, we propose a SRCMLP block as the foundational block of the model. Similar to MLP-Mixer, each SRCMLP consists of two sub-blocks: the rearrangement module (RC) and the channel MLP module (CMLP). These sub-blocks are designed to extract spatial and channel information from the input feature maps, respectively. Given an input feature
X with dimensions
, where
H,
W, and
C represent the height, width, and original channel count of the image, respectively, the operations of the SRCMLP can be expressed as shown in Equations (2) and (3):
where
Y and
Z represent the intermediate feature maps and the final output feature maps, respectively. The specific operations of the RC module are shown in
Figure 3. First, the spatial dimension of the input feature maps is divided into multiple fixed-size blocks according to a specified square region size. Then, the feature maps in these fixed-size blocks are reorganized. Suppose the specified side length for division is Len. The input feature map
X with shape
is divided into
T regions, each with a shape of
. Next, the feature maps of each region are reorganized, transforming the shape of the input feature map
X into
. By utilizing two fully connected layers and a batch normalization (BN) layer, the local information within the input feature maps are effectively extracted. Finally, the CMLP module is employed to derive global information from these inputs, ultimately enhancing the network’s denoising performance.
As shown in
Figure 4, to capture more detailed image features, the CMLP module uses a fully connected layer to increase the feature map dimension to
. Furthermore, the inclusion of BN, GELU activation, and dropout improves the stability and performance of the network, enabling the model to converge quickly. Lastly, a fully connected layer is employed to restore the original number of channels.
In the SRCMLP block, the fully connected layer operation is the primary computational and parameter burden. Assume the shape of the input feature map is . In the RCMLP module, suppose the region’s side length after division is Len. After rearrangement, the shape of the feature map becomes . Assume the feature map dimension in the bottleneck is reduced to . Therefore, the parameter count for the rearrangement module is . The floating point operations per second (FLOPs) are calculated as . Next, in the CMLP module, assume the feature map dimension in the bottleneck is modified to . The parameters and FLOPs for the CMLP are and , respectively. The total parameter count and FLOPs are and , respectively. If the fully connected layers are replaced by conventional convolution layers, the total parameter count and FLOPs would be and , respectively, which are significantly greater than the former. Therefore, SRCMLP has fewer parameters and FLOPs compared with using convolution operations. MLP uses matrix multiplication instead of convolutional kernels for feature map processing, significantly improving the computational efficiency through parallel computation on GPUs.
3.4. The Loss Function
Most image denoising algorithms primarily focus on calculating the loss between the output image and the clean image. However, in the U-Net architecture, the image is progressively restored through multiple hierarchical levels. If we only compute the loss between the final output layer and the clean image, we will overlook the denoising processes at various stages within the network. To address this issue, AsymUNet introduces a multiscale loss function, which consists of multiple L1 loss functions. The L1 loss function can be represented by Equation (
4) as follows:
where
is the set of original images,
is the set of images restored by AsymUNet, and
and
represent the
t-th original image and denoised image, respectively.
T is the number of samples.
Meanwhile, the multiscale loss function consists of three levels of loss functions, which can be represented by Equation (
5):
where
E is the number of layers. The
loss will be accumulated between the input and output images at each layer of the network to account for the overall loss.
To improve the performance of denoising, auxiliary loss terms can be incorporated into the loss function. For AsymUNet, this involves integrating a multiscale frequency reconstruction (MSFR) loss function with the existing multiscale loss function. The MSFR loss measures the L1 distance in the frequency domain at different scales between the original image and the denoised image. This relationship is represented by Equation (
6):
where
F represents the fast Fourier transform (FFT), which converts the image signal into the frequency domain information.
The overall loss function for training AsymUNet consists of two components:
and
. To control the influence of the auxiliary term
, parameter
is introduced as a coefficient before
. Therefore, the overall loss function can be represented as shown in Equation (
7):
4. Experiments
Section 4.1 describes the experimental setup, while
Section 4.2 presents the experimental results. Additionally,
Section 4.3 presents the results of the ablation experiments.
4.1. Experimental Settings
We conducted relevant experiments on the proposed AsymUNet network for various medical image denoising tasks, including: (1) Gaussian grayscale/color medical image denoising and (2) LDCT image denoising.
(1) Gaussian Grayscale/Color Medical Image Denoising:Images from the HAM10000 dataset [
29] and the Chest X-ray dataset [
30] are used as the training and testing datasets. During training, the white Gaussian noise is added to the training set images. The standard deviation (
) is randomly chosen from the interval [0, 55]. The HAM10000 dataset consists of 10,000 RGB images of pigmented skin lesions from various individuals, each sized 450 × 600 × 3. For training, 9400 images are randomly selected, leaving 300 images for testing. The chest X-ray dataset was collected from pediatric patients aged one to five at the Guangzhou Women and Children’s Medical Center. It includes 5216 images for training and 624 images for testing.
(2) Low-Dose CT Medical Image Denoising: To evaluate the proposed AsymUNet on the actual medical images, we used the Mayo Clinic CT dataset [
31] in our experiments. This dataset was created for the 2016 Low Dose CT Grand Challenge, which was jointly organized by the National Institutes of Health, the American Association of Physicists in Medicine (AAPM), and the Mayo Clinic. Its purpose is to compare different LDCT image denoising algorithms. The dataset consists of 150 sets of projection data and corresponding image data obtained from clinical CT examinations using SOMATOM definition CT systems. For our experiments, we chose 2167 pairs of quarter-dose and full-dose CT images for training, and 100 pairs for testing.
The AsymUNet network is trained using the PyTorch framework on a workstation equipped with an Intel(R) Core(TM) i5-13600KF CPU @3.50 GHz processor and an NVIDIA RTX 3090 GPU. For all of the experiments, the training parameters are configured as follows: we employed the AdamW optimizer with parameters , , and a weight decay of 1 . The loss function used is L1 loss, and training proceeds for 300,000 iterations. The initial learning rate is set to 3 and is decreased gradually to 1 using a cosine annealing strategy. To facilitate smooth training, a progressive learning strategy is adopted. Initially, the patch size and batch size are set to 128 × 128 and 8, respectively. As training progresses and the number of iterations reaches 92,000, 156,000, 204,000, 240,000, and 276,000, the patch size and batch size pairs are adjusted as follows: , respectively.
4.2. Results
In this section, we demonstrate the denoising capabilities of the proposed AsymUNet on different types of medical images. This includes Gaussian grayscale and color images from dermatological and pulmonary datasets. Additionally, we demonstrate the network’s effectiveness in denoising LDCT images.
(1) Denoising of Gaussian Grayscale/Color Medical Images: We evaluated the performance of AsymUNet against four competitive algorithms: the traditional BM3D [
12] algorithm, as well as three deep learning methods including DNCNN [
1], ADL [
18], and DRANet [
2]. To assess the effectiveness of AsymUNet, we utilized the peak signal-to-noise ratio (PSNR) and structural similarity index (SSIM) as evaluation metrics.
To improve the generalizability of the proposed AsymUNet, Gaussian white noise at various random levels was added to the training images. During the testing phase, the network’s ability to generalize was assessed across five different noise levels:
. The experimental results are detailed in
Table 1.
From the data presented in
Table 1, it is evident that AsymUNet shows significant enhancements in both PSNR and SSIM measurements when compared with the classical BM3D algorithm. This indicates that AsymUNet is able to make effective use of extensive medical image datasets to obtain various denoising capabilities, and can adaptively reduce noise through the application of these learned functions. Additionally, when compared with other deep-learning algorithms, AsymUNet consistently exhibits a superior performance across the evaluation metrics, thus underscoring its robust ability to effectively denoise medical images across various levels of noise.
To showcase the visual capabilities of AsymUNet,
Figure 5 and
Figure 6 depict the results of AsymUNet alongside those of other comparable algorithms at a noise level of
. The figures illustrate that AsymUNet outperforms the other three methods in terms of clarity and preservation of image details. Whether applied to the HAM10000 color dataset or the Chest X-ray grayscale dataset, AsymUNet effectively restores image edges and enhances contrast, thereby preserving important image details.
(2) Low-Dose CT Image Denoising: This section highlights the effectiveness of AsymUNet in reducing noise in LDCT medical images. The model’s performance was evaluated using the Mayo Clinic LDCT Grand Challenge dataset from the AAPM.
Table 2 presents the test results comparing AsymUNet to WGAN [
26], EDCNN [
22], REDCNN [
21], and Eformer [
23] on the AAPM dataset. The results indicate that AsymUNet outperforms all comparison algorithms, underscoring its superior denoising capability for medical images. Additionally,
Figure 7 visually demonstrates AsymUNet’s performance on the AAPM dataset, showing its ability to effectively reduce noise while preserving crucial image details, thereby significantly mitigating the risk of misdiagnosis in medical applications.
In real-world applications, the processing speed of neural networks is crucial. Therefore, we also compared the inference times of different models.
Figure 8 shows the total inference time for each model on the HAM10000 test set. Among these models, AsymUNet demonstrates the fastest inference speed. This advantage is attributed to its asymmetric U-Net architecture and MLP-based structure, which collectively reduce the computational time significantly.
4.3. Ablation Experiments
To further investigate the effectiveness of various components, this section carried out several ablation experiments. All of the experiments were performed using the Chest X-ray dataset, utilizing a consistent progressive training strategy. For testing, the test set from the Chest X-ray dataset was utilized, and all of the results were evaluated in the presence of Gaussian noise with a noise level of . The influence of each component will be discussed in the subsequent sections.
(1) Multiscale Feature Fusion Module: In this subsection, we conducted experiments where the skip connection part of the MSFFM in AsymUNet was removed, resulting in a model referred to as Baseline. We compared the experimental results of Baseline, Baseline + Skip Connection (SC), and Baseline + MSFFM.
Table 3 illustrates that incorporating MSFFM leads to better metrics compared with using only skip connections. This highlights that integrating feature maps across multiple scales and transmitting them to the decoder allows the model to effectively utilize image features, thereby achieving a superior denoising performance.
(2) Spatial Rearrangement Multilayer Perceptron Block: To assess the effectiveness of SRCMLP, we conducted an experiment where SRCMLP was removed from AsymUNet and replaced with a standard MLP, denoted as W/O SRCMLP. As indicated in
Table 4, the denoising performance of the neural network notably decreased after the removal of SRCMLP.
(3) Asymmetric U-Net:
Table 5 presents the computational load and denoising performance results comparing the Asymmetric U-Net with the Symmetric U-Net. It is evident that the asymmetric variant reduced the computational load by nearly a quarter while maintaining a equivalent denoising performance compared with the symmetric counterpart.
5. Conclusions
This paper presents AsymUNet, an asymmetric multilayer perceptron U-Net designed to improve the denoising performance while reducing the inference time. AsymUNet utilizes an asymmetric U-Net architecture, which contributes to its efficiency gains. Additionally, the model incorporates SRCMLP as a foundational block, enhancing performance metrics and accelerating inference by utilizing RCMLP for local information extraction and CMLP for global information extraction. Furthermore, taking inspiration from the similarities in structure and texture across multiscale and same-scale feature maps, the paper introduces an MSFFM. This module enhances information flow between the encoder and decoder, thereby improving the overall denoising effectiveness. By combining these advancements, AsymUNet achieves state-of-the-art performance in both color/gray image denoising and LDCT image denoising tasks. Specifically, on the AAPM dataset, the proposed AsymUNet demonstrated a superior performance, achieving PSNR and SSIM scores of 44.67 dB and 0.9864, respectively. Notably, it exhibited an average PSNR improvement of 1.19 dB over its closest rival (EFormer).