1. Introduction
Compressed Sensing (CS) has revolutionized the limitations of the Nyquist sampling theorem, enabling the efficient reconstruction of signals at significantly lower sampling rates than the traditional Nyquist rate [
1], particularly for signals exhibiting inherent sparsity or sparsity within specific transform domains [
2]. This innovation has profound implications, substantially reducing the cost of sensor data compression, and mitigating the demands on transmission bandwidth and storage capacity in data transmission processes. CS has found wide applications, ranging from single-pixel cameras [
3,
4] to snapshot compression imaging [
5,
6] and even magnetic resonance imaging [
7,
8].
CS reconstruction methods can be broadly categorized into two main classes: traditional CS reconstruction methods [
9,
10,
11,
12,
13,
14,
15,
16] and deep-learning-based CS reconstruction methods [
17,
18,
19,
20,
21]. Traditional CS reconstruction methods are designed based on a priori knowledge of image sparsity, presuming that the signal exhibits sparsity within a particular transform domain [
22,
23]. These methods formulate signal reconstruction as an optimization problem within a sparse model framework [
12]. Solving this problem involves iterative approaches employing convex optimization methods, greedy algorithms, or Bayesian-like techniques to obtain the reconstructed signal. While traditional CS reconstruction methods provide strong convergence and theoretical guidance, they suffer from drawbacks such as computational intensity, slow reconstruction speeds, and limited reconstruction performance [
24].
The computational complexity inherent in traditional CS reconstruction methods presents challenges in achieving real-time image reconstruction. To address this, deep learning methods, known for their prowess in image processing, have been introduced into the realm of CS reconstruction. Deep-learning-based CS reconstruction algorithms can be broadly classified into two primary categories: deep non-unfolding networks (DNUNs) [
18,
19,
21,
25,
26] and deep unfolding networks (DUNs) [
8,
27,
28,
29,
30,
31,
32,
33]. DNUN treats the reconstruction process as a black-box operation, relying on a data-driven approach to build an end-to-end neural network to address the CS reconstruction problem. In this paradigm, the Gaussian random measurement matrix used in traditional CS reconstruction methods is replaced with a learnable measurement network. Subsequently, the reconstruction network framework is constructed around well-established deep learning models such as stacked denoising autoencoders [
25], convolutional neural networks (CNNs) [
18], or residual networks [
26] to learn the mapping from CS measurements to reconstructed signals. Despite the ability of DNUN to achieve real-time reconstruction, surpassing traditional CS reconstruction methods, it has limitations such as high data dependency and poor interpretability, stemming from its entirely data-driven nature and lack of a strong theoretical foundation.
Conversely, DUN combines traditional optimization methods with deep learning techniques, utilizing optimization algorithms as theoretical guides. It employs a fixed-depth neural network to simulate the finite number of iterations of the optimization algorithm, resulting in reconstructed signals. Many optimization algorithms, such as Approximate Message Passing (AMP) [
34], Iterative Shrinkage Thresholding Algorithm (ISTA) [
35], and the Alternate Direction Multiplier Method (ADMM) [
36], have been incorporated into DUN, leading to superior reconstruction performance compared to DNUN. Due to its foundation in theoretically guaranteed optimization algorithms, DUN offers strong reconstruction performance and a degree of interpretability.
Nonetheless, DUN typically operates in a single-channel form in many cases [
27,
28,
29,
30,
37,
38], as feature maps within the deep reconstruction network are transmitted between phases and updated within each phase. This structural characteristic limits the characterization ability of the feature maps, ultimately degrading the network’s reconstruction performance. Moreover, mainstream DUN methods [
28,
29,
30,
33,
37,
38] often rely on standard CNNs to build the reconstruction network, with each CNN featuring uniform receptive fields. As the human visual system is a multi-channel model, a series of receptive fields of different sizes are generated in the higher-order areas of the human visual system [
39,
40,
41]. Therefore, the single receptive field of the standard CNN is inconsistent with the actual observation of the human visual system, which hampers the characterization ability of the CNN.
To address these limitations, this paper introduces two modules within the Deep Reconstruction Subnet (DRS) of our proposed Multi-channel and Multi-scale Unfolding Network (MMU-Net): the Attention-based Multi-channel Gradient Descent Module (AMGDM) and the Multi-scale Proximal Mapping Module (MPMM). These modules are designed to enhance feature characterization and representation in DUN. AMGDM facilitates the transmission of feature maps in a multi-channel format, both intra-stage and inter-stage. This design enhances the feature maps’ characterization ability. Moreover, inspired by SK-Net [
42], we introduce Adap-SKConv, an attention convolution kernel with a feature fusion mechanism. Adap-SKConv is used to obtain fused gradient terms with attention, further improving the feature representation in AMGDM. To address the limitation of single-scale CNNs, we introduce MPMM, which employs multi-scale CNN. Inspired by the fact that the human visual system has different receptive fields in higher-order areas, in this paper, we utilize the Inception structure [
43] and design Multi-scale Block (MB) with multiple parallel convolutional branches in MPMM to simulate the human visual system using different receptive fields to extract features, thus enhancing the network’s representational capability.
The main contributions of this paper are as follows:
We introduce a novel end-to-end sampling and reconstruction network, named the Multi-channel and Multi-scale Unfolding Network (MMU-Net), comprising three integral components: the Sampling Subnet (SS), Initialize Subnet (IS), and Deep Reconstruction Subnet (DRS).
Within the Deep Reconstruction Subnet (DRS), the Attention-based Multi-channel Gradient Descent Module (AMGDM) is developed. This module introduces a multi-channel strategy that effectively addresses the challenge of limited feature map characterization associated with the conventional single-channel approach. Additionally, we design the Adap-SKConv attention convolution kernel with a feature fusion mechanism, enhancing the feature characterization of gradient terms. These innovations collectively contribute to a substantial improvement in the network’s reconstruction performance.
In DRS, we introduce the Multi-scale Proximal Mapping Module (MPMM). MPMM incorporates a Multi-scale Block (MB) featuring multiple parallel convolutional branches, facilitating the extraction of features across various receptive fields. This innovation allows for the acquisition of multi-scale features, significantly enhancing the characterization capabilities of the Convolutional Neural Network and thereby leading to an enhanced reconstruction performance.
Empirical evidence from a multitude of experiments demonstrates the superior performance of the proposed method in comparison to existing state-of-the-art networks. This extensive validation underscores the efficacy and rationality of our approach.
The rest of the paper is organized as follows.
Section 2 describes the related work of DNUN and DUN.
Section 3 describes the preparatory knowledge for the work of this paper and
Section 4 describes the framework and details of MMU-Net.
Section 5 describes the experimental parameter settings, baseline, comparison with other state-of-the-art methods and ablation experiments.
Section 6 draws the conclusions of the study.
5. Experimental Results and Analysis
This section provides a comprehensive examination of the performance of our proposed MMU-Net. We begin by outlining our experimental settings, detailing the evaluation metrics used, and introducing the baseline methods. Subsequently, we delve into discussions that include an extended investigation, aiming to illustrate the efficacy of our method by addressing the following research questions:
RQ1: How does the performance of our proposed MMU-Net compare in accuracy to state-of-the-art CS reconstruction methods?
RQ2: What is the influence of the key components of the proposed AMGDM (including the multi-scale strategy and Adap-SKConv) in MMU-Net?
RQ3: What is the effect of the essential components (MB) of MPMM proposed in MMU-Net?
5.1. Experimental Parameter Settings
In our experiments, we employ a dataset comprising 91 images, consistent with previous work [
30]. These images are utilized for training, with the luminance components of 88,912 randomly extracted image blocks, each of size
, forming the training set. Our testing set encompasses three natural image datasets and a remote sensing image dataset. The nature image dataset consists of three widely recognized benchmark nature image datasets: Set11 [
18], BSD100 [
48], and Urban100 [
49], and the remote sensing image dataset consists of eight images from the UC Merced Land Use Dataset [
50].
For MMU-Net’s configuration, we set
= 13, use a batch size of 32, establish a learning rate of
, and run the training process for 300 epochs. During training, the network is optimized using an Adam optimizer [
51] with a momentum of 0.9 and a weight decay of 0.999.
Our experiments are conducted using the Pytorch 1.11, and the hardware setup comprises an Intel Core i7-12700F processor and an RTX 3070 GPU. To evaluate the reconstruction quality, we utilize the Peak Signal to Noise Ratio (PSNR) and Structural Similarity Index Measure (SSIM) [
52], focusing on the luminance components. In the results tables, the highest-performing method is indicated in bold, and the second-best is underlined.
5.2. Evaluation Metrics
5.2.1. Peak Signal to Noise Ratio (PSNR)
PSNR is a widely-used metric for evaluating image quality at the pixel level. It measures the quality of a reconstructed image in decibels (dB), with higher values indicating superior image quality. For images
and
, both of size
, the PSNR is computed as shown in Equation (
13):
Here, is the maximum possible pixel value of image , and denotes the mean square error between images and .
5.2.2. Structural Similarity Index Measure (SSIM)
SSIM is a metric that assesses image quality by quantifying structural similarity between two images. It provides insights into brightness, contrast, and structure, with SSIM values ranging from 0 to 1, where larger values indicate greater similarity between images. The SSIM between images
and
is calculated according to Equation (
14):
Here, and represent the mean values of images and , while and represent their variances. The covariance between and is denoted as . Additionally, and are constant terms.
5.3. Baselines
To gauge the effectiveness of MMU-Net, we conducted comparative evaluations by contrasting it with five well-established baseline methods. In this section, we provide an overview of these baseline techniques and their specific characteristics:
AdapReconNet [
18]: AdapReconNet adopts a matrix sampling approach for chunked image sampling. It utilizes a fully connected layer for initial image reconstruction, while employing a variant of the ReconNet for deep reconstruction. Notably, the sampling matrix remains unaltered during the training phase, and the initial reconstruction subnetwork and deep reconstruction subnetwork are jointly trained.
CSNet+ [
45]: CSNet+ employs a convolutional neural network to accomplish chunked uniform sampling and chunked initial image reconstruction. Furthermore, it integrates a deep reconstruction sub-network. During the training phase, the sampling sub-network, initial reconstruction sub-network, and deep reconstruction sub-network are collectively trained.
ISTA-Net+ [
28]: ISTA-Net+ utilizes a fixed random Gaussian matrix for chunked image sampling and initial reconstruction. Deep image reconstruction is performed using an ISTA-based deep unfolding network. Similar to AdapReconNet, ISTA-Net+ maintains the sampling matrix constant throughout training and jointly trains the initial reconstruction and deep reconstruction sub-networks.
OPINE-Net+ [
30]: OPINE-Net+ integrates a CNN for chunked uniform sampling and chunked initial image reconstruction. It employs an ISTA-based deep unfolding network for the final image reconstruction. OPINE-Net+ extends the architecture of ISTA-Net+ by jointly training the look-alike network, the initial reconstruction sub-network, and the deep reconstruction sub-network.
AMP-Net [
29]: AMP-Net initiates image reconstruction with a sampling matrix, initially set as a random Gaussian matrix. It performs chunked image sampling and initial reconstruction using this matrix. For the deep reconstruction phase, AMP-Net follows a denoising perspective, where a deep unfolding network is constructed based on the Approximate Message Passing algorithm. The sampling network, initial reconstruction sub-network, and deep reconstruction sub-network are collectively trained during the training phase.
5.4. Comparison with State-of-the-Art Methods (RQ1)
5.4.1. Comparison in Natural Images
In this section, we compare MMU-Net with five state-of-the-art deep-learning-based CS reconstruction methods using four CS ratios: 0.04, 0.1, 0.25, and 0.3, under natural image datasets. The compared methods include AdapReconNet, CSNet+, ISTA-Net+, AMP-Net, and OPINE-Net+. AdapReconNet and CSNet+ belong to DNUNs, ISTA-Net+ and OPINE-Net+ are ISTA-based DUNs, and AMP-Net is an AMP-based DUN.
Table 2 presents the average PSNR and SSIM results of the five CS reconstruction methods on three datasets: Set11, BSDS68, and Urban100. The table illustrates that, across all four sampling rates, MMU-Net consistently outperforms the existing state-of-the-art CS reconstruction methods on Set11, BSDS68, and Urban100. This result confirms the efficacy of MMU-Net’s network structure. Notably, the DUN-based CS reconstruction methods demonstrate significantly better average PSNR and SSIM results compared to DNUN-based methods, suggesting the superiority of the DUN framework in enhancing reconstruction performance.
Figure 5 displays the original images of lena256 and Parrots from the Set11 dataset, along with the images reconstructed by the seven CS reconstruction methods at a sampling rate of 0.1. The zoomed-in details of the reconstructed images are provided. The visual comparison reveals that the images reconstructed by MMU-Net exhibit minimal block artifacts and superior visual quality. A closer examination of the magnified image details of lena256 and Parrots underscores the richness of details and textures in the MMU-Net’s reconstructed images. In summary, MMU-Net outperforms the five state-of-the-art CS reconstruction methods in terms of average PSNR and SSIM while delivering superior visual quality.
5.4.2. Comparison in Remote Sensing Images
In this section, we assess the performance of MMU-Net using the UC Merced Land Use Dataset, a remote sensing image dataset. Based on our earlier findings favoring DUNs over DNUNs, we benchmark MMU-Net against three state-of-the-art DUNs: ISTA-Net+, AMP-Net, and OPINE-Net+. We evaluate the reconstruction quality at four different sampling rates: 0.04, 0.1, 0.25, and 0.3, with results visualized in
Figure 6 and presented in
Table 3.
The table showcases the average PSNR and SSIM values of reconstructed images for the four CS reconstruction methods across eight different remote sensing images. The results presented in
Table 3 indicate that the PSNR of MMU-Net’s reconstructed images surpasses the second-best result by an average of 0.48 dB. Moreover, MMU-Net exhibits significantly better performance compared to the other three state-of-the-art CS reconstruction methods, underscoring the effectiveness of the MMU-Net’s network structure.
In
Figure 6, we visually compare the reconstructed images and their corresponding originals at a sampling rate of 0.1 for various land-use classes. The lower-left corner of each image provides a magnified view of the selected area in the red box. As depicted in
Figure 6, MMU-Net generates reconstructed images with clear contours and rich texture information. Importantly, it maintains the fidelity of small foreground targets even at lower sampling rates, ensuring that the target positions and shapes remain undistorted. In summary, the proposed MMU-Net excels in terms of both the average PSNR, SSIM, and visual quality, making it well-suited for demanding tasks such as target recognition in remote sensing images.
5.5. Study of Computational Time
In the context of CS reconstruction, the model’s reconstruction time and the number of parameters are crucial performance metrics. Typically, more complex network structures entail higher time complexity and a higher number of network parameters. In this section, two experiments are designed to validate the network performance of MMU-Net. The first compares the average GPU running time and the number of network parameters of MMU-Net with five other CS reconstruction algorithms. Comparison data are obtained by testing the same dataset in the same environment using the source code provided by the authors.The second explores the average GPU running time of MMU-Net on images of different sizes and the trend of the running time as the image size increases.
Table 4 provides the average GPU running times required by six CS reconstruction methods to reconstruct a 512 × 512 image at a sampling rate of 0.25. From the table, it is evident that the DNUN models, AdapReconNet and CSNet+, with relatively straightforward network architectures, exhibit shorter average running times in comparison to the DUN methods. In contrast, MMU-Net, the method proposed in this paper, has more expensive computation and preservation costs due to its multi-scale network structure and higher network complexity compared to other DUN methods. However, it still falls within the same order of magnitude as the other methods. Importantly, MMU-Net’s reconstruction performance surpasses that of the other methods.
Figure 7 and
Table 5 give the average GPU running time of MMU-Net, reconstructing images of sizes 64 × 64, 128 × 128, 256 × 256, 512 × 512 and 1024 × 1024, respectively. From the right panel of
Figure 7, it can be seen that there is a near linear correlation between the average GPU running time of MMU-Net and the image size. When the input image size is large, the average GPU runtime of MMU-Net does not surge.
5.6. Ablation Studies and Discussions
In this section, we conduct ablation experiments to validate the effectiveness of the multi-channel strategy, Adap-SKConv, and the multi-scale strategy (MB).
5.6.1. Effectiveness of AMGDM (RQ2)
To assess the effectiveness of the multi-channel strategy and Adap-SKConv within the AMGDM module, we utilize four network modules: GDM-(a), GDM-(b), GDM-(c), and GDM-(d), which replace the gradient descent modules at the locations shown in
Figure 1. These modules allow us to compare network performance in different scenarios.
GDM-(a) represents a single-channel module without an attention mechanism, similar to the GDM used in most ISTA-based DUNs. GDM-(b) is a multi-channel module without an attention mechanism. GDM-(c) incorporates a multi-channel module with the CBAM (Convolutional Block Attention Module) attention mechanism, which replaces the Adap-SKConv proposed in this paper. GDM-(d) is a multi-channel module with Adap-SKConv, i.e., the AMGDM proposed in this paper. The network structure of each module is illustrated in
Figure 8.
GDM-(b), GDM-(c), and GDM-(d) all adopt multi-channel structures, thereby eliminating the need for subsequent PMMs to perform single-channel and multi-channel transformations, which reduces information loss. GDM-(c) and GDM-(d) utilize different attention mechanisms.
Table 6 presents the average PSNR of these three methods on Set11 and the UC Merced Land Use Dataset at three different sampling rates.
From
Table 6, we observe that the PSNR of the reconstructed images by GDM-(b) is, on average, 0.19 dB higher than that of GDM-(a) for the three sampling rates. This demonstrates that the multi-channel strategy proposed in this paper enhances the feature map characterization capability by mitigating the information loss resulting from dimensionality reduction, ultimately improving network performance. Additionally, when comparing GDM-(b) and GDM-(d), it is evident that the Adap-SKConv proposed in this paper contributes to an average gain of 0.17 dB in network performance. This confirms that Adap-SKConv effectively enhances the information exchange between gradient terms, thereby improving the quality of reconstruction through a well-designed attention mechanism. Lastly, when comparing GDM-(c) and GDM-(d) between Adap-SKConv proposed in this paper and the state-of-the-art CBAM attention mechanism, we find that the two-input structure of Adap-SKConv outperforms the single-input structure of CBAM in facilitating information exchange between the gradient terms. This enhances feature map characterization and, consequently, improves network reconstruction results.
5.6.2. Effectiveness of MB (RQ3)
In this section, we conduct ablation experiments on the Multi-scale Blocks to assess the effectiveness of the multi-scale strategy, and the experimental results are included in
Table 7.
We design and examine single-scale module Block-(1) and multi-scale modules Block-(2), Block-(3), and Block-(4), which comprise two, three, and four branches, respectively. Each of these modules is integrated into the network structure illustrated in
Figure 1, replacing sections with
and
. Among these modules, Block-(4) represents the MB designed in this paper. The structures of these four Blocks are visualized in
Figure 9.
As shown in
Table 7, the average Peak Signal-to-Noise Ratio of the reconstructed images increases with the number of branches. This observation confirms that the multi-scale strategy enhances network performance by increasing the network’s representation capability. However, as the number of branches increases, network complexity also rises, leading to longer training and reconstruction times. To strike a balance between performance and network complexity, this paper selects Block-(4) with four branches as the network structure for the proposed MB.
6. Conclusions
In this paper, we introduced a novel approach for Compressed Sensing image reconstruction. Our proposed MMU-Net leverages innovative strategies to enhance feature map characterization and gradient term representation, ultimately improving reconstruction performance. Specifically, MMU-Net incorporates a multi-channel strategy, bolstering the network’s ability to characterize feature maps effectively. In addition, the introduction of Adap-SKConv within the attention mechanism in Gradient Descent Modules facilitates the exchange of information between gradient terms, leading to improved representation capabilities. Furthermore, we introduced the Multi-scale Block, which enhances network characterization by introducing a multi-scale structure capable of extracting features at different scales. Our extensive experimental results demonstrate the superior performance of MMU-Net compared to state-of-the-art reconstruction algorithms. We have achieved a harmonious balance between algorithmic complexity and reconstruction quality, especially in the context of CS for natural and remote sensing images. The MMU-Net framework, as proposed in this paper, not only offers an effective solution for CS reconstruction in these domains but also opens up possibilities for enhancing a broad spectrum of applications, including image processing and computer vision. However, the MMU-Net proposed in this paper also has some limitations. First, due to the use of multi-channel and multi-scale strategy to build the network, resulting in more parameters in the model, the model requires further compression. Second, the method proposed in this paper adopts the block sampling strategy to improve sampling efficiency, and cannot realize the global pixel interaction, which limits the overall performance, and the feasibility of whole-map sampling needs to be further studied. For future research, we can direct our efforts toward further enhancing the performance of MMU-Net and exploring its applicability in diverse fields, promising continued advancements in image reconstruction techniques and their broader utility.