Next Article in Journal
Development of High-Precision NO2 Gas Sensor Based on Non-Dispersive Infrared Technology
Next Article in Special Issue
A New Approach for Super Resolution Object Detection Using an Image Slicing Algorithm and the Segment Anything Model
Previous Article in Journal
SGRTmreg: A Learning-Based Optimization Framework for Multiple Pairwise Registrations
Previous Article in Special Issue
Multi-Branch Network for Color Image Denoising Using Dilated Convolution and Attention Mechanisms
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

REMA: A Rich Elastic Mixed Attention Module for Single Image Super-Resolution

1
School of Computer Engineering and Science, Shanghai University, Shanghai 200444, China
2
School of Information, Shanghai Jian Qiao University, Shanghai 201306, China
*
Author to whom correspondence should be addressed.
Sensors 2024, 24(13), 4145; https://doi.org/10.3390/s24134145
Submission received: 3 June 2024 / Revised: 19 June 2024 / Accepted: 24 June 2024 / Published: 26 June 2024
(This article belongs to the Special Issue Deep Learning-Based Image and Signal Sensing and Processing)

Abstract

:
Detail preservation is a major challenge for single image super-resolution (SISR). Many deep learning-based SISR methods focus on lightweight network design, but these may fall short in real-world scenarios where performance is prioritized over network size. To address these problems, we propose a novel plug-and-play attention module, rich elastic mixed attention (REMA), for SISR. REMA comprises the rich spatial attention module (RSAM) and the rich channel attention module (RCAM), both built on Rich Structure. Based on the results of our research on the module’s structure, size, performance, and compatibility, Rich Structure is proposed to enhance REMA’s adaptability to varying input complexities and task requirements. RSAM learns the mutual dependencies of multiple LR-HR pairs and multi-scale features, while RCAM accentuates key features through interactive learning, effectively addressing detail loss. Extensive experiments demonstrate that REMA significantly improves performance and compatibility in SR networks compared to other attention modules. The REMA-based SR network (REMA-SRNet) outperforms comparative algorithms in both visual effects and objective evaluation quality. Additionally, we find that module compatibility correlates with cardinality and in-branch feature bandwidth, and that networks with high effective parameter counts exhibit enhanced robustness across various datasets and scale factors in SISR.

1. Introduction

Single image super-resolution (SISR) aims to rebuild a high-resolution (HR) image based on its low-resolution (LR) counterpart. It is widely used in digital multimedia, facial recognition, remote sensing image restoration, medical image processing, and other domains [1], and many SISR algorithms have been proposed, including interpolation, reconstruction, algebraic characteristics, and learning-based methods [2,3]. In recent years, there have been remarkable advancements in deep learning-based SISR algorithms. However, one of the major challenges of deep learning-based algorithms is high-frequency detail preservation. Numerous studies have proposed diverse algorithms to address this challenge, including residual learning [4,5], recursive structures [6,7,8], dense connections [9,10,11], and multi-path learning [12,13]. In recent times, attention-based algorithms have gained prominence, notably after the popularity of Transformer-based algorithms. In fact, there have already been plenty of studies proposing attention-based SISR methods [14,15,16,17] to restore details. Most studies prefer to design a specific SISR network utilizing attention rather than a plug-and-play attention module to improve the reconstruction quality, resulting in a lack of flexibility in methods. And, only a few researchers have proposed flexible attention modules for SR tasks [18,19,20], except for directly plugging classic attention modules into SR networks [21,22].
In fact, many researchers have solely focused on proposing size-oriented attention modules to enhance performance without increasing or even reducing model complexity. However, in real-world scenarios, there is a significant number of tasks that prioritize performance over size, rather than solely emphasizing low-complexity modules with limited performance improvement. Therefore, to address the requirements of various tasks effectively, a flexible module should encompass both size-oriented and performance-oriented characteristics, which are aspects that are rarely discussed. Moreover, according to our experiment results, some size-oriented modules may function effectively within one network; however, their compatibility with other networks may not be guaranteed. Indeed, this raises a more general question in deep learning: why does a plug-and-play module work in one network but not in another? And what are the factors influencing the performance of a plug-and-play module? These issues are not well-studied.
To address the challenges mentioned above, we identify the influential factors affecting the performance of a plug-and-play module and propose the rich elastic attention module (REMA), which is a plug-and-play attention module for SISR. For the flexibility of the module, we propose Rich Structure, which allows seamless switching between size-oriented and performance-oriented modes to accommodate various requirements and ensure compatibility with different networks. And Rich Structure is the basic structure of REMA.
From the attention module’s perspective, it is essential to identify the key features affecting SR quality. Thus, we divide SISR into two steps: (1) upsampling LR images to the target size; and (2) minimizing the difference between the resized image and the ground-truth image, succinctly referred to as ‘upscaling’ and ‘denoising’. An effective attention module should highlight key features throughout this process. Building upon the structure and inspiration from CBAM [23], REMA enhances key feature representation in these steps from spatial and channel aspects by enriching the in-module feature pass-through. Using the proposed Rich Structure, REMA can seamlessly switch between size-oriented and performance-oriented modes, ensuring flexibility for different requirements by controlling the bandwidth of in-module features pass-through.
To evaluate the effectiveness of REMA, we integrate it into our proposed modified EDSR [4] and name the resulting model REMA-SRNet. Extensive experiments are conducted on commonly used SR benchmarks. We compare REMA with other comparative algorithms and plug-and-play attention modules. The results demonstrate the effectiveness of Rich Structure, REMA, and REMA-SRNet.
In summary, the main contributions of this paper are as follows:
  • We identify the key factors affecting the performance of a plug-and-play module and propose Rich Structure, enabling seamless switching between size-oriented and performance-oriented modes for a plug-and-play module to satisfy the diverse needs of different tasks.
  • We propose a SISR attention module, based on Rich Structure, called REMA, consisting of RSAM and RCAM. RSAM employs a creative method to enhance performance through learning LR-HR mapping mode and multi-scale feature fusion. RCAM enhances the overall performance by learning and reducing noise caused by upsample operations and dimension–resolution changes led by convolution operations, using interactive learning. REMA can be easily integrated into networks with various architectures and significantly improve detail reconstruction accuracy at different scale factors.
  • Extensive experiments demonstrate that REMA can carry a simple ResNet backbone SR network to the state of the art while balancing performance and model size. Moreover, the impact of the number of parameters on a module’s effectiveness and the overall networks’ robustness across different datasets and scale factors is comprehensively discussed in the experiments.
The remainder of this paper is organized as follows: Section 2 provides a brief overview of related work on deep learning-based SISR networks, attention modules, and attention-based SR models. In Section 3, we detail our proposed REMA, including problem analysis, overall structural design, and module architecture. Section 4 validates the effectiveness of our method, compares its performance with existing alternatives, and highlights its significant advantages. Finally, Section 5 summarizes the study and outlines directions for future work.

2. Related Works

2.1. Deep Learning-Based SR Methods

SRCNN is the first CNN-based end-to-end SISR network [24]. It interpolates the input image to the target size and employs three convolution layers for LR-HR non-linear mapping learning. SRCNN preserves more details than traditional methods, leading to its widespread adoption. Subsequently, CNN-based SISR methods have gained popularity. Examples include ESPCN [25] and FSRCNN [26], which directly take LR images as inputs directly to reduce complexity and increase network speed. ESPCN uses sub-pixel convolutional layers as reconstruction layers, while FSRCNN employs deconvolution layers for HR reconstruction.
To enhance performance, many researchers have integrated techniques such as residual learning, dense connections, recursive structures, and multi-scale or multi-level fusion into their networks. For instance, Kim et al. proposed VDSR [5], which makes the network deeper through residual learning and gradient clipping to improve reconstruction quality. EDSR [4] employs more residual blocks without batch normalization layers to deepen the network and utilizes pixel shuffle to optimize reconstruction performance. Methods like DRRN [7] and DRCN [6] introduce recursive structures to share parameters among layers and deepen the network without significantly increasing the model size. Others, such as RCAN [27], implement a cascading mechanism on a residual network to reuse hierarchical features and balance the number of parameters and accuracy.
Additionally, MSRN [28] creates two sub-branches and uses convolutions of different sizes in a residual block, fusing features interactively to obtain multi-scale features. The multi-scale dense convolutional network (MDCN) [9] densely connects each layer in multi-scale residual blocks to fully utilize multi-scale features within the block. Moreover, ESRGCNN [29] adapts group convolutional residual blocks for multi-level feature fusion and computational cost reduction. UNetSR [30] directly realizes shallow–deep feature fusion via skip connections, akin to U-Net architecture.
According to these studies, dense connections, recursive learning, multi-scale or multi-level feature fusion, and other techniques share a common goal. They aim to efficiently create and learn features at different scales within the backbone structure, a critical aspect of improving CNN-based SISR algorithms.

2.2. Attention and Attention-Based SR Models

Attention is a method used to recalibrate the weights of input features in deep learning, aiding models in focusing on key features. In fact, attention-based modules find wide application in various computer vision tasks. The squeeze-and-excitation (SE) block [31] was introduced to adjust informative features within channels. Woo et al. [23] proposed the convolutional block attention module (CBAM), incorporating both channel and spatial attention to adjust feature weights. Coordinate attention (CA) [32] embeds positional information into channel attention, facilitating the capture of long-range dependencies while preserving precise positional information.
Attention-based methods are also prevalent in SISR tasks. RCAN [27] implements a residual-in-residual (RIR) structure with channel attention, enhancing performance by fusing high- and low-frequency features via skip connections. DRLN [10] combines densely connected layers with residual blocks and incorporates a Laplacian pyramid attention mechanism to enhance image quality. Multi-scale feature fusion block (MSFFB) utilized in a multi-scale channel and spatial attention module (CSAM) in MCSN [33] facilitates multi-scale feature representation learning, enhancing the feature selection ability of the channel attention module. PAN [18] employs a pixel attention module in the backbone and upscale layers, generating a 3D attention map at the pixel level to improve performance with fewer parameters. PRRN [34] incorporates a progressive representation recalibration block to extract meaningful features by utilizing pixel and channel information and employing a shallow channel attention mechanism for efficient channel importance learning. RNAN [35] proposes residual non-local attention to obtaining non-local hybrid attention, further enhancing performance by adaptively adjusting the interdependence between feature channels. Dynamic attention, as used in attention to network (A2N) [36], comprises non-attention component branches and composite attention branches to dynamically suppress unnecessary attention adjustments. The non-local spatial attention network (NLSN) [20] optimizes the computational cost of non-local attention via sparse attention. SwinIR [16] and Swin2SR [17] construct networks based on the Vision Transformer, achieving superior performance.
Few studies focus on plug-and-play attention modules for SISR tasks. Wang et al. [19] proposed the lightweight attention module BAM to suppress large-scale feature edge noise while retaining high-frequency features, which is the most relevant research to our topic. BAM includes the adaptive context attention module (ACAM) for noise reduction and the multi-scale spatial attention module (MSAM) for preserving high-frequency details.

3. Methodology

3.1. Motivation and the Overall Framework

In our proposed module, the objective is to cater to the requirements of both performance-prioritized and size-prioritized tasks. Therefore, the initial focus is on maximizing performance to meet the demands of performance-prioritized tasks. Subsequently, efforts are directed toward controlling the module size to align with the needs of size-prioritized tasks. Consequently, all parameter-friendly designs are not considered during the initial stage of the module design process. This concept permeates throughout the entire module design, distinguishing our approach from others that opt for lightweight structures directly in their methods. However, this does not mean module size is not important at all for us. Indeed, this is a problem with parameter efficiency. A parameter-efficient module should not only use fewer parameters to exchange limited performance improvement but also boost the performance with more parameters and reach parameter efficiency globally. And ‘Rich Structure’ is proposed for this purpose. Table 1 illustrates the implications of nouns, abbreviations, and symbols used in the following text.

3.2. Module with Rich Structure

For a module, the flexibility involves more than just being plug-and-play, it also involves robustness across different datasets and compatibility to networks with varying characteristics. Identifying influential factors related to these aspects is crucial. Our experiments reveal that key factors affecting the plug-and-play module performance include the overall shape (cardinality, channel bandwidth, and depth) and task-specific effective algorithms. Hence, we propose REMA based on these considerations.
Current plug-and-play attention modules can be categorized into two types based on cardinality (the number of branches with feature transformation): single-branch modules like CBAM, SE, and PA, and multi-branch modules like CA and BAM. However, our experiments show that single-branch modules, which we define as having a plain structure, exhibit less performance improvement than most multi-branch modules when facing input features with higher complexity. Thus, our method is designed as a multi-branch structure to ensure compatibility.
Attention modules with multiple branches, such as Inception-like [37] and ResNeXt [38], or Res2Net-like blocks [39], may encounter challenges related to size-oriented designs, leading to reduced robustness and overall performance across various scale factors in the SISR task. These modules adopt a similar approach to parameter control. For instance, prevalent Inception-like modules split the input feature maps along the channel dimension, transform the features, and then concatenate them for fusion. Likewise, ResNeXt and Res2Net employ bottleneck or grouped convolution to split, transform, and aggregate or concatenate features in the final stage. They all follow a ’split–transform–aggregate or concatenate’ structure to balance performance and module size, utilizing the bottleneck structure to split input features. Additionally, single-branch attention modules utilize this structure to adjust their size. Figure 1 and Figure 2 illustrate how these methods split features or control module sizes using the dimension reduction ratio (r). In other words, the ‘bottleneck’ structure can become a performance bottleneck under certain conditions.
However, the issue does not lie solely with the bottleneck structure. In fact, the real concern that deserves more attention is why the focus remains solely on the reduction in dimensions, or in other words, why finding a ratio to minimize the model size while maintaining performance is the predominant research direction. What would occur if a similar bottleneck structure were employed but with increased dimensions, i.e., widening the bandwidth of channels for feature pass-through, rather than reducing it? Only a few studies have addressed this question, such as [40,41]; the authors approached the topic from the perspective of the entire backbone, comparing the widened residual and Inception block with a deeper backbone, demonstrating the effectiveness of widening the bandwidth of channels. Our experiments also prove this from the module perspective. In other words, switching between size-oriented and performance-oriented modules could be unified within the same framework.
Therefore, Rich Structure is proposed as a multi-branch structure with a bi-directionally adjustable channel bandwidth of features in each branch (Figure 2). Specifically, in our proposed method, instead of using ‘split–transform–concatenate/aggregate structure’, we directly copy or rescale the inputs to different scales, transform features in each branch, and then aggregate them together. In other words, the structure is ‘copy/rescale–transform–aggregate’. Therefore, the overall width of the features in our module will be much larger and appear fatter, thus denoted as the Rich Structure. On the other hand, dimension reduction ( C / r , r [ 1 , + ) ) is replaced with the proposed elastic adjuster ( C × R , R ( 0 , + ) ). When R ( 0 , 1 ) , the module functions akin to a ‘split–transform–concatenate/aggregate’ structure to fulfill the requirements of size-prioritized tasks. Conversely, when R [ 1 , + ) , the module utilizes additional parameters to enhance performance. Thus, with the help of Rich Structure, REMA could seamlessly switch between size-oriented and performance-oriented modes to ensure flexibility to different requirements.

3.3. Rich Elastic Mixed Attention (REMA)

As mentioned above, the performance-related factors include the shape of the module and task-specific effective algorithms. For the former, we design Rich Structure to ensure compatibility with inputs of varying complexity and flexibility for different tasks. However, it is far less important than the latter. Thus, RSAM and RCAM are designed based on the characteristics of SISR, and Rich Structure amplifies their effectiveness. RSAM and RCAM function like miniature SR networks in REMA.
The goal of deep learning-based SISR tasks is to minimize the difference between the reconstructed image and the real HR image, which can be expressed by the following formula [42]:
θ ^ F = a r g θ F m i n L I S R , I y + λ Φ θ
where θ F denotes the parameters of the SR model F. L devotes the loss between the reconstructed image I S R and the ground-truth HR image I y , and θ ^ F denotes the model parameter that minimizes L. Φ θ is the regularization term, and λ serves as the trade-off parameter employed to adjust the proportion of the regularization term. In other words, the purpose of deep learning-based SISR models is to find the θ ^ F to make I S R as close to I y as possible.
From the module perspective, the key is to identify features that deserve more attention during the process mentioned above. To simplify the problem, we decompose the HR reconstruction process into two steps: upscaling the LR image to the target size and eliminating the difference in details between the upscaled image and the real HR image. The process can be expressed in the following formula:
I y = f ( I L R M u p , D H R )
where I L R refers to the low-resolution image, M u p is the LR-HR upscale mapping mode. And D H R denotes the difference between the upscaled LR image I L R M u p and I y .
Obviously, the key to high-quality HR image reconstruction lies in the accurate estimation of M u p and D H R . Therefore, inspired by CBAM, which enhances feature representation from both spatial and channel aspects, we propose a rich spatial attention module (RSAM) and a rich channel attention module (RCAM) to improve SISR network performance. Unlike CBAM, we eschew lightweight design and instead increase cardinality, the in-branch channel dimensions, and depth. Specifically for SISR tasks, the inadequacy of CBAM and other lightweight attention modules results in a lack of sufficient space for feature maps with various resolutions for interactive learning, which is crucial for M u p and D H R estimation. Since learning LR-HR mapping involves avoiding details missing due to resolution changes, there should be at least one pair of feature maps with different resolutions. Therefore, a multi-branch structure is employed in both RSAM and RCAM to enrich the in-module features passed through to aid SISR networks in learning M u p and D H R . On the other hand, a multi-branch structure also ensures better multi-scale and multi-level feature fusion for enhancing long-range dependency learning [43], which has already been proven effective in other studies. Thus, we combine multi-scale fusion, LR-HR interactive learning, and attention mechanisms to propose REMA. To verify the effectiveness of REMA for SISR tasks, we apply REMA to a simple ResBlock-based backbone SISR network named REMA-SRNet and compare it with other methods. We apply RSAM and RCAM in parallel at the ResBlock to enhance the backbone performance. Additionally, we fuse features from the LR image and integrate REMA into the reconstruction block to improve performance under high-scale factors. The detailed structures of REMA and REMA-SRNet are illustrated in Figure 3.

3.4. Rich Spatial Attention Module (RSAM)

RSAM aims to enhance long-range feature extraction and non-linear LR-HR mapping mode M u p learning through dynamic multi-scale feature fusion with spatial attention. The main difference between RSAM and other widely used multi-scale feature fusion methods lies in the construction of the feature pyramid. As shown in Figure 4, in contrast to methods that utilize convolutions with different kernel sizes [44], or lightweight convolutions like dilated or factorized convolution [45] to learn and fuse features from the same feature maps, RSAM constructs the feature pyramid from rescaled input feature maps based on the scale factor.
Thus, regardless of how the scale factor changes, RSAM could learn M u p correctly. Within this module, RSAM generates three LR-HR pairs and scans them with receptive fields of the same size to fuse features. Obviously, the advantage of our method is that such a design can obtain multi-scale features as well as preserve the complete structural information of LR-HR mapping, which is key to SISR tasks.
Specifically, RSAM dynamically upsamples and downsamples the input according to the scale factor. Following this, two sub-branches are created to accommodate each additional scale of the input. The rescaled feature maps in all branches are then scanned by a 3 × 3 convolution to acquire multi-scale features. Subsequently, RSAM generates a total of three sets of LR-HR mapping information. Assuming the scale factor is 2×, the generated mapping pairs are 2×, 2×, and 4×, as depicted in Figure 4. Finally, attention maps for three scales are generated along spatial dimensions, and features from each branch are adjusted and enhanced for the LR-HR mapping mode. Further details are provided in Figure 5. And the entire process is formulated as follows:
F s f u p = μ ( F )
F s f d n = η ( F )
M s f m a i n = σ ( c [ A d p M i x e d P o o l ( c r ( F , R ) ) ] )
M s f u p = σ ( c [ A d p M i x e d P o o l η ( c r ( F s f u p , R ) ] )
M s f d n = σ ( c [ A d p M i x e d P o o l μ ( c r ( F s f d n , R ) ) ]
F s f R S A = c ( [ M s f m a i n F M s f u p η ( F s f u p ) M s f d n μ ( F s f d n ) ] , R )
The input feature map is F R C × H × W . Then, RSAM resizes the input according to the scale factor. And the upsampled and downsampled Fs are represented by F s f u p R C × ( H × s f ) × ( W × s f ) and F s f d n R C × ( H / s f ) × ( W / s f ) , respectively, obtained through upsampling μ and downsampling η via bilinear interpolation, where s f denotes the target scale factor of the sampling operation (e.g., 2×, 4×, or 8× in our experiments). After a 3 × 3 convolution layer and ReLU activation c r ( · ) , each pathway employs adaptive average- and max-pooling operations with scale adjustment, followed by concatenation along the channel axis with scale recovering ( A d p M i x e d P o o l μ ( · ) , A d p M i x e d P o o l μ ( · ) and A d p M i x e d P o o l η ( · ) ). Subsequently, a 1 × 1 convolution layer c ( · ) and a Sigmoid operation σ generate 2D spatial attention maps M s f m a i n R H × W , M s f u p R H × W , and M s f d n R H × W for each pathway. Element-wise multiplication ⊗ is applied, and the output of each branch is fused via element-wise addition ⊕, resulting in the refined output F s f R S A of the input F, after recovering the dimension by a 1 × 1 convolution c ( · ) . The adjustment in the bandwidth of the channel is finished by the first and last convolution layers. R denotes the ratio of the elastic adjuster. The output channel of the first convolution operation is C × R , and the last convolution layer restores the channel to that of the input.

3.5. Rich Channel Attention Module (RCAM)

After completing the learning process of upscaling, during the denoising stage, RCAM focuses on pixels causing differences to their ground-truth images after rescaling to the same size, aiming to effectively capture such features to minimize D H R and highlight these features along the channel dimensions during channel changes. Similar to RSAM, RCAM creates a sub-branch for downscaling the input. Additionally, in this sub-branch, the number of channels is also adjusted alongside the scale, as differences may arise from both rescale and convolution operations. This sub-branch establishes a middle level between layers for learning multi-level features interactively. Moreover, from a super-resolution perspective, this sub-branch offers an intermediate layer for progressive sampling, enhancing reconstruction quality under high-scale factors, a capability not achieved by other channel-related attention modules (e.g., CAM and SE) (Figure 6). Further details are provided in Figure 7.
And the entire RCAM process is formulated as follows:
F M a i n = F
F S u b = η F
f c r 3 M a i n = c r ( c r ( c r ( F M a i n , R ) ) )
f c r 3 S u b = c r ( c r ( c r ( F s u b ) ) )
f D = c ( [ f c r 3 M a i n μ f c r 3 S u b ] , R )
M R C A = σ ( F C ( R e L U ( F C ( A v g P o o l 1 × 1 ( f D ) ) )
F R C A = F M R C A
In our experiment, RCAM resizes the input feature F R C × H × W and utilizes a Convolution ReLU (CR) layer to create F S u b R C / 2 × H / 2 × W / 2 for the sub-branch. Subsequently, a CR layer ( c r ( · ) ) is employed for feature extraction, adjusting the scale and channel number to match the feature maps of the main pathway. Simultaneously, F M a i n undergoes filtering by three CR layers to retain features at the original resolution. The intermediate feature maps of each pathway are denoted as f c r 3 M a i n and f c r 3 S u b respectively. Following this, features that exhibit significant differences (or noise) f D when resolution changes are obtained via an element-wise subtraction operation ⊖. Subsequently, spatial dimensions are compressed to 1 × 1 using adaptive average pooling A v g P o o l 1 × 1 ( · ) , followed by FC-ReLU-FC layers and the Sigmoid function σ to generate attention maps M R C A of f D , resulting in F F C A as the adjusted output of the input F. Like RSAM, the adjustment in the bandwidth of the channel is finished by the first and last convolution layers. R denotes the ratio of the elastic adjuster. The output channel of the first convolution operation is C × R , and the last convolution layer restores the channel to that of the input.

3.6. REMA-Based Backbone

As discussed, efficiently extracting features at various scales in the backbone is crucial for CNN-based SISR algorithms. In REMA-SRNet, REMA is integrated into the residual block (REMA ResBlock) in the backbone. As depicted in Figure 8, during the feature extraction process, input features in each residual block layer pass through and iteratively generate LR-HR image pairs within the layer. Compared with connection-based algorithms that achieve features from various scales through connection transfer, the REMA-based backbone provides richer features at diverse resolutions.

4. Experiments

4.1. Implementation Details and Datasets

To assess the effectiveness of Rich Structure, REMA, and REMA-SRNet, we employ images from [46] for training and validation, following DIV2K’s default split. Evaluation metrics include the peak signal-to-noise ratio (PSNR, dB) and structural similarity (SSIM), computed in the RGB space, where higher values indicate superior reconstruction. The best models are selected based on the highest PSNR + SSIM on the validation set of DIV2K and evaluated on five commonly used datasets (BSDS100 [47], Set14 [48], Set5 [49], Manga109 [50], and Urban100 [51]), and an additional three datasets (Historical [52], PIRM [53], and General100 [26]) for comprehensive study, under upscaling factors of 2×, 4×, and 8×, respectively. HR images are center-cropped to 256 × 256 patches, downscaled via bicubic interpolation to generate LR image pairs for training and testing, without any data augmentation. Optimization employs Adam with an initial learning rate of 0.0001, halved every 50 epochs, β 1 set to 0.9, β 2 set to 0.999, and ϵ set to 10 6 . The batch size is set to 1, and training lasts 300 epochs, using PyTorch 2.0.0 on a desktop with an Intel I5-8600 CPU, 64GB RAM, and NVIDIA GTX 3090 GPU. The training loss function is L1 loss.
L 1 ( P ) = 1 N p P | x ( p ) y ( p ) |
where P represents the calculated area, and p denotes the pixel’s position within area P. The pixel values at position p in both the prediction area x ( p ) and the ground truth map G T area y ( p ) are taken into account.

4.2. Evaluation Metrics

We evaluate SR images using two widely used metrics: the peak signal-to-noise ratio (PSNR) and structural similarity (SSIM). PSNR serves as an objective metric to assess image quality and measure the degree of difference between an original image and a compressed or distorted version. The PSNR calculation relies on mean square error (MSE), quantifying the squared differences between corresponding pixels in the original and reconstructed images. The formula for PSNR is as follows:
M M S E = 1 W H i = 0 W 1 j = 0 H 1 [ X ( i , j ) Y ( i , j ) ] 2
P S N R = 10 log 10 X M A X 2 M M S E
where W and H are the width and height of the image, ( i , j ) represent pixel positions, and X and Y denote the super-resolved image and the ground-truth image, respectively. X M A X is the maximum pixel value range, and M M S E stands for mean square error. Higher PSNR values indicate lower distortion and better image quality, typically ranging from 20 to 50. PSNR values exceeding 30 dB are generally considered indicative of good image quality. Recognizing that PSNR is a limited indicator that fails to capture human subjective perception of images, we also utilize SSIM as an evaluation index. SSIM accounts for contrast, brightness, and structural similarity. The calculation for the SSIM value at the pixel position, p, is as follows:
S S I M ( p ) = 2 μ x μ y + C 1 μ x 2 + μ y 2 + C 1 · 2 σ x y + C 2 σ x 2 + σ y 2 + C 2
Here, μ x , μ y , σ x , σ y , and σ x y denote the mean, standard deviation, and covariance of pixels at position p in the prediction map and the true value map. Constants C 1 and C 2 are included to prevent division by zero. The SSIM value falls within the range of ( 0 , 1 ) , with values closer to 1 indicating a superior HR reconstruction effect.

4.3. Ablation Studies

In this section, ablation studies are conducted to verify the effectiveness of each part of REMA. The experiments span networks with various settings, scale factors, and integration positions, as well as comparisons with other attention modules in REMA-SRNet and other SISR networks. Meanwhile, the effectiveness of Rich Structure is verified by comparing it with REMA using Inception- and ResNeXt-like structures. Furthermore, the impact of parameter count on performance and robustness is discussed based on the experimental results.
The baseline model in our experiment is the proposed modified EDSR (replacing REMA ResBlock with residual blocks in REMA-SRNet). Specifically, the pixel shuffle layer is replaced by bilinear upsampling followed by 3 × 3 convolutions and Leaky ReLU layers, connected with the bilinear-upscaled LR input. To validate the proposed methods, we employ two sets of network configurations: default (64-16-64) as REMA-SRNet and alternative (40-16-40) as REMA-SRNet-M. The adjuster ratio R is set to 1 by default. For 2×, 4×, and 8× reconstruction, the number of reconstruction layers is 1, 2, and 3, respectively.
In all experiments, robustness is evaluated across all eight datasets. The Historical dataset specifically assesses the module’s capability in handling out-of-distribution (OOD) samples. Additionally, two sets of network configurations mentioned above (#C_In is 40 or 64) are employed to test module compatibility with varying complexities of input features.

4.3.1. Study of REMA in the Backbone

Figure 3 illustrates the utilization of REMA within a residual block of the backbone networks. To analyze the effectiveness of REMA, except the baseline, five models were constructed: RSAM, RCAM, RSAM-RCAM, RCAM-RSAM, and REMA, and they represent the model with RSAM and RCAM, and employ them together in parallel, respectively. The results of all the above ablation experiments are shown in Table 2.
The results indicate that employing only RSAM in the backbone enhances PSNR and SSIM across most datasets, except for the Set14 and Historical datasets when the input tensor has 40 channels. However, RCAM also underperforms in the Historical dataset, attributed to significant differences between the Historical dataset images and the distribution of training datasets. Configuring them in parallel (REMA) boosts performance across most datasets. Moreover, with a 64-channel input tensor, all models show significant performance improvements. Notably, using RSAM and RCAM separately substantially mitigates the performance reduction issue in the Historical dataset. Consequently, the backbone of REMA demonstrates performance improvements in the Historical dataset. Overall, these results affirm the effectiveness of our method.

4.3.2. Study of Rich Structure

The study of Rich Structure, along with REMA throughout subsequent experiments, is examined. To verify the effectiveness of Rich Structure and REMA, we initially compared REMA with other attention modules. This allows us to identify the key factors influencing a plug-and-play module and demonstrate the superiority of Rich Structure and REMA. Additionally, we designed ResNeXt and Inception versions of REMA to highlight the advantages of Rich Structure in terms of compatibility and flexibility compared to other popular module structures.

4.3.3. Comparison with Other Attention Modules

We compared the performance of REMA with other attention modules, including CBAM, SE, CA, and BAM, which were employed in the same way as REMA. Our experiment includes results for 40- and 64-channel input. To ensure a fair comparison, we set the dimension reduction to 1 (C/r, r = 1), meaning no channel compression is applied. The results are presented in Table 3.
The results indicate that for the 40-channel input, there is no significant difference in the performance of REMA with other attention modules, except CBAM. However, for the 64-channel input, REMA outperforms other attention modules. Furthermore, comparing the overall improvement when changing the number of channels from 40 to 64, BAM and REMA show much higher performance than other attention modules in the experiment, as discussed in the next section. For the Historical dataset, except for REMA, there is a reduction in performance after integrating other attention modules into the residual block.

4.3.4. Study of Plain, Multi-Branch, and Rich Structure

To elucidate the performance increment difference with increasing input complexity, we analyze these attention modules from a global structural perspective. According to Table 3, modules with a multi-branch structure exhibit a greater performance increase with the rise in input complexity compared to plain structures, except for CA. The primary distinction among these modules lies in their cardinality: 1 for plain modules (SE and CBAM), and 2, 2, and 5 for CA, BAM, and REMA, respectively. Based on the results, cardinality is positively correlated with the overall performance of modules for a 64-channel input. Thus, cardinality is an influential factor relating to module compatibility, and higher cardinality will enhance the module’s performance with the growth in input complexity.
However, cardinality is not the sole factor influencing performance. When comparing the results of CA and BAM, both with a cardinality of two, there exists a performance gap for the 64-channel input. The main difference lies in the in-branch bandwidth. In fact, CA also employs a split–transform–aggregate structure similar to Inception-like blocks. The distinction is that CA splits the features ( C × H × W ) along H and W rather than C, as shown in Figure 1b, while BAM and REMA directly map the complete input to branches. This implies that the in-branch features are less informative in CA compared to BAM and REMA.
Comparing BAM and REMA, both modules generate spatial and channel attention. The difference lies in our proposed algorithm, which not only enhances SR-related feature representation but also generates richer multi-scale and multi-level features compared to BAM. This is because BAM is a size-oriented module, balancing performance and module size, resulting in limited room and more constraints for algorithm design. Our proposed Rich Structure is designed to overcome this limitation. We will delve into this topic in the following section. Therefore, in-branch feature richness and task-related algorithms are other influential factors. The richness is defined by the channel bandwidth of the in-branch features and the diversity of features(multi-scale and multi-level features).

4.3.5. Study of the Elastic Adjuster

For further investigation, we conducted an experiment to analyze the influence of overall channel bandwidth on performance. The overall channel bandwidth of modules with plain structures, multi-branch structures, and our proposed Rich Structure differs significantly, with the plain structure being much slimmer than the others. We redesigned these modules, replacing dimension reduction with the elastic adjuster ( C × R ), where R is set to 3, indicating a widened channel bandwidth by 3 times to determine how bandwidth affects the performance and to verify the effectiveness of the elastic adjuster in different attention modules. The results are presented in Table 4, and there is a dedicated section for this independent experiment in REMA in the following.
The results show that for the 40-channel input, the redesigned wider CBAM and SE exhibit improvements on most datasets, bringing their performances close to those of the original CA and BAM, which performed better than them previously. This underscores the significance of the in-branch feature bandwidth of the channel as a key performance-related factor, which ultimately affects the overall module’s width. These results highlight how plain structures and dimension-reduction components, realized by bottleneck structures, actually limit their potential in performance, proving the effectiveness of the proposed elastic adjuster in enhancing performance when needed alongside the Rich Structure under certain conditions. However, for the 64-channel input, a reduction occurs in wider modules, except for BAM. For BAM, the redesign results in improvements for half of the datasets and reductions on others, with overall performance close to the original for the 64-channel input. This indicates a limit to increasing in-branch channel bandwidth for further performance gains.

4.3.6. Study of the Elastic Adjuster in REMA

To analyze the effect of in-branch channel bandwidth in REMA, experiments are conducted. Specifically, in the experiments, the elastic adjuster’s ratio was varied from 0.5 to 1.5, and the performances of R [ 0.5 , 1 ) and R [ 1 , 1.5 ] , representing the size-oriented and performance-oriented modes of REMA, were compared. The results are shown in Table 5.
The results indicate that the overall performance of size-oriented REMA is lower than the performance-oriented one for the 40-channel input, showing the same trend as the widened versions of other attention modules. However, for the 64-channel input, different from other widened attention modules, A can still benefit from the increased bandwidth for some datasets, including BSDS100, Mange109, Set14, and Urban100. Additionally, the performance gap between the lowest and highest values for the 64-channel input is not large, proving that REMA can ensure flexibility to meet different task requirements by switching the elastic adjuster.
There is still a limit to achieving more performance through parameter exchange. This limitation may stem from two aspects: input complexity and task-specific algorithms. Regarding the former, comparing the results of 40_1.5 and 64_0.6, it can be observed that they have similar numbers of parameters, yet 64_0.6 performs significantly better than 40_1.5, with the only difference being the number of input channels. This illustrates one of the reasons why models with more parameters do not always yield higher performance and why a plug-and-play module works in one network but not in another.
Concerning the latter, comparing REMA with the widened version of BAM (64_1.2), both having a multi-branch structure with the elastic adjuster and similar overall channel bandwidth (BAM: 2 × 3 , REMA: 5 × 1.2 ), REMA outperforms BAM on all datasets. Furthermore, the results of R [ 0.5 , 1 ) and R [ 1 , 1.5 ] demonstrate that a more effective parameter exchange provides extra robustness on different datasets, although models with fewer parameters may perform better on certain datasets.

4.3.7. Size-Oriented vs. Performance-Oriented

In order to investigate how lightweight structures affect performance further, we compare Rich Structure (copy/rescale–transform–aggregate) with other size-oriented multi-branch designs. Specifically, we redesign REMA in Inception (split–transform–concatenate) and ResNeXt (split–transform–aggregate) styles. The split operation is achieved by setting the elastic adjuster to be 1/3 in RSAM and 1/2 in RCAM to maintain the overall bandwidth the same as the input feature. Additionally, the main difference between the Inception and ResNeXt versions lies in the topology of each transforming branch, whereas in ResNeXt, they are the same. Hence, we propose an extra version of it to maintain multi-scale and multi-level feature fusion as used in REMA, to verify their effectiveness.
To comprehensively discuss the parameter efficiency of size-oriented and performance-oriented structures, we also consider the scale factor for two reasons. Firstly, from the SR task perspective, a higher scale factor makes SR inference more challenging. From the network perspective, as the scale factor increases, the network becomes more prone to overfitting since we generate training data by downsampling the ground-truth image at the target scale factor rate. Consequently, the input patch becomes very small at 8× ( 32 × 32 ), potentially leading to overfitting for a module that performs well at 2× and 4×. In other words, 2×, 4×, and 8× represent three situations, ranging from low to high difficulty for every parameter that influences performance. Additionally, performance on the Historical dataset receives more attention as it represents an out-of-distribution (OOD) scenario. Therefore, we use these factors to test the module’s compatibility and robustness, with the experiment results presented in Table 6.
According to the results, Rich Structure outperforms other versions of REMA. Although the performances of Inception and ResNext_MS may be close to the Rich Structure version of REMA in certain datasets or certain upscale ratios, overall, the Rich Structure version demonstrates the best capability across different datasets and networks, with less likelihood of overfitting. Moreover, comparing ResNeXt_MS shows better performance than ResNeXt under 2× and 4×, and their results are comparable under 8x, highlighting the effectiveness of the multi-scale and multi-level feature fusion strategy in REMA. These findings demonstrate the higher compatibility and robustness of our method compared to other popular size-oriented multi-branch structures when applied in the backbone. Again, the results demonstrate that extra effective parameters can exchange and provide more robustness under different scale factors.

4.3.8. Study of REMA in the Reconstruction Layer

Additionally, given the application of REMA in reconstruction layers at high-scale factors, experiments are conducted at scale factors of 4× and 8×. Figure 3 illustrates the implementation of REMA in the reconstruction layer, with corresponding results shown in Table 7.
In summary, the significance of REMA in reconstruction blocks increases with larger-scale factors. At 4×, it results in a performance decline in most datasets, leading to its exclusion from REMA-SRNet under 2× and 4×. However, at 8×, there is an improvement in most datasets when used in reconstruction for the 64-channel input. However, for the 64-channel input, the overall enhancement is less evident. Hence, REMA in reconstruction layers improves performance at high-scale ratios under specific conditions.

4.3.9. Study of REMA in Other SISR Network

For further investigation, we incorporate REMA into UNet-SR, a super-resolution network based on the image segmentation network U-Net. UNet-SR employs skip connections for encoder–reconstruction feature fusion, enhancing reconstitution quality. We utilize this setup to assess REMA’s effectiveness in other networks and evaluate its impact on performance when integrated into skip connections. This extends the experiments beyond the backbone and reconstruction layers, as skip connections were not used in REMA-SRNet for varying depth feature fusion. Results are summarized in Table 8.
The results show that the performance of REMA, when added to the skip connection, surpasses other attention modules at the same position, indicating that REMA remains effective in various SR models and positions. In fact, the number of input channels gradually expands, layer by layer, as it progresses from shallow to deep within the skip connections of UNet-SR. Thus, this also suggests that Rich Structure’s advantage becomes more pronounced when handling inputs with more filters, outperforming other attention modules.

4.3.10. Comparison with Other Comparative Methods

To comprehensively evaluate our methods, we compare REMA-SRNet (R = 1) with other SISR methods, employing similar approaches such as residual, recursive, and multi-branch learning, as well as attention-based SR networks. Our experiments encompass both lightweight and large models, including VDSR [5], ESPCN [25], RCAN [27], PAN [18], A2N [36], DRLN [10], RCAN [27], ESRGCNN [29], SwinIR [16], NLSN [20], and UNet-SR [30].
Table 9 displays the quantitative results for various scaling factors. In summary, compared to other SOTAs, REMA outperforms other methods for 2×, 4×, and 8× upscaling on benchmark datasets, showcasing the effectiveness of REMA-SRNet. Further research should address the parameter-efficiency perspective when discussing trends in results.
The results indicate that methods with large sizes do not necessarily equate to high performance. In fact, size and performance show some positive correlation at 4×. As explained earlier, this is due to complex models being prone to overfitting as the complexity of the training data decreases with the increasing scale factor. For instance, RCAN and DRLN may achieve better results on certain datasets at 2× and 4× but perform worse than lightweight models like PAN and A2N at 8× due to overfitting. Conversely, while lightweight models may excel in specific scale factors or datasets, they may be insufficient for performance-prioritized tasks or broad compatibility requirements. Thus, parameter efficiency not only achieves intermediate results with few parameters but also attains optimal results while maintaining the overall model size. Among the models tested, only REMA-SRNet and SwinIR achieve this balance. REMA-SRNet generally outperforms SwinIR while using only 60% of its parameters (Figure 9).

4.4. Visual Comparison of Different Models

We selected reconstructed images from the Urban100, BSDS100, General100, and SET14 datasets to compare reconstruction details. Figure 10 illustrates the HR effects of REMA-SRNet and other methods, highlighting smoother lines, the preservation of fine details, and improved textures in the reconstructed images. Specifically, the textures in the super-resolved images ’img_048’ and ’img_092’ by REMA-SRNet are more accurate, and the lines in ’monarch’ and ’62096’ are sharper compared to other methods.

5. Conclusions

To address the challenge of detail preservation in SISR tasks, we propose a plug-and-play attention module called REMA. The core component, Rich Structure, is proposed based on extensive research into how different module structures impact size, compatibility, and performance. This allows REMA to seamlessly transition between being size-oriented and performance-oriented, depending on the specific requirements of the task. We separate the SR process into two steps: upsampling and denoising, with RSAM and RCAM designed to focus on the key factors in each step, respectively. Building on Rich Structure, we propose RSAM and RCAM. RSAM focuses on the mutual dependency of multiple LR and HR pairs, as well as multi-scale features, while RCAM uses interactive learning to emphasize key features, enhancing detail and noise differentiation and generating intermediate features for multi-level feature fusion. Thus, with RSAM and RCAM, REMA enhances the SISR process and the performance of deep learning-based networks by simultaneously improving long-range dependency learning. Together, these components alleviate issues of algorithm flexibility and detail preservation.
Extensive experiments validate the effectiveness of REMA, showing significant improvements in performance and compatibility compared to other attention modules. Additionally, REMA-SRNet demonstrates superiority over other SISR networks. Our investigations into module compatibility reveal a correlation between cardinality, in-branch feature bandwidth, and compatibility. Further analysis indicates that networks with high effective parameter counts exhibit enhanced robustness across various datasets and scale factors.
Future work will continue to explore factors influencing the performance and robustness of modules and aim to improve super-resolution accuracy. We plan to introduce more metrics and explore higher super-resolution ratios, such as 16×. Our goal is to develop a plug-and-play module that can automatically adjust its structure and complexity, ensuring cost efficiency and reducing the need for manual parameter tuning to meet diverse requirements.

Author Contributions

X.G.: conceptualization, methodology, software, validation, data curation, writing—review and editing, visualization. Y.C. writing—review and editing, supervision. W.T.: supervision. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The data used in this study are publicly available.

Acknowledgments

This work was supported in part by the National Natural Science Foundation of China (nos. 61976132).

Conflicts of Interest

The authors declare no conflicts of interest.

References

  1. Lepcha, D.C.; Goyal, B.; Dogra, A.; Goyal, V. Image super-resolution: A comprehensive review, recent trends, challenges and applications. Inf. Fusion 2023, 91, 230–260. [Google Scholar] [CrossRef]
  2. Zhu, L.; Zhan, S.; Zhang, H. Stacked U-shape networks with channel-wise attention for image super-resolution. Neurocomputing 2019, 345, 58–66. [Google Scholar] [CrossRef]
  3. Rashkevych, Y.; Peleshko, D.; Vynokurova, O.; Izonin, I.; Lotoshynska, N. Single-frame image super-resolution based on singular square matrix operator. In Proceedings of the 2017 IEEE First Ukraine Conference on Electrical and Computer Engineering (UKRCON), Kyiv, Ukraine, 29 May–2 June 2017; pp. 944–948. [Google Scholar] [CrossRef]
  4. Lim, B.; Son, S.; Kim, H.; Nah, S.; Mu Lee, K. Enhanced deep residual networks for single image super-resolution. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, Honolulu, HI, USA, 21–26 July 2017; pp. 136–144. [Google Scholar]
  5. Kim, J.; Lee, J.K.; Lee, K.M. Accurate image super-resolution using very deep convolutional networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 1646–1654. [Google Scholar]
  6. Kim, J.; Lee, J.K.; Lee, K.M. Deeply-recursive convolutional network for image super-resolution. In Proceedings of the IEEE Conference on Computer Vision and pAttern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 1637–1645. [Google Scholar]
  7. Tai, Y.; Yang, J.; Liu, X. Image super-resolution via deep recursive residual network. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 3147–3155. [Google Scholar]
  8. Han, X.; Wang, L.; Wang, X.; Zhang, P.; Xu, H. A Multi-Scale Recursive Attention Feature Fusion Network for Image Super-Resolution Reconstruction Algorithm. Sensors 2023, 23, 9458. [Google Scholar] [CrossRef] [PubMed]
  9. Li, J.; Fang, F.; Li, J.; Mei, K.; Zhang, G. MDCN: Multi-scale dense cross network for image super-resolution. IEEE Trans. Circuits Syst. Video Technol. 2020, 31, 2547–2561. [Google Scholar] [CrossRef]
  10. Anwar, S.; Barnes, N. Densely residual laplacian super-resolution. IEEE Trans. Pattern Anal. Mach. Intell. 2020, 44, 1192–1204. [Google Scholar] [CrossRef] [PubMed]
  11. Liu, S.; Weng, X.; Gao, X.; Xu, X.; Zhou, L. A Residual Dense Attention Generative Adversarial Network for Microscopic Image Super-Resolution. Sensors 2024, 24, 3560. [Google Scholar] [CrossRef] [PubMed]
  12. Mehri, A.; Ardakani, P.B.; Sappa, A.D. MPRNet: Multi-path residual network for lightweight image super resolution. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, Virtual, 5–9 January 2021; pp. 2704–2713. [Google Scholar]
  13. Ji, K.; Lei, W.; Zhang, W. Multi-Scale and Multi-Path Networks for Simultaneous Enhancement and Super-Resolution of Underwater Images. In Proceedings of the 2023 IEEE 5th International Conference on Power, Intelligent Computing and Systems (ICPICS), Shenyang, China, 14–16 July 2023; pp. 715–720. [Google Scholar]
  14. Lu, Z.; Li, J.; Liu, H.; Huang, C.; Zhang, L.; Zeng, T. Transformer for single image super-resolution. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 457–466. [Google Scholar]
  15. Aslahishahri, M.; Ubbens, J.; Stavness, I. DARTS: Double Attention Reference-based Transformer for Super-resolution. arXiv 2023, arXiv:2307.08837. [Google Scholar]
  16. Liang, J.; Cao, J.; Sun, G.; Zhang, K.; Van Gool, L.; Timofte, R. Swinir: Image restoration using swin transformer. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, BC, Canada, 11–17 October 2021; pp. 1833–1844. [Google Scholar]
  17. Conde, M.V.; Choi, U.J.; Burchi, M.; Timofte, R. Swin2SR: Swinv2 transformer for compressed image super-resolution and restoration. In Proceedings of the European Conference on Computer Vision, Tel Aviv, Israel, 23–27 October 2022; Springer: Cham, Switzerland, 2022; pp. 669–687. [Google Scholar]
  18. Zhao, H.; Kong, X.; He, J.; Qiao, Y.; Dong, C. Efficient image super-resolution using pixel attention. In Proceedings of the Computer Vision–ECCV 2020 Workshops, Glasgow, UK, 23–28 August 2020; Proceedings, Part III 16. Springer: Berlin/Heidelberg, Germany, 2020; pp. 56–72. [Google Scholar]
  19. Wang, F.; Hu, H.; Shen, C. BAM: A balanced attention mechanism for single image super resolution. arXiv 2021, arXiv:2104.07566. [Google Scholar] [CrossRef]
  20. Mei, Y.; Fan, Y.; Zhou, Y. Image super-resolution with non-local sparse attention. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 3517–3526. [Google Scholar]
  21. Yu, B.; Lei, B.; Guo, J.; Sun, J.; Li, S.; Xie, G. Remote Sensing Image Super-Resolution via Residual-Dense Hybrid Attention Network. Remote. Sens. 2022, 14, 5780. [Google Scholar] [CrossRef]
  22. Niu, C.; Nan, F.; Wang, X. A super resolution frontal face generation model based on 3DDFA and CBAM. Displays 2021, 69, 102043. [Google Scholar] [CrossRef]
  23. Woo, S.; Park, J.; Lee, J.Y.; Kweon, I.S. Cbam: Convolutional block attention module. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 3–19. [Google Scholar]
  24. Dong, C.; Loy, C.C.; He, K.; Tang, X. Image super-resolution using deep convolutional networks. IEEE Trans. Pattern Anal. Mach. Intell. 2015, 38, 295–307. [Google Scholar] [CrossRef] [PubMed]
  25. Shi, W.; Caballero, J.; Huszár, F.; Totz, J.; Aitken, A.P.; Bishop, R.; Rueckert, D.; Wang, Z. Real-time single image and video super-resolution using an efficient sub-pixel convolutional neural network. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 1874–1883. [Google Scholar]
  26. Dong, C.; Loy, C.C.; Tang, X. Accelerating the super-resolution convolutional neural network. In Proceedings of the Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, 11–14 October 2016; Proceedings, Part II 14. Springer: Cham, Switzerland, 2016; pp. 391–407. [Google Scholar]
  27. Zhang, Y.; Li, K.; Li, K.; Wang, L.; Zhong, B.; Fu, Y. Image super-resolution using very deep residual channel attention networks. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 286–301. [Google Scholar]
  28. Li, J.; Fang, F.; Mei, K.; Zhang, G. Multi-scale residual network for image super-resolution. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 517–532. [Google Scholar]
  29. Tian, C.; Yuan, Y.; Zhang, S.; Lin, C.W.; Zuo, W.; Zhang, D. Image super-resolution with an enhanced group convolutional neural network. Neural Netw. 2022, 153, 373–385. [Google Scholar] [PubMed]
  30. Lu, Z.; Chen, Y. Single image super-resolution based on a modified U-net with mixed gradient loss. Signal Image Video Process. 2022, 16, 1143–1151. [Google Scholar] [CrossRef]
  31. Hu, J.; Shen, L.; Sun, G. Squeeze-and-excitation networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 7132–7141. [Google Scholar]
  32. Hou, Q.; Zhou, D.; Feng, J. Coordinate attention for efficient mobile network design. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 13713–13722. [Google Scholar]
  33. Li, W.; Li, J.; Li, J.; Huang, Z.; Zhou, D. A lightweight multi-scale channel attention network for image super-resolution. Neurocomputing 2021, 456, 327–337. [Google Scholar]
  34. Wen, R.; Yang, Z.; Chen, T.; Li, H.; Li, K. Progressive representation recalibration for lightweight super-resolution. Neurocomputing 2022, 504, 240–250. [Google Scholar] [CrossRef]
  35. Zhang, Y.; Li, K.; Li, K.; Zhong, B.; Fu, Y. Residual non-local attention networks for image restoration. arXiv 2019, arXiv:1903.10082. [Google Scholar]
  36. Chen, H.; Gu, J.; Zhang, Z. Attention in attention network for image super-resolution. arXiv 2021, arXiv:2104.09497. [Google Scholar]
  37. Szegedy, C.; Liu, W.; Jia, Y.; Sermanet, P.; Reed, S.; Anguelov, D.; Erhan, D.; Vanhoucke, V.; Rabinovich, A. Going deeper with convolutions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA, 7–12 June 2015; pp. 1–9. [Google Scholar]
  38. Xie, S.; Girshick, R.; Dollár, P.; Tu, Z.; He, K. Aggregated residual transformations for deep neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 1492–1500. [Google Scholar]
  39. Gao, S.H.; Cheng, M.M.; Zhao, K.; Zhang, X.Y.; Yang, M.H.; Torr, P. Res2net: A new multi-scale backbone architecture. IEEE Trans. Pattern Anal. Mach. Intell. 2019, 43, 652–662. [Google Scholar]
  40. Zagoruyko, S.; Komodakis, N. Wide residual networks. arXiv 2016, arXiv:1605.07146. [Google Scholar]
  41. Lee, Y.; Kim, H.; Park, E.; Cui, X.; Kim, H. Wide-residual-inception networks for real-time object detection. In Proceedings of the 2017 IEEE Intelligent Vehicles Symposium (IV), Los Angeles, CA, USA, 11–14 June 2017; pp. 758–764. [Google Scholar]
  42. Li, J.; Pei, Z.; Zeng, T. From beginner to master: A survey for deep learning-based single-image super-resolution. arXiv 2021, arXiv:2109.14335. [Google Scholar]
  43. Pang, Y.; Zhao, X.; Xiang, T.Z.; Zhang, L.; Lu, H. Zoom in and out: A mixed-scale triplet network for camouflaged object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 2160–2170. [Google Scholar]
  44. Luo, J.; Zhao, L.; Zhu, L.; Tao, W. Multi-scale receptive field fusion network for lightweight image super-resolution. Neurocomputing 2022, 493, 314–326. [Google Scholar] [CrossRef]
  45. Ke, G.; Lo, S.L.; Zou, H.; Liu, Y.F.; Chen, Z.Q.; Wang, J.K. CSINet: A Cross-Scale Interaction Network for Lightweight Image Super-Resolution. Sensors 2024, 24, 1135. [Google Scholar] [CrossRef] [PubMed]
  46. Agustsson, E.; Timofte, R. NTIRE 2017 Challenge on Single Image Super-Resolution: Dataset and Study. In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), Honolulu, HI, USA, 21–26 July 2017; pp. 1122–1131. [Google Scholar] [CrossRef]
  47. Martin, D.; Fowlkes, C.; Tal, D.; Malik, J. A database of human segmented natural images and its application to evaluating segmentation algorithms and measuring ecological statistics. In Proceedings of the Eighth IEEE International Conference on Computer Vision, ICCV 2001, Vancouver, BC, Canada, 7–14 July 2001; Volume 2, pp. 416–423. [Google Scholar]
  48. Zeyde, R.; Elad, M.; Protter, M. On single image scale-up using sparse-representations. In Proceedings of the Curves and Surfaces: 7th International Conference, Avignon, France, 24–30 June 2010; Revised Selected Papers 7. Springer: Berlin/Heidelberg, Germany, 2012; pp. 711–730. [Google Scholar]
  49. Bevilacqua, M.; Roumy, A.; Guillemot, C.; Alberi-Morel, M.L. Low-complexity single-image super-resolution based on nonnegative neighbor embedding. In Proceedings of the 23rd British Machine Vision Conference (BMVC), Surrey, UK, 3–7 September 2012; pp. 135.1–135.10. [Google Scholar]
  50. Fujimoto, A.; Ogawa, T.; Yamamoto, K.; Matsui, Y.; Yamasaki, T.; Aizawa, K. Manga109 dataset and creation of metadata. In Proceedings of the 1st International Workshop on Comics Analysis, Processing and Understanding, Cancun, Mexico, 4 December 2016; pp. 1–5. [Google Scholar]
  51. Huang, J.B.; Singh, A.; Ahuja, N. Single image super-resolution from transformed self-exemplars. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA, 7–12 June 2015; pp. 5197–5206. [Google Scholar]
  52. Wang, X.; Xie, L.; Yu, K.; Chan, K.C.; Loy, C.C.; Dong, C. BasicSR: Open Source Image and Video Restoration Toolbox. 2022. Available online: https://github.com/XPixelGroup/BasicSR (accessed on 1 June 2023).
  53. Blau, Y.; Mechrez, R.; Timofte, R.; Michaeli, T.; Zelnik-Manor, L. The 2018 PIRM challenge on perceptual image super-resolution. In Proceedings of the European Conference on Computer Vision (ECCV) Workshops, Munich, Germany, 8–14 September 2018; pp. 334–355. [Google Scholar]
Figure 1. Illustration of dimension reduction in size-oriented attention modules. (a) Dimension reduction in SE-like modules, and the channel attention module of BAM and CBAM and their variants. (b) Dimension reduction in CA.
Figure 1. Illustration of dimension reduction in size-oriented attention modules. (a) Dimension reduction in SE-like modules, and the channel attention module of BAM and CBAM and their variants. (b) Dimension reduction in CA.
Sensors 24 04145 g001
Figure 2. Illustration of size-oriented and performance-oriented modules. (a) ResNeXt-like block. (b) Inception-like block. (c) Module with Rich Structure (ours).
Figure 2. Illustration of size-oriented and performance-oriented modules. (a) ResNeXt-like block. (b) Inception-like block. (c) Module with Rich Structure (ours).
Sensors 24 04145 g002
Figure 3. Illustration of the proposed REMA-SRNet. The backbone of REMA-SRNet is based on residual blocks incorporating REMA (REMA ResBlock). The reconstruction layers utilize bilinear upsampling followed by a 3 × 3 convolution and Leaky ReLU layers. REMA is applied at 4× and 8× upscaling, with a long skip connection from the bilinear-upscaled LR input, followed by a 1 × 1 convolution for dimension alignment. SF denotes the scale factor. For 2×, 4×, and 8× reconstruction, the number of REMA ResBlocks is 16 and the number of reconstruction blocks is 1, 2, and 3, respectively.
Figure 3. Illustration of the proposed REMA-SRNet. The backbone of REMA-SRNet is based on residual blocks incorporating REMA (REMA ResBlock). The reconstruction layers utilize bilinear upsampling followed by a 3 × 3 convolution and Leaky ReLU layers. REMA is applied at 4× and 8× upscaling, with a long skip connection from the bilinear-upscaled LR input, followed by a 1 × 1 convolution for dimension alignment. SF denotes the scale factor. For 2×, 4×, and 8× reconstruction, the number of REMA ResBlocks is 16 and the number of reconstruction blocks is 1, 2, and 3, respectively.
Sensors 24 04145 g003
Figure 4. The difference in multi-scale feature generation between RSAM and other conventional methods (assuming the scale factor is 2×). (a) RSAM (ours) learns multi-scale features and LR-HR mapping together. (b) Conventional methods (like ASPP and Inception blocks) can only obtain multi-scale features.
Figure 4. The difference in multi-scale feature generation between RSAM and other conventional methods (assuming the scale factor is 2×). (a) RSAM (ours) learns multi-scale features and LR-HR mapping together. (b) Conventional methods (like ASPP and Inception blocks) can only obtain multi-scale features.
Sensors 24 04145 g004
Figure 5. Illustration of RSAM.
Figure 5. Illustration of RSAM.
Sensors 24 04145 g005
Figure 6. Comparison of RCAM and other channel attention modules. (a) RCAM (ours); (b) other channel attention modules.
Figure 6. Comparison of RCAM and other channel attention modules. (a) RCAM (ours); (b) other channel attention modules.
Sensors 24 04145 g006
Figure 7. Illustration of RCAM.
Figure 7. Illustration of RCAM.
Sensors 24 04145 g007
Figure 8. Comparison of REMA-based Backbone and Other Methods. The trend of scale changes in the backbone is denoted by green waves.
Figure 8. Comparison of REMA-based Backbone and Other Methods. The trend of scale changes in the backbone is denoted by green waves.
Sensors 24 04145 g008
Figure 9. Performance comparison between REMA-SRNet and other SISR methods on BSDS100 (2×). Our algorithms are highlighted in red.
Figure 9. Performance comparison between REMA-SRNet and other SISR methods on BSDS100 (2×). Our algorithms are highlighted in red.
Sensors 24 04145 g009
Figure 10. Subjective quality assessment for 4× upscaling on the general images from four datasets. The best results are bold and underlined.
Figure 10. Subjective quality assessment for 4× upscaling on the general images from four datasets. The best results are bold and underlined.
Sensors 24 04145 g010
Table 1. Implications of nouns, abbreviations, and symbols.
Table 1. Implications of nouns, abbreviations, and symbols.
Abbreviation/SymbolsImplication
REMArich elastic mixed attention
REMA-SRNetREMA-based SR network
RSAMrich spatial attention module
RCAMrich channel attention module
AvgPoolaverage pooling
Adp Avg pooladaptive average pooling
Adp mixed pooladaptive average⊕maximum pooling
SFscale factor
RThe ratio of the elastic adjuster
Convconvolution
FCfully connected layer
ConcateConcatenate
element-wise multiplication
element-wise addition
element-wise subtraction
Table 2. The effect of each part of REMA in the backbone (4×). #C_In denotes the input tensor’s channel count. Numerical comparisons maintain precision to 12 decimal places, with the top two results highlighted in red and blue.
Table 2. The effect of each part of REMA in the backbone (4×). #C_In denotes the input tensor’s channel count. Numerical comparisons maintain precision to 12 decimal places, with the top two results highlighted in red and blue.
MODEL#C_InBSDS100General100HistoricalManga109PIRMSET14SET5Urban100
PSNR SSIM PSNR SSIM PSNR SSIM PSNR SSIM PSNR SSIM PSNR SSIM PSNR SSIM PSNR SSIM
Baseline4025.170.699328.450.825722.040.669626.360.841328.160.891825.840.737429.780.865122.730.6711
RCAM4025.200.700228.520.827522.010.667826.480.843728.210.893025.850.738729.900.866722.840.6756
RSAM4025.200.700928.500.827222.000.670326.440.843428.320.894125.840.738529.900.867422.740.6736
REMA4025.230.701928.540.828522.000.670726.530.845728.220.893025.880.739430.010.868922.860.6775
Baseline6425.220.700528.520.826322.030.670126.440.843528.290.892925.830.737629.880.867622.820.6753
RCAM6425.240.702228.630.829722.030.672126.610.847528.350.896325.890.740130.040.870122.880.6803
RSAM6425.240.702328.630.829422.030.671826.610.847328.430.897125.890.740029.990.869722.900.6801
REMA6425.270.702628.670.830222.050.672226.650.848428.410.896225.970.741030.040.870022.920.6816
Table 3. Performance comparison in ResBlock between REMA and other attention modules. #C_In denotes the input tensor’s channel count. Numerical comparisons maintain precision to 12 decimal places, with the top two results highlighted in red and blue.
Table 3. Performance comparison in ResBlock between REMA and other attention modules. #C_In denotes the input tensor’s channel count. Numerical comparisons maintain precision to 12 decimal places, with the top two results highlighted in red and blue.
MODEL#C_inBSDS100General100HistoricalManga109PIRMSET14SET5Urban100
PSNR SSIM PSNR SSIM PSNR SSIM PSNR SSIM PSNR SSIM PSNR SSIM PSNR SSIM PSNR SSIM
CBAM [23]4025.200.699928.490.827022.010.668726.440.843528.170.893525.830.737029.890.866322.810.6755
SE [31]4025.200.701028.510.828021.980.669526.510.845928.240.893125.870.739729.970.868622.780.6757
CA [32]4025.240.702228.510.827621.960.667526.540.845728.120.890525.850.739429.960.868422.790.6768
BAM [19]4025.220.701628.570.828421.960.669226.540.846428.300.893625.880.739629.950.869022.850.6775
REMA4025.230.701928.540.828522.000.670726.530.845728.220.893025.880.739430.010.868922.860.6775
CBAM [23]6425.210.700428.570.828122.010.669426.480.844328.300.895725.860.738429.870.867622.810.6767
SE [31]6425.240.701628.610.829222.020.671426.550.846428.410.896125.920.740329.960.869122.890.6796
CA [32]6425.220.701428.560.828122.010.670026.530.845728.450.895925.880.739729.970.868922.800.6780
BAM [19]6425.240.701628.640.829422.010.669826.610.847628.360.895725.900.740030.010.869122.900.6804
REMA6425.270.702628.670.830222.050.672226.650.848428.410.896225.970.741030.040.870022.920.6816
Table 4. Performance comparison between widened attention modules #C_In denotes the input tensor’s channel count. _wide denotes the modules with widened channel bandwidth (×3) by the elastic adjuster, and R denotes the ratio of the elastic adjuster. The numerical comparisons are accurate to 12 decimal places, with the best result highlighted in red.
Table 4. Performance comparison between widened attention modules #C_In denotes the input tensor’s channel count. _wide denotes the modules with widened channel bandwidth (×3) by the elastic adjuster, and R denotes the ratio of the elastic adjuster. The numerical comparisons are accurate to 12 decimal places, with the best result highlighted in red.
MODEL#C_InRBSDS100General100HistoricalManga109PIRMSET14SET5Urban100
PSNR SSIM PSNR SSIM PSNR SSIM PSNR SSIM PSNR SSIM PSNR SSIM PSNR SSIM PSNR SSIM
CBAM [23]40125.200.699928.490.827022.010.668726.440.843528.170.893525.830.737029.890.866322.810.6755
CBAM_wide40325.220.701428.520.827922.010.670726.470.843928.120.893125.870.739229.980.868622.810.6748
SE [31]40125.200.701028.510.828021.980.669526.510.845928.240.893125.870.739729.970.868622.780.6757
SE_wide40325.210.701328.540.827922.020.671126.490.845128.300.893525.890.739529.870.867322.800.6757
CA [32]40125.240.702228.510.827621.960.667526.540.845728.120.890525.850.739429.960.868422.790.6768
CA_wide40325.220.701428.580.828522.000.667726.570.845728.230.893325.870.739429.930.867822.790.6744
BAM [19]40125.220.701628.570.828421.960.669226.540.846428.300.893625.880.739629.950.869022.850.6775
BAM_wide40325.220.701628.530.828022.000.670826.540.845828.280.893525.870.739529.930.868722.820.6778
CBAM [23]64125.210.700428.570.828122.010.669426.480.844328.300.895725.860.738429.870.867622.810.6767
CBAM_wide64325.210.698528.470.826421.990.665326.450.843328.130.891725.840.737229.950.868622.820.6750
SE [31]64125.240.701628.610.829222.020.671426.550.846428.410.896125.920.740329.960.869122.890.6796
SE_wide64325.240.700528.560.827721.990.668226.570.845728.270.892625.880.738729.960.868422.880.6785
CA [32]64125.220.701428.560.828122.010.670026.530.845728.450.895925.880.739729.970.868922.800.6780
CA_wide64325.190.700228.510.827721.910.664926.490.845628.110.892025.830.738130.000.869422.770.6757
BAM [19]64125.240.701628.640.829422.010.669826.610.847628.360.895725.900.740030.010.869122.900.6804
BAM_wide64325.230.701628.620.829222.000.670326.640.848328.360.895325.900.740030.000.869322.910.6813
Table 5. The trend of performance changes with different ratios of the elastic adjuster under 4×. #C_in_R denotes the number of channels of the input and the elastic adjuster’s ratio. The results of different input widths are denoted by blue and green. Deeper colors represent higher values.
Table 5. The trend of performance changes with different ratios of the elastic adjuster under 4×. #C_in_R denotes the number of channels of the input and the elastic adjuster’s ratio. The results of different input widths are denoted by blue and green. Deeper colors represent higher values.
#C_in_R#P(M)BSDS100General100HistoricalManga109PIRMSET14SET5Urban100
PSNR SSIM PSNR SSIM PSNR SSIM PSNR SSIM PSNR SSIM PSNR SSIM PSNR SSIM PSNR SSIM
40_0.51.2425.18910.700728.46320.826922.00670.670126.42010.844228.11060.892525.81110.738029.88760.867122.72610.6737
40_0.61.4225.20010.701128.51010.827421.97540.669626.44010.844828.25000.893225.80320.738229.90970.867622.79790.6760
40_0.71.6025.24160.701228.52830.827722.03230.670026.50910.845128.25270.892425.90230.738929.97640.867722.85150.6759
40_0.81.8025.22160.700728.46860.827021.99840.669726.47430.844328.05990.891825.81250.737829.84010.866822.79550.6753
40_0.92.0125.19790.700628.54780.827522.04860.670926.44140.843628.27610.893625.83910.738229.97060.868122.78440.6741
40_1.02.2325.23040.701928.53520.828522.00100.670726.53250.845728.22030.893025.88360.739430.00890.868922.85820.6775
40_1.12.4625.21250.701628.50240.827821.96620.670026.51810.845828.20020.892425.84420.739329.94100.868222.80430.6773
40_1.22.7125.22510.700828.50750.827722.03530.670726.48920.844528.19080.891825.86550.738529.93130.868022.80660.6755
40_1.32.9625.21370.701628.52260.828321.98330.670326.47960.845828.13770.892825.85140.739029.94910.868922.82510.6780
40_1.43.2325.20940.700928.51670.827522.01560.668926.46340.844028.16210.892725.88240.738930.01250.868322.78920.6743
40_1.53.5125.21200.701428.52260.828422.03920.672026.54780.846328.32060.894225.91140.740229.94120.868322.81240.6762
64_0.53.1725.25440.701428.65610.829722.02740.670226.64920.847928.38470.895425.94610.740430.01460.869222.91010.6806
64_0.63.5825.25430.701728.64540.829622.03690.671226.61630.847428.34650.895625.91490.740030.01630.869222.88790.6795
64_0.74.0225.25730.701728.62850.829822.01260.670626.58920.847228.33690.895125.91230.740030.01210.869822.90660.6803
64_0.84.5525.25740.702328.64990.830021.99490.670326.64930.848528.36410.895625.94390.740830.09060.870622.91130.6807
64_0.95.0525.25560.701928.65520.830122.01270.670926.60800.847428.35790.895725.94730.740530.01380.869622.89510.6801
64_1.05.6825.26750.702628.66920.830222.05250.672226.64820.848428.40640.896225.96860.741030.04470.870022.91860.6816
64_1.16.2325.26060.702028.65700.830421.99800.671326.62330.847728.33000.895425.94430.740730.07280.870722.92300.6813
64_1.26.8225.26680.702228.67520.830522.02580.670826.64420.848028.40110.895425.94410.740930.07370.870422.91750.6817
64_1.37.5225.26900.703328.65650.830521.99930.672026.66220.849228.36110.895925.93830.741430.03760.870422.92440.6823
64_1.48.1625.25990.702228.65570.830222.02260.671526.67170.848828.36950.895425.96900.740930.02910.869522.89930.6808
64_1.58.9625.26520.702128.64020.830022.02620.671026.65830.848228.43520.896625.90130.739830.04410.870222.92880.6814
Table 6. Performance comparison between the REMA (Rich Structure), ResNeXt, and Inception versions of REMA. ResNeXt_MS represents the ResNeXt version of REMA with multi-scale and multi-level feature fusion. SF denotes the scale factor. The numerical comparisons are accurate to 12 decimal places. The best two results are highlighted in red and blue.
Table 6. Performance comparison between the REMA (Rich Structure), ResNeXt, and Inception versions of REMA. ResNeXt_MS represents the ResNeXt version of REMA with multi-scale and multi-level feature fusion. SF denotes the scale factor. The numerical comparisons are accurate to 12 decimal places. The best two results are highlighted in red and blue.
MODELSFBSDS100General100HistoricalManga109PIRMSET14SET5Urban100
PSNRSSIMPSNRSSIMPSNRSSIMPSNRSSIMPSNRSSIMPSNRSSIMPSNRSSIMPSNRSSIM
Inception [37]29.910.887935.140.951027.140.897534.590.963133.190.966530.930.892235.920.947928.460.8848
ResNeXt_MS29.930.888335.170.951127.190.898134.550.963033.260.966830.910.892435.970.948128.510.8854
ResNeXt [38]29.910.887935.170.951027.130.897434.590.963033.240.966830.920.892035.980.948128.490.8848
RichStructure29.940.888735.160.951127.250.899434.660.963233.270.966830.950.892835.970.948128.510.8860
Inception [37]25.260.702428.640.829822.040.672226.640.847828.420.896425.930.741129.970.869322.900.6804
ResNeXt_MS25.250.701528.640.829822.020.670126.610.847228.400.896125.920.740229.980.869722.880.6796
ResNeXt [38]25.250.701828.630.829322.010.670026.610.847128.380.895825.910.740429.970.869222.920.6798
Rich Structure25.270.702628.670.830222.050.672226.650.848428.410.896225.970.741030.040.870022.920.6816
Inception [37]22.370.541824.050.666119.240.449421.100.656125.150.808721.960.564124.840.719019.550.4710
ResNeXt_MS22.400.541724.030.664919.240.448921.060.652625.030.805121.990.564524.790.716119.590.4698
ResNeXt [38]22.340.540923.990.662519.240.448820.950.649225.000.796421.890.561924.740.715419.530.4689
Rich Structure22.400.543024.100.667519.240.451821.170.658024.990.806522.010.566324.830.719319.610.4735
Table 7. REMA in the reconstruction layer. SF denotes the scale factor. #C_in denotes the input tensor’s channel count. RB w/ REMA denotes the reconstruction block with REMA. And RB w/o REMA denotes the reconstruction block without REMA. The numerical comparisons are accurate to 12 decimal places. The best result is highlighted in red.
Table 7. REMA in the reconstruction layer. SF denotes the scale factor. #C_in denotes the input tensor’s channel count. RB w/ REMA denotes the reconstruction block with REMA. And RB w/o REMA denotes the reconstruction block without REMA. The numerical comparisons are accurate to 12 decimal places. The best result is highlighted in red.
MODELSF#C_inBSDS100General100HistoricalManga109PIRMSET14SET5Urban100
PSNR SSIM PSNR SSIM PSNR SSIM PSNR SSIM PSNR SSIM PSNR SSIM PSNR SSIM PSNR SSIM
RB w/o REMA4025.230.701928.540.828522.000.670726.530.845728.220.893025.880.739430.010.868922.860.6775
RB w/ REMA4025.210.701428.550.828522.000.670926.530.846428.340.894625.860.738929.940.868422.810.6764
RB w/o REMA6425.270.702628.670.830222.050.672226.650.848428.410.896225.970.741030.040.870022.920.6816
RB w/ REMA6425.250.701528.590.828822.060.671826.560.846428.390.895225.920.740029.960.868622.900.6795
RB w/o REMA4022.350.541424.010.660919.170.446721.030.650225.050.782721.900.562424.800.717319.460.4660
RB w/ REMA4022.370.542024.080.665919.220.447621.130.656525.090.805221.940.562024.920.718419.570.4714
RB w/o REMA6422.400.544724.090.668119.210.449821.170.660424.980.805422.040.569424.770.720919.610.4752
RB w/ REMA6422.400.543024.100.667519.240.451821.170.658024.990.806522.010.566324.830.719319.610.4735
Table 8. Comparison with other attention modules in UNet-SR under 4×. The numerical comparisons are accurate to 12 decimal places. The best two results are highlighted in red and blue.
Table 8. Comparison with other attention modules in UNet-SR under 4×. The numerical comparisons are accurate to 12 decimal places. The best two results are highlighted in red and blue.
MODELBSDS100General100HistoricalManga109PIRMSET14SET5Urban100
PSNR SSIM PSNR SSIM PSNR SSIM PSNR SSIM PSNR SSIM PSNR SSIM PSNR SSIM PSNR SSIM
UNet-SR_CBAM24.930.696327.830.815021.860.662925.550.826327.980.863225.460.729929.500.859922.360.6586
UNet-SR_SE24.930.696127.980.817221.850.663125.610.826928.130.878525.440.730729.350.857422.380.6604
UNet-SR_CA24.920.695727.830.816121.860.661725.480.823928.080.878925.410.729129.260.856622.340.6575
UNet-SR_BAM24.920.696327.950.817121.850.663325.660.828228.050.874025.430.730129.280.856622.370.6600
UNet-SR_REMA25.010.699428.160.822021.900.667725.910.833528.270.886625.560.733829.510.860422.520.6663
Table 9. Performance comparison between REMA-SRNet and other comparative methods. HR images are center-cropped and downscaled via bicubic interpolation to generate LR image pairs for training and testing, without any data augmentation. PSNR and SSIM are computed in the RGB space. #P denotes the number of parameters(m). SF denotes the scale factor. The numerical comparisons are accurate to 12 decimal places. The best two results are highlighted in red and blue.
Table 9. Performance comparison between REMA-SRNet and other comparative methods. HR images are center-cropped and downscaled via bicubic interpolation to generate LR image pairs for training and testing, without any data augmentation. PSNR and SSIM are computed in the RGB space. #P denotes the number of parameters(m). SF denotes the scale factor. The numerical comparisons are accurate to 12 decimal places. The best two results are highlighted in red and blue.
MODEL#PSFBSDS100General100HistoricalManga109PIRMSET14SET5Urban100
PSNR SSIM PSNR SSIM PSNR SSIM PSNR SSIM PSNR SSIM PSNR SSIM PSNR SSIM PSNR SSIM
VDSR [5]0.2129.550.881534.010.942526.860.889033.270.954732.600.960530.420.884235.160.942827.470.8672
ESPCN [25]0.0629.400.880133.800.940026.820.887633.060.952932.430.950530.260.884835.020.942127.050.8585
RCAN [27]16.2129.900.888035.080.950627.140.897534.420.962833.280.966230.870.892135.960.948028.490.8864
PAN [18]0.2529.860.886935.050.950227.190.897734.550.962933.230.966630.890.892035.910.947728.290.8818
A2N [36]0.9929.850.886834.990.950027.200.897734.500.962633.210.966430.880.891935.850.947528.260.8813
DRLN [10]32.8429.810.886534.910.949527.210.897434.380.962433.100.965230.790.890835.840.947528.210.8803
ESRGCNN [29]2.3129.700.884434.410.939627.040.891033.890.958232.860.903930.630.888535.490.944127.860.8746
SwinIR [16]10.429.920.888235.060.950727.130.897134.610.963433.220.966030.950.892336.030.948128.370.8844
NLSN [20]1.6329.860.887335.050.950427.090.896234.550.962733.110.965830.850.891336.000.948028.410.8846
UNetSR [30]8.129.410.883134.030.944226.800.891833.150.956432.600.963530.420.888735.130.944327.190.8645
REMA-SRNet-M2.229.890.887835.090.950527.180.897634.500.962733.300.966730.850.891735.820.947728.340.8833
REMA-SRNet5.6129.940.888735.160.951127.250.899434.660.963233.270.966830.950.892835.970.948128.510.8860
VDSR [5]0.2124.930.688127.830.810921.780.651825.300.813527.990.890925.370.721829.170.851422.310.6503
ESPCN [25]0.0724.860.686827.670.797821.860.644525.020.794027.670.841825.220.715729.070.841722.090.6354
RCAN [27]16.3525.170.699928.410.823221.950.664426.490.840328.100.888925.750.734729.830.864822.750.6740
PAN [18]0.2625.010.691728.060.816421.920.658425.650.824528.020.888925.500.727429.410.857322.460.6566
A2N [36]125.010.692028.100.817421.920.659425.690.825528.090.890225.520.727829.430.857722.480.6586
DRLN [10]32.9825.150.697928.380.822521.970.666326.320.839028.220.880825.770.735129.820.865022.710.6704
ESRGCNN [29]2.3125.060.694428.080.804621.920.653425.970.822928.050.790025.590.728329.480.853322.520.6588
SwinIR [16]10.4525.240.702528.570.828222.040.667526.680.847228.200.884225.920.740229.980.867922.830.6772
NLSN [20]1.7725.150.699228.330.822221.980.666626.330.834127.950.887225.720.733529.800.862322.610.6680
UNet-SR [30]8.1124.930.696327.970.816421.860.663525.620.826428.140.872725.410.729529.360.856622.390.6604
REMA-SRNet-M2.2325.230.701928.540.828522.000.670726.530.845728.220.893025.880.739430.010.868922.860.6775
REMA-SRNet5.6825.270.702628.670.830222.050.672226.650.848428.410.896225.970.741030.040.870022.920.6816
VDSR [5]0.2122.090.523023.420.638819.100.426320.330.601524.150.784121.620.534324.210.677819.140.4383
ESPCN [25]0.1122.190.527123.470.617919.150.420620.420.590124.350.675321.650.534624.380.671319.180.4342
RCAN [27]16.4922.270.535423.790.643519.120.434420.870.620824.580.762021.840.548324.630.693719.390.4547
PAN [18]0.2722.350.540723.950.662119.260.446920.950.648125.030.806321.870.559124.770.713719.490.4661
A2N [36]1.0122.340.540623.960.662219.250.446920.960.648825.000.805921.850.557624.770.713819.480.4662
DRLN [10]33.1222.300.538823.900.653719.070.439321.020.647624.790.764421.800.556224.610.709519.440.4648
ESRGCNN [29]2.3122.240.531823.790.635419.120.424520.790.617424.780.715521.810.548624.570.692119.360.4514
SwinIR [16]10.6822.380.541724.150.662919.190.441721.220.650124.880.785422.010.561725.040.717319.550.4680
NLSN [20]1.9122.140.529323.560.629318.940.424420.770.616624.480.756621.720.542124.450.687219.320.4495
UNetSR [30]6.7722.210.538423.700.659019.190.446020.720.645924.960.779521.730.557024.570.711419.370.4653
REMA-SRNet-M2.4722.370.542024.080.665919.220.447621.130.656525.090.805221.940.562024.920.718419.570.4714
REMA-SRNet6.2922.400.543024.100.667519.240.451821.170.658024.990.806522.010.566324.830.719319.610.4735
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Gu, X.; Chen, Y.; Tong, W. REMA: A Rich Elastic Mixed Attention Module for Single Image Super-Resolution. Sensors 2024, 24, 4145. https://doi.org/10.3390/s24134145

AMA Style

Gu X, Chen Y, Tong W. REMA: A Rich Elastic Mixed Attention Module for Single Image Super-Resolution. Sensors. 2024; 24(13):4145. https://doi.org/10.3390/s24134145

Chicago/Turabian Style

Gu, Xinjia, Yimin Chen, and Weiqin Tong. 2024. "REMA: A Rich Elastic Mixed Attention Module for Single Image Super-Resolution" Sensors 24, no. 13: 4145. https://doi.org/10.3390/s24134145

APA Style

Gu, X., Chen, Y., & Tong, W. (2024). REMA: A Rich Elastic Mixed Attention Module for Single Image Super-Resolution. Sensors, 24(13), 4145. https://doi.org/10.3390/s24134145

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop