4.1. Implementation Details and Datasets
To assess the effectiveness of Rich Structure, REMA, and REMA-SRNet, we employ images from [
46] for training and validation, following DIV2K’s default split. Evaluation metrics include the peak signal-to-noise ratio (PSNR, dB) and structural similarity (SSIM), computed in the RGB space, where higher values indicate superior reconstruction. The best models are selected based on the highest PSNR + SSIM on the validation set of DIV2K and evaluated on five commonly used datasets (BSDS100 [
47], Set14 [
48], Set5 [
49], Manga109 [
50], and Urban100 [
51]), and an additional three datasets (Historical [
52], PIRM [
53], and General100 [
26]) for comprehensive study, under upscaling factors of 2×, 4×, and 8×, respectively. HR images are center-cropped to
patches, downscaled via bicubic interpolation to generate LR image pairs for training and testing, without any data augmentation. Optimization employs Adam with an initial learning rate of 0.0001, halved every 50 epochs,
set to 0.9,
set to 0.999, and
set to
. The batch size is set to 1, and training lasts 300 epochs, using PyTorch 2.0.0 on a desktop with an Intel I5-8600 CPU, 64GB RAM, and NVIDIA GTX 3090 GPU. The training loss function is L1 loss.
where
P represents the calculated area, and
p denotes the pixel’s position within area
P. The pixel values at position
p in both the prediction area
and the ground truth map
area
are taken into account.
4.2. Evaluation Metrics
We evaluate SR images using two widely used metrics: the peak signal-to-noise ratio (PSNR) and structural similarity (SSIM). PSNR serves as an objective metric to assess image quality and measure the degree of difference between an original image and a compressed or distorted version. The PSNR calculation relies on mean square error (MSE), quantifying the squared differences between corresponding pixels in the original and reconstructed images. The formula for PSNR is as follows:
where
W and
H are the width and height of the image,
represent pixel positions, and
X and
Y denote the super-resolved image and the ground-truth image, respectively.
is the maximum pixel value range, and
stands for mean square error. Higher PSNR values indicate lower distortion and better image quality, typically ranging from 20 to 50. PSNR values exceeding 30 dB are generally considered indicative of good image quality. Recognizing that PSNR is a limited indicator that fails to capture human subjective perception of images, we also utilize SSIM as an evaluation index. SSIM accounts for contrast, brightness, and structural similarity. The calculation for the SSIM value at the pixel position,
p, is as follows:
Here, , ,, , and denote the mean, standard deviation, and covariance of pixels at position p in the prediction map and the true value map. Constants and are included to prevent division by zero. The SSIM value falls within the range of , with values closer to 1 indicating a superior HR reconstruction effect.
4.3. Ablation Studies
In this section, ablation studies are conducted to verify the effectiveness of each part of REMA. The experiments span networks with various settings, scale factors, and integration positions, as well as comparisons with other attention modules in REMA-SRNet and other SISR networks. Meanwhile, the effectiveness of Rich Structure is verified by comparing it with REMA using Inception- and ResNeXt-like structures. Furthermore, the impact of parameter count on performance and robustness is discussed based on the experimental results.
The baseline model in our experiment is the proposed modified EDSR (replacing REMA ResBlock with residual blocks in REMA-SRNet). Specifically, the pixel shuffle layer is replaced by bilinear upsampling followed by 3 × 3 convolutions and Leaky ReLU layers, connected with the bilinear-upscaled LR input. To validate the proposed methods, we employ two sets of network configurations: default (64-16-64) as REMA-SRNet and alternative (40-16-40) as REMA-SRNet-M. The adjuster ratio R is set to 1 by default. For 2×, 4×, and 8× reconstruction, the number of reconstruction layers is 1, 2, and 3, respectively.
In all experiments, robustness is evaluated across all eight datasets. The Historical dataset specifically assesses the module’s capability in handling out-of-distribution (OOD) samples. Additionally, two sets of network configurations mentioned above (#C_In is 40 or 64) are employed to test module compatibility with varying complexities of input features.
4.3.1. Study of REMA in the Backbone
Figure 3 illustrates the utilization of REMA within a residual block of the backbone networks. To analyze the effectiveness of REMA, except the baseline, five models were constructed: RSAM, RCAM, RSAM-RCAM, RCAM-RSAM, and REMA, and they represent the model with RSAM and RCAM, and employ them together in parallel, respectively. The results of all the above ablation experiments are shown in
Table 2.
The results indicate that employing only RSAM in the backbone enhances PSNR and SSIM across most datasets, except for the Set14 and Historical datasets when the input tensor has 40 channels. However, RCAM also underperforms in the Historical dataset, attributed to significant differences between the Historical dataset images and the distribution of training datasets. Configuring them in parallel (REMA) boosts performance across most datasets. Moreover, with a 64-channel input tensor, all models show significant performance improvements. Notably, using RSAM and RCAM separately substantially mitigates the performance reduction issue in the Historical dataset. Consequently, the backbone of REMA demonstrates performance improvements in the Historical dataset. Overall, these results affirm the effectiveness of our method.
4.3.2. Study of Rich Structure
The study of Rich Structure, along with REMA throughout subsequent experiments, is examined. To verify the effectiveness of Rich Structure and REMA, we initially compared REMA with other attention modules. This allows us to identify the key factors influencing a plug-and-play module and demonstrate the superiority of Rich Structure and REMA. Additionally, we designed ResNeXt and Inception versions of REMA to highlight the advantages of Rich Structure in terms of compatibility and flexibility compared to other popular module structures.
4.3.3. Comparison with Other Attention Modules
We compared the performance of REMA with other attention modules, including CBAM, SE, CA, and BAM, which were employed in the same way as REMA. Our experiment includes results for 40- and 64-channel input. To ensure a fair comparison, we set the dimension reduction to 1 (C/r, r = 1), meaning no channel compression is applied. The results are presented in
Table 3.
The results indicate that for the 40-channel input, there is no significant difference in the performance of REMA with other attention modules, except CBAM. However, for the 64-channel input, REMA outperforms other attention modules. Furthermore, comparing the overall improvement when changing the number of channels from 40 to 64, BAM and REMA show much higher performance than other attention modules in the experiment, as discussed in the next section. For the Historical dataset, except for REMA, there is a reduction in performance after integrating other attention modules into the residual block.
4.3.4. Study of Plain, Multi-Branch, and Rich Structure
To elucidate the performance increment difference with increasing input complexity, we analyze these attention modules from a global structural perspective. According to
Table 3, modules with a multi-branch structure exhibit a greater performance increase with the rise in input complexity compared to plain structures, except for CA. The primary distinction among these modules lies in their cardinality: 1 for plain modules (SE and CBAM), and 2, 2, and 5 for CA, BAM, and REMA, respectively. Based on the results, cardinality is positively correlated with the overall performance of modules for a 64-channel input. Thus, cardinality is an influential factor relating to module compatibility, and higher cardinality will enhance the module’s performance with the growth in input complexity.
However, cardinality is not the sole factor influencing performance. When comparing the results of CA and BAM, both with a cardinality of two, there exists a performance gap for the 64-channel input. The main difference lies in the in-branch bandwidth. In fact, CA also employs a split–transform–aggregate structure similar to Inception-like blocks. The distinction is that CA splits the features (
) along H and W rather than C, as shown in
Figure 1b, while BAM and REMA directly map the complete input to branches. This implies that the in-branch features are less informative in CA compared to BAM and REMA.
Comparing BAM and REMA, both modules generate spatial and channel attention. The difference lies in our proposed algorithm, which not only enhances SR-related feature representation but also generates richer multi-scale and multi-level features compared to BAM. This is because BAM is a size-oriented module, balancing performance and module size, resulting in limited room and more constraints for algorithm design. Our proposed Rich Structure is designed to overcome this limitation. We will delve into this topic in the following section. Therefore, in-branch feature richness and task-related algorithms are other influential factors. The richness is defined by the channel bandwidth of the in-branch features and the diversity of features(multi-scale and multi-level features).
4.3.5. Study of the Elastic Adjuster
For further investigation, we conducted an experiment to analyze the influence of overall channel bandwidth on performance. The overall channel bandwidth of modules with plain structures, multi-branch structures, and our proposed Rich Structure differs significantly, with the plain structure being much slimmer than the others. We redesigned these modules, replacing dimension reduction with the elastic adjuster (
), where R is set to 3, indicating a widened channel bandwidth by 3 times to determine how bandwidth affects the performance and to verify the effectiveness of the elastic adjuster in different attention modules. The results are presented in
Table 4, and there is a dedicated section for this independent experiment in REMA in the following.
The results show that for the 40-channel input, the redesigned wider CBAM and SE exhibit improvements on most datasets, bringing their performances close to those of the original CA and BAM, which performed better than them previously. This underscores the significance of the in-branch feature bandwidth of the channel as a key performance-related factor, which ultimately affects the overall module’s width. These results highlight how plain structures and dimension-reduction components, realized by bottleneck structures, actually limit their potential in performance, proving the effectiveness of the proposed elastic adjuster in enhancing performance when needed alongside the Rich Structure under certain conditions. However, for the 64-channel input, a reduction occurs in wider modules, except for BAM. For BAM, the redesign results in improvements for half of the datasets and reductions on others, with overall performance close to the original for the 64-channel input. This indicates a limit to increasing in-branch channel bandwidth for further performance gains.
4.3.6. Study of the Elastic Adjuster in REMA
To analyze the effect of in-branch channel bandwidth in REMA, experiments are conducted. Specifically, in the experiments, the elastic adjuster’s ratio was varied from 0.5 to 1.5, and the performances of
and
, representing the size-oriented and performance-oriented modes of REMA, were compared. The results are shown in
Table 5.
The results indicate that the overall performance of size-oriented REMA is lower than the performance-oriented one for the 40-channel input, showing the same trend as the widened versions of other attention modules. However, for the 64-channel input, different from other widened attention modules, A can still benefit from the increased bandwidth for some datasets, including BSDS100, Mange109, Set14, and Urban100. Additionally, the performance gap between the lowest and highest values for the 64-channel input is not large, proving that REMA can ensure flexibility to meet different task requirements by switching the elastic adjuster.
There is still a limit to achieving more performance through parameter exchange. This limitation may stem from two aspects: input complexity and task-specific algorithms. Regarding the former, comparing the results of 40_1.5 and 64_0.6, it can be observed that they have similar numbers of parameters, yet 64_0.6 performs significantly better than 40_1.5, with the only difference being the number of input channels. This illustrates one of the reasons why models with more parameters do not always yield higher performance and why a plug-and-play module works in one network but not in another.
Concerning the latter, comparing REMA with the widened version of BAM (64_1.2), both having a multi-branch structure with the elastic adjuster and similar overall channel bandwidth (BAM: , REMA: ), REMA outperforms BAM on all datasets. Furthermore, the results of and demonstrate that a more effective parameter exchange provides extra robustness on different datasets, although models with fewer parameters may perform better on certain datasets.
4.3.7. Size-Oriented vs. Performance-Oriented
In order to investigate how lightweight structures affect performance further, we compare Rich Structure (copy/rescale–transform–aggregate) with other size-oriented multi-branch designs. Specifically, we redesign REMA in Inception (split–transform–concatenate) and ResNeXt (split–transform–aggregate) styles. The split operation is achieved by setting the elastic adjuster to be 1/3 in RSAM and 1/2 in RCAM to maintain the overall bandwidth the same as the input feature. Additionally, the main difference between the Inception and ResNeXt versions lies in the topology of each transforming branch, whereas in ResNeXt, they are the same. Hence, we propose an extra version of it to maintain multi-scale and multi-level feature fusion as used in REMA, to verify their effectiveness.
To comprehensively discuss the parameter efficiency of size-oriented and performance-oriented structures, we also consider the scale factor for two reasons. Firstly, from the SR task perspective, a higher scale factor makes SR inference more challenging. From the network perspective, as the scale factor increases, the network becomes more prone to overfitting since we generate training data by downsampling the ground-truth image at the target scale factor rate. Consequently, the input patch becomes very small at 8× (
), potentially leading to overfitting for a module that performs well at 2× and 4×. In other words, 2×, 4×, and 8× represent three situations, ranging from low to high difficulty for every parameter that influences performance. Additionally, performance on the Historical dataset receives more attention as it represents an out-of-distribution (OOD) scenario. Therefore, we use these factors to test the module’s compatibility and robustness, with the experiment results presented in
Table 6.
According to the results, Rich Structure outperforms other versions of REMA. Although the performances of Inception and ResNext_MS may be close to the Rich Structure version of REMA in certain datasets or certain upscale ratios, overall, the Rich Structure version demonstrates the best capability across different datasets and networks, with less likelihood of overfitting. Moreover, comparing ResNeXt_MS shows better performance than ResNeXt under 2× and 4×, and their results are comparable under 8x, highlighting the effectiveness of the multi-scale and multi-level feature fusion strategy in REMA. These findings demonstrate the higher compatibility and robustness of our method compared to other popular size-oriented multi-branch structures when applied in the backbone. Again, the results demonstrate that extra effective parameters can exchange and provide more robustness under different scale factors.
4.3.8. Study of REMA in the Reconstruction Layer
Additionally, given the application of REMA in reconstruction layers at high-scale factors, experiments are conducted at scale factors of 4× and 8×.
Figure 3 illustrates the implementation of REMA in the reconstruction layer, with corresponding results shown in
Table 7.
In summary, the significance of REMA in reconstruction blocks increases with larger-scale factors. At 4×, it results in a performance decline in most datasets, leading to its exclusion from REMA-SRNet under 2× and 4×. However, at 8×, there is an improvement in most datasets when used in reconstruction for the 64-channel input. However, for the 64-channel input, the overall enhancement is less evident. Hence, REMA in reconstruction layers improves performance at high-scale ratios under specific conditions.
4.3.9. Study of REMA in Other SISR Network
For further investigation, we incorporate REMA into UNet-SR, a super-resolution network based on the image segmentation network U-Net. UNet-SR employs skip connections for encoder–reconstruction feature fusion, enhancing reconstitution quality. We utilize this setup to assess REMA’s effectiveness in other networks and evaluate its impact on performance when integrated into skip connections. This extends the experiments beyond the backbone and reconstruction layers, as skip connections were not used in REMA-SRNet for varying depth feature fusion. Results are summarized in
Table 8.
The results show that the performance of REMA, when added to the skip connection, surpasses other attention modules at the same position, indicating that REMA remains effective in various SR models and positions. In fact, the number of input channels gradually expands, layer by layer, as it progresses from shallow to deep within the skip connections of UNet-SR. Thus, this also suggests that Rich Structure’s advantage becomes more pronounced when handling inputs with more filters, outperforming other attention modules.
4.3.10. Comparison with Other Comparative Methods
To comprehensively evaluate our methods, we compare REMA-SRNet (R = 1) with other SISR methods, employing similar approaches such as residual, recursive, and multi-branch learning, as well as attention-based SR networks. Our experiments encompass both lightweight and large models, including VDSR [
5], ESPCN [
25], RCAN [
27], PAN [
18], A2N [
36], DRLN [
10], RCAN [
27], ESRGCNN [
29], SwinIR [
16], NLSN [
20], and UNet-SR [
30].
Table 9 displays the quantitative results for various scaling factors. In summary, compared to other SOTAs, REMA outperforms other methods for 2×, 4×, and 8× upscaling on benchmark datasets, showcasing the effectiveness of REMA-SRNet. Further research should address the parameter-efficiency perspective when discussing trends in results.
The results indicate that methods with large sizes do not necessarily equate to high performance. In fact, size and performance show some positive correlation at 4×. As explained earlier, this is due to complex models being prone to overfitting as the complexity of the training data decreases with the increasing scale factor. For instance, RCAN and DRLN may achieve better results on certain datasets at 2× and 4× but perform worse than lightweight models like PAN and A2N at 8× due to overfitting. Conversely, while lightweight models may excel in specific scale factors or datasets, they may be insufficient for performance-prioritized tasks or broad compatibility requirements. Thus, parameter efficiency not only achieves intermediate results with few parameters but also attains optimal results while maintaining the overall model size. Among the models tested, only REMA-SRNet and SwinIR achieve this balance. REMA-SRNet generally outperforms SwinIR while using only 60% of its parameters (
Figure 9).