A Multi-Branch Feature Extraction Residual Network for Lightweight Image Super-Resolution

Liu, Chunying; Wan, Xujie; Gao, Guangwei

doi:10.3390/math12172736

Open AccessArticle

A Multi-Branch Feature Extraction Residual Network for Lightweight Image Super-Resolution

by

Chunying Liu

¹,

Xujie Wan

¹ and

Guangwei Gao

^1,2,*

¹

Institute of Advanced Technology, Nanjing University of Posts and Telecommunications, Nanjing 210046, China

²

Provincial Key Laboratory for Computer Information Processing Technology, Soochow University, Suzhou 215006, China

^*

Author to whom correspondence should be addressed.

Mathematics 2024, 12(17), 2736; https://doi.org/10.3390/math12172736

Submission received: 6 July 2024 / Revised: 14 August 2024 / Accepted: 31 August 2024 / Published: 1 September 2024

(This article belongs to the Special Issue Advancement of Mathematical Methods in Feature Representation Learning for Artificial Intelligence, Data Mining and Robotics, 2nd Edition)

Download

Browse Figures

Versions Notes

Abstract

:

Single-image super-resolution (SISR) seeks to elucidate the mapping relationships between low-resolution and high-resolution images. However, high-performance network models often entail a significant number of parameters and computations, presenting limitations in practical applications. Therefore, prioritizing a light weight and efficiency becomes crucial when applying image super-resolution (SR) to real-world scenarios. We propose a straightforward and efficient method, the Multi-Branch Feature Extraction Residual Network (MFERN), to tackle lightweight image SR through the fusion of multi-information self-calibration and multi-attention information. Specifically, we have devised a Multi-Branch Residual Feature Fusion Module (MRFFM) that leverages a multi-branch residual structure to succinctly and effectively fuse multiple pieces of information. Within the MRFFM, we have designed the Multi-Scale Attention Feature Fusion Block (MAFFB) to adeptly extract features via convolution and self-calibration attention operations. Furthermore, we introduce a Dual Feature Calibration Block (DFCB) to dynamically fuse feature information using dynamic weight values derived from the upper and lower branches. Additionally, to overcome the limitation of convolution in solely extracting local information, we incorporate a Transformer module to effectively integrate global information. The experimental results demonstrate that our MFERN exhibits outstanding performance in terms of model parameters and overall performance.

Keywords:

single-image super-resolution; lightweight network; multi-branch feature extraction; transformer

MSC:

94A08

1. Introduction

Single-image super-resolution (SISR) poses a fundamental challenge in computer vision, seeking to enhance low-resolution (LR) input images to high-resolution (HR) ones. With the rise of deep neural networks, SISR has achieved remarkable success in generating high-quality outputs [1,2]. However, attaining such quality often entails the use of computationally intensive models, demanding substantial computing power, storage resources, and memory, consequently impeding both training and testing processes [3]. For instance, the RCAN [4] model, renowned for its performance in image SR tasks, comprises over 800 convolutional layers with approximately 15 million parameters, rendering it unsuitable for resource-constrained devices. Consequently, recent research has shifted towards developing lightweight SR network models to facilitate deployment on mobile devices.

Numerous methods utilize recursion techniques or parameter-sharing strategies to minimize the parameter count. For instance, the DRCN [5] and DRRN [6] utilized recursion mechanisms to explore lightweight model architectures. The CARN [4] employed weight sharing and group convolution to diminish the network parameters. The IMDN [7] leveraged residual feature extraction to enhance model efficiency. LCRCA [8] introduced a lightweight and efficient residual block that enhanced residual information within the same computational budget. The SFFN [9] advocated for the adoption of generic, lightweight, and efficient feature fusion blocks to replace conventional

1 \times 1

convolutions.

Traditional convolutional neural networks (CNNs) excel in extracting local features, but they often struggle in capturing the global context. For high-quality image reproduction, it is crucial to understand the overall structure and relationships within an image to generate detailed results. CNNs are typically less effective than self-attention mechanisms, like Transformers, in handling long-range dependencies. This limitation can affect the detail and coherence of image reconstruction. In contrast, Transformers excel in integrating global information, which enhances the image clarity and consistency. As a result, integrating Transformers into computer vision tasks has gained significant attention [10,11]. For instance, SwinIR [12] leveraged global information extraction and sliding windows to address the edge uncorrelation issue in SISR. Consequently, the synergistic combination of CNNs and Transformers, along with the fusion of local feature extraction and detailed texture recovery, plays a pivotal role in image reconstruction endeavors [13,14]. Many current SR model architectures adopt a serial structure, facilitating the gradual extraction of more comprehensive feature information to enhance the image quality. In models that integrate CNNs and Transformers, like LBNet [14], the backbone is bifurcated into two components, the CNN and Transformer, which are then connected serially to harness the strengths of both modalities, resulting in notable performance gains. However, Transformers can be resource-intensive, requiring significant computing power and memory, which can lead to inefficiencies in lightweight SISR tasks.

The pure CNN network lacks context modeling abilities, and simply adding a Transformer can introduce many parameters and computational challenges. To address this, we propose the Multi-Branch Feature Extraction Residual Network (MFERN), which combines CNNs and Transformers for effective yet lightweight image reconstruction. CNNs are strong in extracting local features, while Transformers capture the global context and long-term dependencies. By integrating these, the MFERN enhances the feature extraction comprehensiveness. Additionally, our series association approach increases the flexibility in processing features across different scales, adapting to various visual tasks while maintaining a lightweight model without heavy memory and computational demands. We establish a sequential connection between the CNN module and Transformer module using a tight coupling method to integrate both local and global features, thereby enhancing the model’s performance and generalization capabilities. Within the CNN component, we present the Multi-Branch Residual Feature Fusion Module (MRFFM), which mainly incorporates the Multi-Scale Attention Feature Fusion Block (MAFFB) and Attention Feature Fusion Block (AFFB) modules. These are integrated with a multi-residual branch structure following multiple channel splitting operations. In the MAFFB module, the Dual Feature Calibration Block (DFCB) dynamically combines feature information using combination coefficients and dynamic weights. Additionally, a Spatial Attention Calibration Module (SACB) and Channel Attention Calibration Module (CACB) are included to effectively extract spatial and channel information, respectively. The MRFFM module adeptly extracts and blends features of various scales. Furthermore, by reducing the number of channels and employing channel splitting, we ensure a lightweight model architecture without compromising the performance. Within the AFFB module, we merge the SACB and CACB modules to fuse features across different scales, introducing a multi-attention feature fusion approach.

The contributions of this paper are summarized as follows.

We introduce the Multi-Branch Residual Feature Fusion Module (MRFFM). This module uses distillation operations within a multi-residual branch structure to efficiently extract features while keeping the design lightweight. Two feature extraction modules are included to further enhance the feature extraction capabilities.
We design the Spatial Attention Calibration Block (SACB) and Channel Attention Calibration Block (CACB) to incorporate the attention mechanism into self-calibrated convolution. These blocks, combined with multi-information weighted fusion, enhance the model’s performance and generalization abilities.
We integrate a CNN with a Transformer to ensure the effective integration of local details and global information in the Multi-Branch Feature Extraction Residual Network (MFERN). Additionally, a dense connection structure is incorporated to enhance the reuse rate of feature information.

2. Related Work

2.1. SR Models

Given that the SRCNN [1], employing convolutional operations, outperforms existing SR methods, research on image SR models using CNNs has gained significant attention. However, simple CNN networks often struggle to achieve high-quality image detail feature reconstruction [3,4,15]. Consequently, exploring the integration of Transformers into image vision tasks has emerged as a primary focus. Transformers excel in handling global information features and offer superior performance compared to CNNs [16], with lower FLOP and parameter counts. Moreover, researchers have enhanced Transformers by expanding the receptive field [17], extracting diverse features [18], and expediting reasoning through approaches such as eliminating layer normalization and proposing simplified MHSA [19].

2.2. Lightweight SR Models

The research on lightweight SISR models aims to enable efficient image SR tasks on mobile devices [20,21]. Existing lightweight methods include efficient model structure designs [7,22], pruning or quantization techniques [23], and knowledge distillation [24]. Additionally, researchers have reduced the model size through methods such as weight sharing and channel grouping. For instance, strategies like channel splitting and layered distillation in models such as the IDN [25] and IMDN [7] enhance feature extraction. Recursive cascade learning aids in learning cross-layer feature representations [26], while some models reuse middle-layer features through recursive learning. Despite significant exploration, the unresolved thematic issues in lightweight SISR models require further investigation.

2.3. Attention Feature Fusion

Certain efforts aim to aggregate features of various dimensions across multiple visual tasks [27] to enhance the performance. In CNNs, researchers have integrated attentional mechanisms in the spatial and channel dimensions to enrich the feature expression, as seen in the SCA-CNN [28] and DANet [29]. In Transformers [30], spatial self-attention models the long-range dependencies between pixels. Additionally, some researchers have investigated the incorporation of channel attention into Transformers [31] to fuse spatial and channel information, thereby enhancing the modeling capabilities of Transformers.

3. Proposed Method

This section outlines the structure of our proposed Multi-Branch Feature Extraction Residual Network (MFERN). Initially, we introduce the sequential structure and the intensive connection operation between the CNN components and the Transformer backbone network. Subsequently, we describe the Multi-Branch Residual Feature Fusion Module (MRFFM) within the CNN block, encompassing the Multi-Scale Attention Feature Fusion Block (MAFFB) and Attention Feature Fusion Block (AFFB). Then, we introduce the Dual Feature Calibration Block (DFCB), Spatial Attention Calibration Module (SACB), and Channel Attention Calibration Module (CACB). Lastly, we present the specific characteristics of the Transformer.

3.1. Network Framework

As depicted in Figure 1, the Multi-Branch Feature Extraction Residual Network (MFERN) comprises a sequence of Multi-Branch Residual Feature Fusion Modules (MRFFM) and Transformer modules. We integrate the CNN with the Transformer module to effectively combine local and global feature information. This allows the model to better recover image texture details and reconstruct high-quality images. We denote the input LR image as

I_{L R}

, the model outputs as

I_{S R}

, and the HR images as

I_{H R}

. At the beginning of the model, a

3 \times 3

convolutional layer is utilized to extract shallow information:

F_{s f} = H_{s f} (I_{L R}),

(1)

where

H_{s f}

represents a

3 \times 3

convolutional layer (denoted as

F_{s f}

for shallow feature extraction). Subsequently,

F_{s f}

is forwarded to the CNN for local feature extraction. The network comprises four MRFFM modules, each consisting of three MAFFBs and one AFFB, enabling the extraction of additional local feature information through the multi-branch residual structure. The following formula expresses a portion of the CNN model operation:

F_{C N N} = H_{C N N} (F_{s f}),

(2)

where

H_{C N N}

represents CNN and local feature extraction, while

F_{C N N}

denotes the CNN output of local feature extraction. Once the local features of the image are obtained, they are sent to the Transformer module to extract global information.

F_{T r a n s} = H_{T r a n s} (F_{C N N}),

(3)

where

H_{T r a n s}

signifies the Transformer module, and

F_{T r a n s}

denotes feature reconstruction enhanced with global information. The reconstruction process can be expressed as

I_{S R} = H_{b u i l d 1} (F_{s f}) + H_{b u i l d 2} (F_{T r a n s}),

(4)

where

I_{S R}

denotes the ultimate output of the network,

H_{b u i l d 1}

represents the reconstruction module for

F_{s f}

, and

H_{b u i l d 2}

represents the reconstruction module for

F_{T r a n s}

. The reconstruction module comprises a

3 \times 3

convolutional layer and a pixel shuffle layer.

To ensure a fair comparison of the experimental results, we incorporate the

L_{1}

loss function to optimize our experimental model. For the training set

{\{I_{L R}^{i}, I_{H R}^{i}\}}_{i = 1}^{N}

consisting of N images, the objective of the MFERN model is to minimize the values defined by the following loss function formula:

L (Θ) = \frac{1}{N} \sum_{i = 1}^{N} {∥H_{M F E R N} (I_{L R}^{i}, Θ) - I_{H R}^{i}∥}_{1},

(5)

where

H_{M F E R N}

denotes the parameter set of the MFERN, and

{∥.∥}_{1}

is the

L_{1}

norm. The

Θ

indicates the parameter set of the proposed MFERN.

3.2. CNN Backbone

In the CNN segment, we introduce the Multi-Branch Residual Feature Fusion Module (MRFFM) to extract local information. All four modules utilize parameter-sharing technology to maintain the model’s lightweight nature. As depicted in Figure 1, the shallow feature

F_{s f}

sequentially traverses through the four MRFFMs, and each layer’s features can be reused via skip connections. Deep neural networks with many layers may face challenges during training due to small gradients in backpropagation. Skip connections facilitate gradient flow by directly transmitting input information to subsequent layers, stabilizing the gradient and simplifying network training. The above processes can be expressed as

F_{M_1} = F_{C_1} (H_{M_1} (F_{s f}) + F_{s f}),

(6)

F_{M_2} = F_{C_2} (H_{M_2} (F_{M_1}) + F_{M_1} + F_{s f}),

(7)

F_{M_3} = F_{C_3} (H_{M_3} (F_{M_2}) + F_{M_2} + F_{M_1} + F_{s f}),

(8)

F_{C N N} = H_{M_4} (F_{M_3}) + F_{M_3} + F_{M_2} + F_{M_1} + F_{s f},

(9)

where

H_{M_n}

represents the n-th MRFFM, and

F_{M_n}

denotes the feature information extracted by the n-th MRFFM.

F_{C_n}

signifies the n-th convolution operation.

F_{C N N}

represents the feature information extracted by the CNN framework.

The MRFFM comprises two additional modules, as shown in Figure 2: the Multi-Scale Attention Feature Fusion Block (MAFFB) and the Attentional Feature Fusion Block (AFFB). The method involves distilling multiple pieces of feature information to extract and fuse residual branch features multiple times, enhancing its feature expression capabilities. Within the MRFFM branch, we introduce the MAFFB, illustrated in the figure. The MAFFB includes the Dual Feature Calibration Block (DFCB), the Spatial Attention Calibration Module (SACB), and the Channel Attention Calibration Module (CACB).

Dual Feature Calibration Block (DFCB): As illustrated in Figure 3, within the DFCB, the features produced by the upper and lower branches are initially weighted and merged using combination coefficients (CC) [32]. The CC structure, as depicted in Figure 4, employs an attention mechanism to generate weight coefficients for adaptive information selection. Subsequently, the features pass through the Enhanced Spatial Attention (ESA) module with combined coefficients. This process extracts spatial information again, followed by input to two pooling layers simultaneously. Dynamic weight values are then derived post-convolution and activation, multiplying them with the branch features. Finally, the output is added back to the initial input. Through adaptive weights, dual pooling layers, and dynamic adaptive weight integration, the DFCB module excels in extracting valuable feature information efficiently. We denote the input of the DFCB as

X_{i n}^{d}

, and the aforementioned process can be expressed as

F_{o u t_u p_1} = H_{c o n v ↓} (X_{i n}^{d}),

(10)

F_{o u t_d o w n_1} = H_{c o n v ↓} (F_{o u t_u p_1}),

(11)

F_{o u t_u p_2} = F_{o u t_u p_1} + C_{i}^{u} (F_{o u t_d o w n_1}),

(12)

F_{o u t_d o w n_2} = F_{o u t_d o w n_1} + C_{i}^{d} (F_{o u t_u p_1}),

(13)

F_{E S A} = H_{E S A} (F_{o u t_u p_2} + F_{o u t_d o w n_2}),

(14)

F_{m i d_c o n c a t} = C o n c a t (H_{a v g} (F_{E S A}), H_{m a x} (F_{E S A})),

(15)

w 1, w 2 = H_{s p l i t} (H_{s i g m o i d} (H_{c o n v} (F_{m i d_c o n c a t}))),

(16)

F_{o u t}^{d} = w_{1} F_{o u t_u p_2} + w_{2} F_{o u t_d o w n_2} + X_{i n}^{d},

(17)

where

H_{c o n v_{↓}}

represents the operation of channel down-dimensioning and

H_{s i g m o i d}

signifies the sigmoid activation function utilized for nonlinear processing.

F_{o u t_u p_i}

and

F_{o u t_d o w n_i}

(i = 1, 2) represent the output of layer i of the upper and lower branches, respectively.

H_{E S A}

denotes the operation of the Efficient Spatial Attention (ESA) module, and

F_{E S A}

represents the output of the ESA.

C_{i}^{u}

and

C_{i}^{d}

represent the two combined coefficient learning mechanisms connecting the upper and lower branches.

H_{a v g}

and

H_{m a x}

represent the average pooling and max pooling functions.

F_{m i d_c o n c a t}

represents the fusion output that will pass through the features of the upper and lower branches of the two pooling layers.

w_{1}

and

w_{2}

stand for the dynamic weights of the two branches.

H_{s p l i t}

expresses the channel split function, and

F_{o u t}^{d}

represents the output of the DFCB unit.

Multi-Scale Attention Feature Fusion Block (MAFFB): The MAFFB initially utilizes a

1 \times 1

convolution to reduce the number of channels, followed by feature extraction through a Dual Feature Calibration Block (DFCB), as shown in Figure 2. The

1 \times 1

convolution conducts the linear combination of the pixels across the channels, maintaining the image’s spatial structure and modifying its depth. This process is beneficial for dimensionality adjustment, whether reducing or expanding it. Subsequently, after the number of channels is restored by

1 \times 1

convolution, two DFCB operations are sequentially employed for feature extraction. Features from the second

1 \times 1

convolution are processed through the upper and lower branches in a series of modules. Initially, feature extraction occurs in the upper branch’s first DFCB module, followed by output generation in the lower branch through convolution and the Spatial Attention Calibration (SACB) module. The fusion of the upper- and lower-branch features produces a spatial-level output. This output then undergoes processing via convolution and the Channel Attention Calibration (CACB) module for channel-level feature information. The intermediate feature output is combined with the adaptive weights from the upper branch’s first DFCB module. The MAFFB module employs spatial and channel self-calibrating attention to merge the information effectively at both levels, utilizing dynamic weights to boost the feature extraction efficiency and effectiveness. We denote the input of this module as

X_{i n}^{m}

. This process can be represented as

F_{o u t_1} = H_{c o n v_{↑}} (H_{D F C B_1} (H_{c o n v_{↓}} (X_{i n}^{m}))),

(18)

F_{o u t_n} = H_{D F C B_n} (F_{o u t_n - 1}) (n = 2, 3),

(19)

F_{m i d} = η_{1} (H_{S A C B} (H_{c o n v_3} (F_{o u t_1}))) + η_{2} F_{o u t_2},

(20)

F_{o u t}^{m} = η_{3} (H_{C A C B} (H_{c o n v_4} (F_{m i d}))) + η_{4} F_{o u t_3},

(21)

where

H_{c o n v_{↓}}

represents the operation of channel down-dimensioning in the first

1 \times 1

convolution, and

H_{c o n v_{↑}}

represents the operation of channel up-dimensioning in the second

1 \times 1

convolution.

F_{o u t_n}

represents the output of the n-th DFCB.

H_{D F C B_n}

denotes the operation of the n-th DFCB module.

F_{o u t_m i d}

represents the intermediate output of the MAFFB unit.

η_{i}

(i = 1, 2, 3, 4) denotes the adaptive weighted multipliers of the n-th output of the DFCB unit, and

F_{o u t}^{m}

represents the output of the MAFFB unit.

Channel (Spatial) Attention Calibration Block (CACB/SACB): In the CACB module, as shown in Figure 3, the raw input undergoes processing through three branches. The first branch employs basic convolution to reduce the number of channels while preserving the original features. The second branch focuses on extracting spatial information through a combination of convolutional and activation functions. The third branch enhances the feature extraction by incorporating channel attention. Subsequently, the outputs from the three branches are weighted to achieve multi-scale feature fusion. Similarly, the Spatial Attention Calibration Module (SACA) operates akin to the CACB, but with the channel attention (CA) module replaced by spatial attention (SA), enabling the extraction of more valuable spatial features. Spatial and channel attention mechanisms are integrated into self-calibrating convolution to dynamically establish relationships between each spatial position and its neighboring features. This enhancement improves the standard convolution layer’s performance by effectively broadening the receptive field of each spatial position without introducing additional parameters or escalating the model complexity. We denote the input of the unit as

X_{i n}^{c}

. The process can be expressed as

F_{o u t_l e f t} = H_{C A} (H_{r e l u} (H_{c o n v} (H_{s p l i t} (X_{i n}^{c})))),

(22)

F_{o u t_r i g h t} = H_{c o n v} (H_{r e l u} (H_{c o n v} (H_{s p l i t} (X_{i n}^{c})))),

(23)

F_{o u t_m i d} = H_{c o n v ↓} (X_{i n}^{c}),

(24)

F_{a d d} = β_{1} F_{o u t_l e f t} + β_{2} F_{o u t_m i d} + β_{3} F_{o u t_r i g h t},

(25)

F_{o u t}^{c} = H_{c o n v ↑} (F_{a d d}) + X_{i n}^{c},

(26)

where

F_{o u t_l e f t}

,

F_{o u t_r i g h t}

, and

F_{o u t_m i d}

represent the output of the left branch, right branch, and middle branch, respectively.

H_{C A}

refers to the channel attention operation.

H_{c o n v_{↓}}

represents the operation of channel down-dimensioning using

1 \times 1

convolution, while

H_{c o n v_{↑}}

denotes the channel up-dimensioning operation of the final

1 \times 1

convolution layer.

H_{r e l u}

signifies the Rectified Linear Unit (ReLU) activation function utilized for nonlinear processing.

F_{o u t}^{c}

represents the ultimate output of the CACB unit. The symbols

β_{i}

(i = 1, 2, 3) express the adaptive weighted multipliers applied to the outputs of the three branches within the CACB unit. The SACB operates similarly to the CACB, with the distinction that it replaces the channel attention mechanism with spatial attention to extract more beneficial spatial features.

Multi-Branch Residual Feature Fusion Module (MRFFM): In the MRFFM, as shown in Figure 2, the features undergo a sequence of three channel splitting operations, each resulting in two branches. One branch retains its feature, while the other is forwarded to the subsequent layer for additional feature extraction via convolution and MAFFB operations. A convolutional layer is integrated into the distillation connection segment to augment the dimensionality of the split channels. Subsequently, the features preserved post-split are concatenated and fused and then passed into the AFFB module. The AFFB module combines elements from both the SACB and CACB, as depicted in the figure. Initially, the original input undergoes channel operation splitting, followed by concatenating and fusing the features extracted by the SACB and CACB modules with the original features post-split. Finally, the weighted output is added. In the MRFFM, layered features from various residual branches are combined to integrate shallow and deep image features comprehensively. This method allows the model to concentrate effectively on important image features, increases the utilization of feature information, and enhances the restoration of intricate image details. We denote the input of the module as

F_{i n}^{m r}

. The aforementioned operations can be expressed as

F_{d i s t i l l e d_1}, F_{r e m a i n i n g_1} = H_{s p l i t_1} (H_{M A F F B_1} (H_{c o n v_1} (F_{i n}^{m r}))),

(27)

F_{d i s t i l l e d_2}, F_{r e m a i n i n g_2} = H_{s p l i t_2} (H_{M A F F B_2} (H_{c o n v_2} (F_{r e m a i n i n g_1}))),

(28)

F_{d i s t i l l e d_3}, F_{r e m a i n i n g_3} = H_{s p l i t_3} (H_{M A F F B_3} (H_{c o n v_3} (F_{r e m a i n i n g_2}))),

(29)

F_{d i s t i l l e d_4} = H_{c o n v_4} (F_{r e m a i n i n g_3}),

(30)

F_{c o n c a t} = C o n c a t (F_{d i s t i l l e d_1}, F_{d i s t i l l e d_2}, F_{d i s t i l l e d_3}, F_{d i s t i l l e d_4}),

(31)

F_{o u t}^{m r} = λ_{1} (H_{A F F B} (H_{c o n v_5} (F_{c o n c a t}))) + λ_{2} F_{i n}^{m r} .

(32)

Here,

F_{r e m a i n i n g_n}

represents the n-th remaining features,

F_{d i s t i l l e d_n}

denotes the n-th distilled features,

H_{M A F F B_n}

represents the n-th MAFFB unit,

H_{s p l i t_n}

expresses the n-th channel split function, and

F_{o u t}^{m r}

represents the output of the MRFFM. “Concat” denotes the fusion of features from the channel dimensions.

λ_{i}

(i = 1, 2) indicates the adaptive weight applied to the output by adding the features of the AFFB module and the input features.

H_{A F F B}

represents the operation of the AFFB, and

F_{c o n c a t}

represents the output of the Concat operation.

3.3. Transformer Backbone

The CNN component alone is insufficient in reconstructing high-quality images. Integrating local and global information is essential. Thus, a Transformer structure is incorporated to capture long-term image dependencies, and a recursive mechanism is introduced to leverage the Transformer’s performance benefits while ensuring minimal computational costs. Similar to ESRT, we initially pass the features through a linear layer, as shown in Figure 5, resulting in Q, K, and V values, which are subsequently split along the width and height dimensions. These are expressed as follows:

(Q_{1} \dots Q_{n}), (K_{1} \dots K_{n}), (V_{1} \dots V_{n}) = F_{S p l i t} (Q, K, V) .

(33)

Simultaneously, a feature reduction strategy is employed to further decrease the memory consumption. Each head of the multi-head attention (MHA) mechanism must execute a scaled dot product attention operation, followed by concatenating all outputs and applying a linear transformation to obtain the final output. The scaled dot product attention can be defined as

\begin{matrix} O_{i} = A t t e n t i o n (Q_{i}, K_{i}, V_{i}) = S o f t m a x (\frac{Q_{i} K_{i}^{T}}{\sqrt{d_{k}}}) V_{i}, \end{matrix}

(34)

\begin{matrix} A t t e n t i o n (Q, K, V) = C o n c a t (O_{1}, \dots, O_{n}), \end{matrix}

(35)

where Q, K, and V refer to the query matrix, key matrix, and value matrix, respectively, and

S o f t m a x

represents the softmax operation function.

4. Experiments

4.1. Datasets and Evaluation Metrics

In this experiment, we utilized DIV2K as the training dataset. To thoroughly assess the performance, we employed five benchmark datasets to validate the effectiveness of the MFERN, comprising Set5 [33], Set14 [34], Urban100 [35], BSDS100 [36], and Manga109 [37]. Furthermore, the PSNR and SSIM [38] served as evaluation metrics to gauge the performance of the SR images in the YCbCr color space and the Y channel.

4.2. Implementation Details

This study involved augmenting the training dataset with random rotations and horizontal flips at different angles to improve the data diversity. The model underwent training for 1000 epochs, utilizing the PyTorch framework and the Adam optimizer for updates. The initial learning rate was configured as 5 × 10⁻⁴ and decayed to 6.25 × 10⁻⁶, following the cosine annealing strategy. Within the model architecture, both the CNN and Transformer modules feature an input channel size of 32 channels. All experiments were conducted on an NVIDIA RTX 2080Ti GPU for three days.

4.3. Comparison with State-of-the-Art

We conducted a comparative analysis of the MFERN against prominent lightweight image SR models across established benchmark datasets. These included VDSR [15], IDN [25], CARN [39], IMDN [7], MADNet [40], DCDN [20], SMSR [41], ECBSR [21], LAPAR-A [42], HPUN-M [43], GLADSR [44], LCRCA [8], DRSAN-48s [45], LatticeNet [32], AFAN-M [46], and FDSCSR-S [47]. The quantitative assessment of the SR for

\times 2

,

\times 3

, and

\times 4

image SR is presented in Table 1. From the table, it is evident that our MFERN model demonstrates exceptional performance while maintaining a modest number of model parameters. Notably, significant improvements are observed across the Set14, B100, and Manga109 datasets. It is noteworthy that our model’s training on the RTX 2080Ti GPU, combined with its minimal parameter count, represents a key advantage for us.

Furthermore, we compared the MFERN’s visual effects with those of other models. In Figure 6, we select three images from the Set14 and Urban100 datasets, varying in resolution. These include both large and small images, focusing on areas with detailed textures. We compared our model’s reconstruction visually with the outputs from other advanced models, listing and comparing the PSNR and SSIM values for each image. Our findings show that the MFERN not only outperformed the other models in terms of the image quality per image but also excelled in restoring intricate details.

Upon closer examination of the

\times 2

resolution factor, we selected Urban100 (

2 \times

): img_062, where the reconfigured composition revealed a detailed texture. Despite minor blurring, the MFERN significantly outperformed the other methods in restoring the image textures. Similarly, at the

\times 3

resolution factor, we chose Urban100 (

3 \times

): img_048. The MFERN accurately restored the building’s exterior lines, enhancing the clarity. Notably, in Set14 (

4 \times

): barbara_HR_

\times 4

, focusing on the prominent texture details, such as the chair lines, the MFERN’s result closely resembled the HR image, while the other methods produced blurred results.

4.4. Ablation Studies

4.4.1. Effectiveness of MRFFM

To assess the efficacy of the MRFFM, we replaced the MRFFM module with the IMDB [7] and RFDB [49] modules, respectively. To ensure experimental integrity, all model parameters were adjusted to approximately 700K, and it was trained at a

\times 4

resolution and evaluated on the Manga109 test dataset. The results, presented in Table 2, indicate that our designed MAFFB module outperformed the other two modules within the same framework and with similar parameters. While the MRFFM had slightly higher computational demands compared to the other modules, the performance gain far outweighed this increase. Thus, the effectiveness of the MRFFM module is convincingly demonstrated.

4.4.2. AFFB Validity

To assess the validity of the AFFB module, we excluded it from the MRFFM module, trained the model at a

\times 4

image SR, and evaluated it on the BSD100 dataset. The test results are presented in Table 3, demonstrating that the AFFB module that we designed consistently enhances the network model’s performance.

4.4.3. Effectiveness of MAFFB

Subsequently, we assessed the effectiveness of each module within the MAFFB through ablation experiments. Initially, we conducted an ablation experiment to evaluate the efficacy of the MAFFB itself. Obtained after directly removing the MAFFB from the MRFFM module, the data presented in Table 3 illustrate a significant decrease in model performance, affirming its importance.

Additionally, we individually evaluated the effectiveness of the DFCB, SACB, and CACB modules within the MAFFB. These modules were removed one by one, and the model was trained under

\times 4

resolution factors and evaluated on the Urban100 dataset. The results are summarized in Table 4, indicating a notable decline in model performance upon the removal of these modules. This underscores the effectiveness of our designed modules.

4.4.4. Effectiveness of DFCB

To validate the effectiveness of the DFCB module structure that we designed, we conducted experiments by removing the CC structure and ESA module from the original DFCB module individually. These experiments were performed at a

\times 4

image SR and verified using the Set14 dataset. The results are presented in Table 5, demonstrating the rationality of our DFCB module design.

4.4.5. Dense Connection (DC)

We conducted an ablation experiment on the dense connection part of the overall structure, i.e., we removed the CNN part, the Trans part, and the DC structure of the entire network architecture, respectively, and verified its validity on the Urban100 dataset. As shown in Table 6, the data indicate that, in the three experiments, the number of model parameters with or without the DC structure and the amount of calculation did not change significantly. However, after the DC structure was added to both parts, the model performance was significantly improved, which verifies the validity of our model structure.

4.4.6. Comparison with Some Transformer-Based Methods

We benchtested the MFERN against several existing Transformer-based approaches, namely SwinIR [12], ESRT [13], and LBNet [14]. As shown in Table 7, the MFERN proposed by us does not exceed SwinIR [12] in performance. However, compared with SwinIR and using the Flickr2k dataset for training, we used DIV2k to train the model and the model was much lighter, achieving a good balance between performance and efficiency.

4.5. Model Complexity Studies

We comprehensively compared the model’s complexity with that of the existing methods, including the number of model parameters and computation, on the Set5 test set. Figure 7 shows the good balance of our MFERN with a small number of parameters and a fast execution time. It shows the efficiency and effectiveness of our model.

5. Conclusions

In this study, we introduce the Multi-Branch Feature Extraction Residual Network (MFERN) tailored to efficient image SR tasks. The CNN component of the MFERN comprises four Multi-Branch Residual Feature Fusion Modules (MRFFM), which optimize the model parameters through parameter sharing. Additionally, we employ a dense connection methodology to amalgamate information across all layers, enhancing the effective utilization of the feature data. The MRFFM integrates a multi-branch residual structure to maintain model efficiency while ensuring robust feature extraction. Moreover, we introduce the Multi-Scale Attention Feature Fusion Block (MAFFB) and Attentional Feature Fusion Block (AFFB), leveraging attention mechanisms and dynamic weight operations to extract valuable information and enhance the model performance through output fusion. Concurrently, by serially integrating CNN and Transformer modules, the MFERN adeptly addresses both local details and global features, thereby boosting the model performance. By effectively balancing the model size and performance, the MFERN efficiently tackles image SR tasks. Meanwhile, it is necessary to explore more effective ways to combine the strengths of CNNs and Transformers. Additionally, our future goal will be to develop even lighter models without compromising the performance.

Author Contributions

Methodology, C.L.; Writing—original draft, C.L. and X.W.; Writing—review & editing, G.G.; Supervision, G.G. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the Open Fund Project of Provincial Key Laboratory for Computer Information Processing Technology (Soochow University) under Grant KJS2274.

Data Availability Statement

The results/data/figures in this manuscript have not been published elsewhere, nor are they under consideration by another publisher. The original contributions presented in the study are included in the article, and further inquiries can be directed to the corresponding author.

Conflicts of Interest

The authors declare no conflict of interest.

References

Dong, C.; Loy, C.C.; He, K.; Tang, X. Image super-resolution using deep convolutional networks. IEEE Trans. Pattern Anal. Mach. Intell. 2015, 38, 295–307. [Google Scholar] [CrossRef] [PubMed]
Li, W.; Wang, M.; Zhang, K.; Li, J.; Li, X.; Zhang, Y.; Gao, G.; Deng, W.; Lin, C.W. Survey on Deep Face Restoration: From Non-blind to Blind and Beyond. arXiv 2023, arXiv:2309.15490. [Google Scholar]
Lim, B.; Son, S.; Kim, H.; Nah, S.; Mu Lee, K. Enhanced deep residual networks for single image super-resolution. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, Honolulu, HI, USA, 21–26 July 2017; pp. 136–144. [Google Scholar]
Zhang, Y.; Li, K.; Li, K.; Wang, L.; Zhong, B.; Fu, Y. Image super-resolution using very deep residual channel attention networks. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 286–301. [Google Scholar]
Kim, J.; Lee, J.K.; Lee, K.M. Deeply-recursive convolutional network for image super-resolution. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27– 30 June 2016; pp. 1637–1645. [Google Scholar]
Lai, W.S.; Huang, J.B.; Ahuja, N.; Yang, M.H. Deep laplacian pyramid networks for fast and accurate super-resolution. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 624–632. [Google Scholar]
Hui, Z.; Gao, X.; Yang, Y.; Wang, X. Lightweight image super-resolution with information multi-distillation network. In Proceedings of the ACM International Conference on Multimedia, Nice, France, 2–25 October 2019; pp. 2024–2032. [Google Scholar]
Peng, C.; Shu, P.; Huang, X.; Fu, Z.; Li, X. LCRCA: Image super-resolution using lightweight concatenated residual channel attention networks. Appl. Intell. 2022, 52, 10045–10059. [Google Scholar] [CrossRef]
Wang, Z.; Liu, Y.; Zhu, R.; Yang, W.; Liao, Q. Lightweight single image super-resolution with similar feature fusion block. IEEE Access 2022, 10, 30974–30981. [Google Scholar] [CrossRef]
Zamir, S.W.; Arora, A.; Khan, S.; Hayat, M.; Khan, F.S.; Yang, M.H. Restormer: Efficient transformer for high-resolution image restoration. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 5728–5739. [Google Scholar]
Yu, R.; Du, D.; LaLonde, R.; Davila, D.; Funk, C.; Hoogs, A.; Clipp, B. Cascade transformers for end-to-end person search. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 7267–7276. [Google Scholar]
Liang, J.; Cao, J.; Sun, G.; Zhang, K.; Van Gool, L.; Timofte, R. Swinir: Image restoration using swin transformer. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, BC, Canada, 11–17 October 2021; pp. 1833–1844. [Google Scholar]
Lu, Z.; Li, J.; Liu, H.; Huang, C.; Zhang, L.; Zeng, T. Transformer for single image super-resolution. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 457–466. [Google Scholar]
Gao, G.; Wang, Z.; Li, J.; Li, W.; Yu, Y.; Zeng, T. Lightweight Bimodal Network for Single-Image Super-Resolution via Symmetric CNN and Recursive Transformer. In Proceedings of the International Joint Conference on Artificial Intelligence, Messe Wien, Vienna, 23–29 July 2022; pp. 661–669. [Google Scholar]
Kim, J.; Lee, J.K.; Lee, K.M. Accurate image super-resolution using very deep convolutional networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 1646–1654. [Google Scholar]
Liu, Z.; Lin, Y.; Cao, Y.; Hu, H.; Wei, Y.; Zhang, Z.; Lin, S.; Guo, B. Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, BC, Canada, 11–17 October 2021; pp. 10012–10022. [Google Scholar]
Chen, X.; Wang, X.; Zhou, J.; Qiao, Y.; Dong, C. Activating more pixels in image super-resolution transformer. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 22367–22377. [Google Scholar]
Li, A.; Zhang, L.; Liu, Y.; Zhu, C. Feature modulation transformer: Cross-refinement of global representation via high-frequency prior for image super-resolution. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Paris, France, 2–6 October 2023; pp. 12514–12524. [Google Scholar]
Liu, Y.; Dong, H.; Liang, B.; Liu, S.; Dong, Q.; Chen, K.; Chen, F.; Fu, L.; Wang, F. Unfolding Once is Enough: A Deployment-Friendly Transformer Unit for Super-Resolution. In Proceedings of the 31st ACM International Conference on Multimedia, Ottawa, ON, Canada, 29 October–3 November 2023; pp. 7952–7960. [Google Scholar]
Li, Y.; Cao, J.; Li, Z.; Oh, S.; Komuro, N. Lightweight single image super-resolution with dense connection distillation network. ACM Trans. Multimed. Comput. Commun. Appl. 2021, 17, 1–17. [Google Scholar] [CrossRef]
Zhang, X.; Zeng, H.; Zhang, L. Edge-oriented convolution block for real-time super resolution on mobile devices. In Proceedings of the ACM International Conference on Multimedia, Virtual Event, China, 20–24 October 2021; pp. 4034–4043. [Google Scholar]
Gao, G.; Li, W.; Li, J.; Wu, F.; Lu, H.; Yu, Y. Feature distillation interaction weighting network for lightweight image super-resolution. In Proceedings of the AAAI Conference on Artificial Intelligence, Vancouver, BC, Canada, 20–27 February 2022; Volume 36, pp. 661–669. [Google Scholar]
Li, H.; Yan, C.; Lin, S.; Zheng, X.; Zhang, B.; Yang, F.; Ji, R. Pams: Quantized super-resolution via parameterized max scale. In Proceedings of the European Conference on Computer Vision, Glasgow, UK, 23–28 August 2020; pp. 564–580. [Google Scholar]
Lee, W.; Lee, J.; Kim, D.; Ham, B. Learning with privileged information for efficient image super-resolution. In Proceedings of the European Conference on Computer Vision, Glasgow, UK, 23–28 August 2020; pp. 465–482. [Google Scholar]
Hui, Z.; Wang, X.; Gao, X. Fast and accurate single image super-resolution via information distillation network. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 723–731. [Google Scholar]
Tai, Y.; Yang, J.; Liu, X. Image super-resolution via deep recursive residual network. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 3147–3155. [Google Scholar]
Li, W.; Li, J.; Gao, G.; Deng, W.; Yang, J.; Qi, G.J.; Lin, C.W. Efficient Image Super-Resolution with Feature Interaction Weighted Hybrid Network. arXiv 2022, arXiv:2212.14181. [Google Scholar]
Chen, L.; Zhang, H.; Xiao, J.; Nie, L.; Shao, J.; Liu, W.; Chua, T.S. Sca-cnn: Spatial and channel-wise attention in convolutional networks for image captioning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 5659–5667. [Google Scholar]
Fu, J.; Liu, J.; Tian, H.; Li, Y.; Bao, Y.; Fang, Z.; Lu, H. Dual attention network for scene segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 3146–3154. [Google Scholar]
Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold, G.; Gelly, S.; et al. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv 2020, arXiv:2010.11929. [Google Scholar]
Li, W.; Guo, H.; Liu, X.; Liang, K.; Hu, J.; Ma, Z.; Guo, J. Efficient Face Super-Resolution via Wavelet-based Feature Enhancement Network. ACMMM 2024.
Luo, X.; Qu, Y.; Xie, Y.; Zhang, Y.; Li, C.; Fu, Y. Lattice network for lightweight image restoration. IEEE Trans. Pattern Anal. Mach. Intell. 2023, 45, 4826–4842. [Google Scholar] [CrossRef] [PubMed]
Bevilacqua, M.; Roumy, A.; Guillemot, C.; Alberi-Morel, M.L. Low-complexity single-image super-resolution based on nonnegative neighbor embedding. In Proceedings of the British Machine Vision Conference, Surrey, UK, 3–7 September 2012; pp. 135.1–135.10. [Google Scholar]
Zeyde, R.; Elad, M.; Protter, M. On single image scale-up using sparse-representations. In Proceedings of the International Conference on Curves and Surfaces, Oslo, Norway, 3–28 June 2012; pp. 711–730. [Google Scholar]
Huang, J.B.; Singh, A.; Ahuja, N. Single image super-resolution from transformed self-exemplars. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA, 7–5 June 2015; pp. 5197–5206. [Google Scholar]
Martin, D.; Fowlkes, C.; Tal, D.; Malik, J. A database of human segmented natural images and its application to evaluating segmentation algorithms and measuring ecological statistics. In Proceedings of the IEEE International Conference on Computer Vision, Vancouver, BC, Canada, 7–14 July 2001; Volume 2, pp. 416–423. [Google Scholar]
Matsui, Y.; Ito, K.; Aramaki, Y.; Fujimoto, A.; Ogawa, T.; Yamasaki, T.; Aizawa, K. Sketch-based manga retrieval using manga109 dataset. Multimed. Tools Appl. 2017, 76, 21811–21838. [Google Scholar] [CrossRef]
Wang, Z.; Bovik, A.C.; Sheikh, H.R.; Simoncelli, E.P. Image quality assessment: From error visibility to structural similarity. IEEE Trans. Image Process. 2004, 13, 600–612. [Google Scholar] [CrossRef] [PubMed]
Ahn, N.; Kang, B.; Sohn, K.A. Fast, accurate, and lightweight super-resolution with cascading residual network. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 252–268. [Google Scholar]
Lan, R.; Sun, L.; Liu, Z.; Lu, H.; Pang, C.; Luo, X. MADNet: A fast and lightweight network for single-image super resolution. IEEE Trans. Cybern. 2020, 51, 1443–1453. [Google Scholar] [CrossRef] [PubMed]
Wang, L.; Dong, X.; Wang, Y.; Ying, X.; Lin, Z.; An, W.; Guo, Y. Exploring sparsity in image super-resolution for efficient inference. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 4917–4926. [Google Scholar]
Li, W.; Zhou, K.; Qi, L.; Jiang, N.; Lu, J.; Jia, J. Lapar: Linearly-assembled pixel-adaptive regression network for single image super-resolution and beyond. Adv. Neural Inf. Process. Syst. 2020, 33, 20343–20355. [Google Scholar]
Sun, B.; Zhang, Y.; Jiang, S.; Fu, Y. Hybrid pixel-unshuffled network for lightweight image super-resolution. arXiv 2022, arXiv:2203.08921. [Google Scholar] [CrossRef]
Zhang, X.; Gao, P.; Liu, S.; Zhao, K.; Li, G.; Yin, L.; Chen, C.W. Accurate and efficient image super-resolution via global-local adjusting dense network. IEEE Trans. Multimed. 2021, 23, 1924–1937. [Google Scholar] [CrossRef]
Park, K.; Soh, J.W.; Cho, N.I. A Dynamic Residual Self-Attention Network for Lightweight Single Image Super-Resolution. IEEE Trans. Multimed. 2023, 25, 907–918. [Google Scholar] [CrossRef]
Wang, L.; Li, K.; Tang, J.; Liang, Y. Image super-resolution via lightweight attention-directed feature aggregation network. ACM Trans. Multimed. Comput. Commun. Appl. 2023, 19, 1–23. [Google Scholar] [CrossRef]
Wang, Z.; Gao, G.; Li, J.; Yan, H.; Zheng, H.; Lu, H. Lightweight feature de-redundancy and self-calibration network for efficient image super-resolution. ACM Trans. Multimed. Comput. Commun. Appl. 2023, 19, 1–15. [Google Scholar] [CrossRef]
Wang, C.; Li, Z.; Shi, J. Lightweight image super-resolution with adaptive weighted learning network. arXiv 2019, arXiv:1904.02358. [Google Scholar]
Liu, J.; Tang, J.; Wu, G. Residual feature distillation network for lightweight image super-resolution. In Proceedings of the European Conference on Computer Vision Workshops, Glasgow, UK, 23–28 August 2020; pp. 41–55. [Google Scholar]

Figure 1. The architecture of the Multi-Branch Feature Extraction Residual Network (MFERN).

Figure 2. The architecture of the proposed Multi-Branch Residual Feature Fusion Module (MRFFM) and its components: the Multi-Scale Attention Feature Fusion Block (MAFFB) and the Attention Feature Fusion Block (AFFB).

Figure 3. The architecture of the Spatial Attention Calibration Block (SACB), Channel Attention Calibration Module (CACB), and Dual Feature Calibration Block (DFCB) units. In the DFCB,

C_{i}^{u}

and

C_{i}^{d}

denote the combination coefficient (CC) learning, which is elaborated in Figure 4.

Figure 3. The architecture of the Spatial Attention Calibration Block (SACB), Channel Attention Calibration Module (CACB), and Dual Feature Calibration Block (DFCB) units. In the DFCB,

C_{i}^{u}

and

C_{i}^{d}

denote the combination coefficient (CC) learning, which is elaborated in Figure 4.

Figure 4. The processes of combination coefficient (CC) learning and enhanced spatial attention (ESA).

Figure 5. The architecture of the Transformer block (Trans).

Figure 6. Visual comparisons of MFERN with AWSRN-M [48], CARN [39], CARN-M [39], FDIWN [22], MADNET [40], IMDN [7], and VDSR [15] on Set14 and Urban100 datasets.

Figure 7. Model execution time study on Set5 dataset (

\times 4

).

Figure 7. Model execution time study on Set5 dataset (

\times 4

).

Table 1. Average PSNR/SSIM values for scales

\times 2

,

\times 3

, and

\times 4

on the Set5, Set14, BSD100, Urban100, and Manga109 datasets. The best and second best indexes are highlighted in bold and underlined.

Table 1. Average PSNR/SSIM values for scales

\times 2

,

\times 3

, and

\times 4

on the Set5, Set14, BSD100, Urban100, and Manga109 datasets. The best and second best indexes are highlighted in bold and underlined.

Method	Scale	Params	Multi-Adds	Set5	Set14	BSD100	Urban100	Manga109
Method	Scale	Params	Multi-Adds	PSNR/SSIM	PSNR/SSIM	PSNR/SSIM	PSNR/SSIM	PSNR/SSIM
IDN [25]	$\times 2$	553K	124.6G	37.83/0.9600	33.30/0.9148	32.08/0.8985	31.27/0.9196	38.01/0.9749
CARN [39]		1592K	222.8G	37.76/0.9590	33.52/0.9166	32.09/0.8978	31.92/0.9256	38.32/0.9765
IMDN [7]		694K	158.8G	38.00/0.9605	33.63/0.9177	32.19/0.8996	32.17/0.9283	38.88/0.9774
MADNet [40]		878K	187.1G	37.85/0.9600	33.38/0.9161	32.04/0.8979	31.62/0.9233	-
DCDN [20]		756K	-	38.01/0.9606	33.52/0.9166	32.17/0.8996	32.16/0.9283	38.70/0.9773
SMSR [41]		985K	351.5G	38.00/0.9601	33.64/0.9179	32.17/0.8990	32.19/0.9284	38.76/0.9771
ECBSR [21]		596K	137.3G	37.90/0.9615	33.34/0.9178	32.10/0.9018	31.71/0.9250	-
LAPAR-A [42]		548K	171.0G	38.01/0.9605	33.62/0.9183	32.19/0.8999	32.10/0.9283	38.67/0.9772
HPUN-M [43]		492K	106.2G	38.03/0.9604	33.60/0.9185	32.20/0.9000	32.09/0.9282	38.83/0.9775
GLADSR [44]		812K	187.2G	37.99/0.9608	33.63/0.9179	32.16/0.8996	32.16/0.9283	-
LCRCA [8]		813K	186.0G	38.05/0.9607	33.65/0.9181	32.17/0.8994	32.19/0.9285	-
DRSAN-48s [45]		650K	150.0G	38.08/0.9609	33.62/0.9175	32.19/0.9002	32.16/0.9286	-
LatticeNet [32]		756K	169.5G	38.06/0.9607	33.70/0.9187	32.20/0.8999	32.25/0.9288	-
AFAN-M [46]		682K	163.4G	37.99/0.9605	33.57/0.9175	32.14/0.8994	32.08/0.9277	38.58/0.9769
FDSCSR-S [47]		466K	121.8G	38.02/0.9606	33.51/0.9174	32.18/0.8996	32.24/0.9288	38.67/0.9771
MFERN (Ours)		691K	173.2G	38.05/0.9609	33.67/0.9181	32.19/ 0.8998	32.27/0.9295	38.86/0.9777
IDN [25]	$\times 3$	553K	56.3G	34.11/0.9253	29.99/0.8354	28.95/0.8013	27.42/0.8359	32.71/0.9381
CARN [39]		1592K	118.8G	34.29/0.9255	30.29/0.8407	29.06/0.8493	28.06/0.8493	33.43/0.9427
IMDN [7]		703K	71.5G	34.36/0.9270	30.32/0.8417	29.09/0.8046	28.17/0.8519	33.61/0.9445
MADNet [40]		930K	88.4G	34.16/0.9253	30.21/0.8398	28.98/0.8023	27.77/0.8439	-
DCDN [20]		765K	-	34.41/0.9273	30.31/0.8417	29.08/0.8045	28.17/0.8520	33.54/0.9441
SMSR [41]		993K	156.8G	34.40/0.9270	30.33/0.8412	29.10/0.8050	28.25/0.8536	33.68/0.9445
LAPAR-A [42]		594K	114.0G	34.36/0.9267	30.34/0.8421	29.11/0.8054	28.15/0.8523	33.51/0.9441
HPUN-M [43]		500K	48.1G	34.39/0.9269	30.33/0.8420	29.11/0.8052	28.06/0.8508	33.54/0.9441
GLADSR [44]		821K	88.2G	34.41/0.9272	30.37/0.8418	29.08/0.8050	28.24/0.8537	-
LCRCA [8]		822K	83.6G	34.40/0.9269	30.36/0.8422	29.09/0.8049	28.21/0.8532	-
DRSAN-48s [45]		750K	78.0G	34.47/0.9274	30.35/0.8422	29.11/0.8060	28.26/0.8542	-
LatticeNet [32]		765K	76.3G	34.40/0.9272	30.32/0.8416	29.10/0.8049	28.19/0.8513	-
AFAN-M [46]		681K	80.8G	34.35/0.9263	30.31/0.8423	29.06/0.8053	28.11/0.8522	33.44/0.9440
FDSCSR-S [47]		471K	54.6G	34.42/0.9274	33.37/0.8429	29.10/0.8052	28.20/0.8532	33.55/0.9443
MFERN (Ours)		691K	76.8G	34.43/0.9276	30.36/0.8422	29.11/0.8058	28.27/0.8542	33.84/0.9460
IDN [25]	$\times 4$	553K	32.3G	31.82/0.8903	28.25/0.7730	27.41/0.7297	25.41/0.7632	29.41/0.8942
CARN [39]		1592K	90.9G	32.13/0.8937	28.60/0.7806	27.58/0.7349	26.07/0.7837	30.42/0.9070
IMDN [7]		715K	40.9G	32.20/0.8948	28.58/0.7811	27.56/0.7353	26.04/0.7838	30.45/0.9075
MADNet [40]		1002K	54.1G	31.95/0.8917	28.44/0.7780	27.47/0.7327	25.76/0.7746	-
DCDN [20]		777K	-	32.21/0.8949	28.57/0.7807	27.55/0.7356	26.09/0.7855	30.41/0.9072
SMSR [41]		1006K	89.1G	32.12/0.8932	28.55/0.7808	27.55/0.7351	26.11/0.7868	30.54/0.9085
ECBSR [21]		603K	34.7G	31.92/0.8946	28.34/0.7817	27.48/0.7393	25.81/0.7773	-
LAPAR-A [42]		659K	94.0G	32.15/0.8944	28.61/0.7818	27.61/0.7366	26.14/0.7871	30.42/0.9074
HPUN-M [43]		511K	27.7G	32.19/0.8946	28.61/0.7818	27.58/0.7364	26.04/0.7851	30.49/0.9078
GLADSR [44]		826K	52.6G	32.14/0.8940	28.62/0.7813	27.59/0.7361	26.12/0.7851	-
LCRCA [8]		834K	47.7G	32.20/0.8948	28.60/0.7807	27.57/0.7653	26.10/0.7851	-
DRSAN-48s [45]		730K	57.6G	32.25/0.8945	28.55/0.7817	27.59/0.7374	26.14/0.7875	-
LatticeNet [32]		777K	43.6G	32.18/0.8943	28.61/0.7812	27.57/0.7355	26.14/0.7844	-
AFAN-M [46]		692K	50.9G	32.18/0.8939	28.62/0.7826	27.58/0.7373	26.13/0.7876	30.45/0.9085
FDSCSR-S [47]		478K	31.1G	32.25/0.8959	28.61/0.7821	27.58/0.7367	26.12/0.7866	30.51/0.9087
MFERN (Ours)		691K	43.3G	32.25/0.8958	28.70/0.7837	27.63/0.7385	26.31/0.7921	30.77/0.9113

Table 2. Performance comparison of MRFFM with other basic modules on Manga109 dataset.

Scale	Method	Params	Multi-Adds	PSNR/SSIM
$\times 4$	MFERN + IMDB [7]	700.7K	12.7G	30.27/0.9054
	MFERN + RFDB [49]	679.6K	12G	30.52/0.9083
	MFERN + MRFFM (Ours)	691.4K	43.3G	30.77/0.9113

Table 3. Study of different units in MRFFM on BSD100 dataset.

Scale	MAFFB	AFFB	Params	Multi-Adds	PSNR/SSIM
$\times 4$	✗	✓	503.4K	4.8G	27.50/0.7343
	✓	✗	681.8K	41.1G	27.55/0.7363
	✓	✓	691.4K	43.3G	27.63/0.7385

Table 4. Study of different units in MAFFB on Urban100 dataset.

Scale	DFCB	SACB	CACB	Params	Multi-Adds	PSNR/SSIM
$\times 4$	✗	✓	✓	540.2K	13.1G	25.99/0.7815
	✓	✗	✓	678.1K	40.2G	26.27/0.7911
	✓	✓	✗	677.5K	40.3G	26.17/0.7875
	✓	✓	✓	691.4K	43.3G	26.31/0.7921

Table 5. Study of different units in DFCB on Set14 dataset.

Scale	CC	ESA	Params	Multi-Adds	PSNR/SSIM
$\times 4$	✗	✓	686.9K	43.3G	28.67/0.7834
	✓	✗	671.4K	42.1G	28.60/0.7814
	✓	✓	691.4K	43.3G	28.70/0.7939

Table 6. Study of DC on Urban100 dataset.

Scale	CNN-DC	Transformer-DC	Params	Multi-Adds	PSNR/SSIM
$\times 4$	✗	✓	691.4k	43.3G	26.18/0.7872
	✓	✗	691.4k	43.3G	26.29/0.7911
	✗	✗	691.4k	43.3G	26.16/0.7869
	✓	✓	691.4k	43.3G	26.37/0.8964

Table 7. Comparison with some Transformer-based methods (

\times 4

).

Table 7. Comparison with some Transformer-based methods (

\times 4

).

Method	Params	Multi-Adds	Set5	Set14	BSD100	Urban100	Manga109	Average
SwinIR [12]	897K	49.6G	32.44/0.8976	28.77/0.7858	27.69/0.7406	26.47/0.7980	30.92/0.9151	29.26/0.8274
ESRT [13]	751K	67.7G	32.19/0.8947	28.69/0.7833	27.69/0.7379	26.39/0.7962	30.75/0.9100	29.14/0.8244
LBNet [14]	742K	38.9G	32.29/0.8960	28.68/0.7832	27.62/0.7382	26.27/0.7906	30.76/0.9111	29.12/0.8238
MFERN (Ours)	691K	43.3G	32.25/0.8958	28.70/0.7837	27.63/0.7385	26.31/0.7921	30.77/0.9113	29.15/0.8243

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Liu, C.; Wan, X.; Gao, G. A Multi-Branch Feature Extraction Residual Network for Lightweight Image Super-Resolution. Mathematics 2024, 12, 2736. https://doi.org/10.3390/math12172736

AMA Style

Liu C, Wan X, Gao G. A Multi-Branch Feature Extraction Residual Network for Lightweight Image Super-Resolution. Mathematics. 2024; 12(17):2736. https://doi.org/10.3390/math12172736

Chicago/Turabian Style

Liu, Chunying, Xujie Wan, and Guangwei Gao. 2024. "A Multi-Branch Feature Extraction Residual Network for Lightweight Image Super-Resolution" Mathematics 12, no. 17: 2736. https://doi.org/10.3390/math12172736

APA Style

Liu, C., Wan, X., & Gao, G. (2024). A Multi-Branch Feature Extraction Residual Network for Lightweight Image Super-Resolution. Mathematics, 12(17), 2736. https://doi.org/10.3390/math12172736

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

A Multi-Branch Feature Extraction Residual Network for Lightweight Image Super-Resolution

Abstract

1. Introduction

2. Related Work

2.1. SR Models

2.2. Lightweight SR Models

2.3. Attention Feature Fusion

3. Proposed Method

3.1. Network Framework

3.2. CNN Backbone

3.3. Transformer Backbone

4. Experiments

4.1. Datasets and Evaluation Metrics

4.2. Implementation Details

4.3. Comparison with State-of-the-Art

4.4. Ablation Studies

4.4.1. Effectiveness of MRFFM

4.4.2. AFFB Validity

4.4.3. Effectiveness of MAFFB

4.4.4. Effectiveness of DFCB

4.4.5. Dense Connection (DC)

4.4.6. Comparison with Some Transformer-Based Methods

4.5. Model Complexity Studies

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI