This section outlines the structure of our proposed Multi-Branch Feature Extraction Residual Network (MFERN). Initially, we introduce the sequential structure and the intensive connection operation between the CNN components and the Transformer backbone network. Subsequently, we describe the Multi-Branch Residual Feature Fusion Module (MRFFM) within the CNN block, encompassing the Multi-Scale Attention Feature Fusion Block (MAFFB) and Attention Feature Fusion Block (AFFB). Then, we introduce the Dual Feature Calibration Block (DFCB), Spatial Attention Calibration Module (SACB), and Channel Attention Calibration Module (CACB). Lastly, we present the specific characteristics of the Transformer.
3.1. Network Framework
As depicted in
Figure 1, the Multi-Branch Feature Extraction Residual Network (MFERN) comprises a sequence of Multi-Branch Residual Feature Fusion Modules (MRFFM) and Transformer modules. We integrate the CNN with the Transformer module to effectively combine local and global feature information. This allows the model to better recover image texture details and reconstruct high-quality images. We denote the input LR image as
, the model outputs as
, and the HR images as
. At the beginning of the model, a
convolutional layer is utilized to extract shallow information:
where
represents a
convolutional layer (denoted as
for shallow feature extraction). Subsequently,
is forwarded to the CNN for local feature extraction. The network comprises four MRFFM modules, each consisting of three MAFFBs and one AFFB, enabling the extraction of additional local feature information through the multi-branch residual structure. The following formula expresses a portion of the CNN model operation:
where
represents CNN and local feature extraction, while
denotes the CNN output of local feature extraction. Once the local features of the image are obtained, they are sent to the Transformer module to extract global information.
where
signifies the Transformer module, and
denotes feature reconstruction enhanced with global information. The reconstruction process can be expressed as
where
denotes the ultimate output of the network,
represents the reconstruction module for
, and
represents the reconstruction module for
. The reconstruction module comprises a
convolutional layer and a pixel shuffle layer.
To ensure a fair comparison of the experimental results, we incorporate the
loss function to optimize our experimental model. For the training set
consisting of
N images, the objective of the MFERN model is to minimize the values defined by the following loss function formula:
where
denotes the parameter set of the MFERN, and
is the
norm. The
indicates the parameter set of the proposed MFERN.
3.2. CNN Backbone
In the CNN segment, we introduce the Multi-Branch Residual Feature Fusion Module (MRFFM) to extract local information. All four modules utilize parameter-sharing technology to maintain the model’s lightweight nature. As depicted in
Figure 1, the shallow feature
sequentially traverses through the four MRFFMs, and each layer’s features can be reused via skip connections. Deep neural networks with many layers may face challenges during training due to small gradients in backpropagation. Skip connections facilitate gradient flow by directly transmitting input information to subsequent layers, stabilizing the gradient and simplifying network training. The above processes can be expressed as
where
represents the
n-th MRFFM, and
denotes the feature information extracted by the
n-th MRFFM.
signifies the
n-th convolution operation.
represents the feature information extracted by the CNN framework.
The MRFFM comprises two additional modules, as shown in
Figure 2: the Multi-Scale Attention Feature Fusion Block (MAFFB) and the Attentional Feature Fusion Block (AFFB). The method involves distilling multiple pieces of feature information to extract and fuse residual branch features multiple times, enhancing its feature expression capabilities. Within the MRFFM branch, we introduce the MAFFB, illustrated in the figure. The MAFFB includes the Dual Feature Calibration Block (DFCB), the Spatial Attention Calibration Module (SACB), and the Channel Attention Calibration Module (CACB).
Dual Feature Calibration Block (DFCB): As illustrated in
Figure 3, within the DFCB, the features produced by the upper and lower branches are initially weighted and merged using combination coefficients (CC) [
32]. The CC structure, as depicted in
Figure 4, employs an attention mechanism to generate weight coefficients for adaptive information selection. Subsequently, the features pass through the Enhanced Spatial Attention (ESA) module with combined coefficients. This process extracts spatial information again, followed by input to two pooling layers simultaneously. Dynamic weight values are then derived post-convolution and activation, multiplying them with the branch features. Finally, the output is added back to the initial input. Through adaptive weights, dual pooling layers, and dynamic adaptive weight integration, the DFCB module excels in extracting valuable feature information efficiently. We denote the input of the DFCB as
, and the aforementioned process can be expressed as
where
represents the operation of channel down-dimensioning and
signifies the sigmoid activation function utilized for nonlinear processing.
and
(
i = 1, 2) represent the output of layer
i of the upper and lower branches, respectively.
denotes the operation of the Efficient Spatial Attention (ESA) module, and
represents the output of the ESA.
and
represent the two combined coefficient learning mechanisms connecting the upper and lower branches.
and
represent the average pooling and max pooling functions.
represents the fusion output that will pass through the features of the upper and lower branches of the two pooling layers.
and
stand for the dynamic weights of the two branches.
expresses the channel split function, and
represents the output of the DFCB unit.
Multi-Scale Attention Feature Fusion Block (MAFFB): The MAFFB initially utilizes a
convolution to reduce the number of channels, followed by feature extraction through a Dual Feature Calibration Block (DFCB), as shown in
Figure 2. The
convolution conducts the linear combination of the pixels across the channels, maintaining the image’s spatial structure and modifying its depth. This process is beneficial for dimensionality adjustment, whether reducing or expanding it. Subsequently, after the number of channels is restored by
convolution, two DFCB operations are sequentially employed for feature extraction. Features from the second
convolution are processed through the upper and lower branches in a series of modules. Initially, feature extraction occurs in the upper branch’s first DFCB module, followed by output generation in the lower branch through convolution and the Spatial Attention Calibration (SACB) module. The fusion of the upper- and lower-branch features produces a spatial-level output. This output then undergoes processing via convolution and the Channel Attention Calibration (CACB) module for channel-level feature information. The intermediate feature output is combined with the adaptive weights from the upper branch’s first DFCB module. The MAFFB module employs spatial and channel self-calibrating attention to merge the information effectively at both levels, utilizing dynamic weights to boost the feature extraction efficiency and effectiveness. We denote the input of this module as
. This process can be represented as
where
represents the operation of channel down-dimensioning in the first
convolution, and
represents the operation of channel up-dimensioning in the second
convolution.
represents the output of the
n-th DFCB.
denotes the operation of the
n-th DFCB module.
represents the intermediate output of the MAFFB unit.
(
i = 1, 2, 3, 4) denotes the adaptive weighted multipliers of the
n-th output of the DFCB unit, and
represents the output of the MAFFB unit.
Channel (Spatial) Attention Calibration Block (CACB/SACB): In the CACB module, as shown in
Figure 3, the raw input undergoes processing through three branches. The first branch employs basic convolution to reduce the number of channels while preserving the original features. The second branch focuses on extracting spatial information through a combination of convolutional and activation functions. The third branch enhances the feature extraction by incorporating channel attention. Subsequently, the outputs from the three branches are weighted to achieve multi-scale feature fusion. Similarly, the Spatial Attention Calibration Module (SACA) operates akin to the CACB, but with the channel attention (CA) module replaced by spatial attention (SA), enabling the extraction of more valuable spatial features. Spatial and channel attention mechanisms are integrated into self-calibrating convolution to dynamically establish relationships between each spatial position and its neighboring features. This enhancement improves the standard convolution layer’s performance by effectively broadening the receptive field of each spatial position without introducing additional parameters or escalating the model complexity. We denote the input of the unit as
. The process can be expressed as
where
,
, and
represent the output of the left branch, right branch, and middle branch, respectively.
refers to the channel attention operation.
represents the operation of channel down-dimensioning using
convolution, while
denotes the channel up-dimensioning operation of the final
convolution layer.
signifies the Rectified Linear Unit (ReLU) activation function utilized for nonlinear processing.
represents the ultimate output of the CACB unit. The symbols
(
i = 1, 2, 3) express the adaptive weighted multipliers applied to the outputs of the three branches within the CACB unit. The SACB operates similarly to the CACB, with the distinction that it replaces the channel attention mechanism with spatial attention to extract more beneficial spatial features.
Multi-Branch Residual Feature Fusion Module (MRFFM): In the MRFFM, as shown in
Figure 2, the features undergo a sequence of three channel splitting operations, each resulting in two branches. One branch retains its feature, while the other is forwarded to the subsequent layer for additional feature extraction via convolution and MAFFB operations. A convolutional layer is integrated into the distillation connection segment to augment the dimensionality of the split channels. Subsequently, the features preserved post-split are concatenated and fused and then passed into the AFFB module. The AFFB module combines elements from both the SACB and CACB, as depicted in the figure. Initially, the original input undergoes channel operation splitting, followed by concatenating and fusing the features extracted by the SACB and CACB modules with the original features post-split. Finally, the weighted output is added. In the MRFFM, layered features from various residual branches are combined to integrate shallow and deep image features comprehensively. This method allows the model to concentrate effectively on important image features, increases the utilization of feature information, and enhances the restoration of intricate image details. We denote the input of the module as
. The aforementioned operations can be expressed as
Here,
represents the
n-th remaining features,
denotes the
n-th distilled features,
represents the
n-th MAFFB unit,
expresses the
n-th channel split function, and
represents the output of the MRFFM. “Concat” denotes the fusion of features from the channel dimensions.
(
i = 1, 2) indicates the adaptive weight applied to the output by adding the features of the AFFB module and the input features.
represents the operation of the AFFB, and
represents the output of the Concat operation.