Residual learning can be classified as a globally connected or locally connected method according to the skip connection range. The globally connected method employs interpolation to combine an upsampled LR image and the output of a model at the network’s end. This method was proposed in the VDSR model and was intended to learn residuals with HR using a CNN based on SR via the conventional interpolation method. The locally connected method uses a skip connection inside and outside the convolution block. This structure is useful for extracting high-level features because it stabilizes learning even when the models are stacked deeply, and it is commonly used in various models that adopted ResNet [
40], such as SRGAN [
41], and enhanced deep super-resolution (EDSR) [
42].
4.2.1. Upscaling Depth and Width of Model
Attempts have been made to improve performance in SR tasks by increasing network capacity. VDSR [
28] and EDSR [
42] are representative examples of this approach. The VDSR and EDSR models improved performance by significantly increasing network capacity when compared to existing models.
VDSR [
28] is based on modified VGGNet [
29] and uses a global skip connection to connect the input and output. An LR image used as input is upsampled to the HR scale through bicubic interpolation. VDSR is a ground-breaking model composed of 20 layers, which is significantly deeper layering than existing models. VDSR converges a model effectively by applying a high learning rate and gradient clipping at the start of learning.
EDSR [
42] is based on a modified SRResNet [
41]. In SR, an image has a fixed pixel value range; therefore, the BN layer is not required. Moreover, the use of the BN layer can degrade information in the extracted features. As a result, Lim et al. [
42] did not use the BN layer in EDSR, reducing computational costs by 40%. As well as reducing computational costs, the model’s learning capacity was increased by upscaling both its width and depth [
42]. The EDSR model won first place in the NTIRE 2017 challenge [
43]. Lim et al. [
42] demonstrated that network width and depth are strongly related to performance even in SR tasks by showing PSNR and SSIM values close to those of models released since 2018. However, there is a limit to the degree of performance improvement that can be realized by upscaling a model. Furthermore, with large-scale models, inference is slow, and the risk of overfitting is increased [
44,
45].
4.2.2. Recursive Architecture
As shown in
Figure 5, EDSR [
42] demonstrated that expanding the depth and width of a network improved the SR performance; however, the number of parameters is significantly increased. By recursively using the same convolution layer multiple times, a recursive architecture is designed to extract higher-level features while keeping the number of parameters small.
The deeply-recursive convolutional network (DRCN) [
46] extracts features using the same convolutional layer several times. To generate their respective sub-outputs, these features are connected directly to the construction layer via a skip connection. Sub-outputs are combined to derive the final output. Due to the fact that the same bias and parameters are used repeatedly, there are issues with exploding and vanishing gradients. The gradient problem was addressed by two techniques: (i) taking the average value of the features produced by the same convolution and (ii) applying a skip connection to the reconstruction layer.
As shown in
Figure 6a, the very deep persistent memory network (MemNet) [
47] receives an LR image bicubic-upsampled as input. This model directly transmits input and feature maps, which are output-passed through memory blocks, to the reconstruction module. In the reconstruction module, feature maps are used to create each intermediate SR image and then fuse them to generate an SR image. MemNet convolution consists of BN, ReLU, and convolution layers in the form of pre-activation. A memory block comprises recursive and gate units, where the recursive unit is a residual block with two convolution layers. The structure of the recursive unit allows it to pass the same residual block multiple times. The feature map output from each convolution layer and the output from the memory blocks are directly connected to the gate unit (i.e., 1 × 1 convolution). The gate unit is structured to remember features that may fade away whenever they pass through a layer.
The deep recursive residual network (DRRN) [
49] uses the ResNet structure as a backbone. However, the residual block is replaced by a recursive block that is used to stack several convolution layers. DRRN, unlike DRCN, recursively uses the entire block rather than a single convolution layer. To learn consistently, DRCN employs a multi-supervision strategy. Due to these structural characteristics in DRRN, the model is simplified.
The dual-state recurrent network (DSRN) [
44] performs upsampling and downsampling recursively using the same transposed convolution and convolution layer. This is in contrast to DRCN and DRRN, which recursively use the same convolution layer. The concept of performing recursive upsampling and downsampling is similar to DBPN [
36]. However, unlike DBPN, the process is not densely connected. Compared with DRRN, although the performance is similar at a sampling rate of ×2 and ×3, it is slightly degraded at a sampling rate of ×4, and it shows a significant difference from DBPN in that the PSNR is 1% or lower. Like DRCN, DRRN adopts a multi-supervision strategy, i.e., the final output is created by averaging all intermediate
n outputs generated every
n times.
The non-local recurrent network (NLRN) [
50] is a DL model for estimating non-local self-similarity that was previously widely used in image restoration. Some features contain information about each image, which is referred to as self-similarity. A non-local module is used to generate feature correlation to determine self-similarity. Through 1 × 1 convolution, the non-local module extracts the correlation from each pixel in a specific area of the feature map’s neighborhood
. In addition, NLRN increases parameter efficiency and propagates correlations with neighboring pixels in adjacent recurrent states, taking advantage of the RNN architecture. Strong correlations for various degradations can be estimated through the inter-state flow between these feature correlations.
The super-resolution feedback network (SRFBN) [
48] is a structure that operates one feedback block recursively, as shown in
Figure 6b. However, similar to the DBPN [
36], the outputs of each convolution in the feedback block are densely connected via recursive upsampling and downsampling. By bicubic upsampling an input LR image and adding it to the feedback block, the overall design can be considered a model that ultimately learns residuals. Although the performance for BI degradation did not differ significantly from that of EDSR [
42], better performance than that of relational dependency networks (RDNs) was generally shown [
24] for the complex degradation problem of BD and DN. SRFBN uses curriculum learning, which trains learning models in a meaningful order, from the easy samples to the hard ones. As a result, SRFBN may be a good fit for a complex SR degradation problem. The model for BD generated by complex degradation is specifically trained by comparing two front outputs among four outputs with Gaussian blurred HR (intermediate HR) and L1 loss three times. In addition, the model is trained by comparing two outputs at the back with the original HR. Compared to RDN, SRFBN shows better results for a complex degradation image SR problem after applying curriculum learning [
48]. Note that SRFBN is constructed with parameters equivalent to 8% of those in EDSR by adopting a recursive architecture.
4.2.3. Densely Connected Architecture
Feature maps from each convolution block are transmitted to the input of subsequent blocks, as in DenseNet [
51]. This structure significantly reduces the number of parameters by enhancing the reuse of features and mitigating the gradient vanishing problem in object classification tasks [
51]. In particular, low-dimensional features contain critical information in an SR task. This is due to the fact that even low-dimensional features can have high-frequency details (e.g., edges and textures) that must be restored in HR images [
52]. Unlike ResNet’s skip connection, a densely connected architecture concatenates and uses features rather than simply adding them. This architecture ensures that important features from low to high dimensions do not vanish while passing through layers.
SRDenseNet [
52] uses a post-upsampling method that employs a network in which dense blocks are applied to transposed convolution. The dense block structure connects the output of the
n-th convolution layer from the
layer to the
N layer in a by-pass form. It can transmit the extracted feature to the bottom of the network without distorting it because the feature map generated as a result of the convolution in the dense block is used as input to the next layer via concatenation with the feature map transmitted through a by-pass.
RDN [
24] was modified on the basis of SRDenseNet, and the residual dense block (RDB) was employed by adding skip connection to the dense block. The structures of the RDN and RDB are shown in
Figure 7a. The RDB is designed to learn the local pattern of an LR image using all the outputs of the block immediately before reconstructing an SR image. Since the dense connection rapidly increases the number of channels, the number of channels is reduced through 1 × 1 convolution in the RDB.
DBPN [
36] uses a densely connected architecture and iteratively performs upsampling and downsampling, as shown in
Figure 7b. This differs from existing models that perform upsampling only once. DBPN performs upsampling twice in the up-projection unit. The progress of the up-projection unit is as follows:
where
, and
indicate the feature map-reduced dimensions by 1 × 1 convolution (i.e.,
in Equation (
7)), the feature map upsampled to HR scale, and the feature map downsampled to LR scale, respectively.
can be considered an upsampling error as it differs from the original input feature map. The down-projection unit also performs downsampling twice in this structure. This process demonstrated good performance in the ×8 BI track of the NTIRE 2018 challenge [
53]. However, the structure is complex, and the computational cost increases as the number of parameters increases.
The enhanced super-resolution generative adversarial network (ESRGAN) [
54] is based on SRResNet. First, the BN layer is removed, as in EDSR. Second, three dense blocks (consisting of five layers of convolution with leaky ReLU) are stacked in the residual block, with the skip connections connected before and after the dense block. The residual-in-residual dense block (RRDB) is a modified architecture that is used as a GAN generator in ESRGAN.
The cascading residual network (CARN) [
55] is modified through the application of group convolution and point convolution to ResNet. The existing residual block consists of convolutions and ReLU, whereas the residual-e block in CARN stacks two group convolutions and ReLU and adjusts the number of channels by a 1 × 1 convolution. By stacking residual-e blocks and a 1 × 1 convolution alternately, the cascading block densely connects the output to form a single module that comprises the network. A final network is constructed by stacking a cascading block and a 1 × 1 convolution alternately. The number of parameters is reduced by changing the existing convolution to a group convolution while using the dense connection that reuses features as much as possible.
4.2.4. GAN-Based Architecture
When only the pixel-wise loss function is used in an SR task, the fine texture of a generated image tends to be blurry. A GAN-based SR model was proposed to address this problem. Adversarial learning establishes a relationship in which the generator generates an SR image and the discriminator distinguishes whether the image is real or fake. In general, it is built by adding a discriminator similar to VGGNet [
29] to the existing SR model (generator). Although the images appeared to be better visually, the PSNR and SSIM values indicated deterioration.
SRGAN [
41] is a GAN-based SR model that uses SRResNet, adopting the modified ResNet structure as a generator and a structure similar to VGGNet as a discriminator, as shown in
Figure 8. The feature map-wise MSE loss was used rather than the pixel-wise MSE loss, as the existing models do not adequately represent the fine-grained texture and the SR image is blurred overall. Sub-pixel convolution is used for upsampling. The feature map-wise MSE loss calculates errors by comparing SR and HR images with the feature map obtained by passing through a pretrained VGG19 [
41].
EnhanceNet [
56] also grafted the feature map-wise loss onto the GAN. The difference between EnhanceNet and SRGAN is that it uses a nearest-neighbor upsampling because checkerboard artifacts are generated when using transposed convolution. The potential loss of information is prevented by applying connected residual learning globally, which adds bicubic upsampled images of input LR images.
SRfeat [
57] also uses a generator that adopted the ResNet structure. A 9 × 9 filter is used for the first convolution layer, whereas the output of each residual block is compressed through 1 × 1 convolution and added through a skip connection immediately before sub-pixel upsampling. SRFeat attempted to maximize representation through a feature discriminator that uses GAN-based learning for feature maps to generate feature maps that more accurately represent actual features. Three types of loss are employed to achieve this goal: (i) the perceptual loss of the feature map-wise MSE, (ii) the image-wise GAN loss, and (iii) the feature map-wise GAN loss.
ESRGAN [
54] also uses the feature map-wise MSE loss by employing VGG19. Compared to SRGAN, it is different in that feature maps are compared before passing them through the activation, which is used to show sharper edges and obtain more visually pleasing results. In addition, Wang et al. [
54] proposed a network interpolation technique as follows. Given
for
, where
and
denote the parameters trained using pixel-wise loss and the GAN method, respectively. This method removes unpleasant artifacts and meaningless noise while retaining the high visual quality obtained through adversarial learning.
For super-resolution by neural texture transfer (SRNTT) [
58], it is stated that the texture generated by GAN-based SR models must be a fake texture that seems real. SRNTT attempted to address this problem by grafting a reference-based method onto a GAN. The Wasserstein GAN gradient penalty [
59], which measures the distance between distributions, was used as the adversarial loss and was modified based on the L1 norm to achieve more stable learning than in existing GANs.
4.2.5. Reference-Based Architecture
SR is an ill-posed problem as there may be multiple corresponding HR images for a single LR image [
45]. To address this issue, SR methods that makes use of similar textures in other images were proposed as a reference. Although this method can produce more visually sophisticated results, the quality of the results may vary depending on the similarity of the referenced image.
The end-to-end reference-based super-resolution network using cross-scale warping (CrossNet) [
32] obtains a feature map for a similar texture by comparing the reference (Ref) image and the SR image with a flow estimator, which is a network that estimates optical flow, after generating an SR image by using an existing SR model. Slightly modified FlowNetS [
60] was used as a flow estimator. The proposed flow estimator decodes a new SR image by fusing the Ref features with the features of the SR images generated from the existing SR model. EDSR [
42] was used as the SR model, and U-Net [
61] were used as the encoder and decoder in CrossNet, respectively. The charbonnier penalty function [
62] is used as a loss function that compares SR and HR images. Although the flow estimator could be learned end-to-end, its loss was not explicitly defined.
SRNTT [
58] calculates the similarity between the LR image patch and the reference image patch through dot product by using the feature map extracted using VGGNet [
29], as shown in
Figure 9. Then, the feature map extracted from the LR patch is partially replaced with the feature map from the reference patch with high similarity.
The texture loss
in Equation (
11) was used for texture similarity training [
58].
and
denote the Frobenius norm and the feature space, respectively.
computes the Gram matrix, and
is a normalization factor corresponding to the feature size of layer
l [
58].
represents a weighting map for all LR patches calculated as the best matching score [
58]. Compared to CrossNet, the texture loss for the Ref image
is explicitly defined.
As shown in Equation (
12), the total loss function of SRNTT consists of the pixel-wise MSE loss
, the feature map-wise loss
, the adversarial loss (WGAN-GP loss)
, and the texture loss
. Unlike CrossNet, SR images can be created end-to-end, and textures with high similarity in the local patch are searched and imported.
The texture transformer network for image super-resolution (TTSR) [
63] captures the relevance between an LR image and a reference image using Transformer architecture [
64]. TTSR starts with the SRNTT model and removes all BN layers and the reference part. SRNTT employs a pretrain VGGNet [
29] as a texture extractor, whereas TTSR uses a Learnable ConvNet (i.e., learnable texture extractor (LTE)) with five convolution and two pooling layers. This LTE is trained end-to-end and used to calculate the relevance (similarity) between the LR image and reference image using
(query, key, value) attention in the feature map-wise; where
denote the LR↑ patch feature, the Ref
patch feature, and the Ref patch feature, respectively. Also, ↑ and ↓ represent bicubic upsampling and bicubic downsampling, respectively, i.e.,
means performing downsampling and upsampling sequentially to match distributions of the Ref patch with the LR↑ patch.
The TTSR transfers the textures of patches by following this process. (i) Relevance embedding: The hard/soft attention map and similarity are calculated using the normalized inner product of the LR↑ patch feature
Q and the Ref
patch feature
K. (ii) Hard attention: The transferred texture features
T are generated using hard attention by replacing the Ref
patch feature
K with the Ref patch feature
V. (iii) Soft attention: After concatenating
T with the LR patch feature
F and performing convolution on them, this is multiplied element-wise with the soft attention map
S and added again with the LR patch feature
F as follows:
4.2.6. Progressive Reconstruction Architecture
Since post-upsampling methods upsample the feature map from the end of the network to the final scale only once, they cannot extract features from the HR image space. A progressive reconstruction architecture gradually upsamples the feature map in the middle of the network to compensate for this problem.
The sparse coding-based network (SCN) [
31] is a model that simulates the conventional sparse coding concept using CNN and has a structure that performs gradual upsampling. Through the patch extraction layer, the model performs sparse coding using the learned iterative shrinkage and thresholding algorithm (LISTA) [
65] subnetwork, followed by HR patch recovery and a combination of the output patch. The LISTA subnetwork operates in two recurrent stages, each of which consists of two fully connected layers and an activation function that uses a specific threshold. In addition to the fully connected layer, the threshold value used for activation is learned.
LapSRN [
35] consists of two branches responsible for feature extraction and image reconstruction (
Figure 10), respectively. LapSRN gradually upsamples an input image and extracts HR features from the image in the feature extraction branch. LapSRN is designed to enable the stable learning of a model through a residual connection between LR and HR in the image reconstruction branch. Transposed convolution is used as an upsampling method. Furthermore, because the model has several intermediate outputs, the Charbonnier loss, which is derived from the L1 loss, is used to effectively control outliers.
SRNTT and TTSR are based on progressive reconstruction to use the feature map of the reference image according to each scale.
4.2.8. Multi-Scale Receptive Field
While a 3 × 3 convolution filter scale is widely used, the multi-scale receptive field architecture uses various filter sizes, such as 5 × 5 and 7 × 7.
Models that use the multi-scale receptive field structure include CNF [
68], CMSC [
67], and MSRN [
66]. A multi-scale receptive field was applied to each SRCNN in CNF [
68] using different filter sizes or layers. In addition, an MSC module with stacked filters (two 3 × 3, 3 × 3 and 5 × 5, two 5 × 5, and 3 × 3 and 7 × 7) was used in CMSC [
67]. MSRN [
66] operates in parallel by stacking two layers of 3 × 3 and two layers of 5 × 5 filters.
The advantage of these models is that they can take various inputs of contextual information, and the disadvantages are that the number of parameters increases as a filter larger than the commonly used 3 × 3 filters is used, and the model can be heavier because multi-scale filters are often applied in parallel.