1. Introduction
High-resolution (HR) images with high perceptual quality are often required in applications such as video surveillance [
1,
2], face recognition [
3], medical diagnosis [
4], and remote sensing [
5,
6,
7]. However, due to the different capabilities of sensors, the quality of captured images can vary greatly and fail to meet the requirements of subsequent applications. Super-resolution technology is an effective way to overcome the inherent resolution limitation of the current sensor imaging systems [
8]. The objective of the super-resolution technique is to reconstruct an HR image from single or multiple LR observation frames captured at different perspectives of the same scene. In general, the observed LR image can be modeled as a degraded representation of the HR image, which are degraded by warp, blur, noise, and decimation [
5]. According to the number of input LR images, the conventional super-resolution approaches can be roughly categorized into single-frame super-resolution (SFSR) [
9,
10,
11,
12,
13,
14,
15] and multi-frame super-resolution (MFSR) [
16,
17,
18,
19,
20].
Multi-frame super-resolution reconstruction aims to merge the complementary information from different images to generate a higher spatial resolution image. The problem was first formulated by Tsai and Huang [
16] in the frequency domain to improve the spatial resolution of Landsat Thematic Mapper (TM) images. Over the past few decades, the research work has been presented and studied in the spatial domain to improve multi-frame super-resolution techniques [
17,
18]. The SR problem is considered as an ill-posed inverse problem, as for each LR image, the space of its plausible corresponding HR images is huge and scales up quadratically with the magnification factor [
21]. Owing to its effectiveness and flexibility, most research has focused on regularized frameworks, which impose some constraints on the solution space [
22]. The maximum a posteriori estimation (MAP) method transforms the super-resolution reconstruction into an energy function optimization problem. Generally, the energy function consists of a data fidelity term that measures the model error between the degraded observations and the ideal image, and a regularization term that imposes some prior knowledge to constrain the model to achieve a robust solution. However, the priors of these methods are hand-crafted based on limited observations of specific image statistics, which may restore unsatisfactory results, as the real constraint often deviates from the predefined priors. On the one hand, the ill-posed nature is particularly evident for large magnification factors, which increases the problem of sub-pixel alignment and leads to the absence of texture details in the reconstructed images. On the other hand, it is difficult to obtain sufficient LR images with non-redundant information to recover the aliasing high-frequency components. Therefore, the performance of MFSR algorithms decreases rapidly with increasing magnification.
The mainstream algorithms of SFSR involve, e.g., reconstruction-based [
9], example-based [
23], sparse representation-based [
24], regression-based [
11], and deep learning-based approaches [
13,
14,
15,
25]. With the rapid development of deep learning, the convolutional neural network (CNN) dominated the research of SR due to its promising performance in terms of effectiveness and efficiency [
26]. A pioneering work of SRCNN [
12] applied a three-layer network to learn non-linear mapping relationships between the HR patches and the corresponding LR patches. Since then, considering the excellent learning capacity of convolutional neural networks (CNNs), deep learning-based methods have been developed in various ways by using new architectures or proper loss functions. The improved network [
13] exploited residual learning (VDSR) [
27] and recursive structure layers (DRCN) [
28] to achieve an outstanding performance for SFSR. The residual dense network (RDN) [
14] innovatively combined residual learning and dense connection to fully utilize both the shallow features and deep features together with over 100 layers. Recently, the network of channel attention (RCAN) [
29] and second-order channel attention (SAN) [
15] were introduced to exploit feature correlation for superiority performance. These end-to-end networks compute a series of feature maps from the LR image, culminating with one or more upsampling layers to construct the HR image. Therefore, it is convenient in that it automatically learns good features from massive quantities of data without much expertise and manual feature learning. Nevertheless, many deep learning approaches hypothesize that the training and test dataset are drawn from the same feature space with the similar distribution. Hence, the SR performance is heavily bound to the consistency between testing data and training data [
8]. Meanwhile, learning-based methods directly generate high-resolution details according to the learned mapping functions and low-resolution input, and some unexpected artifacts may be produced in the reconstructed results, especially for large magnification factors. Furthermore, the difficulty in estimating missing high-frequency details increases with the scale factor due to the increment in the ambiguities between LR and HR.
Briefly speaking, the SR performance at a large scale factor remains a challenging problem for both the MFSR and SFSR approaches. On the one hand, model-based MFSR algorithms encounter difficulty in recovering missing high-frequency details with the limited complementary information. On the other hand, at large upsampling scales, since insufficient information is available to recover such high-frequency components, deep learning-based SFSR methods may “hallucinate” the fine detail structure. In particular, the hallucination can be very problematic in some critical applications. To deal with this challenge, some researchers [
30,
31] have proposed exploiting the complementary advantages of external and internal information to improve SR performance and perceptual visual quality. However, most deep learning-based video and multi-frame super-resolution methods cannot fully exploit the temporal and spatial correlations among multiple images. Their fusion modules do not adapt well to image sequences with weak temporal correlations [
32]. These methods cannot satisfy our everyday requirements, because of the limited information involved in the reconstruction model.
To our knowledge, the MFSR and SFSR methods extract missing details from different sources. SFSR extracts various feature maps representing the details of a target image. MFSR provides multiple sets of feature maps from other images. The model-based MFSR methods and the deep learning-based SFSR procedures are complementary, to a large extent [
33]. Combining the feature learning capacity of SFSR with the information fusion brought by MFSR, a few pieces of research proposed a combination of single-frame and multi-frame SR such as [
34,
35]. In [
34] the input LR images are first magnified and recovered by a conventional MFSR method with a 4× scaling factor; then, an SFSR network is applied to the previous recovered result for artifacts removal without magnification. The authors of [
35] carried out the process in the inverse order to [
34], where they input LR images separately through the SFSR network, and then a conventional MFSR was applied on the resulting image. In contrast, the SFSR network in the former framework is only used as a filter to fine-tune the output of the MFSR method, while the SFSR network is used to initialize the input of the MFSR method in the latter research. Compared with traditional methods, the cascade model can simultaneously capitalize on both inter-frame aliasing information and external learned feature information, which notably improves the utilization of multiple images and external example data.
In this paper, we propose a novel two-step super-resolution reconstruction method concatenating the L0-norm constrained reconstruction with an enhanced residual back-projection network. Such a cascade model property induces considerable advantages for image SR, which integrates the flexibility of model-based method and the feature learning capacity of learning-based method. Specifically, the L0-norm constrained reconstruction method takes multiple images as input to obtain an initial high-resolution image, and then an enhanced residual back-projection network is further applied to the initial image for recovering a more accurate result. The proposed cascade model leverages the information learned from multiple low-resolution inputs and neural networks, outperforming the existing baseline SR methods in the cascade model in both objective and perceptual quality measurements.
The rest of this paper is organized as follows:
Section 2 introduces the variational model-based MFSR algorithm and the deep learning-based SFSR algorithm that are concatenated in the cascade model. We present the detailed experimental results for this multi/single-frame super-resolution cascade model in
Section 3, followed with a discussion of the strategy for cascade model in
Section 4. Finally, our conclusions are drawn in
Section 5.
2. The Cascade Model for Image Super-Resolution
Most methods reconstruct HR images in one upsampling step, which increases the difficulty of reconstructing at large scaling factors. A Laplacian pyramid framework (LapSRN) [
36] is proposed to progressively reconstruct multiple images with different scales in one feed-forward. However, this network relies only on the limited features available in the LR space with a stack of single upsampling networks. Because of the insufficient information available to restore such high frequencies, it is unrealistic to generate sharp HR images with fine detail at large scale factors.
The cascade model of MFSR and SFSR is proposed to obtain high-performance results for image super-resolution at large scaling factors. There are four structures for performing SR using MFSR, SFSR, or combinations of them when the upscaling factor is a divisible integer such as 4, as shown in
Figure 1. To the best of our knowledge, the question of how to best combine SFSR and MFSR has not been answered theoretically. Since the actual degradation is more complex and varying, the learning-based SFSR cannot fully simulate the image degradation process, which may cause incorrect results in actual reconstruction. In order to reduce the error transmission, we suggest using the multi-frame first and then single-frame cascade method for super-resolution (MFSF-SR), while the opposite method by applying SFSR first and MFSR after (SFMF-SR) is analyzed in detail in the subsequent discussion section.
The proposed cascade method consists of two main parts: the variational model-based MFSR and the deep learning-based SFSR. We aim to concatenate the MFSR method with the SFSR method to progressively upsample images to the desired resolution. Regarding the choice of the MFSR and SFSR methods, we employ the MFSR approach via L0-norm regularized intensity and gradient combined prior (L0RIG) and the SFSR approach using enhanced residual back-projection networks (ERBPN), respectively, which are introduced in the following subsection.
2.1. Multi-Frame Super-Resolution via the L0-Norm Regularized Intensity and Gradient Combined Prior
In image super-resolution reconstruction, as a typical inverse problem, SR is highly coupled with the degradation model. Generally speaking, the HR image is inevitably corrupted by many factors in the acquisition process, including warping, blurring, subsampling operators, and additive noise [
5]. It allows for us to reconstruct an output image above the Nyquist Limit of the original imaging device. Super-resolution turns out to be an inherently ill-posed inverse problem because the information contained in the observed LR images is not sufficient to solve the HR image. Therefore, it is necessary to impose a specific regularization in order to obtain a stable solution. The model-based methods incorporate prior constraints to estimate the desired HR image by minimizing an objective function of the posterior probability.
We denote the ideal HR image required to be reconstructed as
, the observed LR images as
, the downsampling matrix as
, the motion matrix as
, and
as the blur matrix including the sensor blur, optical blur, and atmospheric turbulence, where we assume that the blur of multiple images obtained under the same scene is consistent. The additive noise of the image observation model is usually assumed to be white Gaussian noise. Thus, the size of the LR image
is
, the scaling factor is
, and the size of the HR image is
. By changing the number of LR images, they can be applied to the MFSR or SFSR tasks. The MAP-based solution model for the super-resolution problem can be represented by a generalized minimization cost function as follows:
The first term of the cost function is the data fidelity term, which measures the reconstruction error to ensure that pixels in the reconstructed HR image are close to real values; the second term is the regularization term associated with the general prior information about the desirable HR image to obtain a robust solution; and is the regularization parameter, which provides a tradeoff between the data fidelity term and the regularization term.
In the image processing field, Gaussian-type noise is the most commonly assumed because the noise generated in image acquisition usually satisfies a Gaussian distribution [
22]. We assume the noise to be additive white Gaussian noise, so the fidelity term can be characterized by the L2-norm. For the regularization term, Laplacian [
37], total variation (TV) [
20] and Huber–Markov random field (HMRF) [
38] regularization are first considered, due to their simplicity and efficiency. Based on the advantages of the TV regularization, a combined image prior based on intensity and gradient is proposed for natural images [
39], which describes the two-tone distribution characteristics of the gradient statistics. This expression is written as follows:
where
is the gradient operator. As the intensity prior is based on independent pixels instead of the disparities of neighboring pixels, it introduces significant noise and artifacts in the image restoration. In contrast, the gradient prior is based on the disparities of neighboring pixels, and thus enforces smooth results with fewer artifacts. Prior knowledge for constraining the intensity and gradient can sufficiently exploit the statistical properties of natural images. To effectively preserve the detailed texture information and enhance the reconstructed image quality, the intensity and gradient combined prior is employed in the super-resolution reconstruction [
40]. We propose an MFSR algorithm via an L0-norm regularized intensity and gradient combined prior (L0RIG) to integrate into the cascade model.
Typically, geometric registrations and the blur can be estimated from the input data and used with the generative model to reconstruct the super-resolution image. The super-resolution becomes very limited without a good estimation of the blur and motion between the LR sequences. In this work, we compute the warping matrix
and blur matrix
with the optical flow approach [
41] and the blind blur kernel estimation method [
39], respectively. In order to simplify Equation (1),
can be regarded as a system matrix
. By substituting Equation (2) into Equation (1), the following minimization function for solving the MFSR model can be obtained:
Due to the L0 regularization term in Equation (3), it is difficult to solve the super-resolution model since it is a nonconvex function. As known, variable splitting and alternate iterative optimization algorithms are typically used for optimizing the solutions of the variational model. Based on the variable splitting L0 minimization approach, we adopt the alternating direction method of multipliers (ADMM) algorithm [
42] to solve the model. We introduce the auxiliary variables
and
, representing
and
, respectively, to move a few terms out of the non-differentiable L0 norm expression. The objective function can be rewritten as follows:
By transforming Equation (4) to generate an unconstrained problem with the augmented Lagrangian algorithm, it can be rewritten:
where
and
are penalty parameters, and are set to be 0.001 initially, that times 0.9 after each iteration to accelerate the convergence. Equation (5) can be efficiently solved through alternately minimizing
,
, and
independently, by fixing the other variables. The flowchart of the MFSR via L0-norm regularized intensity and gradient combined prior (L0RIG) algorithm is illustrated in
Figure 2.
2.2. Single-Frame Super-Resolution Using Enhanced Residual Back-Projection Network
Inspired by the idea of iterative back-projection framework, Haris et al. [
43] proposed deep back-projection network (DBPN) to iteratively use error feedbacks from the multiple up- and downscaling steps, which achieves the state-of-the-art SR performance with large scale factors. Since the iterative up/downsampling framework has the advantage of capturing the deep relationships between LR and corresponding HR images, it has become a promising framework in the field of SFSR [
44].
Figure 3 illustrates the schematic pipeline of the proposed enhanced residual back-projection network (ERBPN), which is designed on the basic architecture of the original DBPN [
43]. The architecture of ERBPN consists of three parts, namely, initial feature extract module, projection unit, and SR reconstruction module, as described below. Some modifications were made for the projection unit: (1) the down-projection unit was replaced with the downsampling unit; (2) the concatenation operation was replaced with a sequential feature fusion (SFF) operation. In the following, the major improvements are further explained.
The first part extracts the shallow feature from the input LR image and can be formulated by , where denotes a convolution operation with and , are the number of input LR image channel and the feature maps, respectively. Then, a 1 × 1 convolution layer is used as feature pooling and dimension reduction before entering the projection unit.
Then, the initial feature extraction is followed by a sequence of projection units, alternating between construction of the LR and HR feature maps
,
. The projection units in our proposed framework include the up-projection unit and the downsampling unit. Iterative error feedback mechanism is proposed by iteratively estimating and applying a correction to the current estimation of the LR and HR feature maps. Here, the projection errors are used to characterize or constraint the features in early layers. The up-projection unit is utilized to map the LR feature maps to the HR feature maps, which is shown in
Figure 4a. However, it is intuitive that obtaining LR feature maps from HR feature maps is simple and does not require projection unit based on iterative error feedback mechanism. Therefore, we simplify the back-projection network with a downsampling unit for faster computation, which has a very simple structure with a convolution layer as is shown in
Figure 4b. Note that each input feature map is concatenated and fused through the sequential feature fusion (SFF) operation before entering the projection unit.
The up-projection and downsampling unit are densely connected to alleviate the vanishing gradient problem, produce improved feature, and encourage feature reuse [
14]. The input for each unit is the concatenation of the outputs from all previous units to generate the feature maps effectively. Generally speaking, the feature maps generated by different projection units have different types of HR and LR components with different impacts on the quality of the results. Therefore, it is necessary to discriminate these feature maps with a feature fusion module [
45]. In our framework, the sequential feature fusion operation (SFF) is employed to deal with the feature maps discriminatorily, integrating these feature maps in a sequential manner.
Figure 5 shows the illustration of the SFF. Suppose that
represents the
input LR/HR feature map,
denotes the output of the
convolutional layer. Next, we obtain the following equation:
where
.
denotes the number of projection units,
represents the concatenation operation, and
denotes a convolution operation with 3 × 3 convolutional layer. It is worth pointing out that the SFF has discriminative ability because the feature maps generated by different projection units are processed at different depths of the network. Different from other networks, our reconstruction directly exploits different types of LR-to-HR features without propagating through up-projection layers.
Finally, we employ a global residual back-projection block structure. Residual learning helps the network converge faster and makes it easier for the network to generate only the difference between the HR and interpolated LR images [
29], which can address the performance degradation problem caused by the details loss after so many layers in deep networks. In our ERBPN framework, the LR image is taken as the input to reduce the computation time. At the last stage, all HR feature maps from the up-projection step are deeply concatenated and fused with the SFF, then added to the interpolated LR image to generate the final super-solved image.
The last convolution layer is used for image reconstruction with filter size of 3 × 3. The network takes the reconstructed results, denoted as
, as the output. Loss functions help us estimate the difference between the recovered SR images and the corresponding ground-truth HR images. MSE loss between the ground-truth HR image and the reconstructed HR image is used as the objective function, which can be written as follows:
where
N is the number of the training images.
2.3. Summary of the Proposed Cascade Model for Super-Resolution
In our work, the two-step super-resolution reconstruction method cascades the model-based MFSR and the deep learning-based SFSR method abovementioned. The MFSR with L0-norm regularized intensity and gradient combination prior (L0RIG) and the SFSR via enhanced residual back projection network (ERBPN) are employed to reconstruct a more accurate result. Specifically, first, we take 16 low-resolution images as the input of the L0RIG method to reconstruct one intermediate super-resolved image denoted as
, whose dimensions are 2× larger than the input LR images. Then, the intermediate super-resolved image
is fed into the ERBPN framework to obtain a high-resolution result
with better quality. The high-resolution result
are 2× larger than
, hence 4× larger than the input LR images. Even though we exemplify our super resolution reconstruction method using 4× scaling factor, it can be directly extended to other SR scaling factors. The schematic diagram for the proposed cascade method is illustrated in
Figure 6.
3. Experiments
To validly confirm the effectiveness of the proposed cascade model of MFSF-SR, this section presents the experimental results on both synthesized and real images. We combine the multi-frame-based L0RIG method with the single-frame-based ERBPN method, to up sampling images progressively at the 4× scale factor. The proposed cascade method applies ERBPN directly on the output of L0RIG in a sequential manner, where the L0RIG method reconstructs the LR images first, and then the resulting image is independently enhanced using the ERBPN method to obtain a higher-quality output. At the same time, the two baseline super-resolution reconstruct methods of L0RIG and ERBPN are also implemented to compare with the cascade method. In the simulation experiments, the effect of the proposed method under different noise levels is further investigated to verify the robustness to noise. The detailed steps are presented in the following sections.
3.1. Data and Training Details
The five grayscale HR images shown in
Figure 7 were selected as the test images in the simulation experiments. For each image from these test sets, we generated a set of N = 16 images with different subpixel shifts applied before further degradation. Synthetic sequences of 16 LR images were generated by applying isotropic Gaussian blur to the sequential subpixel shifts HR image, then downsampling the row and column of the image by a factor of 4.
In the reconstruction stage of the L0RIG, the central frame of LR sequence is chosen as our reference frame and the initial HR image is obtained by bicubic interpolation method. The regularization parameter λ is determined empirically based on numerous experiments to produce the best performance. Since minimizing the objective function by preconditioned conjugate gradient method usually converges within 30 iterations, the maximum iteration number is set to TS = 30.
In the ERBPN, the filter size in the up-projection unit varies with respect to the scaling factor. For the 2× enlargement, we used a 6 × 6 convolutional layer with two striding and two padding. The 4× enlargement then used an 8 × 8 convolutional layer with four striding and two padding. In the training phase, we augmented the training data from the DIV2K dataset [
46] by randomly employing 90°, 180°, and 270° rotation and horizontal and vertical flipping [
44]. In each mini-batch, 128 degraded LR images with a patch size of 64 × 64 were provided as inputs for the model, and the corresponding HR image served as the ground truth for calculating the loss. The models were optimized using the ADAM optimizer [
47] with β1 = 0.9, β2 = 0.999, and ε = 10
−8. The initial learning rate was set to 10
−4 and then decreased by half every 100 epochs. A total of 1000 epochs were used for training the models since more epochs did not bring further improvements. All experiments were implemented using Caffe framework version 1.0.0-rc3 and MATLAB R2022a on an Nvidia RTX GPU, Santa Clara, CA, USA.
Image enhancement or visual quality improvement can be subjective because the perception of better image quality can vary from person to person. For this reason, it is necessary to establish quantitative measures for the comparison of image enhancement algorithms. To assess the image quality of the super-resolution reconstructed results, two classical evaluation criteria—the peak signal-to-noise ratio (PSNR/dB) and the structural similarity index measure (SSIM)—were chosen to measure the performance of the different super-resolution methods [
48]. The higher the quantitative measure, the better the quality of the reconstructed image.
3.2. Experiments on Synthetic Data
L0RIG and ERBPN are the baseline methods of the proposed MFSF-SR, which only reconstruct by upsampling one step instead of step-by-step reconstruction under an upscaling factor of 4, for comparison with the cascade model. For a fair comparison, we run SFSR method for all 16 simulated LR images and compute the mean metric from the reconstruction outcomes—this way, the method is fed with the same data as those for MFSR. Additionally, a bicubic interpolation of the LR reference frame is also constructed for comparison.
Table 1 shows the quantitative performance comparison in terms of PSNR and SSIM for the five simulated images presented in
Figure 7 with the different methods. For the sake of comparison, the two types of L0RIG and ERBPN algorithm directly reconstructed on 4× enlargement. The output of the cascade model is a super-resolved central frame with four times the size of the original LR images.
For the sake of comparison, we analyzed the simulated experimental results from both subjective and objective perspectives. Quantitatively, as displayed in
Table 1, the proposed cascade model yields the best scores in the evaluation metrics among all the compared methods. In the experiment with the butterfly image, the PSNR values are 25.073 dB for L0RIG, 26.006 dB for ERBPN, 26.863 dB for MFSF-SR. These quantitative results confirm the effectiveness of the MFSF-SR cascade model. From a subjective perspective, the red rectangles show zoomed regions of the restored images, to compare the qualitative performance of the different methods. L0RIG shows the preferable performance, but some edge is oversmoothed. ERBPN can produce good contrast through the up- and down-projection unit, but there are some unnatural artifacts around the slight edge. The result of the proposed MFSF-SR method contains more details and fewer blurred contours than L0RIG and ERBPN.
Furthermore, in the experiment with the parrot image, the PSNR value for the proposed MFSF-SR is 30.636 dB, which is 1.896 dB and 0.731 dB better than L0RIG and ERBPN, respectively. As displayed in
Figure 8, images reconstructed with the MFSF-SR cascade model are able to preserve the HR components which contain more details, with rare additional artifacts. As a simple comparison, in the bottom line of
Figure 8, the enlarged image in the result of L0RIG shows the misinterpreted area of the diagonal stripe due to the ringing artifact effect. It shows that the MFSF-SR can preserve the low-frequency content, and reliably restore the high-frequency details with the combination of the inter-frame information and external learning prior. From both the qualitative and quantitative analyses, most of the results show that the MFSF-SR with a two-step reconstruction creates more high-frequency information than the baseline methods at large magnification factors.
To further assess the robustness of the proposed method with regard to different noise levels, the Zebra image from the BSD68 dataset [
49] was also selected as a synthesized test image with warping, blurring, downsampling, and different noise levels of additive white Gaussian noise (AWGN) added. For the color image sequence of the synthesized zebra image, we first convert the color input to YCbCr space, and then reconstructed the luminance component with the super-resolution algorithm.
To further compare the performance of the proposed method, a simulation experiment with the zebra image was implemented under different noise levels. The quantitative reconstruction results of the different methods with the color zebra image are shown in
Table 2, where the proposed MFSF-SR method achieves very pleasing PSNR and SSIM results at all the noise levels.
Figure 9 shows the quantitative performance comparison in terms of PSNR and SSIM for the zebra images under different noise levels. To be specific, in the experiment with a noise variance of 0.005, the proposed method outperforms all the compared methods with a result of 29.22 dB, which is 0.907 dB and 1.325 dB better than L0RIG and ERBPN, respectively. Furthermore, it can be observed that the performance advantage is more obvious for the high noise levels, and the proposed method turns out to be effectively adapted to different noise characteristics.
For these simulation experiments,
Figure 10 shows the HR reconstruction results for the different methods at a scale factor of 4×. The green boxes show the zoomed regions to compare the performance of different methods. As the partial enlargement shows, the L0RIG method shows a better trade-off between removing noise and preserving the edges, but it is not able to recover the lost fine details. Undesired edge artifacts can be found in the results of the ERBPN method, which produces artificial edges in the flat surfaces and fails to suppress the noise in the details of the image. In
Figure 10, the result of the proposed method shows a very good performance, with clear details and fewer ringing effects. Specifically, the distorted content, e.g., the stripes on the zebra, can be finely restored in the proposed two-step cascade model. Overall, the MFSF-SR cascade model performs favorably when compared to the baseline methods in this comparison experiment. It demonstrated that cascading L0RIG and ERBPN to enhance each individual baseline methods can substantially improve the final super-resolved image.
In conclusion, with the qualitative and quantitative analysis, most of the results show that the cascade model creates more high-frequency information than the L0RIG and ERBPN methods. The MFSF-SR method works better in either noisy or noise-free case. It can reliably recover high-frequency details with higher consistency and contrast loss, while preserving strong edges and contours with few additional artifacts. The results were perceived as most informative and natural.
3.3. Experiments on Real Data
Besides the above experiments on synthetic test images, we also conducted experiments on real images to demonstrate the effectiveness of the proposed MFSF-SR cascade model. The real image grayscale sequences of Car and Eia are part of the Multi-Dimensional Signal Processing Research Group (MDSP) benchmark dataset [
50], which is the most widely used dataset to test the performance of multi-frame super-resolution methods. In our experiment, 16 frames from these two image sequences were used as the low-resolution input image. The central frame in the sequence was set as the reference frame in this reconstruction.
Since no ground-truth HR image is available for the real sequence, we introduced no-reference image evaluation metrics the natural image quality evaluator (NIQE) [
51] and the perception-based image quality evaluator (PIQE) [
52] to further evaluate the quality of the real image SR results. Smaller values of NIQE and PIQE indicate better SR results.
Figure 11 provides a visual comparison of the super-resolved results for the Car and Eia images with magnification factor 4. The red rectangles show zoomed regions of the restored images to compare the qualitative performance of the different methods. Experimental results on real image sequences show that our method yields a boosted performance in both objective metrics and visual quality. The MFSF-SR method achieves comparable or even better performance than the baseline methods in terms of quantitative evaluations. For a real-world image, the downsampling kernel is unknown and complicated; thus, performance of the non-blind SR methods are severely affected. Nevertheless, our method can produce visual pleasant images and effectively suppress the errors caused by noise, registration, and bad estimation of unknown PSF kernels.
From the top line of
Figure 11, we can observe that the experiment with the Car sequence can be considered as a challenging example because the LR Car images are severely degraded by blur and noise, with a complicated noise model. It was observed that the bicubic interpolation method is too blurry to be recognized, while the L0RIG and ERBPN algorithms can produce better visual effects than the bicubic interpolation method. Compared with the bicubic interpolation method, other methods are more efficient in improving spatial resolution due to the use of LR frame sequences or external prior knowledge in the reconstruction. With a L0-norm regularized constrain, L0RIG algorithm prefers a smooth result, but important edges and texture are also oversmoothed. As a contrast, the result of ERBPN suffers from visible ghosting artifacts and is seriously affected by the stair effects. As expected, the MFSF-SR algorithm has the best visual performance with clear edges and less influence of artifacts and can effectively remove noise in the smoothing area of the image. Meanwhile, as shown in the bottom line of
Figure 11, the proposed method gives rise to the most visually pleasing results with both sharpness and naturalness. The L0RIG algorithm has a good noise suppression effect, but it over-smooths the image, resulting in the loss of edge information. In contrast, ERBPN produce result with sharp edges, but it lacks the ability to recover clean HR image because of the effect of artifacts. In summary, the proposed MFSF-SR cascade model is capable of generating clean and sharp HR images at a large scale factor without any hallucination of fine details. It consistently demonstrated the effectiveness and superiority in the thorough experiments conducted in this study.