1. Introduction
Remote sensing (RS) has become an indispensable technology for various applications, including agricultural survey, global surface monitoring, and climate change detection. However, owing to the limitations of RS devices, atmospheric disturbances, and other uncertain factors, it is hard to obtain images at the desired resolution [
1]. Low-resolution RS images are gradually becoming an obstacle to many advanced tasks, such as finer-scale land cover classification [
2], object recognition [
3], and precise road extraction [
4].
Super-resolution (SR) methods are devoted to improving image resolution beyond the acquisition equipment limits [
5]. SR has the advantages of low cost, easy implementation, and high efficiency compared to updating image acquisition devices. Remote sensing super-resolution (RS-SR) approaches can be roughly divided into two categories: single-image SR (SISR) and multi-image SR (MISR). The former requires only one image of the target scene for generating high-resolution output [
5], while the latter requires multiple images that differ in terms of their satellite, angles, and sensors. For example, the authors of [
6] improved the spatial resolution of Landsat images by fusion with SPOT5 images; the authors of [
7] fused the information of multi-angle images for super-resolution; the authors of [
8,
9] merged multispectral images with panchromatic images to generate images with high spatial and spectral resolutions. MISR usually requires the input of multiple low-resolution (LR) images of the same region, which are difficult to collect in practical applications. Additionally, the feature extraction and fusion process with various resolutions and sensors are time-consuming, which restricts the applications of these techniques in real scenarios. Therefore, SISR is typically used.
According to the evolutionary trend and the complexity of the methods, we roughly divided the SISR approaches into interpolation-based approaches and machine learning-based approaches. Additionally, as deep learning methods (which are part of machine learning methods) have boomed in recent years and have achieved great success in SISR, we separated deep learning-based methods from machine learning methods into a third category.
The main idea of interpolation-based SISR is to locate each pixel of an HR image to be restored in the corresponding LR images and to interpolate the pixel’s value accordingly [
10]. Bicubic, bilinear, and nearest-neighbor interpolation approaches are commonly used, and some novel interpolation methods are available [
11,
12,
13]. Interpolation-based approaches offer a simple and fast way to improve image resolution [
14]. However, they restore the missing values from a local perspective; thus, the generated images usually lack detailed information.
Machine-learning-based approaches attempt to overcome the shortcomings of interpolation-based approaches by a data-driven mechanism. The neighbor embedding method [
15], sparsity-based methods [
16,
17,
18,
19], local regression [
20], self-similarity algorithm [
21], anchored neighborhood regression [
22,
23], and naive Bayes [
24] are effective machine learning SISR approaches. Accurate representation of image features is key to the success of machine learning methods [
25]. However, the expression ability of handcrafted features in machine learning is limited; therefore, it is difficult to handle complex data with high quality and high resolution.
In recent years, deep learning-based methods have demonstrated their powerful nonlinear expression and deep feature extraction capabilities. [
26] first introduced the convolutional neural network (CNN) to solve SISR tasks, which showed excellent performance. Subsequently, many CNN-based methods emerged, such as in [
27,
28,
29]. Residual learning [
25,
28,
30,
31,
32,
33,
34,
35] has been proposed to relieve the difficulty of training deeper networks and to improve SR performance. Some models use prior information, such as edges [
36] and segmentation probability maps [
37], to improve the details and fidelity of the super-resolved images. Deep Laplacian pyramid networks (LapSRN) [
38], EnhanceNet [
39], and the enhanced deep residual network (EDSR) [
40] have all achieved great success in SISR.
However, some challenges are encountered. First, preserving the geographic information such as terrain, structure, and edge details precisely is of great significance for RS-SR, as this information can strongly affect the accuracy of subsequent analysis. The image gradient, which can sensitively reflect the changes in small details of an image [
41], is highly important for RS images, and many RS applications have utilized this image gradient information. For instance, [
42] used the gradient map to represent the topographic surface, and [
43] used the gradient map to extract object boundaries for satellite image classification.
Moreover, the available supervised learning methods perform much better when the test images and the training set are highly similar; however, if the test images differ substantially from the training set, the results may be strongly affected. Because RS images are acquired from different sensors, their spectral, temporal, and spatial resolutions are different. It is difficult to include all scenarios in the training set, which limits the practicality of the existing supervised models.
In view of the above problems, the deep gradient-aware network with the image-specific enhancement (DGANet-ISE) method is proposed to obtain higher resolution RS images. Specifically, an enhanced deep residual network is constructed to learn the relationship between low-resolution (LR) and high-resolution (HR) images. During the training phase, a new gradient-aware loss is proposed in this network to promote the extraction of more high-frequency information, and to generate HR images with more detail. Additionally, when faced with inexperienced images, the image-specific enhancement approach is designed in the test phase to further improve SR performance. Each test LR image is inputted into the DGANet to obtain a relevant super-resolved HR image. Then, the HR image is further improved via the ISE method. No additional information is needed in this module, which focused on the specific characteristics of the single image and on realizing adaptive enhancement. In summary, our contributions are fourfold:
- (1)
A new SISR method, DGANet-ISE, which includes a deep gradient-aware network and an image-specific enhancement approach, is proposed to improve the spatial resolution of RS images.
- (2)
A deep gradient-aware network is proposed to model the complex relationship between LR and HR images. A new gradient-aware loss is designed in the training process to preserve more image gradient information in the super-resolved RS images.
- (3)
This paper proposes an image-specific enhancement approach to further improve the SISR performance of RS images and to enhance the generalization capability and the adaptability of our method when facing inexperienced images.
- (4)
Three data sets are used for evaluating the performance of DGANet-ISE. Compared to 14 methods, the experimental results indicate the superiority of DGANet-ISE.z.
2. Materials and Methods
The objective of SISR is to construct an HR image from an LR image . Representing the target HR image and the estimated image as . and , respectively, the more similar . and . are, the better the SR effect is. The images have . color channels, and is the width and is the height of the images. refers to the upscaling factor. In this section, a detailed description of the proposed method is presented. In addition, the evaluation criteria and comparison methods are introduced.
2.1. Overview of the Proposed Method
In this work, a new RS-SR method, DGANet-ISE, is proposed, as illustrated in
Figure 1. The method mainly involves training and testing phases. The training phase aims at learning the complex relationships between LR and corresponding HR images. DGANet is the core element of this phase, and it is based on the enhanced residual network. This model employs residual and skip connections to devise a deep architecture, and it exhibits strong feature representation and nonlinear fitting abilities. As geographic information such as the terrain and edges are highly important for RS image interpretation, the precise preservation of more geographic details in the super-resolved images should be a focus of research. However,
and
losses generate blurry images in generation problems [
44]. Therefore, we innovatively propose a new gradient-aware loss to alleviate this problem. Gradient-aware loss imposes gradient constraints to focus on the high-frequency signals in RS images, such as boundaries, edges, and terrains. As such, the proposed model can generate HR images with more detailed geographic information.
In the testing phase, the trained DGANet model is applied to generate primary HR results. However, if the testing images differ substantially from the training set (e.g., the images collected from different satellites), the performances of existing supervised learning methods will be strongly affected. To address this problem, an unsupervised learning-based enhancement approach, ISE, is introduced to further improve the performance and generate final HR images. ISE uses each input LR image to supplement more global information, and it has several advantages: (a) no additional supplementary data are required; (b) it focuses on the specific information and features of each test image, i.e., image-specific enhancement; (c) it boosts the generalization performance on inexperienced data sets.
2.2. Structure of DGANet
DGANet is proposed to extract the deep features of an LR image and upscale LR feature maps into HR output, as shown in
Figure 2. The network contains different types of layers, including convolutional (Conv) layers, rectified linear unit (ReLU) layers, and pixel-shuffle layers. Based on these layers, residual blocks (ResBlocks) and upsample blocks (UpBlocks) are subsequently constructed. The ResBlocks are employed to construct deep networks, and UpBlocks to efficiently transform low-resolution feature maps into high-resolution size.
ResBlock: In each ResBlock, there are two Conv layers and one ReLU layer. The output is obtained by summing the input and the result obtained through the Conv and ReLU layers, as shown in
Figure 2.
UpBlock: To transform the input feature maps to the target HR image, the UpBlocks are constructed based on one Conv layer and one pixel-shuffle layer.
represents the upscale factor. The Conv layer transforms the feature maps to
channels and the pixel-shuffle layer is a periodic shuffling operator that rearranges the elements of a tensor
to a tensor of shape
(
Figure 3). Each UpBlock upscales 2 times. Therefore, if the upscale factor is 2, one UpBlock is used; if the upscale factor is larger, multiple UpBlocks are used to improve the size gradually.
According to
Figure 2, the whole network mainly consists of four modules: low-level features are extracted from the input image with Conv layers at the first module; the ResBlocks in the second module are used to learn more complex and deeper features; the third module transforms the feature maps from the LR domain to the HR domain; the last module is the global residual block, in which the input image is interpolated to high resolution directly to provide a large number of global signals of the input image.
Take an upscale factor of 8 as an example (the size of the input LR image is
),
Table 1 gives an exact representation of the components of DGANet. The kernel size of each Conv layer is
. Five ResBlocks are used to extract the deep features sufficiently in this network. For the global residual block, the bilinear interpolation algorithm is used to directly transform the input image to another image with the target resolution. In addition, the Adam optimizer was used to train the model, and the loss function will be introduced in the next section. The learning rate was 0.0001 and was halved every 40 epochs, and the batch size in the training phase was 32.
2.3. Gradient-Aware Loss
As images usually change quickly at the boundaries between objects, the image gradient is significant in boundary detection [
41]. Therefore, the gradient information is usually used to extract object edges [
43] and to reflect the changes in the surface’s topography [
42] from RS images.
and
losses are widely used in deep learning applications, which are expressed below, where
is the number of samples of one batch. The batch size was 32 in this work.
Although they can accurately represent the global pixel difference between images, they usually lead to smooth and blurred images [
45], and little attention is paid to the important structure and edge information of RS images. Therefore, we propose a new gradient-aware loss (
) that facilitates the preservation of more edge and structure information and generates sharper HR images. The Sobel operator (
) [
46] can effectively detect edges, and it enhances high spatial frequency details. Therefore, this operator is applied to generate gradient maps of the target image
and the predicted
images.
The gradient-aware loss is defined as the mean of
samples. The mean absolute error is used to evaluate the difference between gradient maps
and
, as shown below:
To maintain the balance between the global pixel error and the gradient error, a hyperparameter
is introduced. From the experiment, we found that
works best at 0.1 (as shown in
Appendix A). Therefore,
was used in this paper, and the overall loss function can be formulated as
By using gradient-aware loss, more high-frequency information, such as edges and terrain, will be preserved in the super-resolution process, and sharper HR images with higher accuracy will be generated.
2.4. Image-Specific Enhancement
The performance of supervised methods largely depends on the training data set. If the test images differ substantially from the training set, the performance on inexperienced data will be greatly affected. As for RS images, they differ in terms of sensors, times, places, colors, types, and resolutions. It is hard to collect training samples that cover all scenarios, resulting in the insufficient generalization ability of many supervised SR methods. In this paper, the ISE algorithm is proposed to provide an effective solution to improve the generalization and adaptability of our method.
The core strategy of ISE is to back-project the error between the emulated and actual LR image to the SR image and to iteratively update it. This approach is inspired by the iterative back-projection algorithm (IBP) [
47]. The specific procedure of ISE is presented in Algorithm 1, where the iteration number is
and
is the largest iteration number. The input of ISE is the original LR input
and the predicted HR image
obtained from the DGANet. The
is firstly downsampled to the LR domain (
). The difference image (
) between
and
is calculated. Then, the difference image is upsampled to the HR size, and subsequently,
can be updated by adding
to the predicted image. This process is repeated iteratively until the difference is sufficiently small or the
has been reached. The bilinear interpolation algorithm was used for upsampling and downsampling operations in this paper.
ISE is based on the assumption that if the estimated HR image is closer to the target image, the LR image
derived from the estimated HR image should be more similar to the input LR image. By backward-projecting the difference image between
and
to the super-resolved HR image, more differences can be considered in the estimated HR image. We can thereby obtain better SR results.
Algorithm 1. ISE |
Input: LR input (), predicted HR image (); |
Output: Enhanced HR image (); |
While or is not sufficiently small |
1.
|
2.
|
3.
|
4.
|
5. |
|
Return |
The main advantages of ISE include that it requires no additional images for training. Furthermore, as ISE focuses on the characteristics of every single image, the image-specific information is used to further improve the quality of the super-resolved HR image. This way, the limitations that are due to the training data set can be effectively alleviated.
2.5. Evaluation Criteria and Baselines
2.5.1. Evaluation Criteria
The mean squared error (MSE), peak signal-to-noise ratio (PSNR) [
48], and structural similarity index (SSIM) [
49] are used to evaluate the performance of models, which are expressed as follows:
where
is the target high-resolution image;
is the super-resolved image which is generated from the low-resolution image;
tW and
tH are the width and the height of the HR image, respectively;
represents the maximum pixel value in the original
image;
is the root mean squared error;
and
are the average pixel values of
and
, respectively;
and
are the variances of
and
, respectively; and
is the covariance of
and
. Moreover,
,
, where both variables are used to stabilize division with a weak denominator, and
is the dynamic range of the pixel values. The default values of
and
are
and
.
is commonly used to measure the error of super-resolved images. An
that is closer to 0 implies a higher model accuracy.
is measured in decibels (dB).
is used to measure the similarity between two images [
5]; the larger the value of
and
SSIM, the better the SR performance.
2.5.2. Methods to be Compared
Fourteen widely used SISR methods, which are shown in
Table 2, were compared with DGANet-ISE. These SR methods use LR input images with three channels to generate super-resolved images. The input and output schemes were the same as proposed DGANet-ISE. The theory and characteristics of these methods were described in the corresponding references shown in
Table 2.
3. Results
In this section, the data sets and the implementation details are introduced firstly. Subsequently, the performance of DGANet-ISE is verified by comparison with 14 different SR methods, and cross-database tests are performed to evaluate the generalization ability of our method.
3.1. Data Sets and Implementation Details
3.1.1. Data Sets
Three different data sets were employed to verify the effectiveness and superiority of DGANet-ISE.
RSI-CB256 [
55] contains 35 categories and about 24,000 images, which were collected for scene classification. This data set is rather challenging as the scenes are widely different. The pixel size of the images is 256 × 256 with 0.3–3 m spatial resolutions.
Figure 4 shows the samples for 10 categories in the RSI-CB256 data set.
UCMerced [
56] consists of 21 classes of land use images, including agricultural, airplane, beach, buildings, etc. Each class has 100 images with 256 × 256 pixels, and the spatial resolution is 1 foot (
0.3 m).
Landsat-test is widely used, and the super-resolution of these images is of great value for many applications, such as finer land cover monitoring. The test images used in this paper were Landsat 5 TM data of band 7 (2.08–2.35 µm), band 4 (0.76–0.90 µm), and band 2 (0.52–0.60 µm). The data set was downloaded from the Google Earth Engine (
https://developers.google.com/earth-engine/datasets/catalog/LANDSAT_LT05_C01_T1). The spatial resolution is 30 m. During the experiment, we cut the image into approximately 400 small images of 256 × 256 pixels. This data set is different from the RSI-CB256 and UCMerced data sets in two aspects: (1) the resolution of the Landsat data set is much smaller than the other two data sets, resulting in very different geographic information and scene content in the same image size; (2) the first two data sets are artificially selected into categories, while Landsat-test data set is a real unselected scene.
3.1.2. Implementation Details
As there are not enough corresponding LR–HR image pairs in reality, we downsampled the images using the Bicubic interpolation (BCI) algorithm with a factor of
to obtain LR images of different scales. Many studies, such as [
29,
35], have used BCI to generate LR–HR image pairs. The RSI-CB256 data set was used to compare our model with commonly used SISR methods, and all images were randomly divided into training samples (80%) and test samples (20%). Furthermore, in order to evaluate the generalization ability of our model, the Landsat-test data set was utilized for a cross-database test. That is, the models trained on RSI-CB256 were directly applied to the SR experiments of Landsat-test without any tuning. This is difficult because the data were collected from different sensors and have different spatial resolutions.
In addition, the UCMerced data set was randomly split into two balanced halves for training and testing, according to [
35], which is consistent with other RS-SR methods, including CNN-7 [
25], LGCNet [
25], DCM [
29], etc. Therefore, we compared our method with these RS-SR methods using the UCMerced data set.
The interpolation-based and deep learning-based methods were implemented in Python, and the machine learning-based methods were implemented in MATLAB. In addition, the models were used with the default settings suggested by the authors. The generation of LR images and the calculation of the evaluation criteria were implemented in the same Python environment to ensure the consistency and accuracy of the results.
3.2. Comparison with Baselines
In this section, we experimented on the RSI-CB256 and UCMerced data sets, respectively. UCMerced is widely used in RS-SR; therefore, we compared our methods with the following RS-SR methods available in the literature [
29], which experimented on the UCMerced data set as well. The upscale factors were 2 and 4, and the PSNR and SSIM results are shown in
Table 3. In addition,
Figure 5 provides a more vivid presentation of the comparison of the PSNR and SSIM results.
According to the table, compared with recent RS-SR methods, DGANet-ISE yields the best results. Although the PSNR values of our method and DCM are close, our SSIM results are larger than those of all of the compared approaches. The experimental results indicate that our method is good at structure reconstruction of RS images.
Additionally, another experiment was conducted on the RSI-CB256 data set, and DGANet-ISE was compared with three kinds of SISR methods: interpolation-based methods (nearest-neighbor interpolation (NNI), bilinear interpolation (BLI), and BCI), machine learning-based methods (simple functions (SF), neighbor embedding (NE), and classical sparsity-based super-resolution (SCSR)), and deep learning-based methods (super-resolution convolutional neural network (SRCNN), efficient sub-pixel convolutional network (ESPCN), and enhanced deep super-resolution network (EDSR)). The experiments were conducted with three different upscale factors, i.e.,
. The detailed results are presented in
Table 4. In addition to the quantitative assessments, visual results are provided for a qualitative and intuitive evaluation. Three test images of a parking lot, a residence, and a dam were chosen as examples, and
Figure 6,
Figure 7 and
Figure 8 show the super-resolved HR images for
, respectively.
From the global perspective, it is obvious that as the upscale factor increases, the accuracy gradually decreases, because achieving super-resolution from a lower resolution image is much more difficult. Furthermore, according to
Table 4, DGANet-ISE significantly outperforms the baselines. With the exception of DGANet-ISE, the results of EDSR are the best among the other methods. The PSNR values of DGANet-ISE are 0.33 dB, 0.26 dB, and 0.18 dB larger than those of EDSR when
t is 2, 4, and 8, respectively.
BCI, SCSR, and DGANet-ISE are the most prominent within the interpolation, machine-learning, and deep learning categories, respectively. For example, when t = 2, the MSE values of BCI and SCSR are 69.74 and 48.01, respectively, much smaller than the other methods of the same category. With regard to deep learning methods, our proposed model exhibits superior performance on the SISR task. The MSE and PSNR results of our method are the best among all of the methods. For instance, when t = 2, the error of our proposed method is the smallest (MSE = 31.26), while PSNR = 37.92 dB and SSIM = 0.9477 are the largest.
Focusing on the visual results (
Figure 6,
Figure 7 and
Figure 8), the HR images super-resolved by different methods vary in terms of their features. For example, images obtained from NNI and NE have regular dense squares, similar to mosaics. This causes very blurred edges of these super-resolved images. The reason is that both algorithms rely heavily on the values of neighboring pixels or patches and ignore other significant structural details. NE performs better as it considers that the LR patches and the corresponding HR patches have similar local geometries.
In addition, images super-resolved via BLI and SCSR have very blurred boundaries (
Figure 8d,h). We conducted a thorough analysis and found that both the BLI and SCSR algorithms use linear features for SISR, which results in the emergence of stripes when
t increases. The basic idea of SCSR supposes that a signal can be represented as a sparse linear combination with respect to an overcomplete dictionary. A linear feature extraction operator behaves like a high-pass filter for feature representation of LR image patches [
52]. In this way, the SCSR images have sharper edges than those of NE and BLI.
The HR images (
Figure 7i–l) obtained by the deep learning methods generate smoother textures than the other two kinds of methods. DGANet-ISE yields more competitive visual results, because they are the most similar to their ground-truth HR counterparts. Our proposed DGANet-ISE recovers more texture and structure details, such as the edges of the dam in
Figure 8. The edges of the dam generated by our method are very clear, whereas the edges are blurred or even missing in other images. This is because the gradient-aware loss proposed in our method can capture more gradient information of the RS images, which can facilitate the generation of sharper edges and textures. In summary, the proposed method performs the best compared to other methods.
3.3. Cross-Database Test
To further evaluate the robustness and generalization capability of DGANet-ISE, the Landsat-test data set was applied for a cross-database test, i.e., the models trained on the RSI-CB256 data set were directly used to super-resolve the Landsat-test data set without any fine-tuning or modification. The results of the comparison between the different approaches under the three upscale factors are presented in
Table 5, and
Figure 9 shows the visual results of the different methods on an example image.
The Landsat-test data set differs substantially from the training samples, since the spatial resolution of training samples is 0.3–3m, while the spatial resolution of Landsat-test images is 30 m, and the scene contents in the same image size are very different. Therefore, it poses a great challenge to apply these supervised learning models directly to such test images.
According to the results, our proposed method shows a reasonably satisfactory performance compared to the other methods. For instance, when t = 2, the PSNR is 40.70 dB, which is 0.64 dB larger than that of EDSR and 1.57 dB larger than that of SCSR. This is because the ISE algorithm focuses on the personalized features of each test image, and provides a fast additional improvement of the super-resolved HR image obtained from the deep gradient-aware network. In summary, DGANet-ISE has strong robustness and generalization capability on new data sets.