Spatiotemporal Fusion Model of Remote Sensing Images Combining Single-Band and Multi-Band Prediction

Wang, Zhiyuan; Fang, Shuai; Zhang, Jing

doi:10.3390/rs15204936

Open AccessArticle

Spatiotemporal Fusion Model of Remote Sensing Images Combining Single-Band and Multi-Band Prediction

by

Zhiyuan Wang

¹,

Shuai Fang

^1,2,* and

Jing Zhang

¹

Key Laboratory of Knowledge Engineering with Big Data, Ministry of Education, Hefei University of Technology, Hefei 230601, China

²

Anhui Province Key Laboratory of Industry Safety and Emergency Technology, Hefei 230601, China

^*

Author to whom correspondence should be addressed.

Remote Sens. 2023, 15(20), 4936; https://doi.org/10.3390/rs15204936

Submission received: 29 July 2023 / Revised: 25 September 2023 / Accepted: 11 October 2023 / Published: 12 October 2023

(This article belongs to the Special Issue Machine Learning for Spatiotemporal Remote Sensing Data)

Download

Browse Figures

Versions Notes

Abstract

:

In recent years, convolutional neural network (CNN)-based spatiotemporal fusion (STF) models for remote sensing images have made significant progress. However, existing STF models may suffer from two main drawbacks. Firstly, multi-band prediction often generates a hybrid feature representation that includes information from all bands. This blending of features can lead to the loss or blurring of high-frequency details, making it challenging to reconstruct multi-spectral remote sensing images with significant spectral differences between bands. Another challenge in many STF models is the limited preservation of spectral information during 2D convolution operations. Combining all input channels’ convolution results into a single-channel output feature map can lead to the degradation of spectral dimension information. To address these issues and to strike a balance between avoiding hybrid features and fully utilizing spectral information, we propose a remote sensing image STF model that combines single-band and multi-band prediction (SMSTFM). The SMSTFM initially performs single-band prediction, generating separate predicted images for each band, which are then stacked together to form a preliminary fused image. Subsequently, the multi-band prediction module leverages the spectral dimension information of the input images to further enhance the preliminary predictions. We employ the modern ConvNeXt convolutional module as the primary feature extraction component. During the multi-band prediction phase, we enhance the spatial and channel information captures by replacing the 2D convolutions within ConvNeXt with 3D convolutions. In the experimental section, we evaluate our proposed algorithm on two public datasets with 16x resolution differences and one dataset with a 3x resolution difference. The results demonstrate that our SMSTFM achieves state-of-the-art performance on these datasets and is proven effective and reasonable through ablation studies.

Keywords:

spatiotemporal fusion; remote sensing; deep learning; ConvNeXt; convolutional neural network (CNN)

Graphical Abstract

1. Introduction

High spatiotemporal resolution remote sensing (RS) images play a pivotal role in various applications, including, but not limited to, crop growth monitoring [1,2], land cover change detection [3,4,5,6,7,8], and land cover classification [9,10,11]. However, due to technical and budgetary constraints, obtaining RS data with high spatial and temporal resolutions is challenging [12], thereby limiting the utilization of advanced RS applications. For instance, the Moderate Resolution Imaging Spectroradiometer (MODIS) provides observations at spatial resolutions ranging from 250 to 1000 m and offers a global revisit time of nearly one day. In comparison, Landsat acquires images at a higher spatial resolution of 30 m but with relatively more minor scene coverage and a revisit time of up to 16 days. Although recent satellite systems, such as Sentinel-2, have made obtaining the time series of high-resolution RS images more accessible, challenges persist, such as frequent cloud contamination [13]. To overcome the time and space trade-offs in RS images, spatiotemporal fusion methods (STFM) are employed to combine satellite images with a low spatial resolution but high frequency (e.g., MODIS, referred to as coarse images) and satellite images with high spatial resolution but low frequency (e.g., Landsat, referred to as fine images) to create satellite image time series with both a high spatial and temporal resolution [14,15]. These fusion techniques enable researchers and practitioners to access data with enhanced spatiotemporal characteristics, facilitating more a accurate and comprehensive analysis for various environmental and land-related studies. To facilitate reader comprehension, we have compiled the primary abbreviations used in this article in Abbreviations.

2. Related Works

The current STF methods are primarily categorized into three main approaches: weighted function-based, unmixing-based, and learning-based algorithms [16]. Weight function-based methods employ linear combinations of input image information to obtain refined pixel values. For example, STARFM [17] employs a moving window to search for pixels similar to the central pixel and assigns weights based on their spatial, spectral, and temporal similarities to reflect their respective contributions. Furthermore, ESTARFM [18] introduces variable transformation coefficients based on STARFM and modifies the search method for pixels to enhance the performance of heterogeneous sites with many mixed similar pixels. On the other hand, OBSTFM [19] focuses on the performance in regions with non-shape variations, considering the actual distribution of surface features. It incorporates segmentation methods to generate surface objects with good similarity and uniformity and then searches for and weighs similar pixels within each object. Unmixing-based methods posit that a coarse pixel comprises various land-cover types of fine pixels and employs the linear spectral mixing theory to decompose the coarse pixels. Maselli [20] utilized a moving window approach to account for spatiotemporal variations in pixel reflectance and incorporated distance weighting within this window, where closer pixels to the target pixel received higher weights. When selecting endmembers, Busetto et al. [21] considered both spatial and spectral differences between pixels and determined the weights of each pixel in the linear unmixing model based on their spatial and spectral similarities. Additionally, there are hybrid methods that integrate the two approaches mentioned above. FSDAF [22] addresses rapidly changing regions by utilizing unmixing principles to obtain residuals between predicted fine images and to reference the date’s fine images. The model also incorporates weight functions to enhance its applicability in scenarios with fast-changing land cover types. Furthermore, the Fit-FC algorithm [23] combines model fitting (Fit), spatial filtering (F), and residual compensation (C) to handle scenes with significant changes while also constraining the impact of the unmixing process on the results, thereby further improving the accuracy of the fusion.

There are two main categories of learning-based methods: those based on sparse representation and those based on deep learning (DL). Algorithms based on sparse representation are commonly employed to train dictionaries for high and low spatial resolution in either the image or frequency domain, facilitating the reconstruction of finely detailed images for predicting specific dates through sparse coding [24,25,26,27]. However, due to the inherent computational complexity of sparse learning and the limitation in extracting sufficient local structural information from large input patches, they face constraints in accurately preserving object shapes [28].

DL-based methods can be categorized into three main groups: those based on Convolutional Neural Networks (CNNs), Generative Adversarial Networks (GANs), and Vision Transformers (ViTs) [29]. Convolution involves image cross-correlation, allowing the model to learn relative spatial positional information. The CNN-based STFM extracts representative image features by stacking multiple convolutional layers. For example, Song et al. [30] employed a super-resolution convolutional neural network (SRCNN) to reconstruct fine images from coarse counterparts, achieving remarkable improvements in the image quality. To further enhance image details, DCSTFN [31] simultaneously extracts features from fine and coarse images and merges these features using equations that consider temporal ground cover changes. Considering the inherent information loss in the reconstruction process of deconvolution fusion methods, EDCSTFN [32] goes a step further by incorporating residual encoding blocks. Additionally, it employs a composite loss function to enhance the learning capability of the network, thereby improving the fidelity of the fused images. The two-stream convolutional neural network spatiotemporal fusion model (StfNet) incorporates temporal dependence to predict unknown fine difference images. Additionally, it establishes a time constraint that considers the relationship between time series, ensuring the uniqueness and authenticity of the fusion results [33]. GAN-based STFMs achieve the prediction of RS data by leveraging the collaborative efforts of a generator and discriminator, aiming to make the predicted RS data as similar as possible to the actual data distribution. CycleGAN-STF [23] utilizes cycle-GAN to select generated images, enhancing the selected images using the wavelet transformation. GAN-STFM [34] introduces conditional GAN (CGAN) and switchable normalization techniques to address spatiotemporal fusion problems. This approach reduces input data and enhances model flexibility. MLFF-GAN combines multi-level feature fusion with GAN to generate fused images. MLFF-GAN incorporates Adaptive Instance Normalization (AdaIN) blocks to learn the global distribution relationships between multi-temporal images. Additionally, it employs an Attention Module (AM) to learn the local information weights of minor region variations [35]. A crucial component of the ViT-based STFMs is the self-attention mechanism, enabling the capture of global information and compensating for the inherent limitation of CNNs with narrower receptive fields. MSNet is a multi-stream STFM based on ViT and CNN. It combines the global temporal correlation learning capability of the transformer with the feature extraction capability of convolutional neural networks using an average-weighted fusion approach [36]. SwinSTFM [13] is a novel algorithm based on the Swin Transformer [37] and the Linear Spectral Mixing theory. This algorithm fully leverages the advantages of the Swin Transformer in feature extraction, and integrates the unmixing theory into the model based on the self-attention mechanism. Table 1 presents the strengths and weaknesses of the methods from different categories.

However, there are certain limitations to DL-based STFMs. Firstly, in ViT-based STFMs, self-attention often neglects fine-grained pixel-level internal structural features, leading to the loss of shallow-level features [28]. Additionally, the high resolution of RS images results in a quadratic increase in the input image size, thereby increasing computational complexity. Secondly, STFMs based on CNNs and GANs often use 2D convolutions for feature extraction, and they typically have two limitations: (1) 2D convolutions may lead to the loss of channel dimension information. (2) Hybrid features may be challenging to use for reconstructing multispectral RS images with significant differences in spectral reflectance across different bands. The detailed analysis is depicted in Figure 1, where the inputted multispectral RS images are passed through the 2D convolution-based encoder to generate hybrid features containing information from all bands of the input images. Then, the 2D convolution-based decoder reconstructs the hybrid features into individual output image bands. However, 2D convolution combines the convolution results of all input channels into a single-channel output feature map, leading to the loss of channel dimension information. Furthermore, a multi-band prediction generates a comprehensive feature representation that contains information from each band, and such hybrid features can be used for the subsequent image analysis and processing tasks, such as feature classification, target detection, change detection, etc. Nevertheless, the hybrid features may lead to the loss or blurring of high-frequency detail information, which is not conducive to reconstructing multispectral RS images with significant differences in spectral reflectance between the bands.

To address the above issues, we propose an RS image STFM that combines single-band and multi-band prediction (SMSTFM). First, we employ a Single-Band Prediction Module (SBPM) to generate an initial fused image, thus avoiding generating hybrid features. Next, we employ the Multi-Band Prediction Module (MBPM), which focuses on the information in the input images’ channel dimension to enhance the details of the preliminary fused image. In SBPM, the feature extraction is mainly performed using convolutional modules based on ConvNeXt. ConvNeXt is an advanced convolutional module redesigned according to the ViT architecture [38]. Compared to ViT, it is concise in design, computationally efficient, and has a considerable performance. In the MBPM, we replace the 2D convolutions in ConvNeXt with 3D convolutions to better extract spatial-spectral features. Here are the summarized main contributions of our study:

The paper introduces an STFM to address two challenges within the current STF framework. Our proposed STFM consists of two key modules: the Single-Band Prediction Module (SBPM) and the Multi-Band Prediction Module (MBPM). SBPM is responsible for generating the initial fused image, thus eliminating the need for hybrid feature generation. Subsequently, MBPM is employed to extract channel-wise information that enhances the details of the preliminary fused image.
In SBPM, feature extraction primarily relies on ConvNeXt due to its concise architecture, computational efficiency, and outstanding performance compared to ViT. In MBPM, we replace the 2D convolutions in ConvNeXt with 3D convolutions to better extract spatial-spectral features.
We evaluated and compared different models on three datasets, each with distinct characteristics. Our examination comprised three datasets: CIA, LGC, and Nanjing. The resolution difference between coarse and fine images was 16x for the former two and 3x for the latter.

The remainder of this manuscript is organized as follows: Section 3 provides an overview of the SMSTFM, outlining its overall structure and specific internal modules. Section 4 is dedicated to presenting our results, encompassing a description of the dataset, the experimental procedures, and their subsequent analysis. Section 5 constitutes our discussion, while Section 6 offers the conclusion.

3. Materials and Methods

SMSTFM requires a fine image

L_{t 0}

at reference date

t 0

and a coarse image

M_{t 1}

at prediction date

t 1

, and finally synthesizes a fine image

{\hat{L}}_{t 1}

at prediction date

t 1

. The general structure of the SMSTFM is shown in Figure 2.

The SMSTFM consists of two main stages: the SBPM (Single Band Prediction Module) and MBPM (Multi-Band Prediction Module). In the SBPM stage, each iteration takes a band from

L_{t 0}

and a corresponding band from

M_{t 1}

as the input and generates a synthesized fine band

L_{p}^{i}

for the predicted image. The SBPM can be seen as a reference-based super-resolution reconstruction model [39]. It operates by taking a single band from the coarse image and a corresponding band from the reference image to predict the corresponding band of the fused image. These predicted bands are then stacked together to form

L_{p}

. The MBPM complements this process by extracting channel-wise information from the inputted multispectral images, enhancing the details of the

L_{p}

generated by the SBPM.

3.1. Single-Band Prediction Module

The overall structure of the SBPM is depicted in Figure 3. To effectively integrate multi-scale feature information, the SBPM module adopts the U-Net architecture [40]. The U-Net architecture can preserve surrounding pixel information at different scales in multispectral images, thereby reducing the impact of sensor registration errors on fusion results [41,42]. The SBPM consists of an encoder and a decoder. The encoder is responsible for extracting the feature information of the input images, while the decoder is responsible for restoring the resolution and details of the input images. In order to achieve this goal, the SBPM module performs three downsampling and three upsampling operations, which increase the receptive field of the network while maintaining a low computational complexity. Moreover, the SBPM also uses skip connections, which connect the feature maps of the same size in the encoder and decoder. The use of skip connections helps to transfer the low-level-detail information from lower layers to higher layers, thereby improving the reconstruction performance of the network. The detailed workflow of the SBPM is as follows: first, concatenate two single-band input images,

L_{t 0}^{i}

and

M_{t 1}^{i}

, into a tensor of shape (2, 256, 256) as the input to the module. Then, perform downsampling on the feature maps using a convolution operation (Conv2) with a kernel size of 2 and a stride of 2, while increasing the number of channels in the feature maps by four times. Next, use the PixelShuffle operation to upsample the feature maps, simultaneously reducing the number of channels in the feature maps by four times. In both the encoder and decoder, feature maps of the same size are added together (indicated by dashed lines in the diagram). Adding feature maps helps pass low-level-detail information from the lower layers to the upper layers, thus restoring resolution and details. Finally, use a convolution operation (Conv1) with a kernel size of 1 to adjust the channel dimension of the output features. This results in the single-band prediction

L_{p}^{i}

, which is the sum of the module’s predicted residuals and the coarse image

M_{t 1}^{i}

.

The structure of the 2D-CBlock is the same as that of ConvNeXt [38], as depicted in Figure 4. Firstly, inspired by the window size in Swin Transformers, the convolutional kernel size of the first convolutional layer in the ConvNeXt block is set to 7 in order to achieve an efficient feature extraction. This convolution is a depthwise convolution, which is a special case of a grouped convolution where the number of groups is equal to the number of channels. Depthwise convolution is similar to the weighted sum operation in self-attention, as it operates independently on each channel, only mixing information across spatial dimensions. Due to the complexity of Batch Normalization (BN), it can potentially have negative effects on the performance of the model [43]. However, in the ConvNeXt module, Layer Normalization (LN) is used instead of BN, which is a simpler normalization technique. Furthermore, the ConvNeXt block employs an inverted bottleneck design, where the feature maps outputted by the LN module are first expanded by a convolutional module, increasing the number of channels by a factor of four, and then the activated features are reduced by a factor of four. The activation function used in the ConvNeXt block is the Gaussian Error Linear Unit or GELU [44], which can be seen as a smoother variant of ReLU.

3.2. Multi-Band Prediction Module

The overall structure of the MBPM is illustrated in Figure 5. The MBPM also comprises an encoder and a decoder. During the encoding phase, we extract and merge multi-scale feature maps from

L_{p}

,

L_{t 0}

, and

M_{t 1}

to complement the channel-wise information of

L_{p}

. The decoder is responsible for gradually reconstructing the merged feature maps into the image. The skip connections between the encoder and decoder enhance the network’s reconstruction performance by passing low-level-detail information from the lower layers to the upper layers. The specific process of the MBPM is as follows: we first extract multi-scale feature maps from two branches and then merge them. We stack

L_{t 0}

and

M_{t 1}

together as one branch, and use

L_{p}

as the other branch. We perform two downsampling operations on both branches and then add the feature maps of the corresponding scales together. Finally, we perform two upsampling operations in the decoder to recover the fused image. The skip connections between the encoder and decoder enhance the network’s reconstruction performance. In Figure 5, Conv2 represents downsampling, pixelShuffle stands for upsampling, and 3D-CBlock is utilized for further feature extraction.

The overall structure of the 3D-CBlock is similar to that of the 2D-ConvBlock, as shown in Figure 6. 3D convolution operates by moving the convolution kernel in three dimensions (width, height, and depth). This process results in output features encompassing both spatial and spectral neighborhood information, effectively extracting spectral dimension details. To address memory consumption issues commonly associated with traditional 3D convolution, we employ a separable 3D convolution (Split-3D) [45]. In Split-3D, the 7 × 7 × 7 convolution kernel is split into two groups: a 7 × 1 × 1 kernel and a 1 × 7 × 7 kernel. The Split-3D achieves similar results as directly using a 7 × 7 × 7 convolution kernel [46]. The workflow of the 3D-CBlock is as follows: first, the input tensor (C, H, W) is expanded using an unsqueeze to have the dimensions (1, C, H, W), where the first dimension represents the number of feature maps. The Split-3D convolution is then applied to further extract spectral features. To perform subsequent 2D convolution operations, the feature dimension is compressed (represented by squeeze) to (C, H, W).

3.3. Loss Function

During the training phase, the SMSTFM computes the loss function separately for the single-band prediction result

L_{p}

and the final prediction result

L_{t}

. The total loss of the network can be mathematically represented by the following equation:

L o s s_{t o t a l} = λ L o s s (L_{p}, L_{t r u t h}) + L o s s (L_{t}, L_{t r u t h})

(1)

where

L_{t r u t h}

represents the true fine-resolution image at the predicted time, and the function

L o s s (\hat{L}, L_{t r u t h})

calculates the reconstruction loss between the predicted image

\hat{L}

and the true image

L_{t r u t h}

. The weight coefficient

λ

is set to 1 to balance the contributions of the two loss terms.

The function

L o s s (\hat{L}, L_{t r u t h})

consists of two components: pixel loss and structure loss. The Charbonnier loss, a part of LapSRN [47], is utilized as the pixel loss, while multi-scale structural similarity (MS-SSIM) [48] is employed to measure the overall similarity between the generated image and the ground truth fine image. The two loss functions are represented by the following equations:

\begin{matrix} L_{structure} & = & 1 - \min (MS-SSIM (\hat{L}, L_{t r u t h}) + ε_{s}, 1) \end{matrix}

(2)

\begin{matrix} L_{pixel} & = & \frac{1}{N} \sum_{i = 1}^{N} \sqrt{{(\hat{L} - L_{t r u t h})}^{2} + ε_{p}^{2}} \end{matrix}

(3)

Here,

ε_{p}

is introduced to stabilize the error of back-propagation, and

ε_{s}

is used to reduce the impact of samples with a lower structural loss on the network training process. The values of

ε_{p}

and

ε_{s}

are set to 0.001 and 0.05, respectively. Finally, the function

L o s s (\hat{L}, L_{t r u t h})

can be expressed by the following equation:

L o s s (\hat{L}, L_{t r u t h}) = L_{p i x e l} + α L_{s t r u c t u r e}

(4)

where

α

is the weight coefficient and the value is set to 1.

4. Experiments and Results

4.1. Study Areas and Datasets

To facilitate comparisons with other STFMs, we have utilized well-established public datasets commonly used in spatiotemporal fusion research for RS images. These datasets include the Coleambally irrigation area (CIA) and the lower Gwydir catchment (LGC) [49]. Additionally, we introduced the Nanjing dataset to assess the performance of our STFM in urban scenes with lower resolution disparities. The CIA study site is a rice-based irrigation system in southern New South Wales (NSW, Australia, 34.0034°E, 145.0675°S), and the LGC study site is located in northern New South Wales (NSW, 149.2815°E, 29.0855°S). These two locations represent different types of changes, namely phenological and land cover changes. Located in southern New South Wales, the CIA study area is primarily characterized by the presence of rice crops and a contemporary irrigation infrastructure. The dataset from CIA comprises 17 pairs of MODIS-Landsat data, acquired from 2001 to 2002, with each pair consisting of cloudfree images that are sized at 6 × 2040 × 1720. At the time of data collection, the CIA area remained relatively stable in terms of land cover changes. However, the temporal variation in this region can be utilized to evaluate the efficacy of the SMSTFM in predicting phenological changes. Situated in the northern region of New South Wales, the LGC study area encompasses 14 sets of cloud-free data pairs, each consisting of images sized at 6 × 2720 × 3200 and acquired between 2004 to 2005. Notably, a flood event occurred in this area during mid-December 2004, resulting in significant land cover modifications. This attribute of the LGC dataset renders it highly suitable for assessing the effectiveness of the SMSTFM in predicting land cover changes.

Given that the CIA and LGC study areas are predominantly comprised of plains and agricultural land, this study includes additional experiments utilizing the Nanjing dataset (China, 118.803611°E, 32.075833°N) [50]. In this dataset, we used the first four bands of the Sentinel2 satellite images (10 m resolution) and the first four bands of the Landsat8 (30 m resolution) satellite images to form an image pair. The surface reflectance products (i.e., Level 2 products) of the Landsat images were acquired from the United States Geological Survey (USGS) (http://earthexplorer.usgs.gov/, accessed on 1 April 2023). The surface reflectance products (i.e., L2A products) of the Sentinel-2 multispectral images were acquired from the European Space Agency (ESA) (https://scihub.copernicus.eu/dhus/#/home, accessed on 1 April 2023). The Nanjing dataset contains 14 pairs of images with acquisition dates between 2017 and 2021 and each image has a size of 10800 × 10800. The resolution difference between fine and coarse images in the CIA and LGC datasets is about 16 times, while the difference between fine and coarse images in the Nanjing dataset is about three times. Therefore, the Nanjing dataset was used to explore the ability of the STF model to recover details.

4.2. Experiment Design and Evaluation

The experiment can be divided into two main parts: one focusing on the experimental results for the CIA and LGC datasets, and the other on the experimental results for the Nanjing dataset. In this paper, we compare our approach with four traditional STFMs (STARFM [17], FSDAF [22], SFSDAF [51], and Fit-FC [23]), as well as four DL-based methods (GANSTFM [34], EDCSTFN [52], SwinSTFM [13], and MLFF-GAN [35]). All three datasets have been divided into training and test sets. During the testing phase, for the LGC dataset, the prediction of the fine image on 12 December 2004 is performed using the image pair captured on 26 November 2004, and the corresponding coarse image taken on 12 December 2004. Likewise, for the CIA dataset, the prediction of the fine image on 17 April 2002 is conducted using the image pair acquired on 10 April 2002, and the corresponding coarse image captured on 17 April 2002. Lastly, for the Nanjing dataset, the prediction of the fine image on 3 October 2021 is carried out using the image pair taken on 22 March 2021, along with the corresponding coarse image captured on 3 October 2021. Additionally, it is worth noting that the images in the LGC dataset, CIA dataset, and Nanjing dataset were cropped to the following dimensions: 2400 × 2400, 1400 × 1400, and 4800 × 4800, respectively. The three traditional algorithms utilize a fixed sliding window size of 41 × 41 when searching for similar pixels, and they consider 20 similar pixels within this window during their respective processes. Moreover, deep learning-based algorithms incorporate appropriate methods for dataset generation and data augmentation. When creating the training set, all four algorithms opt for the image pair closest to the prediction date as the reference images. During each training iteration, the three input images undergo random flipping and rotation as part of the data augmentation process. This data augmentation technique helps enhance the model’s robustness and generalization ability by exposing it to a diverse range of training samples. In the SMSTFM, the number of 2D-CBlocks is set as follows: m1 = 6, m2 = 9, and m3 = 6. Additionally, the number of 3D-CBlocks is set as follows: m1 = 6 and m2 = 9. The other deep learning models use the original code disclosed by the authors.

The models’ performance is assessed using six evaluation indices: the root mean square error (RMSE), Structural Similarity Index (SSIM) [53], universal image quality index (UIQI) [54], correlation coefficient (CC), spectral angle mapper (SAM) [55], and average relative global error (ERGAS) [56]. The benchmark values for RMSE, SSIM, UIQI, and CC are 0, 1, 1, and 1, respectively. A fusion result with the SAM and ERGAS values approaching zero indicates a reduction in uncertainty.

4.3. Experimental Results on LGC and CIA Datasets

Figure 7 and Figure 8 illustrate the predicted images and subregions of different models on the LGC test dataset, using the NIR, Blue, and Green channels as RGB. From the visual comparison, it is evident that the fusion results of the SMSTFM exhibit the highest level of restoration in terms of appearance and land cover types. Moreover, in some heterogeneous regions, the fusion results of traditional fusion methods suffer from severe spectral distortion, likely due to the significant influence of the search window in the pixel prediction process of these methods. Additionally, when the predicted image differs greatly from the reference image, it becomes challenging for traditional methods to extract sufficient meaningful information from the reference image. In contrast, deep learning-based methods generate prediction images with more details, especially in subregions with the water land-cover type, where the prediction image of the SMSTFM closely resembles the ground truth image. This demonstrates the effectiveness of deep learning methods in capturing fine details and improving the fusion performance. Furthermore, Figure 9 displays a comparison of the subregion fusion results on the AAD map. It can be observed that the SMSTFM achieves the best fusion results for pixels with the water land-cover type, even though the reference image contains only a very small number of water pixels. This result highlights the strong generalization ability and capability of learning from limited samples possessed by the SMSTFM. Quantitative comparisons of fusion results are presented in Table 2. The results indicate that deep learning-based methods outperform traditional methods in all metrics across all bands, with the MLFFGAN, SwinSTFM, and SMSTFM showing significant improvements over other deep learning algorithms. Though the SMSTFM and MLFFGAN exhibit similar metrics, it is worth noting that the SMSTFM has fewer parameters, making it a more efficient model.

Figure 10 and Figure 11 depict the predicted images and subregions of different models on the CIA test dataset, with the NIR, Blue, and Green channels used as the RGB. As shown in the figure, the fusion results of the STARFM and FSDAF exhibit severe distortions in spectral details. Additionally, the predicted image of Fit-FC appears to be overall blurry. It may be because these STF methods are highly affected by the search window during the image pixel prediction process, and they perform poorly when the image exhibits high spatial heterogeneity. Furthermore, Figure 12 presents the comparison of subregion fusion results on the AAD map. It is evident that the SMSTFM achieves the highest spectral accuracy among the models, indicating its superiority in preserving spectral information during the fusion process. The quantitative comparisons of fusion results are listed in Table 3. Across all spectral bands, the SMSTFM consistently achieves the best results in almost all metrics, demonstrating its superiority over other spatiotemporal fusion models in terms of spectral accuracy and overall fusion performance. The results on the CIA dataset further reinforce the effectiveness of the SMSTFM in handling various remote sensing data scenarios and highlight its robustness in capturing spectral and spatial details for accurate and reliable fusion results.

4.4. Experimental Results on the Nanjing Dataset

With the difference in resolution between the coarse and fine images being only three times, the spatiotemporal fusion models in this dataset primarily focus on restoring image details. Figure 13 and Figure 14 illustrate the predicted images and subregions of different models on the Nanjing test dataset, with the NIR, Blue, and Green channels used as the RGB. Overall, the predicted images of the STARFM and Fit-FC differ significantly from the real images. In contrast, the predicted images of the FSDAF and SFSDAF share a similar overall structure to the real image. However, the predicted images of the FSDAF and SFSDAF exhibit significant spectral differences and overall blurriness. Among the deep learning-based fusion models, the predicted images are visually similar to the real images in terms of the overall structure and spectrum. However, there are differences in their ability to recover image details. Figure 15 shows a comparison of the subregion fusion results on the AAD map. The predicted images of the EDCSTFN and SwinSTFM contain some erroneous pixels, while the GANSTFM and MLFFGAN exhibit notable discrepancies in the edges of objects compared to the ground truth images. In contrast, our model highly preserves rich spatial detail information in the predicted images, and the spectral information in our model is found to be closer to the real image in the overall scene when compared to other models.

The quantitative metrics for the Nanjing scene are listed in Table 4. The SMSTFM demonstrates a superior performance in almost all metrics across all bands, except for the SAM, where it slightly lags behind the FSDAF. Furthermore, the traditional algorithms are significantly inferior to the deep learning-based algorithms in all metrics except the SAM, suggesting that the traditional algorithms may perform poorly in datasets with only a three-fold difference in resolution. Additionally, the SMSTFM shows a substantial improvement compared to the suboptimal models, particularly with an approximately 20% increase in the RMSE, further highlighting the superiority of the SMSTFM in accurately restoring image details and spectral fidelity.

5. Discussion

In this section, we begin by conducting ablation experiments to explore the roles of various components within the SMSTFM. Following that, we provide a brief analysis of the efficiency of the SMSTFM compared to other STFMs.

Three models are designed to demonstrate the advantages of the SMSTFM for problems based on deep learning methods. We first train the SBPM module independently, creating a standalone experiment referred to as the “SBPM”. This allows us to evaluate the performance of the SBPM without the additional components of the SMSTFM. To assess the impact of ConvNeXt on the network structure, we replace the ConvNeXt module with a standard convolution while keeping the rest of the structure unchanged. This sub-method is defined as “No-ConvNeXt”. To examine the influence of Split-3d on the network structure, we remove Split-3d from the 3D-CBlock, defining the sub-method as “No-Split-3d”. The results of the ablation experiments on the LGC dataset are listed in Table 5. The SMSTFM achieves the best results on almost all metrics and shows significant improvement compared to No-MBP. This result highlights the importance of the supplementary information predicted by the MBP module in the SMSTFM. Furthermore, Figure 16 displays the prediction images of the SBPM and SMSTFM. It can be observed that while the SBPM shares a similar structure with the SMSTFM, it noticeably lacks spectral details. This observation underscores the contribution of the MBP module in the SMSTFM, enhancing the spectral details and improving the overall fusion performance.

Table 6 illustrates the model’s efficiency by examining two key metrics: the model parameters and Multiply-Accumulate Operations (MACs). The model parameters reflect the model’s complexity and storage requirements, while the MACs measure its computational complexity and resource utilization. Models with fewer parameters and lower MACs are generally considered more efficient, requiring less storage space and computational resources. Table 6 shows that SwinSTFM, based on ViT, has the most significant number of parameters, approximately 11 times that of the SMSTFM. Although the SMSTFM has more parameters than earlier models like EDCSTFN and GANSTFM, it is significantly smaller in parameter count compared to recent models like the SwinSTFM and MLFFGAN. When considering MACs, the SMSTFM still falls within the moderate range and notably consumes fewer resources than SwinSTFM. However, it is worth noting that even though the SMSTFM falls into the mid-range in terms of efficiency, it achieves the highest prediction accuracy. Therefore, the SMSTFM strikes a good balance between model efficiency and accuracy.

6. Conclusions

This article proposes an innovative spatiotemporal fusion model called the SMSTFM, combining single-band and multi-band predictions to achieve superior fusion results. Existing STFMs that use CNN for feature extraction result in hybrid features and information loss in channel dimensions, while ViT-based STFMs incur significant computational overhead. Compared to these models, the SMSTFM has the following advantages:

Our model addresses the issue of hybrid features and information loss in channel dimensions by concatenating the SBPM and MBPM. The SBPM establishes a mapping from low-resolution images to high-resolution images, generating preliminary fusion results without hybrid features. The MBPM efficiently extracts spatial channel-wise information from the preliminary fusion results to enhance fusion details.
ConvNeXt, designed based on ViT architecture, and its variants are utilized as the feature extraction modules in our model. Compared to ViT, ConvNeXt maintains a high performance while reducing computational costs.

Furthermore, significant performance improvements were observed on datasets with 16× and 3× the resolution differences between coarse and fine images, highlighting the robustness and versatility of our proposed approach.

Our strategy for channel-wise feature extraction may serve as a valuable reference for tasks related to multispectral and hyperspectral remote sensing imagery. However, one limitation of the SMSTFM is that the concatenation of the SBPM and MBPM may result in a slowdown in inference speed, which is an aspect that we aim to improve in the future.

Author Contributions

Conceptualization, Z.W. and S.F.; Formal analysis, Z.W., S.F. and J.Z.; Methodology, Z.W.; Writing—original draft, Z.W.; Writing—review & editing, S.F. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the Collaborative Innovation Project of Colleges and Universities of Anhui Province: grant number PA2023AGXC0006.

Data Availability Statement

Data available upon request.

Acknowledgments

We would like to thank the computing support from the Key Laboratory of Knowledge Engineering with Big Data (Hefei University of Technology).

Conflicts of Interest

The funders had no role in the design of the study; in the collection, analyses, or interpretation of data; in the writing of the manuscript; or in the decision to publish the results.

Abbreviations

RS	Remote sensing
DL	Deep learning
GAN	Generative adversarial network
STFM	Spatiotemporal fusion method
CNN	Convolutional neural network
ViT	Vision transformer

References

Nduati, E.; Sofue, Y.; Matniyaz, A.; Park, J.G.; Yang, W.; Kondoh, A. Cropland Mapping Using Fusion of Multi-Sensor Data in a Complex Urban/Peri-Urban Area. Remote Sens. 2019, 11, 207. [Google Scholar] [CrossRef]
Hwang, T.; Song, C.; Bolstad, P.V.; Band, L.E. Downscaling real-time vegetation dynamics by fusing multi-temporal MODIS and Landsat NDVI in topographically complex terrain. Remote Sens. Environ. 2011, 115, 2499–2512. [Google Scholar] [CrossRef]
Arévalo, P.; Olofsson, P.; Woodcock, C.E. Continuous monitoring of land change activities and post-disturbance dynamics from Landsat time series: A test methodology for REDD+ reporting. Remote Sens. Environ. 2020, 238, 111051. [Google Scholar] [CrossRef]
Hamunyela, E.; Brandt, P.; Shirima, D.; Do, H.T.T.; Herold, M.; Roman-Cuesta, R.M. Space-time detection of deforestation, forest degradation and regeneration in montane forests of Eastern Tanzania. Int. J. Appl. Earth Obs. Geoinf. 2020, 88, 102063. [Google Scholar] [CrossRef]
Yin, L.; Wang, L.; Li, T.; Lu, S.; Yin, Z.; Liu, X.; Li, X.; Zheng, W. U-Net-STN: A Novel End-to-End Lake Boundary Prediction Model. Land 2023, 12, 1602. [Google Scholar] [CrossRef]
Zhang, C.; Wang, L.; Cheng, S.; Li, Y. SwinSUNet: Pure Transformer Network for Remote Sensing Image Change Detection. IEEE Trans. Geosci. Remote Sens. 2022, 60, 1–13. [Google Scholar] [CrossRef]
Liu, Z.; Xu, J.; Liu, M.; Yin, Z.; Liu, X.; Yin, L.; Zheng, W. Remote sensing and geostatistics in urban water-resource monitoring: A review. Mar. Freshw. Res. 2023, 74, 747–765. [Google Scholar] [CrossRef]
Liu, X.; Li, Z.; Fu, X.; Yin, Z.; Liu, M.; Yin, L.; Zheng, W. Monitoring house vacancy dynamics in the pearl river delta region: A method based on NPP-viirs night-time light remote sensing images. Land 2023, 12, 831. [Google Scholar] [CrossRef]
Interdonato, R.; Ienco, D.; Gaetano, R.; Ose, K. DuPLO: A DUal view Point deep Learning architecture for time series classificatiOn. ISPRS J. Photogramm. Remote Sens. 2019, 149, 91–104. [Google Scholar] [CrossRef]
Ghrefat, H.A.; Goodell, P.C. Land cover mapping at Alkali Flat and Lake Lucero, White Sands, New Mexico, USA using multi-temporal and multi-spectral remote sensing data. Int. J. Appl. Earth Obs. Geoinf. 2011, 13, 616–625. [Google Scholar] [CrossRef]
Jia, D.; Gao, P.; Cheng, C.; Ye, S. Multiple-feature-driven co-training method for crop mapping based on remote sensing time series imagery. Int. J. Remote Sens. 2020, 41, 8096–8120. [Google Scholar] [CrossRef]
Shen, H.; Meng, X.; Zhang, L. An Integrated Framework for the Spatio–Temporal–Spectral Fusion of Remote Sensing Images. IEEE Trans. Geosci. Remote Sens. 2016, 54, 7135–7148. [Google Scholar] [CrossRef]
Chen, G.; Jiao, P.; Hu, Q.; Xiao, L.; Ye, Z. SwinSTFM: Remote Sensing Spatiotemporal Fusion Using Swin Transformer. IEEE Trans. Geosci. Remote Sens. 2022, 60, 1–18. [Google Scholar] [CrossRef]
Fu, Z.; Sun, Y.; Fan, L.; Han, Y. Multiscale and Multifeature Segmentation of High-Spatial Resolution Remote Sensing Images Using Superpixels with Mutual Optimal Strategy. Remote Sens. 2018, 10, 1289. [Google Scholar] [CrossRef]
Ghassemian, H. A review of remote sensing image fusion methods. Inf. Fusion 2016, 32, 75–89. [Google Scholar] [CrossRef]
Belgiu, M.; Stein, A. Spatiotemporal Image Fusion in Remote Sensing. Remote Sens. 2019, 11, 818. [Google Scholar] [CrossRef]
Gao, F.; Masek, J.; Schwaller, M.; Hall, F. On the blending of the Landsat and MODIS surface reflectance: Predicting daily Landsat surface reflectance. IEEE Trans. Geosci. Remote Sens. 2006, 44, 2207–2218. [Google Scholar]
Lu, M.; Chen, J.; Tang, H.; Rao, Y.; Yang, P.; Wu, W. Land cover change detection by integrating object-based data blending model of Landsat and MODIS. Remote Sens. Environ. 2016, 184, 374–386. [Google Scholar] [CrossRef]
Zhang, H.; Sun, Y.; Shi, W.; Guo, D.; Zheng, N. An object-based spatiotemporal fusion model for remote sensing images. Eur. J. Remote Sens. 2021, 54, 86–101. [Google Scholar] [CrossRef]
Maselli, F. Definition of spatially variable spectral endmembers by locally calibrated multivariate regression analyses. Remote Sens. Environ. 2001, 75, 29–38. [Google Scholar] [CrossRef]
Busetto, L.; Meroni, M.; Colombo, R. Combining medium and coarse spatial resolution satellite data to improve the estimation of sub-pixel NDVI time series. Remote Sens. Environ. 2008, 112, 118–131. [Google Scholar] [CrossRef]
Zhu, X.; Helmer, E.H.; Gao, F.; Liu, D.; Chen, J.; Lefsky, M.A. A flexible spatiotemporal method for fusing satellite images with different resolutions. Remote Sens. Environ. 2016, 172, 165–177. [Google Scholar] [CrossRef]
Wang, Q.; Atkinson, P.M. Spatio-temporal fusion for daily Sentinel-2 images. Remote Sens. Environ. 2018, 204, 31–42. [Google Scholar] [CrossRef]
Huang, B.; Song, H. Spatiotemporal Reflectance Fusion via Sparse Representation. IEEE Trans. Geosci. Remote Sens. 2012, 50, 3707–3716. [Google Scholar] [CrossRef]
Song, H.; Huang, B. Spatiotemporal satellite image fusion through one-pair image learning. IEEE Trans. Geosci. Remote Sens. 2012, 51, 1883–1896. [Google Scholar] [CrossRef]
Wu, B.; Huang, B.; Zhang, L. An error-bound-regularized sparse coding for spatiotemporal reflectance fusion. IEEE Trans. Geosci. Remote Sens. 2015, 53, 6791–6803. [Google Scholar] [CrossRef]
Peng, Y.; Li, W.; Luo, X.; Du, J.; Zhang, X.; Gan, Y.; Gao, X. Spatiotemporal reflectance fusion via tensor sparse representation. IEEE Trans. Geosci. Remote Sens. 2021, 60, 1–18. [Google Scholar] [CrossRef]
Xiao, J.; Aggarwal, A.K.; Duc, N.H.; Arya, A.; Rage, U.K.; Avtar, R. A review of remote sensing image spatiotemporal fusion: Challenges, applications and recent trends. Remote Sens. Appl. Soc. Environ. 2023, 32, 101005. [Google Scholar] [CrossRef]
Li, J.; Hong, D.; Gao, L.; Yao, J.; Zheng, K.; Zhang, B.; Chanussot, J. Deep learning in multimodal remote sensing data fusion: A comprehensive review. Int. J. Appl. Earth Obs. Geoinf. 2022, 112, 102926. [Google Scholar] [CrossRef]
Song, H.; Liu, Q.; Wang, G.; Hang, R.; Huang, B. Spatiotemporal Satellite Image Fusion Using Deep Convolutional Neural Networks. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2018, 11, 821–829. [Google Scholar] [CrossRef]
Tan, Z.; Yue, P.; Di, L.; Tang, J. Deriving High Spatiotemporal Remote Sensing Images Using Deep Convolutional Network. Remote Sens. 2018, 10, 1066. [Google Scholar] [CrossRef]
Tan, Z.; Di, L.; Zhang, M.; Guo, L.; Gao, M. An Enhanced Deep Convolutional Model for Spatiotemporal Image Fusion. Remote Sens. 2019, 11, 2898. [Google Scholar] [CrossRef]
Liu, X.; Deng, C.; Chanussot, J.; Hong, D.; Zhao, B. StfNet: A Two-Stream Convolutional Neural Network for Spatiotemporal Image Fusion. IEEE Trans. Geosci. Remote Sens. 2019, 57, 6552–6564. [Google Scholar] [CrossRef]
Tan, Z.; Gao, M.; Li, X.; Jiang, L. A Flexible Reference-Insensitive Spatiotemporal Fusion Model for Remote Sensing Images Using Conditional Generative Adversarial Network. IEEE Trans. Geosci. Remote Sens. 2022, 60, 1–13. [Google Scholar] [CrossRef]
Song, B.; Liu, P.; Li, J.; Wang, L.; Zhang, L.; He, G.; Chen, L.; Liu, J. MLFF-GAN: A Multilevel Feature Fusion With GAN for Spatiotemporal Remote Sensing Images. IEEE Trans. Geosci. Remote Sens. 2022, 60, 1–16. [Google Scholar] [CrossRef]
Li, W.; Cao, D.; Peng, Y.; Yang, C. MSNet: A multi-stream fusion network for remote sensing spatiotemporal fusion based on transformer and convolution. Remote Sens. 2021, 13, 3724. [Google Scholar] [CrossRef]
Liu, Z.; Lin, Y.; Cao, Y.; Hu, H.; Wei, Y.; Zhang, Z.; Lin, S.; Guo, B. Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Montreal, QC, Canada, 10–17 October 2021; pp. 10012–10022. [Google Scholar]
Liu, Z.; Mao, H.; Wu, C.Y.; Feichtenhofer, C.; Darrell, T.; Xie, S. A ConvNet for the 2020s. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 18–24 June 2022; pp. 11976–11986. [Google Scholar]
Cao, J.; Liang, J.; Zhang, K.; Li, Y.; Zhang, Y.; Wang, W.; Gool, L.V. Reference-based image super-resolution with deformable attention transformer. In Proceedings of the European Conference on Computer Vision, Tel Aviv, Israel, 23–27 October 2022; pp. 325–342. [Google Scholar]
Ronneberger, O.; Fischer, P.; Brox, T. U-net: Convolutional networks for biomedical image segmentation. In Proceedings of the Medical Image Computing and Computer-Assisted Intervention–MICCAI 2015, Proceedings of the 18th International Conference, Munich, Germany, 5–9 October 2015; Part III 18. Springer: Cham, Switzerland, 2015; pp. 234–241. [Google Scholar]
Ma, X.; Wang, Q.; Tong, X.; Atkinson, P.M. A deep learning model for incorporating temporal information in haze removal. Remote Sens. Environ. 2022, 274, 113012. [Google Scholar] [CrossRef]
Yu, B.; Xu, C.; Chen, F.; Wang, N.; Wang, L. HADeenNet: A hierarchical-attention multi-scale deconvolution network for landslide detection. Int. J. Appl. Earth Obs. Geoinf. 2022, 111, 102853. [Google Scholar] [CrossRef]
Bronskill, J.; Gordon, J.; Requeima, J.; Nowozin, S.; Turner, R. Tasknorm: Rethinking batch normalization for meta-learning. In Proceedings of the International Conference on Machine Learning, Virtual Event, 13–18 July 2020; pp. 1153–1164. [Google Scholar]
Hendrycks, D.; Gimpel, K. Gaussian error linear units (GELUs). arXiv 2016, arXiv:1606.08415. [Google Scholar]
Zhu, Z.; Tao, Y.; Luo, X. HCNNet: A Hybrid Convolutional Neural Network for Spatiotemporal Image Fusion. IEEE Trans. Geosci. Remote Sens. 2022, 60, 1–16. [Google Scholar] [CrossRef]
Zhong, Z.; Li, J.; Luo, Z.; Chapman, M. Spectral–Spatial Residual Network for Hyperspectral Image Classification: A 3-D Deep Learning Framework. IEEE Trans. Geosci. Remote Sens. 2018, 56, 847–858. [Google Scholar] [CrossRef]
Lai, W.S.; Huang, J.B.; Ahuja, N.; Yang, M.H. Deep Laplacian Pyramid Networks for Fast and Accurate Super-Resolution. In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; pp. 5835–5843. [Google Scholar]
Wang, Z.; Simoncelli, E.; Bovik, A. Multiscale structural similarity for image quality assessment. In Proceedings of the Thrity-Seventh Asilomar Conference on Signals, Systems and Computers, Pacific Grove, CA, USA, 9–12 November 2003; Volume 2, pp. 1398–1402. [Google Scholar]
Emelyanova, I.V.; McVicar, T.R.; Van Niel, T.G.; Li, L.T.; van Dijk, A.I. Assessing the accuracy of blending Landsat–MODIS surface reflectances in two landscapes with contrasting spatial and temporal dynamics: A framework for algorithm selection. Remote Sens. Environ. 2013, 133, 193–209. [Google Scholar] [CrossRef]
Chen, Y.; Shi, K.; Ge, Y.; Zhou, Y. Spatiotemporal Remote Sensing Image Fusion Using Multiscale Two-Stream Convolutional Neural Networks. IEEE Trans. Geosci. Remote Sens. 2022, 60, 1–12. [Google Scholar] [CrossRef]
Li, X.; Foody, G.M.; Boyd, D.S.; Ge, Y.; Zhang, Y.; Du, Y.; Ling, F. SFSDAF: An enhanced FSDAF that incorporates sub-pixel class fraction change information for spatio-temporal image fusion. Remote Sens. Environ. 2020, 237, 111537. [Google Scholar] [CrossRef]
Zhu, X.; Chen, J.; Gao, F.; Chen, X.; Masek, J.G. An enhanced spatial and temporal adaptive reflectance fusion model for complex heterogeneous regions. Remote Sens. Environ. 2010, 114, 2610–2623. [Google Scholar] [CrossRef]
Wang, Z.; Bovik, A.; Sheikh, H.; Simoncelli, E. Image quality assessment: From error visibility to structural similarity. IEEE Trans. Image Process. 2004, 13, 600–612. [Google Scholar] [CrossRef]
Wang, Z.; Bovik, A. A universal image quality index. IEEE Signal Process. Lett. 2002, 9, 81–84. [Google Scholar] [CrossRef]
Yuhas, R.H.; Goetz, A.F.H.; Boardman, J.W. Discrimination among semi-arid landscape endmembers using the Spectral Angle Mapper (SAM) algorithm. In Summaries of the Third Annual JPL Airborne Geoscience Workshop; JPL: Pasadena, CA, USA, 1992. [Google Scholar]
Khan, M.M.; Alparone, L.; Chanussot, J. Pansharpening Quality Assessment Using the Modulation Transfer Functions of Instruments. IEEE Trans. Geosci. Remote Sens. 2009, 47, 3880–3891. [Google Scholar] [CrossRef]

Figure 1. General framework of STFMs based on CNNs and GANs.

Figure 2. General structure of our proposed SMSTFM. The network consists of two modules: the Single-Band Prediction Module (SBPM) and Multi-Band Prediction Module (MBPM).

L_{t x}^{y}

and

M_{t x}^{y}

represent the y-th spectral band of the fine and coarse images at time x, respectively, while

L_{t x}

and

M_{t x}

represent the multispectral fine and coarse images at time x, respectively.

Figure 2. General structure of our proposed SMSTFM. The network consists of two modules: the Single-Band Prediction Module (SBPM) and Multi-Band Prediction Module (MBPM).

L_{t x}^{y}

and

M_{t x}^{y}

represent the y-th spectral band of the fine and coarse images at time x, respectively, while

L_{t x}

and

M_{t x}

represent the multispectral fine and coarse images at time x, respectively.

Figure 3. Architecture of the SBPM. Convi represents convolution operations with a kernel size of ‘i’ and a stride of ‘i’. ×mi represents concatenating ‘mi’ identical convolution blocks together.

Figure 4. The overall structure of the 2D-ConvBlock.

Figure 5. The architecture of the MBPM. Convi represents convolution operations with a kernel size of ‘i’ and a stride of ‘i’. ×mi represents concatenating ‘mi’ identical convolution blocks together.

Figure 6. The overall structure of the 3D-ConvBlock.

Figure 7. Fusion results for the LGC dataset on 12 December 2004 with different methods. (a) Ground truth. (b) STARFM. (c) FSDAF. (d) SFSDAF (e) Fit-FC. (f) EDCSTFN. (g) GANSTFM. (h) SwinSTFM (i) MLFFGAN. (j) SMSTFM.

Figure 8. Enlarged display of the rectangular region in the prediction images of various methods in Figure 7. (a) Ground truth. (b) STARFM. (c) FSDAF. (d) SFSDAF (e) Fit-FC. (f) EDCSTFN. (g) GANSTFM. (h) SwinSTFM (i) MLFFGAN. (j) SMSTFM.

Figure 9. Average absolute difference maps between the prediction images and ground truth for each image in Figure 8. (a) Ground truth. (b) STARFM. (c) FSDAF. (d) SFSDAF (e) Fit-FC. (f) EDCSTFN. (g) GANSTFM. (h) SwinSTFM (i) MLFFGAN. (j) SMSTFM.

Figure 10. On April 17, 2002, fusion results for the CIA dataset were achieved utilizing diverse methods. (a) Ground truth. (b) STARFM. (c) FSDAF. (d) SFSDAF. (e) Fit-FC. (f) EDCSTFN. (g) GANSTFM. (h) SwinSTFM (i) MLFFGAN. (j) SMSTFM.

Figure 11. Enlarged display of the rectangular region in the prediction images of various methods in Figure 10. (a) Ground truth. (b) STARFM. (c) FSDAF. (d) SFSDAF. (e) Fit-FC. (f) EDCSTFN. (g) GANSTFM. (h) SwinSTFM. (i) MLFFGAN. (j) SMSTFM.

Figure 12. Average absolute difference maps between the prediction images and ground truth for each image in Figure 11. (a) Color bar. (b) STARFM. (c) FSDAF. (d) SFSDAF. (e) Fit-FC. (f) EDCSTFN. (g) GANSTFM. (h) SwinSTFM. (i) MLFFGAN. (j) SMSTFM.

Figure 13. On 3 October 2021, fusion results for the Nanjing dataset were obtained using various methods. (a) Ground truth. (b) STARFM. (c) FSDAF. (d) SFSDAF (e) Fit-FC. (f) EDCSTFN. (g) GANSTFM. (h) SwinSTFM. (i) MLFFGAN. (j) SMSTFM.

Figure 14. Enlarged display of the rectangular region in the prediction images of various methods in Figure 13. (a) Ground truth. (b) STARFM. (c) FSDAF. (d) SFSDAF. (e) Fit-FC. (f) EDCSTFN. (g) GANSTFM. (h) SwinSTFM. (i) MLFFGAN. (j) SMSTFM.

Figure 15. Average absolute difference maps between the prediction images and ground truth for each image in Figure 14. (a) Color bar. (b) STARFM. (c) FSDAF. (d) SFSDAF. (e) Fit-FC. (f) EDCSTFN. (g) GANSTFM. (h) SwinSTFM. (i) MLFFGAN. (j) SMSTFM.

Figure 16. Fusion results of the SBP and SMSTFM on the LGC dataset. (a) SBP. (b) SMSTFM.

Table 1. Comparison table of strengths and weaknesses of different categories of STFM.

Model Type	Strengths	Weaknesses
CNN-based Models [30,31,32,33]	Capable of extracting features from coarse images and restoring fine images by employing techniques such as upsampling or deconvolution.	Face the challenge of increasing complexity and redundancy as the spatial information grows.
GAN-based Models [23,34,35]	Leverage the mutual competition between a generator and a discriminator to produce increasingly lifelike and finely detailed images.	Incorporating a balance between structure and texture restoration is necessary, which can influence the quality and consistency of GANs’ generated outputs.
ViT-based Models [13,36]	Focus on crucial areas and details within the image. Capture global information.	High computational load; loss of shallow-level features.

Table 2. Quantitative evaluation result for the LGC dataset on 12 December 2004 with different methods. The bold values indicate the best results.

Bands	STARFM	FSDAF	SFSDAF	Fit-FC	EDCSTFN	GANSTFM	SwinSTFM	MLFFGAN	SMSTFM
RMSE (↓)
Blue	0.0152	0.0151	0.0153	0.0251	0.0134	0.0133	0.0125	0.0112	0.0111
Green	0.0207	0.0202	0.0215	0.0256	0.0191	0.0187	0.0172	0.0162	0.0161
Red	0.0255	0.0249	0.0275	0.0322	0.0235	0.0236	0.0205	0.0199	0.0195
NIR	0.0387	0.0388	0.0404	0.0524	0.0371	0.0379	0.0325	0.0311	0.0310
SWIR1	0.0633	0.0625	0.0708	0.0791	0.0525	0.0556	0.0467	0.0464	0.0457
SWIR2	0.0570	0.0537	0.0658	0.0569	0.0397	0.0425	0.0365	0.0346	0.0345
average	0.0367	0.0358	0.0402	0.0452	0.0309	0.0319	0.0276	0.0266	0.0263
SSIM (↑)
Blue	0.9213	0.9138	0.9161	0.9004	0.9359	0.9365	0.9441	0.9469	0.9486
Green	0.8905	0.8904	0.8748	0.8725	0.9052	0.9073	0.9110	0.9186	0.9191
Red	0.8579	0.8555	0.8300	0.8362	0.8787	0.8803	0.8779	0.8950	0.8972
NIR	0.7706	0.7631	0.7677	0.7317	0.7888	0.7921	0.8126	0.8079	0.8132
SWIR1	0.5524	0.5683	0.5587	0.5304	0.6481	0.6460	0.6833	0.6841	0.6917
SWIR2	0.5494	0.6027	0.5546	0.6017	0.7058	0.6967	0.7241	0.7391	0.7430
average	0.7570	0.7656	0.7503	0.7455	0.8104	0.8098	0.8255	0.8319	0.8355
UIQI (↑)
Blue	0.7153	0.6953	0.7362	0.4538	0.7628	0.7771	0.6456	0.8207	0.8360
Green	0.7075	0.6989	0.6895	0.5837	0.7541	0.7722	0.8252	0.8150	0.8265
Red	0.7083	0.6960	0.6722	0.5715	0.7587	0.7702	0.8190	0.8217	0.8117
NIR	0.8063	0.7952	0.7827	0.6648	0.8194	0.8037	0.8722	0.8626	0.8742
SWIR1	0.7578	0.7513	0.7264	0.6242	0.8314	0.8177	0.8604	0.8681	0.8779
SWIR2	0.6752	0.6456	0.6766	0.6335	0.8314	0.8122	0.8547	0.8669	0.8707
average	0.7284	0.7088	0.7189	0.5886	0.7930	0.7922	0.8460	0.8425	0.8495
CC (↑)
Blue	0.7165	0.6990	0.7367	0.4768	0.7699	0.7805	0.8117	0.8276	0.8292
Green	0.7090	0.7067	0.6915	0.5838	0.7609	0.7740	0.8193	0.8190	0.8290
Red	0.7096	0.7050	0.6775	0.5717	0.7633	0.7723	0.8213	0.8233	0.8331
NIR	0.8255	0.8265	0.8002	0.6674	0.8287	0.8095	0.8785	0.8730	0.8782
SWIR1	0.7855	0.7980	0.7673	0.6266	0.8372	0.8215	0.8618	0.8686	0.8789
SWIR2	0.7378	0.7720	0.7490	0.6488	0.8358	0.8167	0.8559	0.8677	0.8714
average	0.7473	0.7511	0.7371	0.5958	0.7993	0.7957	0.8414	0.8465	0.8532
SAM (↓)
	0.1959	0.1994	0.2125	0.1954	0.1382	0.1429	0.1247	0.1240	0.1218
ERGAS (↓)
	4.0329	3.9691	3.9563	4.3234	3.2709	3.3205	3.1502	3.1667	3.1205

Table 3. Quantitative evaluation result for the CIA dataset on 17 April 2002 with different methods. The bold values indicate the best results.

Bands	STARFM	FSDAF	SFSDAF	Fit-FC	EDCSTFN	GANSTFM	SwinSTFM	MLFFGAN	SMSTFM
RMSE (↓)
Blue	0.0130	0.0114	0.0112	0.0743	0.0134	0.0139	0.0125	0.0094	0.0097
Green	0.0149	0.0135	0.0131	0.0767	0.0134	0.0139	0.0131	0.0117	0.0117
Red	0.0210	0.0191	0.0196	0.1496	0.0191	0.0196	0.0184	0.0173	0.0161
NIR	0.0373	0.0340	0.0373	0.5743	0.0284	0.0300	0.0277	0.0279	0.0269
SWIR1	0.0419	0.0392	0.0411	0.6279	0.0344	0.0370	0.0339	0.0340	0.0318
SWIR2	0.0396	0.0364	0.0391	0.4303	0.0314	0.0338	0.0316	0.0312	0.0289
average	0.0280	0.0256	0.0269	0.3222	0.0234	0.0247	0.0229	0.0219	0.0208
SSIM (↑)
Blue	0.9282	0.9241	0.9237	0.8153	0.9168	0.9191	0.9253	0.9429	0.9408
Green	0.9278	0.9235	0.9221	0.7872	0.9332	0.9336	0.9363	0.9375	0.9359
Red	0.8820	0.8802	0.8734	0.7139	0.8914	0.8945	0.8984	0.8952	0.9005
NIR	0.8129	0.8128	0.7923	0.5838	0.8444	0.8433	0.8496	0.8406	0.8499
SWIR1	0.7695	0.7631	0.7500	0.5134	0.7895	0.7900	0.8003	0.7887	0.7994
SWIR2	0.7593	0.7591	0.7363	0.5335	0.7975	0.7974	0.8001	0.7903	0.8021
average	0.8466	0.8438	0.8330	0.6579	0.8621	0.8630	0.8683	0.8659	0.8714
UIQI (↑)
Blue	0.6952	0.7611	0.7795	0.0493	0.7887	0.8004	0.8156	0.8157	0.8344
Green	0.7299	0.7655	0.7849	0.0704	0.8087	0.8200	0.8304	0.8182	0.8338
Red	0.8051	0.8301	0.8268	0.0524	0.8408	0.8530	0.8638	0.8556	0.8787
NIR	0.8260	0.8469	0.8217	0.0083	0.8709	0.8837	0.8798	0.8765	0.8879
SWIR1	0.8387	0.8544	0.8419	0.0132	0.8694	0.8677	0.8803	0.8704	0.8929
SWIR2	0.8455	0.8647	0.8459	0.0335	0.8805	0.8822	0.8928	0.8885	0.9047
average	0.7901	0.8205	0.8168	0.0378	0.8432	0.8512	0.8604	0.8541	0.8720
CC (↑)
Blue	0.7002	0.7676	0.7866	0.1250	0.8204	0.8233	0.8380	0.8300	0.8417
Green	0.7299	0.7665	0.7852	0.1430	0.8159	0.8233	0.8264	0.8299	0.8358
Red	0.8054	0.8311	0.8270	0.1237	0.8477	0.8555	0.8674	0.8622	0.8801
NIR	0.8287	0.8476	0.8236	0.0407	0.8777	0.8845	0.8915	0.8825	0.8919
SWIR1	0.8398	0.8568	0.8430	0.0595	0.8753	0.8681	0.8818	0.8771	0.8945
SWIR2	0.8472	0.8670	0.8473	0.1088	0.8891	0.8828	0.8938	0.8918	0.9064
average	0.7919	0.8228	0.8188	0.1001	0.8543	0.8562	0.8664	0.8622	0.8751
SAM (↓)
	0.0943	0.0855	0.0984	0.2160	0.0775	0.0745	0.0689	0.0679	0.0656
ERGAS (↓)
	2.9046	2.7659	2.7967	8.7648	2.6280	2.6675	2.5728	2.5086	2.4715

Table 4. Quantitative evaluation result for the Nanjing dataset on 3 October 2021 with different methods. The bold values indicate the best results.

Bands	STARFM	FSDAF	SFSDAF	Fit-FC	EDCSTFM	GANSTFM	SwinSTFM	MLFFGAN	SMSTFM
RMSE (↓)
Blue	0.04952	0.04044	0.04214	0.56809	0.01877	0.01855	0.01596	0.01975	0.01475
Green	0.05220	0.04496	0.05491	0.65046	0.01975	0.01951	0.01636	0.02118	0.01622
Red	0.06550	0.04769	0.06053	0.60749	0.02086	0.02342	0.01915	0.02055	0.01873
NIR	0.14108	0.12801	0.23715	1.03641	0.03635	0.03939	0.03459	0.03458	0.03415
average	0.07708	0.06528	0.98688	0.71561	0.02393	0.02522	0.02151	0.02401	0.02096
SSIM (↑)
Blue	0.31158	0.52964	0.51456	0.02289	0.90414	0.87968	0.92052	0.90009	0.93977
Green	0.48018	0.59608	0.54275	0.02465	0.89781	0.87062	0.90973	0.89409	0.93551
Red	0.17417	0.50905	0.41117	0.02670	0.88418	0.83788	0.89169	0.88290	0.91553
NIR	0.54811	0.59174	0.34822	0.03438	0.76216	0.76362	0.77270	0.75838	0.85080
average	0.37851	0.55663	0.45417	0.02716	0.86207	0.83795	0.87366	0.85886	0.91040
UIQI (↑)
Blue	0.52998	0.65937	0.58238	0.00243	0.89113	0.85877	0.91052	0.86163	0.91857
Green	0.60971	0.68126	0.51493	0.00196	0.89031	0.85950	0.90973	0.86114	0.92018
Red	0.49221	0.67155	0.49417	0.00113	0.90371	0.87676	0.89169	0.89558	0.92955
NIR	0.57671	0.63068	0.34055	0.03375	0.91624	0.91977	0.77270	0.91489	0.94521
average	0.55215	0.66072	0.48301	0.00860	0.90035	0.87870	0.87116	0.88331	0.92838
CC (↑)
Blue	0.81336	0.83992	0.69035	0.04680	0.90305	0.86550	0.91455	0.90962	0.92301
Green	0.78497	0.80331	0.64405	0.03784	0.89801	0.86998	0.91406	0.90522	0.92570
Red	0.76028	0.78203	0.58443	0.01651	0.90810	0.88590	0.91688	0.92119	0.93904
NIR	0.73047	0.74072	0.53996	0.32645	0.91727	0.92324	0.93000	0.91975	0.94934
average	0.77227	0.79149	0.61470	0.12728	0.90661	0.88615	0.91887	0.91394	0.93427
SAM (↓)
	0.29878	0.28475	0.31374	0.49515	0.09007	0.07389	0.07791	0.09514	0.06548
ERGAS (↓)
	7.42311	4.82085	5.50971	16.91114	3.07252	3.13715	2.84120	3.10399	2.83762

Table 5. Quantitative evaluation result for the LGC dataset on 13 January 2005 with different methods. The bold values indicate the best results.

	SBPM	No-ConvNeXt	No-Spli-3d	SMSTFM
RMSE	0.03201	0.03484	0.03089	0.03032
SSIM	0.80107	0.80820	0.81294	0.81792
UIQI	0.80135	0.81482	0.83382	0.83482
CC	0.90258	0.90122	0.91574	0.91752
SAM	0.15297	0.13859	0.12775	0.12496
ERGAS	3.03965	2.93864	2.92264	2.92014

Table 6. Efficiency comparison of various models in spatiotemporal fusion.

Method	Parameters (M)	MACs (G)
EDCSTFN	0.284 M	18.585 G
GANSTFM	0.585 M	37.770 G
SMSTFM	3.339 M	18.938 G
MLFFGAN	8.701 M	17.369 G
SwinSTFM	37.466 M	28.180 G

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Wang, Z.; Fang, S.; Zhang, J. Spatiotemporal Fusion Model of Remote Sensing Images Combining Single-Band and Multi-Band Prediction. Remote Sens. 2023, 15, 4936. https://doi.org/10.3390/rs15204936

AMA Style

Wang Z, Fang S, Zhang J. Spatiotemporal Fusion Model of Remote Sensing Images Combining Single-Band and Multi-Band Prediction. Remote Sensing. 2023; 15(20):4936. https://doi.org/10.3390/rs15204936

Chicago/Turabian Style

Wang, Zhiyuan, Shuai Fang, and Jing Zhang. 2023. "Spatiotemporal Fusion Model of Remote Sensing Images Combining Single-Band and Multi-Band Prediction" Remote Sensing 15, no. 20: 4936. https://doi.org/10.3390/rs15204936

APA Style

Wang, Z., Fang, S., & Zhang, J. (2023). Spatiotemporal Fusion Model of Remote Sensing Images Combining Single-Band and Multi-Band Prediction. Remote Sensing, 15(20), 4936. https://doi.org/10.3390/rs15204936

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Spatiotemporal Fusion Model of Remote Sensing Images Combining Single-Band and Multi-Band Prediction

Abstract

1. Introduction

2. Related Works

3. Materials and Methods

3.1. Single-Band Prediction Module

3.2. Multi-Band Prediction Module

3.3. Loss Function

4. Experiments and Results

4.1. Study Areas and Datasets

4.2. Experiment Design and Evaluation

4.3. Experimental Results on LGC and CIA Datasets

4.4. Experimental Results on the Nanjing Dataset

5. Discussion

6. Conclusions

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI