1. Introduction
High spatiotemporal resolution remote sensing (RS) images play a pivotal role in various applications, including, but not limited to, crop growth monitoring [
1,
2], land cover change detection [
3,
4,
5,
6,
7,
8], and land cover classification [
9,
10,
11]. However, due to technical and budgetary constraints, obtaining RS data with high spatial and temporal resolutions is challenging [
12], thereby limiting the utilization of advanced RS applications. For instance, the Moderate Resolution Imaging Spectroradiometer (MODIS) provides observations at spatial resolutions ranging from 250 to 1000 m and offers a global revisit time of nearly one day. In comparison, Landsat acquires images at a higher spatial resolution of 30 m but with relatively more minor scene coverage and a revisit time of up to 16 days. Although recent satellite systems, such as Sentinel-2, have made obtaining the time series of high-resolution RS images more accessible, challenges persist, such as frequent cloud contamination [
13]. To overcome the time and space trade-offs in RS images, spatiotemporal fusion methods (STFM) are employed to combine satellite images with a low spatial resolution but high frequency (e.g., MODIS, referred to as coarse images) and satellite images with high spatial resolution but low frequency (e.g., Landsat, referred to as fine images) to create satellite image time series with both a high spatial and temporal resolution [
14,
15]. These fusion techniques enable researchers and practitioners to access data with enhanced spatiotemporal characteristics, facilitating more a accurate and comprehensive analysis for various environmental and land-related studies. To facilitate reader comprehension, we have compiled the primary abbreviations used in this article in Abbreviations.
2. Related Works
The current STF methods are primarily categorized into three main approaches: weighted function-based, unmixing-based, and learning-based algorithms [
16]. Weight function-based methods employ linear combinations of input image information to obtain refined pixel values. For example, STARFM [
17] employs a moving window to search for pixels similar to the central pixel and assigns weights based on their spatial, spectral, and temporal similarities to reflect their respective contributions. Furthermore, ESTARFM [
18] introduces variable transformation coefficients based on STARFM and modifies the search method for pixels to enhance the performance of heterogeneous sites with many mixed similar pixels. On the other hand, OBSTFM [
19] focuses on the performance in regions with non-shape variations, considering the actual distribution of surface features. It incorporates segmentation methods to generate surface objects with good similarity and uniformity and then searches for and weighs similar pixels within each object. Unmixing-based methods posit that a coarse pixel comprises various land-cover types of fine pixels and employs the linear spectral mixing theory to decompose the coarse pixels. Maselli [
20] utilized a moving window approach to account for spatiotemporal variations in pixel reflectance and incorporated distance weighting within this window, where closer pixels to the target pixel received higher weights. When selecting endmembers, Busetto et al. [
21] considered both spatial and spectral differences between pixels and determined the weights of each pixel in the linear unmixing model based on their spatial and spectral similarities. Additionally, there are hybrid methods that integrate the two approaches mentioned above. FSDAF [
22] addresses rapidly changing regions by utilizing unmixing principles to obtain residuals between predicted fine images and to reference the date’s fine images. The model also incorporates weight functions to enhance its applicability in scenarios with fast-changing land cover types. Furthermore, the Fit-FC algorithm [
23] combines model fitting (Fit), spatial filtering (F), and residual compensation (C) to handle scenes with significant changes while also constraining the impact of the unmixing process on the results, thereby further improving the accuracy of the fusion.
There are two main categories of learning-based methods: those based on sparse representation and those based on deep learning (DL). Algorithms based on sparse representation are commonly employed to train dictionaries for high and low spatial resolution in either the image or frequency domain, facilitating the reconstruction of finely detailed images for predicting specific dates through sparse coding [
24,
25,
26,
27]. However, due to the inherent computational complexity of sparse learning and the limitation in extracting sufficient local structural information from large input patches, they face constraints in accurately preserving object shapes [
28].
DL-based methods can be categorized into three main groups: those based on Convolutional Neural Networks (CNNs), Generative Adversarial Networks (GANs), and Vision Transformers (ViTs) [
29]. Convolution involves image cross-correlation, allowing the model to learn relative spatial positional information. The CNN-based STFM extracts representative image features by stacking multiple convolutional layers. For example, Song et al. [
30] employed a super-resolution convolutional neural network (SRCNN) to reconstruct fine images from coarse counterparts, achieving remarkable improvements in the image quality. To further enhance image details, DCSTFN [
31] simultaneously extracts features from fine and coarse images and merges these features using equations that consider temporal ground cover changes. Considering the inherent information loss in the reconstruction process of deconvolution fusion methods, EDCSTFN [
32] goes a step further by incorporating residual encoding blocks. Additionally, it employs a composite loss function to enhance the learning capability of the network, thereby improving the fidelity of the fused images. The two-stream convolutional neural network spatiotemporal fusion model (StfNet) incorporates temporal dependence to predict unknown fine difference images. Additionally, it establishes a time constraint that considers the relationship between time series, ensuring the uniqueness and authenticity of the fusion results [
33]. GAN-based STFMs achieve the prediction of RS data by leveraging the collaborative efforts of a generator and discriminator, aiming to make the predicted RS data as similar as possible to the actual data distribution. CycleGAN-STF [
23] utilizes cycle-GAN to select generated images, enhancing the selected images using the wavelet transformation. GAN-STFM [
34] introduces conditional GAN (CGAN) and switchable normalization techniques to address spatiotemporal fusion problems. This approach reduces input data and enhances model flexibility. MLFF-GAN combines multi-level feature fusion with GAN to generate fused images. MLFF-GAN incorporates Adaptive Instance Normalization (AdaIN) blocks to learn the global distribution relationships between multi-temporal images. Additionally, it employs an Attention Module (AM) to learn the local information weights of minor region variations [
35]. A crucial component of the ViT-based STFMs is the self-attention mechanism, enabling the capture of global information and compensating for the inherent limitation of CNNs with narrower receptive fields. MSNet is a multi-stream STFM based on ViT and CNN. It combines the global temporal correlation learning capability of the transformer with the feature extraction capability of convolutional neural networks using an average-weighted fusion approach [
36]. SwinSTFM [
13] is a novel algorithm based on the Swin Transformer [
37] and the Linear Spectral Mixing theory. This algorithm fully leverages the advantages of the Swin Transformer in feature extraction, and integrates the unmixing theory into the model based on the self-attention mechanism.
Table 1 presents the strengths and weaknesses of the methods from different categories.
However, there are certain limitations to DL-based STFMs. Firstly, in ViT-based STFMs, self-attention often neglects fine-grained pixel-level internal structural features, leading to the loss of shallow-level features [
28]. Additionally, the high resolution of RS images results in a quadratic increase in the input image size, thereby increasing computational complexity. Secondly, STFMs based on CNNs and GANs often use 2D convolutions for feature extraction, and they typically have two limitations: (1) 2D convolutions may lead to the loss of channel dimension information. (2) Hybrid features may be challenging to use for reconstructing multispectral RS images with significant differences in spectral reflectance across different bands. The detailed analysis is depicted in
Figure 1, where the inputted multispectral RS images are passed through the 2D convolution-based encoder to generate hybrid features containing information from all bands of the input images. Then, the 2D convolution-based decoder reconstructs the hybrid features into individual output image bands. However, 2D convolution combines the convolution results of all input channels into a single-channel output feature map, leading to the loss of channel dimension information. Furthermore, a multi-band prediction generates a comprehensive feature representation that contains information from each band, and such hybrid features can be used for the subsequent image analysis and processing tasks, such as feature classification, target detection, change detection, etc. Nevertheless, the hybrid features may lead to the loss or blurring of high-frequency detail information, which is not conducive to reconstructing multispectral RS images with significant differences in spectral reflectance between the bands.
To address the above issues, we propose an RS image STFM that combines single-band and multi-band prediction (SMSTFM). First, we employ a Single-Band Prediction Module (SBPM) to generate an initial fused image, thus avoiding generating hybrid features. Next, we employ the Multi-Band Prediction Module (MBPM), which focuses on the information in the input images’ channel dimension to enhance the details of the preliminary fused image. In SBPM, the feature extraction is mainly performed using convolutional modules based on ConvNeXt. ConvNeXt is an advanced convolutional module redesigned according to the ViT architecture [
38]. Compared to ViT, it is concise in design, computationally efficient, and has a considerable performance. In the MBPM, we replace the 2D convolutions in ConvNeXt with 3D convolutions to better extract spatial-spectral features. Here are the summarized main contributions of our study:
The paper introduces an STFM to address two challenges within the current STF framework. Our proposed STFM consists of two key modules: the Single-Band Prediction Module (SBPM) and the Multi-Band Prediction Module (MBPM). SBPM is responsible for generating the initial fused image, thus eliminating the need for hybrid feature generation. Subsequently, MBPM is employed to extract channel-wise information that enhances the details of the preliminary fused image.
In SBPM, feature extraction primarily relies on ConvNeXt due to its concise architecture, computational efficiency, and outstanding performance compared to ViT. In MBPM, we replace the 2D convolutions in ConvNeXt with 3D convolutions to better extract spatial-spectral features.
We evaluated and compared different models on three datasets, each with distinct characteristics. Our examination comprised three datasets: CIA, LGC, and Nanjing. The resolution difference between coarse and fine images was 16x for the former two and 3x for the latter.
The remainder of this manuscript is organized as follows:
Section 3 provides an overview of the SMSTFM, outlining its overall structure and specific internal modules.
Section 4 is dedicated to presenting our results, encompassing a description of the dataset, the experimental procedures, and their subsequent analysis.
Section 5 constitutes our discussion, while
Section 6 offers the conclusion.
5. Discussion
In this section, we begin by conducting ablation experiments to explore the roles of various components within the SMSTFM. Following that, we provide a brief analysis of the efficiency of the SMSTFM compared to other STFMs.
Three models are designed to demonstrate the advantages of the SMSTFM for problems based on deep learning methods. We first train the SBPM module independently, creating a standalone experiment referred to as the “SBPM”. This allows us to evaluate the performance of the SBPM without the additional components of the SMSTFM. To assess the impact of ConvNeXt on the network structure, we replace the ConvNeXt module with a standard convolution while keeping the rest of the structure unchanged. This sub-method is defined as “No-ConvNeXt”. To examine the influence of Split-3d on the network structure, we remove Split-3d from the 3D-CBlock, defining the sub-method as “No-Split-3d”. The results of the ablation experiments on the LGC dataset are listed in
Table 5. The SMSTFM achieves the best results on almost all metrics and shows significant improvement compared to No-MBP. This result highlights the importance of the supplementary information predicted by the MBP module in the SMSTFM. Furthermore,
Figure 16 displays the prediction images of the SBPM and SMSTFM. It can be observed that while the SBPM shares a similar structure with the SMSTFM, it noticeably lacks spectral details. This observation underscores the contribution of the MBP module in the SMSTFM, enhancing the spectral details and improving the overall fusion performance.
Table 6 illustrates the model’s efficiency by examining two key metrics: the model parameters and Multiply-Accumulate Operations (MACs). The model parameters reflect the model’s complexity and storage requirements, while the MACs measure its computational complexity and resource utilization. Models with fewer parameters and lower MACs are generally considered more efficient, requiring less storage space and computational resources.
Table 6 shows that SwinSTFM, based on ViT, has the most significant number of parameters, approximately 11 times that of the SMSTFM. Although the SMSTFM has more parameters than earlier models like EDCSTFN and GANSTFM, it is significantly smaller in parameter count compared to recent models like the SwinSTFM and MLFFGAN. When considering MACs, the SMSTFM still falls within the moderate range and notably consumes fewer resources than SwinSTFM. However, it is worth noting that even though the SMSTFM falls into the mid-range in terms of efficiency, it achieves the highest prediction accuracy. Therefore, the SMSTFM strikes a good balance between model efficiency and accuracy.
6. Conclusions
This article proposes an innovative spatiotemporal fusion model called the SMSTFM, combining single-band and multi-band predictions to achieve superior fusion results. Existing STFMs that use CNN for feature extraction result in hybrid features and information loss in channel dimensions, while ViT-based STFMs incur significant computational overhead. Compared to these models, the SMSTFM has the following advantages:
Our model addresses the issue of hybrid features and information loss in channel dimensions by concatenating the SBPM and MBPM. The SBPM establishes a mapping from low-resolution images to high-resolution images, generating preliminary fusion results without hybrid features. The MBPM efficiently extracts spatial channel-wise information from the preliminary fusion results to enhance fusion details.
ConvNeXt, designed based on ViT architecture, and its variants are utilized as the feature extraction modules in our model. Compared to ViT, ConvNeXt maintains a high performance while reducing computational costs.
Furthermore, significant performance improvements were observed on datasets with 16× and 3× the resolution differences between coarse and fine images, highlighting the robustness and versatility of our proposed approach.
Our strategy for channel-wise feature extraction may serve as a valuable reference for tasks related to multispectral and hyperspectral remote sensing imagery. However, one limitation of the SMSTFM is that the concatenation of the SBPM and MBPM may result in a slowdown in inference speed, which is an aspect that we aim to improve in the future.