1. Introduction
Remote sensing imagery has indeed been widely used for land cover detection, land surface change monitoring, and the estimation of biophysical parameters [
1,
2]. While the presence of clouds contributes to providing helpful information for weather forecasting and climate prediction [
3,
4], the cloud cover not only diminishes the quality of the optical remote sensing data, but also intensifies the complexity of subsequent data processing and downstream remote sensing analysis [
5]. And the diverse types of clouds and intricate ground surfaces make it challenging to accurately distinguish between clouds and mixed ground objects. Consequently, the development of automatic cloud detection has become increasingly pivotal and crucial in the preprocessing of optical satellite remote sensing images, thereby significantly improving their utilization.
Recently, a wide range of techniques for identifying clouds have been put forward. Rule-based algorithms extract clouds based on significant disparities in both the physical characteristics and spatial characteristics between the clouds and most ground surfaces [
6,
7,
8,
9]. Using the ACCA algorithm [
10], the authors applied several spectral filters and employed 32 fixed thresholds and three dynamic thresholds to estimate the overall percentage of clouds in each Landsat 7 scene. The authors in [
11] utilized 26 fixed thresholds according to different features to delineate cloud and cloud shadow regions. The spectral indices, namely, the method cloud index (CI) and cloud shadow index (CSI), were introduced in [
12] to identify the clouds and cloud shadow regions using threshold segmentation. In summary, threshold-based algorithms are simple and effective, thereby making them well suited for small scenes or situations without complicated objects. However, it is essential to note that climate and ground surface conditions vary over time and space, and fixed thresholds may not be universally applicable to all areas and time frames, even for data from the same sensor. Additionally, adaptive thresholds are difficult to determine, especially in cloud-like features such as deserts, fog, haze, and ice/snow cover.
Furthermore, traditional machine learning algorithms are frequently employed in cloud detection. Those methods generate the cloud masks by carefully selecting relevant features and choosing an effective model. The authors in [
13] proposed an algorithm that combines k-means clustering and random forest for cloud detection in Landsat images, which had better results than FMask [
14]. Researchers applied the support vector machine (SVM) for cloud detection based on comparative analysis of the feature differences between clouds and backgrounds, and they verified the method using GF-1 and GF-2 satellite images [
15]. An end-to-end PCANet cloud detection for Landsat 8 images obtained the cloud masks by employing an SVM based on the superpixels. Additionally, the boundaries of the clouds were refined through the fully connected conditional random field (CRF) [
16]. Traditional machine learning algorithms have obtained promising results through statistical analysis. However, they heavily rely on artificial features or human-specified rules, thereby making it difficult to design a universal template to handle the diversity of cloud types.
With the rapid and extensive advancement of deep learning, it has been widely used in various industries. Deep learning-based methods can automatically extract data features from data and achieve remarkable results without the need for manual feature selection [
17,
18,
19,
20,
21,
22]. Numerous approaches based on deep learning have been proposed for cloud detection [
23,
24,
25,
26,
27,
28,
29,
30]. For example, DANet [
31] utilizes space and channel attention to obtain the semantic interdependencies in different dimensions. CCNet [
32] designs a crisscross attention module to harvest all pixels’ contextual information using their respective crisscross paths. CSD-Net [
33] composes the multiscale global attention feature fusion module and channel attention mechanism to refine the edges of cloud and cloud shadow masks. CSD-HFnet [
34] combines the fundamental features, obtained through the local binary pattern, gray-level co-occurrence matrix, superpixel segmentation, and the deep semantic features, which are acquired form deep learning feature extraction network to distinguish the clouds from snow. BABFNet [
35] introduces a boundary prediction branch to enhance the cloud detection results in confusing areas. CDUNet [
36] uses a high-frequency feature extractor and multiscale convolutions to predict cloud masks. MAFANet [
29] combines a multiscale strip pooling attention module, a multihead attention module, and a feature fusion module to acquire more accurate cloud and cloud shadow masks.
Nevertheless, CNN-based cloud detection models still have limitations due to the diversity of cloud forms and the difficulty of learning long-range dependencies using convolution operations. Recent advances in vision transformer (ViT) [
37] technology have demonstrated its ability to learn long-term features and model global information effectively, thus resulting in satisfactory performance on image classification tasks [
38]. To overcome the high memory demand of transformers, the Swin transformer [
39] designs a hierarchical transformer to limit self-attention computation within nonoverlapping windows. Using the Swin transformer as the backbone, Swin-Unet [
40] and TransDeepLab [
38] have outperformed other methods in medical image segmentation. To leverage the complementary strengths of CNNs and ViTs, researchers have proposed hybrid models that combine convolution and transformer architectures to extract both local and global features for image classification tasks [
41,
42,
43]. He et al. [
44] integrated the global dependencies of the Swin transformer into the features from the UNet’s encoder for remote sensing image segmentation. BuildFormer [
45] designs a dual-path structure with a CNN and transformer to extract the spatial details and global context for building segmentation. Yuan et al. [
46] proposed the LiteST-Net model to extract building data from remote sensing images, which creatively simplifies the matrices Q, K, and V of the transformer to decrease the model computation. Alrfou et al. [
47] concatenated the feature maps of CNN and Swin transformer encoders as the input into the corresponding decoder, and it has been demonstrated that combining transformers and CNN encoders consistently outperforms using CNN encoders along with image segmentation.
To enhance the capability of obtaining precise cloud boundaries and efficiently discerning clouds from bright ground objects, this paper proposes a novel network with an encoder–decoder structure for cloud detection, which has been named STCCD (a hybrid Swin transformer–CNN cloud detection network). The STCCD network has a parallel hybrid encoder that combines the Swin transformer layers and convolution blocks. Within this encoder, the feature coupling module (FCM) interacts with features from the Swin transformer layer and residual convolution blocks to allow the encoder to effectively learn the global representations while also capturing local features. Additionally, the feature fusion module based on the attention mechanism is designed to fuse the feature maps outputted from the Swin transformer and convolution branches to explore the relationships between channels. Next, we introduce an aggregation multiscale feature module (AMSFM) to extract multiscale features, which equips our network with the ability to recognize clouds at different scales. In the decoder part, the first four layers take three inputs: the output of the upper layer, the outputs of the corresponding residual layers, and the Swin transformer layer. Finally, a boundary refinement module (BRM) is utilized to capture edge details and optimize the result of cloud detection.
The main contributions of this paper are as follows. Firstly, we propose a novel cloud detection framework, the STCCD network, with an encoder–decoder architecture that leverages a combination of Swin transformer layers and residual convolution blocks to obtain both global representations and local features. Secondly, the STCCD network also includes two novel modules, FFMAM and AMSFM, which exploit the interplay between various network levels and the characteristics of cloud pixels. The STCCD netowork achieves state-of-the-art performance on cloud detection benchmarks, as evidenced by its superior results on the GF1-WHU, SPARCS, AIR-CCD, and L8-Biome datasets.
5. Conclusions
This paper introduces the STCCD network, an encoder–decoder network tailored for cloud detection, which has demonstrated remarkable performance. The STCCD network embodies a holistic methodology, with multiple pivotal modules synergistically collaborating to attain its overarching success. The dual branches seamlessly combine Swin transformer and convolutional components. These components harness convolutional operators to extract local features and utilize the self-attention mechanisms within shift windows to capture global representations. The feature coupling module, in various forms, facilitates the transformation of the global representations and local features. On the basis of the harmonious branches, the feature fusion based on the attention mechanism leverages multihead attention and channel attention to effectively fuse the local and global features, thereby enhancing the overall feature representation. The aggregation multiscale feature module plays a pivotal role by extensively employing dilated convolution, pooling layers, and spatial attention mechanisms to extract discriminative information from the fused features. The boundary refinement module finetunes the cloud mask boundaries, thus further improving the accuracy of the cloud detections.
The experimental results prove the effectiveness of the STCCD network in cloud detection across diverse datasets, including the SPARCS, GF1-WHU, and AIR-CD. Quantitatively, the STCCD network achieved an overall accuracy (OA) that was greater than 97.59%, a mean intersection over union (MIoU) that was greater than 90.5%, and a cloud F1 score that was greater than 90.1% on these three datasets, thereby evidencing its versatility and superior performance. Moreover, we set up an extended experiment to verify the generalization capability of the STCCD model, and the results show that our model has high extensibility.
Future work will investigate the relationship between clouds and cloud shadows based on our accurate cloud mask generation to enhance cloud shadow extraction capabilities. Additionally, we will explore cloud detection methods based on domain adaptation to alleviate the reliance on limited sample data.