Remote sensing imagery, such as that provided by the United States Geological Survey (USGS) Landsat satellites, has been widely used to study environmental protection, hazard analysis, and urban planning for decades. The usefulness and applications of remote sensing imagery continue to expand as more image-based models and algorithms have emerged so that we can derive more knowledge and information from satellite images. Due to the ubiquity of clouds, cloud pixels are a persistent presence in such imagery, especially in tropical areas. A study estimates that about
of the earth’s surface is typically covered by cloud based on satellite data from July 2002 to April 2015 [
1]. The presence of cloud has a serious impact on the use of remote sensing images. Cloud areas appear as extremely bright pixels in the images. These pixels can cause issues in various remote sensing imagery analyses, including incorrect land surface classification, inaccurate atmosphere correction, low quality Aerosol Optical Depth (AOD) retrieval and false land surface change detection [
2]. As a result, clouds are considered noise in most situations and are typically removed prior to further analysis, which makes cloud detection a crucial step for remote sensing image preprocessing.
Over the last few decades, a variety of methods have been developed for cloud detection. Let us briefly consider a few examples.
1.1. An Overview of Cloud Detection Approaches
Automated Cloud Cover Assessment (ACCA) [
3], a scene specific cloud detection method was developed for Landsat 7. It employs two pass through ETM+s to establish the the reflective and thermal features of the cloud and non-cloud area in a scene, and then to identify the cloud in the rest area of the whole scene. This approach experiences difficulty in identifying cloud areas with snow and brightly illuminated desert areas.
Zhu et al. [
4] proposed Function of Mask (Fmask) which is an objected-based cloud and cloud shadow approach for Landsat images, which has a 96.41% reported accuracy. The author of Fmask further improved Fmask in terms of increasing the performance for Landsat 4–7, making it suitable for Landsat 8 and Sentinel 2 imagery [
5].
Foga et al. [
6] compared the performance of ACCA, LEDAPS CCA, and CFmask (C version of Fmask) on three cloud validation dataset including IRISH, SPARCS, and Biome. Among these three algorithms, CFmask is reported with the best overall accuracy.
Hughes et al. [
7] proposed a machine learning approach for automated cloud and cloud shadow detection by using neural network and spatial post-processing techniques, which achieves lower cloud shadow omission error and cloud commission error compared to Fmask.
A multi-feature combined (MFC) approach is proposed for the Chinese GaoFren1 cloud detection by using spectral features in combination with geometric and texture features, which has a 96.8% accuracy [
8].
All these aforementioned approaches are single temporal approaches, which only require one scene for implementation. In contrast to the single temporal approach, a multi-temporal approach (Tmask) is proposed for cloud, shadow, and snow by using multiple images at the same location [
9]. It generates a time series model which is used to predict the TOA reflectance surface. These surfaces are then compared with Landsat images to differentiate clouds, shadows, and snow. However, this approach requires at least 15 clear observations in each pixel to generate a robust time series model, which makes it less applicable in places that have long been covered by snow or cloud.
Candra et al. [
10] proposed an automated cloud and cloud shadow detection method by using multi temporal Lansat8 images, which is named MCM. This approach makes use of the reflectance differences between two images at the same location to identify cloud and cloud shadows, which is especially effective in tropical areas.
The use of deep learning techniques including CNN, RNN and GCN have recently garnered much attention for remote sensing image classification tasks because they are capable of extracting high-level features from images. Zhu et al. [
11] reviewed the major advances of deep learning in remote sensing. Xie et al. [
12] proposed a multilevel cloud detection method based on simple linear iterative clustering (SLIC) and a deep convolutions neural network (CNN). This method achieves a better result compared with a scene learning-based approach proposed by Zhenyu [
13] and progressive refinement scheme approach proposed by Zhang et al. [
14]. Authors of [
15] proposed a cloud detection method (MSCN) based on Fully Convectional Networks (FCN) [
16] by fusing multi-scale convolutional features for cloud detection, which is effective in snow and areas covered by non-cloud bright objects.
Another CNN-based cloud detection method is trained on high resolution WV-2 satellite images, which not rely on SWIR or IR bands and can be applied to Sentinel imagery [
17]. Zi et al. [
18] proposed a novel cloud detection method for Landsat 8 images by using Simple Linear Iterative Cluster (SLIC), PCA Network (PCANet), Support Vector Machine (SVM), and Conditional Random Field (CRF). This approach combines statistical models, classical machine learning methods, and a deep learning network to generate a robust model for cloud detection and achieves an accurate result. Yang et al. [
19] proposed a CNN-Based cloud detection method by using thumbnails of remote sensing images instead of the original remote sensing images. To handle the coarse resolution of thumbnail images, a cloud detection neural network feature pyramid module, and boundary refinement block techniques are employed to generate accurate cloud prediction results. This work has been further extend by Guo [
20] for cloud and snow coexistence scenarios by proposing a new model DSnetV2.
In addition to cloud detection, deep learning-based techniques have been extensively applied to remote sensing image classification problems, especially for hyper-spectral images. Graph convolutional networks (GCNs) are a new emerging network architecture that can handle and model long-range spatial relations. Shahraki and Prasad [
21] proposed a cascade framework of 1-D CNNs and GCNs for a hyper-spectral classification problem. Qin et al. [
22] extended the GCNs by considering spatial and spectral neighbors. Pu et al. [
23] proposed a localized graph convolutional filtering-based GCNs method for hyper-spectral image classification. Traditional GCNs are computationally expensive because the spatial matrices are constructed. Hong et al. [
24] showed that miniGCNS can be trained in minibatch fashion for classification problems. The miniGCNs are more robust, and are capable of handling out-of-samples with lower computation cost compared to traditional GCNs.
Based on the data required, these cloud detection methods can be divided into single temporal and multi-temporal approaches. The single temporal approach seeks to identify cloud pixels based on imagery at a single time, while a multi-temporal approach makes use of imagery from multiple comparable timeframes for a same area to identify cloud pixels by comparing the pixel differences between cloud free images and cloudy ones [
25]. Depending on the algorithm used, cloud detection can be categorized into a classical algorithm-based approach and machine learning approach [
26]. The classical algorithm refers to methods which have specific steps to be followed for input imagery and to generate the output mask like the FMask [
4]. On the other hand, machine learning approaches take the advantage of existing data and learn from a training set without human interference to generate an output [
18].
Cloud detection is still challenging in these aspects. First, cloud pixels are hard to identify from “bright” areas such as snow by traditional rule-based approaches. The multi-temporal approach requires more than one image at the same location, which is less applicable for low temporal resolution satellites, such as Landsat. While deep learning-based approaches generally enable better cloud detection accuracy, they require a high performance GPU which may not be available. Some other approaches require additional information in combination with the spectral features to improve the performance, which requires extra labor efforts.
1.2. A Pixel-Based Approach Using an Ensemble of Learners
This paper has four distinct goals. (1) Propose a cloud detection modeling approach for cloud detection by only using the 10 wavelength bands available at a single pixel as an information source without the need of any other ancillary data. (2) Engineer important predictor features that could increase the model accuracy. (3) Investigate the influence of the training sample size on the model accuracy, thus finding the smallest possible training sample size that could balance the training time and machine learning model accuracy. (4) Explore the importance rank of predictors and the optimal hyper-parameter settings for cloud mask prediction.
In contrast to the whole image-based approaches just described, and to mitigate the challenges just enumerated, in this study we have taken a pixel-based approach that only requires a single image. We have used an ensemble machine learning approach that has been tested with Landsat 8 imagery.
Compared with the multi-temporal approaches [
25], this single-temporal approach is more feasible because of the mono-temporal feature of Landsat data in nature. Unlike other classical approaches that need auxiliary data [
4], this approach only requires the information on the 10 wavelength bands for a single pixel of Landsat 8 imagery. In comparison to machine learning approaches under a deep learning frame work, the computational facility is not as demanding as deep learning, so a GPU is not needed in this research. This proposed machine learning model uses an ensemble approach which simultaneously employs multiple decision tree learners for cloud detection. The input parameters are tuned in two ways. On the one hand we optionally include unsupervised classification results from the application of a self-organizing map (SOM) as one of the input features for the ensemble model training. On the other hand, 5 indices that are calculated from the 10 wavelength bands signal are included as input features for model training.
Four distinct training subsets were generated from the Biome cloud validation dataset for model training [
6]. Models are generated based on different sets of input parameters and different training samples. Then, their performance is compared against Fmask 4.0.