1. Introduction
Satellite imagery provides a vital source of information for a wide range of remote-sensing applications. However, such imagery is often contaminated with cloud and cloud shadows. Screening cloud and shadows is a critical preprocessing step that needs to be undertaken before any meaningful analysis can be performed on these images.
Traditional algorithms use threshold functions on spectral values of the image to generate cloud and shadow masks. In addition to this, metadata, such as solar zenith and azimuth angles, and elevation, are also used in computing these masks. The Fmask algorithm is widely accepted in the remote sensing community as a reliable method for generating cloud and cloud shadow masks [
1]. Though the initial version of the Fmask was developed for Landsat-5 and Landsat-7 imagery, subsequent publications adapted the algorithm for use in Landsat-8 and Sentinel-2 imagery and proposed further improvements [
2,
3]. CFmask, a derivative of the Fmask algorithm, is used by the U.S. Geological Survey (USGS) in their production environment. The Sen2Cor process used by the European Space Agency (ESA) to process its Sentinel-2 products generates a scene classification map that detects clouds, shadows, water and snow [
4]. However, the results of these algorithms are known to be erroneous when the underlying surface is brightly colored, for example, in the case of ice and snow, white building tops, salt lakes, etc. The detection of shadows is not reliable on darker surfaces. Water bodies are often misclassified as shadows and vice versa. In addition to Fmask and Sen2Cor, several texture-based and threshold-based approaches have been proposed [
5,
6,
7]. Early attempts at using data-driven methods, such as support vector machines [
8,
9], random forest [
10,
11] and Markov random fields [
12,
13], for this task, have shown improvements over the traditional rule-based methods. In this paper, we propose a deep-learning-based model that specializes in cloud detection in polar and high mountain regions that are snow- and ice-covered for most parts of the year.
Over the past decade, deep convolutional neural networks (DCNNs) have made giant leaps in the field of computer vision and have found application in domains that require very high precision, such as medical imaging and autonomous driving [
14,
15]. In contrast to single pixel classification methods, DCNNs learn spatial relationships between the neighboring pixels in the image. Hence, the network can make use of texture and structural information, in addition to the spectral properties, to distinguish objects in the scene—this contributes towards improved detection performance. DCNNs are typically trained using large amounts of correctly labeled data. However, labeled data is very expensive because it is time-consuming to produce.
The open access policy of the Copernicus programme has led to an abundant use of optical Sentinel-2 data. No cloud mask is provided with this optical data that reliably detects clouds over bright targets or shadows over dark targets. While there are a few datasets for cloud detection, they were unsuitable for direct use in our multi-class segmentation problem. In this study, we propose a self-training framework in order to train our model. The trained model in our work segments a given Sentinel-2 L1C scene into six classes: No-Data, Clear-Sky Land, Cloud, Shadow, Snow and Water. We use a large dataset with labels generated using the Fmask algorithm for the training, and a small human-labeled dataset for validation. The validation dataset contains numerous examples where the Fmask classification has given incorrect labels.
The trained model, when compared with widely used automatic methods, such as Fmask 4 and Sen2Cor 2.8, on our test dataset, showed significant improvement in various segmentation metrics. Interestingly, the model achieved better performance than its teacher, Fmask, that was used to automatically generate the labels for training.
3. Materials and Methods
3.1. Dataset
The Sentinel-2 L1C data used for this study was downloaded from the Copernicus Open Access Data Hub. The data was captured by the Sentinel-2A and Sentinel-2B satellites using their onboard multi-spectral instrument (MSI) that captures images in 13 frequency bands. The description of the bands used in this study is provided in
Table 1. The L1C data contains the top-of-atmosphere (TOA) reflectance values and each scene covers an area of 100 km × 100 km.
The sensing period for the images used in our study was between October 2019 and December 2020. We used 96 scenes in the training dataset, 22 scenes in the validation dataset and 23 scenes in the test dataset. The sites of these scenes were randomly selected in the polar and tundra regions. We also included three sites in mountainous terrain that were not in the polar and tundra region. The geographic distribution of the data is shown in
Figure 1. The tile-wise quarterly distribution of the Sentinel-2 scenes for the training, validation and test datasets is provided in
Table A1,
Table A2 and
Table A3, respectively.
The six classes that we used in our model:
No-Data,
Clear-Sky Land,
Cloud,
Shadow,
Snow and
Water, were the same class labels as were used in the segmentation output of the F-Mask algorithm. The
No-Data class was used to indicate pixels in the image that were saturated or defective and regions along the edge of the image where sensor data was not available. Such pixels can be identified using the quality indicator information provided along with the Sentinel-2 imagery. Hence, we did not include this class in the labeled dataset. The
Clear-Sky Land class was used for snow- and cloud-free land, which was usually rocky surfaces in the sites used for this study. The
Shadow class was used for all types of shadow, irrespective of the object that cast the shadow, or the underlying surface on which the shadow fell. The samples in the validation set were selected so that many examples where the Fmask algorithm produced an incorrect label (as per our visual assessment) were present. The labeling for the validation and test dataset was performed manually using the QGIS software [
41] package. In addition to the true color image, the short-wave infrared band (B11) and the near infrared band (B08) were used in our labeling process. Labeling of very small targets was performed occasionally using a thresholding operation in the local neighborhood of the target.
The Fmask labels were generated using Fmask 4.3. According to its authors, this version of Fmask offers substantially better cloud, cloud shadow, and snow detection results compared to the previous versions for Sentinel-2 imagery [
2,
3,
42]. The cloud buffer distance, and the cloud shadow buffer distance were set to 0 in the Fmask configuration settings. All other configuration settings were set to their default values, and the Fmask output was computed at 20 m (
pixels) resolution.
We also compared our model performance against the results from Sen2Cor 2.8. The Sen2cor algorithm is part of the European Space Agency’s (ESA) processing pipeline, used for generating the L2A images [
4]. This algorithm produces a scene classification (SCL) image, in addition to various quality indicators and bottom-of-atmosphere (BOA) images. The resolution of the SCL was set to 20 m and no additional digital elevation model (DEM) was provided. While the Fmask segmented the image into the same six classes that we used for training our model, Sen2Cor provided a more comprehensive class organization. For example, the Sen2Cor scene classification mask has four classes for clouds: cloud high probability, cloud medium probability, cloud low probability and thin cirrus clouds. To perform the comparison, we combined multiple classes from the Sen2Cor mask into the six classes used in our model, as shown in
Table 2.
As shown in
Table 1, the spectral bands in the Sentinel-2 L1C product are provided in different resolutions. We resampled all the bands to a resolution of 20 m using bicubic interpolation. Due to hardware limitations, it was not possible to train the network with the resampled image of size
. Hence, we split the images into sub-scenes of size
pixels for training. Since the optical sensor data from the satellite may not always be aligned to fill the complete image, the edges of the image tend to have zero-valued pixels, i.e., no data. After splitting the image into sub-scenes, we discarded those sub-scenes in which all pixels were zero-valued. We applied zero-padding of one pixel around each sub-scene, which increased the network input size to
. Since the network did not have any fully connected layers (which requires a fixed input size), we were able flexibly to use a larger sub-scene size for prediction. In the prediction step, we used overlapping sub-scenes of size
, and we also applied zero-padding, as in the training step. Before stitching together the network output from each sub-scene, we clipped out a three-pixel border around the sub-scene to eliminate the uncertain network predictions from the zero-padding [
39].
The input to the network consisted of seven channels: the Sentinel-2 RGB bands (B02, B03, B04), near infrared band (B08), the short-wave infrared bands (B11, B12) and the normalised difference snow index (NDSI). The NDSI is a well-known index that exploits the difference in the spectral reflectance of the green band (B03) and shortwave infrared band (B11) to detect snow [
43]. The NDSI is computed pixel-wise as follows:
3.2. Neural Network Architecture
The U-Net [
14] and its variants are widely used neural network architectures for semantic segmentation. The network displayed in
Figure 2 consists of an encoder-decoder architecture. Each encoder block operates at a spatial resolution twice that of its succeeding encoder but has half the number of convolution filters compared to its succeeding encoder.
At each encoder and decoder block, there are two sequences, comprising a convolution layer, a ReLU layer, and a batch-normalization layer connected in series, followed by a spatial dropout. The convolution layers used here had a kernel size of and a stride of 1.
The output of the encoder takes two paths. In one path, the output of the spatial dropout is downsampled by a factor of two in the spatial dimensions using a max-pool layer, and is provided as input to the next encoder. The other path, referred as the skip connection, is connected to the decoder block that operates, at the same resolution, on the other side of the network.
At each decoder, two inputs are received and these two inputs have to be concatenated as a single input before applying the layers of the decoder. The input received from the preceding layer is upsampled using a
up-convolution operation with a stride of 2, which results in an increase in spatial dimensions by a factor of two, and a decrease in the number of feature maps by a factor of two. The upsampled output is then concatenated with the input received from the encoder of the corresponding step via the skip connection (denoted as a blue box in the decoder block in
Figure 2). The decoder layers are then applied to the concatenated output and the result is provided as the input to the next decoder.
This sequence is repeated until the final decoder that operates at the same spatial resolution as the input. At the final decoder, an additional convolution, and a softmax layer, is applied to obtain the class probabilities for each pixel of the image. The network output has the same spatial dimensions as the input and has six channels, where each channel contains the probability map for the respective class. The segmentation map can be generated by selecting the class with the highest probability at each pixel position.
3.3. Self-Training Framework
We used a four-stage iterative approach to train our model. The network architecture and training data organization of the iterative approach is summarized in
Table 3. The training dataset, consisting of sub-scenes of size
pixels, was divided randomly into four equal batches. In the first stage, a deep neural network with a modified U-Net architecture was trained using batch-1 of the training dataset. The
imperfect labels used in this stage were generated automatically using the Fmask algorithm. We evaluated the model performance on the validation dataset with human-labeled data after each training epoch, and, importantly, the human-labeled data was never used as training data. Based on the performance metrics, we selected the model from the best epoch, and called the selected model the
teacher model. Using the teacher model, we inferred labels for the data in batch-2 of the training dataset. We used only the labels for which the network confidence was greater than a threshold of 0.33 (twice the probability of a random guess). We assigned the label as 0 (
No-Data label) to pixels where the confidence was lower than this threshold. In the second stage, we trained another neural network model, called the
student model, using a combined dataset of batch-1 and batch-2. For the batch-1 data, we continued to use the Fmask labels exactly as in the previous stage. Once we had selected the best student model from stage 2 using the validation dataset, we made this student model our new teacher model. Using the new teacher model, we inferred new labels for batch-2 and batch-3 data. In each subsequent stage, we trained a new student model with a combined dataset of Fmask labels for batch-1, and model-inferred labels from the teacher model, i.e., the best model of the previous stage, for the remaining batches, as shown in
Table 3.
A new batch of the data containing previously unseen images was available for training the model at each stage. Hence, the size of the training data increased linearly at each stage. A smaller model was used for the training in the first stage, and the model size incrementally increased at each stage with increase in the training data. This was done to make sure that the model did not overfit on the smaller training data in the early stages. Hence, we defined two hyperparameters,
number of start filters and
depth, which we used to control the model size. The number of start filters was the number of filters used in the convolution layer present in the first encoder of the network. The number of filters in the subsequent encoders was defined as multiples of this parameter. The depth of the network set the number of encoder blocks used in the network. A network architecture with 32 start filters and a depth of 5 is illustrated in
Figure 2.
3.4. Training Implementation
3.4.1. Regularization Techniques
We employed the regularization techniques described below at each stage of the self-training framework to aid in generalization of the trained model.
We used an online augmentation strategy where we performed augmentation “on the fly” before the data was provided as input to the network. The training dataset already consisted of a large number of data samples, and using the online strategy helped us achieve diversity across each training epoch without an explosive increase in the dataset size. The augmentations used included cutout [
44], horizontal flip, vertical flip, and rotation in steps of
. These transformations were applied on randomly selected input samples and more than one transformation could be applied to the same image.
Unlike for fully connected layers, it has been shown that element-wise dropout is not very effective in the case of convolutional layers. The spatial dropout used in this study set entire feature maps of the output to zero instead of just individual pixels [
45]. The application of spatial dropout leads to learning from an ensemble of sub-networks consisting of randomly selected feature maps and hence helps in regularizing the trained model.
Each stage of the self-training framework was trained for 100 epochs. An early stopping strategy that stopped the training if the average class accuracy on the validation dataset did not improve after 10 epochs, was employed. The Adam optimizer was used for the training with a weight decay of . The weight decay resulted in norm regularization for the network weights.
3.4.2. Loss Function
As shown in
Table 4, the training data had a class imbalance problem. This could have resulted in the network predicting the abundant class more often than the classes that were scarce. Hence, we used a weighted cross-entropy function as the loss function for the training. The weights for each class were calculated using the median frequency balancing (MFB) method [
46]. It has been shown that this technique is effective in semantic segmentation of small targets in remote-sensing images [
47]. The weights were calculated using the frequency of the class in the training dataset: Let the set
denote the set of the
M classes in the given segmentation problem. The frequency of class
i, denoted as
, is the ratio of the number of pixels that are labeled as class
i to the total number of pixels. The median of the class frequencies in the set
C is given by
. The weight
for class
i is given by:
The target label used for training has a one-hot encoding, i.e., it is a vector whose elements are all zero except for one element corresponding to the true class set to one. The weighted cross entropy loss is given by:
where
N is the number of training samples (pixels),
is the weight of class
i calculated using the MFB method,
is the target label at the dimension corresponding to class
i for the training sample
j, and
is the softmax output of the model at the dimension corresponding to class
i.
3.5. Validation Metrics
The performance metrics used for the model evaluation were computed from a pixel-level confusion matrix: Consider the confusion matrix for an
M-class segmentation problem shown in
Figure 3. Each element in the matrix, denoted by
, is the total number of pixels that belong to class
i and predicted as class
j by the model. For class
i,
are the number of true positives, false positives and false negatives, respectively. Precision or user accuracy for class
i is defined as follows:
i.e., the ratio of the number of pixels that were correctly predicted as class
i to the total number of pixels that were predicted as class
i. Recall or producer accuracy for class
i is defined as follows:
i.e., the ratio of the number of pixels that were correctly predicted as class
i to the total number of pixels that actually belong to class
i, i.e., the ground truth.
Cases of over-segmentation and under-segmentation are often misinterpreted as good results when either recall or precision is analyzed individually. Hence, in semantic segmentation tasks, metrics such as the F1 Score and the Intersection over Union (IoU) are the preferred evaluation metrics. The F1 score is the harmonic mean of the precision and recall:
Given the set of pixels that are predicted to be a particular class by the model, and the set of pixels that actually belong to that class (ground truth), IoU is defined as the ratio of the size of the intersection to the size of the union of these two sets. The IoU is expressed in terms of the number of true positives, true negatives and false positives as follows:
The overall performance of the model can be evaluated using the mean Intersection over Union (mIoU) and the total accuracy. The mIoU for a segmentation task with
M classes is defined as follows:
i.e., the mean of the IoU metric computed over each class. The total accuracy is then defined as follows:
i.e., the ratio of the number of correctly labeled pixels to the total number of pixels.
4. Results
We compared the performance of our model with two widely used methods for cloud masking, Fmask and Sen2Cor.
The confusion matrices comparing the prediction of Fmask, Sen2Cor and our model with the true labels in the test dataset are presented in
Table 5,
Table 6 and
Table 7, respectively.
Table 8 shows that our model was able to achieve high precision and recall
simultaneously. In the case of Fmask 4 and Sen2Cor 2.8, the high values of precision or recall for certain classes, such as the class
Clear-Sky Land, were accompanied by lower values of recall or precision, respectively. Therefore the overall segmentation performance suffered. In contrast, our model outperformed both these methods in this regard; this is clearly reflected in the F1 score and IoU metrics shown in
Table 9 and in
Figure 4. We also observed that the performance of the class
Water was slightly better for Fmask compared to our model.
In
Figure 5,
Figure A1,
Figure A2 and
Figure A3, we provide several more examples of results from the test and validation datasets to visually demonstrate the model performance. The results in each figure are supplemented with the false color composite image and the labeled polygons to aid in understanding the scene. The false color composite image comprises of the short-wave infrared band (B11), near infrared band (B08) and red band (B04) in the red, green and blue channels, respectively.
5. Discussion
In our study, we showed that, even with a small manually labeled validation dataset, the self-training framework enabled us to train a segmentation model using noisy labels from the Fmask algorithm. Our model outperformed two widely used cloud-screening methods, Sen2Cor and Fmask, and can be considered a better alternative to its teacher, the Fmask algorithm. The results showed that the model performed particularly well for the Cloud and Snow classes, which were the two prominent classes observed in the geographical sites that we used in our study.
Shadows in satellite imagery can be classified based on their source as cloud shadows or topographic shadows. Topographic shadows are the shadows that are cast by topographic features, such as mountains, and are static features that depend on acquisition geometry and solar position. In contrast, cloud shadows are dynamic features whose location and representation also depend on the prevailing meteorological conditions during image acquisition. The spectral characteristics of cloud shadows and topographic shadows are similar [
48]. Topographic shadows can be detected accurately using digital elevation models (DEM) and solar angles. However this information is not provided to the network explicitly. The Fmask algorithm also makes use of the image metadata, such as solar zenith and azimuth angles, along with the cloud detections, in order to predict cloud shadow pixels. However, visual inspection of the Fmask results showed that they were not always very accurate. In
Figure 5, both Sen2Cor and Fmask tended to incorrectly label topographic shadows as
Water. Our model offered an improvement in this regard, correctly labelng them as shadow in many instances. A suitable post-processing technique using the metadata and DEM can be applied to separate the cloud shadows and topographic shadows.
In
Table 10, we compare the performance metrics for different stages of the self-training framework. We observe that the performance of the model improved over the first three stages of the self-training framework and subsequently saturated in the last stage. We attribute this pattern to improvement in the quality of data labels as a result of our framework. The performance on the class
Water did not improve across the different stages. We know that shadows and water tend to have dark-colored pixels and it is often difficult to distinguish between them by visual inspection. We believe the network infers that the class
Water and the class
Shadow are similar, and, in cases of ambiguity, prefers
Shadow over
Water due to higher loss function weights used during training. As shown in
Figure 6, the network focused on improving the performance in the shadow class at each stage and this improvement was also reflected in the performance metrics.
We also trained two models using the conventional supervised training approach. The first model was trained using the same training dataset, but using the noisy Fmask labels, while the second model was trained with the small human-labeled dataset that was previously used for model selection in our proposed framework. The network architecture used was the same as the architecture from the model in Stage-4; we also used all regularization techniques discussed previously for training this model. Hence, the key difference between the supervised models and the self-trained model (Stage-4) was the labels used for the training. We present the performance metrics for these two models and our self-trained model in
Table 11. The results for the
clean supervised model show that the limited number of labeled examples in our validation dataset was insufficient to generalize the task effectively. We believe that the large noisy dataset offered greater data diversity over the smaller clean dataset, and that his helped the
noisy model to outperform the
clean model. One might be inclined to argue that the improvement in performance across the stages of the self-training framework was due to the increase in the number of parameters used in the network architecture. However, the performance of the
noisy supervised model demonstrated that a larger network alone does not guarantee better performance. We observed that the improvement in data labels achieved through a self-trained framework enabled the stage-2 and stage-3 models to outperform the supervised model despite their smaller network size. Hence, we believe our approach has great potential for training models with noisy labels.
Jeppesen et al. demonstrated the capability of neural networks to learn from noisy Fmask labels in their RS-Net model trained for cloud detection in Landsat Imagery using the supervised learning approach [
28]. Our study can be viewed as a next step in this research direction, and we also further extend this approach for the multi-class segmentation task. Similar to our research, Li et al. offered a solution for addressing the difficulty of obtaining large pixel-level annotated datasets for training neural networks [
36]. The block-level annotated dataset used for training their weakly supervised model can potentially simplify the annotation process in comparison to the more laborious pixel-level annotation. However, it may prove to be challenging to adapt this approach for multi-class segmentation, particularly for shadow detection. We used the simple, yet effective, U-Net architecture in our self-training framework. However we note that the network used in our framework can be easily adapted to take advantage of recent advances in network architectures and training techniques, which have shown improved cloud-detection capabilities in supervised models.
6. Conclusions
Our study demonstrates the effectiveness of self-training neural networks in the Earth observation domain. The key challenge in this study was to train the model using the noisy labels from the Fmask cloud detection algorithm. Many other remote-sensing applications face the same difficulty with existing, but imprecise, training data, that often limits the deployment of deep-learning technologies.
Even though the proposed method offers better performance compared to Fmask and Sen2Cor, we believe that the performance can be improved further. As well as the detection of shadows, the detection of thin clouds, particularly those above water bodies, can be improved. The clear-sky land class is the most heterogeneous of the six classes in this investigation. When training a similar model that can be applied to all geographical environments, this class is expected to pose some challenges. The use of other indices in addition to NDSI would allow advantage to be taken of the domain knowledge that has been acquired as a result of years of research.
Our classification strategy has the potential for more nuanced class separation and for integration into information retrieval algorithms. The use of other existing masks or data sets (e.g., ocean, lakes or vegetation) might lead to even higher precision on specific problems or over certain surfaces.