1. Introduction
Remote-sensing images provide important data support for various industries, but the acquired remote-sensing images are often accompanied by cloud cover that seriously affects the interpretation of ground targets [
1,
2]. Existing studies show that the mean annual cloud cover of global remote-sensing images is about 66% [
3]. The presence of clouds hinders the application of remote-sensing images in Earth observation [
4,
5]. Therefore, accurately detecting clouds on remote-sensing images has become a prominent issue.
Recently, there has been a lot of study conducted on the development of computer-based automatic cloud segmentation techniques as a prerequisite for picture analysis [
6,
7,
8]. Existing cloud detection methods can be divided into two categories: empirical-rule algorithms based on physical features, and machine-learning algorithms [
9]. The majority of rule-based methods consist of a set of thresholds applied to one or more spectral bands of the image, or for extracted features that attempt to enhance the physical properties of the cloud [
10]. Generally speaking, rule-based methods are straightforward and simple to use, and when the spectral data provided by satellites is sufficiently rich, they are more effective in cloud classification. The Fmask algorithm is a classical rule-based method, which uses a threshold function to determine whether the spectrum corresponding to each pixel is a cloud or not [
11]. However, this method is highly dependent on the human setting of the threshold value and requires a large number of law-generation experiments to obtain a more accurate value, which is costly to obtain. The Sen2Cor algorithm can generate cloud-recognition results with different probabilities through different waveband threshold conditions, but its recognition accuracy is not high, and it is easy to identify bright surface features as clouds and mountain shadows as cloud shadows by mistake [
12]. On the other hand, machine-learning approaches treat cloud detection as a statistical classification problem, such as Random Forest (RF) [
13] and Support Vector Machine (SVM) [
14]. These methods rely on high-quality remote-sensing images and use their high resolution and distinct spatial features to classify shape, texture, edges, and other features. Since a large number of empirically set thresholds and band selection are not required, machine-learning methods outperform rule-based methods when the quality of training data is good enough [
15]. However, most of these methods rely heavily on training samples, and their accuracy decreases significantly when the training samples are underrepresented or biased in distribution [
16]. Deep neural networks also belong to one of the methods of machine learning. With the booming development of deep-learning technology, image segmentation techniques in the field of computer vision have also been gradually introduced into the field of remote sensing for cloud detection [
17,
18,
19]. Due to its ability to effectively mine multi-level texture features from images without requiring manual feature selection, it achieves higher accuracy than traditional machine-learning classifiers in cloud-detection tasks [
20]. Although the emergence of deep learning has elevated cloud detection to a new level [
21], making the high-dimensional features of remote-sensing images more fully exploited and achieving certain improvements in detection accuracy, the effect of bit depth of remote-sensing images on the accuracy of cloud segmentation algorithms based on deep-learning methods has not been discussed yet.
Bit depth, also known as color depth or pixel depth [
22], is used to describe the ability of the sensor to discriminate between objects viewed in the same part of the electromagnetic spectrum, which equates to the number of data values that can be had in each band [
23]. Bit depth refers to the number of bits used to represent the color of each pixel in an image; it is a critical factor in remote-sensing image analysis and interpretation [
24]. According to the different coding methods, remote-sensing images with bit depth ≥ 10 bit are generally defined as high radiometric resolution images. High radiometric resolution remote-sensing images can obtain detailed structure and spectral information of various features in a more detailed way, enhance the interpretation ability and reliability of images, and improve the accuracy of remote-sensing analysis [
25]. For example, the data captured by the operational Land Imager (OLI) of Landsat 8 have better radiometric accuracy in the 12-bit dynamic range, which improves the overall signal-to-noise ratio. This compares to only 256 gray levels in the 8-bit instruments of Landsat 1–7. The improved signal-to-noise performance results in an improved description of land cover states and conditions. The product is delivered in 16-bit unsigned integer format [
26].
In a study devoted to radiometric resolution impact on classification accuracy [
27], a bagging classification tree was used to carry out the experiments. They discovered that the radiometric resolution of the spectral indices and texture bands is more closely related to their own radiometric resolution than to the radiometric resolution of the original remote-sensing image. In a paper-on-cloud detection with deep learning, Francis et al. [
28] quantized 16-bit Landsat 8 images to 4 bits to verify the robustness of the model. Their findings imply that high-precision data is not necessary for their cloud masking approach to perform well. Ji et al. [
29] used a novel fully convolutional network (FCN) to detect clouds and shadows in Landsat 8 images but converted the 16-bit images into 8-bit images by default during the process of data preprocessing. Li et al. [
17] utilized 16-bit GF-1 data for cloud detection, cloud removal, and cloud coverage assessment, but all images were converted to 8-bit RGB images before being fed into the network. In addition to cloud-segmentation tasks, in other remote-sensing image processing tasks such as water body recognition, Song et al. [
30] changed the bit depth of Worldview-3 images from 16-bit to 8-bit in the data preprocessing stage to improve the algorithm processing speed. In the task of marine ranching recognition, Chen et al. [
31] also changed the bit depth of GF-1 images from 16-bit to 8-bit in the data preprocessing stage. In the field of biomedicine, Mahbod et al. [
32] investigated the effect of image bit depth on the segmentation performance of cell nucleus instances. Considering that remote-sensing images are acquired in a different way than biomedical images with more complex content resolution and lower signal-to-noise ratio, the effect of bit depth on the segmentation performance of remote-sensing images is still worth exploring.
It can be observed from the existing literature that there has been limited focus on the impact of bit depth on the information extraction process in remote-sensing images. In contrast, the influence of spatial [
33,
34] and spectral [
35,
36,
37] resolution on remote-sensing data has been extensively studied in regards to classification accuracy and information-extraction capabilities. The results of cloud segmentation through remote sensing would be influenced by the bit depth of remote-sensing images [
38]. In addition, prior research on bit depth has primarily focused on a single scenario. Thus, this study aims to expand upon existing literature by examining the influence of bit depth on cloud segmentation in remote-sensing imagery. To achieve this, we assessed its impact across eight distinct landscape classifications: barren, forest, grass/crops, shrubland, snow/ice, urban, water, and wetlands.
This study will be based on a comprehensive review of the literature and experimental analysis using a representative set of remote-sensing images. The findings of this research will provide insights into the optimal bit depth for cloud segmentation and inform best practices for remote-sensing image processing. More specifically, the main three contributions in this study are summarized as follows:
Unique focus on the impact of bit depth: While previous studies have largely focused on improving the performance of cloud-detection algorithms, our research specifically addresses the overlooked aspect of bit depth in remote-sensing images. By examining the relationship between bit depth and segmentation accuracy, we provide new insights into the importance of bit depth in cloud-detection tasks.
Comparative analysis of 8-bit and 16-bit remote-sensing images: Our study is among the first to systematically compare the performance of cloud detection using 8-bit and 16-bit remote-sensing images. This comparison not only highlights the differences in accuracy between the two types of images but also sheds light on the trade-offs between efficiency and accuracy in cloud-detection tasks.
Extensive evaluation across different surface landscapes: To ensure the generalizability of our findings, we have evaluated the performance of the UNet algorithm across different surface landscapes. This comprehensive evaluation helps to highlight the varying impact of bit depth on cloud-segmentation accuracy in diverse contexts.
2. Materials
Training deep-learning models requires a large number of high-quality remote sensing images and corresponding labels. In this study, we utilize a widely recognized cloud detection dataset, the Landsat 8 Biome Type Cloud Validation Dataset (L8 Biome) [
1], which contains 96 Landsat 8 images sampled from around the world, with a size of 8000 × 8000 (30-m resolution), and artificially generated cloud masks. The dataset comprises eight distinct cloud underlying surface types: barren, forest, grass, shrubland, snow, urban, water, and wetlands, with each category containing 12 images. At the same time, in order to ensure the heterogeneity and diversity of the data, the dataset selects images with different latitude and longitude and cloud types (
Figure 1).
In this study, the focus was placed on the impact of cloud detection. Thus, the cloud masks of the L8 Biome Dataset were consolidated into two distinct categories: cloud and non-cloud. The artificial cloud mask was generated by assigning a value of “0” to the “cloud shadow” and “clear” categories, while the “thin cloud” and “cloud” categories were assigned a value of “1”. Subsequently, the red, green, and blue bands were extracted from the original image to synthesize the three-band images.
The data preparation process for the remote-sensing images involved several steps. Firstly, 16-bit Landsat 8 images provided by L8 Biome were converted to 8-bit images using the raster dataset tool in ArcMap. To demonstrate the difference in performance between the two kinds of bit depths, sample datasets were created for both 8-bit and 16-bit images (
Figure 2). Due to limitations in computational resources, complete remote-sensing images were not used for experiments. Instead, small image blocks were cropped from the whole remote-sensing images. To minimize boundary issues, referring to Jeppesen et al. [
39], each remote sensing image was cropped into 512 × 512-size sample image blocks with 64-pixel buffer zones between adjacent windows, and image blocks with more than 20% of No-data were deleted. To further increase the sample number and complexity, data-augmentation strategies, including random flips and rotations, were applied to the processed samples (
Figure 3). Finally, we obtained 2248 images of size 512 × 512 for each type. We divided the dataset into training and testing sets at a 4:1 ratio, meaning there are 1798 images in the training set and 450 images in the testing set for each type. Note that a small part of the training samples was used for validation. The training set was used to train the network model, the validation set was used to adjust the model parameters during the training process, and the test set was only used to evaluate the model performance and did not participate in the training process.
3. Methods
In this chapter, we will provide a detailed explanation of the methods and technical details employed in our experiment. First, we will introduce the UNet and its network details. Following that, we will discuss the design of the experiment, the environment, and specific parameters. Lastly, we will evaluate the performance of images with different bit depths using evaluation metrics. In the subsequent subsections, we will elaborate on each step in detail.
3.1. UNet Semantic Segmentation Algorithm
UNet [
40] is a convolutional neural network (CNN) architecture that was developed based on the FCN [
41] architecture and first applied to biomedical image segmentation in 2015. This architecture was designed to address the challenge of segmenting objects with high levels of variability in size, shape, and appearance. To achieve this, UNet combines a contracting path, which captures the context of the input image with a symmetrical expanding path, restoring the detail of the segmented objects. UNet employs a symmetric encoding and decoding structure, which is among the most frequently utilized methods in semantic medical-image segmentation. In the field of remote sensing, the UNet architecture has been widely used for various tasks, including cloud segmentation. The ability of UNet to handle high levels of variability in size, shape, and appearance makes it well-suited for cloud segmentation, which often involves separating cloud objects from the surrounding sky or land. Clouds can exhibit considerable variations in size, shape, and appearance, and UNet’s employment of skip connections helps to preserve detailed information from the input data, making it well-suited for cloud segmentation where objects of interest can have complex internal structures. Multiple studies have shown the effectiveness of using UNet for cloud segmentation [
42,
43].
The UNet architecture is depicted in
Figure 4. The left half of the figure displays the encoding path, while the right half displays the decoding path. The encoding path comprises blocks with two 3 × 3 convolutions and a max pooling layer. The convolution layer is used to extract features from the input remote-sensing image, and the max pooling layer reduces the size of the feature map by half. On the other hand, the decoding path consists of blocks with a deconvolution, a skip connection, and two 3 × 3 convolutions. The deconvolution layer expands the pixel size by a factor of two. To match the size of the output remote-sensing image with that of the input, we used bilinear interpolation to upsample the small-sized hotspot map to obtain the original-sized semantic-segmentation image. The skip connection transfers information from the encoding path to the decoding path, thus enhancing UNet’s segmentation ability by compensating for missing information in the image. Finally, the final segmentation result is obtained through channel downscaling using a 1 × 1 convolution.
3.2. Experimental Design
In order to assess the effect of bit depth on cloud-segmentation performance in various scenes, two sets of experiments were conducted. Firstly, datasets were created, consisting of 512 × 512-sized image sample blocks and corresponding masks for both 16- and 8-bit images, respectively. The cloud-detection performance of the network trained with various bit-depth samples in a single scene was then evaluated. The 16 cloud detection networks trained using the aforementioned training samples were utilized for cloud detection in the test set to examine the impact of bit depth on cloud segmentation performance in different scenes. Prior to network training, the pixels of the input images were normalized to the 0–1 interval through Equation (1):
where
X denotes the original data,
Xnorm denotes the normalized data, and
Xmax,
Xmin denote the maximum and minimum values of the original data, respectively.
In the second set of experiments, with all other conditions kept constant, the effect of not normalizing the images was further investigated by conducting experiments without image normalization. This was done in order to determine whether normalization affects the disparity between the experimental results obtained for different bit depths.
3.3. Experimental Enviroment
In this experiment, a Pytorch-based semantic-segmentation model, UNet, was implemented on an RTX TITAN GPU server equipped with 24 GB of memory. The server was configured with an Intel Core i7-8700 processor clocked at 3.20 GHz and 16 GB of memory, and it was running a 64-bit Ubuntu 18.0 operating system. The programming language used was Python 3.8. Key libraries utilized included Gdal, OpenCV, Numpy, Matplotlib, and PIL.
3.4. Experimental Details
During the optimization phase, we employed the Adam optimization algorithm with an initial learning rate of 0.001. In order to dynamically adjust the learning rate, we employed an exponential weight decay strategy with a decay rate of 0.9. The learning rate per epoch was calculated as Equation (2):
where
l denotes the learning rate,
linitial denotes the initial learning rate.
Our study involved training all models for approximately 100 epochs until convergence was achieved. Due to memory constraints on the GPU, the batch size was set to 8, and the same parameter settings were maintained to fairly evaluate the performance of various approaches. The detailed steps of our cloud-detection model training and validation are presented in Algorithm 1.
Algorithm 1 Cloud-detection model of different bit depths. |
Input: Remote-sensing image of either 8-bit or 16-bit depth |
Output: Cloud mask of the image and evaluation metrics |
1: Preprocess the input image to obtain 16-bit and 8-bit images |
2: Split the image into overlapping patches |
3: For each patch in the image: |
4: Feed the patch to the trained deep-learning model |
5: Obtain the predicted cloud mask of the patch |
6: Merge the predicted cloud masks of all patches to obtain the final cloud mask of the entire image |
7: Postprocess the final cloud mask to refine the results |
8: Evaluate the performance of the deep-learning model using various evaluation metrics, such as accuracy, kappa, etc. |
9: Output the final cloud mask of the image and the evaluation metrics |
3.5. Evaluation Metrics
In this paper, objective evaluation of the test results is achieved through the use of a confusion matrix. The matrix, of size
M ×
M displays the number of elements that have been classified as a particular category, as well as the number of true instances of that category, where
M represents the total number of categories. Precision and kappa are calculated from the information contained in the confusion matrix, with the calculation formulae being expressed as follows:
where
TTP refers to the number of samples that have been correctly identified as clouds,
TTN denotes the number of samples that have been correctly classified as non-clouds,
FFP represents the instances of non-cloud samples that have been mistakenly categorized as clouds,
FFN stands for the cloud samples that have been misclassified as non-clouds,
Aaccuracy is the proportion of pixels correctly predicted to all pixels in the samples, and
Kkappa is a measure of the reduction in errors generated by a classification system compared to random chance. It is calculated using the diagonal element (
xii) and the sum of the rows and columns (
xi+ and
x+i) of the confusion matrix, as well as the total number of image elements (
N). The Kappa coefficient represents the extent to which the classifier has outperformed a purely random classification approach and provides a measure of the effectiveness of the system.
5. Discussion
The primary objective of this study was to assess the impact of different bit depths on cloud-segmentation accuracy and conduct experiments using remote-sensing images of various surface types. The findings indicate that both 8- and 16-bit images exhibit similar competitiveness in cloud segmentation. Nevertheless, the higher bit-depth classification accuracy presented in
Figure 8 is notably better, accurately identifying and classifying clouds of various shapes and sizes. Overall, remote-sensing images with higher bit depth provide more stable and higher accuracy and kappa coefficients and perform better in cloud segmentation than lower bit-depth images. However, it is important to note that higher bit-depth images do not always provide more accurate results, as this is also dependent on the selected elemental dataset and classification method. In other words, some images are easier to segment, resulting in higher segmentation scores, while others are more challenging, resulting in lower segmentation scores. The accuracy and kappa coefficients of images with lower bit depths varied by 2–4% in both snow and ice scenes, resulting in poorer classification. This may be attributed to the fact that lower bit depths lead to the missed detection of finer elements within the pixels. The detection of these elements leads to higher intra-class variance, which may result in lower accuracy.
When comparing the use of 16-bit images to 8-bit images for the semantic segmentation of clouds, the primary difference in visualization results is the level of detail and accuracy of the segmented clouds. Due to their increased dynamic range, 16-bit images can capture more detail and variation in the clouds, resulting in more accurate and detailed semantic segmentation. This enables the accurate and detailed segmentation of clouds with clear boundaries and a clear distinction between clouds and the background. Furthermore, 16-bit images are less sensitive to noise, which can reduce the impact of noise on the final segmentation results. In contrast, 8-bit images have a limited dynamic range, which may result in a lack of captured detail and changes in the clouds, leading to less accurate and detailed segmentation of the clouds, less clear boundaries, and less clear distinctions between clouds and the background. Additionally, 8-bit images are quantized to 256 different levels, while each color channel in 16-bit images is represented by 16 bits, allowing for the representation of color values to be limited to 65,536 different levels. This increased resolution of color values can help capture more details and variations within the clouds, contributing to more accurate and detailed semantic segmentation. In conclusion, the semantic segmentation of clouds using 16-bit images can yield more accurate and detailed segmentation results with clear boundaries and a clear distinction between clouds and the background. This is due to the increased dynamic range and resolution of color values in 16-bit images, resulting in a more detailed and accurate representation of clouds. On the other hand, semantic segmentation of clouds using 8-bit images leads to less accurate and detailed segmentation results, with less clear boundaries and less clear distinctions between clouds and the background.
As shown in
Table 1 of the results section, omitting the normalization step during training resulted in a decrease in accuracy, further exacerbating the gap in accuracy between 8- and 16-bit images. This highlights the significance of the normalization step in training deep-learning-based algorithms. Normalization is a preprocessing technique that aims to scale pixel values within a specific range, typically [0, 1], which helps the algorithm converge faster and generalize better. It is particularly useful when dealing with images of varying intensities, as it can reduce the effect of contrast differences between images. In the case of 8BN, normalization helps to mitigate some limitations of lower bit depth, such as reduced dynamic range and lower sensitivity to subtle variations in pixel values. On the other hand, increasing the bit depth from 8- to 16-bit results in a higher dynamic range and more precise representation of pixel values, which can improve the segmentation accuracy. In this study, the 16B results without normalization indeed show an improvement over the 8B results. However, it is important to note that the full potential of the 16-bit images might not have been realized due to the absence of normalization. The similar results between 8BN and 16B in
Table 1 indicate that normalization can help to bridge the performance gap between lower and higher bit depths to some extent. However, it does not imply that normalization can completely substitute the benefits of using higher bit-depth images. In fact, when both 8- and 16-bit images are normalized (8BN and 16BN), the 16BN results show a higher segmentation accuracy, suggesting that a combination of higher bit depth and normalization can provide even better cloud-detection performance.
While normalizing images can make the weight distribution of a deep-learning model more stable, it may also have an impact on the training time. The normalization process often requires additional computations, which can consume valuable training time, especially when training large-scale models. Additionally, in some cases, normalization can reduce the dynamic range of the pixel values, which can make it more difficult for the model to distinguish between different land-cover types. Thus, while normalization can improve model performance and reliability, it is important to consider its potential impact on training time, especially in the context of training more complex datasets with more iterations. Further investigation is necessary to determine the extent of this impact and how it varies with different datasets, classification methods, and hyperparameters. To further explore the differences in segmentation performance of remote-sensing images with varying bit depths, we conducted training-time analysis. In our study, we determined that the training time for 16-bit images is indeed slightly longer than for 8-bit images. While the difference in
Table 2 might not appear significant, it is important to consider the overall computational resources and time required when dealing with large-scale remote-sensing applications. In such scenarios, even small differences in training time can have a more substantial impact on efficiency. Regarding the normalization operation, it does indeed have an effect on efficiency. Normalization helps in reducing the range of pixel values, which in turn can speed up the training process by making convergence faster. However, it is crucial to note that normalization also affects the accuracy of cloud detection, as it helps the algorithm to generalize better across different image intensities. In summary, while there may not be a significant difference in training time between 8- and 16-bit images as shown in
Table 2, the consideration of efficiency should take into account the overall computational resources and time required for large-scale applications. It is worth noting that in our experiments, using 8-bit images instead of 16-bit images did not significantly affect training time or GPU usage since both cases used the default training scheme of the 32-bit floating point for model training. While mixed precision (using the 16- and 32-bit floating points) is possible for model training, our experiments found no significant change in model performance in terms of training time and GPU usage since we only utilized 8- and 16-bit images. Nevertheless, this minor difference cannot be disregarded as training time is also impacted by the selected dataset, classification method, and hyperparameter tuning. If more complex datasets and additional iterations were employed, this gap would have been substantially widened, particularly given that we only trained each model for 100 epochs.
In this study, we focused on evaluating the impact of bit depth on cloud-segmentation accuracy using a single deep-learning classification algorithm. This approach allowed us to specifically examine the effect of image-bit depth on cloud segmentation while minimizing the impact of parameter tuning and training sampling considerations. However, to increase the generalizability and transferability of our findings, it may be useful to extend this research to include other classification algorithms and datasets in future studies. It should also be noted that while we focused on cloud segmentation in this study, the potential impact of image-bit depth on other remote-sensing analysis tasks, such as water and building extraction, warrants further investigation.
6. Conclusions
Thanks to the continuous development of computer technology in the field of remote sensing, advanced algorithms, such as statistical methods, artificial neural networks, support-vector machines, and fuzzy-logic algorithms, have been developed and widely used. To some extent, this makes up for the shortcomings of traditional cloud-segmentation methods in terms of low detection accuracy. In this study, we investigated the effect of image-bit depth on the performance of DL-based cloud segmentation using different datasets. Our findings indicate that there is a difference in accuracy for models trained with 8- and 16-bit remote-sensing images, and the accuracy of cloud segmentation is impacted by different surface types. Although 16-bit images are more accurate for cloud segmentation, training with 8-bit images is more efficient. Further exploration of the impact of image-bit depth on other remote-sensing analysis tasks can be addressed in future research. Our research offers valuable insights for remote sensing practitioners, helping them make informed decisions about the optimal bit depth for cloud detection tasks. By considering the balance between efficiency and accuracy, practitioners can select the most suitable bit depth for their specific application, ultimately leading to improved cloud-detection results.
In conclusion, using a higher bit depth, such as 16-bit images, and normalizing the images before semantic segmentation can significantly improve the accuracy of cloud-layer segmentation in remote-sensing images. The increased dynamic range and resolution in color values of 16-bit images can capture more details and variations within the clouds, which is crucial for accurate semantic segmentation. Normalizing the images can help remove variations in brightness and contrast, making the image more consistent and aiding the model to generalize better. Additionally, normalization can also reduce the impact of outliers or extreme values in the image. By utilizing these techniques, the accuracy and reliability of cloud-layer segmentation in remote-sensing applications can be greatly improved.