Next Article in Journal
The bHLH Transcription Factor PubHLH66 Improves Salt Tolerance in Daqing Poplar (Populus ussuriensis)
Previous Article in Journal
Multi-Stakeholder Game Relationships in Promoting the Development of the Non-Timber Forest Product Industry by State-Owned Forest Farms
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

CCE-UNet: Forest and Water Body Coverage Detection Method Based on Deep Learning: A Case Study in Australia’s Nattai National Forest

College of Mathematics and Computer Science, Zhejiang A&F University, Hangzhou 311300, China
*
Author to whom correspondence should be addressed.
These authors contributed equally to this work.
Forests 2024, 15(11), 2050; https://doi.org/10.3390/f15112050
Submission received: 9 October 2024 / Revised: 13 November 2024 / Accepted: 15 November 2024 / Published: 20 November 2024
(This article belongs to the Section Natural Hazards and Risk Management)

Abstract

:
Severe forest fires caused by extremely high temperatures have resulted in devastating disasters in the natural forest reserves of New South Wales, Australia. Traditional forest research methods primarily rely on manual field surveys, which have limited generalization capabilities. In order to monitor forest ecosystems more comprehensively and maintain the stability of the regional forest ecosystem, as well as to monitor post-disaster ecological restoration efforts, this study employed high-resolution remote sensing imagery and proposed a semantic segmentation architecture named CCE-UNet. This architecture focuses on the precise identification of forest coverage while simultaneously monitoring the distribution of water resources in the area. This architecture utilizes the Contextual Information Fusion Module (CIFM) and introduces the dual attention mechanism strategy to effectively filter background information and enhance image edge features. Meanwhile, it employs a multi-scale feature fusion algorithm to maximize the retention of image details and depth information, achieving precise segmentation of forests and water bodies. We have also trained seven semantic segmentation models as candidates. Experimental results show that the CCE-UNet architecture achieves the best performance, demonstrating optimal performance in forest and water body segmentation tasks, with the MIoU reaching 91.07% and the MPA reaching 95.15%. This study provides strong technical support for the detection of forest and water body coverage in the region and is conducive to the monitoring and protection of the forest ecosystem.

1. Introduction

Forest ecosystems offer a wide range of ecological services and socioeconomic benefits. Forests occupy a central role in terrestrial ecosystems, covering extensive areas and featuring diverse species, complex structures, and abundant resources. However, the intensification of global climate change and environmental damage, such as forest fires, excessive deforestation, and the frequent occurrence of extreme climate events, is placing unprecedented pressure on this ecosystem [1,2,3]. These pressures threaten not only the health of forest ecosystems but may also have profound impacts on the global environment and economy [4]. Among them, forest coverage monitoring and water resource distribution monitoring can provide more comprehensive technical support for ecosystem restoration assessment [5]. In the face of natural disasters like wildfires, which cause widespread forest burns or damage, accurate forest coverage data provide an essential scientific basis for post-disaster rescue and vegetation restoration [6]. Additionally, long-term monitoring of forest cover changes enables the assessment of forest ecosystem health, supporting the protection and sustainable use of forest resources [7]. From the perspective of ecological service functions, forests, and water bodies work together to conserve water sources and reduce soil erosion. Therefore, monitoring water bodies simultaneously can help assess the ecological service functions of forests. The monitoring of forest coverage and the distribution changes of water resources also helps to assess the impact of natural disasters such as fires and floods on forest ecosystems [8].
In the past, forest monitoring primarily relied on field surveys, a method that is costly and time-consuming, and in some areas, it is difficult to implement due to spatial or topographical constraints [9]. With the development of satellite sensor technology, high-resolution remote sensing satellite imagery offers a wide coverage range, which can effectively reduce the need for field surveys and provide the capability for periodic monitoring. As such, remote sensing now offers unprecedented capabilities for large-scale forest monitoring [10]. Forest coverage can reflect the overall health and biodiversity level of forest ecosystems [11]. The forest coverage data extracted from remote sensing satellite imagery can be used to monitor changes in forest area, vegetation types, and forest age structure, all of which are key parameters for assessing the state of ecosystems. Therefore, accurate extraction of forest coverage can provide a scientific basis for forest management and conservation [12]. For example, Liu et al. used Gaofen-2 satellite imagery to extract the characteristics of Ulmus tree canopies in the Otingdag Sandy Land area to monitor the Ulmus forest coverage in the region [13]. Kalinaki et al. utilized Sentinel-2 satellite imagery to monitor recent forest coverage in Brunei. Due to the complexity and variability of forest and water environments, accurately extracting forest and water information from remote sensing imagery and achieving efficient semantic segmentation remains a challenging issue [14].
A forest ecosystem is a complex biogeographical system that encompasses interdependent and interacting biotic and abiotic elements. Among these, the distribution of water resources is fundamental to the health and stability of the forest ecosystem. Therefore, monitoring and managing the distribution of water resources is crucial for the protection and sustainable use of the ecosystem [15]. In recent years, artificial intelligence has become an important tool for managing natural disasters, and the development of deep learning technology provides new possibilities for water body monitoring from remote sensing images [16]. Although deep learning technology has achieved remarkable results in remote sensing image segmentation, there are still some limitations [17]. The DeeplabV3+ [18] architecture may have limitations in processing complex scenes. The Unet [19] network may be insufficient for feature extraction. Swin-Transformer [20], TransUNet [21], and Swin-TransUNet [22] have high computational complexity. Segformer has shown overall promising performance, but it may have limitations in precise boundary positioning, especially in cases where object boundaries are blurred, or the scene is complex [23].
Currently, traditional forest monitoring methods rely heavily on manual field surveys, which not only consume significant human resources but also lack generalization capabilities. Similarly, monitoring the distribution of water resources faces numerous shortcomings when dealing with complex environments. Forests and water bodies are integral components of forest ecosystems. By utilizing satellite remote sensing imagery to simultaneously monitor forest coverage and water resource distribution, we can not only provide decision-making support for the utilization of forest resources in the region but also offer more comprehensive support for monitoring the restoration of forest ecosystems. When traditional monitoring methods conflict with the currently popular deep learning approaches, achieving high precision and rapid extraction of forest and water body features becomes the primary challenge of this study. Therefore, we proposed the CCE-UNet semantic segmentation architecture to enhance the segmentation accuracy of forest and water body boundaries in remote sensing images. The model integrates multi-scale scene information fusion algorithms and dual attention mechanism strategies, utilizing high-resolution remote sensing images from the Google Earth Engine platform to achieve high-precision and rapid feature extraction of forests and water bodies. At the same time, it provides technical support for the monitoring of forest ecosystem restoration in the region.

2. Materials and Methods

2.1. Description of the Study Area

The research area of this experiment is the Nattai National Forest Reserve in Australia. As shown in Figure 1, the study area is located in the southwest of New South Wales, Australia, under the jurisdiction of Wollondilly Shire, approximately 100 km from Sydney. Covering an area of 53,000 hectares, it is situated between 152°30′ and 153°00′ east longitude and 31°30′ to 31°50′ south latitude. The park is renowned for its rugged and mountainous terrain. The reserve experiences a subtropical climate with hot and humid summers and mild and rainy winters. It is primarily composed of forests, rivers, and lakes, with forest trees mainly consisting of eucalyptus and various rainforest species. From July 2019 to March 2020, Australia suffered the most severe wildfire disaster in its history, affecting five states and two territories. The fire engulfed over 10 million hectares, destroyed more than 6000 buildings, and resulted in the death of about 1 billion animals, with New South Wales and Victoria being the hardest hit. Following the disaster, the Australian government announced an investment of AUD 50 million to aid the recovery of animals and plants and to mitigate the losses caused by the fires. The main causes of wildfires include climate change, drought, strong winds, and human activities. In 2019, Australia’s average temperature was 1.52 °C higher than the long-term average, and the average maximum temperature was 2.09 °C higher, setting a historical record. The persistent warming led to drier weather conditions, such as heatwaves, increasing the risk of wildfires. Prior to the wildfires, Australia had been experiencing a prolonged drought, which dried out the vegetation and provided ample fuel for the fires. Therefore, this study selected the core area of the wildfire outbreak in New South Wales, the Nattai National Forest Reserve.
We obtained high-definition satellite images of the area from 2018 to 2023 through the Google Earth Engine platform. Due to the slow growth rate of forest vegetation, the characteristics do not change significantly in a short period of time. Therefore, we extracted thumbnails of some disaster-affected areas from 2018 (before the fire), 2020 (after the fire), and 2022 (three years after the fire) for comparison. As shown in Figure 2, taking forest coverage as an example, we can observe that in 2018, the forest vegetation in the area was lush with a very high coverage rate; after the fire in 2020, the area showed large areas of exposed land, and the forest vegetation suffered devastating damage. Three years later, in 2022, as the ecosystem gradually recovered, the forest vegetation gradually covered the area again.

2.2. Data Preprocessing and Dataset Establishment

The data for this experiment selected images from the main disaster-affected area in the southern part of the Nattai National Park in New South Wales. We used Google high-definition imagery collection for our research, which consists of high-definition color images (R/G/B) with a spatial resolution of 1 m. The timing of the image capture can affect the results of forest and water body feature extraction, so it is crucial to set a time window that makes it easy to observe forests and water bodies. To better extract forest and water body features, the imaging time window was set from early March to mid-April. During this period, the forest reserves in New South Wales have the least cloud cover, the tree characteristics are most distinct, and the coverage of forests and water bodies is also very suitable for remote sensing observation.
Due to minimal changes in forest vegetation coverage in the area over the years before the fire, in order to better observe the recovery of forests and water bodies in the area after the disaster, we selected remote sensing images from the year before the fire and the four years after the fire (i.e., 2018, 2020–2023). The original data were all high-definition remote sensing images without cloud cover in winter to prevent noise in the training model. Ultimately, we chose five remote sensing images that contain a large number of forests, water bodies, and open spaces from the southern part of the Nattai Forest Reserve in New South Wales, Australia. Considering the similar growth conditions of forest vegetation, we believe that there is no significant difference in the extraction of forests and water bodies between these images. All images used in this study were downloaded from the Google Earth Engine platform (https://developers.google.com/earth-engine/datasets/, accessed on 16 July 2024). We imported the downloaded raw data into ARCGIS (https://www.arcgis.com/, accessed on 16 July 2024) and performed bilinear interpolation for geometric correction while retaining the geographic information.
The construction of the semantic segmentation dataset was completed by four annotators. Specifically, we used ARCGIS’s rasterization tool to create underlying features for each original image, then set the classification for forests, water bodies, and the background, and added the polygon features of forests and water bodies to the ARCGIS vector layer. Considering the potential differences in the judgment of forest and water body boundaries by different annotators, the annotation process followed the principle of allowing partial overlap of polygon features at the boundaries of different classifications. The selected four images were preliminarily annotated, which formed the basis of our forest and water body semantic segmentation dataset. Then, we exported the layers with preliminary annotations and generated complete vector maps. Due to GPU memory limitations, the entire original image could not be directly input into the deep learning model for training and inference, so it was necessary to crop the annotated images. The original images were 24-bit color depth TIF format files, which we cropped into RGB channel images using the OpenCV library. The final image files were in JPG format, and the label files were in PNG format. After completing the above process, we selected 945 valid images. The dataset images include elements of forests and water bodies, as well as roads, land, and a few residential areas. The objects segmented in this experiment include forests, water bodies, and the background. The dataset contains original images and annotation data. Red pixels represent forest areas, green pixels represent water body areas, and black pixels represent the background. Each image and label file in the dataset was cropped into 256 pixels × 256 pixels, totaling 1890 pieces. As shown in Figure 3, to improve the model’s generalization ability and robustness, this study used four data augmentation methods to expand the dataset. Each original image was randomly scaled, flipped, rotated, and translated, resulting in a total of 4725 valid images and 4725 annotated images. The dataset was further divided randomly, at a ratio of 8:1:1, to form the training/validation/test sets for subsequent research.

2.3. Contextual Information Fusion Module

Multi-scale features provide richer contextual information, which is crucial for understanding objects and scene structures in images. Forest images have specific texture features, usually containing relatively complex textures. On the contrary, water body images are mostly characterized by smooth water surfaces, and the two have a certain spatial relationship with each other. For example, wetlands or riverbank vegetation around the water bodies are often adjacent to forests. For this purpose, we designed the Contextual Information Fusion Module (CIFM). Using multi-scale feature extraction methods enhances spatial relationship learning, effectively suppresses background noise, and improves feature extraction of forests and water body features. The Non-Local attention mechanism improves the quality of feature representation by simulating the dependence between distant pixels in the image and can effectively capture global contextual information. To this end, we introduce the Non-Local attention mechanism and dilated convolutions with different Dilation rates, use Weight Normal to limit the weights to combat the problem of excessive reuse of the same feature information caused by multi-scale feature fusion, and then use the global maximum Pooling extracts significant feature points, further reduces the risk of overfitting, and improves the generalization ability of the model. Among them, Weight Norm is a weight normalization technique that improves the training stability of neural networks by normalizing the length of the weight vector.
This module is applied to the deep feature extraction stage of the encoder, with its core function being to extract deeper semantic information and fuse features of different scales, thereby enhancing the overall performance of the network. As shown in Figure 4, the CIFM module first uses the Non-Local attention mechanism to capture the global context information on the input feature map and then performs a 1 × 1 convolution and, while applying the Dilation rates of 4, 8, and 12, Atrous convolutions are used to extract features of different scales. Atrous convolutions with different Dilation rates can expand the receptive field and capture a wider range of contextual information. Before applying dilated convolution, we use the Weight Norm to ensure the stability of the weight matrix norm, improving training stability and speed while maintaining the direction of the weight vector. This has contributed to the training and generalization capabilities of the network. Global max pooling is then applied to extract salient features of the image. Another 1 × 1 convolution is applied, and the result is fused with the extracted multi-scale features after upsampling. Weight Norm is applied again to the fused features to reduce noise and redundant information, further mitigating the issues of gradient vanishing or explosion. Finally, a 1 × 1 convolution is used to reduce the number of channels of the feature map, and then Relu is used to output the class probability of each pixel.

2.4. CBAM Module Based on Shallow Feature Application

In artificial intelligence and deep learning, the attention mechanism enables models to automatically learn and focus on task-relevant parts. To enhance the model’s baseline performance, we introduced the CBAM attention mechanism module during the initial stage of feature extraction. CBAM enhances the effective information in the feature map by sequentially applying the channel attention module and the spatial attention module. As shown in Figure 5, starting from the channel attention module, feature compression is first performed, and global average pooling and global maximum pooling are performed on the input feature map F to obtain two different feature descriptions respectively. The formula can be expressed as follows:
F a v g = 1 H × W h = 1 H w = 1 W F h , w
F m a x = H m a x h 1 W m a x w 1 F h , w
where H and W are the height and width of the feature map, respectively. Then F a v g and F m a x are passed through a fully connected layer respectively, and the output can be expressed as follows:
M c = σ W 1 · F a v g + W 2 · F m a x
Among them, W 1 and W 2 are the weights of the fully connected layer, and σ is the sigmoid activation function. Finally, the output M c of the fully connected layer is multiplied by the input feature map F to obtain the feature map F ' after channel attention weighting. So far, the channel attention module has completed the work of feature compression, sharing network, and weight generation. Next is the spatial attention module, which first takes the feature map F ' as input, performs maximum pooling and average pooling on the channel dimension, and obtains two different feature descriptions M s . Its formula can be expressed as follows:
M s = σ c o n v F a v g ' , F m a x '
Among them, F a v g ' and F m a x ' are the average pooling and maximum pooling results of F ' , respectively, and c o n v represents the convolution operation. Finally, M s is multiplied by the output F ' to obtain the spatial attention-weighted feature map F ' ' .
Generally, shallow features contain more details and texture information. In the initial stage of feature extraction, the shallow features of forest and water elements are dominated by texture and color, especially the edge contours of water areas, which are very obvious when bordering land or forests. Therefore, applying CBAM to the first and second layers of the CCE-UNet main architecture encoder has the following advantages:
(1)
Improve the importance of detailed information: Details in shallow features of forest and water images are very important for accurate edge detection and fine-grained segmentation. By applying CBAM on shallow features, the network’s attention to detailed features can be enhanced, thereby improving the edge feature extraction effect of forests and water bodies;
(2)
Optimization of early feature extraction: In the initial stage of the network, the feature map size is larger and contains more spatial information. CBAM’s spatial attention module can use this information to emphasize important areas and provide more significant features for subsequent layers;
(3)
Improve the representativeness of features and improve the robustness of the network: Introducing CBAM at a shallow level can help the network effectively suppress unimportant background information and focus on the key features of forest and water elements.

2.5. Architecture of ECA Module Based on Deep Feature Application

After the CIFM module performs scene information fusion, the network acquires feature information at different scales. In order to further extract local feature information, the ECA attention mechanism is introduced in the fourth and fifth layers of the encoder of the CCE-UNet architecture to improve the network performance and efficiency and assist the CIFM module in integrating scene information. The ECA attention mechanism is a lightweight channel attention mechanism. Its main idea is to use 1 × 1 convolutional layers instead of fully connected layers to learn the dependencies between channels. As shown in Figure 6, ECA first performs global average pooling on the input feature map X to obtain a channel-dimensional feature vector X a v g , which represents the average activation of each channel.
Where X a v g can be expressed as follows:
X a v g = 1 H × W h = 1 H w = 1 W F h , w
Among them, H and W are the height and width of the feature map, respectively. Next, the dependencies between channels are captured through a convolution kernel whose kernel size is determined by an adaptive function (which considers the balance between the number of channels and performance); that is, local correlation is captured in the channel dimension. This relationship is represented by k = A d a p t i v e   f u n c t i o n ( C ) , where C is the number of channels and k is the determined convolution size. Here, we set k to 5. Then, the convolution kernel is used to perform a convolution operation on the one-dimensional feature vector X a v g to obtain the attention weight P . This can be expressed as P = c o n v l d ( X a v g , k ) . Then, the sigmoid activation function is used to convert the output of the convolution into the attention weight α = σ ( P ) . Finally, the attention weight α is multiplied by the original feature map X to obtain the weighted feature map X ~ . As the network deepens, the deep features of forest and water images contain richer semantic information. Forests are often located in specific geographic settings, and the surrounding geographic features (e.g., mountains and rivers) and land use types can provide important contextual information about the deeper characteristics of a forest. Secondly, forest patches usually present irregular shapes in images, and their sizes can range from a few pixels to a large part of the entire image, which is unique shape information of forests. The deep features contained in water bodies are mainly shape and size, edge contours, reflections, and shadows. Especially under sunlight, the reflection characteristics of the water body surface can produce specific lighting and shadow effects in the image.
Therefore, to extract deeper features of forests and water bodies, we applied the ECA module in the fourth layer of the model, using its guidance of spatial attention relationships to effectively help the network reduce its dependence on irrelevant features of forest and water body elements. ECA is applied after the CIFM module in the fifth layer, which can help the network to strengthen the channel relationship when fusing multi-scale features from CIFM and assist in extracting local features containing more details, such as scene information of forests, water bodies, etc. This improves calculation efficiency and segmentation accuracy.

2.6. General Architecture of the CCE-UNet Model

The CCE-UNet architecture designed CIFM and introduced the attention mechanism modules CBAM and ECA. CBAM is applied to the shallow features of the first and second layers of the encoder because shallow features usually contain rich spatial detail information, but they lack global semantic information. The CBAM attention mechanism can help the model pay more attention to detailed information related to the target and suppress irrelevant background information, thereby improving the expression ability of shallow features. ECA is applied after the fourth and fifth layer deep features (after applying CIFM to the fifth layer features and then applying ECA) because deep features usually contain rich semantic information, but some spatial detail information may be lost. The ECA attention mechanism can help the model pay more attention to the semantic information related to the target and suppress irrelevant interference, thereby improving the expression ability of deep features.
CIFM is applied to the output end of the encoder to capture multi-scale contextual information by using multiple convolution kernels with different Dilation rates in parallel, thereby enhancing the model’s segmentation effect on targets of different sizes. This module receives the feature map of the fifth layer of the encoder as input, which contains high-level semantic information.
The specific process is shown in Figure 7. First, the input image is passed through the encoder for feature extraction. The encoder consists of five layers, each including a convolution layer and a pooling layer. CBAM is applied after feature extraction in the first and second layers, and ECA is applied after feature extraction in the fourth and fifth layers. After a series of convolution and pooling operations, the image is gradually downsampled while the number of channels is gradually increased. After feature extraction, we obtained five feature maps of different scales. The number of channels in these feature maps gradually increased, reaching 64, 128, 256, 512, and 1024 channels, respectively. Then, we applied the ASPP module to the deepest feature map, which is the fifth layer of the encoder, to capture contextual information at different scales through multi-scale Atrous convolution. Then, in the decoder part, the decoder consists of 4 layers, from bottom to top. We first upsampled and spliced the fourth layer feature map in the encoder, and the fifth layer feature map was processed by the ASPP module to obtain the decoder. The feature map of the first layer is (512 × 512 × 1024). Then, we upsampled and spliced it with the third layer feature map in the encoder to obtain the second layer feature map of the decoder (256 × 256 × 512). Then, we upsampled and spliced it with the second layer in the encoder to obtain the feature map of the third layer of the decoder (128 × 128 × 256). Finally, we upsampled and spliced it with the first layer of the encoder to obtain the feature map of the fourth layer of the encoder (64 × 64 × 128). In this way, the decoder part gradually restores the size of the feature map to the size of the input image and fuses feature information of different scales. At the output layer, we performed a 1 × 1 convolution on the feature map of the fourth layer of the decoder. The entire decoder effectively restores the detailed information of the image through upsampling and splicing operations, achieving accurate segmentation results.

3. Results and Discussion

This section presents the semantic segmentation results for forests and water bodies on this dataset. Table 1 lists the experimental environment for the training and testing phases.
Table 2 shows the experimental environment parameter configuration.

3.1. Assessment of Indicators

Evaluation Index: In order to demonstrate the performance of the CCE-UNet model in detail, MPA, MRecall, F1Score, and Mean Intersection over Union (MIoU) were selected as evaluation indices for comprehensive analysis. The calculation formulas for these indicators are as follows:
The MPA is actually a weighted average of precision, where the precision of each category is weighted according to the recall of that category.
M P A = i = 1 C P r e c i s i o n i × R e c a l l i i = 1 C R e c a l l i
Among them, C is the number of categories and P r e c i s i o n is the accuracy of the i -th category.
The average recall is the average of the predicted recalls for each category. It measures the model’s ability to identify target areas. Similar to Precision, the closer its value is to 1, the better the model’s performance. MRecall is actually a weighted average of recall, where the recall of each category is weighted according to the false positive rate ( F P R ) of that category.
M R e c a l l = i = 1 C R e c a l l i × 1 F P R i i = 1 C 1 F P R i
The F1 Score is the harmonic mean of P r e c i s i o n and R e c a l l .
F 1   S c o r e = 2 × P r e c i s i o n × R e c a l l P r e c i s i o n + R e c a l l
The Mean Intersection over Union (MIoU) is one of the important indicators for evaluating the performance of image segmentation models. It can intuitively reflect the degree of overlap between the segmentation results and the real segmentation results and is an important indicator to measure the sophistication of the segmentation model.
M I o U = 1 C i = 1 c I o U i  
Among them, C is the total number of categories and I o U i is the I o U of the i -th category.
TP_ i : True examples of the i -th class, which refers to the number of pixels predicted to be class i and whose true class is also class i .
FP_ i : False positive examples of the i -th category, which refers to the number of pixels predicted to be of category i but whose true category is not category i .
FN_ i : False negative examples of the i -th category, which refers to the number of pixels whose true category is category i but is not predicted to be category i .
The value of MIoU ranges from 0 to 1. A higher value indicates a greater overlap between the segmentation result and the ground truth and, thus, a better segmentation performance.
The E d g e   F 1   S c o r e is a metric used to evaluate the performance of image segmentation models in edge detection tasks. It is a variant of the F1 Score, specifically designed to measure the accuracy of a model in identifying object edges in images.
E d g e   F 1   S c o r e = 2 × T P e d g e T P e d g e + F P e d g e + F N e d g e
T P e d g e : Correctly identified the number of pixels at the boundary between the pair of categories.
F P e d g e : Incorrectly identified as a boundary, but in fact, the number of pixels that are not at the boundary between the pair of categories.
F N e d g e : Belongs to the boundary between the pair of categories, but the number of pixels that were not correctly identified.

3.2. Loss Functions

The aim of forest segmentation is to classify each pixel into one of three categories, forest, water body, or surrounding background, based on its location. Therefore, this study introduced three loss functions to train the model. The first is Cross-Entropy Loss, which is a loss function commonly used in multi-classification. It measures the difference between the probability distribution predicted by the model and the probability distribution of the true label. The formula is as follows:
C r o s s E n t r o p y   L o s s = i y i log y ^ i
where y i is the true label (0 or 1) and y ^ i is the probability predicted by the model.
The second is Focal Loss, which is a variant of Cross-Entropy Loss. It adjusts the weight of positive samples during the training process to solve the problem of sample imbalance. The formula is as follows:
F o c a l   L o s s = α t ( 1 y ^ t ) γ log y ^ t
Among them, α t is the weight of positive samples, γ is the index of adjusting difficult and easy samples, and y ^ t is the predicted probability of the real label by the model.
The last one is Dice Loss, which is a measure based on area overlap and is particularly suitable for dealing with class imbalance problems. The formula is as follows:
D i c e   L o s s = 1 2 × A B A + B
Among them, A and B represent the predicted area and the real area, respectively.
In this study, after extensive testing and data comparisons, we selected the model that combines Cross-Entropy Loss and Dice Loss for optimal performance. This is because Cross-Entropy Loss helps the model learn better classification boundaries, and with sufficient sample size, it effectively measures the model’s classification ability for each category. Dice Loss focuses on the degree of overlap between the predicted area and the real area, which helps the model pay better attention to easily classified samples during the training process, especially in the case of sample imbalance. The data in this study are mainly forest and water elements, especially the forest part, which accounts for more than 40% of the pixels. In forest images based on remote sensing images, unclear boundaries have been a long-standing problem. The small proportion of water pixels in the dataset also presents an imbalanced sample. Combining these two loss functions can improve classification accuracy and boundaries. strike a balance between matching accuracy. Cross-Entropy Loss helps the model learn better classification boundaries, while Dice Loss helps the model focus on easily classified samples. This combination can improve the model’s performance on difficult samples, thereby improving the model’s generalization ability and robustness.

3.3. Parameter Settings and Ablation Experiments

To achieve better segmentation results, in the early stage of the experiment, we selected DeeplabV3+ (Backbone: MobilenetV2), Unet (Backbone: VGG), UNet (Backbone: ResNet50), Swin-Transformer, Segformer, Swin-UNet, TransUNet, and Swin-TransUNet for basic experimental comparison. Due to the excellent performance of Unet (Backbone: VGG), we selected it as the baseline model of this study, conducted a series of debugging and improvements on it, and finally proposed CCE-UNet based on the combination of CIFM and dual attention mechanisms. Therefore, we will discuss various improvements and tests conducted during the experiment.
First, we discuss the CIFM module. The core of the CIFM module involves introducing Non-Local attention in the data input stage to obtain global context information, performing feature extraction at different scales through Atrous convolution to further capture context information, using global maximum pooling to enhance the expression of contextual information, and applying Weight Norm to normalize the weight vector, thereby improving the stability of network training. Therefore, to verify the superiority of CIFM, we chose to compare it with SPP, which has a similar principle to the full version of ASPP. As shown in Table 3, ASPP v2 implements multi-scale analysis using multiple parallel equivalent dilated convolutions (equivalent confidence convolutions); the use of multi-scale blocks can effectively capture the multi-scale context information of objects. ASPP v3 adds low-dimensional down-sampling channels on the basis of the previous generation to increase the integration of low-level features, thereby improving the utilization efficiency of low-level feature information. ASPP v3+ uses a spatial pyramid pooling upsampling module to fuse full-channel features, improving the upsampling effect. LR-ASPP is a lightweight design based on the first-generation version, which significantly reduces computational complexity while being more suitable for mobile networks. These modules have brought us remarkable results in previous semantic segmentation tasks. We integrated these modules into the fifth layer of the baseline model encoder, which processes the final output content of the encoder part. The reason for integrating the module at this stage is that the deepest features contain the richest information, which can fully leverage the performance of these modules.
As shown in Table 4, the experimental results indicate that the CIFM module performs the best, with an improvement of 1.18% in MIoU. After the baseline model was equipped with the ASPP v2, ASPP v3, and LR-ASPP modules, respectively, there was an increase in model accuracy, but the difference in effects was minimal. However, the introduction of SPP and ASPP v3+ resulted in a negative growth in the baseline model’s performance. Based on this, we have made the following conjectures. The SPP module itself has limitations, such as the tendency to over-smooth during pyramid pooling, leading to the loss of important detail information and consequently negatively impacting the model’s performance. However, the actual situation of ASPP v3 warrants deeper consideration. Firstly, the integration of the BN module within ASPP can help with the convergence and stability of the model, thereby improving training speed and generalization ability, which is verified by ASPPv2. The introduction of depthwise separable convolutions (DSCs) can effectively reduce the number of parameters and computational complexity, maintaining or even improving the model’s performance while reducing model complexity. However, the combination of DSCs and BN can easily lead to issues with feature normalization, causing redundant computation of related features. Particularly for high-resolution remote sensing images rich in semantic information, DSCs operate on a single channel, resulting in sparse feature maps and causing instability during normalization. A method to improve this is to introduce a more suitable normalization method to avoid this conflict between DSCs and BN. Therefore, in the improvement of CIFM, we added weight normalization to normalize the weight vectors and introduced Non-Local to obtain global context information. The final experimental results also show that the performance of CIFM is superior to the aforementioned feature extraction modules.
In the final round of testing, we defaulted to the Dilation rates of Atrous convolutions in the multi-scale feature fusion module to be 6, 12, and 18 and verified that CIFM achieved the best results on the baseline model. Next, by adjusting the Dilation rate, we tested the performance of the baseline model line under 16 groups of strategies. Since the performance difference displayed by some data was too small to show the performance change trend, we filtered the MioU difference between different strategies with a low 0.15% of the data and retained strategies that showed significant performance differences. As shown in Table 5, when the Dilation rates of Atrous convolutions are 4, 8, and 12, CIFM achieved further performance improvement on the baseline model, with MIoU improving by 1.92% compared to the baseline model.
In order to verify the performance improvement brought by the dual attention mechanism to the CCE-UNet model, this study selected different combination strategies for comparative experiments. Considering the architecture of the baseline model, experiments applied different attention mechanisms at various positions of the encoder according to their characteristics. It should be noted that the third layer of the encoder usually contains the transition from low-level to high-level features, which have already undergone preliminary abstraction and combination. Introducing the attention mechanism at this stage may cause the network to focus excessively on local features and ignore the global context of the entire feature map. Additionally, the attention mechanism is usually used to emphasize certain feature areas, while the main work of the decoder stage is pixel-level fusion and upsampling. Therefore, the role of the attention mechanism in the decoder may not be as obvious as in other stages, and the impact on the final segmentation result will be relatively small. Thus, comparative experiments will be carried out on layers 1, 2, 4, and 5 of the encoder. As shown in Table 6, the MPA and MIoU of the optimal strategy improved by 2.11% and 3.4%, respectively, compared to the baseline model. Consequently, the combined application of CBAM and ECA can effectively improve segmentation accuracy.
To verify the contribution of each enhancement module in the CCE-UNet model to performance, the following ablation experiments were conducted. Table 7 shows the semantic segmentation performance for forest and water bodies under different model architectures. Model 0 represents the Unet baseline model whose backbone network is VGG. Model 1 applies the CIFM module to the baseline model. Model 2 references both CBAM and ECA attention mechanisms. Model 3 references both CIFM and CBAM. Model 4 references both CIFM and ECA. Model 5 includes all the enhanced modules representing the CCE-UNet architecture proposed in this work. The CCE-UNet, integrating all three modules, achieves the best semantic segmentation performance, with an F1 Score of 95.12% and an MIoU of 91.07%. In the CCE-UNet model, each model complements the others without conflict, further improving the performance of semantic segmentation.
In the process of model training, different optimizers and learning rate strategies are also one of the keys to model performance and generalization ability. To this end, in this study, we introduced two optimizers, SGD and Adam, as well as learning rate strategies, Cosine Annealing and Step Decay. During the experiment, we found that when using the Adam optimizer, the fluctuations in the loss curve presented by the CCE-UNet model were significantly smaller than when using the SGD optimizer. However, when using the Adam optimizer combined with the Cosine Annealing strategy, the fluctuations appeared again. The situation of overfitting was eliminated. Therefore, after comparative tests of different strategies, as shown in Table 8, the CCE-UNet model achieves the best performance when using the Adam optimizer and the learning rate strategy using Step Decay.
Based on the current computational resources of the experimental platform, we conducted comparative experiments on the forest and water body datasets with the same batch size configuration (batch size = 8) to fairly compare the performance of each semantic segmentation model. We selected more lightweight architectures, such as DeeplabV3+ and Swin-Unet, as well as the Transformer series architectures that have demonstrated strong computational power in the field of semantic segmentation. Table 9 shows the backbone networks and basic parameters of each semantic segmentation model.
The experimental results indicate that the CCE-Unet architecture exhibits the best semantic segmentation performance on the forest and water body datasets. As shown in Table 10, its MIoU has improved by 4.34% compared to the baseline model and by 11.31% compared to DeeplabV3+. This is mainly due to the CIFM module being designed based on situational information fusion and the dual attention mechanism strategy for feature extraction at different stages. The information complementarity and reinforcement among these three modules enable the CCE-UNet to accurately segment forest and water body pixels in the overall task, thereby assisting in calculating forest coverage and water body coverage rates. In the past, following large-scale deforestation or various disasters, monitoring the recovery of forest vegetation typically required manual on-site surveys to collect local forest data, followed by prediction through time-series methods, which incurred significant human and time costs. By utilizing satellite remote sensing data and the CCE-Unet architecture, the coverage of forests and water bodies can be rapidly extracted, greatly reducing the consumption of human resources. This also provides an important technical foundation for the post-disaster ecosystem recovery monitoring in the Nattai Forest Reserve.
In the face of remote sensing images with complex boundaries of forests and water bodies, we additionally introduced the Edge F1 Score to compare the performance differences of various models in edge segmentation tasks. We sequentially batch-obtained the predicted renderings of the forest and water body datasets on all models. Then, using the Canny algorithm, we derived the Edge F1 Score for each category by comparing GroundTruth with the predicted renderings. As shown in Table 11, each category corresponds to an Edge F1 Score for a scene. Category 1 is forest and background, Category 2 is water body and background, and Category 3 is forest and water body; there is the comprehensive Edge F1 Score for each model. Although other models have already demonstrated good edge segmentation performance, the CCE-Unet architecture outperforms them.

3.4. Visualization of Segmentation Results

When the forest or water cover is too high or too low, the visualization cannot clearly show the differences between models. Therefore, the experiment selected scenes with a relatively balanced proportion of forests, water bodies, and backgrounds to visualize segmentation results. Among them, the red pixel area is the forest element, the green pixel area is the water element and the content marked by the blue rectangular box is the element with the wrong segmentation. Figure 8 compares the visualization effects of CCE-UNet and CNN-based image segmentation networks. Among them, the baseline model misidentified many forest elements as water elements, while DeeplabV3+ lost a large number of forest elements, and CCE-UNet stood out with its precise segmentation effect.
Figure 9 also presents a visual comparison between CCE-UNet and variant algorithms based on Transformer or Trans and CNN. In contrast, the Swin-T architecture, representing strong computational power, incorrectly identified some forest areas as water bodies, and similarly, the TransUNet architecture lost a significant amount of boundary information when segmenting water body elements. The Segformer and Swin-Unet, characterized by their lightweight nature, each lost some forest elements to varying degrees. In comparison, the CCE-UNet architecture demonstrates certain advantages in the segmentation performance on forest and water body remote sensing datasets. In particular, it has a strong ability to understand the boundary characteristics between forests and water bodies, and its segmentation effect is closest to the ground truth (Ground Truth).

4. Discussion

To accurately and quickly extract the coverage information of forests and water bodies for monitoring the recovery of forest ecosystems after a disaster, we propose a semantic segmentation model based on situational information fusion and a dual attention mechanism. The model designs a feature enhancement module CIFM based on context information fusion. It acquires rich semantic information through the non-local algorithm and multi-scale feature fusion and uses weight normalization to suppress the issue of key feature reuse. This effectively enhances the boundary feature information of forests and water bodies, improving the robustness and generalization performance of the network. Moreover, to further strengthen the feature extraction of forests and water bodies, we have adopted a dual attention mechanism strategy. CBAM, which combines channel attention and spatial attention, is capable of capturing important features of images at different scales. Applying it in the early stage of feature extraction can effectively enhance the capability of feature representation and highlight the importance of detailed information, especially in the edge feature extraction work of forests and water bodies. The introduction of ECA in the subsequent deep feature extraction stage is to utilize its guidance on spatial attention relationships, effectively suppressing the reliance on irrelevant features within forest and water elements. It helps the network to strengthen channel relationships when fusing multi-scale features from CIFM and assists in extracting local features that contain more details, such as the scene information of forests and water bodies, thereby improving computational efficiency. The combination of CBAM and ECA effectively enhances network performance in the feature extraction stage. To verify the effectiveness of all modules in improving model performance, we selected deep-learning models that have performed well in semantic segmentation tasks in recent years and are widely used for comparative experiments. We trained each semantic segmentation model with the same dataset and used the same set of model performance evaluation metrics. The proposed method was evaluated fairly and objectively in terms of objective accuracy and subjective visual assessment. In the comprehensive performance evaluation comparison, the CCE-UNet architecture showed superior performance in the segmentation of forests and water bodies.
Therefore, we believe that the CCE-UNet model can be applied to similar multi-variable classification remote sensing image segmentation tasks, such as forest coverage measurement, water body feature extraction, and vegetation coverage monitoring. However, there are still some limitations to the study. For example, in the process of creating the dataset, using ARCGIS to label images requires a significant amount of time and labor. In addition, despite our unified labeling principles, there is still a certain degree of error in visual interpretation. In subsequent research, we need to further expand the dataset by trying to select high-definition remote sensing images from different regions or different satellites to train the model, thereby further improving the generalization ability of the network. As for the data labeling work, we will continue to learn relevant professional knowledge and try to invite experts in the field to verify the labels. We believe that by using more advanced technical means to re-label the dataset or using higher-resolution remote sensing image datasets, higher performance can be achieved through the method proposed in this study. Moreover, our classification of forest features primarily focuses on these two key objects: forests and water bodies. The segmented data can be applied to the forest ecological monitoring system, providing an important technical basis for forest vegetation monitoring and forest water storage monitoring, which is also part of the research content for future work. To detect forest coverage and water body coverage, we need to obtain clearer remote sensing images, accumulate more knowledge about forest vegetation monitoring and water resource monitoring, and then conduct more in-depth research on the restoration of the forest ecosystem. Despite these limitations, the CCE-UNet proposed in this study has shown promising results, and we believe it will play an important role in future forest ecosystem monitoring efforts.

5. Conclusions

In this study, we propose a semantic segmentation method for forests and water bodies based on high-resolution remote sensing images obtained through the GEE platform’s open-source imagery data to achieve large-scale remote sensing identification of forests and water bodies. Firstly, we constructed a forest and water body semantic segmentation dataset using Google high-definition imagery of the Nattai Nature Reserve area in New South Wales, Australia. Subsequently, we developed a deep learning-based forest and water body semantic segmentation model named CCE-UNet. Moreover, we designed a multi-scale feature fusion module called CIFM for this architecture and introduced a dual attention mechanism strategy with CBAM and ECA. After optimizing the loss function, we achieved optimal performance. Through continuous debugging and training, we achieved excellent performance on the forest and water body dataset, with an F1 score of 95.12% and MIoU of 91.07%. We compared it with seven existing mainstream semantic segmentation models. The experimental results indicate that our model outperforms the existing mainstream models in the task of segmenting forests and water bodies based on remote sensing images.
However, our proposed method is unable to further separate the overlapping parts at the edges in the semantic segmentation results. Considering the limitations of the current dataset’s volume, we plan to further expand the data in future research and include forest and water body images from different regions, enhancing the feature extraction capability of the feature backbone for overlapping shadow parts to achieve more accurate coverage extraction of forests and water bodies. Additionally, we intend to further refine the classification of tree species and water bodies and conduct subsequent research on canopy extraction and small water body feature extraction. Finally, we hope that it will play a significant role in future forest ecosystem monitoring efforts.

Author Contributions

B.H.: Conceptualization, Methodology, Software, Validation, Formal analysis, Investigation, Data Curation, Writing—Original Draft. X.Y.: Conceptualization, Formal analysis, Investigation, Writing—Review and Editing, Visualization, Supervision, Project administration. L.M.: Methodology, Formal analysis, Investigation, Visualization. G.W.: Methodology, Formal analysis, Investigation, Visualization. P.W.: Methodology, Supervision. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

This study utilized remote sensing data obtained through the Google Earth Engine (GEE) platform, including Landsat and Sentinel-2 satellite imagery provided by Google Earth. These remote sensing data are publicly accessible on the GEE platform. Researchers interested in accessing and utilizing these data can do so by visiting the GEE platform (https://earthengine.google.com/, accessed on 16 July 2024) and exploring the Sentinel datasets.The homemade remote sensing semantic segmentation dataset created for this study is based on the aforementioned raw data. However, due to data processing and copyright restrictions, this dataset is not publicly available. If you are interested in the dataset from this study and wish to engage in academic exchange or collaboration, please contact Bangjun Huang via the following email: [email protected]. We would be happy to discuss the possibility of data sharing, provided it is in accordance with relevant research ethics and copyright regulations.

Conflicts of Interest

The authors declare that they have no known competing financial interests or personal relationships that could have influenced the work reported in this paper.

References

  1. Laneve, G.; Fusilli, L.; Bernini, G.; Beltran, J.S. Preventing Forest Fires Through Remote Sensing. IEEE Geosci. Remote Sens. Mag. 2020, 8, 37–49. [Google Scholar] [CrossRef]
  2. Cambrin, D.R.; Colomba, L.; Garza, P. CaBuAr: California burned areas dataset for delineation Software and Data Sets. IEEE Geosci. Remote Sens. Mag. 2023, 11, 106–113. [Google Scholar] [CrossRef]
  3. Chen, N.; Tsendbazar, N.E.; Suarez, D.R.; Verbesselt, J.; Herold, M. Characterizing aboveground biomass and tree cover of regrowing forests in Brazil using multi-source remote sensing data. Remote Sens. Ecol. Conserv. 2023, 9, 553–567. [Google Scholar] [CrossRef]
  4. Liu, H.; Liao, T.K.; Wang, Y.; Qian, X.M.; Liu, X.C.; Li, C.M.; Li, S.W.; Guan, Z.L.; Zhu, L.J.; Zhou, X.Y.; et al. Fine-grained wetland classification for national wetland reserves using multi-source remote sensing data and Pixel Information Expert Engine (PIE-Engine). GISci. Remote Sens. 2023, 60, 2286746. [Google Scholar] [CrossRef]
  5. Knoke, T.; Kindu, M.; Schneider, T.; Gobakken, T. Inventory of Forest Attributes to Support the Integration of Non-provisioning Ecosystem Services and Biodiversity into Forest Planning-from Collecting Data to Providing Information. Curr. For. Rep. 2021, 7, 38–58. [Google Scholar] [CrossRef]
  6. Gürsoy, M.I.; Orhan, O.; Tekin, S. Creation of wildfire susceptibility maps in the Mediterranean Region (Turkey) using convolutional neural networks and multilayer perceptron techniques. For. Ecol. Manag. 2023, 538, 121006. [Google Scholar] [CrossRef]
  7. Ru, F.X.; Zulkifley, M.A.; Abdani, S.R.; Spraggon, M. Forest Segmentation with Spatial Pyramid Pooling Modules: A Surveillance System Based on Satellite Images. Forests 2023, 14, 405. [Google Scholar] [CrossRef]
  8. Hua, F.Y.; Bruijnzeel, L.A.; Meli, P.; Martin, P.A.; Zhang, J.; Nakagawa, S.; Miao, X.R.; Wang, W.Y.; McEvoy, C.; Peña-Arancibia, J.L.; et al. The biodiversity and ecosystem service contributions and trade-offs of forest restoration approaches. Science 2022, 376, 839–844. [Google Scholar] [CrossRef]
  9. Miranda-Castro, W.; Acevedo-Barrios, R.; Guerrero, M.; Guzmán, K.; de Gorostizaga, S. Assessing Forest Cover Loss Using Landsat Images and GIS: A Case Study in Colombian Protected Area. J. Sustain. For. 2023, 42, 831–847. [Google Scholar] [CrossRef]
  10. Kurbanov, E.; Vorobev, O.; Lezhnin, S.; Sha, J.M.; Wang, J.L.; Li, X.M.; Cole, J.; Dergunov, D.; Wang, Y.B. Remote Sensing of Forest Burnt Area, Burn Severity, and Post-Fire Recovery: A Review. Remote Sens. 2022, 14, 4714. [Google Scholar] [CrossRef]
  11. Fassnacht, F.E.; White, J.C.; Wulder, M.A.; Naesset, E. Remote sensing in forestry: Current challenges, considerations and directions. Forestry 2023, 97, 11–37. [Google Scholar] [CrossRef]
  12. Li, X.Y.; Jin, H.J.; Wang, H.W.; Marchenko, S.S.; Shan, W.; Luo, D.L.; He, R.X.; Spektor, V.; Huang, Y.D.; Li, X.Y.; et al. Influences of forest fires on the permafrost environment: A review. Adv. Clim. Chang. Res. 2021, 12, 48–65. [Google Scholar] [CrossRef]
  13. Liu, H.; Sun, B.; Gao, Z.H.; Chen, Z.L.; Zhu, Z.Z. High resolution remote sensing recognition of elm sparse forest via deep-learning-based semantic segmentation. Ecol. Indic. 2024, 166, 112428. [Google Scholar] [CrossRef]
  14. Kalinaki, K.; Malik, O.A.; Lai, D.T.C.; Sukri, R.S.; Wahab, R.B.A. Spatial-temporal mapping of forest vegetation cover changes along highways in Brunei using deep learning techniques and Sentinel-2 images. Ecol. Inform. 2023, 77, 102193. [Google Scholar] [CrossRef]
  15. Asbjornsen, H.; Wang, Y.H.; Ellison, D.; Ashcraft, C.M.; Atallah, S.S.; Jones, K.; Mayer, A.; Altamirano, M.; Yu, P.T. Multi-Targeted payment for the balanced management of hydrological and other forest ecosystem services. For. Ecol. Manag. 2022, 522, 120482. [Google Scholar] [CrossRef]
  16. Ma, D.H.; Jiang, L.G.; Li, J.; Shi, Y. Water index and Swin Transformer Ensemble (WISTE) for water body extraction from multispectral remote sensing images. GISci. Remote Sens. 2023, 60, 2251704. [Google Scholar] [CrossRef]
  17. Jiang, B.D.; An, X.Y.; Xu, S.F.; Chen, Z.L. Intelligent Image Semantic Segmentation: A Review Through Deep Learning Techniques for Remote Sensing Image Analysis. J. Indian Soc. Remote Sens. 2023, 51, 1865–1878. [Google Scholar] [CrossRef]
  18. Peng, H.X.; Zhong, J.R.; Liu, H.A.; Li, J.; Yao, M.W.; Zhang, X. ResDense-focal-DeepLabV3+enabled litchi branch semantic segmentation for robotic harvesting. Comput. Electron. Agric. 2023, 206, 107691. [Google Scholar] [CrossRef]
  19. Wang, Y.; Gu, L.J.; Jiang, T.; Gao, F. MDE-UNet: A Multitask Deformable UNet Combined Enhancement Network for Farmland Boundary Segmentation. IEEE Geosci. Remote Sens. Lett. 2023, 20, 3001305. [Google Scholar] [CrossRef]
  20. Tu, J.Z.; Mei, G.; Ma, Z.J.; Piccialli, F. SWCGAN: Generative Adversarial Network Combining Swin Transformer and CNN for Remote Sensing Image Super-Resolution. IEEE J. Sel. Top. Appl. Earth Observ. Remote Sens. 2022, 15, 5662–5673. [Google Scholar] [CrossRef]
  21. Niu, B.W.; Feng, Q.L.; Chen, B.; Ou, C.; Liu, Y.M.; Yang, J.Y. HSI-TransUNet: A transformer based semantic segmentation model for crop mapping from UAV hyperspectral imagery. Comput. Electron. Agric. 2022, 201, 107297. [Google Scholar] [CrossRef]
  22. Lin, A.L.; Chen, B.Z.; Xu, J.Y.; Zhang, Z.; Lu, G.M.; Zhang, D. DS-TransUNet: Dual Swin Transformer U-Net for Medical Image Segmentation. IEEE Trans. Instrum. Meas. 2022, 71, 4005615. [Google Scholar] [CrossRef]
  23. Wang, Z.G.; Wang, Q.N.; Yang, Y.; Liu, N.H.; Chen, Y.M.; Gao, J.H. Seismic Facies Segmentation via a Segformer-Based Specific Encoder-Decoder-Hypercolumns Scheme. IEEE Trans. Geosci. Remote Sens. 2023, 61, 5903411. [Google Scholar] [CrossRef]
Figure 1. Study area in Nattai National Park, New South Wales, Australia.
Figure 1. Study area in Nattai National Park, New South Wales, Australia.
Forests 15 02050 g001
Figure 2. Example map of forest coverage in partially disaster-affected areas.
Figure 2. Example map of forest coverage in partially disaster-affected areas.
Forests 15 02050 g002
Figure 3. Examples of data augmentation.
Figure 3. Examples of data augmentation.
Forests 15 02050 g003
Figure 4. Architecture of Contextual Information Fusion (CIFM) Module.
Figure 4. Architecture of Contextual Information Fusion (CIFM) Module.
Forests 15 02050 g004
Figure 5. Architecture of a CBAM module.
Figure 5. Architecture of a CBAM module.
Forests 15 02050 g005
Figure 6. Architecture of an ECA module.
Figure 6. Architecture of an ECA module.
Forests 15 02050 g006
Figure 7. Architecture of CCE-UNet.
Figure 7. Architecture of CCE-UNet.
Forests 15 02050 g007
Figure 8. Comparison of visualization segmentation effects (taking CNN as an example).
Figure 8. Comparison of visualization segmentation effects (taking CNN as an example).
Forests 15 02050 g008
Figure 9. Comparison of visualization segmentation effects (using transformer and their variants as examples).
Figure 9. Comparison of visualization segmentation effects (using transformer and their variants as examples).
Forests 15 02050 g009
Table 1. Experimental environments.
Table 1. Experimental environments.
EnvironmentType
Operating SystemWindows 11
FrameworkPyTorch 2.0.0 and CUDA 12.1.66
LanguagePython 3.11
CPUAMD Ryzen 9 5900HX with Radeon Graphics
GPUGeForce RTX 3070
Table 2. Training configuration.
Table 2. Training configuration.
ConfigurationValue
Batch Size8
OptimizerAdam
Learning Rate5 × 10−4
UpSamplingbi-liner
Epochs200
Table 3. Comparison of different versions of ASPP.
Table 3. Comparison of different versions of ASPP.
ASPP VersionModule Core Component
ASPP v2Atrous Convolution + SPP
ASPP v3Atrous Convolution + SPP + BN
ASPP v3+Atrous Convolution + SPP + BN + DESC
LR-ASPPLightweight ASPP_v2
Table 4. Performance comparison of feature enhancement models.
Table 4. Performance comparison of feature enhancement models.
Feature Enhancement ModuleMPA (%)MRecall (%)MIoU (%)
Baseline model92.1892.0286.73
SPP91.7790.585.73
ASPP v292.3192.2486.92
ASPP v392.5692.4487.03
ASPP v3+91.9991.8986.32
LR-ASPP92.2592.1186.88
CIFM93.4493.0287.91
Table 5. Performance baseline of CIFM modules using different Dilation rates.
Table 5. Performance baseline of CIFM modules using different Dilation rates.
CIFMDilationsMPA (%)MRecall (%)MIoU (%)
04, 6, 892.4692.4187.26
12, 6, 1292.0391.8987.03
24, 8, 1093.2892.8987.84
36, 12, 1893.4493.0287.91
Ours4, 8, 1293.8993.6288.65
Table 6. Performance comparison of different citation strategies of attention module CBAM and attention module ECA.
Table 6. Performance comparison of different citation strategies of attention module CBAM and attention module ECA.
Model/EncoderLayer 1Layer 2Layer 4Layer 5MPA (%)MIoU (%)
Baseline 92.1886.73
CBAM-1 92.3987.01
CBAM-2 92.4387.16
CBAM-1,2 93.2589.69
ECA-4 91.2986.85
ECA-5 92.4287.11
ECA-4,5 92.8988.64
CBAM-1+ECA-4,5 93.6590.01
CBAM-2+ECA-4,5 93.0189.45
CBAM-1,2+ECA-4 93.2989.67
CBAM-1,2+ECA-5 94.0690.04
Ours94.2990.13
Table 7. Model performance comparison of enhancement module combinations with different strategies.
Table 7. Model performance comparison of enhancement module combinations with different strategies.
ModelCIFMCBAMECAMPA (%)MRecall (%)F1 Score (%)MIoU (%)
0 92.1892.0292.5486.73
1 93.8993.6293.5788.65
2 93.2592.6892.9687.26
3 93.7794.1593.9889.31
4 94.0193.6293.8188.35
Ours95.1595.0995.1291.07
Table 8. Performance baselines for different optimizers and learning rate strategies.
Table 8. Performance baselines for different optimizers and learning rate strategies.
OptimizerLearning Rate Scheduling PoliciesMPA (%)MIoU (%)
SGDCosine Annealing93.5688.61
Step Decay93.9989.46
AdamCosine Annealing94.7889.95
Step Decay(Ours)95.1591.07
Table 9. The number of parameters and FLOPS for each semantic segmentation model.
Table 9. The number of parameters and FLOPS for each semantic segmentation model.
ModelBackboneParams (M)FLOPs (G)
UnetVGG24.89126.685
ResNet18.97112.281
DeeplabV3+MobileNet-v25.81413.219
TransUNetR50-ViT-B_16114.41581.837
Swin-TransformerSwin-T141.35121.653
Swin-TransUNetSwin-T41.34291.497
Swin-UNetSwin-T27.14611.831
SegformerR50-ViT-B_1631.66227.349
CCE-UNetVGG30.93028.995
Table 10. Semantic segmentation performance of each model, with batch size = 8.
Table 10. Semantic segmentation performance of each model, with batch size = 8.
ModelMPA (%)MRecall (%)F1 Score (%)MIoU (%)
UnetVGG92.1892.0292.5486.73
ResNet92.0691.8892.3986.66
DeeplabV3+89.9284.1288.7479.76
TransUNet92.0991.2392.3786.25
Swin-Transformer90.4990.6890.7983.28
Swin-TransUNet91.1790.8991.0284.59
Swin-UNet91.9991.3991.6886.05
Segformer92.1191.9692.0386.62
CCE-UNet95.1595.0995.1291.07
Table 11. Comparison of boundary F1 Score performance.
Table 11. Comparison of boundary F1 Score performance.
ModelCategory 1Category 2Category 3Edge F1 Score (%)
UnetVGG90.4590.2489.9190.02
ResNet89.4588.6287.2588.44
DeeplabV3+86.3285.4982.4684.76
TransUNet88.6687.4385.7587.28
Swin-Transformer89.1089.5988.6789.12
Swin-TransUNet87.7290.3088.9588.99
Swin-UNet89.4189.6687.3388.80
Segformer89.2791.1889.4889.98
CCE-UNet92.1795.1693.1693.50
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Huang, B.; Yi, X.; Mo, L.; Wang, G.; Wu, P. CCE-UNet: Forest and Water Body Coverage Detection Method Based on Deep Learning: A Case Study in Australia’s Nattai National Forest. Forests 2024, 15, 2050. https://doi.org/10.3390/f15112050

AMA Style

Huang B, Yi X, Mo L, Wang G, Wu P. CCE-UNet: Forest and Water Body Coverage Detection Method Based on Deep Learning: A Case Study in Australia’s Nattai National Forest. Forests. 2024; 15(11):2050. https://doi.org/10.3390/f15112050

Chicago/Turabian Style

Huang, Bangjun, Xiaomei Yi, Lufeng Mo, Guoying Wang, and Peng Wu. 2024. "CCE-UNet: Forest and Water Body Coverage Detection Method Based on Deep Learning: A Case Study in Australia’s Nattai National Forest" Forests 15, no. 11: 2050. https://doi.org/10.3390/f15112050

APA Style

Huang, B., Yi, X., Mo, L., Wang, G., & Wu, P. (2024). CCE-UNet: Forest and Water Body Coverage Detection Method Based on Deep Learning: A Case Study in Australia’s Nattai National Forest. Forests, 15(11), 2050. https://doi.org/10.3390/f15112050

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop