3.1. Assessment of Indicators
Evaluation Index: In order to demonstrate the performance of the CCE-UNet model in detail, MPA, MRecall, F1Score, and Mean Intersection over Union (MIoU) were selected as evaluation indices for comprehensive analysis. The calculation formulas for these indicators are as follows:
The MPA is actually a weighted average of precision, where the precision of each category is weighted according to the recall of that category.
Among them, is the number of categories and is the accuracy of the -th category.
The average recall is the average of the predicted recalls for each category. It measures the model’s ability to identify target areas. Similar to Precision, the closer its value is to 1, the better the model’s performance. MRecall is actually a weighted average of recall, where the recall of each category is weighted according to the false positive rate (
) of that category.
The
F1
Score is the harmonic mean of
and
.
The Mean Intersection over Union (MIoU) is one of the important indicators for evaluating the performance of image segmentation models. It can intuitively reflect the degree of overlap between the segmentation results and the real segmentation results and is an important indicator to measure the sophistication of the segmentation model.
Among them, is the total number of categories and is the of the -th category.
TP_ : True examples of the -th class, which refers to the number of pixels predicted to be class and whose true class is also class .
FP_ : False positive examples of the -th category, which refers to the number of pixels predicted to be of category but whose true category is not category .
FN_ : False negative examples of the -th category, which refers to the number of pixels whose true category is category but is not predicted to be category .
The value of MIoU ranges from 0 to 1. A higher value indicates a greater overlap between the segmentation result and the ground truth and, thus, a better segmentation performance.
The
is a metric used to evaluate the performance of image segmentation models in edge detection tasks. It is a variant of the F1 Score, specifically designed to measure the accuracy of a model in identifying object edges in images.
: Correctly identified the number of pixels at the boundary between the pair of categories.
: Incorrectly identified as a boundary, but in fact, the number of pixels that are not at the boundary between the pair of categories.
: Belongs to the boundary between the pair of categories, but the number of pixels that were not correctly identified.
3.2. Loss Functions
The aim of forest segmentation is to classify each pixel into one of three categories, forest, water body, or surrounding background, based on its location. Therefore, this study introduced three loss functions to train the model. The first is Cross-Entropy Loss, which is a loss function commonly used in multi-classification. It measures the difference between the probability distribution predicted by the model and the probability distribution of the true label. The formula is as follows:
where
is the true label (0 or 1) and
is the probability predicted by the model.
The second is Focal Loss, which is a variant of Cross-Entropy Loss. It adjusts the weight of positive samples during the training process to solve the problem of sample imbalance. The formula is as follows:
Among them, is the weight of positive samples, is the index of adjusting difficult and easy samples, and is the predicted probability of the real label by the model.
The last one is Dice Loss, which is a measure based on area overlap and is particularly suitable for dealing with class imbalance problems. The formula is as follows:
Among them, and represent the predicted area and the real area, respectively.
In this study, after extensive testing and data comparisons, we selected the model that combines Cross-Entropy Loss and Dice Loss for optimal performance. This is because Cross-Entropy Loss helps the model learn better classification boundaries, and with sufficient sample size, it effectively measures the model’s classification ability for each category. Dice Loss focuses on the degree of overlap between the predicted area and the real area, which helps the model pay better attention to easily classified samples during the training process, especially in the case of sample imbalance. The data in this study are mainly forest and water elements, especially the forest part, which accounts for more than 40% of the pixels. In forest images based on remote sensing images, unclear boundaries have been a long-standing problem. The small proportion of water pixels in the dataset also presents an imbalanced sample. Combining these two loss functions can improve classification accuracy and boundaries. strike a balance between matching accuracy. Cross-Entropy Loss helps the model learn better classification boundaries, while Dice Loss helps the model focus on easily classified samples. This combination can improve the model’s performance on difficult samples, thereby improving the model’s generalization ability and robustness.
3.3. Parameter Settings and Ablation Experiments
To achieve better segmentation results, in the early stage of the experiment, we selected DeeplabV3+ (Backbone: MobilenetV2), Unet (Backbone: VGG), UNet (Backbone: ResNet50), Swin-Transformer, Segformer, Swin-UNet, TransUNet, and Swin-TransUNet for basic experimental comparison. Due to the excellent performance of Unet (Backbone: VGG), we selected it as the baseline model of this study, conducted a series of debugging and improvements on it, and finally proposed CCE-UNet based on the combination of CIFM and dual attention mechanisms. Therefore, we will discuss various improvements and tests conducted during the experiment.
First, we discuss the CIFM module. The core of the CIFM module involves introducing Non-Local attention in the data input stage to obtain global context information, performing feature extraction at different scales through Atrous convolution to further capture context information, using global maximum pooling to enhance the expression of contextual information, and applying Weight Norm to normalize the weight vector, thereby improving the stability of network training. Therefore, to verify the superiority of CIFM, we chose to compare it with SPP, which has a similar principle to the full version of ASPP. As shown in
Table 3, ASPP v2 implements multi-scale analysis using multiple parallel equivalent dilated convolutions (equivalent confidence convolutions); the use of multi-scale blocks can effectively capture the multi-scale context information of objects. ASPP v3 adds low-dimensional down-sampling channels on the basis of the previous generation to increase the integration of low-level features, thereby improving the utilization efficiency of low-level feature information. ASPP v3+ uses a spatial pyramid pooling upsampling module to fuse full-channel features, improving the upsampling effect. LR-ASPP is a lightweight design based on the first-generation version, which significantly reduces computational complexity while being more suitable for mobile networks. These modules have brought us remarkable results in previous semantic segmentation tasks. We integrated these modules into the fifth layer of the baseline model encoder, which processes the final output content of the encoder part. The reason for integrating the module at this stage is that the deepest features contain the richest information, which can fully leverage the performance of these modules.
As shown in
Table 4, the experimental results indicate that the CIFM module performs the best, with an improvement of 1.18% in MIoU. After the baseline model was equipped with the ASPP v2, ASPP v3, and LR-ASPP modules, respectively, there was an increase in model accuracy, but the difference in effects was minimal. However, the introduction of SPP and ASPP v3+ resulted in a negative growth in the baseline model’s performance. Based on this, we have made the following conjectures. The SPP module itself has limitations, such as the tendency to over-smooth during pyramid pooling, leading to the loss of important detail information and consequently negatively impacting the model’s performance. However, the actual situation of ASPP v3 warrants deeper consideration. Firstly, the integration of the BN module within ASPP can help with the convergence and stability of the model, thereby improving training speed and generalization ability, which is verified by ASPPv2. The introduction of depthwise separable convolutions (DSCs) can effectively reduce the number of parameters and computational complexity, maintaining or even improving the model’s performance while reducing model complexity. However, the combination of DSCs and BN can easily lead to issues with feature normalization, causing redundant computation of related features. Particularly for high-resolution remote sensing images rich in semantic information, DSCs operate on a single channel, resulting in sparse feature maps and causing instability during normalization. A method to improve this is to introduce a more suitable normalization method to avoid this conflict between DSCs and BN. Therefore, in the improvement of CIFM, we added weight normalization to normalize the weight vectors and introduced Non-Local to obtain global context information. The final experimental results also show that the performance of CIFM is superior to the aforementioned feature extraction modules.
In the final round of testing, we defaulted to the Dilation rates of Atrous convolutions in the multi-scale feature fusion module to be 6, 12, and 18 and verified that CIFM achieved the best results on the baseline model. Next, by adjusting the Dilation rate, we tested the performance of the baseline model line under 16 groups of strategies. Since the performance difference displayed by some data was too small to show the performance change trend, we filtered the MioU difference between different strategies with a low 0.15% of the data and retained strategies that showed significant performance differences. As shown in
Table 5, when the Dilation rates of Atrous convolutions are 4, 8, and 12, CIFM achieved further performance improvement on the baseline model, with MIoU improving by 1.92% compared to the baseline model.
In order to verify the performance improvement brought by the dual attention mechanism to the CCE-UNet model, this study selected different combination strategies for comparative experiments. Considering the architecture of the baseline model, experiments applied different attention mechanisms at various positions of the encoder according to their characteristics. It should be noted that the third layer of the encoder usually contains the transition from low-level to high-level features, which have already undergone preliminary abstraction and combination. Introducing the attention mechanism at this stage may cause the network to focus excessively on local features and ignore the global context of the entire feature map. Additionally, the attention mechanism is usually used to emphasize certain feature areas, while the main work of the decoder stage is pixel-level fusion and upsampling. Therefore, the role of the attention mechanism in the decoder may not be as obvious as in other stages, and the impact on the final segmentation result will be relatively small. Thus, comparative experiments will be carried out on layers 1, 2, 4, and 5 of the encoder. As shown in
Table 6, the MPA and MIoU of the optimal strategy improved by 2.11% and 3.4%, respectively, compared to the baseline model. Consequently, the combined application of CBAM and ECA can effectively improve segmentation accuracy.
To verify the contribution of each enhancement module in the CCE-UNet model to performance, the following ablation experiments were conducted.
Table 7 shows the semantic segmentation performance for forest and water bodies under different model architectures. Model 0 represents the Unet baseline model whose backbone network is VGG. Model 1 applies the CIFM module to the baseline model. Model 2 references both CBAM and ECA attention mechanisms. Model 3 references both CIFM and CBAM. Model 4 references both CIFM and ECA. Model 5 includes all the enhanced modules representing the CCE-UNet architecture proposed in this work. The CCE-UNet, integrating all three modules, achieves the best semantic segmentation performance, with an F1 Score of 95.12% and an MIoU of 91.07%. In the CCE-UNet model, each model complements the others without conflict, further improving the performance of semantic segmentation.
In the process of model training, different optimizers and learning rate strategies are also one of the keys to model performance and generalization ability. To this end, in this study, we introduced two optimizers, SGD and Adam, as well as learning rate strategies, Cosine Annealing and Step Decay. During the experiment, we found that when using the Adam optimizer, the fluctuations in the loss curve presented by the CCE-UNet model were significantly smaller than when using the SGD optimizer. However, when using the Adam optimizer combined with the Cosine Annealing strategy, the fluctuations appeared again. The situation of overfitting was eliminated. Therefore, after comparative tests of different strategies, as shown in
Table 8, the CCE-UNet model achieves the best performance when using the Adam optimizer and the learning rate strategy using Step Decay.
Based on the current computational resources of the experimental platform, we conducted comparative experiments on the forest and water body datasets with the same batch size configuration (batch size = 8) to fairly compare the performance of each semantic segmentation model. We selected more lightweight architectures, such as DeeplabV3+ and Swin-Unet, as well as the Transformer series architectures that have demonstrated strong computational power in the field of semantic segmentation.
Table 9 shows the backbone networks and basic parameters of each semantic segmentation model.
The experimental results indicate that the CCE-Unet architecture exhibits the best semantic segmentation performance on the forest and water body datasets. As shown in
Table 10, its MIoU has improved by 4.34% compared to the baseline model and by 11.31% compared to DeeplabV3+. This is mainly due to the CIFM module being designed based on situational information fusion and the dual attention mechanism strategy for feature extraction at different stages. The information complementarity and reinforcement among these three modules enable the CCE-UNet to accurately segment forest and water body pixels in the overall task, thereby assisting in calculating forest coverage and water body coverage rates. In the past, following large-scale deforestation or various disasters, monitoring the recovery of forest vegetation typically required manual on-site surveys to collect local forest data, followed by prediction through time-series methods, which incurred significant human and time costs. By utilizing satellite remote sensing data and the CCE-Unet architecture, the coverage of forests and water bodies can be rapidly extracted, greatly reducing the consumption of human resources. This also provides an important technical foundation for the post-disaster ecosystem recovery monitoring in the Nattai Forest Reserve.
In the face of remote sensing images with complex boundaries of forests and water bodies, we additionally introduced the Edge F1 Score to compare the performance differences of various models in edge segmentation tasks. We sequentially batch-obtained the predicted renderings of the forest and water body datasets on all models. Then, using the Canny algorithm, we derived the Edge F1 Score for each category by comparing GroundTruth with the predicted renderings. As shown in
Table 11, each category corresponds to an Edge F1 Score for a scene. Category 1 is forest and background, Category 2 is water body and background, and Category 3 is forest and water body; there is the comprehensive Edge F1 Score for each model. Although other models have already demonstrated good edge segmentation performance, the CCE-Unet architecture outperforms them.