1. Introduction
The forest ecosystem plays a significant role in the global ecosystem and human society. Forests provide habitat for numerous species and serve as the foundation of food chains [
1,
2]. They also regulate climate, maintain water sources, and prevent soil erosion. However, forest fires, as a severe ecological disturbance, have profound impacts on forest ecosystems and human society [
3]. Forest fires give rise to a range of issues. Firstly, they disrupt the structure and functioning of forest ecosystems, leading to loss of biodiversity, habitat degradation, and disruption of ecological processes [
4]. Secondly, forest fires release a substantial amount of carbon into the atmosphere, exacerbating global climate change [
5]. Thirdly, fires trigger soil erosion and water source contamination, negatively affecting the sustainable utilization of water resources and ecosystem health. Moreover, forest fires lead to significant economic losses, safety risks, and health issues in human society. For instance, the recent Lahina Fire in the United States resulted in the highest death toll since 1900, causing extensive casualties and property damage [
6]. Therefore, accurate identification and effective monitoring of forest fires are crucial. In this regard, the importance of forest flame segmentation and recognition becomes evident. Accurate segmentation and recognition of forest flames contribute to real-time monitoring of forest fires, providing crucial information for emergency response and forest management decisions [
7]. By applying advanced computer vision and deep learning techniques such as convolutional neural networks (CNNs) [
8], the accuracy of forest flame recognition and segmentation can be enhanced, offering robust support for fire management. This helps in early detection and control of forest fires, minimizing damage to ecosystems and human society, and ensuring the sustainability of forest resources [
7].
With the rapid advancement of remote sensing technology [
9], it plays a crucial role in the segmentation of forest fires, offering significant advantages and significance [
10,
11]. Remote sensing technology provides high-resolution remote sensing images that capture detailed information about the shape and boundaries of flames, enabling accurate segmentation of fire regions [
12,
13]. This is of paramount importance for assessing the scale, intensity, and impact of wildfires on forest ecosystems [
13]. Furthermore, remote sensing technology allows for the acquisition of multi-temporal image data, facilitating the observation and monitoring of fire dynamics. By analyzing the temporal changes in flames, researchers can investigate fire propagation patterns, predict potential fire spread paths, and provide more accurate guidance for wildfire suppression operations [
14]. In addition, remote sensing technology possesses extensive spatial coverage capabilities, enabling the coverage of large forested areas [
15]. It acquires images from different angles and heights, providing comprehensive information for flame segmentation. Moreover, remote sensing technology offers real-time capabilities, enabling the timely acquisition and rapid analysis of fire images, thereby facilitating prompt response to fire incidents and the implementation of effective firefighting and rescue measures. Given the various advantages offered by remote sensing technology, researchers have endeavored to utilize forest remote sensing images for forest flame segmentation [
16,
17].
Building upon the advantages of remote sensing technology in forest fire segmentation, the application of Convolutional Neural Networks (CNNs) has exhibited remarkable potential in enhancing the accuracy and efficiency of fire detection and segmentation [
18]. CNNs, as a class of deep learning algorithms, have revolutionized computer vision tasks, including object recognition, image classification, and semantic segmentation. Leveraging the power of CNNs, researchers have made significant strides in effectively analyzing and extracting features from remote sensing images, enabling more precise and automated fire segmentation [
17]. In their work, Eleni Tsalera et al. [
19] propose a method that utilizes lightweight CNNs, such as SqueezeNet, ShuffleNet, MobileNetv2, and ResNet50, for wildfire identification. Performance evaluation is conducted on multiple datasets with cross-dataset analysis, comparing computational resources and costs to ResNet-50. For contextualization purposes, ResNet-18 is employed for image semantic segmentation. The experimental results demonstrate a high accuracy of 96% and satisfactory performance across datasets. Furthermore, five classes from the CamVid dataset are identified for contextualizing wildfires. Zhihao Guan et al. [
20] propose a novel approach for forest fire detection and segmentation. They introduce a channel domain attention mechanism for image classification, achieving an impressive classification accuracy of 93.65%. Additionally, they develop MaskSU R-CNN, a novel instance segmentation method, which exhibits a precision of 91.85%, recall of 88.81%, F1-score of 90.30%, and mean intersection over union (mIoU) of 82.31%.
However, despite the good performance of convolutional neural network (CNN)-based semantic segmentation techniques on forest remote sensing datasets for forest flame applications, they have not considered some challenges inherent in remote sensing datasets and limitations of CNNs themselves.
Challenge 1: The limited receptive field of CNNs prevents the comprehensive extraction and utilization of information from the entire image, further exacerbating the neglect of flame features [
21].
Challenge 2: As shown in
Figure 1, due to the high resolution of remote sensing datasets, the flame region usually occupies a small proportion, resulting in insufficient attention from the model towards the flame region and incomplete learning of flame features.
Challenge 3: The scarcity of flame instances and the extremely imbalanced class distribution lead to long training time and elevated dataset requisites (encompassing a larger number of training images or images with more pronounced flame characteristics).
In response to these challenges, we propose corresponding designs to enhance the performance of the model and fully leverage the training data. To address Challenge 1, we incorporate a simple transformer architecture in the encoder part of the network to capture global features in a parallel and serial manner, and introduce the CBAM(Convolutional Block Attention Module) attention mechanism in the decoder part to enable comprehensive learning of the image and improve segmentation accuracy and detail preservation. For Challenge 2, we introduce an adaptive Copy-Paste data augmentation method to increase the presence of poorly learned classes, allowing for sufficient learning of these classes. For Challenge 3, we introduce the dice loss, which emphasizes the flame region rather than the non-flame region, thereby improving model training speed.
As shown in
Figure 2, our model is based on an encoder-decoder architecture, where the encoder part considers both speed and performance, and we choose MobileNetV2, while the decoder part adopts DeepLabV3+. Specifically, the approach involves initially selecting the image with the minimum confidence score from the current batch. Subsequently, based on the confidence scores transformed into probabilities for the images within the current batch, another image is randomly chosen. All pixels belonging to the flame category in the second image are then copied and pasted onto the first image. Subsequently, the batch images are passed through the transformer and further feature extraction by the encoder, where in the transformed features are concatenated with the features post transformer. This step serves to enhance the feature richness and accuracy. The final stage encompasses decoder operations for label prediction and incorporates the dice loss for facilitating the backpropagation process.
In this research paper, we have developed a novel network model based on unmanned aerial vehicle (UAV) remote sensing imagery, aimed at enhancing forest fire management and assessment. This can be categorized into two specific aspects:
Accurate Flame Detection and Localization: Our approach enables the direct segmentation of UAV-acquired remote sensing images, accurately identifying the presence of flames within the images. Simultaneously, it provides information regarding the shape and size of the flames. Even relatively small flames can be accurately recognized using our method, facilitating early flame detection and timely firefighting measures.
Fire Monitoring and Management: Managers can assess the fire situation and make informed decisions by analyzing the images segmented by our model. This facilitates the timely and accurate development of firefighting plans.
FlameTransNet, by integrating our approach with UAV technology, it provides managers with a convenient and efficient means of obtaining insights into forest conditions, reducing the labor costs associated with manual on-site inspections. Taking into account the forest environment and the dataset we have utilized, we believe that our technology holds significant potential for effective application in extensive forest regions such as Northern Arizona.
In the following section, we will introduce the work of previous researchers in
Section 2 and compare our work with theirs. Then we provide a detailed description of our model design and the methods employed in
Section 3. We will specifically describe our self-built dataset, the selected evaluation metrics, performance comparison with mainstream semantic segmentation methods, and ablation experiments of each module in
Section 4. Finally, we will summarize our work and provide future research directions in
Section 5.
3. Method
3.1. Proposed Framework
As shown in
Figure 3, our proposed network is built upon an encoder-decoder architecture. In the encoder part, we employ the MobileNetV2 network for feature extraction. Prior to the MobileNetV2 network, we integrate a Transformer module to capture deep image information, while simultaneously incorporating a parallel Transformer module to preserve the spatial context of the image, thereby alleviating the limited receptive field issue of CNNs. This approach maximizes the extraction of image features. Moreover, during the fusion stage of low-level features in both the encoder and decoder parts, we utilize the CBAM attention mechanism to further extract informative details from the lower-level features, enabling the model to pay more attention to the flame region and further enhance the model’s performance.
3.1.1. Encoder (MobileNetV2 Based)
MobileNet, a pioneering lightweight deep neural network devised by Google, was crafted to meet the demands of mobile and embedded devices. As illustrated in
Figure 4, MobileNetV2 [
28] represents a refined iteration of MobileNet, introducing a pivotal enhancement known as the Inverted Residual Block. This distinctive feature anchors the entirety of the MobileNetV2 architecture, facilitating its efficiency and effectiveness.
The Inverted Residual Block, a cornerstone of MobileNetV2, is carefully engineered to strike a balance between robust feature extraction and model lightweightness. Comprising two interconnected components, this innovative design leverages the strengths of MobileNetV2:
Main Branch (Left Side): This segment initiates with a 1 × 1 convolution, strategically employed to expand dimensionality without a significant surge in computational complexity. Following this, a 3 × 3 depthwise separable convolution is deployed for capturing intricate features, enhancing the network’s capacity to discern fine-grained patterns. The sequence culminates with another 1 × 1 convolution, skillfully tailored to compress dimensionality while retaining crucial information.
Residual Connection (Right Side): A defining aspect of the Inverted Residual Block, this pathway establishes a direct connection between input and output, thereby fostering information flow and facilitating gradient propagation. This architectural innovation significantly contributes to both model performance and training efficiency.
Given our commitment to maintaining robust feature extraction capabilities while minimizing model overhead, our selection of MobileNetV2 as the encoder aligns seamlessly with our objectives. By leveraging the strengths of the Inverted Residual Block, we can harness the advantages of MobileNetV2’s efficient and lightweight design, ensuring that our model strikes an optimal balance between computational efficiency and representation power.
3.1.2. Enhancing Feature Extraction and Expanding Receptive Field Using Transformer
In the context of flame semantic segmentation, where the flame regions are typically small and require accurate feature extraction, we propose a method that leverages the Transformer architecture [
29,
30] to capture a broader range of contextual information and enhance the representation of flame semantics.
The Transformer module(as shown in
Figure 5), integrated into our flame semantic segmentation framework, consists of multiple TransformerEncoderLayer modules. These modules enable the network to effectively process the input data and extract discriminative features relevant to flame semantics. During the feature extraction process, the TransformerEncoderLayer module utilizes a self-attention mechanism to capture long-range dependencies between different regions in the input image. By attending to the entire image simultaneously, the Transformer can effectively capture the spatial context of the flame region and its surroundings, even when the flame region is small. During the feature extraction process, the TransformerEncoderLayer module utilizes a self-attention mechanism to capture long-range dependencies between different regions in the input image. By attending to the entire image simultaneously, the Transformer can effectively capture the spatial context of the flame region and its surroundings, even when the flame region is small. By incorporating the Transformer architecture into our flame semantic segmentation framework, our method can effectively extract flame-specific features by capturing extensive contextual information. This enables the model to better understand the spatial relationship between the flame region and its surroundings, leading to improved segmentation accuracy and performance.
In practical usage, we employ a combined approach by both concatenating and parallelizing Transformer modules in the encoder phase, aiming to effectively extract flame semantic features and address the limited receptive field issue commonly encountered in traditional convolutional neural networks.
In the encoder phase, we first concatenate a Transformer module to extract deep-level information from the images. By utilizing the self-attention mechanism, this Transformer module captures global contextual relationships, aiding in the understanding of spatial characteristics within the flame regions. However, considering that flame regions are typically small, relying solely on a single Transformer module may struggle to accurately capture subtle features.
To overcome this limitation, we further introduce a parallel Transformer module in the encoder phase. The parallel Transformer module aims to preserve the extensive spatial information of the images and provide a broader receptive field. By incorporating both concatenated and parallel Transformer modules, we can leverage the complementary aspects of different layers in feature representation, enabling a more comprehensive capture of the flame region’s semantic information.
By simultaneously concatenating and parallelizing Transformer modules, our proposed method harnesses the benefits of deep-level information and an expanded receptive field, thereby enhancing the capability to extract flame semantic features. This architectural design proves valuable in practical applications, augmenting the model’s understanding and accuracy in flame region analysis.
The self-attention mechanism serves as the pivotal component within the Transformer encoder, playing a vital role in directing the model’s focus towards salient image regions based on their respective significance. This enables the network to emphasize critical information and tailor the extracted features to align with identified targets. In this mechanism, embedded patch vectors are transformed into three distinct vectors: query (
Q), key (
K), and value (
V), which are computed through dot product operations. The correlation between
K and
Q is assessed via dot product calculation. Subsequent to normalization through scaling and softmax functions, the computed similarity values are utilized to weight the value vector, thereby obtaining semantic importance. Aggregation of all semantic weights facilitates the generation of the self-attention feature. Ultimately, a feature map enriched with substantial information is derived through subsequent processing via a Multi-Layer Perceptron (MLP). This self-attention computation process can be represented as follows:
where
Z is the self-attention feature;
is the scaling factor;
Q is the query vector;
K is the key vector;
V is a value vector.
3.1.3. CBAM (Convolutional Block Attention Module) Attention Mechanism
The Convolutional Block Attention Module (CBAM) is an attention mechanism module that combines spatial and channel attention [
31] in the convolutional blocks [
32]. By integrating both spatial and channel attention mechanisms, CBAM offers improved performance compared to attention mechanisms that focus solely on channel attention, such as SENet [
33].
Figure 6 illustrates the overall structure after incorporating the CBAM module. It can be observed that the output of the convolutional layers undergoes a channel attention module, which generates weighted results. Subsequently, the output passes through a spatial attention module before obtaining the final weighted results. The introduction of CBAM aims to enhance the features specifically related to the flame region.
The channel attention module processes the input feature map by applying global max pooling and global average pooling operations based on width and height. Subsequently, each pooled feature is fed through a Multi-Layer Perceptron (MLP). The output features from the MLPs are element-wise summed and passed through a sigmoid activation function to generate the final channel attention feature map. This channel attention feature map is then multiplied element-wise with the input feature map to produce the input features required for the spatial attention module.
The spatial attention module takes the output feature map from the channel attention module as its input. Firstly, a global max pooling and global average pooling operation are performed based on the channels. The results of these operations are then concatenated along the channel dimension. Subsequently, a convolutional operation is applied to reduce the dimensionality to a single channel. The resulting feature map is passed through a sigmoid function to generate the spatial attention feature. Finally, this feature map is multiplied element-wise with the input feature map of this module, yielding the final generated feature.
As illustrated in
Figure 7, the integration of the CBAM attention mechanism results in a model that focuses more on the flame’s characteristic regions. The fine details of the features become more pronounced, while attention to the background is reduced.
Discussion: In order to better extract fire-related features and improve model performance, we compared the effectiveness of various attention mechanisms including SE (Squeeze-and-Excitation attention), CAM (Channel Attention Module), SAM (Spatial Attention Module), and CBAM (Convolutional Block Attention Module), as shown in
Table 1 and
Figure 8. Introducing different attention mechanisms proved beneficial for enhancing the model’s learning and segmentation of fire features, with CBAM exhibiting a better focus on fire-related characteristics. Considering factors such as model parameters and overfitting, we opted to solely employ the CBAM attention mechanism in this study.
3.1.4. Decoder (DeepLabV3+ Based)
In DeeplabV3+, the enhanced feature extraction network can be divided into two parts:
In the Encoder, the preliminary effective feature maps that have been compressed by a factor of four are processed using parallel Atrous Convolutions. These Atrous Convolutions are performed with different rates to extract features at multiple scales. The resulting feature maps are then merged and further compressed using 1 × 1 convolutions.
In the Decoder, the preliminary effective feature maps that have been compressed by a factor of two are adjusted in terms of channel dimensions using 1 × 1 convolutions. These adjusted feature maps are then stacked with the upsampled feature maps from the output of the Atrous Convolutions. Once the stacking is complete, two rounds of depth-wise separable convolution blocks are applied.
Additionally, DeeplabV3+ incorporates other important components such as the use of dilated convolutions (Atrous Convolutions) to capture multi-scale context information, the application of skip connections to combine features at different levels, and the utilization of depth-wise separable convolutions for efficient computation. These elements collectively contribute to the overall performance improvement and semantic segmentation accuracy achieved by DeeplabV3+.
3.2. Adaptive Copy-Paste
The Copy-Paste augmentation method [
34] involves the process of pasting objects from one image onto another image, resulting in a diverse set of training data with various choices of source images, object instances, and paste locations. This simple strategy of randomly selecting and pasting objects at random locations has shown significant improvements in model performance across multiple settings.
However, high-resolution and large-scale characteristics of remote sensing images result in a limited proportion of flame regions in general images. This leads to insufficient learning of flame-specific features and the inability of random Copy-Paste methods to augment flame-related features, thereby hindering the improvement of model performance. To address this issue, we propose an adaptive Copy-Paste data augmentation method, which further trains the underrepresented flame regions and enhances the model performance. Compared to traditional techniques such as resampling and undersampling, our method eliminates the need for manual hyperparameter tuning, reduces training time, and does not affect the size of the dataset.
As shown in
Figure 9, we introduce a global confidence bank to store the confidence values for each image. Considering that our task aims to segment flame and non-flame regions, we utilize the Intersection over Union(IoU) metric for the flame category as the confidence measure for each image. Specifically, for each image, we initialize the confidence value to 0 and update it during each training iteration using an exponential moving average (EMA) parameter transfer, as described in Equation (
2).
where
represents the confidence value for the
i-th image, while
represents the IoU metric specific to the flame category. The parameter
is set to a value of 0.98.
Upon obtaining the confidence bank, during each training iteration, the confidences within the current confidence bank are initially normalized to probabilities. Subsequently, a random image is selected based on the probabilistic transformation of confidences for each image in the current batch. In this selected image, all pixels belonging to the flame category are then superimposed onto the image with the lowest confidence within the batch. Finally, Gaussian filtering [
35] is applied to achieve edge-smoothing effects.
Simultaneously, following the validation of the batch, the confidence values corresponding to the images within the batch are updated using the post-validation IoU metric. This comprehensive approach ensures that the training process incorporates probabilistic image selection, targeted flame category augmentation, and refinement through confidence-based IoU updates.
The pseudo-code for the adaptive Copy-Paste method is shown in Algorithm 1.
Algorithm 1 Adaptive Copy-Paste Augmentation |
- Require:
Batch of images - Ensure:
Updated batch with pasted flame pixels - 1:
Initialize/Update confidence bank - 2:
Normalize all confidences in to probabilities - 3:
Select image with the lowest confidence in the batch - 4:
for each image I in the batch do - 5:
if then - 6:
Continue to the next image - 7:
end if - 8:
Randomly select a flame image - 9:
Paste all flame pixels from onto - 10:
Apply Gaussian filter to smooth the edges of - 11:
end for
|
3.3. Dice Loss
In the context of fire segmentation, where the fire regions constitute a small proportion of the overall image, we introduce the Dice loss as a means to focus on and learn the fire-specific features.
The Dice loss is a widely used loss function in segmentation tasks, aiming to optimize the similarity between the predicted fire segmentation and the ground truth fire mask. It is derived from the Dice coefficient, which measures the overlap or similarity between two binary masks.
The Dice loss [
36] is defined as 1 minus the Dice coefficient, and it serves as an objective function to guide the model towards producing more accurate fire segmentations. The Dice coefficient is computed as twice the intersection of the predicted fire mask and the ground truth fire mask, divided by the sum of their areas.
By incorporating the Dice loss during the training process, we encourage the model to focus on and accurately capture the fire regions. The Dice loss penalizes the discrepancies between the predicted and ground truth fire masks, guiding the model to better learn the fire-specific features and improve the segmentation performance.
Since the fire regions are sparse in each image, the Dice loss is particularly beneficial as it can effectively handle class imbalance. It emphasizes the intersection between the predicted and ground truth fire masks, enabling the model to learn the subtle details and boundaries of the fire regions, even in the presence of significant background regions.
The introduction of the Dice loss in our fire segmentation framework addresses the challenge of imbalanced class distribution and enables the model to effectively learn and focus on the fire regions. By optimizing the Dice loss, our model can achieve more accurate and precise fire segmentations, contributing to improved fire detection and analysis tasks.
The Dice loss can be described as follows:
where
and
represent the label value and predicted value, respectively, for pixel
i in an image. The parameter
N represents the total number of pixels, which is equal to the number of pixels in a single image multiplied by the batch size.
4. Data and Experiments
4.1. Data Description
The quality of the dataset and labels significantly influence the training results in the context of fire semantic segmentation tasks. Therefore, we extracted a portion of the publicly available flame dataset and preprocessed it. Additionally, we collected forest fire images from remote sensing sources to create a custom dataset. This dataset not only preserves the flame features but also includes diverse background scenarios, enabling effective segmentation of complex forest conditions.
Specifically, we randomly selected 500 images from the flame dataset [
37] and resized them to 512 × 512 dimensions. Images without flame features were removed, and an additional 500 forest fire images of size 512 × 512 were collected from various regions using online sources. In total, the dataset comprises 1000 images. To better validate the effectiveness of our approach, we partitioned the dataset into training, validation, and testing sets in an 8:1:1 ratio. The specific distribution quantities are presented in
Table 2.
Based on the visualization shown in
Figure 10, our dataset exhibits severe class imbalance, which provides an opportunity to validate the effectiveness of our proposed method. Furthermore, we present several visualized images from our dataset in
Figure 11.
Flame: The FLAME dataset is a vital resource for wildfire research, offering aerial imagery captured by UAVs and drones during controlled burns in Northern Arizona. It includes raw drone videos and thermal heatmaps. Aimed at fire classification and segmentation tasks, it provides 39,375 labeled frames for training, 8617 for testing, and 2003 pixel-annotated frames for segmentation. This dataset empowers advanced image analysis, aiding in understanding wildfire behavior for improved management, risk reduction, and ecological preservation.
4.2. Experimental Settings
The experimental settings were carefully configured as follows:
The input images were resized to a shape of [512, 512]. A batch size of 4 was used during training. The initial learning rate was set to 5 × 10
, and a minimum learning rate of 0.01 times the initial learning rate was defined for learning rate decay. The optimization algorithm employed was Adam [
38] (Adam is an optimization algorithm that combines momentum and RMSProp techniques to dynamically adjust learning rates for individual model parameters, making it effective for a variety of optimization tasks) with a momentum value of 0.9. No weight decay was applied in the training process. The learning rate decay strategy used was cosine annealing, where the learning rate decreases gradually over the course of training. These settings were chosen to ensure a balanced trade-off between model performance and computational efficiency. It is worth noting that we used the same experimental settings when comparing our approach with other state-of-the-art semantic segmentation models.
4.3. Evaluation Metrics
To assess the effectiveness of our proposed method in forest fire segmentation, we employed various evaluation metrics, such as Intersection over Union (
) [
39],
, and
specifically for the fire class.
We computed evaluation metrics using the confusion matrix generated by our improved model, which includes the pixel counts of true positives (), false positives (), true negatives (), and false negatives (). Specifically, measures the similarity between the predicted forest/non-forest areas and the ground truth. Precision and recall evaluate the completeness and accuracy of our method. Our results demonstrate the superior performance of our proposed method in accurately segmenting forest flames, as evidenced by the higher values of these evaluation metrics.
4.4. Results and Analysis
In this section, we compare the performance of our proposed method with several state-of-the-art semantic segmentation approaches, specifically focusing on the IoU, Precision, and Recall metrics for the fire class. For our comparative experiments, we selected a set of representative semantic segmentation networks, which are described in detail below:
FCN (Fully Convolutional Network): FCN [
40] is a semantic segmentation network that replaces fully connected layers with convolutional layers, enabling end-to-end pixel-level prediction. It utilizes upsampling and skip connections to capture both local and global context information, resulting in accurate and detailed segmentation maps.
PSPNet (Pyramid Scene Parsing Network): PSPNet [
41] is a semantic segmentation model that incorporates a pyramid pooling module to capture multi-scale contextual information. By aggregating features from different pyramid levels, PSPNet effectively captures context at various scales, allowing for robust and precise segmentation of objects in complex scenes.
U-Net: U-Net [
42] is a popular network architecture for biomedical image segmentation. It consists of an encoder-decoder structure with skip connections. The encoder captures contextual information, while the decoder recovers spatial details using skip connections. U-Net is known for its ability to handle limited training data and produce accurate segmentation results.
DeepLabV3+: DeepLabV3+ [
43] is an advanced semantic segmentation model that combines the strengths of DeepLabV3 and a modified encoder-decoder architecture. It utilizes atrous convolution and a multi-scale feature fusion module to capture fine-grained details and context information. DeepLabV3+ also incorporates a spatial pyramid pooling module to handle objects at different scales. This network achieves state-of-the-art performance in semantic segmentation tasks.
The validation results are shown in
Table 3 and
Figure 12. Despite achieving state-of-the-art performance in current mainstream semantic segmentation tasks, networks such as deeplabV3+ struggle in the specific context of forest fire segmentation due to the extreme class imbalance of the fire class. This hinders the effective learning of fire-specific features by these advanced networks. On the other hand, Unet, with its unique architecture, is capable of handling limited training data and producing accurate segmentation results. Consequently, in the comparison of base models, Unet outperforms other base models in all metrics. Our proposed model, built upon the deeplabV3+ framework, addresses these limitations through various design improvements. The validation results are shown in
Table 3. Despite achieving state-of-the-art performance in current mainstream semantic segmentation tasks, networks such as deeplabV3+ struggle in the specific context of forest fire segmentation due to the extreme class imbalance of the fire class. This hinders the effective learning of fire-specific features by these advanced networks. On the other hand, Unet, with its unique architecture, is capable of handling limited training data and producing accurate segmentation results. Consequently, in the comparison of base models, Unet outperforms other base models in all metrics. Our proposed model, built upon the deeplabV3+ framework, addresses these limitations through various design improvements. As a result, our model achieves a 6.67% improvement in IoU, a 5.23% improvement in Precision, and a 3.27% improvement in Recall compared to the base model. Furthermore, our model also surpasses Unet in all metrics.
In
Figure 13, we provide visual representations of the prediction results obtained from different models. To ensure the inclusion of diverse scenarios, we carefully selected four typical situations to evaluate the models’ performance. In the first column, which corresponds to images with a higher proportion of fire, our model demonstrates superior accuracy compared to the other models. Even the Unet model shows noticeable false detections, whereas our model consistently produces relatively accurate predictions. Moving to the second column, where the fire class is less prominent, our model showcases remarkable completeness in comparison to the competing models. In the third column, depicting fire-absent scenarios, our model effectively avoids false detections altogether. Furthermore, in the fourth column, which presents fire images with complex backgrounds involving objects such as humans, trees, and smoke, our model accurately delineates the fire regions. Taking all these distinct scenarios into consideration, our model consistently outperforms mainstream semantic segmentation networks both in terms of quantitative analysis and qualitative evaluation, thereby establishing its superior performance and reliability.
4.5. Ablation Experiments
To further validate the effectiveness of our proposed method, we conducted ablation experiments. Specifically, our method consists of the fusion of Transformer, the incorporation of CBAM, the adoption of the adaptive Copy-Paste data augmentation method (ACP), and the integration of Dice loss (DL).
In
Table 4, we demonstrate the impact of incorporating each method on various performance metrics of the model. It can be observed that our proposed methods contribute to the improvement of model performance. Specifically, the inclusion of the CBAM results in a 5.43% increase in IoU. Furthermore, the introduction of Transformer Module leads to an additional 2.42% improvement in IoU. Subsequently, with the incorporation of the adaptive Copy-Paste method and Dice loss, the performance of the model is further enhanced.