1. Introduction
Forests, as the “green heart” and “ecological barrier” of the Earth, play an irreplaceable role in climate regulation, water conservation, biodiversity preservation, and carbon sequestration. However, in recent years, the increasing frequency of wildfires has not only caused large-scale destruction of forest resources and loss of animal and plant habitats but also triggered various secondary disasters such as soil erosion and water loss, severely impacting local ecosystems and socioeconomic conditions. For instance, the Australian bushfires that began in late 2019 swept across approximately 20% of the country’s land area, resulting in 33 fatalities, displacing tens of thousands of people, destroying over 3000 homes, and burning through 24 million hectares. More than one billion mammals, birds, and reptiles perished, with at least 34 species driven to extinction. This catastrophic event, which lasted for nearly 200 days, is conservatively estimated to have caused USD 5 billion in health and property damages. More recently, on 19 February 2024, a wildfire broke out in Huaga Village, Huaga Township, Shuicheng District, Liupanshui City, Guizhou Province, China, spreading to nearby villages in Pu’an County, Qianxinan Prefecture. Tragically, during the firefighting efforts in Longyin Town, two young firefighters, both in their 20s, lost their lives.
2. Related Works
To mitigate the safety risks posed by mountain wildfires, researchers worldwide have been exploring detection and early warning methods using satellite remote sensing and unmanned aerial vehicle (UAV) aerial detection. Satellite remote sensing, with its advantages of wide coverage, periodicity, and multi-spectral capabilities, plays an indispensable role in mountain wildfire monitoring. Zhao et al. [
1] proposed a new framework for near-real-time, early-stage mountain wildfire detection based on Himawari-8 satellite imagery, which outperformed the JAXA fire detection product by integrating spatiotemporal spectral information. Similarly, Zhang et al. [
2] utilized Himawari-8 satellite data to construct a spatiotemporal spectral recursive neural network model, achieving accurate detection of small-scale, early-stage, daytime, and night-time mountain wildfires. In addition to geostationary satellites, polar-orbiting satellites such as MODIS and VIIRS have been widely applied in mountain wildfire monitoring. Ding et al. [
3] developed an adaptive mountain wildfire detection algorithm called DBTDW based on MODIS data, which demonstrated high applicability under various spatiotemporal conditions. Ji et al. [
4] coupled the Bidirectional Reflectance Distribution Function (BRDF) physical model with deep learning techniques to achieve near-real-time monitoring of mountain wildfires using geostationary satellite imagery. Although these methods have made progress in addressing wildfire early warning issues, they still face limitations in monitoring range, data processing delays, warning accuracy, high alert costs, and susceptibility to meteorological factors due to constraints in spatiotemporal resolution, weather conditions, and satellite transmission costs. These limitations hinder the early detection and timely response to wildfires. In contrast to satellite remote sensing, UAVs offer advantages such as flexibility, mobility, and the ability to acquire high-resolution images, leading to their increasing application in mountain wildfire detection. Mohapatra et al. [
5] reviewed recent advances in UAV applications for mountain wildfire detection, focusing on monitoring systems based on sensor nodes, UAV aerial photography, and ground camera networks. Moghadasi et al. [
6] proposed a method for continuous mountain wildfire detection and monitoring using rotary-wing UAV formations, optimizing UAV trajectory planning to achieve sustained observation of suspected fire areas and fire mapping. Qiao et al. [
7] designed a UAV-based mountain wildfire detection system using visible light/infrared cameras, coupling algorithms for smoke and flame segmentation, camera pose estimation, and feature matching to achieve early detection and distance localization of mountain wildfires. Chuang et al. [
8] proposed using UAV swarms carrying L-band SAR and optical sensors to obtain high-resolution real-scene images through tomographic imaging techniques, enabling early identification of mountain wildfire hazards by inverting changes in tree dielectric constants.
In recent years, the rapid development of artificial intelligence methods such as machine learning and deep learning, and their widespread application in forest fire detection, has significantly enhanced the capability to detect mountain wildfires. Machine learning methods primarily involve automatically mining multi-dimensional features of images, including spectral, textural, and spatiotemporal characteristics, to construct classification decision functions or rules for forest fires. Representative methods include decision trees [
9], random forests [
10], and support vector machines [
11]. Research has shown that machine learning methods can significantly improve fire detection accuracy in complex mountainous terrain conditions. For example, Bar et al. [
12] used Landsat-8 and Sentinel-2 medium-resolution optical satellite images from 2016 to 2019 to identify forest fire areas in the western Himalayan state of Uttarakhand, India, through the Google Earth Engine (GEE) platform. They applied unsupervised classification using the Weka clustering algorithm to identify the shape and pattern of fire areas, and employed supervised classification algorithms such as Classification and Regression Trees (CART), Random Forest (RF), and Support Vector Machine (SVM). Results showed that CART and RF algorithms achieved similarly high accuracy (97–100%) in identifying forest fire areas. Janiec et al. [
13] employed two machine learning classification methods—Maximum Entropy (MaxENT) and Random Forest—to analyze satellite images and products of different spatial and spectral resolutions (Landsat TM, Modis TERRA, GMTED2010, and VIIRS), vector data (OSM), and bioclimatic variables (WORLDCLIM). They found that the Random Forest prediction model was more effective in improving accuracy and reducing risk areas, while the MaxENT method showed lower accuracy. Mohajane et al. [
14] developed five new hybrid machine learning algorithms—Frequency Ratio-Multilayer Perceptron (FR-MLP), Frequency Ratio-Logistic Regression (FR-LR), Frequency Ratio-Classification and Regression Tree (FR-CART), Frequency Ratio-Support Vector Machine (FR-SVM), and Frequency Ratio-Random Forest (FR-RF)—for mapping forest fire susceptibility. The results demonstrated that these hybrid models significantly improved the accuracy and performance of forest fire susceptibility studies, with the FR-RF model performing best (AUC = 0.989).
Unlike machine learning methods that rely on manual feature extraction, deep learning methods primarily learn multi-scale, multi-level deep features directly from raw images, thereby enhancing model performance in complex scenarios for fire point identification. Currently, deep learning-based mountain wildfire detection methods can be broadly categorized into two main types. The first is based on convolutional neural networks (CNNs) for mountain wildfire detection. Deep CNNs, with their powerful feature extraction and semantic expression capabilities, have become a research hotspot in the field of mountain wildfire detection. Ahmad et al. [
15] proposed the FireXNet model for wildfire detection, which adopts a lightweight structure similar to MobileNetV3 and introduces SHAP interpretability analysis, achieving performance superior to models such as VGG16 and DenseNet201 on resource-constrained devices. Wang et al. [
16] proposed an efficient real-time forest fire detection model called FireDetn for complex scenarios, introducing multi-scale detection heads, transformer encoders, and multi-head attention mechanisms to enhance the ability to capture global feature information and contextual information, thereby improving average precision in complex scenarios. Johnston et al. [
17] thoroughly investigated the performance of YOLOv5 for real-time mountain wildfire detection on embedded systems, particularly the Raspberry Pi 4. Through performance comparisons with YOLOv3 and YOLOv3-tiny, their results showed that the proposed system achieved high detection accuracy, low power consumption, and strong adaptability to real environments. Mukhiddinov et al. [
18] proposed an improved YOLOv5-based UAV visual early mountain wildfire smoke detection system, enhancing network architecture and detection speed by adding a spatial pyramid pooling fast layer, applying a bidirectional feature pyramid network, and employing network pruning and transfer learning methods. Their experimental results demonstrated the effectiveness of the proposed method and its superiority over other single-stage and two-stage object detectors. Casas et al. [
19] conducted a comprehensive comparison of YOLO series models in smoke and mountain fire detection, utilizing multiple performance metrics including recall, precision, F1 score, and mean average precision. Their findings indicate that YOLOv5, YOLOv7, and YOLOv8 demonstrate relatively balanced performance across all metrics, while YOLO-NAS variants excel in recall but underperform in precision. This underscores the importance of considering specific model performance in relation to practical application requirements when selecting an appropriate model. He et al. [
20] proposed two improved mountain fire detection models based on YOLOv5, reducing model parameters by simplifying the original network structure’s neck and head, and eliminating backbone modules. Experimental results demonstrate that these lightweight models maintain high accuracy and recall while adapting to embedded devices, enabling real-time fire monitoring. Li et al. [
21] introduced LEF-YOLO, a lightweight mountain fire detection model. By incorporating MobileNetv3′s bottleneck structure and depth-wise separable convolutions, they reduced model complexity. Multi-scale feature fusion strategies, coordinate attention, and spatial pyramid pooling-fast blocks were employed to enhance feature extraction and improve detection accuracy. The LEF-YOLO model exhibited superior detection performance on an extreme forest fire dataset, achieving 2.7 GFLOPs, 61 FPS, and 87.9% mAP. Gonçalves et al. [
22] compared the performance of multiple models, including YOLOv7 and YOLOv8, in wildfire smoke detection for both ground-level and aerial imagery. They also discussed the impact of complex scene factors on detection accuracy.
Semantic segmentation approaches for mountain fire detection aim to assign semantic labels to each pixel in an image, enabling precise delineation of fire-affected areas. Valero et al. [
23] proposed an accurate wildfire area segmentation method based on UAV thermal infrared videos. Their approach enhances video registration accuracy through trajectory stabilization, foreground histogram equalization, and multi-reference frame strategies. The KAZE feature matching algorithm is employed to achieve stable and accurate frame-by-frame segmentation of wildfire videos, supporting fire behavior analysis. Bouguettaya et al. [
24] reviewed recent deep learning algorithms applied to UAV wildfire smoke segmentation, focusing on methods based on semantic segmentation networks such as FCN, U-Net, and SegNet. They systematically summarized key indicators, including accuracy and computational efficiency. Muksimova et al. [
25] proposed a wildfire segmentation method based on a dual encoder-decoder structure. By improving residual modules and attention gate mechanisms, they enhanced the network’s multi-scale feature extraction capabilities, outperforming existing methods in terms of accuracy, speed, and robustness. To further improve wildfire segmentation accuracy and real-time performance, some researchers have introduced novel network structures such as transformers to this field. Ghali et al. [
26] employed TransUNet and TransFire, two transformer-based semantic segmentation networks, to achieve precise segmentation of wildfire areas in UAV aerial images, with F1 scores exceeding 99%. Garcia et al. [
27] proposed a multi-layer wildfire smoke segmentation method based on level set theory, optimizing contour smoothness and segmentation confidence to enhance model detection performance.
AI-based fire detection methods using RGB image recognition and thermal imaging have been widely adopted. However, thermal imaging-based fire detection methods are often limited to infrared camera imaging areas and primarily use a single standard deviation as a distinguishing feature, which weakens their early fire detection capability when disturbed. Additionally, the high cost of thermal imaging equipment makes it challenging to widely implement in large-scale environments such as forests. Moreover, thermal imaging systems require substantial data processing power to analyze the vast amounts of data collected, resulting in high power consumption and necessitating high-performance computing resources. In contrast, AI-based fire detection using RGB images employs standard cameras already widely used in most surveillance systems, making it cost-effective and easy to integrate. RGB-based fire detection also leverages modern deep learning models, enabling fast and accurate real-time monitoring with high inference speed and flexibility. Therefore, this paper focuses on in-depth research into deep learning-based fire detection using RGB images.
Fire image detection technologies based on satellite or UAV vision have achieved notable success in mountain fire monitoring. However, their practical application still faces several major obstacles: Complex terrain and environmental factors significantly impact fire detection, making it challenging. The varied mountain terrain and diverse surface coverage, combined with atmospheric interference from clouds, fog, and smoke, make it difficult to accurately isolate fire signals from complex backgrounds, leading to frequent missed detections or false alarms. The rapid evolution of forest fire scales requires improved adaptability of detection models. In the early stages of mountain fires, smoke and flame areas are small and undergo rapid spatiotemporal changes, easily blending with complex backgrounds. Detecting small targets in the early stages of a fire is crucial for preventing large-scale wildfires and protecting the ecological environment from damage. Subsequently, detection algorithms need enhanced adaptability to overcome limitations in UAV visual angles, changes in fire area scale, and visual obstructions. Large models’ high computational resource demands pose deployment challenges for onboard equipment. Complex deep learning algorithms result in high computational resource consumption and long processing delays, making it difficult to meet the emergency response needs of mountain fire disasters. In light of these challenges, this paper proposes the YOLO-CSQ rapid mountain fire detection model based on YOLOv8n for quick identification and early warning of mountain fires in complex scenarios. The main contributions are as follows:
- (1)
By adding the P2 layer output to YOLOv8n and introducing the CBAM (Convolutional Block Attention Module) attention mechanism in the four scales of the neck network, the model’s ability to independently extract and fuse feature information at small scales is improved. This enables effective capture of multi-scale features of smoke and flame targets and enhances information interaction across scales, fully utilizing semantic information at different levels to improve small target detection capabilities and localization accuracy for larger targets.
- (2)
The ShuffleNetV2 backbone network structure is improved by introducing depth-wise separable convolutions (DWConv), CBAM attention mechanism, and h-swish activation function, replacing the original CSPDarknet53 backbone network structure. This reduces the model’s parameter count and computational complexity while maintaining high feature extraction capabilities in a more lightweight head network, effectively reducing the model size and making it more suitable for deployment on detection devices with limited computational resources. Most importantly, it can enhance the model’s ability to detect small targets.
- (3)
A Quadrupled-ASFF detection head is proposed, and the loss function is optimized to enhance the model’s understanding of complex scenes (especially those with high background noise or small-scale targets with occlusions), improving the balance between positional accuracy and detection precision. Additionally, the WIoU loss function is introduced to address the inability of the original CIoU loss function to provide effective gradients in certain situations (e.g., non-overlapping bounding boxes) and to consider the distance between center points and aspect ratios of bounding boxes, thereby enhancing the model’s localization accuracy.
The remainder of this paper is organized as follows:
Section 2 introduces the theoretical background of the original YOLOV8 detection method;
Section 3 describes the network and structural improvements;
Section 4 presents the dataset preparation, classification, and experimental results; and
Section 5 discusses the results and provides conclusions.
3. YOLOv8 Detection Algorithm
YOLO (You Only Look Once), as a classic real-time object detection algorithm, has played a significant role in defect identification, protection warning, and other fields since its proposal in 2015. Currently, the relatively classic iterative version, YOLOV8 [
28], is an improvement based on YOLOV5 [
29] proposed by Ultralytics. It also adopts a single-stage detection strategy, integrating object localization and classification tasks into an end-to-end convolutional neural network. Compared with traditional two-stage object detection algorithms such as R-CNN [
30], both detection speed and efficiency are significantly improved. The YOLOV8 model includes five versions: YOLOv8n, YOLOv8s, YOLOv8m, YOLOv8l, and YOLOv8x. Among them, YOLOv8n has the lowest complexity, maintaining high detection accuracy while having the fastest inference speed, making it convenient for deployment on mobile or embedded devices. Considering the requirements of lightweight deployment on airborne platforms for mountain fire detection and the need for real-time and high-precision detection, this paper selects YOLOV8n as the baseline model.
The YOLOV8n network structure mainly consists of three parts: Backbone, Neck, and Head, as shown in
Figure 1. The Backbone uses the CSPDarknet53 network, which replaces the original CSP (Cross Stage Partial) module with the C2f (Cross Stage Partial Network Fusion) module based on YOLOv5. The C2f module adopts gradient flow linking, effectively improving the model’s nonlinear representation ability while keeping the model lightweight, thereby better handling complex image features. In addition, the SPPF module is retained in the Backbone, which converts the input features into adaptive-size outputs through mapping pooling operations to better capture multi-scale features in the image. The Neck part adopts the Path Aggregation Network with a Feature Pyramid Network (PAN-FPN) [
31] structure to further fuse the features transmitted by the Backbone. Compared with the PAN-FPN structure in YOLOv5, YOLOv8n removes the convolutional structure computation after upsampling in PAN and replaces the original C3 module with the C2f module, constructing a top-down and bottom-up network structure. This improves the model’s feature fusion efficiency while complementing shallow location information and deep semantic information, ensuring the completeness and diversity of the output feature maps. The Head part adopts the same decoupled head structure as YOLOX [
32], separating the classification and detection heads, allowing each part of the model to focus on its specific task. By further processing the feature maps output by the Neck part, it predicts the location, category, and confidence of the target. Moreover, the Anchor Free method used describes the detection target using multiple key points or center points and boundary information, which is more suitable for detecting dense obstacles and targets with large-scale variations in mountain smoke and fire detection.
Although the YOLOv8 model has balanced overall performance and performs well in many general scenarios, it still faces some challenges in the actual detection process due to the complex environment of mountain fires. Firstly, the background information in mountain fire scenes is redundant, and factors such as smoke, trees, and terrain can interfere with the identification of fire conditions, requiring the model to have strong robustness and generalization ability. Secondly, the shape and size of mountain fires vary, and there is a problem of large-scale differences, requiring the model to take into account the detection of targets at different scales. Moreover, under night conditions or low visibility conditions, the image quality is poor, requiring the model to adapt to low light, blur, and other interference factors. Finally, real-time performance is one of the important requirements for mountain fire detection, requiring the model to maintain a high inference speed while ensuring detection accuracy to support real-time detection and early warning of mountain fires.
4. Methods
To further improve the detection efficiency of mountain fire images on UAV-borne visual platforms, this paper proposes a YOLO-CSQ object detection algorithm based on the traditional YOLOV8n for dense small-target mountain fire detection in complex scenes. The network structure is shown in
Figure 2. Firstly, by increasing the output of the P2 layer in YOLOv8n and introducing the CBAM (Convolutional Block Attention Module) attention mechanism in the neck network, the model’s ability to independently extract and fuse feature information at small-scale levels is improved. This enables effective capture of multi-scale features of smoke and flame targets, enhances information interaction between cross-scale features, and fully utilizes semantic information at different levels, thereby improving the detection ability of small targets and the localization accuracy of large-range targets. Secondly, an improved lightweight ShuffleNetV2 network is employed to replace the original backbone network. While introducing depthwise separable convolution and h-swish activation function, the CBAM attention mechanism is added before the P2 layer output to improve the model’s detection performance. Subsequently, an improved four-head ASFF (Adaptive Spatial Feature Fusion) detection head is introduced, which enhances the detection capability of multi-scale targets by adaptively adjusting the fusion weights of features at different scales. Finally, the WIoU loss function is introduced to replace the original CIoU loss function, considering the specific weight of each pair of bounding boxes. This provides a more flexible and refined evaluation mechanism for object detection tasks in complex scenes or extreme conditions, improving the model’s learning efficiency and generalization ability.
4.1. Improved Attention Mechanism
Due to the influence of illumination direction and the dynamic characteristics of flame and smoke targets, precise target detection in complex mountain fire scenes still faces enormous challenges, mainly reflected in the following three aspects. First, under the influence of the environment, there are fluctuations in the range and intensity of illumination, leading to uneven brightness distribution of flame and smoke images, weakening the contrast between the target and the background, and increasing the difficulty of positive target detection for the model. Second, due to the rapid movement and deformation of flames and smoke in the fire scene, the target edge contours become unclear, affecting the model’s localization accuracy. Third, the complex terrain and rich vegetation cover in mountainous areas generate a large amount of redundant background information, making it difficult to distinguish smoke and fire targets from the background. To address these issues, some scholars have introduced attention mechanisms into object detection models to enhance useful feature information, suppress useless feature information, and enable the model to adaptively focus on key regions in the image, improving the model’s detection accuracy.
CBAM (Convolutional Block Attention Module) [
33], as an attention mechanism widely used in computer vision tasks, mainly consists of two sub-modules: Channel Attention Module and Spatial Attention Module. The Channel Attention Module obtains global information of the feature map through global average pooling and global max pooling operations, and then uses a multi-layer perceptron (MLP) [
34] to learn the interdependencies between channels, generating channel weights to identify the importance of each channel in the feature map. The Spatial Attention Module generates spatial weights by applying convolutional operations, effectively focusing on the importance of each spatial region in the feature map. CBAM combines the channel attention and spatial attention weights with the original feature map through element-wise multiplication to enhance the features of important channels and spatial regions while suppressing the unimportant parts. This process significantly improves the quality of the feature map, providing richer and more useful information for subsequent tasks such as mountain fire detection. The structure of the CBAM attention module is shown in
Figure 3.
Assuming the input feature map has dimensions
, where
represents the number of channels, and
and
represent the height and width of the feature map, respectively. In the CBAM module, the channel attention module first applies Global Average Pooling (GAP) and Global Max Pooling (GMP) operations to obtain the global average and maximum values for each channel, generating two feature vectors
and
with dimensions of
. The structure of the channel attention mechanism is illustrated in
Figure 4.
The global average pooling operation can be expressed as:
The global maximum pooling operation can be expressed as:
where
represents the output result after applying average pooling to the
channel,
denotes the output result after performing max pooling on the
channel, and
represents the feature vector at spatial position
in the
channel of the input feature map.
After obtaining the two feature vector descriptors, the Channel Attention Module uses two shared multilayer perceptions (MLPs) to transform
and
, learning the interdependencies between channels. To reduce the number of model parameters, the number of neurons in the hidden layer is set to a reduced value
, where the reduction rate
determines the degree of neuron reduction. After processing each descriptor through the shared network, the outputs are aggregated through element-wise addition to obtain a unified output feature vector. The MLP output generates channel weights through the Sigmoid activation function, and the calculation formula
is as follows:
where
represents the Sigmoid activation function, and
and
are the weight parameters of the multi-layer perceptron MLP.
The Spatial Attention Module is mainly a complement to the Channel Attention Module. Its structure is shown in
Figure 5. Firstly, average pooling and max pooling operations are performed on the channel dimension to reduce the dimensionality of the channel itself and generate two feature maps
and
with a size of
, respectively. The two feature maps are concatenated and sent to a convolutional layer for learning.
After the average pooling operation on the channel dimension, it can be represented as:
The max pooling operation on the channel dimension can be represented as:
where
represents the feature map of the
channel of the input feature map.
Subsequently,
and
are concatenated on the channel dimension to obtain a feature map with a size of
, and a
convolutional layer is applied to learn the interdependencies between different spatial locations, ultimately obtaining the attention features on the spatial dimension. The calculation process is as follows:
where is the Sigmoid activation function
,
represents processing using a convolutional kernel of size
, and
represents the concatenation operation on the channel dimension.
The original feature map is element-wise multiplied with the generated channel feature weights and spatial feature weights, respectively, to obtain the final weighted feature map:
The CBAM attention mechanism effectively enhances the model’s feature extraction ability [
35]. Through adaptive adjustment of the weights of different channels and spatial locations in the feature map, the model’s representation ability of mountain fire region features is enhanced, the interference of complex backgrounds and noise is suppressed, and the feature utilization efficiency is improved, enabling the model to focus more on wildfire-related features. Concurrently, the model’s generalization ability is enhanced, enabling it to adapt to diverse wildfire types and shooting conditions. This results in more accurate and robust wildfire detection results. Furthermore, the computational complexity of the CBAM attention mechanism is relatively low, facilitating integrated operations. While improving wildfire detection performance, it maintains the model’s detection efficiency, meeting the requirements of practical applications.
4.2. Improved Backbone Network Architecture
The shapes of smoke and flames in mountain fire images are diverse, with blurred boundaries, often requiring a larger receptive field to capture their contextual information. Furthermore, the influence of illumination changes and background interference necessitates a more robust model for feature extraction. Although the traditional YOLOV8n’s CSPDarknet53 backbone network performs well in general scenarios, it still struggles to adapt well to mountain fire target detection scenarios due to limitations in its receptive field range and feature extraction capabilities. In comparison to the traditional CSPDarknet53, ShuffleNetV2 [
36], as a lightweight backbone network, exhibits distinctive advantages in the context of mountain fire detection. The structure of the network is depicted in
Figure 6.
In channel splitting, the input feature channels are divided into two branches, each with an equal number of channels, with the objective of reducing memory access costs and improving computational efficiency. The formula representation is as follows:
where
represents the number of input channels, and
and
represent the number of output channels for the two branches, respectively.
Subsequently, the two branches undergo processing through a series of grouped convolutions, such as and , in order to achieve a balance between computational complexity and model capacity. Finally, the results of the two processed branches are merged, and channel shuffling is performed to enhance information flow between different channels, thereby improving the model’s representational power. In spatial downsampling operations, ShuffleNetV2 employs convolutions with a stride greater than 1 to achieve feature map spatial size reduction while maintaining feature richness.
Nevertheless, ShuffleNetV2 still exhibits certain deficiencies in the context of mountain fire detection. Firstly, the network depth and receptive field of ShuffleNetV2 remain relatively limited, which presents a challenge in fully capturing long-range dependencies and global contextual information in mountain fire scenes. Secondly, ShuffleNetV2 is deficient in sufficient scale invariance, rendering it incapable of effectively handling smoke and fire targets of varying sizes. To address these issues, this paper proposes an enhanced ShuffleNetV2 backbone network structure, as illustrated in
Figure 7. The incorporation of depth-wise separable convolution (DWConv) [
37], the CBAM attention mechanism, and the h-swish activation function has led to a notable enhancement in the model’s performance with regard to mountain fire detection.
Firstly, in the feature extraction stage of the model’s backbone network, by introducing a set of depthwise separable convolution (DWConv) kernels combined with the original convolutional layers, the model’s receptive field is expanded. This allows the network to capture richer contextual information while only slightly increasing the computational load. It enhances the model’s ability to recognize irregular and boundary-blurred targets in mountain fire scenes, strengthens the robustness of feature extraction, and enables more precise handling of the complexity in mountain fire scenes.
Secondly, in the final stage of the backbone network, by integrating the CBAM attention mechanism, multi-scale parallel learning and cross-spatial information interaction are realized, establishing connections between features of different scales. This improves the model’s detection performance for multi-scale targets, particularly in capturing features of smoke and flames with varying scales, effectively increasing the detection rate of small and weak targets and the localization accuracy of large targets.
Finally, to further optimize the model’s performance, we replace the traditional ReLU activation function with the h-swish activation function. By leveraging its superior nonlinear expression ability and smoothness characteristics, it alleviates the gradient vanishing problem that may be caused by the ReLU function, enabling the model to converge quickly and enhancing its generalization ability. Moreover, introducing the h-swish activation function reduces the model’s dependence on cross-channel correlation and spatial correlation, allowing the model to focus more on channel information recognition and extraction, thereby improving the model’s accuracy and reliability in the mountain fire detection task.
4.3. Improved Quadrupled-ASFF (Adaptive Spatial Feature Fusion) Detection Head Structure
Traditional object detection networks typically employ a fixed feature fusion strategy, combining feature maps of different scales with preset weights to generate final detection results. However, real-world mountain fire scenes are complex and varied, with smoke and fire targets often exhibiting significant multi-scale features. Using a “one-size-fits-all” fusion approach can ignore the specificity of targets at different scales, making it challenging to adapt to the multi-scale characteristics of mountain fire scenes, resulting in limited detection performance. To address this issue, researchers have proposed introducing an Adaptive Spatial Feature Fusion (ASFF) [
38] module to adaptively adjust the fusion weights of features at different scales, enhancing the model’s detection capability for multi-scale targets. The core idea of ASFF is to dynamically adjust the fusion weights of different feature maps based on the scale features of the targets, allowing the model to adaptively focus on targets of various scales.
Figure 8 depicts the ASFF module, which initially adjusts the feature maps
,
, and
from disparate convolutional layers to a uniform spatial resolution through up-sampling or down-sampling. Subsequently, the system learns to generate weight maps (
,
, and
) for each feature map, which are employed to dynamically adjust the fusion weight of each feature map at each spatial location. The weight maps are normalized through a soft-max layer to ensure that the sum of weights is 1. For each location
, the weight maps must satisfy the following conditions and weight calculation formula:
where
represents the weight map of the
feature map.
After learning the weight maps, each feature map is weighted according to its corresponding weight map, and all the weighted feature maps are summed to form the final fused feature map
. The calculation process of the fused feature map can be represented as:
where
represents element-wise multiplication;
represents the adjusted feature map of the
scale;
is the learned weight map of the
feature map, determining its contribution to the fusion process; and
represents the final fused feature map.
Although ASFF can adaptively adjust the weights of feature fusion and enhance the model’s detection performance for multi-scale targets, mountain fire scenes contain a considerable number of minute targets (e.g., distant fire points, thin smoke), rendering it challenging to accurately localize and identify these targets solely through the original three scales of feature maps. To further enhance the model’s detection capability for tiny targets, this paper proposes an improved Quadrupled-ASFF detection head, as shown in
Figure 9.
The Quadrupled-ASFF detection head introduces an additional feature map , corresponding to the layer in the backbone network, on top of the original three scales of feature maps, , , and . By increasing the output of the layer, Quadrupled-ASFF is able to obtain richer detail information, which enables the detection of minute targets. Following the addition of the novel prediction head, the weight calculation formula and feature fusion calculation formula are presented as follows:
Weight calculation formula:
Feature fusion calculation formula:
The improved Quadrupled-ASFF prediction head effectively improves the model’s detection capability for tiny targets while maintaining high recognition performance for medium and large-sized targets by increasing the output of the layer. It also enhances the model’s understanding of complex scenes, particularly in cases with high background noise or occlusion between targets, enabling more accurate localization and identification of tiny targets in complex mountain fire environments.
4.4. Improved Loss Function
The YOLOv8n algorithm employs the CIoU loss function as the default loss function for bounding box regression. The rationale behind the utilization of CIoU loss is to address the shortcomings of the IoU loss function, which is unable to provide effective gradients in specific instances, such as when two bounding boxes are not in spatial overlap. Furthermore, CIoU loss takes into account the distance between the center points of the bounding boxes and their aspect ratios, thereby enhancing the model’s localization accuracy. The calculation formula for CIoU loss is as follows:
where IoU represents the intersection over the union of the predicted box and the ground truth box;
is the Euclidean distance between the center points of the predicted and ground truth boxes;
is the diagonal length of the smallest enclosing area containing the two boxes;
considers the aspect ratio of the predicted and ground truth boxes;
and
are the width and height of the predicted box, respectively, while
and
are the width and height of the ground truth box; and
is a weight parameter.
Despite the enhancements brought about by the CIoU loss function in bounding box regression performance, it still exhibits certain limitations. In the event of a significant discrepancy between the aspect ratio of the predicted and ground truth boxes, the model may be unduly penalized, which could have a detrimental impact on its learning efficiency and ultimate performance. Moreover, the CIoU loss function may lack the capacity to generalize in complex mountain fire scene conditions.
To further enhance the model’s performance in the mountain fire detection task, this paper introduces the WIoU loss function [
39] as a replacement for the original loss function. In calculating the IoU score, the WIoU assigns differential importance to each pair of predicted and ground truth boxes by incorporating specific weights for the bounding boxes. This weighted strategy enables the model to evaluate the overlap quality between different bounding boxes in a more flexible and meticulous manner, rendering it effective in handling object detection tasks in complex scenes. The WIoU calculation formula is as follows:
where
represents the number of annotated defect boxes;
represents the coordinates of the
predicted box;
represents the coordinates of the
ground truth box;
represents the
value between the corresponding predicted and ground truth boxes; and
represents the weight value.