1. Introduction
Over the past few decades, frequent forest fires worldwide have caused devastating consequences. Due to climate change, the duration of extreme wildfire events has increased by over 20% [
1]. People face survival vulnerability in wildfires [
2]. With the development of computer vision technology, people are also exploring ways to use computer vision technology for forest fire image target detection [
3], and human beings are in urgent need of an efficient and accurate forest fire detection method to help detect fires faster.
Initially, the collection of forest fire information mainly relied on sensor technology [
4], including multi-sensor fusion [
5,
6] and infrared sensors [
7]. However, these methods are only effective within a limited geographical area and have high implementation costs. Remote sensing satellite technology is widely used for fire monitoring, but it is susceptible to weather factors. Developing of deep learning technology provides new solutions for forest fire detection [
8,
9].
The efficient convolutional neural networks (CNN) has been widely applied in the classification of fire smoke [
10,
11,
12]. For example, Zhang et al. [
13] proposed a multi-level joint CNN classifier, Majid et al. [
14] used transfer learning and CNN models to locate fire areas, and Zhang et al. [
15] designed a dual channel CNN model for classifying fires of different sizes. However, the detection efficiency of these methods is limited. Object detection technology can be achieved more accurately and faster.
The technology of object detection is mainly divided into two categories: two-level object detectors (RCNN [
16], Faster R-CNN [
17], and FPN [
18]) and first-order object detectors (such as SSD [
19] and YOLO [
20,
21]). Among them, YOLO has received widespread attention. Alexandrov et al. [
22] compared SSD and Faster R-CNN with YOLOv2, and the results indicate that YOLOv2 performed better. Zheng et al. [
23] expanded the dataset and compared YOLOv3 with EfficientDet, and the results showed that YOLOv3 performed outstandingly in real-time detection speed. Xu et al. [
24] proposed a method of combining YOLOv5 with classifiers. Chen et al. [
25] proposed an improved version based on YOLOv5s, which constructs BiFPN in Neck and combines the CA attention mechanism to enhance detection performance. Wang et al. [
26] improved on YOLOv6, and CBAM was used to enhance features extraction. Furthermore, the loss function CIoU is used with a reduced training time during feature extraction. There are also some studies based on improving the YOLO series algorithm, but the detection accuracy of the forest fire image target detection problem needs to be improved [
27]. The methodology of YOLO series, due to its excellent performance in precision and speed, has attracted the research attention of many scholars [
28,
29]. However, these methods still have limitations, for example, the missed detection of small targets leads to low precision, while more parameters are needed to improve accuracy.
Table 1 provides a comparative overview of the object detection techniques discussed above, highlighting their respective merits and limitations.
In recent years, many scholars have explored the application of fire detection to UAVs or edge computing devices [
31,
32]. UAV-mounted cameras have also been used for target detection of forest fire images or forest fire videos, but there are still problems with video processing and feature extraction [
33,
34]. YOLO series algorithms show better results in target detection tasks, especially the proposal of YOLOv5, which brings a qualitative leap to YOLO series. Li et al. improved on the basis of YOLOv5 and obtained a certain breakthrough. Yang proposed [
35] an improved multistage forest fire smoke detection model SIMCB-YOLO based on YOLOv5, which improves the accuracy of smoke detection by adding a Swin-Transformer, and it also introduces a convolutional layer and a sequence of channel blocks to improve the extraction of surface features. However, the added modules make the model parameters larger, its computational complexity rises, and the demand for video memory increases. The existing algorithm parameters are large, and edge computing equipment resources are limited, so it is difficult to deploy large-scale models. Lightweight technology provides new opportunities for this [
36]. Huang et al. [
30] proposed a model with a lightweight neural network and deployed it on edge computing end for real-time monitoring. In addition, forest fires change dynamically, and the resource demand for edge computing devices has to change too. In the early stage of a fire, few edge computing devices may be needed to meet the demand. Chen [
37] proposes a lightweight model for small target smoke detection, which achieves the task of lightweight smoke detection by incorporating RepVGG in the backbone network to enhance the feature extraction capability while achieving lossless compression of the model in the inference stage. As the fire spreads, more resources are needed to be allocated [
38]. Zuo et al. [
39] designed a dynamic deployment scheme for UAV based on edge computing to adjust resources in real time.
Overall, an increase in parameters results in a larger output weight file, which negatively impacts real-time detection performance and complicates deployment. Thus, balancing accuracy and the number of parameters remains a key challenge. To address this, this paper introduces an improved model based on YOLOv7-tiny [
39], called Mcan-YOLO, designed to enhance detection precision without increasing the number of parameters. The main contributions of this work are as follows:
- (1)
The AFPN module is introduced to improve detection precision across objects of different scales while reducing the model’s parameter complexity. Additionally, the upsampling method is replaced with CARAFE, which boosts precision without adding extra parameters.
- (2)
An attention mechanism, NAM, is incorporated into the Neck section, allowing the convolutional kernel to focus more precisely on target areas for feature extraction, thereby further enhancing detection precision. Following the analysis, the Mish activation function was chosen to improve both precision and the model’s generalization ability.
- (3)
To validate the effectiveness of the proposed model, a real forest fire dataset was constructed using the MSSIM metric, and ablation experiments were performed to compare it with classical algorithms. Additionally, Grad-CAM was used to verify the model’s attention to different regions, further confirming the method’s effectiveness.
This article is organized as follows:
Section 2 provides a detailed description of the data collection process, improvement methods, evaluation metrics, and experimental environment parameters. In
Section 3, the effectiveness of the proposed method is confirmed through ablation experiments and comparisons with classical models, with an in-depth analysis and discussion of the detection results.
Section 4 discusses all the obtained results. Finally,
Section 5 offers a summary and outlines future research directions.
3. Results
3.1. Ablation Experiment
3.1.1. Evaluations of Different Components
To evaluate the effectiveness of each module under consistent experimental conditions, we integrated different modules into the model and tested them on the same dataset. The detailed results of these experiments are presented in
Table 4.
As shown in
Table 4, the baseline model (YOLOv7-tiny) without any additional modules achieved moderate performance, but with lower performance in small object detection, as reflected by the mAP50s score of 53.9%. The addition of the AFPN module improved recall from 80.3% to 82.2% and also increased both mAP50 and mAP50s, demonstrating AFPN’s effectiveness in enhancing recall and overall model performance. The CARAFE module further improved precision and recall, although its contribution to mAP50s was less significant. Incorporating the Mish activation function improved precision to 86.5%, highlighting its role in enhancing the model’s precision. When the NAM was added, all metrics showed improvement, with a particularly notable increase in mAP50s from 53.9% to 54.1%, indicating that the NAM significantly enhanced the model’s ability to detect small targets. The model achieved its best performance when all four modules (AFPN, CARAFE, Mish, and NAM) were combined. In this configuration, the model achieved 90.9% precision, 86.8% recall, 88.8% F1 score, 91.5% mAP50, and 63.2% mAP50s. This combination outperformed any individual module alone. For example, the combination of AFPN and CARAFE yielded an mAP50s of 62.0%, whereas AFPN and CARAFE alone achieved 57.0% and 54.0%, respectively.
Table 4 demonstrates that integrating AFPN, CARAFE, Mish, and NAM—either individually or in combination—improved the model’s overall performance, particularly in detecting small objects. Among these, AFPN and NAM were the most effective at improving recall and mAP50s, while the Mish activation function significantly boosted precision. When used together, these techniques allowed the model to achieve superior performance across all metrics, especially in mAP50s, which improved from 53.9% to 63.2% compared to the baseline. The integration of the NAM, with its spatial and channel attention mechanisms, significantly enhanced the model’s ability to detect small targets in early-stage forest fires, with minimal computational cost. This attention mechanism effectively allocates weights across channels and spatial areas, improving the model’s focus on fire-specific regions and reducing false negatives. AFPN enhanced multi-scale feature extraction by merging features from different layers, improving the model’s accuracy while reducing the number of parameters, thus demonstrating its advantages over YOLOv7-tiny’s original feature fusion network. The CARAFE operator, known for its low redundancy and strong feature fusion capabilities, improved the completeness of feature information during upsampling. Finally, the Mish activation function, with its smooth gradient properties, improved gradient flow within the deep network, leading to better accuracy and generalization in fire detection tasks.
Table 5 compares the detection performance of Mcan-YOLO with the original YOLOv7-tiny model across different categories (Overall, Fire, and Smoke). Mcan-YOLO outperformed YOLOv7-tiny in all key metrics, including precision, recall, F1 score, mAP50, and mAP50s. Specifically, Mcan-YOLO achieved a precision of 90.9% and a recall of 86.8% in the overall category, compared to 86.3% and 80.3% for YOLOv7-tiny. In the specific categories of fire and smoke detection, Mcan-YOLO also demonstrated higher accuracy and recall. Although Mcan-YOLO had a slight increase in parameters (0.3 M) and computational complexity (FLOPs) compared to YOLOv7-tiny, its performance improvements, particularly in small target detection (mAP50s), were significant, outperforming YOLOv7-tiny by nearly 10 percentage points. These results indicate that the additional modules in Mcan-YOLO effectively capture multi-scale information and enhance detection of both fire and smoke, providing superior detection performance while maintaining a relatively lightweight architecture.
To mitigate the risk of overfitting associated with a single division of training and test sets, we applied k-fold [
50] cross-validation (k = 5), randomly dividing the dataset into five groups to train both YOLOv7-tiny and Mcan-YOLO. The results, shown in
Table 6, indicate that Mcan-YOLO consistently outperformed YOLOv7-tiny across all folds, achieving an average mAP50 of 91.36%, compared to 87.07% for YOLOv7-tiny. Additionally, the standard deviations for both models (0.36% for Mcan-YOLO and 0.34% for YOLOv7-tiny) demonstrated good robustness with minimal performance fluctuation. These results highlight the superior stability and accuracy of Mcan-YOLO in k-fold cross-validation.
3.1.2. Effectiveness of NAM
Different attention mechanisms have varying impacts on the YOLO model, especially when processing complex scenes and multiscale objects. In this experiment, we compared four attention mechanisms: NAM, which is our introduced mechanism, along with three alternatives: CBAM, SE, and CA. As shown in
Table 7, the NAM stood out by improving both precision and mAP50 while maintaining the same number of parameters as the baseline YOLOv7-tiny model. In contrast, CA and CBAM resulted in decreases in precision and mAP50, while SE provided only slight improvements. These results highlight the efficiency of the NAM in enhancing detection performance, particularly for small targets, without increasing model complexity.
The variation curve of mAP50 across epochs is presented in
Figure 14, demonstrating that the NAM provides superior stability in terms of mAP50. Compared to other attention mechanisms, the NAM-enhanced model exhibits a faster growth trend and quicker convergence, indicating more efficient learning. This suggests that the NAM not only improves detection accuracy but also accelerates the training process, leading to more reliable and consistent performance over time.
The variation curve of mAP50 with epochs is shown in
Figure 14, which indicates that the NAM achieves greater stability in terms of mAP50. The growth trend and convergence speed of its curve are faster than other attention mechanisms.
3.1.3. Effectiveness of Mish
Different activation functions generate different effects on the performance and training process of the model. To evaluate their applicability in the model, three different activation functions, namely, Mish, HardSwish, and FReLU [
51], were compared with LeakyReLU. The comparison results are shown in
Figure 15. The results indicated that the Mish activation function reached the convergent state faster than the other functions did.
The findings reveal that the Mish activation function achieved a convergent state more rapidly than the other functions. In the training loss curve, which illustrates the variation across epochs, the model using Mish not only converged faster but also demonstrated a similar trend on the validation set. This indicates that Mish enhances both the training efficiency and the generalization capability of the model, leading to more stable and reliable performance during validation.
3.2. Comparative Experiment
To evaluate the effectiveness of Mcan-YOLO, we compared it against several mainstream detection algorithms, including Faster R-CNN, SSD, Li et al. [
29], RepVGG-YOLOv7 [
37], and YOLOv8n [
52]. The comparative results are shown in
Table 8. The results demonstrate that Mcan-YOLO consistently outperforms these models in both accuracy and efficiency. Specifically, Mcan-YOLO achieved a precision of 90.9% and an mAP50 of 91.5%, while maintaining a relatively small parameter count of 5.5 M and computational cost of 14.0 GFLOPs.
While models like Faster R-CNN and SSD struggled with lower precision and larger model sizes, the YOLO-based models performed better overall. Faster R-CNN, despite being accurate in certain scenarios, suffered from a heavy computational load with 137.1 M parameters, making it unsuitable for real-time applications like forest fire detection. SSD, while faster, demonstrated lower accuracy with a precision of 86.3% and an mAP50 of 76.8%, as well as a relatively large parameter count of 26.2 M, which still hindered its efficiency in resource-constrained environments. In comparison, YOLO-based models achieved better performance. RepVGG-YOLOv7, for instance, delivered high accuracy with a precision of 90.8% and an mAP50 of 91.4%. However, this came with the cost of a significantly larger parameter count of 35.2 M and a computational load of 102.7 GFLOPs, limiting its practicality in real-time applications where both speed and model efficiency are crucial.
In contrast, Mcan-YOLO demonstrated a superior balance by incorporating additional modules that enhance multiscale information processing without significantly increasing the model size or computational cost. With a precision of 90.9%, mAP50 of 91.5%, and a parameter count of just 5.5 M, Mcan-YOLO achieved performance comparable to or better than RepVGG-YOLOv7 while being significantly more efficient. Additionally, its computational cost of 14.0 GFLOPs is far lower than RepVGG-YOLOv7, making it much more practical for real-time deployment. Notably, Mcan-YOLO also showed clear advantages in small target detection, which is critical for early fire and smoke recognition.
To visually compare the detection performance of each model, we selected representative images from the dataset for experimentation, as shown in
Figure 16. Faster R-CNN and SSD struggled with false positives, often mislabeling firefighters with similar colors as fires, and both models exhibited low confidence levels. The method by Li et al. [
29] missed thin smoke, while YOLOv8n had false positives, affecting detection clarity. RepVGG-YOLOv7 tended to miss small fire targets during detection. In contrast, Mcan-YOLO achieved higher detection confidence, particularly for small fires and smoke of varying shapes. Moreover, Mcan-YOLO effectively reduced the false detection rate for interfering objects in complex backgrounds, showcasing its robustness in challenging environments.
3.3. Visualization Analysis
This section evaluates the detection performance of the models in various scenarios. Using the same detection samples, the Mcan-YOLO model consistently demonstrated higher confidence in recognizing both fire and smoke. Typical images are presented in
Figure 17. In the case of YOLOv7-tiny (
Figure 17a,c), the confidence values for fire and smoke were 0.66 and 0.83, respectively. In contrast, Mcan-YOLO (
Figure 17b,d) significantly improved these values to 0.88 and 0.94, highlighting its enhanced detection accuracy and reliability.
Additionally, the detection of small objects was analyzed. YOLOv7-tiny showed a tendency to miss small flames, whereas Mcan-YOLO demonstrated a very low miss rate, as illustrated in
Figure 18. In the YOLOv7-tiny detection sample (
Figure 18a), several small fire targets were not detected. In contrast, Mcan-YOLO (
Figure 18b) successfully detected these small targets, such as the fire in the bottom right corner.
Figure 18c,d provide a comparison for detecting sparse smoke, where Mcan-YOLO effectively recognized thin smoke that YOLOv7-tiny failed to detect.
To provide a more intuitive demonstration of the model’s effectiveness, we employed Grad-CAM for feature map visualization experiments. Two images from the test set were selected to generate heatmaps, where darker areas indicate a higher probability of being identified as the target, and lighter areas represent the background. As illustrated in
Figure 19a, Mcan-YOLO accurately captured the true shape of the target flames, highlighting the areas most likely to contain fire with precision. In contrast, in
Figure 19b, where the colors of the smoke and background are similar, YOLOv7-tiny mistakenly labeled parts of the background as the target area. Mcan-YOLO, however, successfully distinguished between the smoke and the background, correctly identifying the target area and avoiding mislabeling. These results demonstrate that Mcan-YOLO excels in identifying key targets, even in challenging environments with complex backgrounds, confirming its robustness and accuracy in practical applications.
In summary, through the visualization analysis of feature maps, it can be seen that Mcan-YOLO exhibits higher precision and robustness in object recognition, especially when dealing with complex backgrounds, where its advantages are more pronounced.
3.4. Generalization Experiment
To evaluate the generalization capability of Mcan-YOLO, we conducted experiments using the M
4SFWD [
53] dataset, which includes a variety of fire and smoke scenarios. The dataset consists of 3985 images with a resolution of 640 × 640. This dataset was specifically designed to test model performance in diverse and challenging real-world conditions, making it ideal for assessing the robustness and adaptability of detection algorithms.
Table 9 presents the detection results of various models in the generalization experiment. Mcan-YOLO performed exceptionally well across all key metrics, including precision, recall, F1 score, and mAP50. With a mAP50 of 87.9%, Mcan-YOLO ranks just behind the RepVGG-YOLOv7 model, which achieved the highest mAP50 of 89.1%. However, Mcan-YOLO still outperforms other models in terms of a balanced performance, particularly in recall and F1 score, making it a strong contender in terms of both accuracy and generalization.
4. Discussion
Climate change has exacerbated the frequency and duration of forest fires, and there is an urgent need for efficient and accurate detection methods, whereas traditional fire information collection methods are limited by geographical and weather conditions. Deep learning, especially convolutional neural networks (CNNs) and YOLO series of object detection techniques, have improved the efficiency of fire smoke classification but still face the challenges of missing small target detection and increasing parameter requirements, and there is the need for further research and improvement in the field of forest fire image target detection.
Traditional sensor-based methods for fire point and smoke detection in forest fire images are often used, but this method is expensive and only effective in a limited geographical area. With the rise of convolutional neural networks (CNNS), the use of CNNS for fire point and smoke detection in forest fire images has attracted wide attention, although the CNN model requires high computational power and has limitations in forest fire image detection alone. The introduction of YOLO technology, especially the high response speed and accuracy of YOLO series models, has promoted its application in fire point and smoke detection in forest fire images. The improvement of the YOLO model has become one of the main methods for forest fire image target detection. For example, the SIMCB-YOLO model has been improved on the basis of YOLOv5. Although such methods achieve a high level of accuracy, they inevitably require higher computational power due to the complex structure of the model. The application of lightweight technology provides a breakthrough to this problem. The use of lightweight models for target detection in forest fire images has become a current research hotspot, such as the RepVGG-YOLOv7 model and the method proposed by Zuo et al. [
39]. For a lightweight object detection model, finding the balance between model accuracy and model parameter number is a difficult problem.
In this study, an improved forest fire and smoke detection model called Mcan-YOLO is proposed based on the YOLOv7 architecture. The model takes into account the difficulty of fire and smoke detection for small targets in complex environments and introduces the AFPN module to enhance the model’s ability to detect objects at different scales while using the CARAFE up-sampling and NAM attention modules to further refine the feature extraction process so as to obtain higher detection accuracy with fewer parameters. The experimental results show that the Macan-YOLO model outperformed the YOLOv7-tiny baseline in several key metrics. Specifically, the accuracy increased by 4.6%, the recall increased by 6.5%, and the mean average precision (mAP50) increased by 4.7%. The ablation experiments in
Table 4 verified that each of the added modules substantially improved fire point and smoke detection in forest fire images. The generalization experiments in
Table 8 and
Figure 16 verify that the model not only had excellent detection capability on the experimental dataset but also achieved results on the M
4SFWD dataset that cannot be achieved by other state-of-the-art models.
The Macan-YOLO model proposed in this study outperforms the existing comparison models with its lightweight and high accuracy. However, for the target detection task, the training process of the model requires a large amount of data with labelled information, and the data labelling process is a time-consuming and labor-intensive process. Especially for the labelling of fire points and smoke in forest fire images, the boundaries of fire points and smoke are difficult to accurately define, which will limit its application to some extent. In our subsequent research, we consider integrating semi-supervised and weakly supervised techniques into the Macan-YOLO model to achieve higher accuracy while reducing the need for labelled data.
5. Conclusions
In this study, we proposed Mcan-YOLO, an improved version of YOLOv7-tiny, specifically designed for enhanced forest fire detection. Several key enhancements were introduced. First, the AFPN was integrated into the neck network for multi-scale detection, effectively balancing feature information across different scales while optimizing model parameters. Second, CARAFE was employed for upsampling, improving precision with fewer parameters. Third, the NAM was added to address the complex backgrounds of forest fire smoke, enhancing the model’s adaptability. Finally, the Mish activation function was introduced to improve the model’s convergence. Compared to YOLOv7-tiny, Mcan-YOLO achieved a 4.6% increase in precision for forest fire and smoke detection while also reducing the model’s parameters by 5%. A dedicated dataset was constructed for the experiments, and the MSSIM algorithm was used to eliminate highly repetitive images, ensuring a more diverse training set. The experimental results demonstrated that Mcan-YOLO outperformed other object detection methods, offering improved accuracy with fewer parameters. This makes Mcan-YOLO a more practical and efficient solution for real-time forest fire detection.
Future work can explore several avenues to further enhance detection reliability. First, improving the dataset remains a challenge, as real forest fire images are difficult to collect. Generative adversarial networks (GANs) could be used to generate synthetic yet realistic fire images to expand the dataset. Additionally, we will explore innovative lightweight techniques to seamlessly integrate Mcan-YOLO into emergency response systems, incorporating a human verification process to improve accuracy and minimize false alarms.