Mcan-YOLO: An Improved Forest Fire and Smoke Detection Model Based on YOLOv7

Liu, Hongying; Zhu, Jun; Xu, Yiqing; Xie, Ling

doi:10.3390/f15101781

Open AccessArticle

Mcan-YOLO: An Improved Forest Fire and Smoke Detection Model Based on YOLOv7

¹

School of Computer and Artificial Intelligence, Nanjing University of Science and Technology Zijin College, Nanjing 210023, China

²

School of Computer and Software, Nanjing Vocational University of Industry Technology, Nanjing 210023, China

^*

Author to whom correspondence should be addressed.

Forests 2024, 15(10), 1781; https://doi.org/10.3390/f15101781

Submission received: 10 August 2024 / Revised: 3 October 2024 / Accepted: 8 October 2024 / Published: 10 October 2024

(This article belongs to the Special Issue Artificial Intelligence and Machine Learning Applications in Forestry)

Download

Browse Figures

Versions Notes

Abstract

:

Forest fires pose a significant threat to forest resources and wildlife. To balance accuracy and parameter efficiency in forest fire detection, this study proposes an improved model, Mcan-YOLO, based on YOLOv7. In the Neck section, the asymptotic feature pyramid network (AFPN) was employed to effectively capture multi-scale information, replacing the traditional module. Additionally, the content-aware reassembly of features (CARAFE) replaced the conventional upsampling method, further reducing the number of parameters. The normalization-based attention module (NAM) was integrated after the ELAN-T module to enhance the recognition of various fire smoke features, and the Mish activation function was used to optimize model convergence. A real fire smoke dataset was constructed using the mean structural similarity (MSSIM) algorithm for model training and validation. The experimental results showed that, compared to YOLOv7-tiny, Mcan-YOLO improved precision by 4.6%, recall by 6.5%, and mAP50 by 4.7%, while reducing the number of parameters by 5%. Compared with other mainstream algorithms, Mcan-YOLO achieved better precision with fewer parameters.

Keywords:

forest fire smoke detection; YOLOv7; normalization-based attention module; CARAFE; AFPN

1. Introduction

Over the past few decades, frequent forest fires worldwide have caused devastating consequences. Due to climate change, the duration of extreme wildfire events has increased by over 20% [1]. People face survival vulnerability in wildfires [2]. With the development of computer vision technology, people are also exploring ways to use computer vision technology for forest fire image target detection [3], and human beings are in urgent need of an efficient and accurate forest fire detection method to help detect fires faster.

Initially, the collection of forest fire information mainly relied on sensor technology [4], including multi-sensor fusion [5,6] and infrared sensors [7]. However, these methods are only effective within a limited geographical area and have high implementation costs. Remote sensing satellite technology is widely used for fire monitoring, but it is susceptible to weather factors. Developing of deep learning technology provides new solutions for forest fire detection [8,9].

The efficient convolutional neural networks (CNN) has been widely applied in the classification of fire smoke [10,11,12]. For example, Zhang et al. [13] proposed a multi-level joint CNN classifier, Majid et al. [14] used transfer learning and CNN models to locate fire areas, and Zhang et al. [15] designed a dual channel CNN model for classifying fires of different sizes. However, the detection efficiency of these methods is limited. Object detection technology can be achieved more accurately and faster.

The technology of object detection is mainly divided into two categories: two-level object detectors (RCNN [16], Faster R-CNN [17], and FPN [18]) and first-order object detectors (such as SSD [19] and YOLO [20,21]). Among them, YOLO has received widespread attention. Alexandrov et al. [22] compared SSD and Faster R-CNN with YOLOv2, and the results indicate that YOLOv2 performed better. Zheng et al. [23] expanded the dataset and compared YOLOv3 with EfficientDet, and the results showed that YOLOv3 performed outstandingly in real-time detection speed. Xu et al. [24] proposed a method of combining YOLOv5 with classifiers. Chen et al. [25] proposed an improved version based on YOLOv5s, which constructs BiFPN in Neck and combines the CA attention mechanism to enhance detection performance. Wang et al. [26] improved on YOLOv6, and CBAM was used to enhance features extraction. Furthermore, the loss function CIoU is used with a reduced training time during feature extraction. There are also some studies based on improving the YOLO series algorithm, but the detection accuracy of the forest fire image target detection problem needs to be improved [27]. The methodology of YOLO series, due to its excellent performance in precision and speed, has attracted the research attention of many scholars [28,29]. However, these methods still have limitations, for example, the missed detection of small targets leads to low precision, while more parameters are needed to improve accuracy. Table 1 provides a comparative overview of the object detection techniques discussed above, highlighting their respective merits and limitations.

In recent years, many scholars have explored the application of fire detection to UAVs or edge computing devices [31,32]. UAV-mounted cameras have also been used for target detection of forest fire images or forest fire videos, but there are still problems with video processing and feature extraction [33,34]. YOLO series algorithms show better results in target detection tasks, especially the proposal of YOLOv5, which brings a qualitative leap to YOLO series. Li et al. improved on the basis of YOLOv5 and obtained a certain breakthrough. Yang proposed [35] an improved multistage forest fire smoke detection model SIMCB-YOLO based on YOLOv5, which improves the accuracy of smoke detection by adding a Swin-Transformer, and it also introduces a convolutional layer and a sequence of channel blocks to improve the extraction of surface features. However, the added modules make the model parameters larger, its computational complexity rises, and the demand for video memory increases. The existing algorithm parameters are large, and edge computing equipment resources are limited, so it is difficult to deploy large-scale models. Lightweight technology provides new opportunities for this [36]. Huang et al. [30] proposed a model with a lightweight neural network and deployed it on edge computing end for real-time monitoring. In addition, forest fires change dynamically, and the resource demand for edge computing devices has to change too. In the early stage of a fire, few edge computing devices may be needed to meet the demand. Chen [37] proposes a lightweight model for small target smoke detection, which achieves the task of lightweight smoke detection by incorporating RepVGG in the backbone network to enhance the feature extraction capability while achieving lossless compression of the model in the inference stage. As the fire spreads, more resources are needed to be allocated [38]. Zuo et al. [39] designed a dynamic deployment scheme for UAV based on edge computing to adjust resources in real time.

Overall, an increase in parameters results in a larger output weight file, which negatively impacts real-time detection performance and complicates deployment. Thus, balancing accuracy and the number of parameters remains a key challenge. To address this, this paper introduces an improved model based on YOLOv7-tiny [39], called Mcan-YOLO, designed to enhance detection precision without increasing the number of parameters. The main contributions of this work are as follows:

(1): The AFPN module is introduced to improve detection precision across objects of different scales while reducing the model’s parameter complexity. Additionally, the upsampling method is replaced with CARAFE, which boosts precision without adding extra parameters.
(2): An attention mechanism, NAM, is incorporated into the Neck section, allowing the convolutional kernel to focus more precisely on target areas for feature extraction, thereby further enhancing detection precision. Following the analysis, the Mish activation function was chosen to improve both precision and the model’s generalization ability.
(3): To validate the effectiveness of the proposed model, a real forest fire dataset was constructed using the MSSIM metric, and ablation experiments were performed to compare it with classical algorithms. Additionally, Grad-CAM was used to verify the model’s attention to different regions, further confirming the method’s effectiveness.

This article is organized as follows: Section 2 provides a detailed description of the data collection process, improvement methods, evaluation metrics, and experimental environment parameters. In Section 3, the effectiveness of the proposed method is confirmed through ablation experiments and comparisons with classical models, with an in-depth analysis and discussion of the detection results. Section 4 discusses all the obtained results. Finally, Section 5 offers a summary and outlines future research directions.

2. Materials and Methods

2.1. Data Collection

Currently, most publicly available forest fire datasets consist primarily of single scenes, which fail to capture the complexity of real-world forest fire scenarios. To address this, fire images extracted from forest fire videos were used in our experiment. However, similar images from video frames can lead to overfitting. To mitigate this, the Mean Structural Similarity Index (MSSIM) [40] algorithm was applied to eliminate highly similar images. A structural similarity threshold of 0.85 was set, and only images with an MSSIM value below 0.85 were retained. As a result, a dataset of 1548 flame and smoke images was obtained. Additionally, we combined these images with publicly available datasets, such as FLAME [41], to expand the experimental data. In total, 2865 images were collected and split into training, validation, and testing sets in a 7:2:1 ratio.

This dataset covers a range of scenarios, including different times of day (day and night) and various seasons (all four seasons), as well as multiple fire categories, such as ground fires, tree trunk fires, and tree crown fires. Representative images are shown in Figure 1 and Figure 2. The dataset also includes small targets captured by drones.

Given that YOLOv7 includes built-in data augmentation features, we did not perform additional data augmentation. We used the LabelImg (v1.8.1) software to annotate the images, resulting in a total of 9652 labeled targets. Targets with pixel sizes smaller than 32 × 32 are classified as small targets [42]. The size distribution of the targets is shown in Figure 3, and statistics reveal that 16.8% of the targets in this dataset are classified as small.

2.2. Methods

2.2.1. YOLOv7-Tiny

YOLOv7-tiny is a streamlined version of YOLOv7, and it is designed for deployment in edge computing scenarios. It consists of three parts: backbone network, neck network, and head network, as visually depicted in Figure 4.

The backbone network is composed of the CBL module, which includes convolution, batch normalization and LeakyReLU activation function, the ELAN model, and the MP layer. The neck network includes the SPP module and the PANet architecture. The feature maps are initially processed by the Backbone network, then further processed by the improved SPPCSPC module, and finally passed to the Head network for detection.

YOLOv7-tiny still exhibits several limitations. For example, PANet is used to capture features at different scales, but it is based on tensor concatenation of adjacent layers, which is not comprehensive enough for feature information fusion. In fire detection tasks, using nearest neighbor interpolation for upsampling cannot fully balance the two main objectives of speed and precision.

This study improves upon YOLOv7-tiny. Introducing the AFPN module in the neck network can achieve effective fusion of features of different scales, thereby improving the detection performance of fire smoke of different sizes without adding more parameters. Meanwhile, replacing the traditional upsampling method with the CARAFE method improves precision in terms of reducing parameters. To further improve the precision of the model, the NAM was introduced to enhance the ability of extracting key features. The Mish activation function was adopted to improve the convergence speed and precision of the model. In the next section, the overall network design of Mcan-YOLO is elaborated step by step.

2.2.2. AFPN

In YOLOv7, the PANet (as shown in Figure 5) helps to improve the object detection performance. However, the feature fusion and context information capture operations may lead to an increase in model size and partial information loss, which causes accuracy reduction. To alleviate the above issues, AFPN [43] is used instead of PANet. This method can reduce the information loss in the interaction process. AFPN architecture is shown in Figure 6 as follows:

(1): Extracting multi-level features

The low-level features were firstly input into the feature pyramid network, and then higher-level features were added. This process greatly reduces the semantic gap between non-adjacent layers and minimizes information loss during transmission.

(2): Progressive fusion

To enhance the importance of the key level, AFPN assigns different spatial weights to various levels. For example,

x_{i, j}^{n, l}

indicates the eigenvectors from level n to level l at

(i, j)

, and the eigenvectors of the three layers are expressed as

x_{i, j}^{1, l}

,

x_{i, j}^{2, l}

, and

x_{i, j}^{3, l}

. The resulting eigenvector is expressed as

y_{i, j}^{l}

. Then, the following is obtained:

y_{i, j}^{l} = α_{i, j}^{l} \cdot x_{i, j}^{1, l} + β_{i, j}^{l} \cdot x_{i, j}^{2, l} + γ_{i, j}^{l} \cdot x_{i, j}^{3, l} = 1

(1)

α_{i, j}^{l} + β_{i, j}^{l} + γ_{i, j}^{l} = 1

(2)

where

α_{i, j}^{l}

,

β_{i, j}^{l}

, and

γ_{i, j}^{l}

denote the spatial weights of the features at the three levels in layer l.

The structure obtained after replacing PANet with AFPN is shown in Figure 7. AFPN enhances the extraction of target features by integrating features at different levels, thereby dynamically adjusting feature weights to effectively focus on targets of different scales. In addition, AFPN enhances the understanding of the surrounding environment of the fire, improving the adaptability and precision of the model.

2.2.3. CARAFE

The nearest neighbor interpolation method has some problems, such as easily ignoring the influence of adjacent pixels, resulting in discontinuous sampling values and weakening the ability to extract feature information for small targets. To address these issues, this study employed the lightweight upsampling operator CARAFE module [44]. CARAFE comprises two components: a convolution kernel prediction module and a feature reorganization module, as shown in Figure 8.

Given a feature map of

X

with a size of

C \times H \times W

, the upsampling ratio is set to

f

. By conducting feature reorganization after performing sampling kernel prediction, CARAFE generates a new feature map

X^{'}

with a size of

C \times f H \times f W

. The specific steps are as follows:

(1): Convolution kernel prediction module: For a given feature image $X$ , the $1 \times 1$ convolution process achieves channel compression to generate a compressed image $Y$ . This reduces the number of parameters required for the subsequent calculations. And a convolution kernel with a size of $k_{e} \times k_{e}$ is used. The predicted upsampling kernel size of $f H \times f W \times k_{e}^{2}$ is obtained from the input image size and the upsampling rate $f$ . Finally, the sum of the convolution kernel weights is 1 through Softmax regularization.
(2): Feature reorganization module: For a feature map $Y$ , a feature map with a size of $k_{u} \times k_{u}$ is chosen at the center of $Y$ , and the predicted upsampling kernel at this point is convolved to obtain the output feature map $X^{'}$ , which possesses a size of $C \times f H \times f W$ .

CARAFE can receive more feature information from the domain. Replacing AFPN’s upsample with CARAFE can maximize the retention of information in local corner areas, enhancing the perception of small fire smoke targets [45]. The CARAFE algorithm is a lightweight module and exhibits a high computation speed.

2.2.4. NAM

The attention mechanism simulates the attention processing in the human visual system, which can choose key information in images for better detection. Traditional attention mechanisms typically use Softmax functions to normalize attention weights, ensuring that the weights satisfy the properties of probability distribution. However, the Softmax function may experience gradient vanishing or exploding when facing large or small input values. To overcome this issue, the NAM [46] has introduced normalization operations to adjust attention weights more effectively.

In the channel attention module (CAM), the channel variance, which is calculated by the scaling factor in batch normalization (BN), serves as a basis to measure the importance of the weights. The equation is as follows:

B_{o u t} = B N (B_{i n}) = γ \frac{B_{i n} - μ_{β}}{\sqrt{σ_{β}^{2} + ε}} + ο

(3)

where

B_{i n}

is the input feature;

B_{o u t}

is the output feature; BN denotes batch normalization;

μ_{β}

and

σ_{β}

denote the mean and the batch standard deviation of the small batch; and

β

,

γ

, and

o

are trainable scaling and shifting parameters.

The structure of CAM is depicted in Figure 9, where

F_{1}

is the input feature,

M_{c}

is the output feature, and

γ_{i}

is the scale factor for each channel.

The weights are obtained as

ω_{i} = γ_{i} / \sum_{i = 0}^{3} γ_{i}

. The output feature can be obtained by Equation (4).

M_{c} = s i g m o i d (ω_{i} (B N (F_{1})))

(4)

The structure of the spatial attention module (SAM) is depicted in Figure 10, where

F_{2}

is the input feature,

M_{s}

is the output feature,

λ_{j}

is the scaling factor, and the weight of each channel is

W_{j} = λ_{j} / \sum_{j = 0}^{3} λ_{j}

. The output feature can be calculated by Equation (5).

M_{s} = s i g \mod (W_{j} (B N (F_{2})))

(5)

The forest fire environment is usually complex. Attention mechanisms can suppress irrelevant weights in the neural network, thereby reducing the impact of complex environments on detection. Moreover, the CAM utilizes information among feature channels, while the SAM utilizes information between feature spaces. The advantages of these two attention modules can complement each other, achieving effective attention to small targets [47] and improving the recognition ability of small targets in the early stages of forest fires. In Mcan-YOLO, the NAM was added after the ELAN-T module in the feature fusion network of the Neck, as illustrated in Figure 11, defined as the ELAN-TA module. To assess the effectiveness of the attention mechanism, four attention mechanisms, namely, CBAM [48], SE [49], CA [50], and the NAM, are compared in the experiments detailed in Section 3.1.2.

2.2.5. Mish

The LeakyReLU activation function in YOLOv7-tiny exhibits limitations that the convergence speed tends to slow down when the gradient approaches zero. The Mish activation function is employed in replacement of the LeakyReLU to resolve this issue [49]. The Mish activation function can be expressed as follows:

M i s h = x \times \tanh (s o f t p l u s (x))

(6)

s o f t p l u s (x) = \ln (1 + e^{x})

(7)

The curves of LeakyReLU and Mish are shown in Figure 12, indicating that the curve of Mish is smoother than that of the LeakyReLU within the negative range. This enhances the nonlinear expression, enabling an easier optimization of the neural network and thus enhancing the model’s accuracy and generalization ability.

To investigate the impact of different activation functions on the model, in Section 3.1, four activation functions are compared and discussed to determine the most suitable option for the constructed model.

2.2.6. Mcan-YOLO

To address the challenges in forest fire detection, enhancement methods are proposed in this section. Firstly, AFPN is used to enhance the model’s ability to extract features from targets of different scales, so that even small targets can be noticed. Secondly, the CARAFE method is used to improve the detection precision with fewer parameters. Thirdly, the NAM attention mechanism is integrated to increase the model’s attention to the target area. Finally, Mish is used to enhance the generalization ability of the model. The total integration of these methods is denoted as Mcan-YOLO. The Mcan-YOLO model is shown in Figure 13.

2.3. Experimental Conditions

The details of the experimental environment are provided in Table 2. The configuration is detailed in Table 3.

2.4. Indicator Evaluation

The following evaluation indicators were used: precision (P), recall (R), and mean average precision (mAP) and F1. P reflects the precision of the model. R reflects the detection completeness of the model. F1 is a composite evaluation index of P and R, which is usually used to evaluate the performance of classification tasks. FLOPs are used to denote the number of floating point operations that the model can perform per unit of time. These indicators can be calculated via Equations (8)–(11):

P = \frac{T P}{T P + F P}

(8)

R = \frac{T P}{T P + F N}

(9)

F 1 = \frac{2 \times P \times R}{P + R}

(10)

m A P = \frac{\sum_{i = 1}^{n} A P_{i}}{n}

(11)

where TP denotes true positives, FP denotes false positives, and

A P = \int_{0}^{1} P (R) d (R)

stands for average accuracy.

The mAP is used to evaluate the model performance comprehensively. mAP50 represents the average accuracy of all categories when the IoU threshold is 0.5. A higher mAP50 indicates better recognition and detection capabilities of the model. mAP50:95 represents the average of 10 mAPs obtained at intervals of 0.05 with IoU thresholds ranging from 0.5 to 0.95, denoted as mAP50s in this article.

3. Results

3.1. Ablation Experiment

3.1.1. Evaluations of Different Components

To evaluate the effectiveness of each module under consistent experimental conditions, we integrated different modules into the model and tested them on the same dataset. The detailed results of these experiments are presented in Table 4.

As shown in Table 4, the baseline model (YOLOv7-tiny) without any additional modules achieved moderate performance, but with lower performance in small object detection, as reflected by the mAP50s score of 53.9%. The addition of the AFPN module improved recall from 80.3% to 82.2% and also increased both mAP50 and mAP50s, demonstrating AFPN’s effectiveness in enhancing recall and overall model performance. The CARAFE module further improved precision and recall, although its contribution to mAP50s was less significant. Incorporating the Mish activation function improved precision to 86.5%, highlighting its role in enhancing the model’s precision. When the NAM was added, all metrics showed improvement, with a particularly notable increase in mAP50s from 53.9% to 54.1%, indicating that the NAM significantly enhanced the model’s ability to detect small targets. The model achieved its best performance when all four modules (AFPN, CARAFE, Mish, and NAM) were combined. In this configuration, the model achieved 90.9% precision, 86.8% recall, 88.8% F1 score, 91.5% mAP50, and 63.2% mAP50s. This combination outperformed any individual module alone. For example, the combination of AFPN and CARAFE yielded an mAP50s of 62.0%, whereas AFPN and CARAFE alone achieved 57.0% and 54.0%, respectively.

Table 4 demonstrates that integrating AFPN, CARAFE, Mish, and NAM—either individually or in combination—improved the model’s overall performance, particularly in detecting small objects. Among these, AFPN and NAM were the most effective at improving recall and mAP50s, while the Mish activation function significantly boosted precision. When used together, these techniques allowed the model to achieve superior performance across all metrics, especially in mAP50s, which improved from 53.9% to 63.2% compared to the baseline. The integration of the NAM, with its spatial and channel attention mechanisms, significantly enhanced the model’s ability to detect small targets in early-stage forest fires, with minimal computational cost. This attention mechanism effectively allocates weights across channels and spatial areas, improving the model’s focus on fire-specific regions and reducing false negatives. AFPN enhanced multi-scale feature extraction by merging features from different layers, improving the model’s accuracy while reducing the number of parameters, thus demonstrating its advantages over YOLOv7-tiny’s original feature fusion network. The CARAFE operator, known for its low redundancy and strong feature fusion capabilities, improved the completeness of feature information during upsampling. Finally, the Mish activation function, with its smooth gradient properties, improved gradient flow within the deep network, leading to better accuracy and generalization in fire detection tasks.

Table 5 compares the detection performance of Mcan-YOLO with the original YOLOv7-tiny model across different categories (Overall, Fire, and Smoke). Mcan-YOLO outperformed YOLOv7-tiny in all key metrics, including precision, recall, F1 score, mAP50, and mAP50s. Specifically, Mcan-YOLO achieved a precision of 90.9% and a recall of 86.8% in the overall category, compared to 86.3% and 80.3% for YOLOv7-tiny. In the specific categories of fire and smoke detection, Mcan-YOLO also demonstrated higher accuracy and recall. Although Mcan-YOLO had a slight increase in parameters (0.3 M) and computational complexity (FLOPs) compared to YOLOv7-tiny, its performance improvements, particularly in small target detection (mAP50s), were significant, outperforming YOLOv7-tiny by nearly 10 percentage points. These results indicate that the additional modules in Mcan-YOLO effectively capture multi-scale information and enhance detection of both fire and smoke, providing superior detection performance while maintaining a relatively lightweight architecture.

To mitigate the risk of overfitting associated with a single division of training and test sets, we applied k-fold [50] cross-validation (k = 5), randomly dividing the dataset into five groups to train both YOLOv7-tiny and Mcan-YOLO. The results, shown in Table 6, indicate that Mcan-YOLO consistently outperformed YOLOv7-tiny across all folds, achieving an average mAP50 of 91.36%, compared to 87.07% for YOLOv7-tiny. Additionally, the standard deviations for both models (0.36% for Mcan-YOLO and 0.34% for YOLOv7-tiny) demonstrated good robustness with minimal performance fluctuation. These results highlight the superior stability and accuracy of Mcan-YOLO in k-fold cross-validation.

3.1.2. Effectiveness of NAM

Different attention mechanisms have varying impacts on the YOLO model, especially when processing complex scenes and multiscale objects. In this experiment, we compared four attention mechanisms: NAM, which is our introduced mechanism, along with three alternatives: CBAM, SE, and CA. As shown in Table 7, the NAM stood out by improving both precision and mAP50 while maintaining the same number of parameters as the baseline YOLOv7-tiny model. In contrast, CA and CBAM resulted in decreases in precision and mAP50, while SE provided only slight improvements. These results highlight the efficiency of the NAM in enhancing detection performance, particularly for small targets, without increasing model complexity.

The variation curve of mAP50 across epochs is presented in Figure 14, demonstrating that the NAM provides superior stability in terms of mAP50. Compared to other attention mechanisms, the NAM-enhanced model exhibits a faster growth trend and quicker convergence, indicating more efficient learning. This suggests that the NAM not only improves detection accuracy but also accelerates the training process, leading to more reliable and consistent performance over time.

The variation curve of mAP50 with epochs is shown in Figure 14, which indicates that the NAM achieves greater stability in terms of mAP50. The growth trend and convergence speed of its curve are faster than other attention mechanisms.

3.1.3. Effectiveness of Mish

Different activation functions generate different effects on the performance and training process of the model. To evaluate their applicability in the model, three different activation functions, namely, Mish, HardSwish, and FReLU [51], were compared with LeakyReLU. The comparison results are shown in Figure 15. The results indicated that the Mish activation function reached the convergent state faster than the other functions did.

The findings reveal that the Mish activation function achieved a convergent state more rapidly than the other functions. In the training loss curve, which illustrates the variation across epochs, the model using Mish not only converged faster but also demonstrated a similar trend on the validation set. This indicates that Mish enhances both the training efficiency and the generalization capability of the model, leading to more stable and reliable performance during validation.

3.2. Comparative Experiment

To evaluate the effectiveness of Mcan-YOLO, we compared it against several mainstream detection algorithms, including Faster R-CNN, SSD, Li et al. [29], RepVGG-YOLOv7 [37], and YOLOv8n [52]. The comparative results are shown in Table 8. The results demonstrate that Mcan-YOLO consistently outperforms these models in both accuracy and efficiency. Specifically, Mcan-YOLO achieved a precision of 90.9% and an mAP50 of 91.5%, while maintaining a relatively small parameter count of 5.5 M and computational cost of 14.0 GFLOPs.

While models like Faster R-CNN and SSD struggled with lower precision and larger model sizes, the YOLO-based models performed better overall. Faster R-CNN, despite being accurate in certain scenarios, suffered from a heavy computational load with 137.1 M parameters, making it unsuitable for real-time applications like forest fire detection. SSD, while faster, demonstrated lower accuracy with a precision of 86.3% and an mAP50 of 76.8%, as well as a relatively large parameter count of 26.2 M, which still hindered its efficiency in resource-constrained environments. In comparison, YOLO-based models achieved better performance. RepVGG-YOLOv7, for instance, delivered high accuracy with a precision of 90.8% and an mAP50 of 91.4%. However, this came with the cost of a significantly larger parameter count of 35.2 M and a computational load of 102.7 GFLOPs, limiting its practicality in real-time applications where both speed and model efficiency are crucial.

In contrast, Mcan-YOLO demonstrated a superior balance by incorporating additional modules that enhance multiscale information processing without significantly increasing the model size or computational cost. With a precision of 90.9%, mAP50 of 91.5%, and a parameter count of just 5.5 M, Mcan-YOLO achieved performance comparable to or better than RepVGG-YOLOv7 while being significantly more efficient. Additionally, its computational cost of 14.0 GFLOPs is far lower than RepVGG-YOLOv7, making it much more practical for real-time deployment. Notably, Mcan-YOLO also showed clear advantages in small target detection, which is critical for early fire and smoke recognition.

To visually compare the detection performance of each model, we selected representative images from the dataset for experimentation, as shown in Figure 16. Faster R-CNN and SSD struggled with false positives, often mislabeling firefighters with similar colors as fires, and both models exhibited low confidence levels. The method by Li et al. [29] missed thin smoke, while YOLOv8n had false positives, affecting detection clarity. RepVGG-YOLOv7 tended to miss small fire targets during detection. In contrast, Mcan-YOLO achieved higher detection confidence, particularly for small fires and smoke of varying shapes. Moreover, Mcan-YOLO effectively reduced the false detection rate for interfering objects in complex backgrounds, showcasing its robustness in challenging environments.

3.3. Visualization Analysis

This section evaluates the detection performance of the models in various scenarios. Using the same detection samples, the Mcan-YOLO model consistently demonstrated higher confidence in recognizing both fire and smoke. Typical images are presented in Figure 17. In the case of YOLOv7-tiny (Figure 17a,c), the confidence values for fire and smoke were 0.66 and 0.83, respectively. In contrast, Mcan-YOLO (Figure 17b,d) significantly improved these values to 0.88 and 0.94, highlighting its enhanced detection accuracy and reliability.

Additionally, the detection of small objects was analyzed. YOLOv7-tiny showed a tendency to miss small flames, whereas Mcan-YOLO demonstrated a very low miss rate, as illustrated in Figure 18. In the YOLOv7-tiny detection sample (Figure 18a), several small fire targets were not detected. In contrast, Mcan-YOLO (Figure 18b) successfully detected these small targets, such as the fire in the bottom right corner. Figure 18c,d provide a comparison for detecting sparse smoke, where Mcan-YOLO effectively recognized thin smoke that YOLOv7-tiny failed to detect.

To provide a more intuitive demonstration of the model’s effectiveness, we employed Grad-CAM for feature map visualization experiments. Two images from the test set were selected to generate heatmaps, where darker areas indicate a higher probability of being identified as the target, and lighter areas represent the background. As illustrated in Figure 19a, Mcan-YOLO accurately captured the true shape of the target flames, highlighting the areas most likely to contain fire with precision. In contrast, in Figure 19b, where the colors of the smoke and background are similar, YOLOv7-tiny mistakenly labeled parts of the background as the target area. Mcan-YOLO, however, successfully distinguished between the smoke and the background, correctly identifying the target area and avoiding mislabeling. These results demonstrate that Mcan-YOLO excels in identifying key targets, even in challenging environments with complex backgrounds, confirming its robustness and accuracy in practical applications.

In summary, through the visualization analysis of feature maps, it can be seen that Mcan-YOLO exhibits higher precision and robustness in object recognition, especially when dealing with complex backgrounds, where its advantages are more pronounced.

3.4. Generalization Experiment

To evaluate the generalization capability of Mcan-YOLO, we conducted experiments using the M⁴SFWD [53] dataset, which includes a variety of fire and smoke scenarios. The dataset consists of 3985 images with a resolution of 640 × 640. This dataset was specifically designed to test model performance in diverse and challenging real-world conditions, making it ideal for assessing the robustness and adaptability of detection algorithms.

Table 9 presents the detection results of various models in the generalization experiment. Mcan-YOLO performed exceptionally well across all key metrics, including precision, recall, F1 score, and mAP50. With a mAP50 of 87.9%, Mcan-YOLO ranks just behind the RepVGG-YOLOv7 model, which achieved the highest mAP50 of 89.1%. However, Mcan-YOLO still outperforms other models in terms of a balanced performance, particularly in recall and F1 score, making it a strong contender in terms of both accuracy and generalization.

4. Discussion

Climate change has exacerbated the frequency and duration of forest fires, and there is an urgent need for efficient and accurate detection methods, whereas traditional fire information collection methods are limited by geographical and weather conditions. Deep learning, especially convolutional neural networks (CNNs) and YOLO series of object detection techniques, have improved the efficiency of fire smoke classification but still face the challenges of missing small target detection and increasing parameter requirements, and there is the need for further research and improvement in the field of forest fire image target detection.

Traditional sensor-based methods for fire point and smoke detection in forest fire images are often used, but this method is expensive and only effective in a limited geographical area. With the rise of convolutional neural networks (CNNS), the use of CNNS for fire point and smoke detection in forest fire images has attracted wide attention, although the CNN model requires high computational power and has limitations in forest fire image detection alone. The introduction of YOLO technology, especially the high response speed and accuracy of YOLO series models, has promoted its application in fire point and smoke detection in forest fire images. The improvement of the YOLO model has become one of the main methods for forest fire image target detection. For example, the SIMCB-YOLO model has been improved on the basis of YOLOv5. Although such methods achieve a high level of accuracy, they inevitably require higher computational power due to the complex structure of the model. The application of lightweight technology provides a breakthrough to this problem. The use of lightweight models for target detection in forest fire images has become a current research hotspot, such as the RepVGG-YOLOv7 model and the method proposed by Zuo et al. [39]. For a lightweight object detection model, finding the balance between model accuracy and model parameter number is a difficult problem.

In this study, an improved forest fire and smoke detection model called Mcan-YOLO is proposed based on the YOLOv7 architecture. The model takes into account the difficulty of fire and smoke detection for small targets in complex environments and introduces the AFPN module to enhance the model’s ability to detect objects at different scales while using the CARAFE up-sampling and NAM attention modules to further refine the feature extraction process so as to obtain higher detection accuracy with fewer parameters. The experimental results show that the Macan-YOLO model outperformed the YOLOv7-tiny baseline in several key metrics. Specifically, the accuracy increased by 4.6%, the recall increased by 6.5%, and the mean average precision (mAP50) increased by 4.7%. The ablation experiments in Table 4 verified that each of the added modules substantially improved fire point and smoke detection in forest fire images. The generalization experiments in Table 8 and Figure 16 verify that the model not only had excellent detection capability on the experimental dataset but also achieved results on the M⁴SFWD dataset that cannot be achieved by other state-of-the-art models.

The Macan-YOLO model proposed in this study outperforms the existing comparison models with its lightweight and high accuracy. However, for the target detection task, the training process of the model requires a large amount of data with labelled information, and the data labelling process is a time-consuming and labor-intensive process. Especially for the labelling of fire points and smoke in forest fire images, the boundaries of fire points and smoke are difficult to accurately define, which will limit its application to some extent. In our subsequent research, we consider integrating semi-supervised and weakly supervised techniques into the Macan-YOLO model to achieve higher accuracy while reducing the need for labelled data.

5. Conclusions

In this study, we proposed Mcan-YOLO, an improved version of YOLOv7-tiny, specifically designed for enhanced forest fire detection. Several key enhancements were introduced. First, the AFPN was integrated into the neck network for multi-scale detection, effectively balancing feature information across different scales while optimizing model parameters. Second, CARAFE was employed for upsampling, improving precision with fewer parameters. Third, the NAM was added to address the complex backgrounds of forest fire smoke, enhancing the model’s adaptability. Finally, the Mish activation function was introduced to improve the model’s convergence. Compared to YOLOv7-tiny, Mcan-YOLO achieved a 4.6% increase in precision for forest fire and smoke detection while also reducing the model’s parameters by 5%. A dedicated dataset was constructed for the experiments, and the MSSIM algorithm was used to eliminate highly repetitive images, ensuring a more diverse training set. The experimental results demonstrated that Mcan-YOLO outperformed other object detection methods, offering improved accuracy with fewer parameters. This makes Mcan-YOLO a more practical and efficient solution for real-time forest fire detection.

Future work can explore several avenues to further enhance detection reliability. First, improving the dataset remains a challenge, as real forest fire images are difficult to collect. Generative adversarial networks (GANs) could be used to generate synthetic yet realistic fire images to expand the dataset. Additionally, we will explore innovative lightweight techniques to seamlessly integrate Mcan-YOLO into emergency response systems, incorporating a human verification process to improve accuracy and minimize false alarms.

Author Contributions

Conceptualization, Y.X. and L.X.; software, H.L.; investigation, H.L.; experiment, H.L. and J.Z.; formal analysis, Y.X.; writing—original draft preparation, H.L. and J.Z.; writing—review and editing, Y.X. and L.X.; funding acquisition, L.X. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by The Natural Science Foundation of Jiangsu Province (No. BK20241070), Start-up Fund for New Talented Researchers of Nanjing Vocational University of Industry Technology (No. YK22-05-01 and No. YK23-05-01), Scientific Research Project of Nanjing University of Science and Technology ZiJin College (No. 2022ZXKX0401007), and Ministry of Education Industry Cooperation Collaborative Education Project (No. 220906093231129).

Data Availability Statement

The data presented in this study are available on request.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Bowman, D.M.; Williamson, G.J.; Abatzoglou, J.T.; Kolden, C.A.; Cochrane, M.A.; Smith, A.M. Human Exposure and Sensitivity to Globally Extreme Wildfire Events. Nat. Ecol. Evol. 2017, 1, 0058. [Google Scholar] [CrossRef]
Baijnath-Rodino, J.A.; Kumar, M.; Rivera, M.; Tran, K.D.; Banerjee, T. How Vulnerable Are American States to Wildfires? A Livelihood Vulnerability Assessment. Fire 2021, 4, 54. [Google Scholar] [CrossRef]
Pradhan, P.K.; Das, A.; Kumar, A.; Baruah, U.; Sen, B.; Ghosal, P. SwinSight: A Hierarchical Vision Transformer Using Shifted Windows to Leverage Aerial Image Classification. Multimed. Tools Appl. 2024, 1–22. [Google Scholar] [CrossRef]
Lv, L.-Y.; Cao, C.-F.; Qu, Y.-X.; Zhang, G.-D.; Zhao, L.; Cao, K.; Song, P.; Tang, L.-C. Smart Fire-Warning Materials and Sensors: Design Principle, Performances, and Applications. Mater. Sci. Eng. R Rep. 2022, 150, 100690. [Google Scholar] [CrossRef]
Benzekri, W.; El Moussati, A.; Moussaoui, O.; Berrajaa, M. Early Forest Fire Detection System Using Wireless Sensor Network and Deep Learning. Int. J. Adv. Comput. Sci. Appl. 2020, 11, 496–503. [Google Scholar] [CrossRef]
Ting, Y.-Y.; Hsiao, C.-W.; Wang, H.-S. A Data Fusion-Based Fire Detection System. IEICE Trans. Inf. Syst. 2018, 101, 977–984. [Google Scholar] [CrossRef]
Anđelić, N.; Baressi Šegota, S.; Lorencin, I.; Car, Z. The Development of Symbolic Expressions for Fire Detection with Symbolic Classifier Using Sensor Fusion Data. Sensors 2022, 23, 169. [Google Scholar] [CrossRef]
Guede-Fernández, F.; Martins, L.; de Almeida, R.V.; Gamboa, H.; Vieira, P. A Deep Learning Based Object Identification System for Forest Fire Detection. Fire 2021, 4, 75. [Google Scholar] [CrossRef]
Lin, J.; Lin, H.; Wang, F. A Semi-Supervised Method for Real-Time Forest Fire Detection Algorithm Based on Adaptively Spatial Feature Fusion. Forests 2023, 14, 361. [Google Scholar] [CrossRef]
Muhammad, K.; Ahmad, J.; Lv, Z.; Bellavista, P.; Yang, P.; Baik, S.W. Efficient Deep CNN-Based Fire Detection and Localization in Video Surveillance Applications. IEEE Trans. Syst. Man Cybern. Syst. 2018, 49, 1419–1434. [Google Scholar] [CrossRef]
Saponara, S.; Elhanashi, A.; Gagliardi, A. Real-Time Video Fire/Smoke Detection Based on CNN in Antifire Surveillance Systems. J. Real-Time Image Process. 2021, 18, 889–900. [Google Scholar] [CrossRef]
Lee, W.; Kim, S.; Lee, Y.-T.; Lee, H.-W.; Choi, M. Deep Neural Networks for Wild Fire Detection with Unmanned Aerial Vehicle. In Proceedings of the 2017 IEEE International Conference on Consumer Electronics (ICCE), Las Vegas, NV, USA, 8–10 January 2017; IEEE: New York, NY, USA, 2017; pp. 252–253. [Google Scholar]
Zhang, Q.; Xu, J.; Xu, L.; Guo, H. Deep Convolutional Neural Networks for Forest Fire Detection. In Proceedings of the 2016 International Forum on Management, Education and Information Technology Application, Guangzhou, China, 30–31 January 2016; Atlantis Press: Dordrecht, The Netherlands, 2016; pp. 568–575. [Google Scholar]
Majid, S.; Alenezi, F.; Masood, S.; Ahmad, M.; Gündüz, E.S.; Polat, K. Attention Based CNN Model for Fire Detection and Localization in Real-World Images. Expert Syst. Appl. 2022, 189, 116114. [Google Scholar] [CrossRef]
Zhang, Z.; Guo, Y.; Chen, G.; Xu, Z. Wildfire Detection via a Dual-Channel CNN with Multi-Level Feature Fusion. Forests 2023, 14, 1499. [Google Scholar] [CrossRef]
Girshick, R.; Donahue, J.; Darrell, T.; Malik, J. Rich Feature Hierarchies for Accurate Object Detection and Semantic Segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Columbus, OH, USA, 23–28 June 2014; pp. 580–587. [Google Scholar]
Ren, S.; He, K.; Girshick, R.; Sun, J. Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks. IEEE Trans. Pattern Anal. Mach. Intell. 2016, 39, 1137–1149. [Google Scholar] [CrossRef]
Lin, T.-Y.; Dollár, P.; Girshick, R.; He, K.; Hariharan, B.; Belongie, S. Feature Pyramid Networks for Object Detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 2117–2125. [Google Scholar]
Liu, W.; Anguelov, D.; Erhan, D.; Szegedy, C.; Reed, S.; Fu, C.-Y.; Berg, A.C. Ssd: Single Shot Multibox Detector. In Proceedings of the Computer Vision—ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, 11–14 October 2016; Proceedings, Part I 14. Springer: Berlin/Heidelberg, Germany, 2016; pp. 21–37. [Google Scholar]
Han, X.; Chang, J.; Wang, K. You Only Look Once: Unified, Real-Time Object Detection. Procedia Comput. Sci. 2021, 183, 61–72. [Google Scholar] [CrossRef]
Redmon, J.; Farhadi, A. YOLO9000: Better, Faster, Stronger. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 7263–7271. [Google Scholar]
Alexandrov, D.; Pertseva, E.; Berman, I.; Pantiukhin, I.; Kapitonov, A. Analysis of Machine Learning Methods for Wildfire Security Monitoring with an Unmanned Aerial Vehicles. In Proceedings of the 2019 24th Conference of Open Innovations Association (FRUCT), Moscow, Russia, 8–12 April 2019; IEEE: New York, NY, USA, 2019; pp. 3–9. [Google Scholar]
Zheng, X.; Chen, F.; Lou, L.; Cheng, P.; Huang, Y. Real-Time Detection of Full-Scale Forest Fire Smoke Based on Deep Convolution Neural Network. Remote Sens. 2022, 14, 536. [Google Scholar] [CrossRef]
Xu, R.; Lin, H.; Lu, K.; Cao, L.; Liu, Y. A Forest Fire Detection System Based on Ensemble Learning. Forests 2021, 12, 217. [Google Scholar] [CrossRef]
Chen, G.; Zhou, H.; Li, Z.; Gao, Y.; Bai, D.; Xu, R.; Lin, H. Multi-Scale Forest Fire Recognition Model Based on Improved YOLOv5s. Forests 2023, 14, 315. [Google Scholar] [CrossRef]
Wang, A.; Liang, G.; Wang, X.; Song, Y. Application of the YOLOv6 Combining CBAM and CIoU in Forest Fire and Smoke Detection. Forests 2023, 14, 2261. [Google Scholar] [CrossRef]
Liu, C.; Meng, Z. TBFF-DAC: Two-Branch Feature Fusion Based on Deformable Attention and Convolution for Object Detection. Comput. Electr. Eng. 2024, 116, 109132. [Google Scholar] [CrossRef]
Liu, H.; Hu, H.; Zhou, F.; Yuan, H. Forest Flame Detection in Unmanned Aerial Vehicle Imagery Based on YOLOv5. Fire 2023, 6, 279. [Google Scholar] [CrossRef]
Li, J.; Xu, R.; Liu, Y. An Improved Forest Fire and Smoke Detection Model Based on Yolov5. Forests 2023, 14, 833. [Google Scholar] [CrossRef]
Huang, J.; Yang, H.; Liu, Y.; Liu, H. A Forest Fire Smoke Monitoring System Based on a Lightweight Neural Network for Edge Devices. Forests 2024, 15, 1092. [Google Scholar] [CrossRef]
Sudhakar, S.; Vijayakumar, V.; Kumar, C.S.; Priya, V.; Ravi, L.; Subramaniyaswamy, V. Unmanned Aerial Vehicle (UAV) Based Forest Fire Detection and Monitoring for Reducing False Alarms in Forest-Fires. Comput. Commun. 2020, 149, 1–16. [Google Scholar] [CrossRef]
Bouguettaya, A.; Zarzour, H.; Taberkit, A.M.; Kechida, A. A Review on Early Wildfire Detection from Unmanned Aerial Vehicles Using Deep Learning-Based Computer Vision Algorithms. Signal Process. 2022, 190, 108309. [Google Scholar] [CrossRef]
Duangsuwan, S.; Klubsuwan, K. Accuracy Assessment of Drone Real-Time Open Burning Imagery Detection for Early Wildfire Surveillance. Forests 2023, 14, 1852. [Google Scholar] [CrossRef]
Halder, A.; Shivakumara, P.; Pal, U.; Blumenstein, M.; Ghosal, P. A Locally Weighted Linear Regression-Based Approach for Arbitrary Moving Shaky and Nonshaky Video Classification. Int. J. Pattern Recognit. Artif. Intell. 2024, 38, 2351019. [Google Scholar] [CrossRef]
Yang, W.; Yang, Z.; Wu, M.; Zhang, G.; Zhu, Y.; Sun, Y. SIMCB-Yolo: An Efficient Multi-Scale Network for Detecting Forest Fire Smoke. Forests 2024, 15, 1137. [Google Scholar] [CrossRef]
Alam, S.; Yakopcic, C.; Wu, Q.; Barnell, M.; Khan, S.; Taha, T.M. Survey of Deep Learning Accelerators for Edge and Emerging Computing. Electronics 2024, 13, 2988. [Google Scholar] [CrossRef]
Chen, X.; Xue, Y.; Hou, Q.; Fu, Y.; Zhu, Y. RepVGG-YOLOv7: A Modified YOLOv7 for Fire Smoke Detection. Fire 2023, 6, 383. [Google Scholar] [CrossRef]
Wu, Z.; Yang, Z.; Yang, C.; Lin, J.; Liu, Y.; Chen, X. Joint Deployment and Trajectory Optimization in UAV-Assisted Vehicular Edge Computing Networks. J. Commun. Netw. 2021, 24, 47–58. [Google Scholar] [CrossRef]
Zuo, W.; Xian, Y. Dynamic UAV Deployment Scheme Based on Edge Computing for Forest Fire Scenarios. Sensors 2024, 24, 4337. [Google Scholar] [CrossRef] [PubMed]
Xu, L.; Xiao, Q.; He, L. Fused Image Quality Assessment Based on Human Visual Characteristics. Geomat. Inf. Sci. Wuhan Univ. 2019, 44, 546–554. [Google Scholar]
Shamsoshoara, A.; Afghah, F.; Razi, A.; Zheng, L.; Fulé, P.Z.; Blasch, E. Aerial imagery pile burn detection using deep learning: The FLAME dataset. Comput. Netw. 2021, 193, 108001. [Google Scholar] [CrossRef]
Zhang, X.; Liu, C.-A. Model Averaging Prediction by K-Fold Cross-Validation. J. Econom. 2023, 235, 280–301. [Google Scholar] [CrossRef]
Yang, G.; Lei, J.; Zhu, Z.; Cheng, S.; Feng, Z.; Liang, R. AFPN: Asymptotic Feature Pyramid Network for Object Detection. In Proceedings of the 2023 IEEE International Conference on Systems, Man, and Cybernetics (SMC), Man, Honolulu, HI, USA, 1–4 October 2023; IEEE: New York, NY, USA, 2023; pp. 2184–2189. [Google Scholar]
Wang, J.; Chen, K.; Xu, R.; Liu, Z.; Loy, C.C.; Lin, D. Carafe: Content-Aware Reassembly of Features. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Korea, 27 October–2 November 2019; pp. 3007–3016. [Google Scholar]
Liu, Y.; Liu, X.; Zhang, B. Retinanet-Vline: A Flexible Small Target Detection Algorithm for Efficient Aggregation of Information. Clust. Comput. 2024, 27, 2761–2773. [Google Scholar] [CrossRef]
Liu, Y.; Shao, Z.; Teng, Y.; Hoffmann, N. NAM: Normalization-Based Attention Module. arXiv 2021, arXiv:2111.12419. [Google Scholar]
Wang, K.; Zhou, H.; Wu, H.; Yuan, G. RN-YOLO: A Small Target Detection Model for Aerial Remote-Sensing Images. Electronics 2024, 13, 2383. [Google Scholar] [CrossRef]
Woo, S.; Park, J.; Lee, J.-Y.; Kweon, I.S. Cbam: Convolutional Block Attention Module. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 3–19. [Google Scholar]
Hu, J.; Shen, L.; Sun, G. Squeeze-and-Excitation Networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 7132–7141. [Google Scholar]
Hou, Q.; Zhou, D.; Feng, J. Coordinate Attention for Efficient Mobile Network Design. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 13713–13722. [Google Scholar]
Doherty, J.; Gardiner, B.; Kerr, E.; Siddique, N.; Manvi, S.S. Comparative Study of Activation Functions and Their Impact on the YOLOv5 Object Detection Model. In Proceedings of the International Conference on Pattern Recognition and Artificial Intelligence, Paris, France, 1–3 June 2022; Springer: Cham, Switzerland, 2022; pp. 40–52. [Google Scholar]
Mamadaliev, D.; Touko, P.L.M.; Kim, J.-H.; Kim, S.-C. ESFD-YOLOv8n: Early Smoke and Fire Detection Method Based on an Improved YOLOv8n Model. Fire 2024, 7, 303. [Google Scholar] [CrossRef]
Wang, G.; Li, H.; Li, P.; Lang, X.; Feng, Y.; Ding, Z.; Xie, S. M4SFWD: A Multi-Faceted Synthetic Dataset for Remote Sensing Forest Wildfires Detection. Expert Syst. Appl. 2024, 248, 123489. [Google Scholar] [CrossRef]

Figure 1. Forest fires: (a) small targets; (b) fires at night; (c) wildfires in winter; (d) autumn fires and smoke.

Figure 2. Fires on trees: (a) ground fire; (b) tree trunk fire; (c) tree crown fire.

Figure 3. The distribution of length and width of the targets.

Figure 4. Design architecture of the YOLOv7-tiny.

Figure 5. The PANet module.

Figure 6. The AFPN module.

Figure 7. AFPN in the neck network.

Figure 8. Structure of the CARAFE module.

Figure 9. Structure of CAM.

Figure 10. Structure of SAM.

Figure 11. ELAN-T model with an attention mechanism.

Figure 12. Curves of LeakyReLU and Mish.

Figure 13. Structure of Mcan-YOLO.

Figure 14. Comparison of performance between attention blocks.

Figure 15. Comparison of performance among various activation functions.

Figure 16. Comparison of visualization results of different models.

Figure 17. Detection effect between YOLOv7-tiny and Mcan-YOLO. (a,c): detection samples of YOLOv7-tiny; (b,d): detection samples of Mcan-YOLO.

Figure 18. Example image of small target detection. (a,c): YOLOv7-tiny detection samples; (b,d): Mcan-YOLO detection samples.

Figure 19. Visualization of feature maps. (a) Heatmap of fire in YOLOv7-tiny and Mcan-YOLO; (b) Heatmap of smoke in YOLOv7-tiny and Mcan-YOLO.

Table 1. Summary of object detection techniques.

Method	Merits	Limitations
RCNN [16]	High accuracy in object detection; effective on large datasets.	Slow detection speed; high computational cost.
Faster R-CNN [17]	Improved speed compared to RCNN; better region proposal network.	Still relatively slow for real-time applications.
FPN [18]	Good at detecting small objects; effective multi-scale feature fusion.	Increased model complexity; higher computational demand.
SSD [19]	Faster detection speed than two-level detectors; good for real-time applications.	Lower accuracy, especially on small objects.
YOLOv5 [24]	Enhanced detection precision; good deployment efficiency.	Requires tuning for specific datasets; may miss smaller objects.
YOLOv6 [26]	Faster and more accurate than earlier YOLO versions; optimized feature extraction with CBAM and CIoU.	Limited ability to detect very small or complex targets.
EfficientDet [23]	High detection accuracy with fewer parameters; good for resource-limited environments.	Slower than YOLO in real-time detection.
Lightweight CNN [30]	Efficient for deployment on edge devices; fast real-time monitoring.	Lower accuracy compared to larger models; limited to specific use cases.

Table 2. Experimental environment.

Experimental Environment	Details
Operating system	Ubuntu 18.04
CPU	E5-2686 v4 @ 2.30 GHz
Storage	SSD: 512 GB/HDD: 1 TB
GPU	NVIDIA RTX A4000
Framework	Pytorch 1.11
Python	Python 3.8
Cuda version	CUDA 11.3

Table 3. Experimental parameter configuration.

Training Parameters	Details
Learning rate	0.001
Epochs	200
Batch size	16
Image size	640 × 640

Table 4. Ablation experiment results for the proposed model (✓ Represents the addition of this module).

YOLOv7-Tiny	AFPN	CARAFE	Mish	NAM	Precision (%)	Recall (%)	F1	mAP50 (%)	mAP50s (%)
✓					86.3	80.3	83.2	86.8	53.9
✓	✓				86.6	82.2	84.3	88.3	57.0
✓		✓			86.9	82.2	84.5	87.3	54.0
✓			✓		86.5	80.5	83.4	87.1	54.7
✓				✓	87.6	81.1	84.2	87.8	54.1
✓	✓	✓			89.9	86.0	87.9	90.8	62.0
✓	✓		✓		89.3	85.0	87.1	90.6	62.3
✓	✓			✓	89.8	86.4	88.1	90.5	62.6
✓		✓	✓		89.7	86.8	88.2	89.9	60.7
✓		✓		✓	89.1	85.6	87.3	90.1	61.6
✓			✓	✓	90.8	85.2	87.9	90.3	60.7
✓	✓	✓	✓		90.9	86.3	88.5	91.1	63.0
✓		✓	✓	✓	90.5	86.4	88.4	90.3	61.2
✓	✓	✓		✓	89.9	86.5	88.2	91.2	62.9
✓	✓		✓	✓	90.9	86.7	88.8	90.9	62.6
✓	✓	✓	✓	✓	90.9	86.8	88.8	91.5	63.2

Table 5. Detection performance comparison between Mcan-YOLO and YOLOv7-tiny.

Model	Classes	Precision (%)	Recall (%)	F1 (%)	mAP50 (%)	mAP50s (%)	Parameters (M)	FLOPs (G)
YOLOv7-tiny	ALL	86.3	80.3	83.2	86.8	53.9	5.7	13.2
	Fire	86.4	83.6	85.0	89.0	56.6
	Smoke	86.2	76.9	81.3	84.6	51.2
Mcan-YOLO	ALL	90.9	86.8	88.8	91.5	63.2	6.0	14.0
	Fire	92.9	88.7	90.8	92.8	67.9
	Smoke	88.9	84.9	86.9	90.2	58.5

Table 6. Results of k-fold cross-validation.

The Times of Fold	mAP50 of YOLOv7-Tiny (%)	mAP50 of Mcan-YOLO (%)
1	86.81	91.52
2	87.53	91.14
3	87.32	90.96
4	86.90	91.90
5	86.79	91.29
Average	87.07	91.36
Standard deviation	0.34	0.36

Table 7. Performance outcomes with different attention mechanisms.

Model	Precision (%)	F1 (%)	mAP50 (%)	mAP50s (%)	Parameters (M)
YOLOv7-tiny	86.3	83.2	86.8	53.9	6.0
+NAM	87.6	84.2	87.8	54.1	6.0
+CA	85.0	78.8	82.2	46.2	7.2
+SE	86.9	83.5	87.1	54.1	6.3
+CBAM	87.2	84.1	87.7	52.4	8.4

Table 8. Performance comparison among different target detection networks.

Model	Precision (%)	F1 (%)	mAP50 (%)	Parameters (M)	FLOPs(G)
Faster R-CNN	56.4	67.0	78.9	137.1	370.2
SSD	86.3	72.5	76.8	26.2	62.7
Li et al. [29]	89.2	89.9	89.3	7.1	16.2
RepVGG-YOLOv7 [37]	90.8	91.8	91.4	35.2	102.7
YOLOv8n	87.1	84.1	87.9	3.2	8.1
Mcan-YOLO (Ours)	90.9	88.8	91.5	5.5	14.0

Table 9. Generalization experiment results.

Model	Precision (%)	Recall (%)	F1 (%)	mAP50 (%)
Faster R-CNN	49.1	77.5	60.1	66.2
SSD	85.1	62.2	71.9	64.8
Li et al. [29]	85.6	81.9	83.7	86.4
RepVGG-YOLOv7 [37]	87.8	84.4	86.1	89.1
YOLOv8n	83.6	81.0	82.3	86.9
Mcan-YOLO (Ours)	85.9	82.5	84.2	87.9

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Liu, H.; Zhu, J.; Xu, Y.; Xie, L. Mcan-YOLO: An Improved Forest Fire and Smoke Detection Model Based on YOLOv7. Forests 2024, 15, 1781. https://doi.org/10.3390/f15101781

AMA Style

Liu H, Zhu J, Xu Y, Xie L. Mcan-YOLO: An Improved Forest Fire and Smoke Detection Model Based on YOLOv7. Forests. 2024; 15(10):1781. https://doi.org/10.3390/f15101781

Chicago/Turabian Style

Liu, Hongying, Jun Zhu, Yiqing Xu, and Ling Xie. 2024. "Mcan-YOLO: An Improved Forest Fire and Smoke Detection Model Based on YOLOv7" Forests 15, no. 10: 1781. https://doi.org/10.3390/f15101781

APA Style

Liu, H., Zhu, J., Xu, Y., & Xie, L. (2024). Mcan-YOLO: An Improved Forest Fire and Smoke Detection Model Based on YOLOv7. Forests, 15(10), 1781. https://doi.org/10.3390/f15101781

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Mcan-YOLO: An Improved Forest Fire and Smoke Detection Model Based on YOLOv7

Abstract

1. Introduction

2. Materials and Methods

2.1. Data Collection

2.2. Methods

2.2.1. YOLOv7-Tiny

2.2.2. AFPN

2.2.3. CARAFE

2.2.4. NAM

2.2.5. Mish

2.2.6. Mcan-YOLO

2.3. Experimental Conditions

2.4. Indicator Evaluation

3. Results

3.1. Ablation Experiment

3.1.1. Evaluations of Different Components

3.1.2. Effectiveness of NAM

3.1.3. Effectiveness of Mish

3.2. Comparative Experiment

3.3. Visualization Analysis

3.4. Generalization Experiment

4. Discussion

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI