1. Introduction
Forests are the cornerstone of national economic development and are crucial for national economic construction and sustainable development of agroforestry. Nevertheless, the occurrence of forest fires globally has been on the rise due to extreme weather and dry climate in recent years. This has also had severe consequences in China [
1,
2]. According to reports, a forest fire broke out on 30 March 2020 in Xichang, Liangshan Prefecture, resulting in the loss of 792 hectares of forest and a direct economic loss of CNY 97.31 million, with 19 lives lost and three injuries. Forest fires produce significant amounts of smoke and dust, which severely pollutes the environment. Hence, establishing an efficient forest fire detection system is of utmost importance. This system can detect fires beforehand, allowing for a quick response and suppression, which reduces the harm caused to living beings and property. At the same time, timely detection and control of fires can also protect forest resources, maintain the sustainable development of agroforestry, and promote the healthy and sustainable development of the ecological environment [
3].
Amidst the progression in computer vision technology, computer vision-based image processing and pattern recognition have important applications in forest fire smoke detection. Avudaiammal et al. [
4] proposed extracting perceptual dynamic features using a color model of forest fires and features such as wavelet energy and grayscale covariance matrices, which were then trained using machine learning classifiers. Sheng et al. [
5] proposed a deep belief network (DBN) fire detection method based on statistical image features to address the challenges of different fire stages in complex environments, which can effectively extract the features of fire in the time, frequency, and time–frequency domains and classify them using DBN. Bakri et al. [
6] proposed a color pixel classification algorithm to separate fire pixels from the background and detect fire using image enhancement techniques and color models. To improve the early detection of forest fire smoke, Han et al. [
7] proposed a semantic segmentation method with multi-color spatial feature fusion to extract complementary smoke features by combining multi-scale features and attention mechanisms. However, despite the progress made by these feature extraction and classification-based techniques in forest fire smoke recognition, there are still some challenges. They pertain to the color, contour, and dynamic texture of the detected objects, which leads to difficulties in accurately extracting the fire and smoke features, thus reducing the accuracy of fire-smoke recognition.
With the swift progress in deep learning within the realm of forest fire and smoke detection, researchers have begun to actively explore the application of various deep learning models in this scenario [
8,
9]. A convolutional neural network (CNN)-based forest fire smoke detection method has a significant advantage in that it uses a more powerful feature extraction network, which can extract richer, more advanced, and more abstract semantic features, thus effectively improving the detection effect [
10,
11]. Zhang et al. [
12] developed a multi-scale feature extraction model (MS-FRCNN) suitable for detecting small targets in forest fires. Huang et al. [
13] proposed a GXLD forest fire detection technique that is lightweight and relies on YOLOX-L and defogging algorithms to address challenges posed by fog-related disruptions. Avula et al. [
14] improved fire detection efficacy and reduced false detections by introducing fuzzy entropy optimization thresholds and spatial transformations to CNNs. Xue et al. [
15] proposed the small-target forest fire detection model, which solved the problem where the forest fire image model captured at a long distance cannot learn effective information. Recognition is difficult due to the complex texture of smoke and interference factors in the forest environment. Li et al. [
16] proposed a high-precision edge-focused detection network. The network enhances the extraction of global texture features by introducing Swin Multidimensional window extractor (SMWE), reduces redundant information using a guillotine feature pyramid network (GFPN), and reduces boundary blurring using a contour adaptive loss function. Chen et al. [
17] proposed a lightweight forest fire and smoke early detection method based on GS-YOLOv5, which effectively reduces the model parameters and false-alarm rate and improves the detection accuracy by introducing the Super-SPPF structure, the C3Ghost module and the coordinate attention module. The deep learning-based algorithm has a superior feature extraction network compared to the traditional fire smoke detection algorithm. As a result, this leads to the extraction of more comprehensive semantic feature details and a significant enhancement in the model’s detection performance.
Previous research has tended to focus on a single fire or smoke as the basis for detection; however, there are several limitations to this approach. Given the intricate and diverse nature of forest surroundings, smoke can affect the visibility of fire, leading to false or missed alarms. In large wildfires, smoke can act as a visual barrier to fire, making it difficult to accurately detect fire. In the literature [
18,
19], authors have explored the limitations of single-fire or smoke detection, emphasizing that environmental disturbances such as light variations and meteorological conditions may affect the visibility of fire and smoke, which in turn reduces the accuracy of detection. To tackle these challenges, this paper adopts a joint fire and smoke detection strategy, aiming to detect both fire and smoke simultaneously to enhance the precision and dependability of detecting forest fire and smoke.
The above studies have some limitations, such as low model detection rate, poor real-time performance, high computational complexity, as well as the ability to detect only fire or smoke and not joint detection. In order to solve these problems, this paper proposes an efficient, accurate and real-time method for joint detection of forest fire and smoke, and presents a lightweight detection network, SmokeFireNet, which adopts ShuffleNetV2 as the backbone and combines advanced techniques such as FPN, PAN, RFB, and ECA. The network uses ShuffleNetV2 as the backbone and combines advanced technologies such as FPN, PAN, RFB, ECA, and DySample to achieve efficient, accurate, and real-time forest fire and smoke detection. Compared with existing research, SmokeFireNet has the following advantages. (1) Joint detection: SmokeFireNet detects fire and smoke at the same time, fully considering the mutual influence between the two, and improving the accuracy and reliability of detection. (2) Lightweight design: SmokeFireNet adopts the lightweight ShuffleNetV2 backbone network and combines with DySample and other technologies to reduce the computational complexity of the model and the number of parameters, which meets the demand for real-time detection. (3) Multi-scale feature fusion: SmokeFireNet introduces the FPN and PAN structures to fuse features of different scales to better understand the contextual information of fire and smoke and improve the detection accuracy. (4) Attention mechanism: SmokeFireNet introduces RFB and ECA mechanisms to better capture the detailed features of fire and smoke and suppress irrelevant features to further improve the detection accuracy. The experimental results show that SmokeFireNet outperforms other mainstream target detection algorithms in terms of average accuracy, frame rate, and computational complexity, which provides effective technical support for forest fire prevention and new ideas and methods for future research.
The rest of the paper is organized as follows. In
Section 2, we describe in detail the process of building the forest fire and smoke datasets, the construction of the SmokeFireNet model, and the model performance evaluation metrics. In
Section 3, we compare the experimental results of different models, analyzing in detail the experimental comparison of models of the attention mechanism, the performance of the models at different resolutions, the results of the ablation experiments, the comparison of different data enhancement methods, and the detection effect of the models. In
Section 4, we discuss the limitations of the current model and outlines potential future research directions. In
Section 5, we conclude the paper and summarize the key contributions of SmokeFireNet to the field of forest fire and smoke detection.
2. Materials and Methods
2.1. Forest Fire Smoke Dataset
The datasets utilized in this paper were obtained primarily from institutions such as the Machine Intelligence Laboratory of the University of Salerno in Italy, the CV&PR Lab of Keimyung University in South Korea, and Yuan Fei Niu Team Lab. The collected video data were processed and filtered frame by frame, resulting in the acquisition of 1776 images of forest fire smoke. The dataset constructed in this study is mainly used to detect smoke and fire in the spreading stage of forest fire. A section of the sample dataset illustrated in
Figure 1 was selected from them.
2.2. Data Enhancement
The sample scale of the forest fire smoke dataset obtained above is small, and is not enough to support the effective training and evaluation of deep learning models. To solve the problem of the small sample, this paper adopts three image enhancement methods according to the characteristics of forest fire. These enhancement methods aim to generate more diverse image samples, thus improving the robustness and accuracy of the model in different scenarios [
20].
- (1)
Wind Dynamic Orientation Adjustment (WDOA)
During the construction of the aerial forest fire dataset, WDOA is used to simulate the changes in target objects in different orientations and angles in the images due to wind effects and different shooting angles. The irregular fire patterns and smoke distribution of forest fires due to wind effects lead to different visual characteristics of images taken at different angles and orientations. Therefore, by modeling the effect of wind on fire and smoke morphology in this way, the diversity of the dataset is increased to enable the model to better understand forest fires at different angles and orientations.
- (2)
Illumination Condition Modulation Simulation (ICMS)
During the construction of the aerial forest fire dataset, there may be changes in lighting conditions due to images taken at different times, resulting in different brightness and contrast of the images. The simulation of light condition modulation can make the images in the dataset more diverse and allow the model to learn the characteristics of the fire under different light conditions. As a result, the model trained in this way will be more robust and better able to adapt to different lighting conditions that may occur in practical applications.
- (3)
Photographic Equipment Vibration Simulation (PEVS)
During the construction of the aerial forest fire dataset, image blurring is caused by the UAV due to motion or instability during the shooting process. Simulating this blurring can help the model to better cope with the problems that may be encountered in the real world and improve its robustness and generalization ability in real scenarios.
The example images after data augmentation are shown in
Figure 2. These images cover a wide range of states and scenarios of forest fires and smoke, including fire and smoke situations with different lighting conditions and different angles. Such a dataset enriches the diversity of samples, which helps the training model to better understand and adapt to forest fire and smoke situations in different environments and improves the generalization ability and detection accuracy of the model.
After data augmentation, we obtained a dataset containing 4000 images and labeled the images, in which the number of labels for “fire” and “smoke” is 2946 and 2624, respectively. In the experiments, the dataset is divided into training and validation sets in the ratio of 8:2.
Table 1 shows the number of images and labels in the training and validation sets.
2.3. SmokeFireNet
In this paper, we present SmokeFireNet, a network designed to detect forest fire smoke, which is shown in
Figure 3. SmokeFireNet consists of input, backbone, neck, and head. To curtail the computational intricacy of the network and the number of model parameters, SmokeFireNet adopts a lightweight network, ShuffleNetV2, as the fundamental network for extracting features, which meets the requirements of devices with limited computational resources while maintaining efficient performance. The feature pyramid network (FPN) [
21] and path aggregation network (PAN) [
22] structures are introduced in the neck to fuse the feature of different layers, along with ECA, RFB and DySample. Finally, the head layer undertakes the dual tasks of classifying and regressing the multi-scale feature layer. This is achieved by adapting the channel count, which in turn enables precise detection and localization of fire and smoke targets.
2.3.1. Backbone
ShuffleNet [
23] is a lightweight convolutional neural network for computationally limited devices, which uses pointwise group convolution to reduce the amount of computation and the number of parameters, and at the same time uses channel shuffle operation to solve the problem of non-shared exchange of feature information between groups caused by group-by-group convolution, so as to realize the exchange of channel information. However, ShuffleNetV1 [
23] violates the four efficient network design principles proposed in the literature; therefore, Zhang et al. [
24] proposed a ShuffleNetV2 network using a channel splitting operation. In
Figure 4a, the base unit uses a configuration with a step size of 1. The feature channel is divided into two by a channel division technique. One branch retains the original features, and the other branch extracts the features through 1 × 1 and 3 × 3 convolution operations. By this channel division method, these two branches achieve the interaction of information. In
Figure 4b, the spatial downsampling unit with a step size of 2 does not use channel division, but downsampling is performed by deep convolution with a step size of 2. The two branches are then merged, resulting in a halving of the feature space size while doubling the number of channels.
The structure of the backbone network is shown in
Figure 5. It begins with an initial downsampling of the input image using standard 3 × 3 convolution and maximum pooling operations to reduce the image resolution and extract initial features. Subsequently, the network enters three stages, each of which contains a repeated stack of ShuffleNet’s basic and downsampling units. When the step size of the unit is set to 2, it means that the downsampling unit is used to further extract features by reducing the output size, and when the step size is 1, it means that the features are extracted using the base unit without downsampling. After three stages of feature extraction, the network channels the output features of stages 1 to 3 by convolution operation, and finally obtains three effective feature layers with sizes (80 × 80 × 64), (40 × 40 × 128), and (20 × 20 × 256), labeled C1, C2, and C3, respectively.
2.3.2. Neck
In the context of convolutional neural networks, the dimensions of the feature diminish progressively with the augmentation in the quantity of network layers, which can cause the position information of small target fires to become blurred, leading to missed detection. To tackle this challenge, this study adopts a multi-scale fusion strategy in the neck network and introduces a feature pyramid structure with neck layers as shown in
Figure 6. The backbone network generates multi-scale features (denoted C1, C2, C3) at different network layers. The feature pyramid FPN fuses the higher-level features with the lower-level features in a top-down manner, such as through upsampling or lateral connectivity, to counterbalance the deficiency in semantic information within the lower-level features. On this basis, the path aggregation network (PAN) further introduces bottom-up paths to pass shallow features to higher layers, which enhances the network’s ability to localize at different scales, enabling the network to better capture the location information of targets at different scales, especially small targets.
DySample [
11] is a novel dynamic upsampler that uses point sampling as its core idea and achieves an efficient and lightweight upsampling process through content-aware sample point generation and offset range control. The biggest difference between DySample and other upsampling operators is its design based on point sampling, which gives it a significant advantage in terms of light weight, efficiency, and ease of use. Compared to kernel-based dynamic up samplers, DySample eliminates the need for complex kernel generation and convolution operations, has fewer parameters, a lower computation and memory footprint, and an inference speed close to that of bilinear interpolation. At the same time, DySample only needs a low-resolution feature map as input, without the need for high-resolution bootstrap features, making it more flexible and easier to use.
In order to enhance the network’s ability in recognizing forest fires and smoke and to gain a deeper understanding of their intrinsic contextual properties and overall attributes, this study introduces receptive field block(RFB) [
25] after feature C3 to broaden the network’s receptive field. As shown in
Figure 7, the RFB module mainly consists of multi-branched small convolutional kernel layers and expanded convolutional layers. These small convolutional kernel layers include 3 × 3, 1 × 3, and 3 × 1 convolutions, which help to reduce the number of parameters and computational burden of the model. In addition, to improve the resolution of the features, each standard convolutional branch is combined with a dilated convolutional branch, and each dilated convolutional layer is set to a different dilation rate. The role of the dilated convolutional layer is to mimic the eccentricity effect of human vision so that the network can capture the complex details in the image more comprehensively. The RFB module ultimately splices the features from all the branches and integrates them into a convolutional feature set. Through this feature splicing and integration, the RFB module can extract image features more efficiently, providing a more accurate and efficient feature description for forest fire smoke detection.
Forest fire and smoke images have few pixels and information is easily lost, in which case the network can easily ignore these small targets, resulting in the phenomenon of missed detection and misdetection. To solve this problem, the network’s ability to perceive fire and smoke targets is enhanced, focusing especially on small targets. As shown in
Figure 6, this paper adds the ECA mechanism [
26] to the output features C2 and C3 of the backbone network and PAN, respectively, which enables the network to focus more on the important feature channels by adaptively weighting the feature responses of the channel dimensions to improve the perception capability for forest fire and smoke.
A schematic depiction of the ECA mechanism is given in
Figure 8. The ECA mechanism first distills the global representation of each channel through global average pooling to capture the core information of the channel. Then, the size of the 1D convolution kernel is adaptively determined according to the number of channels to accurately capture the inter-channel dependencies without adding additional parameters. Next, the convolution kernel is applied to the pooled features to achieve localized modeling of inter-channel relationships. After a sigmoid activation function, these features are transformed into weights between 0 and 1, reflecting the importance of each channel. Finally, these weights are multiplied with the original feature maps to complete the dynamic recalibration of the feature maps, retaining the important feature information and suppressing the noise. The ECA mechanism is concise and efficient, which enhances the network’s focus on the key features and improves the accuracy and efficiency of the feature extraction.
2.3.3. Loss Function
Designing a loss function is integral to this target detection algorithm, as it measures the discrepancy between the predicted and labeled fire and improves the model parameters by minimizing the loss function, hence enhancing the accuracy of the detected targets. The loss function of this model consists of three components: bounding box regression loss, confidence regression loss, and classification loss.
- (1)
Bounding Box Prediction
This paper calculates the bounding box loss using SIoU_Loss [
27], a loss function that considers not only the overlapping area, distance, and aspect but also the vector angle between the real and predicted boxes. Compared to other loss functions such as CIoU, DIoU [
28], or GIoU [
29], SIoU_Loss redefines the associated loss function by dividing it into four parts: angle loss, distance loss, shape loss, and IoU loss. A schematic of the bounding box losses is shown in
Figure 9.
- (2)
Class Prediction
In this paper, the detection object is only two categories of forest fire and smoke, so the confidence regression loss adopts the binary cross-entropy loss function, which is defined as follows:
where
and
denote the forest fire smoke confidence and true labels, respectively;
denotes the number of a priori frames corresponding to each grid point; and
denotes that no forest fire smoke target to be detected exists in this a priori frame.
- (3)
Confidence Prediction
The classification loss also uses the binary cross-entropy loss function, which is defined as follows:
where
denotes the category prediction probability of the prior frame,
denotes the category label of the labeled frame,
used to determine whether the prior frame with coordinates (
) is a positive sample or not, and non-zero is one.
2.4. Training
In the experimental phase, to ensure the reproducibility and reliability of the results, this study was conducted in the following strictly configured environment: using Python 3.9 as the programming language, executed on a Windows 11 operating system; model training and evaluation with the help of PyTorch 2.0.1 Deep Learning framework; and optimizing the computational resources using NVIDIA GeForce RTX 3060 GPUs and a 12th-generation Intel Core i3-12100F CPU to optimize computational resources. The 12th-generation Intel Core i3-12100F CPU is manufactured by Intel Corporation. Intel Corporation is headquartered in Santa Clara, CA, USA.
To train efficient forest fire and smoke detection models, a set of experimental parameters are selected in this paper. After 300 training cycles of iterations, the data size of each batch is set to 32, which balances training efficiency and memory requirements. The image size is 640 × 640 to ensure that the model adapts to different target sizes. The initial learning rate is 0.01 and the SGD optimization algorithm is used to ensure that the model converges quickly and achieves a high level of performance. During the training process, we use data enhancement and regularization techniques to improve the generalization ability and robustness of the model. Data augmentation increases the diversity of the training data by simulating different variations in light, color, angle, position, and size, enabling the model to better adapt to various scenarios. Regularization techniques, on the other hand, prevent overfitting by limiting model complexity and adjusting the training data distribution, thus improving the generalization ability of the model. For example, weight decay prevents model overfitting, and label smoothing techniques reduce the model’s dependence on training data labels. Setting these parameters appropriately effectively enhances the performance of the model in real applications.
In addition, to ensure the performance of the model under different input image sizes, the adaptability and robustness of the model for different sizes of image inputs can be determined by testing multiple resolutions. Therefore, in our experiments we set up three different resolutions: 224 × 224, 416 × 416, 640 × 640, 800 × 800, and 1024 × 1024. The performance of the model under different conditions is evaluated using images with different resolutions to select the model and input resolution that is most suitable for a particular application scenario.
2.5. Evaluation Metrics
In this paper, the performance of the model is measured using
[email protected], which measures the average accuracy of the model at an IoU threshold of 0.5 on different target categories. In other words, it evaluates the performance of the model in accurately localizing and classifying targets in the image. Higher
[email protected] values indicate better target detection performance. In this paper, we denote the
[email protected] of forest fire and smoke AP
fire and AP
smoke, respectively, and the total model
[email protected] AP
all.
To meet the real-time and accuracy requirements of forest fire and smoke detection, we use the GFLOPs and FPS of the model as another evaluation metric to assess the computational efficiency and processing speed of the model in real applications. GFLOPs measure the computational complexity and processing power of the model, while FPS indicates the number of image frames per second that the model can process. Higher GFLOPs of a model means that the model is more computationally intensive and takes longer to perform inference. Therefore, in resource-constrained situations, the GFLOPs and FPS of a model need to be considered to ensure a balance between performance and efficiency.
3. Results
3.1. Comparison of Different Detection Models
In order to verify the effectiveness of SmokeFireNet proposed in this paper, we compare it with the current mainstream target detection algorithms Faster R-CNN [
30], SSD [
31], YOLOv5, YOLOv7 [
32], ShuffleNetv2 [
23], and MobileNetv3 [
33]. The results of the comparison experiments are shown in
Table 2.
From the experimental data, SmokeFireNet performs well in all indicators, especially in the average precision (AP). Its APall reaches 86.2, APfire 82.3, and APsmoke 90, which are the highest values. In addition, SmokeFireNet achieves a frame rate (FPS) of 114, second only to ShuffleNetv2’s 121, and maintains a low GFLOPs of 8.4, achieving a balance between performance and efficiency. In contrast, Faster R-CNN performs well in terms of average accuracy, but has the lowest FPS of 98 and the highest computational complexity of 24.5, which makes it unsuitable for real-time applications. YOLOv5, YOLOv7-tiny, and MobileNetv3 also perform well in some of the metrics, but when considering the accuracy, frame rate, and computational complexity, the performance of SmokeFireNet is not as good. SmokeFireNet is undoubtedly the best choice when considering accuracy, frame rate, and computational complexity.
3.2. Performance Comparison of Different Attention Mechanism Models
Different attentional mechanisms have their own unique characteristics and performance, so comparing them can help us identify the best-performing mechanism in the forest fire detection task. By evaluating their impact on model performance, we can select the most suitable attention mechanism for our task. To validate the impact of the attention mechanisms cited in this paper on the model, this section introduces several different attention methods, including GAM (global attention mechanism) [
34], CBAM (convolutional block attention module) [
35], SimAM (simple attention mechanism) [
36], and ECA (efficient channel attention) [
26], generates the corresponding feature maps, and visualizes them as heat maps. The visualization results are shown in
Figure 10.
With these visualization results, we observe that the model focuses more on the to-be-detected region after the introduction of the ECA mechanism, whereas in the baseline network without the introduction of the attention mechanism, the model also focuses on other irrelevant regions. At the same time, we find that other attention mechanisms do not fully focus on the to-be-detected region and are not as effective as they should be for the tasks in this paper. Therefore, ECA is selected as the attention mechanism in this paper.
3.3. Experimental Comparison of Different Resolutions
To further analyze the effect on the model under different input sizes, this section tests the model with five different resolutions of input images: 224 × 224, 416 × 416, 640 × 640, 800 × 800, and 1024 × 1024. Higher-resolution images usually contain more detailed information, so the model can obtain richer visual features from them. In contrast, low-resolution images may lose some subtle features, leading to a decrease in model performance. By evaluating these images at different resolutions, we can gain insight into the model’s performance under different input conditions and determine its adaptability and robustness to images of different sizes. This careful analysis helps to optimize the design of the model and improve its performance and reliability in various application scenarios. The specific experimental results are shown in
Table 3.
By analyzing the experimental results in
Table 2, we find that the average precision (AP) of the model shows a growing trend as the resolution of the input image increases. Higher-resolution images provide more detail and clarity, allowing the model to capture the features and shape of the target more accurately. Therefore, ideally, higher resolution usually leads to higher detection accuracy. However, increased resolution is accompanied by increased computational costs, including model inference time and required computational resources. On the contrary, at lower resolutions, the computational effort of the model is reduced and the inference speed is accelerated, but the low resolution of the input image results in the loss of image detail information, which in turn affects the model’s detection accuracy of fire and smoke. Nevertheless, since smoke occupies a larger pixel value on low-resolution images, while forest fires usually contain many small targets, the decrease in smoke accuracy is smaller compared to forest fires. In summary, low-resolution images are more suitable for embedded devices and resource-constrained environments. To make a choice based on the trade-off between detection accuracy and detection speed, an input image resolution of 640 × 640 is selected for training in this paper to ensure that real-time performance is maintained while maintaining high detection accuracy.
3.4. Ablation Experiments
To validate the importance of ShuffleNetv2, FPN, PAN, ECA, RFB, and DySample on the forest fire smoke detection task and to explore the impact of these modules on the model performance, this section evaluates the model performance by adding step-by-step ShuffleNetv2, FPN, PAN, RFB, DySample, and ECA modules. By introducing these modules step-by-step, we can analyze their contribution to the model performance and understand their role in improving the detection results. This step-by-step evaluation approach helps us to gain a deeper understanding of the impact of each module on the model performance and provides useful guidance for further optimization and improvement of the model. The ablation experiment results are shown in
Table 4.
As can be seen from the ablation experimental data in the table, with the gradual improvement in the model structure, the individual performance metrics (APall, APfire, and APsmoke) are significantly improved, while the GFLOPs increase accordingly. The performance of the base model ShuffleNetv2 is low, with APall, APfire, and APsmoke of 80.7, 76.8, and 84.6, respectively, and GFLOPs of 6.6. The addition of FPN and PAN improves the performance, with APall increasing to 81.9, and APfire and APsmoke increasing to 77.7 and 86.1, and the computational complexity increases to 7.3. Further addition of the RFB module increases APall to 83.9, APfire and APsmoke to 78.4 and 89.4, respectively, and the computational complexity increases to 7.9. The addition of the DySample module results in a significant increase in APall to 85.6, APfire to 81.6, and APsmoke increased slightly to 89.6, and GFLOPs were 8.3. The final addition of the ECA module resulted in the highest performance, with APall, APfire, and APsmoke at 86.2, 82.3, and 90, respectively, and a slight increase in computational complexity to 8.4. The addition of each of the modules improved the model performance to varying degrees, and the final combination achieved the optimal performance, indicating that these modules have a significant effect on the improvement of the comprehensive model performance.
These ablation experimental results show that the gradual introduction of new components and techniques can effectively improve the performance of the forest fire smoke detection model, and the final model combines a variety of advanced techniques to provide a reliable solution for real application scenarios.
3.5. Detection Effect under Different Data Augmentation
The dataset used in this paper has been processed by three data enhancement methods: WDOA, ICMS, and PEVS. WDOA simulates the effects of wind on the target object and the changes in the target object in different directions and angles in the image brought by different shooting angles. ICMS simulates the fire scene under different lighting conditions so that the model can better adapt to different lighting environment. PEVS simulates the image blurring situation caused by motion or instability factors during UAV shooting to increase the robustness of the model to motion blur. Through these data enhancement methods, we improve the generalization ability of the model for diverse scenarios, which makes its application in real environments more robust and reliable.
Figure 11 shows the effect of forest fire smoke detection of the model trained with data enhancement in the above three different scenarios, which further validates the effectiveness of data augmentation and the generalization ability of the model.
3.6. Forest Fire and Smoke Detection Performance Analysis
In analyzing the experimental results, we observed the detection performance under different scenarios. In
Figure 12a, the situation of smoke-only detection is shown, and we can observe the model’s effective detection of smoke regions. However, if we focus only on fire detection, it might be difficult to detect forest fires in this scenario. In the context of
Figure 12b, smoke could potentially act as a visual barrier to the fire, making it difficult to detect fire. If a single-fire detection strategy is used, there is a possibility that fire incidents could be missed. In addition, we considered the detection performance of small targets.
Figure 12c,d illustrates the detection results of tiny fires, with the model accurately pinpointing the positions of these small fires. The experimental results unequivocally showcase the outstanding detection capabilities of our proposed model across diverse scenarios. It can skillfully discriminate between smoke and fire and can achieve accurate detection even in cases involving small target objects.
4. Discussion
In this study, we present SmokeFireNet, an efficient and accurate network for forest fire smoke detection. By adopting the lightweight ShuffleNetV2 backbone network, multi-scale feature fusion, RFB module, ECA mechanism and DySample up-sampling operation, SmokeFireNet achieves accurate identification of fire and smoke, while considering the lightweight design of the model to satisfy the real-time detection requirements. In addition, the application of data enhancement methods further improves the robustness and adaptability of the model.
The SmokeFireNet model proposed in this paper performs well in forest fire and smoke detection and has obvious advantages over other mainstream target detection algorithms in terms of AP, FPS, and GFLOPs. SmokeFireNet achieves the highest values in the three metrics, namely, 86.2%, 82.3%, and 90%, for APall, APfire, and APsmoke, respectively, proving its accuracy in forest fire and smoke detection. SmokeFireNet achieves the highest values in APall, APfire, and APsmoke with 86.2%, 82.3%, and 90%, respectively, proving its accuracy in fire and smoke detection. In addition, the FPS of SmokeFireNet reaches 114, which is second only to ShuffleNetv2 and much higher than other models and meets the demand of real-time detection. The GFLOPs of the model are 8.4, which is much lower than that of Faster R-CNN and comparable to that of ShuffleNetv2 and MobileNetv3, which ensures high performance while considering the lightweight design of the model.
The SmokeFireNet model has achieved remarkable results in forest fire and smoke detection, but still has some limitations. Firstly, the model relies on high-quality datasets and its adaptability in complex scenarios needs to be improved. In addition, the model still has room for optimization in terms of weight and real-time performance. To further enhance the usefulness of SmokeFireNet in the field of forest fire prevention, future improvements can be made in the following areas. Firstly, the performance of the model in different geographical regions and fire conditions is crucial for its practical application. Although our results show high accuracy, future research should focus on the robustness of the model in various environments, such as dense forests, arid areas, and different climatic conditions. This may require testing on datasets for these specific conditions. Secondly, the adaptability of SmokeFireNet in complex scenarios, such as nighttime detection, detection in dense fog conditions, or smoke recognition in the presence of tree cover, still needs to be improved. This can be achieved by employing more advanced data enhancement techniques and introducing domain adaptation methods to fine-tune the model for specific environmental conditions.