1. Introduction
Fire outbreaks pose significant threats to safety and property worldwide, necessitating the development of rapid and reliable detection systems. The rapid growth of large-scale and complex construction projects in conjunction with rapid economic development presents substantial challenges to fire management and control. Traditional fire detection methods [
1,
2], such as smoke and heat sensors, are crucial in early warning systems; however, they cannot often visually confirm the presence of a fire, leading to response delays or false alarms. The advent of computer vision in fire detection represents a transformative approach [
3] that enables the automatic recognition and localization of fires in real-time through visual inputs.
According to the US Fire Administration (USFA), in 2023, the United States will face significant fire incidents across various categories [
4]. Approximately 222,000 vehicle fires, primarily those involving highway vehicles, have been reported, resulting in considerable property damage. Additionally, over 522,500 structural fires have been recorded [
5], with most occurring in residential areas, underscoring persistent fire safety concerns in homes. Furthermore, the country has experienced 55,571 wildfires that burned approximately 2.63 million acres. Although these wildfires were less severe than those in previous years, they still posed major challenges, particularly in fire-prone states such as California and Arizona.
Computer vision is a field of artificial intelligence that uses algorithms to interpret and understand the visual world. In fire detection, computer vision systems analyze digital images or video feeds to identify distinct fire characteristics, such as movement, color, and smoke patterns [
6]. This capability not only enhances the accuracy of fire detection but also aids in assessing the severity and spread of fires, providing crucial data for emergency response efforts. However, the application of computer vision to fire detection presents unique challenges. Fires can vary significantly in appearance based on the materials involved and environmental conditions, such as lighting and weather, which can affect the visibility and distinguishability of fire and smoke. Furthermore, the dynamic and unpredictable nature of fires requires robust algorithms capable of rapid and adaptive responses.
Convolutional neural networks (CNNs) are adept at autonomous learning and identifying intricate images. These algorithms have garnered significant attention owing to their superior performance in various applications, including visual searches, autonomous driving, and medical imaging [
7]. Consequently, researchers have explored the integration of CNNs into image-based fire detection and advanced self-learning algorithms to capture fire-related image characteristics [
8]. Enhancements have been made to leading models such as AlexNet, Inception, VGG, and residual network (ResNet) for developing flame and smoke detection algorithms [
9]. Additionally, the integration of time-series data has been implemented to enhance the algorithmic performance [
10]. VGG-Net has been modified to detect smoke and flames simultaneously [
11]. The effectiveness of five networks, including GoogleNet and Modified GoogleNet, has been demonstrated for forest fire detection using unmanned aerial vehicles [
12]. A novel method is proposed to identify moving objects and generate proposal regions by employing dynamic background updates and a dark channel prior. Subsequently, a CNN-based method was employed for smoke detection in these regions [
13]. Furthermore, a combination of Gaussian background modeling (MOG) was used to distinguish between the background and foreground, enabling the use of a cascade model for smoke region identification. Finally, CaffeNet was implemented to detect smoke in the identified regions [
14].
Recent advances in deep learning have significantly improved the accuracy and efficiency of fire detection systems, leading to extensive research and development within this field. Deep learning, particularly using CNNs, has revolutionized visual recognition by providing powerful methods for complex tasks such as fire detection. However, the unique challenges of fire detection, such as varying lighting, changing smoke density, and environmental clutter, require specialized approaches beyond the typical CNN applications. To overcome these obstacles, researchers are developing adaptive algorithms that can adjust to the changing conditions of fire scenes in real-time, thereby improving the detection accuracy and reliability.
Numerous studies have increasingly focused on utilizing CNNs for the automated detection of fire scenes. For instance, ref. [
15] developed a compact CNN that incorporated a multi-attention mechanism to achieve high precision and recall in real-time fire detection scenarios. This approach effectively reduces the processing time and adapts to dynamic fire characteristics, making it suitable for deployment in drones and mobile surveillance systems. Similarly, ref. [
16] introduced a dual fire attention network (DFAN), which employs dual attention mechanisms to enhance feature extraction and fire detection accuracy in complex scenes where the fire is partially obscured or mimics other hot objects. Ref. [
17] explored the application of CNNs in real-time fire detection systems. This highlights the effectiveness of deep learning in distinguishing fire characteristics from similar natural phenomena, such as sunlight, thereby improving problem-solving in computer vision within dynamic and complex environments, such as industrial settings.
The concept of multiscale feature extraction has proven critical in recognizing fires of various sizes and intensities. The paper [
18] introduced a multiscale prediction model that utilizes feature maps from deep convolutional layers for fire image recognition. The authors developed a feature block to compress these maps spatially, enabling the extraction of information across multiple scales. Ref. [
19] demonstrated how a multiscale approach can improve the sensitivity of fire detection models by adapting them to the varying scales of fire and smoke observed in different settings. These methods help distinguish fire from fire-like phenomena, such as sun reflections or vehicle headlights, which are often mistaken for fires using simpler models.
Attention mechanisms have been another area of focus because they help models concentrate on the most salient features of an image, thereby enhancing detection accuracy. The paper [
20] developed the PACNN, a model that modifies traditional convolution to incorporate pixel-specific attention and enhance smoke image classification. This approach allows for the dynamic adjustment of pixel weights, improves feature diversity, and expands the representation space. Ref. [
21] successfully implemented channel and spatial attention mechanisms in their models, improving the discriminative capability of CNNs against complex backgrounds that might otherwise trigger false alarms.
In addition to single-model approaches, ensemble methods have garnered attention for their ability to combine predictions from multiple models to improve overall accuracy. Ref. [
22] explored the use of ensemble learning with CNNs pre-trained on diverse datasets to robustly classify fire and non-fire images across various environments, significantly reducing false positives. Ref. [
23] introduced an adversarial fusion network that enhances smoke detection by merging abstract and detailed features to learn more discriminative representations. This approach significantly enhances both the performance and generalizability of the system.
In this paper, we introduced a hybrid deep learning model for fire scene classification, advancing fire detection technology, with several key contributions:
- -
We integrated ResNet50, VGG16, and EfficientNet-B3 to ensure robust detection across various fire scenarios.
- -
Our approach enhances the sensitivity to fires of different sizes and intensities, significantly improving early detection capabilities.
- -
We employed spatial and channel attention mechanisms to reduce false positives and increase fire-detection precision.
- -
Our model combines predictions from multiple architectures, boosting accuracy, and ensuring generalization across diverse environmental conditions.
- -
Computational complexity and processing speed were optimized to facilitate real-time applications on various hardware platforms.
- -
We rigorously tested our model on a comprehensive dataset to verify its performance in real-world scenarios.
- -
We provide practical deployment recommendations tailored to various environments to enhance the model’s utility and effectiveness in the field.
To ensure clarity in evaluating the proposed model’s contributions, we structured this paper by separating the presentation of the model’s standalone results from its comparative analysis. The Results section summarizes the key performance metrics specific to the developed model, highlighting its efficacy across different fire scenarios. The Discussion section follows with a comparative analysis, contextualizing our model performance relative to state-of-the-art (SOTA) techniques. This approach provides readers with a clear understanding of the model’s strengths and situates our findings within the broader landscape of fire detection technologies.
The remainder of this article is organized as follows:
Section 2 introduces the proposed methodology and details the backbone architecture, multiscale feature extraction, and the attention mechanisms employed.
Section 3 outlines the experimental setup, including the descriptions of the datasets, implementation details, and results. It also compares the results of basic CNN models and SOTA approaches.
Section 4 presents a discussion of the findings and suggestions. Finally,
Section 5 concludes the article with recommendations for future direction.
2. The Proposed Methodology
In this section, we present a novel model for fire scene classification designed to operate effectively in various environments such as factories, urban areas, and forests. The model integrates advanced deep learning techniques, including multiscale feature extraction, attention mechanisms, and ensemble learning, to deliver a robust, accurate, and efficient solution for real-time fire detection, as shown in
Figure 1. The following subsections describe the key components and methodologies of the proposed model.
Figure 1 illustrates a hybrid CNN architecture designed to classify different types of fires such as wildfires, building fires, and car fires. It integrates features from well-known CNN models such as ResNet 50 [
24], VGG 16 [
25], and EfficientNet B3 [
26]. The architecture employs multiple convolutional layers, attention modules (spatial and channel), and pooling operations to enhance feature extraction and integration. A fully connected layer followed feature concatenation to classify the input images based on fire type, demonstrating a complex yet efficient approach to image-based classification tasks.
2.1. Backbone
The backbone of the proposed model utilized pre-trained CNNs such as ResNet50 [
24], VGG16 [
25], and EfficientNet-B3 [
26]. These networks are widely recognized for their powerful feature-extraction capabilities, which are essential for various computer vision tasks, including image classification, object detection, and segmentation.
2.1.1. ResNet 50
ResNet50, a 50-layer deep CNN, introduced the innovative concept of residual learning. This architecture is a part of the residual network (ResNet) family, which achieved notable success by winning the ImageNet large-scale visual recognition challenge (ILSVRC) in 2015. The key characteristic of ResNet50 is its use of residual blocks that allow the network to learn residual functions with reference to the layer inputs. This design effectively addresses the vanishing gradient problem, which is a common issue in deep network training. By enabling an easier gradient flow, residual blocks facilitate the training of considerably deeper networks compared to traditional architectures. ResNet50 was designed to balance depth and computational efficiency. With 50 layers, it can extract detailed and abstract features from images, making it highly suitable for complex detection tasks, such as identifying fire and smoke scenes. The network begins with an input layer that typically accepts images resized to 224 × 224 pixels. It then processes these images through multiple convolutional layers that extracts features at various levels of abstraction. In addition to the convolutional layers, ResNet50 incorporates residual blocks. These blocks include shortcut connections (or skip connections), which add the input of a layer to the output of a deeper layer. This mechanism promotes a better gradient flow through the network, mitigating the issues of vanishing gradients and enabling the construction of considerably deep networks without experiencing performance degradation. Pooling layers, such as max pooling, are employed to reduce the spatial dimensions of feature maps, ensuring computational efficiency while preserving important features. The network includes fully connected layers that consolidate the extracted features and produce the final output.
2.1.2. VGG16
VGG16 is another widely used deep CNN that stands out for its simplicity and effectiveness. Developed by the Visual Geometry Group at the University of Oxford, the VGG16 is notable for its deep but straightforward architecture. The VGG16 network consisted of 16 weight layers, including 13 convolutional layers and three fully connected layers. One of its distinguishing features is the use of small 3 × 3 receptive fields throughout the network, which allows fine details in the images to be captured. Despite its depth, the VGG16 architecture is characterized by a uniform layer structure, with each convolutional layer followed by a ReLU activation function, and max-pooling layers placed intermittently to downsample the spatial dimensions of the feature maps. The simplicity of the VGG16 architecture combined with its depth makes it highly effective for extracting high-quality features from images. This capability is crucial for tasks that require the precise identification of complex patterns, such as distinguishing between different types of fire and smoke in varied environments. Using these pre-trained networks, the proposed model benefits from extensive training and optimization performed on large-scale datasets such as ImageNet. This pre-training enabled the model to leverage robust high-level feature representations, significantly enhancing its ability to accurately classify fire and smoke scenes in real-world applications.
2.1.3. EfficientNet-B3
EfficientNet-B3 is a part of the EfficientNet family and can balance high accuracy with computational efficiency. Introduced by Google in 2019, EfficientNet models use a compound scaling method that uniformly scales the depth, width, and resolution of a network, optimizing the performance with fewer parameters and lower computational costs. EfficientNet-B3 contains 12 million parameters, which is significantly fewer than those of other high-performing models while maintaining competitive accuracy. It is structured with convolutional layers and employs mobile inverted bottleneck MBConv blocks and squeeze-and-excitation optimization, contributing to its superior performance in image recognition tasks.
A defining feature of EfficientNet-B3 is its use of the Swish activation function, which improves its ability to capture complex patterns. In addition, EfficientNet-B3 leverages Automated Machine Learning (AutoML) to fine-tune its architecture and hyperparameters, thereby enhancing its efficiency and accuracy. EfficientNet-B3’s architecture is well suited for detailed image analysis, such as identifying subtle differences in fire and smoke scenes. The model begins with an input layer that processes images resized to 300 × 300 pixels. It progresses through several layers of convolution and pooling, thereby extracting increasingly abstract features. The final layers comprise fully connected layers that consolidate the features to produce the final output. Using EfficientNet-B3, the models benefit from a robust pre-trained architecture capable of delivering high-quality performance in real-world applications while maintaining computational efficiency.
2.2. Multiscale Feature Extraction
Multiscale feature extraction is a crucial component of the proposed model that enables the detection of fire and smoke at various resolutions and scales. This approach enhances the model’s ability to recognize patterns in diverse environments, from close-up details to broader scenes. The multiscale feature extraction process involves adding layers to the base pre-trained CNN (ResNet50, VGG16, and EfficientNet B3) that can capture features at different scales. This is achieved by employing convolutional layers with varying kernel sizes and applying pooling operations in multiple stages. The input image
is passed through the
model to extract the initial features. Let
denote the output of the last convolutional layer of the pre-trained network:
where
, with
representing the height, width, and number of channels of the feature map, respectively. To capture features at multiple scales, we applied convolutional layers with different kernel sizes to
. Let
be kernel sizes. The feature maps at different scales
were computed as follows:
where
denotes the convolution operation,
and
are the weights and biases for the
-th convolutional layer, and
is the activation function. The resulting feature maps,
, are of the same spatial dimensions but capture different scales of features. Pooling operations are applied to reduce spatial dimensions while preserving important features. Maximum and average pooling are also commonly used.
denotes the pooling operation:
This operation helps to reduce the computational load and enhances the ability of our model to detect patterns at various scales by aggregating features over regions. Feature maps from different scales were concatenated to form a comprehensive feature representation:
where
denotes the concatenation operation along the channel dimensions. The resulting feature map
combines information from multiple scales, enhancing the capability of the model to recognize fire and smoke patterns at various resolutions. To ensure stability during training, batch normalization was applied to the concatenated feature maps as follows:
Batch normalization accelerates the training process and improves the generalization performance of our model. The normalized multiscale feature map is then passed to the subsequent layers of the network for further processing, including the attention mechanisms and classification layers. The combination of convolutional layers with varying kernel sizes and pooling operations at different stages allowed the model to capture a rich set of features that are crucial for accurate fire and smoke detection. This multiscale feature extraction process enhanced the robustness of the model and its ability to operate effectively in diverse and complex environments.
2.3. Attention Mechanism
Attention mechanisms enhance the performance of deep learning models by allowing them to focus on the most relevant parts of the input data. In our proposed model, we integrate both spatial and channel attention mechanisms to improve the accuracy and robustness of fire scene classification. These mechanisms enable the model to prioritize significant features, thereby reducing false alarms and enhancing the detection precision.
2.3.1. Spatial Attention Mechanism
The spatial attention mechanism helps the model to focus on important regions within the input image, such as areas where fires are likely to be present. This is crucial for fire detection because fires often occupy specific parts of an image. A convolution operation is applied to the feature maps
to generate a single-channel attention map
R
H×W. This operation is followed by a sigmoid activation function to ensure that the attention values are between 0 and 1:
where
denotes the sigmoid activation function. The original feature maps
are multiplied elementwise by the attention map
to obtain the attention-weighted feature maps
:
where
denotes element-wise multiplication. The resulting feature map
emphasizes the regions deemed important by the attention mechanism.
2.3.2. Channel Attention Mechanism
The channel attention mechanism aims to highlight the most significant channels (or feature maps) in the input data. This mechanism helps the model to prioritize the features that are most relevant for fire detection and perform global average pooling on the feature maps
to generate a channel descriptor
. This reduces the spatial dimensions while retaining the number of channels:
for each channel
in
. Pass the channel descriptor
through two fully connected (FC) layers, followed by a sigmoid activation function to generate the channel attention vector
:
where
and
are the fully connected layers and
is the rectified linear unit activation function. The original feature maps
are multiplied elementwise by the channel attention vector
to obtain the attention-weighted feature maps
:
The resulting feature maps prioritize the channels with the most critical information for fire detection. By combining spatial and channel attention mechanisms, our model effectively highlights the most relevant features, both spatially and across channels. This dual-attention approach allowed the model to focus on key aspects of the input data, enhancing its ability to accurately classify fire scenes.
2.4. Ensemble Learning
Ensemble learning is a powerful technique that combines the strengths of multiple models to improve the overall performance and robustness. In our proposed fire scene classification model, we employed an ensemble of pre-trained CNNs to leverage their diverse feature extraction capabilities. This approach enhances the accuracy and generalization of the model, particularly in complex and varied environments. Three CNN architectures were selected for the ensemble: VGG-19, ResNet-50, and EfficientNet-B3. These models were chosen because of their complementary strengths in terms of feature extraction. Each of these networks was pre-trained on the ImageNet dataset, providing a strong starting point for feature extraction. For each input image, the selected pre-trained CNNs independently processed the image to extract feature maps. The feature extraction process was enhanced by incorporating attention mechanisms. Spatial attention was applied to the feature maps from each CNN to emphasize relevant spatial regions. This helps us to focus on areas where fires are likely to be present. Channel attention was applied to highlight the important channels in the feature maps. This prioritizes critical features necessary for accurate fire detection. Once the attention-weighted feature maps were obtained from each CNN, we concatenated these features to form a unified representation. This fusion process combined the strengths of each model to capture a wide range of features and patterns relevant to fire detection. The attention-weighted feature maps
,
, and
are concatenated along the channel dimensions as follows:
This results in a comprehensive feature map that integrates the spatial and channel information from all three networks. The fused feature map
is passed through a fully connected layer, followed by a softmax layer for classification. The high-dimensional fused feature map is transformed into a lower-dimensional feature vector suitable for classification:
The layer learns to effectively combine features from different networks and produces the final classification scores, indicating the likelihood of the presence of fire in the image S = SoftMax(V). The entire model was trained end-to-end using a dataset of labeled and non-fire images. We employed the Adam optimizer with a cross-entropy loss function to minimize classification errors. Techniques such as dropout and weight decay were used to prevent overfitting and ensure that the model generalizes well to unseen data. The model was evaluated on a validation set after each epoch to monitor the performance and adjust the hyperparameters if necessary. By integrating ensemble learning with attention mechanisms, the proposed model effectively captures and emphasizes relevant features for fire scene classification. This comprehensive approach enhanced the robustness, accuracy, and ability of the model to be generalized across diverse datasets.
In analyzing the results, we present a structured approach. The Results section highlights all outcomes and metrics unique to the developed model, emphasizing its standalone performance. The Discussion section subsequently includes a comparative analysis with existing fire detection methods, showcasing the model’s competitive advantages and providing insights into its contributions to the field. This organization supports a thorough understanding of the model’s efficacy and relevance.
3. Experimental Setup and Results
This section outlines the experimental methods and settings used to evaluate the fire detection model. It describes the process of sourcing image data, detailing the types and resolutions of images used. The model’s performance is benchmarked against conventional CNN models and is further compared with SOTA fire detection systems, focusing on computational efficiency and speed.
3.1. Experimental Setup
3.1.1. Datasets
High-quality fire incident datasets are notably scarce, and publicly available datasets often lack the required quality for thoroughly evaluating and analysing new methodologies. Consequently, we sourced our data predominantly from various online platforms to evaluate the model. The core dataset comprises images with a resolution of 224 × 224 pixels, consisting of four principal categories: wildfires, building fires, vehicle fires, and non-fire images. Each category included 3000 images for both training and testing purposes. The comprehensive dataset details and 12,000 images are presented in
Table 1 and illustrated in
Figure 2.
This robust dataset allowed us to thoroughly test our model across various scenarios, supporting the reliability and accuracy of our findings. Consequently, the substantial data volume in each category strengthens the generalizability of our results and validates the effectiveness of the proposed fire detection framework. We developed and validated our model using these diverse datasets to secure the necessary permissions for the copyrighted images utilized in our study.
3.1.2. Implementation Details
All our experiments were conducted on a system equipped with an NVIDIA GeForce RTX 4060 Ti GPU. The testing hardware consisted of an Intel
® Core™ i7-13700F CPU operating at 3.60 GHz. The software environment for testing included CUDA version 11.8, cuDNN version 8.8, and Python version 3.10. Our method was implemented using PyTorch [
27], with the Adamax optimizer employed for training. Adamax, a variant of the Adam optimizer [
28], introduces a learning rate cap that can be particularly beneficial for stabilizing the training process in deep learning applications. The optimizer’s parameters were configured with a learning rate of 0.001, betas set to (0.9, 0.999), and epsilon set to 10
−8.
3.1.3. Evaluation Metrics
This section outlines the evaluation metrics employed when assessing fire detection models. Commonly used metrics in this domain include recall, precision, accuracy, and
-score, each of which serves a distinct purpose. The recall is defined as the ratio of correctly detected positive samples to the total number of actual positive samples, whereas precision is the ratio of correctly identified positive samples to the total number of samples classified as positive. Accuracy, which is the most widely used performance metric, evaluates the overall effectiveness of a model by representing the ratio of correctly detected samples to the total number of samples in a dataset. The mathematical formulations for these metrics are provided in Equation (13):
Here,
and True
indicate the correct classifications of positive and negative cases, respectively.
and
represent incorrect classifications. The
-score combines the recall and precision into a single metric, as calculated using Equation (14):
where
stands for recall, while
refers to precision.
3.2. Comparison with CNN Baseline Models
In our quantitative analysis, we evaluated the performance of our model against contemporary leading techniques in CNN-based fire detection. Utilizing four baseline datasets, including a dataset we developed, we showcased our model’s aptness for classifying and pinpointing fire scenes. We performed an ablation study using various baseline CNN models to ensure a rigorous assessment. The performance metrics on existing datasets are detailed in
Table 1, while the effects of modifications made to our model using our dataset are shown in
Table 2. These findings confirm that our model surpasses current advanced methods in accuracy (ACC), precision (P), recall (R), and F1 score on both existing and newly introduced datasets.
We conducted further ablation studies to identify the most effective configuration of our model, experimenting with different backbone models and assessing the impact of our dual-fire attention approach on various deep features. These experimental modifications are documented in
Table 2.
For backbone feature extraction and fire scene classification, we employed several standard CNN models including MobileNetV2 [
29], Xception [
30], NasNetMobile [
31], InceptionV3 [
32], and ResNet50. Using our dataset, we enhanced these models with a dual fire-attention module to improve the precision of fire area classification and localization. As indicated in
Table 2 (first block), integrating the dual fire-attention module with the baseline CNN models led to superior performance over models relying solely on deep features. This is because fire scene classification is inherently more complex than generic ImageNet classification. Incorporating attention modules helps to more effectively capture the unique features of objects, thus enhancing classification accuracy (ACC). Among the evaluated models, InceptionV3, paired with a SoftMax classifier, yielded the highest performance due to its superior feature extraction abilities. Additionally,
Table 2 illustrates that the most favorable outcomes were produced by our model, whereas the least favorable were by Xception and NASNetMobile, attributed to their features’ poor adaptability and limited relevance to fire scene classification. The table also notes the ACC of our streamlined model, designed to reduce the parameter count and size without compromising effectiveness. This compressed version of our model delivered the second-best performance across most metrics except precision, due to some non-fire scenes being incorrectly identified as fires. Nonetheless, this model achieved a commendable F1 score, indicating a well-balanced classification capability for fire and non-fire scenes.
3.3. Comparison with SOTA Models
Our model underwent comparisons with various advanced methodologies in terms of computational complexity, model dimensions, and inference durations. The computational load and size of a model are pivotal factors influencing the inference time in any deep learning architecture. We assessed the time complexity of our model and its compressed version against five distinguished lightweight models: MAFire-Net [
15], EFDNet [
19], E-FireNet [
21], DFAN [
16], and Grad-CAM [
22]. To gauge computational complexity, we explored the mega-floating-point operations (MFLOPs) and the framework sizes in megabytes as detailed in
Table 3. Generally, a greater number of MFLOPs and larger model sizes were correlated, as documented. We conducted a feasibility study of both the original and condensed versions of our model. Our condensed technique effectively reduced the model’s size and MFLOP count by up to 50%, albeit with a minor reduction in ACC. Furthermore, we evaluated the inference times for our model and these leading-edge methods using three distinct hardware platforms: an edge device, a standard CPU, a Quad-core Cortex A72 64-bit system on a chip operating at 1.5 GHz, and specifically, a Raspberry Pi model 4 equipped with 4 GB of main memory.
The proposed model demonstrated superior performance across multiple metrics. With 122.58 MFLOPs, it exhibited lower computational complexity than MAFire-Net (125.71 MFLOPs), EFDNet (125.24 MFLOPs), DFAN (141.25 MFLOPs), and the Grad-CAM-based model (1500 MFLOPs). Only E-FireNet with 300 MFLOPs has a lower complexity, indicating that the proposed model is computationally efficient. For the model size, the proposed model is significantly more compact, occupying only 40.78 MB. This was the smallest size among the compared models, with the next smallest being E-FireNet at 43.52 MB. Other models, such as MAFire-Net, DFAN, and the Grad-CAM-based model, have considerably larger sizes of 74.43 MB, 83.63 MB, and 84.54 MB, respectively. EFDNet, although having a relatively small size of 58.21 MB, still surpassed the proposed model in terms of memory consumption. Regarding the inference speed, the proposed model achieved the highest FPS on a CPU of 16.84. The proposed model reduces computational complexity and model size while enhancing inference speed, making it a more efficient choice for fire scene classification than existing SOTA methods.
The results of our proposed hybrid deep learning model are presented in terms of key performance metrics, emphasizing its standalone achievements in fire scene classification. Metrics such as accuracy, precision, recall, and F1 score were used to evaluate the model efficacy across different fire types—wildfires, building fires, vehicle fires, and non-fire scenes—while computational efficiency was assessed to determine the model’s suitability for real-time applications. Our model achieved high accuracy, precision, and recall, outperforming conventional CNN models in classification accuracy. Additionally, introducing spatial and channel attention mechanisms contributed to a reduction in false positives, particularly in complex scenes, leading to improved F1 scores. These metrics underscore the robustness of our approach in diverse scenarios, confirming the effectiveness of multiscale feature extraction and ensemble learning in enhancing detection capabilities. Further analysis revealed that the proposed model’s computational complexity remains manageable for real-time applications, particularly on high-performance hardware platforms. We reduced the parameter count through model compression techniques without sacrificing significant performance, making the model adaptable to resource-constrained environments
4. Discussion and Future Work
In this section, we contextualize our model performance by comparing it with existing SOTA fire detection models, emphasizing both the strengths and limitations of our approach. The development of this hybrid architecture is crucial to the advancement of deep learning models aimed at environmental scene classification, specifically in challenging and dynamic conditions such as fires. By integrating multiscale feature extraction with dual spatial and channel attention mechanisms, our model emphasizes relevant visual features while maintaining efficiency. This construction supports a highly adaptable framework that can be generalized to other types of scene detection, such as flood or smoke classification. Our use of ensemble learning—combining ResNet50, VGG16, and EfficientNet-B3 architectures—improves feature diversity, which is essential for accurately detecting a broad range of fire types and conditions. This architectural innovation can serve as a prototype for future research on hybrid models in AI, especially in real-time monitoring systems where computational efficiency and accuracy are paramount. This model represents a significant accomplishment in fire modeling, addressing longstanding challenges in false positive reduction, especially under varying environmental conditions. The implementation of dual attention mechanisms marks a leap forward in fire detection by allowing the model to discern between fire and non-fire elements with high precision, even in visually complex backgrounds where false alarms are common. The model’s ability to adapt to multiple fire types enhances its applicability in public safety and disaster management, providing more reliable data for emergency responders and reducing response times. This approach contributes a more nuanced understanding of fire dynamics to the field, offering potential improvements in areas such as wildfire monitoring, industrial safety, and urban fire prevention systems.
Our model demonstrates a competitive advantage over techniques such as MAFire-Net, EFDNet, and DFAN, particularly in terms of computational efficiency and model size. For instance, while MAFire-Net and DFAN are effective in real-time detection, they exhibit higher computational demands than our hybrid model, which achieved a lower MFLOP count and a more compact memory footprint. This efficiency allows for faster inference times, positioning our model as a viable option for edge devices and mobile platforms. Compared to Grad-CAM-based approaches, which rely heavily on interpretability through attention maps, our dual-attention mechanism provides a targeted improvement in accuracy while maintaining simplicity in the overall architecture. Furthermore, the ensemble learning approach utilized in our model offers a balanced fusion of diverse feature extraction methods, enhancing generalization across varied fire scenarios—something that single-architecture models, such as EFDNet, may struggle to achieve in complex, real-world settings. Beyond its immediate application in fire detection, this model design and performance contribute to the larger field of AI-driven modeling, especially in environmental monitoring and disaster response. The combination of ensemble learning and multiscale feature extraction serves as an innovative example of leveraging multiple neural networks to handle complex visual tasks under real-time constraints. The success of our approach in handling high-dimensional data with reduced computational demand has broad implications for AI research, suggesting pathways to more efficient yet powerful models applicable across fields such as autonomous driving, medical diagnostics, and security surveillance. The model’s adaptability, due to the inclusion of attention mechanisms, sets a standard for future AI-based models to balance interpretability and precision, fostering advances in transparent, reliable AI solutions.
Our model’s main strength lies in its ability to deliver high accuracy with relatively low computational complexity, making it suitable for real-time applications. The integration of multiscale feature extraction and dual attention mechanisms helps minimize false positives, particularly in dynamic and visually challenging environments. Moreover, our ensemble learning approach ensures that the model is robust across diverse datasets, offering improved adaptability in different fire detection contexts. Despite its advantages, our model’s dependence on high-performance hardware for optimal functioning limits its deployment in resource-constrained settings. Future research could explore more lightweight versions of the model, potentially utilizing federated learning or edge computing to enhance portability and efficiency on low-power devices. Additionally, while the dual-attention mechanism improves detection precision, it may require further optimization for less uniform fire scenes. The proposed model sets a new benchmark for fire detection accuracy and efficiency, yet also opens avenues for future improvements in adaptability and computational demands. Using ensemble learning improves its robustness and reliability, effectively handling the dynamic nature of fire scenes across diverse environments.
Despite its effectiveness, the dependency of the model on extensive computational resources and diverse training datasets may limit its deployment in resource-constrained settings. Future research should focus on developing more resource-efficient versions suitable for portable devices and exploring the incorporation of real-time data from IoT devices to boost detection capabilities. In addition, investigating new architectures and training techniques can further enhance the adaptability and performance of the model.
The implementation of this technology can significantly impact public safety by improving the accuracy of fire detection systems and reducing response times, thereby mitigating fire-related damage and enhancing preventive measures in fire-prone areas.
Future research will focus on optimizing the model architecture to reduce computational demands while maintaining high accuracy. Integration of emergent technologies, such as federated learning and edge computing, could potentially decentralize and accelerate processing capabilities, making real-time analytics more feasible on edge devices. Additionally, incorporating adaptive learning algorithms could improve the responsiveness of the model to new and evolving fire behaviors without requiring complete retraining. The achievements of this study validate the importance of constructing hybrid architectures in enhancing both accuracy and efficiency in critical applications like fire detection. The model not only demonstrates a sophisticated understanding of fire scene dynamics but also provides a scalable framework adaptable to other modeling tasks within environmental and safety-focused AI applications. By addressing the challenges unique to fire detection—such as false positives in dynamic scenes and real-time processing demands—this research establishes a foundation for further innovation in AI modeling. The significance of our work extends beyond fire detection; it sets a precedent in AI model design, positioning this study as a notable contribution to both the specific domain of fire modeling and the broader field of intelligent environmental monitoring.