Fire Safety Detection Based on CAGSA-YOLO Network

Wang, Xinjie; Cai, Lecai; Zhou, Shunyong; Jin, Yuxin; Tang, Lin; Zhao, Yunlong

doi:10.3390/fire6080297

Open AccessArticle

Fire Safety Detection Based on CAGSA-YOLO Network

by

Xinjie Wang

^1,2,

Lecai Cai

^2,*,

Shunyong Zhou

^1,*

,

Yuxin Jin

^1,2,

Lin Tang

^1,2 and

Yunlong Zhao

^1,2

¹

School of Automation and Information Engineering, Sichuan University of Science & Engineering, Yibin 644000, China

²

Sanjiang Artificial Intelligence and Robot Research Institute, Yibin University, Yibin 644000, China

^*

Authors to whom correspondence should be addressed.

Fire 2023, 6(8), 297; https://doi.org/10.3390/fire6080297

Submission received: 8 July 2023 / Revised: 28 July 2023 / Accepted: 31 July 2023 / Published: 2 August 2023

(This article belongs to the Section Fire Risk Assessment and Safety Management in Buildings and Urban Spaces)

Download

Browse Figures

Versions Notes

Abstract

:

The layout of a city is complex, and indoor spaces have thousands of aspects that make them susceptible to fire. If a fire breaks out, it is difficult to quell, so a fire in the city will cause great harm. However, the traditional fire detection algorithm has a low detection efficiency and high detection rate of small targets, and disasters have occurred during detection. Therefore, this paper proposes a fire safety detection algorithm based on CAGSA-YOLO and constructs a fire safety dataset to integrate common fire safety tools into fire detection, which has a preventive detection effect before a fire occurs. In the improved algorithm, the upsampling in the original YOLOv5 is replaced with the CARAFE module. By adjusting its internal Parameter contrast, the algorithm pays more attention to local regional information and obtains stronger feature maps. Secondly, a new scale detection layer is added to detect objects larger than 4 × 4. Furthermore, the sampling Ghost lightweight design replaces C3 with the C3Ghost module without reducing the mAP. Finally, a lighter SA mechanism is introduced to optimize visual information processing resources. Using the fire safety dataset, the precision, recall, and mAP of the improved model increase to 89.7%, 80.1%, and 85.1%, respectively. At the same time, the size of the improved model is reduced by 0.6 M to 13.8 M, and the Param is reduced from 7.1 M to 6.6 M.

Keywords:

YOLOv5s; CARAFE; fire safety tools; attention mechanism

1. Introduction

Over the past decade, with rapid economic development, there has been a significant increase in the occurrence of fires, posing significant threats to the safety of the national economy [1]. These fire accidents exhibit characteristics such as complexity, randomness, rapid development, and disorder, not only endangering the safety of the national economy but also presenting immense risks to the personal safety of firefighters involved in rescue operations [2].

Based on the 2016–2018 China Fire Yearbook and the fire data reports from the Fire Bureau [3] of the Ministry of Public Security in 2019, as well as the data from 2020 to 2022 [4,5,6], we have gathered information on the four fire indices, as illustrated in Figure 1. The 2022 report highlights that self-built houses, high-rise buildings, factories, and small business establishments are more susceptible to fires, particularly within indoor environments. In comparison to natural forest fires, indoor fires result in a higher number of casualties and impose greater economic losses.

In the initial stages, efforts were made to prevent indoor fires through the design of fire detection systems, primarily relying on various sensor-based detection methods. Among these, temperature sensing, smoke sensing, and light sensing detectors are considered the most well-established fire detection technologies [7]. These detectors sample Parameters such as transparency, spectrum, temperature, and smoke particle content in the monitored area to determine the occurrence of fire accidents.

In recent years, image detection methods based on deep learning have also developed rapidly. Frizzi et al. [8] detected abnormal fire scenes using CNN for the first time, creating the first adaptive learning algorithm for fire and smoke image feature acquisition based on deep learning. Zhao et al. [9] replaced the backbone network of YOLOv3 with EfficientNet to enhance the ability of the network to lift the flame texture features. By comparing object detection algorithms such as Faster R-CNN, R-FCN, SSD, YOLOv3, and machine vision technology in flame detection, Li et al. [10] proved that the accuracy of the object detection algorithm is much better than that of the method of manual feature extraction. Xie [11] selected the anchor frame of smoke by adding a K-means algorithm and optimized the YOLOv4 network while introducing channel attention into the network prediction head, so as to improve the detection accuracy of smoke. However, due to the lack of location information in channel attention and the inability to establish a long-range dependence relationship, it causes loss of some information. Wang Yixu et al. [12] proposed an improved YOLOv5 small-target smoke and flame detection algorithm, added the 3D attention mechanism SimAM to the network to enhance the feature extraction capability of the algorithm, and changed the three-scale detection to four-scale detection combined with a BiFPN structure to improve the detection capability and feature fusion capability of small targets, so as to reduce false detection and missing detection.

In conclusion, while current algorithmic studies mostly focus on the separate detection of flames and smoke, it is crucial to simultaneously detect both elements since they are essential in identifying the occurrence of a fire. Additionally, the detection process should consider indoor environmental factors and incorporate common fire-related tools into the dataset to enhance the accuracy and effectiveness of the detection system.

We studied the situation in which only detection of the fire danger stage is carried out, and prevention detection before the fire is rarely conducted. At the same time, aiming at indoor environments, a CAGSA-YOLO model for fire safety detection algorithm is proposed, which can add common fire safety tools to the detection of flame and smoke and achieve the overall fire safety detection effect for indoor regional environments. Building such a dataset could also facilitate research and innovation in fire prevention and safety. Through the analysis and research of these data, the potential causes and development trends of fire can be found, and the corresponding preventive measures and safety recommendations can be put forward. In this way, in the early stage of a fire, the surrounding environment can be analyzed in a timely manner by the relevant personnel who can deal with the information regarding the fire conditions, to protect the safety of life and property and to lay the foundation for fire prevention.

In short, the following contributions are made:

According to the characteristics of fire prevention, we add several common fire safety tools into the flame and smoke datasets, and we integrate them into a set of fire safety datasets. By adding these fire safety tool data, the performance and accuracy of fire detection and prevention systems can be improved.
Based on the traditional three-scale detection layer, we add a new detection layer according to the situation in which the indoor fire-starting instance and fire tools are small in the specific indoor target, and we create an anchor box for small, medium, large, and larger four-scale detection.
In order to reduce the number of Parameters and GFLOPs (Giga Floating-point Operations Per Second) and improve the accuracy on the basis of lightweight models, we improve the YOLOv5 model, add modules such as CARAFE, Ghost, and SA, modify the input image sampling and training network, and integrate it into the CAGSA-YOLO model.

The rest of this article is arranged as follows:

In the second section, we introduce the traditional YOLOv5 algorithm and its principle. In the third section, we explain the principle of the replacement improvement, the CARAFE upsampling improvement, new scale detection, Ghost lightweight design, and the introduction of the SA mechanism. In the fourth section, the designs for collection and optimization of fire safety datasets are explained. In the fifth section, we provide the experimental comparison results and discuss the test indicators. Finally, the sixth section summarizes the manuscript. The basic concepts and meanings of the symbols are summarized in Table 1.

2. Introduction to Detection Algorithm

In the field of fire detection, deep learning methods have garnered significant attention and are widely employed by researchers. Deep learning continues to be prominent in this area. Currently, there are numerous fire detection approaches based on convolutional neural networks (CNNs), including flame and smoke detection, as well as the fusion of traditional machine learning techniques with deep neural networks. The YOLO (You Only Look Once) series, known for its one-stage object detection method, has consistently maintained its advantage of speed through continuous innovation. The introduction of YOLOv1 at CVPR in 2016 sparked a research surge in the YOLO series [13,14,15,16]. In the subsequent sections of this paper, we propose utilizing the YOLO one-stage target detection algorithm to identify fire safety targets using deep learning.

2.1. Basic YOLOv5 Framework

“YOLO” is the name of an object detection algorithm that reformulates object detection as a regression problem. It applies a single convolutional neural network (CNN) to the entire image, slices the image into grids, and predicts the class probabilities and bounding boxes for each grid. YOLO is very fast. Since the recognition problem is a regression problem, there is no need for a complex pipeline. It is 1000 times faster than R-CNN and 100 times faster than Fast R-CNN. The YOLOv5 object detection algorithm, developed by the Ultralytice company in 2020 [17], introduced four different versions under the classic 5.0 release: YOLOv5x, YOLOv5l, YOLOv5m, and YOLOv5s. These versions gradually reduce the depth and width of the model. In this study, we specifically focus on conducting an in-depth investigation of YOLOv5s, as shown in Figure 2, to establish a foundation for lightweight design in mobile terminals.

In the Input terminal, an adaptive scaling mechanism is introduced to resize the RGB image to 640 × 640 × 3, reducing information redundancy. The Backbone includes a new structure called Focus, which involves slicing and reconstructing the image. The initial 320 × 320 × 12 feature map is downsampled to create a low-resolution image. This downsampled map is then concatenated with the original feature map to generate a 12-channel feature map. Finally, the 320 × 320 × 64 feature map is transformed to enhance focal points and minimize loss of original image information [18].

The number of Parameters (Params) in Focus, related to the size of the model, are usually measured in M.

C_{i n}

and

C_{out}

represent the number of input and output channels.

K_{h}

and

K_{w}

represent the height and width of the convolution kernel.

P a r a m s = (K_{h} \times K_{w} \times C_{i n} + 1) \times C_{o u t}

(1)

Giga Floating-point Operations Per Second (GFLOPs) in Focus is used to measure model complexity, collectively referred to as G and M.

F L O P s = 2 \times C_{i n} \times K_{h} \times K_{w} \times C_{o u t} \times H \times W

(2)

1 G F L O P s = 10^{9} F L O P s

(3)

Backbone includes the spatial pyramid Pooling SPP module [19], which is used to convert feature maps into feature vectors and convert any size of the graph into a fixed size under the vector. The convolution kernel size and step size in SPP are calculated as follows:

K_{h} = [\frac{h_{i n}}{n}] = c e i l (\frac{h_{i n}}{n})

(4)

S_{h} = [\frac{h_{i n}}{n}] = c e i l (\frac{h_{i n}}{n})

(5)

p_{h} = [\frac{h_{h} \times n — h_{i n} + 1}{2}]

(6)

h_{n e w} = 2 \times p_{h} + h_{i n}

(7)

where

K_{h}

represents the height of the kernel;

S_{h}

represents the step size in the height direction;

p_{h}

represents the number of fillers in the height direction; and the corresponding formulas of

K_{w}

,

S_{w}

and

p_{w}

are the same, and h and w can be converted accordingly.

The Neck component adopts the FPN + PAN structure, as depicted in Figure 3, which enables the fusion and integration of features in a hierarchical manner before passing them to the prediction layer. The FPN module propagates essential semantic features from top to bottom, while the PAN module transmits important localization information from bottom to top. By combining these two modules, the model can aggregate Parameters across different layers, resulting in enhanced semantic understanding and precise localization capabilities.

The Output terminal section contains three YOLO HEAD detectors, which output size feature maps of 80 × 80, 40 × 40, and 20 × 20, respectively, corresponding to the feature maps of the three sizes in the Neck. The output part corresponds to three anchors in YOLO, namely P3/8, P4/16, and P5/32, with a total of nine anchors, as shown in Table 2.

2.2. Principle of Loss Function

There are three types of loss functions at the output end of YOLOv5: Location loss, Objectness loss, and Classes loss. Equation (8) is obtained by adding the three losses.

L o s s = λ_{1} L_{l o c} + λ_{2} L_{o b j} + λ_{3} L_{c l a}

(8)

The predicted box is assumed to be A, and the true box to be B. IoU is used to measure the overlap degree between the predicted box and the real box in target detection. The IoU block diagram is shown in Figure 4, and it is expressed using Equation (9).

I o U = \frac{A \cap B}{A \cup B}

(9)

In YOLOv5, CIoU loss is used to calculate the rectangle loss, and overlapping masks, center point distance, and aspect ratio are added to the calculation at the same time. CIoU can be calculated according to the following equation:

C I o U = I o U - \frac{p^{2}}{c^{2}} - α v

(10)

v = \frac{4}{π^{2}} {(\arctan \frac{w_{l}}{h_{l}} - a r c t a n \frac{w_{p}}{h_{p}})}^{2}

(11)

α = \frac{v}{1 - I o U + v}

(12)

{loss}_{CIoU} = 1 - CIoU

(13)

3. Improved Model

In this section, we propose a CAGSA-YOLO model, which, on the basis of the original YOLOv5s, changes the traditional three-scale architecture through the replacement of CARAFE upsampling and Ghost lightweight design, realizes four-scale improvement, and introduces a SA mechanism for overall network improvement. The overall network structure of CAGSA-YOLO is shown in Figure 5.

3.1. Upsampling Is Replaced by CARAFE Model

At present, there are many kinds of upsampling in deep learning networks, but compared with the commonly used models, there are many shortcomings. For example, in Nearest and Bilinear, on the one hand, their perceptual domains are too small, including 1 × 1 and 2 × 2 patterns. On the other hand, both of them cannot utilize the semantic information of feature maps.

Deconvo [20] has a large number of Parameters and transport volumes, and the same upsampling kernel is used for any location, so the feature map content information cannot be captured. Therefore, in 2019, Wang proposed a lightweight CARAFE (Content-Aware Reassembly of Features) sampling model with a large receptive field [21,22], and its specific framework is shown in Figure 6.

According to the overall framework of CARAFE, it is composed of a kernel prediction module and a content-aware reassembly module. The kernel prediction module contains a channel compressor, a content encoder, and a kernel normalizer. First, the channel compressor was used to reduce the input channels of the feature map, and then the content encoder was used to generate the reorganization kernel. Finally, the flexible maximum transfer function was added to each reorganization kernel in the kernel specification module. The content-aware reassemble module mainly reassembles the feature maps so that CARAFE as a whole can pay more attention to local regional information and finally obtain a feature map that is stronger than the original feature map.

In order to compare the performance of Nearest, Bilinear, Deconvo, and CARAFE upsampling methods, Wang conducted a replacement comparison between Faster R-CNN and Mask R-CNN, as presented in Table 3. The results indicate that utilizing the CARAFE model for upsampling replacement yields a superior outcome.

3.2. Adding a Scale Detection Layer

There are only three YOLO HEAD detectors in the YOLOv5s model, which output 20 × 20, 40 × 40, and 80 × 80 feature maps, respectively, corresponding to the detection of objects with sizes above 32 × 32, 16 × 16, and 8 × 8. However, for the actual dataset, the receptive field of the 80 × 80 feature map will cause the loss of feature information for smaller detection pixels, thus reducing the accuracy and making the target unable to be recognized. Therefore, in order to improve the overall detection ability of the model, the multi-scale detection is improved, and a new small object detection layer is added.

The newly added small object detection layer is used to detect objects with a size of more than 4 × 4, and the output size of the detection feature map is set to 160 × 160. The structure of the small object detection layer is similar to the first three layers, and it is modified under the yaml file in the original model. After the [−1, 3, C3, [256, False]] step in the Neck structure, a structure similar to the first two layers is added, which is the [−1, 1, Conv, [128, 1, 1]] module, and then Concat is connected with step 2. At the same time, the new [−1, 1, Conv, [256, 3, 2]] module is added before [−1, 1, Conv, [128, 3, 2]] for overall improvement. Finally, the three categories of the anchor box are modified into four categories, and the corresponding sizes are expressed as four types of receptive fields of small, medium, large, and larger. N is changed from the original 9 to 12, as shown in the Table 4.

3.3. Ghost Lightweight Model Design

When considering the implementation of intelligent algorithms for hardware devices, the computational requirements of the hardware are crucial. To enable deployment for low-power devices at a large scale, it is important to compress the overall model of the algorithm without sacrificing average accuracy. In this regard, Han et al. [23] introduced the concept of Ghost and developed the GhostNet network model, as depicted in Figure 7. The GhostNet model utilizes the Ghost Bottleneck module to increase channels and expand features. It involves performing ordinary convolution with fewer kernels, generating a feature map, and then concatenating it with another feature map.

In the comparison between the the ordinary convolution and Ghost convolution formulas, as shown in Equations (14) and (15), the Ghost module has less computational complexity.

Y = X * f + b

(14)

Y^{’} = X * f^{’}

(15)

In the appeal equation, b is the deviation term,

Y \in h^{’} \times w^{’} \times n

,

Y^{’} \in h^{’} \times w^{’} \times n

,

X \in c \times h \times w

,

f \in c \times k \times k \times n

, and

f^{’} \in c \times k \times k \times m

. c represents the number of convolutional channels,

k \times k

represents the size of the convolutional kernel, n and m represent the number of feature maps of both and m < n, and h and w represent the height and width of feature maps, respectively.

Through visual inspection, the C3Ghost lightweight design improvement of the C3 module in the Backbone is carried out, so that the overall improved algorithm model size, Parameter number, and calculation amount are reduced, so that it can be deployed on low computing power hardware, which is in line with the lightweight improvement idea.

3.4. Adding the SA Mechanism

In order to rationally utilize the limited visual information and optimize the ability of visual information processing resources, the attention mechanism is introduced into the model [24]. The attention mechanism is introduced into the object detection model, which is helpful in improving the interest in the target and suppressing unnecessary features. The attention mechanism mainly involves input to the attention element and allocation of limited processing resources to the important element.

Regarding the two commonly used modes of channel attention and spatial attention, CBAM and GCNET combine the two modes into the same module to achieve significant improvement, but this is accompanied by a huge computational burden.

Therefore, Zhang proposed a more effective and lighter SA mechanism for deep convolutional neural networks (Shuffle Attention) [25], as shown in the Figure 8.

Zhang compared the SA mechanism with other advanced modules such as SE [26], CBAM [27], and ECA [28]. As shown in Table 5, in terms of both network Parameters (Param) and GFLOPs, the amount of GFLOP computation with the SA module is smallest when the number of Parameters is smallest, which can prove that SA is more effective and lightweight.

4. Dataset Optimization Design

This section mainly explains how the fire safety datasets and collected and sorted. Common fire safety tools are added to the fire prevention detection to improve the dataset.

4.1. Dataset Acquisition

The dataset is the basis of object detection in deep learning. How to make the model have better robustness and generalization depends on whether the provided dataset is superior. In view of the lack of existing datasets and only two types of fire and smoke, it is proposed to add the detection of common fire safety tools to fire prevention detection, as shown in Figure 9, and a fire safety dataset with more scenes and more complex types is produced. Datasets of common fire safety tools are few and far between, and various kinds of collection are used to make datasets for innovation. Regional environmental fire detection can be carried out before the fire, and a set of datasets can be made by collecting images of flames, smoke, and common fire safety tools.

Approximately 500 high-definition (HD) images were collected for the dataset, obtained through both photography and online crawling. To enhance the image data, various techniques such as brightness enhancement, contrast enhancement, multi-angle rotation, and image inversion were applied. Additionally, non-fire-related images were included as negative samples to improve the overall accuracy of the training process. LabelImg software was used to mark the maximum external rectangular box for each data sample, as demonstrated in Figure 10 and Figure 11. The dataset consists of eight categories listed in order: fire, smoke, fire hydrant, fire extinguisher box, fire extinguisher, smoke alarm, manual fire alarm call point, and warning lights. These categories encompass their respective defining features, and annotations were completed to generate XML files. Subsequently, these XML files were converted into TXT files to be read by the YOLO algorithm.

4.2. Dataset Analysis

Finally, a total of 4679 fire safety datasets were sorted out and randomly divided into training set, validation set, and test set according to the ratio of 7:2:1. The training statistical analysis was carried out on the dataset, including label types, number of labels, and labeling borders, as shown in Figure 12. We can see that the training targets are divided into eight classes, and the number of classes they contain is plotted. The label bounding box on the other side indicates the size of the corresponding object. The larger the target, the larger the label bounding box. The center is darker, indicating that smaller targets have more numbers.

We then analyze the label distribution position and size, as shown in Figure 13. The ratio between the abscissa of the center of the label and the width of the image is represented by the abscissa x, and the ratio between the ordinate of the center of the label and the height of the image is represented by the ordinate y. The closer to the edge, the smaller or larger the value is. It can be seen that the image is concentrated in the middle, indicating that the data distribution is wide. The ratio between the label width and the image width is represented by the abscissa width, and the ratio between the label height and the image height is represented by the ordinate height. The higher the number is, the darker the color is. It can be seen that the image is concentrated in the lower left corner, indicating that the small target has more data, which is in line with the actual size of the fire safety class. At the same time, it contains a certain amount of other data of various sizes, which ensures the generalization of the overall mode.

5. Improvement of the Experiment and Results Analysis

5.1. Experiment Environment and Configuration

The operation configuration of the experimental platform is as follows: the operating system is Windows10; the memory is 128 GB; the processor is Intel(R) Xeon(R) Platinum 8280 [email protected] GHz; and the GPU graphics card is NVIDIA Quadro RTX 6000 (dedicated GPU memory 24 GB), with Python 3.7 and TorCH 1.7 to build the neural network model. According to the performance of the hardware equipment of the experimental platform, the initialization Parameters are set, as shown in Table 6.

5.2. Evaluation Index

In deep learning, the evaluation metrics of object detection algorithms are mainly determined using the following three categories, namely precision (P), recall (R), and average precision (mAP).

P indicates that the predicted result was correctly predicted, that is, which percentage of what was predicted is correct.

P = \frac{T P}{T P + F P}

(16)

TP represents the number of positive samples correctly identified as positive, and FP represents the number of negative samples incorrectly identified as positive.

R is the fraction of all positive samples for which the original sample was correctly predicted, that is, which percentage of the predicted aspect was found.

R = \frac{T P}{T P + F N}

(17)

FN represents the number of positive samples that were incorrectly identified as negative samples.

The mAP is jointly affected by P and R, and it is obtained by averaging the AP average accuracy of all categories.

m A P = \frac{1}{c} \sum_{j = 1}^{c} A P_{j}

(18)

A P = \int_{0}^{1} P (R) d R

(19)

5.3. Comparative Experiment

In order to pursue the practicality and optimal effect of experimental research, two groups of Parameters,

k_{e n c o d e r}

and

k_{r e a s s e m b l y}

, in the upsampling part of CARAFE, are carefully studied. According to the nine groups of Parameters set in Wang’s paper, the nearest of the YOLOv5s upsampling parts under the detection of the self-built dataset was replaced by CARAFE, and two groups of Parameters,

k_{e n c o d e r}

and

k_{r e a s s e m b l y}

, were modified for a series of comparative statistics, as shown in Table 7.

According to the hardware conditions of the experiment, after comparing the two groups of Parameters

k_{e n c o d e r}

and

k_{r e a s s e m b l y}

with nine groups of experiments, it can be seen that when the two Parameters are set to 1 and 5, the average precision mAP obtained is the highest, and the GFLOPs are the smallest, and the number of Parameters is relatively small. Therefore, the CARAFE Parameters,

k_{e n c o d e r}

1 and

k_{r e a s s e m b l y}

5, are used for the subsequent experiments.

After the early improvement of YOLO, the SA mechanism of the second module is compared. According to the superiority of the SA module proposed by Zhang and the comparison with other attentions in Table 1, it is verified that the SA mechanism has the same superiority in this fire safety dataset. The Param and GFLOPs and other Parameters are compared with SE-Net, CBAM, and ECA-Net again, as shown in Table 8. It can be seen that SA-Net is optimal in P, R, and mAP, and it is the smallest under the two indicators of Param and GFLOPs.

In order to further show the effect of the improved algorithm and provide ideas for subsequent researchers, a comparison was made with other existing single-stage detection algorithms YOLOv3, YOLOv3-tiny, YOLOv4-tiny, YOLOv5s, and YOLOv7-tiny, and the experimental Parameters were kept the same. Compared with various model algorithms in Table 9, it can be seen that although YOLOv3 models have higher mAP, the model size, Param, and GFLOPs are larger, so only one of them is taken for comparison.

The CAGSA-YOLO model size is 0.6MB lower than that of the original YOLOv5s. In the case of the highest average precision, the accuracy rate, recall rate, Parameter quantity, and calculation amount of the algorithm in this paper are also more advantageous, so it also shows that the idea to improve the algorithm in this paper is correct.

5.4. Ablation Experiment

In this paper, the ablation experiment of YOLOv5s was carried out through four improvements, and the experimental results are shown in Table 10. The first improvement is to replace the upsampling to replace the CARAFE operator; the second improvement is to improve the traditional three-scale architecture and increase the number of scale layers to realize the four-scale detection layer; the third improvement is to use Ghost for the lightweight design of the C3 module; and the fourth improvement is to introduce the SA mechanism to the model.

As can be seen from Table 10, after CARAFE replacement was added to the original model, the overall mAP of the model was increased by 0.7% after setting the Parameters

k_{e n c o d e r}

and

k_{r e a s s e m b l y}

to 1 and 5, while ensuring the accuracy and recall rate changes were low.

Moreover, we have incorporated a scale detection layer into the model. It is noteworthy that despite the slightly lower accuracy rate compared to the original model, Improvement 1, and Improvement 2, the mAP remains at 84.5%. More significantly, the R has been substantially enhanced, surpassing the original model by 1.2% and Improvement 1 and Improvement 2 by 0.2% each.

The third step adopts the Ghost lightweight design, adjusts the use of Ghost without reducing the average accuracy, and improves the C3 module to create the C3Ghost module. Faced with the improvement of the original model, the recall rate increased by 2.9%, that is, 3.7% of the original model, but the overall accuracy rates decreased by 6%, 4.6%, and 2%, respectively, compared with the original model, Improvement 1, and Improvement 2. The purpose of using Ghost is to make the model more lightweight, so under the conditions improved in the previous step, the number of Parameters and the amount of computation can be compressed be reduced to 8.5% and 2% of Improvement 2. Finally, we improved the CAGSA-YOLO model by introducing the SA module. Compared with Improvement 3, P is increased by 2%, and mAP is increased by 0.6% while GFLOPs remain the same.

The final improved model reaches 85.1% in the mAP index, which is 1.7% higher than that of the original YOLOv5s model, as shown in Figure 14. At the same time, the loss decreased faster after the improvement, indicating that the detection was more accurate. The size of the improved CAGSA-YOLO model is only 13.8 MB, which is 0.6MB less than that of the original model. This makes the CAGSA-YOLO model more lightweight and superior, with faster inference speed and higher object detection performance.

5.5. Analysis of Test Results

Figure 15 and Figure 16 show the visual comparison of the YOLOv5 test of the three-scale detection and the CAGSA-YOLO algorithm test of the four-scale detection. Some samples detected via the traditional YOLOv5 are shown in Figure 15, and those detected via the improved CAGSA-YOLO network are shown in Figure 16; these are visually different from each other. The reason is because fire safety datasets, flames, smoke, and fire tools are very small, and smaller targets can be detected on the four scales.

We select representative fire safety examples from the dataset, and we test and compare the vision of these two detectors. The improved CAGSA-YOLO can detect correctly, and the accuracy of its detection is often higher. The traditional YOLOv5 has a low detection rate, and it can only detect part of the more obvious targets, and most of the target information is missing.

6. Conclusions

This paper presents the construction of a new fire safety dataset by combining flame and smoke instances with fire safety tools. The inclusion of fire safety tool data in the dataset aims to enhance the performance and accuracy of fire detection and prevention systems. The establishment of such a dataset holds great significance as it provides a more comprehensive and diverse collection of data that can be utilized for training and evaluating fire detection and prevention systems. By incorporating a wide range of fire-related scenarios and including fire safety tools, the dataset enables more effective training and evaluation, ultimately leading to improved fire detection and prevention capabilities.

In order to enhance the detection capability of the network model for fire safety labels, this study proposes several improvements to the traditional YOLOv5 model. Firstly, a new scale detection layer is introduced to the original model to expand its receptive field. Additionally, the CARAFE method is employed to replace upsampling, enabling the model to remain lightweight while maintaining a larger receptive field. Furthermore, the C3 component of the original model is subject to further lightweight design modifications. Drawing inspiration from the concept of GhostNet, the C3 module is transformed into the C3Ghost module to facilitate deployment on hardware with limited computing power. Lastly, the SA mechanism is incorporated to evaluate our self-built dataset, which encompasses common indoor fire safety tools along with instances of flames and smoke. By introducing the SA mechanism, the model’s detection accuracy and robustness for fire safety labels can be improved.

Our experimental study demonstrates the superior performance of the improved CAGSA-YOLO model for a fire safety dataset. The application of this enhanced model effectively identifies and locates indoor fire equipment, such as fire extinguishers and fire extinguisher boxes, thereby enhancing the efficiency of fire safety systems. This provides valuable support for firefighters to rapidly respond to and address fire incidents while aiding in fire prevention and control. Given the ongoing expansion of cities and increasing population density, ensuring fire safety for residents and buildings is of paramount importance. Therefore, the findings of this study hold significant relevance for departments and personnel involved in indoor fire prevention management, contributing to the improvement of fire safety management efficiency and quality.

Author Contributions

Conceptualization, X.W., L.C. and S.Z.; Methodology, X.W.; Software, X.W.; Formal analysis, X.W., L.C., S.Z., Y.J., L.T. and Y.Z.; Data curation, X.W.; Writing—original draft preparation, X.W.; Writing—review and editing, X.W., L.C., S.Z., Y.J., L.T. and Y.Z.; Visualization, X.W.; Supervision, L.C. and S.Z. All authors have read and agreed to the published version of the manuscript.

Funding

The key research and development project of Sichuan Science and Technology Department NO, 2019YFN0104, Sichuan Science and Technology Department General Project No. 19ZDYF2284, and Yibin City Science and Technology Project No. 2021GY008.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The data presented in this study are available on request from the corresponding authors.

Conflicts of Interest

The authors declare no conflict of interest.

References

Yang, W.; Shiliang, S.; Runqiu, L.; Yong, L.; Xiaoyong, C. Statistical Analysis and Countermeasures of National Fire Accidents in 2013–1016. Saf. Secur. 2018, 39, 60–63. [Google Scholar]
Wennuo, H.; Rongyi, Z.; Jinyu, L.; Xieyu, L.; Lingrui, W. Statistical Analysis and Countermeasures of National Fire Accidents in 2015–1019. Min. Technol. 2021, 21, 92–94. [Google Scholar]
Xiao, Z. In 2018, the national fire situation was generally stable, and the fire rescue team dispatched more than 1 million cases for six consecutive years, nearly 80% of which were emergency rescue and social assistance. China Fire 2019, 1, 8–9. [Google Scholar]
Ministry of Emergency Management: In 2020, 252,000 fires were reported nationwide and 1183 people died. Chinanews, 22 January 2021. Available online: https://baijiahao.baidu.com/s?id=1689555566139031357&wfr=spider&for=pc (accessed on 8 July 2023).
In 2021, China reported 748,000 fires with direct property losses of 6.75 billion yuan. Chinanews, 20 January 2022. Available online: https://baijiahao.baidu.com/s?id=1722447883601814342&wfr=spider&for=pc (accessed on 8 July 2023).
National Fire and Rescue Administration. National Police and Fire situation in 2022. 24 March 2023. Available online: https://www.119.gov.cn/qmxfxw/xfyw/2023/36210.shtml (accessed on 8 July 2023).
Chen, S. Research and Design of Fire Detection Algorithm Based on Video Image; Shandong University: Jinan, China, 2018. [Google Scholar]
Frizzi, S.; Kaabi, R.; Bouchouicha, M.; Ginoux, J.-M.; Moreau, E.; Fnaiech, F. Convolutional neural network for video fire and smoke detection. In Proceedings of the IECON 2016-42nd Annual Conference of the IEEE Industrial Electronics Society, Florence, Italy, 24–27 October; 2016; pp. 822–877. [Google Scholar]
Zhao, L.; Zhi, L.; Zhao, C.; Zheng, W. Fire-YOLO: A Small Target Object Detection Method for Fire Inspection. Sustainability 2022, 14, 4930. [Google Scholar] [CrossRef]
Li, P.; Zhao, W. Image fire detection algorithms based on convolution neural networks. Case Stud. Therm. Eng. 2020, 19, 100625. [Google Scholar] [CrossRef]
Shuhan, X.; Wenzhu, Z.; Peng, C.; Zixuan, Y. Firesmoke detection model based on YOLOv4 with channel attention. Chin. J. Liq. Cryst. Disp. 2021, 36, 1445–1453. [Google Scholar]
Yixu, W.; Xiaoling, X.; Pengfei, W.; Jiafu, X. Improved YOLOv5s Small Target Smoke and Fire Detection Algorithm. Comput. Eng. Appl. 2023, 59, 72–81. [Google Scholar]
Redmon, J.; Divvala, S.; Girshick, R.; Farhadi, A. You only look once: Unified, real-time object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 26 June–1 July 2016; pp. 779–788. [Google Scholar]
Redmon, J.; Farhadi, A. YOLO9000: Better, faster, stronger. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 7263–7271. [Google Scholar]
Redmon, J.; Farhadi, A. Yolov3: An incremental improvement. arXiv 2018, arXiv:1804.02767. [Google Scholar]
Bochkovskiy, A.; Chien-Yao, W.; Hong-Yuan, L. Yolov4: Optimal speed and accuracy of object detection. arXiv 2020, arXiv:2004.10934. [Google Scholar]
Xing, L.; Lecai, C.; Bojie, C.; Kui, C.; Xiang, G.; Shaosong, D. Lightweight end to end mobile phone detection method based on YOLOv5. Electron. Meas. Technol. 2023, 46, 188–196. [Google Scholar]
Yujie, Z.; Lecai, C.; Kui, C.; Xiang, G.; Lin, T. An Improved YOLOv5 Method for the Detection of Metal Surface Defects. J. Sichuan Univ. Sci. Eng. 2022, 35, 32–41. [Google Scholar]
Kaiming, H.; Xiangyu, Z.; Shaoqing, R.; Jian, S. Spatial pyramid pooling in deep convolutional networks for visual recognition. IEEE Trans. Pattern Anal. Mach. Intell. 2015, 37, 1904–1916. [Google Scholar]
Noh, H.; Hong, S.; Han, B. Learning deconvolution network for semantic segmentation. In Proceedings of the IEEE International Conference on Computer Vision, Santiago, Chile, 13–16 December 2015; pp. 1520–1528. [Google Scholar]
Jiaqi, W.; Kai, C.; Rui, X.; Ziwei, L.; Chen, L.; Dahua, L. CARAFE: Content-Aware ReAssembly of FEatures. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, South Korea, 20–26 October 2019; pp. 3007–3016. [Google Scholar]
Jiaqi, W.; Kai, C.; Rui, X.; Ziwei, L.; Chen, L.; Dahua, L. CARAFE++: Unified Content-Aware ReAssembly of FEatures. IEEE Trans. Pattern Anal. Mach. Intell. 2021, 44, 4674–4687. [Google Scholar]
Kai, H.; Yunhe, W.; Qi, T.; Jianyuan, G.; Chunjing, X.; Chang, X. GhostNet: More Features from Cheap Operations. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 14–19 June 2020; pp. 1580–1589. [Google Scholar]
Menghao, G.; Tianxing, X.; Jiangjiang, L.; Zhengning, L.; Pengtao, J.; Taijiang, M.; Songhai, Z.; Ralph, R.; Mingming, C.; Shimin, H. Attention mechanisms in computer vision: A survey. Comput. Vis. Media 2022, 8, 331–368. [Google Scholar]
Qinglong, Z.; Yubin, Y. SA-Net: Shuffle Attention for Deep Convolutional Neural Networks. arXiv 2021, arXiv:2102.00240. [Google Scholar]
Jie, H.; Li, S.; Gang, S. Squeeze-and-Excitation Networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake, UT, USA, 18–22 June 2018; pp. 7132–7141. [Google Scholar]
Woo, S.; Park, J.; Lee, J.; Kweon, I. CBAM: Convolutinal Block Attention Module. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 3–19. [Google Scholar]
Qilong, W.; Banggu, W.; Pengfei, Z.; Peihua, L.; Wangmeng, Z.; Qinghua, H. ECA-Net: Efficient Channel Attention for Deep Convolutional Neural Networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 14–19 June 2020; pp. 11534–11542. [Google Scholar]

Figure 1. China’s fire four indicators. China fire data in recent years, including the number of accidents, economic losses, deaths, and injuries.

Figure 2. The YOLOv5 structure follows a similar framework as the overall YOLO series. It includes the Input terminal, Backbone, Neck, and Output terminal.

Figure 3. Diagram of the FPN+PAN sampling structure.

Figure 4. Block diagram of IoU.

Figure 5. The overall network structure of CAGSA-YOLO. The white block part is where the original model remains unchanged, and the gray block part is the improved part in this study.

Figure 6. The overall framework of CARAFE (Content-Aware Reassembly of Features). CARAFE as a whole is composed of the kernel prediction module and content-aware reassembly module.

Figure 7. Ghost convolution frame diagram. (a) is the ordinary convolution, and (b) is the Ghost convolution. In Ghost convolution, a part of the feature map is generated via the ordinary convolution operation with fewer convolution kernels, and then another part of the feature map is generated via simple calculation and is finally concatenated to achieve the effect of increasing channels and expanding features.

Figure 8. SA Overall framework. Firstly, the channel dimension is grouped into multiple sub-features, and then the sub-features are integrated into the complementary channel and spatial attention module through the shuffle unit.

Figure 9. Common fire safety tools. The fire safety tools are added to the fire detection data set and integrated into a new fire safety data set.

Figure 10. Category labeling. The fire safety dataset was expanded to eight categories.

Figure 11. LabelImg software annotation example.

Figure 12. Number of labels and label borders.

Figure 13. Label position and label size.

Figure 14. This is a comparison of the model before and after. It can be seen from (a) that although CAGSA-YOLO has a slower increase in mAP than the original model at the beginning, it has a higher accuracy in the end. In (b), it can be seen that the loss of the improved model decreases faster, and it is smaller, which indicates more accurate detection.

Figure 15. The original YOLOv5s model for detection.

Figure 16. Improved CAGSA-YOLO model for detection.

Table 1. Main symbols’ abbreviations and definitions.

P	Precision
R	Recall
mAP	Average precision
Param	Amount of Parameter calculation
GFLOPs	Giga Floating-point Operations Per Second
CARAFE	Content-Aware Reassembly of Features
$k_{encoder}$	Used to control the scaling size of the encoder part
$k_{reassembly}$	Used to control the resizing size of the recombiner
SA	Shuffle attention
CAGSA-YOLO	YOLOv5s + CARAFE + scale + Ghost + SA

Table 2. Table about anchors assignment.

Feature Map Size	P3/8(80 × 80)	P4/16(40 × 40)	P5/32(20 × 20)
Receptive field size	medium	large	larger
Anchor frame size	[10, 13]	[30, 61]	[116, 90]
	[16, 30]	[62, 45]	[156, 198]
	[33, 23]	[59, 119]	[373, 326]

Table 3. This is a comparison table of upsampled models.

Faster R-CNN	$A P$	$A P_{50}$	$A P_{75}$	$A P_{S}$	$A P_{M}$	$A P_{L}$
Nearest	36.5	58.4	39.3	21.3	40.3	47.2
Bilinear	36.7	58.7	39.7	21.0	40.5	47.5
Deconv	36.4	58.2	39.2	21.3	39.9	46.5
CARAFE	37.8	60.1	40.8	23.2	41.2	48.2
Mast R-CNN	$A P$	$A P_{50}$	$A P_{75}$	$A P_{S}$	$A P_{M}$	$A P_{L}$
Nearest	32.7	55.0	34.8	17.7	35.9	44.4
Bilinear	34.2	55.9	36.4	18.5	37.5	46.2
Deconv	34.2	55.5	36.3	17.6	37.8	46.7
CARAFE	34.7	56.2	37.1	18.2	37.9	47.5

Table 4. Table showing anchor assignment. When N = 12, it is divided into 4 detection layers, and the original 9 anchors are changed to 12 anchors.

Feature Map Size	160 × 160	80 × 80	40 × 40	20 × 20
Receptive field size	small	medium	large	larger
Anchor frame size N = 12	[5, 6]	[10, 13]	[30, 61]	[116, 90]
	[8, 14]	[16, 30]	[62, 45]	[156, 198]
	[15, 11]	[33, 23]	[59, 119]	[373, 326]

Table 5. Contrast of different attentions. The partial quotation of the data in Zhang’s article reflects the superiority of SA.

Attention Methods	Backbones	Param	GFLOPs
SE-Net [26]	ResNet-50	28.088 M	4.130
CBAM [27]		28.090 M	4.139
ECA-Net [28]		25.557 M	4.127
SA-Net [25]		25.557 M	4.125

Table 6. This is a table of initialization Parameter settings. Batch size (bs) is set to 32, epoch to 300, initial learning rate (lr0) to 0.01, final OneCycleLR learning rate (lrf) to 0.2, SGD momentum (momentum) to 0.937, and optimizer weight decay (weight_decay) to 0.0005 in the training model.

bs	Epoch	lr0	lrf	Momentum	Weight_Decay
32	300	0.01	0.2	0.937	0.0005

Table 7. This is a table of research on CARAFE Parameters. The CARAFE upsampling was replaced under YOLOv5s, and test experiments were performed for

k_{e n c o d e r}

and

k_{r e a s s e m b l y}

under different Parameters.

Table 7. This is a table of research on CARAFE Parameters. The CARAFE upsampling was replaced under YOLOv5s, and test experiments were performed for

k_{e n c o d e r}

and

k_{r e a s s e m b l y}

under different Parameters.

$k_{e n c o d e r}$	$k_{r e a s s e m b l y}$	$P$	$R$	$m A P @ .5$	Param	GFLOPs
1	3	90.9%	80.2%	83.8%	7,102,173	16.4
1	5	92.3%	79.5%	84.1%	7,110,493	16.4
1	7	91.7%	79.9%	83.4%	7,122,973	16.4
3	3	92.5%	79.2%	83.9%	7,139,037	16.5
3	5	93%	78.8%	83.5%	7,212,893	16.6
3	7	90.3%	80.2%	83.4%	7,323,677	16.8
5	5	91.2%	79.8%	83.2%	7,417,693	17.0
5	7	92.8%	78.9%	83.7%	7,725,085	17.6
7	7	92.2%	79.8%	83.7%	8,327,197	18.8

Table 8. This is a table of contrasts between attentional mechanisms. The improved YOLOv5s model is compared with five attentions to verify the superiority of SA.

Attention Methods	$P$	$R$	$m A P @ .5$	Param	GFLOPs
SE-Net [26]	88.8%	79.5%	84.8%	6,680,552	18.9
CBAM [27]	88.4%	79.8%	84.0%	6,680,650	18.9
ECA-Net [28]	88.3%	80.0%	84.2%	6,648,045	19.0
SA-Net [25]	89.7%	80.1%	85.1%	6,647,976	18.9

Table 9. This is a comparison table of algorithm experiments. The improved algorithm in this paper has 0.9% higher average precision than YOLOv3, 8% higher than YOLOv3-tiny, 11.1% higher than YOLOv4-tiny, and 1.7% higher than YOLOv5s, respectively.

Model	MB	$P$	$R$	$m A P @ .5$	Param	GFLOPs
YOLOv3	123.5	90.1%	81.1%	84.2%	61,535,125	154.7
YOLOv3-tiny	17.4	79.3%	76.0%	77.1%	8,682,862	12.9
YOLOv4-tiny	23.6	57.2%	77.1%	74.0%	5,890,286	16.2
YOLOv5s	14.4	93.7%	78.5%	83.4%	7,072,789	16.3
CAGSA-YOLO	13.8	89.7%	80.1%	85.1%	6,647,976	18.9

Table 10. Comparison table of ablation experiments. Step by step, the ideas for improvement are shown to verify whether the method for improvement is correct and whether the model is effectively improved.

Model	MB	$P$	$R$	$m A P @ .5$	Param	GFLOPs
YOLOv5s	14.4	93.7%	78.5%	83.4%	7,072,789	16.3
YOLOv5s + CARAFE	14.5	92.3%	79.5%	84.1%	7,110,493	16.4
YOLOv5s + CARAFE + scale	15.0	89.7%	79.7%	84.5%	7,265,704	19.3
YOLOv5s + CARAFE + scale + Ghost	13.8	87.7%	81.4%	84.5%	6,647,784	18.9
CAGSA-YOLO (YOLOv5s + CARAFE + scale + Ghost + SA)	13.8	89.7%	80.1%	85.1%	6,647,976	18.9

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Wang, X.; Cai, L.; Zhou, S.; Jin, Y.; Tang, L.; Zhao, Y. Fire Safety Detection Based on CAGSA-YOLO Network. Fire 2023, 6, 297. https://doi.org/10.3390/fire6080297

AMA Style

Wang X, Cai L, Zhou S, Jin Y, Tang L, Zhao Y. Fire Safety Detection Based on CAGSA-YOLO Network. Fire. 2023; 6(8):297. https://doi.org/10.3390/fire6080297

Chicago/Turabian Style

Wang, Xinjie, Lecai Cai, Shunyong Zhou, Yuxin Jin, Lin Tang, and Yunlong Zhao. 2023. "Fire Safety Detection Based on CAGSA-YOLO Network" Fire 6, no. 8: 297. https://doi.org/10.3390/fire6080297

APA Style

Wang, X., Cai, L., Zhou, S., Jin, Y., Tang, L., & Zhao, Y. (2023). Fire Safety Detection Based on CAGSA-YOLO Network. Fire, 6(8), 297. https://doi.org/10.3390/fire6080297

Article Menu

Fire Safety Detection Based on CAGSA-YOLO Network

Abstract

1. Introduction

2. Introduction to Detection Algorithm

2.1. Basic YOLOv5 Framework

2.2. Principle of Loss Function

3. Improved Model

3.1. Upsampling Is Replaced by CARAFE Model

3.2. Adding a Scale Detection Layer

3.3. Ghost Lightweight Model Design

3.4. Adding the SA Mechanism

4. Dataset Optimization Design

4.1. Dataset Acquisition

4.2. Dataset Analysis

5. Improvement of the Experiment and Results Analysis

5.1. Experiment Environment and Configuration

5.2. Evaluation Index

5.3. Comparative Experiment

5.4. Ablation Experiment

5.5. Analysis of Test Results

6. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI