The CNN-based fire detection approaches mentioned in the previous section followed the process only of fine-tuning different CNNs like Googlenet, SqueezeNet, VGG, and MobileNet. The major drawback of these approaches is their large number of layers and large model size. So it is difficult to deploy the trained models onto some resource-constrained embedded devices such as Raspberry Pi or Jetson-Nano for fire detection with a reasonable frame rate. Moreover, existing approaches need to support localizing the fire regions in real time. Thus, having a unified framework to detect and locate the fire in real time is helpful.
Therefore, we propose a cost-effective fire-detection and -localization framework that can be deployed to embedded platforms to detect and localize indoor fires in real time. We first discuss the overall architecture of the system and then describe the fire detection module and fire localization module in detail.
3.2. Fire Detection Framework
Inspired by the fully convolutional one-stage objection detection model [
20], we propose a fire-detection CNN architecture for fire detection and segmentation.
Figure 2 shows the CNN architecture, which mainly contains three parts, namely the backbone, path aggregation network, and detection head. Backbone is used as a feature-extracting network. In the training process, we use the pre-trained backbone to extract the features from the input images. Then the features are passed into the path-aggregation network to aggregate the features from different layers. Finally, the aggregated features are passed into the detection head to predict the bounding boxes and segmentation masks. The detection head comprises two parts: the classification head and the regression head. The classification head is used to predict the class of the object: fire or background. The regression head is used to predict the bounding boxes of the fire. The segmentation head is used to predict the segmentation masks of the fire. The segmentation masks are used to calculate the coordinates of the fire. In object detection tasks, pre-defined anchor boxes have been used in frameworks such as Faster R-CNN [
21], YOLO [
22], and SSD [
23]. However, anchor-free methods have been proposed to overcome the limitations of anchor-based methods. In anchor-free methods, the bounding boxes are predicted directly from the feature maps. In this paper, we use the fully convolutional one-stage approach to predict the bounding boxes, which reduces the number of hyper-parameters and the computational complexity.
First, we evaluate several networks, namely EfficientNet, ShuffleNet, RepVGG, and CSPNet, as feature-extracting networks in the backbone. This feature extractor encodes the network’s input into a specific feature representation. This backbone technique is commonly used in image segmentation and object detection tasks. In this paper, we mainly evaluate four networks as backbones as follows:
EfficientNet: With the fast development of embedded systems, CNN nowadays is commonly developed at a fixed resource budget and then scaled up for better accuracy if more resources are available. To identify the effect of the network depth
d, width
w, and resolution
r on the model accuracy, Tan and Le formalized this optimization problem as seen in Equation (
1) [
5]. Here
denotes the CNN, which a list of composing layers can represent. The optimization problem wishes to find the maximum accuracy the CNN architecture
can achieve under the restrictions of target memory and target flops. The neural architecture search technique is used here to find the best EfficientNet architecture. This paper uses
efficient_lite0 as the backbone model and uses the second, fourth, and sixth stages as output to the Feature Pyramid Network(FPN).
ShuffleNet: The architecture of ShuffleNet utilizes two new operations, namely pointwise group convolution and channel shuffle, to reduce computation cost while maintaining model accuracy. The architecture of ShuffleNet is mainly composed of a stack of ShuffleNet units grouped into three stages. The image is first fed into a convolution layer with a stride of 2. Then the output is fed into the following stages, and the output channels are doubled. Results show that ShuffleNet can achieve better model accuracy than other models on the ARM platform [
6].
RepVGG: The RepVGG has a VGG-style architecture in that every layer takes the output of its only preceding layer as input and feeds the output into its only following layer. The Conv and ReLu are used in the model. There are five stages in the RepVGG; the output is fed into the following stages, and the output channels are doubled.
CSPNet: CSPNet is a new backbone that can be used to reduce the heavy inference computations by integrating feature maps from the beginning and the end of a network stage [
8]. The CSPNet will use some ResBlock [
24] to skip the calculations for gradients.
Table 1 shows our custom configuration for the CSPNet used in this paper.
Different layers of the output of the backbone would then be fed into the FPN, which outputs proportionally sized feature maps for the different layers in a fully convolutional fashion. The feature pyramid network builds the feature pyramids inside the deep convolutional networks and can be used in object-detection tasks. This paper uses the Path Aggregation Network (PAN) [
25] as the FPN to build the feature pyramids. The construction of PAN mainly involves two pathways. One is the bottom-up pathway, which computers a feature hierarchy consisting of feature maps at several scales.
Moreover, the top-down pathway hallucinates higher-resolution features by upsampling spatially coarser feature maps from higher pyramid levels. These features are then enhanced with features from the bottom-up pathway via lateral connections. Each lateral connection merges feature maps of the same spatial size from the bottom-up and top-down pathways. The configurations, such as the out stages, activation function, input for PAN, and output for PAN used in this paper, are listed in
Table 2.
In image classification and localization, Focal Loss is usually used to measure the classification accuracy, and the Dirac delta distribution is used to measure the box location of object detection. Li et al. proposed the Generalized Focal Loss (GFL) to measure the joint representation of localization quality and classification [
26]. Focal loss is typically used for one-stage classification, which supports discrete labels such as 0 and 1. However, the proposed joint representation of localization quality and classification has continuous values ranging from 0 to 1. This paper uses the two forms of GFL, namely Quality Focal Loss (QFL) and Distribution Focal Loss (DFL), as the loss function to train the model.
GFL uses a float target
to present the standard one-hot category label. Here, if
, there are no category samples. Furthermore, if
, then there exists some positive samples with the score
y. The FL consists of two parts: the cross entropy part
and a dynamic scaling factor part
. QFL is then extended based on FL to support the joint representation of localization quality and classification. QFL is presented in Equation (
2). Here
is extended to
, where
is the output of multiple binary classifications with the sigmoid operator.
is extended to ∣
∣
to calculate the absolute distance between the estimation
to the continuous label
y. In this paper, we use
to train the model.
DFL uses the Dirac delta distribution as the loss function to train the regression model, whose targets are the relative offsets from the location of the object to the four sides of a bounding box. First, the sampling technique converts the continuous domain into a discrete domain. For example, if the range is discretized into
, then the estimated regression value of
, where
. Then the DFL of any successive
is shown in Equation (
3). As seen from the equation, DFL will force the network to rapidly focus on the values near the label by enlarging the probabilities of
and
.
So for the detection head, we use QFL and DFL as loss functions. The other configurations are shown in
Table 3. The number of input channels and feature channels is set to 96. Moreover, we use [8, 16, 32] as the strides. Moreover, we use batch normalization as the normalized approach.
3.3. Fire Localization Framework
The fire detection framework trains a model that can be fed into the embedded system for real-time surveillance video processing. The inference module will calculate the position of the fire in the current image in pixels. Next, we propose a two-step real-world fire localization framework that maps the position of the fire on the two frames of surveillance videos to the localization in the real-world setting.
Figure 3 shows our two-step framework for real-time fire localization. The first step is for camera resectioning and geometric camera calibration. Camera resectioning is estimating the parameters of a pinhole camera model approximating the camera that produced a given photograph or video. The single-camera resectioning can calculate the camera’s intrinsic parameters, such as focal length and principal point. Here we use
and
to denote the focal length in pixels. Then we use
and
to denote the principal points of the camera. After the single-camera resectioning, we perform the stereo calibration to calculate the extrinsic parameters, which denote the coordinate system transformations from 3D world coordinates to 3D camera coordinates. Here
is the rotation matrix that is used to perform a rotation in Euclidean space.
is the position of the origin of the world coordinate system expressed in coordinates of the camera-centered coordinate system. Typically the rotation matrix is a
matrix, and the transformation matrix
T is a
matrix. Let
u and
v be the position of fire in the video in pixels,
X,
Y, and
Z be the localization of the fire in a real-world setting. The relationship between
and
satisfies Equation (
4). Here s is the projective transformation’s arbitrary scaling.
Furthermore, actual lenses usually have radial distortion and tangential distortion. After the single-camera resectioning and the calibration of two cameras, we compute the rectification transforms for each calibrated stereo camera, giving us the radial and tangential distortion coefficients. The radical coefficients and tangential distortion coefficients can make the localization framework more accurate.
The first step of the fire-localization framework mainly focuses on calibrating the two cameras. In the second step, we use an anchor point to calculate the relative coordinates of the fire position to the anchor point. First, we determine a fixed anchor point and find the pixels of the point, denoted as
. Then we calculate the real-word coordinates
of the anchor point. The inference module of the trained model can detect and localize fires on each frame, which will obtain the area
, which then can be calculated to the real-word coordinates
. The relative coordinates of the fire are shown in Equation (
5).