1. Introduction
Ship fires are one of the main problems that endanger the safety of ships. Since ships are sailing far away from land for a long time at sea, the fire risk factor is high and difficult to deal with, which often causes incalculable losses. For example, a massive fire and explosion accident on the 4400 teu Hanjin Pennsylvania in 2002 caused nearly USD 100 million in damages to the insurance company. In July 2012, a container ship, the 6732 teu MSC Flaminia, caught fire in the Atlantic Ocean and killed three sailors. In 2018, a ship fire accident aboard the Maersk Honam in the Indian Ocean generated the largest container ship general average claim in shipping history, with the final insurance payout expected to exceed USD 500 million. In January 2019, a 7150 teu container ship called Yantian Express was also involved in a major fire accident in the Atlantic Ocean. According to the analysis of the Ship Safety Risk Report 2022, the shipping industry has shown a more positive safety trend over the past decade, but as ships age, fires on board are on the rise. In the past five years, fires and explosions have overtaken shipwrecks and collisions as the number one cause of Marine insurance losses. With the rapid development of ports and shipping, ships are increasingly large, environmentally friendly, professional, and intelligent, and the overall safety of ships has been widely improved. However, the statistics do not show a significant improvement in ship fire accidents.
The engine room of a ship is the core compartment that provides power to the ship and guarantees the normal operation of the ship. However, due to the complex structure and combustible materials inside the cabin, 75% of all ship fires start in the engine room, and nearly two-thirds of engine room fires occur in the main and auxiliary engines, or their related components, such as the turbochargers. Therefore, engine room fire detection is extremely important. A method that can quickly and accurately detect an engine room fire on ships is important to reduce the damage to people and property caused by ship accidents and can play a positive role in further improving ship damage control systems and improving ship fire prevention and control technology.
Traditional fire detection technology is generally judged by collecting data from various types of indoor sensors. When the detected parameter values reach the threshold of the sensor settings, an alert will be issued. Early sensor technology was mainly based on the “point sensor” of particle activation, based on heat, gas, flames, smoke, and other important fire features [
1]. Depending on the sensor detection objects, they can be divided into smoke sensors, temperature sensors, photosensitive sensors, and special gas sensors. These detectors can detect fires from multiple angles, but there are also defects with limited detection range, accuracy, and response speed at the same time. The particles must pass through a certain distance to reach the sensor; for example, for smoke detectors often installed on a building ceiling, only a small share of smoke occurs when a fire has just occurred, which is not enough to trigger the fire alarm. The alarm will be triggered only when the smoke reaches a certain concentration and rises up to the sensor. This time difference may cause the fire to spread quickly to the point where it cannot be controlled. Simply put, if a fire is far from the sensor, it may not be detected immediately [
2]. In addition, the sensor is also easily affected by factors such as light and dust, causing potential misjudgments. In fact, both flames and smoke have certain static and dynamic characteristics, such as color and movement. Point sensors do not use these important features to detect fires [
3]. It can be seen that traditional fire detectors have their own advantages, but their disadvantages are also obvious. Almost all of the above fire detectors could be limited by the environment while having multiple sensors in place will increase the cost and difficulty of placement.
In recent years, with the rapid development of computer vision, image processing technology, the gradual improvement of hardware computing capabilities, and the popularization of video surveillance networks, people have gradually shifted to focus on the development of fire detection technology to detect fires. Video fire testing based on deep learning has become a popular research field, with its characteristics of fast response and high accuracy. Algorithms based on convolutional neural networks can extract deeper image features, suitable for fire detection in complex venues such as engine rooms. On the one hand, modern ships are becoming more and more intelligent and automated, and video surveillance systems are becoming increasingly mature, which provides the possibility of using monitoring and deep learning technology for fire detection in the engine room. On the other hand, video-oriented fire detection has been successfully applied in many fields, it can detect indoor scenes, such as office fires, and can also detect outdoor scenes, such as forest fires. This lays a foundation for its application in ships. Fire detection technology based on deep learning mainly inputs the engine room scene through real-time video surveillance. Compared with traditional sensor detection technology, it has the following advantages:
It has a wide detection range and fast reaction speed. Fire detection technology based on deep learning can respond to a fire at its early stage;
It can track and predict the flames and evaluate their development trend. At the same time, it will pave the way for the directional fire extinguishing of the unmanned engine room of a future ship;
Compared with the traditional fire detector, which can only provide limited information, it can record the fire process completely, and contribute to the subsequent accident investigation and future safety regulations;
It is convenient to arrange. Only a few surveillance cameras are needed to monitor the entire engine room;
It has low environmental requirements and can also be compatible with other types of fire detectors.
In summary, the fire detection technology based on deep learning has obvious advantages, and it has very important research significance and practical application value in ensuring the safety of navigation and improving the automation and intelligence of the cabin.
Although fire detection based on deep learning has developed rapidly, this method is still in the initial stage in its application and research of the ship engine room. The main problems are as follows:
There are too little image data of cabin fires. Deep learning is an algorithm based on big data. The more samples used in model training, the more features are learned in models, which makes the model more effective. However, there are currently no public data sets related to ship cabin fires, and it is difficult to obtain ship engine room fire image data, so we cannot build a large-scale cabin fire data set;
The complex environment inside the engine room of the ship may affect the performance of the algorithm in the actual scene. A model may perform well in training and averagely well in testing. In particular, the space of the engine room is large and it has a lot of equipment, complex pipelines, and many red or yellow signs, which are similar to the color of the fire. If the robustness of the algorithm is not high, the detection and location of the fire could be affected.
In response to the above problems, this article proposes a fire detection algorithm based on the improved YOLOv7-tiny [
4]. The specific work is as follows:
We used the 3D virtual engine room simulator of Dalian Maritime University to collect engine room fire image data. Combined with real fire images in other scenarios, we construct a ship engine room fire dataset for training models.
We improve the original YOLOv7-tiny algorithm. More specifically, partial convolution (PConv) is used to replace part of the regular convolution module in the original network to reduce redundancy in the calculation process. Then the coordinate attention (CA) mechanism is added to strengthen the feature extraction ability of the model to increase the robustness of the model. Finally, the SCYLLA-IoU (SIoU) loss function is used to accelerate the efficiency of training and improve the accuracy of inference.
The remaining parts are arranged as follows: the second part discusses related work; the third part introduces the improvement method; the fourth part conducts experimental verification; the fifth part summarizes and prospects the research.
2. Related Work
The technology of fire detection using vision can be mainly divided into traditional detection methods based on image processing and detection methods based on deep learning. Fire recognition based on image processing technology mainly obtains the fire area by analyzing and extracting the dynamic and static characteristics of the flames, such as color features, shapes, and other appearance features; then, the extracted features are passed into the machine learning algorithm for recognition. Chen et al. [
5] proposed an early fire warning method based on video processing. The basic idea is to extract fire pixels and smoke pixels using chromaticity and disorder measurements based on the RGB (red, green, blue) model. Wu et al. [
6] proposed a dynamic fire detection algorithm for surveillance video based on the combination of radiation domain feature models. Foggia et al. [
7] proposed a method to detect fires by analyzing videos captured by surveillance cameras, which combined the complementary information of color, shape change, and motion analysis through multiple expert systems. Arthur K et al. [
8] conducted research on video flame segmentation and recognition; in this paper, the Otsu multi-threshold algorithm and Rayleigh distribution analysis method were used to segment the image to obtain an accurate and clear flame image. Then the flame image was identified by the method of combining the nearest neighbor algorithm with the flame centroid feature. S. R. Vijayalakshmi et al. [
9] proposed a method integrating color, space, time, and motion information to locate fire areas in video frames. The method fuses the characteristics of the smoke with the burning flame to remove the false fire area. Binti Zaidi et al. [
10] developed a fire identification system based on the consistency rules of R, G, B, Y, Cb, and Cr component values in images. Wang et al. [
11] combined the dynamic and static characteristics of the flame in the video for fire detection, effectively reducing the impact of environmental factors on the results. Although researchers have conducted many studies of smoke and flame images, they have found only a few simple image features, which are not enough to cover the complex fire types and scenarios.
In recent years, with the rapid development of deep learning, it has been successfully applied to many fields, such as image classification, object detection, speech recognition, natural language processing, and so on. Its applications range from simple classification of cats and dogs to complex medicine. For example, Yuan et al. [
12] used convolutional neural networks (CNN) and bidirectional long short-term memory (BiLSTM) to predict anticancer peptides (ACPs). Shervan Fekri-Ershad et al. [
13] proposed a multi-layer perceptron (MLP) neural network with deep features to analyze cell images. The fire detection method based on convolutional neural network (CNN) has also been widely used, which has significant potential advantages, such as fast response speed, wide detection range, high precision, and low detection cost. At present, there are many excellent target detection models, and the first-stage algorithms include OverFeat [
14], Single Shot MultiBox Detector (SSD) [
15], R-SSD [
16], YOLO (you only look once) series [
4,
17,
18,
19,
20], etc. The two-stage algorithms include R-CNN [
21], Fast R-CNN [
22], Faster R-CNN [
23], Mask R-CNN [
24], and so on. Barmpoutis et al. [
25] used the Faster R-CNN model to identify fires, achieving high detection accuracy, but the speed was not fast enough. Shen et al. [
26] used the YOLOv1 model to achieve flame detection, but there is still much room for improvement. Qian [
27] et al. introduced channel-wise pruning technology to reduce the number of parameters in YOLOv3, making it more suitable for fire monitoring systems. Wang et al. [
28] proposed a lightweight detector, Light-YOLOv4, which considers the balance between performance and efficiency and has good detection performance and speed in embedded scenarios. Wu [
29] et al. improved the SPP module and activation function of YOLOv5 to improve the robustness and reliability of fire detection. Xue [
30] et al. introduced the convolutional block attention module (CBAM) and bidirectional feature pyramid network (BiFPN) into YOLOv5, which improved the detection of small targets in forest fires. The application of fire detection based on deep learning technology in ships is still in the initial stage. Wang et al. [
2] proposed a video-based ship flame and smoke detection method to overcome the shortcomings of traditional fire detection equipment. First, using the fuzzy C-means clustering algorithm to create the dominant flame color lookup table (DFCLT), then, the changed region in the video frame is extracted and the alarm is triggered by the contrast of the pixels. Park et al. [
31] further tested the performance of Tiny-YOLOv2 for fire detection in the environment of the ship engine room. Wu et al. [
32] proposed an improved YOLOv4-tiny algorithm for accurate and efficient ship fire detection. They improved the detection accuracy of small target objects by adding a detection layer and Squeeze-and-Excitation attention (SE) module to the network. In general, detection methods based on deep learning technology can not only automatically extract image details and features but also learn deeper features of the targets so that it has a better feature extraction effect than traditional image processing technology.
3. Methods
This section elaborates on the network structure of YOLOv7-tiny and the improvement methods, including improving the backbone network, adding the attention mechanism, and modifying the loss function.
3.1. The Model Structure of YOLOv7-Tiny Network
Created in 2022, YOLOv7 [
3] is a relatively new work in the YOLO series. It is a kind of target detection network with high speed and precision, easy training, and deployment, and it surpasses previous object detectors in both speed and accuracy, ranging from 5 FPS to 160 FPS. In order to better meet the requirements of real-time fire detection, the YOLOv7-tiny model with the smallest amount of computation and parameters is chosen as the baseline model in this paper.
Figure 1 shows the structure of YOLOv7-tiny.
According to the structure diagram, the YOLOv7-tiny network consists of an input layer, backbone network, and head network. The input part uses techniques such as adaptive image scaling and Mosaic data augmentation. The image size is uniformly converted to 640 × 640 × 3. Mosaic data enhancement is to stitch together four pictures by random scaling, cropping, arrangement, and other processing and can increase the number of targets, enrich the diversity of data, improve the robustness of the network, and the ability to detect small objects.
In the backbone network, the picture goes through a series of CBL modules, ELAN modules, and MP modules to reduce the length and width, and increase the number of channels. The module structure is shown in
Figure 2. The CBL module represents a convolutional layer, a BN layer (batch normalization), and an activation function. Unlike YOLOv5, YOLOv7-tiny uses LeakyReLU as the activation function, which evolved from the ReLU (rectified linear unit) activation function. Compared with ReLU, LeakyReLU handles negative values better and the overall function interval is not zero, which solves the problem that part of the network parameters cannot be updated. The formula is as follows:
ELAN module is an efficient network architecture that enables the network to learn more features and be more robust by controlling the shortest and longest gradient paths. It has two branches, one of which goes through convolution to change the number of channels. The other uses multiple convolution for feature extraction. Finally, the four features are superimposed together to extract the result. The image is extracted in the backbone network and then fused in the head network. The fusion part of YOLOv7-tiny is similar to that of YOLOv5, using the PANet structure. The SPPCSPC module in the head network can obtain multi-scale target information while keeping the size of the feature map unchanged. The function of SPP is to increase the receptive field. It obtains different receptive fields through maximum pooling so that the algorithm can adapt to different-resolution images. According to the structure diagram, four different Maxpools represent that it can handle different objects and can be used to distinguish between large and small targets. The CSP structure first divides the features into two parts, one of which is processed conventionally, and the other part is processed with the SPP structure. Finally, the two parts are merged together.
YOLOv7-tiny mainly calculates three loss functions, namely, the bounding box loss, the classification loss, and the target confidence loss. BCEWithLogitsLoss (binary cross entropy loss with log) is used for target confidence loss and classification loss, and the CIOU loss is used to calculate the box loss. During the post-processing of the target detection phase, YOLOv7-tiny uses Non-Maximum Suppression (NMS) to filter multiple target candidate boxes and eliminate redundant candidate boxes, which ensures that the algorithm only gets one detection box for each object in the end.
3.2. Partial Convolution
Whether on land or on board, real-time fire detection is very important. A fast-operating network model can detect fires earlier, greatly reducing the harm they cause. Many researchers are working on designing fast neural networks. Rather than requiring more expensive equipment, they tend to design fast neural networks that are cost-effective. So a lot of work has been carried out on how to reduce computational complexity. Computational complexity is measured by the number of floating-point operations (FLOPs). MobileNet [
33], ShuffleNet [
34], and GhostNet [
35] utilize depthwise convolution (DWConv) or group convolution (GConv) to extract features, and these networks all have low FLOPs. However, Chen et al. [
36] found that they are not actually fast enough because the operators often suffer from the effect of increased memory access in the process of reducing FLOPs. Therefore, a new partial convolution (PConv) has been proposed to extract spatial features more effectively by reducing redundant computation and memory access at the same time. Various convolution forms are shown in
Figure 3.
As shown in
Figure 4, the feature maps are highly similar between different channels, which brings about computational redundancy.
Figure 3c shows how PConv works. To reduce computational redundancy, it simply applies regular convolution on part of the input channel for feature extraction and keeps the rest of the channels unchanged. For the regular convolution and the depthwise convolution, they have FLOPs
h ×
w ×
k2 ×
c2 and
h ×
w ×
k2 ×
c, respectively, and the FLOPs of the partial convolution are
When cp/c = 1/4, the FLOPs of the partial convolution is 1/16 of the regular convolution. Thus PConv achieves lower FLOPs than regular convolution and higher FLOPs than depthwise convolution.
In this paper, we combine the ELAN module in the backbone network with PConv to extract features more efficiently. As shown in
Figure 5, we replaced some CBL modules in the ELAN module with PConv.
3.3. Coordinate Attention (CA) Mechanism
The attention mechanism is a concept proposed to mimic the human nervous system. It has been widely used in various fields of deep learning in recent years and has shown great success in tasks of image segmentation, speech recognition, and natural language processing. Because of bottlenecks in information processing, humans selectively focus on some information while ignoring others. Similarly, when a neural network processes a large quantity of input information, it quickly focuses on some of the key information for processing, which is the attention mechanism. Its essence is to enhance the useful feature information and suppress the useless information to improve detection accuracy. The attention mechanism can be generally divided into a channel attention mechanism and a spatial attention mechanism, such as the squeeze-and-excitation attention (SE) [
37] and convolutional block attention modules (CBAM) [
38], etc. The SE module only considers the information between channels while ignoring the location information. Although CBAM is improved, it still lacks the ability of long-distance relation extraction. In contrast, the coordinate attention (CA) mechanism [
39] not only obtains inter-channel information, but also considers orientation-related location information, which helps the model to better locate and identify the target. In addition, it is flexible and lightweight enough to easily plug into the core modules of mobile networks. Due to the complexity and variability of the engine room environment, the model needs to improve its ability to express the flame’s characteristics. The two advantages of coordinate attention mentioned above is greatly assist our model. Therefore, we added the attention mechanism to the backbone network. The structure of the three attention mechanisms is shown in
Figure 6, where (a) is SE, (b) is CBAM, and (c) is the coordinate attention (CA) mechanism.
Different from SE in
Figure 6a and CBAM in (b), in order to avoid losing spatial information of objects, CA does not directly use global max pooling and global average pooling, but is composed of two similar parallel stages. The location information is embedded into the channel attention by processing the high and wide direction features.
The specific operation is to first pool the input feature graphs of size C × H × W in the X direction and the Y direction, respectively, to generate the feature graphs of size C × H × 1 and C × 1 × W. The calculation formula is as follows:
where
is the output of the channel whose number is
c and height is
h, and
is the output of the channel whose number is c and width is
w.
Secondly, the width and height feature maps of the global receptive field are spliced together and passed to the 1 × 1 convolution module to reduce the dimension to the original C/r. After batch normalization, the feature graph F1 is fed into the nonlinear activation function to obtain the feature maps in the form of 1 × (W + H) × C/r. The calculation formula is as follows:
where
F1 is a 1 × 1 convolution transformation, [·,·] is a concatenation operation, and
δ is a nonlinear activation function.
Then, the feature map is decomposed into two independent tensors
and
, and two 1 × 1 convolutions are used to obtain feature map
and
with the same number of channels as the input x. The Sigmoid activation function is used to obtain the attention weights in the height direction and the width direction of the feature graph. The calculation formula is as follows:
where
and
are 1 × 1 convolution transformations and
σ is the sigmoid function.
Finally, the original feature map is weighted by multiplication, and the output
y of the CA attention module is calculated. The formula is as follows:
3.4. SIoU Loss
The loss function is crucial in the process of model training because it can calculate the gap between the model and the actual data, and determine how well the model performs. The proper loss function is helpful to make the model converge faster and obtain better results in the training process. Traditional losses such as GIoU [
40], DIoU [
41], and CIoU [
41] only consider the distance, overlap area, and aspect ratio between the prediction box and the ground truth but do not consider the angle between the real box and the prediction box, which leads to a slow convergence speed. In order to solve the above problems, Gevorgyan proposed the SCYLLA-IoU (SIoU) loss function [
42]. The SIoU loss function redefines the penalty measure by considering the vector Angle between the required regressions. This consideration can greatly speed up the training convergence process. It predicts that the box will first move to the nearest axis (x or y). Then the prediction box is regressed along this axis. SIOU consists of four parts:
The angle cost diagram is shown in
Figure 7, where B and B
GT are the prediction box and the ground truth box, and
α and
β are the angles to the horizontal and vertical directions, respectively.
is the height difference between the center points of the prediction box and the ground truth box. σ is the distance between the center point of the prediction box and the ground truth box.
The regression direction of the prediction box is determined by the magnitude of the angle. To achieve this, the following strategy is used to optimize the angle parameter θ:
If α ≤ π/4, the convergence process will first minimize α, otherwise minimize β.
The angle cost is calculated as follows:
- 2.
Distance cost
The distance cost diagram is shown in
Figure 8.
The distance cost is redefined according to the angle cost, and its calculation formula is as follows:
where
,
,
.
- 3.
Shape cost
The shape cost is calculated as follows:
where
,
.
- 4.
IoU cost
The IoU cost is shown in
Figure 9, where
In summary, the SIoU loss function is defined as follows: