1. Introduction
The heat release rate (HRR) is defined as the amount of heat released by a combustion system per unit of time. It reflects the characteristics and risks of a fire and is an important indicator for assessing the danger level of fires. HRR is widely used in the safety design of building fires and firefighting operations [
1]. In the laboratory, two common methods for measuring the HRR of a fire scene are the combustion rate method based on fuel mass loss [
2] and the calorimetry method based on oxygen consumption [
3]. However, these methods require expensive and complex equipment and are unable to predict the HRR at future moments. Consequently, the monitoring of the early stages of an actual fire and the prediction of future HRR based on current data, with the objective of judging the development scale of indoor fires and providing early warnings, has become one of the most pressing scientific problems in the field of fire research.
In numerous fire tests and actual fire scenes, closed-circuit television cameras and mobile device cameras are frequently utilized to obtain fire videos, record alterations in flames and smoke, and assess related fire parameters [
4,
5,
6]. The extracted fire frame images from these videos contain data about the behavior and characteristics of the fire, including the size, color, brightness, and oscillation frequency of the flames and smoke, as well as their changes over time. A comprehensive analysis of fire scene images can yield crucial insights into the progression of a fire.
The field of artificial intelligence (AI) has witnessed a remarkable advancement in recent years, significantly enhancing the capabilities of image analysis. AI methods have been extensively utilized in diverse domains, including image recognition [
7] and object detection [
8]. Additionally, AI techniques have been employed to identify implicit information in fire images and predict the evolution of fires and smoke. For instance, Hodges et al. [
9] employed a transposed convolutional neural network (TCNN) to predict the spatial resolution of temperature and velocity in compartment fires. Wu et al. [
10,
11,
12] utilized deep learning methods to predict the development and smoke propagation of tunnel fires, thereby demonstrating the potential of intelligent firefighting systems in laboratory-scale tunnel models. Su et al. [
13] employed AI to train smoke images derived from numerical fire simulations to assist performance-based fire engineering design, which is applicable to atrium design. Ghosh et al. [
14] proposed a hybrid deep learning model of Convolutional Neural Networks (CNNs) and Recurrent Neural Networks (RNNs) for forest fire detection, which provides new insights into computer vision forest fire detection. Choi et al. [
15] employed convolutional neural networks (CNNs) for semantic image segmentation in wildfire scenes. Ban et al. [
16] developed a deep learning-based framework to monitor the development of wildfires in real time, including complex conditions such as smoke, clouds, and nighttime. Wang et al. [
17] generated a large compartment fire database using CFD models, obtaining numerical simulation smoke images (front and side dual views) produced outside buildings, and used VGG16 to extract smoke features under different building fire scenarios. They established a relationship between smoke features (based on external fire information) and HRR, thereby predicting the HRR of fires inside buildings. Wang et al. [
18] also used the NIST database [
19,
20] to construct a large fire scene image database, extracting continuous fire scene images from experimental videos, and proposed an AI image fire calorimetry method using the VGG16 deep learning model, achieving real-time prediction of fire HRR.
Previous research has concentrated on target detection tasks such as flames or smoke, real-time analysis, or prediction of basic parameters affecting fire development (such as HRR). However, studies on predictive methods for future fire parameters are extremely rare. Moreover, traditional video-based fire detection methods mostly analyze single flames or smoke, ignoring the coexistence features of flames and smoke in fire scenes. Flames are the direct result of combustion, manifesting as glowing and heating gasification phenomena. Smoke is a byproduct of combustion, appearing as a collection of floating particles after oxidation [
21], as illustrated in
Figure 1.
Figure 1 is derived from the NIST database. Therefore, considering the common characteristics of flames and smoke is of significant importance for improving the accuracy and practicality of fire HRR predictions.
In summary, this paper aims to predict the future transient heat release rate (HRR) of fire scenes at the next moment/frame based on continuous fire scene images of flames and smoke and their temporal information. It integrates deep learning methods such as Bi-LSTM and Attention [
22,
23,
24] (Att-BiLSTM), comprehensively modeling the temporal relationships between fire scene image features. To construct a large-scale fire scene image dataset, this paper utilized fire scene videos from the NIST public database. Continuous fire scene images were extracted from these experimental videos in chronological order and annotated for HRR. These images were then preprocessed for the training of the deep learning model. Finally, the proposed HRR prediction method was applied to other fire scene experiments to verify its generalization ability and reliability in predicting future transient fire HRR.
3. Methods
Due to the temporal correlation and non-linear characteristics between fire HRR and the flames and smoke, predicting future transient HRR of fires with high accuracy is challenging. Deep learning technology can better capture the features of fire image sequences through the automatic training of deep neural networks, thus addressing these issues. The selection of an appropriate architecture is of paramount importance when addressing image time series tasks. Conventional convolutional neural networks (CNNs), such as the Visual Geometry Group (VGG) [
30] and Residual Networks (ResNets) [
31], are networks that exhibit certain limitations in processing image time-series data. While CNNs are effective in processing static images, they are more challenging to utilize in processing time-series data. CNNs are primarily concerned with spatial feature extraction and thus lack the capacity to model temporal dynamics. Furthermore, CNNs typically necessitate a substantial quantity of data for training, as otherwise, they are susceptible to overfitting [
32]. In the context of image sequence tasks, CNNs are unable to effectively capture the temporal dependencies between image frames, which represents a significant limitation for tasks that require temporal contextual information.
The proposed Att-BiLSTM model is capable of processing data in sequences and weighting them simultaneously, effectively resolving issues related to the model’s sequence correlation and non-linear relationships. The Bi-LSTM model comprises two independent Long Short-Term Memory networks (LSTMs), enabling the network to consider both forward and backward information and thus facilitating the handling of long-term dependencies in image sequences [
33]. Attention enhances the temporal information of the target, allowing the model to learn and determine the areas of focus, thereby enabling the model to concentrate on the most effective information with limited resources, and thus achieving better prediction accuracy [
34].
3.1. Bi-LSTM Layer
Long Short-Term Memory (LSTM) [
35] is a special type of Recurrent Neural Network (RNN) that introduces a structure known as ‘memory cells’ to address the vanishing and exploding gradient problems that arise during the training of long sequences. Each LSTM unit comprises an input gate
, a forget gate
, an output gate
, a candidate cell state
, a cell state
, and a hidden state
, as illustrated in
Figure 4. The input gate
determines whether the current input information is written into the cell state
. The forget gate
decides if the information in the cell state is to be forgotten. The output gate
determines whether the information in the memory cell is outputted. The computation is as follows:
In this context, σ represents the sigmoid function, ⊕ denotes the concatenation operator, and + and × symbolize element-wise addition and multiplication operations, respectively. and are the weight matrix and bias vector for gate x, respectively.
The Bi-LSTM network structure comprises a forward and backward LSTM. It considers both past and future information, enabling the model to better capture the contextual relationships and long-distance dependencies within sequence data. There is evidence that Bi-LSTM performs better than standard LSTM in many domains, including time series prediction [
36], phoneme classification [
37], and others.
3.2. Attention Layer
In recent years, the attention mechanism [
38] has been widely applied in the field of deep learning, inspired by the simulation of human visual attention mechanisms. The core idea of the attention mechanism is to gradually shift focus from all information to key points, that is, to allocate higher weights to important information, reasonably changing the external focus on information, ignoring irrelevant information, and amplifying the required information. In particular, the attention mechanism computes the similarity between the query and key information to obtain a weight. This weight is then normalized to obtain a usable weight. Finally, the weighted summation of this weight with the corresponding value is performed.
represents the length of the data source. The core idea and basic structure of the attention mechanism are illustrated in
Figure 5a. In the model and structure depicted in
Figure 5b, x represents the input sequence, and h denotes the hidden state, which contains information from the input sequence. This state can be considered a vector representing the input sequence x, while α represents the weight coefficient, and y is the output.
3.3. Model Input
This paper employs a sliding window mechanism with a stride of 1, taking each sequence of t frames of fire scene images (
,
, …
) as a group of inputs, while considering the information of the frames before and after each image, to predict the HRR of the fire scene at
(i.e., the future 1/30 s). In order to fully capture the temporal relationship between image sequences and achieve more accurate prediction performance, this paper employs multiple sets of t-values. However, selecting t-values that are too small (e.g., 1–8) may result in the model being unable to capture sufficient information. This is exemplified by the dynamically changing characteristics of the fire scene, which may result in underfitting. Conversely, selecting too large t-values (e.g., 12 or more) may introduce excessive noise, increase the computational complexity and time cost, and lead to overfitting [
39]. Accordingly, in this paper, we select t = 9, 10, 11 for comparison experiments and analyze the main indicators, including goodness of fit (R
2), mean square error (MSE), and root mean square error (RMSE), as well as other parameters, as shown in
Table 1. R
2 ranges from [0, 1], with a closer value to 1 indicating a better fit. The regression line provides a better fit to the observed values, while the remaining four criteria have a range of [0, +∞). When the predicted value is perfectly matched with the true value, this range is equal to 0. Conversely, when the true value is equal to 0, the model is considered to be a perfect fit. As the error increases, the values become larger, indicating a poorer model.
As illustrated in
Table 1, when the number of input image sequences is 10 (i.e., t = 10), the coefficient of determination R
2 value of the model reaches a maximum of 0.99700, which is higher than 0.96960 (i.e., t = 9) and 0.97036 (i.e., t = 11). As illustrated in
Figure 6, the comparative analysis of performance indicators, including MSE, RMSE, and MAE, and the comparative analysis of performance metrics, including MSE, RMSE, and MAE, are presented. The colors orange, yellow, and green represent the values taken as 9, 10, and 11, respectively. The vertical axis represents the magnitude of the values represented by each type of metric. It can be observed that all the prediction performance metrics are superior to those of the other two groups, with the exception of the RMSE value of the model at t = 10, which is marginally elevated in comparison to the other two groups. The experimental data indicate that while the results are favorable when t values are selected as 9, 10, and 11, the combined performance of each prediction metric reveals that the performance of inputs from frames 9 to 11 initially increases and then declines, reaching a peak at t = 10, which represents the optimal combined result. The requisite information is captured while avoiding overfitting or underfitting.
3.4. Prediction Process
Initially, the fire scene images undergo transformations and other preprocessing operations (transforms layer) to enhance data quality. Subsequently, the preprocessed image sequences are input into the Bi-LSTM layer to extract spatiotemporal features of the image sequences. Finally, Attention is added at the output of the Bi-LSTM to strengthen the temporal information of the target, thereby identifying key features of fire scene images at different time points. This improves the prediction accuracy of future transient HRR.
Figure 7 presents the network architecture of the Att-BiLSTM. The network comprises two pathways: The upper pathway accepts HRR labels as input, providing the necessary supervisory signal for the model, which is used to correctly identify or predict data during training. The network comprises three hidden layers, with 128, 256, and 256 units, respectively. The input dimension is 1 × 10, and after linear transformation and a Dropout layer, the output dimension is 1 × 240. The lower pathway is mainly composed of three Bi-LSTM and one Attention layer. The input to the model is a preprocessed sequence of fire scene images with dimensions of 1 × 10 × 108 × 50. This is reduced in dimensionality to 1 × 16 through linear activation and a Dropout layer. The two pathways are concatenated in the Connect layer, forming a 256-dimensional vector. This vector is then reduced from 256 dimensions to 1 dimension, i.e., the predicted value of future transient HRR, through a Fully Connected layer (FC) and a Dropout layer.
The network has approximately 15 million parameters (14,912,385 parameters), effectively modeling the temporal relationships between image sequences and the relationship between image data and heat release rate. The upper pathway employs a ReLU activation function three times, while the lower pathway utilizes Tanh, Softmax, and ReLU as activation functions between the Attention layer and the Connect layer. The paper employs the Mean Squared Error (MSE) and the coefficient of determination (R²) as loss functions to evaluate the fit between predicted and actual values through residual and control charts. To prevent overfitting, both Dropout layers are set to 0.05. The training was conducted over 20 epochs on a server equipped with an RTX 4090 GPU (24 GB), taking approximately 3 h. The training results indicate that the network is capable of modeling the time dependency of fire scene image sequences and learning the importance of input. It is able to capture the long-term dependencies of image sequences, effectively processing the dynamic changes of flames and smoke in the fire scene. This enables the network to reliably predict the future transient HRR of the fire scene.
5. Discussion
In order to assess the model’s recognition performance on data outside the training set, this paper selected a series of fire test cases with varying ranges of combustion HRRs from the NIST fire calorimetry database for prediction. These test cases were not included in the 20% validation set partitioned during the model training process. In other words, these cases represent new, unknown samples for the trained deep learning model. The model must utilize the knowledge acquired during the training phase to predict the future transient HRR values of these unfamiliar fire scenes and unknown combustibles. This approach more accurately reflects the model’s ability to generalize to real-world scenarios.
5.1. High-Brightness Fire Scenes
Changes in the brightness of the fire scene environment can result in variations in the brightness and contrast of fire scene images, potentially affecting the model’s ability to extract and analyze image features.
Figure 10a,b illustrate the impact of images under daylight or strong light exposure and lower brightness conditions, respectively. Therefore, the brightness conditions of the fire scene environment represent a crucial factor influencing the model’s capacity for generalization in HRR prediction. To assess the model’s predictive performance in high-brightness fire scenes, this paper selected three experiments from the NIST database with higher brightness conditions. These included burning items such as cardboard boxes (
Figure 11a), rubber trash bins (
Figure 11b), and plastic chairs (
Figure 11c). The burning items in question have the same thermal parameters as those in the training dataset, but the experimental conditions differ in brightness. This allows for an examination of the robustness of the model’s HRR predictions.
Figure 11 illustrates this, with the left column showing a scatter plot of the demonstration results and the right column showing a line chart of the results.
The results demonstrate that even in high-brightness fire scenes, the deep learning model can accurately predict the HRR of different combustibles (
Figure 11), with all R
2 values exceeding 0.97. The residual plots and result comparison charts indicate a good fit, indicating high prediction accuracy. The residual plot and result comparison chart for the cardboard box experiment (
Figure 11a) exhibit a slight degree of inferiority in comparison to the other two experiments. This may be attributed to the relatively limited number of frames in the cardboard fire scene video, which has prevented the full utilization of temporal relationships between images. Overall, the model demonstrates a certain degree of adaptability to changes in the brightness of fire scenes, maintaining a high degree of consistency between predicted results and actual measurements under conditions of increased brightness. This supports the application of the model in various complex brightness environments.
5.2. Complex Combustibles
The presence of complex combustibles increases the difficulty of predicting the actual HRR of fire scenes. Such combustibles may include different states, such as solids, liquids, and gases, leading to diversity in the characteristics of flames and smoke, such as differences in color and shape. To assess the reliability of the model’s HRR predictions in fire scenes with complex combustibles, this paper selected three typical complex combustible fire scene scenarios from the NIST database for validation. These are a “box-type gas burner” (
Figure 12a), a “utility cart with a laptop and printer” (
Figure 12b), and “propanol liquid” (
Figure 12c) [
33]. The three sets of experiments simulate the complexity of combustibles in actual fire scenes and are used to test the model’s generalization ability.
Figure 12 illustrates the results of the demonstration in the left column, while the right column presents the results in the form of a line chart.
As illustrated in
Figure 12, we compared the actual HRR with the predicted results of the deep learning model for the three groups of fire scenarios with complex combustible characteristics. Despite the complexity of the fuel load and the fire spread process in these experiments, the results demonstrate that the deep learning model can reasonably predict the HRR of fire scenes, with all R
2 values exceeding 0.94, reflecting the changing trend of HRR during combustion. This indicates that the deep learning model has strong adaptability and predictive capability for HRR in complex fire scenarios.
However, when the fire enters the high heat release peak phase, both sets of experiments show a certain overestimation bias in the model’s predictions. This may be attributed to the distribution characteristics of the samples near the high heat release peak in the training dataset. In other words, the fire scene videos in the NIST database are time-lapse, with a greater number of scenes (frames) of small to medium HRR than those at the peak HRR stage. This results in a tendency for the model to overestimate at the peak HRR stage of large fires. In addition, the deep learning model exhibits some inaccuracy in predicting the absolute value of HRR, with a tendency to underestimate. This may be related to the scale and quality of the model’s training dataset. The experimental results above demonstrate that the proposed deep learning-based method for predicting future transient HRR of fires can effectively utilize the features and temporal relationships of fire scene images, providing good predictive capability for future fuel combustion in fire scenes without the need for additional equipment or sensors. Although the prediction accuracy and applicability of the model are subject to improvement due to the scale of the training data, there is sufficient evidence to suggest that the model can effectively predict the future transient HRR of fires.
5.3. Comparative Analysis with Similar Studies
The advent of sophisticated deep learning models has facilitated remarkable advancements in the application of image recognition and computer vision techniques in the domains of fire detection and predictive analysis of fire parameters. While existing fire target detection methods [
14,
15,
16] can identify fire in real time and issue timely warnings, their assessment of the current state and future trend of the fire still relies on empirical judgments and lacks the quantitative analysis of professional fire parameters. Consequently, these methods have limitations. In contrast, the analysis of real-time parameters of flames (e.g., heat release rate) based on video images allows for a more intuitive and reliable assessment of the degree of fire danger. For example, Wang Zilong et al. [
18] constructed a large-scale fire image database using the NIST database and successfully predicted the heat release rate of real-time fires by extracting continuous fire images from experimental videos and combining them with the VGG16 deep learning model. Nevertheless, the objective of predicting future fires is to gain a deeper understanding of the fire situation. This approach is still limited. In this study, we propose a future transient heat release rate (HRR) prediction method based on fire video images. This method aims to complement and extend the traditional fire target detection task and real-time fire parameter analysis task. This paper explores the feasibility of futuristic analysis and prediction in the field of fire prevention and control. It presents new ideas and methods for fire monitoring and emergency response. The approach enhances the understanding of fire trends and provides more accurate data support for monitoring and preventing fires.
5.4. Applications in Intelligent Firefighting
The experiments described above have demonstrated the effectiveness of the proposed Att-BiLSTM model in predicting the future transient heat release rate (HRR) of fires. This suggests that deep learning-based technologies are poised to become a key component of intelligent firefighting systems, which could be applied in actual firefighting operations. The development of fires occurring inside buildings is significantly affected by the limited space available. As a result of the incomplete exposure of the burning area, which leads to a lack of oxygen and a slow air flow, these factors work in concert to make the fire more stable in its initial stage and to slow the expansion of the burning area compared to outdoor fires. The particular characteristics of this combustion environment necessitate the development of more sophisticated fire response strategies and safety assessment methodologies [
40]. As illustrated in
Figure 13, when an indoor fire occurs, the following sequence of events occurs: first, video images of the indoor fire are collected in real-time using cameras such as CCTV and smartphones; next, the fire images are uploaded to a cloud database via the network; finally, the streaming fire images are input into the deep learning model, which outputs the predicted value of the future fire HRR. This method offers the potential to simulate and predict the development of fire situations, thereby providing an earlier warning period and enabling more effective response and command decisions. It also assists in the optimization of the allocation of firefighting resources, thereby enhancing the protection of personnel and reducing property loss. This method represents a significant advance in the field of intelligent firefighting.
Although this method has achieved satisfactory results in laboratory environments, there are still some issues and deficiencies that need further refinement and optimization. Firstly, to enhance the predictive performance and adaptability of this method in various fire scenarios, it is necessary to expand and enrich the database of fire scene images to cover a wider range of real fire situations and scales. Secondly, since flames are three-dimensional and camera images can only capture two-dimensional projections, it is difficult to obtain depth information about flames. Therefore, future depth models should consider using multi-angle camera images to reconstruct the three-dimensional form of the fire scene, combine temporal information, and extract more features. Thirdly, the current method uses AI image calorimetry [
41] to obtain the HRR of fire video images in real time. The advantage of this method is that it does not rely on additional instruments and is low-cost, but it may also lead to a certain degree of error.
6. Conclusions
This paper proposes a deep learning model that integrates Bi-LSTM and Attention mechanisms, capable of simultaneously processing and weighting data in sequences. This model effectively addresses the temporal correlation and non-linear relationships between fire HRR and images of flames and smoke. The contributions of this paper include the following aspects:
A new end-to-end method for predicting future fire HRR is proposed. By inputting fire scene images and corresponding HRR label data into the Att-BiLSTM model and employing a sliding window mechanism, it is possible to achieve continuous output of future transient fire HRR predictions.
In the preprocessing of fire scene images, the quality of images is enhanced while reasonably preserving the information of flames and smoke. This is achieved by fully considering their coexistence characteristics and their impact on fire HRR.
The model’s generalization ability and reliability were tested in high-brightness environments and fire scenes with complex combustibles. The experimental results demonstrate that the model can accurately predict future transient HRR of fire scenes and can also simulate and predict the development trend of fire situations to a certain extent.
This paper presents a novel method of using deep learning technology to predict the future transient HRR of fires, which has broad application prospects and high potential value for the development of future intelligent firefighting systems. In future work, we intend to further explore and improve the deep learning model. This will involve introducing more image features and inter-frame information, as well as considering the combined effects of more influencing factors. This will enhance the accuracy and duration of predictions for future fire HRR.