1. Introduction
China is advocating coal washing and quality improvement processing as well as promoting coal energy low-carbon development. Coal slime is a byproduct of the coal washing process and is characterized by fine particle size, high water holding capacity, high viscosity, high ash content, and low heat generation [
1]. If the coal slime is not correctly disposed of, it will harm the environment. The harmful substances contained in coal slime can seep into the ground and cause soil contamination. If coal slime is discharged into rivers and lakes, it will cause severe pollution to water quality. Circulating fluidized bed combustion (CFB) technology, one of the clean coal technologies, has strong adaptability to medium-quality and low coal fuel, with combustion efficiency reaching 95~99% [
2]. Therefore, CFB coal slime blending is recognized as the best way to handle coal slime.
However, coal slime blending affects the stability of CFB units significantly. Ideally, the water content of coal slime is constant in each pump of the coal slime pump. In actual operation, the coal slime falls and splashes water on the level meter, causing the alarm. The alarm causes the discharge screw to stop working, emptying the pump. As a result, the uneven mixture of coal slime and water content significantly fluctuates in boiler input energy. The bed pressure of the coal slime blending unit shows a wide range of fluctuations due to the unevenness of the coal feed in the furnace. In the pant-leg CFB units, the unstable bed pressure will lead to the inventory overturn accident, threatening the stable operation of the unit. Accurate prediction models improve the safety of unit operation. By obtaining future information through forecasting, operators can make timely adjustments to their operations, thereby reducing accidents. Therefore, it is necessary to monitor and predict bed pressure. Most studies in bed pressure only focus on mechanism research. The experimental relationship between the lateral pressure difference of the upper furnace and the lateral material exchange flow rate was established by Li [
3]. Li also found the critical role of the lateral pressure difference of the upper furnace in material fluctuation, which was also verified by Sun [
4]. Many researchers focused on the solid flow model in the furnace to study bed pressure [
5]. Yang [
6] developed a computational particle fluid dynamics model to simulate solid exchange behavior between two half beds in the bench-scale two-dimensional dual-leg fluidized bed. Wang [
7] proposed an empirical correlation of the non-uniformity index to quantify the non-uniformity of gas–solid two-phase flow. The computational particle fluid dynamics model of a pilot-scale CFB was used to numerically simulate its gas–solid flow characteristics by Liu [
8]. Afsin Gungor [
9] established a CFB axial pressure distribution prediction model based on the particle method to predict bed pressure. Current research also shows that the airflow rate in the furnace has a significant influence on the bed pressure [
10]. However, the structure of these mechanism models is complex, and it is difficult to obtain some of the data in the mechanism models. In addition, the empty pump phenomenon makes the measured value of feed quantity in the unit deviate from the actual value. Therefore, the mechanism models make it challenging to achieve accurate bed pressure prediction in coal slime blending CFB units.
Driven by the development of intelligent algorithms, data-driven modeling technology for complex industrial processes is attracting the attention of researchers [
11,
12]. The data-driven models are applied and migrated without considering the design parameters of units. Such models only need to use operational data to adjust and train the parameters of the model. The error backpropagation neural network (BPNN) [
13], least squares support vector machine (LSSVM) [
14], and other algorithms are commonly used in the industrial process [
15,
16]. Due to the time relevance character, algorithms that focus on analyzing the relationships between time relevance industrial data are more suitable for typical industrial process modeling. The emergence of the recurrent neural network (RNN) and the long short-term memory neural network (LSTM) improves the ability of neural network models to extract temporal information from temporal data. These two neural networks are also widely used in industrial forecasting. Xia [
17] proposed a new method for predicting renewable energy generation and power load in univariate and multivariate scenarios using the improved stacked gated recurrent unit neural network. The method uses various weather parameters to predict wind power generation and historical energy consumption data to predict power load. Shahid [
18] proposed a model composed of LSTM and the genetic algorithm to predict short-term wind power generation. It also provides accurate, reliable, and robust predictions for the wind power of seven wind farms in Europe. Inapakurthi [
19] used RNN and LSTM to capture the dynamic trends of 15 environmental parameters, including particulate matter and pollutants in the atmosphere that cause long-term health hazards. In addition to these models, the attention models have also received attention from researchers. The idea of Attention originated from visual images in the 1990s. In 2014, Google Mind [
20] team used the attention mechanism to classify images on the RNN model and achieved excellent results. Attentional mechanisms are used in many areas, such as the machine translation field [
21], the image recognition field [
22], and the language emotion analysis field [
23]. In recent years, attention was also widely used in the prediction field. Azam [
24] proposed a new hybrid deep learning method based on bi-directional long short-term memory and a multi-head self-attention mechanism, which can accurately predict the marginal price of position and the system load one day ago. By analyzing the formation mechanism of NOx and the reaction mechanism of the SCR reactor, a sequence-to-sequence dynamic prediction model was proposed by Xie [
25], which can fit multivariable coupling, nonlinear, and large delay systems. Shih [
26] proposed a novel attention mechanism for selecting the relevant time series. Shih used its frequency domain information for multivariate forecasting.
Influenced by the successful application of temporal pattern attention [
26], an improved attention layer is proposed in this paper. The model uses the Gaussian convolution kernels in the convolutional neural network, enabling the model to extract different inertial delay properties. Scaled dot-product attention [
27] is used in the model for attention computation. The prediction method in this paper treats the input and output of the model and improves the way the model is trained. The main contributions of this paper are summarized as follows:
(1) The delayed relationship between variables was considered in the correlation analysis. The best combination of variable inputs was selected through comparison experiments.
(2) The differential values of the input variables are used as additional input features, which reduce the effect of the autocorrelation of the inputs on the model. The differential values in bed pressure are used as the predictive target for the model to improve the learning ability of the model for the target value.
(3) The attention layer uses the Gaussian convolution kernel to extract the inertial and delay properties of the features. The convolution operation allows the model to learn different inertial and delay properties for different features. Thus, the prediction performance of the model is improved.
(4) Based on the principle of the attention mechanism, the model is trained in segments. The first training segment enables the query vector in the attention layer to learn the target information well. The parameters of the attention layer are updated in the second segment training.
The structure of this paper is as follows:
Section 2 describes the background and principle of the method used in this paper. In
Section 3, the forecast method proposed in this paper is used to forecast the actual operation data. In
Section 4, the prediction methods proposed in this paper are used for ablation research and comparison of prediction models. The ablation study verified the effectiveness of the model structure and data processing method in the prediction method. The model proposed in this paper performs better than other algorithm models in comparing forecast models. Finally,
Section 5 summarizes the conclusions obtained from the study with the highlights of significant findings.
3. Structuring of the Proposed Forecast Framework
In this section, the production data from LinHuan Zhongli 1 # 330 MW circulating fluidized bed unit in China illustrates the prediction modeling steps. The boiler slime blending system was designed and manufactured by PUTZMEISTER, Germany. The coal slime blending method is a typical coal slime pumping system consisting of the coal slime silo, the bin bottom slide frame, the flushing water pressure pump, and four coal slime pumping pipelines. The coal slime guns are arranged at a position 2.6 m high from the air distribution board, two on each of the left and right side walls of the boiler, and arranged horizontally and symmetrically. The coal slime gun comprises the gun body, the ball valve, the gate valve, and the safety valve. This paper uses three datasets for prediction and comparison. Among them, dataset 1 and dataset 2 come both from the operational data from 18 to 20 June 2018. The difference is that dataset 1 is all the operational data from the unit during regular operation. Dataset 2 has fault data in the test set. Dataset 3 is the operating data of the unit between 12 September and 14 September 2015. All datasets have a sampling interval of 5 s and a prediction step of 6. The size of all datasets is 39,600. All data sets are divided into training, validation, and test sets according to the data segment lengths of 36,000, 1600, and 2000 in that order. The prediction task is shown as an example of left bed pressure prediction. The three datasets are described as follows:
Dataset 1: The entire dataset is in regular operation. The time range of the data set was from 18 June 2018, 0:21 to 20 June 2018, 7:21.
Dataset 2: There was an inventory overturn accident in the test set. The time range of the data set was from 18 June 2018, 14:05, to 20 June 2018, 21:05.
Dataset 3: The data in the dataset are normal operation data. The time range of the data set was from 12 September 2015, 0:21 to 14 September 2015, 7:21.
The flow of the prediction method in this paper is shown in
Figure 6. Among them, the variable screening method is described in
Section 2.3. The hyperparameters of the model are adjusted according to the errors on the validation set. The program was compiled using Python, and the algorithm model used Pytorch and Scikit-learn framework. All experiments were carried out in the Python compiling environment using an Intel Core i9-10900K CPU and RTX3080 GPU machine. CUDA and cuDNN accelerate the algorithm model.
3.1. Prediction Model Structure
The prediction model structure proposed in this paper is shown in
Figure 7. The variable data is input into the network model after the variable screening. The number of input features is m/2. The input data are differenced to form the new input to the model, where the differential values of the model inputs are also normalized. The step length of the differential value is the same as the length of the prediction step. The final model input with feature number m is formed. In this model, the LSTM part is composed of a stack of
N identical layers. Each layer has two sub-layers. The first is an LSTM layer with 2 m units, and the second LSTM layer with
m units. The attention model in this paper computes the data after passing through the LSTM part. The final output is formed by the dense layer, the batch normalization layer, and the activation function. The output of the model is the differential value between the current time and the predicted time.
The dense layer adopted the full connection layer of one neuron in the model structure, and the activation function adopted the Sigmoid function. The function expression is shown in Equation (9). The ResNet (residual network) structure and batch normalization layer were added to the model structure in this paper. The ResNet and batch normalization are described as follows:
ResNet: This network structure was proposed in 2015 [
35], which significantly promoted deep learning model development. This structure is added to solve the problem of model degradation in the neural network, which refers to the fact that the performance drops rapidly after adding more layers to the network. With the addition of ResNet, the parameters of the deep neural network can be optimized more easily.
Batch Normalization: This network layer structure was proposed in 2015 to solve interval covariate shift (ICS) [
36]. It is found that the layer has the advantage of smoothing the optimization space and regularizing the model in the subsequent research. Therefore, the batch normalization layer is used before the final output layer in the model to accelerate the convergence of the model.
3.2. Segmented Training
In this work, a segmented training approach was applied to learn the parameters of the model based on the computation of the attention layer. In the original temporal pattern attention calculation process, the attention layer in this paper focuses on the last-moment state output
of the LSTM layer. In the early stages of training, slow convergence of the attention layer parameter learning occurs due to the information confusion in
. Therefore, this paper uses a segmented training approach to train the parameters of the model. The model is divided into an LSTM layer part and an output layer part, using the attention layer as the dividing point. A fully connected layer with one unit is added in the first segment after the
. A sigmoid activation function is used for the final output. After the training, the fully connected layer is replaced with the output layer part of this paper. In the second segment of training, the parameters of the LSTM layer part are fixed, and only the parameters of the output layer part are trained. The training process is shown in
Figure 8.
3.3. Performance Assessment
To assess the prediction performance under different experimental scenarios, two scientific performance metrics are selected for prediction. This paper chooses mean absolute error (MAE) and mean absolute percentage error (MAPE) as performance metrics, which are used for evaluating the performance of different models in prediction results and can be expressed as follows:
Mean absolute error (
MAE):
Mean absolute percentage error (
MAPE):
where
is the predicted value;
is the target value;
n is the number of prediction data;
is the average value of
. Generally, the lower values of
MAE and
MAPE lead to better performance of the prediction task.
3.4. Data Standardization and Differential Processing
Too large or too small an input can easily lead to unstable gradient values. Data normalization can effectively avoid such problems. This paper uses the min-max scaling method to normalize the data linearly. The min-max scaling formula is as follows: mapping the data to the range of 0 to 1 by min-max scaling is more beneficial to the training of network model parameters.
where
and
represent the minimum and maximum values in the data;
represents the data after standardization;
represents the data before standardization.
The prediction method in this paper uses a differential prediction method, which is stated in
Section 2.1. The differential processing is divided into differential processing for the target value and differential processing for the input. The differential processing for the target value is shown below.
where
is the value of the predicted target at time
t + k;
is the value of the predicted target at time
t;
is the differential value between the two moments. The prediction model obtains the value of
by predicting
. The
k is set to 6 in this paper.
In addition, the prediction method in this paper also performs differential processing on the input of the model. The differential processing for the input is shown in Equation (2).
3.5. Variable Filtering
The method described in this paper uses Dataset 1 to filter the variables. CFB unit control quantities are used as input features for the model in this paper. The differential values of these variables are input as features of the model. The correlation coefficients between the input features and differential value were calculated using the method in
Section 2.3. This paper sets
,
l, and
to 100, 200, 4000, respectively.
Table 1 shows the calculation results of the correlation coefficients. The correlation coefficients of the input variables are sorted in descending order, and the results are shown in
Table 1.
The screening of variables based on empirical thresholds is unreliable, so the comparison experiments are used to perform variable screening. The experiments first rank the importance of the variable features according to the correlation coefficient. Then the experimental groups were constructed by adding the features in order of importance. Finally, the accuracy performance of the prediction model on the validation set is used as the basis for variable selection.
Table 2 shows the prediction performance of each experimental group on the validation set. EG-i indicates that the top i most essential features among the features are used as the input to the model. The same model hyperparameter settings were used for all experimental groups, as shown in
Table 3.
Table 2 presents the mean and standard deviation of all experimental groups in three runs. The best performance is in boldface.
As can be seen from
Table 2, the best prediction accuracy of the validation set was achieved for the input features of EG-6. Therefore, the input features of EG-6 are considered the results of the variable filtering. The selected variables are fuel quantity, primary air quantity on the right side, secondary air quantity at upper left side, primary air quantity on the left side, secondary air quantity at lower right side, and secondary air quantity at upper right side.
3.6. Model Prediction Results
The model is trained using the segmented training approach. The parameters of the LSTM layer are trained first. The last moment output
of the LSTM layer is output through a fully connected layer. The sigmoidactivation function is used for the final output. After the first training segment, the model parameters of the LSTM are migrated into the model of this paper. In the second training segment, only the parameters of the output layer part are trained. The hyperparameters of the model in this paper are selected by the validation set. The specific hyperparameters are shown in
Table 4.
The model was trained on 36,000 training samples. The average training time in the first training segment was 650 s, and the average period of training stop was 134. The average training time in the second training segment was 260 s, and the average period of training stop was 83. The entire training process is performed using the early stopping strategy, and the optimizer uses AdamW [
37].
Figure 9 and
Figure 10, and
Table 5 show the predicted results of the model on the test set. The mean values of the predicted results are presented in
Figure 9 and
Figure 10. The mean and standard deviation of the results of the three replicate experiments are shown in
Table 5. In
Figure 9, the target value is the actual differential value. In
Figure 10, the target value is the actual value of the bed pressure after 30 s. From the results, it can be seen that the model in this paper has accurate prediction performance on all three datasets. It can be seen from the prediction results of Dataset 2 that the model can still capture the bed pressure changes well in the case of sudden changes in bed pressure.