4.2. TCN for Furnace Temperature Prediction
Since the heating process was continuous, TCN was used as the prediction model in this paper. In this study, we optimized the structure and parameters of the TCN model proposed in [
19]. The experimental results show that our improved TCN model has better performance in time series prediction.
We set the size of 1−
D, convolution kernel size
, as mentioned before; the receptive field of TCN was expressed as
. We entered a sliding window size of 28. Therefore, the maximum dilation rate
of TCN was
. The dilation rate in this paper was [
1,
2,
4,
8,
16]. The structure of the TCN is shown in
Figure 9a. After setting the hidden layer D = 16, we added the hidden layer C to optimize the TCN structure. As shown in
Figure 9b, in this layer, we did not use the residual structure, and we used two convolution layers to extract features. This hidden layer can extract more effective features to get better results. The proposed TCN structure includes input layer, initial convolution layer, 5 residual block structures, hidden layer C, and finally a fully connected layer. It has 56 layers in total. We choose Keras [
37] as our deep learning framework for training the TCN model. Considering the effect of the rectified linear unit (RELU) activation function on the output, and in order to ensure the consistency of the input and output variances, he-normal [
38] is selected as the method of weight initialization. After simple tuning of TCN parameters, the number of convolution kernels was 64 per layer, and the value of dropout rate was 0. When we trained the TCN model, we set the value of epoch to 100 and selected Adam [
39] as the optimizer to adapt to the learning rate.
In order to test this network structure, we first used the source domain dataset to compare its prediction performance with several other commonly used prediction models, which included LSTM, gated recurrent units (GRU) [
40], convolutional neural network-LSTM (CNN-LSTM), and bi-directional LSTM (BiLSTM) [
41]. LSTM is an improved recurrent neural network (RNN) that exhibits excellent performance in processing time series data. GRU simplifies the LSTM structure for faster training. BiLSTM is based on LSTM and can learn from long-term dependencies in both forward and backward data of the time series. CNN-LSTM is a combination of convolutional neural networks and LSTM, which performs better in most studies. The detailed information of these networks is shown in
Table 2, where the parameters are determined by grid search.
In order to measure the prediction results of each model, we choose the root mean square error (
RMSE) and the mean absolute error (
MAE) as the main evaluation indicators of the regression prediction. The definitions of RMSE and MAE are as follows:
where
n represents the number of samples,
represents the observation of the
ith sample, and
represents the predicted value of the
ith sample. Lower
RMSE or
MAE is better than a higher one. The smaller the
RMSE and
MAE values, the higher the prediction accuracy, and the better the model performance.
As mentioned earlier, the data of heating zone 1 were used as the source domain data: the first 70% of the data were used as the training set, and the remaining 30% of the data were used as the test set. The mean values of
RMSE and
MAE after multiple training procedures of these models are shown in
Table 3. From the data shown in the table, we can easily see that our proposed TCN model performed best in this case. Although the reported mixed model CNN-LSTM is a relatively advanced sequence processing model, this case does not support this statement. In addition, GRU and BiLSTM perform better than LSTM.
As mentioned earlier, we redesigned the structure of the TCN. As shown in
Figure 9a, in the last Add layer, six sources of information were added as hidden layers with dilation rates of 1, 2, 4, 8, and 16 and hidden layer C. Among them, the hidden layer C was added in order to extract the features of the previous hidden layers after the hidden layer with the dilation rate of 16. The initial TCN model is shown in
Figure 9c, and the final add layer adds the information of five hidden layers with dilation rates of 1, 2, 4, 8, and 16. Therefore, our improved TCN can extract more comprehensive features and obtain better results.
Therefore, we use the university of California, Irvine (UCI) open source dataset Beijing PM2.5 data for experimental verification. The dataset is described in
Table 4. The time period of the data is from 1 January 2010 to 31 December 2014, with a one-hour interval between each datum. We used the attributes measured in the past 12 h, including dew point, temperature, pressure, and so on, to predict the PM2.5 concentration in the next hour. We took the data of the first 3.5 years as the training set and the remaining data as the test set. Since only 12 h of historical data were needed, we set the expansion rate to
, and the other parameters were the same as the furnace temperature prediction.
Table 5 shows the score in comparison between the initial TCN and the improved TCN. The experimental results show that the improved TCN is better than the initial TCN in time series prediction.
In addition to the improvement of the TCN structure, we used the idea of transfer learning to optimize the TCN parameters. We called this a method of neural network weight initialization. The specific process of optimization is to freeze the shallow weight of TCN and then use the training set to update the parameters of the unfrozen high-level features, which we call self-TL-TCN. First of all, we used the training set in the source domain (heating zone 1) to train a preliminary TCN model. The TCN structure proposed in this paper contains 56 hidden layers, but only the weight and bias of the convolution layer need to be updated. We froze from layer 20 of the pre-trained TCN—that is, from the hidden layer with an expansion rate of 4. Finally, we used the training set in the source domain to update the parameters of the unfrozen layer to complete the fine-tuning. The reasons for this are as follows: (1) the optimized model is given more appropriate initial weight compared to the he-normal method, (2) the convolution layer with high dilation rate of the preliminary TCN model may cause information loss. The training process is shown in
Figure 10.
The performance of the TCN after parameter initialization under different freezing layers is shown in
Figure 11. It can be seen that, when the number of freezing layers is 29, we obtained the lowest RMSE and MAE scores. Therefore, when optimizing the preliminary TCN model, the best prediction performance can be achieved by freezing the first 29 layers and fine-tuning the parameters of the upper layer. The self-TL-TCN model after two optimizations of structure and parameters is the final source domain model in this study.
After determining the optimal number of freezing layers, we used the test set of the source domain to evaluate.
Figure 12 shows the prediction results of the TCN model before and after optimization. It can be seen from the figure that the prediction accuracy of self-TL-TCN proposed by us is higher.
In order to verify the effectiveness of the proposed network, we also used the Beijing PM2.5 dataset to verify. Since we used the data from the first 12 h as the historical data, the TCN network was layer 46, and we chose to freeze from layer 20.
Figure 13 shows the scores of self and initial TCN under different freezing layers. The comparison results between the optimized TCN model and the original TCN model are shown in
Table 6. As can be seen from the figure, the method using self-transfer learning as initialization parameter can achieve better prediction results.
4.3. Transfer Learning for Furnace Temperature Prediction
As mentioned before, a heating furnace system has multiple heating zones. We need to accurately predict the temperature of each heating zone, but because the heating furnace is a complex controlled object, its temperature curve is nonlinear and unstable, among other characteristics; there will be some heating zone temperatures which are difficult to accurately predict. For example, for a heating zone such as zone 2, the neural network is ineffective. In addition, because all heating zones are in the same furnace, the heating process is similar. As shown in
Table 1, the control variables of each heating zone are very similar, and, at most, only nine variables are different. Therefore, we believe that the knowledge learned by the neural network in each heating zone is also similar. Therefore, we can transfer the knowledge learned by the neural network in a heating zone that can accurately predict the temperature to the remaining heating zones. There are 10 heating zones in this case. If we build a model for each heating zone, different heating zones may have different neural network models, which will undoubtedly increase the calculation cost. This is also one of the reasons that we used transfer learning.
Since our goal was to build a model that can solve different tasks of the same industrial equipment, the data collected by industrial sensors can be divided into two types. The first is that different tasks share a control system—that is, the same variable controls different tasks, such as the prediction of water content and temperature at the outlet of the dryer in the tobacco factory. The control variables of these two tasks are the same. The other is that different control systems control different task requirements. For the first kind of sensor data, because the characteristics of each domain are the same, the fine-tuning method was used to complete the knowledge transfer. For the second kind of sensor data, we used the generative adversarial loss to complete the domain adaptation. We took the sensor data of the heating furnace as an example to verify the performance of the two methods. This was also the first time that transfer learning has been applied to the prediction of furnace temperature.
As mentioned before, because each heating zone is located in the same furnace, there is high similarity between the target domain and the source domain. For example, only one control variable is different in zone 1 and zone 2, and the difference between zone 1 and zone 10 is the largest, but only nine variables are different. Each heating zone has 62 control variables. Therefore, we used the fine-tuning method shown in
Figure 14 to complete the transfer.
Through traversing all hidden layers of the network to determine the optimal number of frozen layers in each heating zone, the optimal number of frozen layers in each heating zone was calculated, as shown in
Table 7. The reason that the number of freezing layers in the table is small is that the neural network mentioned before has no extrapolation—that is, the distribution of the test set and training set is quite different. This means that more parameters need to be updated for better results. The temperature distribution of each zone is shown in
Appendix A,
Figure A1.
For the different features of the source domain and the target domain, we used the generative adversarial network to align the features, as shown in
Figure 15. We used the source domain data as the real data input discriminator. The discriminator uses three full connection layers as two classification layers to judge the data source. To prevent overfitting, dropout layers were added between the second and third layers. The first two layers use RELU as the loss function, and the third layer uses Sigmoid. In this case, the generator in GAN was used as the feature extractor, and we used the 1−
D convolution to extract the target domain features. The generator generates time series features with higher similarity to the source domain in order to replace the original target domain features. When the discriminator cannot judge whether the data is generator data or real data, this shows that the features extracted by the generator have high similarity with the features of the source domain. In addition, this case is supervised learning. In order to further improve the prediction accuracy, we need to use the target variable of the target domain to fine-tune the target model after using GAN as the feature alignment. Therefore, we added a fine-tuning strategy after GAN to generate the final target model.
We established TCN, LSTM, BiLSTM, GRU, and CNN-LSTM models on nine target domains. We chose the best performing model per target domain to optimize. The optimization method uses self-transfer learning to change the network weight initialization. In addition, the transfer learning method was based on the BiLSTM proposed by Jun Ma et al. [
30], which greatly improved the air quality prediction results. In our paper, first of all, the initial BiLSTM was established on the source domain and was optimized based on the self-transfer to obtain self-TL-BiLSTM. Then, the target domain model TL-BiLSTM was established based on transfer learning. We compared the prediction results of these models with the two methods proposed in this paper.
Table 8 shows the RMSE scores of each model applied to nine target domains.
Table 9 shows the MAE scores. The last column of the two tables is obtained by comparing these two methods with the best-performing model without knowledge transfer.
Figure 16 is two histograms of RMSE and MAE scores for these models applied to nine target heating zones.
The following information can be obtained from these two tables and histograms:
The performance of the TCN model is better than the variant of the RNN in almost all zones, and the GRU is better than TCN only in heating zone 9.
The self-TL-TCN proposed by us has better performance than the common TCN model; the self-TL-GRU also has better performance than the common GRU model. This means that network performance can be improved by changing the initial weight of the network based on the migration learning idea.
Two transfer learning frameworks proposed by us can effectively solve the problem of large prediction error in some heating zones, which greatly reduces the prediction error. Zones 10 and 9 have the largest error reductions, with RMSE reduced by 57.38% and 43.97% and MAE reduced by 51.63% and 50.59%. TL-BiLSTM also shows better performance than models without knowledge transfer in each zone but worse performance than our models.
In addition, the reason that the fine-tuning method in zone 4, zone 7, zone 9, and zone 10 performs better is because the original target data are more similar to the source domain feature than the new feature generated.
Table 10 shows the similarity between the generated features and the source domain, as well as the similarity between the original features and the source domain. We use the Pearson coefficient to measure the similarity. We only measure the similarity between the source domain and the target domain. It can be seen that the similarity between the original data of the above four regions and the source domain is higher. Of course, both fine-tuning and GAN-TL have better performance than no knowledge transfer. That is to say, fine-tuning is a good solution when the time series of the same feature is transferred. When the time series of different features are transferred, the GAN-TL method that we proposed is also a suitable solution.
Figure 17 shows the comparison of the prediction results based on transfer learning of all target domains with the prediction results without knowledge transfer. These prediction results prove that the neural network that we mentioned before has no shortcomings in terms of extrapolation. In other words, the data with a large difference between the test set and training set cannot produce good prediction results, such as in zone 2 and zone 6. However, we can solve this problem through transfer learning. This is the first time that we propose to solve the non-extrapolation problem of the neural network by using transfer learning.
In order to measure the performance of the proposed transfer learning framework more comprehensively, the complexity of the model is calculated. The complexity of the convolutional network is defined as follows:
where
D represents the number of convolutional layers,
l represents the
l-th convolutional layer,
K represents the size of the convolution kernel,
Stride represents the step length,
represents the number of output channels. In this paper, each convolution layer has 64 convolution kernels. Since this is one-dimensional convolution, the input dimension is 62 and the size of the convolution kernel is 62 × 2.
This paper takes LSTM as an example to illustrate the computational complexity of recurrent neural networks. The complexity is defined as follows:
where
D represents the number of hidden layers,
l represents the
l th hidden layer,
X represents the input dimension, and
H epresents for hidden layer size.
Table 11 shows the calculation results of the complexity of each model based on the above two formulas. At the same time, we calculated the running time of each model on the same computer configuration, as shown in
Table 12.
From the above two tables, and combined with the prediction results, we can see that the framework proposed in this paper has the best performance, whether in terms of runtime or prediction results. On one hand, the time prediction framework based on TCN is a stack of convolutional layers, and the convolution kernels of each layer are shared, so the calculation time will be greatly accelerated. On the other hand, the target model only needs to train unfrozen parameters, which will greatly reduce the number of parameters.
In order to verify the idea that using transfer learning under similar tasks will produce better results than not using transfer learning, we also used the aforementioned open source data from Beijing PM2.5 for experimental verification. We conducted the hourly interval prediction before. However, in the context of large time resolutions such as days and weeks, it is difficult to produce high-precision prediction methods. Therefore, we used the research methods in this paper to transfer the learned knowledge from a smaller time resolution to a larger time resolution. That is to say, we transferred the knowledge that we learned at hourly intervals to predicting air quality at daily intervals. First, we re-sampled the original data at daily intervals to form the target domain data. Since we only re-sampled the original data with expanded frequency, we used the fine-tuning method for knowledge transfer. The grid search determined that the prediction of the target domain was the best when the source domain model froze the first 23 layers.
Table 13 shows the score by comparison of each model. It can be seen from the above table that the method proposed in this paper can improve the prediction accuracy of PM2.5 concentration at large time resolutions. Compared with no transfer learning, transfer learning achieves better results.
In order to prove the rationality of source domain selection and the shortcomings of our neural network without extrapolation. Combined with the temperature distribution diagram of
Figure A1 in
Appendix A, we chose heating zone 3, with uniform temperature distribution, as the source region. Compared with other target regions, the distribution of the test set and training set in heating zone 3 was more uniform, but it was not as reasonable as that in heating zone 1. We used the fine-tuning method to transfer knowledge.
Table 14 and
Table 15 show the prediction results of the target domain when zone 3 is the source domain.
Table 16 is a comparison of transfer results of zone 3 as source domain and zone 1 as source domain. From
Table 14,
Table 15 and
Table 16, it can be seen that, compared with zone 3 as the source domain, when zone 1 is the source domain, the scores of other heating zones are significantly better, except for the RMSE of zone 2. Even when the knowledge learned by the neural network is transferred from zone 3 to zone 9, there is a negative transfer phenomenon. Therefore, this experiment verifies the rationality of our source domain selection.
For industrial sensor data, through this experiment, we can establish a source selection standard. When the data distribution of the test set is outside the data distribution of the training set, or the difference between the data distribution of the test set and the data distribution of the training set is large, because the neural network has no extrapolation, the data as the source domain data may have a negative transfer phenomenon. The modified standard is not only applicable to the data of the heating furnace but also to other sensor data and could even be extended to more applications.
4.4. Discussion on Whether the Framework Is Overfitted
To verify the question of whether the presented transfer learning framework is “overfitted” for a concrete situation, we conducted the following experiments measuring the following three factors: (1) transfer between different pieces of equipment at the same time, (2) transfer between the same pieces of equipment at different times, and (3) transfer between different pieces of equipment at different times.
First of all, we conducted knowledge transfer of different equipment at the same time. We applied this framework to transfer between different pieces of equipment. We called the above heating furnace heating furnace 1, and we transferred the knowledge learned in zone 1 of heating furnace 1 to heating furnace 2. The sensor data of the two heating furnaces were collected at the same time. We selected three heating zones from the preheating section, heating section, and soaking section of furnace 2, namely zone 1, zone 5, and zone 10, to verify the transfer results.
Table 17 shows the comparison between the proposed TL-TCN framework and the existing model.
It can be concluded from the above table that the prediction results based on TL-TCN are greatly improved compared to those obtained without transfer learning. It proves the reliability of the proposed framework to transfer between different devices. In addition, we selected three heating zones in the three heating sections of the heating furnace 2 for knowledge transfer, which also proves that the TL-TCN-based framework can transfer the knowledge learned from one source domain to any heating zone of different heating furnaces.
Secondly, we conducted knowledge transfer between the same pieces of equipment at different times. The previous data were collected from 10:00 on 24 January to 10:00 on 25 January 2019. We transferred the knowledge that they learned to the data collected from 0:00 on 24 February to 0:00 on 25 February 2019. There was an interval of one month between the two datasets. The data were collected by heating furnace 1. The source domain was still heating zone 1, and the target domain was also heating zone 1, heating zone 5, and heating zone 10.
Table 18 shows the scores after knowledge transfer.
It can be seen from the above table that the proposed framework can be applied to the transfer of the same equipment at different times. Moreover, the source domain model that we used has not been recalibrated but only fine-tuned with historical data from the time before the target data were acquired on 24 February. This experiment has proven that the knowledge learned by the proposed framework can be applied to data acquired one month later.
Finally, we transferred the knowledge learned in the source domain of this article to the data of different equipment at different times. The target domain data were collected from furnace 2 from 24 February at 0:00 to 25 February at 0:00. We also selected zone 1, zone 5, and zone 10 as the target domains.
Table 19 shows the prediction scores of each model.
It can be seen from the table that the proposed framework can be applied to the transfer between different pieces of equipment at different times, and we did not recalibrate the source domain model but only used the target domain data to fine-tune it. This experiment has proven that the knowledge learned by the proposed framework can be applied to data acquired one month later. However, it can be concluded from the above three different experiments that the results of knowledge transfer between different equipment and different times have not improved much. This is also the main direction of future research.