1. Introduction
Soft Sensors (SSs) are mathematical models of industrial processes able to estimate hard-to-measure variables (i.e., quality variables) by exploiting their dependence on easy-to-measure variables (i.e., quantity variables) [
1,
2]. SSs are widely adopted in industrial processes to improve process monitoring and control. Real-time quality variable estimation is necessary when quality variables are measured with large delays or require time-consuming laboratory analysis. In these cases, the design of an SS allows for increasing the performance of feedback control strategies. SSs are widely diffused in process industries, such as refineries [
3], chemical plants [
4], cement kilns [
5], power plants [
6], pulp and paper mills [
7], food processing [
8], polymerization processes [
9], or wastewater treatment systems [
10].
SS implementation often requires the use of black-box nonlinear dynamical identification strategies, which uses data collected from the distributed control system [
11] and stored in the historical database. To achieve this aim, machine learning (ML) techniques are mostly used, ranging from Support Vector Regression [
12], Partial Least Square [
13], and classical multilayer perceptrons [
1,
14,
15,
16,
17] to more recent deep architectures, such as deep belief networks [
9,
18,
19,
20], long short-term memory networks (LSTMs) [
21,
22], and stacked autoencoders [
23,
24,
25,
26]. Bayesian approaches [
27], Gaussian Processes Regression [
28], Extreme Learning Machines [
29], and adaptive methods, [
30,
31,
32] are also used.
Data-driven SS design can be summarized in the following steps, which are typical of the system identification procedure [
33]:
data acquisition, selection, and pre-processing;
model class selection;
model order selection;
model identification; and
model validation.
The design phase involves a lot of open problems and time-consuming tasks [
34,
35]. Among these we can mention: input-variable choice, model-class selection (e.g., linear/nonlinear, static/dynamic, time-variant/invariant), model-order design, model-structure, and hyperparameters selection. Another relevant problem is known as labeled data scarcity. In fact, conventional supervised learning algorithms, usually adopted in SS design, require the use of labeled data. While quantity variables are sampled with a fast rate, the corresponding quality variables are, in general, infrequently measured. This issue can be addressed by using semi-supervised learning, which exploits unlabeled data in an unsupervised training phase and labeled data in a supervised fine-tuning [
19,
36,
37].
Since some industrial processes present a high nonlinearity and intrinsic dynamical dependencies between input and output variables, feed-forward artificial neural networks (ANNs) require the use of tapped delay lines (TDLs) for the I/O variables [
1].
As an alternative, Recurrent Neural Networks (RNNs) can be used to catch temporal dynamics behaviors. In such networks, connections between hidden units are included between the previous and same level(s), making their output influenced by both the current and previous time instants. RNNs can therefore extract the sequential information available in the input data and can show better performance when modeling industrial processes.
To catch long-term dependencies among the variables, Long Short-Term Memory (LSTM) networks have been introduced. They contain memory cells that can store information for long periods of time during the training phase.
A common problem present in recurrent networks consists of a large number of hyperparameters to be optimized. Hyperparameters directly control the behavior of the training algorithm, and their correct setting strongly impacts the performance of the final model. Different hyperparameter optimization searching strategies are proposed in the literature, such as grid search, genetic algorithms, Bayesian Optimization, or Tree-structured Parzen estimators [
38,
39,
40]. However, their optimization is an extremely computational and time-consuming task.
The outcome of the SS design process is a model tailored for the specific dataset adopted in the learning procedure, which should, therefore, cover all the working points of the plant. In general, the obtained model is not scalable without an adaptation to other processes. Developing SSs for a similar process requires, therefore, a new design procedure.
As an effort to reduce the computational time required to design an SS for similar processes, model transferability plays a key role. Transfer learning (TL) focuses on storing the knowledge gained while learning a task from a source domain and utilizing it for a different but related problem, defined as the target domain [
41]. TL techniques can be divided into three classes: inductive, transductive, and unsupervised transfer learning [
42,
43]. In the inductive TL, labeled data in the target domain are required to induce a predictive model (here named
) to be used in the target domain. In the transductive TL methods, no labeled data in the target domain are available, while they are available in the source domain. The unsupervised transfer learning focuses on solving unsupervised learning tasks in both the source and the target domains, such as clustering and dimensionality reduction. No labeled data are used. A scheme of the differences among the TL methods is reported in
Figure 1.
TL methods are widely diffused in applications, such as classification, image processing, and natural language processing, as described in the next section. There actually exist few studies that investigate TL technique applications on industrial processes, both for SS design [
44] and fault detection and diagnosis [
45,
46].
In our work, we focus on inductive transfer learning for SS design, when a limited number of labeled data is available in the target domain. Two different strategies are proposed. The first strategy, called the fine-tuned transferred model (FTTM), consists of performing only a fine-tuning of the network weights of the optimal model designed in the source domain, with the dataset belonging to the target domain. The second strategy, called the transferred hyperparameters model (THM), is based on adopting only the optimal hyperparameters identified in the source domain to train the SS in the target domain, starting from random initial weights. RNN- and LSTM-based SSs are considered and compared in regard to transferability properties. The use of the proposed techniques allows, at the same time, to reduce the time needed to design a SS for a similar process and to cope with the problem of labeled data scarcity.
The transferability of SSs between two similar industrial processes is considered in our work. A Sulfur Recovery Unit (SRU) from a refinery located in Sicily (Italy) is considered as a case study. It is a highly nonlinear process with dynamic dependencies between input/output variables [
47]. It consists of different lines that work in parallel. RNN and LSTM-based SSs have been designed for two lines of the process (i.e., SRU line 2 and SRU line 4). The transferability of models designed for SRU line 4 (i.e., the source domain) to SRU line 2 (i.e., the target domain) has been investigated.
The main contributions of this work are summarized as follows: (1) The TL methodologies are applied to SS design, which is a topic rarely considered in the TL research field. (2) The transferability of nonlinear dynamical models is considered. This aspect is relevant both in the field of SS design which, often, considers static models and in the TL applications. (3) Two dynamical neural models (i.e., RNN and LSTM) are designed and compared in regard to model accuracy and transferability. The trade-off between the performance and computational time required to transfer the SS from the source to the target domain is analyzed. (4) Two different TL approaches are presented and discussed. Both the adopted techniques have the advantage of avoiding the time-expensive procedure of hyperparameters selection for the target dataset. (5) A real-world industrial case study is considered. (6) The presented framework is successfully applied in two different scenarios, to outline the advantages in presence of labeled data scarcity in the target domain. To underline this aspect, analyses including datasets with similar size and datasets with a reduced amount of data for the target domain are reported.
The remainder of this paper is organized as follows: In
Section 2, the state of the art on TL is reported; in
Section 3, RNN and LSTM structures are explained in details, along with the SS implementation; in
Section 4, the proposed TL methods are introduced; in
Section 5, the case study is presented, and the numerical results of the TL procedures are reported and discussed. Conclusions are finally drawn in
Section 6.
2. Related Works
Transfer learning is a relevant topic in the ML field, especially referring to deep learning strategies. Most of the theoretical results and applications belong to the area of classification, including fault detection applications. Only a few results are available in regard to SSs and regression estimation. In this section, some related works are briefly introduced. Examples of applications in different research areas, such as image classification [
48,
49], text classification [
50], and biometrics [
51], can be found in the literature. In Reference [
52], a new multi-source deep transfer neural network algorithm, based on a convolutional neural network (CNN) and a multi-source TL technique, is proposed and evaluated on several classification benchmarks. A systematic analysis of computational intelligence-based TL techniques is reported in Reference [
53]. Methods based on neural networks, Bayesian systems and fuzzy logic are described in the paper, along with applications in the field of language processing, computer vision, biology, finance, and business management. A structured description of the application fields and methodologies related to TL can be, also, found in References [
41,
42,
53,
54]. Some metrics suitable to evaluate the distance between domains are reported in Reference [
43]. Applications on the industrial field, related to TL for process monitoring, are mostly dealing with fault detection tasks. In Reference [
55], an application to a gearbox fault dataset, based on CNNs, is presented. In the paper, a CNN is trained on large datasets to learn hierarchical features from raw data. Both the architecture and weights of the pre-trained CNN are then transferred to a new task using a fine-tuning procedure. Different TL strategies have been compared to analyze feature transferability from the different levels of the structure. In Reference [
56], a TL method for gas turbine fault diagnosis based on CNNs and support vector machines is proposed. The scarcity of information related to faults has been solved by applying a feature mapping method, reusing the internal layers of a CNN trained on the normal dataset. Another interesting approach of TL applied to CNNs is reported in Reference [
57]. The proposed method addresses a qualitative tool condition monitoring problem, using computer vision, CNNs and TL approaches, to teach the machines the conformity of the component-producing tool. In Reference [
58], a fault diagnosis method based on variational mode decomposition, multi-scale permutation entropy and feature-based TL is proposed. The methodology was applied to the vibration signal of wind turbines. In Reference [
59], a linear discriminant analysis (LDA)–based on Deep transfer network is proposed for fault classification of chemical processes as the Tennessee Eastman benchmark and real hydrocracking processes. A maximum mean discrepancy based loss function is used to extract similar latent features and reduce the discrepancy of distributions between the source and target data. Domain-Adversarial Neural Networks are introduced as a domain adaptive TL technique in Reference [
60], to implement transferable fault diagnosis. A fault diagnosis system is also developed in Reference [
61] using an LSTM model, based on instance TL, to reduce the differences in the probability distributions of the source and the target domains. Other applications in the field of fault diagnosis are reported in References [
62,
63]. A few applications to SS design have been proposed in very recent works. In Reference [
64], a domain adaptation, soft sensing framework for multi-grade chemical processes is discussed. A limited number of labeled samples is available for some operating grades. An adversarial transfer learning SS is proposed to reduce the data distribution discrepancy between different grades, therefore allowing for a supervised SS development. A similar approach, based on extreme learning machine, has been proposed in Reference [
65] to develop an SS for a simulated continuous stirred tank reactor and an industrial polyethylene process. In Reference [
25], a data-driven model based on deep dynamic features extracting and transferring methods are applied to build a virtual sensor for cement quality prediction. A large unlabeled dataset is used to extract nonlinear dynamic features, along with a limited labeled dataset. The features are then transferred to a regression model, called the eXtreme Gradient Boosting, for output prediction. A model updating strategy is also proposed to include online data samples. In Reference [
66], an instance-based TL method is combined with a boosting decision tree. The procedure is adopted to estimate wind power generations and uses correlated zones of the source domain to realize an instance-based transfer learning.
3. Theory Fundamentals
In this section, a description of the RNN and LSTM architectures, used to identify the data-driven nonlinear dynamical models, is reported. Moreover, the details on the structures adopted in this work are provided.
3.1. Recurrent Neural Networks
RNNs [
67] are widely used to capture the temporal dynamic behavior in time sequences [
68]. RNNs can make use of past states and past information for the present state estimation, making them suitable for sequences processing, such as natural language, handwriting recognition, and speech recognition [
69,
70,
71]. Such a property makes this type of networks able to identify dynamical models of industrial processes.
The intrinsic dynamic structure of the RNN allows it to avoid the regressor selection procedure needed when using static networks. It is also not necessary to feed past I/O samples into the input layer. RNNs have the same structure of multilayers perceptrons, with the difference that, in an RNN, neuron connections are included between the previous levels and in the same level, as well. This forms a directed graph along a temporal sequence since, at each instant, the nodes connected through a recurring connection receive inputs both from the current and previous state, based on the dependencies created in the network.
The connections between the output of a layer and the input of the previous one are performed by applying a real-valued time-delay between them. Such delays are implemented with TDL blocks.
Figure 2 shows an RNN with two hidden layers with input delays
, internal recurrent connections delays
and output recurrent connections delays
.
Given a layer
ℓ, its output
is given by
where
is the activation function, particularly the hyperbolic tangent
in the hidden layers (i.e.,
,
) and the linear function in the output layer (i.e.,
). The input signals
are given by the following equations:
Matrix contains the weights of the inputs; is the internal feedback weight matrix in layer 1; is the external feedback weight matrix in layer 2; is the external feedback weight matrix in layer 3 (i.e., the output layer). is the layer weight matrix between layer 2 and layer 1; is the internal feedback weight matrix in layer 2. is the matrix of the weights of the output layer; is the matrix of the internal feedback weight in the output layer. The vectors , , and contain the bias values for layers 1 and 2 and the output layer, respectively. In particular, the vector is built from the input vector at time t and the consecutive tapped input delays; the vectors are built from the layer ℓ output delayed in the value of (or ) to itself.
The RNNs are here trained both with the Levenberg-Marquardt (LM) algorithm [
72] and the Broyden-Fletcher-Goldfarb-Shanno (BFGS) algorithm [
73].
However, standard RNNs have difficulties in learning long-term dependencies, because they are easily affected by the vanishing or exploding gradient problem [
74]. This issue occurs when the gradient becomes vanishingly small, at the point of preventing the weights from changing value, or vice-versa, when it increases exponentially, making the derivatives diverge.
3.2. Long Short-Term Memory Network
LSTM networks have been introduced as a variant to standard RNNs to deal with such issues [
75]. Basic hidden units in RNNs are replaced with LSTM units, making the network handle the vanishing and exploding gradient problem when learning long-term dependencies [
76]. LSTM units consist of memory cells and three nonlinear gates that selectively retain current information that is relevant and forget past information that is not relevant. This type of network is mostly used in language modeling, time series prediction, speech recognition and video analysis [
77,
78,
79,
80]. An LSTM unit is shown in
Figure 3.
Given a time instant t, the state of the unit consists of the hidden (or output) state , which contains the output for that time instant, and the cell state , which contains information learned from previous time instants. They are computed using and from the previous time step. At each time step, is updated by adding or removing information using gates. The blocks that form the LSTM unit and control the next state are the following:
Forget gate (f), that controls the level of cell state reset;
Cell candidate (g), that adds information to cell state;
Input gate (i), that controls the level of cell state update;
Output gate (o), that controls the level of cell state added to the hidden state.
The following are the learnable parameters of an LSTM layer:
Input weights:
Recurrent weights:
Biases:
The states of the blocks of the LSTM unit at the time instant
t can be written as:
where
denotes the sigmoid function, and
the hyperbolic tangent function
.
The cell state
and the output state
at each time instant are updated as:
where ⊙ denotes the Hadamard product, the pointwise multiplication operator for two vectors.
Given the learning rate
, the standard SGD algorithm updates the network parameters
(weights and biases) to minimize the loss function
by taking small steps at each iteration
k in the direction of the negative gradient of the loss as follows:
SGDM adds momentum to reduce the possible oscillation along the path of steepest descent towards the optimum.
The term
determines the contribution of the previous gradient step to the current iteration. Even though Adam optimizer [
81] is more computationally efficient, SGDM showed better performances in our applications.
3.3. Model Description
In this section, some notations and technical details that will be used in the following of the paper are reported.
Let us denote the model implemented by a generic network (i.e., RNN and LSTM) , where is a vector containing all the network weights and biases, and contains all the model hyperparameters, thus describing the network structure. In the case of RNN, the hyperparameters are detailed as follows:
The LSTM model training involves the optimization of the following hyperparameters:
Number of hidden units in the LSTM layer;
Number of hidden neurons in the fully connected layer;
Dropout probability value.
The Dropout, a technique to prevent over-fitting in deep neural networks, has been applied. It consists of randomly disconnecting some neurons by a certain percentage during the training, by setting their outgoing edge to 0 at each epoch. This way, at each update during the training phase, each neuron has a probability to be dropped out and miss the training [
82]. Among the available hyperparameters searching strategies, a grid search approach was preferred in both the RNN and LSTM cases.
Model performances were evaluated through CC, MAE, and RMSE between the actual output and the predicted output over test data as follows:
where
is the covariance,
the standard deviation,
N is the number of samples,
and
are the actual and the predicted output samples, and
Y and
are the corresponding vectors. To select the optimal SS and to compare the different methodologies here reported, the CC is considered.
The experiments were performed on a laptop with an Intel i5 @2.4GHz CPU and 8 GB RAM. The software environment is Mathworks MATLAB 2020a on Windows 10 Pro 64-bit.
6. Conclusions
The proposed work investigates the problem of dynamical model transferability in developing SSs for industrial applications. Two different model transferability solutions were investigated. The first strategy (i.e., the fine-tuned transferred model) consisted in adopting the SS designed for the source domain, after a fine-tuning of the network weights on the target domain. The second strategy (i.e., the transferred hyperparameters model) is based on adopting the optimal hyperparameters identified for the source dataset to train a new SS on the target dataset. Both techniques have been implemented by using, as SS structures, two dynamical neural networks: an RNN and an LSTM. The RNN-based SS showed better performance than the LSTM network in developing the optimal model, with a comparable computational effort. The obtained results showed that the FTTM reached the best performance in terms of the correlation coefficient between the estimated SS outputs and the measured ones. In regard to the comparison between the two different ML approaches, the LSTM showed a greater capability of maintaining, after the transfer process, similar performance, when compared with the full optimization procedure. Another relevant achievement of the proposed procedures is the possibility to cope with the problem of labeled data scarcity. In the case of a limited labeled dataset, the LSTM showed the best transferring capabilities, with respect to the RNN-based model with the FTTM. The results obtained with the hyperparameter transferring are instead not satisfactory, confirming that the FTTM is more suitable as a TL procedure. The proposed application has shown the suitability of TL procedures in the field of SS for industrial processes, where dynamical nonlinear models are of interest. This is a relevant achievement in the SS research field, allowing to greatly reduce the computational complexity of the SS design. In detail, the use of TL allowed to leave out the time-consuming phases of model-class and order selection and hyperparameters optimization. Another relevant aspect of the proposed procedures, with respect to those reported in the literature, is their simplicity. The FTTM and the HTM algorithms can, in fact, be implemented directly from the industrial technicians, without the help of a system identification or ML expert. A set of labeled data in the target domain is, however, required. The proposed TL procedures have the characteristic to preserve the knowledge on the structure of the model dynamics from the source process to the target one. If the sensors available on the target process are different from those installed in the source one, this characteristic should still guarantee good performance, if the measured quantity variables are strictly related in the two domains. In the proposed application, the suitability of using a TL approach was assured by using the knowledge of the experts, who assessed the similarity between the two considered processes. In more general cases, this is not always easy to understand. A limitation of the proposed approaches consists, therefore, in the lack of a procedure able to assess the possibility of successfully applying TL to a given process, by looking only at the available datasets. Further research will be devoted to the introduction of proper metrics able to quantify the distribution distance between the source and target domains. This preliminary analysis should be able to guarantee the possibility of applying TL and, eventually, estimate the expected performance.