1. Introduction
Soft sensors are inferential models used for online prediction of quality variables [
1,
2,
3]. Often, the online measurement of such variables is difficult due to high costs or lack of a physical sensor. In such cases, the quality variables are measured by means of laboratory analysis. An operator collects a sample during production and sends it away for laboratory analysis. Meanwhile, the operator cannot act on the process in response to the most recent measurement, whose response is not yet known, leading to possible production loss, or at least quality loss. After the analysis result returns back, the operator can then make a decision, but it might be late. The goal of soft sensor technology is to solve this issue. It builds a data-driven model that relates the operational process variables to the quality variable. By doing so, it allows the online inference of the quality variables, allowing operators to get earlier information about the quality of the process and take corrective actions on time. The common steps to deploy a soft sensor are: data cleaning/synchronization [
4], feature selection [
5], model learning/validation [
2], and model maintenance [
6]. During the learning phase, it is beneficial to take, as much as possible, the process properties into consideration. For example, in case of multimode or multiphase processes, such information can be used in the modeling efforts. Multimode process operational characteristics exist due to external factors, such as changes in feedstock, production, operation conditions, or the external environment. Multiphase process characteristics are commonly found in batch processes. A series of phases comprise a batch cycle of production, with its own characteristics [
7]. Several authors appeal to using these multiple operating properties into the modeling phase. From now on, we will also refer to each mode/phase of multimode-multiphase processes as the operating mode.
Two main steps follow the modeling of multimode-multiphase processes for quality prediction. The first step is the characterization of the operating modes, which can be done manually [
8] or be inferred from data [
9]. The second step is the learning of the models for each mode. In [
10], the authors derived and proposed the use of a mixture of probabilistic principal component regression (MPPCR) for multimode data. In [
11], the use of a two phase partial least squares (PLS) modeling approach was proposed. In the first phase, a separated intra-phase-PLS model is learned for each phase, and in the second phase a series of inter-phase-PLS are learned to model the relationships among different phases. In [
12], in a case study for Mooney viscosity prediction in a rubber-mixing process, the authors employed a fuzzy C-means clustering algorithm to cluster the samples in different subclasses, and then taught single Gaussian process regression (GPR) models for each subclass. In [
7], the main idea was to learn individual partial least squares (PLS) models for each phase and each mode. The distinction between modes and phases was made manually by the experts. Following this, in [
13] the authors successfully implemented a Gaussian mixture regression (GMR) model to handle the multimode data in a penicillin process. In [
14], the authors incorporated the PLS model into the GMM framework for quality prediction and fault detection in an industrial hot strip mill process. In [
15], the authors expanded the Gaussian mixture models (GMM) framework to its semi-supervised form for dealing with incomplete data and multimode processes. Other approaches model multimode processes using just in time learning (JITL) [
16,
17]. In [
18], the authors proposed a robust GMR based on the Student’s-t distribution for dealing with noise and outliers in data.
This paper proposes a mixture of experts (MoE) methodology with the following two characteristics: (1) the characterization of different modes, and (2) the learning of models in a single unified manner. MoE consists of a set of experts and gates, applied for modeling heterogeneous or multi-mode data. The experts are assigned for each mode, while the gate defines the boundaries for the experts.
Figure 1 illustrates the MoE.
MoE was introduced in [
19,
20], where the experts were neural network models and the gates were modeled by a softmax function. Since then, several extensions and variants of the MoE model have been proposed (see the review paper [
21]). A variant of MoE is the mixture of linear experts (MoLE), wherein the experts are multivariate linear models. MoLE has the property of universal approximation, works for nonlinear modeling, is interpretable, and can automatically infer the different modes. All these characteristics make MoLE suitable for modeling multimode industrial data. However, the estimation of MoLE is unfeasible in the presence of collinearity or a small number of samples. Moreover, MoLE cannot handle irrelevant variables or perform feature selection. However, all these, are usual requirements to deal with industrial data. To solve the collinearity issue, in [
9], the authors have proposed the use partial least squares (PLS) for modeling the gate and experts, defining the Mixture of Partial Least Squares Experts (Mix-PLS). It has been successfully applied to two industrial multimode-multiphase processes. In a short paper, Ref. [
22] modeled the results of MoLE with elastic net penalty for modeling a polymerization multiphase process.
Beyond the industrial context, authors have appealed to regularization to MoLE models [
23,
24,
25,
26] to perform of feature selection and allow the learning with small number of samples. The regularized MoLE methods reported in the literature use the
penalty, also known as least absolute shrinkage and selection operator (Lasso) [
27]; the
penalty, also known as ridge regression (RR) or Tikonohov regularization [
28]; a compromise between
and
norm [
29], also known as elastic net (EN); and the smooth clipped absolute deviation (Scad) [
30]. In [
23], the authors have used a RR penalty for the gates and a Lasso and Scad penalties to experts in MoLE. They reported successful results, in both, performance and in finding parsimonious models. They compared the results with a RR regularized linear model. Similarly, Ref. [
24] applied Lasso penalty on gates and experts for classification problems. Their regularized MoLE performed better than ordinary MoLE, and state of art classifiers. In [
26], the authors employed Lasso penalty for experts, and EN penalty for the gates, and reported better results than ordinary MoLE. In [
31] the authors have employed a proximal-Newton expectation maximization (EM) to avoid instability while learning the gates. In [
32], the authors discuss the theoretical aspects of the use of Lasso in MoLE. All results report regularization as a viable approach to improve the performance of MoLE when dealing with irrelevant variables and small number of samples.
The goal of this paper is to check the performance of regularized MoLE models for quality prediction in multimode-multiphase processes. For this purpose, a regularized version of MoLE based on EN penalty has been derived, which has a flexible regularization form. Thereafter, this paper derives three regularized MoLE models, defined as MoLE-Lasso, MoLE-RR, and MoLE-EN. Besides, a set of experiments was run and analyzed, with all regularized MoLEs, and the Mix-PLS [
9]. The experiments were run on real multiphase polymerization data, for predicting two quality variables, with a total of 31 batches. The performances of MoLE models with respect to the number of batches for training were checked. The results show that MoLE-Lasso gives the most stable results, even in learning with only a few batches. On the other hand, the Mix-PLS has a tendency to perform better when the number of batches increases. Finally, all regularized MoLE’s were also compared to different state-of-the-art models. The results show that the regularized MoLE is a valid choice for modeling multimode processes data.
The paper is divided up as follows.
Section 2 presents the proposed regularized MoLE.
Section 3 presents the experimental results. Finally,
Section 4 gives concluding remarks.
2. Regularized MoLE
In this section, the regularized MoLE is derived. First, an introduction of MoLE is given. Afterwards, the derivation of regularized MoLE and its learning procedure are presented.
2.1. Notation
In this paper, finite random variables are represented by capital letters and their values by the corresponding lowercase letters, e.g., random variable A, and corresponding value a. Matrices and vectors are represented by boldface capital letters, e.g., and boldface lowercase letters, e.g., , respectively. The input and output/target variables are defined as and Y, respectively. The variables can take n different values as , and similarly for Y as .
2.2. Definition
The MoLE follows the divide and conquer principle. In the learning phase, the input space is divided into soft regions and a linear "expert" is learned for each region, while the gates assign soft boundaries to the experts’ domains. The output for a new data sample is given by the weighted combination of the experts’ outputs, where the new data is assigned to a specific or overlap of regions. The MoLE output is given by
where
is the vector of input variables,
P is the total number of experts,
is the output of expert
p, and
is the gate output for expert
p. The gates assign mixture proportions to the experts, and have the following constraints
and
. From now on,
,
,
, are denoted by their shortened versions
, and
, respectively. Expert
p has the linear form
, where
is the vector of regression coefficients of linear expert
p; the bias has been omitted to simplify the derivation. For the gate, the following softmax function will be employed:
where
is the gate parameter of expert
p. This softmax function follows the required gate constraints. The MoLE formulation fits perfectly for multimode-multiphase processes, where each different mode can be modeled by an expert
p, while the gates define the boundaries of the different modes. MoLE can infer the number of modes, or can use the expert information to define the number of modes. It has a very flexible format, and established learning algorithms. However, MoLE models cannot deal with collinearity or irrelevant variables. The next section will discuss how to apply regularization to MoLE models.
2.3. Formulation
The MoLE approximates the true probability distribution function (PDF), defined as
, by a superposition of PDF’s:
where
is the conditional PDF of expert
p, governed by the parameters
, where
is the error variance of expert
p, assumed to be zero mean. The collection of gates parameters is represented by
. The collection of all parameters is defined as
.
The estimation of
by maximum likelihood is unfeasible. Instead, the expectation maximization (EM) is commonly employed [
33]. The EM uses a two-steps iterative procedure to perform the likelihood maximization of MoLE. In the first, called the expectation step (E-Step), the expectation of the log-likelihood (ELL) is evaluated, while in the second step, called maximization step (M-Step), new parameters are determined by maximum likelihood estimation from the ELL. Therefore, the maximization of likelihood in EM is achieved trough successive improvements of ELL; see [
33] for further details on the application of the EM algorithm.
Since the EM is an iterative procedure, the superscript
t is used to indicate the
t-th iteration of the EM algorithm, e.g.,
is the vector of the estimated parameters of MoLE at iteration
t. The ELL at the
t-th E-Step,
, is given by
where
and
account for the expert and gate contributions to the ELL, respectively, and
is the responsibility, which accounts for the probability of sample
belonging to the region covered by expert
p; from now on
will also be referred as
. The convergence of the ELL, can be measured by computing the ELL difference
at each iteration. The complete derivation of ELL, and its relation to the log-likelihood of (
3) can be found in ([
9],
Section 4).
In the M-Step, the objective is to estimate the new parameters
, that maximize Equation (
4). This is stated as the following optimization problem:
The EM guarantees that the ELL, and consequently the log-likelihood of Equation (
3), increases monotonically. The EM algorithm runs until the convergence of the ELL, where the convergence detection condition is defined as
, where
is a convergence threshold.
The algorithmic solution is shown in Algorithm 1.
Algorithm 1 Expectation maximization (EM) for mixture of linear experts (MoLE) learning. |
- 1:
procedure MoLE() - 2:
Initialize , , Done ← FALSE - 3:
while not Done do - 4:
E-Step: compute responsibilities, using Equation (5). - 5:
if then - 6:
if then ▹ Check convergence. - 7:
Done←TRUE - 8:
end if - 9:
end if - 10:
M-Step: ▹ Find new parameters. - 11:
- 12:
end while - 13:
return - 14:
end procedure
|
As input, the MoLE receives the convergence threshold , and outputs the learned parameters .
2.4. Regularization on Experts
The next step is to solve the maximization (
6) for the experts. The experts’ contribution to the ELL,
, can be further decomposed to account for the contribution of each expert separately, as follows
where
is the individual contribution of expert
p to the ELL. The new vector of expert coefficients
is the one that maximizes
, which is the solution of the following weighted least squares problem
where
is the matrix of responsibilities of expert
p. However, in presence of collinearity, the inverse
becomes ill conditioned. To overcome this situation, a EN regularization is added as follows to penalize the loss function
where
is the regularization parameter that controls the sparsity of the solution, and
is the EN penalty. The EN regularization allows the use of the Lasso penalty when
, the RR penalty when
, or the EN penalty when
(a trade-off between Lasso and RR). The Lasso penalty is known to promote sparse solutions by shrinking the regression coefficient towards zero, being adequate when dealing with a large number of inputs. However, the Lasso penalty does not consider the group effect, i.e. for correlated features, it will tend to select one input while shrinking the coefficient of others to zero. The RR penalty alleviates the ill posed problem of
by adding a regularization factor to it. In RR, all coefficients will have a contribution to prediction. On the other hand, the EN penalty integrates the benefits of both, it provides sparse solutions (Lasso penalty) while considering the effect group problem (RR penalty).
The error variance update becomes the solution of the following maximization problem
which is equal to
where the updated
is used to compute the updated variance term.
The maximization problem in Equation (
11) can be achieved using the coordinate gradient optimization descent algorithm, described in [
34]. The coordinate gradient descent algorithm minimizes the loss function for each coordinate at a time. It converges to the optimal value if the loss function is convex and differentiable, conditions that hold for the loss function (
11). The updated coefficient of variable
j and expert
p equals to
where
is the fitted value of local expert
p, without the contribution of variable
j.
is the soft threshold operator, given by
Further details on the derivation of EN learning by coordinate gradient descent can be found in [
35]. For the experiments, the glmnet package [
34] has been used. It is a computationally efficient procedure that uses cyclical coordinate descent, computed along a regularization path for solving the EN problem.
Another issue is to select a proper value of
, which controls the overfitting and the sparsity of the solution. For such purpose, here it is adopted the Bayesian information criterion (BIC), which measures the trade off between accuracy and complexity, and has the following format
where
is the number of effective samples of expert
p, and
is the number of the degree of freedom of expert
p, and is the number of non zero elements in
. The BIC has a tendency to penalize complex models, due to the
multiplicative factor, giving preference to simpler models. Thus, the selected value of
is the one that minimizes the value of
.
2.5. Regularization on Gates
On the experts’ update, the EN regularization was easily added to penalize the experts’ parameters (
Section 2.4). On the other hand, during the gates learning, the application of the regularization term is not explicit. The contribution of gates to the ELL,
, is given by
The solution for the new parameters
by direct maximization of Equation (
17) is not straightforward. Instead, the iterative re-weighted least squares (IRLS) method will be employed. The IRLS algorithm works in the following way. First, define the following auxiliary variable
as the auxiliary gate parameter in the
k-th iteration of the IRLS algorithm. It is updated as follows:
where
The IRLS algorithm runs until convergence. The employed convergence detection condition is
. At the end of the algorithm, the new gate parameter is updated as
, where
K is the last iteration of the IRLS algorithm. In practice, few
K iterations are necessary in the IRLS algorithm. The IRLS solution can become unstable due to the ill-conditioned inverse
. Many authors have proposed different alternatives to IRLS, to overcome this problem, such as the proximal-Newton EM in [
31] that avoids the matrix inversion in the gates update. Here, a regularization term is added, which works well in practice. Specifically, the regularization term
is added such that the inverse becomes
. In experiments,
is used.
However, similarly to the experts, the solution for the gates at each iteration of IRLS becomes ill conditioned in the presence of collinearity. Thus, through the closed form solution of the inner loops of the IRLS algorithm in Equation (
18), the results derived for the experts are mimicked to the gates. By modifying so, the value of
to be found at the
k-th IRLS algorithm iteration with the EN regularization added, becomes
Then, at each iteration of IRLS, an EN penalty is added to the loss function. In total,
EN maximization problems are computed. In a way similar to case of the experts, the solution of (
19) can be achieved by using the coordinate gradient descent algorithm, described in [
34]. In that case
where
is the fitted value of gate
p, without the contribution of variable
j.
The major issue here is to find the most appropriate
for each gate
p and at each iteration of the IRLS algorithm, where the value of
controls the sparsity of the solution. For such purpose, it is adopted the BIC, which measures the trade off between accuracy and complexity, and for the gates, it has the following format
where
is the number of effective samples of gate
p, and
is the degrees of freedom of expert
p which is equal to the number of non zero elements in
. Thus, the selected value of
is the one that minimizes the value of
. The IRLS algorithm with EN penalty is described in Algorithm 2.
Algorithm 2 Iterative re-weighted least squares (IRLS) with elastic net (EN) regularization. |
- 1:
procedure IRLS-EN(,,,, ) - 2:
Initialize , Done ← FALSE - 3:
while not Done do - 4:
- 5:
- 6:
if then - 7:
Done←TRUE - 8:
end if - 9:
- 10:
end while - 11:
return ▹t: EM index; k: IRLS index. - 12:
end procedure
|
2.6. Model Selection and Stop Condition
In ordinary MoLE learning, the ELL is employed as the measure of convergence. Here, in addition to ELL, the BIC criterion will be employed to select the best MoLE architecture along the EM iterations. In that case, the parameters
to be selected is the ones were the BIC is minimal, instead of ELL. For this purpose, at each EM iteration, the BIC criterion is computed:
Smaller values of BIC means better models. The BIC will increase as the complexity of the MoLE architecture increases. That selection allows the selection of less complex architectures, and overcome the problem of overfitting in the prediction phase. The ELL measures the convergence of the MoLE, while the BIC is considered as the criteria to select the best model. Here, the number of experts selected is not considered a concern, it is assumed that this information is known a priori, and comes from the process to be modeled. However, this criterion can also be employed for model selection. In [
36], there is a short discussion on different modeling selections for MoE.
2.7. Different Regularized MoLE
The regularized MoLE learning, presented in previous sections will be used to derive three main regularized MoLE models for the experimental part. First,
is considered to derive the MoLE-Lasso; then
is used to derive the MoLE-EN; and
is used to derive the MoLE-RR. The selection for regularized MoLE regularization parameters will follow the BIC procedure, as previously discussed. The source code will be made available at the author’s github page (
www.github.com/faasouza (accessed on 19 February 2021)).