Upon completion of data clustering, the labeled clusters are utilized as input to the LSTM-MLP network to predict abnormal chemical working conditions. The LSTM component captures time dependencies in the data, while the MLP part performs a non-linear mapping of input to output.
2.3.1. LSTM
Long short-term memory (LSTM) is a type of recurrent neural network (RNN) that has gained wide popularity in time series prediction due to its ability to remember patterns of a large number of sequences [
50]. By using gates to regulate the flow of information, LSTM models are designed to capture temporal dependencies in data [
51]. A typical LSTM network consists of multiple memory blocks, or cells, that enable the memorization of information. The input, forget, and output gates play crucial roles in regulating the flow of information in and out of the cells [
52], like
Figure 2.
The mathematical notation used in the LSTM model is described as follows: The input at time t is represented by , and represents the previous hidden state. The sigmoid activation function is used to control the flow of information through the model. and are matrices representing the weights associated with the input and hidden states, respectively, while is a vector representing the bias term. The operator denotes element-wise multiplication, which is used in the calculation of the input gate, forget gate, and output gate in the LSTM model.
In the LSTM cell, the input gate assesses the importance of new information carried by input data through the equation
Meanwhile, the forget gate determines whether to retain or delete historical information through
Lastly, the output gate determines which information to output.
The long short-term memory (LSTM) neural network is a type of unidirectional network used for long and short-term pattern recognition in time series data. In contrast, the bidirectional LSTM (Bi-LSTM) network is a type of bidirectional network that utilizes a positive and negative LSTM network in the training process, which are then combined in the output layer [
53]. This approach allows the Bi-LSTM network to capture temporal dependencies from both past and future time steps, resulting in a more comprehensive and complete representation of time-series data compared to the unidirectional LSTM [
54]. A schematic of the Bi-LSTM network structure is presented in
Figure 3.
2.3.3. Bi-LSTM-MLP Fusion Method
Data reshaping is a crucial preprocessing step in building an effective Bi-LSTM model. This involves determining the number of layers of the LSTM, the number of neurons in each layer, and the input and output dimensions of the network. To incorporate anomalous observations after the clustering stage, the clustering assignment sequence must be reformatted into a structure that is compatible with the input Bi-LSTM-MLP model. A sliding window approach is employed to create fixed-length sequences, where the length of the window corresponds to the number of time steps in the input sequence. For instance, if the length of the input sequence is set to 10, the first input sequence will contain cluster assignments of data points 1–10, the second input sequence will contain cluster assignments of data points 2–11, and so forth. The reshaped data are then represented as a three-dimensional tensor with dimensions , where denotes the number of input sequences, is the length of each input sequence, and is the number of clusters. A suitable partitioning ratio is employed to divide the input data into training, testing, and validation sets, depending on the size of the dataset and the complexity of the model. The reshaping process is carried out using three key parameters: sample size, time step, and number of features, with the reshaping function being varied based on the specific use case.
In this study, a Bi-LSTM-MLP model is proposed for processing the reshaped data. Specifically, the input sequence is first passed through the Bi-LSTM layer in a bidirectional manner to capture the temporal dependence between cluster assignments. The output of this layer is a string of hidden states H, which has a dimension of , where is the number of hidden cells in the Bi-LSTM layer.
To calculate the hidden state for each time step, a specific approach is employed.
In the proposed Bi-LSTM-MLP model, the input sequence is first processed in a bidirectional manner by the forward and backward LSTM cells denoted as and , respectively. At timestep , the input is used to compute the forward and backward hidden states and , as well as the forward and backward cell states and .
The output hidden state for timestep is obtained by concatenating the forward and backward hidden states as , where [;] denotes concatenation.
Following the Bi-LSTM layer processing, the output hidden states are directed towards a multilayer perceptron (MLP) layer for mapping to the output space. The MLP layer, comprising one or more fully connected layers activated with functions such as sigmoid, generates a vector y of dimensions, where represents the number of output units within the MLP layer.
At each timestep t, the output is determined by first flattening the hidden state to obtain a 1-dimensional vector with dimensions , where corresponds to the concatenated forward and backward hidden state size.
The flattened hidden state is then passed through the MLP layer with appropriate weights and biases, yielding the output at timestep t, which can be represented as . The MLP layer is instantiated with the necessary weights and biases before being utilized in the model architecture.
The output of the previous layer will be utilized as the input of the current layer in the MLP input layer to the hidden layer process, and the calculation rule is frequently as follows: the product of the weights and the inputs plus the value of the bias. The formula is not provided here since experiments are used to determine the number of layers.
The loss function plays a crucial role in training the Bi-LSTM-MLP model. After obtaining the output vector y by passing the hidden state through the MLP layer, the model compares it to the true labels and computes the loss. The choice of loss function depends on the task at hand and may include mean squared error for regression or cross-entropy loss for classification. The loss is denoted as
To train the model, an appropriate optimization algorithm, such as stochastic gradient descent or the Adam optimizer, is used to minimize the loss function. Instead of using parameter data from training, hyperparameters are parameters that are set to values before the model begins learning. Hyperparameters frequently need to be optimized, which involves experimenting with various hyperparameters before settling on a set that will increase learning performance. There are many hyperparameters. The model’s hyperparameters in this study, including the learning rate, number of hidden units, and epochs, are fine-tuned via cross-validation.
Once trained, the model can be used to make predictions on new data by feeding the encoded cluster assignments through the Bi-LSTM and MLP layers, where the MLP layer produces the predicted labels for the input data. To obtain accurate predictions, it is essential to normalize the predicted data back to the original scale using the
. This function reverses the scaling applied to the data during preprocessing. By doing so, the predictions can be compared to the original data to obtain an accuracy metric.
To assess the performance of the proposed model, a held-out test set is utilized to estimate its performance. The effectiveness of the model is then validated through the computation of various metrics such as the root mean square error (RMSE).
The RMSE serves as a measure of the average difference between the predicted and actual values, with the square root taken to ensure that the units of measurement are the same as the original values. The resulting RMSE value ranges between 0 and 1, with 0 being the most favorable outcome.
Additionally, the trend of RMSE is used to monitor the working condition of the chemical process. By monitoring the fluctuation of the residual RMSE, the operation status of the process can be accurately identified. When the changing trend exceeds or maintains the threshold level, it can be inferred that the chemical process has changed and is in an abnormal working condition.