For the time-series forecasting of the performance of turbomachinery, missing data is commonly encountered due to the failure of sensors or human error, and it greatly affects the quality of time-series modeling and prediction. Therefore, it is important to fill in the missed time-series data before forecasting. The turbomachinery operation datasets, meanwhile, often have the following characteristics: time series with noise, non-significant periodicity, and high and correlated dimensions (multi-task). Hence, we need to integrate multiple methods, including data denoising, dimensionality reduction, normalization, imputation, and prediction. The flowchart in
Figure 1 presents the process for time-series forecasting using datasets collected from turbomachinery. Suppose that we collected scarce time-series data (target task) regarding the interested performance indicator of turbomachinery, and some related and sufficient time-series datasets (auxiliary tasks) collected from other performance indicators are available. The proposed MT-LSTM first performs data denoising using discrete wavelet packet transform (DWPT), followed by a reduction in dimensionality via probabilistic principal component analysis (PPCA). Thereafter, the MT-LSTM conducts the data imputation of the target task through MTGP. Finally, the LSTM model is built upon the imputed data to predict of the target task in the future. The following sections elaborate on the key ingredients in the proposed MT-LSTM model.
2.1. Data Denoising
In the complex working environment of turbomachinery, sensor time-series data usually comprise a wide range of defects, for example, noise, error, and incoherence. Hence, it is necessary to remove the noise or uncertainty characteristics of data before modeling. In past decades, scholars have proposed many methods to remove noise from sensor data. However, few studies have evaluated multiple defects simultaneously. In this paper, we propose Bayesian wavelet denoising and use multiscale wavelet analysis to analyze the contexts in the signal.
Specifically, we adopt discrete wavelet packet transform (DWPT) to denoise and multiscale analyze the original time series. In DWPT, a given one-dimensional time-series signal,
, with
historical data points, is simultaneously decomposed into a sequence of scaling coefficients,
, and wavelet coefficients,
. Then, we can use the inverse wavelet transform and the DWPT coefficients to reconstruct the time series as follows:
where
is the reconstructed time series;
and
are the time and frequency indices, respectively;
is the scaling function;
is the wavelet function; and the double summation means that the wavelet subspace and scaling subspace are simultaneously decomposed into second-level subspaces, which reveal the time and frequency resolution of the signal.
For the benefit of universality, we set
to the
-th decomposed coefficients at the
-th level (
=
, …,
− 1;
= 0, 1, …,
− 1). According to the Bayesian rule, the posterior distribution of the cleaned coefficients,
, can be expressed as
where the symbol ~ represents a distribution;
denotes a normal distribution, where
is the mean and
is the variance;
denotes the decomposed coefficients distribution,
;
is a binary random variable with independent Bernoulli distribution
, i.e.,
P(
= 1) = 1 −
P(
= 0) =
, which decides whether
are zero (
= 0) or nonzero (
= 1); and the variance
denotes the magnitude of
at the
-th decomposition level. For all the coefficients in the
-th level,
and
may be assigned the same values;
is the standard deviation, which is estimated from the wavelet coefficients of the
-th level DWPT decomposition by dividing the median of the wavelet coefficients by a factor.
Then, the Bayesian hypothesis test is applied to threshold the decomposed coefficients by determining whether to accept the null hypothesis,
:
= 0. The thresholding rule is defined as follows:
where
is the posterior odds ratio of
= 0 versus
= 1.
I(.) denotes an indicator function. When
< 1,
I(.) is equal to unity. Thus,
is rejected and
estimates the coefficient
. If not,
I(.) is equal to zero, and
is removed. The cleaned time series
based on Equation (1) is then reconstructed by using the acquired coefficients
from Equation (3).
In summary, we can decompose the time-series data into various levels of wavelet coefficients by the DWPT analysis. Then, a Bayesian hypothesis test is carried out for each level of wavelet coefficients to eliminate possible defects. To avoid over-denoising, the ratio of posterior odds Bayesian method offers a simple way to determine whether the decomposed coefficients have any defects. Meanwhile, the DWTP method provides more coefficients denoting additional subtle details of the time-series signal, thus avoiding under-denoising.
2.2. Dimensionality Reduction
In the analysis of multi-variable time-series data of turbomachinery, due to the correlation between variables and the inherent uncertainty of data, too many variable dimensions increase the calculation assumption and the analysis complexity. After cleaning the multivariate time-series data, the PPCA method is mainly used to (1) reduce data dimensionality, (2) decouple the multivariate correlation, and (3) filter data uncertainty. The method plays an important role in improving the accuracy of subsequent data analysis and reducing the calculation time.
Let
represent a
-dimensional real number and let
represent the
× data matrix: there are
variables, and each variable contains
cleaned time-series data points,
. Let
represent the
× data matrix and
represent
unobservable latent variables (factors), each contains
corresponding positions in the latent space. All
are usually defined as independently distributed Gaussian variables obeying zero mean and unit variance, i.e.,
. The latent variable model combines the measurable variable
in the relevant data matrix
with the corresponding uncorrelated latent variable matrix
. Then, the form of the Gaussian distribution is
where
is the isotropic noise covariance, the
× weight matrix
represents the relation between
and
, and
mean values that are acquired from
form the parameter vector
, i.e.,
. In the lower dimensional latent space, the unadjusted data
can be expressed as
where
,
, and
are the maximum likelihood of
and
. Equation (5) has the mean of
. The
-dimensional principal components (PC) matrix
is used in the following data-driven modeling.
2.3. MTGP Model
In the data analysis of turbomachinery, the lack of data is often caused by the damage of sensor equipment, the insufficient maintenance of equipment, the power failure of equipment, the aging fault of equipment hardware and software, the communication interruption of equipment, and new equipment recently put into operation. Data imputation and augmentation are crucial to the performance of the deep learning model for time-series data, which can effectively improve the size and quality of the training data. We explore a method for time-series data imputation using the multi-task GP model.
We assume a given dataset T = , , is specified as the index set, such as the observed time, and is the corresponding value of the observed training data. The purpose is to learn a regression model: , where is a latent function and is a noise term. Given the test points , we can predict the uncharted test observations .
Gaussian processes (GP) are useful for time-series modeling [
30] because they can accommodate irregular and uncertain observations, and provide probabilistic predictions. Gaussian process models suppose that
can be expressed as a probability distribution of the following functions:
where
is the mean function and
is a covariance function describing the coupling between the inputs. The prior knowledge of the function behavior, which we want to model, can be encoded by modifying the covariance function. In the covariance of the vector
, the size of the covariance matrix
is
, and the covariance function
provides the element
. The covariance function usually satisfies Mercer’s theorem; that is,
must be symmetric and positive semidefinite. The affine transformation of the basic covariance function can be used to construct the complex covariance function.
For a given set
T, we can obtain the predictions at the test indices
through calculating the conditional distribution
, where
is a Gaussian distribution:
where
is the mean and
is the variance. Without the loss of generality, it is usually assumed that the mean function
is zero. Thus,
and
can be expressed as
Furthermore, we assume that are the training indices, and are the observations for tasks, where task has training points. In an actual situation, we may encounter task-specific with training indices where the number is far lower than the other tasks. This is a challenge for the problem.
The problem of modeling
tasks simultaneously (e.g., multiple time-series) drives the extension of the MTGP model, where all models have the same index set
. Training an STGP (single-task GP) model for each task separately is a straightforward method, as shown in
Figure 2a. For the modeling of
tasks via MTGP, to reveal the affiliation of index
and observation
to task
, we have to add a label
to the model as an additional input with
, as illustrated in
Figure 2b.
In MTGP, two independent covariance functions,
and
, can be assumed as
where
denotes the correlation between tasks and only depends on the labels
and
denotes the temporal covariance functions within a task and only depends on the indices
. Let
for
; the covariance matrix
can be expressed as
where
; ⊗ is the Kronecker product;
and
are vectors, including hyperparameters for
and
;
has a size of
; and
has a size of
. Thus, the matrix,
has a size of
. We named
as the correlation matrix. In practice, to adapt the model to the usage of task-specific training data, we can relax the assumption
for
.
This model has some useful features: we may have task-specific training indices (i.e., training data may be observed at task-specific times); when fitting the covariance function in (11), the correlation within the tasks can be learned automatically; the tasks have analogous temporal characteristics and hyperparameters in the framework assumption.
In summary, as MTGP is not a parametric approach, the number of parameters can grow as more observation data are collected. The advantage of the MTGP model over other comparable related methods, such as support vector regression (SVR), is that it may easily express prior knowledge of function behavior (e.g., periodicity or smoothness). Given multiple time series, the goal is to increase the overall modeling accuracy by employing a single MTGP model rather than several STGPs. MTGPs have been used in financial time series [
31], environmental sensor networks [
32], and the classification of ECG signals [
33]. It is important to note that MTGPs can include nonuniformly sampled time series in a single model without further downsampling or interpolation. In addition, it is easy to include prior knowledge between time series, such as time shifts or assumptions about similar temporal. Thus, the method has many applications, such as prediction, correlation analysis, the estimation of time shifts between signals, and the compensation of missing data.
2.4. LSTM Model
It is very important to predict the operation trend for turbomachinery reliability. Recurrent neural networks (RNN) can predict trends quickly and accurately. However, the traditional RNN has the problem of gradient disappearance or explosion and cannot solve long-term dependence, which affects its practical application [
34]. To solve this problem, we established a prediction model based on LSTM (long short-term memory). The LSTM is an improvement on RNN [
35], which is composed of multiple connected recurrent units recursively. Three gate structures, i.e., forgetting gate, input gate, and output gate, constitute the recurrent unit of LSTM [
23]. The advantages of LSTM are mainly embodied in the following aspects: (1) the three gating structures of LSTM can learn long-term memory information and provide a solution to long-term dependence; and (2) the sigmoid function and tank function are combined to form the activation function in LSTM, which keeps the gradient almost unchanged during the derivation of backpropagation, avoiding the disappearance or explosion of the gradient, and greatly accelerating the convergence speed of the model.
The LSTM neural network has a variety of input forms. Here, we consider the univariate time-series data. We suppose that there are discrete time points with a uniform distribution, and is a multi-dimensional time variable at each observed time point. These random variables serve as the input of the LSTM neural network.
According to
Figure 3, at time
, LSTM receives three inputs: the network’s input
in the present, the previous time output
, and the previous time cell state
. At the same time, LSTM has two outputs: output
and cell state
. We can notice that
,
, and
are all vectors. LSTM has three gate-like structures that filter through the information [
36,
37].
The forgetting gate controls how much information of the previous time cell state
is passed to
at the present time:
where
is the sigmoid activation function,
is the weight matrix of the forgetting gate, and
is the bias.
The input gate controls how much information of input
is passed to
at the present time (i.e., it is used to update the cell state):
where
is the input gate,
is the weight matrix of the input gate, and
is the bias.
is the input cell state,
is the weight matrix of the cell state, and
is the bias of the cell state.
The output gate determines how much information of
is passed to the present output
:
where
is the output gate,
is the weight matrix of the output gate, and
is the bias. The final output of the LSTM is determined by the output gate and the cell state.