Monitoring signals of the mechanical system are mainly determined by the physical model of the system, and it may be affected by equipment degradation, operating condition, and external environment factors. Once the status of equipment changes, for example, normal deterioration or mechanical failure, the signal would change as well. Monitoring signals are always digital signals that can be regarded as time series, in this section, the compressing, symbolization, and classification procedure of time series are introduced.
2.1. Perceptually Important Point
Time series is constructed by a sequence of data points and the value of each point has different degrees of influence on the movement shape of time series. That is, each data point has its own importance to the time series, a data point may determine the overall movement shape of the time series, while another only has little influence on the time series or it can even be discarded. Perceptually Important Point technology attempts to look for the point that has key influence on the overall movement shape of time series [
According to the PIP framework, a time series
X = [
x3, …,
xn] can be represented by a PIP series
P = [
p3, …,
m <<
n. The first two important points,
p1 and
p2 are the first and the last point of original series,
x1 and
xn, the next important point
P3 is the point with the maximum impact on the movement shape of original time series among the remaining points in original time series. The influence is measured by the vertical distance from the point to the line connecting its adjacent important points; that is to say, the third important point is the point with the maximum vertical distance to the line connecting
x1 and
xn. The forth important point is that the point remains in
X with maximum distance to its adjacent important points,
p1 and
p2 or
p2 and
p3. The process of locating the important points continues until getting
m important points, the points that were identified in the earlier iterations are considered to be more important than points identified later [
16]. The process of measuring the distance to adjacent important points is depicted in
Figure 1, the curve is a time series includes six points
X = [
x6], the first and the last point are regarded as
p1 and
p2. Firstly, the slope of the line connecting
p1 and
p2 is calculated by Equation (1), then the vertical distance between the remaining points to the line connecting their adjacent important points
di is calculated by Equation (2), the point with maximum vertical distance is regarded as
p3. During the process of determining the forth important point, the vertical distance of points between
p1 and
p3 is the vertical distance from the point to the line connecting
p1 and
p3, the vertical distance of the points between
p2 and
p3 is the distance from the point to the line connecting
p2 and
p3. After all of the important points are determined, the order of important points will be rearranged according to their index in original time series, the series obtained is PIP series. The PIP series can depict movement shape of original time series and it will replace the original time series in subsequent analysis.
Table 1 shows the scheme of Perceptually Important Point technology.
yi is the horizontal ordinate and vertical ordinate of the points in the original series,
di is the vertical distance of
Here, a simulating signal that is generated by MATLAB is imported to validate the performance of the algorithm,
Figure 2a is the waveform of simulating signal. The expression is
x = sin(4 × pi ×
t) + cos(6 × pi ×
t), and 10% Gaussian white noise is added, the series includes 200 points, the time interval between each point is 0.01 s. Subsequently, the time series is compressed by PIP technology,
Figure 2b is the waveform of PIP series, the length of PIP series is 20. As can be seen from the results of time series compression, when the density of important points is one-tenth of the original time series, the overall movement shape of the simulating signal can be depicted by PIP series clearly.
The time series compression process must cause information loss that always has a negative impact on time series analysis. In order to clarify the applicability of time series compression methods and the influence of the length of PIP series, this paper proposes a metric of information loss, called reconstruction error. For each point in original series, there is a corresponding point on the PIP curve (may not be important point), the horizontal ordinate of these two points is equal and the vertical ordinate of the corresponding point
xi can be attained by Equation (3).
ci is vertical ordinate of the corresponding point of
yR is vertical and horizontal ordinate of the left and the right important point of
xi, the corresponding point of
xi can be expressed as (
ci). The ratio of deviation
ei can be calculated by Equation (4), it is defined as the local reconstruction error, the mean value of local reconstruction error in a time series is defined as reconstruction error, it can be attained by Equation (5). The reconstruction error reflects the residual level between the PIP series and the original series.
The procedure of calculating the reconstruction error is depicted in
Figure 3a, the blue line is the original series, and the red line is the waveform of PIP series.
Figure 3b is the local enlarged view of the region marked in
Figure 3a,
is the local reconstruction error of point
xi, after the local reconstruction error of all the points are calculated, the reconstruction error
E can be attained by Equation (5). In
Figure 4, the relation between the reconstruction error and the length of PIP series is given. It can be seen that the reconstruction error of simulating signal is extremely high when the length of PIP series is short. With the increasing of the length, the reconstruction error recesses sharply, after the length exceeds 25, the recession tendency slows down, over length PIP series may aggravate noise disturbance and reduce the computational efficiency. Therefore, in this case, compressing the original signal into a PIP series including 20–30 points is appropriate. How to determine the density of important points in the practical application process will be discussed in
Section 4.
2.2. Time Series Symbolization
Time series symbolization is treated as a transformation of original time series from the phase space into a symbolic space. It is performed by partitioning the time series into a finite set of segments that are labeled as symbols. The procedure can help in reducing the disturbance of environmental factors, facilitating pattern recognition, and increasing computational efficiency. The scheme of PIP series symbolization is introduced this section.
First, the symbolic space is partitioned, the mean value μ and standard deviation σ of PIP series should be calculated, and the number of regions in symbolic space k should be determined. Mean value μ serves as the center of symbolic space, fractiles of standard deviation serve as region boundaries. The space is partitioned into k regions with a set of fractiles of standard deviation F = [f1, f2, …, fk], fi > fi−1 > 0. Each region is labeled with a symbol, the number of symbol is equal to the amount of region in symbolic space k, and the symbol set can be expressed as SY = [sy1, sy2, …, syk]. Second, important points are transformed into symbols according to their location in symbolic space. Each important point is encoded with the symbol corresponding to the region it locates in, then the PIP series P = [p1, p2, p3, …, pm] is transformed into symbolic series S = [s1, s2, s3, …, sm].
Here, an example is applied to explain the procedure, simulating signal in 2.1 is transformed into symbolic series based on 3σ criterion, the PIP series of the simulating signal is expressed as
P = [
p2, …,
p20]. The mean value
μ = 0.0696, the standard deviation
σ = 1.0323, and the number of regions
k = 3, fractiles of deviation set
F = [1, 2, 3], and the symbol set
SY = [A, B, C]. The results are shown in
Figure 5, red dotted lines are partition boundaries, three symmetric regions are labeled with A, B, and C. All of the points in
P are encoded with the symbol corresponding to the region that they are located in. For instance, the value of the first and the second points is 1.004 and 1.213, so they are encoded with A and B, the whole symbolic series obtained is ABAABBCBBBBBAABCBBBB.
In practical fault diagnosis work, the symbolization scheme has key influence on the classification accuracy. Symbolic series from different operating conditions may show great difference that is based on a symbolization scheme and may be almost identical based on another scheme. There are two major symbolization rules: uniform partitioning and maximum entropy rule, but none of them take advantage of distribution information of the original series. In practical application, determining the partition scheme with the distribution characteristics of PIP series will help in maximizing the difference between symbolic series of different operating conditions, thus improving classification accuracy. This paper imports Genetic Algorithm in searching the optimal symbolic space partition scheme, maximizing the difference between reference series from different operating conditions, and reducing the probability of confusion.
Two simulating signals are used to elaborate the procedure of searching optimal scheme. First, they are compressed into PIP series, the probability density distribution curve of important points are shown in
Figure 6. It can be seen that the distribution of important points is quite different, so the difference must be keep in the symbolization procedure. The Genetic Algorithm has the ability to achieve this goal.
The symbolic space is divided into six region, the amount of important points in each region are expressed by a1–a6 and b1–b6, and two distribution vectors can be obtained: A = [a1, a2, …, ai], B = [b1, b2, …, bi], i = 6. Then, the Genetic Algorithm is used to determine partition nodes, and the reciprocal of Euclidean distance between vectors is chosen as the fitness function of Genetic Algorithm; the fitness function is shown in Equation (6). A smaller fitness function value indicates that there is bigger difference between the two reference symbolic series. The difference between two reference symbolic series is regarded as the largest according to the partition scheme that was obtained by the Genetic Algorithm.
It is noticeable that if the amount of each symbol is unbalance, data underflow phenomenon may appear and the computational efficiency may be reduced. Thus, the liner constraint must be added based on the characteristic of distribution, the liner constraint of the example is that the width of each region is no less than 0.5. The algorithm of time series symbolization is shown in
Table 2.
Fit is the value of fitness function.
2.3. Hidden Markov Model
Hidden Markov Model (HMM) is an effective tool to characterize a time series, which is initially introduced and studied in the late 1960s [
17]. It has a wide range of applications in the field of speech recognition, economic analysis, and mechanical engineering due to its strong mathematical basic theory and well developed algorithms [
19]. The components of HMM can be described, as follows:
Hidden states: the hidden states are defined as W = [w1, w2, ..., wM], and the state in the time t is defined as qt;
Observations: the observations is the real output of system. Let V = [v1, v2, …, vN] be the set of observation symbols, and the observation at time t is defined as ot;
State transition probability matrix H: , , the sum of each row is 1;
The observation probability vector L: , , , the sum of the vector is 1; and,
The initial state distribution : , .
Subsequently, the HMM can be specified by
H, and
L, the model can be expressed as
, the topological structure of HMM is elaborated in
Figure 7. Three assumptions are associated with the using of the HMM theory [
The probability of the state at a given time t only depends on the state of previous time t − 1;
The state transition probabilities are independent of the actual time at which the transition occurs; and,
The current observation only depends on the current state and it is independent of the previous observations.
There are three basic algorithms in HMM, Forward–Backward procedure, Viterbi algorithm, and Baum-Welch algorithm. The Forward–Backward procedure is applied to estimate the probability of the observed sequence is generated by a given model
; Viterbi algorithm is applied to estimate the optimal state sequence
Q = (
q2, …,
qt) when a model
and a observation sequence
O = (
o2, …,
ot) are given; the Baum–Welch algorithm is used for HMM parameters re-estimation, outputting parameters
to maximize the probability of the given observation sequence [
21]. The advantage of HMM is that both the symbol of each important point and transformation rules between adjacent points can be utilized in the process of series analysis; more information can be mined from original time series. Therefore, HMM needs less monitoring information than other methods in fault classification.
In the process of hydraulic pump fault diagnosis, models correspond with different operating conditions is first established with reference symbolic series using the Baum–Welch algorithm. Subsequently, testing samples are inputted in models, the operating condition corresponds with the model outputting the maximal probability is regarded as the operating condition of testing samples.