In this section, we present the proposed fault diagnosis method based on structural feature selection of heterogeneous fault features. Briefly speaking, we first exploit the inner structure of features and then choose the most representative fault features for building diagnosis model. We adopt the following assumption: the features with strong relatedness have short weights’ distance, and then construct a multi-objective 0–1 programming problem by making the feature weights in the same group as close as possible while making the feature weights in the different group as unlike as possible. By solving this problem, the optimal inner structure of fault features can be determined adaptively, and a set of fault features with good discriminant ability can be chosen for model training.
3.1. Model Construction
Given a set of
i.i.d training samples
, we have the weight coefficient vector
of
d-dimensional features by running linear SVM. Suppose that there exists some group structures among the features. Therefore, to conduct structural feature selection, we need to assign all
d-dimensional features into
G groups with making the weight distance of features in same group as small as possible and the distance in different group as big as possible. Here we first present the expression of intra-group weight distance:
where
G is the total number of group,
m is the number of features in group
g. Equation (
3) means the sum of the weight distance of each two features in all groups. Obviously, minimizing Equation (
3) can prompt features with close weight into a same group.
Equation (
3) cannot be minimized directly. To get a feasible minimization target, we introduce group assignment matrix
Q whose elements
indicate whether the
t-th feature is assigned to the group
g. Moreover, let
be the diagonal matrix whose diagonal elements are
. It is clear that
, where
I stands for the identity matrix. We also set the weight distance matrix
as:
where
is square matrix composed by feature’s weight vector
, i.e.,
. Using the element
, we can re-write Equation (
3) as:
Because , we first give the following definition:
Definition 1 (Intra-group weight distance sum).
The summation constraint guarantees that each feature can be assigned to one and only one group. Obviously, if , Equation (5) equals a standard formulation of weight distance. It is clear that the smaller the value of f is, the better the grouping performance is. Similarly, to measure the inter-group distance, we have the following formulation:where , m is the number of features in group g
, n is the number of features which are not in the group g
. Equation (6) indicates the sum of inter-group feature distance. Obviously, maximizing Equation (6) can prompt features in different groups to be farther away from each other, or say, to be more divergent. Similarly, we use group assignment matrix Q to express the inter-group feature divergence. We can re-write Equation (6) as:where is intra-group weight distance matrix for group g
, is beyond-group weight distance matrix for the group g
. Consequently, means the inter-group divergence matrix for the group g
. According to the matrix theory, we have: Based on the derivation above, we give the following definition:
Definition 2 (Inter-group weight distance sum).
Equation (9) indicates the distance sum of each feature to any other features in different groups. Obviously, maximizing Equation (9) will assign the features with divergent weights into different groups. Please note that Equation (5) indicates the intra-group feature relatedness while Equation (9) indicates the inter-group feature divergence. To exploit the inner structure based on feature distance, we need to minimize Equation (5) and maximize Equation (9) simultaneously. The final target is seeking a series of , named , to make the intra-group weight distance sum as small as possible while inter-group weight distance sum as large as possible, i.e., 3.2. Solving Method
Considering the group assignment matrix
Q, Equation (
10) is a multi-objective 0–1 programming problem. As seeking the optimal solution for this kind of problem is NP-hard (i.e., a deterministic algorithm to solve it cannot be found in polynomial time), we choose an evolutionary algorithm to seek a numerically approximate solution. Taking the 0–1 programming into account, we can improve the traditional multi-objective optimization algorithm by adjusting its search space. Compared to single-objective optimization, the multi-objective optimization problem involves more than one objective function to be optimized simultaneously, and seeks one or more solutions (generally denoted as Pareto-optimal) to make each objective reach optimum [
28]. In this case, the objective functions are generally conflicting, which results in no single solution that simultaneously optimizes each objective.
For the sake of better understanding, we first provide some general concepts. The multi-objective optimization problem can be expressed as follows [
28]:
where
is the solution belonging to the feasible region
.
In mathematical terms, a feasible solution is regarded to dominate another solution , written as , if:
- (1)
for all indices and
- (2)
for at least one index
Due to conflicting objectives, there exist several Pareto-optimal solutions which construct the Pareto front. A solution is called nondominated, Pareto-optimal or noninferior, if it cannot be dominated by any other solutions. In another word, none of the objective functions can be improved in value without degrading some of the other objective values.
Because of the insensitivity to the shape and continuity of Pareto front, the evolutionary algorithm can approximate the non-convex or non-continuous optimal front well, so it is suitable to solve the multi-objective optimization problem. In recent decades, several nature-inspired evolutionary algorithms have been proved very efficient in solving multi-objective problems. Within these algorithms, particle swarm optimization(PSO) [
29], proposed by J. Kennedy and R. Eberhart, tries to find an optimal solution by mimicking the social behavior of birds flock. Compared with other evolutionary algorithms, PSO has its advantages such as few parameters, faster convergence rates and so on. Moreover, due to its simple structure, PSO has been successfully developed for solving the multi-objective optimization problem. Therefore, in this work we apply the multi-objective PSO, named MOPSO [
30], to solve Equation (
10). Please note that this work just pays the emphasis on the application of MOPSO rather than the development of MOPSO algorithm.
The basic idea of classical PSO and MOPSO refers to the literature [
29,
30]. Here we provide the main step of application of MOPSO in this work, as follows. The brief flowchart of MOPSO can also be found in
Figure 3.
Step 1. Initialize the swarm size and number of generations. Initialize the present location and fitness value of each particle by generating randomly the group assignment matrix
Q for each group. The intra-group weight distance sum shown in Equation (
5) and the minus of inter-group weight distance sum shown in Equation (
9) are taken as the fitness function to evaluate the goodness of the solution. Please note that the fitness value from Equation (
9) must be added minus, as shown in Equation (10).
Step 2. Update the noninferior archive by adding the particles which are nondominated by others.
Step 3. Update the particles’ velocity and position by following the best particle in the swarm (gBest) and best individual particle (xBest). Here gBest is randomly selected from the noninferior archive.
Step 4. Calculate the fitness value of new particles. Update xBest by checking the domination relationship between the current xBest and new particles.
Step 5. Update the noninferior archive by merging xBest into the current noninferior archive with domination relationship checked.
Step 6. Go to Step 3 or stop if a stop criterion is satisfied.
In the framework described above, three tricks in solving Equation (
10) need to be highlighted:
(1) In the initialization of particles, the dimension of a particle is equal to the number of features, i.e., the ith particle’s position is . To assign the jth feature to group m, the value of should be m (), where G is the pre-assigned group amount. Through this method, every particle gets one-to-one correspondence to the group assignment matrix Q.
(2) The velocity of
in
kth iteration is updated by [
29]:
Considering the trick (1), the position of
jth feature of
ith particle
which indicates the group assignment index is then updated by:
where:
In Equations (11)–(13), is the velocity of jth feature, and are values that weight the contribution of the individual and social information, are uniformly distributed random numbers, is the best previous position of the ith particle and is the best particle in the swarm, means the top integral function for t.
In our experiment, all particles X are randomly initialized in the range of [0, G], and the corresponding velocities V are initialized to 0. In the searching process, the value of velocity is influenced by w, and . To guarantee the searching efficiency, we set the value of w linearly decreasing from 1.2 to 0.2, while and are both set 0.8. Consequently, the value scope of the velocity V is easily determined, just around [−X, X]. We mainly restrict the searching scope of X. If a particle exceeds the bound, we will pull the particle back to the searching scope by adding a random value in reverse direction.
(3) To improve the global search ability, the linear decreasing weight
w in Equation (
11) is updated by:
where
k is the current iteration number,
MaxIt is the maximum iteration number,
wmin and
wmax are the minimal and maximal value of weight
w respectively.
After assigning features into different groups, it needs to choose representative features from each group. In this work, information gain is introduced to conduct this selection. Information gain is an effective tool to evaluate the relatedness between two variables. For a classification problem, the information gain of one feature
to the classification label
Y is defined as the change in information entropy
H only for classification label
Y to a state with this feature given, as follows:
The bigger the information gain is, the larger the relatedness between this feature and classification label is. As a result, we choose the features with the biggest information gain in each group as the representative ones, and finally combine them as a discriminative feature set. Using this feature set, we can apply SVM to construct the fault diagnosis model.