In machine learning, the prediction performance of the classifier increases with an increase in the number of used features [
52]. However, when the number of features is oversaturated, redundant features degrade the prediction performance of the classifier. Therefore, it is necessary to select features from the original dataset that contribute most to the prediction performance of the classifier. It is necessary to filter out features that contribute considerably to the identification accuracy of the RF classifier. Feature selection can eliminate irrelevant and redundant features, thereby reducing the number of features, reducing the training or running time, and improving identification accuracy [
53]. We studied the existing feature selection methods and adopted methods based on Pearson correlation coefficient and
p-value. We performed feature selection on the dataset after min–max standardization processing and analyzed the feature selection effects. Furthermore, we effectively combined the feature selection method based on Pearson correlation coefficient and
p-value and proposed a new multi-index-fusion feature selection method. Therefore, the effective features that contribute considerably to the identification accuracy of the RF classifier were selected, and a high-quality material dataset was constructed. The new feature selection method effectively improved the prediction performance of the RF classifier on the material dataset and considerably improved the identification accuracy of the existing loose particle material identification.
3.3.1. Feature Selection Method Based on Pearson Correlation Coefficient
Pearson correlation is also called product–difference correlation or product–moment correlation. The Pearson correlation coefficient can be used to measure the linear relationship between the features of each column in the material dataset [
54]. The greater the absolute value of the Pearson correlation coefficient, i.e., the closer the correlation coefficient is to 1 or −1, the stronger the correlation between the two variables used in the calculation. The closer the correlation coefficient is to 0, the weaker the correlation between the two variables [
55]. Assuming that there are two variables,
and
, the Pearson correlation coefficient between the two variables can be calculated as follows [
56]:
In the formula, represents the calculation of mathematical expectation between two variables, and represents the calculation of covariance between two variables.
When using the feature selection method based on Pearson correlation coefficient on the material dataset, we treated labels as a fixed variable and other column features as another variable. Thus, Pearson correlation coefficients between each column feature and the label can be calculated. In this case, the closer the calculated correlation coefficient is to 1 or −1, the more important the column features used for calculation. Values of the calculated correlation coefficient closer to 0 indicate that the column features used for calculation are relatively less important. We used
Pandas to calculate Pearson correlation coefficients between each column feature and the label in the material dataset and draw a heat map, as shown in
Figure 4.
In the heatmap shown in
Figure 4, the lighter the color, the weaker the correlation between the two features; the darker the color, the stronger the correlation between the two features. The diagonal area from the upper left corner to the lower right corner represents the correlation of the feature with itself, so the color is the darkest. Taking this diagonal area as the dividing line, the two obtained triangular areas are actually the same. They both express the correlation between features. The value in each square shown in the figure represents the calculated Pearson correlation coefficient between the two features on the abscissa and ordinate of the corresponding square. For example, the first row of squares in the figure represent the Pearson correlation coefficients between the label and itself or between the label and the fourteen features. It can be seen that the correlation between labels and individual features is weak.
Furthermore, by setting the threshold of the absolute value of the Pearson correlation coefficient to 0.1, we selected and retained the three features of energy density (MD), spectral centroid (mainHz), and Cepstral coefficient (MSF). We processed the dataset after min–max standardization processing and reserved only the column data corresponding to the three column features to form a new dataset. The RF classifier was used to make predictions, and the achieved average identification accuracy was 48.76%. Compared with the identification accuracy achieved by the RF classifier on the dataset after min–max standardization processing, the identification accuracy decreased significantly. This is because the feature selection method based on Pearson correlation coefficient selected a small number of features, so a considerable amount of material information contained in the original dataset was lost.
The Pearson correlation coefficient describes the linear correlation between components. It can also be found from
Figure 4 that the linear correlation between labels and features in the material dataset is weak. Therefore, we further investigated the non-linear correlation between labels and features.
3.3.2. Feature Selection Method Based on p-Value
Hypothesis testing, also known as statistical hypothesis testing involves first making a certain hypothesis and then collecting data by sampling to make statistical inferences about whether the hypothesis should be rejected or accepted [
57]. In feature selection, the principle of hypothesis testing is “whether the feature has a relationship with the response variable”. Therefore, the null hypothesis in this paper is “whether the features in the material dataset have a relationship with labels”, i.e., the response variable is the label. It is necessary to test each feature and determine whether it has a significant relationship with the label. To some extent, the detection logic of the feature selection method based on Pearson correlation coefficient described above is the same. Specifically, if the correlation between a feature and the label is too weak, then the hypothesis that the “feature has no relationship with the label” is considered true. If a feature is sufficiently relevant to the label, then the hypothesis can be rejected, and the feature is considered to be related to the label.
p-value is a common evaluation index in hypothesis testing. The
p-value is a decimal between 0 and 1 that represents the probability that given data appear by chance under hypothesis testing. The lower
p-value, the greater the probability of rejecting the null hypothesis [
58]. That is, in the material dataset, the lower the
p-value, the greater the probability that a given feature is related to the label and the more it should be retained.
Commonly used hypothesis testing methods are Z test, t-test, chi-square test, F test, etc. In this article, the we to use chose the t-test. A t-test uses the t-distribution theory to infer the probability of a difference so as to compare whether the difference between two means is significant. In this article, the material dataset was established, and the feature data of the dataset were known. From another point of view, the built material dataset only contained limited feature data that did not fully reflect the value and distribution of all feature data. Therefore, for such a normal distribution with a finite number of samples and an unknown population standard deviation, t-test is most appropriate. In addition, Z test is a hypothesis testing method based on information about the normal distribution, given the known population mean and variance. The chi-square test is used for categorical variables. The values of the feature data in this article were continuous unknown values, rather than discrete categories. The F test is a hypothesis testing method for a known statistical model based on variance information. Therefore, none of these three methods are suitable for this research. The results of hypothesis testing can be seen as a description of the non-linear relationship between labels and features in the material dataset. Therefore, by studying the hypothesis testing results, analysis of the non-linear relationship between labels and features can be completed.
In machine learning, the threshold for p-value is 0.05; i.e., features with a p-value less than 0.05 are worth preserving. Therefore, we calculated the p-value of each feature and set the screening threshold at 0.05 to filter out the unqualified features in the material dataset. Ultimately, we selected a total of fourteen features, which is the same number of features as previously used, indicating that the effect of using this method for feature selection is not great. It can be seen from the feature selection effect that the non-linear correlation between labels and features in the material dataset is strong. Therefore, we considered comprehensive analysis and utilization of the linear correlation and the non-linear correlation between labels and features.
3.3.3. Multi-Index-Fusion Feature Selection Method
According to the feature selection results based on Pearson correlation coefficient, the number of selected features is too small to construct a dataset containing sufficient information. It shows that the conditions for feature selection were too strict, with few features qualifying. According to the feature selection results based on p-value, all features in the material dataset were selected. This indicates that the conditions used for feature selection were too broad to filter out poor performers. Judging from the linear correlation or non-linear correlation between labels and features in the material dataset, there is a strong non-linear correlation and a weaker linear correlation between the two. From the global point of view, an extreme bias towards a certain correlation leads to an unsatisfactory feature selection effect. Therefore, we need to comprehensively consider both correlations to come up with a feature selection method that combines both correlations so that they are in a balanced state.
Based on the analysis and summary of the feature selection effects of the above two methods, we attempted to effectively combine the Pearson correlation coefficient and p-value to design a new multi-index-fusion feature selection method. In this method, we no longer used a single evaluation index to evaluate and select features. Instead, two evaluation indices were used together to evaluate the features in the dataset, and the final evaluation results of features were obtained after comprehensive consideration of two evaluation results. According to the results, we selected the features with excellent performance to achieve the purpose of feature selection. The specific implementation steps of the proposed method are as follows:
Step 1: Equation (6) was used to calculate the absolute values of Pearson correlation coefficient between features and labels in the material dataset, which are expressed as
. Among them,
is the feature number, which is in the same order as that listed in
Table 3.
In the formula, , , and are the standard fraction, mean, and standard deviation of label , respectively; and , , and are the standard fraction, mean, and standard deviation of feature , respectively.
Step 2: Based on the obtained absolute values of the Pearson correlation coefficient of each feature and label in the dataset, was ranked from large to small according to the numerical value. In this way, we obtained the first ranking number through, expressed as .
Step 3: In the above process, we used a single evaluation index (Pearson correlation coefficient) to rank fourteen features. In the next step, we used the second index (p-value) to evaluate fourteen features in the same way. Similarly, we calculated the p-value(s) of all features in the dataset and expressed them as. We ranked from small to large according to the numerical value. In this way, we obtained the second ranking number, expressed as .
So far, we used the second index (p-value) to rank the fourteen features. Finally, it was necessary to conduct a comprehensive analysis based on the two ranking results to achieve the final evaluation of the fourteen features.
Step 4: According to the same feature, we accumulated the two ranking results to obtain fourteen cumulative sums, which are expressed as . We ranked from small to large according to the numerical value. Finally, we obtained the comprehensive ranking number, which is expressed as .
It should be noted that when of multiple features is the same, we made the following supplementary rules: the lower the ranking of , the lower the comprehensive ranking of we artificially set; that is, the higher the priority. For example, = 3 and = 5 obtained the same with = 6 and = 2. However, under the supplementary rules, because the former = 3, the latter = 6; thus, the comprehensive ranking, , of = 3 and = 5 is lower than that of = 6 and = 2.
Step 5: The feature selection experience shows that when the number of selected features accounts for more than half of the features in the dataset, the prediction effect obtained by the classifier can be ideal based on the dataset built by these selected features. Therefore, we combined the grid search method and retained the top eight to top fourteen features by referring to the comprehensive ranking number, , and formed seven datasets. We applied the RF classifier to make predictions on each dataset and achieved multiple identification accuracies. By comparing on which dataset the RF classifier achieved the highest identification accuracy, the combination of features used to construct that dataset is optimal. Then, the optimal feature selection result was obtained.
We applied the multi-index-fusion feature selection method to the material dataset after min–max standardization processing and obtained rankings of fourteen features at different feature selection stages. The specific description is shown in
Table 7.
According to the comprehensive rankings in
Table 7, we formed seven new datasets by referring to the combination of the top eight to top fourteen features and implemented them in the grid search method. We applied the RF classifier to make ten predictions on seven datasets and found that the RF classifier achieved the highest average identification accuracy on dataset formed by the following twelve features: pulse area (
s), degree of symmetry between left and right (
dczy), pulse rise proportion (
Tp), duration (
Tl), energy density (
MD), pulse ratio (
ZB), area ratio (
dp), spectral centroid (
mainHz), variance (
var), Cepstral coefficient (
MSF), Cepstral coefficient difference (
MSFcha), and Zero crossing rate (
zerorate). The highest identification accuracy was 64.46%. The multi-index-fusion feature selection method retained twelve columns of feature data in the material dataset;
columns were formed, which is less than the original
columns, and the achieved identification accuracy was significantly improved compared with that before selection.
Table 8 lists the feature selection effects of feature selection methods based on Pearson correlation coefficient,
p-value, and the multi-index-fusion feature selection method. It can be seen from the table that compared with the identification accuracy of 63.51% achieved by the RF classifier on the material dataset containing fourteen features, the RF classifier achieved an identification accuracy of 64.46% on the material dataset that contained twelve features after feature selection. Despite the reduction in two columns of feature data in the material dataset, the identification accuracy achieved by the RF classifier was improved by 0.95%. Similarly, we also found that compared to using a single evaluation index, the multi-index-fusion feature selection method organically combined the two evaluation indices and achieved an average identification accuracy higher than either of them. In other words, the average identification accuracy of the multi-index-fusion feature selection method is higher than that of the traditional feature selection based on the “filtering method”. This shows the superiority of the new feature selection method on the material dataset.
The proposed feature optimization method of material identification for loose particles inside sealed relays achieved the highest identification accuracy of 64.46% on the material dataset, which is significantly improved compared with 59.63% obtained in our previous study.
Table 9 lists the identification accuracies achieved by the RF classifier in the missing value processing stage, standardization and normalization processing stage, and feature selection stage.