In this section, we first test the effectiveness of semi-supervised attribute reduction and then the proposed model is compared with other semi-supervised learning methods. All experiments are conducted on a computer with Windows 10 operating system configured with Inter(R) Core(TM) i7-7700K CPU @ 4.20 GHz 4.20 GHz, 16 GB RAM.
4.3. Effectiveness of the Proposed Co-Training Model
To demonstrate the performance of the proposed model, it is compared with classical semi-supervised methods, including self-training, co-training, and their extensions.
The classic self-training is a self-learning model. It first trains a basic classifier on labeled data, followed by iteratively selecting some confident samples from unlabeled samples to learn until the stop condition is met. Co-training is a multi-view model, in which two classifiers learn from each other on unlabeled data, but it has to fulfill the condition that two views must be sufficient and independent. Nevertheless, such a condition is usually difficult to satisfy in practical problems. Fortunately, the work of Nigam et al. [
47] demonstrated that even if the raw data are randomly split into two attribute subsets, the co-training classifier can still learn from unlabeled samples. Therefore, we divided each condition attribute set of the dataset into two disjoint subsets by half-splitting attributes. In addition, for a more comprehensive comparison, we set the self-training into two cases: self-training with a single view and self-training with two randomly divided views. Moreover, we evaluated single-view self-training in two cases: data after attribute reduction and data without attribute reduction. The settings for these models are shown in
Table 4.
In
Table 4, ST-1V and ST-2V denote the single-view self-training and two-view self-training, respectively, and ST-1VR denotes the single-view self-training after attribute reduction. In order to comprehensively compare the performance of the proposed model, we adopt a semi-supervised neighborhood discriminant index, which is a filter method that combines the supervised neighborhood discriminant index with unsupervised Laplacian information. CT-2V represents the classical co-training, while CT-TWD denotes the proposed three-way co-training model. To learn useful unlabeled samples, a threshold parameter needs to be set for ST-1V, ST-1VR, ST-2V, and CT-2V, and the model proposed in this study requires two pairs of parameters, where the first pair can be obtained based on the Bayesian minimum risk decision, while the second pair is calculated by the defined normalized entropy. For a simple and fair comparison, these parameters are all empirically set to
= 0.75,
= 0.55,
= 0.80, and
= 0.95. For ST-1V and ST-2V, the unlabeled sample with a prediction probability greater than
is selected for learning. For CT-2V, the unlabeled sample is used for learning when its prediction probability of one classifier is greater than
, and the probability predicted by the other classifier is less than
. For the CT-TWD in this study, the thresholds
and
are used to classify whether the unlabeled sample is useful, uncertain, or useless. It should be noted that when the average normalized entropy of the two classifiers for an unlabeled sample is less than
, the unlabeled sample is considered useful; when the average normalized entropy is greater than
and less than
, the sample is determined to be uncertain; when the average normalized entropy is greater than
, the sample is considered useless. In the experiments, two types of classifiers, i.e., the K-nearest neighbor classifier with
and the naive Bayes classifier, are used to evaluate the performance of the selected methods. Given a label rate
= 10%, the results of the different methods are shown in
Table 5 and
Table 6.
In
Table 5 and
Table 6, the symbols “initial” and “final” denote the error rates of each model trained from labeled data and then improved by unlabeled data, respectively. All results in “initial” and “final” are obtained after averaging over 10-fold cross validation. In addition, for the convenience of comparison, the results with the lowest error rates are marked in bold.
Table 7 and
Table 8 provide the computation time of different comparison methods on KNN and naive Bayes classifiers. The row “avg.” represents the average error rates of the selected models computed from all the datasets.
By observing
Table 5,
Table 6,
Table 7 and
Table 8, it can be found that when the label rate is 10%, the initial performance of the ST-1VR model is better than ST-1V, and even on some datasets, such as “ttt”(33.26%), “vowel”(32.85%) in
Table 5, and “cmc”(36.14%) in
Table 6, it is better than that of the proposed model CT-TWD in this study, which shows the effectiveness of attribute reduction for semi-supervised learning. However, the improvement of both ST-1VR and ST-1V after learning unlabeled samples is not significant, even worse performance is observed on many datasets. The two-view self-training (ST-2V) can learn useful information from unlabeled samples and outperform the first two models in terms of performance. Combining
Table 7 and
Table 8, it can be found that the computation time of ST-2V is greater than that of ST-1VR and ST-1V, which proves that two views have better performance than single views, but additional computational time is required to process them. For most datasets, the classifier retrained on unlabeled samples performs better than the classifier trained on labeled data only, while the co-training model with two views (CT-2V) achieved better performance using the KNN classifier and the naive Bayes classifier, which is improved by 2.50% and 2.20%, respectively, because of the mutual learning in the two classifiers. The average error rates on the KNN classifier and the naive Bayes classifier are lower than ST-1V, ST-1VR, and ST-2V, which demonstrates the stability of CT-2V. In addition, CT-2V requires to simultaneously train the two classifiers, resulting in a slightly longer computation time. However, the results in
Table 5 and
Table 6 show that the performance of CT-2V is with the average error rates of 33.42% on the KNN classifier and 31.65% on the naive Bayes classifier, which still has a large gap compared to the proposed model CT-TWD, with 30.60% on the KNN classifier and 29.61% on the naive Bayes classifier. In terms of the calculation time, although the average calculation time (avg.) of CT-TWD in
Table 7 and
Table 8 is relatively large with 4.0440 s on the KNN classifier and 2.5763 s on the naive Bayes classifier, considering the good performance of CT-TWD, the additional time cost is clearly acceptable.
To compare the differences among the methods more comprehensively, we also conduct experiments at different label rates, and the results are shown in
Figure 2 and
Figure 3.
As can be seen in
Figure 2 and
Figure 3, the proposed model CT-TWD can learn from unlabeled samples and achieve impressive performance against different models. ST-1V is a single-view semi-supervised learning model, and it can be found in the experiments that ST-1V performs poorly on most datasets; even worse performance occurs at higher label rates, such as “lymph” with the KNN classifier and “frogs” with the naive Bayes classifier. This may be because the initially labeled data are not representative, so the classifiers will mislabel unlabeled samples in the training process. Therefore, the classifier will learn the wrong classification information, which results in poor generalization of the final performance. ST-1VR is also a single-view semi-supervised self-learning model but performs attribute reduction on the dataset. Although its overall performance is poor, it outperforms ST-1V, which shows the effectiveness of the semi-supervised neighborhood discriminant index-based attribute reduction method. However, ST-1VR still has poor final performance with the limitations of the single-view model, such as “cmc” and “lymph” with the KNN classifier. ST-2V is a multi-view self-training model that uses randomly split subsets of attributes from the raw dataset to train the base classifiers, and a threshold is used to select useful samples to help the classifiers retrain themselves, but its final performance is not good. On the one hand, the two classifiers of ST-2V are self-taught. On the other hand, the poor quality of the randomly partitioned subsets of attributes also leads to the disappointing performance of ST-2V. Although CT-2V can make two base classifiers learn from each other through unlabeled samples to improve the performance, the two subspaces of CT-2V are randomly divided from the dataset. Therefore, the performance of the classifiers is not stable, resulting in CT-2V only performing better on some datasets.
Different from the selected comparison models, the CT-TWD uses the three-way co-decision model in the training process to classify unlabeled samples into useful, uncertain, and useless. The training set of each classifier is updated only when the unlabeled samples are useful and have a positive impact on the model performance. Such a sample selection mechanism ensures that CT-TWD can effectively use unlabeled samples to improve performance on most datasets at different label rates. For example, the proposed model achieves an improvement of 22.03% at a 30% label rate on the “vowel” dataset with the KNN classifier and an improvement of 13.55% at a 40% label rate on the “wine” dataset with the naive Bayes classifier, illustrating the potential of the proposed model for partially labeled data.
It should be noted that for some datasets, such as “cmc” on the naive Bayes classifier and “lymph” on the KNN classifier, as the label rate increases, the performance of the methods tends to decrease. This is likely due to the labeled data not being representative enough, thereby limiting the performance of the classifier as the data scale increases. Compared to other models, the CT-TWD proposed in this study assigns pseudo labels of 0 and 1 to unlabeled samples to form two views of data, which makes the two views of data still maintain the discriminative ability as the raw dataset. Therefore, the quality of the base classifiers obtained by CT-TWD has good robustness, which allows it to have better performance across all the datasets.