Next Article in Journal
Nonadditive Entropies and Complex Systems
Next Article in Special Issue
Structure Extension of Tree-Augmented Naive Bayes
Previous Article in Journal
Entropy and Mixing Entropy for Weakly Nonlinear Mechanical Vibrating Systems
Previous Article in Special Issue
Melodies as Maximally Disordered Systems under Macroscopic Constraints with Musical Meaning
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Discriminatory Target Learning: Mining Significant Dependence Relationships from Labeled and Unlabeled Data

1
Key Laboratory of Symbolic Computation and Knowledge Engineering of Ministry of Education, Jilin University, Changchun 130012, China
2
Faculty of Science, Engineering & Built Environment, Deakin University Geelong, Burwood, VIC 3125, Australia
3
Changzhou College of Information Technology, Changzhou 213164, China
4
College of Computer Science and Technology, Jilin University, Changchun 130012, China
*
Author to whom correspondence should be addressed.
Entropy 2019, 21(5), 537; https://doi.org/10.3390/e21050537
Submission received: 27 April 2019 / Revised: 20 May 2019 / Accepted: 24 May 2019 / Published: 26 May 2019
(This article belongs to the Special Issue Information Theoretic Measures and Their Applications)

Abstract

:
Machine learning techniques have shown superior predictive power, among which Bayesian network classifiers (BNCs) have remained of great interest due to its capacity to demonstrate complex dependence relationships. Most traditional BNCs tend to build only one model to fit training instances by analyzing independence between attributes using conditional mutual information. However, for different class labels, the conditional dependence relationships may be different rather than invariant when attributes take different values, which may result in classification bias. To address this issue, we propose a novel framework, called discriminatory target learning, which can be regarded as a tradeoff between probabilistic model learned from unlabeled instance at the uncertain end and that learned from labeled training data at the certain end. The final model can discriminately represent the dependence relationships hidden in unlabeled instance with respect to different possible class labels. Taking k-dependence Bayesian classifier as an example, experimental comparison on 42 publicly available datasets indicated that the final model achieved competitive classification performance compared to state-of-the-art learners such as Random forest and averaged one-dependence estimators.

1. Introduction

With the rapid development of computer technologies, business and government organizations create large amounts of data, which need to be processed and analyzed. Over the past decade, to satisfy the urgent need of mining knowledge hidden in the data, numerous machine learning models [1,2] (e.g., decision tree [3], Bayesian network [4,5], support vector machine [6] and Neural network [7]) have been proposed.
To mine all “right” knowledge that exist in a database, researchers mainly proposed two kinds of learning strategies to address this issue. (1) Increase structure complexity to represent more dependence relationships, e.g., convolutional neural network [8] and k-dependence Bayesian classifier (KDB) [9]. However, as structure complexity grows overfitting will inevitably appear, which will result in redundant dependencies and performance degradation. Sometimes the overly complex structures hide the internal working mechanism and make them criticized for being used as “black box”. (2) Build ensemble of several individual members having relatively simple network structure, e.g., Random forest [10] and averaged one-dependence estimators (AODE) [11]. Ensembles can generally perform better than any individual member. However, it is difficult or even impossible to give a clear semantic explanation of the combined result since the working mechanisms of individual members may differ greatly. In practice, people would rather use models with simple and easy-to-explain structures, e.g., decision tree [12] and Naive Bayes (NB) [13,14,15], although they may perform poorer.
Bayesian networks (BNs) have long been a popular medium for graphically representing the probabilistic dependencies, which exist in a domain. Recently, work in Bayesian methods for classification has grown enormously. Numerous Bayesian network classifiers (BNCs) [9,16,17,18,19,20] have been proposed to mine the significant dependence relationships implicated in training data. With solid theoretic support, they have strong potential to be effective for practical application in a number of massive and complex data-intensive fields such as medicine [21], astronomy [22], biology [23], and so on. A central concern for BNC is to learn conditional dependence relationships encoded in the network structure. Some BNCs, e.g., KDB, use conditional mutual information I ( X i ; X j | Y ) to measure the conditional dependence relationships between X i and X j , which is defined as follows [24],
I ( X i ; X j | Y ) = x i x j y P ( x i , x j , y ) l o g P ( x i , x j | y ) P ( x i | y ) P ( x j | y ) = x i x j y I ( x i ; x j | y )
For example, I ( X i ; X j | Y ) = 0 indicates that attributes X i and X j are conditionally independent. However, in practice, for any specific event or data point, the situation will be much more complex. Taking Waveform dataset as an example, attributes X 15 and X 16 are conditionally dependent, since I ( X 15 ; X 16 | Y ) > 0 always holds. Figure 1 shows the distributions of I ( x 15 ; x 16 | y i ) , where i { 1 , 2 , 3 } . As can be seen, there exist some positive values of I ( x 15 ; x 16 | y 1 ) and I ( x 15 ; x 16 | y 2 ) . However, for the class label y 3 , the negative or zero values of I ( x 15 ; x 16 | y 3 ) have a high proportion among all values. That is, for different class labels, the conditional dependence relationships may be different rather than invariant when attributes take different values. We argue that most BNCs (e.g., NB and KDB), which build only one model to fit training instances, cannot capture this difference and cannot represent the dependence relationships flexibly, especially hidden in unlabeled instances.
The scientific data can be massive, and labeled training data may account for only a small portion. In this paper, we propose a novel learning framework, called discriminatory target learning, for achieving better classification performance and high-level of dependence relationships while not increasing structure complexity. KDB is taken as an example to illustrate the basic idea and prove the feasibility of discriminatory target learning. By redefining mutual information and conditional mutual information, we build a “precise” model kdbi for each unlabeled instance x with respect to class label y i . The ensemble of kdbi, i.e., kdbe, can finely describe the dependency relationships hidden in x . The final ensemble of kdbe and regular KDB can fully and discriminately describe the dependence relationships in training data and unlabeled instance.
The rest of the paper is organized as follows: Section 2 introduces some state-of-the-art BNCs. Section 3 introduces the basic idea of discriminatory target learning. Experimental study on 42 UCI machine learning datasets is presented in Section 4, including a comparison with seven algorithms. The final section draws conclusions and outlines some directions for further research.

2. Bayesian Network Classifiers

The structure of a BN on the random variables { X 1 , , X n } is a directed acyclic graph (DAG), which represents each attribute in a given domain as a node in the graph and dependencies between these attributes as arcs connecting the respective nodes. Thus, independencies are represented by the lack of arcs connecting particular nodes. BNs are powerful tools for knowledge representation and inference under conditions of uncertainty. BNs were considered as classifiers only after the discovery of NB, a very simple kind of BN on the basis of conditional independence assumption. It is surprisingly effective and efficient for inference [5]. The success of NB has led to the research of Bayesian network classifiers (BNCs), including tree-augmented naive Bayes (TAN) [16], averaged one-dependence estimators (AODE) [18] and k-dependence Bayesian classifier (KDB) [9,17].
Let each instance x be characterized with n values { x 1 , , x n } for attributes { X 1 , , X n } , and class label y { y 1 , , y m } is the value of class variable Y. NB assumes that the predictive attributes are conditional independent of each other given the class label, that is
P ( x 1 , , x n | y ) = i = 1 n P ( x i | y )
Correspondingly for any value pair of arbitrary two attributes X i and X j , P ( x i , x j | y ) = P ( x i | y ) P ( x j | y ) always holds. From Equation (1) there will be I ( X i ; X j | Y ) = 0 and this can explain why there exist no arc between attributes for NB. However, in the real world, it will be much more complex when considering different specific event or data point. We now formalize our notion of the spectrum of point dependency relationship in Bayesian classification.
Definition 1.
For unlabeled data point x = { x 1 , , x n } , the conditional dependence between X i and X j ( 1 i , j n ) with respect to label y on point x is measured by pointwise y-conditional mutual information, which is defined as follows,
I ( x i ; x j | y ) = P ( x i , x j , y ) log P ( x i , x j | y ) P ( x i | y ) P ( x j | y ) = P ( x i , x j , y ) log P ( x i | x j , y ) P ( x i | y )
Equation (2) is a modified version of pointwise conditional mutual information that is applicable to labeled data point [25]. By comparing Equations (1) and (2), I ( X i ; X j | Y ) is a summation of expected values of I ( x i ; x j | y ) given all possible values of X i , X j and Y. The traditional BNCs, e.g., TAN and KDB, use I ( X i ; X j | Y ) to roughly measure the conditional dependence between X i and X j . I ( X i ; X j | Y ) is non-negative, I ( X i ; X j | Y ) > 0 iff X i and X j are conditionally dependent given Y. However, only considering I ( X i ; X j | Y ) = 0 as the criterion for identifying the conditional independent relationship is too strict for BN learning, which may lead to classification bias, since I ( x i ; x j | y ) 0 may hold for specific data point x . That may be the main reason why NB performs better in some research domains. To address this issue, in this paper I ( x i ; x j | y ) is applied to measure the extent to which X i and X j are relatively conditionally dependent when P ( x i | x j , y ) > P ( x i | y ) or relatively conditionally independent or irrelevant when P ( x i | x j , y ) < P ( x i | y ) , respectively.
Definition 2.
For unlabeled data point x = { x 1 , , x n } with respect to label y, if I ( x i ; x j | y ) > 0 ( 1 i , j n ) , then X i and X j are y-conditionally dependent on point x ; if I ( x i ; x j | y ) = 0 , then they are y-conditionally independent on point x ; and if I ( x i ; x j | y ) < 0 , then they are y-conditionally irrelevant on point x .
TAN maintains the structure of NB and allows each attribute to have at most one parent. Then, the number of arcs encoded in TAN is n - 1 . During the constructing procedure of maximum weighted spanning tree, TAN sorts the arcs between arbitrary attributes X i and X j by comparing I ( X i ; X j | Y ) , and adds them in turn to the network structure if no cycle appears. KDB further relaxes NB’s independence assumption and can represent arbitrary degree of dependence while capturing much of the computational efficiency of NB. KDB first sorts attributes by comparing mutual information I ( X i ; Y ) , which is defined as follows [24],
I ( X i ; Y ) = x i y P ( x i , y ) l o g P ( x i , y ) P ( x i ) P ( y )
Suppose the attribute order is { X 1 , , X n } . By comparing I ( X i ; X j | Y ) , X i select its parents, e.g., X j , from attributes that ranks before it in the order. KDB requires that X i must have m i n ( i - 1 , k ) parents and there will exist m i n ( i - 1 , k ) arcs between X i and its parents. The number of arcs encoded in KDB is n k - k 2 2 - k 2 and will grow as k grows. Thus, KDB can represent more dependency relationships than TAN. For TAN or KDB, they do not evaluate the extent to which the conditional dependencies are weak enough and should be neglected. They simply specify the maximum number of parents that attribute X i can have before structure learning. Some arcs corresponding to weak conditional dependencies will inevitably be added to the network structure. The prior and joint probabilities in Equations (1) and (3) will be estimated from training data as follows:
P ( y ) = 1 N Count ( Y = y ) P ( x j ) = 1 N Count ( X j = x j ) P ( x j , y ) = 1 N Count ( X j = x j , Y = y ) P ( x i , x j , y ) = 1 N Count ( X i = x i , X j = x j , Y = y )
where N is the number of training instances. Then, P ( x j | y ) and P ( x i , x j | y ) in Equations (1) and (3) can be computed as follows:
P ( x j | y ) = P ( x j , y ) P ( y ) P ( x i , x j | y ) = P ( x i , x j , y ) P ( y )
Sahami [9] suggested that, if k is large enough to capture all “right” conditional dependencies that exist in a database, then a classifier would be expected to achieve optimal Bayesian accuracy. However, as k grows, KDB will encode more weak dependency relationships, which correspond to smaller value of I ( X i ; X j | Y ) . That increases the risk of occurrence of negative values of I ( x i ; x j | y ) and may introduce redundant dependencies, which will mitigate the positive effect from significant dependencies that correspond to positive values of I ( x i ; x j | y ) . On the other hand, conditional mutual information I ( X i ; X j | Y ) cannot finely measure the conditional dependencies hidden in different data points. The arc X i X j in BNC learned from training data corresponds to positive value of I ( X i ; X j | Y ) and represents strong conditional dependence between X i and X j . However, for specific labeled instance d = { x 1 , , x n , y 1 } , I ( x i ; x j | y 1 ) 0 may hold. Then, X i and X j are y 1 -conditionally independent or irrelevant on point d and the arc X i X j should be removed. For unlabeled instance, the possible dependency relationships between nodes may differ greatly with respect to different class labels.
Thus, BNCs with highly complex network structure do not necessarily beat those with simple ones. The conditional dependencies hold for training data in general do not necessarily hold for each instance. BNCs should discriminate between conditionally dependent and irrelevant relationship for different data points. Besides, BNC should represent all possible spectrums of point dependency relationship that correspond to different class labels for dependence analysis.

3. Discriminatory Target Learning

In probabilistic classification, Bayes optimal classification suggests that, if we can determine the conditional probability distribution P ( y | x ) with true distribution available, where y is one of the m class labels and x is the n-dimensional data point x = { x 1 , x 2 , , x n } that represents an observed instance, then we could achieve the theoretically optimal classification. P ( y | x ) can be described in an unrestricted Bayesian network, as shown in Figure 2a. By applying arc reversal, Shachter [26] proposed to produce the equivalent dependence structure, as shown in Figure 2b. The problem is reduced to estimating the conditional probability P ( x | y ) . Figure 2a,b represents two inference processes that run in the opposite directions. Figure 2a indicates the causality that runs from the state of { X 1 , , X n } (the cause) to the state of Y (the effect). In contrast, if the causality runs in the opposite direction as shown in Figure 2b and the state of Y (the effect) is uncertain, the dependencies between predictive attributes (the causes) should be tuned to match with different states of Y. That is, the restricted BNC shown in Figure 2b presupposes the class label first and then the conditional dependencies between attributes can verify the presupposition.
For different class labels or presuppositions, the conditional dependencies should be different. It is not reasonable that, no matter what the effect (class label) is, the relationships between causes (predictive attributes) remain the same. Consider an unlabeled instance x = { x 1 , , x n } ; if I ( x i ; x j | y ) > 0 , then the conditional dependence between X i and X j on data point x with respect to class label y is reasonable, otherwise it should be neglected. Since the class label for x is uncertain and there are m labels available, we take x as the target and learn an ensemble of m micro BNCs, i.e., bnce = {bnc1, ⋯, bncm}, each of them fully describes the conditional dependencies between attribute values in x with respect to different class labels. The linear combiner is used for models that output real-valued numbers, thus is applicable for bnce. The ensemble probability estimate for bnce is,
P ^ ( y i | x , bnc e ) = P ( y i , x | bnc i ) i = 1 m P ( y i , x | bnc i ) .
bnce may overfit the unlabeled instance and underfit training data. In contrast, regular BNC learned from training data may underfit the unlabeled instance. Thus, they are complementary in nature. After training bnce and regular BNC, the final ensemble that estimates the class membership probabilities by averaging both predictions will be generated. The framework of discriminatory target learning is shown in Figure 3.
Because in practice it is hardly possible to find the true distribution of P ( x | y ) from data, KDB approximates the estimation of P ( x | y ) by allowing for the modeling of arbitrarily complex dependencies between attributes. The pseudocode of KDB is shown in Algorithm 1.
Algorithm 1 Structure learning of KDB.
Entropy 21 00537 i001
From the definition of I ( X i ; Y ) in Equation (3), we can have
I ( X i ; Y ) = y x i P ( x i , y ) l o g P ( x i , y ) P ( x i ) P ( y ) = y x i P ( x i , y ) l o g P ( y | x i ) P ( y )
Definition 3.
For unlabeled data point x = { x 1 , , x n } , the dependence between x i ( 1 i n ) and any given label y is measured by pointwise y-mutual information, which is defined as follows,
I ( x i ; y ) = P ( x i , y ) l o g P ( x i , y ) P ( x i ) P ( y ) = P ( x i , y ) l o g P ( y | x i ) P ( y ) .
Equation (8) is a modified version of pointwise mutual information that is applicable to labeled data point [25]. The prior and joint probabilities in Equations (2) and (8) will be estimated as follows
P ^ ( y ) = 1 N + 1 [ Count ( Y = y ) + 1 m ] P ^ ( x j ) = 1 N + 1 [ Count ( X j = x j ) + 1 m ] P ^ ( x j , y ) = 1 N + 1 [ Count ( X j = x j , Y = y ) + 1 m ] P ^ ( x i , x j , y ) = 1 N + 1 [ Count ( X i = x i , X j = x j , Y = y ) + 1 m ]
Conditional probabilities in Equations (2) and (8) can be estimated by:
P ^ ( x j | y ) = P ^ ( x j , y ) P ^ ( y ) P ^ ( x i , x j | y ) = P ^ ( x i , x j , y ) P ^ ( y ) P ^ ( y | x i ) = P ^ ( x i , y ) P ^ ( x i )
Similar to the Laplace correction [27], the main idea behind Equation (9) is equivalent to creating a “pseudo” training set P by adding to the training data a new instance { x 1 , , x n } with multi-label by assuming that the probability that this new instance is in class y is 1 / m for each y { y 1 , , y m } .
Definition 4.
For unlabeled data point x = { x 1 , , x n } with respect to label y, if I ( x i ; y ) > 0 ( 1 i n ) , then X i is y-dependent on point x ; if I ( x i ; y ) = 0 , then X i is y-independent on point x ; and if I ( x i ; y ) < 0 , then X i is y-irrelevant on point x .
KDB uses I ( X i ; Y ) to sort the attributes and I ( X i ; X j | Y ) to measure the conditional dependence. Similarly, for unlabeled instance x = { x 1 , , x n } , the corresponding micro KDB with respect to class label y t , called kdbt, uses I ( x i ; y t ) (see Equation (8)) to sort the attribute values and I ( x i ; x j | y t ) (see Equation (2)) to measure the conditional dependence. The learning procedure of kdbt is shown in Algorithm 2.    
Algorithm 2 Structure learning of kdbt with respect to class label y t .
Entropy 21 00537 i002
Breiman [28] revealed that ensemble learning brings improvement in accuracy only to those “unstable” learning algorithms, in the sense that small variations in the training set would lead them to produce very different models. bnce is obviously an example of such learners. For individual members of kdbe, the difference in network structure is the result of change of I ( x i ; y ) or I ( x i ; x j | y ) ( 1 i j n ) , or, more precisely, the conditional probability defined in Equations (2) and (8). Given unlabeled instance x = { x 1 , , x n } and binary class labels y 1 and y 2 , if I ( x i ; y 1 ) > 0 , i.e., P ( y 1 | x i ) > P ( y 1 ) , then X i is y 1 -dependent on x . Because P ( y 2 ) = 1 - P ( y 1 ) and P ( y 2 | x i ) = 1 - P ( y 1 | x i ) , we have
P ( y 1 | x i ) > P ( y 1 ) 1 - P ( y 1 | x i ) < 1 - P ( y 1 ) P ( y 2 | x i ) < P ( y 2 )
and
I ( x i ; y 2 ) = P ( x i , y 2 ) l o g P ( y 2 | x i ) P ( y 2 ) < 0
Thus, X i is y 2 -irrelevant on x . X i plays totally different roles in the relationships with different class labels on the same instance. Supposing that before small variations in the training set I ( x i ; y 1 ) > 0 and after that I ( x i ; y 1 ) < 0 , the attribute values will be resorted and correspondingly the network structures of kdb1 and kdb2 for x will change greatly. The sensitivity to the variation makes kdbe finely describe the dependencies hidden in x . Figure 4 shows examples of kdb1 and kdb2 corresponding to class labels y 1 and y 2 , respectively. If the decision of the final ensemble is y 1 , then we will use Figure 4a for dependence analysis. Otherwise, we will use Figure 4b instead. The attribute values annotated in black correspond to positive values of I ( x i ; y t ) ( t = 1 or 2) and they should be focused on.
KDB requires training time complexity of O ( n 2 N m v 2 ) (dominated by the calculations of I ( X i ; X j | Y ) ) and classification time complexity of O ( n 2 N m ) [9] for classifying a single unlabeled instance, where n is the number of attributes, N is the number of data instances, m is the number of class labels, and v is the maximum number of discrete values that an attribute may take. Discriminatory target learning requires no additional training time, thus the training time complexity of final ensemble is the same as that of regular KDB. At classification time it requires O ( n 2 N m ) to calculate I ( x i ; x j | y ) , and the same time complexity for classifying a single unlabeled instance.

4. Experiments and Results

We compared the performance of our proposed methods kdbe and KDBe with several state-of-the-art classifiers. We analyzed the performance in terms of zero-one loss, root mean square error (RMSE), bias and variance on 42 natural domains from the UCI Machine Learning Repository [29]. These datasets are described in Table 1, in ascending order of number of instances. The structure of this section is as follows: we discuss our experimental methodology and evaluation function in details in Section 4.1. Section 4.2 includes comparisons with three classic single-structure BNCs, namely NB, TAN and KDB, as well as one ensemble BNC: AODE. Then, in Section 4.3, KDBe is compared with Random Forest with 100 decision trees. Section 4.4 presents a global comparison of all learners considered by applying the Friedman and Nemenyi tests.

4.1. Experimental Methodology and Evaluation Function

The experiments for all BNCs used C++ software (NetBeans 8.0.2) specially designed to deal with classification problems. Each algorithm was tested on each dataset using 10-fold cross validation. All experiments were conducted on a desktop computer with an Intel(R) Core(TM) i3-6100 CPU @ 3.70 GHz, 64 bits and 4096 MB of memory(Dell Vostro 2667, Changchun, China).
  • Win/Draw/Lose (W/D/L) Record: When two algorithms were compared, we counted the number of datasets for which one algorithm performed better, equally well or worse than the other on a given measure. We considered there exists a significant difference if the output of a one-tailed binomial sign test was less than 0.05.
  • Missing Values: Missing values for qualitative attributes were replaced with modes, and those for quantitative attributes were replaced with means from the training data.
  • Numeric Attributes: For each dataset, we used MDL (Minimum Description Length) discretization [30] to discretize numeric attributes.
  • Dataset Sizes: Datasets were categorized in terms of their sizes. That is, datasets with instances <1000, ≥1000 and <10,000, ≥10,000 were denoted as small size, medium size and large size, respectively. We report results on these sets to discuss suitability of a classifier for datasets of different sizes.
  • Zero-one loss: Zero-one loss can be used to measure the extent to which a learner correctly identifies the class label of an unlabeled instance. Supposing y and y ^ are the true class label and that generated by a learning algorithm, respectively, given M unlabeled test instances, the zero-one loss function is defined as
    ξ ( y , y ^ ) = i = 1 M 1 - ϱ ( y i , y i ^ ) M ,
    where ϱ ( y i , y i ^ ) = 1 if y i = y i ^ and 0 otherwise.
  • Bias and variance: The bias-variance decomposition proposed by Kohavi and Wolpert [31] provides valuable insights into the components of the zero-one loss of learned classifiers. Bias measures how closely the classifier can describe the decision boundary, which is defined as
    b i a s = 1 2 y ^ , y ϵ Y [ P ( y ^ | x ) - P ( y | x ) ] 2 ,
    where x is the combination of any attribute value. Variance measures the sensitivity of the classifier to variations in the training data, which is defined as
    v a r i a n c e = 1 2 [ 1 - y ^ ϵ Y P ( y ^ | x ) 2 ] .
  • RMSE: For each instance, RMSE accumulates the squared error, where the error is the difference between 1.0 and the probability estimated by the classifier for the true class for the instance, and then computes the squared root of the mean of the sum, which is defined as
    R M S E = 1 s i = 1 s ( 1 - P ( y ^ | x ) ) 2 ,
    where s is the sum of training instances.

4.2. KDBe Versus Classic BNCs

We compared KDBe with several classic BNCs, namely NB, TAN, KDB and AODE. Sahami [9] proposed the notion of k-dependence BNC, which allows each attribute X i to have a maximum of k attributes as parents. NB and TAN are, respectively, 0-dependence and 1-dependence BNCs. To clarify the effect of dependence complexity, we set k = 2 for both KDB and KDBe.

4.2.1. Zero-One Loss and RMSE Results

The detailed results in terms of zero-one loss and RMSE are shown in Table A1 and Table A2 in Appendix A, respectively. Table 2 and Table 3 show W/D/L records summarizing the relative zero-one loss and RMSE of different BNCs. When k = 2 , NB, TAN and KDB can, respectively, represent 0, n - 1 and 2 n - 3 conditional dependencies, where n is the number of predictive attributes. As shown in Table 1, since n > 3 holds for all datasets, 2 n - 3 > n - 1 also holds. Thus, KDB can represent the largest number of dependencies among all. With respect to zero-one loss, NB represents no conditional dependencies due to its independence assumption and performed the worst in general. As the dependence degree or structure complexity increased, KDB was competitive compared to NB and TAN. AODE performed better than the other single-structure BNCs due to its ensemble mechanism. Surprisingly, kdbe had significantly better zero-one loss performance than NB, TAN and KDB. When discriminatory target learning was introduced for discovery of dependencies that exist in different unlabeled instances, the final ensemble KDBe could possess significant advantage over other classifiers. For example, KDBe beat KDB in 26 domains and lost only in three in terms of zero-one loss. RMSE-wise, KDBe still performed the best. For instance, KDBe enjoyed a significant advantage over TAN (20/19/3). When compared to KDB, KDBe also achieved superior performance, with 17 wins and 5 losses.
To make the experimental results more intuitive, from the viewpoints of the ensemble mechanism and structure complexity, Figure 5a,c shows the comparisons of KDBe, KDB and AODE in terms of zero-one loss, whereas Figure 5b,d shows the comparisons for RMSE. The red squared symbols are used to indicate significant advantages of KDBe over the other BNCs. In Figure 5a,b, only two points are far above the diagonal line, thus the negative effect caused by discriminatory target learning was negligible. In contrast, many more points are below the diagonal line, which means that discriminatory target learning worked effectively in most cases. A notable case is Waveform dataset, where discriminatory target learning helped to substantially reduce classification error, such as the reduction from 0.0256 to 0.0193 for zero-one loss and from 0.1145 to 0.0901 for RMSE. When comparing KDBe with AODE, it can be seen in Figure 5c,d that there are still many points below the diagonal line, which means that KDBe enjoyed a significant advantage over AODE. For example, a notable case is our largest dataset Localization, where the zero-one loss of KDBe (0.2743) was much lower than that of AODE (0.3596).

4.2.2. Bias and Variance Results

The detailed results in terms of bias and variance are shown in Table A3 and Table A4 in Appendix A, respectively. The W/D/L records with respect to bias and variance results are shown in Table 4 and Table 5, respectively. We can observe in Table 4 that ensemble classifiers, i.e., AODE and kdbe, performed better than TAN but worse than KDB, although these results were not always statistically significant. NB still performed the worst. High-dependence structure or ensemble construction strategy could help reduce the bias. Jointly applying both helped KDBe reduce bias significantly. For example, KDBe performed better than TAN (26/9/7) and KDB (11/27/4).
In terms of variance, since the network structures of NB and AODE are definite and irrelevant to the variation of the training data, the independence assumption helped reduce the variance significantly. KDB was the most sensitive to the variation in training data among all classifiers. As discussed in Section 3, discriminatory target learning made kdbe underfit training data and overfit the unlabeled instance. When kdbe was integrated with regular KDB, discriminatory target learning helped to reduce the variance and the final ensemble classifier, i.e., KDBe, performed the best only after NB and AODE.

4.2.3. Time Comparison

We compared KDBe with the other classic BNCs in terms of training and classification time. Since kdbe is a part of KDBe, we removed it in this experiment. Figure 6a,b shows the training and classification time comparisons for all BNCs. Each bar represents the sum of time on 42 datasets in a 10-fold cross-validation experiment. No parallelization techniques were used in any case. As discussed in Section 3, discriminatory target learning requires no additional training time, thus the training time complexity of KDBe was the same as that of regular KDB. Due to the structure complexity, KDBe and KDB required a bit more time for training than the other BNCs. With respect to classification time, KDBe took a little more time than the other BNCs. The reason lies in that KDBe learned kdbe for each unlabeled test instance, while the other BNCs only needed to directly calculate the joint probabilities. In general, discriminatory target learning helped to significantly improve the classification performance of its base classifier at the cost of a small increase in time consumption, which is perfectly acceptable.

4.3. KDBe Versus Random Forest

To further illustrate the performance of our proposed discriminatory target learning framework, we compared KDBe with a powerful learner, i.e., Random forest.Random forest (RF) is a combination of decision tree predictors, where each tree is trained on data selected at random but with replacement from the original data [10]. As the number of trees in the forest becomes large, the classification error for forests tends to converge to a limit. RF is an effective tool in prediction. RF can process high-dimensional data (that is, data with a lot of features) without making feature selection. Furthermore, due to the random mechanism, RF has the capacity to deal with imbalanced datasets or data with numerous missing values. Moreover, the framework in terms of strength of the individual predictors and their correlations gives insight into the ability of the RF to predict [10]. Because of its high classification accuracy, RF has been applied to many scientific fields, e.g., ecology and agriculture [32]. In our experiment, RF with 100 decision trees was used. The detailed results of RF in terms of zero-one loss, RMSE, bias and variance can be found in Table A1, Table A2, Table A3 and Table A4 in Appendix A, respectively. Table 6 shows the W/D/L records with different dataset sizes. When zero-one loss was compared, KDBe won more frequently than RF, especially on small and medium datasets. The results indicate 10/4/3 on small datasets and 7/4/4 on medium datasets. The reason may lie in that 100 decision trees are complex and tend to overfit the training data. RMSE-wise, KDBe also performed better than RF, which is shown as 16 wins and 11 losses. Bias and variance comparison of KDBe and RF (Table 6) suggested that KDBe is a low variance and high bias classifier. One can expect it to work extremely well on small and medium datasets. This is evident in Table 6 showing the zero-one loss and RMSE comparisons. KDBe beat RF on 26 datasets and lost on 12 datasets with respect to variance. Thus, the advantages of KDBe over RF in terms of zero-one loss and RMSE could be attributed to the change in variance. Since the variance term increased as the algorithm became more sensitive to the change in labeled training data, obviously, discriminatory target learning helped to alleviate the negative effect caused by overfitting.
Besides, we display the time comparisons between KDBe and RF in Figure 7. It is obvious that KDBe enjoyed a great advantage over RF in terms of training time on datasets of all sizes. This advantage could be attributed to that KDBe only learned a regular KDB for every dataset during the training phase while RF needed to train 100 decision trees. When comparing classification time, the performance of KDBe and RF showed a slight reversal. Learning kdbe for each unlabeled test instance made KDBe take a bit more time than RF. However, when comparing on small and medium datasets, the advantage of RF over KDBe was not significant. To conclude, on small and medium datasets, KDBe had a significantly better zero-one loss performance and better RMSE than RF. This was packaged with KDBe’s far superior training times and competitive classification times over RF, which makes KDBe an excellent alternative to RF, especially for dealing with small and medium datasets.

4.3.1. Discussion

RF has been applied to several scientific fields and associated research areas [32], because of its high classification accuracy. However, RF is more negatively affected in terms of computation consumption (memory and time) by dataset sizes than BNCs [19]. Furthermore, due to the random mechanism, RF is sometimes criticized for difficulty giving a clear semantic explanation of the combined result that is outputted by numerous decision trees. In contrast, our proposed discriminatory target learning framework considers not only the dependence relationships that exist in the training data, but also that hidden in unlabeled test instances, which makes the final model highly interpretable. KDBe outperformed RF in terms of zero-one loss, RMSE and variance, especially on small and medium size datasets, while RF beat KDBe in terms of bias. Moreover, RF required substantially more time for training and KDBe took a bit more time for classifying.
To illustrate the better interpretability of KDBe than that of RF, we took medical diagnostic application as an example. The Heart-disease-c dataset (http://archive.ics.uci.edu/ml/datasets/Heart+Disease) from UCI Machine Learning Repository was collected from Cleveland Clinic Foundation, containing 13 attributes and two class labels. The detailed description of this dataset is shown in Table 7. The zero-one loss results of KDB, RF and KDBe are 0.2244, 0.2212 and 0.2079, respectively. KDB learned from training data can describe the general conditional dependencies, while for a certain instance some of dependence relationships may hold instead of all the dependencies shown in KDB. In contrast, kdbe can encode the most possible local conditional dependencies hidden in one single test instance. We argue that an ideal phenomenon is that KDB and kdbe are complementary to each other for classification and they may focus on different key points. To prove this, randomly taking an instance from Heart-disease-c dataset as an example, the detail of this instance is shown as, T = { x 0 = 57 , x 1 = 1 , x 2 = 3 , x 3 = 150 , x 4 = 168 , x 5 = 0 , x 6 = 0 , x 7 = 174 , x 8 = 0 , x 9 = 1 . 6 , x 10 = 3 , x 11 = 0 , x 12 = 3 } . Figure 8 and Figure 9 show the structural difference between KDB and the submodels of kdbe. For KDB, by comparing mutual information I ( X ; Y ) , { X 6 , X 1 , X 12 } are the first three key attributes for this dataset. There are 23 arcs in the structure of KDB which represent the conditional dependencies between predictive attributes. However, the values of I ( X 8 ; X 1 | Y ) , I ( X 8 ; X 6 | Y ) , I ( X 9 ; X 1 | Y ) and I ( X 9 ; X 6 | Y ) are all 0. For the instance T , in Figure 9, we can easily find that the structure of kdbe differed greatly from that of KDB. The true class label for T is y 1 . KDB misclassified T , while KDBe correctly classified the instance. Thus, we can use Figure 9a for dependence analysis. By comparing the pointwise y 1 -mutual information, { x 12 , x 11 , x 7 } are the first three key attribute values for T . It is worth mentioning that X 1 ranked second in KDB, whereas x 1 ranked last in kdby1. Furthermore, there were only 15 arcs in kdby1, which means that some redundant dependencies were eliminated. In general, KDBe could utilize the knowledge learned from the training data and unlabeled test instances by building different models, which is obviously suitable for precision medical diagnosis.

4.3.2. Imbalanced Datasets

There are 15 imbalanced datasets in our experiments, which are annotated with the symbol “*” in Table 1. To prove that KDBe has the capacity to deal with imbalanced datasets, we conducted a set of experiments to compare the performance of KDBe with RF in terms of extended Matthews correlation coefficient (MCC). The MCC provides a balanced measure for skewed datasets by taking into account the class distribution [33]. The classification results can be shown in the form of a confusion matrix as follows:
N 11 N 1 m N m 1 N m m
Each entry N i i of the matrix gives the number of instances, whose true class was Y i that were actually assigned to Y i , where 1 i m . Each entry N i j of the matrix gives the number of instances, whose true class was Y i that were actually assigned to Y j , where i j and 1 i , j m . Given the confusion matrix, the extended MCC can be calculated as follow,
M C C = m i j N i i N j m - N i j N m i i ( j N i j ) ( j , i i N i j ) i ( j N j i ) ( j , i i N j i )
Note that the MCC reaches its best value at 1, which represents a perfect prediction, and worst value at −1, which indicates a total disagreement between the predicted and observed classifications. Figure 10 shows the scatter plot of KDBe and RF in terms of MCC. We can see that many points fall close to the diagonal line, which means that KDBe achieved competitive results compared with RF. Furthermore, there are three points far above the diagonal line, which means KDBe enjoys significant advantages on these datasets. A notable case is Dis dataset annotated with red color, where the MCC of KDBe (0.4714) was much higher than that of RF (0.3710). In general, KDBe had the capacity to handle the imbalanced datasets.

4.4. Global Comparison of All Classifiers

In this section, to assess whether the overall differences in performance of these learners was statistically significant, we employed the Friedman test [34] and the post-hoc Nemenyi test, as recommended by Demšar [35]. The Friedman test is a non-parametric test for multiple hypotheses testing. It ranks the algorithms for each dataset separately: the best performing algorithm getting the rank of 1, the second best ranking 2, and so on. In case of ties, average ranks are assigned. The null-hypothesis is that all of the algorithms perform almost equivalently and there is no significant difference in terms of average ranks. The Friedman statistic can be computed as follows:
χ F 2 = 12 N t ( t + 1 ) j = 1 t R j 2 - 3 N ( t + 1 ) ,
where R j = i r i j and r i j is the rank of the jth of t algorithms on the ith of N datasets. The Friedman statistic is distributed according to χ F 2 with t - 1 degrees of freedom. Thus, for any pre-determined level of significance α , the null hypothesis will be rejected if χ F 2 > χ α 2 . The critical value of χ α 2 for α = 0.05 with six degrees of freedom is 12.592. The Friedman statistics of zero-one loss and RMSE were 53.65 and 60.49, which were both larger than 12.592. Hence, the null-hypotheses was rejected. According to the detailed results of rank shown in Table A5 and Table A6 in Appendix A, Figure 11 plots the average ranks across all datasets, along with the standard deviation for each learner. When assessing the calibration of the probability estimates using zero-one loss, KDBe obtained the lowest average rank of 2.5952, followed by kdbe with 3.5595 and RF with 3.7024 (very close to those for AODE). When assessing performance using RMSE, KDBe still performed the best, followed by RF with 3.4285 and AODE with 3.7500. We found NB at the other extreme on both measures, with average ranks 5.8690 and 5.9523 out of a total of seven learners.
Since we rejected the null-hypotheses, Nemenyi test was used to further analyze which pairs of algorithms were significantly different in terms of average ranks of the Friedman test. The performance of two classifiers is significantly different if their corresponding average ranks of the Friedman test differ by at least the critical difference ( C D ):
C D = q α t ( t + 1 ) 6 N ,
where the critical value q α for α = 0.05 and t = 7 is 2.949. Given seven algorithms and 42 datasets, we used Equation (16) to calculate C D and the result is 1.3902. The learners in Figure 12 are plotted on the red line on the basis of their average ranks, corresponding to the nodes on the top black line. If two algorithms had no significant difference, they were connected by a line. As shown in Figure 12a, we easily found that KDBe had a significantly lower average zero-one loss rank than NB, TAN and KDB. KDBe also achieved lower average zero-one loss rank than kdbe, RF and AODE, but not significantly so. When RMSE was considered, KDBe still performed the best and the rank of KDBe was significantly lower than that of KDB, providing solid evidence for the effectiveness of our proposed discriminatory target learning framework.

5. Conclusions

Lack of explanatory insight into the relative influence of the random variables greatly restricts the application domain of machine learning techniques. By redefining mutual information and conditional information, the framework of discriminatory target learning can help fully and discriminately describe the dependency relationships in unlabeled instance and labeled training data. The kdbe learned from unlabeled instance and regular KDB learned from training data are different but complementary in nature, which will help further improve the classification performance. Discriminatory target learning can be expected to play for different types of BNCs with different dependency complexities. Exploration of application of discriminatory target learning in other kinds of machine learning techniques, e.g., decision tree or support vector machine, is a further area for future work.

Author Contributions

Formal analysis, M.M.; Funding acquisition, M.-H.S.; Methodology, L.-M.W.; Resources, L.-M.W.; Validation, H.L.; Writing—original draft, Z.-Y.D.; Writing—review and editing, Z.-Y.D.

Acknowledgments

This work was supported by the National Science Foundation of China, Grant No. 61272209 and No. 61872164.

Conflicts of Interest

The authors declare no conflict of interest.

Appendix A. Tables of the Experimental Section

The best results in each row of each table are annotated with bold font.
Table A1. Experimental results of average zero-one loss.
Table A1. Experimental results of average zero-one loss.
DatasetNBTANKDBAODERFkdbeKDBe
Contact-lenses0.37500.37500.25000.37500.34380.37500.2917
Labor0.03510.05260.03510.05260.09390.03510.0351
Echocardiogram0.33590.32820.34350.32060.34890.31300.3130
Lymphography0.14860.17570.23650.16890.21320.17570.2095
Sonar0.23080.22120.24520.22600.20670.24520.2308
Glass-id0.26170.21960.21960.25230.21320.22430.2150
New-thyroid0.05120.06510.06980.04650.08160.06050.0605
Heart-disease-c0.18150.20790.22440.20130.22120.19470.2079
Soybean-large0.12380.11070.08790.07820.11070.12700.0814
Ionosphere0.10540.06840.07410.07410.07660.09120.0655
Dermatology0.01910.03280.06560.01640.03670.05460.0519
House-votes-840.09430.05520.05060.05290.04160.05750.0437
Chess0.11250.09260.09980.09980.10740.09260.0926
Soybean0.08930.04690.05560.04690.07030.05420.0527
Breast-cancer-w0.02580.04150.07440.03580.03860.04010.0629
Tic-tac-toe0.30690.22860.20350.26510.21150.19310.2004
Vowel0.42420.13030.18180.14950.16740.17880.1626
Car0.14000.05670.03820.08160.07720.05960.0411
Mfeat-mor0.31400.29700.30600.31450.30000.30150.3035
Segment0.07880.03900.04720.03420.04130.03550.0433
Hypothyroid0.01490.01040.01070.01360.01220.00920.0095
Kr-vs-kp0.12140.07760.04160.08420.01280.04600.0382
Dis0.01590.01590.01380.01300.01330.01270.0122
Hypo0.01380.01410.01140.00950.01220.00980.0098
Sick0.03080.02570.02230.02730.02630.02700.0233
Abalone0.47620.45870.45630.44720.48230.45340.4484
Waveform-50000.20060.18440.20000.14620.15580.17820.1756
Phoneme0.26150.27330.19840.23920.17890.31390.1931
Wall-following0.10540.05540.04010.03700.02160.03980.0387
Page-blocks0.06190.04150.03910.03380.03090.03230.0322
Satellite0.18060.12140.10800.11480.10850.12650.1052
Thyroid0.11110.07200.07060.07010.07500.05860.0642
Pendigits0.11810.03210.02940.02000.03390.02020.0248
Sign0.35860.27550.25390.28210.20380.26850.2419
Nursery0.09730.06540.02890.07300.02480.05090.0356
Magic0.22390.16750.16370.17520.16740.17160.1598
Letter-recog0.25250.13000.09860.08830.09020.06750.0861
Adult0.15920.13800.13830.14930.12040.13150.1316
Shuttle0.00390.00150.00090.00080.00050.00060.0007
Connect-40.27830.23540.22830.24200.18750.23370.2268
Waveform0.02200.02020.02560.01800.15580.01940.0193
Localization0.49550.35750.29640.35960.29760.26590.2743
Table A2. Experimental results of average RMSE.
Table A2. Experimental results of average RMSE.
DatasetNBTANKDBAODERFkdbeKDBe
Contact-lenses0.37780.44960.36390.40660.40980.40860.3825
Labor0.14200.21850.16850.19000.28240.36470.2271
Echocardiogram0.48960.48860.48890.49030.45740.47820.4813
Lymphography0.24460.26840.30310.24780.27010.27290.2680
Sonar0.44210.41310.40840.42850.35180.40710.3959
Glass-id0.35400.33320.33950.34390.31460.33110.3275
New-thyroid0.15440.17310.17970.16140.15600.16890.1714
Heart-disease-c0.37430.37750.39630.36590.36960.35720.3690
Soybean-large0.10320.09630.08580.08580.11430.10510.0856
Ionosphere0.31570.26150.27140.25060.24030.28220.2523
Dermatology0.06310.08510.12060.06920.13030.18570.1313
House-votes-840.29970.21810.19690.19940.18460.19620.1847
Chess0.29440.25940.26150.27250.27710.29370.2642
Soybean0.09330.06420.06540.06560.09220.07540.0643
Breast-cancer-w0.15700.19280.24970.18480.17960.21940.2137
Tic-tac-toe0.43090.40230.37720.39950.29160.38300.3693
Vowel0.22700.12710.15820.14250.15810.16850.1516
Car0.22520.16170.13790.20050.17820.17490.1505
Mfeat-mor0.20860.19400.19740.19850.20740.19480.1954
Segment0.13980.09670.10340.08790.10610.09570.0919
Hypothyroid0.11380.09550.09370.10360.07700.09790.0913
Kr-vs-kp0.30220.23580.18690.26380.12680.26260.2091
Dis0.11770.11030.10240.10800.10110.10740.1021
Hypo0.07660.07380.06710.06500.07150.07190.0635
Sick0.17000.14340.13820.15720.14870.14890.1394
Abalone0.46300.42500.42770.41930.45390.42200.4220
Waveform-50000.33480.29470.31490.26590.30360.29500.2869
Phoneme0.08800.09020.07840.08850.07310.09520.0783
Wall-following0.21770.15860.13630.12920.12060.13150.1210
Page-blocks0.14500.11870.11280.10210.09740.09720.0991
Satellite0.24000.18510.17770.18000.16820.18650.1644
Thyroid0.09670.07460.07440.07450.07700.06740.0679
Pendigits0.14270.07250.06870.05680.09790.07930.0646
Sign0.39840.35050.33340.35240.31040.34680.3300
Nursery0.17660.13850.11210.15710.10100.13720.1217
Magic0.39740.34610.34700.35410.35710.35140.3411
Letter-recog0.11840.08600.07680.07070.08960.07560.0685
Adult0.34090.30760.30890.32450.32740.30210.3015
Shuttle0.02980.01820.01460.01260.01420.01210.0125
Connect-40.35870.33150.32470.33700.30570.34090.3279
Waveform0.11760.09510.11450.08600.07990.09990.0901
Localization0.23900.20950.19600.20950.19390.18340.1846
Table A3. Experimental results of average bias.
Table A3. Experimental results of average bias.
DatasetNBTANKDBAODERFkdbeKDBe
Contact-lenses0.21630.18250.31750.28500.17480.18630.2850
Labor0.02890.02110.02790.03470.04090.01840.0279
Echocardiogram0.28440.26420.30650.27510.22560.26020.2686
Lymphography0.09020.10270.10410.09330.12880.09510.0996
Sonar0.16720.16460.16860.16960.10450.18290.1762
Glass-id0.29010.27560.27130.27850.13480.27300.2732
New-thyroid0.02900.02770.03480.02770.02850.02790.0396
Heart-disease-c0.12970.12630.12990.11380.13040.11280.1274
Soybean-large0.10700.14220.10860.06480.12130.17170.1112
Ionosphere0.12200.08040.08550.07440.06240.09120.0862
Dermatology0.00790.02740.04890.00550.01900.05410.0451
House-votes-840.08990.04100.02580.04300.03270.04570.0301
Chess0.14130.14370.11190.12900.05480.12650.1192
Soybean0.10150.05220.04910.05240.05860.09710.0502
Breast-cancer-w0.01870.03840.04490.03380.03010.02210.0348
Tic-tac-toe0.26140.17460.13670.20050.02700.14340.1390
Vowel0.33010.19420.17450.18950.07560.18450.1736
Car0.09370.04780.03870.05560.03890.03890.0374
Mfeat-mor0.26240.20770.21420.24770.23110.22230.2166
Segment0.07850.04910.04530.03670.02530.03870.0419
Hypothyroid0.01160.01040.00960.00940.05160.00900.0094
Kr-vs-kp0.11070.07020.04170.07470.00630.04340.0407
Dis0.01650.01930.01910.01700.02030.01920.0191
Hypo0.00920.01240.00770.00710.00830.00980.0073
Sick0.02460.02070.01980.02240.01940.02540.0196
Abalone0.41800.31260.30330.32010.32570.31950.3132
Waveform-50000.17620.12320.11570.12350.11140.12190.1147
Phoneme0.22160.23940.15720.22070.11020.29270.1551
Wall-following0.09510.04910.02570.02510.01220.02960.0245
Page-blocks0.04510.03080.02800.02510.02170.02770.0264
Satellite0.17460.09500.08080.09020.08740.10110.0802
Thyroid0.09940.05870.05530.06110.05160.04930.0531
Pendigits0.10950.03140.02070.02280.02160.01960.0189
Sign0.32570.24200.21610.25310.15400.23220.2132
Nursery0.09280.05210.02810.06510.00860.04000.0322
Magic0.21110.12520.12410.16000.12440.13230.1265
Letter-recog0.22070.10320.08060.08760.04900.07000.0732
Adult0.16490.13120.12200.14370.11090.12400.1226
Shuttle0.00400.00080.00070.00060.00060.00060.0006
Connect-40.26600.22530.20220.22640.14270.21690.2075
Waveform0.02190.01520.02100.01560.01580.01720.0161
Localization0.45230.31060.21340.31290.20470.20270.2038
Table A4. Experimental results of average variance.
Table A4. Experimental results of average variance.
DatasetNBTANKDBAODERFkdbeKDBe
Contact-lenses0.17130.19250.17000.12750.20130.21380.1775
Labor0.03950.06320.07210.01790.07580.06050.0721
Echocardiogram0.12720.12650.14000.13190.14690.13740.1337
Lymphography0.03430.11160.14080.04760.13520.09270.1249
Sonar0.09070.11650.11990.09420.11890.09830.1107
Glass-id0.09300.10750.11890.10040.10890.11010.1099
New-thyroid0.01610.02720.03850.02300.03650.02850.0351
Heart-disease-c0.02480.04790.05820.03570.07180.04660.0498
Soybean-large0.07830.11760.09820.08420.13730.09210.0947
Ionosphere0.02420.04010.05810.03850.05820.03440.0497
Dermatology0.02160.05130.06840.01990.06850.07460.0648
House-votes-840.00660.01700.01970.00940.01790.01640.0168
Chess0.04010.04860.05310.04150.06260.04230.0447
Soybean0.03020.06540.04390.03260.06060.05090.0406
Breast-cancer-w0.00100.03370.05040.01340.01010.01990.0425
Tic-tac-toe0.04550.08240.11250.05130.05900.08130.0951
Vowel0.25420.24450.23250.23440.10930.23370.2255
Car0.05200.03760.04340.04380.04560.04470.0379
Mfeat-mor0.06220.10200.10310.06770.13510.08820.0960
Segment0.02590.02940.03810.02550.01910.02910.0344
Hypothyroid0.00310.00340.00240.00340.02790.00340.0024
Kr-vs-kp0.01860.01520.01110.01860.00770.00760.0077
Dis0.00690.00050.00110.00710.00210.00050.0003
Hypo0.00510.00710.00690.00490.00460.00780.0060
Sick0.00470.00510.00430.00420.00820.00520.0035
Abalone0.06820.16930.17690.15440.18650.15110.1633
Waveform-50000.02590.06900.08430.04100.05280.06250.0666
Phoneme0.12150.18280.10640.13430.08180.18500.1052
Wall-following0.02110.02880.02940.02420.01120.02660.0278
Page-blocks0.01350.01430.01770.01240.01100.01150.0146
Satellite0.01390.03670.04550.03630.02510.03880.0406
Thyroid0.02050.02570.02720.02350.02790.02200.0235
Pendigits0.01570.02000.02360.01270.01480.01570.0198
Sign0.03130.03860.05960.03780.05930.05720.0488
Nursery0.00850.01680.01950.01050.01930.01790.0168
Magic0.01740.04900.04910.02970.05120.04070.0440
Letter-recog0.04710.05910.07090.04480.04920.04400.0619
Adult0.00690.01650.02850.01160.04250.01410.0185
Shuttle0.00090.00040.00030.00040.00040.00040.0003
Connect-40.01560.01490.03090.02220.05340.02150.0222
Waveform0.00090.00530.00370.00250.00680.00210.0035
Localization0.04600.05940.10990.05800.11060.08970.0955
Table A5. Ranks in terms of zero-one loss of different learners.
Table A5. Ranks in terms of zero-one loss of different learners.
DatasetNBTANKDBAODERFkdbeKDBe
Contact-lenses5.05.01.07.03.05.02.0
Labor2.55.02.56.07.02.52.5
Echocardiogram5.04.06.03.07.01.51.5
Lymphography1.03.57.02.06.03.55.0
Sonar4.52.06.53.01.06.54.5
Glass-id7.03.53.56.01.05.02.0
New-thyroid2.05.06.01.07.03.53.5
Heart-disease-c1.04.57.03.06.02.04.5
Soybean-large6.04.53.01.04.57.02.0
Ionosphere7.02.03.04.05.06.01.0
Dermatology2.03.07.01.04.06.05.0
House-votes-847.05.03.04.01.06.02.0
Chess7.02.04.05.06.02.02.0
Soybean7.01.05.02.06.04.03.0
Breast-cancer-w1.05.07.02.03.04.06.0
Tic-tac-toe7.05.03.06.04.01.02.0
Vowel7.01.06.02.04.05.03.0
Car7.03.01.06.05.04.02.0
Mfeat-mor6.01.05.07.02.03.04.0
Segment7.03.06.01.04.02.05.0
Hypothyroid7.03.04.06.05.01.02.0
Kr-vs-kp7.05.03.06.01.04.02.0
Dis6.56.55.03.04.02.01.0
Hypo6.07.04.01.05.02.52.5
Sick7.03.01.06.04.05.02.0
Abalone6.05.04.02.07.03.01.0
Waveform-50007.05.06.01.02.04.03.0
Phoneme5.06.03.04.01.07.02.0
Wall-following7.06.05.02.01.04.03.0
Page-blocks7.06.05.04.01.03.02.0
Satellite7.05.02.04.03.06.01.0
Thyroid7.05.03.04.06.01.02.0
Pendigits7.05.04.01.56.01.53.0
Sign7.05.03.06.01.04.02.0
Nursery7.05.02.06.01.04.03.0
Magic7.04.02.06.03.05.01.0
Letter-recog7.06.05.03.04.01.02.0
Adult7.04.05.06.01.02.03.0
Shuttle7.06.05.04.01.02.03.0
Connect-47.05.03.06.01.04.02.0
Waveform5.04.06.01.07.03.02.0
Localization7.05.03.06.04.01.02.0
Sum of ranks246.5179.5175.5160.5155.5149.5109.0
Table A6. Ranks in terms of RMSE of different learners.
Table A6. Ranks in terms of RMSE of different learners.
DatasetNBTANKDBAODERFkdbeKDBe
Contact-lenses2.07.01.04.06.05.03.0
Labor1.04.02.03.06.07.05.0
Echocardiogram5.04.07.06.01.02.03.0
Lymphography1.04.07.02.05.06.03.0
Sonar7.05.04.06.01.03.02.0
Glass-id7.04.05.06.01.03.02.0
New-thyroid1.06.07.03.02.04.05.0
Heart-disease-c5.06.07.02.04.01.03.0
Soybean-large5.04.03.02.07.06.01.0
Ionosphere7.04.05.02.01.06.03.0
Dermatology1.03.04.02.05.07.06.0
House-votes-847.06.04.05.01.03.02.0
Chess7.01.02.04.05.06.03.0
Soybean7.01.04.03.06.05.02.0
Breast-cancer-w1.04.07.03.02.06.05.0
Tic-tac-toe7.06.03.05.01.04.02.0
Vowel7.01.05.02.04.06.03.0
Car7.03.01.06.05.04.02.0
Mfeat-mor7.01.05.04.06.02.03.0
Segment7.04.05.01.06.03.02.0
Hypothyroid7.04.03.06.01.05.02.0
Kr-vs-kp7.04.02.06.01.05.03.0
Dis7.06.03.05.01.04.02.0
Hypo7.06.03.02.04.05.01.0
Sick7.03.02.06.04.05.01.0
Abalone7.04.05.01.06.02.52.5
Waveform-50007.03.06.01.05.04.02.0
Phoneme4.06.03.05.01.07.02.0
Wall-following7.06.05.03.01.04.02.0
Page-blocks7.06.05.04.02.01.03.0
Satellite7.05.03.04.02.06.01.0
Thyroid7.04.05.03.06.01.02.0
Pendigits7.04.03.01.06.05.02.0
Sign7.05.03.06.01.04.02.0
Nursery7.05.02.06.01.04.03.0
Magic7.02.03.05.06.04.01.0
Letter-recog7.05.04.02.06.03.01.0
Adult7.03.04.05.06.02.01.0
Shuttle7.06.05.03.04.01.02.0
Connect-47.04.03.05.01.06.02.0
Waveform7.04.06.02.01.05.03.0
Localization7.05.54.05.53.01.02.0
Sum of ranks250.0178.5170.0157.5144.0173.5102.5

References

  1. Efron, B.; Hastie, T. Computer Age Statistical Inference, 1st ed.; Cambridge University Press: New York, NY, USA, 2016; ISBN 9781107149892. [Google Scholar]
  2. Hastie, T.; Tibshirani, R.; Friedman, J. The Elements of Statistical Learning; Springer: New York, NY, USA, 2009. [Google Scholar]
  3. Liu, Z.T.; Wu, M.; Cao, W.H.; Mao, Z.W. Speech emotion recognition based on feature selection and extreme learning machine decision tree. Neurocomputing 2018, 273, 271–280. [Google Scholar] [CrossRef]
  4. Chen, S.L.; Martinez, A.M.; Webb, G.I.; Wang, L.M. Sample-based attribute selective AnDE for large data. IEEE Trans. Knowl. Data. Eng. 2016, 29, 1–14. [Google Scholar]
  5. Pat, L. Induction of recursive Bayesian classifiers. In Proceedings of the 8th European Conference on Machine Learning, Vienna, Austria, 5–7 April 1993; pp. 153–164. [Google Scholar]
  6. Wang, S.Q.; Liu, Q.; Zhu, E.; Porikli, F.; Yin, J. Hyperparameter selection of one-class support vector machine by self-adaptive data shifting. Pattern Recogn. 2018, 74, 198–211. [Google Scholar] [CrossRef] [Green Version]
  7. Orlandi, A. Multiple Objectives Optimization for an EBG Common Mode Filter by Using an Artificial Neural Network. IEEE Trans. Electromagn. Compat. 2018, 60, 507–512. [Google Scholar]
  8. Chen, F.C.; Jahanshahi, M.R. NB-CNN: Deep Learning-Based Crack Detection Using Convolutional Neural Network and Naive Bayes Data Fusion. IEEE Trans. Ind. Electron. 2018, 65, 4392–4400. [Google Scholar] [CrossRef]
  9. Sahami, M. Learning limited dependence Bayesian classifiers. In Proceedings of the Second International Conference on Knowledge Discovery and Data Mining, Portland, OR, USA, 2–4 August 1996; pp. 335–338. [Google Scholar]
  10. Breiman, L. Random Forests. Mach. Learn. 2001, 45, 5–32. [Google Scholar] [CrossRef] [Green Version]
  11. Zheng, F.; Webb, G.I.; Suraweera, P.; Zhu, L. Subsumption resolution: An efficient and effective technique for semi-naive Bayesian learning. Mach. Learn. 2012, 87, 93–125. [Google Scholar] [CrossRef]
  12. Quinlan, J.R. Induction of Decision Trees. Mach. Learn. 1986, 1, 81–106. [Google Scholar] [CrossRef]
  13. Maron, M.E.; Kuhns, J.L. On relevance, probabilistic indexing, and information retrieval. J. ACM 1960, 7, 216–244. [Google Scholar] [CrossRef]
  14. Marvin, M. Steps toward artificial intelligence. Trans. Inst. Radio Eng. 1961, 49, 8–30. [Google Scholar]
  15. Grenaade, U. Pattern Classification and Scene Analysis. J. Am. Stat. Assoc. 1974, 69, 829. [Google Scholar]
  16. Friedman, N.; Dan, G.; Goldszmidt, M. Bayesian network classifiers. Mach. Learn. 1997, 29, 131–163. [Google Scholar] [CrossRef]
  17. Sona, T.; Musa, M. Structure learning of Bayesian Networks using global optimization with applications in data classification. Optim. Lett. 2015, 9, 931–948. [Google Scholar]
  18. Zheng, Z.J.; Webb, G.I. Lazy learning of Bayesian rules. Mach. Learn. 2000, 41, 53–84. [Google Scholar] [CrossRef]
  19. Martinez, A.M.; Webb, G.I.; Chen, S.L.; Nayyar, A.Z. Scalable learning of Bayesian network classifiers. J. Mach. Learn. Res. 2013, 1, 1–30. [Google Scholar]
  20. Wang, L.M.; Zhao, H.Y.; Sun, M.H.; Ning, Y. General and local: Averaged k-dependence bayesian classifiers. Entropy 2015, 17, 4134–4154. [Google Scholar] [CrossRef]
  21. Zhou, G.; Lu, M.Q.; Li, D.J.; Gao, B.A.; Guo, R. Identification of differentially expressed molecular functions associated with breast cancer using Gibbs sampling. Oncol. Lett. 2017, 14, 7489–7494. [Google Scholar] [CrossRef] [Green Version]
  22. Liu, R.; Jin, H.M.; Duan, F.Q. Spectral Classification Based on Bayes Decision. Spectrosc. Spect. Anal. 2010, 30, 838–841. [Google Scholar]
  23. Marcot, B.G. Common quandaries and their practical solutions in Bayesian network modeling. Ecol. Model. 2017, 358, 1–9. [Google Scholar] [CrossRef]
  24. Shannon, C.E. A mathematical theory of communication. Bell Syst. Tech. J. 1948, 27, 379–423, 623–656. [Google Scholar] [CrossRef]
  25. Kenneth, W.C.; Patrick, H. Word Association Norms, Mutual Information, and Lexicography. Comput. Linguist. 2002, 16, 22–29. [Google Scholar]
  26. Shachter, R.D. Evaluating influence diagrams. Oper. Res. 1986, 34, 871–882. [Google Scholar] [CrossRef]
  27. Park, S.H.; Fuernkranz, J. Efficient implementation of class-based decomposition schemes for Naive Bayes. Mach. Learn. 2014, 96, 295–309. [Google Scholar] [CrossRef]
  28. Breiman, L. Bagging predictors. Mach. Learn. 1996, 24, 123–140. [Google Scholar] [CrossRef] [Green Version]
  29. Newman, D.; Hettich, S.; Blake, C.; Merz, C. UCI Repository of Machine Learning Databases. Available online: http://archive.ics.uci.edu/ml/index.php (accessed on 26 May 2019).
  30. Fayyad, U.M.; Irani, K.B. Multi-interval Discretization of Continuous-Valued Attributes for Classification Learning. In Proceedings of the 13th International Joint Conference on Artificial Intelligence, Chambery, France, 28 August–3 September 1993; pp. 1022–1029. [Google Scholar]
  31. Kohavi, R.; Wolpert, D. Bias Plus Variance Decomposition for Zero-One Loss Functions. In Proceedings of the Thirteenth International Conference on Machine Learning, Bari, Italy, 3–6 July 1996; pp. 275–283. [Google Scholar]
  32. Tyralis, H.; Papacharalampous, G.; Langousis, A. A brief review of random forests for water scientists and practitioners and their recent history in water resources. Water 2019, 11, 910. [Google Scholar] [CrossRef]
  33. Gorodkin, J. Comparing two K-category assignments by a K-category correlation coefficient. Comput. Biol. Chem. 2004, 28, 367–374. [Google Scholar] [CrossRef]
  34. Friedman, M. A comparison of alternative tests of significance for the problem of m rankings. J. Am. Stat. Assoc. 1940, 11, 86–92. [Google Scholar] [CrossRef]
  35. Demšar, J. Statistical comparisons of classifiers over multiple data sets. J. Mach. Learn. Res. 2006, 7, 1–30. [Google Scholar]
Figure 1. The distributions of I ( x 15 ; x 16 | y i ) on Waveform dataset, where i { 1 , 2 , 3 } . The x-axis represents the index of each instance, the y-axis represents the value of I ( x 15 ; x 16 | y i ) .
Figure 1. The distributions of I ( x 15 ; x 16 | y i ) on Waveform dataset, where i { 1 , 2 , 3 } . The x-axis represents the index of each instance, the y-axis represents the value of I ( x 15 ; x 16 | y i ) .
Entropy 21 00537 g001
Figure 2. Example of (a) unrestricted BNC, and (b) restricted BNC.
Figure 2. Example of (a) unrestricted BNC, and (b) restricted BNC.
Entropy 21 00537 g002
Figure 3. The framework of discriminatory target learning.
Figure 3. The framework of discriminatory target learning.
Entropy 21 00537 g003
Figure 4. Example of (a) kdb1, and (b) kdb2.
Figure 4. Example of (a) kdb1, and (b) kdb2.
Entropy 21 00537 g004
Figure 5. Scatter plot of zero-one loss and RMSE comparisons for KDBe, KDB and AODE.
Figure 5. Scatter plot of zero-one loss and RMSE comparisons for KDBe, KDB and AODE.
Entropy 21 00537 g005
Figure 6. Training and classification time comparisons for BNCs.
Figure 6. Training and classification time comparisons for BNCs.
Entropy 21 00537 g006
Figure 7. Training and classification time comparisons between KDBe and RF.
Figure 7. Training and classification time comparisons between KDBe and RF.
Entropy 21 00537 g007
Figure 8. The structure of KDB on Heart-disease-c dataset.
Figure 8. The structure of KDB on Heart-disease-c dataset.
Entropy 21 00537 g008
Figure 9. The structure of submodels of kdbe.
Figure 9. The structure of submodels of kdbe.
Entropy 21 00537 g009
Figure 10. The scatter plot of KDBe and RF in terms of MCC. Dis dataset is annotated with red color, which is a notable case where KDBe enjoys significant advantages.
Figure 10. The scatter plot of KDBe and RF in terms of MCC. Dis dataset is annotated with red color, which is a notable case where KDBe enjoys significant advantages.
Entropy 21 00537 g010
Figure 11. Average ranks in terms of zero-one loss and RMSE for all learners.
Figure 11. Average ranks in terms of zero-one loss and RMSE for all learners.
Entropy 21 00537 g011
Figure 12. Nemenyi test in terms of zero-one loss and RMSE for all learners.
Figure 12. Nemenyi test in terms of zero-one loss and RMSE for all learners.
Entropy 21 00537 g012
Table 1. Datasets. Imbalanced datasets are annotated with the symbol “*”.
Table 1. Datasets. Imbalanced datasets are annotated with the symbol “*”.
IndexDatasetInstanceAttributeClassIndexDatasetInstanceAttributeClass
1Contact-lenses244322Kr-vs-kp3196362
2Labor5716223Dis *3772292
3Echocardiogram1316224Hypo3772294
4Lymphography14818425Sick *3772292
5Sonar20860226Abalone *417783
6Glass-id2149327Waveform-50005000403
7New-thyroid *2155328Phoneme5438750
8Heart-disease-c30313229Wall-following5456244
9Soybean-large307351930Page-blocks5473105
10Ionosphere *35134231Satellite *6435366
11Dermatology36634632Thyroid91692920
12House-votes-84 *43516233Pendigits10,9921610
13Chess *55139234Sign12,54683
14Soybean *683351935Nursery12,96085
15Breast-cancer-w6999236Magic19,020102
16Tic-tac-toe9589237Letter-recog20,0001626
17Vowel990131138Adult *48,842142
18Car *17286439Shuttle *58,00097
19Mfeat-mor200061040Connect-467,557423
20Segment231019741Waveform *100,000213
21Hypothyroid *316325242Localization164,860511
Table 2. W/D/L comparison results of zero-one loss on all datasets.
Table 2. W/D/L comparison results of zero-one loss on all datasets.
NBTANKDBAODEkdbe
TAN29/7/6----
KDB30/5/720/9/13---
AODE33/5/416/14/1220/6/16--
kdbe30/5/717/18/720/11/1113/15/14-
KDBe34/3/523/13/626/13/322/10/1014/20/8
Table 3. W/D/L comparison results of RMSE on all datasets.
Table 3. W/D/L comparison results of RMSE on all datasets.
NBTANKDBAODEkdbe
TAN32/4/6----
KDB32/4/616/19/7---
AODE29/9/416/19/715/15/12--
kdbe30/5/79/21/1211/17/147/19/16-
KDBe34/3/520/19/317/20/517/17/821/21/0
Table 4. W/D/L comparison results of bias on all datasets.
Table 4. W/D/L comparison results of bias on all datasets.
NBTANKDBAODEkdbe
TAN30/5/7----
KDB30/5/725/9/8---
AODE32/7/318/14/1015/4/23--
kdbe31/3/820/10/1215/8/1916/11/15-
KDBe32/3/726/9/711/27/421/13/817/18/7
Table 5. W/D/L comparison results of variance on all datasets.
Table 5. W/D/L comparison results of variance on all datasets.
NBTANKDBAODEkdbe
TAN4/3/35----
KDB8/1/339/7/26---
AODE9/8/2530/8/434/3/5--
kdbe7/1/3419/13/1030/4/86/10/26-
KDBe8/2/3216/12/1434/8/07/4/3112/9/21
Table 6. W/D/L records between KDBe and RF.
Table 6. W/D/L records between KDBe and RF.
AllSmallMediumLarge
Zero-one loss20/10/1210/4/37/4/43/2/5
RMSE16/15/114/9/48/4/34/2/4
Bias11/11/205/1/115/5/51/5/4
Variance26/4/1211/3/37/1/78/0/2
Table 7. Description of Heart-disease-c dataset.
Table 7. Description of Heart-disease-c dataset.
AttributeDescriptionSymbol
agereal value X 0
sexmale or female, {0,1} X 1
cpchest pain type (angina, abnang, notang, asympt), {1,2,3,4} X 2
trestbpsresting blood pressure, real value X 3
cholcholesterol, real value X 4
fbsfasting blood sugar < 120 (true or false), {0,1} X 5
restecgresting electrocardiographic results (norm, abn, hyper), {0,1,2} X 6
thalachmaximum heart rate achieved, real value X 7
exangexercise induced angina (true or false), {0,1} X 8
oldpeakST depression induced by exercise relative to rest, real value X 9
slopethe slope of the peak exercise ST segment (up, flat, down), {1,2,3} X 10
canumber of vessels colored, real value X 11
thalthal (norm, fixed, rever), {3,6,7} X 12
class0 for health, 1 for sickY

Share and Cite

MDPI and ACS Style

Duan, Z.-Y.; Wang, L.-M.; Mammadov, M.; Lou, H.; Sun, M.-H. Discriminatory Target Learning: Mining Significant Dependence Relationships from Labeled and Unlabeled Data. Entropy 2019, 21, 537. https://doi.org/10.3390/e21050537

AMA Style

Duan Z-Y, Wang L-M, Mammadov M, Lou H, Sun M-H. Discriminatory Target Learning: Mining Significant Dependence Relationships from Labeled and Unlabeled Data. Entropy. 2019; 21(5):537. https://doi.org/10.3390/e21050537

Chicago/Turabian Style

Duan, Zhi-Yi, Li-Min Wang, Musa Mammadov, Hua Lou, and Ming-Hui Sun. 2019. "Discriminatory Target Learning: Mining Significant Dependence Relationships from Labeled and Unlabeled Data" Entropy 21, no. 5: 537. https://doi.org/10.3390/e21050537

APA Style

Duan, Z. -Y., Wang, L. -M., Mammadov, M., Lou, H., & Sun, M. -H. (2019). Discriminatory Target Learning: Mining Significant Dependence Relationships from Labeled and Unlabeled Data. Entropy, 21(5), 537. https://doi.org/10.3390/e21050537

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop