1. Introduction
In many classification scenarios and datasets, achieving satisfactory detection performance is a critical problem [
1]. In such scenarios, machine learning experts naturally move to the ensemble classification to combine multiple base classifiers (learners) to achieve satisfactory accurate decisions. In ensemble classification, majority voting has gained significant attention from the research community because of its effectiveness, simplicity, and democratic style of combining the population decisions [
2]. Unluckily, the construction of ensemble classification has been empirical, i.e., first, an ensemble classification is constructed and if its performance is not satisfactory then it is discarded. Contrary to the empirical approach, there is an analytical approach which is deductive. In the analytical approach, firstly a mathematical model is formulated, and then an ensemble classifier constructed accordingly. Contrary to the objectiveless, directionless, random style, and luck-driven efforts in empirical construction, the analytical approach is purposeful, goal-oriented, and systematic. Additionally, once an analytical model is built, it is useful as it can be employed to construct future models, which is not possible in an empirical approach [
3].
In many applications related to video surveillance, driving assistance, pedestrian detection, and disease diagnostics, there are some additional challenges because of their cost-sensitive nature. For instance, in surveillance of human-prohibited areas, a false negative (missing a human detection) has the worst cost compared to false positive. Similarly, cancer diagnosis is also critical in which missing a cancer tumor can result in severe damage, even death of the patient. Whereas, falsely detecting a cancer tumor in a healthy person costs only some money, which can be further identified as a normal person in further tests. Likewise, in automatic driving and pedestrian detection systems, missing a human may result in serious injury or death. Consequently, for such tasks, there is a need for a cost-sensitive detection system to meet the required objectives [
4,
5].
In imbalanced machine learning datasets, the number of positive and negative samples differs significantly. This results in the bias of classification decisions towards the majority class samples [
6,
7]. Unfortunately, many datasets from cost-sensitive applications are significantly imbalanced. Even more, the number of positive samples in these datasets is critically lesser than negative samples, which results in higher false negative rates, which has a higher penalty in cost-sensitive applications [
8].
There are some cost-sensitive classification techniques available in the literature. Unluckily, these contributions mainly focus on either classification technique [
9,
10] or data sampling [
6,
11]. For example, Zadrozny et al. [
12] associated some weight with class learning examples to achieve cost-sensitive learning. Likewise, Krawczyk et al. [
13] uses cost-sensitive analysis for breast thermography. Whereas, Nguyen et al. [
6] fixed the classifier tendency to overwhelm in favor of the majority class because of imbalanced datasets by comparing many data resampling techniques and optimizing the cost ratio. Similarly, Singh et al. [
14] employed transfer learning for imbalanced breast cancer classification, but a dedicated component to handle imbalanced classification is missing in their methodology. Likewise is the work of Saleeman IV et al. [
15], in which they employed Spark for multiclass imbalanced classification. In addition to these conventional learning techniques, there are few deep learning-based techniques also present in the literature. For example, Almarshdi et al. [
16] proposed a hybrid deep learning solution for imbalanced classification, but unluckily, the innovation in how to tackle imbalanced data is missing.
Contrary to using a single classifier [
6,
12,
13], Liangyuan et al. [
17] employed a cost-sensitive ensemble learning method using majority voting. Fan et al. [
18] proposed a pruning mechanism of base classifiers, to minimize the computational cost of ensemble base cost-sensitive ensemble learning. However, these techniques do not consider the role of imbalance datasets towards the bias of the classifier [
17,
18,
19].
In this line of research, some machine learning scientists focused on the role of imbalanced datasets in the designing of ensemble learning models [
20]. For example, Zhang et al. [
8] have proposed an ensemble method for class-imbalanced datasets by splitting the majority class dataset into various subsets and then training different base learners on minority class samples and each subset of majority class samples. Similarly, Yuan et al. [
21] oversampled the dataset, used standard AdaBoost [
22], and then applied genetic algorithm (GA) to optimize the weights of the base classifiers. Ali et al. [
23] proposed a GentleBoost ensemble for breast cancer by oversampling the minority samples. Their work considers the probability of occurrence of each training sample to incorporate the cost effects. Hou et al. [
24] employed dynamic classifier selection [
25] to propose a computationally extensive dynamic ensemble classification META-DESKNN-MI. Their model uses SMOTE to fix the class imbalance in the training set. Although Xu and Chetia [
26] proposed an efficient implementation of dynamic ensemble classification, unluckily, these are empirical ensemble, and additionally, ensemble selection and class imbalance are treated differently.
Unfortunately, these approaches are empirical, and thereby, they require the construction of an ensemble classifier, and if its performance is not satisfactory then discard it to try another ensemble classifier strategy. There is a significant deficiency in the literature for analytical analysis prior to the construction of ensemble classification. In this research, the problem of designing an empirical model to estimate and predict the performance of ensemble classification such as majority voting and logical disjunction is considered. For this design, the formulations are derived using the concepts of conditional joint probability. For this, the output label of a base learner has been considered as random variables with different probabilistic values for true negative (TN), false negative (FN), true positive (TP), and false positive (FP). This is an important, major, and main aspect of this research. Although, the nature of these formulations and derivations is generic, but we consider the cost-sensitive and imbalanced datasets to evaluate the forecasted performance of the ensemble classification using the derived formulations. In experiments, it is analyzed using an analytical model and experimental observations that in imbalanced datasets with cost-sensitive scenarios, logical disjunction outperforms the contemporary majority voting ensemble classification, thus providing a simple and alternative way in such scenarios.
2. The Formulation to Predict the Performance of Ensemble Classification
In classification, a training dataset is used to learn feature space. After the training process is completed, a test sample is fed into the classifier and it predicts its output label. The output belongs to true positive (TP), true negative (TN), false positive (FP), or false negative (FN). It is to note that the nature of this output is random since giving a number of testing samples generates a random sequence from the set
, and thus, this set acts as sample space of this random experiment [
27,
28]. Using this concept, the methodology used for the derivations of the probability of true positive for ensemble classification and logical disjunction is presented in
Figure 1.
2.1. Probability Perspective of Classifier Outputs
This methodology is presented for binary classification problems, wherein the output decision belongs to four categories, i.e.,
, and
. Thus, the set
is the sample space. Considering the output as a random variable, the probabilities (relative frequencies) of the events are in fact the classification performance measures (
TPR,
TNR,
FPR,
FNR) [
29], as follows in Equation (1):
2.2. Ensemble Classifiers
In machine learning, the majority voting ensemble classification technique has gained the attention of the research community because of its effectiveness and simplicity. The enhanced accuracies of ensemble classification are explained by Condorcet’s jury theorem [
30], which states (for binary classification):
“If individual base classifiers have probabilities greater than 0.5 to correctly classify, then increasing the number of base classifiers, the probability of correct classification in majority voting is increased and it approaches to 1.
If individual base classifiers have probabilities less than 0.5 to correctly classify, then increasing the number of base classifiers, the probability of correct classification in majority voting is decreased and it approaches to 0”
In addition to majority voting, this research formulates an analytical model for logical disjunction-based decision aggregation. Although, the nature of derivation is generic, considering the number of base three for majority voting (MV) and two for logical disjunction (LD) just for the sake of simplicity.
2.3. Mutual Dependency
It is to note that since the output of a classifier from the sample space is considered as a random variable, there is mutual dependence among the base learners. For example, if a base learner prediction is true positive, then the other base learner’s prediction is either true positive or false negative. This is because, if one base learner’s prediction is true positive, then it is sure that the sample is positive, and thus, the other base learner predictions are neither true negative nor false positive. Thereby, the output predictions of the base learners are not mutually independent. This important mutual dependency has to be considered while formulating conditional probability distribution for both ensemble classifications.
Consider
and
as the random variables associated with the outputs of the base classifiers
and
, respectively. Thereby, if
, then
(
given
) is either TP or FN. Similarly, if
and
is either
or
, then
is also either
or
. These mutual dependencies are summarized in
Table 1 as follows:
Considering
and
;
as the number of observations for the base classifiers
and
, respectively, as summarized in
Table 2.
2.4. Formulation
In majority voting of three base learners
and
, the ensemble decision is true positive if at least two base learner decisions are true positive. Thereby, in this majority voting, true positive is when either all of the three base learner outputs are true positive or any two of three base learner outputs are true positive. Considering
and
as the probabilities mass functions of three individual classifiers outputs
, the probability of majority voting to give true positive
is derived using the concept joint conditional probability distribution. In this equation,
means the joint probability of the event
(base learner
gives
),
(
gives
), and
(
gives
). In these derivations, the formula of joint probability for three events
is to be kept in mind. It is to note that if the output of the base learners
and
is true positive, then surely one thing is clear, that it is a positive sample. Thus, if a sample is positive, then base learner
has only two options as output, i.e., either to declare it as true positive or false negative. Thus,
and in the similar fashion
and
. Using these formulations,
is to be computed as in Equation (2).
Similarly, by computing
,
, and
, the probability of majority voting to give true positive
is to be computed as in Equation (3).
Figure 1a is additionally helpful in this derivation.
In logical disjunction of two base learners
and
, the ensemble decision is true positive if at least one base learner decision is true positive. Thereby, in this logical disjunction, true positive is when either both base learner outputs are true positive or any base learner output is true positive. Thus, the probability of logical disjunction to give true positive
is to be computed as in Equation (4).
In majority voting of three base learners
and
, the ensemble decision is false negative if at least two base learner decisions are false negative. Thereby, in this majority voting, false negative is when either all base learner outputs are false negative or any two base learner outputs are false negative. Thus, the probability of majority voting to give false negative
is to be computed as in Equation (5).
In logical disjunction of two base learners
and
, the ensemble decision is false negative if both base learner decisions are false negative. Thus, the probability of logical disjunction to give false negative
is to be computed as in Equation (6).
In majority voting of three base learners
and
, the ensemble decision is false positive if at least two base learner decisions are false positive. Thereby, in this majority voting, false positive is when either all base learner outputs are false positive or any two of three base learner outputs are false positive. Thus, the probability of majority voting to give false positive
is to be computed as in Equation (7).
In logical disjunction of two base learners
and
, the ensemble decision is false positive if any base learner decision is false positive. Thereby, in this logical disjunction, false positive is when either both base learner outputs are false positive or any base learner output is false positive. Thus, the probability of logical disjunction to give false positive
is to be computed as in Equation (8).
In majority voting of three base learners
and
, the ensemble decision is true negative if at least two base learner decisions are true negative. Thereby, in this majority voting, true negative is when either all base learner outputs are true negative or any two base learner outputs are true negative. Thus, the probability of majority voting to give true negative
is to be computed as in Equation (9).
In logical disjunction of two base learners
and
, the ensemble decision is true negative if both base learner decisions are true negative. Thus, the probability of logical disjunction to give true negative
is to be computed as in Equation (10).
3. Experiments and Results
To evaluate the proposed analytical formulations to predict the performance of ensemble classification, UCI machine learning repository has been considered. To establish another interesting fact of these proposed analytical formulations, imbalanced datasets have been considered. In addition to the significant difference between the number of samples for each class, these datasets are also cost-sensitive, i.e., there is a different cost of falsely predicting a positive (minority class) or negative (majority) sample, since negative samples are in the majority, as shown in
Table 3. Thereby, the base learners have a tendency to predict more towards negative class as compared to positive class and thus
. In addition to this, the cost of false negative is greater than false positive
, where negative means being healthy and positive means being diseased. In this regard, these datasets create the intensified scenario of cost-sensitive imbalanced classification
. From this perspective, the proposed analytical formulations are evaluated on four different datasets, as in the following subsections.
3.1. Breast Cancer Dataset
This dataset is generated by the institute of oncology, university medical center Ljubljana, Yugoslavia. This binary dataset is described by 9 medical attributes, and it includes 85 positive (recurrence-events) and 201 negative (no-recurrence-events) instances of cancer patients [
31]. The observed confusion matrices of the individual and ensemble classifiers are shown in
Table 4, where the positive class means a person has breast cancer, whereas the negative class means a person is normal. Using these confusion matrices, the observed probabilities are compared with the predicted probabilities computed from the proposed formulations and are shown in
Table 5.
3.2. Wilt Dataset
This dataset was generated from a remote sensing study for detecting diseased trees using Quickbird (a satellite) imagery. This is a highly imbalanced class containing 74 positive (diseased trees) and 4265 negative (normal) instances [
32]. The observed confusion matrices of the individual and ensemble classifiers are shown in
Table 6, where the positive class means a tree is diseased, whereas the negative class means the tree is normal. Using these confusion matrices, the observed probabilities are compared with the predicted probabilities computed from the proposed formulations and are shown in
Table 7.
3.3. Haberman’s Survival Dataset
This dataset is about the survival of the patients of Billings Hospital Chicago who underwent breast surgery because of cancer. This dataset is described by three features, and it includes 81 positive (the patient died within 5 years after the surgery) and 225 negative (the patient survived 5 years or longer after the surgery) instances [
33]. The observed confusion matrices of the individual and ensemble classifiers are shown in
Table 8, where the positive class means a patient will survive after treatment, whereas the negative class means the patient will not survive after treatment. Using these confusion matrices, the observed probabilities are compared with the predicted probabilities computed from the proposed formulations and are shown in
Table 9.
3.4. Thoracic Surgery Dataset
This dataset is about the survival of the patients of the Wroclaw Thoracic Surgery Center who underwent major lung resections because of primary lung cancer. This dataset contains 70 positive (patient died within 1 year after the surgery) and 400 negative (patients survived 1 year or longer after the surgery) instances. The observed confusion matrices of the individual and ensemble classifiers are shown in
Table 10, where the positive class means a patient will survive after treatment, whereas the negative class means the patient will not survive after treatment. Using these confusion matrices, the observed probabilities are compared with the predicted probabilities computed from the proposed formulations and are shown in
Table 11.
3.5. Discussion & Analysis
The experimental results in
Table 5,
Table 7,
Table 9 and
Table 11 are described as graphs in
Figure 2 to facilitate the comparison. From these tables and figure, it is to note that the predicted performances (
and
) of the ensemble classifications match with the observed performance. These observations are quite encouraging and validate the effectiveness of the proposed formulations for analytical analysis prior to the actual development of ensemble classification. Thus, the proposed analytical analysis is quite helpful for deciding which base learners to be chosen and the number of base learners. The wise and early decision in this regard is useful in saving time, contrary to the empirical approach in which a model is first constructed and then continued to be discarded if not satisfactory. This is an important, major, and main aspect of the proposed formulations.
Understanding the nature of logical disjunction and majority voting base ensemble classifications, in Equations (3)–(10), it is to note that logical disjunction classifies a positive sample if any base learner classifies it as a positive sample. This is contrary to ensemble classification, which needs the majority of base learner decisions to label it as a positive sample. Thereby, it results in a decrease in the false negative rate with a tradeoff of an increase in the false positive rate, as in
Table 5,
Table 6,
Table 7,
Table 8,
Table 9,
Table 10 and
Table 11 and
Figure 2. Understanding the cost of false negatives as compared to false positives in disease diagnosis, this increase is quite useful. In these datasets, there is the scenario of
, and thereby, logical disjunction has been beneficial. If there is a contrary scenario of
, then logical conjunction is beneficial.
3.6. Conclusions
This research initiates from the consideration of true positive rate, false negative rate, false positive rate, and true negative rate as probabilities of base learners. Using this information, the concept of conditional joint probability has been applied to derive the analytical model to predict the performance of ensemble classification techniques such as majority voting and logical disjunction. The derivation of the analytical model shows that the performance of such ensemble classification can be predicted even before its actual construction using the individual performances of the base learners. The experimental observations justify the prediction of this performance. This analytical approach is useful for purposeful efforts in the construction of an appropriate ensemble classifier, contrary to the empirical approach which is a trial-based mechanism. Additionally, in the analysis and comparison of the predicted and observed performances, it is observed that for highly imbalanced datasets, the choice of logical disjunction is more appropriate as compared to the conventional majority voting for ensemble classification. Furthermore, in the case of cost-sensitive classification with highly imbalanced datasets, logical disjunction is even more appropriate as compared to majority voting. This study also shows that unwanted classification effects from highly imbalanced datasets can also be fixed using logical disjunction-based ensemble classification, contrary to the conventional under-sampling and over-sampling solutions.