1. Introduction
Naive Bayes, within the realm of Bayesian network classifiers, has garnered significant interest and ranks among the top ten traditional algorithms in data mining [
1,
2,
3,
4]. Naive Bayes assumes that the attributes of a given category are independent of each other. This assumption simplifies the computation of the likelihood function and makes it easy to predict the sample category by maximizing the posterior probability. Given a test sample
x with vector
, Naive Bayes predicts the class of the given test sample as follows:
where
d is the number of attributes,
is the attribute value of the
jth attribute,
y is a specific value in random variable
Y, and
is the class label of
x predicted by the Bayesian network classifier.
Despite its popularity, the Naive Bayes algorithm often fails to consider the correlation between features, resulting in inaccurate classification. In response to the limitations of the Naive Bayes algorithm, the AODE (Averaged One-Dependence Estimators) algorithm emerged [
5]. The AODE algorithm is built upon the Bayesian network, which considers the relationship between features when constructing the model. Unlike the Naive independence assumption, AODE does not assume independence between features. In order to consider dependencies between attributes in a limited scope while keeping the network structure simple, AODE allows dependencies between attributes and assumes that they all depend on a common parent attribute, forming a One-Dependence Estimator (ODE) [
6]. Then, by rotating all attributes as parent attributes, the average of the posterior probabilities is used to predict the class of the sample, thus it achieves good results in classification tasks.
In AODE, to enhance both the performance and robustness of classification algorithms, some researchers have proposed cross-validation risk minimization strategies, among which the leave-one-out cross-validation (LOOCV) technique [
7] is a commonly used method for cross-validation risk minimization. For example, Chen et al. [
8] pointed out that the performance of classification algorithms can be evaluated more accurately by a cross-risk minimization strategy to avoid overfitting of training data. The cross-validation risk minimization strategy is a technique used for evaluating and selecting models by estimating the generalization error of the model using cross-validation during model training to select the optimal model. By introducing the cross-validation risk minimization strategy, the AODE algorithm can better adapt to the characteristics of different datasets and improve the generalization ability of the classifier. However, existing cross-validation risk minimization strategies do not consider the differences in attributes in classification decisions, so this paper proposes a Model Selection-based Weighted AODE (SWAODE) algorithm. The SWAODE algorithm adopts the mutual information as the weights for each ODE and evaluates and selects these weighted sub-models using a leave-one-out cross-validation (LOOCV) technique to determine the best model. The AODE algorithm’s classification efficiency is greatly enhanced by this technique, which also boasts strong robustness and broad applicability.
The main contributions of this paper are as follows:
1. The variability between ODEs and between sub-models is fully taken into account by weighting each ODE and selecting the sub-models in this paper. In this way, the quality of each ODE can be evaluated more finely, and the optimal set of models can be selected, which provides a new perspective for the optimization of the AODE algorithm.
2. We propose a new Model Selection-based Weighted AODE (SWAODE) algorithm, which effectively combines the advantages of weighting and model selection. The goal of the SWAODE algorithm is to enhance the performance and robustness of the AODE classification algorithm. The SWAODE algorithm is able to classify data more accurately and improve the model’s ability to generalize by integrating weighting and model selection strategies.
3. This paper compares the SWAODE algorithm with other advanced algorithms using 70 datasets from the UCI repository [
9], along with conducting ablation experiments. Experimental results indicate the superiority of the SWAODE algorithm over other advanced algorithms.
The sections of this paper are structured as follows: In
Section 1, we introduce relevant research focused on improving AODE.
Section 2 discusses AODE and the process of model selection. The SWAODE algorithm is outlined in
Section 3. In
Section 4, we provide a detailed description of the experimental setup and its results. Finally, our conclusions are presented in
Section 5.
3. AODE and Model Selection Analysis
We discuss the Averaged One-Dependence Estimators (AODE) algorithm and its model selection process in this section.
In order to make the paper more readable, we summarize all the symbols that are defined in the paper in
Table 1 for quick reference and understanding by the reader.
3.1. Constructing the AODE Model
AODE only allows one dependence between attributes; attribute
can only depend on some attribute
and category
Y, where
is called the parent attribute of
. At the same time, in order to keep the computation simple, it is assumed that all attributes depend on a common parent attribute
which constitutes a Bayesian network called One-Dependence Estimator (ODE) [
6]. Based on this ODE, the joint probability
can be estimated as:
To eliminate the bias introduced by the selection of the parent attribute, it is allowed that all the attributes can be used as the parent attribute in turn, thus obtaining
d ODEs, and finally, the posteriori probabilities estimated from these
d ODE are averaged to obtain the posterior probability estimate of the sample. Thus, the AODE algorithm calculates the joint probability as:
where
can be obtained from the ratio of
and
, so only the basic probabilities
and
need to be estimated, which can be obtained by an
M-estimation:
where
is the frequency of occurrence of the parameter item in the training dataset.
is the number of values of attribute
,
c is the number of categories, and
m is the smoothing parameter in the M-estimation, which is a commonly used parameter estimation method. By introducing the smoothing parameter
m, the M-estimation can prevent the probability estimates from being zero and improves the robustness of the estimates.
The frequency
F containing class labels and attribute values can be realized in a three-dimensional table in practical implementation, where the first and second dimensions represent the values taken for the first and second attributes. The third dimension represents the values taken for the category, and the values in the table record the frequency values on the values of that dimension. Assuming that there are two attributes
and two categories
where
has two attribute values,
has three attribute values, the frequency table is shown in
Table 2.
The training process of AODE is described by Algorithm 1.
Algorithm 1 AODE training process |
- Require:
Set of training data - Ensure:
Frequency table F containing a combination of category labels and two attributes - 1:
All frequencies in the frequency table F are initialized to zero - 2:
for each training sample from do - 3:
Obtain the class label y for sample and set i and j to zero - 4:
while do - 5:
Read the value of attribute in sample - 6:
while do - 7:
Read the value of attribute in sample - 8:
Find the frequency at positions , , y in the frequency table F and increase by one - 9:
- 10:
end while - 11:
- 12:
end while - 13:
end for - 14:
return the finally calculated frequency table F
|
Algorithm 1 reveals that the AODE training process’s time complexity is influenced by the quantity of samples and attributes, where there are d parent and child attributes, so the total time complexity of the AODE training process is , where d denotes the number of attributes, and n denotes the number of samples for each attribute. AODE typically exhibits a marginally greater time complexity compared to NB, as AODE represents a singular attribute dependency, aligning more closely with the real data. Therefore, the classification performance is significantly improved compared with NB.
3.2. Model Selection-Based AODE
To fully present the Model Selection-based AODE(SAODE) algorithm in this section, the model space is first constructed here. Then, the attributes are ranked based on mutual information. Finally, the best model is selected using the leave-one-out cross-validation error.
3.2.1. Building the Model Space
When constructing the AODE model space, we introduce the threshold
. When a particular value
of the parent attribute in the training data occurs more often than or as often as a threshold
, the ODE model corresponding to that value is included in the computation of the AODE model. If we choose the former
r attributes as parent attributes and the former
s attributes as child attributes, where
, the AODE model is approximated by:
where
is the frequency of
, and
is the minimum frequency that takes the value
. The AODE algorithm with the inclusion of a threshold
improves the overall performance and prediction accuracy of the model by ensuring that the number of samples for the parent attribute is sufficient, avoiding the problems of high variance and unreliable conditional probability estimation caused by data sparsity. This mechanism enables AODE to maintain high predictive stability and reliability in the face of uneven data distribution. When both
r and
s are equal to
d, it can be seen from the formula that when calculating
, at most
subsets of attributes are created as sub-models.
All these approximate AODE models are just a small extension of the previous model. For example, is obtained by adding the child attribute to . All of these models can be applied to test instances in a single nested computation. Thus, all models can be evaluated efficiently.
3.2.2. Attribute Sorting
Constructing a model of the later attributes depends on the model of the earlier attributes when constructing the AODE model space. Therefore, this method of nesting models depends on the order of the attributes. Thus, here, the mutual information is used to sort the attributes. The mutual information is calculated as:
where
is the entropy of
X,
is the conditional entropy,
is the joint probability of
x and
y, and
and
are the probabilities of
x and
y, respectively. Therefore, the MI is used as an indicator of the correlation between attribute
X and category
Y. The larger the value of the MI, the stronger the correlation between attribute
X and category
Y is indicated.
An advantage of employing the MI is its ability to efficiently compute the MI between attributes and classes within a single training session. While the MI can identify the discriminative power of individual attributes, it cannot directly assess the discriminative power of combinations of attributes. However, this shortcoming is compensated for the fact that the ranking based on MI can be searched in a wide model space.
3.2.3. Model Selection
For evaluating the distinctiveness of different models and preventing overfitting, leave-one-out cross-validation errors are employed. Through gradual cross-validation, the impact of the absent sample in each fold is deducted from the frequency table to create a model excluding that sample. The technique offers a lower bias estimate of the generalization error and assesses the model using a single training dataset.
In addition, as shown in Equation (6), these models are nested together, with each model being a straightforward extension of another, providing an effective means of evaluating them. That is, these models can be evaluated simultaneously during their construction for the training samples missed in each fold.
Among the more common methods used to evaluate the model selection are the 0–1 loss, Root-Mean-Square Error (RMSE), LogLoss, and AUC value. For example, Chen [
19] proposed the RMSE as a criterion for model evaluation, where a lower RMSE indicated a better model. Therefore, we also use the RMSE as a criterion for model evaluation as a way to select the optimal model in
Section 3.
4. Model Selection-Based Weighted AODE
In this section, our focus is on the weighting strategy for AODE and the methodology for model selection on the weighted AODE model, given that we have already described the construction methodology of the AODE model in detail in
Section 2.
4.1. Weighting the AODE Model
The contribution of each ODE to the final classification result may be different in the AODE algorithm, and certain sub-models may discriminate more accurately for specific categories while others may perform weakly. Therefore, weighting each sub-model can more accurately reflect its importance in the overall classification process, thus improving the overall model performance [
11].
The classification ability of ODEs composed of different parent attributes
is different, so different weights can be applied to different ODEs [
11]. Thus, the formula transforms into:
where
is the weight of the
jth (
) ODE, where the weights are obtained by calculating the MI through Equation (8). When attribute
X and category
Y are completely independent, the MI is 0, indicating that there is no information sharing or dependency between them.
Weighting each ODE can better improve the performance and robustness of the overall model, thus enhancing the reliability and validity of the model in practical applications.
4.2. Model Selection for WAODE
We first construct the model space of WAODE in this subsection. Then, the attributes are ranked according to the MI. Finally, we use the RMSE as the cross-validation error and select the optimal sub-model by minimizing the RMSE.
4.2.1. Building the Model Space
As shown in Equation (6), for the WAODE algorithm, where we also introduce the threshold
, the joint probability is given by:
Similar to the AODE model’s construction in
Section 2, all of these approximate WAODE models constitute the model space depicted in
Table 3, since each model is simply a small extension of the previous one. Thus, every model is capable of being assessed with efficiency.
4.2.2. Attribute Sorting
In constructing the WAODE model space, the model for later attributes is dependent on the model for earlier attributes, as shown in
Table 3. This approach of nested models is influenced by the order in which the attributes are considered. To address this, we utilize the MI to rank the attributes. Additionally, we observe that the sorting process also involves selecting attributes, and by sorting first, we can more easily identify the attributes that have a significant impact on categorization. To calculate the MI, we use Equation (8).
4.2.3. Model Selection
We used a 10-fold CV in our experiments to make the results more objective, and we used the LOOCV error as a criterion for model selection.
Figure 1 describes the relationship between LOOCV and 10-fold CV. The test set in the 10-fold CV loops through the 10 folds of samples. At the same time, the test instance in LOOCV loops through all the LOOCV instances.
LOOCV errors were used to evaluate model distinctiveness and prevent overfitting by excluding one sample at a time from the training data and assessing the model without it. This method provides a lower bias estimate of the generalization error and evaluates the model using nearly all available data for training.
The 0–1 loss (ZOL) and the Root-Mean-Square Error (RMSE) are the most common evaluation criteria used as model selection. The 0–1 loss simply assigns “0” to correct classifications and “1” to misclassifications, considering all misclassifications as equally undesirable. However, the RMSE is more sensitive to the severity of misclassification, so it is able to make more fine-grained probabilistic predictions. The RMSE can be expressed as:
where
is the true class of sample
. The smaller the RMSE, the smaller the discrepancy between the model’s predictions and the true labels. Compared to the 0–1 loss, the RMSE is able to assess model uncertainty on a continuous basis rather than simply telling us whether the model is classified correctly or not. Meanwhile, the RMSE penalizes model uncertainty more strictly, so it provides a more fine-grained calibration metric for probability estimation. Consequently, the RMSE was employed to assess potential models in our study.
Therefore, the process of choosing the best model can be framed as the following optimization problem:
where
can be computed by first estimating
from training set
as in Equation (9), and then normalizing across all possible
y’s.
4.3. Algorithm Description
Utilizing the aforementioned method, we formulated a training algorithm for the Selection-based Weighted AODE (SWAODE) model, as shown in Algorithm 2.
Algorithm 2 Training algorithm for Model Selection-based Weighted AODE (SWAODE) |
- 1:
First pass: Form the table of joint frequencies of all combinations of x attribute values and the class label as in Algorithm 1 - 2:
Compute the mutual information - 3:
Weighting each ODE and rank the attributes - 4:
Second pass: Perform LOOCV on the sample set - 5:
for sample do - 6:
Remove sample x from the frequency table - 7:
Obtain category for sample x and set i and j to zero - 8:
Build models for AODE - 9:
for do - 10:
for do - 11:
Predict sample x using all models in Equation (9) - 12:
Accumulate the squared error for each model - 13:
- 14:
end for - 15:
- 16:
end for - 17:
Add sample x back to the frequency table - 18:
end for - 19:
Compute the root-mean-square error for each model - 20:
Select the model with the lowest RMSE
|
The SWAODE algorithm for weighting and model selection needs to consider the time complexity involved in computing the MI and LOOCV, respectively, where the total time complexity of calculating the MI in the WAODE-MI algorithm is , and the time complexity of model selection with LOOCV is , so the total time complexity of the SWAODE algorithm is , which is almost similar to the time complexity of the SAODE algorithm, where d is the number of attributes, n is the number of samples, and c is the number of categories.
5. Experiments and Discussion
We ran the above algorithm on 70 datasets from the UCI repository [
9]. The comprehensive features of the datasets are presented in
Table 4, arranged in an increasing sequence based on the count of instances. The experiments were carried out based on the high-performance computing platform of Nanjing Audit University, the computing node CPU was an Intel E5, the amount of memory was 188 G, and the operating system was CentOS7.9-x64. The algorithms were based on the Petal Machine Learning Platform [
19] implemented in C++. Compared to the well-known machine learning experimental platform Weka [
20], the Petal platform has one significant difference: missing values are viewed as a single value in Petal, whereas the Weka system employs means (numerical attributes) or modes (discrete attributes) instead.
5.1. Comparison on ZOL
In this experiment, in order to verify the performance of the SWAODE algorithm, we compared it with classical algorithms such as NB [
1], KDB (k = 1) [
18], AODE [
5], WAODE-MI [
11], WAODE-KL [
21], and so on. We adopted ZOL as the evaluation index, where the loss was one when the sample was misclassified and zero when it was correctly classified, and then we calculated the percentage of total loss in the total number of test samples in order to comprehensively assess the performance of different algorithms in the classification task. The W/D/L metrics tracked the number of wins, draws, and losses for each algorithm across multiple datasets, allowing for a comparison of their performance on the same dataset. For instance, SWAODE demonstrated strong performance with 52 wins, 5 draws, and 13 losses when compared to NB, providing an objective assessment of the algorithms’ respective strengths and weaknesses. Through this professional evaluation method, we can more comprehensively and objectively assess the advantages of the SWAODE algorithm over other algorithms and provide a scientific basis for its performance evaluation, as shown in
Table 5. Meanwhile, in order to facilitate the observation of SWAODE’s experimental data, we bolded the row where SWAODE is located in all subsequent tables and presents the experimental data for each dataset in Appendix
Table A1.
The analysis in
Table 5 reveals that the SWAODE algorithm outperformed other advanced algorithms. Compared to the AODE algorithm, the SWAODE algorithm achieved 39 wins, 8 draws, and 23 losses, representing a significant improvement. Additionally, the weighted AODE classification algorithm also showed improvement when different weights were assigned. The WAODE-KL algorithm, which uses KL divergence as weights, achieved 37 wins, 13 draws, and 20 losses compared to the AODE algorithm, demonstrating a clear advantage. However, even the excellent WAODE-KL algorithm did not surpass our new algorithm, the SWAODE algorithm. In comparison, the SWAODE algorithm achieved 32 wins, 13 draws, and 24 losses, showing a clear advantage. Overall, our SWAODE algorithm demonstrated strong performance and brought significant improvement to the classification task.
We also represented the scatter plot of SWAODE with respect to WAODE-MI in terms of ZOL in
Figure 2. Points above the diagonal represent datasets whose ZOL values are lower than those of WAODE-MI. It can be found that SWAODE consistently provided better predictions than the regular WAODE-MI in a statistically significant way.
5.2. Comparison on LogLoss
When assessing the effectiveness of the SWAODE algorithm, it is common to use LogLoss as an evaluation metric. LogLoss is a widely used metric for evaluating the predictive accuracy of a classification model. It measures the deviation between the model’s predicted probability for each sample and the actual label. To compare the SWAODE algorithm with other advanced algorithms, their LogLoss values on a test dataset can be calculated and visualized. Additionally, the W/D/L (win/draw/loss) metric can be used to analyze the strengths and weaknesses of different algorithms in the experimental results. By comparing the LogLoss values of the SWAODE algorithm with those of other algorithms, the strengths and weaknesses on different datasets can be determined, as shown in
Table 6. Meanwhile, we also show the experimental data in detail in Appendix
Table A2.
According to the data analysis in
Table 6, the SWAODE algorithm presented excellent performance on LogLoss. In the comparison with AODE, it achieved 48 wins/1 draw/21 losses. In addition, the SWAODE algorithm also performed outstandingly compared to the weighted AODE algorithm. The SWAODE algorithm beat the WAODE-MI algorithm and the WAODE-KL algorithm on 42 of the 70 datasets, respectively. These results show that the SWAODE algorithm is adaptable to various datasets and outperforms other algorithms in most cases. Therefore, the SWAODE algorithm is a very efficient algorithm for AODE improvement.
Meanwhile, we also represented the scatter plot of SWAODE with respect to WAODE-MI in terms of LogLoss in
Figure 3. It can be found that SWAODE consistently provided better predictions than the regular WAODE-MI algorithm in a statistically significant way.
5.3. Ablation Studies
To delve deeper into the necessity of weighting and model selection for the AODE classification algorithm, we conducted two ablation study experiments to validate its impact in this section, again using W/D/L (win/draw/loss) as the measure. These experiments aimed to dissect the performance of the SWAODE algorithm in the absence of weighting and model selection, thus highlighting the crucial role of weighting and model selection in improving the classification performance of SWAODE. In our experiments, we implemented the WAODE-MI algorithm, which uses MI as a weight, and the SAODE algorithm, which performs model selection on AODE. The SWAODE algorithm was compared with these two algorithms in terms of ZOL and LogLoss metrics.
According to
Table 7, the SWAODE algorithm achieved 34 wins/12 draws/24 losses and 42 wins/7 draws/21 losses in the two comparisons with the WAODE-MI algorithm. It also performed well in the comparison with SAODE, achieving 27 wins/22 draws/21 losses and 40 wins/4 draws/26 losses, respectively. Therefore, we can conclude that both weighting and model selection are necessary and indispensable in the SWAODE algorithm, and the algorithm is able to fully draw on the advantages of weighting and model selection to greatly improve the classification performance of the AODE algorithm.
6. Conclusions
This study proposed a new AODE classification algorithm, the SWAODE algorithm, which aimed to solve the problem of existing cross-validation risk minimization strategies not considering the difference in attributes in classification decisions. The core idea of the algorithm lay in first weighting each ODE in the AODE which used the MI values as the weights. Subsequently, a leave-one-out cross-validation (LOOCV) method was used to perform model selection on these weighted sub-models in order to select the optimal model. Experimental results indicated the SWAODE algorithm markedly surpassed other well-known popular classification algorithms on multiple datasets, exhibiting higher classification efficiency and generalization ability.
However, we recognize that this is only one aspect of model selection and that many potential extensions deserve further exploration. The next step of our work will focus on exploring the extension of attribute-weighted AODE classification models. Overall, further exploration of attribute-weighted AODE classification models is a challenging but promising research direction. By delving into this area, we hope to bring innovative ideas and tools to research related to machine learning and data mining.