1. Introduction
Due to its simplicity, efficiency, and effectiveness, naive Bayes (NB) has been widely used to analyze and solve many scientific and engineering problems, such as text classification [
1,
2], resistance of buildings [
3], identification of areas susceptible to flooding [
4], and urban flooding prediction [
5]. Text classification is the task of assigning a text document to a pre-specified class, and it has been widely used in many real-world fields, such as spam filtering and short message service (SMS) filtering [
2,
6]. With the exponential growth of text data in various fields, text classification has attracted more and more attention from researchers in recent years. To address text classification tasks, text documents are generally featured by all of the words that occur in them. Because of the large numbers of documents, large numbers of words, and strong dependencies among these words, accurate and faster text classification presents unique challenges.
Beyond all questions, treating each word as a boolean variable is the simplest approach to applying machine learning for text classification. Based on this idea, multi-variate Bernoulli naive Bayes (BNB) [
7] was proposed as the first statistical language model. BNB represents a document using a vector of binary feature variables, which indicates whether or not each word occurs in the document and, thus, ignores the frequency information of each word occurring in the document. To capture the frequency information of each occurring word, multinomial naive Bayes (MNB) [
8] was proposed. Ref. [
8] proved that MNB achieves, on average, a 27% reduction in the error rate compared to BNB at any vocabulary size. However, when the number of training documents of one class is much greater than those of the others, MNB tends to select poor weights for the decision boundary. To balance the number of training documents and to address the problem of skewed training data, a complement variant of MNB called complement NB (CNB) was proposed [
9]. As a combination of MNB and CNB, OVA [
9] was proposed.
Given a test document
d, which is generally represented by a word vector
, MNB, CNB, and OVA classify it with Equations (
1)–(
3), respectively.
where
c is each possible class label,
C is the set of all classes,
is the complement classes of
c,
m is the number of different words in the text collection,
is the
ith word that occurs in
d, and
is the frequency count of the word
in
d. The prior probabilities
and
are computed in Equations (
4) and (
5), respectively, and the conditional probabilities
and
are computed in Equations (
6) and (
7), respectively.
where
s is the number of classes,
n is the number of training documents,
is the class label of the
jth training document,
is the frequency count of the
ith word in the
jth training document, and
and
are two indicator functions defined by:
Due to their simplicity, efficiency, and efficacy, MNB and its variants, including CNB and OVA, have been widely used for text classification. However, as in naive Bayes (NB), the assumption of the attributes’ (i.e., features) conditional independence that they need is usually violated and, therefore, reduces their classification accuracy. To alleviate their assumption of features’ conditional independence, many approaches have been proposed. These approaches can be divided into five categories [
10,
11]: (1) feature weighting; (2) feature selection; (3) instance weighting; (4) instance selection; (5) structure extension.
Among these approaches, structure extension has attracted far less attention from researchers. To the best of our knowledge, only structure-extended multinomial naive Bayes (SEMNB) [
12] has been proposed so far. SEMNB averages all weighted super-parent one-dependence multinomial estimators and, therefore, is an ensemble learning model. In this paper, we propose a single model called hidden multinomial naive Bayes (HMNB). HMNB creates a hidden parent for each feature, which synthesizes all the other qualified features’ influences. To learn HMNB, we proposed a simple but effective learning algorithm without incurring a high-computational-complexity structure-learning process. Our improved idea can also be used to improve CNB and OVA, and the resulting models are simply denoted as HCNB and HOVA, respectively. The extensive experiments on eleven benchmark text classification datasets show that the proposed HMNB, HCNB, and HOVA significantly outperform their state-of-the-art competitors.
To sum up, the main contributions of our work include the following:
We conducted a comprehensive survey on MNB extensions. Based on the survey, existing work can be divided into five categories: feature weighting, feature selection, instance weighting, instance selection, and structure extension.
We found that structure extension has attracted much less attention from researchers, and only SEMNB was proposed so far. However, it is an ensemble learning model.
We proposed a single model called hidden MNB (HMNB) by adapting the well-known hidden NB (HNB). HMNB creates a hidden parent for each feature, which synthesizes all of the other qualified features’ influences. To learn HMNB, we proposed a simple but effective learning algorithm without incurring a high-computational-complexity structure-learning process. At the same time, we proposed HCNB and HOVA.
The extensive experiments on eleven benchmark text classification datasets validate the effectiveness of HMNB, HCNB, and HOVA.
The remainder of this paper is organized as follows.
Section 2 conducts a compact survey on five categories of existing approaches.
Section 3 describes our proposed models in detail.
Section 4 presents the experimental setup and results.
Section 5 draws conclusions and outlines the main directions.
3. The Proposed Models
Structure extension is not new to the Bayesian learning community, and especially not to the semi-naive Bayesian learning community [
26,
27]. Researchers have proposed many state-of-the-art structure-extended naive Bayes models, such as tree-augmented naive Bayes (TAN) [
24] and its variants [
28,
29]. However, When the structure extension approach comes to high-dimensional text classification data, a key issue that must be addressed is its high-computational-complexity structure learning process. This is the reason for why structure extension has attracted less attention from researchers. To the best of our knowledge, only structure-extended multinomial naive Bayes (SEMNB) [
12] has been proposed so far. SEMNB averages all weighted super-parent one-dependence multinomial estimators and, thus, skillfully avoids high-computational-complexity structure-learning processes. The extensive experiments on a large number of text classification datasets validate its effectiveness. However, beyond all questions, SEMNB is an ensemble learning model. Therefore, a simple but effective single model that does not incur a high-computational-complexity structure-learning process is still desirable. This is our paper’s main motivation.
To maintain NB’s simplicity and efficiency while alleviating its assumption of attributes’ conditional independence, hidden naive Bayes (HNB) [
30] has achieved remarkable classification performance. Inspired by the success of HNB, in this paper, we expected to adapt it to text classification tasks. We call our adapted model hidden multinomial naive Bayes (HMNB). In HMNB, a hidden parent
is created for each present word
, which combines the influences from all of the other present qualified words
. Now, given a test document
d, HMNB classifies it by using Equation (
10).
where
is computed by:
where
indicates the importance of each possible parent word
in the hidden parent
. Therefore, for simplicity, we define it as the gain ratio
of the word
that splits the training data
D. However, at the same time, we only select the word
whose gain ratio is above the average
of all words as the potential parent. The detailed calculation formulas are:
where
.
indicates the absence of
, and
indicates the presence of
.
Now, the only thing left is the efficient calculation of
; the conditional probability
appears given
and
c. It is well known that the space complexity of estimating
directly from
D is
. To our knowledge, for text classification tasks,
m (the vocabulary size in the text collection) is often too large to save the tables of each joint pair of words and class frequencies from which the conditional probability
is estimated. At the same time, text data are usually in the form of a sparse matrix, and therefore, the number of different words present in a given document
d—simply denoted by
—is much smaller than
m. Therefore, as in SEMNB [
12], we also transform a part of the training space consumption into classification time consumption. In more detail, we remove the step of computing
from the training stage to the classification stage. At the classification stage, when a test document
d is predicted,
is computed according to
D and
d. More specifically, given a word
in
d, we only select the documents in which
occurs to compute
by using Equation (
14), which has the space complexity of
only.
In summary, the whole algorithm for learning HMNB is partitioned into a training algorithm (HMNB-Training) and a classification algorithm (HMNB-Classification). They are described by Algorithms 1 and 2, respectively. Algorithm 1 takes the time complexity of
, and Algorithm 2 takes the time complexity of
, where
n is the number of training documents,
m is the number of different words in the text collection,
s is the number of classes, and
is the number of different words present in a given document
d.
Algorithm 1: HMNB-Training (D). |
Input:D—training data Output: and () 1: for each class c do 2: Use Equation ( 4) to compute from D; 3: end for 4: for For each word () from D do 5: Compute using Equation ( 12); 6: end for 7: Compute the averaged gain ratio of all words using Equation ( 13); 8: if then 9: 10: else 11: 12: end if 13: Return and ()
|
Algorithm 2: HMNB-Classification (d, D, , ). |
Input:d—a test document, D—training data, and the computed and Output: 1: for For each word () in d do 2: for For each word () in d do 3: Denote all training documents in which occurs as ; 4: for each class c do 5: Compute from using Equation ( 14); 6: end for 7: end for 8: end for 9: Use and to compute with Equation ( 11); 10: Use and to predict the class label of d with Equation ( 10); 11: Return the predicted class label
|
Our improved idea can also be used to improve CNB and OVA. The resulting models are denoted as HCNB and HOVA, respectively. Given a test document
d, HCNB and HOVA use Equations (
15) and (
16) to classify it.
where
is computed by:
where
is computed by:
Similarly to HMNB, the algorithms for learning HCNB and HOVA are also partitioned into training algorithms (HCNB-Training and HOVA-Training) and classification algorithms (HCNB-Classification and HOVA-Classification). They are described by Algorithms 3–6, respectively. From Algorithms 3–6, we can see that the time complexities of HCNB and HOVA are almost the same as that of HMNB.
Algorithm 3: HCNB-Training (D). |
Input:D—training data Output: and () 1: for each class c do 2: Use Equation ( 5) to compute from D; 3: end for 4: for For each word () from D do 5: Compute using Equation ( 12); 6: end for 7: Compute the averaged gain ratio of all words using Equation ( 13); 8: if then 9: 10: else 11: 12: end if 13: Return and ()
|
Algorithm 4: HCNB-Classification (d, D, , ). |
Input:d—a test document, D—training data, and the computed and Output: 1: for For each word () in d do 2: for For each word () in d do 3: Denote all training documents in which occurs as ; 4: for each class c do 5: Compute from using Equation ( 18); 6: end for 7: end for 8: end for 9: Use and to compute with Equation ( 17); 10: Use and to predict the class label of d with Equation ( 15); 11: Return the predicted class label
|
Algorithm 5: HOVA-Training (D). |
Input:D—training data Output:, , and () 1: for each class c do 2: Use Equation ( 4) to compute from D; 3: Use Equation ( 5) to compute from D; 4: end for 5: for For each word () from D do 6: Compute using Equation ( 12); 7: end for 8: Compute the averaged gain ratio of all words using Equation ( 13); 9: if then 10: 11: else 12: 13: end if 14: Return , , and ()
|
Algorithm 6: HOVA-Classification (d, D, , , ). |
Input:d—a test document, D—training data, and the computed , , and Output: 1: for For each word () in d do 2: for For each word () in d do 3: Denote all training documents in which occurs as ; 4: for each class c do 5: Compute from using Equation ( 14); 6: Compute from using Equation ( 18); 7: end for 8: end for 9: end for 10: Use and to compute with Equation ( 11); 11: Use and to compute with Equation ( 17); 12: Use , , , and to predict the class label of d with Equation ( 16); 13: Return the predicted class label
|
4. Experiments and Results
To validate the effectiveness of the proposed HMNB, HCNB, and HOVA, we designed and completed three groups of experiments. The first group of experiments compared HMNB with MNB,
MNB, GRSMNB, DWMNB, MNBTree, and SEMNB. The second group of experiments compared HCNB with CNB,
CNB, GRSCNB, DWCNB, CNBTree, and SECNB. The third group of experiments compared HOVA with OVA,
OVA, GRSOVA, DWOVA, OVATree, and SEOVA. We used the existing implementations of MNB and CNB in the platform of the Waikato environment for knowledge analysis (WEKA) [
31] and implemented all of the other models by using the WEKA platform [
31].
We conducted our three groups of experiments on eleven well-known text classification tasks published on the homepage of the WEKA platform [
31], which cover a wide range of text classification characteristics.
Table 1 lists the detailed data information of these eleven datasets. All of these eleven datasets were obtained from OHSUMED-233445, Reuters-21578, TREC, and the WebACE project. Ref. [
32] originally converted them into term counts.
Table 2,
Table 3 and
Table 4 show the results of a comparison of the accuracy of each model on each dataset after averaging the classification accuracies from ten runs of 10-fold cross-validation, respectively. Then, we use two-tailed
t-tests at 95% significance level [
33] to compare the proposed HMNB, HCNB and HOVA to each of their competitors. In these tables, the symbols • and ∘ denote statistically significant improvement or degradation with respect to the competitors, respectively. The averaged classification accuracies and the
/
/
(
W/
T/
L) values are summarized at the bottom of the tables. The averaged classification accuracy of each model across all datasets provides a gross indicator of the relative classification performance in addition to the other statistics. Each
W/
T/
L value in these tables indicates that, compared to their competitors, HMNB, HCNB, and HOVA won on
W datasets, tied on
T datasets, and lost on
L datasets.
Based on the accuracy comparisons presented in
Table 2,
Table 3 and
Table 4, we then used the KEEL software [
34] to complete the Wilcoxon signed-rank test [
35,
36] in order to thoroughly compare each pair of models. The Wilcoxon signed-rank test ranks the differences in the performance of two classification models for each dataset, ignoring the signs, and compares the ranks for the positive
and the negative
differences [
35,
36]. According to the table of the exact critical values for the Wilcoxon test, for a confidence level of
and
datasets, we speak of two classification models as being “significantly different” if the smaller of
and
is equal to or less than 11, and thus, we reject the null hypothesis.
Table 5,
Table 6 and
Table 7 summarize the related comparison results. In these tables, ∘ denotes that the model in the column improves the model in the corresponding row, and • denotes that the model in the row improves the model in the corresponding column. In the lower diagonal, the significance level is
. In the upper diagonal, the significance level is
. From all of the above comparison results, we can draw the following highlights:
The average accuracy of HMNB on eleven datasets is 85.60%, which is notably higher than those of MNB (83.18%), MNB (82.39%), GRSMNB (84.23%), DWMNB (83.72%), MNBTree (82.59%), and SEMNB (84.16%). HMNB substantially outperforms MNB (eight wins and zero losses), MNB (11 wins and zero losses), GRSMNB (six wins and one loss), DWMNB (five wins and one loss), MNBTree (eight wins and zero losses), and SEMNB (six wins and one loss).
The average accuracy of HCNB on eleven datasets is 86.17%, which is notably higher than those of CNB (83.8%), CNB (84.29%), GRSCNB (83.36%), DWCNB (85.39%), CNBTree (83.81%), and SECNB (84.28%). HCNB substantially outperforms CNB (eight wins and zero losses), CNB (eight wins and zero losses), GRSCNB (eight wins and zero losses), DWCNB (three wins and one loss), CNBTree (six wins and zero losses), and SECNB (six wins and zero losses).
The average accuracy of HOVA on eleven datasets is 85.44%, which is notably higher than those of OVA (84.51%), OVA (83.56%), GRSOVA (84.7%), DWOVA (84.97%), OVATree (83.54%), and SEOVA (84.12%). HOVA substantially outperformsOVA (four wins and one loss), OVA (six wins and zero losses), GRSOVA (four wins and zero losses), DWOVA (three wins and one loss), OVATree (five wins and zero losses), and SEOVA (seven wins and zero losses).
In addition, according to the results of the Wilcoxon test, HMNB significantly outperforms MNB, MNB, GRSMNB, DWMNB, MNBTree, and SEMNB. HCNB significantly outperforms CNB, CNB, GRSCNB, CNBTree, and SECNB. HOVA significantly outperforms OVA, OVA, OVATree, and SEOVA. All of these comparison results validate the effectiveness of the proposed HMNB, HCNB, and HOVA.
Finally, we conducted the Wilcoxon signed-rank test [
35,
36] to compare each pair of HMNB, HCNB, and HOVA. The detailed comparison results are shown in
Table 8. From these, we can see that HMNB almost tied with HCNB and HOVA, and HCNB was notably better than HOVA. Considering the simplicity of the models, HMNB and HCNB could be appropriate choices.
5. Conclusions and Future Study
To alleviate MNB’s assumption of features’ conditional independence, this paper proposed a single model called hidden MNB (HMNB) by adapting the well-known hidden NB (HNB). HMNB creates a hidden parent for each feature that synthesizes all of the other qualified features’ influences. For HMNB to learn, we proposed a simple but effective learning algorithm that does not incurring a high-computational-complexity structure-learning process. Our improved idea can also be used to improve CNB and OVA, and the resulting models are simply denoted as HCNB and HOVA, respectively. The extensive experiments show that the proposed HMNB, HCNB, and HOVA significantly outperform their state-of-the-art competitors.
In the proposed HMNB, HCNB, and HOVA, how the weight (importance) of each possible parent word is defined is crucial. Currently, we directly use the gain ratio of each possible parent word that splits the training data in order to define the weight, which is somewhat rough. We believe that using more sophisticated methods, such as the expectation-maximum (EM) algorithm, could further improve their classification performance and make their superiority stronger. This is a main topic for future study. In addition, to reduce the training space complexity, we transform a part of the training space consumption into classification time consumption, which leads to a relatively high classification time complexity. Therefore, the improvement of the efficiency of the proposed models is another interesting topic for future study.