1. Introduction
Large-margin kernel machines (e.g., SVM) are recognized state-of-the-art algorithms in machine learning applications. They are broadly applied to several domains, such as text categorization, spam filtering, RNA function prediction, and so on. However, since these methods typically work on an implicitly defined feature space by resorting to the well-known kernel trick, the interpretability of the resulting model is difficult.
This last aspect is often crucial in specific application areas, such as the medical ones, in which the simple predictive answer is not enough. Being this a requirement for the acceptance of these
black-box models by end users, in the last decade, several methods have been introduced for rule extraction from SVMs (see [
1] for a recent survey). The majority of the proposed approaches try to extract
if–then rules over the input variables and this task is generally hard when common kernels, e.g., the polynomial and RBF ones, are used.
On the other hand, Decision Trees (DT), thanks to their easy logical interpretation, are very appreciated, especially by non-expert users. The shortcoming of DTs is that, in general, they are not as accurate as more complex methods. In the case of binary valued input data, an alternative approach to make SVM more interpretable consists in defining features that are easy to interpret, for example, features that are propositional (i.e., logical) formulas applied to the input vectors. In particular, Boolean kernels are kernel functions in which the binary input vectors are mapped into an embedding space formed by propositional formulas of the input variables, and, in such space, the dot product is performed.
More formally, a Boolean kernel function is defined as where is the embedding Boolean function with . When the input space is binary, the linear kernel can be seen as a limit case of a Boolean kernel in which the features simply represent the Boolean literals themselves.
In this paper, we propose a new family of Boolean kernels able to produce feature spaces composed by arbitrarily complex propositional formulas. In particular, we first introduce the basic Boolean kernels [
2], namely the conjunctive kernel and the disjunctive kernel, for both the monotone and non-monotone case. On top of these kernels, we then propose more complex kernels such as the Disjunctive Normal Form kernel and the Conjunctive Normal Form kernel. For all the proposed kernels, an efficient method to compute them is provided. We assess the quality of the proposed kernels in terms of classification accuracy on several categorical datasets, and compare their performance against state-of-the-art kernels, such as the RBF kernel, and other Boolean kernels proposed in the literature.
The reminder of this paper is structured as follows: in
Section 2, we give an overview of the existing work related to Boolean kernels. In
Section 3, we present the proposed Boolean kernels family and, in
Section 4, we discuss their computational complexity. Finally,
Section 5 shows all the performed experiments on several benchmark categorical datasets.
2. Related Work
Sadohara [
3] was the first to introduce the idea of Boolean kernel. In that work, the concept of Boolean kernel is actually related to a single kernel called DNF kernel. Specifically, he proposed a SVM for learning Boolean functions: since every Boolean (i.e., logical) function can be expressed in terms of Disjunctive Normal Form (DNF) formulas, the proposed kernel creates a feature space containing all possible conjunctions of negated or non-negated Boolean variables.
For instance, the feature space for a two variables, e.g.,
, DNF contains the following
features:
. The resulting decision function of a kernel machine which employs the DNF kernel can be represented as a weighted linear sum of conjunctions (Representer Theorem [
4,
5]), which in turn can be seen as a kind of “soft” DNF.
Formally, the DNF kernel between
is defined as
while its monotone (i.e., without negations) form is the following
By restricting the domain of the vectors in
, the computation of the kernels is simplified as follows
where
is the vector with all the binary entries swapped. Note that the sum
simply counts the number of common “bits” between
and
. These last two kernels were independently discovered in [
6,
7].
A drawback of this type of kernels is the exponential growth of the size of the feature space with respect to the number of involved variables, i.e.,
for
n variables. To give the possibility of controlling the size of the feature space, Sadohara et al. [
8] proposed a variation of the DNF kernel in which only conjunctions with up to
d variables (i.e.,
d-ary conjunctions) are considered. Over binary vectors, this kernel, dubbed d-DNF kernel, is defined as
and trivially, if
,
. A nice property of the d-DNF kernel is that it yields a nested sequence of hypothesis spaces, i.e.,
. Thus, choosing a degree
d (also known as “arity”) for the kernel implicitly means controlling the capacity of the hypothesis space, which is a very important aspect in learning. The same “trick” can be applied to the monotone d-DNF kernel [
9]:
Instead of limiting the number of involved variables, Zhang et al. [
10] proposed a parametric version of the DNF and mDNF kernel. Specifically, given
and
, then
The parameter induces an inductive bias towards simpler or more complex DNF formulas. Specifically, for in the range a bias towards shorter DNF is given, while for the bias is more towards longer DNF. When , then , and the same for the monotone DNF kernel. In the following, we refer to these kernel as -DNF and -mDNF kernel, respectively. Zhang et al. also proved that, for binary input vectors, the polynomial kernel, i.e., , , is a Boolean kernel, even though they did not provide any formal definition of Boolean kernel. Nonetheless, an important observation is that the embedding space of a polynomial kernel is composed by all the monomials (that are conjunctions) up to the degree p. Thus, the only difference between the polynomial one and the d-DNF kernel are the weights associated to the features. It is also noteworthy that, in the binary case, the embedding of the polynomial kernel contains sets of equivalent features, e.g., for and , the value of the features (in the feature space) are equivalent to the feature .
A kernel related to the polynomial is the all-subset kernel [
11,
12], defined as
which considers a space with a feature for each subset of the input variables, including the empty subset. It is different from the polynomial because it does not limit the number of considered monomials, and it gives the same weight to all the features. It is easy to see that the all-subset kernel and the monotone DNF kernel are actually the same kernel up to the constant
, i.e.,
.
Both the polynomial and the all-subsets kernel have limited control of which features they use and how they are weighted. The polynomial kernel uses only monomials of degree up to p with a weighting scheme depending on a parameter (c). The all-subsets, instead, makes use of the monomials corresponding to all possible subsets of the n input variables.
A restricted version of the all-subset kernel is the ANOVA kernel [
11] in which the embedding space is formed by monomials with a fixed degree
d without repetition. For example, given
the feature space of the all-subset kernel would be made by the features
and ∅, while for the ANOVA kernel of degree 2 it would be composed by
and
. Formally, the ANOVA kernel is defined as follows
where
are all the possible sets of indices of cardinality
d, taken from the set
.
In [
13,
14], Boolean kernels are used for studying the learnability of logical formulas, specifically DNF formulas, using maximum margin algorithms, such as SVM. In particular, in [
14], the authors showed the learning limitations of some Boolean kernels inside the PAC (Probably Approximately Correct) framework. Moreover, in [
13], Kowalczyk at al. proposed a generalization of the Sadohara mDNF kernel. A special case of this kernel is represented by the
-mDNF kernel.
From the application point of view, Boolean kernels have been successfully applied on many learning tasks, such as face recognition [
15,
16], spam filtering [
17], load forecasting [
18], and generic binary classification tasks [
8,
10].
3. A New Family of Boolean Kernels
In this section, we propose a new family of Boolean kernels which owns the characteristic of creating feature spaces that are very easy to understand, since they are based on logic. Specifically, features are logical formulas (of a fixed form) over the input Boolean variables.
Firstly, we present the most basic Boolean kernel and then, for both the monotone and the non-monotone cases, we propose kernels which mimic the conjunctive operator (and) and the disjunctive operator (or). Then, we provide an efficient way to combine these “base” Boolean kernels to obtain more complex ones, such as kernels with feature spaces composed by normal form formulas.
Throughout the paper, unless specified otherwise, we refer to vectors as binary (Boolean) vectors of dimension . We also use and as the sets interpretation of those vectors, while the set indicates the universal set. Given a set , we refer to its i-th element with for some enumeration of the elements of . With the notation , we express the set of all the subsets of of cardinality k. It is worth noticing that, for any binary vector , holds, which is the number of ones contained in it. For the sake of brevity, we refer to this quantity with . Moreover, denotes the n-dimensional vector with all entries equal to 1, and with the notation we refer to the dot product. The symbol ⊙ denotes the entry-wise multiplication between matrices. Finally, given a function , , then denotes the dimension of its codomain, i.e., n.
For each of the proposed kernels, the embedding function is provided in the general form , where is a set of Boolean functions (formulas) over variables of such that is a truth value associated with the application of the formula b to . For example, let and , then , that is false. For the sake of simplicity, for each new kernel, only the set from which the Boolean formulas are taken is defined.
3.1. Monotone Boolean Kernels
A Boolean function (or formula) is called monotone if replacing a 0 (i.e., false) with a 1 (i.e., true) in the input can only increase f’s value, i.e., the truth value can only change from false to true. In other words, a formula f is monotone if it does not have any not operator.
3.1.1. Monotone Literal Kernel
In logic, a literal is an atomic formula or its negation. Here, we are in the monotone setting, so we refer to a literal only in its positive form. In this case, the embedding is formed by Boolean literals taken from the set . Hence, the monotone Literal (mL) kernel, , will count how many true (i.e., positive) input literals the vectors have in common. Actually, is exactly the linear kernel , which simply performs the dot product between the two input binary vectors.
3.1.2. Monotone Conjunctive Kernel
In Boolean algebra, given two Boolean variables
, the conjunction (i.e.,
and) between
x and
z, denoted by
, is satisfied if and only if
, that is if and only if both variables are
true. Given two vectors
, the monotone Conjunctive (mC) kernel [
19]
counts how many monotone conjunctions of the input variables, of a fixed arity
c, are satisfied in both
and
. In particular, the embedding is defined by Boolean formulas taken from
, which represent all the possible conjunctions of
c literals (i.e., variables) taken from
. The dimension of the resulting feature space is
, that is the number of all the combinations of
c different variables taken from the input
n-dimensional space. To count all the possible conjunctions of
c variables satisfied in both
and
, we need to calculate the number of combinations of
c monomials that can be formed by using all the positive variables in both the vectors, that is the value of the kernel
. Hence, we obtain:
It is easy to see that for binary vectors
is actually the ANOVA kernel of degree
c [
11].
As shown in [
19], we can express the Sadohara mDNF Kernel [
3] as a linear combination of mC-kernels of arity
as in the following:
A similar construction also holds for the d-mDNF kernel.
3.1.3. Monotone Disjunctive Kernel
The disjunction of two Boolean variables
, denoted by
, is not satisfied if and only if
, that is if and only if both variables are
false, or, in other words, it is satisfied anytime at least one of the variables is
true. The monotone Disjunctive (mD) kernel [
19],
counts how many monotone disjunctions, of a fixed arity
d, are satisfied in both
and
. The embedding of this kernel is defined by Boolean formulas taken from
, with a feature space of dimension
. To explain how to count the common positive disjunctions in both
and
, we can rely on the analogy between binary vectors and sets. An active disjunction of
d literals for
can be defined as a set of
d elements taken from the universe
, let us call it
, such that
. Anytime
(potentially
), then
is an
active subset for
and
. Using this interpretation, we can say that the value of the kernel is the number of active subsets
in common between
and
. We can count the number of these subsets in a negative fashion. Starting from the number of all possible subsets, which is
, we remove the inactive subsets for
and for
. An inactive subset for
is a set such that it does not contain any element of
, and the number of this kind of sets is
. Analogously, we can do the same for
. Now, we have removed twice the subsets formed by elements taken from
and hence we need to add its contribution once, that is
. We can now define
as
3.2. Non-Monotone Boolean Kernels
Converse to the monotone case, non-monotone Boolean formulas can contain negated variables, e.g., , thus the mL-kernel is not expressive enough to be the simplest non-monotone Boolean kernel because it does not consider negated variables.
3.2.1. Non-Monotone Literal Kernel
To include the contribution of the negated variables, we need to add the number of
false variables in common between
and
to the mL-kernel. This can be calculated by the
negation kernel, defined as
in which the embedding is defined by Boolean functions taken from
. Hence, the non-monotone Literal (L) kernel,
, counts how many
true and
false variables
and
have in common, and it is defined as a sum of kernels [
11] as in the following:
3.2.2. Non-Monotone Conjunctive Kernel
The non-monotone Conjunctive (C) kernel counts how many
non-monotone conjunctions of a certain arity
c are satisfied in both
and
. The embedding is defined by boolean functions in the set
of all the non-monotone conjunctions of
c literals. Formally, given the set
and the function
if
or
otherwise, then
. Since we are considering conjunctions of variables, this kernel will count how many combinations of (possibly negated) common variables there are between
and
. Thus, relying on these considerations and on the definition of
, we can finally define the
as follows:
Similar to the monotone case, we can express the Sadohara DNF Kernel [
3] as a linear combination of C-kernels of arity
as in the following:
An analogous construction also holds for the d-DNF kernel.
3.2.3. Non-Monotone Disjunctive Kernel
The non-monotone Disjunctive (D) kernel counts how many
non-monotone disjunctions, of a certain arity
d, are satisfied in both
and
. The embedding is defined by Boolean formulas in the set
of all the non-monotone disjunctions of
d literals. Formally,
, with
and
g defined as in the previous section. As for the
monotone case, we derive the kernel function in a negative fashion. The number of every possible combination of arity
d of variables that can be also in their negated form is
. For both
and
, we have to discard the combinations that are
false, which are exactly
because for each set of
d different variables, there exists only one assignment of the negations such that the disjunction is
false. For example, given the variables
and
, only the disjunction
is
false and all the others
negation assignments are
true. Then, we have to re-add the
false combinations that have been discarded twice, that is the combinations made with variables that are
false in both the vectors. This can be seen as the opposite of what C-kernel computes, but, since we generate all the possible combinations with all the possible negation assignments, the counting is actually the same as the C-kernel. Finally, we can define the
as
3.3. Boolean Kernels Combination
Given the Boolean kernels defined in the previous sections, we have now all the basic elements to build new Boolean kernels that represent a specific Boolean concept.
Table 1 shows a summary of all the presented Boolean kernels. It is easy to see that all the kernels are function of dot products of the input vectors, and this allows us to create new kernels by replacing those dot products with other Boolean kernels. The logical interpretation of the new kernel depends on how the base kernels are combined. In the following, we present some new Boolean kernels generated by using the above mentioned method.
3.3.1. DNF Kernels
A disjunctive normal form (DNF) is a normalization of a logical formula which is a disjunction of conjunctive clauses. In the monotone case, both conjunctive and disjunctive clauses are monotone, i.e., the literals are only in their positive form. Since DNFs are disjunctions of conjunctive clauses, we can combine the embedding maps of the mC-kernel and the mD-kernel in this way
obtaining the desired feature space for the monotone DNF, which leads to the definition of the mDNF-kernel as in the following:
Note that we need to know the dimension of the feature space of the mC-kernel. The features of this kernel are actually monotone DNF formulas with exactly
d disjunctions of conjunctive clauses of arity
c. In the non-monotone case, we proceed in a similar way but, since the conjunctive clauses are non-monotone, we have to use the C-kernel instead of the mC-kernel. So, the feature space can be created by composing the embedding map of the mD-kernel with the one of the C-kernel, that is,
Consequently, the DNF-kernel is defined as follows:
3.3.2. CNF Kernels
A logic formula is in conjunctive normal form (CNF) if it is composed of conjunctions of disjunctive clauses. Clearly, in its monotone form it does not contain any negated literal. By using a similar approach as for the mDNF-kernel, the feature space is defined by the function
and hence in the kernel function we replace the dot products inside the mC-kernel with the mD-kernel as in the following:
The resulting feature space is composed of monotone CNF formulas with exactly
c conjunctions of disjunctive clauses of arity
d. By swapping the mD-kernel with the D-kernel, we can easily obtain the non-monotone CNF kernel:
which has associated the embedding function
For the sake of brevity, in the rest of the paper, we indicate the mDNF-kernel having d disjunctions and c conjunctions with either the notation mDNF(d,c)-kernel or simply mDNF(d,c).
4. Computational Complexity
The computational complexity of the Boolean kernels described in the previous sections is bounded by the complexity of the calculation of the binomial coefficient, that is with k the arity of the combinations. Hence, computing an entire kernel matrix over n-dimensional examples taken from a dataset with l examples would lead to a complexity of . Even though it is not possible to reduce such complexity, we can take advantage of the recursive nature of the binomial coefficient to compute the kernels in a recursive fashion. By doing so, it is possible to compute higher degree kernel matrices by leveraging on kernel matrices of lower degrees.
Let the matrix be the base (Boolean) kernel matrix over the dataset , such that , for and some with codomain . Then, we can recursively define the Boolean kernels for both monotone and the non-monotone case as described in the following.
4.1. mC-Kernel
By definition, the mC-kernel matrix of arity 1, that is
, is equivalent to the base kernel matrix
. By using
as base case, we can recursively define the mC-kernel matrix as
4.2. mD-Kernel
Let us define the matrices
Then, we define, recursively in its parts, the mD-kernel matrix
as
where
with the corresponding base cases
,
and
.
4.3. C-Kernel
By relying on the previous definition of
and the base case kernel matrix
the C-kernel matrix can be recursively defined by
4.4. D-Kernel
Using the previous definitions of the matrices
and
, we can define the D-kernel matrix, recursively in its parts, as
Both the standard and the recursive definition have been implemented in a freely available Python module
pyros available at the following URL:
https://github.com/makgyver/pyros.
5. Evaluation
5.1. Evaluation Protocol
In all experiments, the kernels have been normalized using the well-known formula
We assessed the proposed Boolean kernels by using SVM as kernel machine and we compared them with the linear kernel, the RBF kernel, the (monotone) DNF kernel proposed by Sadohara et al. [
3], the d-DNF kernel [
9], the
-mDNF kernel [
10] and the Tanimoto kernel.
Both validation and test were evaluated in terms of the Area Under the ROC Curve (AUC). The used validation method is a five-fold nested cross validation. For each dataset, the test was repeated 10 times and the average performances were recorded. Specifically, we validated the misclassification cost parameter
C for the SVM in the set
; for each of the proposed kernels we validated both the conjunctive arity
c (when applicable) and the disjunctive arity
d (when applicable) in the set
, for the RBF kernel we validated the hyper-paramater
, while for the
d-mDNF we fixed
. Finally, for the
-mDNF kernel we validated
in the set
. All the experiments were implemented in Python using the machine learning module Scikit-learn [
20]. The source code is freely available at
https://github.com/makgyver/pyros.
The benchmark datasets used for the experiments are reported in
Table 2. These datasets are freely available from the UCI repository [
21] and the KEEL repository [
22]. We selected datasets with binary or categorical features and for each of them the following preprocessing steps were performed:
Instances with missing attributes were removed.
categorical features, including the binary ones, were mapped into binary features by means of the
one-hot encoding [
23]. This preprocessing keeps for every example in the dataset the same number of ones , or in other words every input vector has the same euclidean norm.
Non-binary tasks were artificially transformed into binary ones, by arranging the classes into two groups while trying to keep the number of instances balanced.
5.2. Experimental Results
The first set of experiments assessed the quality of the proposed kernels. The average AUCs over 10 runs as well as the standard deviations are reported in
Table 3. The last row of each table summarizes the average rank achieved by all kernels over the benchmark datasets.
It is evident from the tables that normal form (NF) kernels perform on average better than the other Boolean kernels, with the only exception of the C-kernel on the AUC metric. This is reasonable since normal form kernels are a generalization of the (m)C-kernel and the (m)D-kernel. However, after the validation procedure, there is no guarantee that a NF kernel is always better than their correspondent base kernels, as underlined by the very good performance of the C-kernel. Another interesting observation is that both the D-kernel and the mD-kernel are almost always the worst performing ones, and this can be explained by the fact that they are less expressive than the competing kernels.
Since NF kernels have shown very good performances, we built, for each normal form, the average kernel over all the degrees, that is:
In this way, given a normal form, the feature space of the resulting kernel contains all the normal form formulas for each of the degrees with and . Since we have fixed the maximum degrees C and D or all the kernels, and the has no other hyper-parameters, it did not require any further validation.
The average AUCs over the 10 runs as well as the standard deviations are reported in
Table 4.
In general, we can see that the normal form kernels seem to achieve performances almost always comparable or higher than the competing kernels. Good performance is also achieved by the d-mDNF kernel which we have shown is the summation of d mC-kernels.
An interesting observation can be done regarding the monks-1 dataset which can be explained by a mDNF rule. The obtaining results on monks-1 show that most of the proposed Boolean kernels achieve an AUC of 100% while all other kernels are not able to achieve this perfect score. This is because these Boolean kernels contain the target formula, or a set of related formulas, in the feature space.
All the reported results are further confirmed by other experiments with other metrics, such as precision, recall and F1, however they are not reported here for space reasons.
6. Conclusions
We present a family of Boolean kernels designed to have feature spaces composed by logical formulas that can be exploited to interpret the solution of a large-margin kernel machine, such as an SVM. For all kernels, we provide the logic interpretation and an efficient way to compute them. Experimental results on many categorical datasets show that the presented kernels achieve a performance comparable to the performance of state-of-the-art kernels (such as the RBF) and other Boolean kernels. In particular, we observed that, in general, those kernels corresponding to normal form achieve very good performances. In the future, we aim to develop methods able to learn how to combine different Boolean kernels, for example by means of Multiple Kernel Learning algorithms. Moreover, we also aim to find efficient and effective ways to interpret the solution of SVMs based on Boolean kernels.