1. Introduction
Explainable Artificial Intelligence has become a hot topic in the current research [
1,
2]. The recent pervasive deployment of machine learning solutions has hugely increased the need of having transparent and easy-to-explain models. Many efforts have been devoted to improving the interpretability of complex and non-linear models like (deep) neural networks [
3]. Despite the effort, we are still far from being able to systematically interpret what happens under the hood. Still, there are specific applications in which explainability is not only a desideratum but a necessity. For instance, in medical applications, physicians need specific and comprehensible reasons to accept the prediction given by a machine. For this reason, models like Decision Trees (DT) are widely used by non-expert users thanks to their easy logical interpretation. The shortcoming of DTs is that, in general, they are inferior models w.r.t. neural networks or Support Vector Machines (SVM).
Neural networks are not the only (non-linear) models that are hard to interpret. Kernel machines, like SVMs [
4,
5,
6], typically work on an implicitly defined feature space by resorting to the well-known kernel trick. The use of such an implicit representation clearly harms the interpretability of the resulting model. Moreover, these feature spaces are often high dimensional and thus this makes the job even harder. Nonetheless, in the last decade, several methods have been introduced for extracting rules from SVMs [
7]. Typically, these techniques try to build if-then-else rules over the input variables, but the task is not trivial since the feature space might be not easy to deal with in terms of explanatory capabilities.
Given binary-valued input data, a possible approach to make SVM more interpretable consists of defining features that are easy to interpret, for example, features that are logical rules over the input. To this end, recently, a novel Boolean kernel (BK) framework has been proposed [
8]. This framework is theoretically well-founded and the authors also provide efficient methods to compute this type of kernels. Boolean kernels are kernel functions in which the input vectors are mapped into an embedding space formed by logical rules over the input variables, and, in such space, the dot product is performed. In other words, these kernels compute the number of logical rules of a fixed form over the input variables that are satisfied in both the input vectors. To show that Boolean kernels can hugely improve the interpretability of SVM, in [
9] a proof of concept method based on a genetic algorithm has been proposed. This algorithm can extract from the hypothesis of an SVM the most influential features (of the feature space) in the decision. Being those features logical rules over the input, they can be used to explain the SVM’s decision.
However, the main drawback of Boolean kernels is the limited logical expressiveness. Albeit the Boolean kernel framework allows defining Disjunctive Normal Form (DNF) kernels, only a subset of all the possible DNF can be formed. This limitation is due to the way Boolean kernels are computed and in
Section 3.1 we show that such a limitation cannot be overcome using the BK framework itself.
For this reason, in this paper, we improve upon BKs by introducing a new family of kernels, called Propositional Kernels [
10]. Differently from BKs, with Propositional Kernels is possible to define kernels for any propositional logic formula (formulas with bound variables are included in the feature space but they are not exclusive). Starting from the main limitations of the BK framework, we show step-by-step how propositional kernels are computed. Then, we provide a mathematical definition for all possible binary logical operations as well as a general procedure to compose base Propositional kernels to form any propositional kernel. This “composability” makes Propositional kernels highly expressive, much more than BKs.
In the experimental section, we evaluate the effectiveness of these new kernels on several artificial and benchmark categorical data sets. Moreover, we propose a heuristic, based on a theoretically grounded criterion, to select good propositional kernels given the data set. We empirically show that this criterion is usually able to select, among a set of randomly generated Propositional kernels, the one that performs best. This framework is open-source, and it is publicly available at
https://github.com/makgyver/propositional_kernels (accessed on 6 August 2021).
The rest of the paper is structured as follows. In
Section 2 we review the related work, while in
Section 3 we provide an overview of the Boolean kernel framework giving a particular emphasis to the limitations of such framework. Then,
Section 4 presents the main ideas behind Propositional kernels and how to compute all kernels corresponding to binary and unary logical operations.
Section 5 demonstrates how to compose base Propositional kernels to obtain any possible Propositional kernel. The empirical evaluation is presented in
Section 6, and finally,
Section 7 wraps up the paper and provides some possible future directions.
2. Related Work
The concept of Boolean kernel, i.e., kernels with a feature space representing Boolean formulas over the input variables, has been introduced in early 2000 by Ken Sadohara [
11]. He proposed an SVM for learning Boolean functions: since every Boolean (i.e., logic) function can be expressed in terms of Disjunctive Normal Form (DNF) formulas, the proposed kernel creates a feature space containing all possible conjunctions of negated or non-negated Boolean variables.
For instance, given , the feature space of a DNF kernel with a single conjunction contains the following features: .
Formally, the DNF and the monotone DNF (mDNF) kernel, i.e., DNF with no negated variables, between
are defined as
Sadohara’s DNF kernel works on real vectors, thus conjunctions are multiplications and disjunctions are summations. Using such a DNF kernel, the resulting decision function of a kernel machine can be represented as a weighted linear sum of conjunctions (Representer Theorem [
12,
13]), which in turn can be interpreted as a “soft” DNF. The DNF kernel, restricted to binary inputs, has been independently discovered in [
14,
15]. Its formulation is a simplified version of the Sadohara’s DNF kernel:
A drawback of these types of kernels is the exponential growth of the size of the feature space w.r.t the number of involved variables, i.e.,
for
n variables. Thus, the similarity between the two examples is equally influenced by simple DNFs over the input variables, as well as very complex DNFs. To give the possibility of controlling the size of the feature space, i.e., the involved DNFs, Sadohara et al. [
16] proposed a variation of the DNF kernel in which only conjunctions with up to
d variables (i.e.,
d-ary conjunctions) are considered. Over binary vectors, this kernel, dubbed d-DNF kernel, is defined as
Following the same idea of Sadohara, Zhang et al. [
17] proposed a parametric version of the DNF kernel for controlling the influence of the involved conjunctions. Specifically, given
and
, then
where
induces an inductive bias towards simpler or more complex DNF formulas.
An important observation is that the embedding space of a classical non homogeneous polynomial kernel of degree p is composed of all the monomials (i.e., conjunctions) up to the degree p. Thus, the only difference between the polynomial and the d-DNF kernel is the weights associated to the features.
A kernel closely related to the polynomial kernel is the all-subset kernel [
18,
19], defined as
This kernel has a feature space composed of all possible subsets of the input variables, including the empty set. It is different from the polynomial kernel because it does not limit the number of considered monomials/subsets, and all features are equally weighted. We can observe that the all-subset kernel and the monotone DNF kernel are actually the same kernel up to the constant , i.e., . A common issue of both the polynomial and the all-subsets kernel is that they have limited control of which features they use and how they are weighted.
A well-known variant of the all-subset kernel is the ANOVA kernel [
18] in which the feature space is formed by monomials of a fixed degree without repetitions. For instance, given
the feature space induced by the all-subset kernel would have the features
and
∅, while the feature space of the ANOVA kernel of degree 2 it would be made up by
and
.
Boolean kernels have also been used for studying the learnability of logical formulae using maximum margin algorithms, such as SVM [
20,
21]. Specifically, [
21] shows the learning limitations of some Boolean kernels inside the PAC (Probably Approximately Correct) framework. From a more practical stand point, Boolean kernels have been successfully applied on many learning tasks, such as, face recognition [
22,
23], spam filtering [
24], load forecasting [
25], and on generic binary classification tasks [
16,
17].
3. Boolean Kernels for Categorical Data
Recently, a novel Boolean kernel framework for binary inputs, i.e.,
, has been proposed [
8]. Differently from the kernels described in the previous section, this novel Boolean kernels family [
8] defines feature spaces formed by specific types of Boolean formulas, e.g., only disjunctions of a certain degree or DNF with a specific structure. This framework offers a theoretically grounded approach for composing base Boolean kernels, namely conjunctive and disjunctive kernels, to construct kernels representing DNF formulas.
Specifically, the monotone conjunctive kernel (mC-kernel) and the monotone Disjunctive kernel (mD-kernel) of degree
d between
are defined as follows:
The mC-kernel counts the number of (monotone) conjunctions (formed using all different variables) involving d variables are satisfied in both and . Similarly, the mD-kernel computes the number of (monotone) disjunctions (formed using all different variables) involving d variables that are satisfied in both and . Boolean kernel functions heavily rely on the binomial coefficient because they consider clauses without repeated variables, e.g., is not considered in the feature space of a mC-kernel of degree 2.
Starting from the mD- and mC-kernel (and their non-monotone counterpart) we can construct (m)DNF and (m)CNF kernels. The core observation is that DNFs are disjunctions of conjunctive clauses, and CNFs are conjunctions of disjunctive clauses. Thus, we can reuse the definitions of the mD- and mC-kernel to construct mDNF/mCNF kernels by replacing the dot-products (that represents the linear/literal kernel) with the kernel corresponding to the right Boolean operation. For example, if we want to construct an mDNF-kernel composed of three conjunctive clauses of degree 2 we need to substitute the dot-product in (
2) with
, obtaining
where the degree of the disjunction is 3, i.e.,
, and the degree of the conjunctions is 2, i.e.,
. The size of the input space for the disjunction becomes the size of the feature space of the mC-kernel of degree 2, i.e.,
. In its generic form, the mDNF-kernel(d, c) is defined as
This kernel computes the number of mDNF formulas with exactly d conjunctive clauses of degree c that are satisfied in both and . This way of building (m)DNF and (m)CNF kernels imposes a structural homogeneity in the formulas, i.e., all elements in a clause have the same structure. Let us consider again the mDNF-kernel(3,2): each conjunctive clause has exactly two elements, and these elements are literals. As we will discuss in the next section, the BKs’ structural homogeneity is a limitation that can not be overcome using this framework.
Besides its theoretical value, this kernels family hugely improves the interpretability of Boolean kernels as shown in [
9], and achieves state-of-the-art performance on both classification tasks [
26] and on top-N item recommendation tasks [
27].
In this work, we improve upon [
8] by showing the limitations of this framework and providing a well-founded solution. The family of kernels presented here, called Propositional kernels, allows building Boolean kernels with feature spaces representing (almost) any logical proposition. In the remainder of this paper, we will use the term Boolean kernels to refer to the kernel family presented in [
8].
3.1. Limitations of the Boolean Kernels
Before diving into the details of how to compute Propositional kernels, we show the limitations of the Boolean kernel framework [
8], and how Propositional kernels overcome them.
Boolean kernels are designed to produce interpretable feature spaces composed of logical formulas over the input variables. However, the set of possible logical formulas that can be formed using Boolean kernels is limited. In particular, two aspects limit the logical expressiveness of the Boolean kernels [
8]:
- (i)
BKs do not consider clauses with the same variable repeated more than once (This refers to disjunctive and conjunctive BKs). Even though the features considered by BKs are, from a logical point of view, appropriate, they make the kernel definition cumbersome by introducing the binomial coefficient to generate all combinations without repetitions. For instance, the mC-kernel of degree 2 between
can be unfolded as
Thus, the binomial coefficient is introduced to take into account only clauses with no repeated variables in it, e.g., avoiding clauses like .
- (ii)
BKs are structurally “homogeneous”: each Boolean concept, described by a feature in the feature space, is homogeneous in their clauses. For example, an mDNF-kernel(3,2) creates mDNF formulas that are disjunctions of three conjunctive clauses of two variables. So, every single conjunctive clause is structurally homogeneous to the others (each of them has exactly 2 variables). It is not possible to form an mDNF of the form where different conjunctive clauses have different degree.
For these reasons, in this paper, we propose a framework to produce kernels with a feature space that can potentially express any logical proposition over the input variables. To accomplish our goal, we need to overcome the limitations of the Boolean kernels, and we have also to provide a way to construct any possible logical formulas.
Overcoming the first limitation of the Boolean kernels, i.e., no repeated variables in the formulas, is very simple: it is sufficient to include any possible combination, even those that are logically a contradiction or a tautology, e.g.,
or
. Regarding the homogeneity, some considerations need to be taken. Let us assume we want to create a kernel function such that its feature space is composed of monotone DNFs of the form
, using the Boolean kernels. The embedding map of an mDNF-kernel [
8] is defined as the composition of the embedding maps of the mD-kernel and the mC-kernel as:
where we omitted the degrees and put a ~ over the functions to emphasize that we do not want to be linked to specific degrees. By this definition, there is no way to get a feature space with only formulas like
f because we would need conjunctive clauses with different degrees, which is not feasible. Now, let say we redefine
, in such a way that it contains both conjunctions of degree 1 and degree 2, for instance, by summing an mC-kernel of degree 1 and an mC-kernel of degree 2 (the sum of two kernels induces a feature vector that is the concatenation of the kernels’ feature vectors). The resulting mapping
would not create an embedding space with only
f-like formulas anyway, because it would also contain formulas like
. Unfortunately, we cannot overcome this last issue using Boolean kernels in the way they are defined.
The main problem originates from the basic idea behind Boolean kernels, that is creating logical formulas “reusing” the same set of inputs in each clause. Let us consider the simple case of a disjunctive kernel of degree 2. Given an input binary vector , both literals in the disjunction are taken from , thus they are by definition structurally identical (i.e., they are both literals). Now, let us consider an mDNF kernel with d disjunctions and conjunctive clauses of c literals. We have seen that an mDNF kernel is defined as the mD-kernel applied to the output of an mC-kernel. Hence, firstly, the conjunctive kernel embedding is computed, i.e., , where all features have the same structure. Then, the disjunctive kernel embedding over is computed creating the final feature space. This case is the same as the previous example with the only difference that the input features are conjunctive clauses (with the same form) rather than literals. Thus, it is evident that by construction all the clauses are structurally homogeneous.
4. Propositional Kernels
Based on the observations made in the previous section, we now give the intuition behind the construction of Propositional kernels.
Let us take into consideration logical formulas of the form
where ⊗ is some binary Boolean operation over the variables
. To construct formulas in such a way that
a is taken from a set of variables and
b from (possibly) another set, we need to consider two different input Boolean domains, which we call
A and
B, respectively. These domains are generally intended as sets of other Boolean formulas over the input variables. Now, given an input vector
, we map
in both the domain
A and
B, i.e.,
and
, and then we perform the Boolean function
by taking one variable from
and one from
.
Figure 1 graphically shows the just described procedure.
Formally, we can define a generic propositional embedding function for the logical operator ⊗ over the domains
A and
B as:
and consequently the corresponding kernel function
is defined by
with
. The kernel
counts how many logical formulas of the form
, with
a taken from the feature space of
and
b taken from the feature space of
, are satisfied in both
and
. To check whether this formulation is ideally able to build kernels for a generic Boolean formula, let us reconsider the example of the previous section, i.e.,
. If we assume that the domain
A contains all the formulas of the form
, while the domain
B contains single literals (actually it corresponds to the input space), then by using Equation (
4) we can define a kernel for
f by simply posing
. In the next section, we expand upon the idea showed in the example above proving that we can implement any propositional formula that not contains bound literals (we will discuss this limitation). However, we need to design a method to compute it without expliciting any of the involved spaces (except for the input space).
4.1. Construction of a Propositional Kernel
Since we want to be as general as possible, we need to define a constructive method for generating and computing a Propositional kernel, rather than a specific formulation for any possible formula. To do this, we use the fact that Boolean formulas can be defined as strings generated by a context-free grammar.
Definition 1 (Grammar for propositional logic [
28]).
Formula in propositional logic are derived from the context-free grammar, , whose terminals are:The context-free grammar is defined by the following productions: A formula is a word that can be derived from the non-terminal F.
Starting from the grammar , we can define the Propositional kernel framework by providing a kernel for each production (that we call “base” Propositional kernels), and then, with simple combinations, we can build any Propositional kernel by following the rules of the grammar.
The first production, i.e., , is trivial since is the literal kernel (or monotone literal kernel in the Boolean kernels’ jargon), i.e., the linear kernel. Similarly, the second production, i.e., , which represents the negation, corresponds to the NOT kernel (or negation kernel for Boolean kernels), , which is simply the linear kernel between the “inverse” of the input vectors, i.e., , where is an n-dimensional vector with all entries equal to 1.
The third and last production, i.e.,
, represents a generic binary operation between logical formulas and/or literals. To be general with respect to the operation
, we need to distinguish the two operands and we cannot make any assumption about what the operator represents. For these reasons, we will refer to the first and the second operand with
A and
B, respectively. Regarding the operation ⊗, we consider a generic truth table as in
Table 1, where
is the truth value given all possible combinations of
.
The kernel we want to define for the operation ⊗ has exactly the form of the kernel
previously described (Equation (
4)): each operand is taken from (potentially) different spaces, i.e., different formulas, and the kernel counts how many of these logical operations are satisfied in both the input vectors.
Since we have to count the common true formulas, we need to take into account the configurations of
a and
b that generate a true value for
, and given those true configurations we have to consider all the true joint configurations between the inputs. In other words, a formula can be true for
for a certain configuration while it can be also true for
for another configuration. For instance, let the formula be a disjunction of degree 2, i.e.,
. Then, given a feature of the embedding space, this can be true for
because its
a-part is true, and vice versa for
. To clarify this last concept, please consider
Table 2.
It is evident that the value of the kernel is the sum over all the possible joint configurations of the common true target values between
and
. To compute this, for each row of the
Table 2 we calculate the true formulas in
and
for the configuration corresponding to the row. For example, in the first row of the table we have to count all the common true formulas such that the features in
A and in
B are false in both
and
, and this can be computed by:
which is actually the product of the NOT kernels in the domain
A and
B, that is:
Such computation can be generalized over all the possible 16 configurations, i.e., the rows of the joint truth table, by the following formula
where
and
is defined as
and the definition is analogous for
.
In its worst case, that is when the joint truth table has 1 in every configuration, the formula has 16 non-zero terms. However, we have to underline that only a small set of operations need the computation of the corresponding kernel via Equation (
5) since we can use logic equivalences and apply them with the Propositional kernels. The only exceptions where the logic equivalences do not hold for the Propositional kernels is when there are constraints in the variables. For example, in logic we can express the xor operation by means of
and,
or and
not, i.e.,
, but this cannot be done with kernels since we have no way to fix the relations between the first conjunctive clause and the second conjunctive clause. It is worth noticing that this is expected behavior since the Propositional kernels have a feature space with all possible arrangements of the variables in the target formula. Indeed, we can define a Propositional kernel that represents the formula
which contains the
xor-equivalent formula (when
and
) along with all other formulas of that form. Still, the Propositional
XOR kernel can be defined but not using the definition above (see next section). In practical terms, this limitation should not be critical since we are unaware of which type of formulas is needed and it makes sense to try with generic formulas than constrained ones. In all the other cases, logic equivalences hold, e.g., the De Morgan’s laws and the double negation rule, and this allows us to compute, for example, the implication kernel in terms of the disjunctive Propositional kernel and the NOT kernel.
In the following we provide a couple of examples of Equation (
5) for computing the Propositional kernels.
Example 1 (Conjunction).
The truth table of the conjunction has only one true output, that is . Hence, there exists a unique term in the summation of s.t. , that is when . This leads to the following formulationwhich is actually the number of possible conjunctions that are satisfied in both and s.t. . This can be defined as the product between the number of common true formulas in A and the number of common true formulas in B, that is the product of the kernels and . Example 2 (Exclusive disjunction).
The truth table of the exclusive disjunction has two true outputs, that is when a and b have different truth values. So in this case, the joint truth table have four non-zero terms:that through simple math operations is equal towhere and are the NOT kernels applied on the domains A and B, respectively. 4.2. Propositional Kernels’ Definition
In this section, we will refer to and as two generic kernel functions of the type , such that and , where and .
Using the procedure shown in the previous section and simple mathematical/logical simplification we can provide a slimmer kernel definition for all 16 possible truth tables.
Table 3 summarizes the definition of the Propositional kernels for all the unary and binary logical operations. Note that in the table the kernels corresponding to the following propositions are missing:
⊤ (logical true),
⊥ (logical false),
, and
. Both
⊤ and
⊥ are not included because they are defined as the constant matrix
and
, respectively. While for
and
it is sufficient to swap the role of
A and
B in the provided definitions for
and
. Finally, both the literal and the not operator are the same for
B but in the table we report only the definition for
A.
4.3. Relation with Boolean Kernels
The main difference between Boolean kernels [
8] and Propositional kernels have already been discussed, i.e., Boolean kernels have a homogeneous structure and Propositional kernels do not. This difference implies that there exists a discrepancy between the same type of kernels of the two families. For instance, consider the simple conjunction
, and compare the Boolean monotone conjunctive kernel (
) with the propositional
kernel (
):
The core difference lies in the combinations of features considered by the two kernels. On one hand, only considers all the combination without repetition of the input variables, thus excluding conjunctions like and counting only once the conjunction . On the other hand, takes into consideration all the possible pairs of variables, including also and both and . Albeit Propositional kernels are computed on “spurious” combinations, this is indeed crucial to allow the overall framework to work and overcomes the limitations of Boolean kernels.
5. Propositional Kernels’ Composition
All the definitions provided in
Table 3 can be used as building blocks to construct kernels for any propositional formula. Specifically, we use the grammar
to construct the parsing tree of the target propositional formula and then we apply the kernels defined in
Table 3 in a bottom-up fashion starting from the leaves of the parsing tree up to the root.
Leaves, that correspond to the only terminal production in the grammar , are always the literal kernel. Then, moving up to the tree any of the other "base" Propositional kernels can be encountered, and the corresponding kernel is applied where the generic domain symbols A and/or B are substituted with the domain (i.e., logical proposition) resulting from the tree rooted in the current node. Let us explain this procedure through an example.
Example 3 (Constructing the Propositional kernel for the proposition ).
The construction of the kernel follows the structure of the parsing tree, depicted in Figure 2, from the leaves to the root. The leaves, as previously said, are literal kernels between the input vectors, thus it is the linear kernel, i.e., . Then, the NOT kernel is applied on both b and c, . Afterwards, the left branch applies the AND kernel between the input a and the latent representation of the NOT kernel on b, i.e., . Finally, the root applies the OR kernel between the representations of the two branches. The OR kernel is defined in terms of AND kernels applied to the negation of the two expressions involved in the disjunction, in this case and . Let us start by computing : Unsurprisingly, computing , is equivalent to compute , that shows that the logical double negation rule applies to Propositional kernels. Now, for computing , we show that the De Morgan’s law also holds: Finally, we put everything together, obtaining: Algorithm 1 shows the pseudo-code of the just-described procedure. In the algorithm, is a function with signature with a Propositional kernel matrix and the dimension of its feature space. Similarly, in the binary case, the functions’ signatures are with Propositional kernel matrices and the dimension of their feature spaces. All the kernel functions return a tuple made up of a kernel matrix and its feature space dimension.
Algorithm 1: compute_kernel: compute a Propositional kernel (for a data set X) given a propositional parsing tree. |
|
6. Propositional Kernels’ Application
In this section, we evaluate Propositional kernels on binary classification tasks. Specifically, we conducted two sets of experiments: one on artificial data sets and one on benchmark data sets. In the first set of experiments, we assessed the benefit of using Propositional kernels on artificially data sets created with the following procedure:
- (1)
Generate a binary matrix (with examples on the rows) where each row represents a distinct assignment of the n binary variables (i.e., );
- (2)
Generate a random logical proposition f over the n variables using Algorithm 2;
- (3)
Create the vector of labels such that iff , 0 otherwise.
This “controlled” test aims at showing that the Propositional kernel corresponding to the rule that generates the labels guarantees good classification performance.
Algorithm 2: generate_rule: Algorithm for generating the parsing tree of a random propositional rule. |
|
In these experiments we created the artificial data sets using Algorithm 2 with
and
. We chose these values to obtain relatively short formulas that explain the labels.
Figure 3 shows the distribution of the length (i.e., number of leaves in the parsing tree) of the formulas over 1000 generations.
As evident from the figure, the used parametrization gives a bias towards short rules (i.e., 2 or 3 literals), and the probability of having longer rules quickly decays. In terms of explainability, shorter rules are easier to understand than longer and more complex ones. For this reason, we perform this set of experiments using this rules length distribution.
Given the data set
, we compared the AUC of the Propositional kernel implementing the rule that generated
(called
Target kernel) against 10 Propositional kernels that implement a randomly generated rule using the same approach (with the same parameters) described in Algorithm 2. The used kernel machine is a hard-margin SVM and
which generates a total of 1024 (=
) examples of which 30% has been held out as the test set. We performed the comparison varying the size of the training set and the overall procedure has been repeated 100 times. The average results over all the repetitions are reported in
Figure 4.
The plots show that the target kernel can achieve AUC > 0.99 with only 100 examples (i.e., ∼10% of the all data set), and it achieves almost perfect classification accuracy with 200 examples. The performance of the other Propositional kernels remains on average a 3% behind, and never achieves perfect classification despite using all the training examples. These results show that using a Propositional kernel implementing the rule governing the labeling ensure high classification accuracy.
However, in real classification problems, we do not know beforehand if and which rule over the input variables produces the labels. Thus, we propose a simple yet efficient heuristic to select an appropriate Propositional kernel while avoiding performing an extensive validation. Given its combinatorial nature, we cannot systematically generate all possible Propositional kernels. Moreover, assuming that we can select a subset of possible good Propositional kernels, performing validation over all such kernels could be very demanding especially on large-scale data sets.
Thus, our proposal is based on two desiderata:
- 1.
we are interested in generating kernels that can be helpful to interpret the solution of a kernel machine;
- 2.
we want to avoid to perform an extensive and time-consuming validation.
To cope with our first goal, we generate kernels using the same procedure used in the previous set of experiments, i.e., kernel-based on rules generated by Algorithm 2 with the same parameters. In this way, we may enforce a bias toward formulas of shorter length for ensuring decent explainability. Nonetheless, by acting on
p and
we can also compute very complex formulas if we do not care much about explainability. On one hand, by acting on the value of
p, is possible to shift the peak of the distribution towards longer (
) or shorter (
) rules (
Figure 5). On the other hand, by changing
is possible to change the range of the degree. High values of
decrease the range, while smaller values (
) would allow having rules with hundreds of literals (
Figure 5).
Moreover, Algorithm 2 can be substituted with any other procedure that randomly generates rules that are then used to compute the kernels. The kernel generation procedure should be designed to leverage any a-priori knowledge about the data set at hand. However, in these experiments, we assume no prior knowledge about the data sets, and the provided bias aims at providing (potentially) effective but easy to explain rules.
To avoid the validation procedure, we suggest selecting, from the pool of generated kernels, the one that minimizes the radius-margin ratio [
29,
30,
31,
32], that is the ratio between the radius of the minimum enclosing ball that contains all the training examples and the margin observed on training data. A similar strategy has been previously employed in [
30,
32]. The intuition is that by minimizing the radius-margin ratio we aim at selecting a representation with a large margin and small radius that we can expect to achieve a tighter generalization error bound which can lead to better performance. To validate this procedure, we performed a set of experiments on 10 categorical benchmark data sets which details are reported in
Table 4.
In these experiments, we used a soft-margin SVM. We validated the hyper-parameter
C in the set
using nested 5-fold cross-validation, and all kernels have been normalized, that is, given a kernel matrix
we computed its normalized version
as
where
is a
n-dimensional vector containing the diagonal of
.
Given a data set, we randomly generated 30 Propositional kernels and for each of them, we trained a soft-margin SVM. In
Figure 6, we show, for each data set, the achieved AUC w.r.t. the radius-margin ratio of each of the 30 kernels. As a reference, we also indicate with a red dotted line the performance of the linear kernel as well as the performance (blue dotted line) of the best performing Boolean Kernel according to [
8]. We tested all monotone and non-monotone Boolean kernels (i.e., disjunctive, conjunctive, CNF and DNF) up to degree 5 for both the conjunctive and the disjunctive clauses. The best Boolean kernel is selected according to the AUC on the test set.
From the plots, and from the correlation analysis reported in
Table 5 it is evident that the minimal radius-margin ratio represents a good, although not perfect, heuristic. There are three cases in which the correlation between low radius-margin ratio and high AUC is not statistically significant, namely for
house-votes,
spect and
splice. However, in these data sets, it seems that almost all of the randomly generated Propositional kernels have a similar performance which makes the heuristic not useful but still not harmful. There are also two data sets (
monks and
primary-tumor) in which the negative correlation is weakly significant, yet the kernel with the lowest radius-margin ratio is competitive: best performance in
monks and ∼0.5% inferior to the best AUC in
primary-tumor.
In almost all data sets the linear kernel performs badly with respect to the Propositional kernels. Nonetheless, on both
spect and
house-votes the linear kernel performs decently. In particular, on
spect the generated Propositional kernels seem to be sub-optimal and the best performance is achieved by Boolean kernels. We argue that in this specific data set kernels with low complexity (e.g., low spectral-ratio [
31]) tend to perform better. Indeed, the tested Boolean kernels implement simple combinations of disjunctions and conjunctions, while Propositional kernels are generally more complex given the wider range of operations that they can implement. This is further supported by the good performance of the linear kernel.
Despite the just mentioned exception, Propositional kernels achieve the highest AUCs in all other data sets exceeding 98% of AUC in all but primary-tumor. This particular classification task seems to be more difficult than the others and this is further underlined by the poor performance of both the linear and the best Boolean kernel.
It is also noteworthy that in all data sets, except primary-tumor, promoters and soybean, the radius-margin ratio achieved by the linear kernel was out of scale w.r.t. to the ones in the plots. This is due to the fact that such data sets are not linearly separable.