1. Introduction
Conditional independence is one of the main concepts in statistics which plays a fundamental role in such areas as causal inference, dependence analysis, and graphical modelling. In this review, we discuss how measures of dependence considered in information theory are used in classification and regression problems to choose predictors which significantly influence the outcome. This is a vital application of the information theoretic approach to variable selection, also known as feature selection. Let us stress, however, that information-theoretic measures are frequently applied in many other problems in Machine Learning and Statistics, in contexts other than feature selection. Some representative examples include data visualisation (t-SNE, [
1]), clustering [
2], Independent Component Analysis (ICA [
3], Chapter 15), Variational Inference ([
4], Chapter 10), and Natural Language Modelling [
5], among others.
In the times of the big data challenge, the problem of a choice of a small group of variables from the pool of all potential variables, all of which would accurately describe the changes in response, is gaining in importance. This is needed for better understanding of a studied phenomena, as well as for construction of tractable models for them. Moreover, feature selection is instrumental in avoiding curse of dimensionality problem when a prohibitive amount of data are required for adequate fitting of the model ([
6], Chapter 2) and avoiding overfitting (ibid., Chapter 7). This is also important for prediction, as classifiers which avoid using inactive predictors are usually less variable. Feature selection is frequently applied because of cost considerations, especially when measuring some of the potentially useful predictors is costly or inconvenient.
Another competing direction of research with a similar aim is dimensionality reduction, in particular variable extraction, which transforms given variables to obtain a lean group of new predictors. Primary examples of such methods are Principal Components Regression and Partial Least Squares, see, e.g., [
6], and more recent approaches based on neural networks [
7].
Here, we focus on an important group of selectors, called
filters, which, during the selection process, do not take into account the classifier (or regression estimator) which will incorporate selected variables. In contrast,
wrappers select features to optimise performance of a specific classifier under consideration (see, e.g., methods which also yield ranking of features, such as [
8,
9,
10]). The property that filters are model-agnostic makes them a universal tool which can be used for any classifier or regression estimator.
In this paper, we try to present critical and, unavoidably, partial assessment of the state of the art for this problem, focusing on results which can be formally proved, and showing motivation, advantages, and limitations of the presented solutions, including the most recent ones. Additionally, we show that the majority of information-based feature selection criteria can be viewed as truncated or truncated and weighted expansions of Möbius decomposition. This sheds new light on similarities and differences between the frequently employed criteria. Moreover, based on the recent results, asymptotic distributions of their plug-in empirical counterparts are discussed. Such results are needed to construct a test of conditional independence which control probability of false alarms. The recent advances in resampling yielding conditionally independent samples which provide alternative method to solve this problem are also discussed. By this, hopefully, a new insight is added to existing reviews (see e.g., [
11,
12,
13,
14]).
The paper is organised as follows. We first discuss in
Section 2 the main information theoretic objects and their interplay, focusing on Möbius decomposition of conditional mutual information (
). In
Section 3, we introduce and discuss a feature selection problem from the information theory perspective based on measures of dependence introduced in the previous section. The concept of Markov Blanket of a target variable, being the aim of feature selection as the minimal set of predictors containing the whole information about it, is investigated here. In
Section 4, we study feature selection criteria related to
stressing that most of them are naturally related to Möbius decomposition. Moreover, this section introduces another group of selectors based on variational bounds of information measures. In the following
Section 5, we discuss the interplay between various
-related measures. The next sections cover the approximate distributions of the introduced measures (
Section 6) and an alternative method of assessing their distributions using resampling schemes (
Section 7). These properties are vital for performance of conditional independence tests, which are building blocks of feature selection algorithms. In
Section 8, Markov Blanket discovery algorithms are described.
Section 9 discusses feature selection in continuous case,
Section 10 covers related problem of interaction detection stressing how these can be incorporated into feature selection approach.
2. Conditional Mutual Information and Related Measures
We start with discussing properties and the role that conditional mutual information (
) plays in variable selection based on information theoretic concepts. We will focus on a discrete finite case, meaning that the considered variables take a finite number of discrete values. The continuous case is reviewed shortly in
Section 9. In the following,
Y will denote the class variable whereas
X and
Z (possibly multivariate) are features which will be used to predict
Y. For basic information theoretic concepts we refer to [
15,
16].
2.1. Mutual Information
Definition 1 ([
15], p. 46)
. The Mutual Information () between Y and X is defined aswhere and are the entropy and the conditional entropy, respectively. The second equality in (
1) motivates the name
Information Gain also used for
. It also yields intuitive meaning for
as the decrease in variability of
Y is measured by its entropy when the information about another variable
X is available. We remark that
is frequently denoted as
. Note that the semicolon in
determines between which variables dependence is considered: either between
Y and
X in the case of
or between
Y and
in the case of
.
evaluates how similar the joint distribution
of
is to the product
of their marginal distributions. As
corresponds to independence of
X and
Y,
can be considered a measure of strength of dependence between
Y and
X. If follows from the definition that ([
15], Section 2.3)
where
Kullback–Leibler (KL) divergence between distributions
and
is defined as
We note that KL divergence is closely related to the Maximum Likelihood (ML) method and popular feature selection method Akaike Information Criterion (AIC) is derived as the bias-corrected ML [
17]. We also remark that Kullback–Leibler divergence can be replaced by other pseudo-distance between probability distributions resulting in a different measure of dependence.
It follows from the properties of KL divergence that
is non-negative and is equal to zero if, and only if,
, i.e., when
Y and
X are independent. Moreover, the definition (
1) that Mutual Information is symmetric:
. It is also easily seen that (see formula (2.450) in [
15])
2.2. Conditional Mutual Information
We denote by conditional distribution of X given and define as the strength of dependence between Y and X given . Thus, means that Y and X are independent given that Z equals z. Now we define conditional mutual information.
Definition 2 ([
15], p. 49)
. The conditional mutual information () is Thus, the conditional mutual information is the mutual information of
Y and
X given
averaged over the values of
Z. It is measure of dependence between
Y and
X given the knowledge of
X. From the properties of Kullback–Leibler divergence it follows that
This is a powerful property, not satisfied for other measures for dependence, such as partial correlation coefficient in the case of continuous random variables. The conditional independence of
X and
Y given
Z will be denoted by
and abbreviated to CI. We note that since
is defined as a probabilistic average of
over
, it follows that
Thus, formally, testing conditional independence of
X and
Y given
Z is equivalent to testing unconditional dependence of
X and
Y on every stratum
. This, however, does not make the problem an easy task, as sometimes claimed, since performing such tests simultaneously on each strata (global null), even when we have sufficient data to do it, would result in lack of control of the probability of false signals (multiple testing problem).
The above definitions can be naturally extended to the case of random vectors (i.e., when
X,
Y, and
Z are multivariate) by using a multivariate probability mass functions instead of a univariate one. It is also easily seen that the following chain formula holds
2.3. Interaction Information
The 3-way interaction information of, possibly multivariate, , and Z plays also important role in feature selection.
Definition 3. The 3-way interaction information is defined as The second equality, stemming from (4), neatly explains the idea of interaction information: we want to evaluate the synergistic effect (positive or negative) of both X and Z influencing Y, disregarding the sum of their individual effects. Thus, from the evaluated strength of overall dependence between Y and a pair measured by , we subtract the strength of individual dependences of Y on X and Y on Z measured by and , respectively. Although it is not immediately clear from the definition, is actually a symmetric function of its arguments. However, it can be positive, negative, or 0. If it is positive, X and Z are said to be complementary wrt to Y or interacting positively in influencing Y. A smaller than 0 value of indicates redundancy among features or their inhibition.
We define 2-way as
as
Definition (5) can be generalised in a recursive way to define
k-way interaction information (see [
18,
19,
20] for alternative equivalent definition). For any subset of indices
,
will stand for the sub-vector of
s with indices in
S. Abusing the notion,
will be mean that
. Let
and
be the cardinality of
S. We define
k-way interaction information
so that the chain formula holds.
Definition 4. k-way interaction information is defined aswhere, consistently with the definition of the conditional mutual information in (3), we define Using expansions of
and
in terms of entropies, the following formula for
k-way
is obtained
where
for
. In particular, we have
Note that when, e.g.,
and
are independent copies of a random variable having Bernoulli distribution with
, we obtain
, as in this case
. In the following, so-called Möbius expansion (see, e.g., [
20]) plays a crucial role.
Theorem 1. The conditional mutual information satisfies the equationwhere and, conversely, We stress that the inner sum ranges over all
subsets of
S. For a description of set functions for which (8) and (9) are equivalent, see [
20]. Note that we carefully distinguish between 3-way interaction information
with multivariate
as the third component and
being
-way interaction information between
and all components
. Equality (8) can be restated, in view of
and (6) as
Finally, note that it follows from (9) that
provided
X and
Y are conditionally independent given any subvector
of
including
. As we shall see in
Section 4.1, Möbius expansion (8) is a natural starting point to introduce
-related criteria for feature selection.
Let us indicate that for any classifier
of class variable
Y for
g classes, its unconditional probability of error
can be related to conditional entropy
by means of Fano’s inequality ([
15,
21])
which, using
, can be written as
This shows that when
decreases, i.e.,
Y and
X become less dependent, the lower bound on probability of error of any classifier increases.
3. Feature Selection
In this section we discuss an objective of feature selection, the concept of Markov Blanket and its properties, as well as a greedy search for active features.
3.1. General Considerations: Characterisations of Markow Blanket
Suppose now that we consider class variable Y and p-dimensional discrete vector of all predictors available. Our aim is to describe Y using those features among which jointly influence Y. Note that this is a different task than selecting features which individually affect Y. Namely, we want to find out which features can be discarded without loss of information about Y when the remaining features are taken into account. Thus, our aim is to check which features become redundant in the presence of other features. Moreover, we would like to investigate synergy between active features, that is a possible additional effect due to their simultaneous acting. Joint informativeness is, thus, crucial for feature selection. In this context, we define a minimal set of active predictors
Definition 5. We define a minimal subset of active predictors as a minimal subset , such thatwhere denotes the subvector of with indices in set T. Minimality is meant in the set-theoretic sense i.e., is minimal if there is no proper subset which satisfies (11). Note that the second equality in (11) is due to chain formula as it follows that for .
The problem of determining
is a feature selection problem stated in information-theoretic setting, as
may be considered as the minimal set describing adequately the overall dependence of
Y on the available vector of predictors. The other possible way of defining a small subset of predictors which contain ’almost’ all information about target
Y is, for a given value of hyperparameter
, to consider the smallest subset
, such that
The other variant of the problem is its constrained version when a solution is sought for specific cardinality
k of the feature set
which involves
evaluations of mutual information. We also refer to the related bottleneck problem in which information in
X is compressed to a random variable
M satisfying a given compression condition making it easy to transmit, which has maximal mutual information with
Y, see, e.g., [
22].
Note that since
is monotone in the second coordinate (
for
), we have
thus the RHS can be used as a criterion when looking for
([
23]).
It is shown in [
24] that when the distribution of
is determined by a subvector
corresponding to a certain
and is parametrised by
, finding a maximiser of
is a first step of maximising the log-likelihood
over
T and
under so called filter assumption stating that optimisations over
T and over
are independent. The problem of uniqueness of
satisfying (11) is important and sometimes disregarded as a problem in feature selection.
Remark 1. Note that although (11) is trivially satisfied for it does not mean that the minimal set in the sense of inclusion is uniquely defined. The obvious example is constructed by taking any , such that and letting and where f is any transform which maps a set of values of X onto itself. In this case, , thus both subsets and satisfy (11) and are minimal. Moreover, note that . However, uniqueness of satisfying (11) holds for the case of continuous X and Y binary under strict positiveness assumptions stating that density is positive almost everywhere with respect to Lebesgue measure and (see [25], p. 97). We discuss, now, properties of the set . The first one is obtained by noticing that in view of the chain formula for defined in (11) we have , where denotes complement of T in F. This is equivalent to stating that is so called Markov Blanket of Y.
Definition 6. Markov Blanket of target variable Y is the minimal subset , such thatwhere (Minimal set satisfying (13) is also called Markov boundary). Thus shields Y from the rest of predictors in the sense that they become irrelevant once is known. Again, does not need to be uniquely defined. In order to discuss properties of we introduce the concept of strong relevancy.
Definition 7. is strongly relevant feature provided Note that the last property is equivalent to
and it can be restated as
[
24]. Intuitively, strongly relevant features should be included in
as, after exclusion of such features, the dependence of
Y on
is not adequately described. In the class of General Linear Models, under minimal conditions, features which are not strongly relevant are exactly those for which corresponding regression coefficient equals 0 (see [
26], Proposition 2.2). The following facts concerning
have been formally proved:
Theorem 2. Assume that Markov Blanket exists. Then, we have:
- (i)
coincides with Markov Blanket ;
- (ii)
- (iii)
Every strongly relevant feature belongs to .
The first statement is justified above. Part (ii) is due to [
27] and indicates that all subsets of
F are partitioned into two parts: the first consisting of sets
T disjoint with
for which
holds and the remaining ones which are conditionally dependent with
Y given
.
We note that (iii) can be justified by the following simple reasoning: it is enough to show that if
then
is not strongly relevant. Indeed, we have
where the second equality follows from the chain formula and the third from the fact that
and the chain formula again. Thus, from the second equality it follows that
whence
is not strongly relevant. It is not true in general that
consists only of strongly relevant features. Note that in the last example both
and
substituted for
satisfy (13), but neither of them is strongly relevant as
and
. In the case when
X is continuous and
Y is binary, this can be proved for strictly positive distributions defined in Remark 1 (cf. Theorem 10 in [
28]).
Ref. [
27] introduces a concept of
m-Markov Blanket
for
for which equivalence in (14) is satisfied for any subset
, such that
. As for
the condition
is vacuous,
p-Markov Blanket of
Y is
.
3.2. Greedy Feature Selection
In the view of (i) of Theorem 2, finding
is equivalent to finding the minimal set satisfying
This is a difficult task when
p is large as it would involve checking this condition for all
subsets of
. The problem is frequently replaced by its greedy selection analogue: given
S defined as the set of indices of features already selected, one determines
where the second equality follows from (4). The feature having the largest
with
Y is chosen first and the sequential search is stopped when for all
we have
. Note that, in view of chain equality, the maximised criterion equals
where
is
three-way interaction information of
, and
Y. The first two terms of the first equality are called relevance and redundancy of
wrt
Y and
, respectively. Note that in the maximisation process of
over
, redundancy of
with respect to
may be outweighed by the magnitude of conditional information which
contains about
within classes:
. RHS of (16) involves
. In the next section, we show that
is frequently replaced by algebraic expressions involving
where
k does not exceed 2. This is motivated by (8) and allows for easier estimation of the expression.
5. Interplay between and -Related Criteria
The following result states conditions under which and one of the introduced criteria coincide. As before, X denotes a feature from considered as a candidate to be added to .
Theorem 3. - (i)
Assume that all features are conditionally independent given class (naive Bayes assumption): and features in S are conditionally independent given any not chosen feature (i.e., belonging to ). Then, for , differs from by a factor which does not depend on X.
- (ii)
Assume that all k-way interaction informations for (). Then, for any .
- (iii)
Ref. [24] If are conditionally independent given X, for and additionally they are conditionally independent given X and Y, then differs from by a factor which does not depend on X. - (iv)
Ref. [48] Assume that for any we have and additionally . Then, .
For completeness, the proof is included in the
Appendix A. Note that the conditions imposed by (i) are very stringent. There are no known probabilistic vectors satisfying
for
apart from simple situations, such as
, which obviously does not hold when
is the set of chosen predictors.
We remark that under conditions of (i)
, as due to the naive Bayes assumption
. Additionally, it follows from (21) that
provided that the naive Bayes condition holds, thus in order to have
under rather strong assumptions of (iv), we have to assume additionally naive Bayes condition. This indicates that the last equality can hold only under restrictive conditions and is not true in general (see [
24], Section 4.1).
Assumptions in (iii) and (iv) are not compatible as they lead to different forms of criteria equivalent to
criterion. It is argued in [
46] that the only plausible graph representation of probability structure for which conditions of (iii) is graph with edges
. In this case, due to data-processing inequality, we have
. This means, however, that
X should have been chosen
before any features from
S.
Additional results of similar flavours to Theorem 3 are discussed in [
38,
49].
It follows from preceding discussion that criteria introduced in
Section 4.1 are formally introduced by truncation (or truncation and weighing) of terms in Möbius expansion. Their analytical properties concerning how well they approximate
have yet to be established.
6. Asymptotic Distributions of Information-Theoretic Measures
Sequential feature selection for predicting target variable
Y typically involves checking whether a new candidate feature
X is a valuable addition to the set of already chosen features
. This is usually based on testing whether null hypothesis
holds and its rejection is interpreted as an indication that
X carries an additional information about
Y to that provided by
. To this end,
or its modified versions are used. The usual strategy following statistical testing approach is to derive asymptotic distribution of
or its modifications described in
Section 4 under the null. This distribution is used as a benchmark for which value of the statistic for the sample under consideration is compared to obtain the asymptotic
p-value of the test. In the following, we describe the asymptotic distribution of
and
-related criteria under
. A competing approach to approximate the distribution of the considered statistic under
based on resampling is described in
Section 7.
6.1. Asymptotic Distribution of
We assume that
take
possible values, respectively. As
is a
-dimensional vector,
K is the number of all possible combinations of values of its coordinates. We let
and consider the case when all probabilities
are positive. It is assumed throughout that the estimation of
and related quantities is based on
n independent identically distributed (iid) samples from the distribution of
. Construction of estimators of
relies on plugging-in frequencies in place of unknown probabilities, e.g.,
where
and
is a number of samples equal to
. We will consider only frequencies as estimators of discrete probabilities. Other estimators exist, e.g., regularised versions of sample frequencies, for which regularisation reflects a level of departure from conditional independence, see [
38,
50]. The following known result is frequently used in dependence analysis (compare [
51], see also [
52]).
Theorem 4. - (i)
Assume that . Then, we havewhere and .
- (ii)
Assume that . Then,where .
The frequently applied test of conditional independence is based on the above useful fact (ii) that under independence asymptotic distribution does not depend on the distribution of and is chi-square with the known number of degrees of freedom. On the other hand, when the conditional independence does not hold, the limiting distribution of is normal with the variance depending on the underlying probability distribution . We stress that speeds of convergence of to are different in both cases: they equal in the first case and in the second.
The test based on
is a popular tool in dependence analysis, in particular for Markov Blanket discovery (see
Section 8). It has different names among which
test is the most popular (see, e.g., [
53]). Additionally, in the literature
denotes the second order approximation of
, which turns out to be the conditional chi-square test and has the same asymptotic distribution as
.
6.2. Asymptotic Distribution of Modified Criteria
Asymptotic distribution of the modified criteria related to
can be also derived. Let
be a plug in-version of
defined in (20). Moreover, let
be a vector of probabilities of dimension
,
corresponding vector of fractions and
be a function which represents
as a function of
p, i.e.,
. For example for
criterion (see (22)) the corresponding function is defined as
where
denotes coordinate of
. We state here the result on asymptotic distribution of
proved in [
29]. Let
denote an element of matrix
with row index
and column index
.
Theorem 5. - (i)
We havewhere and is a matrix consisting of elements . - (ii)
If thenwhere V follows distribution, and is a Hessian of f.
In particular, we have the following result for
[
54]:
Corollary 1. Let Y be binary.
- (i)
If thenwhere - (ii)
We . In this case,where V and H are defined in Theorem 2, are iid and are eigenvalues of the matrix which has the following elements
It follows from Theorem 1 that if for any we have that , i.e., Y and X are conditionally independent given , then asymptotic distribution is that of quadratic form specified in (ii), otherwise the distribution is normal.
The main advantage of using modified criteria instead of
is that their estimation does not require as large samples as for
itself. Note that for modifications of order
k, conditioning involves
k-dimensional strata, whereas for
p-dimensional strata are considered. However, modified criteria considered in
Section 4.1 suffer from the fact that under hypothesis of conditional independence
the asymptotic distribution of empirical criterion is not uniquely determined and has to be estimated from the sample. As in the case of
, the asymptotic distribution can be either normal (when
for at least one
or coincides with distribution of the quadratic form. Which type of asymptotic distribution is valid can be decided using resampling schemes shortly discussed in
Section 7. The chosen distribution can used as a benchmark to test the conditional independence hypothesis (see, e.g., [
29,
55]). Alternatively, testing of
can be replaced by testing
for any
at each stage of forward feature selection procedure and the benchmark distribution can be obtained by approximating eigenvalues of
M specified in (ii) (see [
54]) or using scaled and shifted
squared distribution
, where
are estimated from the sample [
56]. Note that, although
and
are not equivalent, in the case of faithful distributions (See
Section 9.2 for definition of faithfulness)
implies
, as conditional independence is inherited by conditioning supersets in such a case.
7. Resampling Schemes
When using the conditional independence test, which the test statistic is used for, what is very common, the exact or even the asymptotic distribution under conditional independence is not known, the usual practice is to use conditional randomisation (CR) and resampling schemes to approximate this distribution and use the resulting approximation as the benchmark distribution, as described in the beginning of
Section 6. Analogously as before, comparison of the value of the statistic based on the observed sample with the benchmark distribution is used to calculate the CR or resampling
p-value. It can be also used as alternative way of assessing the distribution of the statistic even when its approximate distribution is known. We briefly discuss these procedures, first for a discrete
, then for a continuous case.
Conditional Randomisation (CR).In the case of the
approach we assume that the probability mass function
or density of
X given
is known. Then, given a sample
one generates a CR sample
, when
are independently sampled from p.m.f. or density
for
. Let
be a test statistic which we would like to use for testing CI and consider
M generated
samples. We define the permutation
p-value as
It follows that
is a valid
p-value, i.e., conditionally on
, i.e., its distribution is uniform on the set
under CI and, thus,
This follows from the following simple fact. Suppose that CI holds and
is such that
. Then, given
, when CI holds, we have the following equality in distribution
(see, e.g., [
57]). For recent improvements of the CR method, random permutations are appropriately chosen and considered instead of choosing them at random, see [
58]. In the case when
is unknown, CR sampling is replaced by Conditional Permutations or Bootstrap X method.
Conditional permutation (). Conditional permutation method is similar to
method and differs in that for value
taken by elements of
we consider the strata of the sample corresponding to this value, namely
The CP sample is obtained from the original sample by replacing for by , where is a randomly chosen permutation of . Thus, on every strata we randomly permute values of corresponding X independently of values of Y. Once M of CP samples are obtained independently in this fashion, we calculate p-value based on them analogously as before.
Conditional Bootstrap X (CB.X). Instead of permuting values of X on each strata, we draw a bootstrap sample from X observations, that is why we sample them with replacement as many times as is the size of the strata. The remaining steps are as previously described.
We discuss now the resampling for continuous predictors. Feature selection for this case is covered shortly in
Section 9. We remark here that the aim, methodology of solutions, and criteria considered are very similar, the main differences consist of technical problems of adequate estimation of information-theoretic measures in the continuous case.
In the case of
being continuous, CR methods work as stated, however the situation is more complicated for
and
method as those depend on transforming the strata of the sample, which degenerate to single observation points in this case. One possible solution is to sample from some estimate
, e.g., kernel estimate, however this requires a large number of observations on each strata. The following proposal of constructing pseudo sample distributions of which is close to
under CI has been suggested by [
59] and consists in using observations having close values to
. More specifically, consider observation
and find
nearest observation to
in
space, say,
with corresponding triple
. Then, observation of
is replaced in the resampling sample by
.
Another technique for data augmentation, applicable in both discrete and continuous cases, is
knock-off construction which we now describe. Consider the regression problem involving vector
of features and response
Y. Vector
is the knock-off vector for
X in this regression problem, if
and, moreover, for any
we have the following equality in distribution [
57]:
where
means swapping entries
and
for any
. Thus, in the above sense
are interchangeable with
and independent of
Y at the same time.
The construction of knock-offs is complicated in general, feasible only in special cases as of now, such as the Gaussian case, and requires knowledge of distribution of
. Thus, even more information is needed than in the case of CR approach. However, the gains are considerable, as for natural classes of statistics measuring influence of predictors on the response one can compare the performance of
with that of their knock-offs and on this basis decide which ones are not strongly relevant. Intuitively, when the performance of
measured, e.g., by an absolute value of estimated regression coefficient, is comparable to that of its knock-off, then such a variable should be discarded. This approach yields a bound on FDR of strongly relevant features (Theorem 3 in [
26]). Thus, in cases discussed in
Section 3, when the set of strongly relevant features is exactly equal
, this approach yields a bound on FDR of its recovery. We stress that, in this way, one can analyse properties of the whole procedure of MB determination, and not only that of an individual steps in the algorithm. An additional advantage is that, unlike for the resampling schemes above, only one sample of knock-offs needs to be generated.
8. Markov Blanket Discovery Algorithms
We discuss now Markov Blanket discovery algorithms. Their aim is to solve a feature selection problem posed in
Section 3 (see Equation (11)) applying
or
approximations introduced in
Section 4 for which theoretical guarantees are discussed in
Section 5. Such algorithms are used as building blocks for conditional independence tests, based either on asymptotic distributions of test statistics discussed in
Section 6 or approximations of their distributions based on resampling covered in
Section 7.
We discuss three representative examples of Markov Blanket discovery algorithms. Other examples include [
60] and algorithms using assumed faithful Directed Acyclic Graph (DAG) representation of the underlying probability structure (for DAG fathfulness of probability distribution based on
d-separation (see e.g., [
4]), such as HITON-MB [
61], MMMB [
62], IPC-MB [
63], and STMB [
64].
The GS (Grow and Shrink) algorithm [
65]. It consists of two phases. Specific ordering of variables in
F is considered, e.g., variables may be ordered according to value of
and, then, the variable having the largest value of the mutual information is the first variable chosen. Denoting by
S the current set of chosen variables, we pick the first variable (in the considered ordering) which depends on
Y given
(
) and we add it to
S, then repeat the step. When there is no longer such a variable among candidates, the first phase is terminated. We call the resulting chosen set
. In the second phase, we remove from
, again using the considered ordering, any variable
which is not strongly relevant with respect to the current
, i.e., such that
(
and we let
.
The IAMB (Incremental Association Markov Blanket) [
66]. The algorithm is similar to GS with one important difference in the first step. Namely, it disregards initial ordering and
S is augmented by the most plausible candidate, that is the variable realising
, provided it is not conditionally independent from
Y given
([
27], where
.
differs from GS in the growing phase only. Namely, instead of an
individual variables with indices in
being considered as possible candidates, all
subsets T of size not exceeding
m are taken into account and the check is performed whether there are conditionally dependent on
Y given current
S. If this holds
.
It is proved in [
27] that
algorithm yields the
m-Markov Blanket (see
Section 3 for definition of
m-Markov Blanket) of
Y. Thus, for
we have that the output of
is
. However, since, in the growing phase, all subsets of
of size not exceeding
m have to be checked which is computationally intensive, in practice
k subsets for
k large enough are checked at every step.
Note that for such results we assume that condition
or
can be verified. In practice, it is impossible to check conditional independence of
X and
Y given the set of variables already chosen without error and this has to be replaced with the appropriate test discussed in
Section 6. Obviously, as such a test has to be performed at each step, one does not control probability of including false discoveries. False discoveries are dealt with in the second phase, but no formal results exist concerning how likely recovering
is in the case of practical algorithm. However, see [
67] for the results concerning the PC algorithm in the Gaussian case and knock-off construction discussed in
Section 7.
Another possibility is to omit phase two and stop early phase one, applying stopping rules devised in multiple testing problems to control Family-Wise Error Rate or False Discovery Rate (see, e.g., [
68] where a stopping rule using the Holm procedure has been applied for CIFE criterion). These methods work promisingly in practice, however, due to the fact that the set
S augmented at each step is data-dependent, also in this case formal results on the consistency of such procedures are not available.
There is no definitive study of empirical performance of the presented criteria and/or selection procedures; note that any criterion considered can be used for any feature selection method described above, thus creating a large number of filter methods. Moreover, those can be evaluated using their classification performance, as well as for synthetic datasets, according to their ability to choose active predictors. Thus, we discuss only the reports on the simplest greedy procedures consisting on the application of discussed criteria on synthetic datasets with a fixed ahead number of chosen features. Authors of [
24] discuss, among other things, results for datasets coming from NIPS Feature Selection Data Challange, Gisette, and Madelon (procedures stopped at 200 variables in the first case and 20 in the second). Two interesting points arise from the study: strong performance of
, which was the second best and co-winner in those cases, with respect to balanced accuracy (BA) and the number of feature chosen, with a strong performance of
(winner in the case of Gisette and together with
co-winner in the case of Madelon). The overall strong performance of
was also confirmed in [
68,
69]. The second important observation is the failure of performance of
, which due to scarcity of data, failed to detect conditionally dependent variables very early. This is due to the fact that for a large conditioning set the test becomes very conservative (see also [
70] for discussion of this property).
10. Interaction Detection
Detection of existing interactions between features in influencing the outcome
Y is an important task of data analysis; primary examples being gene–gene interactions in Genome-Wide Association Studies (GWAS) and gene–environment interactions. In this section, we show how information-theoretic measures introduced before can be applied to solve this problem. Moreover, we indicate that interaction detection problems may be viewed as a special case of feature selection. Various indices have been proposed to measure the strength of interactions, one of the most popular tools being interaction information discussed in
Section 2 and
Section 4.1. There it was considered as a tool to define new selection criteria by truncating and possibly weighing summands in the Möbius expansion of
. Here,
is adopted as a main tool in detection of interactions. Its most popular competitor is the value of Likelihood Ratio Test statistics for two nested logistic models: an additive model with no interactions and a saturated model taking all possible interactions into account. However, it has been shown in [
68] that for logistic model when predictors are independent and at least one of interaction terms is non-zero then
but not vice versa. This shows that there are situations when interactions are present and are detected by
but will go undetected using logistic model approach.
is applied as the main tool to find interactions in AMBIENCE package [
79] and BOOST package uses so called Kirkwood approximation discussed below, which is closely related to
[
80]. Other competitors to LRT and
include Multifactor Dimensionality Reduction (MDR) [
81] and Restricted Partitioning Method (RPM) [
82].
Now, we will discuss two additional properties of Interaction Information which are useful in interaction analysis, focusing on its three-way variant applied to predict synergistic effect of two variables and , say, in determining the target Y.
From the first equality in (5) it follows that when both variables are jointly independent from Y, i.e., then as . Although the converse is not true, it can be shown in adversarially constructed examples only and, thus, testing is usually replaced by (more stringent) hypothesis .
The other property is insightful representation
where
is Kirkwood Superposition Approximation with masses assigned to points equal
is positive but necessarily summing up to 1, mass function supported on values
of
. It can be shown that for
denoting the summary mass of
it follows that when
and
, then
is equal to its Kirkwood approximation [
68].
Let us discuss two commonly used methods of testing that
. The most popular one is based on Han’s approximation [
20] derived under a stringent assumption of overall independence, which results in considering as the benchmark distribution chi-squared distribution with
degrees of freedom, when
are numbers of values of
, and
Y, respectively. This, however, may lead to a large number of false signals when the pertaining test is employed as the overall independence is only a very special situation when
vanishes. It is shown in [
69] that when
are independent of
Y, similarly to other measures of conditional dependence discussed in
Section 4.1, the approximate distribution is weighted chi-square distribution. Its complete description has been given for
and
having three values and
Y being binary, which includes typical GWAS situation of two loci with two co-dominant alleles. The properties of the corresponding test have not been investigated yet.
The other method uses permuting
Y values of the considered sample and in this way obtaining
M random samples satisfying
. Then, permutation
p-value can be calculated as
(compare (32)). One can also use chi-square distribution with the number of degrees of freedom as the benchmark distribution, analogously to [
70]. The advantage of the latter approach is that the number of permuted samples
M may be much smaller than in the case of permutation test.
The main challenge to detect significant interactions among potential predictors is usually computational burden of the procedure. Indeed, any method discussed should in principle include interaction term for any pair of predictors . This requires enlarging set F by hot-encoded interactions, i.e., fitted model will contain nominal predictors. This is practically infeasible, e.g., for 100 Single Nucleotide Polymorphisms (SNP) being our predictors this would yield number of needed tests of order at each step of selection procedure. The frequently adapted approach is to screen the set of variables first and then apply more sophisticated procedure to the chosen variables only.
Consider two other possible solutions. The first one uses the premise that significant interactions may arise only among variables which themselves are significant. Thus any of the method described in
Section 8 can be used for predictors in
F and then two methods described above are applicable for all interactions of the chosen features using Bonferroni adjustment for multiple testing or Holm method [
83]. The problematic aspect of this approach is exclusion of possible interactions between two features when only one of them has non-negligible main effect.
The other group of methods, for which BOOST proposed in [
80] is a representative example, screens interactions by the modified version of LRT test which consists of replacing likelihood for a fitted additive model by likelihood for normalised Kirkwood approximation
which can be very quickly computed. The pairs for which the value of LRT statistic modified in this way exceeds certain threshold. Then, an exact LRT test is performed for all remaining pairs. The weak side of this approach is a choice of a threshold
in the first stage which needs to be chosen by a rule-of-thumb as properties of modified LRT test remain unknown.
11. Conclusions
This paper reviews main ideas concerning feature selection using information theoretic tools which revolve around detecting and measuring the strength of conditional independence. As estimation of
which is a natural and powerful tool for this endeavour, encounters a problem when dimensionality of the task is large; a natural way is to look for its good substitutes. Several ways in which such substitutes are constructed are discussed and the current knowledge about their properties is presented. It is argued that selection criteria based on truncation of Möbius decomposition make quite far-reaching compromises on the dependence structures of feature sets, whereas properties of approximations based on variational approaches are yet to be established. Additionally, major Markov Blanket discovery algorithms are constructed under assumptions that conditional independence or its absence can be established without error, when, in reality, the intrinsic features of any practical CI test applied at each stage is its fallibility. Therefore, further efforts, both theoretical and experimental, are needed to understand the advantages, drawbacks, and interrelations between up-to-date developments. The paper presents some recently established tools which can be used for this purpose, such as asymptotic distributions of
-related criteria discussed in
Section 6. They are used to construct tests of conditional independence with approximate control of probability of false signals. The problem of existence of interactions can be similarly approached using results in [
69].
Another challenging task is to extend general approaches discussed here, such as construction of feature selection criteria by truncation of Möbius expansion or variational approach to multilabel classification when
Y is a multivariate binary vector. Several criteria, such as
or
, have been generalised to this case already (cf. [
84,
85], respectively, see also [
86] for an approach based on Han’s inequality and [
87] for the review), but the general approach, e.g., in the spirit of [
24], is still missing. Another important challenge here seems construction of feature selection criteria which will efficiently take into account dependence between coordinates of the response.