The gold standard for causal inference is experimentation. Deliberately changing one variable while keeping all other variables constant, tests for three necessary conditions of a causal association: temporal precedence of the cause over the effect, the existence of a physical influence, and finally, the distinction between an apparent direct association, and a “real” direct association [
1]. When experiments, or interventions, are not possible, other methods are needed to test whether the earlier mentioned conditions are met. An important class of methods are those that use graphical models reflecting the statistical (conditional) independence relations resulting from observational data [
2,
3]. If the data are causally sufficient, i.e., there are no unobserved variables, causal associations can be inferred if the Markov property is applicable [
2]. The vertices in the related graph represent the variables, the directed edges represent existence and directionality of the causal associations. For a graph to be causal, the faithfulness assumption [
2] must be satisfied. This assumption entails that the edges in a probabilistic graphical model reflect dependencies in the data that do not result as a consequence of other pathways. For example, in a system comprising three variables, independence of
X and
Z given
Y, denoted as
, implies that the causal structure is a cascade. The causal skeleton, the causal graph with undirected edges, equals
. When testing for conditional independence, a multivariate approach is therefore needed; it is deemed impossible to infer a causal skeleton using bivariate measures. An issue with multivariate approaches is the “curse of dimensionality” [
4].
In this letter, we summarize our work that potentially minimizes the impact of the curse of dimensionality, and enables us, in some cases, to use bivariate analysis to discover the causal skeleton for multivariate systems. Our approach is based on information theory [
5]. An implicit assumption of this theory of communication is the existence of a mechanism that enables repeatability in communication, i.e., information theory could be used to infer causality. This idea is not new, see for example [
6,
7]. Mutual information, the information theoretic measure of association between random variables [
8], arises from data transmission over a noisy discrete memoryless communication channel or
channel in short. In our approach, an edge is modeled as a channel. If the maximal amount of information that channel can transfer, the so-called channel capacity [
8], equals zero, no direct causal relation can exist between the input and output of the channel, and the edge is not shown in the graph. Using an additional measure of association, path-based mutual information or
path information in short, we show that for a system comprising three variables, pair-wise determined measures can differentiate between direct and indirect associations. Because of the symmetry of mutual information, directionality cannot not follow from information theory, i.e., we can only use it to discover the causal skeleton. If one so wishes, one could use the fact that dependence between otherwise independent causes is induced when conditioning on the collider of a v-structure, e.g.,
for the v-structure
. See for example the “Fast Causal Inference” (FCI) method described in [
2]. Here we focus on the discovery of the causal skeleton. The assumptions underlying our approach are the same assumptions used in information theory: stationarity and ergodicity of the data [
5]. We furthermore assume that the systems under consideration are noisy.
A foundational aspect of our approach is the earlier mentioned discrete memoryless channel. This channel transforms the probability mass function of the input data into the probability mass function of the output data via a linear transformation. In a memoryless channel, the output solely depends on its input, i.e., it encodes the Markov property. The linear map is the channel specific transition probability matrix [
8]. The data are represented as discrete random variables and denoted with uppercase letters, e.g.,
X and
Y. The related realizations are represented by the lowercase letters, i.e.,
x and
y. Assuming a fixed, e.g., a lexicographical, ordering of the related sample spaces or alphabets, a one-to-one relationship exists between the realizations
x and
y and their positions in the respective alphabets. When using contra-variant and covariant notation, i.e., superscript and subscript notation, it is immediately clear from the equation whether to interpret
x and
y as indices or as realizations. With
the probability that
,
the probability
, and
the transition probability
, the channel transforms an input probability mass function into an output probability mass function according to
As of now, we will use the Einstein summation convention, i.e., summation is implied by indices that appear both as contra-variant and covariant indices. Equation (
1) can therefore be written as
When transforming the input and/or output alphabets via a linear transformation, the transition probability matrix represented using the contra-variant and covariant notation, transforms as a proper
tensor [
9]. We denote a tensor with calligraphic font, e.g.,
. Let us now consider a cascade comprising three variables, the chain
. As we know, the mutual information for any pair of random variables, e.g.,
X and
Z, equals
Because the mediator
Y is not an explicit variable in this equation, it is unknown how
Y influences
in this chain. The resulting uncertainty is reflected in the data processing inequality,
, where
and
are the mutual informations of the constituting direct associations of the chain. There is equality if one or more associations are due to a noiseless channel [
8]. With observational data more information is available, allowing us to be more specific because each association is represented by a tensor, and these tensors contain the transition probabilities that can be determined from the data. We now make a small sidestep and introduce some new terminology and notation. The tensor representing the association between the beginning and the end of this chain is referred to as “the tensor of the path
”. A path with one or more mediators is called an indirect path, while a path without mediators is a so-called direct path. Using the tensors of the constituting direct paths
and
denoted as
and
respectively, and the associated expressions for the linear transformations analogous to Equation (
1), it follows that for the chain:
(see
Appendix A). The mutual information resulting from transmission along the path therefore equals
with
and
. The index
indicates that the probability distributions are associated with the indirect path
. The dynamics within the chain involving the mediator
Y is now captured by the matrix product
. If the three variables in the chain
are the only three variables in the system, then
. The direct association between
X and
Z is a consequence of the indirect path, so an edge between
X and
Z should be removed, in other words, the graph needs to be pruned. In
Appendix B it is proven that the existence of an additional association between
X and
Z, mediated by a fourth variable, does not violate the validity of Equation (
3), but that the path information does not equal the mutual information:
. Instead of comparing the information measures, we use the tensors. Inequality of the path information and the mutual information implies that
, the tensor of the direct path
, does not equal the tensor of the indirect path information:
. Therefore, comparison of the tensors of the direct and indirect paths enables, under the assumption of causal sufficiency, differentiating between direct and indirect associations using pair-wise determined tensors for a system comprising three variables. It is of course possible that the tensor of the direct association equals the tensor of the indirect association purely by chance. Using the simplified probabilistic model from
Appendix C, it can be shown however that this probability decreases with increasing cardinalities of the alphabets because the probability scales with
. The variables
and
represent the cardinalities of the input and output alphabets respectively.
The approach is also applicable to forks. Consider for example the fork
. The indirect association between
Y and
Z is either a consequence of the path
, or the path
. Because Equation (
1) expresses the law of total probability [
10], a stochastic tensor always exists for a path, irrespective of the direction in which this path is traversed. We use the following naming convention, if the tensor of the path
is denoted as
, then the stochastic tensor associated with the path
is denoted as
, i.e.,
. The tensor
is not the inverse of
because the inverse of the stochastic tensor is in general not a stochastic tensor itself and therefore does not represent a channel. This naming convention allows us to write the linear transformations associated with the two paths as
and
respectively. Because a fork can be considered a chain, and the path information is independent of how the chain is traversed [
11],
, where
indicates that probability distributions are associated with the path
, which are identical to the probability distributions associated with the path
. Like mutual information, path information cannot be used to infer directionality either.
Extending the approach to systems comprising over three variables is straightforward because a mediator, say , could also be a set of variables. If the tensor of the path equals the tensor of the path , the association between X and Z is indirect. The tensors of the paths from or towards are multivariate. However, a so-called multivariate pruning step is not needed if all but one indirect paths have channel capacities that do not differ significantly from zero.
The concepts discussed, allow us to discover the causal skeleton in three steps. First, using the observational data, an undirected graph is inferred in which each edge represents a channel capable of transferring information, i.e., only edges for which the channel capacities differ significantly from zero are shown. The channel capacity is calculated using the Blahut-Arimoto algorithm, which uses a transition probability matrix as an input [
12]. In the next step, the resulting undirected graph is pruned using the bivariate tensors that were determined in the first step. The pruned graph is the input for the third step. In this step the channel capacity of each indirect path is calculated for any start and end point connected by a direct path and two or more indirect paths. If two or more indirect paths have a channel capacity larger than zero, the multivariate pruning step is performed. The output of the third step is the final causal skeleton.
We applied our approach to a
toy data set with a well-defined ground truth, the “LUng CAncer Simple set”, LUCAS0 [
13]. The data in this set were generated artificially by causal Bayesian networks with binary variables and comprises 12 binary parameters, each containing 2000 samples. To infer the skeleton, the probability transition tensors and the 95% confidence intervals were determined for all pairs of variables. For the latter, Jeffreys interval estimation for a binomial proportion was used [
14]. The calculations were performed on a 2010 Mac mini with an 2.4 GHz Intel Core 2 Duo processor, and 8 Gb RAM. The proposed framework was implemented in MATLAB R2018b. The processing time was 1.9 s.
In
Figure 1, the result of the bivariate pruning step is depicted. After this pruning step resulting in the solid edges, the direct association between
{Lung Cancer} and
{Fatigue} is the only candidate for which a multivariate pruning step could be needed because this is the only direct association that could arise as a result of two indirect paths. However, the channel capacity of the indirect path
{Lung Cancer}{Genetics}{Attention Disorder}{Fatigue}, does not differ significantly from zero. Because the tensor of the direct association does not equal the resulting tensor of the second indirect path
{Lung Cancer}{Coughing}{Fatigue} either, no multivariate pruning step is needed. The inferred causal skeleton should therefore be equal to the skeleton of the ground truth, which is indeed the case.
To conclude, with the addition of path information, causal skeletons are a consequence of information theoretic considerations. Because directionality can be inferred from the causal skeleton using aspects from the earlier mentioned FCI method, one could argue that information theory is inherently a theory of causation. It furthermore follows from the data processing inequality that in a noisy system the channel capacity of a cascade decreases with increasing path length, we speculate that for larger, noisy systems, the path-based method is less sensitive to the curse of dimensionality, as illustrated by our example where bivariate analysis sufficed. This will be the subject of future research. It is interesting to note that our approach is also applicable to time-series. For example, transfer entropy [
15] is proven to result from transmission over a set of discrete memoryless channels with associated tensors [
16]. In this case, directionality is also inferred.