1. Introduction
Philosophers have debated at length whether causality is a subject that should be treated probabilistically or deterministically. This resulted in the development of different inferential systems and views on reality. Pure logic dealt with inferences about deterministic truths [
1,
2]. Probabilistic reasoning has been developed to allow for uncertainty in inferences about deterministic truths [
3,
4], to make inferences about probabilistic truths [
5,
6], or to imply the existence of associated deterministic truths [
7,
8,
9,
10,
11]. Probabilistic theories about causality were developed throughout the 20th century, with notable contributions by Reichenbach, Good, and Suppe [
12]. At the same time, however, the classical model of physics maintained its position as a role model for other sciences, which led researchers, including those concerned with human behavior and economic systems, to reject ideas about probabilistic causation, opting, often, to reason probabilistically about deterministic truths.
In modern physics, the standard equations of quantum mechanics suggest that reality is, in fact, better described by probability laws [
13]. The outcome of the Bohr–Einstein debates settled on the assertion that these probability laws are a result of a real indeterminacy and that reality itself is probabilistic (One may also argue that this is simply a correct
exposition of the theory and not necessarily of the physical world, as more complete theories may yet be discovered). Ref. [
14] provides an alternative interpretation of quantum physics in which the probability laws are statistical results of the development of completely determined, but hidden, variables. At a macroscopic level, deterministic laws and contingencies induce associated probabilistic laws (Contingencies is a term used by Ref. [
14] to refer to independent factors that may exist outside the scope of what is treated by the laws under consideration, and which do not follow necessarily from anything that may be specified under the context of these laws). In particular, by broadening the context of the processes under consideration, new laws that govern some of the contingencies can be found. This inevitably leads to new contingencies: a process that repeats indefinitely. For this reason, any theory about reality that embraces either of deterministic law or chance, to the exclusion of the other, is inherently incomplete. Regardless of one’s position on real indeterminism, it holds, according to this logic, that any natural process that arises deterministically must also satisfy statistical laws that are more general, and so any complete theory about interesting real-world phenomena must be probabilistic.
In a probabilistic view of reality cause and consequence are related by probability laws rather than laws of logical truths. A theory about probabilistic causality can, therefore, be stated in terms of the properties of the true measure that describes a process stochastically. The theory of causation developed here is that a causal relationship exists if there exists a true probability measure that produces a non-empty stochastic sequence that describes the directly caused effects from perturbations in one variable in terms of the responses in another. The paper shows that ideas about causality, including the direction, statistical significance, and economic relevance of effects, may be tested by formulating a statistical model that correctly describes observed data, and evaluating its dynamic properties. In practice, this means that the inference is conducted with a best approximation of the true probability measure. It is the position of the paper that in order to demonstrate that causality runs from a potential causal variable to the target variable, one requires developing the best approximation of the true probability measure using the potential causal variable and a best approximation of the true probability measure without the potential causal variable. The analysis should then (1) conclude whether the first modeled measure is closer to the true measure, and (2) test that the two modeled measures are not equivalent. Practical routines to do so shall be discussed and an example is provided using random forest (RF) regressions and daily data on yield spreads. The application tests how uncertainty around short- and long-term inflation expectations interact with spreads in the daily Bitcoin price, a digital asset with a predetermined finite supply that has been characterized as a new potential inflation hedge. The results are contrasted with those obtained with standard linear Granger causality tests. It is shown that the suggested approaches do not only lead to better predictive models, but also to more plausible parsimonious descriptions of possible causal flows.
The focus on approximating a correct stochastic representation of the DGP (data generating process) as a means of learning about true causal linkages is different from the approaches that try to simulate laboratory conditions by testing for statistical differences in control groups, such as described by [
15,
16]. The focus on obtaining a correct functional representation of the data is also different from attributing the presence of causal relationships directly to the values of parameters representing averages in treatment groups, see for instance [
17,
18,
19] on this approach. Placing emphasis on the need for accurate statistical models for the full data distribution when conducting causal analysis introduces an obvious weakness: it is generally accepted that all empirical models will be mis-specified to a certain degree and that empirical models are likely never correctly specified. The
true process, after all, is unknown in practice. This is the reason to conduct analyses in the first place. The aim to develop correct models can therefore be seen as an idealistic idea that is difficult to put into practice. However, it is still valuable to understand the role of the correct-specification assumption in causal analysis. It is commonly taught that mis-specification leads to residual dependencies that violate the assumptions made by general central limit theorems needed to obtain correct standard errors, see for example chapter 2 in [
20]. However, more general estimation theory for dependent processes, as those developed and discussed for instance by [
21,
22,
23,
24,
25], may help correct standard error estimation but do not remedy the issue that the structural response of the model is incorrect [
26]. These are theories to correct the variance estimator when the underlying model is wrong, and do not address the issue that the structural response of the model does not correctly describe the data.
The paper builds on contributions of others in the following lines of research. The views on causality developed in the paper are related to the information theoretic view on testing causal theories, as discussed by [
27,
28,
29,
30], which, as here, emphasizes model parsimony. The line of reasoning is inspired by the work of [
31,
32], who emphasized the importance of a probabilistic formulation of economic theories and warned against the use of statistical methods without any reference to a stochastic process. The paper also emphasizes the importance of the overall model response, and, thus, on focusing on system behavior, rather than on isolated parameters that make no reference to a wider economic system. This has previously been advocated by [
33]. The main result of the paper is that convincing statements about partial causal linkages must be underpinned by an accurate model of broader reality, even if the interest is in inference and not prediction per se. In order to do so, researchers must, as shall be discussed, pay due attention to distinguishing between direct causal impacts and system memory and take note of developments in the field of predictive modeling.
The plan of the paper is as follows.
Section 2 develops definitions for probabilistic causality in terms of true probability measures using a flexible type of dynamical system that covers many processes observed in economics, physics, finance, and related fields of study.
Section 3 discusses approximating this true probability measure as an act of minimizing divergence between the modeled probability measure and the true probability measure, while
Section 4 forges the link between statistical divergence and distance. This draws the connections between distance-minimization and the use of maximum likelihood criteria.
Section 5 provides practical considerations and applies the theory. Finally,
Section 6 concludes. Proofs are provided in the
Appendix A.
2. Causality in Terms of True Probability Measures
Notation will be as follows.
Notation 1. , and , respectively denote the sets of natural, integer, and real numbers. If is a set, denotes the Borel-σ algebra over , and , alternatively denoted as , is the Cartesian product of T copies of . Definitional equivalence is denoted , which is to be distinguished from ≡ denoting equivalence, for example in the functional sense. For two maps, f and g, their composition arises from their point-wise application and is denoted and is the inverse function of f. The tensor product is denoted ⊗. The notation is used to indicate that μ is absolutely continuous with respect to ν, i.e., if μ and ν are two measures on the same measurable space , μ is absolutely continuous with respect to ν if for every set A for which , or, as an example, if ν is the counting measure on and μ is the Lebesgue measure, then . It is also said that ν is dominating
μ when , see for instance ([34] p. 574). Finally, the empty set ∅
is also used in the context of an empty sequence, which sometimes would be notated as in the literature. Directional causality is interesting when at least two sequences are considered. Specifically, when the focus is on a
T-period sequence
, that is a subset of the realized path of the
-variate stochastic sequence
for events in the event space
. (That is,
. The random sequence
is a Borel-
-measurable map
. In this,
denotes the Cartesian product of infinite copies of
and
with
, and
denotes the Borel-
algebra on the finite dimensional cylinder set of
, see Theorem 10.1 of [
35], p. 159). As always, the complete probability space of interest is described by a triplet
, with
as the
-field defined on the event space.
is used here informally as a placeholder for a collection of probability measures, as we shall introduce the exact probability measures of interest shortly.
If
is considered as a univariate sequence independent from causal drivers, then for every event
, the stochastic sequence
would live on the probability space
where
assigns probability to all elements of
. In a similar fashion, one can consider
as the subset of the realized path of the
-variate stochastic sequence
indexed by identical
t for events
(i.e.,
and the random sequence
is a Borel-
-measurable map
.) If
would live similarly isolated from outside influence, then for every
, the stochastic sequence
would operate on a space
where
assigns probability to all the elements of
. We have a system of two unrelated sequences (This naturally covers to most common auto-regression case, only stated for
here,
, where
is unobserved. The linear auto-regression case is obtained when
is a scaled identity function.):
As we shall see, an important aspect of causal analysis is to rule out that the observed data is not generated by Equation (
1). As such, it is important to comment on a number of properties. First, in this system of equations, the functions
and
are intentionally not indexed by
t. This does not imply that these functions cannot posses complex time-varying properties; it only limits the discussion to observation-driven models (to the exclusion of parameter-driven models), in which time-varying parameters arise as nonlinear functions of the data. An example would be the threshold models considered by [
36,
37], in which parameter values are allowed to differ across regimes in the data. The choice to restrict the discussion is made because it is intuitively easier to conceive of causal effects in an observation-driven context where observations represent verifiable values describing different states of real-world phenomena. At the same time, it has been shown that parametric observation-driven models can produce time-varying parameters of a wide class of nonlinear models [
38] and that the forecasting power of such models may be on-par with parameter-driven models, even if the latter are correctly specified [
39]. Moreover, Refs. [
20,
40,
41] show how observation-driven models may be used to not only investigate how observations impact future observations, but also future parameter values, which may empirically be interesting if those parameters carry an economic interpretation. Finally, many popular machine learning algorithms, such as neural networks, can be reduced to equations that show how parameter values change according to levels in the data [
42].
While the dynamics in Equation (
1) may be nonlinear, the notation is too restrictive to nest long-memory processes. In particular, the state at time
t is only a function of the previous state at time
, or
if the model would be generalized to
p-order lags, but not of the full history. Vanishing dependence, implied under contraction conditions [
43], is often key to verifying irreducibility and continuity [
44] and proving the ergodicity of time series [
45]. Proving the ergodicity of a model is needed to obtain an estimation theory under an assumption of correct specification [
20,
24]. Later, multivariate models will be considered, in which case long-memory properties may arise, for example, when time-varying parameters in one of the functions are a function of past data as well as of past values of those time-varying parameters.
If interrelated stochastic sequences are at the center of inference, additional building blocks are required to describe the processes. This increases the potential complexity of
and
, but it also allows to distinguish between causality, non-causality, and feedback. Consider the stochastic system:
In this multivariate context,
and
will be referred to as the direct causal maps, while
and
control the memory properties within each channel.
When and are analyzed individually, the properties of and are of key interest. They carry information on the future positions of and , and provide predictability without considering outside influence directly. However, correct causal inference around the interdependencies of and may be preferred over developing predictive capabilities that can result from many configurations within the parameter space that are associated with untrue probability measures. The properties of and determine the direction in which effects move. Verifying their properties is central to causality studies. The functions and , on the other hand, play a central role in the system’s responses to external impulses by shaping memory of the causal initial impact of a sequence of interventions, even after that sequence turns inactive.
The functions that control memory properties within channels in some sense determine how the past reverberates into the future, and specifying correct empirical equivalents to
and
is as crucial to the inference about the causal interdependencies as is specifying mechanisms for the action of interest (it would be more general to write Equation (
2) with
and
and with
. In this case, for instance, the dependence of
on its own past,
, is allowed to vary based on the levels in past data. However, under this notation, one could at any point in time, decompose the change in one variable into effects attributed to memory and outside influence separately, which the simplified notation in Equation (
2) is intended to focus on). In fact, as Ref. [
46] point out, systems may be dominated by memory and the influence of the causal components may be small on the overall process in which case predictive power can be obtained without specifying any causal maps and focusing solely on memory. Inversely, this also suggests that one must obtain a model for the memory process to isolate the causal impacts themselves, suggesting that long-memory applications in which causal inference is of interest must develop a high degree of predictive power, even if prediction is not needed for policy purposes. This can be made more clear by considering the following:
with
and
defined as
and
. Given the realized sequences
and
generated by Equation (
2), the sequential system of Equation (
3) moves forward in time as the one-step-ahead directly caused parts of
and
that are filtered from the reverberating effects of
and
. More specifically, while
partially consists of memory, there is a part,
, that, at any point, is directly mapped from the previous state of
, while, at the same time,
consists partially of memory and a part
directly generated from the last position of
. In this view, directional causality can be stated in terms of whether (
3) produces any values, i.e., diagnosing if there is any statistically significant signal from initial causal impulses left after all memory properties have been stripped from the data. Importantly, the system reveals that by the definitions of
and
, obtaining appropriate estimates for
and
involves
and
being modeled correctly as
and
are not observed and only result as functions from the observable processes
and
. Moreover, if
and
are triggered by an event, then it is possible, by process of infinite backward substitution, to write Equation (
3) as an infinite chain initialized in the infinite past. Plugging in the equalities
and
and defining the random functions
and
, one can write
Repeating infinitely, and extending infinitely in the direction
,
and
are the maps that generate
and
infinitely after
and
have been generated into infinity. Subscript
has been used, here, to mark the initialization points. This shows that
can be written as a sequence of iterating functional operations that are all defined on
, and
defined on
in a similar way (Equation (
5) reveals that the sequences that constitute the directly caused parts of
and
are ultimately dependent on the values at which the observable process has been initialized. That is, the entire causal pathway depends on the initial impact. In practice, one cannot observe all impacts—including those that occurred in the infinite past—and assurance is required that the initialization effect of the causal pathway must, asymptotically, be irrelevant). For ease of notation, let us write
where bold-faced
is used to refer to the entire sequence of functional operations
up to
t, starting in the infinite past
. This highlights that generating the unobserved quantities
and
from the observed quantities
and
by back substitution eventually involves the unobserved quantities
and
. This means that some feasible form of approximation is needed, since time series data in practice area almost never recorded since the beginning of the process.
Note first that
is a
-measurable mapping, and
is a
-measurable mapping. The sequence
thus lives on
, where
is induced according to
, and
lives on
, where
is induced according to
, see [
47] p. 118 and [
48] p. 115. The notation shows that the probability measures underlying the stochastic causal sequences result from the functional behavior of the entire system. In particular, the causal sequences can be written as recursive direct effects from another variable that itself consists of memory and causal effects, and the probability measures underlying the causal sequences are thus induced by the functional relationships that describe all dynamical dependencies. This is important to the extent that many causal studies focus on one single marginal dependency, while, from the measure-theoretic perspective developed here, the wider system within any one single process operates, is of importance to the analysis. This suggests that researchers must pay attention to referencing the workings of a broader system when designing their models for inference, something [
33] has also argued. Moreover, it has been argued (see [
49] for discussion) that probabilistic definitions of causality are not strictly causal in the sense that they do not provide insight in the origin of the probability law that regulates the process of interest, and that a (correct) time-series model only describes (correctly) the probabilistic behavior as the outcome of that unknown causal origin. The notation, here, shows, however, explicitly the relation between the functional behavior of a system and its induced probability measure that assigns probability to all possible outcomes. This suggests that such critiquing views, rather, relate to disagreements around the level of detail in the structure of a model, which in turn would be guided by the research question of interest and the availability of detailed data. Particularly, dynamical systems in economics are often modeled using aggregate macro-economic data that do not have the same granularity as micro-economic data containing information about the behaviors of individual economic agents.
In many cases, a researcher is not able to observe all the relevant variables. When a third, possibly unobserved external variable,
, with effect
, is considered, the researcher is confronted with the situation that
If
is unobserved, it can still be approximated as a difference combination of
and
. To obtain an approximated sequence of the
true sequence to condition empirical counterparts for
and
on, one can work with:
Equation (
8) suggests to write Equation (
7) in terms of
and
only by defining
as a difference combination of
and
(Apart from stability conditions imposed on the endogenous process, one requires also that the exogenous impacts enter the system in some suitable manner, which, for example, requires that
and
are appropriately bounded. Following the same arguments that resulted in Equation (
5), the initialization of the exogenous impacts
should similarly not carry information influential in the empirical estimates of
and
, conditional on partial information). This allows us to define the spaces and measures in terms of
and
when the multivariate process includes further variables, in this case,
. If the process is invertible, one can write, by aggregating the functions:
For every
, the map
is
-measurable and
lives on the space
where the probability measure
is induced by
on
according to the point-wise application of
and the inverse of
. (
). Similar arguments follow for
. This tells us that, in the general case of multivariate dependencies and in the presence of possibly unobserved variables, the probability measures underlying the individual sequences are possibly a result of those of the other sequences. This means the space of empirical candidates for the probability measure
that underlies the joint process
operates on
. (The sequence realizes under the events
,
, where
and
, with
, and the probability measure of the joint process
is thus defined on the product
-algebra
(see, [
47] p. 119)).
Regardless, the measure
is induced by functional relations of Equation (
2), which, as was shown, can be decomposed into memory and causal subsystems. One can thus state causality conditions, based on the measures that describe the directly caused effects represented by Equation (
6). In particular, one can keep the focus on
and
, bearing in mind that they are lower-level constituents of
on which, in turn, the complete estimation objective will be defined.
Definition 1 (Non-causality). The stochastic sequences and are not causally related if and are null measures, such that and .
Definition 2 (Uni-directional Causality). Causality runs uni-directionally from the stochastic sequence to another stochastic sequence (visa versa), if is a null measure, and is a non-null measure, such that and (visa versa).
Definition 3 (Bi-directional Causality). The stochastic sequence is causal with respect to and is causal with respect to , if and are both non-null measures, such that and .
Respectively, conditioning on impacts in , these probabilistic causality definitions can thus be understood broadly as:
- 1.
Whenever an intervention in occurs, there is no chance that reacts as a result of that.
- 2.
Whenever an intervention in occurs, there is positive chance that reacts as a result of that.
- 3.
Whenever an intervention in occurs, there is positive chance that reacts as a result of that. Subsequently there is positive chance that reacts to this initial reaction, a probabilistic process that repeats recursively.
Remark 1. With null-measures, it is meant that the stochastic sequence describing the directly caused effects from one variable to the other takes values in the empty set with probability 1. This is because the functions that induce the probability measure cancel out, hence, they can be removed from the equations resulting in a probability measure that is not induced by any remaining rule or relationship. In practice, one can test whether or , where here denotes the probability measure induced by the functional relationships in Equation (1) and denotes the probability measure induced by the functional relationships in Equation (2), to test whether exists. A practical test is a Kolmogorov–Smirnov-type test. 3. Limit Divergence on the Space of Modeled Probability Measures
The definitions of causality, in terms of the lower-level components of , suggest that correct causal statements can be obtained empirically by extracting relevant counterparts to and from a relevant counterpart to , and investigating the stochastic sequences produced by these modeled measures. For such an approach to be of relevance in an empirical context, one must ensure that the concepts introduced adequately transfer over from the true measure to a modeled measure . The focus is therefore shifted towards detailing how can be approximated as a minimally divergent measure relative to , and draw on approximation theory to construct equivalence around the true measure under an axiom of correct specification.
For some event
, a realized
T-period sequence
consisting of sequences
and
can be observed. The
true function
, consists of our main functions of interest
and
that in turn are composed of
and
that are of particular interest to the researcher focused on causality, but possibly also functions
and
that shape the responses of an initial causal effect. The exact properties are generally unknown to the observer, but one can design a parameterization mapping that learns the behavior of
and
when exposed to sufficient data. To learn from the data an approximation of
and
, one can postulate a model
with
f:
as our postulated model function and
as the modeled data. In the context of parametric inference, the parameter space
is of finite dimensionality, but also in the nonparametric case, the vector
indexes parametric models nested by the nonparametric model, each inducing its own probability measure, and
indexes families of parametric models, each inducing a space of parametric functions generated under
. In this discussion a compact set of potential hypotheses is considered, limiting the inference to parametric models. The arguments can be extended to the nonparametric case, by focusing on a compact subset
of solutions (For example, by letting
grow as
, hence focusing on the case
, see for example [
50]). For example, by using priors or penalties that discard
such that any solution of the criterion necessarily falls within a compact subset space, see [
20] p. 210 and [
24]. Let
f be
-measurable
so that
is
-measurable
and
.
is our space of parametric functions defined on
generated under
under the injective
where
. Under any
true probability measure
, every potential parameter vector included in the parameter space
induces a probability measure
indexed by
on
, according to
. Thus, for every potential parameter vector included in the parameter space
, there is a triplet
that describes the probability space of modeled data under
. The triplet
is, thus, itself an element of the measure spaces indexed by
across all
. Given the
true probability measure
on
, this process is summarized by a functional
, that maps elements from the space of parametric functions generated by the entire parameter space
, onto the space
of probability measures defined on the sets of
generated by
through
.
Now,
is generally not only unknown, but for a finite
there is no guarantee that
, implying that, in many empirical applications, one is concerned with the situation where
. However, if
, one can learn all about
by uncovering the properties of
f, given that a sufficient amount of observations is available. (As discussed in the literature on miss-specification, even when the axiom of correct specification is abandoned,
f may converge to a function that produces the optimal conditional density, which may reveal properties of
). Let
, be the extremum estimate for
as judged by the criterion
. Trivially,
and
. To see that under correct specification it is possible to approximate the
true function
in terms of equivalence (in the sense of function equivalence [
51] p. 288), one can write the criterion function also as a function of the
true function and the postulated model
in which it is made use of the fact that
and
.
The discussion further evolves toward showing that the element in that is closest to minimizes a divergence metric that results from a transformation of the limit criterion that measures the divergence between the true density and the density implied by the model. Note that is induced by the proposed candidates for ; studies on causality thus rely on flexible model design as the researcher determines which hypotheses are considered in a study by exerting control over . Naturally, if , then produces a larger . This suggests that minimizing this divergence metric over a large as possible results in selecting at a point in that attains equivalence to only when is large enough to produce a correctly specified hypothesis set. Note that the definition of , as our space of parametric functions generated under , under the injective and the functional that induces the space of probability measures, is defined on the sample space . This highlights that the correct specification argument, , not only stresses flexible parameterization in the sense that parameterized dependencies can take on many values, but also in the sense of using correct data (Indeed, the potential parameters that would interact with data that is not used are essentially treated as zero, so the focus on using correct data is implicitly already contained in the standard statements of correct specification that focus directly on the dimensions of . The distinction is nevertheless useful because nonparametric models are often popularized as methods to reduce miss-specification bias as becomes infinite dimensional, but this does not imply that if important data is missing). When little is known about f, one is thus not only concerned with flexibility in terms of the type of parametric functions generated under , but also the variables on which the modeled measures are defined. When these concerns are appropriately addressed, testing for causality is deciding based on the approximation whether the best approximation of the true model suggests (1) that and live in isolation, (2) unidirectional causality, or (3) that produces feedback.
To turn this problem into a selection problem that can be solved by divergence minimization w.r.t. the true measure, first introduce the limit criterion by taking and working with the modeled data as the minimizer of the criterion. Specifically, let the limit criterion be evaluated at with and with the criterion as a measure of divergence on the true probability measure and the modeled measure. More specifically, . By definition of as a divergence on the space that contains and , the element is thus the minimizer of that divergence.
Moreover, in the parameter sense, in the function sense (in terms of a divergence metric on the true function), and in the measure sense (in terms of a divergence metric on the true probability measure), are equivalent limits under the same consistency result. To see this, it is convenient to focus once more on the target and write , with , to make clear that the criterion establishes a divergence on , which is, in turn, induced by through according to . This ensures that our statement on the probability measure is relevant under standard consistency results that are focused on the convergence of an estimated parameter vector toward , while, equivalently, the impulse response functions (IRFs) converge to the true IRFs at . This implies that deciding between Definitions 1–3 can be read from the responses produced by the IRF that minimizes divergence w.r.t. the true IRF.
Not necessary, but convenient for a proof that holds easily in practical situations, is to assume the existence of a strictly increasing function that ensures the existence of a transformation of the limit criterion into a metric, , with r being a continuously and strictly increasing function. For convenience, all assumptions are summarized in Assumption 1.
Assumption 1. For a limit criterion of the form , is a divergence. Assume there exists a continuous strictly increasing function such that is a metric. The functional is injective and .
Proposition 1. Assume 1, then the following are equivalent limits:
- 1.
,
- 2.
,
- 3.
,
- 4.
,
- 5.
.
Remark 2. Dropping the axiom of correct specification implies , hence, the equivalences of 3–5 are now w.r.t. item 2.
The equivalences in Proposition 1 not only ensure that for a correctly specified model , the element results in functional equivalence between the model and the true model (item 3), but also in zero divergence between the probability measures and (item 4). Moreover, it follows that at , the empirically estimated probability measure is equivalent to in the sense that there is zero distance between the two (item 5).
Remark 3. Proposition 1 is applicable to a large class of extremum estimators, even those not initially conceived as minimizers of distance. In particular it is often possible to find a divergence on the space of probability measures. For example, method of moments estimators are naturally defined in terms of features of the underlying probability measures. In Section 4 and example is given, using Kullback–Leibler divergence, for which penalized likelihood is an estimator. In this case squared Hellinger distance can be shown to be a lower bound. Corollary 1 now delivers that our definitions, set on the
true measures, transfer to modeled probability measures in the limit for correctly specified cases. It is well-known that standard consistency proofs apply also to approximate extremum estimators, therefore, assuming additionally that
a.s., is sufficient for a consistency result together with the uniqueness of
within the compact hypothesis space
(Note that, under the axiom of correct-specification, consistency results require suitable forms of stability defined on the process rather than the data. While we have loosely remarked on the fact that the non-parametric case of an infinite dimensional
is easily allowed, stability of highly nonlinear multivariate time series is a difficult separate topic. Regardless, Refs. [
44,
45] provide Ergodicity results for a large class of nonlinear time series that include non-parametric ones. The conditions require the nonlinearities to be sufficiently smooth. Specific stability results have also been established for certain neural network models, for example by [
52]). This implies that our causality conditions on the
true measures do not only transfer to the approximate in the limit, but also for large
T under standard regularity conditions. Essentially, this is the setting considered by Ref. [
11]. Summarized:
Corollary 1. Given a true probability measure , and an equivalent modeled probability measure in the sense that , there are four possibilities for causality:
- 1.
There is no causation if and adhere to Definition 1.
- 2.
causes if the probability measure adheres to Definition 2.
- 3.
causes if the probability measure adheres to Definition 2.
- 4.
There is bi-directional causality if and adhere to Definition 3.
Finally, in the case of a miss-specified model, Proposition 2 implies that the divergence between the optimal probability measure as judged by the criterion and the true probability measure attains a minimum at a strictly positive value . In this case, the quantity determines how “close” the empirical claim is to the true hypothesis about causality. While it is difficult to make claims about this quantity, it is evident that minimizing may involve widening in the direction of by increasing the dimensionality of and allow flexibility while investigating a wide range of data. Disregarding the value of , the following holds.
Proposition 2. If , then . However, is still the pseudo-true parameter that minimizes over Θ. Therefore, is the probability measure minimally divergent from within . As such, it follows that, from all the potential probability measures in , the measure closest to is supportive of one out of in corollary 1 based on the properties of and as the best approximations. provides the best approximation of the true causal measure across all the hypotheses considered.
This leads to the following collection of results.
Corollary 2. Given a true probability measure , and a non-equivalent, but pseudo-true modeled probability measure, , in the sense that has attained a non-zero minimum, there are four possible optimal hypotheses about causality, as judged by the criterion:
- 1.
There is no causation if and adhere to Definition 1.
- 2.
causes if the probability measure adheres to Definition 2.
- 3.
causes if the probability measure adheres to Definition 2.
- 4.
There is bi-directional causality if and adhere to Definition 3.
Respectively, conditioning on interventions in , the results can be understood as:
- 1.
Whenever an intervention in occurs, our best hypothesis is that there is no chance that reacts as a result of that.
- 2.
Whenever an intervention in occurs, our best hypothesis is that there is positive chance that reacts as a result of that.
- 3.
Whenever an intervention in occurs, our best hypothesis is that there is positive chance that reacts as a result of that, and these interactions continue to repeat with positive probability.
4. Limit Squared Hellinger Distance
Both Corollaries 1 and 2 assume that an appropriate transformation of the limit criterion exists that provides us with a metric or norm. This assumption allows us to make use of the classical theorems on existence and uniqueness of best approximations that have been naturally obtained for metric, normed, and inner product spaces [
53]. While this retains the simplicity of the argument, it also shows that a direct interpretation of Corollaries 1 and 2 can be obtained within the framework of maximum likelihood. Let us first define the criterion function as the maximum likelihood estimator:
Note that this is conforming to
with
and
. It can be shown that, under this definition with
, the criterion
is a measure of divergence
on the
true probability measure and the modeled measure. Specifically, we can introduce a divergence
as follows. Let
and
be, respectively, the
true density evaluated under the
true parameter and a modeled density at
, evaluated under the estimated parameter, both at time
t, with respect to the Lebesque measure (such that they are probability density functions); then the following is a divergence from the true probability measure to the modeled probability measure (Kullback–Leibler divergence, see [
54]):
Naturally,
with equality if and only if
almost everywhere, i.e., when the probability measures are the same (this is known as Gibb’s inequality and can be verified by applying Jensen’s inequality).
Kullback–Leibler divergence is not a distance metric, as was used in Corollaries 1 and 2 to establish equivalences by partitioning into classes of zero-distance points. In particular, it is asymmetric
and the triangle inequality is also not satisfied. However, it has the product–density property
for
, and
defined similarly. Hence, the MLE is an unbiased estimator of minimized Kullback–Leibler divergence:
Note that under standard assumptions, a law of large numbers can be applied to obtain the convergence, hence, by maximizing log likelihood, we minimize Kullback–Leibler divergence. Now, we need to either find a continuously scaling function,
r, to ensure that it also minimizes distance between the
true measure and the modeled measure so that we may reach zero at
. Alternatively, we find the distance metric directly. We argued above that Kullback–Leibler divergence is not a proper distance (in particular, it is not symmetric and does not satisfy the triangle inequality). However, notably useful is specifying
directly as the Hellinger distance between a modeled probability measure and the true probability measure [
55]:
Specifically, the squared Hellinger distance provides a lower bound for the Kullback–Leibler divergence. Therefore, maximizing log likelihood implies minimizing Kullback–Leibler divergence, which implies minimizing the Hellinger distance. This is easily seen by the following:
Proposition 3. The squared Hellinger distance provides a lower bound to Kullback–Leibler divergence: Remark 4 below highlights that these notions do not just apply to the standard real-valued time series settings considered by Granger, but can apply to the explicit probability modeling of binary outcomes as well. Remark 4 further clarifies a result that has so far only been presented implicitly—that the probabilistic truth identified at the discussed zero-distance point may allow for a base level of entropy to exist even when all functional relationships in the process have been accounted for in a model.
Remark 4. While the paper has implicitly alluded to modeling continuous real-valued processes though the notational conventions, the connections between true probability and modeled probability are also easily made by focusing on an explicit binary outcome problem. Define cross-entropy for two discrete probability distributions p and q with the same support :in which is Kullback–Leibler divergence, or the relative entropy of q with respect to p, and is the entropy of p. Now if and , we can rewrite cross-entropy:or, for predictions generated under a set of parameters and a predictor x, asRemember that the maximum likelihood estimator maximizes the likelihood of the data under some probabilistic model. The correct likelihood in the case of binary classification is Bernoulli:which results in the likelihood functionTaking logs then gives the following log likelihood functionThis shows that negative log likelihood is proportional to Kullback–Leibler divergence and differs by the basic entropy in the data, which is constant. Maximizing the likelihood of a binary model can, thus, be understood as minimizing statistical distance toward a true probability measure; the minimum value is determined by the entropy in the observed data.