Intersection Information Based on Common Randomness

Griffith, Virgil; Chong, Edwin K. P.; James, Ryan G.; Ellison, Christopher J.; Crutchfield, James P.

doi:10.3390/e16041985

Open AccessArticle

Intersection Information Based on Common Randomness

by

Virgil Griffith

^1,*,

Edwin K. P. Chong

²,

Ryan G. James

³,

Christopher J. Ellison

⁴ and

James P. Crutchfield

^5,6

¹

Computation and Neural Systems, Caltech, Pasadena, CA 91125, USA

²

Dept. of Electrical & Computer Engineering, Colorado State University, Fort Collins, CO 80523, USA

³

Department of Computer Science, University of Colorado, Boulder, CO 80309, USA

⁴

Center for Complexity and Collective Computation, Wisconsin Institute for Discovery, University of Wisconsin-Madison, Madison, WI 53715, USA

⁵

Complexity Sciences Center and Physics Dept, University of California Davis, Davis, CA 95616, USA

⁶

Santa Fe Institute, 1399 Hyde Park Rd, Santa Fe, NM 87501, USA

^*

Author to whom correspondence should be addressed.

Entropy 2014, 16(4), 1985-2000; https://doi.org/10.3390/e16041985

Submission received: 25 October 2013 / Revised: 27 March 2014 / Accepted: 28 March 2014 / Published: 4 April 2014

(This article belongs to the Special Issue Entropy Methods in Guided Self-Organization)

Download

Browse Figures

Versions Notes

Abstract

: The introduction of the partial information decomposition generated a flurry of proposals for defining an intersection information that quantifies how much of “the same information” two or more random variables specify about a target random variable. As of yet, none is wholly satisfactory. A palatable measure of intersection information would provide a principled way to quantify slippery concepts, such as synergy. Here, we introduce an intersection information measure based on the Gács-Körner common random variable that is the first to satisfy the coveted target monotonicity property. Our measure is imperfect, too, and we suggest directions for improvement.

Keywords:

intersection information; partial information decomposition; lattice; Gács-Körner; synergy; redundant information

1. Introduction

Partial information decomposition (PID) [1] is an immensely suggestive framework for deepening our understanding of multivariate interactions, particularly our understanding of informational redundancy and synergy. In general, one seeks a decomposition of the mutual information that n predictors X₁, . . ., X_n convey about a target random variable, Y. The intersection information is a function that calculates the information that every predictor conveys about the target random variable; the name draws an analogy with intersections in set theory. An anti-chain lattice of redundant, unique and synergistic partial information is then built from the intersection information.

As an intersection information measure, [1] proposes the quantity:

I_{min} (X_{1}, \dots, X_{n} : Y) = \sum_{y \in Y} Pr (y) min_{i \in {1, \dots, n}} D_{KL} [Pr (X_{i} ∣ y) ‖ Pr (X_{i})],

(1)

where D_KL is the Kullback–Leibler divergence. Although I_min is a plausible choice for the intersection information, it has several counterintuitive properties that make it unappealing [2]. In particular, I_min is not sensitive to the possibility that differing predictors, X_i and X_j, can reduce uncertainty about Y in nonequivalent ways. Moreover, the min operator effectively treats all uncertainty reductions as the same, causing it to overestimate the ideal intersection information. The search for an improved intersection information measure ensued and continued through [3–5], and today, a widely accepted intersection information measure remains undiscovered.

Here, we do not definitively solve this problem, but explore a candidate intersection information based on the so-called common random variable [6]. Whereas Shannon mutual information is relevant to communication channels with arbitrarily small error, the entropy of the common random variable (also known as the zero-error information) is relevant to communication channels without error [7]. We begin by proposing a measure of intersection information for the simpler zero-error information case. This is useful in and of itself, because it provides a template for exploring intersection information measures. Then, we modify our proposal, adapting it to the Shannon mutual information case.

The next section introduces several definitions, some notation and a necessary lemma. We extend and clarify the desired properties for intersection information. Section 3 introduces zero-error information and its intersection information measure. Section 4 uses the same methodology to produce a novel candidate for the Shannon intersection information. Section 5 shows the successes and shortcoming of our candidate intersection information measure using example circuits and diagnoses the shortcoming’s origin. Section 6 discusses the negative values of the resulting synergy measure and identifies its origin. Section 7 summarizes our progress towards the ideal intersection information measure and suggests directions for improvement. Appendices are devoted to technical lemmas and their proofs, to which we refer in the main text.

2. Preliminaries

2.1. Informational Partial Order and Equivalence

We assume an underlying probability space on which we define random variables denoted by capital letters (e.g., X, Y and Z). In this paper, we consider only random variables taking values on finite spaces. Given random variables X and Y, we write X ≼ Y to signify that there exists a measurable function, f, such that X = f(Y ) almost surely (i.e., with probability one). In this case, following the terminology in [8], we say that X is informationally poorer than Y ; this induces a partial order on the set of random variables. Similarly, we write X ≽ Y if Y ≼ X, in which case we say X is informationally richer than Y.

If X and Y are such that X ≼ Y and X ≽ Y, then we write X ≅= Y. In this case, again following [8], we say that X and Y are informationally equivalent. In other words, X ≅= Y if and only if one can relabel the values of X to obtain a random value that is equal to Y almost surely and vice versa.

This “information-equivalence” can easily be shown to be an equivalence relation, and it partitions the set of all random variables into disjoint equivalence classes. The ≼ ordering is invariant within these equivalence classes in the following sense. If X ≼ Y and Y ≅= Z, then X ≼ Z. Similarly, if X ≼ Y and X ≅= Z, then Z ≼ Y. Moreover, within each equivalence class, the entropy is invariant, as shown in Section 2.2.

2.2. Information Lattice

Next, we follow [8] and consider the join and meet operators. These operators were defined for information elements, which are σ-algebras or, equivalently, equivalence classes of random variables. We deviate from [8] slightly and define the join and meet operators for random variables.

Given random variables X and Y, we define X ⋏ Y (called the join of X and Y ) to be an informationally poorest (“smallest” in the sense of the partial order ≼ ) random variable, such that X ≼ X ⋎ Y and Y ≼ X ⋎ Y. In other words, if Z is such that X ≼ Z and Y ≼ Z, then X ⋎ Y ≼ Z. Note that X ≼ Y is unique only up to equivalence with respect to ≅=. In other words, X ⋎ Y does not define a specific, unique random variable. Nonetheless, standard information-theoretic quantities are invariant over the set of random variables satisfying the condition specified above. For example, the entropy of X ⋎ Y is invariant over the entire equivalence class of random variables satisfying the condition above. Similarly, the inequality Z ≼ X ⋎ Y does not depend on the specific random variable chosen, as long as it satisfies the condition above. Note, the pair (X, Y ) is an instance of X ⋎ Y.

In a similar vein, given random variables X and Y, we define X ⋏ Y (called the meet of X and Y ) to be an informationally richest random variable (“largest” in the sense of ≽ ), such that X ⋏ Y ≼ X and X ⋏ Y ≼ Y. In other words, if Z is such that Z ≼ X and Z ≼ Y, then Z ≼ X ⋏ Y. Following [6], we also call X ⋏ Y the common random variable of X and Y.

An algorithm for computing an instance of the common random variable between two random variables is provided in [7]; it generalizes straightforwardly to n random variables. One can also take intersections of the σ-algebras generated by the random variables that define the meet.

The ⋎ and ⋏ operators satisfy the algebraic properties of a lattice [8]. In particular, the following hold:

commutative laws: X ⋎ Y ≅= Y ⋎ X and X ⋏ Y ≅= Y ⋏ X;
associative laws: X ⋎ (Y ⋎ Z) ≅= (X ⋎ Y ) ⋎ Z and X ⋏ (Y ⋏ Z) ≅= (X ⋏ Y ) ⋏ Z;
absorption laws: X ⋎ (X ⋏ Y ) ≅= X and X ⋏ (X ⋎ Y ) ≅= X;
idempotent laws: X ⋎ X ≅= X and X ⋏ X ≅= X;
generalized absorption laws: if X ≼ Y, then X ⋎ Y ≅= Y and X ⋏ Y ≅= X.

Finally, the partial order ≼ is preserved under ⋎ and ⋏, i.e., if X ≼ Y, then X ⋎ Z ≼ Y ⋎ Z and X ⋏ Z ≼ Y ⋏ Z.

Let H(·) represent the entropy function and H (·|·) the conditional entropy. We denote the Shannon mutual information between X and Y by I(X:Y ). The following results highlight the invariance and monotonicity of the entropy and conditional entropy functions with respect to ≅= and ≼ [8]. Given that X ≼ Y if and only if X = f(Y ), these results are familiar in information theory, but are restated here using the current notation:

(a): If X ≅= Y, then H(X) = H(Y ), H(X|Z) = H(Y |Z), and H(Z|X) = H(Z|Y ).
(b): If X ≼ Y, then H(X) ≤ H(Y ), H(X|Z) ≤ H(Y |Z), and H(Z|X) ≥ H(Z|Y ).
(c): X ≼ Y if and only if H(X|Y ) = 0.

2.3. Desired Properties of Intersection Information

We denote $ℐ$ (X:Y ) as a nonnegative measure of information between X and Y. For example, $ℐ$ could be the Shannon mutual information; i.e., $ℐ$ (X:Y ) ≡ I(X:Y ). Alternatively, we could take $ℐ$ to be the zero-error information. Yet, other possibilities include the Wyner common information [9] or the quantum mutual information [10]. Generally, though, we require that $ℐ$ (X:Y ) = 0 if Y is a constant, which is satisfied by both the zero-error and Shannon information.

For a given choice of $ℐ$ , we seek a function that captures the amount of information about Y that is captured by each of the predictors X₁, . . ., X_n. We say that I_∩ is an intersection information for $ℐ$ if I_∩(X:Y) = $ℐ$ (X:Y ). There are currently 11 intuitive properties that we wish the ideal intersection information measure, I_∩, to satisfy. Some are new (e.g., lower bound (LB), strong monotonicity (M₁), and equivalence-class invariance (Eq)), but most were introduced earlier, in various forms, in [1–5]. They are as follows:

(GP) Global positivity: I_∩(X₁, . . ., X_n :Y ) ≥ 0.
(Eq) Equivalence-class invariance: I_∩(X₁, . . ., X_n :Y ) is invariant under substitution of X_i (for any i = 1, . . ., n) or Y by an informationally equivalent random variable.
(TM) Target monotonicity: If Y ≼ Z, then I_∩(X₁, . . ., X_n :Y ) ≤ I_∩(X₁, . . ., X_n :Z).
(M₀) Weak monotonicity: I_∩(X₁, . . ., X_n, W :Y ) ≤ I_∩(X₁, . . ., X_n :Y ) with equality if there exists a Z ∈ {X₁, . . ., X_n} such that Z ≼ W.
(S₀) Weak symmetry: I_∩(X₁, . . ., X_n :Y ) is invariant under reordering of X₁, . . ., X_n.

The next set of properties relate the intersection information to the chosen measure of information between X and Y.

(LB) Lower bound: If Q ≼ X_i for all i = 1, . . ., n, then I_∩(X₁, . . ., X_n :Y ) ≥ $ℐ$ (Q:Y ). Note that X₁ ⋏ · · · ⋏ X_n is a valid choice for Q. Furthermore, given that we require I_∩(X:Y ) = $ℐ$ (X:Y ), it follows that (M₀) implies (LB).
(Id) Identity: I_∩(X, Y :X ⋎ Y ) = $ℐ$ (X : Y ).
(LP₀) Weak local positivity: For n = 2 predictors, the derived “partial information” defined in [1] and described in Section 5 are nonnegative. If both (GP) and (M₀) are satisfied, as well as I_∩(X₁, X₂ :Y ) ≥ $ℐ$ (X₁ :Y ) + $ℐ$ (X₂ :Y ) – $ℐ$ (X₁ ⋎ X₂ :Y ), then (LP₀) is satisfied.

Finally, we have the “strong” properties:

(M₁) Strong monotonicity: I_∩(X₁, . . ., X_n, W :Y ) ≤ I_∩(X₁, . . ., X_n :Y ) with equality if there exists Z ∈ {X₁, . . ., X_n, Y } such that Z ≼ W.
(S₁) Strong symmetry: I_∩(X₁, . . ., X_n :Y ) is invariant under reordering of X₁, . . ., X_n, Y.
(LP₁) Strong local positivity: For all n, the derived “partial information” defined in [1] is nonnegative.

Properties (LB), (M₁) and (Eq) are introduced for the first time here. However, (Eq) is satisfied by most information-theoretic quantities and is implicitly assumed by others. Though absent from our list, it is worthwhile to also consider continuity and chain rule properties, in analogy with the mutual information [4,11].

3. Candidate Intersection Information for Zero-Error Information

3.1. Zero-Error Information

Introduced in [7], the zero-error information, or Gács–Körner common information, is a stricter variant of Shannon mutual information. Whereas the mutual information, I(A:B), quantifies the magnitude of information A conveys about B with an arbitrarily small error ε > 0, the zero-error information, denoted I⁰(A:B), quantifies the magnitude of information A conveys about B with exactly zero error, i.e., ε = 0. The zero-error information between A and B equals the entropy of the common random variable A ⋏ B,

I^{0} (A : B) \equiv H (A ⋏ B) .

Zero-error information has several notable properties, but the most salient is that it is nonnegative and bounded by the mutual information,

0 \leq I^{0} (A : B) \leq I (A : B) .

3.2. Intersection Information for Zero-Error Information

For the zero-error information case (i.e., $ℐ$ = I⁰), we propose the zero-error intersection information $I_{⋏}^{0} (X_{1}, \dots, X_{n} : Y)$ as the maximum zero-error information, I⁰(Q:Y ), that a random variable, Q, conveys about Y, subject to Q being a function of each predictor X₁, . . ., X_n:

I_{⋏}^{0} (X_{1}, \dots, X_{n} : Y) \equiv max_{Pr (Q ∣ Y)} I^{0} (Q : Y) subject to Q ≼ X_{i} \forall i \in {1, \dots, n} .

(2)

In Lemma 7 of Appendix 7.3, it is shown that the common random variable across all predictors is the maximizing Q. This simplifies Equation (2) to:

I_{⋏}^{0} (X_{1}, \dots, X_{n} : Y) = I^{0} (X_{1} ⋏ \dots ⋏ X_{n} : Y) = H (X_{1} ⋏ \dots ⋏ X_{n} ⋏ Y) .

(3)

Most importantly, the zero-error information $I_{⋏}^{0} (X_{1}, \dots, X_{n} : Y)$ satisfies nine of the 11 desired properties from Section 2.3, leaving only (LP₀) and (LP₁) unsatisfied. See Lemmas 1, 2, and 3 in Appendix 7.3 for details.

4. Candidate Intersection Information for Shannon Information

In the last section, we defined an intersection information for zero-error information that satisfies the vast majority of the desired properties. This is a solid start, but an intersection information for Shannon mutual information remains the goal. Towards this end, we use the same method as in Equation (2), leading to I_⋏, our candidate intersection information measure for Shannon mutual information:

I_{⋏} (X_{1}, \dots, X_{n} : Y) \equiv max_{Pr (Q ∣ Y)} I (Q : Y) subject to Q ≼ X_{i} \forall i \in {1, \dots, n} .

(4)

In Lemma 8 of Appendix 7.3, it is shown that Equation (4) simplifies to:

I_{⋏} (X_{1}, \dots, X_{n} : Y) = I (X_{1} ⋏ \dots ⋏ X_{n} : Y) .

(5)

Unfortunately, I_⋏ does not satisfy as many of the desired properties as $I_{⋏}^{0}$ . However, our candidate, I_⋏, still satisfies seven of the 11 properties; most importantly, the coveted (TM) that, until now, had not been satisfied by any proposed measure. See Lemmas 4, 5 and 6 in Appendix 7.3 for details. Table 1 lists the desired properties satisfied by I_min, I_⋏ and $I_{⋏}^{0}$ . For reference, we also include I_red, the proposed measure from [3].

Lemma 9 in Appendix 7.3 allows a comparison of the three subject intersection information measures:

0 \leq I_{⋏}^{0} (X_{1}, \dots, X_{n} : Y) \leq I_{⋏} (X_{1}, \dots, X_{n} : Y) \leq I_{min} (X_{1}, \dots, X_{n} : Y) .

(6)

Despite not satisfying (LP₀), I_⋏ remains an important stepping-stone towards the ideal Shannon I_∩. First, I_⋏ captures what is inarguably redundant information (the common random variable); this makes I_⋏ necessarily a lower bound on any reasonable redundancy measure. Second, it is the first proposal to satisfy target monotonicity. Lastly, I_⋏ is the first measure to reach intuitive answers in many canonical situations, while also being generalizable to an arbitrary number of inputs.

5. Three Examples Comparing I_min and I_⋏

Example Unq illustrates how I_min gives undesirable (some claim fatally so [2]) decompositions of redundant and synergistic information. Examples Unq and RdnXor illustrate I_⋏’s successes and example ImperfectRdn illustrates I_⋏’s paramount deficiency. For each, we give the joint distribution Pr(x₁, x₂, y), a diagram and the decomposition derived from setting I_min or I_⋏ as the I_∩ measure. At each lattice junction, the left number is the I_∩ value of that node, and the number in parentheses is the I_∂ value (this is the same notation used in [4]). Readers unfamiliar with the n = 2 partial information lattice should consult [1], but in short, I_∂ measures the magnitude of “new” information at this node in the lattice beyond the nodes lower in the lattice. Specifically, the mutual information between the pair, X₁ ⋎ X₂ and Y, decomposes into four terms:

I (X_{1} ⋎ X_{2} : Y) = I_{\partial} (X_{1}, X_{2} : Y) + I_{\partial} (X_{1} : Y) + I_{\partial} (X_{2} : Y) + I_{\partial} (X_{1} ⋎ X_{2} : Y) .

In order, the terms are given by the redundant information that X₁ and X₂ both provide to Y, the unique information that X₁ provides to Y, the unique information that X₂ provides to Y and finally, the synergistic information that X₁ and X₂ jointly convey about Y. Each of these quantities can be written in terms of standard mutual information and the intersection information, I_∩, as follows:

\begin{array}{l} Redundant & I_{\partial} (X_{1}, X_{2} : Y) & = I_{\cap} (X_{1}, X_{2} : Y) \\ Unique & I_{\partial} (X_{1} : Y) & = I (X_{1} : Y) - I_{\cap} (X_{1}, X_{2} : Y) \\ Unique & I_{\partial} (X_{2} : Y) & = I (X_{2} : Y) - I_{\cap} (X_{1}, X_{2} : Y) \\ Synergetic & I_{\partial} (X_{1} ⋎ X_{2} : Y) & = I (X_{1} ⋎ X_{2} : Y) - I (X_{1} : Y) - I (X_{2} : Y) + I_{\cap} (X_{1}, X_{2} : Y) \end{array}

These quantities occupy the bottom, left, right and top nodes in the lattice diagrams, respectively. Except for ImperfectRdn, measures I_⋏ and $I_{⋏}^{0}$ reach the same decomposition for all presented examples.

5.1. Example Unq (Figure 1)

The desired decomposition for example Unq is two bits of unique information; X₁ uniquely specifies one bit of Y, and X₂ uniquely specifies the other bit of Y. The chief criticism of I_min in [2] was that I_min calculated one bit of redundancy and one bit of synergy for Unq (Figure 1c). We see that unlike I_min, I_⋏ satisfyingly arrives at two bits of unique information. This is easily seen by the inequality,

0 \leq I_{⋏} (X_{1}, X_{2} : Y) \leq H (X_{1} ⋏ X_{2}) \leq I (X_{1} : X_{2}) = 0 bits .

(7)

Therefore, as I(X₁ :X₂) = 0, we have I_⋏ (X₁, X₂ :Y) = 0 bits leading to I_∂(X₁ : Y) = 1 bit and I_∂(X₂ : Y ) = 1 bit (Figure 1d).

5.2. Example RdnXor (Figure 2)

In [2], RdnXor was an example where I_min shined by reaching the desired decomposition of one bit of redundancy and one bit of synergy. We see that I_⋏ finds this same answer. I_⋏ extracts the common random variable within X₁ and X₂—the r/R bit—and calculates the mutual information between the common random variable and Y to arrive at I_⋏(X₁, X₂ :Y ) = 1 bit.

5.3. Example ImperfectRdn (Figure 3)

ImperfectRdn highlights the foremost shortcoming of I_⋏: It does not detect “imperfect” or “lossy” correlations between X₁ and X₂. Given (LP₀), we can determine the desired decomposition analytically. First, I(X₁ ⋎ X₂ :Y) = I(X₁ :Y) = 1 bit, and thus, I (X₂ :Y |X₁) = 0 bits. Since the conditional mutual information is the sum of the synergy I_∂(X₁, X₂ :Y ) and unique information I_∂(X₂ :Y ), both quantities must also be zero. Then, the redundant information I_∂(X₁, X₂ :Y ) = I(X₂ :Y )–I_∂(X₂ : Y ) = I(X₂ :Y ) = 0.99 bits. Having determined three of the partial informations, we compute the final unique information: I_∂(X₁ :Y ) = I(X₁ :Y ) – 0.99 = 0.01 bits.

How well do I_min and I_⋏ match the desired decomposition of ImperfectRdn? We see that I_min calculates the desired decomposition (Figure 3c); however, I_⋏ does not (Figure 3d). Instead, I_⋏ calculates zero redundant information, that I_∩(X₁, X₂ :Y) = 0 bits. This unpleasant answer arises from Pr(X₁ = 0, X₂ = 1) > 0. If this were zero, then both I_⋏ and I_min reach the desired one bit of redundant information. Due to the nature of the common random variable, I_⋏ only sees the “deterministic” correlations between X₁ and X₂; add even an iota of noise between X₁ and X₂, and I_⋏ plummets to zero. This highlights the fact that I_⋏ is not continuous: an arbitrarily small change in the probability distribution can result in a discontinuous jump in the value of _I⋏. As with traditional information measures, such as the entropy and the mutual information, it may be desirable to have an I_∩ measure that is continuous over the simplex.

To summarize, ImperfectRdn shows that when there are additional “imperfect” correlations between A and B, i.e., I(A:B|A ⋏ B) > 0, I_⋏ sometimes underestimates the ideal I_∩(A, B:Y ).

6. Negative Synergy

In ImperfectRdn, we saw I_⋏ calculate a synergy of −0.99 bits (Figure 3d). What does this mean? Could negative synergy be a “real” property of Shannon information? When n = 2, it is fairly easy to diagnose the cause of negative synergy from the equation for I_∂(X₁ ⋎ X₂ : Y ) in Equation (7). Given (GP), negative synergy occurs if and only if,

I (X_{1} ⋎ X_{2} : Y) < I (X_{1} : Y) + I (X_{2} : Y) - I_{\cap} (X_{1}, X_{2} : Y) = I_{\cup} (X_{1}, X_{2} : Y),

(8)

where I_∪ is dual to I_∩ and related by the inclusion-exclusion principle. For arbitrary n, this is I_∪(X₁, . . ., X_n : Y ) ≡ ∑ _S_⊆{_X_₁,...,_{X_n}_}(−1)^|^S^|+1 I_∩ (S₁, . . ., S_|_S_| :Y). The intuition behind I_∪ is that it represents the aggregate information contributed by the sources, X₁, . . ., X_n, without considering synergies or double-counting redundancies.

From Equation (8), we see that negative synergy occurs when I_∩ is small, probably too small. Equivalently, negative synergy occurs when the joint random variable conveys less about Y than the sources, X₁ and X₂, convey separately; mathematically, when I(X₁ ⋎ X₂ :Y ) < I_∪(X₁, X₂ : Y ). On the face of it, this sounds strange. No structure “disappears” after X₁ and X₂ are combined by the ⋎ operator. By the definition of ⋎, there are always functions f₁ and f₂, such that X₁ ≅= f₁(Z) and X₂ ≅= f₂(Z). Therefore, if your favorite I_∩ measure does not satisfy (LP₀), it is too strict.

This means that our measure, $I_{⋏}^{0}$ , does not account for the full zero-information overlap between I⁰(X₁ :Y ) and I⁰(X₂ :Y ). This is shown in the example, Subtle (Figure 4), where $I_{⋏}^{0}$ calculates a synergy of −0.252 bits. Defining a zero-error, I_∩, that satisfies (LP₀) is a matter of ongoing research.

7. Conclusions and Path Forward

We made incremental progress on several fronts towards the ideal Shannon I_∩.

7.1. Desired Properties

We have expanded, tightened and grounded the desired properties for I_∩. Particularly,

(LB) highlights an uncontentious, yet tighter lower bound on I_∩ than (GP).
Inspired by I_∩(X₁ :Y ) = $ℐ$ (X₁ :Y ) and (M₀) synergistically implying (LB), we introduced (M₁) as a desired property.
What was before an implicit assumption, we introduced (Eq) to better ground one’s thinking.

7.2. A New Measure

Based on the Gács–Körner common random variable, we introduced a new Shannon I_∩ measure. Our measure, I_⋏, is theoretically principled and the first to satisfy (TM). A point to keep in mind is that our intersection information is zero whenever the distribution Pr(x₁, x₂, y) has full support; this dependence on structural zeros is inherited from the common random variable.

7.3. How to Improve

We identified where I_⋏ fails; it does not detect “imperfect” correlations between X₁ and X₂. One next step is to develop a less stringent I_∩ measure that satisfies (LP₀) for ImperfectRdn, while still satisfying (TM). Satisfying continuity would also be a good next step.

Contrary to our initial expectation, Subtle, showed that $I_{⋏}^{0}$ does not satisfy (LP₀). This matches a result from [4], which shows that (LP₀), (S₁), (M₀) and (Id) cannot all be simultaneously satisfied, and it suggests that $I_{⋏}^{0}$ is too strict. Therefore, what kind of zero-error informational overlap is $I_{⋏}^{0}$ not capturing? The answer is of paramount importance. The next step is to formalize what exactly is required for a zero-error I_∩ to satisfy (LP₀). From Subtle, we can likewise see that within zero-error information, (Id) and (LP₀) are incompatible.

Acknowledgments

Virgil Griffith thanks Tracey Ho, and Edwin K. P. Chong thanks Hua Li for valuable discussions. While intrepidly pulling back the veil of ignorance, Virgil Griffith was funded by a Department of Energy Computational Science Graduate Fellowship; Edwin K. P. Chong was funded by Colorado State University’s Information Science & Technology Center; Ryan G. James and James P. Crutchfield were funded by Army Research Office grant W911NF-12-1-0234; Christopher J. Ellison was funded by a subaward from the Santa Fe Institute under a grant from the John Templeton Foundation.

Appendix

By and large, most of these proofs follow directly from the lattice properties and also from the invariance and monotonicity properties with respect to ≅= and ≼.

A. Properties of $I_{⋏}^{0}$

Lemma 1

$I_{⋏}^{0} (X_{1}, \dots, X_{n} : Y)$ satisfies (GP), (Eq), (TM), (M₀), and (S₀), but not (LP₀).

Proof

(GP) follows immediately from the nonnegativity of the entropy. (Eq) follows from the invariance of entropy within the equivalence classes induced by ≅=. (TM) follows from the monotonicity of the entropy with respect to ≼. (M₀) also follows from the monotonicity of the entropy, but now applied to ⋏_iX_i ⋏ W ⋏ Y ≼ ⋏_iX_i ⋏ Y. If there exists some j, such that X_j ≼ W, then generalized absorption says that ⋏_iX_i ⋏ W ⋏ Y ≅= ⋏_iX_i ⋏ Y, and thus, we have the equality condition. (S₀) is a consequence of the commutativity of the ⋏ operator. To see that (LP₀) is not satisfied by the $I_{⋏}^{0}$ , we point to the example, Subtle (Figure 4), which has negative synergy. One can also rewrite (LP₀) as the supermodularity law for common information, which is known to be false in general. (See [8], Section 5.4.)

Lemma 2

$I_{⋏}^{0} (X_{1}, \dots, X_{n} : Y)$ satisfies (LB), (SR), and (Id).

Proof

For (LB), note that Q ≼ X₁ ⋏ · · · ⋏ X_n for any Q obeying Q ≼ X_i for i = 1, . . ., n. Then, apply the monotonicity of the entropy. (SR) is trivially true given Lemma 7 and the definition of zero-error information. Finally, (Id) follows from the absorption law and the invariance of the entropy.

Lemma 3

$I_{⋏}^{0} (X_{1}, \dots, X_{n} : Y)$ satisfies (M₁) and (S₁), but not (LP₁).

Proof

(M₁) follows using the absorption and monotonicity of the entropy in nearly the same way that (M₀) does. (S₁) follows from commutativity, and (LP₁) is false, because (LP₀) is false.

B. Properties of I_⋏

The proofs here are nearly identical to those used for $I_{⋏}^{0}$ .

Lemma 4

I_⋏ (X₁, . . ., X_n :Y ) satisfies (GP), (Eq), (TM), (M₀), and (S₀), but not (LP₀).

Proof

(GP) follows from the nonnegativity of mutual information. (Eq) follows from the invariance of entropy. (TM) follows from the data processing inequality. (M₀) follows from applying the monotonicity of the mutual information I(Y : · ) to ⋏_iX_i ⋏ W ≼ ⋏_iX_i. If there exists some j, such that X_j ≼ W, then generalized absorption says that ⋏_iX_i ⋏ W ≅= ⋏_iX_i, and thus, we have the equality condition. (S₀) follows from commutativity, and a counterexample for (LP₀) is given by ImperfectRdn (Figure 3).

Lemma 5

I_⋏ (X₁, . . ., X_n :Y ) satisfies (LB) and (SR), but not (Id).

Proof

For (LB), note that Q ≼ X₁ ⋏ · · · ⋏ X_n for any Q obeying Q ≼ X_i for i = 1, . . ., n. Then, apply the monotonicity of the mutual information to I(Y : · ). (SR) is trivially true given Lemma 8. Finally, (Id) does not hold, since X ⋏ Y ≼ X ⋎ Y, and thus, I_⋏ (X, Y :Y ⋏ Y) = H(X ⋏ Y ).

Lemma 6

I_⋏ (X₁, . . ., X_n :Y ) does not satisfy (M₁), (S₁), or (LP₁).

Proof

(M₁) is false due to a counterexample provided by ImperfectRdn (Figure 3), where I_⋏ (X₁ :Y) = 0.99 bits and I_⋏ (X₁, Y :Y) = 0 bits. (S₁) is false, since I_⋏ (X, X:Y ) ≠ I_⋏ (X, Y :X). Finally, (LP₁) is false, due to (LP₀) being false.

C. Miscellaneous Results

Lemma 7

Simplification of $I_{⋏}^{0}$ .

\begin{array}{l} I_{⋏}^{0} (X_{1}, \dots, X_{n} : Y) \equiv max_{Pr (Q ∣ Y)} I^{0} (Q : Y) subject to Q ≼ X_{i} \forall i \in {1, \dots, n} \\ = H (X_{1} ⋏ \dots ⋏ X_{n} ⋏ Y) \end{array}

Proof

Recall that I⁰(Q:Y ) ≡ H(Q ⋏ Y ), and note that ⋏_iX_i is a valid choice for Q. By definition, ⋏_iX_i is the richest possible Q, and so, monotonicity with respect to ≼ then guarantees that H( ⋏_iX_i ⋏ Y ) ≥ H(Q ⋏ Y ).

Lemma 8

Simplification of I_⋏.

\begin{array}{l} I_{⋏} (X_{1}, \dots, X_{n} : Y) \equiv max_{Pr (Q ∣ Y)} I (Q : Y) subject to Q ≼ X_{i} \forall i \in {1, \dots, n} \\ = I (X_{1} ⋏ \dots ⋏ X_{n} : Y) \end{array}

Proof

Note that ⋏_iX_i is a valid choice for Q. By definition, ⋏_iX_i is the richest possible Q, and so, monotonicity with respect to ≼ then guarantees that I(Q:Y ) ≤ I(⋏_iX_i :Y ).

Lemma 9

I_⋏ (X₁, . . ., X_n :Y ) ≤ I_min (X₁, . . ., X_n : Y )

Proof

We need only show that I( ⋏_iX_i : Y ) ≤ I_min (X₁, . . ., X_n : Y ). This can be restated in terms of the specific information: I( ⋏_iX_i : y) ≤ min_i I (X_i : y) for each y. Since the specific information increases monotonically on the lattice (cf. Section 2.2 or [8]), it follows that I( ⋏_iX_i : y) ≤ I(X_j : y) for any j.

Conflicts of Interest

The authors declare no conflicts of interest.

Author ContributionsEach of the authors contributed to the design, analysis, and writing of the study.

References

Williams, P.L.; Beer, R.D. Nonnegative Decomposition of Multivariate Information. 2010. arXiv:1004.2515 [cs.IT]. [Google Scholar]
Griffith, V.; Koch, C. Quantifying Synergistic Mutual Information. In Guided Self-Organization: Inception; Prokopenko, M., Ed.; Springer: Berlin, Germany, 2014. [Google Scholar]
Harder, M.; Salge, C.; Polani, D. Bivariate measure of redundant information. Phys. Rev. E 2013, 87, 012130. [Google Scholar]
Bertschinger, N.; Rauh, J.; Olbrich, E.; Jost, J. Shared Information—New Insights and Problems in Decomposing Information in Complex Systems. Proceedings of the European Conference on Complex Systems 2012, Brussels, Belgium, 3–7 September 2012; Gilbert, T., Kirkilionis, M., Nicolis, G., Eds.; Springer Proceedings in Complexity; Springer: Berlin, Germany, 2013; pp. 251–269. [Google Scholar]
Lizier, J.; Flecker, B.; Williams, P. Towards a synergy-based approach to measuring information modification. Proceedings of the 2013 IEEE Symposium on Artificial Life (ALIFE), Singapore, 16–17 April 2013; pp. 43–51.
Gács, P.; Körner, J. Common information is far less than mutual information. Probl. Control Inform. Theor 1973, 2, 149–162. [Google Scholar]
Wolf, S.; Wullschleger, J. Zero-error information and applications in cryptography. Proc. IEEE Inform. Theor. Workshop 2004, 4, 1–6. [Google Scholar]
Li, H.; Chong, E.K.P. On a Connection between Information and Group Lattices. Entropy 2011, 13, 683–708. [Google Scholar]
Wyner, A.D. The common information of two dependent random variables. IEEE Trans. Inform. Theor 1975, 21, 163–179. [Google Scholar]
Cerf, N.J.; Adami, C. Negative Entropy and Information in Quantum Mechanics. Phys. Rev. Lett 1997, 79, 5194–5197. [Google Scholar]
Cover, T.M.; Thomas, J.A. Elements of Information Theory; John Wiley: New York, NY, USA, 1991. [Google Scholar]

Figure 1. Example Unq. This is the canonical example of unique information. X₁ and X₂ each uniquely specify a single bit of Y. This is the simplest example, where I_min calculates an undesirable decomposition (c) of one bit of redundancy and one bit of synergy. I_⋏ and

I_{⋏}^{0}

each calculate the desired decomposition (d). (a) Distribution and information quantities; (b) circuit diagram; (c) I_min; (d) I_⋏ and

I_{⋏}^{0}

.

Figure 1. Example Unq. This is the canonical example of unique information. X₁ and X₂ each uniquely specify a single bit of Y. This is the simplest example, where I_min calculates an undesirable decomposition (c) of one bit of redundancy and one bit of synergy. I_⋏ and

I_{⋏}^{0}

each calculate the desired decomposition (d). (a) Distribution and information quantities; (b) circuit diagram; (c) I_min; (d) I_⋏ and

I_{⋏}^{0}

.

Figure 2. Example RdnXor. This is the canonical example of redundancy and synergy coexisting. I_min and I_⋏ each reach the desired decomposition of one bit of redundancy and one bit of synergy. This is the simplest example demonstrating I_⋏ and

I_{⋏}^{0}

correctly extracting the embedded redundant bit within X₁ and X₂. (a) Distribution and information quantities; (b) circuit diagram; (c) I_min; (d) I_⋏ and

I_{⋏}^{0}

.

Figure 2. Example RdnXor. This is the canonical example of redundancy and synergy coexisting. I_min and I_⋏ each reach the desired decomposition of one bit of redundancy and one bit of synergy. This is the simplest example demonstrating I_⋏ and

I_{⋏}^{0}

correctly extracting the embedded redundant bit within X₁ and X₂. (a) Distribution and information quantities; (b) circuit diagram; (c) I_min; (d) I_⋏ and

I_{⋏}^{0}

.

Figure 3. Example ImperfectRdn. I_⋏ is blind to the noisy correlation between X₁ and X₂ and calculates zero redundant information. An ideal I_∩ measure would detect that all of the information X₂ specifies about Y is also specified by X₁ to calculate I_∩(X₁, X₂ :Y ) = 0.99 bits. (a) Distribution and information quantities; (b) circuit diagram; (c) I_min; (d) I_⋏; (e)

I_{⋏}^{0}

.

Figure 3. Example ImperfectRdn. I_⋏ is blind to the noisy correlation between X₁ and X₂ and calculates zero redundant information. An ideal I_∩ measure would detect that all of the information X₂ specifies about Y is also specified by X₁ to calculate I_∩(X₁, X₂ :Y ) = 0.99 bits. (a) Distribution and information quantities; (b) circuit diagram; (c) I_min; (d) I_⋏; (e)

I_{⋏}^{0}

.

Figure 4. Example Subtle. In this example, both I_⋏ and

I_{⋏}^{0}

calculate a synergy of −0.252 bits of synergy. What kind of redundancy must be captured for a nonnegative decomposition for this example? (a) Distribution and information quantities; (b) circuit diagram; (c) I_min; (d) I_⋏ and

I_{⋏}^{0}

.

Figure 4. Example Subtle. In this example, both I_⋏ and

I_{⋏}^{0}

calculate a synergy of −0.252 bits of synergy. What kind of redundancy must be captured for a nonnegative decomposition for this example? (a) Distribution and information quantities; (b) circuit diagram; (c) I_min; (d) I_⋏ and

I_{⋏}^{0}

.

Table 1. The I_∩ desired properties that each measure satisfies. (The appendices provide proofs for I_⋏ and

I_{⋏}^{0}

.)

**Table 1.** The I_∩ desired properties that each measure satisfies. (The appendices provide proofs for I_⋏ and $I_{⋏}^{0}$ .)
Property	I_min	I_red	I_⋏	$I_{⋏}^{0}$
(GP) Global Positivity	✓	✓	✓	✓
(Eq) Equivalence-Class Invariance	✓	✓	✓	✓
(TM) Target Monotonicity			✓	✓
(M₀) Weak Monotonicity	✓		✓	✓
(S₀) Weak Symmetry	✓	✓	✓	✓
(LB) Lower bound	✓	✓	✓	✓
(Id) Identity		✓		✓
(LP₀) Weak Local Positivity	✓	✓
(M₁) Strong Monotonicity				✓
(S₁) Strong Symmetry				✓
(LP₁) Strong Local Positivity	✓

© 2014 by the authors; licensee MDPI, Basel, Switzerland This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution license (http://creativecommons.org/licenses/by/3.0/).

Share and Cite

MDPI and ACS Style

Griffith, V.; Chong, E.K.P.; James, R.G.; Ellison, C.J.; Crutchfield, J.P. Intersection Information Based on Common Randomness. Entropy 2014, 16, 1985-2000. https://doi.org/10.3390/e16041985

AMA Style

Griffith V, Chong EKP, James RG, Ellison CJ, Crutchfield JP. Intersection Information Based on Common Randomness. Entropy. 2014; 16(4):1985-2000. https://doi.org/10.3390/e16041985

Chicago/Turabian Style

Griffith, Virgil, Edwin K. P. Chong, Ryan G. James, Christopher J. Ellison, and James P. Crutchfield. 2014. "Intersection Information Based on Common Randomness" Entropy 16, no. 4: 1985-2000. https://doi.org/10.3390/e16041985

APA Style

Griffith, V., Chong, E. K. P., James, R. G., Ellison, C. J., & Crutchfield, J. P. (2014). Intersection Information Based on Common Randomness. Entropy, 16(4), 1985-2000. https://doi.org/10.3390/e16041985

Article Menu

Intersection Information Based on Common Randomness

Abstract

1. Introduction

2. Preliminaries

2.1. Informational Partial Order and Equivalence

2.2. Information Lattice

2.3. Desired Properties of Intersection Information

3. Candidate Intersection Information for Zero-Error Information

3.1. Zero-Error Information

3.2. Intersection Information for Zero-Error Information

4. Candidate Intersection Information for Shannon Information

5. Three Examples Comparing Imin and I⋏

5.1. Example Unq (Figure 1)

5.2. Example RdnXor (Figure 2)

5.3. Example ImperfectRdn (Figure 3)

6. Negative Synergy

7. Conclusions and Path Forward

7.1. Desired Properties

7.2. A New Measure

7.3. How to Improve

Acknowledgments

Appendix

A. Properties of I ⋏ 0

Lemma 1

Proof

Lemma 2

Proof

Lemma 3

Proof

B. Properties of I⋏

Lemma 4

Proof

Lemma 5

Proof

Lemma 6

Proof

C. Miscellaneous Results

Lemma 7

Proof

Lemma 8

Proof

Lemma 9

Proof

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI

5. Three Examples Comparing I_min and I_⋏

A. Properties of $I_{⋏}^{0}$

B. Properties of I_⋏