Information Geometric Approach on Most Informative Boolean Function Conjecture

No, Albert

doi:10.3390/e20090688

Open AccessArticle

Information Geometric Approach on Most Informative Boolean Function Conjecture

by

Albert No

Department of Electronical and Electrical Engineering, Hongik University, Seoul 04066, Korea

Entropy 2018, 20(9), 688; https://doi.org/10.3390/e20090688

Submission received: 26 July 2018 / Revised: 6 September 2018 / Accepted: 8 September 2018 / Published: 10 September 2018

(This article belongs to the Special Issue The 20th Anniversary of Entropy - Recent Advances in Entropy and Information-Theoretic Concepts and Their Applications)

Download Versions Notes

Abstract

:

Let

X^{n}

be a memoryless uniform Bernoulli source and

Y^{n}

be the output of it through a binary symmetric channel. Courtade and Kumar conjectured that the Boolean function

f : {0, 1}^{n} \to {0, 1}

that maximizes the mutual information

I (f (X^{n}); Y^{n})

is a dictator function, i.e.,

f (x^{n}) = x_{i}

for some i. We propose a clustering problem, which is equivalent to the above problem where we emphasize an information geometry aspect of the equivalent problem. Moreover, we define a normalized geometric mean of measures and interesting properties of it. We also show that the conjecture is true when the arithmetic and geometric mean coincide in a specific set of measures.

Keywords:

Boolean function; Bregman divergence; clustering; geometric mean; Jensen–Shannon divergence

1. Introduction

Let

X^{n}

be an independent and identically distributed (i.i.d.) uniform Bernoulli source and

Y^{n}

be an output of it through a memoryless binary symmetric channel with crossover probability

p < 1 / 2

. Recently, Courtade and Kumar conjectured that the most informative Boolean function is a dictator function.

Conjecture 1

([1]). For any Boolean function

f : {0, 1}^{n} \to {0, 1}

, we have:

\begin{matrix} I (f (X^{n}); Y^{n}) \leq 1 - h_{2} (p) \end{matrix}

(1)

where the maximum is achieved by a dictator function, i.e.,

f (x^{n}) = x_{i}

for some

1 \leq i \leq n

. Note that

h_{2} (p) = - p \log p - (1 - p) \log (1 - p)

is the binary entropy function.

Although there has been some progress in this line of work [2,3], this simple conjecture still remains open. There are also a number of variations of this conjecture. Weinberger and Shayevitz [4] considered the optimal Boolean function under quadratic loss. Huleihel and Ordentlich [5] considered the complementary case and showed that

I (f (X^{n}); Y^{n}) \leq (n - 1) (1 - h_{2} (p))

for all

f : {0, 1}^{n} \to {0, 1}^{n - 1}

. Nazer et al. focused on information distilling quantizers [6], which can be seen as a generalized version of the above problem.

Many of them are based on the Fourier analysis technique including the original paper [1]. In this paper, we suggest an alternative approach, namely the information geometric approach. The mutual information can naturally be expressed with Kullback–Leibler (KL) divergences. Thus, it can be shown that the maximizing mutual information is equivalent to clustering probability measures under KL divergence.

In the equivalent clustering problem, the center of the cluster is an arithmetic mean of measures. We also provide the role of the geometric mean of measures (with appropriate normalization) in this setting. To the best of our knowledge, the geometric mean of measures has received less attention in the literature. We propose an equivalent formulation of the conjecture using the geometric mean of measures. Note that the geometric mean also allows us to connect Conjecture 1 to the other well-known clustering problem.

The rest of the paper is organized as follows. In Section 2, we briefly review the Jensen–Shannon divergence and

I

-compressedness. In Section 3, we provide an equivalent clustering problem of probability measures. We introduce the geometric mean of measures in Section 4. We conclude this paper in Section 5.

Notations

X

denotes the alphabet set of random variable X, and

M (X)

denotes the set of measures on

X

.

X^{n}

denotes a random vector

(X_{1}, X_{2}, \dots, X_{n})

, while

x^{n}

denotes a specific realization of it. If it is clear from the context,

P_{Y | x}

denotes a conditional distribution of Y given

X = x

, i.e.,

P_{Y | x} (y) = P_{Y | X} (y | x)

. Similarly,

P_{Y^{n} | x^{n}}

denotes a conditional distribution of

Y^{n}

given

X^{n} = x^{n}

, i.e.,

P_{Y^{n} | x^{n}} (y^{n}) = P_{Y^{n} | X^{n}} (y^{n} | x^{n})

. Let

Ω = {0, 1}^{n}

be the set of all binary sequences of length n. For

A \subseteq Ω

, the shifted version of A is denoted by

A \oplus x^{n} = {{\tilde{x}}^{n} \oplus x^{n} : {\tilde{x}}^{n} \in A}

where ⊕ is an element-wise XOR operator. The arithmetic mean of measures in the set

{P_{Y^{n} | x^{n}} : x^{n} \in A}

is denoted by

μ_{A}

. For

1 \leq i \leq n

, let

A_{i 0}

be the set of elements in A that satisfy

x_{i} = 0

, i.e.,

A_{i 0} = {x^{n} \in A : x_{i} = 0}

, and

Ω_{i 0} = {x^{n} \in Ω : x_{i} = 0}

.

A_{i 1}

is defined in a similar manner. A length n binary vector

x^{n - 1} 0

denotes a vector

x^{n}

with

x_{n} = 0

.

2. Preliminaries

2.1. Jensen–Shannon Divergence

For

α_{1}, α_{2} \geq 0

such that

α_{1} + α_{2} = 1

, the Jensen–Shannon (JS) divergence of two measures

P_{1}

and

P_{2}

is defined as:

\begin{matrix} {JSD}_{α} (P_{1}, P_{2}) = H (α_{1} P_{1} + α_{2} P_{2}) - α_{1} H (P_{1}) - α_{2} H (P_{2}) . \end{matrix}

(2)

It is not hard to show that the following definition is equivalent.

\begin{matrix} {JSD}_{α} (P_{1}, P_{2}) = α_{1} D (P_{1} ∥ α_{1} P_{1} + α_{2} P_{2}) + α_{2} D (P_{2} ∥ α_{1} P_{1} + α_{2} P_{2}) . \end{matrix}

(3)

Lin proposed a generalized JS divergence [7]:

\begin{matrix} {JSD}_{α} (P_{1}, P_{2}, \dots, P_{n}) = H (\sum_{i = 1}^{n} α_{i} P_{i}) - \sum_{i = 1}^{n} α_{i} H (P_{i}) \end{matrix}

(4)

where

α = (α_{1}, \dots, α_{n})

is a weight vector such that

\sum_{i = 1}^{n} α_{i} = 1

. Similar to Equation (3), it has an equivalent definition:

\begin{matrix} {JSD}_{α} (P_{1}, P_{2}, \dots, P_{n}) = \sum_{i = 1}^{n} α_{i} D (P_{i} ∥ \bar{P}) \end{matrix}

(5)

where

\bar{P} = \sum_{i = 1}^{n} α_{i} P_{i}

. Topsøe [8] pointed out an interesting property, the so-called compensation identity. It states that for any distribution Q,

\begin{matrix} \sum_{i = 1}^{n} α_{i} D (P_{i} ∥ Q) = & \sum_{i = 1}^{n} α_{i} D (P_{i} ∥ \bar{P}) + D (\bar{P} ∥ Q) \end{matrix}

(6)

\begin{matrix} = & {JSD}_{α} (P_{1}, P_{2}, \dots, P_{n}) + D (\bar{P} ∥ Q) . \end{matrix}

(7)

Throughout the paper, we often use Equation (6) directly without the notion of JSD

Remark 1.

The generalized JS divergence is the mutual information between X and the mixture distribution. Let Z be a random variable that takes the value from

{1, 2, \dots, n}

where

P_{Z} (i) = α_{i}

and

P_{X | Z} (x | i) = P_{i} (x)

. Then, it is not hard to show that:

\begin{matrix} {J S D}_{α} (P_{1}, P_{2}, \dots, P_{n}) = I (X; Z) \end{matrix}

(8)

However, we introduced generalized JS divergence to emphasize the information geometric perspective of our problem.

2.2. $I$ -Compressed

Let A be the subset of

Ω

and

I = {i_{1}, i_{2}, \dots, i_{k}} \subset {1, 2, \dots, n}

be the set of indexes. For

x^{n} \in Ω

, the

I

-section of A is defined as:

\begin{matrix} A_{I} (x^{n}) = \{z^{k} : y^{n} \in A, y_{i} = \{\begin{matrix} z_{j} & if i = i_{j} \in I \\ x_{i} & otherwise \end{matrix}\} . \end{matrix}

(9)

The set A is called

I

-compressed if

A_{I} (x^{n})

is an initial segment of lexicographical ordering for all

x^{n}

. For example, if A is

I

-compressed for some

| I | = 2

, then

A_{I} (x^{n})

should be one of:

\begin{matrix} {00}, {00, 01}, {00, 01, 10}, {00, 01, 10, 11} . \end{matrix}

(10)

It simply says that if

x^{n - 2} 10 \in A

, then

x^{n - 2} 00, x^{n - 2} 01 \in A

.

Courtade and Kumar showed that it is enough to consider the

I

-compressed sets.

Theorem 1

([1]). Let

S_{n}

be the set of functions

f : Ω \to {0, 1}

for which

f^{- 1} (0)

is

I

-compressed for all

I

with

| I | \leq 2

. In maximizing

I (f (X^{n}); Y^{n})

, it is sufficient to consider functions

f \in S_{n}

.

In this paper, we often restrict our attention to functions in the set

S_{n}

.

3. Approach via Clustering

In this section, we provide an interesting approach toward Conjecture 1 via clustering. More precisely, we formulate an equivalent clustering problem.

3.1. Equivalence to Clustering

The following theorem implies the relation between the original conjecture and the clustering problem.

Theorem 2.

Let

f : X \to U

and

U = f (X)

be an induced random variable. Then,

\begin{matrix} I (f (X); Y) = & I (X; Y) - \sum_{x} P_{X} (x) D (P_{Y | x} ∥ P_{Y | U} (\cdot | f (x))) . \end{matrix}

(11)

The proof of the theorem is provided in Appendix A. Note that:

\begin{matrix} P_{Y | U} (y | u) = & \frac{P_{U | Y} (u | y) P_{Y} (y)}{P_{U} (u)} \end{matrix}

(12)

\begin{matrix} = & \frac{P_{Y} (y)}{P_{U} (u)} \sum_{x \in f^{- 1} (u)} P_{X | Y} (x | y) \end{matrix}

(13)

\begin{matrix} = & \sum_{x \in f^{- 1} (u)} \frac{P_{X} (x)}{P_{U} (u)} \cdot P_{Y | X} (y | x) . \end{matrix}

(14)

which is a weighted mean of

P_{Y | X} (y | x)

for

x \in f^{- 1} (u)

. The

D (P_{Y | x} ∥ P_{Y | U} (\cdot | f (x)))

is a distance from each element to the cluster center. This implies that maximizing

I (f (X); Y)

is equivalent to clustering

{P_{Y^{n} | x^{n}}}

under KL divergences. Since KL divergence is a Bregman divergence, all clusters are separated by a hyperplane [9].

In this paper, we focus on

U = {0, 1}

where

X^{n}

is i.i.d. Bern(1/2).

Corollary 1.

Let

f : Ω \to {0, 1}

and

U = f (X^{n})

be a binary random variable.

\begin{matrix} I (f (X^{n}); Y^{n}) = & n - n h_{2} (p) - \frac{1}{2^{n}} \sum_{x^{n}} D (P_{Y^{n} | x^{n}} ∥ P_{Y^{n} | U} (\cdot | f (x^{n}))) . \end{matrix}

(15)

The equivalent clustering problem is minimizing:

\begin{matrix} \sum_{x^{n}} D (P_{Y^{n} | x^{n}} ∥ P_{Y^{n} | U} (\cdot | f (x^{n}))) . \end{matrix}

(16)

Let

A = {x^{n} \in Ω : f (x^{n}) = 0}

, then we can simplify

P_{Y^{n} | U}

further.

\begin{matrix} P_{Y^{n} | U} (y^{n} | 0) = & \frac{1}{| A |} \sum_{x^{n} \in A} P_{Y^{n} | X^{n}} (y^{n} | x^{n}) \end{matrix}

(17)

\begin{matrix} \overset{Δ}{=} & μ_{A} (y^{n}) . \end{matrix}

(18)

The cluster center

μ_{A}

is an arithmetic mean of measures in the set

{P_{Y^{n} | x^{n}} : x^{n} \in A}

. Then, we have:

\begin{matrix} \sum_{x^{n}} D (P_{Y^{n} | x^{n}} ∥ P_{Y^{n} | U} (\cdot | f (x^{n}))) = & \sum_{x^{n} \in A} D (P_{Y^{n} | x^{n}} ∥ μ_{A}) + \sum_{x^{n} \in A^{c}} D (P_{Y^{n} | x^{n}} ∥ μ_{A^{c}}) . \end{matrix}

(19)

For simplicity, let:

\begin{matrix} D (A) \overset{Δ}{=} \sum_{x^{n} \in A} D (P_{Y^{n} | x^{n}} | | μ_{A}) \end{matrix}

(20)

which is the sum of distances from each element in A to the cluster center. In short, finding the most informative Boolean function f is equivalent to finding the set

A \subseteq Ω

that minimizes

D (A) + D (A^{c})

.

Remark 2.

Conjecture 1 implies that

A = Ω_{i 0} = {x^{n} : x_{i} = 0}

minimizes (19). Furthermore, Theorem 1 implies that it is enough to consider A such that

A_{i 1} \subseteq A_{i 0}

for all i.

For any

Q_{Y^{n}} \in M (Ω)

, Equation (6) implies that:

\begin{matrix} \sum_{x \in A} D (P_{Y^{n} | x^{n}} ∥ Q_{Y^{n}}) = D (A) + | A | D (μ_{A} ∥ Q_{Y^{n}}) \end{matrix}

(21)

\begin{matrix} \sum_{x \in A^{c}} D (P_{Y^{n} | x^{n}} ∥ Q_{Y^{n}}) = D (A^{c}) + | A^{c} | D (μ_{A^{c}} ∥ Q_{Y^{n}}) . \end{matrix}

(22)

Thus, we have:

\begin{matrix} \sum_{x \in Ω} D (P_{Y^{n} | x^{n}} ∥ Q_{Y^{n}}) = D (A) + D (A^{c}) + | A | D (μ_{A} ∥ Q_{Y^{n}}) + | A^{c} | D (μ_{A^{c}} ∥ Q_{Y^{n}}) . \end{matrix}

(23)

Note that

\sum_{x \in Ω} D (P_{Y^{n} | x^{n}} ∥ Q_{Y^{n}})

does not depend on A, and therefore, we have the following theorem.

Theorem 3.

For any

Q_{Y^{n}} \in M (Ω)

, minimizing

D (A) + D (A^{c})

is equivalent to maximizing:

\begin{matrix} | A | D (μ_{A} ∥ Q_{Y^{n}}) + | A^{c} | D (μ_{A^{c}} ∥ Q_{Y^{n}}) . \end{matrix}

(24)

The above theorem provides an alternative problem formulation of the original conjecture.

3.2. Connection to Clustering under Hamming Distance

In this section, we consider the duality between the above clustering problem under the KL divergence and the clustering on

Ω

under the Hamming distance. The following theorem shows that the KL divergence on

{P_{Y^{n} | x^{n}} : x^{n} \in Ω}

corresponds to the Hamming distance on

Ω

.

Theorem 4.

For all

x^{n}, {\tilde{x}}^{n} \in Ω

, we have:

\begin{matrix} D (P_{Y^{n} | x^{n}} ∥ P_{Y^{n} |^{{\tilde{x}}^{n}}}) = d_{H} (x^{n}, {\tilde{x}}^{n}) \cdot (1 - 2 p) \log \frac{1 - p}{p} \end{matrix}

(25)

where

d_{H} (x^{n}, {\tilde{x}}^{n})

denotes the Hamming distance between

x^{n}

and

{\tilde{x}}^{n}

.

This theorem implies that the distance between two measures

P_{Y^{n} | x^{n}}

and

P_{Y^{n} | {\tilde{x}}^{n}}

is proportional to the Hamming distance between two binary vectors

x^{n}

and

{\tilde{x}}^{n}

. The proof of the theorem is provided in Appendix B. Note that the KL divergence

D (\cdot ∥ \cdot)

is symmetric on

{P_{Y^{n} | x^{n}} : x^{n} \in Ω}

.

In the above duality, we have a mapping between

{P_{Y^{n} | x^{n}} : x^{n} \in Ω}

and

{0, 1}^{n}

; more precisely,

P_{Y^{n} | x^{n}} \leftrightarrow x^{n}

. This mapping naturally suggests an equivalent clustering problem of n-dimensional binary vectors. However, the cluster center

μ_{A}

is not an element of

{P_{Y^{n} | x^{n}} : x^{n} \in Ω}

in general. In order to formulate an equivalent clustering problem, we need to answer the question “Which n dimensional vector corresponds to

μ_{A}

?”. A naive approach is to extend the set of binary vectors to

{[0, 1]}^{n}

under

ℓ^{2}

distance instead of the Hamming distance. In such a case, the goal is to map

μ_{A}

to the arithmetic mean of binary vectors in the set A. If this is true, we can further simplify the problem into the problem of clustering a hypercube in

R^{n}

. However, the following example shows that this naive extension is not valid.

Example 1.

Let

n = 2

,

A = {00, 11}

and

B = {01, 10}

, then the arithmetic mean of binary vectors of A and that of B are the same. However,

μ_{A}

is not equal to

μ_{B}

.

Furthermore, the set

Ω_{i 0}

is not the optimum choice when clustering the hypercube under

ℓ^{2}

. Instead, we need to consider the set of measures directly. The following theorem provides a bit of geometric structure among measures.

Theorem 5.

For all

x^{n}, {\tilde{x}}^{n} \in Ω

and

Q_{Y^{n}} \in ({P_{Y^{n} | x^{n}} | x^{n} \in Ω})

,

\begin{matrix} D (P_{Y^{n} | X^{n} = x^{n}} ∥ Q_{Y^{n}}) - D (P_{Y^{n} | X^{n} = {\tilde{x}}^{n}} ∥ Q_{Y^{n}}) \leq & k \cdot ({(1 - p)}^{k} - p^{k}) \log \frac{1 - p}{p} \end{matrix}

(26)

where

k = d_{H} (x^{n}, {\tilde{x}}^{n})

.

The proof of the theorem is provided in Appendix C. Since

{(1 - p)}^{k} - p^{k} \leq 1 - 2 p

for all

k \geq 1

, Theorem 5 immediately implies the following corollary.

Corollary 2.

For all

x^{n}, {\tilde{x}}^{n} \in Ω

and

Q_{Y^{n}} \in conv ({P_{Y^{n} | x^{n}} | x^{n} \in Ω})

,

\begin{matrix} |D (P_{Y^{n} | X^{n} = x^{n}} ∥ Q_{Y^{n}}) - D (P_{Y^{n} | X^{n} = {\tilde{x}}^{n}} ∥ Q_{Y^{n}})| \leq D (P_{Y^{n} | X^{n} = x^{n}} ∥ P_{Y^{n} | X^{n} = {\tilde{x}}^{n}}) \end{matrix}

(27)

where

conv (A)

is a convex hull of measures in the set A.

This is a triangle inequality that can be useful when we consider the clustering problem of measures.

4. Geometric Mean of Measures

In the previous section, we formulate the clustering problem that is equivalent to the original maximizing mutual information problem. In this section, we provide another approach using a geometric mean of measures. We define the geometric mean of measures formally and derive a nontrivial conjecture, which is equivalent to Conjecture 1.

4.1. Definition of the Geometric Mean of Measures

For measures

P_{1}, P_{2}, \dots, P_{n} \in M (X)

and weights

α_{i} \geq 0

such that

\sum_{i = 1}^{n} α_{i} = 1

, we considered the sum of KL divergences in (6):

\begin{matrix} \sum_{i = 1}^{n} α_{i} D (P_{i} ∥ Q) . \end{matrix}

(28)

We also observed that (28) is minimized when Q is an arithmetic mean of measures.

Since the KL divergence is asymmetric, it is natural to consider the sum of another direction of KL divergences.

\begin{matrix} \sum_{i = 1}^{n} α_{i} D (Q ∥ P_{i}) = & \sum_{i = 1}^{n} α_{i} \sum_{x \in X} Q (x) \log \frac{Q (x)}{P_{i} (x)} \end{matrix}

(29)

\begin{matrix} = & \sum_{x \in X} \sum_{i = 1}^{n} α_{i} Q (x) \log \frac{Q (x)}{P_{i} (x)} \end{matrix}

(30)

\begin{matrix} = & \sum_{x \in X} Q (x) \log \frac{Q (x)}{\prod_{i = 1}^{n} {(P_{i} (x))}^{α_{i}}} . \end{matrix}

(31)

Compared to the arithmetic mean that minimizes (28),

\prod_{i = 1}^{n} {(P_{i} (x))}^{α_{i}}

can be considered as a geometric mean of measures. However,

\prod_{i = 1}^{n} {(P_{i} (x))}^{α_{i}}

is not a measure in general, and normalization is required. With a normalizing constant s, we can define the geometric mean of measures by:

\begin{matrix} {\bar{P}}_{G} (x) = \frac{1}{s} \prod_{i = 1}^{n} {(P_{i} (x))}^{α_{i}} \end{matrix}

(32)

where s is a constant, so that

\sum_{x \in X} {\bar{P}}_{G} (x) = 1

, i.e.,

\begin{matrix} s = \sum_{x} \prod_{i = 1}^{n} {(P_{i} (x))}^{α_{i}} . \end{matrix}

(33)

Then, we have:

\sum_{i = 1}^{n} α_{i} D (Q ∥ P_{i}) = D (Q ∥ {\bar{P}}_{G}) + \log \frac{1}{s}

(34)

which can be minimized when

Q = {\bar{P}}_{G}

. Thus, for all Q,

\begin{matrix} \sum_{i = 1}^{n} α_{i} D (Q ∥ P_{i}) \geq & \sum_{i = 1}^{n} α_{i} D ({\bar{P}}_{G} ∥ P_{i}) \end{matrix}

(35)

= \log \frac{1}{s} .

(36)

The above result provides a geometric compensation identity.

\begin{matrix} \sum_{i = 1}^{n} α_{i} D (Q ∥ P_{i}) = & D (Q ∥ {\bar{P}}_{G}) + \sum_{i = 1}^{n} α_{i} D ({\bar{P}}_{G} ∥ P_{i}) . \end{matrix}

(37)

This also implies that

\log \frac{1}{s} \geq 0

.

Remark 3.

If

n = 2

, s is called the α-Chernoff coefficient, and it is called the Bhattacharyya coefficient when

α = 1 / 2

. The summation

\log \frac{1}{s} = \sum_{i = 1}^{2} α_{i} D ({\bar{P}}_{G} ∥ P_{i})

is known as α-Chernoff divergence. For more details, please see [10,11] and the references therein.

Under this definition, we can find the geometric mean of measures in the set

{P_{Y^{n} | {\tilde{x}}^{n}} : {\tilde{x}}^{n} \in B}

with uniform weights

\frac{1}{| B |}

by:

\begin{matrix} γ_{B} (y^{n}) = \frac{1}{s_{B}} {(\prod_{{\tilde{x}}^{n} \in B} P_{Y^{n} | X^{n}} (y^{n} | {\tilde{x}}^{n}))}^{1 / | B |} \end{matrix}

(38)

where:

\begin{matrix} s_{B} = \sum_{y^{n}} {(\prod_{{\tilde{x}}^{n} \in B} P_{Y^{n} | X^{n}} (y^{n} | {\tilde{x}}^{n}))}^{1 / | B |} . \end{matrix}

(39)

Remark 4.

The original conjecture is that the Boolean function f such that

f^{- 1} (0) = Ω_{i 0} = {x^{n} : x_{i} = 0}

maximizes the mutual information

I (f (X^{n}); Y^{n})

. The geometric mean of measures in the set

{P_{Y^{n} | x^{n}} : x^{n} \in Ω_{i 0}}

satisfies the following property.

\begin{matrix} γ_{Ω_{i 0}} = & μ_{Ω_{i 0}} \end{matrix}

(40)

\begin{matrix} s_{Ω_{i 0}} = & 2^{n - 1} {(p (1 - p))}^{(n - 1) / 2} . \end{matrix}

(41)

Note that the geometric mean of measures in the set

{P_{Y^{n} | x^{n}} : x^{n} \in Ω}

satisfies:

γ_{Ω} = μ_{Ω}

(42)

s_{Ω} = 2^{n} {(p (1 - p))}^{n / 2} .

(43)

4.2. Main Results

So far, we have seen two means of measures

μ_{A}

and

γ_{A}

. It is natural to ask if they are equal. Our main theorem provides a connection to Conjecture 1.

Theorem 6.

Suppose A is a nontrivial subset of

Ω = {0, 1}^{n}

for

n > 0

(i.e.,

A \neq \emptyset, Ω

), and A is

I

-compressed for all

| I | = 1

. Then,

A = Ω_{i 0}

for some i if and only if

μ_{A} = γ_{A}

and

μ_{A^{c}} = γ_{A^{c}}

.

The proof of the theorem is provided in Appendix D. Theorem 6 implies that the following conjecture is the equivalent to Conjecture 1.

Conjecture 2.

Let

f : Ω \to {0, 1}

and

A = f^{- 1} (0)

be

I

-compressed for all

| I | = 1

. Then,

I (f (X); Y)

is maximized if and only if

μ_{A} = γ_{A}

and

μ_{A^{c}} = γ_{A^{c}}

.

Remark 5.

One of the main challenges of this problem is that the conjectured optimal sets are extremes, i.e.,

A = Ω_{i 0}

for some i. Our main theorem provides an alternative conjecture that seems more natural in the context of optimization.

Remark 6.

It is clear that

μ_{A} = γ_{A}

holds if

| A | = 1

. Thus, both conditions

μ_{A} = γ_{A}

and

μ_{A^{c}} = γ_{A^{c}}

are needed to guarantee

A = Ω_{i 0}

for some i.

4.3. Property of the Geometric Mean

We can derive a new identity by combining the original and geometric compensation identity together. For

A, B \subset Ω

, let

π (A, B)

be:

\begin{matrix} π (A, B) = \sum_{(x^{n}, {\tilde{x}}^{n}) \in A \times B} D (P_{Y^{n} | x^{n}} ∥ P_{Y^{n} | {\tilde{x}}^{n}}) . \end{matrix}

(44)

Then,

\begin{matrix} π (A, B) = & \sum_{(x^{n}, {\tilde{x}}^{n}) \in A \times B} D (P_{Y^{n} | x^{n}} ∥ P_{Y^{n} | {\tilde{x}}^{n}}) \end{matrix}

(45)

\begin{matrix} = & \sum_{{\tilde{x}}^{n} \in B} \sum_{x^{n} \in A} D (P_{Y^{n} | x^{n}} ∥ P_{Y^{n} | {\tilde{x}}^{n}}) \end{matrix}

(46)

\begin{matrix} = & \sum_{{\tilde{x}}^{n} \in B} (\sum_{x^{n} \in A} D (P_{Y^{n} | x^{n}} ∥ μ_{A}) + | A | D (μ_{A} ∥ P_{Y | {\tilde{x}}^{n}})) \end{matrix}

(47)

\begin{matrix} = & | B | D (A) + | A | \sum_{{\tilde{x}}^{n} \in B} D (μ_{A} ∥ P_{Y | {\tilde{x}}^{n}}) \end{matrix}

(48)

where (47) is because of the compensation identity (6). As we discussed in Section 4.1, the second term of the right-hand side is:

\begin{matrix} \sum_{{\tilde{x}}^{n} \in B} D (μ_{A} ∥ P_{{\tilde{x}}^{n}}) = & \sum_{{\tilde{x}}^{n} \in B} \sum_{y^{n}} μ_{A} (y^{n}) \log \frac{μ_{A} (y^{n})}{P_{Y^{n} | X^{n}} (y^{n} | {\tilde{x}}^{n})} \end{matrix}

(49)

\begin{matrix} = & | B | \sum_{y^{n}} μ_{A} (y^{n}) \log \frac{μ_{A} (y^{n})}{{(\prod_{{\tilde{x}}^{n} \in B} P_{Y^{n} | X^{n}} (y^{n} | {\tilde{x}}^{n}))}^{1 / | B |}} \end{matrix}

(50)

\begin{matrix} = & | B | D (μ_{A} ∥ γ_{B}) + | B | \log \frac{1}{s_{B}} \end{matrix}

(51)

\begin{matrix} = & | B | D (μ_{A} ∥ γ_{B}) + \sum_{{\tilde{x}}^{n} \in B} D (γ_{B} ∥ P_{Y^{n} | {\tilde{x}}^{n}}) . \end{matrix}

(52)

Finally, we have:

\begin{matrix} \frac{1}{| A | | B |} π (A, B) = & \frac{1}{| A |} D (A) + D (μ_{A} ∥ γ_{B}) + \log \frac{1}{s_{B}} \end{matrix}

(53)

\begin{matrix} = & \frac{1}{| A |} \sum_{x^{n} \in A} D (P_{Y^{n} | x^{n}} ∥ μ_{A}) + D (μ_{A} ∥ γ_{B}) + \frac{1}{| B |} \sum_{{\tilde{x}}^{n} \in B} D (γ_{B} ∥ P_{Y^{n} | {\tilde{x}}^{n}}) . \end{matrix}

(54)

More interestingly, we can apply original and geometric compensation identities:

\begin{matrix} \frac{1}{| A | | B |} π (A, B) = & \frac{1}{| A |} \sum_{x^{n} \in A} D (P_{Y^{n} | x^{n}} ∥ μ_{A}) + \frac{1}{| B |} \sum_{{\tilde{x}}^{n} \in B} D (μ_{A} ∥ P_{Y^{n} | {\tilde{x}}^{n}}) \end{matrix}

(55)

\begin{matrix} = & \frac{1}{| A |} \sum_{x^{n} \in A} D (P_{Y^{n} | x^{n}} ∥ γ_{B}) + \frac{1}{| B |} \sum_{{\tilde{x}}^{n} \in B} D (γ_{B} ∥ P_{Y^{n} | {\tilde{x}}^{n}}) . \end{matrix}

(56)

From Theorem 4, we have

π (A, B) = π (B, A)

, and therefore, we can switch A and B.

\begin{matrix} \frac{1}{| A | | B |} π (A, B) = & \frac{1}{| B |} \sum_{x^{n} \in B} D (P_{Y^{n} | x^{n}} ∥ μ_{B}) + D (μ_{B} ∥ γ_{A}) + \frac{1}{| A |} \sum_{{\tilde{x}}^{n} \in A} D (γ_{A} ∥ P_{Y^{n} | {\tilde{x}}^{n}}) \end{matrix}

(57)

\begin{matrix} = & \frac{1}{| B |} \sum_{x^{n} \in B} D (P_{Y^{n} | x^{n}} ∥ μ_{B}) + \frac{1}{| A |} \sum_{{\tilde{x}}^{n} \in A} D (μ_{B} ∥ P_{Y^{n} | {\tilde{x}}^{n}}) \end{matrix}

(58)

\begin{matrix} = & \frac{1}{| B |} \sum_{x^{n} \in B} D (P_{Y^{n} | x^{n}} ∥ γ_{A}) + \frac{1}{| A |} \sum_{{\tilde{x}}^{n} \in A} D (γ_{A} ∥ P_{Y^{n} | {\tilde{x}}^{n}}) . \end{matrix}

(59)

If we let

A = B

, we have:

\begin{matrix} \frac{1}{{| A |}^{2}} π (A, A) = & \frac{1}{| A |} \sum_{x^{n} \in A} [D (P_{Y^{n} | x^{n}} ∥ γ_{A}) + D (γ_{A} ∥ P_{Y^{n} | x^{n}})] \end{matrix}

(60)

\begin{matrix} = & \frac{1}{| A |} \sum_{x^{n} \in A} [D (P_{Y^{n} | x^{n}} ∥ μ_{A}) + D (μ_{A} ∥ P_{Y^{n} | x^{n}})] \end{matrix}

(61)

\begin{matrix} = & \frac{1}{| A |} \sum_{x^{n} \in A} [D (P_{Y^{n} | x^{n}} ∥ μ_{A}) + D (μ_{A} ∥ γ_{A}) + D (γ_{A} ∥ P_{Y^{n} | x^{n}})] . \end{matrix}

(62)

Note that

\frac{1}{| A |} π (A, A) + \frac{1}{| A^{c} |} π (A^{c}, A^{c})

is similar to a known clustering problem. In the clustering literature, the min-sum clustering problem [12] is minimizing the sum of all edges in each cluster. Using

π

, we can describe the binary min-sum clustering problem on

Ω

by minimizing

π (A, A) + π (A^{c}, A^{c})

.

4.4. Another Application of the Geometric Mean

Using the geometric mean of measures, we can rewrite the clustering problem in a different form. Recall that

μ_{A \oplus x^{n}} = {{\tilde{x}}^{n} \oplus x^{n} : {\tilde{x}}^{n} \in A}

. Then, we have:

\begin{matrix} D (A) = & \sum_{x^{n} \in A} D (P_{Y^{n} | x^{n}} ∥ μ_{A}) \end{matrix}

(63)

\begin{matrix} = & \sum_{x^{n} \in A} D (P_{Y^{n} | 0^{n}} ∥ μ_{A \oplus x^{n}}) . \end{matrix}

(64)

Let

{\tilde{γ}}_{A}

be the geometric mean of measures in the set

{μ_{A \oplus x^{n}} : x^{n} \in A}

, i.e.,

\begin{matrix} {\tilde{γ}}_{A} (y^{n}) = \frac{1}{{\tilde{s}}_{A}} {(\prod_{x^{n} \in A} μ_{A \oplus x^{n}} (y^{n}))}^{1 / | A |} \end{matrix}

(65)

where:

\begin{matrix} {\tilde{s}}_{A} = \sum_{y^{n}} {(\prod_{x^{n} \in A} μ_{A \oplus x^{n}} (y^{n}))}^{1 / | A |} . \end{matrix}

(66)

Then, we have:

\begin{matrix} D (A) = | A | D (P_{Y^{n} | 0^{n}} ∥ {\tilde{γ}}_{A}) + | A | \log \frac{1}{{\tilde{s}}_{A}} . \end{matrix}

(67)

Needless to say:

\begin{matrix} D (A^{c}) = | A^{c} | D (P_{Y^{n} | 0^{n}} ∥ {\tilde{γ}}_{A^{c}}) + | A^{c} | \log \frac{1}{{\tilde{s}}_{A^{c}}} . \end{matrix}

(68)

The sum of the results is:

\begin{matrix} D (A) + D (A^{c}) = & | A | D (P_{Y^{n} | 0^{n}} ∥ {\tilde{γ}}_{A}) + | A^{c} | D (P_{Y^{n} | 0^{n}} ∥ {\tilde{γ}}_{A^{c}}) + | A | \log \frac{1}{{\tilde{s}}_{A}} + | A^{c} | \log \frac{1}{{\tilde{s}}_{A^{c}}} . \end{matrix}

(69)

This can be considered as a dual of Theorem 3.

Remark 7.

Let

Ω_{i 0} = {x^{n} : x_{i} = 0}

, which is the candidate of the optimizer. Then,

\begin{matrix} {\tilde{γ}}_{Ω_{i 0}} = & {\tilde{γ}}_{{(Ω_{i 0})}^{c}} = μ_{Ω_{i 0}} \end{matrix}

(70)

\begin{matrix} {\tilde{s}}_{Ω_{i 0}} = & {\tilde{s}}_{{(Ω_{i 0})}^{c}} = 1 . \end{matrix}

(71)

5. Concluding Remarks

In this paper, we have proposed a number of different formulations of the most informative Boolean function conjecture. Most of them are based on the information geometric approach. Furthermore, we focused on the (normalized) geometric mean of measures that can simplify the problem formulation. More precisely, we showed that Conjecture 1 is true if and only if the maximum achieving f satisfies the following property: “the arithmetic and geometric mean of measures are the same for both

{P_{Y^{n} | x^{n}} : x^{n} \in f^{- 1} (0)}

, as well as

{P_{Y^{n} | x^{n}} : x^{n} \in f^{- 1} (1)}

.”

Funding

This work was supported by the Hongik University new faculty research support fund.

Conflicts of Interest

The author declares no conflict of interest.

Appendix A. Proof of Theorem 2

By the definition of mutual information, we have:

\begin{matrix} I (X; Y) - I (f (X); Y) = & E [\log \frac{P_{X, Y} (X, Y)}{P_{X} (X) P_{Y} (Y)}] - E [\log \frac{P_{U, Y} (f (X), Y)}{P_{U} (f (X)) P_{Y} (Y)}] \end{matrix}

(A1)

\begin{matrix} = & E [\log \frac{P_{Y | X} (Y | X)}{P_{Y | U} (Y | f (X)}] \end{matrix}

(A2)

\begin{matrix} = & E [E [\log \frac{P_{Y | X} (Y | X)}{P_{Y | U} (Y | f (X))} | X]] \end{matrix}

(A3)

\begin{matrix} = & \sum_{x} P_{X} (x) E [\log \frac{P_{Y | X} (Y | X)}{P_{Y | U} (Y | f (X))} | X = x] \end{matrix}

(A4)

\begin{matrix} = & \sum_{x} P_{X} (x) D (P_{Y | x} ∥ P_{Y | U} (\cdot | f (x))) . \end{matrix}

(A5)

This concludes the proof.

Appendix B. Proof of Theorem 4

Without loss of generality, we can assume that

x^{n} = 0^{n}

and

{\tilde{x}}^{n} = 1^{k} 0^{n - k}

where

k = d_{H} (x^{n}, {\tilde{x}}^{n})

. Then, we have:

\begin{matrix} D (P_{Y^{n} | x^{n}} ∥ P_{Y^{n} | {\tilde{x}}^{n}}) \end{matrix}

\begin{matrix} = & D (P_{Y^{k} | x^{k}} \times P_{Y_{k + 1}^{n} | x_{k + 1}^{n}} ∥ P_{Y^{k} | {\tilde{x}}^{k}} \times P_{Y_{k + 1}^{n} | {\tilde{x}}_{k + 1}^{n}}) \end{matrix}

(A6)

\begin{matrix} = & D (P_{Y^{k} | 0^{k}} \times P_{Y_{k + 1}^{n} | 0_{k + 1}^{n}} ∥ P_{Y^{k} | 1^{k}} \times P_{Y_{k + 1}^{n} | 0_{k + 1}^{n}}) \end{matrix}

(A7)

\begin{matrix} = & D (P_{Y^{k} | 0^{k}} ∥ P_{Y^{k} | 1^{k}}) + D (P_{Y_{k + 1}^{n} | 0_{k + 1}^{n}} ∥ P_{Y_{k + 1}^{n} | 0_{k + 1}^{n}}) \end{matrix}

(A8)

\begin{matrix} = & D (P_{Y^{k} | 0^{k}} ∥ P_{Y^{k} | 1^{k}}) \end{matrix}

(A9)

Thus,

\begin{matrix} D (P_{Y^{k} | 0^{k}} ∥ P_{Y^{k} | 1^{k}}) \end{matrix}

\begin{matrix} = & k D (P_{Y | 0} ∥ P_{Y | 1}) \end{matrix}

(A10)

\begin{matrix} = & k (p \log \frac{p}{1 - p} + (1 - p) \log \frac{1 - p}{p}) \end{matrix}

(A11)

\begin{matrix} = & k (1 - 2 p) \log \frac{1 - p}{p} . \end{matrix}

(A12)

Since

d_{H} (x^{n}, {\tilde{x}}^{n}) = k

, this concludes the proof.

Appendix C. Proof of Theorem 5

The following lemma bounds the ratio between

Q_{Y}^{n} (y^{n})

and

Q_{Y^{n}} ({\tilde{y}}^{n})

, which will be crucial in our argument.

Lemma A1.

For

Q_{Y^{n}} \in conv {P_{Y^{n} | x^{n}} | x^{n} \in Ω}

,

d_{H} (y^{n}, {\tilde{y}}^{n}) \cdot \log (\frac{p}{1 - p}) \leq \log \frac{Q_{Y^{n}} (y^{n})}{Q_{Y^{n}} ({\tilde{y}}^{n})} \leq d_{H} (y^{n}, {\tilde{y}}^{n}) \cdot \log (\frac{1 - p}{p})

(A13)

Proof.

Without loss of generality, we can assume that

{\bar{y}}^{k} = {\tilde{y}}^{k}

and

y_{k + 1}^{n} = {\tilde{y}}_{k + 1}^{n}

.

\frac{Q_{Y^{n}} (y^{n})}{Q_{Y^{n}} ({\bar{y}}^{k}, y_{k + 1}^{n})} = \frac{\sum_{x^{n}} π (x^{n}) P_{Y^{n} | X^{n}} (y^{n} | x^{n})}{\sum_{x^{n}} π (x^{n}) P_{Y^{n} | X^{n}} ({\bar{y}}^{k}, y_{k + 1}^{n} | x^{n})}

(A14)

\leq {(\frac{1 - p}{p})}^{k} \frac{\sum_{x^{n}} π (x^{n}) P_{Y^{n} | X^{n}} (y^{n} | x^{n})}{\sum_{x^{n}} π (x^{n}) P_{Y^{n} | X^{n}} (y^{n} | x^{n})}

(A15)

= {(\frac{1 - p}{p})}^{k} .

(A16)

Similarly, we can show that:

\frac{Q_{Y^{n}} (y^{n})}{Q_{Y^{n}} ({\bar{y}}^{k}, y_{k + 1}^{n})} \geq {(\frac{p}{1 - p})}^{k} .

(A17)

This concludes the proof of the lemma. ☐

Without loss of generality, we can assume that

x^{n} = 0^{n}

and

{\tilde{x}}^{n} = 1^{k} 0^{n - k}

. Then, we have:

P_{Y^{n} | X^{n}} (y^{n} | {\tilde{x}}^{n}) = P_{Y^{n} | X^{n}} (y^{n} | 1^{k} 0^{n - k})

(A18)

= P_{Y^{n} | X^{n}} ({\bar{y}}^{k}, y_{k + 1}^{n} | 0^{n})

(A19)

where

{\bar{y}}_{i} = 1 - y_{i}

. Thus, we have:

\begin{matrix} D (P_{Y^{n} | X^{n} = x^{n}} ∥ Q_{Y^{n}}) - D (P_{Y^{n} | X^{n} = {\tilde{x}}^{n}} ∥ Q_{Y^{n}}) \end{matrix}

\begin{matrix} = E [\log \frac{P_{Y^{n} | X^{n}} (Y^{n} | 0^{n})}{Q_{Y^{n}} (Y^{n})}] - E [\log \frac{P_{Y^{n} | X^{n}} ({\bar{Y}}^{k}, Y_{k + 1}^{n} | 0^{n})}{Q_{Y^{n}} (Y^{n})}] \end{matrix}

(A20)

= E [\log \frac{P_{Y^{n} | X^{n}} (Y^{n} | 0^{n})}{Q_{Y^{n}} (Y^{n})}] - E [\log \frac{P_{Y^{n} | X^{n}} (Y^{n} | 0^{n})}{Q_{Y^{n}} ({\bar{Y}}^{k}, Y_{k + 1}^{n})}]

(A21)

= E [\log \frac{Q_{Y^{n}} (Y^{n})}{Q_{Y^{n}} ({\bar{Y}}^{k}, Y_{k + 1}^{n})}] .

(A22)

Note that two expectations in (A20) are under different distributions; on the other hand, expectations in (A21) and the following equations are under the same distribution

P_{Y^{n} | X^{n} = 0^{n}}

.

The above expectation can be written as follows.

\begin{matrix} D (P_{Y^{n} | X^{n} = x^{n}} ∥ Q_{Y^{n}}) - D (P_{Y^{n} | X^{n} = {\tilde{x}}^{n}} ∥ Q_{Y^{n}}) \end{matrix}

\begin{matrix} = \sum_{y^{n}} P_{Y^{n} | X^{n}} (y^{n} | 0^{n}) \log \frac{Q_{Y^{n}} (y^{n})}{Q_{Y^{n}} ({\bar{y}}^{k}, y_{k + 1}^{n})} \end{matrix}

(A23)

\begin{matrix} = & \sum_{y_{2}^{n}} P_{Y^{n} | X^{n}} (0, y_{2}^{n} | 0^{n}) \log \frac{Q_{Y^{n}} (0, y_{2}^{n})}{Q_{Y^{n}} (1, {\bar{y}}_{2}^{k}, y_{k + 1}^{n})} \end{matrix}

\begin{matrix} + \sum_{y_{2}^{n}} P_{Y^{n} | X^{n}} (1, y_{2}^{n} | 0^{n}) \log \frac{Q_{Y^{n}} (1, y_{2}^{n})}{Q_{Y^{n}} (0, {\bar{y}}_{2}^{k}, y_{k + 1}^{n})} \end{matrix}

(A24)

\begin{matrix} = & \sum_{y_{2}^{n}} P_{Y^{n} | X^{n}} (0, y_{2}^{n} | 0^{n}) \log \frac{Q_{Y^{n}} (0, y_{2}^{n})}{Q_{Y^{n}} (1, {\bar{y}}_{2}^{k}, y_{k + 1}^{n})} \end{matrix}

\begin{matrix} + \sum_{y_{2}^{n}} P_{Y^{n} | X^{n}} (1, {\bar{y}}_{2}^{k}, y_{k + 1}^{n} | 0^{n}) \log \frac{Q_{Y^{n}} (1, {\bar{y}}_{2}^{k}, y_{k + 1}^{n})}{Q_{Y^{n}} (0, y_{2}^{n})} \end{matrix}

(A25)

= \sum_{y_{2}^{n}} ((1 - p) P_{Y_{2}^{n} | X_{2}^{n}} (y_{2}^{n} | 0^{n - 1}) - p P_{Y_{2}^{n} | X_{2}^{n}} ({\bar{y}}_{2}^{k}, y_{k + 1}^{n} | 0^{n - 1})) \log \frac{Q_{Y^{n}} (0, y_{2}^{n})}{Q_{Y^{n}} (1, {\bar{y}}_{2}^{k}, y_{k + 1}^{n})}

(A26)

\leq \sum_{y_{2}^{n}} |(1 - p) P_{Y_{2}^{n} | X_{2}^{n}} (y_{2}^{n} | 0^{n - 1}) - p P_{Y_{2}^{n} | X_{2}^{n}} ({\bar{y}}_{2}^{k}, y_{k + 1}^{n} | 0^{n - 1})| k \log \frac{1 - p}{p}

(A27)

= \sum_{y_{2}^{k}} |(1 - p) P_{Y_{2}^{k} | X_{2}^{k}} (y_{2}^{k} | 0^{k - 1}) - p P_{Y_{2}^{k} | X_{2}^{k}} ({\bar{y}}_{2}^{k} | 0^{k - 1})| k \log \frac{1 - p}{p}

(A28)

where (A27) is because of Lemma A1.

Finally,

\begin{matrix} \sum_{y_{2}^{k}} |(1 - p) P_{Y_{2}^{k} | X_{2}^{k}} (y_{2}^{k} | 0^{k - 1}) - p P_{Y_{2}^{k} | X_{2}^{k}} ({\bar{y}}_{2}^{k} | 0^{k - 1})| \end{matrix}

\begin{matrix} \leq \sum_{i = 0}^{k - 1} (\binom{k - 1}{i}) |(1 - p) p^{i} {(1 - p)}^{k - 1 - i} - p \cdot p^{k - 1 - i} {(1 - p)}^{i}| \end{matrix}

(A29)

\leq {(1 - p)}^{k} - p^{k} .

(A30)

This concludes the proof.

Appendix D. Proof of Theorem 6

From the first assumption

μ_{A} = γ_{A}

, we have:

\sum_{y^{n - 1}} μ_{A} (y^{n - 1} 0) = \sum_{y^{n - 1}} \frac{1}{| A |} \sum_{x^{n} \in A} P_{Y^{n} | X^{n}} (y^{n - 1} 0 | x^{n})

(A31)

= \sum_{y^{n - 1}} \frac{1}{| A |} \sum_{x^{n} \in A_{n 0}} P_{Y^{n} | X^{n}} (y^{n - 1} 0 | x^{n}) + \sum_{y^{n - 1}} \frac{1}{| A |} \sum_{x^{n} \in A_{n 1}} P_{Y^{n} | X^{n}} (y^{n - 1} 0 | x^{n})

(A32)

= \frac{1 - p}{| A |} \sum_{x^{n} \in A_{n 0}} \sum_{y^{n - 1}} P_{Y^{n - 1} | X^{n - 1}} (y^{n - 1} | x^{n - 1}) + \frac{p}{| A |} \sum_{x^{n} \in A_{n 1}} \sum_{y^{n - 1}} P_{Y^{n - 1} | X^{n - 1}} (y^{n - 1} | x^{n - 1})

(A33)

= (1 - p) \frac{| A_{n 0} |}{| A |} + p \frac{| A_{n 1} |}{| A |} .

(A34)

Clearly, we can get the following result in a similar manner.

\sum_{y^{n - 1}} μ_{A} (y^{n - 1} 1) = p \frac{| A_{n 0} |}{| A |} + (1 - p) \frac{| A_{n 1} |}{| A |} .

(A35)

The ratio of those two is given by:

\frac{\sum_{y^{n - 1}} μ_{A} (y^{n - 1} 1)}{\sum_{y^{n - 1}} μ_{A} (y^{n - 1} 0)} = \frac{p \frac{| A_{n 0} |}{| A |} + (1 - p) \frac{| A_{n 1} |}{| A |}}{(1 - p) \frac{| A_{n 0} |}{| A |} + p \frac{| A_{n 1} |}{| A |}} .

(A36)

On the other hand, we can also marginalize

γ_{A}

:

\sum_{y^{n - 1}} γ_{A} (y^{n - 1} 0) = \frac{1}{s_{A}} \sum_{y^{n - 1}} {(\prod_{x^{n} \in A} P_{Y^{n} | X^{n}} (y^{n - 1} 0 | x^{n}))}^{1 / | A |}

(A37)

= \frac{1}{s_{A}} \sum_{y^{n - 1}} {(\prod_{x^{n} \in A_{n 0}} (1 - p) P_{Y^{n - 1} | X^{n - 1}} (y^{n - 1} | x^{n - 1}) \prod_{x^{n} \in A_{n 1}} p P_{Y^{n - 1} | X^{n - 1}} (y^{n - 1} | x^{n - 1}))}^{1 / | A |}

(A38)

= \frac{1}{s_{A}} \sum_{y^{n - 1}} {({(1 - p)}^{| A_{n 0} |} p^{| A_{n 1} |} \prod_{x^{n} \in A} P_{Y^{n - 1} | X^{n - 1}} (y^{n - 1} | x^{n - 1}))}^{1 / | A |}

(A39)

= \frac{1}{s_{A}} {(1 - p)}^{| A_{n 0} | / | A |} p^{| A_{n 1} | / | A |} \sum_{y^{n - 1}} {(\prod_{x^{n} \in A} P_{Y^{n - 1} | X^{n - 1}} (y^{n - 1} | x^{n - 1}))}^{1 / | A |} .

(A40)

Similarly, we have:

\sum_{y^{n - 1}} γ_{A} (y^{n - 1} 1) = \frac{1}{s_{A}} p^{| A_{n 0} | / | A |} {(1 - p)}^{| A_{n 1} | / | A |} \sum_{y^{n - 1}} {(\prod_{x^{n} \in A} P_{Y^{n - 1} | X^{n - 1}} (y^{n - 1} | x^{n - 1}))}^{1 / | A |} .

(A41)

Thus, the ratio is given by:

\frac{\sum_{y^{n - 1}} γ_{A} (y^{n - 1} 1)}{\sum_{y^{n - 1}} γ_{A} (y^{n - 1} 0)} = \frac{p^{| A_{n 0} | / | A |} {(1 - p)}^{| A_{n 1} | / | A |}}{{(1 - p)}^{| A_{n 0} | / | A |} p^{| A_{n 1} | / | A |}}

(A42)

Since

μ_{A} = γ_{A}

, both ratios should be the same. Let

x = | A_{n 0} | / | A |

, which implies

| A_{n 1} | / | A | = 1 - x

. Then, we have:

\frac{p x + (1 - p) (1 - x)}{(1 - p) x + p (1 - x)} = \frac{p^{x} {(1 - p)}^{1 - x}}{{(1 - p)}^{x} p^{1 - x}} .

(A43)

If we let

y = \frac{p}{1 - p}

, then the above equation can be further simplified to:

\frac{x y + (1 - x)}{x + y (1 - x)} = y^{2 x - 1} .

(A44)

Lemma A2.

For fixed

0 < y < 1

, the only solutions of the above equation are

x = 0, \frac{1}{2}, 1

.

Proof.

It is clear that

x = 0, \frac{1}{2}, 1

comprise the solution of the following equation.

\frac{x y + (1 - x)}{x + y (1 - x)} = y^{2 x - 1} .

(A45)

It is enough to show that:

g_{y} (x) = \log (x y + (1 - x)) - \log (x + y (1 - x)) - (2 x - 1) \log y = 0

(A46)

can have up to three solutions. Consider the derivative

\frac{\partial}{\partial x} g_{y} (x) = 0

,

\frac{\partial}{\partial x} g_{y} (x) = \frac{y - 1}{x y + 1 - x} - \frac{1 - y}{x + y - x y} - 2 \log y = 0

(A47)

which is equivalent to:

2 (x y + 1 - x) (x + y - x y) \log y = (1 + y) (y - 1) .

(A48)

It is a quadratic equation, and therefore,

\frac{\partial}{\partial x} g_{y} (x) = 0

can have up to two solutions. Thus,

g_{y} (x) = 0

can have up to three solutions. ☐

This implies that

| A_{n 0} | = 0, | A | / 2

or

| A |

. It is clear that the above result holds for all i and

A^{c}

, i.e.,

| A_{i 0} |

is either

0, | A | / 2

or

| A |

, and

| A_{i 0}^{c} |

is either

0, | A^{c} | / 2

or

| A^{c} |

. These cardinalities should satisfy the following equations:

| A_{i 0} | + | A_{i 1} | = | A |

(A49)

| A_{i 0} | + | A_{i 0}^{c} | = 2^{n - 1}

(A50)

| A_{i 0}^{c} | + | A_{i 1}^{c} | = | A^{c} |

(A51)

| A_{i 1} | + | A_{i 1}^{c} | = 2^{n - 1}

(A52)

for all

1 \leq i \leq n

. Since

I

-compressedness implies

A_{i 1} \subseteq A_{i 0}

, we have

| A_{i 1} | \leq | A_{i 0} |

. Thus,

| A_{i 0} |

should be either

| A | / 2

or

| A |

for all i. If

| A_{i 0} | = | A |

for some i, then

| A_{i 1} | = 0

. Since

| A_{i 1}^{c} |

is either

0, | A^{c} | / 2

or

| A^{c} |

, but

A^{c} \neq \emptyset, Ω

, we have

| A_{i 1}^{c} | = 2^{n - 1}

. Thus,

A = A_{i 0}

and

| A | = 2^{n - 1}

, which implies

A = Ω_{i 0}

.

On the other hand, assume that

| A_{i 0} | = | A_{i 1} | = | A | / 2

for all i. Since A is

I

-compressed,

x^{i - 1} 1 x_{i + 1}^{n} \in A_{i 1}

implies

x^{i - 1} 0 x_{i + 1}^{n} \in A_{i 0}

. However, we have

| A_{i 0} | = | A_{i 1} |

, and therefore:

x^{i - 1} 1 x_{i + 1}^{n} \in A_{i 1} \Leftrightarrow x^{i - 1} 0 x_{i + 1}^{n} \in A_{i 0}

(A53)

and equivalently,

x^{i - 1} 1 x_{i + 1}^{n} \in A \Leftrightarrow x^{i - 1} 0 x_{i + 1}^{n} \in A

(A54)

for all i. It can only be true when

A = \emptyset

or

Ω

, which contradicts our original assumption. This concludes the proof.

References

Courtade, T.A.; Kumar, G.R. Which Boolean functions maximize mutual information on noisy inputs? IEEE Trans. Inf. Theory 2014, 60, 4515–4525. [Google Scholar] [CrossRef]
Pichler, G.; Matz, G.; Piantanida, P. A tight upper bound on the mutual information of two Boolean functions. In Proceedings of the 2016 IEEE Information Theory Workshop (ITW), Cambridge, UK, 11–14 September 2016; pp. 16–20. [Google Scholar]
Ordentlich, O.; Shayevitz, O.; Weinstein, O. An improved upper bound for the most informative Boolean function conjecture. In Proceedings of the 2016 IEEE International Symposium on Information Theory (ISIT), Barcelona, Spain, 10–15 July 2016; pp. 500–504. [Google Scholar]
Weinberger, N.; Shayevitz, O. On the optimal Boolean function for prediction under quadratic loss. IEEE Trans. Inf. Theory 2017, 63, 4202–4217. [Google Scholar] [CrossRef]
Huleihel, W.; Ordentlich, O. How to quantize n outputs of a binary symmetric channel to n-1 bits? In Proceedings of the 2017 IEEE International Symposium on Information Theory (ISIT), Aachen, Germany, 25–30 June 2017; pp. 91–95. [Google Scholar]
Nazer, B.; Ordentlich, O.; Polyanskiy, Y. Information-distilling quantizers. In Proceedings of the 2017 IEEE International Symposium on Information Theory (ISIT), Aachen, Germany, 25–30 June 2017; pp. 96–100. [Google Scholar]
Lin, J. Divergence measures based on the Shannon entropy. IEEE Trans. Inf. Theory 1991, 37, 145–151. [Google Scholar] [CrossRef] [Green Version]
Topsøe, F. An information theoretical identity and a problem involving capacity. Stud. Sci. Math. Hung. 1967, 2, 291–292. [Google Scholar]
Banerjee, A.; Merugu, S.; Dhillon, I.S.; Ghosh, J. Clustering with Bregman divergences. J Mach. Learn. Res. 2005, 6, 1705–1749. [Google Scholar]
Nielsen, F. An information-geometric characterization of Chernoff information. IEEE Signal Process. Lett. 2013, 20, 269–272. [Google Scholar] [CrossRef]
Nielsen, F.; Boltz, S. The burbea-rao and bhattacharyya centroids. IEEE Trans. Inf. Theory 2011, 57, 5455–5466. [Google Scholar] [CrossRef]
Guttmann-Beck, N.; Hassin, R. Approximation algorithms for min-sum p-clustering. Discrete Appl. Math. 1998, 89, 125–142. [Google Scholar] [CrossRef]

© 2018 by the author. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

No, A. Information Geometric Approach on Most Informative Boolean Function Conjecture. Entropy 2018, 20, 688. https://doi.org/10.3390/e20090688

AMA Style

No A. Information Geometric Approach on Most Informative Boolean Function Conjecture. Entropy. 2018; 20(9):688. https://doi.org/10.3390/e20090688

Chicago/Turabian Style

No, Albert. 2018. "Information Geometric Approach on Most Informative Boolean Function Conjecture" Entropy 20, no. 9: 688. https://doi.org/10.3390/e20090688

APA Style

No, A. (2018). Information Geometric Approach on Most Informative Boolean Function Conjecture. Entropy, 20(9), 688. https://doi.org/10.3390/e20090688

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Information Geometric Approach on Most Informative Boolean Function Conjecture

Abstract

1. Introduction

Notations

2. Preliminaries

2.1. Jensen–Shannon Divergence

2.2. $I$ -Compressed

3. Approach via Clustering

3.1. Equivalence to Clustering

3.2. Connection to Clustering under Hamming Distance

4. Geometric Mean of Measures

4.1. Definition of the Geometric Mean of Measures

4.2. Main Results

4.3. Property of the Geometric Mean

4.4. Another Application of the Geometric Mean

5. Concluding Remarks

Funding

Conflicts of Interest

Appendix A. Proof of Theorem 2

Appendix B. Proof of Theorem 4

Appendix C. Proof of Theorem 5

Appendix D. Proof of Theorem 6

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI

Article Menu

Information Geometric Approach on Most Informative Boolean Function Conjecture

Abstract

1. Introduction

Notations

2. Preliminaries

2.1. Jensen–Shannon Divergence

2.2. I -Compressed

3. Approach via Clustering

3.1. Equivalence to Clustering

3.2. Connection to Clustering under Hamming Distance

4. Geometric Mean of Measures

4.1. Definition of the Geometric Mean of Measures

4.2. Main Results

4.3. Property of the Geometric Mean

4.4. Another Application of the Geometric Mean

5. Concluding Remarks

Funding

Conflicts of Interest

Appendix A. Proof of Theorem 2

Appendix B. Proof of Theorem 4

Appendix C. Proof of Theorem 5

Appendix D. Proof of Theorem 6

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI

2.2. $I$ -Compressed