1. Introduction
Let
be an independent and identically distributed (i.i.d.) uniform Bernoulli source and
be an output of it through a memoryless binary symmetric channel with crossover probability
. Recently, Courtade and Kumar conjectured that the most informative Boolean function is a dictator function.
Conjecture 1 ([
1])
. For any Boolean function , we have:where the maximum is achieved by a dictator function, i.e., for some .
Note that is the binary entropy function. Although there has been some progress in this line of work [
2,
3], this simple conjecture still remains open. There are also a number of variations of this conjecture. Weinberger and Shayevitz [
4] considered the optimal Boolean function under quadratic loss. Huleihel and Ordentlich [
5] considered the complementary case and showed that
for all
. Nazer et al. focused on information distilling quantizers [
6], which can be seen as a generalized version of the above problem.
Many of them are based on the Fourier analysis technique including the original paper [
1]. In this paper, we suggest an alternative approach, namely the information geometric approach. The mutual information can naturally be expressed with Kullback–Leibler (KL) divergences. Thus, it can be shown that the maximizing mutual information is equivalent to clustering probability measures under KL divergence.
In the equivalent clustering problem, the center of the cluster is an arithmetic mean of measures. We also provide the role of the geometric mean of measures (with appropriate normalization) in this setting. To the best of our knowledge, the geometric mean of measures has received less attention in the literature. We propose an equivalent formulation of the conjecture using the geometric mean of measures. Note that the geometric mean also allows us to connect Conjecture 1 to the other well-known clustering problem.
The rest of the paper is organized as follows. In
Section 2, we briefly review the Jensen–Shannon divergence and
-compressedness. In
Section 3, we provide an equivalent clustering problem of probability measures. We introduce the geometric mean of measures in
Section 4. We conclude this paper in
Section 5.
Notations
denotes the alphabet set of random variable X, and denotes the set of measures on . denotes a random vector , while denotes a specific realization of it. If it is clear from the context, denotes a conditional distribution of Y given , i.e., . Similarly, denotes a conditional distribution of given , i.e., . Let be the set of all binary sequences of length n. For , the shifted version of A is denoted by where ⊕ is an element-wise XOR operator. The arithmetic mean of measures in the set is denoted by . For , let be the set of elements in A that satisfy , i.e., , and . is defined in a similar manner. A length n binary vector denotes a vector with .
2. Preliminaries
2.1. Jensen–Shannon Divergence
For
such that
, the Jensen–Shannon (JS) divergence of two measures
and
is defined as:
It is not hard to show that the following definition is equivalent.
Lin proposed a generalized JS divergence [
7]:
where
is a weight vector such that
. Similar to Equation (
3), it has an equivalent definition:
where
. Topsøe [
8] pointed out an interesting property, the so-called compensation identity. It states that for any distribution
Q,
Throughout the paper, we often use Equation (
6) directly without the notion of JSD
Remark 1. The generalized JS divergence is the mutual information between X and the mixture distribution. Let Z be a random variable that takes the value from where and . Then, it is not hard to show that: However, we introduced generalized JS divergence to emphasize the information geometric perspective of our problem.
2.2. -Compressed
Let
A be the subset of
and
be the set of indexes. For
, the
-section of
A is defined as:
The set
A is called
-compressed if
is an initial segment of lexicographical ordering for all
. For example, if
A is
-compressed for some
, then
should be one of:
It simply says that if , then .
Courtade and Kumar showed that it is enough to consider the -compressed sets.
Theorem 1 ([
1])
. Let be the set of functions for which is -compressed for all with . In maximizing , it is sufficient to consider functions . In this paper, we often restrict our attention to functions in the set .
3. Approach via Clustering
In this section, we provide an interesting approach toward Conjecture 1 via clustering. More precisely, we formulate an equivalent clustering problem.
3.1. Equivalence to Clustering
The following theorem implies the relation between the original conjecture and the clustering problem.
Theorem 2. Let and be an induced random variable. Then, The proof of the theorem is provided in
Appendix A. Note that:
which is a weighted mean of
for
. The
is a distance from each element to the cluster center. This implies that maximizing
is equivalent to clustering
under KL divergences. Since KL divergence is a Bregman divergence, all clusters are separated by a hyperplane [
9].
In this paper, we focus on
where
is i.i.d. Bern(1/2).
Corollary 1. Let and be a binary random variable. The equivalent clustering problem is minimizing:
Let
, then we can simplify
further.
The cluster center
is an arithmetic mean of measures in the set
. Then, we have:
For simplicity, let:
which is the sum of distances from each element in
A to the cluster center. In short, finding the most informative Boolean function
f is equivalent to finding the set
that minimizes
.
Remark 2. Conjecture 1 implies that minimizes (19). Furthermore, Theorem 1 implies that it is enough to consider A such that for all i. For any
, Equation (
6) implies that:
Note that does not depend on A, and therefore, we have the following theorem.
Theorem 3. For any , minimizing is equivalent to maximizing: The above theorem provides an alternative problem formulation of the original conjecture.
3.2. Connection to Clustering under Hamming Distance
In this section, we consider the duality between the above clustering problem under the KL divergence and the clustering on under the Hamming distance. The following theorem shows that the KL divergence on corresponds to the Hamming distance on .
Theorem 4. For all , we have:where denotes the Hamming distance between and . This theorem implies that the distance between two measures
and
is proportional to the Hamming distance between two binary vectors
and
. The proof of the theorem is provided in
Appendix B. Note that the KL divergence
is symmetric on
.
In the above duality, we have a mapping between and ; more precisely, . This mapping naturally suggests an equivalent clustering problem of n-dimensional binary vectors. However, the cluster center is not an element of in general. In order to formulate an equivalent clustering problem, we need to answer the question “Which n dimensional vector corresponds to ?”. A naive approach is to extend the set of binary vectors to under distance instead of the Hamming distance. In such a case, the goal is to map to the arithmetic mean of binary vectors in the set A. If this is true, we can further simplify the problem into the problem of clustering a hypercube in . However, the following example shows that this naive extension is not valid.
Example 1. Let , and , then the arithmetic mean of binary vectors of A and that of B are the same. However, is not equal to .
Furthermore, the set is not the optimum choice when clustering the hypercube under . Instead, we need to consider the set of measures directly. The following theorem provides a bit of geometric structure among measures.
Theorem 5. For all and ,where . The proof of the theorem is provided in
Appendix C. Since
for all
, Theorem 5 immediately implies the following corollary.
Corollary 2. For all and ,where is a convex hull of measures in the set A. This is a triangle inequality that can be useful when we consider the clustering problem of measures.
4. Geometric Mean of Measures
In the previous section, we formulate the clustering problem that is equivalent to the original maximizing mutual information problem. In this section, we provide another approach using a geometric mean of measures. We define the geometric mean of measures formally and derive a nontrivial conjecture, which is equivalent to Conjecture 1.
4.1. Definition of the Geometric Mean of Measures
For measures
and weights
such that
, we considered the sum of KL divergences in (
6):
We also observed that (
28) is minimized when
Q is an arithmetic mean of measures.
Since the KL divergence is asymmetric, it is natural to consider the sum of another direction of KL divergences.
Compared to the arithmetic mean that minimizes (
28),
can be considered as a geometric mean of measures. However,
is not a measure in general, and normalization is required. With a normalizing constant
s, we can define the geometric mean of measures by:
where
s is a constant, so that
, i.e.,
Then, we have:
which can be minimized when
. Thus, for all
Q,
The above result provides a geometric compensation identity.
This also implies that .
Remark 3. If , s is called the α-Chernoff coefficient, and it is called the Bhattacharyya coefficient when . The summation is known as α-Chernoff divergence. For more details, please see [10,11] and the references therein. Under this definition, we can find the geometric mean of measures in the set
with uniform weights
by:
where:
Remark 4. The original conjecture is that the Boolean function f such that maximizes the mutual information . The geometric mean of measures in the set satisfies the following property. Note that the geometric mean of measures in the set satisfies: 4.2. Main Results
So far, we have seen two means of measures
and
. It is natural to ask if they are equal. Our main theorem provides a connection to Conjecture 1.
Theorem 6. Suppose A is a nontrivial subset of for (i.e., ), and A is -compressed for all . Then, for some i if and only if and .
The proof of the theorem is provided in
Appendix D. Theorem 6 implies that the following conjecture is the equivalent to Conjecture 1.
Conjecture 2. Let and be -compressed for all . Then, is maximized if and only if and .
Remark 5. One of the main challenges of this problem is that the conjectured optimal sets are extremes, i.e., for some i. Our main theorem provides an alternative conjecture that seems more natural in the context of optimization.
Remark 6. It is clear that holds if . Thus, both conditions and are needed to guarantee for some i.
4.3. Property of the Geometric Mean
We can derive a new identity by combining the original and geometric compensation identity together. For
, let
be:
Then,
where (47) is because of the compensation identity (
6). As we discussed in
Section 4.1, the second term of the right-hand side is:
More interestingly, we can apply original and geometric compensation identities:
From Theorem 4, we have
, and therefore, we can switch
A and
B.
If we let
, we have:
Note that
is similar to a known clustering problem. In the clustering literature, the min-sum clustering problem [
12] is minimizing the sum of all edges in each cluster. Using
, we can describe the binary min-sum clustering problem on
by minimizing
.
4.4. Another Application of the Geometric Mean
Using the geometric mean of measures, we can rewrite the clustering problem in a different form. Recall that
. Then, we have:
Let
be the geometric mean of measures in the set
, i.e.,
where:
The sum of the results is:
This can be considered as a dual of Theorem 3.
Remark 7. Let , which is the candidate of the optimizer. Then, 5. Concluding Remarks
In this paper, we have proposed a number of different formulations of the most informative Boolean function conjecture. Most of them are based on the information geometric approach. Furthermore, we focused on the (normalized) geometric mean of measures that can simplify the problem formulation. More precisely, we showed that Conjecture 1 is true if and only if the maximum achieving f satisfies the following property: “the arithmetic and geometric mean of measures are the same for both , as well as .”