A Forward-Reverse Brascamp-Lieb Inequality: Entropic Duality and Gaussian Optimality

Liu, Jingbo; Courtade, Thomas A.; Cuff, Paul W.; Verdú, Sergio

doi:10.3390/e20060418

Open AccessArticle

A Forward-Reverse Brascamp-Lieb Inequality: Entropic Duality and Gaussian Optimality

by

Jingbo Liu

¹,

Thomas A. Courtade

^2,*,

Paul W. Cuff

³ and

Sergio Verdú

¹

Department of Electrical Engineering, Princeton University, Princeton, NJ 08544, USA

²

Department of Electrical Engineering and Computer Sciences, University of California, Berkeley, CA 94720-1770, USA

³

Renaissance Technologies, LLC 600 Route 25A East Setauket, New York, NY 11733, USA

^*

Author to whom correspondence should be addressed.

Entropy 2018, 20(6), 418; https://doi.org/10.3390/e20060418

Submission received: 30 March 2018 / Revised: 25 May 2018 / Accepted: 25 May 2018 / Published: 30 May 2018

(This article belongs to the Special Issue Entropy and Information Inequalities)

Download

Browse Figures

Versions Notes

Abstract

:

Inspired by the forward and the reverse channels from the image-size characterization problem in network information theory, we introduce a functional inequality that unifies both the Brascamp-Lieb inequality and Barthe’s inequality, which is a reverse form of the Brascamp-Lieb inequality. For Polish spaces, we prove its equivalent entropic formulation using the Legendre-Fenchel duality theory. Capitalizing on the entropic formulation, we elaborate on a “doubling trick” used by Lieb and Geng-Nair to prove the Gaussian optimality in this inequality for the case of Gaussian reference measures.

Keywords:

Brascamp-Lieb inequality; hypercontractivity; functional-entropic duality; Gaussian optimality; network information theory; image size characterization

1. Introduction

The Brascamp-Lieb inequality and its reverse [1] concern the optimality of Gaussian functions in a certain type of integral inequality. (Not to be confused with the “variance Brascamp-Lieb inequality” (cf. [2,3,4]), which generalizes the Poincaré inequality). These inequalities have been generalized in various ways since their discovery, nearly 40 years ago. A modern formulation due to Barthe [5] may be stated as follows:

Brascamp-Lieb Inequality and Its Reverse

([5] Theorem 1). Let E,

E_{1}

, …,

E_{m}

be Euclidean spaces and

B_{i} : E \to E_{i}

be linear maps. Let

{(c_{i})}_{i = 1}^{m}

and D be positive real numbers. Then, the Brascamp-Lieb inequality:

\begin{matrix} \int \prod_{i = 1}^{m} f_{i}^{c_{i}} (B_{i} x) d x \leq D \prod_{i = 1}^{m} {(\int f_{i} (x_{i}) d x_{i})}^{c_{i}}, \end{matrix}

(1)

for all nonnegative measurable functions

f_{i}

on

E_{i}

,

i = 1, \dots, m

, holds if and only if it holds whenever

f_{i}

,

i = 1, \dots, m

are centered Gaussian functions (a centered Gaussian function is of the form

x \mapsto exp (r - x^{⊤} A x)

, where

A

is a positive semidefinite matrix and

r \in R

). Similarly, for F a positive real number, the reverse Brascamp-Lieb inequality, also known as Barthe’s inequality (

B_{i}^{*}

denotes the adjoint of

B_{i}

),

\begin{matrix} \int sup_{(y_{i}) : \sum_{i = 1}^{m} c_{i} B_{i}^{*} y_{i} = x} \prod_{i = 1}^{m} f_{i}^{c_{i}} (y_{i}) d x \geq F \prod_{i = 1}^{m} {(\int f_{i} (y_{i}) d y_{i})}^{c_{i}}, \end{matrix}

(2)

for all nonnegative measurable functions

f_{i}

on

E_{i}

,

i = 1, \dots, m

, holds if and only if it holds for all centered Gaussian functions.

For surveys on the history of both the Brascamp-Lieb inequality and Barthe’s inequality and their applications, see, e.g., [6,7]. The Brascamp-Lieb inequality can be seen as a generalization of several other inequalities, including Hölder’s inequality, the sharp Young inequality, the Loomis-Whitney inequality, the entropy power inequality (cf. [6] or the survey paper [8]), hypercontractivity and the logarithmic Sobolev inequality [9]. Furthermore, the Prékopa-Leindler inequality can be seen as a special case of Barthe’s inequality. Due in part to their utility in establishing impossibility bounds, these functional inequalities have attracted much attention in information theory [10,11,12,13,14,15,16,17], theoretical computer science [18,19,20,21,22] and statistics [23,24,25,26,27,28], to name only a small subset of the literature. Over the years, various proofs of these inequalities have been proposed [1,29,30,31,32,33,34]. Among these, Lieb’s elegant proof [29], which is very close to one of the techniques that will be used in this paper, employs a doubling trick that capitalizes on the rotational invariance property of the Gaussian function: if f is a one-dimensional Gaussian function, then:

\begin{matrix} f (x) f (y) = f (\frac{x - y}{\sqrt{2}}) f (\frac{x + y}{\sqrt{2}}) . \end{matrix}

(3)

Since (1) and (2) have the same structure modulo the direction of the inequality, a common viewpoint is to consider (1) and (2) as dual inequalities. This viewpoint successfully captures the geometric aspects of (1) and (2). Indeed, it is known that:

\begin{matrix} D \cdot F = 1 \end{matrix}

(4)

as long as

D, F < \infty

[5]. Moreover, both D and F are equal to one under Ball’s geometric condition [35]:

E_{1}

, …,

E_{m}

are dimension one, and:

\begin{matrix} \sum_{i = 1}^{m} c_{i} B_{i} B_{i}^{*} = I \end{matrix}

(5)

is the identity matrix. While fruitful, this “dual” viewpoint does not fully explain the asymmetry between the forward and the reverse inequalities: there is a sup in (2), but not in (1).

This paper explores a different viewpoint. In particular, we propose a single inequality that unifies (1) and (2). Accordingly, we should reverse both sides of (2) to make the inequality sign consistent with (1). To be concrete, let us first observe that (1) and (2) can be respectively restated in the following more symmetrical forms (with changes of certain symbols):

For all nonnegative functions g and $f_{1}, \dots, f_{m}$ such that:

$\begin{matrix} g (x) \leq \prod_{i = 1}^{m} f_{j}^{c_{j}} (B_{j} x), \forall x, \end{matrix}$

(6)

we have:

$\begin{matrix} \int_{E} g \leq D \prod_{j = 1}^{m} {(\int_{E_{j}} f_{j})}^{c_{j}} . \end{matrix}$

(7)
For all nonnegative measurable functions $g_{1}, \dots g_{l}$ and f such that:

$\begin{matrix} \prod_{i = 1}^{l} g_{i}^{b_{i}} (z_{i}) \leq f (\sum_{i = 1}^{l} b_{i} B_{i}^{*} z_{i}), \forall z_{1}, \dots, z_{l}, \end{matrix}$

(8)

we have:

$\begin{matrix} \prod_{i = 1}^{l} {(\int_{E_{i}} g_{i})}^{b_{i}} \leq D \int_{E} f . \end{matrix}$

(9)

Note that in both cases, the optimal choice of one function (f or g) can be explicitly computed from the constraints, hence the conventional formulations in (1) and (2). Generalizing further, we can consider the following problem: Let

X

,

Y_{1}, \dots, Y_{m}

,

Z_{1}, \dots, Z_{l}

be measurable spaces. Consider measurable maps

ϕ_{j} : X \to Y_{j}

,

j = 1, \dots, m

and

ψ : X \to Z_{i}

,

i = 1, \dots, l

. Let

b_{1}, \dots, b_{l}

and

c_{1}, \dots, c_{m}

be nonnegative real numbers. Let

ν_{1}, \dots, ν_{l}

be measures on

Z_{1}, \dots, Z_{l}

and

μ_{1}, \dots, μ_{m}

be measures on

Y_{1}, \dots, Y_{m}

, respectively. What is the smallest

D > 0

such that for all nonnegative

f_{1}, \dots, f_{m}

on

Y_{1}, \dots Y_{m}

and

g_{1}, \dots, g_{l}

on

Z_{1}, \dots, Z_{l}

satisfying:

\begin{matrix} \prod_{i = 1}^{l} g_{i}^{b_{i}} (ψ_{i} (x)) \leq \prod_{j = 1}^{m} f_{j}^{c_{j}} (ϕ_{j} (x)), \forall x, \end{matrix}

(10)

we have:

\begin{matrix} \prod_{i = 1}^{l} {(\int g_{i} d ν_{i})}^{b_{i}} \leq D \prod_{j = 1}^{m} {(\int f_{j} d μ_{j})}^{c_{j}} ? \end{matrix}

(11)

Except for special case of

l = 1

(resp.

m = 1

), it is generally not possible to deduce a simple expression from (10) for the optimal choice of

g_{i}

(resp.

f_{j}

) in terms of the rest of the functions. We will refer to (11) as a forward-reverse Brascamp-Lieb inequality.

One of the motivations for considering multiple functions on both sides of (11) comes from multiuser information theory: independently, but almost simultaneously with the discovery of the Brascamp-Lieb inequality in mathematical physics, in the late 1970s, information theorists including Ahslwede, Gács and Körner [36,37] invented the image-size technique for proving strong converses in source and channel networks. An image-size inequality is a characterization of the tradeoff of the measures of certain sets connected by given random transformations (channels); we refer the interested readers to [37] for expositions on the image-size problem. Although not the way treated in [36,37], an image-size inequality can essentially be obtained from a functional inequality similar to (11) by taking the functions to be (roughly speaking) the indicator functions of sets. In the case of (10), the forward channels

ϕ_{1}, \dots, ϕ_{m}

and the reverse channels

ψ_{1}, \dots, ψ_{l}

degenerate into deterministic functions. In this paper, motivated by information theoretic applications similar to those of the image-size problems, we will consider further generalizations of (11) to the case of random transformations. Since the functional inequality is not restricted to indicator functions, it is strictly stronger than the corresponding image-size inequality. As a side remark, [38] uses functional inequalities that are variants of (11) together with a reverse hypercontractivity machinery to improve the image-size plus the blowing-up machinery of [39] and shows that the non-indicator function generalization is crucial for achieving the optimal scaling of the second-order rate expansion.

Of course, to justify the proposal of (11), we must also prove that (11) enjoys certain nice mathematical properties; this is the main goal of the present paper. Specifically, we focus on two aspects of (11): equivalent entropic formulation and Gaussian optimality.

In the mathematical literature, e.g., [32,36,40,41,42,43,44,45,46], it is known that certain integral inequalities are equivalent to inequalities involving relative entropies. In particular, Carlen, Loss and Lieb [47] and Carlen and Cordero-Erausquin [32] proved that the Brascamp-Lieb inequality is equivalent to the superadditivity of relative entropy. In this paper, we prove that the forward-reverse Brascamp-Lieb inequality (11) also has an entropic formulation, which turns out to be very close to the rate region of certain multiuser information theory problems (but we will clarify the difference in the text). In fact, Ahlswede, Csiszár and Körner [37,39] essentially derived image-size inequalities from similar entropic inequalities. Because of the reverse part, the proof of the equivalence of (11) and corresponding entropic inequality is more involved than the forward case considered in [32] beyond the case of finite

X

,

Y_{j}

,

Z_{i}

, and certain machinery from min-max theory appears necessary. In particular, the proof involves a novel use of the Legendre-Fenchel duality theory. Next, we give a basic version of our main result on the functional-entropic duality (more general versions will be given later). In order to streamline its presentation, all formal definitions of notation are postponed to Section 2.

Theorem 1 (Dual formulation of the forward-reverse Brascamp-Lieb inequality).

Assume that:

(i): m and l are positive integers; $d \in R$ , $X$ is a compact metric space;
(ii): $b_{i} \in (0, \infty)$ , $ν_{i}$ is a finite Borel measure on a Polish space $Z_{i}$ , and $Q_{Z_{i} | X}$ is a random transformation from $X$ to $Z_{i}$ , for each $i = 1, \dots, l$ ;
(iii): $c_{j} \in (0, \infty)$ , $μ_{j}$ is a finite Borel measure on a Polish space $Y_{j}$ , and $Q_{Y_{j} | X}$ is a random transformation from $X$ to $Y_{i}$ , for each $j = 1, \dots, m$ ;
(iv): For any ${(P_{Z_{i}})}_{i = 1}^{l}$ such that $\sum_{i = 1}^{l} D (P_{Z_{i}} ∥ ν_{i}) < \infty$ , there exists $P_{X}$ such that $P_{X} \to Q_{Z_{i} | X} \to P_{Z_{i}}$ , $i = 1, \dots, l$ and $\sum_{j = 1}^{m} D (P_{Y_{j}} ∥ μ_{j}) < \infty$ , where $P_{X} \to Q_{Y_{j} | X} \to P_{Y_{j}}$ , $j = 1, \dots, m$ .

Then, the following two statements are equivalent:

1.: If the nonnegative continuous functions $(g_{i})$ , $(f_{j})$ are bounded away from zero and satisfy:

$\begin{matrix} \sum_{i = 1}^{l} b_{i} Q_{Z_{i} | X} (g_{i}) \leq \sum_{j = 1}^{m} c_{j} Q_{Y_{j} | X} (f_{j}) \end{matrix}$

(12)

then:

$\begin{matrix} \prod_{i = 1}^{l} {(\int g_{i} d ν_{i})}^{b_{i}} \leq exp (d) \prod_{j = 1}^{m} {(\int f_{j} d μ_{j})}^{c_{j}} \end{matrix}$

(13)
2.: For any $(P_{Z_{i}})$ such that $D (P_{Z_{i}} ∥ ν_{i}) < \infty$ (of course, this assumption is not essential (if we adopt the convention that the infimum in (14) is $+ \infty$ when it runs over an empty set)), $i = 1, \dots, l$ ,

$\begin{matrix} \sum_{i = 1}^{l} b_{i} D (P_{Z_{i}} ∥ ν_{i}) + d \geq inf_{P_{X}} \sum_{j = 1}^{m} c_{j} D (P_{Y_{j}} ∥ μ_{j}) \end{matrix}$

(14)

where $P_{X} \to Q_{Y_{j} | X} \to P_{Y_{j}}$ , $j = 1, \dots, m$ , and the infimum is over $P_{X}$ such that $P_{X} \to Q_{Z_{i} | X} \to P_{Z_{i}}$ , $i = 1, \dots, l$ .

Next, in a similar vein as the proverbial result that “Gaussian functions are optimal” for the forward or the reverse Brascamp-Lieb inequality, we show in this paper that Gaussian functions are also optimal for the forward-reverse Brascamp-Lieb inequality, particularized to the case of Gaussian reference measures and linear maps. The proof scheme is based on rotational invariance (3), which can be traced back in the functional setting to Lieb [29]. More specifically, we use a variant for the entropic setting introduced by Geng and Nair [48], thereby taking advantage of the dual formulation of Theorem 1.

Theorem 2.

Consider

b_{1}, \dots, b_{l}, c_{1}, \dots, c_{m}, D \in (0, \infty)

. Let

E_{1}, \dots, E_{l}, E^{1}, \dots, E^{m}

be Euclidean spaces, and let

B_{j i} : E_{i} \to E^{j}

be a linear map for each

i \in {1, \dots, l}

and

j \in {1, \dots, m}

. Then, for all continuous functions

f_{j} : E^{j} \to [0, + \infty)

,

g_{i} : E_{i} \to [0, \infty)

satisfying:

\begin{matrix} \prod_{i = 1}^{l} g_{i}^{b_{i}} (x_{i}) \leq \prod_{j = 1}^{m} f_{j}^{c_{j}} (\sum_{i = 1}^{l} B_{j i} x_{i}), \forall x_{1}, \dots, x_{l}, \end{matrix}

(15)

we have:

\begin{matrix} \prod_{i = 1}^{l} {(\int g_{i})}^{b_{i}} \leq D \prod_{j = 1}^{m} {(\int f_{j})}^{c_{j}}, \end{matrix}

(16)

if and only if for all centered Gaussian functions

f_{1}, \dots, f_{m}, g_{1}, \dots, g_{l}

satisfying (15), we have (16).

As mentioned, in the literature on the forward or the reverse Brascamp-Lieb inequalities, it is known that a certain geometric condition (5) ensures that the best constant equals one. Now, for the forward-reverse inequality, there is a simple example where the best constant equals one:

Example 1.

Let l be a positive integer, and let

M : = {(m_{j i})}_{1 \leq j \leq l, 1 \leq i \leq l}

be an orthogonal matrix. For any nonnegative continuous functions

{(f_{j})}_{j = 1}^{l}

{(g_{i})}_{i = 1}^{l}

on

R

such that:

\begin{matrix} \prod_{i = 1}^{l} g_{i} (x_{i}) \leq \prod_{j = 1}^{l} f_{j} (\sum_{i = 1}^{l} m_{j i} x_{i}), \forall x^{l} \in R^{l}, \end{matrix}

(17)

we have:

\begin{matrix} \prod_{i = 1}^{l} \int g_{i} (x) d x \leq \prod_{i = 1}^{l} \int f_{j} (x) d x . \end{matrix}

(18)

The rest of the paper is organized as follows: Section 2 defines the notation and reviews some basic theory of convex duality. Section 3 proves Theorem 1 and also presents its extensions to the settings of noncompact spaces or general reverse channels. Section 4 proves the Gaussian optimality in the entropic formulation, with the caveat that a certain “non-degenerate” assumption is imposed to ensure the existence of extremizers. At the end of Section 4, we give a proof sketch of Example 1 and also propose a generalization of the example. To completely prove Theorem 2, in Appendix F, we use a limiting argument to drop the non-degenerate assumption and apply the equivalence between the functional and entropic formulations.

2. Review of the Legendre-Fenchel Duality Theory

Our proof of the equivalence of the functional and the entropic inequalities uses the Legendre-Fenchel duality theory, a topic from convex analysis. Before getting into that, a recap of some basics on the duality of topological vector spaces seems appropriate. Unless otherwise indicated, we assume Polish spaces and Borel measures. Recall that metric space. It enjoys several nice properties that we use heavily in this section, including the Prokhorov theorem and the Riesz-Kakutani theorem. Of course, the Polish space assumption covers the cases of Euclidean and discrete spaces (endowed with the Hamming metric, which induces the discrete topology, making every function on the discrete set continuous), among others. Readers interested in discrete spaces only may refer to the (much simpler) argument in [49] based on the KKT condition.

Notation 1.

Let

X

be a topological space.

$C_{c} (X)$ denotes the space of continuous functions on $X$ with a compact support;
$C_{0} (X)$ denotes the space of all continuous functions f on $X$ that vanish at infinity (i.e., for any $ϵ > 0$ , there exists a compact set $K \subseteq X$ such that $| f (x) | < ϵ$ for $x \in X \ K$ );
$C_{b} (X)$ denotes the space of bounded continuous functions on $X$ ;
$M (X)$ denotes the space of finite signed Borel measures on $X$ ;
$P (X)$ denotes the space of probability measures on $X$ .

We consider

C_{c}

,

C_{0}

and

C_{b}

as topological vector spaces, with the topology induced from the sup norm. The following theorem, usually attributed to Riesz, Markov and Kakutani, is well known in functional analysis and can be found in, e.g., [50,51].

Theorem 3 (Riesz-Markov-Kakutani).

If

X

is a locally compact, σ-compact Polish space, the dual (the dual of a topological vector space consists of all continuous linear functionals on that space, which is naturally also topological vector space (with the weak

^{*}

topology)) of both

C_{c} (X)

and

C_{0} (X)

is

M (X)

.

Remark 1.

The dual space of

C_{b} (X)

can be strictly larger than

M (X)

, since it also contains those linear functionals that depend on the “limit at infinity” of a function

f \in C_{b} (X)

(originally defined for those f that do have a limit at infinity and then extended to the whole

C_{b} (X)

by the Hahn-Banach theorem; see, e.g., [50]).

Of course, any

μ \in M (X)

is a continuous linear functional on

C_{0} (X)

or

C_{c} (X)

, given by:

\begin{matrix} f \mapsto \int f d μ \end{matrix}

(19)

where f is a function in

C_{0} (X)

or

C_{c} (X)

. As is well known, Theorem 3 states that the converse is also true under mild regularity assumptions on the space. Thus, we can view measures as continuous linear functionals on a certain function space (in fact, some authors prefer to construct measure theory by defining a measure as a linear functional on a suitable measure space; see Lax [50] or Bourbaki [52]); this justifies the shorthand notation:

\begin{matrix} μ (f) : = \int f d μ \end{matrix}

(20)

which we employ in the rest of the paper. This viewpoint is the most natural for our setting since in the proof of the equivalent formulation of the forward-reverse Brascamp-Lieb inequality, we shall use the Hahn-Banach theorem to show the existence of certain linear functionals.

Definition 1.

Let

Λ : C_{b} (X) \to (- \infty, + \infty]

be a lower semicontinuous, proper convex function. Its Legendre-Fenchel transform

Λ^{*} : C_{b} {(X)}^{*} \to (- \infty, + \infty]

is given by:

\begin{matrix} Λ^{*} (ℓ) : = sup_{u \in C_{b} (X)} [ℓ (u) - Λ (u)] . \end{matrix}

(21)

Let

ν

be a nonnegative finite Borel measure on a Polish space

X

, and define the convex functional on

C_{b} (X)

:

\begin{matrix} Λ (f) & : = log ν (exp (f)) \end{matrix}

(22)

\begin{matrix} = log \int exp (f) d ν . \end{matrix}

(23)

Then, note that the relative entropy has the following alternative definition: for any

μ \in M (X)

,

\begin{matrix} D (μ ∥ ν) : = sup_{f \in C_{b} (X)} [μ (f) - Λ (f)] \end{matrix}

(24)

which agrees with the more familiar definition

D (μ ∥ ν) : = μ (log \frac{d μ}{d ν})

when

ν

is a probability measure, by the Donsker-Varadhan formula (cf. [53] Lemma 6.2.13). If

μ

is not a probability measure, then

D (μ ∥ ν)

as defined in (24) is

+ \infty

.

Given a bounded linear operator

T : C_{b} (Y) \to C_{b} (X)

, the dual operator

T^{*} : C_{b} {(X)}^{*} \to C_{b} {(Y)}^{*}

is defined in terms of:

\begin{matrix} T^{*} μ_{X} & : C_{b} (Y) \to R; \\ f & \mapsto μ_{X} (T f), \end{matrix}

(25)

for any

μ_{X} \in C_{b} {(X)}^{*}

. Since

P (X) \subseteq M (X) \subseteq C_{b} {(X)}^{*}

, T is said to be a conditional expectation operator if

T^{*} P \in P (Y)

for any

P \in P (X)

. The operator

T^{*}

is defined as the dual of a conditional expectation operator T and, in a slight abuse of terminology, is said to be a random transformation from

X

to

Y

.

For example, in the notation of Theorem 1, if

g \in C_{b} (Y)

and

Q_{Y | X}

is a random transformation from

X

to

Y

, the quantity

Q_{Y | X} (g)

is a function on

X

, defined by taking the conditional expectation. Furthermore, if

P_{X} \in P (X)

, we write

P_{X} \to Q_{Y | X} \to P_{Y}

to indicate that

P_{Y} \in P (Y)

is the measure induced on

Y

by applying

Q_{Y | X}

to

P_{X}

.

Remark 2.

From the viewpoint of category theory (see for example [54,55]),

C_{b}

is a functor from the category of topological spaces to the category of topological vector spaces, which is contra-variant because for any continuous,

ϕ : X \to Y

(morphism between topological spaces), we have

C_{b} (ϕ) : C_{b} (Y) \to C_{b} (X)

,

u \mapsto u \circ f

where

u \circ ϕ

denotes the composition of two continuous functions, reversing the arrows in the maps (i.e., the morphisms). On the other hand,

M

is a covariant functor and

M (ϕ) : M (X) \to M (Y)

,

μ \mapsto μ \circ ϕ^{- 1}

, where

μ \circ ϕ^{- 1} (B) : = μ (ϕ^{- 1} (B))

for any Borel measurable

B \subseteq Y

. “Duality” itself is a contra-variant functor between the category of topological spaces (note the reversal of arrows in Figure 1). Moreover,

C_{b} {(X)}^{*} = M (X)

and

C_{b} {(ϕ)}^{*} = M (ϕ)

if

X

and

Y

are compact metric spaces and

ϕ : X \to Y

is continuous. Definition 2 can therefore be viewed as the special case where ϕ is the projection map:

Definition 2.

Suppose

ϕ : Z_{1} \times Z_{2} \to Z_{1}, (z_{1}, z_{2}) \mapsto z_{1}

is the projection to the first coordinate.

$C_{b} (ϕ) : C_{b} (Z_{1}) \to C_{b} (Z_{1} \times Z_{2})$ is called a canonical map, whose action is almost trivial: it sends a function of $z_{i}$ to itself, but viewed as a function of $(z_{1}, z_{2})$ .
$M (ϕ) : M (Z_{1} \times Z_{2}) \to M (Z_{1})$ is called marginalization, which simply takes a joint distribution to a marginal distribution.

The Fenchel-Rockafellar duality (see [40] Theorem 1.9, or [56] in the case of finite dimensional vector spaces) usually refers to the

k = 1

special case of the following result.

Theorem 4.

Assume that A is a topological vector space whose dual is

A^{*}

. Let

Θ_{j} : A \to R \cup {+ \infty}

,

j = 0, 1, \dots, k

, for some positive integer k. Suppose there exist some

{(u_{j})}_{j = 1}^{k}

and

u_{0} : = - (u_{1} + \dots + u_{k})

such that:

\begin{matrix} Θ_{j} (u_{j}) < \infty, j = 0, \dots, k \end{matrix}

(26)

and

Θ_{0}

is upper semicontinuous at

u_{0}

. Then:

\begin{matrix} - inf_{ℓ \in A^{*}} [\sum_{j = 0}^{k} Θ_{j}^{*} (ℓ)] = inf_{u_{1}, \dots, u_{k} \in A} [Θ_{0} (- \sum_{j = 1}^{k} u_{j}) + \sum_{j = 1}^{k} Θ_{j} (u_{j})] . \end{matrix}

(27)

For completeness, we provide a proof of this result, which is based on the Hahn-Banach theorem (Theorem 5) and is similar to the proof of [40] Theorem 1.9.

Proof.

Let

m_{0}

be the right side of (27). The ≤ part of (27) follows trivially from the (weak) min-max inequality since:

\begin{matrix} m_{0} & = inf_{u_{0}, \dots, u_{k} \in A} sup_{ℓ \in A^{*}} \{\sum_{j = 0}^{k} Θ_{j} (u_{j}) - ℓ (\sum_{j = 0}^{k} u_{j})\} \\ \geq sup_{ℓ \in A^{*}} inf_{u_{0}, \dots, u_{k} \in A} \{\sum_{j = 0}^{k} Θ_{j} (u_{j}) - ℓ (\sum_{j = 0}^{k} u_{j})\} \\ = - inf_{ℓ \in A^{*}} [\sum_{j = 0}^{k} Θ_{j}^{*} (ℓ)] . \end{matrix}

(28)

It remains to prove the ≥ part, and it suffices to assume without loss of generality that

m_{0} > - \infty

. Note that (26) also implies that

m_{0} < + \infty

. Define convex sets:

\begin{matrix} C_{j} & : = {(u, r) \in A \times R : r > Θ_{j} (u)}, j = 0, \dots, k; \end{matrix}

(29)

\begin{matrix} B & : = {(0, m) \in A \times R : m \leq m_{0}} . \end{matrix}

(30)

Observe that these are nonempty sets because of (26). Furthermore,

C_{0}

has a nonempty interior by the assumption that

Θ_{0}

is upper semicontinuous at

u_{0}

. Thus, the Minkowski sum:

\begin{matrix} C : = C_{0} + \dots + C_{k} \end{matrix}

(31)

is a convex set with a nonempty interior. Moreover,

C \cup B = \emptyset

. By the Hahn-Banach theorem (Theorem 5), there exists

(ℓ, s) \in A^{*} \times R

such that:

\begin{matrix} s m \leq ℓ (\sum_{j = 0}^{k} u_{j}) + s \sum_{j = 0}^{k} r_{j} . \end{matrix}

(32)

For any

m \leq m_{0}

and

(u_{j}, r_{j}) \in C_{j}

,

j = 0, \dots, k

. From (30), we see (32) can only hold when

s \geq 0

. Moreover, from (26) and the upper semicontinuity of

Θ_{0}

at

u_{0}

, we see that the

\sum_{j = 0}^{k} u_{j}

in (32) can take a value in a neighborhood of

0 \in A

; hence,

s \neq 0

. Thus, by dividing s on both sides of (32) and setting

ℓ \leftarrow - ℓ / s

, we see that:

\begin{matrix} m_{0} & \leq inf_{u_{0}, \dots, u_{k} \in A} [- ℓ (\sum_{j = 0}^{k} u_{j}) + \sum_{j = 0}^{k} Θ_{j} (u_{j})] \\ = - [\sum_{j = 0}^{k} Θ_{j}^{*} (ℓ)] \end{matrix}

(33)

which establishes ≥ in (27). ☐

Theorem 5 (Hahn-Banach)

Let C and B be convex, nonempty disjoint subsets of a topological vector space A.

1.: If the interior of C is non-empty, then there exists $ℓ \in A^{*}$ , $ℓ \neq 0$ such that:

$\begin{matrix} sup_{u \in B} ℓ (u) \leq inf_{u \in C} ℓ (u) . \end{matrix}$

(34)
2.: If A is locally convex, B is compact and C is closed, then there exists $ℓ \in A^{*}$ such that:

$\begin{matrix} sup_{u \in B} ℓ (u) < inf_{u \in C} ℓ (u) . \end{matrix}$

(35)

Remark 3.

The assumption in Theorem 5 that C has a nonempty interior is only necessary in the infinite dimensional case. However, even if A in Theorem 4 is finite dimensional, the assumption in Theorem 4 that

Θ_{0}

is upper semicontinuous at

u_{0}

is still necessary, because this assumption was not only used in applying Hahn-Banach, but also in concluding that

s \neq 0

in (32).

3. The Entropic-Functional Duality

In this section, we prove Theorem 1 and some of its generalizations.

3.1. Compact $X$

We first state a duality theorem for the case of compact spaces to streamline the proof. Later, we show that the argument can be extended to a particular non-compact case (Theorem 1 is not included in the conference paper [49], but was announced in the conference presentation). Our proof based on the Legendre-Fenchel duality (Theorem 4) was inspired by the proof of the Kantorovich duality in the theory of optimal transportation (see [40] Chapter 1, where the idea was credited to Brenier).

Recall from Section 2 that a random transformation (a mapping between probability measures) is formally the dual of a conditional expectation operator. Suppose

P_{Y_{j} | X} = T_{j}^{*}

,

j = 1, \dots, m

and

P_{Z_{i} | X} = S_{i}^{*}

,

i = 1, \dots, l

.

Proof of Theorem 1.

We can safely assume

d = 0

below without loss of generality (since otherwise, we can always substitute

μ_{1} \leftarrow exp (\frac{d}{c_{1}}) μ_{1}

).

1)⇒2)

This is the nontrivial direction, which relies on certain (strong) min-max type results. In Theorem 4, put (in (36),

u \leq 0

means that u is pointwise non-positive):

\begin{matrix} Θ_{0} : u \in C_{b} (X) \mapsto \{\begin{matrix} 0 & u \leq 0; \\ + \infty & otherwise . \end{matrix} \end{matrix}

(36)

Then,

\begin{matrix} Θ_{0}^{*} : π \in M (X) \mapsto \{\begin{matrix} 0 & π \geq 0; \\ + \infty & otherwise . \end{matrix} \end{matrix}

(37)

For each

j = 1, \dots, m

, set:

\begin{matrix} Θ_{j} (u) : = c_{j} inf log μ_{j} (exp (\frac{1}{c_{j}} v)) \end{matrix}

(38)

where the infimum is over

v \in C_{b} (Y)

such that

u = T_{j} v

; if there is no such v, then

Θ_{j} (u) : = + \infty

as a convention. Observe that:

$Θ_{j}$ is convex: indeed, given arbitrary $u^{0}$ and $u^{1}$ , suppose that $v^{0}$ and $v^{1}$ respectively achieve the infimum in (38) for $u^{0}$ and $u^{1}$ (if the infimum is not achievable, the argument still goes through by the approximation and limit argument). Then, for any $α \in [0, 1]$ , $v^{α} : = (1 - α) v^{0} + α v^{1}$ satisfies $u^{α} = T_{j} v^{α}$ where $u^{α} : = (1 - α) u^{0} + α u^{1}$ . Thus, the convexity of $Θ_{j}$ follows from the convexity of the functional in (23);
$Θ_{j} (u) > - \infty$ for any $u \in C_{b} (X)$ . Otherwise, for any $P_{X}$ and $P_{Y_{j}} : = T_{j}^{*} P_{X}$ , we have:

$\begin{matrix} D (P_{Y_{j}} ∥ μ_{j}) & = sup_{v} {P_{Y_{j}} (v) - log μ_{j} (exp (v))} \end{matrix}$

(39)

$\begin{matrix} = sup_{v} {P_{X} (T_{j} v) - log μ_{j} (exp (v))} \end{matrix}$

(40)

$\begin{matrix} = sup_{u \in C_{b} (X)} \{P_{X} (u) - \frac{1}{c_{j}} Θ_{j} (c_{j} u)\} \end{matrix}$

(41)

$\begin{matrix} = + \infty \end{matrix}$

(42)

which contradicts the assumption that $\sum_{j = 1}^{m} c_{j} D (P_{Y_{j}} ∥ μ_{j}) < \infty$ in the theorem;
From Steps (39)–(41), we see $Θ_{j}^{*} (π) = c_{j} D (T_{j}^{*} π ∥ μ_{j})$ for any $π \in M (X)$ , where the definition of $D (\cdot ∥ μ_{j})$ is extended using the Donsker-Varadhan formula (that is, it is infinite when the argument is not a probability measure).

Finally, for the given

{(P_{Z_{i}})}_{i = 1}^{l}

, choose:

\begin{matrix} Θ_{m + 1} : u \in C_{b} (X) \mapsto \{\begin{matrix} \sum_{i = 1}^{l} P_{Z_{i}} (w_{i}) & if u = \sum_{i = 1}^{l} S_{i} w_{i} for some w_{i} \in C_{b} (Z_{i}); \\ + \infty & otherwise . \end{matrix} \end{matrix}

(43)

Notice that:

$Θ_{m + 1}$ is convex;
$Θ_{m + 1}$ is well defined (that is, the choice of $(w_{i})$ in (43) is inconsequential). Indeed, if ${(w_{i})}_{i = 1}^{l}$ is such that $\sum_{i = 1}^{l} S_{i} w_{i} = 0$ , then:

$\begin{matrix} \sum_{i = 1}^{l} P_{Z_{i}} (w_{i}) & = \sum_{i = 1}^{l} S_{i}^{*} P_{X} (w_{i}) \\ = \sum_{i = 1}^{l} P_{X} (S_{i} w_{i}) \\ = 0, \end{matrix}$

(44)

where $P_{X}$ is such that $S_{i}^{*} P_{X} = P_{Z_{i}}$ , $i = 1, \dots, l$ , whose existence is guaranteed by the assumption of the theorem. This also shows that $Θ_{m + 1} > - \infty$ .
$\begin{matrix} Θ_{m + 1}^{*} (π) & : = sup_{u} {π (u) - Θ_{m + 1} (u)} \\ = sup_{w_{1}, \dots, w_{l}} \{π (\sum_{i = 1}^{l} S_{i} w_{i}) - \sum_{i = 1}^{l} P_{Z_{i}} (w_{i})\} \\ = sup_{w_{1}, \dots, w_{l}} \{\sum_{i = 1}^{l} S_{i}^{*} π (w_{i}) - \sum_{i = 1}^{l} P_{Z_{i}} (w_{i})\} \\ = \{\begin{matrix} 0 & if S_{i}^{*} π = P_{Z_{i}}, i = 1, \dots, l; \\ + \infty & otherwise . \end{matrix} \end{matrix}$

(45)

Invoking Theorem 4 (where the

u_{j}

in Theorem 4 can be chosen as the constant function

u_{j} \equiv 1

,

j = 1, \dots, m + 1

):

\begin{matrix} inf_{π : π \geq 0, S_{i}^{*} π = P_{Z_{i}}} \sum_{j = 1}^{m} c_{j} D (T_{j}^{*} π ∥ μ_{j}) \\ = - inf_{v^{m}, w^{l} : \sum_{j = 1}^{m} T_{j} v_{j} + \sum_{i = 1}^{l} S_{i} w_{i} \geq 0} [\sum_{j = 1}^{m} c_{j} log μ_{j} (exp (\frac{1}{c_{j}} v_{j})) + \sum_{i = 1}^{l} P_{Z_{i}} (w_{i})] \end{matrix}

(46)

where

v^{m}

denotes the collection of the functions

v_{1}, \dots, v_{m}

, and similarly for

w^{l}

. Note that the left side of (46) is exactly the right side of (14). For any

ϵ > 0

, choose

v_{j} \in C_{b} (Y_{j})

,

j = 1, \dots, m

and

w_{i} \in C_{b} (Z_{i})

,

i = 1, \dots, l

such that

\sum_{j = 1}^{m} T_{j} v_{j} + \sum_{i = 1}^{l} S_{i} w_{i} \geq 0

and:

\begin{matrix} ϵ - \sum_{j = 1}^{m} c_{j} log μ_{j} (exp (\frac{1}{c_{j}} v_{j})) - \sum_{i = 1}^{l} P_{Z_{i}} (w_{i}) > inf_{π : π \geq 0, S_{i}^{*} π = P_{Z_{i}}} \sum_{j = 1}^{m} c_{j} D (T_{j}^{*} π ∥ μ_{j}) \end{matrix}

(47)

Now, invoking (13) with

f_{j} : = exp (\frac{1}{c_{j}} v_{j})

,

j = 1, \dots, m

and

g_{i} : = exp (- \frac{1}{b_{i}} w_{i})

,

i = 1, \dots, l

, we upper bound the left side of (47) by:

\begin{matrix} ϵ - \sum_{i = 1}^{l} b_{i} log ν_{i} (g_{i}) + \sum_{i = 1}^{l} b_{i} P_{Z_{i}} (log g_{i}) \leq ϵ + \sum_{i = 1}^{l} b_{i} D (P_{Z_{i}} ∥ ν_{i}) \end{matrix}

(48)

where the last step follows by the Donsker-Varadhan formula. Therefore, (14) is established since

ϵ > 0

is arbitrary.

2)⇒1)

Since

ν_{i}

is finite and

g_{i}

is bounded by assumption, we have

ν_{i} (g_{i}) < \infty

,

i = 1, \dots, l

. Moreover, (13) is trivially true when

ν_{i} (g_{i}) = 0

for some i, so we will assume below that

ν_{i} (g_{i}) \in (0, \infty)

for each i. Define

P_{Z_{i}}

by:

\begin{matrix} \frac{d P_{Z_{i}}}{d ν_{i}} = \frac{g_{i}}{ν_{i} (g_{i})}, i = 1, \dots, l . \end{matrix}

(49)

Then, for any

ϵ > 0

,

\begin{matrix} \sum_{i = 1}^{l} b_{i} log ν_{i} (g_{i}) & = \sum_{i = 1}^{l} b_{i} [P_{Z_{i}} (log g_{i}) - D (P_{Z_{i}} ∥ ν_{i})] \end{matrix}

(50)

\begin{matrix} < \sum_{j = 1}^{m} c_{j} P_{Y_{j}} (log f_{j}) + ϵ - \sum_{j = 1}^{m} c_{j} D (P_{Y_{j}} ∥ μ_{j}) \end{matrix}

(51)

\begin{matrix} \leq ϵ + \sum_{j = 1}^{m} c_{j} log μ_{j} (f_{j}) \end{matrix}

(52)

where:

(51) uses the Donsker-Varadhan formula, and we have chosen $P_{X}$ , $P_{Y_{j}} : = T_{j}^{*} P_{X}$ , $j = 1, \dots, m$ such that:

$\begin{matrix} \sum_{i = 1}^{l} b_{i} D (P_{Z_{i}} ∥ ν_{i}) > \sum_{j = 1}^{m} c_{j} D (P_{Y_{j}} ∥ μ_{j}) - ϵ \end{matrix}$

(53)
(52) also follows from the Donsker-Varadhan formula.

The result follows since

ϵ > 0

can be arbitrary.

☐

Remark 4.

Condition (iv) in the theorem imposes a rather strong assumption on

(S_{i})

: for simplicity, consider the case where

| X |, | Z_{i} | < \infty

. Then, Condition (iv) assumes that for any

(P_{Z_{i}})

, there exists

P_{X}

such that

P_{Z_{i}} = S_{i}^{*} P_{X}

. This assumption is certainly satisfied when

(S_{i})

are induced by coordinate projections; the case of

l = 1

and

P_{Z | X}

being a reverse erasure channel gives a simple example where

P_{Z | X}

is not a deterministic map.

Next, we give a generalization of Theorem 1, which alleviates the restriction on

(S_{i})

:

Theorem 6.

Theorem 1 continues to hold if Condition (iv) therein is weakened to the following:

For any $P_{X}$ such that $D (S_{i}^{*} P_{X} ∥ ν_{i}) < \infty$ , $i = 1, \dots, l$ , there exists ${\tilde{P}}_{X}$ such that $S_{i}^{*} {\tilde{P}}_{X} = S_{i}^{*} P_{X}$ for each i and $\sum_{j = 1}^{m} c_{j} D (T_{j}^{*} {\tilde{P}}_{X} ∥ μ_{j}) < \infty$ for each j.

and the conclusion of the theorem will be replaced by the equivalence of the following two statements:

1.: For any nonnegative continuous functions $(g_{i})$ , $(f_{j})$ bounded away from zero and such that:

$\begin{matrix} \sum_{i = 1}^{l} b_{i} S_{i} log g_{i} \leq \sum_{j = 1}^{m} c_{j} T_{j} log f_{j} \end{matrix}$

(54)

we have:

$\begin{matrix} inf_{({\tilde{g}}_{i}) : \sum_{i = 1}^{l} b_{i} S_{i} log {\tilde{g}}_{i} \geq \sum_{i = 1}^{l} b_{i} S_{i} log g_{i}} \prod_{i = 1}^{l} ν_{i}^{b_{i}} ({\tilde{g}}_{i}) \leq exp (d) \prod_{j = 1}^{m} μ_{j}^{c_{j}} (f_{j}) . \end{matrix}$

(55)
2.: For any $(P_{X})$ such that $D (S_{i}^{*} P_{X} ∥ ν_{i}) < \infty$ , $i = 1, \dots, l$ ,

$\begin{matrix} \sum_{i = 1}^{l} b_{i} D (S_{i}^{*} P_{X} ∥ ν_{i}) + d \geq inf_{{\tilde{P}}_{X} : S_{i}^{*} {\tilde{P}}_{X} = S_{i}^{*} P_{X}} \sum_{j = 1}^{m} c_{j} D (T_{j}^{*} {\tilde{P}}_{X} ∥ μ_{j}) . \end{matrix}$

(56)

In Appendix A, we show that Theorem 6 indeed recovers Theorem 1 for the more restricted class of random transformations.

Proof.

Here, we mention the parts of the proof that need to be changed: upon specifying

(f_{j})

and

(g_{i})

right after (47), we select

({\tilde{g}}_{i})

such that:

\begin{matrix} \sum_{i = 1}^{l} b_{i} S_{i} log {\tilde{g}}_{i} & \geq \sum_{i = 1}^{l} b_{i} S_{i} log g_{i} \end{matrix}

(57)

\begin{matrix} \sum_{i = 1}^{l} b_{i} log ν_{i} ({\tilde{g}}_{i}) & \leq \sum_{j = 1}^{m} c_{j} log μ_{j} (f_{j}) + ϵ . \end{matrix}

(58)

Then, in lieu of (59), we upper-bound the left side of (47) by:

\begin{matrix} 2 ϵ - \sum_{i = 1}^{l} b_{i} log ν_{i} ({\tilde{g}}_{i}) + \sum_{i = 1}^{l} b_{i} P_{Z_{i}} (log {\tilde{g}}_{i}) \leq 2 ϵ + \sum_{i = 1}^{l} b_{i} D (P_{Z_{i}} ∥ ν_{i}) \end{matrix}

(59)

which establishes the 1)⇒2) part. For the other direction, for each

i \in {1, 2, \dots, l}

, define:

\begin{matrix} Λ_{i} (u) : = inf_{{\tilde{g}}_{i} > 0 : b_{i} S_{i} log {\tilde{g}}_{i} = u} b_{i} log ν_{i} ({\tilde{g}}_{i}) . \end{matrix}

(60)

Then, following essentially the same proof as that of

Θ_{j}

in (38), we see that

Λ_{i}

is proper convex and:

\begin{matrix} Λ_{i}^{*} (π) = b_{i} D (S_{i}^{*} π ∥ μ_{j}) . \end{matrix}

(61)

Moreover, let:

\begin{matrix} Λ_{l + 1} (u) : = \{\begin{matrix} 0 & if u = - \sum b_{i} S_{i} log g_{i}; \\ + \infty & otherwise . \end{matrix} \end{matrix}

(62)

Then,

Λ_{l + 1}^{*} (π) = - \sum b_{i} S_{i}^{*} π (log g_{i})

. Using the Legendre-Fenchel duality, we see that for any

ϵ > 0

,

\begin{matrix} inf_{({\tilde{g}}_{i}) : \sum_{i = 1}^{l} b_{i} S_{i} log {\tilde{g}}_{i} \geq \sum_{i = 1}^{l} b_{i} S_{i} log g_{i}} \sum_{i = 1}^{l} b_{i} log ν_{i} ({\tilde{g}}_{i}) \end{matrix}

\begin{matrix} = inf_{u_{1}, \dots, u_{l + 1}} \{Θ_{0} (- \sum_{i = 1}^{l + 1} u_{i}) + \sum_{i = 1}^{l + 1} Λ_{i} (u_{i})\} \end{matrix}

(63)

\begin{matrix} = sup_{π} \{- \sum_{i = 0}^{l + 1} Θ_{i}^{*} (π)\} \end{matrix}

(64)

\begin{matrix} = sup_{π \geq 0} \{- \sum_{i = 1}^{l + 1} Θ_{i}^{*} (π)\} \end{matrix}

(65)

\begin{matrix} = sup_{π \geq 0} \{\sum_{i = 1}^{l} b_{i} S_{i}^{*} π (log g_{i}) - \sum_{i = 1}^{l} b_{i} D (S_{i}^{*} π ∥ ν_{i})\} \end{matrix}

(66)

\begin{matrix} \leq \sum_{i = 1}^{l} b_{i} S_{i}^{*} P_{X} (log g_{i}) - \sum_{i = 1}^{l} b_{i} D (S_{i}^{*} P_{X} ∥ ν_{i}) + ϵ \end{matrix}

(67)

\begin{matrix} \leq \sum_{j = 1}^{m} c_{j} T_{j}^{*} {\tilde{P}}_{X} (log f_{j}) - \sum_{j = 1}^{m} c_{j} D (T_{j}^{*} {\tilde{P}}_{X} ∥ μ_{j}) + 2 ϵ \end{matrix}

(68)

\begin{matrix} \leq 2 ϵ + \sum_{j = 1}^{m} c_{j} log μ_{j} (f_{j}) \end{matrix}

(69)

where:

To see (67), we note that the sup in (66) can be restricted to $π$ , which is a probability measure, since otherwise, the relative entropy terms in (66) are $+ \infty$ by its definition via the Donsker-Varadhan formula. Then, we select $P_{X}$ such that (67) holds.
In (68), we have chosen ${\tilde{P}}_{X}$ such that:

$\begin{matrix} S_{i}^{*} {\tilde{P}}_{X} & = S_{i}^{*} P_{X}, 1 \leq i \leq l; \end{matrix}$

(70)

$\begin{matrix} \sum_{i = 1}^{l} b_{i} D (S_{i}^{*} P_{X}) & > \sum_{j = 1}^{m} c_{j} D (T_{j}^{*} {\tilde{P}}_{X} ∥ μ_{j}) - ϵ, \end{matrix}$

(71)

and then applied the assumption (54). The result follows since $ϵ > 0$ can be arbitrary.
☐

Remark 5.

The infimum in (14) is in fact achievable: for any

(P_{Z_{i}})

, there exists a

P_{X}

that minimizes

\sum_{j = 1}^{m} c_{j} D (P_{Y_{j}} ∥ μ_{j})

subject to the constraints

S_{i}^{*} P_{X} = P_{Z_{i}}

,

i = 1, \dots m

, where

P_{Y_{j}} : = T_{j}^{*} P_{X}

,

j = 1, \dots, m

. Indeed, since the singleton

{P_{Z_{i}}}

is weak

^{*}

-closed and

S_{i}^{*}

is weak

^{*}

-continuous (Generally, if

T : A \to B

is a continuous map between two topologically vector spaces, then

T^{*} : B^{*} \to A^{*}

is a weak

^{*}

continuous map between the dual spaces. Indeed, if

y_{n} \to y

is a weak

^{*}

-convergent subsequence in

B^{*}

, meaning

y_{n} (b) \to y (b)

for any

b \in B

, then, we must have

T^{*} y_{n} (a) = y_{n} (T a) \to y (T a) = T^{*} y (a)

for any

a \in A

, meaning that

T^{*} y_{n}

converges to

T^{*} y

in the weak

^{*}

topology.), the set

⋂_{i = 1}^{l} {(S_{i}^{*})}^{- 1} P_{Z_{i}}

is weak

^{*}

-closed in

M (X)

; hence, its intersection with

P (X)

is weak

^{*}

-compact in

P (X)

, because

P (X)

is weak

^{*}

-compact by (a simple version for the setting of a compact underlying space

X

of) the Prokhorov theorem [57]. Moreover, by the weak

^{*}

-lower semicontinuity of

D (\cdot ∥ μ_{j})

(easily seen from the variational formula/Donsker-Varadhan formula of the relative entropy, cf. [58]) and the weak

^{*}

-continuity of

T_{j}^{*}

,

j = 1, \dots, m

, we see that

\sum_{j = 1}^{m} c_{j} D (T_{j}^{*} P_{X} ∥ μ_{j})

is weak

^{*}

-lower semicontinuous in

P_{X}

, and hence, the existence of a minimizing

P_{X}

is established.

Remark 6.

Abusing the terminology from min-max theory, Theorem 1 may be interpreted as a “strong duality” result, which establishes the equivalence of two optimization problems. The 1)⇒2) part is the non-trivial direction, which requires regularity on the spaces. In contrast, the 2)⇒1) direction can be thought of as a “weak duality”, which establishes only a partial relation, but holds for more general spaces.

3.2. Noncompact $X$

Our proof of 1)⇒2) in Theorem 1 makes use of the Hahn-Banach theorem and hence relies crucially on the fact that the measure space is the dual of the function space. Naively, one might want to extend the the proof to the case of locally compact

X

by considering

C_{0} (X)

instead of

C_{b} (X)

, so that the dual space is still

M (X)

. However, this would not work: consider the case when

X = Z_{1} \times, \dots, \times Z_{l}

and each

S_{i}

is the canonical map. Then,

Θ_{m + 1} (u)

as defined in (43) is

+ \infty

unless

u \equiv 0

(because

u \in C_{0} (X)

requires that u vanishes at infinity); thus,

Θ_{m + 1}^{*} \equiv 0

. Luckily, we can still work with

C_{b} (X)

; in this case,

ℓ \in C_{b} {(X)}^{*}

may not be a measure, but we can decompose it into

ℓ = π + R

where

π \in M (X)

and R is a linear functional “supported at infinity”. Below, we use the techniques in [40] (Chapter 1.3) to prove a particular extension of Theorem 1 to a non-compact case.

Theorem 7.

Theorem 1 still holds if

The assumption that $X$ is a compact metric space is relaxed to the assumption that it is a locally compact and σ-compact Polish space;
$X = \prod_{i = 1}^{l} Z_{i}$ and $S_{i} : C_{b} (Z_{i}) \to C_{b} (X)$ , $i = 1, \dots, l$ are canonical maps (see Definition 2).

Proof.

The proof of the “weak duality” part 2)⇒1) still works in the noncompact case, so we only need to explain what changes need to be made in the proof of the 1)⇒2) part. Let

Θ_{0}

be defined as before, in (36). Then, for any

ℓ \in C_{b} {(X)}^{*}

,

\begin{matrix} Θ_{0}^{*} (ℓ) = sup_{u \leq 0} ℓ (u) \end{matrix}

(72)

which is zero if ℓ is nonnegative (in the sense that

ℓ (u) \geq 0

for every

u \geq 0

), and

+ \infty

otherwise. This means that when computing the infimum on the left side of (27), we only need to take into account those nonnegative ℓ.

Next, let

Θ_{m + 1}

be also defined as before. Then, directly from the definition, we have:

\begin{matrix} Θ_{m + 1}^{*} (ℓ) & = \{\begin{matrix} 0 & if ℓ (\sum_{i} S_{i} w_{i}) = \sum_{i} P_{Z_{i}} (w_{i}), \forall w_{i} \in C_{b} (Z_{i}), i = 1, \dots l; \\ + \infty & otherwise . \end{matrix} \end{matrix}

(73)

For any

ℓ \in C_{b}^{*} (X)

. Generally, the condition in the first line of (73) does not imply that ℓ is a measure. However, if ℓ is also nonnegative, then using a technical result in [40] Lemma 1.25, we can further simplify:

\begin{matrix} Θ_{m + 1}^{*} (ℓ) & = \{\begin{matrix} 0 & if ℓ \in M (X) and S_{i}^{*} ℓ = P_{Z_{i}}, i = 1, \dots, l; \\ + \infty & otherwise . \end{matrix} \end{matrix}

(74)

This further shows that when we compute the left side of (27), the infimum can be taken over ℓ, which is a coupling of

(P_{Z_{i}})

. In particular, if ℓ is a probability measure, then

Θ_{j}^{*} (ℓ) = c_{j} D (T_{j}^{*} ℓ ∥ μ_{j})

still holds with the

Θ_{j}

defined in (38),

j = 1, \dots, m

. Thus, the rest of the proof can proceed as before. ☐

Remark 7.

The second assumption is made in order to achieve (74) in the proof.

4. Gaussian Optimality

Recall that the conventional Brascamp-Lieb inequality and its reverse ((1) and (2)) state that centered Gaussian functions exhaust such inequalities, and in particular, verifying those inequalities is reduced to a finite dimensional optimization problem (only the covariance matrices in these Gaussian functions are to be optimized). In this section, we show that similar results hold for the forward-reverse Brascamp-Lieb inequality, as well. Our proof uses the rotational invariance argument mentioned in Section 1. Since the forward-reverse Brascamp-Lieb inequality has dual representations (Theorem 7), in principle, the rotational invariance argument can be applied either to the functional representation (as in Lieb’s paper [29]) or the entropic representation (as in Geng-Nair [48]). Here, we adopt the latter approach. We first consider a certain “non-degenerate” case where the existence of an extremizer is guaranteed. Then, Gaussian optimality in the general case follows by a limiting argument (Appendix F), establishing Theorem 2.

4.1. Non-Degenerate Forward Channels

This subsection focuses on the following case:

Assumption 1.

Fix Lebesgue measures ${(μ_{j})}_{j = 1}^{m}$ and Gaussian measures ${(ν_{i})}_{i = 1}^{l}$ on $R$ ;
non-degenerate (Definition 3 below) linear Gaussian random transformation ${(P_{Y_{j} | X})}_{j = 1}^{m}$ (where $X : = (X_{1}, \dots, X_{l})$ ) associated with conditional expectation operators ${(T_{j})}_{j = 1}^{m}$ ;
${(S_{i})}_{i = 1}^{l}$ are induced by coordinate projections;
positive $(c_{j})$ and $(b_{i})$ .

Definition 3.

We say

(Q_{Y_{1} | X}, \dots, Q_{Y_{m} | X})

is non-degenerate if each

Q_{Y_{j} | X = 0}

is an

n_{j}

-dimensional Gaussian distribution with an invertible covariance matrix.

Given Borel measures

P_{X_{i}}

on

R

,

i = 1, \dots, l

, define:

\begin{matrix} F_{0} ((P_{X_{i}})) : = inf_{P_{X}} \sum_{j = 1}^{m} c_{j} D (P_{Y_{j}} ∥ μ_{j}) - \sum_{i = 1}^{l} b_{i} D (P_{X_{i}} ∥ ν_{i}) \end{matrix}

(75)

where the infimum is over Borel measures

P_{X}

that have

(P_{X_{i}})

as marginals. Note that (75) is well defined since the first term cannot be

+ \infty

under the non-degenerate assumption, and the second term cannot be

- \infty

. The aim of this subsection is to prove the following:

Theorem 8.

{sup}_{(P_{X_{i}})} F_{0} ((P_{X_{i}}))

, where the supremum is over Borel measures

P_{X_{i}}

on

R

, and

i = 1, \dots, l

, is achieved by some Gaussian

{(P_{X_{i}})}_{i = 1}^{l}

, in which case the infimum in (75) is achieved by some Gaussian

P_{X}

.

Naturally, one would expect that Gaussian optimality can be established when

{(μ_{j})}_{j = 1}^{m}

and

{(ν_{i})}_{i = 1}^{l}

are either Gaussian or Lebesgue. We made the assumption that the former is Lebesgue and the latter is Gaussian so that certain technical conditions can be justified more easily. More precisely, the following observation shows that we can regularize the distributions by a second moment constraint for free:

Proposition 1.

{sup}_{(P_{X_{i}})} F_{0} ((P_{X_{i}}))

is finite and there exist

σ_{i}^{2} \in (0, \infty)

,

i = 1, \dots, l

such that it equals:

\begin{matrix} sup_{(P_{X_{i}}) : E [X_{i}^{2}] \leq σ_{i}^{2}} F_{0} ((P_{X_{i}})) . \end{matrix}

(76)

Proof.

When

μ_{j}

is Lebesgue and

P_{Y_{j} | X}

is non-degenerate,

D (P_{Y_{j}} ∥ μ_{j}) = - h (P_{Y_{j}}) \leq - h (P_{Y_{j}} | X)

is bounded above (in terms of the variance of the additive noise of

P_{Y_{j} | X}

). Moreover,

D (P_{X_{i}} ∥ ν_{i}) \geq 0

when

ν_{i}

is Gaussian, so

{sup}_{(P_{X_{i}})} F_{0} ((P_{X_{i}})) < \infty

. Further, choosing

(P_{X_{i}}) = (ν_{i})

and using the covariance matrix to lower bound the first term in (75) show that

{sup}_{(P_{X_{i}})} F_{0} ((P_{X_{i}})) > - \infty

.

To see (76), notice that:

\begin{matrix} D (P_{X_{i}} ∥ ν_{i}) & = D (P_{X_{i}} ∥ ν_{i}^{'}) + E [ı_{ν_{i}^{'} ∥ ν_{i}} (X)] \\ = D (P_{X_{i}} ∥ ν_{i}^{'}) + D (ν_{i}^{'} ∥ ν_{i}) \\ \geq D (ν_{i}^{'} ∥ ν_{i}) \end{matrix}

(77)

where

ν_{i}^{'}

is a Gaussian distribution with the same first and second moments as

X_{i} \sim P_{X_{i}}

. Thus,

D (P_{X_{i}} ∥ ν_{i})

is bounded below by some function of the second moment of

X_{i}

, which tends to ∞ as the second moment of

X_{i}

tends to ∞. Moreover, as argued in the preceding paragraph, the first term in (75) is bounded above by some constant depending only on

(P_{Y_{j} | X})

. Thus, we can choose

σ_{i}^{2} > 0

,

i = 1, \dots, l

large enough such that if

E [X_{i}^{2}] > σ_{i}^{2}

for some of i, then

F_{0} ((P_{X_{i}})) < {sup}_{(P_{X_{i}})} F_{0} ((P_{X_{i}}))

, irrespective of the choices of

P_{X_{1}}, \dots, P_{X_{i - 1}}, P_{X_{i + 1}}, \dots, P_{X_{l}}

. Then, these

σ_{1}, \dots, σ_{l}

are as desired in the proposition. ☐

The non-degenerate assumption ensures that the supremum is achieved:

Proposition 2.

Under Assumption 1,

1.: For any ${(P_{X_{i}})}_{i = 1}^{l}$ , the infimum in (75) is attained by some Borel $P_{X}$ .
2.: If ${(P_{Y_{j} | X^{l}})}_{j = 1}^{m}$ are non-degenerate (Definition 3), then the supremum in (76) is achieved by some Borel ${(P_{X_{i}})}_{i = 1}^{l}$ .

The proof of Proposition 2 is given in Appendix E. After taking care of the existence of the extremizers, we get into the tensorization properties, which are the crux of the proof:

Lemma 1.

Fix

(P_{X_{i}^{(1)}})

,

(P_{X_{i}^{(2)}})

,

(μ_{j})

,

(T_{j})

,

(c_{j}) \in {[0, \infty)}^{m}

, and let

S_{j}

be induced by coordinate projections. Then:

\begin{matrix} inf_{P_{X^{(1, 2)}} : S_{i}^{* \otimes 2} P_{X^{(1, 2)}} = P_{X_{i}^{(1)}} \times P_{X_{i}^{(2)}}} \sum_{j = 1}^{m} c_{j} D (P_{Y_{j}^{(1, 2)}} ∥ μ_{j}^{\otimes 2}) = \sum_{t = 1, 2} \sum_{j = 1}^{m} c_{j} inf_{P_{X^{(t)}} : S_{i}^{*} P_{X^{(t)}} = P_{X_{i}^{(t)}}} D (P_{Y_{j}^{(t)}} ∥ μ_{j}) \end{matrix}

(78)

where for each j,

\begin{matrix} P_{Y_{j}^{(1, 2)}} : = T_{j}^{* \otimes 2} P_{X^{(1, 2)}} \end{matrix}

(79)

on the left side and:

\begin{matrix} P_{Y_{j}^{(t)}} : = T_{j}^{* \otimes 2} P_{X^{(t)}} \end{matrix}

(80)

on the right side,

t = 1, 2

.

Proof.

We only need to prove the nontrivial ≥ part. For any

P_{X^{(1, 2)}}

on the left side, choose

P_{X^{(t)}}

on the right side by marginalization. Then:

\begin{matrix} D (P_{Y_{j}^{(1, 2)}} ∥ μ_{j}^{\otimes 2}) - \sum_{t} D (P_{Y_{j}^{(t)}} ∥ μ_{j}) & = I (Y_{j}^{(1)}; Y_{j}^{(2)}) \\ \geq 0 \end{matrix}

(81)

for each j. ☐

We are now ready to show the main result of this section.

Proof of Theorem 8.

Assume that $(P_{X_{i}^{(1)}})$ and $(P_{X_{i}^{(2)}})$ are maximizers of $F_{0}$ (possibly equal). Let $P_{X_{i}^{1, 2}} : = P_{X_{i}^{(1)}} \times P_{X_{i}^{(2)}}$ . Define:

$\begin{matrix} X^{+} & : = \frac{1}{\sqrt{2}} (X^{(1)} + X^{(2)}); \end{matrix}$

(82)

$\begin{matrix} X^{-} & : = \frac{1}{\sqrt{2}} (X^{(1)} - X^{(2)}) . \end{matrix}$

(83)

Define $(Y_{j}^{+})$ and $(Y_{j}^{-})$ analogously. Then, $Y_{j}^{+} | {X^{+} = x^{+}, X^{-} = x^{-}} \sim Q_{Y_{j} | X = x^{+}}$ is independent of $x^{-}$ , and $Y_{j}^{-} | {X^{+} = x^{+}, X^{-} = x^{-}} \sim Q_{Y_{j} | X = x^{-}}$ is independent of $x^{+}$ .
Next, we perform the same algebraic expansion as in the proof of tensorization:

$\begin{matrix} \sum_{t = 1}^{2} F_{0} ({(P_{X_{i}^{(t)}})}_{i = 1}^{l}) & = inf_{P_{X^{(1, 2)}} : S_{j}^{* \otimes 2} P_{X^{(1, 2)}} = P_{X_{j}^{(1, 2)}}} \sum_{j} c_{j} D (P_{Y_{j}^{(1, 2)}} ∥ μ_{j}^{\otimes 2}) - \sum_{i} b_{i} D (P_{X_{i}^{(1, 2)}} ∥ ν_{i}^{\otimes 2}) \end{matrix}$

(84)

$\begin{matrix} = inf_{P_{X^{+} X^{-}} : S_{j}^{* \otimes 2} P_{X^{+} X^{-}} = P_{X_{j}^{+} X_{j}^{-}}} \sum_{j} c_{j} D (P_{Y_{j}^{+} Y_{j}^{-}} ∥ μ_{j}^{\otimes 2}) - \sum_{i} b_{i} D (P_{X_{i}^{+} X_{i}^{-}} ∥ ν_{i}^{\otimes 2}) \\ \leq inf_{P_{X^{+} X^{-}} : S_{j}^{* \otimes 2} P_{X^{+} X^{-}} = P_{X_{j}^{+} X_{j}^{-}}} \sum_{j} c_{j} [D (P_{Y_{j}^{+}} ∥ μ_{j}) + D (P_{Y_{j}^{-} | X^{+}} ∥ μ_{j} | P_{X^{+}})] \end{matrix}$

(85)

$\begin{matrix} - \sum_{i} b_{i} [D (P_{X_{i}^{+}} ∥ ν_{i}) + D (P_{X_{i}^{-} | X_{i}^{+}} ∥ ν_{i} | P_{X_{i}^{+}})] \end{matrix}$

(86)

$\begin{matrix} \leq \sum_{j} c_{j} [D (P_{Y_{j}^{+}}^{⋆} ∥ μ_{j}) + D (P_{Y_{j}^{-} | X^{+}}^{⋆} ∥ μ_{j} | P_{X^{+}}^{⋆})] \\ - \sum_{i} b_{i} [D (P_{X_{i}^{+}}^{⋆} ∥ ν_{i}) + D (P_{X_{i}^{-} | X^{+}}^{⋆} ∥ ν_{i} | P_{X^{+}}^{⋆})] \end{matrix}$

(87)

$\begin{matrix} = F_{0} ({(P_{X_{i}^{+}}^{⋆})}_{i = 1}^{l}) + \int F_{0} ({(P_{X_{i}^{-} | X^{+}}^{⋆})}_{i = 1}^{l}) d P_{X^{+}}^{⋆} \end{matrix}$

(88)

$\begin{matrix} \leq \sum_{t = 1}^{2} F_{0} ({(P_{X_{i}^{(t)}})}_{i = 1}^{l}) \end{matrix}$

(89)

where:
- (84) uses Lemma 1.
- (86) is because of the Markov chain $Y_{j}^{+} - X^{+} - Y_{j}^{-}$ (for any coupling).
- In (87), we selected a particular instance of coupling $P_{X^{+} X^{-}}$ , constructed as follows: first, we select an optimal coupling $P_{X^{+}}$ for given marginals $(P_{X_{i}^{+}})$ . Then, for any $x^{+} = {(x_{i}^{+})}_{i = 1}^{l}$ , let $P_{X^{-} | X^{+} = x^{+}}$ be an optimal coupling of $(P_{X_{i}^{-} | X_{i}^{+} = x_{i}^{+}})$ (for a justification that we can select optimal coupling $P_{X^{-} | X^{+} = x^{+}}$ in a way that $P_{X^{-} | X^{+}}$ is indeed a regular conditional probability distribution, see [7]). With this construction, it is apparent that $X_{i}^{+} - X^{+} - X_{i}^{-}$ , and hence:
  
  $\begin{matrix} D (P_{X_{i}^{-} | X_{i}^{+}} ∥ ν_{i} | P_{X_{i}^{+}}) = D (P_{X_{i}^{-} | X^{+}} ∥ ν_{i} | P_{X^{+}}) . \end{matrix}$
  
  (90)
- (88) is because in the above, we have constructed the coupling optimally.
- (89) is because $(P_{X_{i}}^{(t)})$ maximizes $F_{0}$ , $t = 1, 2$ .
Thus, in the expansions above, equalities are attained throughout. Using the differentiation technique as in the case of forward inequality, for almost all $(b_{i})$ , $(c_{j})$ , we have:

$\begin{matrix} D (P_{X_{i}^{-} | X_{i}^{+}} ∥ ν_{i} | P_{X_{i}^{+}}) & = D (P_{X_{i}^{+}} ∥ ν_{i}) \end{matrix}$

(91)

$\begin{matrix} = D (P_{X_{i}^{-}} ∥ ν_{i}), \forall i \end{matrix}$

(92)

where (92) is because by symmetry, we can perform the algebraic expansions in a different way to show that $(P_{X_{i}^{-}})$ is also a maximizer of $F_{0}$ . Then, $I (X_{i}^{+}; X_{i}^{-}) = D (P_{X_{i}^{-} | X_{i}^{+}} ∥ ν_{i} | P_{X_{i}^{+}}) - D (P_{X_{i}^{-}} ∥ ν_{i}) = 0$ , which, combined with $I (X_{i}^{(1)}; X_{i}^{(2)})$ , shows that $X_{i}^{(1)}$ and $X_{i}^{(2)}$ are Gaussian with the same covariance. Lastly, using Lemma 1 and the doubling trick, one can show that the optimal coupling is also Gaussian.
☐

4.2. Analysis of Example 1 Using Gaussian Optimality

We note that Example 1 is a rather simple setting, where (17) can be proven by integrating the two sides of (18) and applying the change of variables, noting that the absolute value of the Jacobian equals one. Nevertheless, it is illuminating to give an alternative proof using the Gaussian optimality result, as a proof of concept. In this section, we only give a proof sketch where certain “technicalities” are not justified. Details of the justifications are deferred to Appendix F.

Proof sketch for the claim in Example 1.

By duality (Theorem 7), it suffices to prove the corresponding entropic inequality. The Gaussian optimality result in Theorem 8 assumed Gaussian reference measures on the output and non-degenerate forward channels in order to simplify the proof of the existence of minimizers; however, supposing that Gaussian optimality extends beyond those technical conditions, we see that it suffices to prove that for any centered Gaussian

(P_{X_{i}})

,

\begin{matrix} \sum_{i = 1}^{l} h (P_{X_{i}}) \leq sup_{P_{X^{l}}} \sum_{j = 1}^{l} h (P_{Y_{j}}) \end{matrix}

(93)

where the supremum is over Gaussian

P_{X^{l}}

with the marginals

P_{X_{1}}, \dots, P_{X_{l}}

and

Y_{j} : = \sum_{i = 1}^{l} m_{j i} X_{i}

. Let

a_{i} : = E [X_{i}^{2}]

, and choose

P_{X^{l}} = \prod_{i = 1}^{l} P_{X_{i}}

; we see that (93) holds if:

\begin{matrix} \sum_{i = 1}^{l} log a_{i} \leq \sum_{j = 1}^{l} log (\sum_{i = 1}^{l} m_{j i}^{2} a_{i}), \forall a_{i} > 0, i = 1, \dots, l, \end{matrix}

(94)

where

(a_{i})

are the eigenvalues and

{(\sum_{i = 1}^{l} m_{j i} a_{i})}_{i = 1}^{l}

are the diagonal entries of the matrix:

\begin{matrix} M diag {(a_{i})}_{1 \leq i \leq l} M^{⊤} . \end{matrix}

(95)

Therefore, (94) holds. ☐

A generalization of Example 1 is as follows.

Proposition 3.

For any orthogonal matrix

M : = {(m_{j i})}_{1 \leq j \leq l, 1 \leq i \leq l}

with nonzero entries, we claim that there exists a neighborhood

U

of the uniform probability vector

(\frac{1}{l}, \dots, \frac{1}{l})

, such that for any

(b_{1}, \dots, b_{l})

and

(c_{1}, \dots, c_{l})

in

U

, the best constant D in the FR-BLinequality (16) equals

exp (H (c^{l}) - H (b^{l}))

where

H (\cdot)

is the entropy functional.

The proposition generalizes the claim in Example 1. Indeed, observe that there is no loss of generality in assuming that

(b_{1}, \dots, b_{l})

and

(c_{1}, \dots, c_{l})

are probability vectors, since by dimensional analysis, we see that the best constant is infinite unless

\sum_{i = 1}^{l} b_{i} = \sum_{j = 1}^{l} c_{j}

; and it is also clear that the best constant is invariant when each

b_{i}

and

c_{j}

is multiplied by the same positive number. Moreover, any orthogonal matrix can be approximated by a sequence of orthogonal

M

with nonzero entries, for which the neighborhood

U

shrinks, but always contains the uniform probability vector

(\frac{1}{l}, \dots, \frac{1}{l})

.

Proof sketch for Proposition 3.

Note that along the same lines as (94), the best constant in the FR-BL inequality equals:

\begin{matrix} D = sup_{a^{l} \in Δ} \frac{\prod_{i = 1}^{l} a_{i}^{b_{i}}}{{sup}_{A ⪰ 0 : A_{i i} = a_{i}} \prod_{j = 1}^{l} {[M A M^{⊤}]}_{j j}^{c_{j}}} \end{matrix}

(96)

where without loss of generality, we assumed

a^{l} \in Δ

is in the probability simplex. We first observe that if the positive semidefinite constraint

A ⪰ 0

in (96) were nonexistent, then the sup in the denominator in (96) would equal

\prod_{j = 1}^{l} c_{j}^{c_{j}}

, and consequently, (96) would equal

exp (H (c^{l}) - H (b^{l}))

, for any

b^{l}, c^{l} \in Δ

not necessarily close to the uniform probability vector. Indeed, fixing

A_{i i} = a_{i}, i = 1, \dots, l

, the linear map from the off-diagonal entries to the diagonal entries of

{MAM}^{⊤}

is onto the space of l-vectors whose entries sum to one; proof of the surjectivity can be reduced to checking the fact that the only diagonal matrix that commutes with

M

is a multiple of the identity matrix. Then, the sup in the denominator is achieved when

{[M A M^{⊤}]}_{j j} = c_{j}, j = 1 \dots l

, which is independent of

a^{l}

.

Next, we argue that the constraint

A ⪰ 0

in (96) is not active when

b^{l}

and

c^{l}

are close to the uniform vector. Denote by

U (t)

the set of l-vectors whose distance (say in total variation) to the uniform vector

(\frac{1}{l}, \dots, \frac{1}{l})

is at most t. Observe that:

There exists $t > 0$ such that for every $a^{l} \in U (t)$ ,

$\begin{matrix} sup_{A ⪰ 0 : A_{i i} = a_{i}} \prod_{j = 1}^{l} {[M A M^{⊤}]}_{j j} = 1 / l^{l} \end{matrix}$

(97)

which follows by continuity and the fact that when $a^{l}$ is uniform, the sup (97) is achieved at the strictly positive definite $A = l^{- 1} I$ .
When $b^{l} = c^{l} = (\frac{1}{l}, \dots, \frac{1}{l})$ is the uniform probability vector, (96) equals one, which is uniquely achieved by $a^{l} = (\frac{1}{l}, \dots, \frac{1}{l})$ . To see the uniqueness, take $A$ to be diagonal in the denominator and observe that the denominator is strictly bigger than the numerator when the diagonals of $M A M^{⊤}$ are not a permutation of $a^{l}$ . Then, since the extreme value of a continuous functions is achieved on a compact set, we can find $ϵ > 0$ such that:

$\begin{matrix} \frac{\prod_{i = 1}^{l} a_{i}^{1 / l}}{{sup}_{A ⪰ 0 : A_{i i} = a_{i}} \prod_{j = 1}^{l} {[M A M^{⊤}]}_{j j}^{1 / l}} < 1 - ϵ \end{matrix}$

(98)

for any $a^{l} \notin U (t / 2)$ .
Finally, by continuity, we can choose $s \in (0, t / 2)$ small enough such that for any $b^{l}, c^{l} \in U (s)$ ,

$\begin{matrix} \frac{\prod_{i = 1}^{l} a_{i}^{b_{i}}}{{sup}_{A ⪰ 0 : A_{i i} = a_{i}} \prod_{j = 1}^{l} {[M A M^{⊤}]}_{j j}^{c_{j}}} & < 1 - ϵ / 2, \forall a^{l} \notin U (t / 2); \end{matrix}$

(99)

$\begin{matrix} sup_{A ⪰ 0 : A_{i i} = a_{i}} \prod_{j = 1}^{l} {[M A M^{⊤}]}_{j j}^{c_{j}} & = sup_{A : A_{i i} = a_{i}} \prod_{j = 1}^{l} {[M A M^{⊤}]}_{j j}^{c_{j}}, \forall a^{l} \in U (t / 2); \end{matrix}$

(100)

$\begin{matrix} exp (H (c^{l}) - H (b^{l})) & > 1 - ϵ / 2 . \end{matrix}$

(101)

Taking the neighborhood

U (s)

proves the claim. ☐

5. Relation to Hypercontractivity and Its Reverses

As alluded to before and illustrated by Figure 2, the forward-reverse Brascamp-Lieb inequality generalizes several other inequalities from functional analysis and information theory; a more complete discussion on these relationships can be found in [7]. In this section, we focus on hypercontractivity and show how its three cases all follow from Theorem 1. Among these, the case in Section 5.3 can be regarded as an instance of the forward-reverse inequality that cannot be reduced to either the forward or the reverse inequality alone. It is also interesting to note that, from the viewpoint of the forward-reverse Brascamp-Lieb inequality, in each of the three special cases, there ought to be three functions involved in the functional formulation; however, the optimal choice of one function can be computed from the other two. Therefore, the conventional functional formulations of the three cases of hypercontractivity involve only two functions, making it non-obvious to find a unifying inequality.

5.1. Hypercontractivity

Fix a joint probability distribution

Q_{Y_{1} Y_{2}}

and nonnegative continuous functions

F_{1}

and

F_{2}

on

Y_{1}

and

Y_{2}

, respectively, both bounded away from zero. In Theorem 1, take

l \leftarrow 1

,

m \leftarrow 2

,

b_{1} \leftarrow 1

,

d \leftarrow 0

,

f_{1} \leftarrow F_{1}^{\frac{1}{c_{1}}}

,

f_{2} \leftarrow F_{2}^{\frac{1}{c_{2}}}

,

ν_{1} \leftarrow Q_{Y_{1} Y_{2}}

,

μ_{1} \leftarrow Q_{Y_{1}}

,

μ_{2} \leftarrow Q_{Y_{2}}

. Furthermore, put

Z_{1} = X = (Y_{1}, Y_{2})

, and let

T_{1}

and

T_{2}

be the canonical maps (Definition 2). The measure spaces and the random transformations are as shown in Figure 3.

The constraint (12) translates to:

\begin{matrix} g_{1} (y_{1}, y_{2}) \leq F_{1} (y_{1}) F_{2} (y_{2}), \forall y_{1}, y_{2} \end{matrix}

(102)

and the optimal choice of

g_{1}

is when the equality is achieved. We thus obtain the equivalence between:

\begin{matrix} ∥ F_{1} ∥_{\frac{1}{c_{1}}} {∥ F_{2} ∥}_{\frac{1}{c_{2}}} \geq E [F_{1} (Y_{1}) F_{2} (Y_{2})], \forall F_{1} \in L^{\frac{1}{c_{1}}} (Q_{Y_{1}}), F_{2} \in L^{\frac{1}{c_{2}}} (Q_{Y_{2}}) \end{matrix}

(103)

and:

\begin{matrix} \forall P_{Y_{1} Y_{2}}, D (P_{Y_{1} Y_{2}} ∥ Q_{Y_{1} Y_{2}}) \geq c_{1} D (P_{Y_{1}} ∥ Q_{Y_{1}}) + c_{2} D (P_{Y_{2}} ∥ Q_{Y_{2}}) . \end{matrix}

(104)

By a standard dense-subspace argument, we see that it is inconsequential that

F_{1}

and

F_{2}

in (103) are not assumed to be continuous, nor bounded away from zero. It is also easy to see that the nonnegativity of

F_{1}

and

F_{2}

is inconsequential for (103).

This equivalence can also be obtained from Theorem 1. By Hölder’s inequality, (103) is equivalent to saying that the norm of the linear operator sending

F_{1} \in L^{\frac{1}{c_{1}}} (Q_{Y_{1}})

to

E [F_{1} (Y_{1}) | Y_{2} = \cdot] \in L^{\frac{1}{1 - c_{2}}} (Q_{Y_{2}})

does not exceed one. The interesting case is

\frac{1}{1 - c_{2}} > \frac{1}{c_{1}}

, hence the name hypercontractivity. The equivalent formulation of hypercontractivity was shown in [44] using a different proof via the method of types/typicality, which requires that

| Y_{1} |, | Y_{2} | < \infty

. In contrast, the proof based on the nonnegativity of relative entropy removes this constraint, allowing one to prove Nelson’s Gaussian hypercontractivity from the information-theoretic formulation (see [7]).

5.2. Reverse Hypercontractivity (Positive Parameters)

By “positive parameters” we mean the

b_{1}

and

b_{2}

in (107) are positive.

Let

Q_{Z_{1} Z_{2}}

be a given joint probability distribution, and let

G_{1}

and

G_{2}

be nonnegative functions on

Z_{1}

and

Z_{2}

, respectively, both bounded away from zero. In Theorem 1, take

l \leftarrow 2

,

m \leftarrow 1

,

c_{1} \leftarrow 1

,

d \leftarrow 0

,

g_{1} \leftarrow G_{1}^{\frac{1}{b_{1}}}

,

g_{2} \leftarrow G_{2}^{\frac{1}{b_{2}}}

,

μ_{1} \leftarrow Q_{Z_{1} Z_{2}}

,

ν_{1} \leftarrow Q_{Z_{1}}

,

ν_{2} \leftarrow Q_{Z_{2}}

. Furthermore, put

Y_{1} = X = (Z_{1}, Z_{2})

, and let

S_{1}

and

S_{2}

be the canonical maps (Definition 2). The measure spaces and the random transformations are as shown in Figure 4.

Note that the constraint (12) translates to:

\begin{matrix} f_{1} (z_{1}, z_{2}) \geq G_{1} (z_{1}) G_{2} (z_{2}), \forall z_{1}, z_{2} . \end{matrix}

(105)

and the equality case yields the optimal choice of

f_{1}

for (13). By Theorem 1, we thus obtain the equivalence between:

\begin{matrix} ∥ G_{1} ∥_{\frac{1}{b_{1}}} {∥ G_{2} ∥}_{\frac{1}{b_{2}}} \leq E [G_{1} (Z_{1}) G_{2} (Z_{2})], \forall G_{1}, G_{2} \end{matrix}

(106)

and:

\begin{matrix} \forall P_{Z_{1}}, P_{Z_{2}}, \exists P_{Z_{1} Z_{2}}, D (P_{Z_{1} Z_{2}} ∥ Q_{Z_{1} Z_{2}}) \leq b_{1} D (P_{Z_{1}} ∥ Q_{Z_{1}}) + b_{2} D (P_{Z_{2}} ∥ Q_{Z_{2}}) . \end{matrix}

(107)

Note that in this setup, if

Z_{1}

and

Z_{2}

are finite, then Condition (iv) in Theorem 1 is equivalent to

Q_{Z_{1} Z_{2}} ≪ Q_{Z_{1}} \times Q_{Z_{2}}

. The equivalent formulations of reverse hypercontractivity were observed in [59], where the proof is based on the method of types.

5.3. Reverse Hypercontractivity (One Negative Parameter)

By “one negative parameter” we mean the

b_{1}

is positive and

- c_{2}

is negative in (111).

In Theorem 1, take

l \leftarrow 1

,

m \leftarrow 2

,

c_{1} \leftarrow 1

,

d \leftarrow 0

. Let

Y_{1} = X = (Z_{1}, Y_{2})

, and let

S_{1}

and

T_{2}

be the canonical maps (Definition 2). Suppose that

Q_{Z_{1} Y_{2}}

is a given joint probability distribution, and set

μ_{1} \leftarrow Q_{Z_{1} Y_{2}}

,

ν_{1} \leftarrow Q_{Z_{1}}

,

μ_{2} \leftarrow Q_{Y_{2}}

in Theorem 1. Suppose that F and G are arbitrary nonnegative continuous functions on

Y_{2}

and

Z_{1}

, respectively, which are bounded away from zero. Take

g_{1} \leftarrow G^{\frac{1}{b_{1}}}

,

f_{2} \leftarrow F^{- \frac{1}{c_{2}}}

. in Theorem 1. The measure spaces and the random transformations are as shown in Figure 5.

The constraint (12) translates to:

\begin{matrix} f_{1} (z_{1}, y_{2}) \geq G (z_{1}) F (y_{2}), \forall z_{1}, y_{2} . \end{matrix}

(108)

Note that (13) translates to:

\begin{matrix} {∥ G ∥}_{\frac{1}{b_{1}}} \leq Q_{Y_{2} Z_{1}} (f_{1}) Q_{Y_{2}}^{c_{2}} (F^{- \frac{1}{c_{2}}}) \end{matrix}

(109)

for all F, G and

f_{1}

satisfying (108). It suffices to verify (109) for the optimal choice

f_{1} = G F

, so (109) is reduced to:

\begin{matrix} {∥ F ∥}_{\frac{1}{- c_{2}}} {∥ G ∥}_{\frac{1}{b_{1}}} \leq E [F (Y_{2}) G (Z_{1})], \forall F, G . \end{matrix}

(110)

By Theorem 1, (110) is equivalent to:

\begin{matrix} \forall P_{Z_{1}}, \exists P_{Z_{1} Y_{2}}, D (P_{Z_{1} Y_{2}} ∥ Q_{Z_{1} Y_{2}}) \leq b_{1} D (P_{Z_{1}} ∥ Q_{Z_{1}}) + (- c_{2}) D (P_{Y_{2}} ∥ Q_{Y_{2}}) . \end{matrix}

(111)

Inequality (110) is called reverse hypercontractivity with a negative parameter in [45], where the entropic version (111) is established for

| Z_{1} |, | Y_{2} | < \infty

using the method of types. Multiterminal extensions of (110) and (111) (called the reverse Brascamp-Lieb type inequality with negative parameters in [45]) can also be recovered from Theorem 1 in the same fashion, i.e., we move all negative parameters to the other side of the inequality so that all parameters become positive.

In summary, from the viewpoint of Theorem 1, the results in Section 5.1, Section 5.2 and Section 5.3 are degenerate special cases, in the sense that in any of the three cases, the optimal choice of one of the functions in (13) can be explicitly expressed in terms of the other functions; hence, this “hidden function” disappears in (103), (106) or (110).

Author Contributions

All the authors have contributed to the problem formulation, refinement, structuring or editing of the paper. Most of the sections were written by J.L. Parts of the sections on the existence of the minimizer and the Gaussian optimality were written by T.A.C.

Acknowledgments

This work was supported in part by NSF Grants CCF-1528132, CCF-0939370 (Center for Science of Information), CCF-1319299, CCF-1319304, CCF-1350595 and AFOSR FA9550-15-1-0180. Jingbo Liu would like to thank Elliott H. Lieb for teaching the Brascamp-Lieb inequality, as well as some techniques used in this paper in his graduate class.

Conflicts of Interest

The authors declare no conflict of interest.

Appendix A. Recovering Theorem 1 from Theorem 6 as a Special Case

Assume that

P_{X} \to (P_{Z_{i}})

is surjective. Let

1_{Z_{i}}

denote the constant one function on

Z_{i}

. Define:

\begin{matrix} C : = \{(w_{i}) : w_{i} \in C_{b} (Z_{i}), \sum_{i = 1}^{l} inf_{z_{i}} w_{i} (z_{i}) \geq 0\}, \end{matrix}

(A1)

which is a closed convex cone in

C_{b} (Z_{1}) \times \dots \times C_{b} (Z_{l})

. Given

(g_{i})

, we show that

\sum_{i = 1}^{l} b_{i} S_{i} log {\tilde{g}}_{i} \geq \sum_{i = 1}^{l} b_{i} S_{i} log g_{i}

implies:

\begin{matrix} {(b_{i} log {\tilde{g}}_{i} - b_{i} log g_{i})}_{i = 1}^{l} \in C . \end{matrix}

(A2)

Indeed, we can verify that the dual cone:

\begin{matrix} C^{*} & : = \{(π_{i}) : \sum_{i = 1}^{l} π_{i} (w_{i}) \geq 0, \forall (w_{i}) \in C\} \\ = \{λ (P_{Z_{1}}, \dots, P_{Z_{l}}) : λ \geq 0\} . \end{matrix}

(A3)

Under the surjectivity assumption, we see:

\begin{matrix} \sum_{i = 1}^{l} π_{i} (b_{i} log {\tilde{g}}_{i} - b_{i} log g_{i}) \geq 0, \forall (π_{i}) \in C^{*} . \end{matrix}

(A4)

Now, if (A2) is not true, by the Hahn-Banach theorem (Theorem 5), we find

π_{i} \in M (Z_{i})

,

i = 1, \dots, l

such that:

\begin{matrix} \sum_{i = 1}^{l} π_{i} (b_{i} log {\tilde{g}}_{i} - b_{i} log g_{i}) < inf_{(w_{i}) \in C} \sum_{i = 1}^{l} π_{i} (w_{i}) \end{matrix}

(A5)

so the right side of (A5) is not

- \infty

. Since

C

is a cone containing the origin, the right side of (A5) hence must be nonnegative, and we conclude that

(π_{i}) \in C^{*}

. However, then, (A5) contradicts (A4).

Appendix B. Existence of Weakly-Convergent Couplings

This section proves an auxiliary result which will be used in Appendix C.

Lemma A1.

Suppose that for each

i = 1, \dots, l

,

P_{X_{i}}

is a Borel measure on

R

and

P_{X_{i}}^{(n)}

converges weakly to some absolutely continuous (with respect to the Lebesgue measure)

P_{X_{i}}

as

n \to \infty

. If

P_{X}

is a coupling of

{(P_{X_{i}})}_{1 \leq i \leq l}

, then, upon extraction of a subsequence, there exist couplings

P_{X}^{(n)}

for

{(P_{X_{i}}^{(n)})}_{1 \leq i \leq l}

that converge weakly to

P_{X}

as

n \to \infty

.

Proof.

For each integer

k \geq 1

, define the random variable

W_{i}^{[k]} : = ϕ_{k} (X_{i})

where

ϕ_{k} : R \to R \cup {e}

is the following “dyadic quantization function”:

\begin{matrix} ϕ_{k} : x \mapsto \{\begin{matrix} ⌊ 2^{k} x ⌋ & | x | \leq k, x \notin 2^{- k} Z; \\ e & otherwise, \end{matrix} \end{matrix}

(A6)

and let

W^{[k]} : = {(W_{i}^{[k]})}_{i = 1}^{l}

. Denote by

W^{[k]} : = {- k 2^{k}, \dots, k 2^{k} - 1, e}

the set from which

W_{i}^{[k]}

takes values. Note that since

P_{X_{i}}

is assumed to be absolutely continuous, the set of “dyadic points” has measure zero:

\begin{matrix} P_{X_{i}} (⋃_{k = 1}^{\infty} 2^{- k} Z) = 0, i = 1, \dots, l . \end{matrix}

(A7)

Since

P_{X_{i}}^{(n)} \to P_{X_{i}}

weakly and the assumption in the preceding paragraph precluded any positive mass on the quantization boundaries under

P_{X_{i}}

, for each

k \geq 1

, there exists some

n : = n_{k}

large enough such that:

\begin{matrix} P_{W_{i}^{[k]}}^{(n)} (w) & \geq (1 - \frac{1}{k}) P_{W_{i}^{[k]}} (w), \end{matrix}

(A8)

for each i and

w \in W^{[k]}

. Now, define a coupling

P_{W^{[k]}}^{(n)}

compatible with the

{(P_{W_{i}^{[k]}}^{(n)})}_{i = 1}^{l}

induced by

{(P_{X_{i}}^{(n)})}_{i = 1}^{l}

, as follows:

\begin{matrix} P_{W^{[k]}}^{(n)} : = (1 - \frac{1}{k}) P_{W^{[k]}} + k^{l - 1} \prod_{i = 1}^{l} (P_{W_{i}^{[k]}}^{(n)} - (1 - \frac{1}{k}) P_{W_{i}^{[k]}}) . \end{matrix}

(A9)

Observe that (A9) is a well-defined probability measure because of (A8) and indeed has marginals

{(P_{W_{i}^{[k]}}^{(n)})}_{i = 1}^{l}

. Moreover, by the triangle inequality, we have the following bound on the total variation distance:

\begin{matrix} |P_{W^{[k]}}^{(n)} - P_{W^{[k]}}| \leq \frac{2}{k} . \end{matrix}

(A10)

Next, construct

P_{X}^{(n)}

(we use

{P |}_{A}

to denote the restriction of a probability measure P on measurable set

A

, that is

{P |}_{A} (B) : = P (A \cap B)

for any measurable

B

):

\begin{matrix} P_{X}^{(n)} : = \sum_{w^{l} \in W^{[k]} \times \dots \times W^{[k]}} \frac{P_{W^{[k]}}^{(n)} (w^{l})}{\prod_{i = 1}^{l} P_{W_{i}^{[k]}}^{(n)} (w_{i})} \prod_{i = 1}^{l} P_{X_{i}}^{(n)} |_{ϕ_{k}^{- 1} (w_{i})} . \end{matrix}

(A11)

Observe that

P_{X}^{(n)}

defined in (A11) is compatible with the

P_{W^{[k]}}^{(n)}

defined in (A9) and indeed has marginals

{(P_{X_{i}}^{(n)})}_{i = 1}^{l}

. Since

n : = n_{k}

can be made increasing in k, we have constructed the desired sequence

{(P_{X}^{(n_{k})})}_{k = 1}^{\infty}

converging weakly to

P_{X}

. Indeed, for any bounded open dyadic cube (that is, a cube whose corners have coordinates being multiples of

2^{- k}

where k is some integer)

A

, using (A10) and the assumption (A7), we conclude:

\begin{matrix} \underset{k \to \infty}{lim inf} P_{X}^{(n_{k})} (A) \geq P_{X} (A) . \end{matrix}

(A12)

Moreover, since bounded open dyadic cubes form a countable basis of the topology in

R^{l}

, we see that (A12) actually holds for any open set

A

. By writing

A

as a countable union of dyadic cubes, using the continuity of measure to pass to a finite disjoint union, and then apply (A12), as desired. ☐

Appendix C. Upper Semicontinuity of the Infimum

Using Lemma A1 in Appendix B, we prove the following result, which will be used in Appendix E.

Corollary A1.

Consider non-degenerate

(P_{Y_{j} | X})

. For each

n \geq 1

,

i = 1, \dots, l

,

P_{X_{i}}^{(n)}

is a Borel measure on

R

, whose second moment is bounded by

σ_{i}^{2} < \infty

. Assume that

P_{X_{i}}^{(n)}

converges to some absolutely continuous

P_{X_{i}}^{⋆}

for each i. Then:

\begin{matrix} \underset{n \to \infty}{lim sup} inf_{P_{X} : S_{i}^{*} P_{X} = P_{X_{i}}^{(n)}} \sum_{j = 1}^{m} c_{j} D (T_{j}^{*} P_{X} ∥ μ_{j}) \leq inf_{P_{X} : S_{i}^{*} P_{X} = P_{X_{i}}^{⋆}} \sum_{j = 1}^{m} c_{j} D (T_{j}^{*} P_{X} ∥ μ_{j}) . \end{matrix}

(A13)

Proof.

By passing to a convergent subsequence, we may assume that the limit on the left side of (A13) exists. For any coupling

P_{X}^{⋆}

of

(P_{X_{i}}^{⋆})

, by invoking Lemma A1 and passing to a subsequence, we find a sequence of couplings

P_{X}^{(n)}

of

(P_{X_{i}}^{(n)})

that converges weakly to

P_{X}^{⋆}

. It is known that under a moment constraint, the differential entropy of the output distribution of a non-degenerate Gaussian channel enjoys weak continuity in the input distribution (see, e.g., [48] Proposition 18, [60] Theorem 7, or [61] Theorem 1 and Theorem 2). Thus:

\begin{matrix} lim_{n \to \infty} \sum_{j = 1}^{m} c_{j} D (T_{j}^{*} P_{X}^{(n)} ∥ μ_{j}) = \sum_{j = 1}^{m} c_{j} D (T_{j}^{*} P_{X} ∥ μ_{j}) \end{matrix}

(A14)

and (A13) follows since

P_{X}^{⋆}

was arbitrarily chosen. ☐

Appendix D. Weak Semicontinuity of Differential Entropy under a Moment Constraint

This section proves the following result, which will be used in Appendix E.

Lemma A2.

Suppose

(P_{X_{n}})

is a sequence of distributions on

R^{d}

converging weakly to

P_{X^{⋆}}

, and:

\begin{matrix} E [X_{n} X_{n}^{⊤}] ⪯ Σ \end{matrix}

(A15)

for all n. Then

\begin{matrix} \underset{n \to \infty}{lim sup} h (X_{n}) \leq h (X^{⋆}) . \end{matrix}

(A16)

Remark A1.

The result fails without the condition (A15). Furthermore, related results when the weak convergence is replaced with pointwise convergence of density functions and certain additional constraints were shown in [61] (Theorem 1 and Theorem 2) (see also the proof of [48] (Theorem 5)). Those results are not applicable here since the density functions of

X_{n}

do not converge pointwise. They are applicable for the problems discussed in [48] because the density functions of the output of the Gaussian random transformation enjoy many nice properties due to the smoothing effect of the “good kernel”.

Proof.

It is well known that in metric spaces and for probability measures, the relative entropy is weakly lower semicontinuous (cf. [58]). This fact and a scaling argument immediately show that, for any

r > 0

,

\begin{matrix} \underset{n \to \infty}{lim sup} h (X_{n} | ∥ X_{n} ∥ \leq r) \leq h (X^{⋆} | ∥ X^{⋆} ∥ \leq r) . \end{matrix}

(A17)

Let

p_{n} (r) : = P [∥ X_{n} ∥ > r]

, then (A15) implies:

\begin{matrix} E [X X^{⊤} | ∥ X_{n} ∥ > r] \leq \frac{1}{p_{n} (r)} Σ . \end{matrix}

(A18)

Therefore, since the Gaussian distribution maximizes differential entropy given a second moment upper bound, we have:

\begin{matrix} h (X_{n} | ∥ X_{n} ∥ > r) \leq \frac{1}{2} log \frac{{(2 π)}^{d} e | Σ |}{p_{n} (r)} . \end{matrix}

(A19)

Since

{lim}_{r \to \infty} {sup}_{n} p_{n} (r) = 0

by (A15) and due to Chebyshev’s inequality, (A19) implies that:

\begin{matrix} lim_{r \to \infty} sup_{n} p_{n} (r) h (X_{n} | ∥ X_{n} ∥ > r) = 0 . \end{matrix}

(A20)

The desired result follows from (A17), (A20) and the fact that:

\begin{matrix} h (X_{n}) = p_{n} (r) h (X_{n} | ∥ X_{n} ∥ > r) + (1 - p_{n} (r)) h (X_{n} | ∥ X_{n} ∥ \leq r) + h (p_{n} (r)) . \end{matrix}

(A21)

☐

Appendix E. Proof of Proposition 2

For any $ϵ > 0$ , by the continuity of measure, there exists $K > 0$ such that:

$\begin{matrix} P_{X_{i}} ([- K, K]) \geq 1 - \frac{ϵ}{l}, i = 1, \dots, l . \end{matrix}$

(A22)

By the union bound,

$\begin{matrix} P_{X} ({[- K, K]}^{l}) \geq 1 - ϵ \end{matrix}$

(A23)

wherever $P_{X}$ is a coupling of $(P_{X_{i}})$ . Now, let $P_{X}^{(n)}$ , $n = 1, 2, \dots$ be such that:

$\begin{matrix} lim_{n \to \infty} \sum_{j = 1}^{m} c_{j} D (P_{Y_{j}}^{(n)} ∥ μ_{j}) = inf_{P_{X}} \sum_{j = 1}^{m} c_{j} D (P_{Y_{j}} ∥ μ_{j}) \end{matrix}$

(A24)

where $P_{Y_{j}} : = T_{j}^{*} P_{X}$ , $j = 1, \dots, m$ . The sequence $(P_{X}^{(n)})$ is tight by (A23). Thus, invoking the Prokhorov theorem and by passing to a subsequence, we may assume that $(P_{X}^{(n)})$ converges weakly to some $P_{X}^{⋆}$ . Therefore, $P_{Y_{j}}^{(n)}$ converges to $P_{Y_{j}}^{⋆}$ weakly, and by the semicontinuity property in Lemma A2, we have:

$\begin{matrix} \sum_{j = 1}^{m} c_{j} D (P_{Y_{j}}^{⋆} ∥ μ_{j}) \leq lim_{n \to \infty} \sum_{j = 1}^{m} c_{j} D (P_{Y_{j}}^{(n)} ∥ μ_{j}) \end{matrix}$

(A25)

establishing that $P_{X}^{⋆}$ is an infimizer.
Suppose ${(P_{X_{i}}^{(n)})}_{1 \leq i \leq l, n \geq 1}$ is such that $E [X_{i}^{2}] \leq σ_{i}^{2}$ , $X_{i} \sim P_{X_{i}}^{(n)}$ , where $(σ_{i})$ is as in Proposition 1 and:

$\begin{matrix} lim_{n \to \infty} F_{0} ({(P_{X_{i}}^{(n)})}_{i = 1}^{l}) = sup_{(P_{X_{i}}) : Σ_{X_{i}} ⪯ σ_{i}^{2}} F_{0} ({(P_{X_{i}})}_{i = 1}^{l}) . \end{matrix}$

(A26)

The regularization on the covariance implies that for each i, ${(P_{X_{i}}^{(n)})}_{n \geq 1}$ is a tight sequence. Thus, upon the extraction of subsequences, we may assume that for each i, ${(P_{X_{i}}^{(n)})}_{n \geq 1}$ converges to some $P_{X_{i}}^{⋆}$ . We have the moment bound:

$\begin{matrix} E [X_{i}^{2}] & = lim_{K \to \infty} E [min {X_{i}^{2}, K}] \end{matrix}$

(A27)

$\begin{matrix} = lim_{K \to \infty} E [min {{(X_{i}^{(n)})}^{2}, K}] \end{matrix}$

(A28)

$\begin{matrix} \leq σ_{i}^{2} \end{matrix}$

(A29)

where $X_{i} \sim P_{X_{i}}^{⋆}$ and $X_{i}^{(n)} \sim P_{X_{i}}^{(n)}$ . Then, by Lemma A2,

$\begin{matrix} \sum_{i} b_{i} D (P_{X_{i}}^{⋆} ∥ ν_{i}) \leq lim_{n \to \infty} \sum_{i} b_{i} D (P_{X_{i}}^{(n)} ∥ ν_{i}) \end{matrix}$

(A30)

Under the covariance regularization and the nondegenerateness assumption, we showed in Proposition 1 that the value of (76) cannot be $+ \infty$ or $- \infty$ . This implies that we can assume (by passing to a subsequence) that $P_{X_{i}}^{(n)} ≪ λ$ , $i = 1, \dots, l$ , since otherwise $F ((P_{X_{i}})) = - \infty$ . Moreover, since ${(\sum_{j} c_{j} D (P_{Y_{j}}^{(n)} ∥ μ_{j}))}_{n \geq 1}$ is bounded above under the nondegenerateness assumption, the sequence ${(\sum_{i} b_{i} D (P_{X_{i}}^{(n)} ∥ ν_{i}))}_{n \geq 1}$ must also be bounded from above, which implies, using (A30), that:

$\begin{matrix} \sum_{i} b_{i} D (P_{X_{i}}^{⋆} ∥ ν_{i}) < \infty . \end{matrix}$

(A31)

In particular, we have $P_{X_{i}}^{⋆} ≪ λ$ for each i. Now, Corollary A1 shows that:

$\begin{matrix} inf_{P_{X} : S_{i}^{*} P_{X} = P_{X_{i}}^{⋆}} \sum_{j} c_{j} D (T_{j}^{*} P_{X} ∥ μ_{j}) & \geq lim_{n \to \infty} inf_{P_{X} : S_{i}^{*} P_{X} = P_{X_{i}}^{(n)}} \sum_{j} c_{j} D (T_{j}^{*} P_{X} ∥ μ_{j}) \end{matrix}$

(A32)

Thus, (A30) and (A32) show that $(P_{X_{i}}^{⋆})$ is in fact a maximizer.

Appendix F. Gaussian Optimality in Degenerate Cases: A Limiting Argument

This section proves Theorem 2. We first give a proof for the choice of parameters in Example 1, merely for the sake of notational simplicity, and then discuss how to extend the argument.

Appendix F.1. Proof of the Claim in Example 1

The proof will be based on Theorem 8, which assumes non-degenerate forward channels and Gaussian measures on the output of the reverse channels. To that end, we will adopt an approximation argument. For each

j = 1, \dots, l

, define the linear operator

T_{j}^{ϵ}

by:

\begin{matrix} (T_{j}^{ϵ} ϕ) (x_{1}, \dots, x_{l}) : = E [ϕ (\sum_{i = 1}^{l} m_{j i} x_{i} + N_{ϵ})] \end{matrix}

(A33)

for any measurable function

ϕ

on

R

, where

N_{ϵ} \sim N (0, ϵ)

. Let

γ_{\frac{1}{ϵ}} : = N (0, ϵ^{- 1})

, and note that the density of

\sqrt{\frac{2 π}{ϵ}} γ_{\frac{1}{ϵ}}

converges pointwise to that of the Lebesgue measure.

Lemma A3.

For any

ϵ > 0

, let

(T_{j}^{ϵ})

be defined as in (A33). Then, for any Borel

P_{X_{i}} ≪ λ

,

i = 1, \dots, l

,

\begin{matrix} \sum_{i = 1}^{l} D (P_{X_{i}} ∥ γ_{\frac{1}{ϵ}}) - \frac{l}{2} log \frac{2 π}{ϵ} \geq inf_{P_{X^{l}} : S_{i}^{*} P_{X^{l}} = P_{X_{i}}} \{- \sum_{j = 1}^{l} h (T_{j}^{ϵ *} P_{X^{l}})\} . \end{matrix}

(A34)

Proof.

By Theorem 8, it suffices to prove (A34) when

P_{X_{i}}

is Gaussian, and from (A34), it is easy to see that it suffices to prove the case of the centered Gaussian. Let

P_{X_{i}} = N (0, a_{i})

,

i = 1, \dots, l

. We can upper bound the right side of (A34) by taking

P_{X^{l}} = P_{X_{1}} \times P_{X_{l}}

instead of the infimum, so it suffices to prove that:

\begin{matrix} \frac{ϵ}{2} \sum_{i = 1}^{l} a_{i} - \frac{1}{2} \sum_{i = 1}^{l} log a_{i} \geq - \frac{1}{2} \sum_{j = 1}^{l} log (\sum_{i = 1}^{l} m_{j i}^{2} a_{i} + ϵ) \end{matrix}

(A35)

for any

ϵ, a_{1}, \dots, a_{l} \in (0, \infty)

. This is implied by the

ϵ = 0

case, which we proved in (94). ☐

By the duality of the forward-reverse Brascamp-Lieb inequality (Theorem 7) , we conclude from Lemma A3 that:

Lemma A4.

For any

ϵ > 0

and nonnegative continuous

(f_{j})

,

(g_{i})

satisfying:

\begin{matrix} \sum_{i = 1}^{l} log g_{i} (x_{i}) \leq \sum_{j = 1}^{l} (T_{j}^{ϵ} log f_{j}) (x^{l}), \forall x^{l} \in R^{l}, \end{matrix}

(A36)

we have:

\begin{matrix} {(\frac{2 π}{ϵ})}^{\frac{l}{2}} \prod_{i = 1}^{l} \int g_{i} d γ_{\frac{1}{ϵ}} \leq \prod_{i = 1}^{l} \int f_{j} (x) d x . \end{matrix}

(A37)

Now, suppose that the claim in Example 1 is not true; then there are nonnegative continuous

(f_{j})

and

(g_{i})

satisfying (17) while:

\begin{matrix} \prod_{i = 1}^{l} \int g_{i} (x) d x > \prod_{i = 1}^{l} \int f_{j} (x) d x, \end{matrix}

(A38)

By the standard approximation argument, we can assume, without loss of generality, that:

\begin{matrix} g_{i} (x) & = 0, \forall x : | x | \geq R, 1 \leq i \leq l; \end{matrix}

(A39)

\begin{matrix} f_{j} (x) & \geq δ e^{- x^{2}}, \forall 1 \leq j \leq l, \end{matrix}

(A40)

for some R sufficiently large and

δ > 0

sufficiently small. Note that for any

x^{l} \in {[- R, R]}^{l}

,

\begin{matrix} \sum_{i = 1}^{l} m_{j i} x_{i} \in [- \sqrt{l} R, \sqrt{l} R] . \end{matrix}

(A41)

Since

log f_{j}

is uniformly continuous on

[- 2 \sqrt{l} R, 2 \sqrt{l} R]

for each j and since we assumed (A40), we have:

\begin{matrix} lim_{ϵ \to 0} inf_{x^{l} \in {[- R, R]}^{l}} \{\sum_{j = 1}^{l} (T_{j}^{ϵ} log f_{j}) (x^{l}) - \sum_{j = 1}^{l} (T_{j}^{0} log f_{j}) (x^{l})\} \geq 0 . \end{matrix}

(A42)

However, since we assumed (17) and (A39), we must also have:

\begin{matrix} lim_{ϵ \to 0} η_{ϵ} \geq 0 \end{matrix}

(A43)

where:

\begin{matrix} η_{ϵ} : = inf_{x^{l} \in R^{l}} \{\sum_{j = 1}^{l} (T_{j}^{ϵ} log f_{j}) (x^{l}) - \sum_{i = 1}^{l} log g_{i} (x_{i})\} . \end{matrix}

(A44)

Put:

\begin{matrix} {\tilde{g}}_{1}^{ϵ} & : = exp (η_{ϵ}) g_{1}, \end{matrix}

(A45)

\begin{matrix} {\tilde{g}}_{i}^{ϵ} & : = g_{i}, i = 1, \dots, l . \end{matrix}

(A46)

Then,

({\tilde{g}}_{i}^{ϵ})

and

(f_{j})

satisfy the constraint (A36) for any

ϵ > 0

. By applying the monotone convergence theorem and then Lemma A4,

\begin{matrix} \prod_{i = 1}^{l} \int g_{i} (x_{i}) d x_{i} & \leq lim_{ϵ \to 0} {(\frac{2 π}{ϵ})}^{\frac{l}{2}} \prod_{i = 1}^{l} \int {\tilde{g}}_{i}^{ϵ} d γ_{\frac{1}{ϵ}} \end{matrix}

(A47)

\begin{matrix} \leq \prod_{i = 1}^{l} \int f_{j} (x) d x \end{matrix}

(A48)

which violates the hypothesis (A38), as desired.

Appendix F.2. Proof of Theorem 2

The limiting argument can be extended to the vector case to prove Theorem 2. Specifically, for each

j = 1, \dots, m

, define

T_{j}^{ϵ}

the same as (A33) except that

N_{ϵ} \sim N (0, ϵ I)

, where

I

is the identity matrix whose dimension is clear from the context (equal to

dim (E^{j})

here), and let

P_{Y_{j} | X_{1} \dots X_{l}}^{ϵ}

be the dual operator. For each

i = 1, \dots, l

, let

ν_{i}^{ϵ} : = {(\frac{2 π}{ϵ})}^{\frac{1}{2} dim (E_{i})} \cdot N (0, ϵ^{- 1} I)

, whose density convergences pointwise to that of

ν_{i}^{0}

, defined as the Lebesgue measure on

E_{i}

. Define:

\begin{matrix} d^{ϵ} : = sup \{\sum_{i = 1}^{l} b_{i} log ν_{i}^{ϵ} (g_{i}) - \sum_{j = 1}^{m} c_{j} log \int f_{j}\} \end{matrix}

(A49)

where the supremum is over nonnegative continuous functions

f_{1}, \dots, f_{m}

and

g_{1}, \dots, g_{l}

such that the summands in (A49) are finite and:

\begin{matrix} \sum_{i = 1}^{l} b_{i} log g_{i} (x_{i}) \leq \sum_{j = 1}^{m} c_{j} (T_{j}^{ϵ} log f_{j}) (x_{1}, \dots, x_{l}), \forall x_{1}, \dots, x_{l} . \end{matrix}

(A50)

The same limiting argument (A38)–(A48) extended to the vector case shows that:

\begin{matrix} d^{0} \leq lim_{ϵ ↓ 0} d^{ϵ} . \end{matrix}

(A51)

Next, define

F_{0}^{ϵ} (\cdot)

for

(μ_{j})

,

(ν_{i}^{ϵ})

and

P_{Y_{j} | X_{1} \dots X_{l}}^{ϵ}

, similarly to (75). The entropic⇒functional argument shows that:

\begin{matrix} d^{ϵ} \leq sup_{P_{X_{1}}, \dots, P_{X_{l}}} F_{0}^{ϵ} (P_{X_{1}}, \dots, P_{X_{l}}) . \end{matrix}

(A52)

However, Theorem 8 based on the rotational invariance of the Gaussian measure can be extended to the vector case, so for any

ϵ > 0

,

\begin{matrix} sup_{P_{X_{1}}, \dots, P_{X_{l}}} F_{0}^{ϵ} (P_{X_{1}}, \dots, P_{X_{l}}) = sup_{P_{X_{1}}, \dots, P_{X_{l}} c . G .} F_{0}^{ϵ} (P_{X_{1}}, \dots, P_{X_{l}}), \end{matrix}

(A53)

where c.G. means that the supremum on the right side is over centered Gaussian measures. The fact that centered distributions exhaust the supremum follows easily from the definition of

F_{0}

. Moreover, from the definitions, it is easy to see that

F_{0}^{ϵ}

is monotonically decreasing in

ϵ

, and in particular:

\begin{matrix} sup_{P_{X_{1}}, \dots, P_{X_{l}} c . G .} F_{0}^{ϵ} (P_{X_{1}}, \dots, P_{X_{l}}) \leq sup_{P_{X_{1}}, \dots, P_{X_{l}} c . G .} F_{0}^{0} (P_{X_{1}}, \dots, P_{X_{l}}) . \end{matrix}

(A54)

To finish the proof with the above chain of inequalities, it only remains to show that the right side of (A54) equals the supremum in (A49) with

(f_{j})

(g_{j})

taken over center Gaussian functions. This follows by similar steps as the proof of the functional⇒entropic part of Theorem 1. We briefly mention how the idea works: suppose A is the linear space defined as the Cartesian product of

R

and the set of

n \times n

symmetric matrices. Let

Λ (\cdot)

be the convex functional on A defined by:

\begin{matrix} Λ (r, M) & : = ln \int {exp}_{e} (r + x^{⊤} M x) d x \end{matrix}

(A55)

\begin{matrix} = \{\begin{matrix} r + \frac{n}{2} ln π - \frac{1}{2} ln | - M | & M ⪯ 0, \\ + \infty & otherwise . \end{matrix} \end{matrix}

(A56)

The dual space of A is itself, and

Λ^{*}

is given by:

\begin{matrix} Λ^{*} (s, H) = sup_{r, M ⪰ 0} {s r + Tr (H^{⊤} M) - Λ (r, M)} . \end{matrix}

(A57)

Then,

Λ^{*} (s, H) = + \infty

if

s \neq 1

, and:

\begin{matrix} Λ^{*} (1, H) & = sup_{M ⪯ 0} \{Tr (H^{⊤} M) - \frac{n}{2} ln π + \frac{1}{2} ln | - M |\} . \end{matrix}

(A58)

The supremum in (A58) equals

+ \infty

if

H

is not positive-semidefinite. However, if

H

is positive-semidefinite, the supremum equals

- \frac{1}{2} ln 2 π e | H |

, which is equal to the relative entropy between

N (0, H)

and the Lebesgue measure (supremum achieved when

M = - {(2 H)}^{- 1}

). Since the proof of Theorem 1, in essence, only uses the duality between convex functionals, the same algebraic steps therein also establish the desired matrix optimization identity.

References

Brascamp, H.J.; Lieb, E.H. Best constants in Young’s inequality, its converse, and its generalization to more than three functions. Adv. Math. 1976, 20, 151–173. [Google Scholar] [CrossRef]
Brascamp, H.J.; Lieb, E.H. On extensions of the Brunn-Minkowski and Prékopa-Leindler theorems, including inequalities for log concave functions, and with an application to the diffusion equation. J. Funct. Anal. 1976, 22, 366–389. [Google Scholar] [CrossRef]
Bobkov, S.G.; Ledoux, M. From Brunn-Minkowski to Brascamp-Lieb and to logarithmic Sobolev inequalities. Geom. Funct. Anal. 2000, 10, 1028–1052. [Google Scholar] [CrossRef]
Cordero-Erausquin, D. Transport inequalities for log-concave measures, quantitative forms and applications. arXiv, 2015; arXiv:1504.06147. [Google Scholar]
Barthe, F. On a reverse form of the Brascamp-Lieb inequality. Invent. Math. 1998, 134, 335–361. [Google Scholar] [CrossRef]
Bennett, J.; Carbery, A.; Christ, M.; Tao, T. The Brascamp-Lieb inequalities: finiteness, structure and extremals. Geom. Funct. Anal. 2008, 17, 1343–1415. [Google Scholar] [CrossRef]
Liu, J.; Courtade, T.A.; Cuff, P.; Verdú, S. Information theoretic perspectives on Brascamp-Lieb inequality and its reverse. arXiv, 2017; arXiv:1702.06260. [Google Scholar]
Gardner, R. The Brunn-Minkowski inequality. Bull. Am. Math. Soc. 2002, 39, 355–405. [Google Scholar] [CrossRef]
Gross, L. Logarithmic Sobolev inequalities. Am. J. Math. 1975, 97, 1061–1083. [Google Scholar] [CrossRef]
Erkip, E.; Cover, T.M. The efficiency of investment information. IEEE Trans. Inf. Theory Mar. 1998, 44, 1026–1040. [Google Scholar] [CrossRef]
Courtade, T. Outer bounds for multiterminal source coding via a strong data processing inequality. In Proceedings of the IEEE International Symposium on Information Theory, Istanbul, Turkey, 7–12 July 2013; pp. 559–563. [Google Scholar]
Polyanskiy, Y.; Wu, Y. Dissipation of information in channels with input constraints. IEEE Trans. Inf. Theory 2016, 62, 35–55. [Google Scholar] [CrossRef]
Polyanskiy, Y.; Wu, Y. A Note on the Strong Data-Processing Inequalities in Bayesian Networks. Available online: http://arxiv.org/pdf/1508.06025v1.pdf (accessed on 25 August 2015).
Liu, J.; Cuff, P.; Verdú, S. Key capacity for product sources with application to stationary Gaussian processes. IEEE Trans. Inf. Theory 2016, 62, 984–1005. [Google Scholar]
Liu, J.; Cuff, P.; Verdú, S. Secret key generation with one communicator and a one-shot converse via hypercontractivity. In Proceedings of the IEEE International Symposium on Information Theory, Hong Kong, China, 14–19 June 2015; pp. 710–714. [Google Scholar]
Xu, A.; Raginsky, M. Converses for distributed estimation via strong data processing inequalities. In Proceedings of the IEEE International Symposium on Information Theory, Hong Kong, China, 14–19 June 2015; pp. 2376–2380. [Google Scholar]
Kamath, S.; Anantharam, V. On non-interactive simulation of joint distributions. arXiv, 2015; arXiv:1505.00769. [Google Scholar]
Kahn, J.; Kalai, G.; Linial, N. The influence of variables on Boolean functions. In Proceedings of the 29th Annual Symposium on Foundations of Computer Science, White Plains, NY, USA, 24–26 October 1988; pp. 68–80. [Google Scholar]
Ganor, A.; Kol, G.; Raz, R. Exponential separation of information and communication. In Proceedings of the 2014 IEEE 55th Annual Symposium on Foundations of Computer Science (FOCS), Philadelphia, PA, USA, 18–21 Otctober 2014; pp. 176–185. [Google Scholar]
Dvir, Z.; Hu, G. Sylvester-Gallai for arrangements of subspaces. arXiv, 2014; arXiv:1412.0795. [Google Scholar]
Braverman, M.; Garg, A.; Ma, T.; Nguyen, H.L.; Woodruff, D.P. Communication lower bounds for statistical estimation problems via a distributed data processing inequality. arXiv, 2015; arXiv:1506.07216. [Google Scholar]
Garg, A.; Gurvits, L.; Oliveira, R.; Wigderson, A. Algorithmic aspects of Brascamp-Lieb inequalities. arXiv, 2016; arXiv:1607.06711. [Google Scholar]
Talagrand, M. On Russo’s approximate zero-one law. Ann. Probab. 1994, 22, 1576–1587. [Google Scholar] [CrossRef]
Friedgut, E.; Kalai, G.; Naor, A. Boolean functions whose Fourier transform is concentrated on the first two levels. Adv. Appl. Math. 2002, 29, 427–437. [Google Scholar] [CrossRef]
Bourgain, J. On the distribution of the Fourier spectrum of Boolean functions. Isr. J. Math. 2002, 131, 269–276. [Google Scholar] [CrossRef]
Mossel, E.; O’Donnell, R.; Oleszkiewicz, K. Noise stability of functions with low influences: Invariance and optimality. Ann. Math. 2010, 171, 295–341. [Google Scholar] [CrossRef]
Garban, C.; Pete, G.; Schramm, O. The Fourier spectrum of critical percolation. Acta Math. 2010, 205, 19–104. [Google Scholar] [CrossRef]
Duchi, J.C.; Jordan, M.; Wainwright, M.J. Local privacy and statistical minimax rates. In Proceedings of the IEEE 54th Annual Symposium on Foundations of Computer Science (FOCS), Berkeley, CA, USA, 26–29 October 2013; pp. 429–438. [Google Scholar]
Lieb, E.H. Gaussian kernels have only Gaussian maximizers. Invent. Math. 1990, 102, 179–208. [Google Scholar] [CrossRef]
Barthe, F. Optimal Young’s inequality and its converse: A simple proof. Geom. Funct. Anal. 1998, 8, 234–242. [Google Scholar] [CrossRef]
Barthe, F.; Cordero-Erausquin, D. Inverse Brascamp-Lieb inequalities along the Heat equation. In Geometric Aspects of Functional Analysis; Lecture Notes in Mathematics; Springer: Berlin/Heidelberg, Germany, 2004; Volume 1850, pp. 65–71. [Google Scholar]
Carlen, E.A.; Cordero-Erausquin, D. Subadditivity of the entropy and its relation to Brascamp-Lieb type inequalities. Geom. Funct. Anal. 2009, 19, 373–405. [Google Scholar] [CrossRef]
Barthe, F.; Cordero-Erausquin, D.; Ledoux, M.; Maurey, B. Correlation and Brascamp-Lieb inequalities for Markov semigroups. Int. Math. Res. Notices 2011, 2011, 2177–2216. [Google Scholar] [CrossRef]
Lehec, J. Short probabilistic proof of the Brascamp-Lieb and Barthe theorems. Can. Math. Bull. 2014, 57, 585–587. [Google Scholar] [CrossRef]
Ball, K. Volumes of sections of cubes and related problems. In Geometric Aspects of Functional Analysis; Springer: Berlin/Heidelberg, Germany, 1989; pp. 251–260. [Google Scholar]
Ahlswede, R.; Gács, P. Spreading of sets in product spaces and hypercontraction of the Markov operator. Ann. Probab. 1976, 4, 925–939. [Google Scholar] [CrossRef]
Csiszár, I.; Körner, J. Information Theory: Coding Theorems for Discrete Memoryless Systems, 2nd ed.; Cambridge University Press: Cambridge, UK, 2011. [Google Scholar]
Liu, J.; van Handel, R.; Verdú, S. Beyond the Blowing-Up Lemma: Sharp Converses via Reverse Hypercontractivity. In Proceedings of the IEEE International Symposium on Information Theory, Aachen, Germany, 25–30 June 2017; pp. 943–947. [Google Scholar]
Ahlswede, R.; Gács, P.; Körner, J. Bounds on conditional probabilities with applications in multi-user communication. Probab. Theory Relat. Fields 1976, 34, 157–177. [Google Scholar] [CrossRef]
Villani, C. Topics in Optimal Transportation; American Mathematical Soc.: Providence, RI, USA, 2003; Volume 58. [Google Scholar]
Atar, R.; Merhav, N. Information-theoretic applications of the logarithmic probability comparison bound. IEEE Trans. Inf. Theory 2015, 61, 5366–5386. [Google Scholar] [CrossRef]
Radhakrishnan, J. Entropy and counting. In Kharagpur Golden Jubilee Volume; Narosa: New Delhi, India, 2001. [Google Scholar]
Madiman, M.M.; Tetali, P. Information inequalities for joint distributions, with interpretations and applications. IEEE Trans. Inf. Theory 2010, 56, 2699–2713. [Google Scholar] [CrossRef]
Nair, C. Equivalent Formulations of Hypercontractivity Using Information Measures; International Zurich Seminar: Zurich, Switzerland, 2014. [Google Scholar]
Beigi, S.; Nair, C. Equivalent characterization of reverse Brascamp-Lieb type inequalities using information measures. In Proceedings of the IEEE International Symposium on Information Theory, Barcelona, Spain, 10–15 July 2016. [Google Scholar]
Bobkov, S.G.; Götze, F. Exponential integrability and transportation cost related to Logarithmic Sobolev inequalities. J. Funct. Anal. 1999, 163, 1–28. [Google Scholar] [CrossRef]
Carlen, E.A.; Lieb, E.H.; Loss, M. A sharp analog of Young’s inequality on S^N and related entropy inequalities. J. Geom. Anal. 2004, 14, 487–520. [Google Scholar] [CrossRef]
Geng, Y.; Nair, C. The capacity region of the two-receiver Gaussian vector broadcast channel with private and common messages. IEEE Trans. Inf. Theory 2014, 60, 2087–2104. [Google Scholar] [CrossRef]
Liu, J.; Courtade, T.A.; Cuff, P.; Verdú, S. Brascamp-Lieb inequality and its reverse: An information theoretic view. In Proceedings of the IEEE International Symposium on Information Theory, Barcelona, Spain, 10–15 July 2016; pp. 1048–1052. [Google Scholar]
Lax, P.D. Functional Analysis; John Wiley & Sons, Inc.: Hoboken, NJ, USA, 2002. [Google Scholar]
Tao, T. 245B, Notes 12: Continuous Functions on Locally Compact Hausdorff Spaces. Available online: https://terrytao.wordpress.com/2009/03/02/245b-notes-12-continuous-functions-on-locally-compact-hausdorff-spaces/ (accessed on 2 March 2009).
Bourbaki, N. Intégration; (Chaps. I-IV, Actualités Scientifiques et Industrielles, no. 1175); Hermann: Paris, France, 1952. [Google Scholar]
Dembo, A.; Zeitouni, O. Large Deviations Techniques and Applications; Springer: Berlin, Germany, 2009; Volume 38. [Google Scholar]
Lane, S.M. Categories for the Working Mathematician; Springer: New York, NY, USA, 1978. [Google Scholar]
Hatcher, A. Algebraic Topology; Tsinghua University Press: Beijing, China, 2002. [Google Scholar]
Rockafellar, R.T. Convex Analysis; Princeton University Press: Princeton, NJ, USA, 2015. [Google Scholar]
Prokhorov, Y.V. Convergence of random processes and limit theorems in probability theory. Theory Probab. Its Appl. 1956, 1, 157–214. [Google Scholar] [CrossRef]
Verdú, S. Information Theory; In preparation; 2018. [Google Scholar]
Kamath, S. Reverse hypercontractivity using information measures. In Proceedings of the 53rd Annual Allerton Conference on Communications, Control and Computing, Champaign, IL, USA, 30 September–2 October 2015; pp. 627–633. [Google Scholar]
Wu, Y.; Verdú, S. Functional properties of minimum mean-square error and mutual information. IEEE Trans. Inf. Theory 2012, 58, 1289–1301. [Google Scholar] [CrossRef]
Godavarti, M.; Hero, A. Convergence of differential entropies. IEEE Trans. Inf. Theory 2004, 50, 171–176. [Google Scholar] [CrossRef]

Figure 1. Diagrams for Theorem 1.

Figure 2. The forward-reverse Brascamp-Lieb inequality generalizes several other functional inequalities/information theoretic inequalities. For more discussions on these relations, see the extended version [7].

Figure 3. Diagram for hypercontractivity.

Figure 4. Diagram for reverse hypercontractivity.

Figure 5. Diagram for reverse hypercontractivity with one negative parameter.

© 2018 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Liu, J.; Courtade, T.A.; Cuff, P.W.; Verdú, S. A Forward-Reverse Brascamp-Lieb Inequality: Entropic Duality and Gaussian Optimality. Entropy 2018, 20, 418. https://doi.org/10.3390/e20060418

AMA Style

Liu J, Courtade TA, Cuff PW, Verdú S. A Forward-Reverse Brascamp-Lieb Inequality: Entropic Duality and Gaussian Optimality. Entropy. 2018; 20(6):418. https://doi.org/10.3390/e20060418

Chicago/Turabian Style

Liu, Jingbo, Thomas A. Courtade, Paul W. Cuff, and Sergio Verdú. 2018. "A Forward-Reverse Brascamp-Lieb Inequality: Entropic Duality and Gaussian Optimality" Entropy 20, no. 6: 418. https://doi.org/10.3390/e20060418

APA Style

Liu, J., Courtade, T. A., Cuff, P. W., & Verdú, S. (2018). A Forward-Reverse Brascamp-Lieb Inequality: Entropic Duality and Gaussian Optimality. Entropy, 20(6), 418. https://doi.org/10.3390/e20060418

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

A Forward-Reverse Brascamp-Lieb Inequality: Entropic Duality and Gaussian Optimality

Abstract

1. Introduction

2. Review of the Legendre-Fenchel Duality Theory

3. The Entropic-Functional Duality

3.1. Compact $X$

3.2. Noncompact $X$

4. Gaussian Optimality

4.1. Non-Degenerate Forward Channels

4.2. Analysis of Example 1 Using Gaussian Optimality

5. Relation to Hypercontractivity and Its Reverses

5.1. Hypercontractivity

5.2. Reverse Hypercontractivity (Positive Parameters)

5.3. Reverse Hypercontractivity (One Negative Parameter)

Author Contributions

Acknowledgments

Conflicts of Interest

Appendix A. Recovering Theorem 1 from Theorem 6 as a Special Case

Appendix B. Existence of Weakly-Convergent Couplings

Appendix C. Upper Semicontinuity of the Infimum

Appendix D. Weak Semicontinuity of Differential Entropy under a Moment Constraint

Appendix E. Proof of Proposition 2

Appendix F. Gaussian Optimality in Degenerate Cases: A Limiting Argument

Appendix F.1. Proof of the Claim in Example 1

Appendix F.2. Proof of Theorem 2

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI

Article Menu

A Forward-Reverse Brascamp-Lieb Inequality: Entropic Duality and Gaussian Optimality

Abstract

1. Introduction

2. Review of the Legendre-Fenchel Duality Theory

3. The Entropic-Functional Duality

3.1. Compact X

3.2. Noncompact X

4. Gaussian Optimality

4.1. Non-Degenerate Forward Channels

4.2. Analysis of Example 1 Using Gaussian Optimality

5. Relation to Hypercontractivity and Its Reverses

5.1. Hypercontractivity

5.2. Reverse Hypercontractivity (Positive Parameters)

5.3. Reverse Hypercontractivity (One Negative Parameter)

Author Contributions

Acknowledgments

Conflicts of Interest

Appendix A. Recovering Theorem 1 from Theorem 6 as a Special Case

Appendix B. Existence of Weakly-Convergent Couplings

Appendix C. Upper Semicontinuity of the Infimum

Appendix D. Weak Semicontinuity of Differential Entropy under a Moment Constraint

Appendix E. Proof of Proposition 2

Appendix F. Gaussian Optimality in Degenerate Cases: A Limiting Argument

Appendix F.1. Proof of the Claim in Example 1

Appendix F.2. Proof of Theorem 2

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI

3.1. Compact $X$

3.2. Noncompact $X$