Next Article in Journal
Finite-Time Synchronization of Markovian Jumping Complex Networks with Non-Identical Nodes and Impulsive Effects
Next Article in Special Issue
Entropic Matroids and Their Representation
Previous Article in Journal
Power, Efficiency and Fluctuations in a Quantum Point Contact as Steady-State Thermoelectric Heat Engine
Previous Article in Special Issue
Empirical Estimation of Information Measures: A Literature Guide
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Two Measures of Dependence

Signal and Information Processing Laboratory, ETH Zurich, 8092 Zurich, Switzerland
*
Author to whom correspondence should be addressed.
Entropy 2019, 21(8), 778; https://doi.org/10.3390/e21080778
Submission received: 5 July 2019 / Revised: 2 August 2019 / Accepted: 5 August 2019 / Published: 8 August 2019
(This article belongs to the Special Issue Information Measures with Applications)

Abstract

:
Two families of dependence measures between random variables are introduced. They are based on the Rényi divergence of order α and the relative α -entropy, respectively, and both dependence measures reduce to Shannon’s mutual information when their order α is one. The first measure shares many properties with the mutual information, including the data-processing inequality, and can be related to the optimal error exponents in composite hypothesis testing. The second measure does not satisfy the data-processing inequality, but appears naturally in the context of distributed task encoding.

1. Introduction

The solutions to many information-theoretic problems can be expressed using Shannon’s information measures such as entropy, relative entropy, and mutual information. Other problems require Rényi’s information measures, which generalize Shannon’s. In this paper, we analyze two Rényi measures of dependence, J α ( X ; Y ) and K α ( X ; Y ) , between random variables X and Y taking values in the finite sets X and Y , with α [ 0 , ] being a parameter. (Our notation is similar to the one used for the mutual information: technically, J α ( · ) and K α ( · ) are functions not of X and Y, but of their joint probability mass function (PMF) P X Y .) For α [ 0 , ] , we define J α ( X ; Y ) and K α ( X ; Y ) as
J α ( X ; Y ) min ( Q X , Q Y ) P ( X ) × P ( Y ) D α ( P X Y Q X Q Y ) ,
K α ( X ; Y ) min ( Q X , Q Y ) P ( X ) × P ( Y ) Δ α ( P X Y Q X Q Y ) ,
where P ( X ) and P ( Y ) denote the set of all PMFs over X and Y , respectively; D α ( P Q ) denotes the Rényi divergence of order α (see (50) ahead); and Δ α ( P Q ) denotes the relative α -entropy (see (55) ahead). As shown in Proposition 7, J α ( X ; Y ) and K α ( X ; Y ) are in fact closely related.
The measures J α ( X ; Y ) and K α ( X ; Y ) have the following operational meanings (see Section 3): J α ( X ; Y ) is related to the optimal error exponents in testing whether the observed independent and identically distributed (IID) samples were generated according to the joint PMF P X Y or an unknown product PMF; and K α ( X ; Y ) appears as a penalty term in the sum-rate constraint of distributed task encoding.
The measures J α ( X ; Y ) and K α ( X ; Y ) share many properties with Shannon’s mutual information [1], and both are equal to the mutual information when α is one. Except for some special cases, we have no closed-form expressions for J α ( X ; Y ) or K α ( X ; Y ) . As illustrated in Figure 1, unless α is one, the minimum in the definitions of J α ( X ; Y ) and K α ( X ; Y ) is typically not achieved by Q X = P X and Q Y = P Y . (When α is one, then the minimum is always achieved by Q X = P X and Q Y = P Y ; this follows from Proposition 8 and the fact that D 1 ( P X Y Q X Q Y ) = Δ 1 ( P X Y Q X Q Y ) = D ( P X Y Q X Q Y ) .)
The rest of this paper is organized as follows. In Section 2, we review other generalizations of the mutual information. In Section 3, we discuss the operational meanings of J α ( X ; Y ) and K α ( X ; Y ) . In Section 4, we recall the required Rényi information measures and prove some preparatory results. In Section 5, we state the properties of J α ( X ; Y ) and K α ( X ; Y ) . In Section 6, we prove these properties.

2. Related Work

The measure J α ( X ; Y ) was discovered independently from the authors of the present paper by Tomamichel and Hayashi [2] (Equation (58)), who, for the case when α > 1 2 , derived some of its properties in [2] (Appendix A-C).
Other Rényi-based measures of dependence appeared in the past. Notable are those by Sibson [3], Arimoto [4], and Csiszár [5], respectively denoted by I α s ( · ) , I α a ( · ) , and I α c ( · ) :
I α s ( X ; Y ) α α 1 log y [ x P ( x ) P ( y | x ) α ] 1 α
= min Q Y D α ( P X Y P X Q Y ) ,
I α a ( X ; Y ) H α ( X ) H α ( X | Y )
= α α 1 log y [ x P ( x ) α x X P ( x ) α P ( y | x ) α ] 1 α ,
I α c ( X ; Y ) min Q Y x P ( x ) D α ( P Y | X = x Q Y ) ,
where, throughout the paper, log ( · ) denotes the base-2 logarithm; D α ( P Q ) denotes the Rényi divergence of order α (see (50) ahead); H α ( X ) denotes the Rényi entropy of order α (see (45) ahead); and H α ( X | Y ) denotes the Arimoto–Rényi conditional entropy [4,6,7], which is defined for positive α other than one as
H α ( X | Y ) α 1 α log y [ x P ( x , y ) α ] 1 α .
(Equation (4) follows from Proposition 9 ahead, and (6) follows from (45) and (8).) An overview of I α s ( · ) , I α a ( · ) , and I α c ( · ) is provided in [8]. Another Rényi-based measure of dependence can be found in [9] (Equation (19)):
I α t ( X ; Y ) D α ( P X Y P X P Y ) .
The relation between I α c ( X ; Y ) , J α ( X ; Y ) , and I α s ( X ; Y ) for α > 1 was established recently:
Proposition 1
([10] (Theorem IV.1)). For every PMF P X Y and every α > 1 ,
I α c ( X ; Y ) J α ( X ; Y )
I α s ( X ; Y ) .
Proof. 
This is proved in [10] for a measure-theoretic setting. Here, we specialize the proof to finite alphabets. We first prove (10):
J α ( X ; Y ) = min Q Y min Q X D α ( P X Y Q X Q Y )
= min Q Y α α 1 log x [ y P ( x , y ) α Q Y ( y ) 1 α ] 1 α
= min Q Y α α 1 log x P ( x ) [ y P ( y | x ) α Q Y ( y ) 1 α ] 1 α
min Q Y α α 1 x P ( x ) log [ y P ( y | x ) α Q Y ( y ) 1 α ] 1 α
= min Q Y x P ( x ) 1 α 1 log y P ( y | x ) α Q Y ( y ) 1 α
= I α c ( X ; Y ) ,
where (12) follows from the definition of J α ( X ; Y ) in (1); (13) follows from Proposition 9 ahead with the roles of Q X and Q Y swapped; (15) follows from Jensen’s inequality because log ( · ) is concave and because α α 1 > 0 ; and (17) follows from the definition of I α c ( X ; Y ) in (7).
We next prove (11):
J α ( X ; Y ) = min Q X , Q Y D α ( P X Y Q X Q Y )
min Q Y D α ( P X Y P X Q Y )
= I α s ( X ; Y ) ,
where (18) follows from the definition of J α ( X ; Y ) in (1), and (20) follows from (4). □
Many of the above Rényi information measures coincide when they are maximized over P X with P Y | X held fixed: for every conditional PMF P Y | X and every positive α other than one,
max P X I α a ( P X P Y | X ) = max P X I α s ( P X P Y | X )
= max P X I α c ( P X P Y | X ) ,
where P X P Y | X denotes the joint PMF of X and Y; (21) follows from [4] (Lemma 1); and (22) follows from [5] (Proposition 1). It was recently established that, for α > 1 , this is also true for J α ( X ; Y ) :
Proposition 2
([10] (Theorem V.1)). For every conditional PMF P Y | X and every α > 1 ,
max P X J α ( P X P Y | X ) = max P X I α s ( P X P Y | X ) .
Proof. 
By Proposition 1, we have for all α > 1
max P X I α c ( P X P Y | X ) max P X J α ( P X P Y | X )
max P X I α s ( P X P Y | X ) .
By (22), the left-hand side (LHS) of (24) is equal to the right-hand side (RHS) of (25), so (24) and (25) both hold with equality. □
Dependence measures can also be based on the f-divergence D f ( P Q ) [11,12,13]. Every convex function f : ( 0 , ) R satisfying f ( 1 ) = 0 induces a dependence measure, namely
I f ( X ; Y ) D f ( P X Y P X P Y )
= x , y P ( x ) P ( y ) f ( P ( x , y ) P ( x ) P ( y ) ) ,
where (27) follows from the definition of the f-divergence. (For f ( t ) = t log t , I f ( X ; Y ) is the mutual information.) Such dependence measures are used for example in [14], and a construction equivalent to (27) is studied in [15].

3. Operational Meanings

In this section, we discuss the operational meaning of J α ( X ; Y ) in hypothesis testing (Section 3.1) and of K α ( X ; Y ) in distributed task encoding (Section 3.2).

3.1. Testing Against Independence and J α ( X ; Y )

Consider the hypothesis testing problem of guessing whether an observed sequence of pairs was drawn IID from some given joint PMF P X Y or IID from some unknown product distribution. Thus, based on a sequence of pairs of random variables { ( X i , Y i ) } i = 1 n , two hypotheses have to be distinguished:
0)
Under the null hypothesis, ( X 1 , Y 1 ) , , ( X n , Y n ) are IID according to P X Y .
1)
Under the alternative hypothesis, ( X 1 , Y 1 ) , , ( X n , Y n ) are IID according to some unknown PMF of the form Q X Y = Q X Q Y , where Q X and Q Y are arbitrary PMFs over X and Y , respectively.
Associated with every deterministic test T n : X n × Y n { 0 , 1 } and pair ( Q X , Q Y ) are the type-I error probability P X Y × n [ T n ( X n , Y n ) = 1 ] and the type-II error probability ( Q X Q Y ) × n [ T n ( X n , Y n ) = 0 ] , where R X Y × n [ A ] denotes the probability of an event A when { ( X i , Y i ) } i = 1 n are IID according to R X Y . We seek sequences of tests whose worst-case type-II error probability decays exponentially faster than 2 n E Q . To be more specific, for a fixed E Q R , denote by T ( E Q ) the set of all sequences of deterministic tests { T n } n = 1 for which
lim inf n min Q X , Q Y 1 n log ( Q X Q Y ) × n [ T n ( X n , Y n ) = 0 ] > E Q ,
where log ( · ) denotes the base-2 logarithm. Note that (28) implies—but is not equivalent to—that for n sufficiently large, ( Q X Q Y ) × n [ T n ( X n , Y n ) = 0 ] 2 n E Q for all ( Q X , Q Y ) P ( X ) × P ( Y ) . For a fixed E Q R , the optimal type-I error exponent that can be asymptotically achieved under the constraint (28) is given by
E P ( E Q ) sup { T n } n = 1 T ( E Q ) lim inf n 1 n log P X Y × n [ T n ( X n , Y n ) = 1 ] .
The measure J α ( X ; Y ) appears as follows: In [2] (first part of (57)), it is shown that for E Q sufficiently close to I ( X ; Y ) ,
E P ( E Q ) = sup α ( 1 2 , 1 ] 1 α α ( J α ( X ; Y ) E Q ) ,
and in [16] (Theorem 3), it is shown that for all E Q R ,
E P * * ( E Q ) = sup α ( 0 , 1 ] 1 α α ( J α ( X ; Y ) E Q ) ,
where E P * * ( · ) denotes the Fenchel biconjugate of E P ( · ) . In general, the Fenchel biconjugation cannot be omitted because sometimes [16] (Equation (11) and Example 14)
E P ( E Q ) E P * * ( E Q ) .
For large values of E Q , the optimal type-I error tends to one as n tends to infinity. In this case, the type-I strong-converse exponent [17,18], which is defined for a sequence of tests { T n } n = 1 as
S C P lim sup n 1 n log 1 P X Y × n [ T n ( X n , Y n ) = 1 ] ,
measures how fast the type-I error tends to one as n tends to infinity (smaller values correspond to lower error probabilities). For a fixed E Q R , the optimal type-I strong-converse exponent that can be asymptotically achieved under the constraint (28) is given by
S C P ( E Q ) inf { T n } n = 1 T ( E Q ) lim sup n 1 n log 1 P X Y × n [ T n ( X n , Y n ) = 1 ] .
In [2] (second part of (57)), it is shown that for E Q sufficiently close to I ( X ; Y ) ,
S C P ( E Q ) = sup α > 1 1 α α ( J α ( X ; Y ) E Q ) .
Here, the same 1 α α ( J α ( X ; Y ) E Q ) expression appears as in (30) and (31), but with a different set of α ’s to optimize over.

3.2. Distributed Task Encoding and K α ( X ; Y )

The task-encoding problem studied in [19] can be extended to a distributed setting as follows [20]: A source { ( X i , Y i ) } i = 1 emits pairs of random variables ( X i , Y i ) taking values in a finite alphabet X × Y . For a fixed rate pair ( R X , R Y ) R 0 2 and a positive integer n, the sequences { X i } i = 1 n and { Y i } i = 1 n are described separately using 2 n R X and 2 n R Y labels, respectively. The decoder produces a list comprising all the pairs ( x n , y n ) whose description matches the given labels, and the goal is to minimize the ρ -th moment of the list size as n tends to infinity (for some ρ > 0 ).
For a fixed ρ > 0 , a rate pair ( R X , R Y ) R 0 2 is called achievable if there exists a sequence of encoders { ( f n , g n ) } n = 1 ,
f n : X n { 1 , , 2 n R X } ,
g n : Y n { 1 , , 2 n R Y } ,
such that the ρ -th moment of the list size tends to one as n tends to infinity, i.e.,
lim n E | L ( X n , Y n ) | ρ = 1 ,
where
L ( x n , y n ) { ( x ˜ n , y ˜ n ) X n × Y n : f n ( x ˜ n ) = f n ( x n ) g n ( y ˜ n ) = g n ( y n ) } .
For a memoryless source and a fixed ρ > 0 , rate pairs in the interior of the region R ( ρ ) defined next are achievable, while those outside R ( ρ ) are not achievable [20] (Theorem 1). The region R ( ρ ) is defined as the set of all rate pairs ( R X , R Y ) satisfying the following inequalities simultaneously:
R X H 1 1 + ρ ( X ) ,
R Y H 1 1 + ρ ( Y ) ,
R X + R Y H 1 1 + ρ ( X , Y ) + K 1 1 + ρ ( X ; Y ) ,
where H α ( X ) denotes the Rényi entropy of order α (see (45) ahead).
To better understand the role of K α ( X ; Y ) , suppose that the sequences { X i } i = 1 n and { Y i } i = 1 n were allowed to be described jointly using 2 n R X · 2 n R Y 2 n ( R X + R Y ) labels. Then, by [19] (Theorem I.2), all rate pairs ( R X , R Y ) R 0 2 satisfying the following inequality with strict inequality would be achievable, while those not satisfying the inequality would not:
R X + R Y H 1 1 + ρ ( X , Y ) .
Comparing (42) and (43), we see that the measure K α ( X ; Y ) appears as a penalty term on the sum-rate constraint incurred by requiring that the sequences be described separately as opposed to jointly.

4. Preliminaries

Throughout the paper, log ( · ) denotes the base-2 logarithm, X and Y are finite sets, P X Y denotes a joint PMF over X × Y , Q X denotes a PMF over X , and Q Y denotes a PMF over Y . We use P and Q as generic PMFs over a finite set X . We denote by supp ( P ) { x X : P ( x ) > 0 } the support of P, and by P ( X ) the set of all PMFs over X . When clear from the context, we often omit sets and subscripts: for example, we write min Q X , Q Y for min ( Q X , Q Y ) P ( X ) × P ( Y ) , x for x X , P ( x ) for P X ( x ) , and P ( y | x ) for P Y | X ( y | x ) . Whenever a conditional probability P ( y | x ) is undefined because P ( x ) = 0 , we define P ( y | x ) 1 / | Y | . We denote by 𝟙 { condition } the indicator function that is one if the condition is satisfied and zero otherwise. In the definitions below, we use the following conventions:
0 0 = 0 , p 0 = p > 0 , 0 log 0 = 0 , β log 0 = β > 0 .
The Rényi entropy of order α [21] is defined for positive α other than one as
H α ( X ) 1 1 α log x P ( x ) α .
For α being zero, one, or infinity, we define by continuous extension of (45)
H 0 ( X ) log | supp ( P ) | ,
H 1 ( X ) H ( X ) ,
H ( X ) log max x P ( x ) ,
where H ( X ) is the Shannon entropy. With this extension to α { 0 , 1 , } , the Rényi entropy satisfies the following basic properties:
Proposition 3
([5]). Let P be a PMF. Then,
(i) 
For all α [ 0 , ] , H α ( X ) log | X | . If α ( 0 , ] , then H α ( X ) = log | X | if and only if X is distributed uniformly over X .
(ii) 
The mapping α H α ( X ) is nonincreasing on [ 0 , ] .
(iii) 
The mapping α H α ( X ) is continuous on [ 0 , ] .
The relative entropy (or Kullback–Leibler divergence) is defined as
D ( P Q ) x P ( x ) log P ( x ) Q ( x ) .
The Rényi divergence of order α [21,22] is defined for positive α other than one as
D α ( P Q ) 1 α 1 log x P ( x ) α Q ( x ) 1 α ,
where we read P ( x ) α Q ( x ) 1 α as P ( x ) α / Q ( x ) α 1 if α > 1 . For α being zero, one, or infinity, we define by continuous extension of (50)
D 0 ( P Q ) log x supp ( P ) Q ( x ) ,
D 1 ( P Q ) D ( P Q ) ,
D ( P Q ) log max x P ( x ) Q ( x ) .
With this extension to α { 0 , 1 , } , the Rényi divergence satisfies the following basic properties:
Proposition 4.
Let P and Q be PMFs. Then,
(i) 
For all α [ 0 , 1 ) , D α ( P Q ) is finite if and only if | supp ( P ) supp ( Q ) | > 0 . For all α [ 1 , ] , D α ( P Q ) is finite if and only if supp ( P ) supp ( Q ) .
(ii) 
For all α [ 0 , ] , D α ( P Q ) 0 . If α ( 0 , ] , then D α ( P Q ) = 0 if and only if P = Q .
(iii) 
For every α [ 0 , ] , the mapping Q D α ( P Q ) is continuous.
(iv) 
The mapping α D α ( P Q ) is nondecreasing on [ 0 , ] .
(v) 
The mapping α D α ( P Q ) is continuous on [ 0 , ] .
Proof. 
Part (i) follows from the definition of D α ( P Q ) and the conventions (44), and Parts (ii)–(v) are shown in [22]. □
The Rényi divergence for negative α is defined as
D α ( P Q ) 1 α 1 log x Q ( x ) 1 α P ( x ) α .
(We use negative α only in Lemma 19. More about negative orders can be found in [22] (Section V). For other applications of negative orders, see [23] (Proof of Theorem 1 and Example 1).)
The relative α -entropy [24,25] is defined for positive α other than one as
Δ α ( P Q ) α 1 α log x P ( x ) Q ( x ) α 1 + log x Q ( x ) α 1 1 α log x P ( x ) α ,
where we read P ( x ) Q ( x ) α 1 as P ( x ) / Q ( x ) 1 α if α < 1 . The relative α -entropy appears in mismatched guessing [26], mismatched source coding [26] (Theorem 8), and mismatched task encoding [19] (Section IV). It also arises in robust parameter estimation and constrained compression settings [25] (Section II). For α being zero, one, or infinity, we define by continuous extension of (55)
Δ 0 ( P Q ) log | supp ( Q ) | | supp ( P ) | if supp ( P ) supp ( Q ) , otherwise ,
Δ 1 ( P Q ) D ( P Q ) ,
Δ ( P Q ) log max x P ( x ) | argmax ( Q ) | 1 x argmax ( Q ) P ( x ) ,
where argmax ( Q ) { x X : Q ( x ) = max x X Q ( x ) } and | argmax ( Q ) | is the cardinality of this set. With this extension to α { 0 , 1 , } , the relative α -entropy satisfies the following basic properties:
Proposition 5.
Let P and Q be PMFs. Then,
(i) 
For all α [ 0 , 1 ] , Δ α ( P Q ) is finite if and only if supp ( P ) supp ( Q ) . For all α ( 1 , ) , Δ α ( P Q ) is finite if and only if | supp ( P ) supp ( Q ) | > 0 .
(ii) 
For all α [ 0 , ] , Δ α ( P Q ) 0 . If α ( 0 , ) , then Δ α ( P Q ) = 0 if and only if P = Q .
(iii) 
For every α ( 0 , ) , the mapping Q Δ α ( P Q ) is continuous.
(iv) 
The mapping α Δ α ( P Q ) is continuous on [ 0 , ] .
(Part (i) differs from [19] (Proposition IV.1), where the conventions for α > 1 differ from ours. Our conventions are compatible with [24,25], and, as stated in Part (iii), they result in the continuity of the mapping Q Δ α ( P Q ) .)
Proof of Proposition 5.
Part (i) follows from the definition of Δ α ( P Q ) in (55) and the conventions (44). For α ( 0 , 1 ) ( 1 , ) , Part (ii) follows from [19] (Proposition IV.1); for α = 1 , Part (ii) holds because Δ 1 ( P Q ) = D ( P Q ) ; and for α { 0 , } , Part (ii) follows from the definition of Δ α ( P Q ) . Part (iii) follows from the definition of Δ α ( P Q ) , and Part (iv) follows from [19] (Proposition IV.1). □
In the rest of this section, we prove some auxiliary results that we need later (Propositions 6–9). We first establish the relation between D α ( P Q ) and Δ α ( P Q ) .
Proposition 6
([26] (Section V, Property 4)). Let P and Q be PMFs, and let α > 0 . Then,
Δ α ( P Q ) = D 1 α ( P ˜ Q ˜ ) ,
where the PMFs P ˜ and Q ˜ are given by
P ˜ ( x ) P ( x ) α x X P ( x ) α ,
Q ˜ ( x ) Q ( x ) α x X Q ( x ) α .
Proof. 
If α = 1 , then (59) holds because P ˜ = P , Q ˜ = Q , and Δ 1 ( P Q ) = D 1 ( P Q ) = D ( P Q ) . Now let α 1 . Because P ˜ ( x ) and Q ˜ ( x ) are zero if and only if P ( x ) and Q ( x ) are zero, respectively, the LHS of (59) is finite if and only if its RHS is finite. If D 1 / α ( P ˜ Q ˜ ) is finite, then (59) follows from a simple computation. □
In light of Proposition 6, J α ( X ; Y ) and K α ( X ; Y ) are related as follows:
Proposition 7.
Let P X Y be a joint PMF, and let α > 0 . Then,
K α ( X ; Y ) = J 1 α ( X ˜ ; Y ˜ ) ,
where the joint PMF of X ˜ and Y ˜ is given by
P ˜ X Y ( x , y ) P X Y ( x , y ) α ( x , y ) X × Y P X Y ( x , y ) α .
Proof. 
Let α > 0 . For fixed PMFs Q X and Q Y , define the transformed PMFs Q X Q Y ˜ , Q ˜ X , and Q ˜ Y as
Q X Q Y ˜ ( x , y ) [ Q X ( x ) Q Y ( y ) ] α ( x , y ) X × Y [ Q X ( x ) Q Y ( y ) ] α ,
Q ˜ X ( x ) Q X ( x ) α x X Q X ( x ) α ,
Q ˜ Y ( y ) Q Y ( y ) α y Y Q Y ( y ) α .
Then,
K α ( X ; Y ) = min Q X , Q Y Δ α ( P X Y Q X Q Y )
= min Q X , Q Y D 1 α ( P ˜ X Y Q X Q Y ˜ )
= min Q X , Q Y D 1 α ( P ˜ X Y Q ˜ X Q ˜ Y )
= min Q X , Q Y D 1 α ( P ˜ X Y Q X Q Y )
= J 1 α ( X ˜ ; Y ˜ ) ,
where (67) holds by the definition of K α ( X ; Y ) ; (68) follows from Proposition 6; (69) holds because Q X Q Y ˜ = Q ˜ X Q ˜ Y ; (70) holds because the transformations (65) and (66) are bijective on the set of PMFs over X and Y , respectively; and (71) holds by the definition of J α ( X ; Y ) . □
The next proposition provides a characterization of the mutual information that parallels the definitions of J α ( X ; Y ) and K α ( X ; Y ) . Because D 1 ( P Q ) = Δ 1 ( P Q ) = D ( P Q ) , this also shows that J α ( X ; Y ) and K α ( X ; Y ) reduce to the mutual information when α is one.
Proposition 8
([27] (Theorem 3.4)). Let P X Y be a joint PMF. Then, for all PMFs Q X and Q Y ,
D ( P X Y Q X Q Y ) D ( P X Y P X P Y ) ,
with equality if and only if Q X = P X and Q Y = P Y . Thus,
I ( X ; Y ) = min Q X , Q Y D ( P X Y Q X Q Y ) .
Proof. 
A simple computation reveals that
D ( P X Y Q X Q Y ) = D ( P X Y P X P Y ) + D ( P X Q X ) + D ( P Y Q Y ) ,
which implies (72) because D ( P Q ) 0 with equality if and only if P = Q . Thus, (73) holds because I ( X ; Y ) = D ( P X Y P X P Y ) . □
The last proposition of this section is about a precursor to J α ( X ; Y ) , namely, the minimization of D α ( P X Y Q X Q Y ) with respect to Q Y only, which can be carried out explicitly. (This proposition extends [5] (Equation (13)) and [2] (Lemma 29).)
Proposition 9.
Let P X Y be a joint PMF and Q X a PMF. Then, for every α ( 0 , 1 ) ( 1 , ) ,
min Q Y D α ( P X Y Q X Q Y ) = α α 1 log y [ x P ( x , y ) α Q X ( x ) 1 α ] 1 α ,
with the conventions of (44). If the RHS of (75) is finite, then the minimum is achieved uniquely by
Q Y * ( y ) = x P ( x , y ) α Q X ( x ) 1 α 1 α y Y x P ( x , y ) α Q X ( x ) 1 α 1 α .
For α = ,
min Q Y D ( P X Y Q X Q Y ) = log y max x P ( x , y ) Q X ( x ) ,
with the conventions of (44). If the RHS of (77) is finite, then the minimum is achieved uniquely by
Q Y * ( y ) = max x [ P ( x , y ) / Q X ( x ) ] y Y max x [ P ( x , y ) / Q X ( x ) ] .
Proof. 
We first treat the case α ( 0 , 1 ) ( 1 , ) . If the RHS of (75) is infinite, then the conventions imply that D α ( P X Y Q X Q Y ) is infinite for every Q Y P ( Y ) , so (75) holds. Otherwise, if the RHS of (75) is finite, then the PMF Q Y * given by (76) is well-defined, and a simple computation shows that for every Q Y P ( Y ) ,
D α ( P X Y Q X Q Y ) = α α 1 log y [ x P ( x , y ) α Q X ( x ) 1 α ] 1 α + D α ( Q Y * Q Y ) .
The only term on the RHS of (79) that depends on Q Y is D α ( Q Y * Q Y ) . Because D α ( Q Y * Q Y ) 0 with equality if and only if Q Y = Q Y * (Proposition 4), (79) implies (75) and (76).
The case α = is analogous: if the RHS of (77) is infinite, then the LHS of (77) is infinite, too; and if the RHS of (77) is finite, then the PMF Q Y * given by (78) is well-defined, and a simple computation shows that for every Q Y P ( Y ) ,
D ( P X Y Q X Q Y ) = log y max x P ( x , y ) Q X ( x ) + D ( Q Y * Q Y ) .
The only term on the RHS of (80) that depends on Q Y is D ( Q Y * Q Y ) . Because D ( Q Y * Q Y ) 0 with equality if and only if Q Y = Q Y * (Proposition 4), (80) implies (77) and (78). □

5. Two Measures of Dependence

We state the properties of J α ( X ; Y ) in Theorem 1 and those of K α ( X ; Y ) in Theorem 2. The enumeration labels in the theorems refer to the lemmas in Section 6 where the properties are proved. (The enumeration labels are not consecutive because, in order to avoid forward references in the proofs, the order of the results in Section 6 is not the same as here.)
Theorem 1.
Let X, X 1 , X 2 , Y, Y 1 , Y 2 , and Z be random variables taking values in finite sets. Then:
(Lemma 1)
For every α [ 0 , ] , the minimum in the definition of J α ( X ; Y ) exists and is finite.
The following properties of the mutual information I ( X ; Y ) [28] (Chapter 2) are also satisfied by J α ( X ; Y ) :
(Lemma 2)
For all α [ 0 , ] , J α ( X ; Y ) 0 . If α ( 0 , ] , then J α ( X ; Y ) = 0 if and only if X and Y are independent (nonnegativity).
(Lemma 3)
For all α [ 0 , ] , J α ( X ; Y ) = J α ( Y ; X ) (symmetry).
(Lemma 4)
If X Y Z form a Markov chain, then J α ( X ; Z ) J α ( X ; Y ) for all α [ 0 , ] (data-processing inequality).
(Lemma 12)
If the pairs ( X 1 , Y 1 ) and ( X 2 , Y 2 ) are independent, then J α ( X 1 , X 2 ; Y 1 , Y 2 ) = J α ( X 1 ; Y 1 ) + J α ( X 2 ; Y 2 ) for all α [ 0 , ] (additivity).
(Lemma 13)
For all α [ 0 , ] , J α ( X ; Y ) log | X | with equality if and only if ( α [ 1 2 , ] , X is distributed uniformly over X , and H ( X | Y ) = 0 ) .
(Lemma 14)
For every α [ 1 , ] , J α ( X ; Y ) is concave in P X for fixed P Y | X .
Moreover:
(Lemma 5)
J 0 ( X ; Y ) = 0 .
(Lemma 6)
Let f : { 1 , , | X | } X and g : { 1 , , | Y | } Y be bijective functions, and let A be the | X | × | Y | matrix whose Row-i Column-j entry A i , j equals P X Y ( f ( i ) , g ( j ) ) . Then,
J 1 2 ( X ; Y ) = 2 log σ 1 ( A ) ,
where σ 1 ( A ) denotes the largest singular value of A . (Because the singular values of a matrix are invariant under row and column permutations, the result does not depend on f or g.)
(Lemma 7)
J 1 ( X ; Y ) = I ( X ; Y ) .
(Lemma 8)
For all α > 0 ,
( 1 α ) J α ( X ; Y ) = min R X Y P ( X × Y ) ( 1 α ) D ( R X Y R X R Y ) + α D ( R X Y P X Y ) .
Thus, being the minimum of concave functions in α, the mapping α ( 1 α ) J α ( X ; Y ) is concave on ( 0 , ) .
(Lemma 9)
The mapping α J α ( X ; Y ) is nondecreasing on [ 0 , ] .
(Lemma 10)
The mapping α J α ( X ; Y ) is continuous on [ 0 , ] .
(Lemma 11)
If X = Y with probability one, then
J α ( X ; Y ) = α 1 α H ( X ) i f α [ 0 , 1 2 ] , H α 2 α 1 ( X ) i f α > 1 2 , H 1 2 ( X ) i f α = .
The minimization problem in the definition of J α ( X ; Y ) has the following characteristics:
(Lemma 15)
For every α [ 1 2 , ] , the mapping ( Q X , Q Y ) D α ( P X Y Q X Q Y ) is convex, i.e., for all λ , λ [ 0 , 1 ] with λ + λ = 1 , all Q X , Q X P ( X ) , and all Q Y , Q Y P ( Y ) ,
D α P X Y ( λ Q X + λ Q X ) ( λ Q Y + λ Q Y ) λ D α ( P X Y Q X Q Y ) + λ D α ( P X Y Q X Q Y ) .
For α [ 0 , 1 2 ) , the mapping need not be convex.
(Lemma 16)
Let α ( 0 , 1 ) ( 1 , ) . If ( Q X * , Q Y * ) achieves the minimum in the definition of J α ( X ; Y ) , then there exist positive normalization constants c and d such that
Q X * ( x ) = c [ y P ( x , y ) α Q Y * ( y ) 1 α ] 1 α x X ,
Q Y * ( y ) = d [ x P ( x , y ) α Q X * ( x ) 1 α ] 1 α y Y ,
with the conventions of (44). The case α = is similar: if ( Q X * , Q Y * ) achieves the minimum in the definition of J ( X ; Y ) , then there exist positive normalization constants c and d such that
Q X * ( x ) = c max y P ( x , y ) Q Y * ( y ) x X ,
Q Y * ( y ) = d max x P ( x , y ) Q X * ( x ) y Y ,
with the conventions of (44). (If α = 1 , then Q X * = P X and Q Y * = P Y by Proposition 8.) Thus, for all α ( 0 , ] , both inclusions supp ( Q X * ) supp ( P X ) and supp ( Q Y * ) supp ( P Y ) hold. 1 2
(Lemma 20)
For every α ( 1 2 , ] , the mapping ( Q X , Q Y ) D α ( P X Y Q X Q Y ) has a unique minimizer. This need not be the case when α [ 0 , 1 2 ] .
The measure J α ( X ; Y ) can also be expressed as follows:
(Lemma 17)
For all α ( 0 , ] ,
J α ( X ; Y ) = min Q X ϕ α ( Q X ) ,
where ϕ α ( Q X ) is defined as
ϕ α ( Q X ) min Q Y D α ( P X Y Q X Q Y )
and is given explicitly as follows: for α ( 0 , 1 ) ( 1 , ) ,
ϕ α ( Q X ) = α α 1 log y [ x P ( x , y ) α Q X ( x ) 1 α ] 1 α ,
with the conventions of (44); and for α { 1 , } ,
ϕ 1 ( Q X ) = D ( P X Y Q X P Y ) ,
ϕ ( Q X ) = log y max x P ( x , y ) Q X ( x ) ,
with the conventions of (44). For every α [ 1 2 , ] , the mapping Q X ϕ α ( Q X ) is convex. For α ( 0 , 1 2 ) , the mapping need not be convex.
(Lemma 18)
For all α ( 0 , 1 ) ( 1 , ] ,
J α ( X ; Y ) = min R X Y P ( X × Y ) ψ α ( R X Y ) i f α ( 0 , 1 ) , max R X Y P ( X × Y ) ψ α ( R X Y ) i f α ( 1 , ] ,
where
ψ α ( R X Y ) D ( R X Y R X R Y ) + α 1 α D ( R X Y P X Y ) i f α ( 0 , 1 ) ( 1 , ) , D ( R X Y R X R Y ) D ( R X Y P X Y ) i f α = .
For every α ( 1 , ] , the mapping R X Y ψ α ( R X Y ) is concave. For all α ( 1 , ] and all R X Y P ( X × Y ) , the statement J α ( X ; Y ) = ψ α ( R X Y ) is equivalent to ψ α ( R X Y ) = D α ( P X Y R X R Y ) .
(Lemma 19)
For all α ( 0 , 1 ) ( 1 , ) ,
J α ( X ; Y ) = min R X P X 1 α 1 D α α 1 ( P X R X ) α E 0 1 α α , R X ,
where the minimization is over all PMFs R X satisfying R X P X i . e . , supp ( R X ) supp ( P X ) ; D α ( P Q ) for negative α is given by (54); and Gallager’s E 0 function [29] is defined as
E 0 ( ρ , R X ) log y [ x R X ( x ) P ( y | x ) 1 1 + ρ ] 1 + ρ .
We now move on to the properties of K α ( X ; Y ) . Some of these properties are derived from their counterparts of J α ( X ; Y ) using the relation K α ( X ; Y ) = J 1 / α ( X ˜ ; Y ˜ ) described in Proposition 7.
Theorem 2.
Let X, X 1 , X 2 , Y, Y 1 , Y 2 , and Z be random variables taking values in finite sets. Then:
(Lemma 21)
For every α [ 0 , ] , the minimum in the definition of K α ( X ; Y ) in (2) exists and is finite.
The following properties of the mutual information I ( X ; Y ) are also satisfied by K α ( X ; Y ) :
(Lemma 22)
For all α [ 0 , ] , K α ( X ; Y ) 0 . If α ( 0 , ) , then K α ( X ; Y ) = 0 if and only if X and Y are independent (nonnegativity).
(Lemma 23)
For all α [ 0 , ] , K α ( X ; Y ) = K α ( Y ; X ) (symmetry).
(Lemma 34)
If the pairs ( X 1 , Y 1 ) and ( X 2 , Y 2 ) are independent, then K α ( X 1 , X 2 ; Y 1 , Y 2 ) = K α ( X 1 ; Y 1 ) + K α ( X 2 ; Y 2 ) for all α [ 0 , ] (additivity).
(Lemma 35)
For all α [ 0 , ] , K α ( X ; Y ) log | X | .
Unlike the mutual information, K α ( X ; Y ) does not satisfy the data-processing inequality:
(Lemma 36)
There exists a Markov chain X Y Z for which K 2 ( X ; Z ) > K 2 ( X ; Y ) .
Moreover:
(Lemma 24)
For all α ( 0 , ) ,
K α ( X ; Y ) + H α ( X , Y ) = min Q X , Q Y log M α 1 α ( Q X , Q Y ) ,
where M β ( Q X , Q Y ) is the following weighted power mean [30] (Chapter III): For β R { 0 } ,
M β ( Q X , Q Y ) [ x , y P ( x , y ) [ Q X ( x ) Q Y ( y ) ] β ] 1 β ,
where for β < 0 , we read P ( x , y ) [ Q X ( x ) Q Y ( y ) ] β as P ( x , y ) / [ Q X ( x ) Q Y ( y ) ] β and use the conventions (44); and for β = 0 , using the convention 0 0 = 1 ,
M 0 ( Q X , Q Y ) x , y [ Q X ( x ) Q Y ( y ) ] P ( x , y ) .
(Lemma 25)
For α = 0 ,
K 0 ( X ; Y ) = log | supp ( P X P Y ) | | supp ( P X Y ) |
min Q X , Q Y log max ( x , y ) supp ( P X Y ) 1 Q X ( x ) Q Y ( y ) log | supp ( P X Y ) |
= lim α 0 K α ( X ; Y ) ,
where in the RHS of (102), we use the conventions (44). The inequality can be strict, so α K α ( X ; Y ) need not be continuous at α = 0 .
(Lemma 26)
K 1 ( X ; Y ) = I ( X ; Y ) .
(Lemma 27)
Let f : { 1 , , | X | } X and g : { 1 , , | Y | } Y be bijective functions, and let B be the | X | × | Y | matrix whose Row-i Column-j entry B i , j equals P X Y ( f ( i ) , g ( j ) ) . Then,
K 2 ( X ; Y ) = 2 log σ 1 ( B ) H 2 ( X , Y ) ,
where σ 1 ( B ) denotes the largest singular value of B . (Because the singular values of a matrix are invariant under row and column permutations, the result does not depend on f or g.)
(Lemma 28)
K ( X ; Y ) = 0 .
(Lemma 29)
The mapping α K α ( X ; Y ) need not be monotonic on [ 0 , ] .
(Lemma 30)
The mapping α K α ( X ; Y ) + H α ( X , Y ) is nonincreasing on [ 0 , ] .
(Lemma 31)
The mapping α K α ( X ; Y ) is continuous on ( 0 , ] . (See Lemma 25 for the behavior at α = 0 .)
(Lemma 32)
If X = Y with probability one, then
K α ( X ; Y ) = 2 H α 2 α ( X ) H α ( X ) i f α [ 0 , 2 ) , α α 1 H ( X ) H α ( X ) i f α 2 , 0 i f α = .
(Lemma 33)
For every α ( 0 , 2 ) , the mapping ( Q X , Q Y ) Δ α ( P X Y Q X Q Y ) in the definition of K α ( X ; Y ) in (2) has a unique minimizer. This need not be the case when α { 0 } [ 2 , ] .

6. Proofs

In this section, we prove the properties of J α ( X ; Y ) and K α ( X ; Y ) stated in Section 5.
Lemma 1.
For every α [ 0 , ] , the minimum in the definition of J α ( X ; Y ) exists and is finite.
Proof. 
Let α [ 0 , ] . Then inf Q X , Q Y D α ( P X Y Q X Q Y ) is finite because D α ( P X Y P X P Y ) is finite and because the Rényi divergence is nonnegative. The minimum exists because the set P ( X ) × P ( Y ) is compact and the mapping ( Q X , Q Y ) D α ( P X Y Q X Q Y ) is continuous. □
Lemma 2.
For all α [ 0 , ] , J α ( X ; Y ) 0 . If α ( 0 , ] , then J α ( X ; Y ) = 0 if and only if X and Y are independent (nonnegativity).
Proof. 
The nonnegativity follows from the definition of J α ( X ; Y ) because the Rényi divergence is nonnegative for α [ 0 , ] . If X and Y are independent, then P X Y = P X P Y , and the choice Q X = P X and Q Y = P Y in the definition of J α ( X ; Y ) achieves J α ( X ; Y ) = 0 . Conversely, if J α ( X ; Y ) = 0 , then there exist PMFs Q X * and Q Y * satisfying D α ( P X Y Q X * Q Y * ) = 0 . If, in addition, α ( 0 , ] , then P X Y = Q X * Q Y * by Proposition 4, and hence X and Y are independent. □
Lemma 3.
For all α [ 0 , ] , J α ( X ; Y ) = J α ( Y ; X ) (symmetry).
Proof. 
The definition of J α ( X ; Y ) is symmetric in X and Y. □
Lemma 4.
If X Y Z form a Markov chain, then J α ( X ; Z ) J α ( X ; Y ) for all α [ 0 , ] (data-processing inequality).
Proof. 
Let X Y Z form a Markov chain, and let α [ 0 , ] . Let Q ^ X and Q ^ Y be PMFs that achieve the minimum in the definition of J α ( X ; Y ) , so
J α ( X ; Y ) = D α ( P X Y Q ^ X Q ^ Y ) .
Define the PMF Q ^ Z as
Q ^ Z ( z ) y Q ^ Y ( y ) P Z | Y ( z | y ) .
(As noted in the preliminaries, we define P Z | Y ( z | y ) 1 / | Z | when P Y ( y ) = 0 .) We show below that
D α ( P X Z Q ^ X Q ^ Z ) D α ( P X Y Q ^ X Q ^ Y ) ,
which implies the data-processing inequality because
J α ( X ; Z ) D α ( P X Z Q ^ X Q ^ Z )
D α ( P X Y Q ^ X Q ^ Y )
= J α ( X ; Y ) ,
where (109) holds by the definition of J α ( X ; Z ) ; (110) follows from (108); and (111) follows from (106).
The proof of (108) is based on the data-processing inequality for the Rényi divergence. Define the conditional PMF A X Z | X Y as
A X Z | X Y ( x , z | x , y ) 𝟙 { x = x } P Z | Y ( z | y ) .
If ( X , Y ) P X Y , then the marginal distribution of X and Z is
( P X Y A X Z | X Y ) ( x , z ) = x , y P X Y ( x , y ) A X Z | X Y ( x , z | x , y )
= y P X Y ( x , y ) P Z | Y ( z | y )
= y P X Y ( x , y ) P Z | X Y ( z | x , y )
= P X Z ( x , z ) ,
where (114) follows from (112); and (115) holds because X, Y, and Z form a Markov chain. If ( X , Y ) Q ^ X Q ^ Y , then the marginal distribution of X and Z is
( Q ^ X Q ^ Y A X Z | X Y ) ( x , z ) = x , y Q ^ X ( x ) Q ^ Y ( y ) A X Z | X Y ( x , z | x , y )
= y Q ^ X ( x ) Q ^ Y ( y ) P Z | Y ( z | y )
= Q ^ X ( x ) Q ^ Z ( z ) ,
where (118) follows from (112), and (119) follows from (107). Finally, we are ready to prove (108):
D α ( P X Z Q ^ X Q ^ Z ) = D α ( P X Y A X Z | X Y ) ( Q ^ X Q ^ Y A X Z | X Y )
D α ( P X Y Q ^ X Q ^ Y ) ,
where (120) follows from (116) and (119), and where (121) follows from the data-processing inequality for the Rényi divergence [22] (Theorem 9). □
Lemma 5.
J 0 ( X ; Y ) = 0 .
Proof. 
By Lemma 2, J 0 ( X ; Y ) 0 , so it suffices to show that J 0 ( X ; Y ) 0 . Let ( x ^ , y ^ ) X × Y satisfy P X Y ( x ^ , y ^ ) > 0 . Define the PMF Q ^ X as Q ^ X ( x ) 𝟙 { x = x ^ } and the PMF Q ^ Y as Q ^ Y ( y ) 𝟙 { y = y ^ } . Then, D 0 ( P X Y Q ^ X Q ^ Y ) = 0 , so J 0 ( X ; Y ) 0 by the definition of J 0 ( X ; Y ) . □
Lemma 6.
Let f : { 1 , , | X | } X and g : { 1 , , | Y | } Y be bijective functions, and let A be the | X | × | Y | matrix whose Row-i Column-j entry A i , j equals P X Y ( f ( i ) , g ( j ) ) . Then,
J 1 2 ( X ; Y ) = 2 log σ 1 ( A ) ,
where σ 1 ( A ) denotes the largest singular value of A . (Because the singular values of a matrix are invariant under row and column permutations, the result does not depend on f or g.)
Proof. 
By the definitions of J α ( X ; Y ) and the Rényi divergence,
J 1 2 ( X ; Y ) = 2 log max Q X , Q Y x , y Q X ( x ) P ( x , y ) Q Y ( y ) .
The claim follows from (123) because
max Q X , Q Y x , y Q X ( x ) P ( x , y ) Q Y ( y ) = max u 2 = v 2 = 1 u T A v
= max v 2 = 1 A v 2
= σ 1 ( A ) ,
where u and v are column vectors with | X | and | Y | elements, respectively; (124) is shown below; (125) follows from the Cauchy–Schwarz inequality | u T A v | u 2 A v 2 , which holds with equality if u and A v are linearly dependent; and (126) holds because the spectral norm of a matrix is equal to its largest singular value [31] (Example 5.6.6).
We now prove (124). Let u and v be vectors that satisfy u 2 = v 2 = 1 , and define the PMFs Q ^ X and Q ^ Y as Q ^ X ( x ) u f 1 ( x ) 2 and Q ^ Y ( y ) v g 1 ( y ) 2 , where f 1 and g 1 denote the inverse functions of f and g, respectively. Then,
u T A v = i , j u i A i , j v j
i , j | u i | A i , j | v j |
= x , y Q ^ X ( x ) P ( x , y ) Q ^ Y ( y )
max Q X , Q Y x , y Q X ( x ) P ( x , y ) Q Y ( y ) ,
where (128) holds because all the entries of A are nonnegative, and in (129), we changed the summation variables to x f ( i ) and y g ( j ) . It remains to show that equality can be achieved in (128) and (130). To that end, let Q X * and Q Y * be PMFs that achieve the maximum on the RHS of (130), and define the vectors u and v as u i Q X * ( f ( i ) ) 1 / 2 and v j Q Y * ( g ( j ) ) 1 / 2 . Then, u 2 = v 2 = 1 , and (128) and (130) hold with equality, which proves (124). □
Lemma 7.
J 1 ( X ; Y ) = I ( X ; Y ) .
Proof. 
This follows from Proposition 8 because D 1 ( P X Y Q X Q Y ) in the definition of J 1 ( X ; Y ) is equal to D ( P X Y Q X Q Y ) . □
Lemma 8.
For all α > 0 ,
( 1 α ) J α ( X ; Y ) = min R X Y P ( X × Y ) ( 1 α ) D ( R X Y R X R Y ) + α D ( R X Y P X Y ) .
Thus, being the minimum of concave functions in α, the mapping α ( 1 α ) J α ( X ; Y ) is concave on ( 0 , ) .
Proof. 
For α = 1 , (131) holds because D ( R X Y P X Y ) 0 with equality if R X Y = P X Y . For α ( 0 , 1 ) ,
( 1 α ) J α ( X ; Y ) = min Q X , Q Y ( 1 α ) D α ( P X Y Q X Q Y )
= min Q X , Q Y min R X Y ( 1 α ) D ( R X Y Q X Q Y ) + α D ( R X Y P X Y )
= min R X Y ( 1 α ) D ( R X Y R X R Y ) + α D ( R X Y P X Y ) ,
where (132) holds by the definition of J α ( X ; Y ) ; (133) follows from [22] (Theorem 30); and (134) follows from Proposition 8 after swapping the minima.
For α > 1 , define the sets
Q { ( Q X , Q Y ) P ( X ) × P ( Y ) : supp ( Q X Q Y ) = X × Y } ,
R { R X Y P ( X × Y ) : supp ( R X Y ) supp ( P X Y ) } .
Then,
( 1 α ) J α ( X ; Y ) = sup ( Q X , Q Y ) Q ( 1 α ) D α ( P X Y Q X Q Y )
= sup ( Q X , Q Y ) Q min R X Y R ( Q X , Q Y ) Q ( 1 α ) D ( R X Y Q X Q Y ) + α D ( R X Y P X Y )
= min R X Y R ( Q X , Q Y ) Q sup ( Q X , Q Y ) Q ( 1 α ) D ( R X Y Q X Q Y ) + α D ( R X Y P X Y )
= min R X Y P ( X × Y ) ( 1 α ) D ( R X Y R X R Y ) + α D ( R X Y P X Y ) ,
where (137) follows from the definition of J α ( X ; Y ) because 1 α < 0 and because the mapping ( Q X , Q Y ) D α ( P X Y Q X Q Y ) is continuous; (138) follows from [22] (Theorem 30); (139) follows from a minimax theorem and is justified below; and (140) follows from Proposition 8, a continuity argument, and the observation that D ( R X Y P X Y ) is infinite if R X Y R .
We now verify the conditions of Ky Fan’s minimax theorem [32] (Theorem 2), which will establish (139). (We use Ky Fan’s minimax theorem because it does not require that the set Q be compact, and having a noncompact set Q helps to guarantee that the function f defined next takes on finite values only. A brief proof of Ky Fan’s minimax theorem appears in [33].) Let the function f : R × Q R be defined by the expression in square brackets in (139), i.e.,
f ( R X Y , Q X , Q Y ) ( 1 α ) D ( R X Y Q X Q Y ) + α D ( R X Y P X Y ) .
We check that
(i)
the sets Q and R are convex;
(ii)
the set R is compact;
(iii)
the function f is real-valued;
(iv)
for every ( Q X , Q Y ) Q , the function f is continuous in R X Y ;
(v)
for every ( Q X , Q Y ) Q , the function f is convex in R X Y ; and
(vi)
for every R X Y R , the function f is concave in the pair ( Q X , Q Y ) .
Indeed, Parts (i) and (ii) are easy to see; Part (iii) holds because both relative entropies on the RHS of (141) are finite by our definitions of Q and R ; and to show Parts (iv)–(vi), we rewrite f as:
f ( R X Y , Q X , Q Y ) = H ( R X Y ) α x , y R X Y ( x , y ) log P ( x , y ) = + ( α 1 ) x R X ( x ) log Q X ( x ) + ( α 1 ) y R Y ( y ) log Q Y ( y ) .
From (142), we see that Part (iv) holds by our definitions of Q and R ; Part (v) holds because the entropy is a concave function (so H ( R X Y ) is convex), because linear functionals of R X Y are convex, and because the sum of convex functions is convex; and Part (vi) holds because the logarithm is a concave function and because a nonnegative weighted sum of concave functions is concave. (In Ky Fan’s theorem, weaker conditions than Parts (i)–(vi) are required, but it is not difficult to see that Parts (i)–(vi) are sufficient.)
The last claim, namely, that the mapping α ( 1 α ) J α ( X ; Y ) is concave on ( 0 , ) , is true because the expression in square brackets on the RHS of (131) is concave in α for every R X Y and because the pointwise minimum preserves the concavity. □
Lemma 9.
The mapping α J α ( X ; Y ) is nondecreasing on [ 0 , ] .
Proof. 
This is true because for every α , α [ 0 , ] with α α ,
min Q X , Q Y D α ( P X Y Q X Q Y ) min Q X , Q Y D α ( P X Y Q X Q Y ) ,
which holds because the Rényi divergence is nondecreasing in α (Proposition 4). □
Lemma 10.
The mapping α J α ( X ; Y ) is continuous on [ 0 , ] .
Proof. 
By Lemma 8, the mapping α ( 1 α ) J α ( X ; Y ) is concave on ( 0 , ) , thus it is continuous on ( 0 , ) , which implies that α J α ( X ; Y ) is continuous on ( 0 , 1 ) ( 1 , ) .
We next prove the continuity at α = 0 . Let Q X * and Q Y * be PMFs that achieve the minimum in the definition of J 0 ( X ; Y ) . Then, for all α 0 ,
D 0 ( P X Y Q X * Q Y * ) = J 0 ( X ; Y )
J α ( X ; Y )
D α ( P X Y Q X * Q Y * ) ,
where (145) holds because α J α ( X ; Y ) is nondecreasing (Lemma 9), and (146) holds by the definition of J α ( X ; Y ) . The Rényi divergence is continuous in α (Proposition 4), so (144)–(146) and the sandwich theorem imply that J α ( X ; Y ) is continuous at α = 0 .
We continue with the continuity at α = . Define
τ min ( x , y ) supp ( P X Y ) P ( x , y ) .
Then, for all α > 1 ,
J ( X ; Y ) J α ( X ; Y )
= min Q X , Q Y 1 α 1 log x , y P ( x , y ) P ( x , y ) α 1 [ Q X ( x ) Q Y ( y ) ] α 1
min Q X , Q Y 1 α 1 log max x , y τ P ( x , y ) α 1 [ Q X ( x ) Q Y ( y ) ] α 1
= 1 α 1 log τ + min Q X , Q Y log max x , y P ( x , y ) Q X ( x ) Q Y ( y )
= 1 α 1 log τ + J ( X ; Y ) ,
where (148) holds because α J α ( X ; Y ) is nondecreasing (Lemma 9), and (149) and (152) hold by the definitions of J α ( X ; Y ) and the Rényi divergence. The RHS of (152) tends to J ( X ; Y ) as α tends to infinity, so J α ( X ; Y ) is continuous at α = by the sandwich theorem.
It remains to show the continuity at α = 1 . Let α ( 3 4 , 1 ) ( 1 , 5 4 ) , and let δ | 1 α | ( 0 , 1 4 ) . Then, for all PMFs Q X and Q Y ,
2 δ D α ( P X Y Q X Q Y ) 2 δ D 1 δ ( P X Y Q X Q Y )
= x , y P ( x , y ) [ Q X ( x ) Q Y ( y ) P ( x , y ) ] δ
= x , y P ( x , y ) [ P X ( x ) P Y ( y ) P ( x , y ) ] δ [ Q X ( x ) Q Y ( y ) P X ( x ) P Y ( y ) ] δ
{ x , y P ( x , y ) [ P X ( x ) P Y ( y ) P ( x , y ) ] 2 δ } 1 2 · { x , y P ( x , y ) [ Q X ( x ) Q Y ( y ) P X ( x ) P Y ( y ) ] 2 δ } 1 2
{ x , y P ( x , y ) [ P X ( x ) P Y ( y ) P ( x , y ) ] 2 δ } 1 2
= 2 δ D 1 2 δ ( P X Y P X P Y ) ,
where (153) holds because 1 δ α and because the Rényi divergence is nondecreasing in α (Proposition 4); (156) follows from the Cauchy–Schwarz inequality; and (157) holds because
{ x , y P ( x , y ) [ Q X ( x ) Q Y ( y ) P X ( x ) P Y ( y ) ] 2 δ } 1 2 { x P X ( x ) [ Q X ( x ) P X ( x ) ] 4 δ } 1 4 · { y P Y ( y ) [ Q Y ( y ) P Y ( y ) ] 4 δ } 1 4
= 2 δ D 1 4 δ ( P X Q X ) · 2 δ D 1 4 δ ( P Y Q Y )
1 ,
where (159) follows from the Cauchy–Schwarz inequality, and (161) holds because 1 4 δ > 0 and because the Rényi divergence is nonnegative for positive orders (Proposition 4). Thus, for all α ( 3 4 , 5 4 ) ,
D 1 2 | 1 α | ( P X Y P X P Y ) min Q X , Q Y D α ( P X Y Q X Q Y )
= J α ( X ; Y )
D α ( P X Y P X P Y ) ,
where (162) follows from (158) if α 1 and from Proposition 8 if α = 1 ; and (164) holds by the definition of J α ( X ; Y ) . The Rényi divergence is continuous in α (Proposition 4), thus (162)–(164) and the sandwich theorem imply that J α ( X ; Y ) is continuous at α = 1 . □
Lemma 11.
If X = Y with probability one, then
J α ( X ; Y ) = α 1 α H ( X ) i f α [ 0 , 1 2 ] , H α 2 α 1 ( X ) i f α > 1 2 , H 1 2 ( X ) i f α = .
Proof. 
We show below that (165) holds for α ( 0 , 1 ) ( 1 , ) . Thus, (165) holds also for α { 0 , 1 , } because both its sides are continuous in α : its LHS by Lemma 10, and its RHS by the continuity of the Rényi entropy (Proposition 3).
Fix α ( 0 , 1 ) ( 1 , ) . Then,
J α ( X ; Y ) = min Q X min Q Y D α ( P X Y Q X Q Y )
= min Q X α α 1 log y [ x P ( x , y ) α Q X ( x ) 1 α ] 1 α
= min Q X α α 1 log x P X ( x ) Q X ( x ) 1 α α ,
where (166) follows from Proposition 9, and (168) holds because
P X Y ( x , y ) = P X ( x ) if x = y , 0 otherwise .
First consider the case α > 1 2 . Define γ x P X ( x ) α 2 α 1 . Then, for all Q X P ( X ) ,
α α 1 log x P X ( x ) Q X ( x ) 1 α α = α α 1 log x γ γ 1 P X ( x ) α 2 α 1 2 α 1 α Q X ( x ) 1 α α
= 2 α 1 α 1 log γ + D 2 α 1 α ( γ 1 P X α 2 α 1 Q X )
= H α 2 α 1 ( X ) + D 2 α 1 α ( γ 1 P X α 2 α 1 Q X ) ,
where (171) holds because x γ 1 P X ( x ) α 2 α 1 is a PMF. Because 2 α 1 α > 0 , Proposition 4 implies that D ( 2 α 1 ) / α ( P Q ) 0 with equality if Q = P . This together with (168) and (172) establishes (165).
Now consider the case α ( 0 , 1 2 ] . For all Q X P ( X ) ,
x P X ( x ) Q X ( x ) 1 α α x P X ( x ) Q X ( x )
x [ max x P X ( x ) ] Q X ( x )
= max x P X ( x ) ,
where (173) holds because Q X ( x ) [ 0 , 1 ] for all x X and because 1 α α 1 . The inequalities (173) and (174) both hold with equality when Q X ( x ) = 𝟙 { x = x * } , where x * X is such that P X ( x * ) = max x P X ( x ) . Thus,
max Q X x P X ( x ) Q X ( x ) 1 α α = max x P X ( x ) .
Now (165) follows:
J α ( X ; Y ) = min Q X α α 1 log x P X ( x ) Q X ( x ) 1 α α
= α α 1 log max Q X x P X ( x ) Q X ( x ) 1 α α
= α α 1 log max x P X ( x )
= α 1 α H ( X ) ,
where (177) follows from (168); (178) holds because α α 1 < 0 ; (179) follows from (176); and (180) follows from the definition of H ( X ) . □
Lemma 12.
If the pairs ( X 1 , Y 1 ) and ( X 2 , Y 2 ) are independent, then J α ( X 1 , X 2 ; Y 1 , Y 2 ) = J α ( X 1 ; Y 1 ) + J α ( X 2 ; Y 2 ) for all α [ 0 , ] (additivity).
Proof. 
Let the pairs ( X 1 , Y 1 ) and ( X 2 , Y 2 ) be independent. For α ( 0 , 1 ) ( 1 , ) , we establish the lemma by showing the following two inequalities:
J α ( X 1 , X 2 ; Y 1 , Y 2 ) J α ( X 1 ; Y 1 ) + J α ( X 2 ; Y 2 ) ,
J α ( X 1 , X 2 ; Y 1 , Y 2 ) J α ( X 1 ; Y 1 ) + J α ( X 2 ; Y 2 ) .
Because J α ( X ; Y ) is continuous in α (Lemma 10), this will also establish the lemma for α { 0 , 1 , } .
To show (181), let Q X 1 * and Q Y 1 * be PMFs that achieve the minimum in the definition of J α ( X 1 ; Y 1 ) , and let Q X 2 * and Q Y 2 * be PMFs that achieve the minimum in the definition of J α ( X 2 ; Y 2 ) , so
J α ( X 1 ; Y 1 ) = D α ( P X 1 Y 1 Q X 1 * Q Y 1 * ) ,
J α ( X 2 ; Y 2 ) = D α ( P X 2 Y 2 Q X 2 * Q Y 2 * ) .
Then, (181) holds because
J α ( X 1 , X 2 ; Y 1 , Y 2 ) D α ( P X 1 X 2 Y 1 Y 2 Q X 1 * Q X 2 * Q Y 1 * Q Y 2 * )
= D α ( P X 1 Y 1 Q X 1 * Q Y 1 * ) + D α ( P X 2 Y 2 Q X 2 * Q Y 2 * )
= J α ( X 1 ; Y 1 ) + J α ( X 2 ; Y 2 ) ,
where (185) holds by the definition of J α ( X 1 , X 2 ; Y 1 , Y 2 ) as a minimum; (186) follows from a simple computation using the independence hypothesis P X 1 X 2 Y 1 Y 2 = P X 1 Y 1 P X 2 Y 2 ; and (187) follows from (183) and (184).
To establish (182), we consider the cases α > 1 and α < 1 separately, starting with α > 1 . Let Q ^ X 1 X 2 and Q ^ Y 1 Y 2 be PMFs that achieve the minimum in the definition of J α ( X 1 , X 2 ; Y 1 , Y 2 ) , so
J α ( X 1 , X 2 ; Y 1 , Y 2 ) = D α ( P X 1 X 2 Y 1 Y 2 Q ^ X 1 X 2 Q ^ Y 1 Y 2 ) .
Define the function f : X 1 × Y 1 R { } as
f ( x 1 , y 1 ) x 2 , y 2 P X 2 Y 2 ( x 2 , y 2 ) α Q ^ X 2 | X 1 ( x 2 | x 1 ) Q ^ Y 2 | Y 1 ( y 2 | y 1 ) 1 α ,
and let ( x 1 , y 1 ) X 1 × Y 1 be such that
f ( x 1 , y 1 ) = min x 1 , y 1 f ( x 1 , y 1 ) .
Define the PMFs Q X 2 and Q Y 2 as
Q X 2 ( x 2 ) Q ^ X 2 | X 1 ( x 2 | x 1 ) ,
Q Y 2 ( y 2 ) Q ^ Y 2 | Y 1 ( y 2 | y 1 ) .
Then,
2 ( α 1 ) J α ( X 1 , X 2 ; Y 1 , Y 2 ) = 2 ( α 1 ) D α ( P X 1 X 2 Y 1 Y 2 Q ^ X 1 X 2 Q ^ Y 1 Y 2 )
= x 1 , x 2 , y 1 , y 2 P X 1 Y 1 ( x 1 , y 1 ) P X 2 Y 2 ( x 2 , y 2 ) α Q ^ X 1 X 2 ( x 1 , x 2 ) Q ^ Y 1 Y 2 ( y 1 , y 2 ) 1 α
= x 1 , y 1 P X 1 Y 1 ( x 1 , y 1 ) α Q ^ X 1 ( x 1 ) Q ^ Y 1 ( y 1 ) 1 α f ( x 1 , y 1 )
x 1 , y 1 P X 1 Y 1 ( x 1 , y 1 ) α Q ^ X 1 ( x 1 ) Q ^ Y 1 ( y 1 ) 1 α f ( x 1 , y 1 )
= 2 ( α 1 ) D α ( P X 1 Y 1 Q ^ X 1 Q ^ Y 1 ) + ( α 1 ) D α ( P X 2 Y 2 Q X 2 Q Y 2 ) ,
where (193) follows from (188); (194) holds by the independence hypothesis P X 1 X 2 Y 1 Y 2 = P X 1 Y 1 P X 2 Y 2 ; (195) follows from (189); (196) follows from (190); and (197) follows from (191) and (192). Taking the logarithm and multiplying by 1 α 1 > 0 establishes (182):
J α ( X 1 , X 2 ; Y 1 , Y 2 ) D α ( P X 1 Y 1 Q ^ X 1 Q ^ Y 1 ) + D α ( P X 2 Y 2 Q X 2 Q Y 2 )
J α ( X 1 ; Y 1 ) + J α ( X 2 ; Y 2 ) ,
where (199) holds by the definition of J α ( X 1 ; Y 1 ) and J α ( X 2 ; Y 2 ) .
The proof of (182) for α ( 0 , 1 ) is essentially the same as for α > 1 : Replace the minimum in (190) by a maximum. Inequality (196) is then reversed, but (198) continues to hold because 1 α 1 < 0 . Inequality (199) also continues to hold, and (198) and (199) together imply (182).
Lemma 13.
For all α [ 0 , ] , J α ( X ; Y ) log | X | with equality if and only if ( α [ 1 2 , ] , X is distributed uniformly over X , and H ( X | Y ) = 0 ) .
Proof. 
Throughout the proof, define X X . We first show that J α ( X ; Y ) log | X | for all α [ 0 , ] :
J α ( X ; Y ) J α ( X ; X )
J ( X ; X )
= H 1 2 ( X )
log | X | ,
where (200) follows from the data-processing inequality (Lemma 4) because X X Y form a Markov chain; (201) holds because J α ( X ; X ) is nondecreasing in α (Lemma 9); (202) follows from Lemma 11; and (203) follows from Proposition 3.
We now show that (200)–(203) can hold with equality only if the following conditions all hold:
(1)
α [ 1 2 , ] ;
(ii)
X is distributed uniformly over X ; and
(iii)
H ( X | Y ) = 0 , i.e., for every y supp ( P Y ) , there exists an x X for which P ( x | y ) = 1 .
Indeed, if α < 1 2 , then Lemma 11 implies that
J α ( X ; X ) = α 1 α H ( X ) .
Because α 1 α < 1 for such α ’s and because H ( X ) log | X | (Proposition 3), the RHS of (204) is strictly smaller than log | X | . This, together with (200), shows that Part (i) is a necessary condition. The necessity of Part (ii) follows from (203): if X is not distributed uniformly over X , then (203) holds with strict inequality (Proposition 3). As to the necessity of Part (iii),
J α ( X ; Y ) J ( X ; Y )
= min Q X min Q Y D ( P X Y Q X Q Y )
= min Q X log y max x P ( x , y ) Q X ( x )
log y max x P ( y ) P ( x | y ) 1 / | X |
= log | X | + log y P ( y ) max x P ( x | y )
log | X | ,
where (205) holds because J α ( X ; Y ) is nondecreasing in α (Lemma 9); (207) follows from Proposition 9; and (208) follows from choosing Q X to be the uniform distribution. The inequality (210) is strict when Part (iii) does not hold, so Part (iii) is a necessary condition.
It remains to show that when Parts (i)–(iii) all hold, J α ( X ; Y ) = log | X | . By (203), J α ( X ; Y ) log | X | always holds, so it suffices to show that Parts (i)–(iii) together imply J α ( X ; Y ) log | X | . Indeed,
J α ( X ; Y ) J 1 2 ( X ; Y )
J 1 2 ( X ; X )
= H ( X )
= log | X | ,
where (211) holds because Part (i) implies that α 1 2 and because J α ( X ; Y ) is nondecreasing in α (Lemma 9); (212) follows from the data-processing inequality (Lemma 4) because Part (iii) implies that X Y X form a Markov chain; (213) follows from Lemma 11; and (214) follows from Part (ii). □
Lemma 14.
For every α [ 1 , ] , J α ( X ; Y ) is concave in P X for fixed P Y | X .
Proof. 
We prove the claim for α ( 1 , ) ; for α { 1 , } the claim will then hold because J α ( X ; Y ) is continuous in α (Lemma 10).
Fix α ( 1 , ) . Let λ , λ [ 0 , 1 ] with λ + λ = 1 , let P X and P X be PMFs, let P Y | X be a conditional PMF, and define f : X × P ( Y ) R { } as
f ( x , Q Y ) [ y P Y | X ( y | x ) α Q Y ( y ) 1 α ] 1 α .
Denoting J α ( X ; Y ) by J α ( P X P Y | X ) ,
J α ( λ P X + λ P X ) P Y | X
= min Q Y min Q X D α ( λ P X + λ P X ) P Y | X Q X Q Y
= min Q Y α α 1 log x [ y [ λ P X ( x ) + λ P X ( x ) ] α P Y | X ( y | x ) α Q Y ( y ) 1 α ] 1 α
= min Q Y α α 1 log x [ λ P X ( x ) + λ P X ( x ) ] [ y P Y | X ( y | x ) α Q Y ( y ) 1 α ] 1 α
= min Q Y α α 1 log [ λ x P X ( x ) f ( x , Q Y ) + λ x P X ( x ) f ( x , Q Y ) ]
min Q Y α α 1 [ λ log x P X ( x ) f ( x , Q Y ) + λ log x P X ( x ) f ( x , Q Y ) ]
λ min Q Y α α 1 log x P X ( x ) f ( x , Q Y ) + λ min Q Y α α 1 log x P X ( x ) f ( x , Q Y )
= λ J α ( P X P Y | X ) + λ J α ( P X P Y | X ) ,
where (217) follows from Proposition 9 with the roles of Q X and Q Y swapped; (220) holds because log ( · ) is concave; (221) holds because optimizing Q Y separately cannot be worse than optimizing a common Q Y ; and (222) can be established using steps similar to (216)–(218). □
Lemma 15.
For every α [ 1 2 , ] , the mapping ( Q X , Q Y ) D α ( P X Y Q X Q Y ) is convex, i.e., for all λ , λ [ 0 , 1 ] with λ + λ = 1 , all Q X , Q X P ( X ) , and all Q Y , Q Y P ( Y ) ,
D α P X Y ( λ Q X + λ Q X ) ( λ Q Y + λ Q Y ) λ D α ( P X Y Q X Q Y ) + λ D α ( P X Y Q X Q Y ) .
For α [ 0 , 1 2 ) , the mapping need not be convex.
Proof. 
We establish (223) for α [ 1 2 , 1 ) and for α ( 1 , ) , which also establishes (223) for α { 1 , } because the Rényi divergence is continuous in α (Proposition 4). Afterwards, we provide an example where (223) is violated for all α [ 0 , 1 2 ) .
We begin with the case where α [ 1 2 , 1 ) :
2 ( α 1 ) λ D α ( P X Y Q X Q Y ) + ( α 1 ) λ D α ( P X Y Q X Q Y ) = [ x , y P ( x , y ) α [ Q X ( x ) Q Y ( y ) ] 1 α ] λ · [ x , y P ( x , y ) α [ Q X ( x ) Q Y ( y ) ] 1 α ] λ
λ x , y P ( x , y ) α [ Q X ( x ) Q Y ( y ) ] 1 α + λ x , y P ( x , y ) α [ Q X ( x ) Q Y ( y ) ] 1 α
= x , y P ( x , y ) α [ λ Q X ( x ) 1 α λ Q Y ( y ) 1 α + λ Q X ( x ) 1 α λ Q Y ( y ) 1 α ]
x , y P ( x , y ) α λ Q X ( x ) 2 ( 1 α ) + λ Q X ( x ) 2 ( 1 α ) λ Q Y ( y ) 2 ( 1 α ) + λ Q Y ( y ) 2 ( 1 α )
x , y P ( x , y ) α [ λ Q X ( x ) + λ Q X ( x ) ] 1 α λ Q Y ( y ) 2 ( 1 α ) + λ Q Y ( y ) 2 ( 1 α )
x , y P ( x , y ) α [ λ Q X ( x ) + λ Q X ( x ) ] 1 α [ λ Q Y ( y ) + λ Q Y ( y ) ] 1 α
= 2 ( α 1 ) D α ( P X Y ( λ Q X + λ Q X ) ( λ Q Y + λ Q Y ) ) ,
where (225) follows from the arithmetic mean-geometric mean inequality; (227) follows from the Cauchy–Schwarz inequality; and (228) and (229) hold because the mapping z z 2 ( 1 α ) is concave on R 0 for α [ 1 2 , 1 ) . Taking the logarithm and multiplying by 1 α 1 < 0 establishes (223).
Now, consider α ( 1 , ) . Then,
2 ( α 1 ) D α ( P X Y ( λ Q X + λ Q X ) ( λ Q Y + λ Q Y ) ) = x , y P ( x , y ) α [ λ Q X ( x ) + λ Q X ( x ) ] 1 α [ λ Q Y ( y ) + λ Q Y ( y ) ] 1 α
x , y P ( x , y ) α [ Q X ( x ) λ Q X ( x ) λ ] 1 α [ Q Y ( y ) λ Q Y ( y ) λ ] 1 α
= x , y P ( x , y ) α Q X ( x ) Q Y ( y ) ( 1 α ) λ Q X ( x ) Q Y ( y ) ( 1 α ) λ
[ x , y P ( x , y ) α Q X ( x ) Q Y ( y ) 1 α ] λ · [ x , y P ( x , y ) α Q X ( x ) Q Y ( y ) 1 α ] λ
= 2 ( α 1 ) λ D α ( P X Y Q X Q Y ) + ( α 1 ) λ D α ( P X Y Q X Q Y ) ,
where (232) follows from the arithmetic mean-geometric mean inequality and the fact that the mapping z z 1 α is decreasing on R > 0 for α > 1 , and (233) follows from Hölder’s inequality. Taking the logarithm and multiplying by 1 α 1 > 0 establishes (223).
Finally, we show that the mapping ( Q X , Q Y ) D α ( P X Y Q X Q Y ) does not need to be convex for α [ 0 , 1 2 ) . Let X be uniformly distributed over { 0 , 1 } , and let Y = X . Then, for all α [ 0 , 1 2 ) ,
D α P X Y ( 0 . 5 , 0 . 5 ) ( 0 . 5 , 0 . 5 ) > 0 . 5 D α P X Y ( 1 , 0 ) ( 1 , 0 ) + 0 . 5 D α P X Y ( 0 , 1 ) ( 0 , 1 ) ,
because the LHS of (236) is equal to log 2 , and the RHS of (236) is equal to α 1 α log 2 . □
Lemma 16.
Let α ( 0 , 1 ) ( 1 , ) . If ( Q X * , Q Y * ) achieves the minimum in the definition of J α ( X ; Y ) , then there exist positive normalization constants c and d such that
Q X * ( x ) = c [ y P ( x , y ) α Q Y * ( y ) 1 α ] 1 α x X ,
Q Y * ( y ) = d [ x P ( x , y ) α Q X * ( x ) 1 α ] 1 α y Y ,
with the conventions of (44). The case α = is similar: if ( Q X * , Q Y * ) achieves the minimum in the definition of J ( X ; Y ) , then there exist positive normalization constants c and d such that
Q X * ( x ) = c max y P ( x , y ) Q Y * ( y ) x X ,
Q Y * ( y ) = d max x P ( x , y ) Q X * ( x ) y Y ,
with the conventions of (44). (If α = 1 , then Q X * = P X and Q Y * = P Y by Proposition 8.) Thus, for all α ( 0 , ] , both inclusions supp ( Q X * ) supp ( P X ) and supp ( Q Y * ) supp ( P Y ) hold.
Proof. 
If ( Q X * , Q Y * ) achieves the minimum in the definition of J α ( X ; Y ) , then
min Q Y D α ( P X Y Q X * Q Y ) = D α ( P X Y Q X * Q Y * ) .
Hence, (238) and (240) follow from (76) and (78) of Proposition 9 because D α ( P X Y Q X * Q Y * ) = J α ( X ; Y ) is finite. Swapping the roles of Q X and Q Y establishes (237) and (239). For α ( 0 , 1 ) ( 1 , ) the claimed inclusions follow from (237) and (238); for α = from (239) and (240); and for α = 1 from Proposition 8. □
Lemma 17.
For all α ( 0 , ] ,
J α ( X ; Y ) = min Q X ϕ α ( Q X ) ,
where ϕ α ( Q X ) is defined as
ϕ α ( Q X ) min Q Y D α ( P X Y Q X Q Y )
and is given explicitly as follows: for α ( 0 , 1 ) ( 1 , ) ,
ϕ α ( Q X ) = α α 1 log y [ x P ( x , y ) α Q X ( x ) 1 α ] 1 α ,
with the conventions of (44); and for α { 1 , } ,
ϕ 1 ( Q X ) = D ( P X Y Q X P Y ) ,
ϕ ( Q X ) = log y max x P ( x , y ) Q X ( x ) ,
with the conventions of (44). For every α [ 1 2 , ] , the mapping Q X ϕ α ( Q X ) is convex. For α ( 0 , 1 2 ) , the mapping need not be convex.
Proof. 
We first establish (242) and (244)–(246): (242) follows from the definition of J α ( X ; Y ) ; (244) and (246) follow from Proposition 9; and (245) holds because
min Q Y D ( P X Y Q X Q Y ) = min Q Y [ D ( P X Y Q X P Y ) + D ( P Y Q Y ) ]
= D ( P X Y Q X P Y ) ,
where (247) follows from a simple computation, and (248) holds because D ( P Y Q Y ) 0 with equality if Q Y = P Y .
We now show that the mapping Q X ϕ α ( Q X ) is convex for every α [ 1 2 , ] . To that end, let α [ 1 2 , ] , let λ , λ [ 0 , 1 ] with λ + λ = 1 , and let Q X , Q X P ( X ) . Let Q ^ Y and Q ^ Y be PMFs that achieve the minimum in the definitions of ϕ α ( Q X ) and ϕ α ( Q X ) , respectively. Then,
ϕ α ( λ Q X + λ Q X ) D α P X Y ( λ Q X + λ Q X ) ( λ Q ^ Y + λ Q ^ Y )
λ D α ( P X Y Q X Q ^ Y ) + λ D α ( P X Y Q X Q ^ Y )
= λ ϕ α ( Q X ) + λ ϕ α ( Q X ) ,
where (249) holds by the definition of ϕ α ( · ) ; (250) holds because D α ( P X Y Q X Q Y ) is convex in the pair ( Q X , Q Y ) for α [ 1 2 , ] (Lemma 15); and (251) follows from our choice of Q ^ Y and Q ^ Y .
Finally, we show that the mapping Q X ϕ α ( Q X ) need not be convex for α ( 0 , 1 2 ) . Let X be uniformly distributed over { 0 , 1 } , and let Y = X . Then, for all α ( 0 , 1 2 ) ,
ϕ α ( 0 . 5 , 0 . 5 ) > 0 . 5 ϕ α ( 1 , 0 ) + 0 . 5 ϕ α ( 0 , 1 ) ,
because the LHS of (252) is equal to log 2 , and the RHS of (252) is equal to α 1 α log 2 . □
Lemma 18.
For all α ( 0 , 1 ) ( 1 , ] ,
J α ( X ; Y ) = min R X Y P ( X × Y ) ψ α ( R X Y ) i f α ( 0 , 1 ) , max R X Y P ( X × Y ) ψ α ( R X Y ) i f α ( 1 , ] ,
where
ψ α ( R X Y ) D ( R X Y R X R Y ) + α 1 α D ( R X Y P X Y ) i f α ( 0 , 1 ) ( 1 , ) , D ( R X Y R X R Y ) D ( R X Y P X Y ) i f α = .
For every α ( 1 , ] , the mapping R X Y ψ α ( R X Y ) is concave. For all α ( 1 , ] and all R X Y P ( X × Y ) , the statement J α ( X ; Y ) = ψ α ( R X Y ) is equivalent to ψ α ( R X Y ) = D α ( P X Y R X R Y ) .
Proof. 
For α ( 0 , 1 ) ( 1 , ) , (253) follows from Lemma 8 by dividing by 1 α , which is positive or negative depending on whether α is smaller than or greater than one. For α = , we establish (253) as follows: By Lemma 10, its LHS is continuous at α = . We argue below that its RHS is continuous at α = , i.e., that
lim α max R X Y ψ α ( R X Y ) = max R X Y ψ ( R X Y ) .
Because (253) holds for α ( 1 , ) and because both its sides are continuous at α = , it must also hold for α = .
We now establish (255). Let R X Y * be a PMF that achieves the maximum on the RHS of (255). Then, for all α > 1 ,
ψ ( R X Y * ) = max R X Y ψ ( R X Y )
max R X Y ψ α ( R X Y )
ψ α ( R X Y * ) ,
where (257) holds because, by (254), ψ ( R X Y ) = ψ α ( R X Y ) + 1 α 1 D ( R X Y P X Y ) ψ α ( R X Y ) for all R X Y P ( X × Y ) . By (254), α ψ α ( R X Y * ) is continuous at α = , so the RHS of (258) approaches ψ ( R X Y * ) as α tends to infinity, and (255) follows from the sandwich theorem.
We now show that R X Y ψ α ( R X Y ) is concave for α ( 1 , ] . A simple computation reveals that for all α ( 1 , ) ,
ψ α ( R X Y ) = H ( R X ) + H ( R Y ) + 1 α 1 H ( R X Y ) + α α 1 x , y R X Y ( x , y ) log P ( x , y ) .
Because the entropy is a concave function and because a nonnegative weighted sum of concave functions is concave, this implies that ψ α ( R X Y ) is concave in R X Y for α ( 1 , ) . By (254), α ψ α ( R X Y ) is continuous at α = , so ψ α ( R X Y ) is concave in R X Y also for α = .
We next show that if α ( 1 , ] and ψ α ( R X Y ) = D α ( P X Y R X R Y ) , then J α ( X ; Y ) = ψ α ( R X Y ) . Let α ( 1 , ] , and let R X Y be a PMF that satisfies ψ α ( R X Y ) = D α ( P X Y R X R Y ) . Then,
ψ α ( R X Y ) J α ( X ; Y )
D α ( P X Y R X R Y ) ,
where (260) follows from (253), and (261) holds by the definition of J α ( X ; Y ) . Because ψ α ( R X Y ) is equal to D α ( P X Y R X R Y ) , both inequalities hold with equality, which implies the claim.
Finally, we show that if α ( 1 , ] and J α ( X ; Y ) = ψ α ( R X Y ) , then ψ α ( R X Y ) = D α ( P X Y R X R Y ) . We first consider α ( 1 , ) . Let R X Y be a PMF that satisfies J α ( X ; Y ) = ψ α ( R X Y ) , and let Q X * and Q Y * be PMFs that achieve the minimum in the definition of J α ( X ; Y ) . Then,
J α ( X ; Y ) = ψ α ( R X Y )
= D ( R X Y R X R Y ) + α 1 α D ( R X Y P X Y )
D ( R X Y Q X * Q Y * ) + α 1 α D ( R X Y P X Y )
D α ( P X Y Q X * Q Y * )
= J α ( X ; Y ) ,
where (264) follows from Proposition 8, and (265) follows from [22] (Theorem 30). Thus, all inequalities hold with equality. Because (264) holds with equality, Q X * = R X and Q Y * = R Y by Proposition 8. Hence, ψ α ( R X Y ) = D α ( P X Y Q X * Q Y * ) = D α ( P X Y R X R Y ) as desired. We now consider α = . Here, (262)–(266) remain valid after replacing α 1 α by 1 . (Now, (265) follows from a short computation.) Consequently, ψ α ( R X Y ) = D α ( P X Y R X R Y ) holds also for α = .
Lemma 19.
For all α ( 0 , 1 ) ( 1 , ) ,
J α ( X ; Y ) = min R X P X 1 α 1 D α α 1 ( P X R X ) α E 0 1 α α , R X ,
where the minimization is over all PMFs R X satisfying R X P X i . e . , supp ( R X ) supp ( P X ) ; D α ( P Q ) for negative α is given by (54); and Gallager’s E 0 function [29] is defined as
E 0 ( ρ , R X ) log y [ x R X ( x ) P ( y | x ) 1 1 + ρ ] 1 + ρ .
Proof. 
Let α ( 0 , 1 ) ( 1 , ) , and define the set R { R X P ( X ) : supp ( R X ) supp ( P X ) } . We establish (267) by showing that for all R X R ,
1 α 1 D α α 1 ( P X R X ) α E 0 1 α α , R X J α ( X ; Y ) ,
with equality for some R X R .
Fix R X R . If the LHS of (269) is infinite, then (269) holds trivially. Otherwise, define the PMF Q ^ X as
Q ^ X ( x ) P X ( x ) α α 1 R X ( x ) 1 1 α x X P X ( x ) α α 1 R X ( x ) 1 1 α ,
where we use the convention that 0 α α 1 · 0 1 1 α = 0 . (The RHS of (270) is finite whenever the LHS of (269) is finite.) Then, (269) holds because
J α ( X ; Y ) = min Q X α α 1 log y [ x P ( x , y ) α Q X ( x ) 1 α ] 1 α
α α 1 log y [ x P ( x , y ) α Q ^ X ( x ) 1 α ] 1 α
= log x P X ( x ) α α 1 R X ( x ) 1 1 α + α α 1 log y [ x R X ( x ) P ( y | x ) α ] 1 α
= 1 α 1 D α α 1 ( P X R X ) α E 0 1 α α , R X ,
where (271) follows from Lemma 17, and (273) follows from (270) using some algebra. It remains to show that there exists an R X R for which (272) holds with equality. To that end, let Q X * be a PMF that achieves the minimum on the RHS of (271), and define the PMF R X as
R X ( x ) P X ( x ) α Q X * ( x ) 1 α x X P X ( x ) α Q X * ( x ) 1 α ,
where we use the convention that 0 α · 0 1 α = 0 . Because supp ( Q X * ) supp ( P X ) (Lemma 16), the definitions (275) and (270) imply that Q ^ X = Q X * . Hence, (272) holds with equality for this R X R .
Lemma 20.
For every α ( 1 2 , ] , the mapping ( Q X , Q Y ) D α ( P X Y Q X Q Y ) has a unique minimizer. This need not be the case when α [ 0 , 1 2 ] .
Proof. 
First consider α ( 1 2 , 1 ) . Let ( Q X * , Q Y * ) and ( Q ^ X , Q ^ Y ) be pairs of PMFs that both minimize ( Q X , Q Y ) D α ( P X Y Q X Q Y ) . We establish uniqueness by arguing that ( Q X * , Q Y * ) and ( Q ^ X , Q ^ Y ) must be identical. Observe that
J α ( X ; Y ) D α P X Y ( 0 . 5 Q X * + 0 . 5 Q ^ X ) ( 0 . 5 Q Y * + 0 . 5 Q ^ Y )
0 . 5 D α ( P X Y Q X * Q Y * ) + 0 . 5 D α ( P X Y Q ^ X Q ^ Y )
= J α ( X ; Y ) ,
where (276) holds by the definition of J α ( X ; Y ) , and (277) follows from Lemma 15. Hence, (277) holds with equality, which implies that (228) in the proof of Lemma 15 holds with equality, i.e.,
x , y P ( x , y ) α 0 . 5 Q X * ( x ) 2 ( 1 α ) + 0 . 5 Q ^ X ( x ) 2 ( 1 α ) 0 . 5 Q Y * ( y ) 2 ( 1 α ) + 0 . 5 Q ^ Y ( y ) 2 ( 1 α ) = x , y P ( x , y ) α [ 0 . 5 Q X * ( x ) + 0 . 5 Q ^ X ( x ) ] 1 α 0 . 5 Q Y * ( y ) 2 ( 1 α ) + 0 . 5 Q ^ Y ( y ) 2 ( 1 α ) .
We first argue that Q X * = Q ^ X . Since Q X * and Q ^ X are PMFs, it suffices to show that Q X * ( x ) = Q ^ X ( x ) for every x supp ( Q ^ X ) . Let x ^ supp ( Q ^ X ) . Because supp ( Q ^ X ) supp ( P X ) (Lemma 16), there exists a y ^ Y such that P ( x ^ , y ^ ) > 0 . Again by Lemma 16, this implies that Q ^ Y ( y ^ ) > 0 . Because the mapping z z 2 ( 1 α ) is strictly concave on R 0 for α ( 1 2 , 1 ) , it follows from (279) that Q X * ( x ^ ) = Q ^ X ( x ^ ) . Swapping the roles of Q X and Q Y , we obtain that Q Y * = Q ^ Y .
For α = 1 , the minimizer is unique by Proposition 8 because D 1 ( P X Y Q X Q Y ) = D ( P X Y Q X Q Y ) .
Now consider α ( 1 , ] . Here, we establish uniqueness via the characterization of J α ( X ; Y ) provided by Lemma 18. Let ψ α ( R X Y ) be defined as in Lemma 18. Let R X Y be a PMF that satisfies J α ( X ; Y ) = ψ α ( R X Y ) , and let ( Q X * , Q Y * ) be a pair of PMFs that minimizes ( Q X , Q Y ) D α ( P X Y Q X Q Y ) . If α ( 1 , ) , then (264) in the proof of Lemma 18 holds with equality, i.e.,
D ( R X Y R X R Y ) + α 1 α D ( R X Y P X Y ) = D ( R X Y Q X * Q Y * ) + α 1 α D ( R X Y P X Y ) .
Because the LHS of (280) is finite, Proposition 8 implies that Q X * = R X and Q Y * = R Y , thus the minimizer is unique. As shown in the proof of Lemma 18, (280) remains valid for α = after replacing α 1 α by 1 , thus the same argument establishes the uniqueness for α = .
Finally, we show that, for α [ 0 , 1 2 ] , the mapping ( Q X , Q Y ) D α ( P X Y Q X Q Y ) can have more than one minimizer. Let X be uniformly distributed over { 0 , 1 } , and let Y = X . Then, for all α [ 0 , 1 2 ] ,
J α ( X ; Y ) = α 1 α log 2
= D α P X Y ( 1 , 0 ) ( 1 , 0 )
= D α P X Y ( 0 , 1 ) ( 0 , 1 ) ,
where (281) follows from Lemma 11. □
Lemma 21.
For every α [ 0 , ] , the minimum in the definition of K α ( X ; Y ) in (2) exists and is finite.
Proof. 
Let α [ 0 , ] , and denote by U X and U Y the uniform distribution over X and Y , respectively. Then inf Q X , Q Y Δ α ( P X Y Q X Q Y ) is finite because Δ α ( P X Y U X U Y ) is finite and because the relative α -entropy is nonnegative (Proposition 5). For α ( 0 , ) , the minimum exists because the set P ( X ) × P ( Y ) is compact and the mapping ( Q X , Q Y ) Δ α ( P X Y Q X Q Y ) is continuous. For α { 0 , } , the minimum exists because ( Q X , Q Y ) Δ α ( P X Y Q X Q Y ) takes on only a finite number of values: if α = 0 , then Δ α ( P X Y Q X Q Y ) depends on Q X Q Y only via supp ( Q X Q Y ) X × Y ; and if α = , then Δ α ( P X Y Q X Q Y ) depends on Q X Q Y only via argmax ( Q X Q Y ) X × Y . □
Lemma 22.
For all α [ 0 , ] , K α ( X ; Y ) 0 . If α ( 0 , ) , then K α ( X ; Y ) = 0 if and only if X and Y are independent (nonnegativity).
Proof. 
The nonnegativity follows from the definition of K α ( X ; Y ) because the relative α -entropy is nonnegative for α [ 0 , ] (Proposition 5). If X and Y are independent, then P X Y = P X P Y , and the choice Q X = P X and Q Y = P Y in the definition of K α ( X ; Y ) achieves K α ( X ; Y ) = 0 . Conversely, if K α ( X ; Y ) = 0 , then there exist PMFs Q X * and Q Y * satisfying Δ α ( P X Y Q X * Q Y * ) = 0 . If, in addition, α ( 0 , ) , then P X Y = Q X * Q Y * by Proposition 5, and hence X and Y are independent. □
Lemma 23.
For all α [ 0 , ] , K α ( X ; Y ) = K α ( Y ; X ) (symmetry).
Proof. 
The definition of K α ( X ; Y ) is symmetric in X and Y. □
Lemma 24.
For all α ( 0 , ) ,
K α ( X ; Y ) + H α ( X , Y ) = min Q X , Q Y log M α 1 α ( Q X , Q Y ) ,
where M β ( Q X , Q Y ) is the following weighted power mean [30] (Chapter III): For β R { 0 } ,
M β ( Q X , Q Y ) [ x , y P ( x , y ) [ Q X ( x ) Q Y ( y ) ] β ] 1 β ,
where for β < 0 , we read P ( x , y ) [ Q X ( x ) Q Y ( y ) ] β as P ( x , y ) / [ Q X ( x ) Q Y ( y ) ] β and use the conventions (44); and for β = 0 , using the convention 0 0 = 1 ,
M 0 ( Q X , Q Y ) x , y [ Q X ( x ) Q Y ( y ) ] P ( x , y ) .
Proof. 
Let α ( 0 , ) , and define the PMF P ˜ X Y as
P ˜ X Y ( x , y ) P X Y ( x , y ) α ( x , y ) X × Y P X Y ( x , y ) α .
Then,
K α ( X ; Y ) = J 1 α ( X ˜ ; Y ˜ )
= min Q X , Q Y D 1 α ( P ˜ X Y Q X Q Y ) ,
where (288) follows from Proposition 7, and (289) follows from the definition of J 1 / α ( X ˜ ; Y ˜ ) . A simple computation reveals that for all PMFs Q X and Q Y ,
D 1 α ( P ˜ X Y Q X Q Y ) = log M α 1 α ( Q X , Q Y ) H α ( X , Y ) .
Hence, (284) follows from (289) and (290). □
Lemma 25.
For α = 0 ,
K 0 ( X ; Y ) = log | supp ( P X P Y ) | | supp ( P X Y ) |
min Q X , Q Y log max ( x , y ) supp ( P X Y ) 1 Q X ( x ) Q Y ( y ) log | supp ( P X Y ) |
= lim α 0 K α ( X ; Y ) ,
where in the RHS of (292), we use the conventions (44). The inequality can be strict, so α K α ( X ; Y ) need not be continuous at α = 0 .
Proof. 
We first prove (291). Recall that
Δ 0 ( P X Y Q X Q Y ) = log | supp ( Q X Q Y ) | | supp ( P X Y ) | if supp ( P X Y ) supp ( Q X Q Y ) , otherwise .
Observe that Δ 0 ( P X Y Q X Q Y ) is finite only if supp ( P X ) supp ( Q X ) and supp ( P Y ) supp ( Q Y ) . For such PMFs Q X and Q Y , we have | supp ( Q X Q Y ) | | supp ( P X P Y ) | . Thus, for all PMFs Q X and Q Y ,
Δ 0 ( P X Y Q X Q Y ) log | supp ( P X P Y ) | | supp ( P X Y ) | .
Choosing Q X = P X and Q Y = P Y achieves equality in (295), which establishes (291).
We now show (292). Let Q X and Q Y be the uniform distributions over supp ( P X ) and supp ( P Y ) , respectively. Then,
log max ( x , y ) supp ( P X Y ) 1 Q X ( x ) Q Y ( y ) log | supp ( P X Y ) | = log | supp ( P X P Y ) | | supp ( P X Y ) | ,
and hence (292) holds.
We next establish (293). To that end, define
τ min ( x , y ) supp ( P X Y ) P ( x , y ) .
We bound K α ( X ; Y ) + H α ( X , Y ) as follows: For all α ( 0 , 1 ) ,
K α ( X ; Y ) + H α ( X , Y ) = min Q X , Q Y α 1 α log x , y P ( x , y ) [ Q X ( x ) Q Y ( y ) ] α 1 α
min Q X , Q Y α 1 α log ( x , y ) supp ( P X Y ) τ [ Q X ( x ) Q Y ( y ) ] α 1 α
min Q X , Q Y α 1 α log max ( x , y ) supp ( P X Y ) τ [ Q X ( x ) Q Y ( y ) ] α 1 α
= min Q X , Q Y log max ( x , y ) supp ( P X Y ) 1 Q X ( x ) Q Y ( y ) α 1 α log 1 τ ,
where (298) follows from Lemma 24. Similarly, for all α ( 0 , 1 ) ,
K α ( X ; Y ) + H α ( X , Y ) = min Q X , Q Y α 1 α log x , y P ( x , y ) [ Q X ( x ) Q Y ( y ) ] α 1 α
min Q X , Q Y α 1 α log max ( x , y ) supp ( P X Y ) [ Q X ( x ) Q Y ( y ) ] α 1 α
= min Q X , Q Y log max ( x , y ) supp ( P X Y ) 1 Q X ( x ) Q Y ( y ) ,
where (302) is the same as (298). Now (293) follows from (301), (304), and the sandwich theorem because lim α 0 α 1 α log 1 τ = 0 and because lim α 0 H α ( X , Y ) = log | supp ( P X Y ) | (Proposition 3).
Finally, we provide an example for which (292) holds with strict inequality. Let X = { 1 , 2 , 3 } , let Y = { 1 , 2 } , and let ( X , Y ) be uniformly distributed over { ( 1 , 1 ) , ( 2 , 2 ) , ( 3 , 1 ) } . The LHS of (292) then equals log 2 . Using
Q X ( x ) 0 . 28 if x { 1 , 3 } , 0 . 44 if x = 2 ,
Q Y ( y ) 0 . 60 if y = 1 , 0 . 40 if y = 2 ,
we see that the RHS of (292) is upper bounded by log 5.952 3 , which is smaller than log 2 . □
Lemma 26.
K 1 ( X ; Y ) = I ( X ; Y ) .
Proof. 
The claim follows from Proposition 8 because Δ 1 ( P X Y Q X Q Y ) in the definition of K 1 ( X ; Y ) is equal to D ( P X Y Q X Q Y ) . □
Lemma 27.
Let f : { 1 , , | X | } X and g : { 1 , , | Y | } Y be bijective functions, and let B be the | X | × | Y | matrix whose Row-i Column-j entry B i , j equals P X Y ( f ( i ) , g ( j ) ) . Then,
K 2 ( X ; Y ) = 2 log σ 1 ( B ) H 2 ( X , Y ) ,
where σ 1 ( B ) denotes the largest singular value of B . (Because the singular values of a matrix are invariant under row and column permutations, the result does not depend on f or g.)
Proof. 
Let ( X ˜ , Y ˜ ) be distributed according to the joint PMF
P ˜ X Y ( x , y ) β P X Y ( x , y ) 2 ,
where
β [ x , y P X Y ( x , y ) 2 ] 1 2 .
Then,
K 2 ( X ; Y ) = J 1 2 ( X ˜ ; Y ˜ )
= 2 log σ 1 ( β B )
= 2 log β σ 1 ( B )
= 2 log σ 1 ( B ) H 2 ( X , Y ) ,
where (310) follows from Proposition 7; (311) follows from Lemma 6 and (308); (312) holds because β > 0 ; and (313) follows from the definition of H 2 ( X , Y ) . □
Lemma 28.
K ( X ; Y ) = 0 .
Proof. 
Let the pair ( x ^ , y ^ ) be such that P ( x ^ , y ^ ) = max x , y P ( x , y ) , and define the PMFs Q ^ X and Q ^ Y as Q ^ X ( x ) = 𝟙 { x = x ^ } and Q ^ Y ( y ) = 𝟙 { y = y ^ } . Then, Δ ( P X Y Q ^ X Q ^ Y ) = 0 , so K ( X ; Y ) 0 . Because K ( X ; Y ) 0 (Lemma 22), this implies K ( X ; Y ) = 0 . □
Lemma 29.
The mapping α K α ( X ; Y ) need not be monotonic on [ 0 , ] .
Proof. 
Let P X Y be such that supp ( P X Y ) = X × Y and I ( X ; Y ) > 0 . Then,
K 0 ( X ; Y ) = 0 ,
K 1 ( X ; Y ) > 0 ,
K ( X ; Y ) = 0 ,
which follow from Lemmas 25, 26, and 28, respectively. Thus, α K α ( X ; Y ) is not monotonic on [ 0 , ] . □
Lemma 30.
The mapping α K α ( X ; Y ) + H α ( X , Y ) is nonincreasing on [ 0 , ] .
Proof. 
We first show the monotonicity for α ( 0 , ) . To that end, let α , α ( 0 , ) with α α , and let M β ( Q X , Q Y ) be defined as in (285) and (286). Then, for all PMFs Q X and Q Y ,
M α 1 α ( Q X , Q Y ) M α 1 α ( Q X , Q Y ) ,
which follows from the power mean inequality [30] (III 3.1.1 Theorem 1) because α 1 α α 1 α . Hence,
K α ( X ; Y ) + H α ( X , Y ) = min Q X , Q Y log M α 1 α ( Q X , Q Y )
min Q X , Q Y log M α 1 α ( Q X , Q Y )
= K α ( X ; Y ) + H α ( X , Y ) ,
where (318) and (320) follow from Lemma 24, and (319) follows from (317).
The monotonicity extends to α = 0 because
K 0 ( X ; Y ) + H 0 ( X , Y ) lim α 0 K α ( X ; Y ) + H 0 ( X , Y )
= lim α 0 [ K α ( X ; Y ) + H α ( X , Y ) ] ,
where (321) follows from Lemma 25, and (322) holds because α H α ( X , Y ) is continuous at α = 0 (Proposition 3).
The monotonicity extends to α = because for all α ( 0 , ) ,
K α ( X ; Y ) + H α ( X , Y ) H α ( X , Y )
H ( X , Y )
= K ( X ; Y ) + H ( X , Y ) ,
where (323) holds because K α ( X ; Y ) 0 (Lemma 22); (324) holds because H α ( X , Y ) is nonincreasing in α (Proposition 3); and (325) holds because K ( X ; Y ) = 0 (Lemma 28). □
Lemma 31.
The mapping α K α ( X ; Y ) is continuous on ( 0 , ] . (See Lemma 25 for the behavior at α = 0 .)
Proof. 
Because α H α ( X , Y ) is continuous on [ 0 , ] (Proposition 3), it suffices to show that the mapping α K α ( X ; Y ) + H α ( X , Y ) is continuous on ( 0 , ] . We first show that it is continuous on ( 0 , 1 ) ( 1 , ) by showing that α 1 1 α K α ( X ; Y ) + H α ( X , Y ) is concave and hence continuous on ( 0 , ) . For a fixed α ( 0 , ) , let ( X ˜ , Y ˜ ) be distributed according to the joint PMF
P ˜ X Y ( x , y ) P X Y ( x , y ) α ( x , y ) X × Y P X Y ( x , y ) α .
Then, for all α ( 0 , ) ,
1 1 α K α ( X ; Y ) + H α ( X , Y ) = 1 1 α J 1 α ( X ˜ ; Y ˜ ) + 1 1 α H α ( X , Y )
= min R X Y 1 1 α D ( R X Y R X R Y ) + 1 α D ( R X Y P ˜ X Y ) + 1 1 α H α ( X , Y )
= min R X Y 1 1 α D ( R X Y R X R Y ) + 1 1 α H ( R X Y ) + D ( R X Y P X Y ) ,
where (327) follows from Proposition 7; (328) follows from Lemma 8; and (329) follows from a short computation. For every R X Y P ( X × Y ) , the expression in square brackets on the RHS of (329) is concave in α because the mapping α 1 1 α is concave on ( 0 , ) and because D ( R X Y R X R Y ) and H ( R X Y ) are nonnegative. The pointwise minimum preserves the concavity, thus the LHS of (327) is concave in α and hence continuous in α ( 0 , ) . This implies that α K α ( X ; Y ) + H α ( X , Y ) and hence α K α ( X ; Y ) is continuous on ( 0 , 1 ) ( 1 , ) .
We now establish continuity at α = . Let ( x ^ , y ^ ) be such that P ( x ^ , y ^ ) = max x , y P ( x , y ) ; define the PMFs Q ^ X and Q ^ Y as Q ^ X ( x ) 𝟙 { x = x ^ } and Q ^ Y ( y ) 𝟙 { y = y ^ } ; and let M β ( Q X , Q Y ) be defined as in (285). Then, for all α ( 1 , ) ,
K ( X ; Y ) + H ( X , Y ) K α ( X ; Y ) + H α ( X , Y )
log M α 1 α ( Q ^ X , Q ^ Y )
= α α 1 H ( X , Y )
= K ( X ; Y ) + α α 1 H ( X , Y ) ,
where (330) holds because K α ( X ; Y ) + H α ( X , Y ) is nonincreasing in α (Lemma 30); (331) follows from Lemma 24; (332) follows from the definitions of M β ( Q X , Q Y ) in (285) and H ( X , Y ) in (46); and (333) holds because K ( X ; Y ) = 0 (Lemma 28). Because lim α α α 1 = 1 , (330)–(333) and the sandwich theorem imply that α K α ( X ; Y ) + H α ( X , Y ) is continuous at α = . This and the continuity of α H α ( X , Y ) at α = (Proposition 3) establish the continuity of α K α ( X ; Y ) at α = .
It remains to show the continuity at α = 1 . Let α ( 4 5 , 1 ) ( 1 , 4 3 ) , and define δ | α 1 | α ( 0 , 1 4 ) . (These definitions ensure that on the RHS of (340) ahead, 1 4 δ will be positive.) Let M β ( Q X , Q Y ) be defined as in (285) and (286). Then, for all PMFs Q X and Q Y ,
M α 1 α ( Q X , Q Y ) M δ ( Q X , Q Y )
= [ x , y P ( x , y ) [ P X ( x ) P Y ( y ) ] δ [ Q X ( x ) Q Y ( y ) P X ( x ) P Y ( y ) ] δ ] 1 δ
[ x , y P ( x , y ) [ P X ( x ) P Y ( y ) ] 2 δ ] 1 2 δ · [ x , y P ( x , y ) [ Q X ( x ) Q Y ( y ) P X ( x ) P Y ( y ) ] 2 δ ] 1 2 δ
[ x , y P ( x , y ) [ P X ( x ) P Y ( y ) ] 2 δ ] 1 2 δ
= M 2 δ ( P X , P Y ) ,
where (334) follows from the power mean inequality [30] (III 3.1.1 Theorem 1) because α 1 α δ ; (336) follows from the Cauchy–Schwarz inequality; and (337) holds because
[ x , y P ( x , y ) [ Q X ( x ) P X ( x ) ] 2 δ [ Q Y ( y ) P Y ( y ) ] 2 δ ] 1 2 δ [ x P X ( x ) [ Q X ( x ) P X ( x ) ] 4 δ ] 1 4 δ · [ y P Y ( y ) [ Q Y ( y ) P Y ( y ) ] 4 δ ] 1 4 δ
= 2 D 1 4 δ ( P X Q X ) · 2 D 1 4 δ ( P Y Q Y )
1 ,
where (339) follows from the Cauchy–Schwarz inequality, and (341) holds because 1 4 δ > 0 and because the Rényi divergence is nonnegative for positive orders (Proposition 4). Thus, for all α ( 4 5 , 4 3 ) ,
log M 2 | α 1 | α ( P X , P Y ) min Q X , Q Y log M α 1 α ( Q X , Q Y )
log M α 1 α ( P X , P Y ) ,
where (342) follows from (338) if α 1 and from Proposition 8 and a simple computation if α = 1 . By Lemma 24, this implies that for all α ( 4 5 , 4 3 ) ,
log M 2 | α 1 | α ( P X , P Y ) K α ( X ; Y ) + H α ( X , Y )
log M α 1 α ( P X , P Y ) .
Because β M β ( P X , P Y ) is continuous at β = 0 [30] (III 1 Theorem 2(b)), (344)–(345) and the sandwich theorem imply that α K α ( X ; Y ) + H α ( X , Y ) is continuous at α = 1 . This and the continuity of α H α ( X , Y ) at α = 1 (Proposition 3) establish the continuity of α K α ( X ; Y ) at α = 1 . □
Lemma 32.
If X = Y with probability one, then
K α ( X ; Y ) = 2 H α 2 α ( X ) H α ( X ) i f α [ 0 , 2 ) , α α 1 H ( X ) H α ( X ) i f α 2 , 0 i f α = .
Proof. 
We first treat the cases α = 0 , α = 1 , and α = . For α = 0 , (346) holds because
K 0 ( X ; Y ) = log | supp ( P X P Y ) | | supp ( P X Y ) |
= log | supp ( P X ) |
= H 0 ( X ) ,
where (347) follows from Lemma 25, and (348) holds because the hypothesis Pr [ X = Y ] = 1 implies that | supp ( P X P Y ) | = | supp ( P X ) | 2 and | supp ( P X Y ) | = | supp ( P X ) | . For α = 1 , (346) holds because K 1 ( X ; Y ) = I ( X ; Y ) (Lemma 26) and because Pr [ X = Y ] = 1 implies that I ( X ; Y ) = H ( X ) = H 1 ( X ) . For α = , (346) holds because K ( X ; Y ) = 0 (Lemma 28).
Now let α ( 0 , 1 ) ( 1 , ) , and let ( X ˜ , Y ˜ ) be distributed according to the joint PMF
P ˜ X Y ( x , y ) P X Y ( x , y ) α ( x , y ) X × Y P X Y ( x , y ) α
= P X ( x ) α x X P X ( x ) α 𝟙 { x = y } ,
where (351) holds because P X Y ( x , y ) = P X ( x ) 𝟙 { x = y } for all x X and all y Y . If α < 2 , then (346) holds because
K α ( X ; Y ) = J 1 α ( X ˜ ; Y ˜ )
= H 1 2 α ( X ˜ )
= 2 α 1 α log x [ P X ( x ) α x X P X ( x ) α ] 1 2 α
= 2 H α 2 α ( X ) H α ( X ) ,
where (352) follows from Proposition 7; (353) follows from Lemma 11 because Pr [ X ˜ = Y ˜ ] = 1 and because 1 α > 1 2 ; and (355) follows from a simple computation. If α 2 , then (346) holds because
K α ( X ; Y ) = J 1 α ( X ˜ ; Y ˜ )
= 1 α 1 H ( X ˜ )
= 1 α 1 log max x P X ( x ) α x X P X ( x ) α
= α α 1 H ( X ) H α ( X ) ,
where (356) follows from Proposition 7; (357) follows from Lemma 11 because Pr [ X ˜ = Y ˜ ] = 1 and because 1 α 1 2 ; and (359) follows from a simple computation. □
Lemma 33.
For every α ( 0 , 2 ) , the mapping ( Q X , Q Y ) Δ α ( P X Y Q X Q Y ) in the definition of K α ( X ; Y ) in (2) has a unique minimizer. This need not be the case when α { 0 } [ 2 , ] .
Proof. 
Let α ( 0 , 2 ) . By Proposition 7, K α ( X ; Y ) = J 1 / α ( X ˜ ; Y ˜ ) , where the pair ( X ˜ , Y ˜ ) is distributed according to the joint PMF P ˜ X Y defined in Proposition 7. The mapping ( Q X , Q Y ) D 1 / α ( P ˜ X Y Q X Q Y ) in the definition of J 1 / α ( X ˜ ; Y ˜ ) has a unique minimizer by Lemma 20 because 1 α > 1 2 . By Proposition 6, there is a bijection between the minimizers of D 1 / α ( P ˜ X Y Q X Q Y ) and Δ α ( P X Y Q X Q Y ) , so the mapping ( Q X , Q Y ) Δ α ( P X Y Q X Q Y ) also has a unique minimizer.
We next show that for α { 0 } [ 2 , ] , the mapping ( Q X , Q Y ) Δ α ( P X Y Q X Q Y ) can have more than one minimizer. Let X be uniformly distributed over { 0 , 1 } , and let Y = X . Then, by Lemma 32,
K α ( X ; Y ) = log 2 if α = 0 , 1 α 1 log 2 if α 2 , 0 if α = .
If α = 0 , then it follows from the definition of Δ 0 ( P Q ) in (56) that Δ 0 ( P X Y Q X Q Y ) = log 2 whenever supp ( Q X ) = supp ( Q Y ) = { 0 , 1 } , so the minimizer is not unique. Otherwise, if α [ 2 , ] , it can be verified that
Δ α P X Y ( 1 , 0 ) ( 1 , 0 ) = Δ α P X Y ( 0 , 1 ) ( 0 , 1 )
= 1 α 1 log 2 if α 2 , 0 if α = ,
so the minimizer is not unique in this case either. □
Lemma 34.
If the pairs ( X 1 , Y 1 ) and ( X 2 , Y 2 ) are independent, then K α ( X 1 , X 2 ; Y 1 , Y 2 ) = K α ( X 1 ; Y 1 ) + K α ( X 2 ; Y 2 ) for all α [ 0 , ] (additivity).
Proof. 
We first treat the cases α = 0 and α = . For α = 0 , the claim is true because
K 0 ( X 1 , X 2 ; Y 1 , Y 2 ) = log | supp ( P X 1 X 2 P Y 1 Y 2 ) | | supp ( P X 1 X 2 Y 1 Y 2 ) |
= log | supp ( P X 1 P Y 1 ) | · | supp ( P X 2 P Y 2 ) | | supp ( P X 1 Y 1 ) | · | supp ( P X 2 Y 2 ) |
= K 0 ( X 1 ; Y 1 ) + K 0 ( X 2 ; Y 2 ) ,
where (363) and (365) follow from Lemma 25, and (364) follows from the independence hypothesis P X 1 X 2 Y 1 Y 2 = P X 1 Y 1 P X 2 Y 2 . For α = , the claim is true because K ( X ; Y ) = 0 (Lemma 28).
Now let α ( 0 , ) , and let ( X ˜ 1 , X ˜ 2 , Y ˜ 1 , Y ˜ 2 ) be distributed according to the joint PMF
P ˜ X 1 X 2 Y 1 Y 2 ( x 1 , x 2 , y 1 , y 2 ) P X 1 X 2 Y 1 Y 2 ( x 1 , x 2 , y 1 , y 2 ) α x 1 , x 2 , y 1 , y 2 P X 1 X 2 Y 1 Y 2 ( x 1 , x 2 , y 1 , y 2 ) α
= P X 1 Y 1 ( x 1 , y 1 ) α x 1 , y 1 P X 1 Y 1 ( x 1 , y 1 ) α · P X 2 Y 2 ( x 2 , y 2 ) α x 2 , y 2 P X 2 Y 2 ( x 2 , y 2 ) α ,
where (366) follows from the independence hypothesis P X 1 X 2 Y 1 Y 2 = P X 1 Y 1 P X 2 Y 2 . Then,
K α ( X 1 , X 2 ; Y 1 , Y 2 ) = J 1 α ( X ˜ 1 , X ˜ 2 ; Y ˜ 1 , Y ˜ 2 )
= J 1 α ( X ˜ 1 ; Y ˜ 1 ) + J 1 α ( X ˜ 2 ; Y ˜ 2 )
= K α ( X 1 ; Y 1 ) + K α ( X 2 ; Y 2 ) ,
where (368) and (370) follow from Proposition 7, and (369) follows from Lemma 12 because the pairs ( X ˜ 1 , Y ˜ 1 ) and ( X ˜ 2 , Y ˜ 2 ) are independent by (367). □
Lemma 35.
For all α [ 0 , ] , K α ( X ; Y ) log | X | .
Proof. 
For α = 0 , this is true because
K 0 ( X ; Y ) = log | supp ( P X P Y ) | | supp ( P X Y ) |
log | X | · | supp ( P Y ) | | supp ( P X Y ) |
log | X | ,
where (371) follows from Lemma 25. For α ( 0 , ) , the claim is true because
K α ( X ; Y ) = J 1 α ( X ˜ ; Y ˜ )
log | X | ,
where (374) follows from Proposition 7, and (375) follows from Lemma 13. For α = , the claim is true because K ( X ; Y ) = 0 (Lemma 28). □
Lemma 36.
There exists a Markov chain X Y Z for which K 2 ( X ; Z ) > K 2 ( X ; Y ) .
Proof. 
Let the Markov chain X Y Z be given by
P Z | Y P X Y ( x , y ) y = 0 y = 1
x = 0 0 . 6 0
x = 1 0 0 . 4
P Z | Y P Z | Y ( z | y ) z = 0 z = 1
y = 0 0 . 9 0 . 1
y = 1 01
Using Lemma 27, we see that K 2 ( X ; Z ) 0.605 bits, which is larger than K 2 ( X ; Y ) 0.531 bits. □

Author Contributions

Writing–original draft preparation, A.L. and C.P.; writing–review and editing, A.L. and C.P.

Funding

This research received no external funding.

Conflicts of Interest

The authors declare no conflict of interest.

References

  1. Shannon, C.E. A mathematical theory of communication. Bell Syst. Tech. J. 1948, 27, 379–423. [Google Scholar] [CrossRef]
  2. Tomamichel, M.; Hayashi, M. Operational interpretation of Rényi information measures via composite hypothesis testing against product and Markov distributions. IEEE Trans. Inf. Theory 2018, 64, 1064–1082. [Google Scholar] [CrossRef]
  3. Sibson, R. Information radius. Z. Wahrscheinlichkeitstheorie verw. Geb. 1969, 14, 149–160. [Google Scholar] [CrossRef]
  4. Arimoto, S. Information measures and capacity of order α for discrete memoryless channels. In Topics in Information Theory; Csiszár, I., Elias, P., Eds.; North-Holland Publishing Company: Amsterdam, The Netherlands, 1977; pp. 41–52. ISBN 0-7204-0699-4. [Google Scholar]
  5. Csiszár, I. Generalized cutoff rates and Rényi’s information measures. IEEE Trans. Inf. Theory 1995, 41, 26–34. [Google Scholar] [CrossRef]
  6. Fehr, S.; Berens, S. On the conditional Rényi entropy. IEEE Trans. Inf. Theory 2014, 60, 6801–6810. [Google Scholar] [CrossRef]
  7. Sason, I.; Verdú, S. Arimoto–Rényi conditional entropy and Bayesian M-ary hypothesis testing. IEEE Trans. Inf. Theory 2018, 64, 4–25. [Google Scholar] [CrossRef]
  8. Verdú, S. α-mutual information. In Proceedings of the 2015 Information Theory and Applications Workshop (ITA), San Diego, CA, USA, 1–6 February 2015; pp. 1–6. [Google Scholar] [CrossRef]
  9. Tridenski, S.; Zamir, R.; Ingber, A. The Ziv–Zakai–Rényi bound for joint source-channel coding. IEEE Trans. Inf. Theory 2015, 61, 4293–4315. [Google Scholar] [CrossRef]
  10. Aishwarya, G.; Madiman, M. Remarks on Rényi versions of conditional entropy and mutual information. In Proceedings of the 2019 IEEE International Symposium on Information Theory (ISIT), Paris, France, 7–12 July 2019; pp. 1117–1121. [Google Scholar]
  11. Csiszár, I.; Shields, P.C. Information Theory and Statistics: A Tutorial; now Publishers: Hanover, MA, USA, 2004; ISBN 978-1-933019-05-5. [Google Scholar]
  12. Liese, F.; Vajda, I. On divergences and informations in statistics and information theory. IEEE Trans. Inf. Theory 2006, 52, 4394–4412. [Google Scholar] [CrossRef]
  13. Sason, I.; Verdú, S. f-divergence inequalities. IEEE Trans. Inf. Theory 2016, 62, 5973–6006. [Google Scholar] [CrossRef]
  14. Jiao, J.; Han, Y.; Weissman, T. Dependence measures bounding the exploration bias for general measurements. In Proceedings of the 2017 IEEE International Symposium on Information Theory (ISIT), Aachen, Germany, 25–30 June 2017; pp. 1475–1479. [Google Scholar] [CrossRef]
  15. Ziv, J.; Zakai, M. On functionals satisfying a data-processing theorem. IEEE Trans. Inf. Theory 1973, 19, 275–283. [Google Scholar] [CrossRef]
  16. Lapidoth, A.; Pfister, C. Testing against independence and a Rényi information measure. In Proceedings of the 2018 IEEE Information Theory Workshop (ITW), Guangzhou, China, 25–29 November 2018; pp. 1–5. [Google Scholar] [CrossRef]
  17. Han, T.S.; Kobayashi, K. The strong converse theorem for hypothesis testing. IEEE Trans. Inf. Theory 1989, 35, 178–180. [Google Scholar] [CrossRef]
  18. Nakagawa, K.; Kanaya, F. On the converse theorem in statistical hypothesis testing. IEEE Trans. Inf. Theory 1993, 39, 623–628. [Google Scholar] [CrossRef]
  19. Bunte, C.; Lapidoth, A. Encoding tasks and Rényi entropy. IEEE Trans. Inf. Theory 2014, 60, 5065–5076. [Google Scholar] [CrossRef]
  20. Bracher, A.; Lapidoth, A.; Pfister, C. Distributed task encoding. In Proceedings of the 2017 IEEE International Symposium on Information Theory (ISIT), Aachen, Germany, 25–30 June 2017; pp. 1993–1997. [Google Scholar] [CrossRef]
  21. Rényi, A. On measures of entropy and information. In Proceedings of the Fourth Berkeley Symposium on Mathematical Statistics and Probability, Berkeley, CA, USA, 20 June–30 July 1960; Volume 1, pp. 547–561. [Google Scholar]
  22. van Erven, T.; Harremoës, P. Rényi divergence and Kullback–Leibler divergence. IEEE Trans. Inf. Theory 2014, 60, 3797–3820. [Google Scholar] [CrossRef]
  23. Sason, I.; Verdú, S. Improved bounds on lossless source coding and guessing moments via Rényi measures. IEEE Trans. Inf. Theory 2018, 64, 4323–4346. [Google Scholar] [CrossRef]
  24. Ashok Kumar, M.; Sundaresan, R. Minimization problems based on relative α-entropy I: Forward projection. IEEE Trans. Inf. Theory 2015, 61, 5063–5080. [Google Scholar] [CrossRef]
  25. Ashok Kumar, M.; Sundaresan, R. Minimization problems based on relative α-entropy II: Reverse projection. IEEE Trans. Inf. Theory 2015, 61, 5081–5095. [Google Scholar] [CrossRef]
  26. Sundaresan, R. Guessing under source uncertainty. IEEE Trans. Inf. Theory 2007, 53, 269–287. [Google Scholar] [CrossRef]
  27. Polyanskiy, Y.; Wu, Y. Lecture Notes on Information Theory. 2017. Available online: http://people.lids.mit.edu/yp/homepage/data/itlectures_v5.pdf (accessed on 18 August 2017).
  28. Cover, T.M.; Thomas, J.A. Elements of Information Theory, 2nd ed.; John Wiley & Sons: Hoboken, NJ, USA, 2006; ISBN 978-0-471-24195-9. [Google Scholar]
  29. Gallager, R.G. Information Theory and Reliable Communication; John Wiley & Sons: Hoboken, NJ, USA, 1968; ISBN 978-0-471-29048-3. [Google Scholar]
  30. Bullen, P.S. Handbook of Means and Their Inequalities; Kluwer Academic Publishers: Dordrecht, The Netherlands, 2003; ISBN 978-1-4020-1522-9. [Google Scholar]
  31. Horn, R.A.; Johnson, C.R. Matrix Analysis, 2nd ed.; Cambridge University Press: Cambridge, UK, 2013; ISBN 978-0-521-83940-2. [Google Scholar]
  32. Fan, K. Minimax theorems. Proc. Natl. Acad. Sci. USA 1953, 39, 42–47. [Google Scholar] [CrossRef] [PubMed]
  33. Borwein, J.M.; Zhuang, D. On Fan’s minimax theorem. Math. Program. 1986, 34, 232–234. [Google Scholar] [CrossRef]
Figure 1. (Left) J α ( X ; Y ) and D α ( P X Y P X P Y ) versus α . (Right) K α ( X ; Y ) and Δ α ( P X Y P X P Y ) versus α . In both plots, X is Bernoulli with Pr ( X = 1 ) = 0.2 , and Y is equal to X.
Figure 1. (Left) J α ( X ; Y ) and D α ( P X Y P X P Y ) versus α . (Right) K α ( X ; Y ) and Δ α ( P X Y P X P Y ) versus α . In both plots, X is Bernoulli with Pr ( X = 1 ) = 0.2 , and Y is equal to X.
Entropy 21 00778 g001

Share and Cite

MDPI and ACS Style

Lapidoth, A.; Pfister, C. Two Measures of Dependence. Entropy 2019, 21, 778. https://doi.org/10.3390/e21080778

AMA Style

Lapidoth A, Pfister C. Two Measures of Dependence. Entropy. 2019; 21(8):778. https://doi.org/10.3390/e21080778

Chicago/Turabian Style

Lapidoth, Amos, and Christoph Pfister. 2019. "Two Measures of Dependence" Entropy 21, no. 8: 778. https://doi.org/10.3390/e21080778

APA Style

Lapidoth, A., & Pfister, C. (2019). Two Measures of Dependence. Entropy, 21(8), 778. https://doi.org/10.3390/e21080778

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop