1. Introduction
Shannon’s information measures (entropy, conditional entropy, the Kullback divergence or relative entropy, and mutual information) are ubiquitous both because they arise as operational fundamental limits of various communication or statistical inference problems, and because they are functionals that have become fundamental in the development and advancement of probability theory itself. Over a half century ago, Rényi [
1] introduced a family of information measures extending those of Shannon [
2], parametrized by an order parameter
. Rényi’s information measures are also fundamental– indeed, they are (for
) just monotone functions of
-norms, whose relevance or importance in any field that relies on analysis need not be justified. Furthermore, they show up in probability theory, PDE, functional analysis, additive combinatorics, and convex geometry (see, e.g., [
3,
4,
5,
6,
7,
8,
9]), in ways where understanding them as information measures instead of simply as monotone functions of
-norms is fruitful. For example, there is an intricate story of parallels between entropy power inequalities (see, e.g., [
10,
11,
12]), Brunn-Minkowski-type volume inequalities (see, e.g., [
13,
14,
15]) and sumset cardinality (see, e.g., [
16,
17,
18,
19,
20]), which is clarified by considering logarithms of volumes and Shannon entropies as members of the larger class of Rényi entropies. It is also recognized now that Rényi’s information measures show up as fundamental operational limits in a range of information-theoretic or statistical problems (see, e.g., [
21,
22,
23,
24,
25,
26,
27]). Therefore, there has been considerable interest in developing the theory surrounding Rényi’s information measures (which is far less well developed than the Shannon case), and there has been a steady stream of recent papers [
27,
28,
29,
30,
31,
32,
33,
34,
35,
36] elucidating their properties beyond the early work of [
37,
38,
39,
40]. This paper, part of which was presented at ISIT 2019 [
41], is a further contribution along these lines.
More specifically, three notions of Rényi mutual information have been considered in the literature (usually named after Sibson, Arimoto and Csiszár) for discrete alphabets. Sibson’s definition has also been considered for abstract alphabets, but Arimoto’s definition has not. Indeed Verdú [
31] asserts: “One shortcoming of Arimoto’s proposal is that its generalization to non-discrete alphabets is not self-evident.” The reason it is not self-evident is because although there is an obvious generalized definition, the mutual information arising from this notion depends on the choice of reference measure on the abstract alphabet, which is not a desirable property. Nonetheless, the perspective taken in this note is that it is still interesting to develop the properties of the abstract Arimoto conditional Rényi entropy keeping in mind the dependence on reference measure. The Sibson definition is then just a special case of the Arimoto definition where we choose a particular, special reference measure.
Our main motivation comes from considering various notions of Rényi capacity. While certain equivalences have been shown between various such notions by Csiszár [
21] for finite alphabets and Nakiboğlu [
36,
42] for abstract alphabets, the equivalences and relationships are further extended in this note.
This paper is organized in the following manner. In
Section 2 below we begin by defining conditional Rényi entropy for random variables taking values in a Polish space.
Section 3 presents a variational formula for the conditional Rényi entropy in terms of Rényi divergence, which will be a key ingredient in several results later. Basic properties that the abstract conditional Rényi entropy satisfies akin to its discrete version are proved in
Section 4, including description for special orders
and
∞, monotonicity in the order, reduction of entropy upon conditioning, and a version of the chain rule.
Section 5 discusses and compares several notions of
-mutual information. The various notions of channel capacity arising out of different notions of
-mutual information are studied in
Section 6, which are then compared using results from the preceding section.
2. Definition of Conditional Rényi Entropies
Let S be a Polish space and its Borel -algebra. We fix a -finite reference measure on . Our study of entropy and in particular all spaces we talk about will be with respect to this measure space, unless stated otherwise.
Definition 1. Let X be an S-valued random variable with density f with respect to γ. We define the Rényi entropy of X of orderby It will be convenient to write down Rényi entropy as
where
will be called the Rényi probability of order
of
X.
Let
T be another Polish space with a fixed measure
on its Borel
-algebra
. Now suppose
are, resepectively,
-valued random variables with a joint density
w.r.t. the reference
. We will denote the marginals of
F on
S and
T by
f and
g respectively. This in particular means that
X has density
f w.r.t.
and
Y has density
g w.r.t.
. Just as Rényi probability of
X, one can define the Rényi probability of the conditional
X given
by the expression
. The following generalizes ([
30], Definition 2).
Definition 2. Let.
We define the conditional Rényi entropy in terms of a weighted mean of conditional Rényi probabilities,
where We can re-write
as
which is the expected
norm of the conditional density under the measure
raised to a power which is the Hölder conjugate of
. Using Fubini’s theorem the formula for
can be further written down only in terms of the joint density,
Remark 1. Suppose,
for each ,
denotes the conditional distribution of X given,
i.e., the probability measure on S with density with respect to γ, then the conditional Rényi entropy can be written aswhere denotes Rényi divergence (see Definition 3). When X and Y are independent random variables one can easily check that , therefore as expected. Since the independence of X and Y means that all the conditionals are equal to , the fact that in this case can also be verified from the expression in Remark 1. The converse is also true, i.e., implies the independence of X and Y, if . This is noted later in Corollary 2.
Clearly, unlike conditional Shannon entropy, the conditional Rényi entropy is not the average Rényi entropy of the conditional distribution. The average Rényi entropy of the conditional distribution,
has been proposed as a candidate for conditional Rényi entropy, however it does not satisfy some properties one would expect such a notion to satisfy, like monotonicity (see [
30]). When
it follows from Jensen’s inequality that
, while the inequality is reversed when
.
3. Relation to Rényi Divergence
We continue to consider an S-valued random variable X and a T-valued random variable Y with a given joint distribution with density F with respect to . Densities, etc are with respect to the fixed reference measures on the state spaces, unless mentioned otherwise.
Let be a Borel probability measure with density p on a Polish space and let be a Borel measure with density q on the same space with respect to a common measure .
Definition 3 (Rényi divergence).
Suppose . Then, the Rényi divergence of order α between measures μ and ν is defined as For order the Rényi divergence is defined by the respective limits.
Definition 4. - 1.
;
- 2.
; and
- 3.
Remark 2. These definitions are independent of the reference measure γ.
Remark 3. if some .
See [29]. The conditional Rényi entropy can be written in terms of Rényi divergence from the joint distribution using a generalized Sibson’s identity we learnt from B. Nakiboğlu [
43] (also see [
36], and [
38] where this identity for
appears to originate from). The proof for abstract alphabets presented here is also due to B. Nakiboğlu [
43], which simplifies our original proof [
41] of the second formula below.
Theorem 1. Letbe random variables taking vaules in spacesrespectively. We assume they are jointly distributed with density F with respect to the product reference measure.
For ,
and any probability measure λ absolutely continuous with respect to η, we havewhere,
is the measure having densitywith respect to η, andis the normalization factor. As a consequence, we have Proof. Suppose
has density
h with respect to
. Then
has density
with respect to
. Now, for
,
The case is straightforward and well-known, and the optimal in this case is the distribution of Y.□
Remark 4. The identities above and the measureare independent of the reference measure η. η is only used to write out the Rényi divergence concretely in terms of densities.
5. Notions of -Mutual Information
Arimoto [
39] used his conditional Rényi entropy to define a mutual information that we extend to the general setting as follows.
Definition 5. Let X be an S-valued random variable and let Y be a T-valued random variable, with a given joint distribution. Then, we define We use the squiggly arrow to emphasize the lack of symmetry in X and Y, but nonetheless to distinguish from the notation for directed mutual information, which is usually written with a straight arrow. By Corollary 2, for , if and only if X and Y are independent. Therefore , for any choice of reference measure , can be seen as a measure of dependence between X and Y.
Let us discuss a little further the validity of
as a dependence measure. If the conditional distributions are denoted by
as in Remark 1, using the fact that
for any random variable
Z, we have for any
that
Furthermore, when
, by ([
29], Proposition 2), we may also write
Note that Rényi divergence is convex in the second argument (see [
29], Theorem 12) when
, and the last equation suggests that Arimoto’s mutual information can be seen as a quantification of this convexity gap.
One can also see clearly from the above expressions why this quantity controls, at least for
, the dependence between
X and
Y: indeed, one has for any
and any
that,
where the inequality comes from Markov’s inequality, and we use
. Thus, when
is large, the probability that the conditional distributions of
X given
Y cluster at around the same “Rényi divergence” distance from the reference measure
as the unconditional distribution of
X (which is of course a mixture of the conditional distributions) is small, suggesting a significant “spread” of the conditional distributions and therefore strong dependence. This is illustrated in
Figure 1. Thus, despite the dependence of
on the reference measure
, it does guarantee strong dependence when it is large (at least for
). When
we have
, and consequently the upper bound
making the inequality trivial.
The “mutual information” quantity clearly depends on the choice of the reference measure . Nonetheless, there are 3 families of Rényi mutual informations that are independent of the choice of reference measure, which we now introduce.
Definition 6. Fix.
- 1.
The Lapidoth-Pfister α-mutual information is defined as - 2.
The Augustin-Csiszár α-mutual information is defined as - 3.
Sibson’s α-mutual information is defined as
The quantity
was recently introduced by Lapidoth and Pfister as a measure of independence in [
45] (cf., [
25,
27,
32]). The Augustin-Csiszár mutual information was originally introduced in [
40] by Udo Augustin with a slightly different parametrization, and gained much popularity following Csiszár’s work in [
21]. For a discussion on early work on this quantity and applications also see [
42] and references therein. Both [
40] and [
42] treat abstract alphabets however the former is limited to
while the latter treats all
. Sibson’s definition originates in [
38] where he introduces
in the form of
information radius (see, e.g, [
33]), which is often written in terms of Gallager’s function (from [
46]). Since all the quantities in the above definition are stated in terms of Rényi divergences not involving the reference measure
, they themselves are independent of the reference measure. Their relationship with the Rényi divergence also shows that all of them are non-negative. Moreover, putting
in the expression for
and
in expressions for
and
when
are independent shows that they all vanish under independence.
While these notions of mutual information are certainly not equal to in general when , they do have a direct relationship with conditional Rényi entropies, by varying the reference measure.
Since
, where all optimizations are done over probability measures, we can write Lapidoth and Pfister’s mutual information as
Note that it is symmetric by definition:
, which is why we do not use squiggly arrows to denote it. By writing down Rényi divergence as Rényi entropy w.r.t. reference measure, Augustin-Csiszár’s
can be recast in a similar form, this time using the average Rényi entropy of the conditionals instead of Arimoto’s conditional Rényi entropy,
In light of Theorem 1, Sibson’s mutual information can clearly be written in terms of conditional Rényi entropy as
This leads to the observation that Sibson’s mutual information can be seen as a special case of Arimoto’s mutual information, when the reference measure is taken to be the distribution of
X:
For the sake of comparision with the corresponding expression for
, we also write
as
The following inequality, which relates the three families when , turns out to be quite fruitful.
Proof. Suppose
. Then,
so that
Moreover,
which completes the proof. □
Remark 7. When, from the straightforward observationand [21], we have.
We note that the relation between
and
(in the finite alphabet case) goes back to Csiszár [
21]. In the next and final section, we explore the implications of Theorem 3 for various notions of capacity.
6. Channels and Capacities
We begin by defining channel and capacities. Throughout this section, assume .
Definition 7. Letbe measurable spaces. A functionis called a probability kernel or a channel from the input spaceto the output spaceif
- 1.
For all, the functionis a probability measure on, and
- 2.
For every, the functionis a -measurable function.
In our setting, the conditional distributions
X given
define a channel
W from
to
S. In terms of this, one can write a density-free expression for the conditional Rényi entropy:
Definition 8. Letbe a measurable space anda set of probability measures on B. Following [36], define the order-α Rényi radius ofrelative toby The order-α Rényi radius ofis defined as Given a joint distribution of , one can consider the quantities and , as functions of an “input” distribution and the channel from X to Y (i.e., the probability kernel formed by the conditional distributions of Y given ). For example, one can consider as a function of and the probability kernel formed by the conditional distributions of Y given . Under this interpretation, we first define various capacities.
Definition 9. Given a channel W from X to Y, we define capacities of order α by
- 1.
,
- 2.
, and
- 3.
.
Theorem 3 allows us to extend ([
31], Theorem 5) to include the capacity based on the Lapidoth-Pfister mutual information.
Theorem 4. Let, and fix a channel W from X to Y. Then, Proof. Theorem 3 implies that,
that is,
It was shown by Csiszár [
21] in the finite alphabet setting (in fact, he showed this for all
) that
. Nakiboğlu demonstrates
in [
36] and
in [
42] for abstract alphabets. Putting all this together, we have
Finally, using the symmetry of
and Theorem 3 in a similar fashion again, we get
This completes the proof.□
The two inequalities in the last theorem cannot be improved to be equalities, this follows from a counter-example communicated to the authors by C. Pfister. Note that this theorem corrects Theorem V.1 in [
41].
Since is no longer sandwiched between and when the same argument cannot be used to deduce the equality of various capacities in this case. However, when , a direct demonstration proves that the Lapidoth-Pfister capacity of a channel equals Rényi radius when the state spaces are finite.
Theorem 5. Let, and fix a channel W from X to Y where X and Y take values in finite sets S and T respectively. Then, Proof. We continue using integral notation instead of summation. Note that,
where
We consider the function
defined on
Observe that the function
has the same minimax properties as
f. We make the following observations about this function.
g is linear in P.
g is convex in μ.
Follows from the proof in ([
27], Lemma 17).
g is continuous in each of the variables P and μ. Continuity in
follows from continuity of
in the second coordinate (see, for example, in [
29]) whereas continuity in
P is a consequence of linearity of the integral (summation).
The above observations ensure that we can apply von Neumann’s convex minimax theorem to
g, and therefore to
f to conclude that
For a fixed however, (the RHS is clearly bigger than the LHS, for the other direction use measures where is a supremum achieving sequence for the RHS). This shows that when the capacity coming from the equals the Rényi radius if the state spaces are finite.□
Though we do not treat capacities coming from Arimoto’s mutual information in this paper due to its dependence on a reference measure, a remark can be made in this regard following B. Nakiboğlu’s [
43] observation that Arimoto’s mutual information w.r.t.
of a joint distribution
can be written as a Sibson mutual information of some input probability measure
P and the channel
W from
X to
Y corresponding to
. Let
denote the marginals of
. As before there are reference measures
on the state space
S of
X. Let
P denote the probability measure on
S with density
. Then a calculation shows that
Therefore, it follows that if a reference measure is fixed, then the capacity of order of a channel W calculated from Arimoto’s mutual information will be less than the capacity based on Sibson mutual information (which equals the Rényi radius of W).