1. Introduction
Information geometry, the study of statistical models equipped with a differentiable structure, was pioneered by the work of Rao [
1], and gained maturity with the work of Amari and many others [
2,
3,
4]. It has been successfully applied in many different areas, such as statistical inference, machine learning, signal processing or optimization [
4,
5]. In appropriate statistical models, the differentiable structure is induced by a (statistical) divergence. The Kullback–Leibler divergence induces a Riemannian metric, called the Fisher–Rao metric, and a pair of dual connections, the exponential and mixture connections. A statistical model endowed with the Fisher–Rao metric is called a (classical) statistical manifold. Amari also considered a family of
α-divergences that induce a family of
α-connections.
Much research in recent years has focused on the geometry of non-standard statistical models [
6,
7,
8]. These models are defined in terms of a deformed exponential (also called
ϕ-exponential). In particular,
κ-exponential models and
q-exponential families are investigated in [
9,
10]. Non-parametric (or infinite-dimensional)
φ-families were introduced by the authors in [
11,
12], which generalize exponential families in the non-parametric setting [
13,
14,
15,
16]. Based on the similarity between exponential and
φ-families, we defined the so-called
φ-divergence, with respect to which the Kullback–Leibler divergence is a particular case. Statistical models equipped with a geometric structure induced by
φ-divergences, which are called generalized statistical manifolds, are investigated in [
17,
18]. With respect to these connections, parametric
φ-families are dually flat.
The
φ-divergence is intrinsically related to the
-model of Zhang, which was proposed in [
19,
20], extended to the infinite-dimension setting in [
21], and explained in more details in [
22,
23]. For instance, the metric induced by
φ-divergence and the
-generalization of the Fisher–Rao metric, for the choices
and
, differ by a conformal factor.
Among many attempts to generalize Kullback–Leibler divergence, Rényi divergence [
24] is one of the most successful, having found many applications [
25]. In the present paper, we propose a generalization of Rényi divergence, which we use to define a family of
α-connections. This generalization is based on an interpretation of Rényi divergence as a kind of normalizing function. To generalize Rényi divergence, we considered functions satisfying some suitable conditions. To a function for which these conditions hold, we give the name of
φ-function. In a limiting case, the generalized Rényi divergence reduces to the
φ-divergence. In [
17,
18], the
φ-divergence gives rise to a pair of dual connections
and
. We show that the connection
induced by the generalization of Rényi divergence satisfies the convex combination
.
Eguchi in [
26] investigated a geometry based on a normalizing function similar to the one used in the generalization of Rényi divergence. In [
26], results were derived supposing that this normalizing function exists; conditions for its existence were not given. In the present paper, the existence of the normalizing function is ensured by conditions involved in the definition of
φ-functions.
The rest of the paper is organized as follows. In
Section 2,
φ-functions are introduced and some properties are discussed. The Rényi divergence is generalized in
Section 3. We investigate in
Section 4 the geometry induced by the generalization of Rényi divergence.
Section 4.2 provides evidence of the role of the generalized Rényi divergence in
φ-families.
2. φ-Functions
Rényi divergence is defined in terms of the exponential function (to be more precise, the logarithm). A way of generalizing Rényi divergence is to replace the exponential function by another function, which satisfies some suitable conditions. To a function for which these conditions hold, we give the name φ-function. In this section, we define and investigate some properties of φ-functions.
Let be a measure space. Although we do not restrict our analysis to a particular measure space, the reader can think of T as the set of real numbers , Σ as the Borel σ-algebra on , and μ as the Lebesgue measure. We can also consider T to be a discrete set, a case in which μ is the counting measure.
We say that
is a
φ-function if the following conditions are satisfied:
- (a1)
is convex;
- (a2)
and ;
- (a3)
there exists a measurable function
such that
for each measurable function
satisfying
.
Thanks to condition (a3), we can generalize Rényi divergence using
φ-functions. These conditions appeared first at [
12] where the authors constructed non-parametric
φ-families of probability distributions. We remark that if
T is finite, condition (a3) is always satisfied.
Examples of functions
satisfying (a1)–(a3) abound. An example of great relevance is the exponential function
, which satisfies conditions (a1)–(a3) with
. Another example of
φ-function is the Kaniadakis’
κ-exponential [
12,
27,
28].
Example 1. The Kaniadakis’ κ-exponential for is defined aswhose inverse is the so called the Kaniadakis’ κ-logarithm , which is given by It is clear that satisfies (a1) and (a2). Let be any measurable function for which . We will show that satisfies expression (1). For any and , we can writewhere we used that . Then, we conclude that for all . Fix any measurable function such that . For each , we havewhich shows that satisfies (a3). Therefore, the Kaniadakis’ κ-exponential is an example of φ-function. The restriction that can be weakened, as asserted in the next result.
Lemma 1. Let be any measurable function such that . Then, for all .
Proof. Notice that if
, then
for some
. From the definition of
, it follows that
, where
. Now assume that
. Consider any measurable set
with measure
. Let
be a measurable function supported on
A satisfying
, where
. Defining
, we see that
. By the definition of
, we can write
which is the desired result. ☐
As a consequence of Lemma 1, condition (a3) can be replaced by the following one:
(a3’) There exists a measurable function
such that
for each measurable function
for which
.
Without the equivalence between conditions (a3) and (a3’), we could not generalize Rényi divergence in the manner we propose. In fact,
φ-functions could be defined directly in terms of (a3’), without mentioning (a3). We chose to begin with (a3) because this condition appeared initially in [
12].
Not all functions , for which conditions (a1) and (a2) hold, satisfy condition (a3). Such a function is given below.
Example 2. Assume that the underlying measure μ is σ-finite and non-atomic. This is the case of the Lebesgue measure. Let us consider the functionwhich clearly is convex, and satisfies the limits and . Given any measurable function , we will find a measurable function with , for which expression (2) is not satisfied. For each , we definewhere . Because , we can find a sub-sequence such that According to (Lemma 8.3 in [29]) , there exists a sub-sequence and pairwise disjoint sets for which Let us define , where and is any measurable function such that for and . Observing thatwe get On the other hand,which shows that (2) is not satisfied. 3. Generalization of Rényi Divergence
In this section, we provide a generalization of Rényi divergence, which is given in terms of a
φ-function. This generalization also depends on a parameter
; for
, it is defined as a limit. Supposing that the underlying
φ-function is continuously differentiable, we show that this limit exists and results in the
φ-divergence [
12]. In what follows, all probability distributions are assumed to have positive density. In other words, they belong to the collection
where
is the space of all real-valued, measurable functions on
T, with equality
μ-a.e. (
μ-almost everywhere).
The Rényi divergence of order
between two probability distributions
p and
q in
is defined as
For
, the Rényi divergence is defined by taking a limit:
Under some conditions, the limits in (
5) and (6) are finite-valued, and converge to the Kullback–Leibler divergence. In other words,
where
denotes the Kullback–Leibler divergence between
p and
q, which is given by
These conditions are stated in Proposition 1, given in the end of this section, for the case involving the generalized Rényi divergence.
The Rényi divergence in its standard form is given by
Expression (
4) is related to this form by
Beyond the change of variables, which results in
α ranging in
, expressions (
4) and (
7) differ by the factor
. We opted to insert the term
so that some kind of symmetry could be maintained when the limits
and
are considered. In addition, the geometry induced by the version (
4) conforms with Amari’s notation [
5].
The Rényi divergence
can be defined for every
. However, for
, the expression (
4) may not be finite-valued for every
p and
q in
. To avoid some technicalities, we just consider
.
Given
p and
q in
, let us define
which can be used to express the Rényi divergence as
The function
, which depends on
p and
q, can be defined as the unique non-negative real number for which
The function
makes the role of a normalizing term. The generalization of Rényi divergence, which we propose, is based on the interpretation of
given in (
8). Instead of the exponential function, we consider a
φ-function in (
8).
Fix any
φ-function
. Given any
p and
q in
, we take
so that
or, in other words, the term inside the integral is a probability distribution in
. The existence and uniqueness of
as defined in (
9) is guaranteed by condition (a3’).
We define a generalization of the Rényi divergence of order
as
For
, this generalization is defined as a limit:
The cases
are related to a generalization of the Kullback–Leibler divergence, the so-called
φ-divergence, which was introduced by the authors in [
12]. The
φ-divergence is given by (It was pointed out to us by an anonymous referee that this form of divergence is a special case of the
-divergence for
and
(see Section 3.5 in [
19]) apart from a conformal factor, which is the denominator of (
13)):
Under some conditions, the limit in (
11) or (12) is finite-valued and converges to the
φ-divergence:
To show (
14), we make use of the following result.
Lemma 2. Assume that is continuously differentiable. If for , the expressionis satisfied for all , then the derivative of exists at any , and is given bywhere . Proof. For
and
, define
The function
is defined implicitly by
. If we show that
- (i)
the function is continuous in a neighborhood of ,
- (ii)
the partial derivatives and exist and are continuous at ,
- (iii)
and ,
then by the Implicit Function Theorem
is differentiable at
, and
We begin by verifying that
is continuous. For fixed
and
, set
. Denoting
, we can write
for every
and
. Because the function on the right-hand side of (
18) is integrable, we can apply the Dominated Convergence Theorem to conclude that
Now, we will show that the derivative of
with respect to
α exists and is continuous. Consider the difference
where
. Represent by
the function inside the integral sign in (
19). For fixed
and
, denote
,
, and
. Because
is convex and increasing, it follows that
where
. Observing that
f is integrable, we can use the Dominated Convergence Theorem to get
and then
For
and
, the function inside the integral sign in (
20) is dominated by
f. As a result, a second use of the Dominated Convergence Theorem shows that
is continuous at
:
Using similar arguments, one can show that
exists and is continuous at any
and
, and is given by
Clearly, expression (
21) implies that
for all
and
.
We proved that items (i)–(iii) are satisfied. As consequence, the derivative of
exists at any
. Expression (
16) for the derivative of
follows from (
17), (
20) and (
21). ☐
As an immediate consequence of Lemma 2, we get the proposition below.
Proposition 1. Assume that is continuously differentiable.- (a)
If, for some , expression (15) is satisfied for all , then - (b)
If, for some , expression (15) is satisfied for all , then
4. Generalized Statistical Manifolds
Statistical manifolds consist of a collection of probability distributions endowed with a metric and
α-connections, which are defined in terms of the derivative of
. In a generalized statistical manifold, the metric and connection are defined in terms of
. Instead of the logarithm, we consider the inverse
of a
φ-function. Generalized statistical manifolds were introduced by the authors in [
17,
18]. Among examples of the generalized statistical manifold, (parametric)
φ-families of probability distributions are of greatest importance. The non-parametric counterpart was investigated in [
11,
12]. The metric in
φ-families can be defined as the Hessian of a function; i.e.,
φ-families are Hessian manifolds [
30]. In [
17,
18], the
φ-divergence gives rise to a pair of dual connections
and
; and then for
the
α-connection
is defined as the convex combination
. In the present paper, we show that the connection induced by
, the generalization of Rényi divergence, corresponds to
.
4.1. Definitions
Let
be a
φ-function. A
generalized statistical manifold is a collection of probability distributions
, indexed by parameters
in a one-to-one relation, such that
- (m1)
Θ is a domain (open and connected set) in ;
- (m2)
is differentiable with respect to θ;
- (m3)
the matrix
defined by
is positive definite at each
, where
- (m4)
the operations of integration with respect to μ and differentiation with respect to commute in all calculations found below, which are related to the metric and connections.
The matrix equips with a metric. By the chain rule, the tensor related to is invariant under change of coordinates. The (classical) statistical manifold is a particular case in which and .
We introduce a notation similar to Equation (
23) that involves higher order derivatives of
. For each
, we define
We also use
,
and
to denote
for
, respectively. The notation (
24) appears in expressions related to the metric and connections.
Using property (m4), we can find an alternate expression for
as well as an identification involving tangent spaces. The matrix
can be equivalently defined by
As a consequence of this equivalence, the tangent space
can be identified with
, the vector space spanned by
, and endowed with the inner product
. The mapping
defines an isometry between
and
.
To verify (
25), we differentiate
, with respect to
, to get
Now, differentiating with respect to
, we obtain
and then (
25) follows. In view of (
26), we notice that every vector
belonging to
satisfies
.
The metric
gives rise to a Levi–Civita connection ∇ (i.e., a torsion-free, metric connection), whose corresponding Christoffel symbols
are given by
Using expression (
25) to calculate the derivatives in (
27), we can express
As we will show later, the Levi–Civita connection ∇ corresponds to the connection derived from the divergence with .
4.2. φ-Families
Let
be a measurable function for which
is a probability density in
. Fix measurable functions
. A
(parametric) φ-family , centered at
, is a set of probability distributions in
, whose members can be written in the form
where
is a
normalizing function, which is introduced so that expression (
28) defines a probability distribution belonging to
.
The functions
are not arbitrary. They are chosen to satisfy the following assumptions:
- (i)
are linearly independent,
- (ii)
, and
- (iii)
there exists such that , for all .
Moreover, the domain
is defined as the set of all vectors
for which
Condition (i) implies that the mapping defined by (
28) is one-to-one. Assumption (ii) makes of
ψ a non-negative function. Indeed, by the convexity of
, along with (ii), we can write
which implies
. By condition (iii), the domain Θ is an open neighborhood of the origin. If the set
T is finite, condition (iii) is always satisfied. One can show that the domain Θ is open and convex. Moreover, the normalizing function
ψ is also convex (or strictly convex if
is strictly convex). Conditions (ii) and (iii) also appears in the definition of non-parametric
φ-families. For further details, we refer to [
11,
12].
In a
φ-family
, the matrix
given by (
22) or (
25) can be expressed as the Hessian of
ψ. If
is strictly convex, then
is positive definite. From
it follows that
.
The next two results show how the generalization of Rényi divergence and the φ-divergence are related to the normalizing function in φ-families.
Proposition 2. In a φ-family , the generalization of Rényi divergence for can be expressed in terms of the normalizing function ψ as follows:for all . Proof. Recall the definition of
as the real number for which
Using expression (
28) for probability distributions in
, we can write
The last equality is a consequence of the domain Θ being convex. Thus, it follows that
By the definition of
, we get (
29). ☐
Proposition 3. In a φ-family , the φ-divergence is related to the normalizing function ψ by the equalityfor all . Proof. To show (
30), we use
which is a consequence of (Lemma 10 in [
12]). In view of
, expression (
13) with
and
results in
Inserting into (
31) the difference
we get expression (
30). ☐
In Proposition 2, the expression on the right-hand side of Equation (
29) defines a divergence on its own, which was investigated by Jun Zhang in [
19]. Proposition 3 asserts that the
φ-divergence
coincides with the Bregman divergence [
31,
32] associated with the normalizing function
ψ for points
ϑ and
θ in Θ. Because
ψ is convex and attains a minimum at
, it follows that
at
. As a result, equality (
30) reduces to
.
4.3. Geometry Induced by
In this section, we assume that
is continuously differentiable and strictly convex. The latter assumption guarantees that
The generalized Rényi divergence induces a metric
in generalized statistical manifolds
. This metric is given by
To show that this expression defines a metric, we have to verify that is invariant under change of coordinates, and is positive definite. The first claim follows from the chain rule. The positive definiteness of is a consequence of Proposition 4, which is given below.
Proposition 4. The metric induced by coincides with the metric given by (22) or (25). Proof. Fix any
. Applying the operator
to
where
, we obtain
which results in
By the standard differentiation rules, we can write
Noticing that
for
, the second term on the right-hand side of Equation (
34) vanishes, and then
If we use the notation introduced in (
24), we can write
It remains to show the case
. Comparing (
13) and (
23), we can write
We use the equivalent expressions
which follows from condition (
32), to infer that
Because
, we conclude that the metric defined by (
22) coincides with the metric induced by
and
. ☐
In generalized statistical manifolds, the generalized Rényi divergence
induces a connection
, whose Christoffel symbols
are given by
Because
, it follows that
and
are mutually dual for any
. In other words,
and
satisfy the relation
. A development involving expression (
35) results in
and
For , the Christoffel symbols can be written as a convex combination of and , as asserted in the next result.
Proposition 5. The Christoffel symbols induced by the divergence satisfy the relation Proof. For
, equality (
39) follows trivially. Thus, we assume
. By (
34), we can write
Applying
to the first term on the right-hand side of (
40), and then equating
, we obtain
Similarly, if we apply
to the second term on the right-hand side of (
40), and make
, we get
Collecting (
41) and (
42), we can write
where we used
Expression (
39) follows from (
37), (
38) and (
43). ☐
5. Conclusions
In [
17,
18], the authors introduced a pair of dual connections
and
induced by
φ-divergence. The main motivation of the present work was to find a (non-trivial) family of
α-divergences, whose induced
α-connections are convex combinations of
and
. As a result of our efforts, we proposed a generalization of Rényi divergence. The connection
induced by the generalization of Rényi divergence satisfies the relation
. To generalize Rényi divergence, we made use of properties of
φ-functions. This makes evident the importance of
φ-functions in the geometry of non-standard models. In standard statistical manifolds, even though Amari’s
α-divergence and Rényi divergence (with
) do not coincide, they induce the same family of
α-connections. This striking result requires further investigation. Future work should focus on how the generalization of Rényi divergence is related to Zhang’s
-divergence, and also how the present proposal is related to the model presented in [
33].