Next Article in Journal
Core-Based Dynamic Community Detection in Mobile Social Networks
Previous Article in Journal
Ribozyme Activity of RNA Nonenzymatically Polymerized from 3′,5′-Cyclic GMP
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Nonparametric Information Geometry: From Divergence Function to Referential-Representational Biduality on Statistical Manifolds

Department of Psychology and Department of Mathematics, University of Michigan, 530 ChurchStreet, Ann Arbor, MI 48109, USA
Entropy 2013, 15(12), 5384-5418; https://doi.org/10.3390/e15125384
Submission received: 3 July 2013 / Revised: 11 October 2013 / Accepted: 22 October 2013 / Published: 4 December 2013

Abstract

:
Divergence functions are the non-symmetric “distance” on the manifold, M θ , of parametric probability density functions over a measure space, ( X , μ ) . Classical information geometry prescribes, on M θ : (i) a Riemannian metric given by the Fisher information; (ii) a pair of dual connections (giving rise to the family of α-connections) that preserve the metric under parallel transport by their joint actions; and (iii) a family of divergence functions (α-divergence) defined on M θ × M θ , which induce the metric and the dual connections. Here, we construct an extension of this differential geometric structure from M θ (that of parametric probability density functions) to the manifold, M , of non-parametric functions on X, removing the positivity and normalization constraints. The generalized Fisher information and α-connections on M are induced by an α-parameterized family of divergence functions, reflecting the fundamental convex inequality associated with any smooth and strictly convex function. The infinite-dimensional manifold, M , has zero curvature for all these α-connections; hence, the generally non-zero curvature of M θ can be interpreted as arising from an embedding of M θ into M . Furthermore, when a parametric model (after a monotonic scaling) forms an affine submanifold, its natural and expectation parameters form biorthogonal coordinates, and such a submanifold is dually flat for α = ± 1 , generalizing the results of Amari’s α-embedding. The present analysis illuminates two different types of duality in information geometry, one concerning the referential status of a point (measurable function) expressed in the divergence function (“referential duality”) and the other concerning its representation under an arbitrary monotone scaling (“representational duality”).

1. Introduction

Information geometry is a differential geometric study of the manifold of probability measures or probability density functions [1]. Its role in understanding asymptotic inference was summarized in [2,3,4,5,6,7]. Information geometric methods have been applied to many areas of interest to statisticians, such as the study of estimating functions (e.g., [8,9]) and invariant priors for Bayesian inference (e.g., [10,11]), and to the machine learning community, such as the natural gradient descent method [12,13], support vector machine [14], boosting [15], turbo decoding [16], etc.
The differential geometric structure of statistical models with finite parameters is now well understood. Consider a family of probability functions (i.e., probability measures on discrete support or probability density functions on continuous support) as parameterized by θ = [ θ 1 , , θ n ] . The collection of such probability functions, where each function is indexed by a point, θ R n , forms a manifold, M θ , under suitable conditions. Rao [17] identified Fisher information to be the Riemannian metric for M θ . Efron [18], through investigating a one-parameter family of statistical models, elucidated the meaning of curvature for asymptotic statistical inference and pointed out its flatness for the exponential model. In his reaction to Efron’s work, Dawid [19] invoked the differential geometric notion of linear connections on a manifold as preserving parallelism during vector transportation and pointed out other possible constructions of linear connections on M θ , in addition to the non-flat Levi-Civita connection associated with the Fisher metric. Amari [2,20], in his path-breaking work, systematically advanced the theory of information geometry by constructing a parametric family of α-connections, Γ ( α ) , α R , along with a dualistic interpretation of α α as conjugate connections on the manifold, M θ . The e-connection ( α = 1 ) vanishes (i.e., becomes identically zero) on the manifold of the exponential family of probability functions under natural parameters, whereas the m-connection ( α = 1 ) vanishes on the manifold of the mixture family of probability functions under mixture parameters. Therefore, not only have Γ ( ± 1 ) zero curvatures for both exponential and mixture families, but affine coordinates were found to yield Γ ( 1 ) and Γ ( 1 ) , themselves zero for the exponential and mixture families, respectively.
This classic information geometry dealing with parametric statistical models has been investigated in the non-parametric setting using the tools of infinite-dimensional analysis [21,22,23], with non-parametric Fisher information given by [23]. This is made possible, because topological issues were resolved by the pioneering work of [24] using the theory of Orlicz space for charting the exponential statistical manifold. Zhang and Hasto [25] characterized the probability manifold modeled on an ambient affine space via functional equations and generalized exponential charts. The goal of the present paper is to extend these non-parametric results by showing links among three inter-connected mathematical topics that underlie information geometry, namely: (i) divergence functions measuring the non-symmetric distance of any two points (density or measurable functions) on the manifold (the referential duality); (ii) convex analysis and the associated Legendre–Fenchel transformation linking the natural and expectation parameters of parametric models (the representational duality); and (iii) the resulting dual Riemannian structure involving the Fisher metric and the family of α-connections. Results in the parametric setting were summarized in [26].
The Riemannian manifold of parametric statistical models is a special kind, one that involves dual (as known as conjugate) connections; historically, such a mathematical theory was independently developed to investigate hypersurface immersion (see [27,28]). Lauritzen [29] characterized the general differential geometric context under which a one-parameter family of α-connections arise, as well as the meaning of conjugacy for a pair of connections on statistical manifolds [30]. Kurose [31,32] and then Matsuzoe [33,34] elucidated information geometry from an affine differential geometric perspective. See, also [35] for a generalized notion of conjugate connections. It was Eguchi [36,37,38] who provided a generic way for inducing a metric and a pair of conjugate connections from an arbitrary divergence (what he called “contrast”) function. The current exposition will build on this “Eguchi relation” between the metric and conjugate connections of the Riemannian manifold, M θ , and the divergence function defined on M θ × M θ .
The main results of this paper include the introduction of an α-parametric family of divergence functionals on measurable functions (including probability functions) using any smooth and strictly convex function and the induction by such divergence a metric and a family of conjugate connections that resemble, but generalize, the Fisher information proper and α-connections proper. In particular, we derive explicit expressions of the metric and conjugate connections on the infinite-dimensional manifold of all functions defined on the same support of the sample space. When finite-dimensional affine embedding is allowed, our formulae reduce to the familiar ones associated with the exponential family established in [2]. We carefully delineate two senses of duality associated with such manifolds, one related to the reference/comparison status of any pair of points (functions) and the other related to properly scaled representations of them.

1.1. Parametric Information Geometry Revisited

Here, we briefly summarize the well-known results of parametric information geometry in the classical (as opposed to quantum) sense. The motivation is two-fold. First, by reviewing the basic parametric results, we want to make sure that any generalization of the framework of information geometry will reduce to those formulae under appropriate conditions. Secondly, understanding how a divergence function is related to the dual Riemannian structure will enable us to approach the infinite-dimensional case by analogy, that is, through constructing more general classes of divergence functionals defined on function spaces.

1.1.1. Riemannian Manifold, Fisher Metric and α-Connections

Let ( X , μ ) be a measure space with σ-algebra built upon the atoms, d ζ , of X . Let M μ denote the space of probability density functions, p : X R + ( R + { 0 } ) , defined on the sample space, X , with background measure d μ = μ ( d ζ ) :
M μ = p ( ζ ) : E μ { p ( ζ ) } = 1 ; p ( ζ ) > 0 , ζ X
Here, and throughout this paper, E μ { · } = X { · } d μ denotes the expectation of a measurable function (in curly brackets) with respect to the background measure, μ. We also denote E p { · } = X { · } p d μ .
A parametric family of density functions, p ( · | θ ) , called a parametric statistical model, is the association of a density function, θ p ( · | θ ) , for each n-dimensional vector θ = [ θ 1 , , θ n ] . The space of parametric statistical models forms a Riemannian manifold (where θ is treated as the local chart):
M θ = { p ( ζ | θ ) M μ : θ Θ R n } M μ
with the so-called Fisher metric [17]:
g i j ( θ ) = E μ p ( ζ | θ ) log p ( ζ | θ ) θ i log p ( ζ | θ ) θ j
and α-connections [20,39]:
Γ i j , k ( α ) ( θ ) = E μ p ( ζ | θ ) 1 α 2 log p ( ζ | θ ) θ i log p ( ζ | θ ) θ j + 2 log p ( ζ | θ ) θ i θ j log p ( ζ | θ ) θ k
with the α-connections satisfying the dualistic relation:
Γ i j , k * ( α ) ( θ ) = Γ i j , k ( α ) ( θ )
Here, *, denotes conjugate (dual) connection. Recall that, in general, a metric is a bilinear map on the tangent space, and an affine connection is used to define parallel transport of vectors. The conjugacy in a pair of connections, Γ Γ * , is defined by their jointly preserving the metric when each acts on one of the two tangent vectors; that is, when the tangent vectors undergo parallel transport according to Γ or Γ * respectively. Equivalently, and perhaps more fundamentally, the pair of conjugate connections preserve the dual pairing of vectors in the tangent space with co-vectors in the cotangent space [30]. Any Riemannian manifold with its metric, g, and conjugate connections, Γ , Γ * , given in the form of Equations (3)–(5) is called a statistical manifold (in the narrower sense) and is denoted as { M θ , g , Γ ( ± α ) } . In the broader sense, a statistical manifold { M , g , Γ , Γ * } is a differentiable manifold equipped with a Riemannian metric g and a pair of torsion-free conjugate connections Γ Γ ( 1 ) , Γ * Γ ( 1 ) , without necessarily requiring g and Γ , Γ * to take the forms of Equations (3)–(5).

1.1.2. Exponential Family, Mixture Family and Their Generalization

An exponential family of probability density functions is defined as:
p ( e ) ( ζ | θ ) = exp F 0 ( ζ ) + i θ i F i ( ζ ) Φ ( θ )
where θ is its natural parameter and F i ( ζ ) ( i = 1 , , n ) is a set of linearly independent functions with the same support in X , and the cumulant generating function (“potential function”) Φ ( θ ) is:
Φ ( θ ) = log E μ exp F 0 ( ζ ) + i θ i F i ( ζ )
Substituting Equation (6) into Equations (3) and (4), the Fisher metric and the α-connections are simply:
g i j ( θ ) = 2 Φ ( θ ) θ i θ j
and:
Γ i j , k ( α ) ( θ ) = 1 α 2 3 Φ ( θ ) θ i θ j θ k
whereas the Riemannian curvature tensor (of an α-connection) is given by [2], p.106:
R i j μ ν ( α ) ( θ ) = 1 α 2 4 l , k ( Φ i l ν Φ j k μ Φ i l μ Φ j k ν ) Φ l k
where Φ i j = g i j is the matrix inverse of g i j and subscripts of Φ indicate partial derivatives. Therefore, the α-connection for the exponential family is dually flat when α = ± 1 . In particular, all components of Γ i j , k ( 1 ) vanish, due to Equation (9), on the manifold formed by p ( e ) ( · | θ ) in which the natural parameter, θ, serves as the local coordinates.
On the other hand, the mixture family:
p ( m ) ( ζ | θ ) = i θ i F i ( ζ )
when viewed as a manifold charted by its mixture parameter, θ, with the constraints, i θ i = 1 and X F i ( ζ ) d μ = 1 , turns out to have identically zero Γ i j , k ( 1 ) . The connections, Γ ( 1 ) and Γ ( 1 ) , are also called the exponential and mixture connections, or e- and m-connection, respectively. The exponential family and the mixture family are special cases of the α-family [1,2] of density functions, p ( ζ | θ ) , whose denormalization satisfies (with constant κ):
l ( α ) ( κ p ) = F 0 ( ζ ) + i θ i F i ( ζ )
under the α-embedding function, l ( α ) : R + R , defined as:
l ( α ) ( t ) = log t α = 1 2 1 α t ( 1 α ) / 2 α 1
The α-embedding of a probability density function plays an important role in Tsallis statistics; see, e.g., [40]. Under α-embedding, the denormalized density functions form the so-called α-affine manifold [1], p.46. The Fisher metric and α-connections, under such α-representation, have the following expressions:
g i j ( θ ) = E μ l ( α ) ( p ( · | θ ) ) θ i l ( α ) ( p ( · | θ ) ) θ j
Γ i j , k ( α ) ( θ ) = E μ 2 l ( α ) ( p ( · | θ ) ) θ i θ j l ( α ) ( p ( · | θ ) ) θ k
Clearly, on an α-affine manifold with any given α value, components of Γ ( α ) are all identically zero by virtue of the definition of α-family Equation (12), and hence, ± α -connections are dually flat.

1.2. Divergence Function and Induced Statistical Manifold

It is well known that the statistical manifold, { M θ , g , Γ ( ± α ) } , with Fisher information as the metric, g, and the ( ± α ) -connections, Γ ( ± α ) , as conjugate connections, can be induced from a parametric family of divergence functions called “α-divergence”. Here, we briefly review the link of divergence functions to the dual Riemannian geometry of statistical manifolds.

1.2.1. Kullback-Leibler Divergence, Bregman Divergence and α-Divergence

Divergence functions are distance-like quantities; they measure the directed (non-symmetric) difference of two probability density functions in the infinite-dimensional function space or two points in a finite-dimensional vector space of the parameters of a statistical model. An example is the Kullback-Leibler divergence (also known as, KL cross-entropy) between two probability densities, p , q M μ , here expressed in its extended form (i.e., without requiring p and q to be normalized):
K ( p , q ) = q p p log q p d μ = K * ( q , p )
with a unique, global minimum of zero when p = q . For the exponential family Equation (6), the expression (16) takes the form of the so-called Bregman divergence [41] defined on Θ × Θ R n × R n :
B Φ ( θ p , θ q ) = Φ ( θ p ) Φ ( θ q ) θ p θ q , Φ ( θ q )
where Φ is the potential function (7), is the gradient operator and · , · denotes the standard bilinear form (pairing) of a vector with a co-vector. The Bregman divergence (17) expresses the directed-distance of two members, p and q, of the exponential family as indexed, respectively, by the two parameters, θ p and θ q .
A generalization of the Kullback-Leibler divergence is the α-divergence, defined as:
A ( α ) ( p , q ) = 4 1 α 2 E μ 1 α 2 p + 1 + α 2 q p 1 α 2 q 1 + α 2
measuring the directed distance between any two density functions, p and q. It is easily seen that:
lim α 1 A ( α ) ( p , q ) = K ( p , q ) = K * ( q , p )
lim α 1 A ( α ) ( p , q ) = K * ( p , q ) = K ( q , p )
Note that traditionally (see [2,20]), the term 1 α 2 p + 1 + α 2 q is replaced by 1 in the integrand of Equation (18), and the term q p is absent in the integrand of Equation (16); this is trivially true when p , q are probability densities with a normalization of one. Zhu and Rohwer [42,43], in what they called the δ-divergence, δ = 1 α 2 , supplied these extra terms as the “extended” forms of α-divergence and of Kullback-Leibler divergence). The importance of these terms will be seen later (Section 2.2).
Note that, strictly speaking, when the underlying space is a finite-dimensional vector space, that is, the space, R n , for the parameters, θ, of a statistical model, p ( · | θ ) , then the term “divergence function” is appropriate. However, if the underlying sample space is infinite-dimensional that may be uncountable, that is, the manifold, M μ , of non-parametric probability densities, p and q, then the term “divergence functional” seems more appropriate. The latter implicitly defines a divergence function (through pullback) if the probability densities are embedded into a finite-dimensional submanifold, M θ , in the case of a parametric statistical model, p ( · | θ ) . As an example, for the exponential family Equation (6), the Kullback-Leibler divergence Equation (16) in terms of p and q, implicitly defines a divergence in terms of θ p , θ q , i.e., the Bregman divergence Equation (17). In the following, we use the term divergence function when we intend to blur the distinction between whether it is defined on the finite-dimensional vector space or on the infinite-dimensional function space and, in the latter case, whether it is pulled back into the finite dimensional submanifold. We will, however, use the term divergence functional when we emphasize the infinite-dimensional setting sans parametric embedding.
In general, a divergence function (also called “contrast function”) is non-negative for all p , q and vanishes only when p = q ; it is assumed to be sufficiently smooth. A divergence function will induce a Riemannian metric, g, in the form of Equation (3) by its second order properties and a pair of conjugate connections, Γ , Γ * , in the forms of Equations (4) and (5) by its third order properties (relations were first formulated by Eguchi [36,37], which we are going to review next).

1.2.2. Induced Dual Riemannian Geometry

Let M be a Riemannian manifold endowed with a metric tensor field, g, whose restriction to any point, p, is a symmetric, positive bilinear form, , , on T p ( M ) × T p ( M ) . Here, T p ( M ) denotes the space of all tangent vectors at the point, p M , and Σ ( M ) denotes the collection of all vector fields on M . Then:
g ( u , v ) = u , v
with u , v Σ ( M ) . Let w Σ ( M ) be another vector field, and d w denotes the directional derivative (of a function, vector field, etc.) along the direction corresponding to w (taken at any given point, p, if explicitly written out). An affine connection, ∇, is a map, Σ ( M ) × Σ ( M ) Σ ( M ) , ( w , u ) w u , that is linear in u , w , while F -linear in w, but not in u. A pair of connections, ∇, * , are said to be conjugate to each other if:
d w g ( u , v ) = w u , v + u , w * v
or in component form, denoted by Γ , Γ * :
k g i j = Γ k i , j + Γ k j , i *
The “contravariant” form, Γ i j l , of the affine connection defined by:
i j = l Γ i j l l
is related to the “covariant” form, Γ i j , k through:
l g l k Γ i j l = Γ i j , k
The Riemannian metric, g, and conjugate connections, , * , on a statistical manifold can be induced by a divergence function, D : M × M R + , which, by definition, satisfies:
(i)
D ( p , q ) 0 p , q M with equality holding iff p = q ;
(ii)
( d u ) p D ( p , q ) p = q = ( d v ) q D ( p , q ) p = q = 0 ;
(iii)
( d u ) p ( d v ) q D ( p , q ) p = q is positive definite;
where the subscript, p , q , means that the directional derivative is taken with respect to the first and second arguments in D ( p , q ) , respectively, along the direction, u or v. Eguchi [36,37] showed that any such divergence function, D , satisfying (i)–(iii) will induce a Riemannian metric, g, and a pair of connections, , * via:
g ( u , v ) = ( d u ) p ( d v ) q D ( p , q ) p = q
w u , v = ( d w ) p ( d u ) p ( d v ) q D ( p , q ) p = q
u , w * v = ( d w ) q ( d v ) q ( d u ) p D ( p , q ) p = q
In index-laden component forms, they are:
g i j = ( i ) p ( j ) q D ( p , q ) p = q
Γ i j , k i j , k = ( i ) p ( j ) p ( k ) q D ( p , q ) p = q
Γ i j , k * k , i * j = ( i ) q ( j ) q ( k ) p D ( p , q ) p = q
Equations (26)–(28) in coordinate-free form, or Equations (29)–(31) in index-laden form, link a divergence function, D , to the Riemannian metric, g, and conjugate connections, , * ; henceforth, they will be called the Eguchi relation. It is easily verifiable that they satisfy Equation (22) or Equation (23), respectively. These relations are the stepping stones going from a divergence function defining (generally) non-symmetric distances between a pair of points on a manifold at large to the dual Riemannian geometric structure on the same manifold in the small. To apply to the infinite-dimensional context, we provide a proof (in Section 4) for the coordinate-free version Equations (26)–(28). This will allow us to first construct divergence functional on the infinite-dimensional function space (the Kullback-Leibler divergence being a special example) and then derive explicit expressions for the non-parametric Riemannian metric and conjugate connections by explicating d u , d v , d w .

1.3. Goals and Approach

Our goals in this paper are several-fold. First, we want to provide a unified perspective for the divergence functions encountered in the literature. There are two broad classes, those defined on the infinite-dimensional function space and those defined on the finite-dimensional vector space. The former class include the one-parameter family of α-divergence (equivalently, the δ-divergence in [42,43]), the family of Jensen difference related to the Shannon entropy function [44], both specializing to Kullback-Leibler divergence as a limiting case. The latter class includes the Bregman divergence [41], also called “geometric divergence” [32], which turns out to be identical to the “canonical divergence” [1] on a dually flat manifold expressed in a pair of biorthogonal coordinates; those coordinates are induced by a pair of conjugate convex functions via the Legendre–Fenchel transform [2,20]. [15] recently investigated an infinite-dimensional version of the Bregman divergence, called the U-divergence. It will be shown in this paper that all of the above-mentioned divergence functions can be understood as convex inequalities associated with some real-valued, strictly convex function defined on R (for the infinite-dimensional case) or R n (for the finite-dimensional case), with the convex mixture parameter assuming the role of α in the induced α-connection. Note that α α in such divergence functions corresponds to an exchange of the two points the divergence functions measure (generally in a non-symmetric fashion), while α α in the induced connections corresponds to the conjugacy operation for the pairing of two metric-compatible connections. Hence, our approach to divergence functions from convex analysis will address both of these aspects coherently, and an intimate relation between these two senses of duality is expected to emerge from our formulation (see below).
The second goal of our paper is to provide a more general form for the Fisher metric Equation (3) and the α-connections Equation (4) (or equivalently, Equations (14) and (15) under α-embedding), while still staying within the framework of [29] in characterizing statistical manifolds. One specific aim is to derive explicit expressions for the Fisher metric and α-connections for the infinite-dimensional case. In the past, infinite-dimensional expression for the α-connection ( α ) , as a mixture of ( 1 ) and ( 1 ) , has emerged, but was given only implicitly with their interpretations debated [22,23]. Our approach exploits the coordinate-free version of the Eguchi relations Equations (26)–(28) directly, and derives Fisher metric and α-connections from the general form of divergence functions mentioned in the last paragraph. The affine connection, ( α ) , is formulated as the covariant derivative, which is characterized by a bilinear form. Since our divergence functional will be defined on the infinite-dimensional manifold, M , without restricting the underlying functions (individual points on M ) to be normalized and positively-valued, the affine connections we derive are expected to have zero Riemann curvature as those in the ambient space. From this perspective, statistical curvature (the curvature of a statistical manifold) can be viewed as an embedding curvature, that is, curvature arising out of restricting to the submanifold, M μ , of normalized and positive-valued functions (i.e., non-parametric statistical manifold), and further to the finite-dimensional submanifold M θ (i.e., parametric statistical models).
Our third goal here is to clarify some fundamental issues in information geometry, including the meaning of duality and its relation to submanifold embedding. In its original development starting from [19], the flatness of the e-connection (or m-connection) is with respect to a particular family of density functions, namely, the exponential family (or mixture family). Later, Amari [2,20] generalized this observation to any α-family (i.e., a density function that is, after denormalization, affine under α-embedding): the α-connection is flat (indeed, Γ i j , k ( α ) vanishes) for the α-affine manifold (which is reduced to the exponential model for α = 1 and the mixture model for α = 1 ). One may be led to infer that the α parameter in the α-connection and the α parameter in α-embedding are one and the same and, thereby, conclude that ( 1 ) -flatness (or ( 1 ) -flatness) is exclusively associated with the exponential family expressed in its natural parameter (or the mixture family expressed in its mixture parameter). Here, we point out that these conclusions are unwarranted: the flatness of an α-connection and the embedding of a probability function into an affine submanifold under α-representation are two related, but separate, issues. We will show that the α-connections for the infinite-dimensional ambient manifold, M , which contains the manifold of probability density functions, M μ , as a submanifold, has zero (ambient) curvature for all α values. For finite-dimensional parametric statistical models, it is known that the α-connection will not in general have zero curvature even when α = ± 1 . Here, we will give precise conditions under which ( ± 1 ) will be dually flat—i.e., when the denormalized statistical model can be affine embedded under any ρ-representation, where a strictly increasing function ρ : R R generalizes the α-embedding function (13). In such cases, there exists a strictly convex potential function, akin to Equation (7), for the exponential statistical model, that will reduce the Fisher metric and α-connections to the forms of Equations (8) and (9). One may define the natural parameter and expectation parameter that are dual to each other and that form biorthogonal coordinates for the underlying manifold, just as for the exponential family.
Our analysis will clarify two different kinds of duality in information geometry, one related to the different status of a reference probability function and a comparison probability function (referential duality), the other related to the representation of each probability function via a pair of conjugate scaling (representational duality). Roughly speaking, the ( ± 1 ) -duality reflects the former, whereas the e / m -duality reflects the latter. Previously, they were non-distinguished; in our analysis, we are able to disambiguate these two senses of duality. For instance, we are able to devise a two-parameter family of divergence functions, where the two parameters play distinct roles in the induced geometry, one capturing referential duality and the other capturing representational duality. Interestingly, this two-parameter family of connections still takes the same form of the α-connection proper (with a single parameter), indicating that this extension is still within [29]’s conceptualization of dual connections in information geometry.
The technical challenge that we have to overcome in our derivations is doing calculus in the infinite-dimensional setting. Consider the set of measurable functions from X to R , which, in the presence of charts modeled on (open) subsets, { E i } i I , of a Banach space, form a manifold, M , of infinite dimension. Each point on M is a function, p : X R , over the sample space X ; and each chart, U M , is afforded with a bijective map to the Banach space with a suitable norm (e.g., Orlicz space, as adopted by [21,22,23,24,45]). For non-parametric statistical models, [24] provided exponential charts modeled on Orlicz spaces, which was followed by the rest of the above-referenced works. We do not restrict ourselves to probability density functions and work, in general, with measurable functions (without positivity and normalization requirements); we treat probability functions as forming a submanifold in M defined by the positivity and normalization conditions. This approach gives us certain advantages in deriving, from divergence functions directly, the Riemannian geometry on M , whereby M serves as an ambient space to embed a statistical manifold, M μ , as a submanifold in a standard way (by restricting the tangent vector field of M ). The usual interpretation of the affine connection on M μ as the projection of a natural connection on M is then “borrowed” over from the finite-dimensional setting to this infinite-dimensional setting. Our approach followed that of [46], who treats the infinite dimensional manifold as a generic C -Banach manifold and used the theory of canonical spray (and the Morse-Palais Lemma) to construct Riemannian metric and affine connections on such manifolds. However, we fell short of providing a topology on M as induced from the divergence functions and compare it with the one endowed by [24]. In particular, the conditions under which M μ forms a proper submanifold of M remain to be identified. Neither have we addressed topological issues for the well-definedness of conjugate connections on such infinite-dimensional manifolds. We refer the readers to [23], who investigated whether the entire family of α-connections is well-defined for M endowed with the same topology.
The structure of the rest of the paper is as follows. Section 2 will deal with information geometry under the infinite-dimensional setting and Section 3 under the finite-dimensional setting. For ease of presentation, results will be provided in the main text, while their proofs will be deferred to Section 4. Section 5 closes with a discussion of the implications of the current framework. A preliminary report of this work was presented to IGAIA2 (Tokyo) and appeared in [47].

2. Information Geometry on Infinite-Dimensional Function Space

In this section, we first review the basic apparatus of the differentiable manifold with particular emphasis paid to the infinite-dimensional (non-parametric) setting (Section 2.1). We then define a family of divergence functionals based on convex analysis (Section 2.2) and use them to induce the dual Riemannian geometry on the infinite-dimensional manifold (Section 2.3). The section is concluded with an investigation of a special case of homogeneous divergence, called ( α , β ) -divergence, in which the two parameters play distinct, but interrelated, roles for referential duality and representational duality, thereby generalizing the familiar α-divergence in a sensible way (Section 2.4).

2.1. Differentiable Manifold in the Infinite-Dimensional Setting

Let U be an open set on the base manifold, M , containing a representative point, x 0 , and F : U R , a smooth function defined on this local patch, U M . The set of smooth functions on M is denoted F ( M ) . A curve, t x ( t ) , on the manifold is a collection of points, { x ( t ) U : t [ 0 , 1 ] } , whereas a tangent vector (or simply “vector”), v at x 0 U , represents an equivalent class of curves passing through x 0 = x ( 0 ) , all with the same direction and speed as specified by the vector, v = d x d t t = 0 . We use T x 0 ( M ) to denote the space of all tangent vectors (“tangent space”) at a given x 0 ; it is obviously a vector space. The tangent manifold, TM , is then the collection of tangent spaces for all points on M : TM = { T x ( M ) , x M } . A vector field, v ( x ) , is the association of a vector, v, at each point, x, of the manifold, M ; it is a cross-section of TM . The set of all smooth vector fields on M is denoted Σ ( M ) . The tangent vector, v, acting on a function, F , will yield a scalar, denoted d v F , called the direction derivative of F :
d v F = lim t 0 1 t ( F ( x ( t ) ) F ( x 0 ) )
The tangent vector, v, acting on a vector field, u ( x ) , is defined analogously:
d v u = lim t 0 1 t ( u ( x ( t ) ) u ( x 0 ) )
In our setting, given a measure space, ( X , μ ) , where samples are drawn from the set X and μ is the background measure, we call any function that maps X R a ζ-function. The set of all ζ-functions forms a vector space, where vector addition is point-wise: ( f 1 + f 2 ) ( ζ ) = f 1 ( ζ ) + f 2 ( ζ ) , and scalar multiplication is simple multiplication: ( c f ) ( ζ ) = c f ( ζ ) . We now consider the set of all ζ-functions with common support μ, which is assumed to form a manifold denoted as M μ . A typical point of this manifold denotes a specific ζ-function, p ( ζ ) : ζ p ( ζ ) , defined over X , the sample space, which is infinite dimensional or even uncountable in general. Under suitable topology (e.g., [24]), all points, M μ , form a manifold. On this manifold, any function, F : p F ( p ) , is referred to (in this paper) as a ζ-functional, because it takes in a ζ-function p ( · ) and outputs a scalar. The set of ζ-functionals on M is denoted F ( M ) . (Note that “ζ-function” and “ζ-functional” are both functions (also called “maps” or “mappings”) in the mathematical sense, with pre-specified domains and ranges. We make the distinction that the ζ-function refers to a real-valued function (e.g., density functions, random variables) defined on the sample space, X , and ζ-functional refers to a mapping from one or more ζ-functions to a real number.) A curve on M passing through a typical point, p, is nothing but a one-parameter family of ζ-functions, denoted as p ( ζ | t ) , with p ( ζ | 0 ) = p . Here, · | t is read as “given t”, “indexed by t”, “parameterized” by t—a one-parameter family of ζ-functions, p ( ζ | t ) , is formed as t varies. For each fixed t, p ( ζ | t ) is a function, X × I R . More generally, p ( ζ | θ ) , where θ = [ θ 1 , , θ n ] R n , is a ζ-function indexed by n parameters, θ 1 , , θ n . As θ varies, p ( ζ | θ ) represents a finite dimensional submanifold, M ˜ θ M where:
M ˜ θ = { p ( ζ | θ ) M : θ R n } M
In this paper, they are referred to as parametric models (and parametric statistical model if p ( ζ | θ ) is normalized and positive-valued).
In the infinite-dimensional setting, the following tangent vector, v:
v ( ζ ) = p ( ζ | t ) t t = 0
is also a ζ-function. When the tangent vector, v, operates on the ζ-functional F ( p ) :
d v ( F ( p ) ) = lim t 0 F ( p ( · | t ) ) F ( p ( · | 0 ) ) t
the outcome is another ζ-functional of both p ( ζ ) and v ( ζ ) and linear in the latter. A particular ζ-functional of interest in this paper is of the following form:
F ( p ) = X f ( p ( ζ ) ) d μ = E μ { f ( p ( · ) ) }
where f : R R is a strictly convex function defined on the real line. In this case, p ( ζ | t ) = p ( ζ ) + v ( ζ ) t + o ( t 2 ) , so:
d v ( F ( p ) ) = X f ( p ( ζ ) ) v ( ζ ) d μ
which is linear in v ( · ) .
A vector field, as a cross-section of TM , takes p ( ζ ) and associates a ζ-function. We denote a vector field as u ( ζ | p ) Σ ( M ) , where the variable following the “|” sign indicates that u depends on the point, p ( ζ ) , an element of the base manifold, M (we could also write it as u ( p ( ζ ) ) ( ζ ) or u p ( ζ ) ). Though the vector fields defined above are not necessarily smooth, we will concentrate on smooth ones below. Of particular interest to us is the vector field, ρ ( p ( ζ ) ) , for some strictly increasing function, ρ : R R .
Differentiation of smooth vector fields can be defined analogously. The directional derivative, d v u , of a vector field, u ( ζ | p ) , which is a ζ-function also dependent on p ( ζ ) , in the direction of v = v ( ζ ) , which is another ζ-function, is:
d v u ( ζ | p ) = lim t 0 u ( ζ | p ( ζ | t ) ) u ( ζ | p ( ζ ) ) t
Note that d v u is another ζ-function; that is why we can write d v u ( ζ | p ) also as ( d v u ) ( ζ ) . As an example, the derivative of the vector field, ρ ( p ( ζ ) ) , where ρ : R R , in the direction of v ( ζ ) is:
d v ρ ( p ( ζ | t ) ) = lim t 0 ρ ( p ( ζ | t ) ) ρ ( p ) t = ρ ( p ( ζ ) ) v ( ζ )
With differentiation of vector fields defined, one can define the covariant derivative operation, w . When operating on a ζ-functional, the covariant derivative is simply the directional derivative (along direction w):
w F ( p ) = d w F ( p )
when operating on a vector field, say u ( ζ | p ) , w is defined as (see [46]):
w u = d w u + B ( · | w ( · | p ) , u ( · | p ) )
where B : Σ ( M ) × Σ ( M ) Σ ( M ) is a ζ-function, which is bilinear in the two tangent vectors (ζ-functions), w and u; it is the infinite-dimensional counterpart of the Christoffel symbol, Γ (for finite dimensions). We denote the conjugate covariant derivative, w * (as defined by Equation (22)) in terms of B * (with an asterisk denoting conjugacy):
( w * u ) ( ζ ) = ( d w u ) ( ζ ) + B * ( ζ | w ( ζ | p ( ζ ) ) , u ( ζ | p ( ζ ) ) )
(here, we write out the explicit dependency on ζ).
The Riemann curvature tensor, R, which measures the curvature of a connection, ∇ (as specified by B), is defined by the map, Σ ( M ) × Σ ( M ) × Σ ( M ) Σ ( M ) :
R ( u , v , w ) = R ( u , v ) w = u v w v u w [ u , v ] w
where:
[ u , v ] = d u v d v u
The torsion tensor, T : Σ ( M ) × Σ ( M ) Σ ( M ) , is given by:
T ( u , v ) = u v v u [ u , v ]

2.2. D ( α ) -Divergence, a Family of Generalized Divergence Functionals

Divergence functionals are defined with respect to a pair of ζ-functions in an infinite-dimensional function space. A divergence functional, D : M × M R + , maps two ζ-functions to a non-negative real number. To the extent that ζ-functions can be parameterized by finite-dimensional vectors, θ R n , a divergence functional on M × M will implicitly induce a divergence function on the parameter space, which is a subset of R n × R n . In this section, we will discuss the general form of the divergence functional and the associated infinite-dimensional manifold. Finite-dimensional embedding of ζ-functions (i.e., parametric models) will be discussed in Section 3.

2.2.1. Fundamental Convex Inequality and Divergence

We start our exposition by reviewing the notion of a convex function on the real line, f : R R . We recall the fundamental convex inequality that defines a strictly convex function, f:
f 1 α 2 γ + 1 + α 2 δ 1 α 2 f ( γ ) + 1 + α 2 f ( δ )
for all γ , δ R , with equality holding, if and only if γ = δ , for all α ( 1 , 1 ) . Geometrically, the value of the function, f, at any point, ϵ, in between two end points, γ and δ, lies on or below the chord connecting its values at these two points. This property of a strictly convex function can also be stated in elementary algebra as the Chord Theorem, namely:
f ( ϵ ) f ( γ ) ϵ γ f ( δ ) f ( γ ) δ γ f ( δ ) f ( ϵ ) δ ϵ
where:
ϵ = 1 α 2 γ + 1 + α 2 δ
(here, we assumed γ ϵ δ without loss of generality). In fact, the slope, f ( δ ) f ( γ ) δ γ , is an increasing function in both δ and γ. The slopes for the chords connecting from the midpoint to either end point are, respectively:
L ( α ) ( γ , δ ) = f ( δ ) f ( ϵ ) δ ϵ = 1 δ γ 2 1 α f ( δ ) f 1 α 2 γ + 1 + α 2 δ
L ^ ( α ) ( γ , δ ) = f ( γ ) f ( ϵ ) γ ϵ = 1 δ γ 2 1 + α f 1 α 2 γ + 1 + α 2 δ f ( γ )
with skew symmetry:
L ( α ) ( γ , δ ) = L ^ ( α ) ( δ , γ ) , L ^ ( α ) ( γ , δ ) = L ( α ) ( δ , γ )
As α : 1 1 (i.e., as point ϵ moves from γ to δ, the two fixed ends), both L ( α ) ( γ , δ ) and L ^ ( α ) ( γ , δ ) , are increasing functions of α, but the chord theorem dictates that the latter is always no greater than the former. In fact, their difference has a non-negative value:
0 L ( α ) ( γ , δ ) L ^ ( α ) ( γ , δ ) = L ( α ) ( δ , γ ) L ^ ( α ) ( δ , γ ) = 1 δ γ 4 1 α 2 1 α 2 f ( γ ) + 1 + α 2 f ( δ ) f 1 α 2 γ + 1 + α 2 δ
Though the above is obviously valid for α [ 1 , 1 ] , it can be shown that it is also valid for any α R .
The fundamental convex inequality applies to any two real numbers, γ , δ . We can treat γ , δ as the values of two functions, p , q : X R , evaluated at any particular sample point, ζ, that is, γ = p ( ζ ) , δ = q ( ζ ) . This allows us to define the following family of divergence functionals (see [48]).
Proposition 1 Let f : R R be smooth and strictly convex, and ρ : R R be strictly increasing. For any two ζ-functions, p , q , and any α R :
D f , ρ ( α ) ( p , q ) = 4 1 α 2 E μ 1 α 2 f ( ρ ( p ) ) + 1 + α 2 f ( ρ ( q ) ) f 1 α 2 ρ ( p ) + 1 + α 2 ρ ( q )
is non-negative and equals zero, if and only
p ( ζ ) = q ( ζ ) a l m o s t e v e r y w h e r e
Proof. See Section 4.
Proposition 1 constructed a family (parameterized by α) of divergence functionals, D ( α ) , for two ζ-functions, in which representational duality is embodied as:
D f , ρ ( α ) ( p , q ) = D f , ρ ( α ) ( q , p )
Its definition involves a strictly increasing function ρ, which can be taken to be the identity function if necessary. The reason ρ is introduced will be clear in the next subsection, where we introduce the notion of conjugate-scaled representations. Furthermore, in order to ensure that the integrals in Equation (54) are well defined, we require p , q to be elements of the set:
{ p ( ζ ) : E μ { f ( ρ ( p ) ) } < }
D ( α ) -divergence was first introduced in [48]. It generalized the familiar α-divergence Equation (18): take f ( p ) = e p and ρ ( p ) = log p ; then D f , ρ ( α ) ( p , q ) = A ( α ) ( p , q ) . D ( α ) -divergence became the U-divergence [15] when f ( p ) = U ( p ) , ρ ( p ) = ( U ) 1 ( p ) , α 1 for any strictly convex and strictly increasing U : R + R . It was well known that U-divergence, when taking
U ( t ) = 1 β ( 1 + ( β 1 ) t ) β β 1 ( β 0 , 1 )
specializes to β-divergence [49], defined as:
B ( β ) ( p , q ) = E μ p p β 1 q β 1 β 1 p β q β β
and that both α- and β-divergence specialize to the Kullback-Leibler divergence as α ± 1 and β 1 , respectively.

2.2.2. Conjugate-Scaled Representations of Measurable Functions

In one-dimension, any strictly convex function, f : R R , can be written as an integral of a strictly increasing function, g, and vice versa: f ( δ ) = γ δ g ( t ) d t + f ( γ ) , with g ( t ) > 0 . The convex (Legendre–Fenchel) conjugate, f * : R R , defined by:
f * ( t ) = t ( f ) 1 ( t ) f ( ( f ) 1 ( t ) )
has the integral expression f * ( λ ) = g ( γ ) λ g 1 ( t ) d t + f * ( g ( γ ) ) , with g 1 also strictly monotonic and γ , δ , λ R . (Here, the monotonicity condition replaces the requirement of a positive semi-definite Hessian in the case of a convex function of several variables.) The Legendre–Fenchel inequality:
f ( δ ) + f * ( λ ) γ λ 0
can be cast as the Young’s inequality:
γ δ g ( t ) d t + g ( γ ) λ g 1 ( t ) d t + γ g ( γ ) δ λ
with equality holding, if and only if λ = g ( δ ) . The conjugate function, f * , which is also strictly convex, satisfies ( f * ) * = f and ( f * ) = ( f ) 1 .
We introduce the notion of ρ-representation of a ζ-function p ( · ) by defining a mapping, p ρ ( p ) , for a strictly increasing function, ρ : R R . We say that a τ-representation of a ζ-function, p τ ( p ) , is conjugate to the ρ-representation with respect to a smooth and strictly convex function, f : R R , if:
τ ( p ) = f ( ρ ( p ) ) = ( ( f * ) ) 1 ( ρ ( p ) ) ρ ( p ) = ( f ) 1 ( τ ( p ) ) = ( f * ) ( τ ( p ) )
As an example, we may let ρ ( p ) = l ( α ) ( p ) be the α-representation, where l ( α ) is given by Equation (13), and the conjugate representation is the ( α ) -representation τ ( p ) = l ( α ) ( p ) :
ρ ( t ) = l ( α ) ( t ) τ ( p ) = l ( α ) ( p )
In this case:
f ( t ) = 2 1 + α 1 α 2 t 2 1 α , f * ( t ) = 2 1 α 1 + α 2 t 2 1 + α
so that:
f ( ρ ( p ) ) = 2 1 + α p , f * ( τ ( p ) ) = 2 1 α p
both linear in p. More generally, strictly increasing functions from R R form a group, with functional composition as group composition operation and the functional inverse as the group inverse operation. That is, (i) for any two strictly increasing functions, ρ 1 , ρ 2 , their functional composition ρ 2 ρ 1 is strictly increasing; (ii) the functional inverse, ρ 1 , of any strictly increasing function, ρ, is also strictly increasing; (iii) there exists a strictly increasing function, ι, the identity function, such that ρ ρ 1 = ρ 1 ρ = ι . From this perspective, f = τ ρ 1 , ( f * ) = ρ τ 1 , encountered above, are themselves two mutually inverse strictly increasing functions.
If, in the above discussions, f = τ ρ 1 is further assumed to be strictly convex, that is:
1 α 2 τ ( ρ 1 ( γ ) ) + 1 + α 2 τ ( ρ 1 ( δ ) ) τ ρ 1 1 α 2 γ + 1 + α 2 δ
for any γ , δ R and α ( 1 , 1 ) , then by taking τ 1 on both sides of the inequality and renaming ρ 1 ( γ ) as γ and ρ 1 ( δ ) as δ, we obtain:
τ 1 1 α 2 τ ( γ ) + 1 + α 2 τ ( δ ) ρ 1 1 α 2 ρ ( γ ) + 1 + α 2 ρ ( δ )
This is to say:
M τ ( α ) ( γ , δ ) M ρ ( α ) ( γ , δ )
with equality holding, if and only if γ = δ , where:
M ρ ( α ) ( γ , δ ) = ρ 1 1 α 2 ρ ( γ ) + 1 + α 2 ρ ( δ )
is the quasi-linear mean of two numbers γ , δ . Therefore, the following is also a divergence functional (see more discussions in Section 2.4)
4 1 α 2 X τ 1 1 α 2 τ ( p ( ζ ) ) + 1 + α 2 τ ( q ( ζ ) ) ρ 1 1 α 2 ρ ( p ( ζ ) ) + 1 + α 2 ρ ( q ( ζ ) ) d μ

2.2.3. Canonical Divergence

The use of a pair of strictly increasing functions, f , f * , allow us to define, in parallel with D f , ρ ( α ) ( p , q ) given in Equation (54), the conjugate family, D f * , τ ( α ) ( p , q ) . The two families turn out to have the same form when α = ± 1 ; this is the so-called canonical divergence.
Taking the limit, α 1 , the inequality Equation (53) becomes:
f ( δ ) f ( γ ) δ γ f ( γ ) 0
where f is strictly convex. A similar inequality is obtained when α 1 . Hence, the divergence functionals, D f , ρ ( ± 1 ) ( p , q ) , take the form:
D f , ρ ( 1 ) ( p , q ) = E μ { f ( ρ ( q ) ) f ( ρ ( p ) ) ( ρ ( q ) ρ ( p ) ) f ( ρ ( p ) ) } ( 73 ) = E μ { f * ( τ ( p ) ) f * ( τ ( q ) ) ( τ ( p ) τ ( q ) ) ( f * ) ( τ ( q ) ) } = D f * , τ ( 1 ) ( q , p ) ( 74 ) D f , ρ ( 1 ) ( p , q ) = E μ { f ( ρ ( p ) ) f ( ρ ( q ) ) ( ρ ( p ) ρ ( q ) ) f ( ρ ( q ) ) } ( 75 ) = E μ { f * ( τ ( q ) ) f * ( τ ( p ) ) ( τ ( q ) τ ( p ) ) ( f * ) ( τ ( p ) ) } = D f * , τ ( 1 ) ( q , p ) ( 76 )
The canonical divergence functional, A : M × M R + , is defined (with the aid of a pair of conjugate representations) as:
A f ( ρ ( p ) , τ ( q ) ) = E μ { f ( ρ ( p ) ) + f * ( τ ( q ) ) ρ ( p ) τ ( q ) } = A f * ( τ ( q ) , ρ ( p ) )
where X f ( ρ ( p ) ) d μ can be called the (generalized) cumulant generating functional and X f * ( τ ( p ) ) d μ , the (generalized) entropy functional. Thus, a dualistic relation exists between α = 1 α = 1 and between ( f , ρ ) ( f * , τ ) :
D f , ρ ( 1 ) ( p , q ) = D f , ρ ( 1 ) ( q , p ) = D f * , τ ( 1 ) ( q , p ) ( 78 ) = A f ( ρ ( p ) , τ ( q ) ) = A f * ( τ ( q ) , ρ ( p ) ) ( 79 )
We can see that under conjugate ( ± α ) -representations Equation (64), A f is simply the α-divergence proper A ( α ) :
A f ( ρ ( p ) , τ ( q ) ) = A ( α ) ( p , q )
In fact:
1 α 2 4 A ( α ) ( u , v ) = E μ 1 α 2 u 2 1 α + 1 + α 2 v 2 1 + α u v 0
is an expression of Young’s inequality between two functions u = ( l ( α ) ) 1 ( p ) , v = ( l ( α ) ) 1 ( q ) under conjugate exponents, 2 1 α and 2 1 + α .

2.3. Geometry Induced by the D ( α ) -Divergence

The last two sections showed that the divergence functional, D ( α ) , we constructed on M according to Equation (54) generalizes the α-divergence in a sensible way. Now, we investigate the metric and conjugate connections that such divergence functionals induce; this is accomplished by invoking Eguchi relations Equations (26)–(28).
Proposition 2. At any given p M and for any vector fields, u , v Σ ( M ) :
(i)
the metric tensor field, g : Σ ( M ) × Σ ( M ) F ( M ) , is given by:
g ( u , v ) = E μ { g ( p ( ζ ) ) u ( ζ | p ( ζ ) ) v ( ζ | p ( ζ ) ) }
where:
g ( t ) = f ( ρ ( t ) ) ( ρ ( t ) ) 2
(ii)
the family of covariant derivatives (connections) ( α ) : Σ ( M ) × Σ ( M ) Σ ( M ) is given as:
w ( α ) u = ( d w u ) ( ζ ) + b ( α ) ( p ( ζ ) ) u ( ζ | p ( ζ ) ) w ( ζ | p ( ζ ) )
where:
b ( α ) ( t ) = 1 α 2 f ( ρ ( t ) ) ρ ( t ) f ( ρ ( t ) ) + ρ ( t ) ρ ( t )
(iii)
the family of conjugate covariant derivatives is:
w * ( α ) u = ( d w u ) ( ζ ) + b ( α ) ( p ( ζ ) ) u ( ζ | p ( ζ ) ) w ( ζ | p ( ζ ) )
Proof. See Section 4.
Note that the g ( · ) term in Equation (82) and the b ( α ) ( · ) term in covariant derivatives Equation (84) depend on p, the point on the base manifold, where the metric and covariant derivatives are evaluated. They both depend on the auxiliary “scaling functions”, f and ρ. We may cast them into an equivalent, dually symmetric form as follows.
Corollary 3. The g ( · ) function in expressing the metric Equation (82) and b ( α ) ( · ) in expressing the covariant derivatives Equation (84) can be expressed in dualistic forms:
g ( t ) = ρ ( t ) τ ( t )
and:
b ( α ) ( t ) = d d t 1 + α 2 log ρ ( t ) + 1 α 2 log τ ( t )
Proof. See Section 4.
Corollary 3 makes it immediately evident that the Riemannian metrics induced by D f , ρ ( α ) ( p , q ) and by D f * , τ ( α ) ( p , q ) are identical for all α values, while the connections (covariant derivatives) induced by the two families of divergence are conjugate to each other, expressed as α α . This implies that the conjugacy embodied by the definition of the pair of connections is related to both referential duality and representational duality.
It can be proven that the covariant derivative of the kind of Equation (84) are both curvature-free and torsion-free.
Proposition 4. For the entire family of covariant derivatives indexed by α ( α R ):
(i)
the Riemann curvature tensor R ( α ) ( u , v , w ) 0 ;
(ii)
the torsion tensor T ( α ) ( u , v ) 0 .
Proof. See Section 4.
In other words, the manifold, M , has zero-curvature and zero-torsion for all α. As such, it can serve as an ambient manifold to embed the manifold, M μ , of non-parametric probability density functions and the manifold, M θ , of parametric density functions, and any curvature on M μ or M θ may be interpreted as arising from embedding or restriction to a lower dimensional space. See, also, [50] for a discussion of curvatures of statistical manifolds.

2.4. Homogeneous ( α , β ) -Divergence and the Induced Geometry

Suppose that f is, in addition to being strictly convex, strictly increasing. We may set ρ ( t ) = f 1 ( ε t ) f ( t ) = ε ρ 1 ( t ) , so that the divergence functional becomes:
D ρ ( α ) ( p , q ) = 4 ε 1 α 2 X 1 α 2 p ( ζ ) + 1 + α 2 q ( ζ ) ρ 1 1 α 2 ρ ( p ( ζ ) ) + 1 + α 2 ρ ( q ( ζ ) ) d μ
Now, the second term in the integrand is just the quasi-linear mean, M ρ ( α ) , introduced in Equation (70), where ρ is strictly increasing and concave here. As an example, take ρ ( p ) = log p , ϵ = 1 ; then M ρ ( α ) ( p , q ) = p 1 α 2 q 1 + α 2 , and D ρ ( α ) ( p , q ) is the α-divergence Equation (18), while:
D ρ ( 1 ) ( p , q ) = X ( p q ( ρ ( p ) ρ ( q ) ) ) ( ρ 1 ) ( ρ ( q ) ) d μ = D ρ ( 1 ) ( q , p )
is an immediate generalization of the extended Kullback-Leibler divergence in Equation (16).
If we impose a homogeneous requirement ( κ R + ) on D ρ ( α ) :
D ρ ( α ) ( κ p , κ q ) = κ D ρ ( α ) ( p , q )
then (see [48]) ρ ( p ) = l ( β ) ( p ) ; so Equation (89) becomes a two-parameter family:
D ( α , β ) ( p , q ) 4 1 α 2 2 1 + β E μ 1 α 2 p + 1 + α 2 q 1 α 2 p 1 β 2 + 1 + α 2 q 1 β 2 2 1 β
Here ( α , β ) [ 1 , 1 ] × [ 1 , 1 ] , and ε = 2 / ( 1 + β ) in Equation (89) is chosen to make D ( α , β ) ( p , q ) well defined for β = 1 . We call this family ( α , β ) -divergence; it belongs to the general class of f-divergence studied by [51]. Note that the α parameter encodes referential duality, and the β parameter encodes representational duality. When either α = ± 1 or β = ± 1 , the one-parameter version of the generic alpha-connection results. The family, D ( α , β ) , is then a generalization of Amari’s α-divergence Equation (18) with:
lim α 1 D ( α , β ) ( p , q ) = A ( β ) ( p , q ) ( 93 ) lim α 1 D ( α , β ) ( p , q ) = A ( β ) ( p , q ) ( 94 ) lim β 1 D ( α , β ) ( p , q ) = A ( α ) ( p , q ) ( 95 ) lim β 1 D ( α , β ) ( p , q ) = J ( α ) ( p , q ) ( 96 )
where J ( α ) denotes the Jensen difference discussed by [44]:
J ( α ) ( p , q ) 4 1 α 2 E μ ( 1 α 2 p log p + 1 + α 2 q log q 1 α 2 p + 1 + α 2 q log 1 α 2 p + 1 + α 2 q )
J ( α ) reduces to Kullback-Leibler divergence Equation (16) when α ± 1 . Lastly, we note that in D ( α , β ) , when either α or β equals zero, the Levi-Civita connection results.
Note that the divergence given by Equation (92), which first appeared in [48], was called “ ( α , β ) -divergence” in [47]. Cichocki et al. [52], following their review of α , β , γ divergence by [53], introduced the following two-parameter family:
D A B α , β = 1 α β E μ p α q β α α + β p α + β β α + β q α + β
and called it ( α , β ) -divergence. Essentially, it is α-divergence under β- (power) embedding:
D A B α , β = ( α + β ) 2 A β α α β ( p α + β , q α + β )
Clearly, by taking f ( t ) = e t , ρ ( t ) = ( α + β ) log t and renaming β α α + β as α, D A B α , β is a special case of the D f , ρ ( α ) ( p , q ) , as is D ( α , β ) ( p , q ) by Zhang [47,48]. The two definitions of ( α , β ) -divergences, both special cases of D f , ρ ( α ) ( p , q ) , only differ by a l ( β ) -embedding in the density functions, leading to a superficial difference in the homogeneity/scaling of the divergence function.
With respect to the geometry induced from the ( α , β ) -divergence of Equation (92), we have the following result.
Proposition 5. The metric g and affine connections (covariant derivatives) ( α , β ) corresponding to the ( α , β ) -divergence are given by:
g ( u , v ) = X 1 p u v d μ
u ( α , β ) v = d u v 1 + α β 2 p u v
u * ( α , β ) v = d u v 1 α β 2 p u v
where u , v Σ ( M ) and p = p ( ζ ) is the point at which g and ∇ are evaluated.
Proof. The proof is immediate upon substituting Equations (64) and (65) to Equations (83) and (85). ⋄
This is to say, with respect to the ( α , β ) -divergence, the product of the two parameters, α β , acts as the “alpha” parameter in the family of induced connections, so:
* ( α , β ) = ( α , β ) = ( α , β )
Setting lim β 1 ( α , β ) yields Amari’s one-parameter family of α-connections in the infinite-dimensional setting, taking the very simple form:
u ( α ) v = d u v 1 + α 2 p u v
The same is true when lim α 1 ( α , β ) (the connections are indexed by β, of course).

3. Parametric Statistical Manifold As Finite-Dimensional Embedding

3.1. Finite-Dimensional Parametric Models

Now, we restrict attention to a finite-dimensional submanifold of measurable functions whose ρ-representation are parameterized using θ = [ θ 1 , , θ n ] R n . In this case, the divergence functional of the two functions, p and q, assumed to be specified, respectively, by θ p and θ q in the parametric model, becomes an implicit function of θ p , θ q . In other words, through introducing parametric models (i.e.,a finite-dimensional submanifold) of the infinite-dimensional manifold of measurable functions, we arrive at a divergence function defined (“pulled back”) over the vector space. We denote the ρ-representation of a parameterized measurable function as ρ ( p ( ζ | θ ) ) , and the corresponding divergence function by D ( θ p , θ q ) . It is important to realize that, while f ( · ) is strictly convex, F ( p ) = X f ( p ( ζ | θ ) ) d μ is not at all convex in θ in general.

3.1.1. Riemannian Geometry of Parametric Models

The parametric family of functions, p ( ζ | θ ) , forms a submanifold of M defined by:
M ˜ θ = { p ( ζ | θ ) M : θ R n }
where p ( ζ | θ ) is a ζ-function indexed by θ, i.e., θ is treated as a parameter to specify a ζ-function. M ˜ θ is a finite-dimensional submanifold of M . We also denote the manifold of a parametric statistical model as:
M θ = { p ( ζ | θ ) M μ : θ Θ R n }
The θ values themselves, called the natural parameter of the parametric (statistical) model, p ( · | θ ) , are coordinates for M ˜ θ (or M θ ). The tangent vector fields, u , v , w , of M in the directions that are also tangent for M ˜ θ (or M θ ) take the form:
u = p ( ζ | θ ) θ i , v = p ( ζ | θ ) θ k , w = p ( ζ | θ ) θ j
The following proposition gives the metric and the family of α-connections in the parametric case. For convenience, we denote ρ ( ζ , θ ) ρ ( p ( ζ | θ ) ) , τ ( ζ , θ ) τ ( p ( ζ | θ ) ) in this subsection.
Proposition 6. For parametric models p ( ζ | θ ) , the metric tensor takes the form:
g i j ( θ ) = E μ f ( ρ ( ζ , θ ) ) ρ ( ζ , θ ) θ i ρ ( ζ , θ ) θ j
and the α-connections take the form:
Γ i j , k ( α ) ( θ ) = E μ 1 α 2 f ( ρ ( ζ , θ ) ) A i j k + f ( ρ ( ζ , θ ) ) B i j k
Γ i j , k * ( α ) ( θ ) = E μ 1 + α 2 f ( ρ ( ζ , θ ) ) A i j k + f ( ρ ( ζ , θ ) ) B i j k
where:
A i j k ( ζ , θ ) = ρ ( ζ , θ ) θ i ρ ( ζ , θ ) θ j ρ ( ζ , θ ) θ k , B i j k ( ζ , θ ) = 2 ρ ( ζ , θ ) θ i θ j ρ ( ζ , θ ) θ k
Proof. See Section 4.
Note that strict convexity of f requires that f > 0 ; thereby the positive-definiteness of g i j ( θ ) is guaranteed. Clearly, the α-connections form conjugate pairs Γ i j , k * ( α ) ( θ ) = Γ i j , k ( α ) ( θ ) .
As an example, we take the embedding f ( t ) = e t and ρ ( p ) = log p , with τ ( p ) = p , the identity function; then, the expressions in Proposition 6 reduce to the Fisher information and α-connections of the exponential family in Equations (3) and (4).
Corollary 7. In dualistic form, the metric and α-connections are:
g i j ( θ ) = E μ ρ ( ζ , θ ) θ i τ ( ζ , θ ) θ j
Γ i j , k ( α ) ( θ ) = E μ 1 α 2 2 τ ( ζ , θ ) θ i θ j ρ ( ζ , θ ) θ k + 1 + α 2 2 ρ ( ζ , θ ) θ i θ j τ ( ζ , θ ) θ k
Γ i j , k * ( α ) ( θ ) = E μ 1 + α 2 2 τ ( ζ , θ ) θ i θ j ρ ( ζ , θ ) θ k + 1 α 2 2 ρ ( ζ , θ ) θ i θ j τ ( ζ , θ ) θ k
Proof. See Section 4.
An immediate consequence of this corollary is as follows. If we construct the divergence function, D f * , τ ( α ) ( θ p , θ q ) , then the induced metric, g ˜ i j , and the induced conjugate connections, Γ ˜ i j , k ( α ) , Γ ˜ i j , k * ( α ) , will be related to those induced from D f , ρ ( α ) ( θ p , θ q ) (and denoted without the ˜ ) via:
g ˜ i j ( θ ) = g i j ( θ )
with:
Γ ˜ i j , k ( α ) ( θ ) = Γ i j , k ( α ) ( θ ) , Γ ˜ i j , k * ( α ) ( θ ) = Γ i j , k ( α ) ( θ )
So, the difference between using D f , ρ ( α ) ( θ p , θ q ) and D f * , τ ( α ) ( θ p , θ q ) reflects a conjugacy in the ρ- and τ-scalings of p ( ζ | θ ) . Corollary 7 says that the conjugacy in the connection pair Γ Γ * reflects, in addition to the referential duality θ p θ q , the representational duality between ρ-scaling and τ-scaling of a ζ-function:
Γ i j , k * ( α ) ( θ ) = Γ ˜ i j , k ( α ) ( θ )

3.1.2. Example: The Parametric ( α , β ) -Manifold

We have introduced the two-parameter family of divergence functionals D ( α , β ) ( p , q ) in Section 2.4. Now, pulling back to M ˜ θ (or to M θ ), we have the two-parameter family of divergence functions D ( α , β ) ( θ p , θ q ) defined by:
D ( α , β ) ( θ p , θ q ) = D f , ρ ( α , β ) ( p ( · | θ p ) , q ( · | θ q ) )
There are two ways to reduce to Amari’s alpha-divergence (indexed by β here to avoid confusion): (i) take α = 1 and ρ ( p ) = l ( β ) ( p ) τ ( p ) = l ( β ) ( p ) ; or (ii) take α = 1 and ρ ( p ) = l ( β ) ( p ) τ ( p ) = l ( β ) ( p ) .
Corollary 8. The metric and affine connections for the parametric ( α , β ) -manifold are:
g i j ( θ ) = E p log p θ i log p θ j
Γ i j , k ( α , β ) ( θ ) = E p 2 log p θ i θ j log p θ k + 1 α β 2 log p θ i log p θ j log p θ k
Γ i j , k * ( α , β ) ( θ ) = E p 2 log p θ i θ j log p θ k + 1 + α β 2 log p θ i log p θ j log p θ k
Proof. Direct substituion of expressions of ρ ( p ) and τ ( p ) . ⋄
This two-parameter family of affine connections, Γ i j , k ( α , β ) ( θ ) , indexed now by the numerical product, α β [ 1 , 1 ] , is actually the alpha-connection proper (i.e., the one-parameter family of its generic form; see [29])
Γ i j , k ( α , β ) ( θ ) = Γ i j , k ( α , β ) ( θ )
with biduality compactly expressed as
Γ i j , k * ( α , β ) ( θ ) = Γ i j , k ( α , β ) ( θ ) = Γ i j , k ( α , β ) ( θ )

3.2. Affine Embedded Submanifold

We now define the notion of ρ-affinity. A parametric model, p ( ζ | θ ) , is said to be ρ-affine if its ρ-representation can be embedded into a finite-dimensional affine space, i.e., if there exists a set of linearly independent functions λ i ( ζ ) over the same support, X ζ , such that:
ρ ( p ( ζ | θ ) ) = i θ i λ i ( ζ )
As noted in Section 3.1.1, the parameter θ = [ θ 1 , , θ n ] Θ is its natural parameter.
For any measurable function, p ( ζ ) , the projection of its τ-representation onto the functions λ i ( ζ )
η i = X τ ( p ( ζ ) ) λ i ( ζ ) d μ
forms a vector η = [ η 1 , , η n ] R n . We call η the expectation parameter of p ( ζ ) , and the functions λ ( ζ ) = [ λ 1 ( ζ ) , , λ n ( ζ ) ] the affine basis functions.
The above notion of ρ-affinity is a generalization of α-affine manifolds [1,2], where ρ- and τ-representations are just α- and ( α ) -representations, respectively. Note that elements of the ρ-affine manifold may not be a probability model; rather, after denormalization, probability models can become ρ-affine. The issue of normalization will be discussed in Section 5.

3.2.1. Biorthogonality of Natural and Expectation Parameters

Proposition 9. When a parametric model is ρ-affine,
(i)
the function:
Φ ( θ ) = X f ( ρ ( p ( ζ | θ ) ) ) d μ
is strictly convex;
(ii)
the divergence functional, D f , ρ ( α ) ( p , q ) , takes the form of the divergence function:
D Φ ( α ) ( θ p , θ q ) = 4 1 α 2 1 α 2 Φ ( θ p ) + 1 + α 2 Φ ( θ q ) Φ 1 α 2 θ p + 1 + α 2 θ q
(iii)
the metric tensor, affine connections and the Riemann curvature tensor take the forms:
g i j ( θ ) = Φ i j ; Γ i j , k ( α ) ( θ ) = 1 α 2 Φ i j k = Γ i j , k * ( α ) ( θ )
R i j μ ν ( α ) ( θ ) = 1 α 2 4 l , k ( Φ i l ν Φ j k μ Φ i l μ Φ j k ν ) Φ l k = R i j μ ν * ( α ) ( θ )
Here, Φ i j , Φ i j k denote, respectively, second and third partial derivatives of Φ ( θ ) :
Φ i j = 2 Φ ( θ ) θ i θ j , Φ i j k = 3 Φ ( θ ) θ i θ j θ k
and Φ i j is the matrix inverse of Φ i j .
Proof. See Section 4.
Recall that, for a convex function of several variables, Φ : R n R , its convex conjugate Φ * is defined through the Legendre–Fenchel transform:
Φ * ( η ) = η , ( Φ ) 1 ( η ) Φ >( ( Φ ) 1 ( η ) )
where Φ stands for the gradient (sub-differential) of Φ, and , denotes the standard inner product. It can be shown that the function, Φ * , is also convex and has Φ as its conjugate ( Φ * ) * = Φ . The Hessian (second derivatives) of a strictly convex function (Φ and Φ * ) is positive-definite. The Legendre–Fenchel inequality Equation (131) can be expressed using dual variables, θ , η , as:
Φ ( θ ) + Φ * ( η ) i η i θ i 0
where equality holds, if and only if:
θ = ( Φ * ) ( η ) = ( Φ ) 1 ( η ) η = Φ ( θ ) = ( Φ * ) 1 ( θ )
Corollary 10. For a ρ-affine manifold:
(i)
define
Φ ˜ ( θ ) = X f * ( τ ( p ( ζ | θ ) ) ) d μ
then Φ * ( η ) Φ ˜ ( ( Φ ) 1 ( η ) ) is the convex (Legendre–Fenchel) conjugate of Φ ( θ ) ;
(ii)
the pair of convex functions, Φ , Φ * , form a pair of “potentials” to induce η , θ :
Φ ( θ ) θ i = η i Φ * ( η ) η i = θ i
(iii)
the expectation parameter, η Ξ , and the natural parameter, θ Θ , form biorthogonal coordinates:
η i θ j = g i j ( θ ) θ i η j = g ˜ i j ( η )
where g ˜ i j ( η ) is the matrix inverse of g i j ( θ ) , the metric tensor of the parametric (statistical) manifold.
Proof. See Section 4.
Note that while the function, Φ ( θ ) , can be viewed as the generalized cumulant generating function (or partition function), the function, Φ * ( η ) , is the generalized entropy function. For an exponential family, the two are well known to form one-to-one correspondence; either can be used on that index as a density function of the exponential family.

3.2.2. Dually Flat Affine Manifolds

When α = ± 1 , part (iii) of Proposition 9 dictates that all components of the curvature tensor vanish, i.e., R i j μ ν ( ± 1 ) ( θ ) = 0 . In this case, there exists a coordinate system under which either Γ i j , k * ( 1 ) ( θ ) = 0 or Γ i j , k ( 1 ) ( θ ) = 0 . This is the well-studied “dually flat” parametric statistical manifold [1,2,20], under which divergence functions have a unique, canonical form.
Corollary 11. When α ± 1 , D Φ ( α ) reduces to the Bregman divergence Equation (17):
D Φ ( 1 ) ( θ p , θ q ) = D Φ ( 1 ) ( θ q , θ p ) = Φ ( θ q ) Φ ( θ p ) θ q θ p , Φ ( θ p ) = B Φ ( θ q , θ p )
D Φ ( 1 ) ( θ p , θ q ) = D Φ ( 1 ) ( θ q , θ p ) = Φ ( θ p ) Φ ( θ q ) θ p θ q , Φ ( θ q ) = B Φ ( θ p , θ q )
or equivalently, to the canonical divergence functions:
D Φ ( 1 ) ( θ p , ( Φ ) 1 ( η q ) ) = Φ ( θ p ) + Φ * ( η q ) θ p , η q A Φ ( θ p , η q )
D Φ ( 1 ) ( ( Φ ) 1 ( θ p ) , θ q ) = Φ ( θ q ) + Φ * ( η p ) η p , θ q A Φ * ( η p , θ q )
Proof. Immediate by substitution using the definition Equation (131). ⋄
The canonical divergence, A Φ ( θ p , η q ) , based on the Legendre–Fenchel inequality was introduced by [2,20], where the functions, Φ , Φ * , the cumulant generating functions of an exponential family, were referred to as dual “potential” functions. This form Equation (139) is “canonical”, because it is uniquely specified in a dually flat manifold using a pair of biorthogonal coordinates.
We point out that there are two kinds of duality associated with the divergence defined on dually flat statistical manifold, one between D Φ ( 1 ) D Φ ( 1 ) and between D Φ * ( 1 ) D Φ * ( 1 ) ; the other between D Φ ( 1 ) D Φ * ( 1 ) and between D Φ ( 1 ) D Φ * ( 1 ) . The first kind is related to the duality in the choice of the reference and the comparison status for the two points (θ versus η) for computing the value of the divergence and, hence, called “referential duality”. The second kind is related to the duality in the choice of the representation of the point as a vector in the parameter versus gradient space (θ versus η) in the expression of the divergence function and, hence, called “representational duality”. More concretely:
D Φ ( 1 ) ( θ p , θ q ) = D Φ * ( 1 ) ( Φ ( θ q ) , Φ ( θ p ) ) = D Φ * ( 1 ) ( Φ ( θ p ) , Φ ( θ q ) ) = D Φ ( 1 ) ( θ q , θ p )
The biduality is compactly reflected in the canonical divergence as:
A Φ ( θ p , η q ) = A Φ * ( η q , θ p )

4. Proofs

Proof of Eguchi Relation Equations (26)–(28). Assume the divergence function, D , is Fréchet differentiable up to third order. For any two points, p , q M , and two vector fields, u , v Σ ( M ) , let us denote G ( p , q ) ( u , v ) = ( d u ) p ( d v ) q D ( p , q ) , the mixed second derivative, which is bilinear in u , v . Then, u , v p = G ( p , p ) ( u , v ) . Suppressing the dependency on ( u , v ) in G, we take directional derivative with respect to w Σ ( M ) :
d w G ( p , q ) = ( d w ) p G ( p , q ) + ( d w ) q G ( p , q )
and then evaluating at q = p to obtain:
d w u , v = w u , v + u , w * v
which is the defining Equation (22) for conjugate connections. Therefore, what remains to be checked is whether ∇ as defined by:
w u , v = ( d w ) p ( d u ) p G ˜ ( p , q ) ( v ) p = q
indeed transforms as an affine connection (and similarly for * ), where G ˜ ( p , q ) ( v ) = ( d v ) q D ( p , q ) is linear in v. It is easy to verify that w ( u 1 + u 2 ) = w u 1 + w u 2 , w 1 + w 2 u = w 1 u + w 2 u , and f w u = f w u for f F . We need only to prove:
w ( f u ) , v = f w u , v + u , v d w f
which is immediately obtained by:
( d w ) p ( f d u ) p G ˜ ( p , q ) = f · ( d w ) p ( d u ) p G ˜ ( p , q ) + ( d u ) p ( G ˜ ( p , q ) ) ( d w f ) p
Proof of Proposition 1. We only need to prove that for a strictly convex function, f : R R and α R , the following quantity:
d f ( α ) ( γ , δ ) = 4 1 α 2 1 α 2 f ( γ ) + 1 + α 2 f ( δ ) f 1 α 2 γ + 1 + α 2 δ
is non-negative for all real numbers, γ , δ R , with d f ( α ) ( γ , δ ) = 0 , if and only if γ = δ .
Clearly, for any α ( 1 , 1 ) , 1 α 2 > 0 ; so, from the fundamental convex inequality Equation (46), the functions d f ( α ) ( γ , δ ) 0 for all γ , δ R , with equality holding, if and only if γ = δ . When α > 1 , we rewrite δ = 2 α + 1 λ + α 1 α + 1 γ as a convex mixture of λ and γ (i.e., 2 α + 1 = 1 α 2 , α 1 α + 1 = 1 + α 2 with α ( 1 , 1 ) ). Strict convexity of f guarantees:
2 α + 1 f ( λ ) + α 1 α + 1 f ( γ ) f ( δ )
or moving left-hand side to right-hand side:
2 1 + α 1 α 2 f ( γ ) + 1 + α 2 f ( δ ) f 1 α 2 γ + 1 + α 2 δ 0
This, along with 1 α 2 < 0 proves the non-negativity of d f ( α ) ( γ , δ ) 0 for α > 1 , with equality holding, if and only if λ = γ , i.e., γ = δ . The case of α < 1 is similarly proven by applying Equation (47) to the three points, γ , λ , and their convex mixture δ = 2 1 α λ + 1 α 1 α γ . Finally, the continuity of d f ( α ) ( γ , δ ) with respect to α guarantees that the above claim is also valid in the case of α = ± 1 . ⋄
Proof of Proposition 2. With respect to Equation (54), note that ( d u ) p means that the functional derivative is with respect to p only (point q is treated as fixed):
( d u ) p D f , ρ ( α ) ( p , q ) = 2 1 + α X f ( ρ ( p ) ) f 1 α 2 ρ ( p ) + 1 + α 2 ρ ( q ) ρ ( p ) u d μ
Applying functional derivative ( d v ) q , now with respect to q only, to the above equation yields:
( d v ) q ( d u ) p D f , ρ ( α ) ( p , q ) = X f 1 α 2 ρ ( p ) + 1 + α 2 ρ ( q ) ρ ( p ) ρ ( q ) u v d μ
Setting p = q and invoking Equation (26) yields Equation (82) with Equation (83).
Next, applying ( d w ) p to Equation (152), and realizing that u , v are both vector fields:
( d w ) p ( ( d v ) q ( d u ) p D f , ρ ( α ) ( p , q ) ) =
X 1 α 2 f 1 α 2 ρ ( p ) + 1 + α 2 ρ ( q ) ( ρ ( p ) ) 2 ρ ( q ) u v w d μ
X f 1 α 2 ρ ( p ) + 1 + α 2 ρ ( q ) ρ ( q ) v ρ ( p ) u w + ρ ( p ) ( d w u ) d μ
Setting p = q , invoking Equation (27) and:
g ( w u , v ) = f ( ρ ( p ) ) ( ρ ( p ) ) 2 w u v ( ζ | p ) d μ
and realizing that v ( ζ | p ) can be arbitrary, we have:
f ( ρ ) ( ρ ) 2 w ( α ) u = 1 α 2 f ( ρ ) ( ρ ) 3 u w + f ( ρ ) ρ ( ρ u w + ρ ( d w u ) )
where we have short-handed ρ for ρ ( p ( ζ ) ) . Remember that w ( α ) u is a ζ-function; the above equation yields:
w ( α ) u = d w u + 1 α 2 f ( ρ ) f ( ρ ) ρ u w + ρ ρ u w = d w u + 1 α 2 f ( ρ ) f ( ρ ) ρ + ρ ρ u w
Thus, we obtain Equation (84) with Equation (85). The expression for * ( α ) is obtained analogously. ⋄
Proof of Corollary 3. From the identities:
f ( ρ ) = τ ρ , f ( ρ ) = ρ τ ρ τ ( ρ ) 3
we obtain Equations (87) and (88) after substitution. ⋄
Proof of Proposition 4. We first derive a general formula for the Riemann curvature tensor for the infinite-dimensional manifold, since that given by a popular text book ([46], p.226) appears to miss some terms. From Equation (42):
d u ( v w ) = d u ( d v w ) + B ( d u v , w ) + B ( v , d u w ) + ( d u B ) ( v , w )
so that:
u ( v w ) = d u ( v w ) + B ( u , v w ) = ( d u ( d v w ) + B ( d u v , w ) + B ( v , d u w ) + ( d u B ) ( v , w ) ) + ( B ( u , d v w ) + B ( u , B ( v , w ) ) )
here d u B = B u refers to the derivative on the B-form itself and not on its v , w arguments. The expression for v ( u w ) simply exchanges u v in the above. Now:
[ u , v ] w = d [ u , v ] w + B ( [ u , v ] , w )
where [ u , v ] = d u v d v u is a vector field, such that:
d [ u , v ] w = d u ( d v w ) d v ( d u w )
Substituting them into Equation (44), we get a general expression of the Riemann curvature tensor in infinite-dimensional setting:
R ( u , v , w ) = B ( u , B ( v , w ) ) B ( v , B ( u , w ) ) + ( d u B ) ( v , w ) ( d v B ) ( u , w )
The expression for T ( u , v ) in Equation (46) becomes:
T ( u , v ) = B ( u , v ) B ( v , u )
In the current case, B evaluated at p ( ζ ) is the bilinear form:
B ( u , v ) = b ( α ) ( p ( ζ ) ) u ( ζ | p ) v ( ζ | p )
Substituting this into the above, and realizing that ( d u B ) ( v , w ) is simply ( b ( α ) ) u v w , we immediately have R ( α ) ( u , v , w ) = 0 , as well as T ( α ) ( u , v ) = 0 . ⋄
Proof of Proposition 6. Given Equation (107) as the tangent vector fields for parametric models with holonomic coordinates θ, we note that:
d u ρ = ρ ( p ) u = ρ ( p ) p θ i = ρ ( p ) θ i
d w ρ = ρ ( p ) w = ρ ( p ) p θ j = ρ ( p ) θ j
so Equation (108) follows. Next, from:
d w u = 2 p θ i θ j
we have:
ρ ( p ) u w + ρ ( p ) ( d w u ) = ρ ( p ) p θ i p θ j + ρ ( p ) 2 p θ i θ j = θ i ρ ( p ) p θ j = θ i ρ θ j = 2 ρ θ i θ j
Observing Γ i j , k = w u , v , expression (109) results after substituting the above derived expressions into Equation (84) with Equation (85). ⋄
Proof of Corollary 7. Applying Equations (166) and (167) to Equation (82) with Equation (87) immediately yields Equation (112). Next, from Corollary 3:
b ( α ) ( t ) = 1 α 2 τ ( t ) τ ( t ) + 1 + α 2 ρ ( t ) ρ ( t )
It follows that:
Γ i j , k ( α ) = w ( α ) u , v = 1 α 2 ρ τ u w + ρ τ d w u + 1 + α 2 ρ τ u w + ρ τ d w u v
= ρ v 1 α 2 d w ( d u τ ) + τ v 1 + α 2 d w ( d u ρ ) = 1 α 2 ( d v ρ ) d w ( d u τ ) + 1 + α 2 ( d v τ ) d w ( d u ρ )
Note that given holonomic coordinates Equation (107):
d w ( d u ρ ) = θ j ρ ( p ) θ i = 2 ρ ( p ) θ i θ j
Substituting into Equation (84) with Equation (88) yields Equations (113) and (114). ⋄
Proof of Proposition 9. The assumption Equation (124) implies that ρ θ i = λ i ( ζ ) , so from Equation (108):
2 Φ ( θ ) θ i θ j = X f ( ρ ( p ( ζ | θ ) ) λ i ( ζ ) λ j ( ζ ) d μ
That the above expression is positive definite is seen by observing:
i j 2 Φ ( θ ) θ i θ j ξ i ξ j = X f ( ρ ( p ( ζ | θ ) ) i λ i ( ζ ) ξ i 2 d μ > 0
for any ξ = [ ξ 1 , , ξ n ] R n , due to the linear independence of the λ i components and the strict convexity of f. Hence, Φ ( θ ) is strictly convex in θ, proving (i). An immediate consequence is that expression (127) is non-negative and vanishes, if an only if θ p = θ q . This establishes (ii), i.e., D Φ ( α ) ( θ p , θ q ) is a divergence functions. Part (iii) follows from a straight-forward application of Eguchi relations Equations (29)–(31). ⋄
Proof of Corollary 10. First, since f ( ρ ( t ) ) = τ ( t ) , we have the identity:
f * ( τ ( p ( ζ | θ ) ) + f ( ρ ( p ( ζ | θ ) ) ) = f ( ρ ( p ( ζ | θ ) ) ) ρ ( p ( ζ | θ ) )
From (126), taking a derivative with respect to θ i , while noting that p ( ζ | θ ) satisfies Equation (124), gives:
Φ ( θ ) θ i = X f j θ j λ j ( ζ ) λ i ( ζ ) d μ = X τ ( p ( ζ | θ ) ) λ i ( ζ ) d μ = η i
and that:
i θ i Φ ( θ ) θ i Φ ( θ ) = X f j θ j λ j ( ζ ) i θ i λ i ( ζ ) f j θ j λ j ( ζ ) d μ
= X f * ( τ ( p ( ζ | θ ) ) ) d μ = Φ ˜ ( θ )
It follows from Equation (131) that Φ * , as defined in (i), is the conjugate of Φ, and that the relation in (ii) is the basic Legendre–Fenchel duality. Finally, the biorthogonality of η and θ as expressed by (iii) also becomes evident on account of (ii). ⋄

5. Discussions

This paper constructs a family of divergence functionals, induced by any smooth and strictly convex function, to measure the non-symmetric “distance” between two measurable functions defined on a sample space. Subject to an arbitrary monotone scaling, the divergence functional induces a Riemannian manifold with a metric tensor generalizing the conventional Fisher information and a pair of conjugate connections generalizing the conventional ( ± α ) -connections. Such manifolds manifest biduality: referential duality (in choosing a reference point) and representational duality (in choosing a monotone scale). The ( α , β ) -divergence we gave as an example of this bidualistic structure extends the α-divergence, with α and β representing referential duality and representational duality, respectively. It induces the conventional Fisher metric and the conventional α-connection (with α β as a single parameter). Finally, for the ρ-affine submanifold, a pair of conjugated potentials exist to induce the natural and expectation parameters as biorthogonal coordinates on the manifold.
Our approach demonstrated an intimate connection between convex analysis and information geometry. The divergence functionals (and the divergence functions in the finite-dimensional case) are associated with the fundamental convex inequality of a convex function, f : R R (or Φ : R n R ), with the convex mixture coefficient as the α-parameter in the induced geometry. Referential duality is associated with α α , and representational duality is associated with the convex conjugacy f f * (or Φ Φ * ). Thus, our analysis reveals that the e / m -duality and ( ± 1 ) -duality that were used almost interchangeably in the current literature are not the same thing!
The kind of referential duality (originating from non-symmetric status for a referent and for a comparison object), while common in psychological and behavioral contexts [54,55], has always been implicitly acknowledged in statistics. Formal investigation of such non-symmetry between a reference probability distribution and comparison probability distribution in constructing divergence functions leads to the framework of preferred point geometry [56,57,58,59,60,61]. Preferred point geometry reformulates Amari’s [20] expected geometry and Barndorff-Nielsen’s [3] observed geometry by studying the product manifold, M θ × M θ , formed by an ordered pair of probability densities, ( p , q ) , and defining a family of Riemannian metric defined on the product manifold. The precise relation of the preferred point approach with our approach to referential duality needs future exploration.
With respect to representational duality, it is worth mentioning the field of affine differential geometry which studies hypersurface realization of the dual Riemannian manifold involving a pair of conjugate connections (see [27,28]). [31,32,33,34] investigated affine immersion of statistical manifolds. [62,63,64,65] further illuminated a conformal structure when the (normalized) probability density functions undergo the l ( α ) embedding. Such an embedding appears in the context of Tsallis statistics, where Shannon entropy and Kullback-Leibler cross-entropy (divergence) is generalized to a one-parameter family of entropy and cross-entropy (see, e.g., [40]). We demonstrated ([48], and here) that the ρ-affine manifold (Section 3.2) has the structure of an α-Hessian structure [26], a generalization of Hessian manifold [66,67]. It remains to be illuminated whether a conformal structure arises for ρ-affine probability density functions after normalization.
It should be noted that, while any divergence function determines uniquely a statistical manifold (in the broad sense of [29]), the converse is not true. Though a statistical manifold equipped with an arbitrary metric tensor and a pair of conjugate, torsion-free connections always admits a divergence function [68], it is not unique in general, except when the connections are dually flat, in which case the divergence is uniquely determined as the canonical divergence. In this sense, there is nothing special about our use of D ( α ) -divergence apart from it generalizing familiar divergences (including α-divergence in particular). Rather, D ( α ) -divergence is merely a vehicle for us to derive the underlying dual Riemannian geometry. It remains to be elucidated why the convex mixture parameter turns out to be the α-parameter in the family of connections of the induced geometry. It seems that our generalizations of the Fisher metric and of conjugate α-connections hinge on this miraculous identification. Generalization from α-affinity/embedding to ρ-affinity/embedding, and the resulting generalized biorthogonality between natural and expectation parameters is akin to generalizing L p space to L Φ (i.e., Orlicz) space, which is an entirely different matter. Future research will further clarify these fundamental relations between convexity, conjugacy, and duality in non-parametric (and parametric) information geometry.

6. Conclusions

We constructed an extension of parametric information geometry to the non-parametric setting by studying the manifold M of non-parametric functions on sample space (without positivity and normalization constraints). The generalized Fisher information and α-connections on M are induced by an α-parameterized family of divergence functions, reflecting the fundamental convex inequality associated with any smooth and strictly convex function. Parametric models are recovered as submanifolds of M . We also generalize Amari’s α-embedding to an affine submanifold under arbitrary monotonic embedding, and show that its natural and expectation parameters form biorthogonal coordinates, and such a submanifold is dually flat for α = ± 1 . Our analysis illuminates two different types of duality in information geometry, one concerning the referential status of a point (measurable function) expressed in the divergence function (“referential duality”) and the other concerning its representation under an arbitrary monotone scaling (“representational duality”).

Acknowledgments

This paper is based on work presented to the Second International Conference of Information Geometry and Its Applications (IGAIA2), Tokyo, Japan in 2005 and appeared in preliminary form as [47]. The author appreciates two anonymous reviewers for constructive feedback. The author has been supported by research grants NSF 0631541 and ARO W911NF-12-1-0163 during revision and final production of this work.

Conflicts of Interest

The author declares no conflict of interest.

References

  1. Amari, S.; Nagaoka, H. Method of Information Geometry; Oxford University Press: Oxford, UK, 2000. [Google Scholar]
  2. Amari, S. Differential Geometric Methods in Statistics; Springer-Verlag: New York, NY, USA, 1985. [Google Scholar]
  3. Barndorff-Nielsen, O.E. Parametric Statistical Models and Likelihood; Springer-Verlag: Heidelberg, Germany, 1988. [Google Scholar]
  4. Barndorff-Nielsen, O.E.; Cox, R.D.; Reid, N. The role of differential geometry in statistical theory. Int. Stat. Rev. 1986, 54, 83–96. [Google Scholar] [CrossRef]
  5. Kass, R.E. The geometry of asymptotic inference (with discussion). Stat. Sci. 1989, 4, 188–234. [Google Scholar] [CrossRef]
  6. Kass, R.E.; Vos, P.W. Geometric Foundation of Asymptotic Inference; John Wiley and Sons: New York, NY, USA, 1997. [Google Scholar]
  7. Murray, M.K.; Rice, J.W. Differential Geometry and Statistics; Chapman & Hall: London, UK, 1993. [Google Scholar]
  8. Amari, S.; Kumon, M. Estimation in the presence of infinitely many nuisance parameters — Geometry of estimating functions. Ann. Stat. 1988, 16, 1044–1068. [Google Scholar] [CrossRef]
  9. Henmi, M.; Matsuzoe, H. Geometry of pre-contrast functions and non-conservative estimating functions. In Proceedings of the International Workshop on Complex Structures, Integrability, and Vector Fields, Sofia, Bulgaria, 13–17 September 2010; Volume 1340, pp. 32–41.
  10. Matsuzoe, H.; Takeuchi, J.; Amari, S. Equiaffine structures on statistical manifolds and Bayesian statistics. Differ. Geom. Its Appl. 2006, 109, 567–578. [Google Scholar] [CrossRef]
  11. Takeuchi, J.; Amari, S. α-Parallel prior and its properties. IEEE Trans. Inf. Theory 2005, 51, 1011–1023. [Google Scholar] [CrossRef]
  12. Amari, S. Natural gradient works efficiently in learning. Neural Comput. 1988, 10, 251–276. [Google Scholar] [CrossRef]
  13. Yang, H.H.; Amari, S. Complexity issues in natural gradient descent method for training multilayer perceptrons. Neural Comput. 1998, 10, 2137–2157. [Google Scholar] [CrossRef] [PubMed]
  14. Amari, S.I.; Wu, S. Improving support vector machine classifiers by modifying kernel functions. Neural Networks 1999, 12, 783–789. [Google Scholar] [CrossRef]
  15. Murata, N.; Takenouchi, T.; Kanamori, T.; Eguchi, S. Information geometry of U-Boost and Bregman divergence. Neural Comput. 2004, 16, 1437–1481. [Google Scholar] [CrossRef] [PubMed]
  16. Ikeda, S.; Tanaka, T.; Amari, S. Information geometry of turbo and low-density parity-check codes. IEEE Trans. Inf. Theory 2004, 50, 1097–1114. [Google Scholar] [CrossRef]
  17. Rao, C.R. Information and accuracy attainable in the estimation of statistical parameters. Bull. Calcutta Math. Soc. 1945, 37, 81–91. [Google Scholar]
  18. Efron, B. Defining the curvature of a statistical problem (with application to second order efficiency) (with discussion). Ann. Stat. 1975, 3, 1189–1242. [Google Scholar] [CrossRef]
  19. Dawid, A.P. Discussion to Efron’s paper. Ann. Stat. 1975, 3, 1231–1234. [Google Scholar]
  20. Amari, S. Differential geometry of curved exponential families—Curvatures and information loss. Ann. Stat. 1982, 10, 357–385. [Google Scholar] [CrossRef]
  21. Cena, A. Geometric Structures on the Non-Parametric Statistical Manifold. Ph.D. Thesis, UniversitÀ Degli Studi di Milano, Milano, Italy, 2003. [Google Scholar]
  22. Gibilisco, P.; Pistone, G. Connections on non-parametric statistical manifolds by Olicz space geometry. Infin. Dimens. Anal. QU. 1998, 1, 325–347. [Google Scholar] [CrossRef]
  23. Grasselli, M. Dual connections in nonparametric classical information geometry. Ann. Inst. Stat. Math. 2010, 62, 873–896. [Google Scholar] [CrossRef]
  24. Pistone, G.; Sempi, C. An infinite dimensional geometric structure on the space of all the probability measures equivalent to a given one. Ann. Stat. 1995, 33, 1543–1561. [Google Scholar] [CrossRef]
  25. Zhang, J.; Hasto, P. Statistical manifold as an affine space: A functional equation approach. J. Math. Psychol. 2006, 50, 60–65. [Google Scholar] [CrossRef]
  26. Zhang, J.; Matsuzoe, H. Dualistic Differential Geometry Associated with a Convex Function. In Advances in Applied Mathematics and Global Optimization; Gao, D.Y., Sherali, H.D., Eds.; Springer: New York, NY, USA, 2009; Volume III, Chapter 13; pp. 439–466. [Google Scholar]
  27. Nomizu, K.; Sasaki, T. Affine Differential Geometry—Geometry of Affine Immersions; Cambridge University Press: Cambridge, MA, USA, 1994. [Google Scholar]
  28. Simon, U.; Schwenk-Schellschmidt, A.; Viesel, H. Introduction to the Affine Differential Geometry of Hypersurfaces; University of Tokyo Press: Tokyo, Japan, 1991. [Google Scholar]
  29. Lauritzen, S. Statistical manifolds. In Differential Geometry in Statistical Inference; Amari, S., Barndorff-Nielsen, O., Kass, R., Lauritzen, S., Rao, C.R., Eds.; IMS: Hayward, CA, USA, 1987; Volume 10, pp. 163–216. [Google Scholar]
  30. Lauritzen, S. Conjugate connections in statistical theory. In Proceedings of the Workshop on Geometrization of Statistical Theory; Dodson, C.T.J., Ed.; University of Lancaster: Lancaster, UK, 1987; pp. 33–51. [Google Scholar]
  31. Kurose, T. Dual connections and affine geometry. Math. Z 1990, 203, 115–121. [Google Scholar] [CrossRef]
  32. Kurose, T. On the divergences of 1-conformally flat statistical manifolds. Tôhoko Math. J. 1994, 46, 427–433. [Google Scholar] [CrossRef]
  33. Matsuzoe, H. On realization of conformally-projecively flat statistical manifolds and the divergences. Hokkaido Math. J. 1998, 27, 409–421. [Google Scholar] [CrossRef]
  34. Matsuzoe, H. Geometry of contrast functions and conformal geometry. Hokkaido Math. J. 1999, 29, 175–191. [Google Scholar]
  35. Calin, O.; Matsuzoe, H.; Zhang, J. Generalization of conjugate connections. In Trends in Differential Geometry, Complex Analysis, and Mathematical Physics; In Proceedings of the 9th International Workshop on Complex Structures, Integrability, and Vector Fields, Sofia, Bulgaria, 25–29 August 2008; pp. 24–34.
  36. Eguchi, S. Second order efficiency of minimum contrast estimators in a curved exponential family. Ann. Stat. 1983, 11, 793–803. [Google Scholar] [CrossRef]
  37. Eguchi, S. A differential geometric approach to statistical inference on the basis of contrast functionals. Hiroshima Math. J. 1985, 15, 341–391. [Google Scholar]
  38. Eguchi, S. Geometry of minimum contrast. Hiroshima Math. J. 1992, 22, 631–647. [Google Scholar]
  39. Chentsov, N.N. Statistical Decision Rules and Optimal Inference; AMS: Providence, RI, USA, 1982. [Google Scholar]
  40. Naudts, J. Generalised exponential families and associated entropy functions. Entropy 2008, 10, 131–149. [Google Scholar] [CrossRef]
  41. Bregman, L.M. The relaxation method of finding the common point of convex sets and its application to the solution of problems in convex programming. USSR Comput. Math. Phys. 1967, 7, 200–217. [Google Scholar] [CrossRef]
  42. Zhu, H.Y.; Rohwer, R. Bayesian invariant measurements of generalization. Neural Process. Lett. 1995, 2, 28–31. [Google Scholar] [CrossRef]
  43. Zhu, H.Y.; Rohwer, R. Measurements of generalisation based on information geometry. In Mathematics of Neural Networks: Models Algorithms and Applications; In Proceedings of the Mathematics of Neural Networks and Applications (MANNA 1995); Oxford, UK, 3–7 July 1995, Ellacott, S.W., Mason, J.C., Anderson, I.J., Eds.; Kluwer: Boston, MA, USA, 1997; pp. 394–398. [Google Scholar]
  44. Rao, C.R. Differential Metrics in Probability Spaces. In Differential Geometry in Statistical Inference; Amari, S., Barndorff-Nielsen, O., Kass, R., Lauritzen, S., Rao, C.R., Eds.; IMS: Hayward, CA, USA, 1987; Volume 10, Lecture; pp. 217–240. [Google Scholar]
  45. Pistone, G.; Rogantin, M.P. The exponential statistical manifold: Mean parameters, orthogonality and space transformations. Bernoulli 1999, 5, 721–760. [Google Scholar] [CrossRef]
  46. Lang, S. Differential and Riemannian Manifolds; Springer-Verlag: New York, NY, USA, 1995. [Google Scholar]
  47. Zhang, J. Referential Duality and Representational Duality on Statistical Manifolds. In Proceedings of the Second International Symposium on Information Geometry and Its Applications, Tokyo, Japan, 12–16 December 2005; pp. 58–67.
  48. Zhang, J. Divergence function, duality, and convex analysis. Neural Comput. 2004, 16, 159–195. [Google Scholar] [CrossRef] [PubMed]
  49. Basu, A.; Harris, I.R.; Hjort, N.; Jones, M. Robust and efficient estimation by minimising a density power divergence. Biometrika 1998, 85, 549–559. [Google Scholar] [CrossRef]
  50. Zhang, J. A note on curvature of alpha-connections of a statistical manifold. Ann. Inst. Stat. Math. 2007, 59, 161–170. [Google Scholar] [CrossRef]
  51. Csiszár, I. Information-type measures of difference of probability distributions and indirect observation. Studia Scientiarum Mathematicarum Hungarica, 1967, 2, 229–318. [Google Scholar]
  52. Cichocki, A.; Cruces, S.; Amari, S. Generalized alpha-beta divergences and their application to robust nonnegative matrix factorization. Entropy 2011, 13, 134–170. [Google Scholar] [CrossRef]
  53. Cichocki, A.; Amari, S. Families of alpha- beta- and gamma- divergences: Flexible and robust measures of similarities. Entropy 2010, 12, 1532–1568. [Google Scholar] [CrossRef]
  54. Zhang, J. Dual scaling between comparison and reference stimuli in multidimensional psychological space. J. Math. Psychol. 2004, 48, 409–424. [Google Scholar] [CrossRef]
  55. Zhang, J. Referential duality and representational duality in the scaling of multi-dimensional and infinite-dimensional stimulus space. In Measurement and Representation of Sensations: Recent Progress in Psychological Theory; Dzhafarov, E., Colonius, H., Eds.; Lawrence Erlbaum Associates: Mahwah, NJ, USA, 2006. [Google Scholar]
  56. Critchley, F.; Marriott, P.; Salmon, M. Preferred point geometry and statistical manifolds. Ann. Stat. 1993, 21, 1197–1224. [Google Scholar] [CrossRef]
  57. Critchley, F.; Marriott, P.; Salmon, M. Preferred point geometry and the local differential geometry of the Kullback-Leibler divergence. Ann. Stat. 1994, 22, 1587–1602. [Google Scholar] [CrossRef]
  58. Critchley, F.; Marriott, P.K.; Salmon, M. On preferred point geometry in statistics. J. Stat. Plan. Inference 2002, 102, 229–245. [Google Scholar] [CrossRef]
  59. Marriott, P.; Vos, P. On the global geometry of parametric models and information recovery. Bernoulli 2004, 10, 639–649. [Google Scholar] [CrossRef]
  60. Zhu, H.-T.; Wei, B.-C. Some notes on preferred point α-geometry and α-divergence function. Stat. Probab. Lett. 1997, 33, 427–437. [Google Scholar] [CrossRef]
  61. Zhu, H.-T.; Wei, B.-C. Preferred point α-manifold and Amari’s α-connections. Stat. Probab. Lett. 1997, 36, 219–229. [Google Scholar] [CrossRef]
  62. Ohara, A. Geometry of distributions associated with Tsallis statistics and properties of relative entropy minimization. Phys. Lett. A 2007, 370, 184–193. [Google Scholar] [CrossRef]
  63. Ohara, A.; Matsuzoe, H.; Amari, S. A dually at structure on the space of escort distributions. J. Phys. Conf. Ser. 2010, 201, No. 012012. [Google Scholar] [CrossRef]
  64. Amari, S.; Ohara, A. Geometry of q-exponential family of probability distributions. Entropy 2011, 13, 1170–1185. [Google Scholar] [CrossRef]
  65. Amari, S.; Ohara, A.; Matsuzoe, H. Geometry of deformed exponential families: Invariant, dually-flat and conformal geometry. Physica A 2012, 391, 4308–4319. [Google Scholar] [CrossRef]
  66. Shima, H. Compact locally Hessian manifolds. Osaka J. Math. 1978, 15, 509–513. [Google Scholar]
  67. Shima, H.; Yagi, K. Geometry of Hessian manifolds. Differ. Geom. Its Appl. 1997, 7, 277–290. [Google Scholar] [CrossRef]
  68. Matumoto, T. Any statistical manifold has a contrast function—On the C3-functions taking the minimum at the diagonal of the product manifold. Hiroshima Math. J. 1993, 23, 327–332. [Google Scholar]

Share and Cite

MDPI and ACS Style

Zhang, J. Nonparametric Information Geometry: From Divergence Function to Referential-Representational Biduality on Statistical Manifolds. Entropy 2013, 15, 5384-5418. https://doi.org/10.3390/e15125384

AMA Style

Zhang J. Nonparametric Information Geometry: From Divergence Function to Referential-Representational Biduality on Statistical Manifolds. Entropy. 2013; 15(12):5384-5418. https://doi.org/10.3390/e15125384

Chicago/Turabian Style

Zhang, Jun. 2013. "Nonparametric Information Geometry: From Divergence Function to Referential-Representational Biduality on Statistical Manifolds" Entropy 15, no. 12: 5384-5418. https://doi.org/10.3390/e15125384

APA Style

Zhang, J. (2013). Nonparametric Information Geometry: From Divergence Function to Referential-Representational Biduality on Statistical Manifolds. Entropy, 15(12), 5384-5418. https://doi.org/10.3390/e15125384

Article Metrics

Back to TopTop