Duality of Maximum Entropy and Minimum Divergence

Eguchi, Shinto; Komori, Osamu; Ohara, Atsumi

doi:10.3390/e16073552

Open AccessArticle

Duality of Maximum Entropy and Minimum Divergence

by

Shinto Eguchi

^1,*,

Osamu Komori

² and

Atsumi Ohara

³

¹

The Institute of Statistical Mathematics and The Graduate University of Advanced Studies, Tachikawa Tokyo 190-8562, Japan

²

The Institute of Statistical Mathematics, Tachikawa Tokyo 190-8562, Japan

³

Department of Electrical and Electronics Engineering, University of Fukui, Fukui 910-8507, Japan

^*

Author to whom correspondence should be addressed.

Entropy 2014, 16(7), 3552-3572; https://doi.org/10.3390/e16073552

Submission received: 28 April 2014 / Revised: 19 June 2014 / Accepted: 24 June 2014 / Published: 26 June 2014

(This article belongs to the Special Issue Maximum Entropy and Its Application)

Download Versions Notes

Abstract

:

We discuss a special class of generalized divergence measures by the use of generator functions. Any divergence measure in the class is separated into the difference between cross and diagonal entropy. The diagonal entropy measure in the class associates with a model of maximum entropy distributions; the divergence measure leads to statistical estimation via minimization, for arbitrarily giving a statistical model. The dualistic relationship between the maximum entropy model and the minimum divergence estimation is explored in the framework of information geometry. The model of maximum entropy distributions is characterized to be totally geodesic with respect to the linear connection associated with the divergence. A natural extension for the classical theory for the maximum likelihood method under the maximum entropy model in terms of the Boltzmann-Gibbs-Shannon entropy is given. We discuss the duality in detail for Tsallis entropy as a typical example.

Keywords:

β-divergence; dual connections; information geometry; MaxEnt; multivariate t-distribution; power exponential family; sufficiency

1. Introduction

Information divergence plays a central role in the understanding of integrating statistics, information science, statistical physics and machine learning. Let

ℱ

be the space of all the probability density functions with a common support with respect to a carrier measure Λ of a data space. Usually Λ is taken as the Lebesgue measure and the counting measure corresponding to continuous and discrete random variables, respectively. The most typical example of information divergence is the Kullback-Leibler divergence

D_{0} (f, g) = \int f (x) {log f (x) - log g (x)} d Λ (x)

on

ℱ

, which is decomposed into the difference of cross and diagonal entropy measures

C_{0} (f, g) = - \int f (x) log g (x) d Λ (x)

and

H_{0} (f, g) = - \int f (x) log f (x) d Λ (x) .

The entropy H₀(f) is nothing but Boltzmann-Gibbs-Shannon entropy. In effect, D₀(f, g) connects the maximum likelihood [1,2], and the maximum entropy [3]. If we take a canonical statistic t(X), then the maximum entropy distribution under a moment constraint for t(X) belongs to the exponential model associated with t(X),

M^{(e)} = {f_{0} (x, θ) : = exp {θ^{⊤} t (x) - κ_{0} (θ)} : θ \in Θ}

(1)

where κ₀(θ) = log ∫ exp{θ^⊤ t (x)}dΛ(x) and Θ = {θ : κ₀(θ) < ∞}. In this context, the statistic t(X) is minimally sufficient in the model, in which the maximum likelihood estimator (MLE) for the parameter θ of the model is given by one-to-one correspondence with t(X), see [4] for the convex geometry. If we consider the expectation parameter,

μ = 𝔼_{f_{0} (\cdot, θ)} {t (X)}

in place of θ, then for a given random sample X₁, ···, X_n, the MLE for μ is given by the sample mean of t(X_i)’s, that is

{\hat{μ}}_{0} = \frac{1}{n} \sum_{i = 1}^{n} t (X_{i}) .

We define two kinds of geodesic curves connecting f and g in

ℱ

. We call a curve

C^{(m)} = {C_{t}^{(m)} (x) : = (1 - t) f (x) + t g (x) : t \in (0, 1)}

(2)

mixture-geodesic. Alternatively, we call a curve

C^{(e)} = {C_{t}^{(e)} (x) : = exp {(1 - t) log f (x) + t log g (x) - κ (t)} : t \in (0, 1)}

(3)

exponential geodesic, where κ(t) = log ∫ f(x)¹^–tg(x)^tdΛ(x). We denote Γ⁽^m⁾ and Γ⁽^e⁾ the two linear connections induced by the mixture and exponential geodesic curves on

ℱ

, which we call the mixture connection and exponential connection on

ℱ

, respectively, see [5,6]. Thus all tangent vectors on a mixture geodesic curve are parallel to each other with respect to Γ^(m); all tangent vectors on an exponential geodesic curve are parallel to each other with respect to Γ^(e). It is well-known that M^(e) is totally exponential-geodesic, that is, for any f₀(x, θ₀) and f₀(x, θ₁) in M^(e) it holds that the exponential geodesic curve connecting f₀(x, θ₀) and f₀(x, θ₁) is in M^(e). In effect we observe that

C_{t}^{(e)} (x) = f_{0} (x, θ_{t})

with θ_t = (1–t)θ₀ +tθ₁. Thus

C_{t}^{(e)} (x) \in M^{(e)}

for all t ∈ (0, 1) because Θ is a convex set. Alternatively, consider a parametric model

M^{(m)} = {f_{1} (x, π) : = \sum_{j = 0}^{d} π_{j} f_{j} (x) : π_{j} > 0 (j = 0, \dots, d), \sum_{j = 0}^{d} π_{j} = 1} .

Then, M^(m) is totally mixture-geodesic. Because a mixture geodesic curve

C_{t}^{(m)} (x) = (1 - t) f_{1} (x, π_{0}) + t f_{1} (x, π_{1})

is in M^(m) for any t ∈ (0, 1) on account of

C_{t}^{(m)} (x) = f_{1} (x, π_{t})

, where (1 – t)π₀ + tπ₁.

We discuss a generalized entropy and divergence measures with applications in statistical models and estimation. There have been recent developments for the generalization of Boltzmann-Shannon entropy and Kullback-Leibler divergence. We focus on U-divergence with a generator function U, in which U-divergence is separated into the differences between cross entropy and diagonal entropy. We observe a dualistic property associated with U-divergence between statistical model and estimation. The U-loss function is given by an empirical approximation for U-divergence based on a given dataset under a statistical model, in which the U-estimator is defined by minimization of the U-loss function on the parameter space. On the other hand, the diagonal entropy leads to a maximum entropy distribution with a mean equal space, where we call the family of distributions U-model. In accordance with this, the U-divergence leads to a pair of U-model and U-estimator as a statistical model and estimation. The typical example is that U(t) = exp(t), which is associated with the Kullback-Leibler divergence D₀(f, g) generating a pair of an exponential family M^(e) and the minus log-likelihood function.

This aspect is characterized as a minimax game between a decision maker and Nature. The paper is organized as follows. Section 2 introduces the class of U-divergence measures. The information geometric framework associated with a divergence measure is given in Section 3. In Section 3 we discuss the maximum entropy model with respect to U-diagonal entropy. The minimum divergence method via U-divergence is discussed in Section 5. We next explore the duality between maximum U-entropy and minimum U-divergence in Section 6. Finally, we discuss the relation to the robust statistics by minimum divergence, and a future problem on MaxEnt in Section 7.

2. U-Divergence

A class of information divergence is constructed by a generator function U via a simple employment of conjugate convexity, see [7]. We introduce a class of generator functions by

𝒰 = {U : ℝ \to ℝ_{+} : \frac{d}{d s} U (s) \geq 0, \frac{d^{2}}{d s^{2}} U (s) \geq 0} .

Then we consider the conjugate convex function defined on ℝ₊ of U in

𝒰

as

U^{*} (t) = max_{s \in ℝ} {s t - U (s)},

and hence U*(t) = tξ(t) – U(ξ(t)), where ξ(t) is the inverse function of the derivative of U(s), or equivalently (dU/ds)(ξ(t)) = t. The existence for ξ(t) is guaranteed from the assumption for U to be in

𝒰

, in which we observe an important property that the derivative of U* is the inverse of the derivative of U, that is

\frac{d}{d t} U^{*} (t) = ξ (t) .

(4)

The conjugate function U* of U is reflexible, that is, U** = U. By definition, for any s ∈ ℝ and t ∈ ℝ₊,

U^{*} (t) \geq s t - U (s)

(5)

with equality if and only if s = ξ(t). We consider an information divergence functional using the generator function U as

D_{U} (f, g) = \int {U^{*} (f) - f ξ (g) + U (ξ (g))} d Λ,

(6)

called U-divergence. We can easily confirm that D_U (f, g) satisfies the first axiom of a distance function since the integrand in Equation (6) is always nonnegative with equality of 0 if and only if f(x)= g(x) because Equation (5). It follows from the construction that D_U (f, g) is decomposed into C_U (f, g) and H_U (f) such that

D_{U} (f, g) = C_{U} (f, g) - H_{U} (f) .

Here

C_{U} (f, g) = \int {U (ξ (g)) - f ξ (g)} d Λ,

is called U-cross entropy;

H_{U} (f) = - \int U^{*} (f) d Λ

(7)

is called U-diagonal entropy. We can write H_U (f) = ∫ {U(ξ(f)) – fξ(f)}dΛ by the definition for U*, which equals the diagonal C_U (f, f). We note that the U-divergence is expressed as

D_{U} (f, g) = \int {U^{*} (f) - U^{*} (g) + ξ (g) (f - g)} d Λ,

because of Equation (4), which implies that U* plays a role on a generator function in place of U. In fact, this is also called U*-Bregman divergence, cf. [8,9]

The first example of U is U₀(s) = exp(s), which leads to

U_{0}^{*} (t) = t log t - t

and

log (t) = \underset{s \in ℝ}{argmax} {s t - exp (s)},

Thus U₀-divergence, U₀-cross entropy and U₀-diagonal entropy equal D₀(f, g), C₀(f, g) and H₀(f) as defined in Introduction, respectively. As for the second example we consider

U_{β} (s) = \frac{1}{β + 1} {(1 + β s)}^{\frac{1 + β}{β}}

(8)

where β is a scalar. The conjugate function becomes

U_{β}^{*} (t) = \frac{1}{β (β + 1)} t^{β + 1} - \frac{1}{β} t .

(9)

Then the generator function U_β associates with the β-power cross entropy

C_{β} (f, g) = \frac{1}{β + 1} \int g^{β + 1} d Λ - \frac{1}{β} \int f (g^{β} - 1) d Λ,

β-diagonal power entropy

H_{β} (f) = - \frac{1}{β (β + 1)} \int f^{β + 1} d Λ + \frac{1}{β}

and the β-power divergence D_β(f, g) = C_β(f, g) – H_β(f), that is,

D_{β} (f, g) = \frac{1}{β (β + 1)} \int f^{(β + 1)} d Λ - \frac{1}{β} \int f g^{β} d Λ + \frac{1}{β + 1} \int g^{β + 1} d Λ .

We observe that

lim_{β \to 0} (C_{β} (f, g), H_{β} (f)) = (C_{0} (f, g), H_{0} (f)) .

The class of β-power divergence functionals includes the Kullback-Leibler divergence in the limiting sense of lim_β_→0 D_β(f, g) = D₀(f, g). If β = 1, then

D_{β} (f, g) = \frac{1}{2} \int {(f - g)}^{2} d Λ

, which is a half of the squared L₂ norm. If we take a limit of β to −1, then D_β(f, g) becomes the Itakura-Saito divergence

D_{IS} (f, g) = \int (log g - log f + \frac{f}{g} - 1) d Λ,

which is widely applied in signal processing and speech recognition, cf. [10–12].

The β-power divergence D_β(p, q) is proposed in [13]; the β-power entropy H_β is equal to the Tsallis q-entropy with a relation q = β + 1, cf. [14–16]. Tsallis entropy is connected with spin glass relaxation, dissipative optical lattices and so on beyond the classical statistical physics associated with the Boltzmann-Shannon entropy H₀(p). See also [17,18] for the power entropy in the field of ecology. We will discuss the statistical property for the minimum β divergence method in the presence of outliers departing from a supposed model, cf. [19–21]. A robustness performance is elucidated by appropriate selection for β. Beyond robustness perspective, a property of spontaneous learning to apply to clustering analysis is focused in [22], see also [23] for nonnegative matrix analysis.

The third example of a generator function is U_η(s) = (1 – η)exp(s) – ηs with a scalar η. This generator function leads to the η-cross entropy

C_{η} (f, g) = - \int {f (x) + η} log {g (x) + η} d Λ (x)

and the η-entropy

H_{β} (f) = \int {f (x) + η} log {f (x) + η} d Λ (x),

so that the η-divegence is D_η(f, g) = C_η(f, g) – H_η(f), see [24–27] for applications for pattern recognition. Obviously, if we take a limit of η to 0, then C_η(f, g), H_η(f) and D_η(f, g) converge to C₀(f, g), H₀(f) and D₀(f, g), respectively. A mislabeled model is derived by a maximum η-entropy distribution with momentary constraint if we consider a binary regression model. See [25,27] for a detailed discussion.

3. Geometry Associated with U-Divergence

We investigate geometric properties associated with U-divergence, which will help the discussion in subsequent sections. Let us arbitrarily fix a statistical model M = {f_θ(x) : θ ∈ Θ} embedded in the total space

ℱ

with mild regularity conditions. In fact, we consider the mixture geodesic curve C^(m), the exponential geodesic curve C^(e), the mixture model M^(m) and the exponential model M^(e) as typical examples of M. Here are difficult aspects to define

ℱ

as a differentiable manifold of infinite dimension because the constraint for positivity on the support is intractable in the sense of the topology, see Section 2 in [6] for detailed discussion and historical remarks. On the other hand, if we confine ourselves to a statistical model M, then we can formulate M as a finite dimensional manifold, as in the following discussion. Thus, we produce a path geometry in which for any two elements f and g of

ℱ

a class of geodesic curves connecting f and g including C^(m) and C^(e) is introduced so that the class of geodesic subspaces is derived as for M^(m) and M^(e).

3.1. Riemannian Metric and Linear Connections

We view the statistical model M as a d-dimensional differentiable manifold with the coordinate θ = (θ¹, ···, θ^d). Any information divergence associates with a Riemaniann metric and dual linear connections, see [28,29] for detailed discussion. We focus on the geometry generated by the U-divergence D_U (f, g) as follows. The Riemannian metric at f_θ of M is given by

G_{i j}^{(U)} (θ) = - \int \partial_{i} f_{θ} \partial_{j} ξ (f_{θ}) d Λ,

(10)

and linear connections are

Γ_{i, j, k}^{(U)} (θ) = - \int \partial_{i} \partial_{j} f_{θ} \partial_{k} ξ (f_{θ}) d Λ

(11)

and

{{}^{*}Γ}_{i, j, k}^{(U)} (θ) = - \int \partial_{k} f_{θ} \partial_{i} \partial_{j} ξ (f_{θ}) d Λ,

(12)

where ∂_i = ∂/∂θⁱ, see Appendix. for the derivation. Now we can assert the following theorem under an assumption for

ℱ

: Let f be arbitrarily fixed in

ℱ

. If ∫ a(x){g(x) – f(x)}dΛ(x) = 0 for any g of

ℱ

, then a(x) is constant in x almost everywhere with respect to Λ.

Theorem 1. Let Γ^(U) be the linear connection defined in Equation (11). Then any Γ^(U)-geodesic curve is equal to the mixture-geodesic curve defined in Equation (2).

Proof. Let C^(U) := {f_t(x): t ∈ (0, 1)} be a Γ^(U)-geodesic curve with f₀ = f and f₁ = g. We consider a 2-dimensional model defined by f_θ(x) = (1 – s + u)f_t(x)+(s – u)g(x), where θ =(s, t, u). Then we observe that if u = s, then

Γ_{11, 2}^{(U)} (θ) = - \int (\frac{d^{2}}{d t^{2}} f_{t}) ξ^{'} (f_{t}) (g - f_{t}) d Λ

(13)

which identically 0 for any g of

ℱ

. It follows from the assumption for

ℱ

that (d²/dt²)f_t(x)= c almost everywhere with respect to Λ, which solved by

f_{t} (x) = \frac{1}{2} c t (t - 1) + (1 - t) f (x) + t g (x)

from the endpoint condition for C⁽^U⁾. We observe that c = 0 because f_t(x) ∈

ℱ

, which concludes that C⁽^U⁾ equals the mixture-geodesic. The proof is complete.

This property is elemental to characterize the U-divergence class, which is closely related with the empirical reducibility as discussed in a subsequent section. The assumption for

ℱ

holds if the carrier measure Λ is Lebesgue measure or the counting measure.

On the other hand, for a *Γ⁽^U⁾-geodesic curve

{{}^{*}C}^{(U)} : = {f_{t}^{*} (x) : t \in (0, 1)}

with f₀ = f and f₁ = g we consider an embedding into a 2-dimensional model,

f_{θ}^{*} (x) = u ((1 - s + t) ξ (f_{t}^{*} (x)) + (s - t) ξ (g (x)) + κ_{θ}),

where θ = (s, t), where u(s) = (d/dt)U(s) and κ_θ is a normalizing constant to satisfy

\int f_{θ}^{*} (x) d Λ (x) = 1

. By definition

{{}^{*}Γ}_{11, 2}^{(U)} (θ) = - \int (\frac{d^{2}}{d t^{2}} ξ (f_{t}^{*})) u^{'} (ξ^{'} (f_{t}^{*})) {ξ (g) - ξ (f_{t}^{*})} d Λ = 0

(14)

if s = t. This leads to

(d^{2} / d t^{2}) ξ (f_{t}^{*} (x)) = c

almost everywhere with respect to Λ, which is solved by

f_{t}^{*} (x) = u ((1 - t) ξ (f (x)) - t ξ (g (x)) - κ_{t}),

(15)

We confirm that, if U = exp, then *Γ^(U)-geodesic curve reduces to the exponential geodesic curve defined in Equation (3). □

3.2. Generalized Pythagorian Theorems

We next consider the Pythagorean theorem based on the U-divergence as an extension of the result associated with the Kullback-Leibler divergence in [6].

Theorem 2. Let p, q and r be in

ℱ

. We connect p with q by the mixture geodesic

f_{t}^{(m)} (x) = (1 - t) p (x) + t q (x),

Alternatively we connect r and q by *Γ^(U)-geodesic curve

f_{s}^{(U)} (x) = u ((1 - s) ξ (r (x)) - s ξ (q (x)) - κ (s)) .

Two curves {

f_{t}^{(m)} (x) : t \in [0, 1]

} and {

f_{s}^{(U)} (x) : s \in [0, 1]

} orthogonally intersect at q with respect to the Riemannian metric G^(U) defined in Equation (10) if and only if

D_{U} (p, r) = D_{U} (p, q) + D_{U} (q, r) .

(16)

Proof. A straightforward calculus yields that

{- \frac{\partial^{2}}{\partial t \partial s} D_{U} (f_{t}^{(m)}, f_{s}^{(U)}) |}_{t = 1, s = 1} = D_{U} (p, r) - {D_{U} (p, q) + D_{U} (q, r)} .

(17)

By the definition of G^(U) we see that

G_{12}^{(U)} (θ)

is nothing but the left side of Equation (17) when

f_{θ} (x) = (1 - t) p (x) + t f_{s}^{(U)} (x),

where θ = (t, s). Hence the orthogonality assumption is equivalent to Equation (16), which completes the proof.

Remark 1. We remark a further property such that, for any s and t in [0, 1],

D_{U} (p_{t}, r) = D_{U} (p_{t}, q) + D_{U} (q, r_{s}) .

If U = exp, then Theorem 2 reduces to the Pythagoras theorem with the Kullback-Leibler divergence as shown in [6]. Consider two geodesic subspaces defined by

M^{(m)} = {p_{π} (x) = π_{0} q (x) + \sum_{j = 1}^{J} π_{j} p_{j} (x) : π_{j} \geq 0 (j = 0, \dots, J), \sum_{j = 0}^{J} π_{j} = 1}

and

M^{(U)} = {r_{∊} (x) = u (∊_{0} ξ (q (x)) + \sum_{k = 1}^{K} ∊_{k} ξ (r_{k} (x)) - κ (∊)) : ∊_{k} \geq 0 (k = 0 \dots, K), \sum_{k = 0}^{K} ∊_{k} = 1} .

(18)

For any m-geodesic curve C^(m) and U-geodesic curve *C^(U) connecting q we assume that C^(m) and C^(U) orthogonally intersect at q in the sense of the Riemannian metric G^(U). Then, for any p ∈ M^(m) and r ∈ M^(U)

D_{U} (p, r) = D_{U} (p, q) + D_{U} (q, r),

in which two-way projection is associated with as

D_{U} (p, q) = min_{r \in M_{2}} D_{U} (p, r) and D_{U} (q, r) = min_{p \in M_{1}} D_{U} (p, r) .

First we confirm a kind of reduction property for the Kullback-Leibler divergence to the framework in information geometry such that (G⁽^D^₀), Γ⁽^D^₀), *Γ^(D₀)) = (G, Γ^(m), Γ^(e)), where G is the information metric. Second we return a case of the β-power divergence, which is reduced a special case of Theorem 2. Consider two curves

C^{(m)} = {C_{t}^{(m)} (x) = (1 - t) p (x) + t q (x) : t \in [0, 1]}

and

C^{(β)} = {C_{s}^{(β)} (x) = {(1 - s) r {(x)}^{β} + t q {(x)}^{β} + c (s)}^{\frac{1}{β}} : s \in [0, s]} .

Then we observe for the Riemannian metric G⁽^β⁾ generated by β-power divergence that

G^{(β)} ({\dot{C}}_{1}^{m}, {\dot{C}}_{1}^{β}) (q) = D_{β} (p, r) - {D_{β} (p, q) + D_{β} (q, r)},

(19)

which is ∫ (p – q)(p^β – q^β)dΛ. We observe that if C^(m) and C⁽^β⁾ orthogonally intersect at q, then

D_{β} (p, r) = D_{β} (p, q) + D_{β} (q, r) .

4. Maximum Entropy Distribution

The maximum entropy principle is based on the Boltzmann-Shannon entropy in which the maximum entropy distribution is characterized by an exponential model. The maximum entropy method has been widely enhanced in fields of natural language processing, ecological analysis and so forth. However, there are other types of entropy measures proposed as the Hill diversity index, the Gini-Simpson index, the Tsallis entropy and so on, cf. [14,17,18] in different fields. We introduced the class of U-entropy functionals, which include all the entropy measures mentioned above. In this subsection, we discuss the maximum entropy distribution based on an arbitrarily fixed U-entropy.

We check a finite discrete case with K + 1 cells as a special situation, where

ℱ

reduces to a K-dimensional simplex

𝒮

_K. The maximum U-entropy distribution is defined by

f^{*} = \underset{f \in 𝒮_{k}}{argmax} H_{U} (f) .

The Lagrange function is

L (f, λ) = \sum_{i = 1}^{K + 1} {- ξ (f_{i}) f_{i} + U (ξ (f_{i}))} + λ (\sum_{i = 1}^{K + 1} f_{i} - 1) .

We observe that

\frac{\partial}{\partial f_{i}} L (f, λ) = - ξ (f_{i}) + λ = 0,

which implies

f_{i}^{*} = 1 / (K + 1)

for i = 1, ···, K + 1. Therefore the maximum U-entropy distribution f^* is a uniform distribution on

𝒮

_K for any generator function U.

In general the U-entropy is an unbounded functional on

ℱ

unless

ℱ

is finite discrete. For this we introduce a moment constraint as follows. Let t(X) be a k-dimensional statistic vector. Henceforth we assume that 𝔼_f {‖ t(X) ‖ ²} < ∞ for all f of

ℱ

. We consider a mean equal space for t(X) as

Γ (τ) = {f \in ℱ : 𝔼_{f} {t (X)} = τ},

where τ is a fixed vector in ℝ^k. By definition Γ(τ) is totally mixture geodesic, that is, if f and g are in Γ(τ), then (1 – t)f + tg is also in Γ(τ) for any t ∈ (0, 1).

Theorem 3. Let

f_{τ}^{*} = argmax {H_{U} (f) : f \in Γ (τ)}

, where H_U (f) is U-diagonal entropy defined in Equation (7). Then the maximum U-entropy distribution is given by

f_{τ}^{*} (x) = u (θ^{⊤} t (x) - κ_{U} (θ)),

(20)

where κ_U (θ) is the normalizing factor and θ is a parameter vector determined by the moment constraint

\int t (x) u (θ^{⊤} t (x) - κ_{U} (θ)) d Λ (x) = τ .

Proof. The Eular-Lagrange functional is given by

Φ (f, θ, λ) - H_{U} (f) - θ^{⊤} [𝔼_{f} {t (X)} - τ] - λ {\int f (x) d Λ (x) - 1}

If g_τ ∈ Γ(τ) and

f_{τ}^{*} (x) + t g_{τ} (x)

, then f_t ∈ Γ(τ), and

{\frac{d}{d t} Φ (f_{t}, θ, λ) |}_{t = 0} = 0, {\frac{d^{2}}{d t^{2}} Φ (f_{t}, θ, λ) |}_{t = 0} < 0.

(21)

The equation in Equation (21) yields that

\int {ξ (f_{τ}^{*} (x)) - θ^{⊤} (t (x) - τ) - λ} {g (x) - f^{*} (x)} d Λ (x) = 0

for any g_τ (x) in Γ(τ), which concludes Equation (20). Since ξ(t) is an increasing function, we observe that

\frac{d^{2}}{d t^{2}} Φ (f_{t}, θ, λ) = - \int ξ^{'} (f_{t} (g)) {g (x) - f_{τ}^{*} (x)}^{2} d Λ (x) < 0

(22)

for any t ∈ [0, 1], which implies the inequality in Equation (21). Since g_τ ∈ Γ(τ), we observe that

𝔼_{g_{τ}} {ξ (f_{τ}^{*} (X))} = 𝔼_{f_{τ}^{*}} {ξ (f_{τ}^{*} (X))}

Therefore we can confirm that

H_{U} (f_{τ}^{*}) \geq H_{U} (g_{τ})

for any g_τ ∈ Γ(τ) since

H_{U} (f_{τ}^{*}) - H_{U} (g_{τ}) = D_{U} (g_{τ}, f_{τ}^{*}),

which is nonnegative by the definition of U-divergence. The proof is complete.

Here we give a definition of the model of maximum U-entropy distributions as follows.

Definition 1. We define a k-dimensional model

M_{U} = {f_{U} (x, θ) : = u (θ^{⊤} t (x) - κ_{U} (θ)) : θ \in Θ},

(23)

which is called U-model, where Θ = {θ ∈ ℝ^k : κ_U (θ) < ∞}.

The Naudts’ deformed exponential family discussed from a statistical physical viewpoint as in [15]is closely related with U-model. The one-parameter family {r_s(x): s ∈ [0, 1]} as defined in Equation (15) is a one-dimensional U-model and M⁽^U⁾ defined in Equation (18) is a K-dimensional U-model. For a U-model M_U defined in Equation (23), the parameter θ is an affine parameter for the linear connection *Γ⁽^U⁾ defined in Equation (12). In fact, we observe from the definition Equation (12) that

{{}^{*}Γ}_{i, j, k}^{(U)} (θ) = \partial_{j} \partial_{k} κ_{U} (θ) \int \partial_{k} f_{U} (θ, x) d Λ (x)

which is identically 0 for all θ ∈ Θ. We have a geometric understanding for the U-model similar to the exponential model discussed in Introduction.

Theorem 4. Assume for U(t) that U^′″ (t) > 0 for any t in ℝ. Then, the U-model is totally *Γ⁽^U⁾-geodesic.

Proof. For arbitrarily fixed θ₁ and θ₂ in Θ, we define the U-geodesic curve connecting between f_U (x, θ₁) and f_U (x, θ₂) such that, for λ ∈ (0, 1),

f_{λ} (x) = u (λ ξ (f_{U} (x, θ_{1})) + (1 - λ) ξ (f_{U} (x, θ_{2})) - κ (λ))

with a normalizing factor κ(λ), which is written by f_λ(x) = f_U (x, θ_λ), where θ_λ = λθ₁ +(1 – λ)θ₂. Hence it suffices to show θ_λ ∈ Θ for all λ ∈ (0, 1), where Θ is defined in Definition 1. We look at the identity ∫ f_U (x, θ)dΛ(x) = 1 from a fact that f_U (x, θ) is a probability density function. This implies that the first derivative gives

\int u^{'} (θ^{⊤} t (x) - κ_{U} (θ)) {t (x) - \frac{\partial}{\partial θ} κ_{U} (θ)} d Λ (x) = 0

and the second derivative gives

\begin{array}{l} \int u^{″} (θ^{⊤} t (x) - κ_{U} (θ)) {t (x) - \frac{\partial}{\partial θ} κ_{U} (θ)} {t (x) - \frac{\partial}{\partial θ} κ_{U} (θ)}^{⊤} d Λ (x) \\ - \int u^{'} (θ^{⊤} t (x) - κ_{U} (θ)) d Λ (x) \frac{\partial^{2}}{\partial θ \partial θ^{⊤}} κ_{U} (θ) = 0 \end{array}

(24)

Since the identity Equation (24) shows that the Hessian of κ_U (θ) is proportional to a Gramian matrix, which implies that κ_U (θ) is convex in θ. Since κ_U (θ_λ) ≤ (1 – λ)κ_U (θ₁)+ λκ_U (θ₂) and θ₁ and θ₂ in Θ, κ_U (θ_λ) ≤ ∞. This concludes that θ_λ ∈ Θ for any λ ∈ (0, 1), which completes the proof.

We discuss a typical example by the power entropy H_β(f), see [15,30–34] from a viewpoint of statistical physics. First we consider a mean equal space of univariate distributions on (0, ∞)

Γ (μ) = {f : 𝔼_{f} {t (X)} = μ}

where

t (x) = {(x, \frac{x^{β (κ - 1)} - 1}{β})}^{⊤}

Note that lim_β→₀ t(x) = (x, (κ – 1) log x). To get the maximum entropy distribution with H_β we consider the Euler-Lagrange function given by

E_{β} (f, λ) = \frac{1}{β (β + 1)} \int_{0}^{\infty} {(x)}^{1 + β} d x + θ^{⊤} {\int_{0}^{\infty} t (x) f (x) d x - μ} + λ {\int_{0}^{\infty} f (x) d x - 1},

where θ and λ are Lagrange multiplier parameters. This yields that the maximum entropy distribution is

\begin{array}{l} f_{β} (x, θ) & = Z_{β} {(θ)}^{- 1} {(1 + β θ^{⊤} t (x))}^{\frac{1}{β}} \\ = Z_{β} {(θ)}^{- 1} {(β θ_{1} x + θ_{2} x^{β (κ - 1)})}^{\frac{1}{β}} \\ = Z_{β} {(θ)}^{- 1} x^{κ - 1} {(θ_{2} - β θ_{1} x^{1 - β (κ - 1)})}^{\frac{1}{β}}, \end{array}

where θ is determined by μ such that 𝔼_{f_β(·,θ)}t(X) = μ and

Z_{β} (θ) = \int_{0}^{\infty} x {(θ_{2} - β θ_{1} x^{1 - β})}^{\frac{1}{β}} d x .

A gamma distribution is defined by the density function

f (x, κ, θ) = \frac{x^{κ - 1} exp (- \frac{x}{θ})}{Γ (κ) θ^{κ}}

Second, we consider a case of multivariate distributions, where the moment constraints are supposed that for a fixed p-dimensional vector μ and matrix V of size p × p

Γ (μ, V) = {f \in ℱ : 𝔼_{f} (X) = μ, 𝕍_{f} (X) = V} .

Let

f_{β} (\cdot, μ, V) = \underset{f \in Γ (μ, V)}{argmax} H_{β} (f) .

If we consider a limit case of β to 0, then H_β(f) reduces to the Boltzmann-Shannon entropy and the maximum entropy distribution is the Gaussian distribution with the density function

φ (x, μ, V) = {det (2 π V)}^{p / 2} exp {- \frac{1}{2} {(x - μ)}^{⊤} V^{- 1} (x - μ)} .

In general we deduce that if

β > - \frac{2}{p + 2}

, then the maximum β-power entropy distribution uniquely exists such that the density function is given by

f_{β} (x, μ, V) = \frac{c_{β}}{det {(2 π V)}^{\frac{1}{2}}} {1 - \frac{β}{2 + p β + 2 β} {(x - μ)}^{⊤} V^{- 1} (x - μ)}_{+},

where

c_{β} = {\begin{array}{l} {(\frac{2 β}{2 + p β + 2 β})}^{\frac{p}{2}} Γ (1 + \frac{p}{2} + \frac{1}{β}) {Γ (1 + \frac{1}{β})}^{- 1} & if & β \geq 0 \\ {(\frac{- 2 β}{2 + p β + 2 β})}^{\frac{p}{2}} Γ (- \frac{1}{β}) {Γ (- \frac{1}{β} - \frac{p}{2})}^{- 1} & if & - \frac{2}{p + 2} < β \leq 0 \end{array}

See [35,36] for the detailed discussion [37,38] for the discussion on group invariance. Thus, if β> 0, then the maximum β-power entropy distribution has a compact support

{x \in ℝ^{p} : {(x - μ)}^{⊤} V^{- 1} (x - μ) \leq \frac{2}{β} + p + 2}

The typical case is β =2, which is called the Wigner semicircle distribution. On the other hand, if

- \frac{2}{p + 2} < β < 0

, the maximum β-power entropy distribution has a full support of ℝ^p, and equals a p-variate t-distribution with a degree of freedom depending on β.

5. Minimum Divergence Method

We have shown a variety of U-divergence functionals using various generator functions in which the minimum divergence methods are applied to analyses in statistics and statistical machine learning. In effect the U-cross entropy C_U (f, g) is convex-linear in f, that is,

C_{U} (\sum_{j = 1}^{J} λ_{j} f_{j}, g) = \sum_{j = 1}^{J} λ_{j} C_{U} (f_{j}, g)

for any λ_j > 0 with

\sum_{j = 1}^{J} λ_{j} = 1

. It is closely related with a characteristic property that the linear connection Γ⁽^U⁾ associated with U-divergence is equal to the mixture connection Γ^(m) as discussed in Theorem 1. Furthermore, for a fixed g, C_U (f, g) can be viewed as a functional of F in place of f as follows:

C_{U} (F, g) = \int {ξ (g (x)) - \int U (ξ (g (x))) d Λ (x)} d F (x),

where F is the probability distribution generated from f(x). If we assume to have a random sequence X₁, ···, X_n from a density function f(x), then the U-cross entropy is approximated as

C_{U} ({\bar{F}}_{n}, g) = - \frac{1}{n} \sum_{i = 1}^{n} ξ (g (X_{i})) + \int U (ξ (g)) d Λ,

(25)

where F̄_n is the empirical distribution based on the data X₁, ···, X_n, that is,

{\bar{F}}_{n} (B) = \frac{1}{n} \sum_{i = 1}^{n} I (X_{i} \in B)

for any Borel measurable set B. By definition,

\int ξ (g (x)) d {\bar{F}}_{n} (x) = \frac{1}{n} \sum_{i = 1}^{n} ξ (g (X_{i})) .

Consequently, if we model g by a model function f(·,θ), then the right side of Equation (25) depends only on the data set

{(X_{i})}_{i = 1}^{n}

and parameter θ without any knowledge for the underlying density function f(x). This gives the empirical approximation, which is advantageous over other classes of divergence measures. The minimum U-divergence method is directly applied to minimization of the empirical approximation with respect to θ. We note that the minimum divergence is equivalent to the minimum cross entropy, in which the diagonal entropy is just a constant in θ. In particular, in the classical case,

C_{0} ({\bar{F}}_{n}, f (\cdot, θ)) = - \frac{1}{n} \sum_{i = 1}^{n} log f (X_{i}, θ) + 1,

which is equivalent to the minus log-likelihood function.

Let X₁, ···, X_n be independently and identically distributed from an underlying density function f(x) which is approximated by a statistical model M = {f(x, θ): θ ∈ Θ}. The U-loss function is introduced by

L_{U} (θ) = - \frac{1}{n} \sum_{i = 1}^{n} ξ (f (X_{i}, θ)) + b_{U} (θ),

where b_U (θ) = ∫ U(ξ(f(x, θ)))dΛ(x). We call θ̂_U = argmin_θ∈Θ L_U (θ) U-estimator for the parameter θ. By definition 𝔼_f {L_U (θ)} = C_U (F, f(·,θ)) for all θ in Θ, which implies that L_U (θ) almost surely converges to C_U (F, f(·,θ)) as n goes to ∞. Let us define a statistical functional as

θ_{U} (F) = \underset{θ \in Θ}{argmax} C_{U} (F, f (\cdot, θ)),

where C_U (F, g) is written C_U (f, g) placing f into the probability distribution F generated from f. Then θ_U (F) is model-consistent, or θ_U (F_θ) = θ for any θ ∈ Θ because

C_{U} (F_{θ}, f (\cdot, θ^{'})) \leq H_{U} (f (\cdot, θ))

with equality if and if θ ^′ = θ, where F_θ is the probability distribution induced form f(x, θ).

Hence U-estimator θ̂_U is asymptotically consistent. The estimating function is given by

s_{U} (x, θ) = \frac{\partial}{\partial θ} ξ (f (x, θ)) - 𝔼_{f (\cdot, f)} {\frac{\partial}{\partial θ} ξ (f (X, θ))} .

(26)

Consequently we confirm that s_U (x, θ) is unbiased in the sense that 𝔼_f(_·,θ₎{s_U (X, θ)} = 0.

We next investigate the asymptotic normality for U-estimator. The estimating equation for the U-estimator is given by

\frac{1}{n} \sum_{i = 1}^{n} s_{U} (X_{i}, {\hat{θ}}_{U}) = 0,

of which the Taylor approximation gives

\frac{1}{n} \sum_{i = 1}^{n} {s_{U} (X_{i}, θ_{U} (F))} + \frac{\partial s_{U}}{\partial θ^{⊤}} (X_{i}, θ_{U}) ({\hat{θ}}_{U} - θ_{U} (F)) = o (n_{P}^{- 1}) .

In accordance with this, we get the asymptotic approximation,

\sqrt{n} {{\hat{θ}}_{U} - θ_{U} (F)} = \frac{1}{\sqrt{n}} J {(θ_{U} (F))}^{- 1} \sum_{i = 1}^{n} s_{U} (X_{i}, θ_{U} (F)) + o (n_{P}^{- \frac{1}{2}}),

where

J (θ) = 𝔼_{f (\cdot, θ)} {\frac{\partial s_{U}}{\partial θ^{⊤}} (X, θ)} .

Because the strong law of large number gives

\frac{1}{n} \sum_{i = 1}^{n} \frac{\partial s_{U}}{\partial θ^{⊤}} (X_{i}, θ_{U} (f)) \overset{a.s.}{\to} J (θ_{U} (F))

as n goes to ∞, where

\overset{a.s.}{\to}

denotes almost sure convergence. If the underlying density function is in the model M, that is f(x) = f(x, θ), then it follows from the model consistency for θ_U (F) that

\sqrt{n} ({\hat{θ}}_{U} - θ) = \frac{1}{\sqrt{n}} J {(θ)}^{- 1} \sum_{i = 1}^{n} s_{U} (X_{i}, θ) + o (n_{P}^{- \frac{1}{2}}),

which implies that

\sqrt{n} ({\hat{θ}}_{U} - θ) \overset{D}{\to} N (0, J {(θ)}^{- 1} V (θ) J {(θ)}^{- 1}),

where

\overset{D}{\to}

denotes convergence in distribution and

V (θ) = 𝕍_{f (\cdot, θ)} {s_{U} (X, θ)} .

If the generator function is taken as U(s) = exp(s), then the U-estimator reduces to the MLE with the asymptotic normality to N(0,G(θ)^–¹), where G(θ) is the Fisher information matrix for θ.

Consider U-estimator for the parameter θ of the exponential model M^(e) in Equation (1). In particular we are concerned with a possible outlying contaminated in the exponential model, and hence a ∊-contamination model is defined as

F_{θ, ∊, y} (x) = (1 - ∊) F_{0} (x, θ) + ∊ δ_{y} (x),

where ∊, 0 < ∊ < 1 is a sufficiently small constant, F₀(x, θ) is the cumulative distribution function of the exponential model, and δ_y(x) denotes a degenerate distribution at y. The influence function for U-estimator is given by

IF ({\hat{θ}}_{U}, y) : = lim_{∊ \to 0} \frac{θ_{U} (F_{θ, ∊, y}) - θ}{∊} = J {(θ)}^{- 1} s_{U} (y, θ),

See [19,20,27]. Thus we can check the robustness for U-estimator whether the influence function is bounded in y or not. For example, if we adopt as U(s) = (1+ βs)¹^/β, then

IF ({\hat{θ}}_{U}, y) = J {(θ)}^{- 1} [{t (y) - μ} f_{0} {(y, θ)}^{β} - b (θ, β)],

(27)

where b(θ, β)= ∫ {t(x) – μ}f₀(x, θ)^βdΛ(x). Thus, if β> 0, then the influence function is confirmed to be bounded in y for almost cases including a normal, exponential and Poisson distribution models since the term {t(y) – μ}f₀(y, θ)^β in Equation (27) is bounded in y for these models. On the other hand, If β = 0, that is the maximum likelihood estimator entails the unbounded influence functions because the term t(y) – μ is unbounded in y for theses models.

6. Duality of Maximum Entropy and Minimum Divergence

In this section, we discuss a dualistic interplay between statistical model and estimation. In statistical literature, the maximum likelihood estimation has a special position over other estimation methods in the sense of efficiency, invariance and sufficiency; while the statistical model has been explored various candidates in the presence of misspecification. For example, we frequently consider a Laplace distribution for estimating a Gaussian mean, which leads to the sample median as the maximum likelihood estimator for the mean of the Laplace distribution. In this sense, there is an unbalance in the employment for the model and estimator. In principle, we can select arbitrarily different generator functions U₀ and U₁ so that the U₁-estimation gives consistency under the U₀-model. There is a natural question which situation happens if we consider the U-estimation under the U-model?

Let M_U be a U-model defined by

M_{U} = {f_{U} (x, θ) : = u (θ^{⊤} t (x) - κ_{U} (θ)) : θ \in Θ},

(28)

where Θ = {θ ∈ ℝ^k : κ_U (θ) < ∞}. The the U-loss function under the U-model for a given data set {X₁, ···, X_n} is defined by

L_{U} (θ) = - \frac{1}{n} \sum_{i = 1}^{n} ξ (f_{U} (X_{i}, θ)) + \int U (ξ (f_{U} (x, θ))) d Λ (x),

which is reduced to

L_{U} (θ) = - θ^{⊤} \bar{t} + κ_{U} (θ) + b_{U} (θ),

(29)

where

\bar{t} = \frac{1}{n} \sum_{i = 1}^{n} t (X_{i})

and

b_{U} (θ) = \int U (ξ (θ^{⊤} t (x) - κ_{U} (θ))) d Λ (x) .

(30)

The estimating equation is given by

\frac{\partial}{\partial θ} L_{U} (θ) - \bar{t} + \frac{\partial}{\partial θ} κ_{U} (θ) + \frac{θ}{\partial θ} b_{U} (θ),

which is written by

\frac{\partial}{\partial θ} L_{U} (θ) = - \bar{t} + 𝔼_{f (\cdot, θ)} {t (X)} .

Hence, if we consider the U-estimator for a parameter η by the transformation of θ defined by ϕ(θ) = 𝔼_f(_·,θ₎{t(X)}, then the U-estimator η̂_U is nothing but the sample mean t̄. Here we confirm that the transformation ϕ(θ) is one-to-one as follows. The Jacobian matrix of the transformation is given by

\frac{\partial}{\partial θ} ϕ (θ) = \int u^{'} (θ^{⊤} t (x) - κ_{U} (θ)) {t (x) - \frac{\partial}{\partial θ} κ_{U} (θ))} {t (x) - \frac{\partial}{\partial θ} κ_{U} (θ))}^{⊤} d Λ (x),

since the first identity for M_U leads to

\frac{\partial}{\partial θ} \int f_{U} (x, θ) d Λ (x) = \int u^{'} (θ^{⊤} t (x) - κ_{U} (θ)) {t (x) - \frac{\partial}{\partial θ} κ_{U} (θ)} d Λ (x) = 0.

Therefore, we conclude that the Jacobian matrix is symmetric and positive-definite since u′ (t) is a positive function from the assumption of the convexity for U, which implies that ϕ(θ) is one-to-one. Consequently, the estimator θ̂ _U for θ is given by ϕ^–¹(t̄). We summarize these results in the following theorem.

Theorem 5. Let M_U be a U-model with a canonical statistic t(X) as defined in Equation (28). Then the U-estimator for the expectation parameter η of t(X) is always t̄, where

\bar{t} = \frac{1}{n} \sum_{i = 1}^{n} t (X_{i})

.

Remark 2. We remark that the empirical Pythagorean theorem holds as in

L_{U} (θ) = L_{U} ({\hat{θ}}_{U}) + D_{U} ({\hat{θ}}_{U}, θ),

since we observe that

L_{U} (θ) - L_{U} ({\hat{θ}}_{U}) = {({\hat{θ}}_{U} - θ)}^{⊤} \bar{t} + κ_{U} (θ) + b_{U} (θ) - κ_{U} ({\hat{θ}}_{U}) + b_{U} ({\hat{θ}}_{U}),

which gives another proof for which θ̂_U is ϕ^–¹(t̄). The statistic t̄ is a sufficient statistic in the sense that the U-loss function L_U (θ) is a function of t̄ as in Equation (29). Accordingly, the U-estimator under U-model is a function only of t̄ from the observations X₁, ···, X_n. In this extension, the MLE is a function of t̄under the exponential model with the canonical statistic t(X).

Let us look at the case of the β-power divergence. Under the β-power model given by

M_{β} = {f_{β} (x, θ) : = {κ_{β} (θ) + β θ^{⊤} t (x)}^{\frac{1}{β}} : θ \in Θ},

the β-loss function is written by

L_{β} (θ) = - β θ^{⊤} \bar{t} + κ_{β} (θ) + b_{β} (θ),

where

b_{β} (θ) = \frac{1}{β + 1} \int {κ_{β} (θ) + β θ^{⊤} t (x)}^{\frac{1 + β}{β}} d Λ (x) .

The β-power estimator for the expectation parameter of t(X) is exactly given by t̄.

7. Discussion

We concentrate on elucidating the dual structure of the U-estimator under the U-model, in which the perspective extends the relation of the maximum likelihood under the exponential model with a functional degree of freedom. Thus, we explore a rich and practical class of duality structures; however, there remains an unsolved problem when we directly treat the space

ℱ

as an differentiable manifold, see [39] for an infinite dimensional exponential family. The approach here is not a direct extension of an infinite dimensional manifold, but a path geometry in the following sense. For all pairs of elements of

ℱ

the geodesic curve connecting the pair is represented in an explicit form in the class of *Γ⁽^U⁾ connections in our context.

The U-divergence approach was the first trial to introduce a dually flat structure to

ℱ

which is different from the alpha-geometry. However, there are many related studies. For example, a nonparametric information geometry on the space of all functions without constraints for positivity and normalizing is discussed in Zhang [40]. Amari [41] characterizes (ρ, τ)-divergence with decomposable dually flat structure, see also [42]. If ρ is an identity function and τ(s) = (d/ds)U(s), (ρ, τ)-divergence is no less than U-divergence. In effect we confine ourselves to discussing the U-divergence class for the sake of the direct estimability for U-estimator.

The duality between the maximum entropy and the minimum divergence has been explored in the minimax theorem for a zero-sum game between a decision maker and Nature. The pay-off function is taken by the cross U-entropy in which Nature tries to maximize the pay-off function under the mean equal constraint; the decision maker tries to minimize the pay-off function. The equilibrium is given by the minimax solution, which is the maximum U-entropy distribution, see [43] for the extensive discussion and the relation with Bayesean robustness. The observation explored in this paper is closely related with this minimax argument, however the duality between the statistical model and estimation is focused on, where the minimum U-divergence leads to projection onto the U-model.

In principle, the U-estimator is applicable for all the statistical model since U-loss function is written by a sample as well as the log-likelihood function. If the choice of the model is different from the U-model, then U-estimator has different performance from the present situation. For example, we consider an exponential model (U(s) = exp(s)), and a β-estimator (U(s) = (1 – βs)¹^/β for getting a robustness property for outlying observations, cf. [19,20]. In such situations, the duality property is no longer valid, since the β-estimator for the parameter of the exponential model is not a function of the sufficient statistic t̄ defined in Theorem 5. Thus, we have to pay attention to another aspect than the duality structure in the presence of outlying, or misspecification for the statistical model. Furthermore, another type of divergence measures including projective power divergence is recommended to perform super robustness, cf. [21,44].

We presented the method of generalized maximum entropy based on the proposed entropy measure, as an extension of the classical maximum entropy method based on the Boltzmann-Gibbs-Shannon entropy. Practical applications of MaxEnt are actively followed in ecological and computational linguistic researches based on the classical maximum entropy, cf. [45,46]. Difficult aspects are discussed, in which the MaxEnt is apt to be over-learning on data sets because it basically employs the maximum likelihood estimator. There is a great potential for the proposed method to implement these research fields in order to overcome these difficult aspects, by selecting an appropriate generator function. A detailed discussion is beyond the scope of the present paper; however, it will be challenged in the near future with concrete objectives motivated by real data analysis.

Author Contributions

Atsumi Ohara and Shinto Eguchi contributed to differential geometric parts associated with minimum divergence, and Osamu Komori and Shinto Eguchi contributed to statistical discussion for the maximum entropy model and minimum divergence estimation.

Conflicts of Interest

The authors declare no conflict of interest.

Appendix: Derivation for G(^U), Γ(^U) and *Γ(^U)

We apply the general formula for the Riemannian metric and the pair of linear connections discussed in [29]to U-divergence D_U (f, g). The Riemannian metric is defined by

G_{i j}^{(U)} (θ) = {\frac{\partial^{2}}{\partial θ^{i} \partial θ_{1}^{j}} D_{U} (f_{θ}, f_{θ_{1}}) |}_{θ_{1} = θ} .

Hence

G_{i j}^{(U)} (θ)

is expressed by Equation (10). Next the pair of linear connections Γ⁽^U⁾ and *Γ⁽^U⁾ are defined by

Γ_{i j, k}^{(U)} (θ) = {\frac{\partial^{3}}{\partial θ^{i} \partial θ^{j} \partial θ_{1}^{k}} D_{U} (f_{θ}, f_{θ_{1}}) |}_{θ_{1} = θ} .

and

{{}^{*}Γ}_{i, j, k}^{(U)} (θ) = {\frac{\partial^{3}}{\partial θ^{i} \partial θ^{j} \partial θ_{1}^{k}} D_{U} (f_{θ_{1}}, f_{θ}) |}_{θ_{1} = θ}

which means Equations (11) and (12), respectively. We confirm the formula for G⁽^U⁾, Γ⁽^U⁾ and *Γ⁽^U⁾.

Acknowledgments

We thank to anonymous referees for their useful comments and suggestions for our revision. Shinto Eguchi and Osamu Komori were supported by Japan Science and Technology Agency (JST), Core Research for Evolutionary Science and Technology (CREST).

References

Fisher, R.A. On an Absolute Criterion for Fitting Frequency Curves. Messenger Math 1912, 41, 155–160. [Google Scholar]
Fisher, R.A. On the mathematical foundations of theoretical statistics. Philos. Trans. R. Soc. A: Math. Phys. Eng. Sci 1922, 222, 309–368. [Google Scholar]
Jaynes, E.T.; Information, Theory. Statistical Mechanics. In Statistical Physics; Ford, K., Ed.; Benjamin: New York, NY, USA, 1963. [Google Scholar]
Barndorff-Nielsen, O. Information and Exponential Families in Statistical Theory; John Wiley: Chichester, UK, 1978. [Google Scholar]
Amari, S. Differential-Geometrical Methods in Statistics; Lecture Notes in Statistics, 28; Springer: New York, NY, USA, 1985. [Google Scholar]
Amari, S.; Nagaoka, H. Methods of Information Geometry; Oxford University Press: Oxford, UK, 2000. [Google Scholar]
Eguchi, S. Information divergence geometry and the application to statistical machine learning. In Information Theory and Statistical Learning; Emmert-Streib, F., Dehmer, M., Eds.; Springer US: New York, NY, USA, 2008; pp. 309–332. [Google Scholar]
Bregman, L.M. The relaxation method of finding the common points of convex sets and its application to the solution of problems in convex programming. USSR Comput. Math. Math. Phys 1967, 7, 200–217. [Google Scholar]
Barndorff-Nielsen, O.E.; Jupp, P.E. Statistics, yokes and symplectic ge-ometry. Ann. Fac. Sci. Toulouse Math 1997, 3, 389–427. [Google Scholar]
Scharf, L.L. Statistical Signal Processing; Addison-Wesley: Reading, MA, USA, 1991; Volume 98. [Google Scholar]
Fëvotte, C.; Bertin, N.; Durrieu, J.-L. Nonnegative matrix factorization with the Itakura-Saito divergence: With application to music analysis. Neural Comput 2009, 21, 793–830. [Google Scholar]
Cichocki, A.; Amari, S.I. Families of alpha-beta-and gamma-divergences: Flexible and robust measures of similarities. Entropy 2010, 12, 1532–1568. [Google Scholar]
Basu, A.; Harris, I.R.; Hjort, N.L.; Jones, M.C. Robust and efficient estimation by minimising a density power divergence. Biometrika 1998, 85, 549–559. [Google Scholar]
Tsallis, C. Possible generalization of Boltzmann-Gibbs statistics. J. Stat. Phys 1988, 52, 479–487. [Google Scholar]
Naudts, J. Generalized Thermostatistics; Springer: New York, NY, USA, 2011. [Google Scholar]
Tsallis, C. Introduction to Nonextensive Statistical Mechanics; Springer: New York, NY, USA, 2009. [Google Scholar]
Simpson, E.H. Measurement of diversity. Nature 1949, 163, 688. [Google Scholar]
Hill, M.O. Diversity and evenness: a unifying notation and its consequences. Ecology 1973, 54, 427–432. [Google Scholar]
Minami, M.; Eguchi, S. Robust blind source separation by beta divergence. Neural Comput 2002, 14, 1859–1886. [Google Scholar]
Fujisawa, H.; Eguchi, S. Robust estimation in the normal mixture model. J. Stat. Plan. Inference 2006, 136, 3989–4011. [Google Scholar]
Fujisawa, H.; Eguchi, S. Robust parameter estimation with a small bias against heavy contamination. J. Multivar. Anal 2008, 99, 2053–2081. [Google Scholar]
Notsu, A.; Komori, O.; Eguchi, S. Spontaneous clustering via minimum gamma-divergence. Neural Comput 2014, 26, 421–448. [Google Scholar]
Cichocki, A.; Cruces, S.; Amari, S. Generalized Alpha-Beta Divergences and Their Application to Robust Nonnegative Matrix Factorization. Entropy 2011, 13, 134–170. [Google Scholar]
Eguchi, S.; Copas, J. A class of logistic-type discriminant functions. Biometrika 2002, 89, 1–22. [Google Scholar]
Takenouchi, T.; Eguchi, S. Robustifying AdaBoost by adding the naive error rate. Neural Comput 2004, 16, 767–787. [Google Scholar]
Murata, N.; Takenouchi, T.; Kanamori, T.; Eguchi, S. Information geometry of U-Boost and Bregman divergence. Neural Comput 2004, 16, 1437–1481. [Google Scholar]
Eguchi, S. Information geometry and statistical pattern recognition. Sugaku Expo. Amer. Math. Soc 2006, 19, 197–216. [Google Scholar]
Eguchi, S. Second order efficiency of minimum contrast estimators in a curved exponential family. Ann. Stat 1983, 11, 793–803. [Google Scholar]
Eguchi, S. Geometry of minimum contrast. Hiroshima Math. J 1992, 22, 631–647. [Google Scholar]
Naudts, J. The q-exponential family in statistical Physics. Cent. Eur. J. Phys 2009, 7, 405–413. [Google Scholar]
Naudts, J. Generalized exponential families and associated entropy functions. Entropy 2008, 10, 131–149. [Google Scholar]
Ohara, A.; Wada, T. Information geometry of q-Gaussian densities and behaviors of solutions to related diffusion equations. J. Phys. A: Math. Theor 2010. [Google Scholar] [CrossRef]
Suyari, H. Mathematical structures derived from the q-multinomial coefficient in Tsallis statistics. Phys. A: Stat. Mech. Appl 2006, 368, 63–82. [Google Scholar]
Suyari, H.; Wada, T. Multiplicative duality, q-triplet and μ, ν, q-relation derived from the one-to-one correspondence between the (μ, ν)-multinomial coefficient and Tsallis entropy S_q. Phys. A: Stat. Mech. Appl 2008, 387, 71–83. [Google Scholar]
Eguchi, S.; Kato, S. Entropy and divergence associated with power function and the statistical application. Entropy 2010, 12, 262–274. [Google Scholar]
Eguchi, S.; Komori, O.; Kato, S. Projective Power Entropy and Maximum Tsallis Entropy Distributions. Entropy 2011, 13, 1746–1764. [Google Scholar]
Ohara, A.; Eguchi, S. Geometry on positive definite matrices deformed by V-potentials and its submanifold structure. In Geometric Theory of Information; Nielsen, F., Ed.; Springer: New York, NY, USA, 2014; Chapter 2; pp. 31–55. [Google Scholar]
Ohara, A.; Eguchi, S. Group invariance of information geometry on q-Gaussian distributions induced by beta-divergence. Entropy 2013, 15, 4732–4747. [Google Scholar]
Pistone, G.; Sempi, C. An infinite-dimensional geometric structure on the space of all the probability measures equivalent to a given one. Ann. Stat 1995, 33, 1543–1561. [Google Scholar]
Zhang, J. Nonparametric information geometry: From divergence function to referential-representational biduality on Statistical Manifolds. Entropy 2013, 15, 5384–5418. [Google Scholar]
Amari, S.-I. Information Geometry of Positive Measures and Positive-Definite Matrices: Decomposable Dually Flat Structure. Entropy 2014, 16, 2131–2145. [Google Scholar]
Harsha, K.V.; Subrahamanian, M.K.S. F-Geometry and AmariâĂŹs α-Geometry on a Statistical Manifold. Entropy 2014, 16, 2472–2487. [Google Scholar]
Grunwald, P.D.; Dawid, A.P. Game theory, maximum entropy, minimum discrepancy and robust Bayesian decision theory. Ann. Stat 2004, 32, 1367–1433. [Google Scholar]
Chen, P.-W.; Hung, H.; Komori, O.; Huang, S.-Y.; Eguchi, S. Robust independent component analysis via minimum gamma-divergence estimation. IEEE J. Sel. Top. Signal Process 2013, 7, 614–624. [Google Scholar]
Phillips, S.J.; Dudik, M. Modeling of species distributions with Maxent: new extensions and a comprehensive evaluation. Ecography 2008, 31, 161–175. [Google Scholar]
Berger, A.L.; Pietra, V.J.D.; Pietra, S.A.D. A maximum entropy approach to natural language processing. Comput. Linguist 1996, 22, 39–71. [Google Scholar]

© 2014 by the authors; licensee MDPI, Basel, Switzerland This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution license (http://creativecommons.org/licenses/by/3.0/).

Share and Cite

MDPI and ACS Style

Eguchi, S.; Komori, O.; Ohara, A. Duality of Maximum Entropy and Minimum Divergence. Entropy 2014, 16, 3552-3572. https://doi.org/10.3390/e16073552

AMA Style

Eguchi S, Komori O, Ohara A. Duality of Maximum Entropy and Minimum Divergence. Entropy. 2014; 16(7):3552-3572. https://doi.org/10.3390/e16073552

Chicago/Turabian Style

Eguchi, Shinto, Osamu Komori, and Atsumi Ohara. 2014. "Duality of Maximum Entropy and Minimum Divergence" Entropy 16, no. 7: 3552-3572. https://doi.org/10.3390/e16073552

APA Style

Eguchi, S., Komori, O., & Ohara, A. (2014). Duality of Maximum Entropy and Minimum Divergence. Entropy, 16(7), 3552-3572. https://doi.org/10.3390/e16073552

Article Menu

Duality of Maximum Entropy and Minimum Divergence

Abstract

1. Introduction

2. U-Divergence

3. Geometry Associated with U-Divergence

3.1. Riemannian Metric and Linear Connections

3.2. Generalized Pythagorian Theorems

4. Maximum Entropy Distribution

5. Minimum Divergence Method

6. Duality of Maximum Entropy and Minimum Divergence

7. Discussion

Author Contributions

Conflicts of Interest

Appendix: Derivation for G(^U), Γ(^U) and *Γ(^U)

Acknowledgments

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI

Article Menu

Duality of Maximum Entropy and Minimum Divergence

Abstract

1. Introduction

2. U-Divergence

3. Geometry Associated with U-Divergence

3.1. Riemannian Metric and Linear Connections

3.2. Generalized Pythagorian Theorems

4. Maximum Entropy Distribution

5. Minimum Divergence Method

6. Duality of Maximum Entropy and Minimum Divergence

7. Discussion

Author Contributions

Conflicts of Interest

Appendix: Derivation for G(U), Γ(U) and *Γ(U)

Acknowledgments

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI

Appendix: Derivation for G(^U), Γ(^U) and *Γ(^U)