1. Introduction
Information divergence plays a central role in the understanding of integrating statistics, information science, statistical physics and machine learning. Let
be the space of all the probability density functions with a common support with respect to a carrier measure Λ of a data space. Usually Λ is taken as the Lebesgue measure and the counting measure corresponding to continuous and discrete random variables, respectively. The most typical example of information divergence is the Kullback-Leibler divergence
on
, which is decomposed into the difference of cross and diagonal entropy measures
and
The entropy
H0(
f) is nothing but Boltzmann-Gibbs-Shannon entropy. In effect,
D0(
f, g) connects the maximum likelihood [
1,
2], and the maximum entropy [
3]. If we take a canonical statistic
t(
X), then the maximum entropy distribution under a moment constraint for
t(
X) belongs to the exponential model associated with
t(
X),
where
κ0(
θ) = log ∫ exp{
θ⊤ t (
x)}
dΛ(
x) and Θ = {
θ :
κ0(
θ)
< ∞}. In this context, the statistic
t(
X) is minimally sufficient in the model, in which the maximum likelihood estimator (MLE) for the parameter
θ of the model is given by one-to-one correspondence with
t(
X), see [
4] for the convex geometry. If we consider the expectation parameter,
in place of
θ, then for a given random sample
X1, ···,
Xn, the MLE for
μ is given by the sample mean of
t(
Xi)’s, that is
We define two kinds of geodesic curves connecting
f and
g in
. We call a curve
mixture-geodesic. Alternatively, we call a curve
exponential geodesic, where
κ(
t) = log
∫ f(
x)
1–tg(
x)
tdΛ(
x)
. We denote Γ
(m) and Γ
(e) the two linear connections induced by the mixture and exponential geodesic curves on
, which we call the mixture connection and exponential connection on
, respectively, see [
5,
6]. Thus all tangent vectors on a mixture geodesic curve are parallel to each other with respect to Γ
(m); all tangent vectors on an exponential geodesic curve are parallel to each other with respect to Γ
(e). It is well-known that
M(e) is totally exponential-geodesic, that is, for any
f0(
x, θ0) and
f0(
x, θ1) in
M(e) it holds that the exponential geodesic curve connecting
f0(
x, θ0) and
f0(
x, θ1) is in
M(e). In effect we observe that
with
θt = (1–
t)
θ0 +
tθ1. Thus
for all
t ∈ (0, 1) because Θ is a convex set. Alternatively, consider a parametric model
Then, M(m) is totally mixture-geodesic. Because a mixture geodesic curve
is in M(m) for any t ∈ (0, 1) on account of
, where (1 – t)π0 + tπ1.
We discuss a generalized entropy and divergence measures with applications in statistical models and estimation. There have been recent developments for the generalization of Boltzmann-Shannon entropy and Kullback-Leibler divergence. We focus on U-divergence with a generator function U, in which U-divergence is separated into the differences between cross entropy and diagonal entropy. We observe a dualistic property associated with U-divergence between statistical model and estimation. The U-loss function is given by an empirical approximation for U-divergence based on a given dataset under a statistical model, in which the U-estimator is defined by minimization of the U-loss function on the parameter space. On the other hand, the diagonal entropy leads to a maximum entropy distribution with a mean equal space, where we call the family of distributions U-model. In accordance with this, the U-divergence leads to a pair of U-model and U-estimator as a statistical model and estimation. The typical example is that U(t) = exp(t), which is associated with the Kullback-Leibler divergence D0(f, g) generating a pair of an exponential family M(e) and the minus log-likelihood function.
This aspect is characterized as a minimax game between a decision maker and Nature. The paper is organized as follows. Section 2 introduces the class of U-divergence measures. The information geometric framework associated with a divergence measure is given in Section 3. In Section 3 we discuss the maximum entropy model with respect to U-diagonal entropy. The minimum divergence method via U-divergence is discussed in Section 5. We next explore the duality between maximum U-entropy and minimum U-divergence in Section 6. Finally, we discuss the relation to the robust statistics by minimum divergence, and a future problem on MaxEnt in Section 7.
2. U-Divergence
A class of information divergence is constructed by a generator function
U via a simple employment of conjugate convexity, see [
7]. We introduce a class of generator functions by
Then we consider the conjugate convex function defined on ℝ
+ of
U in
as
and hence
U*(
t) =
tξ(
t) –
U(
ξ(
t)), where
ξ(
t) is the inverse function of the derivative of
U(
s), or equivalently (
dU/ds)(
ξ(
t)) =
t. The existence for
ξ(
t) is guaranteed from the assumption for
U to be in
, in which we observe an important property that the derivative of
U* is the inverse of the derivative of
U, that is
The conjugate function
U* of
U is reflexible, that is,
U** =
U. By definition, for any
s ∈ ℝ and
t ∈ ℝ
+,
with equality if and only if
s =
ξ(
t). We consider an information divergence functional using the generator function
U as
called
U-divergence. We can easily confirm that
DU (
f, g) satisfies the first axiom of a distance function since the integrand in
Equation (6) is always nonnegative with equality of 0 if and only if
f(
x)=
g(
x) because
Equation (5). It follows from the construction that
DU (
f, g) is decomposed into
CU (
f, g) and
HU (
f) such that
Here
is called
U-cross entropy;
is called
U-diagonal entropy. We can write
HU (
f) = ∫ {
U(
ξ(
f)) –
fξ(
f)}
dΛ by the definition for
U*, which equals the diagonal
CU (
f, f). We note that the
U-divergence is expressed as
because of
Equation (4), which implies that
U* plays a role on a generator function in place of
U. In fact, this is also called
U*-Bregman divergence,
cf. [
8,
9]
The first example of
U is
U0(
s) = exp(
s), which leads to
and
Thus
U0-divergence,
U0-cross entropy and
U0-diagonal entropy equal
D0(
f, g),
C0(
f, g) and
H0(
f) as defined in Introduction, respectively. As for the second example we consider
where
β is a scalar. The conjugate function becomes
Then the generator function
Uβ associates with the
β-power cross entropy
β-diagonal power entropy
and the
β-power divergence
Dβ(
f, g) =
Cβ(
f, g) –
Hβ(
f), that is,
The class of
β-power divergence functionals includes the Kullback-Leibler divergence in the limiting sense of lim
β→0 Dβ(
f, g) =
D0(
f, g). If
β = 1, then
, which is a half of the squared L
2 norm. If we take a limit of
β to −1, then
Dβ(
f, g) becomes the Itakura-Saito divergence
which is widely applied in signal processing and speech recognition,
cf. [
10–
12].
The
β-power divergence
Dβ(
p, q) is proposed in [
13]; the
β-power entropy
Hβ is equal to the Tsallis
q-entropy with a relation
q =
β + 1,
cf. [
14–
16]. Tsallis entropy is connected with spin glass relaxation, dissipative optical lattices and so on beyond the classical statistical physics associated with the Boltzmann-Shannon entropy
H0(
p). See also [
17,
18] for the power entropy in the field of ecology. We will discuss the statistical property for the minimum
β divergence method in the presence of outliers departing from a supposed model,
cf. [
19–
21]. A robustness performance is elucidated by appropriate selection for
β. Beyond robustness perspective, a property of spontaneous learning to apply to clustering analysis is focused in [
22], see also [
23] for nonnegative matrix analysis.
The third example of a generator function is
Uη(
s) = (1 –
η)exp(
s) –
ηs with a scalar
η. This generator function leads to the
η-cross entropy
and the
η-entropy
so that the
η-divegence is
Dη(
f, g) =
Cη(
f, g) –
Hη(
f), see [
24–
27] for applications for pattern recognition. Obviously, if we take a limit of
η to 0, then
Cη(
f, g),
Hη(
f) and
Dη(
f, g) converge to
C0(
f, g),
H0(
f) and
D0(
f, g), respectively. A mislabeled model is derived by a maximum
η-entropy distribution with momentary constraint if we consider a binary regression model. See [
25,
27] for a detailed discussion.
3. Geometry Associated with U-Divergence
We investigate geometric properties associated with
U-divergence, which will help the discussion in subsequent sections. Let us arbitrarily fix a statistical model
M = {
fθ(
x) :
θ ∈ Θ} embedded in the total space
with mild regularity conditions. In fact, we consider the mixture geodesic curve
C(m), the exponential geodesic curve
C(e), the mixture model
M(m) and the exponential model
M(e) as typical examples of
M. Here are difficult aspects to define
as a differentiable manifold of infinite dimension because the constraint for positivity on the support is intractable in the sense of the topology, see Section 2 in [
6] for detailed discussion and historical remarks. On the other hand, if we confine ourselves to a statistical model
M, then we can formulate
M as a finite dimensional manifold, as in the following discussion. Thus, we produce a path geometry in which for any two elements
f and
g of
a class of geodesic curves connecting
f and
g including
C(m) and
C(e) is introduced so that the class of geodesic subspaces is derived as for
M(m) and
M(e).
3.1. Riemannian Metric and Linear Connections
We view the statistical model
M as a
d-dimensional differentiable manifold with the coordinate
θ = (
θ1, ···,
θd). Any information divergence associates with a Riemaniann metric and dual linear connections, see [
28,
29] for detailed discussion. We focus on the geometry generated by the
U-divergence
DU (
f, g) as follows. The Riemannian metric at
fθ of
M is given by
and linear connections are
and
where
∂i =
∂/∂θi, see
Appendix. for the derivation. Now we can assert the following theorem under an assumption for
: Let
f be arbitrarily fixed in
. If ∫
a(
x){
g(
x) –
f(
x)}
dΛ(
x) = 0 for any
g of
, then
a(
x) is constant in
x almost everywhere with respect to Λ.
Theorem 1.
Let Γ
(U) be the linear connection defined in Equation (11). Then any Γ
(U)-
geodesic curve is equal to the mixture-geodesic curve defined in Equation (2). Proof. Let
C(U) := {
ft(
x):
t ∈ (0, 1)} be a Γ
(U)-geodesic curve with
f0 =
f and
f1 =
g. We consider a 2-dimensional model defined by
fθ(
x) = (1 –
s +
u)
ft(
x)+(
s –
u)
g(
x), where
θ =(
s, t, u). Then we observe that if
u =
s, then
which identically 0 for any
g of
. It follows from the assumption for
that (
d2/dt2)
ft(
x)=
c almost everywhere with respect to Λ, which solved by
from the endpoint condition for
C(U). We observe that
c = 0 because
ft(
x)
∈, which concludes that
C(U) equals the mixture-geodesic. The proof is complete.
This property is elemental to characterize the U-divergence class, which is closely related with the empirical reducibility as discussed in a subsequent section. The assumption for holds if the carrier measure Λ is Lebesgue measure or the counting measure.
On the other hand, for a
*Γ
(U)-geodesic curve
with
f0 =
f and
f1 =
g we consider an embedding into a 2-dimensional model,
where
θ = (
s, t), where
u(
s) = (
d/dt)
U(
s) and
κθ is a normalizing constant to satisfy
. By definition
if
s =
t. This leads to
almost everywhere with respect to Λ, which is solved by
We confirm that, if
U = exp, then
*Γ
(U)-geodesic curve reduces to the exponential geodesic curve defined in
Equation (3). □
3.2. Generalized Pythagorian Theorems
We next consider the Pythagorean theorem based on the
U-divergence as an extension of the result associated with the Kullback-Leibler divergence in [
6].
Theorem 2. Let p, q and r be in . We connect p with q by the mixture geodesicAlternatively we connect r and q by *Γ
(U)-geodesic curveTwo curves {
}
and {
}
orthogonally intersect at q with respect to the Riemannian metric G(U) defined in Equation (10) if and only if Proof. A straightforward calculus yields that
By the definition of
G(U) we see that
is nothing but the left side of
Equation (17) when
where
θ = (
t, s). Hence the orthogonality assumption is equivalent to
Equation (16), which completes the proof.
Remark 1. We remark a further property such that, for any s and t in [0, 1],
If U = exp,
then Theorem 2 reduces to the Pythagoras theorem with the Kullback-Leibler divergence as shown in [6]. Consider two geodesic subspaces defined byand For any m-geodesic curve C(m) and U-geodesic curve *C(U) connecting q we assume that C(m) and C(U) orthogonally intersect at q in the sense of the Riemannian metric G(U). Then, for any p ∈
M(m) and r ∈
M(U)in which two-way projection is associated with as First we confirm a kind of reduction property for the Kullback-Leibler divergence to the framework in information geometry such that (
G(D0), Γ
(D0), *
Γ(D0)) = (
G, Γ
(m), Γ
(e)), where
G is the information metric. Second we return a case of the
β-power divergence, which is reduced a special case of Theorem 2. Consider two curves
and
Then we observe for the Riemannian metric
G(β) generated by
β-power divergence that
which is ∫ (
p –
q)(
pβ –
qβ)
dΛ
. We observe that if
C(m) and
C(β) orthogonally intersect at
q, then
4. Maximum Entropy Distribution
The maximum entropy principle is based on the Boltzmann-Shannon entropy in which the maximum entropy distribution is characterized by an exponential model. The maximum entropy method has been widely enhanced in fields of natural language processing, ecological analysis and so forth. However, there are other types of entropy measures proposed as the Hill diversity index, the Gini-Simpson index, the Tsallis entropy and so on,
cf. [
14,
17,
18] in different fields. We introduced the class of
U-entropy functionals, which include all the entropy measures mentioned above. In this subsection, we discuss the maximum entropy distribution based on an arbitrarily fixed
U-entropy.
We check a finite discrete case with
K + 1 cells as a special situation, where
reduces to a
K-dimensional simplex
K. The maximum
U-entropy distribution is defined by
We observe that
which implies
for
i = 1, ···,
K + 1. Therefore the maximum
U-entropy distribution
f* is a uniform distribution on
K for any generator function
U.
In general the
U-entropy is an unbounded functional on
unless
is finite discrete. For this we introduce a moment constraint as follows. Let
t(
X) be a
k-dimensional statistic vector. Henceforth we assume that 𝔼
f {‖
t(
X) ‖
2}
< ∞ for all
f of
. We consider a mean equal space for
t(
X) as
where
τ is a fixed vector in ℝ
k. By definition Γ(
τ) is totally mixture geodesic, that is, if
f and
g are in Γ(
τ), then (1 –
t)
f +
tg is also in Γ(
τ) for any
t ∈ (0, 1).
Theorem 3. Let,
where HU (f) is U-diagonal entropy defined in Equation (7). Then the maximum U-entropy distribution is given bywhere κU (θ) is the normalizing factor and θ is a parameter vector determined by the moment constraint Proof. The Eular-Lagrange functional is given by
If
gτ ∈ Γ(
τ) and
, then
ft ∈ Γ(
τ), and
The equation in
Equation (21) yields that
for any
gτ (
x) in Γ(
τ), which concludes
Equation (20). Since
ξ(
t) is an increasing function, we observe that
for any
t ∈ [0, 1], which implies the inequality in
Equation (21). Since
gτ ∈ Γ(
τ), we observe that
Therefore we can confirm that
for any
gτ ∈ Γ(
τ) since
which is nonnegative by the definition of
U-divergence. The proof is complete.
Here we give a definition of the model of maximum U-entropy distributions as follows.
Definition 1. We define a k-dimensional modelwhich is called U-model, where Θ = {
θ ∈ ℝ
k :
κU (
θ) < ∞}.
The Naudts’ deformed exponential family discussed from a statistical physical viewpoint as in [
15]is closely related with
U-model. The one-parameter family {
rs(
x):
s ∈ [0, 1]} as defined in
Equation (15) is a one-dimensional
U-model and
M(U) defined in
Equation (18) is a
K-dimensional
U-model. For a
U-model
MU defined in
Equation (23), the parameter
θ is an affine parameter for the linear connection
*Γ
(U) defined in
Equation (12). In fact, we observe from the definition
Equation (12) that
which is identically 0 for all
θ ∈ Θ. We have a geometric understanding for the
U-model similar to the exponential model discussed in Introduction.
Theorem 4. Assume for U(t) that U′″ (t) > 0 for any t in ℝ. Then, the U-model is totally *Γ(U)-geodesic.
Proof. For arbitrarily fixed
θ1 and
θ2 in Θ, we define the
U-geodesic curve connecting between
fU (
x, θ1) and
fU (
x, θ2) such that, for
λ ∈ (0, 1),
with a normalizing factor
κ(
λ), which is written by
fλ(
x) =
fU (
x, θλ), where
θλ =
λθ1 +(1 –
λ)
θ2. Hence it suffices to show
θλ ∈ Θ for all
λ ∈ (0, 1), where Θ is defined in Definition 1. We look at the identity ∫
fU (
x, θ)
dΛ(
x) = 1 from a fact that
fU (
x, θ) is a probability density function. This implies that the first derivative gives
and the second derivative gives
Since the identity
Equation (24) shows that the Hessian of
κU (
θ) is proportional to a Gramian matrix, which implies that
κU (
θ) is convex in
θ. Since
κU (
θλ)
≤ (1 –
λ)
κU (
θ1)+
λκU (
θ2) and
θ1 and
θ2 in Θ,
κU (
θλ)
≤ ∞. This concludes that
θλ ∈ Θ for any
λ ∈ (0, 1), which completes the proof.
We discuss a typical example by the power entropy
Hβ(
f), see [
15,
30–
34] from a viewpoint of statistical physics. First we consider a mean equal space of univariate distributions on (0, ∞)
where
Note that lim
β→0 t(
x) = (
x, (
κ – 1) log
x). To get the maximum entropy distribution with
Hβ we consider the Euler-Lagrange function given by
where
θ and
λ are Lagrange multiplier parameters. This yields that the maximum entropy distribution is
where
θ is determined by
μ such that 𝔼
fβ(·,θ)t(
X) =
μ and
A gamma distribution is defined by the density function
Second, we consider a case of multivariate distributions, where the moment constraints are supposed that for a fixed
p-dimensional vector
μ and matrix
V of size
p × pIf we consider a limit case of
β to 0, then
Hβ(
f) reduces to the Boltzmann-Shannon entropy and the maximum entropy distribution is the Gaussian distribution with the density function
In general we deduce that if
, then the maximum
β-power entropy distribution uniquely exists such that the density function is given by
where
See [
35,
36] for the detailed discussion [
37,
38] for the discussion on group invariance. Thus, if
β> 0, then the maximum
β-power entropy distribution has a compact support
The typical case is β =2, which is called the Wigner semicircle distribution. On the other hand, if
, the maximum β-power entropy distribution has a full support of ℝp, and equals a p-variate t-distribution with a degree of freedom depending on β.
5. Minimum Divergence Method
We have shown a variety of
U-divergence functionals using various generator functions in which the minimum divergence methods are applied to analyses in statistics and statistical machine learning. In effect the
U-cross entropy
CU (
f, g) is convex-linear in
f, that is,
for any
λj > 0 with
. It is closely related with a characteristic property that the linear connection Γ
(U) associated with
U-divergence is equal to the mixture connection Γ
(m) as discussed in Theorem 1. Furthermore, for a fixed
g,
CU (
f, g) can be viewed as a functional of
F in place of
f as follows:
where
F is the probability distribution generated from
f(
x). If we assume to have a random sequence
X1, ···,
Xn from a density function
f(
x), then the
U-cross entropy is approximated as
where
F̄n is the empirical distribution based on the data
X1, ···,
Xn, that is,
for any Borel measurable set
B. By definition,
Consequently, if we model
g by a model function
f(
·,θ), then the right side of
Equation (25) depends only on the data set
and parameter
θ without any knowledge for the underlying density function
f(
x). This gives the empirical approximation, which is advantageous over other classes of divergence measures. The minimum
U-divergence method is directly applied to minimization of the empirical approximation with respect to
θ. We note that the minimum divergence is equivalent to the minimum cross entropy, in which the diagonal entropy is just a constant in
θ. In particular, in the classical case,
which is equivalent to the minus log-likelihood function.
Let
X1, ···,
Xn be independently and identically distributed from an underlying density function
f(
x) which is approximated by a statistical model
M = {
f(
x, θ):
θ ∈ Θ}. The
U-loss function is introduced by
where b
U (θ) = ∫
U(
ξ(
f(
x, θ)))
dΛ(
x). We call
θ̂U = argmin
θ∈Θ
LU (
θ)
U-estimator for the parameter
θ. By definition 𝔼
f {
LU (
θ)} =
CU (
F, f(
·,θ)) for all
θ in Θ, which implies that
LU (
θ) almost surely converges to
CU (
F, f(
·,θ)) as
n goes to ∞. Let us define a statistical functional as
where
CU (
F, g) is written
CU (
f, g) placing
f into the probability distribution
F generated from
f. Then
θU (
F) is model-consistent, or
θU (
Fθ) =
θ for any
θ ∈ Θ because
with equality if and if
θ ′ =
θ, where
Fθ is the probability distribution induced form
f(
x, θ).
Hence
U-estimator
θ̂U is asymptotically consistent. The estimating function is given by
Consequently we confirm that sU (x, θ) is unbiased in the sense that 𝔼f(·,θ){sU (X, θ)} = 0.
We next investigate the asymptotic normality for
U-estimator. The estimating equation for the
U-estimator is given by
of which the Taylor approximation gives
In accordance with this, we get the asymptotic approximation,
where
Because the strong law of large number gives
as
n goes to ∞, where
denotes almost sure convergence. If the underlying density function is in the model
M, that is
f(
x) =
f(
x, θ), then it follows from the model consistency for
θU (
F) that
which implies that
where
denotes convergence in distribution and
If the generator function is taken as U(s) = exp(s), then the U-estimator reduces to the MLE with the asymptotic normality to N(0,G(θ)–1), where G(θ) is the Fisher information matrix for θ.
Consider
U-estimator for the parameter
θ of the exponential model
M(e) in
Equation (1). In particular we are concerned with a possible outlying contaminated in the exponential model, and hence a ∊-contamination model is defined as
where ∊, 0
< ∊
< 1 is a sufficiently small constant,
F0(
x, θ) is the cumulative distribution function of the exponential model, and
δy(
x) denotes a degenerate distribution at
y. The influence function for
U-estimator is given by
See [
19,
20,
27]. Thus we can check the robustness for
U-estimator whether the influence function is bounded in
y or not. For example, if we adopt as
U(
s) = (1+
βs)
1/β, then
where
b(θ, β)= ∫ {t(
x) –
μ}
f0(
x, θ)
βdΛ(
x). Thus, if
β> 0, then the influence function is confirmed to be bounded in
y for almost cases including a normal, exponential and Poisson distribution models since the term {
t(
y)
– μ}
f0(
y, θ)
β in
Equation (27) is bounded in
y for these models. On the other hand, If
β = 0, that is the maximum likelihood estimator entails the unbounded influence functions because the term
t(
y)
– μ is unbounded in
y for theses models.
6. Duality of Maximum Entropy and Minimum Divergence
In this section, we discuss a dualistic interplay between statistical model and estimation. In statistical literature, the maximum likelihood estimation has a special position over other estimation methods in the sense of efficiency, invariance and sufficiency; while the statistical model has been explored various candidates in the presence of misspecification. For example, we frequently consider a Laplace distribution for estimating a Gaussian mean, which leads to the sample median as the maximum likelihood estimator for the mean of the Laplace distribution. In this sense, there is an unbalance in the employment for the model and estimator. In principle, we can select arbitrarily different generator functions U0 and U1 so that the U1-estimation gives consistency under the U0-model. There is a natural question which situation happens if we consider the U-estimation under the U-model?
Let
MU be a
U-model defined by
where Θ = {
θ ∈ ℝ
k :
κU (
θ)
< ∞}. The the
U-loss function under the
U-model for a given data set {
X1, ···,
Xn} is defined by
which is reduced to
where
and
The estimating equation is given by
which is written by
Hence, if we consider the
U-estimator for a parameter
η by the transformation of
θ defined by
ϕ(
θ) = 𝔼
f(
·,θ){
t(
X)}, then the
U-estimator
η̂U is nothing but the sample mean
t̄. Here we confirm that the transformation
ϕ(
θ) is one-to-one as follows. The Jacobian matrix of the transformation is given by
since the first identity for
MU leads to
Therefore, we conclude that the Jacobian matrix is symmetric and positive-definite since u′ (t) is a positive function from the assumption of the convexity for U, which implies that ϕ(θ) is one-to-one. Consequently, the estimator θ̂ U for θ is given by ϕ–1(t̄). We summarize these results in the following theorem.
Theorem 5. Let MU be a U-model with a canonical statistic t(X) as defined in Equation (28). Then the U-estimator for the expectation parameter η of t(X) is always t̄, where
.
Remark 2. We remark that the empirical Pythagorean theorem holds as insince we observe thatwhich gives another proof for which θ̂U is ϕ–1(
t̄)
. The statistic t̄ is a sufficient statistic in the sense that the U-loss function LU (
θ)
is a function of t̄ as in Equation (29). Accordingly, the U-estimator under U-model is a function only of t̄ from the observations X
1, ···,
Xn.
In this extension, the MLE is a function of t̄under the exponential model with the canonical statistic t(
X)
. Let us look at the case of the
β-power divergence. Under the
β-power model given by
the
β-loss function is written by
where
The β-power estimator for the expectation parameter of t(X) is exactly given by t̄.
7. Discussion
We concentrate on elucidating the dual structure of the
U-estimator under the
U-model, in which the perspective extends the relation of the maximum likelihood under the exponential model with a functional degree of freedom. Thus, we explore a rich and practical class of duality structures; however, there remains an unsolved problem when we directly treat the space
as an differentiable manifold, see [
39] for an infinite dimensional exponential family. The approach here is not a direct extension of an infinite dimensional manifold, but a path geometry in the following sense. For all pairs of elements of
the geodesic curve connecting the pair is represented in an explicit form in the class of
*Γ
(U) connections in our context.
The
U-divergence approach was the first trial to introduce a dually flat structure to
which is different from the alpha-geometry. However, there are many related studies. For example, a nonparametric information geometry on the space of all functions without constraints for positivity and normalizing is discussed in Zhang [
40]. Amari [
41] characterizes (
ρ, τ)-divergence with decomposable dually flat structure, see also [
42]. If
ρ is an identity function and
τ(
s) = (
d/ds)
U(
s), (
ρ, τ)-divergence is no less than
U-divergence. In effect we confine ourselves to discussing the
U-divergence class for the sake of the direct estimability for
U-estimator.
The duality between the maximum entropy and the minimum divergence has been explored in the minimax theorem for a zero-sum game between a decision maker and Nature. The pay-off function is taken by the cross
U-entropy in which Nature tries to maximize the pay-off function under the mean equal constraint; the decision maker tries to minimize the pay-off function. The equilibrium is given by the minimax solution, which is the maximum
U-entropy distribution, see [
43] for the extensive discussion and the relation with Bayesean robustness. The observation explored in this paper is closely related with this minimax argument, however the duality between the statistical model and estimation is focused on, where the minimum
U-divergence leads to projection onto the
U-model.
In principle, the
U-estimator is applicable for all the statistical model since
U-loss function is written by a sample as well as the log-likelihood function. If the choice of the model is different from the
U-model, then
U-estimator has different performance from the present situation. For example, we consider an exponential model (
U(
s) = exp(
s)), and a
β-estimator (
U(
s) = (1
– βs)
1/β for getting a robustness property for outlying observations,
cf. [
19,
20]. In such situations, the duality property is no longer valid, since the
β-estimator for the parameter of the exponential model is not a function of the sufficient statistic
t̄ defined in Theorem 5. Thus, we have to pay attention to another aspect than the duality structure in the presence of outlying, or misspecification for the statistical model. Furthermore, another type of divergence measures including projective power divergence is recommended to perform super robustness,
cf. [
21,
44].
We presented the method of generalized maximum entropy based on the proposed entropy measure, as an extension of the classical maximum entropy method based on the Boltzmann-Gibbs-Shannon entropy. Practical applications of MaxEnt are actively followed in ecological and computational linguistic researches based on the classical maximum entropy,
cf. [
45,
46]. Difficult aspects are discussed, in which the MaxEnt is apt to be over-learning on data sets because it basically employs the maximum likelihood estimator. There is a great potential for the proposed method to implement these research fields in order to overcome these difficult aspects, by selecting an appropriate generator function. A detailed discussion is beyond the scope of the present paper; however, it will be challenged in the near future with concrete objectives motivated by real data analysis.