Next Article in Journal
Structural Patterns in Complex Systems Using Multidendrograms
Next Article in Special Issue
The Kalman Filter Revisited Using Maximum Relative Entropy
Previous Article in Journal
Core-Based Dynamic Community Detection in Mobile Social Networks
Previous Article in Special Issue
Objective Bayesianism and the Maximum Entropy Principle
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Consistency and Generalization Bounds for Maximum Entropy Density Estimation

1
Kno.e.sis Center, Department of Computer Science and Engineering, Wright State University, Dayton, OH 45435, USA
2
Department of Computing Science, University of Alberta, Edmonton, Alberta T6G 2E8, Canada
3
Visa Inc., San Francisco, CA 94128, USA
*
Author to whom correspondence should be addressed.
Entropy 2013, 15(12), 5439-5463; https://doi.org/10.3390/e15125439
Submission received: 9 July 2013 / Revised: 13 November 2013 / Accepted: 3 December 2013 / Published: 9 December 2013
(This article belongs to the Special Issue Maximum Entropy and Bayes Theorem)

Abstract

:
We investigate the statistical properties of maximum entropy density estimation, both for the complete data case and the incomplete data case. We show that under certain assumptions, the generalization error can be bounded in terms of the complexity of the underlying feature functions. This allows us to establish the universal consistency of maximum entropy density estimation.

1. Introduction

The maximum entropy (ME) principle, originally proposed by Jaynes in 1957 [1], is an effective method for combining different sources of evidence from complex, yet structured, natural systems. It has since been widely applied in science, engineering and economics. In machine learning, ME was first popularized by Della Pietra et al. [2], who applied it to induce overlapping features for a Markov random field model of natural language. Later, it was applied to other machine learning areas, such as information fusion [3] and reinforcement learning [4]. It is now well known that for complete data, the ME principle is equivalent to maximum likelihood estimation (MLE) in a Markov random field. In fact, these two problems are exact duals of one another. Recently, Wang et al. [5] proposed the latent maximum entropy (LME) principle to extend Jaynes’ maximum entropy principle to deal with hidden variables, and demonstrated its effectiveness in many statistical models, such as mixture models, Boltzmann machines and language models [6]. We show that LME is different from both Jaynes’ maximum entropy principle and maximum likelihood estimation, but it often yields better estimates in the presence of hidden variables and limited training data. This paper investigates the statistical properties of maximum entropy density estimation for both the complete and incomplete data cases.
Large sample asymptotic convergence results for MLE are typically based on point estimation analysis [7] in parametric models. Although point estimators have been extensively studied in the statistics literature since Fisher, these analyses typically do not consider generalization ability. Vapnik and Chervonenkis famously reformulated the problem of MLE for density estimation in the framework of empirical risk minimization and provided the first necessary and sufficient conditions for consistency [8,9]. However, the model they considered is still in a Fisher–Wald parametric setting. Barron and Sheu [10] considered a density estimation problem very similar to the one we address here, but only restricted to the one-dimensional case within a bounded interval. Their analysis cannot be easily generalized to a high dimensional case. Recently, Dudik et al. [11] analyzed regularized maximum entropy density estimation with inequality constraints and derived generalization bounds for this model. However, once again, their analysis does not easily extend beyond the specific model considered.
Some researchers have studied the consistency of maximum likelihood estimators under the Hellinger divergence [12], which is a particularly convenient measure for studying maximum likelihood estimation in a general distribution-free setting. However, Kullback–Leibler divergence is a more natural measure for probability distributions and is closely related to the perplexity measure used in language modeling and speech recognition research [13,14]. Moreover, convergence in the Kullback–Leibler divergence always establishes consistency in terms of Hellinger divergence [12], but not vice versa. Therefore, we concentrate on using the Kullback–Leibler divergence in our analysis.
In this paper, we investigate consistency and generalization bounds for maximum entropy density estimation with respect to the Kullback–Leibler divergence. The main technique we use in our analysis is Rademacher complexity, first used by Koltchinskii and Panchenko [15] to analyze the generalization error of combined classification methods, such as boosting, support vector machines and neural networks. Since then, the convenience of Rademacher analysis has been exploited by many to analyze various learning problems in classification and regression. For example, Rakhlin et al. [16] have used this technique to derive risk bounds for the density estimation of mixture models, which basically belong to directed graphical models using a conditional parameterization. Here, we use the Rademacher technique to analyze the generalization error of maximum entropy density estimation for general Markov random fields.

2. Maximum Entropy Density Estimation: Complete Data

Let X X be a random variable. Given a set of feature functions F ( x ) = { f 1 ( x ) , . . . , f N ( x ) } specifying properties one would like to match in the data, the maximum entropy principle states that we should select a probability model, p ( x ) , from the space of all probability distributions, P ( x ) , over X , to maximize entropy subject to the constraint that the feature expectations are preserved:
max p ( x ) P ( x ) x X p ( x ) log p ( x ) μ ( d x )
s . t . x X f i ( x ) p ( x ) μ ( d x ) = x X f i ( x ) p 0 ( x ) μ ( d x ) ; i = 1 . . . N ,
where p 0 ( x ) denotes the unknown underlying true density and μ denotes a given σ-finite measure on X . If X is finite or countably infinite, then μ is the counting measure, and integrals reduce to sums. If X is a subset of a finite dimensional space, μ is the Lebesgue measure. If X is a combination of both cases, μ will be a combination of both measures. The dual problem is:
min λ Ω x X p 0 ( x ) log p λ ( x ) μ ( d x )
where p λ ( x ) = Φ λ 1 exp i = 1 N λ i f i ( x ) and Φ λ = x X exp i = 1 N λ i f i ( x ) μ ( d x ) is a normalizing constant that ensures x X p λ ( x ) μ ( d x ) = 1 .
We will use the following notation and terminology throughout the analysis below. Define:
Ω = λ N : x X exp i = 1 N λ i f i ( x ) μ ( d x ) <
and let:
E ( x ) = p λ ( x ) P ( x ) : p λ ( x ) = 1 Φ λ exp i = 1 N λ i f i ( x ) , λ Ω
denote the exponential family induced by the set of feature functions. The restriction, λ Ω , will guarantee that the maximum likelihood estimate is an interior point of the set of λ’s for which p λ ( x ) is defined. The optimal solution, p λ ^ ( x ) , of Equation (1) or Equation (3) is called the information projection [10,17] of p 0 ( x ) to the exponential family, E ( x ) .
In practice, the true distribution, p 0 ( x ) , is not known, but instead, a collection of data X ˜ = ( x 1 , , x M ) sampled from p 0 ( x ) is given. Therefore, instead of using the true distribution, p 0 ( x ) , we use the empirical distribution, p ˜ ( x ) , to calculate the feature expectations. The ME principle then becomes:
max p ( x ) P x X p ( x ) log p ( x ) μ ( d x )
s . t . x X f i ( x ) p ( x ) μ ( d x ) = x X ˜ p ˜ ( x ) f i ( x ) ; i = 1 . . . N ,
and the dual problem using these constraints becomes:
min λ Ω 1 M j = 1 M log p λ ( x j )
The optimal solution, p λ * ( x ) , of Equation (4) or Equation (6) is the information projection of p ˜ ( x ) to the exponential family, E ( x ) ; see Figure 1.
Figure 1. In the space of all probability distributions, P ( x ) over X , p λ ^ ( x ) is the information projection of the true (but unknown) distribution, p 0 ( x ) , to the exponential family, E , and p λ * ( x ) is the information projection of the empirical distribution, p ˜ ( x ) , to E .
Figure 1. In the space of all probability distributions, P ( x ) over X , p λ ^ ( x ) is the information projection of the true (but unknown) distribution, p 0 ( x ) , to the exponential family, E , and p λ * ( x ) is the information projection of the empirical distribution, p ˜ ( x ) , to E .
Entropy 15 05439 g001

2.1. Consistency and Generalization Bounds of Estimation Error

There are two standard ways to measure the quality of the estimate, p λ * ( x ) .
One approach, based on the Kullback–Leibler divergence, was first considered by Barron and Sheu [10]. Basically, it uses the following well-known Pythagorean property (see Lemma 3 in [10]):
D p 0 ( x ) p λ ( x ) = D p 0 ( x ) p λ ^ ( x ) + D p λ ^ ( x ) p λ ( x ) p λ ( x ) E
and, in particular, the decomposition:
D p 0 ( x ) p λ * ( x ) = D p 0 ( x ) p λ ^ ( x ) + D p λ ^ ( x ) p λ * ( x )
Here, the first term, D ( p 0 ( x ) p λ ^ ( x ) ) , is the approximation error introduced by the bias of the set of feature functions, F , which measures how closely the feature functions are able to approximate the true probability distribution. The second term, D ( p λ ^ ( x ) p λ * ( x ) ) , is the estimation error for densities in the exponential family, introduced by the variance of using a finite sample size. These two terms resemble the bias-variance tradeoff in least squares cost estimation [18]. The approximation error always exists unless the set of feature functions, F , is rich enough [19]. In this paper, we assume that the set of feature functions is given and study how close p λ ^ ( x ) is to p λ * ( x ) .
Another way to evaluate the quality of p λ * ( x ) is to measure the difference between the best expected likelihood and the empirical likelihood of the estimate, which is a more desirable approach, because we can directly calculate the empirical likelihood of the estimate from the training data.
Both approaches, in fact, fall under the umbrella of Vapnik’s empirical risk minimization principle [8] for density estimation. Here, we use the first approach to show that the maximum entropy solution converges to the best possible solution, p λ ^ ( x ) , and the second approach to show that the value of empirical likelihood converges to the maximum expected likelihood.

2.1.1. Maximum Entropy Principle

Exploiting tools from empirical process theory, including symmetrization, concentration and contraction inequalities, Koltchinskii and Panchenko [15] were able to give bounds on the generalization error of boosting, support vector machines (SVMs) and neural networks. We show through the following theorem that we can use similar tools to establish generalization bounds for the estimation error of maximum entropy density estimation. All proofs can be found in the Appendix.
Theorem 1. 
Assuming sup λ Ω λ 1 < and sup f F , x X f ( x ) < , then there exist 0 < ζ < α < , such that, for any η ( 0 , 1 ) , with a probability of at least 1 η ,
D ( p λ ^ ( x ) p λ * ( x ) ) = D ( p 0 ( x ) p λ * ( x ) ) D ( p 0 ( x ) p λ ^ ( x ) ) 4 C 1 M E X ˜ [ ζ α log N ( F ( x ) , ϵ , d x ) d ϵ ] + C 2 2 log ( 1 η ) M
where M is the number of instances, N ( F ( x ) , ϵ , d x ) is the random covering number [20,21] for linear combinations of functions in F ( x ) at scale ϵ with empirical Euclidean distance d x on sample X ˜ , and C 1 and C 2 are positive constants that do not depend on the instance.
If the linear combination of feature functions belongs to a VC-subgraph with Vapnik-Chervonenkis (VC) dimension V, it is well known that the Dudley integral is bounded by V [12,20,22].
Rakhlin et al. [16] first applied a similar technique to derive generalization bounds for the density estimation of mixture models. Here, we apply the technique to derive bounds for maximum entropy density estimation in general Markov random fields. Even though the analysis we use is standard, our results show that the generalization bound is not related to the log partition function, but instead is upper bounded by the covering number of linear combinations of feature functions. This is an important observation, because the log partition function—a fundamental quantity associated with any graph-structured distribution—is NP-hard to compute in general [23].
Although the assumption that sup f F , x X f ( x ) < seems rather restrictive, most of the graphical models studied in machine learning [23] are, in fact, discrete Markov random fields, which always satisfy this condition.
To eliminate the assumption of boundedness on the parameters and feature functions, we use a result adapted from [24].
Theorem 2. 
Assume that there exists a positive number, K ( F ) , such that for all τ > 0 ,
log E p 0 ( x ) p λ ( x ) 2 τ 1 p λ ( x ) 2 τ ( τ K ( F ) ) 2
where λ are parameters, such that E p λ ( x ) ( f ( x ) ) = f ( x ) . Then, for all λ Ω , we have, with a probability of at least 1 η ,
E p 0 ( x ) log p λ ( x ) 1 M j = 1 M log p λ ( x j ) E X ˜ sup λ Ω , f ( x ) F ( x ) ( E p 0 ( x ) λ , f ( x ) E p ˜ ( x ) λ , f ( x ) ) + K ( F ) 2 log ( 1 η ) M
By the above result, Theorem 1 can be proven by replacing C 2 with K ( F ) . Since the value, K ( F ) , is hard to determine in practice, we state our results in terms of the bound on feature functions instead. Nevertheless, the reader should keep in mind that this bound can be replaced by K ( F ) in all our results.
Using the result of [21], the following theorem gives a useful bound on the covering number.
Theorem 3. 
Assume sup λ Ω λ 2 < a and sup f F , x X f ( x ) 2 < b , then:
log 2 N ( F ( x ) , ϵ , d x ) a 2 b 2 ϵ 2 ( min log 2 ( 2 M + 1 ) , log 2 ( 2 N + 1 ) )
We then have the following corollaries. The first is a direct consequence of Theorem 1. The second provides a generalization bound on the expected value of the estimation error, which can be shown using the proof in [16,20].
Corollary 1. 
Universal consistency: If ζ α log N ( F ( x ) , ϵ , d x ) d ϵ is bounded, then as M , p λ * ( x ) will converge to p λ ^ ( x ) in terms of the Kullback–Leibler divergence with rate O ( 1 M ) , independent of the form of the true distribution, p 0 .
Corollary 2. 
Under the conditions of Theorem 1, the generalization bound of the expected estimation error is:
E X ˜ D ( p λ ^ ( x ) p λ * ( x ) ) = E X ˜ D ( p 0 ( x ) p λ * ( x ) ) D ( p 0 ( x ) p λ ^ ( x ) ) 4 C 1 M E X ˜ [ ζ α log N ( F ( x ) , ϵ , d x ) d ϵ ] + C 2 2 M
Next, we turn to the second approach to evaluating the quality of p λ * ( x ) ; namely, by measuring the difference between the expected log-likelihood and the empirical log-likelihood. This is the approach used by Dudik et al. [11].
Define L 0 ( λ ) = x X p 0 ( x ) log p λ ( x ) μ ( d x ) and L ˜ ( λ ) = 1 M j = 1 M log p λ ( x j ) . Then, from Theorem 1 and the McDiarmid concentration inequality, we obtain the following result for the sample log-likelihood, which is similar to Dudik et al. [11].
Theorem 4. 
There are 0 < ζ < α < , such that, with a probability of at least 1 η :
| L 0 ( λ ^ ) L ˜ ( λ * ) | = | E p 0 ( x ) log p λ ^ ( x ) 1 M j = 1 M log p λ * ( x j ) | 6 C 1 M E X ˜ [ ζ α log N ( F ( x ) , ϵ , d x ) d ϵ ] + 3 C 2 log ( 1 η ) 2 M
By the Glivenko–Cantelli theorem [9], we know that the empirical distribution converges to the true distribution. Therefore, under certain conditions, the entropy of empirical distribution will also converge to the entropy of the true distribution. Whenever such convergence holds, we can combine this with Theorem 3 to show that D ( p 0 ( x ) p λ ^ ( x ) ) D ( p ˜ ( x ) p λ * ( x ) ) 0 as M , establishing a stronger form of consistency than Corollary 1.

2.1.2. Regularized Maximum Entropy Principle

In many statistical modeling settings, the constraints used in the ME principle are subject to errors from the empirical nature of the data. This is particularly true in domains with sparse, high dimensional data. One way to gain robustness, though, is to relax the constraints and add a penalty to the entropy, leading to the regularized ME (RME) principle [25]:
max p ( x ) , a x X p ( x ) log p ( x ) μ ( d x ) U ( a )
s . t . x X p ( x ) f i ( x ) μ ( d x ) = x p ˜ ( x ) f i ( x ) + a i ; i = 1 , . . . , N
Here, a = ( a 1 , . . . , a N ) T , a i is the error for each constraint, and U : N is a cost function that has a value of zero at zero. The function penalizes any constraint violations, and can be used to penalize deviations in the more reliably observed constraints to a greater degree than deviations in less reliably observed constraints.
The dual problem of RME becomes:
min λ Ω 1 M j = 1 M log p λ ( x j ) + U * ( λ )
which is equivalent to maximum a posteriori (MAP) estimation with prior U * ( λ ) . Note that the prior, U * ( λ ) , is derived from the penalty function, U, over errors a , by setting U * to the convex (Fenchel) conjugate [26,27] of U, i.e., U * ( λ ) = sup a λ , a U ( a ) . This function is always convex, regardless of the convexity of U. Conversely, given the convex conjugate cost function, U * , the corresponding penalty function, U, can be derived by using the property of Fenchel biconjugation [26]; that is, the conjugate of the conjugate of a convex function is the original convex function, U = U * * .
Conventionally, U ( a ) is chosen to be nonnegative and have a minimum of zero at zero. To illustrate, consider a quadratic penalty U ( a ) = i = 1 N 1 2 σ i 2 a i 2 . Here, the convex conjugate U * ( λ ) = i = 1 N λ i 2 2 σ i 2 can be determined by setting a i = λ i σ i 2 , which specifies a Gaussian prior on λ. A different example can be obtained by considering the Laplacian prior on λ, U * ( λ ) = λ 1 = i = 1 N | λ i | , which leads to the penalty function:
U ( a ) = 0 a = max i = 1 N | a i | 1 otherwise
that forces hard inequality constraints.
However, there is an important aspect of the validity of choosing a legal (log) prior, U * , which has been ignored in previous studies on the RME principle [11,25,28,29]. To see this, consider the following. By plugging the true distribution, p 0 ( x ) , instead of the empirical distribution into Equation (16), one obtains:
max p ( x ) , a x X p ( x ) log p ( x ) μ ( d x ) U ( a )
s . t . x X p ( x ) f i ( x ) μ ( d x ) = x X p 0 ( x ) f i ( x ) + a i ; i = 1 , . . . , N
The dual problem is:
min λ Ω x X p 0 ( x ) log p λ ( x ) + U * ( λ )
Since p λ ^ ( x ) is the information projection of p 0 ( x ) to the exponential family, E ( x ) , we must have U ( a ^ ) = U * ( λ ^ ) = 0 and a ^ = 0 , where a ^ denotes the a corresponding to λ ^ . Moreover, as a penalty term, the prior, U * ( λ ) , should be nonnegative with a minimum of zero. We call the prior satisfying these conditions a legal prior. Both the standard Gaussian prior U * ( λ ) = i = 1 N λ i 2 2 σ i 2 and standard Laplacian prior U * ( λ ) = λ 1 do not satisfy U * ( λ ^ ) = 0 . The correct choices for these priors should be U * ( λ λ ^ ) = i = 1 N ( λ i λ ^ i ) 2 2 σ i 2 for a Gaussian prior and U * ( λ λ ^ ) = λ λ ^ 1 for a Laplacian prior, respectively. Consequently, U ( a ) should be chosen as U ( a ) + a , λ ^ . Note that U ( a ) + a , λ ^ does have a value of zero at zero, but it is no longer nonnegative.
If we let p λ denote the solution to Equation (17), we then obtain the following generalization bound on estimation error without any restrictions on U ( a ) or U * ( λ ) .
Theorem 5. 
Assume sup λ Ω λ 1 < and sup f F , x X f ( x ) < . Then, there exist 0 < ζ < α < , such that with a probability of at least 1 η ,
D ( p λ ^ ( x ) p λ ( x ) ) = D ( p 0 ( x ) p λ ( x ) ) D ( p 0 ( x ) p λ ^ ( x ) ) 4 C 1 M E X ˜ [ ζ α log N ( F ( x ) , ϵ , d x ) d ϵ ] + C 2 2 log ( 1 η ) M + U * ( λ ^ ) U * ( λ )
We then have the following result that guarantees the universal consistency of RME.
Corollary 3. 
Universal consistency: If ζ α log N ( F ( x ) , ϵ , d x ) d ϵ is bounded and U * ( λ ^ ) U * ( λ ) , then as M , p λ ( x ) will converge to p λ ^ ( x ) in terms of their Kullback–Leibler divergence with rate O ( 1 M ) , for any true distribution, p 0 ( x ) , and prior distribution.
If we choose U * ( λ ) to be a legal prior, then we also have a further result.
Corollary 4. 
Assume sup λ Ω λ 1 < , sup f F , x X f ( x ) < and that U * ( λ ) is a legal prior. Then, there exist 0 < ζ < α < , such that, with a probability of at least 1 η ,
D ( p λ ^ ( x ) p λ ( x ) ) = D ( p 0 ( x ) p λ ( x ) ) D ( p 0 ( x ) p λ ^ ( x ) ) 4 C 1 M E X ˜ [ ζ α log N ( F ( x ) , ϵ , d x ) d ϵ ] + C 2 2 log ( 1 η ) M U * ( λ )
Moreover, as M , both U * ( λ ) 0 and p λ ( x ) will converge to p λ ^ ( x ) in terms of their Kullback–Leibler divergence with rate O ( 1 M ) , for any true distribution, p 0 ( x ) , and prior distribution.
The legal prior guarantees RME’s consistency without introducing any regularization parameter, as commonly done in machine learning [9,19]. Since U * ( λ ) 0 for any legal prior, this result shows that RME always obtains a lower generalization bound than ME, even without assuming the truth of the prior.
Using the result of Theorem 5 and the McDiarmid concentration inequality, we are able to derive a generalization bound on the difference between the best expected log-likelihood and the log of the MAP probability. This holds for any regularization cost function, as long as U * ( λ ^ ) U * ( λ ) , without requiring that U * ( λ ) be convex. We note that Dudik et al. [11] gave a similar result, but only for the special case of using the l 1 norm as a penalty in the dual formulation; see Equation (17).
For a broad family of regularization functions, i.e., log priors in Equation (17), such that the Hessian matrices of λ (the prior information matrices) are positive definite, we can improve the convergence rate from O ( 1 M ) to O ( 1 M ) . The techniques needed to prove the O ( 1 M ) convergence rate were first proposed by Zhang [19,30,31] to show the consistency and generalization bounds of various classification methods based on convex risk optimization with the l 2 -penalty.
Define the convex function:
L λ ( x ) = log p λ ( x ) = log exp ( λ , f ( x ) ) x exp ( λ , f ( x ) ) μ ( d x ) = log x exp ( λ , f ( x ) f ( x ) ) μ ( d x )
Consider training samples X ˜ M + 1 = ( x 1 , , x M + 1 ) . Let λ ˜ be the solution of Equation (24) with training data X ˜ M + 1 .
min λ Ω 1 M j = 1 M + 1 L λ ( x j ) + U * ( λ )
Let λ ˜ k be the solution of Equation (24) with training sample x k removed from the set, X ˜ M + 1 . We have the following lemma that extends the quadratic case considered in [19,31]. We prove the result in its full generality, though our proof is essentially a variant of the stability results by Zhang [19,30,31].
Lemma 1. 
Assume the Hessian matrix of U * ( λ ) is positive definite with smallest eigenvalue κ. Then, λ ˜ λ ˜ k 2 2 κ M | L λ ˜ ( x k ) | .
Finally, we can obtain the following leave-one-out error bound, which has a convergence rate of O ( 1 M ) .
Theorem 6. 
Let C = sup x , x f ( x ) f ( x ) . When the legal prior, U ( λ ) , is chosen, such that its Hessian matrix of λ is positive definite with smallest eigenvalue κ, the expected generalization error can then be bounded as:
E X ˜ D ( p λ ^ ( x ) p λ ( x ) ) = E X ˜ D ( p 0 ( x ) p λ ( x ) ) D ( p 0 ( x ) p λ ^ ( x ) ) C 2 κ M ( 1 exp ( E X L λ ^ ( X ) )

3. Maximum Entropy Density Estimation: Incomplete Data

Here, we consider a variable to be “latent” or “hidden” if it is never observed, but causally effects the observations [6]. In practice, many of the natural patterns we wish to classify are the result of causal processes that have a hidden hierarchical structure, yielding data that does not report the value of latent variables [6]. Obtaining fully labeled data is tedious or impossible in many realistic cases. This motivates us to propose an unsupervised statistical learning technique, the latent maximum entropy (LME) principle, which is still formulated in terms of maximizing entropy, except that we must now change the problem formulation to respect hidden variables.
Following the terminology of the expectation-maximization (EM) algorithm [6], let Y Y be the observed incomplete data, Z Z be the missing data and then X = ( Y , Z ) X be the random variable denoting the complete data That is, X = ( Y , Z ) . For example, Y might be observed natural language in the form of text and Z might be its missing syntactic and semantic information. If we let p ( x ) and p ( y ) denote the densities of X and Y, respectively, and let p ( z | y ) denote the conditional density of Z given Y, then p ( y ) = z Z p ( x ) μ ( d z ) , and p ( x ) = p ( y ) p ( z | y ) . The LME principle can be stated as follows.
Given features f 1 ( x ) , . . . , f N ( x ) , specifying the properties that we would like to match in the data, select a joint probability model, p ( x ) , from the space of all probability distributions, P ( x ) , over X , to maximize the entropy:
max p ( x ) P x X p ( x ) log p ( x ) μ ( d x )
s . t . x X f i ( x ) p ( x ) μ ( d x ) = y Y ˜ p ˜ ( y ) z Z f i ( x ) p ( z | y ) μ ( d z ) , i = 1 , . . . , N
where x = ( y , z ) and p ˜ ( y ) is the empirical distribution of the observed data, Y ˜ = ( y 1 , , y M ) denotes the set of observed Y values and p ( z | y ) is the conditional distribution of latent variables given the observed data. Intuitively, the constraints specify that we require the expectations of f i ( X ) in the joint model to match their empirical expectations on the incomplete data, Y, taking into account the structure of the implied dependence of the unobserved component, Z, on Y.
Note that the conditional distribution, p ( z | y ) , implicitly encodes the latent structure and is a nonlinear mapping of p ( x ) . That is, p ( z | y ) = p ( y , z ) z Z p ( y , z ) μ ( d z ) = p ( x ) z Z p ( x ) μ ( d z ) , where x = ( y , z ) and x = ( y , z ) by definition. Clearly, p ( z | y ) is a nonlinear function of p ( x ) because of the division.
Unfortunately, there is no simple optimal solution for p ( x ) in Equation (26) and Equation (27). However, a good approximation can be obtained by restricting the model to p λ ( x ) = Φ λ 1 exp i = 1 N λ i f i ( x ) where Φ λ = x X exp i = 1 N λ i f i ( x ) μ ( d x ) is the normalization constant. This restriction provides a free parameter, λ i , for each feature function, f i ( x ) . By adopting such a “log-linear” restriction, it turns out that we can formulate a practical algorithm for approximately satisfying the LME principle.
In [5], we exploited the following connection between LME and maximum likelihood estimation (MLE) to derive a practical training algorithm.
Theorem 7. 
[5] Under the log-linear assumption, maximizing the likelihood of log-linear models on incomplete data is equivalent to satisfying the feasibility constraints of the LME principle. That is, the only distinction between MLE and LME in log-linear models is that, among local maxima (feasible solutions), LME selects the model with the maximum entropy, whereas MLE selects the model with the maximum likelihood.
Define L λ ( y ) ) = y Y ˜ p ˜ ( y ) log p λ ( y ) and H ( λ , λ ) = y Y ˜ p ˜ ( y ) z Z p λ ( z | y ) log p λ ( z | y ) μ ( d z ) . The latter is the conditional entropy of a hidden variable over observed sample data Y ˜ that measures the uncertainty of the hidden variables. Then, in the case where λ is a feasible log-linear solution according to Equation (27), we have the following relationship between likelihood over observed data and the entropy of the joint model.
Corollary 5. 
[5] If λ is in the set of feasible solutions, then:
L λ ( y ) = H ( p λ ( x ) ) H ( λ , λ )
We will use the following notation and terminology throughout the analysis below. Denote the manifold of the nonlinear constraint set in Equation (27) as C . We then define p λ ^ ( y ) = arg max p λ ( x ) E y Y p 0 ( y ) log p λ ( y ) as the nearest point in terms of D ( p 0 ( y ) p λ ( y ) ) from p 0 ( y ) to the marginalized exponential family over z, using p λ ( y ) = z Z p λ ( y , z ) μ ( d z ) , where p λ ( x ) E ( x ) ; see Figure 2.
Figure 2. The operator, T, denotes the marginalization of p ( x ) over z and maps the entire space of all probability distributions, P ( x ) over X , into the space of all probability distributions, P ( y ) over Y . Here, p λ ^ ( y ) is the information projection of p 0 ( y ) to the marginalized exponential family, E ; p λ * ( y ) is the information projection of p ˜ ( y ) to the marginalized exponential family, E ; and p λ ( y ) is the distribution that in joint model space has the highest entropy among the intersection points of the exponential family, E , and the nonlinear constraint set, C .
Figure 2. The operator, T, denotes the marginalization of p ( x ) over z and maps the entire space of all probability distributions, P ( x ) over X , into the space of all probability distributions, P ( y ) over Y . Here, p λ ^ ( y ) is the information projection of p 0 ( y ) to the marginalized exponential family, E ; p λ * ( y ) is the information projection of p ˜ ( y ) to the marginalized exponential family, E ; and p λ ( y ) is the distribution that in joint model space has the highest entropy among the intersection points of the exponential family, E , and the nonlinear constraint set, C .
Entropy 15 05439 g002
In [29,32], we formulate the regularized latent maximum entropy principle (RLME) as the following:
max p ( x ) , a x X p ( x ) log p ( x ) μ ( d x ) U ( a )
subject to:
x X p ( x ) f i ( x ) μ ( d x ) = y p ˜ ( y ) z p ( z | y ) f i ( y , z ) μ ( d z ) + a i ; i = 1 , . . . , N
Again a = ( a 1 , . . . , a N ) , where a i is the error for each constraint, and U : N is a convex function with its minimum at zero.
The standard maximum a posteriori (MAP) estimate minimizes the negative penalized log-likelihood R ( λ ) = y p ˜ ( y ) log p λ ( y ) + U * ( λ ) .
Our key result in [29,32] is that locally minimizing R ( λ ) is equivalent to satisfying the feasibility constraints in Equation (30) of the RLME principle.
Theorem 8. 
[29,32] Under the log-linear assumption, locally maximizing the posterior probability of log-linear models on incomplete data is equivalent to satisfying the feasibility constraints of the RLME principle. That is, the only distinction between MAP and RLME in log-linear models is that, among local maxima (feasible solutions), RLME selects the model with the maximum regularized entropy, whereas MAP selects the model with the maximum posterior probability
Corollary 6. 
[29,32] If λ * is in the set of feasible solutions of Equation (30), then R ( λ ) = H ( p λ ) U ( a ) H ( λ , λ ) .

3.1. Consistency and Generalization Bounds for Estimation Error

To measure the quality of the maximum likelihood and maximum entropy estimates, we do not consider the divergence of the models in the original joint space, P ( x ) . Instead, we consider the marginalized models in the observed data space, P ( y ) . However, to measure the divergence between models in the observed data space, we have to take the difference of D ( p 0 ( y ) p λ * ( y ) ) and D ( p 0 ( y ) p λ ^ ( y ) ) , even though, technically, the Pythagorean property no longer holds in this case. Nevertheless, this still gives a useful measure of the approximation quality.

3.1.1. Maximum Likelihood Estimate

We first establish consistency and provide generalization bounds for the maximum likelihood density estimate, p λ * ( y ) . Note that if we attempt to use a technique similar to the complete data case here, we will obtain a bound that is governed by the covering number of the log of the marginal feature functions F ( y ) = z Z exp ( λ , f ( y , z ) ) μ ( d z ) . Bounding the covering number of log-int-exp F ( x ) is more difficult than bounding it for F ( y ) directly. To cope with this issue, we use the refined version of the Rademacher comparison inequality proposed in [24] to eliminate the log function. This is a slightly different approach than that taken by Rakhlin et al. [16], who, instead, use the contraction technique of [15,20,33] to derive bounds for mixture model density estimation. Here, we pursue a streamlined analysis that avoids working with the likelihood ratio (also, see [34]),hence avoiding the second application of contraction, which results in tighter constants.
Theorem 9. 
Assume for all λ Ω and for all y Y , we have 0 < a F ( y ) b . Then, there exist 0 < ζ < α < and positive constants, C 3 , C 4 + , such that, with a probability of at least 1 η , for any dataset of size M drawn from p 0 ( y ) ,
D ( p 0 ( y ) p λ * ( y ) ) D ( p 0 ( y ) p λ ^ ( y ) ) 4 C 3 M E Y ˜ [ ζ α log N ( F ( y ) , ϵ , d y ) d ϵ ] + C 4 2 log ( 1 η ) M
where N ( F ( y ) , ϵ , d y ) is the random covering number of the marginal feature functions, F ( y ) = z Z exp ( λ , f ( y , z ) ) μ ( d z ) , at scale ϵ with empirical Euclidean distance d y on sample data Y ˜ .
Similarly, we can eliminate the assumption of boundedness on the parameters and feature functions, F ( y ) , by using a result adapted from [24].
Theorem 10. 
Assume that there exists a positive number, K ( F ) , such that for all τ > 0 :
log E p 0 ( y ) p λ ( y ) 2 τ 1 p λ ( y ) 2 τ ( τ K ( F ) ) 2
where λ are the parameters ( p λ ( y ) 2 τ 1 p λ ( y ) 2 τ ) achieving the maximum subject to E p λ ( x ) ( f ( x ) ) = E p λ ( z | y ) ( f ( y , z ) ) . Then, for all λ Ω , we have with a probability of at least 1 η ,
E p 0 ( x ) log p λ ( y ) 1 M j = 1 M log p λ ( y j ) E Y ˜ sup λ Ω , f ( x ) F ( x ) ( E p 0 ( x ) log F ( y ) E p ˜ ( x ) log F ( y ) ) + K ( F ) 2 log ( 1 η ) M
Using the above result, Theorem 9 can be proven by replacing C 4 with K ( F ) . Again, since the value, K ( F ) , is hard to determine in practice, we will state our results below in terms of a bound on feature functions, but as before, the reader should bear in mind that the bound on feature functions can be replaced by K ( F ) .
From the results above, we can establish the following consistency property.
Corollary 7. 
Universal consistency: If ζ α log N ( F ( y ) , ϵ , d y ) d ϵ is bounded, then p λ * ( y ) will converge to p λ ^ ( y ) (in terms of the difference of the Kullback–Leibler divergence to the true distribution, p 0 ( y ) ) with rate O ( 1 M ) , regardless of the form of true distribution p 0 ( y ) .
Similar to the complete data case, using the result of Theorem 9 and the McDiarmid concentration inequality, we are also able to derive the generalization bound for the difference of the best expected log-likelihood and the maximum empirical log-likelihood.

3.1.2. Latent Maximum Entropy Estimate

Let p λ ( y ) denote the maximum entropy estimate of Equation (26) over the exponential family, E . We use similar techniques to the case of complete data regularized maximum entropy (Section 2.1.2) to prove consistency and generalization bounds for using the latent maximum entropy density estimate, p λ ( y ) .
Theorem 11. 
(LME principle) Assume for all λ Ω and for all y Y , we have 0 < a F ( y ) b . Then, there exist 0 < ζ < α < , such that with a probability of at least 1 η ,
D ( p 0 ( y ) p λ ( y ) ) D ( p 0 ( y ) p λ ^ ( y ) ) 4 C 3 M E Y ˜ [ ζ α log N ( F ( y ) , ϵ , d y ) d ϵ ] + C 4 2 log ( 1 η ) M + E p ˜ ( y ) log p λ ^ ( y ) p λ ( y )
Using this result, we can then easily establish the following consistency property.
Corollary 8. 
Universal consistency: If ζ α log N ( F ( y ) , ϵ , d y ) d ϵ is bounded and also E p ˜ ( y ) log p λ ^ ( y ) E p ˜ ( y ) log p λ ( y ) , then p λ ( y ) will converge to p λ ^ ( y ) (in terms of the difference of the Kullback–Leibler divergence to the true distribution, p 0 ( y ) ) with rate O ( 1 M ) , for any true distribution, p 0 ( y ) .
Corollary 8 gives a sufficient condition, i.e., E p ˜ ( y ) log p λ ^ ( y ) E p ˜ ( y ) log p λ ( y ) , that leads to the universal consistency of latent maximum entropy estimation. This perhaps partially explains our observations of experimental results on synthetic data conducted in [5].
Note that in the proof of Theorem 11 and Corollary 8, it is not necessary to restrict p λ to be the model that has global maximum joint entropy over all feasible log-linear solutions. It turns out that the conclusion still holds for all feasible log-linear models, p λ ( y ) , that have greater empirical log-likelihood, E p ˜ ( y ) log p λ ( y ) , than the empirical log-likelihood, E p ˜ ( y ) log p λ ^ ( y ) , of the optimal expected log-likelihood estimate, p λ ^ ( y ) . That is, as the sample size grows, any of these feasible log-linear models will converge to p λ ^ ( y ) (in terms of the difference of the Kullback–Leibler divergence to the true distribution, p 0 ( y ) ) with rate O ( 1 M ) .

3.1.3. Maximum a Posteriori Estimate

In a similar manner, it is straightforward to have the following generalization bound for the MAP estimate, p λ ( y ) .
Theorem 12. 
(MAP principle) Assume for all λ Ω and for all y Y , 0 < a F ( y ) b . Then, with a probability of at least 1 η ,
D ( p 0 ( y ) p λ ( y ) ) D ( p 0 ( y ) p λ ( y ) ) 4 C 3 M E X ˜ [ ζ α log N ( F ( y ) , ϵ , d x ) d ϵ ] + C 4 2 log ( 1 η ) M + U * ( λ ^ ) U * ( λ )
By the above theorem, one can easily obtain the following consistency result.
Corollary 9. 
Universal consistency: If ζ α log N ( F ( y ) , ϵ , d y ) d ϵ is bounded and U * ( λ ^ ) U * ( λ ) , then p λ ( y ) will converge to p λ ^ ( y ) in terms of the difference of the Kullback–Leibler divergence to the true distribution, p 0 ( y ) , with rate O ( 1 M ) without assuming the form of the true distribution, p 0 ( y ) , nor the true prior distribution.

3.1.4. Regularized Latent Maximum Entropy Estimate

We can also, in a similar manner, establish the following generalization bound for the RLME estimate, p λ ( y ) .
Theorem 13. 
(RLME principle) Assume for all λ Ω and for all y Y , 0 < a F ( y ) b . Then, with a probability of at least 1 η ,
D ( p 0 ( y ) p λ ( y ) ) D ( p 0 ( y ) p λ ^ ( y ) ) 4 C 3 M E X ˜ [ ζ α N ( F ( y ) , ϵ , d x ) d ϵ ] + C 4 2 log ( 1 η ) M + E p ˜ ( y ) log p λ ^ ( y ) E p ˜ ( y ) log p λ ( y ) + U * ( λ ^ ) U * ( λ )
By the above theorem, we can easily obtain the following consistency result.
Corollary 10. 
Universal consistency: If ζ α log N ( F ( y ) , ϵ , d y ) d ϵ is bounded and E p ˜ ( y ) log p λ ^ ( y ) + U * ( λ ^ ) E p ˜ ( y ) log p λ ( y ) + U * ( λ ) , then p λ ( y ) will converge to p λ ^ ( y ) in terms of the difference of the Kullback–Leibler divergence to the true distribution, p 0 ( y ) , with rate O ( 1 M ) without assuming the form of true distribution p 0 ( y ) and true prior distribution.

4. Conclusions

We have investigated the statistical properties of using the maximum entropy principle for density estimation, in both the complete and incomplete data cases. For complete data, maximum entropy is equivalent to maximum likelihood estimation in a Markov random field. Here, we derived bounds on the generalization error based on the complexity of linear combinations of feature functions, and used this to establish a form of universal consistency. We then provided a similar analysis for regularized maximum entropy estimation, which, interestingly, yields a better generalization bound (and maintains consistency) for any legal prior. Moreover, if the information matrix of the prior is positive definite, we can further show that the convergence rate can be improved to O ( 1 M ) instead of O ( 1 M ) . For incomplete data, maximum entropy is no longer equivalent to maximum likelihood estimation, and the analysis becomes more difficult. Nevertheless, we established bounds on the generalization error of maximum likelihood in terms of the complexity of the marginalizedfeature functions, again achieving a form of universal consistency. With additional assumptions, we were able to extend this analysis to apply it to latent maximum entropy estimation and to prove its universal consistency, as well. Analogous conclusions can be drawn for regularized situations. Finally, we note that an alternative analysis can be based on replacing the Kullback–Leibler divergence with the more general Bregman distance [35,36]. The analysis here can be easily extended to this more general setting.
In our future work, we are planning to study the trade-off between approximation error and estimation error to select the best set of feature functions.

Acknowledgments

We thank Dale Schuurmans for his assistance and Tong Zhang for his comments. The research was supported by the Alberta Innovates Centre for Machine Learning, University of Alberta, Canada.

Conflicts of Interest

The authors declare no conflict of interest.

Appendix

In the appendix, we give proofs of the theorems, lemmas and corollaries.
Proof of Theorem 1.
Proof. The techniques we use are quite standard [15,16,20,37] and have appeared in many papers. To be concise, following lecture notes 14–15 in [37], the first key technique we are using is the method of bounded differences [38]. Define:
h ( x 1 , , x M ) = sup p λ ( x ) E | E p 0 ( x ) log p λ ( x ) 1 M j = 1 M log p λ ( x j ) | = sup λ Ω , f F | E p 0 ( x ) λ , f ( x ) E p ˜ ( x ) λ , f ( x ) |
Then:
| h ( x 1 , , x M ) h ( x 1 , , x k 1 , x k , x k + 1 , , x M ) | = sup λ Ω , f F | E p 0 ( x ) λ , f ( x ) 1 M j = 1 M λ , f ( x j ) | sup λ Ω , f F | E p 0 ( x ) λ , f ( x ) 1 M ( j = 1 , j k M λ , f ( x j ) + λ , f ( x k ) ) | sup λ Ω , f F 1 M | λ , ( f ( x k ) f ( x k ) ) | 2 M sup λ Ω , f F λ 1 sup x X f ( x ) = C 2 M
By the McDiarmid concentration inequality in Equation [38], we have:
P ( h ( x 1 , , x M ) E X ˜ h ( x 1 , , x M ) δ ) exp 2 δ 2 j = 1 M ( C 2 M ) 2 = exp 2 M δ 2 C 2 2
Let η = exp 2 M δ 2 C 2 2 , i.e., δ = C 2 log ( 1 η ) 2 M . Then:
P h ( x 1 , , x M ) E X ˜ h ( x 1 , , x M ) C 2 log ( 1 η ) 2 M η
Therefore, with a probability of at least 1 η ,
sup p λ ( x ) E ( x ) | E p 0 ( x ) log p λ ( x ) 1 M j = 1 M log p λ ( x j ) | E X ˜ sup λ Ω , f ( x ) F ( x ) | E p 0 ( x ) λ , f ( x ) E p ˜ ( x ) λ , f ( x ) | + C 2 log ( 1 η ) 2 M
Next, we use the symmetrization technique of [33,39], which states that if:
Z ( X ˜ ) = sup g G | E g ( x ) 1 M j = 1 M g ( x j ) | and R ( X ˜ ) = sup g G | 1 M j = 1 M ω j g ( x j ) |
then:
E Z X ˜ ( X ˜ ) 2 E R X ˜ , ω ( X ˜ )
where ω = ( ω 1 , , ω M ) is a Rademacher sequence. We then have with a probability of at least 1 η ,
sup p λ ( x ) E ( x ) | E p 0 ( x ) log p λ ( x ) 1 M j = 1 M log p λ ( x j ) | 2 E X ˜ , ω sup λ Ω , f F | 1 M j = 1 M ω j λ , f ( x j ) | + C 2 log ( 1 η ) 2 M
A classical result of Dudley establishes that Rademacher averages over linear combinations in F ( x ) are bounded by Dudley’s entropy integral [16,20,22,37],
E ω sup λ Ω , f ( x ) F ( x ) | 1 M j = 1 M ϵ λ , f ( x j ) | C 1 M ζ α log N ( F ( x ) , ϵ , d x ) d ϵ
where 0 < ζ < α < ; provided, as observed in [20,40], that sup λ Ω λ and sup f F , x X f ( x ) are bounded. One can then show that with a probability of at least 1 η ,
sup p λ ( x ) E ( x ) | E p 0 ( x ) log p λ ( x ) 1 M j = 1 M log p λ ( x j ) | 2 C 1 M E X ˜ ζ α log N ( F ( x ) , ϵ , d x ) d ϵ + C 2 log ( 1 η ) 2 M
Therefore:
D ( p λ ^ ( x ) p λ * ( x ) ) = D ( p 0 ( x ) p λ * ( x ) ) D ( p 0 ( x ) p λ ^ ( x ) ) by Equation ( 8 ) = ( E p 0 ( x ) log p λ ^ ( x ) E p ˜ ( x ) log p λ ^ ( x ) ) + ( E p ˜ ( x ) log p λ * ( x ) E p 0 ( x ) log p λ * ( x ) ) + ( E p ˜ ( x ) log p λ ^ ( x ) E p ˜ ( x ) log p λ * ( x ) ) 2 sup p λ ( x ) E ( x ) | E p 0 ( x ) log p λ ( x ) 1 M j = 1 M log p λ ( x j ) | + 1 M j = 1 M log p λ ^ ( x j ) p λ * ( x j ) 2 sup p λ ( x ) E ( x ) | E p 0 ( x ) log p λ ( x ) 1 M j = 1 M log p λ ( x j ) |
where the second inequality comes from the fact that 1 M j = 1 M log p λ ^ ( x j ) p λ * ( x j ) 0 , since p λ * ( x ) has maximum likelihood in the exponential family, E ( x ) .
Therefore, with a probability of at least 1 η ,
D ( p λ ^ ( x ) p λ * ( x ) ) = D ( p 0 ( x ) p λ * ( x ) ) D ( p 0 ( x ) p λ ^ ( x ) ) 4 C 1 M E X ˜ ζ α log N ( F ( x ) , ϵ , d x ) d ϵ + C 2 2 log ( 1 η ) M
Proof of Theorem 2.
Proof. Choose the function to be log p λ ( x ) , λ Ω , then cosh ( 2 τ log p λ ( x ) ) = 1 2 ( p λ ( x ) 2 τ 1 p λ ( x ) 2 τ ) . By Theorem 3 in [24], we have that if there exists a positive number, K ( F ) , such that for all τ > 0 ,
log E p 0 ( x ) sup λ Ω p λ ( x ) 2 τ 1 p λ ( x ) 2 τ ( τ K ( F ) ) 2
then, for all λ Ω with a probability of at least 1 η , (14) will hold.
Next, take the derivative of p λ ( x ) 2 τ 1 p λ ( x ) 2 τ with respect to λ and set this to zero. After some routine calculation, we obtain:
p λ ( x ) 2 τ 1 1 p λ ( x ) 2 τ + 1 p λ ( x ) E p λ ( x ) ( f i ( x ) ) f i ( x ) = 0 ; i = 1 . . . N
Thus, λ = arg sup λ Ω p λ ( x ) 2 τ 1 p λ ( x ) 2 τ are those λ , such that E p λ ( x ) ( f ( x ) ) = f ( x ) , which can be uniquely determined by the maximum entropy approach for each fixed x.☐
Proof of Theorem 3.
Proof. Define u i = f i ( x ) , i = 1 , , N . We consider real valued linear function classes of the following form:
L ( λ , u ) = λ · u = i = 1 N λ i u i
Zhang showed that log N 2 ( L ( λ , u ) , ϵ , d u ) a 2 b 2 ϵ 2 log ( 2 N + 1 ) (Theorem 3 in [21]) and log N 2 ( L ( λ , u ) , ϵ , d u ) a 2 b 2 ϵ 2 log ( 2 M + 1 ) (Corollary 3 in [21]) Since N 2 ( L ( λ , u ) , ϵ , d u ) = N 2 ( F ( x ) , ϵ , d x ) , we have the conclusion.☐
Proof of Theorem 4.
Proof.
| L 0 ( λ ^ ) L ˜ ( λ * ) | = | E p 0 ( x ) log p λ ^ ( x ) 1 M j = 1 M log p λ * ( x j ) | | E p 0 ( x ) log p λ ^ ( x ) E p 0 ( x ) log p λ * ( x ) | + | E p 0 ( x ) log p λ * ( x ) 1 M j = 1 M log p λ * ( x j ) |
Thus, combining the inequalities of Equation (9) and the uniform bound Equation (12), we have with probability 1 η ,
| L 0 ( λ ^ ) L ˜ ( λ * ) | 6 C 1 M E X ˜ [ ζ α log N ( F ( x ) , ϵ , d x ) d ϵ ] + 3 C 2 log ( 1 η ) 2 M
Proof of Theorem 5.
Proof. The proof is similar to the ME case. Consider the chain of inequalities:
D ( p λ ^ ( x ) p λ ( x ) ) = D ( p 0 ( x ) p λ ( x ) ) D ( p 0 ( x ) p λ ^ ( x ) by ( 7 ) = ( E p 0 ( x ) log p λ ^ ( x ) E p ˜ ( x ) log p λ ^ ( x ) ) + ( E p ˜ ( x ) log p λ ( x ) E p 0 ( x ) log p λ ( x ) ) + ( E p ˜ ( x ) log p λ ^ ( x ) E p ˜ ( x ) log p λ ( x ) ) 2 sup p λ ( x ) E | E p 0 ( x ) log p λ ( x ) 1 M j = 1 M log p λ ( x j ) | + ( [ E p ˜ ( x ) log p λ ^ ( x ) U * ( λ ^ ) ] [ E p ˜ ( x ) log p λ ( x ) U * ( λ ) ] ) + U * ( λ ^ ) U * ( λ )
where the second inequality follows from the fact that [ E p ˜ log p λ ^ U * ( λ ) ] [ E p ˜ log p λ U * ( λ ) ] 0 , since p λ maximizes the a posteriori objective over the exponential family, E .
Therefore, with a probability of at least 1 η ,
D ( p λ ^ ( x ) p λ ( x ) ) = D ( p 0 ( x ) p λ ( x ) ) D ( p 0 ( x ) p λ ^ ( x ) ) 4 C 1 M E X ˜ ζ α log N ( F ( x ) , ϵ , d x ) d ϵ + C 2 2 log ( 1 η ) M + U * ( λ ^ ) U * ( λ )
Proof of Corollary 4.
Proof. Since U * ( λ ) is a legal prior, U * ( λ ^ ) = 0 . We thus have the inequality by the last theorem. As M , the right-hand side of the last inequality goes to U * ( λ ) , which is nonpositive; however, the left-hand side is nonnegative. Therefore, we must have U * ( λ ) = 0 .☐
Proof of Lemma 1.
Proof. By the definition of λ ˜ , we have:
1 M i = 1 M + 1 L λ ˜ ( x i ) + U * ( λ ˜ ) = 0
The Bregman divergence of L λ for the Markov random field model can be written as:
B L ( x ) ( λ 1 , λ 2 ) = L λ 2 ( x ) L λ 1 ( x ) L λ 1 ( x ) T ( λ 2 λ 1 )
which is always nonnegative.
Then, we have:
1 M i = 1 , i k M + 1 L λ ˜ k ( x i ) 1 M i = 1 , i k M + 1 [ L λ ˜ k ( x i ) B L ( x ) ( λ ˜ ( x i ) , λ ˜ k ( x i ) ) ] = 1 M i = 1 , i k M + 1 L λ ˜ ( x i ) + 1 M i = 1 , i k M L λ ˜ ( x ) T ( λ ˜ k λ ˜ )
By Taylor’s expansion, we know there exists θ Ω L , such that:
U * ( λ ˜ k ) = U * ( λ ˜ ) + U * ( λ ˜ ) ( λ ˜ k λ ˜ ) + 1 2 ( λ ˜ k λ ˜ ) T 2 U * ( θ ) ( λ ˜ k λ ˜ ) U * ( λ ˜ ) + U * ( λ ˜ ) ( λ ˜ k λ ˜ ) + κ 2 λ ˜ k λ ˜ 2 2
where the last inequality is due to the assumption that the Hessian matrix of U * ( λ ) is positive definite with the smallest eigenvalue κ > 0 .
Furthermore, note that by the definition of λ ˜ k , we have:
1 M i = 1 , i k M + 1 L λ ˜ ( x i ) + U * ( λ ˜ ) 1 M i = 1 , i k M + 1 L λ ˜ k ( x i ) + U * ( λ ˜ k )
Combining the results of three inequalities in Equation (42), Equation (43) and Equation (44), we then obtain:
κ 2 λ ˜ λ ˜ k 2 2 1 M i = 1 , i k M + 1 L λ ˜ ( x i ) T ( λ ˜ λ ˜ k ) + U * ( λ ˜ ) T ( λ ˜ λ ˜ k ) 1 M i = 1 , i k M + 1 L λ ˜ ( x i ) + U * ( λ ˜ ) λ ˜ λ ˜ k = 1 M L λ ˜ ( x k ) λ ˜ λ ˜ k
The last equality follows from Equation (41). By canceling λ ˜ λ ˜ k from the above inequality, we obtained the desired bound.☐
Proof of Theorem 6.
Proof. We use the same leave-one-out technique in [19,30,31]. It follows from Lemma 1 that:
λ ˜ k λ ˜ 1 κ M L λ ˜ ( x k ) C κ M ( 1 exp ( L λ ˜ ( x k ) ) )
Therefore:
L λ ˜ k ( x k ) L λ ˜ ( x k ) C 2 κ M ( 1 exp ( L λ ˜ ( x k ) ) )
After summing over k from one to M + 1 , then taking the expectation with respect to the training data and using Jensen’s inequality, we obtain:
E X ˜ E X L λ ( X ) ) = E X ˜ M + 1 E x k L λ ˜ k ( x k ) ) E X ˜ M + 1 1 M + 1 k = 1 M + 1 L λ ˜ ( x k ) + C 2 κ M ( 1 E X ˜ M + 1 1 M + 1 k = 1 M + 1 exp ( L λ ˜ ( x k ) ) ) E X ˜ M + 1 1 M + 1 k = 1 M + 1 L λ ˜ ( x k ) + C 2 κ M 1 exp ( E X ˜ M + 1 1 M + 1 k = 1 M + 1 L λ ˜ ( x k ) )
Since U * ( λ ) is a legal prior, U * ( λ ) 0 and U * ( λ ^ ) = 0 , and since λ ˜ is the optimal solution of Equation (24), we have:
E X ˜ M + 1 1 M + 1 k = 1 M + 1 L λ ˜ ( x k ) M M + 1 E X ˜ M + 1 1 M k = 1 M + 1 L λ ˜ ( x k ) + U * ( λ ˜ ) M M + 1 E X ˜ M + 1 1 M k = 1 M + 1 L λ ^ ( x k ) + U * ( λ ^ ) = E X L λ ^ ( X )
Combining the results of inequalities in Equation (45) and Equation (46), yields:
E X ˜ D ( p λ ^ ( x ) p λ ( x ) ) = E X ˜ D ( p 0 ( x ) p λ ( x ) ) D ( p 0 ( x ) p λ ^ ( x ) ) = E X ˜ E X L λ ( X ) ) E X L λ ^ ( X ) C 2 κ M ( 1 exp ( E X L λ ^ ( X ) )
Proof of Theorem 9.
Proof. By working with p λ ( y ) and using the same techniques as before, i.e., the McDiarmid concentration inequality [38] and symmetrization [33,39], we have with a probability of at least 1 η ,
sup λ Ω | E p 0 ( y ) log p λ ( y ) 1 M j = 1 M log p λ ( y j ) | 2 E Y ˜ , ω sup λ Ω , f ( x ) F ( x ) | 1 M j = 1 M ω j log F ( y ) | + C 2 log ( 1 η ) 2 M
Now, we apply the refined version of the Rademacher comparison inequality proposed in Theorem 7 of Meir and Zhang [24], which says that for l-Lipschitz functions ϕ i : , i = 1 , , M , one obtains the inequality:
E ω sup f F i = 1 M ω i ϕ i ( f i ) l E ω sup f F i = 1 M ω i f i
By the arguments in [20,33], it is easy to show that the absolute value version is valid as:
E ω sup f F | i = 1 M ω i ϕ i ( f i ) | 2 l E ω sup f F | i = 1 M ω i f i |
Let ϕ ( x ) = log ( x ) , where a x b . It is easy to verify that ϕ ( x ) is 1 a -Lipschitz, so:
E Y ˜ , ω sup λ Ω | 1 M j = 1 M ω j log F λ ( y j ) | 2 a E Y ˜ , ω sup λ Ω | 1 M j = 1 M ω j F λ ( y j ) |
Combining this inequality with that of Dudley’s entropy integral, one can then establish that with a probability of at least 1 η ,
sup λ Ω | E p 0 ( y ) log p λ ( y ) 1 M j = 1 M log p λ ( y j ) ( y j ) | 2 C 3 M ζ α log N ( F ( y ) , ϵ , d y ) d ϵ + C 4 log ( 1 η ) 2 M
where 0 < ζ < α < . These two quantities depend on sup λ Ω λ , a and b. Furthermore, C 3 is the constant in the bound of Rademacher averages over linear combinations in F ( y ) , obtained by by Dudley’s entropy modified by a and b. Finally, the constant, C 4 , depends on a , b and η.
Putting the pieces together, with a probability of at least 1 η , we obtain:
D ( p 0 ( y ) p λ * ( y ) ) D ( p 0 ( y ) p λ ^ ( y ) ) = ( E p 0 ( y ) log p λ ^ ( y ) E p ˜ ( y ) log p λ ^ ( y ) ) + ( E p ˜ ( y ) log p λ * ( y ) E p 0 ( y ) log p λ * ( y ) ) + ( E p ˜ ( y ) log p λ ^ ( y ) E p ˜ ( y ) log p λ * ( y ) ) 2 sup λ Ω | E p 0 ( y ) log p λ ( y ) 1 M j = 1 M log p λ ( y j ) | + 1 M j = 1 M log p λ ^ ( y j ) p λ * ( y j ) 4 C 3 M E Y ˜ [ ζ α log N ( F ( y ) , ϵ , d y ) d ϵ ] + C 4 2 log ( 1 η ) M
Proof of Theorem 10.
Proof. Choose the function to be log p λ ( y ) , λ Ω . Then, cosh ( 2 τ log p λ ( x ) ) = 1 2 ( p λ ( y ) 2 τ 1 p λ ( y ) 2 τ ) . By Theorem 3 in [24], we have that if there exists a positive number, K ( F ) , such that for all τ > 0 ,
log E p 0 ( y ) sup λ Ω p λ ( y ) 2 τ 1 p λ ( y ) 2 τ ( τ K ( F ) ) 2
then, for all λ Ω with a probability of at least 1 η , (41) will hold.
Next, we take the derivative of p λ ( y ) 2 τ 1 p λ ( y ) 2 τ with respect to λ and set this to zero. After some routine calculation, we obtain for i = 1 . . . N :
p λ ( x ) 2 τ 1 1 p λ ( y ) 2 τ + 1 p λ ( y ) E p λ ( x ) ( f i ( x ) ) E p λ ( z | y ) ( f i ( y , z ) ) = 0
Thus, λ = arg sup λ Ω p λ ( y ) 2 τ 1 p λ ( y ) 2 τ are those parameters, such that they achieve E p λ ( x ) ( f ( x ) ) = E p λ ( z | y ) ( f i ( y , z ) ) . Moreover, they achieve the maximum of p λ ( y ) 2 τ 1 p λ ( y ) 2 τ . Even though these parameters can be uniquely determined for each fixed y, unlike the complete data case, they may not be the same as the MLE or LME estimates, due to the existence of multiple feasible solutions.☐
Proof of Theorem 11.
Proof. The coefficients are the same as in Theorem 8, and the proof is similar.
D ( p 0 ( y ) p λ ( y ) ) D ( p 0 ( y ) p λ ^ ( y ) ) = ( E p ˜ ( y ) log p λ ^ ( y ) E p 0 ( y ) log p λ ^ ( y ) ) + ( E p ˜ ( y ) log p λ ( y ) E p 0 ( y ) log p λ ( y ) + ( E p ˜ ( y ) log p λ ^ ( y ) E p ˜ ( y ) log p λ ( y ) ) 2 sup λ Ω | E p 0 ( y ) log p λ ( y ) 1 M j = 1 M log p λ ( y j ) | + 1 M j = 1 M log p λ ^ ( y j ) p λ * ( y j )
Therefore, by using inequality in Equation (48), we have with a probability of at least 1 η ,
D ( p 0 ( y ) p λ ( y ) ) D ( p 0 ( y ) p λ ^ ( y ) ) 4 C 3 M E Y ˜ ζ α log N ( F ( y ) , ϵ , d y ) d ϵ + C 4 2 log ( 1 η ) M + E p ˜ ( y ) log p λ ^ ( y ) p λ ( y )

References

  1. Jaynes, E. Information theory and statistical mechanics. Phys. Rev. 1957, 106, 620–630. [Google Scholar] [CrossRef]
  2. Della Pietra, S.; Della Pietra, V.; Lafferty, J. Inducing features of random fields. IEEE Trans. Pattern Anal. Mach. Intell. 1997, 19, 380–393. [Google Scholar] [CrossRef]
  3. Palmieri, F.; Ciuonzo, D. Objective priors from maximum entropy in data classification. Inf. Fusion 2013, 14, 186–198. [Google Scholar] [CrossRef]
  4. Ziebart, B.; Maas, A.; Bagnell, J.; Dey, A. Maximum Entropy Inverse Reinforcement Learning. In Proceeding of The 23rd National Conference on Artificial Intelligence (AAAI), Chicago, IL, USA, 13–17 July 2008; pp. 1433–1438.
  5. Wang, S.; Schuurmans, D.; Zhao, Y. The latent maximum entropy principle. ACM Trans. Knowl. Discov. Data 2012, 6, No. 8. 1–42. [Google Scholar] [CrossRef]
  6. McLachlan, G.; Krishnan, T. The EM Algorithm and Extensions, 2nd ed.; Wiley-Interscience: Hoboken, NJ, USA, 2008. [Google Scholar]
  7. Lehmann, E.L.; Casella, G. Theory of Point Estimation; Springer-Verlag: New York, NY, USA, 1998. [Google Scholar]
  8. Vapnik, V.; Chervonenkis, A. The necessary and sufficient conditions for consistency in the empirical risk minimization method. Pattern Recognit. Image Anal. 1991, 1, 283–305. [Google Scholar]
  9. Vapnik, V.N. Statistical Learning Theory; Wiley-Interscience: Hoboken, NJ, USA, 1998. [Google Scholar]
  10. Barron, A.; Sheu, C. Approximation of density functions by sequences of exponential families. Ann. Stat. 1991, 19, 1347–1369. [Google Scholar] [CrossRef]
  11. Dudik, M.; Phillips, S.; Schapire, R. Performance Guarantees for Regularized Maximum Entropy Density Estimation. In Proceedings of The 17th Annual Conference on Learning Theory (COLT), Banff, Canada, 1–4 July 2004; pp. 472–486.
  12. Van de Geer, S. Empirical Processes in M-Estimation; Cambridge University Press: Cambridge, UK, 2000. [Google Scholar]
  13. Jelinek, F. Statistical Methods for Speech Recognition; MIT Press: Cambridge, MA, USA, 1998. [Google Scholar]
  14. Rosenfeld, R. A maximum entropy approach to adaptive statistical language modeling. Comput. Speech Lang. 1996, 10, 187–228. [Google Scholar] [CrossRef]
  15. Koltchinskii, V.; Panchenko, D. Empirical margin distributions and bounding the generalization error of combined classifiers. Ann. Stat. 2002, 30, 1–50. [Google Scholar]
  16. Rakhlin, A.; Panchenko, D.; Mukherjee, S. Risk bounds for mixture density estimation. ESAIM: Probab. Stat. 2005, 9, 220–229. [Google Scholar] [CrossRef]
  17. Csiszar, I. I-divergence geometry of probability distributions and minimization problems. Ann. Probab. 1975, 3, 146–158. [Google Scholar] [CrossRef]
  18. Barron, A. Approximation and estimation bounds for artificial neural networks. Mach. Learn. 1994, 14, 115–133. [Google Scholar] [CrossRef]
  19. Zhang, T. Statistical behavior and consistency of classification methods based on convex risk minimization. Ann. Stat. 2004, 32, 56–85. [Google Scholar] [CrossRef]
  20. Panchenko, D. Class Lecture Notes of MIT 18.465: Statistical Learning Theory; Massachusetts Institute of Technology: Cambridge, MA, USA, 2002. [Google Scholar]
  21. Zhang, T. Covering number bounds of certain regularized linear function classes. J. Mach. Learn. Res. 2002, 2, 527–550. [Google Scholar]
  22. Dudley, R. Uniform Central Limit Theorems; Cambridge University Press: New York, NY, USA, 1999. [Google Scholar]
  23. Wainwright, M.; Jordan, M. Graphical models, exponential families, and variational inference. Found. Trends Mach. Learn. 2008, 1, 1–305. [Google Scholar] [CrossRef]
  24. Meir, R.; Zhang, T. Generalization error bounds for Bayesian mixture algorithms. J. Mach. Learn. Res. 2003, 4, 839–860. [Google Scholar]
  25. Chen, S.F.; Rosenfeld, R. A survey of smoothing techniques for ME models. IEEE Trans. Speech Audio Process. 2000, 8, 37–50. [Google Scholar] [CrossRef]
  26. Borwein, J.; Lewis, A. Convex Analysis and Nonlinear Optimization: Theory and Examples; Springer-Verlag: New York, NY, USA, 2000. [Google Scholar]
  27. Boyd, S.; Vandenberghe, L. Convex Optimization; Cambridge University Press: Cambridge, UK, 2004. [Google Scholar]
  28. Lebanon, G.; Lafferty, J. Boosting and maximum likelihood for exponential models. Adv. Neural Inf. Process. Syst. 2001, 14, 447–454. [Google Scholar]
  29. Wang, S.; Schuurmans, D.; Peng, F.; Zhao, Y. Learning mixture models with the regularized latent maximum entropy principle. IEEE Trans. Neural Netw. 2004, 15, 903–916. [Google Scholar] [CrossRef] [PubMed]
  30. Zhang, T. Leave-one-out bounds for kernel methods. Neural Comput. 2003, 15, 1397–1437. [Google Scholar] [CrossRef]
  31. Zhang, T. Class-size independent generalization analsysis of some discriminative multi-category classification methods. Adv. Neural Inf. Process. Syst. 2004, 17, 1625–1632. [Google Scholar]
  32. Wang, S.; Schuurmans, D.; Peng, F.; Zhao, Y. Combining statistical language models via the latent maximum entropy principle. Mach. Learn. J. 2005, 60, 229–250. [Google Scholar] [CrossRef]
  33. Ledoux, M.; Talagrand, M. Probability in Banach Spaces: Isoperimetry and Processes; Springer-Verlag: Berlin and Heidelberg, Germany, 1991. [Google Scholar]
  34. Opper, M.; Haussler, D. Worst Case Prediction over Sequences under Log Loss. In The Mathematics of Information Coding, Extraction and Distribution; Cybenko, G., O’Leary, D., Rissanen, J., Eds.; Springer-Verlag: New York, NY, USA, 1998. [Google Scholar]
  35. Lafferty, J.; Della Pietra, V.; Della Pietra, S. Statistical Learning Algorithms Based on Bregman Distance. In Proceedings of the Canadian Workshop on Information Theory, Toronto, Ontario, Canada, 3–6 June 1997; pp. 77–80.
  36. Wang, S.; Schuurmans, D. Learning Continuous Latent Variable Models with Bregman Divergences. In Proceedings of The 14th International Conference on Algorithmic Learning Theory (ALT), Sapporo, Japan, 17–19 October 2003; pp. 190–204.
  37. Bartlett, P. Class Lecture Notes of UC-Berkeley CS281B/Stat241B: Statistical Learning Theory; University of California, Berkeley: Berkeley, CA, USA, 2003. [Google Scholar]
  38. McDiarmid, C. On the Method of Bounded Differences. In Surveys in Combinatorics; Siemons, J., Ed.; Cambridge University Press: Cambridge, UK, 1989; pp. 148–188. [Google Scholar]
  39. Van der Vaart, A.; Wellner, J. Weak Convergence and Empirical Processes; Springer-Verlag: New York, NY, USA, 1996. [Google Scholar]
  40. Zhang, T.; Yu, B. Boosting with early stopping: Convergence and consistency. Ann. Stat. 2005, 33, 1538–1579. [Google Scholar] [CrossRef]

Share and Cite

MDPI and ACS Style

Wang, S.; Greiner, R.; Wang, S. Consistency and Generalization Bounds for Maximum Entropy Density Estimation. Entropy 2013, 15, 5439-5463. https://doi.org/10.3390/e15125439

AMA Style

Wang S, Greiner R, Wang S. Consistency and Generalization Bounds for Maximum Entropy Density Estimation. Entropy. 2013; 15(12):5439-5463. https://doi.org/10.3390/e15125439

Chicago/Turabian Style

Wang, Shaojun, Russell Greiner, and Shaomin Wang. 2013. "Consistency and Generalization Bounds for Maximum Entropy Density Estimation" Entropy 15, no. 12: 5439-5463. https://doi.org/10.3390/e15125439

APA Style

Wang, S., Greiner, R., & Wang, S. (2013). Consistency and Generalization Bounds for Maximum Entropy Density Estimation. Entropy, 15(12), 5439-5463. https://doi.org/10.3390/e15125439

Article Metrics

Back to TopTop