1. Introduction
A number of statistical inference problems of significant contemporary interest, such as text classification, language modeling and DNA microarray analysis, require inferences based on observed sequences of symbols in which the sequence length or sample size is comparable or even smaller than the set of symbols, the alphabet. For instance, language models for speech recognition estimate distributions over English words using text examples much smaller than the vocabulary.
Inference in this setting has received a lot of attention, from Laplace [
1–
3] in the 18th century, to Good [
4] in the mid-20th century, to an explosion of work in the statistics [
5–
13], information theory [
14–
17,
21,
22] and machine learning [
23–
26] communities in the last few decades. A major strand in the information theory literature on the subject has been based on the notion of patterns. The pattern of a sequence characterizes the repeat structure in the sequence, which is the information that can be described well (see Orlitsky
et al. [
27] for formal characterizations of this idea). The statistical literature has emphasized the importance of exchangeability, which generalizes the notion of independence.
We consider measures over infinite sequences X1,X2 . . ., where Xi come from a countable (infinite) set (the alphabet). Let be the collection of all distributions over countable (potentially infinite) alphabets. For p ∈ , let p(n) denote the product distribution corresponding to an independent and identically distributed (i.i.d.) sample
, where Xi ~ p. Let (n) be the collection of all such i.i.d. distributions on length n sequences drawn from countable alphabets. Let ∞ be the collection of all measures over infinite sequences of symbols from countable alphabets X1,X2 . . ., where the {Xi} are i.i.d. according to some distribution in . The measures are constructed by extending to the Borel sigma algebra the i.i.d. probability assignments on finite length sequences, namely (n), n ≥ 1. We call ∞ the set of i.i.d. measures.
Based on a sample
from an unknown p(n) ∈ (n) (or equivalently, the corresponding measure in ∞), we want to create an estimator qn, which assigns probabilities to length-n sequences. We are interested in the behavior of the sequence of estimators {qn : n = 1, 2, . . .}. With some abuse of notation, we will use q to denote the estimator qn when the sample size n is clear from context. We want qn to approximate p(n) well; in particular, we would like qn to neither overestimate nor underestimate the probability of sequences of length n under the true p(n).
Suppose that there exist Rn > 0 and An > 0, such that for each p ∈ , we have:
If An is any function of n that grows sufficiently fast with n, the sequence {qn} does not asymptotically overestimate probabilities of length-n sequences by a factor larger than than Rn with probability one, no matter what measure p ∈ generated the sequences.
Protecting against underestimation is not so simple. The redundancy of an estimator qn (defined formally in Section 2.4) for a length n sequence
measures how closely
matches:
the largest probability assigned to
by any distribution in n. The estimator redundancy usually either maximizes the redundancy of a sequence or takes the expectation over all sequences. Ideally, we want the estimator redundancy to grow sublinearly in the sequence length n, so that the per-sample redundancy vanishes as n → ∞. If so, we call the estimator universal for . Redundancy thus captures how well q performs against the collection , but the connections between estimation problems and compression run deeper.
In this paper, we consider estimators formed by taking a measure (prior) on
. Different priors induce different distributions on the data
. We think of the prior as randomly choosing a distribution
p in
, and the observed data
is generated according to this
p. How much information about the underlying distribution
p can we obtain from the data (assuming we know the prior)? Indeed, a well known result [
28–
30] proves that the redundancy of the best possible estimator for
∞ equals the maximum (over all choices of priors) information (in bits) that is present about the underlying source in a length
n sequence generated in this manner.
Redundancy is well defined for finite alphabets; recent work [
16] has formalized a similar framework for countably infinite alphabets. This framework is based on the notion of patterns of sequences that abstract the identities of symbols and indicate only the relative order of appearance. For example, the pattern of FEDERER is 1232424, while that of PATTERN is 1233456. The crux of the idea is that instead of considering the set of measures
∞ over infinite sequences, we consider the set of measures induced over patterns of the sequences. It then follows that now our estimate
qΨ is a measure over patterns. While the variables in the sequence are i.i.d., the corresponding pattern merely corresponds to a exchangeable random partition. We can associate a predictive distribution with the pattern probability estimator
qΨ. This is an estimate of the distribution of
Xn+1 given the previous observations, and it assigns probabilities to the event that
Xn+1 will be “new” (has not appeared in
) and probabilities to the events that
Xn+1 takes on one of the values that has been seen so far.
The above view of estimation also appears in the statistical literature on Bayesian nonparametrics that focuses on exchangeability. Kingman [
31] advocated the use of exchangeable random partitions to accommodate the analysis of data from an alphabet that is not bounded or known in advance. A more detailed discussion of the history and philosophy of this problem can be found in the works of Zabell [
11,
32] collected in [
33]. One of the most popular exchangeable random partition processes is the “Chinese restaurant process” [
10], which is a special case of the Poisson–Dirichlet or Pitman–Yor process [
13,
34]. These processes can be viewed as prior distributions on the set of all discrete distributions that can be used as the basis for estimating probabilities and computing predictive distributions.
In this paper, we evaluate the performance of the sequential estimators corresponding to these exchangeable partition processes. As described before, is the collection of all distributions over countable (potentially infinite) alphabets, and ∞ is the collection of all i.i.d. measures with single letter marginals in . Let Ψ be the collection of all measures over patterns induced by measures in ∞. We evaluate the redundancy of estimators based on the Chinese restaurant process (CRP), the Pitman–Yor (PY) process and the Ewen’s sampling formula against Ψ.
In the context of sequential estimation, early work [
16] showed that for the collection
Ψ of measures over patterns, universal estimators do exist: the normalized redundancy is
O(
n1/2). More recent work [
18,
19] proved tight bounds on worst-case and average redundancy; these results show that there are sequential estimators with normalized redundancy of
O(
n1/3). However, these estimators are computationally intensive and (generally speaking) infeasible in practice. Acharya
et al. [
20] demonstrated a linear-time estimator with average redundancy
O(
n1/2), improving over the earlier constructions achieving
O(
n2/3) [
16]. By contrast, estimators such as the CRP or the PY estimator were not developed in a universal compression framework, but have been very successful from a practical standpoint. The goal of this paper is to understand these Bayesian estimators from the universal compression perspective.
For the case of the estimators studied in nonparametric Bayesian statistics, our results show that they are in general neither weakly nor strongly universal when compressing patterns or equivalently exchangeable random partitions. While the notion of redundancy is in some sense different from other measures of accuracy, such as the concentration of the posterior distribution about the true distribution, the parameters of the CRPs or the PY processes that do compress well often correspond to the maximum likelihood estimates obtained from the sample.
Because we choose to measure redundancy in the worst case over p, the underlying alphabet size may be arbitrarily large with respect to the sample size n. Consequently, for a fixed sample of size n, the number of symbols could be large, for example a constant fraction of n. The CRP and PY estimators do not have good redundancy against such samples, since they are not the cases the estimators are designed for. However, we can show that a mixture of estimators corresponding to CRP estimators is weakly universal. This mixture is made by optimizing individual CRP estimators that (implicitly) assume a bound on the support of p. If such a bound is known in advance, we can derive much tighter bounds on the redundancy. In this setting, the two-parameter Poisson–Dirichlet (or Pitman–Yor) estimator is superior to the estimator derived from the Chinese restaurant process.
In order to describe our results, we require a variety of definitions from different research communities. In the next section, we describe this preliminary material and place it in context before describing the main results in Section 3.
2. Preliminaries
In this paper, we use the “big-O” notation. A function f(n) = O(g(n)) if there exists a positive constant C, such that for sufficiently large n, |f(n)| ≤ C|g(n)|. A function f(n) = Ω(g(n)), if there exists a positive constant C′, such that for sufficiently large n, |f(n)| ≥ C′|g(n)|. A function f(n) = Θ (g(n)), if f(n) = O(g(n)) and f(n) = Ω(g(n)).
Let k denote the set of all probability distributions on alphabets of size k and ∞ be all probability distributions on countably infinite alphabets, and let:
be the set of all discrete distributions irrespective of support and support size.
For a fixed p, let
be a sequence drawn i.i.d. according to p. We denote the pattern of
by
. The pattern is formed by taking ψ1 = 1 and:
For example, the pattern of
is
. Let ψn be the set of all patterns of length n. We write p(ψn) for the probability that a length-n sequence generated by p has pattern ψn. For a pattern
, we write φμ for the number of symbols that appear μ times in
, and m = ∑φμ is the number of distinct symbols in
. We call φμ the prevalence of μ. Thus, for FEDERER, we have φ1 = 2, φ2 = 1, φ3 = 1 and m = 4.
2.1. Exchangeable Partition Processes
An exchangeable random partition refers to a sequence (Cn : n ∈ ℕ), where Cn is a random partition of the set [n] = {1, 2, . . . , n}, satisfying the following conditions: (i) the probability that Cn is a particular partition depends only on the vector (s1, s2, . . . , sn), where sk is the number of parts in the partition of size k; and (ii) the realizations of the sequence are consistent in that all of the parts of Cn are also parts of the partition Cn+1, except that the new element n + 1 may either be in a new part of Cn+1 by itself or has joined one of the existing parts of Cn.
For a sequence X1, . . . , Xn from a discrete alphabet, one can partition the set [n] into component sets {Ax}, where Ax = {i : Xi = x} are the indices corresponding to the positions in which x has appeared. When the {Xi} are drawn i.i.d. from a distribution in , the corresponding sequence of random partitions is called a paintbox process.
The remarkable Kingman representation theorem [
8] states that the probability measure induced by any exchangeable random partition is a mixture of paintbox processes, where the mixture is taken using a probability measure (“prior” in Bayesian terminology) on the class of paintbox processes. Since each paintbox process corresponds to a discrete probability measure (the one such that i.i.d.
Xi drawn from it produced the paintbox process), the prior may be viewed as a distribution on the set of probability measures on a countable alphabet. For technical reasons, the alphabet is assumed to be hybrid, with a discrete part, as well as a continuous part, and also, one needs to work with the space of ordered probability vectors (see [
35] for details).
2.2. Dirichlet Priors and Chinese Restaurant Processes
Not surprisingly, special classes of priors give rise to special classes of exchangeable random partitions. One particularly nice class of priors on the set of probability measures on a countable alphabet is that of the Poisson–Dirichlet priors [
5,
36,
37] (sometimes called Dirichlet processes, since they live on the infinite-dimensional space of probability measures and generalize the usual finite-dimensional Dirichlet distribution).
The Chinese restaurant process (or CRP) is related to the so-called Griffiths–Engen–McCloskey (GEM) distribution with parameter θ, denoted by GEM(θ). Consider W1,W2, . . . drawn i.i.d. according to a Beta(1, θ) distribution, and set:
This can be interpreted as follows: take a stick of unit length and break it into pieces of size W1 and 1 – W1. Now take the piece of size 1 – W1 and break off a W2 fraction of that. Continue in this way. The resulting lengths of the sticks create a distribution on a countably infinite set. The distribution of the sequence p = (p1, p2, . . .) is the GEM(θ) distribution.
2.3. Pattern Probability Estimators
Given a sample
with pattern ψn, we would like to produce a pattern probability estimator. This is a function of the form q(ψn+1|ψn) that assigns a probability of seeing a symbol previously seen in ψn, as well as a probability of seeing a new symbol. In this paper, we will investigate two different pattern probability estimators based on Bayesian models.
The Ewens sampling formula [
38–
40], which has its origins in theoretical population genetics, is a formula for the probability mass function of a marginal of a CRP corresponding to a fixed population size. In other words, it specifies the probability of an exchangeable random partition of [
n] that is obtained when one uses the Poisson–Dirichlet PD(
θ) prior to mix paintbox processes. Because of the equivalence between patterns and exchangeable random partitions, it estimates the probability of a pattern
via the following formula:
Recall that φμ is the number of symbols that appear μ times in
. In particular, the predictive distribution associated to the Ewens sampling formula or Chinese restaurant process is:
More generally, one can define the Pitman–Yor predictor (for α ∈ [0, 1] and θ > − α) as:
where m is the number of distinct symbols in
. The probability assigned by the Pitman–Yor predictor to a pattern
is therefore:
Note that Γ (μ − α)/Γ (1 − α) = (μ − α − 1)(μ − α − 2) · · · (1 − α).
2.4. Strong Universality Measures: Worst-Case and Average
How should we measure the quality of a pattern probability predictor q? We investigate two criteria here: the worst-case and the average-case redundancy. The redundancy of q on a given pattern ψn is:
The worst-case redundancy of q is defined to be:
Recall that p(ψn) just denotes the probability that a length-n sequence generated by p has pattern ψn; it is unnecessary to specify the support here. Since the set of length-n patterns is finite, there is no need for a supremum in the outer maximization above. The worst-case redundancy is often referred to as the per-sequence redundancy, as well.
The average-case redundancy replaces the max over patterns with an expectation over p:
where D(·||·) is the Kullback–Leibler divergence or relative entropy. That is, the average-case redundancy is nothing but the worst-case Kullback–Leibler divergence between the distribution p and the predictor q.
A pattern probability estimator is considered “good” if the worst-case or average-case redundancies are sublinear in n or R̂ (q)/n → 0 and R̄(q)/n → 0 as n → ∞. Succinctly put, redundancy that is sublinear in n implies that the underlying probability of a sequence can be estimated accurately almost surely. Redundancy is one way to measure the “frequentist” properties of the Bayesian approaches we consider in this paper and refers to the compressibility of the distribution from an information theoretic perspective.
As mentioned in the Introduction, redundancy differs from notions, such as the concentration of the posterior distribution about the true distribution. However, the parameters of the CRPs or the PY processes that compress well often correspond to the maximum likelihood (ML) estimates from the sample.
2.5. Weak Universality
In the previous section, we considered guarantees that hold over the entire model class; both the worst case and average case involve taking a supremum over the entire model class. Therefore, the strong guarantees—average or worst-case—hold uniformly over the model class. However, as we will see, exchangeable estimators, in particular the Chinese restaurant process and Pitman–Yor estimators, are tuned towards specific kinds of sources, rather than all i.i.d. models by the appropriate choice of parameters. This behavior is better captured by looking at the model-dependent convergence of the exchangeable estimators, which is known as weak universality. Specifically, let ℘∞ be a collection of i.i.d. measures over infinite sequences, and let
be the collection of measures induced on patterns by ℘∞. We say an estimator q is weakly universal for a class
if for all p ∈ ℘∞:
4. Weak Universality
In this section, we show how to modify the CRP or PY estimators to obtain weakly universal estimators. The CRP and PY cases are identical; therefore, we only work out the CRP case.
For all i ≥ 1 and j ≥ 1, let:
so that ∑i,jci,j = 1. Let
be the CRP measure over patterns with θ = i/ log j. Consider the following measure over patterns of infinite sequences that assigns, for all n and all patterns ψn of length n, the probability:
We will show that q* is a weakly universal measure over patterns of i.i.d. sequences.
To do so, we will need the following two lemmas. Lemma 4 is a useful “folk” inequality that we believe is attributed to Minkowski. Lemma 5 relates the expected number of distinct symbols in length
n sequences of an i.i.d. process to its entropy and is of independent interest. The result not only strengthens a similar result in [
43], but also provides a different and more compact proof.
Lemma 4
For n ≥ 1, let x1 ≥ x2 ≥ . . . xn ≥ 0 and y1 ≥ y2 ≥ . . . yn ≥ 0 be two sorted sequences. Then:
Proof
The left inequality of the lemma follows by noting that:
and that the sum ∑l xlyl+k is maximized at k = 0, since both sequences are sorted in the same direction. The right inequality of the lemma can be proven similarly, but will not be used in the paper.
Lemma 5
For all discrete i.i.d. processes P with entropy rate (or marginal entropy) H, let Mn be the random variable counting the number of distinct symbols in a sample of length n drawn from P. The following bound holds:
Proof
Let P(i) = pi. We begin by noting that:
where the second equality follows by the Taylor series expansion:
The right summation in the equation above is bounded below as follows:
where (a) follows from Minkowski’s inequality in Lemma 4 and the last inequality, because
. Thus,
where for the second inequality, we use ∑i pi(1 − (1 − pi)n) ≤ ∑i pi ≤ 1.
Theorem 6
[Weak universality for CRP mixtures] For all discrete i.i.d. processes p ∈ with a finite entropy rate,
That is, q* is weakly universal.
Proof
We write the divergence between
p and
q* in (
15) as the expected log ratio and condition on the value of
Mn:
Consider the estimator
in q* corresponding to i = Mn and j = log n. This is the estimator
with θ = Mn/logn. From the proof of Theorem 1, we have:
We will bound the two terms in (
19) in the regimes for
Mn.
The result of Theorem 1 says that if
, then:
Thus, this term is o(n).
If
, then we first apply Markov’s inequality using the previous lemma:
Therefore, for all finite entropy processes, this probability goes to zero as
n → ∞. Looking at the term in
q* corresponding
with
θ =
Mn/ log
n and using the fact that
Mn ≤
n, we see that the first term in (
19) is upper bounded as
O(log
n). For the second term, we appeal to (
9) in the proof of Theorem 1:
Plugging these terms into (
18):
The preceding theorem shows that the mixture of CRP estimators q* is weakly universal. However, note that q* is not itself a CRP estimator. An identical construction is possible for the PY estimators, as well. The convergence of the weakly universal q* depends on the number of entropy of the source, as well as the number of distinct symbols in a sample of size n.
While it would be tempting to predict the performance of the estimator q* for larger sample sizes N ≥ n, such a task requires a more careful analysis. In general, it may be impossible to non-trivially bound the number of distinct symbols MN with smaller sample size n, as the following example shows.
Example 1
Let
. Consider a set containing the following two distributions: (i) p over ℕ, which assigns probability 1 − 1/n3/2 = 1 − 1/N3/4 to the atom 1 and splitting the probability 1/N3/4 equally among the elements of the set {2, . . . ,N2 + 1}; and (ii) p′, which simply assigns probability one to one. A sample of size n from either p or p′ is 1n with probability at least 1 − 1/N1/4, no matter what the underlying source is; therefore, we cannot distinguish between these sources with probability 1 − 1/N1/4 from a sample of size n.
However, a sample of size
N from
p has
(
N1/4) distinct symbols on average, while that of
p′ will have only one element. It follows that if all we know is that the unknown distribution comes from
, with high probability under the unknown source, we cannot predict whether the number of symbols in a sample of size
N will remain one or not from a sample of length
n. Furthermore, by changing the ratio of
n and
N (and therefore, the probability of the symbol 1 under
p), we can make the expected number of symbols in a
N − length sample under
p as large as we want.
However, it is possible to impose restrictions on the class of distributions that allow us to ensure that we can predict the number of symbols in longer samples. In future work, we will borrow from the data-derived consistency formulations of [
44] to characterize when we will be able to predict the number of symbols in longer samples.