1. Introduction
A number of statistical inference problems of significant contemporary interest, such as text classification, language modeling and DNA microarray analysis, are what are called large alphabet problems. They require inference on sequences of symbols where the symbols come from a set (alphabet) with a size comparable or even larger than the sequence length. For instance, language models for speech recognition estimate distributions over English words using text samples much smaller than the English vocabulary.
An abstraction behind several of these problems is universal compression over large alphabets. The general idea here is to model the problem at hand with a collection of models instead of a single distribution. The model underlying the data is assumed or known to belong to the collection , but the exact identity of the model remains unknown. Instead, we aim to use a universal description of data.
The universal description uses more bits on average (averaged over the random sample) than if the underlying model were known, and the additional number of bits used by the universal description is called the redundancy against the true model. The average excess bits over the entropy of the true model will be referred to as the model redundancy for that model. Since one does not know the true model in general, a common approach is to consider collection redundancy or simply redundancy, which is the supremum of the model redundancy, the supremum being taken over all models of the collection.
Typically, we look at sequences of i.i.d. symbols, and therefore, we usually refer to the redundancy of distributions over length n sequences obtained by i.i.d. sampling from distributions from . The length n of sequences considered will typically be referred to as the sample size.
The nuances of prediction, the compression or estimation where the alphabet size and sample size are roughly equal are not well captured by studying a collection over a finite alphabet when the sample size is increased to infinity. Rather, they are better captured when we begin with a countably infinite support and let the sample size approach infinity or when we let the alphabet size scale as a function of the sample size. However, the collection of all i.i.d. distributions over countably infinite supports has infinite redundancy that renders most estimations or prediction problems impossible. Therefore, there are several alternative formulations to tackle language modeling, classification and estimation questions over large alphabets.
Patterns: One line of work is the patterns [
1] approach that considers the compression of the pattern of a sequence rather than the sequence itself. Patterns abstract the identities of symbols and indicate only the relative order of appearance. For example, the pattern of TATTLE is 121134, while that of HONOLULU is 12324545. The point to note is that patterns of length
n i.i.d. sequences can be compressed (no matter what the underlying countably infinite alphabet is) with redundancy that grows sublinearly in
n [
1]; therefore, the excess bits needed to describe patterns are asymptotically vanishing per symbol encoded. Indeed, insights learned in this line of work will be used to understand the compression of sequences, as well, in this paper.
Envelope on Model Classes: A second line of work considers restricted model classes for applications, particularly where the collection of models can be described in terms of an envelope [
2]. This approach leads to an understanding of the worst-case formulations. In particular, we are interested in the result that if the worst-case regret (different from and a more stringent formulation than the redundancy described here) of describing a single sample is finite, then the per-symbol redundancy diminishes to zero. We will interpret this result towards the end of the Introduction. While envelope classes are usually chosen so that they are compressible in the worst case, a natural extension is the possibility of choosing classes that are only average-case, but not worst-case, compressible. For this, we need to understand how the single-letter average case redundancy of a class influences the redundancy of compressing strings sampled
i.i.d. from distributions in the class—the focus of this paper.
Data-derived Consistency: A third line of work ignores the uniform convergence framework underlying redundancy or regret formulations. This is useful for large or infinite alphabet model collections that have poor or no redundancy guarantees, but ask a question that cannot be answered with the approaches above. In this line of work, one obtains results on the model redundancy described above instead of (the collection) redundancy. For example, a model collection is said to be weakly compressible if there is a universal measure that ensures that for all models, the model redundancy normalized by the sample size (per-symbol) diminishes to zero. The rate at which the per-symbol model redundancy diminishes to zero depends on the underlying model and for some models could be arbitrarily slower than others. Given a particular block length n, however large, there may be, hence, no non-trivial guarantee that holds over the entire model collection, unlike the redundancy formulation.
However, if we add on the additional constraint that we should estimate the rate of convergence from the data, we get the data-derived consistency formulations in [
3]. Fundamental to further research in this direction is a better understanding of how single-letter redundancy (of
) relates to the redundancy of length
n strings (that of
n). The primary theme of this paper is to collect such results on the redundancy of classes over countably infinite support.
In the fixed alphabet setting, this connection is well understood. If the alphabet has size k, the redundancy of is easily seen to be always finite (in fact, ≤ log k) and that of n scales as
. However, when does not have a finite support, the above bounds are meaningless.
Redundancy Capacity Theorem: On the other hand, the redundancy of a collection
over a countably infinite support may be infinite. In this paper we let ℤ
+ = {1, 2, 3, ...} be the set of positive integers and ℕ = {0, 1, 2, ...} be the set of non-negative integers. However, what about the case where the redundancy of a collection
over ℤ
+ is finite? Now, a well-known redundancy-capacity [
4] argument can be used to interpret the redundancy, which equates the redundancy to the amount of information we can get about the source from the data. In this case, finite (infinite, respectively) redundancy of
implies that a single symbol contains a finite (infinite, respectively) amount of information about the model.
The natural question then is the following. If a collection over ℤ+ has finite redundancy, does it imply that the redundancy of length n i.i.d. strings from grows sublinearly? Equivalently, do finite redundancy collections behave similar to their fixed alphabet counterparts? If true, roughly speaking, such a result would inform us that as the universal encoder sees more and more of the sequence, it learns less and less of the underlying model. This would be in line with our intuition, where seeing more data fixes the model. Therefore, the more data we have already seen, the less there is to learn. Yet, as we will show, that is not the case.
Results: To understand these nuances, we first show that if the redundancy of a collection
of distributions over ℤ
+ is finite, then
is tight. This turns out to be a useful tool to check if the redundancy is finite in [
3], for example.
However, in a departure from other worst-case regret formulations, as in [
2], we demonstrate that it is possible for a class
to have finite redundancy, yet the asymptotic per-symbol redundancy of strings sampled
i.i.d. from
is bounded away from zero. Therefore, roughly speaking, no matter how much of the sequence the universal encoder has seen, it learns at least a constant number of bits about the underlying model each time it sees an additional symbol. No matter how much data we see, there is more to learn about the underlying model! We finally obtain a sufficient condition on a class
, such that the asymptotic per-symbol redundancy of length
n i.i.d. strings diminishes to zero.
3. Redundancy and Tightness
We focus on the single-letter redundancy in this section and explore the connections between the single-letter redundancy of a collection and the tightness of .
Lemma 1. A collection
over ℕ with bounded length
n redundancy is tight. Namely, if the single-letter redundancy of
is finite, then for any
γ > 0:
Proof Since
has bounded single-letter redundancy, fix a distribution
q over ℕ, such that:
We define
R ≝ sup
p∈ D(
p‖
q) where
D(
p‖
q) is the Kullback–Leibler distance between
p and
q. We will first show that for all
p ∈
and any
m > 0,
To see
Equation (2), let
S be the set of all
x ∈ ℕ, such that
p(
x) <
q(
x). A well-known convexity argument shows that the partial contribution to KL divergence from
S,
and hence:
Then,
Equation (2) follows by a simple application of Markov’s inequality.
We will now use
Equation (2) to complete the proof of the lemma. Specifically, we will show that for all
γ > 0,
where
m* is the smallest integer, such that (
R + (2 log
e)
/e)/
m* <
γ/2. Equivalently, for all
γ > 0 and
p ∈
, we show that:
We prove the above by partitioning
q′s tail; numbers
into two parts.
- (i)
the set
and
. Clearly:
and thus:
where the right inequality follows from
Equation (2).
- (ii)
the set
and
. Clearly:
and therefore:
By definition, all
x ∈
W2 satisfy
or that
p(
x) ≤
q(
x)2
m*. Hence, we have:
The lemma follows.
The converse is not necessarily true. Tight collections need not have finite single-letter redundancy, as the following example demonstrates.
Construction: Consider the following collection
of distributions over ℤ
+. First, partition the set of positive integers into the sets
Ti,
i ∈ ℕ, where:
Note that |
Ti| =2
i. Now,
is the collection of all possible distributions that can be formed as follows: for all
i ∈ ℤ
+, pick exactly one element of
Ti and assign probability 1/((
i +1)(
i +2)) to the element of
Ti chosen choosing the support as above implicitly assumes the axiom of choice. Note that the set
is uncountably infinite.
Corollary 2. The set of distributions is tight.
Proof For all
p ∈
,
namely, all tails are uniformly bounded over the collection
. Put another way, for all
δ > 0 and all distributions:
p ∈
,
On the other hand:
Proposition 1. The collection does not have finite redundancy.
Proof Suppose
q is any distribution over ℤ
+. We will show that ∃
p ∈
, such that:
is not finite. Since the entropy of every
p ∈
is finite, we just have to show that for any distribution
q over ℤ
+, there exists
p ∈
, such that:
is not finite.
Consider any distribution
q over ℤ
+. Observe that for all
i, |
Ti| =2
i. It follows that for all
i, there is
xi ∈
Ti, such that:
However, by construction,
contains a distribution
p* that has for its support {
xi :
i ∈ ℤ
+} identified above. Furthermore
p* assigns:
The KL divergence from
p* to
q is not finite, and the Lemma follows, since
q is arbitrary.
4. Length n Redundancy
We study how the single-letter properties of a collection of distributions influence the compression of length n strings obtained by i.i.d. sampling from distributions in . Namely, we try to characterize when the length n redundancy of ∞ grows sublinearly in the block length n.
Lemma 3. Let
be a collection of distributions over a countable support
. For some
m ∈ ℤ
+, consider
m pairwise disjoint subsets
Si ⊂
(1 ≤
i ≤
m), and let
δ > 1/2. If there exist
p1,
...,
pm ∈
, such that:
then for all distributions
q over
,
In particular if there are an infinite number of sets
Si,
i ∈ ℤ
+ and distributions
pi ∈
, such that
pi(
Si) ≥
δ, then the redundancy is infinite.
Proof This is a simplified formulation of the distinguishability concept in [
4]. For a proof, see e.g., [
12].
4.1. Counterexample
We now show that it is possible for the single-letter redundancy of a collection of distributions to be finite, yet the asymptotic per-symbol redundancy (the length n redundancy of ∞ normalized by n) remains bounded away from zero; in the limit, the block length goes to infinity. To show this, we obtain such a collection .
Construction: As before, partition the set ℤ+ into Ti = {2i, ..., 2i+1 − 1}, i ∈ ℕ. Recall that Ti has 2i elements. For all 0 < ∊ ≤ 1, let
. Let 1 ≤ j ≤ 2n∊, and let p∊,j be a distribution on ℤ+ that assigns probability 1 − ∊ to the number one (or equivalently, to the set T0) and ∊ to the j-th smallest element of Tn∊, namely the number 2n∊ + j − 1. (mnemonic for binary, since every distribution has a support of size two) is the collection of distributions p∊,j for all ∊ > 0 and 1 ≤ j ≤ 2n∊. ∞ is the set of measures over infinite sequences of numbers corresponding to i.i.d. sampling from .
We first verify that the single-letter redundancy of is finite.
Proposition 2. Let
q be a distribution that assigns
and for all
j ∈
Ti,
Then:
However, the redundancy of compressing length n sequences from ∞ scales linearly with n.
Proposition 3. For all
n ∈ ℤ
+,
Proof Let the set {1
n} denote a set containing a length
n sequence of only ones. For all
n, define 2
n pairwise disjoint sets
Si of
, 1 ≤
i ≤ 2
n, where:
is the set of all length
n strings containing at most two numbers (one and 2
n +
i − 1) and at least one occurrence of 2
n +
i − 1. Clearly, for distinct
i and
j between one and 2
n,
Si and
Sj are disjoint. Furthermore, the measure
assigns
Si the probability:
From Lemma 3, it follows that length
n redundancy of
∞ is lower bounded by:
In a preview of what is to come, we notice that though the single-letter redundancy of the class
over ℤ
+ is finite, the single-letter tail redundancy, as described in the equation below, does not diminish to zero; namely, for all
M:
In fact, in the next section, we relate the single-letter tail redundancy above diminishing to zero to sublinear growth of the
i.i.d. length
n redundancy.
4.2. Sufficient Condition
In this section, we show a sufficient condition on single-letter marginals of and its redundancy that allows for i.i.d. length n redundancy of ∞ to grow sublinearly with n. This condition is, however, not necessary; and the characterization of a condition that is both necessary and sufficient is as yet open.
For all
∊ > 0, let
Ap,∊ be the set of all elements in the support of
p with probability ≥
∊, and let
Tp,∊ = ℤ
+ −
Ap,∊. Let
G0 = {
ϕ}, where
ϕ denotes the empty string. For all
i, the sets:
where, in a minor abuse of notation, we use {
x1,
...,
xi} to denote the set of distinct symbols in the string
. Let
B0 = {}, and let
. Observe from an argument similar to the coupon collector problem that:
Lemma 4. For all i ≥ 2,
Proof The proof follows from an elementary union bound:
Theorem 5. Suppose
is a collection of distributions over ℤ
+. Let the entropy be uniformly bounded over the entire collection, and in addition, let the redundancy of the collection be finite. Namely,
We will denote:
Recall that for any distribution
p, the set
Tp,δ denotes the support of
p, all of whose probabilities are <
δ. Let:
Then, the redundancy of length
n distributions obtained by
i.i.d. sampling from distributions in
, denoted by
Rn(
∞), grows sublinearly:
Remark If the conditions of the theorem are met, we can always assume without loss of generality that there is a distribution
q1 that satisfies
(3) and simultaneously has finite redundancy. To see this, suppose
satisfies the finite-redundancy condition, namely:
while a different distribution
satisfies the second tail-redundancy condition,
It is easy to verify that the distribution
q that assigns to any
x ∈ ℤ
+,
satisfies both conditions simultaneously.
Proof In what follows, xi represents a string x1, ..., xi and x0 denotes the empty string. For all n, we denote Ψ(xn) = ψ1, ..., ψn and Ψ(Xn)=Ψ1, ..., Ψn.
We will construct q, such that
. Recall that qΨ is the optimal universal pattern encoder over patterns of i.i.d. sequences defined in Section 2.2. Furthermore, recall that the redundancy of is finite and that q1 is the universal distribution over ℤ+ that attains redundancy R for .
The universal encoder
q is now defined as follows:
Furthermore, we define for all
and all
ψi ∈ Ψ
i, such that
ψi−1 =Ψ(
xi−1),
Namely, we use an optimal universal pattern encoder over patterns of
i.i.d. sequences and encode any new symbol using a universal distribution over
. We now bound the redundancy of
q as defined above. We have for all
p ∈
∞,
Since
ψ1 is always one,
p(
ψ1) =
qΨ(
ψ1) = 1. Therefore, we have:
The first term, normalized by
n, can be upper bounded as follows:
where we define
Hp as:
and the last inequality follows, since:
Now for
i ≥ 2,
Then,
We have split the length
i − 1 sequences into the sets
Gi−1 and
Bi−1 and use separate bounds on each set that hold uniformly over the entire model collection. The last inequality above follows from Lemma 4. From Condition
(3) of the Theorem, we have that:
Therefore, we have:
The first term on the left in the first equation above is non-negative, hence the limit above has to equal zero. The equality (
a) follows from Cesaro’s lemma asserting that for any sequence {
ai,
i ∈ ℤ
+} with
ai < ∞ for all
i, if lim
i→∞ ai exists, then:
Therefore,
For the second term, observe that:
Furthermore,
As before, the last inequality is from Lemma 4. Again, from Condition
(3), we have:
Therefore, as before:
as well. The theorem follows.
A few comments about
(3) in Theorem 5 are in order. Neither condition automatically implies the other. The set
of distributions in Section 4.1 is an example where every distribution has finite entropy, the redundancy of
is finite,
We will now construct another set
of distributions over ℤ
+, such that every distribution in
has finite entropy, the redundancy of
is finite,
At the same time, the length
n redundancy of
∞ diminishes sublinearly. This is therefore also an example to show that the conditions in Theorem 5 are only sufficient, but, in fact, not necessary. It is yet open to find a condition on single-letter marginals that is both necessary and sufficient for the asymptotic per-symbol redundancy to diminish to zero.
Construction:
is a countable collection of distributions
pk,
k ∈ ℤ
+, on ℕ where:
The entropy of
pk ∈
is therefore
. Note that the redundancy of
is finite, too. To see this, first note that:
Now, letting:
R+ ≝ log(∑
x∈ℤ+ sup
k∈ℕ pk(
x)), observe that the distribution:
satisfies for all
pk ∈
:
implying that the redundancy of
is ≤
R+ + 2. Furthermore,
Equation (5) implies that worst-case regret is finite, and from [
2] the length
n redundancy of
∞ diminishes sublinearly. Now, pick an integer
m ∈ ℤ
+. We have for all
p ∈
,
yet, for all
k ≥
m, we have:
Thus, the length
n redundancy of
∞ diminishes to zero, while not satisfying all of the requirements of Theorem 5. Therefore, the conditions of Theorem 5 are only sufficient, not necessary.