Next Article in Journal
ULF Pre-Seismic Geomagnetic Anomalous Signal Related to Mw8.1 Offshore Chiapas Earthquake, Mexico on 8 September 2017
Next Article in Special Issue
Poincaré and Log–Sobolev Inequalities for Mixtures
Previous Article in Journal
Parameter Identification of Fractional-Order Discrete Chaotic Systems
Previous Article in Special Issue
Entropy Inequalities for Lattices
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

On the Impossibility of Learning the Missing Mass

by
Elchanan Mossel
1,† and
Mesrob I. Ohannessian
2,*,†
1
Department of Mathematics, Massachusetts Institute of Technology, Cambridge, MA 02142, USA
2
Toyota Technological Institute at Chicago, Chicago, IL 60637, USA
*
Author to whom correspondence should be addressed.
This work was conducted when both authors were visiting the Information Theory Program, 13 January–15 May 2015, at the Simons Institute for the Theory of Computing, University of California, Berkeley.
Entropy 2019, 21(1), 28; https://doi.org/10.3390/e21010028
Submission received: 18 June 2018 / Revised: 25 November 2018 / Accepted: 6 December 2018 / Published: 2 January 2019
(This article belongs to the Special Issue Entropy and Information Inequalities)

Abstract

:
This paper shows that one cannot learn the probability of rare events without imposing further structural assumptions. The event of interest is that of obtaining an outcome outside the coverage of an i.i.d. sample from a discrete distribution. The probability of this event is referred to as the “missing mass”. The impossibility result can then be stated as: the missing mass is not distribution-free learnable in relative error. The proof is semi-constructive and relies on a coupling argument using a dithered geometric distribution. Via a reduction, this impossibility also extends to both discrete and continuous tail estimation. These results formalize the folklore that in order to predict rare events without restrictive modeling, one necessarily needs distributions with “heavy tails”.

1. Introduction

Given data consisting of n i.i.d. observations X 1 , , X n from an unknown distribution p over the positive integers N + , we traditionally compute the empirical distribution:
p ^ n ( x ) : = 1 n i = 1 n 1 { X i = x } .
To estimate the probability p ( E ) : = x E p ( x ) of an event E N + , we could use p ^ n ( E ) : = x E p ^ n ( x ) . This works well for abundantly represented events, but not as well for rare events. An unequivocally rare event is the set of symbols that are missing in the data,
E n : = { x N + : p ^ ( x ) = 0 } .
The probability of this (random) event is denoted by the missing mass:
M n ( X 1 , , X n ) : = p ( E n ) = x N + p ( x ) 1 { p ^ ( x ) = 0 } .
The question we strive to answer in this paper is: “Can we learn the missing mass when p is an arbitrary distribution on N + ?” Definition 1 phrases this precisely in the learning framework.
Definition 1.
An estimator is a sequence of functions M ^ n ( x 1 , , x n ) : N + n [ 0 , 1 ] . We say that an estimator learns the missing mass in relative error with respect to a family P of distributions, if for every p P and every ϵ , δ > 0 there exists n 0 ( p , ϵ , δ ) such that for all n > n 0 ( p , ϵ , δ ) :
P p M ^ n ( X 1 , , X n ) M n ( X 1 , , X n ) 1 < ϵ > 1 δ .
The learning is said to be distribution-free, if P consists of all distributions on N + .
In this framework, our question becomes whether we can distribution-free learn the missing mass in relative error. It is obvious that the empirical estimator p ^ ( E n ) gives us the trivial answer of 0, and cannot learn the missing mass. A popular alternative is the Good–Turing estimator of the missing mass, which is the fraction of singletons in the data:
G n : = x N + 1 n 1 { n p ^ ( x ) = 1 } .
The Good–Turing estimator has many interpretations. Its original derivation by Good [1] uses an empirical-Bayes perspective. It can also be thought of as a leave-one-out cross-validation estimator, which contributes to the missing set if and only if the holdout appears exactly once in the data. Fundamentally, G n derives its form and its various properties from the simple fact that:
E [ G n ] = x N + p ( x ) ( 1 p ( x ) ) n 1 = E [ M n 1 ] .
A study of G n in the learning framework was first undertaken by McAllester and Schapire [2] and continued later by McAllester and Ortiz [3]. Some further refinement and insight was also given later by Berend and Kontorovich [4]. These works focused on additive error. Ohannessian and Dahleh [5] shifted the attention to relative error, establishing the learning property of the Good–Turing estimator with respect to the family of heavy-tailed (roughly power-law) distributions, e.g., p ( x ) x 1 / α with α ( 0 , 1 ) . This work also showed that Good–Turing fails to learn the missing mass for geometric distributions, and therefore does not achieve distribution-free learning. More recently, Ben-Hamou et al. [6] provide a comprehensive and tight set of concentration inequalities, which can be interpreted in the current framework, and which further demonstrate that Good–Turing can learn with respect to heavier-than-geometric light tails, e.g., the family that includes p ( x ) 2 x γ with γ ( 0 , 1 ) (see the definition in Section 4.3 and Remark 4.3 in that paper), in addition to power-laws.
These results leave open the important question: does there exist some other estimator that can learn the missing mass in relative error for any distribution p? Our contributions are:
  • We prove that there are no such estimators, thus providing the first such “no free lunch” theorem for learning about rare events. The first insight to glean from this impossibility result is that one is justified to use further structural assumptions. Furthermore, the proof relies on an implicit construction that uses a dithered geometric distribution. In doing so, it shows that the failure of the Good–Turing estimator for light-tailed distributions is not a weakness of the procedure, but is rather due to a fundamental barrier. Conversely, the success of Good–Turing for heavier-than-geometric and power laws shows its universality, in some restricted sense. In particular, in concrete support to folklore (e.g., [7]), we can state that for estimating probabilities of rare events, heavy tails are both necessary and sufficient.
  • We extend this result to continuous tail estimation.
  • We show, on a positive note, that upon restricting to parametric light-tailed families learning may be possible. In particular, we show that for the geometric family the natural plug-in estimator learns the missing mass in relative error. As an ancillary result, we prove an instance-by-instance convergence rate, which can be interpreted as a weak sample complexity. For this, we establish some sharp concentration results for the gaps in geometric distributions, which may be of independent interest.
The paper is organized as follows. In Section 2, we present our main result, with a detailed exposition of the proof. In Section 3 we discuss questions of weak versus strong learnability, we give an immediate extension to continuous tail estimation, show that parametric light-tailed learning is possible, comment further on the Good–Turing estimator, and concisely place this result in the context of a chief motivating application, that of computational linguistics. Lastly, we conclude in Section 4 with a summary and open questions.

2. Main Result

Our main result is stated as follows. The rest of this section is dedicated to its detailed proof.
Theorem 1.
There exists a positive ϵ > 0 and a strictly increasing sequence ( n k ) k = 1 , 2 , , such that for every estimator M ^ n there exists a distribution p , such that for all k:
P p M ^ n k M n k 1 > ϵ > ϵ .
In particular, it follows that it is impossible to perform distribution-free learning of the missing mass in relative error.
Remark 1.
Our proof below implies the statement of the theorem with ϵ = 10 4 and n k = 6.5 × 2 k , but we did not make an honest effort to optimize these parameters.

2.1. Proof Outline

Consider the family P β , m of β , m -dithered geometric ( 1 2 ) distributions, where the mass of each outcome beyond a symbol m of a geometric ( 1 2 ) random variable is divided between two symbols, with a fraction β in one and 1 β in the other. The individual distributions in this family differ only by which of each pair of such symbols gets which fraction. More precisely:
Definition 2.
The β , m -dithered geometric ( 1 2 ) family, for a given choice of β ( 0 , 1 2 ) and m N + , is a collection of distributions parametrized by the dithering choices θ { β , 1 β } N + , θ : = ( θ 1 , θ 2 , , θ j , ) , as follows:
P β , m = p θ : p θ ( x ) = 1 2 x , x = 1 , , m ; p θ ( m + 2 j 1 ) = θ j 2 m + j , p θ ( m + 2 j ) = 1 θ j 2 m + j , j N + , θ { β , 1 β } N + .
The intuition of the proof of Theorem 1 is that within such light-tailed families, two distributions may have very similar samples and thus estimated values, yet have significantly different true values of the missing mass. This follows the general methodology of many statistical lower bounds. We now state the outline of the proof. We choose a subsequence of the form n k = C 2 k . We set β = 1 / 4 , m = 1 , and C = 6.5 . The value of ϵ > 0 is made explicit in the proof, and depends only on these choices. We proceed by induction.
  • We show that there exists θ 1 such that for all θ with θ 1 = θ 1 we have for n = n 1 :
    P p θ M ^ n M n 1 > ϵ > ϵ .
  • Then, at every step k > 1 :
    (H)
    We start with ( θ 1 , , θ k 1 ) such that for all θ with ( θ 1 , , θ k 1 ) = ( θ 1 , , θ k 1 ) , Inequality (3) holds for n = n 1 , , n k 1 .
    (*)
    We then show that it must be that for at least one of θ ˜ = β or θ ˜ = 1 β , for all θ with ( θ 1 , , θ k ) = ( θ 1 , , θ k 1 , θ ˜ ) , Inequality (3) holds additionally for n = n k . We select θ k to be the corresponding θ ˜ .
  • This induction produces an infinite sequence θ { β , 1 β } N + , and the desired distribution in Theorem 1 can be chosen as p = p θ , since it is readily seen to satisfy the claim for each n k , by construction.

2.2. Proof Details

We skip the proof of the base case, since it is mostly identical to that of the induction step. Therefore, in what follows we assume that ( θ 1 , , θ k 1 ) satisfies the inductive hypothesis (H), and we would like to prove that the selection in (*) can always be done. Let us denote the two choices of parameters by
θ : = ( θ 1 , , θ k 1 , β , θ k + 1 , ) ,
and
θ : = ( θ 1 , , θ k 1 , 1 β , θ k + 1 , ) ,
and let us refer to ( θ k + 1 , ) and ( θ k + 1 , ) by the trailing parameters. What we show in the remainder of the proof is that with two arbitrary sets of trailing parameters, we cannot have two simultaneous violations of Inequality (3) (for both θ and θ ). That is, we cannot have both:
P p θ M ^ n k M n k 1 > ϵ < ϵ and P p θ M ^ n k M n k 1 > ϵ < ϵ .
This is stated in Lemma 3, in the last portion of this section. To see why this is sufficient to show that the selection in (*) can be done, consider first the case that Inequality (3) with n = n k is upheld for both θ and θ with any two sets of trailing parameters. In this case we can arbitrarily choose θ k to be either β or 1 β , since the induction step is satisfied. We can therefore focus on the case in which this fails. That is, for either θ or θ a choice of trailing parameters can be made such that Inequality (3) with n = n k is not satisfied, and therefore one of the two cases in (4) holds [say, for example, for θ ]. Fix the corresponding trailing parameters [in this example, ( θ k + 1 , ) ]. Then, for any choice of the other set of trailing parameters [in this example, ( θ k + 1 , ) ], Lemma 3 precludes a violation of Inequality (3) for n = n k by the other choice [in this example, θ ]. Therefore this choice can be selected for θ k [in this example, θ k = 1 β .]
By using the coupling device and restricting ourselves to a pivotal event, we formalize the aforementioned intuition that the estimator may not distinguish between two separated missing mass values, and deduce that both statements in (4) cannot hold simultaneously.

2.2.1. Coupling

Definition 3.
A coupling between two distributions p and p on N + is a joint distribution q on N + 2 , such that the first and second marginal distributions of q revert back to p and p respectively.
Couplings are useful because probabilities of events on each side may be evaluated on the joint probability space, while forcing events of interest to occur in an orchestrated fashion. Going back to our induction step and the specific choices θ and θ with arbitrary trailing parameters, we perform the following coupling. For ( x , x ) N + 2 , let
q ( x , x ) = p θ ( x ) = p θ ( x ) ; if x = x < m + 2 k 1 ; β / 2 m + k ; if x = x = m + 2 k 1 , or if x = x = m + 2 k , ( 1 2 β ) / 2 m + k ; if x = m + 2 k and x = m + 2 k 1 ; p θ ( x ) p θ ( x ) / 2 m + k ; if x , x > m + 2 k ; 0 ; otherwise .
It is easy to verify that q in Equation (5) is a coupling between p θ and p θ as in Definition 3. Now let us observe the consequences of this choice. If X , X are generated according to q, then if either is in { 1 , , m + 2 k 2 } then both values are identical. If either is in { m + 2 k + 1 , } then so is the other, but otherwise the two values are conditionally independent. If either is in { m + 2 k 1 , m + 2 k } , so is the other, and the conditional probability is given by:
x m + 2 k 1 m + 2 k
x
m + 2 k 1 β 0
m + 2 k 1 2 β β
Now consider coupled data ( X i , X i ) i = 1 , , n generated as i.i.d. samples from q. It follows that, marginally, the X-sequence is i.i.d. from p θ , and so is the X -sequence from p θ . Any event B that is exclusively X-measurable or B that is exclusively X -measurable has the same probability under the coupled measure. That is,
P p θ ( B ) = P q ( B ) : = q n ( B × N + n )
and
P p θ ( B ) = P q ( B ) : = q n ( N + n × B ) .
In what follows we work only with coupled data, and use simply the shorthand P to mean P q .

2.2.2. Pivotal Event

The event we would like to work under is that of the coupled samples being identical, while exactly covering the range 1, ⋯, m + 2 k 1 . Let’s call this the pivotal event and denote it by:
A k : = i = 1 n k { X i = X i } { X 1 , , X n k } = { 1 , , m + 2 k 1 } .
The reason A k interests us is that it encapsulates the aforementioned intuition.
Lemma 1.
Under event A k , the coupled missing masses are distinctly separated,
M n k M n k = 2 β 1 + β ,
while any estimator cannot distinguish the coupled samples,
M ^ n k = M ^ n k .
Proof. 
The confusion of any estimator is simply due to the fact that under A k , the coupling forces all samples to be identical X i = X i , for all i = 1 , , n k .
Thus M ^ n k = M ^ n k , since estimators only depend on the samples and not the probabilities.
The missing masses, on the other hand, do depend on both the samples and the probabilities and thus they differ. But the event A k makes the set of missing symbols simply the tail m + 2 k , m + 2 k + 1 , , so we can compute the missing masses exactly:
M n k = p θ ( m + 2 k ) + x = m + 2 k + 1 p θ ( x ) = 1 θ k 2 m + k + 1 2 m + k = ( 2 β ) 2 m k , and
M n k = p θ ( m + 2 k ) + x = m + 2 k + 1 p θ ( x ) = 1 θ k 2 m + k + 1 2 m + k = ( 1 + β ) 2 m k ,
where 1 2 m + k follows from the usual geometric sum. The claim follows. □
We now show that A k has always a positive probability, bounded away from zero.
Lemma 2.
For β = 1 / 4 , m = 1 , C = 6.5 and n k = C 2 k , there exists a positive constant η > 0 that does not depend on θ, such that for all k, P ( A k ) > η . We can explicitly set η = 2 · 10 4 .
Proof. 
Please note that A k in Equation (6) overspecifies the event. In fact, only forcing the exact coverage of 1 , , m + 2 k 1 is sufficient, since this implies in turn that the coupled samples are identical. Recalling the coupling of Equation (5), this is immediate for symbols in 1 , , m + 2 k 2 , and follows for m + 2 k 1 since m + 2 k is not allowed in this event. We can then write A k = A k , 1 A k , 2 , dividing the exact coverage to the localization in the range and the representation of each symbol by at least one sample:
A k , 1 = i = 1 n k { X i } { 1 , , m + 2 k 1 } ( localization ) , A k , 2 = i = 1 n k { X i } { 1 , , m + 2 k 1 } ( representation ) .
Let α be the probability of ( X i , X i ) for a given i being in { ( 1 , 1 ) , , ( m + 2 k 1 , m + 2 k 1 ) } . From the coupling in Equation (5) and the structure of the dithered family in Equation (2), we see that for up to m + 2 k 2 this probability sums up to the m + k 1 first terms of a geometric ( 1 2 ) , and for ( m + 2 k 1 , m + 2 k 1 ) the coupling assigns it β / 2 m + k , thus:
α = x = 1 m + 2 k 1 q ( x , x ) = 1 1 2 m + k 1 + β 2 m + k .
We can then explicitly compute:
P ( A k , 1 ) = α n k = 1 1 2 m + k 1 + β 2 m + k n k = : η 1 ( k ) .
Meanwhile, the complement of A k , 2 is the event that at least one of { ( 1 , 1 ) , , ( m + 2 k 1 , m + 2 k 1 ) } does not appear, that is A k , 2 c = x = 1 m + 2 k 1 x { X i } . Conditionally on A k , 1 , the occurrence probabilities of these symbols are simply normalized by α , that is P ( x X i | A k , 1 ) = 1 q ( x , x ) / α n k . Thus, by a union bound, we have:
P ( A k , 2 | A k , 1 ) = 1 P ( A k , 2 c | A k , 1 ) 1 x = 1 m + 2 k 1 P ( x X i | A k , 1 ) = 1 x = 1 m + 2 k 1 1 q ( x , x ) / α n k 1 x = 1 m + 2 k 1 1 q ( x , x ) n k = 1 x = 1 m 1 1 2 x n k j = 1 k 1 1 β 2 m + j n k + 1 1 β 2 m + j n k 1 β 2 m + k n k 1 x = 1 m 1 1 2 x n k 2 j = 1 k 1 1 β 2 m + j n k 1 β 2 m + k n k = : η 2 ( k ) ,
where the last inequality follows from the fact that β < 1 2 . Therefore,
P ( A k ) = P ( A k , 1 A k , 2 ) = P ( A k , 1 ) P ( A k , 2 | A k , 1 ) η 1 ( k ) η 2 ( k ) inf k 1 η 1 ( k ) η 2 ( k ) = : η .
We now use our choices of β = 1 / 4 , m = 1 , C = 6.5 , and n k = C 2 k , to bound this worst-case η . In particular, we can verify that η 2 · 10 4 , and it follows as claimed that the pivotal event has always a probability bounded away from zero. □

2.2.3. Induction Step

We now combine all the elements presented thus far to complete the proof of Theorem 1 by establishing the following claim, which we have shown in the beginning of the detailed proof section to be sufficient for the validity of the induction step. In particular, we restate Equation (4) under the coupling of Equation (5).
Lemma 3.
Let
θ : = ( θ 1 , , θ k 1 , β , θ k + 1 , ) , and θ : = ( θ 1 , , θ k 1 , 1 β , θ k + 1 , ) ,
with arbitrary trailing parameters ( θ k + 1 , ) and ( θ k + 1 , ) . Let q be the coupling of Equation (5), and let B k = M ^ n k / M n k 1 > ϵ and B k = M ^ n k / M n k 1 > ϵ . Then given our choices of β = 1 / 4 , m = 1 , C = 6.5 and n k = C 2 k , if ϵ < 10 4 we cannot simultaneously have
P q ( B k ) < ϵ and P q ( B k ) < ϵ .
Proof. 
Please note that this choice of ϵ means that ϵ < η / 2 , where η is as in Lemma 2. Recall the pivotal event A k , and assume, for the sake of contradiction, that both probability bounds P ( B k ) < ϵ and P ( B k ) < ϵ hold. Please note that if B k c holds, it means that
M ^ n k / M n k ( 1 ϵ , 1 + ϵ ) ,
and similarly if B k c holds, it means that
M ^ n k / M n k ( 1 ϵ , 1 + ϵ ) .
By making our hypothesis, we are asserting that these events have high probabilities, 1 ϵ , under both p θ and p θ distributions, and that thus the estimator is effectively ( 1 ± ϵ ) -close to the true value of the missing mass. Yet, we know that this would be violated under the pivotal event, which occurs with positive probability. We now formalize this contradiction.
By Lemma 2, we have that:
P ( B k | A k ) = P ( A k B k ) P ( A k ) P ( B k ) P ( A k ) ϵ η P ( B k | A k ) = P ( A k B k ) P ( A k ) P ( B k ) P ( A k ) ϵ η P ( B k c B k c | A k ) 1 2 ϵ η > 0 ,
where the last inequality is strict, by the choice of ϵ < η / 2 .
On the other hand, recall that by Lemma 1 under A k we have:
M ^ n k = M ^ n k and M n k M n k = 2 β 1 + β = 7 5 .
By combining this with Equations (7) and (8), we can now see that if 1 + ϵ 1 ϵ < 7 5 , which is satisfied by any choice of ϵ < 1 / 6 , in particular ours, then if B k c occurs, then B k occurs, and conversely if B k c occurs then B k occurs. For example, say B k c occurs, then M ^ n k / M n k < ( 1 + ϵ ) :
M ^ n k M n k = M ^ n k 7 5 M n k < 5 7 ( 1 + ϵ ) < 1 ϵ ,
implying that Equation (8) is not satisfied, thus B k occurs. The end result is that under event A k , B k c and B k c cannot occur at the same time, and thus:
P ( B k c B k c | A k ) = 0 .
This contradicts the bound in (9), and establishes the lemma. □

3. Discussions

3.1. Weak Versus Strong Distribution-Free Learning

Arguably, a more common notion of learning is a strong version of Definition 1, where the sample complexity is a function of the distribution class rather than the instance. Formally:
Definition 4.
We say that an estimator learns the missing mass in relative error strongly with respect to a family P of distributions, if for every ϵ > 0 , δ ( 0 , 1 ) there exists n 0 ( P , ϵ , δ ) such that for all p P and all n > n 0 ( P , ϵ , δ ) :
P p M ^ n ( X 1 , , X n ) M n ( X 1 , , X n ) 1 < ϵ > 1 δ .
The learning is said to be strongly distribution-free, if P consists of all distributions on N + .
The distinction here is similar to that of uniform versus pointwise convergence. Clearly, the existence of a strong learner implies the existence of a weak learner. Conversely, as we have shown that there is no weakly distribution-free learner, there is also no strongly distribution-free learner. However, the ability to choose a different distribution at every sample size n makes it very easy to show this corollary directly. For example, we can consider two distributions p and q with p ( 1 ) = 1 1 n 2 and q ( 1 ) = 1 1 100 n 2 , both of which would result with overwhelming probability in a length-n sequence consisting entirely of this first symbol. Thus any estimator would need to predict the same missing mass with high probability. However, the rest of the symbols would have probabilities differing by a factor of 100 between the two models, and thus any estimator would be misguided for at least one of the two cases.
The relevance of the current contribution is rooted in the plausible yet misguided optimism that although we may not do well in such a worst-case paradigm, there is more hope if we first fix the instance and then study asymptotics. Our “no free lunch” theorem indeed shows the more subtle fact that there are always bad instances for every estimator, and thus even such weak learning is fundamentally impossible.
Such a contrast between weak/strong learning has also been appreciated in the classical learning literature, notably in the work of Antos and Lugosi [8]. The notions there are framed in the negative, which is why the weak/strong terminology is reversed. A traditional minimax lower bound in that context states that for any sequence of concept learners g ^ n at each n we can find a distribution for which the expected cumulative classification error is lower bounded by the complexity of the concept class. Analogously, not being able to strong learn as in Definition 4 means that for any estimator M ^ n at each n we can find a distribution for which the relative error stays away from zero. By demanding a strong performance from a learner/estimator, we are able to give only a weak guarantee. In particular, it could be too loose for a fixed distribution that doesn’t vary with n. [8] contributes by giving lower bounds that hold infinitely often for any sequence of concept learners g ^ n but for a distribution choice that is adversarial in advance, fixed for all n. Analogously, not being able to (weak) learn as in Definition 1 means that for any estimator M ^ n there exists a distribution, fixed for all n, for which the relative error stays away from zero for infinitely many n. The lower bounds of Antos and Lugosi [8] can now be tighter, which is why they call their results strong minimax lower bounds. In the context of the present paper, of course, the lower bounds correspond to the impossibility result, which is thus stronger since it doesn’t even hold for a fixed distribution.

3.2. Generalization to Continuous Tails

A closely related problem to learning the missing mass is that of estimating the tail of a probability distribution. In the simplest setting, the data consists of Y 1 , , Y n that are i.i.d. samples from a continuous distribution on R . Let F be the cumulative distribution function. The task in question is that of estimating the tail probability
W n = 1 F max i = 1 n Y i ,
that is the probability that a new sample exceeds the maximum of all samples seen in the data.
One can immediately see the similarity with the missing mass problem, as both problems concern estimating probabilities of underrepresented events. We can use essentially the same learning framework given by Definition 1, and prove a completely parallel impossibility result.
Theorem 2.
There exists ϵ > 0 and a subsequence ( n k ) k = 1 , 2 , , such that for every estimator W ^ n of W n there exists a continuous distribution F , such that for all k:
P F W ^ n k W n k 1 > ϵ > ϵ .
In particular, it follows that it is impossible to perform distribution-free learning of the tail probability in relative error.
Proof. 
(Sketch) The discrete version of this theorem is a trivial extension of Theorem 1, since in the proof of the latter the pivotal event forced the missing mass to be a tail probability. The potential strengthening of Theorem 2 comes from insisting on a continuous F . The same techniques may be adapted in this case, such as by dithering an exponential distribution, where a base exponential density is divided into intervals, and mass is moved between pairs of adjacent intervals by scaling the density the same way as β dithers the geometric. The adversarial distribution for a given estimator can then be chosen from this family. In order not to repeat the same aguments, however, we instead prove this result via a reduction. The details can be found in the Appendix A. Namely, we show that discrete tail estimation can be reduced to continuous tail estimation. Since the former is impossible, so is the latter. □
Theorem 2 gives a concrete justification of why it is important to make regularity assumptions when extrapolating distribution tails. This is of course the common practice of extreme value theory, see, for example [9]. Some impossibility results concerning the even more challenging problem of estimating the density of the maximum were already known [10], but to the best of our knowledge this is the first result asserting it for tail probability estimation as well.

3.3. Learning in Various Families

Ben-Hamou et al. [6] (Corollary 5.3) gives a very clean characterization of a sufficient learnable family, which encompasses the one covered by Ohannessian and Dahleh [5].
Theorem 3
([6]).Denote the expected number of single-occurrence and double-occurrence symbols by Φ n , 1 and Φ n , 2 respectively:
Φ n , 1 : = E x N + 1 { n p ^ n ( x ) = 1 } = x N + n p ( x ) [ 1 p ( x ) ] n 1 , and
Φ n , 2 : = E x N + 1 { n p ^ n ( x ) = 2 } = x N + n ( n 1 ) 2 p ( x ) 2 [ 1 p ( x ) ] n 2 .
Let H be the family defined by:
H : = p : Φ n , 1   and   Φ n , 2 Φ n , 1 remains   bounded   as   n .
The Good–Turing estimator learns the missing mass in relative error with respect to H .
The proof relies on power moment concentration inequalities (such as Chebyshev’s). The Φ n , 1 property embodies the heavy-tailed nature, since it says that rare events occur often. The condition that Φ n , 2 Φ n , 1 (i.e., its lim sup) remains bounded is a smoothness condition, since Φ n , 2 roughly captures the variance of the number of singletons (see [6], Proposition 3.3). For us, this is instructive because one could readily verify that the condition of Theorem 3 fails for geometric (and dithered geometric) distributions. We can thus see that in some sense Good–Turing captures a maximal family of learnable distributions. In particular, we now know that the complement of H is not learnable.
Considering how sparse the dithered geometric family is, the failure of any estimator to learn the missing mass with respect to it may seem discouraging. (Please note that Theorem 1 holds even if the estimator is aware that this is the class it is paired with.) However, if we restrict ourselves to smooth parametric families within light tails then the outlook can be brighter. We illustrate this with the case of the geometric family.
Theorem 4.
Let G be the class of geometric distributions, parametrized by α ( 0 , 1 ) :
p α ( x ) = ( 1 α ) α x 1 , for x N + .
Let α ^ n = 1 n X i be the maximum likelihood estimator of the parameter, and define the plug-in estimator:
M ˇ n = x N + ( 1 α ^ n ) α ^ n x 1 { n p ^ n ( x ) = 0 }
Then M ˇ n learns the missing mass in relative error with respect to G .
Proof. 
(Sketch) The proof consists of pushing forward the convergence of the parameter to that of the entire distribution using continuity arguments, and then specializing to the missing mass. The details can be found in the Appendix B. □

3.4. Learning the Missing Mass in Additive Error and Learning Other Related Quantities

As mentioned in the introduction, a good part of the work on learning the missing mass focused on additive error [2,3,4]. Recently, minimax lower bounds were given for the additive error in [11] and [12]. Note however that relative error bounds cannot be deduced from these (nor any other way, given the impossibility established here.) A related problem to learning the missing mass in relative error is that of learning a distribution in KL-divergence loss. This averages all log-relative errors (missing or otherwise). This averaging scales the log-relative errors by the rare probability and attenuates the kind of gaps discussed in our present context. One thus hopes to have more optimistic results. Indeed, the Good–Turing estimator was recently shown to be adaptive/competitive for distribution learning in KL-divergence [13]. A similar result in the context of distribution learning in total variation was given in [14]. In the language of Section 3.2, being competitive can be understood as an intermediate characterization between weak and strong learning. Lastly, one could be interested in learning other properties of distributions that are intimately related to the rare component. Entropy and support size are two of these. In [15] a traditional minimax bound was established for these quantities, which was then further distilled in [16]. Another related problem is predicting the growth of the support as more observations are made. This was characterized very precisely in [17], where one can find further pointers on this very old problem. Some of these results may give the impression that nothing further can be gained from structural assumptions. However, rates can generally be refined whenever such structure exists. See for example [18] for refined competitive rates in distribution estimation and [19] for similar results in the predictive/compression setting. These results use tail characterizations, similar to those in extreme value theory [10].

3.5. N-Gram Models and Bayesian Perspectives

One of the prominent applications of estimating the missing mass has been to computational linguistics. In that context, it is known as smoothing and is used to estimate N-gram transition probabilities. The importance of accurately estimating the missing mass, and in particular in a relative-error sense, comes from the fact that N-grams are used to score test sentences using log-likeliehoods. Test sentences often have transitions that are never seen in the training corpus, and thus in order for the inferred log-likelihoods to accurately track the true log-likelihood, these rare transitions need to be assigned meaningful values, ideally as close to the truth as possible. As such, various forms of smoothing, including Good–Turing esimation, have become an essential ingredient of many practical algorithms, such as the popular method proposed by Kneser and Ney [20].
In the context of N-gram learning, a separate Bayesian perspective was also proposed. One of the earliest to introduce this were [21] using a Dirichlet prior. This was shown to not be very effective, and we now understand that it is due to the fact that (1) the Dirichlet process produces light tails while language is often heavy-tailed and, even if it were; (2) rare probabilities are hard to learn for large light-tailed families. The natural progression of these Bayesian models led to the use of the two-parameter Poisson-Dirichlet prior [22], which was suggested initially by [23]. Despite employing sophisticated inference techniques, the missing mass estimator that resulted from these models closely followed the Good–Turing estimator (for a sharp analysis of this correspondence, see Falahatgar et al. [18].) In light of the present work, this is not surprising since the two-parameter Poisson-Dirichlet process almost surely produces heavy-tailed distributions, and any two algorithms that learn the missing mass are bound to have the same qualitative behavior.

4. Summary

In this paper, we have considered the problem of learning the missing mass, which is the probability of all unseen symbols in an i.i.d. draw from an unknown discrete distribution. We have phrased this in the probabilistic framework of learning. Our main contribution was to show that it is not possible to learn the missing mass in a completely distribution-free fashion.
In other words, no single estimator can do well for all distributions. We have given a detailed account of the proof, emphasizing the intuition of how failure can occur in large light-tailed families. We have also placed this work in a greater context, through some discussions and extensions of the impossibility result to continuous tail probability estimation, and by showing that smaller, parametric, light-tailed families may be learnable.
An initial impetus for this paper and its core message is that assuming further structure can be necessary in order to learn rare events. Further structure, of course, is nothing more than a form of regularization. This is a familiar notion to the computational learning community, but for a long time the Good–Turing estimator enjoyed favorable analysis that focused on additive error, and evaded this kind of treatment. The essential ill-posedness of the problem was uncovered by studying relative error. But lower bounds cannot be deduced from the failure of particular algorithms. Our result thus completes the story, and we can now shift our attention to studying the landscape that is revealed.
The most basic set of open problems concerns establishing families that allow learning of the missing mass. We have seen in this paper some such families, including the heavy-tailed family learnable by the Good–Turing estimator, and simple smooth parametric families, learnable using plug-in estimators. How do we characterize such families more generally? The next layer of questions concerns establishing convergence rates, i.e., strong learnability, via both lower and upper bounds. The fact that a family of distributions allows learning does not mean that such rates can be established. This is because any estimator may be faced with arbitrarily slow convergence, by varying the distribution in the family. In other words we may be faced with a lack of uniformity. How do we control the convergence rate? Lastly, when learning is not possible, we may want to establish how gracefully an estimator can be made to fail. Understanding these limitations and accounting for them can be critical to the proper handling of data-scarce learning problems.

Author Contributions

Conceptualization, M.I.O.; methodology, E.M.; formal analysis, M.I.O. and E.M.; writing, M.I.O.

Funding

The first author was supported by NSF grants DMS 1106999 and CCF 1320105, by DOD ONR grant N00014-14-1-0823, and by grant 328025 from the Simons Foundation. The second author was supported by funds from the California Institute for Telecommunications and Information Technology (Calit2) while a postdoctoral researcher hosted by Alon Orlitsky at the University of California, San Diego.

Acknowledgments

The authors are grateful to the Simons Institute at the University of California, Berkeley for hosting the programs that make such collaborations possible.

Conflicts of Interest

The authors declare no conflict of interest.

Appendix A. This Appendix Presents the Proof of Theorem 2

Let us first consider the tail estimation problem in the discrete case. W n is still well-defined here. But in order to clearly delineate the discrete case, let us define/rename the tail probability as:
T n : = x > max i X i p ( x ) ,
and let T ^ n denote an arbitrary estimator of T n . It is immediate that T n is identical to W n in the discrete case. Next, note that in the proof of Theorem 1, the pivotal event A k given by Equation (6) forces the missing mass to be a tail probability (see the proof of Lemma 1.) That is, under A k , T n k = M n k . It therefore follows that Lemma 1 holds with M n and M ^ n replaced by T n and T ^ n respectively. Lemma 2 only depends on the definition of the pivotal event, not the missing mass. And the argument of Lemma 3 follows exactly identically with again M n and M ^ n replaced by T n and T ^ n respectively everywhere in the statement and its proof. Consequently, the proof of Theorem 1 produces the following parallel result. There exist γ > 1 and a sequence ( n k ) k = 1 , 2 , (both universal) such that for every T ^ n we can find a distribution p (which does depend on T ^ n ), such that for all k (a slight notational change):
P p T ^ n k T n k γ 1 , γ > γ 1 .
So essentially, Theorem 2 is a direct corollary of our main result if we allow any distribution on the real line, including discrete ones. However, we want to show that the same result holds even over the family of properly continuous distributions, which have a density on R with respect to the Lebesgue measure (if the density is not otherwise restricted). The proof of this fact could be done by paralleling the proof thus far, but by dithering a continuous distribution instead. However, for the sake of novelty, we prove it instead via reduction from discrete to continuous.
We first make a minor generalization of the discrete framework. Observe that nowhere in the proof of Theorem 1 was M ^ n required to be a deterministic function of the observation. So M ^ n and T ^ n could be randomized, i.e., depend on an additional random element ζ n that is independent of the observation and any of the parameters of the problem (see Definition 1.) More rigorously, we include this randomness in the coupling of the proof by simply letting ζ n = ζ n always. This way, the samples being identical still implies that the values of the estimators are identical. This is the only property of the estimator we used and thus all the arguments follow.
The rest of the proof basically reduces the ability to learn the discrete tail with a randomized estimator to the ability to learn the continuous tail. The claimed impossibility follows from this reduction. Let us thus assume, for the sake of contradiction, that there exists W ^ n s.t. for all continuous distributions F, for all η > 1 :
P F W ^ n W n η 1 , η 0 , as n .
Next, recall that by construction we more precisely know that p from earlier can be chosen from the ( β = 1 4 , m = 1 ) -dithered geometric ( 1 2 ) family. It is easy to check that for all such distributions, the tail is comparable to its preceding symbol’s probability. In particular it is not much smaller: there exists κ < such that for all p in this family
max x p ( x ) x > x p ( x ) κ ,
and we can always choose κ > γ 1 (by capping from below). For reasons that will shortly become apparent, choose any m such that
1 γ 1 κ m < γ 1 2 .
Now let G be any continuous distribution on [ 0 , 1 ] . Denote by G ¯ : = 1 G the corresponding tail probability function and let G ¯ 1 ( t ) : = inf { z : G ¯ ( z ) t } denote its left inverse. Note two properties: G ¯ ( z ) > t and z < G ¯ 1 ( t ) are equivalent, and G ¯ ( G ¯ 1 ( t ) ) = t .
Let Z i , be i.i.d. draws from G, for i = 1 , , n and = 1 , , m . Let ζ n = ( Z i , ) i = 1 , , n ; = 1 , , m , and for a given observation sequence x 1 , , x n in N + n , define the randomized estimator:
T ^ n ( x 1 , , x n , ζ n ) : = min = 1 m W ^ n ( x 1 + Z 1 , , , x n + Z n , ) .
For this specific choice of estimator, let p be the distribution that yields the impossibility result in Equation (A2). Let X i be i.i.d. drawn from p , independently of the infinite array ( Z i , ) . Observe that then, for each , the sequence Y i , : = X i + Z i , is i.i.d. distributed according to a continuous distribution F , and has its own tail probability W n , : = 1 F ( max i Y i , ) . Also, let the estimate of this tail probability for each be denoted by W ^ n , : = W ^ n ( Y 1 , , , Y n , ) . Accordingly, based on our assumption, the limiting property in Equation (A3) holds and for any η > 1 :
P W ^ n , W n , η 1 , η 0 , as n .
Please note that if for all we have η 1 < a b < η then also η 1 < min a min b < η . Thus:
P min W ^ n , min W n , η 1 , η P = 1 m W ^ n , W n , η 1 , η m P W ^ n , 1 W n , 1 η 1 , η 0 , as n .
This means that min W ^ n , T ^ n (recall the definition of Equation (A4)) is a good estimator of min W n , . The contradiction we’re after is to show that T ^ n thus defined is also a good estimator of T n (given by Equation (A1) with p = p ), so we need to compare min W n , and T n and show that they are close.
Let us first fix one instance of . For clarity, let us momentarily drop the -notation from subscripts. For one fixed such instance, note that W n is a continuous tail that is coupled with T n : they are related through the common values of X. For notational convenience, let X max : = max i X i and let Y max = max i Y i . Now observe that T n is equal to x > X max p ( x ) and it is exactly the F -measure of the interval [ X max + 1 , ) , by construction. On the other hand W n is the F -measure of the interval [ Y max , ) . Since Y max [ X max , X max + 1 ] , it follows that W n T n . How much larger can it get? By at most p ( X max ) . More formally, we can write F ( x ) = x p ( x ) G ( x x ) . Thus W n T n = Y max X max + 1 F ( d x ) = x p ( x ) Y max x X max + 1 x G ( d x ) = p ( X max ) G ¯ ( Y max X max ) , since the inner integral is non-zero only for x = X max . Define the (random) set of maximizing observations as I max = { i : X i = X max } . Now, let us bound the probability of any particular excess beyond p ( X max ) times a factor t ( 0 , 1 ) :
P { W n > T n + p ( X max ) · t } = P G ¯ ( Y max X max ) > t = P Y max X max < G ¯ 1 ( t ) = P i I max , Z i < G ¯ 1 ( t ) = E P i I max { Z i < G ¯ 1 ( t ) } X 1 , , X n ) = E P | I max | { Z < G ¯ 1 ( t ) } P { Z G ¯ 1 ( t ) } = G ( G ¯ 1 ( t ) ) = 1 t ,
where we conditioned over all X i , used the independence of the Z i ’s from the X i ’s to write the probability of the intersection as a product, used the fact that the Z i have identical distribution to a generic Z, and finally bounded | I max | (the only term still depending on the X i ’s and thus influence by the outer expectation) by 1. The only loss here is from this replacement. This, however, can be shown to be rather tight. For intuition, note that a geometric- 1 2 has only a single maximizing observation in expectation, i.e., E [ | I max | ] = 1 . This is not good news, since p ( X max ) / T n = p ( X max ) x > X max p ( x ) is lower bounded away from zero in the dithered geometric family, and thus this shows that we cannot expect T n to be arbitrarily close to a single W n with probability that is arbitrarily large. This is true regardless of the choice of G. This is the motivation behind choosing the smallest of m continuous versions for the reduction, which restores the needed maneuverability for the approximation. Indeed, now restoring the -notation:
P { min = 1 m W n , > T n + p ( X max ) · t } = P = 1 m { W n , > T n + p ( x ) · t } = P = 1 m i I max , Z i , < G ¯ 1 ( t ) = E P = 1 m i I max { Z i , < G ¯ 1 ( t ) } X 1 , , X n = E P m | I max | { Z < G ¯ 1 ( t ) } P m { Z G ¯ 1 ( t ) } = G ( G ¯ 1 ( t ) ) m = ( 1 t ) m ,
where the only notable observation is that the m replicated versions compound the number of Zs that deviate. The rest is derived in the same way as in Equation (A6). Finally, using the fact that p ( X max ) T n max x p ( x ) x > x p ( x ) κ , we have:
P min = 1 m W n , T n > 1 + κ t ( 1 t ) m .
Specializing to 1 + κ t = γ , using the fact that we chose m such that 1 γ 1 κ m < γ 1 2 , and recalling that the ratio cannot be less than 1, we have for all n that:
P min = 1 m W n , T n γ 1 , γ < γ 1 2 .
Also specializing Equation (A5) to η = γ , we have that for n large enough:
P min = 1 m W n , T ^ n γ 1 , γ γ 1 2 .
By combining these two approximations, our reduction is complete. Namely, given our assumption (A3) that we can estimate the continuous tail, we have that for n large enough we can estimate the discrete tail to our desired accuracy:
P T ^ n T n [ γ 1 , γ ] < γ 1 ,
which clearly contradicts the impossibility (A2) of estimating the discrete tail. More precisely, since the continuous vs. discrete tail approximation in Equation (A8) does not depend on the assumption, it must be that it’s Equation (A9) that fails for each ( n k ) . Recalling the bound of Equation (A5), we must have:
m P W ^ n k W n k γ 1 , γ P min W n k , T ^ n k γ 1 , γ > γ 1 2 .
Finally, set ϵ = min γ 1 , 1 γ 1 , γ 1 2 m with the above, to obtain the exact claim of the theorem. Namely, for this absolute constant ϵ and subsequence ( n k ) , given any W ^ n we can construct F as we did (with an arbitrary G), such that P F W ^ n k W n k 1 ϵ , 1 + ϵ > ϵ for all k.

Appendix B. This Appendix Presents the Proof of Theorem 4

Appendix B.1. Notation and Outline

Let us first set some notation. Recall that the mean of the geometric distribution p α ( x ) = ( 1 α ) α x 1 is μ = 1 1 α and its variance is σ 2 = α ( 1 α ) 2 . Let us write the empirical mean and our parameter estimate respectively as follows:
μ ^ n = 1 n i = 1 n X i , α ^ n = 1 1 μ ^ n .
The plug-in probability estimate can be expressed as:
p ˇ n ( x ) : = ( 1 α ^ n ) α ^ n x 1 .
Using our notation for the missing symbols, E n : = { x N + : p ^ ( x ) = 0 } , the missing mass is
M n = p α ( E n ) = x E n ( 1 α ) α x 1
and the suggested plug-in estimator can be written as
M ˇ n : = p ˇ n ( E n ) = x E n ( 1 α ^ n ) α ^ n x 1 .
The following proof first establishes the convergence of the parameter estimate and then pushes it forward to the entire distribution, specializing in particular to the missing mass. For the latter, we establish some basic localization properties of the punctured segment of a geometric sample coverage. This is related to the general study of gaps, see, for example, [24].
We have the following elementary convergence property for the parameter.
Lemma A1
(Parameter Convergence). Let δ > 0 , and define:
ϵ n : = α δ n · max { 1 , 1 α α } 1 α δ n .
Then, at every n > α δ , we have that with probability greater than 1 δ :
α ^ n α 1 ϵ n and 1 α ^ n 1 α 1 ϵ n .
If we let η n = ϵ n / ( 1 ϵ n ) , we can also write this as
1 1 + η n α ^ n α 1 + η n and 1 1 + η n 1 α ^ n 1 α 1 + η n .
Proof. 
From Chebyshev’s inequality, we know that for all δ > 0 :
P | μ ^ n μ | σ δ n 1 δ .
We now simply have to verify that | μ ^ n μ | σ δ n implies that both α ^ n α 1 and 1 α ^ n 1 α 1 are smaller than ϵ n . Indeed, using μ ^ n μ σ δ n :
α ^ n α 1 = ( μ ^ n 1 ) μ μ ^ n ( μ 1 ) 1 = ( μ ^ n μ ) 1 μ ^ n ( μ 1 ) μ ^ n μ 1 ( μ σ δ n ) ( μ 1 )
and
1 α ^ n 1 α 1 = μ μ ^ n 1 = ( μ μ ^ n ) 1 μ ^ n μ ^ n μ 1 ( μ σ δ n ) .
Finally, since μ ^ n μ σ δ n , both of these bounds are smaller than:
σ δ n 1 ( μ σ δ n ) min { 1 , μ 1 } = α 1 α δ n 1 ( 1 1 α α 1 α 1 δ n ) min { 1 , α 1 α } ,
which is equal to ϵ n . The expression with η n follows from 1 ϵ n = 1 1 + η n and 1 + η n > 1 + ϵ n . □
It follows from Lemma A1 that with probability greater than 1 δ , we have the following pointwise convergence of the distribution.
( 1 + η n ) x ( 1 α ) α x 1 p ˇ n ( x ) ( 1 + η n ) x ( 1 α ) α x 1 .
Since the rate of this convergence is not uniform, we need to exercise care when specializing to particular events. We focus on the missing symbols’ event. We have:
x E n ( 1 + η n ) x ( 1 α ) α x 1 x E n ( 1 α ) α x 1 M ˇ n M n = p ˇ n ( E n ) p α ( E n ) x E n ( 1 + η n ) x ( 1 α ) α x 1 x E n ( 1 α ) α x 1 .
The event E n is inconvenient to sum over, because it has points spread out randomly. This is particularly true for its initial portion, where the samples “puncture” it. It is more convenient to approximate this segment in order to bound Equation (A10). We now formalize this notion, via the following definition.
Definition A1
(Punctured Segment). The punctured segment of a sample is the part between the end of the first contiguous coverage and the end of the total coverage. Its extremities are:
V n : = min E n and V n + : = max E n c .
We have the following localization property for the punctured segment of samples from a geometric distribution.
Lemma A2
(Localization of Punctured Segment). Let X 1 , , X n be samples from a geometric distribution p α ( x ) = ( 1 α ) α x 1 on N + . Let V n and V n + be the extremities of the punctured segment as defined in Definition A1. Then, for all u > ( α 1 α ) 2 , we have:
P { V n < log 1 / α ( n ) log 1 / α ( u ) } < 2 e 1 α α u < α ( 1 α ) u , P { V n + > log 1 / α ( n ) + 1 + log 1 / α ( u ) } < 1 u .
In particular, for δ < ( 1 α ) / α 2 , we have that with probability greater than 1 δ :
log 1 / α ( n ) log 1 / α 1 ( 1 α ) δ V n < V n + log 1 / α ( n ) + 1 + log 1 / α 1 ( 1 α ) δ .
Proof. 
Given an integer a 2 , the event that V n < a implies that at least one of the symbols below a did not appear in the sample. By using the union bound, we thus have that:
P { V n < a } x = 1 a 1 1 ( 1 α ) α x 1 n = = 1 a 1 1 ( 1 α ) n α a 1 n n = 1 a 1 exp ( 1 α ) n α a 1 = 1 exp ( 1 α ) n α a 1
By specializing to a ( u , n ) = log 1 / α ( n ) + 1 log 1 / α ( u ) :
P { V n < log 1 / α ( n ) log 1 / α ( u ) } P { V n < a ( u , n ) } = 1 exp ( 1 α ) n α log 1 / α ( n ) log 1 / α ( u ) = = 1 exp ( 1 α ) α u .
Lastly, if u > ( α 1 α ) 2 , one can show by induction that ( 1 α ) α u 1 α α u + 1 . This turns the sum into a geometric series, giving:
P { V n < log 1 / α ( n ) log 1 / α ( u ) } e 1 α α u = 1 e + 1 < 2 e 1 α α u < α ( 1 α ) u .
Next, note that V n + is nothing but the maximum of the samples. Thus, given an integer b N + , the event V n + > b is the complement of the event that all the samples are at b or below. Since the total probability of the range 1 , , b is 1 α b , we thus have:
P { V n + > b } = 1 ( 1 α b ) n .
If we now specialize to b ( u , n ) = log 1 / α ( n ) + log 1 / α ( u ) , we have that:
P { V n + > log 1 / α ( n ) + 1 + log 1 / α ( u ) } P { V n + > b ( u , n ) } 1 1 α log 1 / α ( n ) + log 1 / α ( u ) n = 1 1 1 u · n n < 1 u .
For the last part of the claim, we let u = 1 ( 1 α ) δ , followed by a union bound on the analyzed events. This gives us that at least one of the two events holds with probability at most 1 u + α ( 1 α ) u = δ , and therefore neither holds with probability at least 1 δ , as desired. □

Appendix B.2. Completing the Proof

We now put together the pieces of the proof of Theorem 4. To show that our estimator learns the missing mass in relative error with respect to G , we obtain the following equivalent statement. Fix δ > 0 and η > 0 . We prove that for n large enough with probability greater than 1 2 δ we have:
1 1 + η < M ˇ n M n < 1 + η .
Without loss of generality, to satisfy the conditions of Lemmas A1 and A2, we restrict ourselves to δ < ( 1 α ) / α 2 (we can always choose a smaller δ than specified) and n > α δ (we can always ask for n to be larger). As such, we have that with probability at least 1 2 δ , both events of Lemmas A1 and A2 occur. We work under the intersection of these events.
We give the details of only the right tail of the convergence; all the steps can be directly paralleled for the left tail. To see why the punctured set is a useful notion, we claim that the following quantity upper bounds the right tail of Equation (A10):
x > V n + ( 1 + η n ) x ( 1 α ) α x 1 x > V n + ( 1 α ) α x 1 = ( 1 + η n ) V n + y N + ( 1 + η n ) y ( 1 α ) α y 1 y N + ( 1 α ) α y 1 = 1 = ( 1 + η n ) V n + ( 1 α ) ( 1 + η n ) 1 α ( 1 + η n ) .
where for the first equality we have used the change of variable y = x V n + and simplified the common α factors in the numerator and denominator, and for the second equality we have used the moment generating function of the geometric distribution: E [ e s X ] = ( 1 α ) e s / ( 1 α e s ) . To prove this claim, we proceed by induction, starting at step t = 1 with the set G ( 1 ) : = { V n + + 1 , V n + + 2 , } E n , adding at every step t the largest element z ( t ) of E n not yet in G ( t 1 ) to obtain G ( t ) , and proving that:
x G ( t ) ( 1 + η n ) x ( 1 α ) α x 1 x G ( t ) ( 1 α ) α x 1 x G ( t 1 ) ( 1 + η n ) x ( 1 α ) α x 1 x G ( t 1 ) ( 1 α ) α x 1 .
We use the following basic property that for positive real numbers a 1 , b 1 , a 2 , b 2 , the following three equalities are equivalent (these are mediant inequalities):
( i ) a 1 / b 1 a 2 / b 2 , ( i i ) a 1 / b 1 ( a 1 + a 2 ) / ( b 1 + b 2 ) , ( i i i ) ( a 1 + a 2 ) / ( b 1 + b 2 ) a 2 / b 2 .
For the base case, let a 2 = x G ( 1 ) ( 1 + η n ) x ( 1 α ) α x 1 and b 2 = x G ( 1 ) ( 1 α ) α x 1 . We then choose the largest z ( 1 ) E n G ( 1 ) and we let a 1 = ( 1 + η n ) z ( 1 ) ( 1 α ) α z ( 1 ) 1 and b 1 = ( 1 α ) α z ( 1 ) 1 . From (A11), noting that the fraction is always greater than 1, it follows that a 2 / b 2 > ( 1 + η n ) V n + > ( 1 + η n ) z ( 1 ) = a 1 / b 1 . We can thus add z ( 1 ) to the sum, and obtain ( a 1 + a 2 ) / ( b 1 + b 2 ) a 2 / b 2 , establishing the base case. Please note that this also shows that ( a 1 + a 2 ) / ( b 1 + b 2 ) a 1 / b 1 = ( 1 + η n ) z ( 1 ) . We pass this property down by induction, and we can assume this holds true at every step.
To continue the induction at step t, let a 2 = x G ( t 1 ) ( 1 + η n ) x ( 1 α ) α x 1 and b 2 = x G ( t 1 ) ( 1 α ) α x 1 . As noted, we assume that a 2 / b 2 ( 1 + η n ) z ( t 1 ) from the previous induction step. We then choose the largest z ( t ) E n G ( t 1 ) and we let a 1 = ( 1 + η n ) z ( t ) ( 1 α ) α z ( t ) 1 and b 1 = ( 1 α ) α z ( t ) 1 . Since z ( t 1 ) < z ( t ) , it follows that a 2 / b 2 ( 1 + η n ) z ( t 1 ) > ( 1 + η n ) z ( t ) = a 1 / b 1 . We can thus add z ( t ) to the sum, and obtain ( a 1 + a 2 ) / ( b 1 + b 2 ) a 2 / b 2 , as desired. Note that this also shows that ( a 1 + a 2 ) / ( b 1 + b 2 ) a 1 / b 1 = ( 1 + η n ) z ( t ) , and the induction is complete.
By combining this result with the equivalent argument on the left side, we have shown that we can replace Equation (A10) by
x V n ( 1 + η n ) x ( 1 α ) α x 1 x V n ( 1 α ) α x 1 M ˇ n M n = p ˇ n ( E n ) p α ( E n ) x > V n + ( 1 + η n ) x ( 1 α ) α x 1 x > V n + ( 1 α ) α x 1
or equivalently by
( 1 + η n ) V n + 1 ( 1 α ) ( 1 + η n ) 1 1 α ( 1 + η n ) 1 M ˇ n M n ( 1 + η n ) V n + ( 1 α ) ( 1 + η n ) 1 α ( 1 + η n ) .
In Lemma A1 we have set:
η n = ϵ n / ( 1 ϵ n ) ,
with
ϵ n : = α δ n · max { 1 , 1 α α } 1 α δ n .
On the other hand, by Lemma A2, we have that:
V n + log 1 / α ( n ) + 1 + log 1 / α 1 ( 1 α ) δ
and
V n log 1 / α ( n ) log 1 / α 1 ( 1 α ) δ .
It follows that both bounds of Equation (A12) converge to 1, at the rate of roughly log ( n ) / n , instead of the parametric rate 1 / n . Regardless, for any desired η > 0 , we get that there exists a large enough n beyond which, with probability greater than 1 2 δ , we satisfy:
1 1 + η M ˇ n M n 1 + η .
This establishes that M ˇ n learns M n , as desired.

References

  1. Good, I.J. The population frequencies of species and the estimation of population parameters. Biometrika 1953, 40, 237–264. [Google Scholar] [CrossRef]
  2. McAllester, D.A.; Schapire, R.E. On the Convergence Rate of Good–Turing Estimators; COLT: Hartford, CT, USA, 2000; pp. 1–6. [Google Scholar]
  3. McAllester, D.A.; Ortiz, L.E. Concentration inequalities for the missing mass and for histogram rule error. J. Mach. Learn. Res. 2003, 4, 895–911. [Google Scholar]
  4. Berend, D.; Kontorovich, A. On the concentration of the missing mass. Electron. Commun. Probab. 2013, 18, 1–7. [Google Scholar] [CrossRef]
  5. Ohannessian, M.I.; Dahleh, M.A. Rare Probability Estimation Under Regularly Varying Heavy Tails. JMLR Proc. 2012, 23, 21.1–21.24. [Google Scholar]
  6. Ben-Hamou, A.; Boucheron, S.; Ohannessian, M.I. Concentration inequalities in the infinite urn scheme for occupancy counts and the missing mass, with applications. Bernoulli 2017, 23, 249–287. [Google Scholar] [CrossRef] [Green Version]
  7. Taleb, N.N. The Black Swan: The Impact of the Highly Improbable; Random House: London, UK, 2008. [Google Scholar]
  8. Antos, A.; Lugosi, G. Strong minimax lower bounds for learning. Mach. Learn. 1998, 30, 31–56. [Google Scholar] [CrossRef]
  9. Beirlant, J.; Goegebeur, Y.; Segers, J.; Teugels, J. Statistics of Extremes: Theory and Applications; Wiley: Hoboken, NJ, USA, 2004. [Google Scholar]
  10. Beirlant, J.; Devroye, L. On the impossibility of estimating densities in the extreme tail. Stat. Probab. Lett. 1999, 43, 57–64. [Google Scholar] [CrossRef]
  11. Rajaraman, N.; Thangaraj, A.; Suresh, A.T. Minimax risk for missing mass estimation. In Proceedings of the IEEE International Symposium on Information Theory (ISIT), Aachen, Germany, 25–30 June 2017; pp. 3025–3029. [Google Scholar]
  12. Acharya, J.; Bao, Y.; Kang, Y.; Sun, Z. Improved Bounds for Minimax Risk of Estimating Missing Mass. In Proceedings of the IEEE International Symposium on Information Theory (ISIT), Vail, CO, USA, 17–22 June 2018; pp. 326–330. [Google Scholar]
  13. Orlitsky, A.; Suresh, A.T. Competitive distribution estimation: Why is Good–Turing good. In Advances in Neural Information Processing Systems; MIT Press: Cambridge, MA, USA, 2018; pp. 2143–2151. [Google Scholar]
  14. Valiant, G.; Valiant, P. Instance optimal learning of discrete distributions. In Proceedings of the Forty-Eighth Annual ACM Symposium on Theory of Computing, Cambridge, MA, USA, 19–21 June 2016; pp. 142–155. [Google Scholar]
  15. Valiant, G.; Valiant, P. Estimating the unseen: an n/log (n)-sample estimator for entropy and support size, shown optimal via new CLTs. In Proceedings of the Forty-Third Annual ACM Symposium on Theory of Computing, San Jose, CA, USA, 6–8 June 2011; pp. 685–694. [Google Scholar]
  16. Wu, Y.; Yang, P. Chebyshev polynomials, moment matching, and optimal estimation of the unseen. arXiv, 2015; arXiv:1504.01227. [Google Scholar]
  17. Orlitsky, A.; Suresh, A.T.; Wu, Y. Optimal prediction of the number of unseen species. Proc. Natl. Acad. Sci. USA 2016, 113, 13283–13288. [Google Scholar] [CrossRef]
  18. Falahatgar, M.; Ohannessian, M.I.; Orlitsky, A.; Pichapati, V. The power of absolute discounting: All-dimensional distribution estimation. In Advances in Neural Information Processing Systems; MIT Press: Cambridge, MA, USA, 2017; pp. 6660–6669. [Google Scholar]
  19. Boucheron, S.; Gassiat, E.; Ohannessian, M.I. About adaptive coding on countable alphabets: Max-stable envelope classes. IEEE Trans. Inf. Theory 2015, 61, 4948–4967. [Google Scholar] [CrossRef]
  20. Kneser, R.; Ney, H. Improved smoothing for m-gram language modeling. In Proceedings of the International Conference on Acoustics, Speech and Signal Processing (ICASSP), Detroit, MI, USA, 9–12 May 1995; pp. 679–682. [Google Scholar]
  21. MacKay, D.; Peto, L. A hierarchical Dirichlet language model. Nat. Lang. Eng. 1995, 1, 289–307. [Google Scholar] [CrossRef]
  22. Pitman, J.; Yor, M. The two-parameter Poisson-Dirichlet distribution derived from a stable subordinator. Ann. Probab. 1997, 25, 855–900. [Google Scholar] [CrossRef]
  23. Teh, Y.W. A hierarchical Bayesian language model based on Pitman-Yor processes. In Proceedings of the 21st International Conference on Computational Linguistics and the 44th Annual Meeting of the Association for Computational Linguistics (ACL), Sydney, Australia, 17–18 July 2006; pp. 985–992. [Google Scholar]
  24. Louchard, G.; Prodinger, H. On gaps and unoccupied urns in sequences of geometrically distributed random variables. Discret. Math. 2008, 308, 1538–1562. [Google Scholar] [CrossRef] [Green Version]

Share and Cite

MDPI and ACS Style

Mossel, E.; Ohannessian, M.I. On the Impossibility of Learning the Missing Mass. Entropy 2019, 21, 28. https://doi.org/10.3390/e21010028

AMA Style

Mossel E, Ohannessian MI. On the Impossibility of Learning the Missing Mass. Entropy. 2019; 21(1):28. https://doi.org/10.3390/e21010028

Chicago/Turabian Style

Mossel, Elchanan, and Mesrob I. Ohannessian. 2019. "On the Impossibility of Learning the Missing Mass" Entropy 21, no. 1: 28. https://doi.org/10.3390/e21010028

APA Style

Mossel, E., & Ohannessian, M. I. (2019). On the Impossibility of Learning the Missing Mass. Entropy, 21(1), 28. https://doi.org/10.3390/e21010028

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop