On the Impossibility of Learning the Missing Mass
Abstract
:1. Introduction
- We prove that there are no such estimators, thus providing the first such “no free lunch” theorem for learning about rare events. The first insight to glean from this impossibility result is that one is justified to use further structural assumptions. Furthermore, the proof relies on an implicit construction that uses a dithered geometric distribution. In doing so, it shows that the failure of the Good–Turing estimator for light-tailed distributions is not a weakness of the procedure, but is rather due to a fundamental barrier. Conversely, the success of Good–Turing for heavier-than-geometric and power laws shows its universality, in some restricted sense. In particular, in concrete support to folklore (e.g., [7]), we can state that for estimating probabilities of rare events, heavy tails are both necessary and sufficient.
- We extend this result to continuous tail estimation.
- We show, on a positive note, that upon restricting to parametric light-tailed families learning may be possible. In particular, we show that for the geometric family the natural plug-in estimator learns the missing mass in relative error. As an ancillary result, we prove an instance-by-instance convergence rate, which can be interpreted as a weak sample complexity. For this, we establish some sharp concentration results for the gaps in geometric distributions, which may be of independent interest.
2. Main Result
2.1. Proof Outline
- We show that there exists such that for all with we have for :
- Then, at every step :
- This induction produces an infinite sequence , and the desired distribution in Theorem 1 can be chosen as , since it is readily seen to satisfy the claim for each , by construction.
2.2. Proof Details
2.2.1. Coupling
x | |||
0 | |||
2.2.2. Pivotal Event
2.2.3. Induction Step
3. Discussions
3.1. Weak Versus Strong Distribution-Free Learning
3.2. Generalization to Continuous Tails
3.3. Learning in Various Families
3.4. Learning the Missing Mass in Additive Error and Learning Other Related Quantities
3.5. N-Gram Models and Bayesian Perspectives
4. Summary
Author Contributions
Funding
Acknowledgments
Conflicts of Interest
Appendix A. This Appendix Presents the Proof of Theorem 2
Appendix B. This Appendix Presents the Proof of Theorem 4
Appendix B.1. Notation and Outline
Appendix B.2. Completing the Proof
References
- Good, I.J. The population frequencies of species and the estimation of population parameters. Biometrika 1953, 40, 237–264. [Google Scholar] [CrossRef]
- McAllester, D.A.; Schapire, R.E. On the Convergence Rate of Good–Turing Estimators; COLT: Hartford, CT, USA, 2000; pp. 1–6. [Google Scholar]
- McAllester, D.A.; Ortiz, L.E. Concentration inequalities for the missing mass and for histogram rule error. J. Mach. Learn. Res. 2003, 4, 895–911. [Google Scholar]
- Berend, D.; Kontorovich, A. On the concentration of the missing mass. Electron. Commun. Probab. 2013, 18, 1–7. [Google Scholar] [CrossRef]
- Ohannessian, M.I.; Dahleh, M.A. Rare Probability Estimation Under Regularly Varying Heavy Tails. JMLR Proc. 2012, 23, 21.1–21.24. [Google Scholar]
- Ben-Hamou, A.; Boucheron, S.; Ohannessian, M.I. Concentration inequalities in the infinite urn scheme for occupancy counts and the missing mass, with applications. Bernoulli 2017, 23, 249–287. [Google Scholar] [CrossRef] [Green Version]
- Taleb, N.N. The Black Swan: The Impact of the Highly Improbable; Random House: London, UK, 2008. [Google Scholar]
- Antos, A.; Lugosi, G. Strong minimax lower bounds for learning. Mach. Learn. 1998, 30, 31–56. [Google Scholar] [CrossRef]
- Beirlant, J.; Goegebeur, Y.; Segers, J.; Teugels, J. Statistics of Extremes: Theory and Applications; Wiley: Hoboken, NJ, USA, 2004. [Google Scholar]
- Beirlant, J.; Devroye, L. On the impossibility of estimating densities in the extreme tail. Stat. Probab. Lett. 1999, 43, 57–64. [Google Scholar] [CrossRef]
- Rajaraman, N.; Thangaraj, A.; Suresh, A.T. Minimax risk for missing mass estimation. In Proceedings of the IEEE International Symposium on Information Theory (ISIT), Aachen, Germany, 25–30 June 2017; pp. 3025–3029. [Google Scholar]
- Acharya, J.; Bao, Y.; Kang, Y.; Sun, Z. Improved Bounds for Minimax Risk of Estimating Missing Mass. In Proceedings of the IEEE International Symposium on Information Theory (ISIT), Vail, CO, USA, 17–22 June 2018; pp. 326–330. [Google Scholar]
- Orlitsky, A.; Suresh, A.T. Competitive distribution estimation: Why is Good–Turing good. In Advances in Neural Information Processing Systems; MIT Press: Cambridge, MA, USA, 2018; pp. 2143–2151. [Google Scholar]
- Valiant, G.; Valiant, P. Instance optimal learning of discrete distributions. In Proceedings of the Forty-Eighth Annual ACM Symposium on Theory of Computing, Cambridge, MA, USA, 19–21 June 2016; pp. 142–155. [Google Scholar]
- Valiant, G.; Valiant, P. Estimating the unseen: an n/log (n)-sample estimator for entropy and support size, shown optimal via new CLTs. In Proceedings of the Forty-Third Annual ACM Symposium on Theory of Computing, San Jose, CA, USA, 6–8 June 2011; pp. 685–694. [Google Scholar]
- Wu, Y.; Yang, P. Chebyshev polynomials, moment matching, and optimal estimation of the unseen. arXiv, 2015; arXiv:1504.01227. [Google Scholar]
- Orlitsky, A.; Suresh, A.T.; Wu, Y. Optimal prediction of the number of unseen species. Proc. Natl. Acad. Sci. USA 2016, 113, 13283–13288. [Google Scholar] [CrossRef]
- Falahatgar, M.; Ohannessian, M.I.; Orlitsky, A.; Pichapati, V. The power of absolute discounting: All-dimensional distribution estimation. In Advances in Neural Information Processing Systems; MIT Press: Cambridge, MA, USA, 2017; pp. 6660–6669. [Google Scholar]
- Boucheron, S.; Gassiat, E.; Ohannessian, M.I. About adaptive coding on countable alphabets: Max-stable envelope classes. IEEE Trans. Inf. Theory 2015, 61, 4948–4967. [Google Scholar] [CrossRef]
- Kneser, R.; Ney, H. Improved smoothing for m-gram language modeling. In Proceedings of the International Conference on Acoustics, Speech and Signal Processing (ICASSP), Detroit, MI, USA, 9–12 May 1995; pp. 679–682. [Google Scholar]
- MacKay, D.; Peto, L. A hierarchical Dirichlet language model. Nat. Lang. Eng. 1995, 1, 289–307. [Google Scholar] [CrossRef]
- Pitman, J.; Yor, M. The two-parameter Poisson-Dirichlet distribution derived from a stable subordinator. Ann. Probab. 1997, 25, 855–900. [Google Scholar] [CrossRef]
- Teh, Y.W. A hierarchical Bayesian language model based on Pitman-Yor processes. In Proceedings of the 21st International Conference on Computational Linguistics and the 44th Annual Meeting of the Association for Computational Linguistics (ACL), Sydney, Australia, 17–18 July 2006; pp. 985–992. [Google Scholar]
- Louchard, G.; Prodinger, H. On gaps and unoccupied urns in sequences of geometrically distributed random variables. Discret. Math. 2008, 308, 1538–1562. [Google Scholar] [CrossRef] [Green Version]
© 2019 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).
Share and Cite
Mossel, E.; Ohannessian, M.I. On the Impossibility of Learning the Missing Mass. Entropy 2019, 21, 28. https://doi.org/10.3390/e21010028
Mossel E, Ohannessian MI. On the Impossibility of Learning the Missing Mass. Entropy. 2019; 21(1):28. https://doi.org/10.3390/e21010028
Chicago/Turabian StyleMossel, Elchanan, and Mesrob I. Ohannessian. 2019. "On the Impossibility of Learning the Missing Mass" Entropy 21, no. 1: 28. https://doi.org/10.3390/e21010028
APA StyleMossel, E., & Ohannessian, M. I. (2019). On the Impossibility of Learning the Missing Mass. Entropy, 21(1), 28. https://doi.org/10.3390/e21010028