1. Introduction
Nowadays, there is a huge amount of data in information processing, and the data are varied. With the rapid expansion of data volume, traditional centralized data processing has gradually become unable to adapt to the current needs, which makes it possible to distribute processing power to all computers on the network.
When dealing with large amounts of data, it is inevitable to produce contaminated data which we generally call outliers. The outliers will result in low accuracy or high sensitivity of data processing tasks. Naturally, inferring probability density functions from contaminated samples is an important problem. Correspondingly, when there are no outliers in a dataset, we call such a dataset sane.
The Median-of-Mean (MoM) method is an effective way to deal with contaminated data, which divides the original data into several blocks, calculates the mean for each block, and then takes the median of these means. The literature on MoM methods can be traced back to Ref. [
1]. In recent years, MoM methods have been widely used in the field of machine learning. For example, Ref. [
2] used the MoM method to design estimators for kernel mean embedding and maximum mean discrepancy with excessive resistance properties to outliers; Ref. [
3] applied the MoM method to achieve the optimal trade-off between accuracy and confidence under minimal assumptions in the classical statistical learning/regression problem; Ref. [
4] introduced an MoM method for robust machine learning without deteriorating the estimation properties of a given estimator which is also easily computable in practice; Ref. [
5] introduced a robust nonparametric density estimator combining the popular Kernel Density Estimation method and the Median-of-Means principle.
When using MoM methods to deal with contaminated data, these data often do not have obvious normal distribution characteristics but have more extensive sub-Gaussian properties; thus, non-asymptotic techniques are needed. Non-asymptotic inference can give full play to its advantages in the case of finite samples. Especially in the field of machine learning, non-asymptotic inference can establish strict error boundaries for the desired learning program (see Refs. [
6,
7,
8]). Sometimes when working with data, it is difficult to know the exact distribution; this calls for a more general study such as sub-Gaussian, sub-exponential, heavy-tailed, and bounded distributions. For example, Ref. [
9] studied the non-asymptotic concentration of the heteroskedastic Wishart-type matrices; Ref. [
10] constructed sub-Gaussian estimators of a mean vector under adversarial contamination and heavy-tailed data by Median-of-Mean versions of the Stahel–Donoho outlyingness and of Median Absolute Deviation functions; Ref. [
11] obtained the deconvolution for some singular density errors via a combinatorial Median-of-Mean approach and assessed the estimator quality by establishing non-asymptotic risk bounds.
To obtain a clear picture of robust estimation from a non-asymptotic viewpoint, variance-dependent MoM methods based on binomial tail probability are mainly studied, including uncontaminated and contaminated cases. The paper proceeds as follows. We first provide a variance-dependent MoM-estimator bias inequality by using bounds on binomial tails with unbounded samples, whose bias bound is tighter than the classical Hoeffding’s bound (see
Section 2). Then, by the variance-dependent MoM inequality, we obtain the generalization bound via entropic complexity (see
Section 3.1) and the non-asymptotic property via Sub-Gaussian intrinsic moment norm (see
Section 3.2). Finally, the variance-dependent MoM inequality with contamination data is illustrated in
Section 4.
2. Variance-Dependent Median-of-Mean Estimator without Outliers
The MoM method was originally introduced on page 242 of Ref. [
1]; it reinforces the effect of the empirical mean on the heavy-tail distribution while inheriting its efficiency on the light-tail distribution. The MoM estimator is derived as follows.
Without loss of generality, suppose that the sample data
are decomposed into
K blocks, with each block including
B observations, that is to say,
. We first compute the mean of each block, which leads to estimators
and each estimator is based on
B observations. Then, the MoM estimator is given by the median of all these estimators, i.e.,
It turns out that, even with a very mild condition , the MoM estimator has a nice concentration inequality under finite sample case.
Given the i.i.d. sample
with mean
and finite variance
, using Hoeffding’s inequality, Proposition 1 in Ref. [
12] produces the following concentration inequality:
where
—see detailed description in Remark 1.
When additional conditions are applied to the distribution under consideration, stricter boundaries can be obtained, such as our results on binomial tails (Theorem 1), which can be better.
In fact, sometimes we need to block the data, but the minimum number of samples per block is often a concern, because it involves efficiency and robustness issues, and, from a statistical point of view, the effect of variance is taken into account. The following theorem takes into account the partitioning of variance effects and yields the variance-dependent MoM inequality.
Theorem 1. Given the i.i.d. samples with mean and finite variance , for , there exists and , such that . Then, the MoM estimator has the following concentration inequality:where A powerful feature of Theorem 1 is that
s can be unbounded in this case. In addition, finite sample exponential concentration is not easy to obtain if only variance exists (see Ref. [
13]). And Theorem 1 provides the basis for further obtaining the inequality with outliers. In the process of proving the theorem, we used the following lemma.
Lemma 1 (Theorem 1 of [
14]).
Suppose , and If , thenwhere , and is the KL divergence between Bernoulli distributions with parameters a and p. If an , the bound still holds, but it can be tightened by replacing a with . Now, we give a detailed proof of Theorem 1.
Proof of Theorem 1. First, observe that the event
implies that at least
of
has to be outside
distance to
for
. Namely,
Here, it is assumed that K is an even number. When K is an odd number, take at least , and the same can be said. For the convenience of writing, the following process of proof only writes the case of at least , while proving the case of is no difference.
Define
and let
. Note the theorem condition and the Chebyshev’s inequality (see p. 239 in Ref. [
15]), which imply that there exits
and
such that
In fact, the detailed derivation process is as follows:
The random variables
are i.i.d. because of the i.i.d. samples
. Applying Lemma 1 (with
,
, and
in Lemma 1) to the summations gives
where
.
Setting
for
satisfies Equation (
2); then,
When
, we set
so that
. Then, it follows that
for
and
.
Now, taking
gives
The function is a monotonically decreasing function, so its maximum is .
This then leads to the final result:
□
Remark 1. The classical result by Hoeffding inequality shows that (see Proposition 1 in Ref. [
12]
) Similarily, to obtain a sharp constant, one can consider ; then,and the functionachieve the unique maximum point at with . It follows that Remark 2. The efficient interval of t is an interesting issue. By the construction of , it follows that since .
Remark 3. In Theorem 1
, we substitute into inequality (
1)
to produce Since , we have This result is better than the bound of level-dependent sub-Gaussian estimators. Of course, our conditions are more stringent (see Proposition 12 in Ref. [
16]
). 3. Applications
In this section, we use the proposed sharper concentration inequalities for MoM estimators to perform two applications in statistical machine learning.
3.1. Concentration for Supremum of Variance-Dependent MoM Empirical Processes
Let and , where is a ball of the Lipschitz functions space and is a constant. Let .
To derive the concentration inequality for the supremum of variance-dependent MoM empirical processes, the following auxiliary Lemma 2 is necessary, whose proof is trivial and thus omitted.
Lemma 2. for where means the value of the function at the midpoint of the domain, and the same is true for .
By Lemma 2, for
, we have
Let
be a
-covering of
w.r.t.
. It is well-known that there exist constants
and
, such that
where
denotes the number of
-balls of radius
needed to cover class
, and
is a universal constant depending only on
.
Put
for simplicity. By definition of
, for
, s.t.
Then, by Theorem 1, the union bound for
gives that
Put
, i.e.,
; then, for
and
, we have
3.2. Concentration for Variance-Dependent MoM Intrinsic Moment Norm
A centered random variable
X is called sub-Gaussian if
where the quantity
is named as the sub-Gaussian parameter. In non-asymptotic statistics, because the collected sub-Gaussian data is often unstable, sometimes it is not possible to directly use the empirical moment-generating function to estimate the sub-Gaussian parameter such as variance-type parameters of sub-Gaussian distributions (see Ref. [
17]). This requires us to use the sub-Gaussian intrinsic moment norm for estimation. The definition of intrinsic moment norm is as follows.
Definition 1 (Intrinsic moment norm, see Definition 2 in Ref. [
17]).
The sub-Gaussian intrinsic moment norm is defined aswhere for . As the amount of computation increases, so does the importance of the distributed MoM approach, with the corresponding intrinsic moment norm estimator defined below.
Definition 2 (see Equation (7) in Ref. [
17]).
Let and be the number of samples in the s-th block. The MOM estimator for sub-Gaussian intrinsic moment norm is given bywhere . Definition 3. For any and ,and . Theorem 2. Suppose, for and , there exits , such that where is a finite constant sequence. Then, we haveand Remark 4. Let ; we then obtain distributed samples that satisfy Theorem 2.
Remark 5. The key coefficient . In fact, the key coefficient of Theorem 3 in Ref. [
17]
without outliers is −0.125, as long as is taken. This means that our boundary is better than the boundary in Ref. [
17].
Proof of Theorem 2. From Definitions 1 and 2, we have
and
Recall that
and
are the sequences s.t.
and
for any
and
.
For the first inequality of Theorem 2, we have, by (
7),
where the last inequality is by Theorem 1 and the assumption in Theorem 2.
Let
. For the second inequality of Theorem 2, the definition of
implies
where the last inequality is by Theorem 1 and the assumption in Theorem 2. □
4. Concentration for Variance-Dependent MoM with Distribution-Free Outliers
In the field of big data and artificial intelligence, most work involves dealing with abnormal data. Sometimes we cannot find each outlier directly, but we can obtain a rough idea of the total number of outliers. For example, sometimes there may be abnormal economic activities in a certain region, but the specific company or person who is abnormal may not be known for the time being; however, the total number of companies and the total population in the region are still known.
Based on such information, how to accurately estimate the characteristics of all samples containing outliers is an important problem. In this section, we introduce the concept of variance-dependent MoM estimator with outliers as the following theorem.
Theorem 3. Suppose that
(H.1) Sample contains i.i.d. inliers with finite mean and finite variance . And outliers, upon which no assumption is made.
(H.2) Set , where is the number of blocks containing at least one outlier and is the number of sane blocks containing no outlier. For , there exists a function , such that and , where .
Then, for , we have Remark 6. For the number and , when one divides n samples evenly into K blocks, an extreme case is to assume that the blocks that do not conform to one’s preferences are full of outliers, such as blocks, and the blocks that conform to one’s preferences have no outliers, such as blocks; then, one has .
Remark 7. For the function , we can write a concrete expression to show that such a function exists, for example, , where . But there must be more than one expression, so the non-concrete function is more appropriate for this theorem.
In fact, there is an adaptive way to generate block number K, but we do not show the specific calculation here; see Ref. [
18] for more detail. Now, we give a detailed proof of Theorem 3.
Proof of Theorem 3. In the sane blocks, in the number of blocks whose sample mean is no more than t from the population mean
is at least
, the distance between the population MoM and the population mean
is no more than t, which is mathematically expressed as follows: for
, we have
Further, the following formula is established:
From the condition (H.4), we have
and
Applying Theorem 2 in Ref. [
14], we can obtain the lower bound of Formula (
11), i.e.,
where
,
and
On the other hand, by Chebyshev’s inequality (see p. 239 in Ref. [
15]), we have
Thus,
and
. Because of
and
, the inequality (
13) can be written as
where
The inequality (
16) can be valid, for example, if
is infinitely close to 1/2.
From
, we have the minimum bound of
, i.e.,
Substituting Equation (
17) into Equation (
15), we have
Further, due to Relation (
14), we have
and
; then, the inequality (
18) can be bounded as
□
5. Conclusions
In this paper, we obtain the bounds of variance-dependen MoM estimation based on the binomial tail probability, including the case without pollution and the case with pollution. The nonasymptotic properties of nonpolluting MoM estimates have been shown to be superior to the existing traditional Hoeffding results. In the next step, we will also continue to investigate the bound of variance-dependen MoM estimation with outliers based on sub-Gaussian distribution or Weibull distribution. Compared with traditional exponential family distributions, it is more practical to study the inequalities of these distributions (see Refs. [
19,
20]). We further plan to study application problems with a practical background.
Author Contributions
Conceptualization, G.T. and Y.L.; methodology, G.T. and Y.L.; formal analysis, B.T.; writing—original draft preparation, G.T.; writing—review and editing, Y.L. and J.L.; supervision, B.T.; funding acquisition, J.L. All authors have read and agreed to the published version of the manuscript.
Funding
This research was funded by China Postdoctoral Science Foundation 2023M733852.
Data Availability Statement
This paper does not use any data.
Conflicts of Interest
The authors declare no conflict of interest.
References
- Nemirovskij, A.S.; Yudin, D.B. Problem Complexity and Method Efficiency in Optimization; John Wiley & Sons Ltd.: Hoboken, NJ, USA, 1983. [Google Scholar]
- Lerasle, M.; Szabó, Z.; Mathieu, T.; Lecué, G. Monk outlier-robust mean embedding estimation by median-of-means. In Proceedings of the International Conference on Machine Learning, Long Beach, CA, USA, 9–15 June 2019; pp. 3782–3793. [Google Scholar]
- Lugosi, G.; Mendelson, S. Risk minimization by median-of-means tournaments. J. Eur. Math. Soc. 2019, 22, 925–965. [Google Scholar] [CrossRef]
- Lecué, G.; Lerasle, M. Robust machine learning by median-of-means: Theory and practice. Ann. Stat. 2020, 48, 906–931. [Google Scholar] [CrossRef]
- Humbert, P.; Le Bars, B.; Minvielle, L. Robust kernel density estimation with median-of-means principle. In Proceedings of the 39th International Conference on Machine Learning, Baltimore, MA, USA, 17–23 July 2022; p. 9444. [Google Scholar]
- Wainwright, M.J. High-Dimensional Statistics: A Non-Asymptotic Viewpoint; Cambridge University Press: Cambridge, UK, 2019; Volume 48. [Google Scholar]
- Zhang, H.; Chen, S.X. Concentration Inequalities for Statistical Inference. Commun. Math. Res. 2021, 37, 1–85. [Google Scholar] [CrossRef]
- Zhang, H.; Lei, X. Growing-dimensional Partially Functional Linear Models: Non-asymptotic Optimal Prediction Error. Phys. Scr. 2023, 98, 095216. [Google Scholar] [CrossRef]
- Cai, T.T.; Han, R.; Zhang, A.R. On the non-asymptotic concentration of heteroskedastic Wishart-type matrix. Electron. J. Probab. 2022, 27, 1–40. [Google Scholar] [CrossRef]
- Depersin, J.; Lecué, G. On the robustness to adversarial corruption and to heavy-tailed data of the Stahel–Donoho median of means. Inf. Inference J. IMA 2023, 12, 814–850. [Google Scholar] [CrossRef]
- Marteau, C.; Sart, M. Deconvolution for some singular density errors via a combinatorial median of means approach. Math. Stat. Learn. 2023, 6, 51–85. [Google Scholar] [CrossRef]
- Chen, Y. A Short Note on the Median-of-Means Estimator; University of Washington: Washington, DC, USA, 2020; Available online: https://faculty.washington.edu/yenchic/short_note/note_MoM.pdf (accessed on 12 November 2020).
- Minsker, S. U-statistics of growing order and sub-Gaussian mean estimators with sharp constants. arXiv 2022, arXiv:2202.11842. [Google Scholar]
- Ferrante, G.C. Bounds on Binomial Tails With Applications. IEEE Trans. Inf. Theory 2021, 67, 8273–8279. [Google Scholar] [CrossRef]
- Alsmeyer, G. Chebyshev’s Inequality. In International Encyclopedia of Statistical Science; Lovric, M., Ed.; Springer: Berlin/Heidelberg, Germany, 2011. [Google Scholar] [CrossRef]
- Lerasle, M. Lecture Notes: Selected Topics on Robust Statistical Learning Theory. arXiv 2019, arXiv:1908.10761. [Google Scholar]
- Zhang, H.; Wei, H.; Cheng, G. Tight Non-asymptotic Inference via Sub-Gaussian Intrinsic Moment Norm. arXiv 2023, arXiv:2303.07287. [Google Scholar]
- Depersin, J.; Lecué, G. Robust sub-Gaussian estimation of a mean vector in nearly linear time. Ann. Stat. 2022, 50, 511–536. [Google Scholar] [CrossRef]
- Hallinan, A.J., Jr. A review of the Weibull distribution. J. Qual. Technol. 1993, 25, 85–93. [Google Scholar] [CrossRef]
- Xu, L.; Yao, F.; Yao, Q.; Zhang, H. Non-Asymptotic Guarantees for Robust Statistical Learning under Infinite Variance Assumption. J. Mach. Learn. Res. 2023, 24, 1–46. [Google Scholar]
| Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).