1. Introduction
In classical statistical inference, the number of observations is usually known. If observations are collected in a fixed time span or we lack observations the sample size may be a realization of a random variable. The number of failed devices in the warranty period, the number of new infections each week in a flu season, the number of daily customers in a supermarket or the number of traffic accidents per year are all random numbers.
Interest in studying samples with a random number of observations has grown steadily over the past few years. In medical research, the authors of [
1,
2,
3] examines ANOVA models with unknown sample sizes for the analysis of fixed one-way effects in order to avoid false rejection. Applications of orthogonal mixed models to situations with samples of a random number of observations of a Poisson or binomial distributed random variable are presented. Based a random number of observations [
4], Al-Mutairi and Raqab [
5] and Barakat et al. [
6] examined the mean with known and unknown variances and the variance in the normal model, confidence intervals for quantiles and prediction intervals for the future observations for generalized order statistics. An overview on statistical inference of samples with random sample sizes and some applications are given in [
4], see also the references therein.
When the non-random sample size is replaced by a random variable, the asymptotic features of statistics can change radically, as shown by Gnedenko [
7]. The monograph by Gnedenko and Korolev [
8] deals with below limit distributions for randomly indexed sequences and their applications.
General transfer theorems for asymptotic expansions of the distribution function of statistics based on samples with non-random sample sizes to their analogues for samples of random sizes are proven in [
9,
10]. In these papers, rates of convergence and first-order expansion are proved for asymptotically normal statistics. The results depend on the rates of convergence with which the distributions of the normalized random sample sizes approach the corresponding limit distribution.
The difficulty of obtaining second-order expansions for the normalized random sample sizes beyond the rates of convergences was overcome by Christoph et al. [
11]. Second-order expansions were proved by the authors of [
11,
12] for the random mean and the median of samples with random sample sizes and the authors of [
13,
14] for the three geometric statistics of Gaussian vectors, the length of a vector, the distance between two vectors and the angle between two vectors associated with their correlation coefficient when the dimension of the vectors is random.
The classical Chebyshev–Edgeworth expansions strongly influenced the development of asymptotic statistics. The fruitful interactions between Chebyshev Edgeworth expansions and Bootstrap methods are demonstrated in [
15]. Detailed reviews of applications of Chebyshev–Edgeworth expansions in statistics were given by, e.g., Bickel [
16] and Kolassa [
17]. If the arithmetic mean of independent random variables is considered as the statistic, only the expected value and the dispersion are taken into account in the central limit theorem or in the Berry–Esseen inequalities. The two important characteristics of random variables, skewness and kurtosis, have great influence on second order expansions, provided that the corresponding moments exist. The Cornish–Fisher inversion of the Chebyshev–Edgeworth expansion allows the approximation of the quantiles of the test statistics used, for example, in many hypothesis tests. In [
11], Theorems 3 and 6, and [
12], Corollaries 6.2 and 6.3, Cornish–Fisher expansions for the random mean and median from samples with random sample sizes are obtained. In the same way, Cornish–Fisher expansions for the quantiles of the statistics considered in present paper can be derived from the corresponding Chebyshev–Edgeworth expansions.
In the present paper, we continue our research on approximations if the sample sizes are random. To the best of our knowledge, Chebyshev–Edgeworth-type expansions with asymptotically chi-square statistics have not yet been proven in the literature when the sample sizes are random.
The article is structured as follows.
Section 2 describes statistical models with random numbers of observations, the assumptions about statistics and random sample sizes and transfer propositions from samples with non-random to random sample sizes.
Section 3 presents statistics with non-random sample sizes with Chebyshev–Edgeworth expansions based on standard normal or chi-square distributions. Corresponding expansions of the negative binomial or discrete Pareto distributions as random sample sizes are considered in
Section 4.
Section 5 describes the influence of non-random, random or mixed normalization factors on the limit distributions of the examined statistics that are based on samples with random sample sizes. Besides the common Student’s t, normal and Laplace distributions, inverse Pareto, generalized gamma and generalized Laplace as well as weighted sums of generalized gamma distributions also occur as limit laws. The main results for statistic families with different normalization factors and examples are given in
Section 6. To prove statements about a family of statistics, formal constructions for the expansions are worked out in
Section 7, which are used in
Section 8 to prove the theorems. Conclusions are drawn in
Section 9. We leave four auxiliary lemmas to
Appendix A.
2. Statistical Models with a Random Number of Observations
Let
and
be random variables defined on a common probability space
. The random variables
denote the observations and form the random sample with a non-random sample size
. Let
be some statistic obtained from the sample
. Consider now the sample
. The random variable
denotes the random size of the underlying sample, that is the random number of observations, depending on a parameter
. We suppose for each
that
is independent of random variables
and
in probability as
.
Let
be a statistic obtained from a random sample
defined as
2.1. Assumptions on Statistics and Random Sample Sizes
In further consideration, we restrict ourselves to only those terms in the expansions that are used below.
We assume that the following condition for the statistic with from a sample with non-random sample size is fulfilled:
Assumption 1. There are differentiable functions for all distribution function and bounded functions , and real numbers , and so that for all integers Remark 1. In contrast to Bening et al. [10], the differentiability of , and is only required for . In the present article, in addition to the normal distribution, the chi-square distribution with p degrees of freedom is used as , which is not differentiable in if or . The distribution functions of the normalized random variables satisfy the following condition:
Assumption 2. A distribution function with , a function of bounded variation , a sequence and real numbers and exist so that for all integers 2.2. Transfer Proposition from Samples with Non-Random to Random Sample Sizes
Assumptions 1 and 2 allow the construction of expansions for distributions of normalized random-size statistics
based on approximate results for fixed-size normalized statistics
in (
1) and for the random size
in (
2).
Proposition 1. Suppose and the statistic and the sample size satisfy Assumptions 1 and 2. Then, for all , the following inequality applies:where and are given in (1) and (2). The constants do not depend on n. General transfer theorems with more terms are proved in [
9,
10] for
.
Remark 2. The approximation function is not a polynomial in and . The domain of integration in (4) depends on . Some of the integrals in (4) could tend to infinity with as . The following statement clarifies the problem.
Proposition 2. In addition to the conditions of Proposition 1, let the following conditions be satisfied on the functions and , depending on the rate of convergence in (2):Then, for the function defined in (4), one haswith Remark 3. The lower limit of integration in to in (10) and (11) depends on . If the sample size is negative binomial distributed with, e.g., or and (see (28) below), then both and have order and not or , as it seems at first glance. Remark 4. The additional conditions (6)–(8) guarantee to extend the integration range of the integrals in (9) from to . Proof of Propositions 1 and 2: Evidence of Proposition 1 follows along the similar arguments of the more general Transfer Theorem 3.1 in [
10] for
. The proof was adapted by Christoph and Ulyanov [
13] to negative
, too. Therefore, the Proposition 1 applies to
.
The present Propositions 1 and 2 differ from Theorems 1 and 2 in [
13] only by the additional term
and the added condition (7) to estimate this additional term. Therefore, the details are omitted her. □
Remark 5. In Appendix 2 of the monograph by Gnedenko and Korolev [8], asymptotic expansions for generalized Cox processes are proved (see Theorems A2.6.1–A2.6.3). As random sample size, the authors considered a Cox process controlled by a Poisson process (also known as a doubly stochastic Poisson process) and proved asymptotic expansions for the random sum , where are independent identically distributed random variables. For each , the random variables are independent. The above-mentioned theorems are close to Proposition 1. The structure of the functions in (4) and the bounds on the right-hand side of inequality (3) in Proposition 1 differ from the corresponding terms in Theorems A2.6.1–A2.6.3. Thus, the bounds contain little o-terms. 4. Chebyshev–Edgeworth Expansions for Distributions of Normalized Random Sample Sizes
As in the articles by, e.g., Bening et al. [
9,
10], Christoph et al. [
11,
12] and Christoph and Ulyanov [
13] and Christoph and Ulyanov [
14], we consider as random sample sizes
the negative binomial random variable
and the maximum of
n independent discrete Pareto random variables
where
and
are parameters.
“The negative binomial distribution is one of the two leading cases for count models, it accommodates the overdispersion typically observed in count data (which the Poisson model cannot)” [
26]. Moreover,
and
tends to the gamma distribution
with identical shape and rate parameters
.
On the other hand, the mean for the discrete Pareto-like variable does not exist, yet tends to the inverse exponential distribution with scale parameter .
Remark 8. The authors of [1,2,3,4,27], among others, considered the binomial or Poisson distributions as random number N of observations. If is binomial (with parameters n and ) or Poisson (with rate ) distributed, then tends to the degenerated in 1 distribution as . Therefore, Assumption 2 for the Transfer Proposition 1 is not fulfilled. On the other hand, since binomial or Poisson sample sizes are asymptotically normally distributed and if the statistic is also asymptotically normally distributed, so is the statistic , too (see [28]). Chebyshev–Edgeworth expansions for lattice distributed random variables exist so far only with bounds of small-o or large- order (see [29]). For (2) in Assumption 2, computable error bounds are required because the constant in (3) depends on (see also Remark 6 on large--bounds and computable error bounds). 4.1. The Random Sample Size Has Negative Binomial Distribution with Success Probability
The sample size
has a negative binomial distribution shifted by 1 with the parameters
and
, the probability mass function
and
. Bening and Korolev [
30] and Schluter and Trede [
26] showed
where
is the gamma distribution function with its density
In addition to the expansion of
, a bound of the negative moment
in (
3) is required, where
is rate of convergence of the Chebyshev–Edgeworth expansion for
in (
1).
Proposition 3. Suppose that and the discrete random variables have probability mass function (28) with . Then,for all , where the constant does not depent on n andMoreover, negative momentsfulfill the estimate for all,
and the convergence rate in casecannot be improved.
Proof. In [
10] (Formula (21)) and in [
31] (Formula (11)), the convergence rate is reported for the case
. In [
11] (Theorem 1), the Chebyshev–Edgeworth expansion for
is proved. In the case
, for geometric distributed random variable
with success probability
the proof is straightforward:
where
and
. Hence, (
31) holds for
.
In [
12] (Corollary 4.2), leading terms for the negative moments of
are derived, which lead to (
34). □
Remark 9. The negative binomial random variables satisfy (2) in Assumption 2 and the additional conditions (6), (7) and (8) in Proposition 2 with , , and . The jumps of the distribution function only affect the function in the term . 4.2. The Random Sample Size Is the Maximum of n Independent Discrete Pareto Variables
We consider the continuous Pareto Type II (Lomax) distribution function
The discrete Pareto II distribution
is obtained by discretizing the continuous Pareto distribution
, :
. The random variable
is the discrete counterpart on the positive integers to the continuous random variable
. Both random variables
and
have shape parameter 1 and scale parameter
(see [
32]). The discrete Pareto distributed
has probability mass and distribution functions:
Let
be a sequence of independent random variables with the common distribution function (
35). Define
The random variable is extremely spread over the positive integers.
Proposition 4. Consider the discrete random variable with distribution function (36). Then,where does not depend on n and is defined in (33). Moreover,where for the order of the bound is optimal. The Chebyshev–Edgeworth expansion (
37) is proved in [
11] (Theorem 4). In [
12] (Corollary 5.2), leading terms for the negative moments
are derived for the negative moments that lead to (
39).
Remark 10. Let the random variable is exponentially distributed with rate parameter . Then, is an inverse exponentially distributed random variable with the continuous distribution function . Both and are heavy tailed with shape parameter 1.
Remark 11. Since and for all , we choose as normalizing factor for in (37). Remark 12. The random sample sizes satisfy (2) in Assumption 2 and the additional conditions (6)–(8) in Proposition 2 with , , and . The jumps of the distribution function only affects the function in the term . Remark 13. Lyamin [33] proved a bound for integers . 5. Limit Distributions of Statistics with Random Sample Sizes Using Different Scaling Factors
The statistic
from a sample with non-random sample size
fulfills condition (
1) in Assumption 1. Instead of the non-random sample size
m, we consider a random sample size
satisfying condition (
2) in Assumption 2. Let
be a sequence with
as
. Consider the scaling factor
by the statistics
with
if
and
or
if
and
. Then, conditioning on
and using (
1) and (
2), we have
If there exists a limit distribution of
as
, then it has to be a scale mixture of parent distribution
and positive mixing parameter
:
(see, e.g., [
23,
34], Chapter 13, and [
19] and the references therein).
Remark 14. Formula (40) shows that different normalization factors at lead to different scale mixtures of the limit distribution of the normalized statistics . 5.1. The Case and
The statistics (
15), (
18) and (
21) considered in
Section 3.1 have normal approximations
. The limit distribution for the normalized random sample size
is the gamma distribution
with density (
30). We investigate the dependence of the limit distributions in
as
for
.
(i) If
, then the limit distribution is Student-s
t distribution
having density
(ii) If , the standard normal law is the limit one with density .
(iii) For
, the generalized Laplace distributions
occur with density (see [
13], Section 5.1.3):
where
is the
Macconald function of order α or modified Bessel function of the third kind with index
. The function
is also sometimes called a modified Bessel function of the second kind of order
. For properties of these functions, see, e.g., Chapter 51 in [
35] or the Appendix on Bessel functions in [
36].
If
, the so-called
Sargan densities and their distribution functions are computable in closed forms (see Formulas (
63)–(
65) below in
Section 7):
where
for
.
The double exponential or standard Laplace density is
with variance 1 and distribution function
given in (
43). The Sargan distributions are therefore a generalisation of the standard Laplace distribution.
5.2. The Case and
The statistics considered in
Section 3.2 asymptotically approach chi-square distribution
. The limit distribution for the normalized random sample size
is the inverse exponential distribution
.
(i) If
, then the generalized gamma distribution
occurs with density
:
where the Macconald function
already appears in Formula (
42) with different
and argument. For
, where
m is an integer, the Macconald function
has a closed form (see Formulas (
63)–(
65) below in
Section 7). Therefore, if
is an odd number, then the density
may be calculated in closed form. The distribution functions
with density functions
for
and
are
Remark 15. Functions in (45) are Weibull density and distribution functions, in (46) there are density and distribution functions of a generalized gamma distribution, but and are even more general. The family of generalized gamma distributions contains many absolutely continuous distributions concentrated on the non-negative half-line.
Remark 16. The generalized gamma distribution corresponds to the densitywhere α and r are the two shape parameters and λ the scale parameter. The density representation (48) is suggested in the work of Korolev and Zeifman [37] or Korolev and Gorshenin [38], and many special cases are listed therein. In addition to, e.g., Gamma and Weibull distributions (with a > 0), inverse Gamma, Lévy and Fréche distributions (with a < 0) also belong to that family of generalized gamma distributions. Remark 17. The Weibull density in (45) is . Moreover, . The densities , are weighted sums of generalized gamma distribution with different shape parameters r, e.g., (i) If . For better readability I have introduced for (i) and after (ii).
(ii) If , the standard normal law is the limit distribution with density .
(iii) If
as limit distribution the
inverse Pareto distribution occurs
with shape parameter
, scale parameter
and density
:
In [
39], a robust and efficient estimator for the shape parameter of the inverse Pareto distribution and applications are given.
9. Conclusions
Chebyshev–Edgeworth expansions are derived for the distributions of various statistics from samples with random sample sizes. The construction of these asymptotic expansions is based on the given asymptotic expansions for the distributions of statistics of samples with a fixed sample sizes as well as those of the distributions of the random sample sizes.
The asymptotic laws are scale mixtures of the underlying standard normal or chi-square distributions with gamma or inverse exponential mixing distributions. The results hold for a whole family of asymptotically normal or chi-squared statistics since a formal construction of asymptotic expansions are developed. In addition to the random sample size, a normalization factor for the examined statistics also has a significant influence on the limit distribution. As limit laws, Student, standard normal, Laplace, inverse Pareto, generalized gamma, generalized Laplace and weighted sums of generalized gamma distributions occur. As statistica the random mean, the scale-mixed normalized Student t-distribution and the Student’s t-statistic under non-normality with normal limit law, as well as Hotelling’s generalized and scale mixture of chi-squared statistics with chi-square limit laws, are considered. The bounds for the corresponding residuals are presented in terms of inequalities.