1. Introduction
Let
be a random sample from
m-dimensional population. The data set can be regarded as
k vectors or points in
m-dimensional space. Recently, there has been significant interest in a high-dimensional datasets when the dimension is large. In a high-dimensional setting, it is assumed that either (i)
m tends to infinity and
k is fixed, or (ii) both
m and
k tend to infinity. Case (i) is related to high-dimensional low sample size (HDLSS) data. One of the first results for HDLSS data appeared in Hall et al. [
1]. It became the basis of research in mathematical statistics for the analysis of high-dimensional data, see, e.g., Fujikoshi et al. [
2], which are an important part of the current data analysis fashionable area called
Big data. Scientific areas where these settings have proven to be very useful include genetics and other types of cancer research, neuroscience, and also image and shape analysis. See a recent survey on HDLSS asymptotics and its applications in Aoshima et al. [
3].
For examining the features of the data set, it is necessary to study the asymptotic behavior of three functions: the length
of a
m-dimensional observation vector, the distance
between any two independent observation vectors, and the angle
between these vectors at the population mean. Assuming that
’s are a sample from
, it was shown in Hall et al. [
1] that for HDLSS data the three geometric statistics satisfy the following relations:
where
is the Euclidean distance and
denotes the stochastic order. These interesting results imply that the data converge to the vertices of a deterministic regular simplex. These properties were extended for non-normal sample under some assumptions (see Hall et al. [
1] and Aoshima et al. [
3]). In Kawaguchi et al. [
4], the relations (
1)–(3) were refined by constructing second order asymptotic expansions for distributions of all three basic statistics. The refinements of (
1) and (2) were achieved by using the idea of Ulyanov et al. [
5] who obtained the computable error bounds of order
for the chi-squared approximation of transformed chi-squared random variables with
m degrees of freedom.
The aim of the present paper is to study approximation for the third statistic under generalized assumption that m is a realization of a random variable, say , which represents the sample dimension and is independent of and . This problem is closely related to approximations of statistics constructed from the random size samples, in particular, to this kind of problem for the sample correlation coefficient .
The use of samples with random sample sizes has been steadily growing over the years. For an overview of statistical inferences with a random number of observations and some applications, see Esquível et al. [
6] and the references cited therein. Gnedenko [
7] considered the asymptotic properties of the distributions of sample quantiles for samples of random size. In Nunes et al. [
8] and Nunes et al. [
9], unknown sample sizes are assumed in medical research for analysis of one and more than one-way fixed effects ANOVA models to avoid false rejections, obtained when using the classical fixed size
F-tests. Esquível et al. [
6] considered inference for the mean with known and unknown variance and inference for the variance in the normal model. Prediction intervals for the future observations for generalized order statistics and confidence intervals for quantiles based on samples of random sizes are studied in Barakat et al. [
10] and Al-Mutairi and Raqab [
11], respectively. They illustrated their results with real biometric data set, the duration of remission of leukemia patients treated by one drug. The present paper continues studies of the authors on non-asymptotic analysis of approximations for statistics based on random size samples. In Christoph et al. [
12], second order expansions for the normalized random sample sizes are proved, see below Propositions 1 and 2. These results allow for proving second order asymptotic expansions of random sample mean in Christoph et al. [
12] and random sample median in Christoph et al. [
13]. See also Chapters 1 and 9 in Fujikoshi and Ulyanov [
14].
The structure of the paper is the following. In
Section 2, we describe the relation between
and
. We recall also previous approximation results proved for distributions of
and
.
Section 3 is on general transfer theorems, which allow us to construct asymptotic expansions for distributions of randomly normalized statistics on the base of approximation results for non-randomly normalized statistics and for the random size of the underlying sample, see Theorems 1 and 2.
Section 4 contains the auxiliary lemmas. Some of them have independent interest. For example, Lemma 3 on the upper bounds for the negative order moments of a random variable having negative binomial distribution. We formulate and discuss main results in
Section 5 and
Section 6. In Theorems 3–8, we construct the second order Chebyshev–Edgeworth expansions for distributions of
and
in random setting. Depending on the type of normalization, we get three different limit distributions: Normal, Laplace, or Student’s
t-distributions. All proofs are given in the
Appendix A.
2. Sample Correlation Coefficient, Angle between Vectors and Their Normal Approximations
We slightly simplify notation. Let
and
be two vectors from an
m-dimensional normal distribution
with zero mean, identity covariance matrix
and the sample correlation coefficient
Under the null hypothesis
: {
and are uncorrelated}, the so-called null density
of
is given in Johnson, Kotz and Balakrishnan [
15], Chapter 32, Formula (32.7):
for
, where
denotes indicator function of a set
A.
Note and for ,
is two point distributed with ,
is U-shaped with and
is uniform with density .
Moreover, for , the density function is unimodal.
Consider now the standardized correlation coefficient
with some correcting real constant
having density
which converges with
as
to the standard normal density
and by Konishi [
16],
Section 4, Formula (4.1) as
:
where
is the standard normal distribution function. Note that in Konishi [
16] the sample size (in our case the dimension of vectors) is
and
with Konishi’s correcting constant
. Moreover, (
7) follows from the more general Theorem 2.2 in the mentioned paper for independent components in the pairs
,
.
In Christoph et al. [
17], computable error bounds of approximations in (
7) with
and
of order
for all
are proved:
and
where for some
constants
and
are calculated and presented in Table 1 in Christoph et al. [
17]: i.e.,
,
and
,
.
Usually, the asymptotic for
is (
9), where
since it is related to the
t-distributed statistic
. With the correcting constant
, one term in the asymptotic in (
8) vanishes.
In order to use a transfer theorem from non-random to random dimension of the vectors, we prefer (
7) with
. In a similar manner as proving (
8) and (
9) in Christoph et al. [
17], one can verify the following inequalities for
:
Let us consider now the connection between the correlation coefficient
and the angle
of the involved vectors
:
Hall et al. [
1] showed that under the given conditions
where
denotes the stochastic order. Since
the computable error bounds for
follows from computable error bounds for
.
For any fixed constant
and arbitrary
x with
, we obtain for the angle
:
because
is symmetric and
.
Equation (
12) shows the connection between the correlation coefficient
and the angle
among the vectors involved. In Christoph et al. [
17], computable error bound of approximation in (
8) are used to obtain similar bound for the approximation of the angle between two vectors, defined in (
11). Here, the approximation (
10) and (
12) with
lead for any
and for
to
Many authors investigated limit theorems for the sums of random vectors when their dimension tends to infinity, see, e.g., Prokhorov [
18]. In (
6) and (
7), the dimension
m of the vectors
and
tends to infinity.
Now, we consider the correlation coefficient of vectors
and
, where the non-random dimension
m is replaced by a random dimension
depending on some natural parameter
and
is independent of
and
for any
. Define
3. Statistical Models with a Random Number of Observations
Let
and
be random variables on the same probability space
. Let
be a random size of the underlying sample, i.e., the random number of observations, which depends on parameter
. We suppose for each
that
is independent of random variables
and
in probability as
. Let
be some statistic of a sample with
non-random sample size . Define the random variable
for every
:
i.e.,
is some statistic obtained from a random sample
.
The randomness of the sample size may crucially change asymptotic properties of
, see, e.g., Gnedenko [
7] or Gnedenko and Korolev [
19].
3.1. Random Sums
Many models lead to random sums and random means
A fundamental introduction to asymptotic distributions of random sums is given in Döbler [
20].
It is worth mentioning that a suitable scaled factor by
affects the type of limit distribution. In fact, consider random sum
given in (
14). For the sake of convenience, let
be independent standard normal random variables and
be geometrically distributed with
and independent of
. Then, one has
We have three different limit distributions. The suitable scaled geometric sum is standard normal distributed or tends to the Laplace distribution with variance 1 depending on whether we take the random scaling factor or the non-random scaling factor , respectively. Moreover, we get the Student distribution with two degrees of freedom as the limit distribution if we use scaling with the mixed factor . Similar results also hold for the normalized random mean .
Assertion (
15) is obtained by conditioning and the stability of the normal law. Moreover, using Stein’s method, quantitative Berry–Esseen bounds in (
15) and (16) for arbitrary centered random variables
with
were proved in (Chen et al. [
21], Theorem 10.6), (Döbler [
20] Theorems 2.5 and 2.7) and (Pike and Ren [
22] Theorem 3), respectively. Statement (17) follows from (Bening and Korolev [
23] Theorem 2.1).
First order asymptotic expansions are obtained for the distribution function of random sample mean and random sample median constructed from a sample with two different random sizes in Bening et al. [
24] and in the conference paper Bening et al. [
25]. The authors make use of the rate of convergence of
to the limit distribution
with some
. In Christoph et al. [
12], second order expansions for the normalized random sample sizes are proved, see below Propositions 1 and 2. These results allow for proving second order asymptotic expansions of random sample mean in Christoph et al. [
12] and random sample median in Christoph et al. [
13].
3.2. Transfer Proposition from Non-Random to Random Sample Sizes
Consider now the statistic , where the dimension of the vectors is a random number .
In order to avoid too long expressions and at the same time to preserve a necessary accuracy, we limit ourselves to obtaining limit distributions and terms of order in the following non-asymptotic approximations with a bounds of order for some .
We suppose that the following condition on the statistic with is met for a non-random sample size :
Condition 1. There exist differentiable bounded functionwithand real numbers,
such that for all integerwhere.
Remark 1. Relations (10) and (13) give the examples of statistics such that Condition 1 is met. For other examples of multivariate statistics of this kind, see Chapters 14–16 in Fujikoshi et al. [2]. Suppose that the limiting behavior of distribution functions of the normalized random size is described by the following condition.
Condition 2. There exist a distribution functionwith,
a function of bounded variation,
a sequenceand real numbersandsuch that for all integer Remark 2. In Propositions 1 and 2 below, we get the examples of discrete random variables such that Condition 2 is met.
Conditions 1 and 2 allow us to construct asymptotic expansions for distributions of randomly normalized statistics on the base of approximation results for normalized fixed-size statistics (see relation (
18)) and for the random size of the underlying sample (see relation (
19)). As a result, we obtain the following transfer theorem.
Theorem 1. Let and both Conditions 1 and 2 be satisfied. Then, the following inequality holds for all :where are given in (18) and (19). The constants do not depend on n. Remark 3. Later, we use only the cases .
Remark 4. The domain of integration in (21) depends on . Thus, it is not clear how is represented as a polynomial in and . To overcome this problem (see (26)), we prove the following theorem. Theorem 2. Under the conditions of Theorem 1 and the additional conditions on functions and , depending on the convergence rate b in (19):we obtain for the function defined in (21):with Remark 5. The additional conditions (23) and (24) guarantee to extend the integration range from to of the integrals in (26). 4. Auxiliary Propositions and Lemmas
Consider the standardized correlation coefficient (
5) having density (
6) with correcting real constant
and standardized angle
, see (
12). By (
10) and (
13) for
, we have
and for the angle
between the vectors for
where (
27) and (
28) for
and
are trivial and
does not depend on
m.
Suppose
with
or
when (
27) or (
28) are considered. Since a product of polynomials in
x with
is always bounded, numerical calculus leads to
Condition 1 of the transfer Theorem 1 to the statistics and are satisfied with and .
Next, we estimate
defined in (
22).
Lemma 1. Let a sequence with as . Then, with some , we obtain with and : In the next subsection, we consider the cases when the random dimension is negative binomial distributed with success probability .
4.1. Negative Binomial Distribution as Random Dimension of the Normal Vectors
Let the random dimension
of the underlying normal vectors be negative binomial distributed (shifted by 1) with parameters
and
, having probability mass function
with
. Then,
tends to the Gamma distribution function
with the shape and rate parameters
, having density
If the statistic
is asymptotically normal, the limit distribution of the standardized statistic
with random size
is Student’s
t-distribution
having density
with
, see Bening and Korolev [
23] or Schluter and Trede [
26].
Proposition 1. Let , discrete random variable have probability mass function (29) and . For and all there exists a real number such thatwhere Figure 1 shows the approximation of
by
and
.
Remark 6. The convergence rate for is given in Bening et al. [24] or Gavrilenko et al. [27]. The Edgeworth expansion for is proved in Christoph et al. [12], Theorem 1. The jumps of the sample size have an effect only on the function in the term . The negative binomial random variable satisfies Condition 2 of the transfer Theorem 1 with , , and .
Lemma 2. In Theorem 2 the additional conditions (23) and (24) are satisfied with , , and . Moreover, one has for and , with or : In addition to the expansion of
a bound of
is required, where
is rate of convergence of Edgeworth expansion for
, see (
18).
Lemma 3. Let , and the random variable is defined by (29). Then,and the convergence rate in case cannot be improved. 4.2. Maximum of n Independent Discrete Pareto Random Variables Is the Dimension of the Normal Vectors
Let
be discrete Pareto II distributed with parameter
, having probability mass and distribution functions
which is a particular class of a general model of discrete Pareto distributions, obtained by discretization continuous Pareto II (Lomax) distributions on integers, see Buddana and Kozubowski [
28].
Now, let
, be independent random variables with the same distribution (
38). Define for
and
the random variable
It should be noted that the distribution of is extremely spread out on the positive integers.
In Christoph et al. [
12], the following Edgeworth expansion was proved:
Proposition 2. Let the discrete random variable have distribution function (39). For , fixed and all , then there exists a real number such thatwhere is defined in (33). Remark 7. The continuous function with parameter is the distribution function of the inverse exponential random variable , where is exponentially distributed with rate parameter . Both and are heavy tailed with shape parameter 1.
Remark 8. Therefore, for all and . Moreover:
First absolute pseudo moment ,
Absolute difference moment
for , see Lemma 2 in Christoph et al. [12].
On pseudo moments and some of their generalizations, see Chapter 2 in Christoph and Wolf [29]. Lemma 4. In Transfer Theorem 2, the additional conditions (23) and (24) are satisfied with , , and . Moreover, one has for and , with or : Lemma 5. For random size with probabilities (39) with reals and arbitrary small and , we have