1. Introduction
The statistical problem of comparing treatments with a control population has been an active area of research for nearly eighty years. One of the earlier research studies that had proposed a formal statistical design to compare treatments with a control is reported in [
1]. Soon after this, Ref. [
2] investigated this problem for normal means and binomial proportions with an idea of spacing between treatments. Ref. [
3] extended this further by exploring the idea of multiple comparisons and formulated a procedure to carry out comparisons with a control population. The idea of spacing was further refined in [
4] which formally conceptualized the “indifference zone” formulation for selecting the best normal population from a group of several normally distributed populations in the preference zone with the predetermined probability. In statistical literature, the region outside the indifference zone is referred to as the preference zone. Also in the 1950s, another formulation was proposed for the problem of selecting or isolating the best population in [
5], which had the property that it did not restrict the selection from the preference zone but rather the selection was carried out from the entire parameter space. This formulation of the problem, known as the “subset-selection formulation”, selects a subset of the populations of random size which includes the best treatment with the prespecified probability. A number of researchers have studied this problem by formulating it under various requirements and goals and while adopting various sampling methodologies. Once such formulation that has been extensively studied in the literature is in which the experimenter wants the selected population to be some “specified amount better” than other treatments, which is referred to as a control or standard. This area of research is typically known as the problem of “comparisons with a control” or the “partition problem” in statistical literature. For the partition problem formulation, one formulation that has been used by a number of practitioners and researchers is the one introduced in [
6] for the populations that follow a normal distribution.
In
Section 2, we have summarized the [
6] formulation and provided a summary of the current research in the area. In
Section 3, we have proposed a distribution-free version of the [
6] formulation and proposed a purely sequential methodology and derived its first-order asymptotic properties. In
Section 4, we have studied the performance of the proposed non-parametric procedure by picking different values of design constants to study how the asymptotic expansions provided in Theorem 1 compare with the observed values when the procedure is simulated for small and moderate sample sizes. In
Section 5, we have provided an example to illustrate an application of the proposed non-parametric purely sequential procedure.
2. Normal Populations Case
Assume that we have
independently distributed normal populations to be donated as
, with respective means
and a common variance
. We will assume that all the parameters are unknown. The population
is referred to as the control or standard population. The formulation presented in [
6] starts by mathematically defining the “good” and “bad” populations based on the input from practitioners or experts in the area of the application.
Next, for fixed but arbitrary constants,
and
, with
, ref. [
6] defined the “good” and “bad” populations via three sets by adopting the [
4] indifference zone formulation, as defined below
The set
is termed to as the set of “good” populations while the set
is termed as the set of “bad” populations. Note that the two constants
and
are determined based on the input of experts in the area specifying how much better or worse a population has to be compared to the control to be termed as a good population or a bad population. The goal in [
6] was to partition the populations that belong to
or
correctly with the prespecified probability. On the other hand, the set
is termed as the indifference-zone set, and the experimenter is indifferent to the correct partition of the populations that fall in the set
. The parition problem is designed to partition the set
into two mutually disjoint sets
and
, with high accuracy, so that all populations in
fall inside
and all populations in
fall inside
. That is, when all the populations in
or
are partitioned correctly, then such a partition is defined as a correct decision (CD). Mathematically, let us denote by
the probability of correct decision that the experimenter wants to achieve. Note that
, as the probability of selecting correctly randomly is
for each of the
k populations.
Next, using a sampling design, determine
N as the sample size from each of the
k populations and the control population and the sample mean
from
. Define
; then, the decision rule proposed by [
6] to partition all the populations in
took the following form:
Ref. [
6] has shown that if the sample size
N satisfies
, and we partition the
k populations according to the partition rule (
2), then
Note that
,
when
k is even and
when
k is odd, and the
matrix covariance matrix
is a given by
and
b is a constant satisfying the integral equation given by
Ref. [
6] has tabulated the values of design constant
b for various choices of
k and
. For the unknown
case, ref. [
6] also constructed a two-stage and a purely sequential procedure.
For the normal distributions case, ref. [
7] constructed several multistage methodologies focusing on the second-order asymptotic expansions. For references on the partition problem for binomial treatments, the reader is referred to [
8]. In [
9], a generalization of the “Tongs formulation” was introduced so that the treatments that fall between the “good” and “bad” treatments can be partitioned as a separately identifiable group by introducing two indifference zones. Ref. [
10] extended this generalization by constructing an asymptotically unbiased fine-tuned purely sequential procedure to guarantee the probability requirement.
Next, we have constructed a non-parametric procedure to partition the
k populations compared to a control population that does not require the populations to be normally distributed. However, we have assumed that the unknown distributions are symmetric. Next, in
Section 3, we have proposed a distribution-free version of the [
6] formulation, proposed a purely sequential methodology and derived its first-order asymptotic properties.
3. Non-Parametric Partition Problem
Assume that we are given
independent populations
where the control population is denoted as
. Assume that the cumulative distribution function (cdf) of
is
for
. We will assume the cdf
is continuous and symmetric. Note that the function
and all the centers of symmetries, namely,
,
, ⋯,
are assumed to be unknown. Following [
6], we have defined below what an experimenter may define as “good” and “bad” populations compared to a control based on the input from experts in the area of application. As in
Section 2 for the normal populations, we will partition all
k populations by comparing the centers of symmetry
,
with the control population’s center of symmetry
to define the set of “good” and “bad” populations which has the probability of correct decision (
) of at least
. As before,
.
Based on the input from experts in the area, the statistical design would start by selecting two arbitrary but fixed design constants,
and
, with
. Next, as in [
6], we define three subsets for
following the idea of spacing from [
4] the indifference-zone formulation as follows:
Note that
and
are the sets of “good” populations and “bad” populations, respectively, whereas
is the set of populations the experimenter would be indifferent to. We define two constants based on
and
as
and
. Let
denote a class of symmetric and continuous distributions which satisfy some regularity conditions to be specified in
Section 4. Next, we propose a purely sequential procedure for the partition problem described in (5). The procedure starts with an initial sample size of
observations from all the (
k + 1) populations. Next, implementing the “vector-at-a-time” sampling procedure, we will sample one observation from all the (
k + 1) populations according to the stopping rule defined below in (7). Having recorded an independent sample
, a sample of size
n from
,
, a statistic
, to be defined below, is proposed to estimate the center of symmetry
,
. The estimator
has an asymptotic normal distribution. That is,
, as
for
,
. Note that the unknown constant
A is a finite and positive function of
F. For the literature of non-parametric procedures in the area of selecting the best population, the reader is refereed to [
11]. One may also refer to [
12] who had constructed a non-parametric accelerated sequential procedure to select the population with the largest center of symmetry.
Based on a sample of size
n, the decision rule is to compare each
with
,
, and then partition the
k populations following the partition rule given by:
Next, as in [
11], we will assume that the following regularity conditions are satisfied by the unknown distribution
and the purely sequential stopping rule, which is implemented to obtain the sample size
N:
Regularity Conditions: We will assume the following three conditions hold for all and :
a.s. as where is a standardized average of independent and identically distributed random variables having a finite second moment and .
For an estimator of A, as , we have a.s.
The set is uniformly integral.
Next, following [
7], one can obtain that
is asymptotically at least
if the sample size
n is at least
. Here, “
b” is a constant, as reported earlier, which is a function of
k and
. Let us denote
. The expression
is known as the optimal sample size. However, it is unknown as
A is unknown. Next, to estimate
A, a purely sequential procedure is constructed which satisfies the correct decision probability requirement and has
whenever
and the unknown cdf
, as
. The purely sequential procedure starts with
m observations from each population, and it samples one observation from all (
k + 1) according to the stopping rule:
where
, an estimator of
A, is computed using the control and all
k populations. Also,
depends on the estimator of the center of symmetry
. Next, we present a theorem to the first-order properties of the proposed purely sequential procedure (
7).
Theorem 1. The purely sequential procedure defined in (7), under the assumptions as outlined above, satisfies the following properties for all and : - (i)
monotonically as a.s.
- (ii)
as .
- (iii)
a.s.
- (iv)
as .
Proof. We start with an estimator
for the center of symmetry. Based on a sample of size
n, let
denote the Hodges–Lehmann estimator for the center of symmetry
of the
ith population
. That is, the sample median of the
quantiles
for
,
j,
;
. Then, we consider the estimator of
is given by
where
are the ordered
for
and for
. The sequence
and
are specified as
where
is defined as the largest integer less than or equal to
x.
is defined by
for some
. The Hodges–Lehmann estimator has been used extensively in statistical literature, and it is well known that
is a consistent estimator of the center of symmetry. The reader is referred to [
13] for details.
Next, note that
w.p. 1 if
, that is
is non-decreasing in
. Now, the assumption 1.1 [
13] in regularity conditions will lead to part (i). Part (ii) follows by applying the monotone convergence theorem. Since the stopping rule is
then the basic inequality simplifies to
Now, multiply
throughout (
10) and take limits as
; this leads to part (3). For the population
, statistic
is proposed to estimate
. For
, we have
where
for
,
for
,
. If we define the
covariance matrix
by
then
Equation (
12) gives the infimum of the
for the set of all configurations such that there are
r populations from
(bad populations) and
populations from
(good populations). The right side of (
12) achieves a minimum over all
under the LFC. Let
be the solution of the equation
Also, for any real number
c and
q, let
where the
covariance matrix
is such that
Define
then
which leads to
i.e.,
, which is part (4). This completes the proof of the theorem. □
4. Monte Carlo Simulation Results
In this section, using the Monte Carlo simulation study, the “purely sequential procedure” (
7) is replicated independently 5000 times by picking different values of design constants to study how the asymptotic expansions provided in Theorem 1 compare with the observed values when the procedure is simulated for small and moderate sample sizes. In our simulation study, we considered
independent populations and one control population. To construct the LFC, we generated
populations with the center of symmetry equal to
, and the remaining
populations are generated to have the center of symmetry as
. The control population is generated to have the center of symmetry as
. Without loss of generality, we set
. For
and
, the value of the constant
b equals 2.44177 from [
6]. Next, we considered the following symmetric distributions: normal distribution, Laplace distribution, t-distribution, uniform distribution, and a mixture of two normal distributions. For these distributions, the parameter
is given by
is the density function for normal distribution, Laplace distribution, t-distribution, uniform distribution and a mixture of two normal distributions, respectively. In our simulations,
, the Laplace distribution with
, t-distribution with
,
, and two mixed normal distribution:
and
were used here.
After, we obtained the value of the
for each distribution; the value of
was determined by
. The values of
which we selected were 50, 100, 200, 400, and 800. For each value of
, the corresponding value of
was obtained, and those values have been summarized in
Table 1,
Table 2,
Table 3,
Table 4,
Table 5 and
Table 6. As described earlier, the estimator
as described in (
8) is used to estimate the unknown parameter
. Note that the purely sequential rule does not rely upon the knowledge of
. Next, we generated data from the normal distribution with
, Laplace distribution with
, t-distribution with
, uniform distribution, and two mixed normal distributions given by
and
, respectively. Note that the Hodges–Lehmann estimator holds for
. In the simulations, we have considered several possible choices of the
and studied the impact of
on the estimation of
. The simulation results are reported in
Table 1,
Table 2,
Table 3,
Table 4,
Table 5 and
Table 6.
From
Table 1 and
Table 2, note that the purely sequential procedure (
7) is oversampling by roughly two to three observations when the population is normally distributed and by just below 10 observations for the Laplace distribution. Also, note that the estimated probability of correct selection is below the target value of 0.95 for the normal case. However, for the Laplace distribution, the estimated probability of correct selection matches the target value of 0.95 quite well. This feature of the statistical estimation should not come as a surprise. The Hodges–Lehmann estimator is more appropriate when the distribution has tails longer than normal distribution tails. That is, when the distribution is close to being normally distributed, then the partition procedures are designed for normally distributed populations, such as the ones described in [
7]. However, if the tails are significantly longer than the normal tails, like for the Laplace distribution, then the non-parametric partition procedures are more appropriate.
In
Table 3, the underlying distribution is t-distribution with 5 degrees of freedom. The distribution has tails longer than a normal distribution but shorter than the Laplace distribution. Note that the estimated probability of correct selection is somewhat below the target value of
for smaller values of
. However, as
increases, the estimated probability of correct selection is approaching the target value of 0.95.
Next, we have considered the uniform distribution case which has tails even shorter than the normal tails. One will note that the estimated probability of correct selection is well below the target value of 0.95. This feature is again along the lines of comments made earlier in this section about the Hodges–Lehmann estimator being more appropriate when the distribution has tails longer than normal distribution tails. Next, we have considered the mixture of two normal populations. In the first case, we have considered the which is a mixture of two normal populations with somewhat long tails. The first population is the mixture that has a variance of 1, and the second has a variance of 2. In the second mixture of the two normal populations considered, we have . This second mixture has two normal populations again, but the two variances being 1 and 5, respectively, are farther apart. Intuitively, these two mixture cases are symmetric but are not unimodal like normal distribution or the other distributions considered earlier. The two tables below again exhibit the same behavior: the longer the tails, the better is the performance of the Hodges–Lehmann estimator.
5. An Example
In this section, we study the performance of the non-parametric sequential procedure via a real-world dataset. Ref. [
14] conducted a pilot investigation to see if active exercise can preserve walking beyond the 2nd month. In this experiment, newborn children were randomly placed into one of four treatment groups: (1) active exercise group; (2) passive exercise group; (3) no exercise group (these were observed weekly); and (4) control group (observed once after 8 weeks). A traditional 12 months has been known as the mean time infants take to walk. The statistical analysis confirmed that the walking data are normally distributed with somewhat equal variance, adopting a
improvement as significant and anything other than
as not significant. We took
months,
months,
, and the starting sample size
. The data were analyzed via the following three procedures: (1) two-stage procedure of [
6]; (2) purely sequential procedure of [
7]; (3) non-parametric sequential procedure proposed in this manuscript. Additional samples as needed were generated via SRSWR and saved to have the same data for all the procedures. Note that all the three sampling methodologies yielded the same result: that is, the active exercise group was partitioned as better than the control, while the passive and no exercise groups were partitioned as bad compared to the control, since the improvement was lower than
. The sample size for these five methodologies is reported in
Table 7. One will note that the sample size was somewhat larger for the non-parametric sequential procedure, and it increased further when the parameter
was increased. However, this was quite expected, since the data are normally distributed in this case, and the procedures based on normal distribution assumption are bound to perform better. Note that from the simulations, the true advantage of the non-parametric procedure is when the data are not normal and have long tails.