1. Introduction
The term representative points (RPs) indicates a set of supporting points with corresponding probabilities, which can be used as the best approximation of a
d-dimensional probability distribution. Representative points can be regarded as a discretization of a continuous distribution, and are expected to retain as much information as possible. In the univariate case,
X is considered to be a population random variable with cumulative distribution function (cdf)
, a discrete random variable
Z is defined to approximate
X with probability mass function (pmf) by a set of supporting points
(
) with probabilities
, where
and
. In the literature, there are several approaches to choosing the supporting point set
z. For example, a set of random samples from
can be viewed as a representative of the distribution; Fang and Wang [
1] suggest generating representative points based on the number theoretic method. In 1957, Cox [
2] proposes the idea of using mean squared error (MSE) to measure the loss of information from
, where
The point set
such that
arrives its minimum is called the mean squared error representative points (MSE-RPs) of
. MSE-RPs are found to have many good properties and have been applied in study fields such as signal compression (Gersho and Gray [
3]), numerical integration computation (Pagès [
4,
5]), simulating stochastic differential equation (Gobet et al. [
6]; El Amri et al. [
7]), statistical simulation (Fang et al. [
8], Fang et al. [
9]) and clothing standard settings (Fang and He [
10]; Flury [
11]). To compute MSE-RPs for different distributions, effective numerical methods are proposed. Fang-He algorithm (Fang and He [
10]) calculates MSE-RPs by solving a system of non-linear equations; Lloyd I algorithm (Lloyd [
12]), LBG algorithm (Linde et al. [
13]) and Competitive Learning Vector Quantization algorithm (Pagès [
5]) obtain MSE-RPs by iterating a long training sequence of data; Tarpey’s self-consistency algorithm (Tarpey [
14]) brings the idea of
k-means algorithm for generating MSE-RPs; Chakraborty et al. [
15] provides an accelerate algorithm using Newton’s method. When the number of MSE-RPs (
k) is large, obtaining MSE-RPs becomes computationally intensive. Fang and He [
10] presents some discussion on the optimum choice of
k.
Recently, the use MSE-RPs properties for some distributions have been studied in detail, including normal distribution (Fang et al. [
8]), mixed normal distribution (Fang et al. [
9] and Li et al. [
16]), arcsine distribution (Jiang et al. [
17]) and exponential distribution (Xu et al. [
18]). A general relationship between MSE-RPs and population distribution can be found in the work of Fei [
19] and Fang et al. [
9]. The study of the gamma distribution’s MSE-RPs (gamma MSE-RPs) can be traced back to Fu [
20], which discusses the existence of gamma MSE-RPs and establishes an algorithm for computing these points. The gamma distribution is one of the most important distributions in statistics and probability theory, it is worth taking a closer look at gamma MSE-RPs and discovering their merits. The innovations of this paper are listed as follows:
New theoretical results prove the uniqueness of gamma MSE-RPs;
Gamma MSE-RPs are found to outperform other types of representative points in parameter estimation;
A new standardization technique is proposed to improve the estimation performance of random samples from the gamma distribution.
Our discussion will focus on these three perspectives.
Section 2 provides some preliminary knowledge of the gamma distribution and different types of representative points for readers to access our content easily.
Section 3 gives some theoretical discussion on the existence and uniqueness of gamma MSE-RPs. An algorithm for generating gamma MSE-RPs is recommended.
Section 4 compares the performance of three typical gamma representative points in parameter estimation and simulation. The results demonstrate that gamma MSE-RPs take advantage of other representative points in many scenarios.
Section 5 introduces a new Harrel–Davis standardization technique. Simulation studies show that the standardized samples have better performances than random samples in estimation and can be used to generate gamma MSE-RPs.
Section 6 provides a real clinical data analysis and illustrates that the standardization technique yields efficient estimates for gamma parameters.
3. The Existence and Uniqueness of Gamma MSE-RPs
Let a random variable
with
and
is the supporting points set of
X, to minimize
, by taking partial derivative of (
1), we have
where
is the pdf of the gamma distribution (
2). When
, system of Equation (
7) has only one equation
Obviously, it has one solution
, which is the only representative point. When
, the existence of MSE-RPs is true if the system of Equation (
7) has a solution. After several transformations, (
7) becomes
where
is the cdf. Theorem 1 shows that the system of Equation (
8) has a solution:
Theorem 1. For given , equationa solution exists if and only if . For given , Equation exists a solution when , where is the th representative point in the set of gamma MSE-RPs, which has .
For a given , Equation a solution exists.
Theorem 1 guarantees the existence of gamma MSE-RPs. Its proof is provided in
Appendix A. For the special case
, the existence can be provided by statements 1 and 3 in Theorem 1. Next, we show the uniqueness of gamma MSE-RPs in Theorem 2.
Theorem 2. Suppose . For any , the set of gamma MSE-RPs is unique if .
The proof of Theorem 2 is provided in
Appendix A. As a result, these two theorems guarantee the existence and uniqueness of gamma MSE-RPs. Furthermore, throughout this paper, gamma MSE-RPs are generated based on the self-consistency algorithm [
22]. The details of this algorithm are provided in
Appendix B.
4. Gamma MSE-RPs in Parameter Estimation and Simulation
This section compares the performances of gamma MSE-RPs with other types of representative points, i.e., NT-RPs and MC-RPs, in terms of parameter estimation and simulation. Recall that random variable
and
Z is a discrete approximation of
X. The mean, variance, skewness and kurtosis of
Z are
By the method of moments, we have
which are the point estimators of
a and
b in
. As
Z is a discrete approximation of
X, it is expected that the moments of
Z and estimates in (
12) are close to the moments of
X,
a and
b accordingly. The following theorem shows some connections between gamma MSE-RPs and the corresponding
.
Theorem 3. Let with , is a set of gamma MSE-RPs of with corresponding probabilities in (4); then, The proof of Theorem 3 is provided in
Appendix A. Note that Theorem 3 is established not only for the gamma distribution but also for all continuous population distribution. Next, moments and estimates in (
12) are calculated from MSE-RPs, NT-RPs, and MC-RPs of different
. Three typical shapes of gamma distributions (
—monotone decreasing;
—right skewed and
—bell-shaped; their pdfs are plotted in
Figure 1). These are chosen and the representative points are set to three sizes (
). The first part of
Table 1,
Table 2 and
Table 3 summarizes the results in different scenarios. The last line of each table presents the moments and parameters of
. It is clear that if
k is fixed, the moments and estimates of MSE-RPs are closer to the true values than other representative points. Moreover, we can observe that the means of MSE-RPs are almost equal to the means of
in all scenarios; when
k becomes large, the moments and estimates of MSE-RPs converge to the true values much faster than other representative points. These results are consistent with the description in Theorem 3.
Next, the comparison focuses on the estimating performance of samples from representative points. We take samples from different shapes of gamma distributions (
,
and
), as well as their representative points with different sizes (
). Setting sample size
and repeat sampling
times for each scenario, the method of moment estimates (
and
) and maximum likelihood estimates (
and
) are calculated. Define
as the average proportional deviation between estimations and parameters. The second part of
Table 1,
Table 2 and
Table 3 show that MSE-RPs samples have the smallest average proportional deviation in most of the selected scenarios.
Table A1 and
Table A2 in
Appendix C give medians and 95% empirical confidence intervals of
,
,
and
. In this simulation study, we observe that the point estimates of
a and
b from MSE-RPs samples generally have good estimation accuracy with both the moment and maximum likelihood methods. Meanwhile, when
k is large, the estimation performances of MSE-RPs samples are similar to those samples from the corresponding
. It is also worth mentioning that when
, the proportional deviation
and
are much smaller than
and
. That is, when the size of gamma MSE-RPs is small, it is better to estimate parameters using the method of moments.
5. Generating MSE-RPs from Harrel–Davis Standardized Samples
This section discusses how to generate MSE-RPs from a gamma-distributed sample. A commonly used approach has two steps as follows:
Calculate the maximum likelihood estimates (MLEs) for a and b, namely and , based on the sample dataset;
Generate MSE-RPs from the gamma distribution with the estimated parameters, i.e., .
As we know, the representativeness of MSE-RPs depends on the estimate of gamma parameters. More accurate estimates will produce better representativeness. However, if a random sample does not represent the population well, the estimates may show large deviations from the true parameters. Hence, the MSE-RPs that are generated are not good representatives of the population distribution. This usually occurs when the sample size is small or medium. Next, we introduce a new Harrel–Davis (HD) standardization technique that can reduce the effect of randomness from samples. This technique transfers a random sample to a set of HD quantile estimators and then treats these estimators as a new “sample”. Recall that a set of quantiles with equal probability is a set of NT-RPs for population; a similar idea is utilized for sample standardization.
Definition 1 (HD standardized sample).
Let be a set of sample data from a gamma distribution; set , which is called the HD standardized sample of x, where is the th HD quantile estimator defined in (6), and (). Note here that is not a random sample because are not independent. However, since quantile estimators are equiprobable (), set is treated as an arbitrarily selected sample, which can be used to calculate MLEs for a and b. A new approach to generate MSE-RPs is proposed as follows:
Obtain the HD standardized sample;
Calculate the MLEs for a and b, namely and , based on the HD standardized sample;
Generate MSE-RPs from .
Next, a simulation study is provided to show the good performance of HD standard samples in parameter estimation. Consider three gamma distributions (
,
and
) and three different sample sizes (
), in each scenario, a number of
random samples are generated and their HD standardized samples are obtained. The MLEs are calculated for each sample/standardized sample and summarized in
Table 4. This shows that the means of estimates from HD standardized samples are closer to the true value in most scenarios. Moreover, the estimates from HD standardized samples appear to have smaller standard deviations than those from random samples. We conclude that HD standardized samples outperform random samples in terms of estimation accuracy and stability based on these results. Therefore, it is recommended to use the new three-step approach to generate MSE-RPs. Here, a comparison study between the MSE-RPs generated by random samples and HD-samples is provided. The estimates (
and
) in
Table 4 are used to generate gamma MSE-RPs.
Table 5 summarizes the results when
with the size of MSE-RPs
. It shows that the moments of gamma MSE-RPs from HD-samples are close to the moments of the origin
. Meanwhile, the method of moment estimates in (
12) are obtained. The estimates from HD samples have a better accuracy than those from random samples. This conclusion is generally valid when
.
It is noteworthy that the HD standardization technique can also be applied in resampling. Consider another simulation study with the same settings as
Table 4. We resample from each sample/standardized sample using
and calculate the MLEs. The means and standard deviations of the resampled MLEs are summarized in
Table 6. This shows that estimates from standardized samples generally have a better accuracy and smaller standard deviations when resampling.
6. Real Data Illustration
In this section, we consider a real-world dataset and illustrate the HD standardized technique proposed in the previous section. In this clinical study, 97 Swiss females (
) aged 70–74 inclusive at the time of diagnosis of dementia (a form of mental disorder) were studied for survival times (in years) by Elandt–Johnson and Johnson [
23]. These data were analyzed by Ozonur and Paul [
24] using the likelihood ratio test and score test with
p-values 0.233 and 0.140, which are greater than 0.05. Both tests suggest that the two-parameter gamma distribution adequately fits the dementia data.
Point estimates (MLE) and the bootstrap interval estimates [
25] based on the origin sample data and the corresponding HD sample are calculated. The approximate (
) bootstrap percentile interval is defined as
In practice, we resample the original data times to obtain 1000 replications of the parameter estimate (i.e., and for the gamma distribution) with . These estimates are sorted and the 25th value is used as the lower bound; the 975th value is the upper bound. The MLEs based on the HD standardized sample are and with confidence intervals and . The lengths of confidence intervals are shorter than those based on the origin sample data, where and with confidence intervals and .
7. Concluding Remarks
In the first part of this paper, the existence and uniqueness of gamma MSE-RPs are proved using two different approaches. An effective algorithm is recommended for the generation of gamma MSE-RPs. The second part of this paper compares gamma MSE-RPs with other representative points in terms of parameter estimation and simulation. This shows that the moments and estimates based on gamma MSE-RPs are the closest to the true values in different scenarios. In addition, samples from gamma MSE-RPs show a good general estimation accuracy. The last part of this paper introduces the new HD standardization technique. When a gamma-distributed sample is at hand, we recommend first transferring it to the HD standardized sample and then using it to estimate gamma parameters or generate MSE-RPs.
In future work, we would like to study whether the MSE-RPs of other distributions can also perform well in parameter estimation. It would also be interesting to explain how HD standardization technique reduces the randomness from samples through a theoretical demonstration.