1. Introduction
The density ratio model (DRM) was first introduced by Anderson [
1] and later popularized by Qin and Zhang [
2], who found the relationship between the two-sample DRM and the logistic regression model in case–control studies. The DRM models in a semi-parametric way the difference between two independent samples. Assume that
and
are two samples independently drawn from two cumulative distribution functions
and
. The DRM postulates that
where
is a
d-dimensional pre-specified basis function while
and
are unknown parameters. We can also generalize the DRM to the
sample case as follows
where
for
. Even though the form of
is unspecified, many parametric distribution families are in the DRM, including normal, exponential, and gamma distributions, among others.
Due to its flexibility and utility, increasing importance has been attached to the DRM. Zhang [
3] proposed a weighted Kolmogorov–Smirnov type statistic to test the validity of the DRM based on case–control data. Qin [
4] and Zou et al. [
5] applied the DRM to the semi-parametric mixture model and developed test statistics based on the empirical likelihood function. Zhang [
6] induced the quantile estimator under a two-sample semi-parametric model and Chen and Liu [
7] generalized the estimator to the
-sample case. Another problem of interest is to test the homogeneity of the DRM model, that is, to test whether
. Fokianos et al. [
8] outlined a method based on the classical normal-based one-way analysis of variance. Cai et al. [
9] studied the properties of the dual empirical likelihood ratio tests to general hypotheses on parameters. Moreover, let
be the initial cumulative distribution function (cdf) of a population, and
be the cdf of the weighted distribution of
, so that their densities are connected to each other as follows,
Then,
, in the context of the DRM, seems to be
, and
X is a random variable with density
. Thus, the DRM lies in the context of weighted distributions which have many applications in various fields. The problem of detecting or estimating the weight function
is of interest in the framework of weighted distributions; see Patil and Rao [
10], Rao [
11,
12] and Lele and Keim [
13].
Recent research on the DRM mainly considered using the empirical likelihood function. We give a brief introduction to this method below. Given
and
, the likelihood function of the model (
2) has the form
If
is restricted to a discretized distribution as
where
is constrained by
for
. Then, the Lagrangian multipliers described in Qin and Lawless [
14] are used to obtain the maximum empirical likelihood estimate of
. However, the type-I error of the empirical likelihood ratio test cannot be well controlled in finite samples. To deal with this problem, Wang et al. [
15] suggested using a nonparametric bootstrap procedure. However, the computational cost of the bootstrap procedure is non-negligible, especially when
m is large.
We also notice that there is increasing interest in the case when there are zero values in the samples. This phenomenon happens in many research fields such as meteorology, health, economics, and life sciences; see Tu and Zhou [
16], Muralidharan and Kale [
17] and Kassahun-Yimer et al. [
18]. For example, in the meteorology study, a group of zero observations may correspond to a number of dry days when there are no rainfall measurements recorded. Another example happens in dietary intake studies, where zero observations may occur for some food components that are consumed episodically. In the examples mentioned above, samples are constructed from two parts. One is the zero observations and the other is the positive observations. This kind of distribution is also called a semicontinuous distribution, which has the form
where
p indicates the probability of drawing a zero observation and
is a positive and continuous distribution. We recommend the reviews of Neelon et al. [
19,
20] for more details. In this paper, we adopt the DRM, as the choice of
benefits from the advantages we introduced above. Thus, the model becomes
where
for
, where
I is the indicator function.
A two-part test is proposed to test the homogeneity of the model (
3), which is a fundamental problem in real applications. For example, the different distributions of precipitation in certain areas among years may influence the strategy of agricultural irrigation. Furthermore, in colorectal cancer clinical trials, it is important to compare the efficacy and safety between two or more treatment arms; see Lachenbruch [
21], Su et al. [
22], Smith et al. [
23] and Wang and Tu [
24]. The two-part test consists of a test for the binomial distribution and another for the continuous responses. For the two-sample case, Wang et al. [
15] suggested that the former test is a
test while the latter can be a Wilcoxon–Mann–Whitney rank-sum test or a two-sample t-test. For the
-sample case, the latter can be replaced by a Kruskal–Wallis rank-sum test or an ANOVA F-test; see for example, Wilcox [
25], Hallstrom [
26] and Pauly et al. [
27]. However, as far as we are concerned, the tests mentioned above may perform badly in heteroskedastic cases.
In this paper, we propose an efficient method based on the exponential family of distributions. First, the problem of testing the homogeneity is transformed to testing the equalities of the mean parameters. Secondly, a Wald test statistic is proposed to test the equalities. Since is unknown, we modify the Wald test statistic based on the sample from . This modified statistic has a simple closed form and we show that it converges in distribution to the distribution under the null hypotheses. We also give the local asympotical power. Thirdly, the Bernoulli distribution can be regarded as a DRM and we obtain the combined modified Wald test for the semicontinuous case. Finally, the simulation studies illustrate that the computational cost of the modified Wald test is much less than the bootstrap procedure, while it always controls type-I error better than the empirical likelihood ratio test. Moreover, the power of the modified Wald test is competitive.
The rest of the paper is organized as follows. In
Section 2, we propose the method for testing the homogeneity of the two-sample model for both continuous and semicontinuous distributions. In
Section 3, we generalize the result to multiple-sample cases. We illustrate the performance of the modified Wald test and compare it with the empirical likelihood ratio test through simulations in
Section 4. We consider a real data sample to show the practicability of our method and give the conclusions in the last section.
4. Simulation Study
In our simulations we make comparison between three tests. In addition to the modified Wald test we proposed, denoted by “MWT”, the others are the dual empirical likelihood ratio test proposed by Cai et al. [
9] and the empirical likelihood ratio test using the bootstrap procedure proposed by Wang et al. [
15], which are denoted by “DELRT” and “BELRT”, respectively. We hope to show that our modified Wald test is available for different cases. In the first simulation study, we illustrate the case when the number of populations is large. We compare the performances and computational costs of the three tests. It can be seen that MWT controls the type-I error better than DELRT while taking much less time than BELRT. In the second one, we look into three normal distributions with the same scale and study how the tests perform with the change in location parameter. This means that the three populations vary from the same to totally different. We can clearly see from
Figure 1 how the three tests perform. In the third simulation study we hope to verify Remark 3 in our context, which shows an interesting phenomenon of the power effected by sample sizes under certain alternative hypotheses. In the last one, we consider the semicontinuous case when the continuous part is either log-normal or a gamma distribution. The same parameter settings are also considered by Wang et al. [
15]. From
Figure 2 and
Figure 3, we can show that our method is competitive.
4.1. Scenario 1
We consider the DRM when
, and 11. Let
be the standard normal distribution while the rest are the normal distribution with scale fixed to 1 and location fixed to
. We consider the cases when
. We choose the same sample size
and 50 for all the populations and generate
repetitions for each situation with different
m and
. Then, we calculate the type-I error of the three statistics when
and the power of them when
at the 5% significance level. The results are shown in
Table 1 and
Table 2, respectively.
It can be seen that the type-I error of DELRT is not as well controlled as the other two. The type-I error and the power of MWT is similar to that of BELRT. However, the computational cost of MWT is much smaller. For the DELRT and the modified Wald test, realizing a repetition of when needs no more than 40 s. However, for the bootstrap procedure when , it takes nearly 4 h using the “for” loop in the R programming language to realize a single repetition of when and 12 h when . When it comes to , it took nearly a whole day. Certainly we can use some parallel computational methods to accelerate the computation, but the running time is still a big challenge. The modified Wald test statistic we proposed seems to be a promising compromise, especially when the number of the population is large. It controls the type-I error better than DELRT while retaining a similar computational cost.
4.2. Scenario 2
In the second simulation study, we show how our test statistic performs in the case of three continuous populations. We choose the three populations as normal distributions with the scale equal to 1. The location parameters of the three are set to be
, 0, and
. Then, we change
from 0.2 to
to see how our test statistic performs when the three distributions vary from “similar” to “totally different”. We consider the case with equal sample sizes
, and 50,
. For each sample size, we consider
, and
. We generate
M = 10,000 repetitions for each case and show the comparison of the three statistics in
Table 3 and
Figure 1. In this figure, “MWT”, “DELRT”, and “BELRT” denote the modified Wald test, dual empirical likelihood ratio test, and bootstrap empirical likelihood ratio test, respectively.
Figure 1.
Type-I error and power (%) of the three statistics in simulation two for different sample sizes.
Figure 1.
Type-I error and power (%) of the three statistics in simulation two for different sample sizes.
It can be seen that the modified Wald test can control the type-I error nicely in this case, even when the sample size is small. The power of the Wald test is always smaller than that of the DELRT due to the better control of the type-I error. However, the disparity is gradually eliminated with the increase in the sample size and the differences between the populations.
4.3. Scenario 3
In this simulation study, we verify the conclusion in Remark 3. The total sample size
n is fixed and
and 4 are under consideration. We choose different
for both cases and compare the power for different sample sizes. We fixed
to
,
, and
. The rest
are chosen to be the same distribution corresponding to
with different
, and 0.7 for normal and log-normal cases and
, and 1.6 for the location parameter in gamma’s case. For each different sample size and
, we generalize
M = 100,000 repetitions and calculate the power. The details are given in
Table 4 and
Table 5. The symbols I to VIII in
Table 5 denote different sample sizes which are shown in
Table 6.
It can be seen that the conclusion in Remark 3 holds basically. It is obviously that has the biggest impact on the power while the rest of the sample sizes do not seem to have much influence. This can be seen quite clearly from the comparison of the first four sample sizes in the three-sample case and case I and II, and case V and VI in the five-sample case.
4.4. Scenario 4
In this simulation study, we consider the semicontinuous case. We adopt the same parameter settings as in Wang et al. [
15]. Assume that the samples are generated from
for
, where
’s are all log-normal or gamma distributions. The parameters of
are present in
Table 7. Each of LN
–LN
and GAM
–GAM
in the first column denotes a mixture model whose continuous part follows a log-normal or gamma distribution.
denotes the probability of drawing a zero observation for
. LN
denotes a log-normal distribution whose associated normal distribution has the mean
and variance
. GAM
denotes a gamma distribution with shape parameter
and scale parameter
. We consider both the equal sample sizes where
and the unequal sample size where
. For every parameter setting, we generate
M = 10,000 repetitions. We calculate the type-I error of testing homogeneity at 5% significance level for LN
–LN
and GAM
–GAM
, and the power of that for the rest of the parameter settings. The type-I errors of the three statistics are shown in
Table 8 while the powers are shown in
Table 9 and
Table 10, respectively, for the log-normal and the gamma cases. To have a better view of them, we show the powers of the three statistics in
Figure 2 and
Figure 3. It can be seen that the results are competitive.
Figure 2.
Power (%) for testing
at significance level 0.05 when data are generated from LN
–LN
in
Table 7.
Figure 2.
Power (%) for testing
at significance level 0.05 when data are generated from LN
–LN
in
Table 7.
Figure 3.
Power (%) for testing
at significance level 0.05 when data are generated from GAM
–GAM
in
Table 7.
Figure 3.
Power (%) for testing
at significance level 0.05 when data are generated from GAM
–GAM
in
Table 7.
5. Real Data Sample
In this section, we employ the real data example suggested by Wang et al. [
15] which is available from the website of the University of Waterloo weather station data archive (
http://weather.uwaterloo.ca/data.html, accessed on 1 June 2023). We focus on the data that records the daily precipitation measurements (in millimeters) in the North Campus of the University of Waterloo, Canada and investigate whether the precipitation distribution has changed over the past few years.
Benefiting from what Wang et al. [
15] has previously reported, to reduce the time dependence among the observations, we take every fourth measurement into our analysis, i.e., only use the observations on days 1, 5, 9, …, 361, which gives a sample size of 91 for each sample. Then, we consider two cases, one is from 2003 to 2006 and the other from 2008 to 2012, we hope to obtain some information about the changing of the precipitation distribution in the last few years. Some summaries of the samples are given below
From 2003 to 2006, the estimates of the probability of dry days are (0.30, 0.40, 0.42, 0.42) while those of 2008 to 2012 are (0.45, 0.49, 0.43, 0.38, 0.40).
The sample means of 2003 to 2006 are (2.05, 3.54, 3.40, 3.50) while those of 2008 to 2012 are (3.42, 1.37, 2.29, 4.08, 3.09).
The sample variances are (17.52, 41.07, 76.10, 59.50) and (95.19, 13.53, 18.35, 73.83, 59.76), respectively.
For each null and alternative hypothesis, we fit the data to both the log-normal and the gamma mixture under the assumption of the density ratio model using the maximum likelihood estimate. The details are give in
Table 11 below. There is a small difference between the parameters of ours and Wang et al. [
15], this may be caused by the mistake when summarizing the data of the year 2003. LN
and GAM
are the parameters under the null hypothesis of the case of 2003 to 2006, while LN
and GAM
are those of 2008 to 2012. The rest of the parameters are for the alternative hypotheses.
We apply the modified Wald test on the null hypotheses LN and GAM, respectively. The test statistic is 21.65 for the log-normal mixture and 24.02 for the gamma mixture. Both statistics are larger than the 0.05% quantile of , which is 15.51. The null hypothesis should be rejected at the significance level 0.05. We then move on to the case of 5 years. This time the result becomes quite different. The test statistic for LN is 11.70, while that for GAM is 9.95, this is smaller than the 0.05% quantile of , which is 18.3074, which means that the null hypothesis is true at the significance level 0.05. The two simulations above indicate that the precipitation distribution of the area was changing from 2003 to 2006, but may have remained unchanged over 2008 to 2012.