1. Introduction
Overdispersed count data with excess zeros frequently occur in various fields such as natural science (e.g., the number of torrential rainfall incidences at the Daegu and Busan rain gauge stations in South Korea [
1]), medical science (e.g., the DMFT (decayed, missing, and filled teeth) index in dentistry [
2], the number of falls by people with Parkinson’s disease [
3], and the number of daily COVID-19 deaths in Thailand [
4]), and insurance (e.g., the frequency of health insurance claims [
5]). Although the Poisson model is widely used to analyze such discrete data, the strong assumption of the equality of the mean and variance implies that it is inadequate for modeling data where the variance is larger than the mean, which is termed “overdispersion” [
6] and can arise in various ways.
One common source of overdispersion is when the observed counts contain excess zeros, which has motivated the creation of modified count models such as the zero-inflated (ZI) and hurdle models. These two classes can be considered finite mixture models of two subpopulations (components): (1) the observations contain zeros with a probability
, and (2) the observations occur with a probability
from the baseline distribution of a ZI model and the zero-truncated probability mass function (pmf) (the nonzero part) of a hurdle model. One of the most popular ZI models is the ZI Poisson (ZIP) model, where the Poisson distribution is the baseline distribution [
7].
However, overdispersion arises when large numbers of zeros and ones occur simultaneously. Since the ZI model is not an appropriate for model such data, it has been extended to the zero-and-one-inflated (ZOI) model combined with several other distributions to produce the ZOI Poisson (ZOIP), ZOI geometric, ZOI negative binomial-beta exponential (ZOINB-BE), and ZOIP-Lindley mixed distributions. Zhang et al. [
8] explored the ZOIP distribution extended from the ZIP distribution [
9]. Five equivalent stochastic representations for a ZOIP random variable were presented, and their important distributional properties were derived. Tang et al. [
10] provided a ZOIP model and analyzed two real datasets of Legionnaires’ disease cases in Singapore and accidental deaths in Detroit, USA, both of which contained high proportions of zeros and ones. They used the data augmentation method to obtain the maximum likelihood estimators (MLEs) via the EM algorithm and Bayesian estimations via Gibbs sampling. By obtaining the lowest Akaike information criterion (AIC), deviance information criterion (DIC), and widely applicable information criterion (WAIC) values for the ZOIP, they reported that it was more appropriate for these datasets than the ZIP model. Liu et al. [
11] analyzed two real datasets of Legionnaires’ disease cases in Singapore and accidental deaths in Detroit, USA by using the ZOIP model with both the maximum likelihood and Bayesian estimation. They found that the ZOIP model was more appropriate than the ZIP model for analyzing these two real datasets. Liu et al. [
12] studied the number of daily accidental deaths in 1994, which were available from the NMMAPS database, by using a ZOIP regression model and investigated the maximum likelihood and Bayesian estimation, the expectation maximization algorithm, the generalized expectation maximization algorithm, and Gibbs’ sampling to estimate its parameters. They found that the ZOIP regression model was more appropriate than the ZIP model for analyzing this accidental deaths dataset, as it attained lower AIC, Bayesian information criterion (BIC), DIC, and WAIC values. Xiao et al. [
13] constructed a ZOI geometric distribution regression model and introduced Pólya-gamma latent variables in the Bayesian inference, and they found that it was more suitable for analyzing a doctoral dissertation dataset than the ZOIP regression model. Jornsatian and Bodhisuwan [
14] presented a ZOI negative binomial-beta exponential (ZOINB-BE) distribution, investigated some of its important properties, and analyzed three real datasets: the number of visits to the doctor in Germany in 1998 (the COUNT package in the R programming suite), the number of accidental injuries in the US in 2001 [
15,
16], and the number of monthly crimes in Greece from 1982 to 1993 [
17,
18]. They found that the ZOINB-BE distribution was most appropriate to fit these data by attaining the lowest log-likelihood, AIC, mean absolute error, and root mean squared error values. Tajuddin et al. [
19] introduced the ZOIP–Lindley distribution and developed MLEs and method-of-moment estimators for its parameters. They found that it was the most appropriate for analyzing two datasets: the number of criminal acts [
20] and the number of stillbirths in New Zealand white rabbits [
21]. Mohammadi et al. [
22] introduced a zero-and-one inflated INAR(1) process with a Poisson–Lindley distribution and analyzed the number of abortions of animals reported monthly, which contained large proportions of zeros and ones. They found that the proposed model had the best fit based on the AIC, BIC, log-likelihood, and root mean square differences between the observation and prediction (RMS) criteria.
Aside from these mixed distributions, another one is the two-parameter discrete cosine geometric (CG) distribution, which was proposed by Chesneau [
23]. This belongs to the family of weighted geometric distributions, with its pmf given by
where
If
, then we can obtain
, and
X is a standard geometric distribution. A weighted geometric distribution makes the CG distribution more flexible than the standard geometric distribution. The former is better for analyzing overdispersed data than the Poisson, geometric, negative binomial (NB), and weighted NB distributions. Junnumtuam et al. [
24] extended it and proposed the ZICG distribution (with CG as the baseline distribution), and they reported that it is appropriate for fitting overdispersed count data containing excess zeros, such as the number of daily positive COVID-19 cases at the Tokyo 2020 Olympic Games.
Since studies on ZOI count data have gained much research interest, and the CG distribution is appropriate for overdispersed data, we can extend it to the novel, four-parameter, discrete ZOICG distribution and derive some of its statistical properties such as the pmf, moment-generating function (mgf), mean, variance, and Fisher information. Moreover, its parameters are estimated by deriving their confidence intervals (CIs). There are several examples of applying CIs to analyze ZI and ZOI count data. Liu et al. [
11] considered both point and interval estimation for the parameters of a ZOIP model and compared Bayesian estimation using either the Jeffreys or reference prior with the MLE method via Monte Carlo simulation. The results indicated that the Bayesian estimates performed slightly better when the sample size was small or moderate. Tian et el. [
25] proposed CIs for the mean of a zero-and-one inflated population by using the jackknife empirical likelihood and adjusted jackknife empirical likelihood methods. Wald CIs were constructed for the parameters in the Bernoulli component of the ZIP and hurdle models in [
26] and for the ZIP mean in [
27].
This motivated us to study the confidence intervals for the parameters of the ZOICG distribution to estimate the overdispersed data, which contain a large proportion of zeros and ones with a high index of dispersion, by using five methods: a Wald CI based on the MLE, equal-tailed Bayesian CIs based on the uniform or Jeffreys prior, and highest posterior density (HPD) intervals based on the uniform or Jeffreys prior. Furthermore, real data containing excess zeros and ones (the number of new daily COVID-19 deaths in Luxembourg in 2020) were used to investigate their efficacies.
4. Simulation Study
The performance of the ZOICG model in Equation (
15) was measured by using Monte Carlo simulation. The sample size was set to
n = 30, 50, or 100 with the proviso that
. The probability of real zeros (
) was set to 0.1 or 0.2. The probability of real ones (
) was set to 0.1, 0.2, or 0.3 while
was set to (0.5,0.7) or (0.9,1.5), and the nominal confidence level
was set to 0.95. All of the simulations were run 3000 times, and the samples were generated by using the Metropolis–Hastings algorithm with 10,000 samples and 3000 burn-ins. The criteria for comparing the efficiencies of the CIs were their CPs and ALs. For a particular scenario, the CI with a CP close to or greater than the nominal level of 0.95 and the shortest AL performed the best.
The CP and AL results are reported in
Table 1 and
Table 2, respectively. For the interval estimation of the parameter
, the CPs of all five methods were close to the nominal level of 0.95, especially when the sample size was large. For the interval estimation of the parameter
, an equal-tailed, two-sided Bayesian CI based on the uniform prior provided CPs equal to or more than 0.95 in almost all cases, but when the sample size was large (
), the CPs for all of the methods were similarly close to the nominal level of 0.95. For the interval estimation of parameter
p, when (
) was (0.5,0.7), only the equal-tailed two-sided Bayesian CI and HPD interval based on the uniform prior performed well (CP > 0.95), with the HPD interval based on the uniform prior having the shortest AL. However, when (
) was (0.9,1.5), all five methods performed similarly. For the interval estimation of
, when (
) was (0.5,0.7), all four Bayesian methods performed well (CP > 0.95), with the HPD interval based on the uniform prior providing the shortest AL. When (
) was (0.9,1.5), none of the methods produced CPs equal to or more than 0.95, with the Wald CI providing the worst performance. In general, the ALs decreased when the sample size was increased and when (
) was increased from (0.5,0.7) to (0.9,1.5).
5. The Efficacies of the Methods with Real Data
The number of new COVID-19 cases and deaths reported each day by country is available from the European Centre for Disease Prevention and Control (ECDC; accessed on 13 September 2022
https://www.ecdc.europa.eu/en/publications-data/data-daily-new-cases-covid-19-eueea-country). In this empirical study, the number of new COVID-19 deaths per day in Luxembourg from 24 February 2020 to 31 December 2020, which contained 312 days, 167 days with 0 deaths, and 47 days with 1 death, are presented in
Table 3 and
Figure 1. From the descriptive statistics for the dataset (
Table 4), the mean and variance were 1.6314 and 5.9570, respectively, which were used to calculate the index of dispersion (the variance divided by the mean) to be 3.6514. Since the index of dispersion was larger than one, this dataset was clearly overdispersed. The appropriate model was checked by comparing the AIC and corrected AIC (AICc) values of nine distributions: ZOICG, ZICG, CG, ZIP, ZIG, ZINB, Poisson, geometric, and NB (
Table 5). Those of the ZOICG were very similar (1038.372 and 1038.502) and the lowest, thereby inferring that it provided the best fit for the data.
The
CIs for the parameters of the ZOICG model obtained by using the five methods are provided in
Table 6. The lower and upper bounds for parameters
and
provided by all of the methods were similar, with the lower bound for the parameter
p via the Wald CI being slightly lower and the
Wald-based CI for the parameter
being much lower than the others.
6. Discussion
In this paper, a novel four-parameter discrete distribution called the ZOICG distribution is proposed, and its statistical properties are derived. As the model in Equation (
6) has a complex pmf, it is more difficult to calculate parameters
and
by using the MLEs than by reparameterizing, and so the ZOICG model in Equation (
15) is needed. This model requires that parameters
and
are independent of parameters
p and
, which means that the proportion of data containing zeros and ones should be treated separately from the rest for the convenience of estimation. From the simulation results, it is clear that all of the methods could detect zeros and ones (
and
) with coverage probabilities close to the nominal level. The MLEs of parameters
p and
did not have a closed form, so the Newton-type algorithm was applied to solve them numerically. However, the application of this algorithm was not appropriate, especially for parameter
. Hence, Bayesian analysis was required in this study. However, the derivations of the Bayesian estimates for parameters
p and
were still complex and difficult to estimate, since their marginal posterior distributions did not have a closed form. Hence, the random walk Metropolis–Hastings steps within a Gibbs sampling algorithm were applied to generate samples for these two parameters.
The simulation results show that the Bayesian methods performed better than the Wald CI based on the MLE. Moreover, the Bayesian method based on the uniform prior was more efficient than the one based on the Jeffreys prior. Since the optimization method is sometimes unsuitable for a model that is complex, the Wald-based CI provided the worst estimates. The efficacy of the proposed ZOICG model for analyzing real data containing excess zeros and ones (new daily COVID-19 deaths in Luxembourg in 2020) was excellent, and so it is recommended in these circumstances. In future research, some characteristics and properties of the ZOICG distribution and statistical methods for estimating the parameters of this model should be further investigated.