1. Introduction
The high occurrence of zero-valued observations in insurance claim data is a well-documented phenomenon. Traditional count models, such as Poisson or Negative Binomial, struggle to accurately represent insurance claim data due to the excessive dispersion caused by the observed frequency of zeros surpassing expectations based on these models.
Perumean-Chaney et al. (
2013) emphasized the importance of considering the excess zeros to achieve satisfactory modeling of both zero and non-zero counts. In the statistical literature, two main approaches have been developed to address datasets with a large number of zeros: Hurdle models and Zero-Inflated models. Hurdle models, initially proposed by
Mullahy (
1986), adopt a truncated-at-zero approach, as seen in truncated Poisson and Negative Binomial models. Zero-Inflated models, introduced by
Lambert (
1992), utilize a mixture model approach, separating the population into two groups—one with only zero outcomes and the other with non-zero outcomes. Notable examples include Zero-Inflated Poisson and Negative Binomial regressions.
The generic Zero-Inflated distribution is defined as:
where
p denotes the probability of extra zeros and
can be any count distribution, such as Poisson or Negative Binomial. If
follows a Poisson distribution, this model simplifies to the Zero-Inflated Poisson (ZIP) model, expressed by the density:
In cases where covariates are linked with both the probability
p of a structural zero and the mean function
of the Poisson model, logistic regression is used to model
p, and log-linear regression is applied to model
. This analytical framework is referred to as ZIP regression, which, while not the primary focus of this study, serves as a foundation for our exploration.
In the insurance sector, the adoption of Zero-Inflated models is widespread, reflecting the distinctive characteristics of insurance data, which often comprise both actual zeros (indicating no claims) and potential claims. Various studies exemplify this application trend:
Mouatassim and Ezzahid (
2012) applied Zero-Inflated Poisson regression to a private health insurance dataset using the EM algorithm to maximize the log-likelihood function.
Chen et al. (
2019) introduced a penalized Poisson regression for subgroup analysis in claim frequency data, implementing an ADMM algorithm for optimization.
Zhang et al. (
2022) developed a multivariate zero-inflated hurdle model for multivariate count data with extra zeros, employing the EM algorithm for parameter estimation.
Ghosh et al. (
2006) delved into a Bayesian analysis of Zero-Inflated power series ZIPS (
) regression models, employing the log link function to correlate the mean
of the power series with covariates and the logit function for modeling
p. They proposed Beta (
) and power series-specific priors for the unknown parameters
p and
, respectively, overcoming analytical challenges through Monte Carlo simulation-based techniques. Their findings highlighted the superiority of the Bayesian approach over traditional methods.
Recent trends also show an inclination towards integrating machine learning techniques with Zero-Inflated models.
Zhou et al. (
2022) proposed modeling Tweedie’s compound Poisson distribution using EMTboost.
Lee (
2021) used cyclic coordinate descent optimization for Zero-Inflated Poisson regression, addressing saddle points with Delta boosting.
Meng et al. (
2022) introduced innovative approaches using Gradient Boosted Decision Trees for training Zero-Inflated Poisson regression models.
To the best of our knowledge, the most recent Bayesian analysis was conducted by
Angers and Biswas (
2003). In their study, the authors discussed a Zero-Inflated generalized Poisson model that includes three parameters, namely
p,
, and
. The Bayesian analysis employs a conditional uniform prior for the parameters
p and
, given
, while Jeffreys’ prior is used for
. The authors concluded that analytical integration was not feasible, leading to the use of Monte-Carlo integration with importance sampling for parameter estimation. In contrast, our study utilizes beta and gamma priors to provide enhanced flexibility in the shapes of prior distributions, offering closed formulas for Bayes estimators, predictive density, and predictive expected values. This approach diverges from the regression-centric literature on the ZIP model.
Boucher et al. (
2007) presents models that consist of generalizations of count distributions with time dependency where the dependence between contracts of the same insureds can be modeled with Bayesian and frequentist models, based on a generalization of Poisson and negative binomial distributions.
The structure of the paper is organized as follows: In
Section 2, we present the deviations for Maximum Likelihood Estimators (MLEs).
Section 3 employs gamma and beta distributions as prior distributions for the parameters
and
p, respectively. This section also elaborates on the derivation of the predictive density
, the conditional expectation
, and an approximation for the percentile of the predictive distribution. Here,
signifies a random sample from a Zero-Inflated Poisson (
) distribution, and
z represents an observed value of
.
Section 4 summarizes the outcomes of the simulation studies and introduces a data-driven approach for selecting hyper-parameter values of the prior distribution.
Section 5 is devoted to the analysis of a real dataset, demonstrating the Bayesian inference methodology introduced in this work. Finally,
Section 6 offers brief concluding remarks about the study. The Mathematica code utilized for the simulation studies and the computations performed on the real data are available upon request from the authors.
4. Simulation
Simulation studies were conducted to evaluate the accuracy of Bayesian and Maximum Likelihood estimators for the parameters and p. The simulation studies proceeded as follows:
Step 1: Generate samples of size , and 100, respectively, from the Zero-Inflated Poisson distribution using selected “true” values of the parameters and p listed in the tables.
Step 2: For each set of “true” parameter values of
and
p, as outlined in
Section 2, the MLEs are determined using the equations
Step 3: Selecting optimal hyper-parameter values is crucial for obtaining accurate Bayes estimates.
Recall that the conjugate prior distribution for
is a gamma
distribution. Given that
we let
, or in other words,
. This ensures that the prior distribution of
is centered at the selected “true” value for
. By substituting this into the
formula, we can deduce that, to achieve high accuracy for the Bayes estimate of
, the hyper-parameters
and
should be selected in such a way that they are consistent with the expected value and variance of
. Since
, we therefore choose a large value for
to ensure that
is small, and let
to ensure that
. It is noted that there is no unique pair of hyper-parameter values for
and
; any pair that meets the above criteria should suffice. However, simulation studies confirm that a larger value of
provides a more accurate Bayes estimate.
Recall that
p follows a beta
distribution which is also considered a conjugate prior. A similar rationale is used for selecting the hyper-parameters
and
. We have
By substituting
into
and after a few algebraic manipulations, we obtain
Note that
is a decreasing function of
. Therefore, for a given value of
p, to make
small, we choose a large value for
.
The goal is to select and such that the true selected value of p in simulation studies is closely approximated by its expected value. That is, , or , and a large value for to minimize . Although multiple pairs of hyper-parameter values can meet these conditions, larger values of have been shown to yield more accurate Bayes estimates of p.
The smaller the ASE, the more accurate the estimate. Note that ASE of each parameter is computed separately based on 1000 simulated samples.
In simulation studies, it is important to note that
n and the “true” values of
and
p should be selected so that the nonlinear Equation (9) has a solution. Recall from
Section 2, the equations
for finding MLEs of
and
p. The first equation is nonlinear in
and can written as
where
is the sample mean of non-zero values in the data, and
. Mathematica can be used to find its unique solution.
Recall that the expected value of
Y is given by
By equating the first population moment of Zip(
to the first sample moment, we obtain
, which reduces to
The above relationship between n, , and p provides guidance for selecting the “true” values for in simulation studies. For small values of n and , a large value for p must be selected. For example, if , we expect to have substantial zeros in the data, as is very small. This means p, the percentage of zeros in the data, must be large. The value of p depends on S, the sum of non-zero values in the data. Using the same values for , if (a reasonable value, as ), then , but if (not expected, as ), we obtain , which is not an accepted value.
Simulation studies have confirmed that when n and are small, but the selected value of p is also small, Equation (9) fails to have a solution.
Furthermore, the mean and square root of the Average Square Error (
) for each parameter are detailed in
Table 1,
Table 2,
Table 3 and
Table 4. For instance,
for the MLE of
is defined as
As previously discussed, there are multiple options for selecting hyper-parameter values. In
Table 1,
Table 2,
Table 3 and
Table 4, two sets of hyper-parameter values are utilized, following the proposed method for their selection. These tables demonstrate that as the sample size increases, the Average Squared Error (ASE) for the Maximum Likelihood Estimator (MLE) decreases, as anticipated. Notably, for smaller sample sizes, the Bayes estimator exhibits superior accuracy compared to the MLE. Furthermore, for larger values of
and
, the accuracies of Bayes estimators for
p and
are enhanced. For instance, in
Table 1, with true parameters
and
, two sets of hyper-parameters are selected. Set 1:
,
,
,
The
for
is 0.02014 and for
p is 0.00475 when
. Set 2:
,
,
,
results in
for
of 0.00504 and for
p of 0.00134 when
. It is important to note that the Bayes estimators’ formulas (8) include
n in their denominators. Thus, for a larger sample size
n, the Average Squared Error for both Bayes estimators somewhat increases, yet they still significantly surpass the MLE in terms of accuracy, as measured by the Average Squared Error (ASE).
A sensitivity analysis,
https://www.investopedia.com/terms/s/sensitivityanalysis.asp (accessed on 15 April 2024), tests how independent variables, under a set of assumptions, influence the outcome of a dependent variable. This section theoretically demonstrates that the accuracy of Bayesian estimates, as measured by
, improves with larger values of
and
. The simulation results, presented in
Table 1,
Table 2,
Table 3 and
Table 4, align with our theoretical assertions, utilizing two distinct sets of hyper-parameters. To corroborate these findings further, a sensitivity test was executed for a specific set of “true” parameter values:
,
, and
. Under these stipulations, we considered small hyper-parameter values
and large hyper-parameter values
yielding
samples of size
from the ZIP model with
and
. The outcomes of the sensitivity test are summarized in
Table 5, revealing a substantial decrease in both
for estimates of
and
p, when transitioning from smaller to larger values of hyper-parameters. For example, from a small values
to a large values
,
changes from 0.04149 to 0.01198 for
and from 0.01414 to 0.00470 for
p. This trend is also evident within each group. For the smaller hyper-parameter group, the
for
diminishes from 0.04149 to 0.04058, and the
for
p from 0.01414 to 0.01377 corresponding to hyper-parameters
and
, as the hyper-parameter values incrementally increase by approximately 0.7% for
on each step (from 271 to 273) and 0.9% for
on each step (from 211 to 213). A similar pattern is observed for the larger group, even though the increments for
and
are about 0.17% on each step (
from 1146 to 1148) and 0.25% on each step (
from 796 to 798), respectively. This sensitivity analysis validates that the precision of Bayesian estimates is contingent upon the choice of hyper-parameters, with larger
and
values enhancing estimate accuracy in terms of
As shown in
Table 5, there is a gradual increase in hyper-parameters and a corresponding steady decrease in
For the practical application of the proposed method, it is advisable to establish a stopping rule for the increment in the hyper-parameters. Specifically, the increase in hyper-parameters should continue until no significant reduction in
is observed.
6. Conclusions
Many insurance claims datasets exhibit a high frequency of no claims. To analyze such data, researchers have proposed the use of Zero-Inflated models. These models incorporate parameters with covariates using link functions and are referred to as Zero-Inflated Regression Models. Various methods, including Maximum Likelihood (ML), Bayesian, and Decision Tree, among others, have been employed to fit the data. A significant distinction of this research from prior studies is the introduction of a novel Bayesian approach for the Zero-Inflated Poisson model without covariates. This study aims to develop the statistical ZIP model by estimating the unknown parameters
and
p. To our knowledge, similar research has not been documented in the literature. We derive analytical solutions for the Maximum Likelihood estimators of the unknown parameters
and
p. Additionally, we present analytical closed-form solutions for Bayesian Estimators of these parameters by selecting conjugate prior distributions: Gamma for
and Beta for
p, respectively. The comparison between the ML and Bayesian methods indicates that the Bayesian method, utilizing a data-driven approach (which employs MLEs of parameters
p and
to select hyper-parameter values), surpasses the ML method in accuracy. We derived the predictive distribution based on the posterior distribution, predicting possible future observations from past observed values and calculating percentiles. Furthermore, we demonstrate that larger values of the hyper-parameters
and
enhance the accuracy of the Bayesian estimates. Our findings are confirmed through a sensitivity test. The real-life data from the synthetic auto telematics dataset and simulated data from a specified Zero-Inflated Poisson model, using the methods proposed in this paper, validated the goodness-of-fit (GOF) to the ZIP model based on both Chi-square and Score tests. However, the simulation has limitations, including 1. Ensuring the data adequately fit the Zero-Inflated Poisson model in real applications. 2. The sample size is sufficiently large so that the MLEs of parameters
p and
are accurate. 3. The selection of hyper-parameter values aligns with the MLEs, as elaborated in
Section 4 (Step 3). Future research could explore non-traditional discrete Time Series modeling for Zero-Inflated data to forecast the number of claims at specific future points and extend the Bayesian analysis to the Zero-Inflated Negative Binomial (ZINB) model without covariates. Furthermore, in this article the parameters
and
p are assumed to be independent. A future line of improvement would consist of introducing some copula that allows contemplating the dependence between the parameters, as a small value of
corresponds to a large value of
p.