1. Introduction
We analyse a bivariate model based on the Sarmanov distribution with marginal Beta distributions. These marginals are specified based on a generalized linear model (Beta-GLM) or Beta regression as defined by Ferrari and Cribari-Neto [
1]. The objective is to fit data defined in the
interval.
Many authors have analysed bivariate Beta distributions (see, for example, [
2,
3,
4,
5]). However, these distributions pose several difficult challenges: their generalization to higher dimensions and their specification as a generalized linear model are not straightforward. The Sarmanov distribution provides a way to address these challenges.
Originally, the Sarmanov distribution in its bivariate form was introduced by Sarmanov [
6], its multivariate version was suggested by Lee [
7] and was generalized by Bairamov et al. [
8]. Its use to model the bivariate behaviour of random variables with a marginal
distribution was proposed by Gupta and Wong [
3]. These authors defined the five parameter bivariate Beta distribution from what is known as Morgenstern’s distribution [
9] with marginal Beta, which is a particular case of the Sarmanov distribution.
The bivariate Sarmanov distribution is characterized by its flexibility in the marginal distributions and, furthermore, given that its functional form establishes that the marginals are clearly separated from the dependency model, the specification in terms of a bivariate generalized linear model turns out to be natural. Generalizing from two dimensions to higher dimensions is simple—(see [
10] for an example of a trivariate Sarmanov distribution specified as a generalized linear model with Negative Binomial marginals).
In this work, we show an application of the bivariate Sarmanov distribution with Beta marginals generalised linear model to predict two of the most relevant telematics variables in motor insurance [
11]. Telematics variables are obtained from GPS/inertial devices installed in vehicles and they provide an abundant source of information to motor insurers. In our case study, a bivariate model is specified, for the proportion of kilometres driven above the posted speed limit and the proportion of kilometres driven at night. These two variables seem to be related, but researchers have not yet been able to find a good way to understand their association. The explanatory variables are the characteristics of the insured policyholder and the vehicle. The database used in our application has already been analysed in various works published in statistical, transport and risk analysis journals (see [
11,
12,
13,
14,
15,
16,
17]). In all previous studies, the two telematics variables that we analyse here were used as predictors of the accident rate, and they were assumed to be uncorrelated.
In
Section 2, the new bivariate Sarmanov model is specified and the particular case with marginal Beta-GLM with a domain in the
interval is analysed; the estimation method is also discussed. The results of our case study are shown in
Section 3. Finally,
Section 4 contains the conclusions.
2. The Models
Let
be a bivariate random vector that, for convenience, is defined in
. Its distribution depends on a set of
k quantitative or binary covariates, whose values are represented by the vector
,
, where
is a constant term. The relationship between
and the covariates is given by the linear predictor
, where
,
, are vectors of parameters to be estimated. To simplify the notation, the covariates are assumed to be the same for
and
, and so the vector of explanatory variables is denoted as
x. The bivariate probability density function (pdf) associated with the Sarmanov distribution is:
where
is the dependence parameter and
,
, are bounded kernel functions. For the function defined in (
1) to be a pdf, the following conditions must hold:
and
For given values of
,
, we define:
Taking into account the condition defined in (
3), bounds can be defined for the dependency parameter
. However, as this parameter does not depend on the linear predictor, new extreme values are defined as:
and
, so that the bounds of the dependency parameter are:
The previous condition holds for every vector of covariates
x, which implies that the dependency parameter must be located within the narrowest bounds. In practice, we will assume that the vectors observed in the sample dataset lead to the entire domain of values of linear predictors
,
. In the insurance context, where we will discuss our illustration, we assume that all possible risk profiles that can be insured by the company are already present in the portfolio.
For each vector of covariate observations
x, we can also obtain the covariance between the dependent variables as:
where
. The correlation is obtained by dividing by the product of standard deviations.
There exist many possible specifications for the kernel functions
,
(see [
18] [for a description of kernel functions proposed in the literature). When fitting the bivariate Beta distribution without covariates, Gupta and Wong [
3] propose a kernel function such as
, where
is the cumulative distribution function (cdf). This specification has the advantage that the bounds for the dependency parameter are given by
for any vector
x. However, the previous model does not allow obtaining closed expressions for some magnitudes of interest, such as the conditioned moments. In this work, we propose to use kernels
, where
r is a value to be determined by the analyst. Next, some results obtained for the particular case of the Sarmanov distribution with marginal
distribution with
are analyzed. These cases intuitively correspond to a situation of linear dependency, controlled by the dependence parameter
.
2.1. The Bivariate Beta GLM Model
The pdf of a random variable
Y with
distribution, with parameters
, is:
and its cdf is:
where
and
are the Gamma and Beta functions, respectively, and
is the incomplete Beta function.
The Beta regression was proposed by Ferrari and Cribari-Neto [
1], with the reparametrization
and
, so that:
where
, with
, and
, with
, where
is the scale parameter. We note that, given the values of
and
, it holds that
. The specification as GLM is defined as (note that we use
to emphasize that
depends on the linear predictor):
where
is a link function that can be defined in different ways, in this work, we use the logit link,
.
To simplify the notation from now on, we eliminate the linear predictors in the conditioned part. The pdf associated with the bivariate random vector
with a Sarmanov distribution and Beta GLM marginals that will be called the Sarmanov-Beta-GLM is ():
For the previous expression to be a pdf, the dependency parameter must be located within the bounds defined in (
4), which, for the kernel functions that we propose, are:
The bivariate cdf associated with a Sarmanov-Beta-GLM is obtained directly from the double integral of the bivariate pdf defined in (
6):
where
.
Proposition 1. The conditioned pdf is:and similarly for . Integrating the previous expression, the conditional cdf is obtained as Proof. The conditioned pdf is obtained directly as
Integrating the result of
in (
9), we obtain:
In addition, since
then, by substituting the previous expression in (
11), then (
10) follows directly. □
The conditioned quantile is obtained from the inverse of expression (
10), for which a numerical method (such as Newton’s method) can be used.
Proposition 2. The conditional expectation is:where is the variance, which also depends on the vector of covariates. Similarly, can be found. Proof. The conditional expectation is obtained directly by solving the integral:
Likewise, the corresponding result is obtained for . □
Proposition 3. From (5), the conditional covariance which depends on the vector of covariates x is:and the correlation is: Proof. Note that the covariance and the correlation are calculated directly if, in expression (
5), we see that:
□
The dependence parameter of the model proposed in Gupta and Wong [
3], which uses kernel functions
, is located in the interval
and is the same for all
x. Our proposal bounds the dependence parameter to the narrowest interval among those obtained from all
x. However, the advantage of our proposal is that our model allows for obtaining closed expressions for some magnitudes of interest such as bivariate moments (covariance) and conditional moments. In the numerical analysis section, we also compare the correlations estimated from our model and that of Gupta and Wong [
3].
2.2. Estimation
In practice, we start from a bivariate sample of
n observations. Let us denote the sample information as
,
, where for each
i we know the values of the covariates
. Our objective is to estimate the parameter vectors
, the scale parameters,
and
, and the dependency parameter
, from the maximization of the logarithm of the likelihood function associated with the Sarmanov distribution:
The maximization of (
15) cannot be carried out directly without considering that the parametric space is restricted and, in addition, as it was shown in expression (
4), the bounds of the dependence parameter are closely related to the parameters of the marginals. Thus, in the maximization process, infeasible solutions will often be reached unless a careful numerical procedure is specifically designed. One way to address these difficulties is to rely on the IFM (Inference from Margin) method that has been widely used in the estimation of copulas see [
19] [for a review]. For the estimation of the Sarmanov distribution, the IFM was already used by Bolancé and Vernic [
10] for the case of GLM marginals with Negative Binomial distributions.
The IFM method is implemented as follows:
Inicialization. The parameters for the marginals are estimated as:
For the initial estimation, function betareg() of betareg R package is used. With these parameters of the marginals, we start the iterative process in the two steps described below.
Step 1. Given the estimated marginal parameters in iteration
and taking into account the limits of the dependence parameter
defined in (
4), with function
optim() and the
L-BFGS-B method using
R, we estimate
from the maximization of the likelihood function given fixed values of the marginal parameters, which is:
where
is the likelihood as a function of
given the estimated parameters for the marginals in iteration
.
Step 2. Given the estimated dependency parameter
in step 1, the marginal parameters are re-estimated in iteration
m as:
where
is the likelihood as a function of the marginal parameters given the dependence parameter estimated in step 1. The above maximization is also performed with function
optim() and the
L-BFGS-B method of
R.
Steps 1 and 2 described above are repeated until reaching the convergence criterion based on the differences between parameter estimates obtained in two consecutive iterations.
Remark 1. In the initialization process, if the dependent variables contain zeros or ones, the following correction was proposed by Smithson and Verkuilen [20]. In practice, the algorithm described above is based on the optimization of conditional likelihood functions and not on the likelihood function defined in (
15). However, in the last stage, the parameters estimated with the IFM method can be used as initial parameters in the process of maximizing the full likelihood function defined in (
15). For this purpose, function
optim() and method
L-BFGS-B of
R are used again.
Remark 2. To estimate the Sarmanov model proposed by Gupta and Wong [3], it is not necessary to use the two-step process, since the bounds of the dependence parameter do not depend on the parameters of the marginal distributions. 3. Numerical Analysis
We analyse a database corresponding to a car insurance portfolio, in which part of the variables have been measured via a telematic system. The objective of our analysis is to model the joint behaviour of the percentage of kilometres driven above the posted speed limits (
) and percentage of kilometres driven at night (
). It is well known that both variables are related to the risk of having an accident. In
Table 1, we show the main descriptive statistics of the dependent variables and the covariates used in the modelling process. For the estimation of the Sarmanov-Beta-GLM, the dependent variables have been transformed as indicated in Remark 1 in
Section 2.2. Furthermore, to avoid very low coefficient values due to the scale of some covariates, variables age (
), age of driving license (
) and age of the vehicle (
) have been divided by 10; the vehicle power variable (
) is divided by 100 and the total annual distance driven in kilometres (
) is divided by 1000. In addition, note that, in this study, we have included a variable denoting the driver’s gender (
) and an indicator of private garage (
) as covariates.
The last row of
Table 1 shows the Pearson correlation between the two dependent variables. This correlation is compared with the corresponding parameter estimate obtained from the Sarmanov model with marginal Beta proposed here and with the one proposed by Gupta and Wong [
3], from now on the GW model. With this objective,
Table 2 shows the dependence parameters estimated with both models, and the AIC and BIC statistics without including the covariates and including them. Using expression (
14) and without covariates, from the dependence parameters
, it can be deduced that the estimated correlation is
, which is within the confidence interval of the Pearson correlation as shown in the last row of
Table 1. On the contrary, if we use the five parameter Beta distribution, the (residual) correlation that is obtained from the numerical calculation of expression (
5) is practically zero. This means that the association is captured by the bivariate model. Comparing both models, with and without covariates, using the AIC and BIC statistics, the results of
Table 2 show that the fit is better for the model proposed here than it is for the GW model.
Table 3 shows the results of our Sarmanov-Beta-GLM using different vectors of covariates. Model I includes all the explanatory variables, among which we have the age (
), the age of the driving license (
) and the total distance driven annually (
), these three variables are associated with driving experience. To analyze the robustness of the results, in Model II, age (
) is eliminated, and, in addition, in Model III, the age of a driver’s license (
) is also eliminated. The results of Model I show that the effect of age is negative on both
and
that the effect of the driver’s license age is positive on
and negative on
and the effect of total distance,
, is positive on both dependent variables. By eliminating age (
) in Model II, the signs of the parameters associated with
and
are maintained, although the value is smaller in the case of
and remains practically the same for
. After eliminating variables
and
, we see that the effect of the total annual distance driven remains practically the same. If we observe the effects of the rest of covariates, these are practically the same in models I, II, and III. A man driver (
) with a powerful vehicle (
) would have larger
and
than the rest, all other characteristics being the same. However, using parking at night (
) has a positive effect on the percentage of speeding distance (
) and a negative effect on the percentage of night-time driving (
); the opposite happens with the age of the vehicle (
). The effect of
indicates that, when the vehicle is older, drivers tends to diminish the percent of speed driving, while night-time driving is larger.
To visualize the dependence between and in different quantiles, the following three examples of insured drivers are graphically analysed:
Profile 1 corresponds to a 27-year-old man, who drives about 7000 kilometres per year, with a 7-year-old driving license, with parking, with a vehicle of about 8 years and 100 HP.
Profile 2 corresponds to a 20-year-old man, who drives about 4000 kilometres per year, with a 2-year-old driving license, with parking, with a vehicle of about 2 years and 75 HP.
Profile 3 corresponds to a 36-year-old man, who drives about 10,000 kilometres per year, with a 15-year-old driving license, without parking, with a vehicle of about 15 years and 200 HP.
Profile 1 represents the average insured individual of the portfolio; Profile 2 is a younger man driver, less experienced than Profile 1 and with a newer and less powerful vehicle; finally, Profile 3 is an older man driver, more experienced than Profile 1 and an older and more powerful vehicle.
Figure 1 represents different quantiles of the variable kilometres driven above the speed limit (
) in the
y-axis given the values of the percentage of kilometres driven at night (
) for Profile 1 in the
x-axis. Quantiles have been obtained from the expression (
10). Note that, if the dependence parameter was zero, all the curves would remain constant. The adjusted dependence structure results in the represented conditional quantiles having a negative nonlinear relationship and, furthermore, the curves for the different quantile levels are non-parallel.
Figure 1 indicates that, for Profile 1, the higher the percentage of kilometres driven at night (
), the greater the caution in driving and, therefore, the lower the percentage of distance driven above the speed limits (
). The same quantiles at
(plot on the left) and
(plot on the right) confidence levels are represented in
Figure 2. These plots show that the curves are non-parallel and that Profile 3 is the most risky, followed by Profiles 1 and 2.