1. Introduction
Data whose response falls in the interval such as indices, proportions, or rates appear very frequently in different fields of knowledge, mainly the areas of social sciences, engineering, economic sciences, and medicine. Some practical examples of these types of data are the proportion of patients who die from a certain disease or virus (SARS-CoV-2, Diabetes, HIV or Cancer) in a country or city; the Human Development Index or the illiteracy rate in a region or country; the proportion of deaths due to exposure to smoking or other exposure factors; the mortality rate from traffic accidents in a city; the percentage of items that do not meet the minimum requirements in an assembly line; and the portion of income that a family spends on entertainment.
For the analysis of data such as those described above, statistical methodologies developed from distributions with support in the interval
are required. In this sense, several probability distributions and regression models have been proposed; see Ferrari and Cribari-Neto [
1], Kumaraswamy [
2], Martínez-Flórez et al. [
3,
4], Kieschnick and Mccullough [
5], Mazucheli et al. [
6,
7].
The univariate sinh-normal (SHN) distribution introduced by Rieck and Nedelman [
8] has received special attention for modeling material-fatigue-related problems. The probability density function (pdf) of this distribution is given by:
where
and
are shape and location parameters, respectively;
is a scale parameter, and
is the pdf of the normal distribution. The pdf in (
1) can also be written as:
where
is a derivative of
The distribution function in (
2) is denoted by
Several extensions of the SHN distribution have been studied by numerous authors, for example, Martínez-Flórez et al. [
9] proposed the extended generalized SHN distribution, which has great applicability in the fit of datasets presenting high skewness and bimodality simultaneously; Lemonte [
10] introduced an extension named skewed log-Birnbaum–Saunders (log-BS) regression model which is based on the asymmetric SHN distribution proposed by Leiva et al. [
11]. The log-BS regression model is suitable for fitting data with high degrees of skewness. On the other hand, Moreno et al. [
12] proposed a generalization of the Birnbaum–Saunders (BS) distribution [
13] that affords flexibility for fitting data with greater skewness and kurtosis compared with other distributions.
Multivariate extensions of the SHN distribution have also been considered, for instance, by Martínez-Flórez et al. [
14], Díaz-García and Domínguez-Molina [
15], Lemonte [
16], Marchant et al. [
17] and, recently, by Martínez-Flórez et al. [
18], among others.
For modeling material fatigue data, the most widely known distribution is the BS whose pdf is given by:
where
,
is a shape parameter, and
is a scale parameter and the median distribution. The distribution in (
3) is denoted by
. Rieck and Nedelman [
8] showed that, if
, then
follows a BS distribution with parameters
and
. From this last relationship between the BS and SHN distributions, the SHN regression model can be formulated as follows: if
is a vector of covariates such that,
for
, and,
then, the model in (
4) is known as the log-linear BS regression model. More details about this regression model can be found in Rieck and Nedelman [
8].
The BS distribution has great applicability to analyze data in several areas of knowledge, such as biology, medicine, engineering, etc.; however, so far, no extension of the BS distribution has been proposed to study the modeling of rates and proportions, i.e., of a random variable in the unit interval
from the BS model. In response to this special case, Mazucheli et al. [
19] presented an extension of the BS distribution for fitting random variables in the unit interval
. The pdf of this model is given by:
where
,
is the shape parameter and
is a scale parameter. Based on the work of Mazucheli et al. [
19], Martínez-Flórez et al. [
3] studied the unit sinh-normal (USHN) distribution to deal with the problem of bounded observations on the interval
which has a pdf given by:
where
,
is a shape parameter,
is a location parameter, and
is a scale parameter. The natural extension to the case of the model which considers covariates is the USHN linear regression (USHNR) model. The USHNR model is defined by considering a set of
p explanatory variables that are denoted by
and, such that,
where
is a
p dimensional vector of unknown parameters and
. More details can be found in [
3].
The main objective of this work is to introduce a new multivariate probability distribution capable of modeling data in the region . The new distribution is obtained from the extension of the univariate skewed unit-sinh-normal distribution and to do so, we rely on the conditionally specified distributions methodology introduced by Arnold. In addition, from the new distribution, we propose the multivariate unit-sinh-normal skewed regression model, which allows modeling data in the region through linear predictors. The new proposals are useful for the analysis of data on proportions, rates, or indices that arise in different fields of knowledge, such as those described at the beginning of this section. The results of a simulation presented in this work also show that these methodologies are viable alternatives to those existing in the current statistical literature.
This paper is organized as follows. In
Section 2, the multivariate skew-normal distribution is revised, and its main properties are commented on. In
Section 3, the new multivariate skewed unit sinh-normal distribution is introduced. Some properties are also derived, and the value of the coefficient correlation for the bivariate case is presented for some selected values of the parameter distribution.
Section 4 presents the extension of the USHN to the case of the multivariate regression model and its respective statistical inference. Finally, two applications with real data to illustrate the applicability of the proposed methodologies and a small simulation study are presented in
Section 5.
2. Multivariate Skew-Normal Distribution
The multivariate skew-normal (SN) distribution was studied by Arnold et al. [
20] by using the theory of conditionally specified distributions; see [
21]. The construction of the multivariate SN distribution is as follows: for each
define the vector
to be the
dimensional random vector obtained from
by deleting
In parallel, for a real vector
,
is obtained by deleting the
jth coordinate
of
. Now, suppose that, for each
the conditional distribution of
given
is a SN distribution with a parameter which is a function of
Thus, it is assumed for each
j that
The joint pdf of
is given by
In the distribution (
8), the marginal densities follow a standard normal distribution, that is, for
, and
the conditional distribution follows a SN distribution, (see Azzalini [
22]) of parameter
, with the pdf given by:
From this distribution, Lemonte et al. [
23] presented the multivariate Birnbaum–Saunders distribution, whose joint pdf is given by:
where
for
, with
and
being the shape and scale parameters, respectively. The distribution (
9) is denoted by
Another extension based on the multivariate SN model of Arnold et al. [
20] is the multivariate asymmetric SHN distribution, studied by Martínez-Flórez et al. [
24], whose joint pdf is given by:
where
, and
for
is a derivative of
with respect to
;
and
are shape and scale parameters, respectively, and
are location and asymmetry parameters, respectively. The distribution in (
11) is denoted by
with
and
Although MVBS and MVSHN distributions, which are defined in
and
respectively, can be used to fit sets of random variables whose domain is the unit interval
, these are not appropriate given the support of these distributions and the support of a bounded random variable vector. In the statistical literature, there are few distributions studied to fit sets of variables in the unit interval, that is, whose domain of definition is
, which can be useful to fit rates and proportions. The interest for these type of distributions has been very little; we highlight the works of Cepeda et al. [
25], Souza and Moura [
26] and Lemonte and Moreno-Arenas [
27], among others.
3. Multivariate Skewed Unit-Sinh-Normal Distribution
Following the idea of Arnold et al. [
20], Lemonte et al. [
23] and Martínez-Flórez et al. [
24], in this section, a multivariate extension of the SHN distribution to fit vectors of rates and proportions is proposed, which is named multivariate skewed USHN distribution (MVSUSHN). The construction of the MVSUSHN is as follows: for
, let
where
for
Then, the joint pdf of the vector with MVSUSHN distribution is given by,
where
and
for
is the derivative of
with respect to
;
and
are shape and scale parameters, and
and
are location and asymmetry parameters, respectively. The MVSUSHN is denoted as
with
and
. For
, the case of independence is obtained, that is, the product of the pdf of USHN random variables studied by Martínez-Flórez et al. [
3]. It follows that the parameter
is directly associated with the correlation parameter. The
Figure 1 shows the contours of the bivariate skewed USHN (BVSUSHN) distribution for some selected values of the parameters, while the
Figure 2 presents the shape of the density function for particular values of the parameter of the the BVSUSHN distribution.
The following theorem provides the marginal and conditional distributions of the MVSUSHN distribution.
Theorem 1. If then,
- (1)
for
- (2)
The conditional pdf of is given by - (3)
The cumulative distribution function (cdf) of is given by where is the Owen function; see [28].
Proof. - (1)
For
and applying the integral over all subindex
k (given by
), other than
j, we obtain,
Now, using the transformation
for all
where the second last result follows from Arnold et al. [
20].
- (2)
Let
then, with the transformation
, it is found that
and
and, by the Transformation Theorem, it follows:
- (3)
It has that
through the transformation
, it follows that
where the last equality follows the properties of the cdf of the SN distribution, which is widely known in the literature.
□
Martínez-Flórez et al. [
3] showed that, when
, the random variable
converges to a standard normal distribution. From here, if
with
, then
Furthermore, if
for
, then
If
then,
for all
, then,
(see [
3]). It can be seen that, if
, the multivariate standard USHN distribution follows, which is denoted by
In this case, the marginals are standard USHN, and the variables
for
follow the SHN distribution of Rieck and Nedelman [
8].
In order to study the unimodal distribution, let
, and suppose that
, and
. By differentiating the logarithm of the conditional distributions and equaling to zero, it has
the following equations are obtained:
Multiplying the equation in (
15) by
and the equation in (
16) by
, and subtracting these two results, it follows that
Note that by letting
and substituting in (
17), it follows:
Therefore,
is a trivial solution of the Equation (
17). Then, by replacing
in (
15), it has
from which results the function
Then, by applying the transformation , the bivariate distribution with conditional SN distributions is obtained.
The bivariate distribution with conditional asymmetric USHN is a one-to-one transformation. The
values for which the SBVUSHN distribution is unimodal are the same as for which the bivariate SN distribution is unimodal and, according to Arnold et al. [
20], the bivariate SN distribution is unimodal for
. One can note that the equation
is similar to the equation found by Arnold et al. [
20] for the BVSN distribution. The modes of the BVSUSHN distribution can be obtained by solving the equation
and
Moments and Correlation
The covariance for the random variables
and
is given by:
where the moment product
for two random variables is given by
and
with
,
, and
being the third-order function of Bessel defined by
A proof of the result in (
18) can be seen in the
Appendix A. Using (
18) the variances:
are obtained for
; thus, the correlation coefficient is obtained from:
To compute
, it is necessary to use numerical methods to determine the simple moments and the product moments. It can be shown that for this distribution
, whereby
. To study the range of the correlation coefficient,
was evaluated for some parameter values. The values taken for the parameters:
and
,
,
, and
varying. The
Table 1 shows the values for the parameters
,
, and the values of
for a pair of variables
and
The values were obtained for the case of
and the case of
results for symmetry, given that
As can be seen, the case of independence occurs for
, i.e.,
, then according to the
Table 1, for the values
, which leads us to the conclusion that for this model
.
4. Multivariate Skewed USHN Regression Model
This section presents an extension of the USHN regression model for the case of multiple bounded response variables (rates and proportions). Suppose that we have
q variables measuring rates or proportions in a sample of size
i.e., for
, we have the vector of dimension
Assume also that there are
p explanatory variables
where, for
,
and there is a matrix
associated to the
ith observed response
with
for
, a
p dimensional vector of values of the explanatory variables. For the vector of response variables, we use the operator
, which transforms matrices into a column vector from the columns of the matrix. Therefore,
Thus, the MVSUSHN regression model is given by
where
being
is a vector of unknown parameters of dimension
p, and the vectors
for
are vectors of independent and identically distributed random variables such that
where
and
, a vector of zeros of dimension
It follows that,
From this result, we have by the theorem shown above that
for
i.e., each marginal follows a USHN regression model. Thus, defining
, where
for
is a matrix of size
and
of dimension
, then the MVSUSHN can be represented by
where
,
,
taking values
with
with
a vector of dimension
, i.e.,
is a vector of dimension
being
is an error vector with
where
with
it follows that
for
Statistical Inference
For
and
, define:
and
; with
and
where
and
for
. The log-likelihood function for the parameter vector
is
where
To obtain the score function, denoted
we took the derivative of the log-likelihood function with respect to each of the parameters, so the elements of the score function are given by
where
, for
, and
and;
.
The maximum likelihood estimator (MLE) for
and
are the solutions to the equations
,
and
for
, which require numerical procedures. To start the iterative process, the least squares estimates can be used for the
, i.e.,
, from where
, while for the
, the initial values could be implemented
, where
. With these initial values, and assuming them to be the true values of the parameters, a one-dimensional function for the parameter
can be obtained, which can be estimated by some numerical method such as uniroot from R Development Core Team [
29].
The elements of the observed information matrix that are calculated as minus the second derivative of the log-likelihood function with respect to the parameters, denoted by
, are given by
The explicit expressions of these elements are presented in the
Appendix B. The elements of the Fisher information matrix (denoted
) are given by the expectation of the elements of the observed information matrix, that is
these are calculated numerically, therefore, the information matrix is expressed by
When
, the case of the independence of univariate USHN distribution is obtained; thus, it follows that:
is a diagonal block matrix where
and
is the error function given by:
see Rieck and Nedelman [
8];
where
with
,
and
. Expectations in the above expressions must be calculated numerically,
is a matrix of zeros,
with
,
with
,
is a vector of size
is a vector of size
is a vector of size
q and
The rows (or columns) of the matrix
are linearly independent, so the determinant is different than zero; this guarantees the existence of the inverse of
. Hence, for large samples, the MLE
of
is asymptotically normal, that is,
resulting that the asymptotic variance of the MLE
is the inverse of
. The approximation to the
can be used to construct confidence intervals for
y
these are given by
and
where
is on the diagonal of the matrix
for each parameter and
is the quantil
of the standard normal distribution.
The hypothesis of interest
can be tested by the statistic test (for large n)
where
is the vector
without the parameter
and
is the MLE of
restricted on
5. Numerical Results
In this section, the results of the simulation study and two applications to illustrate the applicability of the proposed models are presented.
5.1. Simulation Study
In order to study the behavior of the MLE of the parameter vector of the MVSUSHN regression model, a small Monte Carlo simulation study with covariates in the model was carried out. We considered
and
for
. For
the case of independence; the results for each model can be seen in the studies conducted by Martínez-Flórez et al. [
3].
We used the bivariate regression model, and we took the values: , and and , while and The sample size was and 200, and the number of iterations was 5000. We studied the absolute value of the relative bias (RB), root of mean square error (RMSE), length of confidence interval (LCI), and coverage probability (CP). To generate the samples, we performed the following algorithm:
- (1)
Generate a uniform random and a random number with distribution .
- (2)
Generate with the inverse of the standard normal function.
- (3)
Let .
- (4)
Compute .
- (5)
Generate another uniform random number (independent of ) and also with distribution
- (6)
Compute the error such that where is the inverse function of the standard skew-normal and is the inverse of the hyperbolic sine function.
- (7)
Let . This algorithm is generated n times, finally obtaining the USHN bivariate random sample.
The results of the simulations can be seen in the
Table 2. In general, it can be seen that the RB, RSME, and LCI of the model parameters decrease as
n increases; this decrease is slower for some parameters. It can also be seen that the CP increases as
n increases.
5.2. Illustration 1
To show the relevance of the MVSUSHN distributio, a real data set from a study conducted by Freeman [
30] on drunk driving legislation and traffic fatalities in 48 states in the United States of America (USA) during the period from 1980 to 2004 is considered. The database is available in the wooldridge library by Shea and Brown [
31] of the software R Development Core Team [
29] under the name of driving, and it contains information associated with current legislation, accident records, and some demographic characteristics. For this illustration, the unemployment rate variables (
) and the percent of the population aged 14 through 24 (
) were used. The bivariate beta (BVBeta) model of Cepeda et al. [
25], the bivariate Johnson SB (BVJSB) model of Lemonte and Moreno-Arenas [
27], and the BVSUSHN model were fitted. The BVBeta distribution of Cepeda et al. [
25] is based on the e Farlie–Gumbel–Morgentern copula; see Nelsen [
32], and their joint pdf is given by
where
and
correspond, respectively, to the pdf and cdf of the beta distributions of parameters
and
for
and
. To compare models, we used the Akaike information criterion (AIC) of [
33] and the Bayesian information criterion (BIC) of [
34], defined, respectively, by
where
p is the number of parameters and
is the log-likelihood function evaluated at the MLEs of parameters. The best model is the one with the smallest AIC or BIC.
For fitting the bivariate model, we used the optim function of R Development Core Team [
29]. The parameter estimates of these models, accompanied by their standard errors in parentheses, obtained using the maximum likelihood method, are given in the
Table 3. According to the AIC and BIC criteria, the BVSUSHN model presents the best fit.
The graphs in the
Figure 3 show the contours of the fitted models. For the BVSUSHN distribution, we have that
for
and
, then it follows that
and
where
and
are a column vector of size 2 and an identity matrix of size
, respectively. The statistic of the Kolmogorov–Smirnov (KS) goodness-of-fit test joint with the respective
-values for the marginal distributions for the BVSJB, BVBeta, and BVSUSHN distribution are presented in
Table 3. From here, it can be see that the BVSUSHN distribution shows a good fit compared with the BVSJB and BVBeta models.
5.3. Illustration 2
In the second illustration, the USHN bivariate regression model is fitted. The real data were taken from
http://www.pe.undp.org (accessed on 31 August 2022), and they correspond to measurements made on 195 districts in Peru. In this illustration, the interest is to model the human development index (
) and illiteracy rate (
) as functions of the proportion of people with high poverty level (HPL) by using the MVSUSHN regression model. As far as poverty is concerned, poor are identified as those unable to obtain minimum required calorie per day to keep body and soul together. We have that
and
with
The MLEs of the vector of model parameters, with standard errors in parenthesis are given by
, and
and
where
and
are a column vector of size 2 and an identity matrix of size
. To study the model fit, we perform the Kolmogorov–Smirnov test for the bivariate vector
.
For the multivariate Kolmogorov–Smirnov test of goodness of fit proposed by Justel [
35], special for the case of a bivariate distribution, which we denote by BKS (bivariate Kolmogorov–Smirnov), the statistic is given by
where
is the empirical distribution function of the sample, and
is some specified distribution function. When the
distribution is unknown, the Kolmogorov–Smirnov statistic is defined by
where
by using the transformations
,
, and
by using the transformations
and
where
is the empirical distribution function of the sample. For the special case of the BVSUSHN model,
which is less than
(for
), which is the critical value given by Justel [
35] at level of 5%. Therefore, it is concluded that the MVSUSHN model fits the data set well.
We also performed univariate Kolmogorov–Smirnov goodness-of-fit tests for with yielding the test statistics of with and with , indicating that the marginals show a good fit.
The
Figure 4 shows the envelope plots for the marginal and contour distributions for the residuals of the fitted model. For the envelope plot, we used the martingale residual transformation,
, proposed by Barros et al. [
36]. These residuals are defined by
where
is the martingale residual proposed by Ortega et al. [
37], where
indicates whether the
i-th observation is censored or not, respectively,
denotes the sign of
, and
represents the survival function evaluated at
, where
are the MLE for
.
6. Concluding Remarks
Diverse distributions to deal with the problem of bounded data on the interval were proposed, with great applicability in all fields of knowledge, especially in the social sciences, humanities, medicine, and engineering, among others. However, few proposals have been developed in the statistical literature to jointly model two or more variables, such as those described above, and especially that incorporate covariates to explain the variability of the variables of interest.
In this paper, a new multivariate distribution was introduced from the conditionally specified distributions methodology useful for modeling responses in the region jointly. The new distribution, which is absolutely continuous, is called the skewed log-Birnbaum–Saunders distribution and is also extended to the case of regression models. For the multivariate distribution, the marginal densities and conditional distributions were presented, and the Fisher information matrix of the multivariate regression model was also presented. For the estimation of the parameters in the models, a classical approach was used together with the maximum likelihood method. A small Monte Carlo simulation study was carried out to study the benefits and limitations of the new methodologies, which allows us to conclude that the parameter estimators behave asymptotically well. Two applications with real data to illustrate the usefulness of the introduced methodologies showed great flexibility to model data in for the particular case , which makes them excellent alternatives to existing methodologies in the statistical literature.