1. Introduction
In statistical literature, several authors have focused their attention on developing new and more flexible statistical distributions by using suitable transformation techniques (see, for example, [
1,
2,
3]). Most of the obtained distributions deal with continuous random variables with unbounded support. Only in recent years has attention been devoted to filling the existing gap with respect to distributions with bounded support, in order to meet the need to describe empirical phenomena whose realizations cover limited ranges. Indeed, these kinds of data naturally arise in different contexts, such as rates, proportions, percentages, and so on, but just a few models, such as the widely used beta distribution model, the Kumaraswamy model [
4], the Topp–Leone model [
5], the arcsine model [
6], the standard two-sided power model [
7], and a few more (see [
8,
9]) were available in the past to describe them. Many others are very recent proposals. In particular, in the last decade, there have been many works in this field, for the most part, on models belonging to the class of so-called unit distributions. These models describe data with support in unit intervals and are often obtained by applying transformations to random variables. These include the unit-Burr III [
10,
11], unit-Lindley [
12], unit-Gompertz [
13], unit-Burr XII [
14], unit inverse Gaussian [
15], the arcsecant hyperbolic normal model [
16], and logit slash [
17], to name a few, as well as some new families of distributions [
18,
19].
The first aim of this work is to describe a general structure based on the random variable transformation technique, which includes most of the distributions for data on unit intervals already present in the literature. Moreover, other members of this class were obtained, considering different transformations. Particular attention is placed on building regression models, starting with unit distributions; this allows us to evaluate the impact of covariates on response variables with bounded support and consider alternative approaches to the most used regression models for unit data.
In a recent paper, [
20] proposed a unified procedure to construct distribution functions in the (0,1) interval from the composition of two random variables with the same support, which turned out to be a special case of the
family introduced by [
21]. Our approach differs from the one just mentioned in that it does not require the knowledge of a second distribution function or a second quantile function. Furthermore, we envisaged a reparameterization and construction of the regression models on the indicators of interest.
The rest of the paper is organized as follows. In
Section 2, we define the general class of distributions and derive the expressions for distribution and probability density functions. Quantiles, moments, and general expressions for the hazard and reverse hazard rate are given. A particular case of distributions belonging to the general class is described in
Section 3, starting with the Dagum random variable and considering two particular kinds of transformations. The maximum likelihood estimation is discussed in
Section 4.
Section 5 is devoted to showing the possibility of employing the proposed models according to a regression perspective. Finally, in
Section 6, two different examples of applications are shown.
2. General Framework
Many of the recently suggested distributions, proposed for modeling data belonging to the unit interval, can be described by resorting to a single probabilistic structure based on a simple technique of a random variable transformation.
To this end, let Y be a random variable (rv) with a distribution function (pdf) and probability density function (df) , where is the parameter vector and , . Let be the application that identifies the transformation of Yrv in a new variable V, assuming values . In general, the distribution of V could also be characterized by a vector of parameters , i.e., .
In the present paper, in order to simplify the discussion, we assume that the boundaries of the support of
V are finite, i.e.,
and
, and we assume that the function
is continuous, differentiable, and monotone over
. Consequently,
is invertible and its inverse
is differentiable on
:
Knowing the distribution function of Y and considering the transformation , it is easy to obtain the distribution function of V and its characteristics, such as quantiles and moments. Moreover, it is typical in the literature to study the behavior of the hazard function (hf) and the reverse hazard function (rhf) , with the aim of evaluating the flexibility of a distribution. Therefore, in the following, we obtain some general expressions of characteristics and properties for distributions belonging to this class. In doing this, we distinguish two cases, depending on whether is an increasing or a decreasing monotonic function.
- (1)
is an increasing monotonic function:
the
of
V is given by:
and, by (
1), we can obtain the
of
V as
Moreover, let
be the
p-th quantile of
Y, with
. It is easy to verify that, from (
2), the
q-th quantile of
V is as follows:
with
.
The general expressions for
and
functions are, respectively, given by:
- (2)
is a decreasing monotonic function.
In this case, with little algebra, we can determine the quantities previously considered. In particular, the
df and
pdf of
V, respectively, are as follows:
and the quantile of order
q is as follows:
The hf and rhf are calculated accordingly.
We can use different methods, known in the literature, to determine the moment of order r.
We should note that most of the proposals in the literature can be thought of as particular cases of the comprehensive framework described earlier. For example, the most used transformations in the cases of positive
rvs are as follows:
and
. On the other hand, the most common transformation, when
Y assumes a real value, is
, as in the case of the logit slash model. Moreover,
was used in the context of non-monotonic
rv transformations to obtain the arcsecant hyperbolic normal model, which, strictly speaking, does not belong to the general framework proposed here, but it can be used in every case with small mathematical expedients. We should note that, in general, any distribution function
can be used to transform
Yrv in a new variable
.
Table 1 summarizes a classification of some unit distributions proposed in the literature, according to the used transformation.
In many application contexts, researchers often focus on specific aspects when characterizing a distribution, such as quantiles, location measures (mode, median, mean), variability indicators, etc. For this reason, when possible, it is useful to express the distribution as a function of such characteristics. The utility derives from the fact that, with appropriate methodological tools, it is possible to construct regressive models on the characteristics of interest with the aim of inspecting the possible determinants of the phenomenon under investigation (see [
28,
29]). Each characteristic and/or indicator is, in general, a function of the vector of the distribution parameters, let us say
, with reference to the unit’s distribution function (
2). If
is a vector of dimension
p and the system
has a unique finite solution, say,
then the unit-distribution function
represents a reparameterization in terms of indicators and/or characteristics of interest of the distribution in (
2).
3. Two Kinds of Unit-Dagum Distributions
In this section, two different transformations of the widely used Dagum
rv [
30,
31] will be described. Given the ability of the Dagum model in fitting real data, the resulting new models may potentially be more flexible than unit distributions that have already appeared in the literature.
The
df and
pdf of Dagum
rv Y are given, respectively, by:
and
with
and
. In particular, the vector of parameters of Dagum distribution (hereafter,
) is
, where
represents a scale parameter and
and
are shape parameters.
The Dagum model is positively skewed and it can be unimodal or zero-modal, depending on
or
. In particular, the mode is given by
It is easy to verify that the
q-th quantile is
therefore, the expression of the median is explicit:
It is also possible to obtain the expression of the
r-th moment, as follows
which exists for
. Here,
indicates the complete beta function.
3.1. The First Kind of Unit-Dagum Distribution
In this section, we consider the hyperbolic secant transformation:
In particular, it is simple to verify that, for
, it is a monotonic decreasing function with
,
and
. Furthermore, it is known that the inverse hyperbolic secant is given by
.
Taking into account the characteristics of the proposed transformation, the distribution function of the new
rv V is given by
with
and
(hereafter,
). From (
1), after simple algebra, we obtain the first derivative of the inverse of
:
and, consequently, the
pdf of
rv:
where
.
Figure 1 shows various behaviors of the
pdf for the type I unit-Dagum model, according to different values of parameters.
The
q-th quantile of the
distribution, by (
10), is
In the following proposition, we show that the r-th moment of the type I unit-Dagum distribution can be expressed in terms of moments of the Dagum distribution.
Proposition 1. The r-th moment of has the following expression: The
hf and
rhz are given, respectively, by
and
The hazard rate function of the type I unit-Dagum model for some values of parameters is shown in
Figure 2.
We propose a possible reparametrization of the type I unit-Dagum distribution in terms of the median and the
quantile. It is possible to verify that the system
presents the following unique solution:
The corresponding distribution function is
with
,
and
for
.
3.2. A Second Kind of Unit-Dagum Distribution
In this section, we consider the monotonic decreasing transformation , with , and , . The inverse is given by .
The distribution function of
V is given by
with
and
(hereafter,
). From (
1), after simple algebra, we obtain the first derivative of the inverse of
:
and, consequently, the
pdf of
rv:
It is worth noting that the distribution in (
23) can be viewed as an extension of the unit-Burr III obtained by [
11], using the same transformation. Indeed, the Dagum model has one more parameter than Burr III, that is a scale parameter, thus, by putting
, the unit-Burr III is obtained. Although the unit-Burr III is already studied in the literature, for the purposes of this work, as will be seen later, the
parameter is essential for carrying out the reparameterization and building the regression model; therefore, here, we consider the type II unit-Dagum distribution, also considering the scale parameter.
Figure 3 shows various behaviors of the
pdf for the type II unit-Dagum model according to different parameter values.
The
q-th quantile of the
distribution, by (
10), is
It can be readily verified that the r-th moment of the type II unit-Dagum distribution coincides with the Laplace transform of the Dagum distribution and it can be expressed in terms of moments of the Dagum distribution.
Proposition 2. The r-th moment of has the following expression: The
hf and
rhf are given, respectively, by
and
The hazard rate function of the type II unit-Dagum model for some values of parameters is shown in
Figure 4.
It is easy to verify that a possible reparametrization of the type II unit-Dagum distribution in terms of the median and the
quantile can be obtained as a solution of the following system:
that presents the following unique solution
The corresponding distribution function is as follows:
with
,
and
for
.
4. Inference
In this section, we use the maximum likelihood (
ML) method to estimate the parameters of type I and type II unit-Dagum distributions under the hypothesis of homogeneity of the statistical units, i.e., assuming that there are no systematic factors (covariates), which make the observations heterogeneous. To this end, we first rewrite the probability density functions (
14) and (
23), in a single expression as follows
where
in the case of the type I unit-Dagum distribution or
in the case of the type II unit-Dagum distribution. Let
be a random sample of size
n from (
31), the log-likelihood function for
is as follows:
Differentiating
with respect to
,
, and
, respectively, we obtain the components of vector score
, where
and setting the components of the score vector equal to zero, we obtain the system of likelihood equations, whose solution gives the
ML estimates
of the parameter vector
. The system does not admit any explicit solution; therefore, the
ML estimates
can only be obtained by means of numerical procedures.
Confidence intervals and hypothesis tests for
can be constructed using the usual asymptotic properties of the maximum likelihood estimators. In particular, we highlight that the expected Fisher information matrix of the parameter vector
coincides with the expected Fisher information matrix of
of the Dagum distribution (see
Appendix A.3). This means that when constructing confidence intervals and hypothesis tests for the parameters of type I and II models of the unit-Dagum distribution, we can use the asymptotic variance and covariance matrix calculated in [
32,
33].
5. Unit-Dagum Regression Models
An important aspect to investigate is how heterogeneity among statistical units impacts possible measures of interest, such as median and extreme quantiles, simultaneously and directly. Given the particular nature of the dependent variable, this leads us to consider a regression approach where the response variable is defined on the unit interval.
The literature on this theme is wide and often deals with two different possibilities: properly transforming data to map the (0,1) interval to the real line and then using a common regression analysis, or choosing a suitable distribution and defining the relations among distribution parameters and covariates. Regarding the first kind of approach, various transformations are possible, and the logit is the most popular, but as [
34] underlines, transformations can be inappropriate since the heteroscedasticity and skewness in data are not properly handled; moreover, the interpretation of results is possible only on the transformed scale. On the other hand, the second approach is nowadays preferred and widely explored, with different existing proposals based on various distributions and response variables. For example, when the attention is focused on the mean, the most popular distribution is the beta [
35], but other possibilities are represented by simplex [
36], log-Bilal [
27], log-Lindley [
37], log-weighted exponential [
38], and unit gamma [
39], to cite a few. When the focus is on the median or, in general, on the distribution quantiles, regression models can be based on Kumaraswamy [
40], Johnson-t [
41], log-extended exponential-geometric [
42], L-logistic [
43], or unit-type distributions (see, for example, [
14,
22,
44]). Our proposal fits into the latter approach.
Specifically, given a sample of n observations, for each statistical unit i (), we observe the individual dependent variable value and the sets of individual covariates supposedly related to indicators and summarized in the vectors, , for . The three sets of covariates , , and are not necessarily the same, and, even if equal, their impact on the corresponding indicator may be different.
The vectors for , , define the rows of three block matrices of . Each one refers to the covariates affecting the indicator .
Each indicator, analogous to generalized linear models, is then related to the covariates, through an appropriate link function
, as follows:
The link functions are chosen to guarantee suitable restrictions on the parameter space, considering if
is positive or varies on
. The elements of the vector
are the unknown regression coefficients related to the
individual characteristics to be estimated, applying the maximum likelihood method. By using the reformulation of unit-Dagum models in terms of indicators of interest, as shown in expressions (
20) and (
29), it is possible to relate the new parameters, such as the median and
q-th quantile, to individual characteristics. In particular, observing that the solutions given in (
20) and (
29) are functions of the indicators of interest, i.e.,
,
for the type I unit-Dagum distribution and
,
for the type II unit-Dagum distribution, and specifying the indicators of interest as functions of the covariates
,
and
, from (
21) and (
30), for the
i-th observation, we can rewrite the
pdfs as functions of the regression coefficients
,
, and
. Similar to what was done previously, we use a single structure to represent type I and type II unit-Dagum distributions, simultaneously, as follows:
where
,
in the case of the type I unit-Dagum, and
,
in the type II unit-Dagum distribution. Putting
, by (
37), the
i-th element of the log-likelihood function is
Remembering that the parameters
,
, and
are functions of the vector
of the dimension
, the
equation of the likelihood system is given by
for
and
. The partial derivatives in system (
39) are given in
Appendix A.4.
The system of the likelihood equations does not admit any explicit solution; therefore, the ML estimates for and can only be obtained by means of numerical procedures. Under the usual regularity conditions, the known asymptotic properties of the maximum likelihood method ensure that , where is the asymptotic variance–covariance matrix and is the Fisher information matrix, given by , where is the Hessian matrix of the second partial derivatives of the log-likelihood function, i.e.,. Elements of the matrix are not reported here for space purposes, but are available upon request.
6. Applications
In order to show the potentiality of the proposed models, we consider two famous and widely used datasets, referred to data that fall into the unit interval and contained in the R package, betareg, namely household food expenditures and reading skills. In particular, the household food expenditure data regard the proportion of income spent on food for 38 households living in a large U.S. city and contain information on the perceived income and the number of persons living in the household. The reading skills dataset refers to the scores obtained in a test on reading accuracy involving 44 Australian children, including 19 dyslexic subjects and 25 non-dyslexic subjects. Moreover, the status of each child, and information regarding the nonverbal intelligent quotient (iq), are available.
These datasets were used by [
34] to describe the implementation of the beta regression in the
R system and to underline the advantage of this kind of regression with respect to the linear one when data belong to the unit interval. Therefore, as a further aim of this section, we will compare the performance of the unit-Dagum regression models with that of the widely used beta regression. Indeed, both methodologies give us the possibility to evaluate, among other aspects, the impact of some covariates on measures of central tendency, namely the mean in the case of the beta regression, and the median in the case of the unit-Dagum regression. It is worth noting that when data exhibit skewness, the median should be preferred as the centrality measure. Therefore, the proposed regression could be more appropriate in some cases.
6.1. Modeling Food/Income and Accuracy Data
In this section, we consider the proportion of income spent on food and the scores regarding reading accuracy. The corresponding empirical distributions are shown in
Figure 5. To evaluate the adequacy of the proposed models in describing the considered data, the maximum likelihood estimates (MLEs) of the parameters for the I-UDa and II-Da densities reported in (
14) and (
23) are obtained, along with the corresponding standard errors and the values for the Akaike information criterion (AIC). Moreover, we compare the obtained results with the analogs for the beta and Kumuraswamy (KW) models, which are likely the most used models for data on bounded support.
Table 2 presents the obtained results. Both the AIC values and the inspection of
Figure 5 suggest that the proposed models better describe the considered data if compared with the beta and KW distributions. In particular, the lower value of the AIC for food expenditure data is obtained in correspondence with the type II unit-Dagum model, while, for reading skills data, the type I unit-Dagum reaches the lower result, far from the beta and KW ones. We should note that the chosen data are very different from each other in terms of the distribution shape, so these examples give us the possibility of testing the flexibilities of our models and their ability to properly reproduce different characteristics of the phenomena, such as unimodality, increasing density, presence of asymmetry, fat tails, and so on.
6.2. Considering the Covariates: The Regression Models
In this section, we consider both type I and type II unit-Dagum distributions according to a regressive perspective and we compare their performances with results from the well-known beta regression.
To this end, we also take into account data regarding covariates and results reported in [
34], corresponding with the best beta regression model for each dataset. We should note that, as can be viewed from
Figure 5, both the income/food proportions and the reading accuracy scores show an asymmetric distribution; therefore, attention is placed on the median rather than the mean of the distribution, and it could be more appropriate to analyze the central tendency.
Food expenditure data
For the first dataset, information on household income and the number of people living in the household are available. Starting with the reparameterization data reported in (
19) and (
28), we consider the effect of these covariates on the median and 90th quantile, according to the regression models described in
Section 5. Since both the indicators assume values in the unit interval, a logit-link function is used to relate the median and 90th quantiles to the covariates. Moreover, we consider an intercept term related to the
indicator through a log-link function, which is suitable for positive indicators. The ML estimates of the coefficients, their standard errors, and results from the Wald test are reported in
Table 3. In both models, we find that the median and 90th quantiles of the proportions spent on food decrease as income increases, while the number of persons living in a household shows a positive significant effect on the 90th quantile, ceteris paribus. Moreover, both models outperformed the beta regression in terms of AIC (−88.37 for beta regression), with the best results obtained for II-UDa regression. A comparison between empirical and fitted curves reported in
Figure 6 confirms these results. In particular, here, two different curves are shown for each model. Indeed, through the regression approach and the resulting estimates, it is possible to consider the behaviors of density functions for different covariate values. The depicted curves refer to the median and
indicators for the I-Da and II-Da model, and to the mean and dispersion parameters for the beta model, when income and the number of persons are equal to the average level observed for
and
, respectively (
;
vs.
;
). This allows us to evaluate the ability of the models to describe the right distribution tail, as well as the central tendency.
Reading skills data
In the reading skills dataset, in addition to information regarding the presence of dyslexia,
z scores for the nonverbal intelligent quotient (iq) test are available. Therefore, we can consider the effects of these characteristics on the median and 90th quantiles of reading accuracy scores, by specifying a logit-link function to relate indicators and covariates. In particular, as suggested by [
34], we consider an interaction term between iq and dyslexia. Once again, we relate an intercept term to
, using a log-link function. Similar to that obtained by [
34] for regression on the mean indicator, we find a significant main and interaction effect on the median for dyslexia and iq, for both I-Da and II-Da models. Specifically, results reported in
Table 4 confirm the positive effect of iq and the negative effect for dyslexia and the interaction term. Moreover, we also find a significant negative effect of dyslexia on the 90th quantile.
In this case, the model with the best performance in terms of AIC is the I-Da one, but both of the proposed models show lower values than the beta regression (AIC = −117.8).
Figure 7 shows the comparisons among empirical and fitted distributions for dyslexic and non-dyslexic subjects, considering an average iq level that is equal to −0.653 for dyslexic subjects and 0.4966 for control subjects.
7. Concluding Remarks
In this paper, we show that many of the existing proposals on probability distributions for data in the unit interval can be viewed as particular cases of a general class of models, obtained using the techniques of rv transformations. In the present paper, expressions on the distribution and density functions of the class are given and the principal characteristics are furnished. Through the proper transformation choice, it is possible to obtain new distribution functions on bounded support, whose characteristics are easy to derive. Indeed, two new distributions are proposed, starting with the Dagum model, and considering two different transformations. The resulting models are particularly flexible, as is evident by choosing different sets of parameter values and by looking at the behavior of their densities and hazard functions.
We also considered the possibility of reparameterizing the distributions in order to express them in terms of the indicators of interest. In particular, we obtained models that depend on the median and quantile; this gave us the opportunity to relate these quantities to covariates, according to a regressive perspective. Given the particular nature of the involved variables, this led us to consider the regression approach, where the response variable was defined on the unit interval. Therefore, the proposed methodology can be considered as an alternative to other approaches that are often employed when the response variable represents proportions, rates, or percentages. Furthermore, considering regression on the median could be more appropriate in the presence of asymmetry. The applications on two different datasets allowed us to evaluate the behaviors of the suggested models and compare their performances with the most widely used approach in this context, namely the beta regression. The obtained findings are encouraging since both models seem to be very competitive.