1. Introduction
In recent statistical literature, new probability distributions have been introduced as extensions of other known distributions. This methodology serves as a foundation for generating new families of distributions applicable across various fields. Among other authors, this approach was utilized by Eugene et al. [
1] to propose the Beta-G class of distributions. Subsequently, Silva et al. [
2] introduced the modified Weibull beta distribution families and the Weibull beta geometry, as noted by Cordeiro et al. [
3]. Moreover, building upon this methodology, Cordeiro and de Castro [
4] defined the Kumaraswamy-G class of distributions, followed by the suggestion of the Kumaraswamy modified Weibull by Cordeiro et al. [
5]. Similarly, Zografos and Balakrishnan [
6] and Ristić and Balakrishnan [
7] presented a new family of distributions generated by gamma random variables, leading to the development of the Gamma-Generated-Logistic distributions by Castellares et al. [
8] and the Gamma-Birnbaum–Saunders distributions by Cordeiro et al. [
9].
On the other hand, Martínez-Flórez et al. [
10] examined the exponentiated-skew-normal distribution. Similarly, Martínez-Flórez et al. [
11] proposed the family of proportional hazard distributions based on the distribution of the minimum in the sample. All of these new families of distributions have proven useful in analyzing responses of interest by adjusting both linear and nonlinear regression models. For instance, the regression model with skew-normal distributed errors Azzalini [
12] has been widely utilized. Furthermore, extensions of regression models have been recommended, assuming errors follow the exposed skew-normal distributions (Martínez-Flórez et al. [
10]; Martínez-Flórez et al. [
13]). Moreover, these distribution families have been extended to encompass the case of the Birnbaum distribution [
14] and the Birnbaum–Saunders log-linear regression model proposed by Rieck and Nedelman [
15], showcasing the extensive range of symmetric and asymmetric families available in the literature.
All the works previously presented, and many others that have not been mentioned here, are appropriate in cases where the response variable has its support in the set of real numbers or has positive support, while very few works focus on the problem of dealing with dichotomous data. In this particular case, the issue is addressed based on non-linear functions or link functions such as the logistic regression model, known in the statistical literature as the logit model, or the non-linear alternative based on the cumulative distribution function (CDF) of the normal density, called the probit model. Thus, the limited existence of proposals in the statistical literature for the analysis of dichotomous or polytomous responses through link functions used in other types of models becomes evident.
In practice, the regression model with a dichotomous response (logistic model) has been widely used in several areas of knowledge. In the educational area, for example, it can be used to predict the probability of a student dropping out based on their academic performance, age of entry, the educational level of their parents, number of siblings, etc. In the health sector, certain patient characteristics and the application of specific treatments can be analyzed using the model to understand the connection between the patient and the implemented treatment, including the probability or odds of survival. Similarly, in finance, based on characteristics such as sex, age, race, income, and educational level, the behavior of investors can be predicted. These models are also utilized for classifying individuals into certain groups according to the predicted probability of a specific event occurring.
In this article, a new regression model is proposed to address research with dichotomous response variables. This novel model can be applied to various fields, including medicine, finance, education, and the social sciences. Our proposal is grounded in the family of proportional hazard distributions, specifically utilizing an extension of the logistic distribution within this family.
The remainder of this work is organized as follows:
Section 2 provides a brief description of the logistic distribution and its associated regression model.
Section 3 describes the proportional hazard and proportional hazard logistic distributions, along with some of their most important properties. In
Section 4, the proportional hazard logistic regression model is introduced. Additionally, the statistical inference process is performed using a classical approach, presenting the score function and the elements of the observed information matrix.
Section 5 presents an application of the introduced model to student dropout data. Finally, the conclusions of the paper are presented in
Section 6.
2. Logistic Distribution
A continuous random variable with a logistic distribution has a probability density function (PDF) given by
where
denotes the hyperbolic secant function. The shape of the logistic distribution is similar to the shape of the normal density, with heavier tails and greater kurtosis than the normal distribution.
The cumulative distribution function (CDF) of a random variable with a logistic distribution is given by
while its survival and hazard functions can be written as
where tanh denotes the hyperbolic tangent function.
The extension of the logistic distribution to the location-scale case is achieved by using the transformation with and . This is denoted by , where represents the location parameter and the scale. Since this distribution is symmetric, then , , the asymmetry coefficient is zero, and its excess kurtosis is equal to . Finally, the p-th percentile, for , of this distribution is given by .
Associated with the logistic distribution is the logistic regression model, which is used to explain the probability of success of a random variable with a binomial distribution when there is a set of covariates that explain this probability (see Agresti [
16]). In essence, the logistic regression model is given by
where
represents a vector of covariates,
is the vector of model coefficients (unknown values that must be estimated), and
is a Bernoulli random variable with parameter
.
3. Hazard Proportional Distribution
In recent decades, families of asymmetric distributions have been introduced for fitting data with tails heavier or lighter than the normal distribution. As is well known, in the presence of high degrees of skewness and/or kurtosis, inferential processes based on the assumption of normality are inadequate. Similarly, while the elliptical family may provide a solution for distributions with heavy tails, it fails to address the issue of asymmetry in the data under study.
The skew-normal (SN) distribution, introduced by Azzalini [
12], is defined by the PDF given as
where
and
represent the PDF and CDF of the standard normal distribution, respectively, and
is a skewness parameter. The distribution is denoted by
. In addition to the work of Azzalini [
12], the SN distribution described in (
2) has been extensively studied by Henze [
17], Pewsey [
18], Chiogna [
19], and Gómez et al. [
20], among others.
Building on the work of Lehmann [
21], Martínez-Flórez et al. [
11] investigated another family of asymmetric univariate distributions called the proportional hazard. The PDF of this distribution is given by
where
is a positive real number, and
F is a continuous CDF with continuous PDF
f. This distribution is denoted by
. The hazard function associated with the density
is
where
represents the hazard function related to the density
f. When
and
, the distribution is called proportional hazard normal, denoted by
. The PDF is given by
where
is the survival function associated with the PDF
. This model serves as an alternative to accommodate data with asymmetry and kurtosis that fall outside the ranges allowed by the normal distribution.
The CDF of the
distribution is given by:
By varying the
parameter, Martínez-Flórez et al. [
11] found that the range of asymmetry and kurtosis coefficients,
and
, respectively, of the variable Z∼PHN(
) falls within the intervals
and
. These ranges exhibit better skewness and kurtosis properties than those of the SN distribution. Additionally, Martínez-Flórez et al. [
11] demonstrated that the information matrix of the PHN distribution in the location-scale case, denoted as
, is nonsingular. A particular case of the proportional hazard family is discussed below.
Proportional Hazard Logistic Distribution
The proportional hazard logistic (PHL) distribution, denoted by
, is defined by the PDF given as
Its respective CDF is given by
while the survival and hazard functions can be expressed as
and
respectively, where
is the hazard function of the logistic distribution.
Figure 1 illustrates the behavior of the CDF and the survival function for different values of the parameter
. It is noteworthy that, for
, the CDF corresponds to that of the logistic distribution. Moreover, the hazard function of the PHL is a multiple of the hazard function of the logistic distribution. Additionally, the adjustment of the CDF of the PHL distribution is more flexible than that of the logistic distribution. Similarly, it is observed that for
, the survival function converges more slowly (indicating a higher probability of survival) towards zero compared to the survival function of the logistic distribution, whereas for values greater than zero, the convergence to zero is faster (indicating a lower probability of survival).
The
r-th moment of the random variable
is given by:
From Expression (
10), the moments of orders 1, 2, 3, and 4 of the PHL distribution can be derived, facilitating the numerical calculation of its mean, variance, skewness, and kurtosis coefficients.
4. Proportional Hazard Logistic Regression Model
Assuming the regression model:
where
represents a set of covariates,
denotes a set of unknown coefficients, and
∼
. It then follows that
Yi∼PHL
, for
.
However, when
Y is a dichotomous random variable with values zero and one, the model errors are not independent and do not satisfy the assumption of homoscedasticity. Additionally, it cannot be ensured that
is bounded by 0 and 1.
For this reason, it is necessary to determine a distribution function
such that
The function is known as a link function, and since it must ensure that the prediction lies between 0 and 1, it is commonly chosen as the distribution function of certain random variables studied in classical probability theory literature.
The link functions
typically utilized in practice are the CDF of the logistic distribution, resulting in the logit model, and the CDF of the normal distribution, resulting in the probit model. Due to their mathematical and computational complexity, the logit model is generally preferred over the probit model in practical applications. A notable commonality between these two models is their symmetric CDF, which can be a limitation in scenarios where the probability of success for response variable
Y exhibits asymmetric behavior. Moreover, both distributions have limitations in accurately modeling certain probabilities in their tails. As illustrated in
Figure 1, the CDF of the proportional hazard logistic distribution displays asymmetric behavior. Additionally, the inclusion of the parameter
allows for modeling the probabilities in its tails. This parameter enhances the flexibility of the probability of success compared to the logit and probit functions, suggesting the potential for more precise adjustment of the probability of success for the variables under study.
Referring to
as the CDF of the PHL, it follows that
From this, it is obtained that
For
, it follows that
which will be referred to as the logit complement
root transformation.
4.1. Properties of the PHL Regression Model
Given the structure of the probability function included in this new model, some statistics of interest are calculated for the interpretation of the parameters. Then, the odds
are given by
Thus, the relative risk
or odds ratio, to compare the profile of individuals
i and
k, is given by
This expression is used when there are profiles of different individuals, or when the profiles only differ in the
jth variable. Thus, to estimate the relative risk in the
ith individual when the
jth variable is increased by one unit, denoted as
, while keeping the value of the rest of the variables constant, we have the expression
This represents the odds or the number of times the risk of the event occurring increases (or decreases) when the variable increases by one unit.
4.2. Maximum Likelihood Estimation
Given a random sample
of a random variable
Y with distribution
∼
, and considering a set of covariates
, the likelihood function is expressed as
Then, the log-likelihood function is given by
The score function,
with
, which is calculated as the first derivative of the log-likelihood function concerning the parameters, is given by
for
, and
The elements of the observed information matrix,
, defined as minus the Hessian matrix (matrix of second derivatives concerning the parameters), are given by:
The elements of the information matrix, which are obtained from the expected value of the elements of the observed information matrix,
, are given by
When
, we obtain
, and the information matrix can be written as
where
is the diagonal matrix
, and
is a vector with elements
.
Letting
, we obtain that the determinant of the information matrix is given by:
Thus, the information matrix is non-singular, which guarantees the existence of the variance–covariance matrix of the vector of maximum likelihood estimators (MLE)
. It can also be concluded that the variance–covariance matrix of the MLE can be written as:
Therefore, for large sample sizes, we have
meaning that the distribution of the vector of estimators is consistent and asymptotically normal, with a covariance matrix equal to the inverse of the Fisher information matrix.
Confidence intervals for coefficients
of level
can be obtained from the expression
. Additionally, the adequacy of the proportional hazard logistic regression (PHLR) model can be evaluated through hypothesis testing:
for at least one
.
We can use the deviance function given by
with distribution
. Similarly, two models can be compared: one complete with
r variables
, and another with
q variables
through the test statistic
for which we have
. This same statistic is useful to test the significance of the remaining
variables that were not included in the model with
q variables.
One of the strategies to validate the good fit of the logistic regression model is to analyze the proportion of correct classification that the fitted model achieves. Letting
be the group of observations with
, and
be the group of observations with
then, using Bayes’ Theorem, the probability of classifying an individual into group
given the information of the explanatory variables
is given by
Thus, when performing the calculations for our model, we have
Similarly, is defined. In this case, the decision will be to classify the ith individual into if ; that is, if .
To evaluate the predictive capacity of the proportional hazard logistic regression model, the overall accuracy of the model can be calculated, which is defined as the proportion of individuals that are correctly classified, as well as the sensitivity or true positive rate of the model (TPR), defined as the number of correctly classified individuals from group divided by the total number of correctly classified ones (from and ). Similarly, the false negative rate (FNR) of the model is defined as , among others.
5. Case Study: Students’ Dropout Data
The data for this application consist of a sample of 413 students from the Department of Mathematics and Statistics of the University of Córdoba, which were obtained from the SPADIES System of the Ministry of National Education of Colombia (MNE). The response variable in this application takes the values
(if program dropout) or
(if non-dropout). The explanatory variables considered are
(cumulative general average, CGA),
character of the school (CS) of the student where they studied, taking values
(if the student comes from an official school), and
(if not), and
the number of periods enrolled (NPE). The logistic regression (LR) and proportional hazard logistic regression (PHLR) models were fitted. The results of the fitted models, obtained using the R Development Core Team [
22] package, are given in
Table 1.
The results of the model fit indicate that the variables CGA and NPE are significant, whereas the variable CS does not significantly explain the probability of university student dropout.
To compare the fitted models, we employ the Akaike Information Criterion (AIC) Akaike [
23], corrected AIC (CAIC), and the Bayesian Information Criterion (BIC) by Hastie and Tibshirani [
24], given by
and
where
p is the number of parameters in the model and
n is the sample size. The results favor the PHLR model based on AIC, CAIC, and BIC values.
To compare the proportional hazard logistic regression (PHLR) model with the logistic regression model, we conduct the hypothesis test
using the likelihood ratio statistic
where
and
represent the likelihood functions of the logistic and PHL models, respectively. Upon numerical evaluation, we obtain
which exceeds the value of
. The PHL model exhibits the best fit compared to the logistic model.
Carrying out the hypothesis test of the significance of the explanatory variables
for at least one
, we have
therefore, the null hypothesis is rejected. Similarly, for the hypothesis test
it follows that
Therefore, the null hypothesis is not rejected, meaning the variable character of the school is not significant in the model. However, academic differences are observed in the classroom between students who come from official schools and those from private schools, with the latter demonstrating better preparation.
Note that in the proportional hazard logistic regression model, the case corresponds to the logistic distribution. However, the hypothesis test vs. , which is performed using the likelihood ratio statistic, is rejected. This means that the parameter is significantly different from one, and must be considered to explain the behavior of the data. Moreover, the AIC, CAIC, and BIC criteria are favorable to the logistic proportional hazard model when compared with the usual logistic regression model. All of the above allows us to conclude that the proportional hazard logistic regression model fits better.
So, the fitted model is given as follows:
Now, the sample is divided into two subsamples. The first one, called the training sample, corresponds to 70% of the total sample, and the second one is the prediction sample (30% of the sample). From this partition, the following results for the fitted PHLR model are obtained.
According to the results in
Table 2, the accuracy is 77.23%, the sensitivity rate is 67.05%, and the specificity rate is 100%.
On the other hand,
Table 3 shows the performance of the PHLR model for different values of the
parameter.
Table 3 shows the skewness and kurtosis coefficients of the proportional hazard logistic model for different
values. The results indicate that the model can fit data with both negative and positive skewness, which is an advantage over traditional logistic models. Additionally, the PHLR model can fit data with varying degrees of kurtosis, both high and low.
Diagnostic analysis is a technique to detect possible influential observations and aberrant or extraneous data. In the case of the logistic model, this technique has certain similarities with the general diagnostic analysis of regression models. However, given that the response variable only takes the values 0 and 1, a somewhat unusual situation arises. Certain difficulties may arise if there is a large number of zeros (or ones) when one expects to find few zeros or ones, which can be a sign of a lack of fit in the model. In the case of the PHLR model, the diagnostic analysis could be carried out using the Pearson residuals,
the square of which is the
ith component of the Pearson chi-square statistic, the residual deviance
which is an adapted version of Cook’s distance for the case of the logistic regression model (see Christensen [
25]). When
,
, while if
,
.
For the student dropout data, the residual deviance graph for the PHLR model is presented in
Figure 2. Note that in this graph, there are no observations with high values of the residuals, which indicates that the model has a good fit. Likewise, the graph of the PHL distribution for the fitted probabilities is shown. Note that there are five values falling within the +2.5/−2.5 range in
Figure 2b, and six values in
Figure 2c, indicating that these observations are not extremely influential. Additionally, there are no observations outside the confidence bands in the envelope graphs (
Figure 3b), suggesting that the PHLR model effectively handles observations that deviate slightly from the +2/−2 range.
The
envelope graphs generated for the logistic and proportional hazard logistic models are presented in
Figure 3a and
Figure 3b, respectively. It is observed that the proportional hazard logistic regression model presents a better fit than the logistic regression model.
6. Conclusions
In this work, we have proposed the PHLR, a nonlinear regression model that captures complex relationships between independent variables and the response variable, particularly in the case of dichotomous data where the relationships cannot be adequately represented by a straight line. The flexibility of the PHLR model allows for a better fit to the data compared to linear models or even the logistic model.
The information matrix of the PHLR model is non-singular, ensuring that the parameters are uniquely estimable, avoiding linear dependency among them, and allowing for the proper calculation of the variance–covariance of the estimators. This guarantees the convergence of optimization and estimation algorithms, and ensures that the maximum likelihood estimators have desirable asymptotic properties, such as asymptotic normality.
In terms of information criteria such as AIC, CAIC, and BIC, the PHLR model shows a better fit than the logistic model for the analyzed student dropout data. The logistic model is revealed as a special case of the PHLR model. Additionally, the PHLR model demonstrates a good rate of correct classifications in the studied data. An alternative for the diagnostic analysis of model errors has also been proposed, offering useful tools for its implementation in educational problems or other contexts with dichotomous responses.