Dichotomous Proportional Hazard Regression Model: A Case Study on Students’ Dropout

Martínez-Flórez, Guillermo; Tovar-Falón, Roger; Barrera-Causil, Carlos

doi:10.3390/math12142170

Open AccessArticle

Dichotomous Proportional Hazard Regression Model: A Case Study on Students’ Dropout

by

Guillermo Martínez-Flórez

^1,†

,

Roger Tovar-Falón

^1,†

and

Carlos Barrera-Causil

^2,*,†

¹

Departamento de Matemáticas y Estadística, Universidad de Córdoba, Montería 230002, Colombia

²

Grupo de Investigación Davinci, Facultad de Ciencias Exactas y Aplicadas, Instituto Tecnológico Metropolitano, Medellín 050034, Colombia

^*

Author to whom correspondence should be addressed.

^†

These authors contributed equally to this work.

Mathematics 2024, 12(14), 2170; https://doi.org/10.3390/math12142170

Submission received: 11 June 2024 / Revised: 8 July 2024 / Accepted: 9 July 2024 / Published: 11 July 2024

Download

Browse Figures

Versions Notes

Abstract

:

In problems involving binary classification, researchers often encounter data suitable for modeling dichotomous responses. These scenarios include medical diagnostics, where outcomes are classified as “disease” or “no disease”, and credit scoring in finance, determining whether a loan applicant is “high risk” or “low risk”. Dichotomous response models are also useful in many other areas for estimating binary responses. The logistic regression model is one option for modeling dichotomous responses; however, other statistical models may be required to improve the quality of fits. In this paper, a new regression model is proposed for cases where the response variable is dichotomous. This novel, non-linear model is derived from the cumulative distribution function of the proportional hazard distribution, and is suitable for modeling binary responses. Statistical inference is performed using a classical approach with the maximum likelihood method for the proposed model. Additionally, it is demonstrated that the introduced model has a non-singular information matrix. The results of a simulation study, along with an application to student dropout data, show the great potential of the proposed model in practical and everyday situations.

Keywords:

dichotomous response; logistic regression; maximum likelihood estimation; proportional hazard distribution

MSC:

62J12

1. Introduction

In recent statistical literature, new probability distributions have been introduced as extensions of other known distributions. This methodology serves as a foundation for generating new families of distributions applicable across various fields. Among other authors, this approach was utilized by Eugene et al. [1] to propose the Beta-G class of distributions. Subsequently, Silva et al. [2] introduced the modified Weibull beta distribution families and the Weibull beta geometry, as noted by Cordeiro et al. [3]. Moreover, building upon this methodology, Cordeiro and de Castro [4] defined the Kumaraswamy-G class of distributions, followed by the suggestion of the Kumaraswamy modified Weibull by Cordeiro et al. [5]. Similarly, Zografos and Balakrishnan [6] and Ristić and Balakrishnan [7] presented a new family of distributions generated by gamma random variables, leading to the development of the Gamma-Generated-Logistic distributions by Castellares et al. [8] and the Gamma-Birnbaum–Saunders distributions by Cordeiro et al. [9].

On the other hand, Martínez-Flórez et al. [10] examined the exponentiated-skew-normal distribution. Similarly, Martínez-Flórez et al. [11] proposed the family of proportional hazard distributions based on the distribution of the minimum in the sample. All of these new families of distributions have proven useful in analyzing responses of interest by adjusting both linear and nonlinear regression models. For instance, the regression model with skew-normal distributed errors Azzalini [12] has been widely utilized. Furthermore, extensions of regression models have been recommended, assuming errors follow the exposed skew-normal distributions (Martínez-Flórez et al. [10]; Martínez-Flórez et al. [13]). Moreover, these distribution families have been extended to encompass the case of the Birnbaum distribution [14] and the Birnbaum–Saunders log-linear regression model proposed by Rieck and Nedelman [15], showcasing the extensive range of symmetric and asymmetric families available in the literature.

All the works previously presented, and many others that have not been mentioned here, are appropriate in cases where the response variable has its support in the set of real numbers or has positive support, while very few works focus on the problem of dealing with dichotomous data. In this particular case, the issue is addressed based on non-linear functions or link functions such as the logistic regression model, known in the statistical literature as the logit model, or the non-linear alternative based on the cumulative distribution function (CDF) of the normal density, called the probit model. Thus, the limited existence of proposals in the statistical literature for the analysis of dichotomous or polytomous responses through link functions used in other types of models becomes evident.

In practice, the regression model with a dichotomous response (logistic model) has been widely used in several areas of knowledge. In the educational area, for example, it can be used to predict the probability of a student dropping out based on their academic performance, age of entry, the educational level of their parents, number of siblings, etc. In the health sector, certain patient characteristics and the application of specific treatments can be analyzed using the model to understand the connection between the patient and the implemented treatment, including the probability or odds of survival. Similarly, in finance, based on characteristics such as sex, age, race, income, and educational level, the behavior of investors can be predicted. These models are also utilized for classifying individuals into certain groups according to the predicted probability of a specific event occurring.

In this article, a new regression model is proposed to address research with dichotomous response variables. This novel model can be applied to various fields, including medicine, finance, education, and the social sciences. Our proposal is grounded in the family of proportional hazard distributions, specifically utilizing an extension of the logistic distribution within this family.

The remainder of this work is organized as follows: Section 2 provides a brief description of the logistic distribution and its associated regression model. Section 3 describes the proportional hazard and proportional hazard logistic distributions, along with some of their most important properties. In Section 4, the proportional hazard logistic regression model is introduced. Additionally, the statistical inference process is performed using a classical approach, presenting the score function and the elements of the observed information matrix. Section 5 presents an application of the introduced model to student dropout data. Finally, the conclusions of the paper are presented in Section 6.

2. Logistic Distribution

A continuous random variable with a logistic distribution has a probability density function (PDF) given by

f_{L} (z) = \frac{exp (- z)}{{(1 + exp (- z))}^{2}} = \frac{1}{4} {sech}^{2} (\frac{z}{2}), z \in R .

(1)

where

sech

denotes the hyperbolic secant function. The shape of the logistic distribution is similar to the shape of the normal density, with heavier tails and greater kurtosis than the normal distribution.

The cumulative distribution function (CDF) of a random variable with a logistic distribution is given by

F_{L} (z) = \frac{exp (z)}{1 + exp (z)} = \frac{1}{2} + \frac{1}{2} tanh (\frac{z}{2}),

while its survival and hazard functions can be written as

S_{L} (z) = \frac{1}{1 + exp (z)} = \frac{1}{2} - \frac{1}{2} tanh (\frac{z}{2}) and

h_{L} (z) = \frac{exp (z)}{1 + exp (z)},

where tanh denotes the hyperbolic tangent function.

The extension of the logistic distribution to the location-scale case is achieved by using the transformation

Y = μ + σ Z

with

μ \in R

and

σ > 0

. This is denoted by

Y \sim L (μ, σ)

, where

μ

represents the location parameter and

σ

the scale. Since this distribution is symmetric, then

E (Y) = μ

,

Var (Y) = \frac{π^{2}}{3} σ^{2}

, the asymmetry coefficient is zero, and its excess kurtosis is equal to

\frac{6}{5}

. Finally, the p-th percentile, for

0 < p < 1

, of this distribution is given by

y_{p} = μ + σ log (p / (1 - p))

.

Associated with the logistic distribution is the logistic regression model, which is used to explain the probability of success of a random variable with a binomial distribution when there is a set of covariates that explain this probability (see Agresti [16]). In essence, the logistic regression model is given by

p_{i} = \Pr (Y_{i} = 1 ∣ x_{1}, x_{2}, \dots, x_{p}) = \frac{exp (x_{i}^{⊤} β)}{1 + exp (x_{i}^{⊤} β)}

where

x = {(1, x_{1}, \dots, x_{p})}^{⊤}

represents a vector of covariates,

β = {(β_{0}, β_{1}, \dots, β_{p})}^{⊤}

is the vector of model coefficients (unknown values that must be estimated), and

Y_{i}

is a Bernoulli random variable with parameter

p_{i}

.

3. Hazard Proportional Distribution

In recent decades, families of asymmetric distributions have been introduced for fitting data with tails heavier or lighter than the normal distribution. As is well known, in the presence of high degrees of skewness and/or kurtosis, inferential processes based on the assumption of normality are inadequate. Similarly, while the elliptical family may provide a solution for distributions with heavy tails, it fails to address the issue of asymmetry in the data under study.

The skew-normal (SN) distribution, introduced by Azzalini [12], is defined by the PDF given as

φ (z; λ) = 2 ϕ (z) Φ (λ z), z \in R,

(2)

where

ϕ

and

Φ

represent the PDF and CDF of the standard normal distribution, respectively, and

λ

is a skewness parameter. The distribution is denoted by

Z \sim S N (λ)

. In addition to the work of Azzalini [12], the SN distribution described in (2) has been extensively studied by Henze [17], Pewsey [18], Chiogna [19], and Gómez et al. [20], among others.

Building on the work of Lehmann [21], Martínez-Flórez et al. [11] investigated another family of asymmetric univariate distributions called the proportional hazard. The PDF of this distribution is given by

φ_{F} (z; α) = α f (z) {1 - F (z)}^{α - 1}, z \in R,

(3)

where

α

is a positive real number, and F is a continuous CDF with continuous PDF f. This distribution is denoted by

PHF (α)

. The hazard function associated with the density

φ_{F}

is

h_{φ_{F}} (X, α) = α h_{f} (x),

where

h_{f} = f / (1 - F)

represents the hazard function related to the density f. When

F = Φ (\cdot)

and

f = ϕ (\cdot)

, the distribution is called proportional hazard normal, denoted by

PHN (α)

. The PDF is given by

φ_{Φ} (z; α) = α ϕ (z) {S (z)}^{α - 1}, z \in R,

(4)

where

S (z)

is the survival function associated with the PDF

ϕ (\cdot)

. This model serves as an alternative to accommodate data with asymmetry and kurtosis that fall outside the ranges allowed by the normal distribution.

The CDF of the

PHN (α)

distribution is given by:

F_{Φ} (z; α) = 1 - {S (z)}^{α}, z \in R .

(5)

By varying the

α

parameter, Martínez-Flórez et al. [11] found that the range of asymmetry and kurtosis coefficients,

\sqrt{β_{1}}

and

β_{2}

, respectively, of the variable Z∼PHN(

α

) falls within the intervals

(- 1.1578, 0.9918)

and

(1.1513, 4.3023)

. These ranges exhibit better skewness and kurtosis properties than those of the SN distribution. Additionally, Martínez-Flórez et al. [11] demonstrated that the information matrix of the PHN distribution in the location-scale case, denoted as

PHN (μ, σ, α)

, is nonsingular. A particular case of the proportional hazard family is discussed below.

Proportional Hazard Logistic Distribution

The proportional hazard logistic (PHL) distribution, denoted by

PHL (α)

, is defined by the PDF given as

\begin{matrix} φ_{HL} (x; α) & = & α \frac{exp (x)}{{(1 + exp (x))}^{α + 1}} \\ = & \frac{α}{4} {sech}^{2} (\frac{x}{2}) {[\frac{1}{2} - \frac{1}{2} tanh (\frac{x}{2})]}^{α - 1} \end{matrix}

(6)

Its respective CDF is given by

\begin{matrix} F_{HL} (x; α) & = & 1 - \frac{1}{{(1 + exp (x))}^{α}} \\ = & 1 - {[\frac{1}{2} - \frac{1}{2} tanh (\frac{x}{2})]}^{α} \end{matrix}

(7)

while the survival and hazard functions can be expressed as

S_{HL} (x; α) = \frac{1}{{(1 + exp (x))}^{α}} = {[\frac{1}{2} - \frac{1}{2} tanh (\frac{x}{2})]}^{α}

(8)

and

\begin{matrix} h_{HL} (x; α) & = & α \frac{exp (x)}{1 + exp (x)} \\ = & \frac{α}{4} \frac{{sech}^{2} (\frac{x}{2})}{\frac{1}{2} - \frac{1}{2} tanh (\frac{x}{2})} = α h_{L} (x) . \end{matrix}

(9)

respectively, where

h_{L} (x)

is the hazard function of the logistic distribution.

Figure 1 illustrates the behavior of the CDF and the survival function for different values of the parameter

α

. It is noteworthy that, for

α = 1

, the CDF corresponds to that of the logistic distribution. Moreover, the hazard function of the PHL is a multiple of the hazard function of the logistic distribution. Additionally, the adjustment of the CDF of the PHL distribution is more flexible than that of the logistic distribution. Similarly, it is observed that for

α = 0.75

, the survival function converges more slowly (indicating a higher probability of survival) towards zero compared to the survival function of the logistic distribution, whereas for values greater than zero, the convergence to zero is faster (indicating a lower probability of survival).

The r-th moment of the random variable

Y \sim PHL (α)

is given by:

E (Y^{r}) = \int_{1}^{\infty} \frac{{log}^{r} (u - 1)}{{(u - 1)}^{2}} {(1 - u^{- 1})}^{α + 1} d u .

(10)

From Expression (10), the moments of orders 1, 2, 3, and 4 of the PHL distribution can be derived, facilitating the numerical calculation of its mean, variance, skewness, and kurtosis coefficients.

4. Proportional Hazard Logistic Regression Model

Assuming the regression model:

Y_{i} = X_{i}^{⊤} β + ε_{i} = μ_{i} + ε_{i}, i = 1, 2, \dots, n

(11)

where

X = {(1, x_{1}, \dots, x_{p})}^{⊤}

represents a set of covariates,

β = {(β_{0}, β_{1}, \dots, β_{p})}^{⊤}

denotes a set of unknown coefficients, and

ε_{i}

∼

PHL (0, σ, α)

. It then follows that Y_i∼PHL

(μ_{i}, σ, α)

, for

i = 1, 2, \dots, n

.

However, when Y is a dichotomous random variable with values zero and one, the model errors are not independent and do not satisfy the assumption of homoscedasticity. Additionally, it cannot be ensured that

E (Y_{i} ∣ x_{1}, \dots, x_{p}) = \Pr (Y_{i} = 1 ∣ x_{1}, \dots, x_{p})

is bounded by 0 and 1.

For this reason, it is necessary to determine a distribution function

G (\cdot)

such that

\Pr (Y_{i} = 1 ∣ x_{1}, \dots, x_{p}) = p_{i} = G (Y_{i} = 1 ∣ x_{1}, \dots, x_{p}) .

The function

G (\cdot ∣ x_{1}, \dots, x_{p})

is known as a link function, and since it must ensure that the prediction lies between 0 and 1, it is commonly chosen as the distribution function of certain random variables studied in classical probability theory literature.

The link functions

G (\cdot ∣ x_{1}, \dots, x_{p})

typically utilized in practice are the CDF of the logistic distribution, resulting in the logit model, and the CDF of the normal distribution, resulting in the probit model. Due to their mathematical and computational complexity, the logit model is generally preferred over the probit model in practical applications. A notable commonality between these two models is their symmetric CDF, which can be a limitation in scenarios where the probability of success for response variable Y exhibits asymmetric behavior. Moreover, both distributions have limitations in accurately modeling certain probabilities in their tails. As illustrated in Figure 1, the CDF of the proportional hazard logistic distribution displays asymmetric behavior. Additionally, the inclusion of the parameter

α

allows for modeling the probabilities in its tails. This parameter enhances the flexibility of the probability of success compared to the logit and probit functions, suggesting the potential for more precise adjustment of the probability of success for the variables under study.

Referring to

G (\cdot ∣ x_{1}, \dots, x_{p})

as the CDF of the PHL, it follows that

\begin{matrix} \Pr (Y_{i} = 1 ∣ x_{1}, \dots, x_{p}) & = p_{i} = G (Y_{i} = 1 ∣ x_{1}, \dots, x_{p}) \\ = 1 - \frac{1}{{(1 + exp (x_{i}^{⊤} β))}^{α}} \\ = 1 - {[\frac{1}{2} - \frac{1}{2} tanh (\frac{x_{i}^{⊤} β}{2})]}^{α} . \end{matrix}

From this, it is obtained that

\begin{matrix} \Pr (Y_{i} = 0 ∣ x_{1}, \dots, x_{p}) & = 1 - \Pr (Y_{i} = 1 ∣ x_{1}, \dots, x_{p}) \\ = \frac{1}{{(1 + exp (x_{i}^{⊤} β))}^{α}} \\ = {[\frac{1}{2} - \frac{1}{2} tanh (\frac{x_{i}^{⊤} β}{2})]}^{α} . \end{matrix}

For

p_{i} = \Pr (Y_{i} = 1 ∣ x_{1}, \dots, x_{p})

, it follows that

\begin{matrix} log (\frac{1 - {(1 - p_{i})}^{1 / α}}{{(1 - p_{i})}^{1 / α}}) = x_{i}^{⊤} β, i = 1, 2, \dots, n, \end{matrix}

(12)

which will be referred to as the logit complement

α -

root transformation.

4.1. Properties of the PHL Regression Model

Given the structure of the probability function included in this new model, some statistics of interest are calculated for the interpretation of the parameters. Then, the odds

odds (x_{1}, x_{2}, x_{3}, \dots, x_{p}) = odds (x)

are given by

\begin{matrix} odds (x_{i}) & = & \frac{\Pr (Y_{i} = 1 ∣ x_{1}, x_{2}, \dots, x_{p})}{1 - \Pr (Y_{i} = 1 ∣ x_{1}, x_{2}, \dots, x_{p})} \\ = & {(1 + exp (x_{i}^{⊤} β))}^{α} - 1 . \end{matrix}

Thus, the relative risk

(R R)

or odds ratio, to compare the profile of individuals i and k, is given by

R R (i, k) = \frac{odds (x_{i})}{odds (x_{k})} = \frac{{(1 + exp (x_{i}^{⊤} β))}^{α} - 1}{{(1 + exp (x_{k}^{⊤} β))}^{α} - 1} .

This expression is used when there are profiles of different individuals, or when the profiles only differ in the jth variable. Thus, to estimate the relative risk in the ith individual when the jth variable is increased by one unit, denoted as

x_{j} + 1

, while keeping the value of the rest of the variables constant, we have the expression

\frac{odds (x_{1}, \dots, x_{j - 1}, x_{j} + 1, x_{j + 1}, \dots, x_{p})}{odds (x_{1}, \dots, x_{j - 1}, x_{j}, x_{j + 1}, \dots, x_{p})} =

\frac{{(1 + exp (β_{j}) exp (x_{i}^{⊤} β))}^{α} - 1}{{(1 + exp (x_{i}^{⊤} β))}^{α} - 1}

This represents the odds or the number of times the risk of the event occurring increases (or decreases) when the variable

x_{j}

increases by one unit.

4.2. Maximum Likelihood Estimation

Given a random sample

y_{1}, y_{2}, \dots, y_{n}

of a random variable Y with distribution

Y_{i}

∼

Bin (n, p_{i})

, and considering a set of covariates

x_{1}, x_{2}, \dots, x_{p}

, the likelihood function is expressed as

L_{PHL} (β, α ∣ X, Y) = \prod_{i = 1}^{n} p_{i}^{y_{i}} {(1 - p_{i})}^{1 - y_{i}} .

Then, the log-likelihood function is given by

\begin{matrix} ℓ_{PHL} (β, α | X, Y) & = & \sum_{i = 1}^{n} y_{i} log (p_{i}) + (1 - y_{i}) log (1 - p_{i}) \\ = & \sum_{i = 1}^{n} y_{i} log ({(1 + exp (x_{i}^{⊤} β))}^{α} - 1) \\ - α \sum_{i = 1}^{n} log (1 + exp (x_{i}^{⊤} β)) \end{matrix}

(13)

The score function,

U (β, α) = (U (β), U (α))

with

U (β) = (U (β_{0}), U (β_{1}), U (β_{2}), \dots,

U (β_{p}))

, which is calculated as the first derivative of the log-likelihood function concerning the parameters, is given by

\begin{matrix} U (β_{j}) & = & α \sum_{i = 1}^{n} x_{i j} y_{i} exp (x_{i}^{⊤} β) \frac{{(1 + exp (x_{i}^{⊤} β))}^{α - 1}}{{(1 + exp (x_{i}^{⊤} β))}^{α} - 1} \\ + & α \sum_{i = 1}^{n} x_{i j} \frac{exp (x_{i}^{⊤} β)}{1 + exp (x_{i}^{⊤} β)} \end{matrix}

(14)

for

j = 0, 1, 2, \dots, p

, and

\begin{matrix} U (α) & = & \sum_{i = 1}^{n} y_{i} \frac{{(1 + exp (x_{i}^{⊤} β))}^{α} log (1 + exp (x_{i}^{⊤} β))}{{(1 + exp (x_{i}^{⊤} β))}^{α} - 1} \\ - & \sum_{i = 1}^{n} log (1 + exp (x_{i}^{⊤} β)) \end{matrix}

(15)

The elements of the observed information matrix,

κ (θ)

, defined as minus the Hessian matrix (matrix of second derivatives concerning the parameters), are given by:

\begin{matrix} κ_{β_{j} β_{k}} & = & α \sum_{i = 1}^{n} x_{i j} x_{i k} \frac{exp (x_{i}^{⊤} β)}{{(1 + exp (x_{i}^{⊤} β))}^{2}} [1 + \frac{y_{i}}{p_{i}^{2}} \\ (- p_{i} (1 + exp (x_{i}^{⊤} β)) + exp (x_{i}^{⊤} β) (p_{i} + α (1 - p_{i})))] \\ κ_{β_{j} α} & = & \sum_{i = 1}^{n} x_{i j} \frac{exp (x_{i}^{⊤} β)}{1 + exp (x_{i}^{⊤} β)} [1 - \frac{y_{i}}{p_{i}^{2}} (\frac{p_{i}}{1 + exp (x_{i}^{⊤} β)} - \\ α log (1 + exp (x_{i}^{⊤} β)) (1 - p_{i}))] \\ κ_{α α} & = & \sum_{i = 1}^{n} \frac{1 - p_{i}}{p_{i}^{2}} {log}^{2} (1 + exp (x_{i}^{⊤} β)) . \end{matrix}

The elements of the information matrix, which are obtained from the expected value of the elements of the observed information matrix,

I (θ) = E (κ (θ))

, are given by

\begin{matrix} i_{β_{j} β_{k}} & = & α \sum_{i = 1}^{n} x_{i j} x_{i k} \frac{1 - p_{i}}{p_{i}} {(\frac{exp (x_{i}^{⊤} β)}{1 + exp (x_{i}^{⊤} β)})}^{2}, \\ i_{β_{j} α} & = & \sum_{i = 1}^{n} x_{i j} {(\frac{exp (x_{i}^{⊤} β)}{1 + exp (x_{i}^{⊤} β)})}^{2} - \\ α \sum_{i = 1}^{n} x_{i j} \frac{1 - p_{i}}{p_{i}} \frac{exp (x_{i}^{⊤} β)}{1 + exp (x_{i}^{⊤} β)} \\ log (1 + exp (x_{i}^{⊤} β)) \\ i_{α α} & = & \sum_{i = 1}^{n} \frac{1 - p_{i}}{p_{i}^{2}} {log}^{2} (1 + exp (x_{i}^{⊤} β)) . \end{matrix}

When

α = 1

, we obtain

p_{i} = \frac{exp (x_{i}^{⊤} β)}{1 + exp (x_{i}^{⊤} β)}

, and the information matrix can be written as

I_{F} (θ) = (\begin{matrix} X^{⊤} W X & X^{⊤} W \\ M^{⊤} X & \frac{1 - p_{i}}{p_{i}^{2}} {log}^{2} (1 + exp (x_{i}^{⊤} β)) \end{matrix}),

(16)

where

W

is the diagonal matrix

W = diag (p_{i} (1 - p_{i})), i = 1, 2, \dots, n

, and

M

is a vector with elements

m_{i} = p_{i} (p_{i} - \frac{1 - p_{i}}{p_{i}} log (1 + exp (x_{i}^{⊤} β)))

.

Letting

d = \frac{1 - p_{i}}{p_{i}^{2}} {log}^{2} (1 + exp (x_{i}^{⊤} β))

, we obtain that the determinant of the information matrix is given by:

|I (θ)| = d^{- p} |X^{⊤} (W - k^{- 1} M M^{⊤}) X| \neq 0 .

Thus, the information matrix is non-singular, which guarantees the existence of the variance–covariance matrix of the vector of maximum likelihood estimators (MLE)

\hat{θ}

. It can also be concluded that the variance–covariance matrix of the MLE can be written as:

Σ = I^{- 1} (\hat{θ}) .

Therefore, for large sample sizes, we have

\hat{θ} \overset{d}{\to} N_{p + 2} (θ, Σ),

meaning that the distribution of the vector of estimators is consistent and asymptotically normal, with a covariance matrix equal to the inverse of the Fisher information matrix.

Confidence intervals for coefficients

θ_{r}

of level

100 (1 - ψ) %

can be obtained from the expression

{\hat{θ}}_{r} \mp z_{1 - ψ / 2} \sqrt{\hat{σ} ({\hat{θ}}_{r})}

. Additionally, the adequacy of the proportional hazard logistic regression (PHLR) model can be evaluated through hypothesis testing:

H_{0} : β_{1} = β_{2} = \dots = β_{p} = 0 vs H_{1} : β_{j} \neq 0,

for at least one

j = 1, \dots, p

.

We can use the deviance function given by

G_{p} = - 2 (ℓ (β_{0}, α) - ℓ (\hat{β}, α)),

with distribution

G_{p} \sim χ_{p}^{2}

. Similarly, two models can be compared: one complete with r variables

(β_{r})

, and another with q

(q < r)

variables

(β_{q})

through the test statistic

G_{r - q} = - 2 (ℓ ({\hat{β}}_{q}, α) - ℓ ({\hat{β}}_{r}, α)),

for which we have

G_{r - q} \sim χ_{r - q}^{2}

. This same statistic is useful to test the significance of the remaining

r - q

variables that were not included in the model with q variables.

One of the strategies to validate the good fit of the logistic regression model is to analyze the proportion of correct classification that the fitted model achieves. Letting

G_{1}

be the group of observations with

Y_{i} = 1

, and

G_{2}

be the group of observations with

Y_{i} = 0

then, using Bayes’ Theorem, the probability of classifying an individual into group

G_{1}

given the information of the explanatory variables

x_{1}, x_{2}, \dots, x_{p}

is given by

\Pr (G_{1} ∣ x) = \frac{p_{1} \times \Pr (x ∣ G_{1})}{p_{1} \times \Pr (x ∣ G_{1}) + p_{2} \times \Pr (x ∣ G_{2})} .

Thus, when performing the calculations for our model, we have

\Pr (G_{1} ∣ x) = 1 - \frac{p_{2}}{(p_{2} - p_{1}) + p_{1} {(1 + exp (x_{i}^{⊤} β))}^{α}} .

Similarly,

\Pr (G_{2} ∣ x)

is defined. In this case, the decision will be to classify the ith individual into

G_{1}

if

\Pr (G_{1} ∣ x) > \Pr (G_{2} ∣ x)

; that is, if

\Pr (G_{1} ∣ x) > 0.5

.

To evaluate the predictive capacity of the proportional hazard logistic regression model, the overall accuracy of the model can be calculated, which is defined as the proportion of individuals that are correctly classified, as well as the sensitivity or true positive rate of the model (TPR), defined as the number of correctly classified individuals from group

G_{1}

divided by the total number of correctly classified ones (from

G_{1}

and

G_{2}

). Similarly, the false negative rate (FNR) of the model is defined as

(1 - T P R)

, among others.

5. Case Study: Students’ Dropout Data

The data for this application consist of a sample of 413 students from the Department of Mathematics and Statistics of the University of Córdoba, which were obtained from the SPADIES System of the Ministry of National Education of Colombia (MNE). The response variable in this application takes the values

Y = 1

(if program dropout) or

Y = 0

(if non-dropout). The explanatory variables considered are

x_{1} =

(cumulative general average, CGA),

x_{2} =

character of the school (CS) of the student where they studied, taking values

= 1

(if the student comes from an official school), and

= 0

(if not), and

x_{3} =

the number of periods enrolled (NPE). The logistic regression (LR) and proportional hazard logistic regression (PHLR) models were fitted. The results of the fitted models, obtained using the R Development Core Team [22] package, are given in Table 1.

The results of the model fit indicate that the variables CGA and NPE are significant, whereas the variable CS does not significantly explain the probability of university student dropout.

To compare the fitted models, we employ the Akaike Information Criterion (AIC) Akaike [23], corrected AIC (CAIC), and the Bayesian Information Criterion (BIC) by Hastie and Tibshirani [24], given by

A I C = - 2 \times \hat{ℓ} (\cdot) + 2 p,

C A I C = - 2 \times \hat{ℓ} (\cdot) + 2 p (1 + \frac{n + 2}{n - p - 2})

and

B I C = - 2 \times \hat{ℓ} (\cdot) + p log (n)

where p is the number of parameters in the model and n is the sample size. The results favor the PHLR model based on AIC, CAIC, and BIC values.

To compare the proportional hazard logistic regression (PHLR) model with the logistic regression model, we conduct the hypothesis test

H_{0} : α = 1 vs H_{1} : α \neq 1,

using the likelihood ratio statistic

Λ_{1} = \frac{L_{L} (θ)}{L_{PHL} (θ^{*})},

where

L_{L} (\cdot)

and

L_{PHL} (\cdot)

represent the likelihood functions of the logistic and PHL models, respectively. Upon numerical evaluation, we obtain

- 2 log (Λ) = - 2 (- 151.2 + 142.955) = 16.49,

which exceeds the value of

χ_{1, 95 %}^{2} = 3.84

. The PHL model exhibits the best fit compared to the logistic model.

Carrying out the hypothesis test of the significance of the explanatory variables

H_{0} : β_{1} = β_{2} = β_{3} = 0 vs H_{1} : β_{j} \neq 0,

for at least one

j = 1, 2, 3

, we have

\begin{matrix} G_{3}^{2} & = & - 2 (- 262.4158 + 142.9585) \\ = & 238.9147 > χ_{0.05, 3}^{2} = 7.8147, \end{matrix}

therefore, the null hypothesis is rejected. Similarly, for the hypothesis test

H_{0} : β_{2} = 0 vs H_{1} : β_{2} \neq 0,

it follows that

\begin{matrix} G_{3}^{2} & = - 2 (ℓ (β_{0}, β_{1}, β_{3}, α) - ℓ (β_{0}, β_{1}, β_{2}, β_{3}, α)) \\ = - 2 (- 143.96 + 142.9585) = 2.003 < χ_{0.05, 1}^{2} = 3.84, \end{matrix}

Therefore, the null hypothesis is not rejected, meaning the variable character of the school is not significant in the model. However, academic differences are observed in the classroom between students who come from official schools and those from private schools, with the latter demonstrating better preparation.

Note that in the proportional hazard logistic regression model, the case

α = 1

corresponds to the logistic distribution. However, the hypothesis test

H_{0} : α = 1

vs.

H_{1} : α \neq 1

, which is performed using the likelihood ratio statistic, is rejected. This means that the parameter

α

is significantly different from one, and must be considered to explain the behavior of the data. Moreover, the AIC, CAIC, and BIC criteria are favorable to the logistic proportional hazard model when compared with the usual logistic regression model. All of the above allows us to conclude that the proportional hazard logistic regression model fits better.

So, the fitted model is given as follows:

P (Y = 1 ∣ x_{1}, x_{2}, x_{3}) = \frac{{(1 + e^{1.374 - 0.860 x_{1} + 0.355 x_{2} - 0.201 x_{3}})}^{11.93} - 1}{{(1 + e^{1.374 - 0.860 x_{1} + 0.355 x_{2} - 0.201 x_{3}})}^{11.93}}

Now, the sample is divided into two subsamples. The first one, called the training sample, corresponds to 70% of the total sample, and the second one is the prediction sample (30% of the sample). From this partition, the following results for the fitted PHLR model are obtained.

According to the results in Table 2, the accuracy is 77.23%, the sensitivity rate is 67.05%, and the specificity rate is 100%.

On the other hand, Table 3 shows the performance of the PHLR model for different values of the

α

parameter.

Table 3 shows the skewness and kurtosis coefficients of the proportional hazard logistic model for different

α

values. The results indicate that the model can fit data with both negative and positive skewness, which is an advantage over traditional logistic models. Additionally, the PHLR model can fit data with varying degrees of kurtosis, both high and low.

Diagnostic analysis is a technique to detect possible influential observations and aberrant or extraneous data. In the case of the logistic model, this technique has certain similarities with the general diagnostic analysis of regression models. However, given that the response variable only takes the values 0 and 1, a somewhat unusual situation arises. Certain difficulties may arise if there is a large number of zeros (or ones) when one expects to find few zeros or ones, which can be a sign of a lack of fit in the model. In the case of the PHLR model, the diagnostic analysis could be carried out using the Pearson residuals,

\tilde{r} = \frac{y_{i} - {\hat{p}}_{i}}{\sqrt{{\hat{p}}_{i} (1 - {\hat{p}}_{i})}}

the square of which is the ith component of the Pearson chi-square statistic, the residual deviance

t_{D_{i}} = sign (\tilde{r}) \sqrt{- 2 [y_{i} log (\frac{y_{i}}{{\hat{p}}_{i}}) + (1 - y_{i}) log (\frac{1 - y_{i}}{1 - {\hat{p}}_{i}})]},

which is an adapted version of Cook’s distance for the case of the logistic regression model (see Christensen [25]). When

y_{i} = 0

,

t_{D_{i}} = sign (\tilde{r}) \sqrt{- 2 log (1 - {\hat{p}}_{i})}

, while if

y_{i} = 1

,

t_{D_{i}} = sign (\tilde{r}) \sqrt{- 2 log ({\hat{p}}_{i})}

.

For the student dropout data, the residual deviance graph for the PHLR model is presented in Figure 2. Note that in this graph, there are no observations with high values of the residuals, which indicates that the model has a good fit. Likewise, the graph of the PHL distribution for the fitted probabilities is shown. Note that there are five values falling within the +2.5/−2.5 range in Figure 2b, and six values in Figure 2c, indicating that these observations are not extremely influential. Additionally, there are no observations outside the confidence bands in the envelope graphs (Figure 3b), suggesting that the PHLR model effectively handles observations that deviate slightly from the +2/−2 range.

The

r M T_{i}

envelope graphs generated for the logistic and proportional hazard logistic models are presented in Figure 3a and Figure 3b, respectively. It is observed that the proportional hazard logistic regression model presents a better fit than the logistic regression model.

6. Conclusions

In this work, we have proposed the PHLR, a nonlinear regression model that captures complex relationships between independent variables and the response variable, particularly in the case of dichotomous data where the relationships cannot be adequately represented by a straight line. The flexibility of the PHLR model allows for a better fit to the data compared to linear models or even the logistic model.

The information matrix of the PHLR model is non-singular, ensuring that the parameters are uniquely estimable, avoiding linear dependency among them, and allowing for the proper calculation of the variance–covariance of the estimators. This guarantees the convergence of optimization and estimation algorithms, and ensures that the maximum likelihood estimators have desirable asymptotic properties, such as asymptotic normality.

In terms of information criteria such as AIC, CAIC, and BIC, the PHLR model shows a better fit than the logistic model for the analyzed student dropout data. The logistic model is revealed as a special case of the PHLR model. Additionally, the PHLR model demonstrates a good rate of correct classifications in the studied data. An alternative for the diagnostic analysis of model errors has also been proposed, offering useful tools for its implementation in educational problems or other contexts with dichotomous responses.

Author Contributions

Conceptualization, R.T.-F. and C.B.-C.; Methodology, G.M.-F. and R.T.-F.; Software, G.M.-F., R.T.-F. and C.B.-C.; Validation, G.M.-F., R.T.-F. and C.B.-C.; Formal analysis, G.M.-F., R.T.-F. and C.B.-C.; Investigation, G.M.-F., R.T.-F. and C.B.-C.; Resources, G.M.-F. and R.T.-F.; Data curation, G.M.-F., R.T.-F. and C.B.-C.; Writing—original draft, R.T.-F. and C.B.-C.; Writing—review & editing, G.M.-F., R.T.-F. and C.B.-C.; Visualization, G.M.-F., R.T.-F. and C.B.-C.; Supervision, G.M.-F., R.T.-F. and C.B.-C.; Project administration, G.M.-F. and R.T.-F.; Funding acquisition, G.M.-F. and R.T.-F. All authors have read and agreed to the published version of the manuscript.

Funding

The research of G. Martínez-Flórez and R. Tovar-Falón was supported by the project: Estudio de la deserción en los programas de pregrado de la Universidad de Córdoba usando diferentes metodologías estadísticas, FCB-06-22. Universidad de Córdoba, Colombia.

Data Availability Statement

Details about data available are given in Section 6.

Acknowledgments

Martínez-Flórez and R. Tovar-Falón acknowledge the support given by Universidad de Córdoba, Montería, Colombia. C. Barrera-Causil extends their sincere gratitude to the Instituto Tecnológico Metropolitano (ITM).

Conflicts of Interest

The authors declare no conflicts of interest.

References

Eugene, N.; Lee, C.; Famoye, F. Beta-normal Distribution and Its Applications. Commun. Stat.-Theory Methods 2002, 31, 497–512. [Google Scholar] [CrossRef]
Silva, G.O.; Ortega, E.M.M.; Cordeiro, G.M. The Beta Modified Weibull Distribution. Lifetime Data Anal. 2010, 16, 409–430. [Google Scholar] [CrossRef]
Cordeiro, G.M.; Silva, G.O.; Ortega, E.M.M. The Beta-Weibull Geometric Distribution. Statistics 2013, 47, 817–834. [Google Scholar] [CrossRef]
Cordeiro, G.M.; de Castro, M. A New Family of Generalized Distributions. J. Stat. Comput. Simul. 2011, 81, 883–898. [Google Scholar] [CrossRef]
Cordeiro, G.M.; Ortega, E.M.M.; Silva, G.O. The Kumaraswamy Modified Weibull Distribution: Theory and Applications. J. Stat. Comput. Simul. 2014, 84, 1387–1411. [Google Scholar] [CrossRef]
Zografos, K.; Balakrishnan, N. On Families of Beta- and Generalized Gamma generated Distributions and Associated Inference. Stat. Methodol. 2009, 6, 344–362. [Google Scholar] [CrossRef]
Ristić, M.M.; Balakrishnan, N. The Gamma-exponentiated Exponential Distribution. J. Stat. Comput. Simul. 2012, 82, 1191–1206. [Google Scholar] [CrossRef]
Castellares, F.; Santos, M.A.C.; Montenegro, L.C.; Cordeiro, G.M. A Gamma- Generated Logistic Distribution: Properties and Inference. Am. J. Math. Manag. Sci. 2015, 34, 14–39. [Google Scholar] [CrossRef]
Cordeiro, G.M.; Lima, M.C.S.; Cysneiros, A.H.M.A.; Pascoa, M.A.R.; Pescim, R.R.; Ortega, E.M.M. An Extended Birnbaum–Saunders Distribution: Theory, Estimation, and Applications. Commun. Stat. Theory Methods 2016, 45, 2268–2297. [Google Scholar] [CrossRef]
Martínez-Flórez, G.; Bolfarine, H.; Gómez, H.W. Skew-normal alpha power model. Statistics 2014, 48, 1414–1428. [Google Scholar] [CrossRef]
Martínez-Flórez, G.; Moreno-Arenas, G.; Vergara-Cardozo, S. Properties and inference for proportional hazard models. Rev. Colomb. Estadística 2013, 36, 95–114. [Google Scholar]
Azzalini, A. A class of distributions which includes the normal ones. Scand. J. Stat. 1985, 12, 171–178. [Google Scholar]
Martínez-Flórez, G.; Bolfarine, H.; Gómez, H.W. Likelihood-based inference for the power regression model. SORT-Stat. Oper. Res. Trans. 2015, 39, 187–208. [Google Scholar]
Birnbaum, Z.W.; Saunders, S.C. A New Family of Life Distributions. J. Appl. Probab. 1969, 6, 319–327. [Google Scholar] [CrossRef]
Rieck, J.R.; Nedelman, J.R. A log-linear model for the Birnbaum-Saunders distribution. Technometrics 1991, 33, 51–60. [Google Scholar]
Agresti, A. Categorical Data Analysis; John Wiley & Sons Inc.: Hoboken, NJ, USA, 2002. [Google Scholar]
Henze, N. A probabilistic representation of the skew-normal distribution. Scand. J. Stat. 1986, 13, 271–275. [Google Scholar]
Pewsey, A. Problems of inference for Azzalini’s skew-normal distribution. J. Appl. Stat. 2000, 27, 859–870. [Google Scholar] [CrossRef]
Chiogna, M. Notes on estimation problems with scalar skew–normal distributions. Stat. Methods Appl. 2005, 14, 331–341. [Google Scholar] [CrossRef]
Gómez, H.W.; Venegas, O.; Bolfarine, H. Skew-symmetric distributions generated by the distribution function of the normal distribution. Environmetrics 2007, 18, 395–407. [Google Scholar] [CrossRef]
Lehmann, E.L. The power of rank tests. Ann. Math. Stat. 1953, 24, 23–43. [Google Scholar] [CrossRef]
R Development Core Team. R: A Language and Environment for Statistical Computing; R Foundation for Statistical Computing: Vienna, Austria, 2022; Available online: http://www.R-project.org (accessed on 6 March 2024).
Akaike, H. A new look at statistical model identification. IEEE Trans. Autom. Contr. 1974, 19, 716–722. [Google Scholar] [CrossRef]
Hastie, T.J.; Tibshirani, R.J. Generalized Additive Models, 1st ed.; Chapman and Hall/CRC: New York, NY, USA, 1990. [Google Scholar]
Christensen, R. Log-Linear Models and Logistic Regression; Springer: New York, NY, USA, 1997. [Google Scholar]

Figure 1. (a) CDF for

α = 0.75

(solid line), 1 (dotted line), 2 (dashed line), and 3 (dotted-dashed line). (b) Survival function for

α = 0.75

(solid line), 1 (dotted line), 2 (dashed line), and 3 (dotted-dashed line).

Figure 1. (a) CDF for

α = 0.75

(solid line), 1 (dotted line), 2 (dashed line), and 3 (dotted-dashed line). (b) Survival function for

α = 0.75

(solid line), 1 (dotted line), 2 (dashed line), and 3 (dotted-dashed line).

Figure 2. (a) Fitted PHLR model. (b) Residual deviance for the fitted PHLR. (c) Residual deviance for the LR model.

Figure 3. Envelope plots for

r M T_{i}

: (a) LR model and (b) PHLR model.

Figure 3. Envelope plots for

r M T_{i}

: (a) LR model and (b) PHLR model.

Table 1. Parameter estimation of LR and PHLR models (standard errors of the estimates are given in parentheses).

Model	${\hat{β}}_{0}$	${\hat{β}}_{1}$	${\hat{β}}_{2}$	${\hat{β}}_{3}$	$\hat{α}$	AIC	CAIC	BIC
LR	7.290	−1.576	0.442	−0.270		310.4	330.5	326.5
se	(1.220)	(0.384)	(0.423)	(0.035)
PHLR	1.374	−0.860	0.355	−0.201	11.93	295.9	325.0	316.0
se	(1.206)	(0.340)	(0.293)	(0.025)	(5.301)

Table 2. Model predictive capacity.

Actual/Forecast	$\hat{y} = 1$	$\hat{y} = 0$	Total
$y = 1$	57	28	85
$y = 0$	0	38	38
Total	57	66	123

Table 3. Skewness and kurtosis of the PHLR model for different

α

values.

Table 3. Skewness and kurtosis of the PHLR model for different

α

values.

$α$	0.050	0.125	0.250	0.500	0.750	1.000	1.500
Skewness	0.355	0.032	0.160	0.135	0.058	0.000	−0.081
Kurtosis	1.673	2.159	2.716	2.974	2.988	3.000	3.031
$α$	2.500	5.000	10.000	20.000	30.000	50.000	100.000
Skewness	−0.179	−0.303	−0.410	−0.501	−0.546	−0.597	−0.655
Kurtosis	3.090	3.201	3.331	3.469	3.548	3.644	3.765

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Martínez-Flórez, G.; Tovar-Falón, R.; Barrera-Causil, C. Dichotomous Proportional Hazard Regression Model: A Case Study on Students’ Dropout. Mathematics 2024, 12, 2170. https://doi.org/10.3390/math12142170

AMA Style

Martínez-Flórez G, Tovar-Falón R, Barrera-Causil C. Dichotomous Proportional Hazard Regression Model: A Case Study on Students’ Dropout. Mathematics. 2024; 12(14):2170. https://doi.org/10.3390/math12142170

Chicago/Turabian Style

Martínez-Flórez, Guillermo, Roger Tovar-Falón, and Carlos Barrera-Causil. 2024. "Dichotomous Proportional Hazard Regression Model: A Case Study on Students’ Dropout" Mathematics 12, no. 14: 2170. https://doi.org/10.3390/math12142170

APA Style

Martínez-Flórez, G., Tovar-Falón, R., & Barrera-Causil, C. (2024). Dichotomous Proportional Hazard Regression Model: A Case Study on Students’ Dropout. Mathematics, 12(14), 2170. https://doi.org/10.3390/math12142170

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Dichotomous Proportional Hazard Regression Model: A Case Study on Students’ Dropout

Abstract

1. Introduction

2. Logistic Distribution

3. Hazard Proportional Distribution

Proportional Hazard Logistic Distribution

4. Proportional Hazard Logistic Regression Model

4.1. Properties of the PHL Regression Model

4.2. Maximum Likelihood Estimation

5. Case Study: Students’ Dropout Data

6. Conclusions

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI