1. Introduction
For survival data, the outcome variable is usually the time until the occurrence of an event of interest. A common characteristic of this type of data is the presence of censoring, that is, when the event of interest is not observed for some subjects before the study is finished. Furthermore, this variable depends on one or more explanatory variables (covariables), which have characteristics of the sample under study. Cox’s proportional hazards and accelerated failure time (AFT) models are two common tools in time-to-event modeling. The first class of models has the strong assumption of proportional risks, which is often invalid, so the effects of the covariables on the risk function are examined which can lead to difficult interpretations. The second class assumes that an association exists between the predictors and the survival time, permitting a direct interpretation of the effects of the covariables on lifetimes.
Nevertheless, these methods can fail to capture the heterogeneity of the effects of the covariables. In this respect, the quantile regression (QR) (Koenker and Bassett [
1]) can be an alternative to these models, enabling evaluation of the heterogeneous effects of the predictors via analysis of different quantiles. This method involves modeling the quantiles of the survival time and links them to the covariables, providing some advantages, such as:
Possible identification and inference under the heterogeneous effects of the covariables for different quantiles, thus supplying more complete information about the covariables and more flexibly controlling for the heterogeneity caused by them;
Flexibility regarding the assumption of proportional risks;
Provision of a direct interpretation of the results, that is, between the survival time and the covariables of interest;
Possible analysis of different quantiles, allowing identification of the different effects of the covariables on individuals with different risks; and
Robustness with respect to outliers in the regression models.
Originally, the QR methods are based on minimizing weighted absolute residuals [
1] without any probability distribution, and the estimation of the parameters occurs by means of linear programming algorithms.
Although this approach is very flexible, some challenges such as: (i) the quantile crossing, that is, when two or more estimated quantile curves cross or overlap, causing difficulty in interpretability; and (ii) the drawback of the inability to apply parametric inference tools led to the search for other methods. Regarding the quantile crossing problem, we can verify alternative methods, such as: semiparametric models [
2], the support vector (SV) regression approach [
3], and a joint quantile estimation approach [
4,
5]. The weighted absolute residuals estimators coincide with the maximum likelihood estimators (MLEs), when the response follows a skewed Laplace distribution, so the initial association of a continuous distribution to the QR models was based on it (Koenker and Machado [
6]).
In the context of censored data, an extensive bibliography can be mentioned, for example: Peng and Huang [
7] developed a QR approach for survival data subject to conditionally independent censoring, Wang and Wang [
8] proposed a locally weighted censored QR approach following the redistribution-of-mass idea and employed a local reweighing scheme. Zarean et al. [
9] used the censored QR for determining overall survival and risk factors in esophageal cancer. Yang [
10] presented a new approach for censored QR estimation, and Du et al. [
11] developed estimation procedures for partially linear QR models, where some of the responses were censored by another random variable. Further, Xue et al. [
12] addressed these limitations by using both simulated examples and data from National Wilms Tumor clinical trials to illustrate proper interpretation of the censored QR model and the differences and advantages of the model compared to the Cox proportional hazard model. Hong et al. [
13] provided a practical guide for using QR for right-censored outcome data with covariates of low or high dimensionality, and De Backer et al. [
14] studied a novel approach for the estimation of quantiles when facing potential right-censoring of the responses. Recently, De Backer et al. [
15] investigated a new procedure for estimating a linear QR with possibly right-censored responses; Qiu et al. [
16] considered the QR model for survival data with missing censoring indicators; Yazdani et al. [
17] introduced the QR approach for modelling failure time and investigated the covariate effects for different quantiles; Peng [
18] provided a comprehensive review of statistical methods for performing QR with different types of survival data; Hsu et al. [
19] studied regression models for interval censored data using quantile coefficient functions via a set of parametric basis functions; He et al. [
20] provided a unified analysis of the smoothed sequential estimator and its penalized counterpart for increasing dimensions in censored QR; and Wei [
21] introduced a discussion about QR for censored data in haematopoietic cell transplant research. Note that all these articles cited in the QR with censored data did not use parametric models or use the skewed Laplace distribution (see [
17]), whose estimators coincide.
Subsequently, other distributions were proposed by re-parameterizing them in terms of the quantile function (qf). Recent papers involving models for non-censored data based on other distributions can be mentioned: log-extended exponential-geometric [
22]; Birnbaum–Saunders [
23,
24]; discrete generalized half-normal [
25]; transmuted unit-Rayleigh [
26]; unit-Burr-XII [
27]; unit-Chen [
28]; log-symmetric [
29]; arcsecant hyperbolic Weibull [
30]; and Dagum and Singh–Maddala [
31] distributions. However, there is a relative lack in the literature of models for censored data in the parametric context: generalized Gompertz [
32] and skew-t [
33].
It is well known that the hazard rate function can assume different forms, which has led to the proposal of a large number of new distributions with the purpose of obtaining greater flexibility of data modeling, for example, Ref. [
34]. In this sense, we introduce a QR regression model based on a reparameterized, exponentiated, odd log-logistic Weibull (EOLLW) distribution. It has two extra shape parameters, thus enabling the modeling of different forms of hazard rate functions, as well as data with positive or negative symmetric or asymmetric bimodal shapes, making it an alternative to the mixture models commonly used in the presence of bimodality. Another important feature of the new QR model is that it has as special cases: the exponentiated Weibull and odd log-logistic Weibull QR models. A detailed discussion of the theoretical foundations is given in analysis of survival data with concrete applications. The maximum likelihood method is adopted, and several simulations evaluate the behavior of these estimators under some scenarios. Additionally, we show that the model can establish functional relations of the covariables with other parameters, including scale and kurtosis, besides the quantile parameter.
The paper is structured as follows.
Section 2 introduces a reparametrization of the EOLLW distribution based on quantiles.
Section 3 addresses some mathematical properties. The proposed QR regression model, and some classic inference methods to estimate the parameters are addressed in
Section 4. Some simulations are reported in
Section 5.
Section 6 provides a real application for the new regression model.
Section 7 ends with a brief conclusion.
2. The Reparameterized EOLLW Distribution
Let
be a parent cumulative distribution function (cdf) and
be its associated probability density function (pdf), both functions of a parameter vector
. The cdf of the exponentiated odd log-logistic (EOLL-G) family is given by (Alizadeh et al. [
35]) (for
)
where
and
are two extra shape parameters.
The pdf corresponding to Equation (
1) has the form
Henceforth, let
be a random variable with density function (
2).
The EOLL-G family reduces to the OLL-G class when
(Gleaton and Lynch [
36]), and to the exponentiated (Exp-G) family (Mudholkar et al., [
37]) when
. Clearly, it becomes the parent
when
.
The EOLLW distribution is defined from (
2) by taking the parent Weibull
respectively, where
,
is a scale parameter, and
is a shape parameter.
The cdf of the random variable
follows from Equations (
1) and (
3)
Based on Equations (
2) and (
3), the pdf of
X becomes
The hazard rate function corresponding to (
5) is
.
By inverting (
1), the qf of
X reduces to
where
(
) is the qf of the Weibull distribution, namely
Thus, we rewrite the
th quantile (
6) as
We can easily obtain the quartiles: first quartile (Q(0.25)), median (Q(0.5)), and third quartile (Q(0.75)).
We define a reparametrization of the pdf (
5) as a function of the
th quantile (
6), where the scale
becomes
is the location, and
th is the quantile of
X (assumed known).
By substituting (
9) into Equation (
4), the reparameterized cdf of
X reduces to
where
.
By simple differentiation, the reparameterized pdf of
X has the form
Henceforth, we redefine
as a random variable with pdf (
11), where
is fixed.
Figure 1 displays plots of the pdf of
X for some
values, thus showing its asymmetry and bimodality.
The qf of
X is obtained by replacing (
9) in Equation (
8)
4. The EOLLW QR Model for Censored Data
A new regression model is defined from the reparametrized EOLLW density (
11), and two systematic components for the parameters
and
(for
)
where
and
are unknown parameter vectors, and
is the explanatory variable vector. Thus, the heteroscedasticity is modeled via
.
The EOLLW QR model is defined by Equations (
11) and (
18), where
and
are unknown constants, and it has as special models:
Rhe exponentiated Weibull (EW) QR model for ;
the odd log-logistic Weibull (OLLW) QR model for ;
and the Weibull QR model for .
Consider a sample
of independent observations, where each random response is defined by
,
(censoring indicator), where
denotes the indicator function. We consider non-informative censoring and the observed lifetimes and censoring times are independent given
. Let
F and
C be the sets of individuals for which
is the lifetime or censoring time, respectively. Conventional likelihood estimation techniques can be applied here. The log-likelihood function for the vector
from model (
18) has the form
where
,
,
is the density (
11),
is the survival function, and
is the cdf (
10) of
. The total log-likelihood function for
can be expressed as
where
and
is the number of uncensored observations (failures).
The
gamlss package in
R [
42] is used to find the maximum likelihood estimate
of
. This package comes from the general class of generalized additive models for location, scale and shape (GAMLSS) (Rigby and Stasinopoulos [
43]). These models allow all parameters of a distribution to be modeled as a function of covariates, such as non-parametric, parametric and/or additive smooth functions. Furthermore, they do not have the restriction that the response distribution belongs to a given family such as the exponential family. The package basically has two algorithms:
CG (Cole and Green [
44]) and
RS (Rigby and Stasinopoulos [
43]), whose acronyms come from the names of the authors. These algorithms are stable and do not require precise initial values to guarantee convergence. For this reason, we work with the
RS algorithm with initial values for
and
obtained from the fitted Weibull QR model (
). Compared to the
CG algorithm,
RS is faster for larger datasets and does not use the expected value of cross derivatives, which can be useful when these values are equal to zero. For more details of the algorithms, see [
43].
The codes for the reparametric EOLLW distribution in the GAMLSS framework are available at
https://github.com/gabrielamrodrigues/EOLLW_quantiles (accessed on 10 February 2023). Following this approach, different regression models can be constructed by incorporating non-parametric smoothing functions, random effects, or other additive terms to the predictors.
Under conditions that are fulfilled for the parameter vector in the interior of the parameter space but not on the boundary, the asymptotic distribution of is multivariate normal , where is the information matrix. The asymptotic covariance matrix of can be approximated by the inverse of the observed information matrix . The approximate multivariate normal distribution for can be used in the classical way to construct approximate confidence regions for some parameters in .
We can use the likelihood ratio (LR) statistic for comparing some sub-models with the EOLLW QR model. We consider the partition , where is the subset of parameters of interest and is the subset of remaining parameters. The LR statistic for testing the null hypothesis versus the alternative hypothesis is given by , where and are the estimates under the null and alternative hypotheses, respectively. The statistic w is asymptotically (as ) distributed as , where k is the dimension of the subset of parameters of interest.
The standard maximum likelihood techniques can be adopted for the proposed regression, such as the quantile residuals (
) (Dunn and Smyth [
45]), namely
where
and
is the inverse cumulative standard normal distribution.
5. Simulation Study
A simulation study is carried out to verify the accuracy of the MLEs in the EOLLW QR model for the quartiles
,
and
, and approximate censoring percentages 0%, 10% and 50%. Just one covariate
Binomial
is included in the systematic components:
For each combination, replicas of sizes , 300 and 500 are generated. The true values used are: , , , , and .
The inverse transformation method is used to generate the lifetimes
from the EOLLW
distribution, and the censoring times
are determined from a uniform distribution
, where
k controls the censoring percentages. For each scenario, the Average Estimates (AEs), Biases and Mean Square Errors (MSEs) of the MLEs are calculated from:
where
. The software
is used and Algorithm 1 presents the simulation steps.
Algorithm 1: Simulation study |
|
Table 1,
Table 2 and
Table 3 report the findings. For all scenarios, the AEs converge to the true parameter values, and the biases and MSEs decrease when
n increases. These facts indicate that the consistency of the estimators hold. In addition, this behavior is verified even for high censoring percentages. We also found the empirical coverage probabilities (CPs) corresponding to the 95% confidence intervals calculated from the simulations.
Table 4 reports CPs values which approach to the nominal level.
6. Application to Gastric Cancer Data
Gastric cancer is the 5th most common cancer worldwide. There are more than one million new cases of this cancer every year, and it ranked as the 2nd leading cause of mortality from cancer in the world. We consider a survival dataset of patients suffering from gastric adenocarcinoma treated by surgery at Helsinki University Hospital in Finland [
46] (available at
https://doi.org/10.5061/dryad.hb62394, accessed on 29 November 2022 [
47]), which contains 301 individuals with approximate censoring of 60%. Here we consider two covariables. The first corresponds to the classification of Lauren (
Figure 2a). Various pathological classifications of the disease exist, but that of Lauren is the most common. Originally developed in the 1960s, the classification system adopted cell structural components to separate the patients in three types: well differentiated (non-cardia/intestinal), poorly differentiated (cardia/diffuse), and mixed disease [
48]. Based on histology, the two leading types of gastric cancer are diffuse and intestinal [
49]. These two types are reflected in the dataset. The second covariable corresponds to the presence of distant metastasis (M1 disease) (
Figure 2b). Many patients diagnosed with gastric cancer present distant metastasis, implying a very poor prognosis, generally indicating prophylactic rather than curative treatment ([
50,
51]). The objective here is to verify the effects of the covariables in different quantiles, so as to obtain a more complete view of this dataset.
Table 5 gives a descriptive summary, which includes the mean times, median times and times for the first and third quartiles. We can observe differences for the Lauren classification covariate: between the quantiles, the average time, and the Lauren 1 and Lauren 2 levels. However, we note subtle differences for the presence of distant metastases covariate. Then, the variables considered are (
):
survival time (in years);
: censoring indicator (0 = censored, 1 = observed);
: Lauren classification (1 = intestinal, 2 = diffuse), defined by a dummy variable (0 = intestinal, 1 = diffuse);
: Presence of distant metastases (pm) (1 = yes, 0 = no)
Regression Model
We compare the EOLLW QR model with the nested OLLW, Exp-W and Weibull models under three systematic components:
We consider the following quantiles:
,
,
,
and
.
Table 6 reports the Akaike information criterion (AIC) values for the fitted QR regression models. The EOLLW QR model under structure
gives the lowest values for these quantiles.
Table 7 gives three likelihood ratio (LR) statistics (
p-values in parentheses), thus indicating that the EOLLW QR model under structure
is better than the others. Thus, we can consider this model as the predictive model.
Figure 3 displays the MLEs and the corresponding confidence intervals along with the interval
, and
Table 8 gives the MLEs and their standard errors (SEs) for the quantiles
,
,
,
and
at the significance level of 5%. The following facts can be mentioned:
The effect of the Lauren classification 2 in comparison with 1 is decreasing along the quantiles and its confidence interval shows significant estimates for all quantiles. These results corroborate with those point quantiles reported in
Table 8.
The effect of the presence of distant metastasis is rising along the quantiles. Its confidence interval includes zero in the interval , thus indicating that the covariable is not significant for these quantiles. These results can be noted by the non-significant p-values for and .
For the parameters and , the estimates are significant for both quantiles, thus indicating that those covariables influence the variability of the survival times.
The estimates corresponding to the shape parameters and are also significant for all quantiles.
Residual Analysis
Figure 4,
Figure 5,
Figure 6,
Figure 7 and
Figure 8 provide the normal probability plots of the
’s in Equation (
20) under structure
for some quantiles. They reveal that the EOLLW QR model is the best among the fitted models. Further, they approximately follow a standard normal distribution, thus indicating adequate fits.
Figure 9 shows the index plot of the
’s for the EOLLW QR model under structure
. There are few points outside the interval
for both quantiles, and a random pattern around zero which show that these models are very adequate to the current data.