1. Introduction
Suppose that
G is a cumulative distribution function (cdf) that is defined on the real line, several papers have proposed composing a unit distribution with
G (a parent cdf) to produce a new cdf. Eugene et al. (2002) [
1] combined the cdf of the beta distribution with
G to create the Beta-
G model with cdf
where
is the regularized incomplete beta function. Alexander et al. (2012) [
2] and Nadarajah et al. (2014b) [
3] generalized the Beta-
G to the generalized-Beta-
G and the modified-Beta-
G. Cordeiro and Castro (2011) [
4] developed the Kumaraswamy-
G model by combining the Kumaraswamy cdf
with the parent cdf
G.
Based on a valid cdf,
for
, for any continuous distribution, we can construct a unit distribution as a truncated version of
with a cdf (monotonically increasing with
and
) given by
The truncated-
G (TG) model is constructed by composing this truncated version of the cdf (or its associated survival function
) with a parent cdf
G (or its associated survival function
) to give the parent distribution additional modeling ability and produce a new family of univariate distributions with cdfs (monotonically increasing with
and
) given by
A list of TG models are given in
Table 1.
In this paper, we generate a new family of continuous distributions using a truncated version of the Lindley distribution.
The new distribution is necessary and helpful because it provides an alternative option for failure time analysis. While there are already numerous existing distributions available for this purpose, having a new distribution adds to the range of choices researchers and analysts have when analyzing failure times. The existing distributions may not always adequately capture the characteristics or behavior of the data being analyzed. Different distributions have different assumptions and properties, and no single distribution can fit all scenarios perfectly. Therefore, having a new distribution can be beneficial in situations where none of the existing options are suitable or provide a good fit to the data. Additionally, the new distribution may offer advantages over existing ones in terms of interpretability, flexibility, or computational efficiency. It could introduce novel features or modeling capabilities that were previously unavailable with other distributions. This can lead to improved accuracy and reliability in failure time analysis.
In summary, while there are already many distributions available for failure time analysis, the introduction of a new distribution expands the options and possibilities for researchers, allowing them to choose the most appropriate model for their specific data and research objectives.
On the other hand, in several research areas (medical, engineering, biology, agronomy, etc.), the failure times are affected by explanatory variables. In this paper, we propose a regression model with censored observations, based on the truncated Lindley–Weibull distribution, which is a feasible alternative for modeling failure time data. Also, different simulation studies are presented to study the behavior of maximum likelihood estimation (MLE), as well as the residual analysis of the proposed regression model. The paper is structured as follows:
Section 2 describes the unit truncated Lindley distribution which is the main component of the proposed new model. We discuss its properties, including moments, mode, quantile function (qf), mean deviations, and generating function.
Section 3 discusses the proposed TLG model (linear representation, properties, shapes of the TLG, stochastic representation, truncated Lindley–Weibull (TLW) submodel and estimation of the parameters using the maximum likelihood method). In
Section 4, we propose a regression model based on the TLW distribution and estimate its parameters using maximum likelihood. Also, we perform some simulation studies for the TLW regression model under different sample sizes and censoring proportions. The TLW regression model application is illustrated by examining four real datasets in
Section 5. Finally,
Section 6 summarizes the result and presents the conclusions.
4. The TLW Regression Model with Censored Data and Two Systematic Components
Statistical analysis of lifetimes is an important topic used in different areas such as, for example, medicine, biology, epidemiology, engineering, among others. Failure time refers to the time until the occurrence of an event of interest, which may be death, the appearance of a tumor, the development of a disease, the breakdown of an electronic component, among other examples.
We relate the parameters and k to
covariates by the logarithm link function
respectively, where
and
denote the vectors of regression coefficients and
.
The survival function of
is given by
where
Equation (
19) is referred to as the TLW parametric regression model. This regression model opens new possibilities for fitting many different types of data.
Consider a sample
of
n independent observations, where each random response is defined by
, where
are the censoring times and
are the observed lifetimes. We assume non-informative censoring such that the observed lifetimes and censoring times are independent. Let
F and
C be the sets of individuals for which
is the lifetime or censoring, respectively. The total log-likelihood function for
reduces to
where
r is the number of uncensored observations (failures) and
. By maximizing the log-likelihood (
20), the MLE of the vector of unknown parameters can be calculated. We use the R software to determine
.
4.1. Residual Analysis
For the TLW regression model with censored observations, we present two types of residuals to evaluate deviations from the error assumptions and detect outliers. The deviance residuals have been used more frequently in the literature because they take into account the information of censored times. The TLW regression model can also use these residuals. A reliable method for detecting atypical observations and confirming that the fitted model is adequate is to plot the deviance residual against the observed times. It is possible to express the deviance residual as
where
is the martingale residual,
means that the observation is uncensored,
means that the observation is censored and
4.2. Simulation Study
To verify the accuracy of the MLEs of the TLW regression model, we carried out a simulation study for different censoring percentages and sample sizes
, 300, and 500. For each sample size, we carried out
N = 1000 replicates and considered the approximate censoring percentages: 0%, 10% and 30%. A covariate
binomial
is included from the following systematic components:
The inverse transformation method is used to obtain the lifetimes from the TLW distribution, and the censoring times are determined from a uniform distribution , where controls the censoring percentages. The true values used for generation are , , , , and .
The Results are checked for
from MABs, MSEs, and AEs given in (
18), where here
. The simulation process is given by:
(i) Generate binomial ;
(ii) Calculate and ;
(iii) Generate TLW ;
(iv) Generate ;
(v) Calculate the survival times ;
(vi) If , then ; otherwise, , for , where is the censoring indicator.
(vii) Calculate AEs, biases, and MSEs.
Table 6 displays these values. It is verified that for all scenarios the averages of the estimates approach the true values of the parameters and the MABs and MSEs decrease as the sample size increases. These results illustrate that the estimates are consistent, even at higher censoring percentages.
5. Data Analysis
In order to demonstrate the superiority of the new distribution over some other models, we use two real datasets originating from different fields. We compare the fits of the TLW model to those of the parent Weibull model (W), the Kumarswamy–Weibull model (KW) from Cordeiro and Castro (2011) [
4], the Weibull–Weibull model (WW) from Alzaatreh et al. [
18], the Geometric–Poisson–Weibull model (GPW) from Nadarajah et al. (2013) [
19], the Poisson–Weibull model (PW) from Ristic and Nadarajah (2013) [
5] the beta-Weibull model (BW) from Eugene et al. (2002) [
1], the Marshall–Olkin–Weibull model (MOW) from Marshall and Olkin (1997) [
20] and the exponentiated generalized Weibull model (EGW) from Cordeiro et al. (2013) [
21]. The cdfs of these models are provided in
Appendix B. The parameter estimates are computed by maximizing (
17) using the BFGS method available in
the adequacy model package in the R software [
22].
The considered models are compared according to a collection of statistics (AIC, CAIC, BIC, HQIC, minus maximum log-likelihood function ()) which assess the relative degree of fit of these models to a dataset.
We also performed an application of the TLW regression model considering censored data. We compared different systematic components for the proposed new regression model and the Weibull regression model. In this part we use the RS algorithm in the
gamlss package in the R software to maximize the log-likelihood function (
20) and we use the AIC and global deviance (GD) statistics to select the most suitable models.
This dataset, reported by Barakat et al. (2014) [
23], depicts the average July temperatures (
C) for Neuenburg, Switzerland, between 1864 and 1993. The observations are as follows.
19.0 | 20.1 | 18.4 | 17.4 | 19.7 | 21.0 | 21.4 | 19.2 | 19.9 | 20.4 | 20.9 | 17.2 | 20.2 |
17.8 | 18.1 | 15.6 | 19.4 | 21.7 | 16.2 | 16.4 | 19.0 | 20.6 | 19.0 | 20.7 | 15.8 | 17.7 |
16.8 | 17.1 | 18.1 | 18.4 | 18.7 | 18.7 | 18.4 | 19.2 | 18.0 | 18.7 | 20.7 | 19.4 | 19.2 |
17.4 | 22.0 | 21.4 | 19.3 | 16.8 | 18.2 | 16.2 | 15.9 | 22.1 | 17.5 | 15.3 | 16.5 | 17.4 |
17.0 | 18.3 | 18.3 | 15.3 | 18.2 | 21.5 | 17.0 | 21.6 | 18.2 | 18.1 | 17.6 | 18.2 | 22.6 |
19.9 | 17.1 | 17.2 | 17.3 | 19.4 | 20.1 | 20.1 | 17.0 | 19.4 | 17.5 | 16.8 | 17.0 | 19.9 |
18.2 | 19.2 | 18.5 | 20.8 | 19.5 | 21.1 | 15.8 | 21.3 | 21.2 | 18.8 | 22.3 | 18.6 | 16.8 |
18.2 | 17.2 | 18.4 | 18.7 | 21.1 | 16.3 | 17.4 | 18.0 | 19.5 | 21.2 | 16.8 | 17.4 | 20.7 |
18.4 | 19.8 | 18.7 | 20.5 | 18.3 | 18.2 | 18.2 | 19.2 | 20.2 | 18.2 | 17.4 | 19.2 | 16.3 |
17.4 | 20.3 | 23.4 | 19.2 | 20.2 | 19.3 | 19.0 | 18.8 | 20.3 | 19.7 | 20.7 | 19.6 | 18.1 |
The MLEs and 95% CIs for the model parameters are shown in
Table 7.
Table 8 provides the competence of the considered models.
The TLW model fits the dataset with the lowest AIC, CAIC, BIC, HQIC, and minus log-likelihood among the other models, as determined by the adequacy statistics presented in
Table 8. Therefore, it may be a viable option for modeling these data.
Figure 3 compares the empirical and fitted distributions of the data, displaying the histogram and fitted pdf, the fitted and empirical cdfs, the P–P plot, and the Q–Q plot, respectively, to graphically explain the appropriateness of the TLW for modeling these data.
The breaking stress of 64 single carbon fibers of gauge length 10 mm (Cheng and Traylor (1970) [
24]). The observations are as follows.
1.901 | 2.132 | 2.203 | 2.228 | 2.257 | 2.35 | 2.361 | 2.396 | 2.397 | 2.4450 | 2.454 |
2.454 | 2.474 | 2.518 | 2.522 | 2.525 | 2.532 | 2.575 | 2.614 | 2.616 | 2.618 | 2.624 |
2.659 | 2.675 | 2.738 | 2.74 | 2.856 | 2.917 | 2.928 | 2.937 | 2.937 | 2.977 | 2.996 |
3.03 | 3.125 | 3.139 | 3.145 | 3.22 | 3.223 | 3.235 | 3.243 | 3.264 | 3.272 | 3.294 |
3.332 | 3.346 | 3.377 | 3.408 | 3.435 | 3.493 | 3.501 | 3.537 | 3.554 | 3.562 | 3.628 |
3.852 | 3.871 | 3.886 | 3.971 | 4.024 | 4.027 | 4.225 | 4.395 | 5.02 | | |
Table 9 displays the MLEs and 95% CIs for the model parameters, demonstrating the validity of the considered models. According to
Table 10, the TLW model fits the dataset with the lowest AIC, CAIC, BIC, HQIC, and minus log-likelihood among the other models. Therefore, it may be a viable option for modeling these data.
Figure 4 compares the empirical and fitted distributions of the data, displaying the histogram and fitted pdf, the fitted and empirical cdfs, the P–P plot and the Q–Q plot to graphically demonstrate the appropriateness of the TLW for modeling these data.
In this application we consider the regression model for censored data. This dataset refers to patients hospitalized with COVID-19. The disease is caused by the pathogen identified as a new coronavirus, denominated severe acute respiratory syndrome coronavirus-2 (SARS-CoV-2). The epidemiological data were tallied by the Health Information System of the Brazilian government, and are available at
https://opendatasus.saude.gov.br/dataset/srag-2020 (accessed on 1 May 2023).
This study involved 195 patients hospitalized in the city of Campinas, state of São Paulo, in May 2020, with infection confirmed by RT-PCR and classified as SARS caused by COVID-19. The survival time consisted of the time in days from the date of first symptoms to the date of evolution of the case, either death (failure) or end of observation (censoring). The censoring percentage was 56.92% and the following variables were considered: :
: observed time (in days);
: censoring indicator ( censored, observed lifetime);
: sex ( male, female);
: age (in years).
There were 110 male patients (56.41%), of whom 42 (38.18%) died, while of the 85 women (43.58%), there were 42 deaths (49.41%).
Figure 5a presents the Kaplan–Meier survival curve broken down by sex. It can be seen that men had a higher risk of death.
Figure 5b depicts the histogram of the ages, where the greatest frequency was in the category from 50 to 75 years old.
We compared the TLW regression model with the Weibull regression model based on the following systematic components:
Table 11 reports the values of the selection criteria of the models, in which the
-TLW model was superior to the others. We also compared this model with the
-Weibull model by means of the residuals in
Figure 6. In turn,
Figure 6a,c illustrate the residuals versus the index of the observations, showing that both models have residuals with random behavior around zero, and no point is outside the interval
. Nevertheless,
Figure 6b,d indicate that the TLW model behaved better, with all the points within the simulated envelope, denoting its superiority. Finally, we illustrate the Kaplan–Meier curves and estimated survival curves in
Figure 7 for the TLW model, showing that this model is able to capture the non-proportional curves of this dataset. The results of this model are shown in
Table 12. Some conclusions can be obtained as follows.
Interpretations for :
A significant difference exists between men and women in relation to survival time (men have shorter survival). Various other studies have also indicated significant differences between the sexes (see [
25,
26]);
The survival time declines with advancing age. This result corroborates the findings of several studies that have indicated that older age is a predictor of higher mortality caused by COVID-19 (see [
27,
28,
29]).
Interpretations for k:
A significant difference exists between men and women with regard to the variability in the survival time;
In relation to age, the variability in survival time increased with older age of the patients.
In this application, we consider the regression model for uncensored data. These data refer to Musa acuminata banana species from a banana plantation in the Philippines. A total of
banana tiers were chosen randomly, in which the numerical values of the RGB colors (red, green, and blue) were obtained from images taken by hardware of four banana classes, extra class, class I, class II, and reject, where the classes contain 65, 49, 30, and 50 samples, respectively. The dataset is available in the repository:
https://data.mendeley.com/datasets/zk3tkxndjw/2 (accessed on 20 May 2023) and more details can be seen in [
30]. Each banana tier sample was captured with a white background in six different views: front, back, left, right, top, and bottom views. Here, we consider the values of B in front view.
Figure 8 displays a boxplot by class, it is possible to observe differences between the colors according to the class.
The variables considered are :
We verified the relationship between colors and classes from the TLW and Weibull models according to the following systematic components:
Table 13 displays the AIC and GD values for these fitted models, in which it can be seen that the
-TLW model obtained the lowest values, being able to be chosen as the best model. In addition, we compare the
-TLW and the
-Weibull from the quantile residues (
Figure 9). These plots agree with the results of
Table 13, there is a high percentage of points outside the confidence band of the Weibull model (
Figure 9e) and many deviations also from the confidence band worm plot confidence (
Figure 9f).
Finally,
Table 14 presents MLEs, SEs, and
p-values of the model
-TLW, in which classes I, II, and extra are compared with the rejected class. We can obtain the following conclusions: there is a significant difference between the color of class 1 and the rejects. Its effect is positive, that is, it presented higher color values. Class II and the extra class do not present a significant difference with the rejected class. The extra class and class I’s colors affect the shape of the distribution compared to the reject class’s color.