1. Introduction
In this paper, we consider a regression analysis of multivariate interval-censored failure time data where the censoring may be informative. Interval-censored data arise when the failure time of interest is known or observed only to belong to an interval instead of being observed exactly (Finkelstein, 1986 [
1]; Sun, 2006 [
2]). It is apparent that one can treat right-censored data as a special case of interval-censored data. Multivariate interval-censored data occur if a failure time study involves more than one related failure time of interest for which only interval-censored data are available. Among others, one can often face multivariate interval-censored data in epidemiological studies and clinical trials. Informative censoring occurs if the censoring mechanism or the underlying process generating observations is related to the failure times of interest (Kalbfleisch and Prentice, 2002 [
3]; Sun, 2006 [
2]).
An example of informative censoring is given by a clinical trial or periodic study on a failure event such as death for which some symptoms may occur before the occurrence of the event. For the situation, the study subject may tend to pay more clinical visits when the symptoms occur rather than following the pre-specified schedule. Many authors have pointed out’ that with informative censoring, the analysis that ignores it could lead to serious biased estimators or analysis results (Wang et al., 2010 [
4]; Sun, 2006 [
2]; Zhang et al., 2005 [
5], 2007 [
6]). For example, Sun (1999) [
7] studied the issue for univariate current status data, a special case of interval-censored data where the observed interval includes either zero or infinity, and showed that the analysis could yield misleading results if the informative censoring is ignored or treated to be non-informative censoring.
A large amount of literature has been established for the regression analysis of multivariate interval-censored failure time data or their special cases, multivariate current status data and bivariate interval-censored data (Chen et al., 2007 [
8], 2009 [
9], 2013 [
10]; Goggins and Finkelstein, 2000 [
11]; Shen, 2015 [
12]; Tong et al., 2008 [
13]; Wang et al., 2008 [
14]; Zeng et al., 2017 [
15]; Zhang et al., 2009 [
16]; Zhou et al., 2017 [
17]). For this, three types of methods are commonly used, including the copula model approach, the marginal model-based approach and the frailty model-based method. The first employs various copula models to characterize the relationship between correlated failure times of interest (Wang et al., 2008 [
14]; Zhang et al., 2009 [
16]), and among others, Sun and Ding (2019) [
18] discussed this for bivariate cases under the framework of the two-parameter Archimedean copula model. The second mainly focuses on the marginal distribution and puts no assumption on the correlation between the failure times of interest (Wei et al. 1989 [
19]). The authors who developed such methods include Chen et al. (2007) [
8], Chen et al. (2013) [
10] and Tong et al. (2008) [
13].
The frailty model-based approach generally employs the frailty or latent variable to model the correlation between the correlated failure times. It has the advantage of allowing one to directly estimate the correlation. One main shortcoming of most of the existing methods for multivariate interval-censored data is that they assume independent or non-informative interval censoring, and it is apparent that this may not hold in practice as discussed above. In this paper, we will adopt the frailty model-based approach to develop a new estimation procedure that allows for dependent or informative interval censoring.
Several authors have considered a regression analysis of univariate informatively interval-censored failure time data. For example, Zhang (2005) [
5], Wang et al. (2010) [
4] and Wang et al. (2018) [
20] investigated the problem for current status data, case II interval-censored data and case
K interval-censored data, respectively. Case II means that each study subject is observed only twice, while case
K refers to the situation where each subject is observed at a sequence of observation times, which is much more general than others (Sun, 2006) [
2]. As mentioned above, most of the existing methods for multivariate interval-censored data apply only to the situation with independent interval censoring, except Yu et al. (2022) [
21]. Yu et al. (2022) [
21] only considered case II interval-censored data under the additive hazards model. In this paper, the focus will be on case
K multivariate interval-censored data with informative censoring and the proposed methods apply to much more general situations than Yu et al. (2022) [
21].
More specifically, in
Section 2, some notation and assumptions will be first introduced as well as the data structures. In the proposed method, we will focus on the case where the failure time of interest marginally follows a general class of semiparametric transformation models. The proposed approximated maximum likelihood estimation procedure will be presented in
Section 3, and for the implementation of the proposed method, a novel EM algorithm will be developed. The asymptotic properties of the resulting estimators of the regression parameters will be given in
Section 4.
Section 5 will present some simulation results obtained from a study performed to evaluate the performance of the proposed method, and they indicate that it performs as expected. In
Section 6, we apply the proposed methodology to a set of real data arising from an AIDS clinical trial, and
Section 7 contains some discussion and concluding remarks.
2. Assumptions and Background
In this section, we first introduce some notation and background and then describe the model and data structure. Suppose that there is a failure time study consisting of n independent subjects and concerning M failure events of interest that may be related. Define to be the failure time of interest and a p-dimensional vector of covariates both related to the ith subject and the event m. Furthermore, for each subject, suppose that there exists a sequence of potential observation times and a follow-up or stopping time , where denotes the number of potential observations, . For simplicity, we assume that for each subject, the observation times for different failure events are the same and the proposed method below can be easily generalized to more general situations.
For subject
i, define the point process
, describing the observation process on the subject that jumps only at the observation times,
. Note that for the situation considered here, we have
processes, the
M underlying failure time processes of interest and the observation process
, and as mentioned above, the focus below will be on the case where they may be related (Ma et al., 2015 [
22]; Wang et al., 2016 [
23]; Zhang et al., 2007 [
6]). To describe their relationships and the possible covariate effects on them, we assume that there exists a vector of latent variables
and another latent variable
with mean zero, and given
,
,
and
, the cumulative hazard function of
has the form
Here, is a known strictly increasing transformation function, is an unknown baseline cumulative hazard function, contains 1 and part of the covariates , and denotes the vector of unknown regression parameters.
For the observation process, it will be assumed that
is a non-homogeneous Poisson process satisfying
for the intensity function given
and
. Here,
is an unknown continuous baseline intensity function,
, and
, which is a vector of regression parameters as
. In the following, it will be assumed that given
,
and
,
are independent, and given
and
,
and
are independent. Moreover,
is independent of
and
. We point out that models (
1) and (
2) with
have been commonly used in the analysis of failure-time data (Klein and Moescherger, 2003 [
24]) and the analysis of event history data (Cook and Lawless, 2007 [
25]), respectively. The parameter
denotes the degree of the correlation between the failure-time process and the observation process, and they will be independent if
.
The semiparametric transformation model (
1) with
and
is quite general and can give many specific models. In particular, one can express it as a class of frailty-induced transformations
In the above,
denotes the density function of a frailty variable with support
. By setting
to be the gamma density with mean 1 and variance
, it gives the class of logarithmic transformations
with
(Chen et al., 2002 [
26]). In particular, it yields the proportional odds model with
or
and gives the proportional hazards model with
or
.
To describe the observed data, define
, indicating if the failure time
belongs to the interval
. In the following, it will be assumed that the observed data have the form
where
. That is, we observe case
K interval-censored data (Sun, 2006).
3. Maximum Likelihood Estimation
3.1. Estimation Procedure
Now, we discuss inference about models (
1) and (
2), and for this, we will propose a two-step or an approximate maximum likelihood estimation procedure by following Huang and Wang (2004) [
27] and Wang et al. (2016) [
23]. More specifically, we will first consider the estimation of model (2) and then estimation of
, the parameter of interest. The first step will be based on the following two facts. One is that one can easily show that
follows the Poisson distribution with the mean
given
and
,
. The other is that the observation times
can be seen as the order statistics of a set of i.i.d. random variables with the density function
One can see that the function above does not depend on neither
nor
, which suggests that the function
can be estimated by
In the above, the ’s denote the ordered and distinct values of observation times , the number of the observation times equal to , and the number of observation times satisfying among all subjects.
Under the assumptions above, it is easy to show that
. This yields
and a class of estimating equations
for estimation of
,
with the
’s being some weights. Let
denote the estimator of
given by the estimating equations above, which suggests that one can naturally estimate
by
Now, consider estimation of
as well as model (
1). For this, note that if the
’s were known, it would be natural to maximize the likelihood function
where
,
, and
denotes the density function of the
’s assumed to be known up to a vector of parameters
. Define
and
, where
and
. Then,
represents the shortest time interval that brackets
and the likelihood function above can be rewritten as
By following Huang and Wang (2004) [
27] and others, it is natural to estimate
and
by their values that maximize the approximated likelihood function
.
For the maximization of
, note that it involves the unknown functions
’s and integrations. To deal with them, for the former, we propose to adopt the nonparametric approach. More specifically, for each
, let
denote the ordered sequence of all
and
with
and assume that
is a step function that jumps only at the
’s with the jump sizes
’s. Then,
can be expressed as
In the following, we will develop an EM algorithm for the maximization with the focus on the situation where
is a multivariate normal distribution with the covariance matrix
depending on the
q-dimensional unknown parameter
. The algorithm is valid for other distributions and some comments on this will be given below. It is worth to point out that as mentioned above, the idea discussed above has been used by Huang and Wang (2004) [
27] and Wang et al. (2016) [
23], among others. However, the problem discussed here is different or much more general than the existing literature.
3.2. EM Algorithm
In this subsection, we will develop an EM algorithm for the maximization of
, and for this, we will first discuss the data augmentation. Let the
’s denote the random sample of size
n from the density
. Then, we can rewrite the observed likelihood function as
Moreover, let the
’s denote the random sample of size
n from the Poisson distributions with means
given
, and define
and
such that
It is easy to see that the maximization of (
4) is equivalent to maximizing the likelihood function based on the data (
)
. Based on this, for the development of the EM algorithm, it is natural to use the
’s,
’s and
’s to augment the observed data. As a consequence, one can derive the resulting pseudo complete data log-likelihood function as
where
.
Now, we consider the E-step of the EM algorithm. At the
th iteration and given
, we need to determine
under the multivariate normal distribution with the covariance matrix
. To calculate the conditional expectations
,
and
given the observed data, we need to employ the joint density of
and
given the observed data, which is proportional to
Note that the conditional expectation of
for
given
,
and the observed data is given by
In the M-step, we can employ the Newton–Raphson method to update
based on the equation
For estimation of
, we have the closed form expression
for
and
. To estimate
, one can maximize
with the
’s given by
.
Now, we summarize the EM algorithm described above as follows.
Step 1. Choose initial estimates of , respectively.
Step 2. In the th iteration, calculate , and by using, for example, the Gaussian quadrature method.
Step 3. Update
by (
6) with current
, and then update
by maximizing
. In addition, estimate
by employing the one-step Newton–Raphson method.
Step 4. Repeat Steps 2–3 until the convergence such that the absolute difference of the log-likelihood values between two consecutive iterations is less than a given positive value such as .
4. Asymptotic Properties
Let denote the estimator of defined above and the true value of . Define and . In this section, we will establish the asymptotic properties of , and for this, we first describe the regularity conditions needed.
, , , , and . For the asymptotic properties of , we need the following regularity conditions.
Condition 1. The true value belongs to a known compact set , where denotes a compact set of , a compact set in , and a compact set of in the domain of γ such that is a positive-definite matrix with eigenvalues bounded away from zero and ∞. In addition, the true value is continuously differentiable with positive derivatives in .
Condition 2. The covariate vector and are bounded in .
Condition 3. For the transformation function , assume that it is twice continuously differentiable on with , and .
Condition 4. Assume that for any smooth function and . Here, denotes the jth derivative of with respect to γ.
Condition 5. If there exists a vector u and some constants such thatfor each of these values, then and . In addition, denotes a -dimensional vector of zeros. Condition 6. Assume that for the follow-up time and latent variable u, where denotes the longest study time and the variance of is bounded and there exists a positive small constant such that almost surely. Moreover, for and u, the function is continuous for .
Note that Conditions 1 and 2 are standard conditions in survival analysis, and it is easy to check that Condition 3 on the transformation function holds for the logarithmic family and the Box–Cox family . Moreover, Condition 4 holds for modeling multivariate data with frailty models, and Condition 5 is required for the identifiability of the model. In addition, Condition 6 describes the relationship between the latent variable u and the parameters of interest. Most of the conditions above are purely for technique purposes and hold in general, in particular, for periodic follow-up studies.
Let denote the Euclidean norm and define and for a function f and a random variable X with distribution P. The following two theorems give the asymptotic properties of .
Theorem 1. Suppose that Conditions 1–6 hold. Then, as , we have that almost surely.
Theorem 2. Suppose that Conditions 1–6 hold. Then, as , we have that , where with given in the Appendix A. We will sketch the proof for the results described above in
Appendix A. For inference about
, it is apparent that one needs to estimate the covariance matrix, and for this, one can see from
Appendix A that it would be difficult to derive a consistent estimator of
. Thus, we propose to employ the profile likelihood approach to estimate the covariance matrix of
(Murphy & van der Vaart, 2000) [
28]. Specifically, let
denote the set of all step functions with nonnegative jumps at
and define
, the profile log-likelihood. Then, one can estimate the covariance matrix of
by the negative inverse of the matrix with the
th element given by
In the above,
denotes the
jth canonical vector in
and
is a constant of order
. Note that to calculate
for each
, one can reuse the proposed EM algorithm with
held fixed and the only step in the EM algorithm is to explicitly evaluate
and
to update
using above. The iteration converges quickly in general by setting
to be the initial value.
5. A Simulation Study
In this section, we give some of the simulation results obtained from a study performed to evaluate the finite sample performance of the proposed method with the focus on estimation of the ’s. In the study, we considered the situation with correlated failure times of interest and two covariates. For the covariates, it was assumed that the first covariate follows the Bernoulli distribution with the success probability of 0.5 and the second covariate, the uniform distribution over . To generate the true failure times, we first set to be one and generated the latent variables ’s from the normal distribution with and the latent variables ’s from the normal distribution with the mean 0 and variance 1. Then, given the ’s, ’s, ’s and ’s, the ’s and ’s were generated under model (1) with , and for , or , respectively.
For the generation of the observation process and the observed data, we first assumed that the
follow the uniform distribution over the interval [2, 3] and generated the
’s from the Poisson distribution with the mean
given the
’s and
’s. Note that in the above, we took
and
. Given the
’s, we took
to be the order statistics of the random sample of size
from the uniform distribution over
. In the following, we considered two sets of true values,
and
, for the regression parameters
and
, corresponding to
and
, respectively. The results given below are based on
or 400 with 1000 replications.
Table 1 gives the results on the estimation of
and
given by the proposed estimation procedure with
,
and
. Here, we calculated the estimated bias (Bias) given by the average of the estimates minus the true value, the sample standard error (SSE) of the estimates, the average of the estimated standard errors (ESE) and the
empirical coverage probability (CP). The results suggest that the proposed estimator of the regression parameters seems to be unbiased and the variance estimation based on the profile likelihood approach also seems to be reasonable. Furthermore, the results on the empirical coverage probabilities indicate that the normal approximation to the distribution of the proposed estimator of the regression parameters appears to be appropriate. In addition, the results got better in general with the increasing sample size, as expected.
As mentioned before, the proposed estimation procedure can be applied to any distribution for the latent variables
’s. To see this, we repeated the study above, except that we generated the
’s from the uniform distribution over
, and
Table 2 presents the obtained results on the estimation of
and
with
and
. One can see that they are similar to those given in
Table 1 and again suggest that the proposed approach seems to work well for the situations considered. To see the performance of the proposed approach with different types of covariates, we also repeated the study giving
Table 1, except that both covariates were assumed to follow the standard normal distribution and give the obtained results with
and
in
Table 3. They indicate that the proposed estimation procedure seems to be robust to different types of covariates.
Note that in the proposed estimation procedure, it has been assumed that the observation process
is a non-homogeneous Poisson process and one may be interested in the performance of the proposed method if this assumption is not true. To see this, we repeated the study giving
Table 1, except that the
’s were assumed to be mixed Poisson processes with the intensity function
given the
’s, where the
’s were generated from the gamma distribution.
Table 4 presents the results on the estimation of
and
given by the proposed approach with
and
, and they indicate that the approach seems to be robust with the processes
’s.
For the initial value in the EM algorithm here, we set , , , and . It is worth to point out that we did try other initial values and the proposed EM algorithm seems to be robust with respect to the selection of the initial values. In other words, we did not encounter non-convergence issue in the simulation study. We also considered some other setups, including multivariate cases and the case with more than one covariate and obtained similar results.
6. An Application
In the section, the estimation procedure proposed in the previous sections is applied to the set of bivariate interval-censored data arising from an AIDS clinical trial, AIDS Clinical Trial Group 181, described in Goggins and Finkelstein (2000) [
11]. The study concerns the opportunistic infection cytomegalovirus (CMV) and examined the study patients periodically. At each clinical visit or observation, among other information, the blood and urine samples were collected and tested to detect the existence of the CMV virus in the sample, which is also commonly referred to as the shedding of the virus. In addition, for each patient, the CD4 count, indicating the status of a person’s immune system and being commonly used to measure the stage of HIV infection, was also recorded at the entry time. For the analysis here, we are mainly interested in if the baseline CD4 account, the indicator of the initial stage of HIV disease, is related to the CMV shedding risk in either blood or urine.
The data set consists of 204 subjects, and they belong to two groups based on their baseline CD4 counts, either less than 75 or otherwise. More specifically, the two groups have 111 and 93 patients, respectively. On the observation of the CMV shedding times, some patients gave left-censored observations and some right-censored observations. The others provided some intervals or interval-censored observations, given by the last negative and first positive test dates. That is, we have bivariate interval-censored data on the CMV shedding times in the blood and urine. The percentages of right-censored observations for the CMV shedding times in the blood and urine are about and , respectively, which indicate that the CMV shedding risk in the urine may be higher than that in the blood. For the application of the proposed estimation procedure, let and denote the CMV shedding times in the blood and urine associated with the ith patient, respectively, and define if the ith subject’s baseline CD4 count was less than 75 and 0 otherwise. As in the simulation study, we took and set for all patients.
Table 5 presents the estimation results given by the proposed approach for different combinations of
and
, and they include the estimated covariate effects,
and
, the estimated standard errors (SE) and the
p-values for testing no covariate effect (P). In addition, we have calculated the Akaike Information Criterion (AIC, Akaike, 1973 [
29]) and Bayesian Information Criterion (BIC, Schwarz, 1978 [
30]) for the selection of the optimal model. One can see from the table that the AIC and BIC values are quite close for all combinations of
and
, and the same is true for the estimated effects. By choosing
, which correspond to the proportional hazards models for both the
’s and
’s, we have
and
with the estimated standard errors being
and
, respectively. They suggest that the patients with lower CD4 at the baseline experienced CMV shedding in both blood and urine significantly early. To provide a graphical view about the difference between the CMV shedding in the blood and urine,
Figure 1 presents the estimates of the baseline marginal survival functions given by the proposed method with
for the CMV shedding times in the blood and urine, respectively. They suggest that as discussed above, the CMV shedding in the urine occurred much earlier than in the blood.
In addition, with , the proposed method yielded and with the estimated standard errors of and , respectively. They indicate that the observation process was significantly correlated with the CMV shedding times in both blood and urine. That is, we had dependent or informative censoring. To investigate the effects of informative censoring on the covariate effects, we assumed that , meaning independent interval censoring, and obtained and with the estimated standard errors being and , respectively. They would correspond to the p-values of and for testing and , respectively. Although these results are similar to those given above, it is apparent that they underestimated the effects of the baseline CD4 on the risks of the CMV shedding times.
7. Discussion and Concluding Remarks
In the preceding sections, the regression analysis of case K multivariate interval-censored failure-time data was discussed under a general class of semiparametric transformation models in the presence of informative censoring. For the problem, an approximate maximum likelihood estimation procedure was proposed and the resulting estimators of the regression parameters were shown to be consistent and asymptotically normal. In the method, the frailty approach was employed to characterize the informative censoring as well as the relationship among the correlated failure times of interest. To implement the proposed approach, a novel EM algorithm was developed and the numerical studies indicated that the proposed method works well in practical situations. In addition, it was applied to a set of real bivariate interval-censored data arising from an AIDS clinical trial.
The proposed approach can be seen as a generalization of the method given by Zeng et al. (2017) [
15] to allow for informative interval censoring, which can occur quite often, as discussed above and in the literature. In particular, it has been shown that in the presence of informative censoring, the analysis that ignores it could lead to biased or misleading results and conclusions. The proposed method has the advantages that it does not need or impose an assumption on the distribution of the latent variables and it is quite flexible and can be easily implemented. Moreover, the type of the data considered here includes most types of the failure-time data discussed in the literature as special cases and the model (
1) gives many commonly used models.
As discussed above, although model (
1) is quite flexible, it may not be straightforward to choose an optimal model for a given set of data, and one commonly used procedure for this is to apply the AIC or BIC. As an alternative, one may prefer to develop a model-checking or data-driven technique. However, this may be difficult and such a method does not seem to exist even for simple types of multivariate interval-censored data. It is worth noting that instead of the proposed approximation maximum likelihood estimation method, one may consider a full maximum likelihood estimation procedure. For this, however, one would need to specify or postulate some distributions for the latent variables
’s, which may be hard to be verified, and also the implementation would be much more complicated.