1. Introduction
In some practical scenarios, we often need to select useful models from a candidate model set. A popular approach to address this issue is model selection. Methods such as the Akaike Information Criterion (AIC) [
1], Mallows’ Cp [
2] and Bayesian Information Criterion (BIC) [
3] are designed to identify the best model. However, in cases where a single model does not receive strong support from the data, these model-selection methods may overlook valuable information from other candidate models, leading to issues of model-selection uncertainty and bias [
4].
To tackle these challenges and enhance prediction accuracy, several model-averaging techniques have been developed to leverage all information from the candidate models. Taking inspiration from AIC and BIC, Buckland et al. [
5] proposed smoothed AIC (SAIC) and smoothed BIC (SBIC) methods based on AIC and BIC, respectively. Hansen [
6] introduced the Mallows model-averaging (MMA) estimator, obtaining weights through the minimization of Mallow’s Cp criterion. The MMA estimator asymptotically attains the minimum squared error among the model-averaging estimators in its class. Subsequently, Wan et al. [
7] relaxed the constraints of Hansen [
6], allowing for non-nested candidate models and continuous weights. In practical applications, many datasets exhibit heteroscedasticity. Therefore, it is essential to explore model-averaging methods tailored for heteroscedastic settings. Firstly, Hensen and Racine [
8] proposed Jackknife model averaging (JMA), which determines weights by minimizing a cross-validation criterion. JMA significantly reduces Mean Squared Error (MSE) compared to MMA when errors are heteroscedastic. Secondly, Liu and Okui [
9] modified the MMA method proposed by Hensen [
6] to make it suitable for heteroscedastic scenarios. Furthermore, Zhao et al. [
10] extended [
6]’s work by estimating the covariance matrix based on the weighted average of squared residuals corresponding to all candidate models. This approach improves the model average estimator under heteroskedasticity settings.
In survival analysis, the accelerated failure time (AFT) model provides a straightforward description of how covariates directly impact survival time and has consequently garnered widespread attention. There are several parameter-estimation methods for the Accelerated Failure Time (AFT) model, including Miller’s estimator [
11], Buckley–James estimator [
12], Koul–Susarla–Van Ryzin (KSV) estimator [
13] and WLS estimator [
14]. However, all these methods assume that the censoring indicator is observable. Therefore, Wang and Dinse [
15] improved the KSV estimator to make it adaptable to situations where the censoring indicator is missing.
Under practical conditions, it is common to encounter situations where only the observed time is available and it is uncertain whether the event of interest has occurred. In such cases, data suffer from missingness in the censoring indicator. For example, in a clinical trial for lung cancer, a patient may die for unknown reasons and while the survival time is observed, it is uncertain whether the patient died specifically due to lung cancer. This situation leads to missingness in the censoring indicator. Previous studies have mainly addressed the issue of missingness in the censoring indicator under a specific model. Research on model averaging for right-censored data typically assumes that the censoring indicator is observable. Therefore, this paper adopts the inverse probability weighting method proposed by [
15] to construct the response variable. Through appropriate weight-selection criteria, weights are chosen to build the model-averaged estimator for the accelerated failure time model. It significantly enhances the predictive performance of the model and mitigates the bias introduced by the selection of a single model. Compared to previous research, this paper makes two main contributions: First, it introduces a novel model-averaging method for the case of missingness in the censoring indicator. Second, the paper allows for heteroscedasticity and employs model-averaging techniques to estimate variance.
The remaining sections of this paper are organized as follows. In
Section 2, we commence by introducing the notation and progressively delineate the methodology and associated theoretical properties of the proposed model-averaging approach. In
Section 3, we report the Monte Carlo simulation results. In
Section 4, we assess the predictive performance of the proposed model-averaging method against other approaches using the real-world Acute Myeloid Leukemia (AML) dataset. In
Section 5, we provide a comprehensive summary of the entire paper and suggest future research directions in this area. All theorem proofs will be presented in
Appendix A.
2. Methodology and Theoretical Property
We denote
,
, where
T represents the survival time and
V denotes the censored time.
denotes the covariate matrix for
n independent observations, where
. The accelerated failure time model can be expressed as follows:
where
is the random error with
and
.
We assume that there are
M candidate models in the candidate model set. Where the
mth candidate model contains
covariates. Following [
7], these candidate model forms are non-nested. The
mth candidate model is
for
. The matrix form of (
2) is
where
is an
dimensional full column-rank matrix,
,
,
.
In the case of right censored data, the response variable
might be censored, making it unobservable. We only observe
, where
and the censoring indicator
. Define a missingness indicator
which is 1 if
is observed and is 0 otherwise. When the censoring indicators are missing, the observed data are
. For simplicity, we set
. In this paper, similar to [
15], we assume the missing mechanism for
to be:
This assumption is more stringent than the missing at random (MAR) condition yet less restrictive than the assumption of missing completely at random (MCAR).
Koul et al. [
13] introduced a method that involves synthetic data for constructing linear regression models. Wang and Dinse [
15] extended [
13]’s method to address the situation where censoring indicators are missing. In our work, we follow the approach proposed by [
15] to construct a response in the form of inverse probability weighting, specifically:
where
,
.
represents the cumulative distribution function of
C. It is easy to observe that under the missing data mechanism in this paper:
Similar to Equation (
2), we have:
where
,
. This is expressed in matrix form as:
where
,
. And then the weighted least squares estimator of
:
where
.
Let
; subsequently, the estimation for the mth candidate model
is given by:
where
. Denote weight vector
, belonging to the set
. The model-averaging estimator of
is defined as follows:
for any
, where
.
Define the square loss function
, where
denotes the Euclidean norm. Then the risk function is defined as:
where
. The derivation of (
10) is as follows:
Regarding the choice of weights, a natural approach is to minimize the risk function to obtain the optimal weights. However, as shown in Equation (
11), we recognize that the risk function includes the unknowns
, which makes it infeasible to directly minimize the risk function to obtain the optimal weights. Therefore, we replace
with
and seek an unbiased estimator of the risk function as the criterion for weight selection.
Define the criterion for weight selection as
It is not difficult to observe that
. By disregarding a term that is independent of
,
serves as an unbiased estimator of the risk function.
In practice,
,
and
are usually unknown; therefore, we need to estimate them. Firstly regarding the estimation of
, it is usually estimated by the Logit model. Suppose
is estimated by the parametric model
, where
. By the maximum likelihood estimation method, we can obtain the parameter estimate
for the parameter
.
usually can be estimated nonparametrically by
where
is a kernel function and
is a bandwidth sequence. Next, we define
,
estimated nonparametrically by
where
is a kernel function and
is a bandwidth sequence. We adopt the following estimator of
:
where
denotes the rank of
.
Next, replacing
,
and
with
,
and
, we have:
And the corresponding weight selection criterion is as follows:
where
. The weights for minimizing
are given by:
Then, we enumerate the necessary regularity conditions for the asymptotic optimality.
- (C1)
Let and , where is the cumulative distribution function of . Assume that .
- (C2)
There exists a positive constant k such that .
- (C3)
Denote
and
is an
unit vector in which the mth element is 1 and the others are 0. For some integer
and some positive constant
k such that
, assume
- (C4)
There exists such that , .
- (C5)
and are bounded.
- (C6)
and .
- (C7)
Let , denote the ith diagonal element of . There exists a constant c such that .
Condition (C1) is utilized in [
16] and it ensures that
is not equal to 0. Condition (C2) mandates that the conditional expectation of
remains within bounded limits, in line with assumptions seen in prior research, including [
7,
17]. Condition (C3) is a requirement commonly found in model-averaging literature (e.g., [
7,
18]). Condition (C4) mandates the non-degeneracy of the covariance matrix
as
. Similar assumptions can also be found in [
9,
10]. Similar to [
15], Conditions (C5) and (C6) impose constraints on the bounds of
,
and bandwidth, respectively. Condition (C7) is frequently employed in the analysis of the asymptotic optimality of cross-validation methods, as seen in prior works like [
8].
Theorem 1. Under Conditions (C1) to (C6), Theorem 1 establishes the asymptotic optimality of the model-averaging procedure employing weights , as its squared loss converges to that of the infeasible best possible model average estimator.
In most cases,
is unknown and needs to be estimated. We estimate
using residuals derived from the model-averaging process:
. Specifically, the estimator of
is
where
.
In the existing literature on model averaging, most estimates of variance are predominantly derived from the largest candidate model, as exemplified by works such as [
6,
16]. In contrast, our approach, following [
10], leverages information from all candidate models for estimation rather than relying on a single model. Such an estimation method is more robust. Replacing
by
in (
13),
becomes
The weights that minimize
are as follows:
This weight selection criterion
is a cubic function of
w.
Theorem 2. Under Conditions (C1) to (C7), 3. Simulation
In the simulation study, we generate data from the accelerated failure time (AFT) model,
, where
; the observations of
are generated from a multivariate normal distribution with zero mean and covariance matrix
with
. The errors
follow normal distribution
. By varying the value of
, we allow
to range from 0.1 to 0.9. This variance specification closely resembles that of [
8]. However, we introduce a small constant, 0.01, to ensure that the variances remain strictly positive. The censoring time
is generated from
. By varying the value of
, we achieve censoring rates (CRs) of approximately
,
. We set sample sizes n = 150, 300. Here, our model configuration is set in a nested form, meaning the first m models include the first m regressors. The number of candidate models M was set to be
, where
denote the smallest integer greater than
x.
Based on the missing mechanism described in this paper, we assume that the probability of missing censoring indicators, denoted as
, is determined via a logistic model:
. Following [
15], we employed the uniform kernel function
for
and
otherwise. Additionally, we used the biweight kernel function
for
and
otherwise. The bandwidths were
. We estimated
under the logistic model:
. As highlighted by [
19], when the data on
are completely (or quasi-completely) separated, the maximum likelihood estimate of
does not exist. In our simulation setup, the number of covariates significantly exceeds the sample size. Therefore, we employ the lasso method to estimate the parameters.
We compare the proposed Model-Averaging method for the Missing Censoring Indicators in the Heteroscedastic setting (HCIMA) with other classical model-selection and model-averaging methods in this article. Brief descriptions of these methods are provided below:
The model-selection methods rely on AIC and BIC, where the AIC and BIC criterion for the mth model are defined as follows:
and
where
.
Model methods based on SAIC and SBIC: The weights for the mth candidate model are given by:
where
,
.
Additionally, we compare our approach with the method that estimates the variance using the maximum candidate model (MCIMA). And the specifics of variance estimation and weight selection in their approach are as follows:
where
.
In the simulation, we utilize the Mean Squared Error (MSE) to evaluate the performance of various methods, where the MSE is defined as . We present the mean of MSEs from 500 replications.
Figure 1 and
Figure 2, respectively, show the Mean Squared Error (MSE) values for various methods across 500 repetitions under different censored rates and sample sizes, with missing rates of 20% and 40%. In terms of Mean Squared Error (MSE), our proposed HCIMA method outperforms other approaches. Additionally, the MCIMA method performs better than existing methods in all cases except for when compared to HCIMA. Furthermore, it is evident that SAIC and SBIC outperform their respective AIC and BIC counterparts, further highlighting the advantages of model-averaging methods.
Comparing
Figure 1 and
Figure 2, it is observed that the MSE at MR = 20% is slightly higher than at MR = 40%. The reason for this occurrence is that when
, the signs of
and
are opposite. As MR increases, the occurrence of the
situation decreases. Although this result may seem counterintuitive, it does not affect the performance of the method proposed in this paper, which still keeps its advantages in this case.
4. Real Data Analysis
In this section, we assess the predictive performance of our proposed HCIMA method using the real Acute Myeloid Leukemia (AML) dataset. This dataset contains 672 samples, including 97 variables such as patient age, survival time, gender, race, mutation count, etc. For more specific information about this dataset, we refer the reader to
https://www.cbioportal.org/study/clinicalData?id=aml_ohsu_2018 (accessed on 13 December 2023).
We selected ten variables for analysis: Cause Of Death, Age, Sex, Overall Survival Status, Overall Survival Months (Survival Time), Number of Cumulative Treatment Stages, Cumulative Treatment Regimen Count, Mutation Count, Platelet Count and WBC (White Bloodcell Count). After removing rows with missing values, we retained a total of 396 samples. We treat samples with unknown causes of death as missing censoring indicators. Among these 396 samples, 76 have unknown causes of death and 167 samples are still alive after the clinical trial ends. Therefore, the missing rate is approximately 19% and the censoring rate is 42%. We focus on the impact of seven variables, excluding “Cause Of Death” and “Overall Survival Status” on Survival Time. Therefore, we can construct non-nested candidate models.
We randomly select data from
samples as the training dataset, while the remaining
samples are used as the testing dataset. We set the training dataset size to 50%, 60%, 70% and 80% of the total dataset size, respectively. Following [
16,
20], we employed the normalized mean squared prediction error (NMSPE) as the performance metric:
where
represents the predicted value and
denotes the value of
for the mth model.
We calculate the mean, the standard deviation and the optimal rate of each method over these 1000 repetitions. Specifically, the optimal rate refers to the frequency at which the minimum value is achieved across these 1000 repetitions.
Table 1 displays the mean, optimal rates and standard deviations of NMSPE for each method over 1000 repetitions. Consistent with the simulation results, the HCIMA method exhibits the lowest average NMSPE and standard deviation and the highest optimal rate. The MCIMA method also performs well, ranking second after HCIMA. This indicates that the proposed model-averaging methods in this paper demonstrate superior predictive performance compared to other approaches.
5. Discussion
To address the uncertainty in model selection and enhance predictive accuracy, this paper proposes a novel model-averaging approach for the accelerated failure time model with missing indicators. Moreover, we establish asymptotic optimality under certain mild conditions. In Monte Carlo simulations, the method proposed in this paper exhibits lower mean squared errors compared to other model-selection and model-averaging methods. Empirical results demonstrate that the proposed method has a lower NMSPE compared to other approaches, indicating its superior predictive performance. This further underscores the applicability of the proposed method to real-life data scenarios with missing censoring indicators.
In this paper, we introduce the inverse probability weighted form of response variable proposed in [
15]. The primary advantage of this form of response variable lies in its double robustness, making it less susceptible to the impact of model misspecification (if
or
is misspecified). However, as mentioned in [
15], its drawback, compared to synthetic response [
13], regression calibration and imputation [
15], is a larger variance. Yet, in practical scenarios, the harm caused by model misspecification often outweighs the harm of higher variance. Therefore, in our work, we follow the recommendation of [
15] to use the inverse probability weighted form of the response variable. A future research direction is to further enhance this response variable for better applicability in the context of missing censoring indicators.
As far as we know, there is currently very limited research on model averaging for missing censoring indicators. Therefore, there are still many questions that deserve further investigation. There is potential for extending our approach to high-dimensional data in terms of data and in terms of models, exploration into partial linear models, generalized linear models and other extensions could be pursued.