1. Introduction
Multi-sources data are now attracting more attention in scientific research. A practical problem with multi-source data is block-wise missing. Our work is motivated by the existence of block-wise missingness in Alzheimer’s Disease Neuroimaging Initiative (ADNI) data when investigating the biomarkers that are associated with Alzheimer’s Disease (AD). In the ADNI study, healthy elderly subjects, as well as subjects with normal cognition (NC), mild cognitive impairment (MCI), or AD, were recruited to identify neuroimaging measures, cognitive measures and biomarkers that can effectively and timely detect cognitive and functional changes [
1]. The ADNI data exhibit a block-wise missing structure along with the long duration of the study, and the high cost of certain measurements, etc. Besides the ADNI data, datasets with block-wise missing structure also exist across many other fields including environmental science, sociology, and economics. For example, a block-wise missing structure appears in human mortality data integrated from Italy and Switzerland [
2] and in credit data collected from various institutions (Lan and Jiang [
3]; Li et al. [
4]).
Statistical analysis with missing covariates has been widely studied due to the prevalence of missing values in many datasets. Common methods for dealing with missing data include complete case analysis, maximum likelihood, inverse probability weighting, and imputation. While complete case analysis is the easiest approach to implement, it has several drawbacks, such as potential bias in certain cases and a significant loss of information when the proportion of missingness is high. The maximum likelihood approach (e.g., Sabbe et al. [
5]; Bondarenko and Raghunathan [
6]; Audigier et al. [
7]; von Hippel and Bartlett [
8]) requires a specification on the distribution of variables, though this is unknown and unverifiable in practice. Inverse probability weighting (e.g., Chen et al. [
9]; Creemers et al. [
10]; Zubizarreta [
11]; Hughes et al. [
12]) heavily relies on the information from complete cases, which can be problematic when the fraction of completely observed subject is small.
Two big challenges with the above ADNI data are the high proportion of missingness and the large number of covariates, which make the complete case analysis and maximum likelihood approach inefficient. In addition to these two challenges, weighted methods cannot handle the problem in presence of multiple missing patterns. Compared to these methods with notable limitations, imputation methods are more appropriate for the ADNI data. Recently, multi-source data with block-wise missing, exemplified by the ADNI data, have drawn extensively attention in statistically research. Ref. [
13] developed a classification framework, which was accomplished by three steps: feature selection, sample selection, and matrix completion. Ref. [
2] proposed a dimension reduction method called generalized integrative principal component analysis (GIPCA). Under the assumption of identical type of distribution in the exponential family within each data source, GIPCA decomposed the overall effect into joint and individual effect across data sources. Ref. [
14] imputed the missing data using a factor structure model, which considered the correlation between predictors and does not depend on missing mechanism. Ref. [
15] developed a multiple block-wise imputation (MBI) approach by constructing estimating functions based on both complete and incomplete observation. Other related literature include those of [
4,
16,
17].
However, these methods are not applicable to longitudinal studies. Using these methods on the ADNI data, they only select baseline measurement for each patient and simply delete the following measurements. Thus, these methods are inefficient for the ADNI data since they fail to take account of with-subject correlations. In this paper, we aim to develop a method for variable selection when integrating longitudinal data from multiple sources in the existing block-wise missing structure. We impute the block-wise missing data multiple times by using the information from both subjects with complete observation and subjects with missing values. We construct estimating equations based on imputed data and incorporate working correlation matrices to account for within-cluster correlation. With the generalized method of moment, we are capable of integrating data from multiple sources and identifying the relevant variables by introducing a penalty term.
This paper is organized as follows.
Section 2 describes the setup and formalize the proposed method. In
Section 3, we study the asymptotic properties of the proposed estimator. In
Section 4, we develop an algorithm to implement the proposed method, followed by a simulation study conducted in
Section 5 to evaluate the performance of the proposed method. In
Section 6, we apply the proposed method to the ADNI study.
Section 7 provides further discussions.
3. Asymptotic Properties
In this section, we investigate the asymptotic properties of the proposed estimator. In
Section 3.1, we assume the sample size
n is increasing while the number of parameters
p is fixed, and demonstrate that the proposed estimator is
-consistent and asymptotically normal. As sample size goes to infinity, the proposed method is capable of selecting out the relevant variables with probability goes to 1. We also show that the proposed estimator is asymptotically more efficient than single imputation method via incorporating information of samples with missing values. In
Section 3.2, we suppose both the sample size
n and the number of parameters
p are increasing but
n increases faster than
p. We show that the consistency and sparsity still hold with diverging
p. Without loss of generality, we assume
can be partitioned into two parts, i.e.,
, where
corresponds to relevant variables with a non-zero true value, while
consists of coefficients of irrelevant variables with a zero true value. For any function
, we use
to denote the first derivative of
evaluated at
. We use similar notation for its other order derivatives.
3.1. Fixed Number of Parameters
To establish the asymptotic properties of the proposed estimator in the setting of increasing sample sizes and fixed number of parameter, we require the following regularity conditions:
- C.1
and , for any , , and , where the inner expectation is with respect to the imputed values.
- C.2
All the variance matrix and , for any and .
- C.3
Let . For any and , and the fourth moment of exists.
- C.4
, for any and .
- C.5
The penalty function satisfied:
- (a)
;
- (b)
;
- (c)
.
- C.6
, where and , with and to be a diagonal matrix with dimension and each element equals to .
C.1–C.3 are conditions that require the existence of the moment, which are easily satisfied. C.4 requires the imputed conditional mean converges to the true conditional mean in probability, which is satisfied as long as the imputed model is correctly specified and the missing mechanism is either missing completely at random. C.5 is a standard condition for SCAD penalty which is commonly used in variable selection method (Gao et al. [
24]; Cho and Qu [
25]; Tian et al. [
26]). More specifically, (a) ensures the property of sparsity is satisfied, (b) and (c) ensure the property of consistency is satisfied, and (c) also guarantees that the objective function (
1) is dominated by the first term. C.6 is used to establish the asymptotic normality of the proposed estimator.
Theorem 1. Under C.1–C.5, there exists a local minimizer of such that .
Theorem 1 states the existence of a minimizer of the objective function and the minimizer will converge to the true coefficients at a rate of as the sample size increases. Next, we demonstrate that this estimator possesses the sparsity property and the estimator for the non-zero coefficient is asymptotically normal, as outlined in the following theorem.
Theorem 2. Under C.1–C.5, if and there exist a sequence such that as , where , then the proposed estimator satisfies the following properties:
(Sparsity) ;
(Asymptotic Normality) Let and and if C.6 holds, then .
The sparsity of the proposed estimator guarantees that the probability of selecting the true model approaches 1. We also obtained in Theorem 2 the asymptotic normality of , the estimator of coefficients for the relevant variables, which allows us to estimate its variance if and are known. However, in practice, these are unknown to us. We can obtain the empirical variance covariance matrix of by replacing with and replacing with , i.e., . Next, we compare the empirical variance of the proposed estimator with the empirical variance of the single imputation approach.
Theorem 3. If a single imputation is used based on complete cases and denotes the asymptotic covariance matrix of as , then under the conditions of Theorem 2, is positive semi-definite.
Theorem 3 claims that the proposed estimator is asymptotically more efficient than the single imputation approach, as it incorporates information from incomplete cases during imputation. The result of this Theorem is intuitive because the proposed method incorporates more samples into the imputation process.
3.2. Diverging Number of Parameters
In this subsection, we consider the setting where sample size n and number of coefficients p increase simultaneously. For certain properties to remain true, we require that n increases faster than p. We replace the notation p by to indicate that the number of parameters also increases. We make the following assumptions:
- D.1
For any
,
and:
- D.2
There exist an open ball of and there exist a constant M such that each entries of is bounded by M, for any in this open ball.
- D.3
The penalty function satisfied:
- (a)
;
- (b)
;
- (c)
.
D.1 and D.2 are analogous to C.1–C.4. D.3 is the modification of C.5 for diverging number of parameters.
Theorem 4. Under D.1–D.3, if , there exists a local minimizer of such that .
From the result of Theorem 4, we find that the consistency still holds for the proposed estimator, even with a diverging number of parameters. Not surprisingly, the convergence rate is no longer , but . We also require that does not increase faster than to ensure the model remains sparse. To be specific, the majority of the coefficients is zero.
Theorem 5. Under D.1–D.3, if , , and as , then with probability tending to 1, the estimator satisfies .
Theorem 5 states the sparsity of the proposed estimator with a diverging number of parameters. This property guarantees that the proposed method can still select the true model with a probability approaching 1, even when the number of parameters is diverging.
4. Implementation
Since directly minimizing the objective function is difficult, we incorporate an iterative procedure inspired by the implementation in [
27], where they combined the minorization–maximization algorithm [
28] with the Newton–Raphson algorithm. Given the current estimate of
and tuning parameter
, the objective function
can be locally approximated by (except a constant term):
where:
and
is a sufficiently small number (e.g.,
). Thus, the search for estimator minimizing the objective function is equivalent to find an estimator that minimize (
2). Notice that both
and
are unknown. Fortunately, from
Lemma S2 in Supplementary Materials,
can be approximated by:
and
can be approximated by:
Plugging (
3) and (
4) into (
2) and applying the Newton–Raphson algorithm, we obtain the following formula to update
:
We repeat the above procedure until is smaller than a pre-specified threshold or reach a pre-specified maximum number of iteration.
It is known that the sampling covariance matrix
may be singular in some cases [
29]. To overcome the difficulty in computing the inverse of
, we adopt the Moore–Penrose generalized inverse, which exists and is unique for any matrix.
In the implementation of the proposed method, we select tuning parameter
with extended Bayesian Information Criterion (EBIC) criteria proposed by [
30]:
where
is the number of parameters of the model with tuning parameter
and
is the residual sum of square of all the missing pattern with:
5. Simulation
In this section, we implement a simulation study to compare the performance of the proposed method in variable selection against complete case analysis (CC) with SCAD penalty, single imputation (SI) with SCAD penalty, and the penalized generalized estimating Equation (PGEE). We use the same data structure as shown in
Figure 1, where we have three missing patterns and three sources. The number of measurement is set to be three throughout this section. We replicate the simulation 100 times and use false positive rate (FPR) and false negative rate (FNR) to evaluate the performance of each method, which reflect the proportion of covariates that are irrelevant but falsely selected and the proportion of covariates that are relevant but fail to be selected, respectively. In the tuning parameter selection procedure, the parameter
was set to 0.5 in EBIC. At the end of the iterative algorithm in
Section 4, the estimated coefficient is considered as zero, if its absolute value is smaller than
.
In the first setting, we simulate a dataset with a small proportion of complete cases, where
,
,
, and the missing rate is around 87%. The data with continuous outcome are generated from the model:
where
,
is a vector consisting of 30 covariates, and
. Here, each source consists of 10 covariates with the first two covariates having non-zero coefficients.
is a time-fixed covariate and we generate it from the standard normal distribution, whereas other covariates are time-varying covariates and follow multivariate normal distribution with mean zero and exchangeable covariance matrix with marginal variance 1 and correlation coefficient 0.5. We generate random error
from the multivariate normal distribution with mean 0 and exchangeable covariance matrix with marginal variance 1 and correlation coefficient
. We always assume the true within-cluster correlation structure is known and considered
to be 0.3, 0.5, and 0.7 in each setting, which corresponded to mild, moderate, and strong within-cluster correlation. Let
. Then,
,
, and
samples were sequentially drawn with probability proportional to the
and assigned to the pattern 1, pattern 2, and pattern 3, respectively. Obviously, subjects with higher covariates value from source 1 at the baseline are more likely to be assigned to pattern 1, followed by pattern 2 and then pattern 3. This data generating process implies a MAR mechanism for the missing covariates. The results of
Table 1 summarize the performance of each method for three different
. All of these methods effectively control the FNR. However, FPR of the proposed method is lower than the other three methods. In other words, the proposed method is able to select most of relevant variables while controlling the error of selecting irrelevant variables. In addition, we notice that the proposed method is more capable of utilizing within-cluster correlation compared with PGEE since the proposed method performs better as the within-cluster correlation becomes stronger. This result demonstrates the superiority of the proposed method when the percentage of complete cases is small in the block-missing data.
In the second setting, we continue to investigate the proposed method’s performance with a continuous outcome, but we proportionally increase the sample size in each missing pattern to demonstrate the proposed method’s effectiveness in larger samples, where
,
,
. The results are described in
Table 2. Unsurprisingly, the FPR and FNR of all the methods decreased compared with the first setting. We observe that the performance of the PGEE is very close to that of the single imputation method while the proposed method has a much lower FPR. In the meanwhile, complete cases analysis is still the worst option since the improvement is minor as the sample size increase, and even negligible when the within-cluster correlation is strong. Therefore, the proposed method is still able to maintain an appealing performance in the large sample size. The results from this setting further verify the efficiency gain of the proposed method in incorporating more information from the missing data compared to the single imputation.
In the third setting, we consider a correlated binary outcome with
,
, and
. The data are generated from the model:
where
,
is a vector consisting of 15 covariates, and
. Here, each source consists of five covariates, with the first covariate in each having non-zero coefficients.
is a time-fixed covariate and we generate it from the standard normal distribution, whereas other covariates are time-varying covariates and follow multivariate normal distribution with mean zero and exchangeable covariance matrix with marginal variance 1 and correlation coefficient 0.5. We generate random error
from the multivariate normal distribution with mean 0 and exchangeable covariance matrix with marginal variance 1 and correlation coefficient 0.3. In this setting,
. The results are summarized in
Table 3. Although the PGEE outperforms other methods in terms of FPR, its performance in FNR is poor. In contrast, the proposed method possesses a better balance between FPR and FNR. We still observed a better performance of the proposed method.
6. Application
We apply our proposed method to the ADNI study. This study was launched in 2003 and has undertaken three different phases so far: ADNI 1, ADNI GO/2, and ADNI 3, which is designed to develop the effective treatment that can slow or stop the progression of AD. Our goal is to identify sensitive biomarkers of AD in the early stage from three data sources: magnetic resonance imaging (MRI), positron emission tomography (PET), and cerebrospinal fluid (CSF). We choose the mini-mental state examination (MMSE) [
31] score as response variable, which has been widely used in the early diagnosis of AD [
32]. The MRI data were analyzed by UCSF, who performed cortical reconstruction and volumetric segmentation with FreeSurfer. The processed MRI data primarly summarized average cortical thickness, standard deviation in cortical thickness, the volumes of cortical parcellations, the volumes of specific white matter parcellations, and the total surface area of the cortex [
33]. The PET data were processed by UCB and quantities variables were obtained by standard uptake value ratio (SUVR) in amyloid florbetapir. The CSF data were acquired by ADNI Biomarker Core and Laboratory Medicine and Center for Neurodegenerative Diseases Research at UPENN. The block-wise missing emerged in this data. Less than half of patients lacked MRI measurements, few patients missed PET measurements, and only a small proportion of patients had CSF measurements. One of the reasons for the block-wise missing data is that obtaining CSF measurements requires more invasive procedures (such as lumbar puncture), which are refused by the majority of patients. The goal of this analysis is to identify biomarkers that are highly predictive of MMSE.
We only use the ADNI GO/2 dataset and consider measurements at baseline, month 24, and month 48, since the majority of patients have records at these time points. We also notice that there exist some low-quality data, such as those missed baseline measurement or belonged to a missing pattern with few patients. For simplicity of analysis, we discard these low-quality data, which leads us to a study cohort of 669 patients. Among them, 280 patients missed the measurement at month 24 and 487 patients missed the measurements at month 48. There are 340 features in MRI data, 229 features in PET data, and 3 features in CSF data. These three datasets and MMSE data are joined by a unique identifier “visit code” provided by the ADNI study. In total, we have three missing patterns.
Table 4 describes the missing pattern of this dataset. The number of patients with fully observed variables is 63, with a missing rate around 90.6%. From this extremely high proportion of missing data, we will see how the proposed method can substantially improve the prediction ability by incorporating the information of related samples with missing values. To assess the predictive performance of the proposed method, data are randomly split into a test data with a sample size 30 (roughly 5%) and the remaining data as training data, where the test data are drawn from the data with fully observed variables (missing pattern 1). This random split process was replicated 30 times. A variable is marked as a potential predictor of AD if its absolute coefficient value is greater than 0.01.
Table 5 summarizes the average number of biomarkers selected by each method, along with the most frequently selected biomarkers. We also report the post-model-selection
p-value. Our method successfully identifies biomarkers that align with findings reported in existing Alzheimer’s Disease research literature. In comparison to PGEE, the other three methods consistently select amyloid-
as a biomarker of AD, whose accumulation in cells is an early event of AD [
34]. Phosphorylated tau, another widely accepted biomarker, has been validated by multiple large-scale, multi-center studies [
35]. Studies found that neurons in AD patients are more likely to loss the superior temporal sulcus [
36]. Two distinct normalization methods of summary measures for the standardized uptake value ratio (SUVR) of the florbetapir tracer, in the composite reference region and the whole cerebellum reference region, may potentially serving as AD biomarkers [
37]. Besides these biomarkers, the proposed method additionally identifies several well-established and potential biomarkers. The size of the region of interest (ROI) in the left and right hemisphere precuneus area of the cortex, as well as cortical volume of left precuneus, summarize the health status of precuneus, which may be atrophy in the early stage of AD. The size and volume of the ROI in the left and right inferior lateral ventricle reflect disease progression (Bartos et al. [
38]; Song et al. [
39]). White matter changes in cerebral or subcortical areas can appear in other neurological conditions and normal aging, their connections with AD potentially make them useful biomarkers for distinguishing AD from normality, especially when considered along with other biomarkers in future investigations. While the surface area of the left caudal middle frontal and the cortical volume of the right caudal anterior cingulate are both associated with AD, more research is required to further explore these associations.
7. Discussion
It is well known that variable selection is a challenge for model robustness, estimator stableness and efficiency, as well as precise predictability. However, another non-negligible issue when integrating longitudinal studies is missingness in the covariate, especially in block-wise missing data. Specifically, with block-wise missing data, the percentage of complete observations is relatively small while traditional statistical methods heavily rely on information of complete cases. In this paper, we develop new methods to extend the MBI approach in a longitudinal study under the setting of block-wise missing data. Under certain regularity conditions, the desirable properties, consistency, sparsity, and asymptotic normality still hold. In addition, the proposed method demonstrates superior efficiency compared to the single imputation approach. It is worth noting that dropout missing data are also very common in longitudinal studies, which typically cause bias in many cases. In future work, it will be of great interest to develop methods to handle dropout missingness and incorporate inverse probability weighting in the proposed method.
One limitation of this paper is that we assume a homogeneous missing pattern across measurements within a single patient. Although this assumption may be restrictive in real data analysis, it is not hard to fulfill in multi-source data.