1. Introduction
Variable screening technique has been demonstrated as a computationally fast and efficient tool in solving many problems in ultrahigh dimensions. For example, in many scientific areas, such as biological genetics, finance and econometrics, we may collect the ultrahigh dimensional data sets (e.g., biomarkers, financial factors, assets and stocks), where the number
of predictors extremely exceeds the sample size
n. Theoretically, ultrahigh dimension often refers to the dimensionality
and sample size
n satisfies the relationship:
for some constant
. Variable screening is able to reduce the computational cost, to avoid the instability of algorithms, and to improve the estimation accuracy. These issues exist in the variable selection approaches based on LASSO [
1], SCAD [
2,
3] or MCP [
4] for ultrahigh dimensional data. Since the seminal work of [
5], which pioneeringly proposed the sure independence screening (SIS) procedure, many variable screening approaches have been consecutively documented over the last fifteen years, including the model-based methods (e.g., [
6,
7,
8,
9,
10,
11]) and the model-free methods [
12,
13,
14,
15,
16,
17,
18,
19,
20]. These papers have showed that with probability approaching one, the set of selected predictors contain the set of all truly important predictors.
Most marginal approaches focus only on developing various effective and robust measures to characterize the marginal association between the response and individual predictor. Whereas, these methods do not take into consideration the influence of conditional variables or confounding factors on the response. A simple application of SIS is relatively rough since SIS may perform poorly when predictors are highly correlated with each other. Some predictors that are weakly relevant or irrelevant, but jointly correlated to the response, may be excluded in the final model after applying marginal screening methods. This will result in a high false positive rate (FPR). To surmount this weakness, an iterated screening algorithm or a penalization-based variable selection is usually offered as a refined follow-up step (e.g., [
5,
10]).
Conditional variable screening can be viewed as an important extension of the marginal screening. It accounts for conditional information when calculating the marginal screening utility. There is relatively less work in the literature. To name a few, Ref. [
21] proposed a conditional SIS (CIS) procedure to improve the performance of SIS because some correlated conditional variables may increase the chance of boosting the rank of the marginally weak predictor and that of reducing the number of false negatives. The paper [
22] proposed a confounder-adjusted screening method for high dimensional censoring data, in which the additional environmental confounders are regarded as conditional variables. The researchers in [
23] studied the variable screening by incorporating within-subject correlation for ultrahigh dimensional longitudinal data, where they used some baseline variables as conditional variables. Ref. [
24] proposed a conditional distance correlation-based screening via kernel smoothing method, while [
25] further presented a screening procedure based on conditional distance correlation, which is similar to [
24] in methodology, but differs in theory. Additionally, Ref. [
11] developed a conditional quantile correlation-based screening approach using the B-spline smoothing technique. However, in [
11,
24,
25], among others, the conditional variable they considered is only univariate. Further, Ref. [
21] focuses on the generalized linear models, but cannot handle heavy-tailed data. For this regard, we aim to develop a screener that behaves more robustly to outliers and heavy-tailed data, and simultaneously considers more than one conditional variable. On the choice of conditional variables, one can achieve that through some prior knowledge such as published research work or the experience of experts from relevant subjects. When no prior knowledge is available, one can apply some marginal screening approaches, such as the SIS or its robust variants, to select several top-ranked predictors as conditional variables.
On the other hand, to the best of our knowledge, several works have considered multiple conditional variables based on distinct partial correlations. For instance, Ref. [
26] proposed a thresholded partial correlation approach to select significant variables in linear regression models. Additionally, Ref. [
17] presented a screening procedure on the basis of the quantile partial correlation in [
27], and they referred to the procedure as QPC-SIS. More recently, Ref. [
28] proposed a copula partial correlation-based screening approach. It is worth noting that the partial correlation used in both [
17,
28] removes the effect of conditional variables on the response and each predictor through fitting two parametric models with a linear structure. However, this manner may be ineffective, especially when the conditional variables have a nonlinear influence on the response nonlinear. This motivates us to work out a flexible way to control the impact of conditional variables. Meanwhile, we also take into account the issue of the robustness to outlying or heavy-tail response in this paper.
This paper contributes a robust and flexible conditional variable screening procedure via a partial correlation coefficient, which is a non-trivial extension of [
17]. First of all, in order to precisely control conditional variables, we propose a nonparametric definition of QPC, which extends that of [
17] and allows for more flexibility. Specifically, we first fit two nonparametric additive models to remove the effect of conditional variables on the response and an individual predictor, where we use the B-spline smoothing technique to estimate the nonparametric functions. This can be viewed as a nonparametric adjustment for controlling conditional variables. By that, we can obtain two residuals, on which a quantile correlation can be calculated to formulate a nonparametric QPC. Second, we use this quantity as the screening utility in variable screening. This procedure can be implemented rapidly. We refer to this procedure as the nonparametric quantile partial correlation-based screening, denoted as NQPC-SIS. Third, theoretically, we establish the sure screening property for NQPC-SIS under some mild conditions. Compared to [
17], our approach is more flexible and our theory on the sure screening property is more difficult to derive. Moreover, our screening idea can be easily transferred to some existing screening methods that use some popular partial correlation.
The remainder of the paper is organized as follows. In
Section 2, the NQPC-SIS is introduced. The technical conditions needed are listed and asymptotic properties are established in
Section 3.
Section 4 provides an iterative algorithm for a further refinement. Numerical studies and empirical analysis of real data set are carried out in
Section 5. Concluding remarks are given in
Section 6. All the proofs of the main results are relegated to the
Appendix A.
3. Theoretical Properties
To state our theoretical results, we first make some notations. Let . Throughout the rest of the paper, for any matrix , we use , , and and to stand for the operator norm, the infinity norm as well as the minimum and maximum eigenvalues for a symmetric matrix , respectively. In addition, for any vector , means the Euclidean norm.
Denote
and
, where
is given in Equation (
4) and
is given in Equation (
7). Further, we also denote
, where
where
,
and
. Before we establish the uniform convergence of
to
, we first investigate the bound of the gap between
and
, which is helpful to understand the marginal signal level after applying B-spline approximation to the population utility. We need the following conditions:
- (B1)
We assume that
and
denotes the support of covariate
. There exist some positive constants
and
such that for any
,
where
d is defined in condition (C1) below.
- (B2)
There exist some positive constants
such that
where
and
are given in (
4) and (
8), respectively.
- (B3)
In a neighborhood of , the conditional density of Y given , , is bounded on the support of and uniformly in j.
- (B4)
for some and .
Condition (B1) is imposed on the approximation error condition for nonparametric function in B-spline smoothing literature (e.g., [
11,
30,
31]). Condition (B2) requires variances
and
to be uniformly bounded. Condition (B3) implies that there exists a finite constant
such that for a small
,
holds uniformly. Condition (B4) guarantees that the marginal signal of active components in model
does not vanish. These conditions are similar to those in [
17].
Proposition 1. Under conditions (B1)–(B3), there exists a positive constant such that In addition, if condition (B4) further holds, thenprovided that for some . To establish the sure screening property, we make the following assumptions:
- (C1)
and
belong to a class of functions
, whose
rth derivatives
and
exist and are Lipschitz of order
,
for some positive constant
K, where
is the support of
,
r is a non-negative integer and
such that
.
- (C2)
The joint density of , is bounded by two positive numbers and satisfying . The density of , is bounded away from zero and infinity uniformly in j, that is, there exist two positive constants and such that .
- (C3)
There exist two positive constants and , such that for every j.
- (C4)
The conditional density of Y given , , satisfies the Lipschitz condition of first order and for some positive constants and for any y in a neighborhood of for .
- (C5)
There exist some positive constants and such that , . Furthermore, assume that for some constant .
- (C6)
There exists some constant such that .
Condition (C1) is a smoothness assumption on
and
in nonparametric B-spline-related literature ([
7,
32]). Condition (C3) is a moment constraint on each of the predictors. Conditions (C2), (C4) and (C5) are similar to those imposed in [
17]. Condition (C6) is assumed to ensure the marginal signal level of truly active variables not too weak after B-spline approximation. The above conditions are standard in variable screening literature (e.g., [
17,
28]).
According to the properties of normalized B-splines and under the conditions (C1) and (C2) (c.f., [
33,
34]), we can obtain the fact that for each
and
, there exist positive constants
and
independent of
such that
and
The following lemma bounds the eigenvalues of the B-spline basis matrix from below and from above. This result extends Lemma 3 of [
32] from a fixed dimension to a diverging dimension, which may be crucial to the independent interest of some readers.
Lemma 1. Suppose that conditions (C1) and (C2) hold, then we havewhere for some constant . This result reveals that plays an important role in bounding the eigenvalues of the B-spline basis matrix. When goes to infinity rapidly, the minimum eigenvalue of the basis matrix will degrade to zero very quickly at an exponential rate. However, if the following result holds, then the divergence rate of cannot achieve a polynomial order of n, but can be of an order of .
Theorem 1. Suppose that conditions (B1)–(B5) and (C1)–(C5) hold and assume that and are satisfied.
- (i)
For any , then there exist some positive constants such that, for and sufficiently large n,where and is given in Lemma 1. - (ii)
In addition, if condition (C6) is further satisfied, by choosing with , we havefor sufficiently large n, where .
The above establishes the sure screening property that all the relevant variables can be recruited with probability going to one in the final model. The probability bound in the property is free of
, but depends on
and the number of basis functions
. Though this ensures that NQPC-SIS retains all important predictors with high probability, the noisy variables can be included by NQPC-SIS. Ideally, this can be realized by the choice of
, according to Theorem 1 and by setting
, to achieve the selection consistency, i.e.,
when
n is sufficiently large. This property can also be achieved by Theorem 1 and by assuming that
for
. However, this would be too restrictive to check in practice. Similar to [
17], we may assume that
for some
to control the false selection rate. With this condition, we can obtain the following property to control the size of the selected model.
Theorem 2. Under the conditions of Theorem 1 and by choosing with and if for some , then for some positive constant , there exist some constants such thatfor sufficiently large n. This theorem reveals that after an application of the NQPC-SIS, the dimensionality can be reduced from an exponential order to a polynomial size of n at the same time retaining all the important predictors with probability approaching one.
4. Algorithm for NQPC-SIS
To make the NQPS-SIS practically applicable, for each
, we need to specify the conditional set
. We note that a sequential test was developed to identify
in [
17] via an application of the Fisher’s Z-transformation [
35] and partial correlation. In this section, we provide a two-stage procedure based on nonparametric additive quantile regression model, which can be viewed as a complementary to [
17].
To reduce the computational burden, we first apply the quantile-adaptive model-free feature screening (Qa-SIS) proposed by [
13] to select a subset from
, denoted by
with
, where
is the number of basis functions used in Qa-SIS and
denotes the largest integer not exceeding
a. Second, for each
, if
, we set
, otherwise
. Thus,
. Third, we carry out a variable selection with SCAD penalty [
2] based on additive quantile regression model for data set
and then a small reduced subset is obtained, denoted by
. Such a two-stage procedure can help to find the conditional subset for the
jth variable and will be incorporated in the following algorithm. With a slight abuse of notation, we use
to denote the screening threshold parameter of the NQPC-SIS, in other words, for the NQPC-SIS, we select
covariates that correspond to the first
largest NQPCs.
Algorithm 1 has the same spirit as the QPCS algorithm of [
17], who demonstrated empirically that the QPCS algorithm outperforms their QTCS and QFR algorithms. In the implementation, we choose
and
, which does not exclude other choice. According to our limited simulation experience, this choice works satisfactorily. The values of
and
we take on cannot be too large, due to the use of B-spline basis approximations. Theoretically, we need to specify
such that
, while it is sufficient to require
practically.
Algorithm 1 The implementation of NQPC-SIS. |
- 1:
Given , we set a pre-specified number and an initial set . - 2:
For ,
- (2a)
update ; - (2b)
update , where the variable index is defined by
- 3:
For ,
- (3a)
update ; - (3b)
update , where the variable index is such that
- 4:
Repeat Step 3 until . The final selected set is denoted as .
|