Model Selection for High Dimensional Nonparametric Additive Models via Ridge Estimation

Wang, Haofeng; Jin, Hongxia; Jiang, Xuejun; Li, Jingzhi

doi:10.3390/math10234551

Open AccessArticle

Model Selection for High Dimensional Nonparametric Additive Models via Ridge Estimation

by

Haofeng Wang

^1,2,

Hongxia Jin

²,

Xuejun Jiang

^2,*

and

Jingzhi Li

³

¹

Department of Mathematics, Harbin Institute of Technology, Harbin 150001, China

²

Department of Statistics and Data Science, Southern University of Science and Technology, Shenzhen 518055, China

³

Department of Mathematics, Southern University of Science and Technology, Shenzhen 518055, China

^*

Author to whom correspondence should be addressed.

Mathematics 2022, 10(23), 4551; https://doi.org/10.3390/math10234551

Submission received: 2 November 2022 / Revised: 22 November 2022 / Accepted: 28 November 2022 / Published: 1 December 2022

(This article belongs to the Special Issue Statistical Methods in Data Science and Applications)

Download Versions Notes

Abstract

:

In ultrahigh dimensional data analysis, to keep computational performance well and good statistical properties still working, nonparametric additive models face increasing challenges. To overcome them, we introduce a methodology of model selection for high dimensional nonparametric additive models. Our approach is to propose a novel group screening procedure via nonparametric smoothing ridge estimation (GRIE) to find the importance of each covariate. It is then combined with the sure screening property of GRIE and the model selection property of extended Bayesian information criteria (EBIC) to select the suitable sub-models in nonparametric additive models. Theoretically, we establish the strong consistency of model selection for the proposed method. Extensive simulations and two real datasets illustrate the outstanding performance of the GRIE-EBIC method.

Keywords:

model selection; nonparametric additive models; nonparametric smoothing; ridge estimation

MSC:

62H12; 62J12

1. Introduction

With the advances in information technology, high-dimensional data exists in various fields such as biology, chemistry, economics, finance, genetics, neuroscience, etc. A common assumption of sparse is that only a few features are truly related to the response. Following that, plenty of variable selection approaches based on regularized M-estimation have been developed, which include but are not limited to Lasso by [1], SCAD by [2], Dantzig selector by [3], and MCP by [4]. However, there always exist two limitations in the above-penalized methods. One is the big burden for computation, and the other is the unstable performance for variable selection in high-dimensional situations [5].

To avoid the mentioned limitations, correlation ranking becomes one of the most popular ways to rapidly reduce the dimensionality of feature space. Fan and Lv [6] proposed the sure independence screening (SIS) by utilizing the marginal Pearson correlation between predictor and response for gaussian linear regression. Fan et al. [7] extended the idea of Pearson correlation ranking to marginal smooth estimation strength ranking and proposed the nonparametric independence screening (NIS) method. Meanwhile, Zhu et al. [8] considered the marginal correlation between the predictor and the conditional cumulative density function of response and developed the model-free screening method. However, in practice, there exist strong correlations between the predictors, which may lead to important predictors being jointly correlated to the response. Hence, the marginal correlation ranking process may miss some important variables. To decrease the effect of correlation between the predictors, some forward variable screening methods based on the prediction rankings were introduced. Wang [9] ranked the residuals of the predictor and proposed the forward regression (FR) algorithm. Cheng et al. [10] applied the forward regression to high dimensional varying coefficient models and proposed the forward-BIC screening method. Zhong et al. [11] further extended the forward regression to ultrahigh-dimensional nonparametric additive models. Based on the cumulative divergence (CD), Zhou et al. [12] proposed a forward screening procedure that considered the joint effects among covariates in the feature screening process.

Next, let us turn to the specific model. In this paper, we are interested in the model of nonlinear regression. It is well known that if there exists extensive nonlinear independence between response and predictors, traditional (partial) linear models can not detect nonlinear independence. Although the nonlinear regression could capture the nonlinear independence accurately, the nonlinear regression suffers from the curse of dimensionality and heavy computational burden in high dimensions. To model them simplify, here, we consider the nonparametric additive models. The nonparametric additive models were introduced by Hastie and Tibshirani [13], which are defined as follows,

y = \sum_{j = 1}^{p_{n}} m_{j} (x_{j}) + ϵ,

(1)

where y is the response variable,

x_{j}

is the covariate,

m_{j}

is an unknown function with

j = 1, \dots, p_{n}

, and

ϵ

is the random error. Obviously, this additive combination of univariate functions could detect the nonlinear independence easily, but their good statistical properties and high computational performance only belong to low dimensions. For ultrahigh dimensions, to keep them working well, one of the most popular methods is the two-stage approach. Its main idea is to perform model selection in a fast and efficient way while retaining all the important features in the reduced feature space and then refitting the reduced models. In the following paper, we focus on the methodology of model selection for ultrahigh nonparametric additive models. In that field, the last decade has seen a growing trend toward smooth-group penalized methods, see [14,15,16,17]. Whereas the above methods may involve some tuning parameters, which bring a heavy computational burden and unstable results in high dimensions. A forward feature selection procedure, proposed by [11] for ultrahigh dimensional nonparametric additive models, does not involve any initial parameters. In addition, model-free methods have been developed recently. Based on the cumulative divergence (CD), Zhou et al. [12] proposed a forward screening procedure that considered the joint effects among covariates in the feature screening process. These two above methods screen the remaining candidate indexes into the sub-models through forwarding procedures. This kind of forward-searching algorithm also leads to a high computational burden. Furthermore, under previous studies’ correlation assumption, they ignored that the predictors are often correlated for high-dimensional feature space. In detail, the unimportant covariate

x_{ℓ}

corresponding to

m_{ℓ} \equiv 0

in the nonparametric additive models (1) may have a strong correlation with the residual

y - \sum_{j \in M} m_{j} (x_{j})

given index set

M \subseteq {1, \dots, p_{n}}

, which implies that their methodologies may screen quite a few unimportant features into the sub-models.

To improve these limitations, first, our approach is to propose a group screening procedure via nonparametric smoothing ridge estimation (GRIE), motivated by the theoretical property and outstanding simulation performance of the ridge estimator in [18]. The core idea of GRIE is to get the importance of each covariate by combining the ridge estimator and group contribution. Its details are as follows. We begin with fitting the ridge regression by B-spline smoothing and then treating the spline basis corresponding to each covariate as a group. Next, we evaluate the group contribution of covariates by the magnitude of group estimators. Lastly, we sort the importance of the covariates by the group contribution in descending order. To further conduct model selection, we propose the refined GRIE-EBIC method mixing GRIE and the extended Bayesian information criteria (EBIC) in [19]. The GRIE-EBIC method is used to search for the predictor with the most group contributions by EBIC.

Compared with other feature selection methods for nonparametric additive models, the GRIE-EBIC method has the following advantages: (1) the joint correlation among covariates is considered, and the strong marginal correlation assumption between response and important predictors is relaxed; (2) simple calculation with lower computational complexity; (3) strong consistency for feature screening, and it implies that the true features can be extracted accurately with probability tending to one, which does not exist in other stepwise feature screening methods, such as forward additive regression in [11], forward screening in [12], etc.

The rest of the paper is organized as follows. In Section 2, we introduce the GRIE screening procedure, the GRIE-EBIC method, and its algorithm. In Section 3, we establish the sure screening property of the GRIE screening procedure and the strong consistency of screening by the GRIE-EBIC. In Section 4, we present the performance of our proposed algorithm through simulation studies. In Section 5, we apply our methodology to fit two real datasets to further illustrate the performance of our proposed method. The first is based on Boston housing, while the second is related to Arabidopsis thaliana gene data. A conclusion is given in Section 6. The proofs are in Appendix A.

Notation

Let

A

be m by l matrix,

M

be any subset of

\{1, 2, \dots, l\}

with any positive integers of m and l, and then

A_{M}

be the submatrix of

A

formed by column indexes in

M

. We write

λ_{min} (A)

and

λ_{max} (A)

to denote the minimum and maximum eigenvalues of a symmetric matrix

A

, separately. We write

I_{m}

as the identity matrix. We defined

P_{λ, A} = A {(A^{⊤} A + λ I_{m})}^{- 1} A^{⊤}

, where λ is some positive constant,

A^{⊤}

represents the transpose of matrix

A

, and here

A

is the column full rank

l \times m

matrix with

m \leq l

. When

λ = 0

,

P_{A} = A {(A^{⊤} A)}^{- 1} A^{⊤}

, which is the projection onto column space of

A

. Otherwise,

e_{i} = {(0, \dots, 0, 1, 0, \dots, 0)}^{⊤}

is the unit vector, which has zeros everywhere, except in the ith position. For vector

a \in R^{n}

, the

L^{2}

norm of

a = {(a_{1}, a_{2}, \dots, a_{n})}^{⊤}

is captured by

{∥ a ∥}_{2} = \sqrt{a^{⊤} a}

.

2. Methodology

Suppose we have the random sample

\{(y_{i}, x_{i, 1}, \dots, x_{i, p_{n}}) : i = 1, \dots, n\}

, which is generated from the population model (1). Then the nonparametric additive model can be rewritten as:

y_{i} = \sum_{j = 1}^{p_{n}} m_{j} (x_{i, j}) + ϵ_{i}, i = 1, \dots, n .

(2)

Without loss generality, we assume that the mean response is zero. For identification of the model, we further assume the mean of each additive function is zero, i.e.,

E m_{j} (x_{i, j}) = 0

for

j = 1, \dots, p_{n}

. We note that all of the response variables are centralized to satisfy the above assumption during a real application. Here, the variance of the additive function

Var (m_{j} (x_{j}))

is used to distinguish the importance of the covariate. Thus, we let

x_{j}

be the important predictor if

Var (m_{j} (x_{j})) > 0

; otherwise

x_{j}

is the redundant predictor. Then we define the index set of the important predictors as

S = {j : Var (m_{j} (x_{j})) > 0, j = 1, \dots, p_{n}}

.

Next, we use B-spline basis functions to approximate

m_{j} (\cdot)

. Let us assume

x_{j} \in [0, 1]

for

j = 1, \dots, p_{n}

,

\bar{ϕ} = {ϕ_{k}}_{k = 0}^{q}

be a knot sequence such that

0 = ϕ_{0} < ϕ_{1} < \dots < ϕ_{q} = 1

, and

S (ℓ, \bar{ϕ})

are the space of polynomial splines of order ℓ with knot sequence

\bar{ϕ}

.

S (ℓ, \bar{ϕ})

is a

κ_{n}

–dimensional linear space with

κ_{n} = q + ℓ

. For any

m_{j} (x_{j})

,

j = 1, \dots, p_{n}

, there exists the unique vector

θ_{j}^{*}

to satisfy

m_{j} (x_{j}) \approx \sum_{t = 1}^{κ_{n}} θ_{j t} B_{t} (x_{j}) = B {(x_{j})}^{⊤} θ_{j}^{*},

(3)

where

B (x_{j}) = {(B_{1} (x_{j}), \dots, B_{κ_{n}} (x_{j}))}^{⊤}

and

θ_{j}^{*} = {(θ_{j 1}^{*}, \dots, θ_{j κ_{n}}^{*})}^{⊤}

. Let

w_{i} = {(w_{i, 1}^{⊤}, \dots, w_{i, p_{n}}^{⊤})}^{⊤}

with

w_{i, j} = B (x_{i, j})

,

W = {(w_{1}, \dots, w_{n})}^{⊤}

and

Y = {(y_{1}, \dots, y_{n})}^{⊤}

. Based on the approximation of (3), model (2) becomes

y_{i} = w_{i}^{⊤} θ^{*} + ϵ_{i}^{*}, i = 1, \dots, n,

(4)

where

θ^{*} = {({θ_{1}^{*}}^{⊤}, \dots, {θ_{p_{n}}^{*}}^{⊤})}^{⊤}

and

ϵ_{i}^{*} = \sum_{j = 1}^{p_{n}} m_{j} (x_{i, j}) - w_{i}^{⊤} θ^{*} + ϵ_{i}

. Under model (4), the ridge estimator minimizes the following loss

{∥ Y - W θ ∥}_{2}^{2} + λ {∥ θ ∥}_{2}^{2},

where

λ

is a positive constant. Then

\hat{θ}

admits

\begin{matrix} \hat{θ} = W^{⊤} {(W W^{⊤} + λ I_{n})}^{- 1} Y, \end{matrix}

(5)

where

I_{n}

is the

n \times n

identity matrix. For linear regression, Wang and Leng [18] considered the effect of each entry of

θ

, and showed that the ridge estimator achieves screening consistency. Notice that

Var (m_{j} (x_{j})) \approx {θ_{j}^{*}}^{⊤} E (w_{j} w_{j}^{⊤}) θ_{j}^{*}

. Different from linear regression, we need to consider the group contribution of

θ_{j}^{*}

. By the boundedness of

E (w_{i, j} w_{i, j}^{⊤})

from assumption A4(i), we use

∥ θ_{j}^{*} ∥_{2}

to evaluate the group contribution. Similar to the results in [18], the ridge estimator

\hat{θ}

provides the ranking order of the group contribution in

θ^{*}

with

P (∥ {\hat{θ}}_{j} ∥_{2} > {∥ {\hat{θ}}_{k} ∥}_{2}) \to 1

if

j \in S, k \in S^{c}

(see Theorem 1).

One natural screening method is to sort

{∥ {\hat{θ}}_{j} ∥_{2}^{2}}

in decreasing order, and select its top m indexes, denoted as

F_{m} = {i_{1}, i_{2}, i_{3}, \dots, i_{m}}, 1 \leq m \leq p_{n} .

This screening process is referred to as the “GRIE” screening procedure. We define

G = \{F_{m} : m = 1, \dots, p_{n}\}

,

A = \{m : S \subseteq F_{m}, 1 \leq m \leq p_{n}\} .

Further, to get a more accurate result of model selection, i.e., searching

d_{n}

, which is the minimum item in the set

A

. At that time,

F_{d_{n}}

is a set with the shortest length in

G

that contains important variable set

S

. With the definition of

G

, we have

S \subseteq F_{p_{n}}

. Then

F_{d_{n}}

is not an empty set. In summary, we want to find

F_{d_{n}}

from

G

.

It is well known that the extended Bayesian information criteria (EBIC) have appealing theoretical properties and outstanding numerical performance for model selection. Let

W_{T} = (W_{j}, j \in T)

for any subset of

T \subset {1, \dots, p_{n}}

. The formula of the EBIC for the sub-model

(Y, W_{T})

is given by

E B I C (T) = log (R S S (T) / n) + \{κ_{n} | T | log (n) + 2 γ log f (| T |)\} / n,

(6)

where

γ

is the preset positive constant,

R S S (T) = ∥ Y - W_{T} {\hat{θ}}_{T} ∥_{2}^{2}

is the sum of squared residuals (RSS), and

f (| T |) = C_{p_{n} κ_{n}}^{| T | κ_{n}}

is the combination number.

For a linear model, Wang [9] showed that

E B I C (F_{m}) < E B I C (F_{m - 1})

if

i_{m} \in S

. Based on this property of EBIC and the preserving rank property of GRIE screening procedure (see Theorem 1), we propose the following Algorithm 1 for the model selection of (1).

Algorithm 1: GRIE-EBIC algorithm.

Initialization: Input

(W, Y)

,

R S S_{0} = {∥ Y ∥}_{2}^{2}

, n,

p_{n}

,

λ

,

κ_{n}

,

γ

, L.
Step (i): Compute the GRIE screening procedure
1: Calculate ridge estimator

\hat{θ} = W^{⊤} {(W^{⊤} + λ I_{n})}^{- 1} Y

;
2: Sort

{∥ {\hat{θ}}_{j} ∥_{2}, j = 1, \dots, p_{n}}

in decreasing order and select the top n index set
which is denoted by

F_{n} = {i_{1}, i_{2}, i_{3}, \dots, i_{n}}

;
Step (ii): Direct decreasing solution path
3: For

k = 1, \dots, n

, do

3.1

: Let

{\hat{S}}_{k} = {i_{1}, \dots, i_{k}}

and compute the sum of squared residuals

R S S_{k} = {∥ Y - W_{{\hat{S}}_{k}} {(W_{{\hat{S}}_{k}}^{⊤} W_{{\hat{S}}_{k}})}^{- 1} W_{{\hat{S}}_{k}}^{⊤} Y ∥}_{2}^{2}

;

3.2

: Compute EBIC:

E B I C_{k} = log (R S S_{k} / n) + {κ_{n} k log (n) + 2 γ log f (k)} / n

;

3.3

: If

k \geq L + 1

and

E B I C_{k} > \dots > E B I C_{k - L}

, compute

K = k - L

and stop;
4: Compute the difference of the EBIC to obtain the decreasing solution path

I = {k : E B I C_{k} - E B I C_{k - 1} < 0, k = 1, 2, \dots, K}

;
5: Find the decreasing index set

{\hat{S}}_{*} = {i_{k} : k \in I}

;
Step (iii): Forward decreasing solution path
6: Compute

R S S_{*} = {∥ Y - W_{{\hat{S}}_{*}} {(W_{{\hat{S}}_{*}}^{⊤} W_{{\hat{S}}_{*}})}^{- 1} W_{{\hat{S}}_{*}}^{⊤} Y ∥}_{2}^{2}

and

E B I C_{*} = log (R S S_{*} / n) + {κ_{n} | {\hat{S}}_{*} | log (n) + 2 γ log f (| {\hat{S}}_{*} |)} / n;

7: For

ℓ \in F_{n} ∖ {\hat{S}}_{*}

, do
Let

{\hat{S}}_{* ℓ} = {\hat{S}}_{*} \cup {ℓ}

, compute

R S S_{* ℓ} = {∥ Y - W_{{\hat{S}}_{* ℓ}} {(W_{{\hat{S}}_{* ℓ}}^{⊤} W_{{\hat{S}}_{* ℓ}})}^{- 1} W_{{\hat{S}}_{* ℓ}}^{⊤} Y ∥}_{2}^{2}

and

E B I C_{* ℓ} = log (R S S_{* ℓ} / n) + {κ_{n} | {\hat{S}}_{* ℓ} | log (n) + 2 γ log f (| {\hat{S}}_{* ℓ} |)} / n;

8: Find decreasing solution path

\hat{S} = {\hat{S}}_{*} \cup {ℓ : E B I C_{* ℓ} - E B I C_{*} < 0, ℓ \in F_{n} ∖ {\hat{S}}_{*}}

;
Output final index set $\hat{S}$ .

In step (ii) of the GRIE-EBIC algorithm, we search the important covariates from the top n predictor space

F_{n}

. Based on Theorem 1, GRIE has the consistency of preserving order in sorting. The higher the index position of the variable in

F_{p_{n}}

, the more likely it is to be an important variable. To speed up the calculation, we set a stopping rule for screening when the EBIC value increases for L times continuously. To improve the robustness of the GRIE-EBIC algorithm, in step (iii), we add the further forward screening process.

3. Asymptotic Properties

3.1. Assumptions

To establish the asymptotic properties of our proposed method, we give the following notations and assumptions. Let

Σ = E (w w^{⊤})

,

Z = W Σ^{- 1 / 2}

,

z = Σ^{- 1 / 2} w

, and

t_{n} = p_{n} κ_{n}

, where

w = {(w_{1}^{⊤}, \dots, w_{p_{n}}^{⊤})}^{⊤}

with

w_{j} = B (x_{j})

. We use

Σ_{T} = E (n^{- 1} W_{T}^{⊤} W_{T})

.

H_{r}

denotes a space of functions whose d-th order derivative is H

\ddot{o}

lder continuous of order v, i.e.,

H_{r} = {h (z) : | h^{(d)} (a^{'}) - h^{(d)} (a) | \leq C | a^{'} - a |^{v}, a, a^{'} \in [0, 1]}

, where

h^{(d)} (\cdot)

is the d-th derivative of

h (\cdot)

and

r = d + v

. If

v = 1

,

h^{(d)} (\cdot)

is Lipschitz continuous. Let

s_{n}

be the cardinality of

S

. The following assumptions are required:

A1: Assume $z$ has a spherically symmetric distribution, and there exists some positive $c_{1}$ and $C_{1}$ such that

$P (λ_{min} ({t_{n}}^{- 1} Z Z^{⊤}) \leq c_{1}^{- 1} or λ_{max} ({t_{n}}^{- 1} Z Z^{⊤}) > c_{1}) \leq 2 exp (- C_{1} n) .$
A2: Assume there exists some positive constant $C_{*}$ such that, for any $a \in R$ ,

$max_{i = 1, \dots, n} E \{exp (a ε_{i}) | x_{i}\} \leq exp (C_{*} a^{2} / 2) .$
A3: Assume that (i) there exists some $r \geq 2$ such that $m_{j} \in H_{r}$ and $κ_{n} = O (n^{1 / (2 r + 1)})$ for any $j \in S$ ; (ii) $\sum_{j \in S} E {| m_{j} (x_{j}) |}^{2} \leq c_{2} s_{n}$ ; (iii) $λ_{max} (Σ) / λ_{min} (Σ) \leq c_{3} n^{τ}$ , where $c_{2}, c_{3}$ are some positive constants and $τ \geq 0$ .
A4: (i) $c_{4}^{- 1} κ_{n}^{- 1} \leq λ_{min} (E (B (x_{j}) B {(x_{j})}^{⊤})) \leq λ_{max} (E (B (x_{j}) B {(x_{j})}^{⊤})) \leq c_{4} κ_{n}^{- 1}$ for some positive constant $c_{4}$ ; (ii) ${min}_{j \in S} {E | m_{j} (x_{j}) {|^{2}}}^{1 / 2} \geq d_{n}$ for some positive sequence $d_{n} \to 0$ ; (iii) $\frac{κ_{n}^{r - 1 / 2} d_{n}}{n^{2 τ} s_{n} \sqrt{log n}} \to \infty$ , $log (t_{n}) = o (\frac{d_{n}^{2} n^{1 - 4 τ}}{κ_{n}^{2} s_{n}^{2} log n})$ .
A5: (i) $Var (y_{1}) = O (κ_{n} s_{n}^{2} n^{3 τ} log (n))$ ; (ii) For any integer N with $s_{n} < N \leq s_{n} log n$ , there exists positive constant $c_{6} > 0$ such that

$\frac{c_{6} n^{- τ}}{κ_{n}} \leq λ_{min} (Σ_{T})$

(7)

holds uniformly in $T \subset F_{n}$ satisfying $| T | \leq N$ and $S \subset T$ .

Assumptions A1 and A3(iii) are like Assumptions 1 and 3 of [18]. Assumption A2 is the same as Assumption A3 of [11], which means that the random error follows the sub-Gaussian distribution. Assumption A3(i) is a common assumption in the literature for the polynomial spline basis, A3(ii) gives the upper bound of all signals, and A3(iii) gives the upper bound of the condition number. In addition, Assumption A3(ii)–(iii) are implied by Assumption A2 in [11]. Assumption A4(i) and a stronger assumption

Var (y_{1}) = O (1)

than A5(i), which is also imposed in [11] for achieving the consistency of variable selection. They also assumed A5(ii) holds. Assumptions A4(ii) and (iii) give the lower bound and upper bound of the minimal signal and dimensionality of the design matrix

W

.

3.2. Main Theorems

Theorem 1.

If Assumptions A1–A4 hold, then

P (min_{j \in S} ∥ {\hat{θ}}_{j} ∥_{2} > max_{j \in S^{c}} {∥ {\hat{θ}}_{j} ∥}_{2}) \to 1 .

Alternatively, we can choose a sub-model

F_{d_{n}}

with

d_{n} = O (n^{ι})

for some

0 < ι < 1

such that

P (S \subset F_{d_{n}}) \to 1 .

Theorem 1 states the consistency of preserving order in sorting, i.e.,

\hat{θ}

could totally separate the unimportant and important variables with a probability tending to 1. For the linear models, Theorem 1 is in line with Theorem 2 in [18], which is the special case of our theorem.

Theorem 2.

If Assumptions A1–A5 hold, then

P (\hat{S} = S) \to 1 .

The screening methods in [7,11,12] adopted a forward selection algorithm, which means the later results are affected by the results of the previous steps. This not only brings a heavy computation burden but also results in overfitting results for screening with

P (S \subseteq \hat{S}) \to 1

. Compared with this result, Theorem 2 gives strong consistency of screening with

P (\hat{S} = S) \to 1

.

4. Simulations

In this section, we investigate the finite sample performance of our proposed method and compare our method with the following two procedures: forward additive regression (FAR) in [11] and cumulative divergence-based forward regression (C-FS) in [12]. We choose

λ = 1

,

L = 5

(suggested by [10]),

γ = 0.5

(suggested by [20]), and

κ_{n} = ⌊ n^{1 / 5} ⌋ + 2

(suggested by [11]) for the GRIE-EBIC algorithm, where

κ_{n}

is the dimension of B-spline basis space and

⌊ n^{1 / 5} ⌋

is the greatest integer less than

n^{1 / 5}

.

Three specific criteria are adopted to evaluate the performance of variable selection for the additive model (1). True positive (TP) is the number of the true variables that are considered true variables in the selected model, and false positive (FP) is the number of the noise variables that are misclassified as true variables in the selected model. Combining TP and FP, they reflect the accuracy of variable selection methods in the selected sub-models. In addition, we select time as the third criterion to reflect the efficiency of variable selection by different methods. It is easy to find that our proposed method, GRIE, is more efficient than FAR and C-FS in computation since their complexities of calculation are

O (n^{2} p_{n} κ_{n})

,

O (n^{3} p_{n})

, and

O (T n^{3} p_{n})

, respectively, where T is the number of repetitions for the bootstrap procedure in the C-FS method. The comparison of computation complexities highlights the time efficiency of GRIE in the calculation, which will be further demonstrated by simulation results in Table 1 and Table 2.

The following examples perform the effect from different dimensions and correlations between any two covariates with each other by the above three procedures. Given two different dimensions and three different correlations between any two predictors with each other, the considered error followed the standard normal

N (0, 1)

and Chi-square

0.5 χ_{2}^{2}

. In each example, we generate 100 random samples, each consisting of

n = 300

. The data generation procedures are implemented by the R package “MASS” generating the random covariates, errors, and response variables. The details are as follows: (1) “rnorm”: simulated from a multivariate normal distribution; (2) “mvrnorm”: simulated from a multivariate normal distribution; (3) “rchisqure”: simulated from a multi Chi-square distribution; (4) “runif”: simulated from a uniform distribution.

Example 1.

We generated n samples from the following nonparametric additive model:

y = m_{1} (x_{1}) + m_{2} (x_{2}) + m_{3} (x_{3}) + m_{4} (x_{4}) + ϵ,

where

m_{1} (x) = 0.75 exp (x)

,

m_{2} (x) = x^{2}

,

m_{3} (x) = 3 sin (x)

,

m_{4} (x) = 2 x

, and

{(x_{1}, x_{2}, \dots, x_{p_{n}})}^{⊤}

follows a multinormal distribution

N (0, Σ)

. In this example, given

Σ = (σ_{i j})

under the following two cases:

(1)

Autoregressive (AR) structure,

σ_{i j} = ρ^{| i - j |}

;

(2)

Compound symmetry (CS) structure, namely, if

i \neq j

,

σ_{i j} = ρ

, else

σ_{i j} = 1

. Here, we set the parameter ρ used to control the strength of correlation between any two predictors with each other at different values of

0.3

,

0.6

, and

0.9

.

Table 1, Table 2, Table 3 and Table 4 summarize the results for the additive model in Example 1. Under the setting of

ρ = 0.3

and

0.6

, except for C-FS, our proposed method and FAR method could identify all important features and keep the FP value close to zero in both AR and CS structures. Even so, the FAR method has the longest calculation time among the three methods. Furthermore, when there exists strong correlations between covariates (

ρ = 0.9

), the performances of all three methods are worse at identifying important variables, especially for FAR and C-FS methods. Under this situation, compared with the other two methods, our method has the highest TP and shortest cost time. To perform the stability of our model, we report the empirical probabilities of each important covariate and all important covariates are retained for 100 replications in Table 3 and Table 4, where

P_{j}

and

P_{all}

are the empirical probabilities of each important covariate and all important covariates being retained in the selected sub-model, respectively. Following Table 3 and Table 4,

P_{all}

is below

0.3

for FAR and C-FS, while the

P_{all}

of GRIE is at least

0.70

. In addition, the

P_{j}

’s of our method is the best among the three methods in high-dimensional settings. Hence, we conclude that our proposed GRIE method performs robustly in the model selection of nonparametric additive models under high-dimension settings.

Example 2.

In this example, we consider a linear model with a group structure given by

y = \sum_{i = 1}^{p_{n}} β_{i} x_{i} + ϵ

with the predictors being generated by the following process

\{\begin{matrix} x_{i} = z_{1} + z + w_{i}, & \forall i = 1, 3, \\ x_{i} = z_{2} + z + w_{i}, & \forall i = 2, 4, \\ x_{5}, \dots, x_{p_{n}} \overset{i . i . d}{\sim} N (0, 1), \end{matrix}

where

w_{1}, \dots, w_{4} \overset{i . i . d}{\sim} U (0, 1)

,

z_{1}, z_{2} \overset{i . i . d}{\sim}

U (0, 1)

and the common component

z \sim N (0, δ^{2})

. The variance parameter δ is set at different values of

0.4

,

0.6

, and

0.8

to control the strength of the group structure. The true value of the coefficients are

β_{i} = 3

with

i = 1, \dots, 4

and

β_{i} = 0

with

i = 5, \dots, p_{n}

.

We also conducted simulations with the normal errors and chi-square errors for Example 2 and found that the performances of the two errors in this example were very close. Therefore, we omitted the results of chi-square errors to save space. In the following, we only report the results from the normal errors in Table 5 and Table 6. We find that the FAR’s performance is the worst for identifying important features with the increase in correlations between groups, the performances of C-FS and our method GRIE also become worse when

δ

is over

0.6

, while GRIE performs better even if there exists a strong correlation among covariates. The above phenomena can be further explained by Table 6. When

δ

is over

0.6

, there is no longer an overwhelming empirical probability of screening important covariates for FAR and C-FS, which results in a decrease in TP and

P_{all}

values. However, our proposed method is still relatively robust to different values of

δ

in terms of TP and

P_{all}

.

5. Real Data

5.1. Boston Housing Data

We use the Boston housing dataset to further illustrate the performance of our proposed method. The dataset contains the MEDV (median value of owner-occupied homes) in 506 U.S. census tracts of Boston from the 1970 census and 13 other variables that explain the variation in housing value. The 13 explaining variables are RM (average number of rooms per dwelling), AGE (proportion of owner-occupied units built prior to 1940), RAD (index of accessibility to radial highways), TAX (full-value property-tax rate per 10,000), PTRATIO (pupil-teacher ratio by town), B (

1000 {(Bk - 0.63)}^{2}

, Bk is the proportion of blacks by town), LSTAT (lower status of the population), CRIM (per capita crime rate by town), ZN (proportion of residential land zoned for lots over 25,000 square footage), INDUS (proportion of non-retail business acres per town), CHAS (Charles River dummy variable), NOX (nitric oxides concentration parts per 10 million), and DIS (weighted distances to five Boston employment centers). To simplify notation, we denote the covariates RM, AGE, RAD, TAX, PTRATIO, B, LSTAT, CRIM, ZN, INDUS, CHAS, NOX, and DIS as

x_{1}, \dots, x_{13}

. To study the relationship between MEDV and the above 13 variables, we consider the following nonparametric addition models:

y = \sum_{j = 1}^{13} m_{j} (x_{j}) + ϵ,

(8)

where y is the log(MEDV). In order to extend the above model to the setting of high-dimensional data, followed by [21], we generate artificial variables

x_{j}

to add noise variables, which is defined as follows.

x_{j} = \frac{Z_{j} + 2 W}{3},

for

j = 14, \dots, 1000

into (8), where

Z_{14}, \dots, Z_{1000} \overset{i . i . d}{\sim} N (0, 1)

, and

W \sim U (0, 1)

.

We use FAR, C-FS, and GRIE to identify important variables in the above additive model (8) with the full dataset. The results are as follows.

(i): Under FAR, 3 covariates ${x_{1}, x_{7}, x_{8}}$ are selected, denoted by “model ( $A_{1}$ )”.
(ii): Under GRIE, we receive 6 covariates ${x_{1}, x_{5}, x_{6}, x_{7}, x_{8}, x_{12}}$ , denoted by “model ( $B_{1}$ )”.
(iii): Under C-FS, there are 15 covariates chosen. They are ${x_{1}, x_{2}, x_{3}, x_{4}, x_{5}, x_{6}, x_{7}, x_{8},$ $x_{9}, x_{12}, x_{13}, x_{156}, x_{377}, x_{737}, x_{859}}$ , denoted by “model ( $C_{1}$ )”.

The above three sub-models have such a nest relation with

A_{1} \subseteq B_{1} \subseteq C_{1}

, and we want to investigate which model is best for fitting this dataset. The nondegenerative Vuong test in [22] is considered here to compare two nested models, and its null hypothesis is that the two models are equivalent. We first compare model (

A_{1}

) with model (

B_{1}

) by the Vuong test, and its p-value

= 0.001

, which leads to the rejection of the null hypothesis. This indicates model (

B_{1}

) is better than model (

A_{1}

) since (

A_{1}

) is nested in (

B_{1}

). We also compare model (

B_{1}

) with model (

C_{1}

), and its corresponding p-value of the Vuong test equals

0.981

, which indicates models (

B_{1}

) and (

C_{1}

) are equivalent since the null hypothesis is not rejected in this situation. However, model (

B_{1}

) has a smaller model size than model (

C_{1}

). Therefore, model (

B_{1}

) is more suitable than model (

C_{1}

) for fitting the Boston housing dataset to be the best working model, which indicates that GRIE performs the best in identifying the important variables among the above three variable selection methods.

To further demonstrate our results, we compare FAR, C-FS, and GRIE through their prediction errors. Toward this end, we randomly select 100 validation sets, with each of which the full sample is randomly partitioned into the training and validation sets with the size ratio

4 : 1

. The training sets are for variable section, and the validation sets are for the estimation of the prediction error. We centralize the response variable y and choose the cubic splines

κ_{n} = 3

to approximate the additive function. The average numbers of model size, the number of selected noise variables (SNV), and adjusted mean prediction errors (A-PE) are used to evaluate the performance of the three methods. All the results are reported in Table 7.

From Table 7, we have that:

(1)

The model sizes of our method, GRIE, and FAR are both smaller than C-FS, but the A-PE of FAR is the largest among the three methods, which means that FAR may fail to identify some important variables. To verify it, we report the frequency for 13 real covariates being selected over 100 replications in Table 8. Table 8 shows that RM and LSTAT are selected by all methods in each repetition. Except for the FAR method, PTRATIO, B, and CRIM can be selected by GRIE and C-FS with high frequency. It is seen that the pupil–teacher ratio, the proportion of blacks, and the per capita crime rate are the key factors affecting housing prices. However, FAR misses the above important variables.

(2)

For the value of SNV, both our method, GRIE, and FAR are 0, which means that they can successfully exclude all artificial variables.

In summary, compared with C-FS and FAR, our method has the smallest A-PE, the smallest SNV, and a simple model, which implies our method has better performance in feature screening under high-dimensional settings.

5.2. Arabidopsis thaliana Gene Data

We now turn to Arabidopsis thaliana gene data to illustrate the screening performance of our method. This dataset was developed by Wille et al. (2004) [23], who detected modules of closely connected isoprenoid genes in Arabidopsis thaliana. It is available on the website https://www.ncbi.nlm.nih.gov/pmc/articles/PMC545783 (accessed on 16 November 2022), which is composed of 834 genes from 58 different pathways in 118 samples. Chen et al. [24] found that GGPPS11 played an essential role in the generation of GGPP, which is the common precursor of several biologically important compounds (such as carotenoids, chlorophylls, and gibberellins), in Arabidopsis. Our goal is to identify the remaining 833 genes’ effects on the expression value of gene GGPPS11.

Followed by Wille et al. [23], the downloaded data

R = {y, x_{1}, \dots, x_{833}}

were converted to permille data by taking

1000 R

. To get the original dataset, we model

0.001 R

here and consider the corresponding nonparametric additive models:

y = \sum_{j = 1}^{833} m_{j} (x_{j}) + ϵ,

(9)

where y is the expression value of gene GGPPS11, and

{x_{1}, \dots, x_{833}}

are the expression values of the remaining 833 genes. Next, we adopt the above additive model on the full dataset to identify the important variables by the three mentioned methods. The results are given as follows:

(i): Under FAR, we get one gene ${x_{72}}$ , denoted by“model ( $A_{2}$ )”;
(ii): Under GRIE, three genes ${x_{140}, x_{571}, x_{560}}$ are chosen, denoted by “model ( $B_{2}$ )”;
(iii): Under C-FS, there nine genes were chosen, which are ${x_{72}, x_{105}, x_{191}, x_{476}, x_{510}, x_{517},$ $x_{554}, x_{658}, x_{800}}$ , and it is denoted by “model ( $C_{2}$ )”.

Again, using the nondegenerative Vuong test from Liao and Shi [22], we compare models (

A_{2}

) and (

B_{2}

). The corresponding p-value of the test is

0.012

, indicating that the above two models are not equivalent at the

5 %

significance level. Then, we also compare models (

B_{2}

) with (

C_{2}

), and the p-value is 0. Hence, models (

B_{2}

) and (

C_{2}

) are not equivalent at the

5 %

significance level.

Lastly, similarly to the first real data case, we compare FAR, C-FS, and GRIE through their prediction errors. Again, we randomly divide the full dataset into the training and validation sets with a ratio of 4:1 and repeat this process 100 times. Here, we also centralize the response variable y and set

κ_{n} = 3

. For this real data, we consider the average numbers of model size and A-PE to evaluate the performance of the three models. The results are shown in Table 9. Thus, we conclude that our proposed method has the smallest model size with the strongest ability for prediction and outstanding performance in identifying important covariates compared with the other two methods.

6. Conclusions

In this paper, we propose a novel variable screening screener (GRIE) for high-dimensional nonparametric additive models, which is a combination of the nonparametric smoothing ridge estimation and the group information. We note that our paper is one of the first to focus on the free marginal correlation assumption. Without the marginal correlation assumption, the proposed screener can totally separate the unimportant and important variables with a probability tending to one. Compared with iterative sure independence screening and forward screening, the proposed screener could essentially eliminate the computational burden and achieve strong, sure screening consistency. Furthermore, it allows the covariates to be strongly correlated and performs better than its alternative competitors. For these reasons, combining the strong, sure screening property of GRIE with the model selection property of EBIC, we propose the GRIE-EBIC method to further eliminate the noise variables and improve the accuracy of model selection. Theoretically, we establish the strong consistency of model selection for the GRIE-EBIC method, which reveals that our proposed method achieves the ideal model selection results.

We conclude this paper with a discussion of directions for future research. One direction to consider is nonparametric additive models with interaction effects between covariates, which are defined as

E (y ∣ x) = \sum_{1 \leq j < k \leq p_{n}} m_{j, k} (x_{j}, x_{k}),

where

x_{j}

is the jth element of

x

. They are the generalization of linear models with two-way interaction effects [25] that are more flexible for capturing the intersection between covariates. One potential approach may be to use the tensor splines bases to approximate each nonparametric function

m_{j, k} (\cdot, \cdot)

. The other direction is to study how to apply our methodology in the nonparametric generalized additive models [26,27]. The nonparametric generalized additive models admits

G {E (y ∣ x)} = \sum_{j = 1}^{p_{n}} m_{j} (x_{j}),

where

x_{j}

is the jth elements of

x

, and

G (\cdot)

is the link function. Since the nonparametric smoothing ridge estimation has outstanding performance in nonparametric additive models, its performance in generalized additive models may be worth investigating.

Author Contributions

Conceptualization, X.J. and J.L.; methodology, H.W. and X.J.; software, H.W. and H.J.; resources, J.L.; data curation, H.J.; writing—original draft preparation, H.W.; supervision, X.J.; funding acquisition, X.J. and J.L. All authors have read and agreed to the published version of the manuscript.

Funding

The work of Jiang is partially supported by the National Natural Science Foundation of China (11871263), and the Shenzhen Sci-Tech Fund No. JCYJ20210324104803010. The work of Li was partially supported by the NSF of China No. 11971221, Guangdong NSF Major Fund No. 2021ZDZX1001, the Shenzhen Sci-Tech Fund No. RCJC20200714114556020.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The Boston housing dataset is available in the R package “MASS”. Arabidopsis thaliana gene data are available on the website https://www.ncbi.nlm.nih.gov/pmc/articles/PMC545783 (accessed on 16 November 2022).

Acknowledgments

We would like to acknowledge the editor and four referees for their valuable comments and suggestions which leads to a substantial improvement of this article.

Conflicts of Interest

The authors declare no conflict of interest.

Appendix A

Now we give technical proofs of our theorems. To streamline our arguments, we introduce some notations and technical lemmas. Define

v = {(v_{1}, \dots, v_{n})}^{⊤}

with

v_{i} = \sum_{j = 1}^{p_{n}} m_{j} (x_{i, j}) - w_{i}^{⊤} θ^{*}

. Denoted by

ξ_{i} = e_{i}^{⊤} W^{⊤} {(W W^{⊤})}^{- 1} W θ^{*}

,

η_{i} = e_{i}^{⊤} W^{⊤} {(W W^{⊤} + λ I_{n})}^{- 1} ε

with

ε = {(ϵ_{1}, \dots, ϵ_{n})}^{⊤}

, and

ζ_{i} = e_{i}^{⊤} W^{⊤} {(W W^{⊤} + λ I_{n})}^{- 1} v

.

Lemma A1.

UnderAssumptions A1 and A3, the following conclusions hold

(i): for $C > 0$ and any fixed vector $b$ with ${∥ b ∥}_{2} = 1$ , there exists constants $c_{1}^{'}$ and $c_{2}^{'}$ with $0 < c_{1}^{'} < 1 < c_{2}^{'}$ such that

$P (b^{⊤} P_{λ, W^{⊤}} b < c_{1}^{'} \frac{n^{1 - τ}}{t_{n}} or b^{⊤} P_{λ, W^{⊤}} b > c_{2}^{'} \frac{n^{1 + τ}}{t_{n}}) \leq 4 exp (- C n);$
(ii): for any $C > 0$ , there exists positive constant $M > 0$ such that

$P (| e_{i}^{⊤} P_{λ, W^{⊤}} e_{j} | > \frac{M n^{1 + τ - α}}{t_{n} \sqrt{log n}}) = O \{exp (- C \frac{n^{1 - 2 α}}{2 log n})\}$

holds for any $0 \leq α < 1 / 2$ and $1 \leq i \neq j \leq t_{n}$ ;
(iii): for any $1 \leq i \leq t_{n}$ , the following inequality

$P (∥ {(W W^{⊤} + λ I_{n})}^{- 1} W e_{i} ∥_{2}^{2} > c_{2}^{'} c_{1} c_{3} κ_{n} n^{1 + 2 τ} t_{n}^{- 2}) \leq 3 exp (- C_{1} n)$

holds.

Proof

(Proof of Lemma A1). Similar to proof of Theorem 3 in [18], we can show that Lemma A1 holds. □

Lemma A2.

Under Assumptions A1–A4, the following conclusions hold

(i): $d_{n} κ_{n}^{r} \to \infty$ and $d_{n} n^{1 / 2 - 2 τ} / \sqrt{log n} \to \infty$ ;
(ii): ${∥ v ∥}_{2} \leq c_{n} s_{n} n^{1 / 2} κ_{n}^{- r}$ for some $c_{n} > 0$ , $∥ θ_{j}^{*} ∥_{2} \geq 0.5 c_{4}^{- 1 / 2} κ_{n}^{1 / 2} {min}_{j \in S} {E | m_{j} (x_{j}) {|^{2}}}^{1 / 2}$ , and $\sum_{j \in S} {∥ θ_{j}^{*} ∥}_{2}^{2} \leq 3 c_{2} c_{4} s_{n} κ_{n}$ ;
(iii): $P (| η_{i} | \geq \sqrt{c_{2}^{'} c_{1} c_{3} C_{*}} d_{n} {(log n)}^{- 1 / 2} n^{1 - τ} t_{n}^{- 1}) \leq 2 exp (- c_{0} κ_{n}^{- 1} d_{n}^{2} n^{1 - 4 τ} / log n)$ for some constant $c_{0} > 0$ ;
(iv): $P (| ζ_{i} | \geq \sqrt{c_{2}^{'} c_{1} c_{3}} c_{n} s_{n} κ_{n}^{1 / 2 - r} n^{1 + τ} t_{n}^{- 1}) \leq 3 exp (- C_{1} n) .$

Proof

(Proof of Lemma A2). (i) It follows that Lemma A2(i) holds by Assumptions A3 and A4.

(ii) By Assumption A3 (i) and Corollary 6.21 of [28], we can obtain

\begin{matrix} sup_{x, j} | m_{j} (x) - B {(x)}^{⊤} θ_{j}^{*} | \leq c_{n} κ_{n}^{- r}, \end{matrix}

(A1)

and

\begin{matrix} | {E | B {(x_{j})}^{⊤} θ_{j}^{*} |^{2}}^{1 / 2} - {E | m_{j} (x_{j}) {|^{2}}}^{1 / 2} | \\ = \frac{| E | B {(x_{j})}^{⊤} θ_{j}^{*} |^{2} - E {| m_{j} (x_{j}) |}^{2} |}{{E | B {(x_{j})}^{⊤} θ_{j}^{*} |^{2}}^{1 / 2} + {E | m_{j} (x_{j}) {|^{2}}}^{1 / 2}} \\ \leq \frac{{sup}_{x, j} | m_{j} (x) - B {(x)}^{⊤} θ_{j}^{*} | {E | m_{j} (x_{j}) | + E | B {(x_{j})}^{⊤} θ_{j}^{*} |}}{{E | B {(x_{j})}^{⊤} θ_{j}^{*} |^{2}}^{1 / 2} + {E | m_{j} (x_{j}) {|^{2}}}^{1 / 2}} = O (κ_{n}^{- r}) . \end{matrix}

This combined with

{min}_{j \in S} {E | m_{j} (x_{j}) {|^{2}}}^{1 / 2} \geq d_{n}

,

d_{n} κ_{n}^{r} \to \infty

in Lemma A2(i), and

∥ θ_{j}^{*} ∥_{2}^{2} \geq λ_{max}^{- 1} (E (B (x_{j}) B {(x_{j})}^{⊤})) E | B {(x_{j})}^{⊤} θ_{j}^{*} |^{2} \geq c_{4}^{- 1} κ_{n} (E | B {(x_{j})}^{⊤} θ_{j}^{*} |^{2})

by noticing

λ_{max} (E (B (x_{j}) B {(x_{j})}^{⊤})) \leq c_{4} κ_{n}^{- 1}

, yields that

{∥ v ∥}_{2} = O (s_{n} n^{1 / 2} κ_{n}^{- r}) and ∥ θ_{j}^{*} ∥_{2} \geq 0.5 c_{4}^{- 1 / 2} κ_{n}^{1 / 2} {E | m_{j} (x_{j}) {|^{2}}}^{1 / 2}

for any

j \in S

. By (A1) and

λ_{min} (E (B (x_{j}) B {(x_{j})}^{⊤})) \geq c_{4}^{- 1} κ_{n}^{- 1}

, we have

\begin{matrix} ∥ θ_{j}^{*} ∥_{2}^{2} & \leq & λ_{min}^{- 1} (E (B (x_{j}) B {(x_{j})}^{⊤})) E {| B {(x_{j})}^{⊤} θ_{j}^{*} |}^{2} \\ \leq & 2 c_{4} κ_{n} {E | B {(x_{j})}^{⊤} θ_{j}^{*} - m_{j} (x_{j}) |^{2} + E | m_{j} (x_{j}) |^{2}} \\ = & O (κ_{n}^{1 - 2 r}) + 2 c_{4} κ_{n} E {| m_{j} (x_{j}) |}^{2} . \end{matrix}

It follows from assumption A3(i)-(ii) that

\sum_{j \in S} ∥ θ_{j}^{*} ∥_{2}^{2} \leq O (s_{n} κ_{n}^{1 - 2 r}) + 2 c_{4} κ_{n} \sum_{j \in S} E {| m_{j} (x_{j}) |}^{2} \leq 3 c_{2} c_{4} s_{n} κ_{n} .

(iii) It is noticed that

\begin{matrix} η_{i} = e_{i}^{⊤} W^{⊤} {(W W^{⊤} + λ I_{n})}^{- 1} ε = {∥ {(W W^{⊤} + λ I_{n})}^{- 1} W e_{i} ∥}_{2} a^{⊤} ε, \end{matrix}

(A2)

where

a = {(W W^{⊤} + λ I_{n})}^{- 1} W e_{i} / {∥ {(W W^{⊤} + λ I_{n})}^{- 1} W e_{i} ∥}_{2} .

Using Lemma A1, for some

C_{1} > 0

, we have

P (a^{⊤} P_{λ, W^{⊤}} a > c_{2}^{'} \frac{n^{1 + τ}}{t_{n}}) \leq 4 exp (- C_{1} n)

and

\begin{matrix} P (∥ {(W W^{⊤} + λ I_{n})}^{- 1} W e_{i} ∥_{2}^{2} > c_{2}^{'} c_{1} c_{3} κ_{n} n^{1 + 2 τ} t_{n}^{- 2}) \leq 3 exp (- C_{1} n) . \end{matrix}

(A3)

By Assumption A2 and Proposition 3 of [4], we obtain

P (∥ P_{a} {ε ∥}_{2}^{2} > C_{*} h (t)) \leq {(1 + t)}^{1 / 2} exp (- t / 2)

for any

t > 2

, where

h (t) = \frac{(1 + t)}{{1 - 2 / (exp (t / 2) \sqrt{1 + t} - 1)}^{2}} .

Let

χ_{n} = 0.9 κ_{n}^{- 1} d_{n}^{2} n^{1 - 4 τ} / log n

. We have

h (χ_{n}) \leq κ_{n}^{- 1} d_{n}^{2} n^{1 - 4 τ} / log n

for sufficient large n since

d_{n} κ_{n}^{- 1 / 2} n^{1 / 2 - 2 τ} / \sqrt{log n} \to \infty

. Therefore, there exists some positive constant

c_{0} < 0.45

such that

\begin{matrix} P (| a^{⊤} ε | > C_{*}^{1 / 2} d_{n} κ_{n}^{- 1 / 2} n^{1 / 2 - 2 τ} / \sqrt{log n}) & = & P (∥ P_{a} {ε ∥}_{2}^{2} > C_{*} κ_{n}^{- 1} d_{n}^{2} n^{1 - 4 τ} / log n) \\ \leq & P (∥ P_{a} {ε ∥}_{2}^{2} > C_{*} h (χ_{n})) \\ \leq & {(1 + χ_{n})}^{1 / 2} exp (- χ_{n} / 2) \\ \leq & exp (- c_{0} κ_{n}^{- 1} d_{n}^{2} n^{1 - 4 τ} / log n) \end{matrix}

for sufficient large n. This, combined with (A2) and (A3), leads to

P (| η_{i} | \geq \sqrt{c_{2}^{'} c_{1} C_{*} c_{3}} d_{n} {(log n)}^{- 1 / 2} n^{1 - τ} t_{n}^{- 1}) \leq 2 exp (- c_{0} κ_{n}^{- 1} d_{n}^{2} n^{1 - 4 τ} / log n) .

(iv) From Lemmas A2(ii) and (A3), we have

P (| ζ_{i} | \geq \sqrt{c_{2}^{'} c_{1} c_{3}} c_{n} κ_{n}^{1 / 2 - r} n^{1 + τ} t_{n}^{- 1} s_{n}) \leq 3 exp (- C_{1} n) .

This completes the proof of Lemma A2. □

Proof

(Proof of Theorem 1.). From the definition of

{\hat{θ}}_{j}

in (5), we have

\begin{matrix} {\hat{θ}}_{j} & = & W_{j}^{⊤} {(W W^{⊤} + λ I_{n})}^{- 1} Y \\ = & W_{j}^{⊤} {(W W^{⊤} + λ I_{n})}^{- 1} W θ^{*} + W_{j}^{⊤} {(W W^{⊤})}^{- 1} v + W_{j}^{⊤} {(W W^{⊤} + λ I_{n})}^{- 1} ε \\ \equiv & {\tilde{θ}}_{j} + E_{1, j} + E_{2, j} . \end{matrix}

Next, we divide the proof into four parts.

Part (I): In this part, we establish the upper bound of

{max}_{j \in S^{c}} {∥ E_{1, j} + E_{2, j} ∥}_{2}

.

By noticing

∥ E_{2, j} ∥_{2} \leq κ_{n}^{1 / 2} {max}_{1 \leq i \leq t_{n}} | η_{i} |

, we have

\begin{matrix} P (max_{1 \leq j \leq p_{n}} {∥ E_{2, j} ∥}_{2} \geq \frac{c κ_{n}^{1 / 2} d_{n} n^{1 - τ}}{t_{n} \sqrt{log n}}) & \leq & P (max_{1 \leq i \leq t_{n}} | η_{i} | \geq \frac{c d_{n} n^{1 - τ}}{t_{n} \sqrt{log n}}) \\ \leq & \sum_{i = 1}^{t_{n}} P (| η_{i} | \geq \frac{c d_{n} n^{1 - τ}}{t_{n} \sqrt{log n}}) . \end{matrix}

It follows from Lemma A2 that, for some constants c and

c_{0}

,

\begin{matrix} P (max_{1 \leq j \leq p_{n}} {∥ E_{2, j} ∥}_{2} \geq \frac{c κ_{n}^{1 / 2} d_{n} n^{1 - τ}}{t_{n} \sqrt{log n}}) & \leq & 2 t_{n} exp (- c_{0} κ_{n}^{- 1} d_{n}^{2} n^{1 - 4 τ} {(log n)}^{- 1}) \\ \leq & exp (- 0.5 c_{0} κ_{n}^{- 1} d_{n}^{2} n^{1 - 4 τ} {(log n)}^{- 1}), \end{matrix}

where the last inequality holds due to

log (t_{n}) = o (κ_{n}^{- 1} d_{n}^{2} n^{1 - 4 τ} {(log n)}^{- 1})

. Similarly, by Lemma A2,

∥ E_{1, j} ∥_{2} \leq κ_{n}^{1 / 2} {max}_{1 \leq i \leq t_{n}} | ζ_{i} |

, and Bonferroni’s inequality, there exists some constant

c_{*}

such that

\begin{matrix} P (max_{1 \leq j \leq p_{n}} {∥ E_{1, j} ∥}_{2} \geq \frac{c_{*} κ_{n}^{1 - r} n^{1 + τ} s_{n}}{t_{n}}) \leq 3 t_{n} exp (- C_{1} n) \leq 3 exp (- 0.5 C_{1} n) . \end{matrix}

By noticing

κ_{n}^{r - 1 / 2} d_{n} / (n^{2 τ} s_{n} \sqrt{log n}) \to \infty

, we obtain

\begin{matrix} P (max_{1 \leq j \leq p_{n}} {∥ E_{1, j} + E_{2, j} ∥}_{2} \geq \frac{(c + c_{*}) κ_{n}^{1 / 2} d_{n} n^{1 - τ}}{t_{n} \sqrt{log n}}) \leq 2 exp (- 0.5 c_{0} κ_{n}^{- 1} d_{n}^{2} n^{1 - 4 τ} {(log n)}^{- 1}) . \end{matrix}

(A4)

Part (II): In this part, we establish the upper bound of

{max}_{j \in S^{c}} {∥ {\tilde{θ}}_{j} ∥}_{2}

. For

1 \leq j \leq t_{n}

, there exists index set

M_{j} \subseteq {1, \dots, t_{n}}

such that

θ_{j} = θ_{M_{j}}

, where

θ_{M_{j}}

is the sub-vector of

θ

formed by all components with indexes in

M_{j}

. Denoted by

M = \cup_{j \in S} M_{j}

and

θ = {(θ_{1}, \dots, θ_{t_{n}})}^{⊤}

with

t_{n} = p_{n} κ_{n}

. By Cauchy–Schwarz’s inequality, Lemma A2(ii), and Assumption A4(ii), we obtain that

\begin{matrix} ∥ {\tilde{θ}}_{j} ∥_{2} & \leq & \sqrt{κ_{n}} max_{i \in M_{j}} | \sum_{k \in M} e_{i}^{⊤} P_{λ, W^{⊤}} e_{k} θ_{k}^{*} | \\ \leq & \sqrt{s_{n} κ_{n}^{2}} ∥ θ^{*} ∥_{2} max_{1 \leq i \neq k \leq t_{n}} | e_{i}^{⊤} P_{λ, W^{⊤}} e_{k} | \\ \leq & \sqrt{3 c_{2} c_{4} s_{n}^{2} κ_{n}^{3}} max_{1 \leq i \neq k \leq t_{n}} | e_{i}^{⊤} P_{λ, W^{⊤}} e_{k} | \end{matrix}

for

j \in S^{c}

, where

c_{2}

and

c_{4}

are defined in Assumptions A3 and A4. It follows from Lemma A1 and Bonferroni inequalities that, for some constants

M, C_{1} > 0

,

\begin{matrix} P (max_{1 \leq i \neq k \leq t_{n}} | e_{i}^{⊤} P_{λ, W^{⊤}} e_{k} | > \frac{M n^{1 + τ - α}}{t_{n} \sqrt{log n}}) & \leq & \sum_{1 \leq i \neq k \leq t_{n}} P (| e_{i}^{⊤} P_{λ, W^{⊤}} e_{k} | > \frac{M n^{1 + τ - α}}{t_{n} \sqrt{log n}}) \\ = & O \{exp (2 log t_{n} - C_{1} \frac{n^{1 - 2 α}}{2 log n})\}, \end{matrix}

holds for any

0 \leq α < 1 / 2

. By taking

n^{α} = d_{n}^{- 1} κ_{n} s_{n} n^{2 τ}

and assumption

log (t_{n}) = o (\frac{d_{n}^{2} n^{1 - 4 τ}}{κ_{n}^{2} s_{n}^{2} log n})

in A4 (iii), we can obtain

\begin{matrix} P (max_{j \in S^{c}} {∥ {\tilde{θ}}_{j} ∥}_{2} > \frac{\sqrt{3 c_{2} c_{4} κ_{n}} M n^{1 - τ} d_{n}}{t_{n} \sqrt{log n}}) & \leq & O \{exp (2 log t_{n} - C_{1} \frac{d_{n}^{2} n^{1 - 4 τ}}{2 κ_{n}^{2} s_{n}^{2} log n})\} \\ = & O \{exp (- C_{1} \frac{d_{n}^{2} n^{1 - 4 τ}}{3 κ_{n}^{2} s_{n}^{2} log n})\} . \end{matrix}

(A5)

Part (III): In this part, we establish the lower bound of

{min}_{j \in S} {∥ {\tilde{θ}}_{j} ∥}_{2}

.

From the triangle inequality, we have

\begin{matrix} min_{j \in S} {∥ {\tilde{θ}}_{j} ∥}_{2} \\ = min_{j \in S} {∥ W_{j}^{⊤} {(W W^{⊤} + λ I_{n})}^{- 1} W_{j} θ_{j}^{*} + \sum_{k \neq j, k \in S} W_{j}^{⊤} {(W W^{⊤} + λ I_{n})}^{- 1} W_{k} θ_{k}^{*} ∥}_{2} \\ \geq min_{j \in S} ∥ W_{j}^{⊤} {(W W^{⊤} + λ I_{n})}^{- 1} W_{j} θ_{j}^{*} ∥_{2} - max_{j \in S} {∥ \sum_{k \neq j, k \in S} W_{j}^{⊤} {(W W^{⊤} + λ I_{n})}^{- 1} W_{k} θ_{k}^{*} ∥}_{2} \\ \equiv I_{n, 1} - I_{n, 2} . \end{matrix}

With the same arguments as (A5), we can establish that

\begin{matrix} P (I_{n, 2} > \frac{\sqrt{3 c_{2} c_{4} κ_{n}} M n^{1 - τ} d_{n}}{t_{n} \sqrt{log n}}) = O \{exp (- C_{1} \frac{d_{n}^{2} n^{1 - 4 τ}}{3 κ_{n}^{2} s_{n}^{2} log n})\} . \end{matrix}

(A6)

Applying equality

{(a + b)}^{2} \geq a^{2} / 2 - b^{2}

and Jensen’s inequality, we can obtain

\begin{matrix} ∥ W_{j} {(W W^{⊤} + λ I_{n})}^{- 1} W_{j} θ_{j}^{*} ∥_{2}^{2} \\ = \sum_{i \in M_{j}} (\sum_{k \in M_{j}} e_{i}^{⊤} P_{λ, W^{⊤}} e_{k} θ_{k}^{*})^{2} \\ \geq \sum_{i \in M_{j}} {(e_{i}^{⊤} P_{λ, W^{⊤}} e_{i})}^{2} | θ_{i}^{*} {|^{2} / 2 - \sum_{i \in M_{j}} (\sum_{k \in M_{j}, k \neq i} e_{i}^{⊤} P_{λ, W^{⊤}} e_{k} θ_{k}^{*})}^{2} \\ \geq min_{i \in M_{j}} {(e_{i}^{⊤} P_{λ, W^{⊤}} e_{i})}^{2} ∥ θ_{j}^{*} ∥_{2}^{2} / 2 - κ_{n} \sum_{i \in M_{j}} \sum_{k \in M_{j}, k \neq i} {(e_{i}^{⊤} P_{λ, W^{⊤}} e_{k})}^{2} {| θ_{k}^{*} |}^{2} \\ \geq min_{i \in M_{j}} {(e_{i}^{⊤} P_{λ, W^{⊤}} e_{i})}^{2} ∥ θ_{j}^{*} ∥_{2}^{2} / 2 - κ_{n}^{2} {∥ θ_{j}^{*} ∥}_{2}^{2} max_{i, k \in M_{j}, i \neq k} {(e_{i}^{⊤} P_{λ, W^{⊤}} e_{k})}^{2} . \end{matrix}

Thus,

I_{n, 1}^{2} \geq min_{j \in S} {∥ θ_{j} ∥}_{2}^{2} \{min_{i \in M} {(e_{i}^{⊤} P_{λ, W^{⊤}} e_{i})}^{2} / 2 - κ_{n}^{2} max_{i, k \in M, i \neq k} {(e_{i}^{⊤} P_{λ, W^{⊤}} e_{k})}^{2}\} .

Lemma A1,

s_{n} κ_{n} = o (n)

and Bonferroni inequalities give that, for some constants

c_{1}^{'}, M, α

and

C_{1} > 0

,

\begin{matrix} P (min_{i \in M} e_{i}^{⊤} P_{λ, W^{⊤}} e_{i} \leq \frac{c_{1}^{'} n^{1 - τ}}{t_{n}}) & \leq & \sum_{i \in M} P (e_{i}^{⊤} P_{λ, W^{⊤}} e_{i} \leq \frac{c_{1}^{'} n^{1 - τ}}{t_{n}}) \\ \leq & 4 n exp (- C_{1} n) \end{matrix}

and

\begin{matrix} P (max_{i, k \in M, i \neq k} | e_{i}^{⊤} P_{λ, W^{⊤}} e_{k} | \geq \frac{M n^{1 + τ - α}}{t_{n} \sqrt{log n}}) & \leq & \sum_{i, k \in M, i \neq k} P (e_{i}^{⊤} P_{λ, W^{⊤}} e_{k} \geq \frac{M n^{1 + τ - α}}{t_{n} \sqrt{log n}}) \\ \leq & O \{n exp (- C_{1} \frac{n^{1 - 2 α}}{2 log n})\} \end{matrix}

holds for any

0 \leq α < 1 / 2

. Denoted by

A_{1} = \{min_{i \in M} e_{i}^{⊤} P_{λ, W^{⊤}} e_{i} \leq \frac{c_{1}^{'} n^{1 - τ}}{t_{n}}\}, A_{2} = \{max_{i, k \in M, i \neq k} | e_{i}^{⊤} P_{λ, W^{⊤}} e_{k} | \geq \frac{M n^{1 + τ - α}}{t_{n} \sqrt{log n}}\},

and

A_{3} = \{min_{i \in M} {(e_{i}^{⊤} P_{λ, W^{⊤}} e_{i})}^{2} / 2 - κ_{n}^{2} max_{i, k \in M, i \neq k} {(e_{i}^{⊤} P_{λ, W^{⊤}} e_{k})}^{2} > \frac{| c_{1}^{'} |^{2} n^{2 - 2 τ}}{3 t_{n}^{2}}\} .

By taking

α = 2 τ + {log}_{n} (κ_{n})

, we have

\begin{matrix} P (A_{3}) \geq P (A_{1}^{c} \cap A_{2}^{c}) \geq 1 - P (A_{1}) - P (A_{2}) = 1 - O \{n exp (- C_{1} \frac{n^{1 - 2 τ}}{2 κ_{n}^{2} log n})\} . \end{matrix}

(A7)

It is obvious that

{min}_{j \in S} {∥ θ_{j} ∥}_{2}^{2} \geq 0.25 c_{4}^{- 1} κ_{n} d_{n}^{2}

from Lemma A2(ii) and Assumption A4(ii). This, combined with (A7), yields that

\begin{matrix} P (I_{n, 1}^{2} \geq \frac{| c_{1}^{'} |^{2} c_{4}^{- 1} κ_{n} d_{n}^{2} n^{2 - 2 τ}}{12 t_{n}^{2}}) \geq 1 - O \{n exp (- C_{1} \frac{n^{1 - 2 τ}}{2 κ_{n}^{2} log n})\} . \end{matrix}

(A8)

Similar to (A7), we can obtain

\begin{matrix} P (min_{j \in S} {∥ {\tilde{θ}}_{j} ∥}_{2} \geq \frac{c_{1}^{'} c_{4}^{- 1 / 2} κ_{n}^{1 / 2} d_{n} n^{1 - τ}}{12 t_{n}}) \geq 1 - O \{exp (- C_{1} \frac{d_{n}^{2} n^{1 - 4 τ}}{3 κ_{n}^{2} s_{n}^{2} log n})\} \end{matrix}

(A9)

by combing (A6) and (A8).

Part (IV): In this part, we show that

\begin{matrix} P (min_{j \in S} ∥ {\hat{θ}}_{j} ∥_{2} > max_{j \in S^{c}} {∥ {\hat{θ}}_{j} ∥}_{2}) \to 1 . \end{matrix}

(A10)

Similar to (A7), by

{\hat{θ}}_{j} = {\tilde{θ}}_{j} + E_{1, j} + E_{2, j}

, (A4) and (A9), we can show that

\begin{matrix} P (min_{j \in S} {∥ {\hat{θ}}_{j} ∥}_{2} \geq \frac{c_{1}^{'} c_{4}^{- 1 / 2} κ_{n}^{1 / 2} d_{n} n^{1 - τ}}{13 t_{n}}) \\ \geq P (min_{j \in S} ∥ {\tilde{θ}}_{j} ∥_{2} - max_{1 \leq j \leq p_{n}} {∥ E_{1, j} + E_{2, j} ∥}_{2} \geq \frac{c_{1}^{'} c_{4}^{- 1 / 2} κ_{n}^{1 / 2} d_{n} n^{1 - τ}}{14 t_{n}}) \\ \geq 1 - O \{exp (- C_{1} \frac{d_{n}^{2} n^{1 - 4 τ}}{3 κ_{n}^{2} s_{n}^{2} log n}) + exp (- \frac{c_{0} d_{n}^{2} n^{1 - 4 τ}}{2 κ_{n} log n})\} . \end{matrix}

(A11)

Denote by

A_{4} = \{{max}_{1 \leq j \leq p_{n}} {∥ E_{1, j} + E_{2, j} ∥}_{2} \geq \frac{(c + c_{*}) κ_{n}^{1 / 2} d_{n} n^{1 - τ}}{t_{n} \sqrt{log n}}\},

A_{5} = \{{max}_{j \in S^{c}} {∥ {\tilde{θ}}_{j} ∥}_{2} > \frac{\sqrt{3 c_{2} c_{4} κ_{n}} M n^{1 - τ} d_{n}}{t_{n} \sqrt{log n}}\},

and

A_{6} = \{max_{1 \leq j \leq p_{n}} ∥ E_{1, j} + E_{2, j} ∥_{2} + max_{j \in S^{c}} {∥ {\tilde{θ}}_{j} ∥}_{2} \geq \frac{(c + c_{*} + \sqrt{3 c_{2} c_{4}}) κ_{n}^{1 / 2} d_{n} n^{1 - τ}}{t_{n} \sqrt{log n}}\} .

Since

A_{6} \cap A_{5}^{c} \subseteq A_{4}

, by (A4) and (A5), we have

\begin{matrix} P (A_{6}) & = & P (A_{6} \cap A_{5}) + P (A_{6} \cap A_{5}^{c}) \\ \leq & P (A_{5}) + P (A_{4}) \\ = & O \{exp (- \frac{c_{0} d_{n}^{2} n^{1 - 4 τ}}{2 κ_{n} log n}) + exp (- C_{1} \frac{d_{n}^{2} n^{1 - 4 τ}}{3 κ_{n}^{2} s_{n}^{2} log n})\} . \end{matrix}

Using

{max}_{j \in S^{c}} ∥ {\hat{θ}}_{j} ∥_{2} \leq {max}_{1 \leq j \leq p_{n}} ∥ E_{1, j} + E_{2, j} ∥_{2} + {max}_{j \in S^{c}} {∥ {\tilde{θ}}_{j} ∥}_{2}

, we obtain that

\begin{matrix} P (max_{j \in S^{c}} {∥ {\hat{θ}}_{j} ∥}_{2} < \frac{(c + c_{*} + \sqrt{3 c_{2} c_{4}}) κ_{n}^{1 / 2} d_{n} n^{1 - τ}}{t_{n} \sqrt{log n}}) \\ \geq P (max_{j \in S^{c}} ∥ {\tilde{θ}}_{j} ∥_{2} + max_{1 \leq j \leq p_{n}} {∥ E_{1, j} + E_{2, j} ∥}_{2} < \frac{(c + c_{*} + \sqrt{3 c_{2} c_{4}}) κ_{n}^{1 / 2} d_{n} n^{1 - τ}}{t_{n} \sqrt{log n}}) \\ \geq 1 - O \{exp (- \frac{c_{0} d_{n}^{2} n^{1 - 4 τ}}{2 κ_{n} log n}) + exp (- C_{1} \frac{d_{n}^{2} n^{1 - 4 τ}}{3 κ_{n}^{2} s_{n}^{2} log n})\} . \end{matrix}

(A12)

Notice that

\frac{d_{n}^{2} n^{1 - 4 τ}}{κ_{n}^{2} s_{n}^{2} log n} \to \infty

and

(c + c_{*} + \sqrt{3 c_{2} c_{4}}) / \sqrt{log n} ≪ c_{1}^{'} c_{4}^{- 1} / 13

for sufficient large n. This, combined with (A11) and (A12), establishes (A10). The proof is completed. □

Proof

(Proof of Theorem 2.). We divide the proof into two parts.

Part (I) to show that

P ({\hat{S}}_{*} = S) \to 1

; Part (II) to show that

P (\hat{S} = {\hat{S}}_{*}) \to 1

.

Part (I): Step (i) It is noticed that

Y = W_{S} θ_{S}^{*} + v + ε

and

P_{W_{{\hat{S}}_{k}}} - P_{W_{{\hat{S}}_{k - 1}}} = P_{{\tilde{W}}_{i_{k}}}

with

{\tilde{W}}_{i_{k}} = (I_{n} - P_{W_{{\hat{S}}_{k - 1}}}) W_{i_{k}}

. For

i_{k} \in S

, we obtain that

\begin{matrix} R S S_{k - 1} - R S S_{k} & = & Y^{⊤} (I_{n} - P_{W_{{\hat{S}}_{k - 1}}}) Y - Y^{⊤} (I_{n} - P_{W_{{\hat{S}}_{k}}}) Y \\ = & ∥ P_{{\tilde{W}}_{i_{k}}} (W_{S} θ_{S}^{*} + v + ε) ∥_{2}^{2} \\ \geq & ∥ P_{{\tilde{W}}_{i_{k}}} W_{S} θ_{S}^{*} ∥_{2}^{2} / 2 - {∥ P_{{\tilde{W}}_{i_{k}}} (v + ε) ∥}_{2}^{2} . \end{matrix}

(A13)

Next, let us deal with the above two terms separately. Denoted by

T_{k} = (S \cup {\hat{S}}_{k - 1}) ∖ {i_{k}}

. We have

\begin{matrix} ∥ P_{{\tilde{W}}_{i_{k}}} W_{S} θ_{S}^{*} ∥_{2}^{2} & = & ∥ (P_{W_{{\hat{S}}_{k}}} - P_{W_{{\hat{S}}_{k - 1}}}) W_{S} θ_{S}^{*} ∥_{2}^{2} \\ \geq & inf_{t} {∥ P_{W_{{\hat{S}}_{k}}} W_{S} θ_{S}^{*} - W_{{\hat{S}}_{k - 1}} t ∥}_{2}^{2} \\ \geq & inf_{a} {∥ P_{W_{{\hat{S}}_{k}}} W_{i_{k}} θ_{i_{k}}^{*} - W_{T_{k}} a ∥}_{2}^{2} . \end{matrix}

From

P_{W_{{\hat{S}}_{k}}} W_{i_{k}} = W_{i_{k}}

, Lemma A2, Assumption A4(ii), and

i_{k} \in S

, we can obtain

\begin{matrix} min_{i_{k} \in S} {∥ P_{{\tilde{W}}_{i_{k}}} W_{S} θ_{S}^{*} ∥}_{2}^{2} & \geq & min_{i_{k} \in S} ∥ θ_{i_{k}}^{*} ∥_{2}^{2} {∥ (I_{n} - P_{W_{T_{k}}}) W_{i_{k}} ∥}_{2}^{2} \\ \geq & 0.25 c_{4}^{- 1} κ_{n} d_{n}^{2} min_{i_{k} \in S} {∥ (I_{n} - P_{W_{T_{k}}}) W_{i_{k}} ∥}_{2}^{2} . \end{matrix}

(A14)

From Theorem 1, we have conclusion

| T_{k} \cup {i_{k}} | = O (s_{n})

holding for ∀

i_{k} \in S

with probability tending to one. This, combined with Assumption A5, yields that

\begin{matrix} λ_{min} (n^{- 1} W_{T}^{⊤} W_{T}) \geq 0.5 c_{6} n^{- τ} κ_{n}^{- 1} \end{matrix}

(A15)

with probability going to one, where

W_{T} = (W_{T_{k}}, W_{i_{k}})

. It follows from

λ_{max} {{(W_{i_{k}}^{⊤} W_{i_{k}})}^{- 1}}

\leq λ_{max} {W_{T}^{⊤} W_{T}}

and (A15) that

\begin{matrix} min_{i_{k} \in S} {∥ P_{{\tilde{W}}_{i_{k}}} W_{S} θ_{S}^{*} ∥}_{2}^{2} \geq 2 μ_{0} d_{n}^{2} n^{1 - τ} \end{matrix}

(A16)

with

μ_{0} = 0.0625 c_{4}^{- 1} c_{6}

.

Following Lemma A2, we have that

\begin{matrix} ∥ P_{{\tilde{W}}_{i_{k}}} {(v + ε) ∥}_{2}^{2} & = & 2 ∥ P_{{\tilde{W}}_{i_{k}}} {v ∥}_{2}^{2} + 2 {∥ P_{{\tilde{W}}_{i_{k}}} ε ∥}_{2}^{2} \\ \leq & {2 ∥ v ∥}_{2}^{2} + 2 {∥ P_{{\tilde{W}}_{i_{k}}} ε ∥}_{2}^{2} \\ = & O (n κ_{n}^{- 2 r}) + 2 {∥ P_{{\tilde{W}}_{i_{k}}} ε ∥}_{2}^{2} . \end{matrix}

From Assumption A2 and Proposition 3 of [4], we have

P (∥ P_{{\tilde{W}}_{i_{k}}} {ε ∥}_{2}^{2} > \frac{κ_{n} C_{*} (1 + t)}{{1 - 2 / (exp (t / 2) \sqrt{1 + t} - 1)}^{2}}) . \leq {(1 + t)}^{1 / 2} exp (- κ_{n} t / 2)

By taking

t = log p_{n} + log n - 1

and applying Bonferroni inequalities, we can obtain

\begin{matrix} P (max_{i_{k} \in S} {∥ P_{{\tilde{W}}_{i_{k}}} ε ∥}_{2}^{2} > β_{n}) & \leq & \sum_{i_{k} \in S} P (∥ P_{{\tilde{W}}_{i_{k}}} {ε ∥}_{2}^{2} > β_{n}) \\ \leq & \sum_{i_{k} \in S} \sqrt{log p_{n} + log n} exp {- κ_{n} (log p_{n} + log n - 1) / 2} \\ = & O (s_{n} \sqrt{log p_{n}}) exp {- κ_{n} (log p_{n} + log n - 1) / 2} \\ \to & 0, \end{matrix}

where

β_{n} = \frac{κ_{n} C_{*} (log p_{n} + log n - 1)}{{1 - 2 / (exp ((log p_{n} + log n - 1) / 2) \sqrt{log p_{n} + log n} - 1)}^{2}} .

Therefore, we establish that

\begin{matrix} max_{i_{k} \in S} {∥ P_{{\tilde{W}}_{i_{k}}} ε ∥}_{2}^{2} = o_{P} {κ_{n} (log p_{n} + log n)} . \end{matrix}

(A17)

By

κ_{n}^{r - 1 / 2} d_{n} / n^{2 τ} \to \infty

and

log (t_{n}) = o (\frac{d_{n}^{2} n^{1 - 4 τ}}{κ_{n}^{2} s_{n}^{2} log n})

in Assumption A4(ii), we obtain

∥ P_{{\tilde{W}}_{i_{k}}} {(v + ε) ∥}_{2}^{2} = o_{P} (d_{n}^{2} n^{1 - τ}) .

This, combined with (A13) and (A16), yields that

min_{i_{k} \in S} {R S S_{k - 1} - R S S_{k}} \geq μ_{0} d_{n}^{2} n^{1 - τ}

with probability going to one. Applying the inequality

log (1 + x) \geq min {log 2, 0.5 x}

for

x > 0

, we obtain that

\begin{matrix} log (R S S_{k - 1}) - log (R S S_{k}) & = & log {1 + (R S S_{k - 1} - R S S_{k}) / R S S_{k}} \\ \geq & 0.5 (R S S_{k - 1} - R S S_{k}) / R S S_{k} \\ \geq & 0.5 μ_{0} d_{n}^{2} n^{1 - τ} / R S S_{k}, \end{matrix}

This combined with

n^{- 1} R S S_{k} \leq n^{- 1} {∥ Y - {\bar{Y}}_{n} ∥}_{2}^{2} \to Var (y_{1})

with

{\bar{Y}}_{n} = n^{- 1} \sum_{i = 1}^{n} y_{i}

, leads to

min_{i_{k} \in S} {log (R S S_{k - 1}) - log (R S S_{k})} \geq 0.4 μ_{0} d_{n}^{2} n^{- τ} / Var (y_{1}) .

Noticing that

log (t_{n}) = o (\frac{d_{n}^{2} n^{1 - 4 τ}}{κ_{n}^{2} s_{n}^{2} log n})

and

Var (y_{1}) = O (κ_{n} s_{n}^{2} n^{3 τ} log (n))

and

log (f (k + 1)) - log (f (k)) = O {κ_{n} log (p_{n})}

, we can obtain

\begin{matrix} E B I C_{k - 1} - E B I C_{k} & \geq & 0.4 μ_{0} d_{n}^{2} n^{- τ} / Var (y_{1}) - n^{- 1} [log (n) + γ \{log (f (k + 1) - log (f (k))\}] \\ \geq & 0.4 μ_{0} d_{n}^{2} n^{- τ} / Var (y_{1}) - n^{- 1} O {log (n) + γ κ_{n} log (p_{n})} \\ > & 0 . \end{matrix}

Therefore, for

i_{k} \in S

, the conclusion

\begin{matrix} E B I C_{k} < E B I C_{k - 1} \end{matrix}

(A18)

holds uniformly with probability going to one.

Step (ii): Let

k_{0}

be an integer satisfying

S \neg \subset {\hat{S}}_{k_{0} - 1}

and

S \subset {\hat{S}}_{k_{0}}

. We prove that

\begin{matrix} min_{1 \leq j \leq L} {E B I C_{k_{0} + j} - E B I C_{k_{0} + j - 1}} > 0, \end{matrix}

By

log (1 + x) \leq x

and

log \{\frac{f (k_{0} + j)}{f (k_{0} + j - 1)}\} = O {κ_{n} log (p_{n})}

, we have

E B I C_{k_{0} + j - 1} - E B I C_{k_{0} + j} \leq \frac{R S S_{k_{0} + j - 1} - R S S_{k_{0} + j}}{R S S_{k_{0} + j}} - [κ_{n} log n + γ κ_{n} log (p_{n})] / n .

With the same arguments as (A17), we can show that

max_{1 \leq j \leq L} (R S S_{k_{0} + j - 1} - R S S_{k_{0} + j}) = max_{1 \leq j \leq L} {∥ (P_{W_{{\hat{S}}_{k_{0} + j}}} - P_{W_{{\hat{S}}_{k_{0} + j - 1}}}) ε ∥}_{2}^{2} = o_{P} {κ_{n} (log p_{n} + log n)} .

From (26) in [10], we have

n^{- 1} R S S_{k_{0} + l} = E ϵ_{1}^{2} + o_{P} (1)

. Furthermore,

E ϵ_{1}^{2} = O (1)

from assumption A2. Thus

\begin{matrix} P (max_{1 \leq j \leq L} {E B I C_{k_{0} + j - 1} - E B I C_{k_{0} + j}} < 0) \to 1 . \end{matrix}

(A19)

Combination of (A18) and (A19) leads to

P ({\hat{S}}_{*} = S) \to 1

.

Part (II): Similar to step (ii) in Part (I), we can show that

min_{l \in F ∖ {\hat{S}}_{* ℓ}} {E B I C_{* ℓ} - E B I C_{*}} > 0,

with probability tending to one. This leads to

P (\hat{S} = {\hat{S}}_{*}) \to 1

. The proof is completed. □

References

Tibshirani, R. Regression shrinkage and selection via the lasso. J. R. Stat. Soc. Ser. B Stat. Methodol. 1996, 58, 267–288. [Google Scholar] [CrossRef]
Fan, J.; Li, R. Variable selection via nonconcave penalized likelihood and its oracle properties. J. Am. Stat. Assoc. 2001, 96, 1348–1360. [Google Scholar] [CrossRef]
Candes, E.; Tao, T. The Dantzig selector: Statistical estimation when p is much larger than n. Ann. Stat. 2007, 35, 2313–2351. [Google Scholar]
Zhang, C.H. Nearly unbiased variable selection under minimax concave penalty. Ann. Stat. 2010, 38, 894–942. [Google Scholar] [CrossRef] [Green Version]
Fan, J.; Samworth, R.; Wu, Y. Ultrahigh dimensional feature selection: Beyond the linear model. J. Mach. Learn. Res. 2009, 10, 2013–2038. [Google Scholar]
Fan, J.; Lv, J. Sure independence screening for ultrahigh dimensional feature space. J. R. Stat. Soc. Ser. B (Stat. Methodol.) 2008, 70, 849–911. [Google Scholar] [CrossRef] [Green Version]
Fan, J.; Feng, Y.; Song, R. Nonparametric independence screening in sparse ultra-high-dimensional additive models. J. Am. Stat. Assoc. 2011, 106, 544–557. [Google Scholar] [CrossRef] [Green Version]
Zhu, L.; Li, L.; Li, R.; Zhu, L. Model-free feature screening for ultrahigh-dimensional data. J. Am. Stat. Assoc. 2011, 106, 1464–1475. [Google Scholar] [CrossRef] [Green Version]
Wang, H. Forward regression for ultra-high dimensional variable screening. J. Am. Stat. Assoc. 2009, 104, 1512–1524. [Google Scholar] [CrossRef]
Cheng, M.Y.; Honda, T.; Zhang, J.T. Forward variable selection for sparse ultra-high dimensional varying coefficient models. J. Am. Stat. Assoc. 2016, 111, 1209–1221. [Google Scholar] [CrossRef] [Green Version]
Zhong, W.; Duan, S.; Zhu, L. Forward additive regression for ultrahigh dimensional nonparametric additive models. Stat. Sin. 2020, 30, 175–192. [Google Scholar] [CrossRef] [Green Version]
Zhou, T.; Zhu, L.; Xu, C.; Li, R. Model-free forward screening via cumulative divergence. J. Am. Stat. Assoc. 2020, 115, 1393–1405. [Google Scholar] [CrossRef]
Hastie, T.; Tibshirani, R. Generalized Additive Models; Chapman and Hall: New York, NY, USA, 1990. [Google Scholar]
Meier, K.; Van de Geer, S.; Bühlmann, P. Minimax optimal rates of estimation in high dimensional additive models. Ann. Stat. 2009, 47, 3779–3821. [Google Scholar]
Gregory, K.; Mammen, E.; Wahl, M. Statistical inference in sparse high-dimensional additive models. Ann. Stat. 2021, 49, 1514–1536. [Google Scholar] [CrossRef]
Lu, J.; Kolar, M.; Liu, H. Kernel meets sieve: Post-regularization confidence bands for sparse additive model. J. Am. Stat. Assoc. 2020, 115, 2084–2099. [Google Scholar] [CrossRef] [Green Version]
Bai, R.; Moran, G.; Antonelli, J.; Cheng, Y.; Boland, M. Spike-and-slab group lassos for grouped regression and sparse generalized additive models. J. Am. Stat. Assoc. 2022, 117, 184–197. [Google Scholar] [CrossRef]
Wang, X.; Leng, C. High dimensional ordinary least squares projection for screening variables. J. R. Stat. Soc. Ser. B Stat. Methodol. 2016, 78, 589–611. [Google Scholar] [CrossRef] [Green Version]
Chen, J.; Chen, Z. Extended Bayesian information criteria for model selection with large model spaces. Biometrika 2008, 95, 759–771. [Google Scholar] [CrossRef] [Green Version]
Chen, J.; Chen, Z. Extended BIC for small-n-large-P sparse GLM. Stat. Sin. 2012, 22, 555–574. [Google Scholar] [CrossRef] [Green Version]
Fan, J.; Ma, Y.; Dai, W. Nonparametric independence screening in sparse ultra-high-dimensional varying coefficient models. J. Am. Stat. Assoc. 2014, 109, 1270–1284. [Google Scholar] [CrossRef] [Green Version]
Liao, Z.; Shi, X. A nondegenerate Vuong test and post selection confidence intervals for semi/nonparametric model. Quant. Econ. 2020, 11, 983–1017. [Google Scholar] [CrossRef]
Wille, A.; Zimmermann, P.; Vranová, E.; Fürholz, A.; Laule, O.; Bleuler, S.; Hennig, L.; Prelić, A.; Von Rohr, P.; Thiele, L.; et al. Sparse graphical Gaussian modeling of the isoprenoid gene network in Arabidopsis thaliana. Genome Biol. 2004, 5, R92. [Google Scholar] [CrossRef] [PubMed] [Green Version]
Chen, Q.; Fan, D.; Wang, G. Heteromeric geranyl (geranyl) diphosphate synthase is involved in monoterpene biosynthesis in Arabidopsis flowers. Mol. Plant 2015, 8, 1434–1437. [Google Scholar] [CrossRef] [PubMed]
Hao, N.; Zhang, H. A note on high-dimensional linear regression with interactions. Am. Stat. 2017, 71, 291–297. [Google Scholar] [CrossRef] [Green Version]
Hastie, T.; Tibshirani, R. Generalized additive models: Some applications. J. Am. Stat. Assoc. 1987, 82, 371–386. [Google Scholar] [CrossRef]
Horowitz, J. Nonparametric estimation of a generalized additive model with an unknown link function. Econometrica 1987, 69, 499–513. [Google Scholar] [CrossRef]
Schumaker, L.L. Spline Functions: Basic Theory; Cambridge University Press: Cambridge, UK, 2007. [Google Scholar]

Table 1. Average number of true positive (TP), false positive (FP), and calculation time over 100 repetitions and their robust standard deviations (in parentheses) of Example 1 with

ϵ \sim N (0, 1)

.

Table 1. Average number of true positive (TP), false positive (FP), and calculation time over 100 repetitions and their robust standard deviations (in parentheses) of Example 1 with

ϵ \sim N (0, 1)

.

$ρ$	Approach	$p_{n} = 500$			$p_{n} = 1000$
$ρ$	Approach	TP	FP	Time (s)	TP	FP	Time (s)
AR Structure
0.3	FAR	4.00 (0.00)	0.00 (0.00)	83.19 (9.80)	4.00 (0.00)	0.00 (0.00)	166.26 (18.65)
	C-FS	3.20 (0.40)	5.21 (2.96)	16.18 (5.38)	3.34 (0.48)	11.44 (5.38)	39.64 (14.35)
	GRIE	4.00 (0.00)	0.00 (0.00)	2.37 (0.28)	3.99 (0.10)	0.01 (0.10)	3.56 (0.77)
0.6	FAR	4.00 (0.00)	0.00 (0.00)	82.06 (9.77)	4.00 (0.00)	0.00 (0.00)	168.16 (20.51)
	C-FS	3.71 (0.46)	4.81 (2.39)	16.57 (4.28)	3.70 (0.46)	9.33 (4.88)	34.61 (12.24)
	GRIE	3.99 (0.10)	0.00 (0.00)	2.40 (0.35)	3.98 (0.14)	0.03 (0.30)	3.43 (0.72)
0.9	FAR	3.17 (0.60)	0.00 (0.00)	81.34 (9.48)	3.09 (0.60)	0.00 (0.00)	168.70 (18.62)
	C-FS	3.14 (0.51)	2.44 (1.72)	10.63 (3.01)	3.14 (0.53)	4.43 (2.96)	19.14 (6.78)
	GRIE	3.71 (0.46)	0.20 (0.40)	2.22 (0.40)	3.70 (0.46)	0.21 (0.43)	3.45 (0.76)
CS Structure
0.3	FAR	4.00 (0.00)	0.00 (0.00)	83.60 (10.17)	4.00 (0.00)	0.00 (0.00)	165.38 (19.00)
	C-FS	3.45 (0.52)	4.96 (2.97)	16.11 (5.23)	3.33 (0.47)	11.69 (6.24)	39.98 (16.38)
	GRIE	4.00 (0.00)	0.09 (0.90)	2.30 (0.39)	4.00 (0.00)	0.02 (0.20)	3.57 (0.72)
0.6	FAR	4.00 (0.00)	0.00 (0.00)	84.24 (10.26)	4.00 (0.00)	0.01 (0.10)	166.92 (18.64)
	C-FS	3.74 (0.44)	5.05 (2.98)	16.82 (5.26)	3.61 (0.55)	10.26 (5.25)	36.73 (13.38)
	GRIE	4.00 (0.00)	0.23 (2.30)	2.35 (0.37)	4.00 (0.00)	0.11 (0.62)	3.41 (0.74)
0.9	FAR	3.03 (0.67)	0.00 (0.00)	85.57 (11.01)	2.79 (0.70)	0.00 (0.00)	166.02 (18.65)
	C-FS	2.63 (0.65)	4.48 (3.47)	13.35 (6.08)	2.56 (0.67)	9.70 (5.93)	32.23 (15.07)
	GRIE	3.89 (0.31)	1.47 (7.62)	2.21 (0.35)	3.79 (0.41)	3.16 (17.90)	3.44 (0.77)

Table 2. Average numbers of true positive (TP), false positive (FP), and calculation time over 100 repetitions and their robust standard deviations (in parentheses) of Example 1 with

ϵ \sim 0.5 χ_{2}^{2}

.

Table 2. Average numbers of true positive (TP), false positive (FP), and calculation time over 100 repetitions and their robust standard deviations (in parentheses) of Example 1 with

ϵ \sim 0.5 χ_{2}^{2}

.

$ρ$	Approach	$p_{n} = 500$			$p_{n} = 1000$
$ρ$	Approach	TP	FP	Time (s)	TP	FP	Time (s)
AR Structure
0.3	FAR	4.00 (0.00)	0.00 (0.00)	79.30 (11.14)	4.00 (0.00)	0.00 (0.00)	165.38 (22.09)
	C-FS	3.27 (0.45)	5.38 (3.00)	16.43 (5.40)	3.33 (0.47)	11.24 (4.85)	39.44 (12.59)
	GRIE	4.00 (0.00)	0.00 (0.00)	2.40 (0.36)	3.99 (0.10)	0.00 (0.00)	3.33 (0.68)
0.6	FAR	4.00 (0.00)	0.00 (0.00)	79.19 (12.07)	4.00 (0.00)	0.00 (0.00)	163.64 (23.37)
	C-FS	3.70 (0.46)	4.42 (2.53)	15.60 (4.49)	3.77 (0.42)	9.56 (4.29)	35.74 (11.14)
	GRIE	3.99 (0.10)	0.01 (0.10)	2.31 (0.33)	3.98 (0.14)	0.03 (0.30)	3.42 (0.67)
0.9	FAR	3.09 (0.68)	0.00 (0.00)	80.28 (10.89)	3.01 (0.72)	0.00 (0.00)	163.88 (22.78)
	C-FS	3.10 (0.48)	2.28 (1.56)	10.15 (2.86)	3.15 (0.50)	4.26 (2.20)	19.12 (5.44)
	GRIE	3.71 (0.46)	0.23 (0.51)	2.28 (0.32)	3.78 (0.42)	0.16 (0.39)	3.33 (0.67)
CS Structure
0.3	FAR	4.00 (0.00)	0.00 (0.00)	80.28 (9.59)	4.00 (0.00)	0.00 (0.00)	164.25 (19.31)
	C-FS	3.51 (0.52)	5.10 (2.99)	16.50 (5.68)	3.36 (0.48)	10.87 (4.98)	37.91 (13.02)
	GRIE	4.00 (0.00)	0.00 (0.00)	2.31 (0.34)	3.98 (0.14)	0.34 (3.30)	3.38 (0.66)
0.6	FAR	4.00 (0.00)	0.00 (0.00)	80.12 (11.56)	4.00 (0.00)	0.01 (0.10)	165.41 (20.04)
	C-FS	3.72 (0.49)	4.71 (2.57)	16.04 (4.68)	3.68 (0.49)	9.79 (5.39)	36.12 (14.41)
	GRIE	3.99 (0.10)	0.00 (0.00)	2.31 (0.29)	4.00 (0.00)	0.11 (0.65)	3.40 (0.70)
0.9	FAR	3.00 (0.79)	0.03 (0.17)	79.62 (11.60)	2.85 (0.78)	0.02 (0.14)	164.90 (19.25)
	C-FS	2.73 (0.66)	4.40 (2.81)	13.56 (4.91)	2.69 (0.63)	10.82 (5.99)	35.72 (15.68)
	GRIE	3.94 (0.24)	3.13 (16.94)	2.28 (0.35)	3.82 (0.39)	3.63 (18.13)	3.30 (0.72)

Table 3. The empirical probabilities of each important covariate and all important covariates being retained for 100 replications in Example 1 with

ϵ \sim N (0, 1)

.

Table 3. The empirical probabilities of each important covariate and all important covariates being retained for 100 replications in Example 1 with

ϵ \sim N (0, 1)

.

$ρ$	Approach	$p_{n} = 500$					$p_{n} = 1000$
$ρ$	Approach	$P_{1}$	$P_{2}$	$P_{3}$	$P_{4}$	$P_{all}$	$P_{1}$	$P_{2}$	$P_{3}$	$P_{4}$	$P_{all}$
AR Structure
0.3	FAR	1.00	1.00	1.00	1.00	1.00	1.00	1.00	1.00	1.00	1.00
	C-FS	1.00	0.20	1.00	1.00	0.20	1.00	0.34	1.00	1.00	0.34
	GRIE	1.00	1.00	1.00	1.00	1.00	1.00	0.99	1.00	1.00	0.99
0.6	FAR	1.00	1.00	1.00	1.00	1.00	1.00	1.00	1.00	1.00	1.00
	C-FS	1.00	0.71	1.00	1.00	0.71	0.98	0.72	1.00	1.00	0.70
	GRIE	1.00	0.99	1.00	1.00	0.99	1.00	0.98	1.00	1.00	0.98
0.9	FAR	0.80	0.49	0.97	0.91	0.28	0.82	0.42	0.98	0.87	0.23
	C-FS	0.50	0.67	0.97	1.00	0.21	0.50	0.67	0.97	1.00	0.22
	GRIE	0.83	0.88	1.00	1.00	0.71	0.79	0.91	1.00	1.00	0.70
CS Structure
0.3	FAR	1.00	1.00	1.00	1.00	1.00	1.00	1.00	1.00	1.00	1.00
	C-FS	0.99	0.46	1.00	1.00	0.46	1.00	0.33	1.00	1.00	0.33
	GRIE	1.00	1.00	1.00	1.00	1.00	1.00	1.00	1.00	1.00	1.00
0.6	FAR	1.00	1.00	1.00	1.00	1.00	1.00	1.00	1.00	1.00	1.00
	C-FS	0.93	0.81	1.00	1.00	0.74	0.88	0.73	1.00	1.00	0.64
	GRIE	1.00	1.00	1.00	1.00	1.00	1.00	1.00	1.00	1.00	1.00
0.9	FAR	0.63	0.57	0.98	0.85	0.24	0.58	0.49	0.91	0.81	0.16
	C-FS	0.11	0.62	0.97	0.93	0.05	0.12	0.53	0.97	0.94	0.07
	GRIE	0.93	0.96	1.00	1.00	0.89	0.87	0.92	1.00	1.00	0.79

Table 4. The empirical probabilities of each important covariate and all important covariates being retained for 100 replications in Example 1 with

ϵ \sim 0.5 χ_{2}^{2}

.

Table 4. The empirical probabilities of each important covariate and all important covariates being retained for 100 replications in Example 1 with

ϵ \sim 0.5 χ_{2}^{2}

.

$ρ$	Approach	$p_{n} = 500$					$p_{n} = 1000$
$ρ$	Approach	$P_{1}$	$P_{2}$	$P_{3}$	$P_{4}$	$P_{all}$	$P_{1}$	$P_{2}$	$P_{3}$	$P_{4}$	$P_{all}$
AR Structure
0.3	FAR	1.00	1.00	1.00	1.00	1.00	1.00	1.00	1.00	1.00	1.00
	C-FS	1.00	0.27	1.00	1.00	0.27	1.00	0.33	1.00	1.00	0.33
	GRIE	1.00	1.00	1.00	1.00	1.00	1.00	0.99	1.00	1.00	0.99
0.6	FAR	1.00	1.00	1.00	1.00	1.00	1.00	1.00	1.00	1.00	1.00
	C-FS	1.00	0.70	1.00	1.00	0.70	0.99	0.78	1.00	1.00	0.77
	GRIE	1.00	0.99	1.00	1.00	0.99	1.00	0.99	1.00	0.99	0.98
0.9	FAR	0.81	0.47	0.95	0.86	0.28	0.82	0.45	0.94	0.80	0.26
	C-FS	0.42	0.70	0.98	1.00	0.17	0.53	0.65	0.97	1.00	0.21
	GRIE	0.82	0.89	1.00	1.00	0.71	0.84	0.94	1.00	1.00	0.78
CS Structure
0.3	FAR	1.00	1.00	1.00	1.00	1.00	1.00	1.00	1.00	1.00	1.00
	C-FS	0.99	0.52	1.00	1.00	0.52	0.99	0.37	1.00	1.00	0.36
	GRIE	1.00	1.00	1.00	1.00	1.00	1.00	0.98	1.00	1.00	0.98
0.6	FAR	1.00	1.00	1.00	1.00	1.00	1.00	1.00	1.00	1.00	1.00
	C-FS	0.93	0.79	1.00	1.00	0.74	0.92	0.76	1.00	1.00	0.69
	GRIE	1.00	0.99	1.00	1.00	0.99	1.00	1.00	1.00	1.00	1.00
0.9	FAR	0.54	0.65	0.97	0.84	0.28	0.55	0.55	0.94	0.81	0.24
	C-FS	0.12	0.66	0.97	0.98	0.09	0.14	0.61	0.97	0.97	0.08
	GRIE	0.97	0.97	1.00	1.00	0.94	0.88	0.94	1.00	1.00	0.82

Table 5. Average numbers of true positive (TP), false positive (FP), and calculation time over 100 repetitions and their robust standard deviations (in parentheses) of Example 2 with

ϵ \sim N (0, 1)

.

Table 5. Average numbers of true positive (TP), false positive (FP), and calculation time over 100 repetitions and their robust standard deviations (in parentheses) of Example 2 with

ϵ \sim N (0, 1)

.

$δ$	Approach	$p_{n} = 500$			$p_{n} = 1000$
$δ$	Approach	TP	FP	Time (s)	TP	FP	Time (s)
0.4	FAR	4.00 (0.00)	0.59 (0.51)	81.85 (10.28)	3.98 (0.20)	0.57 (0.50)	168.16 (20.16)
	C-FS	4.00 (0.00)	5.32 (2.97)	18.74 (5.35)	4.00 (0.00)	11.40 (5.37)	42.87 (15.18)
	GRIE	4.00 (0.00)	0.04 (0.24)	2.41 (0.36)	4.00 (0.00)	0.06 (0.34)	3.59 (0.64)
0.6	FAR	3.94 (0.28)	1.09 (0.49)	80.25 (9.06)	3.86 (0.49)	1.07 (0.48)	164.35 (18.90)
	C-FS	4.00 (0.00)	6.05 (2.88)	19.32 (5.19)	4.00 (0.00)	12.11 (5.37)	43.61 (14.17)
	GRIE	4.00 (0.00)	0.17 (0.49)	2.33 (0.34)	4.00 (0.00)	0.18 (0.54)	3.43 (0.63)
0.8	FAR	3.66 (0.73)	1.26 (0.50)	80.04 (9.10)	3.68 (0.72)	1.22 (0.54)	164.05 (18.76)
	C-FS	3.81 (0.42)	5.88 (2.82)	18.62 (5.41)	3.85 (0.36)	12.13 (5.18)	42.92 (13.12)
	GRIE	3.95 (0.22)	0.38 (0.72)	2.37 (0.34)	3.89 (0.31)	0.27 (0.63)	3.45 (0.75)

Table 6. The empirical probabilities of each important covariate and all important covariates being retained for 100 replications in Example 2 with

ϵ \sim N (0, 1)

.

Table 6. The empirical probabilities of each important covariate and all important covariates being retained for 100 replications in Example 2 with

ϵ \sim N (0, 1)

.

$δ$	Approach	$p_{n} = 500$					$p_{n} = 1000$
$δ$	Approach	$P_{1}$	$P_{2}$	$P_{3}$	$P_{4}$	$P_{all}$	$P_{1}$	$P_{2}$	$P_{3}$	$P_{4}$	$P_{all}$
0.4	FAR	1.00	1.00	1.00	1.00	1.00	1.00	1.00	0.99	0.99	0.99
	C-FS	1.00	1.00	1.00	1.00	1.00	1.00	1.00	1.00	1.00	1.00
	GRIE	1.00	1.00	1.00	1.00	1.00	1.00	1.00	1.00	1.00	1.00
0.6	FAR	0.99	0.99	0.98	0.98	0.95	0.96	0.98	0.97	0.95	0.92
	C-FS	1.00	1.00	1.00	1.00	1.00	1.00	1.00	1.00	1.00	1.00
	GRIE	1.00	1.00	1.00	1.00	1.00	1.00	1.00	1.00	1.00	1.00
0.8	FAR	0.92	0.91	0.91	0.92	0.81	0.90	0.92	0.93	0.93	0.83
	C-FS	0.93	0.97	0.96	0.95	0.82	0.94	0.97	0.97	0.97	0.85
	GRIE	1.00	1.00	0.97	0.98	0.95	0.97	1.00	0.95	0.97	0.89

Table 7. Average numbers of model size, the number of SNV, and A-PE over 100 repetitions and their robust standard deviations (in parentheses) of Boston Housing Data.

Approach	Model Size	SNV	A-PE
FAR	2.10 (0.30)	0.00 (0.00)	0.052 (0.011)
C-FS	19.26 (5.39)	8.71 (5.10)	0.047 (0.012)
GRIE	5.07 (0.95)	0.00 (0.00)	0.043 (0.010)

Table 8. The frequency for 13 real covariates being selected over 100 replications for Boston Housing Data.

Variable	FAR	C-FS	GRIE
RM	100	100	100
AGE	0	99	0
RAD	0	60	6
TAX	0	59	7
PTRATIO	0	100	68
B	0	92	99
LSTAT	100	100	100
CRIM	10	100	80
ZN	0	97	0
INDUS	0	22	0
CHAS	0	26	0
NOX	0	100	47
DIS	0	100	0

Table 9. Average numbers of model size, A-PE over 100 repetitions, and their robust standard deviations (in parentheses) of Arabidopsis thaliana gene data.

Approach	Model Size	A-PE
FAR	1.00 (0.00)	0.289 (0.099)
C-FS	10.15 (3.34)	0.282 (0.181)
GRIE	1.76 (1.18)	0.276 (0.093)

Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

© 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Wang, H.; Jin, H.; Jiang, X.; Li, J. Model Selection for High Dimensional Nonparametric Additive Models via Ridge Estimation. Mathematics 2022, 10, 4551. https://doi.org/10.3390/math10234551

AMA Style

Wang H, Jin H, Jiang X, Li J. Model Selection for High Dimensional Nonparametric Additive Models via Ridge Estimation. Mathematics. 2022; 10(23):4551. https://doi.org/10.3390/math10234551

Chicago/Turabian Style

Wang, Haofeng, Hongxia Jin, Xuejun Jiang, and Jingzhi Li. 2022. "Model Selection for High Dimensional Nonparametric Additive Models via Ridge Estimation" Mathematics 10, no. 23: 4551. https://doi.org/10.3390/math10234551

APA Style

Wang, H., Jin, H., Jiang, X., & Li, J. (2022). Model Selection for High Dimensional Nonparametric Additive Models via Ridge Estimation. Mathematics, 10(23), 4551. https://doi.org/10.3390/math10234551

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Model Selection for High Dimensional Nonparametric Additive Models via Ridge Estimation

Abstract

1. Introduction

2. Methodology

3. Asymptotic Properties

3.1. Assumptions

3.2. Main Theorems

4. Simulations

5. Real Data

5.1. Boston Housing Data

5.2. Arabidopsis thaliana Gene Data

6. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

Appendix A

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI