Next Article in Journal
Plastic-Pollution Reduction and Bio-Resources Preservation Using Green-Packaging Game Coopetition
Next Article in Special Issue
Privacy Protection Practice for Data Mining with Multiple Data Sources: An Example with Data Clustering
Previous Article in Journal
Drug Side Effect Prediction with Deep Learning Molecular Embedding in a Graph-of-Graphs Domain
Previous Article in Special Issue
Intelligent Multi-Strategy Hybrid Fuzzy K-Nearest Neighbor Using Improved Hybrid Sine Cosine Algorithm
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Model Selection for High Dimensional Nonparametric Additive Models via Ridge Estimation

1
Department of Mathematics, Harbin Institute of Technology, Harbin 150001, China
2
Department of Statistics and Data Science, Southern University of Science and Technology, Shenzhen 518055, China
3
Department of Mathematics, Southern University of Science and Technology, Shenzhen 518055, China
*
Author to whom correspondence should be addressed.
Mathematics 2022, 10(23), 4551; https://doi.org/10.3390/math10234551
Submission received: 2 November 2022 / Revised: 22 November 2022 / Accepted: 28 November 2022 / Published: 1 December 2022
(This article belongs to the Special Issue Statistical Methods in Data Science and Applications)

Abstract

:
In ultrahigh dimensional data analysis, to keep computational performance well and good statistical properties still working, nonparametric additive models face increasing challenges. To overcome them, we introduce a methodology of model selection for high dimensional nonparametric additive models. Our approach is to propose a novel group screening procedure via nonparametric smoothing ridge estimation (GRIE) to find the importance of each covariate. It is then combined with the sure screening property of GRIE and the model selection property of extended Bayesian information criteria (EBIC) to select the suitable sub-models in nonparametric additive models. Theoretically, we establish the strong consistency of model selection for the proposed method. Extensive simulations and two real datasets illustrate the outstanding performance of the GRIE-EBIC method.

1. Introduction

With the advances in information technology, high-dimensional data exists in various fields such as biology, chemistry, economics, finance, genetics, neuroscience, etc. A common assumption of sparse is that only a few features are truly related to the response. Following that, plenty of variable selection approaches based on regularized M-estimation have been developed, which include but are not limited to Lasso by [1],  SCAD by [2], Dantzig selector by [3], and  MCP by [4]. However, there always exist two limitations in the above-penalized methods. One is the big burden for computation, and the other is the unstable performance for variable selection in high-dimensional situations [5].
To avoid the mentioned limitations, correlation ranking becomes one of the most popular ways to rapidly reduce the dimensionality of feature space. Fan and Lv [6] proposed the sure independence screening (SIS) by utilizing the marginal Pearson correlation between predictor and response for gaussian linear regression. Fan et al. [7] extended the idea of Pearson correlation ranking to marginal smooth estimation strength ranking and proposed the nonparametric independence screening (NIS) method. Meanwhile, Zhu et al. [8] considered the marginal correlation between the predictor and the conditional cumulative density function of response and developed the model-free screening method. However, in  practice, there exist strong correlations between the predictors, which may lead to important predictors being jointly correlated to the response. Hence, the marginal correlation ranking process may miss some important variables. To decrease the effect of correlation between the predictors, some forward variable screening methods based on the prediction rankings were introduced. Wang [9] ranked the residuals of the predictor and proposed the forward regression (FR) algorithm. Cheng et al. [10] applied the forward regression to high dimensional varying coefficient models and proposed the forward-BIC screening method. Zhong et al. [11] further extended the forward regression to ultrahigh-dimensional nonparametric additive models. Based on the cumulative divergence (CD), Zhou et al. [12] proposed a forward screening procedure that considered the joint effects among covariates in the feature screening process.
Next, let us turn to the specific model. In this paper, we are interested in the model of nonlinear regression. It is well known that if there exists extensive nonlinear independence between response and predictors, traditional (partial) linear models can not detect nonlinear independence. Although the nonlinear regression could capture the nonlinear independence accurately, the nonlinear regression suffers from the curse of dimensionality and heavy computational burden in high dimensions. To model them simplify, here, we consider the nonparametric additive models. The nonparametric additive models were introduced by Hastie and Tibshirani [13], which are defined as follows,
y = j = 1 p n m j ( x j ) + ϵ ,
where y is the response variable, x j is the covariate, m j is an unknown function with j = 1 , , p n , and  ϵ is the random error. Obviously, this additive combination of univariate functions could detect the nonlinear independence easily, but their good statistical properties and high computational performance only belong to low dimensions. For ultrahigh dimensions, to keep them working well, one of the most popular methods is the two-stage approach. Its main idea is to perform model selection in a fast and efficient way while retaining all the important features in the reduced feature space and then refitting the reduced models. In the following paper, we focus on the methodology of model selection for ultrahigh nonparametric additive models. In that field, the last decade has seen a growing trend toward smooth-group penalized methods, see [14,15,16,17]. Whereas the above methods may involve some tuning parameters, which bring a heavy computational burden and unstable results in high dimensions. A forward feature selection procedure, proposed by [11] for ultrahigh dimensional nonparametric additive models, does not involve any initial parameters. In addition, model-free methods have been developed recently. Based on the cumulative divergence (CD), Zhou et al. [12] proposed a forward screening procedure that considered the joint effects among covariates in the feature screening process. These two above methods screen the remaining candidate indexes into the sub-models through forwarding procedures. This kind of forward-searching algorithm also leads to a high computational burden. Furthermore, under previous studies’ correlation assumption, they ignored that the predictors are often correlated for high-dimensional feature space. In detail, the unimportant covariate x corresponding to m 0 in the nonparametric additive models (1) may have a strong correlation with the residual y j M m j ( x j ) given index set M { 1 , , p n } , which implies that their methodologies may screen quite a few unimportant features into the sub-models.
To improve these limitations, first, our approach is to propose a group screening procedure via nonparametric smoothing ridge estimation (GRIE), motivated by the theoretical property and outstanding simulation performance of the ridge estimator in [18]. The core idea of GRIE is to get the importance of each covariate by combining the ridge estimator and group contribution. Its details are as follows. We begin with fitting the ridge regression by B-spline smoothing and then treating the spline basis corresponding to each covariate as a group. Next, we evaluate the group contribution of covariates by the magnitude of group estimators. Lastly, we sort the importance of the covariates by the group contribution in descending order. To further conduct model selection, we propose the refined GRIE-EBIC method mixing GRIE and the extended Bayesian information criteria (EBIC) in [19].  The GRIE-EBIC method is used to search for the predictor with the most group contributions by  EBIC.
Compared with other feature selection methods for nonparametric additive models, the GRIE-EBIC method has the following advantages: (1) the joint correlation among covariates is considered, and the strong marginal correlation assumption between response and important predictors is relaxed; (2) simple calculation with lower computational complexity; (3) strong consistency for feature screening, and it implies that the true features can be extracted accurately with probability tending to one, which does not exist in other stepwise feature screening methods, such as forward additive regression in [11], forward screening in [12], etc.
The rest of the paper is organized as follows. In Section 2, we introduce the GRIE screening procedure, the GRIE-EBIC method, and its algorithm. In Section 3, we establish the sure screening property of the GRIE screening procedure and the strong consistency of screening by the GRIE-EBIC. In Section 4, we present the performance of our proposed algorithm through simulation studies. In Section 5, we apply our methodology to fit two real datasets to further illustrate the performance of our proposed method. The first is based on Boston housing, while the second is related to Arabidopsis thaliana gene data. A conclusion is given in Section 6. The proofs are in Appendix A.
Notation 
Let A be m by l matrix, M be any subset of 1 , 2 , , l with any positive integers of m and l, and then A M be the submatrix of A formed by column indexes in M . We write λ min ( A ) and λ max ( A ) to denote the minimum and maximum eigenvalues of a symmetric matrix A , separately. We write I m as the identity matrix. We defined P λ , A = A ( A A + λ I m ) 1 A , where λ is some positive constant, A represents the transpose of matrix A , and  here A is the column full rank l × m matrix with m l . When λ = 0 , P A = A ( A A ) 1 A , which is the projection onto column space of A . Otherwise, e i = ( 0 , , 0 , 1 , 0 , , 0 ) is the unit vector, which has zeros everywhere, except in the ith position. For vector a R n , the  L 2 norm of a = ( a 1 , a 2 , , a n ) is captured by a 2 = a a .

2. Methodology

Suppose we have the random sample ( y i , x i , 1 , , x i , p n ) : i = 1 , , n , which is generated from the population model (1). Then the nonparametric additive model can be rewritten as:
y i = j = 1 p n m j ( x i , j ) + ϵ i , i = 1 , , n .
Without loss generality, we assume that the mean response is zero. For identification of the model, we further assume the mean of each additive function is zero, i.e.,  E m j ( x i , j ) = 0 for j = 1 , , p n . We note that all of the response variables are centralized to satisfy the above assumption during a real application. Here, the variance of the additive function Var ( m j ( x j ) ) is used to distinguish the importance of the covariate. Thus, we let x j be the important predictor if Var ( m j ( x j ) ) > 0 ; otherwise x j is the redundant predictor.  Then we define the index set of the important predictors as S = { j : Var ( m j ( x j ) ) > 0 , j = 1 , , p n } .
Next, we use B-spline basis functions to approximate m j ( · ) . Let us assume x j [ 0 , 1 ] for j = 1 , , p n , ϕ ¯ = { ϕ k } k = 0 q be a knot sequence such that 0 = ϕ 0 < ϕ 1 < < ϕ q = 1 , and  S ( , ϕ ¯ ) are the space of polynomial splines of order with knot sequence ϕ ¯ . S ( , ϕ ¯ ) is a κ n –dimensional linear space with κ n = q + . For any m j ( x j ) , j = 1 , , p n , there exists the unique vector θ j * to satisfy
m j ( x j ) t = 1 κ n θ j t B t ( x j ) = B ( x j ) θ j * ,
where B ( x j ) = ( B 1 ( x j ) , , B κ n ( x j ) ) and θ j * = ( θ j 1 * , , θ j κ n * ) . Let w i = ( w i , 1 , , w i , p n ) with w i , j = B ( x i , j ) , W = ( w 1 , , w n ) and Y = ( y 1 , , y n ) . Based on the approximation of (3), model (2) becomes
y i = w i θ * + ϵ i * , i = 1 , , n ,
where θ * = ( θ 1 * , , θ p n * ) and ϵ i * = j = 1 p n m j ( x i , j ) w i θ * + ϵ i . Under model (4), the ridge estimator minimizes the following loss
Y W θ 2 2 + λ θ 2 2 ,
where λ is a positive constant. Then θ ^ admits
θ ^ = W ( W W + λ I n ) 1 Y ,
where I n is the n × n identity matrix. For linear regression, Wang and Leng [18] considered the effect of each entry of θ , and showed that the ridge estimator achieves screening consistency. Notice that Var ( m j ( x j ) ) θ j * E ( w j w j ) θ j * . Different from linear regression, we need to consider the group contribution of θ j * . By the boundedness of E ( w i , j w i , j ) from assumption A4(i), we use θ j * 2 to evaluate the group contribution. Similar to the results in [18], the ridge estimator θ ^ provides the ranking order of the group contribution in θ * with P θ ^ j 2 > θ ^ k 2 1 if j S , k S c (see Theorem 1).
One natural screening method is to sort { θ ^ j 2 2 } in decreasing order, and  select its top m indexes, denoted as F m = { i 1 , i 2 , i 3 , , i m } , 1 m p n . This screening process is referred to as the “GRIE” screening procedure. We define G = F m : m = 1 , , p n , A = m : S F m , 1 m p n . Further, to get a more accurate result of model selection, i.e., searching d n , which is the minimum item in the set A . At that time, F d n is a set with the shortest length in G that contains important variable set S . With the definition of G , we have S F p n .  Then F d n is not an empty set. In summary, we want to find F d n from G .
It is well known that the extended Bayesian information criteria (EBIC) have appealing theoretical properties and outstanding numerical performance for model selection. Let W T = ( W j , j T ) for any subset of T { 1 , , p n } . The formula of the EBIC for the sub-model ( Y , W T ) is given by
E B I C ( T ) = log ( R S S ( T ) / n ) + κ n | T | log ( n ) + 2 γ log f ( | T | ) / n ,
where γ is the preset positive constant, R S S ( T ) = Y W T θ ^ T 2 2 is the sum of squared residuals (RSS), and f ( | T | ) = C p n κ n | T | κ n is the combination number.
For a linear model, Wang [9] showed that E B I C ( F m ) < E B I C ( F m 1 ) if i m S . Based on this property of EBIC and the preserving rank property of GRIE screening procedure (see Theorem 1), we propose the following Algorithm 1 for the model selection of (1).
Algorithm 1: GRIE-EBIC algorithm.
Initialization: Input ( W ,   Y ) , R S S 0 = Y 2 2 , n, p n , λ , κ n , γ , L.
Step (i): Compute the GRIE screening procedure
   1: Calculate ridge estimator θ ^ = W ( W + λ I n ) 1 Y ;
   2: Sort { θ ^ j 2 , j = 1 , , p n } in decreasing order and select the top n index set
   which is denoted by F n = { i 1 , i 2 , i 3 , , i n } ;
Step (ii): Direct decreasing solution path
   3: For k = 1 , , n , do
       3.1 : Let S ^ k = { i 1 , , i k } and compute the sum of squared residuals
                         R S S k = Y W S ^ k ( W S ^ k W S ^ k ) 1 W S ^ k Y 2 2 ;
       3.2 : Compute EBIC: E B I C k = log ( R S S k / n ) + { κ n k log ( n ) + 2 γ log f ( k ) } / n ;
       3.3 : If k L + 1 and E B I C k > > E B I C k L , compute K = k L and stop;
   4: Compute the difference of the EBIC to obtain the decreasing solution path
                         I = { k : E B I C k E B I C k 1 < 0 , k = 1 , 2 , , K } ;
   5: Find the decreasing index set S ^ * = { i k : k I } ;
Step (iii): Forward decreasing solution path
   6: Compute R S S * = Y W S ^ * ( W S ^ * W S ^ * ) 1 W S ^ * Y 2 2 and
                         E B I C * = log ( R S S * / n ) + { κ n | S ^ * | log ( n ) + 2 γ log f ( | S ^ * | ) } / n ;
   7: For F n S ^ * , do
      Let S ^ * = S ^ * { } , compute R S S * = Y W S ^ * ( W S ^ * W S ^ * ) 1 W S ^ * Y 2 2 and
                         E B I C * = log ( R S S * / n ) + { κ n | S ^ * | log ( n ) + 2 γ log f ( | S ^ * | ) } / n ;
   8: Find decreasing solution path S ^ = S ^ * { : E B I C * E B I C * < 0 , F n S ^ * } ;
Output final index set S ^ .
In step (ii) of the GRIE-EBIC algorithm, we search the important covariates from the top n predictor space F n . Based on Theorem 1, GRIE has the consistency of preserving order in sorting. The higher the index position of the variable in F p n , the more likely it is to be an important variable. To speed up the calculation, we set a stopping rule for screening when the EBIC value increases for L times continuously. To improve the robustness of the GRIE-EBIC algorithm, in step (iii), we add the further forward screening process.

3. Asymptotic Properties

3.1. Assumptions

To establish the asymptotic properties of our proposed method, we give the following notations and assumptions. Let Σ = E ( w w ) , Z = W Σ 1 / 2 , z = Σ 1 / 2 w , and t n = p n κ n , where w = ( w 1 , , w p n ) with w j = B ( x j ) . We use Σ T = E ( n 1 W T W T ) . H r denotes a space of functions whose d-th order derivative is H o ¨ lder continuous of order v, i.e., H r = { h ( z ) : | h ( d ) ( a ) h ( d ) ( a ) | C | a a | v , a , a [ 0 , 1 ] } , where h ( d ) ( · ) is the d-th derivative of h ( · ) and r = d + v . If v = 1 , h ( d ) ( · ) is Lipschitz continuous. Let s n be the cardinality of S . The following assumptions are required:
A1
Assume z has a spherically symmetric distribution, and there exists some positive c 1 and C 1 such that
P λ min ( t n 1 Z Z ) c 1 1 or λ max ( t n 1 Z Z ) > c 1 2 exp ( C 1 n ) .
A2
Assume there exists some positive constant C * such that, for any a R ,
max i = 1 , , n E exp ( a ε i ) | x i exp ( C * a 2 / 2 ) .
A3
Assume that (i) there exists some r 2 such that m j H r and κ n = O ( n 1 / ( 2 r + 1 ) ) for any j S ; (ii) j S E | m j ( x j ) | 2 c 2 s n ; (iii) λ max ( Σ ) / λ min ( Σ ) c 3 n τ , where c 2 , c 3 are some positive constants and τ 0 .
A4
(i) c 4 1 κ n 1 λ min ( E ( B ( x j ) B ( x j ) ) ) λ max ( E ( B ( x j ) B ( x j ) ) ) c 4 κ n 1 for some positive constant c 4 ; (ii) min j S { E | m j ( x j ) | 2 } 1 / 2 d n for some positive sequence d n 0 ; (iii) κ n r 1 / 2 d n n 2 τ s n log n , log ( t n ) = o d n 2 n 1 4 τ κ n 2 s n 2 log n .
A5
(i) Var ( y 1 ) = O ( κ n s n 2 n 3 τ log ( n ) ) ; (ii) For any integer N with s n < N s n log n , there exists positive constant c 6 > 0 such that
c 6 n τ κ n λ min ( Σ T )
holds uniformly in T F n satisfying | T | N and S T .
Assumptions A1 and A3(iii) are like Assumptions 1 and 3 of [18]. Assumption A2 is the same as Assumption A3 of [11], which means that the random error follows the sub-Gaussian distribution. Assumption A3(i) is a common assumption in the literature for the polynomial spline basis, A3(ii) gives the upper bound of all signals, and A3(iii) gives the upper bound of the condition number. In addition, Assumption A3(ii)–(iii) are implied by Assumption A2 in [11]. Assumption A4(i) and a stronger assumption Var ( y 1 ) = O ( 1 ) than A5(i), which is also imposed in [11] for achieving the consistency of variable selection. They also assumed A5(ii) holds. Assumptions A4(ii) and (iii) give the lower bound and upper bound of the minimal signal and dimensionality of the design matrix W .

3.2. Main Theorems

Theorem 1.
If Assumptions A1–A4 hold, then
P min j S θ ^ j 2 > max j S c θ ^ j 2 1 .
Alternatively, we can choose a sub-model F d n with d n = O ( n ι ) for some 0 < ι < 1 such that
P S F d n 1 .
Theorem 1 states the consistency of preserving order in sorting, i.e., θ ^ could totally separate the unimportant and important variables with a probability tending to 1. For the linear models, Theorem 1 is in line with Theorem 2 in [18], which is the special case of our theorem.
Theorem 2.
If Assumptions A1–A5 hold, then
P S ^ = S 1 .
The screening methods in [7,11,12] adopted a forward selection algorithm, which means the later results are affected by the results of the previous steps. This not only brings a heavy computation burden but also results in overfitting results for screening with P ( S S ^ ) 1 . Compared with this result, Theorem 2 gives strong consistency of screening with P S ^ = S 1 .

4. Simulations

In this section, we investigate the finite sample performance of our proposed method and compare our method with the following two procedures: forward additive regression (FAR) in [11] and cumulative divergence-based forward regression (C-FS) in [12]. We choose λ = 1 , L = 5 (suggested by [10]), γ = 0.5 (suggested by [20]), and κ n = n 1 / 5 + 2 (suggested by [11]) for the GRIE-EBIC algorithm, where κ n is the dimension of B-spline basis space and n 1 / 5 is the greatest integer less than n 1 / 5 .
Three specific criteria are adopted to evaluate the performance of variable selection for the additive model (1). True positive (TP) is the number of the true variables that are considered true variables in the selected model, and false positive (FP) is the number of the noise variables that are misclassified as true variables in the selected model. Combining TP and FP, they reflect the accuracy of variable selection methods in the selected sub-models. In addition, we select time as the third criterion to reflect the efficiency of variable selection by different methods. It is easy to find that our proposed method, GRIE, is more efficient than FAR and C-FS in computation since their complexities of calculation are O ( n 2 p n κ n ) , O ( n 3 p n ) , and O ( T n 3 p n ) , respectively, where T is the number of repetitions for the bootstrap procedure in the C-FS method. The comparison of computation complexities highlights the time efficiency of GRIE in the calculation, which will be further demonstrated by simulation results in Table 1 and Table 2.
The following examples perform the effect from different dimensions and correlations between any two covariates with each other by the above three procedures. Given two different dimensions and three different correlations between any two predictors with each other, the considered error followed the standard normal N ( 0 , 1 ) and Chi-square 0.5 χ 2 2 . In each example, we generate 100 random samples, each consisting of n = 300 . The data generation procedures are implemented by the R package “MASS” generating the random covariates, errors, and response variables. The details are as follows: (1) “rnorm”: simulated from a multivariate normal distribution; (2) “mvrnorm”: simulated from a multivariate normal distribution; (3) “rchisqure”: simulated from a multi Chi-square distribution; (4) “runif”: simulated from a uniform distribution.
Example 1.
We generated n samples from the following nonparametric additive model:
y = m 1 ( x 1 ) + m 2 ( x 2 ) + m 3 ( x 3 ) + m 4 ( x 4 ) + ϵ ,
where m 1 ( x ) = 0.75 exp ( x ) , m 2 ( x ) = x 2 , m 3 ( x ) = 3 sin ( x ) , m 4 ( x ) = 2 x , and ( x 1 , x 2 , , x p n ) follows a multinormal distribution N ( 0 , Σ ) . In this example, given Σ = σ i j under the following two cases: ( 1 ) Autoregressive (AR) structure, σ i j = ρ | i j | ; ( 2 ) Compound symmetry (CS) structure, namely, if i j , σ i j = ρ , else σ i j = 1 . Here, we set the parameter ρ used to control the strength of correlation between any two predictors with each other at different values of 0.3 , 0.6 , and 0.9 .
Table 1, Table 2, Table 3 and Table 4 summarize the results for the additive model in Example 1. Under the setting of ρ = 0.3 and 0.6 , except for C-FS, our proposed method and FAR method could identify all important features and keep the FP value close to zero in both AR and CS structures. Even so, the FAR method has the longest calculation time among the three methods. Furthermore, when there exists strong correlations between covariates ( ρ = 0.9 ), the performances of all three methods are worse at identifying important variables, especially for FAR and C-FS methods. Under this situation, compared with the other two methods, our method has the highest TP and shortest cost time. To perform the stability of our model, we report the empirical probabilities of each important covariate and all important covariates are retained for 100 replications in Table 3 and Table 4, where P j and P all are the empirical probabilities of each important covariate and all important covariates being retained in the selected sub-model, respectively. Following Table 3 and Table 4, P all is below 0.3 for FAR and C-FS, while the P all of GRIE is at least 0.70 . In addition, the P j ’s of our method is the best among the three methods in high-dimensional settings. Hence, we conclude that our proposed GRIE method performs robustly in the model selection of nonparametric additive models under high-dimension settings.
Example 2.
In this example, we consider a linear model with a group structure given by
y = i = 1 p n β i x i + ϵ
with the predictors being generated by the following process
x i = z 1 + z + w i , i = 1 , 3 , x i = z 2 + z + w i , i = 2 , 4 , x 5 , , x p n i . i . d N ( 0 , 1 ) ,
where w 1 , , w 4 i . i . d U ( 0 , 1 ) , z 1 , z 2 i . i . d   U ( 0 , 1 ) and the common component z N 0 , δ 2 . The variance parameter δ is set at different values of 0.4 , 0.6 , and 0.8 to control the strength of the group structure. The true value of the coefficients are β i = 3 with i = 1 , , 4 and β i = 0 with i = 5 , , p n .
We also conducted simulations with the normal errors and chi-square errors for Example 2 and found that the performances of the two errors in this example were very close. Therefore, we omitted the results of chi-square errors to save space. In the following, we only report the results from the normal errors in Table 5 and Table 6. We find that the FAR’s performance is the worst for identifying important features with the increase in correlations between groups, the performances of C-FS and our method GRIE also become worse when δ is over 0.6 , while GRIE performs better even if there exists a strong correlation among covariates. The above phenomena can be further explained by Table 6. When δ is over 0.6 , there is no longer an overwhelming empirical probability of screening important covariates for FAR and C-FS, which results in a decrease in TP and P all values. However, our proposed method is still relatively robust to different values of δ in terms of TP and P all .

5. Real Data

5.1. Boston Housing Data

We use the Boston housing dataset to further illustrate the performance of our proposed method. The dataset contains the MEDV (median value of owner-occupied homes) in 506 U.S. census tracts of Boston from the 1970 census and 13 other variables that explain the variation in housing value. The 13 explaining variables are RM (average number of rooms per dwelling), AGE (proportion of owner-occupied units built prior to 1940), RAD (index of accessibility to radial highways), TAX (full-value property-tax rate per 10,000), PTRATIO (pupil-teacher ratio by town), B ( 1000 ( Bk 0.63 ) 2 , Bk is the proportion of blacks by town), LSTAT (lower status of the population), CRIM (per capita crime rate by town), ZN (proportion of residential land zoned for lots over 25,000 square footage), INDUS (proportion of non-retail business acres per town), CHAS (Charles River dummy variable), NOX (nitric oxides concentration parts per 10 million), and DIS (weighted distances to five Boston employment centers). To simplify notation, we denote the covariates RM, AGE, RAD, TAX, PTRATIO, B, LSTAT, CRIM, ZN, INDUS, CHAS, NOX, and DIS as x 1 , , x 13 . To study the relationship between MEDV and the above 13 variables, we consider the following nonparametric addition models:
y = j = 1 13 m j ( x j ) + ϵ ,
where y is the log(MEDV). In order to extend the above model to the setting of high-dimensional data, followed by [21], we generate artificial variables x j to add noise variables, which is defined as follows.
x j = Z j + 2 W 3 ,
for j = 14 , , 1000 into (8), where Z 14 , , Z 1000 i . i . d N ( 0 , 1 ) , and W U ( 0 , 1 ) .
We use FAR, C-FS, and GRIE to identify important variables in the above additive model (8) with the full dataset. The results are as follows.
(i)
Under FAR, 3 covariates { x 1 , x 7 , x 8 } are selected, denoted by “model ( A 1 )”.
(ii)
Under GRIE, we receive 6 covariates { x 1 , x 5 , x 6 , x 7 , x 8 , x 12 } , denoted by “model ( B 1 )”.
(iii)
Under C-FS, there are 15 covariates chosen. They are { x 1 , x 2 , x 3 , x 4 , x 5 , x 6 , x 7 , x 8 ,   x 9 , x 12 , x 13 , x 156 , x 377 , x 737 , x 859 } , denoted by “model ( C 1 )”.
The above three sub-models have such a nest relation with A 1 B 1 C 1 , and we want to investigate which model is best for fitting this dataset. The nondegenerative Vuong test in [22] is considered here to compare two nested models, and its null hypothesis is that the two models are equivalent. We first compare model ( A 1 ) with model ( B 1 ) by the Vuong test, and its p-value = 0.001 , which leads to the rejection of the null hypothesis. This indicates model ( B 1 ) is better than model ( A 1 ) since ( A 1 ) is nested in ( B 1 ). We also compare model ( B 1 ) with model ( C 1 ), and its corresponding p-value of the Vuong test equals 0.981 , which indicates models ( B 1 ) and ( C 1 ) are equivalent since the null hypothesis is not rejected in this situation. However, model ( B 1 ) has a smaller model size than model ( C 1 ). Therefore, model ( B 1 ) is more suitable than model ( C 1 ) for fitting the Boston housing dataset to be the best working model, which indicates that GRIE performs the best in identifying the important variables among the above three variable selection methods.
To further demonstrate our results, we compare FAR, C-FS, and GRIE through their prediction errors. Toward this end, we randomly select 100 validation sets, with each of which the full sample is randomly partitioned into the training and validation sets with the size ratio 4 : 1 . The training sets are for variable section, and the validation sets are for the estimation of the prediction error. We centralize the response variable y and choose the cubic splines κ n = 3 to approximate the additive function. The average numbers of model size, the number of selected noise variables (SNV), and adjusted mean prediction errors (A-PE) are used to evaluate the performance of the three methods. All the results are reported in Table 7.
From Table 7, we have that: ( 1 ) The model sizes of our method, GRIE, and FAR are both smaller than C-FS, but the A-PE of FAR is the largest among the three methods, which means that FAR may fail to identify some important variables. To verify it, we report the frequency for 13 real covariates being selected over 100 replications in Table 8. Table 8 shows that RM and LSTAT are selected by all methods in each repetition. Except for the FAR method, PTRATIO, B, and CRIM can be selected by GRIE and C-FS with high frequency. It is seen that the pupil–teacher ratio, the proportion of blacks, and the per capita crime rate are the key factors affecting housing prices. However, FAR misses the above important variables. ( 2 ) For the value of SNV, both our method, GRIE, and FAR are 0, which means that they can successfully exclude all artificial variables.
In summary, compared with C-FS and FAR, our method has the smallest A-PE, the smallest SNV, and a simple model, which implies our method has better performance in feature screening under high-dimensional settings.

5.2. Arabidopsis thaliana Gene Data

We now turn to Arabidopsis thaliana gene data to illustrate the screening performance of our method. This dataset was developed by Wille et al. (2004) [23], who detected modules of closely connected isoprenoid genes in Arabidopsis thaliana. It is available on the website https://www.ncbi.nlm.nih.gov/pmc/articles/PMC545783 (accessed on 16 November 2022), which is composed of 834 genes from 58 different pathways in 118 samples. Chen et al. [24] found that GGPPS11 played an essential role in the generation of GGPP, which is the common precursor of several biologically important compounds (such as carotenoids, chlorophylls, and gibberellins), in Arabidopsis. Our goal is to identify the remaining 833 genes’ effects on the expression value of gene GGPPS11.
Followed by Wille et al. [23], the downloaded data R = { y , x 1 , , x 833 } were converted to permille data by taking 1000 R . To get the original dataset, we model 0.001 R here and consider the corresponding nonparametric additive models:
y = j = 1 833 m j ( x j ) + ϵ ,
where y is the expression value of gene GGPPS11, and { x 1 , , x 833 } are the expression values of the remaining 833 genes. Next, we adopt the above additive model on the full dataset to identify the important variables by the three mentioned methods. The results are given as follows:
(i)
Under FAR, we get one gene { x 72 } , denoted by“model ( A 2 )”;
(ii)
Under GRIE, three genes { x 140 , x 571 , x 560 } are chosen, denoted by “model ( B 2 )”;
(iii)
Under C-FS, there nine genes were chosen, which are { x 72 , x 105 , x 191 , x 476 , x 510 , x 517 ,   x 554 , x 658 , x 800 } , and it is denoted by “model ( C 2 )”.
Again, using the nondegenerative Vuong test from Liao and Shi [22], we compare models ( A 2 ) and ( B 2 ). The corresponding p-value of the test is 0.012 , indicating that the above two models are not equivalent at the 5 % significance level. Then, we also compare models ( B 2 ) with ( C 2 ), and the p-value is 0. Hence, models ( B 2 ) and ( C 2 ) are not equivalent at the 5 % significance level.
Lastly, similarly to the first real data case, we compare FAR, C-FS, and GRIE through their prediction errors. Again, we randomly divide the full dataset into the training and validation sets with a ratio of 4:1 and repeat this process 100 times. Here, we also centralize the response variable y and set κ n = 3 . For this real data, we consider the average numbers of model size and A-PE to evaluate the performance of the three models. The results are shown in Table 9. Thus, we conclude that our proposed method has the smallest model size with the strongest ability for prediction and outstanding performance in identifying important covariates compared with the other two methods.

6. Conclusions

In this paper, we propose a novel variable screening screener (GRIE) for high-dimensional nonparametric additive models, which is a combination of the nonparametric smoothing ridge estimation and the group information. We note that our paper is one of the first to focus on the free marginal correlation assumption. Without the marginal correlation assumption, the proposed screener can totally separate the unimportant and important variables with a probability tending to one. Compared with iterative sure independence screening and forward screening, the proposed screener could essentially eliminate the computational burden and achieve strong, sure screening consistency. Furthermore, it allows the covariates to be strongly correlated and performs better than its alternative competitors. For these reasons, combining the strong, sure screening property of GRIE with the model selection property of EBIC, we propose the GRIE-EBIC method to further eliminate the noise variables and improve the accuracy of model selection. Theoretically, we establish the strong consistency of model selection for the GRIE-EBIC method, which reveals that our proposed method achieves the ideal model selection results.
We conclude this paper with a discussion of directions for future research. One direction to consider is nonparametric additive models with interaction effects between covariates, which are defined as
E ( y x ) = 1 j < k p n m j , k ( x j , x k ) ,
where x j is the jth element of x . They are the generalization of linear models with two-way interaction effects [25] that are more flexible for capturing the intersection between covariates. One potential approach may be to use the tensor splines bases to approximate each nonparametric function m j , k ( · , · ) . The other direction is to study how to apply our methodology in the nonparametric generalized additive models [26,27]. The nonparametric generalized additive models admits
G { E ( y x ) } = j = 1 p n m j ( x j ) ,
where x j is the jth elements of x , and G ( · ) is the link function. Since the nonparametric smoothing ridge estimation has outstanding performance in nonparametric additive models, its performance in generalized additive models may be worth investigating.

Author Contributions

Conceptualization, X.J. and J.L.; methodology, H.W. and X.J.; software, H.W. and H.J.; resources, J.L.; data curation, H.J.; writing—original draft preparation, H.W.; supervision, X.J.; funding acquisition, X.J. and J.L. All authors have read and agreed to the published version of the manuscript.

Funding

The work of Jiang is partially supported by the National Natural Science Foundation of China (11871263), and the Shenzhen Sci-Tech Fund No. JCYJ20210324104803010. The work of Li was partially supported by the NSF of China No. 11971221, Guangdong NSF Major Fund No. 2021ZDZX1001, the Shenzhen Sci-Tech Fund No. RCJC20200714114556020.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The Boston housing dataset is available in the R package “MASS”. Arabidopsis thaliana gene data are available on the website https://www.ncbi.nlm.nih.gov/pmc/articles/PMC545783 (accessed on 16 November 2022).

Acknowledgments

We would like to acknowledge the editor and four referees for their valuable comments and suggestions which leads to a substantial improvement of this article.

Conflicts of Interest

The authors declare no conflict of interest.

Appendix A

Now we give technical proofs of our theorems. To streamline our arguments, we introduce some notations and technical lemmas. Define v = ( v 1 , , v n ) with v i = j = 1 p n m j ( x i , j ) w i θ * . Denoted by ξ i = e i W ( W W ) 1 W θ * , η i = e i W ( W W + λ I n ) 1 ε with ε = ( ϵ 1 , , ϵ n ) , and ζ i = e i W ( W W + λ I n ) 1 v .
Lemma A1.
UnderAssumptions A1 and A3, the following conclusions hold
(i) 
for C > 0 and any fixed vector b with b 2 = 1 , there exists constants c 1 and c 2 with 0 < c 1 < 1 < c 2 such that
P b P λ , W b < c 1 n 1 τ t n or b P λ , W b > c 2 n 1 + τ t n 4 exp ( C n ) ;
(ii) 
for any C > 0 , there exists positive constant M > 0 such that
P | e i P λ , W e j | > M n 1 + τ α t n log n = O exp C n 1 2 α 2 log n
holds for any 0 α < 1 / 2 and 1 i j t n ;
(iii) 
for any 1 i t n , the following inequality
P ( W W + λ I n ) 1 W e i 2 2 > c 2 c 1 c 3 κ n n 1 + 2 τ t n 2 3 exp ( C 1 n )
holds.
Proof 
(Proof of Lemma A1). Similar to proof of Theorem 3 in [18], we can show that Lemma A1 holds. □
Lemma A2.
Under Assumptions A1–A4, the following conclusions hold
(i) 
d n κ n r and d n n 1 / 2 2 τ / log n ;
(ii) 
v 2 c n s n n 1 / 2 κ n r for some c n > 0 , θ j * 2 0.5 c 4 1 / 2 κ n 1 / 2 min j S { E | m j ( x j ) | 2 } 1 / 2 , and j S θ j * 2 2 3 c 2 c 4 s n κ n ;
(iii) 
P | η i | c 2 c 1 c 3 C * d n ( log n ) 1 / 2 n 1 τ t n 1 2 exp ( c 0 κ n 1 d n 2 n 1 4 τ / log n ) for some constant c 0 > 0 ;
(iv) 
P | ζ i | c 2 c 1 c 3 c n s n κ n 1 / 2 r n 1 + τ t n 1 3 exp ( C 1 n ) .
Proof 
(Proof of Lemma A2). (i) It follows that Lemma A2(i) holds by Assumptions A3 and A4.
(ii) By Assumption A3 (i) and Corollary 6.21 of [28], we can obtain
sup x , j | m j ( x ) B ( x ) θ j * | c n κ n r ,
and
| { E | B ( x j ) θ j * | 2 } 1 / 2 { E | m j ( x j ) | 2 } 1 / 2 | = | E | B ( x j ) θ j * | 2 E | m j ( x j ) | 2 | { E | B ( x j ) θ j * | 2 } 1 / 2 + { E | m j ( x j ) | 2 } 1 / 2 sup x , j | m j ( x ) B ( x ) θ j * | { E | m j ( x j ) | + E | B ( x j ) θ j * | } { E | B ( x j ) θ j * | 2 } 1 / 2 + { E | m j ( x j ) | 2 } 1 / 2 = O ( κ n r ) .
This combined with min j S { E | m j ( x j ) | 2 } 1 / 2 d n , d n κ n r in Lemma A2(i), and
θ j * 2 2 λ max 1 ( E ( B ( x j ) B ( x j ) ) ) E | B ( x j ) θ j * | 2 c 4 1 κ n ( E | B ( x j ) θ j * | 2 )
by noticing λ max ( E ( B ( x j ) B ( x j ) ) ) c 4 κ n 1 , yields that
v 2 = O ( s n n 1 / 2 κ n r ) and θ j * 2 0.5 c 4 1 / 2 κ n 1 / 2 { E | m j ( x j ) | 2 } 1 / 2
for any j S . By (A1) and λ min ( E ( B ( x j ) B ( x j ) ) ) c 4 1 κ n 1 , we have
θ j * 2 2 λ min 1 ( E ( B ( x j ) B ( x j ) ) ) E | B ( x j ) θ j * | 2 2 c 4 κ n { E | B ( x j ) θ j * m j ( x j ) | 2 + E | m j ( x j ) | 2 } = O ( κ n 1 2 r ) + 2 c 4 κ n E | m j ( x j ) | 2 .
It follows from assumption A3(i)-(ii) that
j S θ j * 2 2 O ( s n κ n 1 2 r ) + 2 c 4 κ n j S E | m j ( x j ) | 2 3 c 2 c 4 s n κ n .
(iii) It is noticed that
η i = e i W ( W W + λ I n ) 1 ε = ( W W + λ I n ) 1 W e i 2 a ε ,
where
a = ( W W + λ I n ) 1 W e i / ( W W + λ I n ) 1 W e i 2 .
Using Lemma A1, for some C 1 > 0 , we have
P a P λ , W a > c 2 n 1 + τ t n 4 exp ( C 1 n )
and
P ( W W + λ I n ) 1 W e i 2 2 > c 2 c 1 c 3 κ n n 1 + 2 τ t n 2 3 exp ( C 1 n ) .
By Assumption A2 and Proposition 3 of [4], we obtain
P P a ε 2 2 > C * h ( t ) ( 1 + t ) 1 / 2 exp ( t / 2 )
for any t > 2 , where
h ( t ) = ( 1 + t ) { 1 2 / ( exp ( t / 2 ) 1 + t 1 ) } 2 .
Let χ n = 0.9 κ n 1 d n 2 n 1 4 τ / log n . We have h ( χ n ) κ n 1 d n 2 n 1 4 τ / log n for sufficient large n since d n κ n 1 / 2 n 1 / 2 2 τ / log n . Therefore, there exists some positive constant c 0 < 0.45 such that
P | a ε | > C * 1 / 2 d n κ n 1 / 2 n 1 / 2 2 τ / log n = P P a ε 2 2 > C * κ n 1 d n 2 n 1 4 τ / log n P P a ε 2 2 > C * h ( χ n ) ( 1 + χ n ) 1 / 2 exp ( χ n / 2 ) exp ( c 0 κ n 1 d n 2 n 1 4 τ / log n )
for sufficient large n. This, combined with (A2) and (A3), leads to
P | η i | c 2 c 1 C * c 3 d n ( log n ) 1 / 2 n 1 τ t n 1 2 exp ( c 0 κ n 1 d n 2 n 1 4 τ / log n ) .
(iv) From Lemmas A2(ii) and (A3), we have
P | ζ i | c 2 c 1 c 3 c n κ n 1 / 2 r n 1 + τ t n 1 s n 3 exp ( C 1 n ) .
This completes the proof of Lemma A2. □
Proof 
(Proof of Theorem 1.). From the definition of θ ^ j in (5), we have
θ ^ j = W j ( W W + λ I n ) 1 Y = W j ( W W + λ I n ) 1 W θ * + W j ( W W ) 1 v + W j ( W W + λ I n ) 1 ε θ ˜ j + E 1 , j + E 2 , j .
Next, we divide the proof into four parts.
Part (I): In this part, we establish the upper bound of max j S c E 1 , j + E 2 , j 2 .
By noticing E 2 , j 2 κ n 1 / 2 max 1 i t n | η i | , we have
P max 1 j p n E 2 , j 2 c κ n 1 / 2 d n n 1 τ t n log n P max 1 i t n | η i | c d n n 1 τ t n log n i = 1 t n P | η i | c d n n 1 τ t n log n .
It follows from Lemma A2 that, for some constants c and c 0 ,
P max 1 j p n E 2 , j 2 c κ n 1 / 2 d n n 1 τ t n log n 2 t n exp ( c 0 κ n 1 d n 2 n 1 4 τ ( log n ) 1 ) exp ( 0.5 c 0 κ n 1 d n 2 n 1 4 τ ( log n ) 1 ) ,
where the last inequality holds due to log ( t n ) = o ( κ n 1 d n 2 n 1 4 τ ( log n ) 1 ) . Similarly, by Lemma A2, E 1 , j 2 κ n 1 / 2 max 1 i t n | ζ i | , and Bonferroni’s inequality, there exists some constant c * such that
P max 1 j p n E 1 , j 2 c * κ n 1 r n 1 + τ s n t n 3 t n exp ( C 1 n ) 3 exp ( 0.5 C 1 n ) .
By noticing κ n r 1 / 2 d n / ( n 2 τ s n log n ) , we obtain
P max 1 j p n E 1 , j + E 2 , j 2 ( c + c * ) κ n 1 / 2 d n n 1 τ t n log n 2 exp ( 0.5 c 0 κ n 1 d n 2 n 1 4 τ ( log n ) 1 ) .
Part (II): In this part, we establish the upper bound of max j S c θ ˜ j 2 . For 1 j t n , there exists index set M j { 1 , , t n } such that θ j = θ M j , where θ M j is the sub-vector of θ formed by all components with indexes in M j . Denoted by M = j S M j and θ = ( θ 1 , , θ t n ) with t n = p n κ n . By Cauchy–Schwarz’s inequality, Lemma A2(ii), and Assumption A4(ii), we obtain that
θ ˜ j 2 κ n max i M j | k M e i P λ , W e k θ k * | s n κ n 2 θ * 2 max 1 i k t n | e i P λ , W e k | 3 c 2 c 4 s n 2 κ n 3 max 1 i k t n | e i P λ , W e k |
for j S c , where c 2 and c 4 are defined in Assumptions A3 and A4. It follows from Lemma A1 and Bonferroni inequalities that, for some constants M , C 1 > 0 ,
P max 1 i k t n | e i P λ , W e k | > M n 1 + τ α t n log n 1 i k t n P | e i P λ , W e k | > M n 1 + τ α t n log n = O exp 2 log t n C 1 n 1 2 α 2 log n ,
holds for any 0 α < 1 / 2 . By taking n α = d n 1 κ n s n n 2 τ and assumption log ( t n ) = o d n 2 n 1 4 τ κ n 2 s n 2 log n in A4 (iii), we can obtain
P max j S c θ ˜ j 2 > 3 c 2 c 4 κ n M n 1 τ d n t n log n O exp 2 log t n C 1 d n 2 n 1 4 τ 2 κ n 2 s n 2 log n = O exp C 1 d n 2 n 1 4 τ 3 κ n 2 s n 2 log n .
Part (III): In this part, we establish the lower bound of min j S θ ˜ j 2 .
From the triangle inequality, we have
min j S θ ˜ j 2 = min j S W j ( W W + λ I n ) 1 W j θ j * + k j , k S W j ( W W + λ I n ) 1 W k θ k * 2 min j S W j ( W W + λ I n ) 1 W j θ j * 2 max j S k j , k S W j ( W W + λ I n ) 1 W k θ k * 2 I n , 1 I n , 2 .
With the same arguments as (A5), we can establish that
P I n , 2 > 3 c 2 c 4 κ n M n 1 τ d n t n log n = O exp C 1 d n 2 n 1 4 τ 3 κ n 2 s n 2 log n .
Applying equality ( a + b ) 2 a 2 / 2 b 2 and Jensen’s inequality, we can obtain
W j ( W W + λ I n ) 1 W j θ j * 2 2 = i M j ( k M j e i P λ , W e k θ k * ) 2 i M j ( e i P λ , W e i ) 2 | θ i * | 2 / 2 i M j ( k M j , k i e i P λ , W e k θ k * ) 2 min i M j ( e i P λ , W e i ) 2 θ j * 2 2 / 2 κ n i M j k M j , k i ( e i P λ , W e k ) 2 | θ k * | 2 min i M j ( e i P λ , W e i ) 2 θ j * 2 2 / 2 κ n 2 θ j * 2 2 max i , k M j , i k ( e i P λ , W e k ) 2 .
Thus,
I n , 1 2 min j S θ j 2 2 min i M ( e i P λ , W e i ) 2 / 2 κ n 2 max i , k M , i k ( e i P λ , W e k ) 2 .
Lemma A1, s n κ n = o ( n ) and Bonferroni inequalities give that, for some constants c 1 , M , α and C 1 > 0 ,
P min i M e i P λ , W e i c 1 n 1 τ t n i M P e i P λ , W e i c 1 n 1 τ t n 4 n exp ( C 1 n )
and
P max i , k M , i k | e i P λ , W e k | M n 1 + τ α t n log n i , k M , i k P e i P λ , W e k M n 1 + τ α t n log n O n exp C 1 n 1 2 α 2 log n
holds for any 0 α < 1 / 2 . Denoted by
A 1 = min i M e i P λ , W e i c 1 n 1 τ t n , A 2 = max i , k M , i k | e i P λ , W e k | M n 1 + τ α t n log n ,
and
A 3 = min i M ( e i P λ , W e i ) 2 / 2 κ n 2 max i , k M , i k ( e i P λ , W e k ) 2 > | c 1 | 2 n 2 2 τ 3 t n 2 .
By taking α = 2 τ + log n ( κ n ) , we have
P ( A 3 ) P ( A 1 c A 2 c ) 1 P ( A 1 ) P ( A 2 ) = 1 O n exp C 1 n 1 2 τ 2 κ n 2 log n .
It is obvious that min j S θ j 2 2 0.25 c 4 1 κ n d n 2 from Lemma A2(ii) and Assumption A4(ii). This, combined with (A7), yields that
P I n , 1 2 | c 1 | 2 c 4 1 κ n d n 2 n 2 2 τ 12 t n 2 1 O n exp C 1 n 1 2 τ 2 κ n 2 log n .
Similar to (A7), we can obtain
P min j S θ ˜ j 2 c 1 c 4 1 / 2 κ n 1 / 2 d n n 1 τ 12 t n 1 O exp C 1 d n 2 n 1 4 τ 3 κ n 2 s n 2 log n
by combing (A6) and (A8).
Part (IV): In this part, we show that
P min j S θ ^ j 2 > max j S c θ ^ j 2 1 .
Similar to (A7), by θ ^ j = θ ˜ j + E 1 , j + E 2 , j , (A4) and (A9), we can show that
P min j S θ ^ j 2 c 1 c 4 1 / 2 κ n 1 / 2 d n n 1 τ 13 t n P min j S θ ˜ j 2 max 1 j p n E 1 , j + E 2 , j 2 c 1 c 4 1 / 2 κ n 1 / 2 d n n 1 τ 14 t n 1 O exp C 1 d n 2 n 1 4 τ 3 κ n 2 s n 2 log n + exp c 0 d n 2 n 1 4 τ 2 κ n log n .
Denote by A 4 = max 1 j p n E 1 , j + E 2 , j 2 ( c + c * ) κ n 1 / 2 d n n 1 τ t n log n ,   A 5 = max j S c θ ˜ j 2 > 3 c 2 c 4 κ n M n 1 τ d n t n log n , and
A 6 = max 1 j p n E 1 , j + E 2 , j 2 + max j S c θ ˜ j 2 ( c + c * + 3 c 2 c 4 ) κ n 1 / 2 d n n 1 τ t n log n .
Since A 6 A 5 c A 4 , by (A4) and (A5), we have
P ( A 6 ) = P ( A 6 A 5 ) + P ( A 6 A 5 c ) P ( A 5 ) + P ( A 4 ) = O exp c 0 d n 2 n 1 4 τ 2 κ n log n + exp C 1 d n 2 n 1 4 τ 3 κ n 2 s n 2 log n .
Using max j S c θ ^ j 2 max 1 j p n E 1 , j + E 2 , j 2 + max j S c θ ˜ j 2 , we obtain that
P max j S c θ ^ j 2 < ( c + c * + 3 c 2 c 4 ) κ n 1 / 2 d n n 1 τ t n log n P max j S c θ ˜ j 2 + max 1 j p n E 1 , j + E 2 , j 2 < ( c + c * + 3 c 2 c 4 ) κ n 1 / 2 d n n 1 τ t n log n 1 O exp c 0 d n 2 n 1 4 τ 2 κ n log n + exp C 1 d n 2 n 1 4 τ 3 κ n 2 s n 2 log n .
Notice that d n 2 n 1 4 τ κ n 2 s n 2 log n and ( c + c * + 3 c 2 c 4 ) / log n c 1 c 4 1 / 13 for sufficient large n. This, combined with (A11) and (A12), establishes (A10). The proof is completed. □
Proof 
(Proof of Theorem 2.). We divide the proof into two parts.
Part (I) to show that P ( S ^ * = S ) 1 ; Part (II) to show that P ( S ^ = S ^ * ) 1 .
Part (I): Step (i) It is noticed that Y = W S θ S * + v + ε and P W S ^ k P W S ^ k 1 = P W ˜ i k with W ˜ i k = ( I n P W S ^ k 1 ) W i k . For i k S , we obtain that
R S S k 1 R S S k = Y ( I n P W S ^ k 1 ) Y Y ( I n P W S ^ k ) Y = P W ˜ i k ( W S θ S * + v + ε ) 2 2 P W ˜ i k W S θ S * 2 2 / 2 P W ˜ i k ( v + ε ) 2 2 .
Next, let us deal with the above two terms separately. Denoted by T k = ( S S ^ k 1 ) { i k } . We have
P W ˜ i k W S θ S * 2 2 = ( P W S ^ k P W S ^ k 1 ) W S θ S * 2 2 inf t P W S ^ k W S θ S * W S ^ k 1 t 2 2 inf a P W S ^ k W i k θ i k * W T k a 2 2 .
From P W S ^ k W i k = W i k , Lemma A2, Assumption A4(ii), and i k S , we can obtain
min i k S P W ˜ i k W S θ S * 2 2 min i k S θ i k * 2 2 ( I n P W T k ) W i k 2 2 0.25 c 4 1 κ n d n 2 min i k S ( I n P W T k ) W i k 2 2 .
From Theorem 1, we have conclusion | T k { i k } | = O ( s n ) holding for ∀ i k S with probability tending to one. This, combined with Assumption A5, yields that
λ min ( n 1 W T W T ) 0.5 c 6 n τ κ n 1
with probability going to one, where W T = ( W T k , W i k ) . It follows from λ max { ( W i k W i k ) 1 }   λ max { W T W T } and (A15) that
min i k S P W ˜ i k W S θ S * 2 2 2 μ 0 d n 2 n 1 τ
with μ 0 = 0.0625 c 4 1 c 6 .
Following Lemma A2, we have that
P W ˜ i k ( v + ε ) 2 2 = 2 P W ˜ i k v 2 2 + 2 P W ˜ i k ε 2 2 2 v 2 2 + 2 P W ˜ i k ε 2 2 = O ( n κ n 2 r ) + 2 P W ˜ i k ε 2 2 .
From Assumption A2 and Proposition 3 of [4], we have
P P W ˜ i k ε 2 2 > κ n C * ( 1 + t ) { 1 2 / ( exp ( t / 2 ) 1 + t 1 ) } 2 . ( 1 + t ) 1 / 2 exp ( κ n t / 2 )
By taking t = log p n + log n 1 and applying Bonferroni inequalities, we can obtain
P max i k S P W ˜ i k ε 2 2 > β n i k S P P W ˜ i k ε 2 2 > β n i k S log p n + log n exp { κ n ( log p n + log n 1 ) / 2 } = O ( s n log p n ) exp { κ n ( log p n + log n 1 ) / 2 } 0 ,
where
β n = κ n C * ( log p n + log n 1 ) { 1 2 / ( exp ( ( log p n + log n 1 ) / 2 ) log p n + log n 1 ) } 2 .
Therefore, we establish that
max i k S P W ˜ i k ε 2 2 = o P { κ n ( log p n + log n ) } .
By κ n r 1 / 2 d n / n 2 τ and log ( t n ) = o d n 2 n 1 4 τ κ n 2 s n 2 log n in Assumption A4(ii), we obtain
P W ˜ i k ( v + ε ) 2 2 = o P ( d n 2 n 1 τ ) .
This, combined with (A13) and (A16), yields that
min i k S { R S S k 1 R S S k } μ 0 d n 2 n 1 τ
with probability going to one. Applying the inequality log ( 1 + x ) min { log 2 , 0.5 x } for x > 0 , we obtain that
log ( R S S k 1 ) log ( R S S k ) = log { 1 + ( R S S k 1 R S S k ) / R S S k } 0.5 ( R S S k 1 R S S k ) / R S S k 0.5 μ 0 d n 2 n 1 τ / R S S k ,
This combined with n 1 R S S k n 1 Y Y ¯ n 2 2 Var ( y 1 ) with Y ¯ n = n 1 i = 1 n y i , leads to
min i k S { log ( R S S k 1 ) log ( R S S k ) } 0.4 μ 0 d n 2 n τ / Var ( y 1 ) .
Noticing that log ( t n ) = o d n 2 n 1 4 τ κ n 2 s n 2 log n and Var ( y 1 ) = O ( κ n s n 2 n 3 τ log ( n ) ) and log ( f ( k + 1 ) ) log ( f ( k ) ) = O { κ n log ( p n ) } , we can obtain
E B I C k 1 E B I C k 0.4 μ 0 d n 2 n τ / Var ( y 1 ) n 1 log ( n ) + γ log ( f ( k + 1 ) log ( f ( k ) ) 0.4 μ 0 d n 2 n τ / Var ( y 1 ) n 1 O { log ( n ) + γ κ n log ( p n ) } > 0 .
Therefore, for i k S , the conclusion
E B I C k < E B I C k 1
holds uniformly with probability going to one.
Step (ii): Let k 0 be an integer satisfying S ¬ S ^ k 0 1 and S S ^ k 0 . We prove that
min 1 j L { E B I C k 0 + j E B I C k 0 + j 1 } > 0 ,
By log ( 1 + x ) x and log f ( k 0 + j ) f ( k 0 + j 1 ) = O { κ n log ( p n ) } , we have
E B I C k 0 + j 1 E B I C k 0 + j R S S k 0 + j 1 R S S k 0 + j R S S k 0 + j κ n log n + γ κ n log ( p n ) / n .
With the same arguments as (A17), we can show that
max 1 j L ( R S S k 0 + j 1 R S S k 0 + j ) = max 1 j L ( P W S ^ k 0 + j P W S ^ k 0 + j 1 ) ε 2 2 = o P { κ n ( log p n + log n ) } .
From (26) in [10], we have n 1 R S S k 0 + l = E ϵ 1 2 + o P ( 1 ) . Furthermore, E ϵ 1 2 = O ( 1 ) from assumption A2. Thus
P ( max 1 j L { E B I C k 0 + j 1 E B I C k 0 + j } < 0 ) 1 .
Combination of (A18) and (A19) leads to P ( S ^ * = S ) 1 .
Part (II): Similar to step (ii) in Part (I), we can show that
min l F S ^ * { E B I C * E B I C * } > 0 ,
with probability tending to one. This leads to P ( S ^ = S ^ * ) 1 . The proof is completed. □

References

  1. Tibshirani, R. Regression shrinkage and selection via the lasso. J. R. Stat. Soc. Ser. B Stat. Methodol. 1996, 58, 267–288. [Google Scholar] [CrossRef]
  2. Fan, J.; Li, R. Variable selection via nonconcave penalized likelihood and its oracle properties. J. Am. Stat. Assoc. 2001, 96, 1348–1360. [Google Scholar] [CrossRef]
  3. Candes, E.; Tao, T. The Dantzig selector: Statistical estimation when p is much larger than n. Ann. Stat. 2007, 35, 2313–2351. [Google Scholar]
  4. Zhang, C.H. Nearly unbiased variable selection under minimax concave penalty. Ann. Stat. 2010, 38, 894–942. [Google Scholar] [CrossRef] [Green Version]
  5. Fan, J.; Samworth, R.; Wu, Y. Ultrahigh dimensional feature selection: Beyond the linear model. J. Mach. Learn. Res. 2009, 10, 2013–2038. [Google Scholar]
  6. Fan, J.; Lv, J. Sure independence screening for ultrahigh dimensional feature space. J. R. Stat. Soc. Ser. B (Stat. Methodol.) 2008, 70, 849–911. [Google Scholar] [CrossRef] [Green Version]
  7. Fan, J.; Feng, Y.; Song, R. Nonparametric independence screening in sparse ultra-high-dimensional additive models. J. Am. Stat. Assoc. 2011, 106, 544–557. [Google Scholar] [CrossRef] [Green Version]
  8. Zhu, L.; Li, L.; Li, R.; Zhu, L. Model-free feature screening for ultrahigh-dimensional data. J. Am. Stat. Assoc. 2011, 106, 1464–1475. [Google Scholar] [CrossRef] [Green Version]
  9. Wang, H. Forward regression for ultra-high dimensional variable screening. J. Am. Stat. Assoc. 2009, 104, 1512–1524. [Google Scholar] [CrossRef]
  10. Cheng, M.Y.; Honda, T.; Zhang, J.T. Forward variable selection for sparse ultra-high dimensional varying coefficient models. J. Am. Stat. Assoc. 2016, 111, 1209–1221. [Google Scholar] [CrossRef] [Green Version]
  11. Zhong, W.; Duan, S.; Zhu, L. Forward additive regression for ultrahigh dimensional nonparametric additive models. Stat. Sin. 2020, 30, 175–192. [Google Scholar] [CrossRef] [Green Version]
  12. Zhou, T.; Zhu, L.; Xu, C.; Li, R. Model-free forward screening via cumulative divergence. J. Am. Stat. Assoc. 2020, 115, 1393–1405. [Google Scholar] [CrossRef]
  13. Hastie, T.; Tibshirani, R. Generalized Additive Models; Chapman and Hall: New York, NY, USA, 1990. [Google Scholar]
  14. Meier, K.; Van de Geer, S.; Bühlmann, P. Minimax optimal rates of estimation in high dimensional additive models. Ann. Stat. 2009, 47, 3779–3821. [Google Scholar]
  15. Gregory, K.; Mammen, E.; Wahl, M. Statistical inference in sparse high-dimensional additive models. Ann. Stat. 2021, 49, 1514–1536. [Google Scholar] [CrossRef]
  16. Lu, J.; Kolar, M.; Liu, H. Kernel meets sieve: Post-regularization confidence bands for sparse additive model. J. Am. Stat. Assoc. 2020, 115, 2084–2099. [Google Scholar] [CrossRef] [Green Version]
  17. Bai, R.; Moran, G.; Antonelli, J.; Cheng, Y.; Boland, M. Spike-and-slab group lassos for grouped regression and sparse generalized additive models. J. Am. Stat. Assoc. 2022, 117, 184–197. [Google Scholar] [CrossRef]
  18. Wang, X.; Leng, C. High dimensional ordinary least squares projection for screening variables. J. R. Stat. Soc. Ser. B Stat. Methodol. 2016, 78, 589–611. [Google Scholar] [CrossRef] [Green Version]
  19. Chen, J.; Chen, Z. Extended Bayesian information criteria for model selection with large model spaces. Biometrika 2008, 95, 759–771. [Google Scholar] [CrossRef] [Green Version]
  20. Chen, J.; Chen, Z. Extended BIC for small-n-large-P sparse GLM. Stat. Sin. 2012, 22, 555–574. [Google Scholar] [CrossRef] [Green Version]
  21. Fan, J.; Ma, Y.; Dai, W. Nonparametric independence screening in sparse ultra-high-dimensional varying coefficient models. J. Am. Stat. Assoc. 2014, 109, 1270–1284. [Google Scholar] [CrossRef] [Green Version]
  22. Liao, Z.; Shi, X. A nondegenerate Vuong test and post selection confidence intervals for semi/nonparametric model. Quant. Econ. 2020, 11, 983–1017. [Google Scholar] [CrossRef]
  23. Wille, A.; Zimmermann, P.; Vranová, E.; Fürholz, A.; Laule, O.; Bleuler, S.; Hennig, L.; Prelić, A.; Von Rohr, P.; Thiele, L.; et al. Sparse graphical Gaussian modeling of the isoprenoid gene network in Arabidopsis thaliana. Genome Biol. 2004, 5, R92. [Google Scholar] [CrossRef] [PubMed] [Green Version]
  24. Chen, Q.; Fan, D.; Wang, G. Heteromeric geranyl (geranyl) diphosphate synthase is involved in monoterpene biosynthesis in Arabidopsis flowers. Mol. Plant 2015, 8, 1434–1437. [Google Scholar] [CrossRef] [PubMed]
  25. Hao, N.; Zhang, H. A note on high-dimensional linear regression with interactions. Am. Stat. 2017, 71, 291–297. [Google Scholar] [CrossRef] [Green Version]
  26. Hastie, T.; Tibshirani, R. Generalized additive models: Some applications. J. Am. Stat. Assoc. 1987, 82, 371–386. [Google Scholar] [CrossRef]
  27. Horowitz, J. Nonparametric estimation of a generalized additive model with an unknown link function. Econometrica 1987, 69, 499–513. [Google Scholar] [CrossRef]
  28. Schumaker, L.L. Spline Functions: Basic Theory; Cambridge University Press: Cambridge, UK, 2007. [Google Scholar]
Table 1. Average number of true positive (TP), false positive (FP), and calculation time over 100 repetitions and their robust standard deviations (in parentheses) of Example 1 with ϵ N ( 0 , 1 ) .
Table 1. Average number of true positive (TP), false positive (FP), and calculation time over 100 repetitions and their robust standard deviations (in parentheses) of Example 1 with ϵ N ( 0 , 1 ) .
ρ Approach p n = 500 p n = 1000
TPFPTime (s)TPFPTime (s)
AR Structure
0.3FAR4.00 (0.00)0.00 (0.00)83.19 (9.80)4.00 (0.00)0.00 (0.00)166.26 (18.65)
C-FS3.20 (0.40)5.21 (2.96)16.18 (5.38)3.34 (0.48)11.44 (5.38)39.64 (14.35)
GRIE4.00 (0.00)0.00 (0.00)2.37 (0.28)3.99 (0.10)0.01 (0.10)3.56 (0.77)
0.6FAR4.00 (0.00)0.00 (0.00)82.06 (9.77)4.00 (0.00)0.00 (0.00)168.16 (20.51)
C-FS3.71 (0.46)4.81 (2.39)16.57 (4.28)3.70 (0.46)9.33 (4.88)34.61 (12.24)
GRIE3.99 (0.10)0.00 (0.00)2.40 (0.35)3.98 (0.14)0.03 (0.30)3.43 (0.72)
0.9FAR3.17 (0.60)0.00 (0.00)81.34 (9.48)3.09 (0.60)0.00 (0.00)168.70 (18.62)
C-FS3.14 (0.51)2.44 (1.72)10.63 (3.01)3.14 (0.53)4.43 (2.96)19.14 (6.78)
GRIE3.71 (0.46)0.20 (0.40)2.22 (0.40)3.70 (0.46)0.21 (0.43)3.45 (0.76)
CS Structure
0.3FAR4.00 (0.00)0.00 (0.00)83.60 (10.17)4.00 (0.00)0.00 (0.00)165.38 (19.00)
C-FS3.45 (0.52)4.96 (2.97)16.11 (5.23)3.33 (0.47)11.69 (6.24)39.98 (16.38)
GRIE4.00 (0.00)0.09 (0.90)2.30 (0.39)4.00 (0.00)0.02 (0.20)3.57 (0.72)
0.6FAR4.00 (0.00)0.00 (0.00)84.24 (10.26)4.00 (0.00)0.01 (0.10)166.92 (18.64)
C-FS3.74 (0.44)5.05 (2.98)16.82 (5.26)3.61 (0.55)10.26 (5.25)36.73 (13.38)
GRIE4.00 (0.00)0.23 (2.30)2.35 (0.37)4.00 (0.00)0.11 (0.62)3.41 (0.74)
0.9FAR3.03 (0.67)0.00 (0.00)85.57 (11.01)2.79 (0.70)0.00 (0.00)166.02 (18.65)
C-FS2.63 (0.65)4.48 (3.47)13.35 (6.08)2.56 (0.67)9.70 (5.93)32.23 (15.07)
GRIE3.89 (0.31)1.47 (7.62)2.21 (0.35)3.79 (0.41)3.16 (17.90)3.44 (0.77)
Table 2. Average numbers of true positive (TP), false positive (FP), and calculation time over 100 repetitions and their robust standard deviations (in parentheses) of Example 1 with ϵ 0.5 χ 2 2 .
Table 2. Average numbers of true positive (TP), false positive (FP), and calculation time over 100 repetitions and their robust standard deviations (in parentheses) of Example 1 with ϵ 0.5 χ 2 2 .
ρ Approach p n = 500 p n = 1000
TPFPTime (s)TPFPTime (s)
AR Structure
0.3FAR4.00 (0.00)0.00 (0.00)79.30 (11.14)4.00 (0.00)0.00 (0.00)165.38 (22.09)
C-FS3.27 (0.45)5.38 (3.00)16.43 (5.40)3.33 (0.47)11.24 (4.85)39.44 (12.59)
GRIE4.00 (0.00)0.00 (0.00)2.40 (0.36)3.99 (0.10)0.00 (0.00)3.33 (0.68)
0.6FAR4.00 (0.00)0.00 (0.00)79.19 (12.07)4.00 (0.00)0.00 (0.00)163.64 (23.37)
C-FS3.70 (0.46)4.42 (2.53)15.60 (4.49)3.77 (0.42)9.56 (4.29)35.74 (11.14)
GRIE3.99 (0.10)0.01 (0.10)2.31 (0.33)3.98 (0.14)0.03 (0.30)3.42 (0.67)
0.9FAR3.09 (0.68)0.00 (0.00)80.28 (10.89)3.01 (0.72)0.00 (0.00)163.88 (22.78)
C-FS3.10 (0.48)2.28 (1.56)10.15 (2.86)3.15 (0.50)4.26 (2.20)19.12 (5.44)
GRIE3.71 (0.46)0.23 (0.51)2.28 (0.32)3.78 (0.42)0.16 (0.39)3.33 (0.67)
CS Structure
0.3FAR4.00 (0.00)0.00 (0.00)80.28 (9.59)4.00 (0.00)0.00 (0.00)164.25 (19.31)
C-FS3.51 (0.52)5.10 (2.99)16.50 (5.68)3.36 (0.48)10.87 (4.98)37.91 (13.02)
GRIE4.00 (0.00)0.00 (0.00)2.31 (0.34)3.98 (0.14)0.34 (3.30)3.38 (0.66)
0.6FAR4.00 (0.00)0.00 (0.00)80.12 (11.56)4.00 (0.00)0.01 (0.10)165.41 (20.04)
C-FS3.72 (0.49)4.71 (2.57)16.04 (4.68)3.68 (0.49)9.79 (5.39)36.12 (14.41)
GRIE3.99 (0.10)0.00 (0.00)2.31 (0.29)4.00 (0.00)0.11 (0.65)3.40 (0.70)
0.9FAR3.00 (0.79)0.03 (0.17)79.62 (11.60)2.85 (0.78)0.02 (0.14)164.90 (19.25)
C-FS2.73 (0.66)4.40 (2.81)13.56 (4.91)2.69 (0.63)10.82 (5.99)35.72 (15.68)
GRIE3.94 (0.24)3.13 (16.94)2.28 (0.35)3.82 (0.39)3.63 (18.13)3.30 (0.72)
Table 3. The empirical probabilities of each important covariate and all important covariates being retained for 100 replications in Example 1 with ϵ N ( 0 , 1 ) .
Table 3. The empirical probabilities of each important covariate and all important covariates being retained for 100 replications in Example 1 with ϵ N ( 0 , 1 ) .
ρ Approach p n = 500 p n = 1000
P 1 P 2 P 3 P 4 P all P 1 P 2 P 3 P 4 P all
AR Structure
0.3FAR1.001.001.001.001.001.001.001.001.001.00
C-FS1.000.201.001.000.201.000.341.001.000.34
GRIE1.001.001.001.001.001.000.991.001.000.99
0.6FAR1.001.001.001.001.001.001.001.001.001.00
C-FS1.000.711.001.000.710.980.721.001.000.70
GRIE1.000.991.001.000.991.000.981.001.000.98
0.9FAR0.800.490.970.910.280.820.420.980.870.23
C-FS0.500.670.971.000.210.500.670.971.000.22
GRIE0.830.881.001.000.710.790.911.001.000.70
CS Structure
0.3FAR1.001.001.001.001.001.001.001.001.001.00
C-FS0.990.461.001.000.461.000.331.001.000.33
GRIE1.001.001.001.001.001.001.001.001.001.00
0.6FAR1.001.001.001.001.001.001.001.001.001.00
C-FS0.930.811.001.000.740.880.731.001.000.64
GRIE1.001.001.001.001.001.001.001.001.001.00
0.9FAR0.630.570.980.850.240.580.490.910.810.16
C-FS0.110.620.970.930.050.120.530.970.940.07
GRIE0.930.961.001.000.890.870.921.001.000.79
Table 4. The empirical probabilities of each important covariate and all important covariates being retained for 100 replications in Example 1 with ϵ 0.5 χ 2 2 .
Table 4. The empirical probabilities of each important covariate and all important covariates being retained for 100 replications in Example 1 with ϵ 0.5 χ 2 2 .
ρ Approach p n = 500 p n = 1000
P 1 P 2 P 3 P 4 P all P 1 P 2 P 3 P 4 P all
AR Structure
0.3FAR1.001.001.001.001.001.001.001.001.001.00
C-FS1.000.271.001.000.271.000.331.001.000.33
GRIE1.001.001.001.001.001.000.991.001.000.99
0.6FAR1.001.001.001.001.001.001.001.001.001.00
C-FS1.000.701.001.000.700.990.781.001.000.77
GRIE1.000.991.001.000.991.000.991.000.990.98
0.9FAR0.810.470.950.860.280.820.450.940.800.26
C-FS0.420.700.981.000.170.530.650.971.000.21
GRIE0.820.891.001.000.710.840.941.001.000.78
CS Structure
0.3FAR1.001.001.001.001.001.001.001.001.001.00
C-FS0.990.521.001.000.520.990.371.001.000.36
GRIE1.001.001.001.001.001.000.981.001.000.98
0.6FAR1.001.001.001.001.001.001.001.001.001.00
C-FS0.930.791.001.000.740.920.761.001.000.69
GRIE1.000.991.001.000.991.001.001.001.001.00
0.9FAR0.540.650.970.840.280.550.550.940.810.24
C-FS0.120.660.970.980.090.140.610.970.970.08
GRIE0.970.971.001.000.940.880.941.001.000.82
Table 5. Average numbers of true positive (TP), false positive (FP), and calculation time over 100 repetitions and their robust standard deviations (in parentheses) of Example 2 with ϵ N ( 0 , 1 ) .
Table 5. Average numbers of true positive (TP), false positive (FP), and calculation time over 100 repetitions and their robust standard deviations (in parentheses) of Example 2 with ϵ N ( 0 , 1 ) .
δ Approach p n = 500 p n = 1000
TPFPTime (s)TPFPTime (s)
0.4FAR4.00 (0.00)0.59 (0.51)81.85 (10.28)3.98 (0.20)0.57 (0.50)168.16 (20.16)
C-FS4.00 (0.00)5.32 (2.97)18.74 (5.35)4.00 (0.00)11.40 (5.37)42.87 (15.18)
GRIE4.00 (0.00)0.04 (0.24)2.41 (0.36)4.00 (0.00)0.06 (0.34)3.59 (0.64)
0.6FAR3.94 (0.28)1.09 (0.49)80.25 (9.06)3.86 (0.49)1.07 (0.48)164.35 (18.90)
C-FS4.00 (0.00)6.05 (2.88)19.32 (5.19)4.00 (0.00)12.11 (5.37)43.61 (14.17)
GRIE4.00 (0.00)0.17 (0.49)2.33 (0.34)4.00 (0.00)0.18 (0.54)3.43 (0.63)
0.8FAR3.66 (0.73)1.26 (0.50)80.04 (9.10)3.68 (0.72)1.22 (0.54)164.05 (18.76)
C-FS3.81 (0.42)5.88 (2.82)18.62 (5.41)3.85 (0.36)12.13 (5.18)42.92 (13.12)
GRIE3.95 (0.22)0.38 (0.72)2.37 (0.34)3.89 (0.31)0.27 (0.63)3.45 (0.75)
Table 6. The empirical probabilities of each important covariate and all important covariates being retained for 100 replications in Example 2 with ϵ N ( 0 , 1 ) .
Table 6. The empirical probabilities of each important covariate and all important covariates being retained for 100 replications in Example 2 with ϵ N ( 0 , 1 ) .
δ Approach p n = 500 p n = 1000
P 1 P 2 P 3 P 4 P all P 1 P 2 P 3 P 4 P all
0.4FAR1.001.001.001.001.001.001.000.990.990.99
C-FS1.001.001.001.001.001.001.001.001.001.00
GRIE1.001.001.001.001.001.001.001.001.001.00
0.6FAR0.990.990.980.980.950.960.980.970.950.92
C-FS1.001.001.001.001.001.001.001.001.001.00
GRIE1.001.001.001.001.001.001.001.001.001.00
0.8FAR0.920.910.910.920.810.900.920.930.930.83
C-FS0.930.970.960.950.820.940.970.970.970.85
GRIE1.001.000.970.980.950.971.000.950.970.89
Table 7. Average numbers of model size, the number of SNV, and A-PE over 100 repetitions and their robust standard deviations (in parentheses) of Boston Housing Data.
Table 7. Average numbers of model size, the number of SNV, and A-PE over 100 repetitions and their robust standard deviations (in parentheses) of Boston Housing Data.
ApproachModel SizeSNVA-PE
FAR2.10 (0.30)0.00 (0.00)0.052 (0.011)
C-FS19.26 (5.39)8.71 (5.10)0.047 (0.012)
GRIE5.07 (0.95)0.00 (0.00)0.043 (0.010)
Table 8. The frequency for 13 real covariates being selected over 100 replications for Boston Housing Data.
Table 8. The frequency for 13 real covariates being selected over 100 replications for Boston Housing Data.
VariableFARC-FSGRIE
RM100100100
AGE0990
RAD0606
TAX0597
PTRATIO010068
B09299
LSTAT100100100
CRIM1010080
ZN0970
INDUS0220
CHAS0260
NOX010047
DIS01000
Table 9. Average numbers of model size, A-PE over 100 repetitions, and their robust standard deviations (in parentheses) of Arabidopsis thaliana gene data.
Table 9. Average numbers of model size, A-PE over 100 repetitions, and their robust standard deviations (in parentheses) of Arabidopsis thaliana gene data.
ApproachModel SizeA-PE
FAR1.00 (0.00)0.289 (0.099)
C-FS10.15 (3.34)0.282 (0.181)
GRIE1.76 (1.18)0.276 (0.093)
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Share and Cite

MDPI and ACS Style

Wang, H.; Jin, H.; Jiang, X.; Li, J. Model Selection for High Dimensional Nonparametric Additive Models via Ridge Estimation. Mathematics 2022, 10, 4551. https://doi.org/10.3390/math10234551

AMA Style

Wang H, Jin H, Jiang X, Li J. Model Selection for High Dimensional Nonparametric Additive Models via Ridge Estimation. Mathematics. 2022; 10(23):4551. https://doi.org/10.3390/math10234551

Chicago/Turabian Style

Wang, Haofeng, Hongxia Jin, Xuejun Jiang, and Jingzhi Li. 2022. "Model Selection for High Dimensional Nonparametric Additive Models via Ridge Estimation" Mathematics 10, no. 23: 4551. https://doi.org/10.3390/math10234551

APA Style

Wang, H., Jin, H., Jiang, X., & Li, J. (2022). Model Selection for High Dimensional Nonparametric Additive Models via Ridge Estimation. Mathematics, 10(23), 4551. https://doi.org/10.3390/math10234551

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop