Next Article in Journal
A Maximum Entropy Production Hypothesis for Time Varying Climate Problems: Illustration on a Conceptual Model for the Seasonal Cycle
Next Article in Special Issue
Forecasting Financial Time Series through Causal and Dilated Convolutional Neural Networks
Previous Article in Journal
A Comprehensive Three-Dimensional Analysis of a Large-Scale Multi-Fuel CFB Boiler Burning Coal and Syngas. Part 1. The CFD Model of a Large-Scale Multi-Fuel CFB Combustion
Previous Article in Special Issue
Variable Selection Using Nonlocal Priors in High-Dimensional Generalized Linear Models With Application to fMRI Data Analysis
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Consistent Estimation of Generalized Linear Models with High Dimensional Predictors via Stepwise Regression

1
Department of Statistics and Probability, Michigan State University, East Lansing, MI 48824, USA
2
Department of Bioinformatics and Biostatistics, University of Louisville, Louisville, KY 40202, USA
3
Department of Biostatistics, University of Michigan, Ann Arbor, MI 48109, USA
*
Author to whom correspondence should be addressed.
Entropy 2020, 22(9), 965; https://doi.org/10.3390/e22090965
Submission received: 1 August 2020 / Revised: 26 August 2020 / Accepted: 28 August 2020 / Published: 31 August 2020

Abstract

:
Predictive models play a central role in decision making. Penalized regression approaches, such as least absolute shrinkage and selection operator (LASSO), have been widely used to construct predictive models and explain the impacts of the selected predictors, but the estimates are typically biased. Moreover, when data are ultrahigh-dimensional, penalized regression is usable only after applying variable screening methods to downsize variables. We propose a stepwise procedure for fitting generalized linear models with ultrahigh dimensional predictors. Our procedure can provide a final model; control both false negatives and false positives; and yield consistent estimates, which are useful to gauge the actual effect size of risk factors. Simulations and applications to two clinical studies verify the utility of the method.

1. Introduction

In the era of precision medicine, constructing interpretable and accurate predictive models, based on patients’ demographic characteristics, clinical conditions and molecular biomarkers, has been crucial for disease prevention, early diagnosis and targeted therapy [1]. When the number of predictors is moderate, penalized regression approaches such as least absolute shrinkage and selection operator (LASSO) by [2] have been used to construct predictive models and explain the impacts of the selected predictors. However, in ultrahigh dimensional settings where p is in the exponential order of n, penalized methods may incur computational challenges [3], may not reach globally optimal solutions and often generate biased estimates [4]. Sure independence screening (SIS) proposed by [5] has emerged as a powerful tool for modeling ultrahigh dimensional data. However, the method relies on a partial faithfulness assumption, which stipulates that jointly important variables must be marginally important, an assumption that may not be always realistic. To relieve this condition, some iterative procedures, such as ISIS [5], have been adopted to repeatedly screen variables based on the residuals from the previous iterations, but with heavy computation and unclear theoretical properties. Conditional screening approaches [see, e.g., [6]] have, to some extent, addressed the challenge. However, screening methods do not directly generate a final model, and post-screening regularization methods, such as LASSO, are recommended by [5] to produce a final model.
For generating a final predictive model in ultrahigh dimensional settings, recent years have seen a surging interest of performing forward regression, an old technique for model selection; see [7,8,9], among many others. Under some regularity conditions and with some proper stopping criteria, forward regression can achieve screening consistency and sequentially select variables according to metrics such as AIC, BIC or R 2 . Closely related to forward selection also, is least angle regression (LARS) [10], a widely used model selection algorithm for high-dimensional models. In the generalized linear model setting [11,12], proposed differential geometrical LARS (dgLARS) based on a differential geometrical extension of LARS.
However, these methods have drawbacks. First, once a variable is identified by the forward selection, it is not removable from the list of selected variables. Hence, false positives are unavoidable without a systematic elimination procedure. Second, most of the existing works focus on variable selection and are silent with respect to estimation accuracy.
To address the first issue, some works have been proposed to add backward elimination steps once forward selection is accomplished, as backward elimination may further eliminate false positives from the variables selected by forward selection. For example, ref. [13,14] proposed a stepwise selection for linear regression models in high-dimensional settings and proved model selection consistency. However, it is unclear whether the results hold for high-dimensional generalized linear models (GLMs); Ref. [15] proposed a similar stepwise algorithm in high-dimensional GLM settings, but with no theoretical properties on model selection. Moreover, none of the relevant works have touched upon the accuracy of estimation.
We extend a stepwise regression method to accommodate GLMs with high-dimensional predictors. Our method embraces both model selection and estimation. It starts with an empty model or pre-specified predictors, scans all features and sequentially selects features, and conducts backward elimination once forward selection is completed. Our proposal controls both false negatives and false positives in high dimensional settings: the forward selection steps recruit variables in an inclusive way by allowing some false positives for the sake of avoiding false negatives, while the backward selection steps eliminate the potential false positives from the recruited variables. We use different stopping criteria in the forward and backward selection steps, to control the numbers of false positives and false negatives. Moreover, we prove that, under a sparsity assumption of the true model, the proposed approach can discover all of the relevant predictors within a finite number of steps, and the estimated coefficients are consistent, a property still unknown to the literature. Finally, our GLM framework enables our work to accommodate a wide range of data types, such as binary, categorical and count data.
To recap, our proposed method distinguishes from the existing stepwise approaches in high dimensional settings. For example, it improves  [13,14] by extending the work to a more broad GLM setting and  [15] by establishing the theoretical properties.
Compared with the other variable selection and screening works, our method produces a final model in ultrahigh dimensional settings, without applying a pre-screening step which may produce unintended false negatives. Under some regularity conditions, the method identifies or includes the true model with probability going to 1. Moreover, unlike the penalized approaches such as LASSO, the coefficients estimated by our stepwise selection procedure in the final model will be consistent, which are useful for gauging the real effect sizes of risk factors.

2. Method

Let ( X i , Y i ) , i = 1 , , n , denote n independently and identically distributed (i.i.d.) copies of ( X , Y ) . Here, X = ( 1 , X 1 , , X p ) T is a ( p + 1 ) -dimensional predictor vector with X 0 = 1 corresponding to the intercept term, and Y is an outcome. Suppose that the conditional density of Y, given X , belongs to a linear exponential family:
π ( Y X ) = exp { Y X T β b ( X T β ) + A ( Y ) } ,
where β = ( β 0 , β 1 , , β p ) T is the vector of coefficients; β 0 is the intercept; and A ( · ) and b ( · ) are known functions. Model (1), with a canonical link function and a unit dispersion parameter, belongs to a larger exponential family [16]. Further, b ( · ) is assumed twice continuously differentiable with a non-negative second derivative b ( · ) . We use μ ( · ) and σ ( · ) to denote b ( · ) and b ( · ) , i.e., the mean and variance functions, respectively. For example, b ( θ ) = log ( 1 + exp ( θ ) ) in a logistic distribution and b ( θ ) = exp ( θ ) in a Poisson distribution.
Let L ( u , v ) = u v b ( u ) and E n { f ( ξ ) } = n 1 i = 1 n f ( ξ i ) denote the mean of { f ( ξ i ) } i = 1 n for a sequence of i.i.d. random variables ξ i ( i = 1 , , n ) and a non-random function f ( · ) . Based on the i.i.d. observations, the log-likelihood function is
( β ) = n 1 i = 1 n L ( X i T β , Y i ) = E n { L ( X T β , Y ) } .
We use β * = ( β * 0 , β * 1 , , β * p ) T to denote the true values of β . Then the true model is M = { j : β * j 0 , j 1 } { 0 } , which consists of the intercept and all variables with nonzero effects. Overarching goals of ultra-high dimensional data analysis are to identify M and estimate β * j for j M . While most of the relevant literature [8,9] is on estimating M , this work is to accomplish both identification of M and estimation of β * j .
When p is in the exponential order of n, we aim to generate a predictive model that contains the true model with high probability, and provide consistent estimates of regression coefficients. We further introduce the following notation. For a generic index set S { 0 , 1 , , p } and a ( p + 1 ) -dimensional vector A , we use S c to denote the complement of a set S and A S = { A j : j S } to denote the subvector of A corresponding to S. For instance, if S = { 2 , 3 , 4 } , then X i S = ( X i 2 , X i 3 , X i 4 ) T . Moreover, denote by S ( β S ) = E n { L ( X S T β S , Y ) } the log-likelihood of the regression model of Y on X S and denote by β ^ S the maximizer of S ( β S ) . Under model (1), we elaborate on the idea of stepwise (details in the supplementary materials) selection, consisting of the forward and backward stages.
Forward stage: We start with F 0 , a set of variables that need to be included according to some a priori knowledge, such as clinically important factors and conditions. If no such information is available, F 0 is set to be { 0 } , corresponding to a null model. We sequentially add covariates as follows:
F 0 F 1 F 2 F k ,
where F k { 0 , 1 , , p } is the index set of the selected covariates upon completion of the kth step, with k 0 . At the ( k + 1 ) th step, we append new variables to F k one at a time and refit GLMs: for every j F k c , we let F k , j = F k { j } , obtain β ^ F k , j by maximizing F k , j ( β F k , j ) , and compute the increment of log-likelihood,
F k , j ( β ^ F k , j ) F k ( β ^ F k ) .
Then the index of a new candidate variable is determined to be
j k + 1 = arg max j F k c F k , j ( β ^ F k , j ) F k ( β ^ F k ) .
Additionally, we update F k + 1 = F k { j k + 1 } . We then need to decide whether to stop at the kth step or move on to the ( k + 1 ) th step with F k + 1 . To do so, we use the following EBIC criterion:
E B I C ( F k + 1 ) = 2 F k + 1 ( β ^ F k + 1 ) + | F k + 1 | n 1 ( log n + 2 η 1 log p ) ,
where the second term is motivated by [17] and | F | denotes the cardinality of a set F.
The forward selection stops if E B I C ( F k + 1 ) > E B I C ( F k ) . We denote the stopping step by k * and the set of variables selected so far by F k * .
Backward stage: Upon the completion of forward stage, backward elimination, starting with B 0 = F k * , sequentially drops covariates as follows:
B 0 B 1 B 2 B k ,
where B k is the index set of the remaining covariates upon the completion of the kth step of the backward stage, with k 0 . At the ( k + 1 ) th backward step and for every j B k , we let B k / j = B k \ { j } , obtain β ^ B k / j by maximizing ( β B k / j ) , and calculating the difference of the log-likelihoods between these two nested models:
B k ( β ^ B k ) B k / j ( β ^ B k / j ) .
The variable that can be removed from the current set of variables is indexed by
j k + 1 = arg min j B k B k ( β ^ B k ) B k / j ( β ^ B k / j ) .
Let B k + 1 = B k \ { j k + 1 } . We determine whether to stop at the kth step or move on to the ( k + 1 )th step of the backward stage according to the following BIC criterion:
B I C ( B k + 1 ) = 2 B k + 1 ( β ^ B k + 1 ) + η 2 n 1 | B k + 1 | log n .
If B I C ( B k + 1 ) > B I C ( B k ) , we end the backward stage at the kth step. Let k * * denote the stopping step and we declare the selected model B k * * to be the final model. Thus, M ^ = B k * * is the estimate of M . As the backward stage starts with the k * variables selected by forward selection, k * * cannot exceed  k * .
A strength of our algorithm, termed STEPWISE hereafter, is the added flexibility with η 1 and η 2 in the stopping criteria for controlling the false negatives and positives. For example, a smaller value of η 1 close to zero in the forward selection step will likely include more variables, and thus incur more false positives and less false negatives, whereas a larger value of η 1 will recruit too few variables and cause too many false negatives. Similarly, in the backward selection step, a large η 2 would eliminate more variables and therefore further reduce more false positives, and vice versa for a small η 2 . While finding optimal η 1 and η 2 is not trivial, our numerical experiences suggest a small η 1 and a large η 2 may well balance the false negatives and positives. When η 2 = 0 , no variables can be dropped after forward selection; hence, our proposal includes forward selection as a special case.
Moreover, [8] proposed a sequentially conditioning approach based on offset terms that absorb the prior information. However, our numerical experiments indicate that the offset approach may be suboptimal compared to our full stepwise optimization approach, which will be demonstrated in the simulation studies.

3. Theoretical Properties

With a column vector v , let v q denote the L q -norm for any q 1 . For simplicity, we denote the L 2 -norm of v by v , and denote v v T by v 2 . We use C 1 , C 2 , , to denote some generic constants that do not depend on n and may change from line to line. The following regularity conditions are set.
  • There exist a positive integer q satisfying | M | q and q log p = o ( n 1 / 3 ) and a constant K > 0 such that sup | S | q β S * 1 K , where β S * = arg max β S E S ( β S ) is termed the least false value of model S.
  • X K . In addition, E ( X j ) = 0 and E ( X j 2 ) = 1 for j 1 .
  • Let ϵ i = Y i μ ( β * T X i ) . There exists a positive constant M such that the Cramer condition holds, i.e., E [ ϵ i m ] m ! M m for all m 1 .
  • | σ ( a ) σ ( b ) | K | a b | and σ min : = inf | t | K 3 | b ( t ) | is bounded below.
  • There exist two positive constants, κ min and κ max such that 0 < κ min < Λ E X S 2 < κ max < , uniformly in S { 0 , 1 , , p } satisfying | S | q , where Λ ( A ) is the collection of all eigenvalues of a square matrix A .
  • min S : M S , | S | q D S > C n α for some constants C > 0 and α > 0 that satisfies q n 1 + 4 α log p 0 , where D S = max j S c M E μ ( β * T X ) μ ( β S * T X S ) X j .
Condition (1), as assumed in  [8,18], is an alternative to the Lipschitz assumption [5,19]. The bound of the model size allowed in the selection procedure or q is often required in model-based screening methods see, e.g., [8,20,21,22]. The bound should be large enough so that the correct model can be included, but not too large; otherwise, excessive noise variables would be included, leading to unstable and inconsistent estimates. Indeed, Conditions (1) and (6) reveal that the range of q depends on the true model size | M | , the minimum signal strength, n α and the total number of covariates, p. The upper bound of q is o ( ( n 1 4 α / log p ) ( n 1 / 3 / log p ) ) , ensuring the consistency of EBIC [17]. Condition (1) also implies that the parameter space under consideration can be restricted to B : = { β R p + 1 : β 1 K 2 } , for any model S with | S | q . Condition (2), as assumed in  [23,24], reflects that data are often standardized at the pre-processing stage. Condition (3) ensures that Y has a light tail, and is satisfied by Gaussian and discrete data, such as binary and count data [25]. Condition (4) is satisfied by common GLM models, such as Gaussian, binomial, Poisson and gamma distributions. Condition (5) represents the sparse Riesz condition [26] and Condition (6) is a strong "irrepresentable" condition, suggesting that M cannot be represented by a set of variables that does not include the true model. It further implies that adding a signal variable to a mis-specified model will increase the log-likelihood by a certain lower bound [8]. The signal rate is comparable to the conditions required by the other sequential methods, see, e.g., [7,22].
Theorem 1 develops a lower bound of the increment of the log-likelihood if the true model M is not yet included in a selected model S.
Theorem 1.
Suppose Conditions (1)–(6) hold. There exists some constant C 1 such that with probability at least 1–6 exp ( 6 q log p ) ,
min S : M S , | S | < q max j S c S { j } ( β ^ S { j } ) S ( β ^ S ) C 1 n 2 α .
Theorem 1 shows that, before the true model is included in the selected model, we can append a variable which will increase the log-likelihood by at least C 1 n 2 α with probability tending to 1. This ensures that in the forward stage, our proposed STEPWISE approach will keep searching for signal variables until the true model is contained. To see this, suppose at the kth step of the forward stage that F k satisfies M F k and | F k | < q , and let r be the index selected by STEPWISE. By Theorem 1, we obtain that, for any η 1 > 0 , when n is sufficiently large,
EBIC ( F k , r ) EBIC ( F k ) = 2 F k , r ( β ^ F k , r ) + ( | F k | + 1 ) n 1 ( log n + 2 η 1 log p ) 2 F k ( β ^ F k ) + | F k | n 1 ( log n + 2 η 1 log p ) 2 C 1 n 2 α + n 1 ( log n + 2 η 1 log p ) < 0 ,
with probability at least 1 6 exp ( 6 q log p ) , where the last inequality is due to Condition (6). Therefore, with high probability the forward stage of STEPWISE continues as long as M F k and | F k | < q . We next establish an upper bound of the number of steps in the forward stage needed to include the true model.
Theorem 2.
Under the same conditions as in Theorem 1 and if
max S : | S | q max j S c M c E Y μ ( β S * T X S ) X j = o ( n α ) ,
then there exists some constant C 2 > 2 such that M F k , for some F k in the forward stage of STEPWISE and k C 2 | M | , with probability at least 1 18 exp ( 4 q log p ) .
The "max" condition, as assumed in Section 5.3 of  [27], relaxes the partial orthogonality assumption that X M c are independent of X M , and ensures that with probability tending to 1, appending a signal variable increases log-likelihood more than adding a noise variable does, uniformly over all possible models S satisfying M S , | S | < q . This entails that the proposed procedure is much more likely to select a signal variable, in lieu of a noise variable, at each step. Since EBIC is a consistent model selection criterion [28,29], the following theorem guarantees termination of the proposed procedure with M F k for some k.
Theorem 3.
Under the same conditions as in Theorem 2 and if M F k 1 and M F k , the forward stage stops at the kth step with probability going to 1 exp ( 3 q log p ) .
Theorem 3 ensures that the forward stage of STEPWISE will stop within a finite number of steps and will cover the true model with probability at least 1 q exp ( 3 q log p ) 1 exp ( 2 q log p ) . We next consider the backward stage and provide a probability bound of removing a signal from a set in which the set of true signals M is contained.
Theorem 4.
Under the same conditions as in Theorem 2, B I C ( S \ { r } ) B I C ( S ) > 0 uniformly over r M and S satisfying M S and | S | q , with probability at least 1 6 exp ( 6 q log p ) .
Theorem 4 indicates that with probability at 1 6 exp ( 6 q log p ) , BIC would decrease when removing a signal variable from a model that contains the true model. That is, with high probability, back elimination is to reduce false positives.
Recall that F k * denotes the model selected at the end of the forward selection stage. By Theorem 2, M F k * with probability at least 1 18 exp ( 4 q log p ) . Then Theorem 4 implies that at each step of the backward stage, a signal variable will not be removed from the model with probability at least 1 6 exp ( 6 q log p ) . By Theorem 2, | F k * | C 2 | M | . Thus, the backward elimination will carry out at most ( C 2 1 ) | M | steps. Combining results from Theorems 2 and 3 yields that M M ^ with probability at least 1 18 exp ( 4 q log p ) 6 ( C 2 1 ) | M | exp ( 6 q log p ) . Let β ^ be the estimate of β * in model (1) at the termination of STEPWISE. By convention, the estimates of the coefficients of the unselected covariates are 0.
Theorem 5.
Under the same conditions as in Theorem 2, we have that M M ^ and
β ^ β * 0
in probability.
The theorem warrants that the proposed STEPWISE yields consistent estimates, a property not shared by many regularized methods, including LASSO. Our later simulations verified this. Proof of main theorems and lemmas are provided in Appendix A.

4. Simulation Studies

We compared the proposal with the other competing methods, including the penalized methods, such as least absolute shrinkage and selection operator (LASSO); the differential geometric least angle regression (dgLARS) [11,12]; the forward regression (FR) approach [7]; the sequentially conditioning (SC) approach [8]; and the screening methods, such as sure independence screening (SIS) [5], which is popular in practice. As SIS does not directly generate a predictive model, we applied LASSO for the top [ n / log ( n ) ] variables chosen by SIS and denoted the procedure by SIS+LASSO. As the FR, SC and STEPWISE approaches involve forward searching and to make them comparable, we applied the same stopping rule, for example, Equation (3) with the same γ , to their forward steps. In particular, the STEPWISE approach, with η 1 = γ and η 2 = 0 , is equivalent to FR and asymptotically equivalent to SC. By varying γ in FR and SC between γ L and γ H , we explored the impact of γ on inducing false positives and negatives. In our numerical studies, we fixed γ H = 10 and set γ L = η 1 . To choose η 1 and η 2 in (3) and (4) in STEPWISE, we performed 5-fold cross-validation to minimize the mean squared prediction error (MSPE), and reported the results in Table 1. Since the proposed STEPWISE algorithm uses the (E)BIC criterion, for a fair comparison we chose the tuning parameter in dgLARS by using the BIC criterion as well, and coined the corresponding approach as dgLARS(BIC). The regularization parameter in LASSO was chosen via the following two approaches: (1) giving the smallest BIC for the models on the LASSO path, denoted by LASSO(BIC); (2) using the one-standard-error rule, denoted by LASSO(1SE), which chooses the most parsimonious model whose error is no more than one standard error above the error of the best model in cross-validation [30].
Denote by X i = ( X i 1 , , X i p ) T and β = ( β 1 , , β p ) T , the covariate vector for subject i ( 1 , , n ) and the true coefficient vector. The following five examples generated X i T β , the inner product of the coefficient and covariate vectors for each individual, which were used to generate outcomes from the normal, binomial and Poisson models.
Example 1.
For each i,
c X i T β = c × j = 1 p 0 β j X i j + j = p 0 + 1 p β j X i j , i = 1 , , n ,
where β j = ( 1 ) B j (4 log n / n + | Z j | ), for j = 1 , , p 0 and β j = 0 otherwise B j was a binary random variable with P ( B j = 1 ) = 0.4 and Z j was generated by a standard normal distribution; p 0 = 8 ; X i j s were independently generated from a standardized exponential distribution, that is, exp ( 1 ) 1 . Here and also in the other examples, c (specified later) controls the signal strengths.
Example 2.
This scenario is the same asExample 1except that X i j was independently generated from a standard normal distribution.
Example 3.
For each i,
c X i T β = c × j = 1 p 0 β j X i j + j = p 0 + 1 p β j X i j * , i = 1 , , n ,
where β j = 2j for 1 j p 0 and p 0 = 5 . We simulated every component of Z i = ( Z i j ) ∈ R p and W i = ( W i j ) ∈ R p independently from a standard normal distribution. Next, we generated X i according to X i j = ( Z i j + W i j ) / 2 for 1 j p 0 and X i j * = ( Z i j + j = 1 p 0 Z i j ) / 2 for p 0 < j p .
Example 4.
For each i,
c X i T β = c × j = 1 500 β j X i j + j = 501 p β j X i j , i = 1 , , n ,
where the first 500 X i j s were generated from the multivariate normal distribution with mean 0 and a covariance matrix with all of the diagonal entries being 1 and c o v ( X i j , X i j ) = 0 . 5 | j j | for 1 j , j p . The remaining p 500 X i j s were generated through the autoregressive processes with X i , 501 ∼ Unif(-2, 2), X i j = 0.5 X i , j 1 + 0.5 X i j * , for j = 502 , , p , where X i j * ∼ Unif(-2, 2) were generated independently. The coefficients β j for j = 1 , , 7 , 501 , , 507 were generated from ( 1 ) B j (4 log n / n + | Z j | ), where B j was a binary random variable with P ( B j = 1 ) = 0.4 and Z j was from a standard normal distribution. The remaining β j were zeros.
Example 5.
For each i,
c X i T β = c × 0.5 X i 1 + X i 2 + 0.5 X i , 100 , i = 1 , , n ,
where X i were generated from a multivariate normal distribution with mean 0 and a covariance matrix with all of the diagonal entries being 1 and c o v ( X i j , X i j ) = 0 . 9 | j j | for 1 ≤ j, j’ ≤ p. All of the coefficients were zero except for X i 1 , X i 2 and X i , 100 .
Examples 1 and 3 were adopted from [7], while Examples 2 and 4 were borrowed from [5,15], respectively. We then generated the responses from the following three models.
Normal model: Y i = c X i T β + ϵ i with ϵ i N ( 0 , 1 ) .
Binomial model: Y i Bernoulli( exp ( c X i T β ) / { 1 + exp ( c X i T β ) } ) .
Poisson model: Y i Poisson( exp ( c X i T β ) ).
We considered n = 400 and p = 1000 throughout all of the examples. We specified the magnitude of the coefficients in the GLMs with a constant multiplier, c. For Examples 1–5, this constant was set, respectively for the normal, binomial and Poisson models, to be: (1, 1, 0.3), (1, 1.5, 0.3), (1, 1, 0.1), (1, 1.5, 0.3) and (1, 3, 2). For each parameter configuration, we simulated 500 independent data sets. We evaluated the performances of the methods by the criteria of true positives (TP), false positives (FP), the estimated probability of including the true models (PIT), the mean squared error (MSE) of β ^ and the mean squared prediction error (MSPE). To compute the MSPE, we randomly partitioned the samples into the training (75%) and testing (25%) sets. The models obtained from the training datasets were used to predict the responses in the testing datasets. Table 2, Table 3 and Table 4 report the average TP, FP, PIT, MSE and MSPE over 500 datasets along with the standard deviations. The findings are summarized below.
First, the proposed STEPWISE method was able to detect all the true signals with nearly zero FPs. Specifically, in all of the Examples, STEPWISE outperformed the other methods by detecting more TPs with fewer FPs, whereas LASSO, SIS+LASSO and dgLARS included much more FPs.
Second, though a smaller γ in FR and SC led to the inclusion of all TPs with a PIT close to 1, it incurred more FPs. On the other hand, a larger γ may eliminate some TPs, resulting in a smaller PIT and a larger MSPE.
Third, for the normal model, the STEPWISE method yielded an MSE close to 0, the smallest among all the competing methods. The binary and Poisson data challenged all of the methods, and the MSEs for all the methods were non-negligible. However, the STEPWISE method still produced the lowest MSE. The results seemed to verify the consistency of β ^ , which distinguished the proposed STEPWISE method from the other regularized methods and highlighted its ability to provide a more accurate means to characterize the effects of high dimensional predictors.

5. Real Data Analysis

5.1. A Study of Gene Regulation in the Mammalian Eye

To demonstrate the utility of our proposed method, we analyzed a microarray dataset from [35] with 120 twelve-week male rats selected for eye tissue harvesting. The dataset contained more than 31,042 different probe sets (Affymetric GeneChip Rat Genome 230 2.0 Array); see [35] for a more detailed description of the data.
Although our method was applicable to the original 31,042 probe sets, many probes turned out to have very small variances and were unlikely to be informative for correlative analyses. Therefore, using variance as the screening criterion, we selected 5000 genes with the largest variances in expressions and correlated them with gene TRIM32 that has been found to cause Bardet–Biedl syndrome, a genetically heterogeneous disease of multiple organ systems including the retina [36].
We applied the proposed STEPWISE method to the dataset with n = 120 and p = 5000 , and treated the TRIM32 gene expression as the response variable and the expressions of 5000 genes as the predictors. With no prior biological information available, we started with the empty set. To choose η 1 and η 2 , we carried out 5-fold cross-validation to minimize the mean squared prediction error (MSPE) by using the following grid search: η 1 = { 0 , 0.25 , 0.5 , 1 } and η 2 = { 1 , 2 , 3 , 4 , 5 } , and set η 1 = 1 and η 2 = 4 . We also performed the same procedure to choose the γ for FR and SC. The regularization parameters in LASSO and dgLARS were selected to minimize BIC values.
In the forward step, STEPWISE selected the probes of 1376747_at, 1381902_at, 1382673_at and 1375577_at, and the backward step eliminated probe 1375577_at. The STEPWISE procedure produced the following final predictive model:
TRIM32 = 4.6208 + 0.2310 × (1376747_at) + 0.1914 × (1381902_at) + 0.1263 × (1382673_at). Table A1 in Appendix B presents the numbers of overlapping genes among competing methods. It shows that the two out of three probes, 1381902_at and 1376747_at, selected from our method are also discovered by the other methods, except for dgLARS.
Next, we performed Leave-One-Out Cross-Validation (LOOCV) to obtain the distribution of the model size (MS) and MSPE for the competing methods.
As reported in Table 5 and Figure 1, LASSO, SIS+LASSO and dgLARS tended to select more variables than the forward approaches and STEPWISE. Among all of the methods, STEPWISE selected the fewest variables but with almost the same MSPE as the other methods.

5.2. An Esophageal Squamous Cell Carcinoma Study

Esophageal squamous cell carcinoma (ESCC), the most common histological type of esophageal cancer, is known to be associated with poor overall survival, making early diagnosis crucial for treatment and disease management [37]. Several studies have investigated the roles of circulating microRNAs (miRNAs) in diagnosis of ESCC [38].
Using a clinical study that investigated the roles of miRNAs on the ESCC [39], we aimed to use miRNAs to predict ESCC risks and estimate their impacts on the development of ESCC. Specifically, with a dataset of serum profiling of 2565 miRNAs from 566 ESCC patients and 4965 controls without cancer, we demonstrated the utility of the proposed STEPWISE method in predicting ESCC with miRNAs.
To proceed, we used a balance sampling scheme (283 cases and 283 controls) in the training dataset. The design of yielding an equal number of cases and controls in the training set has proved to be useful [39] for handling imbalanced outcomes as we encountered here. To validate our findings, samples were randomly divided into a training ( n 1 = 566 , p = 2565 ) and testing set ( n 2 = 4965 , p = 2565 ).
The training set consisted of 283 patients with ESCC (median age of 65 years, 79% male) and 283 control patients (median age of 68 years, 46.3% male), and the testing set consisted of 283 patients with ESCC (median age of 67 years, 85.7% male) and 4682 control patients (median age of 67.5 years, 44.5% male). Control patients without ESCC came from three sources: 323 individuals from National Cancer Center Biobank (NCCB); 2670 individuals from the Biobank of the National Center for Geriatrics and Gerontology (NCGG); and 1972 individuals from Minoru Clinic (MC). More detailed characteristics of cases and controls in the training and testing sets are given in Table 6.
We defined the binary outcome variable to be 1 if the subject was a case and 0 otherwise. As age and gender (0 = female, 1 = male) are important risk factors for ESCC [40,41] and it is common to adjust for them in clinical models, we set the initial set in STEPWISE to be F 0 = {age, gender}. With η 1 = 0 and η 2 = 3.5 that were also chosen from 5-fold CV, our procedure recruited three miRNAs. More specifically, miR-4783-3p, miR-320b, miR-1225-3p and miR-6789-5p were selected among 2565 miRNAs by the forward stage from the training set, and then the backward stage eliminated miR-6789-5p.
In comparison, with γ = 0 , both FR and SC selected four miRNAs, miR-4783-3p, miR-320b, miR-1225-3p and miR-6789-5p. The list of selected miRNAs by different methods are given in Table A2 in Appendix B.
Our findings were biologically meaningful, as the selected miRNAs had been identified by other cancer studies as well. Specifically, miR-320b was found to promote colorectal cancer proliferation and invasion by competing with its homologous miR-320a [42]. In addition, serum levels of miR-320 family members were associated with clinical parameters and diagnosis in prostate cancer patients [43]. Reference [44] showed that miR-4783-3p was one of the miRNAs that could increase the risk of colorectal cancer death among rectal cancer cases. Finally, miR-1225-5p inhibited proliferation and metastasis of gastric carcinoma through repressing insulin receptor substrate-1 and activation of β -catenin signaling [45].
Aiming to identify a final model without resorting to a pre-screening procedure that may miss out on important biomarkers, we applied STEPWISE to reach the following predictive model for ESCC based on patients’ demographics and miRNAs:
logit 1 ( 35.70 + 1.41 × miR - 4783 - 3 p + 0.98 × miR - 320 b + 1.91 × miR - 1225 - 3 p + 0.10 × Age 2.02 × Gender ) , where logit 1 ( x ) = exp ( x ) / ( 1 + exp ( x ) ) .
In the testing dataset, the model had an area under the receiver operating curve (AUC) of 0.99 and achieved a high accuracy of 0.96, with a sensitivity and specificity of 0.97 and 0.95, respectively. Additionally, using the testing cohort, we evaluated the performances of the models sequentially selected by STEPWISE. Starting with a model containing age and gender, STEPWISE selected miR-4783-3p, miR-320b and miR-1225-3p in turn. Figure 2, showing the corresponding receiver operating curves (ROC) for these sequential models, revealed the improvement by sequentially adding predictors to the model and justified the importance of these variables in the final model. In addition, Figure 2e illustrated that adding an extra miRNA selected by FR and SC made little improvement of the model’s predictive power.
Furthermore, we conducted subgroup analysis within the testing cohort to study how the sensitivity of the final model differed by cancer stage, one of the most important risk factors. The sensitivities for stages 0, i.e., non-invasive cancer, 9 ( n = 27 ), 1 ( n = 128 ), 2 ( n = 57 ), 3 ( n = 61 ) and 4 ( n = 10 ) were 1.00, 0.98, 0.97, 0.97 and 1.00, respectively. We next evaluated how the specificity varied across controls coming from different data sources. The specificities for the various control groups, namely, NCCB ( n = 306 ), NCGG ( n = 2512 ) and MC ( n = 1864 ), were 0.99, 0.99 and 0.98, respectively. The results indicated the robust performance of the miRNA-based model toward cancer stages and data sources.
Finally, to compare STEPWISE with the other competing methods, we repeatedly applied the aforementioned balance sampling procedure and split the ESCC data into the training and testing sets 100 times. We obtained MSPE and the average of accuracy, sensitivity, specificity, and AUC. Figure 3 reported the model size of each method. Though STEPWISE selected fewer variables compared to the other variable selection methods (for example, LASSO selected 11-31 variables and dgLARS selected 12–51 variables), it achieved comparable prediction accuracy, specificity, sensitivity and AUC (see Table 7), evidencing the utility of STEPWISE for generating parsimonious models while maintaining competitive predictability.
We used R software [46] to obtain the numerical results in Section 4 and Section 5 with following packages: ggplot2 [47], ncvreg [32], glmnet [31], dglars [34] and screening [33].

6. Discussion

We have proposed to apply STEPWISE to produce final models in ultrahigh dimensional settings, without resorting to a pre-screening step. We have shown that the method identifies or includes the true model with probability going to 1, and produces consistent coefficient estimates, which are useful for properly interpreting the actual impacts of risk factors. The theoretical properties of STEPWISE were established under mild conditions, which are worth discussing. As in practice covariates are often standardized for various reasons, Condition (2) is assumed without loss of generality. Conditions (3) and (4) are generally satisfied under common GLM models, including Gaussian, binomial, Poisson and gamma distributions. Condition (5) is also often satisfied in practice. Proposition 2 in [26] may be used as a tool to verify Condition (5) as well. Conditions (1) and (6) are in good faith with the unknown true model size | M | and minimum signal strength n α in practice. The "irrepresentable" condition (6) is strong and may not hold in some real datasets, see, e.g., [48,49]. However, the condition holds under some commonly used covariance structures, including AR(1) and compound symmetry structure [48].
As shown in simulation studies and real data analyses, STEPWISE tends to generate models as predictive as the other well-known methods, with fewer variables (Figure 3). Parsimonious models are useful for biomedical studies as they explain data with a small number of important predictors, and offer practitioners a realistic list of biomarkers to investigate. With categorical outcome data frequently observed in biomedical studies (e.g., histology types of cancer), STEPWISE can be extended to accommodate multinomial classification, with more involved notation and computation. We will pursue this elsewhere.
There are several open questions. First, our final model was determined by using (E)BIC, which involves two extra parameters η 1 and η 2 . In our numerical experiments, we used cross-validation to choose them, which seemed to work well. However, more in-depth research is needed to find their optimal values to strike a balance between false positives and false negatives. Second, despite our consistent estimates, drawing inferences based on them remains challenging. Statistical inference, which accounts for uncertainty in estimation, is key for properly interpreting analysis results and drawing appropriate conclusions. Our asymptotic results, nevertheless, are a stepping stone toward this important problem.

Supplementary Materials

An R package, STEPWISE, was developed and is available at https://github.com/AlexPijyan/STEPWISE, along with the examples shown in the paper.

Author Contributions

Conceptualization, Q.Z., H.G.H. and Y.L.; Formal analysis, A.P.; Methodology, A.P., Q.Z., H.G.H. and Y.L.; Project administration, H.G.H.; Software, A.P.; Supervision, H.G.H.; Writing – original draft, Q.Z., H.G.H. and Y.L.; Writing – review & editing, H.G.H. and Y.L. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded partially by grants from NSF (DMS1915099, DMS1952486) and NIH (R01AG056764, U01CA209414, R03AG067611).

Acknowledgments

We are thankful to the Editor, the AE and two referees for insightful suggestions that helped improve the manuscript.

Conflicts of Interest

‘The authors declare no conflict of interest.

Appendix A. Proofs of Main Theorems

Since b ( · ) is twice continuously differentiable with a nonnegative second derivative b ( · ) , b max : = max | t | K 3 | b ( t ) | , μ max : = max | t | K 3 | b ( t ) | and σ max : = sup | t | K 3 | b ( t ) | are bounded above, where L and K are some constants from Conditions (1) and (2), respectively. Let G n { f ( ξ ) } = n 1 / 2 i = 1 n ( f ( ξ i ) E [ f ( ξ i ) ] ) for a sequence of i.i.d. random variables ξ i ( i = 1 , , n ) and a non-random function f ( · ) .
Given any β S , when a variable X r , r S c is added into the model S, we define the augmented log-likelihood as
S { r } ( β S + r ) : = E n L β S T X S + β r X r , Y .
We use β ^ S + r to denote the maximizer of (A1). Thus, β ^ S + r = β ^ S { r } . In addition, denote the maximizer of E [ S { r } ( β S + r ) ] by β S + r * . Due to the concavity of the log-likelihood in GLMs with the canonical link, β S + r * is unique.
Proof of Theorem 1.
Given an index set S and r S c , let B S 0 ( d ) = { β S : β S β S * d / ( K | S | ) } where d = A 2 q 3 log p / n with A 2 defined in Lemma A6.
Let Ω be the event that
{ sup | S | q , β S B S 0 ( d ) G n L β S T X S , Y L β S * T X S , Y 20 A 1 d q log p and max | S | q | G n L ( β S * T X S , Y ) | 10 ( A 1 K 2 + b max ) q log p } ,
where A 1 is some constant defined in Lemma A4. By Lemma A4, P ( Ω ) 1 6 exp ( 6 q log p ) . Thus in the rest of the proof, we only consider the sample points in Ω .
In the proof of Lemma A6, we show that max | S | q β ^ S β S * A 2 K 1 ( q 2 log p / n ) 1 / 2 under Ω . Then given an index set S and β S such that | S | < q , β S β S * A 2 K 1 ( q 2 log p / n ) 1 / 2 , and for any j S c ,
S { j } ( β S + j * ) S ( β ^ S ) inf β S β S * A 2 K 1 ( q 2 log p / n ) 1 / 2 S { j } ( β S + j * ) S ( β S ) = n 1 / 2 G n L ( β S + j * T X S { j } , Y ) n 1 / 2 G n L ( β S * T X S , Y ) sup β S β S * A 2 K 1 ( q 2 log p / n ) 1 / 2 n 1 / 2 G n L ( β S T X S , Y ) L ( β S * T X S , Y ) + E L ( β S + j * T X S { j } , Y ) E L ( β S * T X S , Y ) 20 ( A 1 K 2 + b max ) q log p / n 20 A 1 A 2 q 2 log p / n + σ min κ min 2 β S + j * ( β S * T , 0 ) T 2 ,
where the second inequality follows from the event Ω and Lemma A5.
By Lemma A1, if M S , there exists r S c M , such that β S + r * T ( β S * T , 0 ) C σ max 1 κ max 1 n α . Thus, there exists some constant C 1 that does not depend on n such that
max j S c S { j } ( β ^ S + j ) S ( β ^ S ) max j S c S { j } ( β S + j * ) S ( β ^ S ) S { r } ( β S + r * ) S ( β ^ S ) 20 ( A 1 K 2 + b max ) q log p / n 20 A 1 A 2 q 2 log p / n + C 2 σ min κ min n 2 α 2 σ max 2 κ max 2 C 1 n 2 α ,
where the first inequality follows from β ^ S + j being the maximizer of (A1) and the second inequality follows from Conditions (1) and (6).
Withdrawing the restriction to Ω , we obtain that
P min | S | < q , M S max j S c S { j } ( β ^ S { j } ) S ( β ^ S ) C 1 n 2 α 1 6 exp ( 6 q log p ) .
Proof of Theorem 2.
We have shown that our forward stage will not stop when M S and | S | < q with probability converging to 1.
For any r S c M c , β S + r * is the unique solution to the equation E Y μ β S + r T X S { r } X S { r } = 0 . By the mean value theorem,
E Y μ β S * T X S X r = E μ β * T X μ β S * T X S X r = E μ β * T X μ β S * T X S X r E μ β * T X μ β S + r * T X S { r } X r = β S + r * T ( β S * T , 0 ) E σ β ˜ S + r T X S { r } X S { r } 2 e r ,
where β ˜ S + r is some point between β S + r and ( β S * T , 0 ) T and e r is a vector of length ( | S | + 1 ) with the rth element being 1.
Since | β ˜ S + r T X S { r } | | β S + r * T X S { r } | + | ( β S * T , 0 ) X S { r } | 2 K 2 by Conditions (1) and (2), | σ ( β ˜ S + r T X S { r } ) | σ min and
o ( n α ) = E Y μ β S * T X S X r σ min κ min β S + r * T ( β S * T , 0 ) .
Therefore, max S : | S | q , r S c M c β S + r * T ( β S * T , 0 ) = o ( n α ) .
Under Ω that is defined in Theorem 1, max | S | q β ^ S β S * A 2 K 1 ( q 2 log p / n ) 1 / 2 . For any j S c ,
S { j } ( β S + j * ) S ( β ^ S ) sup β S β S * A 2 K 1 ( q 2 log p / n ) 1 / 2 S { j } ( β S + j * ) S ( β S ) n 1 / 2 G n L ( β S + j * T X S { j } , Y ) + n 1 / 2 G n L ( β S * T X S , Y ) + sup β S β S * A 2 K 1 ( q 2 log p / n ) 1 / 2 n 1 / 2 G n L ( β S T X S , Y ) L ( β S * T X S , Y ) + E L ( β S + j * T X S { j } , Y ) E L ( β S * T X S , Y ) 20 ( A 1 K 2 + b max ) q n 1 log p + 20 A 1 A 2 q 2 n 1 log p + σ max κ max β S + j * ( β S * T , 0 ) T 2 / 2 ,
where the second inequality follows from the event Ω and Lemma A5. Since
max S : | S | < q , r S c M c β S + r * ( β S * T , 0 ) T = o ( n α ) and q n 1 + 4 α log p 0 ,
max S : | S | < q , r S c M c S { r } ( β S + r * ) S ( β ^ S ) 20 ( A 1 K 2 + b max ) q n 1 log p + 20 A 1 A 2 q 2 n 1 log p + σ max κ max β S + j * ( β S * T , 0 ) T 2 / 2 = o ( n 2 α ) ,
with probability at least 1 6 exp ( 6 q log p ) . Then by Lemma A6,
max S : | S | < q , r S c M c S { r } ( β ^ S + r ) S ( β ^ S ) max S : | S | < q , r S c M c | S { r } ( β ^ S + r ) S { r } ( β S + r * ) | + max S : | S | < q , r S c M c S { r } ( β S + r * ) S ( β ^ S ) A 3 q 2 n 1 log p + o ( n 2 α ) = o ( n 2 α ) ,
with probability at least 1 12 exp ( 6 q log p ) .
By Theorem 1, if M S , the forward stage would select a noise variable with probability less than 18 exp ( 6 q log p ) .
For k > | M | , M S k implies that at least k | M | noise variables are selected within the k steps. Then for k = C 2 | M | with C 2 > 2 ,
P M S k j = k | M | k k j 18 exp ( 6 q log p ) j | M | k | M | 18 exp ( 6 q log p ) k | M | 18 exp ( 6 q log p + log | M | + | M | log k ) 18 exp ( 4 q log p ) .
Therefore, M S C 2 | M | with probability at least 1 18 exp ( 4 q log p ) . □
Proof of Theorem 3.
By Theorem 2, M will be included in F k for some k < q with probability going to 1. Therefore, the forward stage stops at the kth step if EBIC ( F k + 1 ) > EBIC ( F k ) .
On the other hand, that EBIC ( F k + 1 ) < EBIC ( F k ) if and only if 2 F k + 1 ( β ^ F k + 1 ) 2 F k ( β ^ F k ) ( log n + 2 η 1 log p ) / n . Thus, to show the forward stage stops at the kth step, we only need to show that with probability tending to 1,
2 F k + 1 ( β ^ F k + 1 ) 2 F k ( β ^ F k ) < ( log n + 2 η 1 log p ) / n ,
for all η 1 > 0 .
To prove (A4), we first verify the conditions (A4) and (A5) in [17]. Given any index S such that M S and | S | q , let β * S be the subvector of β * corresponding to S. We obtain that
E ( Y μ ( β * S T X S ) ) X S = E E ( Y μ ( β * M T X M ) ) | X S X S = 0 .
This implies β S * = β * S .
Given any π R | S | , let H S : = h ( π , β S ) = ( σ max K 2 | S | ) 1 σ β S T X S π T X S 2 , π = 1 , β S B S 0 ( d ) . By Conditions (1) and (2), h ( π , β S ) is bounded between 1 and 1 uniformly over π = 1 and β S B S 0 ( d ) .
By Lemma 2.6.15 in [50], the VC indices of W : = { ( K | S | ) 1 π T X S , π = 1 } and V : = { β S T X S , β S B S 0 ( d ) } are bounded by | S | + 2 . For the definitions of the VC index and covering numbers, we refer to pages 83 and 85 in [50]. The VC index of the class U : = { ( K 2 | S | ) 1 ( π T X S ) 2 , π = 1 } is the VC index of the class of sets { ( X S , t ) : ( K 2 | S | ) 1 ( π T X S ) 2 t , π = 1 , t R } . Since { ( X S , t ) : ( K 2 | S | ) 1 ( π T X S ) 2 t } = { ( X S , t ) : 0 < ( K | S | ) 1 π T X S t } { ( X S , t ) : t < ( K | S | ) 1 π T X S 0 } , each set of { ( X S , t ) : ( K 2 | S | ) 1 ( π T X S ) 2 t , π = 1 , t R } is created by taking finite unions, intersections and complements of the basic sets { ( X S , t ) : ( K | S | ) 1 π T X S < t } . Therefore, the VC index of { ( X S , t ) : ( K 2 | S | ) 1 ( π T X S ) 2 t , π = 1 , t R } is of the same order as the VC index of { ( X S , t ) : ( K | S | ) 1 π T X S < t } , by Lemma 2.6.17 in [50].
Then by Theorem 2.6.7 in [50], for any probability measure Q, there exists some universal constant C 3 such that N ( ϵ , U , L 2 ( Q ) ) C 3 / ϵ 2 ( | S | + 1 ) . Likewise, N ( d ϵ , V , L 2 ( Q ) ) C 3 / ϵ 2 ( | S | + 1 ) . Given a β S , 0 B S 0 ( d ) , for any β S in the ball { β S : sup x | β S T x β S , 0 T x | < d ϵ } , we have sup x | σ ( β S T x ) σ ( β S , 0 T x ) | < K d ϵ by Condition (4). Let V : = { σ max 1 σ ( β S T X S ) , β S B S 0 ( d ) } . By the definition of covering number, N ( K d ϵ , V , L 2 ( Q ) ) C 3 / ϵ 2 ( | S | + 1 ) Given a σ ( β S , 0 T x ) and π 0 T x , for any σ ( β S T x ) in the ball { σ ( β S T x ) : sup x | σ ( β S T x ) σ ( β S , 0 T x ) | K d ϵ } and π in the ball { π : sup x | ( π T x ) 2 ( π 0 T x ) 2 | < ϵ } , ( σ max K 2 | S | ) 1 sup x | σ ( β S T x ) ( π T x ) 2 σ ( β S , 0 T x ) ( π 0 T x ) 2 | ( σ max 1 K d + ( K 2 | S | ) 1 ) ϵ . Thus, N ( ( σ max 1 K d + ( K 2 | S | ) 1 ) ϵ , H S , L 2 ( Q ) ) C 3 / ϵ 4 ( | S | + 1 ) , and consequently N ( ϵ , H S , L 2 ( Q ) ) C 4 / ϵ 4 ( | S | + 1 ) for some constant C 4 .
By Theorem 1.1 in [51] and | S | q , we can find some constant C 5 such that
P sup π = 1 , β S B S 0 ( d ) G n h ( π , β S ) C 5 q log p C 4 C 5 q log p C 4 C 5 2 q log p 4 ( | S | + 1 ) 4 ( | S | + 1 ) exp ( 2 C 5 2 q log p ) exp 4 ( | S | + 1 ) log ( C 4 C 5 2 q log p ) 2 C 5 2 q log p exp 5 q log p ,
where C 4 is some constant that depends on C 4 only. Thus,
P sup | S | q , π = 1 , β S B S 0 ( d ) | E n σ X S T β S π T X S 2 E σ X S T β S π T X S 2 | C 5 K 2 q 3 log p / n s = | M | q e p s s exp 5 q log p exp ( 3 q log p ) .
By Condition (5), σ min κ min Λ E σ X S T β S X S 2 σ max κ max , for all β S B S 0 ( d ) and S : M S , | S | < q . Then, by (A5),
σ min κ min / 2 Λ E n σ X S T β * S X S 2 2 σ max κ max
uniformly over all S satisfying M S and | S | q , with probability at least 1 exp ( 3 q log p ) . Hence, the condition (A4) in [17] is satisfied with probability at least 1 exp ( 3 q log p ) .
Additionally, for any β S B S 0 ( d ) ,
| E n σ X S T β S π T X S 2 E n σ X S T β * S π T X S 2 | | n 1 / 2 G n σ X S T β S π T X S 2 | + | n 1 / 2 G n σ X S T β * S π T X S 2 | + | E σ X S T β S π T X S 2 E σ X S T β * S π T X S 2 | 2 C 5 K 2 q 3 log p / n + μ max β S β * S | S | K λ max .
Hence, the condition (A5) in [17] is satisfied uniformly over all S such that M S and | S | q , with probability at least 1 exp ( 3 q log p ) .
Then (A4) can be shown by following the proof of Equation (3.2) in [17]. Thus, our forward stage stops at the kth step with probability at least 1 exp ( 3 q log p ) . □
Proof of Theorem 4.
Suppose that a covariate X r is removed from S. For any r M , since M S \ { r } and r is the only element that is in ( S \ { r } ) c M , by Lemma A1 and (A2)
S ( β ^ S ) S \ { r } ( β ^ S \ { r } ) S ( β S * ) S \ { r } ( β ^ S \ { r } ) = S \ { r } { r } ( β S \ { r } + r * ) S \ { r } ( β ^ S \ { r } ) C 1 n 2 α ,
with probability at least 1 6 exp ( 6 q log p ) . From the proof of Theorem 1, we have for any η 2 > 0 , B I C ( S ) B I C ( S \ { r } ) 2 C 1 n 2 α + η 2 n 1 log n < 0 , uniformly over r M and S satisfying M S and | S | q , with probability at least 1 6 exp ( 6 q log p ) . □
Proof of Theorem 5.
By Theorems 1–3, we have that the event Ω 1 : = { | M ^ | q and M M ^ } holds with probability at least 1 25 exp ( 2 q log p ) . Thus, in the rest of the proof, we restrict our attention on Ω 1 .
As shown in the proof of Theorem 3, we obtain that β M ^ * = β * M ^ . Then by Lemma A6, we have β ^ M ^ β M ^ * A 2 K 1 q 2 log p / n with probability at least 1 6 exp ( 6 q log p ) . Withdrawing the attention on Ω 1 , we obtain that
β ^ β * = β ^ M ^ β * M ^ = β ^ M ^ β M ^ * A 2 K 1 q 2 log p / n ,
with probability at least 1 31 exp ( 2 q log p ) . □

Appendix Additional Lemmas and Proofs

Lemma A1.
Given a model S such that | S | < q , M S , under Condition (6),
(i): r S c M , such that β S + r * ( β S * T , 0 ) T .
(ii): Suppose Conditions (1), (2) and (6’) hold. r S c M , such that β S + r * T ( β S * T , 0 ) C σ max 1 κ max 1 n α .
Proof. 
As β S + j * is the maximizer of E S { j } ( β S + j ) , by the concavity of E S { j } ( β S + j ) , β S + j * is the solution to the equation E Y μ β S * T X S + β j X j X S { j } = 0 ,
( i ) : Suppose that β S + j * = ( β S * T , 0 ) T , j S c M . Then,
0 = E Y μ β S * T X S X j = E μ β * T X μ β S * T X S X j max j S c M | E μ β * T X μ β S * T X S X j | = 0 ,
which violates the Condition (6). Therefore, we can find a r S c M , such that β S + r * ( β S * T , 0 ) T .
( i i ) : Let r S c M satisfy that | E μ β * T X μ β S * T X S X r | > C n α . Without loss of generality, we assume that X r is the last element of X S { r } . By the mean value theorem,
E μ β * T X μ β S * T X S X r = E μ β * T X μ β S * T X S X r E μ β * T X μ β S + r * T X S { r } X r = E μ β S + r * T X S { r } μ ( β S * T , 0 ) X S { r } X r = β S + r * T ( β S * T , 0 ) E σ β ˜ S + r T X S { r } X S { r } 2 e r ,
where β ˜ S + r is some point between β S + r * and ( β S * T , 0 ) T and e r is a vector of length ( | S | + 1 ) with the rth element being 1.
As β ˜ S + r is some point between β S + r * and ( β S * T , 0 ) T , | β ˜ S + r T X S { r } | | β S + r * T X S { r } | + | ( β S * T , 0 ) X S { r } | 2 K 2 , by Conditions (1) and (2). Thus, | σ ( β ˜ S + r T X S { r } ) | σ max . By (A6) and Condition (5),
C n α E μ β * T X μ β S * T X S X r β S + r * T ( β S * T , 0 ) σ max λ max E X S { r } 2 e r σ max κ max β S + r * T ( β S * T , 0 ) .
Therefore, β S + r * T ( β S * T , 0 ) C σ max 1 κ max 1 n α . □
Lemma A2.
Let ξ i , i = 1 , , n be n i.i.d random variables such that | ξ i | B for a constant B > 0 . Under Conditions (1)–(3), we have E | Y i ξ i E Y i ξ i | m m ! ( 2 B ( 2 M + μ max ) ) m , for every m 1 .
Proof. 
By Conditions (1) and (2), | β * T X i | K L , i 1 and consequently μ ( β * T X i ) μ max . Then by Condition (3),
E [ | Y i | m ] = E [ | ϵ i + μ ( β * T X i ) | m ] t = 0 m m t E | ϵ i | t μ max m t t = 0 m t ! m t M t μ max m t m ! ( M + μ max ) m ,
for every m 1 . By the same arguments, it can be shown that, for every m 1 , E | Y i ξ i E Y i ξ i | m E | Y i ξ i | + | E Y i ξ i | m m ! ( 2 B ( M + μ max ) ) m .
Lemma A3.
Under Conditions (1)–(3), when n is sufficiently large such that 28 q log p / n < 1 , we have sup β B E n L ( β T X , Y ) 2 ( M + μ max ) K 3 + b max , with probability 1 2 exp ( 10 q log p ) .
Proof. 
By Conditions (2), sup β B β T X K 3 . Thus,
sup β B E n L ( β T X , Y ) sup β B E n Y β T X + b max | E n Y E | Y | | + E | Y | K 3 + b max | E n Y E | Y | | K 3 + ( M + μ max ) K 3 + b max ,
where the last inequality follows from that E [ | Y | ] M + μ max as shown in the proof of Lemma A2.
Let ξ i = 1 { Y i > 0 } 1 { Y i < 0 } . Thus | ξ i | 1 . By Lemma A2, we have E | | Y i | E | Y i | | m m ! ( 2 ( M + μ max ) ) m . Applying Bernstein’s inequality (e.g., Lemma 2.2.11 in [50]) yields that
P E n | Y | E | Y | > 10 ( M + μ max ) q log p / n 2 exp 1 2 196 q log p 4 + 20 q log p / n 2 exp ( 10 q log p ) ,
when n is sufficiently large such that 20 q log p / n < 1 . Since 10 ( M + μ max ) q log p / n = o ( 1 ) , then
P sup β B E n L ( β T X , Y ) 2 ( M + μ max ) K 3 + b max 2 exp ( 10 q log p ) .
Lemma A4.
Given an index set S and r S c , let B S 0 ( d ) = { β S : β S β S * d / ( K | S | ) } and A 1 : = ( M + 2 μ max ) . Under Conditions (1)–(3), when n is sufficiently large such that 10 q log p / n < 1 , we have
  • | G n L β S T X S , Y L β S * T X S , Y | 20 A 1 d q log p , uniformly over β S B S 0 ( d ) and | S | q , with probability at least 1 4 exp ( 6 q log p ) .
  • | G n L ( β S * T X S , Y ) | 10 ( A 1 K 2 + b max ) q log p , uniformly over | S | q , with probability at least 1 2 exp ( 8 q log p ) .
Proof. 
: ( 1 ) : Let R | S | ( d ) be a | S | -dimensional ball with center at 0 and radius d / ( K | S | ) . Then B S 0 ( d ) = R | S | ( d ) + β S * . Let C | S | : = { C ( ξ k ) } be a collection of cubes that cover the ball R | S | ( d ) , where C ( ξ k ) is a cube containing ξ k with sides of length d / ( K | S | n 2 ) and ξ k is some point in R | S | ( d ) . As the volume of C ( ξ k ) is d / ( K | S | n 2 ) | S | and the volume of R | S | ( d ) is less than ( 2 d / ( K | S | ) ) | S | , we can select ξ k s so that no more than ( 4 n 2 ) | S | cubes are needed to cover R | S | ( d ) . We thus assume | C | S | | ( 4 n 2 ) | S | . For any ξ C ( ξ k ) , ξ ξ k d / ( K n 2 ) . In addition, let T 1 S ( ξ ) : = E n Y ξ T X S , T 2 S ( ξ ) : = E n b β S * + ξ T X S b β S * T X S , and T S ( ξ ) : = T 1 S ( ξ ) T 2 S ( ξ ) .
Given any ξ R | S | ( d ) , there exists C ( ξ k ) C | S | such that ξ C ( ξ k ) . Then
T S ( ξ ) E T S ( ξ ) T S ( ξ ) T S ( ξ k ) T S ( ξ k ) E T S ( ξ k ) + E T S ( ξ ) E T S ( ξ k ) = : I + I I + I I I .
We deal with I I I first. By the mean value theorem, there exists a ξ ˜ between ξ and ξ k such that
E T S ( ξ k ) E T S ( ξ ) = E Y ( ξ k ξ ) T X S + E μ β S * + ξ ˜ T X S ( ξ k ξ ) T X S E [ | Y | ] ξ k ξ X S + μ max ξ k ξ X S ( M + 2 μ max ) d | S | n 2 = A 1 d | S | n 2 ,
where the last inequality follows from Lemma A2 and A 1 = M + 2 μ max .
Next, we evaluate I I . By Condition (2), | X i S T ξ | X i S ξ d / ( K | S | ) | S | K = d , for all ξ R | S | ( d ) . Then by Lemma A2,
E Y ξ k T X S E Y ξ k T X S m m ! ( 2 ( M + μ max ) d ) m .
By Bernstein’s inequality, when n is sufficiently large such that 10 q log p / n 1 .
P T 1 S ( ξ k ) E T 1 S ( ξ k ) > 10 ( M + μ max ) d q n 1 log p 2 exp 1 2 100 q log p 4 + 20 q log p / n 2 exp ( 10 q log p ) .
Since | b ( β S * + ξ k T X S ) b ( β S * T X S ) | μ max d , by the same arguments used for (A9), we have
P T 2 S ( ξ k ) E T 2 S ( ξ k ) > 10 μ max d q n 1 log p 2 exp ( 10 q log p ) .
Combining (A9) and (A10) yields that uniformly over ξ k
T S ( ξ k ) E T S ( ξ k ) 10 A 1 d q n 1 log p ,
with probability at least 1 2 ( 4 n 2 ) | S | exp ( 10 q log p ) .
We now assess I. Following the same arguments as in Lemma A3,
P sup ξ C ( ξ k ) T S ( ξ ) T S ( ξ k ) > ( 2 M + 3 μ max ) d | S | n 2 2 exp ( 8 q log p ) .
Since | S | n 2 = o ( q n 1 log p ) , combining (A8), (A11) and (A12) together yields that
P sup ξ R | S | ( d ) T S ( ξ ) E T S ( ξ ) 20 A 1 d q n 1 log p 2 ( 4 n 2 ) | S | exp ( 10 q log p ) + 2 exp ( 8 q log p ) 4 exp ( 8 q log p ) .
By the combinatoric inequality p s ( e p / s ) s , we obtain that
P sup | S | q , β S B S 0 ( d 1 ) G n L β S T X S , Y L β S * T X S , Y 20 A 1 d q log p s = 1 q ( e p / s ) s 4 exp ( 8 q log p ) 4 exp ( 6 q log p ) .
( 2 ) : We evaluate the mth moment of L ( β S * X S , Y ) .
E Y β S * X S b ( β S * X S ) m E t = 0 m m t | Y | t K 2 t b max m t t = 0 m m t t ! ( M + μ max ) K 2 t b max m t m ! ( ( M + μ max ) K 2 + b max ) m .
Then, by Bernstein’s inequality,
P | G n L ( β S * T X S , Y ) | > 10 ( A 1 K 2 + b max ) q log p 2 exp ( 10 q log p ) .
By the same arguments used in (i), we obtain that
P sup | S | q G n L β S * T X S , Y 10 ( A 1 K 2 + b max ) q log p s = 1 q ( e p / s ) s 2 exp ( 10 q log p ) 2 exp ( 8 q log p ) .
Lemma A5.
Given a model S and r S c , under Conditions (1), (2) and (5), for any β S β S * K / | S | , σ min κ min β S β S * 2 / 2 E S ( β S * ) E S ( β S ) σ max κ max β S β S * 2 / 2 .
Proof. 
Due to the concavity of the log-likelihood in GLMs with the canonical link, E Y X S μ ( β S * T X S ) X S = 0 . Then for any β S β S * K / | S | , | β T X S | | β * T X S | + | ( β β * ) T X S | K 2 + K / | S | × K | S | = 2 K L . Thus, by Taylor’s expansion,
E S ( β S ) E S ( β S * ) = 1 2 ( β S β S * ) T E σ β ˜ S T X S X S 2 ( β S β S * ) ,
where β ˜ S is between β S and β S * . By Condition (5), σ min κ min β S β S * 2 / 2 E S ( β S * ) E S ( β S ) σ max κ max β S β S * 2 / 2 .
Lemma A6.
Under Conditions (1)–(6), there exist some constants A 2 and A 3 that do not depend on n, such that β ^ S β S * A 2 K 1 q 2 log p / n and | S ( β ^ S ) S ( β S * ) | A 3 q 2 log p / n hold uniformly over S : | S | q , with probability at least 1 6 exp ( 6 q log p ) .
Proof. 
Define
Ω ( d ) : = sup | S | q , β S B S 0 ( d ) | G n L β S T X S , Y L β S * T X S , Y | < 20 A 1 d q log p .
By Lemma A4, the event Ω ( d ) holds with probability at least 1 4 exp ( 6 q log p ) . Thus, in the proof of Lemma A6, we shall assume Ω ( d ) hold with d = A 2 q 3 log p / n for some A 2 > 20 ( σ min κ min ) 1 K 2 A 1 .
For any β S β S * = A 2 K 1 q 2 log p / n , since q 2 log p / n q 3 log p / n / | S | , β S B S 0 ( d ) . By Lemma A5,
S ( β S * ) S ( β S ) = S ( β S * ) E S ( β S * ) S ( β S ) E S ( β S ) + E S ( β S * ) E S ( β S ) σ min κ min β S β S * 2 / 2 20 A 1 d q log p / n = σ min κ min A 2 2 q 2 log p / ( K 2 n ) 20 A 1 A 2 q 2 log p / n > 0 .
Thus,
inf | S | q , β S β S * = A 2 K 1 q 2 log p / n S ( β S * ) S ( β S ) > 0 .
Then by the concavity of S ( · ) , we obtain that max | S | q β ^ S β S * A 2 K 1 q 2 n 1 log p .
On the other hand, for any β S β S * A 2 K 1 q 2 log p / n ,
S ( β S * ) S ( β S ) | S ( β S * ) E S ( β S * ) S ( β S ) E S ( β S ) | + E S ( β S * ) E S ( β S ) σ max κ max β S β S * 2 / 2 + 20 A 1 d q log p / n A 3 q 2 n 1 log p ,
where A 3 : = 4 σ max λ max A 2 2 K 2 + 20 A 1 A 2 .

Appendix B. Additional Results in the Applications

Table A1. Comparison of genes selected by each competing method from the mammalian eye data set.
Table A1. Comparison of genes selected by each competing method from the mammalian eye data set.
STEPWISEFRLASSOSIS+LASSOSCdgLARS
STEPWISE332220
FR 42220
LASSO 16520
SIS+LASSO 920
SC 40
dgLARS 7
Note: Diagonal and off-diagonal elements of the table represent the model sizes for each method and the number of overlapping genes selected by the two methods corresponding to the row and column, respectively.
Table A2. Selected miRNAs for ESCC training dataset.
Table A2. Selected miRNAs for ESCC training dataset.
MethodsSelected miRNAs
STEPWISEmiR-4783-3p; miR-320b; miR-1225-3p
FRmiR-4783-3p; miR-320b; miR-1225-3p; 6789-5p
SCmiR-4783-3p; miR-320b; miR-1225-3p; 6789-5p
LASSOmiR-6789-5p; miR-6781-5p; miR-1225-3p; miR-1238-5p; miR-320b;
miR-6794-5p; miR-6877-5p; miR-6785-5p; miR-718; miR-195-5p
SIS+LASSOmiR-6785-5p; miR-1238-5p; miR-1225-3p; miR-6789-5p; miR-320b;
miR-6875-5p; miR-6127; miR-1268b; miR-6781-5p; miR-125a-3p
dgLARSmiR-891b; miR-6127; miR-151a-5p; miR-195-5p; ; miR-3688-5p
miR-125b-1-3p; miR-1273c; miR-6501-5p; miR-4666a-5p; miR-514a-3p
Note: LASSO, SIS+LASSO, dgLARS selected 20, 17 and 33 miRNAs, respectively, and we only reported top 10 miRNAs.

References

  1. Prosperi, M.; Min, J.S.; Bian, J.; Modave, F. Big data hurdles in precision medicine and precision public health. BMC Med. Inform. Decis. Mak. 2018, 18, 139. [Google Scholar] [CrossRef] [PubMed]
  2. Tibshirani, R. Regression shrinkage and selection via the lasso. J. R. Stat. Soc. Ser. B-Stat. Methodol. 1996, 58, 267–288. [Google Scholar] [CrossRef]
  3. Flynn, C.J.; Hurvich, C.M.; Simonoff, J.S. On the sensitivity of the lasso to the number of predictor variables. Stat. Sci. 2017, 32, 88–105. [Google Scholar] [CrossRef] [Green Version]
  4. van de Geer, S.A. On the asymptotic variance of the debiased Lasso. Electron. J. Stat. 2019, 13, 2970–3008. [Google Scholar] [CrossRef]
  5. Fan, J.; Lv, J. Sure independence screening for ultrahigh dimensional feature space (with discussion). J. R. Stat. Soc. Ser. B-Stat. Methodol. 2008, 70, 849–911. [Google Scholar] [CrossRef] [Green Version]
  6. Barut, E.; Fan, J.; Verhasselt, A. Conditional sure independence screening. J. Am. Stat. Assoc. 2016, 111, 1266–1277. [Google Scholar] [CrossRef] [PubMed]
  7. Wang, H. Forward regression for ultra-high dimensional variable screening. J. Am. Stat. Assoc. 2009, 104, 1512–1524. [Google Scholar] [CrossRef]
  8. Zheng, Q.; Hong, H.G.; Li, Y. Building generalized linear models with ultrahigh dimensional features: A sequentially conditional approach. Biometrics 2019, 76, 1–14. [Google Scholar] [CrossRef]
  9. Hong, H.G.; Zheng, Q.; Li, Y. Forward regression for Cox models with high-dimensional covariates. J. Multivar. Anal. 2019, 173, 268–290. [Google Scholar] [CrossRef]
  10. Efron, B.; Hastie, T.; Johnstone, I.; Tibshirani, R. Least angle regression. Ann. Stat. 2004, 32, 407–499. [Google Scholar]
  11. Augugliaro, L.; Mineo, A.M.; Wit, E.C. Differential geometric least angle regression: A differential geometric approach to sparse generalized linear models. J. R. Stat. Soc. Ser. B-Stat. Methodol. 2013, 75, 471–498. [Google Scholar] [CrossRef] [Green Version]
  12. Pazira, H.; Augugliaro, L.; Wit, E. Extended differential geometric LARS for high-dimensional GLMs with general dispersion parameter. Stat. Comput. 2018, 28, 753–774. [Google Scholar] [CrossRef]
  13. An, H.; Huang, D.; Yao, Q.; Zhang, C.H. Stepwise Searching for Feature Variables in High-Dimensional Linear Regression. 2008. Available online: http://eprints.lse.ac.uk/51349/ (accessed on 20 August 2020).
  14. Ing, C.K.; Lai, T.L. A stepwise regression method and consistent model selection for high-dimensional sparse linear models. Stat. Sin. 2011, 21, 1473–1513. [Google Scholar] [CrossRef] [Green Version]
  15. Hwang, J.S.; Hu, T.H. A stepwise regression algorithm for high-dimensional variable selection. J. Stat. Comput. Simul. 2015, 85, 1793–1806. [Google Scholar] [CrossRef]
  16. McCullagh, P. Generalized Linear Models; Routledge: Abingdon-on-Thames, UK, 1989. [Google Scholar]
  17. Chen, J.; Chen, Z. Extended BIC for small-n-large-P sparse GLM. Stat. Sin. 2012, 22, 555–574. [Google Scholar] [CrossRef] [Green Version]
  18. Bühlmann, P.; Yu, B. Sparse boosting. J. Mach. Learn. Res. 2006, 7, 1001–1024. [Google Scholar]
  19. van de Geer, S.A. High-dimensional generalized linear models and the lasso. Ann. Stat. 2008, 36, 614–645. [Google Scholar] [CrossRef]
  20. Chen, J.; Chen, Z. Extended Bayesian information criteria for model selection with large model spaces. Biometrika 2008, 95, 759–771. [Google Scholar] [CrossRef] [Green Version]
  21. Fan, Y.; Tang, C.Y. Tuning parameter selection in high dimensional penalized likelihood. J. R. Stat. Soc. Ser. B-Stat. Methodol. 2013, 75, 531–552. [Google Scholar] [CrossRef] [Green Version]
  22. Cheng, M.Y.; Honda, T.; Zhang, J.T. Forward variable selection for sparse ultra-high dimensional varying coefficient models. J. Am. Stat. Assoc. 2016, 111, 1209–1221. [Google Scholar] [CrossRef] [Green Version]
  23. Zhao, S.D.; Li, Y. Principled sure independence screening for Cox models with ultra-high-dimensional covariates. J. Multivar. Anal. 2012, 105, 397–411. [Google Scholar] [CrossRef] [PubMed] [Green Version]
  24. Kwemou, M. Non-asymptotic oracle inequalities for the Lasso and group Lasso in high dimensional logistic model. ESAIM-Prob. Stat. 2016, 20, 309–331. [Google Scholar] [CrossRef]
  25. Jiang, Y.; He, Y.; Zhang, H. Variable selection with prior information for generalized linear models via the prior LASSO method. J. Am. Stat. Assoc. 2016, 111, 355–376. [Google Scholar] [CrossRef]
  26. Zhang, C.H.; Huang, J. The sparsity and bias of the Lasso selection in high-dimensional linear regression. Ann. Stat. 2008, 36, 1567–1594. [Google Scholar] [CrossRef]
  27. Fan, J.; Song, R. Sure independence screening in generalized linear models with NP-dimensionality. Ann. Stat. 2010, 38, 3567–3604. [Google Scholar] [CrossRef]
  28. Luo, S.; Chen, Z. Sequential Lasso cum EBIC for feature selection with ultra-high dimensional feature space. J. Am. Stat. Assoc. 2014, 109, 1229–1240. [Google Scholar] [CrossRef]
  29. Luo, S.; Xu, J.; Chen, Z. Extended Bayesian information criterion in the Cox model with a high-dimensional feature space. Ann. Inst. Stat. Math. 2015, 67, 287–311. [Google Scholar] [CrossRef]
  30. Hastie, T.; Tibshirani, R.; Friedman, J. The Elements of Statistical Learning: Data mining, Inference, and Prediction; Springer: Berlin/Heidelberger, Germany, 2009. [Google Scholar]
  31. Simon, N.; Friedman, J.; Hastie, T.; Tibshirani, R. Regularization Paths for Cox’s Proportional Hazards Model via Coordinate Descent. J. Stat. Softw. 2011, 39, 1–13. [Google Scholar] [CrossRef]
  32. Breheny, P.; Huang, J. Coordinate descent algorithms for nonconvex penalized regression, with applications to biological feature selection. Ann. Appl. Stat. 2011, 5, 232–253. [Google Scholar] [CrossRef] [Green Version]
  33. Wang, X.; Leng, C. R Package: Screening. 2016. Available online: https://github.com/wwrechard/screening (accessed on 20 August 2020).
  34. Augugliaro, L.; Mineo, A.M.; Wit, E.C. dglars: An R Package to Estimate Sparse Generalized Linear Models. J. Stat. Softw. 2014, 59, 1–40. [Google Scholar] [CrossRef] [Green Version]
  35. Scheetz, T.E.; Kim, K.Y.A.; Swiderski, R.E.; Philp, A.R.; Braun, T.A.; Knudtson, K.L.; Dorrance, A.M.; DiBona, G.F.; Huang, J.; Casavant, T.L.; et al. Regulation of gene expression in the mammalian eye and its relevance to eye disease. Proc. Natl. Acad. Sci. USA 2006, 103, 14429–14434. [Google Scholar] [CrossRef] [PubMed] [Green Version]
  36. Chiang, A.P.; Beck, J.S.; Yen, H.J.; Tayeh, M.K.; Scheetz, T.E.; Swiderski, R.E.; Nishimura, D.Y.; Braun, T.A.; Kim, K.Y.A.; Huang, J.; et al. Homozygosity mapping with SNP arrays identifies TRIM32, an E3 ubiquitin ligase, as a Bardet–Biedl syndrome gene (BBS11). Proc. Natl. Acad. Sci. USA 2006, 103, 6287–6292. [Google Scholar] [CrossRef] [PubMed] [Green Version]
  37. He, S.; Peng, J.; Li, L.; Xu, Y.; Wu, X.; Yu, J.; Liu, J.; Zhang, J.; Zhang, R.; Wang, W. High expression of cytokeratin CAM5.2 in esophageal squamous cell carcinoma is associated with poor prognosis. Medicine 2019, 98, e17104. [Google Scholar] [CrossRef]
  38. Li, B.X.; Yu, Q.; Shi, Z.L.; Li, P.; Fu, S. Circulating microRNAs in esophageal squamous cell carcinoma: Association with locoregional staging and survival. Int. J. Clin. Exp. Med. 2015, 8, 7241–7250. [Google Scholar]
  39. Sudo, K.; Kato, K.; Matsuzaki, J.; Boku, N.; Abe, S.; Saito, Y.; Daiko, H.; Takizawa, S.; Aoki, Y.; Sakamoto, H.; et al. Development and validation of an esophageal squamous cell carcinoma detection model by large-scale microRNA profiling. JAMA Netw. Open 2019, 2, e194573. [Google Scholar] [CrossRef] [Green Version]
  40. Zhang, Y. Epidemiology of esophageal cancer. World J. Gastroenterol 2013, 19, 5598–5606. [Google Scholar] [CrossRef]
  41. Mathieu, L.N.; Kanarek, N.F.; Tsai, H.L.; Rudin, C.M.; Brock, M.V. Age and sex differences in the incidence of esophageal adenocarcinoma: Results from the Surveillance, Epidemiology, and End Results (SEER) Registry (1973–2008). Dis. Esophagus 2014, 27, 757–763. [Google Scholar] [CrossRef]
  42. Zhou, J.; Zhang, M.; Huang, Y.; Feng, L.; Chen, H.; Hu, Y.; Chen, H.; Zhang, K.; Zheng, L.; Zheng, S. MicroRNA-320b promotes colorectal cancer proliferation and invasion by competing with its homologous microRNA-320a. Cancer Lett. 2015, 356, 669–675. [Google Scholar] [CrossRef] [Green Version]
  43. Lieb, V.; Weigelt, K.; Scheinost, L.; Fischer, K.; Greither, T.; Marcou, M.; Theil, G.; Klocker, H.; Holzhausen, H.J.; Lai, X.; et al. Serum levels of miR-320 family members are associated with clinical parameters and diagnosis in prostate cancer patients. Oncotarget 2018, 9, 10402–10416. [Google Scholar] [CrossRef] [Green Version]
  44. Mullany, L.E.; Herrick, J.S.; Wolff, R.K.; Stevens, J.R.; Slattery, M.L. Association of cigarette smoking and microRNA expression in rectal cancer: Insight into tumor phenotype. Cancer Epidemiol. 2016, 45, 98–107. [Google Scholar] [CrossRef] [Green Version]
  45. Zheng, H.; Zhang, F.; Lin, X.; Huang, C.; Zhang, Y.; Li, Y.; Lin, J.; Chen, W.; Lin, X. MicroRNA-1225-5p inhibits proliferation and metastasis of gastric carcinoma through repressing insulin receptor substrate-1 and activation of β-catenin signaling. Oncotarget 2016, 7, 4647–4663. [Google Scholar] [CrossRef] [Green Version]
  46. R Core Team. R: A Language and Environment for Statistical Computing; R Foundation for Statistical Computing: Vienna, Austria, 2018. [Google Scholar]
  47. Wickham, H. ggplot2: Elegant Graphics for Data Analysis; Springer: Berlin/Heidelberger, Germany, 2016. [Google Scholar]
  48. Zhao, P.; Yu, B. On model selection consistency of Lasso. J. Mach. Learn. Res. 2006, 7, 2541–2563. [Google Scholar]
  49. Bühlmann, P.; Van De Geer, S. Statistics for High-dimensional Data: Methods, Theory and Applications; Springer: Berlin/Heidelberger, Germany, 2011. [Google Scholar]
  50. Vaart, A.W.; Wellner, J.A. Weak Convergence and Empirical Processes: With Applications to Statistics; Springer: Berlin/Heidelberger, Germany, 1996. [Google Scholar]
  51. Talagrand, M. Sharper bounds for Gaussian and empirical processes. Ann. Probab. 1994, 22, 28–76. [Google Scholar] [CrossRef]
Figure 1. Box plot of model sizes for each method over 120 different training samples from the mammalian eye data set. STEPWISE was performed with η 1 = 1 and η 2 = 4 , and FR and SC were conducted with γ = 1 .
Figure 1. Box plot of model sizes for each method over 120 different training samples from the mammalian eye data set. STEPWISE was performed with η 1 = 1 and η 2 = 4 , and FR and SC were conducted with γ = 1 .
Entropy 22 00965 g001
Figure 2. Comparisons of ROC curves for the selected models in the ESCC data set by the sequentially selected order: Model 1: 2.52 + 0.02 × A g e 1.86 × G e n d e r ; Model 2: 20.64 + 0.08 × A g e 2.12 × G e n d e r + 2.02 × miR-4783-3p; Model 3: 24.21 + 0.09 × A g e 2.16 × G e n d e r + 1.44 × miR-4783-3p 1.31 × miR-320b; Model 4: 35.70 + 0.10 × A g e 2.02 × G e n d e r + 1.40 × miR-4783-3p 0.98 × miR-320b + 1.91 × miR-1225-3p; Model 5: 53.10 + 0.10 × A g e 1.85 × G e n d e r + 1.43 × miR-4783-3p 0.92 × miR-320b + 1.43 × miR-1225-3p + 2.10 × miR-6789-5p.
Figure 2. Comparisons of ROC curves for the selected models in the ESCC data set by the sequentially selected order: Model 1: 2.52 + 0.02 × A g e 1.86 × G e n d e r ; Model 2: 20.64 + 0.08 × A g e 2.12 × G e n d e r + 2.02 × miR-4783-3p; Model 3: 24.21 + 0.09 × A g e 2.16 × G e n d e r + 1.44 × miR-4783-3p 1.31 × miR-320b; Model 4: 35.70 + 0.10 × A g e 2.02 × G e n d e r + 1.40 × miR-4783-3p 0.98 × miR-320b + 1.91 × miR-1225-3p; Model 5: 53.10 + 0.10 × A g e 1.85 × G e n d e r + 1.43 × miR-4783-3p 0.92 × miR-320b + 1.43 × miR-1225-3p + 2.10 × miR-6789-5p.
Entropy 22 00965 g002aEntropy 22 00965 g002b
Figure 3. Box plot of model sizes for each method based on 100 ESCC training datasets. Performance of STEPWISE is reported with η 1 = 0 and η 2 = 3.5 . Performances of SC and FR are reported with γ = 0 .
Figure 3. Box plot of model sizes for each method based on 100 ESCC training datasets. Performance of STEPWISE is reported with η 1 = 0 and η 2 = 3.5 . Performances of SC and FR are reported with γ = 0 .
Entropy 22 00965 g003
Table 1. The values of η 1 and η 2 used in the simulation studies.
Table 1. The values of η 1 and η 2 used in the simulation studies.
Normal ModelBinomial ModelPoisson Model
Example 1(0.5, 3)(0.5, 3)(1, 3)
Example 2(0.5, 3)(1, 3)(1, 3)
Example 3(1, 3)(0.5, 3)(0.5, 1)
Example 4(1, 3.5)(0, 1)(1, 3)
Example 5(0.5, 3)(0.5, 2)(0.5, 3)
Note: values for η 1 and η 2 were searched on the grid { 0 , 0.25 , 0.5 , 1 } and { 1 , 2 , 3 , 3.5 , 4 , 4.5 , 5 } , respectively.
Table 2. Normal model.
Table 2. Normal model.
ExampleMethodTPFPPITMSE ( × 10 4 ) MSPE
1 ( p 0 = 8 )LASSO(1SE)8.00 (0.00)5.48 (6.61)1.00 (0.00)2.451.148
LASSO(BIC)8.00 (0.00)2.55 (2.48)1.00 (0.00)2.581.172
SIS+LASSO(1SE)8.00 (0.00)6.59 (4.22)1.00 (0.00)1.491.042
SIS+LASSO(BIC)8.00 (0.00)6.04 (3.33)1.00 (0.00)1.371.025
dgLARS(BIC)8.00 (0.00)3.52(2.53)1.00 (0.00)2.251.130
SC ( γ L )8.00 (0.00)3.01 (1.85)1.00 (0.00)1.090.895
SC ( γ H )7.60 (1.59)0.00 (0.00)0.94 (0.24)14.565.081
FR ( γ L )8.00 (0.00)2.96 (2.04)1.00 (0.00)1.080.896
FR ( γ H )7.88 (0.84)0.00 (0.00)0.98 (0.14)3.742.040
STEPWISE8.00 (0.00)0.00 (0.00)1.00 (0.00)0.210.972
2 ( p 0 = 8 )LASSO(1SE)8.00 (0.00)4.74 (4.24)1.00 (0.00)2.461.154
LASSO(BIC)8.00 (0.00)2.12 (2.02)1.00 (0.00)2.621.182
SIS+LASSO7.99 (0.10)6.84 (4.57)0.99 (0.10)1.651.058
SIS+LASSO(BIC)7.99 (0.10)6.11 (3.85)0.99 (0.10)1.561.041
dgLARS(BIC)8.00 (0.00)3.26(2.62)1.00 (0.00)2.281.138
SC ( γ L )8.00 (0.00)2.73 (1.53)1.00 (0.00)0.980.901
SC ( γ H )7.30 (2.11)0.00 (0.00)0.90 (0.30)23.706.397
FR ( γ L )8.00 (0.00)2.45 (1.65)1.00 (0.00)0.920.907
FR ( γ H )7.94 (0.60)0.00 (0.00)0.99 (0.00)2.692.062
STEPWISE8.00 (0.00)0.01 (0.10)1.00 (0.00)0.210.972
3 ( p 0 = 5 )LASSO(1SE)5.00 (0.00)8.24 (2.63)1.00 (0.00)3.071.084
LASSO(BIC)5.00 (0.00)12.33 (3.28)1.00 (0.00)27.972.398
SIS+LASSO(1SE)0.97 (0.26)15.94 (2.93)0.00 (0.00)1406.2276.024
SIS+LASSO(BIC)0.97 (0.26)16.20 (2.81)0.00 (0.00)1354.5471.017
dgLARS(BIC)5.00 (0.00)53.91 (14.44)1.00 (0.00)6.630.979
SC ( γ L )4.48 (0.50)0.25 (0.44)0.48 (0.50)21.743.086
SC ( γ H )4.48 (0.50)0.14 (0.35)0.48 (0.50)21.702.065
FR ( γ L )5.00 (0.00)0.23 (0.66)1.00 (0.00)0.270.973
FR ( γ H )5.00 (0.00)0.14 (0.35)1.00 (0.00)0.150.074
STEPWISE5.00 (0.00)0.03 (0.22)1.00 (0.00)0.180.976
4 ( p 0 = 14 )LASSO(1SE)14.00 (0.00)29.84 (15.25)1.00 (0.00)13.971.148
LASSO(BIC)13.94 (0.24)4.92 (5.54)0.94 (0.24)38.691.995
SIS+LASSO(1SE)11.44 (1.45)15.19 (7.29)0.05 (0.21)133.384.714
SIS+LASSO(BIC)11.35 (1.51)10.98 (7.19)0.05 (0.21)137.064.940
dgLARS(BIC)14.00 (0.00)13.93 (6.68)1.00 (0.00)18.081.329
SC ( γ L )13.68 (0.60)0.86 (0.62)0.75 (0.44)11.801.148
SC ( γ H )4.20 (2.80)0.03 (0.17)0.03 (0.17)407.866.567
FR ( γ L )14.00 (0.00)0.50 (0.76)1.00 (0.00)1.230.940
FR ( γ H )4.99 (3.07)0.00 (0.00)0.03 (0.17)360.656.640
STEPWISE14.00 (0.00)0.00 (0.00)1.00 (0.00)0.910.958
5 ( p 0 = 3 )LASSO(1SE)3.00 (0.00)22.76 (9.05)1.00 (0.00)1.010.044
LASSO(BIC)3.00 (0.00)8.29 (3.23)1.00 (0.00)1.750.054
SIS+LASSO(1SE)3.00 (0.00)8.40 (3.10)1.00 (0.00)0.440.041
SIS+LASSO(BIC)3.00 (0.00)9.58 (3.36)1.00 (0.00)0.290.040
dgLARS(BIC)3.00 (0.00)13.39 (4.94)1.00 (0.00)1.280.048
SC ( γ L )3.00 (0.00)1.47 (0.67)1.00 (0.00)0.030.038
SC ( γ H )2.01 (0.10)0.01 (0.10)0.01 (0.10)4.510.008
FR ( γ L )3.00 (0.00)1.21 (1.01)1.00 (0.00)0.030.038
FR ( γ H )3.00 (0.00)0.00 (0.00)1.00 (0.00)0.010.003
STEPWISE3.00 (0.00)0.00 (0.00)1.00 (0.00)0.010.039
Note: TP, true positives; FP, false positives; PIT, probability of including all true predictors in the selected predictors; MSE, mean squared error of β ^ ; MSPE, mean squared prediction error; numbers in the parentheses are standard deviations; LASSO(BIC), LASSO with the tuning parameter chosen to give the smallest BIC for the models on the LASSO path; LASSO(1SE), LASSO with the tuning parameter chosen by the one-standard-error rule; SIS+LASSO(BIC), sure independence screening by [5] followed by LASSO(BIC); SIS+LASSO(1SE), sure independence screening followed by LASSO(1SE); dgLARS(BIC), differential geometric least angle regression by [11,12] with the tuning parameter chosen to give the smallest BIC on the dgLARS path; SC( γ ), sequentially conditioning approach by [8]; FR( γ ), forward regression by [7]; STEPWISE, the proposed method; in FR and SC, the smaller and large values of γ are presented as γ L and γ H , respectively; p 0 denotes the number of true signals; LASSO(1SE), LASSO(BIC), SIS and dgLARS were conducted via R packages glmnet [31], ncvreg [32], screening [33] and dglars [34], respectively.
Table 3. Binomial model.
Table 3. Binomial model.
ExampleMethodTPFPPITMSEMSPE
1 ( p 0 = 8 )LASSO(1SE)7.99 (0.10)4.77 (5.56)0.99 (0.10)0.0210.104
LASSO(BIC)7.99 (0.10)3.19 (2.34)0.99 (0.10)0.0210.104
SIS+LASSO(1SE)7.94 (0.24)35.42 (6.77)0.94 (0.24)0.1190.048
SIS+LASSO(BIC)7.94 (0.24)16.83 (21.60)0.94 (0.24)0.1190.073
dgLARS(BIC)8.00 (0.00)3.27 (2.29)1.00 (0.00)0.0190.102
SC ( γ L )8.00 (0.00)2.81 (1.47)1.00 (0.00)0.0090.073
SC ( γ H )1.02 (0.14)0.00 (0.00)0.00 (0.00)0.0300.028
FR ( γ L )8.00 (0.00)3.90 (2.36)1.00 (0.00)0.0320.066
FR ( γ H )2.00 (0.00)0.00 (0.00)0.00 (0.00)0.0250.027
STEPWISE7.98 (0.14)0.08 (0.53)0.98 (0.14)0.0020.094
2 ( p 0 = 8 )LASSO(1SE)7.98 (0.14)3.29 (2.76)0.98 (0.14)0.0540.073
LASSO(BIC)7.99 (0.10)3.84 (2.72)0.99 (0.10)0.0520.067
SIS+LASSO(1SE)7.92 (0.27)28.20 (7.31)0.92 (0.27)0.0380.030
SIS+LASSO(BIC)7.92 (0.27)9.60 (12.92)0.92 (0.27)0.0510.058
dgLARS(BIC)7.99 (0.10)3.94 (2.65)0.99 (0.10)0.0500.067
SC ( γ L )7.72 (0.45)0.39 (0.49)0.72 (0.45)0.0050.063
SC ( γ H )1.13 (0.37)0.00 (0.00)0.00 (0.00)0.0690.044
FR ( γ L )7.99 (0.10)0.66 (0.76)0.99 (0.10)0.0140.051
FR ( γ H )2.10 (0.30)0.00 (0.00)0.00 (0.00)0.0610.033
STEPWISE7.99 (0.10)0.02 (0.14)0.99 (0.10)0.0040.056
3 ( p 0 = 5 )LASSO(1SE)4.51 (0.52)7.36 (2.57)0.52 (0.50)0.1550.051
LASSO(BIC)4.98 (0.14)5.97 (2.25)0.98 (0.14)0.1180.037
SIS+LASSO(1SE)0.85 (0.46)10.66 (3.01)0.00 (0.00)0.2060.186
SIS+LASSO(BIC)0.85 (0.46)12.10 (3.13)0.00 (0.00)0.1970.185
dgLARS(BIC)4.92 (0.27)16.21 (6.21)0.92 (0.27)0.1120.035
SC ( γ L )4.32 (0.49)0.47 (0.50)0.33 (0.47)0.0160.048
SC ( γ H )2.62 (1.34)0.42 (0.50)0.00 (0.00)0.1040.066
FR ( γ L )4.98 (0.14)0.67 (0.79)0.98 (0.14)0.0200.033
FR ( γ H )2.98 (0.95)0.40 (0.49)0.00 (0.00)0.0870.043
STEPWISE4.97 (0.17)0.04 (0.28)0.97 (0.17)0.0140.034
4 ( p 0 = 14 )LASSO(1SE)9.96 (1.89)6.78 (7.92)0.01 (0.01)0.1120.107
LASSO(BIC)9.33 (1.86)2.79 (2.87)0.00 (0.00)0.1120.118
SIS+LASSO(1SE)10.03 (1.62)28.01 (9.54)0.03 (0.17)0.0980.070
SIS+LASSO(BIC)8.90 (1.99)5.42 (10.64)0.01 (0.10)0.1140.120
dgLARS(BIC)9.31 (1.85)2.84 (2.86)0.00 (0.00)0.1100.117
SC ( γ L )9.48 (1.40)2.35 (2.14)0.00 (0.00)0.0430.070
SC ( γ H )1.17 (0.40)0.00 (0.00)0.00 (0.00)0.1250.049
FR ( γ L )11.83 (1.39)1.58 (1.60)0.09 (0.29)0.0260.048
FR ( γ H )2.06 (0.24)0.00 (0.00)0.00 (0.00)0.1190.032
STEPWISE11.81 (1.42)1.52 (1.58)0.09 (0.29)0.0260.048
5 ( p 0 = 3 )LASSO(1SE)2.00 (0.00)1.55 (1.76)0.00 (0.00)0.0080.215
LASSO(BIC)2.00 (0.00)1.86 (1.57)0.00 (0.00)0.0080.213
SIS+LASSO(1SE)2.23 (0.42)10.81 (6.45)0.23 (0.42)0.0070.192
SIS+LASSO(BIC)2.10 (0.30)3.60 (4.65)0.10 (0.30)0.0070.206
dgLARS(BIC)2.00 (0.00)1.64 (1.49)0.00 (0.00)0.0080.213
SC ( γ L )2.27 (0.49)7.16 (3.20)0.29 (0.46)0.0600.166
SC ( γ H )1.87 (0.34)0.03 (0.17)0.00 (0.00)0.0050.030
FR ( γ L )2.96 (0.20)8.88 (5.39)0.96 (0.20)0.0130.147
FR ( γ H )1.97 (0.17)0.03 (0.17)0.00 (0.00)0.0050.019
STEPWISE2.89 (0.31)0.76 (1.70)0.89 (0.31)0.0010.194
Note: abbreviations are explained in the footnote of Table 2.
Table 4. Poisson model.
Table 4. Poisson model.
ExampleMethodTPFPPITMSEMSPE
1 ( p 0 = 8 )LASSO(1SE)7.93 (0.43)4.64 (4.82)0.96 (0.19)0.0014.236
LASSO(BIC)7.99 (0.10)14.37 (14.54)0.99 (0.10)0.0013.133
SIS+LASSO(1SE)7.89 (0.37)25.37 (8.39)0.91 (0.29)0.0013.247
SIS+LASSO(BIC)7.89 (0.37)17.77 (11.70)0.91 (0.29)0.0013.078
dgLARS(BIC)8.00 (0.00)13.28 (14.31)1.00 (0.00)0.0013.183
SC ( γ L )7.96 (0.20)4.94 (3.46)0.96 (0.20)0.0012.874
SC ( γ H )5.05 (1.70)0.04 (0.24)0.07 (0.26)0.0013.902
FR ( γ L )7.93 (0.26)4.86 (3.73)0.93 (0.26)0.0012.837
FR ( γ H )5.13 (1.61)0.06 (0.31)0.07 (0.26)0.0013.833
STEPWISE7.91 (0.29)2.77 (2.91)0.91 (0.29)0.0013.410
2 ( p 0 = 8 )LASSO(1SE)8.00 (0.00)2.23 (3.52)1.00 (0.00)0.0013.981
LASSO(BIC)8.00 (0.00)8.98 (8.92)1.00 (0.00)0.0013.107
SIS+LASSO(1SE)7.98 (0.14)22.85 (7.08)0.98 (0.14)0.0012.824
SIS+LASSO(BIC)7.98 (0.14)13.55 (8.24)0.98 (0.14)0.0012.937
dgLARS(BIC)8.00 (0.00)8.91 (9.10)1.00 (0.00)0.0013.099
SC ( γ L )8.00 (0.00)3.89 (2.89)1.00 (0.00)0.0002.979
SC ( γ H )5.68 (1.45)0.00 (0.00)0.12 (0.33)0.0013.971
FR ( γ L )8.00 (0.00)3.60 (2.80)1.00 (0.00)0.0003.032
FR ( γ H )5.71 (1.42)0.00 (0.00)0.10 (0.30)0.0013.911
STEPWISE7.98 (0.14)2.00 (2.23)0.98 (0.14)0.0003.589
3 ( p 0 = 5 )LASSO(1SE)4.37 (0.51)6.88 (2.61)0.38(0.48)0.0011.959
LASSO(BIC)4.79 (0.41)5.62 (2.17)0.79 (0.41)0.0002.044
SIS+LASSO(1SE)0.86 (0.47)10.11 (2.55)0.00 (0.00)0.0023.266
SIS+LASSO(BIC)0.86 (0.47)11.86 (2.99)0.00 (0.00)0.0023.160
dgLARS(BIC)4.55 (0.51)18.29 (6.13)0.56 (0.49)0.0011.877
SC ( γ L )4.73 (0.45)0.53 (0.66)0.73 (0.45)0.0002.479
SC ( γ H )2.84 (0.63)0.40 (0.49)0.00 (0.00)0.0010.664
FR ( γ L )4.54 (0.52)1.98 (2.19)0.55 (0.50)0.0002.128
FR ( γ H )2.71 (0.70)0.43 (0.50)0.00 (0.00)0.0010.605
STEPWISE4.54 (0.52)1.77 (2.01)0.55 (0.50)0.0002.132
4 ( p 0 = 14 )LASSO(1SE)10.01 (1.73)3.91 (6.03)0.01 (0.10)0.00315.582
LASSO(BIC)12.11 (1.46)36.56 (22.43)0.19 (0.39)0.0025.688
SIS+LASSO(1SE)10.42 (1.66)21.41 (8.87)0.03 (0.17)0.00311.316
SIS+LASSO(BIC)10.73 (1.66)32.67 (8.92)0.03 (0.17)0.0038.545
dgLARS(BIC)12.05 (1.52)38.70 (28.97)0.18 (0.38)0.0025.111
SC ( γ L )10.33 (1.63)10.48 (6.66)0.02 (0.14)0.0024.499
SC ( γ H )5.32 (1.92)0.52 (1.37)0.00 (0.00)0.00314.005
FR ( γ L )12.00 (1.71)8.93 (6.36)0.23 (0.42)0.0014.503
FR ( γ H )5.65 (2.13)0.38 (1.15)0.00 (0.00)0.00313.802
STEPWISE11.80 (1.72)5.97 (5.37)0.19 (0.39)0.0015.809
5 ( p 0 = 3 )LASSO(1SE)2.00 (0.00)1.13 (2.85)0.00 (0.00)0.0032.674
LASSO(BIC)2.01 (0.10)2.82 (2.52)0.01 (0.10)0.0032.583
SIS+LASSO(1SE)2.87 (0.34)9.28 (3.85)0.87 (0.34)0.0022.455
SIS+LASSO(BIC)2.87 (0.34)9.88 (4.29)0.87 (0.34)0.0022.355
dgLARS(BIC)2.00 (0.00)2.88 (2.38)0.00 (0.00)0.0032.562
SC ( γ L )2.75 (0.44)3.27 (1.75)0.75 (0.44)0.0012.339
SC ( γ H )2.00 (0.00)0.00 (0.00)0.00 (0.00)0.0031.086
FR ( γ L )3.00 (0.00)2.80 (1.73)1.00 (0.00)0.0012.326
FR ( γ H )2.40 (0.49)0.00 (0.00)0.40 (0.49)0.0020.981
STEPWISE3.00 (0.00)0.35 (0.59)1.00 (0.00)0.0012.977
Note: abbreviations are explained in the footnote of Table 2.
Table 5. Comparisons of MSPE among competing methods using the mammalian eye data set.
Table 5. Comparisons of MSPE among competing methods using the mammalian eye data set.
STEPWISEFRLASSOSIS+LASSOSCdgLARS
Training set0.0050.0050.0050.0060.0050.014
Testing set0.0110.0120.0100.0090.0140.020
Note: The mean squared prediction error (MSPE) was averaged over 120 splits. LASSO, least absolute shrinkage and selection operator with regularization parameter that gives the smallest BIC; SIS+LASSO, sure independence screening by [5] followed by LASSO; dgLARS, differential geometric least angle regression by [11,12] that gives the smallest BIC; SC( γ ), sequentially conditioning approach by [8]; FR( γ ), forward regression by [7]; STEPWISE, the proposed method. STEPWISE was performed with η 1 = 1 and η 2 = 4 ; FR and SC were performed with γ = 1 .
Table 6. Clinicopathological characteristics of study participants of the ESCC data set.
Table 6. Clinicopathological characteristics of study participants of the ESCC data set.
CovariatesTraining SetTesting set
n 1 (%) n 2 (%)
Esophageal squamous cell carcinoma (ESCC) patients
Total number of patients283283
Age, median (range)65 [40, 86]67 [37, 90]
Gender:
Male224 (79.0%)247 (87.3%)
Female59 (21.0%)36 (12.7%)
Stage:
024 (8.5%)27 (9.5%)
1127 (44.9%)128 (45.2%)
258 (20.5%)57 (20.1%)
367 (23.7%)61 (21.6%)
47 (2.4%)10 (3.6%)
Non-ESCC Controls
Total number of patients2834,682
Age, median (range)68 [27, 92]67.5 [20, 100]
Gender:
Male131 (46.3%)2,086 (44.5%)
Female152 (53.7%)2,596 (55.5%)
Data sources of the controls:
National Cancer Center Biobank (NCCB)17 (6.0%)306 (6.5%)
National Center for Geriatrics and Gerontology (NCGG)158 (55.8%)2,512 (53.7%)
Minoru clinic (MC)108 (38.2%)1,864 (39.8%)
Table 7. Comparisons of competing methods over 100 independent splits of the ESCC data into training and testing sets.
Table 7. Comparisons of competing methods over 100 independent splits of the ESCC data into training and testing sets.
Training SetMSPEAccuracySensitivitySpecificityAUC
STEPWISE0.020.970.980.971.00
SC0.010.990.980.981.00
FR0.020.990.970.971.00
LASSO0.010.981.000.971.00
SIS+LASSO0.010.991.000.991.00
dgLARS0.040.960.990.941.00
Training SetMSPEAccuracySensitivitySpecificityAUC
STEPWISE0.040.960.970.950.99
SC0.030.960.970.960.99
FR0.040.960.970.950.99
LASSO0.030.960.990.951.00
SIS+LASSO0.020.970.990.961.00
dgLARS0.050.940.980.941.00
Note: Values were averaged over 100 splits. STEPWISE was performed with η 1 = 0 and η 2 = 1 . SC and FR were performed with γ = 1 . The regularization parameters in LASSO and dgLARS were selected to minimize the BIC.

Share and Cite

MDPI and ACS Style

Pijyan, A.; Zheng, Q.; Hong, H.G.; Li, Y. Consistent Estimation of Generalized Linear Models with High Dimensional Predictors via Stepwise Regression. Entropy 2020, 22, 965. https://doi.org/10.3390/e22090965

AMA Style

Pijyan A, Zheng Q, Hong HG, Li Y. Consistent Estimation of Generalized Linear Models with High Dimensional Predictors via Stepwise Regression. Entropy. 2020; 22(9):965. https://doi.org/10.3390/e22090965

Chicago/Turabian Style

Pijyan, Alex, Qi Zheng, Hyokyoung G. Hong, and Yi Li. 2020. "Consistent Estimation of Generalized Linear Models with High Dimensional Predictors via Stepwise Regression" Entropy 22, no. 9: 965. https://doi.org/10.3390/e22090965

APA Style

Pijyan, A., Zheng, Q., Hong, H. G., & Li, Y. (2020). Consistent Estimation of Generalized Linear Models with High Dimensional Predictors via Stepwise Regression. Entropy, 22(9), 965. https://doi.org/10.3390/e22090965

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop