Next Article in Journal
Adaptive Color Image Encryption Scheme Based on Multiple Distinct Chaotic Maps and DNA Computing
Next Article in Special Issue
Classical and Bayesian Inference of the Inverse Nakagami Distribution Based on Progressive Type-II Censored Samples
Previous Article in Journal
A Masked Self-Supervised Pretraining Method for Face Parsing
Previous Article in Special Issue
A Stochastic Multi-Strain SIR Model with Two-Dose Vaccination Rate
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Robust Variable Selection for Single-Index Varying-Coefficient Model with Missing Data in Covariates

College of Science, China University of Petroleum, Qingdao 266580, China
*
Author to whom correspondence should be addressed.
Mathematics 2022, 10(12), 2003; https://doi.org/10.3390/math10122003
Submission received: 4 April 2022 / Revised: 31 May 2022 / Accepted: 1 June 2022 / Published: 10 June 2022

Abstract

:
As applied sciences grow by leaps and bounds, semiparametric regression analyses have broad applications in various fields, such as engineering, finance, medicine, and public health. Single-index varying-coefficient model is a common class of semiparametric models due to its flexibility and ease of interpretation. The standard single-index varying-coefficient regression models consist mainly of parametric regression and semiparametric regression, which assume that all covariates can be observed. The assumptions are relaxed by taking the models with missing covariates into consideration. To eliminate the possibility of bias due to missing data, we propose a probability weighted objective function. In this paper, we investigate the robust variable selection for a single-index varying-coefficient model with missing covariates. Using parametric and nonparametric estimates of the likelihood of observations with fully observed covariates, we examine the estimators for estimating the likelihood of observations. For variable selection, we use a weighted objective function penalized by a non-convex SCAD. Theoretical challenges include the treatment of missing data and a single-index varying-coefficient model that uses both the non-smooth loss function and the non-convex penalty function. We provide Monte Carlo simulations to evaluate the performance of our approach.

1. Introduction

Traditional statistical techniques are based on completely observed data. However, in many scientific experiments, such as questionnaire survey, medical research and psychological science, respondents are unwilling to provide some information which the researchers need. In addition, there are many factors that cannot be controlled in the research process, and it is often impossible to obtain all the desired data. When data are missing, traditional statistical techniques cannot be directly applied. Some statisticians consider using the observed data to draw valid conclusions in this situation. Until now, in order to deal with missing data, various methods have been employed such as complete-case analysis (CC) (Yates [1] and Healy and Westmacott [2]), imputation and inverse probability weighting (IPW), and methods based on likelihood. The IPW method proposed by Horvitz and Thompson [3] a way to deal with the missing data problems, which selects the inverse of the probability as the estimated weight so that it is not distorted by random missing data. It has earned extensive attention in the field of missing data research. There are also some related literatures, such as Robins et al. [4], Wang et al. [5], Little and Rubin [6], Liang et al. [7], Tsiatis [8], etc. However, when the error distribution is highly tailed or skewed, the results of the two aforementioned methods are not stable because they are based on least squares (LS) method.
In most regression models, it is critical to choose the proper loss function ρ ( · ) to make the resulting estimator robust. Therefore, researchers pay more attention to loss functions that have higher robustness. The exponential squared loss that has robustness is defined as ψ η ( t ) = 1 exp ( t 2 / η ) , where η is the tuning parameter that determines the robustness degree of the estimator. For large η , ψ η ( t ) is approximately equal to t 2 / η . Thus the proposed estimator is the same as the LS estimator in some extreme circumstances. When η is small, observations with absolute values of t i = Y i x i T β will lead to a great loss of ψ η ( t i ) , whose influence upon the estimate of β is insiganificant. Thus, making η smaller limits the impact of outliers on the estimator but also reduces the sensitivity of the estimator. Moreover, quantile regression (QR) has become an increasingly popular method because regression methods based on exponential squared loss are more resistant to the effects of outliers than LS. Such exponential loss functions have been used in classification problems in AdaBoost (Friedman et al. [9]) and variable selection in regression models (Wang et al. [10]).
As applied sciences grow, research on semiparametric models has been extensively developed due to the high degree of flexibility and ease of interpretation. The singleindex varying-coefficient model (SIVCM) is a common semiparametric model. The main advantage of the model is that it avoids the curse of dimensionality. Another is that it has the explanatory power like parametric models. Generally, it takes the following form
Y = g T ( β 0 T X ) Z + ε ,
where Y is the dependent variable, ( X , Z ) are the covariates and ( X , Z ) : R p × R q . g ( · ) and β 0 represent the vector of unknown functions and unknown parameters, respectively, whose dimension are q × 1 and p × 1 . ε is the disturbance term with zero mean and finite variance σ 2 which is independent of ( X , Z ) . Furthermore, assume that the Euclidean norm of β 0 is equal to 1 and its first component is positive. Moreover, in order to avoid the influence due to the lack of uniqueness of the index direction β 0 , g ( x ) cannot take the form of g ( x ) = α T x β 0 T x + γ T x + c , where α , γ , c are constants, α R p , γ R p , c R and β 0 are not parallel to each other (Feng and Xue [11]; Xue and Pang [12]).
Model (1) is so flexible that it covers a class of significant statistical models. It becomes the standard single-index model (SIM) when Z = 1 and q = 1 ; for related literatures, see Hardle et al. [13] and Wu et al. [14]. When β 0 = 1 and p = 1 , it is simplified to the varying coefficient models (VCM) proposed by Hastie and Tibshirani [15] and Fan and Zhang [16]. Consequently, it is easily interpretable and has broad applications in practice. In particular, Xia and Li [17] first studied Model (1) using the kernel smoothing method with the LS method. The empirical likelihood ratio method was proposed by Xue and Wang [18]. Based on estimating equations, the estimate of the parametric component was built by Xue and Pang [12]. Using the function approximation, Feng and Xue [11] investigated Model (1).
Variable selection is of great importance to statistical modeling. The reason is that it will cause seriously biased results if researchers ignore the significant variables, whereas including spurious variables suffers from substantial loss in estimation efficiency. Hence, there are many popular choices for penalty functions, such as least absolute shrinkage and selection operator (LASSO, Tibshirani [19]), bridge penalty, smoothly clipped absolute deviation (SCAD, Fan and Li [20]), and adaptive lasso (Zou [21]). In particular, the non-conave least-squares penalty method based on SCAD penalization in SIM has been proposed by Peng and Huang [22] using SCAD penalization; Yang and Yang [23] adopted the SCAD penalty to achieve efficient estimation and variable selection simultaneously in partially linear single-index models (PLSIM); Wang and Kulasekera [24] proposed the partial linear varying-coefficient model (PLVCM) based on adaptive lasso.
SIVCM is a common semiparametric model. The selection of variables in semiparametric models includes two parts: the selection of the model in the nonparametric part and the selection of significant variables in the parametric part. Classical variable selection procedure involves stepwise regression and optimal subsets selection. However, the nonparametric parts of each submodel need to be extracted separately, leading to high computational cost. It is a great challenge to select variables in SIVCM for the reason that it has a complex multivariate nonlinear structure that incudes both a nonparametric function vector g ( · ) and an unknown parameter vector β . Based on the approximation of the SCAD function and penalties, Feng and Xue [11] developed a penalty method for SIVCM. The method they propose allows the selection of significant variables into parametric and nonparametric components. It should be noted that existing research adopts the LS or likelihood method and assume that the error follows a normal distribution. Therefore, when the error is highly tailed, it makes the method sensitive to outliers and it becomes inefficient. It is not robust to outliers in the dependent variable due to using least squares criterion. Yang and Yang [25] proposed an efficient iterative procedure for SIVCM based on quantile regression. The results indicate that the resulting estimator is robust without accounting for both outliers and errors of variation. However, all existing work on SIVCM assumes that all variables are fully observed. A robust variable selection approach for SIVCM with missing covariates has not yet been studied.
The following are the innovations of this paper:
  • For the case of missing covariates, we propose a robust variable selection approach based on exponential squared loss and adopt the IPW method to eliminate the latent bias due to the missing values in covariates.
  • We consider parametric and nonparametric methods to estimate the probabilistic model and propose a objective function with a weighted penalty for variable selection.
  • We also examine how to select the parameters η of the squared exponential loss function to ensure that the corresponding estimator is robust.
The rest of this article is organized as follows. Section 2 proposes an efficient iterative SIVCM method using exponential quadratic loss, and the SCAD penalty is applied to select both important parametric variables and nonparametric components. In addition, we discusses the implementation, including bandwidth selection and tuning parameters. Section 3 conducts several Monte Carlo experiments with different error distributions in order to show the finite sample performance of the proposed method. Section 4 concludes the paper briefly.

2. Methodology

Using the exponential squared loss functions, the basis function approximation, and the SCAD penalty function, a robust variable selection procedure for SIVCM with missing covariates is proposed. First, the unknown coefficient functions are approximated applying the B-spline function. Next, under the constraint of β = 1 , we use the ’delete-one-component’ approach constructed by Yu and Ruppert [26] in order to establish the objective function of the penalized exponential squared loss.

2.1. Basis Function Expansion

Consider that { ( X i , Z i , Y i ) , 1 i n } is a sample from model (1), i.e.,
Y i = g T ( β T X i ) Z i + ε i , i = 1 , , n ,
where X i = ( X i 1 , , X i p ) T and Z i = ( Z i 1 , , Z i q ) T are p-dimensional and q-dimensional independent variables, respectively. The disturbance term ε i is unobserved random variable with zero mean and finite variance σ 2 . We assume that { ε i , 1 i n } are independent of { ( X i , Z i ) , 1 i n } .
In order to get the unknown g ( · ) , according to He et al. [27], we use its basis function approximations to replace the original g ( · ) . More specifically, construct a B-spline basis function of order M+1, B ( u ) = ( B 1 ( u ) , , B L ( u ) ) T , where L = K + M + 1 , and K is the number of interior knots. We can approximate g k ( u ) as
g k ( u ) B T ( u ) γ k , k = 1 , , q ,
where γ k is the vector of the spline coefficient. The following robust estimation procedure will be performed if all the data { ( X i , Z i , Y i ) , 1 i n } could be collected.
( β , g ( · ) ) = i = 1 n exp { ( Y i g T ( β T X i ) Z i ) 2 / η n } ,
where η n > 0 is a tuning parameter. To prevent outliers from affecting the estimate, we introduce an exponential squared loss in (4). However, (4) cannot be directly optimized when g ( · ) is unknown. After we replace the unknown g ( · ) by its basis function approximations in (4), we get
n ( β , γ ) = i = 1 n exp { ( Y i W i T ( β ) γ ) 2 / η n } ,
where γ = ( γ 1 T , , γ q T ) T , W i ( β ) = I p B ( β T X i ) · Z i .
We first handle the constraints | | β | | = 1 and β 1 > 0 on the p-dimensional single index parameter vector β by reparametrization. Denote ϕ = ( β 2 , , β p ) T and define
β = β ( ϕ ) = ( 1 | | ϕ | | 2 , ϕ T ) T .
The true parameter ϕ 0 must satisfy | | ϕ 0 | | < 1 , which is an inequality constraint. Therefore, β ( ϕ ) is infinitely differentiable with respect to ϕ . Therefore, the Jacobian matrix of β with respect to ϕ is
J ϕ = ( 1 | | ϕ | | 2 ) 1 / 2 ϕ T I p 1 ,
where I q is the q-order identity matrix. As we can see, ϕ is one dimension lower than β , and the penalized robust regression with the exponential squared loss is converted to
n ( ϕ , γ ) = i = 1 n exp { ( Y i W i T ( ϕ ) γ ) 2 / η n } ,
where W i ( ϕ ) = W i ( β ) . By maximizing (7),we can get ϕ ^ and γ ^ = ( γ ^ 1 T , , γ ^ q T ) T . Then, through (3) and (6), the robust regression estimator of β based on the exponential squared loss is
β ^ = ( 1 | | ϕ ^ | | 2 , ϕ ^ T ) T ,
and the estimator of g k ( u ) can be procured by
g ^ k ( u ) = B T ( u ) γ ^ k .

2.2. Robust Estimation Based on Inverse Probability Weighting

We consider the case where a subset of covariates has missing values when estimating (5). Let 1 i R p + q k be the vector of always obtained covariates and m i R k is a vector of covariates that may contain some missing parts from X i or Z i . We define the vector of variables which can be always observed as t i = ( Y i , 1 i T ) T R s ,and s = p + q k . Based on each observation, the value of an indicator variable R is related to whether m i is completely observed, which can be obtained by the following formula
R i = 1 , if m i is observed , 0 , otherwise .
The missing mechanism we proposed satisfies:
P ( R i = 1 | Y i , X i , Z i ) = P ( R i = 1 | Y i , m i , t i ) = P ( R i = 1 | t i ) π ( t i ) π i ,
With this missing mechanism, under the condition of t i , we can ensure the event that m i is missing has no connection with ( Y i , m i T ) . Although the response data are fully observed, the selection probability π ( · ) in (10) still only related to the observed covariates t i instead of the observed response. Therefore, we conclude that the missing mechanism is different from the missing at random (MAR) mechanism. We need this missing mechanism in order to continue the theoretical research.
When faced with missing covariates, we estimate (5) with a naive approach; only observations with complete data are used to fit the model. The naive estimator is
( ϕ ^ N , γ ^ N ) = a r g m a x i = 1 n R i exp { ( Y i W i T ( ϕ ) γ ) 2 / η n } ,
while all observations with missing data are dropped when we estimate the model. Under the assumption that it is not the MAR, this estimator will be asymptotically biased.
An objective function based on inverse probability weights (IPW) is proposed in order to reduce the potential error caused by missing data. The expression R i / π i 0 is used to weight the ith data point in the IPW method. The difference between IPW and naive method is that IPW provides different weights for records with fully observed data. The idea behind weighting is that for every fully observed data point with probability π i 0 of being fully observed, 1 / π i 0 data points with the same covariates are expected if there were no missing data.
The weight 1 / π i 0 is usually unknown and needs to be estimated. We consider estimating the weights using a parametric model. The general parametric relationship of the parametric model is assumed as
π i 0 π i ( t i , η 0 ) .
Assuming the logistic relationship as an example
π i ( t i , η 0 ) = exp { ( 1 , t i ) T η 0 } 1 + exp 1 , t i T η 0 .
In practice π i ( t i , η 0 ) is replaced with π i ( t i , η ^ ) π i ( η ^ ) . The parametric model P ( R i = 1 | t i ) is used to estimate η ^ .
Throughout the paper π i ( η ^ ) will denote the parametric estimate, π ^ i will denote a general estimate that could be parametric, and π i 0 will denote the true probability when observation i has full data. The definition of our parametric robust regression estimator is
( ϕ ^ L , γ ^ L ) = a r g m a x i = 1 n R i π i ( η ^ ) exp { ( Y i W i T ( ϕ ) γ ) 2 / η n } .
According to the above, through (3) and (6) and using the exponential squared loss, β can be robustly estimated by
β ^ L = ( 1 | | ϕ ^ L | | 2 , ( ϕ ^ L ) T ) T .
Then, the estimator of g k ( u ) can be written as
g ^ k L ( u ) = B T ( u ) γ ^ k L .

2.3. The Penalized Robust Regression Estimator

Here we consider the variable selection problem when Model (2) has missing covariates. In order to improve the accuracy and interpretability of model fitting and ensure the identifiability of the model, the vector of the real regression coefficient β * is generally set to a scattered state with only a small fraction of non-zeroes (Fan and Li [20]; Tibshirani [19]).
For the purpose of getting the true model and estimating β * and g ( · ) , a penalized robust regression that uses exponential squared loss is as follows
( β , g ( · ) ) = i = 1 n R i π i ( η ^ ) exp { ( Y i g T ( β T X i ) Z i ) 2 / η n } n λ 1 k = 1 q p λ 1 k ( | | g k ( · ) | | ) n λ 2 l = 1 p p λ 2 l ( | | β l | | ) ,
where
| | g k ( · ) | | = ( g k 2 ( u ) d u ) 1 / 2 .
The penalty function p λ ( · ) is defined on the interval [ 0 , ) and the regularization parameter λ is non-negative. It is necessary to emphasize that the tuning parameters λ 1 and λ 2 have no need to be the same for all g k ( · ) and β l . Our purpose of using exponential squared loss in (5) is to prevent outliers from affecting the estimation process. It is unrealistic to directly optimize (15) when g ( · ) is unknown. To solve this problem, the unknown function g ( · ) in (15) is replaced by its basis function approximation, which can be written as
n ( β , γ ) = i = 1 n R i π i ( η ) exp { ( Y i W i T ( β ) γ ) 2 / η n } n λ 1 k = 1 q p λ 1 k ( | | γ k | | H ) n λ 2 l = 1 p p λ 2 l ( | | β l | | ) ,
where | | γ k | | H = ( γ k T H γ k ) 1 / 2 , H = B ( u ) B T ( u ) d u .
When π i ’s parametric estimate is π i ( η ^ ) , the parametric penalized robust regression with the exponential squared loss transforms to
n ( ϕ , γ ) = i = 1 n R i π i ( η ^ ) exp { ( Y i W i T ( ϕ ) γ ) 2 / η n } n λ 1 k = 1 q p λ 1 k ( | | γ k | | H ) n λ 2 l = 1 p 1 p λ 2 l ( | | ϕ l | | ) ,
where W i ( ϕ ) = W i ( β ) . By maximizing (17), we can get the result ϕ ^ P and γ ^ P = ( γ ^ 1 T , , γ ^ q T ) T . Then, through (3) and (6), the penalized robust regression estimator of β based on the exponential squared loss is
β ^ P = ( 1 | | ϕ ^ P | | 2 , ( ϕ ^ P ) T ) T ,
and the estimator of g k ( u ) can be obtained by
g ^ k P ( u ) = B T ( u ) γ ^ k P .

2.4. Algorithm

A quadratic approximation is used to replace the loss function for the purpose of facilitating the computation. Let
* ( ϕ , γ ) = i = 1 n R i π i ( η ^ ) exp { ( Y i W i T ( ϕ ) γ ) 2 / η n } .
When we get the initial estimator ( ϕ ˜ , γ ˜ ) , then the loss function can be approximated as
* ( ϕ , γ ) * ( ϕ ˜ , γ ˜ ) + 1 2 { ( ϕ , γ ) ( ϕ ˜ , γ ˜ ) } T 2 * ( ϕ ˜ , γ ˜ ) { ( ϕ , γ ) ( ϕ ˜ , γ ˜ ) } .
What makes implementing the Newton–Raphson algorithm directly difficult is that the SCAD-penalty function is irregular at the origin. Now, we develop an iterative algorithm based on the local quadratic approximation of the penalty function p λ ( · ) as in Fan and Li [20]. More specially, in a neighborhood of a given nonzero ω 0 , an approximation of the penalty function at the value ω 0 can be given by
p λ ( | ω | ) p λ ( | ω 0 | ) + 1 2 p ˙ λ ( ω 0 ) | ω 0 | ( ω 2 ω 0 2 ) .
Hence, for the given initial value ϕ l 0 with | ϕ l 0 | > 0 , l = 1 , , p 1 , and γ k 0 with | | γ k 0 | | H > 0 , k = 1 , , q , we have
p λ 1 k ( | | γ k | | H ) p λ 1 k ( | | γ k 0 | | H ) + 1 2 p ˙ λ 1 k ( | | γ k 0 | | H ) | | γ k 0 | | H ( | | γ k | | H 2 | | γ k 0 | | H 2 ) ,
p λ 2 l ( | ϕ l | ) p λ 2 l ( | ϕ l 0 | ) + 1 2 p ˙ λ 2 l ( | ϕ l 0 | ) | ϕ l 0 | ( | ϕ l | 2 ) | ϕ l 0 | 2 ) .
Let
Σ ( ϕ , γ ) = diag { p ˙ λ 21 ( | ϕ 1 | ) | ϕ 1 | , , p ˙ λ 2 , p 1 ( | ϕ p 1 | ) | ϕ p 1 | , p ˙ λ 11 ( | | γ 1 | | H ) | | γ 1 | | H H , , p ˙ λ 1 q ( | | γ q | | H ) | | γ q | | H H } .
Then, in addition to the constant term, we maximize
( ϕ , γ ) = 1 2 { ( ϕ , γ ) ( ϕ ˜ , γ ˜ ) } T 2 * ( ϕ ˜ , γ ˜ ) { ( ϕ , γ ) ( ϕ ˜ , γ ˜ ) } n 2 ( ϕ T , γ T ) Σ ( ϕ , γ ) ( ϕ T , γ T ) T
with respect to ϕ and γ , which brings about an approximated solution of (17). We can get estimates β ^ and g ^ k ( u ) of β and g k ( u ) by solving for (3) and (6) respectively.
In order to implement the above method, we should correctly choose the number of interior knots K and make appropriate adjustments to the tuning parameters a, λ 1 , λ 2 and η n in the penalty function. Fan and Li [20] showed that the choice of a = 3.7 performs well in variety of situations. Hence, we also follow their setup in this article.

2.5. The Choice of the Regularization Parameter λ 1 and λ 2

We can choose the tuning parameters using a method that is similar to cross-validation. However, our penalty function contains too many tuning parameters, and higher-dimensional space makes it difficult to solve the minimization problem for the cross-validation score. To overcome this difficulty, similar to Zhao and Xue [28], we take the tuning parameters as
λ 1 = λ | | γ ^ k u | | H , λ 2 = λ | | ϕ ^ l u | | ,
where γ ^ k u and ϕ ^ l u are the unpenalized estimators of γ k u and ϕ l u , respectively. Then, we can estimate λ and K by minimizing the following cross-validation score:
C V ( K , λ ) = i = 1 n { Y i W i T ( ϕ ^ [ i ] ) γ ^ [ i ] } 2 ,
where ϕ ^ [ i ] and γ ^ [ i ] are the solutions ground on (17) after deleting the ith subject.

2.6. The Choice of the Regularization Parameter η n

The tuning parameter η n plays a decisive role in the degree of robustness and efficiency of the proposed robust regression estimators. A data-driven procedure is proposed to choose the appropriate η n , the new method yields both high efficiency and high robustness simultaneously. We first choose a series of the tuning parameters that makes the proposed penalized robust estimators have an asymptotic breakdown point at 1/2 and then use the maximum efficiency as a measure to select the tuning parameter.
The specific procedure steps are as follows:
Step 1. In this step, we will find the pseudo outlier set of the sample as in Wang et al. [10]. Let D i = ( X i , Z i , Y i ) and D = ( D 1 , , D n ) . Calculate r i ( ϕ ^ n , γ ^ n ) = Y i W i T ( γ ^ n ) γ ^ n , i = 1 , , n and S n = 1.486 × median i | r i ( ϕ ^ n , γ ^ n ) median j ( r j ( ϕ ^ n , γ ^ n ) | . Then, take the pseudo outlier set D m = { ( X i , Z i , Y i ) : | r i ( ϕ ^ n , γ ^ n ) | > 2.5 S n } , set m = { 1 i n : | r i ( ϕ ^ n , γ ^ n ) | > 2.5 S n } , and D n m = D n / D m .
Step 2. In this step, we are going to update the tuning parameter η n . Suppose there are m bad points and n m good points in D n . Define the bad points by D m = ( D 1 , , D m ) and the good points by D n m = ( D m + 1 , , D n ) .
The proportion of bad points in D n is m / n . The computation of the initial estimators ϕ ˜ n and γ ˜ n is the first thing to do . For a contaminated sample D n , let
ξ ( η ) = 2 m n + 2 n m + 1 n ψ η { r i ( ϕ ˜ n , γ ˜ n ) } ,
where r i ( ϕ , γ ) = Y i W i T ( ϕ ) γ . Let η n be the minimizer of det ( V ^ ( η ) ) in the set G = { η : ξ ( η ) ( 0 , 1 ] } , where det ( · ) indicate the determinant operator,
V ^ ( η ) = { I ^ 1 ( ϕ ^ n , γ ^ n ) } 1 Σ ˜ 2 { I ^ 1 ( ϕ ^ n , γ ^ n ) } 1 ,
and
I ^ 1 ( ϕ ^ n , γ ^ n ) = 2 η 1 n i = 1 n exp ( r i 2 ( ϕ ^ n , γ ^ n ) / η ) 2 r i 2 ( ϕ ^ n , γ ^ n ) η 1 × 1 n i = 1 n W i W i T ,
Σ ˜ 2 = cov exp ( r 1 2 ( ϕ ^ n , γ ^ n ) / η ) 2 r 1 ( ϕ ^ n , γ ^ n ) η W 1 , , exp ( r n 2 ( ϕ ^ n , γ ^ n ) / η ) 2 r n ( ϕ ^ n , γ ^ n ) η W n .
Step 3. The value of λ can be calculated from (22). Then, we can get the value of λ 1 and λ 2 by (21). Through fixed λ 1 and λ 2 , and selected η n in Step 2, ϕ ^ n and γ ^ n can be updated by maximizing (17).
Step 4. We learn from Xue and Pang [12] to set the estimator ϕ ˜ and γ ˜ as the initial estimate, which means ϕ ^ = ϕ ˜ and γ ^ = γ ˜ . We then repeat Steps 1-3 until ϕ ^ , γ ^ , and η n converge.
Step 5. Using (3) and (6), we get the penalized robust regression estimator β ^ of β , and the estimator g ^ k ( u ) of g k ( u ) .

3. Simulation

Here we compare the performance of the estimation and variable selection methods we propose for the finite samples with that of Yang and Yang [25] (QR), Xue and Wang [18] (EL), Xue and Pang [12] (EE) via some Monte Carlo simulations. In contrast, Xue and Wang [18] (EL) and Xue and Pang [12] (EE) fail to take into account the problem of selection of significant variables, so we introduced an adaptive penalty term into their objective function to ensure that significant variables are selected.
According to Yang and Yang [25], we choose the Gaussian kernel function in the simulations of the quantile regression method with τ = 0.5 . Evaluation of the performance of the estimators noted above is based on the following three criteria: (1) the average absolute deviations (AAD) of the estimated coefficients and the standard deviations (SD) for each; (2) mean absolute deviations (MAD) of β ^ , which can be calculated by the expression M A D ( β ^ ) = E ( | | β ^ β 0 | | 1 ) , where | | · | | p represents the p-norm; and (3) the square root of the average square error (RASE) as a measure of the performance of estimator g ^ k ( · ) , calculated as follows:
R A S E k = { 1 n g r i d i = 1 n g r i d ( g ^ k ( u i ) g k ( u i ) ) 2 } 1 / 2
for k = 1 , , q , where { u i , i = 1 , 2 , , n g r i d } denote the grid points used to assess the function g k ( · ) .
Additionally, in order to demonstrate the effectiveness of the variable selection procedure, the average number of real zero coefficients accurately identified as zero (NC), the average number of real non-zero coefficients mistakenly identified as zero (NIC), as well as the probability of correctly selecting the real model (PC) are presented in our simulation. The tuning parameter η is chosen for each simulation sample.
Example 1. In this example, we focus attention on the estimation of the proposed estimation procedure, and the following SIVCM is considered:
Y = g 1 ( X T β 0 ) + g 2 ( X T β 0 ) Z 1 + g 3 ( X T β 0 ) Z 2 + ε ,
where β 0 = ( 1 3 , 2 3 , 2 3 ) T , X = ( X 1 , X 2 , X 3 ) T , and Z = ( Z 1 , Z 2 ) T are jointly normally distributed with mean 0, variance 1 and correlation 0 . 5 | i j | , g 1 ( u ) = 2 cos ( π u ) , g 2 ( u ) = 1 + u 2 / 2 and g 3 ( u ) = exp ( u ) . The error ε and X 1 , X 2 , X 3 , Z 1 , Z 2 are independent; X 1 may have missing values. The selection probability functions are given by:
π 1 ( X 2 , X 3 , Z 1 , Z 2 ) = { 1 + exp ( ( γ 0 + γ 1 X 2 + γ 2 X 3 + γ 3 Z 1 + γ 4 Z 2 ) ) } 1 .
We consider π 1 with ( γ 0 , γ 1 , γ 2 , γ 3 , γ 4 ) = ( 1 , 0.2 , 0.2 , 0.4 , 0.5 ) . The corresponding average missing rates are 25 % . In our simulation, three different distributions of model error ε are considered:
case1: The standard normal distribution N ( 0 , 1 ) .
case2: The centralized t-distribution with three degrees of freedom t ( 3 ) that is used to generate heavy-tailed distribution.
case3: The mixture of normals 0.9 N ( 0 , 1 ) + 0.1 N ( 0 , 100 ) ( M N ( 1 , 100 ) ) which is used to produce the outliers.
Table 1 displays the average absolute deviations (AAD) and the standard deviations (SD), as well as the mean absolute deviations (MAD), for each case with sample sizes n = 50 , 200 , 400 . It can be seen that when the errors are normally distributed, our proposed estimator, based on the exponential loss squared (ESL), has smaller AAD, SD, and MAD than the Q R 0.5 , the estimating equations (EE) and the empirical likelihood ratio (EL) methods for all sample sizes, which means that the proposed estimator performs better than the other three estimators. The proposed estimator also gives good results for the other two error distributions, t ( 3 ) and M N ( 1 , 100 ) . The significant improvement in the performance of our proposed estimator over the EE, EL, and Q R 0.5 estimators indicates that our proposed estimation method ESL is robust to datasets with outliers or error distributions of response variables with high tails. More importantly, as the sample size n increases, the performance of the estimator β ^ tends to improve significantly.
The square root of average square error (RASE) of the estimator g ^ k ( · ) for the nonparametric function g k ( · ) with sample sizes of n = 50, 200. and 400 is reported in Table 2. Table 2 gives results similar to those in Table 1. We note that no matter which of the above three distributions the error follows, our proposed estimator, compared with the other three estimators, has smaller RASE and performs better. That is, for the non-normal distributions, our proposed estimate method ESL is consistently superior to QR, EE, and EL. When the probability of selection π ( · ) is correctly specified and estimated using the parametric model, a clear pattern emerges: as the sample size n increases, the performance of the two estimators β ^ and g ^ ( · ) becomes greater and greater.
Example 2. This example aims to study the variable selection performance of the index parameters in model (1). The model setup is similar to (24) except that X = ( X 1 , X 2 , , X 8 ) T independently generated from [ 1 , 1 ] 8 and β 0 = ( 1 3 , 2 3 , 0 , 0 , 2 3 , 0 , 0 , 0 ) T . As considered in Example 1, three different error distributions N ( 0 , 1 ) , t ( 3 ) , and M N ( 1 , 25 ) are considered to show the robustness of the proposed estimator method based on the exponential squared loss (ESL). The error ε and X 1 , ⋯, X 8 , Z 1 , Z 2 are independent; X 1 may have missing values. The selection probability functions are given by:
π 2 ( X 2 , X 5 , Z 1 , Z 2 ) = { 1 + exp ( ( γ 0 + γ 1 X 2 + γ 2 X 5 + γ 3 Z 1 + γ 4 Z 2 ) ) } 1 .
We consider π 1 with ( γ 0 , γ 1 , γ 2 , γ 3 , γ 4 , ) = ( 1 , 0.2 , 0.2 , 0.4 , 0.5 ) . The corresponding average missing rates are 25 % .
For each mechanism mentioned above, we compare the performance of four methods: our proposed method [ESL-SCAD], LSE-SCAD proposed by Feng and Xue [11], LAD-SCAD proposed by Yang and Yang [25], and EE-SCAD method based on Xue and Pang [12]. The results are reported in Table 3 and are similar to the conclusions of Example 1. Whether the error term follows the normal distribution, the centralized t-distribution, or the mixture of normals, our proposed method performs more efficiently in variable selection, which has larger NC and smaller NIC. When there exist outliers in the response variables or heavy-tailed error distributions, ESL-SCAD has an obviously better performance than LAD-SCAD, EE-SCAD, or LSE-SCAD estimators. For normal error, ESL-SCAD hardly loses any efficiency.
The proposed procedure is also competitive in terms of computational cost. The calculation was performed on a computer with AMD Ryzen processors, a 16 GB RAM, running a Windows 10 system, and only one CPU was used for fair comparisons. Results on computational efficiency of the our proposed method are presented in Table 4 and Table 5, which show CPU times (in seconds) for different combinations of the full data size n and the number of covariates p. It is seen that the proposed algorithm is faster.

4. Discussion

In this paper, we use penalized regression with exponential squared loss to propose a robust variable selection procedure for a single-index model along with missing data. The B-spline is a method that can estimate the relationship with the response. IPW is a frequently used method dealing with the bias resulting from missing covariates, and the non-convex penalty method is used to estimate and select the variable at the same time. We examine the properties of sampling and robustness of our estimator. From theoretical and simulation study in this paper, the merits of our method are obvious. We also illustrate that the outcomes are good when using our method for actual data. In particular, we reveal that this estimator has the highest sample breakdown point, and the influence function for outliers are limited either in the response domain or in the covariate domain. In this paper, simulation studies and applications indicate the advantage of our method. When outliers are presented (regardless of the mechanism), EE-SCAD and LSE-SCAD are inferior in terms of non-caused selection rate.
Moreover, we can make further studies based on our proposed method. First, it is worth considering the goodness-of-fit test; in this paper we only study the sparse estimation and variable selection, however. Second, censoring can be examined based on this model. An investigation of the difficulties above is a portion of further study but is out of this paper’s scope. In the proposed theory, internal knots are considered as fixed values. Finally, how to optimally select internal knots when data are missing is an interesting problem worthy of future research.

Author Contributions

Formal analysis, H.S.; Methodology, Y.S.; Software, Y.L. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by NNSF project (61503412) of China, NSF project (ZR2019MA016) of Shandong Province of China.

Conflicts of Interest

The authors declare that they have no competing interest.

References

  1. Yates, F. The analysis of replicated experiments when the field results are incomplete. Emp. J. Exp. Agric. 1933, 1, 129–142. [Google Scholar]
  2. Healy, M.; Westmacott, M. Missing values in experiments analysed on automatic computers. J. R. Stat. Soc. Ser. B Methodol. 1956, 5, 203–206. [Google Scholar] [CrossRef]
  3. Horvitz, D.G.; Thompson, D.J. A generalization of sampling without replacement from a finite universe. J. Am. Stat. Assoc. 1952, 47, 663–685. [Google Scholar] [CrossRef]
  4. Robins, J.M.; Rotnitzky, A.; Zhao, L.P. Estimation of regression coefficients when some regressors are not always observed. J. Am. Stat. Assoc. 1994, 89, 846–866. [Google Scholar] [CrossRef]
  5. Wang, C.; Wang, S.; Zhao, L.P.; Ou, S.T. Weighted semiparametric estimation in regression analysis with missing covariate data. J. Am. Stat. Assoc. 1997, 92, 512–525. [Google Scholar] [CrossRef]
  6. Little, R.J.; Rubin, D.B. Statistical Analysis with Missing Data; John Wiley & Sons: Hoboken, NJ, USA, 2019; Volume 793. [Google Scholar]
  7. Liang, H.; Wang, S.; Robins, J.M.; Carroll, R.J. Estimation in partially linear models with missing covariates. J. Am. Stat. Assoc. 2004, 99, 357–367. [Google Scholar] [CrossRef] [Green Version]
  8. Tsiatis, A.A. Semiparametric Theory and Missing Data; Springer: Berlin/Heidelberg, Germany, 2006. [Google Scholar]
  9. Friedman, J.; Hastie, T.; Tibshirani, R. Additive logistic regression: A statistical view of boosting (with discussion and a rejoinder by the authors). Ann. Stat. 2000, 28, 337–407. [Google Scholar] [CrossRef]
  10. Wang, X.; Jiang, Y.; Huang, M.; Zhang, H. Robust variable selection with exponential squared loss. J. Am. Stat. Assoc. 2013, 108, 632–643. [Google Scholar] [CrossRef]
  11. Feng, S.; Xue, L. Variable selection for single-index varying-coefficient model. Front. Math China 2013, 8, 541–565. [Google Scholar] [CrossRef]
  12. Xue, L.; Pang, Z. Statistical inference for a single-index varying-coefficient model. Stat. Comput. 2013, 23, 589–599. [Google Scholar] [CrossRef]
  13. Hardle, W.; Hall, P.; Ichimura, H. Optimal smoothing in single-index models. Ann. Stat. 1993, 21, 157–178. [Google Scholar] [CrossRef]
  14. Wu, T.Z.; Lin, H.; Yu, Y. Single-index coefficient models for nonlinear time series. J. Nonparametr. Stat. 2011, 23, 37–58. [Google Scholar] [CrossRef]
  15. Hastie, T.; Tibshirani, R. Varying-coefficient models. J. R. Stat. Soc. Ser. B Methodol. 1993, 55, 757–779. [Google Scholar] [CrossRef]
  16. Fan, J.; Zhang, W. Statistical estimation in varying coefficient models. Ann. Stat. 1999, 27, 1491–1518. [Google Scholar] [CrossRef]
  17. Xia, Y.; Li, W.K. On single-index coefficient regression models. J. Am. Stat. Assoc. 1999, 94, 1275–1285. [Google Scholar] [CrossRef]
  18. Xue, L.; Wang, Q. Empirical likelihood for single-index varying-coefficient models. Bernoulli 2012, 18, 836–856. [Google Scholar] [CrossRef]
  19. Tibshirani, R. Regression shrinkage and selection via the lasso. J. R. Stat. Soc. Ser. B Methodol. 1996, 58, 267–288. [Google Scholar] [CrossRef]
  20. Fan, J.; Li, R. Variable selection via nonconcave penalized likelihood and its oracle properties. J. Am. Stat. Assoc. 2001, 96, 1348–1360. [Google Scholar] [CrossRef]
  21. Zou, H. The adaptive lasso and its oracle properties. J. Am. Stat. Assoc. 2006, 101, 1418–1429. [Google Scholar] [CrossRef] [Green Version]
  22. Peng, H.; Huang, T. Penalized least squares for single index models. J. Stat. Plan. Inference 2011, 141, 1362–1379. [Google Scholar] [CrossRef]
  23. Yang, H.; Yang, J. A robust and efficient estimation and variable selection method for partially linear single-index models. J. Multivar. Anal. 2014, 129, 227–242. [Google Scholar] [CrossRef]
  24. Wang, D.; Kulasekera, K. Parametric component detection and variable selection in varying-coefficient partially linear models. J. Multivar. Anal. 2012, 112, 117–129. [Google Scholar] [CrossRef] [Green Version]
  25. Yang, J.; Yang, H. Quantile regression and variable selection for single-index varying-coefficient models. Commun. Stat.-Simul. C 2017, 46, 4637–4653. [Google Scholar] [CrossRef]
  26. Yu, Y.; Ruppert, D. Penalized spline estimation for partially linear single-index models. J. Am. Stat. Assoc. 2002, 97, 1042–1054. [Google Scholar] [CrossRef]
  27. He, X.; Zhu, Z.Y.; Fung, W.K. Estimation in a semiparametric model for longitudinal data with unspecified dependence structure. Biometrika 2002, 89, 579–590. [Google Scholar] [CrossRef]
  28. Zhao, P.; Xue, L. Variable selection for semiparametric varying coefficient partially linear models. Stat. Probab. Lett. 2009, 79, 2148–2157. [Google Scholar] [CrossRef]
Table 1. Simulation results of AAD ( × 10 2 ), SD ( × 10 2 ), and MAD ( × 10 2 ) for the estimators of β i ( i = 1 , 2 , 3 ) .
Table 1. Simulation results of AAD ( × 10 2 ), SD ( × 10 2 ), and MAD ( × 10 2 ) for the estimators of β i ( i = 1 , 2 , 3 ) .
Dist nMethod β ^ 1 β ^ 2 β ^ 3 MAD
ADSDADSDADSD
N ( 0 , 1 ) 50ESL0.08620.08420.09340.09400.09160.09280.0923
QR 0.5 0.09240.09080.09450.09660.09280.09360.09379
EE0.63860.68120.68060.70130.70470.70140.6734
EL0.64180.68260.68140.7020.70690.70220.6742
200ESL0.05210.05170.05680.06550.06560.06370.0623
QR 0.5 0.06430.06350.06960.07430.07040.07250.0739
EE0.49930.50140.48840.51130.50250.52040.4997
EL0.49980.50220.48920.51210.50330.52120.4999
400ESL0.04670.04750.04820.04990.04910.04930.0478
QR 0.5 0.04730.04890.04950.05040.04930.04970.0484
EE0.43060.46820.46730.48090.48350.48240.4531
EL0.43120.46940.46820.48160.48470.48300.4538
t ( 3 ) 50ESL1.86421.93421.95752.04161.94641.94881.9240
QR 0.5 2.04682.17262.19342.26822.26102.32082.2627
EE4.22034.84185.72625.92585.44366.02405.1639
EL4.22174.84465.72805.92645.44756.02645.1653
200ESL0.47340.48920.49570.50440.49360.49780.4846
QR 0.5 0.49020.50130.50750.51840.51180.52500.5129
EE2.31152.75263.23853.53813.13273.55042.9215
EL2.31212.75323.23903.53883.13313.55082.9217
400ESL0.06430.06350.06960.07430.07040.07250.0739
QR 0.5 0.07130.07620.07280.08010.06970.07550.0784
EE1.48321.83642.37342.41192.36582.57062.1897
EL1.48381.83702.37422.41262.36642.57122.1903
M N ( 1 , 100 ) 50ESL2.23282.34442.37272.42282.40532.42642.4306
QR 0.5 2.78922.85262.89132.97262.94363.11752.8328
EE4.62065.23045.83047.46315.65296.43365.9205
EL4.62245.21205.83167.46555.65426.43685.9217
200ESL0.50360.51580.55250.58440.55340.58120.5406
QR 0.5 0.51420.52760.56970.59820.56800.59030.5534
EE2.56123.05463.52744.94373.49354.11783.6893
EL2.56163.05503.52784.94433.49404.11823.6899
400ESL0.05650.05470.05850.06570.06460.06210.0633
QR 0.5 0.07200.07550.07180.07620.07220.07190.0743
EE1.57641.80352.48322.77612.51672.68392.3106
EL1.57721.80432.48382.77672.51712.68422.3110
Table 2. Simulation results of RASE for the estimators of g i ( · ) ( i = 1 , 2 , 3 ) .
Table 2. Simulation results of RASE for the estimators of g i ( · ) ( i = 1 , 2 , 3 ) .
Dist nMethod g ^ 1 g ^ 2 g ^ 3
RASERASERASE
N ( 0 , 1 ) 50ESL0.36470.22810.3872
QR 0.5 0.38930.24630.3969
EE0.38540.23580.3905
EL0.38720.23640.3916
200ESL0.09410.09150.1030
QR 0.5 0.10830.10410.1152
EE0.10340.09670.1096
EL0.10380.09690.1098
400ESL0.03230.03040.0357
QR 0.5 0.04470.04280.0483
EE0.03330.03140.0446
EL0.03390.03200.0448
t ( 3 ) 50ESL0.40280.38840.3916
QR 0.5 0.42640.41520.4237
EE1.68021.47161.7231
EL1.68261.47381.7242
200ESL0.10520.10560.1104
QR 0.5 0.11880.11700.1219
EE0.73080.61310.7463
EL0.73140.61380.7469
400ESL0.03660.03400.0485
QR 0.5 0.04920.04760.0513
EE0.41200.34930.5042
EL0.41240.34950.5048
M N ( 1 , 100 ) 50ESL0.43560.37510.3938
QR 0.5 0.46820.40650.4175
EE1.51451.41271.6127
EL1.51631.42031.6343
200ESL0.11020.10450.1146
QR 0.5 0.12280.11370.1201
EE0.70890.67150.7141
EL0.70930.67190.7147
400ESL0.03790.03840.0361
QR 0.5 0.04830.04960.0489
EE0.37470.35720.5387
EL0.37510.35770.5393
Table 3. Variable selection results and RASE of g ^ k ( · ) , k = 1 , 2 , 3 in Example 2.
Table 3. Variable selection results and RASE of g ^ k ( · ) , k = 1 , 2 , 3 in Example 2.
Dist nMethodNCNICPC RASE 1 RASE 2 RASE 3
N ( 0 , 1 ) 50ESL-SCAD4.85000.9480.23510.23060.2412
LAD-SCAD4.82000.9420.23640.23180.2430
LSE-SCAD4.86500.9500.23360.21400.2384
EE-SCAD4.85500.9440.23400.21530.2396
200ESL-SCAD4.94500.9620.11430.11320.1228
LAD-SCAD4.94000.9600.11870.11450.1256
LSE-SCAD4.95500.9700.11320.10650.1182
EE-SCAD4.95000.9650.11380.10710.1194
400ESL-SCAD5.00001.0000.04670.04320.0556
LAD-SCAD5.00001.0000.05450.05260.0581
LSE-SCAD5.00001.0000.04230.04040.0536
EE-SCAD5.00001.0000.04290.04080.0540
t ( 3 ) 50ESL-SCAD4.9240.0060.9480.22620.22500.2333
LAD-SCAD4.9160.0090.9220.23500.23440.2475
LSE-SCAD3.5030.1780.5940.96160.98260.9688
EE-SCAD3.5240.1900.5890.96240.98560.9723
200ESL-SCAD4.9460.0030.9620.11280.11170.1253
LAD-SCAD4.9300.0050.9500.12900.12740.1321
LSE-SCAD3.7650.1600.6900.73980.70150.7547
EE-SCAD3.7750.1750.6750.74120.70350.7569
400ESL-SCAD4.99800.9980.04660.04320.0585
LAD-SCAD4.99000.9950.05980.05840.0619
LSE-SCAD4.2150.1050.7500.42060.35730.5134
EE-SCAD4.1900.1100.7350.42240.35970.5148
M N ( 1 , 25 ) 50ESL-SCAD4.89500.9300.22680.21500.2269
LAD-SCAD4.88000.9220.24400.23120.2453
LSE-SCAD3.4250.1750.5460.93670.95360.9516
EE-SCAD3.4400.1900.5360.94350.95570.9535
200ESL-SCAD4.94000.9550.11290.10630.1147
LAD-SCAD4.93500.9500.13350.12410.1305
LSE-SCAD3.8050.1550.6850.71510.68030.7233
EE-SCAD3.8150.1650.6600.71930.68190.7245
400ESL-SCAD4.99701.0000.04140.05630.0467
LAD-SCAD4.99501.0000.05870.06010.0595
LSE-SCAD4.3550.0900.8400.38230.36340.5467
EE-SCAD4.2750.1050.7550.38510.36760.5493
Table 4. CPU times for different n in Example 1.
Table 4. CPU times for different n in Example 1.
n N ( 0 , 1 ) t ( 3 ) MN ( 1 , 100 )
500.56090.64910.7832
2000.73260.81690.8664
4000.95280.98681.0610
Table 5. CPU times for different n in Example 2.
Table 5. CPU times for different n in Example 2.
n N ( 0 , 1 ) t ( 3 ) MN ( 1 , 100 )
500.87020.95641.1062
2001.14701.22351.3598
4001.36821.43731.6476
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Share and Cite

MDPI and ACS Style

Song, Y.; Liu, Y.; Su, H. Robust Variable Selection for Single-Index Varying-Coefficient Model with Missing Data in Covariates. Mathematics 2022, 10, 2003. https://doi.org/10.3390/math10122003

AMA Style

Song Y, Liu Y, Su H. Robust Variable Selection for Single-Index Varying-Coefficient Model with Missing Data in Covariates. Mathematics. 2022; 10(12):2003. https://doi.org/10.3390/math10122003

Chicago/Turabian Style

Song, Yunquan, Yaqi Liu, and Hang Su. 2022. "Robust Variable Selection for Single-Index Varying-Coefficient Model with Missing Data in Covariates" Mathematics 10, no. 12: 2003. https://doi.org/10.3390/math10122003

APA Style

Song, Y., Liu, Y., & Su, H. (2022). Robust Variable Selection for Single-Index Varying-Coefficient Model with Missing Data in Covariates. Mathematics, 10(12), 2003. https://doi.org/10.3390/math10122003

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop