1. Introduction
Advancements in technology have led to the production of sophisticated machines which are able to measure many details about every observational or experimental unit in a system. This results in data with more features (p or predictors) than the number of observational or experimental units (sample size n), referred to as high-dimensional data. Most of these data come from genetic research, e-commerce, biomedical imaging, and functional magnetic resonance imaging, among many others.
Since there are many more features (
p) than the sample size (
n), they are analyzed using a sparse high-dimensional regression (SHR) model:
It is assumed that there is only a relatively small number of the nonzero
’s. The main goal in their analysis is feature selection. As stated by [
1], feature selection typically has two goals. The first is for model building using desirable prediction properties. The second is for identifying the features with nonzero coefficients. For convenience, such features are referred to as relevant features in this paper.
One approach to the SHR model is to estimate the
’s by a regularization method, which is done by simultaneously minimizing the penalized least squares below:
where
is the regulating parameter and
is a penalty function. When
is based on the
norm, thus
, it is referred to as the LASSO [
2]. The
norm penalty is able to shrink the coefficients of redundant predictors to zero. Thus, the LASSO usually results in sparse models that are easier to interpret. Other penalty functions such as the SCAD [
3] and adaptive LASSO (ALasso) [
4] have also been reported in the literature. SCAD smoothly clips the
penalty (for small
), and assigns a constant penalty (for large
s). On the other hand, adaptive LASSO utilizes the minimax concave penalty [
5],
, where
represents the weights. The regulating parameter,
, is usually chosen using cross-validation (CV).
Another approach to analyzing the SHR model, large-p-small-n problem, is sequential variable selection, which is designed to reduce the dimension of the data such that . There are two forms of sequential variable selection. The first uses the sure screening property, which selects from the many features a subset which contains the relevant features (predictors). This is usually followed by a regularization method such as SCAD or ALasso to identify and estimate the relevant predictors from the reduced feature space. The other form is to sequentially select the relevant features through a repetitive process which terminates when a stopping criterion is met.
A recent addition to sequential feature selection is the sequential LASSO cum EBIC in ultra high-dimensional feature space (SLasso) by [
1], which sequentially solves partially penalized least squares problems and uses EBIC as the stopping criteria. The EBIC proposed by [
6] are suitable for model selection in large model spaces. It has the ordinary BIC as a special case. For large model spaces, the ordinary BIC tends to select a model with many spurious variables. Let
k and
be the number of predictors in two models, respectively. Using EBIC as the selection or stopping criteria, the model with
k predictors is selected if the EBIC(k + 1) > EBIC(k).
In SLasso, sequentially solving the partially penalized least squares reduces to selecting the feature(s) which maximize the ordinary correlation coefficient between the features and the response variable at each step. It is well-known that the Pearson correlation coefficient is used for measuring the strength of linear associations. Thus, maximizing the Pearson correlation coefficient might not work well for data structures where the relationship between at least one feature and the response variable is nonlinear.
In this article, we propose the use of the energy distance correlation instead of the ordinary correlation coefficient to identify and maximize both the linear and nonlinear relationships that might exist between each feature and the response. Energy distance is a metric that measures the distance between the distributions of random vectors. The name ‘energy’ is motivated by analogy to the potential energy between objects in a gravitational space. The potential energy is zero if and only if the locations (the gravitational centers) of the two objects coincide, and increases as their distance in space increases [
7]. The energy distance correlation has an explicit relationship with the product-moment correlation, but unlike the classical definition of correlation, energy distance correlation is zero only if the random vectors are independent. The empirical energy distance correlation is based on Euclidean distances between sample elements rather than sample moments.
The remainder of the article is arranged as follows: in
Section 2, we discuss the derivation of the energy distance correlation, extended Bayesian information criteria and our proposed method (energy distance correlation with EBIC (Edc + EBIC)). In
Section 3, we report simulation studies comparing Edc + EBIC with various other methods and provide an analysis of real data. In
Section 4, we conclude the article with a discussion of the results.
2. EBIC with Energy Distance Correlation
2.1. Energy Distance Correlation
The authors in [
8] proposed the energy distance correlation between two random variables. Suppose that
and
are two random vectors with
, and
, where
is the euclidean norm and
is the expected value. Let
F and
G be the cumulative distribution function (CDF) of
W and
Z, respectively. Further, let
denote an independent and identically distributed (iid) copy of
W; that is,
W and
are iid. Similarly,
Z and
are iid.
The squared energy distance can be defined in terms of expected distances between the random vectors
and the energy distance between distributions
F and
G is defined as the square root of
.
The energy distance correlation between random vectors
W and
Z with finite first moments is the nonnegative number
defined by
where
is the energy distance covariance between
W and
Z,
and
are the energy distance variance of
W and
Z respectively.
For a statistical sample from a pair of real-valued or vector-valued random variables , the sample energy distance correlation, , is calculated by first computing the n by n distance matrices and containing all pairwise distances and where denotes euclidean norm. Secondly, calculate all doubly centered distances where is the row mean, is the column mean, and is the grand mean of the distance matrix of the w sample. The notation is similar for the b values.
The squared sample distance covariance (a scalar) is the arithmetic average of the products
The sample energy distance variance for sample
wThe sample energy distance variance for sample
zThe sample energy distance correlation is
Some basic properties of the distance correlation are as follows:
- (i)
- (ii)
If then if and only if W and Z are independent;
- (iii)
Suppose that . Then, there exist a vector a, a nonzero real number b and an orthogonal matrix C such that
For further details on the energy distance correlation, see [
7].
2.2. EBIC
Ref. [
6] derived the EBIC which have special cases as AIC and BIC. Let
be independent observations. Suppose that the conditional density function of
given
is
where
being a positive integer. The likelihood function of
is given by
Denote
. Let
and
be the parameter vector
with those components outside
s set to 0. Let
S be the underlying model space, i.e.,
let
be a prior for model
s. Assume that, given
s, the prior density of
is
The posterior is
where
is the likelihood in model
s, i.e.,
Suppose
S is partitioned into
such that models within each
have an equal dimension. Let
be the size of
. Assign the prior distribution
proportional to
for some
between 0 and 1. For each
, assign equal probability,
; this is equivalent to
for
proportional to
, where
Then, the extended BIC family is given by
where
is the maximum likelihood estimator of
and
is the number of components in
s.
2.3. Energy Distance Correlation with EBIC (Edc + EBIC) Algorithm
We propose a sequential model selection method which we call energy distance correlation with EBIC, and for convenience abbreviate it as Edc + EBIC. Let be a continuous response variable and be an data matrix. Let S be the index set of all predictors. Let . For , let . If , then is the complement of s in . Let be the number of elements in the set .
At the initial stage we standardize all the variables. Next, we find the energy distance correlation between the response variable and each of the predictor variables—. We then select the predictor (feature) which has the highest distance correlation with the response and store it in the active set .
Let be the linear space spanned by the columns of and its corresponding projection matrix, i.e., . Next, we compute , , The variable is the unexplained part of y by . This gives close to a zero chance of being selected in the subsequent steps.
For the general step where , we calculate and update the active set to , which is the union of all the previous selected variables and the current one. We then compute and compare it with . The procedure stops if . The selected variables which we call the relevant variables will be . We can then fit a linear regression model between the response y and the relevant variables.
We wish to note that care must be taken in fitting this model because some of the predictors might be non-linearly related to y, and thus some of the predictors may have to enter into the model in their quadratic or cubic form, etc. Alternatively, a Box–Cox transformation can be performed on the data before fitting the model.
The algorithm details are given in the following.
Initial Step: With
satisfying
and
, compute
for
. Let
Let be the active set. Compute and where
General Step: In the selection step
k, compute
for
where
Let
Let . Compute . If , stop; otherwise, continue computing .
When the process terminates, return the least-squares estimates for parameters in the selected model.
2.4. Selection Consistency of Edc + EBIC
We attempt to establish the large sample property for the Edc + EBIC. We will show that under regular conditions, the Edc + EBIC is selection-consistent. The proof essentially follows the approach in [
9]. We proceed with the regularity conditions.
Assumption 1. Random vectors X and Y possess the subexponential tail probabilities, uniformly in p, specified as follows. There is a constant , such that for any and .
Assumption 2. The minimum distance correlation of predictors on which y functionally depends satisfies for some constants and .
Assumption 3. For the index set S of all predictors, let and ( is the number of elements in the set ). For let . If then is the complement of s in . For , for some , where For , is defined as the empty set ∅.
Details for requiring Assumption 1 and 2 are stated in [
9]. Intuitively, Assumption 1 is required to make it easy to establish a relationship between the energy distance correlation and the squared Pearson correlation to aid with the derivations in the proof. Assumption 2 requires that the energy distance correlation for the relevant predictors cannot be too small. Assumption 3 requires that the maximum energy distance correlation between the selected features and the residual response
is smaller than the maximum energy distance correlation between the remaining features and the residual response in the sequential step of the algorithm.
Theorem 1. Suppose that Assumptions 1–3 hold. The proposed Edc + EBIC with the energy distance correlation is consistent, i.e.,where is the set of features selected at the step of Edc + EBIC such that is the set of relevant features and . Proof. Suppose that
and
with cumulative distribution function (CDF)
F and
G, respectively, where
, and
. The energy distance correlation
is the square root of the standardized coefficient:
where
. In the numerator is the distance covariance defined by [
8], as
where
are defined as:
where
, and
are independently and identically distributed.
For a random sample
from
, [
8] estimated
as:
so the sample distance covariance is
.
The remaining part of the proof is to show that the energy distance correlation is uniformly consistent and has the sure screening property. The numerator and denominator of the energy distance correlation are similar, so to show the uniform consistency of the energy distance correlation it suffices to show that both the numerator and the denominator are uniformly consistent.
The uniform consistency of the numerator,
, of the energy distance correlation between the random vectors
is shown by [
9]. However, in the general step of the sequential algorithm for Edc + EBIC, the energy distance correlation is calculated between the residuals
at each step of the algorithm. Thus, to show the uniform consistency of Edc+EBIC it is equivalent to follow the proof by [
9].
Additionally, in [
9] they showed that the energy distance correlation has the sure screening property. They showed that the energy distance is able to select a subset of the features which contains the relevant features. Their argument applies here because we used the energy distance correlation as well, thus the Edc + EBIC has the sure screening property.
Therefore, the Edc + EBIC is selection-consistent since it is uniformly consistent and has the sure screening property. The proof is complete. □