1. Introduction
In traditional statistical models, the response variables are usually studied in the one-dimensional case, for example, references [
1,
2,
3,
4]. Recently, there has been an increasing need to predict several response variables by a common set of covariates. For example, in bioinformatics, multiple phenotypes of a patient are to be predicted based on measures of genetic variation. In economics, the returns of multiple stocks can be predicted based on a common set of economic and market variables. In neuroscience, the responses of different regions of the brain can be predicted based on the input stimulus. Such data analysis problems are widely available in multiple regression.
For multiple response variable regression, one can naively model linear regression for each response variable individually by ignoring the inherent correlation structure between the response variables. In fact, performing a separate ordinary least squares regression for each response variable to estimate the coefficient matrix is consistent with performing maximum likelihood estimation. Another practical problem is that the number of parameters in the regression coefficient matrix can be huge even for a modest number of response variables.
It is natural to think of ways to improve this simple approach. Here two directions can be used to reduce the degrees of freedom parameters of the model and to improve its predictive power and interpretability. One of these directions is to reduce the dimensionality of the coefficient matrix. To take advantage of correlations between the response variables, the rank of the coefficient matrix is restricted, which gives rise to the well-known reduced-order regression (RRR), as in references [
5,
6]. RRR allows the response variables to borrow information from each other through a common set of latent factors to improve the prediction accuracy. However, each latent factor is a linear combination of all predictors. When there are a large number of predictor variables, some of them may not be useful for prediction. Therefore, an alternative approach is to reduce the number of free parameters by variable selection to exclude redundant predictors in forming latent factors. The issue of rank reduction estimation and variable selection in the RRR framework has been extensively studied in the recent literature [
7,
8,
9,
10,
11,
12]. However, in all these works, the components of the error terms are assumed to be independently identically distributed. Ref. [
13] added variable selection and covariance estimation to the RRR model to improve the predictive performance and facilitate interpretation.
Although RRR can substantially reduce the number of free parameters in multivariate problems, Ref. [
14] found that it is extremely sensitive to outliers. In real-world data analysis, outliers and leverage points are bound to occur, and thus, the genuine reduced-rank structure could easily be masked or distorted. Ref. [
15] proposed a method to detect outliers in a continuous distribution based on the cumulative distribution function. Ref. [
16] proposes the use of order statistics to detect extreme values in continuously distributed samples. Ten case studies of outlier detection are given in reference [
16], the method of order statistics demonstrates its excellence in a comparative study. However, this method of outlier detection is without covariates and the effect of covariates needs to be taken into account in our paper.
The closely related research of this paper includes sparse reduced-rank regression with covariance estimation (Cov-SRRR) in literature [
13] and robust reduced-rank regression (R4) in literature [
14]. However, the Cov-SRRR method does not consider the robustness and the R4 method does not consider the correlations among error terms and the sparsity of the coefficient matrix. They only considered a subset of our objectives. The main contributions of this paper are highlighted as follows.
In this work, the mean-shift method is used to get a robust estimation that can explicitly detect outliers, account for general correlations among error terms, and improve estimation accuracy by applying shrinkage on the precision matrix. In addition, we also incorporate variable selection in the model estimation, which is done through a penalty function on the coefficient.
A block coordinate descent algorithm is proposed to obtain parameter estimates of the objective function. The monotonicity of the algorithm is also given.
The rest of the paper is organized as follows: in
Section 2, we formally introduce the framework of reduced-rank regression with a sparse mean-shift outlier component and an unknown error covariance matrix. The penalized estimation method is used to get a sparse estimation. A block coordinate descent algorithm is developed in
Section 3 to get the estimation of the RRR model proposed in
Section 2. The simulation studies are given in
Section 4, which compared the proposed model to relevant competitors. A real data application is shown in
Section 5. Finally, a summary and brief discussion are given in
Section 6.
2. Robust Reduced-Rank Regression Model with Unknown Covariance
Firstly, we write the multivariate regression model in matrix notation as
resulting in the famous RRR model, where
,
,
,
,
,
denote rank and
is a
coefficient matrix, and
is a random
matrix and
from the
distribution.
Ref. [
17] investigates the outlier detection from the perspective of adding a mean shift parameter for each data point in a one-dimensional response variable. Ref. [
14] introduces a multivariate mean-shift regression model for detecting outlier terms. We introduce a mean-shift regression model that includes multivariate response variable dependence. The mean-shift multivariate response variable regression model can be written as
where
is the
response matrix,
is the
predictor matrix,
is the
coefficient matrix,
is the
matrix which describes the outlying effects on
and
is the
error matrix whose rows are independent draws from
which is a symmetric distribution. The unknown parameters in the model can be collectively written as
or
, where
denotes the precision matrix. The assumption that errors are correlated is necessary for many applications, for example, when the predictors are not sufficient to eliminate all the correlations between the dependent variables. Obviously, this leads to an over-parameterized model, so we must regularize the unknown matrices appropriately. We assume that
has a low-rank structure and
is a sparse matrix with only a few nonzero terms due to their inconsistency with the majority of the data. So the following sparse robust reduced-rank regression problem is proposed
where
and
are the penalty functions for
and
. The penalty parameters are
and
, respectively. The term
is the
entry in
and the lasso penalty on the off-diagonal entries of
has two reasons. First, it ensures the optimal solution of
has a finite objective function value when there are more responses than samples; second, the penalty has the effect of getting a sparse estimation. The penalty of
encourages sparsity estimation of coefficient matrix
. The term of
is used to get a sparse estimation of
. The
,
and
are the tuning parameters. Based on the symmetry of multivariate normal distribution, the covariance matrix or precision matrix must be a symmetric matrix, which reduces the number of parameters and improves the efficiency of parameter estimation. Following reference [
18], we estimate the precision matrix using the Gaussian likelihood together with a sparsity-inducing
-penalty on all of its entries
where tr is the trace of the matrix and
.
As shown in the above formula
and
resulting in the reduced-rank regression (RRR) model given in book [
19]. The reduced-rank restriction means a number of linear constraints on regression coefficients which can reduce the number of parameters and improve the estimation efficiency. According to the rank constraint
can be expressed as a product of two rank
r matrices as follows
Inspired by the penalized regression with a grouped lasso penalty in literature [
12,
20] consider the following optimization problem
where
is the
jth row of
,
is the corresponding tuning parameter and
is a unit matrix of order
r. The sparsity of the rows means that the unselected variables are not associated with any response variable. The element-wise sparsity penalty can also be applied to
. These two types of sparsity can be blended, firstly, with row sparsity employed in a screening step to reduce
p to some dimension
d (
) desired, and then, the element-wise sparsity pursuit in each individual loading vector. In this paper, similar to sparse reduced rank regression (SRRR) in literature [
12], only the row-wise sparsity is considered for
.
The problem (
3) can be expressed as
As mentioned above, the assumptions of the model are very mild and, therefore, it can be applied to a wide range of applications. In addition, the model parameters are appropriately regularized, so the model fitting should be robust and the result has well interpretations. As is given in Theorem 2 by paper [
14], when the covariance matrix is not considered, the mean-shift model is equivalent to the robust M-estimation problem for
. There is no doubt that both simulation studies and the form of model expressions can show the robustness of the method.
3. Computational Algorithm
This section presents an algorithm for solving the optimization problem (
7), iteratively optimizing with respect to
,
,
and
.
For given (
,
,
), the problem of solving for the precision matrix
can be expressed as
where
. Reference [
21] describes that this problem can be solved by applying the DP-GLASSO algorithm in the R package dpglasso. The DP-GLASSO algorithm optimizes the primal objective with the actual precision matrix and the GLASSO algorithm given in reference [
22] can solve the dual of the graphical lasso penalized likelihood with the covariance matrix. However, reference [
21] shows that the DP-GLASSO algorithm performs better than the GLASSO algorithm.
For fixed
and
, the objective of minimization over
is
subject to the constraint
. Applying the transformation
, we have
. The first term in the objective function (
9) can be written as
where
. When
, then the problem (
9) becomes a SRRR problem given in literature [
12], which can be solved iteratively. For the sparsity-inducing penalty functions
, there are a lot of choices. The
penalty in reference [
23] is most popular among the sparse penalty literature but suffers from inconsistency and biased estimates, especially when predictors are correlated. To alleviate those problems, some non-convex penalties such as the
penalty, SCAD and the capped
penalty are proposed to approximate the ideal non-convex
penalty.
To solve the S-optimization problem for a general penalty function
, we propose to use a threshold rule based
-estimator as multiple penalty functions usually correspond to a threshold rule. The threshold-based iterative selection procedure (TISP) in reference [
24] can be used to solve the penalty problem for any P associated with a threshold rule (an odd, unbounded monotone shrinkage function) given in reference [
17].
According to reference [
25],
-estimator is linked to a general penalty function
by
for some nonnegative
such that
for all
s. Any thresholding rule can be solved by this method. We can apply this technique to solve all popular penalties including but not limited to the aforementioned
, SCAD,
and
. This means this method is universally valid. The partial derivative of the function
at
can be calculated:
The closed-form solution for
given
is
where
and
is the thresholding function corresponding to penalty
. The further details are given in the literature [
25].
When
is given, the optimization problem of
becomes
This can be identified as a Procrustes rotation problem, realizable through computing the singular value decomposition of
. The optimal
. The full algorithm is presented in Algorithm 1.
For fixed
,
and
, the problem of solving
reduces to the minimization of
Suppose
. To solve problem (
15), construct a surrogate function
where
is the gradient of
ℓ with respect to
. Now define the
th iterate as
The problem of minimizing
boils down to the g-optimization in (
17), which is simpler than direct minimizing
F. We rewrite the problem in the form of
To solve this problem (
18), for a general penalty function
, we still propose to use the thresholding rule based
-estimators. The
penalty has the appealing properties of promoting sparsity and can directly restrict the cardinality of nonzero elements/rows in the estimation matrix. Moreover, it is easy to see that tuning for the constraint form
is intuitive and easy compared to the penalty form
. Here
is used as an upper bound for the percentage of nonzero terms. In this paper, to guarantee the sparsity of
, we advise using
constraint in place of the
penalty
for tuning ease. By employing a quantile thresholding rule
, this problem can be solved through TISP easily. The quantile thresholding rule is a special case of hard thresholding which correspond
penalty. Let
be the element-wise quantile threshold function, then
where
is the
quantile of the
. The tuning parameter
has a defined range
as an upper bound for the nonzero percentage, which is easier to interpret.
The row sparsity regularization can be used for variable selection in our multiple response problem and is used in our main algorithm. In particular, for the row sparsity regularization
constraint
it calls for the multivariate version of quantile thresholding
:
where
is the
quantile of the norm of the vector
. The row-wise constraint version can be realized by simply changing
to
.
Both the row-wise sparsity and element-wise sparsity are considered for to find the outlier rows and outlier elements, respectively.
Assuming that
is constructed by (
11), our algorithm iteratively applies the DP-GLASSO, the SRRR, and simple multivariate thresholding operations until the function (
7) has converged. The complete numerical algorithm is summarized in Algorithm 1.
Algorithm 1 Algorithm to solve Cov-SR4 |
Require:, , r: the desired rank; M: the maximum iteration number; : error tolerance; and the initial estimates , , .
- 1:
; - 2:
repeat - 3:
for fixed ( , , ), solve with objective function ( 8) using the DP-GLASSO algorithm; - 4:
for fixed (, ), solve defined by the SRRR problem (11). - 5:
for fixed ( , , ), the multivariate thresholding operations can be used to solve problem ( 18) to get ; - 6:
; - 7:
Calculate and according to Equation ( 7); - 8:
until or - 9:
return , , and .
|
Theorem 1. Assume the largest eigenvalue of equals to L. Then, as long as , the sequence of iterates defined by (1) satisfiesThat is, the objective function values are non-increasing during the iteration. Proof. Firstly, for fixed
,
and
, the optimization for
with objective function (
8) using the DP-GLASSO algorithm is a block coordinate descent (BCD) which is given in reference [
21].
Secondly, for fixed
,
and
, the optimization for
with closed-form solution (
13) is equivalent to the minimization of following equation
Let
. Then
where the first inequality comes from the Taylor expansion and
, and the second inequality holds because
is a minimum point of
. The monotonicity of V-optimality is due to the fact that it is a closed form minimization.
The monotonicity of C-optimality can be deduced from the following inequality
where the first inequality comes from the Taylor expansion and
, and the second inequality holds because
is a minimum point of
. □