1. Introduction
Consider the following multiple linear model with
n observations and
k regressors:
where
is a vector with the observations of the dependent variable,
is a matrix containing the observations of regressors and
is a vector representing a random disturbance (that is assumed to be spherical). Generally, the first column of matrix
is composed of ones to denote that the model contains an intercept. Thus,
where
. This model is considered to be centered.
When this model presents worrying near-multicollinearity (hereinafter, multicollinearity), that is, when the linear relation between the regressors affects the numerical and/or statistical analysis of the model, the usual approach is to transform the regressors (see, for example, Belsley [
1], Marquardt [
2] or, more recently, Velilla [
3]). Due to the transformations (centering, typification or standardization) implying the elimination of the intercept in the model, the transformed models are considered to be noncentered. Note that even after transforming the data, it is possible to recover the original model (centered) from the estimations of the transformed model (noncentered model). However, in this paper, we refer to the centered and noncentered model depending on whether the intercept is initially included or not. Thus, it is considered that the model is centered if
and noncentered if
, given that
with
.
From the intercept is also possible to distinguish between essential and nonessential multicollinearity:
- Nonessential:
A near-linear relation between the intercept and at least one of the rest independent variables.
- Essential:
A near-linear relation between at least two of the independent variables (excluding the intercept).
A first idea of these definitions was provided by Cohen et al. [
4]: Nonessential ill-conditioning results simply from the scaling of the variables, whereas essential ill-conditioning results from substantive relationships among the variables. While in some papers the idea of distinguishing between essential and nonessential collinearity is attributed to Marquardt [
5], it is possible to find this concept in Marquardt and Snee [
6]. These terms have been widely used not only for linear models but also, for example, for moderated models with interactions and/or with a quadratic term. However, these concepts have been analyzed fundamentally from the point of view of the solution of collinearity. Thus, as Marquardt and Snee [
6] stated: In a linear model, centering removes the correlation between the constant term and all linear terms.
The variance inflation factor is one of the most applied measures to detect multicollinearity. Following O’Brien [
7], commonly a VIF of 10 or even one as low as 4 have been used as rules of thumbs to indicate excessive or serious collinearity. Salmerón et al. [
8] show that the VIF does not detect the nonessential multicollinearity, while this kind of multicollinearity is detected by the index of Stewart [
9] (see Salmerón Gómez et al. [
10]). This index has been misunderstood in the literature since its presentation by Stewart, who wrongly identified it with the VIF. Even Marquardt [
11] when published a comment of the paper of Stewart [
9] stated: Stewart collinearity indices are simply the square roots of the corresponding variance inflation factor. It is not clear to me whether giving a new name to the square of a VIF is a help or a hindrance to understanding. There is a long and precisely analogous history of using the term “standard error” for the square root of the corresponding “variances”. Given the continuing necessity for dealing with statistical quantities on both the scale of the observable and the scale of the observable squared, there may be a place for a new term. Clearly, the essential intellectual content is identical for both terms.
However, in Salmerón Gómez et al. [
12] it is shown that the VIF and the index of Stewart are not the same measure. This paper analyzes in what cases use one measure or another, focusing on the initial distinction between centered and noncentered models. Thus, the algebraic contextualization provided by Salmerón Gómez et al. [
12] will be complemented from an econometric point of view. This question was also presented by Jensen and Ramirez [
13], striving to commit to a clarification of the misuse given to the VIF over decades since its first use, who insinuated: To choose a model, with or without intercept, is substantive, is specific to each experimental paradigm and is beyond the scope of the present study. It was also stated that: This differs between centered and uncentered diagnostics.
This paper, focused on the differences between essential and nonessential multicollinearity in relation to its diagnostic, analyzes the behaviour of the VIF depending on whether model (
1) initially includes the intercept or not. For this analysis, it will be considered that the auxiliary regression used for its calculation is centered or not since as stated by Grob [
14] (p. 304): Instead of using the classical coefficient of determination in the definition of VIF, one may also apply the centered coefficient of determination. As a matter of fact, the latter definition is more common. We may call VIF uncentered or centered, depending on whether the classical or centered coefficient of determination is used. From the above considerations, a centered VIF only makes sense when the matrix
contains ones as a column. Additionally, although initially in the centered version of model (
1) it is possible to find these two kinds of multicollinearity, and in the noncentered version, it is only possible to find essential multicollinearity, this paper shows that this statement is subject to some nuances.
On the other hand, throughout the paper the following statement of Cook [
15] will be illustrated: As a matter of fact, the centered VIF requires an intercept in the model but at the same time denies the status of the intercept as an independent “variable” being possibly related to collinearity effects. Furthermore, another statement was provided by Belsley [
16] (p. 29): The centered VIF has no ability to discover collinearity involving the intercept. Thus, the second part of the paper analyzes why the centered VIF is unable to detect the nonessential multicollinearity and, for this, the centered coefficient of determination of the centered auxiliary regression to calculate the centered VIF is analyzed. This analysis will be applied to propose a methodology to detect the nonessential multicollinearity from the centered auxiliary regression.
The structure of the paper is as follows:
Section 2 presents the detection of multicollinearity in noncentered models from the noncentered auxiliary regressions,
Section 3 analyzes the effects of high values of the noncentered VIF on the statistical analysis of the model and
Section 4 presents the detection of multicollinearity in centered models from the centered auxiliary regressions.
Section 5 illustrates the contribution of the paper with two empirical applications. Finally,
Section 6 summarizes the main conclusions.
3. Effects of the Vifnc on the Statistical Analysis of the Model
Given the model (
1), the expression obtained for the variance of the estimator is given by:
where
is the residual sum of squares of the auxiliary regression of the
independent variable as a function of the rest of the independent variables (see expression (
6)).
From expression (
10), and considering that expression (
7) can be rewritten as:
it is possible to obtain:
Establishing a model as a reference is required to conclude whether the variance has been inflated (see, for example, Cook [
20]). Thus, if the variables in
are orthogonal, it is verified that
where
. In this case,
, and consequently, the variance of the estimated coefficients in the hypothetical orthogonal case is given by the following expression:
In this case:
and it is then possible to state that the VIFnc is a factor that inflates the variance.
As consequence, high values of
imply high values of
and a tendency not to reject the null hypothesis in the individual significance test of model (
1). Thus, the statistical analysis of the model will be affected.
Note from expression (
11) that this negative effect can be offset by low values of the estimation of
, that is, low values of the residual sum of squares of model (
1) or high values of the number of observations,
n. This is similar to what happen to the VIF (see O’Brien [
7] for more details).
4. Auxiliary Centered Regressions
The use of the coefficient of determination of the auxiliary regression (
6) where matrix
contains a column of ones that represents the intercept is a very common approach to detect the linear relations between the independent variables of the model (
1). This is motivated due to the higher relation between
and the rest of the independent variables, that is, the higher the multicollinearity is, the higher the value of that coefficient of determination.
However, since the coefficient of determination ignores the role of the intercept, this measure is unable to detect the nonessential linear relations. The question is evident: Does another measure exist related to the auxiliary regression that allows detection of the nonessential multicollinearity?
4.1. Case When There Is Only Nonessential Multicollinearity
Example 3. Suppose that 100 observations are simulated for variables , and from normal distributions with a mean of 5, 4 and -4 and a standard deviation of 0.01, 4 and 0.01, respectively. Note that and present light variability and, for this reason, it is expected that the model presents nonessential multicollinearity.
Then, is generated by simulating as a normal distribution with a mean equal to 0 and a standard deviation equal to 2.
The second column of Table 5 presents the results obtained after the estimation by ordinary least squares (OLS) of model . Note that the estimations of the coefficients of the model differ substantially from the real values used to generate , except for the coefficient of the variable (this situation illustrates the fact that if the interest is to estimate the effect of variable on , the analysis will not be influenced by the linear relations between the rest of the independent variables), which is the variable free of multicollinearity (indeed, it is the unique coefficient significantly different from zero, with a 5% significance—the value used by default in this paper). This table also shows the results obtained from the estimations of the centered auxiliary regressions. Note that the coefficients of determination are very small, and consequently, the associated VIFs do not detect the degree of multicollinearity. However, note that in the auxiliary regressions corresponding to variables and :
The estimation of the coefficient of the intercept almost coincides with the mean from which each variable was generated, 5 and −4, and, at the same time, the coefficients of the rest of the independent variables are almost zero.
The estimations of the coefficients of the intercept are the unique ones that are significantly different from zero.
Thus, note that the auxiliary regressions are capturing the existence of nonessential multicollinearity. The problem is that it is not transferred to its coefficient of determination but to another characteristic.
From this finding, it is possible to propose a way to detect the nonessential multicollinearity from the centered auxiliary regression traditionally applied to calculate the VIF:
- Condition 1 (C1):
Quantify the contribution of the estimation of the intercept to the total sum of the estimations of the coefficients of model (
6), that is, calculate:
- Condition 2 (C2):
Calculate the number of independent variables with coefficients significantly different from zero and quantify the contribution of the intercept.
A Montecarlo simulation is presented considering the model (
1) where
and the variable
has been generated as a normal distribution with mean
and variance
, the variable
has been generated as normal distribution with mean
and variance
being
,
and
. The results are presented in
Table 6. Taking into account that the sample size has varied within the set
, 235872 iterations have been performed.
Considering the thresholds established by Salmerón Gómez et al. [
10], 90% of the simulations present values for condition
C1 between 99.402% and 99.999% if
and between 95.485% and 99.999% if
. Thus, we can consider that values of condition
C1 higher than 95.485% will indicate that the auxiliary centered regressions are detecting the presence of nonessential multicollinearity.
Table 7 shows that a high value is obtained for the condition
C1, even if any estimated coefficient is significantly different from zero (
C2 = NA).
Thus, the previous threshold, 95.485%, will be considered as valid if it is accompanied by a high value in the second condition.
Example 4. Applying these criteria to the data of the Example 1 for Mod1, it is obtained that:
In the auxiliary regression , the estimation of the intercept is equal to 99.988% of the total, and the individual significance of the intercept corresponds to 100% of the significant estimated coefficients.
In the auxiliary regression , the estimation of the intercept is equal to 99.988% of the total, and the individual significance of the intercept corresponds to 100% of the significant estimated coefficients.
Thus, the symptoms shown in the previous simulation also appear, and consequently, in both situations, the nonessential multicollinearity will be detected.
Replicating both situations where the VIFnc was not able to detect the nonessential multicollinearity, it is obtained that:
Once again, it was shown that with this procedure, it is possible to detect the nonessential multicollinearity and the variables that are causing it.
4.2. Relevance of a Variable in a Regression Model
Note that the conditions
C1 and
C2 are focused on measuring the relevance of one of the variables, in this case, the intercept, within the multiple linear regression model. It is interesting to analyze the behavior of other measures with this same goal as, for example, the index
of Stewart [
9]. Given model (
1), Stewart defined the relevance of the
variable as the number:
where
is the usual Euclidean norm. Stewart considered that a variable with a relevance higher than 0.5 should not be ignored.
Example 5. Table 8 presents the calculation of for situations shown in Example 1. Note that in all cases, the intercept will be considered relevant, even when the variable is analyzed as a function of or , despite that it was previously shown that the intercept was not relevant in these situations (at least in relation to nonessential multicollinearity). Thus, the application of seems not to be appropriate contrarily to what happens with conditionsC1andC2.
4.3. Case When There Is Generalized Nonessential Multicollinearity
Example 6. Suppose that the previous simulation is repeated, except for the generation of the variable , which, in this case, is considered to be given by , for , where is generated from a normal distribution with a mean equal to 2 and a standard deviation equal to 0.01.
Table 9 presents the results of the estimation by OLS of the model and its possible auxiliary regressions. In this case, none of the coefficients are significantly different from zero and the coefficients are very far from the real values used in the simulation.
In relation to the auxiliary regression, it is possible to conclude that:
When the dependent variable is , the coefficients that are significantly different from zero are the ones of the intercept and the variable . At the same time, the estimation of the coefficient of the intercept differs from the mean from which the variable was generated. In this case, the contribution of the estimation of the intercept is equal to 83.837% of the total and represents 50% of the coefficients significantly different from zero.
When the dependent variable is , the coefficients significantly different from zero are the ones of the intercept and the variable . In this case, the contribution of the estimation of the intercept is equal to 53.196% of the total and represents 50% of the coefficients significantly different from zero.
When the dependent variable is , the signs shown in the previous section are maintained. In this case, the contribution of the intercept is equal to 95.829% of the total and represents 100% of the coefficients significantly different from zero.
Finally, although it will require a deeper analysis, the last results indicate that the estimated coefficient that is significantly different from zero in the auxiliary regression represents the variables responsible for the existing linear relation (intercept included).
Note that the existence of generalized nonessential multicollinearity distorts the symptoms previously detected. Thus, the fact that in a centered auxiliary regression, the contribution (in absolute terms) of the estimation of the intercept to the total sum (in absolute value) of all estimations will be close to 100%, and the estimation of the intercept will be uniquely significantly different from zero, are indications of nonessential multicollinearity. However, it is possible that these symptoms are not manifested but there exists worrisome nonessential multicollinearity. Thus, these conditions are sufficient but not required.
However, in situations shown in Example 6 where conditions
C1 and
C2 are not verified, the VIFnc will be equal to 1109,259.3, 758,927.7 and 100,912.7. Thus, note that these results complement the results presented in the previous section in relation to the VIFnc. Thus, VIFnc detects generalized nonessential multicollinearity while conditions
C1 and
C2 detect the traditional nonessential multicollinearity given by Marquardt and Snee [
6].
6. Conclusions
The distinction between essential and nonessential multicollinearity and its diagnosis has not been not been adequately treated in either the scientific literature or in statistical software and this lack of information has led to mistakes in some relevant papers, for example Velilla [
3] or Jensen and Ramirez [
13]. This paper analyzes the detection of essential and nonessential multicollinearity from auxiliary centered and noncentered regressions, obtaining two complementary measures between them that are able to detect both kinds of multicollinearity. The relevance of the results is that they are obtained within an econometric context, encompassing the distinction between centered and noncentered models that is not only accomplished from a numerical perspective, as was the case presented, for example, in Salmerón Gómez et al. [
12] or Salmerón Gómez et al. [
10]. An undoubtedly interesting point of view of this situation is the one presented by Spanos [
38] that stated: It is argued that many confusions in the collinearity literature arise from erroneously attributing symptoms of statistical misspecification to the presence of collinearity when the latter is misdiagnosed using unreliable statistical measures. That is, the distinction related to the econometric model provides confidence to the measures of detection and avoids the problems commented by Spanos.
From a computational point of view, this debate clarifies what is calculated when the VIF is obtained for centered and noncentered models. It also clarifies, see
Section 2.3, what type of multicollinearity is detected (and why) when the uncentered VIF is calculated in a centered model. At the same time, a definition of nonessential multicollinearity is presented that generalizes the definition given by Marquardt and Snee [
6]. Note that this generalization can be understood as a particular kind of essential multicollinearity: A near-linear relation between two independent variables with light variability. However, it is shown that this kind of multicollinearity is not detected by the VIF, and for this reason, we consider it more appropriate to include it within the nonessential multicollinearity.
In relation to the application of the VIFnc, this paper shows that the VIFnc detects the essential and the generalized nonessential multicollinearity and even the traditional nonessential multicollinearity if it is calculated in a regression without the intercept but including the constant as an independent variable. Note that the VIF, although widely applied in many different fields, only detects the essential multicollinearity. This paper has also analyzed why the VIF is unable to detect the nonessential multicollinearity, and two conditions are presented as sufficient (but not required) to establish the existence of nonessential multicollinearity. Since these conditions,
C1 and
C2, are based on the relevance of the intercept within the centered auxiliary regression to calculate the VIF, this scenario was compared to the measure proposed by Stewart [
9],
, to measure the relative importance of a variable within a multiple linear regression. It is shown that conditions
C1 and
C2 are preferable to the calculation of
.
To summarize:
A centered model can present essential, generalized nonessential and traditional nonessential collinearity (given by Marquardt and Snee [
6]) while in a noncentered model only it is only possible to find the essential and the generalized nonessential collinearity.
The VIF only detects the essential collinearity, the VIFnc detects the generalized nonessential and essential collinearity and the conditions C1 and C2 the traditional nonessential collinearity.
When there is generalized nonessential collinearity it is understood that there is also traditional nonessential collinearity, but this is not detected by the conditions C1 and C2. Thus, in this case it is necessary to use other alternative measures as the coefficient of variation of the condition number.
To conclude, in order to detect the kind of multicollinearity and its degree, the greatest number of measures must be used (variance inflation factors, condition number, correlation matrix and its determinant, coefficient of variation, conditions
C1 and
C2, etc.) as in
Section 5, and it is inefficient to limit oneself to the management of only a few. Similarly, it is necessary to know what kind of multicollinearity is capable of detecting each one of them.
Finally, the following will be interesting as future lines of inquiry:
to establish the threshold for the VIFnc,
to extend the Montecarlo simulation of
Section 4.1 for models with
regressors,
a deeper analysis to conclude if the variable responsible for the existing linear relation can be identified as the one whose estimated coefficient is significantly different from zero in the auxiliary regression (see Example 6) and
the development of a specific package in R Core Team [
39] to perform the calculation of VIFnc and conditions
C1 and
C2.