Next Article in Journal
Total Quality and Innovation Management in Healthcare (TQIM-H) for an Effective Innovation Development: A Conceptual Framework and Exploratory Study
Previous Article in Journal
A Computer-Assisted Approach to Assess the Precision of the Reciprocating Angles and the Rotation Speeds of Endodontic Motors
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Cross-Validation for Lower Rank Matrices Containing Outliers

by
Sergio Arciniegas-Alarcón
1,
Marisol García-Peña
2,* and
Wojtek J. Krzanowski
3
1
Facultad de Ingeniería, Universidad de La Sabana, Campus Puente del Común, Km. 7 Autopista Norte, Chía 140013, Colombia
2
Departamento de Matemáticas, Pontificia Universidad Javeriana, Carrera 7 40-62, Bogotá 110231, Colombia
3
College of Engineering, Mathematics and Physical Sciences, University of Exeter, Exeter EX4 4QF, UK
*
Author to whom correspondence should be addressed.
Appl. Syst. Innov. 2022, 5(4), 69; https://doi.org/10.3390/asi5040069
Submission received: 18 June 2022 / Revised: 11 July 2022 / Accepted: 15 July 2022 / Published: 19 July 2022
(This article belongs to the Section Applied Mathematics)

Abstract

:
Several statistical techniques for analyzing data matrices use lower rank approximations to these matrices, for which, in general, the appropriate rank must first be estimated depending on the objective of the study. The estimation can be conducted by cross-validation (CV), but most methods are not designed to cope with the presence of outliers, a very common problem in data matrices. The literature suggests one option to circumvent the problem, namely, the elimination of the outliers, but such information removal should only be performed when it is possible to verify that an outlier effectively corresponds to a collection or typing error. This paper proposes a methodology that combines the robust singular value decomposition (rSVD) with a CV scheme, and this allows outliers to be taken into account without eliminating them. For this, three possible rSVD’s are considered and six resistant criteria are proposed for the choice of the rank, based on three classic statistics used in multivariate statistics. To test the performance of the various methods, a simulation study and an analysis of real data are described, using an exclusively numerical evaluation through Procrustes statistics and critical angles between subspaces of principal components. We conclude that, when data matrices are contaminated with outliers, the best estimation of rank is the one that uses a CV scheme over a robust lower rank approximation (RLRA) containing as many components as possible. In our experiments, the best results were obtained when this RLRA was calculated using an rSVD that minimizes the L2 norm.

1. Introduction

The singular value decomposition (SVD) is a mathematical result that allows the calculation of a low-rank approximation of any matrix and serves as the basis of many statistical methods used in data analysis [1]. The application areas are diverse, for example, SVD can be used in principal component analysis [2], the imputation of missing data [3], the graphical representation of multivariate data [4], the formulation of models to explain the interaction in two-way tables [5,6] and non-parametric analysis of time series [7], to mention just a few.
A problem inherent in the use of SVD is the determination of the appropriate rank of the approximation, i.e., the appropriate number of components to be retained, and a very convenient way to solve this problem uses the resampling technique known as cross-validation. Standard cross-validation consists of subdividing the study matrix into a certain number of groups, deleting each group in turn, evaluating the parameters of a chosen predictor from the remaining data and using the result to predict the deleted values [8].
A numerical measure of agreement between predicted and actual values can then be computed for each possible rank, and the best rank to choose is the one providing the best calculated measure. Readers interested in classic references, alternative procedures, method comparisons and new developments of the subject can refer to the works of Bro et al. [9], Owen and Perry [10], Josse and Husson [11], Camacho and Ferrer [12] and Saccenti and Camacho [13].
Among the wide variety of methods for conducting the cross-validation, one considered as classic and influential is the Eastment and Krzanowski [14] (EK) scheme proposed in 1982, and which currently (April 2022) has 452 citations in Google Scholar. The method depends on the SVD, so does not rest on any distributional or structural assumptions. Moreover, it provides exact calculations, so does not have the convergence problems that can occur with expectation maximization (EM) approximations [15].
A frequent problem in data analysis is the presence of outliers [16] and the performance of the EK method can be affected by them because of its use of SVD, which is a least squares technique. To circumvent the problem, an option used by Krzanowski [17] was to compare the results obtained for the complete data with those for the reduced data, after eliminating the outliers. To the best of our knowledge, no specific study has been conducted to evaluate the effect of outliers on the EK method when applied to lower rank matrices; therefore, the aim of the current paper is to propose a methodology that combines robust singular value decomposition (rSVD) with the EK method and that does not require the elimination of outliers. This proposed methodology would therefore be applicable for any dataset that has some suspicion of contamination and the outliers do not correspond to errors in the collection or typing, as in that case the elimination of information is perfectly sensible.
This article is organized as follows: the EK method is presented first, followed by the various proposed methods along with associated statistics that robustify the criterion for choosing the rank of a matrix. Subsequently, three rSVD’s are presented that can be used as a way to reduce the effect of outliers, but without eliminating them. Then, a simulation study is described for evaluating possible cross-validation strategies for different levels of contamination using different resistant statistics. A study of a real dataset is also described to compare the proposed alternatives, and finally the results are presented together with a relevant discussion.

2. EK Method

Consider a standardized data matrix Y   ( n × p ) with elements y i j   ( i = 1 , n ;   j = 1 , p ) and n p (if n < p the matrix should be first transposed), for which we wished to determine the best lower rank approximation. For this, Eastment and Krzanowski’s [14] cross-validation scheme can be used, which quantifies the idea of “acceptable accuracy” in terms of predicting the elements of Y . The scheme is based on the fact that element y i j in the i-th row and in the j-th column of Y can be written as a multiplicative model using SVD, that is:
y i j = t = 1 p u i t   s t   v t j
Krzanowski [8] used this representation as a basis for determining the dimensionality of a multivariate dataset: if the data structure is essentially m-dimensional, then the variation in the remaining (p-m) dimensions can be treated as random noise. For this reason, it can be assumed that the main characteristics of the data are found in the space of the first m components. This means that the data can be written according to an m-component model, such as:
y i j = t = 1 m u i t   s t   v t j + ε i j
where ε i j is a residual term and setting ε i j equal to zero, the expression results in a predictor of y i j
y ^ i j ( m ) = t = 1 m u i t   s t   v t j
Since the calculations of u i t , s t and v t j involve all values of Y , the predictor of y i j uses the value of y i j itself. This is an undesirable feature in cross-validation because it can be a source of bias. To avoid this bias, Eastment and Krzanowski [14] suggested the following scheme: suppose we are looking for the prediction of y i j in Y , then, the i-th row from Y is deleted and the SVD for the ( ( n 1 ) × p ) resulting matrix Y ( i ) is calculated as Y ( i ) = U ¯ D ¯ V ¯ T , U ¯ = ( u ¯ s h ) , V ¯ = ( v ¯ s h ) , D ¯ = ( d ¯ 1 , , d ¯ p ) . The next step is to delete the j-th column from Y and obtain the SVD for the ( n × ( p 1 ) ) matrix Y ( j ) as Y ( j ) = U ˜ D ˜ V ˜ T , U ˜ = ( u ˜ s h ) , V ˜ = ( v ˜ s h ) , D ˜ = ( d ˜ 1 , , d ˜ p 1 ) . The matrices U ¯ , V ¯ , U ˜ , and V ˜ are orthonormal, while D ˜ and D ¯ are diagonal. By combining the two SVDs, Y ( i ) and Y ( j ) , a predictor of Y i j based on the m component model is given by
y ^ i j ( m ) = h = 1 m u ˜ i h ( d ˜ h p / ( p 1 ) ) 1 2 v ¯ j h ( d ¯ h n / ( n 1 ) ) 1 2
This predictor is slightly modified relative to the one proposed by Eastment and Krzanowski [14] following the work of Bro et al. [9] and Arciniegas-Alarcón et al. [3] because the inclusion of the constants p / ( p 1 ) and n / ( n 1 ) improves the quality of the predictions. On the other hand, in order to avoid computational problems, a parity check should be done in each prediction by matching the sign of ( u ˜ i h ( d ˜ h p / ( p 1 ) ) 1 2 ) ( v ¯ j h ( d ¯ h n / ( n 1 ) ) 1 2 ) in (4) to the sign of u i h d h v j h obtained from the SVD of the Y matrix for each m.
After establishing the predictor of the observations, an overall measure of predictive accuracy of the m-component model is given by
P R E S S ( m ) = 1 n p i = 1 n j = 1 p ( y ^ i j ( m ) y i j ) 2
This function can be computed for each value of m, where m = 1 , , p 1 and an optimal choice of m (the rank of Y ) can then be based on some appropriate function of these values. The suggestion made by Krzanowski [17] and Krzanowski and Kline [18] was to calculate the statistic
W m = P R E S S ( m 1 ) n + p 2 m ÷ P R E S S ( m ) D R
for each m, where D R = ( n m 1 ) ( p m ) . To calculate W 1 , it is necessary to define PRESS(0); in this case, you can use y ^ i j ( m ) = 0, Forkman and Piepho [19], or y ^ i j ( m ) can be the mean of j-th column without the element (i,j) (Carlos Tadeu dos Santos Dias, personal communication, 3 August 2021). Finally, the m-th component of the SVD can be considered “important” if W m > 0.9 and the total number of important components constitutes an estimation of the optimal rank of the matrix Y taking into account predictive criteria. In those cases where W m < 0.9 for all components, the rank estimation can be determined by the component with the highest value of W m . The acronym EK82 can be used to identify the method described above.

3. Proposed Methodology

The choice of the optimal rank for a lower rank approximation using the EK method basically depends on two aspects: the predictions of the elements of Y and the criterion to determine if a component is important or not. It is known that the quality of predictions using the standard SVD decreases in the presence of outliers in the data matrix [20]. For that reason, our first attempt at circumventing this problem was to create a robust version of the EK schema. For this, the usual standardization of Y was replaced by a robust standardization and, in the calculation of the elements of the predictor described in (4), the SVDs of Y ( i ) and Y ( j ) were replaced by an rSVD (details for calculating it are described in the next section).
Although this approach produced good-quality predictions in small test matrices (for instance, 20 × 5), the main disadvantage was computational because the implementation for larger matrices became too heavy due to the inclusion of rSVDs. In this way, the problem of outliers was circumvented, but the computational speed of the EK method was lost. This could later be a problem in the future for large arrays. Thus, we now have a more complex problem: how do we improve the predictions of the EK scheme in the presence of outliers, without eliminating them while at the same time trying to preserve the computational speed of the method as much as possible?
To resolve this issue, we decided to use the work of Arciniegas-Alarcón et al. [21] and our proposal was to conduct the cross-validation in two stages: (i) calculate an rSVD on the original matrix Y and obtain a robust lower rank approximation Y R L R A with the maximum possible number of components, that is, with the smallest number between n and p. With this Y R L R A , we can obtain a good approximation of the original values, and in the case of outliers we can obtain a robust approximation without eliminating any information and following the maximum-data precept. (ii) The EK method is then applied to Y R L R A , and the optimal rank is determined. Such a combination of methodologies (rSVD + EK) was not found in our literature review so becomes a possible way to perform a robust cross-validation on contaminated lower rank matrices that is is computationally efficient as the size of the study matrix increases.
Determining the rank depends on the chosen criterion; therefore, in addition to W m , we could use several criteria based on trimmed PRESS [22]. The following is a list of the possible criteria we considered:
  • PRESS: According to Equation (5), the optimal rank is the one that minimizes the statistic. Bro et al. [9] found that, in some cases, this criterion may be more effective than W m .
  • PRESS75: A resistant PRESS statistic is constructed by averaging the 75% of the smallest squared errors ( y ^ i j ( m ) y i j ) 2 . The optimal rank is the one that minimizes PRESS75.
  • PRESS50: A resistant PRESS statistic is constructed by averaging the 50% of the smallest squared errors ( y ^ i j ( m ) y i j ) 2 . The optimal rank is the one that minimizes PRESS50.
  • W m : According to Equation (6), the optimal rank is the total number of important components.
  • W m ( M a x ) : The optimal rank is the number of the largest important component using W m . Krzanowski [8] found that, in some data sets, the W m statistic does not always show a monotonically decreasing behaviour. For example, if W 1 = 26.22 , W 2 = 5.95 , W 3 = 0.19 and W 4 = 1.03 , the optimal rank is 4, even though component 3 is not important.
  • W m 75 : In Equation (6), PRESS is replaced by PRESS75 and the optimal rank is the total number of important components.
  • W m 75 ( M a x ) : The optimal rank corresponds to the number of the largest important components using W m 75 .
  • W m 50 : In Equation (6), PRESS is replaced by PRESS50 and the optimal rank is the total number of important components.
  • W m 50 ( M a x ) : The optimal rank corresponds to the number of the largest important component using W m 50 .

4. Robust Singular Value Decompositions (rSVDs)

The cross-validation proposed in this paper depends directly on the rSVDs used initially; therefore, to delimit the research, we considered the proposals of Gabriel and Odoroff [23], Hawkins et al. [24] and Zhang et al. [25] for the actual computational procedures. Below, we provide a brief outline of each proposal, but a complete algorithmic description is available from García-Peña et al. [20].
Gabriel and Odoroff’s [23] proposal cyclically used fits of rank-one approximations and obtained the residuals after each of these adjustments. They inserted medians and trimmed means into the technique of reciprocal averaging used to minimize the L2 norm. A computational implementation in the R statistical environment [26] is found in a study by Arciniegas-Alarcón et al. [21] The acronym EK84 will be used to describe the methodology that mixes this rSVD with the EK schema.
On the other hand, Hawkins et al. [24] found the eigenvalues and eigenvectors for each component of the rSVD through an iterative procedure minimizing the L1 norm. The drawbacks are that this rSVD can be affected in the case of the presence of leverage points, the eigenvectors can be non-orthogonal and the eigenvalues do not always have a decreasing order. A computational implementation is in the pcaMethods package of R [26]. The acronym EK01 is used to describe the methodology that mixes this rSVD with the EK schema.
The rSVD by Zhang et al. [25] computed a sequence of robust rank-one approximations. Robust estimates are obtained by minimizing the Huber function in an iterative re-weighted least squares algorithm. The procedure is implemented in the R RobRSVD package [26]. The acronym EK13 is used to describe the methodology that mixes this rSVD with the EK schema.

5. Simulation Study

To evaluate the performance of the cross-validation schemes (i.e., EK01, EK13, EK82, and EK84) on contaminated low-rank matrices, one hundred matrices of dimension (100 × 8) were simulated using the following process: “Clean” observations were generated as Y = T + E , where T = A B T = [ t i j ] is a rank four matrix with A (100 × 4), B (8 × 4), and E (100 × 8) is “pure noise”. The elements of A , B and E were iid N(0, 1). Outliers were produced on each Y matrix in different percentages (0, 5, 10 and 20%), their positions were chosen randomly and were generated using the normal distribution with mean µ j + 100 σ j 2 and variance σ j 2 . In this case, µ j and σ j 2 represent the mean and variance of the j-th column. Contaminated matrices will be noted by Y C .
The four cross-validation schemes were applied to the Y C matrices to determine their rank m (each scheme with each statistic considered can provide different ranks). The quality of this choice was evaluated from the predictive point of view, calculating the corresponding (robust) lower rank approximation with m components ( Y ^ ) and comparing it with the original matrix Y without contamination. For this, the Procrustes M 2 statistic was used [27]. In this case, M 2 = t r a c e ( Y Y T + Y ^ Y ^ T 2 Y Q Y ^ T ) where Q = VUT is the rotation matrix calculated from elements of the SVD of the matrix Y T Y ^ = U V T . The M 2 statistic measures the difference between two configurations of points, so the (robust) lower rank approximation that minimizes this difference indicates the rank selection method that yields the closest match between clean data and calculated approximations in the presence of outliers.
Another criterion for comparing the proposed methods, following the work of Krzanowski [8,28], was the critical angle ( θ ) between two subspaces of principal components. For this, both the SVD of Y = U D V T and the (r)SVD of Y C = W J K T were calculated. If the cross-validation shows that the rank is m, the matrices V and K with m retained components are compared, that is, V ( m ) and K ( m ) . The calculation of the critical angle is defined by θ = c o s 1 ( d ) , where d is the smallest element of G in the SVD of the matrix V ( m ) T K ( m ) = M G P T . The greater the critical angle, the greater the influence of outliers on the principal coefficients components; therefore, the best cross-validation scheme is the one that provides the lowest value for θ .
In each percentage of outliers considered, one hundred values of the Procrustes statistics and one hundred values of critical angles were obtained. From these values, the means of the 90% with the lowest values were calculated. This average is a robust criterion that leaves out some flaws that a method may have, but which, in general, presents a good behavior [20,29].

6. Real Data

In addition to the simulation study, cross-validation schemes were applied to the real dataset previously studied by Skov et al. [30] and Bro and Smilde [31], from whose work we attained the main description of the data: “Red wines, 44 samples, produced from the same grape (Cabernet sauvignon) were collected. A Foss WineScan instrument was used to measure 14 characteristic parameters of the wines such as the ethanol content, pH, etc. Hence a dataset consists of 44 samples and 14 variables. The actual measurements can be arranged in a table or a matrix of size (44 × 14)”. The data are available at http://www.models.life.ku.dk/Wine_GCMS_FTIR (accesed on 1 April 2022) and, as it is a multivariate matrix with different measurement scales, we used the standardized matrix in all our analyses.
A difficulty in evaluating schemas on real data is the lack of a priori knowledge of what should be “good behaviour”. Because of this, and following the recommendation of Maronna and Yohai [29], we added outliers in the data set following the same procedure described in the simulation study and using the same comparison criteria, but the difference in the simulations occurred only once.

7. Results

Table 1 presents the trimmed means to evaluate the cross-validation schemes when the M2 statistic was used in the simulation study. It can be seen that, without contamination (i.e., outliers = 0%), the best method is the EK82 method because it minimizes the value of the criterion. This result is to be expected because the standard SVD provides very good results in the absence of outliers. It can also be seen that, of the resistant criteria considered to choose the rank of the matrices, the one that provided the best results was PRESS50. This indicates that using only 50% of the smallest residuals by cross-validation may be sufficient to select the rank.
On the other hand, when there is some degree of contamination in the simulations, according to Procrustes statistics, methods EK01, EK13 and EK84 always present better performances (lower values) than EK82 using any of the criteria considered (resistant or not). This indicates that the estimation of the rank in the presence of outliers should be performed with one of the robust procedures rather than the standard EK method.
Taking into account predictive criteria and making a comparison between the robust methodologies, it is very clear that the EK84 method always provides the best results, being the most consistent and stable in the simulations, regardless of the percentage of outliers. EK84 shows the highest efficiency (lower M2 values) using the PRESS50 as a rank selection criterion when the contamination level is low (5%), but when said level is intermediate (10%) or high (20%), the criterion with the best performance is Wm.
Table 2 presents the performance of the schemes in the simulations using critical angles as a criterion. It is observed that the smallest angles (very close to zero) are obtained with the EK82 method in the matrices without outliers, while in the face of contamination in any of the three percentages considered, EK84 is the best method. These results confirm what was found with the Procrustes statistics. As for the criterion for choosing the rank in matrices without outliers, we found that any of them can be used, but according to the critical angles, it is not recommended to use either W m or W m ( M a x ) , while in the presence of outliers the EK84 scheme in combination with W m or PRESS provides the best results.
The simulation study was complemented with the analysis based on the real matrix from the multivariate characterization of wines. Table 3 presents the M2 values and it is observed that without outliers again the EK82 method works very well with PRESS50 and, unlike the simulations, Wm50(Max) becomes an alternative criterion. With these criteria, the rank chosen was nine and, according to the analysis by Bro and Smilde [31], with nine components, the explained variation is above 90%, but some of these components can explain very little variation.
According to the cross-validation study by Bro and Smilde [31], in the original data matrix, a maximum of three or four components is sufficient. This same result was obtained with the EK82 method using Wm(Max) and Wm75; however, if Wm75 is used as a criterion, the EK13 method with five components would be preferable as it presents better predictions in the absence of contamination.
For the wine matrix contaminated at three levels, it can be verified that the EK84 methodology always presents the best results, minimizing the Procrustes statistics. In this case, with 5, 10 and 20% contaminations, the ranks chosen were four, five and three, respectively, but the selection criteria were different depending on the number of outliers. With a low level of outliers (5%), EK84 produced the same rank with all criteria (resistant or not), while with an intermediate level of outliers the only criterion that did not work very well was PRESS75 and with a high level of contamination the best performance criteria were Wm and Wm(Max).
Finally, Table 4 presents the results of the wine matrix taking into account only the critical angles. As expected without contamination, the EK82 method was the best with most criteria presenting angles close to zero, but there was a wide multiplicity of ranks with an approximately equal result. For this reason, in this specific situation, it is recommended to take the lowest rank (three) that corresponds to the Wm(Max) criterion.
With the change in criterion, at the maximum level of contamination (20%), EK84 (with Wm and Wm(Max)) again showed the best performance selecting rank three to obtain the lowest critical angle. Results that we did not expect occurred at the other contamination levels (5 and 10%), EK84 was surpassed by EK82 (with Wm50(Max)) and EK01 (with Wm75) with the smallest angles. These last results compared to those obtained from the simulations suggest that the methods can become unstable if the rank is selected taking into account these angles. Because of this situation, it is suggested in practice to repeat the procedure several times when the contamination level is between 5 and 10%. We repeated the procedure ten times for these two percentages of outliers (not shown) and found that, for the wine matrix, the lowest average of the critical angles was obtained with EK01 (using Wm75).
A short comment now follows regarding the distributional and computational aspects. The methods considered in this research depend on (r)SVDs; therefore, they are distribution-free and only require that the dataset under study can be written in a matrix form. On the other hand, the computational characteristics of the algorithms depend on several factors, such as implementation used, computer characteristics, matrix dimension, correlation structure and outliers’ quantity and location. This last aspect is important because, in some cases, the location and quantity of outliers may or may not favor the calculations and the rapid convergence of the iterative procedures that involve the calculation of an (r)SVD.
To present the reader with an idea of the times of each algorithm, we analyzed one of the previously simulated matrices and the real data matrix of red wines on a personal computer with an Intel(R) Core(TM) i7-8550U, CPU 1.80 GHz 1.99 GHz and 8 GB of RAM installed. Table 5 shows the comparison of times of the different cross-validation methods. According to the information in the table, the best average times both in the simulated lower rank matrix and in the real multivariate data were provided by EK82, which can be considered as an expected result because, unlike the proposed methods, this method does not include any extra treatment in the presence of outliers. Clearly, the robust methods need a little more time, which by construction could also be expected. However, for the dimensions of the matrices studied, the robust algorithms present good average times of less than twenty seconds, which in practice makes them quite competitive, taking into account that they adequately treat all observations, including outliers.

8. Discussion and Conclusions

The simulation study and the analysis on the real dataset provide some very clear conclusions. In lower rank matrices without suspected contamination, classic EK82 can be used without problems with a simple statistic, such as PRESS50 that uses only 50% of the smallest residuals. Thus, being a resistant criterion, it presented good results in scenarios that did not present outliers. According to the results of our analysis, other competitive criteria were Wm50(Max) (which also depends on PRESS50) and Wm(Max).
However, when there is any suspicion of contamination, EK82 should be replaced by one of the proposed robust schemes. Thus, if the objective is to choose the rank that produces good predictions with robust lower rank approximations, the EK84 method presented the best performance together with the Wm criterion in most of the situations studied.
On the other hand, if the objective is to minimize the critical angle between two principal component subspaces, EK84 (with PRESS or Wm) can be used if the contamination level is high (20%). From our analysis on real data, we can conclude that at lower contamination levels (5 or 10%) EK84 (with PRESS or Wm) should be compared with EK01 (with Wm75), repeating the analysis several times (say, 10 or more) to gain consistent results. This can be achieved, for example, by adding a small number of outliers at random positions or by applying the methods on submatrices obtained from the original study matrix.
The methodologies proposed in this paper proved to be efficient in solving the contamination problem without having to use any method of detecting outliers in the original data and without eliminating information, but there are still some aspects that are worth discussing. One of these refers to the simulation study. Similar to any simulation study, more features can be added to make it more complex, but in our case it was enough to show the performance of robust procedures taking the EK82 as the standard method. Our idea of fixing the matrix rank and the level of “pure noise” was to detect the strengths and weaknesses of the schemes when the contamination levels were variable, following ideas similar to those used by Maronna and Yohai [29] and Rodrigues et al. [32]. Future research from the computational point of view can be conducted to determine the robustness of the procedures under other conditions.
A common problem in data analysis is performing cross-validation from incomplete matrices. In this case, two options are suggested, the first is to use EK84. This method is based on the rSVD of Gabriel and Odoroff [23], and the fit of rank-one approximations leaves out the missing positions and does not need to perform data imputation for this calculation, but if an imputation of the missing data is required, a robust lower rank approximation can be a very useful alternative if contamination is suspected. The second suggestion to get around the problem of missing data in cross-validation without suspected contamination is to apply iterative EK82, in which case there are already algorithms available [3,33] that can be easily adapted for optimal selection of the rank of a matrix. Arciniegas-Alarcon et al. [3] showed that an iterative algorithm using parity check provides better quality imputations than the expectation-maximization method proposed by Bro et al. [9].
In the dimensions of the matrices that the cross-validation schemes were tested in this study, computational time was not a problem, but if the matrix is of a much larger dimension, there are ways to circumvent this potential problem. For example, Krzanowski and Kline [18] used several random samples from the rows of a multivariate matrix to achieve consistency in the rank selection and reduce computational time. Another alternative is to delete some random positions from the matrix and perform the imputation following one of the previously mentioned schemes. In this case, between 10 and 30% of the original data can be considered as missing data and a comparison between the real data and the imputations can be made to choose the optimal rank. This approximation is used in the R imputation package (https://github.com/jeffwong/imputation) (accessed on 1 April 2022). In the literature, there is also the option to perform leave-group-out elimination in the EK82 method [34]; this alternative computationally reduces calculation times, but the computational gain is accompanied by a decrease in accuracy in determining the optimal rank of a matrix.
Finally, future research could be conducted to continue testing the methods proposed here. For example, more types of contamination could be considered. In the works of Maronna and Yohai [29] and González-Cebrián et al. [35], there are more alternatives to produce outliers. However, there is still no robust cross-validation procedure in the literature to choose the best additive main effects model with a robust multiplicative interaction or robust AMMI model [32], so the EK84 (or EK01) scheme could be tested for the choice of that model. The methodologies presented here can also be compared with those that detect outliers [36] and assume the outliers as missing values for later imputation by EM algorithms or with multiple imputations [37]. Finally, the EK84 and EK01 schemes are based on the EK method that could be replaced by Gabriel’s method [4] or by the Eigenvector method [9], which can be computationally faster for large matrices.

Author Contributions

Conceptualization, S.A.-A., M.G.-P. and W.J.K.; methodology, S.A.-A., M.G.-P. and W.J.K.; software, S.A.-A., M.G.-P. and W.J.K.; validation, S.A.-A., M.G.-P. and W.J.K.; formal analysis, S.A.-A., M.G.-P. and W.J.K.; investigation, S.A.-A., M.G.-P. and W.J.K.; resources, S.A.-A., M.G.-P. and W.J.K.; data curation, S.A.-A., M.G.-P. and W.J.K.; writing—original draft preparation, S.A.-A., M.G.-P. and W.J.K.; writing—review and editing, S.A.-A., M.G.-P. and W.J.K. All authors have read and agreed to the published version of the manuscript.

Funding

This research did not receive any specific grant from funding agencies in the public, commercial, or not-for-profit sectors.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Not applicable.

Acknowledgments

The authors of this paper acknowledge the High-Performance Computing Center–ZINE of Pontificia Universidad Javeriana for assistance during the simulation study.

Conflicts of Interest

The authors declare no conflict of interest.

References

  1. Good, I.J. Some applications of the singular value decomposition of a matrix. Technometrics 1969, 11, 823–831. [Google Scholar] [CrossRef]
  2. Geladi, P.; Linderholm, J. Principal component analysis. In Comprehensive Chemometrics 2nd Edition: Chemical and Biochemical Data Analysis; Brown, S., Tauler, R., Walczak, B., Eds.; Elsevier: Amsterdam, The Netherlands, 2020; pp. 17–37. [Google Scholar]
  3. Arciniegas-Alarcón, S.; García-Peña, M.; Krzanowski, W.J. Imputation using the singular-value decomposition: Variants of existing methods, proposed and assessed. Int. J. Innov. Comput. Inf. Control 2020, 16, 1681–1696. [Google Scholar]
  4. Gabriel, K.R. Le biplot–outil d’exploration de données multidimensionelles. J. Soc. Française Stat. 2002, 143, 5–55. [Google Scholar]
  5. Gauch, H. A simple protocol for AMMI analysis of yield trials. Crop Sci. 2013, 53, 1860–1869. [Google Scholar] [CrossRef]
  6. Yan, W. Crop Variety Trials: Data Management and Analysis; Wiley Blackwell: Hoboken, NJ, USA, 2014. [Google Scholar]
  7. Rodrigues, P.C.; Lourenço, V.; Mahmoudvand, R. A robust approach to singular sprectrum analysis. Qual. Reliab. Eng. Int. 2018, 34, 1437–1447. [Google Scholar] [CrossRef]
  8. Krzanowski, W.J. Cross-validation in principal component analysis. Biometrics 1987, 43, 575–584. [Google Scholar] [CrossRef]
  9. Bro, R.; Kjeldahl, K.; Smilde, A.K.; Kiers, H.A.L. Cross-validation of component models: A critical look at current methods. Anal. Bioanal. Chem. 2008, 390, 1241–1251. [Google Scholar] [CrossRef]
  10. Owen, A.B.; Perry, P. Bi-cross-validation of the svd and the nonnegative matrix factorization. Ann. Appl. Stat. 2009, 3, 564–594. [Google Scholar] [CrossRef] [Green Version]
  11. Josse, J.; Husson, F. Selecting the number of components in principal component analysis using cross-validation approximations. Comput. Stat. Data Anal. 2012, 56, 1869–1879. [Google Scholar] [CrossRef]
  12. Camacho, J.; Ferrer, A. Cross-validation in PCA models with the element-wise k-fold (ekf) algorithm: Theoretical aspects. J. Chemom. 2012, 26, 361–373. [Google Scholar] [CrossRef]
  13. Saccenti, E.; Camacho, J. On the use of the observation-wise k-fold operation in PCA cross-validation. J. Chemom. 2015, 29, 467–478. [Google Scholar] [CrossRef]
  14. Eastment, H.T.; Krzanowski, W.J. Cross-validatory choice of the number of components from a principal component analysis. Technometrics 1982, 24, 73–77. [Google Scholar] [CrossRef]
  15. Dias, C.T.S.; Krzanowski, W.J. Model selection and cross-validation in additive main effect and multiplicative (AMMI) models. Crop Sci. 2003, 43, 865–873. [Google Scholar] [CrossRef]
  16. Liu, Y.J.; Tran, T.; Postma, G.; Buydens, L.M.C.; Jansen, J. Estimating the number of components and detecting outliers using Angle Distribution of Loading Subspaces (ADLS) in PCA analysis. Anal. Chim. Acta 2018, 1020, 17–29. [Google Scholar] [CrossRef]
  17. Krzanowski, W.J. Cross-validatory choice in principal component analysis: Some sampling results. J. Stat. Comput. Simul. 1983, 18, 299–314. [Google Scholar] [CrossRef]
  18. Krzanowski, W.J.; Kline, P. Cross-validation for choosing the number of important components in principal component analysis. Multivar. Behav. Res. 1995, 30, 149–165. [Google Scholar] [CrossRef]
  19. Forkman, J.; Piepho, H.P. Parametric bootstrap methods for testing multiplicative terms in GGE and AMMI models. Biometrics 2014, 70, 639–647. [Google Scholar] [CrossRef] [Green Version]
  20. García-Peña, M.; Arciniegas-Alarcón, S.; Krzanowski, W.J.; Duarte, D. Missing value imputation using the robust singular-value decomposition: Proposals and numerical evaluation. Crop Sci. 2021, 61, 3288–3300. [Google Scholar] [CrossRef]
  21. Arciniegas-Alarcón, S.; García-Peña, M.; Rengifo, C.; Krzanowski, W.J. Techniques for robust imputation in incomplete two-way tables. Appl. Syst. Innov. 2021, 4, 62. [Google Scholar] [CrossRef]
  22. Hubert, M.; Engelen, S. Fast cross-validation of high-breakdown resampling methods for PCA. Comput. Stat. Data Anal. 2007, 51, 5013–5024. [Google Scholar] [CrossRef]
  23. Gabriel, K.R.; Odoroff, C.L. Resistant lower rank approximation of matrices. In Data Analysis and Statistics III; Diday, E., Jambu, M., Lebart, L., Thomassone, Eds.; North-Holland: Amsterdam, The Netherlands, 1984; pp. 23–30. [Google Scholar]
  24. Hawkins, D.M.; Liu, L.; Young, S.S. Robust Singular Value Decomposition; Technical Report 122; National Institute of Statistical Sciences: Washington, DC, USA, 2001. [Google Scholar]
  25. Zhang, L.; Shen, H.; Huang, J.Z. Robust regularized singular-value decomposition with application to mortality data. Tha Ann. Appl. Stat. 2013, 7, 1540–1561. [Google Scholar] [CrossRef]
  26. R Core Team. R: A Language and Environment for Statistical Computing; R Foundation for Statistical Computing: Vienna, Austria, 2020; ISBN 3-900051-070. Available online: https://www.r-project.org/ (accessed on 1 April 2022).
  27. Krzanowski, W.J. Principles of Multivariate Analysis: A User’s Perspective Oxford; University Press: Oxford, UK, 2000. [Google Scholar]
  28. Krzanowski, W.J. Between-group comparison of principal components–some sampling results. J. Stat. Comput. Simul. 1982, 15, 141–154. [Google Scholar] [CrossRef]
  29. Maronna, R.A.; Yohai, V.J. Robust low-rank approximation of data matrices with elementwise contamination. Technometrics 2008, 50, 295–304. [Google Scholar] [CrossRef]
  30. Skov, T.; Ballabio, D.; Bro, R. Multiblock variance partitioning: A new approach for comparing variation in multiple data blocks. Anal. Chim. Acta 2008, 615, 18–29. [Google Scholar] [CrossRef]
  31. Bro, R.; Smilde, A.K. Principal component analysis. Anal. Methods 2014, 6, 2812–2831. [Google Scholar] [CrossRef] [Green Version]
  32. Rodrigues, P.C.; Monteiro, A.; Lourenço, V.M. A robust AMMI model for the analysis of genotype × environment data. Bioinformatics 2016, 32, 58–66. [Google Scholar] [CrossRef] [Green Version]
  33. Krzanowski, W.J. Missing value imputation in multivariate data using the singular value decomposition of a matrix. Biom. Lett. 1988, 25, 31–39. [Google Scholar]
  34. Eshghi, P. Dimensionality choice in principal component analysis via cross-validatory methods. Chemom. Intell. Lab. Syst. 2014, 130, 6–13. [Google Scholar] [CrossRef]
  35. González-Cebrián, A.; Arteaga, F.; Folch-Fortuny, A.; Ferrer, A. How to simulate outliers with desired properties. Chemom. Intell. Lab. Syst. 2021, 212, 104301. [Google Scholar] [CrossRef]
  36. Grentzelos, C.; Caroni, C.; Barranco-Chamorro, I. A comparative study of methods to handle outliers in multivariate data analysis. Comput. Math. Methods 2020, 3, e1129. [Google Scholar] [CrossRef]
  37. Alkan, B.B.; Atakan, C.; Alkan, N. A comparison of different procedures for principal component analysis in the presence of outliers. J. Appl. Stat. 2015, 42, 1716–1722. [Google Scholar] [CrossRef]
Table 1. The 0.9-trimmed means of Procrustes values in the simulation study.
Table 1. The 0.9-trimmed means of Procrustes values in the simulation study.
0.9-Trimmed Means of Procrustes Values
Outliers = 0% Methods Outliers = 10% Methods
Criterion EK01EK13EK82EK84 Criterion EK01EK13EK82EK84
PRESS9284114052326PRESS5,929,1799,858,84010,533,6422726
PRESS758453673572284PRESS756,503,87710,051,92810,592,2232764
PRESS508023573152273PRESS509,915,02613,550,42913,794,5402771
Wm1773154715632364Wm16,969,90123,472,83823,671,6322703
Wm(Max)1431101710002359Wm(Max)16,969,90123,472,83823,671,6322706
Wm7512228418262295Wm756,890,11210,644,08911,219,8532741
Wm75(Max)11026476142292Wm75(Max)8,091,97211,470,83311,918,7692748
Wm5011778228062280Wm5010,431,96015,358,45315,708,1242753
Wm50(Max)10416406192279Wm50(Max)13,724,78718,830,77019,093,6112760
Outliers = 5%MethodsOutliers = 20%Methods
CriterionEK01EK13EK82EK84CriterionEK01EK13EK82EK84
PRESS1,349,6565,347,6435,517,9742503PRESS16,219,91920,858,76823,719,7603794
PRESS751,593,6065,978,1386,160,2942500PRESS7516,585,56120,389,48423,240,8053875
PRESS502,085,2857,208,1807,361,2732491PRESS5024,876,82026,596,38426,604,7943896
Wm6,121,39912,296,34912,339,2722513Wm32,095,00243,746,10947,429,5753491
Wm(Max)6,145,79512,296,34912,339,2722511Wm(Max)32,300,04243,746,10947,429,5753529
Wm751,802,2536,876,6577,043,2312498Wm7517,200,89320,911,45623,301,6833710
Wm75(Max)2,427,8468,446,8538,567,8832500Wm75(Max)18,713,15221,288,04623,488,0223736
Wm502,659,6778,260,8638,433,3062494Wm5023,995,90627,336,62627,389,4223769
Wm50(Max)4,369,88010,172,07810,266,3902493Wm50(Max)28,428,41231,084,46929,862,5333799
In bold, the method with the lowest statistic value in each percentage of outliers.
Table 2. The 0.9-trimmed means of critical angles in the simulation study.
Table 2. The 0.9-trimmed means of critical angles in the simulation study.
0.9-Trimmed Means of Critical Angles
Outliers = 0% Methods Outliers = 10% Methods
Criterion EK01EK13EK82EK84 Criterion EK01EK13EK82EK84
PRESS0.85470.12420.00000.9791PRESS1.52241.57081.56121.1699
PRESS750.94730.15740.00001.1286PRESS751.51001.55681.56521.2546
PRESS501.00530.19350.00001.1629PRESS501.37771.36621.40031.2339
Wm0.81770.72210.73301.0202Wm1.23991.21841.21101.1507
Wm(Max)0.60500.27470.20941.0174Wm(Max)1.23991.21841.21101.1608
Wm750.68460.09500.00001.1048Wm751.49131.51031.51341.2072
Wm75(Max)0.71200.06950.00001.0948Wm75(Max)1.49391.52161.54651.2083
Wm500.69630.09560.00001.1163Wm501.31351.19621.21891.2315
Wm50(Max)0.72740.08240.00001.1184Wm50(Max)1.31181.28631.28831.2278
Outliers = 5%MethodsOutliers = 20%Methods
CriterionEK01EK13EK82EK84CriterionEK01EK13EK82EK84
PRESS1.55241.56801.56741.0965PRESS1.53311.53971.54621.2667
PRESS751.53611.51361.52561.1683PRESS751.55061.56191.56501.2965
PRESS501.46391.40591.41891.2177PRESS501.36581.38451.47041.3262
Wm1.27331.20491.18071.1067Wm1.36821.19881.17681.1356
Wm(Max)1.27541.20491.18071.1059Wm(Max)1.36181.19881.17681.1540
Wm751.42231.28501.29651.1224Wm751.50511.53721.56281.2210
Wm75(Max)1.47731.38531.38261.1292Wm75(Max)1.51641.56131.56801.2497
Wm501.15891.20081.24781.1582Wm501.35671.30861.38881.2847
Wm50(Max)1.32981.27571.29391.1666Wm50(Max)1.35881.36981.44491.2950
In bold, the method with the lowest statistic value in each percentage of outliers.
Table 3. Procrustes statistics for the real dataset.
Table 3. Procrustes statistics for the real dataset.
Procrustes Statistics for the Real Dataset
Outliers = 0% Methods Outliers = 10% Methods
Criterion EK01REK13REK82REK84R Criterion EK01REK13REK82REK84R
PRESS185966716244934PRESS598179,8461124,78815005
PRESS7518596675374934PRESS75598179,8461124,78815146
PRESS5028056672494927PRESS50598179,8461124,78815005
Wm3004469145514913Wm188,3305250,4024343,76745005
Wm(Max)2805231322234913Wm(Max)188,3305250,4024343,76745005
Wm752805129516244934Wm75611279,8461124,78814995
Wm75(Max)28056675374934Wm75(Max)317,830879,8461124,78814995
Wm5030049468264934Wm50147,451479,8461124,78814995
Wm50(Max)30043892494934Wm50(Max)317,830879,8461124,78814995
Outliers = 5%Methods Outliers = 20%Methods
CriterionEK01REK13REK82REK84RCriterionEK01REK13REK82REK84R
PRESS564150,690163,60514754PRESS190,8111218,9851354,334112364
PRESS75564150,690163,60514754PRESS75190,8111218,9851354,334116625
PRESS50564150,690163,60514754PRESS50190,8111218,9851354,334116625
Wm130,4047291,10511247,46664754Wm190,8111849,9977354,33419153
Wm(Max)130,4047291,10511247,46664754Wm(Max)498,0983849,9977626,98039153
Wm75564199,950263,60514754Wm75378,9132218,9851503,121216625
Wm75(Max)5641130,657363,60514754Wm75(Max)498,0983218,9851626,980316625
Wm5030,7532130,6573123,00624754Wm50587,8764218,9851503,121216625
Wm50(Max)101,3596252,1008173,12334754Wm50(Max)778,3367218,9851626,980316625
In bold, the method with the lowest statistic values and the rank used. R: Rank.
Table 4. Critical angles for the real dataset.
Table 4. Critical angles for the real dataset.
Critical Angles for the Real Dataset
Outliers = 0% Methods Outliers = 10% Methods
Criterion EK01REK13REK82REK84R Criterion EK01REK13REK82REK84R
PRESS0.946890.214570.000041.57064PRESS1.570811.570811.570811.54195
PRESS750.946890.214570.000071.57064PRESS751.570811.570811.570811.46726
PRESS501.184550.214570.000091.34547PRESS501.570811.570811.570811.54195
Wm0.954841.570811.570811.51603Wm1.534051.521241.433641.54195
Wm(Max)1.184550.148130.000031.51603Wm(Max)1.534051.521241.433641.54195
Wm751.184550.250750.000041.57064Wm750.873421.570811.570811.54195
Wm75(Max)1.184550.214570.000071.57064Wm75(Max)1.272981.570811.570811.54195
Wm500.954840.229760.000061.57064Wm501.412941.570811.570811.54195
Wm50(Max)0.954840.219690.000091.57064Wm50(Max)1.272981.570811.570811.54195
Outliers = 5%Methods Outliers = 20%Methods
CriterionEK01REK13REK82REK84RCriterionEK01REK13REK82REK84R
PRESS1.570811.570811.570811.56744PRESS1.570811.570811.570811.47724
PRESS751.570811.570811.570811.56744PRESS751.570811.570811.570811.42795
PRESS501.570811.570811.570811.56744PRESS501.570811.570811.570811.42795
Wm1.497771.4511111.551061.56744Wm1.570811.325071.570811.26533
Wm(Max)1.497771.4511111.551061.56744Wm(Max)1.430431.325071.454931.26533
Wm751.570811.539421.570811.56744Wm751.467821.570811.274621.42795
Wm75(Max)1.570811.305131.570811.56744Wm75(Max)1.430431.570811.454931.42795
Wm501.262021.305131.308221.56744Wm501.482141.570811.274621.42795
Wm50(Max)1.516661.569681.243231.56744Wm50(Max)1.398971.570811.454931.42795
In bold, the method with the lowest statistic values and the rank used. R: Rank.
Table 5. Time in seconds.
Table 5. Time in seconds.
Simulated Matrix with n = 100, p = 8, Rank = 4
OutliersEK01EK13EK82EK84
0%15.0510.369.9417.43
5%10.5211.408.4512.94
10%11.3610.478.6913.98
20%14.8912.7511.5820.47
Mean12.9611.259.6716.21
SD2.351.111.433.43
Real matrix with n = 44 and p = 14
0%23.9313.8416.3717.55
5%15.6923.4014.3418.17
10%16.9821.3419.1119.48
20%16.6716.8616.6519.17
Mean18.3218.8616.6218.59
SD3.784.321.950.89
SD: Standard deviation.
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Share and Cite

MDPI and ACS Style

Arciniegas-Alarcón, S.; García-Peña, M.; Krzanowski, W.J. Cross-Validation for Lower Rank Matrices Containing Outliers. Appl. Syst. Innov. 2022, 5, 69. https://doi.org/10.3390/asi5040069

AMA Style

Arciniegas-Alarcón S, García-Peña M, Krzanowski WJ. Cross-Validation for Lower Rank Matrices Containing Outliers. Applied System Innovation. 2022; 5(4):69. https://doi.org/10.3390/asi5040069

Chicago/Turabian Style

Arciniegas-Alarcón, Sergio, Marisol García-Peña, and Wojtek J. Krzanowski. 2022. "Cross-Validation for Lower Rank Matrices Containing Outliers" Applied System Innovation 5, no. 4: 69. https://doi.org/10.3390/asi5040069

APA Style

Arciniegas-Alarcón, S., García-Peña, M., & Krzanowski, W. J. (2022). Cross-Validation for Lower Rank Matrices Containing Outliers. Applied System Innovation, 5(4), 69. https://doi.org/10.3390/asi5040069

Article Metrics

Back to TopTop