Next Article in Journal
Metaheuristics for a Flow Shop Scheduling Problem with Urgent Jobs and Limited Waiting Times
Previous Article in Journal
Path Planning of a Mechanical Arm Based on an Improved Artificial Potential Field and a Rapid Expansion Random Tree Hybrid Algorithm
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Robust Bilinear Probabilistic Principal Component Analysis

College of Computer and Information Sciences, Fujian Agriculture and Forestry University, Fuzhou 350002, China
*
Author to whom correspondence should be addressed.
Algorithms 2021, 14(11), 322; https://doi.org/10.3390/a14110322
Submission received: 27 September 2021 / Revised: 25 October 2021 / Accepted: 30 October 2021 / Published: 1 November 2021

Abstract

:
Principal component analysis (PCA) is one of the most popular tools in multivariate exploratory data analysis. Its probabilistic version (PPCA) based on the maximum likelihood procedure provides a probabilistic manner to implement dimension reduction. Recently, the bilinear PPCA (BPPCA) model, which assumes that the noise terms follow matrix variate Gaussian distributions, has been introduced to directly deal with two-dimensional (2-D) data for preserving the matrix structure of 2-D data, such as images, and avoiding the curse of dimensionality. However, Gaussian distributions are not always available in real-life applications which may contain outliers within data sets. In order to make BPPCA robust for outliers, in this paper, we propose a robust BPPCA model under the assumption of matrix variate t distributions for the noise terms. The alternating expectation conditional maximization (AECM) algorithm is used to estimate the model parameters. Numerical examples on several synthetic and publicly available data sets are presented to demonstrate the superiority of our proposed model in feature extraction, classification and outlier detection.

1. Introduction

High-dimensional data are increasingly collected for a variety of applications in the real world. However, high-dimensional data are not often distributed uniformly in their ambient space, instead of that the interesting structure inside the data often lies in a low-dimensional space [1]. One of the fundamental challenges is how to find the low-dimensional data representation for high-dimensional observed data in pattern recognition, machine learning and statistics [2,3]. Principal component analysis (PCA) [4] is arguably the most well-known dimension reduction method for high-dimensional data analysis, and it aims to find the first few principal eigenvectors corresponding to the first few largest eigenvalues of the covariance matrix, and then projects the high-dimensional data onto the low-dimensional subspace spanned by these principal eigenvectors to achieve the purpose of dimensionality reduction.
The traditional PCA is concerned with vectorial data, i.e., 1-D data. For 2-D image trained sample matrices, it is usual to first convert 2-D image matrices into 1-D image vectors. This transformation leads to higher dimensional image sample vectors and a larger covariance matrix, and thus suffers from the difficulty of accurately evaluating the principal eigenvectors of the large scale covariance matrix. Furthermore, such vectorizing of 2-D data destroys the natural matrix structure, and ignores potentially valuable information about the spatial relationships among 2-D data. Therefore, two-dimensional PCA (2DPCA) type algorithms [5,6,7] are proposed to compute principal component weight matrices directly based on 2-D image training with sample matrices instead of using vectorization.
These conventional PCA and 2DPCA algorithms are both derived and interpreted in the standard algebraic framework, thus they lack capability in handling issues of statistical inference or missing data. To remedy these drawbacks, a probabilistic PCA model (PPCA) has been proposed by Tipping and Bishop in [8], which is processed by assuming some Gaussian distributions on observations with introduced extra latent variables, and it has been successfully applied in many machine learning tasks [9]. Following PPCA, a probabilistic second-order PCA, called PSOPCA, is developed in [10] to directly model 2-D image matrices based on the so-called matrix variate Gaussian distributions.
Throughout this paper, R n × m is the set of all n × m real matrices, I n and 0 m × n are the n × n identity matrix and m × n zero matrix, respectively. The superscript “ · T ” means transpose only, · 2 and · F denote the 2 -norm and Frobenius norm of a matrix, respectively. Denoted by N d c , d r ( M , Ω c , Ω r ) is the matrix variate Gaussian distribution [11] with the mean matrix M R d c × d r , column covariance Ω c R d c × d c and row covariance Ω r R d r × d r . A random matrix X R d c × d r is said to follow the matrix variate Gaussian distribution N d c , d r ( M , Ω c , Ω r ) , i.e.,
X N d c , d r ( M , Ω c , Ω r ) , i f vec ( X ) N ( vec ( M ) , Ω r Ω c ) ,
where vec ( · ) is the vectorization of a matrix obtained by stacking the columns of the matrix on top of one another. That means that the probability density function (pdf) of X is
p ( X ) = ( 2 π ) 1 2 d c d r Ω r Ω c 1 2 exp 1 2 vec ( X M ) T ( Ω r 1 Ω c 1 ) vec ( X M ) = ( 2 π ) 1 2 d c d r Ω r 1 2 d c Ω c 1 2 d r etr 1 2 Ω c 1 ( X M ) Ω r 1 ( X M ) T ,
where “⊗” is the Kronecker product of two matrices, etr ( · ) = exp tr ( · ) and tr ( · ) denotes the trace of a matrix. The last equality of (2) holds because of Ω r Ω c 1 2 = Ω r 1 2 d c Ω c 1 2 d r and
vec ( X M ) T ( Ω r 1 Ω c 1 ) vec ( X M ) = tr Ω c 1 ( X M ) Ω r 1 ( X M ) T .
See ([11], Theorem 1.2.21) and ([11], Theorem 1.2.22) for more details.
PSOPCA in [10] considers the following two-sided latent matrix variable model
X = C Z R T + W + Υ , Z N q c , q r ( 0 q c × q r , I q c , I q r ) , Υ N d c , d r ( 0 d c × d r , σ 2 I d c , σ 2 I d r ) ,
where C R d c × q c and R R d r × q r are the column and row factor loading matrices, respectively, W R d c × d r and Υ R d c × d r are the mean and error matrices, respectively, and Z R q c × q r is the latent core variable of X. The PSOPCA model is further extended to the bilinear probabilistic principal component analysis (BPPCA) model in [12] for better establishing the relationship with the 2DPCA algorithm [6], which is defined as
X = C Z R T + W + C E r + E c R T + E , E c N d c , q r ( 0 d c × q r , σ c 2 I d c , I q r ) , E r N q c , d r ( 0 q c × d r , I q c , σ r 2 I d r ) , E N d c , d r ( 0 d c × d r , σ c 2 I d c , σ r 2 I d r ) , Z N q c , q r ( 0 q c × q r , I q c , I q r ) .
In contrast to the PSOPCA model (3), the column and row noise matrices E c R d c × q r and E r R q c × d r with different noise variances σ c 2 and σ r 2 , respectively, are included in the BPPCA model, and E R d c × d r is represented as the common noise matrix. The model (4) improves the flexibility in capturing data uncertainty, and makes the marginal distribution p ( X ) to be the matrix variable Gaussian. In particular, we can see that if E r and E c are removed and σ c = σ r , then (4) reduces to the PSOPCA model.
All of the above mentioned probabilistic models assume that the noise terms follow Gaussian distributions. It is a well-known issue that Gaussian noises will lead to a serious drawback while dealing with anomalous observations. Thus, the probabilistic PCA models based on Gaussian distributions are not robust to outliers. To make probabilistic models which are insensitive to outliers, one prefers heavy-tailed distributions, such as the Student t distribution or centered Laplacian distribution with 1 -norm. Using the t distribution or centered Laplacian distribution instead of the Gaussian distribution in the PPCA model [8] results in tPPCA [13,14] and probabilistic L1-PPCA [15] algorithms, respectively. Similarly, a robust version of PSOPCA, called L1-2DPPCA, is introduced in [16] based on the Laplacian distribution combined with variational EM-type algorithms to learn parameters. However, it is difficult to generalize a robust version of the BPPCA algorithm based on the Laplacian distribution. The reason is that if the error term Υ in the PSOPCA model is a Laplacian distribution, then the condition distribution X | Z is also a Laplacian distribution, but it does not hold in the BPPCA model. Fortunately, the same goal can be achieved by using the t distribution. In fact, the Gaussian distribution is a special t distribution. Compared to the Gaussian distribution, the t distribution has significantly heavier tails and contains one more free parameter. Recently, some robust probabilistic models under the assumption of the t distribution have already been done successfully by a number of researchers in [17,18,19,20,21,22]. Motivated by these facts, we will continue the effort to develop a robust BPPCA model from matrix variate t distributions to handle 2-D data sets in the presence of outliers.
The remainder of the paper is organized as follows. Section 2 introduces some notations and a matrix variate t distribution which are essential to our later development. The robust BPPCA model and its associated parameters estimation based on the AECM algorithm are given and analyzed in detail in Section 3. Section 4 is dedicated to present some numerical examples for showing the behaviors of our proposed model and to support our analysis. Finally, conclusions are made in Section 5.

2. Preliminaries

Let X R d c × d r , M R d c × d r , Ω c R d c × d c and Ω r R d r × d r , and the probability density function of the random variable μ having a Gamma distribution with parameters α and β , i.e., μ G a ( α , β ) , be
p ( μ ) = β α μ α 1 exp ( β μ ) / Γ ( α ) ,
where Γ ( · ) is the Gamma function, i.e.,
Γ ( α ) = 0 + τ α 1 exp ( τ ) d τ .
Analogously to the process of tPPCA in [14], we derive the matrix variate t distribution in this paper by considering
vec ( X ) 0 + N vec ( M ) , Ω r Ω c μ p ( μ ) d μ ,
where μ G a ( ν / 2 , ν / 2 ) . Let δ = tr c 1 ( X M ) r 1 ( X M ) T . We have
0 + N vec ( M ) , Ω r Ω c μ p ( μ ) d μ = 0 + ( 2 π ) d c d r 2 Ω r Ω c μ 1 2 exp δ μ 2 ν 2 ν 2 μ ν 2 1 exp ν 2 μ 1 Γ ν 2 d μ = ( 2 π ) d c d r 2 | Ω c | d r 2 | Ω r | d c 2 ν 2 ν 2 Γ ν 2 0 + μ ν + d c d r 2 1 exp ( δ + ν ) μ 2 d μ .
Let τ = ( δ + ν ) μ 2 . Then, μ = 2 τ ν + δ and d μ = 2 ν + δ d τ . Therefore, (7) can be rewritten as
0 + N vec ( M ) , Ω r Ω c μ p ( μ ) d μ = ( 2 π ) d c d r 2 | Ω c | d r 2 | Ω r | d c 2 ν 2 ν 2 Γ ν 2 2 ν + δ ν + d c d r 2 0 + τ ν + d c d r 2 1 exp ( τ ) d τ = ( 2 π ) d c d r 2 | Ω c | d r 2 | Ω r | d c 2 ν 2 ν 2 Γ ν 2 2 ν + δ ν + d c d r 2 Γ ν + d c d r 2 = π d c d r 2 | Ω c | d r 2 | Ω r | d c 2 ν ν 2 Γ ν + d c d r 2 Γ ν 2 ν ν + d c d r 2 1 + δ ν ν + d c d r 2 = | Ω c | d r 2 | Ω r | d c 2 Γ ν + d c d r 2 ( ν π ) d c d r 2 Γ ν 2 1 + δ ν ν + d c d r 2 .
In this paper, if the pdf of the random matrix X is
p ( X ) = | Ω c | d r 2 | Ω r | d c 2 Γ ν + d c d r 2 ( ν π ) d c d r 2 Γ ν 2 1 + δ ν ν + d c d r 2 ,
then the random matrix X is said to follow the matrix variate t distribution with degrees of freedom ν , and is denoted by
X T d c , d r ( ν , M , Ω c , Ω r ) .
In particular, if d c = 1 or d r = 1 , then the matrix variate t distribution degenerates to the multivariate t distribution. As the classical multivariate t distribution, another favorite perspective on the matrix variate t distribution which is critical to our later developments, is to treat μ as a latent variable, then the conditional distribution of X | μ is a matrix variate Gaussian distribution by (8), i.e.,
X | μ N d c , d r ( M , μ c Ω c , μ r Ω r ) ,
where μ c μ r = μ 1 . Notice that, despite the non-uniqueness in the factorization μ 1 = μ c μ r , N d c , d r ( M , μ c Ω c , μ r Ω r ) in (10) always returns the same pdf of X.

3. Robust BPPCA

3.1. The Model

In this section, we develop a robust model by replacing the matrix variate Gaussian distribution in BPPCA with the matrix variate t distribution with degrees of freedom ν defined in (9) to deal with 2-D data sets. Specifically, the proposed robust bilinear probabilistic principal analysis model (RBPPCA for short) is defined as
X = C Z R T + W + C E r + E c R T + E , μ G a ( ν / 2 , ν / 2 ) , μ c μ r = μ 1 , Z | μ N q c , q r ( 0 q c × q r , μ c I q c , μ r I q r ) , E c | μ N d c , q r ( 0 d c × q r , μ c σ c 2 I d c , μ r I q r ) , E r | μ N q c , d r ( 0 q c × d r , μ c I q c , μ r σ r 2 I d r ) , E | μ N d c , d r ( 0 d c × d r , μ c σ c 2 I d c , μ r σ r 2 I d r ) .
As BPPCA [12], in the RBPPCA model (11), E c R d c × q r , E r R q c × d r and E R d c × d r are the column, row and common noise matrices, respectively, Z R q c × q r is the latent matrix, and these are assumed to be independent of each other, and the mean matrix, and the column and row factor loading matrices are W R d c × d r , C R d c × q c and R R d r × q r , respectively. Similarly to BPPCA [12], the parameters C, R, σ c and σ r can not be uniquely identified, but the interested subspaces spanned by the columns of C and R are unique. The reader is referred to ([12], Appendix B) for details. The difference from BPPCA is that the noise matrices E c , E r and E and latent matrix variate Z in the RBPPCA model (11) are supposed matrix variate t distributions by (10), i.e.,
E c T d c , q r ( ν , 0 d c × q r , σ c 2 I d c , I q r ) , E r T q c , d r ( ν , 0 q c × d r , I q c , σ r 2 I d r ) , E T d c , d r ( ν , 0 d c × d r , σ c 2 I d c , σ r 2 I d r ) , Z T q c , q r ( ν , 0 q c × q r , I q c , I q r ) .
It follows by (11) that
C Z R T | μ N d c , d r ( 0 d c × d r , μ c C C T , μ r R R T ) , C E r | μ N d c , d r ( 0 d c × d r , μ c C C T , μ r σ r 2 I d r ) , E c R T | μ N d c , d r ( 0 d c × d r , μ c σ c 2 I d c , μ r R R T ) .
Consequently, X | μ N d c , d r ( W , μ c Σ c , μ r Σ r ) where
Σ c = C C T + σ c 2 I d c and Σ r = R R T + σ r 2 I d r .
That means the random matrix X follows the matrix variate t distribution, i.e., X T d c , d r ( ν , W , Σ c , Σ r ) . In addition, as shown in [23], the conditional distribution μ | X which is also required in our later estimation of model parameters is a Gamma distribution, i.e.,
μ | X G a ν + d c d r 2 , ν + ρ 2 ,
where ρ = tr Σ c 1 ( X W ) Σ r 1 ( X W ) T .
By introducing two latent matrix variates Y r R q c × d r and Y ϵ r R d c × d r , the RBPPCA model (11) can be rewritten as
X = C Y r + W + Y ϵ r , Y r = Z R T + E r , Y ϵ r = E c R T + E ,
where Y r and Y ϵ r are the row projected intermediate and residual matrices, respectively. By (11), we have the conditional distributions
Y r | μ N q c , d r ( 0 q c × d r , μ c I q c , μ r Σ r ) ,
Y ϵ r | μ N d c , d r ( 0 d c × d r , μ c σ c 2 I d c , μ r Σ r ) ,
Y r | Z , μ N q c , d r ( Z R T , μ c I q c , μ r σ r 2 I d r ) ,
X | Y r , μ N d c , d r ( C Y r + W , μ c σ c 2 I d c , μ r Σ r ) ,
where Σ r is given by (12). In addition, by using (15c) and the Bayes’ rule, the conditional distributions Z | Y r , μ and Y r | X , μ can be calculated as
Z | Y r , μ N q c , q r Y r R Φ r 1 , μ c I q c , μ r σ r 2 Φ r 1 ,
Y r | X , μ N q c , d r Φ c 1 C T ( X W ) , μ c σ c 2 Φ c 1 , μ r Σ r ,
where
Φ c = C T C + σ c 2 I q c a n d Φ r = R T R + σ r 2 I q r .
In (14), the bilinear projection in the RBPPCA model is split into two stages by first projecting the latent matrix Z in the row direction to obtain Y r , then Y r being projected in the column direction to finally generate X. Similarly, we can also consider the decomposition of the bilinear projection by first projecting column and then row directions to rewrite (11) as
X = Y c R T + W + Y ϵ c , Y c = C Z + E c , Y ϵ c = C E r + E ,
where Y c R d c × q r and Y ϵ c R d c × d r with Y c | μ N d c , q r ( 0 d c × q r , μ c Σ c , μ r I q r ) and Y ϵ c | μ N d c , d r ( 0 d c × d r , μ c Σ c , μ r σ r 2 I d r ) . Furthermore,
Y c | Z , μ N d c , q r ( C Z , μ c σ c 2 I d c , μ r I q r ) ,
X | Y c , μ N d c , d r ( Y c R T + W , μ c Σ c , μ r σ r 2 I d r ) ,
Z | Y c , μ N q c , q r Φ c 1 C T Y c , μ c σ c 2 Φ c 1 , μ r I q r ,
Y c | X , μ N d c , q r ( X W ) R Φ r 1 , μ c Σ c , μ r σ r 2 Φ r 1 .

3.2. Estimation of the Parameters

In the model (11), the parameter set to be estimated is θ = { ν , σ c , σ r , W , C , R } . We will introduce how to calculate the parameters by using the alternating expectation conditional maximization (AECM) algorithm in this subsection. The AECM algorithm [12,24,25] is a two stage iterative optimization technique for finding maximum likelihood solutions. To apply it to the AECM algorithm, we divide the parameter set θ into two subsets θ c = { ν , σ c , W , C } and θ r = { ν , σ r , W , R } .
In the first stage, we consider the AECM algorithm for the model (14) to compute θ c . Let { X n } n = 1 N with X n R d c × d r be a set of 2-D sample observations. The latent variables’ data { Y n r , μ n } n = 1 N are treated as “missing data”, and the “complete” data log-likelihood is
L c o m , c ( θ c ) = n = 1 N ln p ( X n , Y n r , μ n ) = n = 1 N ln p ( X n | Y n r , μ n ) p ( Y n r | μ n ) p ( μ n ) .
It follows by (5), (15a) and (15d) that
L c o m , c ( θ c ) = n = 1 N ln ( ( 2 π ) d c d r 2 | μ c , n σ c 2 I d c | d r 2 | μ r , n Σ r | d c 2 etr ( 1 2 ( μ r , n Σ r ) 1 [ X n ( C Y n r + W ) ] T ( μ c , n σ c 2 I ) 1 [ X n ( C Y n r + W ) ] ) ( 2 π ) q c d r 2 | μ c , n I q c | d r 2 | μ r , n Σ r | q c 2 etr 1 2 ( μ r , n Σ r ) 1 ( Y n r ) T μ c , n 1 Y n r ν 2 ν 2 μ n ν 2 1 exp ν 2 μ n 1 Γ ν 2 ) = n = 1 N { ( d c + q c ) d r 2 ln ( 2 π ) + ln Γ ν 2 + d c + q c 2 ln | Σ r | ν 2 ln ν 2 + ν μ n 2 ( d c + q c ) d r + ν 2 2 ln μ n + d c d r 2 ln σ c 2 + tr ( μ n 2 σ c 2 Σ r 1 [ X n ( C Y n r + W ) ] T [ X n ( C Y n r + W ) ] + μ n 2 Σ r 1 ( Y n r ) T Y n r ) } .
In E-step, given the parameter set θ ( i ) = { ν ( i ) , σ c ( i ) , σ r ( i ) , W ( i ) , C ( i ) , R ( i ) } which is obtained from the i-th iteration, we compute the expectation of L c o m , c ( θ c ) with respect to the condition distribution Y n r , μ n | X n , i.e.,
Q c ( θ c ) = Y n r μ n L c o m , c ( θ c ) p ( Y n r , μ n | X n ) d μ n d Y n r = Y n r μ n L c o m , c ( θ c ) p ( Y n r | μ n , X n ) p ( μ n | X n ) d μ n d Y n r = n = 1 N { ln Γ ν 2 ν 2 ln ν 2 + ν 2 E ( μ n | X n ) ( d c + q c ) d r + ν 2 2 E ( ln μ n | X n ) + d c d r 2 ln σ c 2 + 1 2 σ c 2 E tr μ n Σ r 1 [ X n ( C Y n r + W ) ] T [ X n ( C Y n r + W ) ] | X n } + constant ,
where the constant contains those terms without referring the parameters in the set θ c . We denote E μ n ( i ) = E ( μ n | X n ) and E Y n r ( i ) = E ( Y n r | X n , μ n ) for convenience. It is noted by (13) and (16b) that given the parameter set θ ( i ) = { ν ( i ) , σ c ( i ) , σ r ( i ) , W ( i ) , C ( i ) , R ( i ) } , the conditional distributions of μ n | X n and Y n r | X n , μ n are known. That is, for 1 n N ,
μ n | X n G a ν ( i ) + d c d r 2 , ν ( i ) + ρ n ( i ) 2 , Y n r | X n , μ n N q c , d r ( Φ c ( i ) ) 1 ( C ( i ) ) T ( X n W ( i ) ) , μ c , n ( σ c ( i ) ) 2 ( Φ c ( i ) ) 1 , μ r , n Σ r ( i ) ,
where μ r , n μ c , n = μ n , Φ c ( i ) = ( C ( i ) ) T C ( i ) + ( σ c ( i ) ) 2 I q c , Σ r ( i ) = R ( i ) ( R ( i ) ) T + ( σ r ( i ) ) 2 I d r , and ρ n ( i ) = tr ( Σ c ( i ) ) 1 ( X n W ( i ) ) ( Σ r ( i ) ) 1 ( X n W ( i ) ) T with Σ c ( i ) = C ( i ) ( C ( i ) ) T + ( σ c ( i ) ) 2 I d c . Then, it is easy to obtain that
E μ n ( i ) = ν ( i ) + d c d r ν ( i ) + ρ n ( i ) and E Y n r ( i ) = ( Φ c ( i ) ) 1 ( C ( i ) ) T ( X n W ( i ) ) .
In addition, based on the conditional distributions of μ n | X n and Y n r | X n , μ n , in (20), the condition expectations E ( ln μ n | X n ) = ψ ν ( i ) + d c d r 2 ln ν ( i ) + ρ n ( i ) 2 by [13] where ψ ( · ) is the digamma function, and
E tr μ n Σ r 1 [ X n ( C Y n r + W ) ] T [ X n ( C Y n r + W ) ] | X n = E μ n ( i ) tr Σ r 1 ( X n W ) T ( X n W ) 2 E μ n ( i ) tr Σ r 1 ( X n W ) T C E Y n r ( i ) + tr ( Σ r 1 Σ r ( i ) ) tr C T C ( σ c ( i ) ) 2 ( Φ c ( i ) ) 1 + E μ n ( i ) tr Σ r 1 ( E Y n r ( i ) ) T C T C E Y n r ( i ) ,
which is detailed in Appendix A.
In the subsequent conditional maximization (CM) step of the first stage, given the condition θ r ( i ) = { ν ( i ) , σ r ( i ) , W ( i ) , R ( i ) } , we maximize Q c ( θ c ) with respect to θ c = { ν , σ c , W , C } . It follows by (20) that
Q c ( θ c ) W = 1 2 σ c 2 n = 1 N E μ n ( i ) ( 2 X n + 2 W ) ( Σ r ( i ) ) 1 + 2 C E Y n r ( i ) ( Σ r ( i ) ) 1 , Q c ( θ c ) C = 1 2 σ c 2 n = 1 N { 2 E μ n ( i ) ( X n W ) ( Σ r ( i ) ) 1 ( E Y n r ( i ) ) T + 2 d r C ( σ c ( i ) ) 2 ( Φ c ( i ) ) 1 + 2 E μ n ( i ) C E Y n r ( i ) ( Σ r ( i ) ) 1 ( E Y n r ( i ) ) T } , Q c ( θ c ) σ c 2 = N d c d r 2 σ c 2 + 1 2 σ c 4 n = 1 N E tr μ n ( Σ r ( i ) ) 1 [ X n ( C Y n r + W ) ] T [ X n ( C Y n r + W ) ] | X n .
Therefore, by successively solving the equations Q c ( θ c ) W = 0 , Q c ( θ c ) C = 0 and Q c ( θ c ) σ c 2 = 0 , we can iteratively update the parameters W, C and σ c . Specifically, by Q c ( θ c ) W = 0 , we have
n = 1 N E μ n ( i ) ( 2 X n + 2 W + 2 C E Y n r ( i ) ) ( Σ r ( i ) ) 1 = 0 .
That means
n = 1 N E μ n ( i ) ( X n + C E Y n r ( i ) ) + n = 1 N E μ n ( i ) W = 0 .
Thus, an iterative updating of W can be obtained by
W ˜ ( i ) = n = 1 N E μ n ( i ) ( X n C ( i ) E Y n r ( i ) ) n = 1 N E μ n ( i ) .
Similarly, based on Q c ( θ c ) C = 0 and Q c ( θ c ) σ c 2 = 0 , we have
C ( i + 1 ) = n = 1 N E μ n ( i ) ( X n W ˜ ( i ) ) ( Σ r ( i ) ) 1 ( E Y n r ( i ) ) T
× n = 1 N d r ( σ c ( i ) ) 2 ( Φ c ( i ) ) 1 + E μ n ( i ) E Y n r ( i ) ( Σ r ( i ) ) 1 ( E Y n r ( i ) ) T 1 , ( σ c ( i + 1 ) ) 2 = 1 N d c d r n = 1 N E tr μ n ( Σ r ( i ) ) 1 ( X n C ( i + 1 ) Y n r W ˜ ( i ) ) T ( X n C ( i + 1 ) Y n r W ˜ ( i ) ) | X n
= 1 N d c d r n = 1 N E μ n ( i ) tr ( Σ r ( i ) ) 1 ( X n W ˜ ( i ) ) T ( X n W ˜ ( i ) C ( i + 1 ) E Y n r ( i ) ) .
The last equality (24b) holds because of (22) and (24a). Finally, we update ν ( i ) by maximizing the scalar nonlinear function Q c ( θ c ) defined in (20) on v, which can be solved numerically by most scientific computation software packages [26,27], to obtain ν ˜ ( i ) .
In the second stage, the AECM algorithm is used for the model (18) to update θ r . In such a case, we consider the latent variables’ data { Y n c , μ n } n = 1 N as “missing data”. Then, by (19b), the “complete” data log-likelihood is
L c o m , r ( θ r ) = n = 1 N ln p ( X n , Y n c , μ n ) = n = 1 N ln p ( X n | Y n c , μ n ) p ( Y n c | μ n ) p ( μ n ) = n = 1 N ln ( ( 2 π ) d c d r 2 | μ c , n Σ c | d r 2 | μ r , n σ r 2 I d r | d c 2 etr ( 1 2 ( μ r , n σ r 2 I ) 1 [ X n ( Y n c R T + W ) ] T ( μ c , n Σ c ) 1 [ X n ( Y n c R T + W ) ] ) ( 2 π ) d c q r 2 | μ c , n Σ c | q r 2 | μ r , n I q r | d c 2 × etr 1 2 ( μ r , n I q r ) 1 ( Y n c ) T ( μ c , n Σ c ) 1 Y n c ν 2 ν 2 μ n ν 2 1 exp ν 2 μ n 1 Γ ν 2 ) = n = 1 N { ( d r + q r ) d c 2 ln ( 2 π ) + ln Γ ν 2 + d r + q r 2 ln | Σ c | ν 2 ln ν 2 + ν μ n 2 ( d r + q r ) d c + ν 2 2 ln μ n + d c d r 2 ln σ r 2 + tr ( μ n 2 σ r 2 [ X n ( Y n c R T + W ) ] T Σ c 1 × [ X n ( Y n c R T + W ) ] + μ n 2 ( Y n c ) T Σ c 1 Y n c ) } .
Similarly, in E-step of the second stage, given the updated parameter set
θ ˜ ( i ) = { ν ˜ ( i ) , σ c ( i + 1 ) , σ r ( i ) , W ˜ ( i ) , C ( i + 1 ) , R ( i ) } ,
where ν ˜ ( i ) , σ c ( i + 1 ) , W ˜ ( i ) and C ( i + 1 ) are calculated from the first stage, we compute the expectation of L c o m , r ( θ r ) with respect to the condition distribution Y n c , μ n | X n , denoted by Q r ( θ r ) . Based on (19) and the current parameter set θ ˜ ( i ) , we define
Σ c ( i + 1 ) = C ( i + 1 ) ( C ( i + 1 ) ) T + ( σ c ( i + 1 ) ) 2 I d c ,
ρ ˜ n ( i ) = tr ( Σ c ( i + 1 ) ) 1 ( X n W ˜ ( i ) ) ( Σ r ( i ) ) 1 ( X n W ˜ ( i ) ) T ,
E ˜ μ n ( i ) = E ( μ n | X n ) = ν ˜ ( i ) + d c d r ν ˜ ( i ) + ρ ˜ n ( i ) ,
Φ r ( i ) = ( R ( i ) ) T R ( i ) + ( σ r ( i ) ) 2 I q r ,
E Y n c ( i ) = E ( Y n c | X n , μ n ) = ( X n W ˜ ( i ) ) R ( i ) ( Φ r ( i ) ) 1 .
We have, up to a constant,
Q r ( θ r ) = n = 1 N { ln Γ ν 2 ν 2 ln ν 2 + ν 2 E ˜ μ n ( i ) ( d r + q r ) d c + ν 2 2 E ( ln μ n | X n ) + d c d r 2 ln σ r 2 + 1 2 σ r 2 E tr μ n [ X n ( Y n c R T + W ) ] T Σ c 1 [ X n ( Y n c R T + W ) ] | X n } ,
where E ( ln μ n | X n ) = ψ ν ˜ ( i ) + d c d r 2 ln ν ˜ ( i ) + ρ ˜ n ( i ) 2 and
E tr μ n [ X n ( Y n c R T + W ) ] T Σ c 1 [ X n ( Y n c R T + W ) ] | X n = E ˜ μ n ( i ) tr ( X n W ) T Σ c 1 ( X n W ) 2 E ˜ μ n ( i ) tr ( X n W ) T Σ c 1 E Y n c ( i ) R T + tr ( Σ c 1 Σ c ( i + 1 ) ) tr R T R ( σ r ( i ) ) 2 ( Φ r ( i ) ) 1 + E ˜ μ n ( i ) tr R T R ( E Y n c ( i ) ) T Σ c 1 E Y n c ( i ) .
See Appendix A for the derivation of (27). At last, in CM-step of the second stage, based on θ ˜ c ( i ) = { ν ˜ ( i ) , σ c ( i + 1 ) , W ˜ ( i ) , C ( i + 1 ) } , similarly to (23) and (24), we maximize Q r ( θ r ) with respect to θ r to update
W ( i + 1 ) = n = 1 N E ˜ μ n ( i ) X n E Y n c ( i ) ( R ( i ) ) T n = 1 N E ˜ μ n ( i ) , R ( i + 1 ) = n = 1 N E ˜ μ n ( i ) ( X n W ( i + 1 ) ) T ( Σ c ( i + 1 ) ) 1 E Y n c ( i )
× n = 1 N d c ( σ r ( i ) ) 2 ( Φ r ( i ) ) 1 + E ˜ μ n ( i ) ( E Y n c ( i ) ) T ( Σ c ( i + 1 ) ) 1 E Y n c ( i ) 1 ,
( σ r ( i + 1 ) ) 2 = 1 N d c d r n = 1 N E ˜ u n ( i ) tr ( X n W ( i + 1 ) ) T ( Σ c ( i + 1 ) ) 1 X n W ( i + 1 ) E Y n c ( i ) ( R ( i + 1 ) ) T ,
and then solve the scalar nonlinear maximization problem (26) on ν to get ν ( i + 1 ) .
We summarize what we do in this subsection in Algorithm 1. A few remarks regarding Algorithm 1 are in order:
(1)
In Algorithm 1, it is not necessarily to explicitly compute Σ c ( i ) and Σ r ( i ) . The reason is that the calculation of ( Σ c ( i ) ) 1 and ( Σ r ( i ) ) 1 can be more efficiently performed by using
( Σ c ( i ) ) 1 = I d c C ( i ) ( Φ c ( i ) ) 1 ( C ( i ) ) T ( σ c ( i ) ) 2 , ( Σ r ( i ) ) 1 = I d r R ( i ) ( Φ r ( i ) ) 1 ( R ( i ) ) T ( σ r ( i ) ) 2 .
In its per-iteration of Algorithm 1, the most expensive computational cost is O ( N d c d r ( d c + d r ) ) appearing on the formation of ρ n ( i ) with 1 n N . Owing to introducing the new latent variable μ n , RBPPCA is a little more time complex than BPPCA having a calculation cost of O N d c d r ( q c + q r ) . However, it will be shown in our numerical example that the RBPPCA algorithm presents less sensitivity to outliers.
(2)
Compared with the AECM algorithm of BPPCA in [12] which uses the centered data and estimates { σ c , C } and { σ r , R } based on the model X = C Y r + Y ϵ r and X = Y c R T + Y ϵ c , respectively, two more parameters ν and W are needed to be computed in the AECM iteration of RBPPCA. Notice that both the models (14) and (18) contain the parameters ν and W. Thus, we split the parameter set θ into θ c = { ν , σ c , W , C } and θ r = { ν , σ r , W , R } which naturally leads to the parameters ν and W being calculated twice in each loop of Algorithm 1. Though other partitions of the set θ , such as { ν , σ c , C } and { σ r , W , R } , are also available for the estimation of parameters, we prefer θ c and θ r , because updating ν and W one more time in each iteration can be obtained by adding a little more computational cost.
(3)
As stated in Section 3.3 of [24], any AECM sequence increases L ( θ ) = n = 1 N ln p ( X n | θ ) at each iteration, and converges to a stationary point of L ( θ ) . Notice that the convergence results of the AECM algorithm proved in Section 3.3 of [24] do not depend on the distributions of the data sets. Therefore, up to set a limit on the maximum number of steps i t e r m a x , we use the following relative change of log-likelihood as the stopping criterion, i.e.,
ζ = 1 L ( θ ( i + 1 ) ) L ( θ ( i ) ) = 1 n = 1 N ln p ( X n | θ ( i + 1 ) ) n = 1 N ln p ( X n | θ ( i ) ) ε
where ε is a specified tolerance used, which by default is set to 10 5 in our numerical examples.
(4)
Based on the computed results of Algorithm 1, and similar to PPCA [8] and BPPCA [12], it is known that
E ( Z | X ) = E E ( Z | Y r ) | X = Φ c 1 C T ( X W ) R Φ r 1
can be considered as the compressed representation of X. Hence, we can reconstruct X as
X ^ = C E ( Z | X ) R T + W = C Φ c 1 C T ( X W ) R Φ r 1 R T + W .
Algorithm 1 Robust bilinear probabilistic PCA algorithm (RBPPCA).
Input: Initialization ν ( 0 ) , σ c ( 0 ) , σ c ( 0 ) , W ( 0 ) , C ( 0 ) , R ( 0 ) , and sample matrices { X n } n = 1 N . Compute Φ c ( 0 ) .
Output: the converged { ν , σ c , σ r , W , C , R }
  • for   i = 0 , 1 , 2 , , until ζ defined in (29) is smaller than a threshold do
  •     % Stage 1
  •     Compute Φ r ( i ) , ρ n ( i ) , E μ n ( i ) and E Y n r ( i ) via (21).
  •     Compute W ˜ ( i ) , C ( i + 1 ) and σ c ( i + 1 ) by (23) and (24).
  •     Solve the scalar nonlinear maximization problem (20) to obtain ν ˜ ( i ) .
  •     % Stage 2
  •     Compute Φ c ( i + 1 ) , ρ ˜ n ( i ) , E ˜ μ n ( i ) and E Y n c ( i ) via (25).
  •     Compute W ( i + 1 ) , R ( i + 1 ) and σ r ( i + 1 ) by (28).
  •     Solve the scalar nonlinear maximization problem (26) to obtain ν ( i + 1 ) .
  • end for

4. Numerical Examples

In this section, we conduct several numerical examples based on synthetic problems and three real-world data sets to demonstrate the effectiveness of our proposed RBPPCA algorithm. All experiments were run by using MATLAB (2016a) with machine epsilon 2.22 × 10 16 on a Windows 10 (64 bit) Laptop with an Intel Core i7-8750H CPU (2.20GHz) and 8GB memory. Each random experiment was repeated 20 times independently, then the average numerical results were reported.
Example 1
(Experiments on the synthetic data).In this example, we only compare ours with the BPPCA algorithm [12] to illustrate the significant improvement of the RBPPCA algorithm. We take N data matrices X n for n = 1 , , N with N = 200 and d c = d r = 64 , N g of which are generated by
X n = C Z n R T + W + C E r n + E c n R T + E n , n = 1 , , N g ,
where C, R and W are simply synthesized by MATLAB as
C = eye ( d c , q c ) , R = eye ( d r , q r ) , W = rand ( d c , d r ) w i t h q c = q r = 8 ,
Z n , E r n , E c n and E n are sampled from matrix variate normal distributions N q c , q r ( 0 q c × q r , I q c , I q r ) , N q c , d r ( 0 q c × d r , I q c , σ r 2 I d r ) , N d c , q r ( 0 d c × q r , σ c 2 I d c , I q r ) , and N d c , d r ( 0 d c × d r , σ c 2 I d c , σ r 2 I d r ) with σ c = σ r = 1 , respectively. The other N o data matrices, i.e., X n for n = N g + 1 , , N , are regarded as outliers of which each entry is sampled from the uniform distribution over the range of 0 to 10. In order to demonstrate the quality of computed approximations C ( i ) and R ( i ) , we calculate the arc length distance between the two subspaces span ( R C ) and span ( R ( i ) C ( i ) ) , which is used in [12] and defined as
arcsin ( I d c d r Q Q T ) Q ( i ) 2 ,
to monitor the numerical performance of the RBPPCA and BPPCA method, where span ( R C ) and span ( R ( i ) C ( i ) ) are the column space of R C and R ( i ) C ( i ) , respectively, and Q and Q ( i ) are the orthogonal base matrices of span ( R C ) and span ( R ( i ) C ( i ) ) , respectively. In fact, by ([28], Definition 4.2.1), the computational result of (31) is the largest canonical angle between the estimated subspace span ( R ( i ) C ( i ) ) and the true span ( R C ) .
In this test, we start with C ( 0 ) = rand ( d c , q c ) , R ( 0 ) = rand ( d r , q r ) , σ c ( 0 ) = 1 , σ r ( 0 ) = 1 and ν ( 0 ) = 1 , then consider the effect of the ratio of outliers, i.e., N o / N , which is varied from 0 to 30% with a stride length of 10% in this example. In these cases, the estimated values of ν are all ν = 1 . If we use other initial values ν ( 0 ) here, the computed ν also converges to one as iterations increase. The corresponding numerical results of arc length distances are plotted in Figure 1. Figure 1 shows that the RBPPCA and BPPCA methods almost perform with the same convergence behavior when the data matrices are without outliers.
As the ratio of outliers goes to 30%, it is reasonable that more iterations are required for the RBPPCA method to achieve a satisfactory accuracy. Unlike the BPPCA method, the presented RBPPCA method is more robust to outliers because the arc length distances of the BPPCA method are always held to approximately 1.5 when the data includes outliers.
In this example, we also test the impact of the initializations on C ( 0 ) and R ( 0 ) , and the sample size N, respectively, based on the synthetic data having 10% outliers. Three different types of initializations of C ( 0 ) and R ( 0 ) are set as follows:
(1)
initialization 1: C ( 0 ) = rand ( d c , q c ) and R ( 0 ) = rand ( d r , q r ) ;
(2)
initialization 2: C ( 0 ) = randn ( d c , q c ) and R ( 0 ) = randn ( d r , q r ) ;
(3)
initialization 3:
C ( 0 ) = R ( 0 ) = 1 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 1 9 d c 9 d c + 9 sin ( 9 ) cos ( 9 ) tan ( 9 ) cot ( 9 ) sec ( 9 ) csc ( 9 ) d c d c d c d c + d c sin ( d c ) cos ( d c ) tan ( d c ) cot ( d c ) sec ( d c ) csc ( d c ) ,
and other parameters are fixed to be the same. The results associated with the different initializations are shown in Figure 2. Inspection of the plot illustrates that our RBPPCA method appears to be insensitive to different initializations, since the convergence behaviors of the RBPPCA method based on different initializations do not have a significant difference. Figure 3 presents the required CPU time in seconds and the quantities of arc length distances of the BPPCA and RBPPCA methods with the number of iterations being 25 with respect to the number of samples, where the sample size N varies from 200 to 5000 with a stride length of 200. Such graphs covey the fact that with the increase of the sample size N, the BPPCA and RBPPCA methods both required more CPU time for 25 iterations, and the BPPCA method needs less time complexity than RBPPCA as we stated in the remarks of Algorithm 1. However, the bigger N does not lead to the improved arc length distances of the BPPCA method.
Example 2
(Experiments on face image databases).In this example, all of the experiments are based on two publicly available face image databases: the Yale face database ( Available from http://vision.ucsd.edu/content/yale-face-database) (accessed on 22 September 2021), and the Yale B face database (Available from http://vision.ucsd.edu/~leekc/ExtYaleDatabase/ExtYaleB.html) (accessed on 22 September 2021).The Yale face database includes different face orientations, facial expression, and whether there exist glasses, and the Yale B face database is collected under different illumination conditions. We select 265 face images of 15 individuals for each database. Each person has 11 images, and these images are cropped to 64 × 64 pixels. For each individual, eight images are randomly selected as the training set, and the rest are the test set. Then, we randomly select two and four images from the eight images in the training set to be corrupted as outliers, respectively. Half of the corrupted images are generated by replacing part of the original image with a 30 × 30 rectangle of noise, and the other half of corrupted images use a 50 × 50 rectangle of noise to replace it. Within each rectangle, the pixel value comes from a uniform distribution on the interval [ 0 , 255 ] . Then, we rescale the images from the range of [0, 255] to the range [0, 1]. Some original and corrupted images of the Yale and Yale B databases are shown in Figure 4, Figure 5, Figure 6 and Figure 7, respectively. The iteration of BPPCA and RBPPCA is stopped when their corresponding relative changes of the log-likelihood defined in (29) are smaller than 10 5 .
We run the BPPCA and RBBPCA algorithms for the original data set and the corrupted data set, respectively, with the order of Z in (11) being q c = q r = 8 , and consider the reconstructed images X ^ n defined in (30) based on the computed results. A comparison of the reconstructed images based on the original data set and corrupted data set with two corrupted images for each individual is shown in Figure 8. In Figure 8, the first, second and third columns are the original images, the reconstructed images of RBPPCA, and the reconstructed images of BPPCA for the Yale (left) and Yale B (right) databases, respectively, and the images of the first, second and third rows are shown based on the original, corrupted data sets with 30 × 30 rectangles of noise, and corrupted data sets with 50 × 50 rectangles of noise, respectively. The BPPCA and RBPPCA almost perform the same as the reconstructed images on the original data set. However, for corrupted data sets, the reconstructed performance of RBPPCA presents better images than BPPCA because it tries to explain noise information.
We also compare the average reconstruction errors η = n = 1 N X n X ^ n F / N and the recognition accuracy rates for the RBPPCA and BPPCA algorithms in the cases where each person has two corrupted images and four corrupted images, respectively, where recognition accuracy rates are calculated based on the nearest neighbor classifier (1-NN) which is employed for the classification. The average reconstruction errors and recognition accuracy rates versus the order of Z are plotted in Figure 9 and Figure 10, respectively. As expected, the average reconstruction errors decrease, while the recognition accuracy rates rise as the order of Z increases. In these cases, our proposed RBPPCA algorithm outperforms the BPPCA algorithm in reducing average reconstruction errors and enhancing the recognition accuracy. In addition, another advantage of robust probabilistic algorithms based on t distributions is outlier detection. By [14], we can compute ρ n = vec ( X n M ) T ( Σ r Σ c ) 1 vec ( X n M ) as the standard of outlier detection. Figure 11 is the scatter chart for ρ n of all the images in the training set of the Yale and Yale B databases with two corrupted images for one person, respectively. It is exhibited that the quantity of ρ n can be divided into three parts. Notice that these three parts correspond to the images with no noise, a 30 × 30 rectangle of noise, and a 50 × 50 rectangle of noise, respectively. Hence, the comparison of ρ n provides a method for judging the outliers.
Example 3
(Experiments on the MNIST dataset).In Example 2, it is shown that RBPPCA is superior to BPPCA when the data sets contain outliers. In this example, we compare the RBPPCA algorithm to the tPPCA [14] and L1-PPCA [15] algorithms based on handwritten digit images from the MNIST (Available from http://yann.lecun.com/exdb/mnist) (accessed on 22 September 2021) database in which each image has 28 × 28 pixels. We choose 59 images of the digit 4 as the training data set, and randomly select nine of them to be corrupted as outliers. The way of corrupting the images is to add noise from a uniform distribution on the interval [ 0 , 510 ] , and then normalize all images to the range [ 0 , 1 ] . The normalized corrupted images of the digit 4 are shown in Figure 12. The RBPPCA, tPPCA [14] and L1-PPCA [15] algorithms are implemented with 100 iterations for the original data set of the digit 4 and the corrupted data set, respectively.
Figure 13 presents the average reconstruction errors of the RBBPCA, tPPCA and L1-PPCA algorithms where the feature number is the order of Z for the RBBPCA algorithm and is the dimension of the low-dimensional representation for the tPPCA and L1-PPCA algorithms. It is observed that the performance of our RBPPCA algorithm is superior to other algorithms. The reconstructed behaviors of different algorithms based on the original data set of the digit 4 and the corrupted data set with q c = q r = 1 are shown in Figure 14. In Figure 14, the first column is the original images, and the second, third and fourth columns are the reconstructed images by the RBBPCA, tPPCA and L1-PPCA algorithms, respectively. As shown in Figure 14, compared to the tPPCA and L1-PPCA algorithms, the RBPPCA algorithm performs better reconstruction outcomes in such a case.

5. Conclusions

To remedy the problem that data are assumed to follow a matrix variate Gaussian distribution which is sensitive to outliers, in this paper, we proposed a robust BPPCA algorithm (RBPPCA), i.e., Algorithm 1, by replacing the matrix variate Gaussian distribution with the matrix variate t distribution for noise. Compared to BPPCA, owing to the matrix variate t distribution having a significantly heavy tail property, our proposed RBPPCA method combined with AECM for estimating parameters can deal with 2-D data sets in the presence of outliers. The numerical examples based on a synthetic and two publicly available real data sets, Yale and Yale B, are presented to state that Algorithm 1 is far superior to the BPPCA algorithm in computational accuracy, reconstruction performance, average reconstruction errors, recognition accuracy rates, and outlier detection. It is also shown by numerical examples based on the MNIST database that our RBPPCA method outperforms the tPPCA and L1-PPCA algorithms.

Author Contributions

Writing—original draft, Y.L.; Writing—review and editing, Z.T. All authors have read and agreed to the published version of the manuscript.

Funding

This work is supported in part by the research fund for distinguished young scholars of Fujian Agriculture and Forestry University No. xjq201727, and the science and technology innovation special fund project of Fujian Agriculture and Forestry University No. CXZX2020105A.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Not applicable.

Conflicts of Interest

The authors declare no conflict of interest.

Appendix A

Derivations for (22) and (27) in Section 3.
We first consider the conditional expectation of μ n Y n r with respect to the condition distribution Y n r , μ n | X n , which will be applied in our derivations of (22). That is
E ( μ n Y n r | X n ) = Y n r μ n μ n Y n r p ( Y n r , μ n | X n ) d μ n d Y n r = Y n r μ n μ n Y n r p ( Y n r | X n , μ n ) p ( μ n | X n ) d μ n d Y n r = μ n μ n p ( μ n | X n ) Y n r Y n r p ( Y n r | X n , μ n ) d Y n r d μ n = μ n μ n p ( μ n | X n ) E ( Y n r | X n , μ n ) d μ n = E ( μ n | X n ) E ( Y n r | X n , μ n ) = E μ n ( i ) E Y n r ( i ) ,
where E μ n ( i ) and E Y n r ( i ) are given by (21). In addition, for symmetric matrices A R d r × d r and B R q c × q c , we have
E tr ( μ n A ( Y n r ) T B Y n r ) | X n = Y n r μ n tr μ n A ( Y n r ) T B Y n r p ( Y n r | X n , μ n ) p ( μ n | X n ) d μ n d Y n r = μ n μ n p ( μ n | X n ) Y n r tr A ( Y n r ) T B Y n r p ( Y n r | X n , μ n ) d Y n r d μ n = μ n μ n p ( μ n | X n ) Y n r vec ( Y n r ) T ( A B ) vec ( Y n r ) p ( Y n r | X n , μ n ) d Y n r d μ n = μ n μ n p ( μ n | X n ) Y n r tr ( A B ) vec ( Y n r ) [ vec ( Y n r ) ] T p ( Y n r | X n , μ n ) d Y n r d μ n = μ n μ n p ( μ n | X n ) tr ( A B ) vec ( Y n r ) vec ( Y n r ) [ vec ( Y n r ) ] T p ( vec ( Y n r ) | X n , μ n ) d vec ( Y n r ) d μ n = μ n μ n p ( μ n | X n ) tr ( A B ) μ n 1 Σ r ( i ) ( σ c ( i ) ) 2 ( Φ c ( i ) ) 1 + vec ( E Y n r ( i ) ) [ vec ( E Y n r ( i ) ) ] T d μ n = μ n μ n p ( μ n | X n ) tr ( μ n 1 A Σ r ( i ) B ( σ c ( i ) ) 2 ( Φ c ( i ) ) 1 ) d μ n + μ n μ n p ( μ n | X n ) tr A ( E Y n r ( i ) ) T B E Y n r ( i ) d μ n = tr ( A Σ r ( i ) ) tr B ( σ c ( i ) ) 2 ( Φ c ( i ) ) 1 + E μ n ( i ) tr A ( E Y n r ( i ) ) T B E Y n r ( i ) .
Notice that (22) can be rewritten as
E tr μ n Σ r 1 [ X n ( C Y n r + W ) ] T [ X n ( C Y n r + W ) ] | X n = E μ n tr Σ r 1 ( X n W ) T ( X n W ) | X n 2 E μ n tr Σ r 1 ( X n W ) T C Y n r | X n + E tr μ n Σ r 1 ( Y n r ) T ( C T C ) Y n r | X n .
Then, we give the calculation results of the terms on the right hand side of (A3) from below. It is clear that
E μ n tr Σ r 1 ( X n W ) T ( X n W ) | X n = E ( μ n | X n ) tr Σ r 1 ( X n W ) T ( X n W ) = E μ n ( i ) tr Σ r 1 ( X n W ) T ( X n W ) .
By (A1) and taking A = Σ r 1 and B = C T C into (A2), we have
E μ n tr Σ r 1 ( X n W ) T C Y n r | X n = E μ n ( i ) tr Σ r 1 ( X n W ) T C E Y n r ( i ) ,
E tr ( μ n Σ r 1 ( Y n r ) T ( C T C ) Y n r ) | X n = tr ( Σ r 1 Σ r ( i ) ) tr C T C ( σ c ( i ) ) 2 ( Φ c ( i ) ) 1 + E μ n ( i ) tr Σ r 1 ( E Y n r ( i ) ) T C T C E Y n r ( i ) .
Equality (22) is a consequence of (A3)–(A6).
Similarly, for (27), it follows that
E tr μ n [ X n ( Y n c R T + W ) ] T Σ c 1 [ X n ( Y n c R T + W ) ] | X n = E μ n tr ( X n W ) T Σ c 1 ( X n W ) | X n 2 E μ n tr ( X n W ) T Σ c 1 Y n c R T | X n + E μ n tr R T R ( Y n c ) T Σ c 1 Y n c | X n = E ˜ μ n ( i ) tr ( X n W ) T Σ c 1 ( X n W ) 2 E ˜ μ n ( i ) tr ( X n W ) T Σ c 1 E Y n c ( i ) R T + tr ( Σ c 1 Σ c ( i + 1 ) ) tr R T R ( σ r ( i ) ) 2 ( Φ r ( i ) ) 1 + E ˜ μ n ( i ) tr R T R ( E Y n c ( i ) ) T Σ c 1 E Y n c ( i ) ,
where E ˜ μ n ( i ) and E Y n c ( i ) are defined in (25). The last equality (A7) holds due to the same reason as (A3).

References

  1. Bishop, C.M. Pattern Recognition and Machine Learning; Springer: Berlin/Heidelberg, Germany, 2006. [Google Scholar]
  2. Burges, C.J.C. Dimension Reduction: A Guided Tour; Now Publishers Inc.: Noord-Brabant, The Netherlands, 2010. [Google Scholar]
  3. Ma, Y.; Zhu, L. A review on dimension reduction. Int. Stat. Rev. 2013, 81, 134–150. [Google Scholar]
  4. Jolliffe, I.T. Principal Component Analysis; Springer: Berlin/Heidelberg, Germany, 2002. [Google Scholar]
  5. Yang, J.; Zhang, D.; Frangi, A.F.; Yang, J. Two-dimensional PCA: A new approach to appearance-based face representation and recognition. IEEE Trans. Pattern Anal. Mach. Intell. 2004, 26, 131–137. [Google Scholar]
  6. Ye, J. Generalized low rank approximations of matrices. Mach. Learn. 2005, 61, 167–191. [Google Scholar]
  7. Zhang, D.; Zhou, Z.H. (2D) 2PCA: Two-directional two-dimensional PCA for efficient face representation and recognition. Neurocomputing 2005, 69, 224–231. [Google Scholar]
  8. Tipping, M.E.; Bishop, C.M. Probabilistic principal component analysis. J. R. Stat. Soc. Ser. B Stat. Methodol. 1999, 61, 611–622. [Google Scholar]
  9. Murphy, K.P. Machine Learning: A Probabilistic Perspective; MIT Press: Cambridge, MA, USA, 2012. [Google Scholar]
  10. Yu, S.; Bi, J.; Ye, J. Matrix-variate and higher-order probabilistic projections. Data Min. Knowl. Discov. 2011, 22, 372–392. [Google Scholar]
  11. Gupta, A.K.; Nagar, D.K. Matrix Variate Distributions; CRC Press: Boca Raton, FL, USA, 2018; Volume 104. [Google Scholar]
  12. Zhao, J.; Philip, L.H.; Kwok, J.T. Bilinear probabilistic principal component analysis. IEEE Trans. Neural Netw. Learn. Syst. 2012, 23, 492–503. [Google Scholar]
  13. Zhao, J.; Jiang, Q. Probabilistic PCA for t distributions. Neurocomputing 2006, 69, 2217–2226. [Google Scholar]
  14. Chen, T.; Martin, E.; Montague, G. Robust probabilistic PCA with missing data and contribution analysis for outlier detection. Comput. Stat. Data Anal. 2009, 53, 3706–3716. [Google Scholar]
  15. Gao, J. Robust L1 principal component analysis and its Bayesian variational inference. Neural Comput. 2008, 20, 555–572. [Google Scholar] [CrossRef] [PubMed]
  16. Ju, F.; Sun, Y.; Gao, J.; Hu, Y.; Yin, B. Image outlier detection and feature extraction via L1-norm-based 2D probabilistic PCA. IEEE Trans. Image Process. 2015, 24, 4834–4846. [Google Scholar] [CrossRef] [PubMed]
  17. Galimberti, G.; Soffritti, G. A multivariate linear regression analysis using finite mixtures of t distributions. Comput. Stat. Data Anal. 2014, 71, 138–150. [Google Scholar] [CrossRef]
  18. Morris, K.; McNicholas, P.D.; Scrucca, L. Dimension reduction for model-based clustering via mixtures of multivariate t-distributions. Adv. Data Anal. Classif. 2013, 7, 321–338. [Google Scholar] [CrossRef]
  19. Pesevski, A.; Franczak, B.C.; McNicholas, P.D. Subspace clustering with the multivariate-t distribution. Pattern Recognit. Lett. 2018, 112, 297–302. [Google Scholar] [CrossRef] [Green Version]
  20. Teklehaymanot, F.K.; Muma, M.; Zoubir, A.M. Robust Bayesian cluster enumeration based on the t distribution. Signal Process. 2021, 182, 107870. [Google Scholar] [CrossRef]
  21. Wei, X.; Yang, Z. The infinite Student’s t-factor mixture analyzer for robust clustering and classification. Pattern Recognit. 2012, 45, 4346–4357. [Google Scholar] [CrossRef]
  22. Zhou, X.; Tan, C. Maximum likelihood estimation of Tobit factor analysis for multivariate t-distribution. Commun. Stat. Simul. Comput. 2009, 39, 1–16. [Google Scholar] [CrossRef]
  23. Lange, K.L.; Little, R.J.A.; Taylor, J.M.G. Robust statistical modeling using the t distribution. J. Am. Stat. Assoc. 1989, 84, 881–896. [Google Scholar] [CrossRef] [Green Version]
  24. Meng, X.L.; Van Dyk, D. The EM algorithm—An old folk-song sung to a fast new tune. J. R. Stat. Soc. Ser. B Stat. Methodol. 1997, 59, 511–567. [Google Scholar] [CrossRef]
  25. Zhou, Y.; Lu, H.; Cheung, Y. Bilinear probabilistic canonical correlation analysis via hybrid concatenations. In Proceedings of the AAAI Conference on Artificial Intelligence, San Francisco, CA, USA, 5–9 October 2017; Volume 31. [Google Scholar]
  26. Coleman, T.; Branch, M.A.; Grace, A. Optimization toolbox. In For Use with MATLAB. User’s Guide for MATLAB 5, Version 2, Relaese II; 1999; Available online: https://www.dpipe.tsukuba.ac.jp/~naito/optim_tb.pdf (accessed on 20 September 2021).
  27. Schrage, L. LINGO User’s Guide; LINDO System Inc.: Chicago, IL, USA, 2006. [Google Scholar]
  28. Stewart, G.W. Matrix Algorithms: Volume II: Eigensystems; SIAM: Philadelphia, PA, USA, 2001. [Google Scholar]
Figure 1. Convergence behaviors of RBPPCA and BPPCA with the ratio of outliers being 0%, 10%, 20% and 30%, respectively.
Figure 1. Convergence behaviors of RBPPCA and BPPCA with the ratio of outliers being 0%, 10%, 20% and 30%, respectively.
Algorithms 14 00322 g001
Figure 2. Convergence behaviors of RBPPCA for three different types of initialization.
Figure 2. Convergence behaviors of RBPPCA for three different types of initialization.
Algorithms 14 00322 g002
Figure 3. CPU time in seconds (left) and arc length distances (right) of BPPCA and RBPPCA with the number of iterations being 25 with the sample size N from 200 to 5000.
Figure 3. CPU time in seconds (left) and arc length distances (right) of BPPCA and RBPPCA with the number of iterations being 25 with the sample size N from 200 to 5000.
Algorithms 14 00322 g003
Figure 4. One set of processed samples from the Yale face database.
Figure 4. One set of processed samples from the Yale face database.
Algorithms 14 00322 g004
Figure 5. Some generated outlier face images in the training set from the Yale face database.
Figure 5. Some generated outlier face images in the training set from the Yale face database.
Algorithms 14 00322 g005
Figure 6. Eleven images of an individual from the Yale B face database.
Figure 6. Eleven images of an individual from the Yale B face database.
Algorithms 14 00322 g006
Figure 7. Some corrupted face images in the training set of the Yale B face database.
Figure 7. Some corrupted face images in the training set of the Yale B face database.
Algorithms 14 00322 g007
Figure 8. Original images (the first column), reconstructed images by RBPPCA (the second column), and reconstructed images by BPPCA (the third column) of the Yale (left) and Yale B (right) databases, respectively.
Figure 8. Original images (the first column), reconstructed images by RBPPCA (the second column), and reconstructed images by BPPCA (the third column) of the Yale (left) and Yale B (right) databases, respectively.
Algorithms 14 00322 g008
Figure 9. Average reconstruction errors of the BPPCA and RBPPCA algorithms vs. the order of Z for the Yale and Yale B databases with 2 and 4 corrupted images of each individual, respectively.
Figure 9. Average reconstruction errors of the BPPCA and RBPPCA algorithms vs. the order of Z for the Yale and Yale B databases with 2 and 4 corrupted images of each individual, respectively.
Algorithms 14 00322 g009
Figure 10. Recognition accuracy rates of the BPPCA and RBPPCA algorithms vs. the order of Z for the Yale and Yale B databases with 2 and 4 corrupted images of each individual, respectively.
Figure 10. Recognition accuracy rates of the BPPCA and RBPPCA algorithms vs. the order of Z for the Yale and Yale B databases with 2 and 4 corrupted images of each individual, respectively.
Algorithms 14 00322 g010
Figure 11. ρ n of each image in the Yale (left) and Yale B (right) databases, respectively, with 2 corrupted images of each individual.
Figure 11. ρ n of each image in the Yale (left) and Yale B (right) databases, respectively, with 2 corrupted images of each individual.
Algorithms 14 00322 g011
Figure 12. The normalized corrupted images of the digit 4.
Figure 12. The normalized corrupted images of the digit 4.
Algorithms 14 00322 g012
Figure 13. Average reconstruction errors of the RBPPCA, tPPCA and L1-PPCA algorithms vs. feature number on the MINST database.
Figure 13. Average reconstruction errors of the RBPPCA, tPPCA and L1-PPCA algorithms vs. feature number on the MINST database.
Algorithms 14 00322 g013
Figure 14. Original images of the digit 4 (the first column), images reconstructed by RBPPCA (the second column), images reconstructed by tPPCA (the third column), and images reconstructed by L1-PPCA (the fourth column). The images shown in the first and second rows are based on the original and corrupted image data sets, respectively.
Figure 14. Original images of the digit 4 (the first column), images reconstructed by RBPPCA (the second column), images reconstructed by tPPCA (the third column), and images reconstructed by L1-PPCA (the fourth column). The images shown in the first and second rows are based on the original and corrupted image data sets, respectively.
Algorithms 14 00322 g014
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Share and Cite

MDPI and ACS Style

Lu, Y.; Teng, Z. Robust Bilinear Probabilistic Principal Component Analysis. Algorithms 2021, 14, 322. https://doi.org/10.3390/a14110322

AMA Style

Lu Y, Teng Z. Robust Bilinear Probabilistic Principal Component Analysis. Algorithms. 2021; 14(11):322. https://doi.org/10.3390/a14110322

Chicago/Turabian Style

Lu, Yaohang, and Zhongming Teng. 2021. "Robust Bilinear Probabilistic Principal Component Analysis" Algorithms 14, no. 11: 322. https://doi.org/10.3390/a14110322

APA Style

Lu, Y., & Teng, Z. (2021). Robust Bilinear Probabilistic Principal Component Analysis. Algorithms, 14(11), 322. https://doi.org/10.3390/a14110322

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop