Next Article in Journal
Development of a Refractory High Entropy Superalloy
Next Article in Special Issue
Syntactic Parameters and a Coding Theory Perspective on Entropy and Complexity of Language Families
Previous Article in Journal
Constrained Inference When the Sampled and Target Populations Differ
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Riemannian Laplace Distribution on the Space of Symmetric Positive Definite Matrices

1
Groupe Signal et Image, CNRS Laboratoire IMS, Institut Polytechnique de Bordeaux, Université de Bordeaux, UMR 5218, Talence 33405, France
2
Communications Department, Technical University of Cluj-Napoca, 71-73 Dorobantilor street, Cluj-Napoca 3400, Romania
*
Author to whom correspondence should be addressed.
These authors contributed equally to this work.
Entropy 2016, 18(3), 98; https://doi.org/10.3390/e18030098
Submission received: 19 December 2015 / Revised: 7 March 2016 / Accepted: 8 March 2016 / Published: 16 March 2016
(This article belongs to the Special Issue Differential Geometrical Theory of Statistics)

Abstract

:
The Riemannian geometry of the space P m , of m × m symmetric positive definite matrices, has provided effective tools to the fields of medical imaging, computer vision and radar signal processing. Still, an open challenge remains, which consists of extending these tools to correctly handle the presence of outliers (or abnormal data), arising from excessive noise or faulty measurements. The present paper tackles this challenge by introducing new probability distributions, called Riemannian Laplace distributions on the space P m . First, it shows that these distributions provide a statistical foundation for the concept of the Riemannian median, which offers improved robustness in dealing with outliers (in comparison to the more popular concept of the Riemannian center of mass). Second, it describes an original expectation-maximization algorithm, for estimating mixtures of Riemannian Laplace distributions. This algorithm is applied to the problem of texture classification, in computer vision, which is considered in the presence of outliers. It is shown to give significantly better performance with respect to other recently-proposed approaches.

1. Introduction

Data with values in the space P m , of m × m symmetric positive definite matrices, play an essential role in many applications, including medical imaging [1,2], computer vision [3,4,5,6,7] and radar signal processing [8,9]. In these applications, the location where a dataset is centered has a special interest. While several definitions of this location are possible, its meaning as a representative of the set should be clear. Perhaps, the most known and well-used quantity to represent a center of a dataset is the Fréchet mean. Given a set of points Y 1 , , Y n in P m , their Fréchet mean is defined to be:
Mean ( Y 1 , , Y n ) = argmin Y P m i = 1 n d 2 ( Y , Y i )
where d is Rao’s Riemannian distance on P m [10,11].
Statistics on general Riemannian manifolds have been powered by the development of different tools for geometric measurements and new probability distributions on manifolds [12,13]. On the manifold ( P m , d ) , the major advances in this field have been achieved by the recent papers [14,15], which introduce the Riemannian Gaussian distribution on ( P m , d ) . This distribution depends on two parameters Y ¯ P m and σ > 0 , and its density with respect to the Riemannian volume form d v ( Y ) of P m (see Formula (13) in Section 2) is:
1 Z m ( σ ) exp d 2 ( Y , Y ¯ ) 2 σ 2
where Z m ( σ ) is a normalizing factor depending only on σ (and not on Y ¯ ).
For the Gaussian distribution Equation (2), the maximum likelihood estimate (MLE) for the parameter Y ¯ based on observations Y 1 , , Y n corresponds to the mean Equation (1). In [15], a detailed study of statistical inference for this distribution was given and then applied to the classification of data in P m , showing that it yields better performance, in comparison to recent approaches [2].
When a dataset contains extreme values (or outliers), because of the impact of these values on d 2 , the mean becomes less useful. It is usually replaced with the Riemannian median:
Median ( Y 1 , , Y n ) = argmin Y P m i = 1 n d ( Y , Y i )
Definition Equation (3) corresponds to that of the median in statistics based on ordering of the values of a sequence. However, this interpretation does not continue to hold on P m . In fact, the Riemannian distance on P m is not associated with any norm, and it is therefore only possible to compare distances of a set of matrices to a reference matrix.
In the presence of outliers, the Gaussian distribution on P m also loses its robustness properties. The main contribution of the present paper is to remedy this problem by introducing the Riemannian Laplace distribution while maintaining the same one-to-one relation between MLE and the Riemannian median. This will be shown to offer considerable improvement in dealing with outliers.
This paper is organized as follows.
Section 2 reviews the Riemannian geometry of P m , when this manifold is equipped with the Riemannian metric known as the Rao–Fisher or affine invariant metric [10,11]. In particular, it gives analytic expressions for geodesic curves, Riemannian distance and recalls the invariance of Rao’s distance under affine transformations.
Section 3 introduces the Laplace distribution L ( Y ¯ , σ ) through its probability density function with respect to the volume form d v ( Y ) :
p ( Y | Y ¯ , σ ) = 1 ζ m ( σ ) exp d ( Y , Y ¯ ) σ
here, σ lies in an interval ] 0 , σ max [ with σ max < . This is because the normalizing constant ζ m ( σ ) becomes infinite for σ σ max . It will be shown that ζ m ( σ ) depend only on σ (and not on Y ¯ ) for all σ < σ max . This important fact leads to simple expressions of MLEs of Y ¯ and σ. In particular, the MLE of Y ¯ based on a family of observations Y 1 , , Y N sampled from L ( Y ¯ , σ ) is given by the median of Y 1 , , Y N defined by Equation (3) where d is Rao’s distance.
Section 4 focuses on mixtures of Riemannian Laplace distributions on P m . A distribution of this kind has a density:
p ( Y | ( ω μ , Y ¯ μ , σ μ ) 1 μ M ) = μ = 1 M ϖ μ p ( Y | Y ¯ μ , σ μ )
with respect to the volume form d v ( Y ) . Here, M is the number of mixture components, ϖ μ > 0 , Y ¯ μ P m , σ μ > 0 for all 1 μ M and μ = 1 M ϖ μ = 1 . A new EM (expectation-maximization) algorithm that computes maximum likelihood estimates of the mixture parameters ( ϖ μ , Y ¯ μ , σ μ ) 1 μ M is provided. The problem of the order selection of the number M in Equation (4) is also discussed and performed using the Bayesian information criterion (BIC) [16].
Section 5 is an application of the previous material to the classification of data with values in P m , which contain outliers (abnormal data points). Assume to be given a training sequence Y 1 , , Y n P m . Using the EM algorithm developed in Section 4, it is possible to subdivide this sequence into disjoint classes. To classify new data points, a classification rule is proposed. The robustness of this rule lies in the fact that it is based on the distances between new observations and the respective medians of classes instead of the means [15]. This rule will be illustrated by an application to the problem of texture classification in computer vision. The obtained results show improved performance with respect to recent approaches which use the Riemannian Gaussian distribution [15] and the Wishart distribution [17].

2. Riemannian Geometry of P m

The geometry of Siegel homogeneous bounded domains, such as Kähler homogeneous manifolds, have been studied by Felix A. Berezin [18] and P. Malliavin [19]. The structure of Kähler homogeneous manifolds has been used in [20,21] to parameterize (Toeplitz–) Block–Toeplitz matrices. This led to a Hessian metric from information geometry theory with a Kähler potential given by entropy and to an algorithm to compute medians of (Toeplitz–)Block–Toeplitz matrices by Karcher flow on Mostow/Berger fibration of a Siegel disk. Optimal numerical schemes of this algorithm in a Siegel disk have been studied, developed and validated in [22,23,24].
This section introduces the necessary background on the Riemannian geometry of P m , the space of symmetric positive definite matrices of size m × m . Precisely, P m is equipped with the Riemannian metric known as the affine-invariant metric. First, analytic expressions are recalled for geodesic curves and Riemannian distance. Then, two properties are stated, which are fundamental to the following. These are affine-invariance of the Riemannian distance and the existence and uniqueness of Riemannian medians.
The affine-invariant metric, called the Rao–Fisher metric in information geometry, has the following expression:
g Y ( A , B ) = tr ( Y 1 A Y 1 B )
where Y P m and A , B T Y P m , the tangent space to P m at Y, which is identified with the vector space of m × m symmetric matrices. The Riemannian metric Equation (5) induces a Riemannian distance on P m as follows. The length of a smooth curve c : [ 0 , 1 ] P m is given by:
L ( c ) = 0 1 g c ( t ) ( c ˙ ( t ) , c ˙ ( t ) ) d t
where c ˙ ( t ) = d c d t . For Y , Z P m , the Riemannian distance d ( Y , Z ) , called Rao’s distance in information geometry, is defined to be:
d ( Y , Z ) = inf L ( c ) , c : [ 0 , 1 ] P m is a smooth curve with c ( 0 ) = Y , c ( 1 ) = Z .
This infimum is achieved by a unique curve c = γ , called the geodesic connecting Y and Z, which has the following equation [10,25]:
γ ( t ) = Y 1 / 2 ( Y 1 / 2 Z Y 1 / 2 ) t Y 1 / 2
Here, and throughout the following, all matrix functions (for example, square root, logarithm or power) are understood as symmetric matrix functions [26]. By definition, d ( Y , Z ) coincides with L ( γ ) , which turns out to be:
d 2 ( Y , Z ) = tr [ log ( Y 1 / 2 Z Y 1 / 2 ) ] 2
Equipped with the affine-invariant metric Equation (5), the space P m enjoys two useful properties, which are the following. The first property is invariance under affine transformations [10,25]. Recall that an affine transformation of P m is a mapping Y Y · A , where A is an invertible real matrix of size m × m ,
Y · A = A Y A
and denotes the transpose. Denote by GL ( m ) the group of m × m invertible real matrices on P m . Then, the action of GL ( m ) on P m is transitive. This means that for any Y , Z P m , there exists A GL ( m ) , such that Y . A = Z . Moreover, the Riemannian distance Equation (8) is invariant by affine transformations in the sense that for all Y , Z P m :
d ( Y , Z ) = d ( Y · A , Z · A )
where Y · A and Z · A are defined by Equation (9). The transitivity of the action Equation (9) and the isometry property Equation (10) make P m a Riemannian homogeneous space.
The affine-invariant metric Equation (5) turns P m into a Riemannian manifold of negative sectional curvature [10,27]. As a result, P m enjoys the property of the existence and uniqueness of Riemannian medians. The Riemannian median of N points Y 1 , , Y N P m is defined to be:
Y ^ N = argmin Y n = 1 N d ( Y , Y n )
where d ( Y , Y n ) is the Riemannian distance Equation (8). If Y 1 , , Y N do not belong to the same geodesic, then Y ^ N exists and is unique [28]. More generally, for any probability measure π on P m , the median of π is defined to be:
Y ^ π = argmin Y P m d ( Y , Z ) d π ( Z )
Note that Equation (12) reduces to Equation (11) for π = 1 N n = 1 N δ Y n . If the support of π is not carried by a single geodesic, then again, Y ^ π exists and is unique by the main result of [28].
To end this section, consider the Riemannian volume associated with the affine-invariant Riemannian metric [10]:
d v ( Y ) = det ( Y ) m + 1 2 i j d Y i j
where the indices denote matrix elements. The Riemannian volume is used to define the integral of a function f : P m R as:
P m f ( Y ) d v ( Y ) = f ( Y ) det ( Y ) m + 1 2 i j d Y i j
where the integral on the right-hand side is a multiple integral over the m ( m + 1 ) / 2 variables, Y i j with i j . The integral Equation (14) is invariant under affine transformations. Precisely:
P m f ( Y · A ) d v ( Y ) = P m f ( Y ) d v ( Y )
where Y · A is the affine transformation given by Equation (9). It takes on a simplified form when f ( Y ) only depends on the eigenvalues of Y. Precisely, let the spectral decomposition of Y be given by Y = U diag ( e r 1 , , e r m ) U , where U is an orthogonal matrix and e r 1 , , e r m are the eigenvalues of Y. Assume that f ( Y ) = f ( r 1 , , r m ) , then the invariant integral Equation (14) reduces to:
P m f ( Y ) d v ( Y ) = c m × R m f ( r 1 , , r m ) i < j sinh | r i r j | 2 d r 1 d r m
where the constant c m is given by c m = 1 m ! × ω m × 8 m ( m 1 ) 4 , ω m = π m 2 / 2 Γ m ( m / 2 ) and Γ m is the multivariate gamma function given in [29]. See Appendix A for the derivation of Equation (16) from Equation (14).

3. Riemannian Laplace Distribution on P m

3.1. Definition of L ( Y ¯ , σ )

The Riemannian Laplace distribution on P m is defined by analogy with the well-known Laplace distribution on R . Recall the density of the Laplace distribution on R ,
p ( x | x ¯ , σ ) = 1 2 σ e | x x ¯ | / σ
where x ¯ R and σ > 0 . This is a density with respect to the length element d x on R . The density of the Riemannian Laplace distribution on P m will be given by:
p ( Y | Y ¯ , σ ) = 1 ζ m ( σ ) exp d ( Y , Y ¯ ) σ
here, Y ¯ P m , σ > 0 , and the density is with respect to the Riemannian volume element Equation (13) on P m . The normalizing factor ζ m ( σ ) appearing in Equation (17) is given by the integral:
P m exp d ( Y , Y ¯ ) σ d v ( Y )
Assume for now that this integral is finite for some choice of Y ¯ and σ. It is possible to show that its value does not depend on Y ¯ . To do so, recall that the action of GL ( m ) on P m is transitive. As a consequence, there exists A P m , such that Y ¯ = I . A , where I . A is defined as in Equation (9). From Equation (10), it follows that d ( Y , Y ¯ ) = d ( Y , I . A ) = d ( Y . A 1 , I ) . From the invariance property Equation (15):
P m exp d ( Y , Y ¯ ) σ d v ( Y ) = P m exp d ( Y , I ) σ d v ( Y )
The integral on the right does not depend on Y ¯ , which proves the above claim.
The last integral representation and formula Equation (16) lead to the following explicit expression:
ζ m ( σ ) = c m × R m e | r | σ i < j sinh | r i r j | 2 d r 1 d r m
where | r | = ( r 1 2 + + r 2 m ) 1 2 and c m is the same constant as in Equation (16) (see Appendix B for more details on the derivation of Equation (19)).
A distinctive feature of the Riemannian Laplace distribution on P m , in comparison to the Laplace distribution on R is that there exist certain values of σ for which it cannot be defined. This is because the integral Equation (19) diverges for certain values of this parameter. This leads to the following definition.
Definition 1. 
Set σ m = sup { σ > 0 : ζ m ( σ ) < } . Then, for Y ¯ P m and σ ( 0 , σ m ) , the Riemannian Laplace distribution on P m , denoted by L ( Y ¯ , σ ) , is defined as the probability distribution on P m , whose density with respect to d v ( Y ) is given by Equation (17), where ζ m ( σ ) is defined by Equation (19).
The constant σ m in this definition satisfies 0 < σ m < for all m and takes the value 2 for m = 2 (see Appendix C for proofs).

3.2. Sampling from L ( Y ¯ , σ )

The current section presents a general method for sampling from the Laplace distribution L ( Y ¯ , σ ) . This method relies in part on the following transformation property.
Proposition 2. 
Let Y be a random variable in P m . For all A GL ( m ) ,
Y L ( Y ¯ , σ ) Y · A L ( Y ¯ · A , σ )
where Y · A is given by Equation (9).
Proof. 
Let φ : P m R be a test function. If Y L ( Y ¯ , σ ) and Z = Y · A , then the expectation of φ ( Z ) is given by:
E [ φ ( Z ) ] = P m φ ( X · A ) p ( X | Y ¯ , σ ) d v ( X ) = P m φ ( X ) p ( X · A 1 | Y ¯ , σ ) d v ( X )
where the equality is a result of Equation (15). However, p ( X · A 1 | Y ¯ , σ ) = p ( X | Y ¯ · A , σ ) by Equation (10), which proves the proposition.
 ☐
The following algorithm describes how to sample from L ( Y ¯ , σ ) where 0 < σ < σ m . For this, it is first required to sample from the density p on R m defined by:
p ( r ) = c m ζ m ( σ ) e | r | σ i < j sinh | r i r j | 2 , r = ( r 1 , , r m ) .
This can be done by a usual Metropolis algorithm [30].
It is also required to sample from the uniform distribution on O ( m ) , the group of real orthogonal m × m matrices. This can be done by generating A, an m × m matrix, whose entries are i.i.d. with normal distribution N ( 0 , 1 ) , then the orthogonal matrix U, in the decomposition A = U T with T upper triangular, is uniformly distributed on O ( m ) [29] (p. 70). Sampling from L ( Y ¯ , σ ) can now be described as follows.
Algorithm 1 Sampling from L ( Y ¯ , σ ) .
1:
Generate i.i.d. samples ( r 1 , , r m ) R m with density p   
2:
Generate U from a uniform distribution on O ( m )   
3:
X U diag ( e r 1 , , e r m ) U   
4:
Y X . Y ¯ 1 2
Note that the law of X in Step 3 is L ( I , σ ) ; the proof of this fact is given in Appendix D. Finally, the law of Y in Step 4 is L ( I . Y ¯ 1 2 = Y ¯ , σ ) by proposition Equation (2).

3.3. Estimation of Y ¯ and σ

The current section considers maximum likelihood estimation of the parameters Y ¯ and σ, based on independent observations Y 1 , , Y N from the Riemannian Laplace distribution L ( Y ¯ , σ ) . The main results are contained in Propositions 3 and 5 below.
Proposition 3 states the existence and uniqueness of the maximum likelihood estimates Y ^ N and σ ^ N of Y ¯ and σ. In particular, the maximum likelihood estimate Y ^ N of Y ¯ is the Riemannian median of Y 1 , , Y N , defined by Equation (11). Numerical computation of Y ^ N will be considered and carried out using a Riemannian sub-gradient descent algorithm [8].
Proposition 5 states the convergence of the maximum likelihood estimate Y ^ N to the true value of the parameter Y ¯ . It is based on Lemma 4, which states that the parameter Y ¯ is the Riemannian median of the distribution L ( Y ¯ , σ ) in the sense of definition Equation (12).
Proposition 3 
(MLE and median). The maximum likelihood estimate of the parameter Y ¯ is the Riemannian median Y ^ N of Y 1 , , Y N . Moreover, the maximum likelihood estimate of the parameter σ is the solution σ ^ N of:
σ 2 × d ( d σ log ζ m ( σ ) = 1 N n = 1 N d ( Y ¯ , Y n )
Both Y ^ N and σ ^ N exist and are unique for any realization of the samples Y 1 , , Y N .
 
Proof of Proposition 3. The log-likelihood function, of the parameters Y ¯ and σ, can be written as:
n = 1 N log p ( Y n | Y ¯ , σ ) = n = 1 N log 1 ζ m ( σ ) e d ( Y ¯ , Y n ) σ = N log ζ m ( σ ) 1 σ n = 1 N d ( Y ¯ , Y n )
As the first term in the last expression does not contain Y ¯ ,
argmax Y ¯ n = 1 N log p ( Y n | Y ¯ , σ ) = argmin Y ¯ n = 1 N d ( Y ¯ , Y n )
The quantity on the right is exactly Y ^ N by Equation (11). This proves the first claim. Now, consider the function:
F ( η ) = N log ( ζ m ( 1 η ) ) + η n = 1 N d ( Y ^ N , Y n ) , η < 1 σ m
This function is strictly concave, since it is the logarithm of the moment generating function of a positive measure. Note that lim η 1 σ m F ( η ) = , and admit for a moment that lim η F ( η ) = . By the strict concavity of F, there exists a unique η ^ N < 1 σ m (which is the maximum of F), such that F ( η ^ N ) = 0 . It follows that σ ^ N = 1 η ^ N lies in ( 0 , σ m ) and satisfies Equation (20). The uniqueness of σ ^ N is a consequence of the uniqueness of η ^ N . Thus, the proof is complete. Now, it remains to check that lim η F ( η ) = or just lim σ + 1 σ log ( ζ m ( 1 σ ) ) = 0 . Clearly:
i < j sinh | r i r j | 2 A m e B m | r |
where A m and B m are two constants only depending on m. Using this, it follows that:
1 σ log ( ζ m ( 1 σ ) ) 1 σ log ( c m A m ) + 1 σ log R m exp ( ( σ + B m ) | r | ) d r 1 d r m
However, for some constant C m only depending on m,
R m exp ( ( σ + B m ) | r | ) d r 1 d r m = C m 0 exp ( ( σ + B m ) u ) u m 1 d u
( m 1 ) ! C m 0 exp ( ( σ + B m + 1 ) u ) d u = ( m 1 ) ! C m σ B m 1
Combining this bound and Equation (21) yields lim σ + 1 σ log ( ζ m ( 1 σ ) ) = 0 . ☐
Remark 1. 
Replacing F in the previous proof with F ( η ) = log ( ζ m ( 1 η ) ) + η c where c > 0 shows that the equation:
σ 2 × d d σ log ζ m ( σ ) = c
has a unique solution σ ( 0 , σ m ) . This shows in particular that σ σ 2 × d ( d σ log ζ m ( σ ) is a bijection from ( 0 , σ m ) to ( 0 , ) .
Consider now the numerical computation of the maximum likelihood estimates Y ^ N and σ ^ N given by Proposition 3. Computation of Y ^ N consists in finding the Riemannian median of Y 1 , , Y N , defined by Equation (11). This can be done using the Riemannian sub-gradient descent algorithm of [8]. The k-th iteration of this algorithm produces an approximation Y ^ N k of Y ^ N in the following way.
For k = 1 , 2 , , let Δ k be the symmetric matrix:
Δ k = 1 N n = 1 N Log Y ^ N k 1 ( Y n ) | | Log Y ^ N k 1 ( Y n ) | |
Here, Log is the Riemannian logarithm mapping inverse to the the Riemannian exponential mapping:
Exp Y ( Δ ) = Y 1 / 2 exp Y 1 / 2 Δ Y 1 / 2 Y 1 / 2
and | | Log a ( b ) | | = g a ( b , b ) . Then, Y ^ N k is defined to be:
Y ^ N k = Exp Y ^ N k 1 ( τ k Δ k )
where τ k > 0 is a step size, which can be determined using a backtracking procedure.
Computation of σ ^ N requires solving a non-linear equation in one variable. This is readily done using Newton’s method.
It is shown now that the empirical Riemannian median Y ^ N converges almost surely to the true median Y ¯ . This means that Y ^ N is a consistent estimator of Y ¯ . The proof of this fact requires few notations and a preparatory lemma.
For Y ¯ P m and σ ( 0 , σ m ) , let:
E ( Y | Y ¯ , σ ) = P m d ( Y , Z ) p ( Z | Y ¯ , σ ) d v ( Z )
The following lemma shows how to find Y ¯ and σ from the function Y E ( Y | Y ¯ , σ ) .
Lemma 4. 
For any Y ¯ P m and σ ( 0 , σ m ) , the following properties hold
(i) 
Y ¯ is given by:
Y ¯ = argmin Y E ( Y | Y ¯ , σ )
That is, Y ¯ is the Riemannian median of L ( Y ¯ , σ ) .
(ii) 
σ is given by:
σ = Φ E ( Y ¯ | Y ¯ , σ )
where the function Φ is the inverse function of σ σ 2 × d log ζ m ( σ ) / d σ .
Proof of Lemma 4. 
(i) Let E ( Y ) = E ( Y | Y ¯ , σ ) . According to Theorem 2.1 in [28], this function has a unique global minimum, which is also a unique stationary point. Thus, to prove that Y ¯ is the minimum point of E , it will suffice to check that for any geodesic γ starting from Y ¯ , d d t | t = 0 E ( γ ( t ) ) = 0 [31] (p. 76). Note that:
d d t | t = 0 E ( γ ( t ) ) = P m d d t | t = 0 d ( γ ( t ) , Z ) p ( Z | Y ¯ , σ ) d v ( Z )
where for all Z Y ¯ [32]:
d d t | t = 0 d ( γ ( t ) , Z ) = g Y ¯ ( log Y ¯ ( Z ) , γ ( 0 ) ) d ( Y ¯ , Z ) 1
The integral in Equation (26) is, up to a constant,
d d t | t = 0 P m p ( Z | γ ( t ) , σ ) d v ( Z ) = 0
since P m p ( Z | γ ( t ) , σ ) d v ( Z ) = 1 .
(ii) Differentiating P m exp ( d ( Z , Y ¯ ) σ ) d v ( Z ) = ζ m ( σ ) with respect to σ, it comes that:
σ 2 × d log ζ m ( σ ) / d σ = σ 2 ζ m ( σ ) ζ m ( σ ) = P m d ( Z , Y ¯ ) p ( Z | Y ¯ , σ ) d v ( Z ) = E ( Y ¯ | Y ¯ , σ )
which proves (ii). ☐
Proposition 5 
(Consistency of Y ^ N ). Let Y 1 , Y 2 , be independent samples from a Laplace distribution G ( Y ¯ , σ ) . The empirical median Y ^ N of Y 1 , , Y N converges almost surely to Y ¯ , as N .
Proof of Proposition 5. 
Corollary 3.5 in [33] (p. 49) states that if ( Y n ) is a sequence of i.i.d. random variables on P m with law π, then the Riemannian median Y ^ N of Y 1 , , Y N converges almost surely as N to Y ^ π , the Riemannian median of π defined by Equation (12). Applying this result to π = L ( Y ¯ , σ ) and using Y ^ π = Y ¯ , which follows from item (i) of Lemma 4, shows that Y ^ N converges almost surely to Y ¯ . ☐

4. Mixtures of Laplace Distributions

There are several motivations for considering mixtures of distributions in general. The most natural approach is to envisage a dataset as constituted of several subpopulations. Another approach is the fact that there is a support for the argument that mixtures of distributions provide a good approximation to most distributions in a spirit similar to wavelets.
The present section introduces the class of probability distributions that are finite mixtures of Riemannian Laplace distributions on P m . These constitute the main theoretical tool, to be used for the target application of the present paper, namely the problem of texture classification in computer vision, which will be treated in Section 5.
A mixture of Riemannian Laplace distributions is a probability distribution on P m , whose density with respect to the Riemannian volume element Equation (13) has the following expression:
p ( Y | ( ϖ μ , Y ¯ μ , σ μ ) 1 μ M ) = μ = 1 M ϖ μ × p ( Y | Y ¯ μ , σ μ )
where ϖ μ are nonzero weights, whose sum is equal to one, Y ¯ μ P m and σ μ ( 0 , σ m ) for all 1 μ M , and the parameter M is called the number of mixture components.
Section 4.1 describes a new EM algorithm, which computes the maximum likelihood estimates of the mixture parameters ( ϖ μ , Y ¯ μ , σ μ ) 1 μ M , based on independent observations Y 1 , , Y N from the mixture distribution Equation (27).
Section 4.2 considers the problem of order selection for mixtures of Riemannian Laplace distributions. Precisely, this consists of finding the number M of mixture components in Equation (27) that realizes the best representation of a given set of data Y 1 , , Y N . This problem is solved by computing the BIC criterion, which is here found in explicit form for the case of mixtures of Riemannian Laplace distributions on P m .

4.1. Estimation of the Mixture Parameters

In this section, Y 1 , , Y N are i.i.d. samples from Equation (27). Based on these observations, an EM algorithm is proposed to estimate ( ϖ μ , Y ¯ μ , σ μ ) 1 μ M . The derivation of this algorithm can be carried out similarly to [15].
To explain how this algorithm works, define for all ϑ = { ( ϖ μ , Y ¯ μ , σ μ ) } ,
ω μ ( Y n , ϑ ) = ϖ μ × p ( Y n | Y ¯ μ , σ μ ) s = 1 M ϖ s × p ( Y n | Y ¯ s , σ s ) , N μ ( ϑ ) = n = 1 N ω μ ( Y n )
The algorithm iteratively updates ϑ ^ = { ( ϖ ^ μ , Y ^ μ , σ ^ μ ) } , which is an approximation of the maximum likelihood estimate of the mixture parameters ϑ = ( ϖ μ , Y ¯ μ , σ μ ) as follows.
  • Update for ϖ ^ μ : Based on the current value of ϑ ^ , assign to ϖ ^ μ the new value ϖ ^ μ = N μ ( ϑ ^ ) / N .
  • Update for Y ^ μ : Based on the current value of ϑ ^ , assign to Y ^ μ the value:
    Y ^ μ = argmin Y n = 1 N ω μ ( Y n , ϑ ^ ) d ( Y , Y n )
  • Update for σ ^ μ : Based on the current value of ϑ ^ , assign to σ ^ μ the new value:
    σ ^ μ = Φ ( N μ 1 ( ϑ ^ ) × n = 1 N ω μ ( Y n , ϑ ^ ) d ( Y ^ μ , Y n ) )
where the function Φ is defined in Proposition 4.
These three update rules should be performed in the above order. Realization of the update rules for ϖ ^ μ and σ ^ μ is straightforward. The update rule for Y ^ μ is realized using a slight modification of the sub-gradient descent algorithm described in Section 3.2. More precisely, the factor 1 / N appearing in Equation (22) is only replaced with ω μ ( Y n , ϑ ^ ) at each iteration.
In practice, the initial conditions ( ϖ ^ μ 0 , Y ^ μ 0 , σ ^ μ 0 ) in this algorithm were chosen in the following way. The weights ( ϖ μ 0 ) are uniform and equal to 1 / M ; ( Y ^ μ 0 ) are M different observations from the set { Y 1 , . . , Y N } chosen randomly; and ( σ ^ μ 0 ) is computed from ( ϖ μ 0 ) and ( Y ^ μ 0 ) according to the rule Equation (30). Since the convergence of the algorithm depends on the initial conditions, the EM algorithm is run several times, and the best result is retained, i.e., the one maximizing the log-likelihood function.

4.2. The Bayesian Information Criterion

The BIC was introduced by Schwarz to find the appropriate dimension of a model that will fit a given set of observations [16]. Since then, BIC has been used in many Bayesian modeling problems where priors are hard to set precisely. In large sample settings, the fitted model favored by BIC ideally corresponds to the candidate model that is a posteriori most probable; i.e., the model that is rendered most plausible by the data at hand. One of the main features of the BIC is its easy computation, since it is only based on the empirical log-likelihood function.
Given a set of observations { Y 1 , , Y N } arising from Equation (27) where M is unknown, the BIC consists of choosing the parameter:
M ¯ = argmax M B I C ( M )
where:
B I C ( M ) = L L 1 2 × D F × log ( N )
Here, L L is the log-likelihood given by:
L L = n = 1 N log k = 1 M ϖ ^ k p ( Y n | Y ^ k , σ ^ k )
and D F is the number of degrees of freedom of the statistical model:
D F = M × m ( m + 1 ) 2 + M + M 1
In Formula Equation (32), ( ϖ ^ k , Y ^ k , σ ^ k ) 1 k M are obtained from an EM algorithm as stated in Section 4.1 assuming the exact dimension is M. Finally, note that in Formula Equation (33), M × m ( m + 1 ) 2 (respectively M and M 1 ) corresponds to the number of degrees of freedom associated with ( Y ^ k ) 1 k M (respectively ( σ ^ k ) 1 k M and ( ϖ ^ k ) 1 k M ).

5. Application to Classification of Data on P m

Recently, several approaches have used the Riemannian distance in general as the main innovation in image or signal classification problems [2,15,34]. It turns out that the use of this distance leads to more accurate results (in comparison, for example, with the Euclidean distance). This section proposes an application that follows a similar approach, but in addition to the Riemannian distance, it also relies on a statistical approach. It considers the application of the Riemannian Laplace distribution (RLD) to the classification of data in P m and gives an original Laplace classification rule, which can be used to carry out the task of classification, even in the presence of outliers. It also applies this classification rule to the problem of texture classification in computer vision, showing that it leads to improved results in comparison with recent literature.
Section 5.1 considers, from the point of view of statistical learning, the classification of data with values in P m . Given data points Y 1 , , Y N P m , this proceeds in two steps, called the learning phase and the classification phase, respectively. The learning phase uncovers the class structure of the data, by estimating a mixture model using the EM algorithm developed in Section 4.1. Once training is accomplished, data points are subdivided into disjoint classes. Classification consists of associating each new data point to the most suitable class. For this, a new classification rule will be established and shown to be optimal.
Section 5.2 is the implementation of the Laplace classification rule together with the BIC criterion to texture classification in computer vision. It highlights the advantage of the Laplace distribution in the presence of outliers and shows its better performance compared to recent approaches.

5.1. Classification Using Mixtures of Laplace Distributions

Assume to be given a set of training data Y 1 , , Y N . These are now modeled as a realization of a mixture of Laplace distributions:
p ( Y ) = μ = 1 M ϖ μ × p ( Y | Y ¯ μ , σ μ )
In this section, the order M in Equation (34) is considered as known. The training phase of these data consists of learning its structure as a family of M disjoint classes C μ , μ = 1 , , M . To be more precise, depending on the family ( ϖ μ ) , some of these classes may be empty. Training is done by applying the EM algorithm described in Section 4.1. As a result, each class C μ is represented by a triple ( ϖ ^ μ , Y ^ μ , σ ^ μ ) corresponding to maximum likelihood estimates of ( ϖ μ , Y μ , σ μ ) . Each observation Y n is now associated with the class C μ * where μ * = argmax μ ω ( Y n , ν ^ ) (recall the definition from Equation (28)). In this way, { Y 1 , , Y N } is subdivided into M disjoint classes.
The classification phase requires a classification rule. Following [15], the optimal rule (in the sense of a Bayesian risk criterion given in [35]) consists of associating any new data Y t to the class C μ * where:
μ * = argmax μ N ^ μ × p ( Y t | Y ^ μ , σ ^ μ )
Here, N ^ μ is the number of elements in C μ . Replacing N ^ μ with N × ϖ ^ μ , Equation (35) becomes argmax μ ϖ ^ μ × p ( Y t | Y ^ μ , σ ^ μ ) . Note that when the weights ϖ μ in Equation (34) are assumed to be equal, this rule reduces to a maximum likelihood classification rule max μ p ( Y t | Y ^ μ , σ ^ μ ) . A quick look at the expression Equation (17) shows that Equation (35) can also be expressed as:
μ * = argmin μ log ϖ ^ μ + log ζ ( σ ^ μ ) + d ( Y t , Y ^ μ ) σ ^ μ
The rule Equation (36) will be called the Laplace classification rule. It favors clusters C μ having a larger number of data points (the minimum contains log ϖ ^ μ ) or a smaller dispersion away from the median (the minimum contains log ζ ( σ ^ μ ) ). When choosing between two clusters with the same number of points and the same dispersion, this rule favors the one whose median is closer to Y t . If the number of data points inside clusters and the respective dispersions are neglected, then Equation (36) reduces to the nearest neighbor rule involving only the Riemannian distance introduced in [2].
The analogous rules of Equation (36) for the Riemannian Gaussian distribution (RGD) [15] and the Wishart distribution (WD) [17] on P m can be established by replacing p ( Y t | Y ^ μ , σ ^ μ ) in Equation (35) with the RGD and the WD and then following the same reasoning as before. Recall that a WD depends on an expectation Σ P m and a number of degrees of freedom n [29]. For the WD, Equation (36) becomes:
μ * = argmin μ 2 log ϖ ^ ( μ ) n ^ ( μ ) ( log det ( Σ ^ 1 ( μ ) Y t ) tr ( Σ ^ 1 ( μ ) Y t ) )
Here, ϖ ^ ( μ ) , Σ ^ ( μ ) and n ^ ( μ ) denote maximum likelihood estimates of the true parameters ϖ ( μ ) , Σ ( μ ) and n ( μ ) , which define the mixture model (these estimates can be computed as in [36,37]).

5.2. Application to Texture Classification

This section presents an application of the mixture of Laplace distributions to the context of texture classification on the MIT Vision Texture (VisTex) database [38]. The purpose of this experiment is to classify the textures, by taking into consideration the within-class diversity. In addition, the influence of outliers on the classification performances is analyzed. The obtained results for the RLD are compared to those given by the RGD [15] and the WD [17].
The VisTex database contains 40 images, considered as being 40 different texture classes. The database used for the experiment is obtained after several steps. First of all, each texture is decomposed into 169 patches of 128 × 128 pixels, with an overlap of 32 pixels, giving a total number of 6760 textured patches. Next, some patches are corrupted, in order to introduce abnormal data into the dataset. Therefore, their intensity is modified by applying a gradient of luminosity. For each class, between zero and 60 patches are modified in order to become outliers. An example of a VisTex texture with one of its patches and an outlier patch are shown in Figure 1.
Once the database is built, it is 15-times equally and randomly divided in order to obtain the training and the testing sets that are further used in the supervised classification algorithm. Then, for each patch in both databases, a feature vector has to be computed. The luminance channel is first extracted and then normalized in intensity. The grayscale patches are filtered using the stationary wavelet transform Daubechies db4 filter (see [39]), with two scales and three orientations. To model the wavelet sub-bands, various stochastic models have been proposed in the literature. Among them, the univariate generalized Gaussian distribution has been found to accurately model the empirical histogram of wavelet sub-bands [40]. Recently, it has been proposed to model the spatial dependency of wavelet coefficients. To this aim, the wavelet coefficients located in a p × q spatial neighborhood of the current spatial position are clustered in a random vector. The realizations of these vectors can be further modeled by elliptical distributions [41,42], copula-based models [43,44], etc. In this paper, the wavelet coefficients are considered as being realizations of zero-mean multivariate Gaussian distributions. In addition, for this experiment the spatial information is captured by using a vertical ( 2 × 1 ) and a horizontal ( 1 × 2 ) neighborhood. Next, the 2 × 2 sample covariance matrices are estimated for each wavelet sub-band and each neighborhood. Finally, each patch is represented by a set of F = 12 covariance matrices (2 scales × 3 orientations × 2 neighborhoods) denoted Y = [ Y 1 , , Y F ] .
The estimated covariance matrices are elements of P m , with m = 2 , and therefore, they can be modeled by Riemannian Laplace distributions. More precisely, in order to take into consideration the within-class diversity, each class in the training set is viewed as a realization of a mixture of Riemannian Laplace distributions (Equation (27)) with M mixture components, characterized by ( ϖ μ , Y ¯ μ , f , σ μ , f ) , having Y ¯ μ , f P 2 , with μ = 1 , , M and f = 1 , , F . Since the sub-bands are assumed to be independent, the probability density function is given by:
p ( Y | ( ϖ μ , Y ¯ μ , f , σ μ , f ) 1 μ M , 1 f F ) = μ = 1 M ϖ μ f = 1 F p ( Y f | Y ¯ μ , f , σ μ , f )
The learning step of the classification is performed using the EM algorithm presented in Section 4, and the number of mixture components is determined using the BIC criterion recalled in Equation (31). Note that for the considered model given in Equation (37), the degree of freedom is expressed as:
D F = M 1 + M × F × m ( m + 1 ) 2 + 1
since one centroid and one dispersion parameter should be estimated per feature and per component of the mixture model. In practice, the number of mixture components M varies between two and five, and the M yielding to the highest BIC criterion is retained. As mentioned earlier, the EM algorithm is sensitive to the initial conditions. In order to minimize this influence, for this experiment, the EM algorithm is repeated 10 times, and the result maximizing the log-likelihood function is retained. Finally, the classification is performed by assigning each element Y t P 2 in the testing set to the class of the closest cluster μ * , given by:
μ * = argmin μ log ϖ ^ μ + f = 1 F log ζ ( σ ^ μ , f ) + f = 1 F d ( Y t , Y ^ μ , f ) σ ^ μ , f
This expression is obtained starting from Equations (36) and (37), knowing that F features are extracted for each patch.
The classification results of the proposed model (solid red line), expressed in terms of overall accuracy, shown in Figure 2, are compared to those given by a fixed number of mixture components (that is, for M = 3 , dashed red line) and with those given when the within-class diversity is not considered (that is, for M = 1 , dotted red line). In addition, the classification performances given by the RGD model (displayed in black) proposed in [15] and the WD model (displayed in blue) proposed in [17] are also considered. For each of these models, the number of mixture components is first computed using the BIC, and next, it is fixed to M = 3 and M = 1 . For all of the considered models, the classification rate is given as a function of the number of outliers, which varies between zero and 60 for each class.
It is shown that, as the number of outliers increases, the RLD gives progressively better results than the RGD and the WD. The results are improved by using the BIC criterion for choosing the suitable number of clusters. In conclusion, the mixture of RLDs combined with the BIC criterion to estimate the best number of mixture components can minimize the influence of abnormal samples present in the dataset, illustrating the relevance of the proposed method.

6. Conclusions

Motivated by the problem of outliers in statistical data, this paper introduces a new distribution on the space P m of m × m symmetric positive definite matrices, called the Riemannian Laplace distribution. Denoted throughout the paper by L ( Y ¯ , σ ) , where Y ¯ P m and σ > 0 are the indexing parameters, this distribution may be thought of as specifying the law of a family of observations on P m concentrated around the location Y ¯ and having dispersion σ. If d denotes Rao’s distance on P m and d v ( Y ) its associated volume form, the density of L ( Y ¯ , σ ) with respect to d v ( Y ) is proportional to exp ( d ( Y , Y ¯ σ ) ) . Interestingly, the normalizing constant depends only on σ (and not on Y ¯ ). This allows us to deduce exact expressions for maximum likelihood estimates of Y ¯ and σ relying on the Riemannian median on P m . These estimates are also computed numerically by means of sub-gradient algorithms. The estimation of parameters in mixture models of Laplace distributions are also considered and performed using a new expectation-maximization algorithm. Finally, the main theoretical results are illustrated by an application to texture classification. The proposed experiment consists of introducing abnormal data (outliers) into a set of images from the VisTex database and analyzing their influences on the classification performances. Each image is characterized by a set of 2 × 2 covariance matrices modeled as mixtures of Riemannian Laplace distributions in the space P 2 . The number of mixtures is estimated using the BIC criterion. The obtained results are compared to those given by the Riemannian Gaussian distribution, showing the better performance of the proposed method.

Acknowledgments

This study has been carried out with financial support from the French State, managed by the French National Research Agency (ANR) in the frame of the “Investments for the future” Programme initiative d’excellence (IdEX) Bordeaux-CPU (ANR-10-IDEX-03-02).

Author Contributions

Hatem Hajri and Salem Said carried out the mathematical development and specified the algorithms. Ioana Ilea and Lionel Bombrun conceived and designed the experiments. Yannick Berthoumieu gave the central idea of the paper and managed the main tasks and experiments. Hatem Hajri wrote the paper. All the authors read and approved the final manuscript.

Conflicts of Interest

The authors declare no conflict of interest.

Appendix: Proofs of Some Technical Points

The subsections below provide proofs (using the same notations) of certain points in the paper.

A. Derivation of Equation (16) from Equation (14)

For U O ( m ) and r = ( r 1 , , r m ) R m , let Y ( r , U ) = U diag ( e r 1 , , e r m ) U . On O ( m ) , consider the exterior product det ( θ ) = i < j θ i j , where θ i j = k U j k d U i k .
Proposition 6. 
For all test functions f : P m R ,
P m f ( Y ) d v ( Y ) = ( m ! 2 m ) 1 × 8 m ( m 1 ) 4 O ( m ) R m f ( Y ( r , U ) ) det ( θ ) i < j sinh | r i r j | 2 i = 1 m d r i
This proposition allows one to deduce Equation (16) from Equation (14), since O ( m ) det ( θ ) = 2 m π m 2 / 2 ( Γ m ( m / 2 ) (see [29], p. 70).
Sketch of the proof of Proposition 6. 
In a differential form, the Rao–Fisher metric on P m is:
d s 2 ( Y ) = tr [ Y 1 d Y ] 2
For U O ( m ) and ( a 1 , , a m ) ( R + * ) m , let Y = U diag ( a 1 , , a m ) U . Then:
d s 2 ( Y ) = j = 1 m d a j 2 a j 2 + 2 1 i < j m ( a i a j ) 2 a i a j θ i j 2
(see [10], p. 24). Let a i = e r i , then simple calculations show that:
d s 2 ( Y ) = j = 1 m d r j 2 + 8 i < j sinh 2 r i r j 2 θ i j 2
As a consequence, the volume element d v ( Y ) is written as:
d v ( Y ) = 8 m ( m 1 ) 4 det ( θ ) i < j sinh | r i r j | 2 i = 1 m d r i
This proves the proposition (the factor m ! 2 m comes from the fact that the correspondence between Y and ( r , U ) is not unique: m ! corresponds to all possible reorderings of r 1 , , r m , and 2 m corresponds to the orientation of the columns of U).

B. Derivation of Equation (19)

By Equations (16) and Equation (18), to prove Equation (19), it is sufficient to prove that for all Y P m , d ( Y , I ) = ( i = 1 m r i 2 ) 1 / 2 if the spectral decomposition of Y is Y = U diag ( e r 1 , , e r m ) U , where U is an orthogonal matrix. Note that d ( Y , I ) = d ( diag ( e r 1 , , e r m ) . U , I ) = d ( diag ( e r 1 , , e r m ) . U , I . U ) , where . is the affine transformation given by Equation (9). By Equation (10), it comes that d ( Y , I ) = d ( diag ( e r 1 , , e r m ) , I ) , and so, d ( Y , I ) = ( i = 1 m r i 2 ) 1 / 2 holds using the explicit expression Equation (8).

C. The Normalizing Factor ζ m ( σ )

The subject of this section is to prove these two claims:
(i)
0 < σ m < for all m 2 ;
(ii)
σ 2 = 2 .
To check (i), note that i < j sinh | r i r j | 2 exp ( C | r | ) for some constant C. Thus, for σ small enough, the integral I m ( σ ) = R m e | r | σ i < j sinh | r i r j | 2 d r given in Equation (19) is finite, and consequently, σ m > 0 .
Fix A > 0 , such that sinh ( x 2 ) exp ( x 4 ) for all x A . Then:
I m ( σ ) C exp 1 4 i < j ( r j r i ) | r | σ d r
where C is the set of infinite Lebesgue measures:
C = r = ( r 1 , , r m ) R m : r i [ 2 ( i 1 ) A , ( 2 i 1 ) A ] , 1 i m 1 , r m 2 ( m 1 ) A
Now:
1 4 i < j ( r j r i ) = 1 4 r m + 1 4 ( r 1 + i < j , ( i , j ) ( 1 , m ) ( r j r i ) )
Assume m 3 (the case m = 2 is easy to deal with separately). Then, on C , 1 4 i < j ( r j r i ) 1 4 r m + C and | r | σ ( C + r m 2 ) 1 2 σ , where C and C are two positive constants (not depending on r). However, for σ large enough:
1 4 i < j ( r j r i ) | r | σ 1 4 r m + C ( C + r m 2 ) 1 2 σ 0 .
and so, the integral I m ( σ ) diverges. This shows that σ m is finite.
(ii) Note the following easy inequalities | r 1 r 2 | | r 1 | + | r 2 | 2 | r | , which yield sinh ( | r 1 r 2 | 2 ) 1 2 e | r | 2 . This last inequality shows that ζ 2 ( σ ) is finite for all σ < 2 . In order to check ζ 2 ( 2 ) = , it is necessary to show:
R 2 exp ( | r | 2 + | r 1 r 2 | 2 ) d r 1 d r 2 =
The last integral is, up to a constant, greater than C exp | r | + | r 1 r 2 | 2 d r 1 d r 2 , where:
C = { ( r 1 , r 2 ) R 2 : r 1 r 2 , r 2 0 } = { ( r 1 , r 2 ) R 2 : r 1 | r 2 | , r 2 0 } .
On C ,
| r | + | r 1 r 2 | 2 = | r | + r 1 r 2 2 2 r 1 + r 1 r 2 2 = r 1 r 2 2
However, C exp r 1 r 2 2 d r 1 d r 2 = by integrating with respect to r 1 and then r 2 , which shows Equation (40).

D. The Law of X in Algorithm 1

As stated in Appendix A, the uniform distribution on O ( m ) is given by 1 ω m det ( θ ) , where ω m = 2 m π m 2 / 2 ( Γ m ( m / 2 ) . Let Y ( s , V ) = V diag ( e s 1 , , e s m ) V , with s = ( s 1 , , s m ) . Since X = Y ( r , U ) , for any test function φ : P m R ,
E [ φ ( X ) ] = 1 ω m O ( m ) × R m φ ( Y ( s , V ) ) p ( s ) det ( θ ) i = 1 m d s i
Here, det ( θ ) = i < j θ i j and θ i j = k V j k d V i k . On the other hand, by Proposition 6, P m φ ( Y ) p ( Y | I , σ ) d v ( Y ) can be expressed as:
( m ! 2 m ) 1 × 8 m ( m 1 ) 4 1 ζ m ( σ ) O ( m ) R m φ ( Y ( s , V ) ) e | s | σ det ( θ ) i < j sinh | s i s j | 2 i = 1 m d s i
which coincides with Equation (41).

References

  1. Pennec, X.; Fillard, P.; Ayache, N. A Riemannian framework for tensor computing. Int. J. Comput. Vis. 2006, 66, 41–66. [Google Scholar] [CrossRef] [Green Version]
  2. Barachant, A.; Bonnet, S.; Congedo, M.; Jutten, C. Multiclass Brain–Computer Interface Classification by Riemannian Geometry. IEEE Trans. Biomed. Eng. 2012, 59, 920–928. [Google Scholar] [CrossRef] [PubMed] [Green Version]
  3. Jayasumana, S.; Hartley, R.; Salzmann, M.; Li, H.; Harandi, M. Kernel Methods on the Riemannian Manifold of Symmetric Positive Definite Matrices. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Portland, OR, USA, 23–28 June 2013; pp. 73–80.
  4. Zheng, L.; Qiu, G.; Huang, J.; Duan, J. Fast and accurate Nearest Neighbor search in the manifolds of symmetric positive definite matrices. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Florence, Italy, 4–9 May 2014; pp. 3804–3808.
  5. Dong, G.; Kuang, G. Target recognition in SAR images via classification on Riemannian manifolds. IEEE Geosci. Remote Sens. Lett. 2015, 21, 199–203. [Google Scholar] [CrossRef]
  6. Tuzel, O.; Porikli, F.; Meer, P. Pedestrian detection via classification on Riemannian manifolds. IEEE Trans. Pattern Anal. Mach. Intell. 2008, 30, 1713–1727. [Google Scholar] [CrossRef] [PubMed]
  7. Caseiro, R.; Henriques, J.F.; Martins, P.; Batista, J. A nonparametric Riemannian framework on tensor field with application to foreground segmentation. Pattern Recognit. 2012, 45, 3997–4017. [Google Scholar] [CrossRef]
  8. Arnaudon, M.; Barbaresco, F.; Yang, L. Riemannian Medians and Means With Applications to Radar Signal Processing. IEEE J. Sel. Top. Signal Process. 2013, 7, 595–604. [Google Scholar] [CrossRef]
  9. Arnaudon, M.; Yang, L.; Barbaresco, F. Stochastic algorithms for computing p-means of probability measures, Geometry of Radar Toeplitz covariance matrices and applications to HR Doppler processing. In Proceedings of International International Radar Symposium (IRS), Leipzig, Germany, 7–9 September 2011; pp. 651–656.
  10. Terras, A. Harmonic Analysis on Symmetric Spaces and Applications; Springer-Verlag: New York, NY, USA, 1988; Volume II. [Google Scholar]
  11. Atkinson, C.; Mitchell, A. Rao’s distance measure. Sankhya Ser. A 1981, 43, 345–365. [Google Scholar]
  12. Pennec, X. Probabilities and statistics on Riemannian manifolds: Basic tools for geometric measurements. In Procedings of the IEEE Workshop on Nonlinear Signal and Image Processing, Antalya, Turkey, 20–23 June 1999; pp. 194–198.
  13. Pennec, X. Intrinsic statistics on Riemannian manifolds: Basic tools for geometric measurements. J. Math. Imaging Vis. 2006, 25, 127–154. [Google Scholar] [CrossRef]
  14. Guang, C.; Baba, C.V. A Novel Dynamic System in the Space of SPD Matrices with Applications to Appearance Tracking. SIAM J. Imaging Sci. 2013, 6, 592–615. [Google Scholar]
  15. Said, S.; Bombrun, L.; Berthoumieu, Y.; Manton, J. Riemannian Gaussian distributions on the space of symmetric positive definite matrices. 2015; arXiv:1507.01760. [Google Scholar]
  16. Schwarz, G. Estimating the Dimension of a Model. Ann. Stat. 1978, 6, 461–464. [Google Scholar] [CrossRef]
  17. Lee, J.S.; Grunes, M.R.; Ainsworth, T.L.; Du, L.J.; Schuler, D.L.; Cloude, S.R. Unsupervised classification using polarimetric decomposition and the complex Wishart classifier. IEEE Trans. Geosci. Remote Sens. 1999, 37, 2249–2258. [Google Scholar]
  18. Berezin, F.A. Quantization in complex symmetric spaces. Izv. Akad. Nauk SSSR Ser. Mat. 1975, 39, 363–402. [Google Scholar] [CrossRef]
  19. Malliavin, P. Invariant or quasi-invariant probability measures for infinite dimensional groups, Part II: Unitarizing measures or Berezinian measures. Jpn. J. Math. 2008, 3, 19–47. [Google Scholar] [CrossRef]
  20. Barbaresco, F. Information Geometry of Covariance Matrix: Cartan-Siegel Homogeneous Bounded Domains, Mostow/Berger Fibration and Fréchet Median, Matrix Information Geometry; Bhatia, R., Nielsen, F., Eds.; Springer: New York, NY, USA, 2012; pp. 199–256. [Google Scholar]
  21. Barbaresco, F. Information geometry manifold of Toeplitz Hermitian positive definite covariance matrices: Mostow/Berger fibration and Berezin quantization of Cartan-Siegel domains. Int. J. Emerg. Trends Signal Process. 2013, 1, 1–11. [Google Scholar]
  22. Jeuris, B.; Vandebril, R. Averaging block-Toeplitz matrices with preservation of Toeplitz block structure. In Proceedings of the SIAM Conference on Applied Linear Algebra (ALA), Atlanta, GA, USA, 20–26 October 2015.
  23. Jeuris, B.; Vandebril, R. The Kähler Mean of Block-Toeplitz Matrices with Toeplitz Structured Block. Available online: http://www.cs.kuleuven.be/publicaties/rapporten/tw/TW660.pdf (accessed on 10 March 2016).
  24. Jeuris, B. Riemannian Optimization for Averaging Positive Definite Matrices. Ph.D. Thesis, University of Leuven, Leuven, Belgium, 2015. [Google Scholar]
  25. Maass, H. Siegel’s modular forms and Dirichlet series. In Lecture Notes in Mathematics; Springer-Verlag: New York, NY, USA, 1971; Volume 216. [Google Scholar]
  26. Higham, N.J. Functions of Matrices, Theory and Computation; Society for Industrial and Applied Mathematics: Philadelphia, PA, USA, 2008. [Google Scholar]
  27. Helgason, S. Differential Geometry, Lie Groups, and Symmetric Spaces; American Mathematical Society: Providence, RI, USA, 2001. [Google Scholar]
  28. Afsari, B. Riemannian Lp center of mass: Existence, uniqueness and convexity. Proc. Am. Math. Soc. 2011, 139, 655–673. [Google Scholar] [CrossRef]
  29. Muirhead, R.J. Aspects of Multivariate Statistical Theory; John Wiley & Sons: New York, NY, USA, 1982. [Google Scholar]
  30. Robert, C.P.; Casella, G. Monte Carlo Statistical Methods; Springer-Verlag: Berlin, Germany, 2004. [Google Scholar]
  31. Udriste, C. Convex Functions and Optimization Methods on Riemannian Manifolds; Mathematics and Its Applications; Kluwer Academic Publishers: Dordrecht, The Netherlands, 1994. [Google Scholar]
  32. Chavel, I. Riemannian Geometry, a Modern Introduction; Cambridge University Press: Cambridge, UK, 2006. [Google Scholar]
  33. Yang, L. Médianes de Mesures de Probabilité dans les Variétés Riemanniennes et Applications à la Détection de Cibles Radar. Ph.D. Thesis, L’université de Poitiers, Poitiers, France, 2011. [Google Scholar]
  34. Li, Y.; Wong, K.M. Riemannian distances for signal classification by power spectral density. IEEE J. Sel. Top. Sig. Process. 2013, 7, 655–669. [Google Scholar] [CrossRef]
  35. Hastie, T.; Tibshirani, R.; Friedman, J. The Elements of Statistical Learning: Data Mining, Inference, and Prediction; Springer: Berlin, Germany, 2009. [Google Scholar]
  36. Saint-Jean, C.; Nielsen, F. A new implementation of k-MLE for mixture modeling of Wishart distributions. In Geometric Science of Information (GSI); Springer-Verlag: Berlin/Heidelberg, Germany, 2013; pp. 249–256. [Google Scholar]
  37. Hidot, S.; Saint-Jean, C. An expectation-maximization algorithm for the Wishart mixture model: Application to movement clustering. Pattern Recognit. Lett. 2010, 31, 2318–2324. [Google Scholar] [CrossRef]
  38. VisTex: Vision Texture Database. MIT Media Lab Vision and Modeling Group. Available online: http://vismod.media.mit.edu/pub/ (accessed on 9 March 2016).
  39. Daubechies, I. Ten Lectures on Wavelets; Society for Industrial and Applied Mathematics: Philadelphia, PA, USA, 1992. [Google Scholar]
  40. Do, M.N.; Vetterli, M. Wavelet-Based Texture Retrieval Using Generalized Gaussian Density and Kullback-Leibler Distance. IEEE Trans. Image Process. 2002, 11, 146–158. [Google Scholar] [CrossRef] [PubMed]
  41. Bombrun, L.; Berthoumieu, Y.; Lasmar, N.-E.; Verdoolaege, G. Mutlivariate Texture Retrieval Using the Geodesic Distance between Elliptically Distributed Random Variables. In Proceedings of 2011 18th IEEE International Conference on Image Processing (ICIP), Brussels, Belgium, 11–14 September 2011.
  42. Verdoolaege, G.; Scheunders, P. On the Geometry of Multivariate Generalized Gaussian Models. J. Math. Imaging Vis. 2012, 43, 180–193. [Google Scholar] [CrossRef]
  43. Stitou, Y.; Lasmar, N.-E.; Berthoumieu, Y. Copulas based Multivariate Gamma Modeling for Texture Classification. In Proceedings of the IEEE International Conference on Acoustic Speech and Signal Processing, Taipei, Taiwan, 19–24 April 2009; pp. 1045–1048.
  44. Kwitt, R.; Uhl, A. Lightweight Probabilistic Texture Retrieval. IEEE Trans. Image Process. 2010, 19, 241–253. [Google Scholar] [CrossRef] [PubMed]
Figure 1. Example of a texture from the VisTex database (a), one of its patches (b) and the corresponding outlier (c).
Figure 1. Example of a texture from the VisTex database (a), one of its patches (b) and the corresponding outlier (c).
Entropy 18 00098 g001
Figure 2. Classification results.
Figure 2. Classification results.
Entropy 18 00098 g002

Share and Cite

MDPI and ACS Style

Hajri, H.; Ilea, I.; Said, S.; Bombrun, L.; Berthoumieu, Y. Riemannian Laplace Distribution on the Space of Symmetric Positive Definite Matrices. Entropy 2016, 18, 98. https://doi.org/10.3390/e18030098

AMA Style

Hajri H, Ilea I, Said S, Bombrun L, Berthoumieu Y. Riemannian Laplace Distribution on the Space of Symmetric Positive Definite Matrices. Entropy. 2016; 18(3):98. https://doi.org/10.3390/e18030098

Chicago/Turabian Style

Hajri, Hatem, Ioana Ilea, Salem Said, Lionel Bombrun, and Yannick Berthoumieu. 2016. "Riemannian Laplace Distribution on the Space of Symmetric Positive Definite Matrices" Entropy 18, no. 3: 98. https://doi.org/10.3390/e18030098

APA Style

Hajri, H., Ilea, I., Said, S., Bombrun, L., & Berthoumieu, Y. (2016). Riemannian Laplace Distribution on the Space of Symmetric Positive Definite Matrices. Entropy, 18(3), 98. https://doi.org/10.3390/e18030098

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop