Computational Information Geometry for Binary Classification of High-Dimensional Random Tensors

Pham, Gia-Thuy; Boyer, Rémy; Nielsen, Frank

doi:10.3390/e20030203

Open AccessFeature PaperArticle

Computational Information Geometry for Binary Classification of High-Dimensional Random Tensors^†

by

Gia-Thuy Pham

¹,

Rémy Boyer

¹ and

Frank Nielsen

^2,3,*

¹

Laboratory of Signals and Systems (L2S), Department of Signals and Statistics, University of Paris-Sud, 91400 Orsay, France

²

Computer Science Department LIX, École Polytechnique, 91120 Palaiseau, France

³

Sony Computer Science Laboratories, Tokyo 141-0022, Japan

^*

Author to whom correspondence should be addressed.

^†

The results presented in this work have been partially published in the 2017 IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), New Orleans, LA, USA, 5–9 March 2017 and the 2017 25th European Association for Signal Processing (EUSIPCO), Kos, Greece, 28 August–2 September 2017.

Entropy 2018, 20(3), 203; https://doi.org/10.3390/e20030203

Submission received: 25 January 2018 / Revised: 13 March 2018 / Accepted: 14 March 2018 / Published: 17 March 2018

Download

Browse Figures

Versions Notes

Abstract

:

Evaluating the performance of Bayesian classification in a high-dimensional random tensor is a fundamental problem, usually difficult and under-studied. In this work, we consider two Signal to Noise Ratio (SNR)-based binary classification problems of interest. Under the alternative hypothesis, i.e., for a non-zero SNR, the observed signals are either a noisy rank-R tensor admitting a Q-order Canonical Polyadic Decomposition (CPD) with large factors of size

N_{q} \times R

, i.e., for

1 \leq q \leq Q

, where

R, N_{q} \to \infty

with

R^{1 / q} / N_{q}

converge towards a finite constant or a noisy tensor admitting TucKer Decomposition (TKD) of multilinear

(M_{1}, \dots, M_{Q})

-rank with large factors of size

N_{q} \times M_{q}

, i.e., for

1 \leq q \leq Q

, where

N_{q}, M_{q} \to \infty

with

M_{q} / N_{q}

converge towards a finite constant. The classification of the random entries (coefficients) of the core tensor in the CPD/TKD is hard to study since the exact derivation of the minimal Bayes’ error probability is mathematically intractable. To circumvent this difficulty, the Chernoff Upper Bound (CUB) for larger SNR and the Fisher information at low SNR are derived and studied, based on information geometry theory. The tightest CUB is reached for the value minimizing the error exponent, denoted by

s^{⋆}

. In general, due to the asymmetry of the s-divergence, the Bhattacharyya Upper Bound (BUB) (that is, the Chernoff Information calculated at

s^{⋆} = 1 / 2

) cannot solve this problem effectively. As a consequence, we rely on a costly numerical optimization strategy to find

s^{⋆}

. However, thanks to powerful random matrix theory tools, a simple analytical expression of

s^{⋆}

is provided with respect to the Signal to Noise Ratio (SNR) in the two schemes considered. This work shows that the BUB is the tightest bound at low SNRs. However, for higher SNRs, the latest property is no longer true.

Keywords:

optimal Bayesian detection; information geometry; minimal error probability; Chernoff/Bhattacharyya upper bound; large random tensor; Fisher information; large random sensing matrix

1. Introduction

1.1. State-of-the-Art and Problem Statement

Evaluating the performance limit for the “Gaussian information plus noise” binary classification problem is a challenging research topic, see for instance [1,2,3,4,5,6,7]. Given a binary hypothesis problem, the Bayes’ decision rule is based on the principle of the largest posterior probability. Specifically, the Bayesian detector chooses the alternative hypothesis

H_{1}

if

\Pr (H_{1} | y) > \Pr (H_{0} | y)

for a given N-dimensional measurement vector

y

or the null hypothesis

H_{0}

, otherwise. Consequently, the optimal decision rule can often only be derived at the price of a costly numerical computation of the log posterior-odds ratio [3] since an exact calculation of the minimal Bayes’ error probability

P_{e}^{(N)}

is often intractable [3,8]. To circumvent this problem, it is standard to exploit well-known bounds on

P_{e}^{(N)}

based on information theory [9,10,11,12,13]. In particular, the Chernoff information [14,15] is asymptotically (in N) relied on the exponential rate of

P_{e}^{(N)}

. It turns out that the Chernoff information is very useful in many practically important problems as for instance, distributed sparse detection [16], sparse support recovery [17], energy detection [18], multi-input and multi-output (MIMO) radar processing [19,20], network secrecy [21], angular resolution limit in array processing [22], detection performance for informed communication systems [23], just to name a few. In addition, the Chernoff information bound can be tight for a minimal s-divergence over parameter

s \in (0, 1)

. Generally, this step requires solving numerically an optimization problem [24] and often leads to a complicated and uninformative expression of the optimal value of s. To circumvent this difficulty, a simplified case of

s = 1 / 2

is often used corresponding to the well-known Bhattacharyya divergence [13] at the price of a less accurate prediction of

P_{e}^{(N)}

. In information geometry, parameter s is often called

α

, and the s-divergence is the so-called Chernoff

α

-divergence [24].

The tensor decomposition theory is a timely and prominent research topic [25,26]. Confronting the problem of extracting useful information from a massive and multidimentional volume of measurements, it is shown that tensors are extremely relevant. In the standard literature, two main families of tensor decomposition are prominent, namely the Canonical Polyadic Decomposition (CPD) [26] and the Tucker decomposition (TKD)/HOSVD (High-Order SVD) [27,28]. These approaches are two possible multilinear generalization of the Singular Value Decomposition (SVD). A natural generalization to tensors of the usual concept of rank for matrices is called the CPD. The tensorial/canonical rank of a P-order tensor is equal to the minimal positive integer, say R, of unit rank tensors that must be summed up for perfect recovery. A unit rank tensor is the outer product of P vectors. In addition, the CPD has remarkable uniqueness properties [26] and involves only a reduced number of free parameters due to the constraint of minimality on R. Unfortunately, unlike the matrix case, the set of tensors with fixed (tensorial) rank is not close [29,30]. This singularity implies that the problem of the computation of the CPD is mathematically ill-posed. The consequence is that its numerical computation remains non trivial and is usually done using suboptimal iterative algorithms [31]. Note that this problem can sometimes be avoided by exploiting some natural hidden structures in the physical model [32]. The TKD [28] and the HOSVD [27] are two popular decompositions being an alternative to the CPD. Under this circumstance, alternative definition of rank is required, since the tensorial rank based on CPD scenario is no longer appropriate. In particular, stardard definition of multilinear rank defined as the set of positive integers

{R_{1}, \dots, R_{P}}

where each integer,

R_{p}

, is the usual rank of the p-th mode. Following the Eckart-Young theorem at each mode level [33], this construction is non-iterative, optimal and practical. In real-time computation [34] or adaptively computation [35], it is shown that this approach is suitable. However, in general, the low (multilinear) rank tensor based on this procedure is suboptimal [27]. More precisely, for tensors of order strictly greater than two, a generalization of the Eckart-Young theorem does not exist.

The classification performance of a multilinear tensor following the CPD and TKD can be derived and studied. It is interesting to note that the classification theory for tensors is very under studied. Based on our knowledge on the topic, only the publication [36] tackles this problem in the context of radar multidimensional data detection. A major difference with this publication is that their analysis is based on the performance of a low rank detection after matched filtering.

More precisely, we consider two cases where the observations are either (1) a noisy rank-R tensor admitting a Q-order CPD with large factors of size

N_{q} \times R

, i.e., for

1 \leq q \leq Q

,

R, N_{q} \to \infty

with

R^{1 / q} / N_{q}

converging towards a finite constant, or (2) a noisy tensor admitting a TKD of multilinear

(M_{1}, \dots, M_{Q})

-rank with large factors of size

N_{q} \times M_{q}

, i.e., for

1 \leq q \leq Q

, where

N_{q}, M_{q} \to \infty

with

M_{q} / N_{q}

converging towards a finite constant. A standard approach for zero-mean independent Gaussian core and noise tensors, is to define the Signal to Noise Ratio by

SNR = σ_{s}^{2} / σ^{2}

where

σ_{s}^{2}

and

σ^{2}

are the variances of the vectorized core and noise tensors, respectively. So, the binary classification can be described in the following way:

Under the null hypothesis

H_{0}

,

SNR = 0

, meaning that the observed tensor contains only noise. Conversely, the alternative hypothesis

H_{1}

is based on

SNR \neq 0

, meaning that there exists a multilinear signal of interest. First note that there exists a lack of contribution dealing with classification performance for tensors. Since the exact derivation of the error probability is intractable, the performance of the classification of the core tensor random entries is hard to evaluate. To circumvent this audible difficulty, based on computational information geometry theory, we consider the Chernoff Upper Bound (CUB), and the Fisher information in the context of massive measurement vectors. The error exponent can be minimized at

s^{⋆}

, which corresponds to the reachable tightest CUB. In general, due to the asymmetry of the s-divergence, the Bhattacharyya Upper Bound (BUB)—Chernoff Information calculated at

s^{⋆} = 1 / 2

—cannot solve this problem effectively. As a consequence, we rely on a costly numerical optimization strategy to find

s^{⋆}

. However, with respect to different Signal to Noise Ratios (SNR), we provide simple analytical expressions of

s^{⋆}

, thanks to the so-called Random Matrix Theory (RMT). For low SNR, analytical expressions of the Fisher information are given. Note that the analysis of the Fisher information in the context of the RMT has been only studied in recent contributions [37,38,39] for parameter estimation. For larger SNR, analytic and simple expression of the CUB for the CPD and the TKD are provided.

We note that Random Matrix Theory (RMT) has attracted both mathematicians and physicists since they were first introduced in mathematical statistics by Wishart in 1928 [40]. When Wigner [41] introduced the concept of statistical distribution of nuclear energy levels, the subject has started to earn prominence. However, it took until 1955 before Wigner [42] introduced ensembles of random matrices. Since then, many important results in RMT were developed and analyzed, see for instance [43,44,45,46] and the references therein. In the last two decades, research on RMT has been constantly published.

Finally, let us underline that many arguments of this paper differ from the works presented in [47,48]. In [47], we tackled the problem of detection using Chernoff Upper Bound in data of type matrix in the double asymptotic regime. In [48], we established the detection problem in tensor data by analyzing the Chernoff Upper Bound. In [48], we assumed that the tensor follows the Canonical Polyadic Decomposition (CPD), we gave some analysis of Chernoff Upper Bound when the rank of the tensor is much smaller than the dimensions of the tensor. Since [47,48] are conference papers, some proofs have been omitted due to limited space. Therefore, this full paper may share the ideas in [47,48] on Information Geometry (s-divergence, Chernoff Upper Bound, Fisher Information, etc.), but completes [48] in a more general asymptotic regime. Moreover, in this work, we give new analysis in both scenarios (SNR small and large) whereas [48] did not, and the important and difficult new tensor scenario of the Tucker decomposition is considered. This is in our view the main difference because the CPD is a particular case of the more general decomposition of TucKer. Indeed, in the CPD, the core tensor is assumed to be diagonal.

1.2. Paper Organisation

The organization of the paper is as follows: In the second section, we introduce some definitions, tensor models, and the Marchenko-Pastur distribution from random matrix theory. The third section is devoted to present Chernoff Information for binary hypothesis test. The fourth section gives the main results on Fisher Information and the Chernoff bound. The numerical simulation results are given in the fifth section. We conclude our work by giving some perspectives in the Section 6. Finally, several proofs of the paper can be found in the appendix.

2. Algebra of Tensors and Random Matrix Theory (RMT)

In this section, we introduce some useful definitions from tensor algebra and from the spectral theory of large random matrices.

2.1. Multilinear Functions

2.1.1. Preliminary Definitions

Definition 1.

The Kronecker product of matrices

X

and

Y

of size

I \times J

and

K \times N

, respectively is given by

\begin{matrix} X \otimes Y = [\begin{matrix} {[X]}_{11} Y & \dots & {[X]}_{1 J} Y \\ ⋮ & ⋮ \\ {X]}_{I 1} Y & \dots & {[X]}_{I J} Y \end{matrix}] \in R^{(I K) \times (J N)} . \end{matrix}

We have

rank {X \otimes Y} = rank {X} \times rank {Y}

.

Definition 2.

The vectorization

vec (X)

of a tensor

X \in R^{M_{1} \times \dots \times M_{Q}}

is a vector

x \in R^{M_{1} M_{2} \dots M_{Q}}

defined as

\begin{matrix} x_{h} = {[X]}_{m_{1}, \dots, m_{Q}} \end{matrix}

where

h = m_{1} + \sum_{k = 2}^{Q} (m_{k} - 1) M_{1} M_{2} \dots M_{k - 1}

.

Definition 3.

The q-mode product denoted by

\times_{q}

between a tensor

X \in R^{M_{1} \times \dots \times M_{Q}}

and a matrix

U \in R^{K \times M_{q}}

is denoted by

X \times_{q} U \in R^{M_{1} \times \dots \times M_{q - 1} \times K \times M_{q + 1} \times \dots \times M_{Q}}

with

\begin{matrix} {[X \times_{q} U]}_{m_{1}, \dots, m_{q - 1}, k, m_{q + 1}, \dots, m_{Q}} = \sum_{m_{q} = 1}^{M_{q}} {[X]}_{m_{1}, \dots, m_{Q}} {[U]}_{k, m_{q}} \end{matrix}

where

1 \leq k \leq K

.

Definition 4.

The q-mode unfolding matrix of size

M_{q} \times (\prod_{k = 1, k \neq q}^{Q} M_{k})

denoted by

X_{(q)} = {unfold}_{q} (X)

of a tensor

X \in R^{M_{1} \times \dots \times M_{q}}

is defined according to

\begin{matrix} {[X_{(q)}]}_{M_{q}, h} = {[X]}_{m_{1}, \dots, m_{Q}} \end{matrix}

where

h = 1 + \sum_{k = 1, k \neq q}^{Q} (m_{k} - 1) \prod_{v = 1, v \neq q}^{k - 1} M_{v}

.

2.1.2. Canonical Polyadic Decomposition (CPD)

The rank-R CPD of order Q is defined according to

\begin{matrix} X = \sum_{r = 1}^{R} s_{r} \underset{X_{r}}{\underset{︸}{(ϕ_{r}^{(1)} \circ \dots \circ ϕ_{r}^{(Q)})}} w i t h rank {X_{r}} = 1 \end{matrix}

where ○ is the outer product [25],

ϕ_{r}^{(q)} \in R^{N_{q} \times 1}

and

s_{r}

is a real scalar.

An equivalent formulation using the q-mode product defined in Definition 3 is

X = S \times_{1} Φ^{(1)} \times_{2} \dots \times_{Q} Φ^{(Q)}

where

S

is the

R \times \dots \times R

diagonal core tensor with

{[S]}_{r, \dots, r} = s_{r}

and

Φ^{(q)} = [ϕ_{1}^{(q)} \dots ϕ_{R}^{(q)}]

is the q-th factor matrix of size

N_{q} \times R

.

The q-mode unfolding matrix for tensor

X

is given by

X_{(q)} = Φ^{(q)} S {(Φ^{(Q)} ⊙ \dots ⊙ Φ^{(q + 1)} ⊙ Φ^{(q - 1)} ⊙ \dots ⊙ Φ^{(1)})}^{T}

where

S = diag (s)

with

s = {[s_{1}, \dots, s_{R}]}^{T}

and ⊙ stands for the Khatri-Rao product [25].

2.1.3. Tucker Decomposition (TKD)

The Tucker tensor model of order Q is defined according to

\begin{matrix} X = \sum_{m_{1} = 1}^{M_{1}} \sum_{m_{2} = 1}^{M_{2}} \dots \sum_{m_{Q} = 1}^{M_{Q}} s_{m_{1} m_{2} \dots m_{Q}} (ϕ_{m_{1}}^{(1)} \circ ϕ_{m_{2}}^{(2)} \circ \dots \circ ϕ_{m_{Q}}^{(Q)}) \end{matrix}

where

ϕ_{m_{q}}^{(q)} \in R^{N_{q} \times 1}

,

q = 1, \dots, Q

and

s_{m_{1} m_{2} \dots m_{Q}}

is a real scalar.

The q-mode product of

X

is similar to CPD case, however the q-mode unfolding matrix for tensor

X

is slightly different

\begin{matrix} X_{(q)} = Φ^{(q)} S_{(q)} {(Φ^{(Q)} \otimes \dots \otimes Φ^{(q + 1)} \otimes Φ^{(q - 1)} \dots \otimes Φ^{(1)})}^{T} \end{matrix}

where

S_{(q)} \in R^{N_{q} \times N_{1} N_{2} \dots N_{q - 1} N_{q + 1} \dots N_{Q}}

the q-mode unfolding matrix of tensor

S

,

Φ^{(q)} = [ϕ_{1}^{(q)} \dots ϕ_{M_{q}}^{(q)}] \in R^{N_{q} \times M_{q}}

and ⊗ stands for Kronecker product. See Figure 1.

Following the definitions, we note that the CPD and TKD scenarios imply that vector

x

in Equation (11) is related either to the structured linear system

Φ^{⊙} = Φ^{(Q)} ⊙ \dots ⊙ Φ^{(q + 1)} ⊙ Φ^{(q - 1)} ⊙ \dots ⊙ Φ^{(1)}

or

Φ^{\otimes} = Φ^{(Q)} \otimes \dots \otimes Φ^{(q + 1)} \otimes Φ^{(q - 1)} \dots \otimes Φ^{(1)}

.

2.2. The Marchenko-Pastur Distribution

The Marchenko-Pastur distribution was introduced half a century ago [45] in 1967, and plays a key role in a number of high-dimensional signal processing problems. To help the reader, in this section, we introduce some fundamental results concerning large empirical covariance matrices. Let

{(v_{n})}_{n = 1, \dots, N}

a sequence of i.i.d zero mean Gaussian random M-dimensional vectors for which

E (v_{n} v_{n}^{T}) = σ^{2} I_{M}

. We consider the empirical covariance matrix

\frac{1}{N} \sum_{n = 1}^{N} v_{n} v_{n}^{T}

which can be also written as

\frac{1}{N} \sum_{n = 1}^{N} v_{n} v_{n}^{T} = W_{N} W_{N}^{T}

where matrix

W_{N}

is defined by

W_{N} = \frac{1}{\sqrt{N}} [v_{1}, \dots, v_{N}]

.

W_{N}

is thus a Gaussian matrix with independent identically distributed

N (0, \frac{σ^{2}}{N})

entries. When

N \to + \infty

while M remains fixed, matrix

W_{N} W_{N}^{T}

converges towards

σ^{2} I_{M}

in the spectral norm sense. In the high dimensional asymptotic regime defined by

M \to + \infty, N \to + \infty, c_{N} = \frac{M}{N} \to c > 0

it is well understood that

∥W_{N} W_{N}^{T} - σ^{2} I_{M}∥

does not converge towards 0. In particular, the empirical distribution

{\hat{ν}}_{N} = \frac{1}{M} \sum_{m = 1}^{M} δ_{{\hat{λ}}_{m, N}}

of the eigenvalues

{\hat{λ}}_{1, N} \geq \dots \geq {\hat{λ}}_{M, N}

of

W_{N} W_{N}^{T}

does not converge towards the Dirac measure at point

λ = σ^{2}

. More precisely, we denote by

ν_{c, σ^{2}}

the Marchenko-Pastur distribution of parameters

(c, σ^{2})

defined as the probability measure

\begin{matrix} ν_{c, σ^{2}} (d λ) = δ_{0} {[1 - \frac{1}{c}]}_{+} + \frac{\sqrt{(λ - λ^{-}) (λ^{+} - λ)}}{2 σ^{2} c π λ} ⊮_{[λ^{-}, λ^{+}] (λ)} d λ \end{matrix}

(1)

with

λ^{-} = σ^{2} {(1 - \sqrt{c})}^{2}

and

λ^{+} = σ^{2} {(1 + \sqrt{c})}^{2}

. Then, the following result holds.

Theorem 1

([45]). The empirical eigenvalue value distribution

{\hat{ν}}_{N}

converges weakly almost surely towards

ν_{c, σ^{2}}

when both M and N converge towards

+ \infty

in such a way that

c_{N} = \frac{M}{N}

converges towards

c > 0

. Moreover, it holds that

\begin{matrix} {\hat{λ}}_{1, N} & \to σ^{2} {(1 + \sqrt{c})}^{2} a . s . \end{matrix}

(2)

\begin{matrix} {\hat{λ}}_{min (M, N)} & \to σ^{2} {(1 - \sqrt{c})}^{2} a . s . \end{matrix}

(3)

We also observe that Theorem 1 remains valid if

W_{N}

is not necessarily a Gaussian matrix whose i.i.d. elements have a finite fourth order moment (see e.g., [43]). Theorem 1 means that when ratio

\frac{M}{N}

is not small enough, the eigenvalues of the empirical spatial covariance matrix of a temporally and spatially white noise tend to spread out around the variance of the noise, and that almost surely, for N large enough, all the eigenvalues are located in a neighbourhood of interval

[λ^{-}, λ^{+}]

. See Figure 2 and Figure 3.

3. Classification in a Computational Information Geometry (CIG) Framework

3.1. Formulation Based on a $SNR$ -Type Criterion

We denote by

SNR = σ_{s}^{2} / σ^{2}

and

p_{i} (\cdot) = p (\cdot | H_{i})

with

i \in {0, 1}

. The binary classification of the random signal based on the equi-probable binary hypothesis test,

s

, is

\begin{matrix} \{\begin{matrix} H_{0} : p_{0} (y_{N}; Φ, SNR = 0) = N (0, Σ_{0}), \\ H_{1} : p_{1} (y_{N}; Φ, SNR \neq 0) = N (0, Σ_{1}) \end{matrix} \end{matrix}

(4)

where

Σ_{0} = σ^{2} I_{N}

and

Σ_{1} = σ^{2} (SNR \times Φ Φ^{T} + I_{N})

. The null hypothesis data-space (

H_{0}

) is defined as

X_{0} = X \ X_{1}

where

\begin{matrix} X_{1} = \{y_{N} : Λ (y_{N}) = log \frac{p_{1} (y_{N})}{p_{0} (y_{N})} > τ^{'}\} \end{matrix}

is the alternative hypothesis (

H_{1}

) data-space. Following the above expression, the log-likelihood ratio test

Λ (y_{N})

and the binary classification threshold

τ^{'}

are given by

\begin{matrix} Λ (y_{N}) & = \frac{y_{N}^{T} Φ {(Φ^{T} Φ + SNR \times I)}^{- 1} Φ^{T} y_{N}}{σ^{2}}, \\ τ^{'} & = - log \det (SNR \times Φ Φ^{T} + I_{N}) \end{matrix}

where

\det (\cdot)

and

log (\cdot)

are respectively the determinant and the natural logarithm.

3.2. The Expected Log-likelihood Ratio in Geometry Perspective

We note that the estimated hypothesis

\hat{H}

is associated to

p (y_{N} | \hat{H}) = N (0, Σ)

. Therefore, the expected log-likelihood ratio is defined by

\begin{matrix} E_{y_{N} | \hat{H}} Λ (y_{N}) & = \int_{X} p (y_{N} | \hat{H}) log \frac{p_{1} (y_{N})}{p_{0} (y_{N})} d y_{N} \\ = KL (\hat{H} | | H_{0}) - KL (\hat{H} | | H_{1}) \\ = \frac{1}{σ^{2}} Tr \{{(Φ^{T} Φ + SNR \times I)}^{- 1} Φ^{T} Σ Φ\} \end{matrix}

where

\begin{matrix} KL (\hat{H} | | H_{i}) = \int_{X} p (y_{N} | \hat{H}) log \frac{p (y_{N} | \hat{H})}{p_{i} (y_{N})} d y_{N} \end{matrix}

is the Kullback-Leibler Divergence (KLD) [10]. The expected log-likelihood ratio test admits to a simple geometric characterization based on the difference of two KLDs [8]. However, it is often difficult to evaluate the performance of the test via the minimal Bayes’ error probability

P_{e}^{(N)}

, since its expression cannot be determined analytically in closed-form [3,8].

The minimal Bayes’ error probability conditionally to vector

y_{N}

is defined as

\begin{matrix} \Pr (Error | y_{N}) = \frac{1}{2} min {P_{1, 0}, P_{0, 1}} \end{matrix}

where

P_{i, i^{'}} = \Pr (H_{i} | y_{N} \in X_{i^{'}})

.

3.3. CUB

According to [24], the relation between the Chernoff Upper Bound and the (average) minimal Bayes’ error probability

P_{e}^{(N)} = E \Pr (Error | y_{N})

is given by

\begin{matrix} P_{e}^{(N)} \leq \frac{1}{2} \times exp [- {\tilde{μ}}_{N} (s)] \end{matrix}

(5)

where the (Chernoff) s-divergence for

s \in (0, 1)

is given by

\begin{matrix} {\tilde{μ}}_{N} (s) = - log M_{Λ (y_{N} | H_{1})} (- s) \end{matrix}

(6)

in which

M_{X} (t) = E exp [t \times X]

is the moment generating function (mgf) of variable X. The error exponent, denoted by

\tilde{μ} (s)

, is given by the Chernoff information which is an asymptotic characterization on the exponentially decay of the minimal Bayes’ error probability. The error exponent is derived thanks to the Stein’s lemma according to [13]

\begin{matrix} - lim_{N \to \infty} \frac{log P_{e}^{(N)}}{N} = lim_{N \to \infty} \frac{{\tilde{μ}}_{N} (s)}{N} \overset{def .}{=} \tilde{μ} (s) . \end{matrix}

As parameter

s \in (0, 1)

is free, the CUB can be tightened by minimizing this parameter:

\begin{matrix} s^{⋆} = arg max_{s \in (0, 1)} \tilde{μ} (s) . \end{matrix}

(7)

Finally, using Equations (5) and (7), the Chernoff Upper Bound (CUB) is obtained. Instead of solving Equation (7), the Bhattacharyya Upper Bound (BUB) is calculated by Equation (5) and by fixing

s = 1 / 2

. Therefore we have the following relation of order:

\begin{matrix} P_{e}^{(N)} \leq \frac{1}{2} \times exp [- {\tilde{μ}}_{N} (s^{⋆})] \leq \frac{1}{2} \times exp [- {\tilde{μ}}_{N} (1 / 2)] . \end{matrix}

Lemma 1.

The log-moment generating function given by Equation (6) for test of Equation (4) is given by

\begin{matrix} {\tilde{μ}}_{N} (s) & = - \frac{1 - s}{2} log det (SNR \times Φ Φ^{T} + I) \\ + \frac{1}{2} log det (SNR \times (1 - s) Φ Φ^{T} + I) . \end{matrix}

(8)

Proof.

See Appendix A. ◻

From now on, to simplify the presentation and the numerical results later on, we denote by

\begin{matrix} μ_{N} (s) = - {\tilde{μ}}_{N} (s) \\ μ (s) = - \tilde{μ} (s) \end{matrix}

for all

s \in [0, 1]

, the opposites of the log-moment generating function and its limit.

Remark 1.

The functions

μ_{N} (s), μ (s)

are negative, since the s-divergence

{\tilde{μ}}_{N} (s)

is positive for all

s \in [0, 1]

.

3.4. Fisher Information

In the small deviation regime, we assume that

δ SNR

is a small deviation of the SNR. The new binary hypothesis test is

\begin{matrix} \{\begin{matrix} H_{0} & : & y | δ SNR = 0 \sim N (0, Σ (0)), \\ H_{1} & : & y | δ SNR \neq 0 \sim N (0, Σ (δ SNR)) \end{matrix} \end{matrix}

where

Σ (x) = x \times Φ Φ^{T} + I .

The s-divergence in the small

SNR

deviation scenario is written as

μ_{N} (s) = \frac{1 - s}{2} log det [Σ (δ SNR)] - \frac{1}{2} log det [Σ (δ SNR \times (1 - s))]

Lemma 2.

The s-divergence in the small deviation regime can be approximated according to

\begin{matrix} \frac{μ_{N} (s)}{N} \overset{δ SNR ≪ 1}{\approx} (s - 1) s \times \frac{{(δ SNR)}^{2}}{2} \times \frac{J_{F} (0)}{N} \end{matrix}

where the Fisher information [3] is given by

\begin{matrix} J_{F} (x) = \frac{1}{2} Tr ({(I + x \times Φ Φ^{T})}^{- 1} Φ Φ^{T} {(I + x \times Φ Φ^{T})}^{- 1} Φ Φ^{T}) . \end{matrix}

Proof.

See Appendix B. ◻

According to Lemma 2, the optimal s-value at low

SNR

is

s^{⋆} \overset{δ SNR ≪ 1}{=} \frac{1}{2}

. At contrary, the optimal s-value for larger

SNR

is given by the following lemma.

Lemma 3.

In case of large

SNR

, we have

\begin{matrix} s^{⋆} \overset{SNR ≫ 1}{\approx} 1 - \frac{1}{log SNR + \frac{1}{K} \sum_{n = 1}^{K} log λ_{n}} . \end{matrix}

(9)

where

{(λ_{n})}_{n = 1, \dots, N}

are the eigenvalues of

Φ Φ^{T}

.

Proof.

See Appendix C. ◻

4. Computational Information Geometry for Classification

4.1. Formulation of the Observation Vector as a Structured Linear Model

The measurement tensor follows a noisy Q-order tensor of size

N_{1} \times \dots \times N_{Q}

can be expressed as

\begin{matrix} Y = X + N \end{matrix}

(10)

where

N

is the noise tensor whose entries are assumed to be centered i.i.d. Gaussian, i.e.,

{[N]}_{n_{1}, \dots, n_{Q}} \sim N (0, σ^{2})

and the core tensor

X

follows either CPD or TKD given by Section 2.1.2 and Section 2.1.3, respectively. The vectorization of Equation (10) is given by

\begin{matrix} y_{N} = vec (Y_{(1)}) = x + n \end{matrix}

(11)

where

n = vec (N_{(1)})

and

x = vec (X_{(1)})

. Note that

Y_{(1)}

,

N_{(1)}

and

X_{(1)}

are respectively the first unfolding matrices given by Definition 4 of tensors

Y, N

and

X

,

When tensor $X$ follows a Q-order CPD with a canonical rank of M, we have

$\begin{matrix} x = vec \{Φ^{(1)} S {(Φ^{(Q)} ⊙ \dots ⊙ Φ^{(2)})}^{T}\} = Φ^{⊙} s \end{matrix}$

where $Φ^{⊙} = Φ^{(Q)} ⊙ \dots ⊙ Φ^{(1)}$ is a $N \times R$ structured matrix and $s = {[\begin{matrix} s_{1} & \dots & s_{R} \end{matrix}]}^{T}$ where $s_{r} \sim N (0, σ_{s}^{2})$ , i.i.d. and $N = N_{1} \dots N_{Q}$ .
When tensor $X$ follows a Q-order TKD of multilinear rank of ${M_{1}, \dots, M_{Q}}$ , we have

$\begin{matrix} x = vec \{Φ^{(1)} S_{(1)} {(Φ^{(Q)} \otimes \dots \otimes Φ^{(2)})}^{T}\} = Φ^{\otimes} vec (S) \end{matrix}$

where $Φ^{\otimes} = Φ^{(Q)} \otimes \dots \otimes Φ^{(1)}$ is a $N \times M$ structured matrix with $M = M_{1} \dots M_{Q}$ and $vec (S)$ is the vectorization of tensor $S$ where $s_{m_{1}, \dots, . m_{Q}} \sim N (0, σ_{s}^{2})$ , i.i.d.

4.2. The CPD Case

We recall that in the CPD case, matrix

Φ^{⊙} = Φ^{(Q)} ⊙ \dots ⊙ Φ^{(1)}

and

{(Φ^{(q)})}_{q = 1, \dots, Q}

are matrices of size

N_{q} \times R

. In the following, we assume that matrices

Φ_{q = 1, \dots, Q}^{(q)}

are random matrices with Gaussian

N (0, \frac{1}{N_{q}})

variate entries. We evaluate the behavior of

\frac{μ_{N} (s)}{N}

when

{(N_{q})}_{q = 1, \dots, Q}

converge towards

+ \infty

at the same rate and that

\frac{R}{N}

converges towards a non zero limit.

Result 1.

In the asymptotic regime where

N_{1}, \dots, N_{Q}

converge towards

+ \infty

at the same rate and where

R \to + \infty

in such a way that

c_{R} = \frac{R}{N}

converges towards a finite constant

c > 0

, it holds that

\begin{matrix} \frac{μ_{N} (s)}{N} & \overset{a . s}{⟶} μ (s) = \frac{1 - s}{2} Ψ_{c} (SNR) - \frac{1}{2} Ψ_{c} ((1 - s) \times SNR) \end{matrix}

(12)

with

a . s

standing for “almost sure convergence” and

\begin{matrix} Ψ_{c} (x) & = log (1 + \frac{2 c}{u (x) + (1 - c)}) \\ + c \times log (1 + \frac{2}{u (x) - (1 - c)}) \\ - \frac{4 c}{x (u {(x)}^{2} - {(1 - c)}^{2})} \end{matrix}

(13)

with

u (x) = \frac{1}{x} + \sqrt{(\frac{1}{x} + λ_{c}^{+}) (\frac{1}{x} + λ_{c}^{-})}

where

λ_{c}^{\pm} = {(1 \pm \sqrt{c})}^{2}

.

Proof.

See Appendix D. ◻

Remark 2.

In [49], the Central Limit Theorem (CLT) for the linear eigenvalue statistics of the tensor version of the sample covariance matrix of type

Φ^{⊙} {(Φ^{⊙})}^{T}

is established, for

Φ^{⊙} = Φ^{(2)} ⊙ Φ^{(1)}

, i.e., the tensor order is

Q = 2

.

4.2.1. Small $SNR$ Deviation Scenario

In this section, we assume that

SNR

is small. Under this regime, we have the following result:

Result 2.

In the small

SNR

scenario, the Fisher information for CPD is given as

\begin{matrix} μ (\frac{1}{2}) \overset{SNR ≪ 1}{\approx} - \frac{{(SNR)}^{2}}{16} \times c (1 + c) . \end{matrix}

Proof.

Using Lemma 2, we can notice that

\frac{J_{F} (0)}{N} = \frac{1}{2} \frac{R}{N} \frac{1}{R} Tr [{(Φ^{⊙} {(Φ^{⊙})}^{T})}^{2}]

and that

\frac{1}{R} Tr [{(Φ^{⊙} {(Φ^{⊙})}^{T})}^{2}]

converges a.s towards the second moment of the Marchenko-Pastur distribution which is

1 + c

(see for instance [43]). ◻

Note that

μ (\frac{1}{2})

is the error exponent related to the Bhattacharyya divergence.

4.2.2. Large $SNR$ Deviation Scenario

Result 3.

In case of large

SNR

, the minimizer of Chernoff Information is given by

\begin{matrix} s^{⋆} \overset{SNR ≫ 1}{\approx} 1 - \frac{1}{log SNR - 1 - \frac{1 - c}{c} log (1 - c)} . \end{matrix}

(14)

Proof.

It is straightforward to notice that

\frac{1}{K} \sum_{n = 1}^{K} log (λ_{n}) ⟶ \int_{0}^{+ \infty} log (λ) d ν_{c} (λ) = - 1 - \frac{1 - c}{c} log (1 - c) .

The last equality can be obtained as in [50]. Using Lemma 3, we get immediately Equation (14). ◻

Remark 3.

It is interesting to note that for

c \to 0

or 1, the optimal s-value follows the same approximated relation given by

\begin{matrix} s^{⋆} \overset{SNR ≫ 1}{\approx} 1 - \frac{1}{log SNR} \end{matrix}

as long as

SNR ≫ exp [1]

or equivalently a

SNR

in dB much larger than 4 dB.

Proof.

It is straightforward to note that

\begin{matrix} \frac{1 - c}{c} log (1 - c) \overset{c \to 1}{⟶} 0, a n d \frac{1 - c}{c} log (1 - c) \overset{c \to 0}{⟶} - 1 . \end{matrix}

Using Equation (14) and condition

SNR ≫ exp [1]

, the desired result is proved. ◻

4.2.3. Approximated Analytical Expressions for $c ≪ 1$ and Any $SNR$

In the case of low rank CPD where its rank R is supposed to be small compared to N, it is realistic to assume

c ≪ 1

since

R ≪ N

.

Result 4.

Under this regime, the error exponent can be approximated as follows:

\begin{matrix} μ (s) \overset{c ≪ 1}{\approx} \frac{c}{2} ((1 - s) log (1 + SNR) - log (1 + (1 - s) SNR)) . \end{matrix}

Proof.

See Appendix E. ◻

It is easy to notice that the second-order derivative of

μ (s)

is strictly positive. Therefore,

μ (s)

is a strictly convex function over interval

(0, 1)

. As a consequence,

μ (s)

admits at most one global minimum. We denote by

s^{⋆}

, the global minimizer and obtained by zeroing the first-order derivative of the error exponent. This optimal value is expressed as

\begin{matrix} s^{⋆} \overset{c ≪ 1}{\approx} 1 + \frac{1}{SNR} - \frac{1}{log (1 + SNR)} . \end{matrix}

(15)

The two following scenarios can be considered:

At low $SNR$ , we denote by $μ (s^{⋆})$ , the error exponent associated with the tightest CUB, coincides with the error exponent associated with the BUB. To see this, when $c ≪ 1$ , we derive the second-order approximation of the optimal value $s^{⋆}$ in Equation (47)

$\begin{matrix} s^{⋆} & \overset{2}{\approx} 1 + \frac{1}{SNR} (1 - (1 + \frac{SNR}{2})) = \frac{1}{2} . \end{matrix}$

Result 1 and the above approximation allow us to get the best error exponent at low $SNR$ and $c ≪ 1$ ,

$\begin{matrix} μ (\frac{1}{2}) & \overset{SNR ≪ 1}{\approx} \frac{1}{4} Ψ_{c ≪ 1} (SNR) - \frac{1}{2} Ψ_{c ≪ 1} (\frac{SNR}{2}) \\ = \frac{c}{2} log \frac{\sqrt{1 + SNR}}{1 + \frac{SNR}{2}} . \end{matrix}$
Contrarily, when $SNR \to \infty$ , $s^{⋆} \to 1$ . As a consequence, the optimal error exponent in this regime is not the BUB anymore. Assuming that $\frac{log SNR}{SNR} \to 0$ , Equation (15) in Result 4 provides the following approximation of the optimal error exponent for large $SNR$

$\begin{matrix} μ (s^{⋆}) \overset{SNR ≫ 1}{\approx} \frac{c}{2} (1 - log SNR + log log (1 + SNR)) . \end{matrix}$

4.3. The TKD Case

In the TKD case, we recall that matrix

Φ^{\otimes} = Φ^{(Q)} \otimes \dots \otimes Φ^{(1)}

, with

{(ϕ^{(q)})}_{1 \leq q \leq Q}

are

N_{q} \times M_{q}

dimensional matrices. We still assume that matrices

Φ_{q = 1, \dots, Q}^{(q)}

are random matrices with Gaussian

N (0, \frac{1}{N_{q}})

entries.

Result 5.

In the asymptotic regime where

M_{q} < N_{q}, 1 \leq q \leq Q

and

M_{q}, N_{q}

converge towards

+ \infty

at the same rate such that

\frac{M_{q}}{N_{q}} \to c_{q}

, where

0 < c_{q} < 1

, it holds

\begin{matrix} \frac{μ_{N} (s)}{N} \overset{a . s}{⟶} μ (s) = c_{1} \dots c_{Q} [\frac{1 - s}{2} \int_{0}^{+ \infty} \dots \int_{0}^{+ \infty} log (1 + SNR \times λ_{1} \dots λ_{Q}) d ν_{c_{1}} (λ_{1}) \dots d ν_{c_{Q}} (λ_{Q}) \\ - \frac{1}{2} \int_{0}^{+ \infty} \dots \int_{0}^{+ \infty} log (1 + (1 - s) SNR \times λ_{1} \dots λ_{Q}) d ν_{c_{1}} (λ_{1}) \dots d ν_{c_{Q}} (λ_{Q})] \end{matrix}

where

ν_{c_{q}}

are Marchenko-Pastur distributions of parameters

(c_{q}, 1)

defined as in Equation (1).

Proof.

See Appendix F. ◻

Remark 4.

We can notice that for

Q = 1

, the result 5 is similar to result 1. However, when

Q \geq 2

, the integrals in Equation (16) are not tractable in a closed-form expression. For instance, let

Q = 2

, we consider the integral

\begin{matrix} \int_{- \infty}^{+ \infty} \int_{- \infty}^{+ \infty} log (1 + SNR \times λ_{1} λ_{2}) ν_{c_{1}} (d λ_{1}) ν_{c_{2}} (d λ_{2}) \\ = \int_{λ_{c_{1}}^{-}}^{λ_{c_{1}}^{+}} \int_{λ_{c_{2}}^{-}}^{λ_{c_{2}}^{+}} log (1 + SNR \times λ_{1} λ_{2}) \frac{\sqrt{(λ_{1} - λ_{c_{1}}^{-}) (λ_{c_{1}}^{+} - λ_{1})}}{2 π c_{1} λ_{1}} \frac{\sqrt{(λ_{2} - λ_{c_{2}}^{-}) (λ_{c_{2}}^{+} - λ_{2})}}{2 π c_{2} λ_{2}} d λ_{1} d λ_{2} \end{matrix}

where

λ_{c_{i}}^{\pm} = {(1 \pm \sqrt{c_{i}})}^{2}, i = 1, 2

. We can notice that this integral is characterized by elliptic integral (see e.g., [51]). As a consequence, it cannot be expressed in closed-form. However, numerical computations can be exploited to solve efficiently the minimization problem of Equation (7).

4.3.1. Large $SNR$ Deviation Scenario

Result 6.

In case of large

SNR

, the minimizer of Chernoff Information for TKD is given by

\begin{matrix} s^{⋆} \overset{SNR ≫ 1}{\approx} 1 - \frac{1}{log SNR - Q - \sum_{i = 1}^{Q} \frac{1 - c_{i}}{c_{i}} log (1 - c_{i})} . \end{matrix}

Proof.

We have that

\begin{matrix} \frac{1}{M} \sum_{n = 1}^{M} log (λ_{n}) & ⟶ \sum_{q = 1}^{Q} \int_{0}^{+ \infty} log (λ_{q}) d ν_{c_{q}} (λ_{q}) \\ = \sum_{q = 1}^{Q} (- 1 - \frac{1 - c_{q}}{c_{q}} log (1 - c_{q})) \\ = - Q - \sum_{q = 1}^{Q} \frac{1 - c_{q}}{c_{q}} log (1 - c_{q}) . \end{matrix}

Using Lemma 3, we get immediately Equation (17). ◻

4.3.2. Small $SNR$ Deviation Scenario

Under this regime, we have the following results

Result 7.

For small

SNR

deviation, the Chernoff information for the TKD is given by

\begin{matrix} μ (\frac{1}{2}) \overset{δ SNR ≪ 1}{\approx} - \frac{{(δ SNR)}^{2}}{16} \prod_{q = 1}^{Q} c_{q} \times (1 + c_{q}) . \end{matrix}

Proof.

Using Lemma 2, we can notice that

\frac{J_{F} (0)}{N} = \frac{1}{2} \frac{M}{N} \frac{1}{M} Tr [{(Φ^{\otimes} {(Φ^{\otimes})}^{T})}^{2}] = \frac{1}{2} \frac{M}{N} \prod_{q = 1}^{Q} \frac{Tr [{(Φ^{(q)} {Φ^{(q)}}^{T})}^{2}]}{M_{q}} .

Each term in the product converges a.s towards the second moment of Marchenko-Pastur distributions

ν_{c_{q}}

which are

1 + c_{q}

and

\frac{M}{N}

converges to

\prod_{q = 1}^{Q} c_{q}

. This proves the desired result. ◻

Remark 5.

Contrary to the Remark 3, it is interesting to note that for

c_{1} = c_{2} = \dots = c_{Q} = c

and

c \to 0

or 1, the optimal s-value follows different approximated relation given by

\begin{matrix} s^{⋆} \approx_{c \to 0}^{SNR ≫ 1} 1 - \frac{1}{log SNR} \end{matrix}

which does not depend on Q, and

\begin{matrix} s^{⋆} \approx_{c \to 1}^{SNR ≫ 1} 1 - \frac{1}{log SNR - Q} \end{matrix}

which depends on Q.

In practice, when c is close to 1, we have to carefully check if Q is in the neighbourhood of

log (SNR)

. As we can see that, when

log SNR - Q < 0

or

0 < log SNR - Q < 1

, following the above approximation,

s^{⋆} \notin [0, 1]

.

5. Numerical Illustrations

In this section, we consider cubic tensors of order

Q = 3

with

N_{1} = 10, N_{2} = 20, N_{3} = 30, R = 3000

following a CPD and

M_{1} = 100, M_{2} = 120, M_{3} = 140, N_{1} = N_{2} = N_{3} = 200

for the TKD, respectively.

Firstly, for the CPD model, in Figure 4, parameter

s^{⋆}

is drawn with respect to the

SNR

in dB. The parameter

s^{⋆}

is obtained thanks to three different methods. The first one is based on the brute force/exhaustive computation of the CUB by minimizing the expression in Equation (8) with

Φ = Φ^{⊙}

. This approach has a very high computational cost especially in our asymptotic regime (for a standard computer with Intel Xeon E5-2630 2.3 GHz and 32 GB RAM, it requires 183 h to establish 10,000 simulations). The second approach is based on the numerical optimization of the closed-form expression of

μ (s)

given in Result 4. In this scenario, the drawback in terms of the computational cost is largely mitigated since it consists of a minimization of a univariate regular function. Finally, under the hypothesis that

SNR

is large, typically >30 dB, the optimal s-value,

s^{⋆}

, is derived by an analytic expression given by Equation (15). We can check that the proposed semi-analytic and analytic expressions are in good agreement with the brute-force method for a lowest computational cost. Moreover, we compute the mean square relative error

\frac{1}{L} \sum_{l = 1}^{L} {(\frac{{\hat{s}}_{l}^{⋆} - s^{⋆}}{s^{⋆}})}^{2}

where L = 10,000 the number of samples for Monte–Carlo process and where

{\hat{s}}_{l}^{⋆} = arg {min}_{s \in [0, 1]} μ_{N, l} (s)

and

s^{⋆} = arg {min}_{s \in [0, 1]} μ (s)

. It turns out that the mean square relative errors are in mean of order

- 40

dB. We can conclude that the estimator

{\hat{s}}^{⋆}

is a consistent estimator of

s^{⋆}

.

In Figure 5, we draw various s-divergences:

μ (\frac{1}{2}), μ (s^{⋆}), \frac{1}{N} μ_{N} (\frac{1}{2}), \frac{1}{N} μ_{N} (\hat{s})

. We can observe the good agreement with the proposed theoretical results. The s-divergence obtained by fixing

s = \frac{1}{2}

is accurate only at small

SNR

but degrades when

SNR

grows large.

In Figure 6, we fix

SNR = 45

dB and draw

s^{⋆}

obtained by Equation (14) versus values of

c \in {10^{- 6}, 10^{- 5}, 10^{- 4}, 10^{- 3}, 10^{- 2}, 10^{- 1}, 0.25, 0.5, 0.75, 0.9, 0.99}

and the expression obtained by Equation (15). The two curves approach each other as c goes to zero as predicted by our theoretical analysis.

For the TKD scenario, we follow the same methodology as above for CPD, Figure 7 and Figure 8 all agree with the analysis provided in Section 4.3.

For TKD scenario, the mean square relative error is in mean of order

- 40

dB. So, we check numerically the consistency of the estimator of the optimal s-value.

We can also notice that the convergence of

\frac{μ_{N} (s)}{N}

towards its deterministic equivalent

μ (s)

in the case TKD is faster than in the case CPD, since the dimension of matrix

Φ^{\otimes}

is

200, 200, 200 \times 100, 120, 140

(

N = 200^{3}

) which is much larger than the dimension

6000 \times 3000

of

Φ^{⊙}

(

N = 6000

).

6. Conclusions

In this work, we derived and studied the limit performance in terms of minimal Bayes’ error probability for the binary classification of high-dimensional random tensors using both the tools of Information Geometry (IG) and of Random Matrix Theory (RMT). The main results on Chernoff Bounds and Fisher Information are illustrated by Monte–Carlo simulations that corroborated our theoretical analysis.

For future work, we would like to study the rate of convergence and the fluctuation of the statistics

\frac{μ_{N} (s)}{N}

and

\hat{s}

.

Acknowledgments

The authors would like to thank Philippe Loubaton (UPEM, France) for the fruitful discussions. This research was partially supported by Labex DigiCosme (project ANR-11-LABEX-0045-DIGICOSME) operated by ANR (The French National Research Agency) as part of the program “Investissement d’Avenir” Idex Paris-Saclay (ANR-11-IDEX-0003-02).

Author Contributions

Gia-Thuy Pham, Rémy Boyer and Frank Nielsen contributed to the research results presented in this paper. Gia-Thuy Pham and Rémy Boyer performed the numerical experiments. All authors have read and approved the final manuscript.

Conflicts of Interest

The authors declare no conflict of interest.

Appendix A. Proof of Lemma 1

The s-divergence in Equation (6) for the following binary hypothesis test

\begin{matrix} \{\begin{matrix} H_{0} & : & y \sim N (0, Σ_{0}), \\ H_{1} & : & y \sim N (0, Σ_{1}) \end{matrix} \end{matrix}

is given by [15]:

\begin{matrix} {\tilde{μ}}_{N} (s) = \frac{1}{2} log \frac{\det (s Σ_{0} + (1 - s) Σ_{1})}{{[\det Σ_{0}]}^{s} {[\det Σ_{1}]}^{1 - s}} . \end{matrix}

(A1)

Using the expressions of the covariance matrices

Σ_{0}

and

Σ_{1}

, the numerator in Equation (A1) is given by

N log σ^{2} + log det (SNR \times (1 - s) Φ Φ^{T} + I)

and the two terms at its numerator are

log {[det Σ_{0}]}^{s} = s N log σ^{2}

and

log {[det Σ_{1}]}^{1 - s} = (1 - s) (N log σ^{2} + log det (SNR \times Φ Φ^{T} + I)) .

Using the above expressions,

{\tilde{m u}}_{N} (s)

is given by Equation (8).

Appendix B. Proof of Lemma 2

If we note

d Σ (SNR) = {\frac{\partial Σ (x)}{\partial x}|}_{x = SNR}

then the following expression holds:

\begin{matrix} Σ (δ SNR) = Σ (0) + (δ SNR) \times d Σ (0) = I + (δ SNR) \times Φ Φ^{T} . \end{matrix}

Using the above expression, the s-divergence is given by

\begin{matrix} μ_{N} (s) = \frac{1 - s}{2} log det [I + (δ SNR) \times Φ Φ^{T}] - \frac{1}{2} log det [I + δ SNR \times (1 - s) \times Φ Φ^{T}] \end{matrix}

Now, using Equation (8), and the following approximation:

\begin{matrix} \frac{1}{N} log det (I + x A) = \frac{1}{N} Tr log (I + x A) \approx x \times \frac{1}{N} Tr A - \frac{x^{2}}{2} \times \frac{1}{N} Tr A^{2} \end{matrix}

we obtain

\begin{matrix} \frac{μ_{N} (s)}{N} \approx (s - 1) s \times \frac{{(δ SNR)}^{2}}{2} \times \frac{J_{F} (0)}{N} \end{matrix}

where the Fisher information for

y | δ SNR \sim N (0, Σ (δ SNR))

is given by [3]:

\begin{matrix} J_{F} (δ SNR) & = - E [\frac{\partial^{2} log p (y | δ SNR)}{\partial {(δ SNR)}^{2}}] \\ = \frac{1}{2} Tr {Σ {(δ SNR)}^{- 1} d Σ (δ SNR) Σ {(δ SNR)}^{- 1} d Σ (δ SNR)} \\ = \frac{1}{2} Tr {({(I + (δ SNR) \times Φ Φ^{T})}^{- 1} Φ Φ^{T} (I + δ SNR) \times Φ Φ^{T})}^{- 1} Φ Φ^{T}) . \end{matrix}

Appendix C. Proof of Theorem 3

The first step of the proof is based on the derivation of an alternative expression of

μ_{s} (SNR)

given by Equation (A1) involving the inverse of the covariance matrices

Σ_{0}

and

Σ_{1}

. Specifically, we have

\begin{matrix} μ_{s} (SNR) & = \frac{1}{2} log \frac{(\det Σ_{0}) (\det Σ_{1}) \det ((1 - s) Σ_{0}^{- 1} + s Σ_{1}^{- 1})}{{[\det Σ_{0}]}^{s} {[\det Σ_{1}]}^{1 - s}} \\ = - \frac{1}{2} log \frac{\det ({[(1 - s) Σ_{0}^{- 1} + s Σ_{1}^{- 1}]}^{- 1})}{{[\det Σ_{0}]}^{1 - s} {[\det Σ_{1}]}^{s}} . \end{matrix}

(A2)

The second step is to derive a closed-form expression in the high SNR regime using the following the approximation (see [52] for instance):

{(x \times Φ Φ^{T} + I)}^{- 1} \overset{x ≫ 1}{\approx} Π_{Φ}^{⊥} = I_{N} - Φ Φ^{†}

where

Π_{Φ}^{⊥}

is an orthogonal projector such as

Π_{Φ}^{⊥} Φ = 0

and

Φ^{†} = {(Φ^{T} Φ)}^{- 1} Φ^{T}

. The numerator in Equation (A2) is given by

\begin{matrix} {[(1 - s) Σ_{0}^{- 1} + s Σ_{1}^{- 1}]}^{- 1} & \overset{SNR ≫ 1}{\approx} σ^{2} {(I_{N} - s I_{N} + s Π_{Φ}^{⊥})}^{- 1} \\ = σ^{2} {(I_{N} - s Φ Φ^{†})}^{- 1} . \end{matrix}

As

s Φ Φ^{†}

is a rank-K projector matrix scaled by factor

s > 0

, its eigen-spectrum is given by

\{{\underset{︸}{s, \dots, s}}_{K}, {\underset{︸}{0, \dots, 0}}_{N - K}\}

. In addition, as the rank-N identity matrix and the scaled projector

s Φ Φ^{†}

can be diagonalized in the same orthonormal basis matrix, the n-th eigenvalue of the inverse of matrix

I_{N} - s Φ Φ^{†}

is given by

\begin{matrix} λ_{n} \{{(I_{N} - s Φ Φ^{†})}^{- 1}\} & = \frac{1}{λ_{n} \{I_{N}\} - s λ_{n} \{Φ Φ^{†}\}} \\ = \{\begin{matrix} \frac{1}{1 - s}, & 1 \leq n \leq K, \\ 1, & K + 1 \leq n \leq N \end{matrix} \end{matrix}

with

s \in (0, 1)

. Using the above property, we obtain

\begin{matrix} log det ({[I_{N} - s Φ Φ^{†}]}^{- 1}) & = log \prod_{n = 1}^{N} λ_{n} \{{(I_{N} - s Φ Φ^{†})}^{- 1}\} \\ = - K log (1 - s) . \end{matrix}

In addition, we have

\begin{matrix} log det (SNR \times Φ Φ^{T} + I) \overset{SNR ≫ 1}{\approx} Tr log (SNR \times Φ^{T} Φ) = K \times log SNR + \sum_{n = 1}^{K} log λ_{n} \end{matrix}

Finally, thanks to Equation (A2), we have

\begin{matrix} \frac{μ_{s} (SNR)}{N} & \overset{SNR ≫ 1}{\approx} \frac{1}{2} \frac{K}{N} (log (1 - s) + s \times log SNR + \frac{s}{K} \sum_{n = 1}^{K} log λ_{n}) \end{matrix}

Finally, to obtain

s^{⋆}

in Equation (9), we solve

\frac{\partial μ_{s} (SNR)}{\partial s} = 0

.

Appendix D. Proof of Result 1

The asymptotic behavior of

\frac{μ_{N} (s)}{N}

when

N_{q} \to + \infty

for each

q = 1, \dots, Q

,

R \to + \infty

in such a way that

\frac{R^{1 / q}}{N_{q}}

converge towards a non zero constant for each

q = 1, \dots, Q

can be obtained thanks to large random matrix theory. We suppose that

N_{1}, \dots, N_{Q}

converge towards

+ \infty

at the same rate (i.e.,

\frac{N_{q}}{N_{p}}

converge towards a non zero constant for each

(p, q)

), and

c_{R} = \frac{R}{N}

converges towards a constant

c > 0

. Under this regime, the empirical eigenvalue distribution of covariance matrix

Φ^{⊙} {(Φ^{⊙})}^{T}

is known to converge towards the so-called Marcenko–Pastur distribution. By Section 2.2, we recall that the Marcenko–Pastur distribution

ν_{c} (d λ)

is defined as

ν_{c} (d λ) = δ (λ) {[1 - c]}_{+} + \frac{\sqrt{(λ - λ_{c}^{-}) (λ_{c}^{+} - λ)}}{2 π λ} ⊮_{[λ_{c}^{-}, λ_{c}^{+}]} (λ) d λ

where

λ_{c}^{-} = {(1 - \sqrt{c})}^{2}

and

λ_{c}^{+} = {(1 + \sqrt{c})}^{2}

. We define

t_{c} (z) = \int_{R^{+}} \frac{ν_{c} (d λ)}{λ - z}

the Stieltjes transform of

ν_{c}

. We have that

t_{c} (z)

satisfies the equation

t_{c} (z) = {[- z + \frac{c}{1 + t_{c} (z)}]}^{- 1} .

When

z \in R^{- *}

, i.e.,

z = - ρ

, with

ρ > 0

, it is well known that

t_{c} (ρ)

is given by

t_{c} (- ρ) = \frac{2}{ρ - (1 - c) + \sqrt{(ρ + λ_{c}^{-}) (ρ + λ_{c}^{+})}}

(A3)

It was established for the first time in [45] that if

X

represents a

K \times P

random matrix with zero mean and

\frac{1}{K}

variance i.i.d. entries, and if

{(λ_{k})}_{k = 1, \dots, K}

represent the eigenvalues of

X X^{T}

arranged in decreasing order, then

\frac{1}{K} \sum_{k = 1}^{K} δ (λ - λ_{k})

, the empirical eigenvalue distribution of

X X^{T}

converges weakly almost surely towards

ν_{c}

, under the regime

K \to + \infty

,

P \to + \infty

,

\frac{P}{K} \to c

. In addition, we have the following property, for each continuous function

f (λ)

\frac{1}{K} \sum_{k = 1}^{K} f (λ_{k}) \overset{a . s}{⟶} \int_{R^{+}} f (λ) ν_{c} (d λ) .

(A4)

Practically, when K and P are large enough, the histogram of the eigenvalues of each realization of

X X^{T}

accumulates around the graph of the probability density of

ν_{c}

.

The columns

{(ϕ_{r})}_{r = 1, \dots, R}

of

Φ^{⊙}

are vectors

{(ϕ_{r}^{(Q)} \otimes \dots \otimes ϕ_{r}^{(1)})}_{r = 1, \dots, R}

, which are mutually independent, identically distributed, and satisfy

E (ϕ_{r} ϕ_{r}^{T}) = \frac{I_{N}}{N}

. However, since the components of each column

ϕ_{r}

are not independent, it results in that the entries of

Φ^{⊙}

are not mutually independent. Applying the results of [53] (see also [54]), we can establish that the empirical eigenvalue distribution of

Φ^{⊙} {(Φ^{⊙})}^{T}

still converges almost surely towards

ν_{c}

, under the asymptotic regime

\frac{R}{N} \to c

. For continuous function

f (λ) = log (1 + λ / ρ)

, we apply Equation (A4),

\int_{R^{+}} log (1 + λ / ρ) ν_{c} (d λ)

can be expressed in terms of

t_{c} (- ρ)

given by Equation (A3) (see e.g., [50]), we finish the proof.

Appendix E. Proof of Result 4

We have

u (x) \overset{c ≪ 1}{\approx} \frac{1}{x} + \sqrt{{(\frac{1}{x} + 1)}^{2}} = \frac{2}{x} + 1

and

u (x) + (1 - c) \overset{c ≪ 1}{\approx} 2 (\frac{1}{x} + 1)

,

u (x) - (1 - c) \overset{c ≪ 1}{\approx} \frac{2}{x}

,

u {(x)}^{2} - {(1 - c)}^{2} \overset{c ≪ 1}{\approx} \frac{4}{x} (\frac{1}{x} + 1) .

Using the above first-order approximations, Equation (13) is

\begin{matrix} Ψ_{c ≪ 1} (x) \overset{1}{\approx} c \times \frac{x}{1 + x} + c log (1 + x) - c \frac{x}{1 + x} = c log (1 + x) . \end{matrix}

Using the above approximation and Equation (12), we obtain Result 4.

Appendix F. Proof of Result 5

We first denote

λ_{1}^{(q)} \geq λ_{2}^{(q)} \geq \dots \geq λ_{n_{q}}^{(q)} \geq \dots \geq λ_{N_{q}}^{(q)}

the eigenvalues of

Φ^{(q)} {(Φ^{(q)})}^{T}

,

1 \leq n_{q} \leq N_{q}

, for

1 \leq q \leq Q

. We can notice that the eigenvalues of

Φ^{\otimes} {(Φ^{\otimes})}^{T}

are

λ_{n_{1}}^{(1)} \dots λ_{n_{Q}}^{(Q)}

. Moreover, in the asymptotic regime, where

M_{q} \to + \infty

,

N_{q} \to + \infty

such that

\frac{M_{q}}{N_{q}} \to c_{q}

,

0 < c_{q} < 1

, for all

1 \leq q \leq Q

, we have that

λ_{n_{q}}^{(q)} = 0

if

M_{q} + 1 \leq n_{q} \leq N_{q}

and the empirical distribution of the eigenvalues

{(λ_{n_{q}}^{(q)})}_{1 \leq n_{q} \leq M_{q}}

behaves as Marchenko-Pastur distributions

ν_{c_{q}}

of parameters

(c_{q}, 1)

. Recalling that

M = M_{1} \dots M_{Q}

,

N = N_{1} \dots N_{Q}

, we obtain immediately that

\begin{matrix} \frac{1}{N} log det (SNR \times Φ^{\otimes} {(Φ^{\otimes})}^{T} + I) & = \frac{1}{N} \sum_{n_{1} = 1}^{N_{1}} \dots \sum_{n_{q} = 1}^{N_{Q}} log (SNR \times λ_{n_{1}}^{(1)} \dots λ_{n_{Q}}^{(Q)} + 1) \\ = \frac{M}{N} \frac{1}{M} \sum_{n_{1} = 1}^{M_{1}} \dots \sum_{n_{q} = 1}^{M_{Q}} log (SNR \times λ_{n_{1}}^{(1)} \dots λ_{n_{Q}}^{(Q)} + 1) \end{matrix}

and that

\frac{1}{M} \sum_{n_{1} = 1}^{M_{1}} \dots \sum_{n_{q} = 1}^{M_{Q}} log (SNR \times λ_{n_{1}}^{(1)} \dots λ_{n_{Q}}^{(Q)} + 1) \overset{a . s}{⟶} \int_{0}^{+ \infty} \dots \int_{0}^{+ \infty} log (1 + SNR \times λ_{1} \dots λ_{Q}) d ν_{c_{1}} (λ_{1}) \dots d ν_{c_{Q}} (λ_{Q})

Similarly, we have that

\frac{1}{M} log det (SNR \times (1 - s) Φ^{\otimes} {(Φ^{\otimes})}^{T} + I) \overset{a . s}{⟶} \int_{0}^{+ \infty} \dots \int_{0}^{+ \infty} log (1 + SNR \times (1 - s) λ_{1} \dots λ_{Q}) d ν_{c_{1}} (λ_{1}) \dots d ν_{c_{Q}} (λ_{Q})

We obtain easily Result 5.

References

Besson, O.; Scharf, L.L. CFAR matched direction detector. IEEE Trans. Signal Process. 2006, 54, 2840–2844. [Google Scholar] [CrossRef]
Bianchi, P.; Debbah, M.; Maida, M.; Najim, J. Performance of Statistical Tests for Source Detection using Random Matrix Theory. IEEE Trans. Inf. Theory 2011, 57, 2400–2419. [Google Scholar] [CrossRef] [Green Version]
Kay, S.M. Fundamentals of Statistical Signal Processing, Volume II: Detection Theory; PTR Prentice-Hall: Englewood Cliffs, NJ, USA, 1993. [Google Scholar]
Loubaton, P.; Vallet, P. Almost Sure Localization of the Eigenvalues in a Gaussian Information Plus Noise Model. Application to the Spiked Models. Electron. J. Probab. 2011, 16, 1934–1959. [Google Scholar] [CrossRef]
Mestre, X. Improved Estimation of Eigenvalues and Eigenvectors of Covariance Matrices Using Their Sample Estimates. IEEE Trans. Inf. Theory 2008, 54, 5113–5129. [Google Scholar] [CrossRef]
Baik, J.; Silverstein, J. Eigenvalues of large sample covariance matrices of spiked population models. J. Multivar. Anal. 2006, 97, 1382–1408. [Google Scholar] [CrossRef]
Silverstein, J.W.; Combettes, P.L. Signal detection via spectral theory of large dimensional random matrices. IEEE Trans. Signal Process. 1992, 40, 2100–2105. [Google Scholar] [CrossRef]
Cheng, Y.; Hua, X.; Wang, H.; Qin, Y.; Li, X. The Geometry of Signal Detection with Applications to Radar Signal Processing. Entropy 2016, 18, 381. [Google Scholar] [CrossRef]
Ali, S.M.; Silvey, S.D. A General Class of Coefficients of Divergence of One Distribution from Another. J. R. Stat. Soc. Ser. B (Methodol.) 1966, 28, 131–142. [Google Scholar]
Cover, T.M.; Thomas, J.A. Elements of Information Theory; John Wiley & Sons: Hoboken, NJ, USA, 2012. [Google Scholar]
Kailath, T. The Divergence and Bhattacharyya Distance Measures in Signal Selection. IEEE Trans. Commun. Technol. 1967, 15, 52–60. [Google Scholar]
Nielsen, F. Hypothesis Testing, Information Divergence and Computational Geometry; Geometric Science of Information; Springer: Berlin, Germany, 2013; pp. 241–248. [Google Scholar]
Sinanovic, S.; Johnson, D.H. Toward a theory of information processing. Signal Process. 2007, 87, 1326–1344. [Google Scholar] [CrossRef]
Chernoff, H. A Measure of Asymptotic Efficiency for Tests of a Hypothesis Based on the sum of Observations. Ann. Math. Stat. 1952, 23, 493–507. [Google Scholar] [CrossRef]
Nielsen, F. Chernoff information of exponential families. arXiv, 2011; arXiv:1102.2684. [Google Scholar]
Chepuri, S.P.; Leus, G. Sparse sensing for distributed Gaussian detection. In Proceedings of the 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Brisbane, Australia, 19–24 April 2015. [Google Scholar]
Tang, G.; Nehorai, A. Performance Analysis for Sparse Support Recovery. IEEE Trans. Inf. Theory 2010, 56, 1383–1399. [Google Scholar] [CrossRef]
Lee, Y.; Sung, Y. Generalized Chernoff Information for Mismatched Bayesian Detection and Its Application to Energy Detection. IEEE Signal Process. Lett. 2012, 19, 753–756. [Google Scholar]
Grossi, E.; Lops, M. Space-time code design for MIMO detection based on Kullback-Leibler divergence. IEEE Trans. Inf. Theory 2012, 58, 3989–4004. [Google Scholar] [CrossRef]
Sen, S.; Nehorai, A. Sparsity-Based Multi-Target Tracking Using OFDM Radar. IEEE Trans. Signal Process. 2011, 59, 1902–1906. [Google Scholar] [CrossRef]
Boyer, R.; Delpha, C. Relative-entropy based beamforming for secret key transmission. In Proceedings of the 2012 IEEE 7th Sensor Array and Multichannel Signal Processing Workshop (SAM), Hoboken, NJ, USA, 17–20 June 2012. [Google Scholar]
Tran, N.D.; Boyer, R.; Marcos, S.; Larzabal, P. Angular resolution limit for array processing: Estimation and information theory approaches. In Proceedings of the 20th European Signal Processing Conference (EUSIPCO), Bucharest, Romania, 27–31 August 2012. [Google Scholar]
Katz, G.; Piantanida, P.; Couillet, R.; Debbah, M. Joint estimation and detection against independence. In Proceedings of the Annual Conference on Communication Control and Computing (Allerton), Monticello, IL, USA, 30 September–3 October 2014; pp. 1220–1227. [Google Scholar]
Nielsen, F. An information-geometric characterization of Chernoff information. IEEE Signal Process. Lett. 2013, 20, 269–272. [Google Scholar] [CrossRef]
Cichocki, A.; Mandic, D.; De Lathauwer, L.; Zhou, G.; Zhao, Q.; Caiafa, C.; Phan, H.A. Tensor decompositions for signal processing applications: From two-way to multiway component analysis. IEEE Signal Process. Mag. 2015, 32, 145–163. [Google Scholar] [CrossRef]
Comon, P. Tensors: A brief introduction. IEEE Signal Process. Mag. 2014, 31, 44–53. [Google Scholar] [CrossRef]
De Lathauwer, L.; Moor, B.D.; Vandewalle, J. A Multilinear Singular Value Decomposition. SIAM J. Matrix Anal. Appl. 2000, 21, 1253–1278. [Google Scholar] [CrossRef]
Tucker, L.R. Some mathematical notes on three-mode factor analysis. Psychometrika 1966, 31, 279–311. [Google Scholar] [CrossRef] [PubMed]
Comon, P.; Berge, J.T.; De Lathauwer, L.; Castaing, J. Generic and Typical Ranks of Multi-Way Arrays. Linear Algebra Appl. 2009, 430, 2997–3007. [Google Scholar] [CrossRef] [Green Version]
De Lathauwer, L. A survey of tensor methods. In Proceedings of the IEEE International Symposium on Circuits and Systems, ISCAS 2009, Taipei, Taiwan, 24–27 May 2009. [Google Scholar]
Comon, P.; Luciani, X.; De Almeida, A.L.F. Tensor decompositions, alternating least squares and other tales. J. Chemom. 2009, 23, 393–405. [Google Scholar] [CrossRef] [Green Version]
Goulart, J.H.D.M.; Boizard, M.; Boyer, R.; Favier, G.; Comon, P. Tensor CP Decomposition with Structured Factor Matrices: Algorithms and Performance. IEEE J. Sel. Top. Signal Process. 2016, 10, 757–769. [Google Scholar] [CrossRef]
Eckart, C.; Young, G. The approximation of one matrix by another of lower rank. Psychometrika 1936, 1, 211–218. [Google Scholar]
Badeau, R.; Richard, G.; David, B. Fast and stable YAST algorithm for principal and minor subspace tracking. IEEE Trans. Signal Process. 2008, 56, 3437–3446. [Google Scholar] [CrossRef]
Boyer, R.; Badeau, R. Adaptive multilinear SVD for structured tensors. In Proceedings of the 2006 IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP’06), Toulouse, France, 14–19 May 2006. [Google Scholar]
Boizard, M.; Ginolhac, G.; Pascal, F.; Forster, P. Low-rank filter and detector for multidimensional data based on an alternative unfolding HOSVD: Application to polarimetric STAP. EURASIP J. Adv. Signal Process. 2014, 2014, 119. [Google Scholar] [CrossRef]
Bouleux, G.; Boyer, R. Sparse-Based Estimation Performance for Partially Known Overcomplete Large-Systems. Signal Process. 2017, 139, 70–74. [Google Scholar] [CrossRef]
Boyer, R.; Couillet, R.; Fleury, B.-H.; Larzabal, P. Large-System Estimation Performance in Noisy Compressed Sensing with Random Support—A Bayesian Analysis. IEEE Trans. Signal Process. 2016, 64, 5525–5535. [Google Scholar] [CrossRef]
Ollier, V.; Boyer, R.; El Korso, M.N.; Larzabal, P. Bayesian Lower Bounds for Dense or Sparse (Outlier) Noise in the RMT Framework. In Proceedings of the 2016 IEEE Sensor Array and Multichannel Signal Processing Workshop (SAM 16), Rio de Janerio, Brazil, 10–13 July 2016. [Google Scholar]
Wishart, J. The generalized product moment distribution in samples. Biometrika 1928, 20A, 32–52. [Google Scholar] [CrossRef]
Wigner, E.P. On the statistical distribution of the widths and spacings of nuclear resonance levels. Proc. Camb. Philos. Soc. 1951, 47, 790–798. [Google Scholar] [CrossRef]
Wigner, E.P. Characteristic vectors of bordered matrices with infinite dimensions. Ann. Math. 1955, 62, 548–564. [Google Scholar]
Bai, Z.D.; Silverstein, J.W. Spectral Analysis of Large Dimensional Random Matrices, 2nd ed.; Springer Series in Statistics; Springer: Berlin, Germany, 2010. [Google Scholar]
Girko, V.L. Theory of Random Determinants; Kluwer Academic Publishers: Dordrecht, The Netherlands, 1990. [Google Scholar]
Marchenko, V.A.; Pastur, L.A. Distribution of eigenvalues for some sets of random matrices. Math. Sb. (N.S.) 1967, 72, 507–536. [Google Scholar]
Voiculescu, D. Limit laws for random matrices and free products. Invent. Math. 1991, 104, 201–220. [Google Scholar] [CrossRef]
Boyer, R.; Nielsen, F. Information Geometry Metric for Random Signal Detection in Large Random Sensing Systems. In Proceedings of the 2017 IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), New Orleans, LA, USA, 5–9 March 2017. [Google Scholar]
Boyer, R.; Loubaton, P. Large deviation analysis of the CPD detection problem based on random tensor theory. In Proceedings of the 2017 25th European Association for Signal Processing (EUSIPCO), Kos, Greece, 28 August–2 September 2017. [Google Scholar]
Lytova, A. Central Limit Theorem for Linear Eigenvalue Statistics for a Tensor Product Version of Sample Covariance Matrices. J. Theor. Prob. 2017, 1–34. [Google Scholar] [CrossRef]
Tulino, A.M.; Verdu, S. Random Matrix Theory and Wireless Communications; Now Publishers Inc.: Hanover, MA, USA, 2004; Volume 1. [Google Scholar]
Milne-Thomson, L.M. “Elliptic Integrals” (Chapter 17). In Handbook of Mathematical Functions with Formulas, Graphs, and Mathematical Tables, 9th printing; Abramowitz, M., Stegun, I.A., Eds.; Dover Publications: New York, NY, USA, 1972; pp. 587–607. [Google Scholar]
Behrens, R.T.; Scharf, L.L. Signal processing applications of oblique projection operators. IEEE Trans. Signal Process. 1994, 42, 1413–1424. [Google Scholar] [CrossRef]
Pajor, A.; Pastur, L.A. On the Limiting Empirical Measure of the sum of rank one matrices with log-concave distribution. Stud. Math. 2009, 195, 11–29. [Google Scholar] [CrossRef]
Ambainis, A.; Harrow, A.W.; Hastings, M.B. Random matrix theory: Extending random matrix theory to mixtures of random product states. Commun. Math. Phys. 2012, 310, 25–74. [Google Scholar] [CrossRef]

Figure 1. Canonical Polyadic Decomposition (CPD).

Figure 2. Histogram of the eigenvalues of

\frac{W_{N} W_{N}^{T}}{N}

(with

M = 256, c_{N} = \frac{M}{N} = \frac{1}{256}

,

σ^{2} = 1

).

Figure 2. Histogram of the eigenvalues of

\frac{W_{N} W_{N}^{T}}{N}

(with

M = 256, c_{N} = \frac{M}{N} = \frac{1}{256}

,

σ^{2} = 1

).

Figure 3. Histogram of the eigenvalues of

\frac{W_{N} W_{N}^{T}}{N}

(with

M = 256, c_{N} = \frac{M}{N} = \frac{1}{4}

,

σ^{2} = 1

).

Figure 3. Histogram of the eigenvalues of

\frac{W_{N} W_{N}^{T}}{N}

(with

M = 256, c_{N} = \frac{M}{N} = \frac{1}{4}

,

σ^{2} = 1

).

Figure 4. Canonical Polyadic Decomposition (CPD) scenario: Optimal s-parameter versus Signal to Noise Ratio (SNR) in dB.

Figure 5. CPD scenario: s-divergence vs.

SNR

in dB.

Figure 5. CPD scenario: s-divergence vs.

SNR

in dB.

Figure 6. CPD scenario:

s^{⋆}

vs. c ,

SNR = 45

dB.

Figure 6. CPD scenario:

s^{⋆}

vs. c ,

SNR = 45

dB.

Figure 7. TucKer Decomposition (TKD) scenario: Optimal s-parameter vs.

SNR

in dB.

Figure 7. TucKer Decomposition (TKD) scenario: Optimal s-parameter vs.

SNR

in dB.

Figure 8. TKD scenario: s-divergence vs.

SNR

in dB.

Figure 8. TKD scenario: s-divergence vs.

SNR

in dB.

© 2018 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Pham, G.-T.; Boyer, R.; Nielsen, F. Computational Information Geometry for Binary Classification of High-Dimensional Random Tensors. Entropy 2018, 20, 203. https://doi.org/10.3390/e20030203

AMA Style

Pham G-T, Boyer R, Nielsen F. Computational Information Geometry for Binary Classification of High-Dimensional Random Tensors. Entropy. 2018; 20(3):203. https://doi.org/10.3390/e20030203

Chicago/Turabian Style

Pham, Gia-Thuy, Rémy Boyer, and Frank Nielsen. 2018. "Computational Information Geometry for Binary Classification of High-Dimensional Random Tensors" Entropy 20, no. 3: 203. https://doi.org/10.3390/e20030203

APA Style

Pham, G. -T., Boyer, R., & Nielsen, F. (2018). Computational Information Geometry for Binary Classification of High-Dimensional Random Tensors. Entropy, 20(3), 203. https://doi.org/10.3390/e20030203

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Computational Information Geometry for Binary Classification of High-Dimensional Random Tensors †

Abstract

1. Introduction

1.1. State-of-the-Art and Problem Statement

1.2. Paper Organisation

2. Algebra of Tensors and Random Matrix Theory (RMT)

2.1. Multilinear Functions

2.1.1. Preliminary Definitions

2.1.2. Canonical Polyadic Decomposition (CPD)

2.1.3. Tucker Decomposition (TKD)

2.2. The Marchenko-Pastur Distribution

3. Classification in a Computational Information Geometry (CIG) Framework

3.1. Formulation Based on a SNR -Type Criterion

3.2. The Expected Log-likelihood Ratio in Geometry Perspective

3.3. CUB

3.4. Fisher Information

4. Computational Information Geometry for Classification

4.1. Formulation of the Observation Vector as a Structured Linear Model

4.2. The CPD Case

4.2.1. Small SNR Deviation Scenario

4.2.2. Large SNR Deviation Scenario

4.2.3. Approximated Analytical Expressions for c ≪ 1 and Any SNR

4.3. The TKD Case

4.3.1. Large SNR Deviation Scenario

4.3.2. Small SNR Deviation Scenario

5. Numerical Illustrations

6. Conclusions

Acknowledgments

Author Contributions

Conflicts of Interest

Appendix A. Proof of Lemma 1

Appendix B. Proof of Lemma 2

Appendix C. Proof of Theorem 3

Appendix D. Proof of Result 1

Appendix E. Proof of Result 4

Appendix F. Proof of Result 5

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI

Computational Information Geometry for Binary Classification of High-Dimensional Random Tensors^†

3.1. Formulation Based on a $SNR$ -Type Criterion

4.2.1. Small $SNR$ Deviation Scenario

4.2.2. Large $SNR$ Deviation Scenario

4.2.3. Approximated Analytical Expressions for $c ≪ 1$ and Any $SNR$

4.3.1. Large $SNR$ Deviation Scenario

4.3.2. Small $SNR$ Deviation Scenario