Bryan’s Maximum Entropy Method—Diagnosis of a Flawed Argument and Its Remedy

Rothkopf, Alexander

doi:10.3390/data5030085

Open AccessArticle

Bryan’s Maximum Entropy Method—Diagnosis of a Flawed Argument and Its Remedy

by

Alexander Rothkopf

Faculty of Science and Technology, University of Stavanger, 4021 Stavanger, Norway

Data 2020, 5(3), 85; https://doi.org/10.3390/data5030085

Submission received: 25 July 2020 / Revised: 14 September 2020 / Accepted: 15 September 2020 / Published: 17 September 2020

Download

Browse Figures

Review Reports Versions Notes

Abstract

:

The Maximum Entropy Method (MEM) is a popular data analysis technique based on Bayesian inference, which has found various applications in the research literature. While the MEM itself is well-grounded in statistics, I argue that its state-of-the-art implementation, suggested originally by Bryan, artificially restricts its solution space. This restriction leads to a systematic error often unaccounted for in contemporary MEM studies. The goal of this paper is to carefully revisit Bryan’s train of thought, point out its flaw in applying linear algebra arguments to an inherently nonlinear problem, and suggest possible ways to overcome it.

Keywords:

Bayesian inference; inverse problems; maximum entropy method; Bayesian reconstruction method; singular value decomposition; systematic error

1. Introduction

Ill-posed inverse problems represent a major challenge to research in many disciplines. They arise, e.g., when one wishes to reconstruct the shape of an astronomical object after its light has passed through a turbulent atmosphere and an imperfect telescope or when we image the interior of a patients skull during a computer tomography scan. In a more theoretical setting, extracting spectral functions from numerical simulations of strongly correlated quantum fields constitutes another example. The common difficulty among these tasks lies in the fact that we do not have direct access to the quantity of interest (from here on referred to as

ρ

) but instead only to a distorted representation of it, measured in our experiment (from here on denoted by D). Extracting

ρ

from D, in general, requires us to solve an inherently nonlinear optimization problem, which we construct and discuss in detail below.

Let us consider the common class of inverse problems, where the quantity of interest

ρ

and the measured data D are related via an integral convolution

\tilde{D} (τ) = \int_{ω_{\min}}^{ω_{\max}} d ω K (τ, ω) ρ (ω), 0 \leq τ \leq τ_{\max},

(1)

with a kernel function

K (τ, ω)

. For the sake of simplicity let us assume (as is often the case in practice) that the function K is exactly known. The task at hand is to estimate the function

ρ

that underlies the observed D. The ill-posedness (and ill-conditioning) of this task is readily spotted if we acknowledge that our data comes in the form of

N_{τ}

discrete estimates

D_{i} = \tilde{D} (τ_{i}) + η

of the true function

\tilde{D}

, where

η

denotes a source of noise. In addition, we need to approximate the integral in some form for numerical treatment. In its simplest form, writing it as a sum over

N_{ω}

bins we obtain

D_{i} = \sum_{l = 1}^{N_{ω}} Δ ω_{l} K_{i l} ρ_{l} .

(2)

At this point, we are asked to estimate

N_{ω} ≫ N_{τ}

optimal parameters

ρ_{l}

from

N_{τ}

data points, which themselves carry uncertainty. A naive

χ^{2}

fit of the

ρ_{l}

’s is of no use, since it would produce an infinite number of degenerate solutions, which all reproduce the set of

D_{i}

’s within their error bars. Only if we introduce an appropriate regularization can the problem be made well-posed, and it is this regularization which in general introduces nonlinearities.

Bayesian inference represents one way to regularize the inversion task. It provides a systematic procedure for how additional (so called prior) knowledge can be incorporated to that effect. Bayes theorem

P [ρ | D, I] = \frac{P [D | ρ, I] P [ρ | I]}{P [D | I]},

(3)

states that the posterior probability

P [ρ | D, I]

for some set of parameters

ρ_{l}

to be the true solution of the inversion problem is given by the product of the likelihood probability

P [D | ρ, I]

and the prior probability

P [ρ | I]

. The

ρ

independent normalization

P [D | I]

is often referred to as the evidence. Assuming that the noise

η

is Gaussian, we may write the likelihood as

P [D | ρ, I] = \exp [- L], L = \frac{1}{2} \sum_{i j} (D_{i} - D_{i}^{ρ}) C_{i j}^{- 1} (D_{j} - D_{j}^{ρ}),

(4)

where

C_{i j}

is the unbiased covariance matrix of the measured data with respect to the true mean and

D_{i}^{ρ}

refers to the synthetic data that one obtains by inserting the current set of

ρ_{l}

parameters into (2). A

χ^{2}

fit would simply return one of the many degenerate extrema of

L

, hence being referred to as a maximum likelihood fit.

The important ingredient of Bayes theorem is the presence of the prior probability, often expressed in terms of a regulator functional

S

P [ρ | I] = \exp [S] .

(5)

It is here where pertinent domain knowledge can be encoded. For the study of intensity profiles of astronomical objects and hadronic spectral functions for example, it is a-priori known that the values of

ρ

must be positive. Depending on which type of information one wishes to incorporate, the explicit form of

S

will be different. It is customary to parameterize the shape of the prior distribution by two types of hyperparameters, the default model

m (ω)

and a confidence function

α (ω)

. The discretized

m_{l} = m (ω_{l})

represents the maximum of the prior

P [ρ_{l} | I (m, α)]

for each parameter

ρ_{l}

and each

α_{l} = α (ω_{l})

its corresponding spread.

Once both the likelihood and prior probability are set, we may search for the maximum a posteriori (MAP) point estimate of the

ρ_{l}

’s via

{\frac{δ}{δ ρ} P [ρ | D, I]|}_{ρ = ρ^{Bayes}} = 0 .

(6)

ρ^{Bayes}

constitutes the most probable parameters given our data and prior knowledge. (In the case that no data is provided, the Bayesian reconstruction will simply reproduce the default model m.) Note that since the exponential function is monotonous, instead of finding the extremum of

P [ρ | D, I]

, in practice, one often considers the extremum of

L - S

directly.

The idea of introducing a regularization in order to meaningfully estimate the most probable set of parameters underlying observed data has a long history. As early as 1919 [1], it was proposed to combine what we here call the likelihood with a smoothing regulator. Let us have a look at three choices of regulators from the literature: the historic Tikhonov (TK) regulator [2] (1943–), the Shannon–Jaynes entropy deployed in the Maximum Entropy Method (MEM) [3,4] (1986–), and the more recent Bayesian Reconstruction (BR) method [5] regulator (2013–), respectively,

S_{TK} = - \sum_{l} Δ ω_{l} α_{l} \frac{1}{2} {(ρ_{l} - m_{l})}^{2},

(7)

S_{MEM} = \sum_{l} Δ ω_{l} α_{l} (ρ_{l} - m_{l} - ρ_{l} \log [\frac{ρ_{l}}{m_{l}}]),

(8)

S_{BR} = \sum_{l} Δ ω_{l} α_{l} (1 - \frac{ρ_{l}}{m_{l}} + \log [\frac{ρ_{l}}{m_{l}}]) .

(9)

Both

S_{MEM}

and

S_{BR}

are axiomatically constructed, incorporating the assumption of positivity of the function

ρ

. The assumption manifests itself via the presence of a logarithmic term that forces

ρ

to be positive-semidefinite in the former and positive-definite in the latter case. It is this logarithm that is responsible for the numerical optimization problem (6) to become genuinely nonlinear.

Note that all three functions are concave, which (as proven for example in [6]) guarantees that if an extremum of

P [ρ | D, I]

exists, it is unique—i.e., within the

N_{ω}

dimensional solution space spanned by the discretized parameters

ρ_{l}

, in the case that a Bayesian solution exists, we will be able to locate it with standard numerical methods in a straightforward fashion.

2. Diagnosis of the Problem in Bryan’s MEM

In this section, we investigate the consequences of the choice of regularization on the determination of the most probable spectrum. Starting point is the fully linear Tikhonov regularization, on which a lot of the intuition in the treatment of inverse problems is built. We then continue to the Maximum Entropy Method, which amounts to a genuinely nonlinear regularization and point out how arguments, which were valid in the linear case, fail in the nonlinear context.

2.1. Tikhonov Regularization

The Tikhonov choice amounts to a Gaussian prior probability, which allows

ρ

to take on both positive and negative values. The default model

m_{l}

denotes the value for

ρ_{l}

, which was most probable before the arrival of the measured data D (e.g., from a previous experiment), and

α_{l}

represents our confidence into the prior knowledge (e.g., the uncertainty of the previous experiment).

Since both (4) and (7) are at most quadratic in

ρ_{l}

, taking the derivative in (6) leads to a set of linear equations that need to be solved to compute the Bayesian optimal solution

ρ^{Bayes}

. It is this fully linear scenario from which most intuition is derived when it comes to the solution space of the inversion task. Indeed, we are led to the following relations

- α_{l} (ρ_{l} - m_{l}) = \sum_{i} K_{i l} \frac{δ L}{δ D_{i}^{ρ}}, - \hat{α} (\vec{ρ} - \vec{m}) = {\hat{K}}^{T} \vec{\frac{δ L}{δ D^{ρ}}},

(10)

which can be written solely in terms of linear vector-matrix operations. Note that in this case

δ L / δ D^{ρ}

contains the vector

\vec{ρ}

in a linear fashion. (10) invites us to parameterize the function

ρ

via its deviation from the default model

\vec{ρ} = \vec{m} + \vec{a},

(11)

and to look for the optimal set of parameters

a_{l}

. Here, we may safely follow Bryan [7] and investigate the singular values of

K^{T} = U Σ V^{t}

with U being an

N_{ω} \times N_{ω}

special orthogonal matrix,

Σ

an

N_{ω} \times N_{τ}

matrix with

N_{τ}

nonvanishing diagonal entries, corresponding to the singular values of

K^{T}

and

V^{t}

being an

N_{τ} \times N_{τ}

special orthogonal matrix. We are led to the expression

- \hat{α} \vec{a} = \hat{U} \hat{Σ} {\hat{V}}^{t} \vec{\frac{δ L}{δ D^{ρ}}},

(12)

which tells us that in this case, the solution of the Tikhonov inversion lies in a functional space spanned by the first

N_{τ}

columns of the matrix

\hat{U}

(usually referred to as the SVD or singular subspace) around the default model

\vec{m}

—i.e., we can parameterize

ρ

as

\vec{ρ} = \vec{m} + \sum_{k = 1}^{N_{τ}} c_{k} {\vec{U}}_{k} .

(13)

The point here is that if we add to this SVD space parametrization any further column of the matrix

\hat{U}

, it directly projects into the null space of K via

\hat{K} \cdot {\vec{U}}_{k > N_{τ}} = 0

. In turn, such a column does not contribute to computing synthetic data via (2). As was pointed out in [8], in such a linear scenario, the SVD subspace is indeed all there is. If we add extra columns of

\hat{U}

to the parametrization of

ρ

, these do not change the likelihood. Thus, the corresponding parameter

c_{j}

of that column is not constrained by data and will come out as zero in the optimization procedure of (6), as it encodes a deviation from the default model, which is minimized by

S

.

2.2. Maximum Entropy Method

Now that we have established that in a fully linear context the arguments based on the SVD subspace are indeed justified, let us continue to the Maximum Entropy Method, which deploys the Shannon–Jaynes entropy as regulator.

S_{MEM}

encodes as prior information, e.g., the positivity of the function

ρ

, which manifests itself in the presence of the logarithm in (8). This logarithm however also entails that we are now dealing with an inherently nonlinear optimization problem in (6). Carrying out the functional derivative with respect to

ρ

on

L - S

, we are led to the following relation

- α_{l} \log [\frac{ρ_{l}}{m_{l}}] = \sum_{i} K_{i l} \frac{δ L}{δ D_{i}^{ρ}} .

(14)

As suggested by Bryan [7], let us introduce the completely general (since

ρ > 0

) and nonlinear redefinition of the parameters

ρ_{l} = m_{l} \exp [a_{l}] .

(15)

Inserting (15) into (14), we are led to an expression that is formally quite similar to the result obtained in the Tikhonov case

- α_{l} a_{l} = \sum_{i} K_{i l} \frac{δ L}{δ D_{i}^{ρ}}, - \hat{α} \vec{a} = {\hat{K}}^{T} \vec{\frac{δ L}{δ D^{ρ}}} .

(16)

While at first sight this relation is also amenable to be written as a linear relation for

\vec{a}

, it is actually fundamentally different from (10), since due to its nonlinear nature,

\vec{a}

enters

δ L / δ D^{ρ}

via componentwise exponentiation. It is here, when attempting to make a statement about such a nonlinear relation with the tools of linear algebra, that we run into difficulties. What do I mean by that? Let us push ahead and introduce the SVD decomposition of the transpose Kernel as before,

- \hat{α} \vec{a} = \hat{U} \hat{Σ} {\hat{V}}^{t} \vec{\frac{δ L}{δ D^{ρ}}} .

(17)

At first sight, this relation seems to imply that the vector

\vec{a}

—which encodes the deviation from the default model (this time multiplicatively)—is restricted to the SVD subspace, spanned by the first

N_{τ}

entries of the matrix

\hat{U}

. My claim (as put forward most recently in [9]) is that this conclusion is false, since this linear-algebra argument is not applicable when working with (16). Let us continue to setup the corresponding SVD parametrization advocated, e.g., in [8]

ρ_{l} = m_{l} \exp [\sum_{k = 1}^{N_{τ}} c_{k} U_{l k}] .

(18)

In contrast to (13), the SVD space is not all there is to (18) (see explicit computations in Appendix A). This we can see by taking the

N_{τ} + 1

column of the matrix

\hat{U}

, exponentiating it componentwise, and applying it to the matrix K. In general, we get

\sum_{l} K_{i l} \exp [U_{l (N_{τ} + 1)}] \neq 0 .

(19)

This means that if we add additional columns of the matrix

\hat{U}

to the parametrization in (18), they do not automatically project into the null-space of the Kernel (see the explicit example in Appendix A of this manuscript) and thus will contribute to the likelihood. In turn, the corresponding parameter

c_{j}

related to that column will not automatically come out to be zero in the minimization procedure (6). Hence, we cannot a priori disregard its contribution and thus, the contribution of this direction of the search space, which is not part of the SVD subspace. We thus conclude that limiting the solution space in the MEM to the singular subspace amounts to an ad-hoc procedure, motivated by an incorrect application of linear-algebra arguments to a fully nonlinear optimization problem.

A representative example from the literature, where the nonlinear character of the parametrization of

ρ

is not taken into account, is the recent [8] (see Equations (7) and (12) in that manuscript). We emphasize that one does not apply a column of the matrix

\hat{U}

itself to the matrix K but the componentwise exponentiation of this column. This operation does not project into the null-space.

In those cases where we have only few pieces of reliable prior information and our datasets are limited, restricting to the SVD subspace may lead to significantly distorted results (as shown explicitly in [10]). On the other hand, its effect may be mild if the default model already encodes most of the relevant features of the final result and the number of datapoints is large, so that the SVD subspace is large enough to (accidentally) encompass the true Bayesian solution sought after in (6). Independent of the severity of the problem, the artificial restriction to the SVD subspace in Bryan’s MEM is a source of systematic error, which needs to be accounted for when the MEM is deployed as precision tool for inverse problems.

Being liberated from the SVD subspace does not lead to any conceptual problems either. We have brought to the table

N_{τ}

points of data, as well as

N_{ω}

points of prior information in the form of the default model m (as well as its uncertainty

α

). This is enough information to determine the

N_{ω}

parameters

ρ_{l}

uniquely, as proven in [6].

Recognizing that linear-algebra arguments fail in the MEM setup also helps us to understand some of the otherwise perplexing results found in the literature. If the singular subspace were all there is to the parametrization of

ρ_{l}

in (18), then it would not matter whether we use the first

N_{τ}

columns of

\hat{U}

or just use the

N_{τ}

columns of

{\hat{K}}^{T}

directly. Both encode the same target space, the difference being only that the columns of

\hat{U}

are orthonormal. However, as was clearly seen in Figure 28 of [11], using the SVD parametrization or the columns of

K^{T}

leads to significantly different results in the reconstructed features of

ρ

. If the MEM were a truly linear problem, these two parameterizations gave exactly the same result. The finding that the results do not agree emphasizes that the MEM inversion is genuinely nonlinear and the restriction to the SVD subspace is ad hoc.

2.3. Numerical Evidence for the Inadequacy of the SVD Subspace

Let us construct an explicit example to illustrate the fact that the solution of the MEM reconstruction may lie outside of the SVD search space. Since Bryan’s derivation of the SVD subspace proceeds independently of the particular form of the kernel K, the provided data D, and the choice of the default model m, we are free to choose them at will. For our example, we consider a transform often encountered among inverse problems related to the physics of strongly correlated quantum systems.

One then has

K (τ, ω) = 1 / (ω^{2} + τ^{2})

and we may set

m = 1

. With

α

entering simply as scaling factor in (16), we do not consider it further in the following. Let me emphasize again that the arguments leading to (16) did not make any reference to the data we wish to reconstruct. Here, we will consider three datapoints that encode a single delta peak at

ω = ω_{0}

, embedded in a flat background.

Now, let us discretize the frequency domain between

ω_{\min} = 1 / 2000

and

ω_{\max} = 1000

with

N_{ω} = 2000

points. Together with the choice of

τ_{\min} = 0

,

τ_{\max} = 0.2

, and

N_{τ} = 3

, this fully determines the kernel matrix

K_{i l}

in (2). Three different mock functions

ρ_{i}

are considered, with

ω_{0}^{(1)} = 25

,

ω_{0}^{(2)} = 125

, and

ω_{0}^{(3)} = 250

, the background is assigned the magnitude

1 / N_{ω}^{2}

(see Figure 1 left).

Bryan’s argument states that in the presence of three datapoints (see Figure 1 right), irrespective of the data encoded in those datapoints, the extremum of the posterior must lie in the space spanned by the exponentiation of the three first columns of the matrix

\hat{U}

, obtained from the SVD of the transpose kernel

K^{t}

. In Figure 2, its first three columns are explicitly plotted. Note that while they do show some peaked behavior around the origin, they quickly flatten off above

ω = 10

. From this inspection by eye, it already follows that it will be very difficult to linearly combine

U_{1} (ω)

,

U_{2} (ω)

, and

U_{3} (ω)

into a sharply peaked function, especially for a peak located at

ω > 10

.

Assuming that the data comes with constant relative error

Δ D / D = κ

, let us find out how well we can reproduce it within Bryan’s search space. A minimization carried out by Mathematica (see the explicit code in Appendix B) tells us that

L_{\min}^{1} \approx 10^{6} κ^{- 2}

,

L_{\min}^{2} \approx 10^{9} κ^{- 2}

, and

L_{\min}^{3} \approx 10^{10} κ^{- 2}

—i.e., we clearly see that we are not able to reproduce the provided datapoints well (i.e.,

χ^{2} / d . o . f = L / 3 ≫ 1

) and that as the deviation becomes more and more pronounced, the higher the delta peak is positioned along

ω

. In the full search space on the other hand, we can always find a set of

ρ_{l}

’s which bring the likelihood as close to zero as desired.

Minimizing the likelihood of course is not the whole story in a Bayesian analysis, which is why we have to take a look at the contributions from the regulator term

S

. Remember that by definition, it is negative. We find that for all three mock functions

ρ_{i}

, Bryan’s best fit of the likelihood leads to values of

S \approx - 0.24

. On the other hand, a value of

S \approx - 200

is obtained in the full search space for the parameter set

ρ_{l}

which contains only one entry that is unity and all other entries being set to the background at

1 / N_{ω}^{2}

.

From the above inspection of the extremum of

L

and the associated value of

S

inside and outside of Bryan’s search space, we arrive at the following: Due to the very limited structure present in the SVD basis functions

U_{1} (ω)

,

U_{2} (ω)

, and

U_{3} (ω)

, it is in general not possible to obtain a good reconstruction of the input data. This in turn leads to a large minimal value of

L

accessible within the SVD subspace. Due to the fact that

S

cannot compensate for a large value of

L

, we have constructed an explicit example where at least one set of

ρ_{l}

’s (the one that brings

L

close to zero in the full search space) leads to a smaller value of

L - S

and thus to a larger posterior probability than any of the

ρ_{l}

’s within the SVD subspace.

In other words, we have constructed a concrete example in which the MEM solution, given by the global extremum of the posterior, is not contained in the SVD subspace.

3. Remedy of the Problem

Having identified the shortcoming of Bryan’s MEM as the ad-hoc restriction of the search space to the SVD subspace, we must consider how to overcome it. In this section, we present two possible routes to do so. In the first subsection, systematic modifications of the MEM, previously put forward by the author, are considered, which extend and generalize the SVD search space. In the second subsection, we present a collection of modern approaches to Bayesian inference developed by the research community which are not affected by limited search spaces. Particular attention is paid to the BR method recently codeveloped by the author.

3.1. Staying within MEM

The most naive path to take is to simply add additional columns of the matrix

\hat{U}

to the nonlinear parametrization of (18), as proposed by the author in 2011 in [10] and successfully applied in practice in [12]. This procedure systematically approaches the full search space, in which the unique Bayesian solution is located (We are surprised by the statements made recently in [8], which claim that the numerics presented in [10] and thus also in [12] were unreliable. Both papers have been peer-reviewed and the underlying code has been freely available for many years on the author’s website (ExtMEM link, http://www.alexrothkopf.de/files/ExtMEMv3.tar.bz2). Furthermore, the examples presented in [10] are explicitly parameterized and thus lend themselves to straightforward reproduction.).

Investigating in detail the shape of the SVD basis functions, it was found that they may not offer the same resolution for features located at different

ω

’s. In practice, this means that one often needs to add a large number of additional SVD basis functions before the structures encoded in the data

D_{i}

are sufficiently resolved. In order to remedy this situation, we can exploit that we are not bound by the SVD subspace and instead, as suggested in [13], deploy for example Fourier basis functions in (18)—i.e., we replace the columns of

\hat{U}

by a linear combination of

s i n

and

c o s

terms. This proposal has been shown to lead to a resolution capacity of the MEM that is independent of the position of the encoded structures in

ρ

along

ω

and has been put in practice, e.g., in [14].

Thanks to the availability of highly efficient numerical optimization algorithms, such as the limited-memory BFGS (Broyden–Fletcher–Goldfarb–Shanno) algorithm, it is nowadays easily possible to directly carry out the optimization task (6) in the full

N_{ω}

dimensional search space, even if

N_{ω} \sim O (10^{3})

. Together with the proof of uniqueness of the Bayesian solution and the related convexity of

L - S

, there does not exist any principal need to restrict to a low-dimensional subspace.

3.2. Beyond MEM

An active community is working on devising improved Bayesian approaches to inverse problems in the sciences. On the one hand, there exist works that go beyond the maximum a-posteriori approach and proceed towards sampling the posterior, such as the stochastic analytic continuation method [15], of which the MEM is but one special limit as shown in [16]. Together with the SOM method presented in [17], these stochastic methods have for example, been deployed in the study of nuclear matter at high temperatures in [18]. Recently, the community has seen heightened activity in exploring the use of neural networks for the solution of inverse problems, e.g., in [19,20,21,22].

Let me focus here on one recent Bayesian approach, the BR method, presented in [5] with regulator (9), which was designed with the particular one-dimensional inverse problem of (2) in mind. The motivation to develop this new method on the one hand arose from the observation that the specific form of the Shannon–Jaynes regulator of the MEM can pose a problem in finding the optimal Bayesian solution. Consider the (negative of the) integrands of (7)–(9), plotted for an arbitrary choice of

α = 0.1

and

m = 1

in the left panel of Figure 3. By construction, all of them have an extremum at

ρ = m

and, as expected, only the Tikhonov regulator allows

ρ

to take on values smaller than zero. Both the MEM and BR regulator diverge as

ρ \to \infty

, but their behavior close to

ρ = 0

differs markedly. Let us have a closer look in the right panel of Figure 3. Plotted in log-log scale, we see that while the BR regulator diverges as

ρ \to 0

, the Shannon–Jaynes entropy just flattens off and intercepts the y-axis at a finite value. It is this flattening off that, in practice, can lead to very slow convergence of the deployed optimization algorithms, as the MEM wanders about in this flat direction.

The second reason was that the MEM originally arose in the context of two-dimensional astronomical image reconstruction and the assumptions that enter its construction make reference specifically to this two-dimensional nature of the inverse problems. Here, instead we are interested in a simple one-dimensional inverse problem, which is not directly related to at least one of the axioms underlying the MEM. As laid out in detail in [5], we started by replacing that axiom by a generic smoothness axiom and also introduced a scale invariance axiom to arrive at the BR method with its regulator given in (9). The form of this regulator differs in important aspects from the Shannon–Jaynes entropy. Note that it contains only ratios of the functions

ρ

and m. Since both quantities carry the same units, it implies that the value of the integral does not depend on the units assigned to them. In contrast, the Shannon–Jaynes integrand, and thus the integral, depend on the specific choice of units in

ρ

. In addition, the logarithmic term in

S_{BR}

is not multiplied with the function

ρ

. This changes the behavior of the integrand for

ρ \to 0

, making it diverge—i.e., the BR regulator avoids the flat direction encountered in

S_{MEM}

and thus shows much better convergence properties for functions

ρ

, which contain large ranges, where their values are parametrically much smaller than in the dominant contributions.

A straightforward implementation of the BR method in a general Bayesian context has recently been introduced in [23]. Using the MC-STAN Monte-Carlo library (http://mc-stan.org/), it has been shown how to sample the posterior distribution of the BR method in the full

N_{ω}

-dimensional search space and thus how to access the full information encoded in it. The maximum a-posteriori solution considered in the literature so far only captures the maximum of this distribution. Not only does a full Bayesian implementation of the BR method allow for a self-consistent treatment of the hyperparameter

α

, but it also provides the complete uncertainty budget from the spread of the posterior.

4. Summary and Conclusions

We have critically assessed in this paper the arguments underlying Bryan’s MEM, which we show to be flawed as they resort to methods of linear algebra when treating an inherently nonlinear optimization problem. Therefore, we conclude that even though the individual steps in the derivation of the SVD subspace are all correct, they do not apply to the problem at hand and their conclusions can be disproved with a direct counterexample. The counterexample we provided utilizes the fact that the componentwise exponentiated columns of the matrix

\hat{U}

do not project into the null-space of the Kernel when computing synthetic data. After establishing the fact that the restriction to the SVD subspace is an ad-hoc procedure, we discussed possible ways to overcome it, suggesting either to systematically extend the search space within the MEM or abandon the MEM in favor of one of the many modern Bayesian approaches developed over the past two decades.

In our ongoing work to improve methods for the Bayesian treatment of inverse problems, we focus on the development of more specific regulators. So far, the methods described above only make reference to very generic properties of the quantity

ρ

, such as smoothness and positivity. As a specific example, in the study of spectral functions in strongly-correlated quantum systems, the underlying first principles theory provides extra domain knowledge about admissible structures in

ρ

that so far is not systematically exploited. A further focus of our work is the extension of the separable regulators discussed in this work to fully correlated prior distributions that exploit cross-correlation between the individual parameters

ρ_{l}

.

Author Contributions

The author has conceptualized and worked out the argumentative chain, as well as the explicit examples in the present study. He has written the manuscript and prepared all figures. The author has read and agreed to the published version of the manuscript.

Funding

The author acknowledges funding by the Research Council of Norway under the FRIPRO Young Research Talent grant 286883.

Conflicts of Interest

The author declares that he has no competing interests.

Abbreviations

The following abbreviations are used in this manuscript:

BFGS	Broyden–Fletcher–Goldfarb–Shanno
BR	Bayesian Reconstruction
MEM	Maximum Entropy Method
SJ	Shannon–Jaynes
SVD	singular value decomposition
TK	Tikhonov

Appendix A. Explicit SVD Example of (Non-) Projection to the Null-Space

Appendix B. Explicit MEM Example with a Solution Outside of the SVD Subspace

References

Whittaker, E.T. On a New Method of Graduation. Proc. Edinb. Math. Soc. 1922, 41, 63–75. [Google Scholar] [CrossRef] [Green Version]
Tikhonov, A.N. On the stability of inverse problems. Dokl. Akad. Nauk SSSR 1943, 39, 195–198. [Google Scholar]
Narayan, R.; Nityananda, R. Maximum Entropy Image Restoration in Astronomy. Annu. Rev. Astron. Astrophys. 1986, 24, 127–170. [Google Scholar] [CrossRef]
Jarrell, M.; Gubernatis, J.E. Bayesian inference and the analytic continuation of imaginary-time quantum Monte Carlo data. Phys. Rep. 1996, 269, 133–195. [Google Scholar] [CrossRef] [Green Version]
Burnier, Y.; Rothkopf, A. Bayesian Approach to Spectral Function Reconstruction for Euclidean Quantum Field Theories. Phys. Rev. Lett. 2013, 111, 182003. [Google Scholar] [CrossRef] [Green Version]
Asakawa, M.; Nakahara, Y.; Hatsuda, T. Maximum entropy analysis of the spectral functions in lattice QCD. Prog. Part. Nucl. Phys. 2001, 46, 459–508. [Google Scholar] [CrossRef] [Green Version]
Bryan, R.K. Maximum entropy analysis of oversampled data problems. Eur. Biophys. J. 1990, 18, 165–174. [Google Scholar] [CrossRef]
Asakawa, M. Comment on “Heavy Quarkonium in Extreme Conditions”. arXiv 2020, arXiv:2001.10205. [Google Scholar]
Rothkopf, A. Heavy quarkonium in extreme conditions. Phys. Rep. 2020. [Google Scholar] [CrossRef]
Rothkopf, A. Improved maximum entropy analysis with an extended search space. J. Comput. Phys. 2013, 238, 106–114. [Google Scholar] [CrossRef] [Green Version]
Jakovác, A.; Petreczky, P.; Petrov, K.; Velytsky, A. Quarkonium correlators and spectral functions at zero and finite temperature. Phys. Rev. D 2007, 75, 014506. [Google Scholar] [CrossRef] [Green Version]
Rothkopf, A.; Hatsuda, T.; Sasaki, S. Complex Heavy-Quark Potential at Finite Temperature from Lattice QCD. Phys. Rev. Lett. 2012, 108, 162001. [Google Scholar] [CrossRef] [Green Version]
Rothkopf, A. Improved Maximum Entropy Method with an Extended Search Space. In Proceedings of the 30th International Symposium on Lattice Field Theory (LATTICE 2012), Cairns, Australia, 24–29 June 2012; p. 100. [Google Scholar] [CrossRef] [Green Version]
Kelly, A.; Rothkopf, A.; Skullerud, J.I. Bayesian study of relativistic open and hidden charm in anisotropic lattice QCD. Phys. Rev. D 2018, 97, 114509. [Google Scholar] [CrossRef] [Green Version]
Sandvik, A.W. Stochastic method for analytic continuation of quantum Monte Carlo data. Phys. Rev. B 1998, 57, 10287–10290. [Google Scholar] [CrossRef]
Beach, K.S.D. Identifying the maximum entropy method as a special limit of stochastic analytic continuation. arXiv 2004, arXiv:Cond-mat/0403055. [Google Scholar]
Mishchenko, A.S.; Prokof’ev, N.V.; Sakamoto, A.; Svistunov, B.V. Diagrammatic quantum Monte Carlo study of the Fr\ohlich polaron. Phys. Rev. B 2000, 62, 6317–6336. [Google Scholar] [CrossRef]
Ding, H.T.; Kaczmarek, O.; Mukherjee, S.; Ohno, H.; Shu, H.T. Stochastic reconstructions of spectral functions: Application to lattice QCD. Phys. Rev. D 2018, 97, 094503. [Google Scholar] [CrossRef] [Green Version]
Yoon, H.; Sim, J.H.; Han, M.J. Analytic continuation via domain knowledge free machine learning. Phys. Rev. B 2018, 98, 245101. [Google Scholar] [CrossRef] [Green Version]
Fournier, R.; Wang, L.; Yazyev, O.V.; Wu, Q. Artificial Neural Network Approach to the Analytic Continuation Problem. Phys. Rev. Lett. 2020, 124, 056401. [Google Scholar] [CrossRef] [Green Version]
Kades, L.; Pawlowski, J.M.; Rothkopf, A.; Scherzer, M.; Urban, J.M.; Wetzel, S.J.; Wink, N.; Ziegler, F. Spectral Reconstruction with Deep Neural Networks. arXiv 2019, arXiv:1905.04305. [Google Scholar]
Offler, S.; Aarts, G.; Allton, C.; Glesaaen, J.; Jäger, B.; Kim, S.; Lombardo, M.P.; Ryan, S.M.; Skullerud, J.I. News from bottomonium spectral functions in thermal QCD. arXiv 2019, arXiv:1912.12900. [Google Scholar]
Rothkopf, A. Bayesian techniques and applications to QCD. In Proceedings of the 13th Conference on Quark Confinement and the Hadron Spectrum (Confinement XIII), Maynooth, Ireland, 31 July–6 August 2018; p. 26. [Google Scholar]

Figure 1. The mock function

ρ

(left) and corresponding mock data (right) deployed in the explicit example discussed in the main text, which shows that the extremum of the posterior is not necessarily located within the singular value decomposition (SVD) subspace.

Figure 1. The mock function

ρ

(left) and corresponding mock data (right) deployed in the explicit example discussed in the main text, which shows that the extremum of the posterior is not necessarily located within the singular value decomposition (SVD) subspace.

Figure 2. The SVD basis functions

U_{1} (ω)

,

U_{2} (ω)

, and

U_{3} (ω)

, which Bryan’s argument suggests should capture the extremum of the posterior for all three mock functions

ρ_{i}

after exponentiation. Note that the SVD basis functions flatten out about

ω = 10

. Their domain extends to

ω_{\max} = 1000

but their value stays close to zero.

Figure 2. The SVD basis functions

U_{1} (ω)

,

U_{2} (ω)

, and

U_{3} (ω)

, which Bryan’s argument suggests should capture the extremum of the posterior for all three mock functions

ρ_{i}

after exponentiation. Note that the SVD basis functions flatten out about

ω = 10

. Their domain extends to

ω_{\max} = 1000

but their value stays close to zero.

Figure 3. Comparison of the behavior of the integrand of the Tikhonov (TK), MEM, and Bayesian Reconstruction (BR) regulators for the choice

α = 0.1

and

m = 1

using (left) a log-lin scale and (right) a log-log scale. Note that the flattening off of the Shannon–Jaynes entropy towards vanishing

ρ

.

Figure 3. Comparison of the behavior of the integrand of the Tikhonov (TK), MEM, and Bayesian Reconstruction (BR) regulators for the choice

α = 0.1

and

m = 1

using (left) a log-lin scale and (right) a log-log scale. Note that the flattening off of the Shannon–Jaynes entropy towards vanishing

ρ

.

© 2020 by the author. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Rothkopf, A. Bryan’s Maximum Entropy Method—Diagnosis of a Flawed Argument and Its Remedy. Data 2020, 5, 85. https://doi.org/10.3390/data5030085

AMA Style

Rothkopf A. Bryan’s Maximum Entropy Method—Diagnosis of a Flawed Argument and Its Remedy. Data. 2020; 5(3):85. https://doi.org/10.3390/data5030085

Chicago/Turabian Style

Rothkopf, Alexander. 2020. "Bryan’s Maximum Entropy Method—Diagnosis of a Flawed Argument and Its Remedy" Data 5, no. 3: 85. https://doi.org/10.3390/data5030085

APA Style

Rothkopf, A. (2020). Bryan’s Maximum Entropy Method—Diagnosis of a Flawed Argument and Its Remedy. Data, 5(3), 85. https://doi.org/10.3390/data5030085

Article Menu

Bryan’s Maximum Entropy Method—Diagnosis of a Flawed Argument and Its Remedy

Abstract

1. Introduction

2. Diagnosis of the Problem in Bryan’s MEM

2.1. Tikhonov Regularization

2.2. Maximum Entropy Method

2.3. Numerical Evidence for the Inadequacy of the SVD Subspace

3. Remedy of the Problem

3.1. Staying within MEM

3.2. Beyond MEM

4. Summary and Conclusions

Author Contributions

Funding

Conflicts of Interest

Abbreviations

Appendix A. Explicit SVD Example of (Non-) Projection to the Null-Space

Appendix B. Explicit MEM Example with a Solution Outside of the SVD Subspace

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI