A Scalable Bayesian Sampling Method Based on Stochastic Gradient Descent Isotropization

Franzese, Giulio; Milios, Dimitrios; Filippone, Maurizio; Michiardi, Pietro

doi:10.3390/e23111426

Open AccessArticle

A Scalable Bayesian Sampling Method Based on Stochastic Gradient Descent Isotropization

Data Science Department, Eurecom, 06410 Biot, France

^*

Author to whom correspondence should be addressed.

Entropy 2021, 23(11), 1426; https://doi.org/10.3390/e23111426

Submission received: 21 September 2021 / Revised: 22 October 2021 / Accepted: 26 October 2021 / Published: 28 October 2021

(This article belongs to the Special Issue Approximate Bayesian Inference)

Download

Browse Figure

Versions Notes

Abstract

:

Stochastic gradient sg-based algorithms for Markov chain Monte Carlo sampling (sgmcmc) tackle large-scale Bayesian modeling problems by operating on mini-batches and injecting noise on sgsteps. The sampling properties of these algorithms are determined by user choices, such as the covariance of the injected noise and the learning rate, and by problem-specific factors, such as assumptions on the loss landscape and the covariance of sg noise. However, current sgmcmc algorithms applied to popular complex models such as Deep Nets cannot simultaneously satisfy the assumptions on loss landscapes and on the behavior of the covariance of the sg noise, while operating with the practical requirement of non-vanishing learning rates. In this work we propose a novel practical method, which makes the sg noise isotropic, using a fixed learning rate that we determine analytically. Extensive experimental validations indicate that our proposal is competitive with the state of the art on sgmcmc.

Keywords:

Bayesian sampling; stochastic gradients; Monte Carlo integration

1. Introduction

Stochastic gradient (sg) methods have been extensively studied as a means for mcmc-based Bayesian posterior sampling algorithms to scale to large data regimes. Variants of sg-mcmc algorithms have been studied through the lens of first [1,2,3] or second-order [4,5] Langevin Dynamics, which are mathematically convenient continuous-time processes that correspond to discrete-time gradient methods with and without momentum, respectively. The common traits underlying many methods from the literature can be summarized as follows: they address large data requirements using sg and mini-batching, they inject Gaussian noise throughout the algorithm execution, and they avoid the expensive Metropolis-Hasting accept/reject tests that use the whole data [1,2,4].

Despite mathematical elegance and some promising results restricted to simple models, current approaches fall short in dealing with the complexity of the loss landscape typical of popular modern machine learning models, e.g., neural networks [6,7], for which stochastic optimization poses some serious challenges [8,9].

In general, sg-mcmc algorithms inject random noise to sg descent algorithms: the covariance of such noise and the learning rate, or step-size in the stochastic differential equation simulation community, are tightly related to the assumptions on the loss landscape, which together with the sg noise, determine the sampling properties of these methods [5]. However, current sg-mcmc algorithms applied to popular complex models such as Deep Nets, cannot simultaneously satisfy the assumptions on posterior distribution geometry and on the behavior of the covariance of the sg noise, while operating with the practical requirement of non-vanishing learning rates. In this paper, in accordance with most of the Neural Network related literature, we refer to the posterior distribution geometry as loss landscape. Some recent work [10], instead, argue for fixed step sizes, but settle for variational approximations of quadratic losses. Although we are not the first to highlight these issues, including the lack of a unified notation [5], we believe that studying the role of noise in sg-mcmc algorithms has not received enough attention, and a deeper understanding is truly desirable, as it can clarify how various methods compare. Most importantly, this endeavor can suggest novel and more practical algorithms relying on fewer parameters and less restrictive assumptions.

In this work we chose a mathematical notation that emphasizes the role of noise covariances and learning rate on the behavior of sg-mcmc algorithms (Section 2). As a result, the equivalence between learning rate annealing and extremely large injected noise covariance becomes apparent, and this allows us to propose a novel practical sg-mcmc algorithm (Section 3). We derive our proposal, by first analyzing the case where we inject the smallest complementary noise such that its combined effects with the sg noise result in an isotropic noise. Thanks to this isotropic property of the noise, it is possible to deal with intricate loss surfaces typical of deep models, and produce samples from the true posterior without learning rate annealing. This, however, comes at the expense of cubic complexity matrix operations. We address such issues through a practical variant of our scheme, which employs well-known approximations to the sg noise covariance (see, e.g., [11]). The result is an algorithm that produces approximate posterior samples with a fixed, theoretically derived, learning rate. Please note that in generic Bayesian deep learning setting, none of the existing implementations of sg-mcmc methods converge to the true posterior without learning rate annealing. In contrast, our method automatically determines an appropriate learning rate through a simple estimation procedure. Furthermore, our approach can be readily applied to pre-trained models: after a “warmup” phase to compute sg noise estimates, it can efficiently perform Bayesian posterior sampling.

We evaluate sg-mcmc algorithms (Section 4) through an extensive experimental campaign, where we compare our approach to several alternatives, including Monte Carlo Dropout (mcd) [12] and Stochastic Weighted Averaging Gaussians (swag, [9]), which have been successfully applied to the Bayesian deep learning setting. Our results indicate that our approach offers performance that are competitive to the state of the art, according to metrics that aim at assessing the predictive accuracy and uncertainty.

2. Preliminaries and Related Work

Consider a dataset of m—dimensional observations

D = {U_{i}}_{i = 1}^{N}

. Given prior

p (θ)

for a d-dimensional set of parameters, and a likelihood model

p (D | θ)

, the posterior is obtained by means of Bayes theorem as follows:

p (θ | D) = \frac{p (D | θ) p (θ)}{p (D)}

(1)

where

p (D)

is also known as the model evidence, defined as the integral

p (D) = \int p (D | θ) p (θ) d θ

. Except when the prior and the likelihood function are conjugate, Equation (1) is analytically intractable [13]. However, the joint likelihood term in the numerator is typically not hard to compute; this is a key element of many mcmc algorithms, since the normalization constant

p (D)

does not affect the shape of the distribution in any way other than scaling. The posterior distribution is necessary to obtain predictive distributions for new test observations

U_{*}

, as:

p (U_{*} | D) = \int p (U_{*} | θ) p (θ | D) d θ

(2)

We focus in particular on Monte Carlo methods to obtain an estimate of this predictive distribution, by averaging over

N_{MC}

samples obtained from the posterior over

θ

, i.e.,

θ^{(i)} \sim p (θ | D)

p (U_{*} | D) \approx \frac{1}{N_{MC}} \sum_{i = 1}^{N_{MC}} p (U_{*} | θ^{(i)})

(3)

We develop our work by working with an unnormalized version of the logarithm of the posterior density, by expressing the negative logarithm of the joint distribution of the dataset

D

and parameters

θ

as:

- f (θ) = \sum_{i = 1}^{N} log p (U_{i} | θ) + log p (θ) .

(4)

For computational efficiency, we use a minibatch stochastic gradient

g (θ)

, which guarantees that the estimated gradient is an unbiased estimate of the true gradient

\nabla f (θ)

, and we assume that the randomness due to the minibatch introduces a Gaussian noise:

g (θ) \sim N (\nabla f (θ), 2 B (θ)),

(5)

where the matrix

B (θ)

denotes the sg noise covariance, which depends on the parametric model, the data distribution and the minibatch size.

A survey of algorithms to sample from the posterior using sg methods can be found in Ma et al. [5]. In Appendix A we report some well-known facts which are relevant for the derivations in our paper. As shown in the literature [10,14], there are structural similarities between sg-mcmc algorithms and stochastic optimization methods, and both can be used to draw samples from posterior distributions. Notice that the original goal of stochastic optimization is to find the minimum of a given cost function, and the stochasticity is introduced by sub-sampling the dataset to scale. sg-mcmc methods instead aim at sampling from a given distribution, i.e., collecting multiple values, and the stochasticity is necessary explore the whole landscape. In what follows, we use a unified notation to compare many existing algorithms in light of the role played by their noise components.

It is well-known [15,16,17] that stochastic gradient descent (sgd), with and without momentum, can be studied through the following stochastic differential equation (sde), when the learning rate

η

is small enough (In this work we do not consider discretization errors. The reader can refer to classical sde texts such as [18] to investigate the topic in greater depth.):

d z_{t} = s (z_{t}) d t + \sqrt{2 η D (z_{t})} d W_{t} .

(6)

where

s

is usually referred to as driving force and

D

as diffusion matrix We use a generic form of the sde, with variable

z

instead of

θ

, which accommodates sgd variants, with and without momentum. By doing this, we will be able to easily cast the expression for the two cases in what follows (The operator

\nabla^{⊤}

applied to matrix

D (z)

produces a row vector whose elements are the divergences of the

D (z)

columns. Our notation is aligned with Chen et al. [4]).

Definition 1.

A distribution

ρ (z) \propto exp (- ϕ (z))

is said to be astationarydistribution for thesdeof the form (6), if and only if it satisfies the following Fokker-Planck equation (fpe):

\begin{matrix} 0 = Tr \{\nabla [- s {(z)}^{⊤} ρ (z) + \nabla^{⊤} (D (z) ρ (z))]\} . \end{matrix}

(7)

Please note that in general, the stationary distribution does not converge to the desired posterior distribution, i.e.,

ϕ (z) \neq f (z)

, as shown by Chaudhari and Soatto [8]. Additionally, given an initial condition for

z_{t}

, its distribution is going to converge to

ρ (z)

only for

t \to \infty

. In practice, we observe the sde dynamics for a finite amount of time: then, we declare that the process is approximately in the stationary regime once the potential has reached low and stable values.

Next, we briefly overview known approaches to Bayesian posterior sampling, and interpret them as variants of an sgd process, using the fpe formalism.

2.1. Gradient Methods without Momentum

The generalized updated rule of sgd, described as a discrete-time stochastic process, writes as:

\begin{matrix} δ θ_{n} = - η P (θ_{n - 1}) (g (θ_{n - 1}) + w_{n}), \end{matrix}

(8)

where

P (θ_{n - 1})

is a user-defined preconditioning matrix, and

w_{n}

is a noise term, distributed as

w_{n} \sim N (0, 2 C (θ_{n}))

, with a user-defined covariance matrix

C (θ_{n})

. Then, the corresponding continuous-time sde is [15]:

d θ_{t} = - P (θ_{t}) \nabla f (θ_{t}) d t + \sqrt{2 η P {(θ_{t})}^{2} \sum (θ_{t})} d W_{t} .

(9)

In this paper we use the symbol n to indicate discrete time, while t for continuous time. We denote by

C (θ)

the covariance of the injected noise and

\sum (θ)

the composite noise covariance. Please note that

\sum (θ_{t}) = B (θ_{t}) + C (θ_{t})

combines the sg and the injected noise. Notice that our choice of notation is different from the standard one, in which the starting discrete-time process is in the form

δ θ_{n} = - η P (θ_{n - 1}) (g (θ_{n - 1})) + w_{n}

. By directly grouping the injected noise with the stochastic gradient we can better appreciate the relationship between annealing the learning rate and extremely large injected noise. Moreover, as will be explained in Section 3, this allows derivation of a new sampling algorithm.

We define the stationary distribution of the sde in Equation (9) as

ρ (θ) \propto exp (- ϕ (θ))

. Please note that when

C = 0

, the potential

ϕ (θ)

differs from the desired posterior

f (θ)

[8]. The following theorem, which is an adaptation of known results in light of our formalism, states the conditions for which the noisy sgd converges to the true posterior distribution (proof in Appendix A).

Theorem 1.

Consider dynamics of the form (9) and define the stationary distribution

ρ (θ) \propto exp (- ϕ (θ))

. If

\nabla^{⊤} (\sum {(θ)}^{- 1}) = 0^{⊤} and η P (θ) = \sum {(θ)}^{- 1},

(10)

then

ϕ (θ) = f (θ)

.

Stochastic Gradient Langevin Dynamics (sgld) [1] is a simple approach to satisfy Equation (10); it uses no preconditioning,

P (θ) = I

, and sets the injected noise covariance to

C (θ) = η^{- 1} I

. In the limit for

η \to 0

, it holds that

\sum (θ) = B (θ) + η^{- 1} I ≃ η^{- 1} I

. Then,

\nabla^{⊤} (\sum {(θ)}^{- 1}) = η \nabla^{⊤} I = 0^{⊤}

, and

η P (θ) = \sum {(θ)}^{- 1}

. Although sgld succeeds in (asymptotically) generating samples from the true posterior, its mixing rate is unnecessarily slow, due to the extremely small learning rate [2].

An extension to sgld is Stochastic Gradient Fisher Scoring (sgfs) [2], which can be tuned to switch between sampling from an approximate posterior, using a non-vanishing learning rate, and the true posterior, by annealing the learning rate to zero. sgfs uses preconditioning,

P (θ) \propto B {(θ)}^{- 1}

. In practice, however,

B (θ)

is ill conditioned for complex models such as deep neural networks. Then, many of its eigenvalues are almost zero [8], and computing

B {(θ)}^{- 1}

is problematic. An in-depth analysis of sgfs reveals that conditions (10) would be met with a non-vanishing learning rate only if, at convergence,

\nabla^{⊤} (B {(θ)}^{- 1}) = 0^{⊤}

, which would be trivially true if

B (θ)

was constant. However, recent work [6,7] suggest that this condition is difficult to justify for deep neural networks.

The Stochastic Gradient Riemannian Langevin Dynamics (sgrld) algorithm [3] extends sgfs to the setting in which

\nabla^{⊤} (B {(θ)}^{- 1}) \neq 0^{⊤}

. The process dynamic is adjusted by adding the term

\nabla^{⊤} (B {(θ)}^{- 1})

. However, the term

\nabla^{⊤} (B {(θ)}^{- 1})

has not a clear estimation procedure, restricting sgrld to cases where it can be computed analytically.

The work by [10] investigates constant-rate sgd (with no injected noise), and determines analytically the learning rate and preconditioning that minimize the Kullback–Leibler (kl) divergence between an approximation and the true posterior. Moreover, it shows that the preconditioning used in sgfs is optimal, in the sense that it converges to the true posterior, when

B (θ)

is constant and the true posterior has a quadratic form.

In summary, to claim convergence to the true posterior distribution, existing approaches require either vanishing learning rates or assumptions on the sg noise covariance that are difficult to verify in practice, especially when considering deep models. We instead propose a novel practical method that induces isotropic sg noise and thus satisfies Theorem 1. We determine analytically a fixed learning rate, and we require weaker assumptions on the loss shape.

2.2. Gradient Methods with Momentum

Momentum-corrected methods emerge as a natural extension to sgd approaches. The general set of update equations for (discrete-time) momentum-based algorithms is:

\begin{matrix} \{\begin{matrix} δ θ_{n} = η P (θ_{n - 1}) M^{- 1} r_{n - 1} \\ δ r_{n} = - η A (θ_{n - 1}) M^{- 1} r_{n - 1} - η P (θ_{n - 1}) (g (θ_{n - 1}) + w_{n}), \end{matrix} \end{matrix}

where

P (θ_{n - 1})

is a preconditioning matrix, M is the mass matrix and

A (θ_{n - 1})

is the friction matrix, as shown by [4,19]. As with the first order counterpart, the noise term is distributed as

w_{n} \sim N (0, 2 C (θ_{n}))

. Then, the sde to describe continuous-time system dynamics is:

\begin{matrix} \{\begin{matrix} d θ_{t} = P (θ_{t}) M^{- 1} r_{t} d t \\ d r_{t} = - (A (θ_{t}) M^{- 1} r_{t} + P (θ_{t}) \nabla f (θ_{t})) d t + \sqrt{2 η P {(θ_{t})}^{2} \sum (θ_{t})} d W_{t} . \end{matrix} \end{matrix}

(11)

where

P {(θ_{t})}^{2} = P (θ_{t}) P (θ_{t})

and we assume

P (θ_{t})

to be symmetric. The theorem hereafter describes the conditions for which noisy sgd with momentum converges to the true posterior distribution (Appendix A).

Theorem 2.

Consider dynamics of the form (11) and define the stationary distribution for

θ_{t}

as

ρ (θ) \propto exp (- ϕ (θ))

. If

\nabla^{⊤} P (θ) = 0^{⊤} and A (θ) = η P {(θ)}^{2} \sum (θ),

(12)

then

ϕ (θ) = f (θ)

.

In the naive case, where

P (θ) = I, A (θ) = 0, C (θ) = 0

, Equation (12) are not satisfied and the stationary distribution does not correspond to the true posterior [4]. To generate samples from the true posterior it is sufficient to set

P (θ) = I, A (θ) = η B (θ), C (θ) = 0

(as in Equation (9) in [4]).

Stochastic Gradient Hamiltonian Monte Carlo (sghmc) [4] suggests that estimating

B (θ)

can be costly. Hence, the injected noise

C (θ)

is chosen such that

C (θ) = η^{- 1} A (θ)

, where

A (θ)

is user-defined. When

η \to 0

, the following approximation holds:

\sum (θ) ≃ C (θ)

. It is then trivial to check that conditions (12) hold without the need for explicitly estimating

B (θ)

. A further practical reason to avoid setting

A (θ) = η B (θ)

is that the computational cost for the operation

A (θ_{n - 1}) M^{- 1} r_{n - 1}

has

O (D^{2})

complexity, whereas if

C (θ)

is diagonal, this is reduced to

O (D)

. This, however, severely slows down the sampling process.

Stochastic Gradient Riemannian Hamiltonian Monte Carlo (sgrhmc) is an extension to sghmc [5]), which considers a generic, space-varying preconditioning matrix

P (θ)

derived from information geometric arguments [20]. sgrhmc suggests setting

P (θ) = G {(θ)}^{- \frac{1}{2}}

, where

G (θ)

is the Fisher Information matrix. To meet the requirement

\nabla^{⊤} P (θ) = 0^{⊤}

, it includes a correction term,

- \nabla^{⊤} P (θ)

. The injected noise is set to

C (θ) = η^{- 1} I - B (θ)

, consequently

\sum = η^{- 1} I

, and the friction matrix is set to

A (θ) = P {(θ)}^{2}

. With all these choices, Theorem 2 is satisfied. Although appealing, the main drawbacks of this method are the need for an analytical expression of

\nabla^{⊤} P (θ)

, and the assumption for

B (θ)

to be known.

From a practical standpoint, momentum-based methods suffer from the requirement to tune many hyperparameters, including the learning rate, and the parameters that govern the simulation of a second-order Langevin dynamics.

The method we propose in this work can be applied to momentum-based algorithms; in this case, it could be viewed as an extension of the work in [11], albeit addressing the complex loss landscapes typical of deep neural networks. However, we leave this avenue of research for future work.

3. Sampling by Layer-Wise Isotropization

We present a simple and practical approach to inject noise to SGD iterates to perform Bayesian posterior sampling. Our goal is to sample from the true posterior distribution (or approximations thereof) using a constant learning rate, and to rely on more lenient assumptions about the shape of the loss landscape that characterize deep models, compared to previous works. In general, in modern machine learning applications, we deal with multi-layer neural networks [21]. We exploit the natural subdivision of the parameters of these architecture into different layers to propose a practical sampling scheme

Careful inspection of Theorem 1 reveals that the matrices

P (θ), \sum (θ)

are instrumental in determining the convergence properties of sg methods to the true posterior. Therefore, we consider the constructive approach of designing

η P (θ)

to obtain a sampling scheme that meets our goals; we set

η P (θ)

to be a constant, diagonal matrix which we constrain to be layer-wise uniform:

η P (θ) = Λ^{- 1} = diag {([\underset{layer 1}{\underset{⏟}{λ^{(1)}, \dots, λ^{(1)}}}, \dots, \underset{layer N_{l}}{\underset{⏟}{λ^{(N_{l})}, \dots λ^{(N_{l})}}}])}^{- 1} .

(13)

By properly selecting the set of parameters

{λ^{i}}

we can achieve the simultaneous result of non-vanishing learning rate and well-conditioned preconditioning matrix. This implies a layer-wise learning rate

η^{(p)} = \frac{1}{λ^{(p)}}

for the p-th layer, without further preconditioning.

We can now prove (see Appendix B), as a corollary to Theorem 1, that our design choices can guarantee convergence to the true posterior distribution.

Corollary 1.

(Theorem 1) Consider dynamics of the form (9) and define the stationary distribution

ρ (θ) \propto exp (- ϕ (θ))

. If

η P (θ) = Λ^{- 1}

as in (13),

C (θ) = Λ - B (θ)

and

C (θ) ≻ 0 \forall θ

, then

ϕ (θ) = f (θ)

.

If aforementioned conditions are satisfied, it is in fact simple to show that the relevant matrices satisfy the conditions in Equation (10). The covariance matrix of the composite noise is said to be isotropic within the layers of (deep) models. In fact,

\sum (θ) = C (θ) + B (θ) = diag ([\begin{matrix} λ^{(1)}, \dots, λ^{(1)}, \dots, λ^{(N_{l})}, \dots λ^{(N_{l})} \end{matrix}])

. From a practical point of view, we choose

Λ

to be, among all valid matrices satisfying

Λ - B (θ) ≻ 0

, the smallest (the one with the smallest

λ

’s). Indeed, larger

Λ

induce a smaller learning rate, thus unnecessarily reducing sampling speed.

Now, let us consider an ideal case, in which we assume the sg noise covariance

B (θ)

and

Λ

to be known in advance. The procedure described in Algorithm 1 illustrates a naive sg method that uses the injected noise covariance

C (θ)

to sample from the true posterior.

Algorithm 1 Idealized posterior sampling

{Initialization:

θ_{0}

}

SAMPLE (θ_{0}, B (θ), Λ) :

θ \leftarrow θ_{0}

loop

g = \nabla \tilde{f} (θ)

n \sim N (0, I)

C {(θ)}^{1 / 2} \leftarrow {(\sum - B (θ))}^{1 / 2}

g \leftarrow \sum^{- 1} (g + \sqrt{2} C {(θ)}^{1 / 2} n)

θ \leftarrow θ - g

end loop

This deceivingly simple procedure generate samples from the true posterior, with a non-vanishing learning rate, as shown earlier. However, it cannot be used in practice as

B (θ)

and

Λ

are unknown. Furthermore, the algorithm requires computationally expensive operations, i.e., to compute

{(\sum - B (θ))}^{\frac{1}{2}}

, which requires

O (d^{3})

operations, and

C {(θ)}^{\frac{1}{2}}

, which costs

O (d^{2})

multiplications.

Next, we describe a practical variant of our approach, where we use approximations at the expense of generating samples from the true posterior distribution. We note that [10] suggest exploring a related preconditioning, but do not develop this path in their work. Moreover, the proposed method shares similarities with a scheme proposed in [22] although the analysis we perform here is different.

3.1. A Practical Method: Isotropic SGD

To render the idealized sampling method practical, it is necessary to consider some additional assumptions. As we explain at the end of this section, the assumptions that follow are less strict than other approaches in the literature.

Assumption 1.

Thesgnoise covariance

B (θ)

can be approximated with a diagonal matrix, i.e.,

B (θ) = diag (b (θ))

.

Assumption 2.

The signal-to-noise ratio (SNR) of a gradient is small enough such that in the stationary regime, the second-order moment of the gradient is a good estimate of the true variance. Hence, combining with Assumption 1,

b (θ) ≃ \frac{E [g (θ) ⊙ g (θ)]}{2}

, where ⊙ indicates the elementwise product.

Assumption 3.

The sum of the variances of noise components, layer by layer, can be assumed to constant in the stationary regime. Then,

β^{(p)} = \sum_{j \in I_{p}} b_{j} (θ)

, where

I_{p}

is the set of indices of parameters belonging to

p_{t h}

layer.

The diagonal covariance assumption (i.e., Assumption 1) is common in other works, such as [2,11]. The small signal-to-noise ratio as stated in Assumption 2 is in line with recent studies, such as [11,23]. Assumption 3 is similar to those appeared in earlier work, such as [24]. Please note that Assumptions 2 and 3 must hold in the stationary regime when the process reaches the bottom valley of the loss landscape. The matrix

(b (θ))

has been associated in the literature with the empirical Fisher information matrix [2,25]. As we do not consider this matrix for preconditioning purposes, we do not further investigate this connection.

Given our assumptions, and our design choices, it is then possible to show (see Appendix B) that the optimal (i.e., the smallest possible)

Λ = [\begin{matrix} λ^{(1)}, \dots, λ^{(1)}, \dots, λ^{(N_{l})}, \dots λ^{(N_{l})} \end{matrix}]

satisfying Corollary 1 can be obtained as

λ^{(p)} = β^{(p)}

. Please note that we do not assume

B (θ)

to be known, but use a simple procedure to estimate its components by computing:

λ^{(p)} = \sum_{j \in I_{p}} b_{j} (θ) = \frac{| | g^{(p)} (θ) {| |}^{2}}{2},

where

g^{(p)} (θ)

is the portion of stochastic gradient corresponding to the p-th layer. Then, the composite noise matrix

\sum = Λ

is a layer-wise isotropic covariance matrix, which inspires the name of our proposed method as Isotropic SGD (i-sgd).

The practical implementation of i-sgd is shown in Algorithm 2. The advantage of i-sgd is that it can either be used to obtain posterior samples starting from a pre-trained model, or do so by training a model from scratch. In either case, the estimates of

B (θ)

are used to compute

Λ

, as discussed above. An important consideration is that once all

λ^{(i)}

have been estimated, the learning rate, layer by layer, is determined automatically. In fact, for the p-th layer, the learning rate is:

η^{(p)} = {λ^{(p)}}^{- 1}

. A simpler approach is to use a unique learning rate for all layers, where the equivalent

λ

is the sum of all

λ^{(p)}

.

Algorithm 2i-sgd: practical posterior sampling

SAMPLE (θ_{0}) :

θ \leftarrow θ_{0}

loop

g = \nabla \tilde{f} (θ)

for

p \leftarrow 1

to N_l do

n \sim N (0, I)

C {(θ)}^{1 / 2} \leftarrow (λ^{(p)} - (1 / 2) (g^{(p)} ⊙ g^{(p)}))

g^{(p)} \leftarrow 1 / λ^{(p)} (g^{(p)} + \sqrt{2} C {(θ)}^{1 / 2} n)

end for

θ \leftarrow θ - g

end loop

A Remark on Convergence

In summary, i-sgd is a practical method to perform approximate Bayesian posterior sampling, backed up by solid theoretical foundations. Our assumptions, which are at the origin of the approximate nature of i-sgd, are less strict than those used in the literature of sg-mcmc methods. More precisely, the theory behind i-sgd can explain convergence to the true posterior with a non-vanishing learning rate in the particular case when Assumption 1 holds and the estimation of

B (θ)

is perfect. Even with perfect estimates, this is not the case for sgfs, which requires the correction term

\nabla^{⊤} B {(θ)}^{- 1} = 0

. Additionally, both sgrld and sgrhmc are more demanding than i-sgd because they require computing

\nabla^{⊤} B {(θ)}^{- 1}

, for which an estimation procedure is elusive. Finally, the method by Springenberg et al. [11] needs a constant, diagonal

B (θ)

, a condition that does not necessarily hold for deep models.

3.2. Computational Cost

The computational cost of i-sgd is as follows. As with [4], we define the cost of computing a gradient minibatch as

C_{g} (N_{b}, d)

. Thanks to Assumptions 1 and 2, the computational cost for estimating the noise covariance scales as

O (d)

multiplications. The computational cost of generating random samples with the desired covariance scales as

O (d)

square roots and

O (d)

multiplications (without considering the cost of generating random numbers). The overall cost of our method is the sum of the above terms. Notice that the cost of estimating the noise covariance does not depend on the minibatch size

N_{b}

. We would like to stress that in many modern models, the real computational bottleneck is the backward propagation for the computation of the gradients. As all the sg-mcmc methods considered in this work require one gradient evaluation per step, the different methods have in practice the same complexity.

The space complexity of i-sgd is the same as sghmc,sgfs and variants: it scales as

O (N_{sam} d)

, where

N_{sam}

is the number of posterior samples.

4. Experiments

The empirical analysis of our method, and its comparison to alternative approaches from the literature, is organized as follows. First, we proceed with a validation of i-sgd using the standard uci datasets [26] and a shallow neural network. Then we move to the case of deeper models: we begin with a simple CNN used on the mnist [27] dataset, then move to the standard ResNet-18 [28] deep network using the Cifar-10 [29] dataset.

We compare i-sgd to other Bayesian sampling methods such as sghmc [4], sgld [2], and to alternative approaches to approximate Bayesian inference, including mcd [12], swag [9] and vsgd [10]. In general, our result indicates that: (1) i-sgd achieves similar or superior performance regarding competitors, when measuring uncertainty quantification, even with simple datasets and models; (2) i-sgd is simple to tune, when compared to alternatives; (3) i-sgd is competitive when used for deep Bayesian modeling, even when compared to standard methods used in the literature. In particular, the proposed method shares some of the strengths of vsgd, such as learning rates determined automatically and the simplicity of sgld. Appendix B includes additional implementation details on i-sgd. Appendix C presents detailed configurations of all methods we compare, and additional experimental results.

4.1. A Disclaimer on Performance Characterization

It is important to stress a detail on the analysis of the experimental campaign. The discussion is usually focused on the goodness of the various methods for representing the true posterior distribution. Different methods can or cannot claim convergence to the true posterior according to certain assumptions and the nature of the hyperparameters. In the experimental validation of the results, however, we do not have access to the form of the true posterior as it is exactly the problem we are trying to solve. The practical solution adopted is to compare the different methods in terms of proxy metrics evaluated on the test sets, such as the accuracy and uncertainty metrics. Being better in terms of these performance metrics does not imply that the sampling method is better at approximating the posterior distribution, and outperforming competitors in terms of these metric do not provide sufficient information about the intrinsic quality of the sampling scheme.

4.2. Regression Tasks, with Simple Models

We consider several regression tasks defined on the uci datasets. We use a simple neural network configuration with two fully connected layers and a relu activation function; the hidden layer includes 50 units. In this set of experiments, we use the following metrics: the root mean square error (rmse) to judge the model predictive performance and the mean negative log-likelihood (mnll) as a proxy for uncertainty quantification. We note that the task of tuning our competitors was far from trivial. We used our own version of sghmc, based on [11], to ensure a proper understanding of the implementation internals, and we proceeded with a tuning process to find appropriate values for the numerous hyperparameters. In this set of experiments, we omit results for swag, which we keep for more involved scenarios.

Table 1 and Table 2 report a complete overview of our results, for a selection of uci datasets. For each method and each dataset, we also included how many out of the 10 splits considered failed to converge, indicated as

F = \dots

. As explained in Appendix C we implemented a temperature scaled version of vsgd. A clear picture emerges from this first set of experiments: while for the rmse the performance is similar for different methods, for the mnll averaging over multiple samples clearly improves the uncertainty quantification capabilities. sghmc is in many cases better than alternatives, considering however the standard deviation of the results it is difficult to claim clear superiority of one method over the others.

4.3. Classification Tasks, with Deeper Models

Next, we compare i-sgd against competitors on image classification tasks. First, we use the mnist dataset, and a simple LeNet-5 CNN [30]. All methods are compared based on the test accuracy acc,mnll and the expected calibration error (ece, 31]). Additionally, at test time, we carry out predictions on both mnist and not-mnist; the latter is a dataset equivalent to mnist, but it represents letters rather than numbers. (http://yaroslavvb.blogspot.com/2011/09/notmnist-dataset.html, accessed on 24 October 2021) This experimental setup is often used to check whether the entropy of the predictions on not-mnist is higher than the entropy of the predictions on mnist (the entropy of the output of an

N_{c l}

classes classifier, represented by the vector

p

, is defined as

- \sum_{i = 1}^{N_{c l}} p_{i} log p_{i}

).

Table 3 indicates that all methods are essentially equivalent in terms of accuracy and mnll. We consider, together with the classical in and out of distribution entropies the regions of convergence (rocs) diagrams comparing detection of out of distribution samples and false alarms when using as test statistic the entropy. Results, reported in Figure 1, clearly shows that: (1) collecting multiple samples improve the uncertainty quantification capabilities (2) i-sgd is competitive (but not the best scheme) and importantly outperform the closest approach to ours, i.e., vsgd. The experimental results show that i-sgd improves the quality of the baseline model with respect to all metrics. To test whether the improvements are due just to “additional training” or are intrinsically due to the Bayesian averaging properties, we do consider alternative deterministic baselines (details in Appendix C). For this set of experiments the best performing is baseline R. As can be appreciated by comparing Table 3 and Figure 1, while it is possible to increase the classical metrics, i-sgd (and other methods) still outperform by a large margin the baselines in terms of detection of out of distribution samples.

We now move on to a classical image classification problem with deep convolutional networks, whereby we use the cifar10 dataset, and the ResNet-18 network architecture. For this set of experiments, we compare i-sgd, sghmc, swag, and vsgd using again test accuracy and mnll, which we report in Table 4. As usual, we compare the results against the baseline of the individual network resulting from the pre-training phase. Results are obtained averaging over three independent seeds. Notice, as expanded in Appendix C that for swag we do consider two variants: the Bayesian correct one (swag) and a second variant that has better performance (swag wd). We stress again, as highlighted in Section 4.1 that not always goodness of approximation of the posterior and performance correlate positively. Additionally in this case, we found i-sgd to be competitive with other methods and superior to the baseline. Among the competitors, we found i-sgd to the easiest to tune, given the feature of a fixed learning rate informed by theoretical considerations; we believe that this is an important aspect to consider for a wide adoption of our proposal by practitioners.

5. Conclusions

sg methods allowed Bayesian posterior sampling algorithms, such as mcmc, to regain relevance in an age when datasets have reached extremely large sizes. However, despite mathematical elegance and promising results, current approaches from the literature are restricted to simple models. Indeed, the sampling properties of these algorithms are determined by simplifying assumptions on the loss landscape, which do not hold for the kind of complex models which are popular these days, such as deep models. Meanwhile, sg-mcmc algorithms require vanishing learning rates, which force practitioners to develop creative annealing schedules that are often model specific and difficult to justify.

We have attempted to target these weaknesses by suggesting a simpler algorithm that relies on fewer parameters and less strict assumptions compared to the literature on sg-mcmc. We used a unified mathematical notation to deepen our understanding of the role of the covariance of the noise of stochastic gradients and learning rate on the behavior of sg-mcmc algorithms. We then presented a practical variant of the sgd algorithm, which uses a constant learning rate, and an additional noise to perform Bayesian posterior sampling. Our proposal is derived from the ideal method, in which it is guaranteed that samples are generated from the true posterior. When the learning rate and noise terms are empirically estimated, with no user intervention, our method offers a very good approximation to the posterior, as demonstrated by the extensive experimental campaign.

We verified empirically the quality of our approach, and compared its performance to state-of-the-art sg-mcmc and alternative methods. Results, which span a variety of settings, indicated that our method is competitive to the alternatives from the state-of-the-art, while being much simpler to use.

Author Contributions

Formal analysis, G.F., D.M., M.F. and P.M.; Methodology, G.F., D.M., M.F. and P.M.; Software, G.F., D.M., M.F. and P.M.; Writing—original draft, G.F., D.M., M.F. and P.M.; Writing—review & editing, G.F., D.M., M.F. and P.M. All authors contributed equally to this work. All authors have read and agreed to the published version of the manuscript.

Funding

MF gratefully acknowledges support from the AXA Research Fund and the Agence Nationale de la Recherche (grant ANR-18-CE46-0002 and ANR-19-P3IA-0002).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Not applicable.

Conflicts of Interest

The authors declare no conflict of interest.

Appendix A. Background and Related Material

Appendix A.1. The Minibatch Gradient Approximation

Starting from the gradient of the logarithm of the posterior density:

- \nabla f (θ) = \sum_{i = 1}^{N} \nabla log p (U_{i} | θ) + \nabla log p (θ),

it is possible to define its minibatch version by computing the gradient on a random subset

I_{N_{b}}

with cardinality

N_{b}

of all the indices. The minibatch gradient

g (θ)

is computed as

- g (θ) = \frac{N}{N_{b}} \sum_{i = 1}^{N_{b}} \nabla log p (U_{i} | θ) + \nabla log p (θ),

By simple calculations it is possible to show that the estimation is unbiased (

E (g (θ)) = \nabla f (θ)

). The estimation error covariance is defined to be

E [(g (θ) - \nabla f (θ)) {(g (θ) - \nabla f (θ))}^{⊤}] = 2 B (θ)

.

If the minibatch size is large enough, invoking the central limit theorem, we can state that the minibatch gradient is normally distributed:

g (θ) \sim N (\nabla f (θ), 2 B (θ)) .

Appendix A.2. Gradient Methods without Momentum

Appendix A.2.1. The sde from Discrete Time

We start from the generalized updated rule of sgd:

\begin{matrix} δ θ_{n} = - η P (θ_{n - 1}) (g (θ_{n - 1}) + w_{n}) . \end{matrix}

Since

g (θ_{n - 1}) \sim N (\nabla f (θ_{n - 1}), 2 B (θ_{n - 1}))

we can rewrite the above equation as:

\begin{matrix} δ θ_{n} = - η P (θ_{n - 1}) (\nabla f (θ_{n - 1}) + w_{n}^{^{'}}), \end{matrix}

where

w_{n}^{^{'}} \sim N (0, 2 \sum (θ_{n - 1}))

. If we separate deterministic and random component we can equivalently write:

\begin{matrix} δ θ_{n} = - η P (θ_{n - 1}) \nabla f (θ_{n - 1}) + η P (θ_{n - 1}) w_{n}^{^{'}} = - η P (θ_{n - 1}) \nabla f (θ_{n - 1}) + \\ \sqrt{2 η P^{2} (θ_{n - 1}) \sum (θ_{n - 1})} v_{n} \end{matrix}

where

v_{n} \sim N (0, \sqrt{η} I)

. When

η

is small enough (

η \to d t

) we can interpret the above equation as the discrete-time simulation of the following sde [15]:

d θ_{t} = - P (θ_{t}) \nabla f (θ_{t}) d t + \sqrt{2 η P {(θ_{t})}^{2} \sum (θ_{t})} d W_{t},

where

d W_{t}

is a d—dimensional Brownian motion.

Appendix A.2.2. Proof of Theorem 1

The stationary distribution of the above sde,

ρ (θ) \propto exp (- ϕ (θ))

, satisfies the following fpe

\begin{matrix} 0 = Tr \{\nabla [\nabla^{⊤} (f (θ)) P (θ) ρ (θ) + η \nabla^{⊤} (P {(θ)}^{2} \sum (θ) ρ (θ))]\}, \end{matrix}

that we rewrite as

\begin{matrix} 0 = Tr {\nabla [\nabla^{⊤} (f (θ)) P (θ) ρ (θ) - η \nabla^{⊤} (ϕ (θ)) P {(θ)}^{2} \sum (θ) ρ (θ) + η \nabla^{⊤} (P {(θ)}^{2} \sum (θ)) ρ (θ)]} . \end{matrix}

The above equation is verified with

\nabla f (θ) = \nabla ϕ (θ)

if

\{\begin{matrix} \nabla^{⊤} (P {(θ)}^{2} \sum (θ)) = 0 \\ η P {(θ)}^{2} \sum (θ) = P (θ) \to η P (θ) = \sum {(θ)}^{- 1} \end{matrix}

that proves Theorem 1.

Appendix A.3. Gradient Methods with Momentum

Appendix A.3.1. The SDE from Discrete Time

The general set of update equations for (discrete-time) momentum-based algorithms is:

\begin{matrix} \{\begin{matrix} δ θ_{n} = η P (θ_{n - 1}) M^{- 1} r_{n - 1} \\ δ r_{n} = - η A (θ_{n - 1}) M^{- 1} r_{n - 1} - η P (θ_{n - 1}) (g (θ_{n - 1}) + w_{n}) . \end{matrix} \end{matrix}

Similarly to the case without momentum, we rewrite the second equation of the system as

\begin{matrix} δ r_{n} = - η A (θ_{n - 1}) M^{- 1} r_{n - 1} - η P (θ_{n - 1}) (g (θ_{n - 1}) + w_{n}) = \\ - η A (θ_{n - 1}) M^{- 1} r_{n - 1} - η P (θ_{n - 1}) \nabla f (θ_{n - 1}) + \sqrt{2 η P^{2} (θ_{n - 1}) \sum (θ_{n - 1})} v_{n} \end{matrix}

where again

v_{n} \sim N (0, \sqrt{η} I)

. If we define the supervariable

z = {[θ, r]}^{⊤}

we can rewrite the system as

\begin{matrix} δ z_{n} = - η [\begin{matrix} 0 & - P (θ_{n - 1}) \\ P (θ_{n - 1}) & A (θ_{n - 1}) \end{matrix}] s (z_{n - 1}) + \sqrt{2 η D (z_{n - 1})} ν_{n} \end{matrix}

where

s (z) = [\begin{matrix} \nabla f (θ) \\ M^{- 1} r \end{matrix}]

,

D (z) = [\begin{matrix} 0 & 0 \\ 0 & P {(θ)}^{2} \sum (θ) \end{matrix}]

and

ν_{n} \sim N (0, \sqrt{η} I)

.

As the learning rate goes to zero (

η \to d t

), similarly to the previous case, we can interpret the above difference equation as a discretization of the following fpe

\begin{matrix} d z_{t} = - [\begin{matrix} 0 & - P (θ_{t}) \\ P (θ_{t}) & A (θ_{t}) \end{matrix}] s (z_{t}) + \sqrt{2 η D (z_{t})} d W_{t} \end{matrix}

Appendix A.3.2. Proof of Theorem 2

As before we assume that the stationary distribution has form

ρ (z) \propto exp (- ϕ (z))

. The corresponding fpe is

\begin{matrix} 0 = Tr (\nabla (s {(z)}^{⊤} [\begin{matrix} 0 & - P (θ) \\ P (θ) & A (θ) \end{matrix}] ρ (z) + η (\nabla^{⊤} (D (z) ρ (z))))) . \end{matrix}

Notice that since

\nabla^{⊤} D (z) = 0

we can rewrite

\begin{matrix} 0 = Tr (\nabla (s {(z)}^{⊤} [\begin{matrix} 0 & - P (θ) \\ P (θ) & A (θ) \end{matrix}] ρ (z) + η \nabla^{⊤} (ρ (z)) D (z))) \\ = Tr (\nabla (s {(z)}^{⊤} [\begin{matrix} 0 & - P (θ) \\ P (θ) & A (θ) \end{matrix}] ρ (z) - η \nabla^{⊤} (ϕ (z)) D (z) ρ (z))) \\ = Tr (\nabla (s {(z)}^{⊤} [\begin{matrix} 0 & - P (θ) \\ P (θ) & A (θ) \end{matrix}] ρ (z) - η \nabla^{⊤} (ϕ (z)) [\begin{matrix} 0 & 0 \\ 0 & P {(θ)}^{2} \sum (θ) \end{matrix}] ρ (z))) \end{matrix}

that is verified with

\nabla ϕ (z) = s (z)

if

\{\begin{matrix} \nabla^{⊤} P (θ) = 0 \\ A (θ) = η P {(θ)}^{2} \sum (θ) . \end{matrix}

If

\nabla^{⊤} P (θ) = 0

in fact

\begin{matrix} Tr (\nabla (\nabla^{⊤} (ϕ (z)) ρ (z) [\begin{matrix} 0 & - P (θ) \\ P (θ) & 0 \end{matrix}])) = \nabla^{⊤} ([\begin{matrix} 0 & - P (θ) \\ P (θ) & 0 \end{matrix}] \nabla (ϕ (z)) ρ (z)) = \\ \nabla^{⊤} ([\begin{matrix} 0 & - P (θ) \\ P (θ) & 0 \end{matrix}]) \nabla (ϕ (z)) ρ (z) + Tr ([\begin{matrix} 0 & - P (θ) \\ P (θ) & 0 \end{matrix}] \nabla (\nabla^{⊤} (ϕ (z)) ρ (z))) = 0, \end{matrix}

since

\nabla^{⊤} [\begin{matrix} 0 & - P (θ) \\ P (θ) & 0 \end{matrix}] = 0

and the second term is zero due to the fact that

[\begin{matrix} 0 & - P (θ) \\ P (θ) & 0 \end{matrix}]

is anti-symmetric while

\nabla (\nabla^{⊤} (ϕ (z)) ρ (z))

is symmetric.

Thus, we can rewrite

\begin{matrix} Tr (\nabla (s {(z)}^{⊤} [\begin{matrix} 0 & - P (θ) \\ P (θ) & A (θ) \end{matrix}] ρ (z) - η \nabla^{⊤} (ϕ (z)) [\begin{matrix} 0 & 0 \\ 0 & P {(θ)}^{2} \sum (θ) \end{matrix}] ρ (z))) = \\ Tr (\nabla (s {(z)}^{⊤} [\begin{matrix} 0 & - P (θ) \\ P (θ) & A (θ) \end{matrix}] ρ (z) - \nabla^{⊤} (ϕ (z)) [\begin{matrix} 0 & 0 \\ 0 & η P {(θ)}^{2} \sum (θ) \end{matrix}] ρ (z))) = \\ Tr (\nabla (s {(z)}^{⊤} [\begin{matrix} 0 & - P (θ) \\ P (θ) & A (θ) \end{matrix}] ρ (z) - \nabla^{⊤} (ϕ (z)) [\begin{matrix} 0 & 0 \\ 0 & A (θ) \end{matrix}] ρ (z))) = \\ Tr (\nabla ((s {(z)}^{⊤} - \nabla^{⊤} (ϕ (z))) [\begin{matrix} 0 & - P (θ) \\ P (θ) & A (θ) \end{matrix}] ρ (z))) = 0 \end{matrix}

and then

\nabla ϕ (z) = s (z)

proving Theorem 2.

Appendix B. i-sgd Method Proofs and Details

Appendix B.1. Proof of Corollary 1

The requirement

C (θ) ⪰ 0 \forall θ

, ensures that the injected noise covariance is valid. The composite noise matrix is equal to

\sum (θ) = Λ

. Since

\nabla^{⊤} \sum (θ) = \nabla^{⊤} Λ = 0

and

η P (θ) = Λ^{- 1}

by construction, then Theorem 1 is satisfied.

Appendix B.2. Proof of Optimality of Λ

Our design choice is to select

λ^{(p)} = β^{(p)}

. By the assumptions, the matrix

B (θ)

is diagonal, and consequently

C (θ) = Λ - B (θ)

is diagonal as well. The preconditioner

Λ

must be chosen to satisfy the positive semidefinite constraint, i.e.,

C {(θ)}_{i i} \geq 0 \forall i, \forall θ

. Equivalently, we must satisfy

λ^{(p)} - b_{j} (θ) \geq 0 \forall j \in I_{p}, \forall p, \forall θ

, where

I_{p}

is the set of indices of parameters belonging to

p_{t h}

layer. By assumption 3, i.e.,

β^{(p)} = \sum_{k \in I_{p}} b_{k} (θ)

, it is easy to show that

b_{j} (θ), j \in I_{p}

, is upper bounded as

b_{j} (θ) \leq β^{(p)}

. To satisfy the positive semidefinite requirement in all cases the minimum valid set of

λ^{(p)}

is then determined as

λ^{(p)} = β^{(p)}

.

Appendix B.3. Algorithmic Details

In this section, we provide further details about the practical implementation of the proposed scheme. At any (discrete) time instant a minibatch version of the gradient is computed that is distributed, according to the hypotheses of the main paper, as

g (θ) \sim N (\nabla f (θ), 2 b (θ)

. Since we assumed that the second-order moment is a good approximation of the variance, we can estimate

b (θ)

as

\frac{1}{2} (g (θ) ⊙ g (θ))

. In practice, we found that the following running average estimation procedure to be the most robust

b (θ) \leftarrow μ b (θ) + (1 - μ) \frac{1}{2} (g (θ) ⊙ g (θ))

(A1)

where

μ \in (0, 1]

. In all experiments we considered

μ = 0.5

After a warmup period, the various

λ^{(p)}

, layer per layer, are estimated as

λ^{(p)} = \sum_{k \in I_{p}} b_{k} (θ)

and kept constant until the end. The estimation procedure continues during sampling phase, as the quantity

λ^{(p)} - b (θ)

is necessary at every step. As the learning rate is derived as

\frac{2}{λ^{(p)}}

, we found that the usage of second-order moments instead of variances, and in certain cases temperature scaling, kept the simulated trajectories more stable.

Appendix C. Methodology

We hereafter present additional implementation details.

Appendix C.1. Regression Tasks, with Simple Models

For this set of experiments we considered, the baseline is obtained by running the adam optimizer for 20,000 steps with learning rate

0.01

and default parameters. At test time we use 100 samples to estimate the predictive posterior distribution, using Equation (3), for the sampling methods (i-sgd,sgld,sghmc,vsgd), with a keep-every value equal to 1000. The i-sgd and vsgd sampling methods are started from the baseline. For i-sgd we selected temperature

0.01

, while for sghmc and sgld we do performed experiments for temperatures 1 and

0.01

. We modified the implementation of vsgd as the original implementation produced unstable learning rates (as noticed also in [9]). A simple and effective solution we implement that we kept throughout the experimental campaigns is to divide the learning rate by the number of parameters (thus performing variational inference on a tempered version of the posterior). For sgld the learning rate decay is the one suggested in [2], with initial and finial learning rate equal to

10^{- 6}

and

10^{- 8}

respectively. For mcd we collected 1000 samples with standard dropout rate of

0.5

. All our experiments use 10-splits. The considered batch size is 64 for all methods.

Appendix C.2. Classification Task, ConvNet

For the LeNet-5 on mnist experiment, we do consider also the swag algorithm. At test time we use 30 samples for all methods. Baselines are again trained using adam optimizer for 20,000 steps with learning rate

0.01

and default parameters. For i-sgd and sghmc we collected samples for the different temperatures of 1 and

0.01

. sgld has initial and final learning rates of

10^{- 3}

and

10^{- 5}

. For all the sampling methods we do collect 100 samples with a keep-every of 10,000 steps. swag results are obtained by collecting the statistics over 300 epochs using adam optimizer and decreasing the learning rate every epoch in accordance with the original paper schedule [9]. drop results are obtained by training the networks with sgd, with learning rate 0.005 and momentum 0.5. The number of collected samples for this method is 1000. The batch size for all the methods is 128.

As explained in the main text, we performed an ablation study on the considered baselines. In Table A1 we do report the results for the additional variants obtained by early stopping (10,000 iterations instead of 20,000) baseline S, to ablate overfitting, and baseline L, by training for 30,000 iterations. Finally, we include the best performing baseline R, obtained starting from baseline, reducing the learning rate by a factor of 10 and training for 10,000 more iterations.

Table A1. Baselines comparison for classification on mnist dataset.

Method	acc	mnll	Mean $H_{0}$	ece	Mean $H_{1}$
baseline	9886.6667 ± 11.0252	352.6640 ± 20.8622	0.0353 ± 0.0058	0.0468 ± 0.0001	0.0019 ± 0.0003
baseline l	9871.6667 ± 20.7579	389.7142 ± 79.0354	0.0378 ± 0.0051	0.0468 ± 0.0008	0.0025 ± 0.0006
baseline s	9893.0000 ± 4.8990	339.8170 ± 7.9855	0.0392 ± 0.0042	0.0477 ± 0.0008	0.0024 ± 0.0001
baseline r	9919.0000 ± 9.4163	242.7644 ± 17.0736	0.0303 ± 0.0001	0.0482 ± 0.0006	0.0021 ± 0.0002

Appendix C.3. Classification Task, Deeper Models

We here report details for the ResNet-18 on cifar10 experiments. The baseline is obtained with adam optimizer with learning rate 0.01 decreased by a factor of 10 every 50 epochs for a total of 200 epochs and weight decay of

0.05

. For this set of experiments no temperature scaling was required. We could not find good hyperparameters for the sgld scheme. Concerning i-sgd, sghmc and vsgd the keep-every value is chosen as 10,000 and the number of collected samples is 30. For swag we used the default parameters described in [9]. Notice that for swag we performed the following ablation study: we trained the networks considering as loss function the joint log-likelihood and included or not the suggested weight decay of the original work [9]. From a purely Bayesian perspective no weight decay should be considered to be the information is implicit in the prior; however, we found that without the extra decay swag was not able to obtain competitive results. As underlined in Section 4.1, not necessarily a better posterior approximation translates into better empirical results.

Appendix C.4. Definition of the Metrics

For regression datasets, we consider rmse and mnll. Consider a single datapoint

U_{i} = (x_{i}, y_{i})

, with

x_{i}

the input of the model and

y_{i}

the true corresponding output. The output of the model, for a single sample of parameters

θ_{j}

, is

{\hat{y}}_{θ_{j}} (x_{i})

. rmse is defined as

\frac{1}{N} \sum_{i = 1}^{N} | | y_{i} - μ (x_{i}) {| |}^{2}

, where

μ (x_{i})

is the empirical mean

\frac{1}{N_{M C}} \sum_{j = 1}^{N_{M C}} {\hat{y}}_{θ_{j}} (x_{i})

. mnll is defined instead as

(\frac{1}{N} \sum_{i = 1}^{N} (\frac{1}{2} log (2 π σ_{i}^{2}) + \frac{1}{2} \frac{| | y_{i} - μ (x_{i}) {| |}^{2}}{σ_{i}^{2}})

, where

σ_{i}^{2}

is the empirical variance.

For classification datasets, we consider acc,mnll and entropy. Consider a single datapoint

U_{i} = (x_{i}, y_{i})

, with

x_{i}

the input of the model and

y_{i}

the true corresponding label. The output of the model, for a single sample of parameters

θ_{j}

, is the

N_{c l}

vector

p_{θ_{j}} (x_{i})

. The averaged probability vector for a single sample is

p (x_{i}) = \frac{1}{N_{M C}} \sum_{i = 1}^{N_{M C}} p_{θ_{j}} (x_{i})

.acc is defined as

\frac{1}{N} \sum_{i = 1}^{N} 1 (arg max p (x_{i}) = y_{i})

. mnll is computed as

\frac{1}{N} \sum_{i = 1}^{N} log (p_{y_{i}} (x_{i}))

. Entropy, as stated in the main text, is instead computed according to

\frac{1}{N} \sum_{i = 1}^{N} (\sum_{k = 1}^{N_{c l}} p_{k} (x_{i}) log (p_{k} (x_{i})))

.

References

Welling, M.; Teh, Y.W. Bayesian learning via stochastic gradient Langevin dynamics. In Proceedings of the 28th International Conference on Machine Learning (ICML-11), Bellevue, WA, USA, 28 June–2 July 2011; pp. 681–688. [Google Scholar]
Ahn, S.; Korattikara, A.; Welling, M. Bayesian posterior sampling via stochastic gradient Fisher scoring. arXiv 2012, arXiv:1206.6380. [Google Scholar]
Patterson, S.; Teh, Y.W. Stochastic Gradient Riemannian Langevin Dynamics on the Probability Simplex. In Advances in Neural Information Processing Systems 26; Burges, C.J.C., Bottou, L., Welling, M., Ghahramani, Z., Weinberger, K.Q., Eds.; Curran Associates, Inc.: Red Hook, NY, USA, 2013; pp. 3102–3110. [Google Scholar]
Chen, T.; Fox, E.; Guestrin, C. Stochastic gradient hamiltonian monte carlo. In Proceedings of the International Conference on Machine Learning, Bejing, China, 22–24 June 2014; pp. 1683–1691. [Google Scholar]
Ma, Y.A.; Chen, T.; Fox, E. A complete recipe for stochastic gradient MCMC. In Proceedings of the Advances in Neural Information Processing Systems, Montreal, QC, Canada, 7–12 December 2015; pp. 2917–2925. [Google Scholar]
Draxler, F.; Veschgini, K.; Salmhofer, M.; Hamprecht, F.A. Essentially no barriers in neural network energy landscape. arXiv 2018, arXiv:1803.00885. [Google Scholar]
Garipov, T.; Izmailov, P.; Podoprikhin, D.; Vetrov, D.; Wilson, A.G. Loss Surfaces, Mode Connectivity, and Fast Ensembling of DNNs. In Proceedings of the 32nd International Conference on Neural Information Processing Systems, Montréal, QC, Canada, 3–8 December 2018. [Google Scholar]
Chaudhari, P.; Soatto, S. Stochastic gradient descent performs variational inference, converges to limit cycles for deep networks. In Proceedings of the 2018 Information Theory and Applications Workshop (ITA), San Diego, CA, USA, 11–16 February 2018; pp. 1–10. [Google Scholar]
Maddox, W.J.; Izmailov, P.; Garipov, T.; Vetrov, D.P.; Wilson, A.G. A simple baseline for bayesian uncertainty in deep learning. In Proceedings of the Advances in Neural Information Processing Systems, Vancouver, BC, Canada, 8–14 December 2019; pp. 13132–13143. [Google Scholar]
Mandt, S.; Hoffman, M.D.; Blei, D.M. Stochastic gradient descent as approximate bayesian inference. J. Mach. Learn. Res. 2017, 18, 4873–4907. [Google Scholar]
Springenberg, J.T.; Klein, A.; Falkner, S.; Hutter, F. Bayesian optimization with robust Bayesian neural networks. In Proceedings of the Advances in Neural Information Processing Systems, Barcelona, Spain, 5–10 December 2016; pp. 4134–4142. [Google Scholar]
Gal, Y.; Ghahramani, Z. Dropout as a bayesian approximation: Representing model uncertainty in deep learning. In Proceedings of the International Conference on Machine Learning, ICML, New York, NY, USA, 19–24 June 2016; pp. 1050–1059. [Google Scholar]
Bishop, C.M. Pattern Recognition and Machine Learning, 1st ed.; 2006. corr. 2nd printing 2011 ed.; Springer: Berlin/Heidelberg, Germany, 2006. [Google Scholar]
Chen, C.; Carlson, D.; Gan, Z.; Li, C.; Carin, L. Bridging the gap between stochastic gradient MCMC and stochastic optimization. In Proceedings of the Artificial Intelligence and Statistics, Cadiz, Spain, 9–11 May 2016; pp. 1051–1060. [Google Scholar]
Gardiner, C.W. Handbook of Stochastic Methods for Physics, Chemistry and the Natural Sciences, 3rd ed.; Springer Series in Synergetics; Springer: Berlin/Heidelberg, Germany, 2004; Volume 13. [Google Scholar]
Kushner, H.; Yin, G. Stochastic Approximation and Recursive Algorithms and Applications; Stochastic Modelling and Applied Probability; Springer: New York, NY, USA, 2003. [Google Scholar]
Ljung, L.; Pflug, G.; Walk, H. Stochastic Approximation and Optimization of Random Systems; Birkhauser Verlag: Basel, Switzerland, 1992. [Google Scholar]
Kloeden, P.E.; Platen, E. Numerical Solution of Stochastic Differential Equations; Springer Science & Business Media: Berlin/Heidelberg, Germany, 2013; Volume 23. [Google Scholar]
Neal, R.M. MCMC using Hamiltonian dynamics. In Handbook of Markov Chain Monte Carlo; CRC Press: Boca Raton, FL, USA, 2011; Volume 2, p. 2. [Google Scholar]
Girolami, M.; Calderhead, B. Riemann manifold langevin and hamiltonian monte carlo methods. J. R. Stat. Soc. Ser. B (Stat. Methodol.) 2011, 73, 123–214. [Google Scholar] [CrossRef]
Goodfellow, I.; Bengio, Y.; Courville, A. Deep Learning; MIT Press: Cambridge, MA, USA, 2016. [Google Scholar]
Vollmer, S.J.; Zygalakis, K.C.; Teh, Y.W. Exploration of the (non-) asymptotic bias and variance of stochastic gradient Langevin dynamics. J. Mach. Learn. Res. 2016, 17, 5504–5548. [Google Scholar]
Saxe, A.M.; Bansal, Y.; Dapello, J.; Advani, M.; Kolchinsky, A.; Tracey, B.D.; Cox, D.D. On the information bottleneck theory of deep learning. J. Stat. Mech. Theory Exp. 2019, 2019, 124020. [Google Scholar] [CrossRef]
Zhu, Z.; Wu, J.; Yu, B.; Wu, L.; Ma, J. The anisotropic noise in stochastic gradient descent: Its behavior of escaping from minima and regularization effects. arXiv 2018, arXiv:1803.00195. [Google Scholar]
Scott, W.A. Maximum likelihood estimation using the empirical fisher information matrix. J. Stat. Comput. Simul. 2002, 72, 599–611. [Google Scholar] [CrossRef]
Dua, D.; Graff, C. UCI Machine Learning Repository. 2017. Available online: https://archive.ics.uci.edu/ml/index.php (accessed on 25 October 2021).
LeCun, Y.; Cortes, C. MNIST Handwritten Digit Database. 2010. Available online: http://yann.lecun.com/exdb/mnist/ (accessed on 25 October 2021).
He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar]
Krizhevsky, A.; Nair, V.; Hinton, G. CIFAR-10 (Canadian Institute for Advanced Research). Available online: http://www.cs.toronto.edu/~kriz/cifar.html (accessed on 25 October 2021).
LeCun, Y.; Bottou, L.; Bengio, Y.; Haffner, P. Gradient-based learning applied to document recognition. Proc. IEEE 1998, 86, 2278–2324. [Google Scholar] [CrossRef] [Green Version]
Guo, C.; Pleiss, G.; Sun, Y.; Weinberger, K.Q. On calibration of modern neural networks. arXiv 2017, arXiv:1706.04599. [Google Scholar]

Figure 1. Detection/False alarm diagrams for different methods.

Table 1. rmse results for regression on uci datasets.

Method	wine	protein	naval	kin8nm	power	boston
sgld	0.759 ± 0.07	5.687 ± 0.05	0.007 ± 0.00 (F = 6.000)	0.171 ± 0.07 (F = 3.000)	11.753 ± 3.25	9.602 ± 2.06
i-sgd	0.635 ± 0.05	4.699 ± 0.03	0.001 ± 0.00	0.079 ± 0.00	4.320 ± 0.13	3.703 ± 1.19
Baseline	0.641 ± 0.05	4.733 ± 0.05	0.001 ± 0.00	0.080 ± 0.00	4.354 ± 0.12	3.705 ± 1.19
vsgd	0.635 ± 0.05	4.699 ± 0.03	0.001 ± 0.00	0.079 ± 0.00	4.325 ± 0.13	3.588 ± 1.06 (F = 1.000)
sghmc	0.628 ± 0.04	4.712 ± 0.03	0.000 ± 0.00 (F = 2.000)	0.076 ± 0.00 (F = 1.000)	4.310 ± 0.14	3.659 ± 1.24
sgld T	0.752 ± 0.07	5.673 ± 0.04	0.007 ± 0.00 (F = 6.000)	0.169 ± 0.07 (F = 3.000)	11.351 ± 3.02	9.417 ± 2.07
drop	0.637 ± 0.04	4.968 ± 0.05	0.003 ± 0.00	0.139 ± 0.01	4.531 ± 0.16	3.803 ± 1.26
sghmc T	0.628 ± 0.04	4.684 ± 0.03	0.000 ± 0.00 (F = 6.000)	0.076 ± 0.00	4.326 ± 0.13	3.692 ± 1.19

Table 2. mnll results for regression on uci datasets.

Method	wine	protein	naval	kin8nm	power	boston
sgld	1.546 ± 0.25	5.604 ± 0.08	−1.751 ± 0.28 (F = 6.000)	5.140 ± 7.05 (F=3.000)	8.429 ± 3.14	30.386 ± 15.77
i-sgd	1.129 ± 0.15	4.371 ± 0.03	−2.466 ± 1.12	−0.460 ± 0.65	3.122 ± 0.07	9.799 ± 5.69
Baseline	1.182 ± 0.03	3.964 ± 0.04	0.920 ± 0.00	0.924 ± 0.00	3.071 ± 0.06	5.421 ± 2.73
vsgd	1.128 ± 0.15	4.371 ± 0.03	−2.466 ± 1.12	−0.480 ± 0.65	3.088 ± 0.06	8.413 ± 5.89 (F = 1.000)
sghmc	1.041 ± 0.12	4.142 ± 0.02	−2.763 ± 1.33 (F = 2.000)	−0.798 ± 0.39 (F = 1.000)	2.924 ± 0.04	3.097 ± 0.83
sgld T	1.526 ± 0.24	5.591 ± 0.07	−1.752 ± 0.28 (F = 6.000)	5.118 ± 7.06 (F = 3.000)	8.288 ± 3.04	33.212 ± 19.69
drop	1.065 ± 0.12	4.218 ± 0.06	−2.322 ± 0.75	−0.086 ± 0.41	2.941 ± 0.04	3.989 ± 1.23
sghmc T	1.104 ± 0.14	4.191 ± 0.02	−2.966 ± 1.89 (F = 6.000)	−0.756 ± 0.42	3.116 ± 0.07	9.826 ± 5.72

Table 3. Results for classification on mnist dataset.

Method	acc	mnll	Mean $H_{0}$	ece	Mean $H_{1}$	Failed
i-sgd	9916.3333 ± 2.8674	263.5311 ± 16.3600	0.0368 ± 0.0019	0.0491 ± 0.0003	0.4558 ± 0.0591	0.0000
sghmc	9930.6667 ± 2.4944	268.2559 ± 6.8172	0.0593 ± 0.0018	0.0531 ± 0.0003	1.0369 ± 0.0346	0.0000
drop	9912.6667 ± 6.0185	362.8973 ± 24.8881	0.0960 ± 0.0090	0.0541 ± 0.0011	0.5507 ± 0.0577	0.0000
baseline	9886.6667 ± 11.0252	352.6640 ± 20.8622	0.0353 ± 0.0058	0.0468 ± 0.0001	0.0019 ± 0.0003	0.0000
baseline r	9919.0000 ± 9.4163	242.7644 ± 17.0736	0.0303 ± 0.0001	0.0482 ± 0.0006	0.0021 ± 0.0002	0.0000
swag	9917.0000 ± 2.8284	308.8182 ± 20.0979	0.0675 ± 0.0108	0.0524 ± 0.0011	0.3953 ± 0.0442	0.0000
sgld	9927.0000 ± 1.0000	279.7685 ± 16.6563	0.0556 ± 0.0034	0.0531 ± 0.0004	1.3032 ± 0.1942	1.0000
vsgd	9927.3333 ± 6.7987	225.3725 ± 16.3739	0.0274 ± 0.0008	0.0481 ± 0.0005	0.0414 ± 0.0070	0.0000
i-sgd T	9915.6667 ± 0.9428	255.9641 ± 12.8051	0.0289 ± 0.0014	0.0478 ± 0.0002	0.0284 ± 0.0122	0.0000
sghmc T	9937.0000 ± 0.0000	231.5332 ± 0.0000	0.0434 ± 0.0000	0.0518 ± 0.0000	0.4623 ± 0.0000	2.0000

Table 4. Results for classification on cifar1010 dataset.

Method	acc	mnll	mean $H_{0}$	ece
i-sgd	8591.3333 ± 17.4611	4393.3557 ± 107.0878	0.6107 ± 0.0337	0.0731 ± 0.0075
sghmc	8634.6667 ± 5.1854	4357.8998 ± 11.2722	0.6300 ± 0.0023	0.0819 ± 0.0017
swag wd	8740.6667 ± 35.5653	3931.9900 ± 45.6605	0.4130 ± 0.0066	0.0275 ± 0.0015
swag	8061.0000 ± 11.4310	5903.2605 ± 62.8167	0.5308 ± 0.0135	0.0163 ± 0.0019
baseline	8273.3333 ± 26.7872	8050.4467 ± 109.9864	0.2250 ± 0.0005	0.0809 ± 0.0020
vsgd	8255.6667 ± 24.1155	8919.8062 ± 106.3571	0.1761 ± 0.0078	0.0905 ± 0.0020

Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

© 2021 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Franzese, G.; Milios, D.; Filippone, M.; Michiardi, P. A Scalable Bayesian Sampling Method Based on Stochastic Gradient Descent Isotropization. Entropy 2021, 23, 1426. https://doi.org/10.3390/e23111426

AMA Style

Franzese G, Milios D, Filippone M, Michiardi P. A Scalable Bayesian Sampling Method Based on Stochastic Gradient Descent Isotropization. Entropy. 2021; 23(11):1426. https://doi.org/10.3390/e23111426

Chicago/Turabian Style

Franzese, Giulio, Dimitrios Milios, Maurizio Filippone, and Pietro Michiardi. 2021. "A Scalable Bayesian Sampling Method Based on Stochastic Gradient Descent Isotropization" Entropy 23, no. 11: 1426. https://doi.org/10.3390/e23111426

APA Style

Franzese, G., Milios, D., Filippone, M., & Michiardi, P. (2021). A Scalable Bayesian Sampling Method Based on Stochastic Gradient Descent Isotropization. Entropy, 23(11), 1426. https://doi.org/10.3390/e23111426

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

A Scalable Bayesian Sampling Method Based on Stochastic Gradient Descent Isotropization

Abstract

1. Introduction

2. Preliminaries and Related Work

2.1. Gradient Methods without Momentum

2.2. Gradient Methods with Momentum

3. Sampling by Layer-Wise Isotropization

3.1. A Practical Method: Isotropic SGD

A Remark on Convergence

3.2. Computational Cost

4. Experiments

4.1. A Disclaimer on Performance Characterization

4.2. Regression Tasks, with Simple Models

4.3. Classification Tasks, with Deeper Models

5. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

Appendix A. Background and Related Material

Appendix A.1. The Minibatch Gradient Approximation

Appendix A.2. Gradient Methods without Momentum

Appendix A.2.1. The sde from Discrete Time

Appendix A.2.2. Proof of Theorem 1

Appendix A.3. Gradient Methods with Momentum

Appendix A.3.1. The SDE from Discrete Time

Appendix A.3.2. Proof of Theorem 2

Appendix B. i-sgd Method Proofs and Details

Appendix B.1. Proof of Corollary 1

Appendix B.2. Proof of Optimality of Λ

Appendix B.3. Algorithmic Details

Appendix C. Methodology

Appendix C.1. Regression Tasks, with Simple Models

Appendix C.2. Classification Task, ConvNet

Appendix C.3. Classification Task, Deeper Models

Appendix C.4. Definition of the Metrics

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI