Meta-Strategy for Learning Tuning Parameters with Guarantees

Meunier, Dimitri; Alquier, Pierre

doi:10.3390/e23101257

Open AccessArticle

Meta-Strategy for Learning Tuning Parameters with Guarantees

by

Dimitri Meunier

¹ and

Pierre Alquier

^2,*

¹

Istituto Italiano di Tecnologia, 16163 Genoa, Italy

²

RIKEN Center for Advanced Intelligence Project, Tokyo 103-0027, Japan

^*

Author to whom correspondence should be addressed.

Entropy 2021, 23(10), 1257; https://doi.org/10.3390/e23101257

Submission received: 9 August 2021 / Revised: 14 September 2021 / Accepted: 23 September 2021 / Published: 27 September 2021

(This article belongs to the Special Issue Approximate Bayesian Inference)

Download

Browse Figures

Versions Notes

Abstract

:

Online learning methods, similar to the online gradient algorithm (OGA) and exponentially weighted aggregation (EWA), often depend on tuning parameters that are difficult to set in practice. We consider an online meta-learning scenario, and we propose a meta-strategy to learn these parameters from past tasks. Our strategy is based on the minimization of a regret bound. It allows us to learn the initialization and the step size in OGA with guarantees. It also allows us to learn the prior or the learning rate in EWA. We provide a regret analysis of the strategy. It allows to identify settings where meta-learning indeed improves on learning each task in isolation.

Keywords:

meta-learning; hyperparameters; priors; online learning; Bayesian inference; online optimization; gradient descent

1. Introduction

In many applications of modern supervised learning, such as medical imaging or robotics, a large number of tasks is available but many of them are associated with a small amount of data. With few datapoints per task, learning them in isolation would give poor results. In this paper, we consider the problem of learning from a (large) sequence of regression or classification tasks with small sample size. By exploiting their similarities we seek to design algorithms that can utilize previous experience to rapidly learn new skills or adapt to new environments.

Inspired by human ingenuity in solving new problems by leveraging prior experience, meta-learning is a subfield of machine learning whose goal is to automatically adapt a learning mechanism from past experiences to rapidly learn new tasks with little available data. Since it “learns the learning mechanism” it is also referred to as learning-to-learn [1]. It is seen as a critical problem for the future of machine learning [2]. Numerous formulations exist for meta-learning and we focus on the problem of online meta-learning where the tasks arrive one at a time and the goal is to efficiently transfer information from the previous tasks to the new ones such that we learn the new tasks as efficiently as possible (this has also been refered to as lifelong learning). Each task is in turn processed online. To sum up, we have a stream of tasks and for each task a stream of observations.

In order to solve online tasks, diverse well-established strategies exist: perceptron, online gradient algorithm (OGA), online mirror descent, follow-the-regularized-leader, exponentially weighted aggregation (EWA, also refered to as generalized Bayes etc.) We refer the reader to [3,4,5,6] for introductions to these algorithms and to so-called regret bounds, that control their generalization errors. We refer to these algorithms as the within-task strategies. The big challenge is to design a meta-strategy that uses past experiences to adapt a within-task strategy to perform better on the next tasks.

In this paper, we propose a new meta-learning strategy. The main idea to learn the tuning parameters is to minimize its regret bound. We provide a meta-regret analysis for our strategy. We illustrate our results in the case where the within-task strategy is the online gradient algorithm, and exponentially weighted aggregation. In the case of OGA, the tuning parameters considered are the initialization and the gradient steps. For EWA, we consider either the learning rate, or the prior. In each case, we compare the regret incurred when learning the tasks in isolation to our meta-regret bound. This allows us to identify settings where meta-learning indeed improves on learning in isolation.

1.1. Related Works

Meta-learning is similar to multitask learning [7,8,9] in the sense that the learner faces many tasks to solve. However, in multitask learning, the learner is given a fixed number of tasks, and can learn the connections between these tasks. In meta-learning, the learner must prepare to face future tasks that are not given yet.

Meta-learning is often referred to as learning-to-learn or lifelong learning. The authors of [10] proposed the following distinction: “learning-to-learn” for situations where the tasks are presented simultaneously, and “lifelong learning” for situations where they are presented sequentially. Following this terminology, learning-to-learn algorithms were proposed very early in the literature, with generalization guarantees [11,12,13,14,15,16].

On the other hand, in the lifelong learning scenario, until recently, algorithms were proposed without generalization guarantees [17,18]. A theoretical study was proposed by [10], but the strategies in that paper are not feasible in practice. This problem was recently improved [19,20,21,22,23,24,25,26]. In a similar context, in [27], the authors propose an efficient strategy to learn the starting point of OGA. However, an application of this strategy to learning the step size do not show any improvement over learning in isolation [28]. The closest work to this paper is [29] in which they also suggest a regret bound minimization strategy. This paper indeed provides a meta-regret bound for learning both the initialization and the gradient step. Note, however, that this paper remains specific to OGA, while our work can be potentially applied to any online learning algorithm. Indeed, we provide another example: the generalized Bayesian algorithm EWA, for which we learn the prior, or the learning rate. To learn the prior is new in the online setting, to our knowledge. It can be related to works in the batch setting [11,13,15,16], but the improvement with respect to learning in isolation is not quantified in these works.

Finally, it is important to note that we focus on the case where the number of tasks T is large, while the sample size n and algorithmic complexity of each task is moderately small. When each task is extremely complex, for example training a deep neural network on a huge dataset, our procedure (as well as those discussed above) will become too expansive. Alternative approaches were proposed, based on optimization via multi-armed bandits [30,31].

1.2. Organization of the Paper

In Section 2, we introduce the formalism of meta-learning and the notations that will be used throughout the paper. In Section 3, we introduce our meta-learning strategy, and its theoretical analysis. In Section 4, we provide the details of our method in the case of meta-learning the initialization and the step size in the online gradient algorithm. Based on our theoretical results, there are also explicit situations where meta-learning indeed improves on learning the tasks independently. This is confirmed by experiments reported in this section. In Section 5, we provide the details of our methodology when the algorithm used within tasks is a generalized Bayesian algorithm: EWA. We show how our meta-strategy can be used to tune the learning rate; we also discuss how it can be used to learn priors. The proofs of the main results are given in Section 6.

2. Notations and Preliminaries

By convention, vectors

v \in R^{d}

are seen as

d \times 1

matrices (columns). Let

∥ v ∥

denote the Euclidean norm of v. Let

A^{T}

denote the transposition of any

d \times k

matrix A, and

I_{d}

the

d \times d

identity matrix. For two real numbers a and b, let

a \lor b = max (a, b)

and

a \land b = min (a, b)

. For

z \in R

,

z_{+}

is its positive part

z_{+} = z \lor 0

. Given a finite set S, we let

card (S)

denote the cardinality of S.

The learner has to solve tasks

t = 1, \dots, T

sequentially. Each task t consists in n rounds

i = 1, \dots, n

. At each round i of task t, the learner has to take a decision

θ_{t, i}

in a decision space

Θ \subseteq R^{d}

for some

d > 0

. Then, a convex loss function

ℓ_{t, i} : Θ \to R

is revealed to the learner, who incurs the loss

ℓ_{t, i} (θ_{t, i})

. Classical examples with

Θ \subset R^{d}

include regression tasks, where

ℓ_{t, i} (θ) = {(y_{t, i} - x_{t, i}^{T} θ)}^{2}

for some

x_{t, i} \in R^{d}

and

y_{t, i} \in R

. For classification tasks,

ℓ_{t, i} (θ) = {(1 - y_{t, i} x_{t, i}^{T} θ)}_{+}

for some

x_{t, i} \in R^{d}

,

y_{t, i} \in {- 1, + 1}

.

Throughout the paper, we will assume that the learner uses, for each task, an online decision strategy called within-task strategy, parametrized by a tuning parameter

λ \in Λ

where

Λ

is a closed, convex subset of

R^{p}

for some

p > 0

. Example of such strategies include the online gradient algorithm, given by

θ_{t, i} = θ_{t, i - 1} - γ \nabla ℓ_{t, i} (θ_{t, i - 1})

. In this case, the tuning parameters are the initialization, or starting point,

θ_{t, 1} = ϑ

and the learning rate, or step size,

γ

. That is,

λ = (ϑ, γ)

, so

p = d + 1

. The parameter

λ

is kept fixed during the whole task. It is of course possible to use the same parameter

λ

in all the tasks. However, we will be interested here in defining meta-strategies that will allow us to improve

λ

task after task, based on the information available so far. In Section 3, we will define such strategies. For now, let

λ_{t}

denote the tuning parameter used by the learner all along task t. Figure 1 provides a recap of all the notations.

Let

θ_{t, i}^{λ}

denote the decision at round i of task t when the online strategy is used with parameter

λ

. We will assume that a regret bound is available for the within-task strategy. By this, we mean that there is a set

Θ_{0} \subset Θ

of parameters of interest, and that the learner knows a function

B_{n} : Θ \times Λ \to R

such that, for any task t, for any

λ \in Λ

,

\sum_{i = 1}^{n} ℓ_{t, i} (θ_{t, i}^{λ}) \leq \underset{= : L_{t} (λ)}{\underset{︸}{inf_{θ \in Θ_{0}} \{\sum_{i = 1}^{n} ℓ_{t, i} (θ) + B_{n} (θ, λ)\}}} .

(1)

For OGA, regret bounds can be found, for example, in [4,6] (in this case,

Θ_{0} = Θ

). Other examples include exponentially weighted aggregation (bounds in [3], here

Θ_{0}

is a finite set of predictors while decisions

Θ

are probability distributions on

Θ_{0}

). More examples will be discussed in the paper. For a fixed parameter

θ

, the quantity

\sum_{i = 1}^{n} ℓ_{t, i} (θ_{t, i}^{λ}) - \sum_{i = 1}^{n} ℓ_{t, i} (θ)

measures the difference between the total loss suffered during task t, and the loss what one would have suffered using the parameter

θ

. It is thus called “the regret with respect to parameter

θ

”, and

B_{n} (θ, λ)

is usually referred to as a “regret bound”. We will call

L_{t} (λ)

the “meta-loss”. In [29], the authors study a meta-strategy that minimizes the meta-loss of OGA. Indeed, if (1) is tight, to minimize the right-hand side is a good way to ensure that the left-hand side, that is, the cumulated loss, is small. In this work, we will focus on meta-strategies minimizing the meta-loss in a more general context.

The simplest meta-strategy is learning in isolation. That is, we keep

λ_{t} = λ_{0} \in Λ

for all tasks. The total loss after task T is then given by:

\sum_{t = 1}^{T} \sum_{i = 1}^{n} ℓ_{t, i} (θ_{t, i}^{λ_{0}}) \leq \sum_{t = 1}^{T} L_{t} (λ_{0}) .

(2)

However, when the learner uses a meta-strategy to improve the tuning parameter at the end of each task, the total loss is given by

\sum_{t = 1}^{T} \sum_{i = 1}^{n} ℓ_{t, i} (θ_{t, i}^{λ_{t}})

. We will, in this paper, investigate strategies with meta-regret bounds; that is, bounds of the form

\sum_{t = 1}^{T} \sum_{i = 1}^{n} ℓ_{t, i} (θ_{t, i}^{λ_{t}}) \leq inf_{λ \in Λ} \{\sum_{t = 1}^{T} L_{t} (λ) + C_{T} (λ)\} .

(3)

Of course, such bounds will be relevant only if the right-hand side of (3) is not larger than the right-hand side of (2), and is significantly smaller in some favourable settings. We show when this is the case in Section 4.

3. Meta-Learning Algorithms

In this section, we provide two meta-strategies to update

λ

at the end of each task. The first one is a direct application of OGA to meta-learning. It is computationally simpler, but feasible only in the special case where we have an explicit formula for the (sub-)gradient of each

L_{t} (λ)

. The second one is an application of implicit online learning to our setting. In Section 4, we provide an example where this is the case. The second meta-strategy can be used without this assumption. In both cases, we provide a regret bound as (3), under the following condition.

Assumption 1.

For any

t \in {1, \dots, T}

, the function

λ \mapsto L_{t} (λ)

is L-Lipschitz and convex.

3.1. Special Case: The Gradient of the Meta-Loss Is Available in Closed Form

As each

L_{t}

is convex, its subdifferential at each point of

Λ

is non-empty. For the sake of simplicity, we will use the notation

λ \mapsto \nabla L_{t} (λ)

in the following formulas to denote any element of its subdifferential at

λ

. We define the online gradient meta-strategy (OGMS) with step

α > 0

and starting point

λ_{1} \in Λ

: for any

t > 1

,

λ_{t} = Π_{Λ} [λ_{t - 1} - α \nabla L_{t - 1} (λ_{t - 1})]

(4)

where

Π_{Λ}

denotes the orthogonal projection on

Λ

.

3.2. The General Case

We now cover the general case, where a formula for the gradient of

L_{t} (λ)

might not be available. We propose to apply a strategy that was first defined in [32] for online learning, and studied under the name “implicit online learning” (we refer the reader to [33] and the references therein). In the meta-learning context, this gives the online proximal meta-strategy (OPMS) with step

α > 0

and starting point

λ_{1} \in Λ

, defined by:

λ_{t} = \underset{λ \in Λ}{argmin} \{L_{t - 1} (λ) + \frac{∥ λ - λ_{t - 1} ∥^{2}}{2 α}\} .

(5)

Using classical notations, e.g., [34], we can rewrite this definition with the proximal operator (hence the name of the method). Indeed

λ_{t} = {prox}_{α L_{t - 1}} (λ_{t - 1})

where

prox

is the proximal operator given for any

x \in Λ

and any convex function

f : Λ \to R

,

{prox}_{f} (x) = \underset{λ \in Λ}{argmin} \{f (λ) + \frac{{∥ x - λ ∥}^{2}}{2}\} .

(6)

This strategy is feasible in practice in the regime we are interested in; that is, when n is small or moderately large, and

T \to \infty

. The learner has to store all the losses of the current task

ℓ_{t - 1, 1}, \dots, ℓ_{t - 1, n}

. At the end of the task, the learner can use any convex optimization algorithm to minimize, with respect to

(θ, λ) \in Θ \times Λ

, the function

F_{t} (θ, λ) = \sum_{i = 1}^{n} ℓ_{t, i} (θ) + B_{n} (θ, λ) + \frac{∥ λ - λ_{t - 1} ∥^{2}}{2 α} .

(7)

We can use a (projected) gradient descent on

F_{t}

or its accelerated variants [35].

3.3. Regret Analysis

A direct application of known results to the setting of this paper leads to the following proposition. For the sake of completeness, we still provide the proofs in Section 6.

Proposition 1.

Under Assumption 1, using either OGMS or OPMS with step

α > 0

and starting point

λ_{1} \in Λ

leads to

\sum_{t = 1}^{T} \sum_{i = 1}^{n} ℓ_{t, i} (θ_{t, i}^{λ_{t}}) \leq inf_{λ \in Λ} \{\sum_{t = 1}^{T} L_{t} (λ) + \frac{α T L^{2}}{2} + \frac{∥ λ - λ_{1} ∥^{2}}{2 α}\} .

(8)

The proof can be found in Section 6.

4. Example: Learning the Tuning Parameters of Online Gradient Descent

In all this section, we work under the following condition.

Assumption 2.

For any

(t, i) \in {1, \dots, T} \times {1, \dots, n}

, the function

ℓ_{t, i}

is Γ-Lipschitz and convex.

4.1. Explicit Meta-Regret Bound

We study the situation where the learner uses (projected) OGA as a within-task strategy; that is,

Θ = {θ \in R^{d} : ∥ θ ∥ \leq C}

and, for any

i > 1

,

θ_{t, i} = Π_{Θ} [θ_{t, i - 1} - γ \nabla ℓ_{t, i} (θ_{t, i - 1})] .

(9)

With such a strategy, we already mentioned that

λ = (ϑ, γ) \in Λ \subset Θ \times R_{+}

contains an initialization and a step size. An application of the results in Chapter 11 in [3] gives

B_{n} (θ, λ) = B_{n} (θ, (ϑ, γ)) = γ Γ^{2} n / 2 + {∥ θ - ϑ ∥}^{2} / (2 γ)

. So

L_{t} ((ϑ, γ)) = inf_{∥ θ ∥ \leq C} \{\sum_{i = 1}^{n} ℓ_{t, i} (θ) + \frac{γ Γ^{2} n}{2} + \frac{{∥ θ - ϑ ∥}^{2}}{2 γ}\} .

(10)

It is quite direct to check Assumption 1. We summarize this in the following proposition.

Proposition 2.

Under Assumption 2, assume that the learner uses OGA as an inner algorithm. Assume

Λ = {ϑ \in R^{d} : ∥ ϑ ∥ \leq C} \times [\underset{̲}{γ}, \bar{γ}]

for some

C > 0

and

0 < \underset{̲}{γ} < \bar{γ} < \infty

. Then Assumption 1 is satisfied with

L : = \sqrt{\frac{n^{2} Γ^{4}}{4} + \frac{4 C^{2}}{{\underset{̲}{γ}}^{2}} + \frac{4 C^{4}}{{\underset{̲}{γ}}^{4}}} .

(11)

So, when the learner uses one of the meta-strategies OGMS or OPMS, we can apply Proposition 1 respectively. This leads to the following theorem.

Theorem 1.

Under the assumptions of Proposition 2, with

\underset{̲}{γ} = 1 / n^{β}

for some

β > 0

and

\bar{γ} = C^{2}

, when the learner uses either OGMS or OPMS with

α = \frac{C}{L} \sqrt{\frac{4 + C^{2}}{T}}

(12)

(where L is given by (11)), we have:

\sum_{t = 1}^{T} \sum_{i = 1}^{n} ℓ_{t, i} (θ_{t, i}^{λ_{t}}) \leq inf_{θ_{1}, \dots, θ_{T} \in Θ} \{\sum_{t = 1}^{T} \sum_{i = 1}^{n} ℓ_{t, i} (θ_{t}) + C (Γ, C) [n^{1 \lor 2 β} \sqrt{T} + (n^{1 - β} + σ (θ_{1}^{T}) \sqrt{n}) T]\}

(13)

where

C (Γ, C) > 0

depends only on

(Γ, C)

and where:

σ (θ_{1}^{T}) = \sqrt{\frac{1}{T} \sum_{t = 1}^{T} {∥θ_{t} - \frac{1}{T} \sum_{s = 1}^{T} θ_{s}∥}^{2}} .

(14)

Let us compare this result with learning in isolation, as defined in (2); that is, solving the sequence of tasks with a constant hyperparameter

λ = (ϑ, γ)

. For the usual choice

ϑ = 0

and

γ = c / \sqrt{n}

where c is a constant that does not depend on n nor T, OGA leads to a regret in

O (\sqrt{n})

. After T tasks, learning in isolation thus leads to a regret in

T \sqrt{n}

. Our strategies with

β = 1

lead to a regret in

n^{2} \sqrt{T} + (1 + σ (θ_{1}^{T}) \sqrt{n}) T .

(15)

The term

n^{2} \sqrt{T}

is the price to pay for meta-learning. In the regime we are interested in (small n, large T), which is smaller than

T \sqrt{n}

. Consider the leading term. In the worst case scenario, this is also

T \sqrt{n}

. However, there are good predictors

θ_{1}, \dots, θ_{T}

for tasks

1, \dots, T

, respectively, such that

σ (θ_{1}^{T})

is small, and in this case we see the improvement with respect to learning in isolation. The extreme case is when there is a good predictor

θ^{*}

that predicts well for all tasks. In this case, regret with respect to

θ_{1} = \dots = θ_{T} = θ^{*}

is in

n^{2} \sqrt{T} + T

, which improves significantly on learning in isolation. Note however that, using a different meta-strategy, specifically designed for OGA, Ref. [29] obtain a better dependence on T when

σ (θ_{1}^{T}) = 0

.

Let us now discuss the implementation of our meta-stategy. We first remark that under the quadratic loss, it is possible to derive a formula for

L_{t}

, which allows to use OGMS. We then discuss OPMS for the general case.

4.2. Special Case: Quadratic Loss

First, consider

ℓ_{t, i} = {(y_{t, i} - x_{t, i}^{T} θ)}^{2}

for some

y_{t, i} \in R

and

x_{t, i} \in R^{d}

. Assumption 2 is satisfied if we assume, moreover that all

| y_{t, i} | \leq c

and

∥ x_{t, i} ∥ \leq b

, with

Γ = 2 b c + 2 b^{2} C

. In this case,

L_{t} ((ϑ, γ)) = inf_{∥ θ ∥ \leq C} \{\sum_{i = 1}^{n} {(y_{t, i} - x_{t, i}^{T} θ)}^{2} + \frac{γ Γ^{2} n}{2} + \frac{{∥ θ - ϑ ∥}^{2}}{2 γ}\} .

(16)

Define

Y_{t} = {(y_{t, 1}, \dots, y_{t, n})}^{T}

and

X_{t} = (x_{t, 1} | \dots | x_{t, n})^{T}

. The minimizer of

\sum_{i = 1}^{n} {(y_{t, i} - x_{t, i}^{T} θ)}^{2} + {∥ θ - ϑ ∥}^{2} / (2 γ)

with respect to

θ

is known as the ridge regression estimator:

{\hat{θ}}_{t} = {(X_{t}^{T} X_{t} + \frac{I_{d}}{2 γ})}^{- 1} (X_{t}^{T} Y_{t} + \frac{ϑ}{2 γ}) .

(17)

This also coincides with the minimizer in the right-hand side of (16) on the condition that

∥ {\hat{θ}}_{t} ∥ \leq C

. In this case, by plugging

{\hat{θ}}_{t}

in (16), we have a close form formula for

L_{t} ((ϑ, γ))

, and an explicit (but cumbersome) formula for its gradient. It is thus possible to use the OGMS strategy to update

λ = (ϑ, γ)

.

4.3. The General Case

In the general case, denote

λ_{t - 1} = (ϑ_{t - 1}, γ_{t - 1})

, then

λ_{t} = (ϑ_{t}, γ_{t})

is obtained by minimizing

F_{t} (θ, (ϑ, γ)) = \sum_{i = 1}^{n} ℓ_{t, i} (θ) + \frac{γ Γ^{2} n}{2} + \frac{{∥ θ - ϑ ∥}^{2}}{2 γ} + \frac{∥ ϑ - ϑ_{t - 1} ∥^{2} + {(γ - γ_{t - 1})}^{2}}{2 α}

(18)

with respect to

θ, ϑ, γ

. Any efficient minimization procedure can be used. In our experiments, we used a projected gradient descent, the gradient being given by:

\begin{array}{l} \frac{\partial F_{t}}{\partial θ} & = \sum_{i = 1}^{n} \nabla ℓ_{t, i} (θ) + \frac{θ - ϑ}{γ}, \end{array}

(19)

\begin{array}{l} \frac{\partial F_{t}}{\partial ϑ} & = \frac{ϑ - θ}{γ} + \frac{ϑ - ϑ_{t - 1}}{α}, \end{array}

(20)

\begin{array}{l} \frac{\partial F_{t}}{\partial γ} & = \frac{Γ^{2} n}{2} - \frac{{∥ θ - ϑ ∥}^{2}}{2 γ^{2}} + \frac{γ - γ_{t - 1}}{α} . \end{array}

(21)

Note that even though we do not stricto sensu obtain the minimizer of

F_{t}

, we can get arbitrarily close to it by taking a large enough number of steps. The main difference between this algorithm and the strategy suggested in [29] is that it is obtained by applying the general proximal update introduced in Equation (7), while they decoupled the update for the initialization step and the learning rate.

4.4. Experimental Study

In this section we compare simulated data for the numerical performance of OPMS w.r.t learning the task in isolation with online gradient descent (I-OGA). To measure the impact of learning the gradient step

γ

, we also introduce mean-OPMS that uses the same strategy as OPMS but only learns the starting point

ϑ

(it is thus close to [27]). We present the results for regression tasks with the mean-squared-error loss, and then for classification with the hinge loss. The notebooks of the experiments can be found online: https://dimitri-meunier.github.io/ (accessed on 26 September 2021).

4.4.1. Synthetic Regression

At each round

t = 1, \dots, T

, the meta learner sequentially receives a regression task that corresponds to a dataset

{(x_{t, i}, y_{t, i})}_{i = 1, \dots, n}

generated as

y_{t, i} = x_{t, i}^{T} θ_{t} + ϵ_{t, i}

,

x_{t, i} \in R^{d}

. The noise is

ϵ_{t, i} \sim U ([- σ^{2}, σ^{2}])

and the

ϵ_{t, i}

are all independent, the inputs are uniformly sampled on the

(d - 1)

-unit sphere

S^{d - 1}

and

θ_{t} = r u + θ_{0}

,

u \sim U (S^{d - 1})

,

θ_{0} \in R^{d}

,

r \in R_{+}

. We take

d = 20

,

n = 30

,

T = 200

,

σ^{2} = 0.5

and

θ_{0}

with all components equal to 5. In this setting,

θ_{0}

is a common bias between the tasks,

σ^{2}

is the inter-task variance and r characterizes the tasks similarity. We experiment with different values of

r \in {0, 5, 10, 30}

to observe the impact of task similarity on the meta-learning process. The smaller r, the closer are the tasks and for the extreme case of

r = 0

the tasks are identical, in the sense that the parameters

θ_{t}

of the tasks are all the same. We draw attention to the fact that a cross-validation procedure to select

α

(the parameter of OGMS or OPMS, see Equation (5)) or

γ

is not valid in the online settings, as it would require having knowledge of several tasks in advance for the former and several datapoints in advance for each task for the latter. Moreover, the theoretical values are based on worst-case analysis and lead in practice to slow learning. In practice, to set these values to the correct order of magnitude without adjusting the constants led to better results. So, for mean-OPMS and OPMS we set

α = 1 / \sqrt{T}

, for OPMS and I-OGA we set

γ = 1 / \sqrt{n}

. Instead of cross-validation, one can launch several online learners in parallel with different parameter values to pick the best one (or aggregate them). That is the strategy we use to select

Γ

for OPMS. Note that the exact value of

Γ

is usually unkown in practice; its automatic calibration is an important open question. To solve (18), after each task we use the exact solution for mean-OPMS and projected Newton descent with 10 steps for OPMS. We observed that not reaching the exact solution of (18) does not harm the performance of the algorithm and 10 steps are sufficient to reach convergence. The results are displayed in Table 1 and Figure 2. On Figure 2, for each task

t = 1, \dots, T

, we report the average end-of-task loss

M S E_{t} = \sum_{i = 1}^{n} ℓ_{t, i} (θ_{t, n}) / n

averaged over 50 independent runs (with their confidence intervals). Table 1 reports

M S E_{t}

averaged over the 100 most recent tasks. The results confirm our theoretical findings: learning

γ

can bring a substantial benefit over just learning the starting point, which in turn brings a considerable benefit with respect to learning the tasks in isolation. Learning the gradient step makes the meta-learner more robust to task dissimilarities (i.e. when r increases) as shown in Figure 2. In the regime where r is low, learning the gradient step does not help the meta-learner as it takes more steps to reach convergence. Overall both meta learners are consistently better than learning the task in isolation since the number of observation per task is low.

4.4.2. Synthetic Classification

At each round

t = 1, \dots, T

, the meta learner sequentially receives a binary classification task with the Hinge loss that corresponds to a dataset

{(x_{t, i}, y_{t, i})}_{i = 1, \dots, n}

. The binary labels

{- 1, 1}

are generated as a logistic model

P (y = 1) = {(1 + exp (- x^{t} θ_{t}))}^{- 1}

. The task parameters

θ_{t}

and the inputs are generated as in the regression setting. To add some noise, we shuffle

10 %

of the labels. We take

d = 10

,

n = 100

,

T = 500

,

r = 2

. For mean-OPMS and OPMS we set

α = 1 / \sqrt{T}

, for OPMS and I-OGA we set

γ = 1 / \sqrt{n}

. For the optimisation of

F_{t}

(18) with both OPMS and mean-OPMS we use a projected gradient descent with 50 steps.

On Figure 3, for each task

t = 1, \dots, T

, we report the regret on the end-of-task losses:

R (t) = \frac{1}{n t} \sum_{k = 1}^{t} \sum_{i = 1}^{n} ℓ_{k, i} (θ_{k, n})

, averaged over 10 independent runs (with their confidence intervals). As the for regression setting, the results confirm our theoretical findings: by learning

γ

(OPMS), we reach a better overall performance than just learning the initialization (mean-OPMS) and a substantially stronger than independent task learning (I-OGA). Note that, in the classification regime, there is no known closed formed expression for the meta-gradient; therefore, OGMS cannot be used.

5. Second Example: Learning the Prior or the Learning Rate in Exponentially Weighted Aggregation

In this section, we will study a generalized Bayesian method, exponentially weighted aggregation. Consider a finite set

Θ_{0} = {θ_{1}, \dots, θ_{M}} \subset R^{d}

. EWA depends on a prior distribution

π

on

Θ_{0}

, and on a learning rate

η > 0

, and returns a decision in

Θ = conv (θ_{1}, \dots, θ_{M})

the convex envelope of

Θ_{0}

. In this section, we work under the following condition.

Assumption 3.

There is a

B \in R_{+}^{*}

, such that for any

(t, i) \in {1, \dots, T} \times {1, \dots, n}

, the function

ℓ_{t, i}

is

Θ \to [0, B]

and convex.

We will sometimes use a stronger assumption.

Assumption 4.

There is a

C \in R_{+}^{*}

, such that for any

(t, i) \in {1, \dots, T} \times {1, \dots, n}

, the function

θ \mapsto exp (- ℓ_{t, i} (θ) / C)

is concave.

Examples of a situation in which Assumption 4 is satisfied are provided in [3]. Note that Assumption 4 implies Assumption 3.

5.1. Reminder on EWA

The update in EWA is given by:

θ_{t, i} = \sum_{θ \in Θ_{0}} p_{t, i} (θ) θ

(22)

where

p_{t, i}

are weights defined by

p_{t, i} (θ) = \frac{exp [- η \sum_{j = 1}^{i - 1} ℓ_{t, j} (θ)] π (θ)}{\sum_{ϑ \in Θ_{0}} exp [- η \sum_{j = 1}^{i - 1} ℓ_{t, j} (ϑ)] π (ϑ)} .

(23)

The strategy is studied in detail in [3]. We refer the reader to [36] and the references therein for connections to Bayesian inference. We recall the following regret bounds from [3]. First, under Assumption 3,

\sum_{i = 1}^{n} ℓ_{t, i} (θ_{t, i}) \leq min_{θ \in Θ_{0}} [\sum_{i = 1}^{n} ℓ_{t, i} (θ) + \frac{η n B^{2}}{8} + \frac{log \frac{1}{π (θ)}}{η}] .

(24)

Moreover, under the stronger Assumption 4,

\sum_{i = 1}^{n} ℓ_{t, i} (θ_{t, i}) \leq min_{θ \in Θ_{0}} [\sum_{i = 1}^{n} ℓ_{t, i} (θ) + C log \frac{1}{π (θ)}] .

(25)

In Section 5.2, we work in the general setting (Assumption 3), and we use our meta-strategy OPMS or OGMS to learn

η

. In Section 5.3, we use OPMS or OGMS to learn

π

under Assumption 4.

5.2. Learning the Rate $η$

Consider the uniform prior

π (θ) = 1 / M

for any

θ \in Θ_{0}

. Then, the regret bound (24) becomes:

\sum_{i = 1}^{n} ℓ_{t, i} (θ_{t, i}) \leq min_{θ \in Θ_{0}} \sum_{i = 1}^{n} ℓ_{t, i} (θ) + \frac{η n B^{2}}{8} + \frac{log M}{η}

(26)

and it is then possible to optimize it explicitly with respect to

η

. The value minimizing the bound is

η = (2 / B) \sqrt{2 log (M) / n}

and the regret bound becomes:

\sum_{i = 1}^{n} ℓ_{t, i} (θ_{t, i}) \leq min_{θ \in Θ_{0}} \sum_{i = 1}^{n} ℓ_{t, i} (θ) + B \sqrt{\frac{n log M}{2}} .

(27)

In practice, however, while it is often reasonable to assume that the loss function is bounded (as in Assumption 3), very often one does not know a tight upper bound. Thus, one may use a constant B that satisfies Assumption 3, but that is far too large. Even though one does not know a better upper bound than B, one would like a regret bound that depends on the tightest possible upper bound.

In the meta-learning framework, define:

L_{t} (η) = min_{θ \in Θ_{0}} \sum_{i = 1}^{n} ℓ_{t, i} (θ) + \frac{η n {[{max}_{ϑ \in Θ_{0}, 1 \leq i \leq n} ℓ_{t, i} (ϑ)]}^{2}}{8} + \frac{log M}{η}

(28)

for

η \in Λ = [1 / n, 1]

. It is immediately necessary to prove that this function is convex and L-Lipschitz with

L = n^{2} log (M) + n B^{2} / 8

. So, Assumption 1 is satisfied, allowing for the use of the OPMS or OGMS strategy without needed a tight upper bound on the losses. Note that, in this context, the OGMS strategy is given by:

η_{t} = \frac{1}{n} \lor [η_{t - 1} - α (\frac{n {[{max}_{θ \in Θ_{0}, 1 \leq i \leq n} ℓ_{t, i} (θ)]}^{2}}{8} - \frac{log M}{η_{t - 1}^{2}})] \land 1 .

Theorem 2.

Under Assumption 3, using OGMS or OPMS on

L_{t} (η)

, as in (28) with

η_{1} = 1

,

L = n^{2} log (M) + n B^{2} / 8

and

α = \frac{1}{L} \sqrt{\frac{2}{T}}

(29)

we have

\begin{array}{l} \sum_{t = 1}^{T} \sum_{i = 1}^{n} ℓ_{t, i} (θ_{t, i}^{η_{t}}) \leq \sum_{t = 1}^{T} min_{θ \in Θ_{0}} \sum_{i = 1}^{n} ℓ_{t, i} (θ) & + b T \sqrt{\frac{n log (M)}{2}} \\ + T log (M) + \frac{b^{2} T}{8} + (n^{2} log M + \frac{n B^{2}}{8}) \sqrt{2 T} \end{array}

(30)

where

b = max_{θ \in Θ_{0}, 1 \leq t \leq T, 1 \leq i \leq n} | ℓ_{t, i} (θ) | .

(31)

Let us compare learning in isolation with meta-learning in this context. When learning in isolation, the hyperparameter

η

is fixed (as in (2)). If we fix it to the value

η_{0} = (2 / B) \sqrt{2 log (M) / n}

as in (27), the meta-regret is in

B T \sqrt{n log (M) / 2}

. On the other hand, meta-learning leads to a meta-regret in

b T \sqrt{n log (M) / 2} + n^{2} log M \sqrt{2 T} + O (n B^{2} \sqrt{T} + T)

. In other words, we replace the potentially loose upper bound B by the tightest possible bound b, at the cost of an additional

n^{2} log M \sqrt{2 T} + O (n B^{2} \sqrt{T} + T)

term. Here again, when T is large enough with respect to n, this term is negligible.

5.3. Learning the Prior $π$

Under Assumption 4, we have the regret bound in (25). Without any information on

Θ_{0}

, it seems natural to use the uniform prior

π

on

Θ_{0} = {θ_{1}, \dots, θ_{M}}

, which leads to

\sum_{i = 1}^{n} ℓ_{t, i} (θ_{t, i}) \leq min_{θ \in Θ_{0}} \sum_{i = 1}^{n} ℓ_{t, i} (θ) + C log M .

(32)

If some additional information was available, such as, for example: “the best

θ

is always either

θ_{1}

or

θ_{2}

”, one would rather chose the uniform prior on

{θ_{1}, θ_{2}}

, and obtain the bound:

\sum_{i = 1}^{n} ℓ_{t, i} (θ_{t, i}) \leq min_{θ \in Θ_{0}} \sum_{i = 1}^{n} ℓ_{t, i} (θ) + C log 2 .

(33)

Unfortunately, such information is generally not available. However, in the context of meta-learning, we can take advantage of the previous tasks to learn such information.

Thus, let us define, for any task t,

θ_{t}^{*} = \underset{θ \in Θ_{0}}{argmin} \sum_{i = 1}^{n} ℓ_{t, i} (θ)

(34)

and

L_{t} (π) = \sum_{i = 1}^{n} ℓ_{t, i} (θ_{t}^{*}) + C log \frac{1}{π (θ_{t}^{*})}

(35)

for

π = (π (θ_{1}), \dots, π (θ_{M})) \in Λ

with

Λ = \{x \in {(R_{+})}^{M} : \sum_{h = 1}^{M} x_{h} = 1 and x_{h} \geq \frac{1}{2 M}\} .

(36)

It is important to check that

L_{t}

is convex and L-Lipschitz with

L = 2 C M

on

Λ

; this allows us to use OPMS (or OGMS).

Theorem 3.

Under Assumption 4, using OPMS on

L_{t} (π)

as in (35) with

π_{1} = (1 / M, \dots, 1 / M)

,

L = 2 C M

and

α = \frac{1}{2 C M \sqrt{T}},

(37)

define

I^{*} = {θ_{1}^{*}, \dots, θ_{T}^{*}}

where each

θ_{t}^{*}

is as in (34) and

m^{*} = card (I^{*})

. We have

\sum_{t = 1}^{T} \sum_{i = 1}^{n} ℓ_{t, i} (θ_{t, i}^{π_{t}}) \leq \sum_{t = 1}^{T} \sum_{i = 1}^{n} ℓ_{t, i} (θ_{t}^{*}) + C T log (2 m^{*}) + 2 C M \sqrt{T} .

(38)

When learning in isolation with a uniform prior, the meta-regret is

T C log (M)

. On the other hand, if

m^{*}

is small (that is, many of the

θ_{i}^{*}

s are similar), meta-learning leads to a meta-regret in

C T log (2 m^{*}) + 2 C M \sqrt{T}

. For a T that is large enough, this is an important improvement.

5.4. Discussion on the Continuous Case

Let us now discuss the possibility of meta-learning for generalized Bayesian methods when

Θ_{0}

is no longer a finite set. There is a general formula for EWA, given by

ρ_{t, i} (d θ) = \underset{ρ}{argmin} \{E_{θ \sim ρ} [\sum_{j = 1}^{i - 1} ℓ_{t, j} (θ)] + \frac{K (ρ, π)}{η}\}

(39)

where the minimum is taken over for all probability distributions that are absolutely continuous with

π

, and where

π

is a prior distribution,

η > 0

a learning rate and

K

is the Kullback–Leibler divergence (KL). Meta-learning for such an update rule is proven in [10,37] but usually does not lead to feasible strategies. Online variational inference [38,39] consists in replacing the minimization on the set of all probability distributions by minimization in a smaller set in order to define a feasible approximation of

ρ_{t, i}

. For example, let

{(q_{μ})}_{μ \in M}

be a parametric family of probability distributions, Thus, we define:

μ_{t, i} = \underset{μ \in M}{argmin} \{E_{θ \sim q_{μ}} [\sum_{j = 1}^{i - 1} ℓ_{t, j} (θ)] + \frac{K (q_{μ}, π)}{η}\} .

(40)

It is discussed in [40] that, generally, when

μ

is a location-scale parameter and

ℓ_{t, j}

is

Γ

-Lipschitz and convex, then

{\bar{ℓ}}_{t, i} (μ) : = E_{θ \sim q_{μ}} [ℓ_{t, j} (θ)]

is

2 Γ

-Lipschitz and convex. In this case, under the assumption that

K (q_{μ}, π)

is

α

-strongly convex in

μ

, a regret bound for such strategies was derived in [39]:

\sum_{i = 1}^{n} E_{θ \sim q_{μ_{t, i}}} [ℓ_{t, i} (θ)] \leq inf_{μ \in M} \{E_{θ \sim q_{μ}} [\sum_{i = 1}^{n} ℓ_{t, i} (θ)] + \frac{η 4 Γ^{2} n}{α} + \frac{K (q_{μ}, π)}{η}\} .

(41)

A complete study of meta-learning of the rate

η > 0

and of the prior

π

in this context is an important objective (possibly, with a restriction that

π \in {q_{μ}, μ \in M}

). However, this raises many problems. For example, the KL divergence

K (q_{μ}, q_{μ^{'}})

is not always convex with respect to the parameter

μ^{'}

. In this case, it might help to replace it by a convex relaxation that would allow for the use of OGMS or OPMS. This relates to [41,42], who advocate going beyond the KL divergence in (39); see also [36] and the references therein. This will be the object of future works.

6. Proofs

We start with a preliminary lemma that will be used in the proof of Proposition 1.

Lemma 1.

Let

a, b, c

be three vectors in

R^{p}

. Then:

{(a - b)}^{T} (b - c) = \frac{{∥ a - c ∥}^{2} {- ∥ a - b ∥}^{2} - {∥ b - c ∥}^{2}}{2} .

(42)

Proof.

expand

{∥ a - c ∥}^{2} = {∥ a ∥}^{2} + {∥ c ∥}^{2} - 2 a^{T} c

in the r.h.s, as well as

{∥ a - b ∥}^{2}

and

{∥ b - c ∥}^{2}

. Then simplify. □

We now prove Proposition 1 separately for the general OGMS strategy, and then for OGMS.

Proof of Proposition 1 for OPMS.

As mentioned earlier, this strategy is an application to the meta-learning setting of implicit online learning [32,33]. We follow here a proof from Chapter 11 in [3]. We refer the reader to [43] and the references therein for tighter bounds under stronger assumptions.

First,

λ_{t}

is defined as the minimizer of a convex function in (5). So, the subdifferential of this function at

λ_{t}

contains 0. In other words, there is a

z_{t} \in \partial L_{t - 1} (λ_{t})

, such that

z_{t} = \frac{λ_{t - 1} - λ_{t}}{α} .

(43)

By convexity, for any

λ

, for any

z \in \partial L_{t - 1} (λ_{t})

,

L_{t - 1} (λ) \geq L_{t - 1} (λ_{t}) + {(λ - λ_{t})}^{T} z .

(44)

The choice

z = z_{t}

gives:

L_{t - 1} (λ) \geq L_{t - 1} (λ_{t}) + \frac{{(λ - λ_{t})}^{T} (λ_{t - 1} - λ_{t})}{α},

(45)

that is,

\begin{array}{l} L_{t - 1} (λ_{t}) & \leq L_{t - 1} (λ) + \frac{{(λ - λ_{t})}^{T} (λ_{t} - λ_{t - 1})}{α} \\ = L_{t - 1} (λ) + \frac{∥ λ - λ_{t - 1} ∥^{2} - {∥ λ - λ_{t} ∥}^{2}}{2 α} - \frac{∥ λ_{t} - λ_{t - 1} ∥^{2}}{2 α} \\ = L_{t - 1} (λ) + \frac{∥ λ - λ_{t - 1} ∥^{2} - {∥ λ - λ_{t} ∥}^{2}}{2 α} - α \frac{∥ z_{t} ∥^{2}}{2} \end{array}

(46)

where we used Lemma 1. Then, note that

\begin{array}{l} L_{t - 1} (λ_{t - 1}) & = L_{t - 1} (λ_{t}) + [L_{t - 1} (λ_{t - 1}) - L_{t - 1} (λ_{t})] \\ \leq L_{t - 1} (λ_{t}) + ∥ λ_{t - 1} - λ_{t} ∥ L \\ \leq L_{t - 1} (λ_{t}) + α ∥ z_{t} ∥ L . \end{array}

(47)

Combining this inequality with (46) gives

L_{t - 1} (λ_{t - 1}) \leq L_{t - 1} (λ) + \frac{∥ λ - λ_{t - 1} ∥^{2} - {∥ λ - λ_{t} ∥}^{2}}{2 α} + α (∥ z_{t} ∥ L - \frac{∥ z_{t} ∥^{2}}{2}) .

(48)

Now, for any

x \in R

,

- x^{2} / 2 + x L - L^{2} / 2 \leq 0

. In particular,

∥ z_{t} ∥ L - ∥ z_{t} ∥^{2} / 2 \leq L^{2} / 2

and so the above can be rewritten:

L_{t - 1} (λ_{t - 1}) \leq L_{t - 1} (λ) + \frac{∥ λ - λ_{t - 1} ∥^{2} - {∥ λ - λ_{t} ∥}^{2}}{2 α} + \frac{α L^{2}}{2} .

(49)

Summing the inequality for

t = 2

to

T + 1

leads to:

\sum_{t = 1}^{T} L_{t} (λ_{t}) \leq \sum_{t = 1}^{T} L_{t} (λ) + \frac{∥ λ - λ_{1} ∥^{2} - {∥ λ - λ_{T + 1} ∥}^{2}}{2 α} + \frac{α T L^{2}}{2} .

(50)

This ends the proof. □

Proof of Proposition 1 for OGMS.

The beginning of the proof follows the proof of Theorem 11.1 in [3].

Note that we can rewrite (4) as

\{\begin{matrix} {\tilde{λ}}_{t} = λ_{t - 1} - α \nabla L_{t - 1} (λ_{t - 1}) \\ λ_{t} = Π_{Λ} ({\tilde{λ}}_{t}) \end{matrix}

rearranging the first line, we obtain:

\nabla L_{t - 1} (λ_{t - 1}) = \frac{λ_{t - 1} - {\tilde{λ}}_{t}}{α} .

(51)

By convexity, for any

λ

,

L_{t - 1} (λ) \geq L_{t - 1} (λ_{t - 1}) + {(λ - λ_{t - 1})}^{T} \nabla L_{t - 1} (λ_{t - 1})

(52)

\begin{array}{l} = L_{t - 1} (λ_{t - 1}) + \frac{{(λ - λ_{t - 1})}^{T} (λ_{t - 1} - {\tilde{λ}}_{t})}{α}, \end{array}

(53)

that is,

L_{t - 1} (λ_{t - 1}) \leq L_{t - 1} (λ) - \frac{{(λ - λ_{t - 1})}^{T} (λ_{t - 1} - {\tilde{λ}}_{t})}{α} .

(54)

Lemma 1 gives:

\begin{array}{l} {(λ - λ_{t - 1})}^{T} (λ_{t - 1} - {\tilde{λ}}_{t}) & = \frac{∥ λ - {\tilde{λ}}_{t} ∥^{2} - ∥ λ - λ_{t - 1} ∥^{2} - {∥ λ_{t - 1} - {\tilde{λ}}_{t} ∥}^{2}}{2} \end{array}

\begin{array}{l} = \frac{∥ λ - {\tilde{λ}}_{t} ∥^{2} - ∥ λ - λ_{t - 1} ∥^{2} - α^{2} {∥ \nabla L_{t - 1} (λ_{t - 1}) ∥}^{2}}{2} \end{array}

(55)

\begin{array}{l} \geq \frac{∥ λ - λ_{t} ∥^{2} - ∥ λ - λ_{t - 1} ∥^{2} - α^{2} {∥ \nabla L_{t - 1} (λ_{t - 1}) ∥}^{2}}{2}, \end{array}

(56)

the last step being justified by:

∥ λ - {\tilde{λ}}_{t} ∥^{2} \geq ∥ λ - Π_{Λ} ({\tilde{λ}}_{t}) ∥^{2} = {∥ λ - λ_{t} ∥}^{2}

(57)

for any

λ \in Λ

. Plug (55) in (54) to get:

L_{t - 1} (λ_{t - 1}) \leq L_{t - 1} (λ) + \frac{∥ λ - λ_{t - 1} ∥^{2} - {∥ λ - λ_{t} ∥}^{2}}{2 α} + \frac{α ∥ \nabla L_{t - 1} (λ_{t - 1}) ∥^{2}}{2}

(58)

and the Lipschitz assumption gives:

L_{t - 1} (λ_{t - 1}) \leq L_{t - 1} (λ) + \frac{∥ λ - λ_{t - 1} ∥^{2} - {∥ λ - λ_{t} ∥}^{2}}{2 α} + \frac{α L^{2}}{2}

(59)

sum the inequality for

t = 2

to

T + 1

to get:

\sum_{t = 1}^{T} L_{t} (λ_{t}) \leq \sum_{t = 1}^{T} L_{t} (λ) + \frac{∥ λ - λ_{1} ∥^{2} - {∥ λ - λ_{T + 1} ∥}^{2}}{2 α} + \frac{α T L^{2}}{2} .

(60)

This ends the proof of the statement for OGMS. □

We now provide a lemma that will be useful for the proof of Proposition 2.

Lemma 2.

Let

G (u, v)

be a convex function of

(u, v) \in U \times V

. Define

g (u) = {inf}_{v \in V} G (u, v)

. Then g is convex.

Proof.

indeed, let

λ \in [0, 1]

and

(x, y) \in U^{2}

,

\begin{array}{l} g (λ x + (1 - λ) y) & = inf_{v \in V} G (λ x + (1 - λ) y, v) \end{array}

(61)

\begin{array}{l} \leq G (λ x + (1 - λ) y, λ x^{'} + (1 - λ) y^{'}) \end{array}

(62)

\begin{array}{l} \leq λ G (x, x^{'}) + (1 - λ) G (y, y^{'}) \end{array}

(63)

where the last two inequalities hold for any

(x^{'}, y^{'}) \in V^{2}

. Let us now take the infimum with respect to

(x^{'}, y^{'}) \in V^{2}

in both sides, this gives:

g (λ x + (1 - λ) y) \leq inf_{x^{'} \in V} λ G (x, x^{'}) + inf_{y^{'} \in V} (1 - λ) G (y, y^{'})

(64)

\begin{array}{l} = λ g (x) + (1 - λ) g (y), \end{array}

(65)

that is, g is convex. □

Proof of Proposition 2.

Apply Lemma 2 to

u = (ϑ, γ)

,

v = θ

,

U = Λ

,

V = Θ

and

G (u, v) = \sum_{i = 1}^{n} ℓ_{i, t} (θ) + \frac{γ Γ^{2} n}{2} + \frac{{∥ ϑ - θ ∥}^{2}}{2 γ} .

(66)

This shows

g (u) = L_{t} ((ϑ, γ))

is convex with respect

(ϑ, γ)

. Additionally, G is differentiable w.r.t

u = (ϑ, γ)

, so

\frac{\partial G}{\partial ϑ} = \frac{ϑ - θ}{γ}, and \frac{\partial G}{\partial γ} = \frac{n Γ^{2}}{2} - \frac{{∥ ϑ - θ ∥}^{2}}{2 γ^{2}} .

(67)

As a consequence, for

(θ, ϑ) \in Θ^{2}

and

\underset{̲}{γ} \leq γ \leq \bar{γ}

,

{∥\frac{\partial G}{\partial ϑ}∥}^{2} \leq \frac{4 C^{2}}{{\underset{̲}{γ}}^{2}}, and {|\frac{\partial G}{\partial γ}|}^{2} \leq \frac{n^{2} Γ^{4}}{4} + \frac{4 C^{4}}{{\underset{̲}{γ}}^{4}} .

(68)

This leads to

\begin{array}{l} ∥ \nabla_{u} G (u, v) ∥ & = \sqrt{{∥\frac{\partial G}{\partial ϑ}∥}^{2} + {|\frac{\partial G}{\partial γ}|}^{2}} \end{array}

(69)

\begin{array}{l} \leq \sqrt{\frac{n^{2} Γ^{4}}{4} + \frac{4 C^{2}}{{\underset{̲}{γ}}^{2}} + \frac{4 C^{4}}{{\underset{̲}{γ}}^{4}}} = : L, \end{array}

(70)

that is, for each v,

G (u, v)

is L-Lipschitz in u. So,

g (u) = {inf}_{v \in V} G (u, v)

is L-Lipschitz in u. □

Proof of Theorem 1.

Thanks to the Assumption 2, we can apply Proposition 2. That is, Assumption (A1) is satisfied, and we can apply Proposition 1. This gives:

\begin{array}{l} \sum_{t = 1}^{T} \sum_{i = 1}^{n} ℓ_{t, i} (θ_{t, i}^{λ_{t}}) \leq & inf_{θ_{1}, \dots, θ_{T} \in Θ} inf_{(ϑ, γ) \in Λ} {\sum_{t = 1}^{T} [\sum_{i = 1}^{n} ℓ_{t, i} (θ_{t}) \\ + \frac{γ Γ^{2} n}{2} + \frac{∥ θ_{t} {- ϑ ∥}^{2}}{2 γ}] + \frac{α T L^{2}}{2} + \frac{∥ ϑ - ϑ_{1} ∥^{2} + {| γ - γ_{1} |}^{2}}{2 α}} . \end{array}

(71)

We use direct bounds for the last two terms:

∥ ϑ - ϑ_{1} ∥^{2} \leq 4 C^{2}

and

| γ - γ_{1} |^{2} \leq {| \bar{γ} - \underset{̲}{γ} |}^{2} \leq {\bar{γ}}^{2} = C^{4}

. Then note that

\begin{array}{l} \sum_{t = 1}^{T} {∥ θ_{t} - ϑ ∥}^{2} & = T {∥ϑ - \frac{1}{T} \sum_{s = 1}^{T} θ_{s}∥}^{2} + \sum_{t = 1}^{T} {∥θ_{t} - \frac{1}{T} \sum_{s = 1}^{T} θ_{s}∥}^{2} \end{array}

(72)

\begin{array}{l} = T {∥ϑ - \frac{1}{T} \sum_{s = 1}^{T} θ_{s}∥}^{2} + T σ^{2} (θ_{1}^{T}) . \end{array}

(73)

Upper bounding the infimum on

ϑ

in (71) by

ϑ = \frac{1}{T} \sum_{s = 1}^{T} θ_{s}

leads to

\begin{array}{l} \sum_{t = 1}^{T} \sum_{i = 1}^{n} ℓ_{t, i} (θ_{t, i}^{λ_{t}}) \leq & inf_{θ_{1}, \dots, θ_{T} \in Θ} inf_{γ \in [\underset{̲}{γ}, \bar{γ}]} {\sum_{t = 1}^{T} \sum_{i = 1}^{n} ℓ_{t, i} (θ_{t}) + \frac{γ Γ^{2} n T}{2} \\ + \frac{T σ^{2} (θ_{1}^{T})}{2 γ} + \frac{α T L^{2}}{2} + \frac{C^{2} (4 + C^{2})}{2 α}} . \end{array}

(74)

The right-hand side of (74) is minimized with respect to

α

if

α = \frac{C}{L} \sqrt{\frac{4 + C^{2}}{T}}

, which is the value proposed in the theorem, and we obtain:

\sum_{t = 1}^{T} \sum_{i = 1}^{n} ℓ_{t, i} (θ_{t, i}^{λ_{t}}) \leq inf_{θ_{1}, \dots, θ_{T} \in Θ} inf_{γ \in [\underset{̲}{γ}, \bar{γ}]} \{\sum_{t = 1}^{T} \sum_{i = 1}^{n} ℓ_{t, i} (θ_{t}) + \frac{γ Γ^{2} n T}{2} + \frac{T σ^{2} (θ_{1}^{T})}{2 γ} + C L \sqrt{(4 + C^{2}) T}\} .

(75)

The infimum with respect to

γ

in the r.h.s is reached for

γ^{*} = (\underset{̲}{γ} \lor \frac{σ (θ_{1}^{T})}{Γ \sqrt{n}}) \land \bar{γ} .

(76)

First, note that

\begin{array}{l} \frac{γ^{*} Γ^{2} n T}{2} & \leq (\underset{̲}{γ} \lor \frac{σ (θ_{1}^{T})}{Γ \sqrt{n}}) \frac{Γ^{2} n T}{2} \end{array}

(77)

\begin{array}{l} \leq (\underset{̲}{γ} + \frac{σ (θ_{1}^{T})}{Γ \sqrt{n}}) \frac{Γ^{2} n T}{2} \end{array}

(78)

\begin{array}{l} = \frac{Γ^{2} T n^{1 - β}}{2} + \frac{σ (θ_{1}^{T}) Γ T \sqrt{n}}{2}, \end{array}

(79)

using

\underset{̲}{γ} = n^{- β}

. Then,

\begin{array}{l} \frac{T σ^{2} (θ_{1}^{T})}{2 γ^{*}} & \leq \frac{T σ^{2} (θ_{1}^{T})}{2} (\frac{1}{\bar{γ}} \lor \frac{Γ \sqrt{n}}{σ (θ_{1}^{T})}) \end{array}

(80)

\begin{array}{l} \leq \frac{T σ^{2} (θ_{1}^{T})}{2} (\frac{1}{\bar{γ}} + \frac{Γ \sqrt{n}}{σ (θ_{1}^{T})}) \end{array}

(81)

\begin{array}{l} = \frac{T σ^{2} (θ_{1}^{T})}{2 C^{2}} + \frac{σ (θ_{1}^{T}) Γ T \sqrt{n}}{2} \end{array}

(82)

\begin{array}{l} \leq \frac{T σ (θ_{1}^{T})}{C} + \frac{σ (θ_{1}^{T}) Γ T \sqrt{n}}{2}, \end{array}

(83)

using

\bar{γ} = C^{2}

and

σ (θ_{1}^{T}) \leq 2 C

. Plugging (77), (80) and the definition of L into (75) gives

\begin{array}{l} \sum_{t = 1}^{T} \sum_{i = 1}^{n} ℓ_{t, i} (θ_{t, i}^{λ_{t}}) & \leq inf_{θ_{1}, \dots, θ_{T} \in Θ} {\sum_{t = 1}^{T} \sum_{i = 1}^{n} ℓ_{t, i} (θ_{t}) + C \sqrt{(\frac{n^{2} Γ^{4}}{4} + 4 C^{2} n^{2 β} + 4 C^{4} n^{4 β}) (4 + C^{2}) T} \end{array}

(84)

\begin{array}{l} + \frac{Γ^{2} T n^{1 - β}}{2} + σ (θ_{1}^{T}) T (Γ \sqrt{n} + \frac{1}{C})} \end{array}

(85)

\begin{array}{l} = inf_{θ_{1}, \dots, θ_{T} \in Θ} {\sum_{t = 1}^{T} \sum_{i = 1}^{n} ℓ_{t, i} (θ_{t}) + C \sqrt{(4 + C^{2}) (\frac{n^{2} Γ^{4}}{4 n^{2 \lor 4 β}} + \frac{4 C^{2} n^{2 β}}{n^{2 \lor 4 β}} + \frac{4 C^{4} n^{4 β}}{n^{2 \lor 4 β}})} n^{1 \lor 2 β} \sqrt{T} \end{array}

(86)

\begin{array}{l} + [\frac{Γ^{2}}{2} n^{1 - β} + (Γ + \frac{1}{n C}) σ (θ_{1}^{T}) \sqrt{n}] T} \end{array}

(87)

\begin{array}{l} \leq inf_{θ_{1}, \dots, θ_{T} \in Θ} {\sum_{t = 1}^{T} \sum_{i = 1}^{n} ℓ_{t, i} (θ_{t}) + C \sqrt{(4 + C^{2}) (\frac{Γ^{2}}{4} + 4 C^{2} + 4 C^{4})} n^{1 \lor 2 β} \sqrt{T} \end{array}

(88)

\begin{array}{l} + [\frac{Γ^{2}}{2} n^{1 - β} + (Γ + \frac{1}{C}) σ (θ_{1}^{T}) \sqrt{n}] T} \end{array}

(89)

\begin{array}{l} \leq inf_{θ_{1}, \dots, θ_{T} \in Θ} {\sum_{t = 1}^{T} \sum_{i = 1}^{n} ℓ_{t, i} (θ_{t}) + C (Γ, C) [n^{1 \lor 2 β} \sqrt{T} + (n^{1 - β} + σ (θ_{1}^{T}) \sqrt{n}) T] \end{array}

(90)

where we took

C (Γ, C) = max (C \sqrt{(4 + C^{2}) (\frac{Γ^{2}}{4} + 4 C^{2} + 4 C^{4})}, \frac{Γ^{2}}{2}, Γ + \frac{1}{C}) .

(91)

This ends the proof. □

Proof of Theorem 2.

A direct application of Proposition 1 gives

\begin{array}{l} \sum_{t = 1}^{T} \sum_{i = 1}^{n} ℓ_{t, i} (θ_{t, i}^{η_{t}}) \leq & inf_{η \geq \frac{1}{n}} {\sum_{t = 1}^{T} min_{θ \in Θ_{0}} [\sum_{i = 1}^{n} ℓ_{t, i} (θ) \\ + \frac{η n {[{max}_{ϑ \in Θ_{0}, 1 \leq i \leq n} ℓ_{t, i} (ϑ)]}^{2}}{8} + \frac{log M}{η}] + \frac{α T L^{2}}{2} + \frac{{(η - 1)}^{2}}{2 α}} . \end{array}

(92)

Thus, we have

\sum_{t = 1}^{T} \sum_{i = 1}^{n} ℓ_{t, i} (θ_{t, i}^{η_{t}}) \leq inf_{η \geq \frac{1}{n}} \{\sum_{t = 1}^{T} min_{θ \in Θ_{0}} [\sum_{i = 1}^{n} ℓ_{t, i} (θ) + \frac{η n b^{2}}{8} + \frac{log M}{η}] + \frac{α T L^{2}}{2} + \frac{{(η - 1)}^{2}}{2 α}\} .

(93)

Now, plugging in the right-hand side

η = \frac{1}{n} \lor (\frac{2}{b} \sqrt{\frac{2 log M}{n}}) \land 1,

(94)

we obtain:

\sum_{t = 1}^{T} \sum_{i = 1}^{n} ℓ_{t, i} (θ_{t, i}^{η_{t}}) \leq \sum_{t = 1}^{T} min_{θ \in Θ_{0}} [\sum_{i = 1}^{n} ℓ_{t, i} (θ) + \frac{b^{2}}{8} + b \sqrt{\frac{n log (M)}{2}} + log (M)] + \frac{α T L^{2}}{2} + \frac{1}{2 α} .

(95)

Now, we see that the value

α = \sqrt{2 / (T L^{2})}

leads to:

\sum_{t = 1}^{T} \sum_{i = 1}^{n} ℓ_{t, i} (θ_{t, i}^{η_{t}}) \leq \sum_{t = 1}^{T} min_{θ \in Θ_{0}} [\sum_{i = 1}^{n} ℓ_{t, i} (θ) + \frac{b^{2}}{8} + b \sqrt{\frac{n log (M)}{2}} + log (M)] + L \sqrt{2 T} .

(96)

Rearranging terms, and replacing L by its value,

\begin{array}{l} \sum_{t = 1}^{T} \sum_{i = 1}^{n} ℓ_{t, i} (θ_{t, i}^{η_{t}}) \leq & \sum_{t = 1}^{T} min_{θ \in Θ_{0}} \sum_{i = 1}^{n} ℓ_{t, i} (θ) + b T \sqrt{\frac{n log (M)}{2}} + \frac{b^{2} T}{8} + T log (M) \\ + (n^{2} log M + \frac{n B^{2}}{8}) \sqrt{2 T}, \end{array}

(97)

that is the statement of the theorem. □

Proof of Theorem 3.

An application of Proposition 1 leads to

\sum_{t = 1}^{T} \sum_{i = 1}^{n} ℓ_{t, i} (θ_{t, i}^{π_{t}}) \leq min_{π \in Λ} \{\sum_{t = 1}^{T} [\sum_{i = 1}^{n} ℓ_{t, i} (θ_{t}^{*}) + C log \frac{1}{π (θ_{t}^{*})}] + \frac{α T L^{2}}{2} + \frac{∥ π - π_{1} ∥^{2}}{2 α}\}

(98)

and so

\sum_{t = 1}^{T} \sum_{i = 1}^{n} ℓ_{t, i} (θ_{t, i}^{π_{t}}) \leq min_{π \in Λ} \{\sum_{t = 1}^{T} [\sum_{i = 1}^{n} ℓ_{t, i} (θ_{t}^{*}) + C log \frac{1}{π (θ_{t}^{*})}] + \frac{α T L^{2}}{2} + \frac{1}{2 α}\}

(99)

define

π_{I^{*}}

such that

π_{I^{*}} (θ_{j}) = 1 / (2 m^{*})

if

j \in I^{*}

and

π_{I^{*}} (θ_{j}) = 1 / (2 (M - m^{*}))

otherwise. We have

π_{I}^{*} \in Λ

and thus

\sum_{t = 1}^{T} \sum_{i = 1}^{n} ℓ_{t, i} (θ_{t, i}^{π_{t}}) \leq \sum_{t = 1}^{T} [\sum_{i = 1}^{n} ℓ_{t, i} (θ_{t}^{*}) + C log (2 m^{*})] + \frac{α T L^{2}}{2} + \frac{1}{2 α} .

(100)

Replace L and

α

by their values to get the theorem. □

7. Conclusions

We proposed two simple meta-learning strategies together with their theoretical analysis. Our results clearly show an improvement on learning in isolation if the tasks are similar enough. These theoretical findings are confirmed by our numerical experiments. Important questions remain open. In [27], a purely online method is proposed, in the sense that it does not require storing all of the information of the current task. In the case of OGA, this method allows us to learn the starting point. However, its application to learn the step size is not direct [28]. An important question is, then: is there a purely online method that would provably improve on learning in isolation in this case? Another important question is the automatic calibration of

Γ

. However, as mentioned in Section 5, we believe that a very general and efficient meta-learning method for learning priors in Bayesian statistics (or in generalized Bayesian inference) would be extremely valuable in practice.

Author Contributions

Investigation, D.M. and P.A.; Software, D.M.; Writing—original draft, D.M. and P.A. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

Not applicable.

Acknowledgments

This project was initiated as Dimitri Meunier’s internship project at RIKEN AIP, in the Approximate Bayesian Inference team. The internship was cancelled because of the pandemic. We would like to thank Arnak Dalalyan (ENSAE Paris), who provided fundings so that the internship could take place at ENSAE Paris instead. We would like to thank Emtiyaz Khan (RIKEN AIP), Sébastien Gerchinovitz (IRT Saint-Exupéry, Toulouse), Vianney Perchet (ENSAE Paris) and all the members of the Approximate Bayesian Inference team for valuable feedback.

Conflicts of Interest

The authors declare no conflict of interest.

References

Thrun, S.; Pratt, L. Learning to Learn; Kluwer Academic Publishers: New York, NY, UK, 1998. [Google Scholar]
Chollet, F. On the measure of intelligence. arXiv 2019, arXiv:1911.01547. [Google Scholar]
Cesa-Bianchi, N.; Lugosi, G. Prediction, Learning, and Games; Cambridge University Press: Cambridge, UK, 2006. [Google Scholar]
Hazan, E. Introduction to online convex optimization. arXiv 2019, arXiv:1909.05207. [Google Scholar]
Orabona, F. A modern introduction to online learning. arXiv 2019, arXiv:1912.13213. [Google Scholar]
Shalev-Shwartz, S. Online learning and online convex optimization. Found. Trends Mach. Le. 2012, 4, 107–194. [Google Scholar] [CrossRef]
Maurer, A. Bounds for linear multi-task learning. J. Mach. Learn. Res. 2006, 7, 117–139. [Google Scholar]
Romera-Paredes, B.; Aung, H.; Bianchi-Berthouze, N.; Pontil, M. Multilinear multitask learning. In Proceedings of the International Conference on Machine Learning, Atlanta, GA, USA, 16–21 June 2013; pp. 1444–1452. [Google Scholar]
Yamada, M.; Koh, T.; Iwata, T.; Shawe-Taylor, J.; Kaski, S. Localized lasso for high-dimensional regression. In Proceedings of the 20th International Conference on Artificial Intelligence and Statistics, PMLR, Fort Lauderdale, FL, USA, 20–22 April 2017; Volume 54, pp. 325–333. [Google Scholar]
Alquier, P.; Mai, T.T.; Pontil, M. Regret Bounds for Lifelong Learning. In Proceedings of the 20th International Conference on Artificial Intelligence and Statistics, Fort Lauderdale, FL, USA, 20–22 April 2017; Volume 54, pp. 261–269. [Google Scholar]
Amit, R.; Meir, R. Meta-learning by adjusting priors based on extended PAC-Bayes theory. In Proceedings of the International Conference on Machine Learning, Stockholm, Sweden, 10–15 July 2018; pp. 205–214. [Google Scholar]
Baxter, J. Theoretical models of learning to learn. In Learning to Learn; Springer: Berlin, Germany, 1998; pp. 71–94. [Google Scholar]
Jose, S.T.; Simeone, O.; Durisi, G. Transfer meta-learning: Information-theoretic bounds and information meta-risk minimization. arXiv 2020, arXiv:2011.02872. [Google Scholar]
Maurer, A.; Pontil, M.; Romera-Paredes, B. The benefit of multitask representation learning. J. Mach. Learn. Res. 2016, 17, 1–32. [Google Scholar]
Pentina, A.; Lampert, C. A PAC-Bayesian bound for lifelong learning. In Proceedings of the 31st International Conference on Machine Learning, Bejing, China, 22–24 June 2014; pp. 991–999. [Google Scholar]
Rothfuss, J.; Fortuin, V.; Krause, A. Pacoh: Bayes-optimal meta-learning with pac-guarantees. arXiv 2020, arXiv:2002.05551. [Google Scholar]
Andrychowicz, M.; Denil, M.; Gomez, S.; Hoffman, M.W.; Pfau, D.; Schaul, T.; Shillingford, B.; De Freitas, N. Learning to learn by gradient descent by gradient descent. In Proceedings of the Advances in Neural Information Processing Systems, Barcelona, Spain, 5–10 December 2016; pp. 3981–3989. [Google Scholar]
Ruvolo, P.; Eaton, E. Ella: An efficient lifelong learning algorithm. In Proceedings of the 30th International Conference on Machine Learning, Atlanta, GA, USA, 17–19 June 2013; pp. 507–515. [Google Scholar]
Balcan, M.-F.; Khodak, M.; Talwalkar, A. Provable guarantees for gradient-based meta-learning. In Proceedings of the International Conference on Machine Learning, PMLR, Long Beach, CA, USA, 9–15 June 2019; pp. 424–433. [Google Scholar]
Denevi, G.; Ciliberto, C.; Stamos, D.; Pontil, M. Learning to learn around a common mean. In Proceedings of the Advances in Neural Information Processing Systems, Montreal, QC, Canada, 3–8 December 2018; pp. 10169–10179. [Google Scholar]
Denevi, G.; Ciliberto, C.; Grazzi, R.; Pontil, M. Learning-to-learn stochastic gradient descent with biased regularization. arXiv 2019, arXiv:1903.10399. [Google Scholar]
Denevi, G.; Pontil, M.; Ciliberto, C. The advantage of conditional meta-learning for biased regularization and fine tuning. In Proceedings of the Advances in Neural Information Processing Systems, Online, 6–12 December 2020; Volume 33. [Google Scholar]
Fallah, A.; Mokhtari, A.; Ozdaglar, A. On the convergence theory of gradient-based model-agnostic meta-learning algorithms. In Proceedings of the International Conference on Artificial Intelligence and Statistics, Online, 26–28 August 2020; pp. 1082–1092. [Google Scholar]
Finn, C.; Rajeswaran, A.; Kakade, S.; Levine, S. Online meta-learning. arXiv 2019, arXiv:1902.08438. [Google Scholar]
Konobeev, M.; Kuzborskij, I.; Szepesvári, C. On optimality of meta-learning in fixed-design regression with weighted biased regularization. arXiv 2020, arXiv:2011.00344. [Google Scholar]
Zhou, P.; Yuan, X.; Xu, H.; Yan, S.; Feng, J. Efficient meta learning via minibatch proximal update. In Proceedings of the 2019 Conference on Neural Information Processing Systems, Vancouver, BC, Canada, 8–14 December 2019; pp. 1534–1544. [Google Scholar]
Denevi, G.; Stamos, D.; Ciliberto, C.; Pontil, M. Online-within-online meta-learning. In Proceedings of the Advances in Neural Information Processing Systems, Vancouver, BC, Canada, 8–14 December 2019; pp. 13110–13120. [Google Scholar]
Meunier, D. Meta-Learning Meets Variational Inference: Learning Priors with Guarantees. Master’s Thesis, Université Paris Saclay, Paris, France, 2020. Available online: https://dimitri-meunier.github.io/files/RikenReport.pdf (accessed on 26 September 2021).
Khodak, M.; Balcan, M.-F.; Talwalkar, A. Adaptive Gradient-Based Meta-Learning Methods. In Proceedings of the Advances in Neural Information Processing Systems, Vancouver, BC, Canada, 8–14 December 2019; pp. 5917–5928. [Google Scholar]
Li, L.; Jamieson, K.; DeSalvo, G.; Rostamizadeh, A.; Talwalkar, A. Hyperband: A novel bandit-based approach to hyperparameter optimization. J. Mach. Learn. Res. 2017, 18, 6765–6816. [Google Scholar]
Shang, X.; Kaufmann, E.; Valko, M. A simple dynamic bandit algorithm for hyper-parameter tuning. In Proceedings of the 6th ICML Workshop on Automated Machine Learning, Long Beach, CA, USA, 14–15 June 2019. [Google Scholar]
Kivinen, J.; Warmuth, M.K. Exponentiated gradient versus gradient descent for linear predictors. Inf. Comput. 1997, 132, 1–63. [Google Scholar] [CrossRef]
Kulis, B.; Bartlett, P.L. Implicit online learning. In Proceedings of the 27th International Conference on Machine Learning (ICML-10), Haifa, Israel, 21–24 June 2010; pp. 575–582. [Google Scholar]
Parikh, N.; Boyd, S. Proximal algorithms. Found. Trends Optim. 2014, 1, 127–239. [Google Scholar] [CrossRef]
Nesterov, Y. Introductory Lectures on Convex Optimization: A Basic Course; Springer Science & Business Media: Berlin, Germany, 2004; Volume 87. [Google Scholar]
Alquier, P. Approximate Bayesian Inference. Entropy 2020, 22, 1272. [Google Scholar] [CrossRef] [PubMed]
Mai, T.T. On continual single index learning. arXiv 2021, arXiv:2102.12961. [Google Scholar]
Lin, W.; Khan, M.E.; Schmidt, M. Fast and simple natural-gradient variational inference with mixture of exponential-family approximations. In Proceedings of the 36th International Conference on Machine Learning, Long Beach, CA, USA, 9–15 June 2019; Volume 97, pp. 3992–4002. [Google Scholar]
Chérief-Abdellatif, B.-E.; Alquier, P.; Khan, M.E. A generalization bound for online variational inference. In Proceedings of the Eleventh Asian Conference on Machine Learning, PMLR, Nagoya, Japan, 17–19 November 2019; Volume 101, pp. 662–677. [Google Scholar]
Domke, J. Provable smoothness guarantees for black-box variational inference. In Proceedings of the 37th International Conference on Machine Learning, Online, 12–18 July 2021; Volume 119, pp. 2587–2596. [Google Scholar]
Alquier, P. Non-exponentially weighted aggregation: Regret bounds for unbounded loss functions. In Proceedings of the 38th International Conference on Machine Learning, Online, 18–24 July 2021; Volume 139, pp. 207–218. [Google Scholar]
Knoblauch, J.; Jewson, J.; Damoulas, T. Generalized variational inference: Three arguments for deriving new posteriors. arXiv 2019, arXiv:1904.02063. [Google Scholar]
Campolongo, N.; Orabona, F. Temporal variability in implicit online learning. In Proceedings of the Advances in Neural Information Processing Systems, Vancouver, BC, Canada, 6–12 December 2020; Volume 33. [Google Scholar]

Figure 1. The dynamics of meta-learning.

Figure 2. Performance of learning in isolation with OGA (I-OGA), OPMS to learn initialization (mean-OPMS) and OPMS to learn initialization and step size (OPMS). We report the average end-of-task MSE losses at the end of each task, for different values of the task-similarity index

r \in {0, 5, 10, 30}

. The results are averaged over 50 independent runs to get confidence intervals.

Figure 2. Performance of learning in isolation with OGA (I-OGA), OPMS to learn initialization (mean-OPMS) and OPMS to learn initialization and step size (OPMS). We report the average end-of-task MSE losses at the end of each task, for different values of the task-similarity index

r \in {0, 5, 10, 30}

. The results are averaged over 50 independent runs to get confidence intervals.

Figure 3. Performance of learning in isolation with OGA (I-OGA), OPMS to learn the initialization (mean-OPMS) and OPMS to learn the initialization and step size (OPMS) on a sequence of classification tasks with the Hinge loss. We report the meta-regret of the Hinge loss. The results are averaged over 10 independent runs (dataset generation) to get confidence intervals.

Table 1. Average end-of-task MSE of the 100 last tasks (averaged over 50 independent runs).

	r = 0	r = 5	r = 10	r = 30
I-OGA	6.24	6.44	7.06	13.60
mean OPMS	0.05	0.27	0.93	7.93
OPMS	0.07	0.15	0.49	3.72

Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

© 2021 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Meunier, D.; Alquier, P. Meta-Strategy for Learning Tuning Parameters with Guarantees. Entropy 2021, 23, 1257. https://doi.org/10.3390/e23101257

AMA Style

Meunier D, Alquier P. Meta-Strategy for Learning Tuning Parameters with Guarantees. Entropy. 2021; 23(10):1257. https://doi.org/10.3390/e23101257

Chicago/Turabian Style

Meunier, Dimitri, and Pierre Alquier. 2021. "Meta-Strategy for Learning Tuning Parameters with Guarantees" Entropy 23, no. 10: 1257. https://doi.org/10.3390/e23101257

APA Style

Meunier, D., & Alquier, P. (2021). Meta-Strategy for Learning Tuning Parameters with Guarantees. Entropy, 23(10), 1257. https://doi.org/10.3390/e23101257

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Meta-Strategy for Learning Tuning Parameters with Guarantees

Abstract

1. Introduction

1.1. Related Works

1.2. Organization of the Paper

2. Notations and Preliminaries

3. Meta-Learning Algorithms

3.1. Special Case: The Gradient of the Meta-Loss Is Available in Closed Form

3.2. The General Case

3.3. Regret Analysis

4. Example: Learning the Tuning Parameters of Online Gradient Descent

4.1. Explicit Meta-Regret Bound

4.2. Special Case: Quadratic Loss

4.3. The General Case

4.4. Experimental Study

4.4.1. Synthetic Regression

4.4.2. Synthetic Classification

5. Second Example: Learning the Prior or the Learning Rate in Exponentially Weighted Aggregation

5.1. Reminder on EWA

5.2. Learning the Rate $η$

5.3. Learning the Prior $π$

5.4. Discussion on the Continuous Case

6. Proofs

7. Conclusions

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI

Article Menu

Meta-Strategy for Learning Tuning Parameters with Guarantees

Abstract

1. Introduction

1.1. Related Works

1.2. Organization of the Paper

2. Notations and Preliminaries

3. Meta-Learning Algorithms

3.1. Special Case: The Gradient of the Meta-Loss Is Available in Closed Form

3.2. The General Case

3.3. Regret Analysis

4. Example: Learning the Tuning Parameters of Online Gradient Descent

4.1. Explicit Meta-Regret Bound

4.2. Special Case: Quadratic Loss

4.3. The General Case

4.4. Experimental Study

4.4.1. Synthetic Regression

4.4.2. Synthetic Classification

5. Second Example: Learning the Prior or the Learning Rate in Exponentially Weighted Aggregation

5.1. Reminder on EWA

5.2. Learning the Rate η

5.3. Learning the Prior π

5.4. Discussion on the Continuous Case

6. Proofs

7. Conclusions

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI

5.2. Learning the Rate $η$

5.3. Learning the Prior $π$