Thompson Sampling for Stochastic Bandits with Noisy Contexts: An Information-Theoretic Regret Analysis

Jose, Sharu Theresa; Moothedath, Shana

doi:10.3390/e26070606

Open AccessArticle

Thompson Sampling for Stochastic Bandits with Noisy Contexts: An Information-Theoretic Regret Analysis

by

Sharu Theresa Jose

^1,*

and

Shana Moothedath

^2,*

¹

School of Computer Science, University of Birmingham, Birmingham B15 2TT, UK

²

Department of Electrical Engineering, Iowa State University, Ames, IA 50011, USA

^*

Authors to whom correspondence should be addressed.

Entropy 2024, 26(7), 606; https://doi.org/10.3390/e26070606

Submission received: 20 May 2024 / Revised: 9 July 2024 / Accepted: 15 July 2024 / Published: 17 July 2024

(This article belongs to the Special Issue Information Theoretic Learning with Its Applications)

Download

Browse Figures

Versions Notes

Abstract

:

We study stochastic linear contextual bandits (CB) where the agent observes a noisy version of the true context through a noise channel with unknown channel parameters. Our objective is to design an action policy that can “approximate” that of a Bayesian oracle that has access to the reward model and the noise channel parameter. We introduce a modified Thompson sampling algorithm and analyze its Bayesian cumulative regret with respect to the oracle action policy via information-theoretic tools. For Gaussian bandits with Gaussian context noise, our information-theoretic analysis shows that under certain conditions on the prior variance, the Bayesian cumulative regret scales as

\tilde{O} (m \sqrt{T})

, where m is the dimension of the feature vector and T is the time horizon. We also consider the problem setting where the agent observes the true context with some delay after receiving the reward, and show that delayed true contexts lead to lower regret. Finally, we empirically demonstrate the performance of the proposed algorithms against baselines.

Keywords:

noisy contextual bandits; Thompson sampling; Bayes regret; information theory

1. Introduction

Decision-making in the face of uncertainty is a widespread challenge found across various domains such as control and robotics [1], clinical trials [2], communications [3], and ecology [4]. To tackle this challenge, learning algorithms have been developed to uncover effective policies for optimal decision-making. One notable framework for addressing this is contextual bandits (CBs), which capture the essence of sequential decision-making by incorporating side information, termed context [5].

In the standard CB model, an agent interacts with the environment over numerous rounds. In each round, the environment presents a context to the agent based on which the agent chooses an action and receives a reward from the environment. The reward is stochastic, drawn from a probability distribution whose mean reward (which is a function of context-action pair) is unknown to the agent. The goal of the agent is to design a policy for action selection that can maximize the cumulative mean reward accrued over a T-length horizon.

In this paper, we focus on a CB model that assumes stochastic rewards with linear mean-reward functions, also called stochastic linear contextual bandits. Stochastic linear CB models find applications in various settings including internet advertisement selection [6], where the advertisement (i.e., action) and webpage features (i.e., context) are used to construct a linear predictor of the probability that a user clicks on a given advertisement, and article recommendation on web portals [7].

While most prior research on CBs has primarily focused on models with known exact contexts [8,9,10], in many real-world applications, the contexts are noisy, e.g., imprecise measurement of patient conditions in clinical trials, weather or stock market predictions. In such scenarios, when the exact contexts are unknown, the agent must utilize the observed noisy contexts to estimate the mean reward associated with the true context. However, this results in a biased estimate that renders the application of standard CB algorithms unsuitable. Consequently, recent efforts have been made to develop CB algorithms tailored to noisy context settings.

Related Works: ref. [11] considers a setting where there is a bounded zero-mean noise in the m-dimensional feature vector (denoted by

ϕ (a, c)

, where a is the action and c is the context) rather than in the context vector, and the agent observes only noisy features. For this setting, they develop an upper confidence bound (UCB) algorithm. Ref. [12] models the uncertainty regarding the true contexts by a context distribution that is known to the agent, while the agent never observes the true context and develops a UCB algorithm. A similar setting has also been considered in [13]. Differing from these works, ref. [14] considers the setting where the true feature vectors are sampled from an unknown feature distribution at each time, but the agent observes only a noisy feature vector. Assuming Gaussian feature noise with unknown mean and covariance, they develop an Optimism in the Face of Uncertainty (OFUL) algorithm. A variant of this setting has been studied in [15].

Motivation and Problem Setting: In this work, inspired by [14], we consider the following noisy CB setting. In each round, the environment samples a true context vector

c_{t}

from a context distribution that is known to the agent. The agent, however, does not observe the true context but observes a noisy context

{\hat{c}}_{t}

obtained as the output of a noise channel

P ({\hat{c}}_{t} | c_{t}, γ^{*})

parameterized by

γ^{*}

. The agent is aware of the noise present but does not know the channel parameter

γ^{*}

. Following [14], we consider Gaussian noise channels for our regret analysis.

Based on the observed noisy contexts, the agent chooses an action

a_{t}

and observes a reward

r_{t}

corresponding to the true context. We consider a linear bandit whose mean reward

ϕ {(a_{t}, c_{t})}^{⊤} θ^{*}

is determined by an unknown reward parameter

θ^{*}

. The goal of the agent is to design an action policy that minimizes the Bayesian cumulative regret with respect to the action policy of a Bayesian oracle. The oracle has access to the reward model and the channel parameter

γ^{*}

, and uses the predictive distribution of the true context given the observed noisy context to select an action.

Our setting differs from [14] in that we assume noisy contexts rather than noisy feature vectors and that the agent knows the context distribution. The noise model, incorporating noise in the feature vector, allows [14] to transform the original problem into a different CB problem that estimates a modified reward parameter. Such a transformation, however, is not straightforward in our setting with noise in contexts rather than in feature vectors, where we wish to analyze the Bayesian regret. Additionally, we propose a de-noising approach to estimate the predictive distribution of the true context from given noisy contexts, offering potential benefits for future analyses.

The assumption of known context distribution follows from [12]. This can be motivated by considering the example of an online recommendation engine that pre-processes the user account registration information or contexts (e.g., age, gender, device, location, item preferences) to group them into different clusters [16]. The engine can then infer the ‘empirical’ distribution of users within each cluster to define a context distribution over true contextual information. A noisy contextual information scenario occurs when a guest with different preferences logs into a user’s account.

Challenges and Novelty: Different from existing works that developed UCB-based algorithms, we propose a fully Bayesian Thompson Sampling (TS) algorithm that approximates the Bayesian oracle policy. The proposed algorithm differs from the standard contextual TS [10] in the following aspects. Firstly, since the true context vectors are not accessible at each round and the channel parameter

γ^{*}

is unknown, the agent uses its knowledge of the context distribution and the past observed noisy contexts to infer a predictive posterior distribution of the true context from the current observed noisy context. The inferred predictive distribution is then used to choose the action. This de-noising step enables our algorithm to ‘approximate’ the oracle action policy that uses knowledge of the channel parameter

γ^{*}

to implement exact de-noising. Secondly, the reward

r_{t}

received by the agent corresponds to the unobserved true context

c_{t}

. Hence, the agent cannot accurately evaluate the posterior distribution of

θ^{*}

and sample from it as is conducted in standard contextual TS. Instead, our algorithm proposes to use a sampling distribution that ‘approximates’ the posterior.

Different from existing works that focus on frequentist regret analysis, we derive novel information-theoretic bounds on the Bayesian cumulative regret of our algorithm. For Gaussian bandits, our information-theoretic regret bounds scale as

\tilde{O} (m \sqrt{T})

(the notation

\tilde{O} (•)

suppresses logarithmic terms in •), where

m, T

denote the dimension of the feature vector and time horizon respectively, under certain conditions on the variance of the prior on

θ^{*}

. Furthermore, our Bayesian regret analysis shows that the posterior mismatch, resulting due to replacing the true posterior distribution with a sampling distribution, results in an approximation error that is captured via the Kullback–Leibler (KL) divergence between the distributions. To the best of our knowledge, quantifying the posterior mismatch via KL divergence has not been studied before and is of independent interest.

Finally, we also extend our algorithm to a setting where the agent observes the true context after the decision is made and reward is observed [12]. We call this setting CBs with delayed true contexts. Such scenarios arise in many applications where only a prediction of the context is available at the time of decision-making; however, the true context is available later. For instance, in farming-recommender systems where, at the time of making the decision regarding which crop to cultivate in a year, the true contextual information about the weather pattern is unavailable, while some ‘noisy’ weather predictions are available. In fact, the true weather pattern is observed only after the decision is made. We show that our TS algorithm for this setting with delayed true contexts results in reduced Bayesian regret. Table 1 compares our regret bound with that of the state-of-the-art algorithms in the noiseless and noisy CB settings.

2. Problem Setting

In this section, we present the stochastic linear CB problem studied in this paper. Let

A

denote the action set with K actions and

C

denote the (possibly infinite) set of d-dimensional context vectors. At iteration

t \in N

, the environment randomly draws a context vector

c_{t} \in C

according to a context distribution

P (c)

defined over the space

C

of context vectors. The context distribution

P (c)

is known to the agent. The agent, however, does not observe the true context

c_{t}

drawn by the environment. Instead, it observes a noisy version

{\hat{c}}_{t}

of the true context, obtained as the output of a noisy, stochastic channel

P ({\hat{c}}_{t} | c_{t}, γ^{*})

with the true context

c_{t}

as the input. The noise channel

P ({\hat{c}}_{t} | c_{t}, γ^{*})

is parameterized by the noise channel parameter

γ^{*}

that is unknown to the agent.

Having observed the noisy context

{\hat{c}}_{t}

at iteration t, the agent chooses an action

a_{t} \in A

according to an action policy

π_{t} (\cdot | {\hat{c}}_{t})

. The action policy may be stochastic describing a probability distribution over the set

A

of actions. Corresponding to the chosen action

a_{t}

, the agent receives a reward from the environment given by

\begin{matrix} r_{t} = f (θ^{*}, a_{t}, c_{t}) + ξ_{t}, \end{matrix}

(1)

where

f (θ^{*}, a_{t}, c_{t}) = ϕ {(a_{t}, c_{t})}^{⊤} θ^{*}

is the linear mean-reward function and

ξ_{t}

is a zero-mean reward noise variable. The mean reward function

f (θ^{*}, a_{t}, c_{t})

is defined via the feature map

ϕ : A \times C \to R^{m}

, that maps the action and true context to an m-dimensional feature vector, and via the reward parameter

θ^{*} \in R^{m}

that is unknown to the agent.

We call the noisy CB problem described above CBs with unobserved true context (see Setting 1) since the agent does not observe the true context

c_{t}

and the selection of action is based solely on the observed noisy context. Accordingly, at the end of iteration t, the agent has accrued the history

H_{t, r, a, \hat{c}} = {r_{τ}, a_{τ}, {\hat{c}}_{τ}}_{τ = 1}^{t}

of observed reward-action-noisy context tuples. The action policy

π_{t + 1} (\cdot | {\hat{c}}_{t + 1})

at

{(t + 1)}^{th}

iteration may depend on the history

H_{t, r, a, \hat{c}}

.

Setting 1: CBs with unobserved true contexts

1:: for $t = 1, \dots, T$ do
2:: Environment samples $c_{t} \sim P (c)$ .
3:: Agent observes noisy context ${\hat{c}}_{t} \sim P ({\hat{c}}_{t} | c_{t}, γ^{*})$ .
4:: Agent chooses an action $a_{t} \sim π_{t} (\cdot | {\hat{c}}_{t})$ .
5:: Agent receives reward $r_{t}$ according to (1).
6:: end for

We also consider a variant of the above problem where the agent has access to a delayed observation of the true context

c_{t}

as studied in [12]. We call this setting CBs with delayed true context. In this setting, at iteration t, the agent first observes a noisy context

{\hat{c}}_{t}

, chooses action

a_{t} \sim π_{t} (\cdot | {\hat{c}}_{t})

, and receives reward

r_{t}

. Later, the true context

c_{t}

is observed. It is important to note that the agent has no access to the true context at the time of decision-making. Thus, at the end of iteration t, the agent has collected the history

H_{t, r, a, c, \hat{c}} = {r_{τ}, a_{τ}, c_{τ}, {\hat{c}}_{τ}}_{τ = 1}^{t}

of observed reward-action-context-noisy context tuples.

In both of the problem settings described above, the agent’s objective is to devise an action policy that minimizes the Bayesian cumulative regret with respect to a baseline action policy. We define Bayesian cumulative regret next.

Bayesian Cumulative Regret

The cumulative regret of an action policy

π_{t} (\cdot | {\hat{c}}_{t})

quantifies how different the mean reward accumulated over T iterations is from that accrued by a baseline action policy

π_{t}^{*} (\cdot | {\hat{c}}_{t})

. In this work, we consider as baseline the action policy of an oracle that has access to the channel noise parameter

γ^{*}

, reward parameter

θ^{*}

, the context distribution

P (c)

and the noise channel likelihood

P (c_{t} | {\hat{c}}_{t}, γ^{*}) .

Accordingly, at each iteration t, the oracle can infer the exact predictive distribution

P (c_{t} | {\hat{c}}_{t}, γ^{*})

of the true context from the observed noisy context

{\hat{c}}_{t}

via Baye’s rule as

\begin{matrix} P (c_{t} | {\hat{c}}_{t}, γ^{*}) = \frac{P (c_{t}, {\hat{c}}_{t} | γ^{*})}{P ({\hat{c}}_{t} | γ^{*})} . \end{matrix}

(2)

Here,

P (c_{t}, {\hat{c}}_{t} | γ^{*}) = P (c_{t}) P ({\hat{c}}_{t} | c_{t}, γ^{*})

is the joint distribution of the true and noisy contexts given the noise channel parameter

γ^{*}

, and

P ({\hat{c}}_{t} | γ^{*})

is the distribution obtained by marginalizing

P (c_{t}, {\hat{c}}_{t} | γ^{*})

over the true contexts, i.e.,

\begin{matrix} P ({\hat{c}}_{t} | γ^{*}) = E_{P (c_{t})} [P ({\hat{c}}_{t} | c_{t}, γ^{*})], \end{matrix}

(3)

where

E_{•} [\cdot]

denotes expectation with respect to ‘

•

’. The oracle action policy then adopts an action

\begin{matrix} a_{t}^{*} & = arg max_{a \in A} E_{P (c_{t} | {\hat{c}}_{t}, γ^{*})} [ϕ {(a, c_{t})}^{⊤} θ^{*}] \\ = arg max_{a \in A} ψ {(a, {\hat{c}}_{t} | γ^{*})}^{⊤} θ^{*}, \end{matrix}

(4)

at iteration t, where

ψ (a, \hat{c} | γ^{*}) : = E_{P (c | \hat{c}, γ^{*})} [ϕ (a, c)]

. Note, that as in [14,18], we do not choose the stronger oracle action policy of

arg {max}_{a \in A}

ϕ {(a, c_{t})}^{⊤} θ^{*}

, that requires access to the true context

c_{t}

, as it is generally not achievable by an agent that observes only noisy context

{\hat{c}}_{t}

and has no access to

γ^{*}

.

For fixed parameters

θ^{*}

and

γ^{*}

, we define the cumulative regret of the action policy

π_{t} (\cdot | {\hat{c}}_{t})

as

\begin{matrix} R^{T} (π | θ^{*}, γ^{*}) = \sum_{t = 1}^{T} E [ϕ {(a_{t}^{*}, c_{t})}^{⊤} θ^{*} - ϕ {(a_{t}, c_{t})}^{⊤} θ^{*} | θ^{*}, γ^{*}], \end{matrix}

(5)

the expected difference in mean rewards of the oracle decision policy and the agent’s decision policy over T iterations. In (5), the expectation is taken with respect to the joint distribution

P (H_{t - 1, r, \hat{c}, c, a} | θ^{*}, γ^{*}) P ({\hat{c}}_{t}, c_{t}, a_{t} | H_{t - 1, r, \hat{c}, c, a}, θ^{*}, γ^{*})

, where

P ({\hat{c}}_{t}, c_{t}, a_{t} | H_{t - 1, r, \hat{c}, c, a}, θ^{*}, γ^{*}) = P ({\hat{c}}_{t}, c_{t} | γ^{*}) π_{t} (a_{t} | {\hat{c}}_{t}) = P ({\hat{c}}_{t} | γ^{*}) P (c_{t} | {\hat{c}}_{t}, γ^{*}) π_{t} (a_{t} | {\hat{c}}_{t})

. Using this, the cumulative regret (5) can be written as

\begin{matrix} R^{T} (π | θ^{*}, γ^{*}) \\ = \sum_{t = 1}^{T} E_{P (H_{t - 1, r, \hat{c}, c, a} | θ^{*}, γ^{*})} [E_{P ({\hat{c}}_{t} | γ^{*}) π_{t} (a_{t} | {\hat{c}}_{t})} E_{P (c_{t} | {\hat{c}}_{t}, γ^{*})} [ϕ {(a_{t}^{*}, c_{t})}^{⊤} θ^{*} - ϕ {(a_{t}, c_{t})}^{⊤} θ^{*}]] \\ = \sum_{t = 1}^{T} E_{P (H_{t - 1, r, \hat{c}, c, a} | θ^{*}, γ^{*})} [E_{P ({\hat{c}}_{t} | γ^{*}) π_{t} (a_{t} | {\hat{c}}_{t})} [ψ {(a_{t}^{*}, {\hat{c}}_{t} | γ^{*})}^{⊤} θ^{*} - ψ {(a_{t}, {\hat{c}}_{t} | γ^{*})}^{⊤} θ^{*}]] \\ : = \sum_{t = 1}^{T} E [ψ {(a_{t}^{*}, {\hat{c}}_{t} | γ^{*})}^{⊤} θ^{*} - ψ {(a_{t}, {\hat{c}}_{t} | γ^{*})}^{⊤} θ^{*} | θ^{*}, γ^{*}] . \end{matrix}

(6)

Our focus in this work is on a Bayesian framework where we assume that the reward parameter

θ^{*} \in Θ

and channel noise parameter

γ^{*} \in Γ

are independently sampled by the environment from prior distributions

P (θ^{*})

, defined on the set

Θ

of reward parameters, and

P (γ^{*})

, defined on the set

Γ

of channel noise parameters, respectively. The agent has knowledge of the prior distributions, the reward likelihood in (1) and the noise channel likelihood

P ({\hat{c}}_{t} | c_{t}, γ^{*})

, although it does not observe the sampled

γ^{*}

and

θ^{*}

. Using the above prior distributions, we define Bayesian cumulative regret of the action policy

π_{t} (\cdot | {\hat{c}}_{t})

as

\begin{matrix} R^{T} (π) = E [R^{T} (π | θ^{*}, γ^{*})], \end{matrix}

(7)

where the expectation is taken with respect to the priors

P (θ^{*})

and

P (γ^{*})

.

In the next sections, we present our novel TS algorithms to minimize the Bayesian cumulative regret for the two problem settings considered in this paper.

3. Modified TS for CB with Unobserved True Contexts

In this section, we consider Setting 1 where the agent only observes the noisy context

{\hat{c}}_{t}

at each iteration t. Our proposed modified TS Algorithm is given in Algorithm 1.

Algorithm 1: TS with unobserved true contexts (

π^{TS}

)

1:: for $t = 1, \dots, T$ do
2:: The environment selects a true context $c_{t}$ .
3:: Agent observes noisy context ${\hat{c}}_{t}$ .
4:: Agent evaluates the predictive posterior distribution $P (c_{t} | {\hat{c}}_{t}, H_{t - 1, \hat{c}})$ as in (8).
5:: Agent samples $θ_{t} \sim \bar{P} (θ^{*} | H_{t - 1, r, a, \hat{c}})$ .
6:: Agent chooses action $a_{t}$ as in (11).
7:: Agent observes reward $r_{t}$ as in (1).
8:: end for

The proposed algorithm implements two steps in each iteration

t \in N

. In the first step, called the de-noising step, the agent uses the current observed noisy context

{\hat{c}}_{t}

and the history

H_{t - 1, \hat{c}} = {{\hat{c}}_{τ}}_{τ = 1}^{t - 1}

of past observed noisy contexts to obtain a predictive posterior distribution

P (c_{t} | {\hat{c}}_{t}, H_{t - 1, \hat{c}})

of the true context

c_{t}

. This is a two-step process, where firstly the agent uses the history

H_{t - 1, \hat{c}}

of past observed noisy contexts to compute the posterior distribution of

γ^{*}

as

P (γ^{*} | H_{t - 1, \hat{c}}) \propto P (γ^{*}) \prod_{τ = 1}^{t - 1} P ({\hat{c}}_{τ} | γ^{*}),

where the conditional distribution

P ({\hat{c}}_{t} | γ^{*})

is evaluated as in (3). Note, that to evaluate the posterior, the agent uses its knowledge of the context distribution

P (c)

, the prior

P (γ^{*})

and the noise channel likelihood

P ({\hat{c}}_{t} | c_{t}, γ^{*})

. Using the derived posterior

P (γ^{*} | H_{t - 1, \hat{c}})

, the predictive posterior distribution of the true context is then obtained as

\begin{matrix} P (c_{t} | {\hat{c}}_{t}, H_{t - 1, \hat{c}}) = E_{P (γ^{*} | H_{t - 1, \hat{c}})} [P (c_{t} | {\hat{c}}_{t}, γ^{*})], \end{matrix}

(8)

where

P (c_{t} | {\hat{c}}_{t}, γ^{*})

is defined as in (2).

The second step of the algorithm implements a modified Thompson sampling. Note, that since the agent does not have access to the true contexts, it cannot evaluate the posterior distribution with known contexts,

\begin{matrix} P (θ^{*} | H_{t - 1, r, a, c}) \propto P (θ^{*}) \prod_{τ = 1}^{t - 1} P (r_{τ} | a_{τ}, c_{τ}, θ^{*}), \end{matrix}

(9)

as is conducted in standard contextual TS. Instead, the agent must evaluate the true posterior distribution under noisy contexts,

\begin{matrix} P_{t} (θ^{*}) & : = P (θ^{*} | H_{t - 1, r, a, \hat{c}}) \\ \propto P (θ^{*}) E_{P (γ^{*})} [\prod_{τ = 1}^{t - 1} E_{P (c_{τ})} [P ({\hat{c}}_{τ} | c_{τ}, γ^{*}) P (r_{τ} | a_{τ}, c_{τ}, θ^{*})]] . \end{matrix}

(10)

However, evaluating the marginal distribution

E_{P (γ^{*})} [\prod_{τ = 1}^{t - 1} E_{P (c_{τ})} [P ({\hat{c}}_{τ} | c_{τ}, γ^{*}) P (r_{τ} | a_{τ}, c_{τ}, θ^{*})]]

is challenging even for Gaussian bandits as the mean

ϕ {(a_{τ}, c_{τ})}^{⊤} θ^{*}

of the reward distribution

P (r_{τ} | a_{τ}, c_{τ}, θ^{*})

is, in general, a non-linear function of the true context

c_{τ}

. As a result, the posterior

P_{t} (θ^{*})

is analytically intractable.

Consequently, at each iteration t, the agent samples

θ_{t} \sim \bar{P} (θ^{*} | H_{t - 1, r, a, \hat{c}})

from a distribution

\bar{P} (θ^{*} | H_{t - 1, r, a, \hat{c}})

that ‘approximates’ the true posterior

P_{t} (θ^{*})

. The specific choice of this sampling distribution depends on the problem setting. Ideally, one must choose a distribution that is sufficiently ‘close’ to the true posterior. In the next sub-section, we will explain the choice for Gaussian bandits.

Using the sampled

θ_{t}

and the predictive posterior distribution

P (c_{t} | {\hat{c}}_{t}, H_{t - 1, \hat{c}})

obtained from the denoising step, the agent then chooses action

a_{t}

at iteration t as

\begin{matrix} a_{t} & = arg max_{a \in A} ψ {(a, {\hat{c}}_{t} | H_{\hat{c}})}^{⊤} θ_{t}, where \end{matrix}

(11)

\begin{matrix} ψ (a_{t}, {\hat{c}}_{t} | H_{\hat{c}}) & : = E_{P (c_{t} | {\hat{c}}_{t}, H_{t - 1, \hat{c}})} [ϕ (a_{t}, c_{t})] \end{matrix}

(12)

is the expected feature map with respect to

P (c_{t} | {\hat{c}}_{t}, H_{t - 1, \hat{c}})

.

3.1. Linear-Gaussian Stochastic CBs

We now instantiate Algorithm 1 for Gaussian CBs. Specifically, we consider Gaussian bandits with the reward noise

ξ_{t}

in (1) as Gaussian

N (0, σ^{2})

with mean 0 and variance

σ^{2} > 0

. We also assume a Gaussian prior

P (θ^{*}) = N (0, λ I)

on the reward parameter

θ^{*}

with mean zero and an

m \times m

diagonal, covariance matrix with entries

λ > 0

. Here,

I

denotes the identity matrix. The assumption of diagonal prior covariance is in line with Lemma 3 in [19].

We consider a multivariate Gaussian context distribution

P (c) = N (μ_{c}, Σ_{c})

with mean

μ_{c} \in R^{d}

and covariance matrix

Σ_{c} \in R^{d \times d}

. The context noise channel

P (\hat{c} | c, γ^{*})

is also similarly Gaussian with a mean

(γ^{*} + c)

and covariance matrix

Σ_{n} \in R^{d \times d}

. We assume the prior on noise channel parameter

γ^{*}

to be Gaussian

P (γ^{*}) = N (0, Σ_{γ})

with d-dimensional zero mean vector

0

and covariance matrix

Σ_{γ} \in R^{d \times d} .

We assume that

Σ_{c}, Σ_{γ}

and

Σ_{n}

are all positive definite matrices known to the agent. The assumption of positive definite covariance matrices is to facilitate the Bayesian analysis adopted in this work. Similar assumptions were also required in the related work of [14].

For this setting, we can analytically evaluate the predictive posterior distribution

P (c_{t} | {\hat{c}}_{t}, H_{t - 1, \hat{c}}) = N (c_{t} | V_{t}, R_{t}^{- 1})

as a multi-variate Gaussian with inverse covariance matrix,

\begin{matrix} R_{t} = M - Σ_{n}^{- 1} {(H_{t}^{- 1})}^{⊤} Σ_{n}^{- 1}, \end{matrix}

(13)

where

H_{t} = (t - 1) Σ_{n}^{- 1} - (t - 2) Σ_{n}^{- 1} M^{- 1} Σ_{n}^{- 1} + Σ_{γ}^{- 1}

and

M = Σ_{c}^{- 1} + Σ_{n}^{- 1}

, and with the mean vector

\begin{matrix} V_{t} = {(R_{t}^{- 1})}^{⊤} (Σ_{c}^{- 1} μ_{c} + Σ_{n}^{- 1} {\hat{c}}_{t} - Σ_{n}^{- 1} {(H_{t}^{- 1})}^{⊤} L_{t}^{⊤}), \end{matrix}

(14)

where

\begin{matrix} L_{t}^{⊤} & = Σ_{n}^{- 1} M^{- 1} (Σ_{c}^{- 1} μ_{c} + Σ_{n}^{- 1} {\hat{c}}_{t}) + (Σ_{n}^{- 1} - Σ_{n}^{- 1} M^{- 1} Σ_{n}^{- 1}) \sum_{τ = 1}^{t - 1} {\hat{c}}_{τ} \\ - (t - 1) Σ_{n}^{- 1} M^{- 1} Σ_{c}^{- 1} μ_{c} . \end{matrix}

Derivations are presented in Appendix C.1.2.

For the modified-TS step, we sample

θ_{t}

from the approximate posterior distribution

\begin{matrix} {\bar{P}}_{t} (θ^{*}) : = \bar{P} (θ^{*} | H_{t - 1, r, a, \hat{c}}) \propto P (θ^{*}) \prod_{τ = 1}^{t - 1} \bar{P} (r_{τ} | a_{τ}, {\hat{c}}_{τ}, H_{τ - 1, \hat{c}}, θ^{*}), \end{matrix}

(15)

where

\begin{matrix} \bar{P} (r_{τ} | a_{τ}, {\hat{c}}_{τ}, H_{τ - 1, \hat{c}}, θ^{*}) = N (ψ {(a_{τ}, {\hat{c}}_{τ} | H_{\hat{c}})}^{⊤} θ^{*}, σ^{2}) \end{matrix}

(16)

and

ψ (a_{t}, {\hat{c}}_{t} | H_{\hat{c}})

is the expected feature map defined in (12). This yields the approximate posterior to be a Gaussian distribution

{\bar{P}}_{t} (θ^{*}) = N (μ_{t - 1}, Σ_{t - 1}^{- 1})

whose inverse covariance matrix and mean, respectively, evaluate as

\begin{matrix} Σ_{t - 1} & = \frac{I}{λ} + \frac{1}{σ^{2}} \sum_{τ = 1}^{t - 1} ψ (a_{τ}, {\hat{c}}_{τ} | H_{\hat{c}}) ψ {(a_{τ}, {\hat{c}}_{τ} | H_{\hat{c}})}^{⊤} \end{matrix}

(17)

\begin{matrix} μ_{t - 1} & = \frac{Σ_{t - 1}^{- 1}}{σ^{2}} (\sum_{τ = 1}^{t - 1} r_{τ} ψ (a_{τ}, {\hat{c}}_{τ} | H_{\hat{c}})) . \end{matrix}

(18)

The sampling distribution

{\bar{P}}_{t} (θ^{*})

considered above is different from the true posterior distribution (10), which is analytically intractable. However, it bears resemblance to the posterior (9) when the true contexts are known, with the reward distribution

P (r_{τ} | a_{τ}, c_{τ}, θ^{*})

replaced by

\bar{P} (r_{τ} | a_{τ}, {\hat{c}}_{τ}, H_{τ - 1, \hat{c}}, θ^{*})

. In Section 3.2.2, we show that the above choice of sampling distribution is indeed ‘close’ to the true posterior.

3.2. Bayesian Regret Analysis

In this section, we derive information-theoretic upper bounds on the Bayesian regret (7) of the modified TS algorithm for Gaussian CBs. To this end, we first outline the key information-theoretic tools required to derive our bound.

3.2.1. Preliminaries

To start, let

P (x)

and

Q (x)

denote two probability distributions defined over the space

X

of random variables x. Then, the Kullback–Leibler (KL)-divergence between the distributions

P (x)

and

Q (x)

is defined as

\begin{matrix} D_{KL} (P (x) | | Q (x)) = E_{P (x)} [log \frac{P (x)}{Q (x)}], \end{matrix}

(19)

if

P (x)

is absolutely continuous with respect to

Q (x)

, and takes value ∞ otherwise. If x and y denote two random variables described by the joint probability distribution

P (x, y)

, the mutual information

I (x; y)

between x and y is defined as

I (x; y) = D_{KL} (P (x, y) ∥ P (x) P (y))

, where

P (x)

(and

P (y)

) is the marginal distribution of x (and y). More generally, for three random variables x, y and z with joint distribution

P (x, y, z)

, the conditional mutual information

I (x; y | z)

between x and y given z evaluates as

I (x; y | z) = E_{P (z)} [D_{K L} (P (x, y | z) ∥ P (x | z) P (y | z))]

where

P (x | z)

and

P (y | z)

are conditional distributions. We will also use the following variational representation of the KL-divergence, also termed the Donskar–Varadhan (DV) inequality,

\begin{matrix} D_{KL} (P (x) ∥ Q (x)) \geq E_{P (x)} [f (x)] - log E_{Q (x)} [\exp (f (x))], \end{matrix}

(20)

which holds for any measurable function

f : X \to R

satifying the inequality

E_{Q (x)} [\exp (f (x))]

< \infty

.

3.2.2. Information-Theoretic Bayesian Regret Bounds

In this section, we present information-theoretic upper bounds on the Bayesian regret of the modified TS algorithm. To this end, we first state our main assumption.

Assumption 1.

The feature map

ϕ (\cdot, \cdot) \in R^{m}

has bounded norm, i.e.,

{∥ ϕ (\cdot, \cdot) ∥}_{2} \leq 1

.

The following theorem gives our main result.

Theorem 1.

Assume that the covariance matrices satisfy

Σ_{n} Σ_{c}^{- 1} ≻ 0

and

Σ_{n} Σ_{γ}^{- 1} Σ_{n} M ≻ 0

where

M = Σ_{n}^{- 1} + Σ_{c}^{- 1}

. Under Assumption 1, if

\frac{λ}{σ^{2}} \leq \frac{1}{T} \leq 1

, the following upper bound on the Bayesian regret of the modified TS algorithm holds,

\begin{matrix} R^{T} (π^{TS}) \leq U (m, \frac{σ^{2}}{T}) + \sqrt{2 T m σ^{2}} + \sqrt{2 T σ^{2} (log (K) + m)} \\ + \frac{4}{T} \sqrt{\frac{m σ^{2}}{2 T π}} + 2 \sqrt{4 σ^{2} m log (2 m T) (Tr ({(Σ_{n} Σ_{γ}^{- 1} Σ_{n} M)}^{- 1}) + log (T) Tr (Σ_{c} Σ_{n}^{- 1}))}, \end{matrix}

where

\begin{matrix} U (m, λ) & = \sqrt{2 T m σ^{2} min {m, 2 log (1 + K)} log (1 + \frac{T λ}{m σ^{2}})} . \end{matrix}

(21)

The theorem above shows that the proposed TS algorithm achieves

\tilde{O} (m \sqrt{T})

regret when the prior

P (θ^{*})

is highly informative with variance parameter satisfying the constraint

λ \leq σ^{2} / T

.

Remark 1.

The assumption on covariance matrices in Theorem 1 directly holds for diagonal covariance matrices with positive eigen values.

To prove the regret bound of Theorem 1, we start by defining

\begin{matrix} {\hat{a}}_{t} = arg max_{a \in A} ψ {(a, {\hat{c}}_{t} | H_{\hat{c}})}^{⊤} θ^{*} \end{matrix}

(22)

as the action that maximizes the mean reward

ψ {(a, {\hat{c}}_{t} | H_{\hat{c}})}^{⊤} θ^{*}

corresponding to reward parameter

θ^{*}

. Using the above, the Bayesian cumulative regret (7) for the proposed TS algorithm

π^{TS}

can be decomposed as

\begin{matrix} R^{T} (π^{TS}) = R_{CB}^{T} + R_{EE 1}^{T} + R_{EE 2}^{T}, where \\ R_{CB}^{T} = \sum_{t = 1}^{T} E [ψ {({\hat{a}}_{t}, {\hat{c}}_{t} | H_{\hat{c}})}^{⊤} θ^{*} - ψ {(a_{t}, {\hat{c}}_{t} | H_{\hat{c}})}^{⊤} θ^{*}], \\ R_{EE 1}^{T} = \sum_{t = 1}^{T} E [ψ {(a_{t}^{*}, {\hat{c}}_{t} | γ^{*})}^{⊤} θ^{*} - ψ {({\hat{a}}_{t}, {\hat{c}}_{t} | H_{\hat{c}})}^{⊤} θ^{*}], \\ R_{EE 2}^{T} = \sum_{t = 1}^{T} E [ψ {(a_{t}, {\hat{c}}_{t} | H_{\hat{c}})}^{⊤} θ^{*} - ψ {(a_{t}, {\hat{c}}_{t} | γ^{*})}^{⊤} θ^{*}] . \end{matrix}

(23)

In (23), the first term

R_{CB}^{T}

quantifies the Bayesian regret of our action policy (11) with respect to the action policy (22) for a CB with mean reward function

ψ {(a, {\hat{c}}_{t} | H_{\hat{c}})}^{⊤} θ^{*}

. The second term

R_{EE 1}^{T}

accounts for the average difference in the cumulative mean rewards of the oracle optimal action policy (4), evaluated using the exact predictive distribution

P (c_{t} | {\hat{c}}_{t}, γ^{*})

, and our action policy (11), that uses the inferred predictive posterior distribution

P (c_{t} | {\hat{c}}_{t}, H_{t - 1, \hat{c}})

. In this sense,

R_{EE 1}^{T}

captures the error in approximating the exact predictive distribution

P (c_{t} | {\hat{c}}_{t}, γ^{*})

via the inferred predictive distribution

P (c_{t} | {\hat{c}}_{t}, H_{\hat{c}})

. The third term

R_{EE 2}^{T}

similarly accounts for the average approximation error.

To derive an upper bound on the Bayesian regret

R^{T} (π^{TS})

, we separately upper bound each of the three terms in (23) as derived in the following lemmas. The lemma below presents an upper bound on

R_{CB}^{T}

.

Lemma 1.

Under Assumption 1, the following upper bound holds if

\frac{λ}{σ^{2}} \leq \frac{1}{T} \leq 1

,

\begin{matrix} R_{CB}^{T} & \leq U (m, \frac{σ^{2}}{T}) + \sqrt{2 σ^{2} \sum_{t = 1}^{T} D_{t}} + \sqrt{2 σ^{2} (T log (K) + \sum_{t = 1}^{T} D_{t})}, \end{matrix}

(24)

\begin{matrix} \leq U (m, \frac{σ^{2}}{T}) + \sqrt{2 T m σ^{2}} + \sqrt{2 T σ^{2} (log (K) + m)}, \end{matrix}

(25)

where

D_{t} = E [D_{KL} (P_{t} (θ^{*}) ∥ {\bar{P}}_{t} (θ^{*}))]

and

U (m, λ)

is as defined in (21).

To derive the upper bound in (24), we leverage results from [19] that study information-theoretic Bayesian regret of standard contextual TS algorithms via lifted information-ratio. However, the results do not directly apply to our algorithm due to the posterior mismatch between the sampling distribution

{\bar{P}}_{t} (θ^{*})

and the true posterior distribution

P_{t} (θ^{*})

. Consequently, our upper bound (24) consists of three terms: the first term, defined as in (21), corresponds to the upper bound on the Bayesian regret of contextual TS that assumes

{\bar{P}}_{t} (θ^{*})

as the true posterior. This can be obtained by applying the lifted information ratio-based analysis of Cor. 2 in [19]. The second and third terms account for the posterior mismatch via the expected KL-divergence

E [D_{KL} (P_{t} (θ^{*}) ∥ {\bar{P}}_{t} (θ^{*}))]

between the true posterior

P_{t} (θ^{*})

and the sampling distribution

{\bar{P}}_{t} (θ^{*})

. In particular, this expected KL divergence can be upper bounded by

2 (t - 1) λ m / σ^{2}

(See Appendix C.1.3 for proof) under the prior

P (θ^{*}) = N (0, λ I)

. Importantly, our result holds when this prior distribution is sufficiently concentrated with its variance satisfying the inequality

λ \leq σ^{2} / T

. This ensures that the contribution of posterior mismatch to the Bayes regret scales is

O (\sqrt{m T})

.

The following lemma gives an upper bound on the sum

R_{EE 1}^{T} + R_{EE 2}^{T}

.

Lemma 2.

Under Assumption 1, the following upper bound holds for

δ \in (0, 1)

,

\begin{array}{l} R_{EE 1}^{T} + R_{EE 2}^{T} & \leq 2 R_{EE 1}^{T} \\ \leq 4 δ^{2} T \sqrt{\frac{m λ}{2 π}} + 2 \sqrt{4 λ T m log (\frac{2 m}{δ}) \sum_{t = 1}^{T} I (γ^{*}; c_{t} | {\hat{c}}_{t}, H_{t - 1, \hat{c}})} . \end{array}

(26)

In addition, if the covariance matrices satisfy that

Σ_{n} Σ_{c}^{- 1} ≻ 0

and

Σ_{n} Σ_{γ}^{- 1} Σ_{n} M ≻ 0

where

M = Σ_{n}^{- 1} + Σ_{c}^{- 1}

, then (26) can be further upper bounded as

\begin{matrix} R_{EE 1}^{T} + R_{EE 2}^{T} \\ \leq 4 δ^{2} T \sqrt{\frac{m λ}{2 π}} + 2 \sqrt{4 λ T m log (\frac{2 m}{δ}) (Tr ({(Σ_{n} Σ_{γ}^{- 1} Σ_{n} M)}^{- 1}) + log (T) Tr (Σ_{c} Σ_{n}^{- 1}))} . \end{matrix}

(27)

Lemma 2 shows that the error in approximating

P (c_{t} | {\hat{c}}_{t}, γ^{*})

with

P (c_{t} | {\hat{c}}_{t}, H_{t - 1, \hat{c}})

, on average, can be quantified via the conditional mutual information

I (γ^{*}; c_{t} | {\hat{c}}_{t}, H_{t - 1, \hat{c}})

between

γ^{*}

and true context

c_{t}

given knowledge of observed noisy contexts up to and including iteration t.

Finally, combining Lemmas 1 and 2 with the choice of

δ = 1 / T

and

λ \leq σ^{2} / T

gives us the regret bound in Theorem 1.

3.3. Beyond Gaussian Bandits

In the previous sections, we studied Gaussian bandits and analyzed the Bayesian regret. We will now discuss the potential extension of results beyond Gaussian bandits. As in [14], we will focus on Gaussian context distribution and context noise distribution, which helps to derive the upper bound on the estimation errors in Lemma 2.

To extend the Bayesian regret analysis to non-Gaussian bandits, Lemma 1 requires bandit-specific modifications. Specifically, the derivation of the term

U (m, λ)

, that captures the standard Bayesian regret of contextual TS with

{\bar{P}}_{t} (θ^{*})

as the true posterior, and that of the posterior mismatch term via the expected KL divergence critically depends on the type of bandit and the choice of the sampling posterior. The Bayesian regret bound

U (m, λ)

is derived using the lifted information ratio-based approach of [19]. This can indeed be extended to non-Gaussian bandits like logistic bandits (see [19]) to obtain a modified

U (m, λ)

term.

However, the analysis of posterior mismatch term for non-Gaussian bandits is non-trivial and depends on the specific bandit assumed. Firstly, to characterize the posterior mismatch via the expected KL divergence, our analysis requires the chosen sampling distribution

{\bar{P}}_{t} (θ^{*})

to be sub-Gaussian. To choose the sampling distribution, one can follow the framework adopted in (15) and (16) and use an ‘appropriate’ reward distribution

\bar{P} (r_{τ} | a_{τ}, {\hat{c}}_{τ}, H_{τ - 1, \hat{c}}, θ^{*})

such that

(a)

the KL divergence

D_{KL} (P (r_{τ} | a_{τ}, c_{τ}, θ^{*}) ∥ \bar{P} (r_{τ} | a_{τ}, {\hat{c}}_{τ}, H_{τ - 1, \hat{c}}, θ^{*}))

between the true reward distribution and the chosen reward distribution is small to minimize posterior mismatch, and

(b)

the resulting sampling distribution is easy to sample from and has sub-Gaussian tails. Thus, analyzing the posterior mismatch for non-Gaussian bandits requires a case-by-case treatment. For Gaussian bandits, we control the above KL divergence by choosing a Gaussian distribution

\bar{P} (r_{τ} | a_{τ}, {\hat{c}}_{τ}, H_{τ - 1, \hat{c}}, θ^{*})

with mean

ψ (a_{τ}, {\hat{c}}_{τ} | H_{\hat{c}})

as in (16). Finally, in Section 5, we extend Algorithm 1 to logistic bandits with the choice of sampling distribution motivated by (15) and (16) and use Langevin Monte Carlo to sample from this distribution.

4. TS for CB with Delayed True Contexts

In this section, we consider the CBs with delayed true context setting where the agent observes the true context

c_{t}

after it observes the reward

r_{t}

corresponding to the chosen action

a_{t}

. Note, that at the time of choosing action

a_{t}

, the agent has access only to noisy contexts. We specialize our TS algorithm to this setting, and call it Algorithm 2 (or

π_{delay}^{TS}

).

Algorithm 2: TS for Delayed True contexts (

π_{delay}^{TS}

)

1:: for $t = 1, \dots, T$ do
2:: The environment selects a true context $c_{t}$ .
3:: Agent observes noisy context ${\hat{c}}_{t}$ .
4:: Agent evaluates the predictive posterior distribution $P (c_{t} | {\hat{c}}_{t}, H_{t - 1, c, \hat{c}})$ as in (28).
5:: Agent samples $θ_{t} \sim P (θ^{*} | H_{t - 1, r, a, c})$ .
6:: Agent chooses action $a_{t}$ as in (33).
7:: Agent observes reward $r_{t}$ (as in (1)) and the true context $c_{t}$ .
8:: end for

Algorithm 2 follows similar steps as in Algorithm 1. However, different from Algorithm 1, at the tth iteration, the agent knows the history

H_{t - 1, c, \hat{c}}

of true contexts in addition to that of noisy contexts. Consequently, in the de-noising step, the agent evaluates the predictive posterior distribution as

\begin{matrix} P (c_{t} | {\hat{c}}_{t}, H_{t - 1, c, \hat{c}}) = E_{P (γ^{*} | H_{t - 1, c, \hat{c}})} [P (c_{t} | {\hat{c}}_{t}, γ^{*})], \end{matrix}

(28)

where

P (c_{t} | {\hat{c}}_{t}, γ^{*})

is as defined in (2) and posterior distribution

P (γ^{*} | H_{t - 1, c, \hat{c}})

is obtained via Baye’s rule as

P (γ^{*} | H_{t - 1, c, \hat{c}}) \propto P (γ^{*}) \prod_{τ = 1}^{t - 1} P (c_{τ}, {\hat{c}}_{τ} | γ^{*})

using the history of true and noisy contexts.

For the Gaussian context, noise as considered in Section 3.1, the predictive posterior distribution

P (c_{t} | {\hat{c}}_{t}, H_{t - 1, c, \hat{c}}) = N ({\tilde{V}}_{t}, {\tilde{R}}_{t}^{- 1})

is multivariate Gaussian with the inverse covariance matrix,

\begin{matrix} \tilde{R_{t}} & = M - Σ_{n}^{- 1} {\tilde{H}}_{t}^{- 1} Σ_{n}^{- 1}, \end{matrix}

(29)

and the mean vector

\begin{matrix} \tilde{V_{t}} & = {\tilde{R}}_{t}^{- 1} (Σ_{c}^{- 1} μ_{c} + Σ_{n}^{- 1} {\hat{c}}_{t} + Σ_{n}^{- 1} {\tilde{H}}_{t}^{- 1} Σ_{n}^{- 1} \sum_{τ = 1}^{t - 1} ({\hat{c}}_{τ} - c_{τ}) - Σ_{n}^{- 1} {\tilde{H}}_{t}^{- 1} Σ_{n}^{- 1} M^{- 1} (Σ_{c}^{- 1} μ_{c} - Σ_{n}^{- 1} {\hat{c}}_{t})), \end{matrix}

(30)

where

M = Σ_{c}^{- 1} + Σ_{n}^{- 1}

and

{\tilde{H}}_{t} = Σ_{n}^{- 1} M^{- 1} Σ_{n}^{- 1} + (t - 1) Σ_{n}^{- 1} + Σ_{γ}^{- 1}

. Derivation can be found in Appendix B.2.4.

Following the denoising step, the next step in Algorithm 2 is a conventional Thompson sampling step, thanks to access to delayed true contexts. Consequently, the agent can evaluate the posterior distribution

P (θ^{*} | H_{t - 1, r, a, c})

with known contexts as in (9) and use it to sample

θ_{t} \sim P (θ^{*} | H_{t - 1, r, a, c})

. For Gaussian bandit with Gaussian prior on

θ^{*}

, the posterior distribution

P (θ^{*} | H_{t - 1, r, a, c}) = N ({\tilde{μ}}_{t - 1}, {\tilde{Σ}}_{t - 1}^{- 1})

is a multivariate Gaussian distribution whose inverse covariance matrix and mean, respectively, evaluate as

\begin{matrix} {\tilde{Σ}}_{t - 1} & = \frac{1}{λ} I + \frac{1}{σ^{2}} \sum_{τ = 1}^{t - 1} ϕ (a_{τ}, c_{τ}) ϕ {(a_{τ}, c_{τ})}^{⊤} \end{matrix}

(31)

\begin{matrix} {\tilde{μ}}_{t - 1} & = \frac{{\tilde{Σ}}_{t - 1}^{- 1}}{σ^{2}} (\sum_{τ = 1}^{t - 1} r_{τ} ϕ (a_{τ}, c_{τ})) . \end{matrix}

(32)

Using the sampled

θ_{t}

and the obtained predictive posterior distribution

P (c_{t} | {\hat{c}}_{t}, H_{t - 1, c, \hat{c}})

, the agent then chooses action

a_{t}

as

\begin{matrix} a_{t} & = arg max_{a \in A} ψ {(a, {\hat{c}}_{t} | H_{c, \hat{c}})}^{⊤} θ_{t}, \end{matrix}

(33)

where we use the expected feature map

ψ (a_{t}, {\hat{c}}_{t} | H_{c, \hat{c}}) : = E_{P (c_{t} | {\hat{c}}_{t}, H_{t - 1, c, \hat{c}})} [ϕ (a_{t}, c_{t})]

.

Information-Theoretic Bayesian Regret Bounds

In this section, we derive an information-theoretic upper bound on the Bayesian regret (7) of Algorithm 2 for Gaussian CBs. The following theorem presents our main result.

Theorem 2.

Under Assumption 1 and assuming that covariance matrices satisfy

Σ_{γ} Σ_{n}^{- 1} ≻ 0

, the following inequality holds for

δ \in (0, 1)

when

λ \leq σ^{2}

,

\begin{matrix} R^{T} (π_{delay}^{TS}) \leq U (m, λ) + 4 T δ^{2} \sqrt{\frac{2 m λ}{π}} + 2 \sqrt{2 λ m T d log (1 + T Tr (Σ_{γ} Σ_{n}^{- 1}) / d) log (\frac{2 m}{δ})}, \end{matrix}

where

U (m, λ)

is as defined in (21).

Theorem 2 shows that Algorithm 2 achieves

\tilde{O} (m \sqrt{T})

regret with the choice of

δ = 1 / T

if

d = O (m)

. Furthermore, due to the absence of posterior mistmatch, the upper bound above is tighter than that of Theorem 1.

We now outline the main lemmas required to prove Theorem 2. To this end, we re-use the notation

\begin{matrix} {\hat{a}}_{t} = arg max_{a \in A} ψ {(a, {\hat{c}}_{t} | H_{c, \hat{c}})}^{⊤} θ^{*} \end{matrix}

(34)

to define the optimal action maximizing the mean reward

ψ {(a, {\hat{c}}_{t} | H_{c, \hat{c}})}^{⊤} θ^{*}

.

To derive the regret upper bound in Theorem 2, we first decompose the Bayesian cumulative regret (7) of Algorithm 2 (

π_{delay}^{TS}

), similar to (23), into the following three terms,

\begin{matrix} R^{T} (π_{delay}^{TS}) = R_{d, CB}^{T} + R_{d, EE 1}^{T} + R_{d, EE 2}^{T} where, \\ R_{d, CB}^{T} = \sum_{t = 1}^{T} E [ψ {({\hat{a}}_{t}, {\hat{c}}_{t} | H_{c, \hat{c}})}^{⊤} θ^{*} - ψ {(a_{t}, {\hat{c}}_{t} | H_{c, \hat{c}})}^{⊤} θ^{*}], \\ R_{d, EE 1}^{T} = \sum_{t = 1}^{T} E [ψ {(a_{t}^{*}, {\hat{c}}_{t} | γ^{*})}^{⊤} θ^{*} - ψ {({\hat{a}}_{t}, {\hat{c}}_{t} | H_{c, \hat{c}})}^{⊤} θ^{*}], \\ R_{d, EE 2}^{T} = \sum_{t = 1}^{T} E [ψ {(a_{t}, {\hat{c}}_{t} | H_{c, \hat{c}})}^{⊤} θ^{*} - ψ {(a_{t}, {\hat{c}}_{t} | γ^{*})}^{⊤} θ^{*}] . \end{matrix}

(35)

An upper bound on

R^{T} (π_{delay}^{TS})

can be then obtained by separately bounding each of the three terms in (35).

In (35), the first term

R_{d, CB}^{T}

corresponds to the Bayesian cumulative regret of a standard contextual TS algorithm that uses

ψ {(a, {\hat{c}}_{t} | H_{c, \hat{c}})}^{⊤} θ^{*}

for

a \in A

as the mean reward function. Note, that due to availability of delayed true contexts, there is no posterior mismatch in Algorithm 2. Hence, we apply Cor. 3 in [19] to yield the following upper bound on

R_{d, CB}^{T}

.

Lemma 3.

Under Assumption 1, the following upper bound on

R_{d, CB}^{T}

holds for

\frac{λ}{σ^{2}} \leq 1

,

\begin{matrix} R_{d, CB}^{T} & \leq U (m, λ), \end{matrix}

(36)

where

U (m, λ)

is defined as in (21).

Lemma 3 gives a tighter bound in comparison to Lemma 1 where the posterior mismatch results in additional error terms in the regret bound.

We now upper bound the second term

R_{d, EE 1}^{T}

of (35), which similar to the term

R_{EE 1}^{T}

in (23), captures the error in approximating the exact predictive distribution

P (c_{t} | {\hat{c}}_{t}, γ^{*})

via the inferred predictive distribution

P (c_{t} | {\hat{c}}_{t}, H_{c, \hat{c}})

. The following lemma shows that this approximation error over T iterations can be quantified, on average, via the mutual information

I (γ^{*}; H_{T, c, \hat{c}})

between

γ^{*}

and the T-length history of observed true and noisy contexts. This bound also holds for the third term

R_{d, EE 2}^{T}

of (35) which similarly accounts for the average approximation error.

Lemma 4.

Under Assumption 1, for any

δ \in (0, 1)

, we have the following upper bound,

\begin{matrix} R_{d, EE 1}^{T} + R_{d, EE 2}^{T} \leq 2 R_{d, EE 1}^{T} \leq 4 \sqrt{m λ T log (\frac{2 m}{δ}) I (γ^{*}; H_{T, c, \hat{c}})} + 4 T δ^{2} \sqrt{\frac{2 m λ}{π}} . \end{matrix}

Furthermore, if the covariance matrices satisfy that

Σ_{γ} Σ_{n}^{- 1} ≻ 0

, we obtain that

\begin{matrix} I (γ^{*}; H_{T, c, \hat{c}}) \leq \frac{d}{2} log (1 + T Tr (Σ_{γ} Σ_{n}^{- 1}) / d)) . \end{matrix}

Combining Lemmas 3 and 4 then gives us the upper bound on

R^{T} (π_{delay}^{TS})

in Theorem 1.

5. Experiments and Final Remarks

In this section, we experimentally validate the performance of the proposed algorithms on synthetic and real-world datasets. Details of implementation can be found in Appendix D.

Synthetic Datasets: For synthetic datasets, we go beyond Gaussian bandits and evaluate our algorithms for logistic contextual bandits (see Figure 1 (Left) and (Center)). In both these settings, we consider Gaussian contexts and context noise as in Section 3.1 with parameters

Σ_{c} = I

,

Σ_{γ} = σ_{γ}^{2} I

,

Σ_{n} = σ_{n}^{2} I

for some

σ_{γ}^{2}, σ_{n}^{2} > 0

. We further consider action

a \in A

and context

c \in C

to be

d = 5

dimensional vectors with

a_{i}

and

c_{i}

, respectively, denoting their ith component. We use

ϕ (a, c) = [a_{1}^{2}, a_{2}^{2}, a_{3}^{2}, a_{4}^{2}, a_{5}^{2}, c_{1}^{2}, c_{2}^{2}, c_{3}^{2}, c_{4}^{2}, c_{5}^{2}, a_{1} c_{1}, a_{2} c_{2}, a_{3} c_{3}, a_{4} c_{4},

a_{5} c_{5}]

as the

m = 15

dimensional feature vector.

Gaussian Bandits: The mean reward function is given by

f (θ^{*}, a, c) = ϕ {(a, c)}^{⊤} θ^{*}

with the feature map described above. Other parameters are fixed as

σ_{γ}^{2} = σ_{n}^{2} = 1.1

,

σ^{2} = 2

and

λ = 0.01

. Plots are averaged over 100 independent trials.

Logistic Bandits: The reward

r_{t} \in {0, 1}

is Bernoulli with mean reward given by

μ (ϕ {(a, c)}^{⊤} θ^{*})

, where

μ (z) = 1 / (1 + \exp (- z))

is the sigmoid function. We consider a Gaussian prior

N (0, I)

over

θ^{*}

. In Algorithm 1, we choose the sampling distribution

{\bar{P}}_{t} (θ^{*}) \propto P (θ^{*}) \prod_{τ = 1}^{t - 1} Ber (μ (ψ {(a_{τ}, {\hat{c}}_{τ} | H_{\hat{c}})}^{⊤} θ^{*})) .

However, the posterior

{\bar{P}}_{t} (θ^{*})

is analytically intractable since Bernoulli reward-Gaussian prior forms a non-conjugate distribution pair. Consequently, we use Langevin Monte Carlo (LMC) [20] to sample from

{\bar{P}}_{t} (θ^{*})

. We run LMC for

I = 50

iterations with learning rate

η_{t} = 0.2 / t

and inverse temperature

β^{- 1} = 0.001

. Plots are averaged over 10 independent trials.

MovieLens Dataset: We use the MovieLens-100K dataset [21] to evaluate the performances. To utilise this dataset, we first perform non-negative matrix factorization on the rating matrix

R = [r_{c, a}] \in R^{943 \times 1682}

with 3 latent factors to obtain

R = W H

, where

W \in R^{943 \times 3}

and

H \in R^{3 \times 1682}

. Each row vector

W_{c}

corresponds to an user context, while each column vector

H_{a}

corresponds to movie (action) features. The mean and variance of the Gaussian context distribution is estimated from the row vectors of W. We then add Gaussian noise to context as in the synthetic settings with

σ_{n}^{2} = 0.1

.

We apply K-means algorithm to the column vectors of H to group the actions into

K = 20

clusters. We use

m_{k} \in R^{3}

to denote the centroid and

v_{k}

to denote the variance of the kth cluster. We then fix the mean and variance of the Gaussian prior over

θ^{*}

as

μ_{θ} = (m_{1}, \dots, m_{K})

and

Σ_{θ} = diag (v_{1} I_{3}, \dots, v_{K} I_{3})

, with

I_{3}

denoting the

3 \times 3

identity matrix, respectively. The feature vector

ϕ (a, c)

is then fixed as a 60-dimensional vector with vector

W_{c}

at the index of the cluster k to which action a belongs and zeros everywhere else. We further add mean-zero Gaussian noise to the mean reward

ϕ {(a, c)}^{⊤} θ^{*}

with variance

σ^{2} = 0.01

. The Bayesian oracle in this experiment has access to the exact context noise parameter

γ^{*}

sampled from the Gaussian prior with variance

Σ_{γ} = σ_{γ}^{2} I

, as well as the true

θ^{*}

sampled from the Gaussian prior

P (θ^{*})

.

Baselines: We compare our algorithms with two baselines:

TS_naive

and

TS_oracle

. In

TS_naive

, the agent observes only noisy contexts but is unaware of the presence of noise. Consequently, it naively implements conventional TS with noisy context

{\hat{c}}_{t}

. This sets the benchmark for the worst-case achievable regret. The second baseline

TS_oracle

assumes that the agent knows the true channel parameter

γ^{*}

, a setting studied in [18], and can thus perform exact denoising via the predictive posterior distribution

P (c_{t} | \hat{c}, γ^{*})

. This algorithm sets the benchmark for the best achievable regret.

Figure 1 (Left) corroborates our theoretical findings for Gaussian bandits. In particular, our algorithms (Algorithms 1 and 2) demonstrate sub-linear regret and achieve robust performance comparable to the best achievable performance of

TS_oracle

. We remark that while our regret analysis of Gaussian bandits is motivated due to the tractability of posterior distributions and the concentration properties of Gaussians, our empirical results for logistic bandits in Figure 1 (Center) show a promising extension of our algorithms to non-conjugate distributions. Extension of Bayesian regret analysis to such general distributions is left for future work. Further, our experiments on MovieLens data in Figure 1 (Right) validate the effectiveness of our algorithm in comparison to the benchmarks. The plot shows that our approach outperforms

TS_naive

and achieves comparable regret as that of

TS_oracle

which is the best achievable regret.

6. Conclusions

We studied a stochastic CB problem where the agent observes noisy contexts through a noise channel with an unknown channel parameter. For Gaussian bandits and Gaussian context noise, we introduced a TS algorithm that achieves

\tilde{O} (m \sqrt{T})

Bayesian regret. The setting of Gaussian bandits with Gaussian noise was chosen for easy tractability of posterior distributions used in the proposed TS algorithms. We believe that the algorithm and key lemmas can be extended to when the likelihood-prior form conjugate distributions. Extension to general distributions is left for future work.

Finally, we conjecture that our proposed modified TS algorithm and the information-theoretic Bayesian regret analysis could be extended to noisy contexts in multi-task bandit settings. In this regard, a good starting point would be to leverage prior works that study multi-armed hierarchical bandits [22] and contextual hierarchical bandits [23] with linear-Gaussian reward models. However, the critical challenge is to evaluate the posterior mismatch which requires a case-by-case analysis.

Author Contributions

Conceptualization, S.T.J. and S.M.; Methodology, S.T.J.; Formal analysis, S.T.J. and S.M.; Writing – original draft, S.T.J. and S.M. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

The data presented in this study are openly available in Kaggle, https://www.kaggle.com/datasets/prajitdatta/movielens-100k-dataset, accessed on 1 July 2023.

Conflicts of Interest

The authors declare no conflict of interest.

Appendix A. Preliminaries

Definition A1

(Sub-Gaussian Random Variable). A random variable y is said to be

s^{2}

-sub-Gaussian with respect to the distribution

P (y)

if the following inequality holds:

\begin{matrix} E_{P (y)} [\exp (λ (y - E_{P (y)} [y])] \leq \exp (\frac{λ^{2} s^{2}}{2}) . \end{matrix}

(A1)

Lemma A1

(Change of Measure Inequality). Let

x \in R^{n}

be a random vector and

g : R^{n} \to R

denote a real-valued function. Let

P (x)

and

Q (x)

be two probability distributions defined on the space of x. If

g (x)

is

s^{2}

-sub-Gaussian with respect to

Q (x)

, then the following inequality holds,

\begin{matrix} | E_{P (x)} [g (x)] - E_{Q (x)} [g (x)] | \leq \sqrt{2 s^{2} D_{KL} (P (x) ∥ Q (x))} . \end{matrix}

(A2)

Proof.

The inequality (A2) follows by using the Donsker-Varadhan inequality (20) with

f (x) = λ g (x)

for

λ \in R

. This yields that

\begin{matrix} D_{KL} (P (x) ∥ Q (x)) & \geq E_{P (x)} [λ g (x)] - log E_{Q (x)} [\exp (λ g (x))] \\ \geq E_{P (x)} [λ g (x)] - E_{Q (x)} [λ g (x)] - λ^{2} \frac{s^{2}}{2} \end{matrix}

(A3)

where the last inequality follows from the assumption of sub-Gaussianity. Rearranging, we obtain

\begin{matrix} E_{P (x)} [λ g (x)] - E_{Q (x)} [λ g (x)] \leq λ^{2} \frac{s^{2}}{2} + D_{KL} (P (x) ∥ Q (x)) . \end{matrix}

(A4)

For

λ > 0

, we obtain that

\begin{matrix} E_{P (x)} [g (x)] - E_{Q (x)} [g (x)] \leq λ \frac{s^{2}}{2} + \frac{D_{KL} (P (x) ∥ Q (x))}{λ}, \end{matrix}

(A5)

and optimizing over

λ > 0

then yields that

\begin{matrix} E_{P (x)} [g (x)] - E_{Q (x)} [g (x)] \leq \sqrt{2 s^{2} D_{KL} (P (x) ∥ Q (x))} . \end{matrix}

(A6)

Similarly, for

λ < 0

, we obtain that

\begin{matrix} E_{Q (x)} [g (x)] - E_{P (x)} [g (x)] \leq \sqrt{2 s^{2} D_{KL} (P (x) ∥ Q (x))} . \end{matrix}

(A7)

□

Lemma A2.

Let

x \in R^{n}

be distributed according to

Q (x) = \prod_{i = 1}^{n} N (x_{i} | μ_{i}, σ_{i}^{2})

, i.e., each element of the random vector is independently distributed according to a Gaussian distribution with mean

μ_{i}

and variance

σ_{i}^{2}

. Let

g (x) = {max}_{i = 1, \dots n} x_{i}

denote the maximum of n Gaussian random variables. Then, the following inequality holds for

λ \geq 0

,

\begin{matrix} log E_{Q (x)} [\exp (λ g (x))] \leq log n + λ max_{i} μ_{i} + λ^{2} \frac{{max}_{i} σ_{i}^{2}}{2} . \end{matrix}

(A8)

For any distribution

P (x)

that is absolutely continuous with respect to

Q (x)

, we then have the following change of measure inequality,

\begin{matrix} E_{P (x)} [g (x)] - E_{Q (x)} [g (x)] \leq \sqrt{2 (log n + D_{KL} (P (x) ∥ Q (x))) {max}_{i} σ_{i}^{2}} . \end{matrix}

(A9)

Proof.

The proof of inequality (A8) follows from standard analysis (see [14]). We present it here for the sake of completeness. The following sequence of relations hold for any

λ \geq 0

,

\begin{matrix} E_{Q (x)} [\exp (λ g (x))] = E_{Q (x)} [max_{i} \exp (λ x_{i})] & \leq \sum_{i = 1}^{n} E_{Q (x_{i})} [\exp (λ x_{i})] \\ = \sum_{i = 1}^{n} \exp (λ μ_{i} + λ^{2} σ_{i}^{2} / 2) \\ \leq n \exp (λ max_{i} μ_{i} + λ^{2} max_{i} σ_{i}^{2} / 2) . \end{matrix}

(A10)

Taking logarithm on both sides of the inequality yields the upper bound in (A8). We now apply the DV inequality (20) as in (A3). This yields that

\begin{matrix} D_{KL} (P (x) ∥ Q (x)) & \geq E_{P (x)} [λ g (x)] - log E_{Q (x)} [\exp (λ g (x))] \\ \overset{(a)}{\geq} E_{P (x)} [λ g (x)] - log n - λ max_{i} μ_{i} - λ^{2} \frac{{max}_{i} σ_{i}^{2}}{2} \\ \overset{(b)}{\geq} E_{P (x)} [λ g (x)] - E_{Q (x)} [λ g (x)] - log n - λ^{2} \frac{{max}_{i} σ_{i}^{2}}{2}, \end{matrix}

(A11)

where the inequality in

(a)

follows from (A8). The inequality in

(b)

follows from observing that

{max}_{i} x_{i} \geq x_{i}

for all i, whereby we obtain that

E_{Q (x)} [{max}_{i} x_{i}] \geq μ_{i}

which holds for all i. The latter inequality implies that

E_{Q (x)} [{max}_{i} x_{i}] \geq {max}_{i} μ_{i}

. Re-arranging and optimizing over

λ \geq 0

then yields the required inequality in (A9). □

Appendix B. Linear-Gaussian Contextual Bandits with Delayed Contexts

In this section, we provide all the details relevant to the Bayesian cumulative regret analysis of TS for delayed, linear-Gaussian contextual bandits.

Appendix B.1. TS Algorithm for Linear-Gaussian Bandits with Delayed True Contexts

The pseudocode for the TS algorithm for Gaussian bandits is given in Algorithm A1.

Algorithm A1: TS with Delayed Contexts for Gaussian Bandits (

π_{delay}^{TS}

)

1:: Given parameters: $(Σ_{n}, σ^{2}, λ, Σ_{γ}, μ_{c}, Σ_{c})$ . Initialize ${\tilde{μ}}_{0} = 0 \in R^{m}$ and ${\tilde{Σ}}_{0}^{- 1} = (1 / λ) I$
2:: for $t = 1, \dots, T$ do
3:: The environment selects a true context $c_{t}$ .
4:: Agent observes noisy context ${\hat{c}}_{t}$ .
5:: Agent computes ${\tilde{R}}_{t}$ and ${\tilde{V}}_{t}$ using (29) and (30) to evaluate $P (c_{t} | {\hat{c}}_{t}, H_{t - 1, c, \hat{c}}) = N ({\tilde{V}}_{t}, {\tilde{R}}_{t}^{- 1})$ .
6:: Agent samples $θ_{t} \sim N ({\tilde{μ}}_{t - 1}, {\tilde{Σ}}_{t - 1}^{- 1})$ where ${\tilde{μ}}_{t - 1}$ and ${\tilde{Σ}}_{t - 1}$ are defined as in (32) and (31).
7:: Agent chooses action $a_{t}$ as in (33) using $θ_{t}$ and $P (c_{t} | {\hat{c}}_{t}, H_{t - 1, c, \hat{c}})$ .
8:: Agent observes reward $r_{t}$ corresponding to $a_{t}$ , and the true context $c_{t}$ .
9:: end for

Appendix B.2. Derivation of Posterior and Predictive Posterior Distributions

In this section, we provide detailed derivation of posterior predictive distribution for Gaussian bandits. To this end, we first derive the exact predictive distribution

P (c_{t} | {\hat{c}}_{t}, γ^{*})

.

Appendix B.2.1. Derivation of $P (c_{t} | {\hat{c}}_{t}, γ^{*})$

We begin by noting that

\begin{matrix} P (c_{t} | {\hat{c}}_{t}, γ^{*}) = \frac{P (c_{t}) P ({\hat{c}}_{t} | c_{t}, γ^{*})}{P ({\hat{c}}_{t} | γ^{*})} & \propto P (c_{t}) P ({\hat{c}}_{t} | c_{t}, γ^{*}) \\ = N (μ_{c}, Σ_{c}) N (c_{t} + γ^{*}, Σ_{n}) . \end{matrix}

Subsequently,

\begin{matrix} log (P (c_{t} | {\hat{c}}_{t}, γ^{*})) & \propto log (P (c_{t}) P ({\hat{c}}_{t} | c_{t}, γ^{*})) \\ \propto - \frac{1}{2} ({(c_{t} - μ_{c})}^{⊤} Σ_{c}^{- 1} (c_{t} - μ) + {({\hat{c}}_{t} - c_{t} - γ^{*})}^{⊤} Σ_{n}^{- 1} ({\hat{c}}_{t} - c_{t} - γ^{*})) \\ = - \frac{1}{2} (c_{t}^{⊤} (Σ_{c}^{- 1} + Σ_{n}^{- 1}) c_{t} - (μ_{c}^{⊤} Σ_{c}^{- 1} + {({\hat{c}}_{t} - γ^{*})}^{⊤} Σ_{n}^{- 1}) c_{t} \\ - c_{t}^{⊤} (Σ_{c}^{- 1} μ_{c} + Σ_{n}^{- 1} ({\hat{c}}_{t} - γ^{*})) + {({\hat{c}}_{t} - γ^{*})}^{⊤} Σ_{n}^{- 1} ({\hat{c}}_{t} - γ^{*})) \\ = - \frac{1}{2} (c_{t}^{⊤} M c_{t} - A_{t}^{⊤} M c_{t} - c_{t}^{⊤} M A + A_{t}^{⊤} M A - A_{t}^{⊤} M A \\ + {({\hat{c}}_{t} - γ^{*})}^{⊤} Σ_{n}^{- 1} ({\hat{c}}_{t} - γ^{*})), \end{matrix}

(A12)

where we have defined

\begin{matrix} M & = Σ_{c}^{- 1} + Σ_{n}^{- 1} \end{matrix}

(A13)

\begin{matrix} A_{t} & = {(M^{- 1})}^{⊤} (Σ_{c}^{- 1} μ_{c} + Σ_{n}^{- 1} ({\hat{c}}_{t} - γ^{*})) . \end{matrix}

(A14)

From (A12), we obtain

\begin{matrix} log (P (c_{t} | {\hat{c}}_{t}, γ^{*})) \propto - \frac{1}{2} (c_{t}^{⊤} M c_{t} - A_{t}^{⊤} M c_{t} - c_{t}^{⊤} M A + A_{t}^{⊤} M A) . \end{matrix}

This implies that

\begin{matrix} P (c_{t} | {\hat{c}}_{t}, γ^{*}) & = N (A_{t}, M^{- 1}) . \end{matrix}

(A15)

Appendix B.2.2. Derivation of $P ({\hat{c}}_{t} | γ^{*})$

We now derive the distribution

P ({\hat{c}}_{t} | γ^{*})

which is defined in (3) as

\begin{matrix} P ({\hat{c}}_{t} | γ^{*}) = E_{P (c_{t})} [P ({\hat{c}}_{t} | c_{t}, γ^{*})] . \end{matrix}

Hence,

P ({\hat{c}}_{t} | γ^{*})

can be obtained by marginalizing the joint distribution

P (c_{t}) P ({\hat{c}}_{t} | c_{t}, γ^{*}) = P (c_{t} | {\hat{c}}_{t}, γ^{*}) P ({\hat{c}}_{t} | γ^{*})

over

c_{t}

. To this end, we use (A12) to obtain,

\begin{matrix} log (P (c_{t}) P ({\hat{c}}_{t} | c_{t}, γ^{*})) & = log (P (c_{t} | {\hat{c}}_{t}, γ^{*}) P ({\hat{c}}_{t} | γ^{*})) \\ \propto log (P (c_{t} | {\hat{c}}_{t}, γ^{*})) - \frac{1}{2} (- A_{t}^{⊤} M A + {({\hat{c}}_{t} - γ^{*})}^{⊤} Σ_{n}^{- 1} ({\hat{c}}_{t} - γ^{*})), \end{matrix}

which implies that

\begin{matrix} log (P ({\hat{c}}_{t} | γ^{*})) & \propto - \frac{1}{2} (- A_{t}^{⊤} M A + {({\hat{c}}_{t} - γ^{*})}^{⊤} Σ_{n}^{- 1} ({\hat{c}}_{t} - γ^{*})) \\ \propto - \frac{1}{2} ({\hat{c}}_{t}^{⊤} (Σ_{n}^{- 1} - Σ_{n}^{- 1} {(M^{- 1})}^{⊤} Σ_{n}^{- 1}) {\hat{c}}_{t} - {\hat{c}}_{t}^{⊤} (Σ_{n}^{- 1} {(M^{- 1})}^{⊤} (- Σ_{n}^{- 1} γ^{*} + Σ_{c}^{- 1} μ_{c}) \\ + Σ_{n}^{- 1} γ^{*}) - ((μ_{c}^{⊤} Σ_{c}^{- 1} - γ^{* ⊤} Σ_{n}^{- 1}) (M^{- 1}) Σ_{n}^{- 1} + γ^{* ⊤} Σ_{n}^{- 1}) \hat{c}) \\ = - \frac{1}{2} ({\hat{c}}_{t}^{⊤} G {\hat{c}}_{t} - F^{⊤} G {\hat{c}}_{t} - G^{⊤} F {\hat{c}}_{t}) \\ \propto - \frac{1}{2} ({({\hat{c}}_{t} - F)}^{⊤} G ({\hat{c}}_{t} - F)), \end{matrix}

where

\begin{matrix} G & = Σ_{n}^{- 1} - Σ_{n}^{- 1} {(M^{- 1})}^{⊤} Σ_{n}^{- 1} \end{matrix}

(A16)

\begin{matrix} F & = {(G^{- 1})}^{⊤} (Σ_{n}^{- 1} {(M^{- 1})}^{⊤} (- Σ_{n}^{- 1} γ^{*} + Σ_{c}^{- 1} μ_{c}) + Σ_{n}^{- 1} γ^{*}) = {(G^{- 1})}^{⊤} (G γ^{*} + Σ_{n}^{- 1} {(M^{- 1})}^{⊤} Σ_{c}^{- 1} μ_{c}) . \end{matrix}

(A17)

Thus,

\begin{matrix} P ({\hat{c}}_{t} | γ^{*}) = N (F, G^{- 1}) . \end{matrix}

(A18)

Appendix B.2.3. Derivation of $P (γ^{*} | H_{t - 1, c, \hat{c}})$

We now derive the posterior distribution

P (γ^{*} | H_{t - 1, c, \hat{c}})

. To this end, we use Baye’s theorem as

\begin{matrix} P (γ^{*} | H_{t - 1, c, \hat{c}}) & \propto \prod_{τ = 1}^{t - 1} P ({\hat{c}}_{τ} | c_{τ}, γ^{*}) P (γ^{*}) \\ = \prod_{τ = 1}^{t - 1} N (c_{τ} + γ^{*}, Σ_{n}) N (0, Σ_{γ}) . \end{matrix}

We then have,

\begin{matrix} log P (γ^{*} | H_{t - 1, c, \hat{c}}) & \propto - \frac{1}{2} (\sum_{τ = 1}^{t - 1} ({({\hat{c}}_{τ} - c_{τ} - γ^{*})}^{⊤} Σ_{n}^{- 1} ({\hat{c}}_{τ} - c_{τ} - γ^{*})) + γ^{* ⊤} Σ_{γ}^{- 1} γ^{*}) \\ = - \frac{1}{2} (\sum_{τ = 1}^{t - 1} ({(- {\hat{c}}_{τ} + c_{τ} + γ^{*})}^{⊤} Σ_{n}^{- 1} (- {\hat{c}}_{τ} + c_{τ} + γ^{*})) + γ^{* ⊤} Σ_{γ}^{- 1} γ^{*}) \\ \propto - \frac{1}{2} (γ^{* ⊤} ((t - 1) Σ_{n}^{- 1} + Σ_{γ}^{- 1}) γ^{*} - γ^{* ⊤} Σ_{n}^{- 1} (\sum_{τ = 1}^{t - 1} ({\hat{c}}_{τ} - c_{τ})) \\ - {(\sum_{τ = 1}^{t - 1} ({\hat{c}}_{τ} - c_{τ}))}^{⊤} Σ_{n}^{- 1} γ^{*}) . \end{matrix}

Consequently, we obtain that,

\begin{matrix} P (γ^{*} | H_{t - 1, c, \hat{c}}) = N (γ^{*} | Y_{t}, W_{t}^{- 1}) \end{matrix}

(A19)

where

\begin{matrix} W_{t} & = (t - 1) Σ_{n}^{- 1} + Σ_{γ}^{- 1} \end{matrix}

(A20)

\begin{matrix} Y_{t} & = {(W_{t}^{- 1})}^{⊤} Σ_{n}^{- 1} \sum_{τ = 1}^{t - 1} ({\hat{c}}_{τ} - c_{τ}) . \end{matrix}

(A21)

Appendix B.2.4. Derivation of Posterior Predictive Distribution $P (c_{t} | {\hat{c}}_{t}, H_{t - 1, c, \hat{c}})$

Using results from previous subsections, we are now ready to derive the posterior predictive distribution

P (c_{t} | {\hat{c}}_{t}, H_{t - 1, c, \hat{c}})

. Note, that

P (c_{t} | {\hat{c}}_{t}, H_{t - 1, c, \hat{c}}) = E_{P (γ^{*} | H_{t - 1, c, \hat{c}})} [P (c_{t} | {\hat{c}}_{t}, γ^{*})]

. We then have the following set of relations:

\begin{matrix} log (P (γ^{*} | H_{t - 1, c, \hat{c}}) P (c_{t} | {\hat{c}}_{t}, γ^{*})) \\ \propto - \frac{1}{2} ({(c_{t} - A_{t})}^{⊤} M (c_{t} - A_{t}) + {(γ^{*} - Y_{t})}^{⊤} W_{t} (γ^{*} - Y_{t})) \\ = - \frac{1}{2} ({(c_{t} - D - E_{t} + {(M^{- 1})}^{⊤} Σ_{n}^{- 1} γ^{*})}^{⊤} M (c_{t} - D - E_{t} + {(M^{- 1})}^{⊤} Σ_{n}^{- 1} γ^{*}) \\ + {(γ^{*} - Y_{t})}^{⊤} W_{t} (γ^{*} - Y_{t})) \\ \propto - \frac{1}{2} (γ^{* ⊤} (Σ_{n}^{- 1} {(M^{- 1})}^{⊤} Σ_{n}^{- 1} + W_{t}) γ^{*} - γ^{* ⊤} (Σ_{n}^{- 1} (D + E_{t} - c_{t}) + W_{t} Y_{t}) - ({(D + E_{t} - c_{t})}^{⊤} Σ_{n}^{- 1} \\ + Y_{t}^{⊤} W_{t}) γ^{*} + {(c_{t} - D - E_{t})}^{⊤} M (c_{t} - D - E_{t})) \\ = - \frac{1}{2} (γ^{* ⊤} {\tilde{H}}_{t} γ^{*} - γ^{* ⊤} {\tilde{H}}_{t}^{⊤} {\tilde{J}}_{t} - {\tilde{J}}_{t}^{⊤} {\tilde{H}}_{t} γ^{*} + {\tilde{J}}_{t}^{⊤} {\tilde{H}}_{t} {\tilde{J}}_{t} - {\tilde{J}}_{t}^{⊤} {\tilde{H}}_{t} {\tilde{J}}_{t} + {(c_{t} - D - E_{t})}^{⊤} M (c_{t} - D - E_{t})) \\ \propto log (P (γ^{*} | H_{t, c, \hat{c}})) - \frac{1}{2} (- {\tilde{J}}_{t}^{⊤} {\tilde{H}}_{t} {\tilde{J}}_{t} + {(c_{t} - D - E_{t})}^{⊤} M (c_{t} - D - E_{t})) \end{matrix}

where

D = {(M^{- 1})}^{⊤} Σ_{c}^{- 1} μ_{c}

,

E_{t} = {(M^{- 1})}^{⊤} Σ_{n}^{- 1} {\hat{c}}_{t}

,

{\tilde{H}}_{t} = Σ_{n}^{- 1} {(M^{- 1})}^{⊤} Σ_{n}^{- 1} + W_{t}

and

{\tilde{J}}_{t} = {({\tilde{H}}_{t}^{- 1})}^{⊤} (Σ_{n}^{- 1} (D + E_{t} - c_{t}) + W_{t} Y_{t})

.

Since

log (P (γ^{*} | H_{t - 1, c, \hat{c}}) P (c_{t} | {\hat{c}}_{t}, γ^{*})) = log (P (c_{t} | {\hat{c}}_{t}, H_{t - 1, c, \hat{c}}) P (γ^{*} | H_{t, c, \hat{c}}))

, we obtain

\begin{matrix} log (P (c_{t} | {\hat{c}}_{t}, H_{t - 1, c, \hat{c}})) & \propto - \frac{1}{2} (- {\tilde{J}}_{t}^{⊤} {\tilde{H}}_{t} {\tilde{J}}_{t} + {(c_{t} - D - E_{t})}^{⊤} M (c_{t} - D - E_{t})) \\ \propto - \frac{1}{2} (c_{t}^{⊤} (M - Σ_{n}^{- 1} {({\tilde{H}}_{t}^{- 1})}^{⊤} Σ_{n}^{- 1}) c_{t} - c_{t}^{⊤} (M (D + E_{t}) \\ - Σ_{n}^{- 1} {({\tilde{H}}_{t}^{- 1})}^{⊤} (Σ_{n}^{- 1} (D + E_{t}) + W_{t} Y_{t})) - (M (D + E_{t}) \\ - Σ_{n}^{- 1} {({\tilde{H}}_{t}^{- 1})}^{⊤} (Σ_{n}^{- 1} (D + E_{t}) + W_{t} Y_{t}))^{⊤} c_{t}) . \end{matrix}

This gives

\begin{matrix} P (c_{t} | {\hat{c}}_{t}, H_{t - 1, c, \hat{c}}) & = N ({\tilde{V}}_{t}, {\tilde{R}}_{t}^{- 1}) w h e r e \end{matrix}

(A22)

\begin{matrix} {\tilde{R}}_{t} & = M - Σ_{n}^{- 1} {({\tilde{H}}_{t}^{- 1})}^{⊤} Σ_{n}^{- 1} \end{matrix}

(A23)

\begin{matrix} {\tilde{V}}_{t} & = {({\tilde{R}}_{t}^{- 1})}^{⊤} (M (D + E_{t}) - Σ_{n}^{- 1} {({\tilde{H}}_{t}^{- 1})}^{⊤} (Σ_{n}^{- 1} (D + E_{t}) + W_{t} Y_{t})) . \end{matrix}

(A24)

Appendix B.3. Proof of Lemma 3

We now present the proof of Lemma 3. To this end, we first recall that

E_{t} [\cdot] = E [\cdot | F_{t}]

where

F_{t} = H_{t - 1, r, a, c, \hat{c}} \cup {\hat{c}}_{t}

, and we denote

P_{t} (θ^{*})

as the posterior distribution

P (θ^{*} | H_{t - 1, r, a, c})

of

θ^{*}

given the history of observed reward-action-context tuples. We can then equivalently write

R_{d, CB}^{T}

as

\begin{matrix} R_{d, CB}^{T} & = \sum_{t = 1}^{T} E [E_{t} [ψ {({\hat{a}}_{t}, {\hat{c}}_{t} | H_{c, \hat{c}})}^{⊤} θ^{*} - ψ {(a_{t}, {\hat{c}}_{t} | H_{c, \hat{c}})}^{⊤} θ^{*}]] \\ = \sum_{t = 1}^{T} E [E_{t} \underset{: = Δ_{t}}{\underset{︸}{[f (θ^{*}, {\hat{a}}_{t}, c_{t}) - f (θ^{*}, a_{t}, c_{t})]}}], \end{matrix}

(A25)

where

ψ (a, {\hat{c}}_{t} | H_{c, \hat{c}}) = E_{P (c_{t} | {\hat{c}}_{t}, H_{t - 1, c, \hat{c}})} [ϕ (a, c_{t})]

is as defined in (33) and we have used

f (θ^{*}, a_{t}, c_{t}) = ϕ {(a_{t}, c_{t})}^{⊤} θ^{*}

to denote the mean-reward function. To obtain an upper bound on

R_{d, CB}^{T}

, we define the following lifted information ratio as in [19],

\begin{matrix} Γ_{t} = \frac{{(E_{t} [Δ_{t}])}^{2}}{Λ_{t}} \end{matrix}

(A26)

where

\begin{matrix} Λ_{t} = E_{t} [{(f (θ^{*}, a_{t}, c_{t}) - \bar{f} (a_{t}, c_{t}))}^{2}], \end{matrix}

(A27)

with

\bar{f} (a_{t}, c_{t}) = E_{t} [f (θ^{*}, a_{t}, c_{t}) | a_{t}, c_{t}]

denoting the expectation of mean reward with respect to the posterior distribution

P_{t} (θ^{*})

. Subsequently, we obtain the following upper bound

\begin{matrix} R_{d, CB}^{T} \leq \sum_{t = 1}^{T} E [\sqrt{Γ_{t} Λ_{t}}] \leq \sqrt{E [\sum_{t = 1}^{T} Γ_{t}] [\sum_{t = 1}^{T} E [Λ_{t}]]}, \end{matrix}

(A28)

where the last inequality follows by an application of Cauchy–Schwarz inequality. An upper bound on

R_{d, CB}^{T}

then follows by obtaining an upper bound on the lifted information ratio

Γ_{t}

as well as on

Λ_{t} .

We first evaluate the term

Λ_{t}

. To this end, note that

\bar{f} (a_{t}, c_{t}) = ϕ {(a_{t}, c_{t})}^{⊤} {\tilde{μ}}_{t - 1}

, with

{\tilde{μ}}_{t - 1}

defined as in (32). Using this, we obtain

\begin{matrix} Λ_{t} & = E_{t} [{(ϕ {(a_{t}, c_{t})}^{⊤} (θ^{*} - {\tilde{μ}}_{t - 1}))}^{2}] \end{matrix}

\begin{matrix} = E_{t} [ϕ {(a_{t}, c_{t})}^{⊤} (θ^{*} - {\tilde{μ}}_{t - 1}) {(θ^{*} - {\tilde{μ}}_{t - 1})}^{⊤} ϕ (a_{t}, c_{t})] \end{matrix}

(A29)

\begin{matrix} = E_{t} [ϕ {(a_{t}, c_{t})}^{⊤} {\tilde{Σ}}_{t - 1}^{- 1} ϕ (a_{t}, c_{t})] = E_{t} [∥ ϕ (a_{t}, c_{t}) ∥_{{\tilde{Σ}}_{t - 1}^{- 1}}^{2}] \end{matrix}

(A30)

where

{\tilde{Σ}}_{t - 1}

is as in (31), and the third equality follows since conditional on

F_{t}

,

(a_{t}, c_{t})

is independent of

θ^{*}

. Subsequently, we can apply the elliptical potential lemma Lemma 19.4 in [24] using the assumption that

{∥ ϕ (\cdot, \cdot) ∥}_{2} \leq 1

and that

σ^{2} / λ \geq 1

. This results in

\begin{matrix} \sum_{t = 1}^{T} {∥ ϕ (a_{t}, c_{t}) ∥}_{{\tilde{Σ}}_{t - 1}^{- 1}}^{2} & = σ^{2} \sum_{t = 1}^{T} {∥ ϕ (a_{t}, c_{t}) ∥}_{{(σ^{2} {\tilde{Σ}}_{t - 1})}^{- 1}}^{2} \\ \leq 2 σ^{2} log \frac{\det ({\tilde{Σ}}_{T})}{\det (σ^{2} / λ I)} = 2 σ^{2} (m log (σ^{2} / λ + T / m) - m log (σ^{2} / λ)) \\ = 2 m σ^{2} log (1 + \frac{T λ}{m σ^{2}}) . \end{matrix}

(A31)

To upper bound the lifted information ratio term

Γ_{t}

, we can use Lemma 7 in [19]. To demonstrate how to leverage results from [19], we start by showing that the inequality

Γ_{t} \leq m

holds. To this end, we note that the lifted information ratio can be equivalently written as

\begin{matrix} Γ_{t} = \frac{{(E_{t} [f (θ_{t}, a_{t}, c_{t}) - \bar{f} (a_{t}, c_{t})])}^{2}}{Λ_{t}} \end{matrix}

(A32)

which follows since

\begin{matrix} E_{t} [Δ_{t}] & = E_{t} [f (θ^{*}, {\hat{a}}_{t}, c_{t}) - f (θ^{*}, a_{t}, c_{t})] \\ = E_{t} [f (θ^{*}, {\hat{a}}_{t}, c_{t}) - \bar{f} (a_{t}, c_{t})] \\ = \sum_{a^{'}} P_{t} ({\hat{a}}_{t} = a^{'}) E_{t} [f (θ^{*}, a^{'}, c_{t}) | {\hat{a}}_{t} = a^{'}] - \sum_{a^{'}} P_{t} (a_{t} = a^{'}) E_{t} [\bar{f} (a_{t} = a^{'}, c_{t})] \\ = \sum_{a^{'}} P_{t} ({\hat{a}}_{t} = a^{'}) (E_{t} [f (θ^{*}, a^{'}, c_{t}) | {\hat{a}}_{t} = a^{'}] - E_{t} [\bar{f} (a^{'}, c_{t})]), \end{matrix}

(A33)

where the second equality holds since conditioned on

F_{t}

,

(a_{t}, c_{t})

and

θ^{*}

are independent. In the third equality, we denote

P_{t} ({\hat{a}}_{t}) = P ({\hat{a}}_{t} | F_{t})

and

P_{t} (a_{t}) = P (a_{t} | F_{t})

. Using these, the last equality follows since

a_{t} \overset{d}{=} {\hat{a}}_{t}

, i.e,

P_{t} (a_{t}) = P_{t} ({\hat{a}}_{t})

. Now, let us define a

K \times K

matrix M with entries given by

\begin{matrix} M_{a, a^{'}} = \sqrt{P_{t} ({\hat{a}}_{t} = a^{'}) P_{t} (a_{t} = a)} (E_{t} [f (θ^{*}, a, c_{t}) | {\hat{a}}_{t} = a^{'}] - E_{t} [\bar{f} (a, c_{t})]) . \end{matrix}

(A34)

Using this and noting that

P_{t} ({\hat{a}}_{t}) = P_{t} (a_{t})

, we obtain that

E_{t} [Δ_{t}] = Tr (M)

. We now try to bound

Λ_{t}

in terms of the matrix M. To see this, we can equivalently write

Λ_{t}

as

\begin{matrix} Λ_{t} & = \sum_{a} P_{t} (a_{t} = a) E_{t} [{(f (θ^{*}, a, c_{t}) - \bar{f} (a, c_{t}))}^{2}] \\ = \sum_{a, a^{'}} P_{t} (a_{t} = a) P_{t} ({\hat{a}}_{t} = a^{'}) E_{t} [{(f (θ^{*}, a, c_{t}) - \bar{f} (a, c_{t}))}^{2} | {\hat{a}}_{t} = a^{'}] \\ \geq \sum_{a, a^{'}} P_{t} (a_{t} = a) P_{t} ({\hat{a}}_{t} = a^{'}) {(E_{t} [f (θ^{*}, a, c_{t}) - \bar{f} (a, c_{t}) | {\hat{a}}_{t} = a^{'}])}^{2} \\ = \sum_{a, a^{'}} P_{t} (a_{t} = a) P_{t} ({\hat{a}}_{t} = a^{'}) {(E_{t} [f (θ^{*}, a, c_{t}) | {\hat{a}}_{t} = a^{'}] - E_{t} [\bar{f} (a, c_{t})])}^{2} = {∥ M ∥}_{F}^{2}, \end{matrix}

(A35)

where the inequality follows by the application of Jensen’s inequality. We thus obtain that

Γ_{t} \leq \frac{Tr {(M)}^{2}}{{∥ M ∥}_{F}^{2}} \leq m,

where the last inequality follows from Prop. 5 in [25]. From Lemma 3 in [19], we also obtain

Γ_{t} \leq 2 log (1 + K)

. This results in the upper bound

Γ_{t} \leq min {m, 2 log (1 + K)}

.

Using this and the bound of (A31) in (A28), we obtain

\begin{matrix} R_{d, CB}^{T} \leq \sqrt{2 T m σ^{2} min {m, 2 (1 + log K)} log (1 + \frac{T λ}{m σ^{2}})} . \end{matrix}

(A36)

Appendix B.4. Proof of Lemma 4

We now prove an upper bound on the term

R_{d, EE 1}^{T}

. To this end, let us define the following event:

\begin{matrix} E : = \{∥ θ^{*} ∥_{2} \leq \sqrt{2 λ m log (\frac{2 m}{δ})} : = U\} . \end{matrix}

(A37)

Note, that since

θ^{*} \sim N (θ^{*} | 0, λ I)

, we obtain that with probability at least

1 - δ

, the following inequality holds

∥ θ^{*} ∥_{\infty} \leq \sqrt{2 λ log (\frac{2 m}{δ})}

. Since

∥ θ^{*} ∥_{2} \leq \sqrt{m} {∥ θ^{*} ∥}_{\infty}

, the above inequality in turn implies the event

E

such that

P (E) \geq 1 - δ

.

\begin{matrix} R_{d, EE 1}^{T} & \overset{(a)}{\leq} \sum_{t = 1}^{T} E [ψ {(a_{t}^{*}, {\hat{c}}_{t} | γ^{*})}^{⊤} θ^{*} - ψ {(a_{t}^{*}, {\hat{c}}_{t} | H_{c, \hat{c}})}^{⊤} θ^{*}] \\ = \sum_{t = 1}^{T} E [\underset{: = Δ (P (c_{t} | {\hat{c}}_{t}, γ^{*}), P (c_{t} | {\hat{c}}_{t}, H_{c, \hat{c}}))}{\underset{︸}{E_{P (c_{t} | {\hat{c}}_{t}, γ^{*})} [ϕ {(a_{t}^{*}, c_{t})}^{⊤} θ^{*}] - E_{P (c_{t} | {\hat{c}}_{t}, H_{c, \hat{c}})} [ϕ {(a_{t}^{*}, c_{t})}^{⊤} θ^{*}]}}], \end{matrix}

\begin{matrix} = \sum_{t = 1}^{T} E [Δ (P (c_{t} | {\hat{c}}_{t}, γ^{*}), P (c_{t} | {\hat{c}}_{t}, H_{c, \hat{c}})) 1 {E}] + \sum_{t = 1}^{T} E [Δ (P (c_{t} | {\hat{c}}_{t}, γ^{*}), P (c_{t} | {\hat{c}}_{t}, H_{c, \hat{c}})) 1 {E^{c}}] \end{matrix}

(A38)

\begin{matrix} \overset{(b)}{\leq} \sum_{t = 1}^{T} E [Δ (P (c_{t} | {\hat{c}}_{t}, γ^{*}), P (c_{t} | {\hat{c}}_{t}, H_{c, \hat{c}})) 1 {E}] + 2 δ T E [∥ {θ^{*} ∥}_{2} | E^{c}], \end{matrix}

(A39)

where the inequality

(a)

follows from the definition of

{\hat{a}}_{t} = arg {max}_{a \in A} ψ {(a, \hat{c} | H_{c, \hat{c}})}^{⊤} θ^{*}

, and

1 {•}

denotes the indicator function which takes value 1 when • is true and takes value 0 otherwise. The inequality in

(b)

follows by noting that

\begin{matrix} E [Δ (P (c_{t} | {\hat{c}}_{t}, γ^{*}), P (c_{t} | {\hat{c}}_{t}, H_{c, \hat{c}})) I {E^{c}}] \\ \leq E [{(E_{P (c_{t} | {\hat{c}}_{t}, γ^{*})} [ϕ (a_{t}^{*}, c_{t})] - E_{P (c_{t} | {\hat{c}}_{t}, H_{c, \hat{c}})} [ϕ (a_{t}^{*}, c_{t})])}^{⊤} θ^{*} 1 {E^{c}}] \\ \leq E [∥ E_{P (c_{t} | {\hat{c}}_{t}, γ^{*})} [ϕ (a_{t}^{*}, c_{t})] - E_{P (c_{t} | {\hat{c}}_{t}, H_{c, \hat{c}})} [ϕ (a_{t}^{*}, c_{t})] ∥_{2} {∥ θ^{*} ∥}_{2} 1 {E^{c}}] \\ \leq 2 E [∥ θ^{*} ∥_{2} 1 {E^{c}}] = 2 P (E^{c}) E [∥ θ^{*} ∥_{2} | E^{c}] \leq 2 δ E [∥ θ^{*} ∥_{2} | E^{c}], \end{matrix}

where the last inequality is due to

P (E^{c}) = 1 - P (E) \leq δ

. To obtain an upper bound on

E [∥ θ^{*} ∥_{2} | E^{c}]

, we note that the following set of inequalities hold:

\begin{matrix} E [∥ θ^{*} ∥_{2} | E^{c}] & \overset{(a)}{\leq} \sqrt{m} E [∥ θ^{*} ∥_{\infty} | E^{c}] \overset{(b)}{=} \sqrt{m} E [∥ θ^{*} ∥_{\infty} | ∥ θ^{*} ∥_{\infty} > u] \\ = \sqrt{m} \sum_{i = 1}^{m} P (∥ θ^{*} ∥_{\infty} = | θ_{i}^{*} |) E [∥ θ^{*} ∥_{\infty} |∥ θ^{*} ∥_{\infty} = | θ_{i}^{*} |, ∥ θ^{*} ∥_{\infty} > u] \\ \leq \sqrt{m} \sum_{i = 1}^{m} E [| θ_{i}^{*} | || θ_{i}^{*} | > u] \overset{(c)}{=} 2 \sqrt{m} \sum_{i = 1}^{m} \int_{x > u} x g (x) d x \\ \overset{(d)}{=} - 2 λ \sqrt{m} \sum_{i = 1}^{m} \int_{x > u} g^{'} (x) d x = 2 λ m^{3 / 2} g (u) = 2 λ m^{3 / 2} \frac{1}{\sqrt{2 π λ}} \exp (- u^{2} / 2 λ) \\ = δ \sqrt{\frac{m λ}{2 π}}, \end{matrix}

(A40)

where

(a)

follows since

{∥ θ ∥}_{2} \leq \sqrt{m} {∥ θ ∥}_{\infty}

,

(b)

follows since

∥ θ^{*} ∥_{2} > \sqrt{m} \sqrt{2 λ log (\frac{2 m}{δ})}

implies that

∥ θ^{*} ∥_{\infty} > \sqrt{2 λ log (\frac{2 m}{δ})} : = u

. The equality in

(c)

follows by noting that

| θ_{i}^{*} |

, where

θ_{i}^{*} \sim N (0, λ)

, follows a folded Gaussian distribution with density

2 g (θ_{i}^{*})

where

g (θ_{i}^{*}) = \frac{1}{\sqrt{2 π λ}} \exp (- θ_{i}^{* 2} / (2 λ))

is the Gaussian density. The equality in

(d)

follows by noting that

x g (x) = - λ g^{'} (x)

, where

g^{'} (x)

is the derivative of the Gaussian density. Thus, we have the following upper bound

\begin{matrix} \sum_{t = 1}^{T} E [Δ (P (c_{t} | {\hat{c}}_{t}, γ^{*}), P (c_{t} | {\hat{c}}_{t}, H_{c, \hat{c}})) 1 {E^{c}}] & \leq 2 T δ^{2} \sqrt{\frac{m λ}{2 π}} . \end{matrix}

(A41)

We now obtain an upper bound on

\sum_{t = 1}^{T} E [Δ (P (c_{t} | {\hat{c}}_{t}, γ^{*}), P (c_{t} | {\hat{c}}_{t}, H_{c, \hat{c}})) 1 {E}]

. To this end, note that

\begin{matrix} E [Δ (P (c_{t} | {\hat{c}}_{t}, γ^{*}), P (c_{t} | {\hat{c}}_{t}, H_{c, \hat{c}})) 1 {E}] & \leq P (E) E [Δ (P (c_{t} | {\hat{c}}_{t}, γ^{*}), P (c_{t} | {\hat{c}}_{t}, H_{c, \hat{c}})) | E] \\ \leq E [Δ (P (c_{t} | {\hat{c}}_{t}, γ^{*}), P (c_{t} | {\hat{c}}_{t}, H_{c, \hat{c}})) | E] . \end{matrix}

(A42)

Note, that under the event

E

, we have the following relation,

| ϕ {(a_{t}^{*}, c_{t})}^{⊤} θ^{*}) | \leq ∥ ϕ (a_{t}^{*}, c_{t}) ∥_{2} {∥ θ^{*} ∥}_{2} \leq U,

whereby

ϕ {(a_{t}^{*}, c_{t})}^{⊤} θ^{*}

is

U^{2}

-sub-Gaussian.

Consequently, applying Lemma A1 gives the following upper bound

\begin{matrix} \sum_{t = 1}^{T} E [Δ (P (c_{t} | {\hat{c}}_{t}, γ^{*}), P (c_{t} | {\hat{c}}_{t}, H_{c, \hat{c}})) 1 {E}] & \leq \sum_{t = 1}^{T} E [\sqrt{2 U^{2} D_{KL} (P (c_{t} | {\hat{c}}_{t}, γ^{*}) ∥ P (c_{t} | {\hat{c}}_{t}, H_{c, \hat{c}})}] \\ \leq \sqrt{2 T U^{2} \sum_{t = 1}^{T} E [D_{KL} (P (c_{t} | {\hat{c}}_{t}, γ^{*}) ∥ P (c_{t} | {\hat{c}}_{t}, H_{c, \hat{c}})]} \\ \overset{(a)}{=} \sqrt{2 T U^{2} \sum_{t = 1}^{T} I (c_{t}; γ^{*} | {\hat{c}}_{t}, H_{c, \hat{c}})} \end{matrix}

\begin{matrix} \overset{(b)}{\leq} \sqrt{2 T U^{2} \sum_{t = 1}^{T} I (c_{t}, {\hat{c}}_{t}; γ^{*} | H_{c, \hat{c}})} \end{matrix}

(A43)

\begin{matrix} \overset{(c)}{=} \sqrt{2 T U^{2} I (H_{T, c, \hat{c}}; γ^{*})} \end{matrix}

(A44)

where the equality in

(a)

follows by the definition of condition mutual information

I (c_{t}; γ^{*} | {\hat{c}}_{t}, H_{c, \hat{c}}) : = E [D_{KL} (P (c_{t}, γ^{*} | {\hat{c}}_{t}, H_{c, \hat{c}}) ∥ P (c_{t} | {\hat{c}}_{t}, H_{c, \hat{c}}) P (γ^{*} | {\hat{c}}_{t}, H_{c, \hat{c}}))],

and inequality in

(b)

follows since

I (c_{t}, {\hat{c}}_{t}; γ^{*} | H_{c, \hat{c}}) = I ({\hat{c}}_{t}; γ^{*} | H_{c, \hat{c}}) + I (c_{t}; γ^{*} | {\hat{c}}_{t}, H_{c, \hat{c}}) \geq I (c_{t}; γ^{*} | {\hat{c}}_{t}, H_{c, \hat{c}})

due to the non-negativity of mutual information, and finally, the equality in

(c)

follows from the chain rule of mutual information.

We now analyze the mutual information

I (γ^{*}; H_{T, c, \hat{c}})

which can be written as

\begin{matrix} I (γ^{*}; H_{T, c, \hat{c}}) & = H (γ^{*}) - H (γ^{*} | H_{T, c, \hat{c}}) \end{matrix}

\begin{matrix} = \frac{1}{2} log ({(2 π e)}^{d} \det (Σ_{γ})) - \frac{1}{2} log ({(2 π e)}^{d} \det (W^{- 1})) \end{matrix}

(A45)

\begin{matrix} = \frac{1}{2} log \frac{\det (Σ_{γ})}{\det (W^{- 1})} \end{matrix}

(A46)

where

W = (T - 1) Σ_{n}^{- 1} + Σ_{γ}^{- 1} = Σ_{γ}^{- 1} (I + (T - 1) Σ_{γ} Σ_{n}^{- 1})

. Using this, we can equivalently write

\begin{matrix} I (γ^{*}; H_{T, c, \hat{c}}) & = \frac{1}{2} log \frac{1}{\det ({(I + (T - 1) Σ_{γ} Σ_{n}^{- 1})}^{- 1})} \end{matrix}

(A47)

\begin{matrix} = \frac{1}{2} log \det (I + (T - 1) Σ_{γ} Σ_{n}^{- 1}) . \end{matrix}

(A48)

Under the assumption that

Σ_{γ} Σ_{n}^{- 1} ≻ 0

, we have

I + (T - 1) Σ_{γ} Σ_{n}^{- 1} ≻ 0

, whereby using the determinant-trace inequality we obtain,

\begin{matrix} I (γ^{*}; H_{T, c, \hat{c}}) & = \frac{1}{2} log (\det (I + (T - 1) Σ_{γ} Σ_{n}^{- 1})) \\ \leq \frac{d}{2} log (Tr (I + (T - 1) Σ_{γ} Σ_{n}^{- 1}) / d)) \\ = \frac{d}{2} log (1 + (T - 1) Tr (Σ_{γ} Σ_{n}^{- 1}) / d) \\ \leq \frac{d}{2} log (1 + T Tr (Σ_{γ} Σ_{n}^{- 1}) / d) . \end{matrix}

Using this in (A44), we obtain that

\begin{matrix} \sum_{t = 1}^{T} E [Δ (P (c_{t} | {\hat{c}}_{t}, γ^{*}), P (c_{t} | {\hat{c}}_{t}, H_{c, \hat{c}})) I {E}] \leq \sqrt{T d log (1 + T Tr (Σ_{γ} Σ_{n}^{- 1}) / d) U^{2}} . \end{matrix}

(A49)

Finally, using this in (A39), gives the following upper bound

\begin{matrix} R_{d, EE 1}^{T} \leq \sqrt{2 λ m T d log (1 + T Tr (Σ_{γ} Σ_{n}^{- 1}) / d) log (\frac{2 m}{δ})} + 2 T δ^{2} \sqrt{\frac{m λ}{2 π}} . \end{matrix}

(A50)

We finally note that same upper bound holds for the term

R_{d, EE 2}^{T}

.

Appendix C. Linear-Gaussian Noisy Contextual Bandits with Unobserved True Contexts

Appendix C.1. Derivation of Posterior Predictive Distribution

In this section, we derive the posterior predictive distribution

P (c_{t} | {\hat{c}}_{t}, H_{t - 1, \hat{c}})

for Gaussian bandits with Gaussian context noise. To this end, we first derive the posterior

P (γ^{*} | H_{t - 1, \hat{c}})

.

Appendix C.1.1. Derivation of Posterior $P (γ^{*} | H_{t - 1, \hat{c}})$

Using Baye’s theorem, we have

\begin{matrix} P (γ^{*} | H_{t - 1, \hat{c}}) \propto P (γ^{*}) \prod_{τ = 1}^{t - 1} P ({\hat{c}}_{τ} | γ^{*}), \end{matrix}

where

P ({\hat{c}}_{τ} | γ^{*})

is derived in (A18). Subsequently, we have that

\begin{matrix} log p (γ^{*} | H_{t - 1}, \hat{c}) & \propto - \frac{1}{2} \sum_{τ = 1}^{t - 1} ({({\hat{c}}_{τ} - F)}^{⊤} G ({\hat{c}}_{τ} - F)) - \frac{1}{2} (γ^{* ⊤} Σ_{γ}^{- 1} γ^{*}) \\ \propto - \frac{1}{2} (γ^{* ⊤} ((t - 1) G + Σ_{γ}^{- 1}) γ^{*} - γ^{* ⊤} (G {\hat{c}}_{1 : t - 1} - (t - 1) Σ_{n}^{- 1} {(M^{- 1})}^{⊤} Σ_{c}^{- 1} μ_{c}) \\ - ({\hat{c}}_{1 : t - 1}^{⊤} G - (t - 1) μ_{c}^{⊤} Σ_{c}^{- 1} M^{- 1} Σ_{n}^{- 1}) γ^{*}), \end{matrix}

where we have denoted

\sum_{τ = 1}^{t - 1} {\hat{c}}_{τ} = {\hat{c}}_{1 : t - 1}

. We then obtain

\begin{matrix} P (γ^{*} | H_{t - 1, \hat{c}}) & = N ({\tilde{M}}_{t}, N_{t}^{- 1}) where, \end{matrix}

(A51)

\begin{matrix} N_{t} & = (t - 1) G + Σ_{γ}^{- 1} \end{matrix}

(A52)

\begin{matrix} {\tilde{M}}_{t} & = {(N_{t}^{- 1})}^{⊤} (G {\hat{c}}_{1 : t - 1} - (t - 1) Σ_{n}^{- 1} {(M^{- 1})}^{⊤} Σ_{c}^{- 1} μ_{c}) . \end{matrix}

(A53)

Appendix C.1.2. Derivation of $P (c_{t} | {\hat{c}}_{t}, H_{t - 1, \hat{c}})$

The derivation of posterior predictive distribution

P (c_{t} | {\hat{c}}_{t}, H_{t - 1, \hat{c}})

follows in a similar line as that in Appendix B.2.4. We start the derivation by noting that

P (c_{t} | {\hat{c}}_{t}, H_{t - 1, \hat{c}}) = E_{P (γ^{*} | H_{t - 1, \hat{c}})} [P (c_{t} | {\hat{c}}_{t}, γ^{*})]

.

We have

\begin{matrix} log (P (γ^{*} | H_{t - 1, \hat{c}}) P (c_{t} | {\hat{c}}_{t}, γ^{*})) \\ \propto - \frac{1}{2} ({(c_{t} - A_{t})}^{⊤} M (c_{t} - A_{t}) + {(γ^{*} - {\tilde{M}}_{t})}^{⊤} N (γ^{*} - {\tilde{M}}_{t})) \\ = - \frac{1}{2} ({(M^{- 1})^{⊤} Σ_{n}^{- 1} γ^{*} + c_{t} - D - E_{t})}^{⊤} M ({(M^{- 1})}^{⊤} Σ_{n}^{- 1} γ^{*} + c_{t} - D - E_{t}) \\ + {(γ^{*} - {\tilde{M}}_{t})}^{⊤} N_{t} (γ^{*} - {\tilde{M}}_{t})) \\ = - \frac{1}{2} ({(γ^{*} - J_{t})}^{⊤} H_{t} (γ^{*} - J_{t}) - J_{t}^{⊤} H_{t} J_{t} + {(c_{t} - D - E_{t})}^{⊤} M (c_{t} - D - E_{t}) + {\tilde{M}}_{t}^{⊤} N_{t} {\tilde{M}}_{t}), \end{matrix}

(A54)

where

A_{t}

is defined in (A14), M is defined in (A13),

{\tilde{M}}_{t}

in (A53), N in (A52),

D = {(M^{- 1})}^{⊤} Σ_{c}^{- 1} μ_{c}

,

E_{t} = {(M^{- 1})}^{⊤} Σ_{n}^{- 1} {\hat{c}}_{t}

,

H_{t} = Σ_{n}^{- 1} {(M^{- 1})}^{⊤} Σ_{n}^{- 1} + N_{t}

and

J_{t} = {(H_{t}^{- 1})}^{⊤} (Σ_{n}^{- 1} (- c_{t} + D + E_{t}) + N_{t} {\tilde{M}}_{t})

.

Subsequently, we have

\begin{matrix} log p (c_{t} | {\hat{c}}_{t}, H_{t - 1, \hat{c}}) & \propto - \frac{1}{2} (- J_{t}^{⊤} H_{t} J_{t} + {(c_{t} - D - E_{t})}^{⊤} M (c_{t} - D - E_{t})) \\ \propto - \frac{1}{2} (c_{t}^{⊤} (M - Σ_{n}^{- 1} {(H_{t}^{- 1})}^{⊤} Σ_{n}^{- 1}) c_{t} - c_{t}^{⊤} (M (D + E_{t}) - Σ_{n}^{- 1} {(H_{t}^{- 1})}^{⊤} L_{t}^{⊤}) \end{matrix}

(A55)

\begin{matrix} - ({(D + E_{t})}^{⊤} M - L_{t} (H_{t}^{- 1}) Σ_{n}^{- 1}), \end{matrix}

(A56)

where

L_{t} = {(D + E_{t})}^{⊤} Σ_{n}^{- 1} + {\tilde{M}}_{t}^{⊤} N_{t}

. Thus, we have,

\begin{matrix} P (c_{t} | {\hat{c}}_{t}, H_{t - 1, \hat{c}}) & = N (c_{t} | V_{t}, R_{t}^{- 1}), where \end{matrix}

(A57)

\begin{matrix} R_{t} = M - Σ_{n}^{- 1} {(H_{t}^{- 1})}^{⊤} Σ_{n}^{- 1} \end{matrix}

(A58)

\begin{matrix} V_{t} = {(R_{t}^{- 1})}^{⊤} (M (D + E_{t}) - Σ_{n}^{- 1} {(H_{t}^{- 1})}^{⊤} L_{t}^{⊤}) . \end{matrix}

(A59)

Appendix C.1.3. Evaluating the KL Divergence between the True Posterior $P_{t} (θ^{})$ and Sampling Distribution ${\bar{P}}_{t} (θ^{})$

In this subsection, we analyze the true posterior distribution

P_{t} (θ^{*}) : = P (θ^{*} | H_{t - 1, r, a, \hat{c}})

and the approximate sampling distribution

{\bar{P}}_{t} (θ^{*}) : = \bar{P} (θ^{*} | H_{t - 1, r, a, \hat{c}})

, and derive the KL divergence

D_{KL} (P_{t} (θ^{*}) ∥ {\bar{P}}_{t} (θ^{*}))

between them. To see this, note that from Bayes’s theorem, we have the following joint probability distribution

\begin{matrix} P (θ^{*}, H_{t - 1, r, \hat{c}} | H_{t - 1, a}) & = P (θ^{*}) \underset{: = P (H_{t - 1, r, \hat{c}} | H_{t - 1, a}, θ^{*})}{\underset{︸}{E_{P (γ^{*})} [\prod_{τ = 1}^{t - 1} E_{P (c_{τ})} [P ({\hat{c}}_{τ} | c_{τ}, γ^{*}) P (r_{τ} | a_{τ}, c_{τ}, θ^{*})]]}} \end{matrix}

(A60)

\begin{matrix} = P (θ^{*}) P (H_{t - 1, r, \hat{c}} | H_{t - 1, a}, θ^{*}) \end{matrix}

(A61)

whereby we obtain that

\begin{matrix} P_{t} (θ^{*}) \propto P (θ^{*}) P (H_{t - 1, r, \hat{c}} | H_{t - 1, a}, θ^{*}) . \end{matrix}

(A62)

In particular, for general feature maps

ϕ (a, c)

, the distribution

P (H_{t - 1, r, \hat{c}} | H_{t - 1, a}, θ^{*})

cannot be exactly evaluated, even under Gaussian assumptions, resulting in the posterior

P_{t} (θ^{*})

to be intractable, in general.

In contrast to this, our approximate TS-algorithm scheme assumes the following joint probability distribution,

\begin{matrix} \bar{P} (θ^{*}, H_{t - 1, r, \hat{c}} | H_{t - 1, a}) & = P (θ^{*}) \underset{: = \bar{P} (H_{t - 1, r, \hat{c}} | H_{t - 1, a}, θ^{*})}{\underset{︸}{E_{P (γ^{*})} [\prod_{τ = 1}^{t - 1} E_{P (c_{τ})} [P ({\hat{c}}_{τ} | c_{τ}, γ^{*}) \bar{P} (r_{τ} | a_{τ}, {\hat{c}}_{τ}, H_{τ - 1, \hat{c}}, θ^{*})]]}} \end{matrix}

(A63)

\begin{matrix} = P (θ^{*}) \bar{P} (H_{t - 1, r, \hat{c}} | H_{t - 1, a}, θ^{*}), \end{matrix}

(A64)

where

\bar{P} (r_{τ} | a_{τ}, {\hat{c}}_{τ}, H_{τ - 1, \hat{c}}, θ^{*}) = N (ψ (a_{τ}, {\hat{c}}_{τ} | H_{τ - 1, \hat{c}}), σ^{2}) .

Consequently, we have

\begin{matrix} {\bar{P}}_{t} (θ^{*}) & \propto P (θ^{*}) \bar{P} (H_{t - 1, r, \hat{c}} | H_{t - 1, a}, θ^{*}) \\ = P (θ^{*}) (\prod_{τ = 1}^{t - 1} \bar{P} (r_{τ} | a_{τ}, {\hat{c}}_{τ}, H_{τ - 1, \hat{c}}, θ^{*})) (E_{P (γ^{*})} [\prod_{τ = 1}^{t - 1} [P ({\hat{c}}_{τ} | γ^{*})]) . \end{matrix}

(A65)

As a result, we can upper bound the KL divergence

D_{KL} (P_{t} (θ^{*}) ∥ {\bar{P}}_{t} (θ^{*}))

as

\begin{matrix} D_{KL} (P_{t} (θ^{*}) ∥ {\bar{P}}_{t} (θ^{*})) \\ \leq D_{KL} (P (θ^{*}, H_{t - 1, r, \hat{c}} | H_{t - 1, a}) ∥ \bar{P} (θ^{*}, H_{t - 1, r, \hat{c}} | H_{t - 1, a})) \\ = D_{KL} (P (θ^{*}) P (H_{t - 1, r, \hat{c}} | θ^{*}, H_{t - 1, a}) ∥ P (θ^{*}) \bar{P} (H_{t - 1, r, \hat{c}} | H_{t - 1, a}, θ^{*})) \\ = E_{P (θ^{*})} [D_{KL} (P (H_{t - 1, r, \hat{c}} | θ^{*}, H_{t - 1, a}) ∥ \bar{P} (H_{t - 1, r, \hat{c}} | H_{t - 1, a}, θ^{*}))] \\ \overset{(a)}{\leq} E_{P (θ^{*}) P (γ^{*})} E_{P (H_{t - 1, c})} [D_{KL} (\prod_{τ = 1}^{t - 1} P ({\hat{c}}_{τ} | c_{τ}, γ^{*}) P (r_{τ} | a_{τ}, c_{τ}, θ^{*}) ∥ \prod_{τ = 1}^{t - 1} P ({\hat{c}}_{τ} | c_{τ}, γ^{*}) \bar{P} (r_{τ} | a_{τ}, H_{τ, \hat{c}}, θ^{*}))] \\ = E_{P (θ^{*})} E_{P (γ^{*})} E_{P (H_{t - 1, c})} [\sum_{τ = 1}^{t - 1} E_{P (H_{τ - 1, \hat{c}} | H_{τ - 1, c}, γ^{*})} [D_{KL} (P (r_{τ} | a_{τ}, c_{τ}, θ^{*}) ∥ \bar{P} (r_{τ} | a_{τ}, {\hat{c}}_{τ}, H_{τ - 1, \hat{c}}, θ^{*}))]] \end{matrix}

\begin{matrix} \overset{(b)}{=} E_{P (θ^{*})} E_{P (γ^{*})} E_{P (H_{t - 1, c})} [\sum_{τ = 1}^{t - 1} E_{P (H_{τ - 1, \hat{c}} | H_{τ - 1, c}, γ^{*})} [\frac{{(ϕ {(a_{τ}, c_{τ})}^{⊤} θ^{*} - ψ {({\hat{c}}_{τ}, a_{τ} | H_{τ - 1, \hat{c}})}^{⊤} θ^{*})}^{2}}{2 σ^{2}}]], \\ = E_{P (θ^{*})} E_{P (γ^{*})} E_{P (H_{t - 1, c})} [\sum_{τ = 1}^{t - 1} E_{P (H_{τ - 1, \hat{c}} | H_{τ - 1, c}, γ^{*})} [\frac{| {(ϕ (a_{τ}, c_{τ}) - ψ ({\hat{c}}_{τ}, a_{τ} | H_{τ - 1, \hat{c}}))}^{⊤} θ^{*} |^{2}}{2 σ^{2}}]], \\ \overset{(c)}{\leq} E_{P (θ^{*})} E_{P (γ^{*})} E_{P (H_{t - 1, c})} [\sum_{τ = 1}^{t - 1} E_{P (H_{τ - 1, \hat{c}} | H_{τ - 1, c}, γ^{*})} [\frac{∥ (ϕ (a_{τ}, c_{τ}) - ψ ({\hat{c}}_{τ}, a_{τ} | H_{τ - 1, \hat{c}}) ∥_{2}^{2} ∥ θ^{*} ∥_{2}^{2}}{2 σ^{2}}]] \\ \overset{(d)}{\leq} E_{P (θ^{*})} E_{P (γ^{*})} E_{P (H_{t - 1, c})} [\sum_{τ = 1}^{t - 1} E_{P (H_{τ - 1, \hat{c}} | H_{τ - 1, c}, γ^{*})} [\frac{4 ∥ θ^{*} ∥_{2}^{2}}{2 σ^{2}}]] \\ = \sum_{τ = 1}^{t - 1} E_{P (θ^{*})} [\frac{2 ∥ θ^{*} ∥_{2}^{2}}{σ^{2}}] \overset{(e)}{=} \frac{2 (t - 1) λ m}{σ^{2}} . \end{matrix}

(A66)

In the above series of relationships,

equality in $(a)$ follows by noting that

$\bar{P} (H_{t - 1, r, \hat{c}} | H_{t - 1, a}, θ^{*}) = E_{P (γ^{*})} E_{P (H_{t - 1, c})} [\prod_{τ = 1}^{t - 1} P ({\hat{c}}_{τ} | c_{τ}, γ^{*}) \bar{P} (r_{τ} | a_{τ}, {\hat{c}}_{τ}, θ^{*})]$

and

$P (H_{t - 1, r, \hat{c}} | H_{t - 1, a}, θ^{*}) = E_{P (γ^{*})} E_{P (H_{t - 1, c})} [\prod_{τ = 1}^{t - 1} P ({\hat{c}}_{τ} | c_{τ}, γ^{*}) P (r_{τ} | a_{τ}, c_{τ}, θ^{*})],$

and applying Jensen’s inequality on the jointly convex KL divergence,
equality in $(b)$ follows from evaluating the KL divergence between two Gaussian distributions with same variance $σ^{2}$ and with means $ϕ {(a_{τ}, c_{τ})}^{⊤} θ^{*}$ and $ψ {({\hat{c}}_{τ}, a_{τ} | H_{τ - 1, \hat{c}})}^{⊤} θ^{*}$ respectively,
inequality in $(c)$ follows from application of Cauchy–Schwarz inequality,
inequality in $(d)$ follows from Assumption 1,
inequality in $(e)$ follows from

$E_{P (θ^{*})} [∥ θ^{*} ∥_{2}^{2}] = \sum_{j = 1}^{m} E [{(θ_{j}^{*})}^{2}] = λ m$

since $P (θ^{*}) = N (0, λ I)$ .

Appendix C.2. Proof of Lemma 1

For notational simplicity, throughout this section we use

ψ (a) : = ψ (a, {\hat{c}}_{t} | H_{\hat{c}}) = E_{P (c_{t} | {\hat{c}}_{t}, H_{t - 1, \hat{c}})} [ϕ (a, c_{t})]

to denote the expected feature map. Furthermore, we use

F_{t} = H_{t - 1, r, a, \hat{c}} \cup {\hat{c}}_{t}

.

We start by distinguishing the true and approximated posterior distributions. Recall that

P_{t} (θ^{*}) : = P (θ^{*} | H_{t - 1, r, a, \hat{c}})

denotes the true posterior and

{\bar{P}}_{t} (θ^{*}) : = \bar{P} (θ^{*} | H_{t - 1, r, a, \hat{c}})

denotes the approximated posterior. We then denote

P_{t} ({\hat{a}}_{t}, θ^{*}) : = P ({\hat{a}}_{t}, θ^{*} | F_{t})

as the distribution of

{\hat{a}}_{t}

and

θ^{*}

conditioned on

F_{t}

, while

{\bar{P}}_{t} ({\hat{a}}_{t}, θ^{*}) : = \bar{P} ({\hat{a}}_{t}, θ^{*} | F_{t})

denote the distribution of

{\hat{a}}_{t}

and

θ^{*}

under the sampling distribution. Furthermore, we have that

{\bar{P}}_{t} (a_{t}, θ^{*}) = {\bar{P}}_{t} (a_{t}) {\bar{P}}_{t} (θ^{*}) = P_{t} (a_{t}) {\bar{P}}_{t} (θ^{*})

. We start by decomposing

R_{CB}^{T}

into the following three differences,

\begin{matrix} R_{CB}^{T} & = \sum_{t = 1}^{T} E [E_{P_{t} ({\hat{a}}_{t}, θ^{*})} [ψ {({\hat{a}}_{t})}^{⊤} θ^{*}] - E_{P_{t} (a_{t}, θ^{*})} [ψ {(a_{t})}^{⊤} θ^{*}]] \\ = \sum_{t = 1}^{T} E [\underset{: = {Term}_{1}}{\underset{︸}{E_{{\bar{P}}_{t} ({\hat{a}}_{t}, θ^{*})} [ψ {({\hat{a}}_{t})}^{⊤} θ^{*}] - E_{{\bar{P}}_{t} (a_{t}, θ^{*})} [ψ {(a_{t})}^{⊤} θ^{*}]}} + \underset{: = {Term}_{2}}{\underset{︸}{E_{P_{t} ({\hat{a}}_{t}, θ^{*})} [ψ {({\hat{a}}_{t})}^{⊤} θ^{*}] - E_{{\bar{P}}_{t} ({\hat{a}}_{t}, θ^{*})} [ψ {({\hat{a}}_{t})}^{⊤} θ^{*}]]}} \\ + \underset{: = {Term}_{3}}{\underset{︸}{E_{{\bar{P}}_{t} (a_{t}, θ^{*})} [ψ {(a_{t})}^{⊤} θ^{*}] - E_{P_{t} (a_{t}, θ^{*})} [ψ {(a_{t})}^{⊤} θ^{*}]]}} \end{matrix}

(A67)

We will separately upper bound each of the three terms in the above decomposition.

Appendix C.2.1. Upper Bound on ${Term}_{2}$

To obtain an upper bound on

{Term}_{2}

, note that the following equivalence holds

E_{P_{t} ({\hat{a}}_{t}, θ^{*})} [ψ {({\hat{a}}_{t})}^{⊤} θ^{*}] = E_{P_{t} (θ^{*})} [{max}_{a} ψ {(a)}^{⊤} θ^{*}]

. Using this, we can rewrite

{Term}_{2}

as

\begin{matrix} {Term}_{2} = E_{P_{t} (θ^{*})} [max_{a} ψ {(a)}^{⊤} θ^{*}] - E_{{\bar{P}}_{t} (θ^{*})} [max_{a} ψ {(a)}^{⊤} θ^{*}] . \end{matrix}

(A68)

Note, here that when

θ^{*} \sim {\bar{P}}_{t} (θ^{*})

, for each

a \in A

, we have that

z_{a} = ψ {(a)}^{⊤} θ^{*}

follows Gaussian distribution

N (z_{a} | ψ {(a)}^{⊤} μ_{t - 1}, ψ {(a)}^{⊤} Σ_{t - 1}^{- 1} ψ (a))

with mean

ψ {(a)}^{⊤} μ_{t - 1}

and variance

ψ {(a)}^{⊤} Σ_{t - 1}^{- 1} ψ (a)

, where

μ_{t - 1}

and

Σ_{t - 1}

are as defined in (18) and (17), respectively. Thus,

E_{{\bar{P}}_{t} (θ^{*})} [{max}_{a} z_{a}]

is the average of maximum of Gaussian random variables. We can then apply Lemma A2 with

P (x) = P_{t} (θ^{*})

,

Q (x) = {\bar{P}}_{t} (θ^{*})

,

n = | A | = K

,

μ_{i} = ψ {(a)}^{⊤} μ_{t - 1}

and

σ_{i} = ψ {(a)}^{⊤} Σ_{t - 1}^{- 1} ψ (a)

to obtain that

\begin{matrix} {Term}_{2} & \leq \sqrt{2 (log K + D_{KL} (P_{t} (θ^{*}) ∥ {\bar{P}}_{t} (θ^{*})) {max}_{a} ψ {(a)}^{⊤} Σ_{t}^{- 1} ψ (a)} . \end{matrix}

(A69)

Using this, we obtain that

\begin{matrix} \sum_{t = 1}^{T} E [{Term}_{2}] & \leq E [\sum_{t = 1}^{T} \sqrt{2 (log K + D_{KL} (P_{t} (θ^{*}) ∥ {\bar{P}}_{t} (θ^{*}))) {max}_{a} ψ {(a)}^{⊤} Σ_{t}^{- 1} ψ (a)}] \\ \overset{(a)}{\leq} \sqrt{(\sum_{t = 1}^{T} E [2 (log K + D_{KL} (P_{t} (θ^{*}) ∥ {\bar{P}}_{t} (θ^{*})))]) (\sum_{t = 1}^{T} E [{max}_{a} ψ {(a)}^{⊤} Σ_{t}^{- 1} ψ (a)])}, \\ \overset{(b)}{\leq} \sqrt{2 λ T (\sum_{t = 1}^{T} E [log K + D_{KL} (P_{t} (θ^{*}) ∥ {\bar{P}}_{t} (θ^{*}))])} \\ \overset{(c)}{\leq} \sqrt{2 λ T (T log K + \sum_{t = 1}^{T} 2 (t - 1) \frac{λ m}{σ^{2}})} \\ = \sqrt{2 λ T^{2} log K + \frac{4 λ^{2} T (T^{2} - T) m}{2 σ^{2}}} \\ \leq \sqrt{2 λ T^{2} log K + \frac{2 λ^{2} T^{3} m}{σ^{2}}}, \end{matrix}

(A70)

where the inequality in

(a)

follows from Cauchy–Schwarz inequality, and the inequality in

(b)

follows since

\begin{matrix} \sum_{t = 1}^{T} E [max_{a} ψ {(a)}^{⊤} Σ_{t}^{- 1} ψ (a)] \leq \sum_{t = 1}^{T} E [max_{a} ψ {(a)}^{⊤} (λ I) ψ (a)] \leq λ T, \end{matrix}

which follows since

Σ_{t}^{- 1} \leq λ I

and

∥ ψ (a) ∥ \leq 1

. The inequality in

(c)

follows from (A66).

If

λ \leq \frac{σ^{2}}{T}

, we obtain that

\begin{matrix} \sum_{t = 1}^{T} E [{Term}_{2}] \leq \sqrt{2 T σ^{2} (log (K) + m)} . \end{matrix}

(A71)

Appendix C.2.2. Upper Bound on ${Term}_{3}$

We can bound

{Term}_{3}

by observing that

\begin{matrix} {Term}_{3} & = E_{P_{t} (a_{t})} {[ψ (a_{t})]}^{⊤} (E_{{\bar{P}}_{t} (θ^{*})} [θ^{*}] - E_{P_{t} (θ^{*})} [θ^{*}]) \end{matrix}

(A72)

\begin{matrix} = E_{{\bar{P}}_{t} (θ^{*})} [Ψ_{t}^{⊤} θ^{*}] - E_{P_{t} (θ^{*})} [Ψ_{t}^{⊤} θ^{*}] \end{matrix}

(A73)

where we used

Ψ_{t} = E_{P_{t} (a_{t})} [ψ (a_{t})]

. Note, that for

θ^{*} \sim {\bar{P}}_{t} (θ^{*})

, the random variable

Ψ_{t}^{⊤} θ^{*}

is Gaussian with mean

Ψ_{t}^{⊤} μ_{t - 1}

and variance

Ψ_{t}^{⊤} Σ_{t - 1}^{- 1} Ψ_{t}

. Consequently,

Ψ_{t}^{⊤} θ^{*}

is also

Ψ_{t}^{⊤} Σ_{t - 1}^{- 1} Ψ_{t}

-sub-Gaussian according to Definition A.1. By using Lemma A1, we then obtain that

\begin{matrix} | {Term}_{3} | \leq \sqrt{2 (Ψ_{t}^{⊤} Σ_{t}^{- 1} Ψ_{t}) D_{KL} (P_{t} (θ^{*}) ∥ {\bar{P}}_{t} (θ^{*}))} . \end{matrix}

(A74)

Using Cauchy–Schwarz inequality then yields that

\begin{matrix} E [\sum_{t = 1}^{T} | {Term}_{3} |] & \leq \sqrt{(\sum_{t = 1}^{T} E [Ψ_{t}^{⊤} Σ_{t}^{- 1} Ψ_{t}]) (\sum_{t = 1}^{T} E [2 D_{KL} (P_{t} (θ^{*}) ∥ {\bar{P}}_{t} (θ^{*}))])} \\ \leq \sqrt{2 λ T \frac{λ T^{2} m}{σ^{2}}} = \sqrt{2 λ^{2} T^{3} \frac{m}{σ^{2}}} . \end{matrix}

where the second inequality follows from (A66). As before, if

λ \leq \frac{σ^{2}}{T}

, we then obtain that

\begin{matrix} E [\sum_{t = 1}^{T} | {Term}_{3} |] & \leq \sqrt{2 T m σ^{2}} . \end{matrix}

(A75)

Appendix C.2.3. Upper Bound on ${Term}_{1}$

Note, that in

{Term}_{1}

,

{\bar{P}}_{t} ({\hat{a}}_{t}) = {\bar{P}}_{t} (a_{t}) = P_{t} (a_{t})

, whereby the posterior is matched. Hence, one can apply bounds from conventional contextual Thompson Sampling here. For simplicity, we denote

{\bar{E}}_{t} [\cdot] = E_{\bar{P}} [\cdot | F_{t}]

to denote the expectation with respect to

{\bar{P}}_{t} (a, θ)

. To this end, as in the proof of Lemma 3, we start by defining an information ratio,

\begin{matrix} Γ_{t} = \frac{{Term}_{1}^{2}}{{\bar{E}}_{t} [{(ψ {(a_{t})}^{⊤} θ^{*} - ψ {(a_{t})}^{⊤} μ_{t}))}^{2}] : = Λ_{t}}, \end{matrix}

(A76)

using which we obtain the upper bound on

{Term}_{1}

as

\begin{matrix} \sum_{t = 1}^{T} E [{Term}_{1}] \leq E [\sum_{t = 1}^{T} \sqrt{Γ_{t} Λ_{t}}] \leq \sqrt{(\sum_{t = 1}^{T} E [Γ_{t}]) (\sum_{t = 1}^{T} E [Λ_{t}])} \end{matrix}

(A77)

by the Cauchy–Schwarz inequality.

Furthermore, we have

\begin{matrix} Λ_{t} = {\bar{E}}_{t} [{(ψ {(a_{t})}^{⊤} (θ^{*} - μ_{t}))}^{2}] & = {\bar{E}}_{t} [ψ {(a_{t})}^{⊤} (θ^{*} - μ_{t - 1}) {(θ^{*} - μ_{t - 1})}^{⊤} ψ (a_{t})] \\ = {\bar{E}}_{t} [ψ {(a_{t})}^{⊤} Σ_{t - 1}^{- 1} ψ (a_{t})] = {\bar{E}}_{t} [∥ ψ (a_{t}) ∥_{Σ_{t - 1}^{- 1}}], \end{matrix}

(A78)

where

μ_{t - 1}

and

Σ_{t - 1}

are defined as in (18) andd (17). Subsequently, using elliptical potential lemma, we obtain

\begin{matrix} \sum_{t = 1}^{T} Λ_{t} \leq 2 m σ^{2} log (1 + \frac{(T) λ}{m σ^{2}}) . \end{matrix}

(A79)

To obtain an upper bound on the information ratio

Γ_{t}

, we define

\bar{f} (θ^{*}, a) = ψ {(a)}^{⊤} θ^{*}

and

\bar{f} (a) = ψ {(a)}^{⊤} μ_{t - 1}

and let

\begin{matrix} M_{a, a^{'}} = \sum_{a, a^{'}} \sqrt{{\bar{P}}_{t} (a_{t} = a) {\bar{P}}_{t} ({\hat{a}}_{t} = a^{'})} ({\bar{E}}_{t} [\bar{f} (θ^{*}, a) | {\hat{a}}_{t} = a^{'}] - \bar{f} (a)) . \end{matrix}

(A80)

It is easy to see that

\begin{matrix} {Term}_{1} & = {\bar{E}}_{t} [\bar{f} ({\hat{a}}_{t}, θ^{*})] - {\bar{E}}_{t} [\bar{f} (a_{t})] \\ = \sum_{a^{'}} {\bar{P}}_{t} ({\hat{a}}_{t} = a^{'}) ({\bar{E}}_{t} [\bar{f} (θ^{*}, a^{'}) | {\hat{a}}_{t} = a^{'}] - {\bar{E}}_{t} [\bar{f} ({\hat{a}}_{t})]) \\ = \sum_{a^{'}} {\bar{P}}_{t} ({\hat{a}}_{t} = a^{'}) ({\bar{E}}_{t} [\bar{f} (θ^{*}, a^{'}) | {\hat{a}}_{t} = a^{'}] - \bar{f} (a^{'})) = Tr (M), \end{matrix}

(A81)

where the second and last equality follows since

{\bar{P}}_{t} (a_{t}) = {\bar{P}}_{t} ({\hat{a}}_{t})

. Similarly, we can relate

Λ_{t}

with the matrix

(M_{a, a^{'}})

as

\begin{matrix} Λ_{t} & = {\bar{E}}_{t} [{(\bar{f} (θ^{*}, a_{t}) - \bar{f} (a_{t}))}^{2}] \\ = \sum_{a} {\bar{P}}_{t} (a_{t} = a) {\bar{E}}_{t} [{(\bar{f} (θ^{*}, a) - \bar{f} (a))}^{2}] \\ = \sum_{a, a^{'}} {\bar{P}}_{t} (a_{t} = a) {\bar{P}}_{t} ({\hat{a}}_{t} = a^{'}) {\bar{E}}_{t} [{(\bar{f} (θ^{*}, a) - \bar{f} (a))}^{2} | {\hat{a}}_{t} = a^{'}] \\ \geq \sum_{a, a^{'}} {\bar{P}}_{t} (a_{t} = a) {\bar{P}}_{t} ({\hat{a}}_{t} = a^{'}) {({\bar{E}}_{t} [(\bar{f} (θ^{*}, a) - \bar{f} (a)) | {\hat{a}}_{t} = a^{'}])}^{2} \end{matrix}

(A82)

\begin{matrix} = \sum_{a, a^{'}} {\bar{P}}_{t} (a_{t} = a) {\bar{P}}_{t} ({\hat{a}}_{t} = a^{'}) {({\bar{E}}_{t} [(\bar{f} (θ^{*}, a) | {\hat{a}}_{t} = a^{'}] - \bar{f} (a))}^{2} = {∥ M ∥}_{F}^{2}, \end{matrix}

(A83)

whereby we obtain

\begin{matrix} Γ_{t} \leq \frac{Tr {(M)}^{2}}{{∥ M ∥}_{F}^{2}} \leq m, \end{matrix}

(A84)

where the last inequality can be proved as in [25]. Following [19], it can be seen that

Λ_{t} \leq 2 log (1 + K)

also holds. Using this, together with the upper bound (A79) gives

\begin{matrix} \sum_{t = 1}^{T} E [{Term}_{1}] \leq \sqrt{2 T m σ^{2} min {m, 2 log (1 + K)} log (1 + \frac{T λ}{m σ^{2}})} . \end{matrix}

(A85)

Appendix C.3. Proof of Lemma 2

We first give an upper bound on the estimation error that does not require the assumption of a linear feature map.

Appendix C.3.1. A General Upper Bound on $R_{EE 1}^{T}$

To obtain an upper bound on

R_{EE 1}^{T}

that does not require the assumption that

ϕ (a, c) = G (a) c

, we leverage the same analysis as in the proof of Lemma 4. Subsequently, we obtain that

\begin{matrix} R_{EE 1}^{T} & \leq \sum_{t = 1}^{T} E [Δ (P (c_{t} | {\hat{c}}_{t}, γ^{*}), P (c_{t} | {\hat{c}}_{t}, H_{\hat{c}})) 1 {E}] + 2 δ T E [∥ θ^{*} ∥_{2} | E^{c}] \\ \leq \sum_{t = 1}^{T} E [Δ (P (c_{t} | {\hat{c}}_{t}, γ^{*}), P (c_{t} | {\hat{c}}_{t}, H_{\hat{c}})) 1 {E}] + 2 δ^{2} T \sqrt{\frac{m λ}{2 π}}, \end{matrix}

where the event

E

is defined as in (A37). Subsequently, the first summation can be upper bounded as

\begin{matrix} \sum_{t = 1}^{T} E [Δ (P (c_{t} | {\hat{c}}_{t}, γ^{*}), P (c_{t} | {\hat{c}}_{t}, H_{\hat{c}})) 1 {E}] & \leq \sqrt{2 T U^{2} \sum_{t = 1}^{T} I (c_{t}; γ^{*} | {\hat{c}}_{t}, H_{\hat{c}})} \\ = \sqrt{2 T U^{2} \sum_{t = 1}^{T} (H (c_{t} | {\hat{c}}_{t}, H_{\hat{c}}) - H (c_{t} | {\hat{c}}_{t}, γ^{*}))} \\ = \sqrt{T U^{2} \sum_{t = 1}^{T} log (\det (R_{t}^{- 1}) \det (M_{t}))} \\ \leq \sqrt{T U^{2} (Tr ({(Σ_{n} Σ_{γ}^{- 1} Σ_{n} M)}^{- 1}) + log (T) Tr (Σ_{c} Σ_{n}^{- 1}))} \end{matrix}

(A86)

where U is defined as in (A37),

M_{t} = Σ_{c}^{- 1} + Σ_{n}^{- 1}

and

R_{t}

is as in (13).

To derive the last inequality, we observe the following series of relationships starting from (13):

\begin{matrix} Σ_{n}^{- 1} {(H_{t}^{- 1})}^{⊤} Σ_{n}^{- 1} & = {(Σ_{n} H_{t} Σ_{n})}^{- 1} = {((t - 1) Σ_{n} - (t - 2) M^{- 1} + Σ_{n} Σ_{γ}^{- 1} Σ_{n})}^{- 1} \\ R_{t} & = M - Σ_{n}^{- 1} {(H_{t}^{- 1})}^{⊤} Σ_{n}^{- 1} \\ = M - {((t - 1) Σ_{n} - (t - 2) M^{- 1} + Σ_{n} Σ_{γ}^{- 1} Σ_{n})}^{- 1} \\ = M [I - M^{- 1} {((t - 1) Σ_{n} - (t - 2) M^{- 1} + Σ_{n} Σ_{γ}^{- 1} Σ_{n})}^{- 1}] \\ R_{t}^{- 1} & = {[I - M^{- 1} {((t - 1) Σ_{n} - (t - 2) M^{- 1} + Σ_{n} Σ_{γ}^{- 1} Σ_{n})}^{- 1}]}^{- 1} M^{- 1} \\ = {[I - {((t - 1) Σ_{n} M - (t - 2) I + Σ_{n} Σ_{γ}^{- 1} Σ_{n} M)}^{- 1}]}^{- 1} M^{- 1} \end{matrix}

\begin{matrix} R_{t}^{- 1} M & = {[I - {((t - 1) Σ_{n} M - (t - 2) I + Σ_{n} Σ_{γ}^{- 1} Σ_{n} M)}^{- 1}]}^{- 1} \\ = {[I - {((t - 1) Σ_{n} Σ_{c}^{- 1} + (t - 1) I - (t - 2) I + Σ_{n} Σ_{γ}^{- 1} Σ_{n} M)}^{- 1}]}^{- 1} \\ = {[I - {(I + \underset{: = P_{t}}{\underset{︸}{(t - 1) Σ_{n} Σ_{c}^{- 1} + Σ_{n} Σ_{γ}^{- 1} Σ_{n} M}})}^{- 1}]}^{- 1} \\ \overset{(a)}{=} {[I - (I - P_{t} {(I + P_{t})}^{- 1})]}^{- 1} = {(P_{t} {(I + P_{t})}^{- 1})}^{- 1} \end{matrix}

where the equality in

(a)

follows from Woodbury matrix identity and by the assumption that

Σ_{n} Σ_{γ}^{- 1} Σ_{n} M ≻ 0

,

Σ_{n} Σ_{c}^{- 1} ≻ 0

, we have

P_{t} ≻ 0

is invertible. Now,

\begin{matrix} \det (R_{t}^{- 1} M) & = \frac{1}{\det (P_{t} {(I + P_{t})}^{- 1})} = \det (P_{t}^{- 1} (I + P_{t})) \\ = \det (P_{t}^{- 1} + I) \\ \overset{(b)}{\leq} {(\frac{Tr (P_{t}^{- 1} + I)}{d})}^{d} = {(1 + Tr (P_{t}^{- 1}) / d)}^{d}, \end{matrix}

where the inequality in

(b)

follows from the determinant-trace inequality. Subsequently, we have

\begin{matrix} \sum_{t = 1}^{T} log (\det (R_{t}^{- 1} M)) & \leq \sum_{t = 1}^{T} d log (1 + Tr (P_{t}^{- 1}) / d) \\ \overset{(c)}{\leq} \sum_{t = 1}^{T} Tr (P_{t}^{- 1}) \\ = Tr (P_{1}^{- 1}) + \sum_{t = 2}^{T} Tr (P_{t}^{- 1}) \\ = Tr ({(Σ_{n} Σ_{γ}^{- 1} Σ_{n} M)}^{- 1}) + \sum_{t = 2}^{T} Tr (P_{t}^{- 1}) \\ \overset{(d)}{\leq} Tr ({(Σ_{n} Σ_{γ}^{- 1} Σ_{n} M)}^{- 1}) + log (T) Tr (Σ_{c} Σ_{n}^{- 1}), \end{matrix}

(A87)

where the inequality in

(c)

follows since

log (1 + x) \leq x

for

x > 0

and the inequality in

(d)

follows since by assumption

P_{t} ⪰ (t - 1) Σ_{n} Σ_{c}^{- 1}

, whereby we have

P_{t}^{- 1} ⪯ \frac{1}{t - 1} Σ_{c} Σ_{n}^{- 1}

and consequently,

Tr (P_{t}^{- 1}) \leq \frac{1}{t - 1} Tr (Σ_{c} Σ_{n}^{- 1})

. Finally, we use that

\sum_{s = 1}^{T} 1 / s \leq log (T)

.

We thus obtain that

\begin{matrix} R_{EE 1}^{T} \leq \sqrt{2 λ m T log (2 m / δ) (Tr ({(Σ_{n} Σ_{γ}^{- 1} Σ_{n} M)}^{- 1}) + log (T) Tr (Σ_{c} Σ_{n}^{- 1}))} + 2 δ^{2} T \sqrt{\frac{m λ}{2 π}} \end{matrix}

(A88)

for

δ \in (0, 1)

. We note that same upper bound holds for the term

R_{EE 2}^{T}

.

Appendix C.3.2. Upper Bound for Linear Feature Maps and Scaled Diagonal Covariance Matrices

We now obtain an upper bound on the estimation error under the assumption of a linear feature map

ϕ (a, c) = G_{a} c

such that

{∥ ϕ (a, c) ∥}_{2} \leq 1

. The following set of inequalities hold:

\begin{matrix} R_{EE 1}^{T} & = \sum_{t = 1}^{T} E [ψ {(a_{t}^{*}, {\hat{c}}_{t} | γ^{*})}^{⊤} θ^{*} - ψ {({\hat{a}}_{t}, {\hat{c}}_{t} | H_{\hat{c}})}^{⊤} θ^{*}] \\ \leq \sum_{t = 1}^{T} E [ψ {(a_{t}^{*}, {\hat{c}}_{t} | γ^{*})}^{⊤} θ^{*} - ψ {(a_{t}^{*}, {\hat{c}}_{t} | H_{\hat{c}})}^{⊤} θ^{*}] \\ = \sum_{t = 1}^{T} E [E_{P (c_{t} | {\hat{c}}_{t}, γ^{*})} [ϕ {(a_{t}^{*}, c_{t})}^{⊤} θ^{*}] - E_{P (c_{t} | {\hat{c}}_{t}, H_{\hat{c}})} [ϕ {(a_{t}^{*}, c_{t})}^{⊤} θ^{*}] . \end{matrix}

(A89)

Note, that

P (c_{t} | {\hat{c}}_{t}, H_{\hat{c}}) = N (c_{t} | V_{t}, R_{t}^{- 1})

where

R_{t}

and

V_{t}

are, respectively, defined in (13) and (14). Consequently,

ϕ {(a_{t}^{*}, c_{t})}^{⊤} θ^{*} = c_{t}^{⊤} G_{a_{t}^{*}}^{⊤} θ^{*}

is

s_{t}^{2} = θ^{* ⊤} G_{a_{t}^{*}} R_{t}^{- 1} G_{a_{t}^{*}}^{⊤} θ^{*}

-sub-Gaussian with respect to

P (c_{t} | {\hat{c}}_{t}, H_{\hat{c}})

. Consequently, using Lemma A1, we can upper bound the inner expectation of (A89) as

\begin{matrix} | E_{P (c_{t} | {\hat{c}}_{t}, γ^{*})} [ϕ {(a_{t}^{*}, c_{t})}^{⊤} θ^{*}] - E_{P (c_{t} | {\hat{c}}_{t}, H_{\hat{c}})} [ϕ {(a_{t}^{*}, c_{t})}^{⊤} θ^{*} | \leq \sqrt{2 s_{t}^{2} D_{KL} (P (c_{t} | {\hat{c}}_{t}, γ^{*}) ∥ P (c_{t} | {\hat{c}}_{t}, H_{\hat{c}}))} . \end{matrix}

(A90)

Summing over t and using Cauchy–Schwarz inequality then gives

\begin{matrix} R_{EE 1}^{T} & \leq \sqrt{2 (\sum_{t = 1}^{T} E [s_{t}^{2}]) (\sum_{t = 1}^{T} E [D_{KL} (P (c_{t} | {\hat{c}}_{t}, γ^{*}) ∥ P (c_{t} | {\hat{c}}_{t}, H_{\hat{c}}))])} . \end{matrix}

(A91)

We now evaluate the KL-divergence term. To this end, note that conditioned on

γ^{*}

and

{\hat{c}}_{t}

,

c_{t}

is independent of

H_{\hat{c}}

, i.e.,

P (c_{t} | {\hat{c}}_{t}, γ^{*}, H_{\hat{c}}) = P (c_{t} | {\hat{c}}_{t}, γ^{*})

. This gives

\begin{matrix} \sum_{t = 1}^{T} E [D_{KL} (P (c_{t} | {\hat{c}}_{t}, γ^{*}) ∥ P (c_{t} | {\hat{c}}_{t}, H_{\hat{c}}))] \\ = \sum_{t = 1}^{T} I (c_{t}; γ^{*} | {\hat{c}}_{t}, H_{\hat{c}}) \\ = \sum_{t = 1}^{T} H (c_{t} | {\hat{c}}_{t}, H_{\hat{c}}) - H (c_{t} | {\hat{c}}_{t}, γ^{*}) \\ = \frac{1}{2} \sum_{t = 1}^{T} E_{P ({\hat{c}}_{t}, H_{\hat{c}})} [log \det (R_{t}^{- 1})] - \frac{1}{2} \sum_{t = 1}^{T} E_{P ({\hat{c}}_{t}, γ^{*})} [log \det (M^{- 1})] \\ = \sum_{t = 1}^{T} \frac{1}{2} E_{P ({\hat{c}}_{t}, H_{\hat{c}})} [log (\det (R_{t}^{- 1}) \det (M)) \\ \leq \frac{d σ_{c}^{2}}{σ_{n}^{2}} (\frac{σ_{γ}^{2}}{σ_{n}^{2} + σ_{c}^{2}} + log (T - 1)), \end{matrix}

(A92)

where

M = Σ_{c}^{- 1} + Σ_{n}^{- 1}

and

R_{t}

is as in (13). The first equality follows by noting that

\begin{matrix} E [D_{KL} (P (c_{t} | {\hat{c}}_{t}, γ^{*}) ∥ P (c_{t} | {\hat{c}}_{t}, H_{\hat{c}}))] & = E_{P ({\hat{c}}_{t}, H_{\hat{c}})} [E_{P (γ^{*} | {\hat{c}}_{t}, H_{\hat{c}})} [D_{KL} (P (c_{t} | {\hat{c}}_{t}, γ^{*}, H_{\hat{c}}) ∥ P (c_{t} | {\hat{c}}_{t}, H_{\hat{c}}))]] \\ = E_{P ({\hat{c}}_{t}, H_{\hat{c}})} [I (c_{t}; γ^{*} | {\hat{c}}_{t}, H_{\hat{c}})] \end{matrix}

with the outer expectation taken over

{\hat{c}}_{t}

and

H_{\hat{c}}

. The last inequality is proved in Appendix C.3.3 using that

Σ_{c} = σ_{c}^{2} I

,

Σ_{n} = σ_{n}^{2} I

and

Σ_{γ} = σ_{γ}^{2} I

.

We can now upper bound

\sum_{t} E [s_{t}^{2}]

as follows.

\begin{matrix} E [s_{t}^{2}] \leq E [max_{a} θ^{* ⊤} G_{a} R_{t}^{- 1} G_{a}^{⊤} θ^{*}] \leq \sum_{a} E [θ^{* ⊤} G_{a} R_{t}^{- 1} G_{a}^{⊤} θ^{*}] & = \sum_{a} Tr (G_{a} R_{t}^{- 1} G_{a}^{⊤} E [θ^{*} θ^{* ⊤}]) \\ = λ \sum_{a} Tr (R_{t}^{- 1} G_{a}^{⊤} G_{a}) \\ = λ b_{t} Tr (\sum_{a} G_{a}^{⊤} G_{a}) \end{matrix}

where the last equality uses

R_{t}^{- 1} = b_{t} I

as in (A95). Using (A95), we obtain

\begin{matrix} \sum_{t} b_{t} & = \frac{σ_{c}^{2} σ_{n}^{2}}{f} \sum_{t} (1 + \frac{σ_{c}^{2} σ_{γ}^{2}}{(t - 1) σ_{γ}^{2} σ_{n}^{2} + f σ_{n}^{2}}) \\ = \frac{σ_{c}^{2} σ_{n}^{2} T}{f} + \sum_{t} \frac{σ_{c}^{2}}{f} \frac{σ_{c}^{2} σ_{γ}^{2}}{(t - 1) σ_{γ}^{2} + f} \\ = \frac{σ_{c}^{2} σ_{n}^{2} T}{f} + \frac{σ_{c}^{4} σ_{γ}^{2}}{f^{2}} + \sum_{t > 1} \frac{σ_{c}^{2}}{f} \frac{σ_{c}^{2} σ_{γ}^{2}}{(t - 1) σ_{γ}^{2} + f} \\ \leq \frac{σ_{c}^{2} σ_{n}^{2}}{f} T + \frac{σ_{c}^{4} σ_{γ}^{2}}{f^{2}} + \frac{σ_{c}^{4}}{f} log (T - 1) . \end{matrix}

Using the above relation, we obtain

\begin{matrix} \sum_{t} E [s_{t}^{2}] = λ Tr (\sum_{a} G_{a}^{⊤} G_{a}) \sum_{t} b_{t} \leq \frac{λ K σ_{c}^{2}}{f} max_{a} Tr (G_{a}^{⊤} G_{a}) (σ_{n}^{2} T + \frac{σ_{c}^{2} σ_{γ}^{2}}{f} + σ_{c}^{2} log (T - 1)) . \end{matrix}

If

λ \leq \frac{d σ^{2}}{T}

, we obtain

\begin{matrix} \sum_{t} E [s_{t}^{2}] \leq \frac{d σ^{2} K σ_{c}^{2}}{f} max_{a} Tr (G_{a}^{⊤} G_{a}) (σ_{n}^{2} + \frac{σ_{c}^{2} σ_{γ}^{2}}{T f} + σ_{c}^{2} \frac{log (T - 1)}{T}) = L \end{matrix}

Using the above inequality together with (A92) in (A91) yields

\begin{matrix} R_{EE 1}^{T} \leq \sqrt{2 L \frac{d σ_{c}^{2}}{σ_{n}^{2}} (\frac{σ_{γ}^{2}}{σ_{n}^{2} + σ_{c}^{2}} + log (T - 1))} . \end{matrix}

(A93)

An upper bound on

R_{EE 2}^{T}

similarly follows.

Appendix C.3.3. Analysis of $R_{t}$ for Scaled Diagonal Covariance Matrices

Assume that

Σ_{n} = σ_{n}^{2} I

,

Σ_{c} = σ_{c}^{2} I

and

Σ_{γ} = σ_{γ}^{2} I

. Then, from (13), we obtain

\begin{matrix} H_{t} & = (t - 1) Σ_{n}^{- 1} - (t - 2) Σ_{n}^{- 1} M^{- 1} Σ_{n}^{- 1} + Σ_{γ}^{- 1} \\ = (\frac{(t - 1)}{σ_{n}^{2}} - \frac{(t - 2) σ_{n}^{2} σ_{c}^{2}}{σ_{n}^{4} \underset{: = f}{\underset{︸}{(σ_{c}^{2} + σ_{n}^{2})}}} + \frac{1}{σ_{γ}^{2}}) I \\ = \frac{(t - 1) σ_{γ}^{2} - (t - 2) σ_{c}^{2} σ_{γ}^{2} / f + σ_{n}^{2}}{σ_{n}^{2} σ_{γ}^{2}} I . \end{matrix}

(A94)

This implies

\begin{matrix} H_{t}^{- 1} & = \frac{σ_{n}^{2} σ_{γ}^{2}}{(t - 1) σ_{γ}^{2} - (t - 2) σ_{c}^{2} σ_{γ}^{2} / f + σ_{n}^{2}} I \end{matrix}

whereby we obtain

\begin{matrix} R_{t} & = \frac{f ((t - 1) σ_{γ}^{2} + σ_{n}^{2} + σ_{c}^{2})}{σ_{c}^{2} ((t - 1) σ_{γ}^{2} σ_{n}^{2} + σ_{c}^{2} σ_{γ}^{2} + f σ_{n}^{2})} I and \\ R_{t}^{- 1} & = \frac{σ_{c}^{2} ((t - 1) σ_{γ}^{2} σ_{n}^{2} + σ_{c}^{2} σ_{γ}^{2} + f σ_{n}^{2})}{f ((t - 1) σ_{γ}^{2} + σ_{n}^{2} + σ_{c}^{2})} I = b_{t} I . \end{matrix}

(A95)

Noting that

M_{t} = \frac{f}{σ_{n}^{2} σ_{c}^{2}} I

we then have

\begin{matrix} R_{t}^{- 1} M_{t} & = \frac{(t - 1) σ_{γ}^{2} σ_{n}^{2} + σ_{c}^{2} σ_{γ}^{2} + f σ_{n}^{2}}{σ_{n}^{2} ((t - 1) σ_{γ}^{2} + σ_{n}^{2} + σ_{c}^{2})} I = \frac{(t - 1) σ_{γ}^{2} σ_{n}^{2} + σ_{c}^{2} σ_{γ}^{2} + f σ_{n}^{2}}{(t - 1) σ_{γ}^{2} σ_{n}^{2} + f σ_{n}^{2}} I = (1 + \frac{σ_{c}^{2} σ_{γ}^{2}}{(t - 1) σ_{γ}^{2} σ_{n}^{2} + f σ_{n}^{2}}) I . \end{matrix}

(A96)

Subsequently, we obtain that for

t > 1

,

\begin{matrix} log (\det (R_{t}^{- 1} M_{t})) = d log (1 + \frac{σ_{c}^{2} σ_{γ}^{2}}{(t - 1) σ_{γ}^{2} σ_{n}^{2} + f σ_{n}^{2}}) \leq d log (1 + \frac{σ_{c}^{2} / σ_{n}^{2}}{(t - 1)}) \leq \frac{d σ_{c}^{2} / σ_{n}^{2}}{t - 1}, \end{matrix}

(A97)

whereby

\begin{matrix} \sum_{t = 1}^{T} log (\det (R_{t}^{- 1} M_{t})) & \leq d log (1 + \frac{σ_{c}^{2} σ_{γ}^{2}}{f σ_{n}^{2}}) + \sum_{t > 1}^{T} \frac{d σ_{c}^{2} / σ_{n}^{2}}{t - 1} \end{matrix}

(A98)

\begin{matrix} \leq \frac{d σ_{c}^{2}}{σ_{n}^{2}} (\frac{σ_{γ}^{2}}{f} + log (T - 1)) \end{matrix}

(A99)

where the last inequality follows since

log (1 + x) \leq x

and

\sum_{s = 1}^{T} \frac{1}{s} \leq log (T) .

Appendix D. Details on Experiments

In this section, we present details on the baselines implemented for stochastic CBs with unobserved true contexts.

Appendix D.1. Gaussian Bandits

For Gaussian bandits, we implemented the baselines as explained below.

TS_naive: This algorithm implements the following action policy at each iteration t,

\begin{matrix} a_{t} = arg max_{a \in A} ϕ {(a, {\hat{c}}_{t})}^{⊤} θ_{t}, \end{matrix}

where

θ_{t}

is sampled from a Gaussian distribution

N (μ_{t - 1, naive}, Σ_{t - 1, naive}^{- 1})

with

\begin{matrix} Σ_{t - 1, naive} & = \frac{I}{λ} + \frac{1}{σ^{2}} \sum_{τ = 1}^{t - 1} ϕ (a_{τ}, {\hat{c}}_{τ}) ϕ {(a_{τ}, {\hat{c}}_{τ})}^{⊤} \\ μ_{t - 1, naive} & = \frac{Σ_{t - 1, naive}^{- 1}}{σ^{2}} (\sum_{τ = 1}^{t - 1} r_{τ} ϕ (a_{τ}, {\hat{c}}_{τ})) . \end{matrix}

TS_oracle: In this baseline, the agent has knowledge of the true predictive distribution $P (c_{t} | {\hat{c}}_{t}, γ^{*})$ . Consequently, at each iteration t, the algorithm chooses action

\begin{matrix} a_{t} = arg max_{a \in A} ψ {(a, {\hat{c}}_{t} | γ^{*})}^{⊤} θ_{t}, \end{matrix}

where

θ_{t}

is sampled from a Gaussian distribution

N (μ_{t - 1, poc}, Σ_{t - 1, poc}^{- 1})

with

\begin{matrix} Σ_{t - 1, poc} & = \frac{I}{λ} + \frac{1}{σ^{2}} \sum_{τ = 1}^{t - 1} ψ (a_{τ}, {\hat{c}}_{τ} | γ^{*}) ψ {(a_{τ}, {\hat{c}}_{τ} | γ^{*})}^{⊤} \\ μ_{t - 1, poc} & = \frac{Σ_{t - 1, poc}^{- 1}}{σ^{2}} (\sum_{τ = 1}^{t - 1} r_{τ} ψ (a_{τ}, {\hat{c}}_{τ} | γ^{*})) . \end{matrix}

For Gaussian bandits, the following figure shows additional experiment comparing the performance of our proposed Algorithm 1 for varying values of the number K of actions. All parameters are set as in Figure 1 (Left).

Figure A1. Bayesian cumulative regret of Algorithm 1 as a function of iterations over varying number K of actions.

Appendix D.2. Logistic Bandits

In the case of logistic bandits, we implemented the baselines as explained below.

TS_naive: This algorithm considers the following sampling distribution:

\begin{matrix} Q (θ^{*} | H_{t - 1, r, a, \hat{c}}) \propto P (θ^{*}) \prod_{τ = 1}^{t - 1} Ber (μ (ϕ {(a_{τ}, {\hat{c}}_{τ})}^{⊤} θ^{*})) . \end{matrix}

However, due to the non-conjugateness of Gaussian prior

P (θ^{*})

and Bernoulli reward likelihood, sampling from the above posterior distribution is not straightforward. Consequently, we adopt the Langevin Monte Carlo (LMC) sampling approach from [20]. To sample

θ_{t}

at iteration t, we run LMC for

I = 50

iterations with learning rate

η_{t} = 0.2 / t

and inverse temperature parameter

β^{- 1} = 0.001

. Then,

θ_{t}

is chosen as the output of the LMC after

I = 50

iterations. Using the sampled

θ_{t}

, the algorithm then chooses the action

a_{t}

as

\begin{matrix} a_{t} = arg max_{a \in A} ϕ {(a, {\hat{c}}_{t})}^{⊤} θ_{t} . \end{matrix}

TS_oracle: This algorithm considers the following sampling distribution:

\begin{matrix} Q (θ^{*} | H_{t - 1, r, a, \hat{c}}) \propto P (θ^{*}) \prod_{τ = 1}^{t - 1} Ber (μ (ψ {(a_{τ}, {\hat{c}}_{τ} | γ^{*})}^{⊤} θ^{*})), \end{matrix}

where

ψ (a_{t}, {\hat{c}}_{t} | γ^{*}) = E_{P (c_{t} | {\hat{c}}_{t}, γ^{*})} [ϕ (a_{t}, c_{t})]

is the expected feature map under the posterior predictive distribution with known

γ^{*}

. As before, to sample from the above distribution, we use

I = 50

iterations of LMC with learning rate

η_{t} = 0.2 / t

and inverse temperature parameter

β^{- 1} = 0.001

. Using the sampled

θ_{t}

, the algorithm then chooses the action

a_{t}

as

\begin{matrix} a_{t} = arg max_{a \in A} ψ {(a, {\hat{c}}_{t} | γ^{*})}^{⊤} θ_{t} . \end{matrix}

References

Srivastava, V.; Reverdy, P.; Leonard, N.E. Surveillance in an abruptly changing world via multiarmed bandits. In Proceedings of the IEEE Conference on Decision and Control (CDC), Los Angeles, CA, USA, 15–17 December 2014; pp. 692–697. [Google Scholar]
Aziz, M.; Kaufmann, E.; Riviere, M.K. On multi-armed bandit designs for dose-finding clinical trials. J. Mach. Learn. Res. 2021, 22, 686–723. [Google Scholar]
Anandkumar, A.; Michael, N.; Tang, A.K.; Swami, A. Distributed algorithms for learning and cognitive medium access with logarithmic regret. IEEE J. Sel. Areas Commun. 2011, 29, 731–745. [Google Scholar] [CrossRef]
Srivastava, V.; Reverdy, P.; Leonard, N.E. On optimal foraging and multi-armed bandits. In Proceedings of the Annual Allerton Conference on Communication, Control, and Computing (Allerton), Monticello, IL, USA, 2–4 October 2013; pp. 494–499. [Google Scholar]
Bubeck, S.; Cesa-Bianchi, N. Regret analysis of stochastic and nonstochastic multi-armed bandit problems. arXiv 2012, arXiv:1204.5721. [Google Scholar]
Abe, N.; Biermann, A.W.; Long, P.M. Reinforcement learning with immediate rewards and linear hypotheses. Algorithmica 2003, 37, 263–293. [Google Scholar] [CrossRef]
Agarwal, D.; Chen, B.C.; Elango, P.; Motgi, N.; Park, S.T.; Ramakrishnan, R.; Roy, S.; Zachariah, J. Online models for content optimization. Adv. Neural Inf. Process. Syst. 2008, 21, 17–24. [Google Scholar]
Auer, P.; Cesa-Bianchi, N.; Fischer, P. Finite-time analysis of the multiarmed bandit problem. Mach. Learn. 2002, 47, 235–256. [Google Scholar] [CrossRef]
Chu, W.; Li, L.; Reyzin, L.; Schapire, R. Contextual bandits with linear payoff functions. In Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, Fort Lauderdale, FL, USA, 11–13 April 2011; pp. 208–214. [Google Scholar]
Agrawal, S.; Goyal, N. Thompson sampling for contextual bandits with linear payoffs. In Proceedings of the International Conference on Machine Learning, Atlanta, GA, USA, 16–21 June 2013; pp. 127–135. [Google Scholar]
Lamprier, S.; Gisselbrecht, T.; Gallinari, P. Profile-based bandit with unknown profiles. J. Mach. Learn. Res. 2018, 19, 2060–2099. [Google Scholar]
Kirschner, J.; Krause, A. Stochastic bandits with context distributions. Adv. Neural Inf. Process. Syst. 2019, 32, 14113–14122. [Google Scholar]
Yang, L.; Yang, J.; Ren, S. Multi-feedback bandit learning with probabilistic contexts. In Proceedings of the Twenty-Ninth International Joint Conference on Artificial Intelligence Main Track, Yokohama, Japan, 11–17 July 2020. [Google Scholar]
Kim, J.h.; Yun, S.Y.; Jeong, M.; Nam, J.; Shin, J.; Combes, R. Contextual Linear Bandits under Noisy Features: Towards Bayesian Oracles. In Proceedings of the International Conference on Artificial Intelligence and Statistics, PMLR, Valencia, Spain, 25–27 April 2023; pp. 1624–1645. [Google Scholar]
Guo, Y.; Murphy, S. Online learning in bandits with predicted context. arXiv 2023, arXiv:2307.13916. [Google Scholar]
Roy, D.; Dutta, M. A systematic review and research perspective on recommender systems. J. Big Data 2022, 9, 59. [Google Scholar] [CrossRef]
Russo, D.; Van Roy, B. Learning to optimize via posterior sampling. Math. Oper. Res. 2014, 39, 1221–1243. [Google Scholar] [CrossRef]
Park, H.; Faradonbeh, M.K.S. Analysis of Thompson sampling for partially observable contextual multi-armed bandits. IEEE Control Syst. Lett. 2021, 6, 2150–2155. [Google Scholar] [CrossRef]
Neu, G.; Olkhovskaia, I.; Papini, M.; Schwartz, L. Lifting the information ratio: An information-theoretic analysis of thompson sampling for contextual bandits. Adv. Neural Inf. Process. Syst. 2022, 35, 9486–9498. [Google Scholar]
Xu, P.; Zheng, H.; Mazumdar, E.V.; Azizzadenesheli, K.; Anandkumar, A. Langevin monte carlo for contextual bandits. In Proceedings of the International Conference on Machine Learning, PMLR, Baltimore, MD, USA, 17–23 July 2022; pp. 24830–24850. [Google Scholar]
Harper, F.M.; Konstan, J.A. The movielens datasets: History and context. ACM Trans. Interact. Intell. Syst. (TIIS) 2015, 5, 1–19. [Google Scholar] [CrossRef]
Hong, J.; Kveton, B.; Zaheer, M.; Ghavamzadeh, M. Hierarchical bayesian bandits. In Proceedings of the International Conference on Artificial Intelligence and Statistics, PMLR, Virtual, 28–30 March 2022; pp. 7724–7741. [Google Scholar]
Hong, J.; Kveton, B.; Katariya, S.; Zaheer, M.; Ghavamzadeh, M. Deep hierarchy in bandits. In Proceedings of the International Conference on Machine Learning, PMLR, Baltimore, MD, USA, 17–23 July 2022; pp. 8833–8851. [Google Scholar]
Lattimore, T.; Szepesvári, C. Bandit Algorithms; Cambridge University Press: Cambridge, UK, 2020. [Google Scholar]
Russo, D.; Van Roy, B. An information-theoretic analysis of thompson sampling. J. Mach. Learn. Res. 2016, 17, 2442–2471. [Google Scholar]

Figure 1. Comparison of Bayesian regret of proposed algorithms with baselines as a function of number of iterations. (Left): Gaussian bandits with

K = 40

,

σ_{n}^{2} = σ_{γ}^{2} = 1.1

; (Center) Logistic bandits with

K = 40

,

σ_{n}^{2} = 2

,

σ_{γ}^{2} = 2.5

; (Right) MovieLens dataset with added Gaussian context noise and Gaussian prior: parameters set as

σ_{n}^{2} = 0.1

,

σ_{γ}^{2} = 0.6

.

Figure 1. Comparison of Bayesian regret of proposed algorithms with baselines as a function of number of iterations. (Left): Gaussian bandits with

K = 40

,

σ_{n}^{2} = σ_{γ}^{2} = 1.1

; (Center) Logistic bandits with

K = 40

,

σ_{n}^{2} = 2

,

σ_{γ}^{2} = 2.5

; (Right) MovieLens dataset with added Gaussian context noise and Gaussian prior: parameters set as

σ_{n}^{2} = 0.1

,

σ_{γ}^{2} = 0.6

.

Table 1. Comparison of the regret bounds of our proposed TS algorithm for noisy CB with state-of-the art algorithms.

Reference	Setting	Algorithm	Regret	Bound
[8]	Linear CB	LinRel	Frequentist	$\tilde{O} (\sqrt{m T})$
[9]	Linear CB	Lin-UCB	Frequentist	$\tilde{O} (\sqrt{m T})$
[10]	Linear CB	TS	Frequentist	$O (m \sqrt{T} {log}^{3 / 2} T)$
[17]	Linear CB	TS	Bayesian	$O (m \sqrt{T} log T)$
[11]	Noisy CB	SampLinUCB	Frequentist	$\tilde{O} (m \sqrt{T})$
[12]	Noisy CB	UCB	Frequentist	$\tilde{O} (m \sqrt{T})$
[14]	Noisy CB	OFUL	Frequentist	$\tilde{O} (m \sqrt{T})$
Our work	Noisy CB	TS	Bayesian	$\tilde{O} (m \sqrt{T})$

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Jose, S.T.; Moothedath, S. Thompson Sampling for Stochastic Bandits with Noisy Contexts: An Information-Theoretic Regret Analysis. Entropy 2024, 26, 606. https://doi.org/10.3390/e26070606

AMA Style

Jose ST, Moothedath S. Thompson Sampling for Stochastic Bandits with Noisy Contexts: An Information-Theoretic Regret Analysis. Entropy. 2024; 26(7):606. https://doi.org/10.3390/e26070606

Chicago/Turabian Style

Jose, Sharu Theresa, and Shana Moothedath. 2024. "Thompson Sampling for Stochastic Bandits with Noisy Contexts: An Information-Theoretic Regret Analysis" Entropy 26, no. 7: 606. https://doi.org/10.3390/e26070606

APA Style

Jose, S. T., & Moothedath, S. (2024). Thompson Sampling for Stochastic Bandits with Noisy Contexts: An Information-Theoretic Regret Analysis. Entropy, 26(7), 606. https://doi.org/10.3390/e26070606

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Thompson Sampling for Stochastic Bandits with Noisy Contexts: An Information-Theoretic Regret Analysis

Abstract

1. Introduction

2. Problem Setting

Bayesian Cumulative Regret

3. Modified TS for CB with Unobserved True Contexts

3.1. Linear-Gaussian Stochastic CBs

3.2. Bayesian Regret Analysis

3.2.1. Preliminaries

3.2.2. Information-Theoretic Bayesian Regret Bounds

3.3. Beyond Gaussian Bandits

4. TS for CB with Delayed True Contexts

Information-Theoretic Bayesian Regret Bounds

5. Experiments and Final Remarks

6. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

Appendix A. Preliminaries

Appendix B. Linear-Gaussian Contextual Bandits with Delayed Contexts

Appendix B.1. TS Algorithm for Linear-Gaussian Bandits with Delayed True Contexts

Appendix B.2. Derivation of Posterior and Predictive Posterior Distributions

Appendix B.2.1. Derivation of P ( c t | c ^ t , γ * )

Appendix B.2.2. Derivation of P ( c ^ t | γ * )

Appendix B.2.3. Derivation of P ( γ * | H t − 1 , c , c ^ )

Appendix B.2.4. Derivation of Posterior Predictive Distribution P ( c t | c ^ t , H t − 1 , c , c ^ )

Appendix B.3. Proof of Lemma 3

Appendix B.4. Proof of Lemma 4

Appendix C. Linear-Gaussian Noisy Contextual Bandits with Unobserved True Contexts

Appendix C.1. Derivation of Posterior Predictive Distribution

Appendix C.1.1. Derivation of Posterior P ( γ * | H t − 1 , c ^ )

Appendix C.1.2. Derivation of P ( c t | c ^ t , H t − 1 , c ^ )

Appendix C.1.3. Evaluating the KL Divergence between the True Posterior P t ( θ * ) and Sampling Distribution P ¯ t ( θ * )

Appendix C.2. Proof of Lemma 1

Appendix C.2.1. Upper Bound on Term 2

Appendix C.2.2. Upper Bound on Term 3

Appendix C.2.3. Upper Bound on Term 1

Appendix C.3. Proof of Lemma 2

Appendix C.3.1. A General Upper Bound on R EE 1 T

Appendix C.3.2. Upper Bound for Linear Feature Maps and Scaled Diagonal Covariance Matrices

Appendix C.3.3. Analysis of R t for Scaled Diagonal Covariance Matrices

Appendix D. Details on Experiments

Appendix D.1. Gaussian Bandits

Appendix D.2. Logistic Bandits

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI

Appendix B.2.1. Derivation of $P (c_{t} | {\hat{c}}_{t}, γ^{*})$

Appendix B.2.2. Derivation of $P ({\hat{c}}_{t} | γ^{*})$

Appendix B.2.3. Derivation of $P (γ^{*} | H_{t - 1, c, \hat{c}})$

Appendix B.2.4. Derivation of Posterior Predictive Distribution $P (c_{t} | {\hat{c}}_{t}, H_{t - 1, c, \hat{c}})$

Appendix C.1.1. Derivation of Posterior $P (γ^{*} | H_{t - 1, \hat{c}})$

Appendix C.1.2. Derivation of $P (c_{t} | {\hat{c}}_{t}, H_{t - 1, \hat{c}})$

Appendix C.1.3. Evaluating the KL Divergence between the True Posterior $P_{t} (θ^{})$ and Sampling Distribution ${\bar{P}}_{t} (θ^{})$

Appendix C.2.1. Upper Bound on ${Term}_{2}$

Appendix C.2.2. Upper Bound on ${Term}_{3}$

Appendix C.2.3. Upper Bound on ${Term}_{1}$

Appendix C.3.1. A General Upper Bound on $R_{EE 1}^{T}$

Appendix C.3.3. Analysis of $R_{t}$ for Scaled Diagonal Covariance Matrices