1. Introduction
Decision-making in the face of uncertainty is a widespread challenge found across various domains such as control and robotics [
1], clinical trials [
2], communications [
3], and ecology [
4]. To tackle this challenge, learning algorithms have been developed to uncover effective policies for optimal decision-making. One notable framework for addressing this is contextual bandits (CBs), which capture the essence of sequential decision-making by incorporating side information, termed
context [
5].
In the standard CB model, an agent interacts with the environment over numerous rounds. In each round, the environment presents a context to the agent based on which the agent chooses an action and receives a reward from the environment. The reward is stochastic, drawn from a probability distribution whose mean reward (which is a function of context-action pair) is unknown to the agent. The goal of the agent is to design a policy for action selection that can maximize the cumulative mean reward accrued over a T-length horizon.
In this paper, we focus on a CB model that assumes stochastic rewards with linear mean-reward functions, also called stochastic linear contextual bandits. Stochastic linear CB models find applications in various settings including internet advertisement selection [
6], where the advertisement (i.e., action) and webpage features (i.e., context) are used to construct a linear predictor of the probability that a user clicks on a given advertisement, and article recommendation on web portals [
7].
While most prior research on CBs has primarily focused on models with known exact contexts [
8,
9,
10], in many real-world applications, the contexts are noisy, e.g., imprecise measurement of patient conditions in clinical trials, weather or stock market predictions. In such scenarios, when the exact contexts are unknown, the agent must utilize the observed noisy contexts to estimate the mean reward associated with the true context. However, this results in a biased estimate that renders the application of standard CB algorithms unsuitable. Consequently, recent efforts have been made to develop CB algorithms tailored to noisy context settings.
Related Works: ref. [
11] considers a setting where there is a bounded zero-mean noise in the
m-dimensional
feature vector (denoted by
, where
a is the action and
c is the context) rather than in the context vector, and the agent observes only noisy features. For this setting, they develop an upper confidence bound (UCB) algorithm. Ref. [
12] models the uncertainty regarding the true contexts by a
context distribution that is known to the agent, while the agent never observes the true context and develops a UCB algorithm. A similar setting has also been considered in [
13]. Differing from these works, ref. [
14] considers the setting where the true feature vectors are sampled from an unknown feature distribution at each time, but the agent observes only a noisy feature vector. Assuming Gaussian feature noise with unknown mean and covariance, they develop an Optimism in the Face of Uncertainty (OFUL) algorithm. A variant of this setting has been studied in [
15].
Motivation and Problem Setting: In this work, inspired by [
14], we consider the following noisy CB setting. In each round, the environment samples a true context vector
from a
context distribution that is
known to the agent. The agent, however, does not observe the true context but observes a noisy context
obtained as the output of a noise channel
parameterized by
. The agent is aware of the noise present but does not know the channel parameter
. Following [
14], we consider Gaussian noise channels for our regret analysis.
Based on the observed noisy contexts, the agent chooses an action and observes a reward corresponding to the true context. We consider a linear bandit whose mean reward is determined by an unknown reward parameter . The goal of the agent is to design an action policy that minimizes the Bayesian cumulative regret with respect to the action policy of a Bayesian oracle. The oracle has access to the reward model and the channel parameter , and uses the predictive distribution of the true context given the observed noisy context to select an action.
Our setting differs from [
14] in that we assume noisy contexts rather than noisy feature vectors and that the agent knows the context distribution. The noise model, incorporating noise in the feature vector, allows [
14] to transform the original problem into a different CB problem that estimates a modified reward parameter. Such a transformation, however, is not straightforward in our setting with noise in contexts rather than in feature vectors, where we wish to analyze the Bayesian regret. Additionally, we propose a de-noising approach to estimate the predictive distribution of the true context from given noisy contexts, offering potential benefits for future analyses.
The assumption of known context distribution follows from [
12]. This can be motivated by considering the example of an online recommendation engine that pre-processes the user account registration information or contexts (e.g., age, gender, device, location, item preferences) to group them into different clusters [
16]. The engine can then infer the ‘empirical’ distribution of users within each cluster to define a context distribution over true contextual information. A noisy contextual information scenario occurs when a guest with different preferences logs into a user’s account.
Challenges and Novelty: Different from existing works that developed UCB-based algorithms, we propose a fully Bayesian Thompson Sampling (TS) algorithm that approximates the Bayesian oracle policy. The proposed algorithm differs from the standard contextual TS [
10] in the following aspects. Firstly, since the true context vectors are not accessible at each round and the channel parameter
is unknown, the agent uses its knowledge of the context distribution and the past observed noisy contexts to infer a
predictive posterior distribution of the true context from the current observed noisy context. The inferred predictive distribution is then used to choose the action. This
de-noising step enables our algorithm to ‘approximate’ the oracle action policy that uses knowledge of the channel parameter
to implement
exact de-noising. Secondly, the reward
received by the agent corresponds to the unobserved true context
. Hence, the agent cannot accurately evaluate the posterior distribution of
and sample from it as is conducted in standard contextual TS. Instead, our algorithm proposes to use a sampling distribution that ‘approximates’ the posterior.
Different from existing works that focus on frequentist regret analysis, we derive novel information-theoretic bounds on the Bayesian cumulative regret of our algorithm. For Gaussian bandits, our information-theoretic regret bounds scale as (the notation suppresses logarithmic terms in •), where denote the dimension of the feature vector and time horizon respectively, under certain conditions on the variance of the prior on . Furthermore, our Bayesian regret analysis shows that the posterior mismatch, resulting due to replacing the true posterior distribution with a sampling distribution, results in an approximation error that is captured via the Kullback–Leibler (KL) divergence between the distributions. To the best of our knowledge, quantifying the posterior mismatch via KL divergence has not been studied before and is of independent interest.
Finally, we also extend our algorithm to a setting where the agent observes the true context after the decision is made and reward is observed [
12]. We call this setting CBs with delayed true contexts. Such scenarios arise in many applications where only a prediction of the context is available at the time of decision-making; however, the true context is available later. For instance, in farming-recommender systems where, at the time of making the decision regarding which crop to cultivate in a year, the true contextual information about the weather pattern is unavailable, while some ‘noisy’ weather predictions are available. In fact, the true weather pattern is observed only after the decision is made. We show that our TS algorithm for this setting with delayed true contexts results in reduced Bayesian regret.
Table 1 compares our regret bound with that of the state-of-the-art algorithms in the noiseless and noisy CB settings.
2. Problem Setting
In this section, we present the stochastic linear CB problem studied in this paper. Let denote the action set with K actions and denote the (possibly infinite) set of d-dimensional context vectors. At iteration , the environment randomly draws a context vector according to a context distribution defined over the space of context vectors. The context distribution is known to the agent. The agent, however, does not observe the true context drawn by the environment. Instead, it observes a noisy version of the true context, obtained as the output of a noisy, stochastic channel with the true context as the input. The noise channel is parameterized by the noise channel parameter that is unknown to the agent.
Having observed the noisy context
at iteration
t, the agent chooses an action
according to an
action policy . The action policy may be stochastic describing a probability distribution over the set
of actions. Corresponding to the chosen action
, the agent receives a reward from the environment given by
where
is the linear
mean-reward function and
is a zero-mean reward noise variable. The mean reward function
is defined via the
feature map , that maps the action and true context to an
m-dimensional feature vector, and via the reward parameter
that is
unknown to the agent.
We call the noisy CB problem described above
CBs with unobserved true context (see Setting 1) since the agent does not observe the true context
and the selection of action is based solely on the observed noisy context. Accordingly, at the end of iteration
t, the agent has accrued the history
of observed reward-action-noisy context tuples. The action policy
at
iteration may depend on the history
.
Setting 1: CBs with unobserved true contexts |
- 1:
for
do - 2:
Environment samples . - 3:
Agent observes noisy context . - 4:
Agent chooses an action . - 5:
Agent receives reward according to ( 1). - 6:
end for
|
We also consider a variant of the above problem where the agent has access to a
delayed observation of the true context
as studied in [
12]. We call this setting
CBs with delayed true context. In this setting, at iteration
t, the agent first observes a noisy context
, chooses action
, and receives reward
. Later, the true context
is observed. It is important to note that the agent has no access to the true context at the time of decision-making. Thus, at the end of iteration
t, the agent has collected the history
of observed reward-action-context-noisy context tuples.
In both of the problem settings described above, the agent’s objective is to devise an action policy that minimizes the Bayesian cumulative regret with respect to a baseline action policy. We define Bayesian cumulative regret next.
Bayesian Cumulative Regret
The cumulative regret of an action policy
quantifies how different the mean reward accumulated over
T iterations is from that accrued by a baseline action policy
. In this work, we consider as baseline the action policy of an
oracle that has access to the channel noise parameter
, reward parameter
, the context distribution
and the noise channel likelihood
Accordingly, at each iteration
t, the oracle can infer the
exact predictive distribution of the true context from the observed noisy context
via Baye’s rule as
Here,
is the joint distribution of the true and noisy contexts given the noise channel parameter
, and
is the distribution obtained by marginalizing
over the true contexts, i.e.,
where
denotes expectation with respect to ‘
’. The oracle action policy then adopts an action
at iteration
t, where
. Note, that as in [
14,
18], we do not choose the stronger oracle action policy of
, that requires access to the true context
, as it is generally not achievable by an agent that observes only noisy context
and has no access to
.
For fixed parameters
and
, we define the cumulative regret of the action policy
as
the expected difference in mean rewards of the oracle decision policy and the agent’s decision policy over
T iterations. In (
5), the expectation is taken with respect to the joint distribution
, where
. Using this, the cumulative regret (
5) can be written as
Our focus in this work is on a
Bayesian framework where we assume that the reward parameter
and channel noise parameter
are independently sampled by the environment from prior distributions
, defined on the set
of reward parameters, and
, defined on the set
of channel noise parameters, respectively. The agent has knowledge of the prior distributions, the reward likelihood in (
1) and the noise channel likelihood
, although it does not observe the sampled
and
. Using the above prior distributions, we define
Bayesian cumulative regret of the action policy
as
where the expectation is taken with respect to the priors
and
.
In the next sections, we present our novel TS algorithms to minimize the Bayesian cumulative regret for the two problem settings considered in this paper.
3. Modified TS for CB with Unobserved True Contexts
In this section, we consider Setting 1 where the agent only observes the noisy context
at each iteration
t. Our proposed modified TS Algorithm is given in Algorithm 1.
Algorithm 1: TS with unobserved true contexts () |
- 1:
for
do - 2:
The environment selects a true context . - 3:
Agent observes noisy context . - 4:
Agent evaluates the predictive posterior distribution as in ( 8). - 5:
Agent samples . - 6:
Agent chooses action as in ( 11). - 7:
Agent observes reward as in ( 1). - 8:
end for
|
The proposed algorithm implements two steps in each iteration
. In the first step, called the
de-noising step, the agent uses the current observed noisy context
and the history
of past observed noisy contexts to obtain a
predictive posterior distribution of the true context
. This is a two-step process, where firstly the agent uses the history
of past observed noisy contexts to compute the posterior distribution of
as
where the conditional distribution
is evaluated as in (
3). Note, that to evaluate the posterior, the agent uses its knowledge of the context distribution
, the prior
and the noise channel likelihood
. Using the derived posterior
, the predictive posterior distribution of the true context is then obtained as
where
is defined as in (
2).
The second step of the algorithm implements a
modified Thompson sampling. Note, that since the agent does not have access to the true contexts, it cannot evaluate the posterior distribution with known contexts,
as is conducted in standard contextual TS. Instead, the agent must evaluate the true
posterior distribution under noisy contexts,
However, evaluating the marginal distribution
is challenging even for Gaussian bandits as the mean
of the reward distribution
is, in general, a non-linear function of the true context
. As a result, the posterior
is analytically intractable.
Consequently, at each iteration t, the agent samples from a distribution that ‘approximates’ the true posterior . The specific choice of this sampling distribution depends on the problem setting. Ideally, one must choose a distribution that is sufficiently ‘close’ to the true posterior. In the next sub-section, we will explain the choice for Gaussian bandits.
Using the sampled
and the predictive posterior distribution
obtained from the denoising step, the agent then chooses action
at iteration
t as
is the expected feature map with respect to
.
3.1. Linear-Gaussian Stochastic CBs
We now instantiate Algorithm 1 for Gaussian CBs. Specifically, we consider Gaussian bandits with the reward noise
in (
1) as Gaussian
with mean 0 and variance
. We also assume a Gaussian prior
on the reward parameter
with mean zero and an
diagonal, covariance matrix with entries
. Here,
denotes the identity matrix. The assumption of diagonal prior covariance is in line with Lemma 3 in [
19].
We consider a multivariate Gaussian context distribution
with mean
and covariance matrix
. The context noise channel
is also similarly Gaussian with a mean
and covariance matrix
. We assume the prior on noise channel parameter
to be Gaussian
with
d-dimensional zero mean vector
and covariance matrix
We assume that
and
are all positive definite matrices known to the agent. The assumption of positive definite covariance matrices is to facilitate the Bayesian analysis adopted in this work. Similar assumptions were also required in the related work of [
14].
For this setting, we can analytically evaluate the predictive posterior distribution
as a multi-variate Gaussian with inverse covariance matrix,
where
and
, and with the mean vector
where
Derivations are presented in
Appendix C.1.2.
For the
modified-TS step, we sample
from the approximate posterior distribution
where
and
is the expected feature map defined in (
12). This yields the approximate posterior to be a Gaussian distribution
whose inverse covariance matrix and mean, respectively, evaluate as
The sampling distribution
considered above is different from the true posterior distribution (
10), which is analytically intractable. However, it bears resemblance to the posterior (
9) when the true contexts are known, with the reward distribution
replaced by
. In
Section 3.2.2, we show that the above choice of sampling distribution is indeed ‘close’ to the true posterior.
3.2. Bayesian Regret Analysis
In this section, we derive information-theoretic upper bounds on the Bayesian regret (
7) of the modified TS algorithm for Gaussian CBs. To this end, we first outline the key information-theoretic tools required to derive our bound.
3.2.1. Preliminaries
To start, let
and
denote two probability distributions defined over the space
of random variables
x. Then, the Kullback–Leibler (KL)-divergence between the distributions
and
is defined as
if
is absolutely continuous with respect to
, and takes value
∞ otherwise. If
x and
y denote two random variables described by the joint probability distribution
, the mutual information
between
x and
y is defined as
, where
(and
) is the marginal distribution of
x (and
y). More generally, for three random variables
x,
y and
z with joint distribution
, the conditional mutual information
between
x and
y given
z evaluates as
where
and
are conditional distributions. We will also use the following variational representation of the KL-divergence, also termed the
Donskar–Varadhan (
DV) inequality,
which holds for any measurable function
satifying the inequality
.
3.2.2. Information-Theoretic Bayesian Regret Bounds
In this section, we present information-theoretic upper bounds on the Bayesian regret of the modified TS algorithm. To this end, we first state our main assumption.
Assumption 1. The feature map has bounded norm, i.e., .
The following theorem gives our main result.
Theorem 1. Assume that the covariance matrices satisfy and where . Under Assumption 1, if , the following upper bound on the Bayesian regret of the modified TS algorithm holds,where The theorem above shows that the proposed TS algorithm achieves regret when the prior is highly informative with variance parameter satisfying the constraint .
Remark 1. The assumption on covariance matrices in Theorem 1 directly holds for diagonal covariance matrices with positive eigen values.
To prove the regret bound of Theorem 1, we start by defining
as the action that maximizes the mean reward
corresponding to reward parameter
. Using the above, the Bayesian cumulative regret (
7) for the proposed TS algorithm
can be decomposed as
In (
23), the first term
quantifies the Bayesian regret of our action policy (
11) with respect to the action policy (
22) for a CB with mean reward function
. The second term
accounts for the average difference in the cumulative mean rewards of the oracle optimal action policy (
4), evaluated using the exact predictive distribution
, and our action policy (
11), that uses the inferred predictive posterior distribution
. In this sense,
captures the error in approximating the exact predictive distribution
via the inferred predictive distribution
. The third term
similarly accounts for the average approximation error.
To derive an upper bound on the Bayesian regret
, we separately upper bound each of the three terms in (
23) as derived in the following lemmas. The lemma below presents an upper bound on
.
Lemma 1. Under Assumption 1, the following upper bound holds if ,where and is as defined in (
21).
To derive the upper bound in (
24), we leverage results from [
19] that study information-theoretic Bayesian regret of standard contextual TS algorithms via lifted information-ratio. However, the results do not directly apply to our algorithm due to the
posterior mismatch between the sampling distribution
and the true posterior distribution
. Consequently, our upper bound (
24) consists of three terms: the first term, defined as in (
21), corresponds to the upper bound on the Bayesian regret of contextual TS that assumes
as the true posterior. This can be obtained by applying the lifted information ratio-based analysis of Cor. 2 in [
19]. The second and third terms account for the posterior mismatch via the expected KL-divergence
between the true posterior
and the sampling distribution
. In particular, this expected KL divergence can be upper bounded by
(See
Appendix C.1.3 for proof) under the prior
. Importantly, our result holds when this prior distribution is sufficiently concentrated with its variance satisfying the inequality
. This ensures that the contribution of posterior mismatch to the Bayes regret scales is
.
The following lemma gives an upper bound on the sum .
Lemma 2. Under Assumption 1, the following upper bound holds for ,In addition, if the covariance matrices satisfy that and where , then (
26)
can be further upper bounded as Lemma 2 shows that the error in approximating with , on average, can be quantified via the conditional mutual information between and true context given knowledge of observed noisy contexts up to and including iteration t.
Finally, combining Lemmas 1 and 2 with the choice of and gives us the regret bound in Theorem 1.
3.3. Beyond Gaussian Bandits
In the previous sections, we studied Gaussian bandits and analyzed the Bayesian regret. We will now discuss the potential extension of results beyond Gaussian bandits. As in [
14], we will focus on Gaussian context distribution and context noise distribution, which helps to derive the upper bound on the estimation errors in Lemma 2.
To extend the Bayesian regret analysis to non-Gaussian bandits, Lemma 1 requires bandit-specific modifications. Specifically, the derivation of the term
, that captures the standard Bayesian regret of contextual TS with
as the true posterior, and that of the posterior mismatch term via the expected KL divergence critically depends on the type of bandit and the choice of the sampling posterior. The Bayesian regret bound
is derived using the lifted information ratio-based approach of [
19]. This can indeed be extended to non-Gaussian bandits like logistic bandits (see [
19]) to obtain a modified
term.
However, the analysis of posterior mismatch term for non-Gaussian bandits is non-trivial and depends on the specific bandit assumed. Firstly, to characterize the posterior mismatch via the expected KL divergence, our analysis requires the chosen sampling distribution
to be sub-Gaussian. To choose the sampling distribution, one can follow the framework adopted in (
15) and (
16) and use an ‘appropriate’ reward distribution
such that
the KL divergence
between the true reward distribution and the chosen reward distribution is small to minimize posterior mismatch, and
the resulting sampling distribution is easy to sample from and has sub-Gaussian tails. Thus, analyzing the posterior mismatch for non-Gaussian bandits requires a case-by-case treatment. For Gaussian bandits, we control the above KL divergence by choosing a Gaussian distribution
with mean
as in (
16). Finally, in
Section 5, we extend Algorithm 1 to logistic bandits with the choice of sampling distribution motivated by (
15) and (
16) and use Langevin Monte Carlo to sample from this distribution.
4. TS for CB with Delayed True Contexts
In this section, we consider the CBs with delayed true context setting where the agent observes the true context
after it observes the reward
corresponding to the chosen action
. Note, that at the time of choosing action
, the agent has access only to noisy contexts. We specialize our TS algorithm to this setting, and call it Algorithm 2 (or
).
Algorithm 2: TS for Delayed True contexts () |
- 1:
for
do - 2:
The environment selects a true context . - 3:
Agent observes noisy context . - 4:
Agent evaluates the predictive posterior distribution as in ( 28). - 5:
Agent samples . - 6:
Agent chooses action as in ( 33). - 7:
Agent observes reward (as in ( 1)) and the true context . - 8:
end for
|
Algorithm 2 follows similar steps as in Algorithm 1. However, different from Algorithm 1, at the
tth iteration, the agent knows the history
of true contexts in addition to that of noisy contexts. Consequently, in the
de-noising step, the agent evaluates the predictive posterior distribution as
where
is as defined in (
2) and posterior distribution
is obtained via Baye’s rule as
using the history of true and noisy contexts.
For the Gaussian context, noise as considered in
Section 3.1, the predictive posterior distribution
is multivariate Gaussian with the inverse covariance matrix,
and the mean vector
where
and
. Derivation can be found in
Appendix B.2.4.
Following the denoising step, the next step in Algorithm 2 is a conventional Thompson sampling step, thanks to access to delayed true contexts. Consequently, the agent can evaluate the posterior distribution
with known contexts as in (
9) and use it to sample
. For Gaussian bandit with Gaussian prior on
, the posterior distribution
is a multivariate Gaussian distribution whose inverse covariance matrix and mean, respectively, evaluate as
Using the sampled
and the obtained predictive posterior distribution
, the agent then chooses action
as
where we use the expected feature map
.
5. Experiments and Final Remarks
In this section, we experimentally validate the performance of the proposed algorithms on synthetic and real-world datasets. Details of implementation can be found in
Appendix D.
Synthetic Datasets: For synthetic datasets, we go beyond Gaussian bandits and evaluate our algorithms for logistic contextual bandits (see
Figure 1 (Left) and (Center)). In both these settings, we consider Gaussian contexts and context noise as in
Section 3.1 with parameters
,
,
for some
. We further consider action
and context
to be
dimensional vectors with
and
, respectively, denoting their
ith component. We use
as the
dimensional feature vector.
Gaussian Bandits: The mean reward function is given by with the feature map described above. Other parameters are fixed as , and . Plots are averaged over 100 independent trials.
Logistic Bandits: The reward
is Bernoulli with mean reward given by
, where
is the sigmoid function. We consider a Gaussian prior
over
. In Algorithm 1, we choose the sampling distribution
However, the posterior
is analytically intractable since Bernoulli reward-Gaussian prior forms a non-conjugate distribution pair. Consequently, we use Langevin Monte Carlo (LMC) [
20] to sample from
. We run LMC for
iterations with learning rate
and inverse temperature
. Plots are averaged over 10 independent trials.
MovieLens Dataset: We use the MovieLens-100K dataset [
21] to evaluate the performances. To utilise this dataset, we first perform non-negative matrix factorization on the rating matrix
with 3 latent factors to obtain
, where
and
. Each row vector
corresponds to an user context, while each column vector
corresponds to movie (action) features. The mean and variance of the Gaussian context distribution is estimated from the row vectors of
W. We then add Gaussian noise to context as in the synthetic settings with
.
We apply K-means algorithm to the column vectors of H to group the actions into clusters. We use to denote the centroid and to denote the variance of the kth cluster. We then fix the mean and variance of the Gaussian prior over as and , with denoting the identity matrix, respectively. The feature vector is then fixed as a 60-dimensional vector with vector at the index of the cluster k to which action a belongs and zeros everywhere else. We further add mean-zero Gaussian noise to the mean reward with variance . The Bayesian oracle in this experiment has access to the exact context noise parameter sampled from the Gaussian prior with variance , as well as the true sampled from the Gaussian prior .
Baselines: We compare our algorithms with two baselines:
and
. In
, the agent observes only noisy contexts but is unaware of the presence of noise. Consequently, it naively implements conventional TS with noisy context
. This sets the benchmark for the worst-case achievable regret. The second baseline
assumes that the agent knows the true channel parameter
, a setting studied in [
18], and can thus perform exact denoising via the predictive posterior distribution
. This algorithm sets the benchmark for the best achievable regret.
Figure 1 (Left) corroborates our theoretical findings for Gaussian bandits. In particular, our algorithms (Algorithms 1 and 2) demonstrate sub-linear regret and achieve robust performance comparable to the best achievable performance of
. We remark that while our regret analysis of Gaussian bandits is motivated due to the tractability of posterior distributions and the concentration properties of Gaussians, our empirical results for logistic bandits in
Figure 1 (Center) show a promising extension of our algorithms to non-conjugate distributions. Extension of Bayesian regret analysis to such general distributions is left for future work. Further, our experiments on MovieLens data in
Figure 1 (Right) validate the effectiveness of our algorithm in comparison to the benchmarks. The plot shows that our approach outperforms
and achieves comparable regret as that of
which is the best achievable regret.