72  Contextual Bandits

Imagine you run a news website. Every time a reader arrives, you get to show them one headline out of several, and you find out only whether they clicked the headline you chose. You never learn what would have happened if you had shown a different one. Over many readers, you want to show headlines that get the most clicks, but you also have to keep trying less-tested headlines now and then, because the one you have ignored might turn out to be the best. This is the contextual bandit problem in a nutshell: repeated decisions, feedback only on the choice you made, and a context (who the reader is) that should shape the choice.

This chapter introduces that problem and the vocabulary around it. By the end you will understand what makes contextual bandits different from ordinary supervised learning, why the tension between exploration and exploitation is unavoidable, and how to read the formal setup and its regret objective. We treat this as a bridge topic: it sits between classical prediction (where you see the right answer for every example) and full reinforcement learning (where actions change future states; see Chapter 64). Contextual bandits are sometimes called “one-step reinforcement learning” for exactly that reason.

A useful external walkthrough, with runnable examples in Vowpal Wabbit, is the tutorial at https://vowpalwabbit.org/tutorials/contextual_bandits.html.

Intuition

A bandit problem is named after a slot machine (“one-armed bandit”). You face several arms (actions), each pays out with some unknown probability, and you want to find the good arms while spending as few pulls as possible on the bad ones. The “contextual” version adds a clue before each pull that tells you something about which arm is likely to pay off this time.

72.1 A topic with many names

The same idea shows up across machine learning, statistics, and operations research, often under different names. If you read widely, you will meet it called multi-world testing, associative bandits, learning with partial feedback (or, equivalently, learning with bandit feedback), bandits with side information, multi-class classification with bandit feedback, associative reinforcement learning, and one-step reinforcement learning.

Note

The phrase “bandit feedback” is the key technical term among these. It means you observe the reward only for the action you took, not for the alternatives. Contrast this with “full feedback,” where you would learn the reward of every possible action. Bandit feedback is what makes the problem hard.

72.2 Where it is used

Contextual bandits are a natural fit whenever you make many small decisions, get quick feedback, and want to improve as you go. Three common settings illustrate the range:

  • A/B testing. Classic A/B testing splits traffic evenly and waits for a verdict. A bandit can shift traffic toward the better variant while the test is still running, so fewer users see the worse option.
  • Recommendation feeds. Systems like the TikTok or YouTube recommendation engine repeatedly choose what to show next and learn from whether you watch, like, or skip (see the recommender systems chapter, Chapter 89).
  • Clinical trials. Adaptive trial designs can route more patients toward treatments that appear to be working, using patient characteristics as context, though this raises ethical and statistical subtleties that ordinary bandits do not address.1
When to use this

Reach for a contextual bandit when (1) you choose one action per round from a fixed menu, (2) you observe a reward only for the chosen action, and (3) you have features describing the situation that plausibly predict which action is best. If actions also change the future state of the world, you have crossed into full reinforcement learning, not a bandit.

72.3 The core idea

At heart, a contextual bandit is an algorithm that tries out different actions and figures out, on its own, which action is best for a given situation. It learns a mapping from context to action by acting, observing rewards, and adjusting.

You do not have to build this from scratch. Managed services exist: for example, Google Cloud AutoML Tables2 packages much of the modeling work behind a hosted interface (Dutta et al. 2019). Such platforms typically bundle several steps that you would otherwise hand-tune yourself:

  • Automated feature engineering
  • Architecture search
  • Hyper-parameter tuning (for example, grid search)
  • Model selection
  • Model tuning and ensembling
Tip

For getting a contextual-bandit workflow running quickly without managing infrastructure, a hosted offering such as Google Cloud AutoML Tables is often a smoother starting point than running Microsoft’s Vowpal Wabbit yourself. Vowpal Wabbit remains excellent for learning the mechanics and for fine control, which is why the tutorial linked above is worth working through.

72.4 Exploration versus exploitation

The defining tension of any bandit algorithm is the trade-off between exploration and exploitation, and contextual bandits are designed to balance the two deliberately.

  • Exploration means trying various possibilities to learn what reward they may bring, even actions you currently believe are mediocre.
  • Exploitation means using the knowledge you already have to pick the action that maximizes your immediate return.

Neither extreme works alone. Pure exploitation locks in whatever looked best early, before you had enough data to trust it. Pure exploration wastes feedback on actions you have already learned are poor. A good policy leans toward exploration early, when it knows little, and shifts toward exploitation as evidence accumulates.

Key idea

Every round of exploration has a cost (you may take a worse action now) but a benefit (better decisions later). The art of bandit algorithms is spending exploration where it buys the most future reward.

72.5 From multi-armed to contextual bandits

It helps to place contextual bandits next to their simpler cousin. Two algorithm families are worth distinguishing:

  • Multi-armed bandit (MAB). Every round is treated in isolation, with no side information (see Chapter 65). The learner just estimates which arm is best on average. This is independent learning: the same action is favored regardless of who is in front of you.
  • Contextual bandit. An extension of the MAB, and equally a simplified version of reinforcement learning, this approach takes the context into account when choosing an arm. Because the choice adapts to the situation, the learner accumulates reward in a more holistic way, fitting a policy rather than a single best arm.
Note

A plain multi-armed bandit ignores context entirely. The contextual bandit’s whole advantage is that it can give different answers for different inputs, which is exactly what you want when the best headline depends on the reader.

72.6 The formal setting

We now make the picture precise. The following formulation draws on the lecture materials by Alina Beygelzimer3.

The interaction unfolds over rounds. For \(t = 1, \dots, T\):

  1. The environment produces a context \(x_t \in X\).
  2. The learner chooses an action \(a_t \in \{1, \dots, K, \}\).
  3. The environment reveals a reward \(r_t(a_t) \in [0,1]\).

The goal is to learn a good policy for selecting actions given context, where a policy \(\pi\) is just a rule that maps a context to an action.4

We measure success by regret: how much total reward we gave up compared to the best fixed policy in our policy class \(\Pi\), judged in hindsight.

\[ Regret = \underset{\pi \in \Pi}{\operatorname{max}} \sum_{t=1}^T r_t(\pi(x_t)) - \sum_{t=1}^T r_t (a_t) \tag{72.1}\]

The first term is the reward the single best policy would have earned over all \(T\) rounds; the second term is the reward we actually earned. Smaller regret means we came closer to acting optimally, so the entire learning problem can be stated as: keep regret small.

Intuition

Regret is “what we missed.” We cannot compute it while the game is running, because we do not see the rewards of actions we skipped. It is an after-the-fact yardstick used to analyze and compare algorithms, not a quantity you optimize directly in real time.

72.7 The statistical model behind the rewards

To do anything rigorous we must say how rewards are generated. The standard contextual bandit model fixes a context distribution and a conditional reward distribution and assumes the two are stationary across rounds.

Let contexts arrive i.i.d. from a fixed distribution \(\mathcal{D}_X\) on \(X\), and let the reward for taking action \(a\) in context \(x\) be drawn from a fixed conditional distribution with mean

\[ \mu(x,a) \;=\; \mathbb{E}\!\left[\, r_t(a) \mid x_t = x \,\right], \tag{72.2}\]

independently across rounds given \((x_t, a_t)\). The function \(\mu\) is the unknown object we want to learn. The optimal policy is the one that, in each context, plays the arm with the largest mean reward,

\[ \pi^\star(x) \;=\; \operatorname*{arg\,max}_{a \in \{1,\dots,K\}} \, \mu(x,a), \tag{72.3}\]

and the per-round gap of an arm is \(\Delta(x,a) = \mu(x,\pi^\star(x)) - \mu(x,a) \ge 0\). With this notation the expected regret against the optimal policy is \(\mathbb{E}[\mathrm{Regret}] = \sum_{t=1}^T \mathbb{E}[\Delta(x_t, a_t)]\), the cumulative mean gap of the arms we actually pulled. The regret in the formal setting above compares to the best policy in a class \(\Pi\); when \(\pi^\star \in \Pi\) the two notions coincide, and otherwise they differ by the approximation error \(\inf_{\pi \in \Pi} \sum_t \Delta(x_t, \pi(x_t))\), the usual bias of a restricted hypothesis class.

Note

Two modeling regimes are studied. In the stochastic regime, rewards follow the fixed law Equation 72.2, and one aims for instance-dependent guarantees that exploit the gaps \(\Delta(x,a)\). In the adversarial regime, contexts and reward vectors are chosen by an arbitrary (possibly malicious) opponent, no distribution exists, and one settles for worst-case guarantees against the best policy in \(\Pi\). EXP4, discussed below, is the canonical adversarial algorithm; LinUCB and Thompson sampling are stochastic.

72.7.1 A linear realizability assumption

Most practical algorithms assume the mean reward is a parametric function of features. The linear (disjoint) model attaches to each context-action pair a feature vector \(\phi(x,a) \in \mathbb{R}^d\) and posits an unknown coefficient vector \(\theta^\star\) with

\[ \mu(x,a) \;=\; \phi(x,a)^\top \theta^\star , \qquad r_t(a_t) \;=\; \phi(x_t,a_t)^\top \theta^\star + \eta_t , \tag{72.4}\]

where \(\eta_t\) is conditionally zero-mean, \(\sigma\)-sub-Gaussian noise. A common special case gives every arm its own parameter \(\theta_a^\star\) and sets \(\phi(x,a)\) to the raw context \(x\) for arm \(a\) and zero elsewhere, so \(\mu(x,a) = x^\top \theta_a^\star\). Realizability (the truth lies in the assumed class) is what lets the confidence-bound and posterior machinery below produce honest uncertainty estimates; when it fails the algorithms remain runnable but their regret guarantees no longer hold.

72.8 Off-policy reward estimation: making bandit feedback usable

The central obstacle, bandit feedback, is also where the key estimator lives. Suppose actions were taken by a known logging policy that selected \(a_t\) with probability \(p_t = \pi_0(a_t \mid x_t) > 0\). We want to estimate the value \(V(\pi) = \mathbb{E}_{x \sim \mathcal{D}_X}[\, \mu(x, \pi(x)) \,]\) of some other target policy \(\pi\) from logged tuples \((x_t, a_t, r_t, p_t)\). The trick is importance weighting.

72.8.1 The inverse propensity score (IPS) estimator

Define the per-round IPS reward \[ \hat r_t(a) \;=\; \frac{r_t \,\mathbf{1}\{a_t = a\}}{p_t}. \tag{72.5}\] Conditioning on \(x_t\) and taking expectation over the logged action \(a_t \sim \pi_0(\cdot \mid x_t)\), \[ \mathbb{E}_{a_t \sim \pi_0}\!\left[\hat r_t(a) \mid x_t\right] = \sum_{a'} \pi_0(a' \mid x_t)\, \frac{\mathbb{E}[r_t \mid x_t, a']\,\mathbf{1}\{a' = a\}}{\pi_0(a' \mid x_t)} = \mu(x_t, a), \] because every factor of \(\pi_0(a \mid x_t)\) cancels and only the \(a'=a\) term survives. So \(\hat r_t(a)\) is an unbiased estimate of the full-feedback reward vector entry, recovered from a single observed action. The IPS estimate of a target policy’s value is then \[ \hat V_{\mathrm{IPS}}(\pi) \;=\; \frac{1}{T}\sum_{t=1}^{T} \frac{r_t \,\mathbf{1}\{\pi(x_t) = a_t\}}{p_t}, \qquad \mathbb{E}\big[\hat V_{\mathrm{IPS}}(\pi)\big] = V(\pi). \tag{72.6}\] This is exactly the reweighting the earlier warning callout referred to: it removes the bias that comes from a logging policy that explored some arms more than others.

The price of unbiasedness is variance. Since \(r_t \in [0,1]\), the second moment of a single term is \[ \mathbb{E}\!\left[\Big(\tfrac{r_t \mathbf{1}\{\pi(x_t)=a_t\}}{p_t}\Big)^{2}\right] \le \mathbb{E}\!\left[\frac{\mathbf{1}\{\pi(x_t)=a_t\}}{p_t^{2}}\right] = \mathbb{E}\!\left[\frac{1}{p_t^{2}}\,\Big|\,\pi(x_t)=a_t\right]\Pr(\pi(x_t)=a_t), \] which blows up whenever the target policy puts mass on actions the logger rarely took (\(p_t\) small). This is the variance-vs-bias tension of importance sampling: small logging probabilities give large weights and unstable estimates. Two standard fixes are weight clipping (cap \(1/p_t\) at a threshold \(M\), trading a controlled bias for bounded variance) and self-normalized IPS (divide by \(\sum_t \mathbf{1}\{\pi(x_t)=a_t\}/p_t\) instead of \(T\), which is consistent and far more stable).

72.8.2 Doubly robust estimation

We can cut variance further without sacrificing unbiasedness by combining IPS with a regression model \(\hat\mu(x,a)\) of the reward. The doubly robust (DR) estimator is \[ \hat V_{\mathrm{DR}}(\pi) = \frac{1}{T}\sum_{t=1}^{T}\left[\,\hat\mu(x_t, \pi(x_t)) + \frac{\big(r_t - \hat\mu(x_t, a_t)\big)\,\mathbf{1}\{\pi(x_t)=a_t\}}{p_t}\,\right]. \tag{72.7}\] The first term is the model’s plug-in guess; the second is an IPS correction applied to the residual rather than the raw reward. The name “doubly robust” comes from the following fact: the estimator is unbiased if either the propensities \(p_t\) are correct or the reward model \(\hat\mu\) is correct. To see the correction’s role, take the expectation of the correction term over \(a_t \sim \pi_0\), as in the IPS derivation, \[ \mathbb{E}_{a_t \sim \pi_0}\!\left[\frac{(r_t - \hat\mu(x_t,a_t))\mathbf{1}\{\pi(x_t)=a_t\}}{p_t}\,\Big|\,x_t\right] = \mu(x_t, \pi(x_t)) - \hat\mu(x_t, \pi(x_t)), \] which exactly cancels the bias \(\hat\mu(x_t,\pi(x_t)) - \mu(x_t,\pi(x_t))\) left by the first term. When \(\hat\mu\) is accurate the residual \(r_t - \hat\mu\) is small, so the high-variance importance weight multiplies a small number and the estimator’s variance shrinks toward that of the plug-in model. DR is the workhorse for offline policy evaluation and for the cost-sensitive reductions used in offline policy learning (see Chapter 69).

72.9 Algorithms I: \(\varepsilon\)-greedy and the explore-then-commit baseline

The simplest contextual strategy fits a reward model \(\hat\mu(x,a)\) (any regressor) to the IPS-corrected or directly observed data, then acts greedily with a probability of random exploration: \[ a_t = \begin{cases} \text{uniform over } \{1,\dots,K\} & \text{with probability } \varepsilon_t,\\[2pt] \operatorname*{arg\,max}_a \hat\mu(x_t,a) & \text{with probability } 1-\varepsilon_t. \end{cases} \] With a constant \(\varepsilon\) the policy keeps paying an exploration tax of order \(\varepsilon\) per round, giving linear regret \(\Theta(\varepsilon T)\) from exploration plus a term that shrinks as the model improves. The standard remedy is a decaying schedule \(\varepsilon_t \propto t^{-1/3}\), which balances the two sources of error and yields regret of order \(T^{2/3}\) (worse than the \(\sqrt{T}\) of the optimism and posterior methods below, but the scheme is trivial to implement and pairs with any supervised learner). The crucial detail is that the uniform exploration guarantees a known, bounded-away-from-zero propensity \(p_t \ge \varepsilon_t / K\) on every arm, which is precisely what keeps the IPS weights in Equation 72.5 finite and the off-policy estimates valid.

Warning

\(\varepsilon\)-greedy explores uniformly regardless of context, so it wastes pulls re-confirming arms it is already sure are bad. The optimism and posterior-sampling methods below explore in a targeted way (only where uncertainty is high), which is why they achieve the better \(\sqrt{T}\) rate.

72.10 Algorithms II: LinUCB and optimism in the face of uncertainty

LinUCB instantiates the linear model Equation 72.4 and the principle “act as if the world were as good as plausibly possible.” At round \(t\) it maintains a ridge-regression estimate of \(\theta^\star\) and an upper confidence bound on each arm’s mean reward.

72.10.1 Derivation of the estimator and the confidence width

Collect the chosen features and rewards through round \(t-1\) and form the regularized least-squares (ridge) objective \[ \hat\theta_t = \operatorname*{arg\,min}_{\theta}\ \sum_{s=1}^{t-1}\big(\phi_s^\top\theta - r_s\big)^2 + \lambda \|\theta\|_2^2 , \qquad \phi_s := \phi(x_s, a_s). \] Setting the gradient \(2\sum_s \phi_s(\phi_s^\top\theta - r_s) + 2\lambda\theta\) to zero gives the normal equations and the closed form \[ A_t = \lambda I + \sum_{s=1}^{t-1}\phi_s \phi_s^\top, \qquad b_t = \sum_{s=1}^{t-1}\phi_s r_s, \qquad \hat\theta_t = A_t^{-1} b_t . \tag{72.8}\] The matrix \(A_t\) is the regularized Gram matrix; its inverse is the (scaled) covariance of the estimate. The predicted mean for a candidate arm with features \(\phi(x_t,a)\) is \(\phi(x_t,a)^\top \hat\theta_t\), and the key quantity is the predictive standard deviation. Under the sub-Gaussian noise model, self-normalized martingale concentration (the Abbasi-Yadkori, Pál and Szepesvári bound) shows that with probability at least \(1-\delta\), simultaneously for all \(t\), \[ \big|\phi^\top(\hat\theta_t - \theta^\star)\big| \;\le\; \alpha_t \,\sqrt{\phi^\top A_t^{-1}\phi}, \qquad \alpha_t = \sigma\sqrt{d\log\!\frac{1+tL^2/\lambda}{\delta}} + \sqrt{\lambda}\,\|\theta^\star\|_2 , \tag{72.9}\] where \(L\) bounds \(\|\phi\|_2\). The term \(\sqrt{\phi^\top A_t^{-1}\phi}\) is the elliptical width: it is large in feature directions that have been explored little (small eigenvalues of \(A_t\)) and shrinks as those directions accumulate data. LinUCB then plays the optimistic arm \[ a_t = \operatorname*{arg\,max}_{a}\ \underbrace{\phi(x_t,a)^\top\hat\theta_t}_{\text{exploit}} \;+\; \underbrace{\alpha_t\sqrt{\phi(x_t,a)^\top A_t^{-1}\phi(x_t,a)}}_{\text{explore}} . \tag{72.10}\] The two terms make the exploration-exploitation trade-off explicit and automatic: an arm is chosen either because its estimated reward is high or because we are uncertain about it. Each update is a rank-one modification of \(A_t\), so \(A_{t}^{-1}\) can be maintained incrementally with the Sherman-Morrison formula at cost \(O(d^2)\) per round rather than \(O(d^3)\) for a fresh inverse.

LinUCB enjoys a regret bound of order \(\tilde O(d\sqrt{T})\), where \(d\) is the feature dimension. The \(\sqrt{T}\) dependence is the hallmark of efficient exploration (per-round regret vanishes as \(1/\sqrt{t}\)), and the linear-in-\(d\) factor is the cost of learning a \(d\)-dimensional parameter. The proof bounds the sum of squared elliptical widths \(\sum_t \phi_t^\top A_t^{-1}\phi_t\) by \(O(d\log T)\) via the elliptical potential (matrix determinant) lemma, then converts width into regret through Equation 72.9.

Failure modes of LinUCB

The guarantee rests on the linear realizability of Equation 72.4. If the true \(\mu\) is nonlinear in \(\phi\), the confidence sets in Equation 72.9 no longer cover \(\theta^\star\), optimism can systematically under-explore the truly best arm, and regret can become linear. Practical safeguards are richer features (kernels, neural embeddings, as in NeuralUCB), periodic model checking, and a small floor of uniform exploration. The confidence multiplier \(\alpha_t\) from theory is usually conservative; it is standard to treat \(\alpha\) as a tunable knob.

72.11 Algorithms III: Thompson sampling (posterior sampling)

Thompson sampling explores by randomizing over its own uncertainty: it samples a plausible world from the posterior and acts optimally in that world. With the linear model and a Gaussian prior, the posterior is conjugate and the updates reuse the ridge quantities.

72.11.1 Derivation of the posterior update

Place a prior \(\theta \sim \mathcal{N}(0, \lambda^{-1} I)\) and assume Gaussian noise \(\eta_t \sim \mathcal{N}(0, \sigma^2)\). The log posterior after \(t-1\) rounds is, up to a constant, \[ \log p(\theta \mid \mathcal{H}_{t-1}) = -\frac{1}{2\sigma^2}\sum_{s=1}^{t-1}\big(r_s - \phi_s^\top\theta\big)^2 - \frac{\lambda}{2}\|\theta\|_2^2 + \text{const}, \] a quadratic in \(\theta\), so the posterior is Gaussian \(\mathcal{N}(\hat\theta_t,\, \sigma^2 A_t^{-1})\) with the same \(A_t\) and \(\hat\theta_t = A_t^{-1}b_t\) as in Equation 72.8 (completing the square in the exponent recovers exactly the ridge mean and the Gram-matrix covariance). Each round LinTS draws \(\tilde\theta_t \sim \mathcal{N}(\hat\theta_t, v^2 A_t^{-1})\) and plays \[ a_t = \operatorname*{arg\,max}_a\ \phi(x_t,a)^\top \tilde\theta_t , \tag{72.11}\] where \(v\) is a posterior-inflation factor (the analogue of \(\alpha_t\)) controlling exploration. The marginal of \(\phi^\top\tilde\theta_t\) is \(\mathcal{N}\!\big(\phi^\top\hat\theta_t,\ v^2\,\phi^\top A_t^{-1}\phi\big)\), so the random bonus has standard deviation \(v\sqrt{\phi^\top A_t^{-1}\phi}\), the same elliptical width that drives LinUCB. The two algorithms thus explore along identical geometry; LinUCB uses a fixed optimistic offset while LinTS uses a random one, and LinTS achieves the same \(\tilde O(d\sqrt{T})\) regret order. For Bernoulli rewards with no context the conjugate analogue is a Beta-Bernoulli posterior per arm, the form most often seen in introductions.

Tip

Thompson sampling is often preferred in practice. It needs no explicit confidence-width formula, it handles delayed and batched feedback gracefully (sampling from a slightly stale posterior is still valid exploration), and its randomization naturally produces the stochastic propensities \(p_t = \Pr(a_t = a \mid x_t)\) that off-policy estimation and logging require. Those propensities can be computed in closed form for two arms and by Monte Carlo otherwise.

72.12 Algorithms IV: policy-class methods and EXP4

The methods above model rewards. An alternative is to compete directly with a fixed policy class \(\Pi\), exactly the object in the regret definition Equation 72.1. EXP4 (Exponential weights for Exploration and Exploitation with Experts) treats each policy \(\pi \in \Pi\) as an expert giving an action recommendation, maintains weights \(w_t(\pi)\), and forms an action distribution by mixing the experts’ advice. After observing \(r_t\) it builds the IPS reward estimate Equation 72.5 for every action and updates weights multiplicatively, \[ w_{t+1}(\pi) \;\propto\; w_t(\pi)\,\exp\!\big(\gamma\,\hat r_t(\pi(x_t))\big), \tag{72.12}\] the same exponential-weights rule used in online learning, now fed importance-weighted rewards because of bandit feedback. EXP4 attains regret \(O(\sqrt{TK\log|\Pi|})\) against the best policy in \(\Pi\) and, crucially, makes no stochastic assumption: it holds even against adversarial contexts and rewards. The \(\log|\Pi|\) dependence means it competes with exponentially large policy classes at only logarithmic cost, but maintaining a weight per policy is intractable for rich \(\Pi\). This motivates oracle-based reductions (for example the “Policy Elimination” and ILOVETOCONBANDITS algorithms) that replace explicit enumeration with calls to a cost-sensitive classification solver, achieving the optimal \(\tilde O(\sqrt{TK\log|\Pi|})\) regret while only ever solving supervised learning subproblems. This reduction is the theoretical backbone of systems like Vowpal Wabbit.

72.13 A numerical check: IPS is unbiased and DR has lower variance

The following self-contained simulation confirms the two central claims of the off-policy section: the IPS estimator Equation 72.6 is unbiased for a target policy’s value, and the doubly robust estimator Equation 72.7 reduces variance when a reasonable reward model is supplied. It uses only base R.

Show code
set.seed(1)
K <- 3                                   # arms
n <- 4000                                # logged rounds
reps <- 400                              # Monte Carlo repetitions
true_mu <- c(0.2, 0.5, 0.8)             # context-free mean rewards (truth)
logging <- c(0.6, 0.3, 0.1)            # logging policy (over-samples arm 1)
target <- 3                             # deterministic target policy: always arm 3
V_true <- true_mu[target]               # true value of the target policy

mu_hat <- c(0.25, 0.45, 0.85)          # an imperfect but decent reward model

est <- replicate(reps, {
  a <- sample.int(K, n, replace = TRUE, prob = logging)   # logged actions
  r <- rbinom(n, 1, true_mu[a])                            # observed rewards
  p <- logging[a]                                          # propensities
  match <- as.numeric(a == target)
  ips <- mean(r * match / p)
  dr  <- mean(mu_hat[target] + (r - mu_hat[a]) * match / p)
  c(ips = ips, dr = dr)
})

c(true_value = V_true,
  ips_mean = mean(est["ips", ]), dr_mean = mean(est["dr", ]),
  ips_sd = sd(est["ips", ]),    dr_sd = sd(est["dr", ]))
#> true_value   ips_mean    dr_mean     ips_sd      dr_sd 
#> 0.80000000 0.79669375 0.79871250 0.04218526 0.02013849

Both estimators center on the true value \(V(\pi) = 0.8\) (unbiasedness), while the doubly robust estimate has the smaller standard deviation, illustrating the variance reduction derived above.

72.14 How this differs from ordinary supervised learning

Contextual bandits resemble supervised learning, but with one decisive twist that shapes everything. In standard supervised learning, each training example comes with the correct label, so you can compute the loss for any prediction. Here you face the same kind of input-to-output mapping, yet two features set the problem apart:

  • The reward for actions you did not take is unknown. You see only \(r_t(a_t)\), never \(r_t(a)\) for the other actions \(a\). The loss for the alternatives is hidden.
  • Because of that hidden loss, you must explore to succeed, which means deliberately taking some suboptimal actions in order to gather the information you need.
Warning

Do not treat a contextual bandit as a plain classification task by simply training on the actions you happened to take. Doing so ignores the bias from your own past choices: if you rarely tried an action, you have almost no data about it, and a naive classifier will quietly assume it is bad. Proper bandit methods correct for this with exploration and reweighting.

A final point of clarification: although a contextual bandit builds on the multi-armed bandit and shares the bandit-feedback structure, it is not the same as the multi-armed bandit problem. The presence of context, and the goal of learning a policy that varies with that context, is what separates the two.

72.15 LinUCB and Regret in R: A Worked Example

The off-policy section above evaluated a fixed policy from logged data. The harder, more characteristic problem is learning a good policy online, where the only way to discover a promising arm is to try it. This is where the exploration-exploitation tension becomes concrete, and where regret — the running gap between the best arm in hindsight and the one you played — becomes the scoreboard. We implement LinUCB, which assumes each arm’s reward is linear in the context and acts on an “optimism in the face of uncertainty” rule: pick the arm with the highest upper confidence bound, \(\hat\theta_a^\top x + \alpha\sqrt{x^\top A_a^{-1} x}\), so an arm is tried either because it looks good or because we are still unsure about it.

Show code
set.seed(1)
d <- 5; K <- 4; T <- 3000
Theta <- matrix(rnorm(d * K), d, K)            # true per-arm reward weights (unknown to the agent)

run <- function(policy, alpha = 1, eps = 0.1) {
  A <- replicate(K, diag(d), simplify = FALSE) # ridge "design" matrix per arm
  b <- replicate(K, numeric(d), simplify = FALSE)
  reg <- numeric(T)
  for (t in 1:T) {
    x <- rnorm(d); x <- x / sqrt(sum(x^2))      # this round's context
    truth <- as.numeric(crossprod(Theta, x))    # true expected reward of each arm
    Ai  <- lapply(A, solve)
    est <- vapply(1:K, function(a) sum(Ai[[a]] %*% b[[a]] * x), 0)   # ridge estimate per arm
    a <- switch(policy,
      const_eps = if (runif(1) < eps) sample(K, 1) else which.max(est),
      decay_eps = if (runif(1) < min(1, 5 / t)) sample(K, 1) else which.max(est),
      linucb    = which.max(vapply(1:K, function(a)
                    est[a] + alpha * sqrt(sum(x * (Ai[[a]] %*% x))), 0)))  # estimate + uncertainty bonus
    r <- truth[a] + rnorm(1, 0, 0.1)            # observe reward for the CHOSEN arm only
    A[[a]] <- A[[a]] + tcrossprod(x); b[[a]] <- b[[a]] + r * x          # online update
    reg[t] <- max(truth) - truth[a]             # instantaneous regret
  }
  cumsum(reg)[c(500, 1500, 3000)]               # cumulative regret at three checkpoints
}
set.seed(1)
t(sapply(c(const_eps = "const_eps", decay_eps = "decay_eps", linucb = "linucb"),
         run)) |> `colnames<-`(c("t=500", "t=1500", "t=3000")) |> round(1)
#>           t=500 t=1500 t=3000
#> const_eps  69.3  182.4  340.3
#> decay_eps  23.2   29.8   36.1
#> linucb     10.0   10.7   11.2

Read the rows by their growth, not their level. The \(\varepsilon\)-greedy policy with a fixed exploration rate piles up regret at a constant rate (here roughly 0.11 per round) because it never stops paying the exploration tax — its cumulative regret is linear, the hallmark of a policy that cannot learn to quit exploring. Decaying \(\varepsilon\) fixes this: as \(\varepsilon_t \to 0\) the regret curve flattens (sublinear), the property any good bandit algorithm must have. LinUCB reaches that flat regime fastest and lowest, because its confidence bonus shrinks automatically as \(A_a^{-1}\) accumulates data — it explores exactly as much as its uncertainty warrants and no more, with no exploration schedule to hand-tune. That self-regulating exploration, not a cleverer point estimate, is the whole game in bandits.

72.16 Summary

Contextual bandits handle repeated decisions where you act, observe a reward only for the action you chose, and use side information to choose well. They sit one step beyond the multi-armed bandit (which ignores context) and one step short of full reinforcement learning (where actions change future states). The central challenge is balancing exploration against exploitation, and the standard measuring stick is regret, the gap between the best policy in hindsight and what you actually achieved. With that framing in hand, the Vowpal Wabbit tutorial linked at the top is a good next step for seeing these ideas run on real data.


  1. Adaptive clinical trials are a careful, regulated application. The exploration cost in a medical setting is not abstract: a suboptimal action can harm a patient. Real trials add safeguards well beyond the plain bandit formulation discussed here.↩︎

  2. https://cloud.google.com/automl-tables/↩︎

  3. http://videolectures.net/kdd2010_beygelzimer_langford_lte/↩︎

  4. The action set has \(K\) choices and rewards are scaled to \([0,1]\), a standard convention that keeps the regret bound interpretable. The notation \(\pi(x_t)\) means “the action the policy \(\pi\) would take in context \(x_t\).”↩︎