From A/B to RL (2/3): Multi-Armed Bandits
Learn and adapt online instead of choosing once at the end.
In the first post, we used a fixed randomized experiment to learn posterior uncertainty over two binary actions. After collecting data, we used $P(B > A \mid \mathrm{data})$ as a decision quantity: how likely is B to have the higher expected reward?
Now we keep the same Beta-Bernoulli model, but change the decision problem. Instead of waiting until the end, the agent chooses an action for each interaction, observes one binary reward, updates its belief, and uses that updated belief before the next interaction.
A multi-armed bandit turns that decision quantity into an online action rule. Instead of using the posterior once at the end, we use the current posterior before each interaction. We choose one action, observe only that action's reward, update that action's posterior, and continue.
The name multi-armed bandit comes from the picture of a row of slot machines, sometimes called one-armed bandits. In bandit language, each option is traditionally called an arm. Following reinforcement-learning terminology, we will mostly use action to refer to the same thing.
The statistical model stays familiar. The decision problem does not. We no longer wait until the end of the experiment to act. This is the point where fixed experimentation turns into online allocation.
Learning and acting now happen at the same time. In RL language, this gives us an adaptive policy, actions, and rewards, but still no meaningful evolving state. The external state is effectively the same at every interaction; what changes is the agent's internal belief about the actions.
This is the second post in a 3-part series:
- Part 1: Bayesian A/B Testing
- Part 2: Multi-Armed Bandits (this post)
- Part 3: From Continuous Learning to Delayed Rewards
# Imports and setup
import importlib.metadata
import platform
from dataclasses import dataclass
import matplotlib.pyplot as plt
import numpy as np
import numpy.typing as npt
import seaborn as sns
from matplotlib.axes import Axes
from matplotlib.lines import Line2D
from scipy import stats
sns.set_style("darkgrid")
plt.rcParams["figure.figsize"] = (7, 4)
FloatArray = npt.NDArray[np.float64]
#
The Toy Problem
Suppose the agent can choose one of three actions at each interaction:
- Action A
- Action B
- Action C
After choosing one action, the environment returns one immediate binary reward: $1$ for success and $0$ for no success. The success probability $\theta_a$ is therefore also the expected immediate reward of choosing action $a$, which we write as $\mathbb{E}[R \mid a]$:
$$ \mathbb{E}[R \mid a] = 1 \cdot P(R=1 \mid a) + 0 \cdot P(R=0 \mid a) = \theta_a. $$
Because this is a one-step problem, that expected immediate reward is also the action value. So in this binary bandit, the unknown success probability, expected reward, and action value are all the same quantity.
Compared with the fixed A/B test, the statistical quantity is still familiar. The timing of the decision changes. In a fixed experiment, allocation is planned up front and the final decision comes later. In a bandit, the policy chooses an action at each interaction while learning from the feedback so far. The goal is cumulative reward: the total reward collected over time.
For illustration we will simulate a Bernoulli bandit with three possible actions. Each action has a fixed but hidden success probability. The algorithm never sees those true probabilities directly; it only observes binary rewards from the actions it actually chooses. At each interaction, the policy has to choose an action that balances learning the action values with collecting reward along the way.
# Define the fixed Bernoulli success probabilities for each bandit action.
true_success_probabilities = np.array([0.09, 0.12, 0.15], dtype=float)
action_labels = ["A", "B", "C"]
n_actions = len(true_success_probabilities)
#
From Posterior Beliefs to Online Decisions
A bandit keeps the statistical model from the first post almost unchanged. The full Bayesian setup is in the previous post; here we mainly reuse it. Each action $a$ has an unknown success probability $\theta_a$. Since the reward $R_t$ is binary, either $0$ or $1$, $\theta_a$ is also the expected immediate reward of choosing action $a$.
At interaction $t$, two things happen in order. First, the policy chooses an action; we write that chosen action as $a_t$. We then take that action and observe the resulting reward. If $a_t = a$, then the reward random variable $R_t$ has success probability $\theta_a$:
$$ P(R_t = 1 \mid a_t = a) = \theta_a, \qquad P(R_t = 0 \mid a_t = a) = 1 - \theta_a. $$
In words, conditional on choosing action $a$, the reward $R_t$ is sampled from a Bernoulli distribution with parameter $\theta_a$:
$$ R_t \mid (a_t = a) \sim \mathrm{Bernoulli}(\theta_a). $$
As before, we place one Beta prior on each action's unknown success probability:
$$ \theta_a \sim \mathrm{Beta}(1, 1). $$
If we keep track of the current Beta parameters $(\alpha_a, \beta_a)$ for each action, then after choosing action $a_t$ and observing reward $r_t \in \{0, 1\}$ we update only that action:
$$ \alpha_{a_t} \leftarrow \alpha_{a_t} + r_t, \qquad \beta_{a_t} \leftarrow \beta_{a_t} + (1 - r_t). $$
This is the same Beta-Bernoulli update as the first post, just applied online: the posterior from the previous step becomes the prior for the next. For a standalone explanation of why this two-count update fits sequential binary feedback, see Beta Priors for Self-Reinforcing Binary Decisions.
Online Updates and Cumulative Reward
The model is familiar, but the decision problem is now online. The goal is no longer only to identify the best action at the end; it is to collect as much cumulative reward as possible while learning.
Because each interaction is one-step, the reward for one interaction is just the immediate binary outcome. If we write that reward as $r_t$, then $r_t = 1$ for success and $r_t = 0$ for no success. Across the whole run, the objective is to maximize cumulative reward:
$$ \sum_{t=1}^T r_t. $$
In this binary setup, the action value is the expected immediate reward. That expected reward is $\theta_a$, the action's hidden success probability. This is why sampling a plausible $\theta_a$ can also be read as sampling a plausible action value.
Policy and Exploration
To act online, we need a policy: a rule for choosing the next action. In standard RL notation, a policy is often written as $\pi(a \mid s)$, a distribution over actions $a$ that can be taken from a state $s$. In this bandit setting, the external state is just a trivial repeated state $s_0$. What changes over time is the agent's posterior belief about the actions.
A good bandit policy has to manage the exploration-exploitation dilemma. At one extreme, a greedy policy always chooses the action that currently looks best. That uses the evidence we have, but it can stop trying underexplored actions too early, missing an action that would have turned out better. At the other extreme, a uniformly random policy keeps exploring every action, but it never uses what it has learned to improve cumulative reward.
A common compromise is an $\varepsilon$-greedy policy: most of the time choose the currently best-looking action, and with probability $\varepsilon$ explore by choosing a random action.
Here we use probability matching instead. We already model posterior uncertainty over each action's expected reward, and in the first post we used that uncertainty to compute $P(B > A \mid \mathrm{data})$. That same kind of posterior comparison can now drive exploration and exploitation directly: actions are chosen in proportion to how plausible it is that they are best. With $K$ actions, we ask this question for every action: under the current posterior, how likely is this action to have the highest expected reward?
$$ \pi_t(a) = P\!\left(\theta_a > \theta_j \;\; \text{for every other action } j \mid \mathrm{data}_t\right). $$
In words, $\pi_t(a)$ is the posterior probability that action $a$ has the highest expected reward. Because this is a one-step binary setup, the action with the highest expected reward is also the action with the highest action value.
The probability-matching rule turns these posterior probabilities into a policy distribution: choose each action with probability $\pi_t(a)$ on the next interaction. This is the point Thompson emphasized in 1933: use the current probability that one option is better to guide allocation while more evidence arrives, instead of waiting for one final all-or-nothing decision. Exploration is driven by posterior uncertainty rather than by a fixed $\varepsilon$: an underexplored action can still receive interactions if the posterior leaves a meaningful chance that it is actually best.
Online Learning with Bandit Feedback
This partial-feedback setting is called bandit feedback: after we choose one action and observe its reward, we do not get to see what reward the other actions would have produced. Every interaction is both a chance to learn and a chance to earn reward, but the feedback is partial: we only observe the reward of the action we actually chose.
This is still only the simplest reinforcement-learning-like setting. The agent chooses actions and receives rewards, but there is no evolving external state. Each interaction ends after the immediate reward; what changes over time is the learner's information about the actions.
Thompson Sampling as Posterior Sampling
The policy distribution
$$ \pi_t(a) = P\!\left(\theta_a > \theta_j \;\; \text{for every other action } j \mid \mathrm{data}_t\right) $$
can be difficult to compute exactly. More importantly, for action selection we do not need to estimate the full probability distribution.
In the first post, we estimated $P(B > A \mid \mathrm{data})$ by sampling from the posterior. We drew many plausible expected rewards, $\tilde\theta_A$ and $\tilde\theta_B$, counted how often $\tilde\theta_B$ was higher, and used that empirical frequency as a Monte Carlo estimate.
For action selection, though, we do not need to build that empirical distribution first. Repeated posterior comparisons would estimate $\pi_t(a)$; one posterior comparison is already one draw from that same policy. The policy $\pi_t$ puts mass on action $a$ equal to the posterior probability that action $a$ is best. When we sample one plausible expected reward $\tilde\theta_i$ for every action $i$, the sample falls into exactly one of those "which action is best" regions. For a single interaction, we can sample once, compare the samples, and treat the winning action $A_t = \arg\max_i \tilde\theta_i$ as a draw from $\pi_t$.
Written in probability notation, the selected action has the policy distribution because selecting action $a$ is exactly the event that $a$ has the largest sampled expected reward:
$$ P(A_t = a \mid \mathrm{data}_t) = P\!\left(\tilde\theta_a > \tilde\theta_j \;\; \text{for every other action } j \mid \mathrm{data}_t\right) = \pi_t(a). $$
So repeated sample-and-compare draws would reveal the policy distribution, but a single draw is already one valid sample from it. This is the key idea behind Thompson sampling: use current posterior uncertainty to choose actions while evidence accumulates, instead of waiting for one final decision.
You can view Thompson sampling as a one-sample Monte Carlo implementation of probability matching: instead of estimating the whole policy distribution, it draws one plausible action value per action and acts on that draw.
The Thompson sampling algorithm works as follows:
- sample one plausible expected reward $\tilde{\theta}_i$ from each action's current posterior,
- read each sampled expected reward as a sampled action value,
- choose the action with the largest sampled value,
- observe one binary reward,
- update only the chosen action.
If we repeated steps 1-3 many times without updating, the action frequencies would approach $\pi_t(a)$. Thompson sampling uses only one such draw per interaction, then observes feedback and updates the chosen action before the next draw.
The sampled $\tilde{\theta}_i$ values are sampled action values in this one-step setting: plausible average rewards for each action, not possible next binary outcomes.
The rest of this post makes those ideas concrete: we will watch the posteriors evolve, see the induced policy concentrate on the best action, and track reward and regret along the way.
The figure below freezes one moment in the learning loop and shows a single Thompson sampling interaction. We start with current posterior beliefs, sample one plausible expected reward for each action, choose the action with the largest sampled value, observe one binary reward, and update only the chosen action.
# Visualize one Thompson sampling interaction step by step.
demo_alpha = np.array([8.0, 12.0, 7.0])
demo_beta = np.array([55.0, 68.0, 36.0])
demo_rng = np.random.default_rng(20260602)
demo_action_colors = ["tab:blue", "tab:orange", "tab:green"]
sampled_action_values = demo_rng.beta(demo_alpha, demo_beta)
chosen_action = int(np.argmax(sampled_action_values))
observed_reward = int(demo_rng.binomial(n=1, p=true_success_probabilities[chosen_action]))
updated_alpha = demo_alpha.copy()
updated_beta = demo_beta.copy()
updated_alpha[chosen_action] += observed_reward
updated_beta[chosen_action] += 1 - observed_reward
value_grid = np.linspace(0.001, 0.32, 800)
fig, axes = plt.subplots(1, 3, figsize=(17, 4.8))
# Step 1: sample one plausible expected reward from each posterior.
for action_index, (label, color) in enumerate(zip(action_labels, demo_action_colors, strict=False)):
density = stats.beta(a=demo_alpha[action_index], b=demo_beta[action_index]).pdf(value_grid)
sample = sampled_action_values[action_index]
sample_density = stats.beta(a=demo_alpha[action_index], b=demo_beta[action_index]).pdf(sample)
axes[0].plot(value_grid, density, color=color, linewidth=2, label=f"action {label} posterior")
axes[0].fill_between(value_grid, density, color=color, alpha=0.12)
axes[0].vlines(sample, 0, sample_density, color=color, linestyle="--", linewidth=1.5)
axes[0].scatter(sample, sample_density, color=color, s=55, zorder=3)
axes[0].set_title("1. Sample one action value per action")
axes[0].set_xlabel("expected reward / action value")
axes[0].set_ylabel("posterior density")
axes[0].legend(fontsize=8, loc="upper right")
# Step 2: compare sampled action values and act greedily under that sampled model.
bar_colors = ["0.82"] * n_actions
bar_colors[chosen_action] = demo_action_colors[chosen_action]
bars = axes[1].bar(
action_labels,
sampled_action_values,
color=bar_colors,
edgecolor="black",
linewidth=1.2,
)
bars[chosen_action].set_linewidth(2.6)
for action_index, value in enumerate(sampled_action_values):
axes[1].text(action_index, value + 0.006, f"{value:.3f}", ha="center", va="bottom")
axes[1].text(
chosen_action,
sampled_action_values[chosen_action] * 0.42,
f"choose\n{action_labels[chosen_action]}",
ha="center",
va="center",
color="white",
fontweight="bold",
)
axes[1].set_title("2. Choose the largest sampled value")
axes[1].set_xlabel("action")
axes[1].set_ylabel("sampled expected reward")
axes[1].set_ylim(0.0, max(sampled_action_values) * 1.35)
# Step 3: update only the chosen action's posterior after observing feedback.
chosen_label = action_labels[chosen_action]
chosen_color = demo_action_colors[chosen_action]
before_density = stats.beta(a=demo_alpha[chosen_action], b=demo_beta[chosen_action]).pdf(value_grid)
after_density = stats.beta(a=updated_alpha[chosen_action], b=updated_beta[chosen_action]).pdf(value_grid)
axes[2].plot(
value_grid,
before_density,
color=chosen_color,
linestyle="--",
linewidth=2,
label=f"before: Beta({demo_alpha[chosen_action]:.0f}, {demo_beta[chosen_action]:.0f})",
)
axes[2].plot(
value_grid,
after_density,
color=chosen_color,
linewidth=2.5,
label=f"after: Beta({updated_alpha[chosen_action]:.0f}, {updated_beta[chosen_action]:.0f})",
)
axes[2].fill_between(value_grid, after_density, color=chosen_color, alpha=0.14)
axes[2].set_title("3. Observe reward and update chosen action")
axes[2].set_xlabel("expected reward / action value")
axes[2].set_ylabel("posterior density")
axes[2].legend(fontsize=8, loc="upper right")
axes[2].text(
0.04,
0.93,
f"observed reward $r_t={observed_reward}$\n"
f"$\\alpha_{chosen_label}$: {demo_alpha[chosen_action]:.0f} $\\rightarrow$ {updated_alpha[chosen_action]:.0f}\n"
f"$\\beta_{chosen_label}$: {demo_beta[chosen_action]:.0f} $\\rightarrow$ {updated_beta[chosen_action]:.0f}",
transform=axes[2].transAxes,
ha="left",
va="top",
bbox={"boxstyle": "round,pad=0.35", "facecolor": "white", "edgecolor": "0.65"},
)
fig.suptitle("One Thompson sampling interaction", y=1.03)
plt.tight_layout()
plt.show()
#
# Define bandit records, posterior updates, and probability-matching helpers.
@dataclass(frozen=True)
class BanditInteraction:
"""Observed feedback and regret from one bandit interaction."""
action: int
reward: int
regret: float
def beta_posterior_means(alpha: FloatArray, beta: FloatArray) -> FloatArray:
"""Return the posterior mean expected reward of each action."""
return alpha / (alpha + beta)
def update_beta_bernoulli_posterior(
alpha: FloatArray,
beta: FloatArray,
action: int,
reward: int,
) -> tuple[FloatArray, FloatArray]:
"""Update only the chosen action's Beta posterior after binary reward feedback."""
updated_alpha = alpha.copy()
updated_beta = beta.copy()
updated_alpha[action] += reward
updated_beta[action] += 1 - reward
return updated_alpha, updated_beta
def estimate_probability_each_action_is_best(
alpha: FloatArray,
beta: FloatArray,
rng: np.random.Generator,
size: int = 50_000,
) -> FloatArray:
"""Monte Carlo estimate of the probability-matching policy.
The estimate samples plausible expected rewards from each action's Beta posterior
and returns the empirical frequency with which each action has the largest sample.
"""
sampled_action_values = rng.beta(alpha, beta, size=(size, len(alpha)))
winning_actions = np.argmax(sampled_action_values, axis=1)
return np.bincount(winning_actions, minlength=len(alpha)) / size
#
Let's run the interaction loop for 1,000 interactions. The code initializes one Beta posterior per action, repeats the Thompson-sampling interaction, and keeps the logs used by the plots.
# Run Thompson sampling for repeated bandit interactions.
n_steps = 1000
steps = np.arange(1, n_steps + 1)
bandit_rng = np.random.default_rng(202604243)
# Initialize one Beta posterior per action.
alpha = np.ones(n_actions, dtype=float)
beta = np.ones(n_actions, dtype=float)
interaction_log: list[BanditInteraction] = []
posterior_history: dict[int, tuple[FloatArray, FloatArray]] = {}
checkpoints = [1, 10, 50, 200, 500, 1000]
# Run Thompson sampling as repeated agent-environment interaction.
for step in steps:
# Sample one plausible expected reward for each action from its posterior.
sampled_action_values = bandit_rng.beta(alpha, beta)
# Choose the action with the highest sampled value.
action = int(np.argmax(sampled_action_values))
# Observe binary feedback from the selected action's hidden success probability.
reward = int(bandit_rng.binomial(n=1, p=true_success_probabilities[action]))
# Update only the selected action's Beta posterior.
alpha, beta = update_beta_bernoulli_posterior(
alpha=alpha,
beta=beta,
action=action,
reward=reward,
)
# Log feedback and instantaneous regret for later summaries.
instant_regret = float(np.max(true_success_probabilities) - true_success_probabilities[action])
interaction_log.append(BanditInteraction(action=action, reward=reward, regret=instant_regret))
# Keep posterior snapshots for the checkpoint visualizations.
if step in checkpoints:
posterior_history[step] = (alpha.copy(), beta.copy())
# Summarize actions, rewards, and regret from the interaction log.
chosen_action_history = np.array([record.action for record in interaction_log], dtype=int)
reward_history = np.array([record.reward for record in interaction_log], dtype=int)
instantaneous_regret = np.array([record.regret for record in interaction_log], dtype=float)
cumulative_reward = np.cumsum(reward_history)
cumulative_regret = np.cumsum(instantaneous_regret)
action_counts = np.bincount(chosen_action_history, minlength=n_actions)
best_action_value = float(np.max(true_success_probabilities))
uniform_expected_reward = steps * float(np.mean(true_success_probabilities))
uniform_expected_regret = steps * float(best_action_value - np.mean(true_success_probabilities))
optimal_expected_reward = steps * best_action_value
print(f"Best true action: {action_labels[int(np.argmax(true_success_probabilities))]}")
print(f"Action counts: {action_counts}")
print(f"Posterior means: {np.round(beta_posterior_means(alpha=alpha, beta=beta), 3)}")
print(f"Total reward: {int(cumulative_reward[-1])}")
print(f"Total regret: {cumulative_regret[-1]:.2f}")
#
Posterior Evolution
The Bayesian update is the same as in the first post, but it is worth restating here because it now drives the online decision rule. Each action has a Beta posterior over its unknown expected reward $\theta_a$. When a chosen action produces reward $1$, its posterior shifts to the right; when it produces reward $0$, its posterior shifts to the left.
Two things matter when reading these curves:
- the location of the curve tells us which expected rewards currently look plausible,
- the width of the curve tells us how uncertain we still are about that action value.
Actions that are chosen often accumulate evidence quickly, so their posteriors become narrower. Actions that are chosen less often remain more diffuse. That difference in uncertainty is exactly what Thompson sampling uses: uncertain actions can still be explored, while well-performing and well-measured actions get exploited more often.
Each panel below shows the posterior over the expected reward of every action at a different moment in the run. The dashed lines mark the hidden true expected rewards used to generate the simulation, so we can see whether the posteriors are moving in the right direction. One important detail is that only the chosen action is updated after each interaction; the other actions stay exactly where they were until they are chosen again.
# Plot posterior distributions at selected checkpoints.
action_colors = ["tab:blue", "tab:orange", "tab:green"]
def plot_beta_density(
ax: Axes,
alpha: float,
beta: float,
label: str,
color: str,
) -> None:
"""Plot the density of a Beta distribution on an existing Matplotlib axis."""
x = np.linspace(0.001, 0.999, 1000)
y = stats.beta(a=alpha, b=beta).pdf(x)
ax.plot(x, y, label=label, color=color)
ax.fill_between(x, y, alpha=0.18, color=color)
ax.set_xlabel("expected reward / action value")
ax.set_ylabel("density")
fig, axes = plt.subplots(len(checkpoints), 1, figsize=(9, 14), sharex=True)
for ax, step in zip(axes, checkpoints, strict=False):
alpha_step, beta_step = posterior_history[step]
for action_index, (label, color, true_action_value) in enumerate(
zip(action_labels, action_colors, true_success_probabilities, strict=False)
):
plot_beta_density(
ax=ax,
alpha=float(alpha_step[action_index]),
beta=float(beta_step[action_index]),
label=f"posterior of {label}",
color=color,
)
ax.axvline(true_action_value, color=color, linestyle="--", alpha=0.55)
ax.set_xlim(0.0, 0.30)
ax.set_title(f"Posterior distributions after {step} interactions")
ax.set_ylabel("density")
legend_handles = [
Line2D([0], [0], color=color, lw=2, label=f"posterior of {label}")
for label, color in zip(action_labels, action_colors, strict=False)
] + [
Line2D([0], [0], color=color, lw=2, linestyle="--", label=f"true value of {label}")
for label, color in zip(action_labels, action_colors, strict=False)
]
fig.legend(handles=legend_handles, loc="upper center", ncol=3, bbox_to_anchor=(0.5, 1.02))
axes[-1].set_xlabel("expected reward / action value")
plt.tight_layout(rect=[0, 0, 1, 0.96])
plt.show()
#
We can estimate the induced policy distribution $\pi_t(a)$ at the same checkpoints as above. The following plot answers: under the current posterior, how likely is each action to be the best action?
# Plot the estimated probability-matching policy at selected checkpoints.
policy_distribution_rng = np.random.default_rng(202604244)
estimated_policy_by_checkpoint = np.vstack([
estimate_probability_each_action_is_best(
alpha=posterior_history[step][0],
beta=posterior_history[step][1],
rng=policy_distribution_rng,
)
for step in checkpoints
])
plt.figure(figsize=(9, 4))
for action_index, (label, color) in enumerate(zip(action_labels, action_colors, strict=False)):
plt.plot(
checkpoints,
estimated_policy_by_checkpoint[:, action_index],
marker="o",
linewidth=2,
color=color,
label=f"action {label}",
)
plt.xscale("log")
plt.ylim(0.0, 1.0)
plt.title("Estimated policy distribution: P(each action is best | data)")
plt.xlabel("interaction checkpoint")
plt.ylabel("policy probability")
plt.legend(loc="center left", bbox_to_anchor=(1.02, 0.5))
plt.tight_layout()
plt.show()
#
Reward, Regret, and Action Allocation
The whole point of a bandit is that we do not keep allocating equally forever. We start uncertain, explore several actions, and then allocate more interactions to actions that continue to look promising. This is the behavioral difference that matters most relative to a fixed experiment.
There are three useful views of the run:
- Cumulative reward: the reward actually collected; this is the run-level quantity the bandit tries to maximize.
- Cumulative regret: expected reward left on the table compared to an oracle that always picked the best action.
- Action allocation: how the algorithm shifted choices over time.
In our simulation the reward at interaction $t$ is simply $r_t$ (1 for success, 0 for no success). The cumulative reward $\sum_t r_t$ is therefore the total reward collected during learning. Because we know the hidden true action values, we can also compute regret. If $\theta^*$ is the best action's true expected reward and we chose action $a_t$, then the instantaneous regret is $\theta^* - \theta_{a_t}$. Summing this over time gives cumulative regret, which you can interpret as expected missed reward.
For context we also compare against a simple uniform baseline that keeps allocation evenly split across all three actions. This is a fixed-experiment reference point: keep testing evenly and do not adapt. The uniform and oracle reward lines are expected-reference curves, not realized alternative runs. The regret curve is only available because this is a simulation and we know the hidden true action values. In a real bandit deployment, the true regret is not directly observable.
One practical caveat is that adaptively collected bandit data should not be analyzed as if it came from a fixed randomized experiment. The policy that generated the data is part of the data-generating process. If we later want to make causal or off-policy claims from logged bandit data, we need to account for that policy, usually by logging action probabilities and keeping enough exploration so that alternative actions remain observable.
# Plot cumulative reward, regret, action counts, and allocation shares.
block_size = 50
block_starts = np.arange(0, n_steps, block_size)
block_centers = block_starts + block_size / 2
allocation_share_by_block = np.zeros((len(block_starts), n_actions), dtype=float)
for block_index, start in enumerate(block_starts):
end = min(start + block_size, n_steps)
block_choices = chosen_action_history[start:end]
allocation_share_by_block[block_index] = np.bincount(block_choices, minlength=n_actions) / (end - start)
fig, axes = plt.subplots(1, 3, figsize=(16, 4))
axes[0].plot(steps, cumulative_reward, color="tab:green", label="Thompson sampling")
axes[0].plot(steps, uniform_expected_reward, color="tab:gray", linestyle="--", label="uniform allocation (expected)")
axes[0].plot(steps, optimal_expected_reward, color="black", linestyle=":", label="oracle best action (expected)")
axes[0].set_title("Cumulative reward with expected references")
axes[0].set_xlabel("interaction")
axes[0].set_ylabel("cumulative reward")
axes[0].legend(loc="upper left")
axes[1].plot(steps, cumulative_regret, color="tab:red", label="Thompson sampling")
axes[1].plot(steps, uniform_expected_regret, color="tab:gray", linestyle="--", label="uniform allocation (expected)")
axes[1].set_title("Cumulative regret")
axes[1].set_xlabel("interaction")
axes[1].set_ylabel("expected reward shortfall")
axes[1].legend(loc="upper left")
axes[2].bar(action_labels, action_counts, color=action_colors)
axes[2].set_title("How often each action was chosen")
axes[2].set_xlabel("action")
axes[2].set_ylabel("times chosen")
plt.tight_layout()
plt.show()
plt.figure(figsize=(9, 3.8))
for action_index, (label, color) in enumerate(zip(action_labels, action_colors, strict=False)):
plt.plot(
block_centers,
allocation_share_by_block[:, action_index],
marker="o",
linewidth=2,
color=color,
label=f"action {label}",
)
plt.ylim(0.0, 1.0)
plt.title(f"Share of interactions assigned to each action in {block_size}-step blocks")
plt.xlabel("interaction")
plt.ylabel("allocation share within block")
plt.legend(ncol=2, loc="upper left")
plt.tight_layout()
plt.show()
#
Final Learned Picture
By the end of the run, the posterior means should roughly recover the true ordering of the actions. But unlike the fixed A/B test, that is only part of the story. The bandit was not waiting to make one final choice. It was already using those evolving beliefs to steer interactions toward the better actions.
The plot below compares the hidden true expected rewards with the final posterior means and 90% credible intervals. The number of times each action was chosen is included in the labels, which makes the exploration-exploitation pattern visible at a glance.
# Compare final posterior estimates with the hidden true action values.
final_posterior_means = beta_posterior_means(alpha=alpha, beta=beta)
posterior_lower = stats.beta(a=alpha, b=beta).ppf(0.05)
posterior_upper = stats.beta(a=alpha, b=beta).ppf(0.95)
x_positions = np.arange(n_actions)
plt.figure(figsize=(8, 4.5))
plt.bar(
x_positions,
true_success_probabilities,
color=action_colors,
alpha=0.45,
label="true expected reward (hidden)",
)
plt.errorbar(
x_positions,
final_posterior_means,
yerr=[final_posterior_means - posterior_lower, posterior_upper - final_posterior_means],
fmt="o",
color="0.15",
capsize=5,
label="posterior mean with 90% interval",
)
plt.xticks(
x_positions,
[f"{label}\n({count} choices)" for label, count in zip(action_labels, action_counts, strict=False)],
)
plt.ylim(0.0, 0.22)
plt.ylabel("expected reward / action value")
plt.title(f"What the bandit has learned after {n_steps} interactions")
plt.legend(loc="upper left")
plt.tight_layout()
plt.show()
#
Summary
The first post used Bayesian posterior uncertainty as a final decision aid. A bandit uses the same kind of belief online.
At each interaction:
- the policy chooses one action using the current posterior belief,
- the environment returns an immediate binary reward for that chosen action only,
- the learner updates only the chosen action's Beta posterior,
- the updated posterior becomes the input to the next action choice.
That is the key shift from a fixed A/B test: the posterior is no longer used only after data collection. It becomes part of the action loop. Probability matching turns posterior probabilities into a policy distribution, so actions that are still plausibly best keep some chance of being tried. Thompson sampling uses one posterior draw to sample directly from that policy distribution: sample plausible action values, choose the largest sampled value, observe reward, and update the chosen action. It samples plausible average rewards, not possible next binary outcomes.
This post has moved us from fixed experiments to online learning: actions, rewards, an adaptive policy, bandit feedback, and cumulative reward. But it is still only a small step toward RL: there is no evolving external state yet, and no delayed credit assignment, which we'll explore in the next post.
References
- On the Likelihood that One Unknown Probability Exceeds Another in View of the Evidence of Two Samples, William R. Thompson: original paper behind probability matching for allocating experimental effort in two-armed trial settings.
- A Tutorial on Thompson Sampling, Daniel Russo, Benjamin Van Roy, Abbas Kazerouni, Ian Osband, and Zheng Wen: reference for posterior-sampling action selection, probability matching, and Beta-Bernoulli bandit examples.
Further Reading
- Part 1: Bayesian A/B Testing, Peter Roelants: starts the series from fixed randomized experiments and posterior comparisons.
- Part 3: From Continuous Learning to Delayed Rewards, Peter Roelants: extends the online-learning view to stateful policies and delayed rewards.
- Beta Priors for Self-Reinforcing Binary Decisions, Peter Roelants: background on the Beta-Bernoulli update loop used by the bandit learner.
- Bandit Algorithms, Tor Lattimore and Csaba Szepesvari: broader reading on bandit feedback, regret, and common exploration algorithms.
# Print package versions used to execute this notebook.
print(f"Python: {platform.python_version()}")
for package in ["numpy", "scipy", "matplotlib", "seaborn"]:
print(f"{package}: {importlib.metadata.version(package)}")
#
This post is generated from an IPython notebook file. Link to the full IPython notebook file