From A/B to RL (1/3): Bayesian A/B Testing

Before policies learn, experiments assign.

This series bridges A/B testing (randomized comparisons between variants A and B) and reinforcement learning (RL) from a Bayesian decision-making perspective. We start with uncertainty, make decisions from feedback, and follow the shift from assignment to policy: from controlling how users are assigned to variants, to learning rules for choosing actions online from feedback.

The goal is not to teach all of RL, but to make ideas like action, reward, policy, state, episode, and delayed feedback feel like natural extensions of experimentation under uncertainty.

This is the first post in a 3-part series:

Part 1: Bayesian A/B Testing (this post)
Part 2: Multi-Armed Bandits
Part 3: From Continuous Learning to Delayed Rewards

Bayesian A/B Testing

We begin with a fixed randomized experiment and the smallest useful decision problem: two fixed options, immediate binary feedback, and one final choice at the end.

In RL language, this is still intentionally small. Showing A or B is an action, a click is an immediate reward, and the assignment rule is a fixed policy set before data arrives. There is no evolving state yet, and the policy does not learn while the experiment is running.

An A/B test is a randomized experiment for comparing two variants of a system or intervention: two webpages, two button labels, two recommendation policies, or two treatments in a randomized controlled trial. We assign users or experimental units to variant A or variant B, observe an outcome, and then decide which variant to keep, ship, or prefer.

This post uses the simplest version of that setup:

users or experimental units are randomly assigned to one of two fixed variants,
every assignment produces one binary outcome, which we can also read as a one-step reward,
we collect a fixed batch of data,
then we make one final decision.

This stripped-down setup lets us keep two questions separate:

how to model uncertainty about each variant's unknown click-through rate, or expected immediate reward,
how to turn posterior uncertainty into a concrete decision.

The random assignment matters. In causal inference, randomization is what lets us treat the assigned variant as an intervention rather than an attribute of the user. It protects against confounding: if A and B are assigned at random, differences in clicks are less likely to come from pre-existing differences between the groups. We are not merely reading clicks from a database and comparing groups after the fact. We are running an intervention. The experiment chooses an action for each user according to a fixed randomized rule, the environment responds with an outcome, and those outcomes become evidence about the action's expected reward. That is already an agent-environment loop in miniature, but the policy is fixed in advance rather than learned.

From here on, variant and action refer to the same object. We will mostly use variant when talking about the A/B test, and action when connecting the same setup to RL.

In [1]:

# Imports and setup
import importlib.metadata
import platform
from dataclasses import dataclass
from functools import lru_cache

import matplotlib.pyplot as plt
import numpy as np
import numpy.typing as npt
import seaborn as sns
from matplotlib.axes import Axes
from matplotlib.patches import Patch
from scipy import stats
from scipy.special import betaln

sns.set_style("darkgrid")
plt.rcParams["figure.figsize"] = (7, 4)

# Some common types
FloatArray = npt.NDArray[np.float64]
IntArray = npt.NDArray[np.int_]
#

The Toy Problem

Suppose we want to compare two versions of a page in exactly that stripped-down setting:

Variant A: the current baseline
Variant B: a new design

The goal is to choose the version that produces more clicks for a user action we care about, such as clicking a buy or subscribe button. In this toy example, a click is the immediate reward: $1$ for click and $0$ for no click. The click-through rate (CTR) is therefore the expected immediate reward of showing a variant. Because this is a one-step problem, that expected immediate reward is also the action value. We will still refer to it as CTR when the A/B-testing meaning matters.

We run a randomized experiment: each user is assigned to A or B by the test before any click is observed. That random assignment is what lets us interpret differences in outcomes as being caused by the variant, rather than just associated with it.

Every exposure is one interaction: choose A or B, observe click or no click, then stop. The state is trivial because every exposure is treated as the same kind of decision. The assignment rule is therefore a fixed policy over two actions in one state. We can represent one such outcome with a binary variable $x$:

$x = 1$ if the user clicks
$x = 0$ if the user does not click

The causal question is whether one page causes more clicks than the other. The important constraint is that this post is still a choose-once setting: we do not adapt the assignment policy while the experiment is running.

At the end of the experiment we want to answer one question:

Which variant should we ship?

For illustration we will simulate one randomized experiment. The true CTRs, or expected rewards, are fixed but hidden from the inference procedure. We only keep them around so we can check later whether the posterior behaves sensibly.

In [2]:

# Define a small container for the simulated A/B test outcomes.
@dataclass(frozen=True)
class ABTestData:
    """Observed outcomes from one fixed randomized A/B test."""

    observations_a: IntArray
    observations_b: IntArray
    true_rate_a: float
    true_rate_b: float

    @property
    def trials_per_variant(self) -> int:
        """Return the number of observed users assigned to each variant."""
        return int(len(self.observations_a))

    @property
    def clicks(self) -> tuple[int, int]:
        """Return observed click counts for A and B."""
        return int(self.observations_a.sum()), int(self.observations_b.sum())

    @property
    def no_clicks(self) -> tuple[int, int]:
        """Return observed non-click counts for A and B."""
        clicks_a, clicks_b = self.clicks
        return self.trials_per_variant - clicks_a, self.trials_per_variant - clicks_b
#

The next cell simulates one randomized experiment.

In [3]:

# Simulate one fixed randomized A/B test.
TRUE_RATE_A = 0.12  # Hidden from the learner
TRUE_RATE_B = 0.15  # Hidden from the learner
NB_TRIALS = 200
experiment_rng = np.random.default_rng(20260425)

# Simulate user outcomes for each variant. Each user either clicks (1) or doesn't click (0).
observations_a = experiment_rng.binomial(n=1, p=TRUE_RATE_A, size=NB_TRIALS)
observations_b = experiment_rng.binomial(n=1, p=TRUE_RATE_B, size=NB_TRIALS)

experiment = ABTestData(
    observations_a=observations_a,
    observations_b=observations_b,
    true_rate_a=TRUE_RATE_A,
    true_rate_b=TRUE_RATE_B,
)

clicks_a, clicks_b = experiment.clicks
no_clicks_a, no_clicks_b = experiment.no_clicks

print(f"Variant A: {clicks_a} clicks, {no_clicks_a} non-clicks, observed CTR = {clicks_a / NB_TRIALS:.3f}")
print(f"Variant B: {clicks_b} clicks, {no_clicks_b} non-clicks, observed CTR = {clicks_b / NB_TRIALS:.3f}")

Variant A: 23 clicks, 177 non-clicks, observed CTR = 0.115
Variant B: 32 clicks, 168 non-clicks, observed CTR = 0.160

In [4]:

# Plot the simulated A/B test outcomes.
true_rate_a = experiment.true_rate_a
true_rate_b = experiment.true_rate_b
nb_trials = experiment.trials_per_variant
clicks_a, clicks_b = experiment.clicks
no_clicks_a, no_clicks_b = experiment.no_clicks

variants = ["A", "B"]
clicks = [clicks_a, clicks_b]
no_clicks = [no_clicks_a, no_clicks_b]

fig, ax = plt.subplots(figsize=(6.5, 4))

ax.barh(variants, clicks, label="Clicks", color="tab:green")
ax.barh(variants, no_clicks, left=clicks, label="No Clicks", color="tab:gray", alpha=0.7)
ax.set_title("Observed outcomes in the toy experiment")
ax.set_xlabel("user count")
ax.legend()

plt.tight_layout()
plt.show()

#

No description has been provided for this image

What Is Actually Observed?

The key point is that we do not observe the true click-through rate (CTR), or expected immediate reward, directly. Under randomized assignment we only observe finite, noisy counts:

how many users saw A and B,
how many clicks each variant produced.

Comparing two observed percentages is only the surface problem. The harder question is how to reason about an unknown one-step action value from incomplete evidence.

A Bayesian view is useful here because it turns observed reward counts into posterior uncertainty over each unknown CTR. Since CTR is the expected immediate reward in this setup, the same posterior is also uncertainty over each action's one-step value. From those posteriors we can compute decision quantities directly, such as the posterior probability that B has the higher expected reward.

In this post we keep the data-collection part simple: we run a fixed-size experiment, observe the final counts, and make one decision at the end. The traffic-allocation policy is fixed; it does not adapt while the experiment is running.

It is not the only valid way to analyze an A/B test, but it is a clean place to start.

To make that concrete, we now need the simplest probabilistic model that connects user-level rewards to an unknown click-through rate.

A Minimal Bayesian Model

For one assigned user, the outcome $x$ is binary, so we model it as a Bernoulli distribution:

$$x \sim \mathrm{Bernoulli}(\theta)$$

Here $x = 1$ means a click and $x = 0$ means no click. We use $\theta$ to represent the unknown click-through rate of one variant. In decision-making language, $\theta$ is the expected immediate reward of showing that variant. In this one-step setting, it is also the action value.

If we look at $n_{\mathrm{users}}$ users at once and count the total number of clicks, which we denote by $n_{\mathrm{clicks}}$, we can model this using the Binomial distribution:

$$n_{\mathrm{clicks}} \sim \mathrm{Binomial}(n_{\mathrm{users}}, \theta)$$

So far this only tells us how clicks are generated if we knew $\theta$. In the A/B test we face the inverse problem: we observe clicks and want to infer the unknown rate.

In Bayesian inference, we use Bayes' theorem to move from observed clicks back to the unknown rate.

For a single observed outcome $x$, that relationship can be written as:

$$P(\theta \mid x) = \frac{P(x \mid \theta) P(\theta)}{P(x)}$$

The posterior is proportional to the likelihood times the prior. Concretely, we combine the likelihood of the observed clicks with a prior over $\theta$. The result is a full posterior distribution over the unknown click-through rate, rather than a single best guess.

For the prior, we use a Beta distribution:

$$\theta \sim \mathrm{Beta}(\alpha, \beta)$$

You can think of the Beta distribution as a distribution over plausible click-through rates between 0 and 1. For a more in-depth introduction, see The Beta Distribution: A Distribution Over Probabilities. In what follows we use the uniform prior $\mathrm{Beta}(1, 1)$, which says that before seeing data we do not favor any particular rate.

This choice is convenient because the Beta distribution is a conjugate prior for the Bernoulli/Binomial likelihood, so the posterior stays Beta after observing clicks and non-clicks.

At the single-observation level, the update is simple:

after a click, $\mathrm{Beta}(\alpha, \beta) \to \mathrm{Beta}(\alpha + 1, \beta)$,
after a non-click, $\mathrm{Beta}(\alpha, \beta) \to \mathrm{Beta}(\alpha, \beta + 1)$.

If we aggregate observations and count total clicks instead, we can write the same update in one line:

$$\theta \mid n_{\mathrm{clicks}}, n_{\mathrm{users}} \sim \mathrm{Beta}(\alpha + n_{\mathrm{clicks}}, \beta + n_{\mathrm{users}} - n_{\mathrm{clicks}})$$

The update has a simple interpretation: each observed click adds 1 to the first shape parameter, and each observed non-click adds 1 to the second. Starting from $\mathrm{Beta}(1, 1)$ and observing $n_{\mathrm{clicks}}$ clicks out of $n_{\mathrm{users}}$ users gives $\mathrm{Beta}(1 + n_{\mathrm{clicks}}, 1 + n_{\mathrm{users}} - n_{\mathrm{clicks}})$. For a one-observation-at-a-time view of this update, see Beta Priors for Self-Reinforcing Binary Decisions. That sequential view is worth keeping in mind. In this post we update after the full experiment, rather than online one interaction at a time.

In the A/B test we apply this one-variant model twice: once for A and once for B. You can think of those two unknown rates as $\theta_A$ and $\theta_B$, the expected immediate rewards of the two actions. We then compare the two posterior distributions to decide which variant to ship.

From Observed Counts to Posterior Distributions

Let us now apply the model to the toy experiment. The only inputs are the observed counts from A and B. The dashed lines show the hidden true CTRs, or expected rewards, used to generate the simulation; the posterior itself only uses the observed clicks and non-clicks.

In [5]:

# Compute Beta posterior parameters from observed click counts.
def beta_posterior_parameters(
    successes: int,
    trials: int,
    alpha_prior: float = 1.0,
    beta_prior: float = 1.0,
) -> tuple[float, float]:
    """Return the Beta posterior shape parameters after Bernoulli or Binomial observations."""
    return alpha_prior + successes, beta_prior + trials - successes

alpha_a, beta_a = beta_posterior_parameters(successes=clicks_a, trials=nb_trials)
alpha_b, beta_b = beta_posterior_parameters(successes=clicks_b, trials=nb_trials)

In [6]:

# Plot the Beta distribution's density
def plot_beta_density(
    ax: Axes,
    alpha: float,
    beta: float,
    label: str,
    color: str,
) -> None:
    """Plot the density curve of a Beta distribution on an existing Matplotlib axis."""
    x = np.linspace(0.001, 0.999, 400)
    y = stats.beta(a=alpha, b=beta).pdf(x)
    ax.plot(x, y, label=label, color=color)
    ax.fill_between(x, y, alpha=0.20, color=color)
    ax.set_xlabel("click-through rate")
    ax.set_ylabel("density")


fig, ax = plt.subplots(figsize=(8, 4.5))
plot_beta_density(ax=ax, alpha=alpha_a, beta=beta_a, label="posterior of A", color="tab:blue")
plot_beta_density(ax=ax, alpha=alpha_b, beta=beta_b, label="posterior of B", color="tab:orange")
ax.axvline(true_rate_a, color="tab:blue", linestyle="--", alpha=0.6, label="true rate of A (unknown)")
ax.axvline(true_rate_b, color="tab:orange", linestyle="--", alpha=0.6, label="true rate of B (unknown)")
ax.set_xlim(0.0, 0.50)
ax.set_title(f"Posterior distributions after {nb_trials} observations per variant")
ax.legend(loc="upper right")
plt.tight_layout()
plt.show()
#

Making the Decision

Once we have posterior distributions for A and B, we can ask: given the data we observed, what is the probability that B has a higher true CTR, or expected immediate reward, than A? Written in terms of the unknown CTRs:

$$ P(B > A \mid \mathrm{data}) = P(\theta_B > \theta_A \mid \mathrm{data}). $$

Here $\theta_A$ and $\theta_B$ are the unknown true CTRs of variants A and B. In the one-step reward view, they are also the unknown one-step action values.

In this post, $P(B > A \mid \mathrm{data})$ is a decision quantity: it helps us make one final ship decision after the experiment. It is not yet a traffic-allocation policy, because the experiment has already collected its data under a fixed randomized assignment rule. In the next post, the same kind of posterior probability will become an online policy distribution.

There are two useful ways to compute this probability:

approximate it with Monte Carlo sampling: draw posterior samples from A and B and count how often $\theta_B > \theta_A$,
compute it exactly for Beta posteriors using a finite-sum identity.

The Monte Carlo version is more general because it works whenever we can sample from the posterior. In the Beta-Bernoulli case, we can also compute the same probability directly.

Estimating $P(B > A \mid \mathrm{data})$

We can estimate $P(B > A \mid \mathrm{data})$ with a simple Monte Carlo procedure. Repeatedly draw one plausible CTR, equivalently one plausible expected reward, for each of A and B from their posterior distributions, then check whether B's draw is higher. If that happens in 90% of the paired draws, our estimate of $P(B > A \mid \mathrm{data})$ is about 0.90.

Each comparison is a paired draw from the joint posterior: one sampled $\theta_A$, one sampled $\theta_B$, and whether $\theta_B > \theta_A$.

$$ P(B > A \mid \mathrm{data}) = P(\theta_B > \theta_A \mid \mathrm{data}). $$

For each pair, we can also compute the posterior difference $\Delta$, B's sampled CTR minus A's sampled CTR. In this one-step setting, this is also the sampled difference in expected reward:

$$ \Delta = \theta_B - \theta_A. $$

Positive values of $\Delta$ mean B is higher in that paired draw. When A is a baseline or control, this same difference is often called the absolute lift of B over A.

The figure below shows three linked views. The scatter plot shows a readable subset of paired posterior draws. The histogram shows the posterior differences $\Delta$ using all paired draws from the same Monte Carlo run. The final bar chart collapses each paired comparison into one binary event: did B beat A in that draw? In all three panels, green marks the event $\theta_B > \theta_A$.

In [7]:

# Estimate P(B > A | data) by posterior sampling and by an exact Beta formula.
def sample_beta_posterior(
    alpha: float,
    beta: float,
    rng: np.random.Generator,  # Random number generator for posterior sampling
    size: int = 200_000,
) -> FloatArray:
    """Draw posterior samples from a Beta distribution with the given shape parameters."""
    return rng.beta(a=alpha, b=beta, size=size)


def estimate_probability_b_beats_a_by_sampling(
    alpha_a: float,
    beta_a: float,
    alpha_b: float,
    beta_b: float,
    rng: np.random.Generator,  # Random number generator for posterior sampling
    size: int = 200_000,
) -> tuple[float, FloatArray, FloatArray]:
    """Estimate P(theta_B > theta_A | data) from posterior samples for both variants."""
    samples_a = sample_beta_posterior(alpha=alpha_a, beta=beta_a, rng=rng, size=size)
    samples_b = sample_beta_posterior(alpha=alpha_b, beta=beta_b, rng=rng, size=size)
    return float(np.mean(samples_b > samples_a)), samples_a, samples_b


def exact_probability_beta_x_exceeds_beta_y(
    alpha_x: int,
    beta_x: int,
    alpha_y: int,
    beta_y: int,
) -> float:
    """Compute P(X > Y) for independent Beta variables when alpha_x is an integer.

    Uses the finite-sum Beta-function identity described in
    https://www.evanmiller.org/bayesian-ab-testing.html .
    """
    indices = np.arange(alpha_x)
    log_terms = (
        betaln(alpha_y + indices, beta_x + beta_y)
        - np.log(beta_x + indices)
        - betaln(1 + indices, beta_x)
        - betaln(alpha_y, beta_y)
    )
    max_log_term = np.max(log_terms)
    return float(np.exp(max_log_term) * np.sum(np.exp(log_terms - max_log_term)))


@lru_cache(maxsize=None)
def exact_posterior_probability_b_beats_a(
    n: int,
    clicks_a: int,
    clicks_b: int,
) -> float:
    """Compute P(theta_B > theta_A | data) exactly under the uniform Beta(1, 1) prior."""
    return exact_probability_beta_x_exceeds_beta_y(
        alpha_x=1 + clicks_b,
        beta_x=1 + n - clicks_b,
        alpha_y=1 + clicks_a,
        beta_y=1 + n - clicks_a,
    )


posterior_sampling_rng = np.random.default_rng(20260425)
prob_b_better_sampling, samples_a, samples_b = estimate_probability_b_beats_a_by_sampling(
    alpha_a=alpha_a,
    beta_a=beta_a,
    alpha_b=alpha_b,
    beta_b=beta_b,
    rng=posterior_sampling_rng,
)
prob_b_better_exact = exact_posterior_probability_b_beats_a(
    n=nb_trials,
    clicks_a=clicks_a,
    clicks_b=clicks_b,
)

print(f"Observed A: {clicks_a}/{nb_trials} = {clicks_a / nb_trials:.3f}")
print(f"Observed B: {clicks_b}/{nb_trials} = {clicks_b / nb_trials:.3f}")
print(f"P(B > A | data), posterior sampling: {prob_b_better_sampling:.3f}")
print(f"P(B > A | data), exact Beta formula: {prob_b_better_exact:.3f}")


# Visualize P(B > A | data) as joint samples and sampled CTR differences.
visual_sampling_rng = np.random.default_rng(20260426)
nb_visual_samples = 5_000
visual_indices = visual_sampling_rng.choice(
    len(samples_a),
    size=nb_visual_samples,
    replace=False,
)
visual_samples_a = samples_a[visual_indices]
visual_samples_b = samples_b[visual_indices]
visual_b_beats_a = visual_samples_b > visual_samples_a
# Use all paired posterior samples for a smoother posterior-difference histogram.
difference_samples = samples_b - samples_a

fig, axes = plt.subplots(
    1,
    3,
    figsize=(12, 4),
    gridspec_kw={"width_ratios": [1.0, 1.3, 0.6], "wspace": 0.28},
)

axis_min = 0.05
axis_max = 0.20
axis_min_b = 0.09
axis_max_b = 0.24
axes[0].scatter(
    visual_samples_b[~visual_b_beats_a],
    visual_samples_a[~visual_b_beats_a],
    s=9,
    alpha=0.22,
    color="tab:gray",
    label="A sample > B sample",
)
axes[0].scatter(
    visual_samples_b[visual_b_beats_a],
    visual_samples_a[visual_b_beats_a],
    s=9,
    alpha=0.22,
    color="tab:green",
    label="B sample > A sample",
)
axes[0].plot(
    [axis_min_b, axis_max],
    [axis_min_b, axis_max],
    color="black",
    linestyle="--",
    linewidth=1.5,
    label=r"$\theta_B = \theta_A$",
)
axes[0].set_xlim(axis_min_b, axis_max_b)
# Invert the A-axis so B-favoring samples appear in the upper-right region.
axes[0].set_ylim(axis_max, axis_min)
axes[0].set_aspect("equal", adjustable="box")
axes[0].set_anchor("E")
axes[0].set_title(r"Joint samples define $P(B > A \mid data)$")
axes[0].set_xlabel(r"sampled CTR $\theta_B$")
axes[0].set_ylabel(r"sampled CTR $\theta_A$")
axes[0].text(
    0.04,
    0.96,
    rf"$P(B > A \mid data) \approx {prob_b_better_sampling:.3f}$",
    transform=axes[0].transAxes,
    va="top",
    bbox={"boxstyle": "round,pad=0.35", "facecolor": "white", "alpha": 0.9},
)
axes[0].legend(loc="lower right")

difference_density, difference_edges = np.histogram(difference_samples, bins=70, density=True)
difference_centers = 0.5 * (difference_edges[:-1] + difference_edges[1:])
difference_widths = np.diff(difference_edges)
difference_colors = np.where(difference_centers > 0, "tab:green", "tab:gray")
axes[1].bar(
    difference_centers,
    difference_density,
    width=difference_widths,
    color=difference_colors,
    alpha=0.55,
    align="center",
)
axes[1].axvline(0, color="black", linestyle="--", linewidth=1.5)
axes[1].set_title(r"Sampled posterior difference distribution")
axes[1].set_xlabel(r"sampled difference $\Delta = \theta_B - \theta_A$")
axes[1].set_ylabel("posterior sample density")
axes[1].text(
    0.04,
    0.96,
    rf"mass right of 0 $\approx {prob_b_better_sampling:.3f}$",
    transform=axes[1].transAxes,
    va="top",
    bbox={"boxstyle": "round,pad=0.35", "facecolor": "white", "alpha": 0.9},
)
axes[1].legend(
    handles=[
        Patch(facecolor="tab:gray", alpha=0.55, label="A sample > B sample"),
        Patch(facecolor="tab:green", alpha=0.55, label="B sample > A sample"),
    ],
    loc="upper right",
)

event_probabilities = np.array([1.0 - prob_b_better_sampling, prob_b_better_sampling])
event_labels = [r"$B \leq A$", r"$B > A$"]
event_colors = ["tab:gray", "tab:green"]
axes[2].bar(event_labels, event_probabilities, color=event_colors, alpha=0.65)
axes[2].set_ylim(0.0, 1.05)
axes[2].set_title(r"Posterior probability of $B > A$")
axes[2].set_ylabel("posterior probability")
for event_index, probability in enumerate(event_probabilities):
    axes[2].text(
        event_index,
        probability + 0.03,
        f"{probability:.3f}",
        ha="center",
        va="bottom",
    )

fig.suptitle(r"Visualizing $P(B > A \mid data)$ from posterior samples", y=0.98)
fig.subplots_adjust(top=0.82)
plt.show()
#

Observed A: 23/200 = 0.115
Observed B: 32/200 = 0.160
P(B > A | data), posterior sampling: 0.902
P(B > A | data), exact Beta formula: 0.903

How the Posterior Tightens with More Data

A/B testing is still a choose-once, fixed-policy problem, so we usually care most about the final posterior. But it is useful to see how the posterior changes as more data accumulates: the distribution narrows, uncertainty shrinks, and the overlap between A and B becomes easier to interpret.

To make that pattern easier to see, the next cell runs a fresh, longer simulated experiment and then looks at prefixes of that run after 20, 100, 500, and 2,000 observations per variant. So this section is an illustration of how the posterior evolves in a new run, not a continuation of the toy experiment above.

In [8]:

# Explore how the posterior tightens with more data

checkpoints = [20, 100, 500, 2_000]

# Fresh simulated run used only to illustrate how the posterior tightens with more data.
posterior_tightening_rng = np.random.default_rng(20260425)
data_a = posterior_tightening_rng.binomial(n=1, p=true_rate_a, size=max(checkpoints))
data_b = posterior_tightening_rng.binomial(n=1, p=true_rate_b, size=max(checkpoints))

fig, axes = plt.subplots(2, 2, figsize=(11, 7), sharex=True)
for ax, n in zip(axes.ravel(), checkpoints, strict=False):
    a_alpha, a_beta = beta_posterior_parameters(successes=int(data_a[:n].sum()), trials=n)
    b_alpha, b_beta = beta_posterior_parameters(successes=int(data_b[:n].sum()), trials=n)
    plot_beta_density(ax=ax, alpha=a_alpha, beta=a_beta, label=f"A after {n} trials", color="tab:blue")
    plot_beta_density(ax=ax, alpha=b_alpha, beta=b_beta, label=f"B after {n} trials", color="tab:orange")
    ax.axvline(true_rate_a, color="tab:blue", linestyle="--", alpha=0.6, label="true rate of A")
    ax.axvline(true_rate_b, color="tab:orange", linestyle="--", alpha=0.6, label="true rate of B")
    ax.set_xlim(0.0, 0.50)
    ax.set_title(f"Posterior after {n} observations per variant")
    ax.legend(loc="upper right")

plt.tight_layout()
plt.show()
#

How Much Evidence Do We Need?

Because this is still an offline decision problem, the amount of data matters a lot. The previous section showed how the posterior over expected reward tightens within one experiment. The next question is what that means across many possible repetitions of the same experiment.

If we repeated the same randomized A/B test many times, how often would our decision rule produce strong evidence for B?

The heatmap below answers that repeated-experiment question exactly under the model. For each sample size and true absolute lift in CTR, equivalently expected reward, we consider all possible observed click counts for A and B, compute P(B > A | data), and sum the probability mass of the experiments where that posterior probability exceeds 0.95.

So each cell tells us the exact fraction of experiments, under the assumed Binomial model, where our Bayesian decision rule would conclude there is strong evidence for B.

In [9]:

# Reuse exact P(B > A | data) and plot how often evidence clears a threshold.
def min_clicks_b_for_confident_win(
    n: int,
    threshold: float,
) -> IntArray:
    """Find the minimum B-click count needed to make P(B > A | data) exceed the threshold.

    The threshold test uses the exact Beta-Beta posterior probability from
    Evan Miller's Formulas for Bayesian A/B Testing:
    https://www.evanmiller.org/bayesian-ab-testing.html .
    """
    critical = np.full(n + 1, n + 1, dtype=int)
    for clicks_a in range(n + 1):
        low, high = 0, n
        found = n + 1
        while low <= high:
            mid = (low + high) // 2
            prob_b_better = exact_posterior_probability_b_beats_a(
                n=n,
                clicks_a=clicks_a,
                clicks_b=mid,
            )
            if prob_b_better > threshold:
                found = mid
                high = mid - 1
            else:
                low = mid + 1
        critical[clicks_a] = found
    return critical

baseline_rate = 0.10
sample_sizes = [20, 50, 100, 250, 500, 1_000, 2_000]
lifts = [0.00, 0.01, 0.02, 0.03, 0.05]
posterior_threshold = 0.95

critical_clicks_by_sample_size = {
    n: min_clicks_b_for_confident_win(n=n, threshold=posterior_threshold) for n in sample_sizes
}

heatmap_exact_repeated_experiments = np.zeros((len(lifts), len(sample_sizes)))

for i, lift in enumerate(lifts):
    lifted_rate_b = baseline_rate + lift
    for j, n in enumerate(sample_sizes):
        clicks_a_values = np.arange(n + 1)
        prob_clicks_a = stats.binom.pmf(k=clicks_a_values, n=n, p=baseline_rate)
        critical_clicks_b = critical_clicks_by_sample_size[n]
        prob_confident_b_given_clicks_a = np.array([
            stats.binom.sf(k=critical - 1, n=n, p=lifted_rate_b) if critical <= n else 0.0
            for critical in critical_clicks_b
        ])
        heatmap_exact_repeated_experiments[i, j] = np.sum(
            prob_clicks_a * prob_confident_b_given_clicks_a
        )

plt.figure(figsize=(10, 4))
sns.heatmap(
    heatmap_exact_repeated_experiments,
    annot=True,
    fmt=".2f",
    cmap="cividis",
    xticklabels=sample_sizes,
    yticklabels=[f"+{lift:.2f}" for lift in lifts],
)
plt.xlabel("observations per variant")
plt.ylabel("true absolute lift of B over A")
plt.title("Exact fraction of experiments where posterior P(B > A | data) > 0.95")
plt.tight_layout()
plt.show()
#

Short Note on Real-World Use

Real A/B tests are messier than this toy example. In practice, we also need to think about data quality, sample-ratio mismatches, multiple metrics, stopping rules, and business constraints. We also need to ask whether the measured outcome is a good reward signal for the value we actually care about, and what the decision means for users and the product.

The core idea is still the same:

random assignment makes the comparison causal,
observed outcomes update our uncertainty about each variant's CTR, or expected immediate reward,
the posterior lets us ask decision questions such as $P(B > A \mid \mathrm{data})$.

The important limitation, for this series, is that the decision happens at the end. We collect the data, analyze it, and choose once. In the next post, we stop waiting until the end and start reallocating traffic while evidence arrives.

Summary

In this stripped-down A/B testing setting:

action: show variant A or variant B,
policy: a fixed randomized assignment rule; in one-state notation, a fixed $\pi(a \mid s_0)$,
feedback: an immediate click or no-click outcome,
state: a trivial repeated state $s_0$; each exposure is treated as independent,
what we learn directly: posterior uncertainty over each variant's CTR; because reward is binary and immediate, this is also uncertainty over expected immediate reward,
decision quantity: posterior probability that one action has higher expected immediate reward than the other, such as $P(B > A \mid \mathrm{data})$.

This is still a fixed experiment with immediate feedback, no evolving state, and no adaptive policy. The posterior belief helps with the final ship decision, but it does not change the traffic-allocation policy while the experiment is running. The setting is deliberately simple: uncertainty matters, but online decision-making does not enter yet. In the next post, the same posterior-probability idea becomes an adaptive action distribution.

References

Formulas for Bayesian A/B Testing, Evan Miller: compact reference for computing posterior probabilities in Bayesian A/B tests.