Beta Priors for Self-Reinforcing Binary Decisions

Why the Beta prior naturally matches sequential binary feedback.

Beta priors are natural for sequential binary decisions because they match the structure of the feedback: each observation is either a success or a failure, and the posterior update only needs to add one count to one side.

That is why Bayesian multi-armed bandit examples with binary feedback often use one Beta prior for each action's unknown success probability. The same count-based update also appears in A/B testing, online experiments, diagnostics, quality control, and other repeated yes/no feedback loops.

The previous post introduced the Beta distribution as a distribution over probabilities and showed how binary data updates its two parameters. This post uses that result online: one observation arrives, two counts update, the predictive probability changes, and the same posterior predictive loop can be represented by a Pólya urn.

In [1]:

The Beta-Bernoulli Update Loop

Start with one action and one unknown success probability $\theta$. Each time we choose the action, we observe one binary outcome:

$$ Y_t \in \{0, 1\}. $$

The outcome $Y_t$ is the feedback we observe. The parameter $\theta$ is not an outcome; it is the unknown probability that the outcome is $1$ when this action is chosen.

If the current belief is

$$ \theta \sim \mathrm{Beta}(\alpha, \beta), $$

then one new observation updates the parameters by counting:

$$ \alpha' = \alpha + y_t, \qquad \beta' = \beta + (1-y_t). $$

A success increments $\alpha$; a failure increments $\beta$. This is the same conjugacy property from the previous post, now used one observation at a time.

For a bandit with multiple actions, this loop runs separately for each action. Action $a$ has its own unknown success probability $\theta_a$ and its own counts $\alpha_a$ and $\beta_a$. Only the chosen action gets updated after each observed success or failure.

The success and failure counts are sufficient statistics for this model: once we know how many successes and failures we have seen, the order of those observations no longer matters for the posterior. That is useful online: each new observation changes one count and leaves the rest of the summary untouched.

The plot below shows the same update after a short sequence of binary observations. The posterior stays in the Beta family; only the two counts move.

In [2]:
No description has been provided for this image

The Predictive View

The current posterior also gives the predictive probability for the next observation. We do not know the true success probability $\theta$, so we average over the current posterior belief about it.

If the current belief is

$$ \theta \mid y_{1:t} \sim \mathrm{Beta}(\alpha_t, \beta_t), $$

then the posterior predictive probability of success is the posterior mean of $\theta$:

$$ P(Y_{t+1}=1 \mid y_{1:t}) = \mathbb{E}[\theta \mid y_{1:t}] = \frac{\alpha_t}{\alpha_t + \beta_t}. $$

In other words, the next feedback is predicted as a Bernoulli outcome whose success probability is the posterior mean:

$$ Y_{t+1} \mid y_{1:t} \sim \mathrm{Bernoulli}\!\left(\frac{\alpha_t}{\alpha_t + \beta_t}\right). $$

If we started from $\mathrm{Beta}(\alpha_0,\beta_0)$ and have seen $s_t$ successes and $f_t$ failures, then $\alpha_t = \alpha_0 + s_t$ and $\beta_t = \beta_0 + f_t$. The predictive probability can therefore also be written as

$$ P(Y_{t+1}=1 \mid y_{1:t}) = \frac{\alpha_0 + s_t}{\alpha_0 + \beta_0 + s_t + f_t}. $$

After the next observation arrives, the same loop continues:

$$ \alpha_{t+1} = \alpha_t + y_{t+1}, \qquad \beta_{t+1} = \beta_t + (1-y_{t+1}). $$

So the posterior does two jobs at once: it summarizes what we have learned so far, and through its posterior predictive distribution it gives the probability for the next outcome. When we sample an outcome from that predictive distribution and feed it back as evidence, we get exactly the Pólya urn update.

This is the predictive reason the parameters are often described as pseudo-counts: in the fraction above, $\alpha_0$ and $\beta_0$ sit in the same denominator as the observed counts. They are not fake data in a literal sense, but they behave like prior evidence in both the update and the next-outcome probability.

The Pólya Urn Connection

A Pólya urn gives a concrete version of the same posterior predictive loop.

This is a generative picture, not a literal claim that the world samples from our belief. In Beta-Bernoulli inference, a fixed but unknown $\theta$ generates the data. In the urn, the current composition generates the next draw. Averaged over the Beta prior, the two views produce the same sequence distribution, which is why the urn is a faithful picture of the posterior predictive loop.

Imagine an urn with red and blue balls. In the urn itself, these are just colors. To keep the physical picture literal, take α and β to be whole-number initial counts here:

  • it starts with α red balls,
  • and β blue balls.

Then the Pólya urn process follows this four-step loop:

  1. draw one ball,
  2. observe its color,
  3. put it back,
  4. add one extra ball of the same color.

To connect this notation with the Beta-Bernoulli model, let red represent Bernoulli outcome $\textcolor{#c44e52}{y=1}$ and blue represent $\textcolor{#4c72b0}{y=0}$. After $t$ urn draws, suppose we have observed rt red draws and bt blue draws. The urn then contains α + rt red balls and β + bt blue balls, so the probability of drawing red next is

$$ P(\text{red next}) = \frac{\textcolor{#c44e52}{\alpha} + \textcolor{#c44e52}{r_t}}{(\textcolor{#c44e52}{\alpha} + \textcolor{#c44e52}{r_t}) + (\textcolor{#4c72b0}{\beta} + \textcolor{#4c72b0}{b_t})} = \frac{\textcolor{#c44e52}{\alpha} + \textcolor{#c44e52}{r_t}}{\textcolor{#c44e52}{\alpha} + \textcolor{#4c72b0}{\beta} + t}. $$

Here $t = \textcolor{#c44e52}{r_t} + \textcolor{#4c72b0}{b_t}$.

This is reinforcement in the predictive process. A red draw is recorded and reinforced by adding another red ball to the urn. That changes the urn composition, so red is slightly more likely on the next draw. A blue draw does the same in the other direction. The process has memory, but the memory is compressed into two counts.

With this mapping, rt and bt are the two counts that update a $\mathrm{Beta}(\textcolor{#c44e52}{\alpha}, \textcolor{#4c72b0}{\beta})$ prior. The urn's next-red probability then matches the Beta-Bernoulli posterior predictive probability for $\textcolor{#c44e52}{y=1}$: current evidence changes the next predictive probability, and the next sampled outcome becomes evidence for the following prediction.

The interactive plot below runs this draw-replace-add loop many times and aggregates the path segments. Within each draw count $t$, brighter segments have been walked by more simulated trajectories than the other segments at that same step. The color scale is log-compressed so rare branches stay visible. The bar chart underneath counts where completed trajectories end.

In [3]:

Before the Limit: The Beta-Binomial Distribution

The urn gives a one-step predictive rule, but the endpoint plots ask a longer-run question: after $T$ reinforced draws, where can an urn trajectory end?

For a fixed number of urn draws $T$, the endpoint distribution is still discrete. There are only $T+1$ possible numbers of red draws:

$$ R_T \in \{0, 1, \ldots, T\}. $$

Those finite endpoint counts follow a Beta-binomial distribution:

$$ R_T \sim \mathrm{BetaBinomial}(T, \alpha, \beta). $$

This is related to the Beta-Bernoulli update, but it is not the same object. Beta-Bernoulli describes one binary observation at a time and how the Beta belief updates after observing $y \in \{0,1\}$. Beta-binomial describes a finite-count prediction: how many red, or $y=1$, outcomes we might see after $T$ reinforced draws.

The final urn ratio in the plots is

$$ Z_T = \frac{\alpha + R_T}{\alpha + \beta + T}. $$

For small $T$, this ratio lives on a coarse grid of possible values. As $T$ grows, the grid becomes finer. In the limit, the distribution of $Z_T$ approaches the $\mathrm{Beta}(\alpha,\beta)$ distribution, which is continuous over the interval $[0,1]$.

The finite-time Beta-binomial view explains what the plots below are showing. In each panel, we compute the exact distribution of final red ratios after a fixed number of draws and compare that finite endpoint distribution with the limiting Beta density.

The blue step function shows the exact finite distribution on its natural grid, scaled as a density so each step's area equals its probability mass. The black curve is the $\mathrm{Beta}(\alpha,\beta)$ density. As the trajectories get longer, the possible final ratios become more finely spaced, and the step function lines up with the Beta density. Only in the limit of increasingly long sequences of draws do the final red ratios approach the Beta distribution.

A single urn trajectory ends at one final ratio. The Beta distribution describes the distribution of those endpoints across possible trajectories, not a deterministic destination for one trajectory.

In [4]:
No description has been provided for this image

What the Beta-Bernoulli Loop Assumes

The Beta-Bernoulli update is online and interpretable, but it is still a model. Binary feedback is the setup; the stronger modeling assumptions are about stability, feedback, and what the counts summarize.

It assumes:

  • each action has a stable probability of outcome $y=1$,
  • feedback is observed for the action we actually took,
  • counts of $y=1$ and $y=0$ outcomes are enough to summarize what we have learned.

These assumptions fit simple A/B tests, binary bandits, and many repeated yes/no decision problems. They are too narrow when outcomes have different values, rewards are continuous, actions affect future states, or the environment changes over time.

The Pólya urn is the posterior predictive loop made physical: each draw is sampled from the current predictive probability, then added back as evidence that reinforces the next prediction. It shows what happens inside the Beta-Bernoulli model when we repeatedly sample from the posterior predictive distribution and feed each sampled outcome back in as evidence.

Summary

The Beta prior is natural for sequential binary decisions because it matches the structure of the feedback.

Each action has an unknown probability of outcome $y=1$. Each time we choose that action, we observe one binary outcome. If $y_t=1$, we increment $\alpha_a$; if $y_t=0$, we increment $\beta_a$. The posterior remains Beta, so the learner only needs to carry two counts per action.

Those same counts also define the next prediction. The posterior mean gives the posterior predictive probability for the next $y=1$ outcome:

$$ P(Y_{t+1}=1 \mid y_{1:t}) = \frac{\alpha_t}{\alpha_t + \beta_t}. $$

The Pólya urn gives a concrete picture of the same posterior predictive update. A red draw is sampled from the current urn composition, then added back as evidence for the next draw; a blue draw does the same in the other direction. That reinforcement makes future draws depend on previous draws, while the final counts still carry the information used by the Beta-Bernoulli posterior.

The urn also gives the finite-to-limit bridge. After $T$ draws, the number of red draws follows a Beta-binomial distribution. After converting that count into a final red ratio and letting $T$ grow, the distribution approaches $\mathrm{Beta}(\alpha,\beta)$.

The update itself stays small:

$$ \alpha_a \leftarrow \alpha_a + y_t, \qquad \beta_a \leftarrow \beta_a + (1-y_t) $$

for the action $a$ we actually chose. That is the compact Beta-Bernoulli update loop behind Bayesian sequential binary decisions.

References

  • Pólya's Urn Process - RandomServices: detailed technical reference for the urn process and its connection to Beta-Bernoulli and Beta-binomial behavior.
  • Back to basics - Pólya urns - Djalil Chafaï: mathematical walkthrough of the urn ratio as a martingale and the limiting Beta distribution.
  • Pólya urn model - Wikipedia: compact reference for the draw-replace-add urn process, self-reinforcement, exchangeability, the martingale property, and convergence to a Beta distribution.
  • Beta-binomial distribution - Wikipedia: compact reference for the finite-time bridge in this post. Before urn ratios approach a continuous Beta distribution, the number of red draws after $T$ draws follows a discrete Beta-binomial distribution.

Further Reading

In [5]:
python: 3.13.12
numpy: 2.4.4
scipy: 1.17.1
matplotlib: 3.10.9
seaborn: 0.13.2
bokeh: 3.9.0
jupyter-bokeh: 4.0.5

This post is generated from an IPython notebook file. Link to the full IPython notebook file