The Beta Distribution: A Distribution Over Probabilities

Many statistical problems start with an unknown probability. A visitor might convert or not. A coin might land heads or tails. A machine part might fail or keep working. A user might click a recommendation or ignore it.

In all of these examples the quantity we want to infer is a probability, so it must live between $0$ and $1$. The Beta distribution is a distribution over such probabilities. It lets us describe which values of the unknown probability seem plausible before or after seeing binary data.

We'll build that idea by moving through four steps:

  • model binary feedback with Bernoulli and Binomial distributions,
  • turn the Binomial view around: after seeing successes and failures, ask which success probabilities are plausible,
  • illustrate why the binary likelihood has the same shape as a Beta density,
  • use the same shape to update Beta priors by adding observed successes and failures.
In [1]:

The Setting

We'll start with a repeated binary experiment. Each observation $Y_t$ can take one of two values:

$$ Y_t \in \{0, 1\}. $$

Here, $1$ means success and $0$ means failure. What counts as success depends on the application: a click, a conversion, a purchase, heads in a coin flip, a positive diagnostic result, or a passed quality check.

The unknown quantity is the success probability $\theta$:

$$ \theta = P(Y_t = 1). $$

The observed outcome $Y_t$ is binary, but $\theta$ is not an outcome. It is the unknown probability of success. That means the parameter space for $\theta$ is the interval $[0,1]$, and a prior over $\theta$ should put its probability mass on that interval. A Beta distribution has exactly the support we need:

$$ 0 \leq \theta \leq 1. $$

Bernoulli and Binomial Feedback

We usually model a single binary observation as a Bernoulli random variable :

$$ Y_t \sim \mathrm{Bernoulli}(\theta), $$

where $Y_t = 1$ with probability $\theta$ and $Y_t = 0$ with probability $1-\theta$.

If we repeat the same kind of interaction $n$ times, keep the success probability fixed, and let $K = \sum_{t=1}^n Y_t$ be the random number of successes, then $K$ follows a Binomial distribution :

$$ K \sim \mathrm{Binomial}(n, \theta). $$

These are two views of the same binary process. The Bernoulli view is the one-at-a-time view. The Binomial view is the summary view: after $n$ binary observations, the observed data can be summarized by the number of successes and failures. When $n=1$, the Binomial distribution reduces to the Bernoulli distribution.

The widget below illustrates the Binomial view. It plots the probability of observing each possible number of successes $k$ after $n$ repeated binary trials with success probability $\theta$.

In [2]:

A Distribution Over Probabilities

The Binomial distribution is about possible observed counts when the success probability $\theta$ is fixed. The Beta distribution turns the question around: after seeing binary data, which success probabilities $\theta$ are plausible?

This inverse question sits near the origin of Bayesian inference. Thomas Bayes's An Essay towards Solving a Problem in the Doctrine of Chances, published posthumously, asked: after an event has happened some number of times and failed some number of times, how plausible is each possible value of its unknown probability? The section below follows the same idea in modern notation, using Bayes' rule to reverse the conditioning.

The Beta distribution has two positive parameters, $\alpha$ and $\beta$:

$$ \theta \sim \mathrm{Beta}(\alpha, \beta) $$

with density

$$ p(\theta \mid \alpha, \beta) = \frac{\theta^{\alpha - 1}(1-\theta)^{\beta - 1}}{B(\alpha,\beta)} $$

where $B(\alpha,\beta)$ is the Beta function and acts as the normalizing constant that makes the density integrate to $1$ over the interval $[0,1]$. Because this curve is nonnegative on $[0,1]$, normalizing it to integrate to $1$ lets us interpret it as a probability density over $\theta$.

A practical way to read the parameters is to treat $\alpha$ as evidence for success and $\beta$ as evidence for failure. Their relative size controls where the distribution puts most of its mass. Their total size controls how concentrated the distribution is.

For example:

  • $\mathrm{Beta}(1, 1)$ is uniform: every success probability starts equally plausible.
  • $\mathrm{Beta}(8, 2)$ puts more mass on high success probabilities.
  • $\mathrm{Beta}(2, 8)$ puts more mass on low success probabilities.
  • $\mathrm{Beta}(20, 20)$ is centered near $0.5$ but more concentrated than, say, $\mathrm{Beta}(5, 5)$.

The curves below show those densities:

In [3]:
No description has been provided for this image

The following interactive widget lets you explore how the Beta distribution changes as you vary $\alpha$ and $\beta$.

In [4]:

Why the Beta Shape Appears

The Beta distribution appears naturally when we look at binary data from the other direction.

In the Binomial distribution, the probability $\theta$ is treated as fixed and the number of successes is random. After seeing the data, the perspective changes: the observed successes and failures are fixed, and $\theta$ is the unknown quantity. We ask which values of $\theta$ make the observations more or less plausible.

To compare possible values of $\theta$ after the data are observed, the likelihood function gives us a data-based score. The observed data stay fixed, and the unknown success probability $\theta$ is what varies. For each candidate value of $\theta$, the likelihood asks: if $\theta$ had this value, how plausible would the observed successes and failures be?

The shape is easiest to see from one binary observation. A success occurs with probability $\theta$, and a failure occurs with probability $1-\theta$. Assuming independence across observations, we multiply those factors together. So a sequence with $s$ successes and $f$ failures contributes

$$ \theta^s(1-\theta)^f. $$

If we summarize the data by counts rather than by the exact order, the Binomial coefficient counts how many such sequences there are. But that coefficient does not depend on $\theta$. Since likelihood is used as a relative score over possible values of $\theta$, we can ignore constants that do not depend on $\theta$.

The likelihood of $\theta$, given $s$ successes and $f$ failures, is therefore proportional to

$$ \mathcal{L}(\theta \mid s, f) \propto \theta^s(1-\theta)^f. $$

This is not yet a probability density over $\theta$; it is a relative score for different possible values of $\theta$. The important part is its shape: a power of $\theta$ times a power of $1-\theta$.

To normalize that shape over the valid parameter space $0 \leq \theta \leq 1$, we divide by its integral over that interval, which corresponds to the Beta function $B(s+1, f+1)$:

$$ \frac{\theta^s(1-\theta)^f}{\int_0^1 u^s(1-u)^f\,du} = \frac{\theta^s(1-\theta)^f}{B(s+1, f+1)}. $$

Read as a posterior over $\theta$, this normalized likelihood corresponds to using a uniform $\mathrm{Beta}(1,1)$ prior. The next section makes the prior explicit.

Now the connection to the Beta distribution is visible. In a Beta density, the exponents are $\alpha-1$ and $\beta-1$. Here the exponents are $s$ and $f$, so the matching Beta parameters are $\alpha = s+1$ and $\beta = f+1$.

That is the main point of this section: binary evidence creates a Beta-shaped curve over $\theta$. The next section uses Bayes' rule to combine that curve with a prior.

Beta-Bernoulli Conjugacy: Beta In, Beta Out

To turn the likelihood into a posterior belief, Bayes' rule combines it with a prior distribution over $\theta$ and then normalizes the result:

$$ p(\theta \mid \mathrm{data}) = \frac{p(\mathrm{data} \mid \theta)\,p(\theta)}{p(\mathrm{data})}. $$

The denominator $p(\mathrm{data})$ is the normalizing constant. It makes the posterior integrate to $1$, but it does not depend on $\theta$. When we only want the shape of the posterior as a function of $\theta$, we can write

$$ p(\theta \mid \mathrm{data}) \propto p(\mathrm{data} \mid \theta)\,p(\theta). $$

So, apart from normalization, the posterior is the likelihood multiplied by the prior.

Start with a Beta prior:

$$ \theta \sim \mathrm{Beta}(\alpha, \beta). $$

After observing $s$ successes and $f$ failures, the previous section gave the likelihood shape. Ignoring constants that do not depend on $\theta$, the two pieces are

$$ p(\mathrm{data} \mid \theta) \propto \theta^s(1-\theta)^f, \qquad p(\theta) \propto \theta^{\alpha-1}(1-\theta)^{\beta-1}. $$

Bayes' rule multiplies these two shapes. The matching powers of $\theta$ and $1-\theta$ add together:

$$ p(\theta \mid \mathrm{data}) \propto \theta^s(1-\theta)^f\,\theta^{\alpha-1}(1-\theta)^{\beta-1} = \theta^{\alpha+s-1}(1-\theta)^{\beta+f-1}. $$

The right-hand side has the Beta density shape, so after normalization the posterior is

$$ \theta \mid \mathrm{data} \sim \mathrm{Beta}(\alpha + s,\; \beta + f). $$

The conjugacy property of the Beta-Bernoulli model is the useful part here: the prior and posterior have the same family form. Updating only means adding success evidence to $\alpha$ and failure evidence to $\beta$.

The counts $s$ and $f$ are enough for this update. Once we know how many successes and failures we observed, the order of the observations no longer changes the posterior. In statistical terms, these counts are sufficient statistics for the Beta-Bernoulli model.

In other words: a $\mathrm{Beta}(\alpha,\beta)$ prior behaves like prior success and failure evidence. After observing $s$ successes and $f$ failures, the posterior is $\mathrm{Beta}(\alpha+s,\beta+f)$.

The widget below starts from a uniform $\mathrm{Beta}(1,1)$ prior. Press Add success or Add failure to see the posterior update by adding one count to $\alpha$ or $\beta$. Reset returns to the prior.

In [5]:

Choosing Alpha and Beta

The parameters $\alpha$ and $\beta$ describe what we believe about the success probability before the current data arrive. Sometimes that belief is deliberately weak; sometimes it comes from earlier experiments or domain knowledge.

A few choices come up often:

  • $\mathrm{Beta}(1,1)$ is uniform over $[0,1]$. It gives the same density to every success probability before seeing data.
  • $\mathrm{Beta}(1/2,1/2)$ is a standard non-informative prior known as Jeffreys' prior for a Bernoulli probability.
  • Larger values such as $\mathrm{Beta}(20,5)$ express stronger prior evidence. They should be used when that evidence has a real source, such as previous experiments or domain knowledge.

The pseudo-count interpretation helps, but it should not be taken too literally. A $\mathrm{Beta}(20,5)$ prior behaves like prior evidence with a high success rate, but it is still a modelling choice, not observed data from the current experiment.

What the Beta Distribution Is Not Saying

The Beta distribution models uncertainty about one stable binary probability. It is a good fit when the feedback can be reduced to success versus failure, but that reduction is also its main limitation.

It assumes:

  • the unknown quantity is a probability between $0$ and $1$,
  • each immediate outcome is binary, such as success or failure,
  • the success probability is stable for the situation being modeled,
  • counts of successes and failures are enough to summarize what we have learned.

Those assumptions fit many A/B testing, diagnostic, quality-control, and binary-feedback examples. They are too narrow when outcomes have several categories, outcomes have different values, rewards are continuous, actions affect future states, or the environment changes over time. In those cases, the Beta distribution can still be a useful reference point, but the model needs to become richer.

For problems with more than two mutually exclusive outcomes, the same idea becomes a Dirichlet distribution over category probabilities. The Beta distribution is the two-outcome special case.

Summary

The Beta distribution is a good fit when we are uncertain about one stable binary probability. It has the right support, $0 \leq \theta \leq 1$, and the normalized likelihood from binary data is itself a Beta density.

Its two parameters have a concrete interpretation: $\alpha$ behaves like success evidence and $\beta$ behaves like failure evidence. Because the Beta prior is conjugate to Bernoulli and Binomial feedback, those parameters update by addition:

$$ \alpha \leftarrow \alpha + s, \qquad \beta \leftarrow \beta + f. $$

That makes the Beta distribution both mathematically convenient and readable as a model of uncertainty about a binary probability. When the outcome is not binary, or when actions affect future states, we need a richer model.

Further Reading

In [6]:
python: 3.13.12
numpy: 2.4.4
scipy: 1.17.1
matplotlib: 3.10.9
seaborn: 0.13.2
bokeh: 3.9.0
jupyter-bokeh: 4.0.5

This post is generated from an IPython notebook file. Link to the full IPython notebook file