The Beta Distribution: A Distribution Over Probabilities
Many statistical problems start with an unknown probability. A visitor might convert or not. A coin might land heads or tails. A machine part might fail or keep working. A user might click a recommendation or ignore it.
In all of these examples the quantity we want to infer is a probability, so it must live between $0$ and $1$. The Beta distribution is a distribution over such probabilities. It lets us describe which values of the unknown probability seem plausible before or after seeing binary data.
We'll build that idea by moving through four steps:
- model binary feedback with Bernoulli and Binomial distributions,
- turn the Binomial view around: after seeing successes and failures, ask which success probabilities are plausible,
- illustrate why the binary likelihood has the same shape as a Beta density,
- use the same shape to update Beta priors by adding observed successes and failures.
# Imports
from __future__ import annotations
import bokeh.io
import bokeh.layouts
import bokeh.models
import bokeh.plotting
import matplotlib.pyplot as plt
import numpy as np
import scipy.stats
import seaborn as sns
from bokeh.models import CustomJS
from bokeh.resources import Resources
bokeh.io.output_notebook(
resources=Resources(mode="cdn", components=["bokeh", "bokeh-widgets"]),
hide_banner=True,
)
sns.set_theme(style="whitegrid", context="notebook")
rng = np.random.default_rng(17)
#
The Setting
We'll start with a repeated binary experiment. Each observation $Y_t$ can take one of two values:
$$ Y_t \in \{0, 1\}. $$
Here, $1$ means success and $0$ means failure. What counts as success depends on the application: a click, a conversion, a purchase, heads in a coin flip, a positive diagnostic result, or a passed quality check.
The unknown quantity is the success probability $\theta$:
$$ \theta = P(Y_t = 1). $$
The observed outcome $Y_t$ is binary, but $\theta$ is not an outcome. It is the unknown probability of success. That means the parameter space for $\theta$ is the interval $[0,1]$, and a prior over $\theta$ should put its probability mass on that interval. A Beta distribution has exactly the support we need:
$$ 0 \leq \theta \leq 1. $$
Bernoulli and Binomial Feedback
We usually model a single binary observation as a Bernoulli random variable :
$$ Y_t \sim \mathrm{Bernoulli}(\theta), $$
where $Y_t = 1$ with probability $\theta$ and $Y_t = 0$ with probability $1-\theta$.
If we repeat the same kind of interaction $n$ times, keep the success probability fixed, and let $K = \sum_{t=1}^n Y_t$ be the random number of successes, then $K$ follows a Binomial distribution :
$$ K \sim \mathrm{Binomial}(n, \theta). $$
These are two views of the same binary process. The Bernoulli view is the one-at-a-time view. The Binomial view is the summary view: after $n$ binary observations, the observed data can be summarized by the number of successes and failures. When $n=1$, the Binomial distribution reduces to the Bernoulli distribution.
The widget below illustrates the Binomial view. It plots the probability of observing each possible number of successes $k$ after $n$ repeated binary trials with success probability $\theta$.
# Interactive Binomial distribution
# Precompute an initial PMF so the static page has a visible starting state.
binomial_initial_n = 10
binomial_initial_theta = 0.35
binomial_k = np.arange(binomial_initial_n + 1)
binomial_probability = scipy.stats.binom.pmf(
binomial_k,
n=binomial_initial_n,
p=binomial_initial_theta,
)
# Bokeh stores plotted data in a ColumnDataSource that the browser can update.
source_binomial_pmf = bokeh.models.ColumnDataSource(
data=dict(k=binomial_k, probability=binomial_probability)
)
figure_binomial_pmf = bokeh.plotting.figure(
x_range=bokeh.models.Range1d(start=-0.5, end=binomial_initial_n + 0.5),
y_range=bokeh.models.Range1d(start=0.0, end=max(0.2, float(binomial_probability.max() * 1.15))),
width=620,
height=320,
title=f"Binomial PMF: n={binomial_initial_n}, theta={binomial_initial_theta:.2f}",
x_axis_label="number of successes k",
y_axis_label="probability",
tools="",
)
figure_binomial_pmf.vbar(
x="k",
top="probability",
source=source_binomial_pmf,
width=0.85,
color="navy",
alpha=0.75,
)
figure_binomial_pmf.add_tools(
bokeh.models.HoverTool(
tooltips=[("successes", "@k"), ("probability", "@probability{0.000}")],
mode="vline",
)
)
figure_binomial_pmf.xaxis.ticker = bokeh.models.SingleIntervalTicker(interval=1)
slider_binomial_theta = bokeh.models.Slider(
start=0.0,
end=1.0,
value=binomial_initial_theta,
step=0.01,
title="success probability theta",
width=280,
html_attributes={"title": "Probability of success on each binary trial."},
)
slider_binomial_n = bokeh.models.Slider(
start=1,
end=40,
value=binomial_initial_n,
step=1,
title="number of repeated trials n",
width=280,
html_attributes={"title": "Number of independent binary trials summarized by the count k."},
)
# CustomJS keeps the widget fully embedded: slider changes run in the browser,
# without needing a live Python or Bokeh server behind the blog post.
callback_binomial_pmf_code = """
function binomialCoefficient(n, k) {
if (k < 0 || k > n) {
return 0;
}
if (k === 0 || k === n) {
return 1;
}
k = Math.min(k, n - k);
let coefficient = 1;
for (let i = 1; i <= k; i++) {
coefficient = (coefficient * (n - k + i)) / i;
}
return coefficient;
}
function binomialProbability(n, k, theta) {
if (theta === 0) {
return k === 0 ? 1 : 0;
}
if (theta === 1) {
return k === n ? 1 : 0;
}
return binomialCoefficient(n, k) * Math.pow(theta, k) * Math.pow(1 - theta, n - k);
}
const theta = thetaSlider.value;
const n = Math.round(nSlider.value);
// Recompute the full PMF for the current slider values.
const ks = [];
const probabilities = [];
let maxProbability = 0;
for (let k = 0; k <= n; k++) {
const probability = binomialProbability(n, k, theta);
ks.push(k);
probabilities.push(probability);
if (probability > maxProbability) {
maxProbability = probability;
}
}
// Replace the data and ranges, then ask Bokeh to redraw the bars.
source.data = {k: ks, probability: probabilities};
figure.x_range.start = -0.5;
figure.x_range.end = n + 0.5;
figure.y_range.end = Math.max(0.2, maxProbability * 1.15);
figure.title.text = "Binomial PMF: n=" + n + ", theta=" + theta.toFixed(2);
source.change.emit();
"""
callback_binomial_pmf = CustomJS(
args=dict(
source=source_binomial_pmf,
figure=figure_binomial_pmf,
thetaSlider=slider_binomial_theta,
nSlider=slider_binomial_n,
),
code=callback_binomial_pmf_code,
module=False,
)
slider_binomial_theta.js_on_change("value", callback_binomial_pmf)
slider_binomial_n.js_on_change("value", callback_binomial_pmf)
layout_binomial_pmf = bokeh.layouts.column(
figure_binomial_pmf,
bokeh.layouts.row(slider_binomial_theta, slider_binomial_n),
)
bokeh.plotting.show(layout_binomial_pmf)
#
A Distribution Over Probabilities
The Binomial distribution is about possible observed counts when the success probability $\theta$ is fixed. The Beta distribution turns the question around: after seeing binary data, which success probabilities $\theta$ are plausible?
This inverse question sits near the origin of Bayesian inference. Thomas Bayes's An Essay towards Solving a Problem in the Doctrine of Chances, published posthumously, asked: after an event has happened some number of times and failed some number of times, how plausible is each possible value of its unknown probability? The section below follows the same idea in modern notation, using Bayes' rule to reverse the conditioning.
The Beta distribution has two positive parameters, $\alpha$ and $\beta$:
$$ \theta \sim \mathrm{Beta}(\alpha, \beta) $$
with density
$$ p(\theta \mid \alpha, \beta) = \frac{\theta^{\alpha - 1}(1-\theta)^{\beta - 1}}{B(\alpha,\beta)} $$
where $B(\alpha,\beta)$ is the Beta function and acts as the normalizing constant that makes the density integrate to $1$ over the interval $[0,1]$. Because this curve is nonnegative on $[0,1]$, normalizing it to integrate to $1$ lets us interpret it as a probability density over $\theta$.
A practical way to read the parameters is to treat $\alpha$ as evidence for success and $\beta$ as evidence for failure. Their relative size controls where the distribution puts most of its mass. Their total size controls how concentrated the distribution is.
For example:
- $\mathrm{Beta}(1, 1)$ is uniform: every success probability starts equally plausible.
- $\mathrm{Beta}(8, 2)$ puts more mass on high success probabilities.
- $\mathrm{Beta}(2, 8)$ puts more mass on low success probabilities.
- $\mathrm{Beta}(20, 20)$ is centered near $0.5$ but more concentrated than, say, $\mathrm{Beta}(5, 5)$.
The curves below show those densities:
# Some examples of Beta distributions
def beta_density_grid(*, alpha: float, beta: float, nb_points: int = 500) -> tuple[np.ndarray, np.ndarray]:
"""Return a grid and Beta density values for plotting."""
theta = np.linspace(1e-4, 1.0 - 1e-4, nb_points)
density = scipy.stats.beta.pdf(theta, a=alpha, b=beta)
return theta, density
beta_examples = [
(1.0, 1.0, r"$Beta(1, 1)$"),
(5.0, 5.0, r"$Beta(5, 5)$"),
(8.0, 2.0, r"$Beta(8, 2)$"),
(2.0, 8.0, r"$Beta(2, 8)$"),
(20.0, 20.0, r"$Beta(20, 20)$"),
]
fig, ax = plt.subplots(figsize=(8, 4))
for alpha, beta, label in beta_examples:
theta, density = beta_density_grid(alpha=alpha, beta=beta)
ax.plot(theta, density, label=label, linewidth=2)
ax.set_title("Beta distributions describe uncertainty over a probability")
ax.set_xlabel(r"success probability $\theta$")
ax.set_ylabel("density")
ax.legend()
sns.despine()
plt.show()
#
The following interactive widget lets you explore how the Beta distribution changes as you vary $\alpha$ and $\beta$.
# Interactive Beta distribution
# Start with one concrete Beta density so the app has an initial state.
beta_explorer_alpha = 4.0
beta_explorer_beta = 2.0
beta_explorer_theta = np.linspace(1e-4, 1.0 - 1e-4, 300)
beta_explorer_density = scipy.stats.beta.pdf(beta_explorer_theta, a=beta_explorer_alpha, b=beta_explorer_beta)
beta_explorer_y_max = max(2.0, float(beta_explorer_density.max() * 1.12))
# ColumnDataSource is the shared data model Bokeh can update in the browser.
source_beta_explorer_params = bokeh.models.ColumnDataSource(
data=dict(alpha=[beta_explorer_alpha], beta=[beta_explorer_beta])
)
source_beta_explorer_density = bokeh.models.ColumnDataSource(
data=dict(theta=beta_explorer_theta, density=beta_explorer_density)
)
# Draw the density as a line with a light filled band to emphasize area.
figure_beta_explorer = bokeh.plotting.figure(
x_range=bokeh.models.Range1d(start=0.0, end=1.0),
y_range=bokeh.models.Range1d(start=0.0, end=beta_explorer_y_max),
width=520,
height=320,
title=f"Beta density: alpha={beta_explorer_alpha:.1f}, beta={beta_explorer_beta:.1f}",
x_axis_label="success probability theta",
y_axis_label="density",
tools="",
)
renderer_beta_explorer_density = figure_beta_explorer.line(
x="theta",
y="density",
source=source_beta_explorer_density,
color="navy",
line_width=3,
)
figure_beta_explorer.add_tools(
bokeh.models.HoverTool(
renderers=[renderer_beta_explorer_density],
tooltips=[
("theta", "@theta{0.000}"),
("density", "@density{0.000}"),
],
mode="vline",
)
)
figure_beta_explorer.add_layout(
bokeh.models.Band(
base="theta",
lower=0,
upper="density",
source=source_beta_explorer_density,
level="underlay",
fill_color="navy",
fill_alpha=0.18,
line_alpha=0.0,
)
)
# Sliders expose alpha and beta as evidence-like shape controls.
slider_beta_explorer_alpha = bokeh.models.Slider(
start=0.2,
end=25.0,
value=beta_explorer_alpha,
step=0.2,
title="alpha: success evidence",
width=250,
html_attributes={"title": "Larger alpha shifts mass toward higher success probabilities."},
)
slider_beta_explorer_beta = bokeh.models.Slider(
start=0.2,
end=25.0,
value=beta_explorer_beta,
step=0.2,
title="beta: failure evidence",
width=250,
html_attributes={"title": "Larger beta shifts mass toward lower success probabilities."},
)
# CustomJS recomputes the density in the browser, keeping the post fully static.
callback_beta_explorer_code = """
const LOG_SQRT_TWO_PI = 0.9189385332046727;
const LANCZOS_COEFFICIENTS = [
676.5203681218851,
-1259.1392167224028,
771.32342877765313,
-176.61502916214059,
12.507343278686905,
-0.13857109526572012,
9.9843695780195716e-6,
1.5056327351493116e-7,
];
function logGamma(z) {
if (z < 0.5) {
return Math.log(Math.PI) - Math.log(Math.sin(Math.PI * z)) - logGamma(1 - z);
}
z -= 1;
let x = 0.99999999999980993;
for (let i = 0; i < LANCZOS_COEFFICIENTS.length; i++) {
x += LANCZOS_COEFFICIENTS[i] / (z + i + 1);
}
const t = z + LANCZOS_COEFFICIENTS.length - 0.5;
return LOG_SQRT_TWO_PI + (z + 0.5) * Math.log(t) - t + Math.log(x);
}
function betaPdf(theta, alpha, beta) {
const logDensity =
(alpha - 1) * Math.log(theta)
+ (beta - 1) * Math.log(1 - theta)
+ logGamma(alpha + beta)
- logGamma(alpha)
- logGamma(beta);
return Math.exp(logDensity);
}
// Read the current slider values and store them for the displayed app state.
const alpha = alphaSlider.value;
const beta = betaSlider.value;
paramsSource.data["alpha"][0] = alpha;
paramsSource.data["beta"][0] = beta;
const theta = densitySource.data["theta"];
const density = densitySource.data["density"];
let maxDensity = 0;
// Recompute the whole curve so the plot responds immediately to sliders.
for (let i = 0; i < theta.length; i++) {
density[i] = betaPdf(theta[i], alpha, beta);
if (density[i] > maxDensity) {
maxDensity = density[i];
}
}
// Rescale the y-axis and notify Bokeh that the data changed.
figure.y_range.end = Math.max(2.0, maxDensity * 1.12);
figure.title.text = "Beta density: alpha=" + alpha.toFixed(1) + ", beta=" + beta.toFixed(1);
paramsSource.change.emit();
densitySource.change.emit();
"""
callback_beta_explorer = CustomJS(
args=dict(
paramsSource=source_beta_explorer_params,
densitySource=source_beta_explorer_density,
figure=figure_beta_explorer,
alphaSlider=slider_beta_explorer_alpha,
betaSlider=slider_beta_explorer_beta,
),
code=callback_beta_explorer_code,
module=False,
)
# Both sliders drive the same callback because either parameter changes the curve.
slider_beta_explorer_alpha.js_on_change("value", callback_beta_explorer)
slider_beta_explorer_beta.js_on_change("value", callback_beta_explorer)
layout_beta_explorer = bokeh.layouts.column(
figure_beta_explorer,
bokeh.layouts.row(slider_beta_explorer_alpha, slider_beta_explorer_beta),
)
bokeh.plotting.show(layout_beta_explorer)
#
Why the Beta Shape Appears
The Beta distribution appears naturally when we look at binary data from the other direction.
In the Binomial distribution, the probability $\theta$ is treated as fixed and the number of successes is random. After seeing the data, the perspective changes: the observed successes and failures are fixed, and $\theta$ is the unknown quantity. We ask which values of $\theta$ make the observations more or less plausible.
To compare possible values of $\theta$ after the data are observed, the likelihood function gives us a data-based score. The observed data stay fixed, and the unknown success probability $\theta$ is what varies. For each candidate value of $\theta$, the likelihood asks: if $\theta$ had this value, how plausible would the observed successes and failures be?
The shape is easiest to see from one binary observation. A success occurs with probability $\theta$, and a failure occurs with probability $1-\theta$. Assuming independence across observations, we multiply those factors together. So a sequence with $s$ successes and $f$ failures contributes
$$ \theta^s(1-\theta)^f. $$
If we summarize the data by counts rather than by the exact order, the Binomial coefficient counts how many such sequences there are. But that coefficient does not depend on $\theta$. Since likelihood is used as a relative score over possible values of $\theta$, we can ignore constants that do not depend on $\theta$.
The likelihood of $\theta$, given $s$ successes and $f$ failures, is therefore proportional to
$$ \mathcal{L}(\theta \mid s, f) \propto \theta^s(1-\theta)^f. $$
This is not yet a probability density over $\theta$; it is a relative score for different possible values of $\theta$. The important part is its shape: a power of $\theta$ times a power of $1-\theta$.
To normalize that shape over the valid parameter space $0 \leq \theta \leq 1$, we divide by its integral over that interval, which corresponds to the Beta function $B(s+1, f+1)$:
$$ \frac{\theta^s(1-\theta)^f}{\int_0^1 u^s(1-u)^f\,du} = \frac{\theta^s(1-\theta)^f}{B(s+1, f+1)}. $$
Read as a posterior over $\theta$, this normalized likelihood corresponds to using a uniform $\mathrm{Beta}(1,1)$ prior. The next section makes the prior explicit.
Now the connection to the Beta distribution is visible. In a Beta density, the exponents are $\alpha-1$ and $\beta-1$. Here the exponents are $s$ and $f$, so the matching Beta parameters are $\alpha = s+1$ and $\beta = f+1$.
That is the main point of this section: binary evidence creates a Beta-shaped curve over $\theta$. The next section uses Bayes' rule to combine that curve with a prior.
Beta-Bernoulli Conjugacy: Beta In, Beta Out
To turn the likelihood into a posterior belief, Bayes' rule combines it with a prior distribution over $\theta$ and then normalizes the result:
$$ p(\theta \mid \mathrm{data}) = \frac{p(\mathrm{data} \mid \theta)\,p(\theta)}{p(\mathrm{data})}. $$
The denominator $p(\mathrm{data})$ is the normalizing constant. It makes the posterior integrate to $1$, but it does not depend on $\theta$. When we only want the shape of the posterior as a function of $\theta$, we can write
$$ p(\theta \mid \mathrm{data}) \propto p(\mathrm{data} \mid \theta)\,p(\theta). $$
So, apart from normalization, the posterior is the likelihood multiplied by the prior.
Start with a Beta prior:
$$ \theta \sim \mathrm{Beta}(\alpha, \beta). $$
After observing $s$ successes and $f$ failures, the previous section gave the likelihood shape. Ignoring constants that do not depend on $\theta$, the two pieces are
$$ p(\mathrm{data} \mid \theta) \propto \theta^s(1-\theta)^f, \qquad p(\theta) \propto \theta^{\alpha-1}(1-\theta)^{\beta-1}. $$
Bayes' rule multiplies these two shapes. The matching powers of $\theta$ and $1-\theta$ add together:
$$ p(\theta \mid \mathrm{data}) \propto \theta^s(1-\theta)^f\,\theta^{\alpha-1}(1-\theta)^{\beta-1} = \theta^{\alpha+s-1}(1-\theta)^{\beta+f-1}. $$
The right-hand side has the Beta density shape, so after normalization the posterior is
$$ \theta \mid \mathrm{data} \sim \mathrm{Beta}(\alpha + s,\; \beta + f). $$
The conjugacy property of the Beta-Bernoulli model is the useful part here: the prior and posterior have the same family form. Updating only means adding success evidence to $\alpha$ and failure evidence to $\beta$.
The counts $s$ and $f$ are enough for this update. Once we know how many successes and failures we observed, the order of the observations no longer changes the posterior. In statistical terms, these counts are sufficient statistics for the Beta-Bernoulli model.
In other words: a $\mathrm{Beta}(\alpha,\beta)$ prior behaves like prior success and failure evidence. After observing $s$ successes and $f$ failures, the posterior is $\mathrm{Beta}(\alpha+s,\beta+f)$.
The widget below starts from a uniform $\mathrm{Beta}(1,1)$ prior. Press Add success or Add failure to see the posterior update by adding one count to $\alpha$ or $\beta$. Reset returns to the prior.
# Interactive Beta-Bernoulli posterior update
posterior_initial_alpha = 1.0
posterior_initial_beta = 1.0
posterior_theta = np.linspace(1e-4, 1.0 - 1e-4, 400)
posterior_density = scipy.stats.beta.pdf(
posterior_theta,
a=posterior_initial_alpha,
b=posterior_initial_beta,
)
posterior_y_max = max(2.0, float(posterior_density.max() * 1.12))
source_posterior_state = bokeh.models.ColumnDataSource(
data=dict(
successes=[0],
failures=[0],
alpha=[posterior_initial_alpha],
beta=[posterior_initial_beta],
)
)
source_posterior_density = bokeh.models.ColumnDataSource(
data=dict(theta=posterior_theta, density=posterior_density)
)
figure_posterior_update = bokeh.plotting.figure(
x_range=bokeh.models.Range1d(start=0.0, end=1.0),
y_range=bokeh.models.Range1d(start=0.0, end=posterior_y_max),
width=620,
height=340,
title="Beta posterior: alpha=1, beta=1",
x_axis_label="success probability theta",
y_axis_label="density",
tools="",
)
renderer_posterior_density = figure_posterior_update.line(
x="theta",
y="density",
source=source_posterior_density,
color="navy",
line_width=3,
)
figure_posterior_update.add_tools(
bokeh.models.HoverTool(
renderers=[renderer_posterior_density],
tooltips=[
("theta", "@theta{0.000}"),
("density", "@density{0.000}"),
],
mode="vline",
)
)
figure_posterior_update.add_layout(
bokeh.models.Band(
base="theta",
lower=0,
upper="density",
source=source_posterior_density,
level="underlay",
fill_color="navy",
fill_alpha=0.18,
line_alpha=0.0,
)
)
button_posterior_success = bokeh.models.Button(
label="Add success",
button_type="success",
width=130,
)
button_posterior_failure = bokeh.models.Button(
label="Add failure",
button_type="danger",
width=130,
)
button_posterior_reset = bokeh.models.Button(
label="Reset",
button_type="default",
width=100,
)
div_posterior_status = bokeh.models.Div(
text=(
"Observed evidence: <b>0</b> successes, <b>0</b> failures. "
"Posterior: <b>Beta(1, 1)</b>."
),
width=620,
)
callback_posterior_update_template = """
const LOG_SQRT_TWO_PI = 0.9189385332046727;
const LANCZOS_COEFFICIENTS = [
676.5203681218851,
-1259.1392167224028,
771.32342877765313,
-176.61502916214059,
12.507343278686905,
-0.13857109526572012,
9.9843695780195716e-6,
1.5056327351493116e-7,
];
function logGamma(z) {
if (z < 0.5) {
return Math.log(Math.PI) - Math.log(Math.sin(Math.PI * z)) - logGamma(1 - z);
}
z -= 1;
let x = 0.99999999999980993;
for (let i = 0; i < LANCZOS_COEFFICIENTS.length; i++) {
x += LANCZOS_COEFFICIENTS[i] / (z + i + 1);
}
const t = z + LANCZOS_COEFFICIENTS.length - 0.5;
return LOG_SQRT_TWO_PI + (z + 0.5) * Math.log(t) - t + Math.log(x);
}
function betaPdf(theta, alpha, beta) {
const logDensity =
(alpha - 1) * Math.log(theta)
+ (beta - 1) * Math.log(1 - theta)
+ logGamma(alpha + beta)
- logGamma(alpha)
- logGamma(beta);
return Math.exp(logDensity);
}
function formatNumber(value) {
if (Math.abs(value - Math.round(value)) < 1e-9) {
return Math.round(value).toString();
}
return value.toFixed(1);
}
const action = "__ACTION__";
if (action === "success") {
stateSource.data["successes"][0] += 1;
} else if (action === "failure") {
stateSource.data["failures"][0] += 1;
} else if (action === "reset") {
stateSource.data["successes"][0] = 0;
stateSource.data["failures"][0] = 0;
}
const successes = stateSource.data["successes"][0];
const failures = stateSource.data["failures"][0];
const alpha = 1 + successes;
const beta = 1 + failures;
stateSource.data["alpha"][0] = alpha;
stateSource.data["beta"][0] = beta;
const theta = densitySource.data["theta"];
const density = densitySource.data["density"];
let maxDensity = 0;
for (let i = 0; i < theta.length; i++) {
density[i] = betaPdf(theta[i], alpha, beta);
if (density[i] > maxDensity) {
maxDensity = density[i];
}
}
figure.y_range.end = Math.max(2.0, maxDensity * 1.12);
figure.title.text = "Beta posterior: alpha=" + formatNumber(alpha) + ", beta=" + formatNumber(beta);
statusDiv.text =
"Observed evidence: <b>" + successes + "</b> successes, <b>" + failures + "</b> failures. "
+ "Posterior: <b>Beta(" + formatNumber(alpha) + ", " + formatNumber(beta) + ")</b>.";
stateSource.change.emit();
densitySource.change.emit();
"""
callback_posterior_args = dict(
stateSource=source_posterior_state,
densitySource=source_posterior_density,
figure=figure_posterior_update,
statusDiv=div_posterior_status,
)
button_posterior_success.js_on_click(
CustomJS(
args=callback_posterior_args,
code=callback_posterior_update_template.replace("__ACTION__", "success"),
module=False,
)
)
button_posterior_failure.js_on_click(
CustomJS(
args=callback_posterior_args,
code=callback_posterior_update_template.replace("__ACTION__", "failure"),
module=False,
)
)
button_posterior_reset.js_on_click(
CustomJS(
args=callback_posterior_args,
code=callback_posterior_update_template.replace("__ACTION__", "reset"),
module=False,
)
)
layout_posterior_update = bokeh.layouts.column(
figure_posterior_update,
bokeh.layouts.row(
button_posterior_success,
button_posterior_failure,
button_posterior_reset,
),
div_posterior_status,
)
bokeh.plotting.show(layout_posterior_update)
#
Choosing Alpha and Beta
The parameters $\alpha$ and $\beta$ describe what we believe about the success probability before the current data arrive. Sometimes that belief is deliberately weak; sometimes it comes from earlier experiments or domain knowledge.
A few choices come up often:
- $\mathrm{Beta}(1,1)$ is uniform over $[0,1]$. It gives the same density to every success probability before seeing data.
- $\mathrm{Beta}(1/2,1/2)$ is a standard non-informative prior known as Jeffreys' prior for a Bernoulli probability.
- Larger values such as $\mathrm{Beta}(20,5)$ express stronger prior evidence. They should be used when that evidence has a real source, such as previous experiments or domain knowledge.
The pseudo-count interpretation helps, but it should not be taken too literally. A $\mathrm{Beta}(20,5)$ prior behaves like prior evidence with a high success rate, but it is still a modelling choice, not observed data from the current experiment.
What the Beta Distribution Is Not Saying
The Beta distribution models uncertainty about one stable binary probability. It is a good fit when the feedback can be reduced to success versus failure, but that reduction is also its main limitation.
It assumes:
- the unknown quantity is a probability between $0$ and $1$,
- each immediate outcome is binary, such as success or failure,
- the success probability is stable for the situation being modeled,
- counts of successes and failures are enough to summarize what we have learned.
Those assumptions fit many A/B testing, diagnostic, quality-control, and binary-feedback examples. They are too narrow when outcomes have several categories, outcomes have different values, rewards are continuous, actions affect future states, or the environment changes over time. In those cases, the Beta distribution can still be a useful reference point, but the model needs to become richer.
For problems with more than two mutually exclusive outcomes, the same idea becomes a Dirichlet distribution over category probabilities. The Beta distribution is the two-outcome special case.
Summary
The Beta distribution is a good fit when we are uncertain about one stable binary probability. It has the right support, $0 \leq \theta \leq 1$, and the normalized likelihood from binary data is itself a Beta density.
Its two parameters have a concrete interpretation: $\alpha$ behaves like success evidence and $\beta$ behaves like failure evidence. Because the Beta prior is conjugate to Bernoulli and Binomial feedback, those parameters update by addition:
$$ \alpha \leftarrow \alpha + s, \qquad \beta \leftarrow \beta + f. $$
That makes the Beta distribution both mathematically convenient and readable as a model of uncertainty about a binary probability. When the outcome is not binary, or when actions affect future states, we need a richer model.
Further Reading
- Follow-up post: Beta priors for self-reinforcing binary decisions connects the Beta distribution to online binary feedback, self-reinforcement, and multi-armed bandits.
- Historical source: An Essay towards Solving a Problem in the Doctrine of Chances by Thomas Bayes and Richard Price studies inference about an unknown binomial success probability; in modern terms, it leads to the Beta-Bernoulli posterior update.
- Practical example: Understanding the beta distribution using baseball statistics by David Robinson gives a reader-friendly example of using Beta distributions to reason about uncertain rates.
# Python package versions used
import importlib.metadata
import platform
print(f"python: {platform.python_version()}")
for package_name in ["numpy", "scipy", "matplotlib", "seaborn", "bokeh", "jupyter-bokeh"]:
version = importlib.metadata.version(package_name)
print(f"{package_name}: {version}")
#
This post is generated from an IPython notebook file. Link to the full IPython notebook file