Multi-armed bandit implementation

In the multi-armed bandit (MAB) problem we try to maximise our gain over time by "gambling on slot-machines (or bandits)" that have different but unknown expected outcomes.

The concept is typically used as an alternative to A/B-testing used in marketing research or website optimization. For example, testing which marketing email leads to the most newsletter signups, or which webshop design leads to the most sales.

The benefit of viewing website optimization as a multi-armed bandit problem instead of an A/B-testing problem is that no pre-defined sample sizes are needed and the algorithm will start optimizing the outcome (e.g. click rate) from the beginning. While the A/B-test needs to run all predefined samples to make a conclusion.

What follows will illustrate how to implement and solve a very simple multi-armed bandit problem with a probabilistic algorithm.

In [1]:

Defining the bandit experiment

We will simulate 3 bandits with each an underlying probability of winning stored in p_bandits . The pull(i) method will "pull the arm" of bandit $i$ and randomly decides if the pull was a win (1) or not (0) based on the underlying probability.

In [2]:
# Define the multi-armed bandits
nb_bandits = 3  # Number of bandits
# True probability of winning for each bandit
p_bandits = [0.45, 0.55, 0.60]

def pull(i):
    """Pull arm of bandit with index `i` and return 1 if win, 
    else return 0."""
    if np.random.rand() < p_bandits[i]:
        return 1
        return 0

Bayesian interpretation

Each pull of a specifc bandit will result in a win with a certain probability. The higher this probability the more likely pulling the arm of the bandit is going to result in a win. However we don't know what this probability is so we will have to model it based on our observations of a certain bandit resulting in a win or not. We are going to model the unkown probability with a parameter $\theta$. Based on oberving a single outcome $x$ we can model this as the distribution $P(\theta \mid x)$. Using Bayes rule we can write this as:

$$ P(\theta \mid x) = \frac{P(x \mid \theta)P(\theta)}{P(x)} $$

$P(\theta \mid x)$ is called the posterior distribution of $\theta$ after observing $x$. We will be able to compute this via the likelihood $P(x \mid \theta)$ and prior $P(\theta)$. In this case the likelihood function $P(x \mid \theta)$ follows the Bernoulli distribution .

Beta prior

A good choice for the prior $P(\theta)$ would be the Beta distribution since it is a conjugate prior to the Bernoulli distribution. This means that if the likelihood function $P(x \mid \theta)$ is Bernoulli distributed and the prior distribution $P(\theta)$ is Beta distributed then the posterior $P(\theta \mid x)$ will also Beta distributed. More specifically if the prior is $Beta(\alpha, \beta)$ then after observing a win $x=1$ the posterior would become $Beta(\alpha+1, \beta)$, or after observing a loss $x=0$ the posterior would become $Beta(\alpha, \beta+1)$ [ 1 ].

This means that after every observation we can use the posterior as the prior for the next time we pull the bandit's arm.

Maximize reward / Minimize regret

The goal of the muli-armed bandit problem is to maximize reward (minimize regret). There is an exploitation-exploration tradeoff we have to make here. The more we pull the arm of our percieved best bandit the more certain we become of the probability of that bandit. But other bandits that we haven't pulled that often might have a lower expected probability but with higher uncertaintly. There is a chance that they are actually better than our percieved best bandit.

We will use Thompson sampling to overcome this problem. In Thompson sampling we will for each bandit sample the probability $\theta$ from the prior and pull the bandit with the highest sampled probability. And repeat this step until finished.

We will start with the prior $Beta(\alpha=1, \beta=1)$, which corresponds to a uniform prior between 0 and 1. The run is simulated for 1000 steps and the results at certain steps are plotted below.

In [3]:
In [4]:
# Simulate multi-armed bandit process and update posteriors

# Setup plot
fig, axs = plt.subplots(5, 2, figsize=(8, 10))
axs = axs.flat

# The number of trials and wins will represent the prior for each
#  bandit with the help of the Beta distribution.
trials = [0, 0, 0]  # Number of times we tried each bandit
wins = [0, 0, 0]  # Number of wins for each bandit

n = 1000
# Run the trail for `n` steps
for step in range(1, n+1):
    # Define the prior based on current observations
    bandit_priors = [
        stats.beta(a=1+w, b=1+t-w) for t, w in zip(trials, wins)]
    # plot prior 
    if step in plots:
        plot(bandit_priors, step, next(axs))
    # Sample a probability theta for each bandit
    theta_samples = [
        d.rvs(1) for d in bandit_priors
    # choose a bandit
    chosen_bandit = np.argmax(theta_samples)
    # Pull the bandit
    x = pull(chosen_bandit)
    # Update trials and wins (defines the posterior)
    trials[chosen_bandit] += 1
    wins[chosen_bandit] += x

In [5]:
True prob=0.45;  Emperical prob=0.43;  Trials=43
True prob=0.55;  Emperical prob=0.47;  Trials=82
True prob=0.60;  Emperical prob=0.60;  Trials=875

Note above that the algorithm quickly converges to select the bandit with the highest probability of winning. After 1000 iterations it has chosen the winning bandit 10 times more often than one of the other bandits. See the further readings below if you want to know how to compute the probability that the winning bandit is better than the others [ 2 ].

In [6]:
Python: 3.7.0
numpy: 1.15.2
scipy: 1.1.0
matplotlib: 3.0.0
seaborn: 0.9.0

This post at is generated from an Python notebook file. Link to the full IPython notebook file