From A/B to RL (3/3): Continuous Learning to Delayed Rewards

Learning state-dependent policies with MENACE, matchboxes, and beads.

The second post made the policy explicit in a one-step bandit. A Bayesian learner kept a posterior over each action's mean reward, and that uncertainty induced a stochastic policy over actions.

This post keeps the focus on policies: rules for choosing actions. In the bandit setting, a policy only had to choose among actions in one fixed situation. Here the agent moves through a small sequential game. It sees a board state, chooses a move, sees a new board state, and keeps playing until the game ends.

We'll use the game of tic-tac-toe as an example, and more specifically the historical MENACE setup: a matchbox-and-bead learner that illustrates how a policy can be learned directly. Here, feedback is delayed: after a sequence of moves, the learner receives only the final outcome: win, draw, or loss. In RL language this is often called the terminal outcome, because it is observed when the episode ends. That gives us state-dependent policies and delayed credit assignment without needing a large RL problem.

This is the third post in a 3-part series:

In [1]:

MENACE: Matchboxes and Policies

MENACE is short for Matchbox Educable Noughts and Crosses Engine. Donald Michie built it in the early 1960s as a physical game-learning machine for tic-tac-toe. Instead of code and arrays, the original system used matchboxes filled with colored beads.

MENACE is useful here because its learned policy is visible and physical: it is stored as bead counts in matchboxes. Drawing a bead from a matchbox means sampling an action from the current policy. This is similar to a statistical urn model: the counts in the box determine the probabilities of future draws.

We can describe the physical setup in reinforcement-learning terms:

  • matchboxes represent board states,
  • bead colors represent legal moves, or symmetry-distinct move classes, from those states,
  • relative bead counts define the action probabilities for each state,
  • the game rules and opponent produce the next board state,
  • one full game is an episode,
  • only the final win/draw/loss outcome is used for learning.

The original MENACE was physical, but here we will simulate the same setup in code.

The MENACE Loop

At a high level, MENACE alternates between two steps. During a game, it samples moves by drawing beads from matchboxes. After the game, it reinforces the bead colors it drew based on the final win, draw, or loss. We first look at how one move is sampled; the delayed update comes after that.

The key point is that MENACE learns the policy itself. The bead counts are not posterior beliefs over rewards, and they are not action-value estimates. They are policy weights: more beads means a move is sampled more often, but moves with fewer beads can still be selected.

The figure below walks through one MENACE move. It uses a small illustrative matchbox whose bead counts are already non-uniform, as if earlier reinforcement has changed the policy.

First, MENACE observes the current board state and looks up the matching matchbox. The beads in that matchbox represent the policy weights $N(s,a)$ for the legal moves from that state. Colors are local to the matchbox: within this board, each color marks one legal move class.

The action probabilities for the current policy are the bead-count proportions inside the matchbox:

$$ \pi(a \mid s) = \frac{N(s,a)}{\sum_b N(s,b)}. $$

Drawing one bead is the physical sampling step. The sampled bead color maps back to a board square, and we play the selected move.

This is the first important contrast with the Thompson-sampling-style policy from the previous post. There, action probabilities came from uncertainty over values. In MENACE, there is no sampled value model. The randomness belongs to the policy itself.

In [2]:
In [3]:
No description has been provided for this image

Initialization

Before training, every matchbox starts with an untrained policy. Following Michie's original setup, each move class in a matchbox starts with the same number of beads. MENACE uses 4 beads per move class for its first move, then 3, 2, and 1 beads for later moves.

Each matchbox initially samples uniformly over its own move classes. Later-game boxes start with fewer beads, which makes them more sensitive to reinforcement.

Symmetry is part of that initialization. If one tic-tac-toe board can be rotated or reflected into another, MENACE treats them as the same state. The same idea applies to moves: on the empty board, all four corners are equivalent, all four sides are equivalent, and the center is unique. The opening matchbox therefore needs only three bead colors: corner, side, and center.

The opening board is a useful check. Because this is MENACE's first move, the opening matchbox starts with 4 beads for each of those three move classes. This is uniform over move classes, not over the nine physical board squares. When a bead is drawn, that move class is mapped back to a corresponding square on the current board.

In [4]:
No description has been provided for this image

Delayed Reinforcement

MENACE does not update after each move. It first plays a full episode, meaning one complete game of tic-tac-toe. During that episode it records the beads it selected. Only after the final win, draw, or loss does it update those beads.

For every bead selected during the episode, the final outcome determines how the visited matchboxes are updated. MENACE adds or removes beads of the same color as the selected bead:

  • win: add 3 beads
  • draw: add 1 bead
  • loss: remove 1 bead

In compact notation, let $O$ be the final episode outcome: win, draw, or loss. For every visited state-action pair $(s_t,a_t)$, MENACE updates the policy weight $N(s_t,a_t)$ by an amount $\Delta(O)$ determined by that outcome:

$$ N(s_t,a_t) \leftarrow \max(0, N(s_t,a_t) + \Delta(O)), $$

Here $\Delta(O)$ is best read as a reinforcement strength, or policy-weight update amount. It is $+3$, $+1$, or $-1$ for a win, draw, or loss. A positive value makes the selected move more likely in that state next time. A negative value makes it less likely. The number is not learned from the environment; it is part of MENACE's algorithm design, combining outcome preference and update magnitude.

In words, the loop is:

  1. sample moves from the current matchbox policies,
  2. observe the final outcome $O$,
  3. for every selected move, update its policy weight by $\Delta(O)$.

This is delayed credit assignment in physical form. The final outcome changes the bead counts for decisions made earlier in the game.

In [5]:

The next figure shows one played episode in detail. Each column is one move MENACE selected during the game. The first row shows the matchbox bead counts before the final update, including the board state and chosen square. The middle row shows the bead that was sampled and kept aside until the final outcome is known. The bottom row shows the same matchbox after delayed credit assignment: MENACE changes the selected policy weights by adding or removing beads of the sampled colors.

In [6]:
Demo episode outcome for MENACE: win
Number of MENACE matchboxes updated: 3
No description has been provided for this image

The final outcome arrives only after the whole game has been played. MENACE does not know exactly how each move influenced the final result, so it applies the same win/draw/loss update to all selected beads from that episode.

This differs from the Bayesian updates in the earlier posts. There, counts represented evidence about uncertain outcome probabilities or values. Here, bead counts are policy weights. MENACE adds or removes beads to make selected moves more or less likely after delayed feedback.

Learning from Self-Play

Now we let MENACE learn through repeated games. The opening-player MENACE plays 5,000 games against a rotating pool of second-player MENACE agents, a small self-play setup.

After training, we evaluate the learned opening-player policy against several fixed opponents. This separates the training setup from the final evaluation.

We will inspect a few checkpoints:

  • 25 games: early bead movement; wins and losses already start changing the opening policy,
  • 100 games: early policy shaping; common mistakes are being punished, but the policy is not reliable yet,
  • 1,000 games: tactical matchboxes have usually been visited enough to show clearer preferences,
  • 5,000 games: a stable demonstration run where the learned policy mostly draws rather than loses against strong fixed opponents.

Michie's original report tracked cumulative bonus and forfeit scores. Here we plot cumulative outcome rates instead, because they connect directly to the win/draw/loss evaluations below.

In [7]:
Training setup: pooled self-play with 5% opponent random moves
Training outcome rates: {'win': 0.449, 'draw': 0.443, 'loss': 0.108}
Perfect-opponent draw rates by checkpoint:
  25 games: draw=0.165, loss=0.835
 100 games: draw=0.180, loss=0.820
1,000 games: draw=0.661, loss=0.339
5,000 games: draw=0.931, loss=0.069
    Random: win=0.886, draw=0.075, loss=0.039
Positional: win=0.923, draw=0.018, loss=0.059
 Defensive: win=0.004, draw=0.932, loss=0.064
   Perfect: win=0.000, draw=0.914, loss=0.086
In [8]:
No description has been provided for this image
No description has been provided for this image

These plots are diagnostics. The training curves summarize the games MENACE experienced during self-play; the evaluation plot checks the learned policy against a few fixed opponents. Against a perfect tic-tac-toe opponent, the best possible result is a draw, so the useful signal is whether losses disappear rather than whether wins appear.

Learned State-Dependent Policies

In the bandit examples, one action distribution was enough because there was only one state. In tic-tac-toe, the right move depends on the board. MENACE handles this by giving each board state its own policy, represented by that state's matchbox and its current bead counts.

The next figure shows three states after 1,000 training games. This is early enough that the policy is not completely saturated, but late enough that the tactical matchboxes have usually been visited enough to show meaningful preferences.

  • the empty opening board,
  • a state where $X$ can win immediately,
  • a state where $X$ must block an immediate threat from $O$.

The first row shows bead counts and the second row shows the action probabilities induced by those counts. For the opening board, we show only the symmetry-distinct move classes, matching the physical matchbox.

In [9]:
No description has been provided for this image

These plots are the MENACE counterpart to the posterior and policy plots from the previous post. There are no density curves here. The learned object is the policy itself, visible as bead counts and the action probabilities implied by those counts.

Play Against the Learned Policy

The plots above are static. The widget below freezes the trained opening-player MENACE after 5,000 games and lets you play against it.

You play O and MENACE plays X. When it is MENACE's turn, colored squares show the symmetry-reduced matchbox policy. If several legal squares are equivalent by rotation or reflection, the widget shows only one representative square for that bead class. The vertical fill inside each colored square shows that class probability: taller fill means the bead class is more likely to be sampled. Click Sample MENACE move to draw one bead from the current matchbox and play the representative square for the sampled class.

The widget is for inspection, not training. It does not add or remove beads while you play.

In [10]:

How This Relates to Thompson Sampling

The comparison with Thompson sampling matters because both methods can produce stochastic actions, but the randomness comes from different places.

A Thompson-style learner stores uncertainty about a model or value function. It samples one plausible version of that model, then acts greedily inside that sampled world:

posterior over values -> sample plausible values -> choose the action with highest sampled value

The action probabilities are therefore induced by uncertainty over values. If an action is often best across posterior samples, it is chosen often.

MENACE stores policy weights directly. It does not keep a posterior over action values, and it does not choose the best action under a sampled value table. It stores bead counts:

bead counts -> action probabilities -> sample action from beads

This can look similar at the final action-selection step, because both can sample an action. But the meaning of the probabilities is different. In Thompson-style value sampling, probabilities come from uncertain value estimates. In MENACE, the probabilities are the learned object itself: updating beads updates the policy directly.

One way to place these methods next to each other is:

Method What is stored? What is sampled? What the policy probabilities mean
Thompson-style value sampling posterior over action values plausible values, then choose max how often each action is best under sampled values
Induced Bayesian policy posterior over action values action from $P(a \text{ is best})$ posterior probability that each action is best
MENACE bead counts / policy weights action directly from beads learned tendency to play each move

MENACE is therefore closer to direct policy reinforcement than to Bayesian value learning. The learned object is the state-dependent stochastic policy itself.

Summary

MENACE turns tic-tac-toe into a visible policy-learning problem:

  • a board position is the state,
  • a legal move is an action,
  • a matchbox stores the policy for one state as colored beads, where each bead color represents an action,
  • bead-count proportions define the policy $\pi(a \mid s)$,
  • one full game is an episode,
  • the final win/draw/loss outcome updates the beads selected during that episode.

The first important point is that the reward arrives only at the end of the episode. MENACE then applies that final win/draw/loss update to all selected moves that led to the outcome. That is the delayed-credit part.

The second point is about what MENACE learns. It does not learn a posterior over values, and it does not learn an environment model of how moves lead to future boards. It learns the policy directly: more beads make an action more likely, fewer beads make it less likely.

That makes MENACE a useful contrast with the Bayesian bandit policy view from the previous post. Its randomness lives directly in the learned policy, not in sampled posterior values. Stepping back, tic-tac-toe has the structure of a Markov decision process: states, actions, transitions to new states, and feedback after an episode. MENACE acts inside that environment, but it does not learn an MDP representation of it. It learns a state-dependent stochastic policy.

References

Further Reading

  • MENACE blog post, Matthew Scroggs: reader-friendly historical and implementation companion to the MENACE page.
  • Reinforcement Learning: An Introduction, Richard S. Sutton and Andrew G. Barto: this series here only brings us to the edge of RL. This book is the deeper dive into reinforcement learning.
In [11]:
Python: 3.13.12
numpy: 2.4.6
matplotlib: 3.11.0
seaborn: 0.13.2

This post is generated from an IPython notebook file. Link to the full IPython notebook file