How to implement a recurrent neural network Part 1

This page is part of a 2 part tutorial on how to implement a recurrent neural network model. You can find the links to the pages of this tutorial here:

This tutorial on implementing recurrent neural networks will build on my previous tutorial on how to implement a feedforward neural network . Please check out this previous tutorial if you are unfamiliar with neural network basics such as backpropagation.

These tutorials are generated from Python 2 IPython Notebook files, which will be linked to at the end of each chapter so that you can adapt and run the examples yourself. The neural networks themselves are implemented using the Python NumPy library which offers efficient implementations of linear algebra functions such as vector and matrix multiplications. Illustrative plots are generated using Matplotlib . If you want to run these examples yourself and don't have Python with the necessary libraries installed I recommend to download and install Anaconda Python , which is a free Python distribution that contains all the libraries you need to run these tutorials, and is used to create these tutorials.

The code input cells in this blog can be collapsed or expanded by clicking on the button in the top right of each cell.

A version of this tutorial is also available in Chinese thanks to Mingming Chen .

Recurrent neural networks

This first part will cover:

Recurrent neural networks (RNNs) are neural nets that can deal with sequences of variable length (unlike feedforward nets). They are able to this by defining a recurrence relation over timesteps which is typically the following formula:

$$ S_{k} = f(S_{k-1} * W_{rec} + X_k * W_x) $$

Where $S_k$ is the state at time $k$, $X_k$ an exogenous input at time $k$, $W_{rec}$ and $W_x$ are parameters like the weights parameters in feedforward nets. Note that the RNN can be viewed as a state model with a feedback loop . The state evolves over time due to the recurrence relation, and the feedback is fed back into the state with a delay of one timestep. This delayed feedback loop gives the model memory because it can remember information between timesteps in the states.
The final output of the network $Y_k$ at a certain timestep $k$ is typically computed from one or more states $S_{k-i} .. S_{k+j}$.

Note that we can either compute the current state $S_{k}$ from the current input $X_k$ and previous state $S_{k-1}$, or predict the next state from $S_{k+1}$ from the current state $S_k$ and current input $X_k$. The difference of notation has not much effect on our model and depends on the task at hand.

This tutorial will illustrate that recurrent nets are not much different from feedforward nets, but there are some difficulties in training them as will be explained later.

Linear recurrent neural network

The first part of this tutorial describes a simple RNN that is trained to count how many 1's it sees on a binary input stream, and output the total count at the end of the sequence.

The RNN model used here has one state, takes one input element from the binary stream each timestep, and outputs its last state at the end of the sequence. This model is shown in the figure below. The left part is a graphical illustration of the recurrence relation it describes ($ s_{k} = s_{k-1} * w_{rec} + x_k * w_x $). The right part illustrates how the network is unfolded through time over a sequence of length n. Notice that the unfolded network can be viewed as an (n+1)-layer neural network with the same shared parameters $w_{rec}$ and $w_x$ in each layer.

Structure of the linear RNN

We can easiliy see that the optimal parameters for this task would be $w_{rec} = w_x = 1$. Nevertheles the training and implementation of this neural net will be interesting. First import the libraries we need and define the dataset.

In [1]:

Define the dataset

The input data $X$ used in this example consists of 20 binary sequences of 10 timesteps each. Each input sequence is generated from a uniform random distribution which is rounded to 0 or 1.

The output targets $t$ are the number of times '1' occurs in the sequence, which is equal to the sum of that sequence since the sequence is binary.

In [2]:
# Create dataset
nb_of_samples = 20
sequence_len = 10
# Create the sequences
X = np.zeros((nb_of_samples, sequence_len))
for row_idx in range(nb_of_samples):
    X[row_idx,:] = np.around(np.random.rand(sequence_len)).astype(int)
# Create the targets for each sequence
t = np.sum(X, axis=1)

Training with backpropagation through time

The typical algorithm to train recurrent nets is the backpropagation through time algorithm. The name already implies that it is based on the backpropagation algorithm we discussed in the previous tutorial on feedforward nets .

If you understand regular backpropagation than backpropagation through time is not much more difficult to understand. The only main difference is that the recurrent net needs to be unfolded through time for a certain amount of timesteps. This unfolding was already illustrated in the figure at the beginning of this tutorial. After unfolding, we end up with a model quite similar to a regular multilayer feedforward network. The only differences are that each layer has multiple inputs (The previous state and the current input) and that the parameters (here: $w_{rec}$ and $w_x$) are shared between each layer.

Compute the output with the forward step

The forward step will unroll the network, and compute the forward activations just as in regular backpropagation. The final output will be used to compute the cost function from which the error signal used for training will be derived.

When unfolding a recurrent net over multiple timesteps each layer computes the same recurrence relation on different timesteps. This recurrence relation for our model is defined in the update_state method.

The forward_states method computes the states over increasing timesteps by applying the update_state method in a for-loop. Note that the forward steps for multiple sequences can be computed in parallel by use of vectorisation, as was illustrated in the previous tutorial on feedforward nets . Since the network begins without seeing anything of the sequence, an initial state needs to be provided. In this example, this initial state is set to 0 (it is possible to treat it as another parameter).

Finally, the cost $\xi$ ( cost ) at the output is computed in this example via the mean squared error function (MSE) over all sequences in the input data.

In [3]:
# Define the forward step functions
def update_state(xk, sk, wx, wRec):
    """
    Compute state k from the previous state (sk) and current input (xk),
    by use of the input weights (wx) and recursive weights (wRec).
    """
    return xk * wx + sk * wRec

def forward_states(X, wx, wRec):
    """
    Unfold the network and compute all state activations given the input X,
    and input weights (wx) and recursive weights (wRec).
    Return the state activations in a matrix, the last column S[:,-1] contains the
    final activations.
    """
    # Initialise the matrix that holds all states for all input sequences.
    # The initial state s0 is set to 0.
    S = np.zeros((X.shape[0], X.shape[1]+1))
    # Use the recurrence relation defined by update_state to update the 
    #  states trough time.
    for k in range(0, X.shape[1]):
        # S[k] = S[k-1] * wRec + X[k] * wx
        S[:,k+1] = update_state(X[:,k], S[:,k], wx, wRec)
    return S

def cost(y, t): 
    """
    Return the MSE between the targets t and the outputs y.
    """
    return ((t - y)**2).sum() / nb_of_samples

Compute the gradients with the backward step

The backward step will begin with computing the gradient of the cost with respect to the output of the network $\partial \xi / \partial y$ by the output_gradient method. This gradient will then be propagated backwards through time (layer by layer) from output to input to update the parameters by the backward_gradient method. The recurrence relation to propagate this gradient through the network can be written as:

$$\frac{\partial \xi}{\partial S_{k-1}} = \frac{\partial \xi}{\partial S_{k}} \frac{\partial S_{k}}{\partial S_{k-1}} = \frac{\partial \xi}{\partial S_{k}} w_{rec}$$

and starts at:

$$\frac{\partial \xi}{\partial y} = \frac{\partial \xi}{\partial S_{n}}$$

With $n$ the number of timesteps the network is unfolded. Note that only the recursive parameter that connects states $w_{rec}$ plays a role in propagating the error down the network.

The gradients of the cost function with respect to the parameters can then be found by summing the parameter gradients in each layer (or accumulating them while propagating the error). The update rules for the weights are:

$$\frac{\partial \xi}{\partial w_x} = \sum_{k=0}^{n} \frac{\partial \xi}{\partial S_{k}} x_k \\ \frac{\partial \xi}{\partial w_{rec}} = \sum_{k=1}^{n} \frac{\partial \xi}{\partial S_{k}} S_{k-1}$$
In [4]:
def output_gradient(y, t):
    """
    Compute the gradient of the MSE cost function with respect to the output y.
    """
    return 2.0 * (y - t) / nb_of_samples

def backward_gradient(X, S, grad_out, wRec):
    """
    Backpropagate the gradient computed at the output (grad_out) through the network.
    Accumulate the parameter gradients for wX and wRec by for each layer by addition.
    Return the parameter gradients as a tuple, and the gradients at the output of each layer.
    """
    # Initialise the array that stores the gradients of the cost with respect to the states.
    grad_over_time = np.zeros((X.shape[0], X.shape[1]+1))
    grad_over_time[:,-1] = grad_out
    # Set the gradient accumulations to 0
    wx_grad = 0
    wRec_grad = 0
    for k in range(X.shape[1], 0, -1):
        # Compute the parameter gradients and accumulate the results.
        wx_grad += np.sum(grad_over_time[:,k] * X[:,k-1])
        wRec_grad += np.sum(grad_over_time[:,k] * S[:,k-1])
        # Compute the gradient at the output of the previous layer
        grad_over_time[:,k-1] = grad_over_time[:,k] * wRec
    return (wx_grad, wRec_grad), grad_over_time

Gradient Checking

We can again perform gradient checking, like we did in part 4 of the tutorial on feedforward nets, to assert that we didn't make any mistakes while computing the gradients. Gradient checking asserts that the gradient computed by backpropagation is close to the numerical gradient .

In [5]:
No gradient errors found

Updating the parameters

Recurrent nets are notoriously difficult to train due to unstable gradients which make it difficult for simple gradient based optimization methods, such as gradient descent, to find a good local minimum.

This instability is illustrated in the following figures. The top 2 figures show parts of the cost surface between certain values for $w_x$ and $w_{rec}$. The colored markers are places where we sampled the evolution of the gradient during backpropagating through time, which is illustrated in the bottom figure. Note that the color of the cost surface is logarithmic. The cost surface is near 0 in a small valley around $w_x=w_{rec}=1$, and grows very fast near the edges of the plot (especially for $|w_{rec}|>1$).

The source of the gradient instabilities in our simple linear model is shown in the bottom figure. The instabilities arise from the fact that the recurrence relation to propagate the gradient backwards through time forms a geometric sequence (as was noted above):

$$\frac{\partial \xi}{\partial S_{k-1}} = \frac{\partial \xi}{\partial S_{k}} \frac{\partial S_{k}}{\partial S_{k-1}} = \frac{\partial \xi}{\partial S_{k}} w_{rec}$$

The gradient of a state $S_k$ between a state $m$ timesteps back ($S_{k-m}$) can then be written as:

$$\frac{\partial S_{k}}{\partial S_{k-m}} = \frac{\partial S_{k}}{\partial S_{k-1}} * \cdots * \frac{\partial S_{k-m+1}}{\partial S_{k-1}} = w_{rec}^m$$

Note that in our simple linear model the gradient grows exponentially if $|w_{rec}|>1$ (known as exploding gradient). And that the gradient shrinks exponentially if $|w_{rec}|<1$ (known as vanishing gradient).

This exploding gradient is illustrated in the point sampled at $w_x\!=\!1, w_{rec}\!=\!2$, in which the gradient grows exponentially during backpropagation (y-axis is log scale). The $w_x\!=\!1, w_{rec}\!=\!-2$ point illustrates a different kind of alternating gradient explosion due to $w_{rec}$ being negative. This exploding gradient means that training can be highly sensitive to changes in the $w_{rec}$ parameter.

The vanishing gradient is illustrated in the points sampled at $w_x\!=\!1, w_{rec}\!=\!0.5$ and $w_x\!=\!1, w_{rec}\!=\!-0.5$, in which both the gradients decrease exponentially. This vanishing gradient means that the network is not able to learn long-term dependencies because the gradient vanishes over time.

Note that if $w_{rec}\!=\!0$ the gradient immediatly goes to $0$, and that if $w_{rec}\!=\!1$ the gradient stays constant over time.

The next section will illustrate how we can optimize this unstable error function.

In [6]:
In [7]:

Rprop optimisation

The previous section illustrated that the gradient can be highly unstable. Entering an area of unstable gradients could result in a huge jump on the cost surface, where the optimizer might end up far from the original point. This is illustrated in the following figure (from Pascanu et al. ):

Illustration exploding gradient from "On the difficulty of training Recurrent Neural Networks", Razvan Pascanu

Remember from our previous tutorial on feedforward nets that our gradient descent update rule was:

$$W(i+1) = W(i) - \mu \frac{\partial \xi}{\partial W(i)}$$

With $W(i)$ the value of $W$ at iteration $i$ during the gradient descent, and $\mu$ the learning rate.

If during training we would end up in the blue point in the surface plot above above ($w_x\!=\!1, w_{rec}\!=\!2$) the gradient would be in the order of $10^7$. Even with a small learning rate of 0.000001 ($10^{-6}$) the parameters $W$ would be updated a distance 10 units from its current position, which would be catastrophic in our example. One way do deal with this is to lower the learning rate even more, but if then the optimisation enters a low gradient area the updates wouldn't move at all.

Researchers have found many ways to do gradient based training in this unstable environment. Some examples are: Gradient clipping , Hessian-Free Optimization , Momentum , etc.

We can handle the unstable gradients by making the optimisation updates less sensitive to the gradients. One way to do this is by using a technique called resilient backpropagation ( Rprop ). Rprop is similar to the momentum method we described in a previous tutorial but uses only the sign of the gradient to update the parameters. The Rprop algorithm can be summarised as follows:

  • Set initial weight update value $\Delta$ to a nonzero value.
  • For each parameter $w$:
    • If the sign of the weight's gradient changes compared to the previous iteration:
      $sign(\partial \xi /\partial w(i)) \neq sign(\partial \xi /\partial w(i-1))$
      • Multiply the weight update value $\Delta$ by $\eta^-$, with $\eta^- < 1$.
        $\Delta(i) = \Delta(i) * \eta^-$

    • Else if the sign of the gradient stays the same:
      $sign(\partial \xi /\partial w(i)) = sign(\partial \xi /\partial w(i-1))$
      • Multiply the weight update value $\Delta$ by $\eta^+$, with $\eta^+ > 1$.
        $\Delta(i) = \Delta(i) * \eta^+$

    • Update the weight with its weight update value $\Delta$ in the oposite direction of its gradient. $w(i+1) = w(i) - sign(\partial \xi /\partial w(i))* \Delta(i)$

The hyperparameters are usually set as $\eta^+ = 1.2$ and $\eta^- = 0.5$. If we compare this with the momentum method then this means that if the sign of the gradient doesn't change that we increase the weight update value with 20%. And if the sign of the gradient changes (update passes through a minima) then we decrease the weight update value by half. Note that the weight update value $\Delta$ is similar to the momentum's velocity parameter, the difference is that the weight update value only reflects the size of the velocity for each parameter. The direction is determined by the sign of the current gradient.

We optimised our simple model with Rprop for 500 iterations. The updates are plotted on the cost surface below as blue dots. Note that even though the weight parameters start out in a high cost and high gradient position that Rprop can still approximate the (1, 1) minimum at the end of our iterations.

In [8]:
# Define Rprop optimisation function
def update_rprop(X, t, W, W_prev_sign, W_delta, eta_p, eta_n):
    """
    Update Rprop values in one iteration.
    X: input data.
    t: targets.
    W: Current weight parameters.
    W_prev_sign: Previous sign of the W gradient.
    W_delta: Rprop update values (Delta).
    eta_p, eta_n: Rprop hyperparameters.
    """
    # Perform forward and backward pass to get the gradients
    S = forward_states(X, W[0], W[1])
    grad_out = output_gradient(S[:,-1], t)
    W_grads, _ = backward_gradient(X, S, grad_out, W[1])
    W_sign = np.sign(W_grads)  # Sign of new gradient
    # Update the Delta (update value) for each weight parameter seperately
    for i, _ in enumerate(W):
        if W_sign[i] == W_prev_sign[i]:
            W_delta[i] *= eta_p
        else:
            W_delta[i] *= eta_n
    return W_delta, W_sign
In [9]:
# Perform Rprop optimisation

# Set hyperparameters
eta_p = 1.2
eta_n = 0.5

# Set initial parameters
W = [-1.5, 2]  # [wx, wRec]
W_delta = [0.001, 0.001]  # Update values (Delta) for W
W_sign = [0, 0]  # Previous sign of W

ls_of_ws = [(W[0], W[1])]  # List of weights to plot
# Iterate over 500 iterations
for i in range(500):
    # Get the update values and sign of the last gradient
    W_delta, W_sign = update_rprop(X, t, W, W_sign, W_delta, eta_p, eta_n)
    # Update each weight parameter seperately
    for i, _ in enumerate(W):
        W[i] -= W_sign[i] * W_delta[i]
    ls_of_ws.append((W[0], W[1]))  # Add weights to list to plot

print('Final weights are: wx = {0},  wRec = {1}'.format(W[0], W[1]))
Final weights are: wx = 1.00135554721,  wRec = 0.999674473785
In [10]:

Final model

The final model is tested on a test sequence below. Note that the output is very close to the target output and that if we round the output to the nearest integer that the output would be perfect.

In [11]:
test_inpt = np.asmatrix([[0, 1, 1, 0, 0, 0, 0, 1, 0, 0, 1, 1]])
test_outpt = forward_states(test_inpt, W[0], W[1])[:,-1]
print 'Target output: {:d} vs Model output: {:.2f}'.format(test_inpt.sum(), test_outpt[0])
Target output: 5 vs Model output: 5.00

This post at peterroelants.github.io is generated from an IPython notebook file. Link to the full IPython notebook file