Neural network implemetation - backpropagation

Hidden layer trained by backpropagation

This part will illustrate with help of a simple toy example how hidden layers with a non-linear activation function can be trained by the backpropagation algorithm to learn how to seperate non-linearly seperated samples.

While the first and second part of this tutorial described simple regression and classification models, this tutorial will describe a 2-class classification neural network with 1 input dimension, and a non-linear hidden layer with 1 neuron. This network can be represented graphically as:

Image of the network with hidden layer

In [1]:

Dataset

In this example the target classes $t$ corresponding to the inputs $x$ will be generated from 2 class distributions: blue circles ($t=1$) and red stars ($t=0$). Where the red star class is a multimodal distribution that surrounds the distribution of the blue circle class. This results in a 1D dataset that is not linearly separable. These samples are plotted in the figure below.

The model from part 2 won't be able to classify both classes correctly since it can learn only linear separations. By adding a hidden layer with a non-linear activation function, the model will be able to train a non-linear classifier.

In [2]:
In [3]:

Non-linear activation function

The non-linear activation function used in the hidden layer of this example is the Gaussian radial basis function (RBF). The RBF is a activation function that is not usually used in neural networks, except for radial basis function networks . It's being used in this example because it's non-linearity allows us to seperate the blue circle class surrounded by the red star class i our toy example. The RBF will allow to separate the blue circle samples from the red star samples in this simple example by only activating for a certain region around the origin. The RBF is plotted in the figure below and is defined in this example as:

$$ \text{RBF} = \phi(z_h) = e^{-z_h^2} $$

The derivative of this RBF function is:

$$ \frac{d \phi(z_h)}{dz_h} = -2 z_h e^{-z_h^2} = -2 z_h \phi(z_h)$$

In [4]:
def rbf(zh):
    """RBF activation function"""
    return np.exp(-zh**2)
In [5]:

Optimization by backpropagation

We will train this model by using the backpropagation algorithm that is typically used to train neural networks. Each iteration of the backpropagation algorithm consists of two steps:

  1. A forward propagation step to compute the output of the network.
  2. A backward propagation step in which the error at the end of the network is propagated backward through all the neurons while updating their parameters.

1. Forward step

During the forward step, the input will be propagated layer by layer through the network to compute the final output of the network.

Compute activations of hidden layer

The activations $\mathbf{h}$ of the hidden layer will be computed by:

$$\mathbf{h} = \text{RBF}(\mathbf{x} \cdot w_h) = e^{-(\mathbf{x} \cdot w_h)^2} $$

With $w_h$ the weight parameter that transforms the input before applying the RBF activation function. This is implemented below by the hidden_activations(x, wh) method.

Compute activations of output

The final output of the network will be computed by passing the hidden activations $\mathbf{h}$ as input to the logistic classifier function $\sigma$:

$$ \mathbf{y} = \sigma(\mathbf{h} + b_o) = \frac{1}{1+e^{-(\mathbf{h}+ b_o)}} $$

With $b_o$ the bias parameter of the output layer. This is implemented below as the output_activations(h , bo) method. Note that in this toy example we don't multiply with a weight parameter and we add a only add a bias (intercept) term to the input of the logistic output neuron.

In [6]:
def logistic(zo):
    """Logistic classifiction function"""
    return 1. / (1. + np.exp(-zo))


def hidden_activations(x, wh):
    """Hidden layer activations from RBF."""
    return rbf(x * wh)


def output_activations(h , bo):
    """Logistic classification output."""
    return logistic(h + bo)


def nn(x, wh, bo):
    """Full neural network function."""
    return output_activations(hidden_activations(x, wh), bo)


def nn_predict(x, wh, bo):
    """Neural network prediction function that only returns
    1 or 0 depending on the predicted class."""
    return np.around(nn(x, wh, bo))

2. Backward step

The backward step will begin with computing the loss (error) at the output node. The gradient of this loss will then be propagated backwards layer by layer through the network to calculate the parameter updates. Once we have the gradients for each parameter we can use the gradient descent algorithm to update these parameters.

Compute the loss function

The loss function $\xi$ used in this model is the same cross-entropy loss function introduced in part 2 :

$$\xi(t_i,y_i) = - \left[ t_i log(y_i) + (1-t_i)log(1-y_i) \right]$$

This loss function is plotted for the $w_h$ and $b_o$ parameters in the next figure. Note that this error surface is not convex anymore because we introduced a hidden layer with a non-linearity.

In [7]:
def loss(y, t):
    """Cross entropy loss function."""
    return -np.mean(
        (t * np.log(y)) + ((1-t) * np.log(1-y)))

 
def loss_for_param(x, wh, bo, t):
    """Calculate the loss for a given set of parameters."""
    return loss(nn(x, wh, bo) , t)
In [8]:

Update the output layer

At the output the gradient for sample $i$, the update ${\partial \xi_i}/{\partial b_o}$ for parameter $b_o$ can be worked out using the chain rule the same way as we did in part 2 :

$$ \frac{\partial \xi_i}{\partial b_o} = \frac{\partial \xi_i}{\partial y_i} \frac{\partial y_i}{\partial z_{oi}} \frac{\partial z_{oi}}{\partial b_o} = (y_i-t_i) = \delta_{oi} $$

With $z_{oi} = h_i + b_o$, $h_i$ the hidden layer activation of sample $i$ and ${\partial \xi_i}/{\partial z_{oi}} = \delta_{oi}$ the gradient of the error at the output layer of the neural network with respect to the input to this layer. Note that ${\partial z_{oi}}/{\partial b_o} = 1$.

$\delta_{o}$ is defined below as the gradient_output(y, t) method and ${\partial \xi}/{\partial b_o}$ as the gradient_bias_out(h, grad_output) method. Note that these return the same in this case.

Update the hidden layer

Using the chain rule at the hidden layer parameter $w_h$ we can compute the gradient for sample $i$: ${\partial \xi_i}/{\partial w_{h}}$ the same way:

$$ \frac{\partial \xi_i}{\partial w_{h}} = \frac{\partial \xi_i}{\partial h_i} \frac{\partial h_i}{\partial z_{hi}} \frac{\partial z_{hi}}{\partial w_{h}} = x_i \cdot \delta_{hi} $$

With $z_{hi} = x_i \cdot w_{h} $. And with ${\partial \xi_i}/{\partial z_{hi}} = \delta_{hi}$ the gradient of the error at the input of the hidden layer with respect to the input to this layer. $\delta_{hi}$ can be interpreted as the gradient of the contribution of $z_{hi}$ to the final error.

How do we define this error gradient $\delta_{hi}$ at the input of the hidden neurons? It can be computed as the error gradient propagated back from the output layer through the hidden layer.

$$ \begin{split} \delta_{hi} = \frac{\partial \xi_i}{\partial z_{hi}} &= \frac{\partial \xi_i}{\partial z_{oi}} \frac{\partial z_{oi}}{\partial h_i} \frac{\partial h_i}{\partial z_{hi}} = (-2 \cdot z_{hi} \cdot h_i) \cdot (y_i - t_i) \\ &= -2 \cdot z_{hi} \cdot h_i \cdot \delta_{oi} \end{split} $$

Because of this, and because ${\partial z_{hi}}/{\partial w_{h}} = x_i$ we can compute ${\partial \xi_i}/{\partial w_{h}}$ as:

$$\frac{\partial \xi_i}{\partial w_{h}} = x_i \delta_{hi} $$

The gradients for each parameter can again be averaged of the inputs $i$ to compute the update for a batch of input examples.

$\delta_{h}$ is defined below as the gradient_hidden(bo, grad_output) method and ${\partial \xi}/{\partial w_h}$ as the gradient_weight_hidden(x, zh, h, grad_hidden) method.

In [9]:
def gradient_output(y, t):
    """Gradient of loss output."""
    return y - t


def gradient_bias_out(grad_output): 
    """Gradient off the bias parameter at the output layer."""
    return grad_output


def gradient_hidden(grad_output):
    """Gradient of hidden layer output."""
    return grad_output


def gradient_weight_hidden(x, zh, h, grad_hidden):
    """Gradient of hidden layer weight parameter wh."""
    return x * -2 * zh * h * grad_hidden

Backpropagation updates

Once we know how to compute gradient of the loss with respect to the parameters we can start using the gradient descent algorithm to train the network. Gradient descent will iteratively update the parameters in the direction of the negative gradient . The parameters $w_h$ and $b_o$ are updated scaled by the learning rate $\mu$:

$$ w_h(k+1) = w_h(k) - \Delta w_h(k+1) \quad \text{with} \; \Delta w_h = \mu \cdot {\partial \xi}/{\partial w_h}\\ b_o(k+1) = b_o(k) - \Delta b_o(k+1) \quad \text{with} \; \Delta b_o = \mu \cdot {\partial \xi}/{\partial b_o} $$

The gradient descent algorithm is typically initialised by starting with random initial parameters. After initiating these parameters we can start updating these parameters with $\Delta$ until convergence. The learning rate needs to be tuned separately as a hyperparameter for each neural network. One backpropagation iteration with gradient descent is implemented below by the backprop_update(x, t, wh, bo, learning_rate) method.

I hope that this rather long writeup of how to update the parameters illustrates that the backpropagation algorithm is just an application of the chain rule for computing derivatives and can be written out for more complex networks if needed.

In [10]:
def backprop_update(x, t, wh, bo, learning_rate):
    """Full backpropagation update function.
    Updates the parameters bo and wh in 1 iteration."""
    # Compute the output of the network
    # This can be done with y = nn(x, wh, bo), but we need
    #  the intermediate h and zh for the weight updates.
    zh = x * wh
    h = rbf(zh)  # h = hidden_activations(x, wh)
    y = output_activations(h, bo)
    # Compute the gradient at the output
    grad_output = gradient_output(y, t)
    # Get the delta for bo
    d_bo = learning_rate * gradient_bias_out(grad_output)
    # Compute the gradient at the hidden layer
    grad_hidden = gradient_hidden(grad_output)
    # Get the delta for wh
    d_wh = learning_rate * gradient_weight_hidden(
        x, zh, h, grad_hidden)
    # return the update parameters
    return float(np.mean(wh-d_wh)), float(np.mean(bo-d_bo))

An example run of backpropagation for 50 iterations on the example inputs $\mathbf{x}$ and targets $\mathbf{t}$ is shown in the figure below. The white dots represent the weight parameter values $w_h$ and $b_o$ at iteration $k$ and are plotted on the loss surface.

In [11]:
# Run backpropagation
# Set the initial weight parameter
wh = 2.3  # Randomly decided
bo = 1.4  # Randomly decided
# Set the learning rate
learning_rate = 2.0

# Start the gradient descent updates and plot the iterations
nb_of_iterations = 20  # number of gradient descent updates
# List to store the weight values for later vizualization
params_loss = [(wh, bo, loss_for_param(x, wh, bo, t))]
for i in range(nb_of_iterations):
    # Update the weights via backpropagation
    wh, bo = backprop_update(x, t, wh, bo, learning_rate)
    # Store the values for plotting
    params_loss.append((wh, bo, loss_for_param(x, wh, bo, t)))

# Print the final loss
final_loss = loss_for_param(x, wh, bo, t)
print(f'final loss is {final_loss:.2f} for weights',
      f'wh={wh:.2f} and bo={bo:.2f}')
final loss is 0.52 for weights wh=1.12 and bo=-0.44
In [12]: