How to implement a neural network (5/5) - Generalization to multiple layers

This last part of our tutorial on how to implement a neural network, generalizes the single layer feedforward neural network into any number of layers. The concepts of a linear projection via matrix multiplication and non-linear transformation will be generalized. The usage of the generalization will be illustrated by building a small feedforward network that consists of two hidden layers to classify handwritten digits. This network will be trained by stochastic gradient descent , a popular variant of gradient descent that updates the parameters each step on only a subset of the parameters.

This part will cover:

This is the last part of a 5-part tutorial on how to implement neural networks from scratch in Python:

The notebook starts out with importing the libraries we need:

In [1]:

Handwritten digits dataset

The dataset used in this tutorial is the digits dataset provided by scikit-learn . This dataset consists of 1797 8x8 images of handwritten digits between 0 and 9. Each 8x8 pixel image is provided as a flattened input vector of 64 variables. Example images for each digit are shown below. Note that this dataset is different from the larger and more famous MNIST dataset. The smaller dataset from scikit-learn was chosen to minimize training time for this tutorial. Feel free to experiment and adapt this tutorial to classify the MNIST digits.

The dataset will be split into:

  • A training set used to train the model. (inputs: X_train , targets: T_train )
  • A validation set used to validate the model performance and to stop training if the model starts overfitting on the training data. (inputs: X_validation , targets: T_validation )
  • A final test set to evaluate the trained model on data is independent of the training and validation data. (inputs: X_test , targets: T_test )
In [2]:
# Load the data from scikit-learn.
digits = datasets.load_digits()

# Load the targets.
# Note that the targets are stored as digits, these need to be 
#  converted to one-hot-encoding for the output sofmax layer.
T = np.zeros(([0], 10))
T[np.arange(len(T)),] += 1

# Divide the data into a train and test set.
(X_train, X_test, T_train, T_test
) = model_selection.train_test_split(, T, test_size=0.4)
# Divide the test set into a validation set and final test set.
(X_validation, X_test, T_validation, T_test
) = model_selection.train_test_split(X_test, T_test, test_size=0.5)
In [3]:
2021-05-13T19:56:55.424464 image/svg+xml Matplotlib v3.4.2,

Generalization of the layers

Part 4 of this tutorial series took on the classical view of neural networks where each layer consists of a linear transformation by a matrix multiplication and vector addition followed by a non-linear function.
Here the linear transformation is split from the non-linear function and each is abstracted into its own layer. This has the benefit that the forward and backward step of each layer can easily be calculated separately.

This tutorial defines three example layers as Python classes :

  • A layer to apply the linear transformation ( LinearLayer ).
  • A layer to apply the logistic function ( LogisticLayer ).
  • A layer to compute the softmax classification probabilities at the output ( SoftmaxOutputLayer ).

Each layer can compute its output in the forward step with get_output , which can then be used as the input for the next layer. The gradient at the input of each layer in the backpropagation step can be computed with get_input_grad . This function computes the gradient with the help of the targets if it's the last layer, or the gradients at its outputs (gradients at input of next layer) if it's an intermediate layer. Each layer has the option to iterate over the parameters (if any) with get_params_iter , and get the gradients of these parameters in the same order with get_params_grad .

Notice that the gradient and cost computed by the softmax layer are divided by the number of input samples. This is to make this gradient and cost independent of the number of input samples so that the size of the mini-batches can be changed without affecting other parameters. More on this later.

In [4]:
# Non-linear functions used
def logistic(z): 
    return 1 / (1 + np.exp(-z))

def logistic_deriv(y):  # Derivative of logistic function
    return np.multiply(y, (1 - y))
def softmax(z): 
    return np.exp(z) / np.sum(np.exp(z), axis=1, keepdims=True)
In [5]:
# Layers used in this model
class Layer(object):
    """Base class for the different layers.
    Defines base methods and documentation of methods."""
    def get_params_iter(self):
        """Return an iterator over the parameters (if any).
        The iterator has the same order as get_params_grad.
        The elements returned by the iterator are editable in-place."""
        return []
    def get_params_grad(self, X, output_grad):
        """Return a list of gradients over the parameters.
        The list has the same order as the get_params_iter iterator.
        X is the input.
        output_grad is the gradient at the output of this layer.
        return []
    def get_output(self, X):
        """Perform the forward step linear transformation.
        X is the input."""
    def get_input_grad(self, Y, output_grad=None, T=None):
        """Return the gradient at the inputs of this layer.
        Y is the pre-computed output of this layer (not needed in 
        this case).
        output_grad is the gradient at the output of this layer 
         (gradient at input of next layer).
        Output layer uses targets T to compute the gradient based on the 
         output error instead of output_grad"""
In [6]:
class LinearLayer(Layer):
    """The linear layer performs a linear transformation to its input."""
    def __init__(self, n_in, n_out):
        """Initialize hidden layer parameters.
        n_in is the number of input variables.
        n_out is the number of output variables."""
        self.W = np.random.randn(n_in, n_out) * 0.1
        self.b = np.zeros(n_out)
    def get_params_iter(self):
        """Return an iterator over the parameters."""
        return itertools.chain(
            np.nditer(self.W, op_flags=['readwrite']),
            np.nditer(self.b, op_flags=['readwrite']))
    def get_output(self, X):
        """Perform the forward step linear transformation."""
        return (X @ self.W) + self.b
    def get_params_grad(self, X, output_grad):
        """Return a list of gradients over the parameters."""
        JW = X.T @ output_grad
        Jb = np.sum(output_grad, axis=0)
        return [g for g in itertools.chain(
            np.nditer(JW), np.nditer(Jb))]
    def get_input_grad(self, Y, output_grad):
        """Return the gradient at the inputs of this layer."""
        return output_grad @ self.W.T
In [7]:
class LogisticLayer(Layer):
    """The logistic layer applies the logistic function to its inputs."""
    def get_output(self, X):
        """Perform the forward step transformation."""
        return logistic(X)
    def get_input_grad(self, Y, output_grad):
        """Return the gradient at the inputs of this layer."""
        return np.multiply(logistic_deriv(Y), output_grad)
In [8]:
class SoftmaxOutputLayer(Layer):
    """The softmax output layer computes the classification 
    propabilities at the output."""
    def get_output(self, X):
        """Perform the forward step transformation."""
        return softmax(X)
    def get_input_grad(self, Y, T):
        """Return the gradient at the inputs of this layer."""
        return (Y - T) / Y.shape[0]
    def get_cost(self, Y, T):
        """Return the cost at the output of this output layer."""
        return - np.multiply(T, np.log(Y)).sum() / Y.shape[0]

Sample model

The following sections will refer to a layer as a layer defined above, and hidden-layer or output-layer as the classical neural network view of a linear transformation followed by a non-linear function.

The sample model used to classify the handwritten digits in this tutorial consists of two hidden-layers with logistic functions and a softmax output-layer. The fist hidden-layer takes a vector of 64 pixel values and transforms them to a vector of 20 values. The second hidden-layer projects the previous 20 values to 20 new values. The output-layer outputs probabilities for the 10 possible classes. This architecture is illustrated in the following figure (biases are not shown to keep figure clean).

Image of the sample neural network model

The full network is represented as a sequential list where each next layer is added on top of the previous layer by putting it in the next position in the list. The first layer is at position 0 in this list, the last layer is at the last index of this list.

In [9]:
# Sample model to be trained on the data
hidden_neurons_1 = 20  # Number of neurons in the first hidden-layer
hidden_neurons_2 = 20  # Number of neurons in the second hidden-layer
# Create the model
layers = [] # Define a list of layers
# Add first hidden layer
layers.append(LinearLayer(X_train.shape[1], hidden_neurons_1))
# Add second hidden layer
layers.append(LinearLayer(hidden_neurons_1, hidden_neurons_2))
# Add output layer
layers.append(LinearLayer(hidden_neurons_2, T_train.shape[1]))


The details of how backpropagation works in the forward and backward step of vectorized model are explained in part 4 of this tutorial. This section will only illustrate how to perform backpropagation over any number of layers with the generalized model described here.

Forward step

The forward steps are computed by the forward_step method defined below. This method iteratively computes the outputs of each layer and feeds it as input to the next layer until the last layer. Each layer's output is computed by calling the get_output method. These output activations are stored in the activations list.

In [10]:
# Forward propagation step as a method.
def forward_step(input_samples, layers):
    Compute and return the forward activation of each layer in layers.
        input_samples: A matrix of input samples (each row 
                       is an input vector)
        layers: A list of Layers
        A list of activations where the activation at each index 
        i+1 corresponds to the activation of layer i in layers. 
        activations[0] contains the input samples.  
    activations = [input_samples] # List of layer activations
    # Compute the forward activations for each layer starting 
    #  from the first
    X = input_samples
    for layer in layers:
        # Get the output of the current layer
        Y = layer.get_output(X)
        # Store the output for future processing
        # Set the current input as the activations of the previous layer
        X = activations[-1]
    return activations

Backward step

The gradients computed in the backward step are computed by the backward_step method defined below. The backward step goes over all the layers in the reversed order. It first gets the initial gradients from the output layer and uses these gradients to compute the gradients of the layers below by iteratively calling the get_input_grad method. During each step, it computes the gradients of the cost with respect to the parameters by calling the get_params_grad method and returns them in a list.

In [11]:
# Define the backward propagation step as a method
def backward_step(activations, targets, layers):
    Perform the backpropagation step over all the layers and return the parameter gradients.
        activations: A list of forward step activations where the activation at 
            each index i+1 corresponds to the activation of layer i in layers. 
            activations[0] contains the input samples. 
        targets: The output targets of the output layer.
        layers: A list of Layers corresponding that generated the outputs in activations.
        A list of parameter gradients where the gradients at each index corresponds to
        the parameters gradients of the layer at the same index in layers. 
    param_grads = collections.deque()  # List of parameter gradients for each layer
    output_grad = None  # The error gradient at the output of the current layer
    # Propagate the error backwards through all the layers.
    #  Use reversed to iterate backwards over the list of layers.
    for layer in reversed(layers):   
        Y = activations.pop()  # Get the activations of the last layer on the stack
        # Compute the error at the output layer.
        # The output layer error is calculated different then hidden layer error.
        if output_grad is None:
            input_grad = layer.get_input_grad(Y, targets)
        else:  # output_grad is not None (layer is not output layer)
            input_grad = layer.get_input_grad(Y, output_grad)
        # Get the input of this layer (activations of the previous layer)
        X = activations[-1]
        # Compute the layer parameter gradients used to update the parameters
        grads = layer.get_params_grad(X, output_grad)
        # Compute gradient at output of previous layer (input of current layer):
        output_grad = input_grad
    return list(param_grads)  # Return the parameter gradients

Gradient Checking

As in part 4 of this tutorial the gradient computed by backpropagation is compared with the numerical gradient to assert that there are no bugs in the code to compute the gradients.

The code below gets the parameters of each layer by the help of the get_params_iter method that returns an iterator over all the parameters in the layer. The order of parameters returned corresponds to the order of parameter gradients returned by get_params_grad during backpropagation.

In [12]:
No gradient errors found

Stochastic gradient descent backpropagation

This tutorial uses a variant of gradient descent called Stochastic gradient descent (SGD) to optimize the cost function. SGD follows the negative gradient of the cost function on subsets of the total training set. This has a few benefits: One of them is that the training time on large datasets can be reduced because the matrix of samples is much smaller during each sub-iteration and the gradients can be computed faster and with less memory. Another benefit is that computing the cost function on subsets results in noise, i.e. each subset will give a different cost depending on the samples. This will result in noisy (stochastic) gradient updates which might be able to push the gradient descent out of local minima. These, and other, benefits contribute to the popularity of SGD on training large scale machine learning methods such as neural networks.

The cost function needs to be independent of the number of input samples because the size of the subsets used during SGD can vary. This is why the mean squared error (MSE) cost function is used instead of just the squared error. Using the mean instead of the sum is reflected in the gradient and cost computed by the softmax layer being divided by the number of input samples.


The subsets of the training set are often called mini-batches. The following code will divide the training set into mini-batches of around 25 samples per batch. The inputs and targets are combined together in a list of (input, target) tuples.

In [13]:
# Create the minibatches
batch_size = 25  # Approximately 25 samples per batch
nb_of_batches = X_train.shape[0] // batch_size  # Number of batches
# Create batches (X,Y) from the training set
XT_batches = list(zip(
    np.array_split(X_train, nb_of_batches, axis=0),   # X samples
    np.array_split(T_train, nb_of_batches, axis=0)))  # Y targets

SGD updates

The parameters $\mathbf{\theta}$ of the network are updated by the update_params method that iterates over each parameter of each layer and applies the simple gradient descent rule on each mini-batch: $\mathbf{\theta}(k+1) = \mathbf{\theta}(k) - \Delta \mathbf{\theta}(k+1)$. $\Delta \mathbf{\theta}$ is defined as: $\Delta \mathbf{\theta} = \mu \frac{\partial \xi}{\partial \mathbf{\theta}}$ with $\mu$ the learning rate.

The update steps will be performed for a number of iterations ( nb_of_iterations ) over the full training set, where each full iterations consists of multiple updates over the mini-batches. After each full iteration, the resulting network will be tested on the validation set. The training will stop if the cost on the validation set doesn't increase after three full iterations to prevent overfitting or after maximum 300 iterations. All costs will be stored in between for future analysis.

In [14]:
# Define a method to update the parameters
def update_params(layers, param_grads, learning_rate):
    Function to update the parameters of the given layers with the given 
    gradients by gradient descent with the given learning rate.
    for layer, layer_backprop_grads in zip(layers, param_grads):
        for param, grad in zip(layer.get_params_iter(), 
            # The parameter returned by the iterator point to the 
            #  memory space of the original layer and can thus be 
            #  modified inplace.
            param -= learning_rate * grad  # Update each parameter
In [15]:
# Perform backpropagation
# initalize some lists to store the cost for future analysis        
batch_costs = []
train_costs = []
val_costs = []

max_nb_of_iterations = 300  # Train for a maximum of 300 iterations
learning_rate = 0.1  # Gradient descent learning rate

# Train for the maximum number of iterations
for iteration in range(max_nb_of_iterations):
    for X, T in XT_batches:  # For each minibatch sub-iteration
        # Get the activations
        activations = forward_step(X, layers)
        # Get cost
        batch_cost = layers[-1].get_cost(activations[-1], T)
        # Get the gradients
        param_grads = backward_step(activations, T, layers)
        # Update the parameters
        update_params(layers, param_grads, learning_rate)
    # Get full training cost for future analysis (plots)
    activations = forward_step(X_train, layers)
    train_cost = layers[-1].get_cost(activations[-1], T_train)
    # Get full validation cost
    activations = forward_step(X_validation, layers)
    validation_cost = layers[-1].get_cost(
        activations[-1], T_validation)
    if len(val_costs) > 3:
        # Stop training if the cost on the validation set doesn't 
        # decrease for 3 iterations
        if val_costs[-1] >= val_costs[-2] >= val_costs[-3]:
# The number of iterations that have been executed
nb_of_iterations = iteration + 1

The costs stored during training can be plotted to visualize the performance during training. The resulting plot is shown in the next figure. The cost on the training samples and validation samples goes down very quickly and flattens out after about 40 iterations on the full training set. Notice that the cost on the training set is lower than the cost on the validation set, this is because the network is optimized on the training set and is slightly overfitting. The training stops after around 90 iterations because the validation cost stops decreasing.
Also, notice that the cost of the mini-batches fluctuates around the cost of the full training set. This is the stochastic effect of mini-batches in SGD.

In [16]:
2021-05-13T19:57:18.867975 image/svg+xml Matplotlib v3.4.2,

Performance on the test set

Finally, the accuracy on the independent test set is computed to measure the performance of the model. The scikit-learn accuracy_score method is used to compute this accuracy from the predictions made by the model. The final accuracy on the test set is $96\%$ as shown below.

The results can be analyzed in more detail with the help of a confusion table . This table shows how many samples of which class are classified as one of the possible classes. The confusion table is shown in the figure below. It is computed by the scikit-learn confusion_matrix method.
Notice that the digit '8' is misclassified five times, two times as '2', two times as '5', and one time as '9'.

In [17]:
The accuracy on the test set is 96%
In [18]:
2021-05-13T19:57:19.258015 image/svg+xml Matplotlib v3.4.2,
In [19]:
Python implementation: CPython
Python version       : 3.9.4
IPython version      : 7.23.1

numpy     : 1.20.2
sklearn   : 0.24.2
seaborn   : 0.11.1
matplotlib: 3.4.2

Neural Networks Machine Learning NumPy Classification Backpropagation Gradient Descent Notebook