# How to implement a neural network Part 4

This page is part of a 5 (+2) parts tutorial on how to implement a simple neural network model. You can find the links to the rest of the tutorial here:

## Vectorization ¶

This part will cover:

The previous tutorial described a very simple neural network with only one input, one hidden neuron and one output. This tutorial will describe a neural network that takes 2-dimensional input samples, projects them onto a 3-dimensional hidden layer, and classifies them with a 2-dimensional softmax output classfier, this softmax function is explained in intermezzo 2 . While we didn't add the bias parameters to the previous 2 models, we will add them to this model. We will work out the vector computations needed to optimize and run this neural network in this tutorial. The network of this model is shown in the following figure:

The notebook starts out with importing the libraries we need:

In [1]:

## Define the dataset ¶

In this example the target classes $t$ corresponding to the inputs $x$ will be generated from 2 class distributions: red ($t=0$) and blue ($t=1$). Where the red class is a circular distribution that surrounds the distribution of the blue class. The dataset is generated by the scikit-learn  make_circles  method.

This results in a 2D dataset that is not linearly separable. The model from part 2 won't be able to classify both classes correctly since it can learn only linear separators. By adding a hidden layer, the model will be able to train a non-linear separator.

The $N$ input samples with $2$ variables each are given as a $N \times 2$ matrix $X$:

$$X = \begin{bmatrix} x_{11} & x_{12} \\ \vdots & \vdots \\ x_{N1} & x_{N2} \end{bmatrix}$$

Where $x_{ij}$ is the value of the $j$-th variable of the $i$-th input sample.

Since the softmax output function is used the targets for each input sample are given as a $N \times 2$ matrix $T$:

$$T = \begin{bmatrix} t_{11} & t_{12}\\ \vdots & \vdots \\ t_{N1} & t_{N2} \end{bmatrix}$$

Where $t_{ij} = 1$ if and only if the $i$-th input sample belongs to class $j$. So blue points are labelled T = [0 1] and red points are labelled T = [1 0].

In [2]:
# Generate the dataset
X, t = sklearn.datasets.make_circles(n_samples=100, shuffle=False, factor=0.3, noise=0.1)
T = np.zeros((100,2)) # Define target matrix
T[t==1,1] = 1
T[t==0,0] = 1
# Separate the red and blue points for plotting
x_red = X[t==0]
x_blue = X[t==1]

print('shape of X: {}'.format(X.shape))
print('shape of T: {}'.format(T.shape))

shape of X: (100, 2)
shape of T: (100, 2)

In [3]:

## Vectorization of backpropagation ¶

### 1. Vectorization of the forward step ¶

#### Compute activations of hidden layer ¶

The 2-dimensional inputs $X$ are projected onto the 3 dimensions of the hidden layer $H$ by the weight matrix $W_h$ ($w_{hij}$ is the weight of the connection between input variable $i$ and hidden neuron activation $j$) and bias vector $\mathbf{b}_h$ :

\begin{align} W_h = \begin{bmatrix} w_{h11} & w_{h12} & w_{h13} \\ w_{h21} & w_{h22} & w_{h23} \end{bmatrix} && \mathbf{b}_h = \begin{bmatrix} b_{h1} & b_{h2} & b_{h3} \end{bmatrix} \end{align}

following computation:

$$H = \sigma(X \cdot W_h + \mathbf{b}_h) = \frac{1}{1+e^{-(X \cdot W_h + \mathbf{b}_h)}} = \begin{bmatrix} h_{11} & h_{12} & h_{13} \\ \vdots & \vdots & \vdots \\ h_{N1} & h_{N2} & h_{N3} \end{bmatrix}$$

With $\sigma$ the logistic function, and with $H$ resulting in a $N \times 3$ matrix.

This computation flow is illustrated by the figure below. Each input $x_{ij}$ is multiplied with the weight parameters from the corresponding column $w_{hj1}, w_{hj2}, w_{hj3}$. The multiplication results for each row $k$ ($x_{ij} * w_{hjk}$) are summed to form the $z_{ik}$, after which these are transformed by the logistic function $\sigma$ to the $h_{ik}$. Note that the bias $B_h$ can be incorporated in $W_h$ by adding a new variable to each input $\mathbf{x}_i$ that is always $+1$.

$\mathbf{b}_h$ and $W_h$ are represented in the code below by  bh  and  Wh  respectively. The hidden layer activations are computed by the  hidden_activations(X, Wh, bh)  method.

#### Compute activations of output ¶

To compute the output activations the hidden layer activations can be projected onto the 2-dimensional output layer. This is done by the $3 \times 2$ weight matrix $W_o$ ($w_{oij}$ is the weight of the connection between hidden layer neuron $i$ and output activation $j$) and $2 \times 1$ bias vector $\mathbf{b}_o$ :

\begin{align} W_o = \begin{bmatrix} w_{o11} & w_{o12} \\ w_{o21} & w_{o22} \\ w_{o31} & w_{o32} \end{bmatrix} && \mathbf{b}_o = \begin{bmatrix} b_{o1} & b_{o2} \end{bmatrix} \end{align}

following computation:

$$Y = \varsigma(H \cdot W_o + \mathbf{b}_o) = \frac{e^{Z_o}}{\sum_{d=1}^C e^{\mathbf{z}_{od}}} = \frac{e^{H \cdot W_o + \mathbf{b}_o}}{\sum_{d=1}^C e^{H \cdot \mathbf{w}_{od} + b_{od}}} = \begin{bmatrix} y_{11} & y_{12}\\ \vdots & \vdots \\ y_{n1} & y_{n2} \end{bmatrix}$$

With $\varsigma$ the softmax function, $Y$ resulting in a $n \times 2$ matrix, $\mathbf{z}_{od}$ the $d$-th column of the $Z_o$ matrix, $\mathbf{w}_{od}$ the $d$-th column of the $W_o$ matrix, and $b_{od}$ the $d$-th element of the $b_o$ vector.

$\mathbf{b}_o$ and $W_o$ are represented in the code below by  bo  and  Wo  respectively. The hidden layer activations are computed by the  output_activations(H, Wo, bo)  method.

In [4]:
# Define the logistic function
def logistic(z):
return 1 / (1 + np.exp(-z))

# Define the softmax function
def softmax(z):
return np.exp(z) / np.sum(np.exp(z), axis=1, keepdims=True)

# Function to compute the hidden activations
def hidden_activations(X, Wh, bh):
return logistic(X.dot(Wh) + bh)

# Define output layer feedforward
def output_activations(H, Wo, bo):
return softmax(H.dot(Wo) + bo)

# Define the neural network function
def nn(X, Wh, bh, Wo, bo):
return output_activations(hidden_activations(X, Wh, bh), Wo, bo)

# Define the neural network prediction function that only returns
#  1 or 0 depending on the predicted class
def nn_predict(X, Wh, bh, Wo, bo):
return np.around(nn(X, Wh, bh, Wo, bo))


### 2. Vectorization of the backward step ¶

#### Vectorization of the output layer backward step ¶

##### Compute the error at the output ¶

The output layer is a softmax layer with corresponding cross-entropy cost function, which are described in detail in intermezzo 2 . The cost function $\xi$ for $N$ samples and $C$ classes used in this model is defined as:

$$\xi(T,Y) = \sum_{i=1}^n \xi(\mathbf{t}_i,\mathbf{y}_i) = - \sum_{i=1}^n \sum_{i=c}^{C} t_{ic} \cdot log( y_{ic})$$

The error gradient $\delta_{o}$ of this cost function at the softmax output layer is simply:

$$\delta_{o} = \frac{\partial \xi}{\partial Z_o} = Y - T$$

With $Z_o$ a $n \times 2$ matrix of inputs to the softmax layer ($Z_o = H \cdot W_o + \mathbf{b}_o$), and $T$ a $n \times 2$ target matrix that corresponds to $Y$. Note that $\delta_{o}$ also results in a $n \times 2$ matrix.

$\delta_{o}$ will be defined in the code as  Eo  and will be computed by the  error_output(Y, T)  method defined below.

##### Update the output layer weights ¶

At the output the gradient $\delta_{\mathbf{w}_{oj}}$ over all $N$ samples is computed by ${\partial \xi}/{\partial \mathbf{w}_{oj}}$:

$$\frac{\partial \xi}{\partial \mathbf{w}_{oj}} = \frac{\partial Z_{o}}{\partial \mathbf{w}_{oj}} \frac{\partial Y}{\partial Z_{o}} \frac{\partial \xi}{\partial Y} = \frac{\partial Z_{o}}{\partial w_{oji}} \frac{\partial \xi}{\partial Z_o} = \sum_{i=1}^N h_{ij} (\mathbf{y}_i - \mathbf{t}_i) = \sum_{i=1}^N h_{ij} \delta_{oi}$$

With $\mathbf{w}_{oj}$ the $j$-th row of $W_o$ and thus a $1 \times 2$ vector. This formula can be written out as matrix operations with the parameters of the output layer:

$$\frac{\partial \xi}{\partial W_o} = H^T \cdot (Y-T) = H^T \cdot \delta_{o}$$

The resulting gradient is a $3 \times 2$ Jacobian matrix :

$$J_{W_o} = \frac{\partial \xi}{\partial W_o} = \begin{bmatrix} \frac{\partial \xi}{\partial w_{o11}} & \frac{\partial \xi}{\partial w_{o12}} \\ \frac{\partial \xi}{\partial w_{o21}} & \frac{\partial \xi}{\partial w_{o22}} \\ \frac{\partial \xi}{\partial w_{o31}} & \frac{\partial \xi}{\partial w_{o32}} \end{bmatrix}$$

$J_{W_o}$ will be defined in the code as  JWo  and will be computed by the  gradient_weight_out(H, Eo)  method defined below.

##### Update the output layer bias ¶

The bias $\mathbf{b}_o$ can be updated in the same manner. The formula to compute the gradient ${\partial \xi}/{\partial \mathbf{b}_{o}}$ for batch processing over all $N$ samples is:

$$\frac{\partial \xi}{\partial \mathbf{b}_{o}} = \frac{\partial Z_{o}}{\partial \mathbf{b}_{o}} \frac{\partial Y}{\partial Z_{o}} \frac{\partial \xi}{\partial Y} = \sum_{i=1}^N 1 * (\mathbf{y}_i - \mathbf{t}_i) = \sum_{i=1}^N \delta_{oi}$$

The resulting gradient is a $2 \times 1$ Jacobian matrix:

$$J_{b_o} = \frac{\partial \xi}{\partial \mathbf{b}_o} = \begin{bmatrix} \frac{\partial \xi}{\partial b_{o1}} & \frac{\partial \xi}{\partial b_{o2}} \end{bmatrix}$$

$J_{b_o}$ will be defined in the code as  Jbo  and will be computed by the  gradient_bias_out(Eo)  method defined below.

In [5]:
# Define the cost function
def cost(Y, T):
return - np.multiply(T, np.log(Y)).sum()

# Define the error function at the output
def error_output(Y, T):
return Y - T

# Define the gradient function for the weight parameters at the output layer
return  H.T.dot(Eo)

# Define the gradient function for the bias parameters at the output layer
return  np.sum(Eo, axis=0, keepdims=True)


#### Vectorization of the hidden layer backward step ¶

##### Compute the error at the hidden layer ¶

The error gradient $\delta_{h}$ of the cost function at the hidden layer is defined as:

$$\delta_{h} = \frac{\partial \xi}{\partial Z_h} = \frac{\partial H}{\partial Z_h} \frac{\partial \xi}{\partial H} = \frac{\partial H}{\partial Z_h} \frac{\partial Z_o}{\partial H} \frac{\partial \xi}{\partial Z_o}$$

With $Z_h$ a $n \times 3$ matrix of inputs to the logistic functions in the hidden neurons ($Z_h = X \cdot W_h + \mathbf{b}_h$). Note that $\delta_{h}$ will also result in a $N \times 3$ matrix.

Lets first derive the error gradient $\delta_{hij}$ for one sample $i$ at hidden neuron $j$. The gradients that backpropagate from the previous layer via the weighted connections are summed for each origin $h_{ij}$.

$$\delta_{hij} = \frac{\partial \xi}{\partial z_{hij}} = \frac{\partial h_{ij}}{\partial z_{hij}} \frac{\partial \mathbf{z}_{oi}}{\partial h_{ij}} \frac{\partial \xi}{\partial \mathbf{z}_{oi}} = h_{ij} (1-h_{ij}) \sum_{k=1}^2 w_{ojk} (y_{ik}-t_{ik}) = h_{ij} (1-h_{ij}) [\delta_{oi} \cdot \mathbf{w}_{oj}^T]$$

Where $\mathbf{w}_{oj}$ is the $j$-th row of $W_o$, and thus a $1 \times 2$ vector and $\delta_{oi}$ is a $1 \times 2$ vector. The full $N \times 3$ error matrix $\delta_{h}$ can thus be calculated as:

$$\delta_{h} = \frac{\partial \xi}{\partial Z_h} = H \circ (1 - H) \circ [\delta_{o} \cdot W_o^T]$$

With $\circ$ the elementwise product .

$\delta_{h}$ will be defined in the code as  Eh  and will be computed by the  error_hidden(H, Wo, Eo)  method defined below.

##### Update the hidden layer weights ¶

At the hidden layer the gradient ${\partial \xi}/{\partial \mathbf{w}_{hj}}$ over all $N$ samples can be computed by:

$$\frac{\partial \xi}{\partial \mathbf{w}_{hj}} = \frac{\partial Z_{h}}{\partial \mathbf{w}_{hj}} \frac{\partial H}{\partial Z_{h}} \frac{\partial \xi}{\partial H} = \frac{\partial Z_{h}}{\partial \mathbf{w}_{hj}} \frac{\partial \xi}{\partial Z_h} = \sum_{i=1}^N x_{ij} \delta_{hi}$$

With $\mathbf{w}_{hj}$ the $j$-th row of $W_h$ and thus a $1 \times 3$ vector. We can write this formula out as matrix operations with the parameters of the hidden layer:

$$\frac{\partial \xi}{\partial W_h} = X^T \cdot \delta_{h}$$

The resulting gradient is a $2 \times 3$ Jacobian matrix :

$$J_{W_h} = \frac{\partial \xi}{\partial W_{h}} = \begin{bmatrix} \frac{\partial \xi}{\partial w_{h11}} & \frac{\partial \xi}{\partial w_{h12}} & \frac{\partial \xi}{\partial w_{h13}} \\ \frac{\partial \xi}{\partial w_{h21}} & \frac{\partial \xi}{\partial w_{h22}} & \frac{\partial \xi}{\partial w_{h23}} \\ \end{bmatrix}$$

$J_{W_h}$ will be defined in the code as  JWh  and will be computed by the  gradient_weight_hidden(X, Eh)  method defined below.

##### Update the hidden layer bias ¶

The bias $\mathbf{b}_h$ can be updated in the same manner. The formula to compute the gradient ${\partial \xi}/{\partial \mathbf{b}_{h}}$ for batch processing over all $N$ samples is:

$$\frac{\partial \xi}{\partial \mathbf{b}_{h}} = \frac{\partial Z_{h}}{\partial \mathbf{b}_{h}} \frac{\partial H}{\partial Z_{h}} \frac{\partial \xi}{\partial H} = \sum_{j=1}^N \delta_{hj}$$

The resulting gradient is a $1 \times 3$ Jacobian matrix:

$$J_{b_h} = \frac{\partial \xi}{\partial \mathbf{b}_{h}} = \begin{bmatrix} \frac{\partial \xi}{\partial b_{h1}} & \frac{\partial \xi}{\partial b_{h2}} & \frac{\partial \xi}{\partial b_{h3}} \end{bmatrix}$$

$J_{b_h}$ will be defined in the code as  Jbh  and will be computed by the  gradient_bias_hidden(Eh)  method defined below.

In [6]:
# Define the error function at the hidden layer
def error_hidden(H, Wo, Eo):
# H * (1-H) * (E . Wo^T)
return np.multiply(np.multiply(H,(1 - H)), Eo.dot(Wo.T))

# Define the gradient function for the weight parameters at the hidden layer
return X.T.dot(Eh)

# Define the gradient function for the bias parameters at the output layer
return  np.sum(Eh, axis=0, keepdims=True)


Programming the computation of the backpropagation gradient is prone to bugs. This is why it is recommended to check the gradients of your models. Gradient checking is done by computing the numerical gradient of each parameter, and compare this value to the gradient found by backpropagation.

The numerical gradient ${\partial \xi}/{\partial \theta_i}$ for a parameter $\theta_i$ can be computed by:

$$\frac{\partial \xi}{\partial \theta_i} = \frac{f(X, \theta_1, \cdots, \theta_i+\epsilon, \cdots, \theta_m) - f(X, \theta_1, \cdots, \theta_i-\epsilon, \cdots, \theta_m)}{2\epsilon}$$

Where $f$ if the neural network function that takes the input $X$ and all parameters $\theta$, and $\epsilon$ is the small change that is used to peturbate the parameter $\theta_i$.

The numerical gradient for each parameter should be close to the backpropagation gradient for that parameter.

In [7]:
# Initialize weights and biases
init_var = 1
# Initialize hidden layer parameters
bh = np.random.randn(1, 3) * init_var
Wh = np.random.randn(2, 3) * init_var
# Initialize output layer parameters
bo = np.random.randn(1, 2) * init_var
Wo = np.random.randn(3, 2) * init_var

# Compute the gradients by backpropagation
# Compute the activations of the layers
H = hidden_activations(X, Wh, bh)
Y = output_activations(H, Wo, bo)
# Compute the gradients of the output layer
Eo = error_output(Y, T)
# Compute the gradients of the hidden layer
Eh = error_hidden(H, Wo, Eo)

In [8]:
# Combine all parameter matrices in a list
params = [Wh, bh, Wo, bo]
# Combine all parameter gradients in a list
grad_params = [JWh, Jbh, JWo, Jbo]

# Set the small change to compute the numerical gradient
eps = 0.0001

# Check each parameter matrix
for p_idx in range(len(params)):
# Check each parameter in each parameter matrix
for row in range(params[p_idx].shape[0]):
for col in range(params[p_idx].shape[1]):
# Copy the parameter matrix and change the current parameter slightly
p_matrix_min = params[p_idx].copy()
p_matrix_min[row,col] -= eps
p_matrix_plus = params[p_idx].copy()
p_matrix_plus[row,col] += eps
# Copy the parameter list, and change the updated parameter matrix
params_min = params[:]
params_min[p_idx] = p_matrix_min
params_plus = params[:]
params_plus[p_idx] =  p_matrix_plus
grad_num = (cost(nn(X, *params_plus), T)-cost(nn(X, *params_min), T))/(2*eps)
# Raise error if the numerical grade is not close to the backprop gradient

No gradient errors found


### Backpropagation updates with momentum ¶

In the previous examples we used simple gradient decent to optimize the parameters with respect to the cost function (which was convex in the first and second example). Multilayer neural networks with non-linear activation functions and many parameters are highly unlikely to have convex cost functions. Simple gradient decent is not the best method to find a global minimum of a non-convex cost function since it is a local optimization method that tends to converge to a local minimum .

To solve this issue this example uses a variant of gradient decent called momentum . The method of momentum can be visualized as a ball rolling down the surface of the cost function, this ball will increase in velocity while rolling downhill, and will decrease in velocity when rolling uphill. This momentum update is formulated as:

$$\begin{split} V(i+1) & = \lambda V(i) - \mu \frac{\partial \xi}{\partial \theta(i)} \\ \theta(i+1) & = \theta(i) + V(i+1) \end{split}$$

Where $V(i)$ is the velocity of the parameters at iteration $i$, $\theta(i)$ is the location of the parameters at iteration $i$, ${\partial \xi}/{\partial \theta(i)}$ the gradient of the parameters with respect to the cost function at iteration $i$, $\lambda$ how much the velocity decreases due to 'resistance' and $\mu$ the learning rate (how much the gradient affects the velocity). This formula can be visualised as in the following illustration:

The velocities $V_{W_h}, V_{\mathbf{b}_h}, V_{W_o}, V_{\mathbf{b}_o},$ corresponding to the parameters $W_h, \mathbf{b}_h, W_o, \mathbf{b}_o$ are represented in the code by  VWh  ,  Vbh  ,  VWo  ,  Vbo  . And kept in a list  Vs  . They are updated by the  update_velocity(X, T, ls_of_params, Vs, momentum_term, learning_rate)  method. After updating the velocities the parameters are updated via the  update_params(ls_of_params, Vs)  method.

In [9]:
# Define the update function to update the network parameters over 1 iteration
def backprop_gradients(X, T, Wh, bh, Wo, bo):
# Compute the output of the network
# Compute the activations of the layers
H = hidden_activations(X, Wh, bh)
Y = output_activations(H, Wo, bo)
# Compute the gradients of the output layer
Eo = error_output(Y, T)
# Compute the gradients of the hidden layer
Eh = error_hidden(H, Wo, Eo)
return [JWh, Jbh, JWo, Jbo]

def update_velocity(X, T, ls_of_params, Vs, momentum_term, learning_rate):
# ls_of_params = [Wh, bh, Wo, bo]
# Js = [JWh, Jbh, JWo, Jbo]
return [momentum_term * V - learning_rate * J for V,J in zip(Vs, Js)]

def update_params(ls_of_params, Vs):
# ls_of_params = [Wh, bh, Wo, bo]
# Vs = [VWh, Vbh, VWo, Vbo]
return [P + V for P,V in zip(ls_of_params, Vs)]


Training the neural network model with this momentum method over $300$ iterations on the dataset $X$ and $T$ defined in the beginning results in the training convergence plotted in the figure below. This smooth convergence would almost never result from simple gradient descent. You can try it out yourself and implement gradient descent optimization for this model. You will notice that if you run your gradient descent a couple of times from different starting points that it will almost always converge to a sub-optimal solution. Note that momentum is less sensitive to the learning rate than gradient descent, but the learning rate hyperparameter still needs to be tuned separately. In this tutorial, this learning rate was tuned by experimentation.

In [10]:
# Run backpropagation
# Initialize weights and biases
init_var = 0.1
# Initialize hidden layer parameters
bh = np.random.randn(1, 3) * init_var
Wh = np.random.randn(2, 3) * init_var
# Initialize output layer parameters
bo = np.random.randn(1, 2) * init_var
Wo = np.random.randn(3, 2) * init_var
# Set the learning rate
learning_rate = 0.02
momentum_term = 0.9

# define the velocities Vs = [VWh, Vbh, VWo, Vbo]
Vs = [np.zeros_like(M) for M in [Wh, bh, Wo, bo]]

lr_update = learning_rate / nb_of_iterations # learning rate update rule
ls_costs = [cost(nn(X, Wh, bh, Wo, bo), T)]  # list of cost over the iterations
for i in range(nb_of_iterations):
# Update the velocities and the parameters
Vs = update_velocity(X, T, [Wh, bh, Wo, bo], Vs, momentum_term, learning_rate)
Wh, bh, Wo, bo = update_params([Wh, bh, Wo, bo], Vs)
ls_costs.append(cost(nn(X, Wh, bh, Wo, bo), T))

In [11]:

## Visualization of the trained classifier ¶

The resulting decision boundary of running backpropagation with momentum on the example inputs $X$ and targets $T$ is shown in the figure below. The background color (blue, red) refers to the classification decision of the trained classifier at that position in the input space. Note that the classifier is able to surround the blue samples and that all examples are classified correctly by the trained classifier.

In [12]:

## Transformation of the input domain ¶

There are two reasons why the neural network is able to separate the non-linearly separable classes. First of all the non-linear logistic functions in the hidden layer can help to make non-linear transformations of the data. And second the hidden layer contains three dimensions. By projecting the 2-dimensional input data to 3 dimensions, the classes can become linearly separable in this 3-dimensional space. This projection in illustrated in the figure below that plots the transformations of the input samples upon this 3-dimensional hidden layer.

In [13]:

This post at peterroelants.github.io is generated from an IPython notebook file. Link to the full IPython notebook file