Week 5 of Andrew Ng’s Machine Learning Course went deeper into neural networks. Explaining how to apply forward and backward propagation to calculate the elements required to train a neural network. The coding assignment used the same dataset as the previous week; however, required both forward and backward propagation to be implemented. The graded assignments in python are outlined below. This repository was used as a guide in implementing the gradient checking and calculating the numerical gradient. The Git repository of the complete script is here.

Required modules

import numpy as py
from scipy import optimize
from scipy.io import loadmat

Gradient of the sigmoid function

Uses the following formula:

\begin{align*} & g'(z) = {d \over dz}g(z) = g(z)(1-g(z)) \newline& \text{where} \newline& \text{sigmoid}(z) = g(z) = {1 \over 1+ e^{-z}} \end{align*}

def sigmoid_gradient(z):
    g = sigmoid(z) * (1 - sigmoid(z))
    return(g)
...
g = sigmoid_gradient(z)

Input values are:

Name	Type	Description
z	numpy.ndarray	vector or matrix as input of sigmoid function

Return value:

Name	Type	Description
g	numpy.ndarray	gradient of sigmoid function; has same shape as input

Randomly initalise weights

Uses the following formula:

\begin{align*} & \text{Initialize each }\theta^{(l)}_{ij} \text{to a random value in } [-\epsilon,\epsilon] \newline & W = rand[m, 1 + n] \times (2 \times \epsilon) - \epsilon \end{align*}

def rand_init_weights(L_in, L_out, episilon_init=0.12):
    W = np.random.rand(L_out, 1 + L_in) * 2 * episilon_init - episilon_init
    return(W)
...
theta_1_rand = rand_init_weights(input_layer_size, hidden_layer_size)
theta_2_rand = rand_init_weights(hidden_layer_size, num_labels)

Input values are:

Name	Type	Description
L_in	int	number of incoming connections
L_out	int	number of outgoing connections
episilon	float	range of values that weight can take from a uniform distrubution

Return value:

Name	Type	Description
W	numpy.ndarray	array of randomised weights

Neural network cost function and gradient

Uses the following formula:

Regularised cost funtion:

\begin{gather*} J(\Theta) = - \frac{1}{m} \sum_{i=1}^m \sum_{k=1}^K \left[y^{(i)}_k \log ((h_\Theta (x^{(i)}))_k) + (1 - y^{(i)}_k)\log (1 - (h_\Theta(x^{(i)}))_k)\right] + \frac{\lambda}{2m}\sum_{l=1}^{L-1} \sum_{i=1}^{s_l} \sum_{j=1}^{s_{l+1}} ( \Theta_{j,i}^{(l)})^2\end{gather*}

Regularised gradient:

\begin{align*} & {\partial \over \partial \theta^{(l)}_{ij}} J(\theta) = D^{(l)}_{ij} = {1 \over m}\Delta^{(l)}_{ij} \hspace{10pt} \text{for j = 0 }\newline& {\partial \over \partial \theta^{(l)}_{ij}} J(\theta) = D^{(l)}_{ij} = {1 \over m}\Delta^{(l)}_{ij} \hspace{10pt} \text{for } j \geq 1 \end{align*}

def cost_function(nn_params, input_layer_size, hidden_layer_size, num_labels,
                  X, y, lambda_=0):

    theta_1 = np.reshape(nn_params[:hidden_layer_size * (input_layer_size + 1)],
                         (hidden_layer_size, (input_layer_size + 1)))
    theta_2 = np.reshape(nn_params[(hidden_layer_size * (input_layer_size + 1)):],
                         (num_labels, (hidden_layer_size + 1)))
    m = y.shape[0]

    # forward propagation
    a_1 = np.c_[np.ones(X.shape[0]), X]
    z_2 = a_1 @ np.transpose(theta_1)
    a_2 = sigmoid(z_2)
    a_2 = np.c_[np.ones(a_2.shape[0]), a_2]
    z_3 = a_2 @ np.transpose(theta_2)
    a_3 = sigmoid(z_3)

    # convert y to dummies
    y_v = np.zeros([m, num_labels])
    y_v[np.arange(m), y] = 1

    # cost function
    error = ((-1 * y_v) * np.log(a_3)) - (1 - y_v) * np.log(1 - a_3)
    regularise_cost = (lambda_/(2*m)) * (np.sum(np.sum(theta_1[:, 1:]**2))
                                         + np.sum(np.sum(theta_2[:, 1:]**2)))
    J = 1/m * np.sum(np.sum(error)) + regularise_cost

    # backward propagation and gradient regularisation
    d_3 = a_3 - y_v
    gz_2 = sigmoid_gradient(np.c_[np.ones(z_2.shape[0]), z_2])
    d_2 = (d_3 @ theta_2) * gz_2
    d_2 = d_2[:, 1:]

    theta_1_grad = np.zeros(theta_1.shape)
    theta_1_grad += (np.transpose(d_2) @ a_1)
    nn_theta_1_grad = theta_1_grad/m + (lambda_/m) \
        * np.column_stack((np.zeros(theta_1.shape[0]), theta_1[:, 1:]))

    theta_2_grad = np.zeros(theta_2.shape)
    theta_2_grad += (np.transpose(d_3) @ a_2)
    nn_theta_2_grad = theta_2_grad/m + (lambda_/m) \
        * np.column_stack((np.zeros(theta_2.shape[0]), theta_2[:, 1:]))

    grad = np.concatenate([nn_theta_1_grad.ravel(), nn_theta_2_grad.ravel()])

    return(J, grad)
...
J, grad = cost_function(nn_params, input_layer_size, hidden_layer_size,
                            num_labels, X, y, lambda_)

Input values:

Name	Type	Description
nn_params	numpy.ndarray	theta parameters for neural network; ‘unrolled’ into vector
input_layer_size	int	number of features of input layer
hidden_layer_size	int	number of hidden units in second layer
num_labels	int	number of units in output layer
X	numpy.ndarray	X variables
y	numpy.ndarray	y variables
lamdba_	float	lambda value used for regularisation (if 0 no regularisation is applied)

Return values are:

Name	Type	Description
J	numpy.float64	cost function
grad	numpy.ndarray	gradient as an ‘unrolled’ vector

Programming Exercise 4

Required modules

Gradient of the sigmoid function

Randomly initalise weights

Neural network cost function and gradient