Programming Exercise 4



Week 5 of Andrew Ng’s Machine Learning Course went deeper into neural networks. Explaining how to apply forward and backward propagation to calculate the elements required to train a neural network. The coding assignment used the same dataset as the previous week; however, required both forward and backward propagation to be implemented. The graded assignments in python are outlined below. This repository was used as a guide in implementing the gradient checking and calculating the numerical gradient. The Git repository of the complete script is here.

Required modules

import numpy as py
from scipy import optimize
from scipy.io import loadmat

Gradient of the sigmoid function

Uses the following formula:

\begin{align*} & g'(z) = {d \over dz}g(z) = g(z)(1-g(z)) \newline& \text{where} \newline& \text{sigmoid}(z) = g(z) = {1 \over 1+ e^{-z}} \end{align*}

def sigmoid_gradient(z):
    g = sigmoid(z) * (1 - sigmoid(z))
    return(g)
...
g = sigmoid_gradient(z)

Input values are:

Name Type Description
z numpy.ndarray vector or matrix as input of sigmoid function

Return value:

Name Type Description
g numpy.ndarray gradient of sigmoid function; has same shape as input

Randomly initalise weights

Uses the following formula:

\begin{align*} & \text{Initialize each }\theta^{(l)}_{ij} \text{to a random value in } [-\epsilon,\epsilon] \newline & W = rand[m, 1 + n] \times (2 \times \epsilon) - \epsilon \end{align*}

def rand_init_weights(L_in, L_out, episilon_init=0.12):
    W = np.random.rand(L_out, 1 + L_in) * 2 * episilon_init - episilon_init
    return(W)
...
theta_1_rand = rand_init_weights(input_layer_size, hidden_layer_size)
theta_2_rand = rand_init_weights(hidden_layer_size, num_labels)

Input values are:

Name Type Description
L_in int number of incoming connections
L_out int number of outgoing connections
episilon float range of values that weight can take from a uniform distrubution

Return value:

Name Type Description
W numpy.ndarray array of randomised weights

Neural network cost function and gradient

Uses the following formula:

Regularised cost funtion:

\begin{gather*} J(\Theta) = - \frac{1}{m} \sum_{i=1}^m \sum_{k=1}^K \left[y^{(i)}_k \log ((h_\Theta (x^{(i)}))_k) + (1 - y^{(i)}_k)\log (1 - (h_\Theta(x^{(i)}))_k)\right] + \frac{\lambda}{2m}\sum_{l=1}^{L-1} \sum_{i=1}^{s_l} \sum_{j=1}^{s_{l+1}} ( \Theta_{j,i}^{(l)})^2\end{gather*}

Regularised gradient:

\begin{align*} & {\partial \over \partial \theta^{(l)}_{ij}} J(\theta) = D^{(l)}_{ij} = {1 \over m}\Delta^{(l)}_{ij} \hspace{10pt} \text{for j = 0 }\newline& {\partial \over \partial \theta^{(l)}_{ij}} J(\theta) = D^{(l)}_{ij} = {1 \over m}\Delta^{(l)}_{ij} \hspace{10pt} \text{for } j \geq 1 \end{align*}

def cost_function(nn_params, input_layer_size, hidden_layer_size, num_labels,
                  X, y, lambda_=0):

    theta_1 = np.reshape(nn_params[:hidden_layer_size * (input_layer_size + 1)],
                         (hidden_layer_size, (input_layer_size + 1)))
    theta_2 = np.reshape(nn_params[(hidden_layer_size * (input_layer_size + 1)):],
                         (num_labels, (hidden_layer_size + 1)))
    m = y.shape[0]

    # forward propagation
    a_1 = np.c_[np.ones(X.shape[0]), X]
    z_2 = a_1 @ np.transpose(theta_1)
    a_2 = sigmoid(z_2)
    a_2 = np.c_[np.ones(a_2.shape[0]), a_2]
    z_3 = a_2 @ np.transpose(theta_2)
    a_3 = sigmoid(z_3)

    # convert y to dummies
    y_v = np.zeros([m, num_labels])
    y_v[np.arange(m), y] = 1

    # cost function
    error = ((-1 * y_v) * np.log(a_3)) - (1 - y_v) * np.log(1 - a_3)
    regularise_cost = (lambda_/(2*m)) * (np.sum(np.sum(theta_1[:, 1:]**2))
                                         + np.sum(np.sum(theta_2[:, 1:]**2)))
    J = 1/m * np.sum(np.sum(error)) + regularise_cost

    # backward propagation and gradient regularisation
    d_3 = a_3 - y_v
    gz_2 = sigmoid_gradient(np.c_[np.ones(z_2.shape[0]), z_2])
    d_2 = (d_3 @ theta_2) * gz_2
    d_2 = d_2[:, 1:]

    theta_1_grad = np.zeros(theta_1.shape)
    theta_1_grad += (np.transpose(d_2) @ a_1)
    nn_theta_1_grad = theta_1_grad/m + (lambda_/m) \
        * np.column_stack((np.zeros(theta_1.shape[0]), theta_1[:, 1:]))

    theta_2_grad = np.zeros(theta_2.shape)
    theta_2_grad += (np.transpose(d_3) @ a_2)
    nn_theta_2_grad = theta_2_grad/m + (lambda_/m) \
        * np.column_stack((np.zeros(theta_2.shape[0]), theta_2[:, 1:]))

    grad = np.concatenate([nn_theta_1_grad.ravel(), nn_theta_2_grad.ravel()])

    return(J, grad)
...
J, grad = cost_function(nn_params, input_layer_size, hidden_layer_size,
                            num_labels, X, y, lambda_)

Input values:

Name Type Description
nn_params numpy.ndarray theta parameters for neural network; ‘unrolled’ into vector
input_layer_size int number of features of input layer
hidden_layer_size int number of hidden units in second layer
num_labels int number of units in output layer
X numpy.ndarray X variables
y numpy.ndarray y variables
lamdba_ float lambda value used for regularisation (if 0 no regularisation is applied)

Return values are:

Name Type Description
J numpy.float64 cost function
grad numpy.ndarray gradient as an ‘unrolled’ vector
July 2020