Programming Exercise 7



Week 8 of Andrew Ng’s Machine Learning Course taught an introduction into unsupervised learning using K-means clusters. Principal component analysis was also covered as a way to reduce the dimensionality of a dataset and for increasing interpretability. The programming assignments used these techniques to demonstrate image compression and reducing/ restoring data. The programming assignments in python are below. This repository was used as a guide in implementing the graphical representation of the data. The Git repository of the complete script is here.

Required modules

import numpy as np
from scipy.io import loadmat
import matplotlib.pyplot as plt
import matplotlib as mpl
from tqdm import tqdm

Find closest centroids

Uses following formula:

\begin{align*} c^{(i)}: = j \text{ that minimizes } {\left\Vert x^{(i)} - \mu_j \right\Vert} ^2 \end{align*}

def find_closest_centroid(X, centroids):
    K = centroids.shape[0]
    idx = np.empty(X.shape[0])

    for i in range(X.shape[0]):
        min_length = np.inf

        for j in range(K):
            length = np.sum((X[i] - centroids[j]) ** 2)
            if length < min_length:
                min_length = length
                idx[i] = j

    return idx.astype(int)

Input values:

Name Type Description
X numpy.ndarray Array of dataset
centroids numpy.ndarray Array of K-means centroids

Return values:

Name Type Description
idx numpy.ndarray Array of centroid assignment of each example as integer

Compute centroid means

Uses the following formula:

\begin{align*} \mu_k : = \frac{1}{{\left\vert C_k \right\vert}} \sum_{i \epsilon C_k}x^{(i)} \end{align*}

def compute_centroid_mean(X, idx, K):
    m, n = X.shape
    centroids = np.empty((K, n))

    for k in range(K):
        X_k = np.delete(X, np.argwhere(idx != k), axis=0)
        centroids[k] = np.sum(X_k, axis=0) / X_k.shape[0]

    return centroids

Input values:

Name Type Description
X numpy.ndarray Array of dataset
idx numpy.ndarray Array of centroid assignment of each example as integer
K integer Number of clusters

Return values:

Name Type Description
centroids numpy.ndarray Array where each row is mean of the data points assigned to it

Initialisation for K-means centroids

def k_means_initial_centroid(X, K):
    randidx = np.random.permutation(X.shape[0])
    centroids = X[randidx[:K], :]
    return centroids

Input values:

Name Type Description
X numpy.ndarray Array of dataset
K integer Number of clusters

Return values:

Name Type Description
centroids numpy.ndarray Array of random initialised clusters

Perform principla component analysis

Use the following formula:

\begin{align*} \Sigma = \frac{1}{m} X^T X \end{align*}

def pca(X):
    m, n = X.shape
    sigma = (np.transpose(X) @ X) / m
    U, S, V = np.linalg.svd(sigma)
    return U, S

Input values:

Name Type Description
X numpy.ndarray Array of dataset

Return values:

Name Type Description
U numpy.ndarray Array of eigenvectors, represents computed principal components
S numpy.ndarray Array of singular values for each principal component

Project a data set into a lower dimensional space

def project_data(X, U, K):
    U_reduce = U[:, :K]
    Z = X @ U_reduce
    return Z

Input values:

Name Type Description
X numpy.ndarray Array of dataset
U numpy.ndarray Array of eigenvectors, computed from PCA
K integer Number of dimensions to project to

Return values:

Name Type Description
Z numpy.ndarray Projects of the dataset onto the top K eigenvectors

Recover the original data from the projection

def recover_data(Z, U, K):
    v = Z[:, :K]
    U_reduce = U[:, :K]
    X_rec = v @ np.transpose(U_reduce)
    return X_rec

Input values:

Name Type Description
Z numpy.ndarray Reduced data after applying PCA
U numpy.ndarray Array of eigenvectors, computed from PCA
K integer Number of principal components retained

Return values:

Name Type Description
X_rec numpy.ndarray Recovered dataset after transformation back to original
July 2020