Week 8 of Andrew Ng’s Machine Learning Course taught an introduction into unsupervised learning using K-means clusters. Principal component analysis was also covered as a way to reduce the dimensionality of a dataset and for increasing interpretability. The programming assignments used these techniques to demonstrate image compression and reducing/ restoring data. The programming assignments in python are below. This repository was used as a guide in implementing the graphical representation of the data. The Git repository of the complete script is here.

Required modules

import numpy as np
from scipy.io import loadmat
import matplotlib.pyplot as plt
import matplotlib as mpl
from tqdm import tqdm

Find closest centroids

Uses following formula:

\begin{align*} c^{(i)}: = j \text{ that minimizes } {\left\Vert x^{(i)} - \mu_j \right\Vert} ^2 \end{align*}

def find_closest_centroid(X, centroids):
    K = centroids.shape[0]
    idx = np.empty(X.shape[0])

    for i in range(X.shape[0]):
        min_length = np.inf

        for j in range(K):
            length = np.sum((X[i] - centroids[j]) ** 2)
            if length < min_length:
                min_length = length
                idx[i] = j

    return idx.astype(int)

Input values:

Name	Type	Description
X	numpy.ndarray	Array of dataset
centroids	numpy.ndarray	Array of K-means centroids

Return values:

Name	Type	Description
idx	numpy.ndarray	Array of centroid assignment of each example as integer

Compute centroid means

Uses the following formula:

\begin{align*} \mu_k : = \frac{1}{{\left\vert C_k \right\vert}} \sum_{i \epsilon C_k}x^{(i)} \end{align*}

def compute_centroid_mean(X, idx, K):
    m, n = X.shape
    centroids = np.empty((K, n))

    for k in range(K):
        X_k = np.delete(X, np.argwhere(idx != k), axis=0)
        centroids[k] = np.sum(X_k, axis=0) / X_k.shape[0]

    return centroids

Input values:

Name	Type	Description
X	numpy.ndarray	Array of dataset
idx	numpy.ndarray	Array of centroid assignment of each example as integer
K	integer	Number of clusters

Return values:

Name	Type	Description
centroids	numpy.ndarray	Array where each row is mean of the data points assigned to it

Initialisation for K-means centroids

def k_means_initial_centroid(X, K):
    randidx = np.random.permutation(X.shape[0])
    centroids = X[randidx[:K], :]
    return centroids

Input values:

Name	Type	Description
X	numpy.ndarray	Array of dataset
K	integer	Number of clusters

Return values:

Name	Type	Description
centroids	numpy.ndarray	Array of random initialised clusters

Perform principla component analysis

Use the following formula:

\begin{align*} \Sigma = \frac{1}{m} X^T X \end{align*}

def pca(X):
    m, n = X.shape
    sigma = (np.transpose(X) @ X) / m
    U, S, V = np.linalg.svd(sigma)
    return U, S

Input values:

Name	Type	Description
X	numpy.ndarray	Array of dataset

Return values:

Name	Type	Description
U	numpy.ndarray	Array of eigenvectors, represents computed principal components
S	numpy.ndarray	Array of singular values for each principal component

Project a data set into a lower dimensional space

def project_data(X, U, K):
    U_reduce = U[:, :K]
    Z = X @ U_reduce
    return Z

Input values:

Name	Type	Description
X	numpy.ndarray	Array of dataset
U	numpy.ndarray	Array of eigenvectors, computed from PCA
K	integer	Number of dimensions to project to

Return values:

Name	Type	Description
Z	numpy.ndarray	Projects of the dataset onto the top K eigenvectors

Recover the original data from the projection

def recover_data(Z, U, K):
    v = Z[:, :K]
    U_reduce = U[:, :K]
    X_rec = v @ np.transpose(U_reduce)
    return X_rec

Input values:

Name	Type	Description
Z	numpy.ndarray	Reduced data after applying PCA
U	numpy.ndarray	Array of eigenvectors, computed from PCA
K	integer	Number of principal components retained

Return values:

Name	Type	Description
X_rec	numpy.ndarray	Recovered dataset after transformation back to original

Programming Exercise 7

Required modules

Find closest centroids

Compute centroid means

Initialisation for K-means centroids

Perform principla component analysis

Project a data set into a lower dimensional space

Recover the original data from the projection