Week 8 of Andrew Ng’s Machine Learning Course taught an introduction into unsupervised learning using K-means clusters. Principal component analysis was also covered as a way to reduce the dimensionality of a dataset and for increasing interpretability. The programming assignments used these techniques to demonstrate image compression and reducing/ restoring data. The programming assignments in python are below. This repository was used as a guide in implementing the graphical representation of the data. The Git repository of the complete script is here.
import numpy as np
from scipy.io import loadmat
import matplotlib.pyplot as plt
import matplotlib as mpl
from tqdm import tqdm
Uses following formula:
\begin{align*} c^{(i)}: = j \text{ that minimizes } {\left\Vert x^{(i)} - \mu_j \right\Vert} ^2 \end{align*}
def find_closest_centroid(X, centroids):
K = centroids.shape[0]
idx = np.empty(X.shape[0])
for i in range(X.shape[0]):
min_length = np.inf
for j in range(K):
length = np.sum((X[i] - centroids[j]) ** 2)
if length < min_length:
min_length = length
idx[i] = j
return idx.astype(int)
Input values:
Name | Type | Description |
---|---|---|
X | numpy.ndarray | Array of dataset |
centroids | numpy.ndarray | Array of K-means centroids |
Return values:
Name | Type | Description |
---|---|---|
idx | numpy.ndarray | Array of centroid assignment of each example as integer |
Uses the following formula:
\begin{align*} \mu_k : = \frac{1}{{\left\vert C_k \right\vert}} \sum_{i \epsilon C_k}x^{(i)} \end{align*}
def compute_centroid_mean(X, idx, K):
m, n = X.shape
centroids = np.empty((K, n))
for k in range(K):
X_k = np.delete(X, np.argwhere(idx != k), axis=0)
centroids[k] = np.sum(X_k, axis=0) / X_k.shape[0]
return centroids
Input values:
Name | Type | Description |
---|---|---|
X | numpy.ndarray | Array of dataset |
idx | numpy.ndarray | Array of centroid assignment of each example as integer |
K | integer | Number of clusters |
Return values:
Name | Type | Description |
---|---|---|
centroids | numpy.ndarray | Array where each row is mean of the data points assigned to it |
def k_means_initial_centroid(X, K):
randidx = np.random.permutation(X.shape[0])
centroids = X[randidx[:K], :]
return centroids
Input values:
Name | Type | Description |
---|---|---|
X | numpy.ndarray | Array of dataset |
K | integer | Number of clusters |
Return values:
Name | Type | Description |
---|---|---|
centroids | numpy.ndarray | Array of random initialised clusters |
Use the following formula:
\begin{align*} \Sigma = \frac{1}{m} X^T X \end{align*}
def pca(X):
m, n = X.shape
sigma = (np.transpose(X) @ X) / m
U, S, V = np.linalg.svd(sigma)
return U, S
Input values:
Name | Type | Description |
---|---|---|
X | numpy.ndarray | Array of dataset |
Return values:
Name | Type | Description |
---|---|---|
U | numpy.ndarray | Array of eigenvectors, represents computed principal components |
S | numpy.ndarray | Array of singular values for each principal component |
def project_data(X, U, K):
U_reduce = U[:, :K]
Z = X @ U_reduce
return Z
Input values:
Name | Type | Description |
---|---|---|
X | numpy.ndarray | Array of dataset |
U | numpy.ndarray | Array of eigenvectors, computed from PCA |
K | integer | Number of dimensions to project to |
Return values:
Name | Type | Description |
---|---|---|
Z | numpy.ndarray | Projects of the dataset onto the top K eigenvectors |
def recover_data(Z, U, K):
v = Z[:, :K]
U_reduce = U[:, :K]
X_rec = v @ np.transpose(U_reduce)
return X_rec
Input values:
Name | Type | Description |
---|---|---|
Z | numpy.ndarray | Reduced data after applying PCA |
U | numpy.ndarray | Array of eigenvectors, computed from PCA |
K | integer | Number of principal components retained |
Return values:
Name | Type | Description |
---|---|---|
X_rec | numpy.ndarray | Recovered dataset after transformation back to original |