Week 7 of Andrew Ng’s Machine Learning Course introduced large margin classification and kernels. Pulling these concepts together to learn about support vector machines. The programming assignments were to create the Gaussian kernel used in support vector machines and explore how the support vector machines classifies data. The second half was to create a spam classifier. The graded assignments in python are outlined below. The Git repository of the complete script is here.
import numpy as np
import re
from sklearn import svm
from nltk.stem import PorterStemmer
from nltk.tokenize import word_tokenize
Uses equation:
\begin{align*} Kgaussian & = (x^{(i)}, x^{(j)}) \newline & = \exp \left( - \frac{\left\Vert x^{(i)} - x^{(j)} \right\Vert ^2}{2 \sigma^2} \right) \newline & = \exp \left( - \frac{ \sum\limits_{k=1}^n \left( x^{(i)}_k - x^{(j)}_k \right)^2}{2 \sigma^2} \right) \end{align*}
def gaussian_kernel(x1, x2, sigma):
vec_sum = np.sum((x1 - x2)**2)
sim = np.exp(-(vec_sum) / (2 * sigma**2))
return sim
Input values:
Name | Type | Description |
---|---|---|
x1 | numpy.ndarray | Array of first datapoints |
x2 | numpy.ndarray | Array of second datapoints |
sigma | float | Bandwith parameter for Gaussian kernel |
Return values are:
Name | Type | Description |
---|---|---|
sim | float | Computed RBF between the two provided data points |
def dataset3_params(X, y, Xval, yval):
C_vec = [0.01, 0.03, 0.1, 0.3, 1, 3, 10, 30]
gamma_vec = [0.01, 0.03, 0.1, 0.3, 1, 3, 10, 30]
best_error = np.inf
best_C = 0
best_gamma = 0
for C in C_vec:
for gamma in gamma_vec:
clf = svm.SVC(C=C, kernel='rbf', gamma=gamma)
clf.fit(X, y)
predictions = clf.predict(Xval)
error = np.mean(predictions != yval)
if error < best_error:
best_error = error
best_C = C
best_gamma = gamma
print(f"Optimal settings:\n C: {best_C}\n Gamma: {gamma}\n Error: {error}\n")
return best_C, best_gamma
Input values:
Name | Type | Description |
---|---|---|
X | numpy.ndarray | X variables of training data |
y | numpy.ndarray | y variables of training data |
Xval | numpy.ndarray | X variables of validation data |
yval | numpy.ndarray | y variables of validation data |
Return values are:
Name | Type | Description |
---|---|---|
best_C | float | Best performing value for regularisation parameter C |
best_gamma | float | Best performing value for gamma parameter |
Preprocces email by:
httpaddr
emailaddr
number
dollar
def process_email(email_contents):
word_indices = []
vocab_dict = get_vocab_list("vocab.txt")
# Regex contents
email_contents = email_contents.lower()
email_contents = re.sub('<[^<>]+>', ' ', email_contents)
email_contents = re.sub('(http|https)://[^\s]*', 'httpaddr', email_contents)
email_contents = re.sub('[^\s]+@[^\s]*', 'emailaddr', email_contents)
email_contents = re.sub('[0-9]+', 'number ', email_contents)
email_contents = re.sub('[$]+', 'dollar ', email_contents)
email_contents = re.sub('[^A-Za-z0-9]+', ' ', email_contents)
# Word stemming and key convert
ps = PorterStemmer()
word_tokens = word_tokenize(email_contents)
for w in word_tokens:
for key, value in vocab_dict.items():
if ps.stem(w) == value:
word_indices.append(key)
if len(w) < 1:
continue
return word_indices
Input values:
Name | Type | Description |
---|---|---|
email_contents | string | String containing one email |
Return values are:
Name | Type | Description |
---|---|---|
word_indices | list | List containing index of each word |
def email_features(word_indices):
vocab_dict = get_vocab_list("vocab.txt")
X = np.zeros(len(vocab_dict))
ones = 0
for key in word_indices:
X[key-1] = 1
unique, counts = np.unique(X, return_counts=True)
ones = dict(zip(unique, counts))[1]
return X, ones
Input values:
Name | Type | Description |
---|---|---|
word_indices | list | List containing index of each word |
Return values are:
Name | Type | Description |
---|---|---|
X | numpy.ndarray | Array of features from word indices |
ones | integer | Number of ‘ones’ in feature array |