Week 7 of Andrew Ng’s Machine Learning Course introduced large margin classification and kernels. Pulling these concepts together to learn about support vector machines. The programming assignments were to create the Gaussian kernel used in support vector machines and explore how the support vector machines classifies data. The second half was to create a spam classifier. The graded assignments in python are outlined below. The Git repository of the complete script is here.

Required modules

import numpy as np
import re
from sklearn import svm
from nltk.stem import PorterStemmer
from nltk.tokenize import word_tokenize

Gaussian Kernel

Uses equation:

\begin{align*} Kgaussian & = (x^{(i)}, x^{(j)}) \newline & = \exp \left( - \frac{\left\Vert x^{(i)} - x^{(j)} \right\Vert ^2}{2 \sigma^2} \right) \newline & = \exp \left( - \frac{ \sum\limits_{k=1}^n \left( x^{(i)}_k - x^{(j)}_k \right)^2}{2 \sigma^2} \right) \end{align*}

def gaussian_kernel(x1, x2, sigma):
    vec_sum = np.sum((x1 - x2)**2)
    sim = np.exp(-(vec_sum) / (2 * sigma**2))
    return sim

Input values:

Name	Type	Description
x1	numpy.ndarray	Array of first datapoints
x2	numpy.ndarray	Array of second datapoints
sigma	float	Bandwith parameter for Gaussian kernel

Return values are:

Name	Type	Description
sim	float	Computed RBF between the two provided data points

Finding optimal parameters

def dataset3_params(X, y, Xval, yval):
    C_vec = [0.01, 0.03, 0.1, 0.3, 1, 3, 10, 30]
    gamma_vec = [0.01, 0.03, 0.1, 0.3, 1, 3, 10, 30]
    best_error = np.inf
    best_C = 0
    best_gamma = 0

    for C in C_vec:
        for gamma in gamma_vec:
            clf = svm.SVC(C=C, kernel='rbf', gamma=gamma)
            clf.fit(X, y)
            predictions = clf.predict(Xval)
            error = np.mean(predictions != yval)
            if error < best_error:
                best_error = error
                best_C = C
                best_gamma = gamma

    print(f"Optimal settings:\n C: {best_C}\n Gamma: {gamma}\n Error: {error}\n")
    return best_C, best_gamma

Input values:

Name	Type	Description
X	numpy.ndarray	X variables of training data
y	numpy.ndarray	y variables of training data
Xval	numpy.ndarray	X variables of validation data
yval	numpy.ndarray	y variables of validation data

Return values are:

Name	Type	Description
best_C	float	Best performing value for regularisation parameter C
best_gamma	float	Best performing value for gamma parameter

Preprocessing email

Preprocces email by:

Convert to lower case
Strip HTML tags
Normalise URLs to httpaddr
Normalise email addresses to emailaddr
Normalise numbers to number
Normalise Dollars to dollar
Word stemming
Removal of non-words and punctuation

def process_email(email_contents):
    word_indices = []
    vocab_dict = get_vocab_list("vocab.txt")

    # Regex contents
    email_contents = email_contents.lower()
    email_contents = re.sub('<[^<>]+>', ' ', email_contents)
    email_contents = re.sub('(http|https)://[^\s]*', 'httpaddr', email_contents)
    email_contents = re.sub('[^\s]+@[^\s]*', 'emailaddr', email_contents)
    email_contents = re.sub('[0-9]+', 'number ', email_contents)
    email_contents = re.sub('[$]+', 'dollar ', email_contents)
    email_contents = re.sub('[^A-Za-z0-9]+', ' ', email_contents)

    # Word stemming and key convert
    ps = PorterStemmer()
    word_tokens = word_tokenize(email_contents)
    for w in word_tokens:
        for key, value in vocab_dict.items():
            if ps.stem(w) == value:
                word_indices.append(key)

        if len(w) < 1:
            continue

    return word_indices

Input values:

Name	Type	Description
email_contents	string	String containing one email

Return values are:

Name	Type	Description
word_indices	list	List containing index of each word

Extracting features from email

def email_features(word_indices):
    vocab_dict = get_vocab_list("vocab.txt")
    X = np.zeros(len(vocab_dict))
    ones = 0
    for key in word_indices:
        X[key-1] = 1

    unique, counts = np.unique(X, return_counts=True)
    ones = dict(zip(unique, counts))[1]

    return X, ones

Input values:

Name	Type	Description
word_indices	list	List containing index of each word

Return values are:

Name	Type	Description
X	numpy.ndarray	Array of features from word indices
ones	integer	Number of ‘ones’ in feature array

Programming Exercise 6

Required modules

Gaussian Kernel

Finding optimal parameters

Preprocessing email

Extracting features from email