Programming Exercise 6



Week 7 of Andrew Ng’s Machine Learning Course introduced large margin classification and kernels. Pulling these concepts together to learn about support vector machines. The programming assignments were to create the Gaussian kernel used in support vector machines and explore how the support vector machines classifies data. The second half was to create a spam classifier. The graded assignments in python are outlined below. The Git repository of the complete script is here.

Required modules

import numpy as np
import re
from sklearn import svm
from nltk.stem import PorterStemmer
from nltk.tokenize import word_tokenize

Gaussian Kernel

Uses equation:

\begin{align*} Kgaussian & = (x^{(i)}, x^{(j)}) \newline & = \exp \left( - \frac{\left\Vert x^{(i)} - x^{(j)} \right\Vert ^2}{2 \sigma^2} \right) \newline & = \exp \left( - \frac{ \sum\limits_{k=1}^n \left( x^{(i)}_k - x^{(j)}_k \right)^2}{2 \sigma^2} \right) \end{align*}

def gaussian_kernel(x1, x2, sigma):
    vec_sum = np.sum((x1 - x2)**2)
    sim = np.exp(-(vec_sum) / (2 * sigma**2))
    return sim

Input values:

Name Type Description
x1 numpy.ndarray Array of first datapoints
x2 numpy.ndarray Array of second datapoints
sigma float Bandwith parameter for Gaussian kernel

Return values are:

Name Type Description
sim float Computed RBF between the two provided data points

Finding optimal parameters

def dataset3_params(X, y, Xval, yval):
    C_vec = [0.01, 0.03, 0.1, 0.3, 1, 3, 10, 30]
    gamma_vec = [0.01, 0.03, 0.1, 0.3, 1, 3, 10, 30]
    best_error = np.inf
    best_C = 0
    best_gamma = 0

    for C in C_vec:
        for gamma in gamma_vec:
            clf = svm.SVC(C=C, kernel='rbf', gamma=gamma)
            clf.fit(X, y)
            predictions = clf.predict(Xval)
            error = np.mean(predictions != yval)
            if error < best_error:
                best_error = error
                best_C = C
                best_gamma = gamma

    print(f"Optimal settings:\n C: {best_C}\n Gamma: {gamma}\n Error: {error}\n")
    return best_C, best_gamma

Input values:

Name Type Description
X numpy.ndarray X variables of training data
y numpy.ndarray y variables of training data
Xval numpy.ndarray X variables of validation data
yval numpy.ndarray y variables of validation data

Return values are:

Name Type Description
best_C float Best performing value for regularisation parameter C
best_gamma float Best performing value for gamma parameter

Preprocessing email

Preprocces email by:

  • Convert to lower case
  • Strip HTML tags
  • Normalise URLs to httpaddr
  • Normalise email addresses to emailaddr
  • Normalise numbers to number
  • Normalise Dollars to dollar
  • Word stemming
  • Removal of non-words and punctuation
def process_email(email_contents):
    word_indices = []
    vocab_dict = get_vocab_list("vocab.txt")

    # Regex contents
    email_contents = email_contents.lower()
    email_contents = re.sub('<[^<>]+>', ' ', email_contents)
    email_contents = re.sub('(http|https)://[^\s]*', 'httpaddr', email_contents)
    email_contents = re.sub('[^\s]+@[^\s]*', 'emailaddr', email_contents)
    email_contents = re.sub('[0-9]+', 'number ', email_contents)
    email_contents = re.sub('[$]+', 'dollar ', email_contents)
    email_contents = re.sub('[^A-Za-z0-9]+', ' ', email_contents)

    # Word stemming and key convert
    ps = PorterStemmer()
    word_tokens = word_tokenize(email_contents)
    for w in word_tokens:
        for key, value in vocab_dict.items():
            if ps.stem(w) == value:
                word_indices.append(key)

        if len(w) < 1:
            continue

    return word_indices

Input values:

Name Type Description
email_contents string String containing one email

Return values are:

Name Type Description
word_indices list List containing index of each word

Extracting features from email

def email_features(word_indices):
    vocab_dict = get_vocab_list("vocab.txt")
    X = np.zeros(len(vocab_dict))
    ones = 0
    for key in word_indices:
        X[key-1] = 1

    unique, counts = np.unique(X, return_counts=True)
    ones = dict(zip(unique, counts))[1]

    return X, ones

Input values:

Name Type Description
word_indices list List containing index of each word

Return values are:

Name Type Description
X numpy.ndarray Array of features from word indices
ones integer Number of ‘ones’ in feature array
July 2020