Week 7 of Andrew Ng’s Machine Learning Course introduced large margin classification and kernels. Pulling these concepts together to learn about support vector machines. The programming assignments were to create the Gaussian kernel used in support vector machines and explore how the support vector machines classifies data. The second half was to create a spam classifier. The graded assignments in python are outlined below. The Git repository of the complete script is here.
import numpy as np
import re
from sklearn import svm
from nltk.stem import PorterStemmer
from nltk.tokenize import word_tokenizeUses equation:
\begin{align*} Kgaussian & = (x^{(i)}, x^{(j)}) \newline & = \exp \left( - \frac{\left\Vert x^{(i)} - x^{(j)} \right\Vert ^2}{2 \sigma^2} \right) \newline & = \exp \left( - \frac{ \sum\limits_{k=1}^n \left( x^{(i)}_k - x^{(j)}_k \right)^2}{2 \sigma^2} \right) \end{align*}
def gaussian_kernel(x1, x2, sigma):
    vec_sum = np.sum((x1 - x2)**2)
    sim = np.exp(-(vec_sum) / (2 * sigma**2))
    return simInput values:
| Name | Type | Description | 
|---|---|---|
| x1 | numpy.ndarray | Array of first datapoints | 
| x2 | numpy.ndarray | Array of second datapoints | 
| sigma | float | Bandwith parameter for Gaussian kernel | 
Return values are:
| Name | Type | Description | 
|---|---|---|
| sim | float | Computed RBF between the two provided data points | 
def dataset3_params(X, y, Xval, yval):
    C_vec = [0.01, 0.03, 0.1, 0.3, 1, 3, 10, 30]
    gamma_vec = [0.01, 0.03, 0.1, 0.3, 1, 3, 10, 30]
    best_error = np.inf
    best_C = 0
    best_gamma = 0
    for C in C_vec:
        for gamma in gamma_vec:
            clf = svm.SVC(C=C, kernel='rbf', gamma=gamma)
            clf.fit(X, y)
            predictions = clf.predict(Xval)
            error = np.mean(predictions != yval)
            if error < best_error:
                best_error = error
                best_C = C
                best_gamma = gamma
    print(f"Optimal settings:\n C: {best_C}\n Gamma: {gamma}\n Error: {error}\n")
    return best_C, best_gammaInput values:
| Name | Type | Description | 
|---|---|---|
| X | numpy.ndarray | X variables of training data | 
| y | numpy.ndarray | y variables of training data | 
| Xval | numpy.ndarray | X variables of validation data | 
| yval | numpy.ndarray | y variables of validation data | 
Return values are:
| Name | Type | Description | 
|---|---|---|
| best_C | float | Best performing value for regularisation parameter C | 
| best_gamma | float | Best performing value for gamma parameter | 
Preprocces email by:
httpaddremailaddrnumberdollardef process_email(email_contents):
    word_indices = []
    vocab_dict = get_vocab_list("vocab.txt")
    # Regex contents
    email_contents = email_contents.lower()
    email_contents = re.sub('<[^<>]+>', ' ', email_contents)
    email_contents = re.sub('(http|https)://[^\s]*', 'httpaddr', email_contents)
    email_contents = re.sub('[^\s]+@[^\s]*', 'emailaddr', email_contents)
    email_contents = re.sub('[0-9]+', 'number ', email_contents)
    email_contents = re.sub('[$]+', 'dollar ', email_contents)
    email_contents = re.sub('[^A-Za-z0-9]+', ' ', email_contents)
    # Word stemming and key convert
    ps = PorterStemmer()
    word_tokens = word_tokenize(email_contents)
    for w in word_tokens:
        for key, value in vocab_dict.items():
            if ps.stem(w) == value:
                word_indices.append(key)
        if len(w) < 1:
            continue
    return word_indicesInput values:
| Name | Type | Description | 
|---|---|---|
| email_contents | string | String containing one email | 
Return values are:
| Name | Type | Description | 
|---|---|---|
| word_indices | list | List containing index of each word | 
def email_features(word_indices):
    vocab_dict = get_vocab_list("vocab.txt")
    X = np.zeros(len(vocab_dict))
    ones = 0
    for key in word_indices:
        X[key-1] = 1
    unique, counts = np.unique(X, return_counts=True)
    ones = dict(zip(unique, counts))[1]
    return X, onesInput values:
| Name | Type | Description | 
|---|---|---|
| word_indices | list | List containing index of each word | 
Return values are:
| Name | Type | Description | 
|---|---|---|
| X | numpy.ndarray | Array of features from word indices | 
| ones | integer | Number of ‘ones’ in feature array |