You are on page 1of 8

CS189 HW1 Angel Avila

25488322
Problem 1: Data Partitioning

Data partitioning happened inside my load_data function, I used scikit-learns train_test_split function and
plotted histograms to verify that there was an even distribution of data from all categories in the test, and
train datasets

Histograms of the partitioned data, the blue is the training data, and the orange is the test data.
This shows that the data is evenly distributed among all 9 categories.
CS189 HW1 Angel Avila
25488322
Problem 2: Training and Evaluating Prediction Accuracy

This happens in the function train_plot. The data was partitioned into the samples sizes specified, trained on
those sample sets, and then error was measured both on the training data itself as well as the test dataset.
Results are plotted below.

mnist data, test and training error vs sample size


spam data, test and train error vs sample size

cifar data, test and training error vs sample size


CS189 HW1 Angel Avila
25488322
Problem 3: Hyperparameter Tuning

Tuning of the C parameter happens in the function sweep_C, this function first partitions the training data into
training and validation data sets, it then fits several time to the training set with different C parameters and checks
the score against the validation set. In the end, the C that produced the lowest validation score is used to make
predictions on the original test data set. To find the best C I began by sweeping a very large range of C values, then
narrowing it down until the global optimum was found. the best C value for this data seemed to be ~2.7E-6. The C
values I tried and the corresponding error can be seen in the figure below as well as the program output also
included.

mnist data validation set error with varying values for C

output:
===============
loading mnist data...
===============
training with C = 1.00E-07 sample size = 9000
validation error = 8.6%
---------------
training with C = 1.39E-07 sample size = 9000
validation error = 8.5%
---------------
training with C = 1.93E-07 sample size = 9000
validation error = 8.2%
---------------
training with C = 2.68E-07 sample size = 9000
validation error = 8.1%
---------------
training with C = 3.73E-07 sample size = 9000
validation error = 8.2%
---------------
training with C = 5.18E-07 sample size = 9000
validation error = 7.0%
---------------
training with C = 7.20E-07 sample size = 9000
CS189 HW1 Angel Avila
25488322
validation error = 7.1%
---------------
training with C = 1.00E-06 sample size = 9000
validation error = 6.8%
---------------
training with C = 1.39E-06 sample size = 9000
validation error = 7.0%
---------------
training with C = 1.93E-06 sample size = 9000
validation error = 6.9%
---------------
training with C = 2.68E-06 sample size = 9000
validation error = 6.6%
---------------
training with C = 3.73E-06 sample size = 9000
validation error = 7.2%
---------------
training with C = 5.18E-06 sample size = 9000
validation error = 7.7%
---------------
training with C = 7.20E-06 sample size = 9000
validation error = 8.4%
---------------
training with C = 1.00E-05 sample size = 9000
validation error = 9.1%
---------------
===============
generating prediction csv file...
===============
CS189 HW1 Angel Avila
25488322
Problem 4: Cross Validation

Hyperparameter turning with cross validation works in much the same way as problem 3, but instead of
partitioning the training data into a separate validation set, k-folds dividing was used which partitions the training data
into k partitions, and trains/evaluates error k times, using each partition as the validation set once. The result is the
mean error after k evaluations. This is built into scikit-learn in the form of the cross_val_score function. Similar to
problem 3, I began by sweeping a very large range of C values then narrowing it down until I found the apparent
minimum validation error. The best C value for this data seemed to be ~27. The values I tried and corresponding error
can be seen in the figure below.

spam data validation error using various values for C


CS189 HW1 Angel Avila
25488322

Problem 5: Kaggle

Kaggle username: AngelJ

mnist score: 0.94480

spam score: 0.84221

After the best C value is found, the sweep_C function calls generate_pred_csv which uses the best C to fit to the
training data set, then generates predictions on the test data set and outputs to the csv file needed for Kaggle
submission.
CS189 HW1 Angel Avila
25488322
Appendix (code)
from scipy.io import loadmat
from sklearn.model_selection import train_test_split, cross_val_score, StratifiedKFold, cross_val_predict
from sklearn import svm, metrics
import sys
import os
import numpy as np
import matplotlib.pyplot as plt
import csv

# load data, segment into test and train datasets,


# rows are data points, columns are features, last columns is label
def load_data(dataset):
print('===============')
print('loading {} data...'.format(dataset))
print('===============')
if dataset == 'cifar':
data = loadmat(sys.path[0] + '\\..\data\\hw01_data\\cifar\\train.mat')['trainX'].astype(int)
train, test = train_test_split(data, test_size=5000)
return train, test, [100, 200, 500, 1000, 2000, 5000]
elif dataset == 'mnist':
data = loadmat(sys.path[0] + '\\..\data\\hw01_data\\mnist\\train.mat')['trainX'].astype(int)
train, test = train_test_split(data, test_size=10000)
return train, test, [100, 200, 500, 1000, 2000, 5000, 10000]
elif dataset == 'spam':
data = loadmat(sys.path[0] + '\\..\\data\\hw01_data\\spam\\spam_data.mat')
data = np.concatenate((data['training_data'].astype(int),
data['training_labels'].astype(int).transpose()), 1)
train, test = train_test_split(data, test_size=0.2)
return train, test, [100, 200, 500, 1000, 2000, len(train)]

def train_plot(dataset):
train_data, test_data, sample_sizes = load_data(dataset)

# plot to verify shuffling of data


idx = np.unique(np.concatenate((train_data[:, -1], test_data[:, -1])))
idx = np.concatenate((idx, [max(idx) + 1])) - 0.5
plt.hist(train_data[:, -1], idx, rwidth=0.9)
plt.hist(test_data[:, -1], idx, rwidth=0.9)
plt.show()

test_error = np.zeros(len(sample_sizes))
train_error = np.zeros(len(sample_sizes))
for i in range(len(sample_sizes)):
print('training on {:,d} samples'.format(sample_sizes[i]))
if sample_sizes[i] < len(train_data):
train_sample = train_test_split(train_data, train_size=sample_sizes[i])[0]
else:
train_sample = train_data

svc = svm.SVC(kernel='linear').fit(train_sample[:, 0:-1], train_sample[:, -1])


test_error[i] = 1 - svc.score(test_data[:, 0:-1], test_data[:, -1])
train_error[i] = 1 - svc.score(train_sample[:, 0:-1], train_sample[:, -1])
print('test error: {:.1%}'.format(test_error[i]))
print('training error: {:.1%}'.format(train_error[i]))
print('---------------')

plot_train_test_error(sample_sizes, 'Sample Size', train_error, test_error)

# plot error vs label


test_predictions = svc.predict(test_data[:, 0:-1])
test_error_values = test_data[np.not_equal(test_predictions, test_data[:, -1]), -1]
plt.hist(test_error_values, idx, rwidth=0.9, normed=True)
plt.show()
CS189 HW1 Angel Avila
25488322

def sweep_C(dataset, C, validation_set_size=0, cross_validate=False):


train_data, test_data = load_data(dataset)[0:2]

if cross_validate:
validation_data = []
def evaluate_error(train_data, validation_data, C):
svc = svm.SVC(C=C, kernel='linear')
skf = StratifiedKFold(5, shuffle=True)
scores = cross_val_score(svc, train_data[:, 0:-1], train_data[:, -1], cv=skf, n_jobs=-1)
return 1 - np.mean(scores)
else:
train_data = train_test_split(train_data, test_size = 5000)[1]
train_data, validation_data = train_test_split(train_data, test_size=validation_set_size)
def evaluate_error(train_data, validation_data, C):
svc = svm.SVC(C=C, kernel='linear').fit(train_data[:, 0:-1], train_data[:, -1])
return 1 - svc.score(validation_data[:, 0:-1], validation_data[:, -1])

validation_error = np.zeros(len(C))
for i in range(len(C)):
print('training with C = {:.2E} sample size = {:d}'.format(C[i], len(train_data)))
validation_error[i] = evaluate_error(train_data, validation_data, C[i])
print('validation error = {:.1%}'.format(validation_error[i]))
print('---------------')
plot_train_test_error(C, 'C', validation_error, [])

min_i = validation_error.argmin()
generate_pred_csv(dataset, train_data, test_data, C[min_i])

def plot_train_test_error(x, xlabel, train_error, test_error):


if len(test_error):
plt.semilogx(x, test_error * 100, 'bx-', label='test')
plt.semilogx(x, train_error * 100, 'rx-', label='train')
plt.title('Prediction Error')
plt.xlabel(xlabel)
plt.ylabel('Error (%)')
plt.legend()
plt.show()

def generate_pred_csv(dataset, train_data, test_data, C):


print('===============')
print('generating prediction csv file...')
print('===============')

svc = svm.SVC(C=C, kernel='linear').fit(train_data[:, 0:-1], train_data[:, -1])


pred = svc.predict(test_data[:, 0:-1])

with open('{}.csv'.format(dataset), 'w', newline='') as csvfile:


writer = csv.writer(csvfile)
writer.writerow(['Id', 'Category'])
for i in range(len(pred)):
writer.writerow([i, pred[i]])

def main():
# problems 1 & 2
train_plot('cifar')
train_plot('spam')
train_plot('mnist')
# problem 3 & 5
sweep_C('mnist', 10**np.linspace(-7, -5, 15), validation_set_size=0.1)
# problem 4 & 5
sweep_C('spam', 10**np.linspace(-6, 1.3, 15), cross_validate=True)

if __name__ == '__main__':
main()

You might also like