HW1

CS189 HW1 Angel Avila
25488322
Problem 1: Data Partitioning
Data partitioning happened inside my load_data function, I used scikit-learns train_test_split function and
plotted histograms to verify that there was an even distribution of data from all categories in the test, and
train datasets
Histograms of the partitioned data, the blue is the training data, and the orange is the test data.
This shows that the data is evenly distributed among all 9 categories.
25488322
Problem 2: Training and Evaluating Prediction Accuracy
This happens in the function train_plot. The data was partitioned into the samples sizes specified, trained on
those sample sets, and then error was measured both on the training data itself as well as the test dataset.
Results are plotted below.
mnist data, test and training error vs sample size

spam data, test and train error vs sample size
cifar data, test and training error vs sample size

25488322
Problem 3: Hyperparameter Tuning
Tuning of the C parameter happens in the function sweep_C, this function first partitions the training data into
training and validation data sets, it then fits several time to the training set with different C parameters and checks
the score against the validation set. In the end, the C that produced the lowest validation score is used to make
predictions on the original test data set. To find the best C I began by sweeping a very large range of C values, then
narrowing it down until the global optimum was found. the best C value for this data seemed to be ~2.7E-6. The C
values I tried and the corresponding error can be seen in the figure below as well as the program output also
included.
mnist data validation set error with varying values for C
output:
===============
loading mnist data...
===============
training with C = 1.00E-07 sample size = 9000
validation error = 8.6%
---------------
---------------
---------------
---------------
---------------
---------------
25488322
---------------
---------------
---------------
---------------
---------------
---------------
---------------
---------------
---------------
===============
generating prediction csv file...
===============
25488322
Problem 4: Cross Validation
Hyperparameter turning with cross validation works in much the same way as problem 3, but instead of
partitioning the training data into a separate validation set, k-folds dividing was used which partitions the training data
into k partitions, and trains/evaluates error k times, using each partition as the validation set once. The result is the
mean error after k evaluations. This is built into scikit-learn in the form of the cross_val_score function. Similar to
problem 3, I began by sweeping a very large range of C values then narrowing it down until I found the apparent
minimum validation error. The best C value for this data seemed to be ~27. The values I tried and corresponding error
can be seen in the figure below.
spam data validation error using various values for C

25488322
Problem 5: Kaggle
Kaggle username: AngelJ
mnist score: 0.94480
spam score: 0.84221
After the best C value is found, the sweep_C function calls generate_pred_csv which uses the best C to fit to the
training data set, then generates predictions on the test data set and outputs to the csv file needed for Kaggle
submission.
25488322
Appendix (code)
from scipy.io import loadmat
from sklearn.model_selection import train_test_split, cross_val_score, StratifiedKFold, cross_val_predict
from sklearn import svm, metrics
import sys
import os
import numpy as np
import matplotlib.pyplot as plt
import csv
# load data, segment into test and train datasets,

# rows are data points, columns are features, last columns is label
def load_data(dataset):
print('===============')
print('loading {} data...'.format(dataset))
print('===============')
if dataset == 'cifar':
data = loadmat(sys.path[0] + '\\..\data\\hw01_data\\cifar\\train.mat')['trainX'].astype(int)
train, test = train_test_split(data, test_size=5000)
return train, test, [100, 200, 500, 1000, 2000, 5000]
elif dataset == 'mnist':
data = loadmat(sys.path[0] + '\\..\data\\hw01_data\\mnist\\train.mat')['trainX'].astype(int)
train, test = train_test_split(data, test_size=10000)
return train, test, [100, 200, 500, 1000, 2000, 5000, 10000]
elif dataset == 'spam':
data = loadmat(sys.path[0] + '\\..\\data\\hw01_data\\spam\\spam_data.mat')
data = np.concatenate((data['training_data'].astype(int),
data['training_labels'].astype(int).transpose()), 1)
train, test = train_test_split(data, test_size=0.2)
return train, test, [100, 200, 500, 1000, 2000, len(train)]
def train_plot(dataset):
train_data, test_data, sample_sizes = load_data(dataset)
# plot to verify shuffling of data

idx = np.unique(np.concatenate((train_data[:, -1], test_data[:, -1])))
idx = np.concatenate((idx, [max(idx) + 1])) - 0.5
plt.hist(train_data[:, -1], idx, rwidth=0.9)
plt.hist(test_data[:, -1], idx, rwidth=0.9)
plt.show()
test_error = np.zeros(len(sample_sizes))
train_error = np.zeros(len(sample_sizes))
for i in range(len(sample_sizes)):
print('training on {:,d} samples'.format(sample_sizes[i]))
if sample_sizes[i] < len(train_data):
train_sample = train_test_split(train_data, train_size=sample_sizes[i])[0]
else:
train_sample = train_data
svc = svm.SVC(kernel='linear').fit(train_sample[:, 0:-1], train_sample[:, -1])

test_error[i] = 1 - svc.score(test_data[:, 0:-1], test_data[:, -1])
train_error[i] = 1 - svc.score(train_sample[:, 0:-1], train_sample[:, -1])
print('test error: {:.1%}'.format(test_error[i]))
print('training error: {:.1%}'.format(train_error[i]))
print('---------------')
plot_train_test_error(sample_sizes, 'Sample Size', train_error, test_error)
# plot error vs label

test_predictions = svc.predict(test_data[:, 0:-1])
test_error_values = test_data[np.not_equal(test_predictions, test_data[:, -1]), -1]
plt.hist(test_error_values, idx, rwidth=0.9, normed=True)
plt.show()
25488322
def sweep_C(dataset, C, validation_set_size=0, cross_validate=False):

train_data, test_data = load_data(dataset)[0:2]
if cross_validate:
validation_data = []
def evaluate_error(train_data, validation_data, C):
svc = svm.SVC(C=C, kernel='linear')
skf = StratifiedKFold(5, shuffle=True)
scores = cross_val_score(svc, train_data[:, 0:-1], train_data[:, -1], cv=skf, n_jobs=-1)
return 1 - np.mean(scores)
else:
train_data = train_test_split(train_data, test_size = 5000)[1]
train_data, validation_data = train_test_split(train_data, test_size=validation_set_size)
def evaluate_error(train_data, validation_data, C):
svc = svm.SVC(C=C, kernel='linear').fit(train_data[:, 0:-1], train_data[:, -1])
return 1 - svc.score(validation_data[:, 0:-1], validation_data[:, -1])
validation_error = np.zeros(len(C))
for i in range(len(C)):
print('training with C = {:.2E} sample size = {:d}'.format(C[i], len(train_data)))
validation_error[i] = evaluate_error(train_data, validation_data, C[i])
print('validation error = {:.1%}'.format(validation_error[i]))
print('---------------')
plot_train_test_error(C, 'C', validation_error, [])
min_i = validation_error.argmin()
generate_pred_csv(dataset, train_data, test_data, C[min_i])
def plot_train_test_error(x, xlabel, train_error, test_error):

if len(test_error):
plt.semilogx(x, test_error * 100, 'bx-', label='test')
plt.semilogx(x, train_error * 100, 'rx-', label='train')
plt.title('Prediction Error')
plt.xlabel(xlabel)
plt.ylabel('Error (%)')
plt.legend()
plt.show()
def generate_pred_csv(dataset, train_data, test_data, C):

print('===============')
print('generating prediction csv file...')
print('===============')
svc = svm.SVC(C=C, kernel='linear').fit(train_data[:, 0:-1], train_data[:, -1])

pred = svc.predict(test_data[:, 0:-1])
with open('{}.csv'.format(dataset), 'w', newline='') as csvfile:

writer = csv.writer(csvfile)
writer.writerow(['Id', 'Category'])
for i in range(len(pred)):
writer.writerow([i, pred[i]])
def main():
# problems 1 & 2
train_plot('cifar')
train_plot('spam')
train_plot('mnist')
# problem 3 & 5
sweep_C('mnist', 10**np.linspace(-7, -5, 15), validation_set_size=0.1)
# problem 4 & 5
sweep_C('spam', 10**np.linspace(-6, 1.3, 15), cross_validate=True)
if __name__ == '__main__':
main()

HW1

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

HW1

Uploaded by

Copyright:

Available Formats

CS189 HW1 Angel Avila

mnist data, test and training error vs sample size

cifar data, test and training error vs sample size

mnist data validation set error with varying values for C

spam data validation error using various values for C

Kaggle username: AngelJ

mnist score: 0.94480

spam score: 0.84221

# load data, segment into test and train datasets,

# plot to verify shuffling of data

svc = svm.SVC(kernel='linear').fit(train_sample[:, 0:-1], train_sample[:, -1])

plot_train_test_error(sample_sizes, 'Sample Size', train_error, test_error)

# plot error vs label

def sweep_C(dataset, C, validation_set_size=0, cross_validate=False):

def plot_train_test_error(x, xlabel, train_error, test_error):

def generate_pred_csv(dataset, train_data, test_data, C):

svc = svm.SVC(C=C, kernel='linear').fit(train_data[:, 0:-1], train_data[:, -1])

with open('{}.csv'.format(dataset), 'w', newline='') as csvfile:

You might also like