Support Vector Machines (SVMS) : Lirong Tan, Ph.D. Student

Support Vector Machines (SVMs)
Lirong Tan, Ph.D. student
Advisor: Dr. Jason Lu and Dr. Scott K. Holland
August, 12th
Lirong Tan (CCHMC) PNRC Neuroimaging Training Course 2013-8-12 1 / 38

Outline
1 What SVM can do? Introduce some concepts with a 2-dimensional

artificial data.
2 How to use SVM? Model training/selection/evaluation
3 Use a UCI dataset for a practice.
4 SVM regression (SVR)
5 Feature extraction from fMRI data

Applications for SVM
1 Classification:
1 Predict a house expensive or not.
2 Brain states decoding: lie detector
3 Automatic disease diagnosis: early detection of Alzheimers disease
2 Regression:
1 Predict the house price
2 Predict cochlear implantation (CI) outcomes
3 Predict the reading gains in children with dyslexia

Introduction
1 SVM is multivariate method.
2 Input for SVM: some training set
D = {(x (i) , y (i) ) | x (i) Rm }N

i=1 (1)
where x (i) is called feature vector, the superscript (i) indicates the
i-th training sample, x (i) is a m dimensional vector
(i) (i) (i)
x (i) = [x1 , xj , , xm ] (2)
(i)
xj is the j-th feature/independent variable for the i-th subject. y (i)
is called label/dependent variable.
3 Given the training set, SVM is able to learn the rules to predict y .

Intuition for SVM
SVM is also known as large margin classifier.

libsvm download and installation
1 Download: http://www.csie.ntu.edu.tw/~cjlin/libsvm/
oldfiles/index-1.0.html
2 Documents: http://www.csie.ntu.edu.tw/~cjlin/libsvm/
3 Unzip
4 Move to the folder where you want to install the libsvm, e.g.
C:\users\MATLAB\tools
5 Set path

Demo Code
How to train a SVM model? How to use the trained model to predict new
samples whose true label is not known?
1 clc
2 clear
3 load data.mat;
4 m=min(data);
5 n=max(data);
6 data=(datarepmat(m,size(data,1),1))./repmat(nm,size(data,1),1);
7 test=(testm)./(nm);
8 plot(data(1:10,1),data(1:10,2),'o','MarkerSize',12);
9 hold on
10 plot(data(11:21,1),data(11:21,2),'r*','MarkerSize',12);
11 plot(test(1),test(2),'k*','MarkerSize',12);
12 model = svmtrain(label, data, 's 0 t 0 c 10');
13 [predicted label, accuracy, decision score] = svmpredict(label, data, model);
14 w = model.SVs' * model.sv coef;
15 x0=[0,1];
16 y0=zeros(1,2);
17 y0(1)=(model.rhox0(1)*w(1))/w(2);
18 y0(2)=(model.rhox0(2)*w(1))/w(2);
19 plot(x0,y0,'k');
20 axis([0,1,0,1.2])

Feature scaling
1 In SVM training, we need to scale the features. Otherwise, features
with large range will dominate the training.
2 Two ways to scale features:

1 Linearly scale the features to the range [0,1] or [-1,1]
x min
x0 =
max min
2 (x min)
x0 =
max min
2 Normalize the feature to zero mean, and unit variance. (zscore
function in matlab)
x mean
x0 =
standard deviation
3 Scaling the train and test samples in the same way. Use the min,
max, mean, std from train samples to scale the test sample.
parameter -s
-s svm type : set type of SVM (default 0)
0 C-SVC
1 nu-SVC
2 one-class SVM
3 epsilon-SVR
4 nu-SVR
Basically C-SVC and nu-SVC are the same thing but with different
parameters. You can choose either one.
Black boundary: -s 0 -t 0 -c 10
Green boundary: -s 1 -t 0 -nu 0.7

parameter -t
-t kernel type : set type of kernel function (default 2)
0 linear: u*v
1 polynomial: ( u 0 v + coef 0)degree
2 radial basis function: exp( |u v |2 )
3 sigmoid: tanh( u 0 v + coef 0)
Linear kernel: No kernel. Decision boundary is a line/hyperplane in the

original space. Decision rule is as follows:
(
1; if z 0
y= (3)
0; otherwise
where z = w T x + bias = w1 x1 + + wm xm + bias.
You may choose linear kernel, when the number of features is large and
the number of training samples is small. Another advantage of the linear
kernel is that the model is easy to interpret.
parameter -t
Radial basis function kernel (RBF):
1 Another parameter needs to be set, . If is too small, the model
may be underfitted. If is too large, it may be overfitted.
2 Choose RBF kernel, when the number of features is small and the
number of training samples is large.
3 Make sure you have performed feature scaling before you use RBF
kernel. Otherwise, exp( |u v |2 ) will be dominated by features
with large range.
parameter -t
Other suggestions:
1 Recent research shows that if RBF is used with model selection, then
there is no need to consider the linear kernel.
2 People tend to not use the polynomial and sigmoid kernel that much.
3 For libsvm, you may define your own kernel functions, and feed the
precomputed kernel matrix to SVM.

parameter -c
1 Objective Function:
N
X M
X
C [yi cost1 + (1 yi ) cost0 ] + wj2 (4)
i=1 j=1
where N is the total number of training samples, M is number of

features.

parameter -c
2 The first term is actually to pernalize the samples that have been
classified wrongly.
3 We can prove that

1
Margin PM
2
j=1 wj
SVM is to minimize the objective function defined in Equation 4, so it

is to maximize the Margin.
4 Therefore, we tend to classify all training samples correctly, when c is

very large. When c is not too large, the SVM tends to ignore the
outliers and find the decision boundary with maximal margin.

parameter -c
Black Boundary: -s 0 -t 0 -c 10
Green Boundary: -s 0 -t 0 -c 10000

Model Evaluation
Use samples with true labels to assess the performance of the model.
Category True Label Predicted Label

True Positive (TP) positive positive
Positive samples
False Negative (FN) positive negative
True Negative (TN) negative negative
Negative samples
False Positive (FP) negative positive
TP
sensitivity = (5)
TP + FN
TN
specificity = (6)
TN + FP
TP + TN
accuracy = (7)
TP + TN + FP + FN

Model Evaluation
1 Plot the receiver operating characteristic (ROC) curve, and calculate

the area under the curve (AUC).
2 AUC evaluates how the model ranks the samples. For example, if the
predicted scores for negative samples are always smaller than the
predicted scores of positive samples, this model tends to have a high
AUC.
1 [X,Y,THRE,AUC]=perfcurve(labels,scores,posclass);
2 plot(X,Y,'.')
3 xlabel('False Positive Rate (1specificity)');
4 ylabel('True Positive Rate (sensitivity)');
5 title('AUC=x');

Model Evaluation
Each vertical line segment represents a positive sample, and each
horizontal line segment represents a negative sample.

Model Selection & Evaluation
We usually partition the data into 3 parts:

1 Training set (60%): used for training
2 Cross-validation set (20%): used for model selection
3 Testing set (20%): used for model evaluation
procedures:
1 Train models with different parameters (c = 22 , 21 , 20 , 21 , 22 )
2 Test the models on the cross-validation set, and pick the model with
the highest AUC/accuracy
3 Test the picked model on the testing set, and report the performance

Model Selection/Grid Search
In RBF kernel, we have two parameters (c and ), we use grid search.
Each cross point corresponds to a combination of c and
Try all the combinations defined on the grid. Choose the model that
gives out the best performance on the cross-validation set.
When the number of samples is limited, we use the technique called

cross-validation.
Take 3-folds of cross-validation for example. We randomly partition the

data into 3 parts.
1 Fold 1: use part 1 for testing, and part 2 and 3 for training
The model is assessed based on its performance on all the testing samples
across Fold 1, 2, and 3.

In NeuroImage field, the number of samples is usually limited. In this

situation, we usually use the technique called Leave-One-Out
Cross-Validation (LOOCV).
Suppose we have 5 samples:

1 Fold 1: use 1st sample for testing, and the rest for training
2 Fold 2: .... 2nd .................................................................
3 Fold 3: .... 3rd .................................................................
4 Fold 4: .... 4th .................................................................
5 Fold 5: .... 5th .................................................................
The model is evaluated based on its performance on the 5 testing samples.

If we need to select parameters, we may use two-folds LOOCV to make

the evaluation more objective
Suppose we have 5 samples:

1 Fold 1: use 1st sample for testing, and perform a LOOCV on the rest.
Use the LOOCV for model selection, and then apply the selected
model to the 1st sample for testing.
2 Fold 2: .... 2nd .................................................................................
3 Fold 3: .... 3rd .................................................................................
4 Fold 4: .... 4th .................................................................................
5 Fold 5: .... 5th .................................................................................
Report the performance on the 5 testing samples.

UCI Dataset Information
1 Iris dataset from http://archive.ics.uci.edu/ml/
2 The data set contains 3 classes of 50 instances each, where each class
refers to a type of iris plant.
3 Attribute information:
1 sepal length in cm
2 sepal width in cm
3 petal length in cm
4 petal width in cm
4 Class information: Setosa, Versicolour, Virginica

Demo CodeTwo-folds of LOOCV
1 clc
2 clear
3 load iris.mat;
4 features=features(logical(label6=1),:);
5 label=label(logical(label6=1));
6 label(logical(label==2))=1;
7 label(logical(label==3))=0;
8 sampleN=length(label);
9 score=zeros(sampleN,1);
10 cs=10:10;
11 for i=1:sampleN
12 train=features([1:i1,i+1:end],:);
13 test=features(i,:);
14 [train,test]=func scale(train,test);
15 trainLabel=label([1:i1,i+1:end]);
16 testLabel=label(i);
17 AUCs=zeros(length(cs),1);
18 for j=1:length(cs)
19 AUCs(j)=func LOOCV(train,trainLabel,2(cs(j)));
20 end
21 [tmp,j]=max(AUCs);
22 model = svmtrain(trainLabel, train, ['s 0 t 0 c ',num2str(2(cs(j)))]);
23 [predicted label, accuracy, score(i)] = svmpredict(testLabel, test, model);
24 end
25 [X,Y,THRE,AUC]=perfcurve(label,score,1);
26 plot(X,Y,'.');

Demo Codefunction func scale() & func LOOCV
1 function [train,test]=func scale(train,test)

2 Mi=max(train);
3 mi=min(train);
4 m=size(train,1);
5 n=size(test,1);
6 train=(trainrepmat(mi,m,1))./(repmat(Mimi,m,1));
7 test=(testrepmat(mi,n,1))./(repmat(Mimi,n,1));
8 end
1 function AUC=func LOOCV(data,label,c)

2 sampleN=length(label);
3 score=zeros(sampleN,1);
4 for i=1:sampleN
5 train=data([1:i1,i+1:end],:);
6 test=data(i,:);
7 trainLabel=label([1:i1,i+1:end]);
8 testLabel=label(i);
9 model = svmtrain(trainLabel, train, ['s 0 t 0 c ',num2str(c)]);
10 [predicted label, accuracy, score(i)] = svmpredict(testLabel, test, model);
11 end
12 [X,Y,THRE,AUC]=perfcurve(label,score,1);
13 end

Demo CodeMultiple classes & 10-folds of
cross-validation
1 clc
2 clear
3 load iris.mat;
4 cvidx = [crossvalind('Kfold', 50, 10);crossvalind('Kfold', 50, ...
10);crossvalind('Kfold', 50, 10)];
5 trueLabel=[];
6 predictedLabel=[];
7 for i=1:10
8 train=features(logical(cvidx6=i),:);
9 test=features(logical(cvidx==i),:);
10 trainLabel=label(logical(cvidx6=i));
11 testLabel=label(logical(cvidx==i));
12 trueLabel=[trueLabel;testLabel];
13 [train,test]=func scale(train,test);
14 model = svmtrain(trainLabel, train, 's 0 t 0 c 1');
15 [predicted label, accuracy, descion score] = svmpredict(testLabel, test, model);
16 predictedLabel=[predictedLabel;predicted label];
17 end
18 accuracy=sum(predictedLabel==trueLabel)/length(trueLabel);

Model Interpretation
In linear SVM, we can use the weights to measure the importance of the
features.
Lets start from the binary class and assume you have two labels 0 and 1.
After obtaining the model from calling svmtrain, do the following to have
w and b:

2 b = model.rho;
3 if model.Label(1) == 0
4 w = w;
5 b = b;
6 end
The larger the absolute value of w(i), the more important the i-th feature.
Model: y = w T x + b

SVR linear model
1 clc
2 clear
3 N=1000;
4 x=randn(N,1);
5 y=2*x+randn(N,1)+3;
6 Mi=max(x);
7 mi=min(x);
8 x=(xmi)./(Mimi);
9 plot(x,y,'.')
10 model=svmtrain(y,x,'s 3 t 0 c 4');
11 [predicted label, accuracy, descion score] = svmpredict(y, x, model);
13 b=model.rho;
14 hold on
15 plot([min(x),max(x)],[min(x)*w+b,max(x)*w+b],'r','LineWidth',2);
16 xlabel('x');
17 ylabel('y');
18 title('SVR linear model');

SVR linear model

SVR non-linear model
1 clc
2 clear
3 N=1000;
4 x=randn(N,1);
5 y=(x0.5).2+randn(N,1);
6 Mi=max(x);
7 mi=min(x);
8 x=(xmi)./(Mimi);
9 plot(x,y,'.')
10 model=svmtrain(y,x,'s 3 t 2 c 8');
11 [predicted label, accuracy, descion score] = svmpredict(y, x, model);
12 hold on
13 plot(x,predicted label,'r*')
14 xlabel('x');
15 ylabel('y');
16 title('SVR nonlinear model');

SVR non-linear model

SVR multi-dimensional data & grid search
1 clc
2 clear
3 load moore.mat;
4 data=moore;
5 label=data(:,end);
6 data=data(:,1:end1);
7 Mi=max(data);
8 mi=min(data);
9 m=size(data,1);
10 data=(datarepmat(mi,m,1))./(repmat(Mimi,m,1));
11
12 cs=15:15;
13 rs=zeros(length(cs)3,1);
14 MSEs=zeros(length(cs)3,1);
15 idx=0;
16 ijk=zeros(length(cs)3,3);
17 for i=1:length(cs)
18 for j=1:length(cs)
19 for k=1:length(cs)
20 idx=idx+1;
21 model = svmtrain(label,data, ['s 3 t 2 c ',num2str(2(cs(i))),' p ...
',num2str(2(cs(j))),' g ',num2str(2(cs(k)))]);
22 [predicted label, accuracy, descion score] = svmpredict(label, data, model);
23 MSEs(idx)=accuracy(2);
24 rs(idx)=accuracy(3);
25 ijk(idx,:)=[i,j,k];
26 end
27 end
28 end

Feature Extraction from fMRI data
How to extract features from fMRI data? This is the most important
question for multivariate analysis.
1 Use the fMRI time series directly. The number of features would be:
the number of voxels * the number of time points
2 Construct a contrast map, each voxel becomes a features.
3 Define ROIs, each ROI is measured as the mean contrast value of the
voxels within this ROI
4 Use ICA time series. The number of ICs * the number of time points.
5 Construct brain network first, then extract network features.

Feature Extraction from sMRI data

Feature Extraction from fMRI data

Important Features
1 Classification problem:
normal hearing (NH) vs.
hearing impaired (HI)
2 Classifier: linear SVM

Thank you!

Support Vector Machines (SVMS) : Lirong Tan, Ph.D. Student

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Support Vector Machines (SVMS) : Lirong Tan, Ph.D. Student

Uploaded by

Copyright:

Available Formats

Support Vector Machines (SVMs)

Lirong Tan, Ph.D. student

Advisor: Dr. Jason Lu and Dr. Scott K. Holland

Lirong Tan (CCHMC) PNRC Neuroimaging Training Course 2013-8-12 1 / 38

1 What SVM can do? Introduce some concepts with a 2-dimensional

2 How to use SVM? Model training/selection/evaluation

3 Use a UCI dataset for a practice.

4 SVM regression (SVR)

5 Feature extraction from fMRI data

Lirong Tan (CCHMC) PNRC Neuroimaging Training Course 2013-8-12 2 / 38

Lirong Tan (CCHMC) PNRC Neuroimaging Training Course 2013-8-12 3 / 38

1 SVM is multivariate method.

2 Input for SVM: some training set

D = {(x (i) , y (i) ) | x (i) Rm }N

Lirong Tan (CCHMC) PNRC Neuroimaging Training Course 2013-8-12 4 / 38

Lirong Tan (CCHMC) PNRC Neuroimaging Training Course 2013-8-12 5 / 38

Lirong Tan (CCHMC) PNRC Neuroimaging Training Course 2013-8-12 6 / 38

Lirong Tan (CCHMC) PNRC Neuroimaging Training Course 2013-8-12 7 / 38

2 Two ways to scale features:

Lirong Tan (CCHMC) PNRC Neuroimaging Training Course 2013-8-12 9 / 38

Linear kernel: No kernel. Decision boundary is a line/hyperplane in the

Lirong Tan (CCHMC) PNRC Neuroimaging Training Course 2013-8-12 12 / 38

where N is the total number of training samples, M is number of

Lirong Tan (CCHMC) PNRC Neuroimaging Training Course 2013-8-12 13 / 38

3 We can prove that

SVM is to minimize the objective function defined in Equation 4, so it

4 Therefore, we tend to classify all training samples correctly, when c is

Lirong Tan (CCHMC) PNRC Neuroimaging Training Course 2013-8-12 14 / 38

Lirong Tan (CCHMC) PNRC Neuroimaging Training Course 2013-8-12 15 / 38

Category True Label Predicted Label

Lirong Tan (CCHMC) PNRC Neuroimaging Training Course 2013-8-12 16 / 38

1 Plot the receiver operating characteristic (ROC) curve, and calculate

Lirong Tan (CCHMC) PNRC Neuroimaging Training Course 2013-8-12 17 / 38

Lirong Tan (CCHMC) PNRC Neuroimaging Training Course 2013-8-12 18 / 38

We usually partition the data into 3 parts:

Lirong Tan (CCHMC) PNRC Neuroimaging Training Course 2013-8-12 19 / 38

Each cross point corresponds to a combination of c and

When the number of samples is limited, we use the technique called

Take 3-folds of cross-validation for example. We randomly partition the

Lirong Tan (CCHMC) PNRC Neuroimaging Training Course 2013-8-12 21 / 38

In NeuroImage field, the number of samples is usually limited. In this

Suppose we have 5 samples:

The model is evaluated based on its performance on the 5 testing samples.

Lirong Tan (CCHMC) PNRC Neuroimaging Training Course 2013-8-12 22 / 38

If we need to select parameters, we may use two-folds LOOCV to make

Suppose we have 5 samples:

Report the performance on the 5 testing samples.

Lirong Tan (CCHMC) PNRC Neuroimaging Training Course 2013-8-12 23 / 38

1 Iris dataset from http://archive.ics.uci.edu/ml/

4 Class information: Setosa, Versicolour, Virginica

Lirong Tan (CCHMC) PNRC Neuroimaging Training Course 2013-8-12 24 / 38

Lirong Tan (CCHMC) PNRC Neuroimaging Training Course 2013-8-12 25 / 38

1 function [train,test]=func scale(train,test)

1 function AUC=func LOOCV(data,label,c)

Lirong Tan (CCHMC) PNRC Neuroimaging Training Course 2013-8-12 26 / 38

Lirong Tan (CCHMC) PNRC Neuroimaging Training Course 2013-8-12 27 / 38

1 w = model.SVs' * model.sv coef;

Lirong Tan (CCHMC) PNRC Neuroimaging Training Course 2013-8-12 28 / 38

Lirong Tan (CCHMC) PNRC Neuroimaging Training Course 2013-8-12 29 / 38

Lirong Tan (CCHMC) PNRC Neuroimaging Training Course 2013-8-12 30 / 38

Lirong Tan (CCHMC) PNRC Neuroimaging Training Course 2013-8-12 31 / 38

Lirong Tan (CCHMC) PNRC Neuroimaging Training Course 2013-8-12 32 / 38

Lirong Tan (CCHMC) PNRC Neuroimaging Training Course 2013-8-12 33 / 38