Professional Documents
Culture Documents
by
Nicholas Ruiz
1
Contents
1 Introduction 3
3 Multiclass classification 6
3.1 One-versus-rest model . . . . . . 6
3.2 Decision Directed Acyclic Graphs 6
5 Experimentation procedures 11
5.1 Input representation . . . . 11
5.2 Choice of the kernel . . . . . 11
5.3 SVM training methods used 12
5.4 Leave-one-out cross validation 12
6 Experimentation results 12
7 Remarks 15
8 Conclusion 16
2
1 Introd uction
SUppoTt Vector Machines (SVMs) have proven to successfully solve real-world problems;
one example of their power is exhibited through the classification of images. For simple
binary classification tasks, SVMs use a kernel function - a "computational shortcut which
makes it possible to represent linear patterns efficiently in high-dimensional spaces" [1]
- where a separating hyperplane can be found that maximizes the distance between the
two classes of labeled points [2].
While Chapelle has shown that SVMs have a remarkable recognition rate for color-
and luminescence-based image classification [3], the requirement of solving a complex
quadratic programming problem may result in a slow training time for the classifica-
tion model. The purpose of this paper is to compare the training time efficiency of the
quadratic programming SVM to several alternative SVMs (Kernel Adatron, Kernel Mi-
nover, and Friess Adatron), using several experimental procedures outlined by Chapelle
on a subset of the Corel image database for multiclass image classification. In addition,
we shall compare the results of each of the above SVMs to their implementations using
the Directed Acyclic Graph for multi class classification [4].
This paper is organized as follows. In Section 2 we present an overview of Support
Vector Machines. Section 3 discusses techniques of multiclass classification. Section 4
introduces four alternative algorithms for training the SVM. Section 5 discusses the ex-
perimentation procedures for testing the SVM models on sample images from the Corel
database; results of the experimentation are illustrated in section 6.
3
of the hyperplane. Thus, the goal is to find wand x such that
(1)
min
l:S;i:S;m
Yi (w . Xi + e) 2: 1.
The purpose of this rescaling is so that the closest point to the hyperplane has a distance
of l/ JJ w ll . In this case, (1) becomes
(2)
By maximizing the left hand side of (2) , we obtain the optimal separating hyperplane.
This can be understood by considering that the closest point to any hyperplane satisfying
(2) has a distance of l / JJw JJ. Thus, finding the optimal separating hyperplane amounts to
minimizing " where
,= 2w, w,
1
(3)
under constraints (2). It should be noted that, in this case, the margin is 2/lI w JJ . The
optimal separating hyperplane is understood as the hyperplane which maximizes the
margin.
Since IIW JJ2 is convex, minimizing (3) under constraints (2) may be achieved using
Lagrange multipliers [lJ . Thus, the maximal margin can be found by minimizing the
Lagrangian: .
1 m
L(w, e, a) = 2(w, w) - L
adydw, Xi + eJ - 1). (4)
i=l
8L(w, e, a) _ ~ _
--'--="
8'--
e-'----..:. . - L.... ai Yi - 0 (5)
i=l
(6)
Substituting (6) into (4) gives the dual representation of the Lagrangian:
m m
4
which must be maximized with respect to each ai, subject to the constraint from (5),
m
I:aiYi = 0,
i=l
and ai 2: O.
When the optimal separating hyperplane and margin are found , only the training
points that lie close to the hyperplane have ai > 0 and contribute to the classification
model. These training points are called support vectors (SV) . All other training points
have associated ai values of zero. The training points with nonzero ai values provide the
most informative patterns on the model's data. The resulting decision function can be
written as:
where a? is the solution to the maximization problem under the constraints listed above
and SV represents the indices of the support vectors. The sign function transforms the
result of the function in order to map approximate results to either -lor +1.
then
J K(x , xl)g(X)g(xl)dxdx' 2: O. (9)
Common choices for kernels include polynomial kernels of the form
5
where d is the' degree of the polynomial, and Gaussian Radial Basis Functions (RBF) of
the form 2 I
f( x) = sign (L
iESV
YiC1:?K(X, Xi) + e) . (12)
By satisfying Mercer 's condition, a kernel may be used to replace the use of a ¢ mapping;
thus, ¢ need not be explicitly defined.
3 M ulticlass classification
3.1 One-versus-rest model
In the previous section, we described the use of the SVM for binary classification. The
standard method of multiclass classification is to construct N binary classifiers. For
example, let i E {1 , 2, ... N}. The binary classification for the ith class is to separate the
training points from class i from the remaining N - 1 classes. This is also known as
one-versus-rest (or 1-v-r) classification. Empirically, SVM training is observed to scale
super-linearly with the training set size m [6]' according to a power law:
(14)
6
! ,\
4 •
44~
4
4 J 2 1
(a) (0)
Figure 1: (a) The decision DAG for finding the best class out of four classes, The equiva-
lent list state for each node is shown next to that node, (b) A diagram of the input space
of a four-class problem, A I-v-l SVM can only exclude one class from consideration [4],
if each class contains the same number of examples, each I-v-l classifier requires only 2;:-
training examples. Recalling (13), training the DDAG would require
T
I-v - l -
_ N(N - 1)
C 2
(2m)"Y
N (15)
Comparing (14) to (15) , the DDAG trains more quickly than the standard I-v-r
method.
(16)
7
where the final term implements the constraint condition in (5). Using stochastic gradient
ascent based on the derivative of the Lagrangian [1],
(17)
where r; is controls the growth rate. In addition, the constraint ai 2: 0 will be enforced
by updating ai ----> 0 when ai < O. The Lagrangian changes as follows during an update
procedure where ai ----> ak + Oak for a particular pattern k [1] :
2> r; > O.
For a polynomial kernel in the form of (11), the upper bound for r; is determined from
the 2-norm of each pattern [1]:
8
1. a?
Initialize = O.
2. For i = 1, 2, ... , m execute steps 3 and 4.
3. For labeled points (Xi, Yi) calculate:
Zi = 2:7=1 ajyjK (xi, Xj).
4. Calculate 8a~ = 1](1 - ZiYi):
4.1. If (a~ + 8aD :S 0 then a~ <-- O.
4.2. If ( a~ + 8aD > 0 then a~ <-- a~ + 8a~.
5. If a maximum number of iterations is exceeded or the margin:
"f = ~ (min{iIYi=+1}(zi) - max{iIYi=-l} (Zi)) ~ 1
then stop, otherwise, return to 2 for next epoch t.
recalling that for each k , Yk = {-I , +1}. Thus, the only additional change from the KA
without bias is to keep track of the A-values at each epoch t. It is noted that the bias "can
be found by a subprocess involving iterative adjustment of the A based on the gradient of
L with respect to X' [1]. Thus, A is updated by:
(21)
where v is a learning parameter, which may be derived from the secant method. Likewise,
v is defined as
At - 1 _ At - 2
V = --,-----:-
wt - 1 _ wt - 2 ,
where w t = 2:i a~ Yi [1]. The KA (without bias) algorithm is outlined in Table 2 (where
t max is the maximum number of epochs, and f1, is an arbitrary value to initialize A) .
9
1. Initialize a? = O.
2. For t = 1, ... , t max execute steps 3 through 8.
3. If t = 0 then
At = fl,
Elseif t = 1 then
At = -fl,
Else
\t t-1 .x t - 1 .x t - 2
/I = -w wt l w t 2
Endif
4. For i = 1, ... , m, execute steps 5 and 6.
5. For labeled points (Xi, Vi) calculate:
Zi = 2:;:1 ajyjK(Xi, Xj)
6. Calculate 6a~ = 77(1 - ZiYi - AtYi):
6.1 If (af + 6aD :::; 0 then a~ = 0
6.2 If (a~ + 6aD > 0 then a~ .- a~ + 6a~
7. Calculate wt = 2: j a;Yj
8. If a maximum number of iterations is exceeded or the margin:
'Y = ~ (min{iIY;=+l}(Zi) - max{iIYi=-l}) ~ 1
then stop, otherwise, return to step 2 for next epoch t.
(23)
Second, the bias term is updated each time in the learning loop, where the value of a
Lagrange multiplier is increased, using the rule b .- b + Yi6ai.
The third change to the KA (without bias) algorithm is that the criteria that prevents
ai from going negative is discarded during the learning loop in order to maintain (23).
YiZi = Yi (f
)=1
ajyjK(xi, Xj) + b) . (24)
By minimizing YiZi in (24), Xi' (the pattern associated with the minimum value) is selected
to be updated; subsequently, only 6ai' needs to be calCulated, where
6ai' = 77(1 - Yi,Zi').
10
6ai' is used to update ai' and b. Thus, only one Lagrange multiplier is updated at each
epoch t .
5 Experimentation procedures
Following Chapelle's methodology for experimentation, we used sample images from the
Corel database. Images were collected from 10 categories: people, beaches , buildings,
buses, reptiles , elephants, flowers, horses , mountains, and food. In this experimentation,
we used each classification model to train the images to classify into their respective
classes.
where (J determines the spread of the RBF function. Smaller (J-values suggest a slower
learning rate for the training procedure .
11
5.3 SVM training methods used
In this experiment, we used MATLAB R14 to implement each SVM. We compared the
performance of a quadratic programming SVM toolbox written by Steve Gunn [8] to the
following alternative SVMs (following the algorithms listed in Section 4):
• Kernel Minover
• Friess Adatron
6 Experimentation results
To understand the relationships among the training time complexities, we performed
the same experiment on 5, 6, 7, 8, and 9 images per class to construct a visual graph.
The training times for each experiment are graphed using Vandermonde interpolation.
Figure 2 shows the interpolation of training times for Gunn's SVM, the Adatron (without
bias), and the Friess Adatron. According to Figure 2, Gunn's SVM begins to outperform
the Adatron (without bias) at 8 training examples per class.
Figure 3 shows the interpolation of training times for the Adatron (with bias) and
the Kernel Minover. In comparison to Figure 2, the Adatron (with bias) and the Kernel
Minover are computationally inefficient. The rationale behind this finding lies in the
nature of each classification method. In the case of the KA (with bias) algorithm, it
12
SVM Method T error 'rJ
Gunn's SVM 40,0906 0,000 -
Table 3: Training time comparisons (in seconds) for 100 training examples (10 images per
class), using RBF kernel with a = 2,
- - - Friess Adatron /
I
120 I
00' /
'U
c /'
0
100 /i
U
(l)
,/l
Ul
C /
-=-
(l) 80 /
E ./ /
,/
en ./ ./
c 50
c /' ./
/-'
~ ./ /
f- ,/ ",-
,/
40 ,/
./
",,-
,/
./
-----
,~
./ "", '
,/
20 ./
/'
..or
-- --"
-~ , -.,...
0
5 5 7 8 9 10
Number ofTraining Points per class
Figure 2: Vandermonde interpolation of training times for Gunn's SVM, Adatron (without
bias) , and Friess Adatron,
13
is necessary to calculate the w- and A-values in the learning loop of each epoch t. The
greatest computational expense lies in the arithmatic required to calculate these values.
In addition, the ai-values are only slightly updated at each t, requiring more iterations to
optimize classification model. While the KA (with bias) algorithm shows great accuracy,
it requires a longer training time.
Similarly, the Kernel Minover algorithm's slowness is related to requiring more itera-
tions to optimize the classification model. This is due to the Kernel Minover algorithm
only updating one ai-value per iteration.
(i)
-g 2000
o
u
(l)
(J)
c
~ 1500
E
OJ
C
C
~ 1000
I-
500
6 7 8 9 10
Number of Training Points per class
Figure 3: Vandermonde interpolation oftraining times for Adatron (with Bias) and Kernel
Minover.
Figure 4 shows the interpolation of the training times of each DDAG method. Both
the DAG-equivalent methods of the Adatron (without bias) and the Friess Adatron show
significant improvement from the overall training time of each classifier.
It is interesting to note that though Gunn's SVM performs best in the 1-v-r problem,
its DAG-equivalent method is suboptimal with respect to training time. Since the DDAG
requires n( n-1) /2 1-v-1 classification models to compute, the large number of calculations
of the quadratic programming procedure increases the training time required for the SVM
when compared to the Adatron and minover classification models.
14
Vandermonde Interpolation ofTraining Times versus Training Points Per Class
40.--------.--------.--------.--------,--------.
35
30
(j)
<:l
C
8 25
(I)
- -·-DAG SVC
(j)
5
--- .---.--
-"'"
OL-------~--------~--------~------~~------~
5 G 7 8 9 10
Number of Training Points per class
7 Remarks
Additional remarks should be made to present other potential discrepancies in the exper-
imentation results. Firstly, it must be noted that Gunn's SVM uses the C programming
language (called by MATLAB) to perform the quadratic programming arithmetic, while
the other algorithms presented use only the MATLAB language. While C programs are
compiled into machine code, MATLAB scripts are interpreted upon runtime. By plac-
ing the most computationally challenging arithmetic into machine code, Gunn's SVM
may be saving computational time, whereas the other algorithms must first translate the
MATLAB code to compute the results. It would be interesting to compare a SVM writ-
ten purely in MATLAB to the other algorithms. However, the mathematics involved in
the computation of the quadratic programming procedure is outside of the scope of this
experiment.
In addition, the Adatron (with bias) should receive additional attention. Program-
matically storing the w- and A-values of each epoch of the KA algorithm requires much
consideration. We decided to store only the previous three values of these parameters and
used modular arithmetic to determine which w- and A-values correspond to wt - 1 , wt - 2 ,
etc. at each iteration. Modular arithmetic in itself is computationally expensive. Other
15
possibilities include storing the parameter values for every iteration (which would require
a large amount of space).
8 Conclusion
In conclusion, while the SVM appears to require the least amount of training time in
the 1-v-r multiclassification model, the KA (without bias) and the Kernel Friess Adatron
algorithm require significantly less time to train in the DDAG 1-v-1 model as the number
of data points increase. The use of the DDAG 1-v-1 model results in a significant decrease
in training time for the overall multiclassification model while maintaining classification
accuracy.
16
References
[1] C. Campbell and N. Cristianini, Simple learning algorithms for training support vector
machines, http://citeseer. ist. psu. edu/campbell98simple. html, 1998.
[2] J. Shawe-Taylor and N. Cristianini, Kernel Methods for Pattern Analysis (Cambridge,
New York, 2004).
[3] O. Chapelle, P. Haffner , and et al. , Support Vector Machines et Classification d'
Images, http://citeseer.ist.psu.edu/392011.html, 1998.
[4] J. Platt, N. Cristianini, and J. Shawe-Taylor, in Large Margin DAGs for Multiclass
Classification, edited by S. Solla, T. Leen, and K.-R. Mueller (MIT Press, Cambdridge,
MA , 2000), pp. 547-553.
[5] V. Vapnik, The Nature of Statistical Learning Theory (Springer-Verlag, Berlin, 1995).
17 .