You are on page 1of 18

Support Vector Machines for Optimizing Multiclass Image

Classification Training Time

by

Nicholas Ruiz

Submitted in partial fulfillment of the requirements for Major Honors in Computer


Science

Houghton College, Houghton, New York


May 3,2006
Acknowledgments ------
I would like to thank Dr. Wei Hu for his assistance as an advisor during and before the
development of this thesis. Dr. Hu has motivated me to push myself beyond average
scholarship: he has inspired me to be diligent and faithful in all of my endeavors. I
would also like to thank the other members of the Honors Committee: Dr. Richard
Jacobson, Dr. Paul Young, and Dr. Mark Yuly, in additionto the other faculty and staff
at Houghton College for providing a challenging academic atmosphere while maintaining
genuine dedication to the development of each student.
I would also like to thank my parents, Brian and Sandra, my VLEOL
James, for their encouragement and continued support as I have grown into the person I
am today. Their involvement in my life has motivated me to work hard and have held me
accountable to my work. Finally, I earnestly thank God for his involvement in my life:
the time and effort I have spent in every activity would be worthless if it were not for the
opportunity to honor Him. I am eternally thankful for his love, concern, and the sacrifice
of his Son to provide meaning to my life.

1
Contents
1 Introduction 3

2 Support Vector Machines 3


2.1 Linearly separable Support Vector Machines 3
2.2 Nonlinear Support Vector Machines . . . . . 5

3 Multiclass classification 6
3.1 One-versus-rest model . . . . . . 6
3.2 Decision Directed Acyclic Graphs 6

4 Alternative SVM training methods 7


4.1 Gradient ascent .. . 7
4.2 Kernel Adatron .. . 8
4.2 .1 Without bias 8
4.2.2 With biqs 9
4.3 Friess Adatron . 9
4.4 Kernel Minover . 10

5 Experimentation procedures 11
5.1 Input representation . . . . 11
5.2 Choice of the kernel . . . . . 11
5.3 SVM training methods used 12
5.4 Leave-one-out cross validation 12

6 Experimentation results 12

7 Remarks 15

8 Conclusion 16

2
1 Introd uction
SUppoTt Vector Machines (SVMs) have proven to successfully solve real-world problems;
one example of their power is exhibited through the classification of images. For simple
binary classification tasks, SVMs use a kernel function - a "computational shortcut which
makes it possible to represent linear patterns efficiently in high-dimensional spaces" [1]
- where a separating hyperplane can be found that maximizes the distance between the
two classes of labeled points [2].
While Chapelle has shown that SVMs have a remarkable recognition rate for color-
and luminescence-based image classification [3], the requirement of solving a complex
quadratic programming problem may result in a slow training time for the classifica-
tion model. The purpose of this paper is to compare the training time efficiency of the
quadratic programming SVM to several alternative SVMs (Kernel Adatron, Kernel Mi-
nover, and Friess Adatron), using several experimental procedures outlined by Chapelle
on a subset of the Corel image database for multiclass image classification. In addition,
we shall compare the results of each of the above SVMs to their implementations using
the Directed Acyclic Graph for multi class classification [4].
This paper is organized as follows. In Section 2 we present an overview of Support
Vector Machines. Section 3 discusses techniques of multiclass classification. Section 4
introduces four alternative algorithms for training the SVM. Section 5 discusses the ex-
perimentation procedures for testing the SVM models on sample images from the Corel
database; results of the experimentation are illustrated in section 6.

2 Support Vector Machines


In the binary classification model, Support Vector Machines typically use a non-linear
function ¢ (if the data is not linearly-separable) to map training points to a high-
dimensional and linearly-separable feature space. By mapping the training points to a
linearly-separable feature space, a separating hyperplane can be found that separates the
training points into distinct classes. A separating hyperplane may be found which mini-
mizes the distance between training points of the same class and maximizes the distance
between training points of differing classes (the distance between the nearest points of
differing classes is called the margin). The separating hyperplane is represented by a
linear combination of the training points [1].

2.1 Linearly separable Support Vector Machines


Let S = {(Xl , yd, (X2' Y2), ... , (Xm' Ym)} be a sample of points Xi E X (the entire set of
training data; in image classification, for example, Xi represents the training data for one
image) with targets Yi E {-I, +1}. Consider a hyperplane defined by (w, e) where w is
a weight vector and e is a bias . The goal is to find a hyperplane which divides the set of
examples such that all training points with the same target labels are on the same side

3
of the hyperplane. Thus, the goal is to find wand x such that

(1)

where i = {I, ... , m} and m is the number of training examples.


If there exists a hyperplane that satisfies (1) , then the sample set S is said to be
linearly separable. In this case, wand e may be rescaled, such that

min
l:S;i:S;m
Yi (w . Xi + e) 2: 1.

The purpose of this rescaling is so that the closest point to the hyperplane has a distance
of l/ JJ w ll . In this case, (1) becomes

(2)

By maximizing the left hand side of (2) , we obtain the optimal separating hyperplane.
This can be understood by considering that the closest point to any hyperplane satisfying
(2) has a distance of l / JJw JJ. Thus, finding the optimal separating hyperplane amounts to
minimizing " where
,= 2w, w,
1
(3)
under constraints (2). It should be noted that, in this case, the margin is 2/lI w JJ . The
optimal separating hyperplane is understood as the hyperplane which maximizes the
margin.
Since IIW JJ2 is convex, minimizing (3) under constraints (2) may be achieved using
Lagrange multipliers [lJ . Thus, the maximal margin can be found by minimizing the
Lagrangian: .
1 m
L(w, e, a) = 2(w, w) - L
adydw, Xi + eJ - 1). (4)
i=l

These ai are Lagrange multipliers, one for each training point.


The partial derivatives of (4) with respect to e and ware:

8L(w, e, a) _ ~ _
--'--="
8'--
e-'----..:. . - L.... ai Yi - 0 (5)
i=l

(6)

Substituting (6) into (4) gives the dual representation of the Lagrangian:
m m

L(w , e, a) = L ai - L aiajYiYj (Xi' Xj) , (7)


i=l i,j=l

4
which must be maximized with respect to each ai, subject to the constraint from (5),
m

I:aiYi = 0,
i=l

and ai 2: O.
When the optimal separating hyperplane and margin are found , only the training
points that lie close to the hyperplane have ai > 0 and contribute to the classification
model. These training points are called support vectors (SV) . All other training points
have associated ai values of zero. The training points with nonzero ai values provide the
most informative patterns on the model's data. The resulting decision function can be
written as:

j(x) = sign (I:


iESV
Yia?(X' Xi) + e) , (8)

where a? is the solution to the maximization problem under the constraints listed above
and SV represents the indices of the support vectors. The sign function transforms the
result of the function in order to map approximate results to either -lor +1.

2.2 Nonlinear Support Vector Machines


In the case that the training points are not linearly separable, SVMs support mapping
the input data to a high dimension feature space through a nonlinear mapping function
¢. Campbell and Cristianini show that SVMs may use high dimensional spaces without
overfitting the data. Thus, we may replace ¢(x) for X in (8) [1]:

j(x) = sign (I:


iESV
Yia?(¢(X) . ¢(Xi)) + e) .
Interestingly, the mapping ¢ does not need to be explicity defined if we use a kernel
junction, since the only case that requires ¢ is a dot-product between two mappings. A
kernel is a symmetric function K that can be described for all x, x' E X as
K(x, x') = ¢(x) . ¢(X')
for a specific mapping ¢. Appropriate kernels that describe this mapping must satisfy
Mercer's condition [5], which states that for any g(x) ' for which

J g(x)2dx < 00,

then
J K(x , xl)g(X)g(xl)dxdx' 2: O. (9)
Common choices for kernels include polynomial kernels of the form

K(x, x') = (x· x' + l)d , (10)

5
where d is the' degree of the polynomial, and Gaussian Radial Basis Functions (RBF) of
the form 2 I

K(x,x') = e-lIx~:211 , (11)


where (J" is the standard deviation of the gaussian curve. Similar to the mapping function
¢(x), we may replace x . Xi in (8) with

f( x) = sign (L
iESV
YiC1:?K(X, Xi) + e) . (12)

By satisfying Mercer 's condition, a kernel may be used to replace the use of a ¢ mapping;
thus, ¢ need not be explicitly defined.

3 M ulticlass classification
3.1 One-versus-rest model
In the previous section, we described the use of the SVM for binary classification. The
standard method of multiclass classification is to construct N binary classifiers. For
example, let i E {1 , 2, ... N}. The binary classification for the ith class is to separate the
training points from class i from the remaining N - 1 classes. This is also known as
one-versus-rest (or 1-v-r) classification. Empirically, SVM training is observed to scale
super-linearly with the training set size m [6]' according to a power law:

Tsingl e = cm'Y , (13)


where c is a proportionality constant, m is the number of training examples, and "f reflects
the time complexity of the classification model. Tsingl e represents the time associated with
training a single classifier. For the standard 1-v-r multiclass SVM training algorithm, the
entire training set is used to create all N classifiers, thus the training time is

(14)

3.2 Decision Directed Acyclic Graphs


An alternative method of multiclass classification was proposed by Platt, Cristianini,
and Shawe-Taylor [4], which utilizes a Directed Acyclic Graph (DAG). A Decision DAG
(DDAG) is composed of N(~-l) nodes, each containing a one-versus-one (1-v-1) classifier,
classifying two distinct classes from one another. Each node on the graph eliminates one
class from the list of candidate results. Once the 1-v-1 classification is performed on a
leaf node , the graph has eliminated N - 1 classes from the list. The remaining class is
the result of the multiclass classification for a data point. Figure 1 is an illustration of
the DAG structure.
Instead ofrequiring all m data points for training, each 1-v-1 classification only requires
the data points that are expected to classify as one of the two candidate classes. Thus,

6
! ,\
4 •
44~
4

"I ~v, \'


2 (' , .
3 2 •• 4 , 1 nl

.{ . " ., .,' ( ' "".


! . 3" ",-s 4 . : 2 "'s 3 1 1 '" 2 " . 1 1 ''54 SliM

r~/ \/j ' I 1 I 1


I 1 1

4 J 2 1
(a) (0)

Figure 1: (a) The decision DAG for finding the best class out of four classes, The equiva-
lent list state for each node is shown next to that node, (b) A diagram of the input space
of a four-class problem, A I-v-l SVM can only exclude one class from consideration [4],

if each class contains the same number of examples, each I-v-l classifier requires only 2;:-
training examples. Recalling (13), training the DDAG would require

T
I-v - l -
_ N(N - 1)
C 2
(2m)"Y
N (15)

Comparing (14) to (15) , the DDAG trains more quickly than the standard I-v-r
method.

4 Alternative SVM training methods


Several alternative SVM training methods exist that are much simpler to implement than
the quadratic programming SVM. It is suggested that the use of the Kernel Adatron
(KA) as a simple implementation of the SVM that compares to the original SVM in both
accuracy and time-complexity during classification [1]. In the following subsection, we
will discuss the explanation of gradient ascent in [1], which is necessary to understand the
Kernel Adatron. In the subsequent subsections, we will discuss the Kernel Adatron and
several variations on the Adatron theme.

4.1 Gradient ascent


A simple alternative method to maximizing a concave Lagrangian under linear constraints
is to use gradient ascent. The Lagrangian to be maximized is

(16)

7
where the final term implements the constraint condition in (5). Using stochastic gradient
ascent based on the derivative of the Lagrangian [1],

(17)

where r; is controls the growth rate. In addition, the constraint ai 2: 0 will be enforced
by updating ai ----> 0 when ai < O. The Lagrangian changes as follows during an update
procedure where ai ----> ak + Oak for a particular pattern k [1] :

O£ = £(ak + oak) - £(ak) (18)

~ oa, (1 - y, ~ ajyjK(x" Xj) - >.y,) - ~(oa,)2 K(x" x,) (19)

[~ _ K(X;, Xk) ] (Oak)2. (20)

Thus, o£ > 0, given [1]:


2 > r;K(Xk' Xk) > O.
For a Gaussian RBF kernel in the form of (10), K(Xk' Xk) = 1, which suggests that
the gradient ascent algorithm will converge with the maximal Lagrangian, provided that

2> r; > O.

For a polynomial kernel in the form of (11), the upper bound for r; is determined from
the 2-norm of each pattern [1]:

4.2 Kernel Adatron


4.2.1 Without bias
According to Campbell [1], dropping the condition 2::1 aiYi = 0 is equivalent to forcing
the hyperplane to pass through the origin of the feature space. Since the feature space is
high-dimensional, this restriction is not an active constraint for many problems and thus,
will not affect the overall generalization of the model significantly. Essentially, this means
that we may drop the). term in the gradient ascent model without a significant decrease
in generalization efficiency.
We shall first outline the Kernel Adatron algorithm without bias in Table 1. The
algorithm was developed by Thilo Friess and [1] .

8
1. a?
Initialize = O.
2. For i = 1, 2, ... , m execute steps 3 and 4.
3. For labeled points (Xi, Yi) calculate:
Zi = 2:7=1 ajyjK (xi, Xj).
4. Calculate 8a~ = 1](1 - ZiYi):
4.1. If (a~ + 8aD :S 0 then a~ <-- O.
4.2. If ( a~ + 8aD > 0 then a~ <-- a~ + 8a~.
5. If a maximum number of iterations is exceeded or the margin:
"f = ~ (min{iIYi=+1}(zi) - max{iIYi=-l} (Zi)) ~ 1
then stop, otherwise, return to 2 for next epoch t.

Table 1: Kernel Adatron algorithm without bias.

4.2.2 With bias


In t he KA model with a bias term, we shall once again consider the 2:: 1 aiYi = 0
condition. In the process of gradient ascent, [1] notes that L increases irrespective of A,
except for iterations where 8ai = O. In addition, the final A of the optimization problem
is the bias , since from the stationary condition of 8aiYi = 0 (when the maximum has been
found) ,

1 - Yk 2( aj yjK(Xk, Xj ) - AYk ~ Yk (Yk - 2( ajyjK (.'k, Xj) - A) ~ 0,

recalling that for each k , Yk = {-I , +1}. Thus, the only additional change from the KA
without bias is to keep track of the A-values at each epoch t. It is noted that the bias "can
be found by a subprocess involving iterative adjustment of the A based on the gradient of
L with respect to X' [1]. Thus, A is updated by:
(21)

where v is a learning parameter, which may be derived from the secant method. Likewise,
v is defined as
At - 1 _ At - 2
V = --,-----:-
wt - 1 _ wt - 2 ,
where w t = 2:i a~ Yi [1]. The KA (without bias) algorithm is outlined in Table 2 (where
t max is the maximum number of epochs, and f1, is an arbitrary value to initialize A) .

4.3 Friess Adatron


The Friess KA algorithm is a modification of the KA (without bias) algorithm in three
ways. First , a bias term b2 / 2 is added to the objective function. Thus, the Zi is calculated
as:
m

Zi = L ajihK(xi ' Xj) + b. (22)


j=l

9
1. Initialize a? = O.
2. For t = 1, ... , t max execute steps 3 through 8.
3. If t = 0 then
At = fl,
Elseif t = 1 then
At = -fl,
Else
\t t-1 .x t - 1 .x t - 2
/I = -w wt l w t 2
Endif
4. For i = 1, ... , m, execute steps 5 and 6.
5. For labeled points (Xi, Vi) calculate:
Zi = 2:;:1 ajyjK(Xi, Xj)
6. Calculate 6a~ = 77(1 - ZiYi - AtYi):
6.1 If (af + 6aD :::; 0 then a~ = 0
6.2 If (a~ + 6aD > 0 then a~ .- a~ + 6a~
7. Calculate wt = 2: j a;Yj
8. If a maximum number of iterations is exceeded or the margin:
'Y = ~ (min{iIY;=+l}(Zi) - max{iIYi=-l}) ~ 1
then stop, otherwise, return to step 2 for next epoch t.

Table 2: Kernel Adatron algorithm with bias.

This implies an equality constraint of

(23)

Second, the bias term is updated each time in the learning loop, where the value of a
Lagrange multiplier is increased, using the rule b .- b + Yi6ai.
The third change to the KA (without bias) algorithm is that the criteria that prevents
ai from going negative is discarded during the learning loop in order to maintain (23).

4.4 Kernel Minover


The Kernel Minover (KM) algorithm varies from the Friess KA algorithm only in the
employment of the minover algorithm, instead of the Adatron algorithm. Like the Friess
KA algorithm, the KM algorithm is introducing a bias term b. However, the loop in Table
1 is reduced to a single operation [7]:

YiZi = Yi (f
)=1
ajyjK(xi, Xj) + b) . (24)

By minimizing YiZi in (24), Xi' (the pattern associated with the minimum value) is selected
to be updated; subsequently, only 6ai' needs to be calCulated, where
6ai' = 77(1 - Yi,Zi').

10
6ai' is used to update ai' and b. Thus, only one Lagrange multiplier is updated at each
epoch t .

5 Experimentation procedures
Following Chapelle's methodology for experimentation, we used sample images from the
Corel database. Images were collected from 10 categories: people, beaches , buildings,
buses, reptiles , elephants, flowers, horses , mountains, and food. In this experimentation,
we used each classification model to train the images to classify into their respective
classes.

5.1 Input representation


The simplest way to represent an image is by its bitmap representation. Each pixel of the
image contains numeric values which represent its red, green, and blue levels. Assuming
the size of the images is h x w (where hand ware the height and width of the image in
pixels, respectively) , the input data for the SVM are vectors of size h x w for grayscale
images and 3 x h x w for color images. To simplify the learning procedure, we convert
each pixel into grayscale in our experiment, using the following formula:
L = 0.30R + 0.59G + O.llB,
where R, G, and B, represent the red, green, and blue levels in a specific pixel (this will
be an integer between 0 and 255). This formula is accepted by various mathematicians
as an acceptable converson from RGB to grayscale.
While it is simple to construct the bitmap representation of each image, the represen-
tation lacks invariance with respect to translations. To alleviate this problem, we consider
the luminescence histogram of each image's bitmap representation. Chapelle states that
constructing a histogram with 16 bins per color component yields the best results in image
classification [3] .
Another advantage of representing each image as a luminescence histogram is the
reduction of the vector size for each data point, which yields fewer computational issues.
Rather than using input data with vector sizes of h x w, the vector size of each histogram
is simply the number of color bins (16). Moreover, the input vectors may be normalized,
which makes the vector invariant to image size.

5.2 Choice of the kernel


While there are many kernel functions that may be used to map each data point to a
higher-dimensional feature space , Chapelle suggests the use Gaussian Radial Basis Func-
tion (RBF) kernel
2
I II x - x; 1I
K(x, x) = e- 2" ,

where (J determines the spread of the RBF function. Smaller (J-values suggest a slower
learning rate for the training procedure .

11
5.3 SVM training methods used
In this experiment, we used MATLAB R14 to implement each SVM. We compared the
performance of a quadratic programming SVM toolbox written by Steve Gunn [8] to the
following alternative SVMs (following the algorithms listed in Section 4):

• Adatron (Without Bias)

• Adatron (With Bias)

• Kernel Minover

• Friess Adatron

In addition, we implemented and benchmarked a DDAG version of each SVM, including


Gunn's SVM. The alternative SVM learning methods (including the DDAGs) are writ-
ten purely in MATLAB, while the quadratic programming procedure of Gunn's SVM is
written in C and is called by MATLAB.

5.4 Leave-one-out cross validation


To determine the efficiency of both the training time and the classification rate for each
SVM model, we use a modified leave-one-out cross validation (LOO-XV), with respect
to the ten distinct classes in the model. The typical method of LOO-XV is k-fold cross
validation where the training data is split up into k = m subsets (m is the number of
training points in the set). The standard method of LOO-XV uses k - 1 subsets for
training and the remaining subset for testing. The cycle is repeated until each subset is
"left out" for testing. The results for each test are averaged together to find the overall
results of the classification model.
For our purposes, we divided the training data into miN subsets. Each subset contains
one training point per class. Thus, in each training cycle, we leave out N data points for
testing.

6 Experimentation results
To understand the relationships among the training time complexities, we performed
the same experiment on 5, 6, 7, 8, and 9 images per class to construct a visual graph.
The training times for each experiment are graphed using Vandermonde interpolation.
Figure 2 shows the interpolation of training times for Gunn's SVM, the Adatron (without
bias), and the Friess Adatron. According to Figure 2, Gunn's SVM begins to outperform
the Adatron (without bias) at 8 training examples per class.
Figure 3 shows the interpolation of training times for the Adatron (with bias) and
the Kernel Minover. In comparison to Figure 2, the Adatron (with bias) and the Kernel
Minover are computationally inefficient. The rationale behind this finding lies in the
nature of each classification method. In the case of the KA (with bias) algorithm, it

12
SVM Method T error 'rJ
Gunn's SVM 40,0906 0,000 -

Adatron (No Bias) 77,5535 0,160 1.0


Adatron (Bias) 2585,5 0,000 1.0
Kernel Minover 604,8047 0,000 0,5
Friess Adatron 157,2681 0,000 0,5
DAG-SVM 9,8762 0,000 -

DAG-Adatron (No Bias) 3,5541 0,000 1.0


DAG-Adatron (Bias) 38,2240 0,010 1.0
DAG-Minover 17,6314 0,000 0,5
DAG-Friess 4,7699 0,080 0,5

Table 3: Training time comparisons (in seconds) for 100 training examples (10 images per
class), using RBF kernel with a = 2,

Vandermonde Interpolation ofTraining Times versus Training Points Per Class


150r--------.--------.--------.--------.--------.
/
- - SVC
/
140 - , - , - Adatron (No Bias) I
i

- - - Friess Adatron /
I
120 I
00' /
'U
c /'
0
100 /i
U
(l)
,/l
Ul
C /
-=-
(l) 80 /
E ./ /
,/
en ./ ./
c 50
c /' ./
/-'
~ ./ /
f- ,/ ",-
,/
40 ,/
./
",,-
,/

./

-----
,~
./ "", '
,/

20 ./
/'
..or
-- --"
-~ , -.,...

0
5 5 7 8 9 10
Number ofTraining Points per class

Figure 2: Vandermonde interpolation of training times for Gunn's SVM, Adatron (without
bias) , and Friess Adatron,

13
is necessary to calculate the w- and A-values in the learning loop of each epoch t. The
greatest computational expense lies in the arithmatic required to calculate these values.
In addition, the ai-values are only slightly updated at each t, requiring more iterations to
optimize classification model. While the KA (with bias) algorithm shows great accuracy,
it requires a longer training time.
Similarly, the Kernel Minover algorithm's slowness is related to requiring more itera-
tions to optimize the classification model. This is due to the Kernel Minover algorithm
only updating one ai-value per iteration.

Vandermonde Interpolation ofTraining Times versus Training Points Per Class


3000 I---;::===::::!:=======r::;------,-----,------~
Adatron (Bias)
.. Kernel Minover
2500

(i)
-g 2000
o
u
(l)
(J)

c
~ 1500
E
OJ
C
C
~ 1000
I-

500

6 7 8 9 10
Number of Training Points per class

Figure 3: Vandermonde interpolation oftraining times for Adatron (with Bias) and Kernel
Minover.

Figure 4 shows the interpolation of the training times of each DDAG method. Both
the DAG-equivalent methods of the Adatron (without bias) and the Friess Adatron show
significant improvement from the overall training time of each classifier.
It is interesting to note that though Gunn's SVM performs best in the 1-v-r problem,
its DAG-equivalent method is suboptimal with respect to training time. Since the DDAG
requires n( n-1) /2 1-v-1 classification models to compute, the large number of calculations
of the quadratic programming procedure increases the training time required for the SVM
when compared to the Adatron and minover classification models.

14
Vandermonde Interpolation ofTraining Times versus Training Points Per Class
40.--------.--------.--------.--------,--------.

35

30
(j)
<:l
C
8 25
(I)
- -·-DAG SVC
(j)

c . - DAG Adatron (No Bias)


~ 20 DAG Adatron (Bias)
E DAG Kernel Minover
OJ
c 15 - - - DAG Friess Adatron
c
2:'
I-
10

5
--- .---.--
-"'"

OL-------~--------~--------~------~~------~
5 G 7 8 9 10
Number of Training Points per class

Figure 4: Vandermonde interpolation of training times for DAG implementations of each


classifier.

7 Remarks
Additional remarks should be made to present other potential discrepancies in the exper-
imentation results. Firstly, it must be noted that Gunn's SVM uses the C programming
language (called by MATLAB) to perform the quadratic programming arithmetic, while
the other algorithms presented use only the MATLAB language. While C programs are
compiled into machine code, MATLAB scripts are interpreted upon runtime. By plac-
ing the most computationally challenging arithmetic into machine code, Gunn's SVM
may be saving computational time, whereas the other algorithms must first translate the
MATLAB code to compute the results. It would be interesting to compare a SVM writ-
ten purely in MATLAB to the other algorithms. However, the mathematics involved in
the computation of the quadratic programming procedure is outside of the scope of this
experiment.
In addition, the Adatron (with bias) should receive additional attention. Program-
matically storing the w- and A-values of each epoch of the KA algorithm requires much
consideration. We decided to store only the previous three values of these parameters and
used modular arithmetic to determine which w- and A-values correspond to wt - 1 , wt - 2 ,
etc. at each iteration. Modular arithmetic in itself is computationally expensive. Other

15
possibilities include storing the parameter values for every iteration (which would require
a large amount of space).

8 Conclusion
In conclusion, while the SVM appears to require the least amount of training time in
the 1-v-r multiclassification model, the KA (without bias) and the Kernel Friess Adatron
algorithm require significantly less time to train in the DDAG 1-v-1 model as the number
of data points increase. The use of the DDAG 1-v-1 model results in a significant decrease
in training time for the overall multiclassification model while maintaining classification
accuracy.

16
References
[1] C. Campbell and N. Cristianini, Simple learning algorithms for training support vector
machines, http://citeseer. ist. psu. edu/campbell98simple. html, 1998.

[2] J. Shawe-Taylor and N. Cristianini, Kernel Methods for Pattern Analysis (Cambridge,
New York, 2004).

[3] O. Chapelle, P. Haffner , and et al. , Support Vector Machines et Classification d'
Images, http://citeseer.ist.psu.edu/392011.html, 1998.

[4] J. Platt, N. Cristianini, and J. Shawe-Taylor, in Large Margin DAGs for Multiclass
Classification, edited by S. Solla, T. Leen, and K.-R. Mueller (MIT Press, Cambdridge,
MA , 2000), pp. 547-553.

[5] V. Vapnik, The Nature of Statistical Learning Theory (Springer-Verlag, Berlin, 1995).

[6] J. Platt, in Advances in Kernel Methods - Support Vector Learning, edited by B.


Scholkopf, C. J. C. Burges, and A. J. Smola (MIT Press, Cambdridge, MA, 1999) , pp.
185- 208.

[7] H. D. Navone and T. Downs, Variations on a Kernel-Adatron Theme,


http://citeseer.ist.psu.edu/navoneOlvariations. html, 2001.

[8] S. Gunn, MATLAB SVM Toolbox, http://www.isis.ecs.soton.ac.uk/resources/svminfo/

17 .

You might also like