Professional Documents
Culture Documents
x ;
Step 3: If 1
t
= , then the algorithm return with
1
( )
t
sign o = ,
1 t
h h = , T=1.
Step 4: Otherwise set
1 1
ln
2 1
t
t
t
+
=
;
Step 5: Update
( ) 1
e
t n t n
y h t t
n n t
d d Z
o +
=
x
for 1, , i N = with
( )
1
e
t n t n
N
y h t
t n
n
Z d
o
=
=
x
.
Let
1
1
( ) e
t
n
N
u t
t n
n
Z d
o
o
=
=
and
1
( ) e ( )
t
n
u t t
n n t
d d Z
o
o o
= , where ( )
t
n n t n
u y h = x . In
AdaBoost [1] algorithm, the step
t
o for updating the distribution
t
d is gotten by
keeping ( ) 0
t t
o = d u or keeping the next edge at least 0, because of the convexity of
( )
t
Z o with respect to o , which is equivalent to
1
1
min ( ) e
t
n
N
u t
t n
n
Z d
o
o
o
=
=
or ( ) 0
t
Z o ' = . (1)
Although (1) can be solved exactly by any line-search method because ( )
t
Z o is convex
that clear is proved in [4], for the convenience of analysis, a close form of approximate
solution of (1) is obtained by
min ( )
t
Z
o
o where
( ) ( ) ( )
1 1
1 1
2 2
( ) ( ): 1 e 1 e
t t t t
t t
Z Z
o o
o o
s = + + d u d u .
The inequality is because
( ) ( )
1 1
2 2
e 1 e 1 e
x
x x
o o o
s + + , [ 1,1] xe . (2)
which is also important to prove the bounds of AdaBoost algorithm.
The solution of
min ( )
t
Z
o
o can be analytically obtained as
1 1
ln
2 1
t
t
t
+
=
, where
1 t t
t
= d u (3)
In [2], in order to obtain a higher progress, a tighter condition ( )
t t
w = d u is
maintained per iteration, which is equivalent to
( ) 1
1
min ( ) e
t
n
N
u t
t n
n
Z d
o
o
o
=
=
. (4)
Similarly, by inequality (2),
( ) ( ) ( )
1 1
1 1
2 2
min ( ): 1 e 1 e
t t t t
t
Z e
o o o
o
o
= + + d u d u is
instead to get closed form of step
t
o and a new algorithm AdaBoost
is proposed.
1 1 1 1
ln ln
2 1 2 1
t
t
t
o
+ +
=
. (5)
Here is margin target. If is chosen properly(slightly blow the maximum margin
*
), AdaBoost
=
converging to maximum margin
*
. It needs to run AdaBoost
+ e
s
e
(7)
The difference between inequality (2) and (7) is shown in Fig 1.
2.1 New step for AdaBoost method
Firstly, we consider problem (1). By inequality (7), ( )
t
Z o in (1) is magnified as the
following ( )
t
Z o
( ) ( )
1 1
0 0
( ): 1 (1 e ) 1 (1 e )
1 (1 e ) (1 e )
t t
n n
t t t t
t n n n n
u u
t t
Z d u d u
o o
o o
o
s >
+
= + +
= +
.
where
1
0
t
n
t t
t n n
u
d u
<
=
,
1
0
t
n
t t
t n n
u
d u
+
>
=
( ) ( ) ( )
t t t
Z Z Z o o o s s .
Typically 0
t
< , then minimizing ( )
t
Z o by solving ( ) 0
t
Z o ' = , and we can get a
simple step
1
ln
2
t
t
t
. (8)
with
( )
2
( ) 1
t t t t
Z o
+
| |
= +
|
\ .
.
If 0
t
= (this means 0
t
n
u > for all n and
1
( ) 0
t t
o
*
method
Similar as [2, 3], in order to maintain ( ) ( ) 0
t t
w = d u or solve the problem in Eq.
(4) approximately, by inequality (7), we magnify ( )
t
Z o in Eq. (4) to ( )
t
Z o as
following
( )
( ) ( ): 1 (1 e ) (1 e )
t t t t
Z Z e
o o o
o o
+
s = +
To Minimize ( )
t
Z o , we have
( )
( ) e e 0
t
Z b a c e
o o o
o
' = + + = , (9)
where ( 1)
t
a
= + , (1 )
t t
b
+
= + and (1 )
t
c
+
= .
If 0
t
< , we can get a complicated step by solving the equation (9):
2
4
ln
2
t
b b ac
a
o
+
= . (10)
And if 0
t
= or 0 a = (then
t t
+
= ), we have
ln ln
1 1
t
t
t
o
=
. (11)
After omitting b in Eq. (9), we can get a simple step similar like that in [2, 3] as
follows more approximately than (10)(if 0
t
= , (11) is used too):
1 1 1
ln ln
2 2 1
t
t
t
o
+
+
=
or
1 1 1
ln ln
2 2 1
t t
t
t t
o
+
+
=
(12)
We can prove 0
t t
o o < s and ( ) ( )
t t t t
Z Z o o s because
1
1 | |
t t
b
= d u in Eq (9) is
positive and
1
(0)
t t
t
Z b a c
' = + + = d u is negative thus 0 and
t
o are just on the
different side of the
t
o .
3. Experimental results
In this section, we give some experiments to illustrate the different effect of the steps
involved in this project. The optimal solution of (1) or (4)
*
o , called as Optimal Step in
our experiments, is also obtained as a comparator in our experiments, where Newton
method is used to solve (1) or (4):
Updating
1
: ( ) ( )
k k k k
a Z a Z a o
+
' '' = +
until
1
( )
k
Z a c
+
' s is reached for tinny number c . (13)
In our experiments all the steps are capped into interval [0, 1] to keep the iteration
consistent.
3.1 Experiment on email spam data with the hypothesis with rage { 1,1}
The fist experiment is on email spam dataset by AdaBoost algorithms to compare the
step (3) and step (8), where while 0
t
= , the (3) is used to take place (8). The data is
permuted 100 times and is split into 3/4 for training and 1/4 for testing set. The average
results are plotted in Fig 2. We select the feature as the weaker learners, where the
component of all the instances only is either 1 or 1 + . In order to get a little bit better
hypothesis in every trial, we randomly chose 30 feature and selected the best one as the
hypothesis to the algorithm in stage t.
The results in Fig. 2 show that new step can get a result the same better as the
original one. This is not strange because at this situation ( ) { 1,1}
t
n
h e x , all the
inequalities (2) and (7) are all tight then
( ) ( )
t t
Z Z o o = . The steps used should be the
same one all the time. The small difference is because the hypothesis chosen per trial is
totally random.
0 50 100 150 200 250 300
0
0.01
0.02
0.03
0.04
0.05
0.06
0.07
0.08
Iteration T
M
e
a
n
e
r
r
o
r
s
New Step for Trai ni ng
New Step for Testi ng
Ori gi nal Step for Trai ni ng
Ori gi nal Step for Test
Fig. 2 The compare between NEW and Original steps on email spam dataset.
3.2 Experiment on artificial data
In order to see the difference between step
t
o in (8) and
t
o in (3) or the difference
between step
t
o in (10) and
t
o in (6), we have to find a problem with hypothesis
( ) [ 1,1]
n
h e x .
Similarly as in [3], we generate an artificial data with 100 dimensional and
containing 98 nuisance dimensions with uniform noise. The other two dimensions are plot
exemplary in Figure 3.
-3 -2 -1 0 1 2 3
-2.5
-2
-1.5
-1
-0.5
0
0.5
1
1.5
2
2.5
Figure 3. The two discriminative dimensions of our separable 100 dimensional data.
In the experiments, 20000 data are generated and all the results shown are averaged
over 200 splits into m training and 20000-m test examples, where m varies form 50 to 500.
This is a linear classification problem, and the hypothesis should be ( ) h = x w x . We
prune this linear function as (14) to keep its domain into [ 1,1] .
1 1
( ) , [ 1,1]
1 1
h
<
= e
>
w
w x
x w x w x
w x
, with 1 = w (14)
A very weak learner is chosen to obtain the weak hypothesis h
w
according to the
following three steps:
1) Randomly choose K pairs points
1 2
( , )
i i
x x form the training data according to the
distribution
1 t
d ;
2) K weight vectors are generated as
1 2
i i i
= w x x , 1,2, , i K = ;
3)
1
argmax ( )
i
i
N
h n
n
h h
=
=
w
w w
x .
In my experiments, 10 K = .
We implement AdaBoost algorithms with updating step
t
o in (8),
t
o in (3) and
optimal step gotten by procedure (13) on this dataset respectively, and also have
accomplished AdaBoost
*
with updating step in Eq. (6), (10) and (12).
At first data, we run AdaBoost methods with three different steps above and set
300 T = and 1000 T = separately. The test accuracies are listed in Table 1(and the
training accuracies are 100% always and not given here).
It shows that differences between them are tiny, and original AdaBoost always has a
little better than others and the new step gets the second rank. The differences of the step
are the margin they achieve and we will compare them lately.
Table 1. The experimental testing accuracies corresponding different steps and training sizes.
m=50 m=100 m=200 m=500
Original step 98.490.27 98.970.14 99.240.10 99.460.07
New step 98.350.33 98.780.19 99.060.13 99.320.10 T =300
Optimal step 98.330.34 98.720.21 98.990.15 99.230.11
Original step 98.340.30 98.830.18 99.130.12 99.420.08
New step 98.350.35 98.760.21 99.050.15 99.270.09 T=1000
Optimal step 98.340.35 98.770.22 99.040.16 99.250.10
In Table 1, it observes that a long iteration has not any advantage to improve the test
accuracy, so we can not do any more analysis for 1000 T = . The later experiments are
obtained when 300 T = to show the margin the different methods achieved.
In Figure 4, the steps of different schemes with AdaBoost algorithms are plotted. It
illustrates that the original AdaBoost is often smaller than the new step and the optimal
step. It shows that the new step is closer to optimal than original one.
0 50 100 150 200 250 300
0
0.2
0.4
0.6
0.8
1
m
e
a
n
s
t
e
p
m=50
0 50 100 150 200 250 300
0
0.2
0.4
0.6
0.8
1
m=100
0 50 100 150 200 250 300
0
0.2
0.4
0.6
0.8
1
T
m
e
a
n
s
t
e
p
m=200
0 50 100 150 200 250 300
0
0.2
0.4
0.6
0.8
1
T
m=500
Original(Eq. (3)) New(Eq.(8)) Optimal(Eq.(13))
Figure 4. The mean steps of different schemes with AdaBoost Algorithm. The value is
averaged over 200 split trials. It shows that the new step is closer to optimal than original
one.
In Figure 5, the steps of different schemes with AdaBoost
*
algorithms are plotted. It
illustrates that the original AdaBoost
*
is often small, the new step in Eq. (10) in the
middle and approximate one is always keep a large step.
0 50 100 150 200 250 300
0
0.2
0.4
0.6
0.8
1
T
m
e
a
n
s
t
e
p
m=50
0 50 100 150 200 250 300
0
0.2
0.4
0.6
0.8
1
T
m
e
a
n
s
t
e
p
m=100
0 50 100 150 200 250 300
0
0.2
0.4
0.6
0.8
1
T
m
e
a
n
s
t
e
p
m=200
0 50 100 150 200 250 300
0
0.2
0.4
0.6
0.8
1
T
m
e
a
n
s
t
e
p
m=500
Eq. (6) Eq. (10) Eq.(12)
Figure 5. The mean steps of different schemes with AdaBoost
*
. The value is averaged over
200 split trials. It shows that the new step is closer to optimal than original one.
In Figure 6, we plot the averaged margin gotten by all the schemes for the same dataset.
The dotted lines are gotten by AdaBoost and the solid lines are obtained by AdaBoost
*
.
Firstly it shows AdaBoost
*
can achieve a better margin than AdaBoost, because the solid
lines are always higher than the dotted lines. The only except one is on the last plot, but it
should get a higher result after some more iterations (not show here). Secondly, it also
shows that original AdaBoost and AdaBoost
*
work well for small training problem with
large margin (m=50). They are reached the largest margin respectively in its kind of
algorithms. But for others larger training problem (should be with a smaller margin), our
new steps(Eq. (8), Eq. (10) and (12)) works better than others. At least our new schemes
are converged fast than others.
0 50 100 150 200 250 300
-0.2
-0.1
0
0.1
0.2
0.3
m
e
a
d
m
a
r
g
i
n
m=50
0 50 100 150 200 250 300
-0.2
-0.1
0
0.1
0.2
0.3
m
e
a
d
m
a
r
g
i
n
m=100
0 50 100 150 200 250 300
-0.2
-0.1
0
0.1
0.2
T
m
e
a
d
m
a
r
g
i
n
m=200
0 50 100 150 200 250 300
-0.2
-0.1
0
0.1
0.2
T
m
e
a
d
m
a
r
g
i
n
m=500
Eq. (3) Eq. (8) Optimal Eq. (6) Eq. (10) Eq. (12)
Figure 6. The averaged margin gotten by all the schemes, including obtained by AdaBoost
and AdaBoost
*
. It shows AdaBoostv can achieve a better margin than AdaBoost, and the
original AdaBoost and AdaBoost
*
work well for small training problem with large
margin (m=50), but our new schemes works better than other for a larger training problem
(should be with a smaller margin).
In Figure 7, the resulting margins of 200 split problems are plotted respectively, where
the problems are sorted by the margin of original AdaBoost algorithm gotten. It shows that
for the same problem, the margin of different scheme achieved is distinct. But it is clearly
that our new step often achieved a larger margin than the optimal scheme, and again
shows that original AdaBoost method has advantages of dealing with small training
problem than other two methods. But for larger training problem, our new steps are best
scheme in all methods.
In Figure 8, the experimental results of the original AdaBoost
*
, our new step and its
approximate one are plotted. It again shows that for smaller size problem with a bigger
margin, the original one is best. But for larger size problem, our new step and its
approximate one can obtain a better result.
0 20 40 60 80 100 120 140 160 180 200
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
A
c
h
i
e
v
e
d
m
a
r
g
i
n
m=50
0 20 40 60 80 100 120 140 160 180 200
0
0.05
0.1
0.15
0.2
m=100
0 20 40 60 80 100 120 140 160 180 200
0.02
0.04
0.06
0.08
0.1
0.12
Problems split fromthe same data
A
c
h
i
e
v
e
d
m
a
r
g
i
n
m=200
0 20 40 60 80 100 120 140 160 180 200
-0.02
0
0.02
0.04
0.06
0.08
Problems split fromthe same data
m=500
Original AdaBoost(Eq.(3)) New step(Eq. (8)) Optimal step obtained by line-search
Figure 7. The margin obtained after 300 iterations for 200 split problems from the same data.
The problems are sorted according to the margin obtained by original AdaBoost algorithm. It
is apparent that the new step is always better than optimal step, and for large training problem
it is also better than the original step of AdaBoost.
0 20 40 60 80 100 120 140 160 180 200
0.1
0.2
0.3
0.4
0.5
The spilt problems from the same data
A
c
h
i
e
v
e
d
m
a
r
g
i
n
m=50
0 20 40 60 80 100 120 140 160 180 200
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
the spilt problems from the same data
A
c
h
i
e
v
e
d
m
a
r
g
i
n
m=100
0 20 40 60 80 100 120 140 160 180 200
0
0.05
0.1
0.15
0.2
0.25
the spilt problems from the same data
A
c
h
i
e
v
e
d
m
a
r
g
i
n
m=200
0 20 40 60 80 100 120 140 160 180 200
-0.02
0
0.02
0.04
0.06
0.08
0.1
0.12
the spilt problems from the same data
A
c
h
i
e
v
e
d
m
a
r
g
i
n
m=500
Orginal AdaBoost
*
(Eq. (6)) New(Eq. (10)) New approximate(Eq.(12))
Figure 8. The margin obtained after 300 iterations for 200 split problems from the same data.
The problems are sorted according to the margin obtained by original AdaBoost
*
algorithm.
It is apparent that the new step and its approximation always achieve better results for large
training problem.
The experiments in this section partly prove that the proposed steps have some
advantages over the original AdaBoost and AdaBoost
*
. But we have not proven the
bounds of those algorithms with the new steps. We will work further to obtain them. We
also plan to use the method given in [4] to compare all the steps with the complete
corrective scheme.
4. Conclusion
Some new steps for updating the distribution in AdaBoost algorithms are discussed in this
project. Based on a more tight inequality, we give some new steps to achieve a better
margin. Some experimental results show that the new steps have some advantages over the
original ones. Of course, there are some aspects have not finished yet, such the upper
bounds of the iterations, the experimental results on some practical dataset, especially on
some nonlinear classification problems and soft margin problems. We will work on it
gradually.
5. Reference
[1] Y. Freund and R. E. Schapire. A decision theoretic generalization of on-line learning and an
application to boosting. J ournal of Computer and System Sciences, 55(1):119139, 1997.
[2] Gunnar Rtsch, Manfred K. Warmuth. Maximizing the margin with boosting. In Proc. COLT, Vol
2375 of LNAI, pages 319-333, Sydney, 2002, Springer.
[3] Gunnar Rtsch, Manfred K. Warmuth. Efficient Margin Maximizing with Boosting. J ournal of
Machine Learning Research, 6:2131--2152, 2005.
[4] Robert E. Schapire, Yoram Singer. Improved Boosting Algorithms using confidence-rated
predictions. Machine learning, 373(3):297-336, 1999.