You are on page 1of 155

AdaBoost

Lecturer:
Jan

Sochman
Authors:
Jan

Sochman, Jir Matas
Center for Machine Perception
Czech Technical University, Prague
http://cmp.felk.cvut.cz
2/17
Presentation
Motivation
AdaBoost with trees is the best o-the-shelf classier in the world. (Breiman 1998)
Outline:
AdaBoost algorithm
How it works?
Why it works?
Online AdaBoost and other variants
3/17
What is AdaBoost?
AdaBoost is an algorithm for constructing a strong classier as linear combination
f(x) =
T

t=1

t
h
t
(x)
of simple weak classiers h
t
(x): X {1, +1}.
Terminology
h
t
(x) ... weak or basis classier, hypothesis, feature
H(x) = sign(f(x)) ... strong or nal classier/hypothesis
Interesting properties
AB is capable reducing both bias (e.g. stumps) and variance (e.g. trees) of the weak
classiers
AB has good generalisation properties (maximises margin)
AB output converges to the logarithm of likelihood ratio
AB can be seen as a feature selector with a principled strategy (minimisation of upper
bound on empirical error)
AB is close to sequential decision making (it produces a sequence of gradually more
complex classiers)
4/17
The AdaBoost Algorithm
Given: (x
1
, y
1
), . . . , (x
m
, y
m
); x
i
X, y
i
{1, +1}
4/17
The AdaBoost Algorithm
Given: (x
1
, y
1
), . . . , (x
m
, y
m
); x
i
X, y
i
{1, +1}
Initialise weights D
1
(i) = 1/m
4/17
The AdaBoost Algorithm
Given: (x
1
, y
1
), . . . , (x
m
, y
m
); x
i
X, y
i
{1, +1}
Initialise weights D
1
(i) = 1/m
For t = 1, ..., T:
4/17
The AdaBoost Algorithm
Given: (x
1
, y
1
), . . . , (x
m
, y
m
); x
i
X, y
i
{1, +1}
Initialise weights D
1
(i) = 1/m
For t = 1, ..., T:
Find h
t
= arg min
h
j
H

j
=
m

i=1
D
t
(i)Jy
i
= h
j
(x
i
)K
t = 1
4/17
The AdaBoost Algorithm
Given: (x
1
, y
1
), . . . , (x
m
, y
m
); x
i
X, y
i
{1, +1}
Initialise weights D
1
(i) = 1/m
For t = 1, ..., T:
Find h
t
= arg min
h
j
H

j
=
m

i=1
D
t
(i)Jy
i
= h
j
(x
i
)K
If
t
1/2 then stop
t = 1
4/17
The AdaBoost Algorithm
Given: (x
1
, y
1
), . . . , (x
m
, y
m
); x
i
X, y
i
{1, +1}
Initialise weights D
1
(i) = 1/m
For t = 1, ..., T:
Find h
t
= arg min
h
j
H

j
=
m

i=1
D
t
(i)Jy
i
= h
j
(x
i
)K
If
t
1/2 then stop
Set
t
=
1
2
log(
1
t

t
)
t = 1
4/17
The AdaBoost Algorithm
Given: (x
1
, y
1
), . . . , (x
m
, y
m
); x
i
X, y
i
{1, +1}
Initialise weights D
1
(i) = 1/m
For t = 1, ..., T:
Find h
t
= arg min
h
j
H

j
=
m

i=1
D
t
(i)Jy
i
= h
j
(x
i
)K
If
t
1/2 then stop
Set
t
=
1
2
log(
1
t

t
)
Update
D
t+1
(i) =
D
t
(i)exp(
t
y
i
h
t
(x
i
))
Z
t
where Z
t
is normalisation factor
t = 1
4/17
The AdaBoost Algorithm
Given: (x
1
, y
1
), . . . , (x
m
, y
m
); x
i
X, y
i
{1, +1}
Initialise weights D
1
(i) = 1/m
For t = 1, ..., T:
Find h
t
= arg min
h
j
H

j
=
m

i=1
D
t
(i)Jy
i
= h
j
(x
i
)K
If
t
1/2 then stop
Set
t
=
1
2
log(
1
t

t
)
Update
D
t+1
(i) =
D
t
(i)exp(
t
y
i
h
t
(x
i
))
Z
t
where Z
t
is normalisation factor
Output the nal classier:
H(x) = sign

t=1

t
h
t
(x)

step
t
r
a
i
n
i
n
g
e
r
r
o
r
t = 1
0 5 10 15 20 25 30 35 40
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
4/17
The AdaBoost Algorithm
Given: (x
1
, y
1
), . . . , (x
m
, y
m
); x
i
X, y
i
{1, +1}
Initialise weights D
1
(i) = 1/m
For t = 1, ..., T:
Find h
t
= arg min
h
j
H

j
=
m

i=1
D
t
(i)Jy
i
= h
j
(x
i
)K
If
t
1/2 then stop
Set
t
=
1
2
log(
1
t

t
)
Update
D
t+1
(i) =
D
t
(i)exp(
t
y
i
h
t
(x
i
))
Z
t
where Z
t
is normalisation factor
Output the nal classier:
H(x) = sign

t=1

t
h
t
(x)

step
t
r
a
i
n
i
n
g
e
r
r
o
r
t = 2
0 5 10 15 20 25 30 35 40
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
4/17
The AdaBoost Algorithm
Given: (x
1
, y
1
), . . . , (x
m
, y
m
); x
i
X, y
i
{1, +1}
Initialise weights D
1
(i) = 1/m
For t = 1, ..., T:
Find h
t
= arg min
h
j
H

j
=
m

i=1
D
t
(i)Jy
i
= h
j
(x
i
)K
If
t
1/2 then stop
Set
t
=
1
2
log(
1
t

t
)
Update
D
t+1
(i) =
D
t
(i)exp(
t
y
i
h
t
(x
i
))
Z
t
where Z
t
is normalisation factor
Output the nal classier:
H(x) = sign

t=1

t
h
t
(x)

step
t
r
a
i
n
i
n
g
e
r
r
o
r
t = 3
0 5 10 15 20 25 30 35 40
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
4/17
The AdaBoost Algorithm
Given: (x
1
, y
1
), . . . , (x
m
, y
m
); x
i
X, y
i
{1, +1}
Initialise weights D
1
(i) = 1/m
For t = 1, ..., T:
Find h
t
= arg min
h
j
H

j
=
m

i=1
D
t
(i)Jy
i
= h
j
(x
i
)K
If
t
1/2 then stop
Set
t
=
1
2
log(
1
t

t
)
Update
D
t+1
(i) =
D
t
(i)exp(
t
y
i
h
t
(x
i
))
Z
t
where Z
t
is normalisation factor
Output the nal classier:
H(x) = sign

t=1

t
h
t
(x)

step
t
r
a
i
n
i
n
g
e
r
r
o
r
t = 4
0 5 10 15 20 25 30 35 40
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
4/17
The AdaBoost Algorithm
Given: (x
1
, y
1
), . . . , (x
m
, y
m
); x
i
X, y
i
{1, +1}
Initialise weights D
1
(i) = 1/m
For t = 1, ..., T:
Find h
t
= arg min
h
j
H

j
=
m

i=1
D
t
(i)Jy
i
= h
j
(x
i
)K
If
t
1/2 then stop
Set
t
=
1
2
log(
1
t

t
)
Update
D
t+1
(i) =
D
t
(i)exp(
t
y
i
h
t
(x
i
))
Z
t
where Z
t
is normalisation factor
Output the nal classier:
H(x) = sign

t=1

t
h
t
(x)

step
t
r
a
i
n
i
n
g
e
r
r
o
r
t = 5
0 5 10 15 20 25 30 35 40
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
4/17
The AdaBoost Algorithm
Given: (x
1
, y
1
), . . . , (x
m
, y
m
); x
i
X, y
i
{1, +1}
Initialise weights D
1
(i) = 1/m
For t = 1, ..., T:
Find h
t
= arg min
h
j
H

j
=
m

i=1
D
t
(i)Jy
i
= h
j
(x
i
)K
If
t
1/2 then stop
Set
t
=
1
2
log(
1
t

t
)
Update
D
t+1
(i) =
D
t
(i)exp(
t
y
i
h
t
(x
i
))
Z
t
where Z
t
is normalisation factor
Output the nal classier:
H(x) = sign

t=1

t
h
t
(x)

step
t
r
a
i
n
i
n
g
e
r
r
o
r
t = 6
0 5 10 15 20 25 30 35 40
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
4/17
The AdaBoost Algorithm
Given: (x
1
, y
1
), . . . , (x
m
, y
m
); x
i
X, y
i
{1, +1}
Initialise weights D
1
(i) = 1/m
For t = 1, ..., T:
Find h
t
= arg min
h
j
H

j
=
m

i=1
D
t
(i)Jy
i
= h
j
(x
i
)K
If
t
1/2 then stop
Set
t
=
1
2
log(
1
t

t
)
Update
D
t+1
(i) =
D
t
(i)exp(
t
y
i
h
t
(x
i
))
Z
t
where Z
t
is normalisation factor
Output the nal classier:
H(x) = sign

t=1

t
h
t
(x)

step
t
r
a
i
n
i
n
g
e
r
r
o
r
t = 7
0 5 10 15 20 25 30 35 40
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
4/17
The AdaBoost Algorithm
Given: (x
1
, y
1
), . . . , (x
m
, y
m
); x
i
X, y
i
{1, +1}
Initialise weights D
1
(i) = 1/m
For t = 1, ..., T:
Find h
t
= arg min
h
j
H

j
=
m

i=1
D
t
(i)Jy
i
= h
j
(x
i
)K
If
t
1/2 then stop
Set
t
=
1
2
log(
1
t

t
)
Update
D
t+1
(i) =
D
t
(i)exp(
t
y
i
h
t
(x
i
))
Z
t
where Z
t
is normalisation factor
Output the nal classier:
H(x) = sign

t=1

t
h
t
(x)

step
t
r
a
i
n
i
n
g
e
r
r
o
r
t = 40
0 5 10 15 20 25 30 35 40
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
5/17
Reweighting
Eect on the training set
D
t+1
(i) =
D
t
(i)exp(
t
y
i
h
t
(x
i
))
Z
t
exp(
t
y
i
h
t
(x
i
))

< 1, y
i
= h
t
(x
i
)
> 1, y
i
= h
t
(x
i
)
Increase (decrease) weight of wrongly (correctly) classied examples
The weight is the upper bound on the error of a given example
All information about previously selected features is captured in D
t
5/17
Reweighting
Eect on the training set
D
t+1
(i) =
D
t
(i)exp(
t
y
i
h
t
(x
i
))
Z
t
exp(
t
y
i
h
t
(x
i
))

< 1, y
i
= h
t
(x
i
)
> 1, y
i
= h
t
(x
i
)
Increase (decrease) weight of wrongly (correctly) classied examples
The weight is the upper bound on the error of a given example
All information about previously selected features is captured in D
t
5/17
Reweighting
Eect on the training set
D
t+1
(i) =
D
t
(i)exp(
t
y
i
h
t
(x
i
))
Z
t
exp(
t
y
i
h
t
(x
i
))

< 1, y
i
= h
t
(x
i
)
> 1, y
i
= h
t
(x
i
)
Increase (decrease) weight of wrongly (correctly) classied examples
The weight is the upper bound on the error of a given example
All information about previously selected features is captured in D
t
6/17
Upper Bound Theorem
Theorem: The following upper bound holds on the training error of H
1
m
|{i : H(x
i
) = y
i
}|
T

t=1
Z
t
Proof: By unravelling the update rule
D
T+1
(i) =
D
t
(i)exp(
t
y
i
h
t
(x
i
))
Z
t
=
exp(

t
y
i
h
t
(x
i
))
m

t
Z
t
=
exp(y
i
f(x
i
))
m

t
Z
t
If H(x
i
) = y
i
then y
i
f(x
i
) 0 implying that exp(y
i
f(x
i
)) > 1, thus
JH(x
i
) = y
i
K exp(y
i
f(x
i
))
1
m

i
JH(x
i
) = y
i
K
1
m

i
exp(y
i
f(x
i
))
=

i
(

t
Z
t
)D
T+1
(i) =

t
Z
t 2 1 0 1 2
0
0.5
1
1.5
2
2.5
3
yf(x)
e
r
r
7/17
Consequences of the Theorem
Instead of minimising the training error, its upper bound can be minimised
This can be done by minimising Z
t
in each training round by:
Choosing optimal h
t
, and
Finding optimal
t
AdaBoost can be proved to maximise margin
AdaBoost iteratively ts an additive logistic regression model
8/17
Choosing
t
We attempt to minimise Z
t
=

i
D
t
(i)exp(
t
y
i
h
t
(x
i
)):
dZ
d
=
m

i=1
D(i)y
i
h(x
i
)e
y
i

i
h(x
i
)
= 0

i:y
i
=h(x
i
)
D(i)e

i:y
i
=h(x
i
)
D(i)e

= 0
e

(1 ) +e

= 0
=
1
2
log
1

The minimisator of the upper bound is


t
=
1
2
log
1
t

t
9/17
Choosing h
t
Weak classier examples
Decision tree (or stump), Perceptron H innite
Selecting the best one from given nite set H
Justication of the weighted error minimisation
Having
t
=
1
2
log
1
t

t
Z
t
=
m

i=1
D
t
(i)e
y
i

i
h
t
(x
i
)
=

i:y
i
=h
t
(x
i
)
D
t
(i)e

t
+

i:y
i
=h
t
(x
i
)
D
t
(i)e

t
= (1
t
)e

t
+
t
e

t
= 2

t
(1
t
)
Z
t
is minimised by selecting h
t
with minimal weighted error
t
10/17
Generalisation (Schapire & Singer 1999)
Maximising margins in AdaBoost
P
(x,y)S
[yf(x) ] 2
T
T

t=1

1
t
(1
t
)
1+
where f(x) =

h(x)

1
Choosing h
t
(x) with minimal
t
in each step one minimises the margin
Margin in SVM use the L
2
norm instead: (

h(x))/
2
Upper bounds based on margin
With probability 1 over the random choice of the training set S
P
(x,y)D
[yf(x) 0] P
(x,y)S
[yf(x) ] +O

d log
2
(m/d)

2
+ log(1/)

1/2

where D is a distribution over X {+1, 1}, and d is pseudodimension of H.


Problem: The upper bound is very loose. In practice AdaBoost works much better.
11/17
Convergence (Friedman et al. 1998)
Proposition 1 The discrete AdaBoost algorithm minimises J(f(x)) = E(e
yf(x)
) by
adaptive Newton updates
Lemma J(f(x)) is minimised at
f(x) =
T

t=1

t
h
t
(x) =
1
2
log
P(y = 1|x)
P(y = 1|x)
Hence
P(y = 1|x) =
e
f(x)
e
f(x)
+e
f(x)
and P(y = 1|x) =
e
f(x)
e
f(x)
+e
f(x)
Additive logistic regression model
T

t=1
a
t
(x) = log
P(y = 1|x)
P(y = 1|x)
Proposition 2 By minimising J(f(x)) the discrete AdaBoost ts (up to a factor 2) an
additive logistic regression model
12/17
The Algorithm Recapitulation
Given: (x
1
, y
1
), . . . , (x
m
, y
m
); x
i
X, y
i
{1, +1}
12/17
The Algorithm Recapitulation
Given: (x
1
, y
1
), . . . , (x
m
, y
m
); x
i
X, y
i
{1, +1}
Initialise weights D
1
(i) = 1/m
12/17
The Algorithm Recapitulation
Given: (x
1
, y
1
), . . . , (x
m
, y
m
); x
i
X, y
i
{1, +1}
Initialise weights D
1
(i) = 1/m
For t = 1, ..., T:
12/17
The Algorithm Recapitulation
Given: (x
1
, y
1
), . . . , (x
m
, y
m
); x
i
X, y
i
{1, +1}
Initialise weights D
1
(i) = 1/m
For t = 1, ..., T:
Find h
t
= arg min
h
j
H

j
=
m

i=1
D
t
(i)Jy
i
= h
j
(x
i
)K
t = 1
12/17
The Algorithm Recapitulation
Given: (x
1
, y
1
), . . . , (x
m
, y
m
); x
i
X, y
i
{1, +1}
Initialise weights D
1
(i) = 1/m
For t = 1, ..., T:
Find h
t
= arg min
h
j
H

j
=
m

i=1
D
t
(i)Jy
i
= h
j
(x
i
)K
If
t
1/2 then stop
t = 1
12/17
The Algorithm Recapitulation
Given: (x
1
, y
1
), . . . , (x
m
, y
m
); x
i
X, y
i
{1, +1}
Initialise weights D
1
(i) = 1/m
For t = 1, ..., T:
Find h
t
= arg min
h
j
H

j
=
m

i=1
D
t
(i)Jy
i
= h
j
(x
i
)K
If
t
1/2 then stop
Set
t
=
1
2
log(
1
t

t
)
t = 1
12/17
The Algorithm Recapitulation
Given: (x
1
, y
1
), . . . , (x
m
, y
m
); x
i
X, y
i
{1, +1}
Initialise weights D
1
(i) = 1/m
For t = 1, ..., T:
Find h
t
= arg min
h
j
H

j
=
m

i=1
D
t
(i)Jy
i
= h
j
(x
i
)K
If
t
1/2 then stop
Set
t
=
1
2
log(
1
t

t
)
Update
D
t+1
(i) =
D
t
(i)exp(
t
y
i
h
t
(x
i
))
Z
t
t = 1
12/17
The Algorithm Recapitulation
Given: (x
1
, y
1
), . . . , (x
m
, y
m
); x
i
X, y
i
{1, +1}
Initialise weights D
1
(i) = 1/m
For t = 1, ..., T:
Find h
t
= arg min
h
j
H

j
=
m

i=1
D
t
(i)Jy
i
= h
j
(x
i
)K
If
t
1/2 then stop
Set
t
=
1
2
log(
1
t

t
)
Update
D
t+1
(i) =
D
t
(i)exp(
t
y
i
h
t
(x
i
))
Z
t
Output the nal classier:
H(x) = sign

t=1

t
h
t
(x)

step
t
r
a
i
n
i
n
g
e
r
r
o
r
t = 1
0 5 10 15 20 25 30 35 40
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
12/17
The Algorithm Recapitulation
Given: (x
1
, y
1
), . . . , (x
m
, y
m
); x
i
X, y
i
{1, +1}
Initialise weights D
1
(i) = 1/m
For t = 1, ..., T:
Find h
t
= arg min
h
j
H

j
=
m

i=1
D
t
(i)Jy
i
= h
j
(x
i
)K
If
t
1/2 then stop
Set
t
=
1
2
log(
1
t

t
)
Update
D
t+1
(i) =
D
t
(i)exp(
t
y
i
h
t
(x
i
))
Z
t
Output the nal classier:
H(x) = sign

t=1

t
h
t
(x)

step
t
r
a
i
n
i
n
g
e
r
r
o
r
t = 2
0 5 10 15 20 25 30 35 40
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
12/17
The Algorithm Recapitulation
Given: (x
1
, y
1
), . . . , (x
m
, y
m
); x
i
X, y
i
{1, +1}
Initialise weights D
1
(i) = 1/m
For t = 1, ..., T:
Find h
t
= arg min
h
j
H

j
=
m

i=1
D
t
(i)Jy
i
= h
j
(x
i
)K
If
t
1/2 then stop
Set
t
=
1
2
log(
1
t

t
)
Update
D
t+1
(i) =
D
t
(i)exp(
t
y
i
h
t
(x
i
))
Z
t
Output the nal classier:
H(x) = sign

t=1

t
h
t
(x)

step
t
r
a
i
n
i
n
g
e
r
r
o
r
t = 3
0 5 10 15 20 25 30 35 40
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
12/17
The Algorithm Recapitulation
Given: (x
1
, y
1
), . . . , (x
m
, y
m
); x
i
X, y
i
{1, +1}
Initialise weights D
1
(i) = 1/m
For t = 1, ..., T:
Find h
t
= arg min
h
j
H

j
=
m

i=1
D
t
(i)Jy
i
= h
j
(x
i
)K
If
t
1/2 then stop
Set
t
=
1
2
log(
1
t

t
)
Update
D
t+1
(i) =
D
t
(i)exp(
t
y
i
h
t
(x
i
))
Z
t
Output the nal classier:
H(x) = sign

t=1

t
h
t
(x)

step
t
r
a
i
n
i
n
g
e
r
r
o
r
t = 4
0 5 10 15 20 25 30 35 40
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
12/17
The Algorithm Recapitulation
Given: (x
1
, y
1
), . . . , (x
m
, y
m
); x
i
X, y
i
{1, +1}
Initialise weights D
1
(i) = 1/m
For t = 1, ..., T:
Find h
t
= arg min
h
j
H

j
=
m

i=1
D
t
(i)Jy
i
= h
j
(x
i
)K
If
t
1/2 then stop
Set
t
=
1
2
log(
1
t

t
)
Update
D
t+1
(i) =
D
t
(i)exp(
t
y
i
h
t
(x
i
))
Z
t
Output the nal classier:
H(x) = sign

t=1

t
h
t
(x)

step
t
r
a
i
n
i
n
g
e
r
r
o
r
t = 5
0 5 10 15 20 25 30 35 40
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
12/17
The Algorithm Recapitulation
Given: (x
1
, y
1
), . . . , (x
m
, y
m
); x
i
X, y
i
{1, +1}
Initialise weights D
1
(i) = 1/m
For t = 1, ..., T:
Find h
t
= arg min
h
j
H

j
=
m

i=1
D
t
(i)Jy
i
= h
j
(x
i
)K
If
t
1/2 then stop
Set
t
=
1
2
log(
1
t

t
)
Update
D
t+1
(i) =
D
t
(i)exp(
t
y
i
h
t
(x
i
))
Z
t
Output the nal classier:
H(x) = sign

t=1

t
h
t
(x)

step
t
r
a
i
n
i
n
g
e
r
r
o
r
t = 6
0 5 10 15 20 25 30 35 40
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
12/17
The Algorithm Recapitulation
Given: (x
1
, y
1
), . . . , (x
m
, y
m
); x
i
X, y
i
{1, +1}
Initialise weights D
1
(i) = 1/m
For t = 1, ..., T:
Find h
t
= arg min
h
j
H

j
=
m

i=1
D
t
(i)Jy
i
= h
j
(x
i
)K
If
t
1/2 then stop
Set
t
=
1
2
log(
1
t

t
)
Update
D
t+1
(i) =
D
t
(i)exp(
t
y
i
h
t
(x
i
))
Z
t
Output the nal classier:
H(x) = sign

t=1

t
h
t
(x)

step
t
r
a
i
n
i
n
g
e
r
r
o
r
t = 7
0 5 10 15 20 25 30 35 40
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
12/17
The Algorithm Recapitulation
Given: (x
1
, y
1
), . . . , (x
m
, y
m
); x
i
X, y
i
{1, +1}
Initialise weights D
1
(i) = 1/m
For t = 1, ..., T:
Find h
t
= arg min
h
j
H

j
=
m

i=1
D
t
(i)Jy
i
= h
j
(x
i
)K
If
t
1/2 then stop
Set
t
=
1
2
log(
1
t

t
)
Update
D
t+1
(i) =
D
t
(i)exp(
t
y
i
h
t
(x
i
))
Z
t
Output the nal classier:
H(x) = sign

t=1

t
h
t
(x)

step
t
r
a
i
n
i
n
g
e
r
r
o
r
t = 40
0 5 10 15 20 25 30 35 40
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
13/17
AdaBoost Variants
Freund & Schapire 1995
Discrete (h : X {0, 1})
Multiclass AdaBoost.M1 (h : X {0, 1, ..., k})
Multiclass AdaBoost.M2 (h : X [0, 1]
k
)
Real valued AdaBoost.R (Y = [0, 1], h : X [0, 1])
Schapire & Singer 1999
Condence rated prediction (h : X R, two-class)
Multilabel AdaBoost.MR, AdaBoost.MH (dierent formulation of minimised loss)
Oza 2001
Online AdaBoost
Many other modications since then: cascaded AB, WaldBoost, probabilistic boosting
tree, ...
14/17
Online AdaBoost
Oine
Given:
Set of labeled training samples
X = {(x
1
, y
1
), ..., (x
m
, y
m
)|y = 1}
Weight distribution over X
D
0
= 1/m
For t = 1, . . . , T
Train a weak classier using samples
and weight distribution
h
t
(x) = L(X, D
t1
)
Calculate error
t
Calculate coecient
t
from
t
Update weight distribution D
t
Output:
F(x) = sign(

T
t=1

t
h
t
(x))
Online
Given:
For t = 1, . . . , T
Output:
F(x) = sign(

T
t=1

t
h
t
(x))
14/17
Online AdaBoost
Oine
Given:
Set of labeled training samples
X = {(x
1
, y
1
), ..., (x
m
, y
m
)|y = 1}
Weight distribution over X
D
0
= 1/m
For t = 1, . . . , T
Train a weak classier using samples
and weight distribution
h
t
(x) = L(X, D
t1
)
Calculate error
t
Calculate coecient
t
from
t
Update weight distribution D
t
Output:
F(x) = sign(

T
t=1

t
h
t
(x))
Online
Given:
For t = 1, . . . , T
Output:
F(x) = sign(

T
t=1

t
h
t
(x))
One labeled training sample
(x, y)|y = 1
Strong classier to update
14/17
Online AdaBoost
Oine
Given:
Set of labeled training samples
X = {(x
1
, y
1
), ..., (x
m
, y
m
)|y = 1}
Weight distribution over X
D
0
= 1/m
For t = 1, . . . , T
Train a weak classier using samples
and weight distribution
h
t
(x) = L(X, D
t1
)
Calculate error
t
Calculate coecient
t
from
t
Update weight distribution D
t
Output:
F(x) = sign(

T
t=1

t
h
t
(x))
Online
Given:
For t = 1, . . . , T
Output:
F(x) = sign(

T
t=1

t
h
t
(x))
One labeled training sample
(x, y)|y = 1
Strong classier to update
Initial importance = 1
14/17
Online AdaBoost
Oine
Given:
Set of labeled training samples
X = {(x
1
, y
1
), ..., (x
m
, y
m
)|y = 1}
Weight distribution over X
D
0
= 1/m
For t = 1, . . . , T
Train a weak classier using samples
and weight distribution
h
t
(x) = L(X, D
t1
)
Calculate error
t
Calculate coecient
t
from
t
Update weight distribution D
t
Output:
F(x) = sign(

T
t=1

t
h
t
(x))
Online
Given:
For t = 1, . . . , T
Output:
F(x) = sign(

T
t=1

t
h
t
(x))
One labeled training sample
(x, y)|y = 1
Strong classier to update
Initial importance = 1
Update the weak classier using the
sample and the importance
h
t
(x) = L(h
t
, (x, y), )
14/17
Online AdaBoost
Oine
Given:
Set of labeled training samples
X = {(x
1
, y
1
), ..., (x
m
, y
m
)|y = 1}
Weight distribution over X
D
0
= 1/m
For t = 1, . . . , T
Train a weak classier using samples
and weight distribution
h
t
(x) = L(X, D
t1
)
Calculate error
t
Calculate coecient
t
from
t
Update weight distribution D
t
Output:
F(x) = sign(

T
t=1

t
h
t
(x))
Online
Given:
For t = 1, . . . , T
Output:
F(x) = sign(

T
t=1

t
h
t
(x))
One labeled training sample
(x, y)|y = 1
Strong classier to update
Initial importance = 1
Update the weak classier using the
sample and the importance
h
t
(x) = L(h
t
, (x, y), )
Update error estimation
t
Update weight
t
based on
t
14/17
Online AdaBoost
Oine
Given:
Set of labeled training samples
X = {(x
1
, y
1
), ..., (x
m
, y
m
)|y = 1}
Weight distribution over X
D
0
= 1/m
For t = 1, . . . , T
Train a weak classier using samples
and weight distribution
h
t
(x) = L(X, D
t1
)
Calculate error
t
Calculate coecient
t
from
t
Update weight distribution D
t
Output:
F(x) = sign(

T
t=1

t
h
t
(x))
Online
Given:
For t = 1, . . . , T
Output:
F(x) = sign(

T
t=1

t
h
t
(x))
One labeled training sample
(x, y)|y = 1
Strong classier to update
Initial importance = 1
Update the weak classier using the
sample and the importance
h
t
(x) = L(h
t
, (x, y), )
Update error estimation
t
Update weight
t
based on
t
Update importance weight
14/17
Online AdaBoost
Oine
Given:
Set of labeled training samples
X = {(x
1
, y
1
), ..., (x
m
, y
m
)|y = 1}
Weight distribution over X
D
0
= 1/m
For t = 1, . . . , T
Train a weak classier using samples
and weight distribution
h
t
(x) = L(X, D
t1
)
Calculate error
t
Calculate coecient
t
from
t
Update weight distribution D
t
Output:
F(x) = sign(

T
t=1

t
h
t
(x))
Online
Given:
For t = 1, . . . , T
Output:
F(x) = sign(

T
t=1

t
h
t
(x))
One labeled training sample
(x, y)|y = 1
Strong classier to update
Initial importance = 1
Update the weak classier using the
sample and the importance
h
t
(x) = L(h
t
, (x, y), )
Update error estimation
t
Update weight
t
based on
t
Update importance weight
15/17
Online AdaBoost
Converges to oine results given the same training
set and the number of iterations N
N. Oza and S. Russell. Online Bagging and Boosting.
Articial Inteligence and Statistics, 2001.
16/17
Pros and Cons of AdaBoost
Advantages
Very simple to implement
General learning scheme - can be used for various learning tasks
Feature selection on very large sets of features
Good generalisation
Seems not to overt in practice (probably due to margin maximisation)
Disadvantages
Suboptimal solution (greedy learning)
17/17
Selected references
Y. Freund, R.E. Schapire. A Decision-theoretic Generalization of On-line Learning
and an Application to Boosting. Journal of Computer and System Sciences. 1997
R.E. Schapire, Y. Freund, P. Bartlett, W.S. Lee. Boosting the Margin: A New
Explanation for the Eectiveness of Voting Methods. The Annals of Statistics,
1998
R.E. Schapire, Y. Singer. Improved Boosting Algorithms Using Condence-rated
Predictions. Machine Learning. 1999
J. Friedman, T. Hastie, R. Tibshirani. Additive Logistic Regression: a Statistical
View of Boosting. Technical report. 1998
N.C. Oza. Online Ensemble Learning. PhD thesis. 2001
http://www.boosting.org
17/17
Selected references
Y. Freund, R.E. Schapire. A Decision-theoretic Generalization of On-line Learning
and an Application to Boosting. Journal of Computer and System Sciences. 1997
R.E. Schapire, Y. Freund, P. Bartlett, W.S. Lee. Boosting the Margin: A New
Explanation for the Eectiveness of Voting Methods. The Annals of Statistics,
1998
R.E. Schapire, Y. Singer. Improved Boosting Algorithms Using Condence-rated
Predictions. Machine Learning. 1999
J. Friedman, T. Hastie, R. Tibshirani. Additive Logistic Regression: a Statistical
View of Boosting. Technical report. 1998
N.C. Oza. Online Ensemble Learning. PhD thesis. 2001
http://www.boosting.org
Thank you for attention
0 5 10 15 20 25 30 35 40
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0 5 10 15 20 25 30 35 40
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0 5 10 15 20 25 30 35 40
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0 5 10 15 20 25 30 35 40
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0 5 10 15 20 25 30 35 40
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0 5 10 15 20 25 30 35 40
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0 5 10 15 20 25 30 35 40
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0 5 10 15 20 25 30 35 40
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0 5 10 15 20 25 30 35 40
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0 5 10 15 20 25 30 35 40
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0 5 10 15 20 25 30 35 40
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0 5 10 15 20 25 30 35 40
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0 5 10 15 20 25 30 35 40
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0 5 10 15 20 25 30 35 40
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0 5 10 15 20 25 30 35 40
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0 5 10 15 20 25 30 35 40
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
2 1 0 1 2
0
0.5
1
1.5
2
2.5
3
yf(x)
e
r
r
0 5 10 15 20 25 30 35 40
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0 5 10 15 20 25 30 35 40
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0 5 10 15 20 25 30 35 40
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0 5 10 15 20 25 30 35 40
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0 5 10 15 20 25 30 35 40
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0 5 10 15 20 25 30 35 40
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0 5 10 15 20 25 30 35 40
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0 5 10 15 20 25 30 35 40
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0 5 10 15 20 25 30 35 40
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0 5 10 15 20 25 30 35 40
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0 5 10 15 20 25 30 35 40
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0 5 10 15 20 25 30 35 40
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0 5 10 15 20 25 30 35 40
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0 5 10 15 20 25 30 35 40
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0 5 10 15 20 25 30 35 40
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0 5 10 15 20 25 30 35 40
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35

You might also like