Professional Documents
Culture Documents
DD3364
April 1, 2012
Introduction
with m = 1, . . . , M .
Then model the relationship between X and Y
f (X) =
M
X
m hm (X) =
m=1
M
X
m Z m
m=1
before.
Which transformations?
Some examples
Linear:
hm (X) = Xm , m = 1, . . . , p
Polynomial:
hm (X) = Xj2 ,
or
hm (X) = Xj Xk
hm (X) = log(Xj ),
p
Xj , ...
hm (X) = kXk
Use of Indicator functions:
Cons
Lack of locality in global basis functions.
Solution Use local polynomial representations such as
Cons
Lack of locality in global basis functions.
Solution Use local polynomial representations such as
Mj
p X
X
jm hjm (Xj )
j=1 m=1
Selection Methods
Regularization Methods
Let
f (X) =
M
X
j hj (X)
j=1
Examples
Piecewise Constant
O
O
O O
O
OO
O O
OO
Piecewise Linear
O
O
O
O
O
O
OOO
O
O
O
O
O
O O
O
OO
O O
O
O
O
O
O
O
O
OO
O
O
O
O
O
O
O
O
OOO
O
O
O
O
O
O
O O
O
O
O
O
O
O O
O O
O
OO
O O
OO
O
O
OOO
O
O
O
O
Examples
Piecewise Constant
O
O
O O
O
OO
O O
OO
Piecewise Linear
O
O
O
O
O
O
OOO
O
O
O
O
O
O O
O
OO
O O
O
O
O
O
O
O
O
OO
O
O
O
O
O
O
O
O
OOO
O
O
O
O
O
O
O O
O
O
O
O
O
O O
O O
O
OO
O O
OO
O
O
OOO
O
O
O
O
O O
O
OO
O O
Piecewise Linear
O
O
O
O
O
O
O
O
O
O
OO
O
O
O
O
O
O
O
O
OOO
O
O
O
O
O
O O
O
O
O
O O
O
OO
O O
O
O
O
O
OOO
OO
O
O
O
O
O
O
O
O
O
O O
Continuous
Linear three regions
Piecewise-linear Basis Function
Divide [a, b], the domain
ofPiecewise
X, into
O
O
[a, 1 ), [1 , 2 ), [2 , b]
O O
O
OO
O O
OO
O
O
O
O
O
O
OOO
O
O
O
O O
(X =
1 )+
h1 (X) = Ind(X < 1 ), h2 (X) = Ind(1 X < 2 ), h3 (X)
Ind(
2 X)
O
O
O
O
P3
m=1
O O
O
OO
O O
OO
Piecewise Linear
O
O
O
O
O
O
O
O
O
O
O
O
OO
O
O
O
O
O
O
O
O
OOO
O
O
O
O
O O
O
OO
O O
O
O
O
O
O O
O
O
O
O
O
O
O
O O
O
O
O
O O
O
OO
O O
O
O
O
O
OOO
OO
O
O
O
O
O
O
OOO
h4 (X) = X h1 (X),
O
O
O
O
O
O
O
O
O
h5 (X) = X h2 (X),
O O
O
P6
h3 (X) = Ind(2 X)
h6 (X) = X h3 (X)
(X 1 )+
m=1
model to
the data in each
region.
2
FIGURE 5.1. The top left panel shows a piecewise constant function fit to some
O O
O
OO
O O
O
O
OO
O
O
O
O
O
O
OOO
O
O
O
O
O
O
O
O
O O
O
O
O
(X 1 )+
FIGURE 5.1. The top left panel shows a piecewise constant function fit to some
artificial data. The broken vertical lines indicate the positions of the two knots
1 and 2 . The blue curve represents the true function, from which the data were
generated with Gaussian noise. The remaining two panels show piecewise linear functions fit to the same datathe top right unrestricted, and the lower left
restricted to be continuous at the knots. The lower right panel shows a piecewise
linear basis function, h3 (X) = (X 1 )+ , continuous at 1 . The black points
indicate the sample evaluations h3 (xi ), i = 1, . . . , N .
1 and 2 .
This means
1 + 2 1 = 3 + 4 1 , and
3 + 4 2 = 5 + 6 2
This reduces the # of dof of f (X) from 6 to 4.
Piecewise Constant
O
O
O O
O
OO
O O
Piecewise Linear
O
O
OO
basis instead:
O
O
O
O
O
O
O
O
O
O
OOO
O
O
O
OO
O
O
O
h1 (X) = 1
h2 (X) = X
h4 (X) = (X 2 )+
h3 (X) = (X 1 )+
O
O
O
O O
O
O
O
OO
O
O
O O
O
O
O O
O
OO
O O
O
O
O
O
OOO
O
O
O
O
O
O
O
O
O
O
OOO
O
O
O
O
O
O O
O
O
O
O
O
O
(X 1 )+
FIGURE 5.1. The top left panel shows a piecewise constant function fit to some
artificial data. The broken vertical lines indicate the positions of the two knots
1 and 2 . The blue curve represents the true function, from which the data were
generated with Gaussian noise. The remaining two panels show piecewise linear functions fit to the same datathe top right unrestricted, and the lower left
Smoother f (X)
Can achieve a smoother f (X) by increasing the order
of the local polynomials
Smoother f (X)
Can achieve a smoother f (X) by increasing the order
of the local polynomials
5.2 Piecewise Polynomials and Splines
143
of the continuity at the knots
Piecewise Cubic Polynomials
O
O
O O
O
OO O
OO
Continuous
O
O
O
O
OOO
O
O
O
O
O O
O
O
O
O O
O
OO O
OO
O
O
O
O
OOO
O
O
O
O
O
O
O
O
O
O
O
O
O
O
O
O
O
O
O O
O
OO O
O
OO
O
O
O
O
O
O O
O
O
O
O
O
O
O
O
O
O
O
O
O
O
O O
O
OO O
O
O
OO
O
O
O
O
O
O
O
O
O
O
O O
O
O
O
O
O
O
O
O
O
OOO
O
O
OOO
O
O
O
O O
O
O
O
O
O
O
O
O
O
O
O
O
O
O
O
O
O O
O
OO O
O
OO
O
O
O
O
OOO
O
O
O
O
O O
O
O
O
O
O
O
O
O
O
O
O O
O
OO O
OO
O
O
O
O
OOO
O
O
O
O
O
O O
O
O
O
O
O
O
O
O
O
O
O
O
O
1
2
has 1st
and 2nd continuity
at the2 knots
O
O
O O
O
OO O
O
OO
O
O
O
O
OOO
O
O
O
O
O
O
O
O O
O
O
O
O
O
O
O
O
O
O
O
O
O
O
O O
O
OO O
O
O
OO
O
O
O
O
OOO
O
O
O
O
O
O
O
O O
O
O
O
O
O
O
O
O
O
O
O
O
A cubic spline
Cubic Spline
O
O
O
O
O
O
O
O
O
O
O
O
O
O
O
O
O O
O
OO O
O
OO
O
O
O
O O
A cubic spline
O
O
O
O
O O
O
OO O
O
O
O
O
O
O
O
O
O
O
O
O
OO
O
O
O
O
OOO
O
O
O
O
O
O
O
O
Cubic Spline
OOO
O
O
O
O
O
O
O O
O
O
O
O
O
O
O
O
O
O
O
O
h (X) = X 2 ,
h (X) = (X )3 ,
Order M spline
An order M spline with knots 1 , . . . , K is
a piecewise-polynomial of order M and
has continuous derivatives up to order M 2
hj (X) = X j1
j = 1, . . . , M
1
hM +l (X) = (X l )M
,
+
l = 1, . . . , K
Order M spline
An order M spline with knots 1 , . . . , K is
a piecewise-polynomial of order M and
has continuous derivatives up to order M 2
hj (X) = X j1
j = 1, . . . , M
1
hM +l (X) = (X l )M
,
+
l = 1, . . . , K
Order M spline
An order M spline with knots 1 , . . . , K is
a piecewise-polynomial of order M and
has continuous derivatives up to order M 2
hj (X) = X j1
j = 1, . . . , M
1
hM +l (X) = (X l )M
,
+
l = 1, . . . , K
Regression Splines
Fixed-knot splines are known as regression splines.
For a regression spline one needs to select
the order of the spline,
the number of knots and
the placement of the knots.
Regression Splines
Fixed-knot splines are known as regression splines.
For a regression spline one needs to select
the order of the spline,
the number of knots and
the placement of the knots.
Regression Splines
Fixed-knot splines are known as regression splines.
For a regression spline one needs to select
the order of the spline,
the number of knots and
the placement of the knots.
Near the boundaries one has reduced the variance of the fit
Smoothing Splines
Smoothing Splines
Avoid knot selection problem by using a maximal set of knots.
Complexity of the fit is controlled by regularization.
Consider the following problem:
Find the function f (x) with continuous second derivative
which minimizes
Z
n
X
2
RSS(f, ) =
(yi f (xi )) + (f 00 (t))2 dt
i=1
Smoothing Splines
Avoid knot selection problem by using a maximal set of knots.
Complexity of the fit is controlled by regularization.
Consider the following problem:
Find the function f (x) with continuous second derivative
which minimizes
Z
n
X
2
RSS(f, ) =
(yi f (xi )) + (f 00 (t))2 dt
i=1
Smoothing Splines
Avoid knot selection problem by using a maximal set of knots.
Complexity of the fit is controlled by regularization.
Consider the following problem:
Find the function f (x) with continuous second derivative
which minimizes
Z
n
X
2
RSS(f, ) =
(yi f (xi )) + (f 00 (t))2 dt
i=1
smoothing parameter
closeness to data
curvature penalty
between.
between.
between.
the xi , i = 1, . . . , n.
That is
f(x) =
n
X
Nj (x)j
j=1
the xi , i = 1, . . . , n.
That is
f(x) =
n
X
Nj (x)j
j=1
N= .
..
N1 (xn )
..
.
N2 (x1 )
N2 (x2 )
..
.
N2 (xn )
R 00
N2 (t)N100 (t)dt
N =
..
R 00 . 00
Nn (t)N1 (t)dt
y = (y1 , y2 , . . . , yn )t
Nn (x1 )
Nn (x2 )
..
.
Nn (xn )
..
.
..
.
R 00
00
Nn (t)Nn (t)dt
n
X
j=1
Nj (x)j
n
X
j=1
Nj (x)j
combination of the yi s
= (Nt N + N )1 Nt y
Let
f be the n-vector of the fitted values f(xi ) then
f = N = N(Nt N + N )1 Nt y = S y
where S = N(Nt N + N )1 Nt .
combination of the yi s
= (Nt N + N )1 Nt y
Let
f be the n-vector of the fitted values f(xi ) then
f = N = N(Nt N + N )1 Nt y = S y
where S = N(Nt N + N )1 Nt .
Properties of S
S is symmetric and positive semi-definite.
S S S
S has rank n.
The book defines the effective degrees of freedom of a
smoothing spline to be
df = trace(S )
0.15
0.10
0.05
0.0
-0.05
0.20
Male
Female
10
15
20
25
Age
FIGURE 5.6. The response is the relative change in bone mineral density measured at the spine in adolescents, as a function of age. A separate smoothing spline
was fit to the males and females, with 0.00022. This choice corresponds to
about 12 degrees of freedom.
S = N(Nt N + N )1 Nt
as
S = (1 + K)1
where
K = U S 1 V t N V S 1 U t .
It is also easy to show that
f = S y is the solution to the
optimization problem
min (y f )t (y f ) + f t Kf
f
S = N(Nt N + N )1 Nt
as
S = (1 + K)1
where
K = U S 1 V t N V S 1 U t .
It is also easy to show that
f = S y is the solution to the
optimization problem
min (y f )t (y f ) + f t Kf
f
The eigen-decomposition of S
Let K = P DP 1 be the real eigen-decomposition of K -
Then
S = (I + K)1 = (I + P DP 1 )1
= (P P 1 + P DP 1 )1
= (P (I + D)P 1 )1
= P (I + D)1 P 1
n
X
1
pk ptk
=
1 + dk
i=1
The eigen-decomposition of S
Let K = P DP 1 be the real eigen-decomposition of K -
Then
S = (I + K)1 = (I + P DP 1 )1
= (P P 1 + P DP 1 )1
= (P (I + D)P 1 )1
= P (I + D)1 P 1
n
X
1
pk ptk
=
1 + dk
i=1
155
30
20
10
Ozone Concentration
-50
50
100
1.0
1.2
df=5
df=11
0.6
0.4
0.2
0.0
-0.2
Eigenvalues
0.8
0.6
0.4
df=5
df=11
0.2
Eigenvalues
0.8
1.0
1.2
Example: Eigenvalues of S
-0.2
0.0
10
15
Order
20
25
-50
50
100
FIGURE
5.7.of(Top:)
Smoothing
spline fit of ozone concentr
Green curve
eigenvalues
S with
df = 11.
pressure gradient. The two fits correspond to different valu
achieve
Red curveparameter,
eigenvalueschosen
of S to
with
df =five
5. and eleven effective degrees
by df = trace(S ). (Lower left:) First 25 eigenvalues for the t
matrices. The first two are exactly 1, and all are 0. (Lo
-50
50
100
Example: Eigenvectors of S
df=5
df=11
15
Order
20
25
-50
50
100
-50
50
100
p:) Smoothing
spline
of ozone
versus
Daggot
Each
bluefit
curve
is an concentration
eigenvector of S
plotted against x. Top left
The two fits has
correspond
to different
values
the smoothing
highest e-value,
bottom
right of
samllest.
o achieve five and eleven effective degrees of freedom, defined
curve
is the eigenvector
by 1/(1 + dk ).
(Lower left:)Red
First
25 eigenvalues
for thedamped
two smoothing-spline
n
X
k=1
1
pk (ptk y)
1 + dk
df = trace(S ) =
n
X
k=1
1/(1 + dk ).
n
X
k=1
1
pk (ptk y)
1 + dk
df = trace(S ) =
n
X
k=1
1/(1 + dk ).
n
X
k=1
1
pk (ptk y)
1 + dk
df = trace(S ) =
n
X
k=1
1/(1 + dk ).
Visualization of a S
Equivalent Kernels
Row 12
Smoother Matrix
12
Row 25
Row 50
25
50
Row 75
75
100
115
Row 100
Row 115
FIGURE 5.8. The smoother matrix for a smoothing spline is nearly banded,
indicating an equivalent kernel with local support. The left panel represents the
Choosing ???
This is a crucial and tricky problem.
Will deal with this problem in Chapter 7 when we consider the
log
P (Y = 1|X = x)
= 0 + t x
P (Y = 0|X = x)
log
P (Y = 1|X = x)
= f (x)
P (Y = 0|X = x)
ef (x)
1 + ef (x)
of P (Y = 1|X = x).
log
P (Y = 1|X = x)
= 0 + t x
P (Y = 0|X = x)
log
P (Y = 1|X = x)
= f (x)
P (Y = 0|X = x)
ef (x)
1 + ef (x)
of P (Y = 1|X = x).
Z
n
X
[yi log P (Y = 1|xi ) + (1 yi ) log(1 P (Y = 1|xi ))] .5 (f 00 (t))2 dt
i=1
Z
n
X
=
[yi f (xi ) log(1 + ef (xi ) )] .5 (f 00 (t))2 dt
i=1
min
f H
"
n
X
i=1
where
L(yi , f (xi )) is a loss function,
J(f ) is a penalty functional,
H is a space of functions on which J(f ) is defined.
well.
Types of Kernels
Definition
A kernel is a mapping k : X X R.
These three types of kernels are equivalent
dot-product kernel
m
Mercer kernel
Dot-product kernel
Definition
A mapping
k :X X R
is a dot-product kernel if and only if
k(x, y) = h(x), (y)i
where
:X H
and H is a vector space and h, i is an inner-product on H.
K=
.
.
...
.
...
...
k(xm , x1 ) k(xm , x2 ) k(xm , xm )
is positive semi-definite.
Mercer kernel
Definition
A symmetric mapping k : X X R such that
Z Z
k(x, y) f (x) f (y) dx dy 0
for all functions f s.t.
Z
is a Mercer kernel.
f (x)2 dx <
X
Hk = f (.) | f () =
i k(, xi )
j=1
hf, gi =
m X
m
X
i=1 j=1
: X k(, x)
i j k(xi , x0j )
x|
X
i
x2i <
hf, gi = f g
p
p
: X ( 1 1 (x), 2 2 (x), ...)t
Gaussians.
on X .
f Hk
Definition
A Hilbert Space is a complete dot-product space.
(vector space + dot product + limit points of all Cauchy
sequences)
f Hk
Definition
A Hilbert Space is a complete dot-product space.
(vector space + dot product + limit points of all Cauchy
sequences)
H = {f () | f () =
i i k(, xi )
for i R and xi X }
f H
X
2
1
i =
k 2 (x, y)dx dy < and
i
k(x, y) =
X
i=1
i i (x)i (y)
X
i=1
Issues
Therefore there is a vector space `2 other than Hk such that
k(x, y) is a dot product in that space.
Have two very different interpretations of what the kernel
does
1
2
(x) =
P
i
i i (x)ei
: `2 span{k }
Can write
( )(x) =
P
i
ek =
p
k k ()
i i (x)i () = k(, x)
we have
!
xi
x2
x
x x x
x
x
o o
x
x
o o o o
x
x
o
o
o
o
o o
x
e1
)(xi)
l2 d
x x x
x
x
x
xo o
o oo
o
o
o
o
ed
x1
e3
e2
x
x
x x x
x
x
x
x
I1
7R)(xi)=k(.,x
)=k( xi)
"
xo o
o oo
o
o
o
o
o
o
o
Id
I3
I2
13
Mercer map
Define the inner-product in M as
Z
hf, gim = f (x)g(x) dx
Note we will normalize the eigenfunctions l such that
Z
lk
l (x)k (x) dx =
l
Any function f M can be written as
f (x) =
X
k=1
then
k k (x)
Mercer map
f (x)k(x, y) dx
Z X
k k (x)
k=1
X
l=1
k l l (y)
k=1 l=1
l l l (y)
l=1
1
l
l l (y) = f (y)
l=1
k is a reproducing kernel on M.
l l (x)l (y) dx
k (x)l (x) dx
Show Hk M.
Hk M
If f Hk then there exists m N, {i } and {xi } such that
f () =
=
=
=
m
X
i=1
m
X
i=1
X
l=1
i k(, xi )
i
l l (xi ) l ()
l=1
m
X
i l l (xi )
i=1
l ()
l l ()
l=1
n
X
i k(, xi ),
g() =
i=1
m
X
j k(, yj )
j=1
Then by definition
hf, gi =
While
hf, gim =
=
=
n X
m
X
i j k(xi , yj )
i=1 j=1
f (x)g(x) dx
Z X
n
i k(x, xi )
i=1
n
m
XX
i=1 j=1
i j
m
X
j k(x, yj ) dx
j=1
k(x, xi ) k(x, yj ) dx
hf, gim =
=
=
n X
m
X
i=1 j=1
n X
m
X
i=1 j=1
n X
m
X
i j
Z X
l l (x)l (xi )
l=1
i j
l l (xi ) l (yj )
l=1
i j k(xi , yj )
i=1 j=1
= hf, gi
X
s=1
s s (x)s (yj ) dx
MH
Can also show that if f M then also f Hk .
Will not prove that here.
But it implies M Hk
Summary
The reproducing kernel map and the Mercer Kernel map lead to
the same RKHS, Mercer gives us an orthonormal basis.
Interpretation I
Reproducing kernel map:
X
Hk = f (.) | f () =
i k(, xi )
j=1
hf, gi =
m X
m
X
i=1 j=1
r : X k(, x)
i j k(xi , x0j )
Summary
The reproducing kernel map and the Mercer Kernel map lead to
the same RKHS, Mercer gives us an orthonormal basis.
Interpretation II
Mercer kernel map:
HM = `2 =
x|
X
i
x2i
<
hf, gi = f t g
p
p
M : X ( 1 1 (x), 2 2 (x), ...)t
: `2 span{k ()}
M = r
Back to Regularization
Back to regularization
We to solve
min
f Hk
" n
X
i=1
functions.
For f Hk we have
f (x) =
i k(x, xi )
X
i
X
l
X
l
i
"
l l (x)l (xi )
X
i
cl l (x)
i l (xi ) l (x)
with cl = l
Hence
lk
cl ck hl (x), k (x)im =
X 1
X c2
l
cl ck lk =
l
l
lk
i i l (xi ).
smoothness.
Representer Theorem
Theorem
Let
: [0, ) R be a strictly monotonically increasing function
H is the RKHS associated with a kernel k(x, y)
L(y, f (x)) be a loss function
then
f = arg min
f Hk
" n
X
i=1
n
X
i=1
i k(x, xi )
Relevance
The remarkable consequence of the theorem is that
This is because as f =
Pn
i=1 i k(, xi )
kfk2 = hf, fi =
=
X
ij
then
i j hk(, xi ), k(, xj )i
i j k(xi , xj ) = t K
ij
and
f(xi ) =
j k(xi , xj ) = Ki
Relevance
The remarkable consequence of the theorem is that
This is because as f =
Pn
i=1 i k(, xi )
kfk2 = hf, fi =
=
X
ij
then
i j hk(, xi ), k(, xj )i
i j k(xi , xj ) = t K
ij
and
f(xi ) =
j k(xi , xj ) = Ki
Representer Theorem
Theorem
Let
: [0, ) R be a strictly monotonically increasing function
H is the RKHS associated with a kernel k(x, y)
"
f = arg min
f Hk
n
X
i=1
#
2
where
= arg min
" n
X
i=1
L(yi , Ki ) + ( K)
subject to
yi (0 + t xi ) 1 i
max(0, 1 yi (0 + t xi )) = (1 yi (0 + t xi )+ = 0 i
Hence we can re-write the optimization problem as
min
0 ,
"
n
X
(1 yi (0 + t xi ))+ + kk2
i=1
"
n
X
(1 yi (0 + t xi ))+ + kk2
i=1
"
n
X
i=1
where
L(y, f (x)) = (1 yi f (xi ))+
(kf k2 ) = kf k2
problem is
f(x) =
n
X
i xti x
i=1
conditions
n
X
i=1
i yi x i