Professional Documents
Culture Documents
Analysis
John Shawe-Taylor
Department of Computer Science
University College London
jst@cs.ucl.ac.uk
Machine Learning Tutorial
Imperial College
February 2014
Aim:
OVERALL STRUCTURE
Part 1
Kernel methods
Kernel methods (re)introduced in 1990s with
Support Vector Machines
Linear functions but in high dimensional spaces
equivalent to non-linear functions in the input
space
Statistical analysis showing large margin can
overcome curse of dimensionality
Extensions rapidly introduced for many other
tasks other than classification
w i xi ,
i=1
Primal solution
Xy
X y
10
Dual solution
A dual solution should involve only computation of
inner products this is achieved by expressing the
weight vector as a linear combination of the training
examples:
XXw + w = Xy implies
1
1
w = (X y X Xw) = X (y Xw) = X,
where
1
= (y Xw)
or equivalently
w=
(1)
ixi
i=1
11
Dual solution
Substituting w = X into equation (1) we obtain:
= y XX
implying
(XX + Im) = y
This gives the dual solution:
1
= (XX + Im)
ix, xi
i=1
12
Step 1: Compute
= (K + Im)
ix, xi
i=1
13
14
15
16
g(x) =
ixi, x2
i=1
17
Part 2
18
m+
m
1
1
(x, xi) b,
h(x) = sgn
(x, xi)
m+ i=1
m i=m +1
+
where
m+
m
1
1
b=
(xi, xj )
(xi, xj )
2m2+ i,j=1
2m2 i,j=m +1
+
19
+
m
1
(xi)
m
i=1
(xi)
i=m+ +1
20
Variance of projections
Consider projections of the datapoints (xi) onto
a unit vector direction v in the feature space:
average is given by
[Pv ((x))] = E
[v(x)] = vS
v = E
of course this is 0 if the data has been centred.
average squared is given by
[
]
2
[v(x)(x)v] = 1 vXXv
E Pv ((x)) = E
m
21
Variance of projections
Now suppose v has the dual representation v =
X. Average is given by
v =
1
1
XXj = Kj
m
m
v X Xv = XX XX = K
m
m
m
Hence, variance in direction v is given by
v2
1
1 2
= K 2 ( Kj)2
m
m
22
Fisher discriminant
The Fisher discriminant is a thresholded linear
classifier:
f (x) = sgn(w, (x) + b
where w is chosen to maximise the quotient:
2
(+
w w )
J(w) = + 2
2
(w ) + (w
)
23
Fisher discriminant
Using the results we now have we can substitute
dual expressions for all of these quantities and
solve using lagrange multipliers.
The resulting classifier has dual variables
= (BK + I)1y
where B = D C with
2m/(mm+) if yi = 1 = yj
2m+/(mm) if yi = 1 = yj
Cij =
0
otherwise
Imperial, February 2014
24
and
2m/m if i = j and yi = 1
2m+/m if i = j and yi = 1
D=
0
otherwise
and b = 0.5Kt with
1/m+ if yi = 1
1/m if yi = 1
ti =
0
otherwise
giving a decision function
f (x) = sgn
(m
)
i(xi, x) b
i=1
25
Preprocessing
Corresponds to feature selection, or learning the
feature space
Note that in kernel methods the feature
space is only determined up to orthogonal
transformations (change of basis):
(x)
= U(x)
for some orthogonal transformation U (UU =
I = UU), then
26
Subspace methods
Principal components analysis: choose directions
to maximise variance in the training data
Canonical correlation analysis: choose directions
to maximise correlations between two different
views of the same objects
Gram-Schmidt:
greedily choose directions
according to largest residual norms (not covered)
Partial least squares: greedily choose directions
with maximal covariance with the target (not
covered)
In all cases we need kernel versions in order to
apply these methods in high-dimensional kernel
defined feature spaces
Imperial, February 2014
27
28
= wXXw =
w, xi2
i=1
29
i=1
i=1
xi2 =
i = tr(K)
i=1
30
Kernel PCA
We would like to find a dual representation
of the principal eigenvectors and hence of the
projection function.
Suppose that w, = 0 is an eigenvector/eigenvalue
pair for XX, then Xw, is for XX:
(XX)Xw = X(XX)w = Xw
and vice versa , X,
(XX)X = X(XX) = X
Note that we get back to where we started if we
do it twice.
Imperial, February 2014
31
Kernel PCA
Hence, 1-1 correspondence between eigenvectors
corresponding to non-zero eigenvalues, but note
that if = 1
X2 = XX = K =
so if i, i, i = 1, . . . , k are first k eigenvectors/values
of K
1
i
i
are dual representations of first k eigenvectors
w1, . . . , wk of XX with same eigenvalues.
Computing projections:
m
1
1
wi, (x) = Xi, (x) =
ji (xi, x)
i
i j=1
32
Part 3
33
Generalisation of a learner
Assume that we have a learning algorithm A that
chooses a function AF (S) from a function space
F in response to the training set S.
From a statistical point of view the quantity of
interest is the random variable:
(S, A, F) = E(x,y) [(AF (S), x, y)] ,
where is a loss function that measures the
discrepancy between AF (S)(x) and y.
34
Generalisation of a learner
For example, in the case of classification is 1
if the two disagree and 0 otherwise, while for
regression it could be the square of the difference
between AF (S)(x) and y.
We refer to the random variable (S, A, F) as the
generalisation of the learner.
35
Example of Generalisation I
We consider the Breast Cancer dataset from the
UCI repository.
Use the simple Parzen window classifier described
in Part 2: weight vector is
w+ w
where w+ is the average of the positive training
examples and w is average of negative training
examples.
Threshold is set so hyperplane
bisects the line joining these two points.
36
Example of Generalisation II
Given a size m of the training set, by repeatedly
drawing random training sets S we estimate the
distribution of
(S, A, F) = E(x,y) [(AF (S), x, y)] ,
by using the test set error as a proxy for the true
generalisation.
We plot the histogram and the average of the
distribution for various sizes of training set
initially the whole dataset gives a single value if
we use training and test as the all the examples,
but then we plot for training set sizes:
342, 273, 205, 137, 68, 34, 27, 20, 14, 7.
37
E [AF (S)] = ES wS wS
[ +]
[ ]
= ES wS ES wS
= Ey=+1 [x] Ey=1 [x] ,
we do not expect large differences in the average
of the distribution, though the non-linearity of
the loss function means they wont be the same
exactly.
38
100
90
80
70
60
50
40
30
20
10
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
39
30
25
20
15
10
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
40
30
25
20
15
10
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
41
35
30
25
20
15
10
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
42
30
25
20
15
10
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
43
25
20
15
10
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
44
20
18
16
14
12
10
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
45
16
14
12
10
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
46
12
10
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
47
12
10
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
48
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
49
Observations
50
Controlling generalisation
51
Controlling generalisation
52
35
30
25
20
15
10
0.2
0.4
0.6
0.8
53
35
30
25
20
15
10
0.2
0.4
0.6
0.8
54
30
25
20
15
10
0.2
0.4
0.6
0.8
55
14
12
10
0.2
0.4
0.6
0.8
56
15
10
0.2
0.4
0.6
0.8
57
12
10
0.2
0.4
0.6
0.8
58
59
60
Concentration inequalities
Statistical Learning is concerned with the
reliability or stability of inferences made from a
random sample.
Random variables with this property have been
a subject of ongoing interest to probabilists and
statisticians.
61
Sm
62
63
McDiarmids inequality
Theorem 1. Let X1, . . . , Xn be independent random
variables taking values in a set A, and assume that
f : An R satisfies
sup
x1 ,...,xn ,
xi A
2
P {f (X1, . . . , Xn) Ef (X1, . . . , Xn) } exp n
2
i=1 ci
64
Using McDiarmid
2
c
1
i=1 i
log
2
1
c2
log
2n
65
Rademacher complexity
Rademacher complexity is a new way of
measuring the complexity of a function class. It
arises naturally if we rerun the proof using the
double sample trick and symmetrisation but look
at what is actually needed to continue the proof:
66
)
[f (z)] + sup E[h] E[h]
E [f (z)] E
.
hF
hF
hF
ln(1/)
.
2m
67
ES sup ES
hF
1
1
h(
zi)
m i=1
m i=1
]]
h(zi) S
68
hF
1
(h(
zi) h(zi))
hF m i=1
ES ES sup
69
Adding symmetrisation
Here symmetrisation is again just swapping
corresponding elements but we can write this as
multiplication by a variable i which takes values 1
with equal probability:
[
ES
)]
hF
1
i (h(
zi) h(zi))
hF m i=1
[
]
m
1
2ES sup
ih(zi)
hF m i=1
ES S sup
= Rm (F) ,
assuming F closed under negation f 7 f .
Imperial, February 2014
70
Rademacher complexity
2
if (zi) .
sup
m
f F
i=1
71
[f (z)] + Rm(F) +
E [f (z)] E
ln(1/)
2m
72
m(F) = E sup 2
R
f F m i=1
]
if (zi) z1, . . . , zm
[f (z)] + R
m(F) + 3
ED [f (z)] E
ln(2/)
.
2m
73
74
Application to Boosting
We can view Boosting as seeking a function from
the class (H is the set of weak learners)
{
ahh(x) : ah 0,
hH
}
ah B
= convB (H)
hH
75
}
i(xi, x): K B 2
i=1
{x w, (x) : w B}
= FB ,
where we assume a kernel defined feature
space with
(x), (z) = (x, z).
76
Rademacher complexity of FB
i(xi)
E
m
i=1
1/2
m
2B
=
E
i(xi),
j (xj )
m
i=1
j=1
1/2
m
2B
E
ij (xi, xj )
m
i,j=1
Imperial, February 2014
v
um
2B u
t
=
(xi, xi)
m i=1
77
78
79
Margins in SVMs
Critical to the bound will be the margin of the
classifier
(x, y) = yg(x) = y(w, (x) + b) :
positive if correctly classified, and measures
distance from the separating hyperplane when
the weight vector is normalised.
The margin of a linear function g is
(g) = min (xi, yi)
i
80
Margins in SVMs
81
1,
1 + a/,
A(a) =
0,
m(A F) + 3 ln(2/) .
R
2m
Imperial, February 2014
82
83
ln(2/)
.
2m
m(A F) 2 R
slope of the function A: R
m (F).
84
um
m
1
4 u
ln(2/)
t
i +
(xi, xi) + 3
m i=1
m i=1
2m
ln(2/)
2m
85
Using a kernel
Can consider much higher dimensional spaces
using the kernel trick
Can even work in infinite dimensional spaces, eg
using the Gaussian kernel:
)
(
x z2
(x, z) = exp
2 2
86
40
35
30
25
20
15
10
5
0
0.2
0.4
0.6
0.8
87
40
35
30
25
20
15
10
5
0
0.2
0.4
0.6
0.8
88
+ C i=1 i
yi (w, (xi) + b) i, i 0,
2
i = 1, . . . , m, and w = 1.
(2)
Note that
i = ( yig(xi))+ ,
where g() = w, () + b.
89
i=1
i [yi( (xi) , w + b) + i]
i=1
)
ii + w 1
2
i=1
with i 0 and i 0.
90
L(w, b, , , , , )
w
= 2w
L(w, b, , , , , )
i
= Cii = 0,
L(w, b, , , , , )
b
L(w, b, , , , , )
yii (xi) = 0,
i=1
yii = 0,
i=1
= 1
i = 0.
i=1
91
1/2
1
yiyj ij (xi, xj )
=
2 i,j=1
92
ij yiyj (xi, xj ) ,
i,j=1
i = 1
i=1
yii = 0
i=1
to give solution
i, i = 1, . . . , m
93
i,j=1
i=1
uiyi(xi),
uj yj (xj )
j=1
2
m
uiyi(xi)
0
=
i=1
94
95
k yk (xk , xj )
= 0.5
k yk (xk , xi) +
k=1
k=1
f () = sgn
jyj (xj , ) + b ;
j=1
96
m
1
=
yiyj ij (xi, xj )
2 i,j=1
)
(m
1
k yk (xk , xj ) + b
= (2 )
k=1
2 +
i =
C
i=1
97
98
99
m
1
i
ij yiyj (xi, xj ) .
L() =
2
i=1
i,j=1
with constraints
0 i C,
yii = 0
i=1
100
101
Part 4
102
Kernel functions
Already seen some properties of kernels:
symmetric:
(x, z) = (x), (z) = (z), (x) = (z, x)
kernel matrices psd:
uKu =
i,j=1
i=1
ui(xi),
uj (xj )
j=1
2
m
ui(xi)
0
=
i=1
103
Kernel functions
These two properties are all that is required for a
kernel function to be valid: symmetric and every
kernel matrix is psd.
Note that this is equivalent to all eigenvalues nonnegative recall that eigenvalues of the kernel
matrix measured the sum of the squares of the
projections onto the eigenvector.
If we have uncountable domains should also
have continuity, though there are exceptions to
this as well.
104
Kernel functions
Proof outline:
Define feature space as class of functions:
F=
{m
}
i(xi, ): m N, xi X, i R, i = 1, . . . , m
i=1
Linear space
embedding given by
x 7 (x, )
105
Kernel functions
inner product between
f (x) =
i(zi, x)
i=1
i=1
defined as
f, g =
m
n
ij (xi, zj ) =
i=1 j=1
i=1
ig(xi) =
j f (zj ),
j=1
well-defined
f, f 0 by psd property.
Imperial, February 2014
106
Kernel functions
so-called reproducing property:
f, (x) = f, (x, ) = f (x)
implies that inner product corresponds to
function evaluation learning a function corresponds
to learning a point being the weight vector
corresponding to that function:
wf , (x) = f (x)
107
Kernel constructions
For 1, 2 valid kernels, any feature map, B psd
matrix, a 0 and f any real valued function, the
following are valid kernels:
(x, z) = 1(x, z) + 2(x, z),
(x, z) = a1(x, z),
(x, z) = 1(x, z)2(x, z),
(x, z) = f (x)f (z),
(x, z) = 1((x),(z)),
(x, z) = xBz.
Imperial, February 2014
108
Kernel constructions
Following are also valid kernels:
(x, z) =p(1(x, z)), for p any polynomial with
positive coefficients.
(x, z) = exp(1(x, z)),
2
= exp
2
2
2
2 2
2
2
2
2
exp(x / ) exp(z / )
)
(
2
x z
.
= exp
2 2
109
Subcomponents kernel
For the kernel x, zs the features can be indexed by
sequences
i = (i1, . . . , in),
ij = s
j=1
where
i(x) = xi11 xi22 . . . xinn
A similar kernel can be defined in which all subsets
of features occur:
: x 7 (A(x))A{1,...,n}
where
A(x) =
xi
iA
110
Subcomponents kernel
So we have
(x, y) = (x), (y)
=
A(x)A(y)
A{1,...,n}
A{1,...,n} iA
xiyi =
(1 + xiyi)
i=1
x1 y1
x2 y2
xn yn
111
Graph kernels
Can also represent polynomial kernel
d
2
xn yn
R
x1 y1
x2 y2
R
x1 y1
x2 y2
R
x1 y1
x2 y2
xn yn
d -1
d
xn yn
112
Graph kernels
The ANOVA kernel is represented by the graph:
(0, 0)
(0,1)
(0, 2)
(0, n - 1)
x1 z1
x1 z1
(1,1)
(0, n)
x1 z1
(1, 2)
(1, n - 1)
x2 z2
(1, n)
x2 z2
(2, 2)
(2, n - 1)
(2, n)
(d - 1, n - 1)
(d - 1, n)
xn zn
(d , n - 1)
( d , n)
113
Graph kernels
Features are all the combinations of exactly d
distinct features, while computation is given by
recursion:
m
0 (x, z) = 1, if m 0,
m
s (x, z) = 0, if m < s,
m1
m1
m
(x,
z)
=
(x
z
)
(x,
z)
+
(x, z)
m
m
s
s
s1
114
Graph kernels
Initialise DP (1) = 1;
for each node compute
DP (i) =
ji
115
116
j=1
117
118
j=1
119
(SS )ij =
tf(i, d)tf(j, d)
120
d 7 Uk (d)
121
String kernels
Consider the feature map given by
pu(s) = |{(v1, v2) : s = v1uv2}|
for u p with associated kernel
p(s, t) =
pu(s)pu(t)
up
122
String kernels
Consider the following two sequences:
s ="statistics"
t ="computation"
The two strings contain the following substrings
of length 3:
"sta", "tat", "ati", "tis",
"ist", "sti", "tic", "ics"
"com", "omp", "mpu", "put",
"uta", "tat", "ati", "tio", "ion"
and they have in common the substrings "tat"
and "ati", so their inner product would be
(s, t) = 2.
Imperial, February 2014
123
124
cat
car
bat
bar
ca
2
2
0
0
ct
3
0
0
0
at
2
0
2
0
ba
0
0
2
2
bt
0
0
3
0
cr
0
3
0
0
ar
0
2
0
2
br
0
0
0
3
125
Tree kernels
We can consider a feature mapping for trees
defined by
: T 7 (S (T ))SI
where I is a set of all subtrees and S (T ) counts
the number of co-rooted subtrees isomorphic to
the tree S.
The computation can again be performed
efficiently by working up from the leaves of the
tree integrating the results from the children at
each internal node.
Similarly we can compute the inner product in the
feature space given by all subtrees of the given
tree not necessarily co-rooted.
Imperial, February 2014
126
mM
127
128
Fisher kernels
Fisher kernels are an alternative way of defining
kernels based on probabilistic models.
We assume the model is parametrised according
to some parameters: consider the simple
example of a 1-dim Gaussian distribution
parametrised by and :
(
{
)
}
2
(x )
1
2
exp
M = P (x|) =
:
=
(,
)
R
.
2
2
2
The Fisher score vector is the derivative of the
log likelihood of an input x wrt the parameters:
2
(x )
1
log L(,) (x) =
log (2) .
2
2
2
Imperial, February 2014
129
Fisher kernels
Hence the score vector is given by:
(
g ,x =
(x 0) (x 0)
1
,
02
03
20
)
.
130
Fisher kernels
3.5
2.5
1.5
0.5
0.5
2
1.5
0.5
0.5
1.5
131
Fisher kernels
132
Conclusions
Kernel methods provide a general purpose toolkit
for pattern analysis
kernels define flexible interface to the data
enabling the user to encode prior knowledge into
a measure of similarity between two items with
the proviso that it must satisfy the psd property.
composition and subspace methods provide
tools to enhance the representation: normalisation,
centering, kernel PCA, kernel Gram-Schmidt,
kernel CCA, etc.
algorithms well-founded in statistical learning
theory enable efficient and effective exploitation
of the high-dimensional representations to
enable good off-training performance.
Imperial, February 2014
133
References
[1] N. Alon, S. Ben-David, N. Cesa-Bianchi, and D. Haussler.
Scale-sensitive Dimensions, Uniform Convergence, and
Learnability. Journal of the ACM, 44(4):615631, 1997.
Imperial, February 2014
134
135
136
[13] W. Hoeffding.
Probability inequalities for sums of
bounded random variables. J. Amer. Stat. Assoc.,
58:1330, 1963.
[14] M. Kearns and U. Vazirani.
An Introduction to
Computational Learning Theory. MIT Press, 1994.
[15] V. Koltchinskii and D. Panchenko.
Rademacher
processes and bounding the risk of function learning.
High Dimensional Probability II, pages 443 459, 2000.
[16] J. Langford and J. Shawe-Taylor.
PAC bayes and
margins. In Advances in Neural Information Processing
Systems 15, Cambridge, MA, 2003. MIT Press.
[17] M. Ledoux and M. Talagrand. Probability in Banach
Spaces: isoperimetry and processes. Springer, 1991.
[18] C. McDiarmid. On the method of bounded differences. In
141 London Mathematical Society Lecture Notes Series,
editor, Surveys in Combinatorics 1989, pages 148188.
Cambridge University Press, Cambridge, 1989.
[19] R. Schapire, Y. Freund, P. Bartlett, and W. Sun
Lee. Boosting the margin: A new explanation for the
effectiveness of voting methods. Annals of Statistics,
Imperial, February 2014
137
138
139