Professional Documents
Culture Documents
www.cs.wisc.edu/~dpage/cs760/
Neural networks
a.k.a. artificial neural networks, connectionist models
inspired by interconnected neurons in biological systems
simple processing units
each unit receives a number of real-valued inputs
each unit produces a single real-valued output
Perceptrons
[McCulloch & Pitts, 1943; Rosenblatt, 1959; Widrow & Hoff, 1960]
1
x1
x2
xn
w0
w1
w2
n
"
$ 1 if w0 + wi xi > 0
o=#
i=1
$ 0 otherwise
%
wn
input units:
represent given x
output unit:
represents binary classification
Learning a perceptron:
the perceptron training rule
1. randomly initialize weights
2. iterate through training instances until convergence
n
"
$ 1 if w0 + wi xi > 0
o=#
i=1
$ 0 otherwise
%
wi = ( y o ) xi
is learning rate;
set to value << 1
wi wi + wi
1 if w0 + w1 x1 + w2 x2 > 0
also write as: wx > 0
w1 x1 + w2 x2 = w0
x2 =
w1
w
x1 0
w2
w2
+
+
+
+
x1
x1 x2
0
0
1
1
0
0
0
1
0
1
0
1
x1
1b
a
0
c
1 x2
OR
a
b
c
d
x1 x2
0
0
1
1
0
1
1
1
0
1
0
1
x1
1b
a
0
c
1 x2
x1 x2
0
0
1
1
0
1
1
0
0
1
0
1
x1
a multilayer perceptron
can represent XOR
x2
x1
1b
c
1 x2
a
0
-1
-1
1
10
11
12
x1
how to determine
error signal for
hidden units?
x2
13
(1)
(m )
(m )
1
(d )
(d ) 2
E(w) = ( y o )
2 dD
w2
w1
This error measure defines a surface over the hypothesis (i.e. weight) space
14
Error
w1
w2
15
w = E ( w )
E
wi =
wi
w1
w2
E
16
w2
w1
1
f (x) =
x
1+ e
17
f (x) =
1+ e
n
#
&
% w0 + wi xi (
$
'
i=1
w0 + wi xi
i=1
18
given: network structure and a training set D = (x (1) , y(1) )(x (m ) , y(m ) )
initialize all weights in w to small random numbers
until stopping criteria met do
initialize the error E(w) = 0
for each (x(d), y(d)) in the training set
input x(d) to the network and compute output o(d)
1 (d ) (d ) 2
increment the error E(w) = E(w) + ( y o )
2
# E E
E &
E(w) = %
,
, ,
wn ('
$ w0 w1
w = E ( w )
19
20
given: network structure and a training set D = (x (1) , y(1) )(x (m ) , y(m ) )
initialize all weights in w to small random numbers
until stopping criteria met do
for each (x(d), y(d)) in the training set
input x(d) to the network and compute output o(d)
calculate the error
E(w) =
1 (d ) (d ) 2
y o )
(
2
# E E
E &
E(w) = %
,
, ,
(
w
w
w
$ 0
1
n '
w = E ( w )
21
y = f (u)
u = g(x)
y y u
=
x u x
well make use of this as follows
E
E o net
=
wi
o net wi
net = w0 + wi xi
i=1
23
n
)
o(d ) = w0 + wi x (d
i
x1
i=1
1
(d )
(d ) 2
E(w) = ( y o )
2 dD
x2
xn
w0
w1
w2
wn
batch case
online case
E
1
(d )
(d ) 2
=
y o )
(
wi wi 2 dD
E (d )
1 (d ) (d ) 2
=
y o )
(
wi
wi 2
24
E (d )
1 (d ) (d ) 2
=
y o )
(
wi
wi 2
= (y
(d )
o )
y(d ) o(d ) )
(
wi
(d )
(d )
#
&
o
(d )
(d )
= ( y o )%
$ w ('
i
(d )
(d )
o
net
= ( y(d ) o(d ) )
net (d ) wi
)
= ( y (d ) o(d ) ) ( x (d
)
i
(d )
net
= ( y(d ) o(d ) )
wi
25
w0
net
(d )
)
= w0 + wi x (d
i
i=1
(d )
1
=
net ( d )
1+ e
useful property:
o(d )
(d )
(d )
=
o
(1
o
)
(d )
net
x1
x2
xn
w1
w2
wn
26
= (y
(d )
(d )
(d )
o )
y
o
(
)
wi
(d )
(d )
#
&
o
(d )
(d )
= ( y o )%
$ wi ('
(d )
(d )
o
net
= ( y(d ) o(d ) )
net (d ) wi
(d )
net
= ( y(d ) o(d ) ) o(d ) (1 o(d ) )
wi
27
Backpropagation
now weve covered how to do gradient descent for single-layer
networks with
linear output units
sigmoid output units
E
how can we calculate
for every weight in a multilayer network?
wi
backpropagate errors from the output units to the hidden units
28
Backpropagation notation
lets consider the online case, but drop the (d) superscripts for simplicity
well use
subscripts on y, o, net to indicate which unit they refer to
subscripts to indicate the unit a weight emanates from and goes to
w ji
j
oj
29
Backpropagation
each weight is changed by
E
w ji =
w ji
E net j
=
net j w ji
= j oi
where
E
j =
net j
xi if i is an input unit
30
Backpropagation
each weight is changed by
where
w ji = j oi
E
j =
net j
j = o j (1 o j )(y j o j )
if j is an output unit
j = o j (1 o j ) k wkj
if j is a hidden unit
same as
single-layer net
with sigmoid
output
sum of backpropagated
contributions to error
31
Backpropagation illustrated
1. calculate error of output units
j = o j (1 o j )(y j o j )
w ji = j oi
32
Backpropagation illustrated
3. calculate error for hidden units
j = o j (1 o j ) k wkj
k
w ji = j oi
33
34
Initializing weights
Weights should be initialized to
small values so that the sigmoid activations are in the range
where the derivative is large (learning will be quicker)
Error
wij
wij
too small (error goes down
a little)
too large (error goes up)
36
Stopping criteria
conventional gradient descent: train until local minimum reached
empirically better approach: early stopping
use a validation set to monitor accuracy during training iterations
return the weights that result in minimum validation-set error
error
training iterations
37
1
0
0
0
$
&
&
&
&
%
!
#
C=#
#
#
"
0
1
0
0
$
&
&
&
&
%
!
#
G= #
#
#
"
0
0
1
0
$
&
&
&
&
%
!
#
T= #
#
#
"
0
0
0
1
$
&
&
&
&
%
! 1 $
medium = # 1 &
#
&
#" 0 &%
! 1 $
large = # 1 &
#
&
#" 1 &%
38
eneti
oi =
net j
e
joutputs
39
40
Alternative approach to
training deep networks
use unsupervised learning to to find useful hidden unit
representations
41
Learning representations
the feature representation provided is often the most
significant factor in how well a learning system works
an appealing aspect of multilayer neural networks is
that they are able to change the feature representation
can think of the nodes in the hidden layer as new
features constructed from the original features in the
input layer
consider having more levels of constructed features,
e.g., pixels -> edges -> shapes -> faces or other objects
42
Competing intuitions
Only need a 2-layer network (input, hidden layer, output)
Representation Theorem (1989): Using sigmoid activation
functions (more recently generalized to others as well), can
represent any continuous function with a single hidden layer
Empirically, adding more hidden layers does not improve
accuracy, and it often degrades accuracy, when training by
standard backpropagation
44
45
test set
error
15 HUs
46
training epochs
Backpropagation with
multiple hidden layers
in principle, backpropagation can be used to train arbitrarily deep
networks (i.e. with multiple hidden layers)
in practice, this doesnt usually work well
there are likely to be lots of local minima
diffusion of gradients leads to slow training in lower layers
gradients are smaller, less pronounced at deeper levels
errors in credit assignment propagate as you go back
48
Autoencoders
one approach: use autoencoders to learn hidden-unit representations
in an autoencoder, the network is trained to reconstruct the inputs
49
Autoencoder variants
how to encourage the autoencoder to generalize
bottleneck: use fewer hidden units than inputs
sparsity: use a penalty function that encourages most
hidden unit activations to be near 0
[Goodfellow et al. 2009]
denoising: train to predict true input from corrupted input
[Vincent et al. 2008]
contractive: force encoder to have small derivatives (of
hidden unit output as input varies) [Rifai et al. 2011]
50
Stacking Autoencoders
can be stacked to form highly nonlinear representations
[Bengio et al. NIPS 2006]
train autoencoder
to represent x
51
Fine-Tuning
After completion, run backpropagation on the entire
network to fine-tune weights for the supervised task
Because this backpropagation starts with good
structure and weights, its credit assignment is
better and so its final results are better than if we
just ran backpropagation initially
52
53
54
Dropout
training
Dropout
56
Dropout
At Test Time
At test time
Final
model uses all nodes
use all units and weights in the network
Multiply
each weight from a node by fraction
adjust weights according to the probability that the source unit
of was
times
node
dropped
out was used during training
57
58
59