You are on page 1of 22

1430/10/28

Ch 10: Widrow-Hoff Learning


(LMS Algorithm)
In this chapter we apply the principles of performance
learning to a single-layer linear neural network.
Widrow-Hoff learning is an approximate steepest
descent algorithm, in which the performance index is
mean square error.

Bernard Widrow began working in NN in


the late 1950s, at about the same time that
Frank Rosenblatt developed the
perceptron learning rule.
In
I 1960 Widrow
Wid
and
dH
Hoff
ff iintroduced
t d
d
ADALINE (ADAptive LInear NEuron)
network.
Its learning rule is called LMS (Least Mean
Square) algorithm.
ADALINE is similar to the perceptron,
except that its transfer function is linear,
instead of hard limiting.
2

IUT-Ahmadzadeh

1430/10/28

Widrow, B., and Hoff, M. E., Jr., 1960, Adaptive


switching circuits, in 1960 IRE WESCON Convention
Record, Part 4, New York: IRE, pp. 96104.
Widrow, B., and Lehr, M. A., 1990, 30 years of
adaptive neural networks: Perceptron, madaline, and
backpropagation, Proc. IEEE, 78:14151441.
Widrow, B., and Stearns, S. D., 1985, Adaptive Signal
Processing, Englewood Cliffs, NJ: Prentice-Hall.

Both have the same limitations; They can


only solve linearly separable problems.
The LMS algorithm minimizes mean
square error
error, and therefore tries to move
the decision boundaries as far from the
training patterns as possible.
The LMS algorithm found many more
practical uses than the p
p
perceptron
p
((like
most long distance phone lines use
ADALINE network for echo cancellation).
4

IUT-Ahmadzadeh

1430/10/28

ADALINE Network

a = purel in Wp + b = Wp + b

w i 1

iw

iw

is made up of the elements of the ith row of W:

w i 2

a i = pure lin ni = purelin iw p + b i = iw p + bi

w i R

Two-Input ADALINE

a = pure lin n = purelin 1w p + b = 1w p + b


T

a = 1w p + b = w1 1 p 1 + w1 2 p 2 + b

The ADALINE like perceptron has a decision boundary, which is


determined by the input vectors for which the net input n is zero.

IUT-Ahmadzadeh

1430/10/28

Mean Square Error


The LMS algorithm is an example of supervised training.
Training Set:

{ p 1, t 1} {p 2 , t2} { p Q, t Q}

Input:

Target:

pq

tq

Notation:
x =

1w

a = 1w p + b

z = p

a = x z

Mean Square Error:


2

F x = E e = E t a = E t xT z
The expectation is taken over all sets of input/target pairs.7

Error Analysis
2

F x = E e = E t a = E t xT z
T

F x = E t 2 t x T z + x T z z x
2

F x = E t 2 x T E t z + xT E zz x
This can be written in the following convenient form:
T

F x = c 2 x h + x R x
where

c = E t

h = E tz

R = E zz

IUT-Ahmadzadeh

1430/10/28

The vector h gives the cross-correlation


between the input vector and its associated
target.
R is the input correlation matrix.
The diagonal elements of this matrix are
equal to the mean square values of the
elements of the input vectors.
The mean square error for the ADALINE Network is a
quadratic function:
T
1 T
F x = c + d x + -- x Ax
2

d = 2 h

A = 2R

Stationary Point
Hessian Matrix:

A = 2R

The correlation matrix R must be at least positive


semidefinite. Really it can be shown that all correlation
matrices are either positive definite or positive
semidefinite. If there are any zero eigenvalues, the
performance index will either have a weak minimum or
else no stationary point (depending on d= -2h),
otherwise there will be a unique global minimum x*
(see Ch8).
T
1 T
Fx = c + d x + - x Ax = d + Ax = 2h + 2Rx

Stationary point:

2h + 2R x = 0

10

IUT-Ahmadzadeh

1430/10/28

If R (the correlation matrix) is positive


definite:
1

x = R h

If we could calculate the statistical quantities h and R,


we could find the minimum point directly from above
equation.
But it is not desirable or convenient to calculate h and
R. So

11

Approximate Steepest Descent


Approximate mean square error (one sample):

x = t k a k 2 = e 2k
F
Expectation of the squared error has been replaced
by the squared error at iteration k.
Approximate (stochastic) gradient:

Fx = e2k

2
e k
e k
e k j = ---------------- = 2 e k ------------ w1 j
w 1 j

j = 1 2 R

2
e k
2
e k
e k R + 1 = ---------------- = 2e k ------------b
b
12

IUT-Ahmadzadeh

1430/10/28

Approximate Gradient Calculation


T
e k t k a k

t k 1 w pk + b
------------- = ---------------------------------- =
w1 j
w 1 j
w1 j

e k

------------- =
w 1 j
w 1

t k w 1 i p i k + b
i = 1

Where pi(k) is the ith elements of the input vector at kth iteration.
e k
------------- = p j k
w1 j

e k
-------- = 1
b

F x = e2 k = 2e k z k

13

Now we can see the beauty of approximating the


mean square error by the single error at iteration k as in:

x = tk ak2 = e2k
F
This approximation to F (x) can now be used
in the Steepest descent algorithm.

LMS Algorithm
Al ith
xk + 1 = xk F x

x = xk
14

IUT-Ahmadzadeh

1430/10/28

If we substitute

F (x) for F (x)

x k + 1 = x k + 2 e k z k
1w k + 1

= 1w k + 2e k p k

b k + 1 = b k + 2 e k
These last two equations make up the LMS algorithm.
Also called Delta Rule or the Widrow-Hoff learning
algorithm.
15

Multiple-Neuron Case
iw k +

1 = iw k + 2 ei k p k

b i k + 1 = b i k + 2e i k
Matrix Form:
T

W k + 1 = W k + 2e k p k

b k + 1 = b k + 2 e k
16

IUT-Ahmadzadeh

1430/10/28

Analysis of Convergence
Note that xk is a function only of z(k-1), z(k-2), , z(0). If
we assume that successive input vectors are statistically
independent, then xk is independent of z(k).
We will show that for stationary input processes meeting
this condition, so the expected value of the weight vector
will converge to:
*
1

x R h

This is the minimum mean square error {E[ek2]}


solution, as we saw before.
17

Recall the LMS Algorithm:

xk + 1 = xk + 2e k zk
E xk + 1 = E xk + 2E e k z k
Substitute the error with

t (k ) xTk z (k )
T

Ex k + 1 = Ex k + 2E t k z k E xk zk z k

since xTk z (k ) z T (k )x k
T

E xk + 1 = E xk + 2 Etk z k E zkz k xk
18

IUT-Ahmadzadeh

1430/10/28

Since xk is independent of z(k)

E xk + 1 = E xk + 2 h RE xk

E xk + 1 = I 2RE xk + 2h
For stability, the eigenvalues of this
matrix must fall inside the unit circle.

Conditions for Stability

eig I 2 R = 1 2 i 1
(where i is an eigenvalue of R)

Since

i 0 ,

19

1 2i 1 .

Therefore the stability condition simplifies to


1 2

1 i

for all i

0 1 m ax

Note: we have the same condition as the SD algorithm. In


SD we use the Hessian Matrix A, here we use the input
correlation matrix R (Recall that A=2R).

10

20

IUT-Ahmadzadeh

1430/10/28

Steady State Response


E xk + 1 = I 2 R E xk + 2 h
If the system is stable,
stable then a steady state condition will be reached.
reached

E xss = I 2 R E xss + 2 h
The solution to this equation is
1
Ex ss = R h = x

This is also the strong minimum of the performance index.


Thus the LMS solution, obtained by applying one input at a time, is
the same as the minimum mean square solution of x* R 1h

21

Example
Banana

p
=

t
=
1
1 1
1

p
=

t
=
Apple 2
1 2
1

If inputs are generated randomly with equal probability, the


input correlation matrix is:
1
2

1
2

R = E pp = -- p 1 p 1 + -- p 2 p 2
1

1 0 0

R = --2- 1 1 1 1 + -2- 1 1 1 1 = 0 1 1
1

1 = 1.0

2 = 0.0

3 = 2.0

0 1 1

1
1
------- = ---- = 0.5
max 2.0

We take =0.2 (Note: Practically it is difficult to calculate R and


. We choose them by trial and error).

11

22

IUT-Ahmadzadeh

1430/10/28

Iteration One
Banana

a0 = W 0p 0 = W0 p1= 0 0 0

1
1= 0
1

W(0) is
selected
arbitrarily.

e 0 = t 0 a0 = t1 a 0= 1 0= 1

W 1 = W0 + 2e 0 pT 0
T

1
W 1 = 0 0 0 + 20.2 1 1 = 0.4 0.4 0.4
1

23

Iteration Two
Apple

a 1= W 1 p 1= W 1 p2 = 0.4 0.4 0.4

1
1 = 0.4
1

e 1 = t1 a1 = t2 a 1= 1 0.4= 1.4
T

1
W 2 = 0.4 0.4 0.4 + 2 0.2 1.4 1 = 0.96 0.16 0.16
1

24

12

IUT-Ahmadzadeh

1430/10/28

Iteration Three
a 2= W2 p 2= W 2 p1= 0.96 0.16 0.16

1
1 = 0.64
1

e 2 = t 2 a 2 = t 1 a2 = 1 0.64= 0.36
T

W3 = W 2 + 2e 2p 2 = 1.1040 0.0160 0.0160

W = 1 0 0
25

Some general comments on the


learning process:
Computationally, the learning process
goes through
th
h allll ttraining
i i examples
l ((an
epoch) number of times, until a stopping
criterion is reached.
The convergence process can be
monitored with the plot of the meanmean
squared error function F(W(k)).

26

13

IUT-Ahmadzadeh

1430/10/28

The popular stopping criteria are:


the mean-squared error is sufficiently
small:
ll F(W(k)) <
The rate of change of the mean-squared
error is sufficiently small:

27

Adaptive Filtering
ADALINE is one of the most widely used NNs in practical
applications. One of the major application areas has been
Adaptive Filtering.
Adaptive Filter
Tapped Delay Line

28

14

IUT-Ahmadzadeh

1430/10/28

ak = purelinWp + b =

w1 i yk i + 1 + b

i= 1

In Digital Signal Processing (DSP) language


lang age wee
recognize this network as a finite impulse response
(FIR) filter.

29

Example: Noise Cancellation

30

15

IUT-Ahmadzadeh

1430/10/28

Noise Cancellation Adaptive Filter


Two-input filter can attenuate and phase-shift the
noise in the desired way.

31

Correlation Matrix
To Analyze this system we need to find the input
correlation matrix R and the input/target crosscorrelation vector h.
h

R E[zz T ]
z k =

h = E t z

v k
v k 1

t k = s k + m k
2

R=

E v k

E v k v k 1
2

E v k 1v k Ev k 1

h =

16

E s k + m k v k
E s k + m k v k 1

32

IUT-Ahmadzadeh

1430/10/28

We must define the noise signal , the EEG signal s,


and the filtered noise m, to be able to obtain specific
values.
We assume: The EEG signal is a white (Uncorrelated
from one time step to the next) random signal
uniformly distributed between the values -0.2 and +0.2,
the noise source (60 Hz sine wave sampled at 180 Hz) is
given by
2 k
2k
v k = 1.2 sin---------
3

And the filtered noise that contaminates the EEG is the


noise attenuated by a factor 1.0 and shifted in phase by
33
-3/4:

m k = 1.2

2 k 3
sin --------- ----- 3
4

2
2k 2
E v k = 1.2 --- sin --------- = 1.2 0.5 = 0.72

3
3
21

k =1

E v k 1 = E v k = 0.72
3

2 k 1
2k
1
E v k v k 1 = --- 1.2 sin ---------1.2 sin-----------------------
3
3
3
k=1

2
2
= 1.2 0.5 cos ------ = 0.36
3

R=

17

0.72 0.36
0.36 0.72

34

IUT-Ahmadzadeh

1430/10/28

Stationary Point
E sk + mk v k = E sk v k + E mk v k
0

1st

The term is zero because s(k) and v(k) are


independent and zero mean.
1
Em k v k = -3

2k
3
--------- ------ 1.2sin --------- = 0.51
1.2 sin 2k
3
3
4

k =1

Now we find the 2nd element of h:


E s k + m k v k 1 = Es k v k 1 + E m kv k 1
0

35

1
2k 3
2 k 1
Em k v k 1 = --- 1.2 sin------- ----1.2 sin --------------- = 0.70

3
4
3
k=1

h =

E s k + m k v k

h = 0.51

E s k + m k v k 1

x = R 1 h =

0.72 0.36
0.36 0.72

0.70

0.51
0.70

0.30
0.82

Now, what kind of error will we have at the


minimum solution?
36

18

IUT-Ahmadzadeh

1430/10/28

Performance Index
T

F x = c 2 x h + x Rx

We have just found x*, R and h. To find c we have


2

c = E t k = E s k + m k
2

c = Es k + 2E s k mk + E m k
The middle term is zero because s(k) and v(k) are
independent and zero mean.
1
E s k = ------0.4
2

0.2

0.2

0.2
2
1
3
s d s = --------------- s 0.2 = 0.0133
3 0.4

37

1
E m k = --- 1.2 sin 2
------ 3
------ = 0.72
3
3
4

k =1

c = 0.0133
0 0133 + 0.72
0 72 = 0.7333
0 7333

F x = 0.7333 20.72 + 0.72 = 0.0133


The minimum mean square error is the same as the
mean square value of the EEG signal. This is what
we expected, since the error of this adaptive noise
canceller is in fact the reconstructed EEG Signal.

38

19

IUT-Ahmadzadeh

1430/10/28

W1,2

LMS Response for =0.1

W1,1

LMS trajectory looks like noisy version of steepest


descent.

39

Note that the contours in this figure reflect the fact that
the eigenvalues and the eigenvectors of the Hessian
matrix A=2R are
0.7071
0.7071
, 2 0.75, z 2

0.7071
0.7071

1 2.16, z1

If the learning rate is decreased, the LMS trajectory is


smoother, but the learning proceed more slowly.
Note that max is 2/2.16=0.926 for stability.
40

20

IUT-Ahmadzadeh

1430/10/28

Note that error does not go to zero, because the LMS


algorithm is approximate steepest descent; it uses an estimate
41
of the gradient, not the true gradient. nnd10eeg

Echo Cancellation

42

21

IUT-Ahmadzadeh

1430/10/28

HW

Ch 4: E 2, 4, 6, 7
Ch 5: 5, 7, 9
Ch 6: 4, 5, 8, 10
Ch 7: 1, 5, 6, 7
Ch 8: 2, 4, 5
Ch 9: 2, 5, 6
Ch 10: 3, 6, 7
43

22

IUT-Ahmadzadeh