Widrow HoffLearning LMS

1430/10/28
Ch 10: Widrow-Hoff Learning

(LMS Algorithm)
In this chapter we apply the principles of performance
learning to a single-layer linear neural network.
Widrow-Hoff learning is an approximate steepest
descent algorithm, in which the performance index is
mean square error.
Bernard Widrow began working in NN in

the late 1950s, at about the same time that
Frank Rosenblatt developed the
perceptron learning rule.
In
I 1960 Widrow
Wid
and
dH
Hoff
ff iintroduced
t d
d
ADALINE (ADAptive LInear NEuron)
network.
Its learning rule is called LMS (Least Mean
Square) algorithm.
ADALINE is similar to the perceptron,
except that its transfer function is linear,
instead of hard limiting.
2
IUT-Ahmadzadeh
1430/10/28
Widrow, B., and Hoff, M. E., Jr., 1960, Adaptive

switching circuits, in 1960 IRE WESCON Convention
Record, Part 4, New York: IRE, pp. 96104.
Widrow, B., and Lehr, M. A., 1990, 30 years of
adaptive neural networks: Perceptron, madaline, and
backpropagation, Proc. IEEE, 78:14151441.
Widrow, B., and Stearns, S. D., 1985, Adaptive Signal
Processing, Englewood Cliffs, NJ: Prentice-Hall.
Both have the same limitations; They can

only solve linearly separable problems.
The LMS algorithm minimizes mean
square error
error, and therefore tries to move
the decision boundaries as far from the
training patterns as possible.
The LMS algorithm found many more
practical uses than the p
p
perceptron
p
((like
most long distance phone lines use
ADALINE network for echo cancellation).
4
IUT-Ahmadzadeh
1430/10/28
ADALINE Network
a = purel in Wp + b = Wp + b
w i 1
iw
iw
is made up of the elements of the ith row of W:
w i 2
a i = pure lin ni = purelin iw p + b i = iw p + bi
w i R
Two-Input ADALINE
a = pure lin n = purelin 1w p + b = 1w p + b

T
a = 1w p + b = w1 1 p 1 + w1 2 p 2 + b
The ADALINE like perceptron has a decision boundary, which is

determined by the input vectors for which the net input n is zero.
IUT-Ahmadzadeh
1430/10/28
Mean Square Error

The LMS algorithm is an example of supervised training.
Training Set:
{ p 1, t 1} {p 2 , t2} { p Q, t Q}
Input:
Target:
pq
tq
Notation:
x =
1w
a = 1w p + b
z = p
a = x z
Mean Square Error:

2
F x = E e = E t a = E t xT z
The expectation is taken over all sets of input/target pairs.7
Error Analysis
2
F x = E e = E t a = E t xT z
T
F x = E t 2 t x T z + x T z z x
2
F x = E t 2 x T E t z + xT E zz x
This can be written in the following convenient form:
T
F x = c 2 x h + x R x
where
c = E t
h = E tz
R = E zz
IUT-Ahmadzadeh
1430/10/28
The vector h gives the cross-correlation

between the input vector and its associated
target.
R is the input correlation matrix.
The diagonal elements of this matrix are
equal to the mean square values of the
elements of the input vectors.
The mean square error for the ADALINE Network is a
quadratic function:
T
1 T
F x = c + d x + -- x Ax
2
d = 2 h
A = 2R
Stationary Point
Hessian Matrix:
A = 2R
The correlation matrix R must be at least positive

semidefinite. Really it can be shown that all correlation
matrices are either positive definite or positive
semidefinite. If there are any zero eigenvalues, the
performance index will either have a weak minimum or
else no stationary point (depending on d= -2h),
otherwise there will be a unique global minimum x*
(see Ch8).
T
1 T
Fx = c + d x + - x Ax = d + Ax = 2h + 2Rx
Stationary point:
2h + 2R x = 0
10
IUT-Ahmadzadeh
1430/10/28
If R (the correlation matrix) is positive

definite:
1
x = R h
If we could calculate the statistical quantities h and R,

we could find the minimum point directly from above
equation.
But it is not desirable or convenient to calculate h and
R. So
11
Approximate Steepest Descent

Approximate mean square error (one sample):
x = t k a k 2 = e 2k
F
Expectation of the squared error has been replaced
by the squared error at iteration k.
Approximate (stochastic) gradient:
Fx = e2k
2
e k
e k
e k j = ---------------- = 2 e k ------------ w1 j
w 1 j
j = 1 2 R
2
e k
2
e k
e k R + 1 = ---------------- = 2e k ------------b
b
12
IUT-Ahmadzadeh
1430/10/28
Approximate Gradient Calculation

T
e k t k a k
t k 1 w pk + b
------------- = ---------------------------------- =
w1 j
w 1 j
w1 j
e k
------------- =
w 1 j
w 1
t k w 1 i p i k + b
i = 1
Where pi(k) is the ith elements of the input vector at kth iteration.
e k
------------- = p j k
w1 j
e k
-------- = 1
b
F x = e2 k = 2e k z k
13
Now we can see the beauty of approximating the

mean square error by the single error at iteration k as in:
x = tk ak2 = e2k
F
This approximation to F (x) can now be used
in the Steepest descent algorithm.
LMS Algorithm
Al ith
xk + 1 = xk F x
x = xk
14
IUT-Ahmadzadeh
1430/10/28
If we substitute
F (x) for F (x)
x k + 1 = x k + 2 e k z k
1w k + 1
= 1w k + 2e k p k
b k + 1 = b k + 2 e k
These last two equations make up the LMS algorithm.
Also called Delta Rule or the Widrow-Hoff learning
algorithm.
15
Multiple-Neuron Case
iw k +
1 = iw k + 2 ei k p k
b i k + 1 = b i k + 2e i k
Matrix Form:
T
W k + 1 = W k + 2e k p k
b k + 1 = b k + 2 e k
16
IUT-Ahmadzadeh
1430/10/28
Analysis of Convergence
Note that xk is a function only of z(k-1), z(k-2), , z(0). If
we assume that successive input vectors are statistically
independent, then xk is independent of z(k).
We will show that for stationary input processes meeting
this condition, so the expected value of the weight vector
will converge to:
*
1
x R h
This is the minimum mean square error {E[ek2]}

solution, as we saw before.
17
Recall the LMS Algorithm:
xk + 1 = xk + 2e k zk
E xk + 1 = E xk + 2E e k z k
Substitute the error with
t (k ) xTk z (k )
T
Ex k + 1 = Ex k + 2E t k z k E xk zk z k
since xTk z (k ) z T (k )x k
T
E xk + 1 = E xk + 2 Etk z k E zkz k xk
18
IUT-Ahmadzadeh
1430/10/28
Since xk is independent of z(k)
E xk + 1 = E xk + 2 h RE xk
E xk + 1 = I 2RE xk + 2h
For stability, the eigenvalues of this
matrix must fall inside the unit circle.
Conditions for Stability
eig I 2 R = 1 2 i 1
(where i is an eigenvalue of R)
Since
i 0 ,
19
1 2i 1 .
Therefore the stability condition simplifies to

1 2
1 i
for all i
0 1 m ax
Note: we have the same condition as the SD algorithm. In

SD we use the Hessian Matrix A, here we use the input
correlation matrix R (Recall that A=2R).
10
20
IUT-Ahmadzadeh
1430/10/28
Steady State Response

E xk + 1 = I 2 R E xk + 2 h
If the system is stable,
stable then a steady state condition will be reached.
reached
E xss = I 2 R E xss + 2 h
The solution to this equation is
1
Ex ss = R h = x
This is also the strong minimum of the performance index.

Thus the LMS solution, obtained by applying one input at a time, is
the same as the minimum mean square solution of x* R 1h
21
Example
Banana
p
=
t
=
1
1 1
1
p
=
t
=
Apple 2
1 2
1
If inputs are generated randomly with equal probability, the

input correlation matrix is:
1
2
1
2
R = E pp = -- p 1 p 1 + -- p 2 p 2
1
1 0 0
R = --2- 1 1 1 1 + -2- 1 1 1 1 = 0 1 1
1
1 = 1.0
2 = 0.0
3 = 2.0
0 1 1
1
1
------- = ---- = 0.5
max 2.0
We take =0.2 (Note: Practically it is difficult to calculate R and

. We choose them by trial and error).
11
22
IUT-Ahmadzadeh
1430/10/28
Iteration One
Banana
a0 = W 0p 0 = W0 p1= 0 0 0
1
1= 0
1
W(0) is
selected
arbitrarily.
e 0 = t 0 a0 = t1 a 0= 1 0= 1
W 1 = W0 + 2e 0 pT 0
T
1
W 1 = 0 0 0 + 20.2 1 1 = 0.4 0.4 0.4
1
23
Iteration Two
Apple
a 1= W 1 p 1= W 1 p2 = 0.4 0.4 0.4
1
1 = 0.4
1
e 1 = t1 a1 = t2 a 1= 1 0.4= 1.4
T
1
W 2 = 0.4 0.4 0.4 + 2 0.2 1.4 1 = 0.96 0.16 0.16
1
24
12
IUT-Ahmadzadeh
1430/10/28
Iteration Three
a 2= W2 p 2= W 2 p1= 0.96 0.16 0.16
1
1 = 0.64
1
e 2 = t 2 a 2 = t 1 a2 = 1 0.64= 0.36
T
W3 = W 2 + 2e 2p 2 = 1.1040 0.0160 0.0160
W = 1 0 0
25
Some general comments on the

learning process:
Computationally, the learning process
goes through
th
h allll ttraining
i i examples
l ((an
epoch) number of times, until a stopping
criterion is reached.
The convergence process can be
monitored with the plot of the meanmean
squared error function F(W(k)).
26
13
IUT-Ahmadzadeh
1430/10/28
The popular stopping criteria are:

the mean-squared error is sufficiently
small:
ll F(W(k)) <
The rate of change of the mean-squared
error is sufficiently small:
27
Adaptive Filtering
ADALINE is one of the most widely used NNs in practical
applications. One of the major application areas has been
Adaptive Filtering.
Adaptive Filter
Tapped Delay Line
28
14
IUT-Ahmadzadeh
1430/10/28
ak = purelinWp + b =
w1 i yk i + 1 + b
i= 1
In Digital Signal Processing (DSP) language

lang age wee
recognize this network as a finite impulse response
(FIR) filter.
29
Example: Noise Cancellation
30
15
IUT-Ahmadzadeh
1430/10/28
Noise Cancellation Adaptive Filter

Two-input filter can attenuate and phase-shift the
noise in the desired way.
31
Correlation Matrix
To Analyze this system we need to find the input
correlation matrix R and the input/target crosscorrelation vector h.
h
R E[zz T ]
z k =
h = E t z
v k
v k 1
t k = s k + m k
2
R=
E v k
E v k v k 1
2
E v k 1v k Ev k 1
h =
16
E s k + m k v k
E s k + m k v k 1
32
IUT-Ahmadzadeh
1430/10/28
We must define the noise signal , the EEG signal s,

and the filtered noise m, to be able to obtain specific
values.
We assume: The EEG signal is a white (Uncorrelated
from one time step to the next) random signal
uniformly distributed between the values -0.2 and +0.2,
the noise source (60 Hz sine wave sampled at 180 Hz) is
given by
2 k
2k
v k = 1.2 sin---------
3
And the filtered noise that contaminates the EEG is the

noise attenuated by a factor 1.0 and shifted in phase by
33
-3/4:
m k = 1.2
2 k 3
sin --------- ----- 3
4
2
2k 2
E v k = 1.2 --- sin --------- = 1.2 0.5 = 0.72
3
3
21
k =1
E v k 1 = E v k = 0.72
3
2 k 1
2k
1
E v k v k 1 = --- 1.2 sin ---------1.2 sin-----------------------
3
3
3
k=1
2
2
= 1.2 0.5 cos ------ = 0.36
3
R=
17
0.72 0.36
0.36 0.72
34
IUT-Ahmadzadeh
1430/10/28
Stationary Point
E sk + mk v k = E sk v k + E mk v k
0
1st
The term is zero because s(k) and v(k) are

independent and zero mean.
1
Em k v k = -3
2k
3
--------- ------ 1.2sin --------- = 0.51
1.2 sin 2k
3
3
4
k =1
Now we find the 2nd element of h:

E s k + m k v k 1 = Es k v k 1 + E m kv k 1
0
35
1
2k 3
2 k 1
Em k v k 1 = --- 1.2 sin------- ----1.2 sin --------------- = 0.70
3
4
3
k=1
h =
E s k + m k v k
h = 0.51
E s k + m k v k 1
x = R 1 h =
0.72 0.36
0.36 0.72
0.70
0.51
0.70
0.30
0.82
Now, what kind of error will we have at the

minimum solution?
36
18
IUT-Ahmadzadeh
1430/10/28
Performance Index
T
F x = c 2 x h + x Rx
We have just found x*, R and h. To find c we have

2
c = E t k = E s k + m k
2
c = Es k + 2E s k mk + E m k
The middle term is zero because s(k) and v(k) are
independent and zero mean.
1
E s k = ------0.4
2
0.2
0.2
0.2
2
1
3
s d s = --------------- s 0.2 = 0.0133
3 0.4
37
1
E m k = --- 1.2 sin 2
------ 3
------ = 0.72
3
3
4
k =1
c = 0.0133
0 0133 + 0.72
0 72 = 0.7333
0 7333
F x = 0.7333 20.72 + 0.72 = 0.0133

The minimum mean square error is the same as the
mean square value of the EEG signal. This is what
we expected, since the error of this adaptive noise
canceller is in fact the reconstructed EEG Signal.
38
19
IUT-Ahmadzadeh
1430/10/28
W1,2
LMS Response for =0.1
W1,1
LMS trajectory looks like noisy version of steepest

descent.
39
Note that the contours in this figure reflect the fact that
the eigenvalues and the eigenvectors of the Hessian
matrix A=2R are
0.7071
0.7071
, 2 0.75, z 2
0.7071
0.7071
1 2.16, z1
If the learning rate is decreased, the LMS trajectory is

smoother, but the learning proceed more slowly.
Note that max is 2/2.16=0.926 for stability.
40
20
IUT-Ahmadzadeh
1430/10/28
Note that error does not go to zero, because the LMS

algorithm is approximate steepest descent; it uses an estimate
41
of the gradient, not the true gradient. nnd10eeg
Echo Cancellation
42
21
IUT-Ahmadzadeh
1430/10/28
HW
Ch 4: E 2, 4, 6, 7
Ch 5: 5, 7, 9
Ch 6: 4, 5, 8, 10
Ch 7: 1, 5, 6, 7
Ch 8: 2, 4, 5
Ch 9: 2, 5, 6
Ch 10: 3, 6, 7
43
22
IUT-Ahmadzadeh

Widrow HoffLearning LMS

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Widrow HoffLearning LMS

Uploaded by

Copyright:

Available Formats

1430/10/28

Ch 10: Widrow-Hoff Learning

Bernard Widrow began working in NN in

Widrow, B., and Hoff, M. E., Jr., 1960, Adaptive

Both have the same limitations; They can

is made up of the elements of the ith row of W:

a i = pure lin ni = purelin iw p + b i = iw p + bi

a = pure lin n = purelin 1w p + b = 1w p + b

The ADALINE like perceptron has a decision boundary, which is

Mean Square Error

Mean Square Error:

The vector h gives the cross-correlation

The correlation matrix R must be at least positive

If R (the correlation matrix) is positive

If we could calculate the statistical quantities h and R,

Approximate Steepest Descent

Approximate Gradient Calculation

Now we can see the beauty of approximating the

F (x) for F (x)

This is the minimum mean square error {E[ek2]}

Recall the LMS Algorithm:

Since xk is independent of z(k)

Conditions for Stability

Therefore the stability condition simplifies to

Note: we have the same condition as the SD algorithm. In

Steady State Response

This is also the strong minimum of the performance index.

If inputs are generated randomly with equal probability, the

We take =0.2 (Note: Practically it is difficult to calculate R and

a 1= W 1 p 1= W 1 p2 = 0.4 0.4 0.4

W3 = W 2 + 2e 2p 2 = 1.1040 0.0160 0.0160

Some general comments on the

The popular stopping criteria are:

In Digital Signal Processing (DSP) language

Example: Noise Cancellation

Noise Cancellation Adaptive Filter

We must define the noise signal , the EEG signal s,

And the filtered noise that contaminates the EEG is the

The term is zero because s(k) and v(k) are

Now we find the 2nd element of h:

Now, what kind of error will we have at the

We have just found x*, R and h. To find c we have

F x = 0.7333 20.72 + 0.72 = 0.0133

LMS Response for =0.1

LMS trajectory looks like noisy version of steepest

If the learning rate is decreased, the LMS trajectory is

Note that error does not go to zero, because the LMS

You might also like