You are on page 1of 10

IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS

Area-Delay-Power Efficient Fixed-Point LMS


Adaptive Filter with Low Adaptation-Delay
Pramod Kumar Meher, Senior Member, IEEE and Sang Yoon Park, Member, IEEE

AbstractIn this paper, we present an efficient architecture


for the implementation of delayed least mean square (DLMS)
adaptive filter. For achieving lower adaptation-delay and areadelay-power efficient implementation, we have used a novel
partial product generator, and we have proposed a strategy
for optimized balanced pipelining across the time-consuming
combinational blocks of the structure. From synthesis results,
we find that the proposed design offers nearly 17% less areadelay-product (ADP) and nearly 14% less energy-delay-product
(EDP) than the best of the existing systolic structures, in average,
for filter-lengths N =8, 16, and 32. We have proposed an efficient
fixed-point implementation scheme of the proposed architecture,
and derived the expression for steady-state error. We have shown
that the steady-state mean squared error obtained from the
analytical result matches with the simulation result. Moreover, we
have proposed a bit-level pruning of the proposed architecture,
which provides nearly 20% saving in ADP, and 9% saving in EDP
over the proposed structure before pruning without noticeable
degradation of steady-state error-performance.
Index TermsAdaptive filters, least mean square algorithms,
fixed-point arithmetic, circuit optimization.

I. I NTRODUCTION
Adaptive digital filters find wide applications in several digital signal processing (DSP) applications, e.g., noise and echo
cancelation, system identification, and channel equalization
etc. The finite impulse response (FIR) filters whose weights
are updated by the famous Widrow-Hoff least mean square
(LMS) algorithm [1] may be considered as the simplest known
adaptive filter. The LMS adaptive filter is the most popular one
not only due to its simplicity but also due to its satisfactory
convergence performance [2]. The direct form LMS adaptive
filter involves a long critical-path due to an inner-product computation to obtain the filter output. The critical-path is required
to be reduced by pipelined implementation when it exceeds the
desired sample period. Since the conventional LMS algorithm
does not support pipelined implementation due to its recursive
behavior, it is modified to a form called delayed LMS (DLMS)
algorithm [3][5], which allows pipelined implementation of
the filter.
A lot of work has been done to implement the DLMS
algorithm in systolic architectures to increase the maximum
usable frequency, [3], [6], [7] which can support very high
sampling frequency due to pipelining after every multiply
accumulate (MAC) operation. But they involve adaptationdelay of N cycles for filter-length N , which is quite high
The authors are with the Institute for Infocomm Research, Singapore,
138632 (e-mail: pkmeher@i2r.a-star.edu.sg; sypark@i2r.a-star.edu.sg).
Corresponding author of this paper is Sang Yoon Park.
A preliminary version of this paper was presented in IEEE International
Midwest Symposium on Circuits and Systems (MWSCAS 2011).

for large order filters. Since the convergence performance


degrades considerably for large adaptation-delay, Visvanathan
[8] et al have proposed a modified systolic architecture to
reduce the adaptation-delay. A transpose form LMS adaptive
filter is suggested in [9], where the filter output at any instant
depends on the delayed versions of weights, and the number
of delays in weights varies from 1 to N . Van and Feng [10]
have proposed a systolic architecture, where they have used
relatively large processing elements (PEs) for achieving lower
adaptation-delay with critical-path of one MAC operation.
Ting et al [11] have proposed a fine-grained pipelined design
to limit the critical-path to the maximum of one addition-time,
which supports high sampling frequency, but involves lot of
area overhead for pipelining and higher power-consumption
than that of [10] due to its large number of pipeline latches.
Further effort has been made by Meher and Maheswari [12] to
reduce the number of adaptation-delays. Meher and Park have
proposed a 2-bit multiplication cell, and used that with an
efficient adder-tree for pipelined inner-product computation to
minimize the critical-path and silicon area without increasing
the number of adaptation-delays [13], [14].
The existing work on DLMS adaptive filter do not discuss
the fixed-point implementation issues, e.g, location of radixpoint, choice of word-length, and quantization at various stages
of computation, although they directly affect the convergence
performance particularly due to the recursive behavior of the
LMS algorithm. Therefore, fixed-point implementation issues
are given adequate emphasis in this paper. Besides, we present
here the optimization of our previously reported design [13],
[14] to reduce the number of pipeline delays along with the
area, sampling period, and energy-consumption. The proposed
design is found to be more efficient in terms of power-delay
product and energy-delay-product compared with the existing
structures. The novel contributions of this paper are as follows:
i) A novel design strategy for reducing the adaptation
delay, which very much impacts the convergence speed.
ii) Fixed-point implementation scheme of the modified
DLMS algorithm.
iii) Bit-level optimization of computing blocks without adversely affecting the convergence performance.
In the next Section, we have reviewed the DLMS algorithm, and in Section III, we have described the proposed
optimized architecture for its implementation. Section IV deals
with fixed-point implementation considerations, and simulation studies of convergence of algorithm. In Section V, we
discuss the synthesis of proposed architecture and comparison
with the existing architectures followed by conclusions in
Section VI.

FIG1.pdf
2

IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS

input sample, xn

FIR FILTER
BLOCK

filter output, yn

desired signal,dn
input sample, xn

_
mD

new weights

new weights

Fig. 1.

WEIGHT-UPDATE BLOCK

Fig. 2.

FIG1.pdf

Structure of modified delayed LMS adaptive filter.

FIG2.pdf

input sample, xn

wn+1ERROR-COMPUTATION
= wn + en xn BLOCK
new weights

en = dn yn and yn = wnT xn

(1a)
error en

n2D

(1b)

n1D

where the input vector xn , and the weight vector wn at the


nth iteration are, respectively, given by

Mean Squared Error (dB)

II. R EVIEW OF D ELAYED LMS A LGORITHM


The weights of LMS adaptive filter during the nth iteration
are updated according to the following
equations
[2].
desired signal,d
n

n1D

n1D

mD

Structure of conventional delayed LMS adaptive filter.

where

error en

n2D

n1D

error, en
WEIGHTUPDATE BLOCK

ERROR-COMPUTATION BLOCK

desired
signal
dn

LMS (n1=0, n2=0)


DLMS (n1=5, n2=1)

-20

DLMS (n1=7, n2=2)

-40
-60
-80
0

100

WEIGHT-UPDATE BLOCK

xn = [xn , xn1 , , xnN +1 ]T

200

300
400
Iteration Number

500

600

700

Fig. 3. The convergence performance of system identification with LMS and


modified DLMS adaptive filters.

wn = [wn (0), wn (1), , wn (N 1)]


FIG2.pdf

dn is the desired response, yn is the filter output, and en


denotes the error computed during the nth iteration. is the
step-size, and N is the number of weights used in the LMS
adaptive filter.
In case of pipelined designs with m pipeline stages, the error
en becomes available after m cycles, where m is called the
adaptation-delay. The DLMS algorithm therefore uses the
delayed error enm , i.e., the error corresponding to (n m)th
iteration for updating the current weight instead of the recentmost error. The weight-update equation of DLMS adaptive
filter is given by
wn+1 = wn + enm xnm

(2)

The block diagram of DLMS adaptive filter is depicted in


Fig. 1, where the adaptation-delay of m cycles amounts to
the delay introduced by the whole of adaptive filter structure
consisting of FIR filtering and weight-update process.
It is shown in [12] that the adaptation-delay of conventional
LMS can be decomposed into two parts, where one part is the
delay introduced by the pipeline stages in FIR filtering and the
other part is due to the delay involved in pipelining of weightupdate process. Based on such a decomposition of delay, the
DLMS adaptive filter can be implemented by a structure shown
in Fig. 2. Assuming that the latency of computation of error is
n1 cycles, the error computed by the structure at the nth cycle
is enn1 , which is used with the input samples delayed by
n1 cycles to generate the weight-increment term. The weightupdate equation of the modified DLMS algorithm is given by
wn+1 = wn + enn1 xnn1

(3a)

enn1 = dnn1 ynn1

(3b)

where
and

T
yn = wnn
xn
2

(3c)

We can notice that during weight-update, the error with n1


delays is used while the filtering unit uses the weights delayed
by n2 cycles. The modified DLMS algorithm decouples computations of error-computation block and weight-update block
and allows to perform optimal pipelining by feed-forward cutset retiming of both these sections separately to minimize the
number of pipeline stages and adaptation delay.
The adaptive filters with different n1 and n2 are simulated
for a system identification problem. The 10-tap bandpass filter
with impulse response
hn =

sin(wH (n 4.5)) sin(wL (n 4.5))

,
(n 4.5)
(n 4.5)
for n = 0, 1, 2, ..., 9, otherwise hn = 0.

(4)

is used as unknown system as in [10]. wH and wL represent


the high and low cutoff frequencies of the passband, and are
set to wH = 0.7 and wL = 0.3, respectively. The step-size
is set to 0.4. A 16-tap adaptive filter identifies the unknown
system with gaussian random input xn of zero mean and unit
variance. In all cases, outputs of known system are of unity
power, and contaminated with white Gaussian noise of 70
dB strength. Fig. 3 shows the learning curve of MSE of error
signal en by averaging 20 runs for the conventional LMS
adaptive filter (n1 = 0, n2 = 0) and DLMS adaptive filters
with (n1 = 5, n2 = 1) and (n1 = 7, n2 = 2). It can be seen
that as the total number of delays increases, the convergence
is slowed down, while the steady-state MSE remains almost
the same in all cases. In this example, the MSE difference
between the cases (n1 = 5, n2 = 1) and (n1 = 7, n2 = 2)
after 2000 iterations is less than 1 dB in average.

MEHER AND PARK: AREA-DELAY-POWER EFFICIENT FIXED-POINT LMS ADAPTIVE FILTER WITH LOW ADAPTATION-DELAY

input samples, xn
L

wnn2 (0)
W

wnn2 (1)

2-BIT
PPG-0

p00 p01

xn-1

2-BIT
PPG-1

W
p0 L 1
2

p10 p11

xn-2

wnn2 (2)

2-BIT
PPG-2

W
p1 L 1
2

p20 p21

D
wnn2 (N 1)
W

xn-(N-1)

2-BIT
PPG-(N-1)

p(N-1)0 p(N-1)1p(N-1) L 1

p2 L 1
2

ADDERTREE (log2N STAGES)


q0

q1

q2

q L 2 q L 1
2

SHIFTADDTREE (log2L-1 STAGES)

error, en( n11)


Fig. 4.

filter output, yn(n12)


desired response, dn( n12)

Proposed structure of the error-computation block.

error-block.pdf
bit input b, and consists of n AND gates. It distributes all the
n-bits of input D to its n AND gates as one of the inputs. The
As shown in Fig. 2, there are two main computing blocks
other inputs of all the n AND gates are fed with the single-bit
in the adaptive filter architecture, namely (i) the errorinput b. As shown in Fig. 6(c), each OR cell similarly takes a
computation block and (ii) weight-update block. In this Secpair of n-bit input words, and has n OR gates. A pair of bits
tion, we discuss the design strategy of the proposed structure to
in the same bit-position in B and D is fed to the same OR
minimize the adaptation-delay
in samples,
the error-computation
block,
xn-1
input
xn
xn-(N-1)
xn-2
gate.
D
D
D

followed by weight-update block.


The output of an AOC is w, 2w and 3w corresponding to the
decimal values 1, 2, and 3 of 2-bit input (u1 u0 ), respectively.
A. Pipelined Structure of Error-Computation Block
The decoder along with the AOC performs a multiplication of
PPG-(NPPG-0
PPG-1
The proposed structure for error-computation
unit of an
N - input operand w with two-bit digit (u1 u1)0 ), such that the PPG

tap DLMS adaptive filter is shown in Fig. 4. It consists of of Fig. 5 performs L/2 parallel multiplications of input word
N number of 2-bit partial product generators (PPG) corre- w with a 2-bit digit to produce L/2 partial products of the

product word wu.


sponding to N multipliers, a cluster of
L/2 binary adder-trees
D shift-add tree. Each sub-block is described
3) Structure of Adder-Tree: Conventionally, we should have
followed by a single
<<1
performed the shift-add operation on the partial products of
in detail.
D
+
PPG separately to obtain the
product value, and then
L
L
1) Structure of Partial Product Generator:
structure
of r1 each
r(N-1)0 r(N-1)1 r(N-1) L 1
r00 r01 r0 The
r10 r11
1
1
2
2
2
we
would
have
added
all
the
N
product
to compute
values
each PPG is shown in Fig. 5. It involves L/2 number of 2-to<<1
1
the
desired
inner
product.
However,
the
shift-add
operation
e

3 decoders
and the same number ofSHIFT-ADDAND-OR cells (AOC)
.
n+1
SHIFT-ADDSHIFT-ADDto
obtain
the
product
value
increases
the
word-length,
and
Each of the 2-to-3 decoders takes a 2-bit
digit (u1 u0 ) as input
TREE
TREE
TREE
consequently
increases
the
adder-size
of
N

1
additions
of
and produces three output b0 = u0 u1 , b1 = u0 u1 , and b2 =
s
s
sN-1in word-size of the
u0 u1 , such that b0 = 1 for (u1 u0 ) = 1, b01 = 1 for (u1 u0 ) =1 the product values. To avoid such increase
we add all the N partial products
+ output b0 , b1 and +b2 adders,
+ of the same place
2, and b2 = 1 for (u1 u0 ) = 3. The decoder

along with w, 2w and 3w are fed to an AOC, where w, 2w and value from all the N PPGs by one adder-tree.
All the L/2 partial products generated
D by each of the N
3w are in 2s complement representationDand sign-extendedDto
have (W + 2)-bit each. To take care of the sign of the input PPGs are thus added by (L/2) binary adder-trees. The output
samples while computing the partial product corresponding of the L/2 adder-trees are then added by a shift-add-tree
according to their place values. Each wof
the binary adder-tree
wnn ((u
0) L1 uL2 ) ofwnthe
to the most-significant digit (MSD), i.e.,
nn (N 1)
n (1)
input sample, the AOC-(L/21) is fed with w, 2w and w requires log2 N stages of adders to add N partial product, and
the shift-add-tree requires log2 L 1 stages of adders to add
update.pdf
as input since (uL1 uL2 ) can have four possible values 0,
L/2 output of L/2 binary adder-trees2 . The addition scheme
1, 2 and 1.
2) Structure of AND-OR Cells: The structure and function for the error-computation block for a 4-tap filter and input
of AOC is depicted in Fig. 6. Each AOC consists of 3 AND word-size L = 8 is shown in Fig. 7. For N = 4 and L = 8, the
cells and 2 OR cells. The structure and function of AND adder-network requires four binary adder-trees of two stages
cells and OR cells are depicted by Fig. 6(b) and Fig. 6(c), each and a two-stage shift-add tree. In this figure, we have
respectively. Each AND cell takes n-bit input D and a single shown all possible locations of pipeline latches by dashed
lines, to reduce the critical-path to one addition-time. If we
III. T HE P ROPOSED A RCHITECTURE

1 We have assumed the word-length of the input L to be even, which is


valid in most practical cases.

2 When

L is not a power of 2, log2 L should be replaced by dlog2 Le.

IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS

L
input samples
u=(uL-1uL-2uL-3u0)

(u1 u0)

2-to-3 DECODER

b1

b0

b1

2-to-3 DECODER

AOC-2

AOC-1

AOC(L/2-1)

-w -2w w

<<1

w(2u1+u0)

b0

2-to-3 DECODER

W+2

Fig. 5.

(uL-1uL-2)

W+1

<<1

(u5 u4)

3w 2w w

-1

(u3 u2)
2-to-3 DECODER

b2

AOC-0

coefficient, w

2w

b2

W+2

w(2u3+u2)

W+2

w(2u5+u4)

W+2

w(-2uL-1+uL-2)

W+2

3w

Proposed structure of partial product generator (PPG). AOC stands for AND/OR cell.
1

W+2

AC-1

b0

w
W+2

1
AC-2

b1

W+2

2w

1 W+2W+2

W+2

mult_cell.pdf
and-or-park.pdf
the multiplication with recently available error only by a shift

AC-3

b2

3w

W+2
W+2

operation. Each of the MAC units therefore performs the


multiplication of shifted value of error with the delayed input
W+2
W+2
W+2
and-or-park.pdf
samples xi followed by the additions with the corresponding
W+2
W+2
old weight values wi . All the N multiplications for the MAC
OC-1
OC-2
p20 pW+2
p01wXp11 p21 p31 operations
p02 p12 p22 are
p32 performed
p03 p13 p23by
p33N PPGs followed by N shift30
W+2p00 p10
W+2
(a)
add-trees. Each of the PPGs generates L/2 partial products
D=(dn-1dn-2d0)
corresponding
productADDER
of recentSTAGE-1
shifted error value e
b
ADDER
ADDER
ADDER
ADDER to ADDER
D=(dn-1ADDER
dn-2d0) ADDER
B=(bn-1bn-2b0)
n
1
with
L/2
number
of
2-bit
digits
of
the
input word xi , where
n
ADDER
the
subexpression
3

e
is
shared
within
the multiplier. Since
ri=b di
OC
r0) D2
bAC D for i=0,1,,n-1 TREEn
ADDER n R=(rn-1rn-2ADDER
ADDER
STAGE-2
the scaled
error ( ADDER
e) is multiplied
with all the N delayed
1 n n
n
input values in the weight-update block, this subexpression
ri=bi + di for i=0,1,,n-1
R=(rn-1rn-2R
r
0) b D (i ),
(
i
)
q2
q3
q
q
0
1
AC
D1
can beR shared
across
all the multipliers as well. This leads
OC
(b)
(c)
n
n
0 i n 1
<<2
<<2
to
substantial
saving
of
adder complexity. The final outputs
n
Fig. 6. Structure
and function of AND/OR cell. Binary operators and +
R(i) D1 (i) D2 (i), 0of
i MAC
n 1 units constitute the desired updated weights to be
in (b) and (c) are implemented using
AND and OR gates, respectively.
andor.pdf
ADDER
R
STAGE-1
ADDER
used as inputs to the error-computation block, as well as, the
weight-update block for the next iteration.
and-park.pdf SHIFT-ADD-TREE
or-park.pdf
introduce pipeline latches after every addition, it would require
<<4
L(N 1)/2 + L/2 1 latches in log2 N + log2 L 1 stages, C. Adaptation Delay
which leads to high adaptation-delay, and introduces large ADDER
STAGE-2
As shown in Fig. 2, the adaptation-delay is decomposed
overhead of area and power-consumption for large values of N
and L. On the other hand, some of those pipeline latches are into n1 and n2 . The error-computation block generates the
redundant in the sense that they are not required to maintain a delayed error by n1 1 cycles as shown in Fig. 4, which is
critical-path of one addition-time. The final adder in the shift- fed to weight-update block shown in Fig. 8 after scaling by ,
shift-add-tree.pdf
add tree contributes the maximum delay to the critical
path. then the input is delayed by 1 cycle before the PPG to make
Based on that observation, we have identified the pipeline the total delay introduced by FIR filtering be n1 . In Fig. 8, the
latches which do not contribute significantly to critical-path weight-update block generates wn1n2 , the weights delayed
and could exclude those without any noticeable increase of by n2 +1 cycles. However, it should be noted that the delay by
critical-path. The location of pipeline latches for filter-lengths 1 cycle is due to the latch before the PPG, which is included in
N = 8, 16 and 32 and for input size L = 8 are shown in the delay of the error-computation block, i.e., n1 . Therefore,
Table-I. The pipelining is performed by a feed-forward cut-set the delay generated in weight-update block becomes n2 . If
retiming of error-computation block [15].
AC-1

OC-1
AC-2

OC-2
AC-3

wX

B. Pipelined Structure of Weight-Update Block


The proposed structure for weight-update block is shown
in Fig. 8. It performs N multiply accumulate operations of
the form ( e) xi + wi to update N filter weights. The
step-size is taken as a negative power of two to realize

TABLE I
L OCATION OF PIPELINE LATCHES FOR L = 8 AND N = 8, 16 AND 32.
N

Error-Computation Block
Adder-Tree
Shift-Add-Tree

Weight-Update Block
Shift-Add-Tree

stage-2

stage-1 and 2

stage-1

16

stage-3

stage-1 and 2

stage-1

32

stage-3

stage-1 and 2

stage-2

mult_cell.pdf
MEHER AND PARK: AREA-DELAY-POWER
EFFICIENT
FIXED-POINT
input samples,
xn
xn-1LMS ADAPTIVE FILTER
xn-2 WITH LOW ADAPTATION-DELAY
xn-(N-1)

p00 p10
wnn2 (0)
W
ADDER

ADDER
TREE

p20 p30

p01 p11
wnn2 (1)

2-BIT
PPG-0
ADDER

p00 p01

p0 L 1

p21 p31

p10 p11

ADDER

q0

p1 L 1

p22 p32

p03 p13

2-BIT
PPG-2
ADDER

p20 p21

ADDER

p23 p33
wnn2 (N 1)

2-BIT
PPG-(N-1)
STAGE-1

ADDER
W

p(N-1)0 p(N-1)1p(N-1) L 1

ADDER

STAGE-2

q L 2 q L 1

q2

p2 L 1

ADDERTREE
(log2N STAGES)
ADDER

q2

q1

q3

SHIFTADDTREE (log2L-1 STAGES)


<<2

<<2

ADDER

ADDER
W

ADDER

q1

p02 p12
wnn2 (2)

2-BIT
PPG-1
ADDER

ADDER
W

q0

error, en( n11)

filter output, yn(n12)


ADDERresponse, d
desired
n( n12)

STAGE-1

SHIFT-ADD-TREE
<<4

error-block.pdf
ADDER

Fig. 7.

STAGE-2

The adder-structure of the filtering unit for N = 4 and L = 8.


input samples, xn

xn-3

n-1
n-2
shift-add-tree.pdf
D
D

PPG-0

PPG-1

xn-N

PPG-(N-1)

xn

UNKNOWN SYSTEM, hn

FILTERING BLOCK

<<1

<<1

r00 r01 r0 L 1
2

SHIFT-ADDTREE

r10 r11 r1 L 1
2

r(N-1)0 r(N-1)1

n1D

n2D

SHIFT-ADDTREE

n1D

r(N-1) L 1
2

SHIFT-ADDTREE

-1

en n1

WEIGHT UPDATE BLOCK


s0

s1

wn1n2 (0)

Fig. 8.

yn

w nn2

<<1

en

d n n1

n1D

Proposed structure of weight-update block.

the locations of pipeline latches are decided as Table I, n1


becomes 5, where 3 latches are in the error-computation block,
one latch is after the subtraction in Fig. 4, and the other latch
is before PPG in Fig. 8. Also, n2 is set to 1 from a latch in
the shift-add-tree in weight-update block.

sN-1

+
si-app.pdf
D
wn1n2 (N 1)

wn1n2 (1)

update.pdf
integer
aXi 1

MSB

a1

fraction
a0

b1

b2

b3

radix-point

b4

bX f
LSB

Fig. 9. Fixed-point representation of a binary number (Xi : integer wordlength, Xf : fractional word-length).

IV. F IXED -P OINT I MPLEMENTATION , O PTIMIZATION ,


S IMULATION , AND A NALYSIS
In this section, we discuss the fixed-point implementation
and optimization of the proposed DLMS adaptive filter. A bitlevel pruning of the adder-tree is also proposed to reduce the
hardware complexity without noticeable degradation of steadystate MSE.

fp-basic.pdf

A. Fixed-Point Design Considerations


For fixed-point implementation, the choice of word-lengths
and radix-points for input samples, weights, and internal
signals need to be decided. Fig. 9 shows the fixed-point
representation of a binary number. Let (X, Xi ) be a fixedpoint representation of a binary number where X is the

IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS

word-length and Xi is integer-length. The word-length and


location of radix-point of xn and wn in Fig. 4 need to be
pre-determined by hardware designer taking the design constraints such as desired accuracy and hardware complexity into
consideration. Assuming (L, Li ) and (W, Wi ), respectively,
as the representations of input signals and filter weights,
all other signals in Fig. 4 and Fig. 8 can be decided as
shown in Table II. The signal pij which is the output of
PPG block (shown in Fig. 4) has at most 3 times the value
of input coefficients. Thus, we can add 2 more bits to the
word-length and to the integer-length of coefficients to avoid
overflow. The output of each stage in the adder-tree in Fig. 7
is 1 bit more than the size of input signals, so that the
fixed-point representation of the output of the adder-tree with
log2 N stages becomes (W + log2 N + 2, Wi + log2 N + 2).
Accordingly, the output of the shift-add-tree would be of the
form (W + L + log2 N, Wi + Li + log2 N ) assuming that
no truncation of any least significant bits (LSB) is performed
in the adder-tree or the shift-add tree. However, the number
of bits of the output of the shift-add-tree is designed to have
W -bits. The most significant W bits need to be retained out
of (W + L + log2 N )-bits, which results in the fixed-point
representation (W, Wi + Li + log2 N ) for y as shown in
Table II. Let the representation of the desired signal d be the
same as y even though its quantization is usually given as the
input. For this purpose, the specific scaling/sign-extension and
truncation/zero-padding are required. Since the LMS algorithm
performs learning so that y has the same sign as d, the error
signal e can also be set to have the same representation as y
without overflow after the subtraction.
It is shown in [4] that the convergence of N -tap DLMS
adaptive filter with n1 adaptation-delay is assured if
2
(5)
0<< 2
(x (N 2) + 2n1 2)x2
where x2 is the average power of input samples. Furthermore, if the value of is defined as (power of two) 2n
where n Wi + Li + log2 N , the multiplication with is
equivalent to the change of location of radix-point. Since the
multiplication with does not need any arithmetic operation,
it does not introduce any truncation error. If we need to use
TABLE II
F IXED - POINT REPRESENTATION OF THE SIGNALS OF THE PROPOSED
DLMS ADAPTIVE FILTER ( = 2(Li +log2 N ) ).
Signal Name

a smaller step-size, i.e., n > Wi + Li + log2 N , some of


the LSBs of en need to be truncated. If we assume that
n = Li + log2 N , i.e., = 2(Li +log2 N ) as in Table II,
the representation of en should be (W, Wi ) without any
truncation. The weight increment term s (shown in Fig. 8)
which is equivalent to en xn is required to have fixed-point
representation (W + L, Wi + Li ). However, only Wi MSBs
in the computation of shift-add tree of weight-update circuit
are to be retained, while the rest of the more significant bits
of MSBs need to be discarded. This is in accordance with
the assumptions that as the weights converge towards optimal
value, the weight increment terms become smaller, and MSB
end of error term contains more number of zeros. Also, in our
design, L Li LSBs of weight increment terms are truncated
so that the weight increment terms have the same fixedpoint representation as weight values. We also assume that
no overflow occurs during the addition for the weight-update.
Otherwise, the word-length of the weights should be increased
at every iteration, which is not desirable. The assumption is
valid since the weight increment terms are small when the
weights are converged. Also when overflow occurs during the
training-period, the weight updating is not appropriate, and
will lead to additional iterations to reach the convergence.
Accordingly, the updated weight can be computed in truncated
form (W, Wi ), and fed into the error-computation block.
B. Computer Simulation of the Proposed DLMS Filter
The proposed fixed-point DLMS adaptive filter is used for
system identification used in Section II. is set to 0.5, 0.25,
and 0.125 for filter-lengths 8, 16, and 32, respectively, such
that the multiplication with does not require any additional
circuits. For the fixed-point simulation, the word-length and
radix-point of input and coefficient are set to L = 16, Li = 2,
W = 16, Wi = 0, and the gaussian random input xn of
zero mean and unit variance is scaled down to fit in with the
representation of (16, 2). The fixed-point data type of all the
other signals are obtained by Table II. Each learning curve is
averaged over 50 runs to obtain a clean curve. The proposed
design was coded in C++ using SystemC fixed-point library
for different order of band-pass filter, that is, N = 8, N = 16,
and N = 32. The corresponding convergence behaviors are
obtained as shown in Fig. 10. It is found that as the filter order
increases, not only the convergence becomes slower, but also
the steady-state MSE increases.

Fixed-Point Representation

(L, Li )

(W, Wi )

(W + 2, Wi + 2)

(W + 2 + log2 N, Wi + 2 + log2 N )

y, d, e

(W, Wi + Li + log2 N )

(W, Wi )

(W + 2, Wi + 2)

(W, Wi )

x, w, p, q, y, d, and e can be found in error-computation block of Fig. 4.


e, r, and s are defined in the weight-update block in Fig. 8. It is noted
that all the subscripts and time indices of signals are obviated for simple
notations.

C. Steady-State Error Estimation


In this section, the MSE of output of the proposed DLMS
adaptive filter due to the fixed-point quantization is analyzed.
Based on the models introduced in [16] and [17], the MSE
of output in the steady-state is derived in terms of parameters
listed in Table II. Let us denote the primed symbols as the
truncated quantities due to the fixed-point representation, so
that the input and desired signals can be written as
x0n = xn + n

(6)

d0n

(7)

= dn + n

MEHER AND PARK: AREA-DELAY-POWER EFFICIENT FIXED-POINT LMS ADAPTIVE FILTER WITH LOW ADAPTATION-DELAY

Mean Squared Error (dB)

where

N=8
N=16
N=32

-20

m2n = 2(W (Wi +Li +log2 N )) /2


2n

-40
-60
-80
0

200

400
600
Iteration Number

800

1000

Fig. 10. Mean-squared-error of fixed-point DLMS filter output for the system
identification for N = 8, N = 16, and N = 32.

where n and n are input quantization noise vector and


quantization noise of desired signal, respectively. The weight
vector can be written as
wn0 = wn + n

(9)

0
wn+1
= wn0 + e0n x0n + n

(10)

where n and n are the errors due to the truncation of output


from shift-add-tree in the error-computation block and weightupdate block, respectively. The steady-state MSE in the fixedpoint representation can be expressed as
E|dn yn0 |2 = E|en |2 + E|Tn wn |2
+ E|n |2 + E|Tn xn |2

(11)

where E| | is operator for mathematical expectation, and


the terms en , Tn wn , n , and Tn xn are assumed to be
uncorrelated.
The first term E|en |2 where en = dn yn is the excessMSE from infinite precision computation whereas the other
three terms are due to finite-precision arithmetic.
The second term can be calculated as
E|Tn wn |2 = |wn |2 (m2n + 2 n ).

(12)

where wn is the optimal Wiener vector, and mn and 2 n


are defined as mean and variance of n when xn is truncated
to the fixed-point type of (L, Li ) as listed in Table II. n can
be modeled as uniform distribution with following mean and
variance:
mn = 2(LLi ) /2
2 n

2(LLi )

=2

/12.

(13a)
(13b)

For the calculation of the third term E|n |2 in (11), we have


used the fact that the output from shift-add-tree in the errorcomputation block is of the type of (W, Wi + Li + log2 N )
after the final truncation. Therefore,
E|n |2 = m2n + 2n .

(14)

2(W (Wi +Li +log2 N ))

=2

(15a)

/12.

(15b)

The last term E|Tn xn |2 in (11) can be obtained by using


the derivation proposed in [17] as


1
N 2n m2n
T
2
2 i k Rki
+
E|n xn | = mn
(16)
2
2
where Rki represents the (k, i)th entry of the matrix E(xn xTn ).
For the weight-update in (10), the first operation is to multiply
e0n with , which is equivalent to move only the location
of radix-point, therefore, it does not introduce any truncation
error. The truncation after multiplication of e0n with xn is
only required to be considered in order to evaluate n . Then,
we have
m2n = 2(W Wi ) /2

(8)

where n is the error vector of current weights due to the finite


precision. The output signal yn0 and weight-update equation
can accordingly be modified, respectively, to the forms:
yn0 = w0 n x0n + n

2n

2(W Wi )

=2

(17a)

/12.

(17b)

For a large , the truncation error n from the errorcomputation block becomes the dominant error source, and
(11) can be approximated as E|n |2 . The MSE values are
estimated from analytical expressions as well as the simulation
results by averaging over 50 experiments. Table III shows that
the steady state MSE computed from analytical expression
matches with that of simulation of the proposed architecture
for different values of N and .
TABLE III
E STIMATED AND S IMULATED STEADY STATE MSE S OF THE FIXED - POINT
DLMS ADAPTIVE FILTER (L = W = 16).
Filter-Length

Step-Size ()

Simulation

Analysis

N =8

21

71.01 dB

70.97 dB

N = 16

22

64.84 dB

64.97 dB

N = 32

23

58.72 dB

58.95 dB

D. Adder-Tree Optimization
The adder-tree and shift-add-tree for the computation of yn
can be pruned for further optimization of area, delay and power
complexity. To illustrate the proposed pruning optimization
of adder-tree and shift-add-tree for the computation of filter
output, we take a simple example of filter-length N = 4,
considering word-lengths L and W to be 8. The dot-diagram
of the adder-tree is shown in Fig. 11. Each row of dotdiagram contains 10 dots, which represents the partial products
generated by a PPG unit, for W = 8. We have 4 sets of partial
products corresponding to 4 partial products of each multiplier
since L = 8. Each set of partial products of the same weight
values contains 4 terms since N = 4. The final sum without
truncation should be 18 bits. However, we use only 8 bits in
the final sum, and the rest 10 bits are finally discarded. To
reduce the computational complexity, some of the LSBs of
inputs of the adder-tree can be truncated while some guard
bits can be used to minimize the impact of truncation on the
error performance of the adaptive filter. In the Figure, 4 bits

IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS

-30
8

p00
p10
p20
p30
11

10

p01
p11
p21
p31
13

12

truncated
bits

p02
p12
p22
p32
p03

15

-40

N=8
N=16
N=32

-50
-60
-70
-80
4

10

12

14

16

18

k1

14

p13
p23

Fig. 12. MSE at the steady-state vs. k1 for N = 8, 16, and 32 (L = W =


16)

p33
17

Mean Squared Error (dB)

16

2-STAGE ADDER-TREE
q0
q1
q2

guard bits

q3

2-STAGE SHIFT-ADD-TREE

final truncation

Fig. 11. The dot-diagram for Optimization of adder-tree in the case of


N = 4, L = 8, and W = 8.
pad.pdf

are taken as the guard bits and the rest 6 LSBs are truncated.
To have more hardware saving, the bits to be truncated are not
generated by the PPGs, so that the complexity of PPGs also
gets reduced.
n defined in (9) increases if we prune the adder-tree, and
the worst-case error is caused when all the truncated bits are
one. For the calculation of sum of truncated values in the worst
case, let us denote k1 as the bit-location of MSB of truncated
bits and N k2 as the number of rows which is affected by the
truncation. In the example of Fig. 11, k1 and k2 are set to
5 and 3, respectively, since the bit-positions from 0 to 5 are
truncated and total 12 rows are affected by the truncation for
N = 4. Also, we can find that k2 can be derived using k1 as
k2 = b

k1
W
W
c + 1 for k1 <
, otherwise k2 =
,
2
2
2

(18)

since the number of truncated bits is reduced by 2 for every


group of N rows as shown in Fig. 11. Using k1 , k2 and N , the
sum of truncated values for the worst case can be formulated
as
bworst = N

kX
k1
2 1 X
j=0


1
2i = N k2 2k1 +1 (4k2 1) . (19)
3
i=2j

In the example of Fig. 11, bworst amounts to 684. Meanwhile,


the LSB weight of the output of adder-tree after final truncation is 210 in the example. Therefore, it can be shown that in
more than 50% cases there might be one bit difference due to
pruning. The truncation error from each row (total twelve rows
from row p00 to row p32 in Fig. 11) has a uniform distribution,
and if the individual errors is assumed to be independent

each other, mean and variance of total error introduced can


be calculated as sum of means and variances of each random
variable. However, it is unlikely that outputs from the same
PPG are uncorrelated since it is generated from the same
input sample. It would not be straightforward to estimate the
distribution of error from the pruning. However, as the value
of bworst is closer to or larger than the LSB weight of the
output after final truncation, the pruning will affect more to
the overall error. Fig. 12 illustrates the steady-state MSE in
terms of k1 for N = 8, 16, and 32 when L = W = 16 to
show how much the pruning affects the output MSE. When k1
is less than 10 for N = 8, the MSE deterioration is less than
1 dB compared to the case when the pruning is not applied.
V. C OMPLEXITY C ONSIDERATIONS
The hardware and time complexities of proposed design
and those of the structure of [11] and the best of systolic
structures [10] are listed in Table IV. The original DLMS
structure proposed in [4] are also listed in this Table. It is
found that the proposed design has a shorter critical-path of
one addition-time as that of [11], and lower adaptation-delay
than others. if we consider each multiplier to have (L 1)
adders then the existing designs involve 16N adders, while the
proposed one involves 10N + 2 adders for L = 8. Similarly, it
involves less number of delay registers compared with others.
We have coded the proposed designs in VHDL and synthesized by Synopsys Design Compiler using CMOS 65-nm
library for different filter orders. The word-length of input
samples and weights are chosen to be 8, i.e., L = W = 8.
The step-size is chosen to be 1/2Li +log2 N to realize its
multiplication without any additional circuitry. Word-length
of all the other signals are determined based on the types
listed in Table II. We have also coded structures proposed in
[10] and [11] using VHDL, and synthesized using the same
library and synthesis options in the Design Compiler for a fair
comparison. In Table V, we have shown the synthesis results
of proposed designs and existing designs in terms of dataarrival-time (DAT), area, energy-per-sample (EPS), area-delayproduct (ADP), and energy-delay-product (EDP) obtained for
filter-lengths N = 8, 16, and 32. The proposed design-I before
the pruning of adder-tree has the same data-arrival-time (DAT)
as the design of [11] since the critical-paths of both designs

MEHER AND PARK: AREA-DELAY-POWER EFFICIENT FIXED-POINT LMS ADAPTIVE FILTER WITH LOW ADAPTATION-DELAY

TABLE IV
C OMPARISON OF HARDWARE AND TIME COMPLEXITIES OF DIFFERENT ARCHITECTURES FOR L = 8.
Design

Critical-Path

n1

n2

Long et al [4]
Ting et al [11]
Van and Feng [10]
Proposed Design

TM + TA
TA
TM + TA
TA

log2 N + 1
log2 N + 5
N/4 + 3
5

0
3
0
1

# adders
2N
2N
2N
10N + 2

Hardware Elements
# multipliers
2N
2N
2N
0

# registers

3N + 2 log2 N + 1
10N + 8
5N + 3
2N + 14 + E

E = 24, 40 and 48 for N = 8, 16 and 32, respectively. Besides, proposed design needs additional 24N AND cells and 16N OR cells. Twos
complement operator in Figs.5 and 8 is counted as one adder, and it is assumed that the multiplication with the step-size does not need the multiplier over
all the structures.

TABLE V
P ERFORMANCE C OMPARISON OF DLMS A DAPTIVE F ILTER BASED ON S YNTHESIS R ESULT U SING CMOS 65- NM L IBRARY.
Design

Ting et al [11]

Van and Feng [10]

Proposed Design-I

Proposed Design-II

Filter
Length, N

DAT
(ns)

Latency
(cycles)

Area
(sq.um)

Leakage
power (mW)

EPS
(mWns)

ADP
(sq.umns)

EDP
(mWns2 )

ADP
reduction

EDP
reduction

8
16
32
8
16
32
8
16
32
8
16
32

1.14
1.19
1.25
1.35
1.35
1.35
1.14
1.19
1.25
0.99
1.15
1.15

8
9
10
5
7
11
5
5
5
5
5
5

24204
48049
95693
13796
27739
55638
14029
26660
48217
12765
24360
43233

0.13
0.27
0.54
0.08
0.16
0.32
0.07
0.14
0.27
0.06
0.13
0.24

18.49
36.43
72.37
7.29
14.29
27.64
8.53
14.58
21.00
8.42
14.75
21.23

26867
55737
116745
18349
36893
73998
15572
31192
58824
12382
27526
48853

21.08
43.35
90.47
9.84
19.30
37.31
9.72
17.34
26.25
8.33
16.96
24.41

15.13%
15.45%
20.50%
32.51%
25.38%
33.98%

1.14%
10.10%
29.64%
15.22%
12.07%
34.55%

DAT: data-arrival-time, ADP: area-delay-product, EPS: energy per sample, EDP: energy-delay-product. ADP and EDP reductions in last two columns are
improvements of proposed designs over [10] in percentage. Proposed Design-I: without optimization, Proposed Design-II: after optimization of adder-tree
with k1 = 5.

are same as TA as shown in Table IV while the design of


[10] has a longer DAT which is equivalent to TA + TM .
However, the proposed design-II after the pruning of adder-tree
has the slightly smaller DAT than existing designs. Also, the
proposed designs could reduce the area by using a PPG based
on common subexpression sharing compared to the existing
designs. As shown in Table V, the reduction in area is more
significant in the case of N = 32 since more sharing can be
obtained in the case of large order filters. The proposed designs
could obtain less area and more power reduction compared
with [11] by removing redundant pipeline latches which are
not required to maintain a critical-path of one addition-time. It
is found that the proposed design-I involves 17% less ADP
and 14% less EDP than the best prior work of [10], in
average for filter-lengths N = 8, 16 and 32. The proposed
design-II, similarly, involves 31% less ADP and nearly
21% less EDP than the structure of [10] for the same filters.
The optimization of adder-tree of the proposed structure with
k1 = 5 offers 20% less ADP and 9% less EDP over the
structure before optimization of the adder-tree.
The proposed designs are also implemented on the fieldprogrammable gate-array (FPGA) platform of Xilinx devices. The number of slices (NOS), maximum usable frequency (MUF) using two different devices of Spartan-3A
(XC3SD1800A-4FG676) and Virtex-4 (XC4VSX35-10FF668)
are listed in Table VI. The proposed design-II after the
pruning offers nearly 11.86% less slice-delay-product which
is calculated as NOS/MUF in average, for N = 8, 16, 32 and

two devices.
VI. C ONCLUSIONS
We have proposed an area-delay-power efficient low
adaptation-delay architecture for fixed-point implementation
of LMS adaptive filter. We have used a novel partial-product
generator for efficient implementation of general multiplications and inner-product computation by common subexpression sharing. For L-bit input width, the proposed partial
product generator performs L/2 multiplications of one of the
operands with 2-bit digits of the other in parallel, where the
simple subexpressions like w, 2w, and 3w are shared
across all the 2-bit multiplications. We have proposed an
efficient addition-scheme for inner-product computation to
reduce the adaptation-delay significantly in order to achieve
faster convergence performance, and to reduce the criticalpath to support high input sampling-rate. Besides, we have
proposed the strategy for optimized balanced pipelining across
the time-consuming blocks of the structure, to reduce the
adaptation-delay and power-consumption, as well. The proposed structure involves significantly less adaptation-delay
and provides significant saving of area-delay-product and
energy-delay-product compared with the existing structures.
We have proposed an fixed-point implementation of the proposed architecture, and derived the expression for steady-state
error. We find that the steady-state MSE obtained from the
analytical result matches with the simulation result. We have
also discussed a pruning scheme which provides nearly 20%

10

IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS

TABLE VI
FPGA IMPLEMENTATIONS OF PROPOSED DESIGNS FOR L = 8 AND
N = 8, 16 AND 32.
Design

N =8

Proposed Design-I
NOS
MUF

Proposed Design-II
NOS
MUF

Xilinx Virtex-4 (XC4VSX35-10FF668)


1024
148.7
931

151.6

N = 16

2036

121.2

1881

124.6

N = 32

4036

121.2

3673

124.6

Xilinx Spartan-3A DSP (XC3SD1800A-4FG676)


N =8
1025
87.2
966
93.8
N = 16

2049

70.3

1915

74.8

N = 32

4060

70.3

3750

75.7

NOS stands for the number of slices. MUF stands for maximum usable
frequency in MHz.

saving in area-delay-product and 9% saving in energy-delayproduct over the proposed structure before pruning without
noticeable degradation of steady-state error-performance. The
highest sampling rate that could be supported by the ASIC
implementation of the proposed design is ranging from about
870 to 1010 MHz for filter orders 8 to 32. When the adaptive
filter is required to be operated at lower sampling rate, one can
use the proposed design with a clock slower than the maximum
usable frequency and can use lower operating voltage to reduce
the power consumption further.
R EFERENCES
[1] B. Widrow and S. D. Stearns, Adaptive signal processing. Englewood
Cliffs, NJ: Prentice-Hall, 1985.
[2] S. Haykin and B. Widrow, Least-mean-square adaptive filters. Hoboken, NJ: Wiley-Interscience, 2003.
[3] M. D. Meyer and D. P. Agrawal, A modular pipelined implementation
of a delayed LMS transversal adaptive filter, in Proc. IEEE International Symposium on Circuits and Systems (ISCAS), May 1990, pp.
19431946.
[4] G. Long, F. Ling, and J. G. Proakis, The LMS algorithm with delayed
coefficient adaptation, IEEE Transactions on Acoustics, Speech, and
Signal Processing, vol. 37, no. 9, pp. 13971405, Sep. 1989.
[5] , Corrections to The LMS algorithm with delayed coefficient
adaptation, IEEE Transactions on Signal Processing, vol. 40, no. 1,
pp. 230232, Jan. 1992.
[6] H. Herzberg, R. Haimi-Cohen, and Y. Beery, A systolic array realization of an LMS adaptive filter and the effects of delayed adaptation,
IEEE Transactions on Signal Processing, vol. 40, no. 11, pp. 27992803,
Nov. 1992.
[7] M. D. Meyer and D. P. Agrawal, A high sampling rate delayed
LMS filter architecture, IEEE Transactions on Circuits and Systems
II: Analog and Digital Signal Processing, vol. 40, no. 11, pp. 727729,
Nov. 1993.
[8] S. Ramanathan and V. Visvanathan, A systolic architecture for LMS
adaptive filtering with minimal adaptation delay, in Proc. International
Conference on VLSI Design (VLSID), Jan. 1996, pp. 286289.
[9] Y. Yi, R. Woods, L.-K. Ting, and C. F. N. Cowan, High speed FPGAbased implementations of delayed-LMS filters, Journal of VLSI Signal
Processing, vol. 39, no. 1-2, pp. 113131, Jan. 2005.
[10] L. D. Van and W. S. Feng, An efficient systolic architecture for
the DLMS adaptive filter and its applications, IEEE Transactions on
Circuits and Systems II: Analog and Digital Signal Processing, vol. 48,
no. 4, pp. 359366, Apr. 2001.
[11] L.-K. Ting, R. Woods, and C. F. N. Cowan, Virtex FPGA implementation of a pipelined adaptive LMS predictor for electronic support
measures receivers, IEEE Transactions on Very Large Scale Integration
(VLSI) Systems, vol. 13, no. 1, pp. 8699, Jan. 2005.

[12] P. K. Meher and M. Maheshwari, A high-speed FIR adaptive filter


architecture using a modified delayed LMS algorithm, in Proc. IEEE
International Symposium on Circuits and Systems (ISCAS), Rio de
Janeiro, Brazil, May 2011.
[13] P. K. Meher and S. Y. Park, Low adaptation-delay LMS adaptive
filter part-I: Introducing a novel multiplication cell, in Proc. IEEE
International Midwest Symposium on Circuits and Systems (MWSCAS),
Aug. 2011.
[14] , Low adaptation-delay LMS adaptive filter part-II: An optimized
architecture, in Proc. IEEE International Midwest Symposium on Circuits and Systems (MWSCAS), Aug. 2011.
[15] K. K. Parhi, VLSI Digital Signal Procesing Systems: Design and
Implementation. New York: John Wiley & Sons, Inc, 1999.
[16] C. Caraiscos and B. Liu, A roundoff error analysis of the LMS adaptive algorithm, IEEE Transactions on Acoustics, Speech, and Signal
Processing, vol. 32, no. 1, pp. 3441, Feb. 1984.
[17] R. Rocher, D. Menard, O. Sentieys, and P. Scalart, Accuracy evaluation
of fixed-point LMS algorithm, in Proc. IEEE International Conference
on Acoustics, Speech, and Signal Processing (ICASSP), May 2004, pp.
237240.

Pramod Kumar Meher (SM03) received the degrees of BSc (Honours), MSc in Physics and PhD
in science from Sambalpur University, Sambalpur,
India, in 1976, 1978, and 1996, respectively.
Currently, he is a Senior Scientist with the Institute for Infocomm Research, Singapore. Prior to
this assignment he was a visiting faculty with the
School of Computer Engineering, Nanyang Technological University, Singapore. He was a Professor
of Computer Applications with Utkal University,
Bhubaneswar, India from 1997 to 2002, a Reader
in Electronics with Berhampur University, Berhampur, India from 1993 to
1997, and a Lecturer in Physics with various Government Colleges in India
from 1981 to 1993. His research interest includes design of dedicated and
reconfigurable architectures for computation-intensive algorithms pertaining
to signal, image and video processing, communication, bio-informatics and
intelligent computing. He has contributed more than 180 technical papers to
various reputed journals and conference proceedings.
Dr. Meher is a Fellow of the Institution of Electronics and Telecommunication Engineers, India and a Senior Member of IEEE. He has served
as Associate Editor for the IEEE Transactions on Circuits and Systems-II:
Express Briefs during 2008-2011. Currently, he is serving as a speaker for
the Distinguished Lecturer Program (DLP) of IEEE Circuits Systems Society,
and Associate Editor for the IEEE Transactions on Circuits and Systems-I:
Regular Papers, IEEE Transactions on Very Large Scale Integration (VLSI)
Systems, and Journal of Circuits, Systems, and Signal Processing. He was the
recipient of the Samanta Chandrasekhar Award for excellence in research in
engineering and technology for the year 1999.

Sang Yoon Park (S03-M11) received the B.S.,


M.S., and Ph.D. degrees in Department of Electrical
Engineering and Computer Science from Seoul National University, Seoul, Korea, in 2000, 2002, and
2006, respectively. He joined the School of Electrical
and Electronic Engineering, Nanyang Technological University, Singapore as a Research Fellow in
2007. Since 2008, he has been with Institute for
Infocomm Research, Singapore, where he is currently a Research Scientist. His research interest
includes dedicated/reconfigurable architectures and
algorithms for low-power/low-area/high-performance digital signal processing
and communication systems.

You might also like