Professional Documents
Culture Documents
I. I NTRODUCTION
Adaptive digital filters find wide applications in several digital signal processing (DSP) applications, e.g., noise and echo
cancelation, system identification, and channel equalization
etc. The finite impulse response (FIR) filters whose weights
are updated by the famous Widrow-Hoff least mean square
(LMS) algorithm [1] may be considered as the simplest known
adaptive filter. The LMS adaptive filter is the most popular one
not only due to its simplicity but also due to its satisfactory
convergence performance [2]. The direct form LMS adaptive
filter involves a long critical-path due to an inner-product computation to obtain the filter output. The critical-path is required
to be reduced by pipelined implementation when it exceeds the
desired sample period. Since the conventional LMS algorithm
does not support pipelined implementation due to its recursive
behavior, it is modified to a form called delayed LMS (DLMS)
algorithm [3][5], which allows pipelined implementation of
the filter.
A lot of work has been done to implement the DLMS
algorithm in systolic architectures to increase the maximum
usable frequency, [3], [6], [7] which can support very high
sampling frequency due to pipelining after every multiply
accumulate (MAC) operation. But they involve adaptationdelay of N cycles for filter-length N , which is quite high
The authors are with the Institute for Infocomm Research, Singapore,
138632 (e-mail: pkmeher@i2r.a-star.edu.sg; sypark@i2r.a-star.edu.sg).
Corresponding author of this paper is Sang Yoon Park.
A preliminary version of this paper was presented in IEEE International
Midwest Symposium on Circuits and Systems (MWSCAS 2011).
FIG1.pdf
2
input sample, xn
FIR FILTER
BLOCK
filter output, yn
desired signal,dn
input sample, xn
_
mD
new weights
new weights
Fig. 1.
WEIGHT-UPDATE BLOCK
Fig. 2.
FIG1.pdf
FIG2.pdf
input sample, xn
wn+1ERROR-COMPUTATION
= wn + en xn BLOCK
new weights
en = dn yn and yn = wnT xn
(1a)
error en
n2D
(1b)
n1D
n1D
n1D
mD
where
error en
n2D
n1D
error, en
WEIGHTUPDATE BLOCK
ERROR-COMPUTATION BLOCK
desired
signal
dn
-20
-40
-60
-80
0
100
WEIGHT-UPDATE BLOCK
200
300
400
Iteration Number
500
600
700
(2)
(3a)
(3b)
where
and
T
yn = wnn
xn
2
(3c)
,
(n 4.5)
(n 4.5)
for n = 0, 1, 2, ..., 9, otherwise hn = 0.
(4)
MEHER AND PARK: AREA-DELAY-POWER EFFICIENT FIXED-POINT LMS ADAPTIVE FILTER WITH LOW ADAPTATION-DELAY
input samples, xn
L
wnn2 (0)
W
wnn2 (1)
2-BIT
PPG-0
p00 p01
xn-1
2-BIT
PPG-1
W
p0 L 1
2
p10 p11
xn-2
wnn2 (2)
2-BIT
PPG-2
W
p1 L 1
2
p20 p21
D
wnn2 (N 1)
W
xn-(N-1)
2-BIT
PPG-(N-1)
p(N-1)0 p(N-1)1p(N-1) L 1
p2 L 1
2
q1
q2
q L 2 q L 1
2
error-block.pdf
bit input b, and consists of n AND gates. It distributes all the
n-bits of input D to its n AND gates as one of the inputs. The
As shown in Fig. 2, there are two main computing blocks
other inputs of all the n AND gates are fed with the single-bit
in the adaptive filter architecture, namely (i) the errorinput b. As shown in Fig. 6(c), each OR cell similarly takes a
computation block and (ii) weight-update block. In this Secpair of n-bit input words, and has n OR gates. A pair of bits
tion, we discuss the design strategy of the proposed structure to
in the same bit-position in B and D is fed to the same OR
minimize the adaptation-delay
in samples,
the error-computation
block,
xn-1
input
xn
xn-(N-1)
xn-2
gate.
D
D
D
tap DLMS adaptive filter is shown in Fig. 4. It consists of of Fig. 5 performs L/2 parallel multiplications of input word
N number of 2-bit partial product generators (PPG) corre- w with a 2-bit digit to produce L/2 partial products of the
3 decoders
and the same number ofSHIFT-ADDAND-OR cells (AOC)
.
n+1
SHIFT-ADDSHIFT-ADDto
obtain
the
product
value
increases
the
word-length,
and
Each of the 2-to-3 decoders takes a 2-bit
digit (u1 u0 ) as input
TREE
TREE
TREE
consequently
increases
the
adder-size
of
N
1
additions
of
and produces three output b0 = u0 u1 , b1 = u0 u1 , and b2 =
s
s
sN-1in word-size of the
u0 u1 , such that b0 = 1 for (u1 u0 ) = 1, b01 = 1 for (u1 u0 ) =1 the product values. To avoid such increase
we add all the N partial products
+ output b0 , b1 and +b2 adders,
+ of the same place
2, and b2 = 1 for (u1 u0 ) = 3. The decoder
along with w, 2w and 3w are fed to an AOC, where w, 2w and value from all the N PPGs by one adder-tree.
All the L/2 partial products generated
D by each of the N
3w are in 2s complement representationDand sign-extendedDto
have (W + 2)-bit each. To take care of the sign of the input PPGs are thus added by (L/2) binary adder-trees. The output
samples while computing the partial product corresponding of the L/2 adder-trees are then added by a shift-add-tree
according to their place values. Each wof
the binary adder-tree
wnn ((u
0) L1 uL2 ) ofwnthe
to the most-significant digit (MSD), i.e.,
nn (N 1)
n (1)
input sample, the AOC-(L/21) is fed with w, 2w and w requires log2 N stages of adders to add N partial product, and
the shift-add-tree requires log2 L 1 stages of adders to add
update.pdf
as input since (uL1 uL2 ) can have four possible values 0,
L/2 output of L/2 binary adder-trees2 . The addition scheme
1, 2 and 1.
2) Structure of AND-OR Cells: The structure and function for the error-computation block for a 4-tap filter and input
of AOC is depicted in Fig. 6. Each AOC consists of 3 AND word-size L = 8 is shown in Fig. 7. For N = 4 and L = 8, the
cells and 2 OR cells. The structure and function of AND adder-network requires four binary adder-trees of two stages
cells and OR cells are depicted by Fig. 6(b) and Fig. 6(c), each and a two-stage shift-add tree. In this figure, we have
respectively. Each AND cell takes n-bit input D and a single shown all possible locations of pipeline latches by dashed
lines, to reduce the critical-path to one addition-time. If we
III. T HE P ROPOSED A RCHITECTURE
2 When
L
input samples
u=(uL-1uL-2uL-3u0)
(u1 u0)
2-to-3 DECODER
b1
b0
b1
2-to-3 DECODER
AOC-2
AOC-1
AOC(L/2-1)
-w -2w w
<<1
w(2u1+u0)
b0
2-to-3 DECODER
W+2
Fig. 5.
(uL-1uL-2)
W+1
<<1
(u5 u4)
3w 2w w
-1
(u3 u2)
2-to-3 DECODER
b2
AOC-0
coefficient, w
2w
b2
W+2
w(2u3+u2)
W+2
w(2u5+u4)
W+2
w(-2uL-1+uL-2)
W+2
3w
Proposed structure of partial product generator (PPG). AOC stands for AND/OR cell.
1
W+2
AC-1
b0
w
W+2
1
AC-2
b1
W+2
2w
1 W+2W+2
W+2
mult_cell.pdf
and-or-park.pdf
the multiplication with recently available error only by a shift
AC-3
b2
3w
W+2
W+2
e
is
shared
within
the multiplier. Since
ri=b di
OC
r0) D2
bAC D for i=0,1,,n-1 TREEn
ADDER n R=(rn-1rn-2ADDER
ADDER
STAGE-2
the scaled
error ( ADDER
e) is multiplied
with all the N delayed
1 n n
n
input values in the weight-update block, this subexpression
ri=bi + di for i=0,1,,n-1
R=(rn-1rn-2R
r
0) b D (i ),
(
i
)
q2
q3
q
q
0
1
AC
D1
can beR shared
across
all the multipliers as well. This leads
OC
(b)
(c)
n
n
0 i n 1
<<2
<<2
to
substantial
saving
of
adder complexity. The final outputs
n
Fig. 6. Structure
and function of AND/OR cell. Binary operators and +
R(i) D1 (i) D2 (i), 0of
i MAC
n 1 units constitute the desired updated weights to be
in (b) and (c) are implemented using
AND and OR gates, respectively.
andor.pdf
ADDER
R
STAGE-1
ADDER
used as inputs to the error-computation block, as well as, the
weight-update block for the next iteration.
and-park.pdf SHIFT-ADD-TREE
or-park.pdf
introduce pipeline latches after every addition, it would require
<<4
L(N 1)/2 + L/2 1 latches in log2 N + log2 L 1 stages, C. Adaptation Delay
which leads to high adaptation-delay, and introduces large ADDER
STAGE-2
As shown in Fig. 2, the adaptation-delay is decomposed
overhead of area and power-consumption for large values of N
and L. On the other hand, some of those pipeline latches are into n1 and n2 . The error-computation block generates the
redundant in the sense that they are not required to maintain a delayed error by n1 1 cycles as shown in Fig. 4, which is
critical-path of one addition-time. The final adder in the shift- fed to weight-update block shown in Fig. 8 after scaling by ,
shift-add-tree.pdf
add tree contributes the maximum delay to the critical
path. then the input is delayed by 1 cycle before the PPG to make
Based on that observation, we have identified the pipeline the total delay introduced by FIR filtering be n1 . In Fig. 8, the
latches which do not contribute significantly to critical-path weight-update block generates wn1n2 , the weights delayed
and could exclude those without any noticeable increase of by n2 +1 cycles. However, it should be noted that the delay by
critical-path. The location of pipeline latches for filter-lengths 1 cycle is due to the latch before the PPG, which is included in
N = 8, 16 and 32 and for input size L = 8 are shown in the delay of the error-computation block, i.e., n1 . Therefore,
Table-I. The pipelining is performed by a feed-forward cut-set the delay generated in weight-update block becomes n2 . If
retiming of error-computation block [15].
AC-1
OC-1
AC-2
OC-2
AC-3
wX
TABLE I
L OCATION OF PIPELINE LATCHES FOR L = 8 AND N = 8, 16 AND 32.
N
Error-Computation Block
Adder-Tree
Shift-Add-Tree
Weight-Update Block
Shift-Add-Tree
stage-2
stage-1 and 2
stage-1
16
stage-3
stage-1 and 2
stage-1
32
stage-3
stage-1 and 2
stage-2
mult_cell.pdf
MEHER AND PARK: AREA-DELAY-POWER
EFFICIENT
FIXED-POINT
input samples,
xn
xn-1LMS ADAPTIVE FILTER
xn-2 WITH LOW ADAPTATION-DELAY
xn-(N-1)
p00 p10
wnn2 (0)
W
ADDER
ADDER
TREE
p20 p30
p01 p11
wnn2 (1)
2-BIT
PPG-0
ADDER
p00 p01
p0 L 1
p21 p31
p10 p11
ADDER
q0
p1 L 1
p22 p32
p03 p13
2-BIT
PPG-2
ADDER
p20 p21
ADDER
p23 p33
wnn2 (N 1)
2-BIT
PPG-(N-1)
STAGE-1
ADDER
W
p(N-1)0 p(N-1)1p(N-1) L 1
ADDER
STAGE-2
q L 2 q L 1
q2
p2 L 1
ADDERTREE
(log2N STAGES)
ADDER
q2
q1
q3
<<2
ADDER
ADDER
W
ADDER
q1
p02 p12
wnn2 (2)
2-BIT
PPG-1
ADDER
ADDER
W
q0
STAGE-1
SHIFT-ADD-TREE
<<4
error-block.pdf
ADDER
Fig. 7.
STAGE-2
xn-3
n-1
n-2
shift-add-tree.pdf
D
D
PPG-0
PPG-1
xn-N
PPG-(N-1)
xn
UNKNOWN SYSTEM, hn
FILTERING BLOCK
<<1
<<1
r00 r01 r0 L 1
2
SHIFT-ADDTREE
r10 r11 r1 L 1
2
r(N-1)0 r(N-1)1
n1D
n2D
SHIFT-ADDTREE
n1D
r(N-1) L 1
2
SHIFT-ADDTREE
-1
en n1
s1
wn1n2 (0)
Fig. 8.
yn
w nn2
<<1
en
d n n1
n1D
sN-1
+
si-app.pdf
D
wn1n2 (N 1)
wn1n2 (1)
update.pdf
integer
aXi 1
MSB
a1
fraction
a0
b1
b2
b3
radix-point
b4
bX f
LSB
Fig. 9. Fixed-point representation of a binary number (Xi : integer wordlength, Xf : fractional word-length).
fp-basic.pdf
Fixed-Point Representation
(L, Li )
(W, Wi )
(W + 2, Wi + 2)
(W + 2 + log2 N, Wi + 2 + log2 N )
y, d, e
(W, Wi + Li + log2 N )
(W, Wi )
(W + 2, Wi + 2)
(W, Wi )
(6)
d0n
(7)
= dn + n
MEHER AND PARK: AREA-DELAY-POWER EFFICIENT FIXED-POINT LMS ADAPTIVE FILTER WITH LOW ADAPTATION-DELAY
where
N=8
N=16
N=32
-20
-40
-60
-80
0
200
400
600
Iteration Number
800
1000
Fig. 10. Mean-squared-error of fixed-point DLMS filter output for the system
identification for N = 8, N = 16, and N = 32.
(9)
0
wn+1
= wn0 + e0n x0n + n
(10)
(11)
(12)
2(LLi )
=2
/12.
(13a)
(13b)
(14)
=2
(15a)
/12.
(15b)
(8)
2n
2(W Wi )
=2
(17a)
/12.
(17b)
For a large , the truncation error n from the errorcomputation block becomes the dominant error source, and
(11) can be approximated as E|n |2 . The MSE values are
estimated from analytical expressions as well as the simulation
results by averaging over 50 experiments. Table III shows that
the steady state MSE computed from analytical expression
matches with that of simulation of the proposed architecture
for different values of N and .
TABLE III
E STIMATED AND S IMULATED STEADY STATE MSE S OF THE FIXED - POINT
DLMS ADAPTIVE FILTER (L = W = 16).
Filter-Length
Step-Size ()
Simulation
Analysis
N =8
21
71.01 dB
70.97 dB
N = 16
22
64.84 dB
64.97 dB
N = 32
23
58.72 dB
58.95 dB
D. Adder-Tree Optimization
The adder-tree and shift-add-tree for the computation of yn
can be pruned for further optimization of area, delay and power
complexity. To illustrate the proposed pruning optimization
of adder-tree and shift-add-tree for the computation of filter
output, we take a simple example of filter-length N = 4,
considering word-lengths L and W to be 8. The dot-diagram
of the adder-tree is shown in Fig. 11. Each row of dotdiagram contains 10 dots, which represents the partial products
generated by a PPG unit, for W = 8. We have 4 sets of partial
products corresponding to 4 partial products of each multiplier
since L = 8. Each set of partial products of the same weight
values contains 4 terms since N = 4. The final sum without
truncation should be 18 bits. However, we use only 8 bits in
the final sum, and the rest 10 bits are finally discarded. To
reduce the computational complexity, some of the LSBs of
inputs of the adder-tree can be truncated while some guard
bits can be used to minimize the impact of truncation on the
error performance of the adaptive filter. In the Figure, 4 bits
-30
8
p00
p10
p20
p30
11
10
p01
p11
p21
p31
13
12
truncated
bits
p02
p12
p22
p32
p03
15
-40
N=8
N=16
N=32
-50
-60
-70
-80
4
10
12
14
16
18
k1
14
p13
p23
p33
17
16
2-STAGE ADDER-TREE
q0
q1
q2
guard bits
q3
2-STAGE SHIFT-ADD-TREE
final truncation
are taken as the guard bits and the rest 6 LSBs are truncated.
To have more hardware saving, the bits to be truncated are not
generated by the PPGs, so that the complexity of PPGs also
gets reduced.
n defined in (9) increases if we prune the adder-tree, and
the worst-case error is caused when all the truncated bits are
one. For the calculation of sum of truncated values in the worst
case, let us denote k1 as the bit-location of MSB of truncated
bits and N k2 as the number of rows which is affected by the
truncation. In the example of Fig. 11, k1 and k2 are set to
5 and 3, respectively, since the bit-positions from 0 to 5 are
truncated and total 12 rows are affected by the truncation for
N = 4. Also, we can find that k2 can be derived using k1 as
k2 = b
k1
W
W
c + 1 for k1 <
, otherwise k2 =
,
2
2
2
(18)
kX
k1
2 1 X
j=0
1
2i = N k2 2k1 +1 (4k2 1) . (19)
3
i=2j
MEHER AND PARK: AREA-DELAY-POWER EFFICIENT FIXED-POINT LMS ADAPTIVE FILTER WITH LOW ADAPTATION-DELAY
TABLE IV
C OMPARISON OF HARDWARE AND TIME COMPLEXITIES OF DIFFERENT ARCHITECTURES FOR L = 8.
Design
Critical-Path
n1
n2
Long et al [4]
Ting et al [11]
Van and Feng [10]
Proposed Design
TM + TA
TA
TM + TA
TA
log2 N + 1
log2 N + 5
N/4 + 3
5
0
3
0
1
# adders
2N
2N
2N
10N + 2
Hardware Elements
# multipliers
2N
2N
2N
0
# registers
3N + 2 log2 N + 1
10N + 8
5N + 3
2N + 14 + E
E = 24, 40 and 48 for N = 8, 16 and 32, respectively. Besides, proposed design needs additional 24N AND cells and 16N OR cells. Twos
complement operator in Figs.5 and 8 is counted as one adder, and it is assumed that the multiplication with the step-size does not need the multiplier over
all the structures.
TABLE V
P ERFORMANCE C OMPARISON OF DLMS A DAPTIVE F ILTER BASED ON S YNTHESIS R ESULT U SING CMOS 65- NM L IBRARY.
Design
Ting et al [11]
Proposed Design-I
Proposed Design-II
Filter
Length, N
DAT
(ns)
Latency
(cycles)
Area
(sq.um)
Leakage
power (mW)
EPS
(mWns)
ADP
(sq.umns)
EDP
(mWns2 )
ADP
reduction
EDP
reduction
8
16
32
8
16
32
8
16
32
8
16
32
1.14
1.19
1.25
1.35
1.35
1.35
1.14
1.19
1.25
0.99
1.15
1.15
8
9
10
5
7
11
5
5
5
5
5
5
24204
48049
95693
13796
27739
55638
14029
26660
48217
12765
24360
43233
0.13
0.27
0.54
0.08
0.16
0.32
0.07
0.14
0.27
0.06
0.13
0.24
18.49
36.43
72.37
7.29
14.29
27.64
8.53
14.58
21.00
8.42
14.75
21.23
26867
55737
116745
18349
36893
73998
15572
31192
58824
12382
27526
48853
21.08
43.35
90.47
9.84
19.30
37.31
9.72
17.34
26.25
8.33
16.96
24.41
15.13%
15.45%
20.50%
32.51%
25.38%
33.98%
1.14%
10.10%
29.64%
15.22%
12.07%
34.55%
DAT: data-arrival-time, ADP: area-delay-product, EPS: energy per sample, EDP: energy-delay-product. ADP and EDP reductions in last two columns are
improvements of proposed designs over [10] in percentage. Proposed Design-I: without optimization, Proposed Design-II: after optimization of adder-tree
with k1 = 5.
two devices.
VI. C ONCLUSIONS
We have proposed an area-delay-power efficient low
adaptation-delay architecture for fixed-point implementation
of LMS adaptive filter. We have used a novel partial-product
generator for efficient implementation of general multiplications and inner-product computation by common subexpression sharing. For L-bit input width, the proposed partial
product generator performs L/2 multiplications of one of the
operands with 2-bit digits of the other in parallel, where the
simple subexpressions like w, 2w, and 3w are shared
across all the 2-bit multiplications. We have proposed an
efficient addition-scheme for inner-product computation to
reduce the adaptation-delay significantly in order to achieve
faster convergence performance, and to reduce the criticalpath to support high input sampling-rate. Besides, we have
proposed the strategy for optimized balanced pipelining across
the time-consuming blocks of the structure, to reduce the
adaptation-delay and power-consumption, as well. The proposed structure involves significantly less adaptation-delay
and provides significant saving of area-delay-product and
energy-delay-product compared with the existing structures.
We have proposed an fixed-point implementation of the proposed architecture, and derived the expression for steady-state
error. We find that the steady-state MSE obtained from the
analytical result matches with the simulation result. We have
also discussed a pruning scheme which provides nearly 20%
10
TABLE VI
FPGA IMPLEMENTATIONS OF PROPOSED DESIGNS FOR L = 8 AND
N = 8, 16 AND 32.
Design
N =8
Proposed Design-I
NOS
MUF
Proposed Design-II
NOS
MUF
151.6
N = 16
2036
121.2
1881
124.6
N = 32
4036
121.2
3673
124.6
2049
70.3
1915
74.8
N = 32
4060
70.3
3750
75.7
NOS stands for the number of slices. MUF stands for maximum usable
frequency in MHz.
saving in area-delay-product and 9% saving in energy-delayproduct over the proposed structure before pruning without
noticeable degradation of steady-state error-performance. The
highest sampling rate that could be supported by the ASIC
implementation of the proposed design is ranging from about
870 to 1010 MHz for filter orders 8 to 32. When the adaptive
filter is required to be operated at lower sampling rate, one can
use the proposed design with a clock slower than the maximum
usable frequency and can use lower operating voltage to reduce
the power consumption further.
R EFERENCES
[1] B. Widrow and S. D. Stearns, Adaptive signal processing. Englewood
Cliffs, NJ: Prentice-Hall, 1985.
[2] S. Haykin and B. Widrow, Least-mean-square adaptive filters. Hoboken, NJ: Wiley-Interscience, 2003.
[3] M. D. Meyer and D. P. Agrawal, A modular pipelined implementation
of a delayed LMS transversal adaptive filter, in Proc. IEEE International Symposium on Circuits and Systems (ISCAS), May 1990, pp.
19431946.
[4] G. Long, F. Ling, and J. G. Proakis, The LMS algorithm with delayed
coefficient adaptation, IEEE Transactions on Acoustics, Speech, and
Signal Processing, vol. 37, no. 9, pp. 13971405, Sep. 1989.
[5] , Corrections to The LMS algorithm with delayed coefficient
adaptation, IEEE Transactions on Signal Processing, vol. 40, no. 1,
pp. 230232, Jan. 1992.
[6] H. Herzberg, R. Haimi-Cohen, and Y. Beery, A systolic array realization of an LMS adaptive filter and the effects of delayed adaptation,
IEEE Transactions on Signal Processing, vol. 40, no. 11, pp. 27992803,
Nov. 1992.
[7] M. D. Meyer and D. P. Agrawal, A high sampling rate delayed
LMS filter architecture, IEEE Transactions on Circuits and Systems
II: Analog and Digital Signal Processing, vol. 40, no. 11, pp. 727729,
Nov. 1993.
[8] S. Ramanathan and V. Visvanathan, A systolic architecture for LMS
adaptive filtering with minimal adaptation delay, in Proc. International
Conference on VLSI Design (VLSID), Jan. 1996, pp. 286289.
[9] Y. Yi, R. Woods, L.-K. Ting, and C. F. N. Cowan, High speed FPGAbased implementations of delayed-LMS filters, Journal of VLSI Signal
Processing, vol. 39, no. 1-2, pp. 113131, Jan. 2005.
[10] L. D. Van and W. S. Feng, An efficient systolic architecture for
the DLMS adaptive filter and its applications, IEEE Transactions on
Circuits and Systems II: Analog and Digital Signal Processing, vol. 48,
no. 4, pp. 359366, Apr. 2001.
[11] L.-K. Ting, R. Woods, and C. F. N. Cowan, Virtex FPGA implementation of a pipelined adaptive LMS predictor for electronic support
measures receivers, IEEE Transactions on Very Large Scale Integration
(VLSI) Systems, vol. 13, no. 1, pp. 8699, Jan. 2005.
Pramod Kumar Meher (SM03) received the degrees of BSc (Honours), MSc in Physics and PhD
in science from Sambalpur University, Sambalpur,
India, in 1976, 1978, and 1996, respectively.
Currently, he is a Senior Scientist with the Institute for Infocomm Research, Singapore. Prior to
this assignment he was a visiting faculty with the
School of Computer Engineering, Nanyang Technological University, Singapore. He was a Professor
of Computer Applications with Utkal University,
Bhubaneswar, India from 1997 to 2002, a Reader
in Electronics with Berhampur University, Berhampur, India from 1993 to
1997, and a Lecturer in Physics with various Government Colleges in India
from 1981 to 1993. His research interest includes design of dedicated and
reconfigurable architectures for computation-intensive algorithms pertaining
to signal, image and video processing, communication, bio-informatics and
intelligent computing. He has contributed more than 180 technical papers to
various reputed journals and conference proceedings.
Dr. Meher is a Fellow of the Institution of Electronics and Telecommunication Engineers, India and a Senior Member of IEEE. He has served
as Associate Editor for the IEEE Transactions on Circuits and Systems-II:
Express Briefs during 2008-2011. Currently, he is serving as a speaker for
the Distinguished Lecturer Program (DLP) of IEEE Circuits Systems Society,
and Associate Editor for the IEEE Transactions on Circuits and Systems-I:
Regular Papers, IEEE Transactions on Very Large Scale Integration (VLSI)
Systems, and Journal of Circuits, Systems, and Signal Processing. He was the
recipient of the Samanta Chandrasekhar Award for excellence in research in
engineering and technology for the year 1999.