You are on page 1of 6

366

IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 20, NO. 2, FEBRUARY 2012

ACKNOWLEDGMENT
The authors would like to thank National Chip Implementation
Center (CIC), Taiwan for technical support in simulations. The authors
would also like to thank Y.-R. Cho and S.-W. Chen for their assistance
in simulations and layouts.

REFERENCES
[1] H. Kawaguchi and T. Sakurai, A reduced clock-swing flip-flop
(RCSFF) for 63% power reduction, IEEE J. Solid-State Circuits, vol.
33, no. 5, pp. 807811, May 1998.
[2] A. G. M. Strollo, D. De Caro, E. Napoli, and N. Petra, A novel high
speed sense-amplifier-based flip-flop, IEEE Trans. Very Large Scale
Integr. (VLSI) Syst., vol. 13, no. 11, pp. 12661274, Nov. 2005.
[3] H. Partovi, R. Burd, U. Salim, F. Weber, L. DiGregorio, and D. Draper,
Flow-through latch and edge-triggered flip-flop hybrid elements, in
IEEE Tech. Dig. ISSCC, 1996, pp. 138139.
[4] F. Klass, C. Amir, A. Das, K. Aingaran, C. Truong, R. Wang, A. Mehta,
R. Heald, and G. Yee, A new family of semi-dynamic and dynamic flip
flops with embedded logic for high-performance processors, IEEE J.
Solid-State Circuits, vol. 34, no. 5, pp. 712716, May 1999.
[5] S. D. Naffziger, G. Colon-Bonet, T. Fischer, R. Riedlinger, T. J.
Sullivan, and T. Grutkowski, The implementation of the Itanium 2
microprocessor, IEEE J. Solid-State Circuits, vol. 37, no. 11, pp.
14481460, Nov. 2002.
[6] J. Tschanz, S. Narendra, Z. Chen, S. Borkar, M. Sachdev, and V. De,
Comparative delay and energy of single edge-triggered and dual edge
triggered pulsed flip-flops for high-performance microprocessors, in
Proc. ISPLED, 2001, pp. 207212.
[7] B. Kong, S. Kim, and Y. Jun, Conditional-capture flip-flop for statistical power reduction, IEEE J. Solid-State Circuits, vol. 36, no. 8, pp.
12631271, Aug. 2001.
[8] N. Nedovic, M. Aleksic, and V. G. Oklobdzija, Conditional precharge
techniques for power-efficient dual-edge clocking, in Proc. Int. Symp.
Low-Power Electron. Design, Monterey, CA, Aug. 1214, 2002, pp.
5659.
[9] P. Zhao, T. Darwish, and M. Bayoumi, High-performance and low
power conditional discharge flip-flop, IEEE Trans. Very Large Scale
Integr. (VLSI) Syst., vol. 12, no. 5, pp. 477484, May 2004.
[10] C. K. Teh, M. Hamada, T. Fujita, H. Hara, N. Ikumi, and Y. Oowaki,
Conditional data mapping flip-flops for low-power and high-performance systems, IEEE Trans. Very Large Scale Integr. (VLSI) Systems,
vol. 14, pp. 13791383, Dec. 2006.
[11] S. H. Rasouli, A. Khademzadeh, A. Afzali-Kusha, and M. Nourani,
Low power single- and double-edge-triggered flip-flops for high speed
applications, Proc. Inst. Electr. Eng.Circuits Devices Syst., vol. 152,
no. 2, pp. 118122, Apr. 2005.
[12] H. Mahmoodi, V. Tirumalashetty, M. Cooke, and K. Roy, Ultra low
power clocking scheme using energy recovery and clock gating, IEEE
Trans. Very Large Scale Integr. (VLSI) Syst., vol. 17, pp. 3344, Jan.
2009.
[13] P. Zhao, J. McNeely, W. Kaung, N. Wang, and Z. Wang, Design of
sequential elements for low power clocking system, IEEE Trans. Very
Large Scale Integr. (VLSI) Syst., to be published.
[14] Y.-H. Shu, S. Tenqchen, M.-C. Sun, and W.-S. Feng, XNOR-based
double-edge-triggered flip-flop for two-phase pipelines, IEEE Trans.
Circuits Syst. II, Exp. Briefs, vol. 53, no. 2, pp. 138142, Feb. 2006.
[15] V. G. Oklobdzija, Clocking and clocked storage elements in a multigiga-hertz environment, IBM J. Res. Devel., vol. 47, pp. 567584, Sep.
2003.

Area-Efficient Parallel FIR Digital Filter Structures for


Symmetric Convolutions Based on Fast FIR Algorithm
Yu-Chi Tsao and Ken Choi

AbstractBased on fast finite-impulse response (FIR) algorithms (FFAs),


this paper proposes new parallel FIR filter structures, which are beneficial
to symmetric coefficients in terms of the hardware cost, under the condition
that the number of taps is a multiple of 2 or 3. The proposed parallel FIR
structures exploit the inherent nature of symmetric coefficients reducing
half the number of multipliers in subfilter section at the expense of additional adders in preprocessing and postprocessing blocks. Exchanging multipliers with adders is advantageous because adders weigh less than multipliers in terms of silicon area; in addition, the overhead from the additional adders in preprocessing and postprocessing blocks stay fixed and do
not increase along with the length of the FIR filter, whereas the number of
reduced multipliers increases along with the length of the FIR filter. For
example, for a four-parallel 72-tap filter, the proposed structure saves 27
multipliers at the expense of 11 adders, whereas for a four-parallel 576-tap
filter, the proposed structure saves 216 multipliers at the expense of 11
adders still. Overall, the proposed parallel FIR structures can lead to significant hardware savings for symmetric convolutions from the existing FFA
parallel FIR filter, especially when the length of the filter is large.
Index TermsDigital signal processing (DSP), fast finite-impulse response (FIR) algorithms (FFAs), parallel FIR, symmetric convolution,
very large scale integration (VLSI).

I. INTRODUCTION
Due to the explosive growth of multimedia application, the demand
for high-performance and low-power digital signal processing (DSP)
is getting higher and higher. Finite-impulse response (FIR) digital filters are one of the most widely used fundamental devices performed
in DSP systems, ranging from wireless communications to video and
image processing. Some applications need the FIR filter to operate at
high frequencies such as video processing, whereas some other applications request high throughput with a low-power circuit such as multiple-input multiple-output (MIMO) systems used in cellular wireless
communication. Furthermore, when narrow transition-band characteristics are required, the much higher order in the FIR filter is unavoidable. For example, a 576-tap digital filter is used in a video ghost canceller for broadcast television, which reduces the effect of multipath
signal echoes. On the other hand, parallel and pipelining processing are
two techniques used in DSP applications, which can both be exploited
to reduce the power consumption. Pipelining shortens the critical path
by interleaving pipelining latches along the datapath, at the price of
increasing the number of latches and the system latency, whereas parallel processing increase the sampling rate by replicating hardware so
that multiple inputs can be processed in parallel and multiple outputs
are generated at the same time, at the expense of increased area. Both
techniques can reduce the power consumption by lowering the supply
voltage, where the sampling speed does not increase. In this paper, parallel processing in the digital FIR filter will be discussed. Due to its
linear increase in the hardware implementation cost brought by the increase of the block size L, the parallel processing technique loses its
advantage in practical implementation. There have been a few papers
Manuscript received July 30, 2010; revised September 20, 2010, October 22,
2010; accepted November 20, 2010. Date of publication December 30, 2010;
date of current version January 18, 2012.
The authors are with Department of Electrical and Computer Engineering,
Illinois Institute of Technology, Chicago, IL 60616 USA (e-mail: ytsao@iit.edu;
kchoi@ece.iit.edu).
Digital Object Identifier 10.1109/TVLSI.2010.2095892

1063-8210/$26.00 2010 IEEE

IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 20, NO. 2, FEBRUARY 2012

proposing ways to reduce the complexity of the parallel FIR filter in


the past [1][9]. In [1][4], polyphase decomposition is mainly manipulated, where the small-sized parallel FIR filter structures are derived
first and then the larger block-sized ones can be constructed by cascading or iterating small-sized parallel FIR filtering blocks. Fast FIR
algorithms (FFAs) introduced in [1][3] shows that it can implement
a L-parallel filter using approximately (2L 0 1) subfilter blocks, each
. FFA structures successfully break the conof which is of length
straint that the hardware implementation cost of a parallel FIR filter has
a linear increase along with the block size . It reduces the required
number of multipliers to (2 0
) from 2 . In [5][9], the fast
linear convolution is utilized to develop the small-sized filtering structures and then a long convolution is decomposed into several short convolutions, i.e., larger block-sized filtering structures can be constructed
through iterations of the small-sized filtering structures.
However, in both categories of method, when it comes to symmetric
convolutions, the symmetry of coefficients has not been taken into consideration for the design of structures yet, which can lead to a significant saving in hardware cost. In this paper, we provide new parallel FIR
filter structures based on FFA consisting of advantageous polyphase decompositions, which can reduce amounts of multiplications in the subfilter section by exploiting the inherent nature of the symmetric coefficients, compared to the existing FFA fast parallel FIR filter structure.
This paper is organized as follows. A brief introduction of FFAs is given
in Section II. In Section III, the proposed parallel FIR filter structures
are presented. Section IV investigates the complexity and comparisons.
In Section V, the description of hardware implementation and the experimental results are shown. Section VI gives the conclusion.

367

Fig. 1. Two-parallel FIR filter implementation using FFA.

N=L

N N=L

L
L N

II. FAST FIR ALGORITHM (FFA)

Consider an
form as

N -tap FIR filter which can be expressed in the general

y(n) =

N01
i=0

h(i)x(n 0 i);

n = 0; 1; 2; . . . ; 1

(1)

where fx(n)g is an infinite-length input sequence and fh(i)g are the


length- FIR filter coefficients. Then, the traditional 0 parallel FIR
filter can be derived using polyphase decomposition as [3]

Fig. 2. Three-parallel FIR filter implementation using FFA.

N multipliers and 2N 0 2 adders. However, (4)

adders, and totally 2


can be written as

Y0 = H0 X0 + z02 H1 X1
Y1 = (H0 + H1 )(X0 + X1 ) 0 H0 X0 0 H1 X1 :

(5)

The implementation of (5) will require three FIR subfilter blocks of


length
2, one preprocessing and three postprocessing adders, and
3
2 multipliers and 3(
2 0 1) + 4 adders, which reduces approximately one fourth over the traditional two-parallel filter hardware cost
from (4). The two-parallel ( = 2) FIR filter implementation using
FFA obtained from (5) is shown in Fig. 1.

N=

N=

N=

L = 3)

B. 3 2 3 FFA (

By the similar approach, a three-parallel FIR filter using FFA can be


expressed as

Y0 = H0 X0 0 z03 H2 X2 + z03
2 [(H1 + H2 )(X1 + X2 ) 0 H1 X1 ]
L01
L01
L01
Y1 = [(H0 + H1 )(X0 + X1 ) 0 H1 X1 ]
Yp (zL )z0p = Xq (zL )z0q Hr (zL )z0r (2)
0 (H0 X0 0 z03 H2 X2 )
p=0
q=0
r=0
Y2 = [(H0 + H1 + H2 )(X0 + X1 + X2 )]
1 z 0k x(Lk + q);Hr
X
=
=
where
q
k
=0
0 [(H0 + H1 )(X0 + X1 ) 0 H1 X1 ]
(N=L)01 0k
1 z 0k x(Lk + p);
z x(Lk + r);Yp
=
k=0
k=0
0 [(H1 + H2 )(X1 + X2 ) 0 H1 X1 ]:
(6)
for p; q; r = 0; 1; 2; . . . ; L 0 1. From this FIR filtering equation, it
2
shows that the traditional FIR filter will require L -FIR subfilter
The hardware implementation of (6) requires six length-N=3 FIR subblocks of length N=L for implementation.
filter blocks, three preprocessing and seven postprocessing adders, and
three N multipliers and 2N + 4 adders, which has reduced approxiA. 2 2 2 FFA (L = 2)
mately one third over the traditional three-parallel filter hardware cost.
The implementation obtained from (6) is shown in Fig. 2.

According to (2), a two-parallel FIR filter can be expressed as

Y0 + z01 Y1 = (H0 + z01 H1 )(X0 + z01 X1 )


01
02
= H0 X0 + z (H0 X1 + H1 X0 ) + z H1 X1
(3)
implying that

Y0 = H0 X0 + z02 H1 X1 ;
Y1 = H0 X1 + H1 X0:

(4)

Equation (4) shows the traditional two-parallel filter structure, which


will require four length- 2 FIR subfilter blocks, two postprocessing

N=

III. PROPOSED FFA STRUCTURES


SYMMETRIC CONVOLUTIONS

FOR

To utilize the symmetry of coefficients, the main idea behind the proposed structures is actually pretty intuitive, to manipulate the polyphase
decomposition to earn as many subfilter blocks as possible which contain symmetric coefficients so that half the number of multiplications
in the single subfilter block can be reused for the multiplications of
whole taps, which is similar to the fact that a set of symmetric coefficients would only require half the filter length of multiplications in a
single FIR filter. Therefore, for an -tap -parallel FIR filter the total

368

IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 20, NO. 2, FEBRUARY 2012

Fig. 4. Subfilter block implementation with symmetric coefficients.

Fig. 3. Proposed two-parallel FIR filter implementation.

amount of saved multipliers would be the number of subfilter blocks


that contain symmetric coefficients times half the number of multiplications in a single subfilter block ( 2 ).

N= L

L = 2)

A. 2 2 2 Proposed FFA (

From (4), a two-parallel FIR filter can also be written as

Y0 =

1
2

[(

H0 + H1 )(X0 + X1 )

H0 0 H1 )(X0 0 X1 )] 0 H1 X1

+(

z02 H1 X1 ;

Y1 = 12 [(H0 + H1 )(X0 + X1 ) 0 (H0 0 H1 )(X0 0 X1 )]:

(7)
Fig. 5. Proposed three-parallel FIR filter implementation.

When it comes to a set of even symmetric coefficients, (7) can earn


one more subfilter block containing symmetric coefficients than (5),
the existing FFA parallel FIR filter. Fig. 3 shows implementation of
the proposed two-parallel FIR filter based on (7).
An example is demonstrated here for a clearer perspective.
Example 1: Consider a 24-tap FIR fiter with a set of symmetric
coefficients applying to the proposed two-parallel FIR filter

fh(0); h(1);h(2);h(3); h(4); h(5);


h(6); h(7); h(8); h(9); . . . ; h(23)g
where h(0) = h(23);h(1) = h(22);h(2) = h(21);h(3) =
h(20);h(4) = h(19);h(5) = h(18); . . . ; h(11) = h(12), applying

to the proposed two-parallel FIR filter structure, and the top two
subfilter blocks will be as

Fig. 6. Comparison of subfilter blocks between existing FFA and the proposed
FFA three-parallel FIR structures.

H0 6 H1 =fh(0) 6 h(1);h(2) 6 h(3);


h(4) 6 h(5);h(6) 6 h(7); . . . ; h(18) 6 h(19);
h(20) 6 h(21);h(22) 6 h(23)g
where

h(0) 6 h(1) = 6(h(22) 6 h(23))


h(2) 6 h(3) = 6(h(20) 6 h(21))
h(4) 6 h(5) = 6(h(18) 6 h(19))
h(6) 6 h(7) = 6(h(16) 6 h(17)) . . .

(8)

As can be seen from the example above, two of three subfilter blocks
from the proposed two-parallel FIR filter structure, 0 0 1 and 0 +
1 , are with symmetric coefficients now, as (8), which means the subfilter block can be realized by Fig. 4, with only half the amount of multipliers required. Each output of multipliers responds to two taps. Note
that the transposed direct-form FIR filter is employed. Compared to
the existing FFA two-parallel FIR filter structure, the proposed FFA
structure leads to one more subfilter block which contains symmetric
coefficients. However, it comes with the price of the increase of amount
of adders in preprocessing and postprocessing blocks. In this case, two
additional adders are required for = 2.

H H

Fig. 7. Comparison of subfilter blocks between existing FFA and the proposed
FFA four-parallel FIR structures.

IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 20, NO. 2, FEBRUARY 2012

369

Fig. 8. Proposed four-parallel FIR filter implementation.

B. 3 2 3 Proposed FFA (L = 3)
With the similar approach, from (6), a three-parallel FIR filter can
also be written as (9). Fig. 5 shows implementation of the proposed
three-parallel FIR filter. When the number of symmetric coefficients
is the multiple of 3, the proposed three-parallel FIR filter structure
presented in (9) enables four subfilter blocks with symmetric coefficients in total, whereas the existing FFA parallel FIR filter structure
has only two ones out of six subfilter blocks. A comparison figure is
shown in Fig. 6, where the shadow blocks stand for the subfilter blocks
which contain

Y0 = 2 [(H0 + H1 )(X0 + X1 )
+ (H0 0 H1 )(X0 0 X1 )] 0 H1 X1
03
+ z f(H0 + H1 + H2 )(X0 + X1 + X2 )
0 (H0 + H2 )(X0 + X2 )
0 21 [(H0 + H1 )(X0 + X1 )
0 (H0 0 H1 )(X0 0 X1 )] 0 H1 X1g
1
Y1 = 2 [(H0 + H1 )(X0 + X1 )
0 (H0 0 H1 )(X0 0 X1 )]
03 1 [(H0 + H2 )(X0 + X2 )
+z
2
+(H0 0 H2 )(X0 0 X2 )]
0 12 [(H0 + H1 )(X0 + X1 )
1

The proposed cascading process for the larger block-sized proposed


parallel FIR filter is similar to that introduced in [1]. However, a small
modification is adopted here for lower hardware consumption. As we
can see, the proposed parallel FIR structure enables the reuse of multipliers in parts of the subfilter blocks but it also brings more adder cost
in preprocessing and postprocessing blocks. When cascading the proposed FFA parallel FIR structures for larger parallel block factor the
increase of adders can become larger. Therefore, other than applying
the proposed FFA FIR filter structure to all the decomposed subfilter
blocks, the existing FFA structures which have more compact operations in preprocessing and postprocessing blocks are employed for
those subfilter blocks that contain no symmetric coefficients, whereas
the proposed FIR filter structures are still applied to the rest of subfilter
blocks with symmetric coefficients. An illustration of the proposed cascading process for a four-parallel FIR filter ( = 4) as an example is
shown in Fig. 7, and the realization is shown in Fig. 8. From Fig. 7, it
is clear to see that the proposed four-parallel FIR structure earns three
more subfilter blocks containing symmetric coefficients than the existing FFA one, which means 3 8 multipliers can be saved for an
-tap FIR filter, at the price of 11 additional adders in preprocessing
and postprocessing blocks. By this cascading approach, parallel FIR
filter structures with larger block factor can be realized. The proposed six-parallel FIR filter will result in 6 more symmetric subfilter
blocks, equivalently /2 multipliers saved for an -tap FIR filter, than
the existing FFA, at the expense of an additional 32 adders. Also, the
proposed eight-parallel FIR filter will lead to seven more symmetric
subfilter blocks, equivalently 7 16 multipliers saved for an -tap
filter, than the existing FFA, with the overhead of additional 54 adders.

L;

N=

H0 0 H1 )(X0 0 X1 )] + H1 X1

N
N=

C. Proposed Cascading FFA

+ (

Y2 = 21 [(H0 + H2 )(X0 + X2 )
0 (H0 0 H2 )(X0 0 X2 )] + H1 X1

FIR structure also brings an overhead of seven additional adders in


preprocessing and postprocessing blocks.

N=

(9)

symmetric coefficients. Therefore, for an -tap three-parallel FIR


3 multipliers from the
filter, the proposed structure can save
existing FFA structure. However, again, the proposed three-parallel

IV. COMPLEXITY ANALYSIS AND COMPARISON

When an -parallel FIR filter comes with a set of symmetric coefficients of length
the number of required multipliers for the proposed
parallel FIR filter structures is provided by (10) and (11).

N;

370

IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 20, NO. 2, FEBRUARY 2012

TABLE I
COMPARISON OF PROPOSED AND THE EXISTING FFA STRUCTURES NUMBER
OF REQUIRED MULTIPLIERS (M.), REDUCED MULTIPLIERS (R.M.), NUMBER
OF REQUIRED ADDERS IN SUBFILTER SECTION (SUB.), NUMBER OF REQUIRED
ADDERS IN PRE/POSTPROCESSING BLOCKS (PRE/POST.), AND NUMBER OF
THE INCREASED ADDERS (I.A.)

TABLE II
COMPARISON OF STRUCTURES FOR A 144-TAP FIR FILTER NUMBER OF
REQUIRED MULTIPLIERS (M.), NUMBER OF REQUIRED ADDERS (A.), NUMBER
OF REQUIRED DELAY ELEMENTS (D.)

TABLE III
COMPARISON OF AREA

TABLE IV
COMPARISON OF POWER

TABLE V
COMPARISON OF CRITICAL PATH DELAY

resulted from -th FFA. is the number of subfilter blocks containing


symmetric coefficients. The number of the required adders in subfilter
section can be given by

Asub =

Case 1:
When

r
i=1

Li is even;

M=

r
i=1

Li

i=1

Mi 0 S2 :

(10)

Case 2:
When

r
i=1

Li is odd;

M = rN Li Mi 0 S2
i=1
i=1
r

Li 0 1 : (11)
Li is the small parallel block size such as (2 2 2) or (3 2 3) FFA.
r is the number of FFAs used. Mi is the number of subfilter blocks
r
i=1

r
i=1

Mi

r
i=1

Li 0 1 :

(12)

A comparison between the proposed and the existing FFA structures for
even symmetric coefficients with different length under different level
of parallelism is summarized in Table I. Also, a comparison between
the proposed structures and other structures for a 144-tap FIR filter with
parallel block 4 and 8 is shown in Table II.
V. IMPLEMENTATION AND EXPERIMENTAL RESULT
The proposed FFA structures and the existing FFA structures are implemented in Verilog HDL with filter length of 24 and 72, word length
16-bit and 32-bit, respectively. Two sets of the ideal low-pass FIR filter
symmetric coefficients of length 24 and 72 are generated by MATLAB
using Remez Exchange algorithm. The maximum absolute difference

IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 20, NO. 2, FEBRUARY 2012

(MAD) algorithm introduced in [1], [2] is used for coefficients quantization. The subfilter is based on canonical signed digit (CSD) structure
and Carry-Save adders are used. Tables III, IV, and V show the results
of area, power, and critical path delay, synthesized by Design Compiler
[10] with 45-nm technology.
VI. CONCLUSION
In this paper, we have presented new parallel FIR filter structures,
which are beneficial to symmetric convolutions when the number of
taps is the multiple of 2 or 3. Multipliers are the major portions in hardware consumption for the parallel FIR filter implementation. The proposed new structure exploits the nature of even symmetric coefficients
and save a significant amount of multipliers at the expense of additional adders. Since multipliers outweigh adders in hardware cost, it is
profitable to exchange multipliers with adders. Moreover, the number
of increased adders stays still when the length of FIR filter becomes
large, whereas the number of reduced multipliers increases along with
the length of FIR filter. Consequently, the larger the length of FIR filters is, the more the proposed structures can save from the existing FFA
structures, with respect to the hardware cost. Overall, in this paper, we
have provided new parallel FIR structures consisting of advantageous
polyphase decompositions dealing with symmetric convolutions comparatively better than the existing FFA structures in terms of hardware
consumption.

REFERENCES
[1] D. A. Parker and K. K. Parhi, Low-area/power parallel FIR digital
filter implementations, J. VLSI Signal Process. Syst., vol. 17, no. 1,
pp. 7592, 1997.
[2] J. G. Chung and K. K. Parhi, Frequency-spectrum-based low-area
low-power parallel FIR filter design, EURASIP J. Appl. Signal
Process., vol. 2002, no. 9, pp. 444453, 2002.
[3] K. K. Parhi, VLSI Digital Signal Processing Systems: Design and Implementation. New York: Wiley, 1999.
[4] Z.-J. Mou and P. Duhamel, Short-length FIR filters and their use in
fast nonrecursive filtering, IEEE Trans. Signal Process., vol. 39, no.
6, pp. 13221332, Jun. 1991.
[5] J. I. Acha, Computational structures for fast implementation of L-path
and L-block digital filters, IEEE Trans. Circuit Syst., vol. 36, no. 6, pp.
805812, Jun. 1989.
[6] C. Cheng and K. K. Parhi, Hardware efficient fast parallel FIR filter
structures based on iterated short convolution, IEEE Trans. Circuits
Syst. I, Reg. Papers, vol. 51, no. 8, pp. 14921500, Aug. 2004.
[7] C. Cheng and K. K. Parhi, Furthur complexity reduction of parallel
FIR filters, in Proc. IEEE Int. Symp. Circuits Syst. (ISCAS 2005),
Kobe, Japan, May 2005.
[8] C. Cheng and K. K. Parhi, Low-cost parallel FIR structures with
2-stage parallelism, IEEE Trans. Circuits Syst. I, Reg. Papers, vol.
54, no. 2, pp. 280290, Feb. 2007.
[9] I.-S. Lin and S. K. Mitra, Overlapped block digital filtering, IEEE
Trans. Circuits Syst. II, Analog Digit. Signal Process., vol. 43, no. 8,
pp. 586596, Aug. 1996.
[10] Design Compiler User Guide, ver. B-2008.09, Synopsys Inc., Sep.
2008.

371

Low-Power and Area-Efficient Carry Select Adder


B. Ramkumar and Harish M Kittur

AbstractCarry Select Adder (CSLA) is one of the fastest adders used


in many data-processing processors to perform fast arithmetic functions.
From the structure of the CSLA, it is clear that there is scope for reducing
the area and power consumption in the CSLA. This work uses a simple and
efficient gate-level modification to significantly reduce the area and power
of the CSLA. Based on this modification 8-, 16-, 32-, and 64-b square-root
CSLA (SQRT CSLA) architecture have been developed and compared with
the regular SQRT CSLA architecture. The proposed design has reduced
area and power as compared with the regular SQRT CSLA with only a
slight increase in the delay. This work evaluates the performance of the
proposed designs in terms of delay, area, power, and their products by
hand with logical effort and through custom design and layout in 0.18- m
CMOS process technology. The results analysis shows that the proposed
CSLA structure is better than the regular SQRT CSLA.
Index TermsApplication-specific integrated circuit (ASIC), area-efficient, CSLA, low power.

I. INTRODUCTION
Design of area- and power-efficient high-speed data path logic systems are one of the most substantial areas of research in VLSI system
design. In digital adders, the speed of addition is limited by the time
required to propagate a carry through the adder. The sum for each bit
position in an elementary adder is generated sequentially only after the
previous bit position has been summed and a carry propagated into the
next position.
The CSLA is used in many computational systems to alleviate the
problem of carry propagation delay by independently generating multiple carries and then select a carry to generate the sum [1]. However,
the CSLA is not area efficient because it uses multiple pairs of Ripple
Carry Adders (RCA) to generate partial sum and carry by considering
carry input Cin = 0 and Cin = 1, then the final sum and carry are
selected by the multiplexers (mux).
The basic idea of this work is to use Binary to Excess-1 Converter
(BEC) instead of RCA with Cin = 1 in the regular CSLA to achieve
lower area and power consumption [2][4]. The main advantage of this
BEC logic comes from the lesser number of logic gates than the n-bit
Full Adder (FA) structure. The details of the BEC logic are discussed
in Section III.
This brief is structured as follows. Section II deals with the delay
and area evaluation methodology of the basic adder blocks. Section III
presents the detailed structure and the function of the BEC logic. The
SQRT CSLA has been chosen for comparison with the proposed design as it has a more balanced delay, and requires lower power and
area [5], [6]. The delay and area evaluation methodology of the regular
and modified SQRT CSLA are presented in Sections IV and V, respectively. The ASIC implementation details and results are analyzed in
Section VI. Finally, the work is concluded in Section VII.
Manuscript received May 12, 2010; revised October 28, 2010; accepted December 15, 2010. Date of publication January 24, 2011; date of current version
January 18, 2012.
The authors are with the School of Electronics Engineering, VIT University,
Vellore 632 014, India (e-mail: ramkumar.b@vit.ac.in; kittur@vit.ac.in).
Color versions of one or more of the figures in this paper are available online
at http://ieeexplore.ieee.org.
Digital Object Identifier 10.1109/TVLSI.2010.2101621

1063-8210/$26.00 2011 IEEE

You might also like