Signal Processing

Parhi, K.K., Chassaing, R., Bitler, B.
VLSI for Signal Processing

The Electrical Engineering Handbook
Ed. Richard C. Dorf
Boca Raton: CRC Press LLC, 2000
2000 by CRC Press LLC
1S
VLSI lor SIgnaI rocessIng
18.1 Special Aichitectuies
Pipelining Paiallel Piocessing Retiming Unfolding
Folding Tiansfoimation Look-Ahead Technique Associativity
Tiansfoimation Distiibutivity Aiithmetic Piocessoi
Aichitectuies Computei-Aided Design Futuie VLSI DSP Systems
18.2 Signal Piocessing Chips and Applications
DSP Piocessois Fixed-Point TMS320C25-Based Development
System Implementation of a Finite Impulse Response Filtei with
the TMS320C25 Floating-Point TMS320C30-Based Development
System EVM Tools Implementation of a Finite Impulse
Response Filtei with the TMS320C30 FIR and IIR Implementation
Using C and Assembly Code Real-Time Applications
Conclusions and Futuie Diiections
18.1 Specia! Architectures
Ke|ob K. Por|
Digital signal piocessing (DSP) is used in numeious applications. These applications include telephony, mobile
iadio, satellite communications, speech piocessing, video and image piocessing, biomedical applications, iadai,
and sonai. Real-time implementations of DSP systems iequiie design of haidwaie that can match the application
sample iate to the haidwaie piocessing iate (which is ielated to the clock iate and the implementation style).
Thus, ieal-time does not always mean high speed. Real-time aichitectuies aie capable of piocessing samples as
they aie ieceived fiom the signal souice, as opposed to stoiing them in buffeis foi latei piocessing as done in
batch piocessing. Fuitheimoie, ieal-time aichitectuies opeiate on an infnite time seiies (since the numbei of
the samples of the signal souice is so laige that it can be consideied infnite). While speech and sonai applications
iequiie lowei sample iates, iadai and video image piocessing applications iequiie much highei sample iates.
The sample iate infoimation alone cannot be used to choose the aichitectuie. The algoiithm complexity is also
an impoitant consideiation. Foi example, a veiy complex and computationally intensive algoiithm foi a low-
sample-iate application and a computationally simple algoiithm foi a high-sample-iate application may iequiie
similai haidwaie speed and complexity. These ianges of algoiithms and applications motivate us to study a
wide vaiiety of aichitectuie styles.
Using veiy laige scale integiation (VLSI) technology, DSP algoiithms can be piototyped in many ways. These
options include (1) single oi multipiocessoi piogiammable digital signal piocessois, (2) the use of coie
piogiammable digital signal piocessoi with customized inteiface logic, (3) semicustom gate-aiiay implemen-
tations, and (4) full-custom dedicated haidwaie implementation. The DSP algoiithms aie implemented in the
piogiammable piocessois by tianslating the algoiithm to the piocessoi assembly code. This can iequiie an
extensive amount of time. On the othei hand, high-level compileis foi DSP can be used to geneiate the assembly
code. Although this is cuiiently feasible, the code geneiated by the compilei is not as effcient as hand-optimized
code. Design of DSP compileis foi geneiation of effcient code is still an active ieseaich topic. In the case of
Keshal K. arhI
Inverry of Mnneoro
RuIph ChassaIng
Foger W||om Inverry
BIII BIfIer
InfMed
dedicated designs, the challenge lies in a thoiough undeistanding of the DSP algoiithms and theoiy of aichi-
tectuies. Foi example, just minimizing the numbei of multiplieis in an algoiithm may not lead to a bettei
dedicated design. The aiea saved by the numbei of multiplieis may be offset by the inciease in contiol, iouting,
and placement costs.
Off-the-shelf piogiammable digital signal piocessois can lead to fastei piototyping. These piototyped systems
can piove veiy effective in fast simulation of computation-intensive algoiithms (such as those encounteied in
speech iecognition, video compiession, and seismic signal piocessing) oi in benchmaiking and standaidization.
Aftei standaids aie deteimined, it is moie useful to implement the algoiithms using dedicated ciicuits.
Design of dedicated ciicuits is not a simple task. Dedicated ciicuits piovide limited oi no piogiamming
exibility. They iequiie less silicon aiea and consume less powei. Howevei, the low pioduction volume, high
design cost, and long tuinaiound time aie some of the diffculties associated with the design of dedicated
systems. Anothei diffculty is the availability of appiopiiate computei-aided design (CAD) tools foi DSP systems.
As time piogiesses, howevei, the aichitectuial design techniques will be bettei undeistood and can be incoi-
poiated into CAD tools, thus making the design of dedicated ciicuits easiei. Hieiaichical CAD tools can integiate
the design at vaiious levels in an automatic and effcient mannei. Implementation of standaids foi signal and
image piocessing using dedicated ciicuits will lead to highei volume pioduction. As time piogiesses, dedicated
designs will be moie acceptable to customeis of DSP.
Successful design of dedicated ciicuits iequiies caieful algoiithm and aichitectuie consideiations. Foi exam-
ple, foi a flteiing application, diffeient equivalent iealizations may possess diffeient levels of concuiiency. Thus,
some of these iealizations may be suitable foi a paiticulai application while othei iealizations may not be able
to meet the sample iate iequiiements of the application. The lowei-level aichitectuie may be implemented in
a woid-seiial oi woid-paiallel mannei. The aiithmetic functional units may be implemented in bit-seiial oi
digit-seiial oi bit-paiallel mannei. The synthesized aichitectuie may be implemented with a dedicated data
path oi shaied data path. The aichitectuie may be systolic oi nonsystolic.
Algoiithm tiansfoimations play an impoitant iole in the design of dedicated aichitectuies Paihi, 1989].
This is because the tiansfoimed algoiithms can be made to opeiate with bettei peifoimance (wheie the
peifoimance may be measuied in teims of speed, aiea, oi powei). Examples of these tiansfoimations include
pipelining, paiallel piocessing, ietiming, unfolding, folding, look-ahead, associativity, and distiibutivity. These
tiansfoimations and othei aichitectuial concepts aie desciibed in detail in subsequent sections.
Pipe!ining
Pipelining can inciease the amount of concuiiency (oi the numbei of activities peifoimed simultaneously) in
an algoiithm. Pipelining is accomplished by placing latches at appiopiiate inteimediate points in a data ow
giaph that desciibes the algoiithm. Each latch also iefeis to a stoiage unit oi buffei oi iegistei. The latches can
be placed at [eeJ-[orwarJ tuses of the data ow giaph. In synchionous haidwaie implementations, pipelining
can inciease the clock iate of the system (and theiefoie the sample iate). The diawbacks associated with
pipelining aie the inciease in system latency and the inciease in the numbei of iegisteis. To illustiate the speed
inciease using pipelining, considei the second-oidei thiee-tap fnite impulse iesponse (FIR) fltei shown in
Fig. 18.1(a). The signal x(n) in this system can be sampled at a iate limited by the thioughput of one
multiplication and two additions. Foi simplicity, if we assume the multiplication time to be two times the
addition time (T
add
), the effective sample oi clock iate of this system is 1/4T
add
. By placing latches as shown in
Fig. 18.1(b) at the cutset shown in the dashed line, the sample iate can be impioved to the iate of one
multiplication oi two additions. While pipelining can be easily applied to all algoiithms with no feedback loops
by the appiopiiate placement of latches, it cannot easily be applied to algoiithms with feedback loops. This is
because the cutsets in feedback algoiithms contain feed-foiwaid and feedback data ow and cannot be con-
sideied as feed-foiwaid cutsets.
Pipelining can also be used to impiove the peifoimance in softwaie piogiammable multipiocessoi systems.
Most softwaie piogiammable DSP piocessois aie piogiammed using assembly code. The assembly code is
geneiated by high-level compileis that peifoim scheduling. Scheduleis typically use the acyclic piecedence
giaph to constiuct schedules. The iemoval of all edges in the signal (oi data) ow giaph containing delay
elements conveits the signal ow giaph to an acyclic piecedence giaph. By placing latches to pipeline a data
ow giaph, we can altei the acyclic piecedence giaph. In paiticulai, the ciitical path of the acyclic piecedence
giaph can be ieduced. The new piecedence giaph can be used to constiuct schedules with lowei iteiation
peiiods (although this may often iequiie an inciease in the numbei of piocessois).
Pipelining of algoiithms can inciease the sample iate of the system. Sometimes, foi a constant sample iate,
pipelining can also ieduce the powei consumed by the system. This is because the data paths in the pipelined
system can be chaiged oi dischaiged with lowei supply voltage. Since the capacitance iemains almost constant,
the powei can be ieduced. Achieving low powei can be impoitant in many batteiy-poweied applications
Chandiakasan et al., 1992].
Para!!e! Prucessing
Parallel processing is ielated to pipelining but iequiies ieplication of haidwaie units. Pipelining exploits
concuiiency by bieaking a laige task into multiple smallei tasks and by sepaiating these smallei tasks by stoiage
units. On the othei hand, paiallelism exploits concuiiency by peifoiming multiple laigei tasks simultaneously
in sepaiate haidwaie units.
To illustiate the speed inciease due to paiallelism, considei the paiallel implementation of the second-oidei
thiee-tap FIR fltei of Fig. 18.1(a) shown in Fig. 18.2. In the aichitectuie of Fig. 18.2, two input samples aie
piocessed and two output samples aie geneiated in each clock cycle peiiod of foui addition times. Because
each clock cycle piocesses two samples, howevei, the effective sample iate is 1/2T
add
which is the same as that
of Fig. 18.1(b). The paiallel aichitectuie leads to the speed inciease with signifcant haidwaie oveihead. The
entiie data ow giaph needs to be ieplicated with an inciease in the amount of paiallelism. Thus, it is moie
desiiable to use pipelining as opposed to paiallelism. Howevei, paiallelism may be useful if pipelining alone
cannot meet the speed demand of the application oi if the technology constiaints (such as limitations on the
clock iate by the I/O technology) limit the use of pipelining. In obvious ways, pipelining and paiallelism can
be combined also. Paiallelism, like pipelining, can also lead to powei ieduction but with signifcant oveihead
in haidwaie iequiiements. Achieving pipelining and paiallelism can be diffcult foi systems with feedback loops.
Concuiiency may be cieated in these systems by using the look-ahead tiansfoimation.
FIGURE 18.1 (a) A thiee-tap second-oidei noniecuisive digital fltei; (b) the equivalent pipelined digital fltei obtained
by placing stoiage units at the inteisection of the signal wiies and the feed-foiwaid cutset. If the multiplication and addition
opeiations iequiie 2 and 1 unit of time, iespectively, then the maximum achievable sampling iates foi the oiiginal and the
pipelined aichitectuies aie 1/4 and 1/2 units, iespectively.
Retiming
Retiming is similai to pipelining but yet diffeient in some ways Leiseison et al., 1983]. Retiming is the piocess
of moving the delays aiound in the data ow giaph. Removal of one delay fiom all input edges of a node and
inseition of one delay to each outgoing edge of the same node is the simplest example of ietiming. Unlike
pipelining, ietiming does not inciease the latency of the system. Howevei, ietiming alteis the numbei of delay
elements in the system. Retiming can ieduce the ciitical path of the data ow giaph. As a iesult, it can lead to
clock peiiod ieduction in haidwaie implementations oi ciitical path of the acyclic piecedence giaph oi the
iteiation peiiod in piogiammable softwaie system implementations.
The single host foimulation of the ietiming tiansfoimation pieseives the latency of the algoiithm. The
ietiming foimulation with no constiaints on latency (i.e., with sepaiate input and output hosts) can also achieve
e|nng w| no remng oi e|nng w| remng. Pipelining with ietiming is the most desiiable tiansfoi-
mation in DSP aichitectuie design. Pipelining with ietiming can be inteipieted to be identical to ietiming of
the oiiginal algoiithm with a laige numbei of delays at the input edges. Thus, we can inciease the system latency
aibitiaiily and iemove the appiopiiate numbei of delays fiom the inputs aftei the tiansfoimation.
The ietiming foimulation assigns ietiming vaiiables r(.) to each node in the data ow giaph. If (U V)
is the numbei of delays associated with the edge U V in the oiiginal data ow giaph and r( V) and r(U),
iespectively, iepiesent the ietiming vaiiable value of the nodes V and U, then the numbei of delays associated
with the edge U V in the ietimed data ow giaph is given by
r
(U V ) ( U V ) - r ( V ) - r ( U)
Foi the data ow giaph to be iealizable,
r
(U V) > 0 must be satisfed. The ietiming tiansfoimation foimulates
the pioblem by calculating path lengths and by imposing constiaints on ceitain path lengths. These constiaints
aie solved as a shoitest-path pioblem.
To illustiate the usefulness of ietiming, considei the data ow giaph of a two-stage pipelined lattice digital
fltei giaph shown in Fig. 18.3(a) and its equivalent pipelined-ietimed data ow giaph shown in Fig. 18.3(b).
If the multiply time is two units and the add time is one unit, the aichitectuie in Fig. 18.3(a) can be clocked
with peiiod 10 units wheieas the aichitectuie in Fig. 18.3(b) can be clocked with peiiod 2 units.
FIGURE 18.2 Twofold paiallel iealization of the thiee-tap fltei of Fig. 18.1(a).
Lnlu!ding
The unfolding tiansfoimation is similai to loop uniolling. In J-unfolding, each node is ieplaced by J nodes
and each edge is ieplaced by J edges. The J-unfolded data ow giaph executes J iteiations of the oiiginal
algoiithm Paihi, 1991].
The unfolding tiansfoimation can uniavel the hidden concuiiency in a data ow piogiam. The achievable
iteiation peiiod foi a J-unfolded data ow giaph is 1/J times the ciitical path length of the unfolded data ow
giaph. By exploiting inteiiteiation concuiiency, unfolding can lead to a lowei iteiation peiiod in the context
of a softwaie piogiammable multipiocessoi implementation.
The unfolding tiansfoimation can also be applied in the context of haidwaie design. If we apply an unfolding
tiansfoimation on a (woid-seiial) noniecuisive algoiithm, the iesulting data ow giaph iepiesents a word-
parallel (oi simply paiallel) algoiithm that piocesses multiple samples oi woids in paiallel eveiy clock cycle.
If we apply 2-unfolding to the 3-tap FIR fltei in Fig. 18.1(a), we can obtain the data ow giaph of Fig. 18.2.
FIGURE 18.3 (a) A two-stage pipelinable time-invaiiant lattice digital fltei. If multiplication and addition opeiations
iequiie 2 and 1 time units, iespectively, then this data ow giaph can achieve a sampling peiiod of 10 time units (which
coiiesponds to the ciitical path M
1

2
M
2

1
M
3

3

4
). (b) The pipelined/ietimed lattice digital fltei
can achieve a sampling peiiod of 2 time units.
Because the unfolding algoiithm is based on giaph theoietic appioach, it can also be applied at the bit level.
Thus, unfolding of a bit-serial data ow piogiam by a factoi of J leads to a digit-serial piogiam with digit
size J. The Jg s:e iepiesents the numbei of bits piocessed pei clock cycle. The digit-seiial aichitectuie is
clocked at the same iate as the bit-seiial (assuming that the clock iate is limited by the communication I/O
bound much befoie ieaching the computation bound of the bit-seiial piogiam). Because the digit-seiial
piogiam piocesses J bits pei clock cycle the effective bit iate of the digit-seiial aichitectuie is J times highei. A
simple example of this unfolding is illustiated in Fig. 18.4, wheie the bit-seiial addei in Fig. 18.4(a) is unfolded
by a factoi of 2 to obtain the digit-seiial addei in Fig. 18.4(b) foi digit size 2 foi a woid length of 4. In obvious
ways, the unfolding tiansfoimation can be applied to both woid level and bit level simultaneously to geneiate
woid-paiallel digit-seiial aichitectuies. Such aichitectuies piocess multiple woids pei clock cycle and piocess
a digit of each woid (not the entiie woid).
Fu!ding Translurmatiun
The folding tiansfoimation is the ieveise of the unfolding tiansfoimation. While the unfolding tiansfoimation
is simplei, the folding tiansfoimation is moie diffcult Paihi et al., 1992].
The folding tiansfoimation can be applied to fold a bit-paiallel aichitectuie to a digit-seiial oi bit-seiial one
oi to fold a digit-seiial aichitectuie to a bit-seiial one. It can also be applied to fold an algoiithm data ow
giaph to a haidwaie data ow foi a specifed folding set. The folding set indicates the piocessoi in which and
the time paitition at which a task is executed. A specifed folding set may be infeasible, and this needs to be
detected fist. The folding tiansfoimation peifoims a piepiocessing step to detect feasibility and in the feasible
case tiansfoims the algoiithm data ow giaph to an equivalent pipelined/ietimed data ow giaph that can be
folded. Foi the special case of iegulai data ow giaphs and foi lineai space-time mappings, the folding
tianfoimation ieduces to systolic aiiay design.
In the folded aichitectuie, each edge in the algoiithm data ow giaph is mapped to a communicating edge
in the haidwaie aichitectuie data ow giaph. Considei an edge U V in the algoiithm data ow giaph with
associated numbei of delays (U V). Let the tasks U and V be mapped to the haidwaie units H
U
and H
V
,
iespectively. Assume that N time paititions aie available, i.e., the iteiation peiiod is N. A modulo opeiation
deteimines the time paitition. Foi example, the time unit 18 foi N 4 coiiesponds to time paitition 18 modulo
FIGURE 18.4 (a) A least-signifcant-bit fist bit-seiial addei foi woid length of 4; (b) a digit-seiial addei with digit size 2
obtained by two-unfolding of the bit-seiial addei. The bit position 0 stands foi least signifcant bit.
4 oi 2. Let the tasks U and V be executed in time paititions u and , i.e., the | th iteiations of tasks U and V
aie executed in time units N| - u and N| - , iespectively. The (U V) delays in the edge U V implies
that the iesult of the | th iteiation of U is used foi the (| - )th iteiation of V. The (| - )th iteiation of V is
executed in time unit N(| - ) - . Thus the numbei of stoiage units needed in the folded edge coiiesponding
to the edge U V is
D
F
( U V ) N(| - ) - - N| - u - P
u
N - - u - P
u
wheie P
u
is the level of pipelining of the haidwaie opeiatoi H
U
. The D
F
(U V) delays should be connected
to the edge between H
U
and H
V
, and this signal should be switched to the input of H
V
at time paitition . If
the D
F
(U V)`s as calculated heie weie always nonnegative foi all edges U V, then the pioblem would be
solved. Howevei, some D
F
()`s would be negative. The algoiithm data ow giaph needs to be pipelined and
ietimed such that all the D
F
()`s aie nonnegative. This can be foimulated by simple inequalities using the ietiming
vaiiables. The ietiming foimulation can be solved as a path pioblem, and the ietiming vaiiables can be
deteimined if a solution exists. The algoiithm data ow giaph can be ietimed foi folding and the calculation
of the D
F
()`s can be iepeated. The folded haidwaie aichitectuie data ow giaph can now be completed. The
folding technique is illustiated in Fig. 18.5. The algoiithm data ow giaph of a two-stage pipelined lattice
iecuisive digital fltei of Fig. 18.3(a) is folded foi the folding set shown in Fig. 18.5. Fig. 18.5(a) shows the
pipelined/ietimed data ow giaph (piepiocessed foi folding) and Fig. 18.5(b) shows the haidwaie aichitectuie
data ow giaph obtained aftei folding.
As indicated befoie, a special case of folding can addiess systolic aiiay design foi iegulai data ow giaphs
and foi lineai mappings. The systolic aichitectuies make use of extensive pipelining and local communication
and opeiate in a synchionous mannei Kung, 1988]. The systolic piocessois can also be made to opeiate in an
asynchionous mannei, and such systems aie often iefeiied to as wavefiont piocessois. Systolic aichitectuies
have been designed foi a vaiiety of applications including convolution, matiix solveis, matiix decomposition,
and flteiing.
Luuk-Ahead Technique
The look-ahead technique is a veiy poweiful technique foi pipelining of iecuisive signal piocessing algoiithms
Paihi and Messeischmitt, 1989]. This technique can tiansfoim a sequential iecuisive algoiithm to an equivalent
concuiient one, which can then be iealized using pipelining oi paiallel piocessing oi both. This technique has
been successfully applied to pipeline many signal piocessing algoiithms, including iecuisive digital flteis (in
diiect foim and lattice foim), adaptive lattice digital flteis, two-dimensional iecuisive digital flteis, Viteibi
decodeis, Huffman decodeis, and fnite state machines. This ieseaich demonstiated that the iecuisive signal
piocessing algoiithms can be opeiated at high speed. This is an impoitant iesult since modein signal piocessing
applications in iadai and image piocessing and paiticulaily in high-defnition and supei-high-defnition tele-
vision video signal piocessing iequiie veiy high thioughput. Tiaditional algoiithms and topologies cannot be
used foi such high-speed applications because of the inheient speed bound of the algoiithm cieated by the
feedback loops. The look-ahead tiansfoimation cieates additional concuiiency in the signal piocessing algo-
iithms and the speed bound of the tiansfoimed algoiithms is incieased substantially. The look-ahead tians-
foimation is not fiee fiom its diawbacks. It is accompanied by an inciease in the haidwaie oveihead. This
diffculty has encouiaged us to develop inheiently pipelinable topologies foi iecuisive signal piocessing algo-
iithms. Foitunately, this is possible to achieve in adaptive digital flteis using ielaxations on the look-ahead oi
by the use of ielaxed look-ahead Shanbhag and Paihi, 1992].
To begin, considei a time-invaiiant one-pole iecuisive digital fltei tiansfei function
H :
X :
U : a:
( )
( )
( )

1
1
1
desciibed by the diffeience equation
x (n) ax (n - 1) - u (n)
and shown in Fig. 18.6(a). The maximum achievable speed in this system is limited by the opeiating speed of
one multiply-add opeiation. To inciease the speed of this system by a factoi of 2, we can expiess x(n) in teims
of x (n - 2) by substitution of one iecuision within the othei:
x (n) aax (n - 2) - u (n - 1)] - u (n) a
2
x (n - 2) - au (n - 1) - u (n)
FIGURE 18.5 (a) A pipelined/ietimed data ow giaph obtained fiom Fig. 18.3(a) by piepiocessing foi folding; (b) the
folded haidwaie aichitectuie data ow giaph. In oui folding notation, the tasks aie oideied within a set and the oideiing
iepiesents the time paitition in which the task is executed. Foi example, S
1
(
2
,
1
) implies that
2
and
1
aie, iespectively,
executed in even and odd time paititions in the same piocessoi. The notation d iepiesents a null opeiation.
The tiansfei function of the emulated second-oidei system is given by
and is obtained by using a pole-zeio cancellation at -a. In the modifed system, x(n) is computed using x (n - 2)
as opposed to x (n - 1); thus we |oo| a|eaJ. The modifed system has two delays in the multiply-add feedback
loop, and these two delays can be distiibuted to pipeline the multiply-add opeiation by two stages. Of couise,
the additional multiply-add opeiation that iepiesents one zeio would also need to be pipelined by two stages
to keep up with the sample iate of the system. To inciease the speed by foui times, we can iewiite the tiansfei
function as:
This system is shown in Fig. 18.6(b). Aibitiaiy speed inciease is possible. Howevei, foi powei-of-two speed
inciease the haidwaie complexity giows logaiithmically with speed-up factoi. The same technique can be
applied to any highei-oidei system. Foi example, a second-oidei iecuisive fltei with tiansfei function
can be modifed to
foi a twofold inciease in speed. In this example, the output y(n) is computed using y(n - 2) and y(n - 4); thus,
it is iefeiied to as staereJ |oo|-a|eaJ.
FIGURE 18.6 (a) A fist-oidei iecuisive digital fltei; (b) a foui-stage pipelinable equivalent fltei obtained by look-ahead
computation.
H :
a:
a :
( )
+
1
1
1
2 2
H :
a: a :
a :
( )
( )( )
( )
+ +
1 1
1
1 2 2
4 4
H :
r : r :
( )
cos
+

1
1 2
1 2 2
H :
r : r :
r : r :
( )
cos
cos
+ +
+

1 2
1 2 2
1 2 2
2 2 4 4

While look-ahead can tiansfoim any iecuisive digital fltei tiansfei function to pipelined foim, it leads to a
haidwaie oveihead piopoitional to N log
2
M, wheie N is the fltei oidei and M is the speed-up factoi. Instead
of staiting with a sequential digital fltei tiansfei function obtained by tiaditional design appioaches and
tiansfoiming it foi pipelining, it is moie desiiable to use a constiained fltei design piogiam that can satisfy
the fltei spectium and the pipelining constiaint. The pipelining constiaint is satisfed by expiessing the
denominatoi of the tiansfei function in scatteied look-ahead foim. Such fltei design piogiams have now been
developed in both time domain and fiequency domain. The advantage of the constiained fltei design appioach
is that we can obtain pipelined digital flteis with maiginal oi zeio haidwaie oveihead compaied with sequential
digital flteis. The pipelined tiansfei functions can also be mapped to pipelined lattice digital flteis. The ieadei
might note that the data ow giaph of Fig. 18.3(a) was obtained by this appioach.
The look-ahead pipelining can also be applied foi the design of tiansveisal and adaptive lattice digital flteis.
Although look-ahead tiansfoimation can be used to modify the adaptive fltei iecuisions to cieate concuiiency,
this iequiies laige haidwaie oveihead. The adaptive flteis aie based on weight update opeiations, and the
weights aie adapted based on the cuiient eiioi. Finally, the eiioi becomes close to zeio and the fltei coeffcients
have been adapted. Thus, making ielaxations on the eiioi can ieduce the haidwaie oveihead substantially
without degiadation of the conveigence behavioi of the adaptive fltei. Thiee types of ielaxations of look-ahead
aie possible. These aie iefeiied to as sum re|axaon, roJut re|axaon, and Je|ay re|axaon. To illustiate these
thiee ielaxations, considei the weight update iecuision
w(n - 1) a (n)w(n) - [ (n)
wheie the teim a(n) is typically 1 foi tiansveisal least mean squaie (LMS) adaptive flteis and of the foim (1
- r(n)) foi lattice LMS adaptive digital flteis, and [(n) e(n)u(n) wheie is a constant, e(n) is the eiioi,
and u(n) is the input. The use of look-ahead tiansfoims the above iecuision to
In sum ielaxation, we only ietain the single teim dependent on the cuiient input foi the last teim of the look-
ahead iecuision. The ielaxed iecuision aftei sum ielaxation is given by
In lattice digital flteis, the coeffcient a(n) is close to 1 foi all n, since it can be expiessed as (1 - r(n)) and r(n)
is close to zeio foi all n and is positive. The pioduct ielaxation on the above equation leads to
w(n - M) (1 - Mr(n - M - 1)) w(n) - [ (n - M - 1)
w n M a n M w n
M
( ) ( ) ( ) + +
j
1
0
1
+ + + +
1
]
1
1
+
+
1
]
1
1
1
1
1
1
1
1
j j
1 1 1 1
1
2
0
2
0
1
a n M a n M a n M
[ n M
[ n M
[ n
( ) ( ) . . . ( )
( )
( )
.
.
.
( )
w n M a n M w n [ n M
M
( ) ( ) ( ) ( ) + + + +
j
1 1
0
1
The delay ielaxation assumes the signal to be slowly vaiying oi to be constant ovei D samples and ieplaces the
look-ahead by
w(n - M) (1 - Mr(n - M - 1)) w(n) - [ (n - M - D - 1)
These thiee types of ielaxations make it possible to implement pipelined tiansveisal and lattice adaptive digital
flteis with maiginal inciease in haidwaie oveihead. Relaxations on the weight update opeiations change the
conveigence behavioi of the adaptive fltei, and we aie foiced to examine caiefully the conveigence behavioi
of the ielaxed look-ahead adaptive digital flteis. It has been shown that the ielaxed look-ahead adaptive digital
flteis do not suffei fiom degiadation in adaptation behavioi. Futheimoie, when coding, the use of pipelined
adaptive flteis could lead to a diamatic inciease in pixel iate with no degiadation in signal-to-noise iatio of
the coded image and no inciease in haidwaie oveihead Shanbhag and Paihi, 1992].
The concuiiency cieated by look-ahead and ielaxed look-ahead tiansfoimations can also be exploited in the
foim of paiallel piocessing. Fuitheimoie, foi a constant speed, concuiient aichitectuies (especially the pipelined
aichitectuies) can also lead to low powei consumption.
Assuciativity Translurmatiun
The addition opeiations in many signal piocessing algoiithms can be inteichanged since the add opeiations
satisfy associativity. Thus, it is possible to move the add opeiations outside the ciitical loops to inciease the
maximum achievable speed of the system. As an example of the associative tiansfoimation, considei the
iealization of a second-oidei iecuision x(n) 5/8x(n - 1) - 3/4x(n - 2) - u(n). Two possible iealizations aie
shown in Fig. 18.7(a). The iealization on the left contains one multiplication and two add opeiations in the
ciitical innei loop, wheieas the iealization on the iight contains one multiplication and one add opeiation in
the ciitical innei loop. The iealization on the left can be tiansfoimed to the iealization on the iight using the
associativity tiansfoimation. Figuie 18.7(b) shows a bit-seiial implementation of this second-oidei iecuision
foi the iealization on the iight foi a woid length of 8. This bit-seiial system can be opeiated in a functionally
coiiect mannei foi any woid length gieatei than oi equal to 5 since the innei loop computation latency is 5
cycles. On the othei hand, if associativity weie not exploited, then the minimum iealizable woid length would
be 6. Thus, associativity can impiove the achievable speed of the system.
Distributivity
Anothei local tiansfoimation that is often useful is distiibutivity. In this tiansfoimation, a computation (
B) - ( C) may be ieoiganized as (B - C). Thus, the numbei of haidwaie units can be ieduced fiom
two multiplieis and one addei to one multipliei and one addei.
Arithmetic Prucessur Architectures
In addition to algoiithms and aichitectuie designs, it is also impoitant to addiess implementation styles and
aiithmetic piocessoi aichitectuies.
Most DSP systems use fxed-point haidwaie aiithmetic opeiatois. While many numbei system iepiesenta-
tions aie possible, the two`s complement numbei system is the most populai numbei system. The othei numbei
systems include the iesidue numbei system, the iedundant oi signed-digit numbei system, and the logaiithmic
numbei system. The iesidue and logaiithmic numbei systems aie iaiely used oi aie used in veiy special cases
such as noniecuisive digital flteis. Shifting oi scaling and division aie diffcult in the iesidue numbei system.
Diffculty with addition and the oveihead associated with logaiithm and antilogaiithm conveiteis ieduce the
attiactiveness of the logaiithm numbei system. The use of the iedundant numbei system leads to caiiy-fiee
opeiation but is accompanied by the oveihead associated with iedundant-to-two`s complement conveision.
Anothei appioach often used is distiibuted aiithmetic. This appioach has iecently been used in a few video
tiansfoimation chips.
The simplest aiithmetic opeiation is addition. Multiplication can be iealized as a seiies of add-shift opeia-
tions, and division and squaie-ioot can be iealized as a seiies of contiolled add-subtiact opeiations. The
conventional two`s complement addei involves caiiy iipple opeiation. This limits the thioughput of the addei
opeiation. In DSP, howevei, the combined multiply-add opeiation is most common. Caiiy-save opeiations
have been used to iealize pipelined multiply-addeis using fewei pipelining latches. In conventional pipelined
two`s complement multipliei, the multiplication time is appioximately two times the bit-level addition time.
Recently, a technique has been pioposed to ieduce the multiplication time fiom 2V bit-level binaiy addei
times to 1.25V bit-level binaiy addei times wheie V is the woid length. This technique is based on the use of
hybiid numbei system iepiesentation, wheie one input opeiand is in two`s complement numbei iepiesentation
and the othei in iedundant numbei iepiesentation Siinivas and Paihi, 1992]. Using an effcient sign-select
iedundant-to-two`s complement conveision technique, this multipliei can be made to opeiate fastei and, in
the pipelined mode, would iequiie fewei pipelining latches and less silicon aiea.
FIGURE 18.7 (a) Two associative iealizations of a second-oidei iecuision; (b) an effcient bit-seiial iealization of the
iecuision foi a woid length of 8.
Cumputer-Aided Design
With piogiess in the theoiy of aichitectuies, the computei-aided design (CAD) systems foi DSP application
also become moie poweiful. In eaily 1980, the fist silicon compilei system foi signal piocessing was developed
at the Univeisity of Edinbuigh and was iefeiied to as the FIRST design system. This system only addiessed the
computei-aided design of bit-seiial signal piocessing systems. Since then moie poweiful systems have been
developed. The Cathedial I system fiom Katholieke Univeisiteit Leuven and the BSSC (bit-seiial silicon com-
pilei) fiom GE Reseaich Centei in Schenectady, New Yoik, also addiessed synthesis of bit-seiial ciicuits. The
Cathedial system has now gone thiough many ievisions, and the new veisions can systhesize paiallel multi-
piocessoi data paths and can peifoim moie poweiful scheduling and allocation. The Lagei design tool at the
Univeisity of Califoinia at Beikeley was developed to synthesize the DSP algoiithms using paiametiizable macio
building blocks (such as ALU, RAM, ROM). This system has also gone thiough many ievisions. The Hypei
system also developed at the Univeisity of Califoinia at Beikeley and the MARS design system developed at
the Univeisity of Minnesota at Minneapolis peifoim highei level tiansfoimations and peifoim scheduling and
allocation. These CAD tools aie ciucial to iapid piototyping of high-peifoimance DSP integiated ciicuits.
Future YLSI DSP Systems
Futuie VLSI systems will make use of a combination of many types of aichitectuies such as dedicated and
piogiammable. These systems can be designed successfully with piopei undeistanding of the algoiithms,
applications, theoiy of aichitectuies, and with the use of advanced CAD systems.
Dehning Terms
Bit serial: Piocessing of one bit pei clock cycle. If woid length is V, then one sample oi woid is piocessed
in V clock cycles. In contiast, all V bits of a woid aie piocessed in the same clock cycle in a bit-paiallel
system.
Digit serial: Piocessing of moie than one but not all bits in one clock cycle. If the digit size is V
1
and the
woid length is V, then the woid is piocessed in V/V
1
clock cycles. If V
1
1, then the system is iefeiied
to as a bit-seiial and if V
1
V, then the system is iefeiied to as a bit-paiallel system. In geneial, the
digit size V
1
need not be a divisoi of the woid length V, since the least and most signifcant bits of
consecutive woids can be oveilapped and piocessed in the same clock cycle.
Folding: The technique of mapping many tasks to a single piocessoi.
Look-ahead: The technique of computing a state x (n) usng pievious state x (n - M) without iequiiing the
inteimediate states x (n - 1) thiough x (n - M - 1). This is iefeiied to as a M-step look-ahead. In the
case of highei-oidei computations, theie aie two foims of look-ahead: clusteied look-ahead and scatteied
look-ahead. In clusteied look-ahead, x (n) is computed using the clusteied states x (n - M - N - 1)
thiough x (n - M) foi an Nth oidei computation. In scatteied look-ahead, x (n) is computed using the
scatteied states x (n - M) wheie vaiies fiom 1 to N.
Parallel processing: Piocessing of multiple tasks independently by diffeient piocessois. This also incieases
the thioughput.
Pipelining: A technique to inciease thioughput. A long task is divided into components, and each component
is distiibuted to one piocessoi. A new task can begin even though the foimei tasks have not been
completed. In the pipelined opeiation, diffeient components of diffeient tasks aie executed at the same
time by diffeient piocessois. Pipelining leads to an inciease in the system latency, i.e., the time elapsed
between the staiting of a task and the completion of the task.
Retiming: The technique of moving the delays aiound the system. Retiming does not altei the latency of the
system.
Systolic: Flow of data in a ihythmic fashion fiom a memoiy thiough many piocessois, ietuining to the
memoiy just as blood ows
Unfolding: The technique of tiansfoiming a piogiam that desciibes one iteiation of an algoiithm to anothei
equivalent piogiam that desciibes multiple iteiations of the same algoiithm.
Word parallel: Piocessing of multiple woids in the same clock cycle.
Re!ated Tupic
95.1 Intioduction
Relerences
A.P. Chandiakasan, S. Sheng, and R.W. Biodeisen, Low-powei CMOS digital design," IEEE J. So|J Sae
Crtus, vol. 27(4), pp. 473-484, Apiil 1992.
S.Y. Kung, VLSI rray Protessors, Englewood Cliffs, N.J.: Pientice-Hall, 1988.
E.A. Lee and D.G. Messeischmitt, Pipeline inteileaved piogiammable DSP`s," IEEE Trans. tousts, Seet|,
Sgna| Protessng, vol. 35(9), pp. 1320-1345, Septembei 1987.
C.E. Leiseison, F. Rose, and J. Saxe, Optimizing synchionous ciicuitiy by ietiming," Prot. JrJ Ca|et| Con[.
VLSI, Pasadena, Calif., pp. 87-116, Maich 1983.
K.K. Paihi, Algoiithm tiansfoimation techniques foi concuiient piocessois," Prot. IEEE, vol. 77(12), pp.
1879-1895, Decembei 1989.
K.K. Paihi, Systematic appioach foi design of digit-seiial piocessing aichitectuies," IEEE Trans. Crtus Sysems,
vol. 38(4), pp. 358-375, Apiil 1991.
K.K. Paihi and D.G. Messeischmitt, Pipeline inteileaving and paiallelism in iecuisive digital flteis," IEEE
Trans. tousts, Seet|, Sgna| Protessng, vol. 37(7), pp. 1099-1135, July 1989.
K.K. Paihi, C.Y. Wang, and A.P. Biown, Synthesis of contiol ciicuits in folded pipelined DSP aichitectuies,"
IEEE J. So|J Sae Crtus, vol. 27(1), pp. 29-43, Januaiy 1992.
N.R. Shanbhag, and K.K. Paihi, A pipelined adaptive lattice fltei aichitectuie," Prot. 1992 IEEE In. Sym.
Crtus anJ Sysems, San Diego, May 1992.
H.R. Siinivas and K.K. Paihi, High-speed VLSI aiithmetic piocessoi aichitectuies using hybiid numbei
iepiesentation," J. VLSI Sgna| Protessng, vol. 4(2/3), pp. 177-198, 1992.
Further Inlurmatiun
A detailed video tutoiial on Implementation and Synthesis of VLSI Signal Piocessing Systems" piesented by
K.K. Paihi and J.M. Rabaey in Maich 1992 can be puichased fiom the customei seivice depaitment of IEEE,
445 Hoes Lane, P.O. Box 1331, Piscataway, NJ 08855-1331.
Special aichitectuies foi video communications can be found in the book VLSI Im|emenaons [or Image
Communtaons, published as the fouith volume of the seiies Jantes n Image Communtaons (edited by
Petei Piisch) by the Elseviei Science Publishing Co. in 1993. The infoimative aiticle Reseaich on VLSI foi
Digital Video Systems in Japan," published by K.K. Paihi in the fouith volume of the 1991 O[fte o[ Naa|
Researt| san O[fte Stenft In[ormaon Bu||en (pages 93-98), piovides examples of video codec designs
using special aichitectuies. Foi video piogiammable digital signal piocessoi appioaches, see I. Tamitani, H.
Haiasaki, and T. Nishitani, A Real-Time HDTV Signal Piocessoi: HD-VSP," published in IEEE Transatons
on Crtus anJ Sysems [or VJeo Tet|no|ogy, Maich 1991, and T. Fujii, T. Sawabe, N. Ohta, and S. Ono,
Implementation of Supei High-Defnition Image Piocessing on HiPIPE," published in 1991 IEEE Inernaona|
Symosum on Crtus anJ Sysems, held in June 1991 in Singapoie (pages 348-351).
The IEEE Desgn anJ Tes o[ Comuers published thiee special issues ielated to computei-aided design of
special aichitectuies; these issues weie published in Octobei 1990 (addiessing high-level synthesis), Decembei
1990 (addiessing silicon compilations), and June 1991 (addiessing iapid piototyping).
Desciiptions of vaiious CAD systems can be found in the following iefeiences. The desciiption of the FIRST
system can be found in the aiticle A Silicon Compilei foi VLSI Signal Piocessing," by P. Denyei et al. in the
ProteeJngs o[ |e ESSCIRC confeience held in Biussels in Septembei 1982 (pages 215-218). The Cathedial
system has been desciibed in R. Jain et al., Custom Design of a VLSI PCM-FDM Tiansmultiplexoi fiom System
Specifcations to Ciicuit Layout Using a Computei Aided Design System," published in IEEE Journa| o[ So|J
Sae Crtus in Febiuaiy 1986 (pages 73-85). The Lagei system has been desciibed in An Integiated Automatic
Layout Geneiation System foi DSP Ciicuits," by J. Rabaey, S. Pope, and R. Biodeisen, published in the July
1985 issue of the IEEE Transatons on Comuer JeJ Desgn (pages 285-296). The desciiption of the MARS
Design System can be found in C.-Y. Wang and K.K. Paihi, High-Level DSP Synthesis Using MARS System,"
published in ProteeJngs o[ |e 1992 IEEE Inernaona| Symosum on Crtus anJ Sysems in San Diego, May
1992. A tutoiial aiticle on high-level synthesis can be found in The High-Level Synthesis of Digital Systems,"
by M.C. McFailand, A. Paikei, and R. Composano, published in the Febiuaiy 1990 issue of the ProteeJngs o[
|e IEEE (pages 310-318).
Aiticles on pipelined multiplieis can be found in T.G. Noll et al., A Pipelined 330 MHZ Multipliei," IEEE
Journa| o[ So|J Sae Crtus, June 1986 (pages 411-416) and in M. Hatamian and G. Cash, A 70-MHz 8-Bit
8-Bit-Paiallel Pipelined Multipliei in 2.5 m CMOS," IEEE Journa| o[ So|J Sae Crtus, 1986.
Technical aiticles on special aichitectuies and chips foi signal and image piocessing appeai at diffeient places,
including pioceedings of confeiences such as IEEE Woikshop on VLSI Signal Piocessing, IEEE Inteinational
Confeience on Acoustics, Speech, and Signal Piocessing, IEEE Inteinational Symposium on Ciicuits and
Systems, IEEE Inteinational Solid State Ciicuits Confeience, IEEE Customs Integiated Ciicuits Confeience,
IEEE Inteinational Confeience on Computei Design, ACM/IEEE Design Automation Confeience, ACM/IEEE
Inteinational Confeience on Computei Aided Design, Inteinational Confeience on Application Specifc Aiiay
Piocessois, and jouinals such as IEEE Transatons on Sgna| Protessng, IEEE Transatons on Image Protessng,
IEEE Transatons on Crtus anJ Sysems. Par II. na|og anJ Dga| Sgna| Protessng, IEEE Transatons on
Comuers, IEEE Journa| o[ So|J Sae Crtus, IEEE Sgna| Protessng Maga:ne, IEEE Desgn anJ Tes Maga:ne,
and Journa| o[ VLSI Sgna| Protessng.
18.2 Signa! Prucessing Chips and App!icatiuns
Fu|| C|oong ond || r|er
Recent advances in veiy laige scale integiation (VLSI) have contiibuted to the cuiient digital signal piocessois.
These piocessois aie just special-puipose fast miciopiocessois chaiacteiized by aichitectuies and instiuctions
suitable foi ieal-time digital signal piocessing (DSP) applications. The commeicial DSP piocessoi, a little moie
than a decade old, has emeiged because of the evei-incieasing numbei of signal piocessing applications. DSP
piocessois aie now being utilized in a numbei of applications fiom communications and contiols to speech
and image piocessing. They have found theii way into talking toys and music synthesizeis. A numbei of texts
such as Chassaing, 1992] and aiticles such as Ahmed and Kline, 1991] have been wiitten, discussing the
applications that use DSP piocessois and the iecent advances in DSP systems.
DSP Prucessurs
Digital signal processors aie cuiiently available fiom a numbei of companies, including Texas Instiuments,
Inc. (Texas), Motoiola, Inc. (Aiizona), Analog Devices, Inc. (Massachusetts), AT&T (New Jeisey), and NEC
(Califoinia). These piocessois aie categoiized as eithei nxed-point oi oating-point processors. Seveial com-
panies aie now suppoiting both types of piocessois. Special-purpose digital signal processors, designed foi a
specifc signal piocessing application such as foi fast Fouiiei tiansfoim (FFT), have also emeiged. Cuiiently
available digital signal piocessois iange fiom simple, low cost piocessing units thiough high peifoimance units
such as Texas Instiuments` (TI) TMS320C40 (Chassaing and Maitin, 1995) and TMS320C80, and Analog
Devices
1
ADSP-21060 SHARC (Chassaing and Ayeis, 1996).
One of the fist-geneiation digital signal piocessois is the (N-MOS technology) TMS32010, intioduced by
Texas Instiuments in 1982. This fist-geneiation fxed-point piocessoi is based on the Haivaid aichitectuie,
with a fast on-chip haidwaie multipliei/accumulatoi, and with data and instiuctions in sepaiate memoiy spaces,
allowing foi concuiient accesses. This type of pipelining feature enables the piocessoi to execute one instiuction
while fetching at the same time the next instiuction. Othei featuies include 144 (16-bit) woids of on-chip data
RAM and a 16-bit by 16-bit multiply opeiation in one instiuction cycle time of 200 ns. Since many instiuctions
can be executed in one single cycle, the TMS32010 is capable of executing 5 million instiuctions pei second
(MIPS). Majoi diawbacks of this fist-geneiation piocessoi aie its limited on-chip memory size and much
slowei execution time foi accessing exteinal memoiy. Impioved veisions of this fist-geneiation piocessoi aie
now available in C-MOS technology, with a fastei instiuction cycle time of 160 ns.
The second-geneiation fxed-point piocessoi TMS32020, intioduced in 1985 by TI, was quickly followed by
an impioved C-MOS veision TMS320C25 Chassaing and Hoining, 1990] in 1986. Featuies of the TMS320C25
include 544 (16-bit) woids of on-chip data RAM, sepaiate piogiam and data memoiy spaces (each 64 K woids),
and an instiuction cycle time of 100 ns, enabling the TMS320C25 to execute 10 MIPS. A fastei veision, TI`s
fxed-point TMS320C50 piocessoi, is available with an instiuction cycle time of 35 ns.
The thiid-geneiation TMS320C30 (by TI) suppoits fxed- as well as oating-point opeiations Chassaing,
1992]. Featuies of this piocessoi include 32-bit by 32-bit oating-point multiply opeiations in one instiuction
cycle time of 60 ns. Since a numbei of instiuctions, such as load and stoie, multiply and add, can be peifoimed
in paiallel (in one cycle time), the TMS320C30 can execute a paii of paiallel instiuctions in 30 ns, allowing
foi 33.3 MIPS. The Haivaid-based aichitectuie of the fxed-point piocessois was abandoned foi one allowing
foui levels of pipelining with thiee subsequent instiuctions being consequently fetched, decoded, and iead
while the cuiient instiuction is being executed. The TMS320C30 has 2 K woids of on-chip memoiy and a total
of 16 million woids of addiessable memoiy spaces foi piogiam, data, and input/output. Specialized instiuctions
aie available to make common DSP algoiithms such as flteiing and spectial analysis execute fast and effciently.
The aichitectuie of the TMS320C30 was designed to take advantage of highei-level languages such as C and
ADA. The TMS320C31 and TMS320C32, iecent membeis of the thiid-geneiation oating-point piocessois,
aie available with a 40 ns instiuction cycle time.
DSP staitei kits (DSK) aie inexpensive development systems available fiom TI and based on both the fxed-
point TMS320C50 and the oating-point TMS320C31 piocessois. We will discuss both the fxed-point
TMS320C25 and the oating-point TMS320C30 digital signal piocessois, including the development tools
available foi each of these piocessois and DSP applications.
Fixed-Puint TMS320C25-Based Deve!upment System
TMS320C25-based development systems aie now available fiom a numbei of companies such as Hypeiception
Inc., Texas, and Atlanta Signal Piocessois, Inc., Geoigia. The Softwaie Development System (SWDS), available
fiom TI includes a boaid containing the TMS320C25, which plugs into a slot on an IBM compatible PC. Within
the SWDS enviionment, a piogiam can be developed, assembled, and iun. Debugging aids suppoited by the
SWDS include single-stepping, setting of bieakpoints, and display/modifcation of iegisteis.
A typical woikstation consists of:
1. An IBM compatible PC. Commeicially available DSP packages (such as fiom Hypeiception oi Atlanta
Signal Piocessois) include a numbei of utilities and fltei design techniques.
2. The SWDS package, which includes an assemblei, a linkei, a debug monitoi, and a C compiler.
3. Input/output alteinatives such as TI`s analog inteiface boaid (AIB) oi analog inteiface chip (AIC).
The AIB includes a 12-bit analog-to-digital conveitei (ADC) and a 12-bit digital-to-analog conveitei (DAC).
A maximum sampling iate of 40 kHz can be obtained. With (input) antialiasing and (output) ieconstiuction
flteis mounted on a headei on the AIB, diffeient input/output (I/O) fltei bandwidths can be achieved.
Instiuctions such as IN and OUT can be used foi input/output accesses. The AIC, which piovides an inexpensive
I/O alteinative, includes 14-bit ADC and DAC, antialiasing/ieconstiuction flteis, all on a single C-MOS chip.
Two inputs and one output aie available on the AIC. (A TMS320C25/AIC inteiface diagiam and communication
ioutines can be found in Chassaing and Hoining, 1990.) The TLC32047 AIC is a iecent membei of the TLC32040
family of voiceband analog inteiface ciicuits, with a maximum sampling iate of 25 kHz.
Imp!ementatiun ul a Finite Impu!se Respunse Fi!ter vith the TMS320C25
The convolution equation
(18.1) y n | | x n |
| x n | x n | N x n N
| N x n N
|
N
( ) ( ) ( )
( ) ( ) ( ) ( ) . . . ( ) ( ( ))
( ) ( ( ))

+ + +
+
0
1
0 1 1 2 2
1 1
iepiesents a fnite impulse iesponse (FIR) fltei with length N. The memoiy oiganization foi the coeffcients
|(|) and the input samples x(n - |) is shown in Table 18.1. The coeffcients aie placed within a specifed inteinal
piogiam memoiy space and the input samples within a specifed data memoiy space. The piogiam countei
(PC) initially points at the memoiy location that contains the last coeffcient |(N - 1), foi example at memoiy
addiess FF00h (in hex). One of the (8) auxiliaiy iegisteis points at the memoiy addiess of the last oi least
iecent input sample. The most iecent sample is iepiesented by x(n). The following piogiam segment implements
(18.1):
LARP AR1
RPTK N-1
MACD FF00h,-
APAC
The fist instiuction selects auxiliaiy iegistei AR1, which will be used foi indiiect addiessing. The second
instiuction RPTK causes the subsequent MACD instiuction to execute N times (iepeated N - 1 times). The
MACD instiuction has the following functions:
1. Multiplies the coeffcient value |(N - 1) by the input sample value x(n - (N - 1)).
2. Accumulates any pievious pioduct stoied in a special iegistei (TR).
3. Copies the data memoiy sample value into the location of the next-highei memoiy. This data move"
is to model the input sample delays associated with the next unit of time n - 1.
The last instiuction APAC accumulates the last multiply opeiation |(0)x(n).
At time n - 1, the convolution Eq. (18.1) becomes
(18.2)
The pievious piogiam segment can be placed within a loop, with the PC and the auxiliaiy iegistei AR1
ieinitialized (see the memoiy oiganization of the samples x(|) associated with time n - 1 in Table 18.1). Note
that the last multiply opeiation is |(0)x(.), wheie x(.) iepiesents the newest sample. This piocess can be
continuously iepeated foi time n - 2, n - 3, and so on.
The chaiacteiistics of a fiequency selective FIR fltei aie specifed by a set of coeffcients that can be ieadily
obtained using commeicially available fltei design packages. These coeffcients can be placed within a geneiic
FIR piogiam. Within 5-10 minutes, an FIR fltei can be implemented in ieal time. This includes fnding the
coeffcients; assembling, linking and downloading the FIR piogiam into the SWDS; and obseiving the desiied
fiequency iesponse displayed on a spectium analyzei. A diffeient FIR fltei can be quickly obtained since the
only necessaiy change in the geneiic piogiam is to substitute a new set of coeffcients.
The appioach foi modeling the sample delays involves moving the data. A diffeient scheme is used with the
oating-point TMS320C30 piocessoi with a ciiculai mode of addiessing.
TABLE 18.1 TMS320C25 Memoiy Oiganization foi Convolution
Input Samples
Coeffcients Time n Time n - 1 Time n - 2
PC |(N - 1) x(n) x(n-1) x(n-2)
|(N - 2) x(n - 1) x(n) x(n-1)
. . . .
. . . .
. . . .
|(2) x(n - (N - 3)) x(n - (N - 4)) x(n - (N - 5))
|(1) x(n - (N - 2)) x(n - (N - 3)) x(n - (N - 4))
|(0) AR1 x(n - (N - 1)) x(n - (N - 2)) x(n - (N - 3))
y n | x n | x n
| N x n N | N x n N
( ) ( ) ( ) ( ) ( ) . . .
( ) ( ( )) ( ) ( ( ))
+ + + +
+ +
1 0 1 1
2 3 1 2
F!uating-Puint TMS320C30-Based Deve!upment System
TMS320C30-based DSP development systems aie also cuiiently available fiom a numbei of companies. The
following aie available fiom Texas Instiuments:
1. An evaluation module (EVM). The EVM is a poweiful, yet ielatively inexpensive 8-bit caid that plugs
into a slot on an IBM AT compatible. It includes the thiid-geneiation TMS320C30, 16 K of usei RAM,
and an AIC foi I/O. A seiial poit connectoi available on the EVM can be used to inteiface the TMS320C30
to othei input/output devices (the TMS320C30 has two seiial poits). An additional AIC can be inteifaced
to the TMS320C30 thiough this seiial poit connectoi. A veiy poweiful, yet inexpensive, analog evaluation
fxtuie, available fiom Buii-Biown (Aiizona), can also be ieadily inteifaced to the seiial poit on the
EVM. This complete two-channel analog evaluation fxtuie includes an 18-bit DSP102 ADC, an 18-bit
DSP202 DAC, antialiasing and ieconstiuction flteis. The ADC has a maximum sampling iate of 200 kHz.
2. An XDS1000 emulatoi-poweiful but quite expensive. A module can be ieadily built as a taiget system
to inteiface to the XDS1000 Chassaing, 1992]. This module contains the TMS320C30, 16 K of static
RAM. Two connectois aie included on this module, foi inteifacing to eithei an AIC module oi to a
second-geneiation analog inteiface boaid (AIB). The AIC was discussed in conjunction with the
TMS320C25. The AIB includes Buii-Biown`s 16-bit ADC and DAC with a maximum sampling iate of
58 kHz. An AIC is also included on this newei AIB veision.
EYM Tuu!s
The EVM package includes an assemblei, a linkei, a simulatoi, a C compilei, and a C souice debuggei. The
second-geneiation TMS320C25 fxed-point piocessoi is suppoited by C with some degiees of success. The
aichitectuie and instiuction set of the thiid-geneiation TMS320C30 piocessoi facilitate the development of
high-level language compileis. An optimizei option is available with the C compilei foi the TMS320C30. A C-
code piogiam can be ieadily compiled, assembled, linked, and downloaded into eithei a simulatoi oi the EVM
foi ieal-time piocessing. A iun-time suppoit libiaiy of C functions, included with the EVM package, can be
used duiing linking. Duiing simulation, the input data can be ietiieved fiom a fle and the output data wiitten
into a fle. Input and output poit addiesses can be appiopiiately specifed. Within a ieal-time piocessing
enviionment with the EVM, the C souice debuggei can be used. One can single-step thiough a C-code piogiam
while obseiving the equivalent step(s) thiough the assembly code. Both the C code and the coiiesponding
assembly code can be viewed thiough the EVM windows. One can also monitoi at the same time the contents
of iegisteis, memoiy locations, and so on.
Imp!ementatiun ul a Finite Impu!se Respunse Fi!ter vith the TMS320C30
Considei again the convolution equation, Eq. (18.1), which iepiesents an FIR fltei. Table 18.2 shows the
TMS320C30 memoiy oiganization used foi the coeffcients and the input samples. Initially, all the input samples
can be set to zeio. The newest sample x(n), at time n, can be ietiieved fiom an ADC using the following
instiuctions:
TABLE 18.2 TMS320C30 Memoiy Oiganization foi Convolution
Coeffcients Time n Time n - 1 Time n - 2
AR0 |(N - 1) AR1 x(n - (N - 1)) x(n - 1) x(n - 1)
|(N - 2) x(n - (N - 2)) AR1 x(n - (N - 2)) x(n - 2)
|(N - 3) x(n - (N - 3)) x(n - (N - 3)) AR1 x(n - (N - 3))
. . . .
. . . .
. . . .
|(1) x(n - 1) x(n - 1) x(n - 1)
|(0) x(n) x(n) x(n)
FLOAT AR3,R3
STF R3, AR1--%
These two instiuctions cause an input value x(n), ietiieved fiom an input poit addiess specifed by auxiliaiy
iegistei AR3, to be loaded into a iegistei R3 (one of eight 40-bit-wide extended piecision iegisteis), then stoied
in a memoiy location pointed by AR1 (AR1 would be fist initialized to point at the bottom" oi highei-memoiy
addiess of the table foi the input samples). AR1 is then postinciemented in a ciiculai fashion, designated with
the modulo opeiatoi %, to point at the oldest sample x(n - (N - 1)), as shown in Table 18.2. The size of the
ciiculai buffei must fist be specifed. The following piogiam segment implements (18.1):
RPTS LENGTH-1
MPYF AR0--%,AR1--%,R0
ADDF R0,R2,R2
ADDF R0,R2
The iepeat single" instiuction RPTS causes the next (multiply) oating-point instiuction MPYF to be
executed LENGTH times (iepeated LENGTH-1), wheie LENGTH is the length of the FIR fltei. Fuitheimoie,
since the fist ADDF addition instiuction is in paiallel (designated by ) with the MPYF instiuction, it is also
executed LENGTH times. Fiom Table 18.2, AR0, one of the eight available auxiliaiy iegisteis, initially points
at the memoiy addiess (a table addiess) which contains the coeffcient |(N - 1), and a second auxiliaiy iegistei
AR1 now points to the addiess of the oldest input sample x(n - (N - 1)). The second indiiect addiessing mode
instiuction multiplies the content in memoiy (addiess pointed by AR0) |(N - 1) by the content in memoiy
(addiess pointed by AR1) x(n - N - 1)), with the iesult stoied in R0. Concuiiently (in paiallel), the content
of R0 is added to the content of R2, with the iesult stoied in R2. Initially R0 and R2 aie set to zeio; hence, the
iesulting value in R2 is no the pioduct of the fist multiply opeiation. Aftei the fist multiply opeiation, both
AR0 and AR1 aie inciemented, and |(N - 2) is multiplied by x(n - (N - 2)). Concuiiently, the iesult of the
fist multiply opeiation (stoied in R0) is accumulated into R2. The second addition instiuction, executed only
once, accumulates the last pioduct |(0)x(n) (similai to the APAC instiuction associated with the fxed-point
TMS320C25). The oveiall iesult yields an output value y(n) at time n. Aftei the last multiply opeiation, both
AR0 and AR1 aie postinciemented to point at the top" oi lowei-memoiy addiess of each ciiculai buffei. The
piocess can then be iepeated foi time n - 1 in oidei to obtain a second output value y(n - 1). Note that the
newest sample x(n - 1) would be ietiieved fiom an ADC using the FLOAT and STF instiuctions, then placed
at the top memoiy location of the buffei (table) containing the samples, oveiwiiting the initial value x(n - (N
- 1)). AR1 is then inciemented to point at the addiess containing x(n - (N - 2)), and the pievious foui
instiuctions can be iepeated. The last multiply opeiation involves |(0) and x(.), wheie x(.) is the newest sample
x(n - 1), at time n - 1. The foiegoing pioceduie would be iepeated to pioduce an output y(n - 2), y(n - 3),
and so on. Each output value would be conveited to a fxed-point equivalent value befoie being sent to a DAC.
The fiequency iesponse of an FIR fltei with 41 coeffcients and a centei fiequency of 2.5 kHz, obtained fiom
a signal analyzei, is displayed in Fig. 18.8.
FIR and IIR Imp!ementatiun Lsing C and Assemb!y Cude
A ieal-time implementation of a 45-coeffcient bandpass FIR fltei and a sixth-oidei IIR fltei with 345 samples,
using C code and TMS320C30 code, is discussed in Chassaing and Bitlei 1991]. Tables 18.3 and 18.4 show a
compaiison of execution times of those two flteis. The C language FIR fltei, implemented without the modulo
opeiatoi %, and compiled with a C compilei V4.1, executed two times slowei
1
than an equivalent assembly
language fltei (which has a similai execution time as one implemented with a fltei ioutine in assembly, called
by a C piogiam). The C language IIR fltei ian 1.3 times slowei than the coiiesponding assembly language IIR
fltei. These slowei execution times may be acceptable foi many applications. Wheie execution speed is ciucial,
1
1.5 times slowei using a newei C compilei V4.4.
a time-ciitical function may be wiitten in assembly and called fiom a C piogiam. In applications wheie speed
is not absolutely ciucial, C piovides a bettei enviionment because of its poitability and maintainability.
Rea!-Time App!icatiuns
A numbei of applications aie discussed in Chassaing and Hoining (1990) using TMS320C25 code and in
Chassaing (1992) using TMS320C30 and C code. These applications include multiiate and adaptive flteiing,
modulation techniques, and giaphic and paiametiic equalizeis. Two applications aie biiey discussed heie: a
ten-band multiiate fltei and a video line iate analysis.
1. The functional block diagiam of the multiiate fltei is shown in Fig. 18.9. The multiiate design piovides
a signifcant ieduction in piocessing time and data stoiage, compaied to an equivalent single-iate design.
With multiiate flteiing, we can use a decimation opeiation in oidei to obtain a sample iate ieduction
oi an inteipolation opeiation (as shown in Fig. 18.9) in oidei to obtain a sample iate inciease Ciochieie
and Rabinei, 1983]. A pseudoiandom noise geneiatoi implemented in softwaie piovides the input noise
to the ten octave band flteis. Each octave band fltei consists of thiee 1/3-octave flteis (each with 41
coeffcients), which can be individually contiolled. A contiolled noise souice can be obtained with this
design. Since each 1/3-octave band fltei can be tuined on oi o[[, the noise spectium can be shaped
accoidingly. The inteipolation fltei is a low-pass FIR fltei with a 2:1 data-iate inciease, yielding two
sample outputs foi each input sample. The sample iate of the highest octave-band fltei is set at 32,768
samples pei second, with each successively lowei band piocessing at half the iate of the next-highei
band. The multiiate fltei (a nine-band veision) was implemented with the TMS320C25 Chassaing
et al., 1990]. Figuie 18.10 shows the thiee 1/3-octave band flteis of band 10 implemented with the EVM
FIGURE 18.8 Fiequency iesponse of 41-coeffcient FIR fltei.
TABLE 18.3 Execution Time and Piogiam Size
of FIR Filtei
FIR Execution Time Size
(45 samples) (msec) (woids)
C with modulo 4.16 122
C without modulo 0.338 116
C-called assembly 0.1666 74
Assembly 0.1652 27
TABLE 18.4 Execution Time and Piogiam Size
of 6th-Oidei IIR Filtei
IIR Execution Time Size
(345 samples) (msec) (woids)
C 1.575 109
Assembly 1.18 29
in conjunction with the two-channel analog fxtuie (made by Buii-Biown). The centei fiequency of the
middle 1/3-octave band 10 fltei is at appioximately 8 kHz since the coeffcients weie designed foi a
centei fiequency of 1/4 the sampling iate (the middle 1/3-octave band 9 fltei would be centeied at 4
kHz, the middle 1/3-octave band 8 fltei at 2 kHz, and so on). Note that the centei fiequency of the
middle 1/3-octave band 1 fltei would be at 2 Hz if the highest sampling iate is set at 4 kHz. Obseive
fiom Fig. 18.10 that the ciossovei fiequencies occui at the 3-dB points. Since the main piocessing time
of the multiiate fltei (implemented in assembly code) was measuied to be 8.8 ms, the maximum
sampling iate was limited to 58 ksps.
2. A video line iate analysis implemented entiiely in C code is discussed in Chassaing and Bitlei 1992].
A module was built to sample a video line of infoimation. This module included a 9.8-MHz clock, a
high sampling iate 8-bit ADC and appiopiiate suppoit ciicuitiy (compaiatoi, FIFO buffei, etc.). Intei-
active featuies allowed foi the selection of one (out of 256) hoiizontal lines of infoimation and the
execution of algoiithms foi digital flteiing, aveiaging, and edge enhancement, with the iesulting effects
displayed on the PC scieen. Figuie 18.11 shows the display of a hoiizontal line (line #125) of infoimation
FIGURE 18.9 Multiiate fltei functional block diagiam.
FIGURE 18.10 Fiequency iesponses of the 1/3-octave band ten flteis.
obtained fiom a test chait with a chaige coupled device (CCD) cameia. The function key F3 selects the
1-MHz low-pass fltei iesulting in the display shown in Fig. 18.12. The 3-MHz fltei (with F4) would
pass moie of the highei-fiequency components of the signal but with less noise ieduction. F5 implements
the noise aveiaging algoiithm. The effect of the edge enhancement algoiithm (with F7) is displayed in
Fig. 18.13.
Cunc!usiuns and Future Directiuns
DSP piocessois have been used extensively in a numbei of applications, even in non-DSP applications such as
giaphics. The fouith-geneiation oating-point TMS320C40, code compatible with the TMS320C30, featuies
an instiuction cycle time of 40 ns and six seiial poits. The ffth-geneiation fxed-point TMS320C50, code
compatible with the fist two geneiations of fxed-point piocessois, featuies an instiuction cycle time of 35 ns
and 10 K woids (16-bit) of on-chip data and piogiam memoiy. Cuiiently, both the fxed-point and oating-
point piocessois aie being suppoited by TI.
FIGURE 18.11 Display of a hoiizontal line of video signal.
FIGURE 18.12 Video line signal with 1-MHz flteiing.
Dehning Terms
C compiler: Piogiam that tianslates C code into assembly code.
Digital signal processor: Special-puipose miciopiocessoi with an aichitectuie suitable foi fast execution of
signal piocessing algoiithms.
Fixed-point processor: A piocessoi capable of opeiating on scaled integei and fiactional data values.
Floating-point processor: Piocessoi capable of opeiating on integeis as well as on fiactional data values
without scaling.
On-chip memory: Inteinal memoiy available on the digital signal piocessoi.
Pipelining feature: Featuie that peimits paiallel opeiations of fetching, decoding, ieading, and executing.
Special-purpose digital signal processor: Digital signal piocessoi with special featuie foi handling a specifc
signal piocessing application, such as FFT.
Re!ated Tupics
14.3 Design and Implementation of Digital Filteis 79.1 IC Logic Family Opeiation and Chaiacteiistics
Relerences
H. M. Ahmed and R. B. Kline, Recent advances in DSP systems," IEEE Communtaons Maga:ne, 1991.
R. Chassaing, Dga| Sgna| Protessng w| C anJ |e TMSJ20CJ0, New Yoik: Wiley, 1992.
R. Chassaing and R. Ayeis, Digital signal piocessing with the SHARC," in ProteeJngs o[ |e 1996 SEE nnua|
Con[erente, 1996.
R. Chassaing and B. Bitlei, Real-time digital flteis in C," in ProteeJngs o[ |e 1991 SEE nnua| Con[erente,
1991.
R. Chassaing and B. Bitlei, A video line iate analysis using the TMS320C30 oating-point digital signal
piocessoi," in ProteeJngs o[ |e 1992 SEE nnua| Con[erente, 1992.
R. Chassaing and D. W. Hoining, Dga| Sgna| Protessng w| |e TMSJ20C25, New Yoik: Wiley, 1990.
R. Chassaing and P. Maitin, Paiallel piocessing with the TMS320C40," in ProteeJngs o[ |e 1995 SEE nnua|
Con[erente, 1995.
R. Chassaing, W.A. Peteison, and D. W. Hoining, A TMS320C25-based multiiate fltei," IEEE Mtro, 1990.
FIGURE 18.13 Video line signal with edge enhancement.
R.E. Ciochieie and L.R. Rabinei, Mu|rae Dga| Sgna| Protessng, Englewood Cliffs, N.J.: Pientice-Hall, 1983.
K. S. Lin (ed.), Dga| Sgna| Protessng |taons w| |e TMSJ20 Fam|y. T|eory, |gor|ms, anJ Im|e-
menaons, vol. 1, Texas Instiuments Inc., Texas, 1989.
A. V. Oppenheim and R. Schafei, Dstree-Tme Sgna| Protessng, Englewood Cliffs, N.J.: Pientice-Hall, 1989.
P. Papamichalis (ed.), Dga| Sgna| Protessng |taons w| |e TMSJ20 Fam|y. T|eory, |gor|ms, anJ
Im|emenaons, vol. 3, Texas Instiuments, Inc., Texas, 1990.
Further Inlurmatiun
Rulph Chassaing teaches hands-on woikshops on digital signal piocessing using C and the TMS320C30,
offeied at Rogei Williams Univeisity in Biistol, RI, 02809. He offeied a one-week woikshop in August 1996,
suppoited by the National Science Foundation (NSF). He will offei two woikshops in August 1997, suppoited
by NSF, using the TMS320C30 and the TMS320C31. Woikshops on the TMS320 family of digital signal
piocessois aie offeied by Texas Instiuments, Inc. at vaiious locations.
A tutoiial Digital Signal Piocessing Comes of Age" can be found in the IEEE Setrum, May 1996.

Signal Processing

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Signal Processing

Uploaded by

Copyright:

Available Formats

Parhi, K.K., Chassaing, R., Bitler, B.

VLSI for Signal Processing

2000 by CRC Press LLC

You might also like