Lect 2

Processor Architectures and
Program Mapping
Programmable Digital Signal Processors
5kk10
TU/e
Henk Corporaal
Jef van Meerbergen
Bart Mesman
Topic 2: Programmable Digital Signal Processors

real-time worst-case processing = need for more compute power
sec
instr cycles
sec
prog prog instr
cycle
CPI = 1
instruction level parallelism (ILP)
hardware support for loop control
attention for high level data types e.g. arrays, delaylines
(vs. scalars for CPUs)
difficult to compare architectures
e.g. DIT, DIF, radix 2/4, FFT loop unrolling, scaling,
shuffling, intialisation can be included or forgotten
benchmarking (Berkeley Design Technology Inc (BDTi))
(compare to SpecInt benchmarks for CPs)
Outline
architectures for programmable DSPs
multiplier-accumulator
modified Harvard architecture
extension with an ALU (decision making)
controller architectures
examples: TI, Motorola, Philips
code generation
recent developments: VLIW (Very Long Instruction Word)
examples: C6 and TM
c(i) x(i)
control
MPY
P_reg
clock
(Booth,
Wallace..)
PR
ADDER
ACR
Sum of products = basic operation

for correlation, filtering,
spectral analysis ...
linear
c(i) * x(i)
expr.
Goal = 1 cycle per iteration
Modifications
extra inputs/outputs
position ACR (1 or 2)
adder/subtractor
extra pipelines
asymmetric inputs
multi-precision
DSP data types

not every signal requires 32 bits
2 types of DSP: floating point and integer
advantages FP: most specs are in FP
(conversion to int is time consuming since the behaviour
may change)
disadvantage FP: cost (area, speed, power)
wanted : type of output of an operation = type of input
(because both stored in RAM)
no problem for FP but for integer
integer multiplication doubles the number of bits: n * n => 2n
What about fractional numbers ?
0.9
x 0.9
0.81
DSP data types

integer and fractional numbers are a special case of fixed point
fix <p,q> (ART designer & SystemC)
q
fix <8,3>
1 1 1 0 1
1 0 1
-24 23 22 21 20 2-1 2-2 2-3

negative weight
quantization error
2s complement
if q=0 then integer e.g. int <8,0>
if q=p-1 then fractional e.g. int <8,7>
-19/8 = -2.375
Scale factor 1/8
Same alu handles

fix <8,1>, fix <8,2>,
fix <8,3>, ...
DSP data types

Int <8,3>
1 1 1 0 1
Int <8,4>
0 1 1 0
1 1 1 1 1 0 0 0 1
1 0 1
0 0 0 1
1 0 0 1 1 0 1
-19/8
97/16
-1843/128
Some processors (C54) have special instructions for fractional

Numbers (and symmetric number domain 2n-1 2n-1)
sxxx
syyy
-------sszzzzzz
s z z z z z z 0 => if FRCT = 1
DSP data types

continue (after multiplication) with msb only
represents the limit of the accuracy of the result
(can not be larger than the accuracy of the inputs)
more efficient solution
continue with msb + lsb
sum-of-product operations generate accumulative noise at 32nd
vs. 16th bit
Still overflow for addition = overflow bits
double precision accumulator
+ extra overflow bits
+ shift, round, truncate unit
c(i) x(i)
control
MPY
P_reg
clock
(Booth,
Wallace..)
PR
ADDER
P_reg
clock
ACR
SHIFT
ROUND
TRUNCATE
xQ
xQ
xQ
rounding
value truncation
111.11
+ 000.1
= 000
-0.25
111.01
+ 000.1
= 111
-0.75
-1
111.11
= 111
111.01
= 111
-0.25
-1
-0.75
-1
magnitude truncation
111.11
+ 001.
= 000
-0.25
111.01
+ 001.
= 000
-0.75
zeroing
saturation
sawtooth
Prog/data
memory
prog
mem.
data
mem.
prog
mem.
data
mem. 1
data
mem. 2
EXU
EXU
EXU
Von Neumann
(sequencial)
Harvard
Modified Harvard
c(i) * x(i)
Goal = 1 cycle per iteration
Reset
+1
Interrupt
address
PC
Program
Memory
ACU_A
ACU_B
AR_A
AR_B
RAM_A
RAM_B
DR_A
DR_B
Stack
IR
Control Bus
MAC
Rfile
time loop
ci * xi
1 cycle/tap ?
filter loop i
How updating the delayline ?
x5
c5
x4
-1
c4
x3
-1
Z
c3
x2
-1
Z
*
+
c2
x1
-1
c1
*
y
Solution 1: blockmove in memory
Memory
location
1
2
3
4
5
Output
sample 1
x1
x2
x3
x4
x5
output
sample 2
x2
x3
x4
x5
x6
2 possibilities
complete move after every output sample is calculated
read and write the data twice
move after read of every datum separately
write the data twice
need for a special instruction (TMS320)
output
sample 3
x3
x4
x5
x6
x7
Solution 2: indirect adressing

Memory
location
1
2
3
4
5
6
7
8
output
sample 1
x1
x2
x3
x4
x5
output
sample 2
x2
x3
x4
x5
x6
output
sample 3
x3
x4
x5
x6
x7
output
sample 4
x4
x5
x6
x7
x8
Output
sample 5
x9
x5
x6
x7
x8
use of a pointer to mark the begin of the delay line

update the pointer instead of moving the data
problem: trashing of the whole memory
solution: modulo addressing
need for a register to store the pointer
IIR filter
y1
Z
c1
x
-1
c2
y2
-1
Z
c3
pointer
y3
-1
Z
c4
y4
Z
-1
y5
y1
y2
y3
y4
y5
memory map
y1
y2
y3
y4
y5
modulo range
x1
x2
x3
x4
x5
modulo range 2
time loop
x1
x2
x3
x4
x5
pntr 1
modulo range 1
2 filters
pntr 1
for i = 1..itaps
c(i) * x(i)
pntr 2
for j = 1..jtaps
d(j) * y(j)
y1
y2
y3
y4
y5
2 memory segments
=>
1 segment
Mapping strategy
x3
x2
z-1
x1
c4
c3
y3
c2
c1
z-1
y2
z-1
pntr 1
y1
y2
x1/y3
x2
x3
modulo range
z-1
c5
y1
Mapping strategy
define positions in Ram
constraint: vars that form a delay line in consecutive places
find a schedule
example : c1 => c2 => c3 => c4 => c5
define ACU instructions
x7
-1
Z
c6
x6
-1
Z
c8
-1
Z
c3
x5
-1
Z
c4
x4
-1
c1
x8
c5
x3
-1
Z
c2
x2
-1
Z
c7
yo
x1
*
ye
Modulo
output
to RAM
ACU architecture and

Instruction set
Output
Read_A A
Read_S
S
incA
A+1
decA
A-1
Step
A+S
Inc_step S+1
Modulo can be
implemented as a
mask operation
if the size is 2k
reg A
A
A
A+1
A-1
A+S
A
16 10 000
23 10 111
mask
=hold
reg S
S
S
S
S
S
S+1
Mapping example
x3
x2
z-1
x1
c4
c3
Assume
initialisation
A = pointer=17
S = -2
y3
c2
c1
z-1
y2
z-1
read_A
incA
incA
incA
incA
step
dec
y1
17
18
19
20
21
19
18
pntr
16
17
18
19
20
21
22
23
y1
y2
x1/y3
x2
x3
modulo range
z-1
c5
prepare new pointer for next iteration
Addressing modes
register
immediate
direct
indirect
w. inc/dec
indexed
ADD R4, R3
ADD R4, #3
ADD R4, (100)
ADD R4, (R3)
ADD R4, (R3)
R[R4] = R[R4] + R[R3]

R[R4] = R[R4] + #3
R[R4] = R[R4] + Mem[100]
R[R4] = R[R4] + Mem[R[R3]]
R[R4] = R[R4] + Mem[R[R3]]
R[R3] = R[R3] 1
ADD R4, (R3R2) R[R4] = R[R4] + Mem[R[R3]]
R[R3] = R[R3] R[R2]
Remarks
direct = for static data
indirect = for arrays
inc/dec = for stepping through arrays e.g. xn
index = for stepping through arrays e.g. x2n
Addressing modes: extra for DSP

8 ARs (address or auxiliary register) available
extra indirect modes
circular *ARn %
post inc/dec by 1 - circular
*ARn AR0 % post inc/dec by AR0 - circular
bit reverse *ARn AR0 B post inc/dec by AR0 - bit rev.
Incorporation of an ALU
regular data-flow algorithms ==> MAC
filtering, correlation, windowing etc
decision making ==> ALU
sorting filters (e.g. median filters)
interpolation (e.g. sqrt)
absolute value calculation
logarithmic conversion
finite field aritmetic (e.g. Galois field)
Viterbi
VLC, VLD
division
Reset
+1
Interrupt
address
PC
Program
Memory
ACU_A
ACU_B
AR_A
AR_B
RAM_A
RAM_B
DR_A
DR_B
Stack
IR
Control Bus
MAC
ALU
Rfile
Bus-oriented instruction encoding

SX
SY
ACU
DX DY RF A B
01 MULT SX
SY
DX DY RF ACU
A B
00 ALU
10
Imm. data
11
Next address
ACU
DX DY RF A B
BR Cond
ACU
A B
first solution
c(i) * x(i)
resources
LABEL ALU
MPY-ACC
Acc = 0
Not shown
coefficient RAM+ACU
RAM
ACU
init (i=0)
init counter
loop
incr (=i+1)
read x(i)
acc(i)=acc(i-1)+x(i)*c(i)
dec counter
branch to loop if counter > 0
nop
time (cc)
6 clockcycles/sample
limit pipelines in the controller
Loopfolding (software pipelining)

ai
for i = 0 to n f
bi
bi = f(ai)
g
ci = g(bi)
ci
di = h(ci)
h
di
a0
f
g
b0
a1
f
c0 b1 a2
g
h
f
d0 c1 b2
g
h
d1 c2
h
d2
ci-2
bi-1 ai
di-2 ci-1
f
bi
for i = 2 to n
bi = f(ai)
ci-1 = g(bi-1)
di-2 = h(ci-2)
Loopfolding (software pipelining)
c(i) * x(i)
LABEL
ALU
MPY-ACC
acc(i-1)=0
init counter
loop
acc(i) = acc(i-1)+x(i)*c(i)
RAM
ACU
init (i=1)
read x(i) inc(=i+1)
read x(i+1) incr (=i+2)
dec counter
branch to loop if counter > 0
nop
acc(n-1) = acc(n-2)+x(n-1)*c(n-1) read x(n)
acc(n) = acc(n-1)+x(n)*c(n)
Pre- and postamble

4 clockcycles /sample
hardware support for loop control
c(i) * x(i)
Label ALU
MPY-ACC
acc(i-1=0
init counter
repeat n-2
acc(i)=acc(i-1)+x(i)*c(i)
acc(n-1) = acc(n-2) + x(n-1)*c(n-1)
acc(n) = acc(n-1) + x(n)*c(n)
RAM
ACU
init (i=1)
read x(i) inc(=i+1)
read x(i+1) incr(=i+2)
read x(n)
1 clockcycles/sample
repeat instruction and repeat block
Outline
code generation
examples: C6 and TM
TMS320C5000
T register
E
T
D
A
Sign ctr
P C D
B A T
Sign ctr
A(40)
B(40)
BACD
Sign ctr
Sign ctr
Sign ctr
MUX
Multiplier (17*17)
A
M U
A B
ALU (40)
A
B
B
fractional
MUX
Barrer shifter
MUX
COMP
Adder (40)
TRN
ZERO
SAT
ROUND
TC
MSW/LSW
select
Address bus
16 bits
Motorola 56K family
EXTERNAL
ADRESS SWITCH
P Address
Y Address
X Address
2,048-by-24-bit
PROGRAM
MEMORY
ROM
I/O
PORTS
7 BITS
Address
ALU
X-DATA
Y DATA
P DATA
GLOBAL DATA
INTERNAL
DATA-BUS
SWITCH
24 BITS
X memory
256-by-24-bit
RAM
256-by-24-bit
ROM
ON CHIP
PERIPHERALS,
HOST,
SYNCHRONOUS
SERIAL INTERFACE
SERIAL COMMUNICATIONS
INTERFACE,
PROGRAMMED I/O,
BUS CONTROL
Y memory
256-by-24-bit
RAM
256-by-24-bit
ROM
EXTERNAL
DATA-BUS
SWITCH
DATA ALU
PROGRAM CONTROLLER
2 BITS
CLOCK
3 BITS
INTERRUPT
24-by-24 bit
MULTIPLIERACCUMULATOR
PRODUCING
56 BIT RESULT
24
BITS
DATA
BUS
Two 16-by-16 bit

multipliers
Y0
Y1
Y1
X
PO
P1
scale
Y
Two 40 bit
arithmiclogic units
Two address
Compution
units
X data
memory
R.E.A.L.
X data
Buses for
Y data
Z data
scale
Saturation
Y data
memory
shift
16 bit
bus
Program
memory
(Z data)
Saturation
Four 40 bit
accumulators
Saturation/scale
16 bit
bus
Program
control
unit
96-bit instructions
Instruction decoder
Y0
16-bit
bus
RD16021 DSP
memories
Not included
Process
0.35, 5M
2.7-3.6 V
voltage
frequency
area
39 MHz
Tj = 85 C, 2.7V, wcp
3.9 mm2
Power dissipation 2.1 mW/MHz
Instruction cycle counts for BDTi benchmarks

Function
DSPgroup
OAK
Motorola
DSP561xx
ADI
ADSP-218x
Lucent
DSP16xx
TI TMS320
C54x
TI320
C62xx
Lucent
DSP16210
Philips
RD16020
Real block FIR

Single sample FIR
Complex block FIR
LMS adaptive
IIR (8 sections)
Vector dot product
Vector add
Vector maximum
Convolution
encoder
835
21
3018
90
51
43
122
41
506
925
23
3043
64
45
43
85
86
772
841
22
3122
59
43
43
83
128
818
1240
26
3123
101
65
47
123
120
888
684
18
2922
58
44
41
61
111
528
334
17
1294
33
30
29
36
39
188
780
16
1681
464
448
20
1470
55
37
43
63
40
176
FSM
284
16514
375
12148
198
10633
415
455
21035 13234
147
4225
301
9016
167
5797
256 pnt FFT
16 taps
40 samples
8 biquads
38
23
43
Outline

code generation
source
lexical analysis
syntax analysis
Intermediate
machine independent
representation
semantic analysis
Code selection
Register allocation
1 instr = // ops
order of instr
Front end
scheduling
code
Code generation
Intermediate
machine independent
representation
BBi
BBj
BBk
a
b
*
t1 := a * b
t2 := c + d
t3 := t1 + c
out := t2 * t3
c
c
t1
+
t3
d
+
t2
Code selection
Intermediate
representation
RTP
match &
cover
Register transfer pattern (RTP) for a given datapath
is any RT operation ( read - combinatorial logic - write)
which can be executed on the datapath. [Leupers]
Notation
ar := ar | ax + ay | af means
ar := ar + ay
ar := ar + af
ar := ax + ay
ar := ax + af
or
or
or
Code selection example

d memory
ax
ay
af
y
+-
mx
my
x
ALU
mf
y
*
+-
ar
ADSP
[Analog Devices]
p memory
mr
MAC
Examples of RTPs on the ADSP-210 datapath

ar | mr | mx
mr
my | mf
ar | mr | mx
mr
mr | mf
mr | mf
ar | mr | mx
my | mf
mr | ar | ax
my | mf
ay | af
mr | ar | ax
ay | af
mr | mf
ar | af
ar | af
Example of code selection

= covering of intermediate representation with RTPs
mx := dmem
my := pmem
*
Mr := mr + (mx * my)
ax := dmem ay := pmem
c
mr := dmem
c 3:
t1
2: +
t3
1:
ar := ax + ay
t2
*
my := ar
mr = mr * my
Problems
local decisions which have a global impact
phase coupling: example
asap schedule
maximal freedom for scheduling
code selection during scheduling
register allocation comes afterwards
can lead to infeasible solutions
phase coupling: example 1

1
R2
R3
alu1
Move
R1
alu2
4
(a)
4
(b)
(c)
phase coupling: example 2

[Mesman]
Pu
u
Pv
Cu
v
Cv
if u and v
share the
same register
Pu
u
Pv
Cu
v
Cv
Example of coupling between scheduling and register allocation
phase coupling: discussion
[Mesman]
application
Traditional
code generation
(heuristic)
constraints
OK ?
yes
feasible
space
no
design space seen
by code generator
Phase coupling is difficult because of many constraints originating

from irregular interconnect, special purpose registers and
non-orthogonal microcode.
phase coupling: discussion

It is very difficult and almost impossible to develop robust and
efficient DSP compilers.
Current DSP practice = programming in assembler
Solution:
1. Solve code generation for DSPs
2. Step back and rethink the architecture
develop an architecture which is still efficient but also
a good model for building a compiler
Efficiency = exploit instruction level parallelism (ILP)
compilation = systematic positioning of registers and regular
interconnect
= VLIW = Very Long Instruction Word
Outline
code generation
principles
central register file + example TM
clustered VLIW + example C6
subword parallelism or SIMD
VLIW principles
multiple parallel FUs, possibly different and pipelined
pipelining is exposed to the compiler
= no interlock mechanism
load-store architecture
all operands fetched from/stored in register files,
possibly multi-ported
each FU can receive an instruction every clock cycle
one instruction = many RISC instructions
each RISC instruction = one issue slot
no dependencies between different RISC instructions
= orthogonal microcode
= compiler friendly
VLIW architecture
Register file
R&W addr.
instruction
Exec
unit 1
Exec
unit 2
Exec
unit 3
Exec
unit 4
Exec
unit 5
Issue
slot 1
Issue
slot 2
Issue
slot 3
Issue
slot 4
Issue
slot 5
...
Exec
Exec
unit 24 unit 25
...
Issue
Issue
slot 24 slot 25
long instruction words e.g. (3*7+4)*25=625

many ports on the registerfile e.g. 75
VLIW architecture: central Register File
Register file
Exec Exec Exec

unit 1 unit 2 unit 3
Issue slot 1
Exec Exec Exec

Issue slot 2
Exec Exec Exec

Issue slot 3
TM1000 DSPCPU
5 constant
5 ALU
2 memory
2 shift
2 DSP-ALU
2 DSP-mul
3 branch
2 FP ALU
2 Int/FP ALU
1 FP compare
1 FP div/sqrt
Register file (128 regs, 32 bit, 15 ports)
Exec
unit
Exec
unit
Exec
unit
Exec
unit
Exec
unit
Instruction register (5 issue slots)

PC
Instruction
cache (32kB)
Data
cache
(16 kB)
TriMedia TM32A processor

0.18 micron
area : 16.9mm2
200 MHz (typ)
1.4 W
7 mW/MHz
I/O
INTERFACE
TAG
TAG
DSPMUL2
DSPMUL1
IFMUL1
(FLOAT) IFMUL2
(FLOAT)
FCOMP2
ALU1
ALU4
SHIFTER1
ALU2
FTOUGH1
DSPALU2
FALU0 FALU3
ALU3
ALU0
(MIPS=
0.9 mW/MHz)
SEQUENCER
/ DECODE
TAG
SHIFTER0
DSPALU0
TAG
I-Cache
D-cache
Synthesised RF area (CMOS18, 64 bit)

9
8
Area in mm-sq
7
32regs, after P&R
64regs, after P&R

5
128regs, after P&R
Poly. (128regs, after P&R)

2
1
0
0
10
15
20
Nr of ports
Area, speed and power dissipation goes more than linear with the
number of ports
VLIW architecture: clustered Register Files
Register file 1
Exec
unit 1
Exec
unit 2
Register file 2
copy
unit
Exec
unit 3
Exec
unit 4
Register file 3
copy
unit
Exec
unit 5
Exec
unit 6
copy
unit
REGISTER
FILE 1
REGISTER
FILE 2
REGISTER
FILE 3
FMUL
FADD
IMUL
IADD
IMUL
IADD
FMUL r1,r2,r3
IADD r1,r2,r3
IMUL r1,r2,r3
REGISTER
FILE I0
FU00
IADD_01
IMOV_01
:
FU10
IADD_10
IMOV_10
:
FU01
IADD_00
LAND_00
:
FU01
IADD_11
LAND_10
:
FU02
IMUL_00
SHFT_00
:
FU02
IMUL_10
SHFT_10
:
REGISTER
FILE I1

Discussion
performance loss (more instructions) compared to a central
Register File (due to extra cycle for copy)
15-20 % for 2 clusters
20-30 % for 4 clusters
limited scalability
not too many clusters
not too many registers within each cluster (too many RF ports)
add of copy ops in the compiler
= graph changes during scheduling
TMS320C62x VelociTI (fixed point)
Int add
logical
bit count
Store/load
data
S1
load
data
Int add
logical
Int mult
bit manip
(16=>32)
shift
constant
branch
D1
Store/load
address
Int add
load/
store
D2
Store/load
address
Dst
src1
src2
Dst
src1
src2
Dst
src1
src2
Src_up
Dst_up
Dst
src1
src2
Src_up
Dst_up
Dst
src1
src2
L1
M1
Dst
src1
src2
Registerfile 0-15
Registerfile 0-15 (32 bits)
M2
S2
L2
VelociTI principles
parallelism (fetch-decode-execute) (max 8 issue slots)
pipeline critical sections (alu 1cc, mult 2 cc, 200 MHz)
Risc (simple, atomic, independent instructions)
performance comes from compiler (pipelining, unroll)
load-store
orthogonal (2 identical DP, add on 6 units)
deterministic (no interlock)
conditional instructions (=guarding)
instruction packing
Fully serial
n
n
n
n
n
F
n
n
n
B
n
n
n
n
n
n
A
n
n
n
n
n
n
n
n
n
n
n
E
n
n
n
n
n
n
n
n
n
n
n
n n n
n n n
C n n
Dn n
n n n
n n n
nGn
n nH
ABCDE FGH
0 0 0 0 0 0 0 0
Classical encoding:
fetching many nops
Mixed serial/parallel
n
n
F
n
B
n
n
n
A
n
n
n
n
E
n
n
n
n
n
n
C n n
Dn n
n n n
n GH
ABCDE FGH
ABCDE FGH
1 1 0 1 0 0 1 0
ABCDE FGH
1 1 1 1 1 1 1 0
Velocity encoding
Fully parallel
Instruction cycle counts for BDTi benchmarks

Function
DSPgroup
OAK
Motorola
DSP561xx
ADI
ADSP-218x
Lucent
DSP16xx
TI TMS320
C54x
TI320
C62xx
Lucent
DSP16210
Philips
RD16020
Real block FIR

Single sample FIR
Complex block FIR
LMS adaptive
IIR (8 sections)
Vector dot product
Vector add
Vector maximum
Convolution
encoder
835
21
3018
90
51
43
122
41
506
925
23
3043
64
45
43
85
86
772
841
22
3122
59
43
43
83
128
818
1240
26
3123
101
65
47
123
120
888
684
18
2922
58
44
41
61
111
528
334
17
1294
33
30
29
36
39
188
780
16
1681
464
448
20
1470
55
37
43
63
40
176
FSM
284
16514
375
12148
198
10633
415
455
21035 13234
147
4225
301
9016
167
5797
256 pnt FFT
38
23
43
Subword parallelism
(custom operators in TM)
1st input operand
2nd input operand
byte3 byte2 byte1 byte0
32 bits = 4 bytes
are processed
independently
op
Ex. +, - , min, max
=> quadumin
=> quadumax
...
op
op
op

output operand
Subword parallelism
(custom operators in TM)
int size = 1000
byte out[size], in1[size], in2[size]
for i = 0; i < size; i+
out[ i ] = in1[ i ] + in2[ i ];
+ faster execution
- rewrite effort (e.g. different
types for in- and outputs)
int size = 1000

byte out[size], in1[size], in2[size]
for i = 0; i < size; i+
packet4 t1 = packet4_load ( in1 );
packet4 t2 = packet4_load ( in2 );
packet4 t3 = packet4_add ( t1, t2 );
packet4_store ( out, t3 );
Typical example : graphics ( 4 * 32 bit floating point)
Subword parallelism
MPEG example
for (i=0; i<64; I++)
{
temp = ((back(i) + forward(i) +1) >> 1) +idct(i);
if (temp > 255)
temp = 255;
else if (temp < 0)
temp = 0;
destination[i] = temp;
}
Remark: simple example without interloop dependencies
for (i=0; i<64; i+=4)

{
temp = ((back(i+0) + forward(i+0) +1) >> 1) +idct(i+0);
if (temp > 255) temp = 255;
else if (temp < 0) temp = 0;
destination[i+0] = temp;
if (temp > 255) temp = 255;
if (temp > 255) temp = 255;
if (temp > 255) temp = 255;
}
temp0 = ((back(i+0) + forward(i+0) +1) >> 1) ;

quadavg
temp0 = idct(i+0);
if (temp0 > 255) temp = 255;
else if (temp0 < 0) temp0 = 0;
temp1 = idct(i+1);
if (temp1 > 255) temp1 = 255;
temp2 = idct(i+2);
if (temp2 > 255) temp2 = 255;
temp3 = idct(i+3);
if (temp3 > 255) temp3 = 255;
dspuquadaddui
destination[i+0] = temp0;
Will embedded CPUs and DSPs converge ?

Converging forces
both include a hardware multiplier
trend in DSPs towards caches and RTK
trend in DSPs towards C/C++
common trend towards VLIW
Diverging forces
deeply embedded code (DSP) vs. end-user SW (CPU)
different RTKs
SPOX, Virtuoso (DSP) vs. pSOS, WinCE (top down)
Conclusions VLIW
good balance between hw and sw
between efficiency (ILP) and cost
fundamental problems: code size, interruptability

Lect 2

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Lect 2

Uploaded by

Copyright:

Available Formats

Processor Architectures and

Topic 2: Programmable Digital Signal Processors

Sum of products = basic operation

Goal = 1 cycle per iteration

DSP data types

DSP data types

-24 23 22 21 20 2-1 2-2 2-3

Same alu handles

DSP data types

Some processors (C54) have special instructions for fractional

DSP data types

How updating the delayline ?

Solution 1: blockmove in memory

Solution 2: indirect adressing

use of a pointer to mark the begin of the delay line

ACU architecture and

prepare new pointer for next iteration

R[R4] = R[R4] + R[R3]

Addressing modes: extra for DSP

Bus-oriented instruction encoding

Loopfolding (software pipelining)

Loopfolding (software pipelining)

Pre- and postamble

hardware support for loop control

Motorola 56K family

Two 16-by-16 bit

Power dissipation 2.1 mW/MHz

Instruction cycle counts for BDTi benchmarks

Real block FIR

256 pnt FFT

architectures for programmable DSPs

Code selection example

Examples of RTPs on the ADSP-210 datapath

Example of code selection

phase coupling: example 1

phase coupling: example 2

Example of coupling between scheduling and register allocation

phase coupling: discussion

Phase coupling is difficult because of many constraints originating

phase coupling: discussion

long instruction words e.g. (3*7+4)*25=625

VLIW architecture: central Register File

Exec Exec Exec

Exec Exec Exec

Exec Exec Exec

Register file (128 regs, 32 bit, 15 ports)

Instruction register (5 issue slots)

TriMedia TM32A processor

Synthesised RF area (CMOS18, 64 bit)

64regs, after P&R

128regs, after P&R

Poly. (128regs, after P&R)

Poly. (32regs, after P&R)

VLIW architecture: clustered Register Files

VLIW architecture: clustered Register Files

VLIW architecture: clustered Register Files

VLIW architecture: clustered Register Files

TMS320C62x VelociTI (fixed point)

Registerfile 0-15 (32 bits)

Instruction cycle counts for BDTi benchmarks

Real block FIR

256 pnt FFT

(custom operators in TM)

long instruction words e.g. (37+4)25=625