Professional Documents
Culture Documents
Program Mapping
Programmable Digital Signal Processors
5kk10
TU/e
Henk Corporaal
Jef van Meerbergen
Bart Mesman
Outline
architectures for programmable DSPs
multiplier-accumulator
modified Harvard architecture
extension with an ALU (decision making)
controller architectures
examples: TI, Motorola, Philips
code generation
recent developments: VLIW (Very Long Instruction Word)
examples: C6 and TM
c(i) x(i)
control
MPY
P_reg
clock
(Booth,
Wallace..)
PR
ADDER
ACR
c(i) * x(i)
expr.
Modifications
extra inputs/outputs
position ACR (1 or 2)
adder/subtractor
extra pipelines
asymmetric inputs
multi-precision
q
fix <8,3>
1 1 1 0 1
1 0 1
-19/8 = -2.375
Scale factor 1/8
1 1 1 0 1
Int <8,4>
0 1 1 0
1 1 1 1 1 0 0 0 1
1 0 1
0 0 0 1
1 0 0 1 1 0 1
-19/8
97/16
-1843/128
c(i) x(i)
control
MPY
P_reg
clock
(Booth,
Wallace..)
PR
ADDER
P_reg
clock
ACR
SHIFT
ROUND
TRUNCATE
xQ
xQ
xQ
rounding
value truncation
111.11
+ 000.1
= 000
-0.25
111.01
+ 000.1
= 111
-0.75
-1
111.11
= 111
111.01
= 111
-0.25
-1
-0.75
-1
magnitude truncation
111.11
+ 001.
= 000
-0.25
111.01
+ 001.
= 000
-0.75
zeroing
saturation
sawtooth
Prog/data
memory
prog
mem.
data
mem.
prog
mem.
data
mem. 1
data
mem. 2
EXU
EXU
EXU
Von Neumann
(sequencial)
Harvard
Modified Harvard
c(i) * x(i)
Goal = 1 cycle per iteration
Reset
+1
Interrupt
address
PC
Program
Memory
ACU_A
ACU_B
AR_A
AR_B
RAM_A
RAM_B
DR_A
DR_B
Stack
IR
Control Bus
MAC
Rfile
time loop
ci * xi
1 cycle/tap ?
filter loop i
x5
c5
x4
-1
c4
x3
-1
Z
c3
x2
-1
Z
*
+
c2
x1
-1
c1
*
y
Memory
location
1
2
3
4
5
Output
sample 1
x1
x2
x3
x4
x5
output
sample 2
x2
x3
x4
x5
x6
2 possibilities
complete move after every output sample is calculated
read and write the data twice
move after read of every datum separately
write the data twice
need for a special instruction (TMS320)
output
sample 3
x3
x4
x5
x6
x7
output
sample 1
x1
x2
x3
x4
x5
output
sample 2
x2
x3
x4
x5
x6
output
sample 3
x3
x4
x5
x6
x7
output
sample 4
x4
x5
x6
x7
x8
Output
sample 5
x9
x5
x6
x7
x8
IIR filter
y1
Z
c1
x
-1
c2
y2
-1
Z
c3
pointer
y3
-1
Z
c4
y4
Z
-1
y5
y1
y2
y3
y4
y5
memory map
y1
y2
y3
y4
y5
modulo range
x1
x2
x3
x4
x5
modulo range 2
time loop
x1
x2
x3
x4
x5
pntr 1
modulo range 1
2 filters
pntr 1
for i = 1..itaps
c(i) * x(i)
pntr 2
for j = 1..jtaps
d(j) * y(j)
y1
y2
y3
y4
y5
2 memory segments
=>
1 segment
Mapping strategy
x3
x2
z-1
x1
c4
c3
y3
c2
c1
z-1
y2
z-1
pntr 1
y1
y2
x1/y3
x2
x3
modulo range
z-1
c5
y1
Mapping strategy
define positions in Ram
constraint: vars that form a delay line in consecutive places
find a schedule
example : c1 => c2 => c3 => c4 => c5
define ACU instructions
x7
-1
Z
c6
x6
-1
Z
c8
-1
Z
c3
x5
-1
Z
c4
x4
-1
c1
x8
c5
x3
-1
Z
c2
x2
-1
Z
c7
yo
x1
*
ye
Modulo
output
to RAM
Modulo can be
implemented as a
mask operation
if the size is 2k
reg A
A
A
A+1
A-1
A+S
A
16 10 000
23 10 111
mask
=hold
reg S
S
S
S
S
S
S+1
Mapping example
x3
x2
z-1
x1
c4
c3
Assume
initialisation
A = pointer=17
S = -2
y3
c2
c1
z-1
y2
z-1
read_A
incA
incA
incA
incA
step
dec
y1
17
18
19
20
21
19
18
pntr
16
17
18
19
20
21
22
23
y1
y2
x1/y3
x2
x3
modulo range
z-1
c5
Addressing modes
register
immediate
direct
indirect
w. inc/dec
indexed
ADD R4, R3
ADD R4, #3
ADD R4, (100)
ADD R4, (R3)
ADD R4, (R3)
Remarks
direct = for static data
indirect = for arrays
inc/dec = for stepping through arrays e.g. xn
index = for stepping through arrays e.g. x2n
Incorporation of an ALU
regular data-flow algorithms ==> MAC
filtering, correlation, windowing etc
decision making ==> ALU
sorting filters (e.g. median filters)
interpolation (e.g. sqrt)
absolute value calculation
logarithmic conversion
finite field aritmetic (e.g. Galois field)
Viterbi
VLC, VLD
division
Reset
+1
Interrupt
address
PC
Program
Memory
ACU_A
ACU_B
AR_A
AR_B
RAM_A
RAM_B
DR_A
DR_B
Stack
IR
Control Bus
MAC
ALU
Rfile
SY
ACU
DX DY RF A B
01 MULT SX
SY
DX DY RF ACU
A B
00 ALU
10
Imm. data
11
Next address
ACU
DX DY RF A B
BR Cond
ACU
A B
first solution
c(i) * x(i)
resources
LABEL ALU
MPY-ACC
Acc = 0
Not shown
coefficient RAM+ACU
RAM
ACU
init (i=0)
init counter
loop
incr (=i+1)
read x(i)
acc(i)=acc(i-1)+x(i)*c(i)
dec counter
branch to loop if counter > 0
nop
time (cc)
6 clockcycles/sample
limit pipelines in the controller
a0
f
g
b0
a1
f
c0 b1 a2
g
h
f
d0 c1 b2
g
h
d1 c2
h
d2
ci-2
bi-1 ai
di-2 ci-1
f
bi
for i = 2 to n
bi = f(ai)
ci-1 = g(bi-1)
di-2 = h(ci-2)
c(i) * x(i)
LABEL
ALU
MPY-ACC
acc(i-1)=0
init counter
loop
acc(i) = acc(i-1)+x(i)*c(i)
RAM
ACU
init (i=1)
read x(i) inc(=i+1)
read x(i+1) incr (=i+2)
dec counter
branch to loop if counter > 0
nop
acc(n-1) = acc(n-2)+x(n-1)*c(n-1) read x(n)
acc(n) = acc(n-1)+x(n)*c(n)
c(i) * x(i)
Label ALU
MPY-ACC
acc(i-1=0
init counter
repeat n-2
acc(i)=acc(i-1)+x(i)*c(i)
acc(n-1) = acc(n-2) + x(n-1)*c(n-1)
acc(n) = acc(n-1) + x(n)*c(n)
RAM
ACU
init (i=1)
read x(i) inc(=i+1)
read x(i+1) incr(=i+2)
read x(n)
1 clockcycles/sample
repeat instruction and repeat block
Outline
architectures for programmable DSPs
multiplier-accumulator
modified Harvard architecture
extension with an ALU (decision making)
controller architectures
examples: TI, Motorola, Philips
code generation
recent developments: VLIW (Very Long Instruction Word)
examples: C6 and TM
TMS320C5000
T register
E
T
D
A
Sign ctr
P C D
B A T
Sign ctr
A(40)
B(40)
BACD
Sign ctr
Sign ctr
Sign ctr
MUX
Multiplier (17*17)
A
M U
A B
ALU (40)
A
B
B
fractional
MUX
Barrer shifter
MUX
COMP
Adder (40)
TRN
ZERO
SAT
ROUND
TC
MSW/LSW
select
Address bus
16 bits
EXTERNAL
ADRESS SWITCH
P Address
Y Address
X Address
2,048-by-24-bit
PROGRAM
MEMORY
ROM
I/O
PORTS
7 BITS
Address
ALU
X-DATA
Y DATA
P DATA
GLOBAL DATA
INTERNAL
DATA-BUS
SWITCH
24 BITS
X memory
256-by-24-bit
RAM
256-by-24-bit
ROM
ON CHIP
PERIPHERALS,
HOST,
SYNCHRONOUS
SERIAL INTERFACE
SERIAL COMMUNICATIONS
INTERFACE,
PROGRAMMED I/O,
BUS CONTROL
Y memory
256-by-24-bit
RAM
256-by-24-bit
ROM
EXTERNAL
DATA-BUS
SWITCH
DATA ALU
PROGRAM CONTROLLER
2 BITS
CLOCK
3 BITS
INTERRUPT
24-by-24 bit
MULTIPLIERACCUMULATOR
PRODUCING
56 BIT RESULT
24
BITS
DATA
BUS
Y1
Y1
X
PO
P1
scale
Y
Two 40 bit
arithmiclogic units
Two address
Compution
units
X data
memory
R.E.A.L.
X data
Buses for
Y data
Z data
scale
Saturation
Y data
memory
shift
16 bit
bus
Program
memory
(Z data)
Saturation
Four 40 bit
accumulators
Saturation/scale
16 bit
bus
Program
control
unit
96-bit instructions
Instruction decoder
Y0
16-bit
bus
RD16021 DSP
memories
Not included
Process
0.35, 5M
2.7-3.6 V
voltage
frequency
area
39 MHz
Tj = 85 C, 2.7V, wcp
3.9 mm2
DSPgroup
OAK
Motorola
DSP561xx
ADI
ADSP-218x
Lucent
DSP16xx
TI TMS320
C54x
TI320
C62xx
Lucent
DSP16210
Philips
RD16020
835
21
3018
90
51
43
122
41
506
925
23
3043
64
45
43
85
86
772
841
22
3122
59
43
43
83
128
818
1240
26
3123
101
65
47
123
120
888
684
18
2922
58
44
41
61
111
528
334
17
1294
33
30
29
36
39
188
780
16
1681
464
448
20
1470
55
37
43
63
40
176
FSM
284
16514
375
12148
198
10633
415
455
21035 13234
147
4225
301
9016
167
5797
16 taps
40 samples
8 biquads
38
23
43
Outline
source
lexical analysis
syntax analysis
Intermediate
machine independent
representation
semantic analysis
Code selection
Register allocation
1 instr = // ops
order of instr
Front end
scheduling
code
Code generation
Intermediate
machine independent
representation
BBi
BBj
BBk
a
b
*
t1 := a * b
t2 := c + d
t3 := t1 + c
out := t2 * t3
c
c
t1
+
t3
d
+
t2
Code selection
Intermediate
representation
RTP
match &
cover
Register transfer pattern (RTP) for a given datapath
is any RT operation ( read - combinatorial logic - write)
which can be executed on the datapath. [Leupers]
Notation
ar := ar | ax + ay | af means
ar := ar + ay
ar := ar + af
ar := ax + ay
ar := ax + af
or
or
or
ax
ay
af
y
+-
mx
my
x
ALU
mf
y
*
+-
ar
ADSP
[Analog Devices]
p memory
mr
MAC
my | mf
ar | mr | mx
mr
mr | mf
mr | mf
ar | mr | mx
my | mf
mr | ar | ax
my | mf
ay | af
mr | ar | ax
ay | af
mr | mf
ar | af
ar | af
my := pmem
*
Mr := mr + (mx * my)
ax := dmem ay := pmem
c
mr := dmem
c 3:
t1
2: +
t3
1:
ar := ax + ay
t2
*
my := ar
mr = mr * my
Problems
local decisions which have a global impact
phase coupling: example
asap schedule
maximal freedom for scheduling
code selection during scheduling
register allocation comes afterwards
can lead to infeasible solutions
R2
R3
alu1
Move
R1
alu2
4
(a)
4
(b)
(c)
Pv
Cu
v
Cv
if u and v
share the
same register
Pu
u
Pv
Cu
v
Cv
[Mesman]
application
Traditional
code generation
(heuristic)
constraints
OK ?
yes
feasible
space
no
design space seen
by code generator
Outline
architectures for programmable DSPs
multiplier-accumulator
modified Harvard architecture
extension with an ALU (decision making)
controller architectures
examples: TI, Motorola, Philips
code generation
recent developments: VLIW (Very Long Instruction Word)
principles
central register file + example TM
clustered VLIW + example C6
subword parallelism or SIMD
VLIW principles
multiple parallel FUs, possibly different and pipelined
pipelining is exposed to the compiler
= no interlock mechanism
load-store architecture
all operands fetched from/stored in register files,
possibly multi-ported
each FU can receive an instruction every clock cycle
one instruction = many RISC instructions
each RISC instruction = one issue slot
no dependencies between different RISC instructions
= orthogonal microcode
= compiler friendly
VLIW architecture
Register file
R&W addr.
instruction
Exec
unit 1
Exec
unit 2
Exec
unit 3
Exec
unit 4
Exec
unit 5
Issue
slot 1
Issue
slot 2
Issue
slot 3
Issue
slot 4
Issue
slot 5
...
Exec
Exec
unit 24 unit 25
...
Issue
Issue
slot 24 slot 25
Register file
Issue slot 1
Issue slot 2
Issue slot 3
TM1000 DSPCPU
5 constant
5 ALU
2 memory
2 shift
2 DSP-ALU
2 DSP-mul
3 branch
2 FP ALU
2 Int/FP ALU
1 FP compare
1 FP div/sqrt
Exec
unit
Exec
unit
Exec
unit
Exec
unit
Exec
unit
Instruction
cache (32kB)
Data
cache
(16 kB)
I/O
INTERFACE
TAG
TAG
DSPMUL2
DSPMUL1
IFMUL1
(FLOAT) IFMUL2
(FLOAT)
FCOMP2
ALU1
ALU4
SHIFTER1
ALU2
FTOUGH1
DSPALU2
FALU0 FALU3
ALU3
ALU0
(MIPS=
0.9 mW/MHz)
SEQUENCER
/ DECODE
TAG
SHIFTER0
DSPALU0
TAG
I-Cache
D-cache
Area in mm-sq
7
32regs, after P&R
2
1
0
0
10
15
20
Nr of ports
Area, speed and power dissipation goes more than linear with the
number of ports
Register file 1
Exec
unit 1
Exec
unit 2
Register file 2
copy
unit
Exec
unit 3
Exec
unit 4
Register file 3
copy
unit
Exec
unit 5
Exec
unit 6
copy
unit
REGISTER
FILE 1
REGISTER
FILE 2
REGISTER
FILE 3
FMUL
FADD
IMUL
IADD
IMUL
IADD
FMUL r1,r2,r3
IADD r1,r2,r3
IMUL r1,r2,r3
REGISTER
FILE I0
FU00
IADD_01
IMOV_01
:
FU10
IADD_10
IMOV_10
:
FU01
IADD_00
LAND_00
:
FU01
IADD_11
LAND_10
:
FU02
IMUL_00
SHFT_00
:
FU02
IMUL_10
SHFT_10
:
REGISTER
FILE I1
Int add
logical
bit count
Store/load
data
S1
load
data
Int add
logical
Int mult
bit manip
(16=>32)
shift
constant
branch
D1
Store/load
address
Int add
load/
store
D2
Store/load
address
Dst
src1
src2
Dst
src1
src2
Dst
src1
src2
Src_up
Dst_up
Dst
src1
src2
Src_up
Dst_up
Dst
src1
src2
L1
M1
Dst
src1
src2
Registerfile 0-15
M2
S2
L2
VelociTI principles
parallelism (fetch-decode-execute) (max 8 issue slots)
pipeline critical sections (alu 1cc, mult 2 cc, 200 MHz)
Risc (simple, atomic, independent instructions)
performance comes from compiler (pipelining, unroll)
load-store
orthogonal (2 identical DP, add on 6 units)
deterministic (no interlock)
conditional instructions (=guarding)
instruction packing
Fully serial
n
n
n
n
n
F
n
n
n
B
n
n
n
n
n
n
A
n
n
n
n
n
n
n
n
n
n
n
E
n
n
n
n
n
n
n
n
n
n
n
n n n
n n n
C n n
Dn n
n n n
n n n
nGn
n nH
ABCDE FGH
0 0 0 0 0 0 0 0
Classical encoding:
fetching many nops
Mixed serial/parallel
n
n
F
n
B
n
n
n
A
n
n
n
n
E
n
n
n
n
n
n
C n n
Dn n
n n n
n GH
ABCDE FGH
ABCDE FGH
1 1 0 1 0 0 1 0
ABCDE FGH
1 1 1 1 1 1 1 0
Velocity encoding
Fully parallel
DSPgroup
OAK
Motorola
DSP561xx
ADI
ADSP-218x
Lucent
DSP16xx
TI TMS320
C54x
TI320
C62xx
Lucent
DSP16210
Philips
RD16020
835
21
3018
90
51
43
122
41
506
925
23
3043
64
45
43
85
86
772
841
22
3122
59
43
43
83
128
818
1240
26
3123
101
65
47
123
120
888
684
18
2922
58
44
41
61
111
528
334
17
1294
33
30
29
36
39
188
780
16
1681
464
448
20
1470
55
37
43
63
40
176
FSM
284
16514
375
12148
198
10633
415
455
21035 13234
147
4225
301
9016
167
5797
38
23
43
Subword parallelism
32 bits = 4 bytes
are processed
independently
op
Ex. +, - , min, max
=> quadumin
=> quadumax
...
op
op
op
Subword parallelism
(custom operators in TM)
int size = 1000
byte out[size], in1[size], in2[size]
for i = 0; i < size; i+
out[ i ] = in1[ i ] + in2[ i ];
+ faster execution
- rewrite effort (e.g. different
types for in- and outputs)
Subword parallelism
MPEG example
for (i=0; i<64; I++)
{
temp = ((back(i) + forward(i) +1) >> 1) +idct(i);
if (temp > 255)
temp = 255;
else if (temp < 0)
temp = 0;
destination[i] = temp;
}
Remark: simple example without interloop dependencies
quadavg
temp0 = idct(i+0);
if (temp0 > 255) temp = 255;
else if (temp0 < 0) temp0 = 0;
temp1 = idct(i+1);
if (temp1 > 255) temp1 = 255;
else if (temp1 < 0) temp1 = 0;
temp2 = idct(i+2);
if (temp2 > 255) temp2 = 255;
else if (temp2 < 0) temp2 = 0;
temp3 = idct(i+3);
if (temp3 > 255) temp3 = 255;
else if (temp3 < 0) temp3 = 0;
dspuquadaddui
destination[i+0] = temp0;
destination[i+1] = temp1;
destination[i+2] = temp2;
destination[i+3] = temp3;