You are on page 1of 71

Processor Architectures and

Program Mapping
Programmable Digital Signal Processors

5kk10
TU/e
Henk Corporaal
Jef van Meerbergen
Bart Mesman

Topic 2: Programmable Digital Signal Processors


real-time worst-case processing = need for more compute power
sec
instr cycles
sec
prog prog instr
cycle
CPI = 1
instruction level parallelism (ILP)
hardware support for loop control
attention for high level data types e.g. arrays, delaylines
(vs. scalars for CPUs)
difficult to compare architectures
e.g. DIT, DIF, radix 2/4, FFT loop unrolling, scaling,
shuffling, intialisation can be included or forgotten
benchmarking (Berkeley Design Technology Inc (BDTi))
(compare to SpecInt benchmarks for CPs)

Outline
architectures for programmable DSPs
multiplier-accumulator
modified Harvard architecture
extension with an ALU (decision making)
controller architectures
examples: TI, Motorola, Philips
code generation
recent developments: VLIW (Very Long Instruction Word)
examples: C6 and TM

c(i) x(i)

control

MPY

P_reg

clock

(Booth,
Wallace..)

PR
ADDER
ACR

Sum of products = basic operation


for correlation, filtering,
spectral analysis ...
linear

c(i) * x(i)

expr.

Goal = 1 cycle per iteration

Modifications
extra inputs/outputs
position ACR (1 or 2)
adder/subtractor
extra pipelines
asymmetric inputs
multi-precision

DSP data types


not every signal requires 32 bits
2 types of DSP: floating point and integer
advantages FP: most specs are in FP
(conversion to int is time consuming since the behaviour
may change)
disadvantage FP: cost (area, speed, power)
wanted : type of output of an operation = type of input
(because both stored in RAM)
no problem for FP but for integer
integer multiplication doubles the number of bits: n * n => 2n
What about fractional numbers ?
0.9
x 0.9
0.81

DSP data types


integer and fractional numbers are a special case of fixed point
fix <p,q> (ART designer & SystemC)

q
fix <8,3>

1 1 1 0 1

1 0 1

-24 23 22 21 20 2-1 2-2 2-3


negative weight
quantization error
2s complement
if q=0 then integer e.g. int <8,0>
if q=p-1 then fractional e.g. int <8,7>

-19/8 = -2.375
Scale factor 1/8

Same alu handles


fix <8,1>, fix <8,2>,
fix <8,3>, ...

DSP data types


Int <8,3>

1 1 1 0 1

Int <8,4>

0 1 1 0

1 1 1 1 1 0 0 0 1

1 0 1
0 0 0 1

1 0 0 1 1 0 1

-19/8
97/16
-1843/128

Some processors (C54) have special instructions for fractional


Numbers (and symmetric number domain 2n-1 2n-1)
sxxx
syyy
-------sszzzzzz
s z z z z z z 0 => if FRCT = 1

DSP data types


continue (after multiplication) with msb only
represents the limit of the accuracy of the result
(can not be larger than the accuracy of the inputs)
more efficient solution
continue with msb + lsb
sum-of-product operations generate accumulative noise at 32nd
vs. 16th bit
Still overflow for addition = overflow bits
double precision accumulator
+ extra overflow bits
+ shift, round, truncate unit

c(i) x(i)

control

MPY

P_reg

clock

(Booth,
Wallace..)

PR
ADDER

P_reg

clock

ACR
SHIFT
ROUND
TRUNCATE

xQ

xQ

xQ

rounding

value truncation

111.11
+ 000.1
= 000

-0.25

111.01
+ 000.1
= 111

-0.75

-1

111.11
= 111
111.01
= 111

-0.25
-1
-0.75
-1

magnitude truncation
111.11
+ 001.
= 000

-0.25

111.01
+ 001.
= 000

-0.75

zeroing

saturation

sawtooth

Prog/data
memory

prog
mem.

data
mem.

prog
mem.

data
mem. 1

data
mem. 2

EXU

EXU

EXU

Von Neumann
(sequencial)

Harvard

Modified Harvard

c(i) * x(i)
Goal = 1 cycle per iteration

Reset

+1

Interrupt
address

PC
Program
Memory

ACU_A

ACU_B

AR_A

AR_B

RAM_A

RAM_B

DR_A

DR_B

Stack

IR
Control Bus

MAC
Rfile

time loop

ci * xi

1 cycle/tap ?

filter loop i

How updating the delayline ?

x5
c5

x4

-1

c4

x3

-1
Z

c3

x2

-1
Z

*
+

c2

x1

-1

c1

*
y

Solution 1: blockmove in memory

Memory
location
1
2
3
4
5

Output
sample 1
x1
x2
x3
x4
x5

output
sample 2
x2
x3
x4
x5
x6

2 possibilities
complete move after every output sample is calculated
read and write the data twice
move after read of every datum separately
write the data twice
need for a special instruction (TMS320)

output
sample 3
x3
x4
x5
x6
x7

Solution 2: indirect adressing


Memory
location
1
2
3
4
5
6
7
8

output
sample 1
x1
x2
x3
x4
x5

output
sample 2
x2
x3
x4
x5
x6

output
sample 3

x3
x4
x5
x6
x7

output
sample 4

x4
x5
x6
x7
x8

Output
sample 5
x9

x5
x6
x7
x8

use of a pointer to mark the begin of the delay line


update the pointer instead of moving the data
problem: trashing of the whole memory
solution: modulo addressing
need for a register to store the pointer

IIR filter

y1
Z

c1
x

-1

c2

y2

-1
Z

c3

pointer
y3

-1
Z

c4

y4
Z

-1

y5

y1
y2
y3
y4
y5

memory map

y1
y2
y3
y4
y5

modulo range

x1
x2
x3
x4
x5

modulo range 2

time loop

x1
x2
x3
x4
x5

pntr 1

modulo range 1

2 filters

pntr 1

for i = 1..itaps

c(i) * x(i)

pntr 2

for j = 1..jtaps

d(j) * y(j)

y1
y2
y3
y4
y5

2 memory segments

=>

1 segment

Mapping strategy
x3

x2

z-1
x1

c4
c3

y3
c2
c1

z-1
y2

z-1

pntr 1

y1
y2
x1/y3
x2
x3

modulo range

z-1

c5

y1

Mapping strategy
define positions in Ram
constraint: vars that form a delay line in consecutive places
find a schedule
example : c1 => c2 => c3 => c4 => c5
define ACU instructions

x7

-1
Z

c6

x6

-1
Z

c8

-1
Z

c3

x5

-1
Z

c4

x4

-1

c1

x8

c5

x3

-1
Z

c2

x2

-1
Z

c7

yo

x1

*
ye

Modulo

output
to RAM

ACU architecture and


Instruction set
Output
Read_A A
Read_S
S
incA
A+1
decA
A-1
Step
A+S
Inc_step S+1

Modulo can be
implemented as a
mask operation
if the size is 2k

reg A
A
A
A+1
A-1
A+S
A

16 10 000
23 10 111
mask
=hold

reg S
S
S
S
S
S
S+1

Mapping example
x3

x2

z-1
x1

c4
c3

Assume
initialisation
A = pointer=17
S = -2

y3
c2
c1

z-1
y2

z-1

read_A
incA
incA
incA
incA
step
dec

y1

17
18
19
20
21
19
18

pntr

16
17
18
19
20
21
22
23

y1
y2
x1/y3
x2
x3

modulo range

z-1

c5

prepare new pointer for next iteration

Addressing modes
register
immediate
direct
indirect
w. inc/dec
indexed

ADD R4, R3
ADD R4, #3
ADD R4, (100)
ADD R4, (R3)
ADD R4, (R3)

R[R4] = R[R4] + R[R3]


R[R4] = R[R4] + #3
R[R4] = R[R4] + Mem[100]
R[R4] = R[R4] + Mem[R[R3]]
R[R4] = R[R4] + Mem[R[R3]]
R[R3] = R[R3] 1
ADD R4, (R3R2) R[R4] = R[R4] + Mem[R[R3]]
R[R3] = R[R3] R[R2]

Remarks
direct = for static data
indirect = for arrays
inc/dec = for stepping through arrays e.g. xn
index = for stepping through arrays e.g. x2n

Addressing modes: extra for DSP


8 ARs (address or auxiliary register) available
extra indirect modes
circular *ARn %
post inc/dec by 1 - circular
*ARn AR0 % post inc/dec by AR0 - circular
bit reverse *ARn AR0 B post inc/dec by AR0 - bit rev.

Incorporation of an ALU
regular data-flow algorithms ==> MAC
filtering, correlation, windowing etc
decision making ==> ALU
sorting filters (e.g. median filters)
interpolation (e.g. sqrt)
absolute value calculation
logarithmic conversion
finite field aritmetic (e.g. Galois field)
Viterbi
VLC, VLD
division

Reset

+1

Interrupt
address

PC
Program
Memory

ACU_A

ACU_B

AR_A

AR_B

RAM_A

RAM_B

DR_A

DR_B

Stack

IR
Control Bus

MAC

ALU

Rfile

Bus-oriented instruction encoding


SX

SY

ACU
DX DY RF A B

01 MULT SX

SY

DX DY RF ACU
A B

00 ALU

10

Imm. data

11

Next address

ACU
DX DY RF A B
BR Cond

ACU
A B

first solution

c(i) * x(i)

resources

LABEL ALU

MPY-ACC
Acc = 0

Not shown
coefficient RAM+ACU

RAM

ACU
init (i=0)

init counter
loop

incr (=i+1)
read x(i)
acc(i)=acc(i-1)+x(i)*c(i)
dec counter
branch to loop if counter > 0
nop

time (cc)

6 clockcycles/sample
limit pipelines in the controller

Loopfolding (software pipelining)


ai
for i = 0 to n f
bi
bi = f(ai)
g
ci = g(bi)
ci
di = h(ci)
h
di

a0
f
g

b0

a1
f

c0 b1 a2
g
h
f
d0 c1 b2
g
h
d1 c2
h
d2

ci-2

bi-1 ai

di-2 ci-1

f
bi

for i = 2 to n
bi = f(ai)
ci-1 = g(bi-1)
di-2 = h(ci-2)

Loopfolding (software pipelining)

c(i) * x(i)
LABEL

ALU

MPY-ACC
acc(i-1)=0

init counter
loop

acc(i) = acc(i-1)+x(i)*c(i)

RAM

ACU
init (i=1)
read x(i) inc(=i+1)
read x(i+1) incr (=i+2)

dec counter
branch to loop if counter > 0
nop
acc(n-1) = acc(n-2)+x(n-1)*c(n-1) read x(n)
acc(n) = acc(n-1)+x(n)*c(n)

Pre- and postamble


4 clockcycles /sample

hardware support for loop control

c(i) * x(i)
Label ALU

MPY-ACC
acc(i-1=0
init counter
repeat n-2
acc(i)=acc(i-1)+x(i)*c(i)
acc(n-1) = acc(n-2) + x(n-1)*c(n-1)
acc(n) = acc(n-1) + x(n)*c(n)

RAM

ACU
init (i=1)
read x(i) inc(=i+1)
read x(i+1) incr(=i+2)
read x(n)

1 clockcycles/sample
repeat instruction and repeat block

Outline
architectures for programmable DSPs
multiplier-accumulator
modified Harvard architecture
extension with an ALU (decision making)
controller architectures
examples: TI, Motorola, Philips
code generation
recent developments: VLIW (Very Long Instruction Word)
examples: C6 and TM

TMS320C5000
T register
E
T

D
A

Sign ctr

P C D

B A T

Sign ctr

A(40)

B(40)

BACD

Sign ctr

Sign ctr

Sign ctr

MUX

Multiplier (17*17)
A

M U
A B

ALU (40)
A

B
B

fractional

MUX

Barrer shifter
MUX

COMP

Adder (40)

TRN
ZERO

SAT

ROUND
TC

MSW/LSW
select

Address bus
16 bits

Motorola 56K family

EXTERNAL
ADRESS SWITCH

P Address

Y Address

X Address

2,048-by-24-bit
PROGRAM
MEMORY
ROM

I/O
PORTS
7 BITS

Address
ALU

X-DATA
Y DATA
P DATA
GLOBAL DATA

INTERNAL
DATA-BUS
SWITCH

24 BITS

X memory
256-by-24-bit
RAM
256-by-24-bit
ROM

ON CHIP
PERIPHERALS,
HOST,
SYNCHRONOUS
SERIAL INTERFACE
SERIAL COMMUNICATIONS
INTERFACE,
PROGRAMMED I/O,
BUS CONTROL

Y memory
256-by-24-bit
RAM
256-by-24-bit
ROM

EXTERNAL
DATA-BUS
SWITCH

DATA ALU
PROGRAM CONTROLLER

2 BITS
CLOCK

3 BITS
INTERRUPT

24-by-24 bit
MULTIPLIERACCUMULATOR
PRODUCING
56 BIT RESULT

24
BITS

DATA
BUS

Two 16-by-16 bit


multipliers
Y0

Y1

Y1
X

PO

P1

scale
Y

Two 40 bit
arithmiclogic units

Two address
Compution
units
X data
memory

R.E.A.L.
X data
Buses for

Y data
Z data

scale

Saturation

Y data
memory

shift

16 bit
bus

Program
memory
(Z data)

Saturation

Four 40 bit
accumulators
Saturation/scale

16 bit
bus

Program
control
unit

96-bit instructions
Instruction decoder

Y0

16-bit
bus

RD16021 DSP
memories

Not included

Process

0.35, 5M
2.7-3.6 V

voltage
frequency
area

39 MHz
Tj = 85 C, 2.7V, wcp
3.9 mm2

Power dissipation 2.1 mW/MHz

Instruction cycle counts for BDTi benchmarks


Function

DSPgroup
OAK

Motorola
DSP561xx

ADI
ADSP-218x

Lucent
DSP16xx

TI TMS320
C54x

TI320
C62xx

Lucent
DSP16210

Philips
RD16020

Real block FIR


Single sample FIR
Complex block FIR
LMS adaptive
IIR (8 sections)
Vector dot product
Vector add
Vector maximum
Convolution
encoder

835
21
3018
90
51
43
122
41
506

925
23
3043
64
45
43
85
86
772

841
22
3122
59
43
43
83
128
818

1240
26
3123
101
65
47
123
120
888

684
18
2922
58
44
41
61
111
528

334
17
1294
33
30
29
36
39
188

780
16
1681

464

448
20
1470
55
37
43
63
40
176

FSM

284
16514

375
12148

198
10633

415
455
21035 13234

147
4225

301
9016

167
5797

256 pnt FFT

16 taps

40 samples

8 biquads

38
23
43

Outline

architectures for programmable DSPs


multiplier-accumulator
modified Harvard architecture
extension with an ALU (decision making)
controller architectures
examples: TI, Motorola, Philips
code generation
recent developments: VLIW (Very Long Instruction Word)

source
lexical analysis
syntax analysis

Intermediate
machine independent
representation

semantic analysis

Code selection
Register allocation

1 instr = // ops
order of instr

Front end

scheduling
code

Code generation

Intermediate
machine independent
representation

BBi

BBj

BBk
a

b
*

t1 := a * b
t2 := c + d
t3 := t1 + c
out := t2 * t3

c
c

t1

+
t3

d
+
t2

Code selection

Intermediate
representation

RTP

match &
cover
Register transfer pattern (RTP) for a given datapath
is any RT operation ( read - combinatorial logic - write)
which can be executed on the datapath. [Leupers]
Notation

ar := ar | ax + ay | af means

ar := ar + ay
ar := ar + af
ar := ax + ay
ar := ax + af

or
or
or

Code selection example


d memory

ax

ay

af

y
+-

mx

my

x
ALU

mf

y
*
+-

ar

ADSP
[Analog Devices]

p memory

mr

MAC

Examples of RTPs on the ADSP-210 datapath


ar | mr | mx
mr

my | mf

ar | mr | mx
mr

mr | mf

mr | mf

ar | mr | mx

my | mf

mr | ar | ax

my | mf

ay | af

mr | ar | ax

ay | af

mr | mf

ar | af

ar | af

Example of code selection


= covering of intermediate representation with RTPs
mx := dmem

my := pmem

*
Mr := mr + (mx * my)

ax := dmem ay := pmem
c

mr := dmem
c 3:
t1
2: +
t3
1:

ar := ax + ay

t2
*
my := ar

mr = mr * my

Problems
local decisions which have a global impact
phase coupling: example
asap schedule
maximal freedom for scheduling
code selection during scheduling
register allocation comes afterwards
can lead to infeasible solutions

phase coupling: example 1


1

R2

R3

alu1
Move

R1

alu2

4
(a)

4
(b)

(c)

phase coupling: example 2


[Mesman]
Pu
u

Pv

Cu

v
Cv

if u and v
share the
same register

Pu
u

Pv

Cu

v
Cv

Example of coupling between scheduling and register allocation

phase coupling: discussion

[Mesman]

application

Traditional
code generation
(heuristic)
constraints

OK ?
yes

feasible
space
no
design space seen
by code generator

Phase coupling is difficult because of many constraints originating


from irregular interconnect, special purpose registers and
non-orthogonal microcode.

phase coupling: discussion


It is very difficult and almost impossible to develop robust and
efficient DSP compilers.
Current DSP practice = programming in assembler
Solution:
1. Solve code generation for DSPs
2. Step back and rethink the architecture
develop an architecture which is still efficient but also
a good model for building a compiler
Efficiency = exploit instruction level parallelism (ILP)
compilation = systematic positioning of registers and regular
interconnect
= VLIW = Very Long Instruction Word

Outline
architectures for programmable DSPs
multiplier-accumulator
modified Harvard architecture
extension with an ALU (decision making)
controller architectures
examples: TI, Motorola, Philips
code generation
recent developments: VLIW (Very Long Instruction Word)
principles
central register file + example TM
clustered VLIW + example C6
subword parallelism or SIMD

VLIW principles
multiple parallel FUs, possibly different and pipelined
pipelining is exposed to the compiler
= no interlock mechanism
load-store architecture
all operands fetched from/stored in register files,
possibly multi-ported
each FU can receive an instruction every clock cycle
one instruction = many RISC instructions
each RISC instruction = one issue slot
no dependencies between different RISC instructions
= orthogonal microcode
= compiler friendly

VLIW architecture
Register file

R&W addr.
instruction

Exec
unit 1

Exec
unit 2

Exec
unit 3

Exec
unit 4

Exec
unit 5

Issue
slot 1

Issue
slot 2

Issue
slot 3

Issue
slot 4

Issue
slot 5

...

Exec
Exec
unit 24 unit 25

...

Issue
Issue
slot 24 slot 25

long instruction words e.g. (3*7+4)*25=625


many ports on the registerfile e.g. 75

VLIW architecture: central Register File

Register file

Exec Exec Exec


unit 1 unit 2 unit 3

Issue slot 1

Exec Exec Exec


unit 4 unit 5 unit 6

Issue slot 2

Exec Exec Exec


unit 7 unit 8 unit 9

Issue slot 3

TM1000 DSPCPU
5 constant
5 ALU
2 memory
2 shift
2 DSP-ALU
2 DSP-mul
3 branch
2 FP ALU
2 Int/FP ALU
1 FP compare
1 FP div/sqrt

Register file (128 regs, 32 bit, 15 ports)

Exec
unit

Exec
unit

Exec
unit

Exec
unit

Exec
unit

Instruction register (5 issue slots)


PC

Instruction
cache (32kB)

Data
cache
(16 kB)

TriMedia TM32A processor


0.18 micron
area : 16.9mm2
200 MHz (typ)
1.4 W
7 mW/MHz

I/O
INTERFACE

TAG

TAG

DSPMUL2

DSPMUL1

IFMUL1
(FLOAT) IFMUL2
(FLOAT)

FCOMP2

ALU1
ALU4
SHIFTER1
ALU2
FTOUGH1
DSPALU2

FALU0 FALU3

ALU3
ALU0

(MIPS=
0.9 mW/MHz)

SEQUENCER
/ DECODE

TAG

SHIFTER0
DSPALU0

TAG

I-Cache

D-cache

Synthesised RF area (CMOS18, 64 bit)


9
8

Area in mm-sq

7
32regs, after P&R

64regs, after P&R


5

128regs, after P&R

Poly. (128regs, after P&R)


Poly. (64regs, after P&R)

Poly. (32regs, after P&R)

2
1
0
0

10

15

20

Nr of ports

Area, speed and power dissipation goes more than linear with the
number of ports

VLIW architecture: clustered Register Files

Register file 1
Exec
unit 1

Exec
unit 2

Register file 2
copy
unit

Exec
unit 3

Exec
unit 4

Register file 3
copy
unit

Exec
unit 5

Exec
unit 6

copy
unit

VLIW architecture: clustered Register Files

REGISTER
FILE 1

REGISTER
FILE 2

REGISTER
FILE 3

FMUL
FADD

IMUL
IADD

IMUL
IADD

FMUL r1,r2,r3

IADD r1,r2,r3

IMUL r1,r2,r3

VLIW architecture: clustered Register Files

REGISTER
FILE I0

FU00
IADD_01
IMOV_01
:

FU10
IADD_10
IMOV_10
:

FU01
IADD_00
LAND_00
:

FU01
IADD_11
LAND_10
:

FU02
IMUL_00
SHFT_00
:

FU02
IMUL_10
SHFT_10
:

REGISTER
FILE I1

VLIW architecture: clustered Register Files


Discussion
performance loss (more instructions) compared to a central
Register File (due to extra cycle for copy)
15-20 % for 2 clusters
20-30 % for 4 clusters
limited scalability
not too many clusters
not too many registers within each cluster (too many RF ports)
add of copy ops in the compiler
= graph changes during scheduling

TMS320C62x VelociTI (fixed point)

Int add
logical
bit count

Store/load
data

S1
load
data

Int add
logical
Int mult
bit manip
(16=>32)
shift
constant
branch

D1
Store/load
address

Int add
load/
store

D2
Store/load
address

Dst
src1
src2

Dst
src1
src2

Dst
src1
src2

Src_up
Dst_up
Dst
src1
src2

Src_up
Dst_up
Dst
src1
src2
L1

M1

Dst
src1
src2

Registerfile 0-15

Registerfile 0-15 (32 bits)

M2

S2

L2

VelociTI principles
parallelism (fetch-decode-execute) (max 8 issue slots)
pipeline critical sections (alu 1cc, mult 2 cc, 200 MHz)
Risc (simple, atomic, independent instructions)
performance comes from compiler (pipelining, unroll)
load-store
orthogonal (2 identical DP, add on 6 units)
deterministic (no interlock)
conditional instructions (=guarding)
instruction packing

Fully serial
n
n
n
n
n
F
n
n

n
B
n
n
n
n
n
n

A
n
n
n
n
n
n
n

n
n
n
n
E
n
n
n

n
n
n
n
n
n
n
n

n n n
n n n
C n n
Dn n
n n n
n n n
nGn
n nH

ABCDE FGH
0 0 0 0 0 0 0 0

Classical encoding:
fetching many nops
Mixed serial/parallel
n
n
F
n

B
n
n
n

A
n
n
n

n
E
n
n

n
n
n
n

C n n
Dn n
n n n
n GH

ABCDE FGH

ABCDE FGH
1 1 0 1 0 0 1 0

ABCDE FGH
1 1 1 1 1 1 1 0

Velocity encoding

Fully parallel

Instruction cycle counts for BDTi benchmarks


Function

DSPgroup
OAK

Motorola
DSP561xx

ADI
ADSP-218x

Lucent
DSP16xx

TI TMS320
C54x

TI320
C62xx

Lucent
DSP16210

Philips
RD16020

Real block FIR


Single sample FIR
Complex block FIR
LMS adaptive
IIR (8 sections)
Vector dot product
Vector add
Vector maximum
Convolution
encoder

835
21
3018
90
51
43
122
41
506

925
23
3043
64
45
43
85
86
772

841
22
3122
59
43
43
83
128
818

1240
26
3123
101
65
47
123
120
888

684
18
2922
58
44
41
61
111
528

334
17
1294
33
30
29
36
39
188

780
16
1681

464

448
20
1470
55
37
43
63
40
176

FSM

284
16514

375
12148

198
10633

415
455
21035 13234

147
4225

301
9016

167
5797

256 pnt FFT

38
23
43

Subword parallelism

(custom operators in TM)

1st input operand

2nd input operand

byte3 byte2 byte1 byte0

byte3 byte2 byte1 byte0

32 bits = 4 bytes
are processed
independently
op
Ex. +, - , min, max
=> quadumin
=> quadumax
...

op

op

op

byte3 byte2 byte1 byte0


output operand

Subword parallelism
(custom operators in TM)
int size = 1000
byte out[size], in1[size], in2[size]
for i = 0; i < size; i+
out[ i ] = in1[ i ] + in2[ i ];

+ faster execution
- rewrite effort (e.g. different
types for in- and outputs)

int size = 1000


byte out[size], in1[size], in2[size]
for i = 0; i < size; i+
packet4 t1 = packet4_load ( in1 );
packet4 t2 = packet4_load ( in2 );
packet4 t3 = packet4_add ( t1, t2 );
packet4_store ( out, t3 );

Typical example : graphics ( 4 * 32 bit floating point)

Subword parallelism
MPEG example
for (i=0; i<64; I++)
{
temp = ((back(i) + forward(i) +1) >> 1) +idct(i);
if (temp > 255)
temp = 255;
else if (temp < 0)
temp = 0;
destination[i] = temp;
}
Remark: simple example without interloop dependencies

for (i=0; i<64; i+=4)


{
temp = ((back(i+0) + forward(i+0) +1) >> 1) +idct(i+0);
if (temp > 255) temp = 255;
else if (temp < 0) temp = 0;
destination[i+0] = temp;
temp = ((back(i+1) + forward(i+1) +1) >> 1) +idct(i+1);
if (temp > 255) temp = 255;
else if (temp < 0) temp = 0;
destination[i+1] = temp;
temp = ((back(i+2) + forward(i+2) +1) >> 1) +idct(i+2);
if (temp > 255) temp = 255;
else if (temp < 0) temp = 0;
destination[i+2] = temp;
temp = ((back(i+3) + forward(i+3) +1) >> 1) +idct(i+3);
if (temp > 255) temp = 255;
else if (temp < 0) temp = 0;
destination[i+3] = temp;
}

temp0 = ((back(i+0) + forward(i+0) +1) >> 1) ;


temp1 = ((back(i+1) + forward(i+1) +1) >> 1) ;
temp2 = ((back(i+2) + forward(i+2) +1) >> 1) ;
temp3 = ((back(i+3) + forward(i+3) +1) >> 1) ;

quadavg

temp0 = idct(i+0);
if (temp0 > 255) temp = 255;
else if (temp0 < 0) temp0 = 0;
temp1 = idct(i+1);
if (temp1 > 255) temp1 = 255;
else if (temp1 < 0) temp1 = 0;
temp2 = idct(i+2);
if (temp2 > 255) temp2 = 255;
else if (temp2 < 0) temp2 = 0;
temp3 = idct(i+3);
if (temp3 > 255) temp3 = 255;
else if (temp3 < 0) temp3 = 0;

dspuquadaddui

destination[i+0] = temp0;
destination[i+1] = temp1;
destination[i+2] = temp2;
destination[i+3] = temp3;

Will embedded CPUs and DSPs converge ?


Converging forces
both include a hardware multiplier
trend in DSPs towards caches and RTK
trend in DSPs towards C/C++
common trend towards VLIW
Diverging forces
deeply embedded code (DSP) vs. end-user SW (CPU)
different RTKs
SPOX, Virtuoso (DSP) vs. pSOS, WinCE (top down)
Conclusions VLIW
good balance between hw and sw
between efficiency (ILP) and cost
fundamental problems: code size, interruptability

You might also like