You are on page 1of 47

VLSI Signal Processing

Lecture 2 Unfolding
Transformation

ADSP Lecture2 - Unfolding (cwliu@twins.ee.nctu.edu.tw) 2-1


Multiple-Data Processing
• Create a program with more than one iteration,
e.g. J loops unrolling
• Example: Loop unrolling + software pipelining
clock cycle operation clock cycle
1 1
1 1
2 2
2 1 2
3 3
3 1 2 3
4 1
4 2 3
5 2
5 3
6 3
6
7 1
7
8 2
8

ADSP Lecture2 - Unfolding (cwliu@twins.ee.nctu.edu.tw) 2-2


Basic Ideas
• Parallel processing • Pipelined
processing
P1 a1 a2 a3 a4 P1 a1 b1 c1 d1

P2 b1 b2 b3 b4 P2 a2 b2 c2 d2

P3 c1 c2 c3 c4 P3 a3 b3 c3 d3

P4 d1 d2 d3 d4 P4 a4 b4 c4 d4

time time

ADSP Lecture2 - Unfolding (cwliu@twins.ee.nctu.edu.tw) 2-3


Data Dependence
• Parallel processing • Pipelined processing will
requires NO data involve inter-processor
dependence between communication
processors
P1 P1

P2 P2

P3 P3

P4 P4

time time

ADSP Lecture2 - Unfolding (cwliu@twins.ee.nctu.edu.tw) 2-4


Parallel Processing

• In a J-unfolded system, each delay is J-slow. That is, if input to a


delay element is x(kJ+m), then the output is x((k-1)J+m) = x(kJ+m-J)

ADSP Lecture2 - Unfolding (cwliu@twins.ee.nctu.edu.tw) 2-5


Parallel Processing
• Block processing
– the number of inputs processed in a clock cycle is
referred to as the block size

x(n) SISO y(n)

x(3k) y(3k)
Serial to x(3k+1) y(3k+1) Parallel to
x(n) Parallel MIMO Serial y(n)
Converter x(3k+2) y(3k+2) Converter

– at the k-th clock cycle, three inputs x(3k), x(3k+1), and


x(3k+2) are processed simultaneously to generate y(3k),
y(3k+1), and y(3k+2)

ADSP Lecture2 - Unfolding (cwliu@twins.ee.nctu.edu.tw) 2-6


I/O Conversion
• Serial to parallel converter
sampling period
T/3 T/3

x(n) D D

x(3k+2) x(3k+1) x(3k)

• Parallel to serial converter

y(3k+2) y(3k+1) y(3k)

3k

D D y(n)
T/3 T/3
sampling period
ADSP Lecture2 - Unfolding (cwliu@twins.ee.nctu.edu.tw) 2-7
General approach for block processing

ADSP Lecture2 - Unfolding (cwliu@twins.ee.nctu.edu.tw) 2-8


Mathematical Formulation
• e.g. y(n) = ay(n-9) + x(n)
• 2-parallel
Y(2k) = ay(2k-9) + x(2k)
Y(2k+1) = ay(2k-8) + x (2k+1)
• In 2-parallel SDFG, one active clock edge leads
two samples
Y(2k) = ay(2(k-5)+1) + x(2k)
Y(2k+1) = ay(2(k-4)+0) + x(2k+1)

• Dependency with less than # parallelism of sample


delays can be implemented with internal routing

ADSP Lecture2 - Unfolding (cwliu@twins.ee.nctu.edu.tw) 2-9


Unfolding the DFG
T=J Ts

T=Ts

Not trivial, even for a simple graph

ADSP Lecture2 - Unfolding (cwliu@twins.ee.nctu.edu.tw) 2-10


Block Processing for FIR Filter
• One form of vectorized parallel processing of DSP
algorithms. (Not the parallel processing in most general
sense)
• Block vector: [x(3k) x(3k+1) x(3k+2)]
• Clock cycle: can be 3 times longer
• Original (FIR filter):
y (n)  ax(n)  bx(n  1)  cx(n  2)
• Rewrite 3 equations at a time:

 y (3k )   x(3k )   x(3k  1)   x(3k  2) 


 y (3k  1)   a  x(3k  1)   b  x(3k )   c  x(3k  1) 
       
 y (3k  2)   x(3k  2)   x(3k  1)   x(3k ) 

ADSP Lecture2 - Unfolding (cwliu@twins.ee.nctu.edu.tw) 2-11


Block Processing

ADSP Lecture2 - Unfolding (cwliu@twins.ee.nctu.edu.tw) 2-12


Block Processing for IIR Digital Filter
• Original formulation:
y (n)  a  y (n  2)  x(n)
• Rewrite: n: sample period
y (2k )  ay (2k  2)  x(2k )
y (2k  1)  ay (2k  1)  x(2k  1)
• Vector formulation:

 x ( 2k )   x ( 2k ) 
x( k )    , y (k )   
 x(2k  1)  x(2k  1)
y (k )  ay (k  1)  x(k ) k: processor period
Tsample≠Tclk
ADSP Lecture2 - Unfolding (cwliu@twins.ee.nctu.edu.tw) 2-13
Block IIR Filter
y(2(k1))
 D

x(2k)
y(2k)
x(n)
S/P
+ P/S
y(n)
y(2k+1)
x(2k+1) +
clock period not equal to
sampling period y(2(k1)+1)
 D

ADSP Lecture2 - Unfolding (cwliu@twins.ee.nctu.edu.tw) 2-14


Timing Comparison
x(1) x(2) x(3) x(4)
MAC 1 2 3 4
y(1) y(2) y(3) y(4)

• Pipelining
x(1) x(2) x(3) x(4) x(5) x(6) x(7) x(7)
Add 1 2 3 4 5 6 7 8
y(1) y(2) y(3) y(4) y(5) y(6) y(7) y(7)

a y(1)
Mul 1 2 3 4 5 6 7 8
• Block processing
x(2) x(4) x(6) x(8)
2 2 4 4 6 6 8 8
x(1) x(3) x(5) x(7)
1 1 3 3 5 5
ADSP Lecture2 - Unfolding (cwliu@twins.ee.nctu.edu.tw) 7 7 2-15
Definitions
• Unfolding is the process of unfolding a loop so that
several iterations are unrolled into the same
iteration.
• Also known as (a.k.a.)
– Loop unrolling (in compilers for parallel programs)
– Block processing
• Applications
– Reducing sampling period to achieve iteration bound
(desired throughput rate) T.
– Parallel (block processing) to execute several iterations
concurrently.
– Digit-serial or bit-serial processing

ADSP Lecture2 - Unfolding (cwliu@twins.ee.nctu.edu.tw) 2-16


Unfolding the DFG
• y(n)=ay(n-9)+x(n)

• Rewrite the algorithm formulation:


y(2k)=ay(2k-9)+x(2k)
y(2k+1)=ay(2k-8)+x(2k+1)

y(2k)=ay(2(k-5)+1)+x(2k)
y(2k+1)=ay(2(k-4))+x(2k+1)

• After J-folded unfolding, the clock period T = J Ts,


where Ts is the data sampling period.

ADSP Lecture2 - Unfolding (cwliu@twins.ee.nctu.edu.tw) 2-17


Timing Diagram
y(0) y(1) y(2) y(3) y(4) y(5) y(6) y(7) y(8) y(9) y(10) y(11) y(12) y(13)

9T
T=Ts 9T

T=2Ts
y(0) y(2) y(4) y(6) y(8) y(10) y(12)

4T
5T
y(1) y(3) y(5) y(7) y(9) y(11) y(13)

• Above timing diagram is obtained assuming that the sampling


period Ts remains unchanged. Thus, the clock period T is
increased J-fold.
• Since 9/2 is not an integer, output (y(0), y(1)) will be needed
by two different future iterations, 4T and 5T later.

ADSP Lecture2 - Unfolding (cwliu@twins.ee.nctu.edu.tw) 2-18


Another DFG Unfolding Example
J=2 S0
i w (i+w)%J (i  w) / J 

0 0 0 0 Q0 T0
S
0 2 0 1
R0
0 3 1 1 Q T
2D 3D
1 0 1 0
S1
R
1 2 1 1
Q1 T1
1 3 0 2
T=3
R1
Step 1. Duplicate J copies of each node

ADSP Lecture2 - Unfolding (cwliu@twins.ee.nctu.edu.tw) 2-19


Another DFG Unfolding Example
J=2 S0
i w (i+w)%J (i  w) / J 

0 0 0 0 Q0 T0
S
0 2 0 1
R0
0 3 1 1 Q T
2D 3D
1 0 1 0
S1
1 2 1 1
R
Q1 T1
1 3 0 2
T=3
R1
Step 2. Add all edges with 0 delay on them.

ADSP Lecture2 - Unfolding (cwliu@twins.ee.nctu.edu.tw) 2-20


Another DFG Unfolding Example
J=2 S0
i w (i+w)%J (i  w) / J 

0 0 0 0 Q0 T0
S D
0 2 0 1
R0
0 3 1 1 Q T
D 2D
2D 3D
1 0 1 0
S1
1 2 1 1
R
Q1 T1
1 3 0 2
T=3 D

Step 3. Use table on the left to R1


figure out edges with delays.
T=6
ADSP Lecture2 - Unfolding (cwliu@twins.ee.nctu.edu.tw) 2-21
Unfolding Transformation
• For each node U in the original DFG, draw J node U0, U1,…, UJ-1
• For each edge UV with w delays in the original DFG, draw the J
edges UiV(i + w)%J with floor[(i+w)/J] delays for i=0,1,…, J-1

Example

• Unfolding of an edge with w delays in the original DFG produces J-


w edges with no delays and w edges with 1delay in J-unfolded DFG
for w < J
• Unfolding preserves precedence constraints of a DSP algorithm

ADSP Lecture2 - Unfolding (cwliu@twins.ee.nctu.edu.tw) 2-22


Precedence Preservation

ADSP Lecture2 - Unfolding (cwliu@twins.ee.nctu.edu.tw) 2-23


Delay Preservation
• Unfolding preserves the number of delays in a DFG
• Let w  m  J  n , where m, n 0  N  0  n  J 1
 w
 J   m

 w   J  n  1   m  J  n   J  n  1   m  J   J  1 
 J    J    J   m
 w   J  n    m  J  n   J  n    m  1 J 
 J    J    J   m  1

 w  J  1
 J   m  1
 w  w  J  n  1  w  J  n   w  J  1
          
J   J J J
 m  J  n   m  1 n
 m J  n
w

ADSP Lecture2 - Unfolding (cwliu@twins.ee.nctu.edu.tw) 2-24


Example
• Unfold the following DFG using folding factor 2 and 5

D
2D

A B C E
D 7D

3D

5-unfolded DFG
2-unfolded DFG

D0 A0 D B0 C0 E0 D0
D D

D
2D
A0 B0 C0 E0 A1 B1 C1 E1 D1
3D D D D

D 4D
A1 B1 C1 E1 A2 B2 C2 E2 D2
D D D D

D1 A3 B3 C3 2D E3 D3
D

A4 B4 C4 2D E4 D4

ADSP Lecture2 - Unfolding (cwliu@twins.ee.nctu.edu.tw) 2-25


Properties of Unfolding
• Unfolding preserves the • Unfolding a DFG with
number of registers iteration bound T results
(delays) in a DFG in a J-folded DFG with
• For a loop with w delays in iteration bound JT.
a DFG that has been • A path with w (< J) delays
unfolded J times, it leads in a DFG will lead to J-w
to paths with no delays, and w
– g.c.d.(w, J) loops in the paths with 1 delay each in
unfolded DFG, with each the J-unfolded DFG.
of these loops containing • Any clock period that can
W/(g.c.d.(w,J)) delays and be achieved by retiming a
J/(g.c.d.(w,J)) copies of J-unfolded DFG can be
each node that appear in achieved by retiming the
the original loop. original DFG and followed
by J-unfolding.
ADSP Lecture2 - Unfolding (cwliu@twins.ee.nctu.edu.tw) 2-26
When a Loop is Unfolded
• A loop ℓ with w delays in a DFG
• Travel the loop A~>A p times  also a loop with pw delays
• In J-unfolded DFG, consider the path AiA(i+pw)%J . It is a
loop if i=(i+ pw)%J. This implies that J | pw
• The smallest p = J/gcd(J, w). That is, in J-unfolded DFG, one
can travel the loop A~>A J/gcd(J, w) times.
• Recall that there are totally J copies of node A. Hence,
there are J/(J/gcd(J,w))=gcd(J, w) loops and each loop
contains w/ gcd(J, w) delays.
• The iteration bound in J-unfolded DFG is then
 J 
 t
 gcd( j, wl )    J  tl 
l
T '  max    max    JT
l
 w l 
l
 wl 

 gcd( j, wl )  
ADSP Lecture2 - Unfolding (cwliu@twins.ee.nctu.edu.tw) 2-27
When a Path is Unfolded
• If w<J, then a path containing w delays within a DFG will
lead to (J-w) paths with no delays and w paths with 1 delay
in the J-unfolded DFG.
• If w≥J, then the path leads to J paths with one or more
delays in the J-unfolded DFG. This implies that these paths
are not critical.
• Assume that the critical path of the J-unfolded DFG is c.
If D(U,V)≥c, then Wr(UV)=W(UV)+r(V)-r(U) ≥ J
• Any feasible clock cycle period that can be obtained by
retiming the J-unfolded DFG can be achieved by retiming
the original DFG directly and followed by J-unfolding.

ADSP Lecture2 - Unfolding (cwliu@twins.ee.nctu.edu.tw) 2-28


When a Path is Unfolded
• Suppose r’ is a legal retiming for the J-unfolded DFG, GJ,
which leads to critical path c.
• Let r(U) = i r’(Ui), 0≤i≤J-1.
– r is a feasible retiming for the original DFG, G.
– The retiming leads to a critical path c
Consider an edge U  V with w delays in G
Since r ' is legal retiming for GJ and leads to a critical path c, then
i  w i
(1) r ' U i   r ' V(i  w)% J     feasible constraint
 J 
i  w 0≤i≤J-1
(2) r ' U i   r ' V(i  w)% J      1, if D(U i  V(i  w)% J )  c
 J 
critical path constraint
(1) r (U )  r (V )  w
(2) r (U )  r (V )  W (U , V )  J
ADSP Lecture2 - Unfolding (cwliu@twins.ee.nctu.edu.tw) 2-29
Sample Period Reduction
• Case1: A node in the DFG having
computation time greater than T∞

• Case2: Iteration bound is not an


integer

• Case3: Longest node computation is


larger than the iteration T∞, and T∞
is not an integer
ADSP Lecture2 - Unfolding (cwliu@twins.ee.nctu.edu.tw) 2-30
Case 1
• Critical path dominates, since a node computation
time is more than iteration bound

Retiming cannot be used to reduce sample period

ADSP Lecture2 - Unfolding (cwliu@twins.ee.nctu.edu.tw) 2-31


Sample Period Reduction
tU   unfolding should be used
• Rule of Thumb:  T 

T∞=6,
Tcritical=6

ADSP Lecture2 - Unfolding (cwliu@twins.ee.nctu.edu.tw) 2-32


Case 2
• Iteration period cannot not achieve the iteration
bound

ADSP Lecture2 - Unfolding (cwliu@twins.ee.nctu.edu.tw) 2-33


Sample Period Reduction

ADSP Lecture2 - Unfolding (cwliu@twins.ee.nctu.edu.tw) 2-34


Case 3

ADSP Lecture2 - Unfolding (cwliu@twins.ee.nctu.edu.tw) 2-35


Parallel Processing
• Parallel processing can be performed
by unfolding

ADSP Lecture2 - Unfolding (cwliu@twins.ee.nctu.edu.tw) 2-36


Bit-Level Parallel Processing

ADSP Lecture2 - Unfolding (cwliu@twins.ee.nctu.edu.tw) 2-37


ADSP Lecture2 - Unfolding (cwliu@twins.ee.nctu.edu.tw) 2-38
Bit-Serial Adder

ADSP Lecture2 - Unfolding (cwliu@twins.ee.nctu.edu.tw) 2-39


Unfolding of Switches

ADSP Lecture2 - Unfolding (cwliu@twins.ee.nctu.edu.tw) 2-40


Example

ADSP Lecture2 - Unfolding (cwliu@twins.ee.nctu.edu.tw) 2-41


Example

ADSP Lecture2 - Unfolding (cwliu@twins.ee.nctu.edu.tw) 2-42


Example

ADSP Lecture2 - Unfolding (cwliu@twins.ee.nctu.edu.tw) 2-43


Example

ADSP Lecture2 - Unfolding (cwliu@twins.ee.nctu.edu.tw) 2-44


Switches with Delays

ADSP Lecture2 - Unfolding (cwliu@twins.ee.nctu.edu.tw) 2-45


Switch with Delays

ADSP Lecture2 - Unfolding (cwliu@twins.ee.nctu.edu.tw) 2-46


If Wordlength is not a Multiple of J

ADSP Lecture2 - Unfolding (cwliu@twins.ee.nctu.edu.tw) 2-47

You might also like