Professional Documents
Culture Documents
Memory
1
Work and critical path Terminology
Work = i wi START Instantaneous parallelism
1
1
time required to execute
program on one processor IP(t) = maximum number of 1
1 processors that can be kept 1 1
= T1 1 1
busy at each point in execution
Path weight of algorithm
1
1 1
sum of weights of nodes on wi Maximal parallelism
path
Critical path Data MP = highest instantaneous
parallelism
path from START to END
that has maximal weight Average parallelism
P 3
this work must be done AP = T1/T
Processors 2
sequentially, so you need These are properties of the
this much time regardless END computation DAG, not of the 1
of how many processors machine or the work assignment
you have Computation DAG time
call this T
Instantaneous and average parallelism
2
Amdahls law Scheduling
1
Suppose P MP 1
Amdahl: There will be times during 1
1 1
suppose a fraction p of a program can be done in parallel the execution when only 1 1
suppose you have an unbounded number of parallel processors a subset of ready nodes 1
and they operate infinitely fast can be executed. 1 1
speed-up will be at most 1/(1-p). Time to execute DAG can
Follows trivially from previous result. depend on which subset
Plug in some numbers: of P nodes is chosen for
execution. 3
p = 90% speed-up 10
p = 99% speed-up 100
To understand this better, 2
it is useful to have a more 1
To obtain significant speed-up, most of the program must detailed machine model
be performed in parallel time
serial bottlenecks can really hurt you
What if we only had 2 processors?
P0 START a c END
space
P1
P1 c
Intuition: nodes along the critical path should be given preference in scheduling
3
Optimal schedules Heuristic: list scheduling
Example
List scheduling algorithm
cycle c = 0; time 1 4
ready-list = {START}; START
inflight-list = { }; 0 1 2 3 4
while (|ready-list|+|inflight-list| > 0) { 2 3 2
P0 START a c END 1 1 1
space
4
Generating dependence graphs Data dependence
How do we produce dependence graphs in the Basic blocks
first place? straight-line code
Nodes represent statements
Two approaches Edge S1 S2
specify DAG explicitly flow dependence (read-after-write (RAW))
parallel programming S1 is executed before S2 in basic block
S1 writes to a variable that is read by S2
easy to make mistakes anti-dependence (write-after-read (WAR))
data races: two tasks that write to same location but are not S1 is executed before S2 in basic block
ordered by dependence S1 reads from a variable that is written by S2
by compiler analysis of sequential programs output-dependence (write-after-write (WAW))
S1 is executed before S2 in basic block
Let us study the second approach S1 and S2 write to the same variable
input-dependence (read-after-read (RAR)) (usually not important)
called dependence analysis S1 is executed before S2 in basic block
S1 and S2 read from the same variable
5
One transformation: loop unrolling Smarter loop unrolling
Original program Use new name for loop iteration variable in each
for i = 1,100
unrolled instance
X(i) = i
Unroll loop 4 times: not very useful! for i = 1,100,4
for i = 1,100,4 X(i) = i
X(i) = i o i1 = i+1
o i = i+1 X(i1) = i1
X(i) = i o i2 = i+2
o i = i+1
X(i2) = i2
X(i) = i
o i3 = i+3
o
i = i+1
X(i) = i X(i3) = i3
6
Scheduling instructions for VLIW
Two applications machines
Static scheduling Processors functional units
START
Local memories registers Ops
create space-time diagram at compile-time Global memory memory
a b c
VLIW code generation Time instruction
Nodes in DAG are operations
Instruction
Dynamic scheduling (load/store/add/mul/branch/..) d
instruction-level parallelism
create space-time diagram at runtime List scheduling
useful for scheduling code for END
multicore scheduling for dense linear algebra pipelined, superscalar and VLIW
machines
used widely in commercial compilers
loop unrolling and array dependence
analysis are also used widely
7
Increasing granularity: New problem
Block Matrix Algorithms
Original matrix multiplication B00 B01
Difficult to get accurate execution times of
coarse-grain nodes
for I = 1,N
for J = 1,N
B10 B11
conditional inside loop iteration
for K = 1,N
C(I,J)= C(I,J)+A(I,K)*B(K,J) A00 A01 C00 C01
cache misses
exceptions
Block (tiled) matrix multiplication A10 A11 C10 C11
O/S processes
for IB = 1,N step B
for JB = 1,N step B
parallel loops C00 = A00*B00 + A01*B10 .
Solution: runtime scheduling
for KB = 1,N step B C01 = A01*B11 + A00*B01
for I = IB, IB+B-1 C11 = A11*B01 + A10*B01
for J = JB, JB+B-1 C10 = A10*B00 + A11*B10
for K = KB, KB+B-1
C(I,J) = C(I,J)+A(I,K)*B(K,J)
32
8
DAGuE: Tiled QR (2) Summary of multicore
scheduling
Assumptions
Tiled QR
DAG of tasks is known
each task is heavy-weight and executing task
on one worker exploits adequate locality
no assumptions about runtime of tasks
no lock-step execution of processors or
synchronous global memory
Scheduling
Dataflow Graph for 2x2 processor grid Machine: 81 nodes, 648 cores keep a work-list of tasks that are ready to execute
use heuristic priorities to choose from ready tasks
33
Summary
Dependence graphs
nodes are computations
edges are dependences
Static dependence graphs: obtained by
studying the algorithm
analyzing the program
Limits on speed-ups
critical path
Amdahls law
DAG scheduling
heuristic: list scheduling (many variations)
static and dynamic scheduling
applications: VLIW code generation, multicore scheduling for dense
linear algebra
Major limitations:
works for topology-driven algorithms with fixed neighborhoods since we
know tasks and dependences before executing program
not very useful for data-driven algorithms since tasks are created
dynamically
one solution: work-stealing, work-sharing. Study later.