MIT6 035S10 Lec13

Spring 2010
IIntroduction
t d ti to t
Code Optimization
Instruction Scheduling
5
Outline
Modern architectures
Introduction to instruction scheduling
List scheduling
Resource constraints
Scheduling across basic blocks
Trace scheduling g
Saman Amarasinghe 2 6.035 MIT Fall 1998

Simple
p Machine Model
Instructions are executed in sequence
q
Fetch, decode, execute, store results
One instruction at a time
For branch instructions, start fetching from a
different location if needed
Check branch condition
Next instruction may come from a new location
given by the branch instruction

Simple
p Execution Model
5 Stage
g pipe-line
pp
fetch decode execute memory writeback
Fetch: get the next instruction

Decode: figure-out what that instruction is
p
Execute: Perform ALU operation
address calculation in a memory op
Memory: Do the memory access in a mem. Op.
Write Back: write the results back
Simple
p Execution Model
time
IF DE EXE MEM WB
Inst 1
IF DE EXE MEM WB
Inst 2
Inst 1 IF DE EXE MEM WB
IF DE EXE MEM WB
Inst 2
IF DE EXE MEM WB
Inst 3
IF DE EXE MEM WB
IInstt 4
Inst 5 IF DE EXE MEM WB
5
Outline
List scheduling
Trace scheduling g

From a Simple Machine Model
to a Reall Machi
M hine Mod
del
l
Many pipeline stages
Pentium 5
Pentium Pro 10
Pentium IV (130nm) 20
Pentium IV (90nm) 31
Core 2 Duo 14
Different instructions taking different amount of time

to execute
Hardware to stall the pipeline if an instruction uses a

result that is not ready
Real Machine Model cont.

Most modern p
processors have multiple
p cores
Will deal with multicores next week
Each core has multiple execution units
(superscalar)
If the instruction sequence is correct,
correct multiple
operations will happen in the same cycles
Even more important to have the right instruction
sequence

Instruction Scheduling
g
Reorder instructions so that pipeline
p stalls are
minimized

Constraints On Scheduling
g
Data dep
pendencies
Control dependencies
Resource Constraints

Data Dependency between
I t ti
Instructions
If two instructions access the same variable,,
they can be dependent
p
Kind of dependencies
True: write read
Anti: read write
Output: write write
What to do if two instructions are dependent.
p
The order of execution cannot be reversed
Reduce the possibilities for scheduling

Computing
p g Dependencies
p
For basic blocks, comp pute dep
pendencies by
y
walking through the instructions
Identifying register dependencies is simple

is it the same register?
For memory accesses

accesses
simple: base + offset1 ?= base + offset2

d t depend
data dence anallysiis: a[2i]
[2i] ?=
? a[2i+1]
[2i+1]
interprocedural analysis: global ?= parameter
pointer
i alias
li analysis:
l i p1foo f ?= p2foof

Representing
p g Dependencies
p
Using
ga de pendence DAG, one per basic block
Nodes are instructions, edges represent
dependencies
dependencies

Representing
p g Dependencies
p
Using
ga de pendence DAG, one per basic block
Nodes are instructions, edges represent
dependencies
dependencies
1: r2 = *(r1 + 4) 1 2
2: r3 = *(r1
(r1 + 8)
2 2
3: r4 = r2 + r3 2
4: r5 = r2 - 1 4 3
Edge is labeled with Latency:
(i j) = delay
v(i d l required
i d bbetween initiation
i i i i times
i off
i and j minus the execution time required by i
Example
p
1: r2 = *(r1 + 4)
2: r3 = *(r2 + 4)
3: r4 = r2 + r3
4: r5 = r2 - 1 1 2
2 2
2
4 3

Another Example
p
1: r2 = *(r1 + 4)
2: *(r1 + 4) = r3
3: r3 = r2 + r3
1
4: r5 = r2 - 1 1 2
2 2
1
4 3

Control Dependencies and
R
C t i t
For now, lets only worry about basic blocks

For now,
now lets look at simple pipelines

Example
p
1: lea var_a, %rax
2: add $4, %rax
3: inc %r11
4: mov 4(%rsp), %r10
5: add %r10, 8(%rsp)
6: and 16(%rsp), %rbx
7: imul %rax, %rbx

Example
p
Results In
1: lea var_a, %rax 1 cycle
2: add $4, %rax 1 cycle
3: inc %r11 1 cycle
4: mov 4(%rsp), %r10 3 cycles
5: add %r10, 8(%rsp)
6: and 16(%rsp), %rbx 4 cycles
7: imul %rax, %rbx 3 cycles

Example
p
Results In
3: inc %r11 1 cycle
5: add %r10, 8(%rsp)
1 2 3 4 st st 5 6 st st st 7
5
Outline
List scheduling
Trace scheduling g

List Scheduling
g Algorithm
g
Idea
Do a topological sort of the dependence DAG
Consider when an instruction can be scheduled
without causing a stall
Schedule the instruction if it causes no stall and all
its predecessors are already scheduled
Opptimal list schedulingg is NP-comp
plete
Use heuristics when necessary

List Scheduling
g Algorithm
g
Create a dep
pendence DAG of a basic block
Topological Sort
READY = nodes with no predecessors
predecessors
Loop until READY is empty
Schedule each node in READY when no stalling
Update READY

Heuristics for selection
Heuristics for selecting
g from the READY list
pick the node with the longest path to a leaf in the
dependence graph
pick a node with most immediate successors
ppick a node that can go
g to a less busy
y ppipeline
p ((in a
superscalar)

pick
p the node with the longest
g ppath to a leaf in
the dependence graph
Algorithm (for node x)
If no successors x = 0
d
dx = MAX( dy + cxy) for all successors y of x
reverse breadth-first
breadth first visitation order

ppick a node with most immediate successors
Algorithm (for node x):
fx = number of successors of x

Example
p
Results In
3: inc %r11 1 cycle
5: add %r10, 8(%rsp)
8: mov %rbx, 16(%rsp)
9: lea var_b,
var b %rax

Example
p
1 3 4
1: lea var_a, %rax
2: add $4, %rax 1 3
3: inc %r11
2 6 5
4: mov 4(%rsp), %r10
5: add %r10, 8(%rsp) 1 4
6: and 16(%rsp), %rbx 7
7: imul %rax, %rbx 3 1
9: lea var_b,
var b %rax 8 9

Example
p
d=5 d=0 d=3
1 f=1 3 4
f=0 f=1
1 3
d=4 d=7 d=0
2 f=1 6 f=1 5 f=0
1 4
d=3
7 f=2
3 1
d 0
d=0 d 0
d=0
8 f=0 9 f=0

Example
p
d=5 d=0 d=3
1 f=1 3 4
f=0 f=1
1 3
READY = { } d=4 d=7 d=0
2 f=1 6 f=1 5 f=0
1 4
d=3
7 f=2
3 1
d 0
d=0 d 0
d=0
8 f=0 9 f=0

Example
p
d=5 d=0 d=3
1 f=1 3 4
f=0 f=1
1, 3, 4, 6 1 3
READY = { } d=4 d=7 d=0
2 f=1 6 f=1 5 f=0
1 4
d=3
7 f=2
3 1
d 0
d=0 d 0
d=0
8 f=0 9 f=0

Examp
ple
d=5 d=0 d=3

1 f=1 3 4
f=0 f=1
1 3
READY = { 1, 4, 3 } d=4 d=7 d=0
2 f=1 6 f=1 5 f=0
1 4
d=3
7 f=2
3 1
d 0
d=0 d 0
d=0
8 f=0 9 f=0
6
Examp
ple
d=5 d=0 d=3

1 f=1 3 4
f=0 f=1
1 3
READY = { 2, 4, 3 } d=4 d=7 d=0
2 f=1 6 f=1 5 f=0
1 4
d=3
7 f=2
3 1
d 0
d=0 d 0
d=0
8 f=0 9 f=0
6 1
Examp
ple
d=5 d=0 d=3

1 f=1 3 4
f=0 f=1
1 3
READY = { 7, 4, 3 } d=4 d=7 d=0
2 f=1 6 f=1 5 f=0
1 4
d=3
7 f=2
3 1
d 0
d=0 d 0
d=0
8 f=0 9 f=0
6 1 2
Examp
ple
d=5 d=0 d=3

1 f=1 3 4
f=0 f=1
1 3
READY = { 7, 3, 5 } d=4 d=7 d=0
2 f=1 6 f=1 5 f=0
1 4
d=3
7 f=2
3 1
d 0
d=0 d 0
d=0
8 f=0 9 f=0
6 1 2 4
Examp
ple
d=5 d=0 d=3

1 f=1 3 4
f=0 f=1
1 3
READY = { 3, 5, 8, 9 } d=4 d=7 d=0
2 f=1 6 f=1 5 f=0
1 4
d=3
7 f=2
3 1
d 0
d=0 d 0
d=0
8 f=0 9 f=0
6 1 2 4 7
Examp
ple
d=5 d=0 d=3

1 f=1 3 4
f=0 f=1
1 3
READY = { 5, 8, 9 } d=4 d=7 d=0
2 f=1 6 f=1 5 f=0
1 4
d=3
7 f=2
3 1
d 0
d=0 d 0
d=0
8 f=0 9 f=0
6 1 2 4 7 3
Examp
ple
d=5 d=0 d=3

1 f=1 3 4
f=0 f=1
1 3
READY = { 9 } d=4 d=7 d=0
2 f=1 6 f=1 5 f=0
1 4
d=3
7 f=2
3 1
d 0
d=0 d 0
d=0
8 f=0 9 f=0
6 1 2 4 7 3 5 8
Examp
ple
d=5 d=0 d=3

1 f=1 3 4
f=0 f=1
1 3
READY = { 9 } d=4 d=7 d=0
2 f=1 6 f=1 5 f=0
1 4
d=3
7 f=2
3 1
d 0
d=0 d 0
d=0
8 f=0 9 f=0
6 1 2 4 7 3 5 8 9
Examp
ple
d=5 d=0 d=3

1 f=1 3 4
f=0 f=1
1 3
READY = { } d=4 d=7 d=0
2 f=1 6 f=1 5 f=0
1 4
d=3
7 f=2
3 1
d 0
d=0 d 0
d=0
8 f=0 9 f=0
6 1 2 4 7 3 5 8 9
Example
p
Results In
3: inc %r11 1 cycle
5: add %r10, 8(%rsp)
9: lea var_b,
var b %rax
1 2 3 4 st st 5 6 st st st 7 8 9
14 cycles vs
6 1 2 4 7 3 5 8 9 9 cycles
5
Outline
List scheduling
Trace scheduling g

Modern machines have many y resource
constraints
Superscalar architectures:
can run few parallel operations
But have constraints

Resource Constraints of a
S
Superscalar
l Processor
P
Example:
One fully pipelined reg-to-reg unit
All integer operations taking one cycle
In parallel with
One fully pipelined memory-to/from-reg unit
Data loads take two cycles
Data stores teke one cycle
cycle

List Scheduling Algorithm with
resource constraints
t i t
Represent the superscalar architecture as multiple
pipelines
Each pipeline represent some resource

resource

t i t
Represent the superscalar architecture as multiple
pipelines
Each pipeline represent some resource
Example
OOne single
i l cycle
l reg-to-reg ALU unit i
One two-cycle pipelined reg-to/from-memory unit

ALU
MEM 1
MEM 2
t i t
Create a dependence DAG of a basic block
Topological Sort
READY = nodes d withith no predecessors
d
Loop until READY is empty
Let n READY be the node with the highest priority
Schedule n in the earliest slot
that satisfies precedence + resource constraints
Update READY

Example
p
1: lea var_a, %rax
2: add 4(%rsp), %rax
3: inc %r11
4: mov 4(%rsp), %r10
5: mov %r10, 8(%rsp)
6: and $0
$0x00ff,
00ff %rb
%rbx
7: imul %rax, %rbx
8: lea var_b, %rax
9
9: mov %rbx,
% b 16(%
16(%rsp))

Examp
ple
1: lea var_a, %rax

2: add 4(%rsp), %rax 1 3 4
3: inc %r11 1 2
4: mov 4(%rsp), %r10
5: mov %r10, 8(%rsp) 2 6 5
6: and $0x00ff, %rbx
7: imul %rax, %rbx 2 1
8: lea var_b, %rax 7
1 1
READY = { } 8 9
ALUop 1 6 3 7 8
MEM 1 4 2 5 9
MEM 2 4 2
5
Outline
List scheduling
Trace scheduling g

Scheduling
g across basic blocks
Number of instructions in a basic block is small
Cannot keep a multiple units with long pipelines
busy by just scheduling within a basic block
Need to handle control dependence
Scheduling constraints across basic blocks
Scheduling policy

Moving
Downward to adjacent
j basic block
B C

Moving
j basic block
B C

Moving
j basic block
B C
A path to B that does not execute A?

Moving
Upward
p to adjacent
j basic block
B C

Moving
Upward
p to adjacent
j basic block
B C

Moving
Upward
p to adjacent
j basic block
B C
A path from C that does not reach A?

Control Dependencies
p
Constraints in moving
g instructions across basic
blocks

p
blocks
if ( . . . )
p c
a = b op

p
blocks
if ( . . . )
p c
a = b op

p
blocks
if ( c != 0 )
a = b / c
NO!!!

p
blocks
If ( . . . )
d = *(a1)

p
blocks
If ( valid address? )
d = *(a1)

5
Outline
List scheduling
Trace scheduling g

Trace Scheduling
g
Find the most common trace of basic blocks
Use profile information
Combine the basic blocks in the trace and
schedule them as one block
Create clean-up
clean up code if the execution goes off
off-
trace

Trace Scheduling
g
B C
F G
H
Trace Scheduling
g
H
Large Basic Blocks via
C d Duplication
Code D li ti
Creating
g large
g extended basic blocks by
y
duplication
Schedule the larger blocks
A
B C
E
Large Basic Blocks via
C d Duplication
Code D li ti
Creating
g large
g extended basic blocks by
y
duplication
Schedule the larger blocks
A A
B C B C
D D D
E E E
Trace Scheduling
g
B C
D D
E
E
F G
F G
H
H H
5
Next
Schedulingg for loopps
Loop unrolling
Software pipelining
Interaction with register allocation
Hardware vs. Compiler

MIT OpenCourseWare
http://ocw.mit.edu
6.035 Computer Language Engineering

Spring 2010
For information about citing these materials or our Terms of Use, visit: http://ocw.mit.edu/terms.

MIT6 035S10 Lec13

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

MIT6 035S10 Lec13

Uploaded by

Copyright:

Available Formats

Spring 2010

Saman Amarasinghe 2 6.035 MIT Fall 1998

Saman Amarasinghe 3 6.035 MIT Fall 1998

Fetch: get the next instruction

Inst 1 IF DE EXE MEM WB

Saman Amarasinghe 6 6.035 MIT Fall 1998

Different instructions taking different amount of time

Hardware to stall the pipeline if an instruction uses a

Real Machine Model cont.

Saman Amarasinghe 8 6.035 MIT Fall 1998

Saman Amarasinghe 9 6.035 MIT Fall 1998

Saman Amarasinghe 10 6.035 MIT Fall 1998

Saman Amarasinghe 11 6.035 MIT Fall 1998

walking through the instructions

Identifying register dependencies is simple

For memory accesses

simple: base + offset1 ?= base + offset2

Saman Amarasinghe 12 6.035 MIT Fall 1998

Saman Amarasinghe 13 6.035 MIT Fall 1998

Saman Amarasinghe 15 6.035 MIT Fall 1998

Saman Amarasinghe 16 6.035 MIT Fall 1998

For now, lets only worry about basic blocks

Saman Amarasinghe 17 6.035 MIT Fall 1998

Saman Amarasinghe 18 6.035 MIT Fall 1998

Saman Amarasinghe 19 6.035 MIT Fall 1998

Saman Amarasinghe 21 6.035 MIT Fall 1998

Saman Amarasinghe 22 6.035 MIT Fall 1998

READY = nodes with no predecessors

Saman Amarasinghe 23 6.035 MIT Fall 1998

Saman Amarasinghe 24 6.035 MIT Fall 1998

Saman Amarasinghe 25 6.035 MIT Fall 1998

Saman Amarasinghe 26 6.035 MIT Fall 1998

Saman Amarasinghe 27 6.035 MIT Fall 1998

Saman Amarasinghe 28 6.035 MIT Fall 1998

Saman Amarasinghe 29 6.035 MIT Fall 1998

Saman Amarasinghe 30 6.035 MIT Fall 1998

Saman Amarasinghe 31 6.035 MIT Fall 1998

d=5 d=0 d=3

d=5 d=0 d=3

d=5 d=0 d=3

d=5 d=0 d=3

d=5 d=0 d=3

d=5 d=0 d=3

d=5 d=0 d=3

d=5 d=0 d=3

d=5 d=0 d=3

Saman Amarasinghe 42 6.035 MIT Fall 1998

Saman Amarasinghe 43 6.035 MIT Fall 1998

Data stores teke one cycle

Saman Amarasinghe 44 6.035 MIT Fall 1998

Each pipeline represent some resource

Saman Amarasinghe 45 6.035 MIT Fall 1998

One two-cycle pipelined reg-to/from-memory unit

Saman Amarasinghe 47 6.035 MIT Fall 1998

Saman Amarasinghe 48 6.035 MIT Fall 1998

1: lea var_a, %rax

Saman Amarasinghe 50 6.035 MIT Fall 1998

Saman Amarasinghe 51 6.035 MIT Fall 1998

Saman Amarasinghe 52 6.035 MIT Fall 1998

Saman Amarasinghe 53 6.035 MIT Fall 1998

A path to B that does not execute A?

Saman Amarasinghe 54 6.035 MIT Fall 1998

Saman Amarasinghe 55 6.035 MIT Fall 1998