You are on page 1of 48

EECS 252 Graduate Computer

Architecture
Lec 3 Performance
+ Pipeline Review
David Patterson
Electrical Engineering and Computer Sciences
University of California, Berkeley
http://www.eecs.berkeley.edu/~pattrsn
http://www-inst.eecs.berkeley.edu/~cs252
http://inst.eecs.berkeley.edu/~cs252/archives.html

CS252: Administrivia
Instructor: Prof David Patterson
Office: 635 Soda Hall, pattrsn@cs
Office Hours: Tue 11 - noon or by appt.
(Contact Cecilia Pracher; cpracher@eecs)
T. A:

Archana Ganapathi, archanag@eecs

Class:

M/W, 11:00 - 12:30pm

203 McLaughlin (and online)

Text:
Computer Architecture: A Quantitative Approach, 4th
Edition (Oct, 2006), Beta, distributed for free provided report errors
Web page: http://www.cs/~pattrsn/courses/cs252-S06/
Lectures available online <9:00 AM day of lecture
Wiki page: ??
Reading assignment: Memory Hierarchy Basics Appendix C
(handout) for Mon 1/30
Wed 2/1: Great ISA debate (3 papers) + Prerequisite Quiz
08/29/16

CS252-s06, Lec 02-intro

Outline

Review
Quantify and summarize performance
Ratios, Geometric Mean, Multiplicative Standard Deviation

F&P: Benchmarks age, disks fail,1 point fail


danger
252 Administrivia
MIPS An ISA for Pipelining
5 stage pipelining
Structural and Data Hazards
Forwarding
Branch Schemes
Exceptions and Interrupts
Conclusion

08/29/16

CS252-s06, Lec 02-intro

A "Typical" RISC ISA

32-bit fixed format instruction (3 formats)


32 32-bit GPR (R0 contains zero, DP take pair)
3-address, reg-reg arithmetic instruction
Single address mode for load/store:
base + displacement
Simple branch conditions
Delayed branch

see: SPARC, MIPS, HP PA-Risc, DEC Alpha, IBM PowerPC,


CDC 6600, CDC 7600, Cray-1, Cray-2, Cray-3

08/29/16

CS252-s06, Lec 02-intro

Example: MIPS ( MIPS)


Register-Register (R-type)
31

26 25

21 20

Rd<-Rs1 op Rs2
16 15

Op
Rs
Rt
Register-Immediate
(I-type)
Load/Store
31

26 25

Op

21 20

Rs

Rd

11 10

6 5

Opx
Rd<-Rs1 op Imm

16 15

26 25

Op

; Add R1,R2,#3
L R1,60(R2)

immediate

Rt

Branch (B-type)
31

; Add R1,R2,R3

BNE R1,R2,name
21 20

Rs

16 15

Rt/Opx

immediate
J name

Jump / Call (J-type)


31

26 25

Op

08/29/16

target

CS252-s06, Lec 02-intro

Approaching an ISA
Instruction Set Architecture
Defines set of operations, instruction format, hardware supported data
types, named storage, addressing modes, sequencing

Meaning of each instruction is described by RTL on


architected registers and memory
Given technology constraints assemble adequate datapath

Architected storage mapped to actual storage


Function units to do all the required operations
Possible additional storage (eg. MAR, MBR, )
Interconnect to move information among regs and FUs

Map each instruction to sequence of RTLs


Collate sequences into symbolic controller state transition
diagram (STD)
Lower symbolic STD to control points
Implement controller
08/29/16

CS252-s06, Lec 02-intro

5 Steps of MIPS Datapath


Figure A.2, Page A-8
Instruction
Fetch

Instr. Decode
Reg. Fetch

Execute
Addr. Calc

Next SEQ PC

Adder

Zero?

RS1

L
M
D

MUX

Data
Memory

ALU

Imm

MUX MUX

RD

Reg File

Inst

Memory

Address

IR <= mem[PC];

RS2

Write
Back

MUX

Next PC

Memory
Access

Sign
Extend

PC <= PC + 4
Reg[IRrd] <= Reg[IRrs] opIRop Reg[IRrt]
08/29/16

WB Data

CS252-s06, Lec 02-intro

5 Steps of MIPS Datapath

Data stationary control

08/29/16

CS252-s06, Lec 02-intro

local decode for each instruction phase / pipeline stage

Pileline stages main activities


IR <= mem[PC];
IF/ID.NPC,PC <=
if(branch cond) EX/MEM.r
else PC + 4

JSR

jmp

br

opFetch-DCD

A <= Reg[IRrs];

JR

ST

B <= Reg[IRrt]

RI

RR
r <= ID/EX.PC
+Irim, cond <=
ID/EX.A==0

PC <= IRjaddr

r <= A opIRop B

WB <= r

Reg[IRrd] <= WB

08/29/16

Ifetch

LD

r <= A opIRop IRim

r <= A + IRim

WB <= r

WB <= Mem[r]

Reg[IRrt] <= WB

CS252-s06, Lec 02-intro

Reg[IRrt] <= WB

5 Steps of MIPS Datapath

08/29/16

CS252-s06, Lec 02-intro

10

5 Steps of MIPS Datapath

08/29/16

CS252-s06, Lec 02-intro

11

5 Steps of MIPS Datapath

08/29/16

CS252-s06, Lec 02-intro

12

5 Steps of MIPS Datapath

08/29/16

CS252-s06, Lec 02-intro

13

5 Steps of MIPS Datapath

08/29/16

CS252-s06, Lec 02-intro

14

5 Steps of MIPS Datapath

08/29/16

CS252-s06, Lec 02-intro

15

5 Steps of MIPS Datapath

08/29/16

CS252-s06, Lec 02-intro

16

5 Steps of MIPS Datapath

08/29/16

CS252-s06, Lec 02-intro

17

5 Steps of MIPS Datapath

08/29/16

CS252-s06, Lec 02-intro

18

Visualizing Pipelining
Figure A.2, Page A-8

Time (clock cycles)

08/29/16

Ifetch

DMem

Reg

Ifetch

Reg

Ifetch

Reg

DMem

Reg

CS252-s06, Lec 02-intro

Reg

DMem

ALU

Reg

ALU

O
r
d
e
r

Ifetch

ALU

I
n
s
t
r.

ALU

Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7

Reg

DMem

Reg

19

Pipelining is not quite that easy!

Limits to pipelining: Hazards prevent next instruction


from executing during its designated clock cycle
Structural hazards: HW cannot support this combination of
instructions (single person to fold and put clothes away)
Data hazards: Instruction depends on result of prior instruction still
in the pipeline (missing sock)
Control hazards: Caused by delay between the fetching of
instructions and decisions about changes in control flow (branches
and jumps).

08/29/16

CS252-s06, Lec 02-intro

20

One Memory Port/Structural Hazards


Figure A.4, Page A-14

Time (clock cycles)

Instr 4
08/29/16

Ifetch

Reg

Ifetch

DMem

ALU

Reg

Reg

Reg

Ifetch

CS252-s06, Lec 02-intro

Reg

DMem

Reg

DMem

Reg

ALU

Instr 3

Ifetch

DMem

ALU

O
r
d
e
r

Instr 2

Reg

ALU

I Load Ifetch
n
s
t Instr 1
r.

ALU

Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7

Reg

Reg

DMem

21

One Memory Port/Structural Hazards


(Similar to Figure A.5, Page A-15)

Time (clock cycles)

Stall
Instr 3
How
08/29/16

Reg

Ifetch

Reg

Bubble

Reg

DMem

ALU

Ifetch

DMem

Reg

DMem

Bubble Bubble

Ifetch

do you bubbleCS252-s06,
the pipe?
Lec 02-intro

Reg

Reg

Bubble
ALU

O
r
d
e
r

Instr 2

Reg

ALU

I Load Ifetch
n
s
t Instr 1
r.

ALU

Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7

Bubble
Reg

DMem

22

Speed Up Equation for Pipelining


CPIunpipelined Cycle Time unpipelined
Speedup

CPIpipelined
Cycle Time pipelined

CPIpipelined Ideal CPI Average Stall cycles per Inst

Cycle Timeunpipelined
Ideal CPI Pipeline depth
Speedup

Ideal CPI Pipeline stall CPI


Cycle Timepipelined

For simple RISC pipeline, Ideal CPI = 1:


Cycle Timeunpipelined
Pipeline depth
Speedup

1 Pipeline stall CPI


Cycle Timepipelined
08/29/16

CS252-s06, Lec 02-intro

23

Example: Dual-port vs. Single-port


Machine A: Dual ported memory (Harvard Architecture)
Machine B: Single ported memory, but its pipelined
implementation has a 1.05 times faster clock rate
Ideal CPI = 1 for both
Loads are 40% of instructions executed
SpeedUpA = Pipeline Depth/(1 + 0) x (clockunpipe/clockpipe)
= Pipeline Depth
SpeedUpB = Pipeline Depth/(1 + 0.4 x 1) x (clockunpipe/(clockunpipe / 1.05)
= (Pipeline Depth/1.4) x 1.05
= 0.75 x Pipeline Depth
SpeedUpA / SpeedUpB = Pipeline Depth/(0.75 x Pipeline Depth) = 1.33

Machine A is 1.33 times faster


08/29/16

CS252-s06, Lec 02-intro

24

Data Hazard on R1
Figure A.6, Page A-17

Time (clock cycles)

and r6,r1,r7
or

r8,r1,r9

Ifetch

DMem

Reg

DMem

Ifetch

Reg

DMem

Reg

DMem

Reg

ALU

sub r4,r1,r3

Reg

ALU

Ifetch

ALU

O
r
d
e
r

add r1,r2,r3

Ifetch

Reg

Ifetch

xor r10,r1,r11
08/29/16

WB

ALU

I
n
s
t
r.

MEM

ALU

IF ID/RF EX

CS252-s06, Lec 02-intro

Reg

Reg

Reg

DMem

25

Reg

Three Generic Data Hazards


Read After Write (RAW)
InstrJ tries to read operand before InstrI writes it
I: add r1,r2,r3
J: sub r4,r1,r3
Caused by a Dependence (in compiler
nomenclature). This hazard results from an actual
need for communication.

08/29/16

CS252-s06, Lec 02-intro

26

Three Generic Data Hazards


Write After Read (WAR)
InstrJ writes operand before InstrI reads it

I: sub r4,r1,r3
J: add r1,r2,r3
K: mul r6,r1,r7

Called an anti-dependence by compiler writers.


This results from reuse of the name r1.
Possible in MIPS ?
Cant happen in MIPS 5 stage pipeline because:
All instructions take 5 stages, and
Reads are always in stage 2, and
Writes are always in stage 5 and instr I preceeds
instr J
08/29/16

CS252-s06, Lec 02-intro

27

Three Generic Data Hazards


Write After Write (WAW)
InstrJ writes operand before InstrI writes it.

I: sub r1,r4,r3
J: add r1,r2,r3
K: mul r6,r1,r7

Called an output dependence by compiler writers


This also results from the reuse of name r1.
Possible in MIPS ?
Cant happen in MIPS 5 stage pipeline because:
All instructions take 5 stages, and
Writes are always in stage 5 and instr I preceeds
instr J
Will see WAR and WAW in more complicated pipes
08/29/16

CS252-s06, Lec 02-intro

28

Is Data Hazards sometime possible to avoid ?


Time (clock cycles)

r8,r1,r9

DMem

Ifetch

Reg

DMem

Reg

DMem

Reg

ALU

or

Reg

ALU

and r6,r1,r7

Ifetch

DMem

ALU

sub r4,r1,r3

Reg

ALU

O
r
d
e
r

add r1,r2,r3 Ifetch

Ifetch

Reg

Ifetch

xor r10,r1,r11
08/29/16

Yes
by Forwarding
(bypassing)

ALU

I
n
s
t
r.

CS252-s06, Lec 02-intro

Reg

Reg

Reg

DMem

29

Reg

HW Change for Forwarding


Figure A.23, Page A-37

NextPC

mux
MEM/WR

EX/MEM

ALU

mux

ID/EX

Registers

Data
Memory

mux

Immediate

How to detect and resolve this hazard?


08/29/16

CS252-s06, Lec 02-intro

30

Resolving hazard with forwarding

08/29/16

CS252-s06, Lec 02-intro

31

Forwarding to Avoid LW-SW Data Hazard


Figure A.8, Page A-20

or

r8,r6,r9

Reg

DMem

Ifetch

Reg

DMem

Reg

DMem

Reg

ALU

sw r4,12(r1)

Ifetch

DMem

ALU

lw r4, 0(r1)

Reg

ALU

O
r
d
e
r

add r1,r2,r3 Ifetch

ALU

I
n
s
t
r.

ALU

Time (clock cycles)

Ifetch

Ifetch

xor r10,r9,r11
ZASTOJ ?
08/29/16

Reg

Reg

Reg

Reg

DMem

Reg

NE !! Proslje. podataka iz prot. stepeni


CS252-s06, Lec 02-intro

32

Data Hazard Even with Forwarding


Figure A.9, Page A-21

Dobro objanjenje takodje na:


http://en.wikipedia.org/wiki/Classic_RISC_pipeline#Solution_B._Pipeline_interlock

and r6,r1,r7
or
08/29/16

r8,r1,r9

DMem

Ifetch

Reg

DMem

Reg

Ifetch

Ifetch

CS252-s06, Lec 02-intro

Reg

Reg

Reg

DMem

ALU

sub r4,r1,r6

Reg

ALU

O
r
d
e
r

lw r1, 0(r2) Ifetch

ALU

I
n
s
t
r.

ALU

Time (clock cycles)

Reg

DMem

33

Reg

Data Hazard Even with Forwarding


(Similar to Figure A.10, Page A-21)

and r6,r1,r7
or r8,r1,r9
08/29/16

Reg

DMem

Ifetch

Reg

Bubble

Ifetch

Bubble

Reg

Bubble

Ifetch

How is this detected?


CS252-s06, Lec 02-intro

Reg

DMem

Reg

Reg

Reg

DMem

ALU

sub r4,r1,r6

Ifetch

ALU

O
r
d
e
r

lw r1, 0(r2)

ALU

I
n
s
t
r.

ALU

Time (clock cycles)

DMem

34

Load interlock detection

08/29/16

CS252-s06, Lec 02-intro

35

Software Scheduling to Avoid Load


Hazards
Try producing fast code for
a = b + c;
d = e f;
assuming a, b, c, d ,e, and f in memory.
Slow code:
LW
LW
ADD
SW
LW
LW
SUB
SW

Rb,b
Rc,c
Ra,Rb,Rc
a,Ra
Re,e
Rf,f
Rd,Re,Rf
d,Rd

Fast code:
LW Rb,b
LW Rc,c
LW Re,e
ADD Ra,Rb,Rc
LW Rf,f
SW a,Ra
SUB Rd,Re,Rf
SW d,Rd

Compiler optimizes for performance. Hardware checks for safety.


08/29/16

CS252-s06, Lec 02-intro

36

Outline

Review
Quantify and summarize performance
Ratios, Geometric Mean, Multiplicative Standard Deviation

F&P: Benchmarks age, disks fail,1 point fail


danger
252 Administrivia
MIPS An ISA for Pipelining
5 stage pipelining
Structural and Data Hazards
Forwarding
Branch Schemes
Exceptions and Interrupts
Conclusion

08/29/16

CS252-s06, Lec 02-intro

37

Reg

DMem

Ifetch

Reg

Ifetch

22: add r8,r1,r9


36: xor r10,r1,r11

Reg

Reg

DMem

Reg

DMem

Ifetch

Reg

ALU

r6,r1,r7

DMem

ALU

Ifetch

14: and r2,r3,r5


18: or

Reg

ALU

Ifetch

ALU

10: beq r1,r3,36

ALU

Control Hazard on Branches


Three Stage Stall

Reg

What do you do with the 3 instructions in between?


How do you do it?
Where is the commit?
08/29/16

CS252-s06, Lec 02-intro

38

Reg

DMem

Branch Stall Impact


If CPI = 1, 30% branch,
Stall 3 cycles => new CPI = 1.9!
Two part solution:
Determine branch taken or not sooner, AND
Compute taken branch address earlier

MIPS branch tests if register = 0 or 0


MIPS Solution:
Move Zero test to ID/RF stage
Adder to calculate new PC in ID/RF stage
1 clock cycle penalty for branch versus 3

08/29/16

CS252-s06, Lec 02-intro

39

Pipelined MIPS Datapath


Figure A.24, page A-38
Instruction
Fetch

Memory
Access

Write
Back

Adder

Adder

MUX

Next
SEQ PC

Next PC

Zero?

RS1

MUX

MEM/WB

Data
Memory

EX/MEM

ALU

MUX

ID/EX

Imm

Reg File

IF/ID

Memory

Address

RS2

WB Data

Execute
Addr. Calc

Instr. Decode
Reg. Fetch

Sign
Extend

RD

RD

RD

Interplay of instruction set design and cycle time.


08/29/16

CS252-s06, Lec 02-intro

40

Four Branch Hazard Alternatives


#1: Stall until branch direction is clear
#2: Predict Branch Not Taken

Execute successor instructions in sequence


Squash instructions in pipeline if branch actually taken
Advantage of late pipeline state update
47% MIPS branches not taken on average
PC+4 already calculated, so use it to get next instruction

#3: Predict Branch Taken


53% MIPS branches taken on average
But havent calculated branch target address in MIPS
MIPS still incurs 1 cycle branch penalty
Other machines: branch target known before outcome

08/29/16

CS252-s06, Lec 02-intro

41

Four Branch Hazard Alternatives

#4: Delayed Branch


Branch could take place AFTER a (number of) following instructions
branch instruction
sequential successor1
sequential successor2
........
sequential successorn
Branch delay of length n

branch target if taken

1 slot delay allows proper decision and branch target address in 5


stage pipeline
MIPS uses this

Where to get instructions to fill branch delay slot?


08/29/16

CS252-s06, Lec 02-intro

42

Scheduling Branch Delay Slots (Fig A.14)


A. From before branch
add R1,R2,R3
if R2=0 then
delay slot

becomes

B. From branch target


sub R4,R5,R6
add R1,R2,R3
if R1=0 then
delay slot
becomes

if R2=0 then
add R1,R2,R3

add R1,R2,R3
if R1=0 then
sub R4,R5,R6

C. From fall through


add R1,R2,R3
if R1=0 then
delay slot
or R7,R8,R9
sub R4,R5,R6
becomes
add R1,R2,R3
if R1=0 then

or R7,R8,R9
sub R4,R5,R6

A is the best choice, fills delay slot & reduces instruction count (IC)
In B, the sub instruction may need to be copied, increasing IC
In B (C), must be okay to execute sub (or) when branch fails (occurs)
08/29/16

CS252-s06, Lec 02-intro

43

Delayed Branch
Compiler effectiveness for single branch delay slot:
Fills about 60% of branch delay slots
About 80% of instructions executed in branch delay slots useful in
computation
About 50% (60% x 80%) of slots usefully filled

Delayed Branch downside: As processor go to deeper


pipelines and multiple issue, the branch delay grows
and need more than one delay slot
Delayed branching has lost popularity compared to more expensive
but more flexible dynamic approaches
Growth in available transistors has made dynamic approaches
relatively cheaper

08/29/16

CS252-s06, Lec 02-intro

44

Evaluating Branch Alternatives


Pipelinespeedup=

Pipelinedepth
1+BranchfrequencyBranchpenalty

Assume 4% unconditional branch, 6% conditional branchuntaken, 10% conditional branch-taken


Scheduling
Branch CPI speedup v. speedup v.
scheme
penalty
unpipelined
stall
Stall pipeline
3 1.60
3.1
1.0
Predict taken
1 1.20
4.2
1.33
Predict not taken
1 1.14
4.4
1.40
Delayed branch
0.5 1.10
4.5
1.45

08/29/16

CS252-s06, Lec 02-intro

45

Problems with Pipelining


Exception: An unusual event happens to an
instruction during its execution
Examples: divide by zero, undefined opcode

Interrupt: Hardware signal to switch the


processor to a new instruction stream
Example: a sound card interrupts when it needs more audio
output samples (an audio click happens if it is left waiting)

Problem: It must appear that the exception or


interrupt must appear between 2 instructions (I i
and Ii+1)
The effect of all instructions up to and including I i is totalling
complete
No effect of any instruction after Ii can take place

The interrupt (exception) handler either aborts


program or restarts at instruction Ii+1
08/29/16

CS252-s06, Lec 02-intro

46

Precise Exceptions in Static Pipelines

Key observation: architected state only


change in memory and register write stages.

And In Conclusion: Control and Pipelining


Speed Up Pipeline Depth; if ideal CPI is 1, then:

Cycle Timeunpipelined
Pipeline depth
Speedup

1 Pipeline stall CPI


Cycle Timepipelined

Hazards limit performance on computers:


Structural: need more HW resources
Data (RAW,WAR,WAW): need forwarding, compiler scheduling
Control: delayed branch, prediction

Exceptions, Interrupts add complexity


Next time: Read Appendix C, record bugs online!

08/29/16

CS252-s06, Lec 02-intro

48

You might also like