Professional Documents
Culture Documents
RHK.SP96 1
Pipeline speedup =
Scheduling Branch scheme penalty Stall pipeline 3 Predict taken 1 Predict not taken 1 Delayed branch 0.5
Two strategies
Backward branch predict taken, forward branch not taken Profile-based prediction: record branch behavior, predict branch based on prior run
Profile-based
Direction-based
RHK.SP96 3
Increasing length of pipe increases impact of hazards; pipelining helps instruction bandwidth, not latency Interrupts, Instruction Set, FP makes pipelining harder Compilers reduce cost of data and control hazards
Load delay slots Branch delay slots Branch prediction
Today: Longer pipelines (R4000) => Better branch prediction, more instruction parallelism?
RHK.SP96 4
IS IF
RF IS IF
EX RF IS IF
Delay slot plus two stalls Branch likely cancels delay slot if not taken
DF EX RF IS IF
DS DF EX RF IS IF
First stage of multiplier Second stage of multiplier Rounding stage Operand shift stage Unpack FP numbers
RHK.SP96 8
R4000 Performance
Not ideal CPI of 1: Load stalls (1 or 2 clock cycles) Branch stalls (2 cycles + unfilled slots) FP result stalls: RAW data hazard (latency) FP structural stalls: Not enough FP hardware (parallelism) 4.5 4 3.5 3 2.5 2 1.5 1 0.5 0
espresso doduc gcc eqntott nasa7 spice2g6 tomcatv su2cor ora li
Base
Load stalls
Branch stalls
FP result stalls
FP structural stalls
RHK.SP96 9
Loop level parallelism one opportunity, SW and HW Do examples and then explain nomenclature DLX Floating Point as example
Measurements suggests R4000 performance FP execution has room for improvement
RHK.SP96 10
Instruction producing result FP ALU op FP ALU op Load double Load double Integer op
Instruction using result Another FP ALU op Store double FP ALU op Store double Integer op
FP Loop Hazards
Loop: LD ADDD SD SUBI BNEZ NOP F0,0(R1) F4,F0,F2 0(R1),F4 R1,R1,8 R1,Loop ;F0=vector element ;add scalar in F2 ;store result ;decrement pointer 8B (DW) ;branch R1!=zero ;delayed branch slot Latency in clock cycles 3 2 1 0 0
RHK.SP96 12
Instruction producing result FP ALU op FP ALU op Load double Load double Integer op
Instruction using result Another FP ALU op Store double FP ALU op Store double Integer op
;store result ;decrement pointer 8B (DW) ;branch R1!=zero ;delayed branch slot Latency in clock cycles 3 2 1
RHK.SP96 13
;alter to 4*8
RHK.SP96 17
RHK.SP96 18
Our example required compiler to know that if R1 doesnt change then: 0(R1) -8(R1) -16(R1) -24(R1) There were no dependencies between some loads and stores so they could be moved by each other
RHK.SP96 19
RHK.SP96 20
Control dependencies relaxed to get parallelism; get same effect if preserve order of exceptions and data flow
RHK.SP96 21
Implies that iterations are dependent, and cant be executed in parallel Not the case for our example; each iteration was distinct RHK.SP96 22
Summary
Instruction Level Parallelism in SW or HW Loop level parallelism is easiest to see SW parallelism dependencies defined for program, hazards if HW cannot resolve SW dependencies/compiler sophistication determine if compiler can unroll loops
RHK.SP96 23