You are on page 1of 56

Instruction Level Parallelism

Instruction Level Parallelism(ILP)


Overlap the execution of instructions to improve performance Two approaches to exploit ILP 1. dynamic and hardware intensive (desktop and server markets) 2. static and software intensive (embedded market)

Instruction Level Parallelism(ILP)


Pipeline CPI = Ideal pipeline CPI + structural stalls + data hazard stalls + control stalls reduce each of the terms in the RHS to reduce the overall CPI and thus increase instructions per cycle (IPC)

Instruction Level Parallelism(ILP)


Amount of parallelism available within a Basic Block is very small. Therefore, we exploit ILP across multiple blocks

Instruction Level Parallelism(ILP)


Loop-level parallelism: exploit parallelism across iterations of a loop Ex: for (i=1;i<=1000;i++) x[i] = x[i] + y[i]; Every iteration of the loop can overlap with any other iteration

Instruction Level Parallelism(ILP)


Converting loop-level parallelism into instruction level parallelism either statically by the compiler or dynamically by the hardware Alternatively, use vector instructions that operate on a sequence of data items ex: we need 4 vector instructions to execute this code: 2 for loading x and y into memory, 2 for adding the two vectors and 1 for storing back the result vector

Data dependences and Hazards


Finding dependences is critical in determining how much parallelism exists in a program Which instructions can be executed in parallel? Whether an instruction is dependent on other instruction?

Data dependences and Hazards


Three different types of dependences: 1. data dependences (also called true data dependences) 2. name dependences 3. control dependences

Data dependences
An instruction j is data dependent on instruction I if either of the following holds: 1. instruction i produces a result that may be used by instruction j, or 2. instruction j is data dependent on instruction k, and instruction k is data dependent on instruction i (this dependence chain can be as long as the entire program)

Data dependences
Ex: LOOP: L.D F0, 0(R1) ADD.D F4, F0, F2 S.D F4, 0(R1) DAAUI R1, R1, #-8 BNE R1, R2, LOOP If two instructions are data dependent, they cannot be executed simultaneously or be completely overlapped

The dependence implies that there would be a chain of one or more data hazards between the two instructions

Data dependences
The effect of the original data dependence must be preserved The presence of the dependence indicates the potential for a hazard, but the actual hazard and the length of any stall is a property of the pipeline.
Ex: there is a data dependence between DADDIU and BNE; this dependence causes a stall because we moved the branch test for the MIPS pipeline to the ID stage. Had the branch test stayed in EX, this dependence would not cause a stall

Data dependences
The importance of the data dependences is that a dependence (1) indicates the possibility of a hazard. (2) determines the order in which results must be calculated, and (3) sets an upper bound on how much parallelism can possibly be exploited

Data dependences
A dependence can be overcome in two different ways: 1. maintaining the dependence but avoiding a hazard, & 2. eliminating a dependence by transforming the code

Data dependences
Primary method used to avoid a hazard is by scheduling the code without altering the dependencies (we see a hardware scheme for scheduling code dynamically as it is executed) Dependences that flow through registers are easy to detect than dependences that flow through memory locations
(register names are fixed in the instruction, 100(R4) & 20(R6) may be identical, 20(R4) & 20(R4) may be different)

Name dependences
Occurs when two instructions use the same register or memory location, called a name, but there is no flow of data between the instructions associated with that name.

Name dependences
There are two types of name dependences between an instruction i that precedes instruction j in program order: 1. An antidependence between instruction i & instruction j occurs when instruction j writes register or memory location that instruction i reads. The original ordering must be preserved to ensure that I reads the correct values

Name dependences
2. An output dependence occurs when instruction i & instruction j writes the same register or memory location. The ordering between the instructions must be preserved to ensure that value finally written corresponds to instruction j

Name dependences
Instructions involved in a name dependence can execute simultaneously or be reordered, if the name( register number or memory) used in the instruction is changed so the instructions do not conflict (register renaming)

Dependences and data hazards


Data hazards are of three types depending on the order of read and write accesses in the instructions: Consider two instructions i and j, with i occurring before j in program order 1. RAW (read after write) true data dependence. Ex: LOAD followed by an ALU instrn that directly uses the LOAD result

Dependences and data hazards


2. WAW (write after write) output dependence. 3. WAR (write after read) - antidependence RAR (read after read) is not a hazard

Control dependences
Determines the ordering of an instruction i with respect to a branch instruction Ex: if p1 { s1 is control dependent on p1, s1; and s2 is control dependent on } p2, but not on p1 if p2 { s2; }

Control dependences
Two constraints imposed by control dependencies: 1. An instruction that is control dependent on a branch cannot be moved before the branch so that its execution is no longer controlled by the branch 2. An instruction that is not control dependent on a branch cannot be moved after the branch so that its execution is controlled by the branch

Control dependences
Control dependencies are preserved by two properties in a simple pipeline: 1. instructions execute in program order: this ensures that the instruction that occurs before the branch is executed before the branch 2. the detection of a control or branch hazard ensures that an instruction that is control dependent on a branch is not executed until the branch direction is known.

Control dependences
However we may be willing to violate the control dependencies without affecting the correctness of the program Two properties that are critical to program correctness and that must be preserved using data and control dependencies are exception behavior and the data flow.

Control dependences
Preserving the exception behavior: reordering of instruction execution must not change how exceptions are raised in the program (or must not cause any new exceptions in the program) Ex: DADDU R2, R3, R4 To show how maintaining data and control dependences do not cause any new exceptions BEQZ R2, L1 after instruction reordering LW R1, 0(R2) Speculation a hardware technique which allows us to overcome this exception problem

Control dependences
Preserving the data flow : data flow is the actual flow of data values among instructions that produce results and those that consume them Branches make the data flow dynamic ex: DADDU R1, R2, R3 Data dependence alone is insufficient for correct execution. BEQZ R4, L Instead when the instructions execute, the data flow must be DSUBU R1, R5, R6 preserved. This data flow is preserved by preserving control L1: flow. OR R7, R1, R8

Overcoming Data Hazards using dynamic scheduling


Hardware rearranges the instruction execution to reduce the stalls while maintaining data flow and exception behavior Advantage of dynamic scheduling is gained at the cost of significant increase in hardware complexity

Dynamic scheduling
Limitation of a simple pipeline: Instructions are issued in program order, and if an instruction is stalled in the pipeline, no later instruction can proceed Ex: DIV.D F0, F2, F4 ADD.D F10, F0, F8 SUB.D F12, F8, F14
( SUB.D cannot execute because the dependence of ADD.D on DIV.D causes the pipeline to stall even though it is not data dependent on anything in the pipeline)

Dynamic scheduling
In the 5-stage pipeline, both structural and data hazards are checked in the ID stage In order to begin executing SUB.D , we must separate the ID stage into two parts: 1. checking for structural hazards 2. Waiting for the absence of data hazard We still use in-order instruction issue, but we want an instruction to begin execution as soon as its data operands are available (out-of-order execution and hence out-of order-completion)

Dynamic scheduling
Out-of-order execution introduces the possibility of WAR and WAW hazards Ex: DIV.D F0, F2, F4 ADD.D F6, F0, F8 SUB.D F8, F10, F14 MUL.D F6, F10, F8 ( WAR hazard because of ADD.D and SUB.D if SUB.D executes before ADD.D. WAW hazard because of ADD.D and MUL.D )

Dynamic scheduling Tomosulos approach


The scheme - tracks when operands for instructions are available, to minimize RAW hazards and - uses register renaming, to minimize WAR and WAW hazards In Tomosulos approach, register renaming is provided by the reservation stations

Dynamic scheduling Tomasulos approach


How register renaming eliminates WAR and WAW hazards: Ex: code before renaming: Code after renaming: DIV.D F0, F2, F4 DIV.D F0, F2, F4 ADD.D F6, F0, F8 ADD.D S, F0, F8 S.D F6, 0(R1) S.D S, 0(R1) SUB.D T, F10, F14 SUB.D F8, F10, F14 MUL.D F6, F10, T MUL.D F6, F10, F8

Tomasulo-based MIPS processor

Tomasulo-based MIPS processor


Each reservation station has 7 fields
Op: Operation to perform on source operands Vj, Vk: Value of Source operands
Loads have offset value in Vk field

Qj, Qk: Reservation stations producing the corresponding source operands (value of 0 indicates that the value of source operand already available in Vj or Vk, or is unnecessary) A : used to hold information for the memory address calculation for a load or store Busy: Indicates reservation station or Functional unit is busy

Tomasulo-based MIPS processor


The register file has one field Qi : The number/name of the reservation station that contains the operation whose result should be stored into this register The Load and Store buffers each have a field A : holds the result of the effective address

Three Stages of Tomasulo Algorithm


1. Issueget instruction from FP Op Queue (Maintains the correct data flow)
If reservation station free (no structural hazard), control issues instr & sends operands (renames registers).

2. Executeoperate on operands (EX)


When both operands ready then execute; if not ready, watch Common Data Bus for result

3. Write resultfinish execution (WB)


Write on Common Data Bus to all awaiting units; mark reservation station available

Dynamic Scheduling : Example


Show what the information tables look like for the following code sequence when only the first load has completed and written its result: 1. L.D F6, 34(R2) 2. L.D F2, 45(R3) 3. MUL.D F0, F2, F4 4. SUB.D F8, F2, F6 5. DIV.D F10, F0, F6 6. ADD.D F6, F8, F2

Tomasulo Example
Instruction status:
Instruction L.D F6 L.D F2 MULT.D F0 SUB.D F8 DIV.D F10 ADD.D F6 j 34+ 45+ F2 F6 F0 F8 k R2 R3 F4 F2 F6 F2

Exec Write Issue Comp Result


Load1 Load2 Load3

Busy Address
No No No

Reservation Stations:
Time Name Busy Add1 No Add2 No Add3 No Mult1 No Mult2 No

Op

S1 Vj

S2 Vk

RS Qj

RS Qk

Register result status: Clock


0 FU

F0

F2

F4

F6

F8

F10

F12

...

F30

Tomasulo Example Cycle 1


Instruction status:
Instruction LD F6 LD F2 MULTD F0 SUBD F8 DIVD F10 ADDD F6 j 34+ 45+ F2 F6 F0 F8 k R2 R3 F4 F2 F6 F2

Exec Write Issue Comp Result


1 Load1 Load2 Load3

Busy Address
Yes No No 34+R2

Reservation Stations:
Time Name Busy Add1 No Add2 No Add3 No Mult1 No Mult2 No

Op

S1 Vj

S2 Vk

RS Qj

RS Qk

Register result status: Clock


1 FU

F0

F2

F4

F6
Load1

F8

F10

F12

...

F30

Tomasulo Example Cycle 2


Instruction status:
Instruction LD F6 LD F2 MULTD F0 SUBD F8 DIVD F10 ADDD F6 j 34+ 45+ F2 F6 F0 F8 k R2 R3 F4 F2 F6 F2

Exec Write Issue Comp Result


1 2 Load1 Load2 Load3

Busy Address
Yes Yes No 34+R2 45+R3

Reservation Stations:
Time Name Busy Add1 No Add2 No Add3 No Mult1 No Mult2 No

Op

S1 Vj

S2 Vk

RS Qj

RS Qk

Register result status: Clock


2 FU

F0

F2
Load2

F4

F6
Load1

F8

F10

F12

...

F30

Tomasulo Example Cycle 3


Instruction status:
Instruction LD F6 LD F2 MULTD F0 SUBD F8 DIVD F10 ADDD F6 j 34+ 45+ F2 F6 F0 F8 k R2 R3 F4 F2 F6 F2

Exec Write Issue Comp Result


1 2 3 3 Load1 Load2 Load3

Busy Address
Yes Yes No 34+R2 45+R3

Reservation Stations:
Time Name Busy Op Add1 No Add2 No Add3 No Mult1 Yes MULTD Mult2 No

S1 Vj

S2 Vk

RS Qj

RS Qk

R(F4) Load2

Register result status: Clock


3 FU

F0

F2

F4

F6
Load1

F8

F10

F12

...

F30

Mult1 Load2

Tomasulo Example Cycle 4


Instruction status:
Instruction LD F6 LD F2 MULTD F0 SUBD F8 DIVD F10 ADDD F6 j 34+ 45+ F2 F6 F0 F8 k R2 R3 F4 F2 F6 F2

Exec Write Issue Comp Result


1 2 3 4 3 4 4 Load1 Load2 Load3

Busy Address
No Yes No 45+R3

Reservation Stations:

Time Name Busy Op Add1 Yes SUBD M(A1) Load2 Add2 No Add3 No Mult1 Yes MULTD R(F4) Load2 Mult2 No

S1 Vj

S2 Vk

RS Qj

RS Qk

Register result status: Clock


4 FU

F0

F2

F4

F6

F8

F10

F12

...

F30

Mult1 Load2

M(A1) Add1

Load2 completing; what is waiting for Load2?

Tomasulo Example Cycle 5


Instruction status:
Instruction LD F6 LD F2 MULTD F0 SUBD F8 DIVD F10 ADDD F6 j 34+ 45+ F2 F6 F0 F8 k R2 R3 F4 F2 F6 F2

Exec Write Issue Comp Result


1 2 3 4 5 3 4 4 5 Load1 Load2 Load3

Busy Address
No No No

Reservation Stations:

Time Name Busy Op 2 Add1 Yes SUBD M(A1) M(A2) Add2 No Add3 No 10 Mult1 Yes MULTD M(A2) R(F4) Mult2 Yes DIVD M(A1) Mult1

S1 Vj

S2 Vk

RS Qj

RS Qk

Register result status: Clock


5 FU

F0

F2

F4

F6

F8

F10

F12

...

F30

Mult1 M(A2)

M(A1) Add1 Mult2

Tomasulo Example Cycle 6


Instruction status:
Instruction LD F6 LD F2 MULTD F0 SUBD F8 DIVD F10 ADDD F6 j 34+ 45+ F2 F6 F0 F8 k R2 R3 F4 F2 F6 F2

Exec Write Issue Comp Result


1 2 3 4 5 6 3 4 4 5 Load1 Load2 Load3

Busy Address
No No No

Reservation Stations:

Time Name Busy Op 1 Add1 Yes SUBD M(A1) M(A2) Add2 Yes ADDD M(A2) Add1 Add3 No 9 Mult1 Yes MULTD M(A2) R(F4) Mult2 Yes DIVD M(A1) Mult1

S1 Vj

S2 Vk

RS Qj

RS Qk

Register result status: Clock


6 FU

F0

F2

F4

F6
Add2

F8

F10

F12

...

F30

Mult1 M(A2)

Add1 Mult2

Tomasulo Example Cycle 7


Instruction status:
Instruction LD F6 LD F2 MULTD F0 SUBD F8 DIVD F10 ADDD F6 j 34+ 45+ F2 F6 F0 F8 k R2 R3 F4 F2 F6 F2

Exec Write Issue Comp Result


1 2 3 4 5 6 3 4 7 4 5 Load1 Load2 Load3

Busy Address
No No No

Reservation Stations:

Time Name Busy Op 0 Add1 Yes SUBD M(A1) M(A2) Add2 Yes ADDD M(A2) Add1 Add3 No 8 Mult1 Yes MULTD M(A2) R(F4) Mult2 Yes DIVD M(A1) Mult1

S1 Vj

S2 Vk

RS Qj

RS Qk

Register result status: Clock


7 FU

F0

F2

F4

F6
Add2

F8

F10

F12

...

F30

Mult1 M(A2)

Add1 Mult2

Add1 completing; what is waiting for it?

Tomasulo Example Cycle 8


Instruction status:
Instruction LD F6 LD F2 MULTD F0 SUBD F8 DIVD F10 ADDD F6 j 34+ 45+ F2 F6 F0 F8 k R2 R3 F4 F2 F6 F2

Exec Write Issue Comp Result


1 2 3 4 5 6 3 4 7 4 5 8 Load1 Load2 Load3

Busy Address
No No No

Reservation Stations:

Time Name Busy Op Add1 No 2 Add2 Yes ADDD (M-M) M(A2) Add3 No 7 Mult1 Yes MULTD M(A2) R(F4) Mult2 Yes DIVD M(A1) Mult1

S1 Vj

S2 Vk

RS Qj

RS Qk

Register result status: Clock


8 FU

F0

F2

F4

F6

F8

F10

F12

...

F30

Mult1 M(A2)

Add2 (M-M) Mult2

Tomasulo Example Cycle 9


Instruction status:
Instruction LD F6 LD F2 MULTD F0 SUBD F8 DIVD F10 ADDD F6 j 34+ 45+ F2 F6 F0 F8 k R2 R3 F4 F2 F6 F2

Exec Write Issue Comp Result


1 2 3 4 5 6 3 4 7 4 5 8 Load1 Load2 Load3

Busy Address
No No No

Reservation Stations:

Time Name Busy Op Add1 No 1 Add2 Yes ADDD (M-M) M(A2) Add3 No 6 Mult1 Yes MULTD M(A2) R(F4) Mult2 Yes DIVD M(A1) Mult1

S1 Vj

S2 Vk

RS Qj

RS Qk

Register result status: Clock


9 FU

F0

F2

F4

F6

F8

F10

F12

...

F30

Mult1 M(A2)

Add2 (M-M) Mult2

Tomasulo Example Cycle 10


Instruction status:
Instruction LD F6 LD F2 MULTD F0 SUBD F8 DIVD F10 ADDD F6 j 34+ 45+ F2 F6 F0 F8 k R2 R3 F4 F2 F6 F2

Exec Write Issue Comp Result


1 2 3 4 5 6 3 4 7 10 4 5 8 Load1 Load2 Load3

Busy Address
No No No

Reservation Stations:

Time Name Busy Op Add1 No 0 Add2 Yes ADDD (M-M) M(A2) Add3 No 5 Mult1 Yes MULTD M(A2) R(F4) Mult2 Yes DIVD M(A1) Mult1

S1 Vj

S2 Vk

RS Qj

RS Qk

Register result status: Clock


10 FU

F0

F2

F4

F6

F8

F10

F12

...

F30

Mult1 M(A2)

Add2 (M-M) Mult2

Add2 completing; what is waiting for it?

Tomasulo Example Cycle 11


Instruction status:
Instruction LD F6 LD F2 MULTD F0 SUBD F8 DIVD F10 ADDD F6 j 34+ 45+ F2 F6 F0 F8 k R2 R3 F4 F2 F6 F2

Exec Write Issue Comp Result


1 2 3 4 5 6 3 4 7 10 4 5 8 11 Load1 Load2 Load3

Busy Address
No No No

Reservation Stations:

Time Name Busy Op Add1 No Add2 No Add3 No 4 Mult1 Yes MULTD M(A2) R(F4) Mult2 Yes DIVD M(A1) Mult1

S1 Vj

S2 Vk

RS Qj

RS Qk

Register result status: Clock


11 FU

F0

F2

F4

F6

F8

F10

F12

...

F30

Mult1 M(A2)

(M-M+M)M-M) Mult2 (

Tomasulo Example Cycle 12


Instruction status:
Instruction LD F6 LD F2 MULTD F0 SUBD F8 DIVD F10 ADDD F6 j 34+ 45+ F2 F6 F0 F8 k R2 R3 F4 F2 F6 F2

Exec Write Issue Comp Result


1 2 3 4 5 6 3 4 7 10 4 5 8 11 Load1 Load2 Load3

Busy Address
No No No

Reservation Stations:

Time Name Busy Op Add1 No Add2 No Add3 No 3 Mult1 Yes MULTD M(A2) R(F4) Mult2 Yes DIVD M(A1) Mult1

S1 Vj

S2 Vk

RS Qj

RS Qk

Register result status: Clock


12 FU

F0

F2

F4

F6

F8

F10

F12

...

F30

Mult1 M(A2)

(M-M+M)M-M) Mult2 (

Tomasulo Example Cycle 13


Instruction status:
Instruction LD F6 LD F2 MULTD F0 SUBD F8 DIVD F10 ADDD F6 j 34+ 45+ F2 F6 F0 F8 k R2 R3 F4 F2 F6 F2

Exec Write Issue Comp Result


1 2 3 4 5 6 3 4 7 10 4 5 8 11 Load1 Load2 Load3

Busy Address
No No No

Reservation Stations:

Time Name Busy Op Add1 No Add2 No Add3 No 2 Mult1 Yes MULTD M(A2) R(F4) Mult2 Yes DIVD M(A1) Mult1

S1 Vj

S2 Vk

RS Qj

RS Qk

Register result status: Clock


13 FU

F0

F2

F4

F6

F8

F10

F12

...

F30

Mult1 M(A2)

(M-M+M)M-M) Mult2 (

Tomasulo Example Cycle 14


Instruction status:
Instruction LD F6 LD F2 MULTD F0 SUBD F8 DIVD F10 ADDD F6 j 34+ 45+ F2 F6 F0 F8 k R2 R3 F4 F2 F6 F2

Exec Write Issue Comp Result


1 2 3 4 5 6 3 4 7 10 4 5 8 11 Load1 Load2 Load3

Busy Address
No No No

Reservation Stations:

Time Name Busy Op Add1 No Add2 No Add3 No 1 Mult1 Yes MULTD M(A2) R(F4) Mult2 Yes DIVD M(A1) Mult1

S1 Vj

S2 Vk

RS Qj

RS Qk

Register result status: Clock


14 FU

F0

F2

F4

F6

F8

F10

F12

...

F30

Mult1 M(A2)

(M-M+M)M-M) Mult2 (

Tomasulo Example Cycle 15


Instruction status:
Instruction LD F6 LD F2 MULTD F0 SUBD F8 DIVD F10 ADDD F6 j 34+ 45+ F2 F6 F0 F8 k R2 R3 F4 F2 F6 F2

Exec Write Issue Comp Result


1 2 3 4 5 6 3 4 15 7 10 4 5 8 11 Load1 Load2 Load3

Busy Address
No No No

Reservation Stations:

Time Name Busy Op Add1 No Add2 No Add3 No 0 Mult1 Yes MULTD M(A2) R(F4) Mult2 Yes DIVD M(A1) Mult1

S1 Vj

S2 Vk

RS Qj

RS Qk

Register result status: Clock


15 FU

F0

F2

F4

F6

F8

F10

F12

...

F30

Mult1 M(A2)

(M-M+M)M-M) Mult2 (

Tomasulo Example Cycle 16


Instruction status:
Instruction LD F6 LD F2 MULTD F0 SUBD F8 DIVD F10 ADDD F6 j 34+ 45+ F2 F6 F0 F8 k R2 R3 F4 F2 F6 F2

Exec Write Issue Comp Result


1 2 3 4 5 6 3 4 15 7 10 4 5 16 8 11 Load1 Load2 Load3

Busy Address
No No No

Reservation Stations:

Time Name Busy Op Add1 No Add2 No Add3 No Mult1 No 40 Mult2 Yes DIVD M*F4 M(A1)

S1 Vj

S2 Vk

RS Qj

RS Qk

Register result status: Clock


16 FU

F0

F2

F4

F6

F8

F10

F12

...

F30

M*F4 M(A2)

(M-M+M)M-M) Mult2 (

Faster than light computation (skip a couple of cycles)

Tomasulo Example Cycle 55


Instruction status:
Instruction LD F6 LD F2 MULTD F0 SUBD F8 DIVD F10 ADDD F6 j 34+ 45+ F2 F6 F0 F8 k R2 R3 F4 F2 F6 F2

Exec Write Issue Comp Result


1 2 3 4 5 6 3 4 15 7 10 4 5 16 8 11 Load1 Load2 Load3

Busy Address
No No No

Reservation Stations:

Time Name Busy Op Add1 No Add2 No Add3 No Mult1 No 1 Mult2 Yes DIVD M*F4 M(A1)

S1 Vj

S2 Vk

RS Qj

RS Qk

Register result status: Clock


55 FU

F0

F2

F4

F6

F8

F10

F12

...

F30

M*F4 M(A2)

(M-M+M)M-M) Mult2 (

You might also like