Professional Documents
Culture Documents
Computer
4
Devices
Chapter
Part II Memory
Processor
Patterson and Hennessys Computer Organization and Design, 4th Ed. Chapter 4.Part II 1 of 168
Agenda and Reading List
Chapter goals (4.5, p. 343)
Introduction (4.5, pp. 330-332 and 4.14, pp. 408-409)
MIPS pipelined datapath, registers, and stages (4.6, pp. 344-355)
Graphically representing pipelines (4.6, pp. 356-358)
Pipelining speedup formula (4.5, pp. 332-334)
Pipelined control (4.6, pp. 359-363)
Pipeline hazards (4.5, pp. 335-343)
Structural hazards solutions (4.5, pp. 335-336)
Data hazards solutions (4.5, pp. 336-339 and 4.7, pp. 363-375)
Control hazards solutions (4.5, pp. 339-343 and 4.8, pp. 375-384)
Designing instruction set for pipelining (4.5, p. 335)
Exceptions (4.9, pp. 384-391)
Advanced instruction-level parallelism (4.10, pp. 391-403)
Fallacies and pitfalls (4.13, pp. 407-408)
Patterson and Hennessys Computer Organization and Design, 4th Ed. Chapter 4.Part II 2 of 168
Chapter Goals
to overview pipelining and describe its benefits and complexities
to cover the concept of pipelining using the MIPS instruction subset
from the single-cycle implementation in Chapter 4, Part I and show a
simplified version of its pipeline
to provide datapath and control details for a pipelined
implementation created by modifying the single-cycle implementation
to understand the challenges of dealing with hazards
to look at the problems that pipelining introduces and the
performance attainable under typical situations
to explore the implementation of forwarding and stalls
to learn about solutions to branch hazards
to focus on the software and performance implications of pipelining
to learn how to design instruction sets for easy pipelining
to describe the implementation of exceptions in pipelined processors
to introduce advanced pipelining concepts, such superscalar and
dynamic scheduling
Patterson and Hennessys Computer Organization and Design, 4th Ed. Chapter 4.Part II 3 of 168
Introduction
Pipelining is an implementation technique in which multiple
instructions are overlapped in execution
Patterson and Hennessys Computer Organization and Design, 4th Ed. Chapter 4.Part II 5 of 168
MIPS Pipelined Datapath
Separate the
datapath into
five stages
Patterson and Hennessys Computer Organization and Design, 4th Ed. Chapter 4.Part II 6 of 168
MIPS Pipelined Datapath (cont.)
MIPS instructions classically take five steps:
1. Fetch instruction from memory
2. Read registers while decoding the instruction
3. Execute the operation or calculate an address
4. Access an operand in data memory
5. Write the result into a register
The single-cycle design must allow for the slowest instruction; lw, so
the time required for every instruction is 800 ps, even though some
instructions can be as fast as 500 ps or 200 ps
All the pipeline stages take a single clock cycle, so the clock cycle
must be long enough to accommodate the slowest operation
The pipelined execution clock cycle must have the worst-case cycle of
200 ps, even though some stages take only 100 ps
So at steady state, an instruction is completed each 200 ps
Patterson and Hennessys Computer Organization and Design, 4th Ed. Chapter 4.Part II 10 of 168
Group Exercise (2)
Locate the single cycle, multiple-cycle, and pipelined datapaths on
diagrams that shows:
Patterson and Hennessys Computer Organization and Design, 4th Ed. Chapter 4.Part II 11 of 168
Group Exercise (2): Answer
Locate the single cycle, multiple-cycle, and pipelined datapaths on
diagrams that shows:
1. the clock rate versus instruction throughput
2. hardware trends (shared or specialized) versus instruction
latency
Specialized
Multicycle Pipelined Single-cycle Pipelined
Faster
Hardware
Clock rate
Slower
Shared
Single-cycle Multicycle
datapath datapath
(section 5.3) (section 5.4)
Patterson and Hennessys Computer Organization and Design, 4th Ed. Chapter 4.Part II 14 of 168
Group Exercise (3)
Consider the following five-instruction sequence
lw $10, 20($1)
sub $11, $2, $3
add $12, $3, $4
lw $13, 24($1)
add $14, $5, $6
Patterson and Hennessys Computer Organization and Design, 4th Ed. Chapter 4.Part II 15 of 168
Group Exercise (3): Answer
1. Show the multiple-clock-cycle pipeline diagram for these instructions
Stylized version
Patterson and Hennessys Computer Organization and Design, 4th Ed. Chapter 4.Part II 16 of 168
Group Exercise (3): Answer (cont.)
Traditional version
Patterson and Hennessys Computer Organization and Design, 4th Ed. Chapter 4.Part II 17 of 168
Group Exercise (3): Answer (cont.)
2. Show the single-clock-cycle pipeline diagram corresponding to
clock cycle 5 of executing the above instruction sequence
Patterson and Hennessys Computer Organization and Design, 4th Ed. Chapter 4.Part II 18 of 168
Pipelining Speedup Formula
Pipeline latency: number of stages in a pipeline
If the stages are perfectly balanced, then the time between
instructions on the pipelined processor, assuming ideal conditions, is:
Time between instructions nonpipelin ed
Time between instructions pipelined
Number of pipe stages
If all stages take about the same amount of time and there is enough
work to do, then the speedup due to pipelining is equal to the
number of stages in the pipeline
That is, under ideal conditions and with a large number of
instructions:
Speedup from pipelining Number of pipe stages
Practically, the time per instruction in the pipelined processor will
exceed the minimum possible, and speedup will be less than the
number of pipeline stages because
the stages may be imperfectly balanced,
pipelining involves overhead (e.g., pipeline registers), and
pipeline hazards exist
Patterson and Hennessys Computer Organization and Design, 4th Ed. Chapter 4.Part II 19 of 168
Pipelining Speedup Formula (cont.)
At the beginning and end of the workload, the pipe is not totally full
This start-up and wind-down affects performance when the number
of tasks is not large compared to the number of stages in the
pipeline
If the number of tasks is much larger than the number of pipeline
stages, then the stages will be full most of the time and the
increase in throughput will be very close to the number of stages
If m instructions are executed on a n-stage pipeline, then:
Number of clock cycles to execute the instructions = m + (n-1)
Time to execute the instructions = [m + (n-1)]*T
If m >> n, then:
Number of clock cycles to execute the instructions = m
Time to execute the instructions = m *T
Pipelining improves performance by increasing instruction throughput,
as opposed to decreasing the inherent execution time (latency) of an
instruction, but instruction throughput is the more important metric
because real programs execute billions of instructions
Patterson and Hennessys Computer Organization and Design, 4th Ed. Chapter 4.Part II 20 of 168
Group Exercise (4)
Consider the following code executed on the pipeline in Exercise (1):
lw $1, 100($0)
lw $2, 200($0)
lw $3, 300($0)
Patterson and Hennessys Computer Organization and Design, 4th Ed. Chapter 4.Part II 22 of 168
Group Exercise (4): Answer
2. Graphically represent the pipelined execution of the above
instructions using the multiple-clock-cycle pipeline diagram
Patterson and Hennessys Computer Organization and Design, 4th Ed. Chapter 4.Part II 23 of 168
Group Exercise (4): Answer (cont.)
3. Calculate the time between the first and fourth instructions in
the nonpipelined design
3 x 800 ps = 2400 ps
4. Calculate the time between the first and fourth instructions in
the pipelined design
3 x 200 ps = 600 ps
This pipelining still offers a fourfold performance improvement
5. Calculate the time needed to execute three lw instructions in the
pipelined design
From the diagram, it is 1400 ps
6. Calculate the pipelining speedup in case of executing three lw
instructions
3* 800
Speedup pipelined 1.71 ( 4 )
1400
The number of instructions is not large for the speedup to be 4
Patterson and Hennessys Computer Organization and Design, 4th Ed. Chapter 4.Part II 24 of 168
Group Exercise (4): Answer (cont.)
7. Calculate the pipelining speedup in case of executing 1,000,003
lw instructions
Patterson and Hennessys Computer Organization and Design, 4th Ed. Chapter 4.Part II 26 of 168
Pipeline Registers
Patterson and Hennessys Computer Organization and Design, 4th Ed. Chapter 4.Part II 27 of 168
Pipeline Registers (cont.)
We must add registers between pipeline stages to allow datapaths
and functional units to be shared by different instructions during
different stages while retaining the value of an individual instruction
for its usage during the following stages
All instructions advance during each clock cycle from one pipeline
register to the next
Patterson and Hennessys Computer Organization and Design, 4th Ed. Chapter 4.Part II 28 of 168
Pipeline Registers (cont.)
There is no pipeline register at the end of the write-back stage
All instructions must update some state in the processor (the
register file, memory, or the PC)
A separate pipeline register is redundant to the state that is
updated
Registers are named for the two stages separated by that register
Patterson and Hennessys Computer Organization and Design, 4th Ed. Chapter 4.Part II 30 of 168
Individual Exercise (5): Answer
Why is the PC not considered among the pipeline registers
although it feeds the IF stage of the pipeline
Patterson and Hennessys Computer Organization and Design, 4th Ed. Chapter 4.Part II 31 of 168
Pipeline Stages
All instructions pass through 5 stages though not all of them needs
5 cycles
As every instruction behind the one being executed would be in
progress, there is no way to accelerate them
An instruction passes through a stage even if there is nothing to
do as later instructions are already progressing at the maximum
rate
1. Instruction fetch (IF):
The instruction is read from memory using the address in the
PC and then placed in the IF/ID pipeline register
The PC address is incremented by 4 and then written back into
the PC to be ready for the next clock cycle
The incremented PC is also saved in the IF/ID in case it is
needed later for an instruction, such as beq
Portion of IF/ID plays the role of the IR!
The computer cannot know which type of instruction is being
fetched, so it must prepare for any instruction, passing
Patterson and potentially needed
Hennessys Computer information
Organization down4ththe
and Design, Ed. pipeline
Chapter 4.Part II 32 of 168
Pipeline Stages (cont.)
2. Instruction decode and register file read (ID):
The instruction portion of the IF/ID pipeline register supplies
the 16-bit immediate field, which is sign-extended to 32 bits,
and the register numbers to read the two registers
All the three values are stored in ID/EX, along with the
incremented PC
We again transfer everything that might be needed by any
instruction during a later clock cycle
Register rt number is saved in ID/EX in case it is needed
later by lw
Patterson and Hennessys Computer Organization and Design, 4th Ed. Chapter 4.Part II 33 of 168
Pipeline Stages (cont.)
3. Execute or address calculation (EX):
lw: reads the contents of register rs and the sign-extended
immediate from the ID/EX and adds them using the ALU
that sum is placed in the EX/MEM
register rt number is passed from ID/EX to EX/MEM
sw: the effective address is placed in the EX/MEM
register rt value is passed from ID/EX to EX/MEM to be used
in the next stage
R-type: reads the contents of registers rs and rt from the
ID/EX and performs the desired function using the ALU
the result is stored in the EX/MEM
register rd number is passed from the ID/EX to EX/MEM
beq: reads the contents of registers rs and rt from the ID/EX
and performs the equal compare function using the ALU
the zero signal is stored in the EX/MEM
Patterson and Hennessys Computer Organization and Design, 4th Ed. Chapter 4.Part II 34 of 168
Pipeline Stages (cont.)
4. Memory access (MEM):
lw: reads the data memory using the address from EX/MEM
and loads the data into the MEM/WB
register rt number is passed from EX/MEM to MEM/WB
sw: data is written to memory
R-type: the ALU output is passed from EX/MEM to MEM/WB
beq: set the next PC according to the zero signal read from
EX/MEM, calculate the branch to address
5. Write back (WB):
lw: reads the data from the MEM/WB and writes it into the
register file using register rt number stored in MEM/WB
sw: nothing to be done
R-type: writes the ALU output from the MEM/WB into the
register number rd (read from MEM/WB also)
beq: nothing to be done
Patterson and Hennessys Computer Organization and Design, 4th Ed. Chapter 4.Part II 35 of 168
Home Exercise (6)
Work out the lw example in Figures 4.36 through 4.38
step by step!
Patterson and Hennessys Computer Organization and Design, 4th Ed. Chapter 4.Part II 36 of 168
Pipelined Control
Patterson and Hennessys Computer Organization and Design, 4th Ed. Chapter 4.Part II 37 of 168
Pipelined Control (cont.)
To specify control for the pipeline, we need only to set the control
values during each pipeline stage
Because each control line is associated with a component active in
only a single pipeline stage, we can divide the control lines into five
groups according to the pipeline stage (refer to Chapter 4, Part I)
IF: there is nothing special to control in the pipeline stage
The control signals to read instruction memory and to write
the PC are always asserted
ID: there are no optional lines to set
The same thing happens at every clock cycle
EX: set signals RegDst, ALUOp, and ALUSrc to select the result
register, the ALU operation, and either read data 2 (register rt)
or a sign-extended immediate for the ALU
MEM: set signals Branch, MemRead, and MemWrite by beq, lw,
and sw instructions, respectively
WB: control signal MemtoReg decides between sending the ALU
result or the memory value to the register file, and control signal
RegWrite writes the chosen value
Patterson and Hennessys Computer Organization and Design, 4th Ed. Chapter 4.Part II 38 of 168
Pipelined Control (cont.)
There are no separate writing signals for the pipeline registers IF/ID,
ID/EX, EX/MEM, and MEM/WB as they are written during each cycle
Implementing control means setting the control lines in each stage
for each instruction
The simplest way to do this is to extend the pipeline registers to
include control information
As the control lines start with the EX stage, we can create the control
information during instruction decode and then place them in ID/EX
The control lines for each pipeline stage are used, and remaining
control lines are then passed to the next pipeline stage
These control signals are then used in the appropriate pipeline stage
as the instruction moves down the pipeline
Sequencing of control in pipeline processors is embedded in the
pipeline structure itself:
all instructions take the same number of clock cycles, so there is
no special control of instruction duration
all control information is computed during instruction decode, and
then passed along by the pipeline registers
Patterson and Hennessys Computer Organization and Design, 4th Ed. Chapter 4.Part II 39 of 168
Pipeline Hazards
Pipeline hazards: situations in pipelining when the next instruction
cannot execute in the following clock cycle
Patterson and Hennessys Computer Organization and Design, 4th Ed. Chapter 4.Part II 40 of 168
Pipeline Hazards (cont.)
2. Data hazards: occur when the pipeline must be stalled because
one step must wait for another to complete
Arise from the dependence of one instruction on an earlier one
that is still in the pipeline
When an instruction depends on the results of a previous one
still in the pipeline, the pipeline should be stalled, i.e., bubbles
should be added to the pipeline
Performance bottlenecks in both integer and floating-point
programs
Often it is easier to deal with in floating-point programs
because the lower branch frequency and more regular
memory access patterns allow the compiler to try to schedule
instructions to avoid hazards
It is more difficult to perform such optimizations in integer
programs that have less regular memory access, involving
more use of pointers
Patterson and Hennessys Computer Organization and Design, 4th Ed. Chapter 4.Part II 41 of 168
Pipeline Hazards (cont.)
3. Control hazards: arise from the need to make a decision based on
the results of one instruction while others are executing, e.g.,
branches
Also called branch hazards
Happen when the proper instruction cannot execute in the
proper pipeline clock cycle because the instruction that was
fetched is not the one that is needed; that is, the flow of
instruction addresses is not what the pipeline expected
Notice that we must begin fetching the instruction following the
branch on the very next clock cycle
Nevertheless, the pipeline cannot possibly know what the next
instruction should be, since it only just received the branch
instruction from memory
Usually more of a problem in integer programs, which tend to
have higher branch frequencies as well as less predictable
branches
Patterson and Hennessys Computer Organization and Design, 4th Ed. Chapter 4.Part II 42 of 168
Structural Hazards Solutions
Each logical component of the datapath, such as instruction
memory, register read ports, ALU, data memory, and register write
ports, can be used only within a single pipeline stage
Otherwise, we would have a structural hazard
Hence, these components, and their control, can be associated with
a single pipeline stage
Proper ISA design
Designing instruction sets for pipelining (as will be explained
later) makes it fairly easy to avoid structural hazards when
designing a pipeline
Example:
without two memories, our pipeline could have a structural
hazard
suppose we had a single memory instead of two memories
we could see that in one clock cycle, the first instruction is
accessing data from memory while the fourth instruction is
fetching an instruction from that same memory
Patterson and Hennessys Computer Organization and Design, 4th Ed. Chapter 4.Part II 43 of 168
Data Hazards Solutions
Compiler support:
Code reordering: compilers could follow instructions with
others that are independent on them to prevent sequences that
result in data hazards
When the instruction generating the data is a load, this
technique is called delayed loads
Patterson and Hennessys Computer Organization and Design, 4th Ed. Chapter 4.Part II 44 of 168
Data Hazards Solutions (cont.)
Compiler support (cont.):
However, data hazards happen just too often and the delay is just
too long to expect the compiler to rescue us from this dilemma
Although the compiler generally relies upon the hardware to
resolve hazards and thereby ensure correct execution, the
compiler must understand the pipeline to achieve the best
performance
Otherwise, unexpected stalls will reduce the performance of
the compiled code
Patterson and Hennessys Computer Organization and Design, 4th Ed. Chapter 4.Part II 45 of 168
Group Exercise (7)
Consider the following code sequence:
Patterson and Hennessys Computer Organization and Design, 4th Ed. Chapter 4.Part II 51 of 168
Individual Exercise (8): Answer
add $s0, $t0, $t1
sub $t2, $s0, $t3
The add instruction does not write its result until the 5th stage
As soon as the ALU creates the sum for the add, it is supplied
as an input for the sub, replacing $s0 value read in the 2nd
stage of sub
Patterson and Hennessys Computer Organization and Design, 4th Ed. Chapter 4.Part II 52 of 168
Group Exercise (9)
Consider the following code segment in C:
a = b + e
c = b + f
Here is the generated MIPS code for this segment, assuming all variables
are in memory and are addressable as offsets from $t0:
Patterson and Hennessys Computer Organization and Design, 4th Ed. Chapter 4.Part II 53 of 168
Group Exercise (9) (cont.)
1. Find the hazards in the above code segment
2. Reorder the instructions to avoid any pipeline stalls
3. Calculate the number of clock cycles needed to complete the
reordered sequence on a pipelined processor with forwarding
relative to the original version
Patterson and Hennessys Computer Organization and Design, 4th Ed. Chapter 4.Part II 54 of 168
Group Exercise (9): Answer
lw $t1, 0($t0) # b is saved at 0($t0)
lw $t2, 4($t0) # e is saved at 4($t0)
add $t3, $t1, $t2
sw $t3, 12($t0) # a is to be saved at 12($t0)
lw $t4, 8($t0) # f is saved at 8($t0)
add $t5, $t1, $t4
sw $t5, 16($t0) # c is to be saved at 16($t0)
Patterson and Hennessys Computer Organization and Design, 4th Ed. Chapter 4.Part II 55 of 168
Group Exercise (9): Answer
2. Reorder the instructions to avoid any pipeline stalls
lw $t1, 0($t0)
lw $t2, 4($t0)
lw $t4, 8($t0)
add $t3, $t1, $t2
sw $t3, 12($t0)
add $t5, $t1, $t4
sw $t5, 16($t0)
Patterson and Hennessys Computer Organization and Design, 4th Ed. Chapter 4.Part II 59 of 168
Individual Exercise (10): Answer
sub $2, $1, $3
and $12, $2, $5
or $13, $6, $2
add $14, $2, $2
sw $15, 100($2)
Patterson and Hennessys Computer Organization and Design, 4th Ed. Chapter 4.Part II 63 of 168
Data Hazards Solutions (cont.)
if (EX/MEM.RegWrite
and (EX/MEM.RegisterRd 0)
and (EX/MEM.RegisterRd = ID/EX.RegisterRt)) ForwardB = 10
In this case, the result is forwarded from the MEM stage because
the result in the MEM stage is the more recent result
Patterson and Hennessys Computer Organization and Design, 4th Ed. Chapter 4.Part II 65 of 168
Data Hazards Solutions (cont.)
Data forwarding (cont.):
MEM hazard:
if (MEM/WB.RegWrite
and (MEM/WB.RegisterRd 0)
and not (EX/MEM.RegWrite
and (EX/MEM.RegisterRd 0)
and (EX/MEM.RegisterRd = ID/EX.RegisterRs))
and (MEM/WB.RegisterRd = ID/EX.RegisterRs)) ForwardA = 01
if (MEM/WB.RegWrite
and (MEM/WB.RegisterRd 0)
and not (EX/MEM.RegWrite
and (EX/MEM.RegisterRd 0)
and (EX/MEM.RegisterRd = ID/EX.RegisterRt))
and (MEM/WB.RegisterRd = ID/EX.RegisterRt)) ForwardB = 01
Patterson and Hennessys Computer Organization and Design, 4th Ed. Chapter 4.Part II 66 of 168
Individual Exercise (11)
For the following instruction sequence identify dependences and
show how forwarding cope with them:
Patterson and Hennessys Computer Organization and Design, 4th Ed. Chapter 4.Part II 67 of 168
Individual Exercise (11): Answer
For the following instruction sequence identify dependences and
show how forwarding cope with them:
Patterson and Hennessys Computer Organization and Design, 4th Ed. Chapter 4.Part II 68 of 168
Individual Exercise (12)
When summing a vector of numbers in a single register, a
sequence of instructions will all read and write to the same register
Patterson and Hennessys Computer Organization and Design, 4th Ed. Chapter 4.Part II 69 of 168
Individual Exercise (12): Answer
For the second add $1 is available in MEM stage
EX/MEM.RegisterRd = ID/EX.RegisterRs
EX/MEM.RegisterRd = ID/EX.RegisterRs
MEM/WB.RegisterRd = ID/EX.RegisterRs
Patterson and Hennessys Computer Organization and Design, 4th Ed. Chapter 4.Part II 71 of 168
Individual Exercise (13)
Consider the following code sequence:
lw $t0, 0($t1)
lw $t2, 4($t1)
sw $t2, 0($t1)
sw $t0, 4($t1)
Patterson and Hennessys Computer Organization and Design, 4th Ed. Chapter 4.Part II 72 of 168
Individual Exercise (13): Answer
1. Find the hazard in this code:
The hazard occurs on $t2 between the second lw and the first
sw
lw $t0, 0($t1)
lw $t2, 4($t1)
sw $t0, 4($t1)
sw $t2, 0($t1)
Patterson and Hennessys Computer Organization and Design, 4th Ed. Chapter 4.Part II 74 of 168
Individual Exercise (14)
Suppose that we have a load instruction followed immediately by a
subtract that uses the loaded word:
lw $s0, 20($t1)
sub $t2, $s0, $t3
Patterson and Hennessys Computer Organization and Design, 4th Ed. Chapter 4.Part II 75 of 168
Individual Exercise (14): Answer
Suppose that we have a load instruction followed immediately by a
subtract that uses the loaded word:
lw $s0 20($t1)
sub $t2, $s0, $t3
Show what pipeline stages would be connected by forwarding
$s0 would be available only after the fourth stage of the first
instruction, which is too late for the input of the third stage of
the sub!
We would have to stall one stage for the load-use data hazard
Patterson and Hennessys Computer Organization and Design, 4th Ed. Chapter 4.Part II 76 of 168
Data Hazards Solutions (cont.)
Stalls (bubbles) (cont.):
We need a hazard detection unit that operates during the ID
stage to insert the stall between the load and its use
To check for loads, the control for the hazard detection unit is:
if (ID/EX.MemRead
and (ID/EX.RegisterRt = IF/ID.RegisterRs)
or (ID/EX.RegisterRt = IF/ID.RegisterRt)) stall the pipeline
Stalls
(bubbles)
Patterson and Hennessys Computer Organization and Design, 4th Ed. Chapter 4.Part II 79 of 168
Individual Exercise (15)
Consider the following code sequence:
lw $2, 20($1)
and $4, $2, $5
or $8, $2, $6
add $9, $4, $2
slt $1, $6, $7
Patterson and Hennessys Computer Organization and Design, 4th Ed. Chapter 4.Part II 80 of 168
Individual Exercise (15): Answer
1. Show that a hazard in the above code sequence cannot be solved
by forwarding
The data is being read from memory in clock cycle 4 while the ALU
is performing the operation for the following instruction
Since dependence between the lw and the following AND goes
backward in time, this hazard cannot be solved by forwarding
Patterson and Hennessys Computer Organization and Design, 4th Ed. Chapter 4.Part II 81 of 168
Individual Exercise (15): Answer (cont.)
2. Show how inserting bubbles could handle data hazards in the this
code sequence
Patterson and Hennessys Computer Organization and Design, 4th Ed. Chapter 4.Part II 82 of 168
Control Hazards Solutions
Stalls on branches (bubbles):
One possible solution for control hazards is to stall immediately
after we fetch a branch, waiting until the pipeline determines the
outcome of the branch and knows what instruction address to
fetch from
Patterson and Hennessys Computer Organization and Design, 4th Ed. Chapter 4.Part II 84 of 168
Individual Exercise (16): Answer
Estimate the cost of the control hazard in the following code sequence:
Assume that:
All other instructions have a CPI of 1 and
We put in enough extra hardware so that we can test registers,
calculate the branch address, and update the PC during the
second stage (e.g., the pipeline is stalled for only 1 cycle)
Patterson and Hennessys Computer Organization and Design, 4th Ed. Chapter 4.Part II 86 of 168
Individual Exercise (17): Answer
Given that branches are 17% of the instructions executed in
SPECint2006, estimate the impact on the clock cycles per
instruction (CPI) of stalling on branches
Assume that:
All other instructions have a CPI of 1
We put in enough extra hardware so that we can test registers,
calculate the branch address, and update the PC during the
second stage (e.g., the pipeline is stalled for only 1 cycle)
Patterson and Hennessys Computer Organization and Design, 4th Ed. Chapter 4.Part II 87 of 168
Control Hazards Solutions (cont.)
Early decision to reduce the delay of branches:
One way to improve branch performance is to reduce the cost
(delay) of the taken branch
Patterson and Hennessys Computer Organization and Design, 4th Ed. Chapter 4.Part II 88 of 168
Control Hazards Solutions (cont.)
Early decision (cont.):
Moving the branch decision up requires 2 actions to occur earlier:
1. Move up the branch address calculation:
Easy part
We already have the PC value and the immediate field in
the IF/ID register, so we just move the branch adder from
the EX stage to the ID stage
Of course, the branch target address calculation will be
performed for all instructions, but only used when needed
2. Move up the branch decision:
The harder part
For beq, we could compare the two registers read during
the ID stage to see if they are equal
Equality can be tested by first XORing their respective bits
and then ORing all the results (faster than using the ALU)
Patterson and Hennessys Computer Organization and Design, 4th Ed. Chapter 4.Part II 89 of 168
Control Hazards Solutions (cont.)
Early decision (cont.):
Moving the branch test to the ID stage implies additional hazard
detection and forwarding hardware, as a branch dependent on a
result still in the pipeline must work properly with this optimization
To implement beq or bne, we will need to forward results to
the equality test logic that operates during the ID stage
There are two complication factors:
1. During the ID stage, we must decode the instruction, decide
whether a bypass to the equality unit is needed, and complete
the equality comparison so that if the instruction is a branch,
we can set the PC to the branch target address
Forwarding for the operands of branches was formerly
handled by the ALU forwarding logic, but the introduction
of the equality test unit in the ID stage will require new
forwarding logic
Note that the bypassed source operands of a branch can
come from either the EX/MEM or MEM/WB registers
Patterson and Hennessys Computer Organization and Design, 4th Ed. Chapter 4.Part II 90 of 168
Control Hazards Solutions (cont.)
Early decision (cont.):
2. Because the values in a branch comparison are needed during
the ID stage but may be produced later in time, it is possible
that a data hazard can occur and a stall will be needed
For example, if an ALU instruction immediately preceding
a branch produces one of the operands for the
comparison in the branch, a stall will be required, since
the EX stage for the ALU instruction will occur after the
ID cycle of the branch
By extension, if a load is immediately followed by a
conditional branch that is on the load result, two stall
cycles will be needed, as the result from the load appears
at the end of the MEM cycle but is needed at the
beginning of ID cycle for the branch
Patterson and Hennessys Computer Organization and Design, 4th Ed. Chapter 4.Part II 91 of 168
Control Hazards Solutions (cont.)
Early decision (cont.):
Despite these difficulties, moving the branch execution in the ID
stage is an improvement, because it reduces the penalty of a
branch to only one instruction if the branch is taken, namely, the
one currently being fetched
Patterson and Hennessys Computer Organization and Design, 4th Ed. Chapter 4.Part II 92 of 168
Control Hazards Solutions (cont.)
Early decision
Patterson and Hennessys Computer Organization and Design, 4th Ed. Chapter 4.Part II 93 of 168
Home Exercise (18)
Work out the example page 378 step by step!
Patterson and Hennessys Computer Organization and Design, 4th Ed. Chapter 4.Part II 94 of 168
Control Hazards Solutions (cont.)
Branch prediction:
Branch prediction: predict the outcome of the branch
instruction and proceed from that assumption rather than waiting
to ascertain the actual outcome
Patterson and Hennessys Computer Organization and Design, 4th Ed. Chapter 4.Part II 95 of 168
Control Hazards Solutions (cont.)
Branch prediction (cont.):
There are three ways to do branch prediction:
1. Assume branch not taken: always predict branches to fail
and continue execution down the sequential instruction flow
Pipeline is not slowed down when the branch is not taken
If the branch is taken, the instructions that are being
fetched and decoded must be discarded and execution
continues at the branch target
This is equivalent to a stall; that is, only when branches
are taken does the pipeline stall
To discard instructions:
Change original control values to 0s in the IF, ID, and
EX stages when the branch reaches the MEM stage
Discarding instructions here means we must be able
to flush instructions in the IF, ID, and EX stages
For load-use stalls, we just change control to 0 in the
ID stage and let them percolate through the pipeline
Patterson and Hennessys Computer Organization and Design, 4th Ed. Chapter 4.Part II 96 of 168
Control Hazards Solutions (cont.)
Branch prediction (cont.):
2. A more sophisticated branch prediction would have some
branches predicted as taken and some as untaken
As an example, at the bottom of loops are branches
that jump back to the top of the loop
Since they are likely to be taken and they branch
backwards, we could always predict taken for branches
that jump to an earlier address
3. Dynamic hardware predictors: make guesses depending
on the behavior of each branch and may change predictions
for a branch over the life of a program
In an aggressive pipeline, a simple static prediction
scheme will probably waste too much performance
With more hardware, it is possible to try to predict
branch behavior during program execution
Dynamic predictors increase in popularity as the
transistors per chip increase in count
Patterson and Hennessys Computer Organization and Design, 4th Ed. Chapter 4.Part II 97 of 168
Group Exercise (19)
Assume that branches are predicted to be not taken
Show the pipeline when the branch in the following code sequence
is not taken and when it is taken:
Patterson and Hennessys Computer Organization and Design, 4th Ed. Chapter 4.Part II 98 of 168
Group Exercise (19): Answer
Assume that branches are predicted to be not taken
Show the pipeline when the branch in the following code sequence
is not taken and when it is taken:
not taken
taken
Patterson and Hennessys Computer Organization and Design, 4th Ed. Chapter 4.Part II 99 of 168
Control Hazards Solutions (cont.)
Dynamic branch prediction:
One popular approach for dynamic prediction of branches is
keeping a history for each branch as taken or untaken, and then
using the recent past behavior to predict the future
The amount and type of history kept have become extensive
The result has been that dynamic branch predictors can
correctly predict branches with more than 90% accuracy
One approach to do this is to look up the address of the
instruction to see if a branch was taken the last time this
instruction was executed, and, if so, to begin fetching new
instructions from the same place as the last time
One implementation of that approach is a branch prediction
buffer or branch history table
A branch prediction buffer is a small, special memory indexed by
the lower portion of the address of the branch instruction during
the IF stage
Patterson and Hennessys Computer Organization and Design, 4th Ed. Chapter 4.Part II 100 of 168
Control Hazards Solutions (cont.)
Dynamic branch prediction (cont.):
The branch prediction buffer contains a bit that says whether the
branch was recently taken or not
This is the simplest sort of buffer
We do not know, in fact, if the prediction is the right one as it
may have been put there by another branch that has the same
low-order address bits
However, this does not affect correctness
Prediction is just a hint that we hope is correct, so fetching
begins in the predicted direction
If the hint turns out to be wrong, the incorrectly predicted
instructions are deleted, the prediction bit is inverted and
stored back, and the proper sequence is fetched and executed
This simple 1-bit prediction scheme has a performance
shortcoming: even if a branch is almost always taken, we can
predict incorrectly twice, rather than once, when it is not taken
Patterson and Hennessys Computer Organization and Design, 4th Ed. Chapter 4.Part II 101 of 168
Class Exercise (20)
Consider a loop branch that branches nine times in a row, then it is
not taken once
Assume the prediction bit for this branch remains in the prediction
buffer
Patterson and Hennessys Computer Organization and Design, 4th Ed. Chapter 4.Part II 102 of 168
Class Exercise (20): Answer
Consider a loop branch that branches nine times in a row, then it is
not taken once
What is the prediction accuracy for this branch, if it is using a
single bit for prediction?
Assume the prediction bit for this branch remains in the prediction
buffer
Patterson and Hennessys Computer Organization and Design, 4th Ed. Chapter 4.Part II 104 of 168
Control Hazards Solutions (cont.)
Dynamic branch prediction (cont.):
By using the 2 bits rather than 1, a branch that strongly favors
taken or not takenas many branches dowill be mispredicted
only once
The 2 bits are used to encode the four states in the system
Patterson and Hennessys Computer Organization and Design, 4th Ed. Chapter 4.Part II 105 of 168
Control Hazards Solutions (cont.)
Advanced dynamic branch prediction techniques
Branch target buffer: a cache to hold the destination PC or
destination instruction
A branch predictor tells us whether or not a branch is taken
We still require the calculation of the branch target
In our pipeline, this calculation takes one cycle, meaning that
taken branches will have a 1-cycle penalty
Branch target buffer is one approach to eliminate that penalty
Patterson and Hennessys Computer Organization and Design, 4th Ed. Chapter 4.Part II 106 of 168
Control Hazards Solutions (cont.)
Advanced dynamic branch prediction techniques (cont.)
Correlating predictor: a branch predictor that combines local
behavior of a particular branch and global information about the
behavior of some recent number of executed branches
Yields greater prediction accuracy for the same number of
prediction bits
A 2-bit dynamic branch predictor scheme uses only
information about a particular branch
A typical correlating predictor might have 2-bit predictors for
each branch, with the choice between predictors made based
on whether the last executed branch was taken or not taken
Thus, the global branch behavior can be thought of as adding
additional index bits for the prediction lookup
Patterson and Hennessys Computer Organization and Design, 4th Ed. Chapter 4.Part II 107 of 168
Control Hazards Solutions (cont.)
Advanced dynamic branch prediction techniques (cont.)
Tournament predictor: uses multiple predictions, tracking, for
each branch, which predictor yields the best result
A typical tournament predictor might contain two predictors
for each branch index: one based on local information and one
based on global branch behavior
A selector would choose which predictor to use for any
prediction
The selector can operate similarly to a 1- or 2-bit predictor,
favoring whichever of the two predictors has been more
accurate
Some recent microprocessors use such elaborate predictors
Patterson and Hennessys Computer Organization and Design, 4th Ed. Chapter 4.Part II 108 of 168
Control Hazards Solutions (cont.)
Delayed branches (decisions):
Delayed branch: always executes the following instruction, but
the second one following the branch will be affected by the branch
The branch takes place after that one instruction delay
Compilers and assemblers try to place an instruction that always
executes after the branch in the branch delay slot
Branch delay slot: the slot directly after a delayed branch
instruction, which in the MIPS architecture is filled by an
instruction that does not affect the branch
The job of the software is to make the successor instructions valid
and useful
MIPS software places an instruction immediately after the delayed
branch instruction that is not affected by the branch, and a taken
branch changes the address of the instruction that follows this
safe instruction
Patterson and Hennessys Computer Organization and Design, 4th Ed. Chapter 4.Part II 109 of 168
Control Hazards Solutions (cont.)
Patterson and Hennessys Computer Organization and Design, 4th Ed. Chapter 4.Part II 113 of 168
Individual Exercise (21): Answer
Assume that branch decision can be done in the second stage
Show how to use a delayed branch to avoid control hazard in the
following code sequence
The add instruction before the branch does not affect the branch,
so we move it to the delayed branch slot following the branch
Thus, the single pipe bubble has been replaced by add
Pro gra m
e xe cu tio n 2 4 6 8 10 12 14
o rde r Ti m e
(in instru ctio ns)
be q $ 1 , $ 2, 40 Instruction Data
Re g AL U Reg
fetch access
a dd $ 4, $5 , $ 6 Instruction Data
Reg A LU Re g
2 ns fetch access
(Dela ye d b ra n c h slo t)
Instruction Data
lw $3 , 3 00 ($0 ) Reg A LU Re g
2 ns fetch access
2 ns
Patterson and Hennessys Computer Organization and Design, 4th Ed. Chapter 4.Part II 114 of 168
Control Hazards Solutions (cont.)
Conditional move instructions:
One way to reduce the number of conditional branches
Instead of changing the PC with a conditional branch, the
instruction conditionally changes the destination register of the
move
If the condition fails, the move acts as a nop
MIPS instruction set architecture has two new instructions called:
movn: move if not zero
movz: move if zero
The ARM instruction set has a condition field in most instructions
ARM programs could have fewer conditional branches than in
MIPS programs
Patterson and Hennessys Computer Organization and Design, 4th Ed. Chapter 4.Part II 115 of 168
Individual Exercise (22)
Explain how the following conditional move instruction works
Patterson and Hennessys Computer Organization and Design, 4th Ed. Chapter 4.Part II 116 of 168
Individual Exercise (22): Answer
Explain how the following conditional move instruction works
The instruction copies the contents of register $11 into register $8,
provided that the value in register $4 is nonzero
Otherwise, it does nothing
Patterson and Hennessys Computer Organization and Design, 4th Ed. Chapter 4.Part II 117 of 168
Group Exercise (23)
Calculate the average CPI for a pipelined implementation
Patterson and Hennessys Computer Organization and Design, 4th Ed. Chapter 4.Part II 118 of 168
Group Exercise (23): Answer
Loads take 1 clock cycle when there is no load-use dependence
(50% of the time) and 2 clock cycle when there is (50% of the
time):
CPIloads 1* 0.50 2 * 0.50 1.50
Store and R-type instructions take 1 clock cycle
Branches take 1 clock cycle when predicted correctly (75% of the
time) and 2 when not (25% of the time):
Patterson and Hennessys Computer Organization and Design, 4th Ed. Chapter 4.Part II 127 of 168
Exceptions (cont.)
Lets assume that we are implementing the exception system used in
the MIPS architecture, with the single entry point being the address
8000 0180hex
Patterson and Hennessys Computer Organization and Design, 4th Ed. Chapter 4.Part II 128 of 168
Exceptions (cont.)
A pipelined implementation treats exceptions as another form of
control hazard
Just as we did for the taken branch, we must flush the instructions
that follow the one that throws the exception from the pipeline and
begin fetching instructions from the new address
We will use the same mechanism we used for taken branches, but
this time the exception causes the deasserting of control lines
The instruction should stop immediately to give the programmer the
chance to track the register values causing the exception
Otherwise, the instruction causing the exception or any following
one could change the register values
Many exceptions require that we eventually complete the instruction
that caused the exception as if it executed normally
The easiest way to do this is to flush the instruction and restart
it from the beginning after the exception is handled
Patterson and Hennessys Computer Organization and Design, 4th Ed. Chapter 4.Part II 129 of 168
Exceptions (cont.)
We saw how to flush the instruction in the IF stage by turning it
into a nop
To flush instructions in the ID stage, we use the multiplexor already
in the ID stage that zeros control signals for stalls
A new control signal, ID.Flush, is ORed with the stall signal from the
hazard detection unit to flush during ID
To flush instructions in the EX phase, we use a new signal called
EX.Flush to cause new multiplexors to zero the control lines
The EX.Flush signal is also used to prevent the instruction in the EX
stage from writing its result in the WB stage
We need to save the address of the offending instruction in EPC
We must subtract 4 from the updated PC before saving it in EPC
To start fetching instructions from location 8000 0180hex, which is
the MIPS exception address, we simply add an additional input to
the PC multiplexor that sends 8000 0180hex to the PC
Patterson and Hennessys Computer Organization and Design, 4th Ed. Chapter 4.Part II 130 of 168
Exceptions (cont.)
Exception routine
address is: 8000 0180hex
Patterson and Hennessys Computer Organization and Design, 4th Ed. Chapter 4.Part II 131 of 168
Exceptions (cont.)
Handling multiple exceptions is also important
With pipelined execution, it is important to
associate the exception with its cause instruction
prioritize the exceptions to determine which is serviced first
In most MIPS implementations, the hardware sorts exceptions
so that the earliest instruction is interrupted
I/O device requests and hardware malfunctions are not
associated with a specific instruction, so the implementation
has some flexibility as to when to interrupt the pipeline
The exception software must match the exception to the instruction
Know in which pipeline stage a type of exception can occur
For example, an undefined instruction is discovered in the ID
stage, and invoking the OS occurs in the EX stage
Exceptions are collected in the Cause register in a pending
exception field so that the hardware can interrupt based on later
exceptions, once the earliest one has been serviced
Precise exceptions is always associated with the correct exception
in pipelined computers, otherwise exceptions are imprecise!
Patterson and Hennessys Computer Organization and Design, 4th Ed. Chapter 4.Part II 132 of 168
Home Exercise (24)
Work out the example pages 388-389 step by step!
Patterson and Hennessys Computer Organization and Design, 4th Ed. Chapter 4.Part II 133 of 168
Home Exercise (25)
Find the width of each of the pipeline registers in each of the following
four pipelined MIPS architectures:
Included feature A1 A2 A3 A4
No hazard solution
All data hazards solutions
All control hazards solutions
Exception support
Patterson and Hennessys Computer Organization and Design, 4th Ed. Chapter 4.Part II 134 of 168
Parallelism and Advanced ILP
Pipelining exploits the potential parallelism among instructions
This parallelism is called instruction-level parallelism (ILP)
There are two primary methods for increasing the potential
amount of ILP:
1. Superpipelining: Increasing the depth of the pipeline to
overlap more instructions
Divide longer steps into more smaller ones
To get the full speed-up, we need to rebalance the remaining
steps so they are the same length
The amount of parallelism being exploited is higher, since
there are more operations being overlapped
Performance is potentially greater since the clock cycle can
be shorter
2. Multiple issue: Replicating the internal components of the
computer so that it can launch multiple instructions in every
pipeline stage
Patterson and Hennessys Computer Organization and Design, 4th Ed. Chapter 4.Part II 135 of 168
Superpipelining (Deep Pipelining)
3.0
Relative performance
2.0
Since the ideal maximum speedup
from pipelining increases with
1.5
Pipeline depth
Patterson and Hennessys Computer Organization and Design, 4th Ed. Chapter 4.Part II 138 of 168
Individual Exercise (26): Answer
Consider a 4 GHz four-way multiple-issue, five-stage pipelined
microprocessor:
1. Calculate the best CPI and IPC
1 1
Best CPI 0.25
m 4
Best IPC m 4
2. Calculate the peak MIPS (or GIPS)
Patterson and Hennessys Computer Organization and Design, 4th Ed. Chapter 4.Part II 142 of 168
The Concept of Speculation (cont.)
Speculation may be done in the compiler or by the hardware
The compiler can use speculation to reorder instructions, moving
an instruction across a branch or a load across a store
The hardware can perform the same transformation at runtime
The recovery mechanisms used for incorrect speculations are:
The compiler usually inserts additional instructions that check the
accuracy of the speculation and provide a fix-up routine to use
when the speculation is incorrect
The hardware usually buffers the speculative results until it
knows they are no longer speculative
If the speculation is correct, the instructions are completed
by allowing the contents of the buffers to be written to the
registers of memory
If the speculation is incorrect, the hardware flushes the
buffers and re-executes the correct instruction sequence
Patterson and Hennessys Computer Organization and Design, 4th Ed. Chapter 4.Part II 143 of 168
The Concept of Speculation (cont.)
Speculation has another problem: speculating on certain instructions
may introduce exceptions that were formerly not present
Suppose a load instruction is moved in a speculative manner, but
the address it uses is not legal when the speculative is incorrect
The result would be that an exception that should not have
occurred will occur
If the load instruction was not speculative, then the exception
must occur!
The compiler avoids such problems by adding special speculation
support that allows such exceptions to be ignored until it is clear
that they really should occur
The hardware simply buffers exceptions until it is clear that the
instruction causing them is no longer speculative and is ready to
complete; at that point the exception is raised, and normal
exception handling proceeds
Since speculation can improve performance when done properly and
decrease performance when done carelessly, significant effort goes
into deciding when it is appropriate to speculate
Patterson and Hennessys Computer Organization and Design, 4th Ed. Chapter 4.Part II 144 of 168
Static Multiple Issue
Static multiple-issue processors all use the compiler to assist with
packaging instructions and handling hazards
The set of instructions issued in a given clock cycle, issue packet, is
a one single large instruction with multiple operations in certain
predefined fields
Patterson and Hennessys Computer Organization and Design, 4th Ed. Chapter 4.Part II 148 of 168
Static Multiple Issue with the MIPS ISA (cont.)
Patterson and Hennessys Computer Organization and Design, 4th Ed. Chapter 4.Part II 149 of 168
Group Exercise (27)
Consider the following loop:
Patterson and Hennessys Computer Organization and Design, 4th Ed. Chapter 4.Part II 150 of 168
Group Exercise (27): Answer
The first 3 instructions have data dependence, and so do the last 2
1. Show how well loop unrolling and scheduling work in the code
above
2. Compare the performance of the code execution with loop
unrolling and without it (Exercise 26)
Patterson and Hennessys Computer Organization and Design, 4th Ed. Chapter 4.Part II 152 of 168
Group Exercise (28): Answer
To schedule the loop without any delays, it turns out that we need
to make four copies of the loop body
After unrolling and eliminating the unnecessary loop overhead
instructions, the loop will contain four copies each of lw, addu, and
sw, plus one addi and one bne
Since the first pair decrements $s1 by 16, the addresses loaded
are the original value of $s1, then that address minus 4, etc.
Storing is done first to the original value of $s1, then that address
minus 4, etc.
Patterson and Hennessys Computer Organization and Design, 4th Ed. Chapter 4.Part II 153 of 168
Group Exercise (28): Answer (cont.)
During the unrolling the compiler introduced additional registers
($t1, $t2, $t3)
Consider how the unrolled code would look using only $t0
There would be repeated instances of {lw $t0 0($s1)},
{addu $t0, $t0, $s2}, followed by {sw $t0, 0($s1)}
But, these sequences, despite using $t0, are actually completely
independent as no data values flow between one pair of these
instructions and the next pair
Renaming the registers during the unrolling process allows the
compiler to move these instructions subsequently so as to better
schedule the code
Patterson and Hennessys Computer Organization and Design, 4th Ed. Chapter 4.Part II 154 of 168
Group Exercise (28): Answer (cont.)
2. Compare the performance of the code execution with loop unrolling
and without it (Exercise 26)
Ideal CPI is 0.5 (an IPC of 2)
In the original loop, just one pair of instructions executes in
multiple issue
It takes 4 clock cycles per loop iteration to execute 5
instructions, which yields a CPI of 4/5 = 0.8 (an IPC of 1.25)
In the unrolled loop, 12 of the 14 instructions in the loop execute
as pairs
It takes 8 clocks for four loop iterations, or 2 clocks per
iteration, which yields a CPI of 8/14 = 0.57 (an IPC of 1.75)
The improvement in the CPI is due to:
reducing the loop control instructions
Dual issue execution
The cost of this improvement is using four temporary registers
rather than one, as well as a significant increase in code size
Patterson and Hennessys Computer Organization and Design, 4th Ed. Chapter 4.Part II 155 of 168
Static Multiple Issue (cont.)
Loop unrolling: a technique to get more performance from loops
that access arrays, in which multiple copies of loop body are made
and instructions from different iterations are scheduled together
After unrolling, there is more ILP available by overlapping
instructions from different iterations
Antidependence (also called name dependence): an ordering
forced purely by the reuse of a name, typically a register, rather than
by a true dependence that carries a value between two instructions
Dependences that are not true dependences, but could either lead
to potential hazards or prevent the compiler from flexibly
scheduling the code
Register renaming: the renaming of registers by the complier or
hardware to remove antidependences
eliminates name dependences, while preserving true dependences
allows the compiler to move independent code subsequently for
better scheduling
Architectural registers: the set of processor visible registers
Patterson and Hennessys Computer Organization and Design, 4th Ed. Chapter 4.Part II 156 of 168
Dynamic Multiple-Issue Processors
Dynamic multiple-issue processors are also known as superscalar
processors, or simply superscalars
Superscalar: An advanced pipelining technique that enables the
processor to execute more than one instruction per clock cycle by
selecting them during execution
In the simplest superscalar processors:
Instructions issue in order
The processor decides whether zero, one, or more instructions
can issue in a given clock cycle
Superscalars also extend dynamic issue decisions to include dynamic
pipeline scheduling
reordering the order of instruction execution by the hardware
Dynamic pipeline scheduling: chooses which instructions to
execute next, possibly reordering them to avoid hazards and stalls
Static pipeline scheduling: stalls when waiting for a hazard to be
resolved, even if later instructions are ready to go
Patterson and Hennessys Computer Organization and Design, 4th Ed. Chapter 4.Part II 157 of 168
Dynamic Multiple-Issue Processors (cont.)
Achieving good performance on superscalars still requires the
compiler to try to schedule instructions to move dependences apart
and thereby improve the instruction issue rate
Patterson and Hennessys Computer Organization and Design, 4th Ed. Chapter 4.Part II 159 of 168
Dynamic Pipeline Scheduling
Patterson and Hennessys Computer Organization and Design, 4th Ed. Chapter 4.Part II 162 of 168
Dynamic Pipeline Scheduling (cont.)
Dynamically scheduled pipeline analyzes the data flow structure of a
program
The processor then executes the instructions in some order that
preserves the data flow order of the program
Out-of-order (OOO) execution: instructions can be executed in a
different order than they were fetched
A situation in pipelined execution when instruction blocked from
executing does not cause the following instruction to wait
In-order commit:
The instruction fetch and decode unit is required to issue
instructions in order, which allows dependences to be tracked
The commit unit is required to write results to registers and
memory in program fetch order
The functional units are free to initiate execution whenever the
data they need is available
Patterson and Hennessys Computer Organization and Design, 4th Ed. Chapter 4.Part II 163 of 168
Dynamic Pipeline Scheduling (cont.)
In-order commit (cont.):
If an exception occurs, the computer can point to last instruction
executed, and the only registers updated will be all those written
by instructions before the instruction causing the exception
Make programs behave as if they were running on a simple in-
order pipeline (model of todays dynamically scheduled pipelines)
Dynamic scheduling is often extended by including hardware-based
speculation, especially for branch outcomes
By predicting the direction of a branch, a processor can continue
to fetch and execute instructions along the predicted path
Because instructions are committed in order, we know whether or
not the branch was correctly predicted before any instruction
from the predicted path are committed
A speculative, dynamically scheduled pipeline can also support
speculation on load-store reordering, and using the commit unit
to avoid incorrect speculation
Patterson and Hennessys Computer Organization and Design, 4th Ed. Chapter 4.Part II 164 of 168
Pipelining and Multiple-Issue Limitations
Despite the existence of processors with four to six issues per clock
cycle, very few applications can sustain more than two instructions
per clock for the following two primary reasons:
1. Dependences that cannot be alleviated and branches that cannot
be accurately predicted
2. Loses in memory system
Four factors combine to limit the performance improvement gained
by pipelining and multiple-issue execution:
1. Data hazard in the code mean that increasing the pipeline depth
increases the time per instruction because a larger percentage of the
cycles become stalls
2. Control hazards mean more clock cycles for the program
3. Pipeline register overhead can limit the decrease in clock period
obtained by further pipelining
4. Instruction latencies (inherited execution time) introduce difficulties in
programs as a dependence means the processor must wait the full
instruction latency for the hazard to be resolved
Patterson and Hennessys Computer Organization and Design, 4th Ed. Chapter 4.Part II 165 of 168
Power Efficiency and Advanced Pipelining
The downside to the increasing exploitation of ILP via dynamic
multiple issue and speculation is power efficiency
Now that we have hit the power wall, we are seeing designs with
multiple processors per chip where the processors are not as deeply
pipelined or as aggressively speculative as the predecessors
While processors are not as fast as the sophisticated ones, they
deliver better performance per watt, so that they can deliver more
performance per chip when designs are constrained more by power
than they are by number of transistors
Patterson and Hennessys Computer Organization and Design, 4th Ed. Chapter 4.Part II 167 of 168
Tutorial Exercises (29)
Textbook problems:
Patterson and Hennessys Computer Organization and Design, 4th Ed. Chapter 4.Part II 168 of 168