L06 Arc DR Wail c4 Pipelining

Enhancing Performance with Pipelining
Computer
4
Devices
Chapter
Part II Memory
Processor
Patterson and Hennessys Computer Organization and Design, 4th Ed. Chapter 4.Part II 1 of 168
Agenda and Reading List
Chapter goals (4.5, p. 343)
Introduction (4.5, pp. 330-332 and 4.14, pp. 408-409)
MIPS pipelined datapath, registers, and stages (4.6, pp. 344-355)
Graphically representing pipelines (4.6, pp. 356-358)
Pipelining speedup formula (4.5, pp. 332-334)
Pipelined control (4.6, pp. 359-363)
Pipeline hazards (4.5, pp. 335-343)
Structural hazards solutions (4.5, pp. 335-336)
Data hazards solutions (4.5, pp. 336-339 and 4.7, pp. 363-375)
Control hazards solutions (4.5, pp. 339-343 and 4.8, pp. 375-384)
Designing instruction set for pipelining (4.5, p. 335)
Exceptions (4.9, pp. 384-391)
Advanced instruction-level parallelism (4.10, pp. 391-403)
Fallacies and pitfalls (4.13, pp. 407-408)
Chapter Goals
to overview pipelining and describe its benefits and complexities
to cover the concept of pipelining using the MIPS instruction subset
from the single-cycle implementation in Chapter 4, Part I and show a
simplified version of its pipeline
to provide datapath and control details for a pipelined
implementation created by modifying the single-cycle implementation
to understand the challenges of dealing with hazards
to look at the problems that pipelining introduces and the
performance attainable under typical situations
to explore the implementation of forwarding and stalls
to learn about solutions to branch hazards
to focus on the software and performance implications of pipelining
to learn how to design instruction sets for easy pipelining
to describe the implementation of exceptions in pipelined processors
to introduce advanced pipelining concepts, such superscalar and
dynamic scheduling
Introduction
Pipelining is an implementation technique in which multiple
instructions are overlapped in execution
All steps in pipelining (called stages) operate concurrently
As long as we have separate resources for each stage, we can

pipeline the tasks
Pipelining exploits the potential parallelism among instructions in a

sequential instruction stream (instruction-level parallelism (ILP))
Pipelining has the substantial advantage that, unlike some speedup

techniques, it is fundamentally invisible to the programmer
Pipelining is key to making processors fast and is nearly universal

Introduction (cont.)
Paradox:
The time for executing a single instruction is not shorter for
pipelining!
The reason pipelining is faster for executing many instructions is

that everything is working in parallel, so more instructions are
finished per unit time
Pipelining improves throughput of instruction execution
Pipelining would not decrease the time to complete executing

one instruction (called the latency), but when we have many
instructions to execute, the improvement in throughput
decreases the total time to complete the work
MIPS Pipelined Datapath
Separate the
datapath into
five stages
MIPS Pipelined Datapath (cont.)
MIPS instructions classically take five steps:
1. Fetch instruction from memory
2. Read registers while decoding the instruction
3. Execute the operation or calculate an address
4. Access an operand in data memory
5. Write the result into a register
We limit our attention to eight instructions:

1. load word (lw)
2. store word (sw)
3. add (add)
4. subtract (sub)
5. AND (and)
6. OR (or)
7. set less than (slt)
8. branch on equal (beq)
MIPS Pipelined Datapath (cont.)
All pipeline stages take a single clock cycle, so the clock cycle must
be long enough to accommodate the slowest operation
The write to the register file occurs in the first half of the clock
cycle and the read from the register file occurs in the second half
Instructions and data move generally from left to right through the
5 stages as they complete execution
There are two exceptions to this left-to-right flow of instructions:
The write-back stage, which places the result back into the
register file in the middle of the datapath
This could lead to data hazards (see later!)
The selection of the next value of the PC, choosing between the
incremented PC and the branch address from the MEM stage
This could lead to control hazards (see later!)
Data flowing from right to left does not affect the current
instruction; only later instructions in the pipeline are influenced by
these reverse data movements
Group Exercise (1)
Assume that the operation time for the major functional units in an
implementation are the following:
200 ps for memory access,
200 ps for ALU and adders operation,
100 ps for register file read or write,
multiplexors, control unit, PC accesses, sign extension unit, and
wires have no delay
Compare the average time between instructions of a single-cycle

implementation, in which all instructions take one clock cycle, to a
pipelined implementation
Note: In the single-cycle model, every instruction takes exactly

one clock cycle, so the clock cycle must be stretched to
accommodate the slowest instruction
Group Exercise (1): Answer
Instruction Instruction Register ALU Data Register Total
class fetch read operation access write time
Load word (lw) 200 ps 100 ps 200 ps 200 ps 100 ps 800 ps
Store word (sw) 200 ps 100 ps 200 ps 200 ps 700 ps
R-format (add, 200 ps 100 ps 200 ps 100 ps 600 ps
sub, AND, OR, slt)
Branch (beq) 200 ps 100 ps 200 ps 500 ps
Jump (j) 200 ps 200 ps
The single-cycle design must allow for the slowest instruction; lw, so
the time required for every instruction is 800 ps, even though some
instructions can be as fast as 500 ps or 200 ps
All the pipeline stages take a single clock cycle, so the clock cycle
must be long enough to accommodate the slowest operation
The pipelined execution clock cycle must have the worst-case cycle of
200 ps, even though some stages take only 100 ps
So at steady state, an instruction is completed each 200 ps
Group Exercise (2)
Locate the single cycle, multiple-cycle, and pipelined datapaths on
diagrams that shows:
1. the clock rate versus instruction throughput

2. hardware trends (shared or specialized) versus instruction
latency
Locate the single cycle, multiple-cycle, and pipelined datapaths on
diagrams that shows:
1. the clock rate versus instruction throughput
2. hardware trends (shared or specialized) versus instruction
latency
Specialized
Multicycle Pipelined Single-cycle Pipelined
Faster
datapath datapath datapath datapath

(section 5.4) (Chapter 6) (section 5.3) (Chapter 6)
Hardware
Clock rate
Slower
Shared
Single-cycle Multicycle
datapath datapath
(section 5.3) (section 5.4)
Slower Faster 1 Several

Instruction throughput Clock cycles of latency for an instruction
(instructions per clock cycle or 1/CPI)
Graphically Representing Pipelines
Pipelining can be difficult to understand, since many instructions are
simultaneously executing in a single datapath in every clock cycle
One way to show what happens in pipelined execution is to pretend

that each instruction has its own private datapath, and then to
place these datapaths on a common timeline to show their
relationship
To aid understanding, there are two basic styles of pipeline figures

1. Multiple-clock-cycle pipeline diagrams
Time advances from left to right across the page in these
diagrams
Instructions advance from the top to the bottom of the page
Used to give overviews of pipelining situations
Simpler but do not contain all the details
Graphically Representing Pipelines (cont.)
2. Single-clock-cycle diagrams
Show the state of the entire datapath during a single cycle
Usually all five instructions in the pipeline are identified by
labels above their respective pipeline stages
Used to show the details of what is happening within the
pipeline during each clock cycle
Typically, the drawings appear in groups to show pipeline
operation over a sequence of clock cycles
Represents a vertical slice through a set of multiple-clock-
cycle diagrams, showing the usage of the datapath by each
of the instructions in the pipeline at the designated cycle
Obviously, have more details and take significantly more
space to show the same number of clock cycles
Group Exercise (3)
Consider the following five-instruction sequence
lw $10, 20($1)
sub $11, $2, $3
add $12, $3, $4
lw $13, 24($1)
add $14, $5, $6
1. Show the multiple-clock-cycle pipeline diagram for these

instructions
2. Show the single-clock-cycle pipeline diagram corresponding to
clock cycle 5 of executing the above instruction sequence
1. Show the multiple-clock-cycle pipeline diagram for these instructions
Stylized version
Group Exercise (3): Answer (cont.)
Traditional version
2. Show the single-clock-cycle pipeline diagram corresponding to
clock cycle 5 of executing the above instruction sequence
Pipelining Speedup Formula
Pipeline latency: number of stages in a pipeline
If the stages are perfectly balanced, then the time between
instructions on the pipelined processor, assuming ideal conditions, is:
Time between instructions nonpipelin ed
Time between instructions pipelined
Number of pipe stages
If all stages take about the same amount of time and there is enough
work to do, then the speedup due to pipelining is equal to the
number of stages in the pipeline
That is, under ideal conditions and with a large number of
instructions:
Speedup from pipelining Number of pipe stages
Practically, the time per instruction in the pipelined processor will
exceed the minimum possible, and speedup will be less than the
number of pipeline stages because
the stages may be imperfectly balanced,
pipelining involves overhead (e.g., pipeline registers), and
pipeline hazards exist
Pipelining Speedup Formula (cont.)
At the beginning and end of the workload, the pipe is not totally full
This start-up and wind-down affects performance when the number
of tasks is not large compared to the number of stages in the
pipeline
If the number of tasks is much larger than the number of pipeline
stages, then the stages will be full most of the time and the
increase in throughput will be very close to the number of stages
If m instructions are executed on a n-stage pipeline, then:
Number of clock cycles to execute the instructions = m + (n-1)
Time to execute the instructions = [m + (n-1)]*T
If m >> n, then:
Number of clock cycles to execute the instructions = m
Time to execute the instructions = m *T
Pipelining improves performance by increasing instruction throughput,
as opposed to decreasing the inherent execution time (latency) of an
instruction, but instruction throughput is the more important metric
because real programs execute billions of instructions
Group Exercise (4)
Consider the following code executed on the pipeline in Exercise (1):
lw $1, 100($0)
lw $2, 200($0)
lw $3, 300($0)
1. Compare non-pipelined and pipelined execution of the three lw

instructions
2. Graphically represent the pipelined execution of the above
instructions using the multiple-clock-cycle pipeline diagram
3. Calculate the time between the first and fourth instructions in
the nonpipelined design
the pipelined design. Comment!
5. Calculate the time needed to execute three lw instructions in the
pipelined design
6. Calculate the pipelining speedup in case of executing the three
lw instructions
7. Calculate the pipelining speedup in case of executing 1,000,003
lw instructions
8. What is the ideal speedup? Comment!
1. Compare non-pipelined and pipelined execution of the three lw
instructions
2. Graphically represent the pipelined execution of the above
instructions using the multiple-clock-cycle pipeline diagram
the nonpipelined design
3 x 800 ps = 2400 ps
the pipelined design
3 x 200 ps = 600 ps
This pipelining still offers a fourfold performance improvement
5. Calculate the time needed to execute three lw instructions in the
pipelined design
From the diagram, it is 1400 ps
6. Calculate the pipelining speedup in case of executing three lw
instructions
3* 800
Speedup pipelined 1.71 ( 4 )
1400
The number of instructions is not large for the speedup to be 4
7. Calculate the pipelining speedup in case of executing 1,000,003
lw instructions
In the non-pipelined case, we would add 1,000,000

instructions, each taking 800 ps, so total execution time
would be 1,000,000 * 800 ps + 2400 ps = 800,002,400 ps
In the pipelined case, we would add 1,000,000 instructions,
each adding 200 ps to the total execution time. The total
execution time would be 1,000,000 * 200 ps + 1400 ps =
200,001,400 ps
800,002,400
Speeduppipelined 200,001,400 4.00
When more instructions are executed, the ratio of total
execution times for real programs on nonpipelined
processors is close to the ratio of times between
instructions
8. What is the ideal speedup? Comment!
The speedup formula suggests that a five-stage pipeline

should offer nearly a fivefold improvement over the 800 ps
nonpipelined time, or a 160 ps clock cycle
However, the stages are imperfectly balanced, resulting in a
200 ps clock cycle
Pipeline Registers
Pipeline Registers (cont.)
We must add registers between pipeline stages to allow datapaths
and functional units to be shared by different instructions during
different stages while retaining the value of an individual instruction
for its usage during the following stages
We place registers wherever there are dividing lines between stages
All instructions advance during each clock cycle from one pipeline
register to the next
To pass something from an early pipeline stage to a later one, the

information must be placed in a pipeline register; otherwise the
information is lost when the next instruction enters that stage
Pipeline Registers (cont.)
There is no pipeline register at the end of the write-back stage
All instructions must update some state in the processor (the
register file, memory, or the PC)
A separate pipeline register is redundant to the state that is
updated
Registers are named for the two stages separated by that register
One pipeline register is divided into different sections to hold

different information
We will use a notation that names the fields of the pipeline

registers, e.g., ID/EX.RegisterRS is the number of one register
whose value is found in ID/EX; that is the number of the first read
port of the register file
Individual Exercise (5)
Why is the PC not considered among the pipeline registers
although it feeds the IF stage of the pipeline
Individual Exercise (5): Answer
Why is the PC not considered among the pipeline registers
although it feeds the IF stage of the pipeline
Every instruction updates the PC, whether by incrementing it or by

setting it to a branch destination address
Unlike pipeline registers, however, the PC is part of the visible

architecture state
PCs content must be saved when an exception occurs, while
the contents of the pipeline registers can be discarded
Pipeline Stages
All instructions pass through 5 stages though not all of them needs
5 cycles
As every instruction behind the one being executed would be in
progress, there is no way to accelerate them
An instruction passes through a stage even if there is nothing to
do as later instructions are already progressing at the maximum
rate
1. Instruction fetch (IF):
The instruction is read from memory using the address in the
PC and then placed in the IF/ID pipeline register
The PC address is incremented by 4 and then written back into
the PC to be ready for the next clock cycle
The incremented PC is also saved in the IF/ID in case it is
needed later for an instruction, such as beq
Portion of IF/ID plays the role of the IR!
The computer cannot know which type of instruction is being
fetched, so it must prepare for any instruction, passing
Patterson and potentially needed
Hennessys Computer information
Organization down4ththe
and Design, Ed. pipeline
Chapter 4.Part II 32 of 168
Pipeline Stages (cont.)
2. Instruction decode and register file read (ID):
The instruction portion of the IF/ID pipeline register supplies
the 16-bit immediate field, which is sign-extended to 32 bits,
and the register numbers to read the two registers
All the three values are stored in ID/EX, along with the
incremented PC
We again transfer everything that might be needed by any
instruction during a later clock cycle
Register rt number is saved in ID/EX in case it is needed
later by lw
3. Execute or address calculation (EX):
lw: reads the contents of register rs and the sign-extended
immediate from the ID/EX and adds them using the ALU
that sum is placed in the EX/MEM
register rt number is passed from ID/EX to EX/MEM
sw: the effective address is placed in the EX/MEM
register rt value is passed from ID/EX to EX/MEM to be used
in the next stage
R-type: reads the contents of registers rs and rt from the
ID/EX and performs the desired function using the ALU
the result is stored in the EX/MEM
register rd number is passed from the ID/EX to EX/MEM
beq: reads the contents of registers rs and rt from the ID/EX
and performs the equal compare function using the ALU
the zero signal is stored in the EX/MEM
4. Memory access (MEM):
lw: reads the data memory using the address from EX/MEM
and loads the data into the MEM/WB
register rt number is passed from EX/MEM to MEM/WB
sw: data is written to memory
R-type: the ALU output is passed from EX/MEM to MEM/WB
beq: set the next PC according to the zero signal read from
EX/MEM, calculate the branch to address
5. Write back (WB):
lw: reads the data from the MEM/WB and writes it into the
register file using register rt number stored in MEM/WB
sw: nothing to be done
R-type: writes the ALU output from the MEM/WB into the
register number rd (read from MEM/WB also)
beq: nothing to be done
Home Exercise (6)
Work out the lw example in Figures 4.36 through 4.38
step by step!
Work out the sw example in Figures 4.39 through 4.40

step by step!
Pipelined Control
Pipelined Control (cont.)
To specify control for the pipeline, we need only to set the control
values during each pipeline stage
Because each control line is associated with a component active in
only a single pipeline stage, we can divide the control lines into five
groups according to the pipeline stage (refer to Chapter 4, Part I)
IF: there is nothing special to control in the pipeline stage
The control signals to read instruction memory and to write
the PC are always asserted
ID: there are no optional lines to set
The same thing happens at every clock cycle
EX: set signals RegDst, ALUOp, and ALUSrc to select the result
register, the ALU operation, and either read data 2 (register rt)
or a sign-extended immediate for the ALU
MEM: set signals Branch, MemRead, and MemWrite by beq, lw,
and sw instructions, respectively
WB: control signal MemtoReg decides between sending the ALU
result or the memory value to the register file, and control signal
RegWrite writes the chosen value
Pipelined Control (cont.)
There are no separate writing signals for the pipeline registers IF/ID,
ID/EX, EX/MEM, and MEM/WB as they are written during each cycle
Implementing control means setting the control lines in each stage
for each instruction
The simplest way to do this is to extend the pipeline registers to
include control information
As the control lines start with the EX stage, we can create the control
information during instruction decode and then place them in ID/EX
The control lines for each pipeline stage are used, and remaining
control lines are then passed to the next pipeline stage
These control signals are then used in the appropriate pipeline stage
as the instruction moves down the pipeline
Sequencing of control in pipeline processors is embedded in the
pipeline structure itself:
all instructions take the same number of clock cycles, so there is
no special control of instruction duration
all control information is computed during instruction decode, and
then passed along by the pipeline registers
Pipeline Hazards
Pipeline hazards: situations in pipelining when the next instruction
cannot execute in the following clock cycle
1. Structural hazards: the hardware cannot support the combination

of instructions that we want to execute in the same clock cycle
Usually revolve around the floating-point unit, which may not be
fully pipelined
Pipeline Hazards (cont.)
2. Data hazards: occur when the pipeline must be stalled because
one step must wait for another to complete
Arise from the dependence of one instruction on an earlier one
that is still in the pipeline
When an instruction depends on the results of a previous one
still in the pipeline, the pipeline should be stalled, i.e., bubbles
should be added to the pipeline
Performance bottlenecks in both integer and floating-point
programs
Often it is easier to deal with in floating-point programs
because the lower branch frequency and more regular
memory access patterns allow the compiler to try to schedule
instructions to avoid hazards
It is more difficult to perform such optimizations in integer
programs that have less regular memory access, involving
more use of pointers
Pipeline Hazards (cont.)
3. Control hazards: arise from the need to make a decision based on
the results of one instruction while others are executing, e.g.,
branches
Also called branch hazards
Happen when the proper instruction cannot execute in the
proper pipeline clock cycle because the instruction that was
fetched is not the one that is needed; that is, the flow of
instruction addresses is not what the pipeline expected
Notice that we must begin fetching the instruction following the
branch on the very next clock cycle
Nevertheless, the pipeline cannot possibly know what the next
instruction should be, since it only just received the branch
instruction from memory
Usually more of a problem in integer programs, which tend to
have higher branch frequencies as well as less predictable
branches
Structural Hazards Solutions
Each logical component of the datapath, such as instruction
memory, register read ports, ALU, data memory, and register write
ports, can be used only within a single pipeline stage
Otherwise, we would have a structural hazard
Hence, these components, and their control, can be associated with
a single pipeline stage
Proper ISA design
Designing instruction sets for pipelining (as will be explained
later) makes it fairly easy to avoid structural hazards when
designing a pipeline
Example:
without two memories, our pipeline could have a structural
hazard
suppose we had a single memory instead of two memories
we could see that in one clock cycle, the first instruction is
accessing data from memory while the fourth instruction is
fetching an instruction from that same memory
Data Hazards Solutions
Compiler support:
Code reordering: compilers could follow instructions with
others that are independent on them to prevent sequences that
result in data hazards
When the instruction generating the data is a load, this
technique is called delayed loads
nop insertions: when no independent instructions could be

found, the compiler inserts nop (no operation) instructions that
are guaranteed to be independent
This results in cycles that do no useful work
nop is represented by all 0s, which is equivalent to sll
$0, $0, 0, i.e., shift register $0 left 0 places
Data Hazards Solutions (cont.)
Compiler support (cont.):
However, data hazards happen just too often and the delay is just
too long to expect the compiler to rescue us from this dilemma
Although the compiler generally relies upon the hardware to
resolve hazards and thereby ensure correct execution, the
compiler must understand the pipeline to achieve the best
performance
Otherwise, unexpected stalls will reduce the performance of
the compiled code
Register file forwarding:

In our design, writes are done in the first half of the clock cycle
and reads are in the second half, so a read delivers what is
written
This overcomes data hazards for one clock cycle
Group Exercise (7)
Consider the following code sequence:
sub $2, $1, $3

and $12, $2, $5
or $13, $6, $2
add $14, $2, $2
sw $15, 100($2)
1. Show the dependences in this code sequence

2. If register $2 had the value 10 before the sub instruction and -20
afterwards. How would this sequence perform with our pipeline?
Show the value of $2 at the beginning of each clock cycle
3. Show how the compiler could help avoiding the data hazard in
the above code sequence
4. What other techniques are used by hardware in this example to
avoid data hazards?
sub $2, $1, $3

and $12, $2, $5
or $13, $6, $2
add $14, $2, $2
sw $15, 100($2)
1. Show the dependences in this code sequence
The last 4 instructions are dependent on the result in $2 of

the first instruction
The sub instruction does not write its result until the fifth
stage, meaning that we would have to waste three clock
cycles in the pipeline
Without intervention, data hazards could severely stall the
pipeline
2. How would this sequence perform with our pipeline? If register $2
had the value 10 before the sub instruction and -20 afterwards,
show the value of $2 at the beginning of each clock cycle
Proper $2 value is written in clock cycle 5

Dependence arrows going backwards in time are pipeline hazards
add and sw get the correct $2 value; AND and OR would not
3. Show how the compiler could help avoiding the data hazard in the
above code sequence
The compiler inserts two nops before the and instruction
sub $s2, $s1, $s3

nop
nop
and $s4, $s2, $s7
or $s5, $s0, $s2
add $s6, $s2, $s2
sw $s4, 100($s2)
4. What other techniques are used by the hardware in this example to

avoid data hazards?
Register file forwarding

Data forwarding:
Forwarding (or bypassing): is a simple solution based on the
observation that we do not need to wait for the instruction to
complete before trying to resolve the data hazard
We can avoid stalls if we simply forward the data as soon as it is
available to any units that need it before it is available to read
from the register file
Adds extra hardware to retrieve the missing item early from the
internal resources (buffers)
Rather than waiting for the missing item to arrive from
programmer-visible registers or memory
The name forwarding comes from the idea that the result is
passed forward from an earlier instruction to a later one
Bypassing comes from passing the result by the register file to
the desired unit
Forwarding paths are valid only if the destination stage is later
in time than the source stage
Otherwise, we would be going backward in time!
Suppose that we have an add instruction followed immediately by
a subtract that uses the sum $s0:
add $s0, $t0, $t1

sub $t2, $s0, $t3
Show what pipeline stages would be connected by forwarding
add $s0, $t0, $t1
sub $t2, $s0, $t3
The add instruction does not write its result until the 5th stage
As soon as the ALU creates the sum for the add, it is supplied
as an input for the sub, replacing $s0 value read in the 2nd
stage of sub
Group Exercise (9)
Consider the following code segment in C:
a = b + e
c = b + f
Here is the generated MIPS code for this segment, assuming all variables
are in memory and are addressable as offsets from $t0:
lw $t1, 0($t0) # b is saved at 0($t0)

lw $t2, 4($t0) # e is saved at 4($t0)
add $t3, $t1, $t2
sw $t3, 12($t0) # a is to be saved at 12($t0)
lw $t4, 8($t0) # f is saved at 8($t0)
add $t5, $t1, $t4
sw $t5, 16($t0) # c is to be saved at 16($t0)
Group Exercise (9) (cont.)
1. Find the hazards in the above code segment
2. Reorder the instructions to avoid any pipeline stalls
3. Calculate the number of clock cycles needed to complete the
reordered sequence on a pipelined processor with forwarding
relative to the original version
lw $t1, 0($t0) # b is saved at 0($t0)
lw $t2, 4($t0) # e is saved at 4($t0)
add $t3, $t1, $t2
sw $t3, 12($t0) # a is to be saved at 12($t0)
lw $t4, 8($t0) # f is saved at 8($t0)
add $t5, $t1, $t4
sw $t5, 16($t0) # c is to be saved at 16($t0)
1. Find the hazards in the above code segment
Both add instructions have a hazard because of their respective

dependence on the immediately preceding lw instruction
Bypassing eliminates several other potential hazards, including
dependence of the first add on the first lw and any hazards for store
instructions
2. Reorder the instructions to avoid any pipeline stalls
Moving up the third lw instruction to become the third instruction

eliminates both hazards
lw $t1, 0($t0)
lw $t2, 4($t0)
lw $t4, 8($t0)
add $t3, $t1, $t2
sw $t3, 12($t0)
add $t5, $t1, $t4
sw $t5, 16($t0)
3. Calculate the number of clock cycles needed to complete the

reordered sequence on a pipelined processor with forwarding
relative to the original version
On a pipelined processor with forwarding, the reordered sequence

will complete in two fewer cycles than the original version
Data forwarding (cont.):
We must first detect a data hazard and then forward the proper
value to resolve the hazard
For now, we consider only the challenge of forwarding to an
operation in the EX stage, which may be either
An R-type ALU operation (add, sub, AND, OR, and slt) or
An effective address calculation
When an instruction tries to read a register in its EX stage that
an earlier instruction intends to write it in its WB stage, we
actually need the values as inputs to the ALU
There is no hazard in the WB stage itself, because we assume
that the register file supplies the correct result if the instruction
in the ID stage reads the same register written by the
instruction in the WB stage
This is another form of forwarding but it occurs within the
register file
The two pairs of hazard conditions are:
1a. EX/MEM.RegisterRd = ID/EX.RegisterRs
1b. EX/MEM.RegisterRd = ID/EX.RegisterRt
2a. MEM/WB.RegisterRd = ID/EX.RegisterRs
2b. MEM/WB.RegisterRd = ID/EX.RegisterRt
EX/MEM.RegisterRd field is the register destination for either:
An ALU instruction, which comes from the instruction Rd field
A load, which comes from the instruction Rt field
Because some instructions do not write registers, this policy is
inaccurate; sometimes it would forward when it should not
Simply check to see if the RegWrite signal will be active by
examining the WB control field of the pipeline register during the
EX and MEM stages
We need also to make sure that the RegisterRd field is not zero
to avoid forwarding its possibly nonzero result value
sub $2, $1, $3

and $12, $2, $5
or $13, $6, $2
add $14, $2, $2
sw $15, 100($2)
Classify the dependences in the above code sequence according to

the hazard conditions
Show the dependences between the pipeline registers and the

inputs to the ALU for this code sequence
sub $2, $1, $3
and $12, $2, $5
or $13, $6, $2
add $14, $2, $2
sw $15, 100($2)
Classify the dependences in the above code sequence according to

the hazard conditions
The sub-and hazard is type 1a:

EX/MEM.RegisterRd = ID/EX.RegisterRs = $2
The sub-or hazard is type 2b:
MEM/WB.RegisterRd = ID/EX.RegisterRt = $2
The dependences in sub-add are not hazards because the

register file supplies the proper data during the ID stage of add
There is no data hazard between sub and sw because sw reads
$2 the clock cycle after sub writes $2
Show the dependences between the pipeline registers and the
inputs to the ALU this code sequence
The dependence begins from a pipeline register, rather than

waiting for the WB stage to write the register file
Thus, the required data exists in time for later instructions, with
the pipeline registers holding data to be forwarded
If we can take the inputs to the ALU from any pipeline register
rather than just ID/EX, then we can forward the proper data
By adding multiplexors to the input of the ALU, and with the
proper controls, we can run the pipeline at full speed in the
presence of data dependences
The forwarding control will be in the EX stage, because the ALU
forwarding multiplexors are found in that stage.
Thus, we must pass the operand register numbers from the ID
stage via the ID/EX pipeline register to the forwarding control
to determine whether to forward values
Otherwise, the control of the multiplexors on the ALU inputs could
be determined during the ID stage and set in new control fields of
the ID/EX register
This makes the hardware faster because the time to select the
ALU inputs is likely to be on the critical path

Conditions for detecting hazards and control signals to resolve them
EX hazard:
if (EX/MEM.RegWrite
and (EX/MEM.RegisterRd 0)
and (EX/MEM.RegisterRd = ID/EX.RegisterRs)) ForwardA = 10
if (EX/MEM.RegWrite
and (EX/MEM.RegisterRd = ID/EX.RegisterRt)) ForwardB = 10
Forwards from previous instruction to either input of the ALU

If the instruction in the WB stage is going to write to the register
file, and the write register number matches the read register
number of ALU inputs A or B, provided it is not in register 0, then
steer the multiplexor to pick the value instead from the pipeline
register MEM/WB
One complication is the potential data hazards between the result

of the instruction in the WB stage, the result of the instruction in
the MEM stage, and the source operand of the instruction in the
ALU stage
In this case, the result is forwarded from the MEM stage because
the result in the MEM stage is the more recent result
MEM hazard:
if (MEM/WB.RegWrite
and (MEM/WB.RegisterRd 0)
and not (EX/MEM.RegWrite
and (EX/MEM.RegisterRd = ID/EX.RegisterRs))
and (MEM/WB.RegisterRd = ID/EX.RegisterRs)) ForwardA = 01
if (MEM/WB.RegWrite
and (MEM/WB.RegisterRd 0)
and not (EX/MEM.RegWrite
and (EX/MEM.RegisterRd = ID/EX.RegisterRt))
and (MEM/WB.RegisterRd = ID/EX.RegisterRt)) ForwardB = 01
For the following instruction sequence identify dependences and
show how forwarding cope with them:
sub $s2, $s1, $s3

and $s4, $s2, $s5
or $s4, $s4, $s2
add $s9, $s4, $s2
For the following instruction sequence identify dependences and
show how forwarding cope with them:
sub $s2, $s1, $s3

and $s4, $s2, $s5
or $s4, $s4, $s2
add $s9, $s4, $s2
Registers to be ALU input

Clock Instruction written Upper Lower
cycle in EX stage
EX/MEM MEM/WB Source From Source From
4 and $s2 -- $s2 EX/MEM $s5 ID/EX
5 or $s4 $s2 $s4 EX/MEM $s2 MEM/WB
6 add $s4 $s4 $s4 EX/MEM $s2 ID/EX
When summing a vector of numbers in a single register, a
sequence of instructions will all read and write to the same register
add $1, $1, $2

add $1, $1, $3
add $1, $1, $4

Show how forwarding works in this case
For the second add $1 is available in MEM stage
EX/MEM.RegisterRd = ID/EX.RegisterRs
For the third add $1 is available in MEM and WB stages
EX/MEM.RegisterRd = ID/EX.RegisterRs
MEM/WB.RegisterRd = ID/EX.RegisterRs
However, in this case, result is forwarded from, the MEM stage

because the result in the MEM stage is the more recent result
Registers to be ALU input

Clock Instruction written Upper Lower
cycle in EX stage
EX/MEM MEM/WB Source From Source From
4 second add $1 -- $1 EX/MEM $3 ID/EXE
5 third add $1 $1 $1 EX/MEM $4 ID/EXE
To add the sign-immediate input, needed by loads and stores,
to the ALU, a 2:1 multiplexor is added to choose between the
ForwardB multiplexor output and the signed immediate
Store forwarding is done by connecting the forwarding
multiplexor output, containing store data, to the EX/MEM pipeline
register
lw $t0, 0($t1)
lw $t2, 4($t1)
sw $t2, 0($t1)
sw $t0, 4($t1)
1. Find the hazard in this code

2. Reorder the instructions to avoid pipeline stalls
1. Find the hazard in this code:
The hazard occurs on $t2 between the second lw and the first
sw
2. Reorder the instructions to avoid pipeline stalls
Swapping the two store instructions removes this hazard:
lw $t0, 0($t1)
lw $t2, 4($t1)
sw $t0, 4($t1)
sw $t2, 0($t1)
We do not create a new hazard as there is still one instruction

between the write of $t0 by the load and the read of it
With store forwarding, the reordered code takes 4 cycles
Pipeline stalls (bubbles):
Forwarding cannot help when an instruction tries to read a
register following a load instruction that writes the same register
This is called load-use data hazard
Use latency: number of clock cycles between a load instruction
and an instruction that can use the result of the load without
stalling the pipeline
We can handle these cases using either hardware detection and
stalls or software that reorders code to try to avoid load-use
pipeline stalls
The pipeline must stall (bubbles are inserted) for the combination
of a load followed by an instruction that reads its result
Stalls (or inserting bubbles) are equivalent to inserting nops
However, bubbles are inserted at runtime, whereas nops are
inserted at compile time!
Suppose that we have a load instruction followed immediately by a
subtract that uses the loaded word:
lw $s0, 20($t1)
sub $t2, $s0, $t3
Suppose that we have a load instruction followed immediately by a
subtract that uses the loaded word:
lw $s0 20($t1)
sub $t2, $s0, $t3
$s0 would be available only after the fourth stage of the first
instruction, which is too late for the input of the third stage of
the sub!
We would have to stall one stage for the load-use data hazard
Stalls (bubbles) (cont.):
We need a hazard detection unit that operates during the ID
stage to insert the stall between the load and its use
To check for loads, the control for the hazard detection unit is:
if (ID/EX.MemRead
and (ID/EX.RegisterRt = IF/ID.RegisterRs)
or (ID/EX.RegisterRt = IF/ID.RegisterRt)) stall the pipeline
If the condition holds, the instruction stalls 1 clock cycle

After this 1-cycle stall, the forwarding logic continues as usual
In case of having loads immediately followed by stores, it is

possible to avoid a stall, since data exists in the MEM/WB register
of a load in time for its use in the MEM stage of a store
We would need to add forwarding into MEM for this option
Stalls (bubbles) (cont.):
If the instruction in the ID stage is stalled, the one in the IF stage
must be stalled; otherwise, the fetched instruction could be lost
Preventing these two instructions from making progress is
accomplished by preventing the PC and IF/ID from changing
Provided these registers are preserved, the instruction in the IF
stage will continue to be read using the same PC, and the
registers in the ID stage will continue to be read using the same
instruction fields in the IF/ID pipeline register
The hazard detection unit controls the value written in PC and
IF/ID plus multiplexors that choose between the real control
values and all 0s
To stall the pipeline, the back half of the pipeline starting with the
EX stage must be executing nop instructions
A bubble is inserted into the pipeline by deasserting all nine
control signals in the EX, MEM, and WB fields of the ID/EX
Actually, to avoid writing registers or memory, only RegWrite and
MemWrite need to be 0, while other signals can be dont cares
This diagram is missing the
sign-extended immediate
and branch logic
Stalls
(bubbles)
lw $2, 20($1)
and $4, $2, $5
or $8, $2, $6
add $9, $4, $2
slt $1, $6, $7
1. Show that a hazard in the above code sequence cannot be solved

by forwarding
2. Show how inserting bubbles could handle data hazards in this code
sequence
1. Show that a hazard in the above code sequence cannot be solved
by forwarding
The data is being read from memory in clock cycle 4 while the ALU
is performing the operation for the following instruction
Since dependence between the lw and the following AND goes
backward in time, this hazard cannot be solved by forwarding
Individual Exercise (15): Answer (cont.)
2. Show how inserting bubbles could handle data hazards in the this
code sequence
Control Hazards Solutions
Stalls on branches (bubbles):
One possible solution for control hazards is to stall immediately
after we fetch a branch, waiting until the pipeline determines the
outcome of the branch and knows what instruction address to
fetch from
Lets assume that we put in enough extra hardware so that we

can test registers, calculate the branch address, and update the
PC during the second stage of the pipeline (see the early
decision approach in the next slides), the pipeline involving
conditional branches is stalled one clock cycle before starting
If we cannot resolve the branch in the second stage, as is often
the case for longer pipelines, then we would see an even larger
slowdown if we stall on branches
This option operates, but is too slow

The cost of this option is too high for most computers to use
Estimate the cost of the control hazard in the following code
sequence:
40 beq $1, $3, 28

44 and $12, $2, $5
48 or $13, $6, $2
52 add $14, $2, $2

72 lw $4, 50($7)
Estimate the cost of the control hazard in the following code sequence:
As the branch instruction decides whether to branch in the MEM stage,

the three sequential instructions will be fetched and execution begins
Thus, the cost of a taken branch is extra 3 cycles
Given that branches are 17% of the instructions executed in
SPECint2006, estimate the impact on the clock cycles per
instruction (CPI) of stalling on branches
Assume that:
All other instructions have a CPI of 1 and
We put in enough extra hardware so that we can test registers,
calculate the branch address, and update the PC during the
second stage (e.g., the pipeline is stalled for only 1 cycle)
Given that branches are 17% of the instructions executed in
SPECint2006, estimate the impact on the clock cycles per
instruction (CPI) of stalling on branches
Assume that:
All other instructions have a CPI of 1
We put in enough extra hardware so that we can test registers,
calculate the branch address, and update the PC during the
second stage (e.g., the pipeline is stalled for only 1 cycle)
0.17% of the instructions have a CPI of 2. All others CPI is 1
CPI 1* (1 0.17) 2 * 0.17 1.17
Control Hazards Solutions (cont.)
Early decision to reduce the delay of branches:
One way to improve branch performance is to reduce the cost
(delay) of the taken branch
Thus far, we have assumed the next PC for a branch is selected in

the MEM stage
If we move the branch execution earlier in the pipeline, then

fewer instructions need be flushed
If we move the branch decision (execution) to the ID stage, only

one instruction need to be flushed (the one being fetched)
Early decision (cont.):
Moving the branch decision up requires 2 actions to occur earlier:
1. Move up the branch address calculation:
Easy part
We already have the PC value and the immediate field in
the IF/ID register, so we just move the branch adder from
the EX stage to the ID stage
Of course, the branch target address calculation will be
performed for all instructions, but only used when needed
2. Move up the branch decision:
The harder part
For beq, we could compare the two registers read during
the ID stage to see if they are equal
Equality can be tested by first XORing their respective bits
and then ORing all the results (faster than using the ALU)
Moving the branch test to the ID stage implies additional hazard
detection and forwarding hardware, as a branch dependent on a
result still in the pipeline must work properly with this optimization
To implement beq or bne, we will need to forward results to
the equality test logic that operates during the ID stage
There are two complication factors:
1. During the ID stage, we must decode the instruction, decide
whether a bypass to the equality unit is needed, and complete
the equality comparison so that if the instruction is a branch,
we can set the PC to the branch target address
Forwarding for the operands of branches was formerly
handled by the ALU forwarding logic, but the introduction
of the equality test unit in the ID stage will require new
forwarding logic
Note that the bypassed source operands of a branch can
come from either the EX/MEM or MEM/WB registers
2. Because the values in a branch comparison are needed during
the ID stage but may be produced later in time, it is possible
that a data hazard can occur and a stall will be needed
For example, if an ALU instruction immediately preceding
a branch produces one of the operands for the
comparison in the branch, a stall will be required, since
the EX stage for the ALU instruction will occur after the
ID cycle of the branch
By extension, if a load is immediately followed by a
conditional branch that is on the load result, two stall
cycles will be needed, as the result from the load appears
at the end of the MEM cycle but is needed at the
beginning of ID cycle for the branch
Despite these difficulties, moving the branch execution in the ID
stage is an improvement, because it reduces the penalty of a
branch to only one instruction if the branch is taken, namely, the
one currently being fetched
To flush instructions in the IF stage, we add a control line, called

IF.Flush, that zeros the instruction field of the IF/ID register
Clearing the register transforms the fetched instruction into a
nop, an instruction that has no action and changes no state
In reality, the flush line comes from hardware that determines if a
branch is taken, labeled with an equal sign!
Even with this extra hardware, the pipeline involving conditional

branches would have to stall for one clock cycle
Early decision
Home Exercise (18)
Work out the example page 378 step by step!
Branch prediction:
Branch prediction: predict the outcome of the branch
instruction and proceed from that assumption rather than waiting
to ascertain the actual outcome
When a prediction is wrong, the pipeline control must ensure that

the instructions following the wrongly guessed branch have no
effect and must restart the pipeline from the proper branch
address
Longer pipelines exacerbate the problem by raising the cost of

misprediction
Branch prediction (cont.):
There are three ways to do branch prediction:
1. Assume branch not taken: always predict branches to fail
and continue execution down the sequential instruction flow
Pipeline is not slowed down when the branch is not taken
If the branch is taken, the instructions that are being
fetched and decoded must be discarded and execution
continues at the branch target
This is equivalent to a stall; that is, only when branches
are taken does the pipeline stall
To discard instructions:
Change original control values to 0s in the IF, ID, and
EX stages when the branch reaches the MEM stage
Discarding instructions here means we must be able
to flush instructions in the IF, ID, and EX stages
For load-use stalls, we just change control to 0 in the
ID stage and let them percolate through the pipeline
Branch prediction (cont.):
2. A more sophisticated branch prediction would have some
branches predicted as taken and some as untaken
As an example, at the bottom of loops are branches
that jump back to the top of the loop
Since they are likely to be taken and they branch
backwards, we could always predict taken for branches
that jump to an earlier address
3. Dynamic hardware predictors: make guesses depending
on the behavior of each branch and may change predictions
for a branch over the life of a program
In an aggressive pipeline, a simple static prediction
scheme will probably waste too much performance
With more hardware, it is possible to try to predict
branch behavior during program execution
Dynamic predictors increase in popularity as the
transistors per chip increase in count
Group Exercise (19)
Assume that branches are predicted to be not taken
Show the pipeline when the branch in the following code sequence
is not taken and when it is taken:
add $4, $5, $6

beq $1, $2, 40
lw $3, 300($0)

or $7, $8, $9
Assume that branches are predicted to be not taken
Show the pipeline when the branch in the following code sequence
is not taken and when it is taken:
not taken
taken
Dynamic branch prediction:
One popular approach for dynamic prediction of branches is
keeping a history for each branch as taken or untaken, and then
using the recent past behavior to predict the future
The amount and type of history kept have become extensive
The result has been that dynamic branch predictors can
correctly predict branches with more than 90% accuracy
One approach to do this is to look up the address of the
instruction to see if a branch was taken the last time this
instruction was executed, and, if so, to begin fetching new
instructions from the same place as the last time
One implementation of that approach is a branch prediction
buffer or branch history table
A branch prediction buffer is a small, special memory indexed by
the lower portion of the address of the branch instruction during
the IF stage
Dynamic branch prediction (cont.):
The branch prediction buffer contains a bit that says whether the
branch was recently taken or not
This is the simplest sort of buffer
We do not know, in fact, if the prediction is the right one as it
may have been put there by another branch that has the same
low-order address bits
However, this does not affect correctness
Prediction is just a hint that we hope is correct, so fetching
begins in the predicted direction
If the hint turns out to be wrong, the incorrectly predicted
instructions are deleted, the prediction bit is inverted and
stored back, and the proper sequence is fetched and executed
This simple 1-bit prediction scheme has a performance
shortcoming: even if a branch is almost always taken, we can
predict incorrectly twice, rather than once, when it is not taken
Class Exercise (20)
Consider a loop branch that branches nine times in a row, then it is
not taken once
What is the prediction accuracy for this branch, if it is using a

single bit for prediction?
Assume the prediction bit for this branch remains in the prediction
buffer
Class Exercise (20): Answer
Consider a loop branch that branches nine times in a row, then it is
not taken once
What is the prediction accuracy for this branch, if it is using a
single bit for prediction?
Assume the prediction bit for this branch remains in the prediction
buffer
The steady-state prediction behavior will mispredict on the first

and last loop iterations:
Mispredicting the last iteration is inevitable since the
prediction bit will indicate taken, as the branch has been
taken nine times in a row at that point
The misprediction on the first iteration happens because
the bit is flipped on prior execution of the last iteration of
the loop, since the branch was not taken on the exiting
iteration
Thus, the prediction accuracy for this branch that is taken 90%
of the time is only 80% (two incorrect predictions and eight
correct ones)!
Ideally, the accuracy of the predictor would match the taken
branch frequency for these highly regular branches
The 1-bit prediction scheme will likely predict incorrectly twice!
To remedy this, 2-bit prediction schemes are often used
In a 2-bit scheme, a prediction must be wrong twice before it is
changed
By using the 2 bits rather than 1, a branch that strongly favors
taken or not takenas many branches dowill be mispredicted
only once
The 2 bits are used to encode the four states in the system
The 2-bit scheme is a general instance of a counter-based

predictor, which is incremented when the prediction is accurate
and decremented otherwise, and uses the midpoint of its range
as the division between taken and not taken
Advanced dynamic branch prediction techniques
Branch target buffer: a cache to hold the destination PC or
destination instruction
A branch predictor tells us whether or not a branch is taken
We still require the calculation of the branch target
In our pipeline, this calculation takes one cycle, meaning that
taken branches will have a 1-cycle penalty
Branch target buffer is one approach to eliminate that penalty
Advanced dynamic branch prediction techniques (cont.)
Correlating predictor: a branch predictor that combines local
behavior of a particular branch and global information about the
behavior of some recent number of executed branches
Yields greater prediction accuracy for the same number of
prediction bits
A 2-bit dynamic branch predictor scheme uses only
information about a particular branch
A typical correlating predictor might have 2-bit predictors for
each branch, with the choice between predictors made based
on whether the last executed branch was taken or not taken
Thus, the global branch behavior can be thought of as adding
additional index bits for the prediction lookup
Advanced dynamic branch prediction techniques (cont.)
Tournament predictor: uses multiple predictions, tracking, for
each branch, which predictor yields the best result
A typical tournament predictor might contain two predictors
for each branch index: one based on local information and one
based on global branch behavior
A selector would choose which predictor to use for any
prediction
The selector can operate similarly to a 1- or 2-bit predictor,
favoring whichever of the two predictors has been more
accurate
Some recent microprocessors use such elaborate predictors
Delayed branches (decisions):
Delayed branch: always executes the following instruction, but
the second one following the branch will be affected by the branch
The branch takes place after that one instruction delay
Compilers and assemblers try to place an instruction that always
executes after the branch in the branch delay slot
Branch delay slot: the slot directly after a delayed branch
instruction, which in the MIPS architecture is filled by an
instruction that does not affect the branch
The job of the software is to make the successor instructions valid
and useful
MIPS software places an instruction immediately after the delayed
branch instruction that is not affected by the branch, and a taken
branch changes the address of the instruction that follows this
safe instruction
Scheduling the branch delay slot

Delayed branches (cont.)
Scheduling the branch delay slot:
1. with an independent instruction from before the branch
this is the best choice
if there is a dependence between the branch condition and
instruction before (like $s1 in the previous figure), we
cannot use the instruction before
2. from the target of the branch
usually the target instruction will need to be copied
because it can be reached by another path
this strategy is preferred when the branch is taken with
high probability, such as a loop branch
also executing the instruction in the delay slot should not
affect execution in case of a not taken branch
3. from the not-taken fall-through
Executing the instruction in the delay slot should not affect
execution in case of a taken branch
For example, changing an unused temporary register
Delayed branches (cont.)
Delayed branches are hidden from the MIPS assembly language
programmer because the assembler can automatically arrange the
instructions to get the branch behavior desired by the programmer
In all delayed branch cases, the program should execute correctly
when the branch goes in the unexpected direction
Since delayed branches are useful when the branches are short,
no processor uses a delayed branch of more than one cycle
For longer branches, hardware-based branch prediction is used
The limitations on delayed-branch scheduling arise from:
the restrictions on the instructions that are scheduled into the
delay slots
our ability to predict at compile time whether a branch is likely
to be taken or not
Delayed branching is losing popularity as processors go both:
to longer pipelines (see later!) and
toward issuing multiple instructions per clock cycle (later!)
Assume that branch decision can be done in the second stage
Show how to use a delayed branch to avoid control hazard in the

following code sequence
add $4, $5, $6

beq $1, $2, 40
lw $3, 300($0)
Assume that branch decision can be done in the second stage
Show how to use a delayed branch to avoid control hazard in the
following code sequence
The add instruction before the branch does not affect the branch,
so we move it to the delayed branch slot following the branch
Thus, the single pipe bubble has been replaced by add
Pro gra m
e xe cu tio n 2 4 6 8 10 12 14
o rde r Ti m e
(in instru ctio ns)
be q $ 1 , $ 2, 40 Instruction Data
Re g AL U Reg
fetch access
a dd $ 4, $5 , $ 6 Instruction Data
Reg A LU Re g
2 ns fetch access
(Dela ye d b ra n c h slo t)
Instruction Data
lw $3 , 3 00 ($0 ) Reg A LU Re g
2 ns fetch access
2 ns
Conditional move instructions:
One way to reduce the number of conditional branches
Instead of changing the PC with a conditional branch, the
instruction conditionally changes the destination register of the
move
If the condition fails, the move acts as a nop
MIPS instruction set architecture has two new instructions called:
movn: move if not zero
movz: move if zero
The ARM instruction set has a condition field in most instructions
ARM programs could have fewer conditional branches than in
MIPS programs
Explain how the following conditional move instruction works
movn $8, $11, $4
Explain how the following conditional move instruction works
movn $8, $11, $4
The instruction copies the contents of register $11 into register $8,
provided that the value in register $4 is nonzero
Otherwise, it does nothing
Group Exercise (23)
Calculate the average CPI for a pipelined implementation
Instruction mix: 24% loads, 12% stores, 44% R-type 18%

branches, and 2% jumps
Half of the load instructions are immediately followed by an

instruction that uses the result
The branch delay on misprediction is 1 clock cycle
One quarter of branches are mispredicted
Assume that jumps always pay 1 full clock cycle of delay
Loads take 1 clock cycle when there is no load-use dependence
(50% of the time) and 2 clock cycle when there is (50% of the
time):
CPIloads 1* 0.50 2 * 0.50 1.50
Store and R-type instructions take 1 clock cycle
Branches take 1 clock cycle when predicted correctly (75% of the
time) and 2 when not (25% of the time):
CPIbranches 1* 0.75 2 * 0.25 1.25

Jump instructions take 2 clock cycle
Thus:
CPIoverall 1.50* 0.24 1* 0.12 1* 0.44 1.25* 0.18 2* 0.02 1.19
From the results obtained in Chapter 4, Part I it is clear that the
pipelined implementation is much faster than the single cycle
(shorter clock period) and the multicycle ones
Designing Instruction Set for Pipelining
MIPS instruction set is designed to be pipelined (makes it easy)
1. All MIPS instructions are the same length
This restriction makes it much easier to fetch instructions in
the first pipeline stage and to decode them in the second one
Widely variable instruction lengths and running times can lead
to imbalance among pipeline stages, causing other stages to
back up, and severely complicate hazard detection in a design
pipelined at the instruction set level
Variable instruction length can also severely complicate hazard
detection and the maintenance of precise exceptions (later!)
In an instruction set like x86, where instructions vary from 1
byte to 17 bytes, pipelining is considerably more challenging
Recent implementations of the architecture actually
translate x86 instructions into simple operations that look
like MIPS instructions and then pipeline the simple
operations rather than the native x86 instructions!
Designing Instruction Set for Pipelining (cont.)
2. MIPS has only a few instruction formats, with the source
register fields being located in the same place in each
instruction
This symmetry means that the second stage can begin
reading the register file at the same time that the hardware
is determining what type of instruction was fetched
If MIPS instruction formats were not symmetric, we would
need to split stage 2, resulting in six pipeline stages
The regular format of the MIPS instructions allows reading
and decoding to occur simultaneously
3. Memory operands only appear in loads or stores in MIPS
This restriction means we can use the execute stage to
calculate the memory address and then access memory in
the following stage
If we could operate on the operands in memory, as in the
x86, stages 3 and 4 would expand to an address stage,
memory stage, and then execute stage
4. Operands must be aligned in memory in MIPS
Hence, we need not worry about a single data transfer
instruction requiring two data memory accesses
The requested data can be transferred between processor
and memory in a single pipeline stage
5. Each MIPS instruction writes a single result and does so
at the end of its execution
Simplifies the handling of exceptions and the maintenance
of a precise exception model (see later)
Data forwarding is harder if there are multiple results to
forward per instruction or they need to write before the
instruction end
PowerPCs load instructions may use update addressing, so
the processor must be able to forward two results per load
6. The MIPS architecture makes it easy for designers to
avoid structural hazards when designing a pipeline
by, for example, having two memories
7. MIPS does not have sophisticated addressing modes
This can have led to different sorts of problems
Addressing modes that updates registers, such as update
addressing modes, complicate hazard detection
Other addressing modes that require multiple memory
accesses substantially complicate pipeline control and make
it difficult to keep the pipeline flowing smoothly
8. The MIPS architecture was designed to support fast
single-cycle branches that could be pipelined with a
small penalty
The designers observed that many branches rely only on
simple tests (e.g., equality or sign) that do not require a full
ALU operation but can be done with at most a few gates
When a more complex branch decision is required, a
separate instruction that uses an ALU to perform a
comparison is required
Like the use of condition codes for branches
Exceptions
An exception is an unexpected event from within the processor,
e.g., arithmetic overflow, invoking the OS from user program, using
an undefined instruction, internal hardware malfunction, etc.
An interrupt is an event that also causes an unexpected change in
control flow but comes from outside of the processor
Interrupts are used by I/O devices to communicate with the
processor
External hardware malfunctions could cause an interrupt
Exceptions and interrupts are events other than branches and
jumps that change the normal flow of instruction execution
Exceptions and interrupts are unscheduled events that disturbs
program execution
Detecting exceptional conditions and taking the appropriate action

is often on the critical timing path of a processor, which determines
the clock cycle time and thus performance
Exceptions (cont.)
Many architectures do not distinguish between interrupts and
exceptions, often using the older name interrupt to refer to both
types of events
We follow the MIPS convention:
using the term exception to refer to any unexpected change in
control flow without distinguishing whether the cause is internal
or external (both interrupts and exceptions)
using the term interrupt only when the event is externally caused
Type of event From where? MIPS terminology

I/O device request External Interrupt
Invoke the OS from user program Internal Exception
Arithmetic overflow Internal Exception
Using an undefined instruction Internal Exception
Hardware malfunctions Either Exception or interrupt
Exceptions (cont.)
Our implementation can generate two types of exceptions:
1. execution of undefined instruction
2. an arithmetic overflow
The basic actions that the processor must perform when an
exception occurs are to:
Save the address of the offending instruction in the exception
program counter (EPC) and then
Transfer control to the OS at some specified address
The OS can take the appropriate action, which involve:
providing some service to the user program,
taking some predefined action in response to an overflow, or
stopping the execution of the program and reporting an error
After performing whatever action is required, the OS can:
terminate the program, or
may continue its execution, using the EPC to determine
where to restart the execution of the program
Exceptions (cont.)
For the OS to handle the exception, it must know the reason for the
exception and the instruction that caused it
Processors include a status register to hold a field that indicates the
reason for the exception
Some processors use vectored interrupts, where the address to

which control is transferred is determined by the exception cause
The OS knows the reason for the exception by the address at
which it is initiated
The addresses are separated by 32 bytes or eight instructions,
and the OS must record the reason for the exception and may
perform some limited processing in this sequence
When the exception is not vectored, a single entry point for
all exceptions can be used, and the OS decodes the status
register to find the cause
Exceptions (cont.)
Lets assume that we are implementing the exception system used in
the MIPS architecture, with the single entry point being the address
8000 0180hex
Two additional 32-bit registers are added to the MIPS implementation:

exception program counter (EPC) and Cause Register:
1. EPC: used to hold the address of the affected instruction
2. Cause: a status register used to record the cause of the exception
A five-bit field encodes the two possible exception sources
10 represents an undefined instruction and
12 represents arithmetic overflow
The ALU overflow signal is an input to the control unit
Other bits are currently unused
Exceptions (cont.)
A pipelined implementation treats exceptions as another form of
control hazard
Just as we did for the taken branch, we must flush the instructions
that follow the one that throws the exception from the pipeline and
begin fetching instructions from the new address
We will use the same mechanism we used for taken branches, but
this time the exception causes the deasserting of control lines
The instruction should stop immediately to give the programmer the
chance to track the register values causing the exception
Otherwise, the instruction causing the exception or any following
one could change the register values
Many exceptions require that we eventually complete the instruction
that caused the exception as if it executed normally
The easiest way to do this is to flush the instruction and restart
it from the beginning after the exception is handled
Exceptions (cont.)
We saw how to flush the instruction in the IF stage by turning it
into a nop
To flush instructions in the ID stage, we use the multiplexor already
in the ID stage that zeros control signals for stalls
A new control signal, ID.Flush, is ORed with the stall signal from the
hazard detection unit to flush during ID
To flush instructions in the EX phase, we use a new signal called
EX.Flush to cause new multiplexors to zero the control lines
The EX.Flush signal is also used to prevent the instruction in the EX
stage from writing its result in the WB stage
We need to save the address of the offending instruction in EPC
We must subtract 4 from the updated PC before saving it in EPC
To start fetching instructions from location 8000 0180hex, which is
the MIPS exception address, we simply add an additional input to
the PC multiplexor that sends 8000 0180hex to the PC
Exceptions (cont.)
Exception routine
address is: 8000 0180hex
Exceptions (cont.)
Handling multiple exceptions is also important
With pipelined execution, it is important to
associate the exception with its cause instruction
prioritize the exceptions to determine which is serviced first
In most MIPS implementations, the hardware sorts exceptions
so that the earliest instruction is interrupted
I/O device requests and hardware malfunctions are not
associated with a specific instruction, so the implementation
has some flexibility as to when to interrupt the pipeline
The exception software must match the exception to the instruction
Know in which pipeline stage a type of exception can occur
For example, an undefined instruction is discovered in the ID
stage, and invoking the OS occurs in the EX stage
Exceptions are collected in the Cause register in a pending
exception field so that the hardware can interrupt based on later
exceptions, once the earliest one has been serviced
Precise exceptions is always associated with the correct exception
in pipelined computers, otherwise exceptions are imprecise!
Home Exercise (24)
Work out the example pages 388-389 step by step!
Home Exercise (25)
Find the width of each of the pipeline registers in each of the following
four pipelined MIPS architectures:
Included feature A1 A2 A3 A4
No hazard solution
All data hazards solutions
All control hazards solutions
Exception support
In a tabular form, explain in detail your solution
Parallelism and Advanced ILP
Pipelining exploits the potential parallelism among instructions
This parallelism is called instruction-level parallelism (ILP)
There are two primary methods for increasing the potential
amount of ILP:
1. Superpipelining: Increasing the depth of the pipeline to
overlap more instructions
Divide longer steps into more smaller ones
To get the full speed-up, we need to rebalance the remaining
steps so they are the same length
The amount of parallelism being exploited is higher, since
there are more operations being overlapped
Performance is potentially greater since the clock cycle can
be shorter
2. Multiple issue: Replicating the internal components of the
computer so that it can launch multiple instructions in every
pipeline stage
Superpipelining (Deep Pipelining)
3.0
Superpipelining (deep pipelining): 2.5

simply means longer pipelines
Relative performance

2.0
Since the ideal maximum speedup
from pipelining increases with
1.5
increasing the number of pipeline 1.0
stages, superpipelining is supposed 0.5
to increase performance 0.0

1 2 4 8 16

Pipeline depth
Longer pipelines increase the problem of control hazards:

1. They raise the cost of branch misprediction
2. If we cannot resolve the branch in the second stage, as is often
the case for longer pipelines, then we would see an even larger
slowdown if we stall on branches
3. If the pipeline is longer than five stages, then we may get more
branch delay slots, which are harder to fill
Multiple Issue
Launching multiple instructions per stage allows the instruction
execution rate to exceed the clock rate or the CPI to be less than 1
If m is the number of instructions executed per clock cycle, then
the CPI is scaled by an m amount.
It is sometimes useful to flip the CPI and use IPC, or Instruction
Per Clock Cycle
1
IPC
CPI
Multiple-issue processors have IPC > 1
Todays high-end microprocessor attempt to issue from three to six
instructions in every clock cycle
The downside of multiple issue is the extra work needed to keep all
hardware busy and transferring instructions to the next pipeline stage
There are typically, however, many constraints on what types of
instructions may be executed simultaneously and what happens when
dependences arise
Consider a 4 GHz four-way multiple-issue, five-stage pipelined
microprocessor:
1. Calculate the best CPI and IPC

2. Calculate the peak MIPS (or GIPS)
3. What is the number of instructions that the processor would
have in execution at any given time?
Consider a 4 GHz four-way multiple-issue, five-stage pipelined
microprocessor:
1. Calculate the best CPI and IPC
1 1
Best CPI 0.25
m 4
Best IPC m 4
2. Calculate the peak MIPS (or GIPS)
R IPC * R 4* 4000* 106

Peak MIPS 16000
6 6 6
CPI * 10 10 10
R IPC * R 4* 4000* 106
Peak GIPS 16
9 9 9
CPI * 10 10 10
3. What is the number of instructions that the processor would
have in execution at any given time?
No. of instructions in execution at any time = n * m = 5*4 = 20
Multiple Issue (cont.)
Structure hazards are one difficulty that limit the effectiveness of
multiple-issue processors
If instructions in the instruction stream are dependent or do not

meet certain criteria, however, only the first few instructions in the
sequence are issued, or perhaps even just the first instruction
There are two major ways to implement a multiple-issue processor:

1. Static multiple issue: many decisions are made statically by
the compiler before execution
2. Dynamic multiple issue: many decisions are made
dynamically during execution by the processor hardware
To effectively exploit the parallelism available in a multiple-issue

processor, more ambitious compiler or hardware scheduling
techniques are needed
Multiple Issue (cont.)
Static issue Dynamic issue
processors processors
Packaging Handled at least Normally dealt with at
instructions into issue partially by the runtime by the processor
slots: how the processor compiler hardware
determines how many Compiler will have tried to
instructions and which improve the issue rate by
instructions can be issued replacing the instructions
in a given clock cycle in a beneficial order
Dealing with data and Some or all of data Attempt to alleviate at
control hazards and control hazard least some classes of
consequences are hazards using hardware
handled statically techniques operating at
by the compiler execution time
Issue slots: from where instructions could issue in a clock cycle
Issue packet: the set of instructions that issues together in one
clock cycle; the packet may be determined statically or dynamically
The Concept of Speculation
Speculation is an approach that allows the compiler or the
processor to guess about the properties of an instruction, so as to
enable execution to begin for other instructions that may depend
on the speculated instruction
Speculate on the outcome of a branch, so that instructions
after the branch could be executed earlier
Speculate that a store that precedes a load does not refer to
the same address, which would allow the load to be executed
before the store
The difficulty with speculation is that it may be wrong
Any speculation mechanism must include methods to:
1. Check if the guess was right
2. Unroll or back out the effects of the instructions that were
executed speculatively
Implementation of this back-out capability adds complexity
The Concept of Speculation (cont.)
Speculation may be done in the compiler or by the hardware
The compiler can use speculation to reorder instructions, moving
an instruction across a branch or a load across a store
The hardware can perform the same transformation at runtime
The recovery mechanisms used for incorrect speculations are:
The compiler usually inserts additional instructions that check the
accuracy of the speculation and provide a fix-up routine to use
when the speculation is incorrect
The hardware usually buffers the speculative results until it
knows they are no longer speculative
If the speculation is correct, the instructions are completed
by allowing the contents of the buffers to be written to the
registers of memory
If the speculation is incorrect, the hardware flushes the
buffers and re-executes the correct instruction sequence
The Concept of Speculation (cont.)
Speculation has another problem: speculating on certain instructions
may introduce exceptions that were formerly not present
Suppose a load instruction is moved in a speculative manner, but
the address it uses is not legal when the speculative is incorrect
The result would be that an exception that should not have
occurred will occur
If the load instruction was not speculative, then the exception
must occur!
The compiler avoids such problems by adding special speculation
support that allows such exceptions to be ignored until it is clear
that they really should occur
The hardware simply buffers exceptions until it is clear that the
instruction causing them is no longer speculative and is ready to
complete; at that point the exception is raised, and normal
exception handling proceeds
Since speculation can improve performance when done properly and
decrease performance when done carelessly, significant effort goes
into deciding when it is appropriate to speculate
Static Multiple Issue
Static multiple-issue processors all use the compiler to assist with
packaging instructions and handling hazards
The set of instructions issued in a given clock cycle, issue packet, is
a one single large instruction with multiple operations in certain
predefined fields
Very Long Instruction Word (VLIW): A style of instruction set

architecture that launches many operations that are defined to be
independent in a single wide instruction, typically with many separate
opcode fields
In VLIW, the compiler guarantees that there are
1. No dependences between instructions that issue at the same time
2. Sufficient hardware resources to execute them
VLIW works well when the source code for the programs is available
so that the programs can be recompiled
VLIW philosophy simplifies instruction decoding and issuing logic
Static Multiple Issue with the MIPS ISA
We consider a simple two-issue MIPS processor
One of the instructions can be an integer ALU operation or branch,
and the other could be load or store
Issuing 2 instructions per cycle needs fetching and decoding 64 bits
To simplify the decoding and instruction issue, we will require that:
The instructions be paired and aligned on a 64-bit boundary,
with the ALU or branch portion appearing first
If an instruction of the pair cannot be used, it be replaced with
a nop
The compiler avoid all dependences within an instruction pair
Two methods to deal with potential data and control hazards:
The compiler takes full responsibilities for removing all hazards,
scheduling the code and inserting nops for the code to execute
without any need for hazard detection or hardware stalls
[Assumed] The hardware detects data hazards and generates
stalls between two issue packets with a hazard forcing the
entire issue packet containing the dependent instruction to stall
Static Multiple Issue with the MIPS ISA (cont.)
The hardware needs extra resources to overcome structural hazards:

1. Another 32 bits from instruction memory are needed
2. To issue an ALU and a data transfer operation in parallel, the first
need for additional hardware is extra ports in the register file
In one clock cycle we may need to read two registers for the
ALU operation and two more for a store, and also one write
port for the ALU operation and one write port for a load
3. Since one ALU is tied up for the ALU operation, we also need a
separate adder to calculate the effective address for data transfers
This two-issue processor can improve performance by up to a factor

of 2
In simple five-stage MIPS pipeline, loads have a use latency of

one clock cycle, which prevents one instruction from using the
result without stalling
In the two-issue, five-stage pipeline the result of a load instruction
cannot be used on the next clock cycle
This means that the next two instructions cannot use the load result
without stalling
Furthermore, ALU instructions that had no use latency in the simple

five-stage pipeline, now have a one instruction use latency, since
the results cannot be used in the paired load or store
Group Exercise (27)
Consider the following loop:
Loop: lw $t0, 0($s1)

addu $t0, $t0, $s2
sw $t0, 0($s1)
addi $s1, $s1, -4
bne $s1, $zero, Loop
Assume branches are predicted, so that control hazards are

handled by the hardware
How would this loop be scheduled on a static two-issue pipeline for

MIPS? Reorder the instructions to avoid as many pipeline stalls as
possible
The first 3 instructions have data dependence, and so do the last 2
Notice that the index in the sw instruction is changed to 4 to

compensate for the subtraction in the addi instruction that is
executed out of order
addi could be scheduled in cycle 1, but cycle 2 will have to be empty
Notice that just one pair of instructions has both issue slots used
It takes four clocks per loop iteration
At four clocks to execute five instructions, we get the disappointing
CPI of 0.8 versus the best case of 0.5, or an IPC of 1.25 versus 2.0
In computing CPI, we do not count any nops executed as useful
instructions; doing so would improve CPI, but not performance
Group Exercise (28)
Consider the following loop:
Loop: lw $t0 0($s1)

addu $t0, $t0, $s2
sw $t0, 0($s1)
addi $s1, $s1, -4
bne $s1, $zero, Loop
For simplicity, assume that the loop index is a multiple of four
1. Show how well loop unrolling and scheduling work in the code
above
2. Compare the performance of the code execution with loop
unrolling and without it (Exercise 26)
To schedule the loop without any delays, it turns out that we need
to make four copies of the loop body
After unrolling and eliminating the unnecessary loop overhead
instructions, the loop will contain four copies each of lw, addu, and
sw, plus one addi and one bne
Since the first pair decrements $s1 by 16, the addresses loaded
are the original value of $s1, then that address minus 4, etc.
Storing is done first to the original value of $s1, then that address
minus 4, etc.
During the unrolling the compiler introduced additional registers
($t1, $t2, $t3)
This register renaming is done to eliminate false data dependences
Consider how the unrolled code would look using only $t0
There would be repeated instances of {lw $t0 0($s1)},
{addu $t0, $t0, $s2}, followed by {sw $t0, 0($s1)}
But, these sequences, despite using $t0, are actually completely
independent as no data values flow between one pair of these
instructions and the next pair
Renaming the registers during the unrolling process allows the
compiler to move these instructions subsequently so as to better
schedule the code
2. Compare the performance of the code execution with loop unrolling
and without it (Exercise 26)
Ideal CPI is 0.5 (an IPC of 2)
In the original loop, just one pair of instructions executes in
multiple issue
It takes 4 clock cycles per loop iteration to execute 5
instructions, which yields a CPI of 4/5 = 0.8 (an IPC of 1.25)
In the unrolled loop, 12 of the 14 instructions in the loop execute
as pairs
It takes 8 clocks for four loop iterations, or 2 clocks per
iteration, which yields a CPI of 8/14 = 0.57 (an IPC of 1.75)
The improvement in the CPI is due to:
reducing the loop control instructions
Dual issue execution
The cost of this improvement is using four temporary registers
rather than one, as well as a significant increase in code size
Static Multiple Issue (cont.)
Loop unrolling: a technique to get more performance from loops
that access arrays, in which multiple copies of loop body are made
and instructions from different iterations are scheduled together
After unrolling, there is more ILP available by overlapping
instructions from different iterations
Antidependence (also called name dependence): an ordering
forced purely by the reuse of a name, typically a register, rather than
by a true dependence that carries a value between two instructions
Dependences that are not true dependences, but could either lead
to potential hazards or prevent the compiler from flexibly
scheduling the code
Register renaming: the renaming of registers by the complier or
hardware to remove antidependences
eliminates name dependences, while preserving true dependences
allows the compiler to move independent code subsequently for
better scheduling
Architectural registers: the set of processor visible registers
Dynamic Multiple-Issue Processors
Dynamic multiple-issue processors are also known as superscalar
processors, or simply superscalars
Superscalar: An advanced pipelining technique that enables the
processor to execute more than one instruction per clock cycle by
selecting them during execution
In the simplest superscalar processors:
Instructions issue in order
The processor decides whether zero, one, or more instructions
can issue in a given clock cycle
Superscalars also extend dynamic issue decisions to include dynamic
pipeline scheduling
reordering the order of instruction execution by the hardware
Dynamic pipeline scheduling: chooses which instructions to
execute next, possibly reordering them to avoid hazards and stalls
Static pipeline scheduling: stalls when waiting for a hazard to be
resolved, even if later instructions are ready to go
Dynamic Multiple-Issue Processors (cont.)
Achieving good performance on superscalars still requires the
compiler to try to schedule instructions to move dependences apart
and thereby improve the instruction issue rate
Motivations for dynamic scheduling vs compiler scheduling:

1. Not all stalls are predictable
For example, cache misses cause unpredictable stalls
2. If the processor speculates on branch outcomes using dynamic
branch prediction, it cannot know the exact order of instructions
at compile time, since it depends on the predicted and actual
behavior of branches
3. As the pipeline latency and issue width change from one
implementation to another, the best way to compile a code
sequence also changes
Old legacy code will get much of the benefit of a new
implementation without the need for recompilation
Dynamic Multiple-Issue Processors (cont.)
Differences between VLIW and superscalar processors:
1. VLIW processors predated superscalars
2. VLIW processors rely totally on compiler scheduling, whereas
superscalars additionally uses dynamic pipeline scheduling
3. The code whether scheduled or not, is guaranteed by the
superscalar hardware to execute correctly without changing the
binary machine code (recompilation)
4. Compiled code will always run correctly on superscalars
independent of the processor issue rate or pipeline structure:
In some VLIW designs, this is not the case, and recompilation
was required when moving across different processors
In other static issue processors, code would run correctly
across different implementations, but often so poorly as to
make compilation effectively required
Dynamic Pipeline Scheduling
Microarchitecture: the organization of the processor, including the

major functional units, their interconnection, and control
The pipeline is divided into three major units:
1. An instruction fetch and issue unit
Fetches instructions, decodes them, and sends each instruction
to a corresponding functional unit for execution
Dynamic Pipeline Scheduling (cont.)
2. Multiple functional units (12 in high-end designs in 2008)
Each unit has buffers, called reservation stations, which hold
the operands and the operation
As soon as the buffer contains all its operands and the
functional unit is ready to execute, the result is calculated
When the result is completed, it is sent to any reservation
stations waiting for it as well as to the commit unit
3. A commit unit
Buffers the result until it is safe to put the result in the register
file or, for a store, into memory
The buffer in the commit unit, called reorder buffer, is also
used to supply operands, in much the same way as forwarding
logic does in a statically scheduled pipeline
Once a result is committed to the register file, it can be fetched
directly from there, just as in a normal pipeline
Final step of updating the state is retirement/graduation
The combination of buffering operands in the reservation stations and
results in the reorder buffer provides a form of register renaming:
1. When an instruction issues, it is copied to a reservation station for
the appropriate functional unit
Operands that are available in the register file or reorder
buffer are immediately copied into the reservation station
The instruction is buffered in the reservation station until all
the operands and functional unit are available
2. If an operand is not in the register file or reorder buffer, it must
be waiting to be produced by a functional unit
The name of the functional unit that will produce the result is
tracked
When that unit eventually produces the results, it is copied
directly into the waiting reservation station from the functional
unit bypassing the registers
Dynamically scheduled pipeline analyzes the data flow structure of a
program
The processor then executes the instructions in some order that
preserves the data flow order of the program
Out-of-order (OOO) execution: instructions can be executed in a
different order than they were fetched
A situation in pipelined execution when instruction blocked from
executing does not cause the following instruction to wait
In-order commit:
The instruction fetch and decode unit is required to issue
instructions in order, which allows dependences to be tracked
The commit unit is required to write results to registers and
memory in program fetch order
The functional units are free to initiate execution whenever the
data they need is available
In-order commit (cont.):
If an exception occurs, the computer can point to last instruction
executed, and the only registers updated will be all those written
by instructions before the instruction causing the exception
Make programs behave as if they were running on a simple in-
order pipeline (model of todays dynamically scheduled pipelines)
Dynamic scheduling is often extended by including hardware-based
speculation, especially for branch outcomes
By predicting the direction of a branch, a processor can continue
to fetch and execute instructions along the predicted path
Because instructions are committed in order, we know whether or
not the branch was correctly predicted before any instruction
from the predicted path are committed
A speculative, dynamically scheduled pipeline can also support
speculation on load-store reordering, and using the commit unit
to avoid incorrect speculation
Pipelining and Multiple-Issue Limitations
Despite the existence of processors with four to six issues per clock
cycle, very few applications can sustain more than two instructions
per clock for the following two primary reasons:
1. Dependences that cannot be alleviated and branches that cannot
be accurately predicted
2. Loses in memory system
Four factors combine to limit the performance improvement gained
by pipelining and multiple-issue execution:
1. Data hazard in the code mean that increasing the pipeline depth
increases the time per instruction because a larger percentage of the
cycles become stalls
2. Control hazards mean more clock cycles for the program
3. Pipeline register overhead can limit the decrease in clock period
obtained by further pipelining
4. Instruction latencies (inherited execution time) introduce difficulties in
programs as a dependence means the processor must wait the full
instruction latency for the hazard to be resolved
Power Efficiency and Advanced Pipelining
The downside to the increasing exploitation of ILP via dynamic
multiple issue and speculation is power efficiency
Now that we have hit the power wall, we are seeing designs with
multiple processors per chip where the processors are not as deeply
pipelined or as aggressively speculative as the predecessors
While processors are not as fast as the sophisticated ones, they
deliver better performance per watt, so that they can deliver more
performance per chip when designs are constrained more by power
than they are by number of transistors
Note the drop in pipeline stages as we switch to multicore designs

Fallacies and Pitfalls
Fallacy: pipelining is easy
Fallacy: pipelining ideas can be implemented independent of

technology
Pitfall: failure to consider instruction set design can adversely

impact pipelining
Fallacy: increasing the depth of pipelining always increases

performance
Tutorial Exercises (29)
Textbook problems:

L06 Arc DR Wail c4 Pipelining

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

L06 Arc DR Wail c4 Pipelining

Uploaded by

Copyright:

Available Formats

Enhancing Performance with Pipelining

All steps in pipelining (called stages) operate concurrently

As long as we have separate resources for each stage, we can

Pipelining exploits the potential parallelism among instructions in a

Pipelining has the substantial advantage that, unlike some speedup

Pipelining is key to making processors fast and is nearly universal

The reason pipelining is faster for executing many instructions is

Pipelining improves throughput of instruction execution

Pipelining would not decrease the time to complete executing

We limit our attention to eight instructions:

Compare the average time between instructions of a single-cycle

Note: In the single-cycle model, every instruction takes exactly

1. the clock rate versus instruction throughput

datapath datapath datapath datapath

Slower Faster 1 Several

One way to show what happens in pipelined execution is to pretend

To aid understanding, there are two basic styles of pipeline figures

1. Show the multiple-clock-cycle pipeline diagram for these

1. Compare non-pipelined and pipelined execution of the three lw

In the non-pipelined case, we would add 1,000,000

The speedup formula suggests that a five-stage pipeline

We place registers wherever there are dividing lines between stages

To pass something from an early pipeline stage to a later one, the

One pipeline register is divided into different sections to hold

We will use a notation that names the fields of the pipeline

Every instruction updates the PC, whether by incrementing it or by

Unlike pipeline registers, however, the PC is part of the visible

Work out the sw example in Figures 4.39 through 4.40

1. Structural hazards: the hardware cannot support the combination

nop insertions: when no independent instructions could be

Register file forwarding:

sub $2, $1, $3

1. Show the dependences in this code sequence

sub $2, $1, $3

1. Show the dependences in this code sequence

The last 4 instructions are dependent on the result in $2 of

Proper $2 value is written in clock cycle 5

The compiler inserts two nops before the and instruction

sub $s2, $s1, $s3

4. What other techniques are used by the hardware in this example to

Register file forwarding

add $s0, $t0, $t1

Show what pipeline stages would be connected by forwarding

Show what pipeline stages would be connected by forwarding

lw $t1, 0($t0) # b is saved at 0($t0)

1. Find the hazards in the above code segment

Both add instructions have a hazard because of their respective

Moving up the third lw instruction to become the third instruction

3. Calculate the number of clock cycles needed to complete the

On a pipelined processor with forwarding, the reordered sequence

sub $2, $1, $3

Classify the dependences in the above code sequence according to

Show the dependences between the pipeline registers and the

Classify the dependences in the above code sequence according to

The sub-and hazard is type 1a:

The dependences in sub-add are not hazards because the

The dependence begins from a pipeline register, rather than

Data forwarding (cont.):

Forwards from previous instruction to either input of the ALU

One complication is the potential data hazards between the result

sub $s2, $s1, $s3

sub $s2, $s1, $s3

Registers to be ALU input