You are on page 1of 168

Enhancing Performance with Pipelining

Computer

4
Devices
Chapter

Part II Memory

Processor

Patterson and Hennessys Computer Organization and Design, 4th Ed. Chapter 4.Part II 1 of 168
Agenda and Reading List
Chapter goals (4.5, p. 343)
Introduction (4.5, pp. 330-332 and 4.14, pp. 408-409)
MIPS pipelined datapath, registers, and stages (4.6, pp. 344-355)
Graphically representing pipelines (4.6, pp. 356-358)
Pipelining speedup formula (4.5, pp. 332-334)
Pipelined control (4.6, pp. 359-363)
Pipeline hazards (4.5, pp. 335-343)
Structural hazards solutions (4.5, pp. 335-336)
Data hazards solutions (4.5, pp. 336-339 and 4.7, pp. 363-375)
Control hazards solutions (4.5, pp. 339-343 and 4.8, pp. 375-384)
Designing instruction set for pipelining (4.5, p. 335)
Exceptions (4.9, pp. 384-391)
Advanced instruction-level parallelism (4.10, pp. 391-403)
Fallacies and pitfalls (4.13, pp. 407-408)

Patterson and Hennessys Computer Organization and Design, 4th Ed. Chapter 4.Part II 2 of 168
Chapter Goals
to overview pipelining and describe its benefits and complexities
to cover the concept of pipelining using the MIPS instruction subset
from the single-cycle implementation in Chapter 4, Part I and show a
simplified version of its pipeline
to provide datapath and control details for a pipelined
implementation created by modifying the single-cycle implementation
to understand the challenges of dealing with hazards
to look at the problems that pipelining introduces and the
performance attainable under typical situations
to explore the implementation of forwarding and stalls
to learn about solutions to branch hazards
to focus on the software and performance implications of pipelining
to learn how to design instruction sets for easy pipelining
to describe the implementation of exceptions in pipelined processors
to introduce advanced pipelining concepts, such superscalar and
dynamic scheduling
Patterson and Hennessys Computer Organization and Design, 4th Ed. Chapter 4.Part II 3 of 168
Introduction
Pipelining is an implementation technique in which multiple
instructions are overlapped in execution

All steps in pipelining (called stages) operate concurrently

As long as we have separate resources for each stage, we can


pipeline the tasks

Pipelining exploits the potential parallelism among instructions in a


sequential instruction stream (instruction-level parallelism (ILP))

Pipelining has the substantial advantage that, unlike some speedup


techniques, it is fundamentally invisible to the programmer

Pipelining is key to making processors fast and is nearly universal


Patterson and Hennessys Computer Organization and Design, 4th Ed. Chapter 4.Part II 4 of 168
Introduction (cont.)
Paradox:
The time for executing a single instruction is not shorter for
pipelining!

The reason pipelining is faster for executing many instructions is


that everything is working in parallel, so more instructions are
finished per unit time

Pipelining improves throughput of instruction execution

Pipelining would not decrease the time to complete executing


one instruction (called the latency), but when we have many
instructions to execute, the improvement in throughput
decreases the total time to complete the work

Patterson and Hennessys Computer Organization and Design, 4th Ed. Chapter 4.Part II 5 of 168
MIPS Pipelined Datapath

Separate the
datapath into
five stages
Patterson and Hennessys Computer Organization and Design, 4th Ed. Chapter 4.Part II 6 of 168
MIPS Pipelined Datapath (cont.)
MIPS instructions classically take five steps:
1. Fetch instruction from memory
2. Read registers while decoding the instruction
3. Execute the operation or calculate an address
4. Access an operand in data memory
5. Write the result into a register

We limit our attention to eight instructions:


1. load word (lw)
2. store word (sw)
3. add (add)
4. subtract (sub)
5. AND (and)
6. OR (or)
7. set less than (slt)
8. branch on equal (beq)
Patterson and Hennessys Computer Organization and Design, 4th Ed. Chapter 4.Part II 7 of 168
MIPS Pipelined Datapath (cont.)
All pipeline stages take a single clock cycle, so the clock cycle must
be long enough to accommodate the slowest operation
The write to the register file occurs in the first half of the clock
cycle and the read from the register file occurs in the second half
Instructions and data move generally from left to right through the
5 stages as they complete execution
There are two exceptions to this left-to-right flow of instructions:
The write-back stage, which places the result back into the
register file in the middle of the datapath
This could lead to data hazards (see later!)
The selection of the next value of the PC, choosing between the
incremented PC and the branch address from the MEM stage
This could lead to control hazards (see later!)
Data flowing from right to left does not affect the current
instruction; only later instructions in the pipeline are influenced by
these reverse data movements
Patterson and Hennessys Computer Organization and Design, 4th Ed. Chapter 4.Part II 8 of 168
Group Exercise (1)
Assume that the operation time for the major functional units in an
implementation are the following:
200 ps for memory access,
200 ps for ALU and adders operation,
100 ps for register file read or write,
multiplexors, control unit, PC accesses, sign extension unit, and
wires have no delay

Compare the average time between instructions of a single-cycle


implementation, in which all instructions take one clock cycle, to a
pipelined implementation

Note: In the single-cycle model, every instruction takes exactly


one clock cycle, so the clock cycle must be stretched to
accommodate the slowest instruction
Patterson and Hennessys Computer Organization and Design, 4th Ed. Chapter 4.Part II 9 of 168
Group Exercise (1): Answer
Instruction Instruction Register ALU Data Register Total
class fetch read operation access write time
Load word (lw) 200 ps 100 ps 200 ps 200 ps 100 ps 800 ps
Store word (sw) 200 ps 100 ps 200 ps 200 ps 700 ps
R-format (add, 200 ps 100 ps 200 ps 100 ps 600 ps
sub, AND, OR, slt)
Branch (beq) 200 ps 100 ps 200 ps 500 ps
Jump (j) 200 ps 200 ps

The single-cycle design must allow for the slowest instruction; lw, so
the time required for every instruction is 800 ps, even though some
instructions can be as fast as 500 ps or 200 ps
All the pipeline stages take a single clock cycle, so the clock cycle
must be long enough to accommodate the slowest operation
The pipelined execution clock cycle must have the worst-case cycle of
200 ps, even though some stages take only 100 ps
So at steady state, an instruction is completed each 200 ps
Patterson and Hennessys Computer Organization and Design, 4th Ed. Chapter 4.Part II 10 of 168
Group Exercise (2)
Locate the single cycle, multiple-cycle, and pipelined datapaths on
diagrams that shows:

1. the clock rate versus instruction throughput


2. hardware trends (shared or specialized) versus instruction
latency

Patterson and Hennessys Computer Organization and Design, 4th Ed. Chapter 4.Part II 11 of 168
Group Exercise (2): Answer
Locate the single cycle, multiple-cycle, and pipelined datapaths on
diagrams that shows:
1. the clock rate versus instruction throughput
2. hardware trends (shared or specialized) versus instruction
latency

Specialized
Multicycle Pipelined Single-cycle Pipelined
Faster

datapath datapath datapath datapath


(section 5.4) (Chapter 6) (section 5.3) (Chapter 6)

Hardware
Clock rate
Slower

Shared
Single-cycle Multicycle
datapath datapath
(section 5.3) (section 5.4)

Slower Faster 1 Several


Instruction throughput Clock cycles of latency for an instruction
(instructions per clock cycle or 1/CPI)
Patterson and Hennessys Computer Organization and Design, 4th Ed. Chapter 4.Part II 12 of 168
Graphically Representing Pipelines
Pipelining can be difficult to understand, since many instructions are
simultaneously executing in a single datapath in every clock cycle

One way to show what happens in pipelined execution is to pretend


that each instruction has its own private datapath, and then to
place these datapaths on a common timeline to show their
relationship

To aid understanding, there are two basic styles of pipeline figures


1. Multiple-clock-cycle pipeline diagrams
Time advances from left to right across the page in these
diagrams
Instructions advance from the top to the bottom of the page
Used to give overviews of pipelining situations
Simpler but do not contain all the details
Patterson and Hennessys Computer Organization and Design, 4th Ed. Chapter 4.Part II 13 of 168
Graphically Representing Pipelines (cont.)
2. Single-clock-cycle diagrams
Show the state of the entire datapath during a single cycle
Usually all five instructions in the pipeline are identified by
labels above their respective pipeline stages
Used to show the details of what is happening within the
pipeline during each clock cycle
Typically, the drawings appear in groups to show pipeline
operation over a sequence of clock cycles
Represents a vertical slice through a set of multiple-clock-
cycle diagrams, showing the usage of the datapath by each
of the instructions in the pipeline at the designated cycle
Obviously, have more details and take significantly more
space to show the same number of clock cycles

Patterson and Hennessys Computer Organization and Design, 4th Ed. Chapter 4.Part II 14 of 168
Group Exercise (3)
Consider the following five-instruction sequence

lw $10, 20($1)
sub $11, $2, $3
add $12, $3, $4
lw $13, 24($1)
add $14, $5, $6

1. Show the multiple-clock-cycle pipeline diagram for these


instructions
2. Show the single-clock-cycle pipeline diagram corresponding to
clock cycle 5 of executing the above instruction sequence

Patterson and Hennessys Computer Organization and Design, 4th Ed. Chapter 4.Part II 15 of 168
Group Exercise (3): Answer
1. Show the multiple-clock-cycle pipeline diagram for these instructions

Stylized version

Patterson and Hennessys Computer Organization and Design, 4th Ed. Chapter 4.Part II 16 of 168
Group Exercise (3): Answer (cont.)

Traditional version

Patterson and Hennessys Computer Organization and Design, 4th Ed. Chapter 4.Part II 17 of 168
Group Exercise (3): Answer (cont.)
2. Show the single-clock-cycle pipeline diagram corresponding to
clock cycle 5 of executing the above instruction sequence

Patterson and Hennessys Computer Organization and Design, 4th Ed. Chapter 4.Part II 18 of 168
Pipelining Speedup Formula
Pipeline latency: number of stages in a pipeline
If the stages are perfectly balanced, then the time between
instructions on the pipelined processor, assuming ideal conditions, is:
Time between instructions nonpipelin ed
Time between instructions pipelined
Number of pipe stages
If all stages take about the same amount of time and there is enough
work to do, then the speedup due to pipelining is equal to the
number of stages in the pipeline
That is, under ideal conditions and with a large number of
instructions:
Speedup from pipelining Number of pipe stages
Practically, the time per instruction in the pipelined processor will
exceed the minimum possible, and speedup will be less than the
number of pipeline stages because
the stages may be imperfectly balanced,
pipelining involves overhead (e.g., pipeline registers), and
pipeline hazards exist
Patterson and Hennessys Computer Organization and Design, 4th Ed. Chapter 4.Part II 19 of 168
Pipelining Speedup Formula (cont.)
At the beginning and end of the workload, the pipe is not totally full
This start-up and wind-down affects performance when the number
of tasks is not large compared to the number of stages in the
pipeline
If the number of tasks is much larger than the number of pipeline
stages, then the stages will be full most of the time and the
increase in throughput will be very close to the number of stages
If m instructions are executed on a n-stage pipeline, then:
Number of clock cycles to execute the instructions = m + (n-1)
Time to execute the instructions = [m + (n-1)]*T
If m >> n, then:
Number of clock cycles to execute the instructions = m
Time to execute the instructions = m *T
Pipelining improves performance by increasing instruction throughput,
as opposed to decreasing the inherent execution time (latency) of an
instruction, but instruction throughput is the more important metric
because real programs execute billions of instructions
Patterson and Hennessys Computer Organization and Design, 4th Ed. Chapter 4.Part II 20 of 168
Group Exercise (4)
Consider the following code executed on the pipeline in Exercise (1):
lw $1, 100($0)
lw $2, 200($0)
lw $3, 300($0)

1. Compare non-pipelined and pipelined execution of the three lw


instructions
2. Graphically represent the pipelined execution of the above
instructions using the multiple-clock-cycle pipeline diagram
3. Calculate the time between the first and fourth instructions in
the nonpipelined design
4. Calculate the time between the first and fourth instructions in
the pipelined design. Comment!
5. Calculate the time needed to execute three lw instructions in the
pipelined design
6. Calculate the pipelining speedup in case of executing the three
lw instructions
7. Calculate the pipelining speedup in case of executing 1,000,003
lw instructions
8. What is the ideal speedup? Comment!
Patterson and Hennessys Computer Organization and Design, 4th Ed. Chapter 4.Part II 21 of 168
Group Exercise (4): Answer
1. Compare non-pipelined and pipelined execution of the three lw
instructions

Patterson and Hennessys Computer Organization and Design, 4th Ed. Chapter 4.Part II 22 of 168
Group Exercise (4): Answer
2. Graphically represent the pipelined execution of the above
instructions using the multiple-clock-cycle pipeline diagram

Patterson and Hennessys Computer Organization and Design, 4th Ed. Chapter 4.Part II 23 of 168
Group Exercise (4): Answer (cont.)
3. Calculate the time between the first and fourth instructions in
the nonpipelined design
3 x 800 ps = 2400 ps
4. Calculate the time between the first and fourth instructions in
the pipelined design
3 x 200 ps = 600 ps
This pipelining still offers a fourfold performance improvement
5. Calculate the time needed to execute three lw instructions in the
pipelined design
From the diagram, it is 1400 ps
6. Calculate the pipelining speedup in case of executing three lw
instructions
3* 800
Speedup pipelined 1.71 ( 4 )
1400
The number of instructions is not large for the speedup to be 4
Patterson and Hennessys Computer Organization and Design, 4th Ed. Chapter 4.Part II 24 of 168
Group Exercise (4): Answer (cont.)
7. Calculate the pipelining speedup in case of executing 1,000,003
lw instructions

In the non-pipelined case, we would add 1,000,000


instructions, each taking 800 ps, so total execution time
would be 1,000,000 * 800 ps + 2400 ps = 800,002,400 ps
In the pipelined case, we would add 1,000,000 instructions,
each adding 200 ps to the total execution time. The total
execution time would be 1,000,000 * 200 ps + 1400 ps =
200,001,400 ps
800,002,400
Speeduppipelined 200,001,400 4.00
When more instructions are executed, the ratio of total
execution times for real programs on nonpipelined
processors is close to the ratio of times between
instructions
Patterson and Hennessys Computer Organization and Design, 4th Ed. Chapter 4.Part II 25 of 168
Group Exercise (4): Answer (cont.)
8. What is the ideal speedup? Comment!

The speedup formula suggests that a five-stage pipeline


should offer nearly a fivefold improvement over the 800 ps
nonpipelined time, or a 160 ps clock cycle
However, the stages are imperfectly balanced, resulting in a
200 ps clock cycle

Patterson and Hennessys Computer Organization and Design, 4th Ed. Chapter 4.Part II 26 of 168
Pipeline Registers

Patterson and Hennessys Computer Organization and Design, 4th Ed. Chapter 4.Part II 27 of 168
Pipeline Registers (cont.)
We must add registers between pipeline stages to allow datapaths
and functional units to be shared by different instructions during
different stages while retaining the value of an individual instruction
for its usage during the following stages

We place registers wherever there are dividing lines between stages

All instructions advance during each clock cycle from one pipeline
register to the next

To pass something from an early pipeline stage to a later one, the


information must be placed in a pipeline register; otherwise the
information is lost when the next instruction enters that stage

Patterson and Hennessys Computer Organization and Design, 4th Ed. Chapter 4.Part II 28 of 168
Pipeline Registers (cont.)
There is no pipeline register at the end of the write-back stage
All instructions must update some state in the processor (the
register file, memory, or the PC)
A separate pipeline register is redundant to the state that is
updated

Registers are named for the two stages separated by that register

One pipeline register is divided into different sections to hold


different information

We will use a notation that names the fields of the pipeline


registers, e.g., ID/EX.RegisterRS is the number of one register
whose value is found in ID/EX; that is the number of the first read
port of the register file
Patterson and Hennessys Computer Organization and Design, 4th Ed. Chapter 4.Part II 29 of 168
Individual Exercise (5)
Why is the PC not considered among the pipeline registers
although it feeds the IF stage of the pipeline

Patterson and Hennessys Computer Organization and Design, 4th Ed. Chapter 4.Part II 30 of 168
Individual Exercise (5): Answer
Why is the PC not considered among the pipeline registers
although it feeds the IF stage of the pipeline

Every instruction updates the PC, whether by incrementing it or by


setting it to a branch destination address

Unlike pipeline registers, however, the PC is part of the visible


architecture state
PCs content must be saved when an exception occurs, while
the contents of the pipeline registers can be discarded

Patterson and Hennessys Computer Organization and Design, 4th Ed. Chapter 4.Part II 31 of 168
Pipeline Stages
All instructions pass through 5 stages though not all of them needs
5 cycles
As every instruction behind the one being executed would be in
progress, there is no way to accelerate them
An instruction passes through a stage even if there is nothing to
do as later instructions are already progressing at the maximum
rate
1. Instruction fetch (IF):
The instruction is read from memory using the address in the
PC and then placed in the IF/ID pipeline register
The PC address is incremented by 4 and then written back into
the PC to be ready for the next clock cycle
The incremented PC is also saved in the IF/ID in case it is
needed later for an instruction, such as beq
Portion of IF/ID plays the role of the IR!
The computer cannot know which type of instruction is being
fetched, so it must prepare for any instruction, passing
Patterson and potentially needed
Hennessys Computer information
Organization down4ththe
and Design, Ed. pipeline
Chapter 4.Part II 32 of 168
Pipeline Stages (cont.)
2. Instruction decode and register file read (ID):
The instruction portion of the IF/ID pipeline register supplies
the 16-bit immediate field, which is sign-extended to 32 bits,
and the register numbers to read the two registers
All the three values are stored in ID/EX, along with the
incremented PC
We again transfer everything that might be needed by any
instruction during a later clock cycle
Register rt number is saved in ID/EX in case it is needed
later by lw

Patterson and Hennessys Computer Organization and Design, 4th Ed. Chapter 4.Part II 33 of 168
Pipeline Stages (cont.)
3. Execute or address calculation (EX):
lw: reads the contents of register rs and the sign-extended
immediate from the ID/EX and adds them using the ALU
that sum is placed in the EX/MEM
register rt number is passed from ID/EX to EX/MEM
sw: the effective address is placed in the EX/MEM
register rt value is passed from ID/EX to EX/MEM to be used
in the next stage
R-type: reads the contents of registers rs and rt from the
ID/EX and performs the desired function using the ALU
the result is stored in the EX/MEM
register rd number is passed from the ID/EX to EX/MEM
beq: reads the contents of registers rs and rt from the ID/EX
and performs the equal compare function using the ALU
the zero signal is stored in the EX/MEM
Patterson and Hennessys Computer Organization and Design, 4th Ed. Chapter 4.Part II 34 of 168
Pipeline Stages (cont.)
4. Memory access (MEM):
lw: reads the data memory using the address from EX/MEM
and loads the data into the MEM/WB
register rt number is passed from EX/MEM to MEM/WB
sw: data is written to memory
R-type: the ALU output is passed from EX/MEM to MEM/WB
beq: set the next PC according to the zero signal read from
EX/MEM, calculate the branch to address
5. Write back (WB):
lw: reads the data from the MEM/WB and writes it into the
register file using register rt number stored in MEM/WB
sw: nothing to be done
R-type: writes the ALU output from the MEM/WB into the
register number rd (read from MEM/WB also)
beq: nothing to be done
Patterson and Hennessys Computer Organization and Design, 4th Ed. Chapter 4.Part II 35 of 168
Home Exercise (6)
Work out the lw example in Figures 4.36 through 4.38
step by step!

Work out the sw example in Figures 4.39 through 4.40


step by step!

Patterson and Hennessys Computer Organization and Design, 4th Ed. Chapter 4.Part II 36 of 168
Pipelined Control

Patterson and Hennessys Computer Organization and Design, 4th Ed. Chapter 4.Part II 37 of 168
Pipelined Control (cont.)
To specify control for the pipeline, we need only to set the control
values during each pipeline stage
Because each control line is associated with a component active in
only a single pipeline stage, we can divide the control lines into five
groups according to the pipeline stage (refer to Chapter 4, Part I)
IF: there is nothing special to control in the pipeline stage
The control signals to read instruction memory and to write
the PC are always asserted
ID: there are no optional lines to set
The same thing happens at every clock cycle
EX: set signals RegDst, ALUOp, and ALUSrc to select the result
register, the ALU operation, and either read data 2 (register rt)
or a sign-extended immediate for the ALU
MEM: set signals Branch, MemRead, and MemWrite by beq, lw,
and sw instructions, respectively
WB: control signal MemtoReg decides between sending the ALU
result or the memory value to the register file, and control signal
RegWrite writes the chosen value
Patterson and Hennessys Computer Organization and Design, 4th Ed. Chapter 4.Part II 38 of 168
Pipelined Control (cont.)
There are no separate writing signals for the pipeline registers IF/ID,
ID/EX, EX/MEM, and MEM/WB as they are written during each cycle
Implementing control means setting the control lines in each stage
for each instruction
The simplest way to do this is to extend the pipeline registers to
include control information
As the control lines start with the EX stage, we can create the control
information during instruction decode and then place them in ID/EX
The control lines for each pipeline stage are used, and remaining
control lines are then passed to the next pipeline stage
These control signals are then used in the appropriate pipeline stage
as the instruction moves down the pipeline
Sequencing of control in pipeline processors is embedded in the
pipeline structure itself:
all instructions take the same number of clock cycles, so there is
no special control of instruction duration
all control information is computed during instruction decode, and
then passed along by the pipeline registers
Patterson and Hennessys Computer Organization and Design, 4th Ed. Chapter 4.Part II 39 of 168
Pipeline Hazards
Pipeline hazards: situations in pipelining when the next instruction
cannot execute in the following clock cycle

1. Structural hazards: the hardware cannot support the combination


of instructions that we want to execute in the same clock cycle
Usually revolve around the floating-point unit, which may not be
fully pipelined

Patterson and Hennessys Computer Organization and Design, 4th Ed. Chapter 4.Part II 40 of 168
Pipeline Hazards (cont.)
2. Data hazards: occur when the pipeline must be stalled because
one step must wait for another to complete
Arise from the dependence of one instruction on an earlier one
that is still in the pipeline
When an instruction depends on the results of a previous one
still in the pipeline, the pipeline should be stalled, i.e., bubbles
should be added to the pipeline
Performance bottlenecks in both integer and floating-point
programs
Often it is easier to deal with in floating-point programs
because the lower branch frequency and more regular
memory access patterns allow the compiler to try to schedule
instructions to avoid hazards
It is more difficult to perform such optimizations in integer
programs that have less regular memory access, involving
more use of pointers
Patterson and Hennessys Computer Organization and Design, 4th Ed. Chapter 4.Part II 41 of 168
Pipeline Hazards (cont.)
3. Control hazards: arise from the need to make a decision based on
the results of one instruction while others are executing, e.g.,
branches
Also called branch hazards
Happen when the proper instruction cannot execute in the
proper pipeline clock cycle because the instruction that was
fetched is not the one that is needed; that is, the flow of
instruction addresses is not what the pipeline expected
Notice that we must begin fetching the instruction following the
branch on the very next clock cycle
Nevertheless, the pipeline cannot possibly know what the next
instruction should be, since it only just received the branch
instruction from memory
Usually more of a problem in integer programs, which tend to
have higher branch frequencies as well as less predictable
branches
Patterson and Hennessys Computer Organization and Design, 4th Ed. Chapter 4.Part II 42 of 168
Structural Hazards Solutions
Each logical component of the datapath, such as instruction
memory, register read ports, ALU, data memory, and register write
ports, can be used only within a single pipeline stage
Otherwise, we would have a structural hazard
Hence, these components, and their control, can be associated with
a single pipeline stage
Proper ISA design
Designing instruction sets for pipelining (as will be explained
later) makes it fairly easy to avoid structural hazards when
designing a pipeline
Example:
without two memories, our pipeline could have a structural
hazard
suppose we had a single memory instead of two memories
we could see that in one clock cycle, the first instruction is
accessing data from memory while the fourth instruction is
fetching an instruction from that same memory
Patterson and Hennessys Computer Organization and Design, 4th Ed. Chapter 4.Part II 43 of 168
Data Hazards Solutions
Compiler support:
Code reordering: compilers could follow instructions with
others that are independent on them to prevent sequences that
result in data hazards
When the instruction generating the data is a load, this
technique is called delayed loads

nop insertions: when no independent instructions could be


found, the compiler inserts nop (no operation) instructions that
are guaranteed to be independent
This results in cycles that do no useful work
nop is represented by all 0s, which is equivalent to sll
$0, $0, 0, i.e., shift register $0 left 0 places

Patterson and Hennessys Computer Organization and Design, 4th Ed. Chapter 4.Part II 44 of 168
Data Hazards Solutions (cont.)
Compiler support (cont.):
However, data hazards happen just too often and the delay is just
too long to expect the compiler to rescue us from this dilemma
Although the compiler generally relies upon the hardware to
resolve hazards and thereby ensure correct execution, the
compiler must understand the pipeline to achieve the best
performance
Otherwise, unexpected stalls will reduce the performance of
the compiled code

Register file forwarding:


In our design, writes are done in the first half of the clock cycle
and reads are in the second half, so a read delivers what is
written
This overcomes data hazards for one clock cycle

Patterson and Hennessys Computer Organization and Design, 4th Ed. Chapter 4.Part II 45 of 168
Group Exercise (7)
Consider the following code sequence:

sub $2, $1, $3


and $12, $2, $5
or $13, $6, $2
add $14, $2, $2
sw $15, 100($2)

1. Show the dependences in this code sequence


2. If register $2 had the value 10 before the sub instruction and -20
afterwards. How would this sequence perform with our pipeline?
Show the value of $2 at the beginning of each clock cycle
3. Show how the compiler could help avoiding the data hazard in
the above code sequence
4. What other techniques are used by hardware in this example to
avoid data hazards?
Patterson and Hennessys Computer Organization and Design, 4th Ed. Chapter 4.Part II 46 of 168
Group Exercise (7): Answer
Consider the following code sequence:

sub $2, $1, $3


and $12, $2, $5
or $13, $6, $2
add $14, $2, $2
sw $15, 100($2)

1. Show the dependences in this code sequence

The last 4 instructions are dependent on the result in $2 of


the first instruction
The sub instruction does not write its result until the fifth
stage, meaning that we would have to waste three clock
cycles in the pipeline
Without intervention, data hazards could severely stall the
pipeline
Patterson and Hennessys Computer Organization and Design, 4th Ed. Chapter 4.Part II 47 of 168
Group Exercise (7): Answer (cont.)
2. How would this sequence perform with our pipeline? If register $2
had the value 10 before the sub instruction and -20 afterwards,
show the value of $2 at the beginning of each clock cycle

Proper $2 value is written in clock cycle 5


Dependence arrows going backwards in time are pipeline hazards
add and sw get the correct $2 value; AND and OR would not
Patterson and Hennessys Computer Organization and Design, 4th Ed. Chapter 4.Part II 48 of 168
Group Exercise (7): Answer
3. Show how the compiler could help avoiding the data hazard in the
above code sequence

The compiler inserts two nops before the and instruction

sub $s2, $s1, $s3


nop
nop
and $s4, $s2, $s7
or $s5, $s0, $s2
add $s6, $s2, $s2
sw $s4, 100($s2)

4. What other techniques are used by the hardware in this example to


avoid data hazards?

Register file forwarding


Patterson and Hennessys Computer Organization and Design, 4th Ed. Chapter 4.Part II 49 of 168
Data Hazards Solutions (cont.)
Data forwarding:
Forwarding (or bypassing): is a simple solution based on the
observation that we do not need to wait for the instruction to
complete before trying to resolve the data hazard
We can avoid stalls if we simply forward the data as soon as it is
available to any units that need it before it is available to read
from the register file
Adds extra hardware to retrieve the missing item early from the
internal resources (buffers)
Rather than waiting for the missing item to arrive from
programmer-visible registers or memory
The name forwarding comes from the idea that the result is
passed forward from an earlier instruction to a later one
Bypassing comes from passing the result by the register file to
the desired unit
Forwarding paths are valid only if the destination stage is later
in time than the source stage
Otherwise, we would be going backward in time!
Patterson and Hennessys Computer Organization and Design, 4th Ed. Chapter 4.Part II 50 of 168
Individual Exercise (8)
Suppose that we have an add instruction followed immediately by
a subtract that uses the sum $s0:

add $s0, $t0, $t1


sub $t2, $s0, $t3

Show what pipeline stages would be connected by forwarding

Patterson and Hennessys Computer Organization and Design, 4th Ed. Chapter 4.Part II 51 of 168
Individual Exercise (8): Answer
add $s0, $t0, $t1
sub $t2, $s0, $t3

Show what pipeline stages would be connected by forwarding

The add instruction does not write its result until the 5th stage
As soon as the ALU creates the sum for the add, it is supplied
as an input for the sub, replacing $s0 value read in the 2nd
stage of sub
Patterson and Hennessys Computer Organization and Design, 4th Ed. Chapter 4.Part II 52 of 168
Group Exercise (9)
Consider the following code segment in C:

a = b + e
c = b + f

Here is the generated MIPS code for this segment, assuming all variables
are in memory and are addressable as offsets from $t0:

lw $t1, 0($t0) # b is saved at 0($t0)


lw $t2, 4($t0) # e is saved at 4($t0)
add $t3, $t1, $t2
sw $t3, 12($t0) # a is to be saved at 12($t0)
lw $t4, 8($t0) # f is saved at 8($t0)
add $t5, $t1, $t4
sw $t5, 16($t0) # c is to be saved at 16($t0)

Patterson and Hennessys Computer Organization and Design, 4th Ed. Chapter 4.Part II 53 of 168
Group Exercise (9) (cont.)
1. Find the hazards in the above code segment
2. Reorder the instructions to avoid any pipeline stalls
3. Calculate the number of clock cycles needed to complete the
reordered sequence on a pipelined processor with forwarding
relative to the original version

Patterson and Hennessys Computer Organization and Design, 4th Ed. Chapter 4.Part II 54 of 168
Group Exercise (9): Answer
lw $t1, 0($t0) # b is saved at 0($t0)
lw $t2, 4($t0) # e is saved at 4($t0)
add $t3, $t1, $t2
sw $t3, 12($t0) # a is to be saved at 12($t0)
lw $t4, 8($t0) # f is saved at 8($t0)
add $t5, $t1, $t4
sw $t5, 16($t0) # c is to be saved at 16($t0)

1. Find the hazards in the above code segment

Both add instructions have a hazard because of their respective


dependence on the immediately preceding lw instruction
Bypassing eliminates several other potential hazards, including
dependence of the first add on the first lw and any hazards for store
instructions

Patterson and Hennessys Computer Organization and Design, 4th Ed. Chapter 4.Part II 55 of 168
Group Exercise (9): Answer
2. Reorder the instructions to avoid any pipeline stalls

Moving up the third lw instruction to become the third instruction


eliminates both hazards

lw $t1, 0($t0)
lw $t2, 4($t0)
lw $t4, 8($t0)
add $t3, $t1, $t2
sw $t3, 12($t0)
add $t5, $t1, $t4
sw $t5, 16($t0)

3. Calculate the number of clock cycles needed to complete the


reordered sequence on a pipelined processor with forwarding
relative to the original version

On a pipelined processor with forwarding, the reordered sequence


will complete in two fewer cycles than the original version
Patterson and Hennessys Computer Organization and Design, 4th Ed. Chapter 4.Part II 56 of 168
Data Hazards Solutions (cont.)
Data forwarding (cont.):
We must first detect a data hazard and then forward the proper
value to resolve the hazard
For now, we consider only the challenge of forwarding to an
operation in the EX stage, which may be either
An R-type ALU operation (add, sub, AND, OR, and slt) or
An effective address calculation
When an instruction tries to read a register in its EX stage that
an earlier instruction intends to write it in its WB stage, we
actually need the values as inputs to the ALU
There is no hazard in the WB stage itself, because we assume
that the register file supplies the correct result if the instruction
in the ID stage reads the same register written by the
instruction in the WB stage
This is another form of forwarding but it occurs within the
register file
Patterson and Hennessys Computer Organization and Design, 4th Ed. Chapter 4.Part II 57 of 168
Data Hazards Solutions (cont.)
Data forwarding (cont.):
The two pairs of hazard conditions are:
1a. EX/MEM.RegisterRd = ID/EX.RegisterRs
1b. EX/MEM.RegisterRd = ID/EX.RegisterRt
2a. MEM/WB.RegisterRd = ID/EX.RegisterRs
2b. MEM/WB.RegisterRd = ID/EX.RegisterRt
EX/MEM.RegisterRd field is the register destination for either:
An ALU instruction, which comes from the instruction Rd field
A load, which comes from the instruction Rt field
Because some instructions do not write registers, this policy is
inaccurate; sometimes it would forward when it should not
Simply check to see if the RegWrite signal will be active by
examining the WB control field of the pipeline register during the
EX and MEM stages
We need also to make sure that the RegisterRd field is not zero
to avoid forwarding its possibly nonzero result value
Patterson and Hennessys Computer Organization and Design, 4th Ed. Chapter 4.Part II 58 of 168
Individual Exercise (10)
Consider the following code sequence:

sub $2, $1, $3


and $12, $2, $5
or $13, $6, $2
add $14, $2, $2
sw $15, 100($2)

Classify the dependences in the above code sequence according to


the hazard conditions

Show the dependences between the pipeline registers and the


inputs to the ALU for this code sequence

Patterson and Hennessys Computer Organization and Design, 4th Ed. Chapter 4.Part II 59 of 168
Individual Exercise (10): Answer
sub $2, $1, $3
and $12, $2, $5
or $13, $6, $2
add $14, $2, $2
sw $15, 100($2)

Classify the dependences in the above code sequence according to


the hazard conditions

The sub-and hazard is type 1a:


EX/MEM.RegisterRd = ID/EX.RegisterRs = $2
The sub-or hazard is type 2b:
MEM/WB.RegisterRd = ID/EX.RegisterRt = $2

The dependences in sub-add are not hazards because the


register file supplies the proper data during the ID stage of add
There is no data hazard between sub and sw because sw reads
$2 the clock cycle after sub writes $2
Patterson and Hennessys Computer Organization and Design, 4th Ed. Chapter 4.Part II 60 of 168
Individual Exercise (10): Answer
Show the dependences between the pipeline registers and the
inputs to the ALU this code sequence

The dependence begins from a pipeline register, rather than


waiting for the WB stage to write the register file
Thus, the required data exists in time for later instructions, with
the pipeline registers holding data to be forwarded
Patterson and Hennessys Computer Organization and Design, 4th Ed. Chapter 4.Part II 61 of 168
Data Hazards Solutions (cont.)
Data forwarding (cont.):
If we can take the inputs to the ALU from any pipeline register
rather than just ID/EX, then we can forward the proper data
By adding multiplexors to the input of the ALU, and with the
proper controls, we can run the pipeline at full speed in the
presence of data dependences
The forwarding control will be in the EX stage, because the ALU
forwarding multiplexors are found in that stage.
Thus, we must pass the operand register numbers from the ID
stage via the ID/EX pipeline register to the forwarding control
to determine whether to forward values
Otherwise, the control of the multiplexors on the ALU inputs could
be determined during the ID stage and set in new control fields of
the ID/EX register
This makes the hardware faster because the time to select the
ALU inputs is likely to be on the critical path
Patterson and Hennessys Computer Organization and Design, 4th Ed. Chapter 4.Part II 62 of 168
Data Hazards Solutions (cont.)

Patterson and Hennessys Computer Organization and Design, 4th Ed. Chapter 4.Part II 63 of 168
Data Hazards Solutions (cont.)

Data forwarding (cont.):


Conditions for detecting hazards and control signals to resolve them
EX hazard:
if (EX/MEM.RegWrite
and (EX/MEM.RegisterRd 0)
and (EX/MEM.RegisterRd = ID/EX.RegisterRs)) ForwardA = 10

if (EX/MEM.RegWrite
and (EX/MEM.RegisterRd 0)
and (EX/MEM.RegisterRd = ID/EX.RegisterRt)) ForwardB = 10

Forwards from previous instruction to either input of the ALU


Patterson and Hennessys Computer Organization and Design, 4th Ed. Chapter 4.Part II 64 of 168
Data Hazards Solutions (cont.)
Data forwarding (cont.):
If the instruction in the WB stage is going to write to the register
file, and the write register number matches the read register
number of ALU inputs A or B, provided it is not in register 0, then
steer the multiplexor to pick the value instead from the pipeline
register MEM/WB

One complication is the potential data hazards between the result


of the instruction in the WB stage, the result of the instruction in
the MEM stage, and the source operand of the instruction in the
ALU stage

In this case, the result is forwarded from the MEM stage because
the result in the MEM stage is the more recent result

Patterson and Hennessys Computer Organization and Design, 4th Ed. Chapter 4.Part II 65 of 168
Data Hazards Solutions (cont.)
Data forwarding (cont.):
MEM hazard:
if (MEM/WB.RegWrite
and (MEM/WB.RegisterRd 0)
and not (EX/MEM.RegWrite
and (EX/MEM.RegisterRd 0)
and (EX/MEM.RegisterRd = ID/EX.RegisterRs))
and (MEM/WB.RegisterRd = ID/EX.RegisterRs)) ForwardA = 01

if (MEM/WB.RegWrite
and (MEM/WB.RegisterRd 0)
and not (EX/MEM.RegWrite
and (EX/MEM.RegisterRd 0)
and (EX/MEM.RegisterRd = ID/EX.RegisterRt))
and (MEM/WB.RegisterRd = ID/EX.RegisterRt)) ForwardB = 01
Patterson and Hennessys Computer Organization and Design, 4th Ed. Chapter 4.Part II 66 of 168
Individual Exercise (11)
For the following instruction sequence identify dependences and
show how forwarding cope with them:

sub $s2, $s1, $s3


and $s4, $s2, $s5
or $s4, $s4, $s2
add $s9, $s4, $s2

Patterson and Hennessys Computer Organization and Design, 4th Ed. Chapter 4.Part II 67 of 168
Individual Exercise (11): Answer
For the following instruction sequence identify dependences and
show how forwarding cope with them:

sub $s2, $s1, $s3


and $s4, $s2, $s5
or $s4, $s4, $s2
add $s9, $s4, $s2

Registers to be ALU input


Clock Instruction written Upper Lower
cycle in EX stage
EX/MEM MEM/WB Source From Source From
4 and $s2 -- $s2 EX/MEM $s5 ID/EX
5 or $s4 $s2 $s4 EX/MEM $s2 MEM/WB
6 add $s4 $s4 $s4 EX/MEM $s2 ID/EX

Patterson and Hennessys Computer Organization and Design, 4th Ed. Chapter 4.Part II 68 of 168
Individual Exercise (12)
When summing a vector of numbers in a single register, a
sequence of instructions will all read and write to the same register

add $1, $1, $2


add $1, $1, $3
add $1, $1, $4

Show how forwarding works in this case

Patterson and Hennessys Computer Organization and Design, 4th Ed. Chapter 4.Part II 69 of 168
Individual Exercise (12): Answer
For the second add $1 is available in MEM stage

EX/MEM.RegisterRd = ID/EX.RegisterRs

For the third add $1 is available in MEM and WB stages

EX/MEM.RegisterRd = ID/EX.RegisterRs
MEM/WB.RegisterRd = ID/EX.RegisterRs

However, in this case, result is forwarded from, the MEM stage


because the result in the MEM stage is the more recent result

Registers to be ALU input


Clock Instruction written Upper Lower
cycle in EX stage
EX/MEM MEM/WB Source From Source From
4 second add $1 -- $1 EX/MEM $3 ID/EXE
5 third add $1 $1 $1 EX/MEM $4 ID/EXE
Patterson and Hennessys Computer Organization and Design, 4th Ed. Chapter 4.Part II 70 of 168
Data Hazards Solutions (cont.)
Data forwarding (cont.):
To add the sign-immediate input, needed by loads and stores,
to the ALU, a 2:1 multiplexor is added to choose between the
ForwardB multiplexor output and the signed immediate
Store forwarding is done by connecting the forwarding
multiplexor output, containing store data, to the EX/MEM pipeline
register

Patterson and Hennessys Computer Organization and Design, 4th Ed. Chapter 4.Part II 71 of 168
Individual Exercise (13)
Consider the following code sequence:

lw $t0, 0($t1)
lw $t2, 4($t1)
sw $t2, 0($t1)
sw $t0, 4($t1)

1. Find the hazard in this code


2. Reorder the instructions to avoid pipeline stalls

Patterson and Hennessys Computer Organization and Design, 4th Ed. Chapter 4.Part II 72 of 168
Individual Exercise (13): Answer
1. Find the hazard in this code:

The hazard occurs on $t2 between the second lw and the first
sw

2. Reorder the instructions to avoid pipeline stalls

Swapping the two store instructions removes this hazard:

lw $t0, 0($t1)
lw $t2, 4($t1)
sw $t0, 4($t1)
sw $t2, 0($t1)

We do not create a new hazard as there is still one instruction


between the write of $t0 by the load and the read of it
With store forwarding, the reordered code takes 4 cycles
Patterson and Hennessys Computer Organization and Design, 4th Ed. Chapter 4.Part II 73 of 168
Data Hazards Solutions (cont.)
Pipeline stalls (bubbles):
Forwarding cannot help when an instruction tries to read a
register following a load instruction that writes the same register
This is called load-use data hazard
Use latency: number of clock cycles between a load instruction
and an instruction that can use the result of the load without
stalling the pipeline
We can handle these cases using either hardware detection and
stalls or software that reorders code to try to avoid load-use
pipeline stalls
The pipeline must stall (bubbles are inserted) for the combination
of a load followed by an instruction that reads its result
Stalls (or inserting bubbles) are equivalent to inserting nops
However, bubbles are inserted at runtime, whereas nops are
inserted at compile time!

Patterson and Hennessys Computer Organization and Design, 4th Ed. Chapter 4.Part II 74 of 168
Individual Exercise (14)
Suppose that we have a load instruction followed immediately by a
subtract that uses the loaded word:

lw $s0, 20($t1)
sub $t2, $s0, $t3

Show what pipeline stages would be connected by forwarding

Patterson and Hennessys Computer Organization and Design, 4th Ed. Chapter 4.Part II 75 of 168
Individual Exercise (14): Answer
Suppose that we have a load instruction followed immediately by a
subtract that uses the loaded word:
lw $s0 20($t1)
sub $t2, $s0, $t3
Show what pipeline stages would be connected by forwarding

$s0 would be available only after the fourth stage of the first
instruction, which is too late for the input of the third stage of
the sub!
We would have to stall one stage for the load-use data hazard

Patterson and Hennessys Computer Organization and Design, 4th Ed. Chapter 4.Part II 76 of 168
Data Hazards Solutions (cont.)
Stalls (bubbles) (cont.):
We need a hazard detection unit that operates during the ID
stage to insert the stall between the load and its use
To check for loads, the control for the hazard detection unit is:
if (ID/EX.MemRead
and (ID/EX.RegisterRt = IF/ID.RegisterRs)
or (ID/EX.RegisterRt = IF/ID.RegisterRt)) stall the pipeline

If the condition holds, the instruction stalls 1 clock cycle


After this 1-cycle stall, the forwarding logic continues as usual

In case of having loads immediately followed by stores, it is


possible to avoid a stall, since data exists in the MEM/WB register
of a load in time for its use in the MEM stage of a store
We would need to add forwarding into MEM for this option
Patterson and Hennessys Computer Organization and Design, 4th Ed. Chapter 4.Part II 77 of 168
Data Hazards Solutions (cont.)
Stalls (bubbles) (cont.):
If the instruction in the ID stage is stalled, the one in the IF stage
must be stalled; otherwise, the fetched instruction could be lost
Preventing these two instructions from making progress is
accomplished by preventing the PC and IF/ID from changing
Provided these registers are preserved, the instruction in the IF
stage will continue to be read using the same PC, and the
registers in the ID stage will continue to be read using the same
instruction fields in the IF/ID pipeline register
The hazard detection unit controls the value written in PC and
IF/ID plus multiplexors that choose between the real control
values and all 0s
To stall the pipeline, the back half of the pipeline starting with the
EX stage must be executing nop instructions
A bubble is inserted into the pipeline by deasserting all nine
control signals in the EX, MEM, and WB fields of the ID/EX
Actually, to avoid writing registers or memory, only RegWrite and
MemWrite need to be 0, while other signals can be dont cares
Patterson and Hennessys Computer Organization and Design, 4th Ed. Chapter 4.Part II 78 of 168
Data Hazards Solutions (cont.)
This diagram is missing the
sign-extended immediate
and branch logic

Stalls
(bubbles)
Patterson and Hennessys Computer Organization and Design, 4th Ed. Chapter 4.Part II 79 of 168
Individual Exercise (15)
Consider the following code sequence:

lw $2, 20($1)
and $4, $2, $5
or $8, $2, $6
add $9, $4, $2
slt $1, $6, $7

1. Show that a hazard in the above code sequence cannot be solved


by forwarding
2. Show how inserting bubbles could handle data hazards in this code
sequence

Patterson and Hennessys Computer Organization and Design, 4th Ed. Chapter 4.Part II 80 of 168
Individual Exercise (15): Answer
1. Show that a hazard in the above code sequence cannot be solved
by forwarding

The data is being read from memory in clock cycle 4 while the ALU
is performing the operation for the following instruction
Since dependence between the lw and the following AND goes
backward in time, this hazard cannot be solved by forwarding
Patterson and Hennessys Computer Organization and Design, 4th Ed. Chapter 4.Part II 81 of 168
Individual Exercise (15): Answer (cont.)
2. Show how inserting bubbles could handle data hazards in the this
code sequence

Patterson and Hennessys Computer Organization and Design, 4th Ed. Chapter 4.Part II 82 of 168
Control Hazards Solutions
Stalls on branches (bubbles):
One possible solution for control hazards is to stall immediately
after we fetch a branch, waiting until the pipeline determines the
outcome of the branch and knows what instruction address to
fetch from

Lets assume that we put in enough extra hardware so that we


can test registers, calculate the branch address, and update the
PC during the second stage of the pipeline (see the early
decision approach in the next slides), the pipeline involving
conditional branches is stalled one clock cycle before starting
If we cannot resolve the branch in the second stage, as is often
the case for longer pipelines, then we would see an even larger
slowdown if we stall on branches

This option operates, but is too slow


The cost of this option is too high for most computers to use
Patterson and Hennessys Computer Organization and Design, 4th Ed. Chapter 4.Part II 83 of 168
Individual Exercise (16)
Estimate the cost of the control hazard in the following code
sequence:

40 beq $1, $3, 28


44 and $12, $2, $5
48 or $13, $6, $2
52 add $14, $2, $2

72 lw $4, 50($7)

Patterson and Hennessys Computer Organization and Design, 4th Ed. Chapter 4.Part II 84 of 168
Individual Exercise (16): Answer
Estimate the cost of the control hazard in the following code sequence:

As the branch instruction decides whether to branch in the MEM stage,


the three sequential instructions will be fetched and execution begins
Thus, the cost of a taken branch is extra 3 cycles
Patterson and Hennessys Computer Organization and Design, 4th Ed. Chapter 4.Part II 85 of 168
Individual Exercise (17)
Given that branches are 17% of the instructions executed in
SPECint2006, estimate the impact on the clock cycles per
instruction (CPI) of stalling on branches

Assume that:
All other instructions have a CPI of 1 and
We put in enough extra hardware so that we can test registers,
calculate the branch address, and update the PC during the
second stage (e.g., the pipeline is stalled for only 1 cycle)

Patterson and Hennessys Computer Organization and Design, 4th Ed. Chapter 4.Part II 86 of 168
Individual Exercise (17): Answer
Given that branches are 17% of the instructions executed in
SPECint2006, estimate the impact on the clock cycles per
instruction (CPI) of stalling on branches

Assume that:
All other instructions have a CPI of 1
We put in enough extra hardware so that we can test registers,
calculate the branch address, and update the PC during the
second stage (e.g., the pipeline is stalled for only 1 cycle)

0.17% of the instructions have a CPI of 2. All others CPI is 1

CPI 1* (1 0.17) 2 * 0.17 1.17

Patterson and Hennessys Computer Organization and Design, 4th Ed. Chapter 4.Part II 87 of 168
Control Hazards Solutions (cont.)
Early decision to reduce the delay of branches:
One way to improve branch performance is to reduce the cost
(delay) of the taken branch

Thus far, we have assumed the next PC for a branch is selected in


the MEM stage

If we move the branch execution earlier in the pipeline, then


fewer instructions need be flushed

If we move the branch decision (execution) to the ID stage, only


one instruction need to be flushed (the one being fetched)

Patterson and Hennessys Computer Organization and Design, 4th Ed. Chapter 4.Part II 88 of 168
Control Hazards Solutions (cont.)
Early decision (cont.):
Moving the branch decision up requires 2 actions to occur earlier:
1. Move up the branch address calculation:
Easy part
We already have the PC value and the immediate field in
the IF/ID register, so we just move the branch adder from
the EX stage to the ID stage
Of course, the branch target address calculation will be
performed for all instructions, but only used when needed
2. Move up the branch decision:
The harder part
For beq, we could compare the two registers read during
the ID stage to see if they are equal
Equality can be tested by first XORing their respective bits
and then ORing all the results (faster than using the ALU)
Patterson and Hennessys Computer Organization and Design, 4th Ed. Chapter 4.Part II 89 of 168
Control Hazards Solutions (cont.)
Early decision (cont.):
Moving the branch test to the ID stage implies additional hazard
detection and forwarding hardware, as a branch dependent on a
result still in the pipeline must work properly with this optimization
To implement beq or bne, we will need to forward results to
the equality test logic that operates during the ID stage
There are two complication factors:
1. During the ID stage, we must decode the instruction, decide
whether a bypass to the equality unit is needed, and complete
the equality comparison so that if the instruction is a branch,
we can set the PC to the branch target address
Forwarding for the operands of branches was formerly
handled by the ALU forwarding logic, but the introduction
of the equality test unit in the ID stage will require new
forwarding logic
Note that the bypassed source operands of a branch can
come from either the EX/MEM or MEM/WB registers
Patterson and Hennessys Computer Organization and Design, 4th Ed. Chapter 4.Part II 90 of 168
Control Hazards Solutions (cont.)
Early decision (cont.):
2. Because the values in a branch comparison are needed during
the ID stage but may be produced later in time, it is possible
that a data hazard can occur and a stall will be needed
For example, if an ALU instruction immediately preceding
a branch produces one of the operands for the
comparison in the branch, a stall will be required, since
the EX stage for the ALU instruction will occur after the
ID cycle of the branch
By extension, if a load is immediately followed by a
conditional branch that is on the load result, two stall
cycles will be needed, as the result from the load appears
at the end of the MEM cycle but is needed at the
beginning of ID cycle for the branch

Patterson and Hennessys Computer Organization and Design, 4th Ed. Chapter 4.Part II 91 of 168
Control Hazards Solutions (cont.)
Early decision (cont.):
Despite these difficulties, moving the branch execution in the ID
stage is an improvement, because it reduces the penalty of a
branch to only one instruction if the branch is taken, namely, the
one currently being fetched

To flush instructions in the IF stage, we add a control line, called


IF.Flush, that zeros the instruction field of the IF/ID register
Clearing the register transforms the fetched instruction into a
nop, an instruction that has no action and changes no state
In reality, the flush line comes from hardware that determines if a
branch is taken, labeled with an equal sign!

Even with this extra hardware, the pipeline involving conditional


branches would have to stall for one clock cycle

Patterson and Hennessys Computer Organization and Design, 4th Ed. Chapter 4.Part II 92 of 168
Control Hazards Solutions (cont.)

Early decision
Patterson and Hennessys Computer Organization and Design, 4th Ed. Chapter 4.Part II 93 of 168
Home Exercise (18)
Work out the example page 378 step by step!

Patterson and Hennessys Computer Organization and Design, 4th Ed. Chapter 4.Part II 94 of 168
Control Hazards Solutions (cont.)
Branch prediction:
Branch prediction: predict the outcome of the branch
instruction and proceed from that assumption rather than waiting
to ascertain the actual outcome

When a prediction is wrong, the pipeline control must ensure that


the instructions following the wrongly guessed branch have no
effect and must restart the pipeline from the proper branch
address

Longer pipelines exacerbate the problem by raising the cost of


misprediction

Patterson and Hennessys Computer Organization and Design, 4th Ed. Chapter 4.Part II 95 of 168
Control Hazards Solutions (cont.)
Branch prediction (cont.):
There are three ways to do branch prediction:
1. Assume branch not taken: always predict branches to fail
and continue execution down the sequential instruction flow
Pipeline is not slowed down when the branch is not taken
If the branch is taken, the instructions that are being
fetched and decoded must be discarded and execution
continues at the branch target
This is equivalent to a stall; that is, only when branches
are taken does the pipeline stall
To discard instructions:
Change original control values to 0s in the IF, ID, and
EX stages when the branch reaches the MEM stage
Discarding instructions here means we must be able
to flush instructions in the IF, ID, and EX stages
For load-use stalls, we just change control to 0 in the
ID stage and let them percolate through the pipeline
Patterson and Hennessys Computer Organization and Design, 4th Ed. Chapter 4.Part II 96 of 168
Control Hazards Solutions (cont.)
Branch prediction (cont.):
2. A more sophisticated branch prediction would have some
branches predicted as taken and some as untaken
As an example, at the bottom of loops are branches
that jump back to the top of the loop
Since they are likely to be taken and they branch
backwards, we could always predict taken for branches
that jump to an earlier address
3. Dynamic hardware predictors: make guesses depending
on the behavior of each branch and may change predictions
for a branch over the life of a program
In an aggressive pipeline, a simple static prediction
scheme will probably waste too much performance
With more hardware, it is possible to try to predict
branch behavior during program execution
Dynamic predictors increase in popularity as the
transistors per chip increase in count
Patterson and Hennessys Computer Organization and Design, 4th Ed. Chapter 4.Part II 97 of 168
Group Exercise (19)
Assume that branches are predicted to be not taken

Show the pipeline when the branch in the following code sequence
is not taken and when it is taken:

add $4, $5, $6


beq $1, $2, 40
lw $3, 300($0)

or $7, $8, $9

Patterson and Hennessys Computer Organization and Design, 4th Ed. Chapter 4.Part II 98 of 168
Group Exercise (19): Answer
Assume that branches are predicted to be not taken
Show the pipeline when the branch in the following code sequence
is not taken and when it is taken:

not taken

taken

Patterson and Hennessys Computer Organization and Design, 4th Ed. Chapter 4.Part II 99 of 168
Control Hazards Solutions (cont.)
Dynamic branch prediction:
One popular approach for dynamic prediction of branches is
keeping a history for each branch as taken or untaken, and then
using the recent past behavior to predict the future
The amount and type of history kept have become extensive
The result has been that dynamic branch predictors can
correctly predict branches with more than 90% accuracy
One approach to do this is to look up the address of the
instruction to see if a branch was taken the last time this
instruction was executed, and, if so, to begin fetching new
instructions from the same place as the last time
One implementation of that approach is a branch prediction
buffer or branch history table
A branch prediction buffer is a small, special memory indexed by
the lower portion of the address of the branch instruction during
the IF stage
Patterson and Hennessys Computer Organization and Design, 4th Ed. Chapter 4.Part II 100 of 168
Control Hazards Solutions (cont.)
Dynamic branch prediction (cont.):
The branch prediction buffer contains a bit that says whether the
branch was recently taken or not
This is the simplest sort of buffer
We do not know, in fact, if the prediction is the right one as it
may have been put there by another branch that has the same
low-order address bits
However, this does not affect correctness
Prediction is just a hint that we hope is correct, so fetching
begins in the predicted direction
If the hint turns out to be wrong, the incorrectly predicted
instructions are deleted, the prediction bit is inverted and
stored back, and the proper sequence is fetched and executed
This simple 1-bit prediction scheme has a performance
shortcoming: even if a branch is almost always taken, we can
predict incorrectly twice, rather than once, when it is not taken
Patterson and Hennessys Computer Organization and Design, 4th Ed. Chapter 4.Part II 101 of 168
Class Exercise (20)
Consider a loop branch that branches nine times in a row, then it is
not taken once

What is the prediction accuracy for this branch, if it is using a


single bit for prediction?

Assume the prediction bit for this branch remains in the prediction
buffer

Patterson and Hennessys Computer Organization and Design, 4th Ed. Chapter 4.Part II 102 of 168
Class Exercise (20): Answer
Consider a loop branch that branches nine times in a row, then it is
not taken once
What is the prediction accuracy for this branch, if it is using a
single bit for prediction?
Assume the prediction bit for this branch remains in the prediction
buffer

The steady-state prediction behavior will mispredict on the first


and last loop iterations:
Mispredicting the last iteration is inevitable since the
prediction bit will indicate taken, as the branch has been
taken nine times in a row at that point
The misprediction on the first iteration happens because
the bit is flipped on prior execution of the last iteration of
the loop, since the branch was not taken on the exiting
iteration
Thus, the prediction accuracy for this branch that is taken 90%
of the time is only 80% (two incorrect predictions and eight
correct ones)!
Patterson and Hennessys Computer Organization and Design, 4th Ed. Chapter 4.Part II 103 of 168
Control Hazards Solutions (cont.)
Dynamic branch prediction (cont.):
Ideally, the accuracy of the predictor would match the taken
branch frequency for these highly regular branches
The 1-bit prediction scheme will likely predict incorrectly twice!
To remedy this, 2-bit prediction schemes are often used
In a 2-bit scheme, a prediction must be wrong twice before it is
changed

Patterson and Hennessys Computer Organization and Design, 4th Ed. Chapter 4.Part II 104 of 168
Control Hazards Solutions (cont.)
Dynamic branch prediction (cont.):
By using the 2 bits rather than 1, a branch that strongly favors
taken or not takenas many branches dowill be mispredicted
only once

The 2 bits are used to encode the four states in the system

The 2-bit scheme is a general instance of a counter-based


predictor, which is incremented when the prediction is accurate
and decremented otherwise, and uses the midpoint of its range
as the division between taken and not taken

Patterson and Hennessys Computer Organization and Design, 4th Ed. Chapter 4.Part II 105 of 168
Control Hazards Solutions (cont.)
Advanced dynamic branch prediction techniques
Branch target buffer: a cache to hold the destination PC or
destination instruction
A branch predictor tells us whether or not a branch is taken
We still require the calculation of the branch target
In our pipeline, this calculation takes one cycle, meaning that
taken branches will have a 1-cycle penalty
Branch target buffer is one approach to eliminate that penalty

Patterson and Hennessys Computer Organization and Design, 4th Ed. Chapter 4.Part II 106 of 168
Control Hazards Solutions (cont.)
Advanced dynamic branch prediction techniques (cont.)
Correlating predictor: a branch predictor that combines local
behavior of a particular branch and global information about the
behavior of some recent number of executed branches
Yields greater prediction accuracy for the same number of
prediction bits
A 2-bit dynamic branch predictor scheme uses only
information about a particular branch
A typical correlating predictor might have 2-bit predictors for
each branch, with the choice between predictors made based
on whether the last executed branch was taken or not taken
Thus, the global branch behavior can be thought of as adding
additional index bits for the prediction lookup

Patterson and Hennessys Computer Organization and Design, 4th Ed. Chapter 4.Part II 107 of 168
Control Hazards Solutions (cont.)
Advanced dynamic branch prediction techniques (cont.)
Tournament predictor: uses multiple predictions, tracking, for
each branch, which predictor yields the best result
A typical tournament predictor might contain two predictors
for each branch index: one based on local information and one
based on global branch behavior
A selector would choose which predictor to use for any
prediction
The selector can operate similarly to a 1- or 2-bit predictor,
favoring whichever of the two predictors has been more
accurate
Some recent microprocessors use such elaborate predictors

Patterson and Hennessys Computer Organization and Design, 4th Ed. Chapter 4.Part II 108 of 168
Control Hazards Solutions (cont.)
Delayed branches (decisions):
Delayed branch: always executes the following instruction, but
the second one following the branch will be affected by the branch
The branch takes place after that one instruction delay
Compilers and assemblers try to place an instruction that always
executes after the branch in the branch delay slot
Branch delay slot: the slot directly after a delayed branch
instruction, which in the MIPS architecture is filled by an
instruction that does not affect the branch
The job of the software is to make the successor instructions valid
and useful
MIPS software places an instruction immediately after the delayed
branch instruction that is not affected by the branch, and a taken
branch changes the address of the instruction that follows this
safe instruction

Patterson and Hennessys Computer Organization and Design, 4th Ed. Chapter 4.Part II 109 of 168
Control Hazards Solutions (cont.)

Scheduling the branch delay slot


Patterson and Hennessys Computer Organization and Design, 4th Ed. Chapter 4.Part II 110 of 168
Control Hazards Solutions (cont.)
Delayed branches (cont.)
Scheduling the branch delay slot:
1. with an independent instruction from before the branch
this is the best choice
if there is a dependence between the branch condition and
instruction before (like $s1 in the previous figure), we
cannot use the instruction before
2. from the target of the branch
usually the target instruction will need to be copied
because it can be reached by another path
this strategy is preferred when the branch is taken with
high probability, such as a loop branch
also executing the instruction in the delay slot should not
affect execution in case of a not taken branch
3. from the not-taken fall-through
Executing the instruction in the delay slot should not affect
execution in case of a taken branch
For example, changing an unused temporary register
Patterson and Hennessys Computer Organization and Design, 4th Ed. Chapter 4.Part II 111 of 168
Control Hazards Solutions (cont.)
Delayed branches (cont.)
Delayed branches are hidden from the MIPS assembly language
programmer because the assembler can automatically arrange the
instructions to get the branch behavior desired by the programmer
In all delayed branch cases, the program should execute correctly
when the branch goes in the unexpected direction
Since delayed branches are useful when the branches are short,
no processor uses a delayed branch of more than one cycle
For longer branches, hardware-based branch prediction is used
The limitations on delayed-branch scheduling arise from:
the restrictions on the instructions that are scheduled into the
delay slots
our ability to predict at compile time whether a branch is likely
to be taken or not
Delayed branching is losing popularity as processors go both:
to longer pipelines (see later!) and
toward issuing multiple instructions per clock cycle (later!)
Patterson and Hennessys Computer Organization and Design, 4th Ed. Chapter 4.Part II 112 of 168
Individual Exercise (21)
Assume that branch decision can be done in the second stage

Show how to use a delayed branch to avoid control hazard in the


following code sequence

add $4, $5, $6


beq $1, $2, 40
lw $3, 300($0)

Patterson and Hennessys Computer Organization and Design, 4th Ed. Chapter 4.Part II 113 of 168
Individual Exercise (21): Answer
Assume that branch decision can be done in the second stage
Show how to use a delayed branch to avoid control hazard in the
following code sequence

The add instruction before the branch does not affect the branch,
so we move it to the delayed branch slot following the branch
Thus, the single pipe bubble has been replaced by add
Pro gra m
e xe cu tio n 2 4 6 8 10 12 14
o rde r Ti m e
(in instru ctio ns)

be q $ 1 , $ 2, 40 Instruction Data
Re g AL U Reg
fetch access

a dd $ 4, $5 , $ 6 Instruction Data
Reg A LU Re g
2 ns fetch access
(Dela ye d b ra n c h slo t)
Instruction Data
lw $3 , 3 00 ($0 ) Reg A LU Re g
2 ns fetch access

2 ns
Patterson and Hennessys Computer Organization and Design, 4th Ed. Chapter 4.Part II 114 of 168
Control Hazards Solutions (cont.)
Conditional move instructions:
One way to reduce the number of conditional branches
Instead of changing the PC with a conditional branch, the
instruction conditionally changes the destination register of the
move
If the condition fails, the move acts as a nop
MIPS instruction set architecture has two new instructions called:
movn: move if not zero
movz: move if zero
The ARM instruction set has a condition field in most instructions
ARM programs could have fewer conditional branches than in
MIPS programs

Patterson and Hennessys Computer Organization and Design, 4th Ed. Chapter 4.Part II 115 of 168
Individual Exercise (22)
Explain how the following conditional move instruction works

movn $8, $11, $4

Patterson and Hennessys Computer Organization and Design, 4th Ed. Chapter 4.Part II 116 of 168
Individual Exercise (22): Answer
Explain how the following conditional move instruction works

movn $8, $11, $4

The instruction copies the contents of register $11 into register $8,
provided that the value in register $4 is nonzero
Otherwise, it does nothing

Patterson and Hennessys Computer Organization and Design, 4th Ed. Chapter 4.Part II 117 of 168
Group Exercise (23)
Calculate the average CPI for a pipelined implementation

Instruction mix: 24% loads, 12% stores, 44% R-type 18%


branches, and 2% jumps

Half of the load instructions are immediately followed by an


instruction that uses the result

The branch delay on misprediction is 1 clock cycle

One quarter of branches are mispredicted

Assume that jumps always pay 1 full clock cycle of delay

Patterson and Hennessys Computer Organization and Design, 4th Ed. Chapter 4.Part II 118 of 168
Group Exercise (23): Answer
Loads take 1 clock cycle when there is no load-use dependence
(50% of the time) and 2 clock cycle when there is (50% of the
time):
CPIloads 1* 0.50 2 * 0.50 1.50
Store and R-type instructions take 1 clock cycle
Branches take 1 clock cycle when predicted correctly (75% of the
time) and 2 when not (25% of the time):

CPIbranches 1* 0.75 2 * 0.25 1.25


Jump instructions take 2 clock cycle
Thus:
CPIoverall 1.50* 0.24 1* 0.12 1* 0.44 1.25* 0.18 2* 0.02 1.19
From the results obtained in Chapter 4, Part I it is clear that the
pipelined implementation is much faster than the single cycle
(shorter clock period) and the multicycle ones
Patterson and Hennessys Computer Organization and Design, 4th Ed. Chapter 4.Part II 119 of 168
Designing Instruction Set for Pipelining
MIPS instruction set is designed to be pipelined (makes it easy)
1. All MIPS instructions are the same length
This restriction makes it much easier to fetch instructions in
the first pipeline stage and to decode them in the second one
Widely variable instruction lengths and running times can lead
to imbalance among pipeline stages, causing other stages to
back up, and severely complicate hazard detection in a design
pipelined at the instruction set level
Variable instruction length can also severely complicate hazard
detection and the maintenance of precise exceptions (later!)
In an instruction set like x86, where instructions vary from 1
byte to 17 bytes, pipelining is considerably more challenging
Recent implementations of the architecture actually
translate x86 instructions into simple operations that look
like MIPS instructions and then pipeline the simple
operations rather than the native x86 instructions!
Patterson and Hennessys Computer Organization and Design, 4th Ed. Chapter 4.Part II 120 of 168
Designing Instruction Set for Pipelining (cont.)
2. MIPS has only a few instruction formats, with the source
register fields being located in the same place in each
instruction
This symmetry means that the second stage can begin
reading the register file at the same time that the hardware
is determining what type of instruction was fetched
If MIPS instruction formats were not symmetric, we would
need to split stage 2, resulting in six pipeline stages
The regular format of the MIPS instructions allows reading
and decoding to occur simultaneously
3. Memory operands only appear in loads or stores in MIPS
This restriction means we can use the execute stage to
calculate the memory address and then access memory in
the following stage
If we could operate on the operands in memory, as in the
x86, stages 3 and 4 would expand to an address stage,
memory stage, and then execute stage
Patterson and Hennessys Computer Organization and Design, 4th Ed. Chapter 4.Part II 121 of 168
Designing Instruction Set for Pipelining (cont.)
4. Operands must be aligned in memory in MIPS
Hence, we need not worry about a single data transfer
instruction requiring two data memory accesses
The requested data can be transferred between processor
and memory in a single pipeline stage
5. Each MIPS instruction writes a single result and does so
at the end of its execution
Simplifies the handling of exceptions and the maintenance
of a precise exception model (see later)
Data forwarding is harder if there are multiple results to
forward per instruction or they need to write before the
instruction end
PowerPCs load instructions may use update addressing, so
the processor must be able to forward two results per load
6. The MIPS architecture makes it easy for designers to
avoid structural hazards when designing a pipeline
by, for example, having two memories
Patterson and Hennessys Computer Organization and Design, 4th Ed. Chapter 4.Part II 122 of 168
Designing Instruction Set for Pipelining (cont.)
7. MIPS does not have sophisticated addressing modes
This can have led to different sorts of problems
Addressing modes that updates registers, such as update
addressing modes, complicate hazard detection
Other addressing modes that require multiple memory
accesses substantially complicate pipeline control and make
it difficult to keep the pipeline flowing smoothly
8. The MIPS architecture was designed to support fast
single-cycle branches that could be pipelined with a
small penalty
The designers observed that many branches rely only on
simple tests (e.g., equality or sign) that do not require a full
ALU operation but can be done with at most a few gates
When a more complex branch decision is required, a
separate instruction that uses an ALU to perform a
comparison is required
Like the use of condition codes for branches
Patterson and Hennessys Computer Organization and Design, 4th Ed. Chapter 4.Part II 123 of 168
Exceptions
An exception is an unexpected event from within the processor,
e.g., arithmetic overflow, invoking the OS from user program, using
an undefined instruction, internal hardware malfunction, etc.
An interrupt is an event that also causes an unexpected change in
control flow but comes from outside of the processor
Interrupts are used by I/O devices to communicate with the
processor
External hardware malfunctions could cause an interrupt
Exceptions and interrupts are events other than branches and
jumps that change the normal flow of instruction execution
Exceptions and interrupts are unscheduled events that disturbs
program execution

Detecting exceptional conditions and taking the appropriate action


is often on the critical timing path of a processor, which determines
the clock cycle time and thus performance
Patterson and Hennessys Computer Organization and Design, 4th Ed. Chapter 4.Part II 124 of 168
Exceptions (cont.)
Many architectures do not distinguish between interrupts and
exceptions, often using the older name interrupt to refer to both
types of events
We follow the MIPS convention:
using the term exception to refer to any unexpected change in
control flow without distinguishing whether the cause is internal
or external (both interrupts and exceptions)
using the term interrupt only when the event is externally caused

Type of event From where? MIPS terminology


I/O device request External Interrupt
Invoke the OS from user program Internal Exception
Arithmetic overflow Internal Exception
Using an undefined instruction Internal Exception
Hardware malfunctions Either Exception or interrupt
Patterson and Hennessys Computer Organization and Design, 4th Ed. Chapter 4.Part II 125 of 168
Exceptions (cont.)
Our implementation can generate two types of exceptions:
1. execution of undefined instruction
2. an arithmetic overflow
The basic actions that the processor must perform when an
exception occurs are to:
Save the address of the offending instruction in the exception
program counter (EPC) and then
Transfer control to the OS at some specified address
The OS can take the appropriate action, which involve:
providing some service to the user program,
taking some predefined action in response to an overflow, or
stopping the execution of the program and reporting an error
After performing whatever action is required, the OS can:
terminate the program, or
may continue its execution, using the EPC to determine
where to restart the execution of the program
Patterson and Hennessys Computer Organization and Design, 4th Ed. Chapter 4.Part II 126 of 168
Exceptions (cont.)
For the OS to handle the exception, it must know the reason for the
exception and the instruction that caused it
Processors include a status register to hold a field that indicates the
reason for the exception

Some processors use vectored interrupts, where the address to


which control is transferred is determined by the exception cause
The OS knows the reason for the exception by the address at
which it is initiated
The addresses are separated by 32 bytes or eight instructions,
and the OS must record the reason for the exception and may
perform some limited processing in this sequence
When the exception is not vectored, a single entry point for
all exceptions can be used, and the OS decodes the status
register to find the cause

Patterson and Hennessys Computer Organization and Design, 4th Ed. Chapter 4.Part II 127 of 168
Exceptions (cont.)
Lets assume that we are implementing the exception system used in
the MIPS architecture, with the single entry point being the address
8000 0180hex

Two additional 32-bit registers are added to the MIPS implementation:


exception program counter (EPC) and Cause Register:
1. EPC: used to hold the address of the affected instruction
2. Cause: a status register used to record the cause of the exception
A five-bit field encodes the two possible exception sources
10 represents an undefined instruction and
12 represents arithmetic overflow
The ALU overflow signal is an input to the control unit
Other bits are currently unused

Patterson and Hennessys Computer Organization and Design, 4th Ed. Chapter 4.Part II 128 of 168
Exceptions (cont.)
A pipelined implementation treats exceptions as another form of
control hazard
Just as we did for the taken branch, we must flush the instructions
that follow the one that throws the exception from the pipeline and
begin fetching instructions from the new address
We will use the same mechanism we used for taken branches, but
this time the exception causes the deasserting of control lines
The instruction should stop immediately to give the programmer the
chance to track the register values causing the exception
Otherwise, the instruction causing the exception or any following
one could change the register values
Many exceptions require that we eventually complete the instruction
that caused the exception as if it executed normally
The easiest way to do this is to flush the instruction and restart
it from the beginning after the exception is handled

Patterson and Hennessys Computer Organization and Design, 4th Ed. Chapter 4.Part II 129 of 168
Exceptions (cont.)
We saw how to flush the instruction in the IF stage by turning it
into a nop
To flush instructions in the ID stage, we use the multiplexor already
in the ID stage that zeros control signals for stalls
A new control signal, ID.Flush, is ORed with the stall signal from the
hazard detection unit to flush during ID
To flush instructions in the EX phase, we use a new signal called
EX.Flush to cause new multiplexors to zero the control lines
The EX.Flush signal is also used to prevent the instruction in the EX
stage from writing its result in the WB stage
We need to save the address of the offending instruction in EPC
We must subtract 4 from the updated PC before saving it in EPC
To start fetching instructions from location 8000 0180hex, which is
the MIPS exception address, we simply add an additional input to
the PC multiplexor that sends 8000 0180hex to the PC

Patterson and Hennessys Computer Organization and Design, 4th Ed. Chapter 4.Part II 130 of 168
Exceptions (cont.)

Exception routine
address is: 8000 0180hex
Patterson and Hennessys Computer Organization and Design, 4th Ed. Chapter 4.Part II 131 of 168
Exceptions (cont.)
Handling multiple exceptions is also important
With pipelined execution, it is important to
associate the exception with its cause instruction
prioritize the exceptions to determine which is serviced first
In most MIPS implementations, the hardware sorts exceptions
so that the earliest instruction is interrupted
I/O device requests and hardware malfunctions are not
associated with a specific instruction, so the implementation
has some flexibility as to when to interrupt the pipeline
The exception software must match the exception to the instruction
Know in which pipeline stage a type of exception can occur
For example, an undefined instruction is discovered in the ID
stage, and invoking the OS occurs in the EX stage
Exceptions are collected in the Cause register in a pending
exception field so that the hardware can interrupt based on later
exceptions, once the earliest one has been serviced
Precise exceptions is always associated with the correct exception
in pipelined computers, otherwise exceptions are imprecise!
Patterson and Hennessys Computer Organization and Design, 4th Ed. Chapter 4.Part II 132 of 168
Home Exercise (24)
Work out the example pages 388-389 step by step!

Patterson and Hennessys Computer Organization and Design, 4th Ed. Chapter 4.Part II 133 of 168
Home Exercise (25)
Find the width of each of the pipeline registers in each of the following
four pipelined MIPS architectures:

Included feature A1 A2 A3 A4
No hazard solution
All data hazards solutions
All control hazards solutions
Exception support

In a tabular form, explain in detail your solution

Patterson and Hennessys Computer Organization and Design, 4th Ed. Chapter 4.Part II 134 of 168
Parallelism and Advanced ILP
Pipelining exploits the potential parallelism among instructions
This parallelism is called instruction-level parallelism (ILP)
There are two primary methods for increasing the potential
amount of ILP:
1. Superpipelining: Increasing the depth of the pipeline to
overlap more instructions
Divide longer steps into more smaller ones
To get the full speed-up, we need to rebalance the remaining
steps so they are the same length
The amount of parallelism being exploited is higher, since
there are more operations being overlapped
Performance is potentially greater since the clock cycle can
be shorter
2. Multiple issue: Replicating the internal components of the
computer so that it can launch multiple instructions in every
pipeline stage
Patterson and Hennessys Computer Organization and Design, 4th Ed. Chapter 4.Part II 135 of 168
Superpipelining (Deep Pipelining)
3.0

Superpipelining (deep pipelining): 2.5


simply means longer pipelines

Relative performance

2.0
Since the ideal maximum speedup
from pipelining increases with
1.5

increasing the number of pipeline 1.0

stages, superpipelining is supposed 0.5

to increase performance 0.0


1 2 4 8 16


Pipeline depth

Longer pipelines increase the problem of control hazards:


1. They raise the cost of branch misprediction
2. If we cannot resolve the branch in the second stage, as is often
the case for longer pipelines, then we would see an even larger
slowdown if we stall on branches
3. If the pipeline is longer than five stages, then we may get more
branch delay slots, which are harder to fill
Patterson and Hennessys Computer Organization and Design, 4th Ed. Chapter 4.Part II 136 of 168
Multiple Issue
Launching multiple instructions per stage allows the instruction
execution rate to exceed the clock rate or the CPI to be less than 1
If m is the number of instructions executed per clock cycle, then
the CPI is scaled by an m amount.
It is sometimes useful to flip the CPI and use IPC, or Instruction
Per Clock Cycle
1
IPC
CPI
Multiple-issue processors have IPC > 1
Todays high-end microprocessor attempt to issue from three to six
instructions in every clock cycle
The downside of multiple issue is the extra work needed to keep all
hardware busy and transferring instructions to the next pipeline stage
There are typically, however, many constraints on what types of
instructions may be executed simultaneously and what happens when
dependences arise
Patterson and Hennessys Computer Organization and Design, 4th Ed. Chapter 4.Part II 137 of 168
Individual Exercise (26)
Consider a 4 GHz four-way multiple-issue, five-stage pipelined
microprocessor:

1. Calculate the best CPI and IPC


2. Calculate the peak MIPS (or GIPS)
3. What is the number of instructions that the processor would
have in execution at any given time?

Patterson and Hennessys Computer Organization and Design, 4th Ed. Chapter 4.Part II 138 of 168
Individual Exercise (26): Answer
Consider a 4 GHz four-way multiple-issue, five-stage pipelined
microprocessor:
1. Calculate the best CPI and IPC
1 1
Best CPI 0.25
m 4
Best IPC m 4
2. Calculate the peak MIPS (or GIPS)

R IPC * R 4* 4000* 106


Peak MIPS 16000
6 6 6
CPI * 10 10 10
R IPC * R 4* 4000* 106
Peak GIPS 16
9 9 9
CPI * 10 10 10
3. What is the number of instructions that the processor would
have in execution at any given time?
No. of instructions in execution at any time = n * m = 5*4 = 20
Patterson and Hennessys Computer Organization and Design, 4th Ed. Chapter 4.Part II 139 of 168
Multiple Issue (cont.)
Structure hazards are one difficulty that limit the effectiveness of
multiple-issue processors

If instructions in the instruction stream are dependent or do not


meet certain criteria, however, only the first few instructions in the
sequence are issued, or perhaps even just the first instruction

There are two major ways to implement a multiple-issue processor:


1. Static multiple issue: many decisions are made statically by
the compiler before execution
2. Dynamic multiple issue: many decisions are made
dynamically during execution by the processor hardware

To effectively exploit the parallelism available in a multiple-issue


processor, more ambitious compiler or hardware scheduling
techniques are needed
Patterson and Hennessys Computer Organization and Design, 4th Ed. Chapter 4.Part II 140 of 168
Multiple Issue (cont.)
Static issue Dynamic issue
processors processors
Packaging Handled at least Normally dealt with at
instructions into issue partially by the runtime by the processor
slots: how the processor compiler hardware
determines how many Compiler will have tried to
instructions and which improve the issue rate by
instructions can be issued replacing the instructions
in a given clock cycle in a beneficial order
Dealing with data and Some or all of data Attempt to alleviate at
control hazards and control hazard least some classes of
consequences are hazards using hardware
handled statically techniques operating at
by the compiler execution time
Issue slots: from where instructions could issue in a clock cycle
Issue packet: the set of instructions that issues together in one
clock cycle; the packet may be determined statically or dynamically
Patterson and Hennessys Computer Organization and Design, 4th Ed. Chapter 4.Part II 141 of 168
The Concept of Speculation
Speculation is an approach that allows the compiler or the
processor to guess about the properties of an instruction, so as to
enable execution to begin for other instructions that may depend
on the speculated instruction
Speculate on the outcome of a branch, so that instructions
after the branch could be executed earlier
Speculate that a store that precedes a load does not refer to
the same address, which would allow the load to be executed
before the store
The difficulty with speculation is that it may be wrong
Any speculation mechanism must include methods to:
1. Check if the guess was right
2. Unroll or back out the effects of the instructions that were
executed speculatively
Implementation of this back-out capability adds complexity

Patterson and Hennessys Computer Organization and Design, 4th Ed. Chapter 4.Part II 142 of 168
The Concept of Speculation (cont.)
Speculation may be done in the compiler or by the hardware
The compiler can use speculation to reorder instructions, moving
an instruction across a branch or a load across a store
The hardware can perform the same transformation at runtime
The recovery mechanisms used for incorrect speculations are:
The compiler usually inserts additional instructions that check the
accuracy of the speculation and provide a fix-up routine to use
when the speculation is incorrect
The hardware usually buffers the speculative results until it
knows they are no longer speculative
If the speculation is correct, the instructions are completed
by allowing the contents of the buffers to be written to the
registers of memory
If the speculation is incorrect, the hardware flushes the
buffers and re-executes the correct instruction sequence

Patterson and Hennessys Computer Organization and Design, 4th Ed. Chapter 4.Part II 143 of 168
The Concept of Speculation (cont.)
Speculation has another problem: speculating on certain instructions
may introduce exceptions that were formerly not present
Suppose a load instruction is moved in a speculative manner, but
the address it uses is not legal when the speculative is incorrect
The result would be that an exception that should not have
occurred will occur
If the load instruction was not speculative, then the exception
must occur!
The compiler avoids such problems by adding special speculation
support that allows such exceptions to be ignored until it is clear
that they really should occur
The hardware simply buffers exceptions until it is clear that the
instruction causing them is no longer speculative and is ready to
complete; at that point the exception is raised, and normal
exception handling proceeds
Since speculation can improve performance when done properly and
decrease performance when done carelessly, significant effort goes
into deciding when it is appropriate to speculate
Patterson and Hennessys Computer Organization and Design, 4th Ed. Chapter 4.Part II 144 of 168
Static Multiple Issue
Static multiple-issue processors all use the compiler to assist with
packaging instructions and handling hazards
The set of instructions issued in a given clock cycle, issue packet, is
a one single large instruction with multiple operations in certain
predefined fields

Very Long Instruction Word (VLIW): A style of instruction set


architecture that launches many operations that are defined to be
independent in a single wide instruction, typically with many separate
opcode fields
In VLIW, the compiler guarantees that there are
1. No dependences between instructions that issue at the same time
2. Sufficient hardware resources to execute them
VLIW works well when the source code for the programs is available
so that the programs can be recompiled
VLIW philosophy simplifies instruction decoding and issuing logic
Patterson and Hennessys Computer Organization and Design, 4th Ed. Chapter 4.Part II 145 of 168
Static Multiple Issue with the MIPS ISA
We consider a simple two-issue MIPS processor
One of the instructions can be an integer ALU operation or branch,
and the other could be load or store
Issuing 2 instructions per cycle needs fetching and decoding 64 bits
To simplify the decoding and instruction issue, we will require that:
The instructions be paired and aligned on a 64-bit boundary,
with the ALU or branch portion appearing first
If an instruction of the pair cannot be used, it be replaced with
a nop
The compiler avoid all dependences within an instruction pair
Two methods to deal with potential data and control hazards:
The compiler takes full responsibilities for removing all hazards,
scheduling the code and inserting nops for the code to execute
without any need for hazard detection or hardware stalls
[Assumed] The hardware detects data hazards and generates
stalls between two issue packets with a hazard forcing the
entire issue packet containing the dependent instruction to stall
Patterson and Hennessys Computer Organization and Design, 4th Ed. Chapter 4.Part II 146 of 168
Static Multiple Issue with the MIPS ISA (cont.)

The hardware needs extra resources to overcome structural hazards:


1. Another 32 bits from instruction memory are needed
2. To issue an ALU and a data transfer operation in parallel, the first
need for additional hardware is extra ports in the register file
In one clock cycle we may need to read two registers for the
ALU operation and two more for a store, and also one write
port for the ALU operation and one write port for a load
3. Since one ALU is tied up for the ALU operation, we also need a
separate adder to calculate the effective address for data transfers
Patterson and Hennessys Computer Organization and Design, 4th Ed. Chapter 4.Part II 147 of 168
Static Multiple Issue with the MIPS ISA (cont.)

Patterson and Hennessys Computer Organization and Design, 4th Ed. Chapter 4.Part II 148 of 168
Static Multiple Issue with the MIPS ISA (cont.)

This two-issue processor can improve performance by up to a factor


of 2

In simple five-stage MIPS pipeline, loads have a use latency of


one clock cycle, which prevents one instruction from using the
result without stalling
In the two-issue, five-stage pipeline the result of a load instruction
cannot be used on the next clock cycle
This means that the next two instructions cannot use the load result
without stalling

Furthermore, ALU instructions that had no use latency in the simple


five-stage pipeline, now have a one instruction use latency, since
the results cannot be used in the paired load or store

Patterson and Hennessys Computer Organization and Design, 4th Ed. Chapter 4.Part II 149 of 168
Group Exercise (27)
Consider the following loop:

Loop: lw $t0, 0($s1)


addu $t0, $t0, $s2
sw $t0, 0($s1)
addi $s1, $s1, -4
bne $s1, $zero, Loop

Assume branches are predicted, so that control hazards are


handled by the hardware

How would this loop be scheduled on a static two-issue pipeline for


MIPS? Reorder the instructions to avoid as many pipeline stalls as
possible

Patterson and Hennessys Computer Organization and Design, 4th Ed. Chapter 4.Part II 150 of 168
Group Exercise (27): Answer
The first 3 instructions have data dependence, and so do the last 2

Notice that the index in the sw instruction is changed to 4 to


compensate for the subtraction in the addi instruction that is
executed out of order
addi could be scheduled in cycle 1, but cycle 2 will have to be empty
Notice that just one pair of instructions has both issue slots used
It takes four clocks per loop iteration
At four clocks to execute five instructions, we get the disappointing
CPI of 0.8 versus the best case of 0.5, or an IPC of 1.25 versus 2.0
In computing CPI, we do not count any nops executed as useful
instructions; doing so would improve CPI, but not performance
Patterson and Hennessys Computer Organization and Design, 4th Ed. Chapter 4.Part II 151 of 168
Group Exercise (28)
Consider the following loop:

Loop: lw $t0 0($s1)


addu $t0, $t0, $s2
sw $t0, 0($s1)
addi $s1, $s1, -4
bne $s1, $zero, Loop

For simplicity, assume that the loop index is a multiple of four

1. Show how well loop unrolling and scheduling work in the code
above
2. Compare the performance of the code execution with loop
unrolling and without it (Exercise 26)

Patterson and Hennessys Computer Organization and Design, 4th Ed. Chapter 4.Part II 152 of 168
Group Exercise (28): Answer
To schedule the loop without any delays, it turns out that we need
to make four copies of the loop body
After unrolling and eliminating the unnecessary loop overhead
instructions, the loop will contain four copies each of lw, addu, and
sw, plus one addi and one bne
Since the first pair decrements $s1 by 16, the addresses loaded
are the original value of $s1, then that address minus 4, etc.
Storing is done first to the original value of $s1, then that address
minus 4, etc.

Patterson and Hennessys Computer Organization and Design, 4th Ed. Chapter 4.Part II 153 of 168
Group Exercise (28): Answer (cont.)
During the unrolling the compiler introduced additional registers
($t1, $t2, $t3)

This register renaming is done to eliminate false data dependences

Consider how the unrolled code would look using only $t0
There would be repeated instances of {lw $t0 0($s1)},
{addu $t0, $t0, $s2}, followed by {sw $t0, 0($s1)}
But, these sequences, despite using $t0, are actually completely
independent as no data values flow between one pair of these
instructions and the next pair
Renaming the registers during the unrolling process allows the
compiler to move these instructions subsequently so as to better
schedule the code

Patterson and Hennessys Computer Organization and Design, 4th Ed. Chapter 4.Part II 154 of 168
Group Exercise (28): Answer (cont.)
2. Compare the performance of the code execution with loop unrolling
and without it (Exercise 26)
Ideal CPI is 0.5 (an IPC of 2)
In the original loop, just one pair of instructions executes in
multiple issue
It takes 4 clock cycles per loop iteration to execute 5
instructions, which yields a CPI of 4/5 = 0.8 (an IPC of 1.25)
In the unrolled loop, 12 of the 14 instructions in the loop execute
as pairs
It takes 8 clocks for four loop iterations, or 2 clocks per
iteration, which yields a CPI of 8/14 = 0.57 (an IPC of 1.75)
The improvement in the CPI is due to:
reducing the loop control instructions
Dual issue execution
The cost of this improvement is using four temporary registers
rather than one, as well as a significant increase in code size
Patterson and Hennessys Computer Organization and Design, 4th Ed. Chapter 4.Part II 155 of 168
Static Multiple Issue (cont.)
Loop unrolling: a technique to get more performance from loops
that access arrays, in which multiple copies of loop body are made
and instructions from different iterations are scheduled together
After unrolling, there is more ILP available by overlapping
instructions from different iterations
Antidependence (also called name dependence): an ordering
forced purely by the reuse of a name, typically a register, rather than
by a true dependence that carries a value between two instructions
Dependences that are not true dependences, but could either lead
to potential hazards or prevent the compiler from flexibly
scheduling the code
Register renaming: the renaming of registers by the complier or
hardware to remove antidependences
eliminates name dependences, while preserving true dependences
allows the compiler to move independent code subsequently for
better scheduling
Architectural registers: the set of processor visible registers
Patterson and Hennessys Computer Organization and Design, 4th Ed. Chapter 4.Part II 156 of 168
Dynamic Multiple-Issue Processors
Dynamic multiple-issue processors are also known as superscalar
processors, or simply superscalars
Superscalar: An advanced pipelining technique that enables the
processor to execute more than one instruction per clock cycle by
selecting them during execution
In the simplest superscalar processors:
Instructions issue in order
The processor decides whether zero, one, or more instructions
can issue in a given clock cycle
Superscalars also extend dynamic issue decisions to include dynamic
pipeline scheduling
reordering the order of instruction execution by the hardware
Dynamic pipeline scheduling: chooses which instructions to
execute next, possibly reordering them to avoid hazards and stalls
Static pipeline scheduling: stalls when waiting for a hazard to be
resolved, even if later instructions are ready to go
Patterson and Hennessys Computer Organization and Design, 4th Ed. Chapter 4.Part II 157 of 168
Dynamic Multiple-Issue Processors (cont.)
Achieving good performance on superscalars still requires the
compiler to try to schedule instructions to move dependences apart
and thereby improve the instruction issue rate

Motivations for dynamic scheduling vs compiler scheduling:


1. Not all stalls are predictable
For example, cache misses cause unpredictable stalls
2. If the processor speculates on branch outcomes using dynamic
branch prediction, it cannot know the exact order of instructions
at compile time, since it depends on the predicted and actual
behavior of branches
3. As the pipeline latency and issue width change from one
implementation to another, the best way to compile a code
sequence also changes
Old legacy code will get much of the benefit of a new
implementation without the need for recompilation
Patterson and Hennessys Computer Organization and Design, 4th Ed. Chapter 4.Part II 158 of 168
Dynamic Multiple-Issue Processors (cont.)
Differences between VLIW and superscalar processors:
1. VLIW processors predated superscalars
2. VLIW processors rely totally on compiler scheduling, whereas
superscalars additionally uses dynamic pipeline scheduling
3. The code whether scheduled or not, is guaranteed by the
superscalar hardware to execute correctly without changing the
binary machine code (recompilation)
4. Compiled code will always run correctly on superscalars
independent of the processor issue rate or pipeline structure:
In some VLIW designs, this is not the case, and recompilation
was required when moving across different processors
In other static issue processors, code would run correctly
across different implementations, but often so poorly as to
make compilation effectively required

Patterson and Hennessys Computer Organization and Design, 4th Ed. Chapter 4.Part II 159 of 168
Dynamic Pipeline Scheduling

Microarchitecture: the organization of the processor, including the


major functional units, their interconnection, and control
The pipeline is divided into three major units:
1. An instruction fetch and issue unit
Fetches instructions, decodes them, and sends each instruction
to a corresponding functional unit for execution
Patterson and Hennessys Computer Organization and Design, 4th Ed. Chapter 4.Part II 160 of 168
Dynamic Pipeline Scheduling (cont.)
2. Multiple functional units (12 in high-end designs in 2008)
Each unit has buffers, called reservation stations, which hold
the operands and the operation
As soon as the buffer contains all its operands and the
functional unit is ready to execute, the result is calculated
When the result is completed, it is sent to any reservation
stations waiting for it as well as to the commit unit
3. A commit unit
Buffers the result until it is safe to put the result in the register
file or, for a store, into memory
The buffer in the commit unit, called reorder buffer, is also
used to supply operands, in much the same way as forwarding
logic does in a statically scheduled pipeline
Once a result is committed to the register file, it can be fetched
directly from there, just as in a normal pipeline
Final step of updating the state is retirement/graduation
Patterson and Hennessys Computer Organization and Design, 4th Ed. Chapter 4.Part II 161 of 168
Dynamic Pipeline Scheduling (cont.)
The combination of buffering operands in the reservation stations and
results in the reorder buffer provides a form of register renaming:
1. When an instruction issues, it is copied to a reservation station for
the appropriate functional unit
Operands that are available in the register file or reorder
buffer are immediately copied into the reservation station
The instruction is buffered in the reservation station until all
the operands and functional unit are available
2. If an operand is not in the register file or reorder buffer, it must
be waiting to be produced by a functional unit
The name of the functional unit that will produce the result is
tracked
When that unit eventually produces the results, it is copied
directly into the waiting reservation station from the functional
unit bypassing the registers

Patterson and Hennessys Computer Organization and Design, 4th Ed. Chapter 4.Part II 162 of 168
Dynamic Pipeline Scheduling (cont.)
Dynamically scheduled pipeline analyzes the data flow structure of a
program
The processor then executes the instructions in some order that
preserves the data flow order of the program
Out-of-order (OOO) execution: instructions can be executed in a
different order than they were fetched
A situation in pipelined execution when instruction blocked from
executing does not cause the following instruction to wait
In-order commit:
The instruction fetch and decode unit is required to issue
instructions in order, which allows dependences to be tracked
The commit unit is required to write results to registers and
memory in program fetch order
The functional units are free to initiate execution whenever the
data they need is available

Patterson and Hennessys Computer Organization and Design, 4th Ed. Chapter 4.Part II 163 of 168
Dynamic Pipeline Scheduling (cont.)
In-order commit (cont.):
If an exception occurs, the computer can point to last instruction
executed, and the only registers updated will be all those written
by instructions before the instruction causing the exception
Make programs behave as if they were running on a simple in-
order pipeline (model of todays dynamically scheduled pipelines)
Dynamic scheduling is often extended by including hardware-based
speculation, especially for branch outcomes
By predicting the direction of a branch, a processor can continue
to fetch and execute instructions along the predicted path
Because instructions are committed in order, we know whether or
not the branch was correctly predicted before any instruction
from the predicted path are committed
A speculative, dynamically scheduled pipeline can also support
speculation on load-store reordering, and using the commit unit
to avoid incorrect speculation
Patterson and Hennessys Computer Organization and Design, 4th Ed. Chapter 4.Part II 164 of 168
Pipelining and Multiple-Issue Limitations
Despite the existence of processors with four to six issues per clock
cycle, very few applications can sustain more than two instructions
per clock for the following two primary reasons:
1. Dependences that cannot be alleviated and branches that cannot
be accurately predicted
2. Loses in memory system
Four factors combine to limit the performance improvement gained
by pipelining and multiple-issue execution:
1. Data hazard in the code mean that increasing the pipeline depth
increases the time per instruction because a larger percentage of the
cycles become stalls
2. Control hazards mean more clock cycles for the program
3. Pipeline register overhead can limit the decrease in clock period
obtained by further pipelining
4. Instruction latencies (inherited execution time) introduce difficulties in
programs as a dependence means the processor must wait the full
instruction latency for the hazard to be resolved
Patterson and Hennessys Computer Organization and Design, 4th Ed. Chapter 4.Part II 165 of 168
Power Efficiency and Advanced Pipelining
The downside to the increasing exploitation of ILP via dynamic
multiple issue and speculation is power efficiency
Now that we have hit the power wall, we are seeing designs with
multiple processors per chip where the processors are not as deeply
pipelined or as aggressively speculative as the predecessors
While processors are not as fast as the sophisticated ones, they
deliver better performance per watt, so that they can deliver more
performance per chip when designs are constrained more by power
than they are by number of transistors

Note the drop in pipeline stages as we switch to multicore designs


Patterson and Hennessys Computer Organization and Design, 4th Ed. Chapter 4.Part II 166 of 168
Fallacies and Pitfalls
Fallacy: pipelining is easy

Fallacy: pipelining ideas can be implemented independent of


technology

Pitfall: failure to consider instruction set design can adversely


impact pipelining

Fallacy: increasing the depth of pipelining always increases


performance

Patterson and Hennessys Computer Organization and Design, 4th Ed. Chapter 4.Part II 167 of 168
Tutorial Exercises (29)
Textbook problems:

Patterson and Hennessys Computer Organization and Design, 4th Ed. Chapter 4.Part II 168 of 168

You might also like