Professional Documents
Culture Documents
2 Pipelining
Pipelining is a technique of decomposing a sequential process into
sub-processes, with each sub-process being executed in a special
dedicated segment that operates concurrently with all other
segments. Any operation that can be decomposed into a sequence
of sub-operations of about the same complexity can be
implemented by a pipeline processor.
Asynchronous Model
As shown in Fig.2-2a, data flow between adjacent stages in
asynchronous pipeline is controlled by a handshaking protocol.
When stage Si is ready to transmit, it sends a ready signal to stage
Si+1. After stage Si+1 receives the incoming data, it returns an
acknowledge signal to Si.
Synchronous Model
Synchronous pipelines are illustrated in Fig.2-2b. The operands
pass through all segments in a fixed sequence. Each segment
consists of a combinational circuit Si that performs a suboperation
over the data stream flowing through the pipe Isolating registers R
(latches) are used to interface between stages and hold the
intermediate results between the stages. Upon the arrival of a clock
pulse, all registers transfer data to the next stage simultaneously.
Input Output
Input Output
R S1 R S2 R R Sk R
Clock
Fig.2-2
Example 2.1
In certain scientific computations it is necessary to perform the
arithmetic operation (Ai + Bi)*(Ci + Di) with a stream of numbers.
Specify a pipeline configuration to carry out this task. List the
contents of all registers in the pipeline for i = 1 through 6.
Solution
Each sub-operation is to in a segment within a pipeline. Each
segment will have one or more registers and a combinational
circuit as shown. The sub-operations performed in each segment of
the pipeline are as follows:
R5 R1 + R2 , R6 R3 + R4 Perform Addition of
Ai + Bi and Ci + Di
R7 R5 * R6 Multiply (Ai + Bi)*(Ci + Di)
Ai Bi Ci Di
R1 R2 R3 R4 Segment 1
Adder Adder
R5 R6 Segment 2
Multiplier
Segment 3
R7
Fig.2-3
The seven registers are loaded with new data every clock pulse.
The effect of each clock will be as shown:
Clock cycles: 1 2 3 4 5 6 7 8 9
Segments: 1 T1 T2 T3 T4 T5 T6 - - -
2 - T1 T2 T3 T4 T5 T6 - -
3 - - T1 T2 T3 T4 T5 T6 -
4 - - - T1 T2 T3 T4 T5 T6
Fig.2-4
Example 2.2
Draw the space-time diagram for a six-segment pipeline showing
the time it takes to process eight tasks.
Solution
Clock cycles: 1 2 3 4 5 6 7 8 9 10 11 12 13
Segments: 1 T1 T2 T3 T4 T5 T6 T7 T8 - - - - -
2 - T1 T2 T3 T4 T5 T6 T7 T8 - - - -
3 - - T1 T2 T3 T4 T5 T6 T7 T8 - - -
4 - - - T1 T2 T3 T4 T5 T6 T7 T8 - -
5 - - - - T1 T2 T3 T4 T5 T6 T7 T8 -
6 - - - - - T1 T2 T3 T4 T5 T6 T7 T8
t
S = n =k
tp
10
8 k = 10
Speedup
factor
4 k=6
0
0 1 2 4 8 16 32 64 128 256 512 1024
Number of tasks
Fig.2-5
2.2.4 Pipeline Efficiency
The efficiency Ek of a linear k-segment pipeline is defined as
Speedup S
E = = k
k Number of segments k
If we assume that the time it takes to process a task is the same in
the pipeline and nonpipeline circuits, the speed up will be
knt p
Sk =
(k + n − 1)t p
Example 2.3
A non-pipeline system takes 50 ns to process a task. The same
task can be processed in a six-segment pipeline with a clock
cycle of 10 ns. Determine the speedup and the efficiency of the
pipeline for 100 tasks. What is the maximum speedup and
efficiency that can be achieved?
Solution
given
required
nt n 100 × 50
− Speedup ⇒ S = = = 4 . 76
( k + n − 1) t p ( 6 + 100 − 1) × 10
The max imum sp e edup can be achieved when the number of tasks
increases up to a value much l arg er than k − 1
, So we can neglect k − 1
⇒ The ratio will be reduc ed to
nt n t 50
S max = = n = =5
nt p t p 10
Speedup S 4 . 76
− Efficiency ⇒ E = = = = 79 . 33 %
Number o f segments k 6
Maximum Speedup
Maximum efficiency ⇒ E max =
Number o f segments
S 5
= max = = 83 . 33 %
k 6
Fixed-Point Operations
Fixed-point numbers are represented internally in machines in
sign-magnitude, one's complement, or two's complement
notation. Add, subtract, multiply, and divide are the four primitive
arithmetic operations.
Floating-Point Numbers
A floating-point number X is represented by a pair (m, e), where
m is the mantissa (or fraction) and e is the exponent with an
implied base (or radix). The algebraic value is represented as
X = m × re
The sign of X can be embedded in the mantissa.
Floating-Point Operations
The four primitive arithmetic operations are defined below for a
pair of floating-point numbers represented by X = (mx, ex) and Y
= (my, ey), Assuming ex <= ey and base r = 2.
ex −ey ey
X+Y = (m x × 2 + m y)× 2
ex −ey ey
X−Y = (m x × 2 − my)× 2
ex +ey
X×Y = (m x × m y ) × 2
ex −ey
X÷Y = (m x ÷ m y ) × 2
These operations can be divided into two halves:
One half is for exponent operations such as comparing their
relative magnitudes or adding/subtracting them; the other half is
for mantissa operations including four types of fixed-point
operations.
The floating-point addition and subtraction can be
performed in four segments as shown in Fig.2-6. The registers
labeled R are latches used to interface between segments and to
store intermediate results. The sub-operations that are performed
in the four segments are:
R R
Compare
Segment 1 Exponents
by subtraction
Add or subtract
Segment 3 mantissas
R R
Adjust Normalize
Segment 4 exponent result
R R
Fig.2-6
Pipeline in Fig.2-6 refers to binary numbers, but in the
following numerical example we use decimal numbers for
simplicity (r = 10). Consider the two normalized floating-point
numbers:
X = 0.99 × 10 4
Y = 0.5 ×103
Segment 1:
The two exponents are subtracted to obtain 4 – 3 = 1. The larger
exponent 4 is chosen to be the exponent of the result.
Segment 2:
The mantissa of Y is shifted to the right once to obtain
X = 0.99 × 10 4
Y = 0.05 ×10 4
This aligns the two mantissas under the same exponent.
Segment 3:
The two mantissas are added to produce the sum
Z = 1.04 × 10 4
Segment 4:
The sum is adjusted by normalizing the result. This done by
shifting the mantissa once to the right and incrementing the
exponent to obtain
Z = 0.104 × 10 5
Example 2.4
The time delays in the four segments in the pipeline of Fig.2-6
are as follows:
t1 = 50ns, t 2 = 30ns, t 3 = 95ns, t 4 = 45ns . The interface registers delay
time t r = 5ns .
a. How long would it take to add 100 pairs of numbers in the
pipeline?
b. How can we reduce the total time to about one-half of the
time calculated in part (a)?
Solution
(a)
The time taken to add n pairs of numbers in an k-segment
pipeline = (k + n - 1 ) tp
Instruction cycle
Computers with complex instructions (CISC) require other
phases in addition to the fetch and execute to process an
instruction completely. In the most general case, the computer
needs to process each instruction with the following sequence of
steps:
Decode instruction
Segment 2 and fetch the
Source operand
yes
Branch
no
Perform the
Segment 3 operation specified
by the instruction
Interrupt yes
handling Interrupt
no
Update PC
Empty pipe
Fig.2-7
Clock cycles 1 2 3 4 5 6 7
Instructions
I1 F1 D1 E1 W1
I2 F2 D2 E2 W2
I3 F3 D3 E3 W3
I4 F4 D4 E4 W4
Fig.2-8
Data hazard.
Instruction hazard.
Structural hazard.
Hardware Interlock
The most straight forward method is to insert hardware interlock.
An interlock is a circuit that detects instructions whose source
operands are destinations of instructions farther up in the
pipeline. Detection of this situation causes the instruction whose
source is no available to be delayed by enough clock cycles to
resolve the conflict. Hardware is used to insert the required
delays.
Operand forwarding
Instead of transferring an ALU result into a destination register, a
hardware checks the destination operand, and if it is needed as a
source in the next instruction, it passes the result directly into the
ALU input. Fig.2-9 shows a part of a pipeline processor which
caries out this task.
Forwarding path
E: Execute W: Write
(ALU) (Register file)
Source Result
Fig.2-9
Example 2.5
Consider the four instructions in the following program. Suppose
that the first instruction starts from step 1 in the pipeline used in
Fig.2-8. Specify what operations are performed in the four
segments during step 5and step 6, assuming:
LOAD R1 M[312]
ADD R2 R2 + M[313]
INC R3 R4 + 1
STORE M[314] R3
Solution
The STORE instruction will cause a data conflict as its source
operand (R3) is the result of the previous instruction (INC), this
will result in a data hazard. The timing for the pipeline of the
above program will be as follows:
Time
Clock cycles 1 2 3 4 5 6 7
Instructions
LOAD F1 D1 E1 W1
ADD F2 D2 E2 W2
INC F3 D3 E3 W3
STORE F4 D4 E4 W4
(a)
In a pipeline system using hardware interlock, the interlock
circuit will detect that the STORE instruction’s source operand is
the destination of the INC instruction. We assume that the
interlock circuit detects such instruction after it passes the F
segment. After detecting this situation, the interlock will cause
the STORE instruction to be delayed for 2 clock cycles which is
the minimum number of cycles for the INC instruction to write
the result to the destination (R3). The timing of the pipeline will
become as follows:
Time
Clock cycles 1 2 3 4 5 6 7 8 9
Instructions
LOAD F1 D1 E1 W1
ADD F2 D2 E2 W2
INC F3 D3 E3 W3
STORE F4 - - D4 E4 W4
(b)
In a pipeline system using operand forwarding, as sown in Fig.2-
9, the result of an instruction can be fed back directly into the E
segment in the case of data dependency. In the above program,
hardware will detect data dependency between the INC and the
STORE instructions. This will cause the result of the INC
instruction to be passed directly into the input of the E segment.
The timing of the pipeline will become as follows:
Time
Clock cycles 1 2 3 4 5 6 7
Instructions
LOAD F1 D1 E1 W1
ADD F2 D2 E2 W2
INC F3 D3 E3 W3
STORE F4 D4 E4 W4
Time
Clock cycles 1 2 3 4 5 6 7 8 9
Instructions
I1 F1 D1 E1 W1
(Branch) I2 F2 D2 E2
<I3> F3 D3 E3 W3
<I4> F4 D4 E4 W4
Fk Dk Ek Wk
Ik
Fk+1 Dk+1 Ek+1 Wk+1
Ik+1
Clock cycles 1 2 3 4 5 6 7 8 9
Instructions
I1 F1 D1 E1 W1
(Branch) I2 F2 D2 E2
<I3> F3 D3 E3 W3
<I4> F4 D4 E4 W4
I3 D3 E3 W3
I4 F4 D4 E4 W4
Loop buffer
Loop buffer is a small very high speed register file included in the
fetch segment of the pipeline. When a program loop is detected
in the program, it is stored in the loop buffer including all
branches. The program loop can be executed directly without
accessing memory until the loop mode is removed by the final
branch.
Branch prediction
A pipeline with branch prediction uses some additional logic to
guess the outcome of a conditional branch instruction before it is
executed. A correct prediction eliminates the wasted time caused
by the branch penalties.
Compiler support
Instead of designing hardware to handle difficulties (hazards)
associated with data conflicts and branch penalties, RISC
processors relay on the efficiency of the compiler to detect and
minimize the delays.
Clock cycles 1 2 3 4 5 6 7
Instructions
I1 I A E
I2 I A E
I3 I A E
Fig.2-11
The following sections illustrate how RISC computers use the
compiler to handle data, instruction hazards.
Delayed load
The compiler of RISC computers is designed to detect the data
conflicts (data hazards) and reorder the instructions as necessary
to delay the loading of the conflicting data by inserting no-
operation instructions. For example, the two instructions (written
in Berkeley RISC I format).
Give rise to data dependency (data hazard). The result of the Add
instruction is placed into register R3, which in tern is one of the
two source operands of the Subtract instruction. There will be a
conflict in the Subtract instruction. This can be seen from the
pipeline timing shown in Fig.2-12a. The A segment in clock
cycle 3 is using data from R3 which will not be the correct value
since the Addition operation is not yet completed. The compiler
when detects such situation, it searches for a useful to put after
the Add, if it cannot find such an instruction, it inserts no-
operation instruction, illustrated in Fig 2-12b. This is the type of
instruction that fetched from memory but has no operation, thus
wasting a clock cycle.
Time
Clock cycles 1 2 3 4 5 6 7
Instructions
ADD I A E
SUB I A E
Clock cycles 1 2 3 4 5 6 7
Instructions
ADD I A E
NOP I A E
SUB I A E
Delayed Branch
In Fig.2-10a, the processor fetches instruction I3 before it
determines whether the current instruction I2 is a branch
instruction. When execution of I2 is completed and a branch is to
be made, the processor must discard I3 and fetch the instruction
at the branch target Ik. The location following the branch
instruction is called a branch delay slot. The instructions in the
delay slots are always fetched and at least partially executed
before the branch decision is made and the branch target address
is computed. The compiler for a processor that uses delayed
branches is designed to analyze the instructions before and after
the branch and rearrange the program sequence by inserting
useful (or no-operation) instructions in the delay slots.
RISC computers use delayed branch to handle branch related
instruction hazards. An example of delayed branch is shown in
Fig.2-13. The program sequence for this example consists of the
following instructions (written in Berkeley RISC I format).
Instructions
AND I A E
SLL I A E
ADD I A E
JMP I A E
NOP I A E
NOP I A E
SUB I A E
(a) Using no-operation instruction
Time
Clock cycles 1 2 3 4 5 6 7
Instructions
ADD I A E
JMP I A E
AND I A E
SLL I A E
I A E
SUB
(a) Rearranging the instructions
Fig.2-13