You are on page 1of 12

See

discussions, stats, and author profiles for this publication at: https://www.researchgate.net/publication/3214681

Architecture of the Pentium Microprocessor

Article in IEEE Micro July 1993


DOI: 10.1109/40.216745 Source: IEEE Xplore

CITATIONS READS

110 3,981

2 authors, including:

Donald Alpert
Camelback Computer Architecture, LLC
13 PUBLICATIONS 200 CITATIONS

SEE PROFILE

All content following this page was uploaded by Donald Alpert on 16 September 2014.

The user has requested enhancement of the downloaded file.


193, at
~ avail
.more
'12 or

les on
issue,
~ Architecture of the Pentium
Microprocessor

The Pentium CPU is the latest in Intel's family of compatible microprocessors. It integrates 3.1
; tech million transistors in O.8-pm BiCMOS technology. We describe the techniques of pipelining,
works superscalar execution, and branch prediction used in the microprocessor's design.
1ge of
Ie has
1 vari
: con Donald Alpert he Pentium processor is Intel's next Compatibility
series generation of compatible microproces Since introduction of the 8086 microprocessor
Dror Avnon sors following the popular i486 CPU in 1978, the X86 architecture has evolved through
Sand family. The design started in early 1989 several generations of substantial functional en
!State Intel Corporation with the primary goal of maximizing performance hancements and technology improvements, in
r four while preserving software compatibility within the cluding the 80286 and i386 CPUs. Each of these
SPEC practical constraints of available technology. The CPUs was supported by a corresponding float
)ublic Pentium processor integrates 3.1 million transis ing-point unit. The i486 CPU,I introduced in 1989,
engi tors in 0.8-J.Lm BiCMOS technology and carries integrates the complete functionality of an inte
e is a the Intel trademark. We describe the architecture ger processor, floating-point unit, and cache
iation and development process employed to achieve memolY into a single circuit.
this goal. The X86 architecture greatly appealed to soft
ware developers because of its widespread
Technology application as the central processor of IBM
hn R. The continual advancement of semiconductor compatible personal computers. The Sllccess of
Blvd., technology promotes innovation in microproces the architecture in PCs has in turn made the X86
sor design. Higher levels of integration, made popular for commercial server applica tions as
possible by reduced feature sizes and increased well. Figure 1 shows some of the well-known
interconnection layers, enable deSigners to de software environments that are hosted on the
ploy additional hardware resources for more par architecture.
allel computation and deeper pipelining. Faster The common software environments allow the
device speeds lead to higher clock rates and con X86 architecture to exercise several operating
sequently to requirements for larger and more modes. Applications developed for DOS llse 16
speCialized on-chip memory buffers. bit real mode (or virtual 8086 mode) and MS
Table 1 (next page) surrunarizes the technology Windows. Early versions of OS/2 use 16-bit pro
improvements associated with our three most re tected mode, and applications for other popular
Jriate cent microprocessor generations. The 0.8-J.Lm envirorunents use 32-bit flat (unsegmented) mode.
BiCMOS technology of the Pentium microproces The Pentium microprocessor employs general
sor enables 2.5 times the number of transistors techniques for improving performance in all op
1153 and twice the clock frequency of the original i486 erating modes, as well as cenain techniques for
CPU, which was implemented in 1.0-J.Lm CMOS. improving performance in specific operating

0272-1732/93/0600-0011 $03 .00 1993 IEEE June 1993 11


Pentium microprocessor

modes. We focus on the 32-bit flat mode


Table 1. Technology for microprocessor development. here, since this is the most appropriate
mode for comparison with the other
No. of Frequency high-performance microprocessors de
M icrop rocessor Year Technology transistors (MHz) scribed at the Hot Chips IV Conference.
The x86 architecture supports the
i386 CPU 1986 1.5-/-lm CMOS, 275K 16 IEEE-754 standard for floating-poim arith
two-layer metal metic. 2 In addition to required operations
on single-precision and double-precision
i486 CPU 1989 1.0-/J.m CMOS, 1.2M 33 forma ts, the x86 floating-poim architec
two-layer metal ture includes operations on 80-bit,
extended-precision format and a set of
Pentium CPU 1993 0.8-/J.m BiCMOS, 3 ,lM 66 basic transcendental functions.
three-layer metal Pentium CPU designers found numer
ous exciting technical challenges in de
veloping a microarchitecture that
maintained compatibility with s ~ch a diverse software base.
16-bit generation 32-bit generation Later in this article we presem examples of techniques fo r
Unix SVR4 supporting self-modifying code and the stack-oriented,
SCO floating-point register file.
DOS OSF/l
MS-Windows Netware 3.11 Performance
OS/2 1.x Next Step A microprocessor's performance is a complex function of
32-bit OS/2 many parameters that valY between applications, compilers,
Solaris and hardware systems. In developing the Pemium micropro
Windows NT cessor, the design team addressed these aspects for each of
Univel the popular software environments. As a result, Pentium CPU
Taligent features tuned compilers and cache me mOly.
1980s 1991 199x We fo cus on the performance of SPEC benchmarks for
both the Pemium microprocessor and i486 CPU in systems
Figure 1. Software environments. (Alljig ures, tables, and with well-tuned compilers and cache memory. More specifi
photographs published in this article are the property ojIntel ca ll y, the Pentium CPU achieves roughly two times the
COlporation.) speedup on integer code and up to five times the speedup
on floating-poim vector code whe n compared with an i486
CPU of identical clock frequ ency.

64 bits Organization
Figure 2 shows the overall organization of the Pentium
Pipelined microprocessor. The core execution units are two imeger
floating-point pipelines and a floating-point pipeline with dedicated adder,
unit multiplier, and divider. Separate on-chip instruction code and
data caches supply the memory demands of the execution
units, with a branch target buffer augmeming the instruction
64 bit
cache for dynamic branch prediction. The external imerface
includes separate address and 64-bit clata buses.
Multiplier
Integer pipeline
Adder The Pemium processor's integer pipeline is similar to that
of the i486 CPU.3The pipeline has five stages (see Figure 3)
Data cache Divider with the foll owing functions:

Pre/etch. During the PF stage the CPU pre fetches code


Figure 2. Pentium processor block diagram. from the instruction cache and aligns the code to the

12 IEEE Micro
~ode
PF Fetch and align instruction PF Fetch and align instruction
'riate
)ther
; de
~nce.

; the
01

Decode instruction
Generate control word 01
Decode instruction

Generate control word

lrith
tions
ision
litec
02

Decode control word
Generate memory address 02
Decode control word
Generate memory address
Decode control word

Generate memory address

I-bit,
et of

mer E

Access d!;lta cache or
calculate ALU result
.
E
Access data cache or
calculate ALU result
Access data cache or

calculate ALU result


1 de-
that
)ase. WB Write result WB Write result Write result

s for
lted, U pipe V pipe

Figure 3. Integer pipeline. Figure 4. Superscalar execution.

)n of
,i1ers, initial byte of the next instruction to be decoded. Be in parallel. Figure 4 shows that the resources for address
)pro cause instructions are of variable length, this stage in generation and AiU functions have been replicated in inde
:h of cludes buffers to hold both the line containing the pendent integer pipelines, called U and V. (The pipeline names
CPU instruction being decoded and the next consecutive line. were selected because U and V were the first two consecu
First decode. In the DI stage the CPU decodes the in tive letters of the alphabet neither of which was the initial of
s for struction to generate a control word. A single control a functional unit in the design partitioning.) In the PF and D I
terns word executes instructions directly; more complex in stages the CPU can fetch and decode two simple instructions
ecifi structions require microcoded control sequencing in DI. in parallel and issue them to the U and V pipelines. Addition
; the Second decode. In the D2 stage the CPU decodes the ally, for complex instructions the CPU in DI can generate
:dup control word from DI for use in the E stage. In addition, microcode sequences that control both U and V pipelines.
i486 the CPU generates addresses for data memory references. Several techniques are used to resolve dependencies be
Execute. In the E stage the CPU either accesses the data tween instructions that might be executed in parallel. Most of
cache or calculates results in the AiU (arithmetic logic the logic is contained in the instnlCtion issue algorithm (see
unit), barrel shifter, or other functional units in the data Figure 5) of Dl.
Itium path.
leger Write back. In the WE stage the CPU updates the regis
ider, ters and flags with the instruction's results. All excep
~and tional conditions must be resolved before an instruction
.ltion can advance to WE .
.Decode two consecutive instructions: 11 and 12
~tion
If the following are all true
rface Compared to the integer pipeline of the i486 CPU, the 11 is a "simple" instruction
Pentium microprocessor integrates additional hardware in 12 is a "simple" instruction
several stages to speed instruction execution. For example,
11 is not a jump instruction
the i486 CPU requires two clocks to decode several instruc Destination of 11 "# source of 12
, that tion formats, but the Pentium CPU takes one clock and ex Destination of 11 "# destination of 12
re 3) ecutes shift and multiply instructions faster. More significantly, Then issue 11 to U pipe and 12 to V pipe
the Pentium processor substantially enhances superscalar ex Else issue 11 to U pipe
ecution, branch prediction, and cache organization.
code Superscalar execution. The Pentium CPU has a super
) the scalar organization that enables two instructions to execute Figure 5. Instruction issue algorithm.

June 1993 13
Pentium microprocessor

Branch prediction. The i486 CPU has a simple technique

l
Branch
Branch
D

History
for hanclling branches. When a branch instruction is executed,
the pipeline continues to fetch and decode instructions along
the sequential path until the branch reaches the E stage. In E,
the CPU fetches the branch destination, and the pipeline re
solves whether or not a conditional branch is taken. If the
branch is not taken, the CPU discards the fetched destina
instruction destination
address address tio n, and execution proceeds along the sequential path with
no delay. If the branch is taken, the fetched destination is
t~.
used to begin decoding along the target path with two clocks
of delay. Taken branches are found to be 15 percent to 20
.< Figure 6. Branch target buffer. percent of instructions executed, representing an obvious area
for improvement by the Pentium processor.
~ The Pentium CPU employs a branch target buffer (BTB),
Resource dependen cies. A resource depende ncy occurs which is an associative memory used to improve performance
when two instructions require a single functional unit or data of taken branch instructio ns (see Figure 6). When a branch
path. During the D1 stage, the CPU only issues two instruc instruction is flfSt taken, the CPU allocates an entlY in the branch
tions for parallel execution if both are from a class of "simple" target buffer to associate the branch instmction's address with
instructions, thereby eliminating most resource dependen its destination address and to initialize the history used in the
cies. The instlUctions must be directl y executed, that is, not prediction algorithm. As instru ctions are decoded , the CPU
require microcode sequenCing. The instruction being issued searches the branch target buffer to determine whether it holds
to the V pipe can be an AiU operation, memory reference, an entlY for a corresponding branch instruction. When there is
or jump. The instruction being issued to the U pipe can be a hit, tl1e CPU uses the histolY to determine whether the branch
from the same categories or from an additional set that uses should be taken. If it should, tl1e microprocessor uses the tar
a functional unit available only in the U pipe, such as the get address to begin fetching and decoding instructions from
barrel shifter. Although the set of instructions identified as the target path. The branch is resolved early in the WE stage,
"simple" might seem restrictive, more than 90 percent of in and if the prediction was incorrect, the CPU flushes the pipe
structions execu ted in the Integer SPEC benchmark suite are line and resumes fetching along the correct path. The CPU
simple. updates the dual-ported histolY in the WE stage. The branch
Data dependerlcies. A data dependency occurs when one target buffer holds entries for predicting 256 branches in a
instruction writes a result that is read or written by another four-wa y associative organization.
instruction . Logic in D1 ensures that the source and destina Using these techniques, the Pentium CPU executes cor
tion registers of the instruction issued to the V pipe diffe r rectly predicted branches with no delay. In addition , condi
from the destination register of the instruction issued to the U tional branches can be executed in the V pipe paired with a
pipe. This arrangement eliminates read-after-write (RAW) and compare or other instruction that sets the flags in the U pipe .
write-after-write (WAW) dependencies. Write-after-read (WAR) Branching executes with full compatibility and no modifica
dependencies need not be checked because reads occur in tion to existing software. (We explain aspects of interactions
Lo an earlier stage of the pipelines than writes. between branch prediction and self-modifying code later.)
The design includes logic that enables instlLlctions with Cache organization. 111e i486 CPU employs a single on
certain special types of data depe ndency to be executed in chip cache that is unified for code and data. The single-poned
parallel. For example, a conditional branch instruction that cache is multiplexed on a demand basis between sequential
tests the flag results can be executed in parallel with a com code prefetches of complete lines and data references to in
pare instruction that sets the flags. dividual locations. As just expl ained, branch targets are
Control dependencies. A control dependency occurs when prefetched in the E stage, effectively using the same hard
~ the result of one instlLlction determines whether another in ware as data memory references. There are potential advan
~ struction will be executed. When a jump instruction is issued tages for such an organization over one that separates code
( to the U pipe, the CPU in D1 neve r issues a n instruction to and data.
I the V p ipe , thereby eliminating control dependencies.
Note that resource dependencies and data dependencies 1) For a given size of cache memolY, a unified cache has a
between memory references are not resolved in D1. Depen higher hit rate than separate caches because it balances
dent memory references can be issued to the two pipelines; the total allocation of code and data lines automatically .
we explain their resolution in the description of the data 2) Only one cache needs to be designed.
cache. 3) Handling self-modifying code can be simpler.

14 IEEE Micro

-
lique I?espite these potential advantages of a unified cache, all U-pipe V-pipe U-pipe V-pipe
:uted, of which apply to the i486 CPU , the Pentium microprocessor address address data data
!long uses separate code and data caches. The reason is that the I
+
I
InE,
le re
If the
stina
superscalar design and branch prediction demand more band
width than a unified cache similar to that of the i486 CPU can
provide. First, efficient branch prediction reqUires that the
destination of a branch be accessed Simultaneously with data

Dual-ported
TLB
Bank
conflict
detection ,
.with referen.ces of previous instructions executing in the pipeline .
I Singe-ported and
,on is Second , the parallel execution of data memory references Dual-ported interleaved
:locks requires simultaneous accesses for loads and stores. Third, in cache tags cache data
to 20 the context of the overall Pentium microprocessor design,
5area handling self-modifying code for separate code and data
caches is only marginally more complex than for a unified Figure 7. Dual-access data cache.
BTB), cache.
nance The instruction cache and data cache are each 8-Kbyte,
ranch two-way associative designs with 32-byte lines. frequency.) For the Pentium microprocessor, with its higher
,ranch Programs executing on the i486 CPU typically generate performance core pipelines and 64-bit data bus, using a write
,widl more data memory references than when executing on RISC back protocol for cache consistency was an obvious enhance
in dle microprocessors. Measurements on Integer SPEC benchmarks ment. The write-back protocol uses four states: modified,
: CPU show 0.5 to 0.6 data references per instruction for the i486 exclusive, shared, and invalid (MESI).
holds CPU' and only 0.17 to 0.33 for the Mips processors This SeJf-modifying code. One challenging aspect of the
lere is difference results directly from the limited number (eight) of Pentium microprocessor's design was supporting self-modi
>ranch registers for the x86 architecture, as well as procedure-calling fying code compatibly. Compatibility requires that when an
le ta1' conventions that require passing all parameters in memory. instruction is modified followed by execution of a taken branch
; from A small data cache is adequate to capture the locality of the instruction, subsequent executions of the modified instruc
stage, additional references. (After all, the additional references have tion must use the updated value. This is a special form of
pipe sufficient locality to fit in the register file of the RISC micro dependency between data stores and instruction fetches.
: CPU processors.) The Pentium microprocessor implements a data The interaction between branch predictions and self-modi
>ranch cache that supports dual accesses by the U pipe and V pipe fying code requires the most attention. The Pentium CPU
s in a to provide additional bandwidth and simplify compiler in fetches the target of a taken branch before previous instruc
struction schedu ling algorithms. tions have completed stores, so dedicated logic checks for
s cor Figure 7 shows that the address path to the translation such conditions in the pipeline and flushes incorrectly fetched
:ondi look-aside buffer and data cache tags is a fully dual-ported instructions when necessaly. The CPU thoroughly verifies
with a stmcture. The data path, however, is single ported with eight predicted branches to handle cases in which an instruction
. pipe. way interleaving of 32-bit-wide banks. When a bank conflict entered in the branch target buffer might be modified. The
clifica occurs, the U pipe assumes priority, and the V pipe stalls for same mechanisms used for consistency with external memory
ctions a clock cycle. The bank conflict logic also serves to eliminate maintain consistency between the code cache and data cache.
Irer.) data dependencies between parallel memory references to a
~e on single location. For memory references to double-precision Floating-point pipeline
)olted floating-point data , the CPU accesses consecutive banks in The i486 CPU integrated the floating-point unit (FPU) on
lential parallel, forming a single 64-bit path. chip, thus e liminating overhead of the communication proto
to in The design team considered a fully dual-ported structure col that resulted from using a coprocessor. Bringing the FPU
ts are for the data cache, but feaSibility studies and performance on ch ip substantially boosted performance in the i486 CPU.
hard simulations showed the interleaved structure to be more ef Nevertheless, due to limited devices available for the FPU, its
Iclva n fective. The dual-ported structure eliminated bank conflicts, microarchitecture was based on a partial multiplier array and
; code but the SRAM cell would have been larger than the cell used a shift-and-add data path controlled by microcode. Floating
in the interleaved scheme, resulting in a smaller cache and point operations could not be pipelined with any other
lower hit ratio for the allocated area. Additionally, the han floating-point operations; that is, once a floating-point in
: has a dling of data dependencies would have been more complex. struction is invoked, all other floating-point instructions stall
lances With a write-through cache-consistency protocol and 32 until its completion.
tically. bit data bus, the i486DX2 CPU uses buses 80 percent of the The larger transistor budget available for the Pentium mi
time; 85 percent of all bus cycles are writes. (The i486DX2 croprocessor permits a completely new approach in the de
CPU has a core pipeline that operates at twice the bus clock's sign of the floating-point microarchitecture. The aggressive

June 1993 15
Pentium microprocessor

Integer pipe one instruction per cycle, assuming instruction cache and
Floating-point pipe data cache hits.
Data dependencies exist between floating-point instruc
tions when a subsequent instruction uses the resu lt of a pre
Figure 8. Floating-point pipeline. ceding instruction. Since the actual computation of
floating-point resu lts takes place during Xl, X2, and WF stages,
special paths in the hardware allow other stages to be by
performance goals for the FPU presented an exciting chal passed and present the I'esult to the subsequent instruction
lenge for the designers, even with more silicon resources upon generation. Consequently, the latency of the basic
availabl e. Furthermore, maintaining full compatibility with floating-point instructions is three cycles.
previous products and with the IEEE standa rd for floating The X86 floating-point archi tecture SUppOl15 single-precision
point arithmetic was an uncompromising requirement. (32-bit), double-precision (64-bit), and extended-preCision (80
Floating-point pipeline stages. Pentium's floating-point bit) floating-po int operations. We chose to support all com
pipeline consists of e ight stages. The first two stages are pro putation for the three precisions directly, by extending the
cessed by the common (integer pipeline) resources for prefetch data path width to su pport extended precision. Although this Ii

and decode. In the third stage the floating-point hardware entailed using more devices for the implementation, it greatl y
begins activating logic for instruction execution. All of the simplified the microarchitecture while improving the perfor
first five stages are matched with their counterpart integer mance. If smaller data paths were designed, special rerouting S

pipeline stages for pipeline sequencing and synchroniza tion of the data within the FPU and several state machines o r 1l

(see Figure 8). microcode sequencing would have been required for calcu tl
lating the higher preCision data. d
Pre/etcb. The PF stage is the same as in the integer pipe Floating-point instructio ns execute in the U pipe and gen d
line. era lly cannot be paired with any other integer or floating a
First decode. The 01 stage is the same as in the integer point instructions (the one exception wi ll be explained later). Sl

pipeline . The design was tuned for instructions that use one 64-bit rt
Second decode. The 02 stage is the same as in the inte operand in memory with the other operand residing in the f(
ger pipeline. floating-poin t register fil e. Thus, these operations may ex C'

Operand/etcb. In this E stage the FPU accesses both the ecute at the maximum throughput rate, since a full stage (E d
data cache and the floating-point register fil e to fetch stage) in the pipeline is ded icated to operand fetching. Al cl
the operands necessa lY for the operation. When f1oating though floating-point instructions use the U pipe during the
point data is to be written to the data cache, the FPU E stage, the two ports to the data cache (which are used by P
converts internal data f0I111at into the appropriate me mOlY the U pipe and the V pipe for integer operations) are used to 31

representation. This stage matches the E stage of the


integer pipeline.
bring 64-bit data to the FPU. Consequently, during intensive
floating-point computation programs, the data cache access
9
First execute. In the Xl stage the FPU executes the first pons of the U pipe and V pipe operate concurrently with the d
steps of the fl oating-point computa tion. When floating floating-pOint computation. T his behavio r is sim ila r to
point data is read from the data cache, the FPU w rites superscalar load-store RISC designs where load instIUctions 7:
the incoming data into the floating-point register file. execute in parallel with floating-point operations, and there
Second execute. In the X2 stage the FPU continues to fore deliver equivalent throughput of floating-point opera 51

execute the floating-point computation. tions per cycle. d


Write float. In the WF stage the FPU completes the ex Microarchitecture overview. The floating-point unit of ei
ecution of the floating-p o int computation and writes the Pe ntium microprocessor consists of six functional sec tl
the result into the floating-po int register file. tions (see Figure 9). C<

Error reporting. In the ER stage the FPU reports internal The float ing-point interface, register file, and control (FIRC)
special situations that might require additional process sectio n is the only interface between the FPU and the rest of ti,
ing to complete execution and updates the floating-point the CPU. Since the function of floating-point operations is ti,
status word. usua lly self-contained within the floating-point computation tl'
core, concentrating all the interface logic in o ne section helped q
The eight-stage pipeline in the FPU allows a single cycle to create a modular clesign of the other sections. The FIRC C(

thro ughput for most of the "basic" floating-point instructions section also contains most of the common floating-poi nt reo. p;
such as floating-point add, subtract, mUltiply, and compare. sources: register file, centralized control logic, and safe in b,
This means that a sequence of basic floating-point instruc stru ction recognition logic (described later). FIRC can complete P!
tions free from data dependencies wou ld execute at a rate of execution of instructions that do not need arithmetic compu- st

16 IEEE Micro
~ and tation. It dispatches the instructions requiring arithmetic com To/from
putation to the arithmetic sections. integer/cache
strLlc The floating-point exponent section (FEXP) calculates the Mantissa result

.
Exponent resu.lt
1 pre exponent and the sign results fo r all the floating-point arith
,n of metic operations. It interfaces with all the other arithmetic
tages, sections for all tl1e necessalY adjus tments between the man
It by tissa and the sign-and-exponent fields in the computation of FIRG
\etion floating-point results
basic The floating-po int multiplier section (FMUL) includes a full
multiplier array to support single-precision (24-bit mantissa),
b

l
ision
n(RO
double-precision (53-bit mantissa), and extended-precision
(64-bit mantissa) multiplication and rounding w ithin tl1ree
cycles. FMUL executes all the floating-point multipl ication
~ .. . ~.
COITI
operations. It is also used for integer multiplication, which is FADD FDIV
19 the
implemented through microcode control.


lh rhis FEXP FMUL
lreatly The floating-point adder section (FADD) executes all the
lerror "add" floating-point instructions, such as floating-po int add,
subtract, and compare. FADD also executes a large set of FRND
Juting
micro-operations that are used by microcode sequences in
les or
calcu the calculation of complex instructio ns, such as binary coded
decimal (BCD) operations, format conversio ns, and transcen
I I J
cI gen dental functions. The FADD section operates during the Xl Figure 9. Floating-point unit block diagram .
lating and X2 stages of the floating-point pipeline and employs
later). several wide adders and shifters to support high-speed arith
64-bit metic algorithms while maintaining maximum performance ception. Otherwise, an instruction may change the state of
in lhe for all data preCisions. The CPU achieves a latency of three the CpO..\ while an earlier fl oating-pOint instruction (which
ayex cycles with a throughput of one cycle for a ll the operations has not -Xet completed) might cause an exception that re
age (E directly executed by the FADD section for single-preCiSion, quires a trap to a software exception handler.
19. AI double-precision, and extended-precision data. To avoid a substantial performance loss due to stalling
ng the The floating-pOint divider (FDIV) section executes the floating instructions until the exception status of a previous floating
sed by point divide, remainder, and square-root instnJctions. It oper point instruction is known, Pentium's fl oating-pOint unit em
Ised [0 ates during the Xl and X2 pipeline stages and ca.!culates two ploys a mechanism called safe instruction recognition (SIR).
:ensive bits of the divide quotient every cycle. The overaU inst.ruction This logic de termines whether a floating-pOint instruction is
access latency depends on the precision of tl1e operation. FDfV uses its guaranteed to comple te withom creating an exception a nd
irh the own sequencer for iterative computation during the Xl stage. therefore is considered "safe. " If an instruction is safe, there
ilar to The results are fully accurate in accordance with IEEE standard is no need to stall the pipeline, and the maximum thro ugh
Ktions 754 and ready for rounding at the end of the X2 stage. put can be obtained. If, however, the instruction is not safe ,
there- The floating-point rounder (FRND) section rou nds the re the pipeline stalls for three cycles until the unsafe instruction
opera- suits delivered fro m the FADD and FDIV sections . It operates reaches the ER stage and a final determination of the excep
during the WF stage of the floating-point pipeline and deliv tion status is made.
unit of ers a rou nded result according to the precision control and Six possible exceptions can occur on the Pentium
al sec the rounding control, which are specified in the floating-point microprocessor's fl oating-pOint operations: invalid operation,
control wo rd . divide by zero, denormal operand, overflow, underflow, and
(FIRC) Safe instruction recognition. Floating-point computa inexact. The SIR logic needs to determine early in the float
res! of tion requires longer execution times than integer computa ing pipeline-in the Xl stage-before any computation takes
ions is tion. Pentium'S floating-point pipeline uses e ight stages, while place whether tl1e instruction is guaran teed to be exception
L1tation the integer pipeline uses only five stages. Compatibility re fre e (safe) or not (unsafe). The first three of the six excep
helped quires in-order instruction execution as well as precise ex tions can be detected without any floating-pOint calculation .
eFIRC ception reporting. To meet these requirements in the Pentium From the latter three exceptions, the inexact exception is
Jint re processor, floating-pOint instructions should not proceed usually "masked" by the operating system or the software
iafe in beyond the Xl stage, that is , allow subsequent instructions to application (using the precision mask, or PM, bit in the
)Jllplete proceed beyond the E stage, unless the floating-point in floating-pOint control word). Otherwise, a trap will occur
:ompu struction is guaranteed to complete without causing an ex- whenever rounding of the result is necessa ly. When the pre
'I.
I

June 1993 17
Pentium microprocessor

Cycle 1 - . FADD QWORD PTR [EAX] I FXCH 8T (2)


The example shown in Figure 10 illustrates the use of par
Cycle 2 - . FMUL QWORD PTR [EBX] FXCH 8T (3)
allel FXCH. The code in the example generates the results of
8TO A 8TO C ><8TO D two independent floating-point calculations. The floating-point
8T1
8T2
8T3
B
C
0
X 8T1
8T2
8T3
B
A-tfEAX1 '
0
8T1
8T2
8T3
B
A+IEAX~
C* fEBXl
register file contains initial values prior to code execution:
register STO (TOS) contains the value A, register STl contains
8T4 E 8T4 E 8T4 E
value B, register ST2 contains value C, and so on, The two
8T5 F 8T5 F 8T5 F o perations are
8T6 G 8T6 G 8T6 G
8T7 H 8T7 H 8T7 H 1) floating-point addition of value A with the 64-bit floating
(a) (b) (c) point operand addressed by the general register EAX,

~
and
Figure 10. FXCH code example, 2) floating-point multiplication of value C by the 64-bit floating
,. point operand addressed by the general register EBX.
cision (inexact) exception is masked, the pipeline delivers
~ the correctly rounded result directly. For overflow and When the floating-point pipeline is fully loaded and these
underflow exceptions SIR logic uses an algorithm that moni two operations are part of the code sequence, the parallel
tors the exponent fields of the input o perands to conclude FXCH allows the calculation to maintai n the maximum
the exception status (safe or unsafe). throughput of one cycle per operation. Within one cycle the
In the X86 architecture the CPU stores floating-point oper Pentium CPU writes the result of the addition to ST2, while
ands in the floating-point register file with an extended the operand for the next operation rrioves to the top of the
precision exponent, rega rdless of the precision control in the stack. On the next cycle, the processor writes the result of
floating-point control word. The extended-precision expo the multiplication to ST3, while the top of the stack contains
nent supports much greater range than the double-precision value D, which may be used for a subsequent operation,
fo rmat. Overflow and underflow exceptions caused by con Transcendental instructions. The CPU SUppOlts all eight
verting the data into double-precision or single-precision for transcendental instructions that are defined in the instruction
mats occur only when storing the data into external memory, set through direct execution of microcode sequences, The
These characteristics of the x86 floating-point a rchitecture transcendental instructions are
give a unique advantage to the effectiveness of the SIR mecha

nism in the Pentium CPU, since the SIR algorithm can use the 1) FSIN sine,
internal (extended-precision) exponent range. TIlLis, the oc 2) FCOS cosine,
currence of unsafe operations is extremely rare. Our evalua 3) FSINCOS sine and cosine,
tion of the SIR algorithm for the FPU design found no unsafe 4) FPTAl'\T tangent,
instructio ns in simulated execution of the SPEC89 floating 5) FPATAN arctangent,
point benchmarks. 6) F2XM1 2**X - 1,
Register stack manipulation. The x 86 floating-point in 7) FYl2X Y * Log2(X), and
struction set uses the register file as a stack of eight registers 8)FYl2XP 1 Y * Log2(X+1)
in which the top of stack (TOS) acts as an accumulator of the
results. Therefore, the top of the stack is used for the majority We developed new, ta ble-driven algorithms for the tran
... of the instructions as one of the source operands and, usu
ally, as the destination register.
scendental functions using polynomial approximation tech
niques, These algorithms substantially improved performance
To improve the floating-point pipeline performance by op and accuracy over the i486 CPU implementation, which used

I
timizing the use of the floating-point register file, Pentium's the more traditional Cordic algorithms. The approximation
FPU can execute the FXCH instruction in parallel with any tables reside in an on-chip ROM along with the other special
basic floating-point operation. The FXCH instructio n "swaps" constants that are used for floating-point computation,
the contents of the TOS register with another register in the The performance improvement of the transcendental in
:::. floating-point register file. All the basic floating-po int instruc structions o n the Pentium processor ranges from two to three
tions may be pai red with FXCH in the V pipe , The pair ex times over the same instructions on the i486 CPU at the same
~, ecute in parallel, even when data dependency between the frequency. The worst-case error for all the transce nde ntal in
I two instructions in the pair exists. The use of parallel FXCH structions is less than 1 ulp (unit in the last place) when
redirects the result of a floating-point operation to any se rounding to nearest even and less than 1.5 ulps when round
lected register in the register file, while bringing a new oper ing in other modes. The functions are guaranteed to be mono
and to the top of the stack for immediate use by the next tonic, with respect to the input operands, throughout the
floating-point operation. domain supported by the instruction.

18 IEEE Micro
f par Development process
llts of Developing a highly integrated microprocessor involves Naming the Pentium processor
point collaboration between numerous teams having diverse tech
ltion: nical specialties and working under the discipline of well In naming the fifth generation of its compatible mi
Itains defined methodologies. A small team of architects and VlSI croprocessor line the Pentium processor, Intel departed
~ two designers developed the initial concepts of the design. This from tradition. Pentium breaks a stl1ng of CPU products
group conducted feasibility studies of parallel instruction dating back to the late 1970s that used numerics (8086,
decoding and options for branch prediction techniques. Si 286, 386, 486)
ning multaneously, it evaluated performance by hand for sholt "The natural course would be to call this chip the
EAX, benchmarks and compiler optimizations. As initial directions 586," sa id Andrew S. Grove, president and chief execu
were established, additional engineers participated , and tive officer. "Unfortunately, we cannot trademark those
ating subteams focused on the following areas: numbers, which means that any company might call any
C chip a 586, even if it doesn't measure up to the real
1) behavioral modeling of the microarchitecture; thing."
these 2) circuit feaSibility design for caches, decoding PLAs (pro Pentium uses the Greek word for five, "pente," as its
lrallel grammable logic arrays), floating-point data path , and root to associate with the fifth-generation product and
mum other critical functions ; adds "-ium," a common ending from the periodic table
Ie the 3) a flexible, trace-driven simulator of instruction timing of elements. Thus , the Pentium microprocessor is the
while for performance evaluation; fifth generation, a key element for future computing.
)f the 4) a prototype compiler; and
ult of 5) enhancements to existing instruction-tracing tools.
1tains
)n. Throughout the design we refined the Pentium micropro
eight cessor using both top-down and bottom-up methods. Top
lction down refinement was accomplished through comprehensive
. The characterization of executing benchmark work loads on the
i486 CPU 4 and trace-driven experiments concerning alterna
tive machine organizations conducted by architects using the
performance sin1Ulator.
VlSI design engineers evaluating features critical to the
targeted area and frequency refined the design from the bot
tom up. On two occasions in the design the accumulation of
changes from bottom-up refinement caused the need for sub
stantial restructuring of the microprocessor's global chip plan,
or "die diets." On those occasions, interdisciplinary teams of
speCialists collaborated to brainstorm and evaluate ideas that
could satisfy the global or local design constraints. In one
tran instance, we found it necessalY to refine the set of instnlc
tech tions that could be executed in parallel. Constraints had been
lance assigned to the area and speed of the decoder PLAs. The
used VlSI deSigners identified combinations of instruction formats
lation that would feasibly decode in parallel, and the compiler writ
)ecial ers determined the optimal selection.
In the end, the measured performance of the Pentium mi
al in croprocessor in production systems is within 2 percent of fication methodology.
three that predicted before the design was completed . We used different validation approaches in pre-silicon test
same The logic validation of the Pentium processor design pre ing of the Pentium microprocessor:
al in sented a major challenge to the design team. A comprehen
Nhen sive test base from the validation of previous X86 1) Architecture verification looked at the "black box" func
lUnd microprocessors was available. However, the Pentium pro tionality from the programmer's point of view. We de
10no cessor microarchitecture introduced several new fundamen signed comprehensive tests to cover all possible aspects
it the tal techniques, such as superscalar, write-back cache, and of the programming model and all the Pentium proces
floating-point algorithms, that required a more rigorous veri- sor user-visible features.

June 1993 19
Pentium microprocessor

floating-point transcendental functions


required an extensive test strategy that
verified the accuracy and monotonic
ity of the results throughout the devel
opment process, comparing the results
to a "super accurate" software model.
Eventually, when the first silicon of the
Pentium processor was available for
testing, we used automatic testing tech
niques to assure the correctness of the
transcendental instructions.

Compiler optimizations
The compiler technology developed
with the Pentium microprocessor

includes machine-independent optimi

zations commo n to current high

pelformance compilers, such as inlining,

unrolling, and other loop transforma

Figure 11. Pentium processor and i486 CPU 'performance for SPEC benchmarks. tions. In addition, we used techniques

specifically developed for the x86 ar


chitecture and tuned them fo r the
Pentium processor's microarchitecture.
2) Design verification checked the internal functionality from The xs6 architecture has certain characteristics that require
the point of view of a logic designer who would under specialized optimization techniques d iffe rent from those for
stand the behavior of every internal signal. This testing RISC architectures. The architecture supports a variety of in Fi~
approach is considered a "white box" technique, in which stnlctio n fo rmats for equivalent operations. Consequently, it
tests are written to exercise all the internal logic and is critical to select instructio n formats that are decoded most CP
verify its correct behavior. efficiently by the processor. The X86 register set includes 12
3) Random instruction testing was a valuable tool to cover o nly eight integer and eight floating-point registers. We have tei
all those situatio ns that are rarely covered by the more found that common global register allocation techniques that
traditional, handwritten tests. Running finely tuned ran assign variables to registers for the entire scope of a proce AI
dom tests let us verify correct functionality by compar dure are ineffective with such a limited number of registers.
ing the resu lts generated by a logic design description of Registers must be allocated within a narrower scope and to P(O
the Pentium processor to the results generated by a gether with instruction scheduling. in
software-emulated model. TI1e compiler schedules instructions to minimize interlocks d~
4) A logic-design hardware mode l CQuickTurn) enabled in and to maximize parallel execution for the Pentium processor's th
creased testing coverage ca pacity by allowing a much superscalar pipelines. These techniques also be nefit perfor in
larger software base to run on the processor model be mance on the i486 CPU (though to a lesser extent) because n<
fore the first silicon was available. We ported the logic the processo rs' pipeline organizations are similar. The instnlC fe
model of the Pentium processor o nto a QuickTurn setup, tion-scheduling techniques have minimal impact on perfor te
which was capable of handling the complete design, and mance for the i386 CPU since that processor uses little A
tested major operating systems and application programs pipelining. As explained in the description of the f1oating T
before finalizing the design. po int pipeline, the compiler schedules FXCH instructions to rr
avoid floating-po int register-stack dependencies. S
In addition to the general validatio n approach, we dedi
cated a special effort to verify the new algorithms employed
by the FPU. We developed a high-level software simulator to THE PENTIUM MICROPROCESSOR employs superscalar in
evaluate the intricacies of the specific add, multiply, and di teger pipelines, branch prediction, and a highly pipelined F
vide algorithms used in the design. This simulator then evolved FPU to achieve the highest x86 performance levels ;3.vailable
into a testing environment, allowing the verification of the elsewhere while preserving binalY compatibility with the x86
FPU logic design model independently from the rest of the architecture. Figure 11 summarizes the performance of the
Pentium processor. Also, the new algorithms used for the Pentium microprocessor and the highest performance i486

20 IEEE Micro
:tions 3. John H. Crawford, "The i486 CPU: Executing Instructions in One
)'that Clock Cycle," IEEE Micro, Vol. 10, NO.1, Feb., 1990, pp. 27-36.
onic 4. Tejpal Chadha and Partha Srinivasan, "The Intel386 CPU Family
level Architecture & Performance Analysis," Digest ofPapers Compcon
:sults Spring 1992, CS Press, Feb. 1992, pp. 332-337.
odel. 5. Robert F. Cmeliketal., "An Analysis of Mipsand Sparc Instruction
)fthe Set Utiliza tion on the SPEC Benchmarks," Proc. ASPLOS-IVConf.,
e for Computer Architecture News, Vol. 19, No.2, Apr., 1991, pp.
tech 290-302.
)fthe

Donald Alpert is an ardlitecture manager


oped in Intel Corporation's Microprocessor
:ssor Division. He holds responsi biJity for man
)timi aging the architecture team that developed
ligh specifications and modeling and evaluating
rling, performance of the Pentium processor.
Inna PreViously, he held various microproces
ques sor development positions at National Semiconductor Cor
6 ar poration and Zilog.
the Alpert received a BS degree from MIT and MS and PhD
ture. degrees from Stanford University, all in electrical engineer
~uire ing. He is a member of the IEEE Computer Society and the
efor Association of Computing Machinery.
)f in Figure 12. Die photograph .
.ly, it
nost CPU for the SPEC benchmarks in well-tuned systems. Figure Dror Avnon is design manager of the
JCles 12 reprodu ces a photograph of the packaged circuit that in flQating-point unit of the Pentium proces
lave tegrates 3.1 million transistors. '" sor. He holds responsibility for the micro
that architecture, design, pelfoffi1ance analysis,
oee Acknowledgments and verification for the FPU logic and
ters. The individuals who made substantial contributions to the microcode. He previously held design
I to Pentium processor's design are too numerous to list here, so engineering positions at National Semicon
instead we acknowledge groups of contributors. The YlSI ductor Corporation , Computer Consoles, and Elscint.
)cks design team applied their creativity and determined effort Avnon received a BSc degree in electronic engineering
;or's throughout the project. The compiler team developed and from Technion-Israel Institute of Technology in Haifa. He is
[for implemented novel optimization techniques. Software engi a member of the IEEE Computer Society.
IUse neers in several groups developed instruction-tracing and per
nlC formance simulation tools. Hardwa re engineers and Direct questions to Dror Avnon, Intel Corporation, MiS
for technicians instrumented measurement and tracing systems. Ri\J2-27, 2200 Mission College Blvd., Santa Clara, CA 95052;
ittle Architects faCilitated and integrated effoI1S of these other teams. davnon@mipos2.intel.com.
ing The effolts in architecture, optimizing compiler, and perfor
s to mance simulation involved collaboration between teams in
Santa Clara and Israel.

. in Reader Interest Survey


.led References Indicate your interest in this article by circling th e appropriate
.ble 1. i486 ProcessorProgrammer's ReferenceManual, Intel Corporation, number on the Reade r Service Card.
~86 Santa Clara, Calif., 1990.
the 2. ANSI/IEEEStandard 754-1985for Binary Floating-PointArithmetic, Low 159 Medium 160 High 161
486 IEEE Computer Society Press, Los Alamitos, Calif., 1985.

June 1993 21

View publication stats