You are on page 1of 85

Advanced Microprocessors

UNIT -II

Hardware details of the Pentium


CPU pin descriptions :
Pentium 60 MHz & 66 MHz, 273 pin PGA (Pin Grid Array).
Power supply 5v
Newer Pentium : faster clock speed, 296 pin PGA, power supply 3.3v

Pentium Processor Pin details

1. A20M (Address 20 mask) - input pin


To force the Pentium to limit addressable memory 1 MB.
Only active in real mode.
Undefined in protected mode.
2. A3- A31 ( Address lines) - Bidirectional pin
29 address lines together with the byte enable outputs form the
Pentium 32 bit address bus. (4 GB memory space)
3. BE0 - BE7 (Byte enable) output pin
The byte enable pins are used to determine which bytes must be
written to external memory, or which bytes were requested by the
CPU for the current cycle.
These signals are generated internally by the processor from
address lines A0, A1 and A2.

4.

ADS (Address data strobe) output pin


The address status indicates that a new valid bus cycle is
currently being driven by the Pentium processor

5. AHOLD - ( Address hold) input pin


It is used to place a Pentiums address bus into a high
impedance state.

AP (Address parity) Bidirectional pin


It is used to indicate the even parity of address lines A5 - A31.

APCHK# - (Address parity check) output pin


Detected a parity error on the address bus during inquire cycles
External circuitry is responsible for taking the appropriate action if a
parity error is encountered.
APICEN - (Advanced Programmable Interrupt Controller Enable) Input pin
Enables or disables the on-chip APIC interrupt controller.
BF[1:0] - (Bus Frequency) - Input pin
Determines the bus-to-core frequency ratio. BF[1:0] are sampled at
RESET.

BOFF# - (Back off) - input pin


This input causes the processor to terminate any bus cycle
currently in progress and tri state its buses.
Highest priority

D63-D0 - (Data lines) Bidirectional pin


Lines D7-D0 define the least significant byte of the data bus;
lines D63-D56 define the most significant byte of the data bus

DP7-DP0 - (Data parity) - Bidirectional pin


To indicate the even parity of each data byte on the data bus.
DP7 applies to D63-56, DP0 applies to D7-0.

HOLD - (Hold bus) - input pin


Completes the current bus cycle and tri states its bus signals.
Activate HLDA.
HLDA - (hold acknowledge) output pin
To indicate that the Pentium has been placed in a hold state.

Bus Operations
Types of bus cycles:
Single transfer cycle
Burst transfer cycle
Interrupt ack cycle
Inquire cycle etc.
Some of the signals are used to indicate the type of bus cycle.
M/IO# - Memory / input output - output pin
If high - memory cycle or low I/O operation

D/C# - Data / Code - output pin


This output indicates that the current bus cycle is accessing code or
data.
If high - data or low - code

W/R# - Write / Read - output pin


This output indicates that the current bus cycle is a read
operation or a write operation.
If high -write operation or low - read operation.
CACHE# - Cacheability - output pin
This output indicates whether the data associated with the
current bus cycle is being read from or written to the internal
cache.
All the burst reads are cacheable and all cacheable read cycles
are bursted.
KEN# - Cache enable - input pin
The cache enable input is used to determine if the current cycle
is cacheable.

Bus State Definition


Ti: This is the bus idle state.
In this state, no bus cycles are being run.
The processor may or may not be driving the address and status
pins, depending on the state of the HLDA, AHOLD, and BOFF#
inputs.
An asserted BOFF# or RESET always forces the state machine
back to this state.
HLDA is only driven in this state.
T1: This is the first clock of a bus cycle.
Valid address and status are driven out and ADS# is asserted.
There is one outstanding bus cycle.
T2: This is the second and subsequent clock of the first outstanding bus
cycle.

In state T2, data is driven out (if the cycle is a write)


data is expected (if the cycle is a read)
The BRDY# pin is sampled.
There is one outstanding bus cycle.
BRDY# - Burst ready - input pin
Read cycle indicate data is available on the data bus
Write cycle - informs the processor that the output data has
been stored.

Single-Transfer Cycle

Burst Cycles
Cache uses burst cycles.
A new 8 byte chunk can be transferred every clock cycle.
The processor supplies the starting address of the first group of 8 bytes at
the beginning of the cycle.
The next groups of 8 bytes are transferred according to the burst order.
Burst transfer order:

The external memory system


must generate the remaining 3
address itself, and supply the
data in the correct order.
Address and BEs are asserted
only in the first transfer and
are not driven for each
transfer.

T12: This state indicates there are two outstanding bus cycles.
The processor is starting the second bus cycle at the same time
that data is being transferred for the first.
In T12, the processor drives
BRDY# - first cycle
ADS# - second cycle
T2P: This state indicates there are two outstanding bus cycles.
both are in their second and subsequent clocks.
Same job as T12
TD : Dead state
This state is used to insert a dead state between two consecutive
cycles (read followed by write or vice versa) in order to give
the system bus time to change states.

BREQ

- (Bus request) - output pin


The bus request output tells the external system that the
Pentium has internally generated a bus request.
This happens even if the Pentium is not driving its bus at the
moment.

NA# - (Next address) - input pin


Indicates that the external memory system is ready to accept a
new bus cycle although all data transfers for the current cycle
have not yet completed.
Issue ADS# for a pending cycle two clocks after NA# is asserted.
Pentium supports up to 2 outstanding bus cycles.

Flow

Functional description

No Request Pending

The processor starts a new bus cycle & ADS# is asserted in the T1 state.

Second clock cycle of current bus cycle

The processor stays in T2 until the transfer is over ( BRDY#) if no new


request becomes pending or if NA# is not asserted.

If there is a new request pending when the current cycle


is complete, and if NA# was sampled asserted, the processor begins from T1.

If no cycle is pending when the processor finishes the current cycle or NA# is
not asserted, the processor goes back to the idle state.

processing the current cycle (one outstanding cycle)


If NA# is asserted, the processor moves to T12 indicating that the processor
now has two outstanding cycles.
ADS# is asserted for the second cycle.

When the processor finishes the current cycle, and no dead clock is needed, it goes to
the T2 state.

When the processor finishes the current cycle and a dead clock is needed, it goes to the
TD state.

If the current cycle is not completed, the processor always moves to T2P to process the
data transfer.

10

The processor stays in T2P until the first cycle transfer is over.

11

The processor finishes the first cycle and no dead clock is needed, it goes to T2 state

12

When the first cycle is complete, and a dead clock is needed, it goes to TD state.

13

If NA# was sampled, a new request is pending, it goes to T12 state.

14

If NA# was not asserted, no new request is pending, it goes to T2 state.

Processor control Instructions


Lock - s/w instruction - Lock bus during next instruction
Executing lock - Lock# output goes low
Lock is used as a prefix to another instruction.
Lock# - h/w pin - (Bus Lock) - output pin
To indicate that the current bus cycle is locked & may not be
interrupted by another bus master.
Locked operation :
Atomic operation cannot be broken down into smaller sub
operations.
Semaphore - a special type of counter variable that must be read,
updated and stored in one single uninterruptable operation.
This requires a read cycle followed by a write cycle.

XCHG instruction automatically lock the bus when one of their


operands is a memory operand.

AHOLD and HOLD - activated in the middle of locked operation


locked operation is not affected.
But it is affected when BOFF# signal is asserted.
Interrupt acknowledge cycle
INTR Interrupt request - input pin
When high Pentium to initiate interrupt processing
Read a 8 bit vector number and select ISR.
The processor runs two interrupt ack cycles in response to an
INTR request.
Both the cycles are locked.

First cycle - D0 - D7 is ignored by the processor


Second cycle - D0 - D7 is accepted by the processor
Byte enable outputs are used to indicate the cycles.
BE4 is low and all other BEs are in high - first cycle
BE0 is low and all other BEs are in high - second cycle

Shutdown :
If Pentium detects an internal parity error then run the shutdown
cycle.
Execution is suspended while in shutdown until the processor
receives an NMI, INIT and RESET request.
Cache is unchanged.

RESET processor reset input pin


Forces the Pentium processor to begin execution at a known state.
internal caches will be invalidated upon the RESET.
Fetch its first instruction from address FFFFFFF0H.
INIT - initialization - input pin
Forces the Pentium processor to begin execution in a known state.
The processor state after INIT is the same as the state after RESET
except that the internal caches, write buffers, and floating point
registers retain the values they had prior to INIT.

NMI - Non-maskable interrupt - input pin


request signal indicates that an external non-maskable interrupt has
been generated.
No external int ack cycles are generated.
HALT cycles
When HALT instruction is executed HALT cycle is run.
INTR signal may also be used to resume the execution.
WB/WT# - (writeback/writethrough) - input pin
allows a data cache line to be defined as writeback (1) or writethrough
(0) on a line-by-line basis.
Writeback : Writing results only to the cache are called writeback.

Writethrough : Writing results to the cache and to main memory are called
Writethrough.

Instruction and Data caches


Cache is a small high-speed memory. Stores data from some frequently
used addresses (of main memory).

Cache hit Data found in cache. Results in data transfer at maximum speed.
Cache miss Data not found in cache. Processor loads data from Memory
and copies into cache. This results in extra delay, called miss
penalty.

Hit ratio = percentage of memory accesses satisfied by the cache.


Miss ratio = 1-hit ratio

Average memory access time =


Hit ratio * Tcache + (1 Hit ratio) * (Tcache + TRAM )

RAM access time = 70 ns


Cache access time = 10 ns
Hit ratio =0.85
Assume there is no external cache.
Tavg = 0.85 * 10 + (1- 0.85) * (10 + 70)
= 20.5 ns

Cache Line : Cache is partitioned into lines (also called blocks). During
data transfer, a whole line is read or written.
Each line has a tag that indicates the address in Memory from which the line
has been copied

Types of Cache
1. Fully Associative
2. Direct Mapped
3. Set Associative
Sequential Access :
Start at the beginning and read through in order
Access time depends on location of data and previous location
Example: tape
Direct Access :
Individual blocks have unique address
Access is by jumping to vicinity then performing a sequential search
Access time depends on location of data within "block" and previous
location
Example: hard disk

Random access:
Each location has a unique address
Access time is independent of location or previous access
e.g. RAM
Associative access :
Data is retrieved based on a portion of its contents rather than its
address
Access time is independent of location or previous access
e.g. cache

Performance
Transfer Rate : Rate at which data can be moved

For random-access memory, equal to 1/(cycle time)


For non-random-access memory, the following relationship holds:

TN = TA + N/R
where
TN = Average time to read or write N bits
TA = Average access time
N = Number of bits
R = Transfer rate, in bits per second(bps)

Fully Associative Cache


Allows any line in main memory
to be stored at any location in the
cache.
Main memory and cache are both
divided into lines of equal size.

No restriction on mapping from Memory to Cache.

It requires large number of comparators to check all the address.


Associative search of tags is expensive.
Feasible for very small size caches only (less than 4 K).
Some special purpose cache, such as the virtual memory Translation
Lookaside Buffer (TLB) is an associative cache.
Associative mapping works the best, but is complex to implement.

Direct-Mapped Cache
One way set associative cache.
Memory divided into cache pages
Page size and cache size both are
equal.
Line 0 of any page - Line 0 of
cache
Directly maps the memory line into
an equivalent cache line.

Direct is often used for instruction


Direct has the lowest performance, cache.
but is easiest to implement.
Less flexible

Set-Associative Cache
Set associative is a compromise
between the other two.
The bigger the way the better the
performance, but the more complex
and expensive.
Combination of fully associative and
direct mapped caching schemes.
Divide the cache in to equal sections
called cache ways.

Page size is equal to the size of the cache way.


Each cache way is treated like a small direct mapped cache.

Design of cache organization


Cache size : 4KB
Line size : 32 bytes
Physical address : 32 bit
Fully Associative Cache
32 bit physical address is divided
into two fields.

n = cache size / line size = number of lines


b = log2(line size) = bit for offset
remaining upper bits = tag address bits

Consider fully associate mapping


scheme with 27 bit tag and 5 bit offset

01111101011101110001101100111000
Compare all tag fields for the value
011111010111011100011011001.
If a match is found, return byte 11000
(2410) of the line.

Direct Cache Addressing


n = cache size / line size = number of lines
b = log2(line size) = bit for offset
log2(number of lines) = bits for cache index
remaining upper bits = tag address bits

Direct mapping scheme with 20 bit tag, 7


bit index and 5 bit offset
01111101011101110001101100111000
Compare the tag field of line 1011001
(8910) for the value
01111101011101110001.

If it matches, return byte 11000 (2410) of


the line.

Set Associative Mapping

n = cache size / line size = number of lines


b = log2(line size) = bit for offset

log2(number of lines) = bits for cache index


remaining upper bits = tag address bits

w = number of lines / set


s = n / w = number of sets

Two way set-associate mapping with 19 bit tag, 6 bit index and 5 bit
offset
01111101011101110001101100111000
Compare the tag fields of lines 0110010 to 0110011 for the value
011111010111011100011.
If a match is found, return byte 11000 (2410) of that line

Instruction & Data Cache of Pentium


Both caches are organized as
2-way set associative caches
Cache size : 8KB
Line size : 32 bytes
Physical address : 32 bits
128 sets, total 256 entries
Each entry in a set has its own
tag

Data Cache of Pentium


Tags in the data cache are triple ported

They can be accessed from 3 different places at the same time


U pipeline
V pipeline
Bus snooping
Each entry in data cache can be configured for write through or write-back
Parity bits are used to maintain data integrity
Each tag and every byte in data cache has its own parity bit.

Instruction Cache of Pentium


Instruction cache is write protected to prevent self-modifying code.
Tags in instruction cache are also triple ported
Two ports for split-line accesses
Third port for bus snooping
In Pentium (since CISC), instructions are of variable length(1-15bytes).
Multibyte instructions may staddle two sequential lines stored in code
cache.
Then it has to go for two sequential access which degrades performance.
Solution: Split line Access

Split-line Access
It permits upper half of one line and lower half of next to be fetched from
code cache in one clock cycle.
When split-line is read, the information is not correctly aligned.

The bytes need to be rotated so that prefetch queue receives instruction in


proper order.
Instruction boundaries within the cache line need to be defined
There is one parity bit for every 8 byte of data in instruction cache

Split-line Access

Multiprocessor System
When multiple processors are used in a single system, there needs to be a
mechanism whereby all processors agree on the contents of shared cache
information.
For e.g., two or more processors may utilize data from the same memory
location, X.
Each processor may change value of X, thus which value of X has to be
considered?
If each processor change the value of the data item, we have different
(incoherent) values of Xs data in each cache.

Solution : Cache Coherency Mechanism

A multiprocessor system with incoherent cache data

Clean Data : The data in the cache and the data in the main memory
both are same, the data in the cache is called clean data.
Dirty Data : The data is modified within cache but not modified in
main memory, the data in the cache is called dirty data.
Stale Data : The data is modified with in main memory but not
modified in cache, the data in the cache is called stale data.
Out of- date main memory Data: The data is modified within cache
but not modified in main memory, the data in the main memory is
called Out of- date main memory Data.

Cache Coherency
Pentiums mechanism is called MESI
(Modified/Exclusive/Shared/Invalid)Protocol.
This protocol uses two bits stored with each line of data to keep track of the
state of cache line.
The four states are defined as follows:
Modified:
The current line has been modified (does not match with main memory)
and is only available in a single cache.
Exclusive:
The current line has not been modified (matches with main memory)
and is only available in a single cache.
Writing to this line changes its state to modified

Shared:
Copies of the current line may exist in more than one cache.
A write to this line causes a write through to main memory and may
invalidate the copies in the other cache.
Invalid:
The current line is empty.
A read from this line will generate a miss.
Only the shared and invalid states are used in code cache.

MESI protocol requires Pentium to monitor all accesses to main


memory in a multiprocessor system. This is called bus snooping.
Bus Snooping: It is used to maintain consistent data in a
multiprocessor system where each processor has a separate cache.

Consider the above example.


If the Processor 3 writes its local copy of X(30) back to memory, the
memory write cycle will be detected by the other 3 processors.
Each processor will then run an internal inquire cycle to determine
whether its data cache contains address of X.
Processor 1 and 2 then updates their cache based on individual MESI
states.

Pentiums address lines are used as inputs during an inquire cycle to


accomplish bus snooping.

Coherence vs. consistency


Cache coherence protocols guarantee that eventually all copies are updated.
Depending on how and when these updates are performed, a read
operation may sometimes return unexpected values.
Consistency deals with what values can be returned to the user by a read
operation (may return unexpected values if the update is not
complete).

Cache Coherency Protocol Implementations


Snooping
used with low-end, bus-based MPs
few processors
centralized memory
Directory-based
used with higher-end MPs
more processors
distributed memory

When we write, should we write to cache or memory?


Write through cache :write to both cache and main memory.
Cache and memory are always consistent.
Write back cache :

write only to cache and set a dirty bit.


When the block gets replaced from the cache,
write it out to memory.

Snoop : when a cache is watching the address lines for transaction, this is
called a snoop.
This function allows the cache to see if any transactions are
accessing memory it contains within itself.
Snarf: when a cache takes the information from the data lines, the cache is
said to have snarfed the data.
This function allows the cache to be updated and maintain consistency

Cache consistency cycles


Inquire cycle
EADS# - (External address strobe) - input pin
This signal indicates that a valid external address has been driven
onto the Pentium processor address pins to be used for an inquire
cycle.
HIT# - (inquire cycle hit / miss) - output pin
The hit indication is driven to reflect the outcome of an inquire cycle.
If an inquire cycle hits a valid line in either data or instruction cache.
asserted two clocks after EADS#.
If the inquire cycle misses the cache, this pin is negated two clocks
after EADS#.
This pin changes its value only as a result of an inquire cycle and
retains its value between the cycles.

HITM# - (hit / miss modified cache line) - output pin


The hit to a modified line output is driven to reflect the outcome of
an inquire cycle.
It is asserted after inquire cycles which resulted in a hit to a modified
line in the data cache.
INV (invalidation) - input pin
determines the final cache line state (S or I) in case of an inquire
cycle hit.
It is sampled together with the address for the inquire cycle in the
clock EADS# is sampled active.
High cache line is invalidated
Low cache line is shared
Miss inv is no effect
Hit modified line line will be written back regardless of the state
of INV.

LRU Algorithm
One or more bits are added to the cache entry to support the LRU algorithm.
One LRU bit & Two valid bits for two lines.
If any invalid line (out of two) is found out that is replaced with the newly
referred data.
If all the lines are valid a LRU line is replaced by the new one.

Four way set associative - LRU algorithm

FLUSH# - (Flush cycle) - input pin


cache flush input forces the Pentium processor to write back all
modified lines in the data cache and invalidate its internal caches.
A Flush Acknowledge special cycle will be generated by the Pentium
processor indicating completion of the write back and invalidation.

Byte enables indicate the type of bus cycle. BE4 is low and all other BEs are
high.
BE7 BE6 BE5 BE4 BE3 BE2 BE1 BE0
1
1 1
0
1
1
1
1
Cache instructions:
INVD invalidate cache
Effectively erases all the information in the data cache. (by marking
it all invalid).

WBIND - write back and invalidate cache


write back special cycle is driven after the WBIND instruction is
executed.
BE7 BE6 BE5 BE4 BE3 BE2 BE1 BE0
1
1 1 1
0
1
1
1

INVD instruction should be used with care. This instruction does not
write back modified cache lines.
Flush cycle is driven after the INVD and WBIND instructions are
executed.
BE7 BE6 BE5 BE4 BE3 BE2 BE1 BE0
1
1 1 1
1
1
0
1
write back cycle is generated followed by the flush cycle.

Super scalar Architecture


Processors capable of parallel instruction execution of multiple instructions
are known as superscalar machines.
Parallel execution is possible through U & V pipeline of Pentium.

Four restriction placed on a pair of integer instruction attempting parallel


execution:
1. Both must be simple instructions
(Mov, Inc, Dec)
2. No data dependencies may exist between them.
read after write dependency
if both instruction write to the same operand

3. Neither instruction may contain both immediate data and a displacement


value.
MOV table[SI], 7
4. Prefixed instruction may only execute in the U pipeline.
MOV ES:[DI], AL
For floating point instruction the first instruction of the pair must be one of
the following :
FADD, FSUB, FMUL, FDIV, FCOM
Second instruction must be FXCH
The compiler plays an important role in the ordering of instruction during
code generation.

Pipeline and Instruction Flow


5 stage pipeline
PF

PF : prefetch
D1 : Instruction decode
D2 : Address Generation
EX : Execute -ALU and Cache Access
WB : Write Back

D1
D2
EX
WB

I1

I2

I3

I4

I1

I2

I3

I4

I1

I2

I3

I4

I1

I2

I3

I4

I1

I2

I3

I4

U pipeline can execute any processor instruction (including the initial


stages of the floating point instructions)
V pipeline only executes simple instructions.

Instructions are fed into the PF stages from the cache or memory.
D1 stage - determine the current pair of instructions can execute together.
D2 stage addresses for operands that reside in memory are calculated.
EX stage - operands are read from the data cache or memory.
ALU operations are performed.
branch prediction for instruction are verified. (except conditional
branches)

WB stage to write the results of the completed instruction


verify conditional branch ins predictions.
When paired instruction reach the EX stage, it is possible that one or the
other will stall and require additional cycles to execute.

Stall - no work is done


Pipeline stall lowers the performance
U stall V executing
V stall U - executing
Both instruction must progress to the WB stage before another pair may
enter the EX stage.

Branch Prediction
Branch Prediction Strategies :
Static
The actions for a branch are fixed for each branch during the entire
execution. The actions are fixed at compile time.
Decided before runtime
Based on the object code
Dynamic
The decision causing the branch prediction can dynamically change
during the program execution.
Based on the execution history.
Prediction decisions may change during the execution of the
program

BHT

IFA:

xx

xx:

00,01: sequential count


10,11: branch.

BHT: Branch History Table

2-bit dynamic prediction

State transition diagram of the most frequently used


2-bit dynamic prediction (Smith algorithm)

ANT

AT

Initialised when a
branch is taken first

ANT

Strongly

Weakly

taken

taken

11

10
AT

Prediction: "Taken"

AT

ANT

Weakly
not
taken

Strongly
not
taken

01

00
AT

Prediction: "Not Taken"


Branch has been :
AT: actually taken
ANT: actually not taken

ANT

The prediction will be either taken or not taken.


If the prediction turns out to be true, the pipeline will not be flushed, and no clock cycles will
be lost.
If the prediction turns out to be false, the pipeline is flushed and started over with the correct
instruction.
It is best if the predictions are true most of the time.

Branch target buffer : four way set associative cache


256 entries, 64 sets
Whenever a branch is taken the CPU enters the destination address (target
address) in the BTB
BTB stores two history bits that indicate the execution history of the
branch instructions.
Two 32 byte prefetch buffers work with BTB and the D1 stage of the U &
V pipelines to keep a steady stream of instruction flowing into the
pipelines.
One buffer prefetches instruction from the current program address.

Another buffer activated when BTB predicts taken will prefetch instruction
from the target address.

Functional
Block
Diagram of
Pentium

Floating point unit


Co processor family:
8086 8087
80286 80287
80386 80387
80486 Internal FPU (not pipelined)
Pentium Internal FPU (pipelined)

Sign bit

Exponent

Mantissa

Floating point format


IEEE 754 format

The floating-point instructions are those that are executed by the


processors floating-point unit (FPU).
These instructions operate on floating-point (real), extended integer, and
binary-coded decimal (BCD) operands.

The term floating point is derived from the fact that there is no fixed
number of digits before and after the decimal point; that is, the decimal
point can float.
There are also representations in which the number of digits before and
after the decimal point is set, called fixed-point representations.

PF - prefetch
D1 instruction decode
D2 address generation
EX

Memory and register read


FP data converted into memory format
memory write

X1

FP execute stage one


Memory date converted into FP format
write operand to FP register file
Bypass 1 send data back to EX stage

X2 FP execute stage two

WF Round FP result and write to FP register file


Bypass 2 send data back to EX stage

ER error reporting, update status word


Bypass 1:
FLD ST
FMUL ST
Bypass 2:
The result of an arithmetic instruction in WF stage is made available
to the next instruction fetching operands in the EX stage.
FADD, FSUB, FMUL, FDIV, FCOM
Second instruction must be FXCH

First ins- U pipeline makes up the first five stages of the FPU pipeline
Second ins V pipeline

8 - 80 bit floating point registers


ST (0) through ST (7)

You might also like