You are on page 1of 6

Design principles:

Simplicity favors regularity


Smaller is faster
Good design demands good compromises
Makes the Common case fast

Single Cycle implementation:

The clock cycle is equal to lowest instruction (load and store)


Cant change control within same clock cycle
2 memory are used one for read and one for write due to point two
Clock signal: indicates when an action is to be performed
Control signal: indicates whether an action is to be performed

Five phases

IF (instruction fetch)
ID(instruction decode)
EXE
MEM
WB(write back)

1st phase:

instruction fetch: input of this phase is whether PC+4 or another


address due to a pervious jump, therefore there is a MUX which takes
decision
to Calc. PC+4 we need an adder of pervious address and another wire
which contains 4
therefore this phase is only for fetching address of instruction

2nd phase:

it only decodes the instruction which was previously fetched from (IF
phase) and sees where its (R,I,J) format
two control lines MEM read MEM write
output of ID phase if I-type will go two sign extend (16 bit ---> 32 bit)
will be explained later

3rd phase:

this phases execute (ALU) arithmetic and logical operations


ALU control line (Signal): which operation should be executed
First adder executes ALU operations also the memory instructions can
also use ALU with inputs coming from two registers to do address
calculations
Second adder inputs are sign extend (shifted left by 2) and PC+4
output is target branch (for branching and jump instructions)
We need 2 operations arithmetic and logical operations in 1 st adder in
second if compare is (equal or not equal haro7 fan)

4th phase:

MEM: this used to reach a target location in memory

5th phase:

WB: to write a result in (memory, register, file)

Sign extend:

Is used in immediate numbers because in MIPS R-200 we need it reads


32 bits while the immediate number is 16 bits so we need to extend it
in 32
1st case: if unsigned therefore we fill the most left 16 bits with zeros
2nd case: if signed : we get ones complement the fill the most left 16
bits with the last most left bit from immediate number

Disadvantage of single cycle implementation:

It slows down the processor as cycle time is equal two lowest


instruction
Memory redundancy as we use 2 memories

Multi cycle implementation:

Cycle time = slowest stage (MEM)


We have latches between stages so the calculated data can be passed
from stage to another
Control can change between stages in single we can only change
instruction after the instruction done (after cycle time)

The advantage of multi cycle over single cycle:

Faster processor as not all instructions needs 5 stages


Clock cycle shorter
One memory therefore no HW redundancy (Can require less HW)
We can use pipelining
Allow different instructions to be executed in different number of
cycles

Pipelining: (instruction level parallelism) which means I can be using 5


stages at once except at beginning and ending of program therefore we only
loss 10 times best case scenario

Tricky example the doctor said it in lecture


If we are using multi cycle the stage cycle time = 200 ps and we executing
lw instruction how can it be possible for this instruction to be done in only
800 ps even so lw instruction need 5 stages??
Answer: this can only be done in single cycle implementation

The pipelining should be taken into consideration while designing the


processor
Design decisions:

3 formats only
The src field is constant in all 3 formats
Memory is used in only 2 instructions (lw and sw)

Pipelining hazards:
1. Data hazard : ex x= y +z, k= x +y if x wasnt calculated therefore the
result in could be wrong
Sol: stalling or forwarding
Forwarding: dont wait till answer is written in register we can read it from
latch
2. Structural hazard: HW cant support a particular combination of
instructions to be executed in the same cycle example : branching to an
address while this address is being written to at exact time
Sol: stall or using 2 memories

Why not design the HW to always avoid structural hazard?

Some hazards dont occur that often, so the cost may outweigh the
benefit
May complicate the HW which isnt used that often therefore may
impact the performance

3. Control hazard:
A control hazard occur because CPU doesnt know soon enough
Whether or not the conditional branch will be taken
The target of transfer of control
Sol:

Can determine or predict this info earlier (predict not taken) 2 bit
predictor
Can delay the exe. Of branch until the calc. is done(stall)

Chapter 5 memory:
Concepts:
1. Memory hierarchies & principle of locality
2. Basic cache organization
3. Virtual memory

Memory hierarchies:
Form faster to slower: Register, cache, RAM, hard disk

Goal: keep up with the demand for data from pipelining


Bandwidth:

The pipelining need a range of one word per cycle a few word cycle
depending on the CPU organization. The Rate at which data is
provided is called (Bandwidth)

Memory hierarchy:

Exploits the principal of locality


Use small few expensive memory
Use larger slow cheap memory
All data found in one level is also found in the level below

Goal:

catch most of the references in the fast memory


have most of the cost per byte be at the low memory level

Concept of locality:

spatial locality: this is applied in HW (bandwidth) always fill the


cache example : x[0] then we dont only move x[0] to cache but we
move as much as bandwidth can take
temporal locality: max usage of data in cache A good programmer
can apply this concept by knowing type of addressing: Row
addressing, column addressing
Examples:
1) if we want to calc. area and perimeter of array of 1000000 which
is best to calc. area of them all then perimeter of them all or take
index by index calc. both area and perimeter 2nd sol. Is much
better it allows concept of max usage
2) calc. multiplication of matrix [5000][5000] should I access data
[row][col] or [col][row] sol: it depends on type of addressing
used in C,C++,java row addressing is used therefore 1 st sol. is
better , in Paython uses col. Addressing therefore 2 nd sol. Is better

Question: Where can a block be placed in the current level? (According to


cache)
Block placement: we have 3 types
1. direct mapped Cache: A memory block cache placed in only one cache
line example page 385 this means en momken yakoun el cache fady
bs fi tabour 3nd cache line mo3ayn this method provides least search
time
almost all direct mapped caches uses this mapping to find a block :
(block address) % (number of blocks in the cache)
2.

fully associative: A memory block can be placed in any cache line


therefore msh haykoun fi tabour bs there is a disadvantage : search
time > direct mapped method

3. Set associative: A memory block can be placed in any cache line within
single set
most comely. Y3ni 3andy el cache mat2asm sets w el set el wa7da feha
kaza cache line therefore el queues 2alat than direct mapped w search
time a7san men el fully associative

dah kan rule bat3 set associative idk for what excatly bs el doctor kan
2alo
(Block address) % (number of sets in cache)
An example on memory problems from page 390:
Note: 4 kiB = 1024 words from page 389.
Rules:
Block size = 2^m
The cache size = 2^n blocks
Number of blocks= cache data capacity/ words per block
Tag size= bit address (n+m+2)
Total number of bits in direct mapped cache is = 2^n * (block size
+tag size+ valid field size)
How many total bits are required for a direct mapped cache with 16 KiB
of data and 4 words blocks, assuming a 32- bit address?
Sol:
Number of blocks = 16 KiB/ 4 words = 4096/ 4 = 2^10 blocks
Cache size =2^n blocks n=10
Since block size = 2^m and block size = 4 words = 2^2 words m=2
Block size in bits = 4*32 bits
Tag size = 32-(10-2-2) =18 bits
Total number of bits = 2^10 * ( [4* 32] + 18 + 1) = 2^10 x147 bits =
127 Kibibits
There are more examples in book for different types of examples
please check them

You might also like