You are on page 1of 36

EC6009 ADVANCED COMPUTER

ARCHITECTURE
UNIT I FUNDAMENTALS OF
COMPUTER DESIGN

Review of fundamentals of CPU, Memory


and IO - Trends in Technology, Power,
Energy and Cost, Dependability-
Performance Evaluation.
INTRODUCTION
6570 years back the first general purpose electronic computer
was created.
Today less than $500, mobile computer that has more
performance, more main memory and more disk storage than a
computer in 1985 for $1 million.
This rapid improvement has come both from advances in the
technology used to build computers and from innovations in
computer design.
RISC based machine focused on two critical performance
techniques. Exploitation of Instruction Level Parallelism (initially
through
pipelining and later through multiple instruction issue)
Use of Caches.
For many applications, the highest performance microprocessors
of today outperform the supercomputer of less than 10 years ago.
Dramatic improvement in cost-performance leads to new classes
of computers.
INTRODUCTION
The last decade saw the rise of smart cell phones and tablet computers, which
are many people are using as their primary computing platform instead of PCs.
These mobile client devices are increasingly using the internet to access
warehouses containing tens of thousands of servers.
Mainframe computers and high performance Supercomputers all are collections
of microprocessors.
Today the nature of application also changes. Speech, sound, images and videos
are becoming increasingly important along with predictable response time that is
so critical to the user experience.
An inspiring example is Google Goggles.
This application lets you hold up your cell phone to point its camera at an object,
and the image is sent wirelessly over the internet to a WSC that recognize the
object and tells you interesting information about it.
Read the bar code on a book cover to tell you if a book is available online and its
price.
Since 2003, single-processor performance improvement has dropped to less than
22% per year due to the twin hurdles of maximum power dissipation and the lack
of more ILP.
In 2004, Intel canceled its high-performance uniprocessor projects and joined
others in declaring that the road to higher performance would be via multiple
processors per chip rather than via faster uniprocessors.
REVIEW OF FUNDAMENTALS OF CPU
The functional blocks in a computer are
1. ALU
2. Control Unit
3. Memory
4. Input Unit
5. Output Unit

The ALU contains necessary electronic circuits to perform arithmetic and logical
operations.
The Control Unit analyses each instruction in the program and sends the relevant
control Signals to all other units ALU, Memory, Input and Output Unit.
The program is fed into the computer through the input unit and stored in the
memory. In order to execute the program, the instructions have to be fetched from
memory one by one. This fetching of instruction is done by the control unit.
After an instruction is fetched, the control unit decodes the instruction. According
to the instruction, the control unit issues control signals to other units.
After an instruction is executed, the result of the instruction is stored in memory
or stored temporarily in the control unit or ALU, so that this can be used by the
next instruction.
The results of a program are taken out of the computer through the output unit.
The control unit and ALU are collectively known as Central Processing Unit (CPU).
REVIEW OF FUNDAMENTALS OF CPU
The physical units in a computer such as the CPU, Memory, Input and
Output units form the Hardware.
The Compilers as well as user programs (high level language or machine
language) form the software.
Hardware works as dictated by the software. The operating system is a
special software that manages the H/W and S/W.
Arithmetic and Logic Unit:
The ALU has hardware circuits which perform primitive arithmetic and logical
operations. The H/W sections in ALU are
1. Adder
2. Accumulator
3. General Purpose Register
4. Counters
5. Shifters
6. Complementer.
Adder: adds two numbers and gives the result.
Accumulator: Register which temporarily holds the results of a previous
operation in
the ALU.
REVIEW OF FUNDAMENTALS OF CPU
General Purpose Register (GPR): When an operand is stored in main memory,
it
Takes time to retrieve it. If it is stored within the CPU, it is immediately available
to the CPU.
The GPRs store different types of information
1. Operand
2. Operand address
3. Constant
Since they are used for multiple purposes, these registers are known as GPRs.
Scratch Pad Memory or Registers: During Complex operations like
multiplication,
division etc., it is necessary to store intermediate results temporarily. For this
purpose there are usually one or more scratch pad registers. These are purely
internal H/W resources and not addressable by program.
Shifter and Complementer: The shifter provides left and right shift required
for various operations. The complementer provides 2s complement of binary
numbers.
REVIEW OF FUNDAMENTALS OF CPU
CONTROL UNIT:
The control unit is the most complex unit in a computer. Its main functions are
1. Fetching instructions
2. Analyzing the OPCODE
3. Generating control signals for performing various operations.
H/W resources of a control unit:
Program Counter or Instruction Address Counter (IAC):
IAC contains the memory address of the next instruction to be fetched. When an instruction
is fetched, the IAC is incremented so that it points to the address of the next instruction.
Every instruction contains an opcode. In addition it may contain one or more of the
following.
1. Operand
2. Operand address
3. Register address
PSW Register : It contains various status bits describing the current condition of the CPU.
These are known as flags. Two such flags are
4. Interrupt Enable: When this bit is 1, CPU will recognize interrupt requests. When this
bit is 0, interrupt requests will be ignored by the CPU and they remain pending. The NMI
is an exception to this.
2. Overflow: When this bit is 1, it indicates there is an overflow condition in ALU in the
previous Arithmetic operation.
MEMORY AND IO
The Memory is organized in to locations. Each memory location is known as one
Memory Word.
Memory Types:
Older computers use magnetic core memory while the present day we use
Semiconductor Memory.
Core memory is non-volatile where semiconductor memory is volatile.
semiconductor memory is of two types: SRAM and DRAM.
SRAM preserves the contents of all the locations as long as the power supply is
present.
DRAM memory can retain the content of any location only for a few milliseconds.
Random Access and Sequential Access Memories:
In a RAM access time is same for all locations. (Core and Semiconductor Memories are
RAM)
In a sequential access memory, the read or write access is sequential. The time taken
for accessing the first location is the shortest and the time taken for the last location is
the Longest. (Magnetic tape)
MEMORY AND IO
Memory Organization:
The Memory unit consists of the following sections:
1. Memory Address Register (MAR)
2. Memory Data Register (MDR)
3. Memory Control Logic
4. Memory cells
For the read operation, the CPU does the following sequence:
(i) Sends the address to MAR.
(ii) Sends READ signal to memory control unit.
The Memory control unit decodes the address bits and identifies the
location
to be accessed. Then it initiates a read operation of the memory. The
memory takes some amount of time to present the contents of the
location
in MDR.
(iii) After a sufficient time interval, the CPU transfers the information
from
MDR.
MEMORY AND IO
For Write operation, the CPU does the following sequence:
(i) Sends address to MAR.
(ii) Sends data to MDR.
(iii) Sends WRITE signal to memory control unit.
The Memory control unit decodes the address bits and identifies the location
Where the write operation has to be performed. It then routes the MDR
Contents to memory and initiates the write operation.
Memory Access Time: The time taken by the memory to supply the contents
of a location , from the time it receives Read is called the Memory Access
time. Core Memory 800ns and semiconductor memory 100ns.
Memory Cycle Time: The memory access time plus the additional recovery
time (memory is busy due to internal operation) is known as Memory Cycle time.
Auxiliary Memory:
1. Floppy Disk drive
2. Hard Disk drive
3. Magnetic tape drive
4. CD-ROM.
Input / Output Units:
Common input units are Keyboard, floppy disk, hard disk, magnetic tape,
mouse, light pen, Scanner, Optical disk, etc.
Common Output units are display terminal, printer, plotter, floppy disk drive ,
Hard disk drive, magnetic tape drive and optical disk drive, etc.
CLASSES OF COMPUTERS
PERSONAL MOBILE DEVICES (PMD):
Collection of wireless devices with multimedia user
interfaces - cell phone and tablet computers.
Price of a system is $100- $1000 and price of p $10-$100.
Energy and size requirements lead to use of flash memory
for storage instead of magnetic disks.
Responsiveness and Predictability are key characteristics for
media applications.
For example playing a video on a PMD, the time to process
each video frame is limited, since the processor must
access and process the next frame shortly.
The memory can be substantial portion of the system cost
and it is important to optimize the memory.
CLASSES OF COMPUTERS
DESKTOP COMPUTING:
Spans from low end net books sell for $300 to high
end, heavily configured workstations that may sell for
$2500.
Since 2008 more than half of the desktop computers
made each year have been battery operated laptop
computers.
Combination of performance (measured in terms of
computer performance and graphics performance) and
price of a system, result in the newest high
performance p and cost reduced p often appear in
desktop systems.
Desktop computing tends to be reasonably well
characterized in terms of applications and
benchmarking.
CLASSES OF COMPUTERS
SERVERS:
Role of servers grew to provide large scale and more reliable file and
computing services.
For servers different characteristics are important. First availability is critical.
Consider the servers running at ATM machines for banks or airline
reservation systems.
Failure of such server system is more catastrophic than failure of a single
desktop. Since these servers must operate seven days a week, 24 hours a
day.
Second key feature of server system is scalability. The ability to scale up the
computing capacity, the memory, the storage and the I/O bandwidth of a
server is crucial.
Servers are designed for efficient throughput. The overall performance of the
server is in terms of transaction per minute or web pages served per second.
Overall efficiency and cost effectiveness of a server determined by how
many requests it can handle in a unit-time.
CLASSES OF COMPUTERS
CLUSTERS/WAREHOUSE SCALE COMPUTERS:

Growth of Software As A Service (SAAS) for applications


like search, social networking, Video sharing,
Multiplayer games, Online shopping has led to the
growth of a class of computers Clusters.
Clusters are collections of desktop computers or servers
connected by LAN to act as a single large computer.
Each node runs its own operating system and nodes
communicate using a networking protocol.
Largest of the clusters are called Warehouse Scale
Computers (WSC) designed so that tens of thousands
of server can act as one.
Price- Performance and power are critical to WSCs
since they are large.
CLASSES OF COMPUTERS
80% of the cost of $90M Warehouse is associated with
power and cooling of the computer inside.
Networking gear cost another $70M and they must be
replaced every few years.
WSCs are related to servers, in that availability is critical.
For example: Amazon.com
$13 Billion in sales in the fourth quarter of 2010.
about 2200 hours in a quarter, average revenue per hour
was almost $6M. During a peak hour for Christmas
shopping, the potential loss would be many times higher.
Supercomputers are related to WSCs equally expensive,
costing hundreds of million of dollars, but differs by
emphasizing floating point performance.
TRENDS IN TECHNOLOGY
INTEGRATED CIRCUIT LOGIC
TECHNOLOGY:
Transistor density increases by 35% per year
.
Increases in die size ranging from 10% to
20% per year.
The combined effect is a growth in transistor
count on a chip of about 40% to 55% per
year or doubling every 18 to 24 months.
This trend is popularly known as Moores
law.
TRENDS IN TECHNOLOGY
1,000,000,000

100,000,000
Pentium 4
Pentium III
10,000,000 Pentium II
Pentium Pro
Transistors

Pentium
Intel486
1,000,000
Intel386
80286
100,000
8086
10,000 8080
8008
4004
1,000

1970 1975 1980 1985 1990 1995 2000

Year
TRENDS IN TECHNOLOGY
SEMICONDUCTOR DRAM:
Capacity per DRAM chip has increased by about 25% to
40% per year recently, doubling roughly every two to three
years.
Year DRAM growth Characterization of
rate impact on DRAM
capacity
1990 60% /Year Quadrupling every 3
years
1996 60% /Year Quadrupling every 3
years
2003 40% - 60% /Year Quadrupling every 3 to
4 years
2007 40% /Year Doubling every 2 years

2011 25% - 40% /Year Doubling every 2 to 3


years
TRENDS IN TECHNOLOGY
SEMICONDUCTOR FLASH: (Electrically Erasable
Programmable Read- Only Memory)
Non-Volatile semiconductor memory standard storage device in
PMDs.
Capacity per Flash chip has increased by about 50% to 60% per year
recently, doubling roughly every two years .
Flash memory is 15 to 20 times cheaper per bit than DRAM.
MAGNETIC DISK TECHNOLOGY:
Prior to 1990, density increased by about 30% per year, doubling
in 3 years. Increased 100% per year in 1996. Since 2004, it has
dropped back to 40% per year.
Disks are 15 to 25 times cheaper per bit than Flash.
Disks are 300 to 500 times cheaper per bit than DRAM.
This technology is central to server and warehouse scale storage.
NETWORK TECHNOLOGY:
Network performance depends on both on the performance of
switches and Performance of the transmission system.
PERFORMANCE TRENDS
BANDWIDTH OR THROUGHPUT:
It is the total amount of work done in a given time,
such as megabytes per second for a disk transfer.
LATENCY OR RESPONSE TIME:
It is the time between the start and completion of an
event, such as milliseconds for a disk access.
TRENDS IN POWER AND ENERGY IN IC:
For CMOS chips, the primary energy has been in switching
transistors, also called dynamic energy.
The energy required per transistor is proportional to the
Product of the capacitive load driven by the transistor and
the square of the voltage.
Energy dynamic Capacitive load X Voltage 2
TRENDS IN POWER AND ENERGY
IN IC:
The energy of pulse of the logic transition 0 1 0 or
1 0 1.
The energy of a single transition(0 1 or 1 0) is then
Energy dynamic X Capacitive load X
Voltage 2
The power required per transistor is the product of the
energy of a transition multiplied by the frequency of
transition.
Power dynamic X Capacitive load X Voltage 2 X
Frequency switched

Dynamic power and energy are greatly reduced by


lowering the voltages. Voltages have dropped from 5V
to just under 1V in 20 Years.
TRENDS IN POWER AND ENERGY IN IC:
Do nothing well:
Most p today turn off the clock of inactive modules to save energy
and dynamic power. For ex, if no floating-point instructions are
executing, the clock of the floating point unit is disabled. If some
cores are idle, their clocks are stopped.
Dynamic Voltage-Frequency Scaling (DVFS):
PMD, laptops and servers have periods of low activity where there is
no need to operate at the highest clock frequency and voltages.
Modern ps offer a few clock frequencies and voltages operate at
lower power and energy.
Power savings via DVFS a server may be operated at 3 different
clock rates : 2.4 GHz, 1.8 GHz and 1 GHz.
Design for typical case:
PMDs and laptops are often idle, memory and storage offer low power
modes to save energy extend battery life time.
On-chip temperature sensors to detect when activity should be
reduced automatically to avoid overheating.
TRENDS IN POWER AND ENERGY
IN IC:
Overlocking:
Intel offered Turbo mode in 2008 chip decides it is safe to run
at a higher clock rate for a short time few cores until
temperature starts to rise.
For a single threaded code, these microprocessors can turn off
all cores but one and run it at an even higher clock rate.
Operating System turn off Turbo mode no notification once it
is enabled- programs vary in performance due to room
temperature.
Power static Current static X Voltage
Static power is proportional to number of devices. Increasing
number of transistors, increases power even if they are idle.
SRAM caches need power to maintain the storage values.
Processor is a portion of the whole energy cost of a system
use faster, less energy-efficient processor to allow the rest of
the system to go into a sleep mode race-to-halt.
TRENDS IN COST:
Cost of an IC:
Cost of IC = Cost of die + Cost of testing die + Cost of packaging and final
test
Final test Yield
Cost of die = Cost of wafer
Dies per wafer X Die yield
Dies per wafer = X (Wafer diameter/2)2 _ X Wafer diameter
Die area (2X Die
area)

Problem 1:
Find the number of dies per 300 mm (30 cm) wafer for a die
that is 1.5 cm on a side and for a die that is 1.0 cm on a side.
Dies per wafer 1.5 cm (Die area = 1.5 X 1.5 =2.25 cm 2) = 270
Dies per wafer 1.0 cm (Die area = 1 X 1 = 1cm 2) = 640
DEPENDABILITY
Dependability is a measure of system availability, reliability,
and its maintainability.
Infrastructure providers started offering Service Level
Agreement (SLA) to guarantee that their networking or
power service would be dependable.
For example they would pay the customer a penalty if they
didnt meet an agreement more than some hours per month.
Two main measures of dependability:
Module Reliability:
Mean Time To Failure (MTTF) reliability measure reciprocal
of MTTF is a rate of failures.
Service interruption is measured as Mean Time To Repair
(MTTR).
Mean Time Between Failures (MTBF) = MTTF + MTTR.
Module Availability = MTTF / (MTTF + MTTR)
MEASURING, REPORTING AND SUMMARIZING
PERFORMANCE

Amazon.com administrator may say a computer is faster


when it completes more transactions per hour.
The computer user is interested in reducing response
time the time between the start and the completion of
an event - referred as execution time.
The operator of warehouse scale computer may be
interested in increasing throughput the total amount of
work done in a given time.
We often want to relate the performance of two different
computers say X and Y. The phrase X is faster than Y i.e
the response or execution time is lower on X than Y for
the given task. In particular X is n time faster than Y.
Execution time Y = n
Execution time X

Since Execution time is reciprocal of performance.


= Performance X
Performance Y
QUANTITATIVE PRINCIPLES OF COMPUTER
DESIGN
PRINCIPLE OF LOCALITY:
Programs tend to reuse data and instructions they have used recently.
The principle of locality applies to data accesses, though not as strongly
as to code accesses.
Two different types of locality have been observed.
Temporal Locality: Recently accessed items are likely to be accessed in
the near future.
Spatial Locality: Items whose addresses are near one another tend to be
referenced close together in time.
AMDAHLS LAW:
It states that the performance improvement can be gained from using
faster mode of execution is limited by the fraction of the time faster
mode can be used.
Speedup = performance for entire task using the enhancement when possible
performance for entire task without using the enhancement
Alternatively
Speedup = Execution time for entire task without using the enhancement
Execution time for entire task using the enhancement when possible
QUANTITATIVE PRINCIPLES OF COMPUTER
DESIGN
Execution time new = Execution time old X ((1 Fraction enhanced ) + Fraction enhanced / Speedup
enhanced )

Speedup overall = Execution time old / Execution time new

= 1 / (1 Fraction enhanced ) + Fraction enhanced / Speedup


enhanced

PROBLEM 2:
Suppose that we want to enhance the processor used for web serving. The
new processor is 10 times faster on computation in the web serving
application than the original processor. Assuming that the original processor
is busy with computation 40% of the time and is waiting for I/O 60% of the
time, what is the overall speedup gained by incorporating the enhancement?
Fraction enhanced = 0.4
Speedup enhanced = 10
Speedup overall = 1.56
THE PROCESSOR PERFORMANCE EQUATION

Essentially all computers are constructed using a


clock running at a constant rate. These discrete
time events are called ticks, clock ticks, clock
periods, clocks, cycles or clock cycles.
CPU time = CPU clock cycles for a program X
Clock cycle time
(or)
CPU time = CPU clock cycles for a program /
Clock rate

Clock cycles Per Instruction


CPI = CPU clock cycles for a program /
Instruction Count
CLASSES OF PARALLELISM
Parallelism driving force of computer design energy and cost being the primary design
constraint.
There are basically two kinds of Parallelism in applications.
1.Data-Level Parallelism (DLP): There are many data items that can be operated on at the
same time.
2.Task-Level Parallelism (TLP): arises because tasks of work are created that can operate
independently and largely in parallel.
Computer hardware can exploit these two kinds of application.
Parallelism in major four ways.
3.Instruction Level Parallelism (ILP):
. Exploits DLP with compiler.
. All Processors since about 1985 use pipelining to overlap the execution of instructions and
improve performance.
. This potential overlap among instructions is called Instruction Level Parallelism.
. The instructions can be evaluated in parallel.
2.Vector Architectures and Graphic Processor Units (GPU):
Exploits DLP by applying a single instruction to a collection of data in parallel.
3.Thread Level Parallelism:
Exploits either DLP or TLP in a tightly coupled hardware module that allows for interaction
among parallel threads.
4.Request Level Parallelism:
Exploits parallelism among largely decoupled tasks specified by the programmer or the
operating systems.
Michael Flynn placed all computers in to one of four
categories:
1. Single Instructions Single Data (SISD) stream:
. Uniprocessor category .
. Standard sequential computer, but it can exploit ILP.
. SISD architectures that use ILP techniques such as
superscalar.
2. Single Instructions Multiple Data (SIMD) stream:
. In a SIMD machine, the same instruction is executed by
multiple processors using different data streams.
. Each processor has its own data memory, but there is only
one instruction memory and control processor, which
fetches and dispatches instructions.
. Standard sequential computer, but it can exploit ILP.
. SISD architectures that use ILP techniques such as
superscalar.
. It exploits DLP, by applying the same operations to
multiple items of data in parallel.
3. Multiple Instructions Single Data (MISD) stream:
No commercial multiprocessor of this type has been built to date.
4. Multiple Instructions Multiple Data (MIMD) stream:
Each processor fetches its own instructions and operates on its own data.
These processors either utilize centralized shared memory architecture or each has
its own memory and they communicate with each other through crossbar networks.
SIMD processors can exploit data parallelism, but are not as flexible as
MIMD processors. They are suitable for algorithms with high data
parallelism and little data dependent control flow.
MIMD processors are more flexible, they can be either function as
single-user machines, focusing on high performance for one particular
application or as multi-programmed machines running many tasks
simultaneously.
However they are much more expensive and complicated due to
replication of control hardware, high instruction bandwidth requirement and
Synchronization of data path.
Besides pure SIMD and MIMD approaches, a combination of both SIMD
and MIMD approaches is also possible, exploiting the advantages of both SIMD
and MIMD architectures.
Tightly coupled MIMD architectures exploits TLP , since multiple cooperating
Threads operate in parallel.
Loosely coupled MIMD architectures (Clusters and WSC) exploits RLP, where many
independent tasks can proceed in parallel with little need for communication and
Synchronization.
MULTITHREADING
Multithreading: Simultaneous execution of two or more threads
by the multiple processors.
On a Single processor, Multithreading generally occurs by Time
Division Multiplexing (TDM). The processor switches between
different threads.
On a Multiprocessor the threads or tasks will actually run at the
same time with each processor or core running as particular
thread or task.
Types:
1. Coarse-grained Multithreading.
2. Fine-grained Multithreading.
3. Simultaneous Multithreading.
Advantages of Multithreading:
1. If a thread gets a lot of cache misses, the other thread can
continue, taking
advantage of unused computing resources, which thus can lead to
faster overall
execution, as these resources would have been idle if only a single
thread was
executed.
MULTITHREADING
Disadvantages :
1. Multiple threads can interfere with each other, when sharing
hardware
resources such as caches or TLP.
2. Execution time of a single threads are not improved, due to slower
frequency or
adding pipeline stages that are necessary to accommodate thread
switching H/W.
3. Requires more changes to both applicable programs and OS than
multiprocessing.
Coarse-grained Multithreading:
Also known as Block or cooperative multithreading.
Simplest type of multithreading, occurs when one thread
runs until, it is
blocked by a event that normally would create a long latency stall.
Such a stall might be a cache miss, that have to access off-
chip memory
might take huge number of CPU cycles, for the data to return.
Instead of waiting for the stall to resolve, a threaded process
would switch
MULTITHREADING
Fine-grained Multithreading:
It is to remove all dependencies stalls from the executing
pipelining.
Since one thread is relatively independent from other thread
there is less
chance of one instruction in one pipeline stages needing an output
from an older
instruction in pipeline.
Hardware Cost:
It has additional cost of each pipeline stages tracking the
thread ID of the
Instruction it is processing.
Since there are more threads being executed concurrently in
the pipeline
shared resources increase. Caches need to be larger to avoid
threading between
the different threads.
Simultaneous Multithreading:
Most advanced type of multithreading applied to superscalar
processors.
LIMITATIONS OF SINGLE CORE
Increasing thePROCESSORS
clock speed in single core processor has
reached saturation
and cant be increased beyond the certain limit.
Power and Heat consumption issues were also existing.
As the physical size of the chip decreases, while the number of
transistors in
chip increased, clock speed increased which boosted heat dissipation
across the chip
to a dangerous level.
There were limitations in the use of silicon surface area.
Throughput is limited. Memory bandwidth is limited. Heat sink
issues were
also there.
Current started to leaking out, while reducing the size of
individual gates.
The power limit has forced dramatically change in the design
of p.
In multicore processors the benefit is more on throughput than
on
response time.
In the past programmers can rely on innovations in H/W,

You might also like