You are on page 1of 34

0907532 Special Topics in Computer Engineering

MulticoreArchitectureBasics

Basic concept of parallelism


The idea is simple: improve performance by
performing two or more operations at the same
time.
Has been an important computer design strategy
since the beginning.

Parallelism in This Course (muticore machines)


Attain parallelism by using several processing
elements (cores) on the same chip or on different
chips sharing main memory.
Parallel computing is necessary for continuing
performance gains given that clock speeds are not
going to increase dramatically

Clock Rate (GHz)

2005 IT Roadmap
Semiconductors
2005 Roadmap

Intel single core

Clock Rate (GHz)

Change in ITS Roadmap in 2


yrs
2005 Roadmap

2007 Roadmap
Intel single core
Intel multicore

Shared Address Space Architectures


Any core can directly reference any memory
location

Communication between cores occurs implicitly as


result of loads and stores

Memory hierarchy and cache memories:


1. Review concepts assuming Single Core

2. Introduce problems and solution when used in


Multicore Machines

Single core memory hierarchy and cache


memories
Programs tend to exhibit temporal and spatial
locality:
Temporal locality: Once programs access data
items or instructions they tend to access them again
in the near future.
Spatial locality: Once programs access data items
or instruction, they tend to access nearby data items
or instruction in the near future.
Because of the locality property of programs,
memory is organized in a hierarchy.
8

Memory hierarchy
~ 1s Cycle
KeyObservations

Access to L1
cache is on order
of 1 cycle
Access to L2 on
order of 1 to 10
cycles

Access to Main
memory ~ 100s
cycles

Access to Disk ~
1000s cycles

Core

~ 1s 10
Cycles

L1 Cache

L2
Cache
~ 100s Cycles

Main
Memory

Magnetic
Disk
~ 1000s Cycles

Connecting lines thickness depict bandwidth: Bytes/Second

Processor and Memory are Far


Apart
memory

interconnect

processor
Art of Multiprocessor
Programming

10

Reading from Memory


address

Art of Multiprocessor
Programming

11

Reading from Memory

zzz

Art of Multiprocessor
Programming

12

Reading from Memory

value

Art of Multiprocessor
Programming

13

Writing to Memory
address, value

Art of Multiprocessor
Programming

14

Writing to Memory

zzz

Art of Multiprocessor
Programming

15

Writing to Memory

ack

Art of Multiprocessor
Programming

16

Cache: Reading from


Memory
address

cache

Art of Multiprocessor
Programming

17

Cache: Reading from


Memory

cache

Art of Multiprocessor
Programming

18

Cache: Reading from


Memory

cache

Art of Multiprocessor
Programming

19

Cache Hit

?
cache

Art of Multiprocessor
Programming

20

Cache Hit

Yes!

cache

Art of Multiprocessor
Programming

21

Cache Miss
address

No

?
cache

Art of Multiprocessor
Programming

22

Cache Miss

cache

Art of Multiprocessor
Programming

23

Cache Miss

cache

Art of Multiprocessor
Programming

24

Memory and cache performance metrics


Cache Hit and Miss : When the data is found in the
cache, we have a cache hit, otherwise it is a miss.
Hit Ratio ,HR = fraction of memory references that
hit
Depends on locality of application
Measure of effectiveness of caching mechanism

Miss Ratio , MR= fraction of memory references


that miss
HR = 1- MR
25

Average memory system access time


If all the data fits in main memory (i.e. ignore desk
access)

HR * cache access time + MR * main memory access time

26

Cache line
When there is a cache miss, a fixed size block of
consecutive data elements, or line, is copied from
main memory to the cache.
Typical cache line size is 4-128 bytes.
Main memory can be seen as a sequence of lines,
some of which can have a copy in the cache.

27

MEMORY HIERARCHY AND BANDWIDTH ON


MULTICORE
Each core has its own private cache, L1 cache to
provide fast access, e.g. 1-2 cycles.
L2 caches may be shared across multiple cores.
In the event of cache miss at both L1 and L2, the
memory controller must forward a load/store
request to the off-chip main memory.

28

Intel Core Microarchitecture Memory Sub-system

High Level Multicore Architectural view


Intel Core 2
Duo Processor
A

E
C
B

Intel Core 2
Quad Processor
A

C1

C2

Dual Core has shared cache


64B Cache
Line has both shared
64B Cache Line
Quad core
Memory And separated Memory
cache
A = Architectural State
C = 2nd Level Cache

E = Execution Engine & Interrupt


B = Bus Interface connects to main memory & I/O

Cache line ping-ponging or tennis effect


One processor writes to a cache line and then another
processor writes to the same cache line but different
data element
Cash line is in a separate socket/separate L2 cache
environment
Each core would take a HITM (HIT Modified) on the
cache line causing it to ship across the FSB (Front
Side Bus to memory)
This increases the FSB traffic and even in good
conditions costs about the cost of a memory access
30

Intel Core Microarchitecture Memory Sub-system

With a separated cache


Memory
Front Side Bus (FSB)

Shipping L2 Cache Line


~Half access to memory

Cache Line
CPU1

CPU2

Intel Core Microarchitecture Memory Sub-system

Advantages of Shared Cache using Advanced


Smart Cache Technology
Memory
Front Side Bus (FSB)
L2 is shared:
No need to ship cache
line
Cache Line
CPU1

CPU2

False Sharing

Performance issue in programs where cores may write to different memory


addresses BUT in the same cache lines
Known as Ping-Ponging Cache line is shipped between cores

Core 0
X[0] = 0

Core 1
X[1] = 0

X[0] = 1
Time

X[1] = 1
False
X[0] =
2 Sharing not an
issue in shared cache
1 0
0
2

It is an issue 1
in 1
separated cache

Avoiding False Sharing


Change either
Algorithm
adjust the implementation of the algorithm (the loop
stride) to access data in different cache line for each
thread

Or
Data Structure:
add some padding to a data structure or arrays ( just
enough padding generally less than cache line size) so
that threads access data from different cache lines.
34

You might also like