1-MulticoreArchitecture Basics

0907532 Special Topics in Computer Engineering
MulticoreArchitectureBasics
Basic concept of parallelism

The idea is simple: improve performance by
performing two or more operations at the same
time.
Has been an important computer design strategy
since the beginning.
Parallelism in This Course (muticore machines)

Attain parallelism by using several processing
elements (cores) on the same chip or on different
chips sharing main memory.
Parallel computing is necessary for continuing
performance gains given that clock speeds are not
going to increase dramatically
Clock Rate (GHz)
2005 IT Roadmap
Semiconductors
2005 Roadmap
Intel single core
Clock Rate (GHz)
Change in ITS Roadmap in 2

yrs
2005 Roadmap
2007 Roadmap
Intel single core
Intel multicore
Shared Address Space Architectures

Any core can directly reference any memory
location
Communication between cores occurs implicitly as

result of loads and stores
Memory hierarchy and cache memories:

1. Review concepts assuming Single Core
2. Introduce problems and solution when used in

Multicore Machines
Single core memory hierarchy and cache

memories
Programs tend to exhibit temporal and spatial
locality:
Temporal locality: Once programs access data
items or instructions they tend to access them again
in the near future.
Spatial locality: Once programs access data items
or instruction, they tend to access nearby data items
or instruction in the near future.
Because of the locality property of programs,
memory is organized in a hierarchy.
8
Memory hierarchy
~ 1s Cycle
KeyObservations
Access to L1
cache is on order
of 1 cycle
Access to L2 on
order of 1 to 10
cycles
Access to Main
memory ~ 100s
cycles
Access to Disk ~
1000s cycles
Core
~ 1s 10
Cycles
L1 Cache
L2
Cache
~ 100s Cycles
Main
Memory
Magnetic
Disk
~ 1000s Cycles
Connecting lines thickness depict bandwidth: Bytes/Second
Processor and Memory are Far

Apart
memory
interconnect
processor
Art of Multiprocessor
Programming
10
Reading from Memory

address
Programming
11
Reading from Memory
zzz
Programming
12
Reading from Memory
value
Programming
13
Writing to Memory
address, value
Programming
14
Writing to Memory
zzz
Programming
15
Writing to Memory
ack
Programming
16
Cache: Reading from

Memory
address
cache
Programming
17
Cache: Reading from

Memory
cache
Programming
18
Cache: Reading from

Memory
cache
Programming
19
Cache Hit
?
cache
Programming
20
Cache Hit
Yes!
cache
Programming
21
Cache Miss
address
No
?
cache
Programming
22
Cache Miss
cache
Programming
23
Cache Miss
cache
Programming
24
Memory and cache performance metrics

Cache Hit and Miss : When the data is found in the
cache, we have a cache hit, otherwise it is a miss.
Hit Ratio ,HR = fraction of memory references that
hit
Depends on locality of application
Measure of effectiveness of caching mechanism
Miss Ratio , MR= fraction of memory references

that miss
HR = 1- MR
25
Average memory system access time

If all the data fits in main memory (i.e. ignore desk
access)
HR * cache access time + MR * main memory access time
26
Cache line
When there is a cache miss, a fixed size block of
consecutive data elements, or line, is copied from
main memory to the cache.
Typical cache line size is 4-128 bytes.
Main memory can be seen as a sequence of lines,
some of which can have a copy in the cache.
27
MEMORY HIERARCHY AND BANDWIDTH ON

MULTICORE
Each core has its own private cache, L1 cache to
provide fast access, e.g. 1-2 cycles.
L2 caches may be shared across multiple cores.
In the event of cache miss at both L1 and L2, the
memory controller must forward a load/store
request to the off-chip main memory.
28
Intel Core Microarchitecture Memory Sub-system
High Level Multicore Architectural view

Intel Core 2
Duo Processor
A
E
C
B
Intel Core 2
Quad Processor
A
C1
C2
Dual Core has shared cache

64B Cache
Line has both shared
64B Cache Line
Quad core
Memory And separated Memory
cache
A = Architectural State
C = 2nd Level Cache
E = Execution Engine & Interrupt

B = Bus Interface connects to main memory & I/O
Cache line ping-ponging or tennis effect

One processor writes to a cache line and then another
processor writes to the same cache line but different
data element
Cash line is in a separate socket/separate L2 cache
environment
Each core would take a HITM (HIT Modified) on the
cache line causing it to ship across the FSB (Front
Side Bus to memory)
This increases the FSB traffic and even in good
conditions costs about the cost of a memory access
30
With a separated cache

Memory
Front Side Bus (FSB)
Shipping L2 Cache Line

~Half access to memory
Cache Line
CPU1
CPU2
Advantages of Shared Cache using Advanced

Smart Cache Technology
Memory
Front Side Bus (FSB)
L2 is shared:
No need to ship cache
line
Cache Line
CPU1
CPU2
False Sharing
Performance issue in programs where cores may write to different memory

addresses BUT in the same cache lines
Known as Ping-Ponging Cache line is shipped between cores
Core 0
X[0] = 0
Core 1
X[1] = 0
X[0] = 1
Time
X[1] = 1
False
X[0] =
2 Sharing not an
issue in shared cache
1 0
0
2
It is an issue 1
in 1
separated cache
Avoiding False Sharing

Change either
Algorithm
adjust the implementation of the algorithm (the loop
stride) to access data in different cache line for each
thread
Or
Data Structure:
add some padding to a data structure or arrays ( just
enough padding generally less than cache line size) so
that threads access data from different cache lines.
34

1-MulticoreArchitecture Basics

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

1-MulticoreArchitecture Basics

Uploaded by

Copyright:

Available Formats

0907532 Special Topics in Computer Engineering

Basic concept of parallelism

Parallelism in This Course (muticore machines)

Clock Rate (GHz)

Intel single core

Clock Rate (GHz)

Change in ITS Roadmap in 2

Shared Address Space Architectures

Communication between cores occurs implicitly as

Memory hierarchy and cache memories:

2. Introduce problems and solution when used in

Single core memory hierarchy and cache

Connecting lines thickness depict bandwidth: Bytes/Second

Processor and Memory are Far

Reading from Memory

Reading from Memory

Reading from Memory

Cache: Reading from

Cache: Reading from

Cache: Reading from

Memory and cache performance metrics

Miss Ratio , MR= fraction of memory references

Average memory system access time

HR * cache access time + MR * main memory access time

MEMORY HIERARCHY AND BANDWIDTH ON

Intel Core Microarchitecture Memory Sub-system

High Level Multicore Architectural view

Dual Core has shared cache

E = Execution Engine & Interrupt

Cache line ping-ponging or tennis effect

Intel Core Microarchitecture Memory Sub-system

With a separated cache

Shipping L2 Cache Line

Intel Core Microarchitecture Memory Sub-system

Advantages of Shared Cache using Advanced

Performance issue in programs where cores may write to different memory

Avoiding False Sharing

You might also like