You are on page 1of 28

Dissertation Seminar, 18/11 – 2005

Auditorium Minus, Museum Gustavianum

Software Techniques for


Distributed Shared Memory

Zoran Radovic
zoran.radovic@it.uu.se

Zoran.Radovic@it.uu.se Dissertation Seminar Nov 18, 2005


Outline

 NUCA Locks

 DSZOOM – Software-based Shared Memory

 TMA – Trap-based Memory Architecture

Zoran.Radovic@it.uu.se Dissertation Seminar Nov 18, 2005


Vasaloppet
“Contention Problem in Sweden”
85.6533
km to
Traditional cross-country ski race go… CS
90 km …

Zoran.Radovic@it.uu.se Dissertation Seminar Nov 18, 2005


Spin Locks under Contention

Spin locks
Critical Section (CS) Cost

Spin locks
with backoff

IF (more contention) 
THEN less efficient CS …

“The more important the slower it runs…”

Amount of Contention

Zoran.Radovic@it.uu.se Dissertation Seminar Nov 18, 2005


Queue-based Locks

Spin locks
CS Cost

Spin locks
with backoff


IF (more contention)Queue-based locks

THEN constant CS cost …

Amount of Contention

Zoran.Radovic@it.uu.se Dissertation Seminar Nov 18, 2005


This Dissertation

Spin locks
CS Cost

Spin locks
with backoff

IF (more contention) 
THEN more efficient CS …
Queue-based locks

“The more important the faster it runs…”


NUCA locks

Amount of Contention

Zoran.Radovic@it.uu.se Dissertation Seminar Nov 18, 2005


NUCA Locks (Basic Idea)

1) Reduce Switch
traffic
- one CPU per node is testing…

Memory
2) ImproveMemory
lock handover Memory
3) More efficient CS
- local traffic is cheaper
$ $ … $ $ $ … $ $ $ … $

P P P P P P P P P

Lock/Unlock Test Test


Test Test
Test Test
Test Test
Lock/Unlock Test
Test
Test

Zoran.Radovic@it.uu.se Dissertation Seminar Nov 18, 2005


The HBO Lock (the simplest HBO)

 What do we need?
 node_id Creates
 Compare&swap (CAS) atomic operation
Communication
CAS(Lock_address, FREE, node_id)
Affinity
 lock-acquire:
 If the lock-value is in the state FREE:
• The node_id is CAS-ed into the lock location
 Else: 2 cases
• The lock is “local”  Spin with small backoff
• The lock is “remote”  Spin with large backoff
 Simple but fairly effective…
Zoran.Radovic@it.uu.se Dissertation Seminar Nov 18, 2005
Performance Results
Realistic microbenchmark, 2-node WildFire, 28 CPUs

14 14
12 60 WF

11
50
10 Spin
Iteration Time [seconds]

MCS

Node Handoffs [%]


9 HBO 40

8 Fairness?
30
7

6 20

5
10
4

3 0

0 500 1000 1500 2000 0 500 1000 1500 2000


critical_work critical_work

Zoran.Radovic@it.uu.se Dissertation Seminar Nov 18, 2005


Fairness Study
Realistic microbenchmark, 2-node WildFire, 28 CPUs

t
28
26
24
Number of Finished Processors

Spin
22
20 MCS
18
HBO
16
14
12
10
8
6
4
2
0
0 5 10 15
Time [seconds]

Zoran.Radovic@it.uu.se Dissertation Seminar Nov 18, 2005


Application Performance
28-processor runs

Spin Spin EXP MCS HBO


2.5
≈ 4x
Normalized Speedup

1.5

0.5

nd
M

e
y
es

sq
ty

ce
sk

ag
FM

si
rn

lre

-N
tra
le

io

er
Ba

Vo

er
ho

ay
ad

Av
at
C

R
R

Zoran.Radovic@it.uu.se Dissertation Seminar Nov 18, 2005


Total Traffic: Raytrace

Local Transactions Global Transactions

1.4

1.2

0.8

0.6

0.4

0.2

0
Spin Spin EXP MCS HBO

Zoran.Radovic@it.uu.se Dissertation Seminar Nov 18, 2005


HBO Locks inside Linux Kernel

 Patch provided by Silicon Graphics, Inc.


 Linux-IA64 kernel implementation, May 2005
 Page-fault handler runs 3x faster
 60 processors

Zoran.Radovic@it.uu.se Dissertation Seminar Nov 18, 2005


Outline

 NUCA Locks

 DSZOOM – Software-based Shared Memory

 TMA – Trap-based Memory Architecture

Zoran.Radovic@it.uu.se Dissertation Seminar Nov 18, 2005


The DSZOOM Proposal

Zoran.Radovic@it.uu.se Dissertation Seminar Nov 18, 2005


The DSZOOM Proposal

 Run entire protocol in requesting-processor


 No protocol agent communication!

 Assumes user-level remote memory access


 put, get, and atomics [  InfiniBand ]

 Fine-grain memory protocols (e.g., 64 bytes)

 Hardware-like memory models


[Shasta, Blizzard, Sirocco]

Zoran.Radovic@it.uu.se Dissertation Seminar Nov 18, 2005


“Squeezing” Protocols into Binaries…

Original DSZOOM
Program Program
...
cmp %g0, %l5
bne 0x24431
nop
...
cmp %g0, %l5 ld [%o1 + 64], %o0
bne 0x24431 mov 255, %g6
nop and %g6, %o0, %g6
Fast-path
cmp %g6, 170 Protocol
ld [%o1 + 64], %o0
bne 0x24450 Code
ldd [%o0 + 16], %f4 nop
clr %l5
ldd [%o0 + 16], %f4
... Slow-path
clr %l5
... Protocol
Code
(C-code)
Binary/Assembler level instrumentation

Zoran.Radovic@it.uu.se Dissertation Seminar Nov 18, 2005


Write Permission Caching

 Problem: store instrumentation relies on locking


 More complex instrumentation

 Solution: write permission cache (WPC)


 Small and fast software-managed cache
 Keeps write permissions

 The WPC idea:


 Exploit store locality
 Dynamically reduce the number of memory references
in store checking code

Zoran.Radovic@it.uu.se Dissertation Seminar Nov 18, 2005


Other “Features”

 Two kinds of protocols


 Invalidate
 Update

 Many optimizations
 Instrumentation scheduling (update and invalidate)
 Instrumentation batching (invalidate)
 WPC-based write batching (update)
 WPC-based dirty-data filtering (update)
 Private-data filtering (update)
 # of WPC entries (update and invalidate)
 Coherence unit size (update and invalidate)

Zoran.Radovic@it.uu.se Dissertation Seminar Nov 18, 2005


Coherence Flags and Profiling

 Coherence flags
 Similar to optimization flags of compilers
 Possible scenario:
gcc -dszoom-cl 128 -dszoom-inv –O3 my_app.c

 Execution profiling
 Similar to profile feedback of compilers
 Helps finding appropriate coherence flag settings
 Low overhead implementation in DSZOOM
• Less than 30 percent overhead
 Works for both small and large input sets

Zoran.Radovic@it.uu.se Dissertation Seminar Nov 18, 2005


DSZOOM Results
2-node WildFire, 16 CPUs

HW-DSM inv-64 inv-dwpc-64 PROFILED BEST


3.0
2.8
2.6
Normalized Execution Time

2.4
2.2 1.45x 1.11x
2.0
1.8
1.6
1.4
1.2
1.0
0.8
0.6
0.4
0.2
0.0
c

sq
x
-c

p
es

-c

ce

ge
m

ity
fft

-n

di

-n

-s
lu

an
fm

-n
rn

ra
os

ra
ra
lu

an

er
er
yt
ba

e
di

at
e
oc

av
ra

at
ra
oc

w
w

Zoran.Radovic@it.uu.se Dissertation Seminar Nov 18, 2005


Outline

 NUCA Locks

 DSZOOM – Software-based Shared Memory

 TMA – Trap-based Memory Architecture

Zoran.Radovic@it.uu.se Dissertation Seminar Nov 18, 2005


Instrumentation Drawbacks

Original DSZOOM
Program Program
...
cmp %g0, %l5
bne 0x24431
nop
...
cmp %g0, %l5 ld [%o1 + 64], %o0
bne 0x24431 mov 255, %g6
nop and %g6, %o0, %g6
Fast-path
cmp %g6, 170 Protocol
ld [%o1 + 64], %o0
bne 0x24450 Code
ldd [%o0 + 16], %f4 nop
clr %l5
ldd [%o0 + 16], %f4
... Slow-path
clr %l5
... Protocol
Code
(C-code)
• Binary transparency?
• Run-time execution overhead

Zoran.Radovic@it.uu.se Dissertation Seminar Nov 18, 2005


Trap-Based Memory Architectures

 Basic idea
 Detect fine-grained coherence violations in hardware
 Trigger a coherence trap when one occur
 Maintain coherence by software protocols

 No memory system modifications


 Minimal processor modifications

 Binary Transparency
 No need to instrument binaries/applications
Zoran.Radovic@it.uu.se Dissertation Seminar Nov 18, 2005
TMA Lite
Proof-of-concept Implementation

 Load permission check


 Hardware implementation of software check
• Predefined “magic-value” convention
 Store permission check
 Hardware WPC

 Can be seen as a very small cache


 Operates on virtual addresses
 Accessed in parallel with the data TLB

Zoran.Radovic@it.uu.se Dissertation Seminar Nov 18, 2005


TMA Lite Performance
[TMA: simulation study, 4 nodes | DSZOOM: 2-node WildFire]

HW-DSM DSZOOM DWPC PROFILED BEST TMA


2.5

2 1.75x 1.01x
Normalized Execution Time

1.5

0.5

0
c

sq
x
-c

e
fft

-n

di

-s

ag
lu

-n
ra
lu

er

er
er

at

av
at

w
w

Zoran.Radovic@it.uu.se Dissertation Seminar Nov 18, 2005


Topics not Presented

 RH lock algorithm
 Controlled (un)fairness
 HBO_GT and HBO_GT_SD algorithms
 Global throttling and starvation detection
 DSZOOM implementation details
 Instrumentation challenges; scheduling, batching, etc.
 Bandwidth filtering techniques; dirty- & private-data
 Innovative TMA simulation tricks
 Low-level “good days” hacks
 Reusing Simics checkpoints

Zoran.Radovic@it.uu.se Dissertation Seminar Nov 18, 2005


Dissertation Seminar, 18/11 – 2005
Auditorium Minus, Museum Gustavianum

Software Techniques for


Distributed Shared Memory

Zoran Radovic
zoran.radovic@it.uu.se

Zoran.Radovic@it.uu.se Dissertation Seminar Nov 18, 2005

You might also like