Professional Documents
Culture Documents
Zoran Radovic
zoran.radovic@it.uu.se
NUCA Locks
Spin locks
Critical Section (CS) Cost
Spin locks
with backoff
IF (more contention)
THEN less efficient CS …
Amount of Contention
Spin locks
CS Cost
Spin locks
with backoff
IF (more contention)Queue-based locks
Amount of Contention
Spin locks
CS Cost
Spin locks
with backoff
IF (more contention)
THEN more efficient CS …
Queue-based locks
Amount of Contention
1) Reduce Switch
traffic
- one CPU per node is testing…
Memory
2) ImproveMemory
lock handover Memory
3) More efficient CS
- local traffic is cheaper
$ $ … $ $ $ … $ $ $ … $
P P P P P P P P P
What do we need?
node_id Creates
Compare&swap (CAS) atomic operation
Communication
CAS(Lock_address, FREE, node_id)
Affinity
lock-acquire:
If the lock-value is in the state FREE:
• The node_id is CAS-ed into the lock location
Else: 2 cases
• The lock is “local” Spin with small backoff
• The lock is “remote” Spin with large backoff
Simple but fairly effective…
Zoran.Radovic@it.uu.se Dissertation Seminar Nov 18, 2005
Performance Results
Realistic microbenchmark, 2-node WildFire, 28 CPUs
14 14
12 60 WF
11
50
10 Spin
Iteration Time [seconds]
MCS
8 Fairness?
30
7
6 20
5
10
4
3 0
t
28
26
24
Number of Finished Processors
Spin
22
20 MCS
18
HBO
16
14
12
10
8
6
4
2
0
0 5 10 15
Time [seconds]
1.5
0.5
nd
M
e
y
es
sq
ty
ce
sk
ag
FM
si
rn
lre
-N
tra
le
io
er
Ba
Vo
er
ho
ay
ad
Av
at
C
R
R
1.4
1.2
0.8
0.6
0.4
0.2
0
Spin Spin EXP MCS HBO
NUCA Locks
Original DSZOOM
Program Program
...
cmp %g0, %l5
bne 0x24431
nop
...
cmp %g0, %l5 ld [%o1 + 64], %o0
bne 0x24431 mov 255, %g6
nop and %g6, %o0, %g6
Fast-path
cmp %g6, 170 Protocol
ld [%o1 + 64], %o0
bne 0x24450 Code
ldd [%o0 + 16], %f4 nop
clr %l5
ldd [%o0 + 16], %f4
... Slow-path
clr %l5
... Protocol
Code
(C-code)
Binary/Assembler level instrumentation
Many optimizations
Instrumentation scheduling (update and invalidate)
Instrumentation batching (invalidate)
WPC-based write batching (update)
WPC-based dirty-data filtering (update)
Private-data filtering (update)
# of WPC entries (update and invalidate)
Coherence unit size (update and invalidate)
Coherence flags
Similar to optimization flags of compilers
Possible scenario:
gcc -dszoom-cl 128 -dszoom-inv –O3 my_app.c
Execution profiling
Similar to profile feedback of compilers
Helps finding appropriate coherence flag settings
Low overhead implementation in DSZOOM
• Less than 30 percent overhead
Works for both small and large input sets
2.4
2.2 1.45x 1.11x
2.0
1.8
1.6
1.4
1.2
1.0
0.8
0.6
0.4
0.2
0.0
c
sq
x
-c
p
es
-c
ce
ge
m
ity
fft
-n
di
-n
-s
lu
an
fm
-n
rn
ra
os
ra
ra
lu
an
er
er
yt
ba
e
di
at
e
oc
av
ra
at
ra
oc
w
w
NUCA Locks
Original DSZOOM
Program Program
...
cmp %g0, %l5
bne 0x24431
nop
...
cmp %g0, %l5 ld [%o1 + 64], %o0
bne 0x24431 mov 255, %g6
nop and %g6, %o0, %g6
Fast-path
cmp %g6, 170 Protocol
ld [%o1 + 64], %o0
bne 0x24450 Code
ldd [%o0 + 16], %f4 nop
clr %l5
ldd [%o0 + 16], %f4
... Slow-path
clr %l5
... Protocol
Code
(C-code)
• Binary transparency?
• Run-time execution overhead
Basic idea
Detect fine-grained coherence violations in hardware
Trigger a coherence trap when one occur
Maintain coherence by software protocols
Binary Transparency
No need to instrument binaries/applications
Zoran.Radovic@it.uu.se Dissertation Seminar Nov 18, 2005
TMA Lite
Proof-of-concept Implementation
2 1.75x 1.01x
Normalized Execution Time
1.5
0.5
0
c
sq
x
-c
e
fft
-n
di
-s
ag
lu
-n
ra
lu
er
er
er
at
av
at
w
w
RH lock algorithm
Controlled (un)fairness
HBO_GT and HBO_GT_SD algorithms
Global throttling and starvation detection
DSZOOM implementation details
Instrumentation challenges; scheduling, batching, etc.
Bandwidth filtering techniques; dirty- & private-data
Innovative TMA simulation tricks
Low-level “good days” hacks
Reusing Simics checkpoints
Zoran Radovic
zoran.radovic@it.uu.se