Professional Documents
Culture Documents
NOTUR2009
Schedule
10:15 Begin (1:00) 11:15 11:30 Coffee (1:30) 13:00 14:00 Lunch (2:00) 16:00 Finish 18:00 Beer
Philip Mucci, Multicore Optimization 2
NOTUR2009
Outline
Commentary HW/SW Overview Working With the Compiler Single Core Optimization
NOTUR2009
Initial Commentary
NOTUR2009
As such, this tutorial is partially subjective. Expect some contradictions to your experience(s).
Don't panic.
NOTUR2009
Shared address space means direct access to data. Most computers are doing many different activities at once. (What about us?)
We really can't get much faster without liquid cooling. Multicore appears to lower power and cooling requirements per FLOP.
Philip Mucci, Multicore Optimization 6
NOTUR2009
NOTUR2009
Compilers are still limited by static analysis Most developers technical computing are not (supposed to be) computer architects Languages have further abstracted performance DRAM ignores Moore's Law Memory controllers are neither bigger or smarter
NOTUR2009
Evaluation of Multicore
Lots of high GF cores with many shared resources means more work for you. Resource constraints must be examined system wide, with attention to per-core performance.
Size/speed of dedicated cache/TLB Memory bandwidth and latency per-core On/Off-chip communications per-core
NOTUR2009
Multicore Performance
They communicate only through cache and memory. No dedicated instructions to do sync, comm, dispatch, must be done by (slow) software.
External bandwidth limited by pins and power. Starvation is a very real issue at every shared resource. Some extreme examples:
NOTUR2009
Architecture Overview
NOTUR2009
11
Hardware
Caches, Coherency and Prefetching Translation Lookaside Buffers (TLB) Hardware Multithreadeding (SMT/HT) Threads vs. Processes
Software
NOTUR2009
12
Multicore has made this (much) worse in order to claim increased peak performance. Make cores as fast/slow as main memory (SiCortex, Tensilica) Add faster/closer memory pipes (Opteron, Nehalem) Streaming compute engines (NVIDIA,AMD), vectorized memory pipelines (Convey).
Philip Mucci, Multicore Optimization 13
NOTUR2009
Cost of memory access is uniform to every core Less work for programmer, OS, etc. Not possible to scale (well)
Crossbar works, but ultimately you have to slow nearest neighbors down.
All memory is not the same. Problem: Memory can be far from the CPU
NOTUR2009
14
Caches
Memory is moved in and out of caches in blocks called lines, usually 32 to 256 bytes.
Multiple levels of cache, with at least one level being dedicated to a single core. i.e.
32K Level 1 -> 1 core 1MB Level 2 -> 2 cores, 1 die 8MB Level 3 -> 4 cores, 1 package
NOTUR2009
15
Spatial If I look at address M(n), it is likely that M(n z) will be used, where z is small. Temporal If I look at address M(n) at time t, it is likely that M(n) will be used again at time t + t', where t' is small. If true for one core, for us (technical-computing) true for multicore.
NOTUR2009
16
Cache Architecture
Memory cannot live anywhere in a cache. Cache associativity The number of unique places in a cache where any given memory item can reside.
Location is determined by some bits in the physical or virtual address.. Direct mapped means only one location.
Higher associativity is better, but costly in terms of gates and complexity (power and performance).
Philip Mucci, Multicore Optimization 17
NOTUR2009
Why do we care?
Tuning for cache yields most of your performance gain. On multicore true, but opportunity for creating contention.
It can happen: A cache unfriendly code may run faster than the same code highly tuned without thought to contention.
NOTUR2009
Cache Consistency
For correctness, cores must see a consistent view of memory through the caches. Thus the caches communicate with a protocol that indicates the state of each cache line. Most common is MESI.
NOTUR2009
With multiple cores, more than one level can keep MESI states.
In Nehalem, L3 keeps state per socket, L1 and L2 per core But they can be identied with the right tools and insight.
NOTUR2009
20
Hardware Prefetching
Automatic
Streams of reference automatically predicted by the hardware. N consecutive misses trigger fetch of lines ahead. Usually unit line stride and one or two streams.
Prefetch Instructions
Many variants, separate from the above, but may trigger it. Can even be invalid addresses.
NOTUR2009
23
Software Prefetching
To do a good job, it needs to predict misses or nd slots to hide them. High level manual placement Programmer puts in assembly primitives
Explicit
NOTUR2009
24
TLB
Computing this address is expensive! It is usually fully associative and multi-level. A TLB miss can be very expensive.
NOTUR2009
25
Big pages are good for lowering TLB misses. But the OS moves data in pages!
If the data is in physical memory, then this is a minor page fault. If the data is on disk, this is a major page fault.
NOTUR2009
26
Simultaneous Multithreading
Hyperthreading is Intel's name for it Share physical resources on a chip among multiple hardware threads except context (PC and regs) Goal is to attempt to hide latency of instructions. When one instruction stalls, try another thread, OS does not need to context switch. To the OS, it's an SMP
This really only works when you've got a rich and diverse instruction mix among threads.
Philip Mucci, Multicore Optimization 27
NOTUR2009
Vector Instructions
More accurately called micro-vector instructions Real vector machines do this on 1000's of items. A register contains a number of Byte, Int, Float, etc...
NOTUR2009
A thread is an execution context; machine state scheduled by the operating system. A process is a thread plus virtual memory, les, etc.
A process can contain multiple threads. Each thread in a process shares everything except state (stack, registers and program counter)
NOTUR2009
29
Other threads can steal the CPU as can OS work like hard and soft interrupts. OS's do this to provide fairness. Linux does understand a cost penalty when moving a thread from one core to another. Rescheduling a new thread often involves a cache and TLB ushing.
NOTUR2009
Function calls are cheap, just register bits. Many system calls contain locks, are serialized or are not scalable.
Philip Mucci, Multicore Optimization 31
NOTUR2009
Architecture Summary
Single core performance comes rst! To use all of the cache To avoid cache conicts To minimize shared resource contention
NOTUR2009
32
NOTUR2009
33
It can't read your mind, only your code. Correctness is always emphasized over performance. For popular and simple constructs, the compiler will usually do a better job than you. But as code gets more abstract, it can't guess the things that matter!
NOTUR2009
34
Understanding Compilers
The best things you can do to work with the compiler are:
Learn a compiler well and stick with it. Clearly express your intentions to the compiler through:
NOTUR2009
35
IEEE has a standard for correctness (IEEE754) Applications relax that standard for performance and because correct is somewhat arbitrary.
NOTUR2009
Inlining
Replacing a subroutine call with the code from the original function. Good because:
Function calls inside loops (often) inhibit vectorization. Function calls are not free, they take cycles and cycles to set up and tear down.
NOTUR2009
37
Vectorization
Helped by inlining, unrolling, fusion, SWP, IPA, etc. GPGPUs and FPGAs
NOTUR2009
38
IPO/IPA
Interprocedural Optimization/Analysis
Compiler can move, optimize, restructure and delete code between procedures and les.
Generates intermediate code at compile time. Generates object code during nal link.
Stronger typing of pointers, arguments and data structures can vastly increase effectiveness.
Philip Mucci, Multicore Optimization 39
NOTUR2009
Software Pipelining
Keep more intermediate results in registers and cache. Loop count Inter-iteration dependencies Aliasing Loop set up and tear down can be costly.
Philip Mucci, Multicore Optimization 40
NOTUR2009
Pointer Aliasing
Compilers must assume that memory (by pointers) has changed or overlaps.
This is called the pointer aliasing problem. It is really bad in C and C++.
NOTUR2009
Types of Aliasing
Strict
Pointers don't alias if they are different types. Pointers of the same type can alias and overlap. Pointers of same type are assumed to not overlap. All pointer expressions result in no overlap.
Typed
Restricted
Disjointed
NOTUR2009
42
a.k.a Feedback Directed Optimization Collect data about what the code really does and then adapt.
Old idea, but (still) not very well developed. Branches (I-cache/ITLB misses, BP misprediction) Loop bounds (unroll, SWP, jam, etc)
Important for:
NOTUR2009
Compiler Flags
Better to start at -O(big) and disable optimizations rather than other way around.
Develop your knowledge of what to turn off. Compiler documentation is usually clear about which n can result in wrong answers.
NOTUR2009
44
-Q --help=optimizers
-O2 and higher assume strict aliasing. First time use -fprofile-generate Subsequent times use -fprofile-use
Philip Mucci, Multicore Optimization 45
NOTUR2009
Takes most of the same ags as GCC. To nd out what the compiler is doing:
-LNO:vintr_verbose=1 -LNO:simd_verbose=1 First time use -fb_create fbdata Subsequent times use -fb_opt fbdata
Philip Mucci, Multicore Optimization 46
NOTUR2009
-ip is subset of -ipo for single les -shared-intel to allow tools to work -opt-report [0123], -opt-report-file f First time use -prof-gen Subsequent times use -prof-use
Philip Mucci, Multicore Optimization 47
NOTUR2009
NOTUR2009
48
Loop Count
Loop Unroll
Loop Split
NOTUR2009
49
Limiting Aliasing
restrict keyword
Part of the C99 standard (-std=c99 with GCC) A pointer refers to unique memory.
NOTUR2009
50
Aligning Data
/* Intel, align to 16 bytes */ __declspec(align(16)) unsigned long lock; /* GCC */ unsigned long lock __attribute__ ((aligned(16)));
NOTUR2009
51
static
In global scope, used only in this le. Data or location never changes. Data may change from an alias outside of scope. Inline all the time.
const
volatile
inline
NOTUR2009
52
NOTUR2009
53
The single most important impediment to good parallel performance is still single-node performance
William Gropp, Argonne National Lab.
s/parallel/multicore; s/node/core;
NOTUR2009
54
Cache gets you all of your performance. Compilers like to optimize loops without.
Function calls are not free System calls are slower I/O is even worse
Philip Mucci, Multicore Optimization 55
NOTUR2009
Allocation Unit Stride Reference Initialization Padding Packing Stride Minimization Blocking Unrolling
NOTUR2009
56
Code Examples
NOTUR2009
57
Array Allocation
NOTUR2009
58
When data comes in, think about using as much of it as possible as soon as possible. When touching large amounts of memory, TLB misses faults can be a concern.
NOTUR2009
59
Array Referencing
In C, outer most index should move the fastest. [x,Y] In Fortran, inner-most should change the fastest. (X,y)
NOTUR2009
60
Array Initialization
No one really uses formal static initialization anymore. Waste space, restricts program, etc.
But static bounds were great for optimizers. Dynamic array allocation on the stack. Run time specication of array bounds. Simpler and more expressive the code, the better. Array addressing can waste a lot of cycles.
Philip Mucci, Multicore Optimization 61
NOTUR2009
Array Padding
Fortran common block is a contiguous region of memory. Lots of codes just use powers of 2. Yikes!
Same can easily be true of dynamically allocated memory. Some elements on systems love aligned data.
NOTUR2009
Intra-Array Padding
Consider striding across each dimension as in a transpose. C: Avoid leading dimension of power of 2 Fortran: Avoid trailing dimension of power of 2.
NOTUR2009
Structure Packing
Unaligned access to data is usually slower. So align items on word, double-word or bigger. Pack from smallest to largest, maybe add padding? But this is a multicore problem! (more later)
struct { short s; int i; char c; void *p; } struct { void *p; int i; short s; char c; }
NOTUR2009
64
Always think about spatial and temporal locality. Often, this is just an oversight of the original implementor. With simple loops, compiler will interchange them for you.
NOTUR2009
65
Cache Blocking
Standard transformation
Most compilers are decent at it, if the loop is simple and has no subroutine calls or side-effects
Helps when potential for re-use is high. Naturally blends with sum reduction and unrolling.
Good for multicore too but, some caches are shared! And which loop should we parallelize?
Philip Mucci, Multicore Optimization 66
NOTUR2009
Cache Blocking
NOTUR2009
67
Loop Unrolling
Standard transformation to improve processor pipeline utilitization and reduce loop overhead.
More work per iteration Function calls inside Inter-iteration dependencies Global variables Pointer aliasing
NOTUR2009
68
Loop Unrolling
NOTUR2009
69
When an loop has a data dependency that introduces serialization. Solution is to unroll and introduce intermediate registers.
NOTUR2009
70
NOTUR2009
71
NOTUR2009
72
Goal is to reduce number of loads and stores on inner loops with invariants
NOTUR2009
73
Loop Jam/Fusion
NOTUR2009
74
Loop Defactorization
Reduce the number of array elements referenced, to reduce cache trafc. But oating point operations are not always associative. (A + B) + C != A + (B + C) Verify that your results are still correct
NOTUR2009
75
Loop Defactorization
NOTUR2009
76
Loop Peeling
For loops which access previous elements in arrays. Compiler cannot determine that an item does not need to be reloaded on every iteration.
NOTUR2009
77
Loop Collapse
Reduce address computation and loop nesting. Reduces loop overhead and increases chance of vectorization.
NOTUR2009
78
Loop Collapse
This can be especially effective in C and C++, where often macros are used to compute multidimensional array offsets.
NOTUR2009
79
If statements in Loops
We already know many optimizations that this inhibits. Unroll loop, move conditional elements into scalars early, test scalars at end of loop.
NOTUR2009
80
Floating IF's
IF statements that do not change from iteration to iteration can be hoisted. Compilers are usually good at this except when:
Loops contain calls to procedures Loops have variable bounds Loops reference global variables that may be aliased to data in the IF statement.
NOTUR2009
81
Floating IF's
NOTUR2009
82
Some Results
Taken years ago on 3 different architectures with the best compilation technology at that time. Percent is of -O3 version but untuned.
Stride Minimization Fusion Interchange Floating IF's Loop Defactor Loop Peeling Loop Unrolling Loop Unroll + SumR Outer Loop Unrolling Arch A 35% 69% 75% 46% 66% 97% 97% 77% 83% Arch B 9% 80% 100% 100% 76% 64% 89% 100% 26% Arch C 100% 81% 100% 101% 94% 81% 67% 39% 46%
NOTUR2009
83
Indirect Addressing
X(I) = X(I) * Y(A(I)) Very hard for a compiler to optimize. Very difcult for normal memory subsystems.
Most memory subsystems are just bad at pseudorandom accesses. Hardware prefetch can mitigate, but can also hurt Consider using a sparse solver package. Block your data into small cache-line sized chunks and do some redundant computation.
Philip Mucci, Multicore Optimization 84
NOTUR2009
Gather-Scatter Optimization
For loops with conditional work. Split loop to gather indirect array where work needs to be done. Can increase pipelining, effectiveness of prefetching and enable other loop optimizations.
NOTUR2009
85
Gather-Scatter Optimization
NOTUR2009
86
Extensive use creates much greater memory pressure, lots and lots of pointers. Dynamic typing and polymorphism is not free. Make use of inline, const and restrict keywords Use STL, Boost and other support libraries
Expresses more of author's intent to compiler to increase performance. But be careful with multicore of the above.
NOTUR2009
87
Fortran Considerations
NOTUR2009
88
Fortran 90 Arrays
The (:) syntax is very useful. But this can hide signicant amount of data movement, often repeatedly.
Pollutes the caches Creates temporaries that may have pathological alignment, especially with 'assumed shapes'
NOTUR2009
89
A construct for masking array operations Generated code is often required to be a loop containing an if statement.
Highly inefcient
Consider multiplying by a 0 or 1 mask array with the same shape into a temporary.
NOTUR2009
90
Many good programmers are focussed on multicore development Performance, Portability, Prototyping Let someone else solve the hard problems. Extensive use can lead to vertical code structure. May make performance debugging difcult.
Philip Mucci, Multicore Optimization 91
Advantages:
Disadvantages:
NOTUR2009
Sample DGEMM
Nave
Advanced ->
Optimal
NOTUR2009
Multicore Programming
NOTUR2009
93
Multithreaded Programming
NOTUR2009
94
Expressing Parallelism
Data parallelism
Same operation on every thread, using different data OpenMP, UPC, Co-Array Fortran, etc...
NOTUR2009
95
Message Passing
Program explicitly exchanges data. Semantics are send/receive (identied by tag) or get/put (direct to address). Ordering and consistency are somewhat implicit.
Synchronization usually not needed Nothing really shared other than task ID's
NOTUR2009
96
Shared Memory
Load/store or language feature. Synchronization is needed. Or in higher level models, data that is declared shared.
NOTUR2009
Receiver is not expected to be in the same address space of the sender. Data was expected to be copied, packed, sent, received, unpacked, copied, etc... Much work has been done to eliminate the copies.
You can get a 4 byte message across a wire in 1us these days (if you do it 1000's of times and average)
NOTUR2009
MPI-2 introduced some get/put primitives to introduce more direct access to remote memory.
Not nearly as lightweight or exible as they should have been, thus limited acceptance. Require synchronization. You had to run multiple processes on a multicore machine. Things are different now.
NOTUR2009
99
Many MPI's are now both thread safe and tuned for on-node, shared memory operation. This means you can easily use MPI for multicore programming.
Explicit coding of data exchange and synchronization. Code may be easier to read and tune. You can lose a substantial amount of performance. Granularity of parallelism must be coarse. Programming model is limited.
Philip Mucci, Multicore Optimization 100
Disadvantages are:
NOTUR2009
Pthreads
Basic primitives are Create, Join and Mutex Used in combination with messages to create different models.
NOTUR2009
Pthreads
Extreme care must be used to coordinate access to global data structures. Reads and writes need to be consistent. Different cores should be working with different cache lines. Mutex, condition variables and rwlocks.
3 types of synchronization
NOTUR2009
102
Logical Parallelism
Make chunks or work and threads separate. Send work to threads in chunks. If a thread nishes, get more work.
Ideally, the programmer should not have to think about processors, just think in parallel!
NOTUR2009
105
OpenMP
Designed for quick and easy parallel programming of shared memory machines. Works by inserting compiler directives in code, usually around loops. Threads are started implicitly and fed work.
NOTUR2009
106
OpenMP Directives
Parallelization
parallel, for, do, workshare, section, sections, task single, master shared, private, threadprivate, copyprivate, rstprivate, lastprivate, reduction barrier, ordered, critical, atomic, ush, nowait
Synchronization
NOTUR2009
107
!$OMP !$OMP
!$OMP !$OMP
NOTUR2009
108
109
!$OMP PARALLEL SHARED(A,B,C,D), PRIVATE(I) !$OMP SECTIONS !$OMP SECTION DO I = 1, N C(I) = A(I) + B(I) ENDDO !$OMP SECTION DO I = 1, N D(I) = A(I) * B(I) ENDDO !$OMP END SECTIONS !$OMP END PARALLEL END
NOTUR2009
110
CILK
Handful of new keywords, spawn and join. Work-stealing scheduler: programmer just thinks in parallel, scheduler does the work http://supertech.csail.mit.edu/cilk C++, parallel loops http://www.cilk.com
Philip Mucci, Multicore Optimization 111
NOTUR2009
Cilk Example
i nt b ( i nt n) { i f ( n< 2 ) r et ur n ( n) ; el se { i nt x , y ; ci l k i nt b ( i nt x = b( n- 1 ) ; i f ( n< 2 ) r et ur n y = b( n- 2 ) ; el se { r et ur n ( x +y ) ; i nt x , y ; } x = spawn b( } y = spawn b( sync; r et ur n ( x +y ) ; } }
Charles E. Leiserson http://supertech.csail.mit.edu/cilk/lecture-1.pdf
NOTUR2009 Philip Mucci, Multicore Optimization 112
n) { ( n) ; n- 1 ) ; n- 2 ) ;
Unied Parallel C
shared, relaxed and strict keywords Intrinsinc functions for sync, get/put, collectives, worksharing, I/O Remote references on on single data items when they are used (not before) Compiler must hoist and aggregate comm.
NOTUR2009
UPC Example
shared int our_hits[THREADS]; /* single writer */ main(int argc, char **argv) { int i, hits, trials = 1000000; double pi; seed48 ((MYTHREAD+1) * THREADS); for (i=0; i<trials; i++) hits += hit(); our_hits[MYTHREAD] = hits; upc_barrier; for (i=0,hits=0; i<THREADS; i++) hits += our_hits[i]; pi = 4.0*hits/trials; printf(Thread %d estimates pi = %g, MYTHREAD, pi); }
NOTUR2009
114
Co-Array Fortran
[n|:] notation after arrays to denote processor. Performance issues are more serious.
Part of Fortran 2008. G95 has some support. Cray has commercial product. http://www.co-array.org/
Philip Mucci, Multicore Optimization 115
NOTUR2009
NOTUR2009
116
Titanium (Java-like) Fortress (ML/Haskell-like) Chapel (C/Java-like) X10 (Java-like) Ok, but let's remember what happened to HPF, ZPL, Split-C, etc...
NOTUR2009
117
Question
When should I be trying to use CUDA over (my) multicore CPU? My experience with CUDA is that I am very good at writing very slow CUDA code.
NOTUR2009
118
Answer
When you've: Had entirely too many great days in a row and need a change. Don't want to go to church but still wish to be punished for your all your sins. Voted for or (ever) supported George Bush. Save it for the platform on which it was made for.
NOTUR2009
119
Multicore Optimization
NOTUR2009
120
But how high is up? Strong scaling: Total problem size is xed. Weak scaling: Problem size per processor is xed.
Interprocessor, Sync, Comm, I/O are not the bottlenecks. It all starts with a good parallel algorithm.
NOTUR2009
121
When you see linear scaling graphs, be (somewhat) suspicious. Linear scalability is easy(ier) when per-core performance is low! The faster a single core computes, the more vulnerable it is to other bottlenecks.
NOTUR2009
Multicore Optimization
Make it cache friendly Help the compilers tune the loops Bigger chunks of work to each thread Alignment, blocking, locks, etc..
Philip Mucci, Multicore Optimization 123
Reduce synchronization
NOTUR2009
Expressing Parallelism
Granularity must always be as large as possible. Synchronization and communication are expensive.
NOTUR2009
124
Libraries
Single core is hard enough. You have to be willing to change your storage format.
Vendor math libraries are probably best at this. ScalaPack, PetSC, SuperLU, FFTW, EISPACK, VSIPL, SPRNG, HYPRE etc etc. But if it's not your bottleneck, then it doesn't matter.
Philip Mucci, Multicore Optimization 125
NOTUR2009
Ideally, you should compile the libraries with the same compiler and ags you use on your binary.
Vendor compilers are best, but may be fussy. Unless you know it needs to run in many places.
Many optimization ags (especially IPO and the like), need to be set for every stage. How can you check?
NOTUR2009
They come in two forms, yet both can be called multithreaded. Monolithic
Your (serial) program calls a (library) routine which uses all the cores to solve the problem.
Good: It can't get much easier to use. Bad: Memory contention and parallel overhead.
NOTUR2009
127
Really parallel
Your (parallel) program creates it's own threads and each calls the library on the relevant portion of data.
Good: The library can dictate alignment, placement, etc through allocators. Overhead can be amortized. Bad: Increase in complexity and plenty of room for error.
NOTUR2009
128
Multicore Libraries
MPI program running on every node, linked against a Intel MKL that spawns it's own threads. HPL (Linpack): MPI between nodes and multithreaded BLAS on node is usually slower than MPI.
But HPL runs at > 75% of peak, DGEMM sometimes 90%! So go with what's simple.
NOTUR2009
129
VSIPL
Vector Signal Image Processing Library Filters Stencils Convolutions Wavelets Serial and Parallel versions
NOTUR2009
130
LAPACK/ScaLAPACK
Eigenvalue problems Factorizations Reordering/Conditioning Parallel and serial versions Some out of core solvers and packaged storage routines
ATLAS/BLAS/etc...
Philip Mucci, Multicore Optimization 131
NOTUR2009
PETSc
Generalized sparse solver package for solution of PDE's Contains different preconditions, explicit and implicit methods Storage format is highly optimized for performance Serial, Parallel and threaded
NOTUR2009
132
FFTW
Multidimensional FFTs
Serial, threaded and parallel Variety of radix sizes and data types
NOTUR2009
133
SuperLU
Highly optimized Compressed block storage formats Serial, parallel and threaded
NOTUR2009
134
TLB/Cache ushing
So do not use them as function calls! Keep threads around and send them work.
NOTUR2009
135
Load Balance
OpenMP and CILK can provide dynamic scheduling of iterations Pthreads you are on your own
Ok, but we still must consider cache line contention when choosing a data layout.
NOTUR2009
136
NOTUR2009
137
Multithreading Gotchas
False Sharing
Invalidations
Two threads writing the same location causing the value to be ushed. Locking, barriers, etc.
Philip Mucci, Multicore Optimization 138
Synchronization
NOTUR2009
False Sharing
http://isdlibrary.intel-dispatch.com/isd/1588/MC_Excerpt.pdf
NOTUR2009 Philip Mucci, Multicore Optimization 139
Read-Write contention
One core writes cache line, another one reads it Many cores writing to same cache line
Write-Write contention
NOTUR2009
140
Loop Structure
do k =1 , nz ! The m agnet i c do j =1 , ny ! El ect r i c el do i =1 , nx Hx ( i , j , k ) = Hx ( i , j , k ) + ( ( Ey ( i , j , k +1 ) ( Ez ( i , j , k ) Hy ( i , j , k ) = Hy ( i , j , k ) + ( ( Ez ( i +1 , j , k ) ( Ex ( i , j , k) Hz ( i , j , k ) = Hz ( i , j , k ) + ( ( Ex ( i , j +1 , k ) ( Ey ( i , j , k) end do end do end do el d updat e d updat e i s ver y si m l ar . i - Ey ( i , j , k ) ) * Cbdz + - Ez ( i , j +1 , k ) ) * Cbdy ) - Ez ( i , j , k ) ) * Cbdx + - Ex ( i , j , k +1 ) ) * Cbdz ) - Ex ( i , j , k ) ) * Cbdy + - Ey ( i +1 , j , k ) ) * Cbdx ) & & & & & &
NOTUR2009
141
NOTUR2009
142
Improved Padding
Hx ( Hy ( Hz ( Ex ( Ey ( Ez ( 1: 1: 1: 1: 1: 1: nx +padHx ( nx +padHy ( nx +padHz ( nx +1 +padEx ( nx +1 +padEy ( nx +1 +padEz ( 1) 1) 1) 1) 1) 1) , , , , , , 1: 1: 1: 1: 1: 1: ny +padHx ( ny +padHy ( ny +padHz ( ny +1 +padEx ( ny +1 +padEy ( ny +1 +padEz ( 2) 2) 2) 2) 2) 2) , , , , , , 1: 1: 1: 1: 1: 1: nz +padHx ( nz +padHy ( nz +padHz ( nz +1 +padEx ( nz +1 +padEy ( nz +1 +padEz ( 3) 3) 3) 3) 3) 3) ) ) ) ) ) )
NOTUR2009
143
Make sure shared (even read only) data is cacheline aligned. Use thread private variables to compute results, then merge to shared arrays. With OpenMP: use default(none) in your parallel clauses
NOTUR2009
144
L2 on Nehalem.
Depending on the speed difference and amount of work per iteration, L1 may be better. Never block for the shared cache size.
NOTUR2009
145
Access to shared data structures & critical sections must be protected (and ordered).
Sometimes even if access is atomic. Unix has (horribly slow) semaphores. Pthreads has rwlocks, mutexes and condition variables. OpenMP has explicit locks and directives.
NOTUR2009
146
Locking
Locks should not be taken around any primitive that does not execute deterministically. Instructions - do minimal work while performing the lock. Data - lock items, not entire structures.
Space
NOTUR2009
Locking Example
Consider a multithreaded server where all threads read from the same socket into a shared FIFO for processing.
Lock the FIFO, read into it, increment and unlock. We could lock each buffer in FIFO, but that would cause gaps. Instead, make temporary message buffers which we copy into the FIFO when full. We only lock FIFO when the data is ready!.
Philip Mucci, Multicore Optimization 148
NOTUR2009
Because we have increased our memory bandwidth! Some compilers have #pragma pack
struct { short s; int i; char c; void *p; } struct { void *p; int i; short s; char c; }
NOTUR2009
149
What if we are locking each structure? What happens after lock is acquired? What if structures are allocated together? Usage dictates method, what we will be accessing and when. struct {
unsigned long lock; void *next; void *p; int i; short s; char c; unsigned long pad[5];
NOTUR2009
150
We know there is nothing wrong with shared read-only data. Unless it happens to be in the same cache line as something that gets written.
NOTUR2009
151
Great way to reduce contention. Only most systems, this is very fast. Variants exist in C and C++: the __thread keyword.
int i_first_val = 101; __thread int i = i_first_val; extern __thread struct state s; static __thread char *p;
NOTUR2009
152
NOTUR2009
153
This can be very bad for multicore due to false sharing, especially for caches with low associativity. Be wary of your F90 allocate or your malloc/new. 3rd party OS replacements are available
NOTUR2009
155
NOTUR2009
156
There is generally a sweet spot for M threads on N cores of a single socket. (M<N)
How tightly synchronized and balanced computation is Memory bandwidth requirements I/O and Comm trafc
Oversubscription (more threads than cores) is usually never a good thing unless...
Philip Mucci, Multicore Optimization 157
NOTUR2009
Threads can bounce from core to core. You do have some control over this.
But you cannot prevent Linux from scheduling something else onto your CPU.
Unless you boot the kernel with special options (isolcpus) or the massage system a bit.
NOTUR2009
Print affinity mask of process PID 24732 > taskset -p 24732 pid 24732's current affinity mask: f Print CPU list of process PID 24732 > taskset -c -p 4695 pid 24732's current affinity mask: 0-3 Set running process to only use CPU's 1 and 2 > taskset -c -p 1,2 4695 pid 4695's current affinity list: 0-3 pid 4695's new affinity list: 1,2 Launch bash shell with all CPU's to choose from > taskset 0xffffffff /bin/bash Launch bash shell with CPU's to choose from > taskset -c 0-3 /bin/bash
NOTUR2009 Philip Mucci, Multicore Optimization 159
Tell OpenMPI to bind each process > mpirun mca mpi_paffinity_alone 1 -np ... Tell SLURM to bind each task to a core/socket > srun -ntasks-per-core=N -ntasks-per-socket=M ... More advanced SLURM binding 8 ranks, 4 nodes, 2 per socket, 1 per core (-B S[:C[:T]]) > srun -n 8 -N 4 -B 2:1 ... Even more advanced SLURM binding > srun -cpu_bind=cores cpu_bind=verbose ... > srun -cpu_bind=map_cpu:0,2,3 cpu_bind=verbose ... > srun -cpu_bind=help cpu_bind=verbose ...
NOTUR2009
160
Static
Data/tasks are split amongst processors for duration of execution. Problem: How do we choose an efcient mapping? Work is performed when resources become available
Dynamic
NOTUR2009
161
If not set, it defaults to the number of cores in a system. (As reported by /proc/cpuinfo on Linux, Hyperthreaders beware...)
Set this to 1,2,etc. and time regions of your code. Time without OpenMP as well.
NOTUR2009
162
As big as possible.
Make sure the compiler can unroll the loop. Merge parallel regions. Avoid barriers with NOWAIT.
NOTUR2009
163
#pragma omp parallel { #pragma omp for for () { ... } #pragma omp for for () { ... } }
Execute in parallel if expression evaluates to true. Very powerful technique for mitigating parallel overhead.
NOTUR2009
166
NOTUR2009
167
NOTUR2009
168
firstprivate(list)
All copies get value in master at beginning. All copies get value in last iteration/section. Data is global data, but private in parallel regions.
lastprivate(list)
threadprivate(list)
NOTUR2009
169
do, for, sections, workshare, single We must work to limit when this happens.
The NOWAIT clause eliminates the barrier, then insert a barrierbarrier and/or flush youself. Also, you can use master instead of single.
NOTUR2009
170
If you can't use a reduction to update a shared variable and you need to use critical:
Only thread at a time executing the code. This will take advantage of special instructions instead of using locking.
NOTUR2009
171
#pragma omp for for(i=0; i<size; i++) a[i] = 1.0/a[i]; #pragma omp for for(i=0; i<size; i++) b[i] = b[i]*2.0
with
Good: But we've reduced overhead and increased work per iteration. Bad: We're increasing memory bandwidth and cache pollution. (No data reused) This is better for multicore:
#pragma omp for nowait for(i=0; i<size; i++) a[i] = 1.0/a[i]; #pragma omp for for(i=0; i<size; i++) b[i] = b[i]*2.0
#pragma omp for for(i=0; i<size; i++) { a[i] = 1.0/a[i]; b[i] = b[i]*2.0; }
NOTUR2009
172
OpenMP Reduction
OpenMP has a shortcut: reduction(op:var) You tell OpenMP how to combine the data.
NOTUR2009
173
Reduction By Hand
#pragma omp parallel for private( privIndx, privDbl ) for ( i = 0; i < arraySize; i++ ) { for ( privIndx = 0; privIndx < 16; privIndx++ ) { privDbl = ( (double)privIndx ) / 16; y[i] = sin(exp(cos(-exp(sin(x[i]))))) + cos(privDbl); /* Here, each thread reads globalCount add 1 to the value, and write the new value back to globalCount. */ #pragma omp critical { globalCount = globalCount + 1; } } }
NOTUR2009
174
Reduction
#pragma omp parallel for private( privIndx, privDbl ) \ reduction( + : globalCount ) for ( i = 0; i < arraySize; i++ ) { for ( privIndx = 0; privIndx < 16; privIndx++ ) { privDbl = ( (double)privIndx ) / 16; y[i] = sin(exp(cos(-exp(sin(x[i]))))) + cos(privDbl); /* Here, each thread reads globalCount add 1 to the value, and write the new value back to globalCount. */ globalCount = globalCount + 1; } }
NOTUR2009
175
Rewrite a few versions, unrolling each level individually. Look for opportunities to:
Re-use data (cache) Reduce memory bandwidth. Move temps into variables (register).
NOTUR2009
OpenMP Scheduling
By default, OpenMP will allocate a small xed number of iterations to each thread. This can be changed at compile time or run-time.
NOTUR2009
177
OpenMP Scheduling
STATIC[,size] default DYNAMIC[,size] allocate iterations at runtime GUIDED[,size] start with big chunks, end with small chunks RUNTIME
NOTUR2009
178
MPI Tips
Ideally a background thread can send the data while this thread can continue. MPI_ISEND, MPI_IRECV, MPI_ISENDRECV, MPI_IRSEND, MPI_WAITxxx, MPI_TESTxxxx
Use native data types. Send big messages not small ones. Make sure receiver arrives early. Minimize collectives.
Philip Mucci, Multicore Optimization 179
NOTUR2009
MPI Tips 2
Avoid wildcard receives Attempt to align application buffers to (at least) 8 bytes Avoid data translation and derived data types. Always think about overlapping comm and compute
NOTUR2009
180
NOTUR2009
181
Performance Analysis
Wall Clock time What are comparing it to? Peak? Ask your vendor to send you a code that performs at peak.
For purposes of optimization, we need data over a range of data sets, problem sizes and number of nodes.
NOTUR2009
182
Comparisons
For the purposes of comparing performance data, time is the best place to start.
Unless you are completely aware of architecture, compiler, run-time systems, etc...
Synthetic benchmarks predict performance of real programs Peak performance tracks observed performance
NOTUR2009
183
Instrumentation
Sampling
Simulation
NOTUR2009
184
Tracing generates a record with a timestamp for every event, say function invocation. This presents numerous problems.
Cure is worse than the disease. Tracing often reserved for the worst and most intermittent problems.
Philip Mucci, Multicore Optimization 185
NOTUR2009
Aggregated Proling
By using simple start, stop and accumulate points in the code, a relatively complete picture of the overall execution can be obtained. This loses temporal performance information.
However, signicant problems still 'bubble' to the top of the overall prole.
NOTUR2009
186
Statistical Proling
Upon dened periodic events, record where in the program the CPU is. Gather data into a histogram, the shape of which approaches the actual prole over time. Periodic events can be clock ticks or other events based on hardware performance counters, like cache misses.
NOTUR2009
187
Understanding Timers
Real time, Wall Clock time: A measure of time that doesn't stop, as when using a stop watch. User time: Time when the CPU is executing your process and is executing your code (not OS code) System time: Time when the CPU is executing your process and is executing OS code on your behalf. CPU utilization is usually (U + S)/R
Philip Mucci, Multicore Optimization 188
NOTUR2009
Timing Utilities
Linux /usr/bin/time
NOTUR2009
189
Wallclock Time
Fortran
Both
NOTUR2009
CPU Time
Usually summed over all threads. Not nearly as accurate as wallclock time. clock() getrusage() clock_gettime() times() dtime()
Philip Mucci, Multicore Optimization 191
Fortran
NOTUR2009
A measure of wallclock is not enough to point to the culprit. We need to know what's happening under the hood.
Philip Mucci, Multicore Optimization 192
NOTUR2009
Often 100's of different events, specialized to the processor, usually just a few registers to count on.
OS support accumulates counts into 64 bit quantities that run only when process is running.
User, kernel and interrupt modes can be measured separately Can count aggregate or use them as sampling triggers
NOTUR2009
193
Cycles Instructions Floating point ops Branches mispredicted Cycles stalled on memory Cache lines invalidated
NOTUR2009
194
Statistical Proling 2
NOTUR2009
195
You can actually count transitions to E and I on some platforms. Can isolate offending processor/thread.
Interprocessor trafc
NOTUR2009
196
PAPI
Performance Application Programming Interface A standardized, portable and efcient API to access the hardware performance counters. Goal is to facilitate the development of crossplatform optimization tools. Patching kernel is required.
Stable and supported patches. (perfctr & perfmon) Many HPC systems have already been patched.
NOTUR2009
197
PAPI Events
Symbol names for something to count Events have different names/meanings for different vendors/processors/revisions etc Some native events are mapped to general names in PAPI
NOTUR2009
198
But some are better than others, all can be useful in the proper situations
NOTUR2009
199
Issues to Consider
Usage
Performance Data
Collection
NOTUR2009
Tools
Vampir SlowSpotter
NOTUR2009
202
Based on source code instrumentation Flat Proles Callgraph Proles Hardware counter values
NOTUR2009
203
ompP: Usage
NOTUR2009
204
Components:
Region number Source code location and region type Timing data and execution counts, depending on the particular construct One line per thread, last line sums over all threads Hardware counter data (if PAPI is available and HW counters are selected) Data is exact (measured, not based on sampling)
Philip Mucci, Multicore Optimization 205
NOTUR2009
[APP 4 threads] +-R00004 main.c (42-46) |-R00001 main.c (19-21) ('foo1') | +-R00003 main.c (33-36) (unnamed) +-R00002 main.c (26-28) ('foo2') +-R00003 main.c (33-36) (unnamed)
[*00] critical.ia64.ompp [+01] R00004 main.c (42-46) PARALLEL [+02] R00001 main.c (19-21) ('foo1') USER REGION TID execT/I execT/E execC 0 1.00 0.00 1 1 3.00 0.00 1 2 2.00 0.00 1 3 4.00 0.00 1 SUM 10.01 0.00 4 [*00] [+01] [+02] [=03] TID 0 1 2 3 SUM critical.ia64.ompp R00004 main.c (42-46) PARALLEL R00001 main.c (19-21) ('foo1') USER REGION R00003 main.c (33-36) (unnamed) CRITICAL execT execC bodyT/I bodyT/E 1.00 1 1.00 1.00 3.00 1 1.00 1.00 2.00 1 1.00 1.00 4.00 1 1.00 1.00 10.01 4 4.00 4.00
NOTUR2009
206
+ Imbal (%) 305.75 (17.14) 104.28 (14.15) 96.47 (25.24) 51.15 (18.52)
Synch (%) 0.00 ( 0.00) 0.00 ( 0.00) 0.00 ( 0.00) 0.00 ( 0.00) 0.00 ( 0.00)
Limpar (%) 0.00 ( 0.00) 0.00 ( 0.00) 0.00 ( 0.00) 0.00 ( 0.00) 0.00 ( 0.00)
+ Mgmt (%) 31.51 ( 0.57) 86.67 ( 1.57) 100.98 ( 1.83) 83.33 ( 1.51) 404.97 ( 7.33)
872.92 (15.80)
NOTUR2009
207
Deductions by ompP about what the problem is. WaitAtBarrier ImbalanceInParallel[Region/Loop/Workshare/Sections] ImbalanceDueToNotEnoughSections InbalanceDueToUnevenSectionDistribution CriticalSectionContention LockContention FrequentAtomic InsufcienWorkInParallelLoop UnparallelizedIn[Master/Single]Region
NOTUR2009 Philip Mucci, Multicore Optimization 208
Valgrind
A tool infrastructure for debugging and performance evaluation. Works by instruction emulation and tracing.
Code can run up to 100x slower. But can catch errors that other tools can't. memcheck, cachegrind, callgrind, massif, helgrind, drd cachegrind is based on simulated machine model (not real hardware)
Philip Mucci, Multicore Optimization 209
Many tools
NOTUR2009
Valgrind: Helgrind
Detects
Pthreads API errors Deadlocks and Data races Broken for GNU OpenMP
Thread #1 unlocked a not-locked lock at 0x7FEFFFA90 at 0x4C2408D: pthread_mutex_unlock (hg_intercepts.c:492) by 0x40073A: nearly_main (tc09_bad_unlock.c:27) by 0x40079B: main (tc09_bad_unlock.c:50) Lock at 0x7FEFFFA90 was first observed at 0x4C25D01: pthread_mutex_init (hg_intercepts.c:326) by 0x40071F: nearly_main (tc09_bad_unlock.c:23) by 0x40079B: main (tc09_bad_unlock.c:50)
NOTUR2009
210
Valgrind: DRD
Detects
Pthreads API errors Deadlocks, Data races and Lock contention Broken for GNU OpenMP
==10668== Acquired at: ==10668== at 0x4C267C8: pthread_mutex_lock (drd_pthread_intercepts.c:395) ==10668== by 0x400D92: main (hold_lock.c:51) ==10668== Lock on mutex 0x7fefffd50 was held during 503 ms (threshold: 10 ms). ==10668== at 0x4C26ADA: pthread_mutex_unlock (drd_pthread_intercepts.c:441) ==10668== by 0x400DB5: main (hold_lock.c:55) ...
NOTUR2009
211
Easy to use, easy to interpret performance reports. mpiP performances only trace reduction and summarization. Compatible with all MPI's. No recompilation required.
NOTUR2009
212
NOTUR2009
213
NOTUR2009
214
PerfSuite
Provides summaries of MPI and Performance Counters Provide statistical proles as well. Output is XML or ASCII
Works on uninstrumented code. Well supported and documented. Lots of derived events for Intel processors.
NOTUR2009
215
psrun Output
NOTUR2009
216
HPCToolkit
Uses hardware counters and timers. Produce proles per-thread. Works on unmodied and fully optimized code.
Visualizer can compare multiple proles with derived metrics. Concise ASCII output or with a Java GUI
NOTUR2009
217
NOTUR2009
218
NOTUR2009
219
NOTUR2009
220
NOTUR2009
221
A set of tools, easy to use as time. Provide a uniform interface to a number of underlying tools. Largely work on uninstrumented code. Mostly take the same arguments. papiex, mpipex, ioex, hpcex, gptlex, tauex
NOTUR2009
222
papiex
A simple to use tool that generates performance measurements for the entire run of a code, including summaries for job, task and thread.
NOTUR2009
Instruction Mix
NOTUR2009
224
GPTL
Used to easily instrument applications for the generation of performance data. Optimized for usability. Provides access to timers as well as PAPI events. Thread-safe and per-thread statistics. Provides estimates of overhead. Call-tree generation.
NOTUR2009
225
225
TAU
Entire toolkit for parallel and serial performance instrumentation, measurement, analysis and visualization. Steep(ish) learning curve, but payoff can be worth it. Works via source instrumentation and limited dynamic instrumentation Very good at OpenMP instrumentation
NOTUR2009
226
Parallel Performance Evaluation Tool for Fortran, C, C++ and Python Used for in-depth performance studies of an application throughout its lifecycle. Supports all forms of parallel proling
Flat, callpath, and phase based proling PAPI counters, wallclock time, CPU time, memory
NOTUR2009
TAU
NOTUR2009
228
AORSA2D on 4k cores PAPI resource stalls Jaguar Cray XT (ORNL) Blue is single node Red is dual core
NOTUR2009
229
AORSA2D on 4k cores Jaguar Cray XT3(ORNL) Floating pt ins/second Blue is dual core Red is single node
NOTUR2009
230
Oprole
Scalasca
NOTUR2009
231
Know what libraries are available and plan your data structures. Spend your time on innovation not implementation. Libraries are well documented and well publicized. Remember the 80/20 rule.
NOTUR2009
232
Do you need double or can you live with oat? Do you need > 2GB of address space? After comm., memory bandwidth is always the biggest bottleneck.
Running in 32-bit mode does not mean you can't use double precision oating point.
NOTUR2009
233
Flags Directives Good code structure Compilers are better at optimizing simple code than you are. Reading the manual is worth it.
NOTUR2009
234
If you have to write your own code, tune it for cache on a single processor.
Make sure the algorithm scales rst. If you get good cache utilization, it will make multicore performance that much easier.
NOTUR2009
235
Larger, longer and more independent the computations, the greater the speedup.
NOTUR2009
236
If something seems 'hard' to get right, you may be doing something wrong. Have reasonable expectations. Recall the CUDA comment.
NOTUR2009
237
References
http://www.cilk.com/resources/multithreaded-algorithms-textbook-chapter/
http://software.intel.com/en-us/articles/software-techniques-for-shared-cache-multi-core-systems/ http://drops.dagstuhl.de/opus/volltexte/2008/1374/pdf/07361.GroeszlingerArmin.Paper.1374.pdf
Multicore Optimization
NOTUR2009
238
References
Pthreads
OpenMP
https://computing.llnl.gov/tutorials/openMP/ http://infostream.rus.uni-stuttgart.de/lec/288/291/real/
https://fs.hlrs.de/projects/par/par_prog_ws/2004C/22_openmp_performance_2.p
NOTUR2009
239
References
FFTW
PetSC
SUPERLU
ScaLAPACK etc...
VSIPL
NOTUR2009
240
References
GNU compilers
http://gcc.gnu.org/onlinedocs/gcc/Optimize-Options.html
http://www.ncsa.edu/UserInfo/Training/Workshops/Multicore/presentations/Optimization%20for%20Performance.ppt
Intel Compiler
CILK tutorials
UPC
Co-Array Fortran
NOTUR2009
References
PAPI
ompP
Valgrind
NOTUR2009
References
mpiP
PerfSuite
HPCToolkit
papiex
TAU
GPTL
NOTUR2009
243
References
Prefetching
http://gcc.gnu.org/projects/prefetch.html http://developers.sun.com/solaris/articles/cc_restrict.html
http://www.redhat.com/docs/manuals/enterprise/RHEL-4-Manual/gcc/restricted-pointers.html
Aliasing
Alignment
http://www.jauu.net/data/pdf/beware-of-your-cacheline.pdf
NOTUR2009
244