A programmer's job is to take advantage as much as possible of the CPU's hardware / software characteristics to boost the performance of the program. Quite often, just a few simple changes to one's code improves performance by a factor of 2, 3 or better. Simply compiling with some of the optimization flags (-O3, - fast,.) can improve the performance dramatically.
A programmer's job is to take advantage as much as possible of the CPU's hardware / software characteristics to boost the performance of the program. Quite often, just a few simple changes to one's code improves performance by a factor of 2, 3 or better. Simply compiling with some of the optimization flags (-O3, - fast,.) can improve the performance dramatically.
A programmer's job is to take advantage as much as possible of the CPU's hardware / software characteristics to boost the performance of the program. Quite often, just a few simple changes to one's code improves performance by a factor of 2, 3 or better. Simply compiling with some of the optimization flags (-O3, - fast,.) can improve the performance dramatically.
HPC Workshop 2004 HPC Workshop 2004 Optimisation Techniques High Performance Computing Workshop Day 1 : October 5, 2004 Uni-Processor Optimization - Code Restructuring and Loop Optimization Techniques Uni Uni - - Processor Optimization Processor Optimization - - Code Restructuring and Loop Code Restructuring and Loop Optimization Techniques Optimization Techniques 2 Copyright C-DAC 2004 October 5-9, 2004 HPC Workshop 2004 HPC Workshop 2004 Optimisation Techniques Why do you need to do optimization of a sequential code Memory hierarchy and how a codes performance depends on it Optimization Techniques Loop Optimization Techniques Collapsing, Fission, Fusion, Unrolling, Interchange, Invariant Code Extraction, De-factorization, overheads of if-while-goto, Neighbor Data Dependency Arithmetic Optimization Compiler Optimizations Use of tuned Math Libraries Performance of selective applications and benchmarks Conclusions Lecture Outline 3 Copyright C-DAC 2004 October 5-9, 2004 HPC Workshop 2004 HPC Workshop 2004 Optimisation Techniques Improving Single Processor Performance How much sustained performance one can achieve for given program on a machine ? It is programmers job to take advantage as much as possible of the CPUs hardware /software characteristics to boost the performance of the program ! Quite often, just a few simple changes to ones code improves performance by a factor of 2, 3 or better ! Also, simply compiling with some of the optimization flags (-O3, - fast, .) can improve the performance dramatically ! 4 Copyright C-DAC 2004 October 5-9, 2004 HPC Workshop 2004 HPC Workshop 2004 Optimisation Techniques Approximate access times CPU-registers: 0 cycles (thats where the work is done!) L 1 Cache: 1 cycle (Data and Instruction cache). Repeated access to a cache takes only 1 cycle L 2 Cache (static RAM): 3-5 cycles? Memory (DRAM): 10 cycles (Cache miss); 30-60 cycles for Translation Lookaside Buffer (TLB) update Disk: about 100,000 cycles! connecting to other nodes - depending on network latency Icache Dcache L2 DISK RAM CPU registers A lot of time is spent accessing/storing data from/to memory. It is important to keep in mind the relative times for each memory types: The Memory sub-system : Access time Access Time is Important 5 Copyright C-DAC 2004 October 5-9, 2004 HPC Workshop 2004 HPC Workshop 2004 Optimisation Techniques Hierarchical Memory A four-level memory hierarchy for a large computer system. External Cache (SRAMs) Main Memory (DRAMs) Disk Storage (Magnetic) Tap Units (Magnetic) M 1 M 2 M 3 M 4 Registers, Internal Cashes in CPU Capacity Level 0 Level 1 Level 2 Level 3 Level 4 I n c r e a s e
i n
c a p a c i t y
a n d
a c c e s s
t i m e I n c r e a s e
i n
c o s t
p e r
b i t
6 Copyright C-DAC 2004 October 5-9, 2004 HPC Workshop 2004 HPC Workshop 2004 Optimisation Techniques Loop Optimization Techniques Classical Optimization techniques Compiler Does Memory Reference Optimization Compiler does to some extent Loop Optimizations Compiler does to some extent Loop Fission and Loop Fusion Loop Interchange Loop Alignment Loop Collapsing Loop Unrolling 7 Copyright C-DAC 2004 October 5-9, 2004 HPC Workshop 2004 HPC Workshop 2004 Optimisation Techniques Loop Collapsing It attempts to create one (larger) loop out of two or more small ones. This may be profitable if the size of each of the two loops is too small for efficient vectorization, but the resulting single loop can be profitably vectorized. REAL A(5,5) B(5,5) DO 10 J =1, 5 DO 10 I=1, 5 A(I,J) = B(I,J) + 2.0 10 CONTINUE Before REAL A(25) B(25) DO 10 JI =1, 25 A(JI) = B(JI) +2.0 10 CONTINUE Loop collapsing is done with multi-dimensional arrays to avoid loop overheads After 8 Copyright C-DAC 2004 October 5-9, 2004 HPC Workshop 2004 HPC Workshop 2004 Optimisation Techniques Using this technique, the code may be transferred into a single loop, regardless of the size of M and N This may require some additional statement to restart the code properly. DO 10 L = 1, NxM I = (L-1)/M+1 J = MOD(L-1,M) +1) A(I,J) = B(I,J) + 2.0 10 CONTINUE After (Contd) Loop Collapsing DO 10 J =1, N DO 10 I =1, M A(I,J) = B(I,J) + 2.0 10 CONTINUE General Versions of this technique is useful for computing systems which support only a single (not nested) DOALL statement. Before 9 Copyright C-DAC 2004 October 5-9, 2004 HPC Workshop 2004 HPC Workshop 2004 Optimisation Techniques Loop Collapsing (Contd) Loop collapsing is done with multi-dimensional arrays to avoid loop overheads Assume declaring a[50][80][4] Un Collapsed Loop for(I = 0; I <50; i++) for(j = 0; j<80; j++) for(k = 0; k<4; k++) a[i][j][k] = a[i][j][k] * b[i][j][k] + c[i][j][k]; Warning : This works only if the entire array space is accessed ! for(i=0; i<50*80*4; i++) a[0][0][k] = a[0][0][k] * b[0][0][k] + c[0][0][k]; Collapsed Loop 10 Copyright C-DAC 2004 October 5-9, 2004 HPC Workshop 2004 HPC Workshop 2004 Optimisation Techniques Loop Fusion: It transforms two adjacent loops into one on the basis of information obtained from data-dependencies analysis. Two statements will be placed into the same loop if there is atleast one variable or array which is referred by both. Remark : Loop Fission and Loop Fusion are related techniques to Strip mining and loop collapsing Loop Fission and Loop Fusion Loop Fission: Attempts to break a single loop into several loops in order to optimize data transfer (behavior main memory, cache and registers) Primary objective of optimization is data transfer. 11 Copyright C-DAC 2004 October 5-9, 2004 HPC Workshop 2004 HPC Workshop 2004 Optimisation Techniques It is merging of several loops into a single loop Example : Untuned Example : Tuned Loop Fusion for(i=0; i < 100000; i++) x = x * a[i] + b[i]; for(i=0; i < 100000; i++) y = y * a[i] + c[i]; for(i=0; i < 100000; i++) { x = x * a[i] + b[i]; y = y * a[i] + c[i]; } Tuned code runs atleast 10 times faster on Ultra Sparc (both with O3 flag) (Contd) 12 Copyright C-DAC 2004 October 5-9, 2004 HPC Workshop 2004 HPC Workshop 2004 Optimisation Techniques Advantages The loop overhead is reduced by a factor of two in the above case. Allows for better instruction overlap in loops with dependencies. Cache misses can be decreased if both loops reference the same array. Loop Fusion (Contd) Disadvantages Has the potential to increase cache misses if the fused loops contain references to more than four arrays and the starting elements of those arrays map to the same cache line. e.g: x = x * a[i] + b[i] * c[i] + d[i] / e[i] 13 Copyright C-DAC 2004 October 5-9, 2004 HPC Workshop 2004 HPC Workshop 2004 Optimisation Techniques Loop Optimizations : Basic Loop Unrolling Loop unrolling is performing multiple loop iterations per pass. Loop unrolling is one of the most important optimizations that can be done on a pipelined machine. Loop unrolling helps performance because it fattens up a loop with calculations that can be done in parallel Remark : Never unroll an inner loop. 14 Copyright C-DAC 2004 October 5-9, 2004 HPC Workshop 2004 HPC Workshop 2004 Optimisation Techniques Outer and Inner Loop Unrolling Remark : The loop or loops in the center are called the inner loops and the surrounding loops are called outer loops Loopnest: Enabled loops within other created loops for (i=0; i<n; i++) for (j=0; j<n; j++) for (k=0; k<n; k++) a[i][j][k] = a[i][j][k] + b[i][j][k]*c; Original: for loop nest for (i=0; i<n; i++) for (j=0; j<n; j+=2) for (k=0; k<n; k++){ a[i][j][k] = a[i][j][k] + b[i][k][j]*c; a[i][j+1][k] = a[i][j+1][k] + b[i][k][j+1]*c; } Unrolling the middle (j) loop twice Modified code 15 Copyright C-DAC 2004 October 5-9, 2004 HPC Workshop 2004 HPC Workshop 2004 Optimisation Techniques Outer and Inner Loop Unrolling Reasons for applying outer loop unrolling are: To expose more computations To improve memory reference patterns for(I =0; i<n; i++) for(j = 0; j<n; j+=2) for(k = 0; k<n; k+=2){ a[i][j][k] = a[i][j][k] + b[i][k][j]*c; a[i][j+1][k] = a[i][j+1][k] + b[i][k][j+1]*c; a[i][j][k+1] = a[i][j][k+1] + b[i][k+1][j]*c; a[i][j+1][k+1] = a[i][j+1][k+1] + b[i][k+1][j+1]*c; } Unrolling the k loop twice Modified code 16 Copyright C-DAC 2004 October 5-9, 2004 HPC Workshop 2004 HPC Workshop 2004 Optimisation Techniques Loop Unrolling and Sum Reduction Loop Unrolling should be used to reduce data dependency. Different variables can be used to eliminate the data dependency a=0.0; for (i=0; i<ARRAY_SIZE; i++) for (j=0; j< ARRAY_SIZE;j++) a = a+ b[j]*c[i]; Untuned Loop a1 = a2 = a3 = a4 = 0.0; for (i=0; i<ARRAY_SIZE; i++) for (j=0; j< ARRAY_SIZE;j=j+4){ a1 = a1 + b[j] *c[i]; a2 = a2 + b[j+1]*c[i]; a3 = a3 + b[j+2]*c[i]; a4 = a4 + b[j+3]*c[i]; aa = a1 + a2 + a3 + a4; } Tuned Loop (unrolled to depth 4) Speed increased by a factor of 4 ! (with appropriate compiler switches ) Modified code Original code 17 Copyright C-DAC 2004 October 5-9, 2004 HPC Workshop 2004 HPC Workshop 2004 Optimisation Techniques Qualifying Candidates for Loop Unrolling The previous example is an ideal candidate for loop unrolling. Study categories of loops that are generally not prime candidates for unrolling. Loops with low trip counts Fat loops Loops containing branches Recursive loops Vector reductions 18 Copyright C-DAC 2004 October 5-9, 2004 HPC Workshop 2004 HPC Workshop 2004 Optimisation Techniques Qualifying Candidates for Loop Unrolling To be effective, loop unrolling requires that there be a fairly large number of iterations in the original loop. When a trip count in loop is low, the preconditioning loop is doing proportionally large amount of work. Loop containing procedure calls Loop containing subroutine or function calls generally are not good candidates for unrolling. First : They often contain a fair number of instructions already. The function call can cancel many more instructions. Second : When the calling routine and the subroutine are compiled separately, it is impossible for the compiler to intermix instructions. Last : Function call overhead is expensive. Registers have to be saved, argument lists have to be prepared.The time spent calling and returning from a subroutine can be much greater than that of the loop overhead. 19 Copyright C-DAC 2004 October 5-9, 2004 HPC Workshop 2004 HPC Workshop 2004 Optimisation Techniques II=IMOD (N,4) DO 9 I=1, II CALL SHORT (A(I),B(I),C) 9 CONTINUE DO 10 I=1+II, N,4 CALL SHORT(A(I),B(I),C) CALL SHORT(A(I+1),B(I+1),C) CALL SHORT(A(I+2),B(I+2),C) CALL SHORT(A(I+3),B(I+3),C) 10 CONTINUE (Contd) DO 10 I=1, N CALL SHORT(A(I), B(I),C) 10 CONTINUE SUBROUTINE SHORT (A,B,C) A = A+B+C RETURN END Qualifying Candidates for Loop Unrolling Loop containing procedure calls is not suitable for unrolling 20 Copyright C-DAC 2004 October 5-9, 2004 HPC Workshop 2004 HPC Workshop 2004 Optimisation Techniques If a particular loop is already fat, then unrolling is not going to help much and loop overhead will spread over a fair number of instructions. A good rule of thumb is to look elsewhere for performance when the loop inwards exceed three or four statements. Since code indicates that inlining is feasible. Qualifying Candidates for Loop Unrolling 21 Copyright C-DAC 2004 October 5-9, 2004 HPC Workshop 2004 HPC Workshop 2004 Optimisation Techniques Qualifying Candidates for Loop Unrolling Original Dependency can be reduced by deriving new set of recursive equations Decreasing the dependencies at the expense of creating more work. DO 10 I=2, N A(I) = A(I) + A(I-1) x B 10 CONTINUE Modified DO 10 I =2, N,2 A(I) = A(I+1) + A(I-1) * B + A(I-1) *B*B A(I) = A(I) + A(I-1)*B 10 CONTINUE This is an example of vector recursion A Good compiler can make the rolled up version go faster by recognizing the dependency as opportunity to save memory traffic. A(I) = A(I)+A(I-1)*B A(I+1) = A(I+1)+A(I)*B A(I+2) = A(I+2)+A(I+1)*B A(I+3) = A(I+3)+A(I+2)*B Recursive Loops (Contd..) 22 Copyright C-DAC 2004 October 5-9, 2004 HPC Workshop 2004 HPC Workshop 2004 Optimisation Techniques Negatives of Loop Unrolling Loop unrolling always adds some run time to the program. If you unroll a loop and see the performance dip little, you can assume that either: The loop wasnt a good candidate for unrolling in the first place or A secondary effort absorbed your performance increase. Other possible reasons Unrolling by the wrong factor Register spitting Instruction cache miss Other hardware delays Outer loop unrolling 23 Copyright C-DAC 2004 October 5-9, 2004 HPC Workshop 2004 HPC Workshop 2004 Optimisation Techniques Loop Interchange Loop interchange is a technique for rearranging a loop nest so that the right stuff at the center. What is the right stuff depends upon what you are trying to accomplish. Loop interchange to move computations to the center of the loop nest. It is also good for improving memory access patterns. Iterations on the wrong subscript can cause a large stride and hurt your performance. Inverting the loops, so that the iterating variables causing the lesser strides are in the center, you can get performance win. 24 Copyright C-DAC 2004 October 5-9, 2004 HPC Workshop 2004 HPC Workshop 2004 Optimisation Techniques PARAMETER(IDIM=1000,JDIM=10 00, KDIM = 4) DO 10 K=1, KDIM DO 20 J=1, JDIM DO 30 I=1, IDIM D(I,J,K)=D(I,J,K)+ V(I,J,K)*DT 30 CONTINUE 20 CONTINUE 10 CONTINUE Loop Interchange PARAMETER(IDIM=1000,JDIM=1 000, KDIM=4) DO 10 I =1, IDIM DO 20 J =1, JDIM DO 30 K =1, KDIM D(I,J,K)=D(I,J,K)+ V(I,J,K)*DT 30 CONTINUE 20 CONTINUE 10 CONTINUE Loop interchange to move computations to the center Frequently, the interchange of nested loops permits a significant increase in the amount of parallelism Example is straight forward: it is easy to see that there are no inter iteration dependencies. (Contd) 25 Copyright C-DAC 2004 October 5-9, 2004 HPC Workshop 2004 HPC Workshop 2004 Optimisation Techniques float a[2][40][2000] for(i=0; i<2; i++) for(j=0; j<40; j++) for(k=0; k<2000; k++) a[i][j][k] = a[i][j][k] * 2.50 + 0.056; A reduction of about 15 % execution time was obtained in C/Fortran Loop Interchange Loop Interchange is done to minimize the stride access corresponding to array elements in the innermost loops. Interchanging loops can also reduce the loop overhead when the inner loop are iterate much less than the outer loops float a[2000][40][2] for(i=0; i<2000; i++) for(j=0; j<40; j++) for(k=0; k<2; k++) a[i][j][k] = a[i][j][k] * 2.50 + 0.056; (Contd) Original code Modified code 26 Copyright C-DAC 2004 October 5-9, 2004 HPC Workshop 2004 HPC Workshop 2004 Optimisation Techniques Statements that do not change within an inner loop can be moved outside of the loop. (Compiler optimizations can usually detect these). for(i=0 ; i<1000; i++){ for(j=0 ; j<1000; j++){ if(a[i]>100) b[i] = a[i]5.0; x=x+a[j] + b[i]; } /* BEWARE if a[i]=a[i]5.0; result could be very different!*/ Loop Optimization: Invariant Code Extraction Example : Data Wrap around, Untuned Remark : Tuned code can about 75 times faster than untuned code! for(i=0;i<1000; i++) { if(a[i]>100) b[i]=a[i]5.0; for(j=0;j<1000; j++) x = x + a[j] + b[i]; } Data Wrap around, tuned 27 Copyright C-DAC 2004 October 5-9, 2004 HPC Workshop 2004 HPC Workshop 2004 Optimisation Techniques Loop De-factorization consists of removing common multiplicative factors outside of inner loops for(i=0; i<1000; i++){ a[i] = 0.0; for(j=0; j<1000; j++) a[i] = a[i] + b[j]*d[j]*c[i]; } Loop Optimization: Loop De-factorization Example Factorized Remark : On some platforms, there is benefit in doing this since one (two) multiplication(s) AND one (two) addition(s) can be done simultaneously in one clock cycle! Compiler optimizations will not be able to determine that neighbor data dependency. Results may vary due to precision of computer for(i=0; i<1000; i++){ a[i] = 0.0; for(j=0; j<1000; j++) a[i] = a[i]+ b[j]*d[j]; a[i] = (a[i]*c[i];} De-Factorized 28 Copyright C-DAC 2004 October 5-9, 2004 HPC Workshop 2004 HPC Workshop 2004 Optimisation Techniques Untuned Loops (IFs and GOTOs): Turned Loop : I=0 I = 0 10 I = I + I 10 I = I + 1 IF(I.GT.100000)GOTO 30 A(I)=A(I)+B(I)*C(I) A(I)=A(I) + B(I)*C(I) IF(I.LE.100000)GOTO 10 GOTO 10 30 CONTINUE Another Untuned Loop (WHILE Loop) : Turned Loop: I = 0 DO I = 1, 100000 DO WHILE (I .LT. 100000) A(I) = A(I)+B(I)*C(I) I = I + 1 END DO A(I) = A(I)+B(I)*C(I) ENDDO Avoid IF/GOTO loops and WHILE loops. They inhibit compiler optimizations and they introduce unnecessary overheads. Loop Optimization: IF, WHILE, and DO Loops 29 Copyright C-DAC 2004 October 5-9, 2004 HPC Workshop 2004 HPC Workshop 2004 Optimisation Techniques Example: data wrap around, untuned version jwrap = ARRAY_SIZE 1; for(i=0; i<ARRAY_SIZE; i++) b[i] =(a[i]+a[jwrap])*0.5; jwrap = i; } Compiler optimizations will not be able to determine that a [jwrap] is a neighbor value 1 2 3 4 8 9 Loop Optimization: Neighbor Data Dependency Example: data wrap around, tuned version: b[0] = (a[0] + a[ARRAY_SIZE 1]) * 0.5; for(i=1 ; i < ARRAY_SIZE ; i++) b[i] = (a[i]+a[i-1]) * 0.5; Remark : Once the program is debugged, declare arrays to exact sizes whenever possible. This reduces memory use and also optimizes pipelining and cache utilization. 30 Copyright C-DAC 2004 October 5-9, 2004 HPC Workshop 2004 HPC Workshop 2004 Optimisation Techniques DO 10, JB = 1, N, NB DO 10, IB = 1, N, NB DO 10, KB = 1, N, NB DO 10, J = JB, JB + NB 1 DO 10, I = IB, IB + NB 1 DO 10, K = KB, KB + NB 1 C (I, J) = C (I, J) + A (I, K) * B(K,J) 10 CONTINUE This is most useful as a simple example of cache blocking. Most compilers will automatically cache block the original code as part of ordinary optimization. Programming Techniques Managing the Cache DO 10, J = 1, N DO 10, I = 1, N DO 10, K = 1, N C(I, J) = C (I, J) + A (I, K) * B (K, J) 10 CONTINUE We can modify the previous code to better use the cache. Original code Modified code 31 Copyright C-DAC 2004 October 5-9, 2004 HPC Workshop 2004 HPC Workshop 2004 Optimisation Techniques Loop optimizations accomplish three things : Reduce loop overhead Increase Parallelism Improve memory performance patterns Understanding your tools and how they work is critical for using them with peak effectiveness. For performance, a compiler is your best friend. Loop Optimizations: Advantages 32 Copyright C-DAC 2004 October 5-9, 2004 HPC Workshop 2004 HPC Workshop 2004 Optimisation Techniques Replace frequent divisions by inverse multiplications Multiplications/divisions by integer powers of 2 can be replaced by bit shifts to the left/right (compilers can usually do this) Small integer exponentials such as an should be replaced by repeated multiplications a*a*a*a.(compilers will usually do this) Reorganize (or eliminate) repeated (or useless) operation: Use Horners rule to evaluate polynomials. Recap of Arithmetic Optimization Example : Ax 5 + Bx 4 + Cx 3 + Dx 2 + E x + F can be written as ((((Ax + B)* x + C)*x+D)* x + E)* x+F This saves more time in C (speed increases by factor greater than 10) than in Fortran (improvement of only about 30%) due to the way C language handles (poorly ) the function pow(x,5). 33 Copyright C-DAC 2004 October 5-9, 2004 HPC Workshop 2004 HPC Workshop 2004 Optimisation Techniques Compiler Optimizations Compiler optimization From Wikipedia, the free encyclopedia Compiler optimization is used to improve the efficiency (in terms of running time or resource usage) of the executables output by a compiler. Allow programmers to write source code in a straightforward manner, expressing their intentions clearly, while allowing the computer to make choices about implementation details that lead to efficient execution. May or may not result in executables that are perfectly "optimal" by any measure Ref: http://en.wikipedia.org/wiki/Compiler_optimization 34 Copyright C-DAC 2004 October 5-9, 2004 HPC Workshop 2004 HPC Workshop 2004 Optimisation Techniques Sun Workshop Compiler 6.2 - O : Set optimization level - fast : Select a set of flags likely to improve speed - stackvar : put local variables on stack - xlibmopt : link optimized libraries - xarch : Specify instruction set architecture - xchip : Specifies the target processor for use by the optimizer. - native : Compile for best performance on localhost. - xprofile : Collects data for a profile or uses a profile to optimize. - fns : Turns on the SPARC nonstandard floating-point mode. - xunroll n : Unroll loops n times. 35 Copyright C-DAC 2004 October 5-9, 2004 HPC Workshop 2004 HPC Workshop 2004 Optimisation Techniques -O Optimize at the level most likely to give close to the maximum performance for many realistic applications (currently -O3) -O1 Do only the basic local optimizations (peephole). -O2 Do basic local and global optimization. This level usually gives minimum code size. -O3 Adds global optimizations at the function level. In general, this level, and -O4, usually result in the minimum code size when used with the -xspace option. -O4 Adds automatic inlining of functions in the same file. -g suppresses automatic inlining. -O5 Does the highest level of optimization, suitable only for the small fraction of a program that uses the largest fraction of computer time. Uses optimization algorithms that take more compilation time or that do not have as high a certainty of improving execution time. Optimization at this level is more likely to improve performance if it is done with profile feedback. See -xprofile=collect|use. Basic Compiler Techniques : Optimizations 36 Copyright C-DAC 2004 October 5-9, 2004 HPC Workshop 2004 HPC Workshop 2004 Optimisation Techniques - stackvar Tells the compiler to put most variables on the stack rather than statically allocate them. - stackvar is almost always a good idea, and it is crucial when parallelization. You can control stack versus static allocation for each variable. Variables that appear in DATA, COMMON, SAVE, or EQUIVALENCE statements will be static regardless of whether you specify -stackvar. Basic Compiler Techniques : Local variables on the Stack 37 Copyright C-DAC 2004 October 5-9, 2004 HPC Workshop 2004 HPC Workshop 2004 Optimisation Techniques Basic Compiler Techniques -xchip Specifies the target chip. Specifying the chip lets the compiler know that certain implementation details such as specific instructions timings, number of functional units etc. -xarch Specifies the target architecture. A target architecture includes the instruction set but may not include implementation details such as instruction timing. -xarch = v8plus on Sun produces an executable file that will take full advantage of some UltaSPARC features. -native Directs the compiler to produce the best executable (performance) that it can for the system on which the program is being compiled. 38 Copyright C-DAC 2004 October 5-9, 2004 HPC Workshop 2004 HPC Workshop 2004 Optimisation Techniques Basic Compiler Techniques - fast Run program with a reasonable level of optimization may change its meaning on different machines. It strikes balance between speed, portability, and safety. -fast is often a good way to et a first-cut approximation of how fast your program can run with a reasonable level of optimization -fast should not be used to build the production code. The meaning of fast will often change from one release to another As with native, -fast may change its meaning on different machines 39 Copyright C-DAC 2004 October 5-9, 2004 HPC Workshop 2004 HPC Workshop 2004 Optimisation Techniques Basic Compiler Techniques - fsimple: (simple floating point model) Tells the compiler to use a floating point system that includes only numbers. - xvector : Vectorization enables the compiler to transform vectorizable loops from scalar to vector form. It is generally faster and slower for short vectors - xlibmil: Tells the compiler to inline certain mathematical operations such as floor, ceiling, and complex absolute value - xlibmopt: Tells the linker to use an optimized math library. This may produce slightly different answer than the regular math library These libraries may get their speed by sacrificing accuracy 40 Copyright C-DAC 2004 October 5-9, 2004 HPC Workshop 2004 HPC Workshop 2004 Optimisation Techniques Advanced Compiler Techniques - xcrossfile Enables the compiler to optimize and inline source code across different files. It may compile code to be optimal for the files that are complied together Produces very fast executable - xpad Directs the compiler to insert padding (unused space) between adjacent variables in common blocks and local variables to try to improve cache performance. 41 Copyright C-DAC 2004 October 5-9, 2004 HPC Workshop 2004 HPC Workshop 2004 Optimisation Techniques Using Your Compiler Effectively - Classical Optimizations The compiler performs the classical optimizations, plus number of architecture specific optimizations. Copy propagation Constant Folding Dead Code removal Strength reduction Induction Variable Elimination Common Sub-expression Elimination 42 Copyright C-DAC 2004 October 5-9, 2004 HPC Workshop 2004 HPC Workshop 2004 Optimisation Techniques The compiler performs the classical optimizations, plus number of architecture specific optimizations. Loop in-variant code motion. Induction variable simplification Register variable detection Inlining Loop Fusion Loop Unrollling Classical Optimizations 43 Copyright C-DAC 2004 October 5-9, 2004 HPC Workshop 2004 HPC Workshop 2004 Optimisation Techniques Copy propagation Copy propagation is an optimization that occurs both locally and globally. x=y z=1.0+x Compiler may be able to perform copy propagation a cross the flow graph. x=y z=1.0+y PROGRAM MAIN INTEGER I, K PARAMETER (I=200) K=200 J=I+K END Constant Folding A clever compiler can find constants throughout your program. Classical Optimizations 44 Copyright C-DAC 2004 October 5-9, 2004 HPC Workshop 2004 HPC Workshop 2004 Optimisation Techniques Dead Code Removal Dead code comes in two types. Instructions that are unreachable. Instructions that produce results which one never used. Program main i=2 write (x,x)i stop i=4 write (x,x)i end Strength Reduction Operations or expressions have various time costs associated with them. There are many opportunities for compiler generated strength reductions. Y=X*2 J=Kx2 Y=X*X J=K+K Classical Optimizations 45 Copyright C-DAC 2004 October 5-9, 2004 HPC Workshop 2004 HPC Workshop 2004 Optimisation Techniques Variable Renaming Example: Observe variable in the following fragment of code. x = y x z q = r+x+x x = a+b Variable renaming is an important technique because it clarifies that calculations are independent of each other, which increases the number of things that can be done in parallel. Common sub expression Elimination D=Cx(A+B) E=(A+B/2) Different computer go to different lengths to find common sub expression xx = y x z q = r+xx+xx x = a+b Temp=A+B D=C X temp E=temp p/2 Classical Optimizations 46 Copyright C-DAC 2004 October 5-9, 2004 HPC Workshop 2004 HPC Workshop 2004 Optimisation Techniques Loop invariant code Motion: The compiler will look for every opportunity to move calculations out of a loop and into the surrounding. Loop invariant code motion is simply the act of moving the repeated, unchanging calculations to the outside. Induction Variable Simplification: Loop can contain what are called induction variables. DO 10 I=1,N A(I)=B(I)+CxD E=G(K) 10 CONTINUE DO 10 I=1,N K=I*4+M 10 CONTINUE temp=CxD DO 10 I=1,N A(I)=B(I)+temp 10 CONTINUE E=G(K) K=M DO 10 I=1,N K=K+4 10 CONTINUE Classical Optimizations (Contd..) 47 Copyright C-DAC 2004 October 5-9, 2004 HPC Workshop 2004 HPC Workshop 2004 Optimisation Techniques SUM=0.0 DO 10 I=1, N SUM=SUM+A(I)xB(I) 10 CONTINUE Example: Dot product of two vectors SUM=0.0 DO 10 I=1,N,4 SUM = SUM+A(I)xB(I)+A(I+1)*B(I+1) + A(I+2)*B(I+2)+A(I+2)*B(I+3) 10 CONTINUE The loop is recursive on that single variable, every iteration needs the result of the previous iteration. The assignment is being made to a scalar, unrolling isnt as straight forward as before. Obvious way is to calculate several iteration at a time. Associative Transformations and Reductions Classical Optimizations 48 Copyright C-DAC 2004 October 5-9, 2004 HPC Workshop 2004 HPC Workshop 2004 Optimisation Techniques Dependency analysis is a technique where by the syntactic constructs of a program are analyzed with the aim of determining whether certain values may depend on other previously computed values. The real objective of dependence analysis is to determine whether two statements are independent of each other Example: S 1 A=C-A S 2 A=B+C S 3 B=A+C DO ALL transformations: This transformation converts every iteration of a loop into process that is independent of all others It assumes that there are no loop-carried dependencies. The DO ALL transformation is very efficient if it can be applied. However, many loops carry dependencies. Classical Optimizations 49 Copyright C-DAC 2004 October 5-9, 2004 HPC Workshop 2004 HPC Workshop 2004 Optimisation Techniques Register Variable Detection On many CISC processors there were few general purpose registers. On RISC designs, there are many more registers to choose from, and everything has to be brought into a register anyway. All variables will be registers resident. The new challenge is determine which variables should live the greater portion of their lives in registers. The compiler performs the classical optimizations, plus a number of architecture-specific optimizations. Classical Optimizations 50 Copyright C-DAC 2004 October 5-9, 2004 HPC Workshop 2004 HPC Workshop 2004 Optimisation Techniques Inlining Inlining is the substitution of the body of a subprogram for the call of that subprogram. This eliminates function call overhead. To enable inlining by the Sun compilers, use fast or xO4 f77 fast a.f Loop Fusion : Loop fusion is the process of fusing two adjacent loops with the same loop bounds, which is usually a Good Thing Induction Values: Induction values that can be computed as a function of the loop count variable and possibly other values. Classical Optimizations 51 Copyright C-DAC 2004 October 5-9, 2004 HPC Workshop 2004 HPC Workshop 2004 Optimisation Techniques Parallel programming-Compilation switches Automatic and directives based parallelization Allow compiler to do automatic and directive based parallelization -x autopar, -x explicitpar, -x parallel, -tell the compiler to parallelize your program. xautopar: tells the compiler to do only those parallelization that it can do automatically xexplicitpar: tells the compiler to do only those parallelization that you have directed it to do with programs in the source xparallel: tells the compiler to parallelize both automatically and under pragma control xreduction: tells the compiler that it may parallelize reduction loops. A reduction loop is a loop that produces output with smaller dimension than the input. 52 Copyright C-DAC 2004 October 5-9, 2004 HPC Workshop 2004 HPC Workshop 2004 Optimisation Techniques Parallel Programming Compiler switches Remarks In some cases, parallelizing a reduction loop can give different answers depending on the number of processors on which the loop is run. Compiler directives can usually over come artificial barriers to parallelization. Compiler directives can also overcome legitimate barriers to parallelization, which introduces errors. The efficiency and effectiveness of automatic compiler parallelization can be significantly improved by supplying the switches. 53 Copyright C-DAC 2004 October 5-9, 2004 HPC Workshop 2004 HPC Workshop 2004 Optimisation Techniques BLAS, IMSL, NAG, LINPACK, ScaLAPACK LAPACK, etc. Calls to these math libraries can often simplify coding. They are portable across different platform They are usually fine-tuned to the specific hardware as well as to the sizes of the array variables that are sent to them Example : Sun performance libraries (-xlic_lib=sunperf), IBM ESSL, ESSLSMP Use of MATH LIBRARIES 54 Copyright C-DAC 2004 October 5-9, 2004 HPC Workshop 2004 HPC Workshop 2004 Optimisation Techniques Optimization of unsteady state 3D Compressible Navier-Stokes equations by finite difference method Computing System used : Sun Ultra Sparc workstation (Each node is quad CPU Ultra Enterprise 450 server, operating at 300Mhz) Grid Size Iterations Time in seconds 192*16*16 1000 (No compiler options) 4930 192*16*16 1000 (Code restructuring and compiler optimization) 2620 192*16*16 680 1000 (with compiler optimization) Conclusions : Re-structuring the code and use of proper compiler optimizations reduces the execution time by a factor of 8.0 Performance of selective application - CFD Performance of selective application - CFD 55 Copyright C-DAC 2004 October 5-9, 2004 HPC Workshop 2004 HPC Workshop 2004 Optimisation Techniques lBMp lBMp lBMp lBMp- -- -630 630 630 630- -- - Configuration Configuration Configuration Configuration 4-way SMP POWER 4 1.0 Ghz 8 GB Main memory (16 GB Max) AIX 5.1 and PPC Linux XL F77, F90, C, C++ Performance Libraries: BLAS 1,2,3 BLACS, ESSL 32-way SMP POWER 4 1.1 Ghz 64 GB Main memory (256 GB Max) AIX 5.1 and PPC Linux XL F77, F90, C, C++ Performance Libraries: BLAS 1,2,3 BLACS, ESSL lBMp lBMp lBMp lBMp- -- -690 690 690 690- -- - Configuration Configuration Configuration Configuration 56 Copyright C-DAC 2004 October 5-9, 2004 HPC Workshop 2004 HPC Workshop 2004 Optimisation Techniques LLCBench: Performance on IBM p630 57 Copyright C-DAC 2004 October 5-9, 2004 HPC Workshop 2004 HPC Workshop 2004 Optimisation Techniques LLCBench: Performance on IBM p690 58 Copyright C-DAC 2004 October 5-9, 2004 HPC Workshop 2004 HPC Workshop 2004 Optimisation Techniques
L1 L2 Benchmarks on IBM p690 Performance of Cache Bench on IBM p690 59 Copyright C-DAC 2004 October 5-9, 2004 HPC Workshop 2004 HPC Workshop 2004 Optimisation Techniques Reducing Memory Overheads is important for performance of sequential and parallel programs Minimization of memory traffic is the single most important goal. For multiple dimensional arrays, access will be fastest if you iterate on the array subscript offering the smallest stride or step size. Role of Data Reuse on Memory sub-system will increase the performance Basic Compiler and Advanced Compiler Optimization flags can be used for performance Write code so that a compiler find it easy to locate optimizations Compiler performs Classical Optimization Techniques and some loop optimization techniques Conclusions 60 Copyright C-DAC 2004 October 5-9, 2004 HPC Workshop 2004 HPC Workshop 2004 Optimisation Techniques 1. Ernst L. Leiss, Parallel and Vector Computing A practical Introduction, McGraw-Hill Series on Computer Engineering, Newyork (1995). 2. Albert Y.H. Zomaya, Parallel and distributed Computing Handbook, McGraw-Hill Series on Computing Engineering, Newyork (1996). 3. Vipin Kumar, Ananth Grama, Anshul Gupta, George Karypis, Introduction to Parallel Computing, Design and Analysis of Algorithms, Redwood City, CA, Benjmann/Cummings (1994). 4. William Gropp, Rusty Lusk, Tuning MPI Applications for Peak Performance, Pittsburgh (1996) 5. Ian T. Foster, Designing and Building Parallel Programs, Concepts and tools for Parallel Software Engineering, Addison-Wesley Publishing Company (1995). 6. Kai Hwang, Zhiwei Xu, Scalable Parallel Computing (Technology Architecture Programming) McGraw Hill Newyork (1997) 7. Culler David E, Jaswinder Pal Singh with Anoop Gupta, Parallel Computer Architecture, A Hardware/Software Approach, Morgan Kaufmann Publishers, Inc, (1999) References 61 Copyright C-DAC 2004 October 5-9, 2004 HPC Workshop 2004 HPC Workshop 2004 Optimisation Techniques