Day01 HPC WRKSHP Compiler Opt

1
Copyright C-DAC 2004 October 5-9, 2004

HPC Workshop 2004 HPC Workshop 2004 Optimisation Techniques
High Performance Computing Workshop
Day 1 : October 5, 2004
Uni-Processor Optimization -
Code Restructuring and Loop
Optimization Techniques
Uni
Uni
-
-
Processor Optimization
Processor Optimization
-
-
2
Why do you need to do optimization of a sequential code
Memory hierarchy and how a codes performance depends on it
Loop Optimization Techniques
Collapsing, Fission, Fusion, Unrolling, Interchange, Invariant Code
Extraction, De-factorization, overheads of if-while-goto, Neighbor
Data Dependency
Arithmetic Optimization
Compiler Optimizations
Use of tuned Math Libraries
Performance of selective applications and benchmarks
Conclusions
Lecture Outline
3
Improving Single Processor Performance
How much sustained performance one can achieve for given
program on a machine ?
It is programmers job to take advantage as much as possible of
the CPUs hardware /software characteristics to boost the
performance of the program !
Quite often, just a few simple changes to ones code improves
performance by a factor of 2, 3 or better !
Also, simply compiling with some of the optimization flags (-O3, -
fast, .) can improve the performance dramatically !
4
Approximate access times
CPU-registers: 0 cycles (thats where the work is done!)
L
1
Cache: 1 cycle (Data and Instruction cache). Repeated
access to a cache takes only 1 cycle
L
2
Cache (static RAM): 3-5 cycles?
Memory (DRAM): 10 cycles (Cache miss);
30-60 cycles for Translation Lookaside Buffer (TLB) update
Disk: about 100,000 cycles!
connecting to other nodes - depending on network latency
Icache Dcache
L2
DISK
RAM
CPU
registers
A lot of time is spent accessing/storing data from/to memory. It is
important to keep in mind the relative times for each memory types:
The Memory sub-system : Access time
Access Time
is Important
5
Hierarchical Memory
A four-level memory hierarchy for a large
computer system.
External Cache
(SRAMs)
Main Memory
(DRAMs)
Disk Storage
(Magnetic)
Tap Units
(Magnetic)
M
1
M
2
M
3
M
4
Registers, Internal
Cashes in CPU
Capacity
Level 0
Level 1
Level 2
Level 3
Level 4
I
n
c
r
e
a
s
e

i
n

c
a
p
a
c
i
t
y

a
n
d

a
c
c
e
s
s

t
i
m
e
I
n
c
r
e
a
s
e

i
n

c
o
s
t

p
e
r

b
i
t

6
Loop Optimization Techniques
Classical Optimization techniques Compiler Does
Memory Reference Optimization Compiler does to some extent
Loop Optimizations Compiler does to some extent
Loop Fission and Loop Fusion
Loop Interchange
Loop Alignment
Loop Collapsing
Loop Unrolling
7
Loop Collapsing
It attempts to create one (larger) loop out of two or more small ones.
This may be profitable if the size of each of the two loops is too small
for efficient vectorization, but the resulting single loop can be
profitably vectorized.
REAL A(5,5) B(5,5)
DO 10 J =1, 5
DO 10 I=1, 5
A(I,J) = B(I,J) + 2.0
10 CONTINUE
Before
REAL A(25) B(25)
DO 10 JI =1, 25
A(JI) = B(JI) +2.0
10 CONTINUE
Loop collapsing is done with multi-dimensional arrays to avoid loop
overheads
After
8
Using this technique, the code may be transferred into a single loop,
regardless of the size of M and N
This may require some additional statement to restart the code
properly.
DO 10 L = 1, NxM
I = (L-1)/M+1
J = MOD(L-1,M) +1)
A(I,J) = B(I,J) + 2.0
10 CONTINUE
After
(Contd)
Loop Collapsing
DO 10 J =1, N
DO 10 I =1, M
A(I,J) = B(I,J) + 2.0
10 CONTINUE
General Versions of this technique is useful for computing
systems which support only a single (not nested) DOALL
statement.
Before
9
Loop Collapsing (Contd)
Loop collapsing is done with multi-dimensional arrays to avoid
loop overheads
Assume declaring a[50][80][4]
Un Collapsed Loop
for(I = 0; I <50; i++)
for(j = 0; j<80; j++)
for(k = 0; k<4; k++)
a[i][j][k] = a[i][j][k] * b[i][j][k] + c[i][j][k];
Warning : This works only if the entire array space is accessed !
for(i=0; i<50*80*4; i++)
a[0][0][k] = a[0][0][k] * b[0][0][k] + c[0][0][k];
Collapsed Loop
10
Loop Fusion:
It transforms two adjacent loops into one on the basis of
information obtained from data-dependencies analysis.
Two statements will be placed into the same loop if
there is atleast one variable or array which is referred
by both.
Remark : Loop Fission and Loop Fusion are related techniques to
Strip mining and loop collapsing
Loop Fission and Loop Fusion
Loop Fission:
Attempts to break a single loop into several loops in order to
optimize data transfer (behavior main memory, cache and
registers)
Primary objective of optimization is data transfer.
11
It is merging of several loops into a single loop
Example : Untuned Example : Tuned
Loop Fusion
for(i=0; i < 100000; i++)
x = x * a[i] + b[i];
for(i=0; i < 100000; i++)
y = y * a[i] + c[i];
for(i=0; i < 100000; i++) {
x = x * a[i] + b[i];
y = y * a[i] + c[i];
}
Tuned code runs atleast 10 times faster on Ultra Sparc (both with
O3 flag)
(Contd)
12
Advantages
The loop overhead is reduced by a factor of two in the above
case.
Allows for better instruction overlap in loops with dependencies.
Cache misses can be decreased if both loops reference the
same array.
Loop Fusion
(Contd)
Disadvantages
Has the potential to increase cache misses if the fused loops
contain references to more than four arrays and the starting
elements of those arrays map to the same cache line.
e.g:
x = x * a[i] + b[i] * c[i] + d[i] / e[i]
13
Loop Optimizations : Basic Loop Unrolling
Loop unrolling is performing multiple loop iterations per pass.
Loop unrolling is one of the most important optimizations that can
be done on a pipelined machine.
Loop unrolling helps performance because it fattens up a loop with
calculations that can be done in parallel
Remark : Never unroll an inner loop.
14
Outer and Inner Loop Unrolling
Remark : The loop or loops in the center are called the inner
loops and the surrounding loops are called outer loops
Loopnest: Enabled loops within other created loops
for (i=0; i<n; i++)
for (j=0; j<n; j++)
for (k=0; k<n; k++)
a[i][j][k] = a[i][j][k] + b[i][j][k]*c;
Original: for loop nest
for (i=0; i<n; i++)
for (j=0; j<n; j+=2)
for (k=0; k<n; k++){
a[i][j][k] = a[i][j][k] + b[i][k][j]*c;
a[i][j+1][k] = a[i][j+1][k] + b[i][k][j+1]*c;
}
Unrolling the middle (j) loop twice
Modified
code
15
Outer and Inner Loop Unrolling
Reasons for applying outer loop unrolling are:
To expose more computations
To improve memory reference patterns
for(I =0; i<n; i++)
for(j = 0; j<n; j+=2)
for(k = 0; k<n; k+=2){
a[i][j][k] = a[i][j][k] + b[i][k][j]*c;
a[i][j+1][k] = a[i][j+1][k] + b[i][k][j+1]*c;
a[i][j][k+1] = a[i][j][k+1] + b[i][k+1][j]*c;
a[i][j+1][k+1] = a[i][j+1][k+1] + b[i][k+1][j+1]*c;
}
Unrolling the k loop twice
Modified
code
16
Loop Unrolling and Sum Reduction
Loop Unrolling should be used to reduce data dependency. Different
variables can be used to eliminate the data dependency
a=0.0;
for (i=0; i<ARRAY_SIZE; i++)
for (j=0; j< ARRAY_SIZE;j++)
a = a+ b[j]*c[i];
Untuned Loop
a1 = a2 = a3 = a4 = 0.0;
for (i=0; i<ARRAY_SIZE; i++)
for (j=0; j< ARRAY_SIZE;j=j+4){
a1 = a1 + b[j] *c[i];
a2 = a2 + b[j+1]*c[i];
a3 = a3 + b[j+2]*c[i];
a4 = a4 + b[j+3]*c[i];
aa = a1 + a2 + a3 + a4; }
Tuned Loop (unrolled to depth 4)
Speed increased by a factor of 4 ! (with appropriate compiler switches
)
Modified
code
Original
code
17
Qualifying Candidates for Loop Unrolling
The previous example is an ideal candidate for loop unrolling.
Study categories of loops that are generally not prime candidates
for unrolling.
Loops with low trip counts
Fat loops
Loops containing branches
Recursive loops
Vector reductions
18
To be effective, loop unrolling requires that there be a fairly large
number of iterations in the original loop.
When a trip count in loop is low, the preconditioning loop is doing
proportionally large amount of work.
Loop containing procedure calls
Loop containing subroutine or function calls generally are not good
candidates for unrolling.
First : They often contain a fair number of instructions already. The
function call can cancel many more instructions.
Second : When the calling routine and the subroutine are compiled
separately, it is impossible for the compiler to intermix instructions.
Last : Function call overhead is expensive. Registers have to be
saved, argument lists have to be prepared.The time spent calling
and returning from a subroutine can be much greater than that of
the loop overhead.
19
II=IMOD (N,4)
DO 9 I=1, II
CALL SHORT (A(I),B(I),C)
9 CONTINUE
DO 10 I=1+II, N,4
CALL SHORT(A(I),B(I),C)
CALL SHORT(A(I+1),B(I+1),C)
10 CONTINUE
(Contd)
DO 10 I=1, N
CALL SHORT(A(I), B(I),C)
10 CONTINUE
SUBROUTINE SHORT
(A,B,C)
A = A+B+C
RETURN
END
Loop containing procedure calls is not suitable for
unrolling
20
If a particular loop is already fat, then unrolling is not going to help
much and loop overhead will spread over a fair number of
instructions.
A good rule of thumb is to look elsewhere for performance when
the loop inwards exceed three or four statements.
Since code indicates that inlining is feasible.
21
Original
Dependency can be reduced by deriving new set of recursive equations
Decreasing the dependencies at the expense of creating more work.
DO 10 I=2, N
A(I) = A(I) + A(I-1) x B
10 CONTINUE
Modified
DO 10 I =2, N,2
A(I) = A(I+1) + A(I-1) * B + A(I-1) *B*B
A(I) = A(I) + A(I-1)*B
10 CONTINUE
This is an example of vector recursion
A Good compiler can make the rolled up version go faster by
recognizing the dependency as opportunity to save memory traffic.
A(I) = A(I)+A(I-1)*B
A(I+1) = A(I+1)+A(I)*B
A(I+2) = A(I+2)+A(I+1)*B
A(I+3) = A(I+3)+A(I+2)*B
Recursive Loops
(Contd..)
22
Negatives of Loop Unrolling
Loop unrolling always adds some run time to the program.
If you unroll a loop and see the performance dip little, you can
assume that either:
The loop wasnt a good candidate for unrolling in the first place
or
A secondary effort absorbed your performance increase.
Other possible reasons
Unrolling by the wrong factor
Register spitting
Instruction cache miss
Other hardware delays
Outer loop unrolling
23
Loop Interchange
Loop interchange is a technique for rearranging a loop nest so that
the right stuff at the center. What is the right stuff depends upon
what you are trying to accomplish.
Loop interchange to move computations to the center of the loop
nest.
It is also good for improving memory access patterns.
Iterations on the wrong subscript can cause a large stride and hurt
your performance.
Inverting the loops, so that the iterating variables causing the lesser
strides are in the center, you can get performance win.
24
PARAMETER(IDIM=1000,JDIM=10
00, KDIM = 4)
DO 10 K=1, KDIM
DO 20 J=1, JDIM
DO 30 I=1, IDIM
D(I,J,K)=D(I,J,K)+ V(I,J,K)*DT
30 CONTINUE
20 CONTINUE
10 CONTINUE
Loop Interchange
PARAMETER(IDIM=1000,JDIM=1
000, KDIM=4)
DO 10 I =1, IDIM
DO 20 J =1, JDIM
DO 30 K =1, KDIM
D(I,J,K)=D(I,J,K)+
V(I,J,K)*DT
30 CONTINUE
20 CONTINUE
10 CONTINUE
Loop interchange to move computations to the center
Frequently, the interchange of nested loops permits a significant
increase in the amount of parallelism
Example is straight forward: it is easy to see that there are no inter
iteration dependencies.
(Contd)
25
float a[2][40][2000]
for(i=0; i<2; i++)
for(j=0; j<40; j++)
for(k=0; k<2000; k++)
a[i][j][k] = a[i][j][k] * 2.50 + 0.056;
A reduction of about 15 % execution time was obtained in C/Fortran
Loop Interchange
Loop Interchange is done to minimize the stride access
corresponding to array elements in the innermost loops.
Interchanging loops can also reduce the loop overhead when the
inner loop are iterate much less than the outer loops
float a[2000][40][2]
for(i=0; i<2000; i++)
for(j=0; j<40; j++)
for(k=0; k<2; k++)
a[i][j][k] = a[i][j][k] * 2.50 + 0.056;
(Contd)
Original
code
Modified
code
26
Statements that do not change within an inner loop can be moved
outside of the loop. (Compiler optimizations can usually detect these).
for(i=0 ; i<1000; i++){
for(j=0 ; j<1000; j++){
if(a[i]>100) b[i] = a[i]5.0;
x=x+a[j] + b[i];
}
/* BEWARE if a[i]=a[i]5.0;
result could be very different!*/
Loop Optimization: Invariant Code Extraction
Example : Data Wrap around, Untuned
Remark : Tuned code can about 75 times faster than untuned code!
for(i=0;i<1000; i++)
{
if(a[i]>100)
b[i]=a[i]5.0;
for(j=0;j<1000; j++)
x = x + a[j] + b[i];
}
Data Wrap around, tuned
27
Loop De-factorization consists of removing common
multiplicative factors outside of inner loops
for(i=0; i<1000; i++){
a[i] = 0.0;
for(j=0; j<1000; j++)
a[i] = a[i] + b[j]*d[j]*c[i];
}
Loop Optimization: Loop De-factorization
Example Factorized
Remark :
On some platforms, there is benefit in doing this since one (two)
multiplication(s) AND one (two) addition(s) can be done simultaneously
in one clock cycle!
Compiler optimizations will not be able to determine that neighbor data
dependency.
Results may vary due to precision of computer
for(i=0; i<1000; i++){
a[i] = 0.0;
for(j=0; j<1000; j++)
a[i] = a[i]+ b[j]*d[j];
a[i] = (a[i]*c[i];}
De-Factorized
28
Untuned Loops (IFs and GOTOs): Turned Loop :
I=0 I = 0
10 I = I + I 10 I = I + 1
IF(I.GT.100000)GOTO 30 A(I)=A(I)+B(I)*C(I)
A(I)=A(I) + B(I)*C(I) IF(I.LE.100000)GOTO 10
GOTO 10
30 CONTINUE
Another Untuned Loop (WHILE Loop) : Turned Loop:
I = 0 DO I = 1, 100000
DO WHILE (I .LT. 100000) A(I) = A(I)+B(I)*C(I)
I = I + 1 END DO
A(I) = A(I)+B(I)*C(I)
ENDDO
Avoid IF/GOTO loops and WHILE loops. They inhibit compiler
optimizations and they introduce unnecessary overheads.
Loop Optimization: IF, WHILE, and DO Loops
29
Example: data wrap around, untuned version
jwrap = ARRAY_SIZE 1;
for(i=0; i<ARRAY_SIZE; i++)
b[i] =(a[i]+a[jwrap])*0.5;
jwrap = i; }
Compiler optimizations will not be
able to determine that a [jwrap] is
a neighbor value
1 2 3 4
8 9
Loop Optimization: Neighbor Data Dependency
Example: data wrap around, tuned version:
b[0] = (a[0] + a[ARRAY_SIZE 1]) * 0.5;
for(i=1 ; i < ARRAY_SIZE ; i++)
b[i] = (a[i]+a[i-1]) * 0.5;
Remark : Once the program is debugged, declare arrays to exact sizes
whenever possible. This reduces memory use and also optimizes
pipelining and cache utilization.
30
DO 10, JB = 1, N, NB
DO 10, IB = 1, N, NB
DO 10, KB = 1, N, NB
DO 10, J = JB, JB + NB 1
DO 10, I = IB, IB + NB 1
DO 10, K = KB, KB + NB 1
C (I, J) = C (I, J) + A (I, K) * B(K,J)
10 CONTINUE
This is most useful as a simple example of cache blocking. Most
compilers will automatically cache block the original code as part of
ordinary optimization.
Programming Techniques Managing the Cache
DO 10, J = 1, N
DO 10, I = 1, N
DO 10, K = 1, N
C(I, J) = C (I, J) + A (I, K) * B (K, J)
10 CONTINUE
We can modify the previous code to better use the cache.
Original code
Modified code
31
Loop optimizations accomplish three things :
Reduce loop overhead
Increase Parallelism
Improve memory performance patterns
Understanding your tools and how they work is critical for using them
with peak effectiveness. For performance, a compiler is your best
friend.
Loop Optimizations: Advantages
32
Replace frequent divisions by inverse multiplications
Multiplications/divisions by integer powers of 2 can be replaced by
bit shifts to the left/right (compilers can usually do this)
Small integer exponentials such as an should be replaced by
repeated multiplications a*a*a*a.(compilers will usually do this)
Reorganize (or eliminate) repeated (or useless) operation:
Use Horners rule to evaluate polynomials.
Recap of Arithmetic Optimization
Example :
Ax
5
+ Bx
4
+ Cx
3
+ Dx
2
+ E x + F can be written as
((((Ax + B)* x + C)*x+D)* x + E)* x+F
This saves more time in C (speed increases by factor greater than
10) than in Fortran (improvement of only about 30%) due to the way
C language handles (poorly ) the function pow(x,5).
33
Compiler Optimizations
Compiler optimization From Wikipedia, the free encyclopedia
Compiler optimization is used to improve the efficiency (in terms of
running time or resource usage) of the executables output by a
compiler.
Allow programmers to write source code in a straightforward
manner, expressing their intentions clearly, while allowing the
computer to make choices about implementation details that lead
to efficient execution.
May or may not result in executables that are perfectly "optimal" by
any measure
Ref: http://en.wikipedia.org/wiki/Compiler_optimization
34
Sun Workshop Compiler 6.2
- O : Set optimization level
- fast : Select a set of flags likely to improve speed
- stackvar : put local variables on stack
- xlibmopt : link optimized libraries
- xarch : Specify instruction set architecture
- xchip : Specifies the target processor for use by the
optimizer.
- native : Compile for best performance on localhost.
- xprofile : Collects data for a profile or uses a profile to
optimize.
- fns : Turns on the SPARC nonstandard floating-point
mode.
- xunroll n : Unroll loops n times.
35
-O Optimize at the level most likely to give close to the maximum
performance for many realistic applications (currently -O3)
-O1 Do only the basic local optimizations (peephole).
-O2 Do basic local and global optimization. This level usually gives
minimum code size.
-O3 Adds global optimizations at the function level. In general, this level,
and -O4, usually result in the minimum code size when used with the
-xspace option.
-O4 Adds automatic inlining of functions in the same file. -g suppresses
automatic inlining.
-O5 Does the highest level of optimization, suitable only for the small
fraction of a program that uses the largest fraction of computer time.
Uses optimization algorithms that take more compilation time or that
do not have as high a certainty of improving execution time.
Optimization at this level is more likely to improve performance if it is
done with profile feedback. See -xprofile=collect|use.
Basic Compiler Techniques : Optimizations
36
- stackvar
Tells the compiler to put most variables on the stack rather than
statically allocate them.
- stackvar is almost always a good idea, and it is crucial when
parallelization.
You can control stack versus static allocation for each variable.
Variables that appear in DATA, COMMON, SAVE, or
EQUIVALENCE statements will be static regardless of whether
you specify -stackvar.
Basic Compiler Techniques : Local variables on the Stack
37
Basic Compiler Techniques
-xchip
Specifies the target chip. Specifying the chip lets the compiler know
that certain implementation details such as specific instructions
timings, number of functional units etc.
-xarch
Specifies the target architecture. A target architecture includes the
instruction set but may not include implementation details such as
instruction timing.
-xarch
= v8plus on Sun produces an executable file that will take full
advantage of some UltaSPARC features.
-native
Directs the compiler to produce the best executable (performance)
that it can for the system on which the program is being compiled.
38
- fast
Run program with a reasonable level of optimization may change
its meaning on different machines.
It strikes balance between speed, portability, and safety.
-fast is often a good way to et a first-cut approximation of how
fast your program can run with a reasonable level of optimization
-fast should not be used to build the production code.
The meaning of fast will often change from one release to
another
As with native, -fast may change its meaning on
different machines
39
- fsimple: (simple floating point model)
Tells the compiler to use a floating point system that includes only
numbers.
- xvector :
Vectorization enables the compiler to transform vectorizable loops
from scalar to vector form. It is generally faster and slower for
short vectors
- xlibmil:
Tells the compiler to inline certain mathematical operations such
as floor, ceiling, and complex absolute value
- xlibmopt:
Tells the linker to use an optimized math library. This may
produce slightly different answer than the regular math library
These libraries may get their speed by sacrificing accuracy
40
Advanced Compiler Techniques
- xcrossfile
Enables the compiler to optimize and inline source code across
different files.
It may compile code to be optimal for the files that are complied
together
Produces very fast executable
- xpad
Directs the compiler to insert padding (unused space) between
adjacent variables in common blocks and local variables to try to
improve cache performance.
41
Using Your Compiler Effectively - Classical Optimizations
The compiler performs the classical optimizations, plus number of
architecture specific optimizations.
Copy propagation
Constant Folding
Dead Code removal
Strength reduction
Induction Variable Elimination
Common Sub-expression Elimination
42
The compiler performs the classical optimizations, plus number of
architecture specific optimizations.
Loop in-variant code motion.
Induction variable simplification
Register variable detection
Inlining
Loop Fusion
Loop Unrollling
Classical Optimizations
43
Copy propagation
Copy propagation is an optimization that occurs both locally and
globally.
x=y
z=1.0+x
Compiler may be able to perform copy propagation a cross the flow
graph.
x=y
z=1.0+y
PROGRAM MAIN
INTEGER I, K
PARAMETER (I=200)
K=200
J=I+K
END
Constant Folding
A clever compiler can find constants
throughout your program.
44
Dead Code Removal
Dead code comes in two types.
Instructions that are unreachable.
Instructions that produce results which
one never used.
Program main
i=2
write (x,x)i
stop
i=4
write (x,x)i
end
Strength Reduction
Operations or expressions have
various time costs associated with
them.
There are many opportunities for
compiler generated strength
reductions.
Y=X*2
J=Kx2
Y=X*X
J=K+K
45
Variable Renaming
Example: Observe variable in the following fragment of code.
x = y x z
q = r+x+x
x = a+b
Variable renaming is an important technique because it clarifies that
calculations are independent of each other, which increases the
number of things that can be done in parallel.
Common sub expression Elimination
D=Cx(A+B)
E=(A+B/2)
Different computer go to different lengths to find common sub expression
xx = y x z
q = r+xx+xx
x = a+b
Temp=A+B
D=C X temp
E=temp p/2
46
Loop invariant code Motion:
The compiler will look for every opportunity to move calculations out of
a loop and into the surrounding.
Loop invariant code motion is simply the act of moving the repeated,
unchanging calculations to the outside.
Induction Variable Simplification:
Loop can contain what are called induction variables.
DO 10 I=1,N
A(I)=B(I)+CxD
E=G(K)
10 CONTINUE
DO 10 I=1,N
K=I*4+M
10 CONTINUE
temp=CxD
DO 10 I=1,N
A(I)=B(I)+temp
10 CONTINUE
E=G(K)
K=M
DO 10 I=1,N
K=K+4
10 CONTINUE
(Contd..)
47
SUM=0.0
DO 10 I=1, N
SUM=SUM+A(I)xB(I)
10 CONTINUE
Example: Dot product of two vectors
SUM=0.0
DO 10 I=1,N,4
SUM = SUM+A(I)xB(I)+A(I+1)*B(I+1) +
A(I+2)*B(I+2)+A(I+2)*B(I+3)
10 CONTINUE
The loop is recursive on that single
variable, every iteration needs the
result of the previous iteration.
The assignment is being made to a scalar, unrolling isnt as straight
forward as before. Obvious way is to calculate several iteration at a
time.
Associative Transformations and Reductions
48
Dependency analysis is a technique where by the syntactic
constructs of a program are analyzed with the aim of determining
whether certain values may depend on other previously computed
values.
The real objective of dependence analysis is to determine whether
two statements are independent of each other
Example: S
1
A=C-A
S
2
A=B+C
S
3
B=A+C
DO ALL transformations: This transformation converts every
iteration of a loop into process that is independent of all others
It assumes that there are no loop-carried dependencies.
The DO ALL transformation is very efficient if it can be applied.
However, many loops carry dependencies.
49
Register Variable Detection
On many CISC processors there were few general purpose
registers.
On RISC designs, there are many more registers to choose from,
and everything has to be brought into a register anyway.
All variables will be registers resident.
The new challenge is determine which variables should live the
greater portion of their lives in registers.
The compiler performs the classical optimizations, plus a number of
architecture-specific optimizations.
50
Inlining
Inlining is the substitution of the body of a subprogram for the call of
that subprogram. This eliminates function call overhead.
To enable inlining by the Sun compilers, use fast or xO4
f77 fast a.f
Loop Fusion :
Loop fusion is the process of fusing two adjacent loops with the
same loop bounds, which is usually a Good Thing
Induction Values:
Induction values that can be computed as a function of the loop
count variable and possibly other values.
51
Parallel programming-Compilation switches
Automatic and directives based parallelization
Allow compiler to do automatic and directive based parallelization
-x autopar, -x explicitpar, -x parallel, -tell the compiler
to parallelize your program.
xautopar: tells the compiler to do only those parallelization that it
can do automatically
xexplicitpar: tells the compiler to do only those parallelization
that you have directed it to do with programs in the source
xparallel: tells the compiler to parallelize both automatically
and under pragma control
xreduction: tells the compiler that it may parallelize reduction
loops. A reduction loop is a loop that produces output with smaller
dimension than the input.
52
Parallel Programming Compiler switches
Remarks
In some cases, parallelizing a reduction loop can give different
answers depending on the number of processors on which the loop is
run.
Compiler directives can usually over come artificial barriers to
parallelization.
Compiler directives can also overcome legitimate barriers to
parallelization, which introduces errors.
The efficiency and effectiveness of automatic compiler parallelization
can be significantly improved by supplying the switches.
53
BLAS, IMSL, NAG, LINPACK, ScaLAPACK LAPACK, etc.
Calls to these math libraries can often simplify coding.
They are portable across different platform
They are usually fine-tuned to the specific hardware as well as to
the sizes of the array variables that are sent to them
Example : Sun performance libraries (-xlic_lib=sunperf), IBM ESSL,
ESSLSMP
Use of MATH LIBRARIES
54
Optimization of unsteady state 3D Compressible Navier-Stokes
equations by finite difference method
Computing System used : Sun Ultra Sparc workstation (Each node
is quad CPU Ultra Enterprise 450 server, operating at 300Mhz)
Grid Size Iterations Time in seconds
192*16*16 1000 (No compiler options)
4930
192*16*16
1000 (Code restructuring and
compiler optimization)
2620
192*16*16
680
1000 (with compiler optimization)
Conclusions : Re-structuring the code and use of proper compiler
optimizations reduces the execution time by a factor of 8.0
Performance of selective application - CFD
Performance of selective application - CFD
55
lBMp lBMp lBMp lBMp- -- -630 630 630 630- -- - Configuration Configuration Configuration Configuration
4-way SMP
POWER 4 1.0 Ghz
8 GB Main memory (16 GB Max)
AIX 5.1 and PPC Linux
XL F77, F90, C, C++
Performance Libraries: BLAS 1,2,3 BLACS, ESSL
32-way SMP
POWER 4 1.1 Ghz
64 GB Main memory (256 GB Max)
AIX 5.1 and PPC Linux
XL F77, F90, C, C++
Performance Libraries: BLAS 1,2,3 BLACS, ESSL
lBMp lBMp lBMp lBMp- -- -690 690 690 690- -- - Configuration Configuration Configuration Configuration
56
LLCBench: Performance on IBM p630
57
LLCBench: Performance on IBM p690
58

L1
L2
Benchmarks on IBM p690
Performance of Cache Bench on IBM p690
59
Reducing Memory Overheads is important for performance of
sequential and parallel programs
Minimization of memory traffic is the single most important goal.
For multiple dimensional arrays, access will be fastest if you iterate
on the array subscript offering the smallest stride or step size.
Role of Data Reuse on Memory sub-system will increase the
performance
Basic Compiler and Advanced Compiler Optimization flags can be
used for performance
Write code so that a compiler find it easy to locate optimizations
Compiler performs Classical Optimization Techniques and some
loop optimization techniques
Conclusions
60
1. Ernst L. Leiss, Parallel and Vector Computing A practical Introduction, McGraw-Hill
Series on Computer Engineering, Newyork (1995).
2. Albert Y.H. Zomaya, Parallel and distributed Computing Handbook, McGraw-Hill
Series on Computing Engineering, Newyork (1996).
3. Vipin Kumar, Ananth Grama, Anshul Gupta, George Karypis, Introduction to Parallel
Computing, Design and Analysis of Algorithms, Redwood City, CA,
Benjmann/Cummings (1994).
4. William Gropp, Rusty Lusk, Tuning MPI Applications for Peak Performance, Pittsburgh
(1996)
5. Ian T. Foster, Designing and Building Parallel Programs, Concepts and tools for Parallel
Software Engineering, Addison-Wesley Publishing Company (1995).
6. Kai Hwang, Zhiwei Xu, Scalable Parallel Computing (Technology Architecture
Programming) McGraw Hill Newyork (1997)
7. Culler David E, Jaswinder Pal Singh with Anoop Gupta, Parallel Computer Architecture,
A Hardware/Software Approach, Morgan Kaufmann Publishers, Inc, (1999)
References
61

Day01 HPC WRKSHP Compiler Opt

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Day01 HPC WRKSHP Compiler Opt

Uploaded by

Copyright:

Available Formats

1

Copyright C-DAC 2004 October 5-9, 2004

You might also like