Professional Documents
Culture Documents
Outline
Language popularity:
(% skilled developers)
#1 C (19.8%)
#2 Java (17.2%)
#3 Objective-C
(9.48%)
#4 C++ (9.3%)
#19 Matlab (0.59%)
#27 Fortran (0.35%)
class Test {
i n t c o u n t e r = 0L ;
increaseCount ( ) {
c o u n t e r ++;
}
/ / c l o s e t h e p o o l and w a i t f o r a l l t a s k s t o f i n i s h .
threadPool . j o i n ( ) ;
}
/ / omp p a r a l l e l f o r
f o r ( i =1; i <n ; i ++) {
b [ i ] = ( a [ i ] + a [ i 1]) 0 . 5 ;
}
}
public class H e l l o {
i f ( rank == 0 ) {
i n t peer_process = 1 ;
S t r i n g [ ] msg = new S t r i n g [ 1 ] ;
msg [ 0 ] = new S t r i n g ( " H e l l o " ) ;
MPI .COMM_WORLD. Send ( msg , 0 , 1 , MPI . OBJECT, peer_process , msg_tag ) ;
} else i f ( rank == 1 ) {
i n t peer_process = 0 ;
S t r i n g [ ] msg = new S t r i n g [ 1 ] ;
MPI .COMM_WORLD. Recv ( msg , 0 , 1 , MPI . OBJECT, peer_process , msg_tag ) ;
System . o u t . p r i n t l n ( msg [ 0 ] ) ;
}
MPI . F i n a l i z e ( ) ;
}
}
mpiJava 1.2
Other APIs
InfiniBand
JGF MPJ
Java NIO
Myrinet
Java IO
SCI
MPJava X X X
Jcluster X X X
Parallel Java X X X
mpiJava X X X X
P2P-MPI X X X X
MPJ Express X X X X
MPJ/Ibis X X X X
JMPI X X X
F-MPJ X X X X X X
FastMPJ
MPJ Applications
MPJ API (mpiJava 1.2)
FastMPJ Library
MPJ Collective Primitives
MPJ PointtoPoint Primitives
Java GPGPU
Java GPGPU
Java GPGPU
Matrix Multiplication performance (Single precision) Matrix Multiplication performance (Double precision)
1000 500
CUDA CUDA
900 jCuda 450 jCuda
800 Aparapi 400 Aparapi
Java Java
700 350
600 300
GFLOPS
GFLOPS
500 250
400 200
300 150
200 100
100 50
0 0
2048x2048 4096x4096 8192x8192 2048x2048 4096x4096 8192x8192
Problem size Problem size
Java GPGPU
Runtime (seconds)
64 128
32 64
16 32
8 16
4 8
2 4
1 2
0.5 1
2048x2048 4096x4096 8192x8192 2048x2048 4096x4096 8192x8192
Problem size Problem size
Java GPGPU
GFLOPS
70 35
60 30
50 25
40 20
30 15
20 10
10 5
0 0
2048x2048 4096x4096 8192x8192 2048x2048 4096x4096 8192x8192
Problem size Problem size
Java GPGPU
Java GPGPU
GFLOPS
200 150
150
100
100
50
50
0 0
2048x2048 4096x4096 8192x8192 2048x2048 4096x4096 8192x8192
Problem size Problem size
Kernel
Applic.
Communicat.
Name Operation SLOC
intensiveness
NPB-MPJ Optimization:
JVM JIT compilation of heavy and frequent methods with
runtime information
Structured programming is the best option
Small frequent methods are better.
mapping elements from multidimensional to one-dimensional
arrays (array flattening technique:
arr3D[x][y][z]arr3D[pos3D(lenghtx,lengthy,x,y,z)])
NPB-MPJ code refactored, obtaining significant
improvements (up to 2800% performance increase)
1200
1000
800
600
400
200
0
C
FT
FT
IS
IS
M
G
G
(c
(c
(c
(c
(c
(c
(c
(c
c1
c2
c1
c2
c
c1
c2
1.
2.
.4
.8
.4
.8
.4
.8
4x
8x
xl
xl
xl
xl
xl
xl
ar
ar
ar
ar
l
l
ar
ar
ar
ar
ge
ge
ge
ge
ge
ge
ge
ge
)
)
)
)
)
)
Guillermo Lpez Taboada Java for High Performance Cloud Computing
Java for High Performance Computing
High Performance Cloud Computing
AWS IaaS for HPC and Big Data
Performance Evaluation
Conclusions
IaaS: Infrastructure-as-a-Service
IaaS
The access to infrastructure supports the execution of
computationally intensive tasks in the cloud. There are many
IaaS providers:
Amazon Web Services (AWS)
Google Compute Engine (beta)
IBM cloud
HP cloud
Rackspace
ProfitBricks
Penguin-On-Demand (POD)
AWS
Provides a set of instances for HPC
CC1: dual socket, quadcore Nehalem Xeon (93 GFlops)
CG1: dual socket, quadcore Nehalem Xeon con 2 GPUs
(1123 GFlops)
CC2: dual socket, octcore Sandy Bridge Xeon (333 GFlops)
Up to 63 GB RAM
10 Gigabit Ethernet high performance network
12000
1000
10000
Speedup
GFLOPS
800
8000
600
6000
400
4000
2000 200
0 0
1 2 4 8 16 32 1 2 4 8 16 32
Number of Instances Number of Instances
References
Finis Terrae (CESGA): 14,01 TFlops
Our AWS Cluster: 14,23 TFlops
AWS (06/11): 41,82 TFlops (# 451 TOP500 06/11)
MareNostrum (BSC): 63,83 TFlops (# 299 TOP500)
MinoTauro (BSC): 103,2 TFlops (# 114 TOP500)
AWS (11/11): 240,1 TFlops (# 42 TOP500)
1400
1200
1200
1000
Runtime (s)
Speedup
1000
800
800
600
600
400
400
200 200
0 0
1 2 4 8 16 32 1 2 4 8 16 32
Number of Instances Number of Instances
Bandwidth (Gbps)
3.5
18
Latency (s)
3
16
2.5 14
12
2
10
1.5 8
1 6
4
0.5
2
0 0
1 4 16 64 256 1K 4K 16K 64K 256K 1M 4M 16M
50000
40000
MOPS
30000
20000
10000
0
1 16 32 64 128 256 512
Number of Cores
150000
MOPS
100000
50000
0
1 16 32 64 128 256 512
Number of Cores
4000
MOPS
3000
2000
1000
0
1 16 32 64 128 256 512
Number of Cores
300000
250000
MOPS
200000
150000
100000
50000
0
1 16 32 64 128 256 512
Number of Cores
Bandwidth (Gbps)
7
65
Latency (s)
6
60 5
4
55
3
2
50
1
45 0
1 4 16 64 256 1K 4K 16K 64K 256K 1M 4M 16M 64M
Message size (bytes)
Bandwidth (Gbps)
7
65
Latency (s)
6
60 5
4
55
3
2
50
1
45 0
1 4 16 64 256 1K 4K 16K 64K 256K 1M 4M 16M 64M
Message size (bytes)
Bandwidth (Gbps)
50
Latency (s)
0.5 45
40
0.4
35
0.3 30
25
0.2 20
15
0.1 10
5
0 0
1 4 16 64 256 1K 4K 16K 64K 256K 1M 4M 16M 64M
Message size (bytes)
Bandwidth (Gbps)
50
Latency (s)
0.5 45
40
0.4
35
0.3 30
25
0.2 20
15
0.1 10
5
0 0
1 4 16 64 256 1K 4K 16K 64K 256K 1M 4M 16M 64M
Message size (bytes)
5000
4000
3000
2000
1000
0
1 8 16 32 64 128 256 512
Number of Cores
15000
MOPS
10000
5000
0
1 8 16 32 64 128 256 512
Number of Cores
400
MOPS
300
200
100
0
1 8 16 32 64 128 256 512
Number of Cores
40000
MOPS
30000
20000
10000
0
1 8 16 32 64 128 256 512
Number of Cores
12000
10000
MOPS
8000
6000
4000
2000
0
1 8 16 32 64 128 256 512
Number of Cores
12000
10000
8000
6000
4000
2000
0
1 8 16 32 64 128 256 512
Number of Cores
500
400
MOPS
300
200
100
0
1 8 16 32 64 128 256 512
Number of Cores
60000
MOPS
50000
40000
30000
20000
10000
0
1 8 16 32 64 128 256 512
Number of Cores
HPC in AWS
Efficient Cloud Computing support
CG Kernel OpenMPI Performance on Amazon EC2 CG Kernel OpenMPI Performance on Amazon EC2
13000 16000
Default Default
12000 Default (Fillup) 16 Instances
Default (Tuned) 14000 32 Instances
11000 16 Instances (Fillup) 64 Instances
16 Instances (Tuned)
10000
12000
9000
8000 10000
MOPS
MOPS
7000
8000
6000
5000 6000
4000
4000
3000
2000
2000
1000
0 0
1 8 16 32 64 128 1 8 16 32 64 128 256 512
Number of Processes
Guillermo Lpez Taboada Java for High Performance Number of Processes
Cloud Computing
Java for High Performance Computing
Evaluation of Java HPC
High Performance Cloud Computing
Evaluation of HPC in the Cloud
Performance Evaluation
Case study: ProtTest-HPC
Conclusions
ProtTest-HPC design
ProtTest-HPC Performance
ProtTestHPC shared memory performance (8core system)
8
RIB
COX
7 HIV
RIBML
COXML
6
HIVML
5
Speedup
0
1 2 4 8
Number of Threads
ProtTest-HPC Performance
ProtTestHPC shared memory performance (24core system)
20 RIB
COX
18 HIV
RIBML
16 COXML
HIVML
14
Speedup
12
10
0
1 2 4 8 16 24
Number of Threads
ProtTest-HPC Performance
ProtTestHPC distributed memory performance (Harpertown)
48
RIBML
44 COXML
HIVML
40 10K
20K
36
100K
32
Speedup
28
24
20
16
12
0
1 8 16 28 56 112
Number of Cores
ProtTest-HPC Performance
ProtTestHPC hybrid implementation performance (Harpertown)
160
RIBML
COXML
140 HIVML
10K
20K
120
100K
100
Speedup
80
60
40
20
0
1 8 16 28 56 112 224
Number of Cores
ProtTest-HPC Performance
ProtTestHPC hybrid implementation performance (Nehalem)
RIBML
60 COXML
HIVML
10K
50 20K
100K
40
Speedup
30
20
10
1 2 4 8 16 32 64
Number of Cores
Summary
Questions?