Professional Documents
Culture Documents
HPC-Cloud Interoperability
Twister4Azure
Thilina Gunarathne
Funded by Microsoft Azure Grant
High-Performance
Visualization Algorithms
For Data-Intensive Analysis
Seung-Hee Baeand Jong Youl Choi
Funded by NIH Grant 1RC2HG005806-01
Cyberinfrastructure for
Remote Sensing of Ice Sheets
Jerome Mitchell
Funded by NSF Grant OCI-0636361
MICROSOFT5
SALSA
SALSA
Applications
Programming
Model
Runtime
Storage
Infrastructure
Hardware
SALSA
SALSA
Data locality
Research issues
portability
between
and Cloud
systems
Factors
beyond
dataHPC
locality
to improve
scaling performance
performance
To fault
tolerance
achieve
the best data locality is not always the optimal scheduling
decision. For instance, if the node where input data of a task are stored
is overloaded, to run the task on it will result in performance
degradation.
12
MICROSOFT
12
13
SALSA
MICROSOFT
Is Research as a Service
14
SALSA
30
40
50
60
70
80
90
100
HPC
?
SALSA
L1 cache reference
0.5 ns
Branch mispredict
5 ns
L2 cache reference
7 ns
Mutex lock/unlock
Main memory reference
Compress 1K w/cheap compression algorithm
Send 2K bytes over 1 Gbps network
25 ns
100 ns
3,000 ns
20,000 ns
250,000 ns
500,000 ns
Disk seek
10,000,000 ns
20,000,000 ns
150,000,000 ns
17
ParallelThinking
19
20
21
M PM D Softw are
Multiple Program Multiple Data (SPMD):
a coarse-grained MIMD approach to programming
Data parallel software:
do the same thing to all elements of a structure (e.g., many
matrix algorithms). Easiest to write and understand.
It applies to complex problems (e.g. MPI, distributed system).
What applications are suitable for MPMD?
(e.g. wikipedia)
22
MICROSOFT
23
FASTA File
N
Sequences
Blast
block
Pairings
Pairwise
Distance
Calculation
Dissimilarity
Matrix
N(N-1)/2
values
Pairwise
clusteri
ng
Clustering
MPI
MDS
3
Visualization
Visualization
Plotviz
Plotviz
4 5
ubmit their jobs to the pipeline and the results will be shown in a visualization tool.
art illustrate a hybrid model with MapReduce and MPI. Twister will be an unified solution for the pipelin
mponents are services and so is the whole pipeline.
ld research on which stages of pipeline services are suitable for private or commercial Clouds.
24
SALSA
Motivation
Data
Deluge
Experienci
ng in many
domains
MapReduce
Classic Parallel
Runtimes (MPI)
Data
Centered, QoS
Efficient and
Proven
techniques
MapOnly
Input
map
Output
MapRedu
ce
Input
map
reduce
Iterative
MapReduce
iterations
Input
map
reduce
More
Extensions
Pij
Twister v0.9
New Infrastructure for
Iterative MapReduce
Programming
Distinction on static
and
variable data
Configurable long running
(cacheable) map/reduce tasks
Pub/sub messaging based
communication/data transfers
Broker Network for facilitating
communication
Worker Nodes
configureMaps(..)
Local Disk
configureReduce(..)
Cacheable map/reduce
tasks
while(condition){
runMapReduce(..
)
May send <Key,Value> pairs
directly
Iteration
s
Combine()
operation
updateCondition()
} //end while
close()
Map(
)
Reduce
()
Communications/data transfers via
the pub-sub broker network & direct
TCP
Master Node
B
Twister
Driver
Pub/sub
Broker Network
Main Program
One broker
serves several
Twister
daemons
Twister Daemon
map
Twister Daemon
reduce
Cacheable
tasks
Worker Pool
Local Disk
Worker Node
Scripts perform:
Data distribution, data
collection, and partition
file creation
Worker Pool
Local Disk
Worker Node
29
MRRoles4Azure
Iterative MapReduce
for Azure
MRRoles4Azure
Distributed, highly scalable and highly available
cloud services as the building blocks.
Utilize eventually-consistent , high-latency cloud
services effectively to deliver performance
comparable to traditional MapReduce runtimes.
Decentralized architecture with global queue
based dynamic task scheduling
Minimal management and maintenance overhead
Supports dynamically scaling up and down of the
compute resources.
MapReduce fault tolerance
Performance Comparisons
BLAST Sequence Search
Performance Kmeans
Clustering
Weak Scaling
Performance Multi
Dimensional Scaling
Weak Scaling
Parallel Visualization
Algorithms
Parallel
visualization
algorithms (GTM,
MDS, )
Improved quality
by using DA
optimization
Interpolation
PlotViz
Provide Virtual
3D space
Cross-platform
Visualization
Toolkit (VTK)
Qt framework
MDS (SMACOF)
Input
Input
Vector-based data
Objective
Objective
Function
Function
Maximize LogLikelihood
Complexit
Complexit
yy
Optimizati
Optimizati
on
on
Method
Method
O(KN) (K << N)
O(N2)
EM
Parallel GTM
GTM / GTM-Interpolation
B
2
K
K
latent
latent
points
points
N
N data
data
Parallel
HDF5
ScaLAPAC
K
MPI / MPI-IO
Parallel File
System
Cray / Linux / Windows
Cluster
points
points
GTM SOFTWARE
STACK
Scalable MDS
Parallel MDS
O(N2) memory and computation
required.
100k data 480GB memory
c
1
r
1
r
2
c
2
c
3
MDS Interpolation
Finding approximate
mapping position w.r.t.
k-NNs prior mapping.
Per point it requires:
O(M) memory
O(k) computation
Pleasingly parallel
Mapping 2M in 1450
sec.
vs. 100k in 27000 sec.
7500 times faster than
estimation of the full MDS.
39
Interpolation extension to
GTM/MDS
MPI,
Twister
n
In-sample
1
2
N-n
N-n
......
Out-of-sample
P-1
Training
Trained data
Interpolation
Total N data
Interpolated
map
Twister
GTM/MDS Applications
PubChem data with CTD
visualization by using
MDS (left) and GTM
(right)
About 930,000 chemical
compounds are visualized as
a point in 3D space,
annotated by the related
genes in Comparative
Toxicogenomics Database
(CTD)
Chemical compounds
shown in literatures,
visualized by MDS (left)
and GTM (right)
Visualized 234,000 chemical
compounds which may be
related with a set of 5 genes of
interest (ABCB1, CHRNB2,
DRD2, ESR1, and F2) based on
the dataset collected from
Twister-MDS Demo
This demo is for real time visualization of the
process of multidimensional scaling(MDS)
calculation.
We use Twister to do parallel calculation inside the
cluster, and use PlotViz to show the intermediate
results at the user client computer.
The process of computation and monitoring is
automated by the program.
Twister-MDS Output
ActiveMQ
Broker
Twister-MDS
MDS
Monitor
PlotViz
I. Send
message to
start the job
Local Disk
Twister-MDS Structure
Master Node
Twister
Driver
Twister-MDS
Twister Daemon
Pub/Sub
Broker
Network
Twister Daemon
ma
p
redu
ce
ma
p
reduc
e
Worker Pool
calculateSt
ress
Worker Pool
Worker Node
Worker Node
calculateB
C
Bioinformatics Pipeline
Gene
Sequences
(N = 1
Million)
Select
Referen
ce
N-M
Sequenc
e Set
(900K)
Reference
Sequence
Set (M =
100K)
Interpolative
MDS with
Pairwise
Distance
Calculation
x, y,
N-M
z
Coordinate
s
Pairwise
Alignmen
t&
Distance
Calculati
on
Distance Matrix
Reference
Coordinate
x,sy,
z
MultiDimensio
nal
Scaling
(MDS)
O(N2)
Visualizati
on
3D Plot
B. Hierarchical
Sending
Node
BrokerDriver
Connection
Broker-Daemon
Connection
7 Brokers
and 32
Computing
Nodes in
total
A. Full Mesh
Network
5 Brokers and 4
Computing Nodes in
total
Broker-Broker
Connection
C.
Performance
Improvement
Twister-MDS Execution Time
100 iterations, 40 nodes, under different input data sizes
1600.000
1508.487
1404.431
1400.000
1200.000
1000.000
Total Execution Time (Seconds)
816.364
737.073
800.000
600.000
359.625
303.432
400.000
200.000
189.288
148.805
0.000
38400
51200
76800
102400
Broadcasting on 40 Nodes
(In Method C, centroids are split to 160 blocks, sent through 40 brokers in 4
rounds)
100.00
93.14
90.00
80.00
70.56
70.00
60.00
Broadcasting Time (Unit: Second)
50.00
46.19
40.00
30.00
20.00
24.50
18.79
13.07
10.00
0.00
400M
Method C
600M
Method B
800M
Worker Node
Worker Node
Broker
Broker
Broker
Configure
Mapper
map broadcasting
chain
Add to MemCache
Map
Reduc
e
Twister
Driver
ma
p
merg
e
redu
ce
Cacheable
tasks
reduce collection
chain
Twister Daemon
Twister Daemon
Chain/Ring Broadcasting
Twister Daemon Node
Twister Driver Node
Driver sender:
send broadcasting data
get acknowledgement
send next broadcasting data
Daemon sender:
receive data from the last
daemon (or driver)
cache data to daemon
Send data to next daemon (waits
for ACK)
send acknowledgement to the
last daemon
Chain
Broadcasting
Protocol
Daemon
Daemon
Daemon
Driver
send
0receive
handle
data
send
get ack
ack
send
receive
handle
data
get ack
send
get ack
ack
receive
handle
data
send
ack
ack
get ack
ack
receive
handle
data
ack
receive
handle
data
get ack
send
get ack
get ack
ack
receive
handle
data
ack
Broadcasting Time
Comparison
Broadcasting Time Comparison on 80 nodes, 600 MB data, 160 pieces
30
25
20
15
10
5
0
Execution No.
Chain Broadcasting
Map Only
Input
CAP3 Analysis
map
Document
conversion (PDF ->
HTML)
Output
Brute force searches
in cryptography
Parametric sweeps
- CAP3 Gene
Assembly
- PolarGrid Matlab
data analysis
Classic
MapReduce
Input
map
High Energy
Iterative
Loosely
Reductions
Synchronous
Twister
iterations
Input
Expectation
Many MPI scientific
map
maximization
applications
Pij
Physics (HEP)
Histograms
algorithms
SWG gene reduce Clustering
reduce
alignment
Linear
Algebra
Distributed search
Distributed sorting
Information
retrieval
utilizing wide
variety of
communication
constructs
including local
interactions
- Information
Retrieval - HEP
Data Analysis
- Calculation of
Pairwise Distances
for ALU Sequences
- Solving
Differential
Equations and
- particle dynamics
with short range
forces
- Kmeans
- Deterministic
Annealing
Clustering
- Multidimensional
Scaling MDS
MPI
SALSA
Twister Futures
Development of library of Collectives to use at Reduce phase
Broadcast and Gather needed by current applications
Discover other important ones
Implement efficiently on each platform especially Azure
Convergence is Happening
Clou
ds
Multico
re
Parallel threading and processes
Private
FG Network
Public
SALSA
SALSAHPCDynamicDemonstrate
Virtual Cluster
on
the concept
of Science
Clouds on
FutureGrid --Demo
at on
SC09
FutureGrid
Dynamic Cluster
Architecture
Monitoring Infrastructure
SW-G
Using
Hadoop
SW-G
Using
Hadoop
Linux
Baresystem
Linux on
Xen
SW-G
Using
DryadLIN
Q
Windows
Server
2008
Baresystem
XCAT Infrastructure
iDataplex Bare-metal
Nodes
(32 nodes)
Pub/Su
b
Broker
Networ
k
Virtual/Physical
Clusters
XCAT
Infrastructure
iDataplex
Bare-metal
Nodes
Monitoring Interface
Summari
zer
Switcher
Switchable clusters on the same hardware (~5 minutes between different OS such as Linux+Xen to
Windows+HPCS)
Support for virtual clusters
SW-G : Smith Waterman Gotoh Dissimilarity Computation as an pleasingly parallel problem suitable for
MapReduce style applications
SALSA
SALSAHPCDynamic Virtual
Cluster
on
Demonstrate
the concept
of Science
Clouds using
FutureGrid --Demo
at on
SC09
a FutureGrid cluster
SALSA
Apache Lucene:
- A library written in Java for building inverted indices and
supporting full-text search
- Incremental indexing, document scoring, and multi-index
search with merged results, etc.
- Existing solutions using Lucene store index data with files
no natural integration with HBase
SALSA
System design
Data from a real digital library application: bibliography data,
page image data, texts data
System design:
SALSA
System design
Table schemas:
- title index table: <term value> --> {frequencies:[<doc
id>, <doc id>, ...]}
- texts index table: <term value> --> {frequencies:[<doc
id>, <doc id>, ...]}
- texts term position vector table: <term value> -->
{positions:[<doc id>, <doc id>, ...]}
SALSA
System implementation
Experiments completed in the Alamo HPC cluster of FutureGrid
MyHadoop -> MyHBase
Workflow:
SALSA
SALSA
SALSA
Solandra:
- Inverted index implemented as tables in Cassandra
- Different index table designs; no MapReduce support
SALSA
Future work
Distributed performance evaluation
More data analysis or text mining based on the
index data
Distributed search engine integrated with HBase
region servers
SALSA
Education
We offer classes
with emerging
new topics
Together with
tutorials on the
most popular
cloud computing
tools
Broader Impact
Hosting workshops and
spreading our technology
across the nation
Giving students
unforgettable
research experience
Acknowledgement
SALSA HPC Group
Indiana University
http://salsahpc.indiana.edu
An application analyzing data from Large Hadron Collider (1TB but 100 Petabytes eventu
73
Combine Histograms produced by separate Root Maps (of event data to partial histograms) into
a single Histogram delivered to Client
This is an example using MapReduce to do distributed histogramming.
74
Events (Pi)
0.23,
0.89,0.27,
0.29,0.23,
0.89,
0.27, 0.23,
0.11
0.29, 0.23,
0.89
0.27, 0.23,
0.11
Bin23,
1
Bin89,
1
Bin27,
Bin29,
1
1
Bin23,
1
Bin89,,
27
11
Bin23,
1
Bin11,
1
Bin11,
1
Bin11,
1
Bin
Bin11,
2
1
Bin23,
1
Bin27
23,
11
Bin27,
1
Bin89,
1
Bin89,
Sor
1
Bin23,
3
23,
Bin27,
2
Bin89,
2
http://blog.jteam.nl/wp75
content/uploads/2009/08/MapReduceWordCountOverview.png
Bin11, 2
Bin23, 3
Bin29,
2
Bin89,
2
18.
19.
20.
21.
22.
23.
24.
word.set(tokenizer.nextToken());
// Parsing
.
double[BIN_SIZE];
for (int
i=0; i <VEC_LENGTH;
i++) {
.
output.collect(word,
one);
for (int j =
j < PROPERTIES;
if 0;
(event
is in bin[i]) j++) {
}
if (eventVector[i]
is in bins[j])
{//Pseudo
{//Pseudo
event.set (i);
++bins[j];
}
}
output.collect(event, one);
}
}
output.collect(Property, bins[]) //Pseudo
76
77
wikipedia
H ow it w orks?
78
wikipedia
Do
Broadcast Cn
Map
[Perform in parallel] the map() operation
for each Vi
for each Cn,j
Dij <= Euclidian (Vi,Cn,j)
E-Step
Assign point Vi to Cn,j with minimum
Dij
*
for each Cn,j
*
Cn,j <=Cn,j/K
*
Reduce
Global
*
[Perform Sequentially]
thereduction
reduce() operation
*
Collect all Cn
M-Step
*
Calculate new cluster centers Cn+1
*
Diff<= Euclidian (Cn, Cn+1)
*
*
while (Diff <THRESHOLD)
Vi
refers to the ith vector
Cn,j refers to the jth cluster center in nth * iteration
Dij
refers to the Euclidian distance between ith vector and jth *
cluster center
79
K
is the number of cluster centers
Partition
Partition
Partition
C1
C1
x1
y1
C1
C2
C2
x2
C2
y2
C3
C3
C3
Ck
Ck
Ck
x3
y3
C2
C3
count2
count3
xk
C1
count
1
yk
Ck
80
countk
Twister K-means
Execution
<c,
<c,
File1 > File2 >
<K,
>
C
1
C
2
C
3
<K,
>
<c, Filek
>
C
1
<K,
>
C
2
<K,
C
C
1
C
2
C
2
C
2
C1
C
>
C
1
>
3
C
2
C
1
C
1
C
1
<K,
>
C
2
C
2
C
1
C
3
C
3
<K,
C
2
C
3
C
3
C2
C33
C
3
C
1
C
3
Ck
C