LCPC LightHouse Shashidhar

Welcome to the M.S.
Seminar of
Shashidhar G
Title: Automatic Code Generation for Graph Algorithms on

GPUs
While we start the seminar, start a conversation with your neighbour
What was your best vacation till now?
Mine is in the Himalayas.
What personal projects you are working on right now?
Mine Photography. Search for one now, if you dont have.
What do you like or not like about your hometown?
I am from Mysuru, many things to like.....
Will love to hear about your stories after the Seminar.
Shashidhar G LightHouse March 22, 2017 0 / 22

Automatic Code Generation for
Graph Algorithms on GPUs
Shashidhar G, Rupesh Nasre
PACE Lab, IIT Madras
M.S. Seminar
March 22, 2017

Graph Algorithms
Graph Algorithms
Graph Algorithms : Shortest Path, Triangle Counting, Community
detection, Chemical reaction simulation, PageRank etc.
Shortest Path Algorithm
d i s t [ 1 . . . . m] =
dist [0] = 0
For a l l n o d e s V[0...x] i n w o r k l i s t {
For a l l n e i g h b o r s n o f Vi {
i f ( d i s t [ n ] > d i s t [ Vi ]+ l e n ( Vi , n ) ) {
d i s t [ n ] = d i s t [ Vi ]+ l e n ( Vi , n )
add n t o w o r k l i s t
}
}
}

Graph Algorithms
Graph Algorithms
d i s t [ 1 . . . . m] =
dist [0] = 0
}
}
}

Graph Algorithms
Graph Algorithms
d i s t [ 1 . . . . m] =
dist [0] = 0
}
}
}

Graph Algorithms
Graph Algorithms
d i s t [ 1 . . . . m] =
dist [0] = 0
}
}
}

Graph Algorithms
Graph Algorithms
d i s t [ 1 . . . . m] =
dist [0] = 0
}
}
}

Graph Algorithms
Graph Algorithms
d i s t [ 1 . . . . m] =
dist [0] = 0
Critical Section for n
}
Critical Section Ends for n
}
}

Parallelization of Graph Algorithms
Parallelization Challenges in Graph Algorithms

Irregularity : No regular
pattern in:
Work distribution
Memory accesses
Control flow
Communication
Cannot be predicted at
compile time.
Scalability :
Millions of Nodes, Edges.

Irregularity : No regular
pattern in:
Work distribution
Memory accesses
Control flow
Communication
Cannot be predicted at
compile time.
Scalability :
Millions of Nodes, Edges.
1
http://www.fmsasg.com/socialnetworkanalysis/
Lot of optimizations developed for a particular algorithm/pattern

(BFS, SSSP, Strongly connected components etc.).

Lot of optimizations developed for a particular algorithm/pattern

(BFS, SSSP, Strongly connected components etc.).
Hard to implement these optimizations for non-HPC experts.

Related Work
1 Library based frameworks

Galois: Concurrent Data-structure library to identify when the node
becomes active for processing based on the operator applied. PLDI 11
Medusa: Set of APIs for processing vertices, edges or messages. IEEE
TPDS 2014
Ligra: Routines defined for processing on mapped edges and vertices.
PPoPP 13
2 Domain Specific Language
Elixir: Specify computations as set of operators along with a scheduling
constraint. The system produces efficient parallel code. OOPSLA 12
Halide: A language and compiler for optimizing parallelism, locality,
and recomputation in image processing pipelines. PLDI 13
Green-Marl: Language for parallel graph algorithms. ASPLOS 12

Motivation
Motivation
Why : We believe we can make novice programmers to do parallel

programming and use the power of parallel computing.

Motivation
Motivation

How : Language to create abstractions of the parallel computing and

let the compiler translate to parallel code.

Motivation
Motivation

How : Language to create abstractions of the parallel computing and

let the compiler translate to parallel code.
What : 30 line of Green-Marl Shortest Path code gets translated into

350 lines of parallel CUDA code.

Outline
Outline
Green-Marl: A Domain Specific Language
Front-end Compilation of Green-Marl
Contribution of LightHouse
Back-end code generation for GPUs(CUDA).
GPU code Optimization.
1 Eliminate Atomics in Boolean Reduction.
2 Loop Collapsing.
Experimental results and Conclusion.

DSL: Green-Marl
Green-Marl: Graph DSL
Fork-join style of parallel execution.

Uses iterators and collection (set, sequence).
Graph does not change.
1 Procedure T e s t (G : Graph ,
2 A : N_P, r o o t : Node ) {
3
4 N_P B ;
5 Int rootValue ;
6 Foreach ( n : G . Nodes )
7 Foreach ( s : n . Nbrs )
8 n .B = n .A + s .A;
9 rootValue = root .B;
10 }

DSL: Green-Marl
Fork-join style of parallel execution.

Uses iterators and collection (set, sequence).
Graph does not change.
1 Procedure T e s t (G : Graph ,
2 A : N_P, r o o t : Node ) {
3
4 N_P B ;
5 Int rootValue ;
7 Foreach ( s : n . Nbrs )
8 n .B = n .A + s .A;
9 rootValue = root .B;
10 }

DSL: Green-Marl
Reduction statement (provides determinism)

+, *, min, max, bitwise AND and OR.
argmin/argmax reductions: save the context of reduced value.
1 Foreach ( n : G . Nodes ) {
2 sum += n . B ;
3 v a l <maxNode> max= n . B<n> ;
4 }

1 Parse the language. Abstract Syntax Tree Generation.

2 Check for data-race conflicts.
1 Node_Prop A ;
3 Foreach ( t : n . Nbrs )
4 t .A = n .A;

1 Parse the language. Abstract Syntax Tree Generation.

2 Check for data-race conflicts.
3 Front-end Optimizations:
Loop fusion.
Hoists the temporaries out of the sequential loop to save the repeated
allocations and deallocations.
Reduction inside a sequential loop is converted to a normal assignment.

Challenges in Code generation for GPUs
GPU Code generation
1 Identify parallel region to run in GPU.

Outer-most Foreach = GPU kernel call.
Nested Foreach loops = sequential For loop inside the kernel.
// Each GPU Thread

Foreach(n:G.Nodes) = n = threadID;
Foreach(s:n.Nbrs) For(all neighbors s of n){
... ...

Challenges in Code generation for GPUs
2 Scope of Variables.
Procedure T e s t (G : Graph , Sym Type Allocate in

A : N_P, r o o t : Node ) { G Graph GPU - Device Mem.
A NP< Int > GPU - Device Mem.
N_P B ; B NP< Int > GPU - Device Mem.
Int ret ; n Node::I GPU - Thread Local
Foreach ( n : G . Nodes ) s Node::I GPU - Thread Local
Foreach ( s : n . Nbrs ) root Node CPU
n .B = n .A + s .A; ret Int CPU
ret = root .B;
Symbol Table
}

GPU Code generation
GPU Code generation

3 Generate indices for memory accesses.
Sym Type Parent

G Graph
aa bb A NP< Int >
a b c d e f
C 0 2 4 6 6 8 B NP< Int >
cc dd n Node::I G
R b c a d b e d f s Node::I nG
0 1 2 3 4 5 6 7 8
ee ff root Node
ret Int
Graph in CSR Format
n = threadID ;
i f ( n > numNodes )
Foreach ( n : G . Nodes ) return ;
Foreach ( s : n . Nbrs ) f o r ( i = C [ n ] ; i < C [ n + 1 ] ; i ++){
... s = R[ i ] ;
...

GPU Code generation
GPU Code generation
4 Generate code for Reductions.
Normal Reductions converted to

atomic instructions.
argmin/argmax reductions are
handled different.

GPU Code generation
GPU Code generation
4 Generate code for Reductions.
GPU_T = 0;
K e r n e l C a l l <<<LaunchPara >>>(C , R , A) ;
GPUMemCpy( from , GPU_from , D e v i c e T o H o s t ) ;
GPUMemCpy( to , GPU_to , D e v i c e T o H o s t ) ;
Normal Reductions converted to K e r n e l C a l l (C , R , A) {
atomic instructions. ...
localMax = 0;
expr = s .A + t .A;
argmin/argmax reductions are atomicMax(&GPU_T, expr);
if(localMax < expr) {
handled different. localMax = expr;
localFrom = s;
localTo = t;
Int T = 0; }
Node s r c , d s t ; SoftwareBarrier();
i f ( l o c a l M a x == GPU_T)
F o r e a c h ( s : G . Nodes ) chooseThread = threadID ;
F o r e a c h ( t : s . Nbrs ) SoftwareBarrier();
T<from,to> max= s . A + t . A<s,t> ; i f ( c h o o s e T h r e a d == t h r e a d I D ) {
GPU_from = l o c a l F r o m ;
GPU_to = l o c a l t o ;
}
}

Optimizations
GPU Optimizations
1 Boolean Value Reduction: Eliminates atomics.
// initialized outside kernel

A = false;
....
atomicOr(&A, val) = if (val)
A = val;

Optimizations
GPU Optimizations
1 Boolean Value Reduction: Eliminates atomics.
// initialized outside kernel

A = false;
....
atomicOr(&A, val) = if (val)
A = val;
2 Loop Collapsing: Expand the traversal on neighbors of all the nodes

to traversal on the edges.
Foreach(e:G.Edges) {
Foreach(s:G.Nodes) = s = e.FromNode();
Foreach(t:s.Nbrs) t = e.ToNode();
... ...

Experiments
Experiments
Bipartite Matching : Less conflicts across threads, load-balanced
task, Data Parallel.
Conductance : Lot of atomics, thread divergent code.
PageRank : Floating point operations.
SSSP : Atomic instructions.

Experiments
Experiments
Bipartite Matching : Less conflicts across threads, load-balanced
task, Data Parallel.
Conductance : Lot of atomics, thread divergent code.
PageRank : Floating point operations.
SSSP : Atomic instructions.
Graph #Nodes #Edges

(millions) (millions)
Epinions 0.1 0.5 TestBed :

LiveJournal 4.8 69.0 Intel XeonE5-2650 v2 with 32 cores.
Pokec 1.6 30.6 OpenMP version 4.0.
Orkut 3.1 117.2 Tesla K40C device with 2880 cores.
USA 23.9 57.7
Datasets

Experiments
Experiments
OMP-1T OMP-Fast CUDA CUDA-OPT OMP-1T OMP-Fast CUDA CUDA-OPT
60 10
50
8
7
40
6
Speed up
Speed up
30 5
20
3
2
10
0 0
Ep
Li
Po
Ep
Li
Po
U
rk
SA
rk
SA
ve
ve
ke
ke
in
in
ut
ut
Jo
Jo
io
io
c
c
ns
ns
u
u
rn
rn
al
al
(a) Matching (b) Conductance
Experiments
Experiments
OMP-1T OMP-Fast CUDA CUDA-OPT
14
OMP-1T OMP-Fast CUDA CUDA-OPT Ligra
25
12
20
10
15
8
Speed up
Speed up
6
10
Ep
Li
Po
U
rk
SA
ve
ke
0
in
ut
J
io
c
ou
ns
Ep
Li
Po
r
na
rk
SA
ve
ke
in
l
ut
Jo
io
c
ns
urn
(b) PageRank-Propagate
al
(a) PageRank-Gather
Experiments
Experiments
OMP-1T OMP-Fast CUDA CUDA-OPT LoneStarGPU Totem-OPT Ligra
60
50
40
Speed up
30
20
10
0
Ep
Li
Po
U
rk
SA
ve
ke
in
ut
Jo
io
c
ns
ur
n
al
(a) SSSP
Experiments
Framework Pagerank Shortest Path

LightHouse(Green-Marl) 25 30
Ligra 110 65
Totem - 400
LonestarGPU 170 -
Lines of Code

Experiments
Future work
Implement an analysis framework for enabling optimizations.

Identify computation patterns in the algorithm for modifying the data
structures to use these patterns for reducing synchronization, Load
balancing.
Work-list based generation of CUDA Code for reducing the number of
idle threads.
Reducing number of idle threads will lead to reduction in power
consumption.
Implementing Work-stealing in CUDA threads to reduce load
imbalance.
Multi-GPU Support.

Conclusion
Conclusion
Efficient GPU code generation of graph algorithms from a high-level

description like Green-Marl.
The performance benefits reveal that DSLs provide an effective way
of developing parallel algorithms.

Conclusion
Conclusion
Efficient GPU code generation of graph algorithms from a high-level

description like Green-Marl.
The performance benefits reveal that DSLs provide an effective way
of developing parallel algorithms.
This work had been accepted at Languages and Compilers for

Parallel Computing(LCPC) 2016
Try LightHouse at http://pace.cse.iitm.ac.in/tools.php
Thank You

Acknowledgements
Acknowledgements
Krishna Nandivada Sir

GTC Committee Member: Madhu Mutyam Sir, Chandra Shekar
Sir, Ramakrishna Sir
Rupesh Nasre(No Sir because he will scold me if i address him as
Sir)
PACErs
All my friends

Acknowledgements
Tesla GPU Computing Architecture

Scalable processing and memory, massively multithreaded
GeForce 8800: 128 processor cores at 1.5 GHz, 12K threads SP
Host CPU System Memory
Compute Work
GPU
Distribution
Geometry Controller Geometry Controller Geometry Controller Geometry Controller Geometry Controller Geometry Controller Geometry Controller Geometry Controller
SMC SMC SMC SMC SMC SMC SMC SMC
I-Cache I-Cache I-Cache I-Cache I-Cache I-Cache I-Cache I-Cache I-Cache I-Cache I-Cache I-Cache I-Cache I-Cache I-Cache I-Cache
MT Issue MT Issue MT Issue MT Issue MT Issue MT Issue MT Issue MT Issue MT Issue MT Issue MT Issue MT Issue MT Issue MT Issue MT Issue MT Issue
C-Cache C-Cache C-Cache C-Cache C-Cache C-Cache C-Cache C-Cache C-Cache C-Cache C-Cache C-Cache C-Cache C-Cache C-Cache C-Cache
SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP
SFU SFU SFU SFU SFU SFU SFU SFU SFU SFU SFU SFU SFU SFU SFU SFU SFU SFU SFU SFU SFU SFU SFU SFU SFU SFU SFU SFU SFU SFU SFU SFU
Shared Shared Shared Shared Shared Shared Shared Shared Shared Shared Shared Shared Shared Shared Shared Shared
Memory Memory Memory Memory Memory Memory Memory Memory Memory Memory Memory Memory Memory Memory Memory Memory
Texture Unit Texture Unit Texture Unit Texture Unit Texture Unit Texture Unit Texture Unit Texture Unit
Tex L1 Tex L1 Tex L1 Tex L1 Tex L1 Tex L1 Tex L1 Tex L1
Interconnection Network
ROP L2 ROP L2 ROP L2 ROP L2 ROP L2 ROP L2
DRAM DRAM DRAM DRAM DRAM DRAM
5 Scalable Parallel Programming with CUDA 2/27/08 NVIDIA Corporation 2008

Acknowledgements
SM Multithreaded Multiprocessor
SM SM has 8 SP Thread Processors

32 GFLOPS peak at 1.35 GHz
I-Cache IEEE 754 32-bit floating point
MT Issue 32-bit and 64-bit integer
8K 32-bit registers
C-Cache
SM has 2 SFU Special Function Units
SP SP Scalar ISA
Memory load/store, texture fetch
SP SP Branch, call, return
Barrier synchronization instruction
SP SP Multithreaded Instruction Unit
768 Threads, hardware multithreaded
SP SP 24 SIMT warps of 32 threads
Independent thread execution
SFU SFU Hardware thread scheduling
16KB Shared Memory
Shared
Memory Concurrent threads share data
Low latency load/store
60 Scalable Parallel Programming with CUDA 2/27/08 NVIDIA Corporation 2008

Acknowledgements
GPU Architecture
Each Streaming Multiprocessor (SM) manages on the order of

thousand of hardware scheduled threads.
Warp is a set of 32 threads which are executed in a particular SM.
Threads in a warp execute in Single Instruction Multiple Thread
(SIMT) fashion.
The thread scheduling in SM happens at the granularity of a warp
(Warp Scheduling).
On the whole, GPU provides tens of thousands of data parallel
threads.

Acknowledgements
References
Elixir: a system for synthesizing concurrent graph programs by

Prountzos Dimitrios et al. at OOPSLA 12
Halide: a language and compiler for optimizing parallelism, locality,
and recomputation in image processing pipelines by Ragan Kelley et
al. at PLDI 13
Galois. The tao of parallelism in algorithms by Pingali Keshav et al.
at PLDI 11
Ligra: A Lightweight Graph Processing Framework for Shared
Memory by Shun Julian and Blelloch Guy E. at PPoPP 13
Medusa: Simplified Graph Processing on GPUs by Jianlong Zhong
and Bingsheng He at IEEE Transactions Parallel Distribributed
Systems 2014

LCPC LightHouse Shashidhar

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

LCPC LightHouse Shashidhar

Uploaded by

Copyright:

Available Formats

Welcome to the M.S.

Title: Automatic Code Generation for Graph Algorithms on

Will love to hear about your stories after the Seminar.

Shashidhar G LightHouse March 22, 2017 0 / 22

Shashidhar G, Rupesh Nasre

PACE Lab, IIT Madras

March 22, 2017

Shashidhar G LightHouse March 22, 2017 1 / 22

Shashidhar G LightHouse March 22, 2017 2 / 22

Shashidhar G LightHouse March 22, 2017 2 / 22

Shashidhar G LightHouse March 22, 2017 2 / 22

Shashidhar G LightHouse March 22, 2017 2 / 22

Shashidhar G LightHouse March 22, 2017 2 / 22

Shashidhar G LightHouse March 22, 2017 2 / 22

Parallelization Challenges in Graph Algorithms

Shashidhar G LightHouse March 22, 2017 3 / 22

Parallelization Challenges in Graph Algorithms

Shashidhar G LightHouse March 22, 2017 3 / 22

Parallelization Challenges in Graph Algorithms

Parallelization Challenges in Graph Algorithms

Lot of optimizations developed for a particular algorithm/pattern

Shashidhar G LightHouse March 22, 2017 4 / 22

Parallelization Challenges in Graph Algorithms

Lot of optimizations developed for a particular algorithm/pattern

Hard to implement these optimizations for non-HPC experts.

Shashidhar G LightHouse March 22, 2017 4 / 22

1 Library based frameworks

Shashidhar G LightHouse March 22, 2017 5 / 22

Why : We believe we can make novice programmers to do parallel

Shashidhar G LightHouse March 22, 2017 6 / 22

Why : We believe we can make novice programmers to do parallel

How : Language to create abstractions of the parallel computing and

Shashidhar G LightHouse March 22, 2017 6 / 22

Why : We believe we can make novice programmers to do parallel

How : Language to create abstractions of the parallel computing and

What : 30 line of Green-Marl Shortest Path code gets translated into

Shashidhar G LightHouse March 22, 2017 6 / 22

Shashidhar G LightHouse March 22, 2017 7 / 22

Green-Marl: Graph DSL

Fork-join style of parallel execution.

Shashidhar G LightHouse March 22, 2017 8 / 22

Green-Marl: Graph DSL

Fork-join style of parallel execution.

Shashidhar G LightHouse March 22, 2017 8 / 22

Green-Marl: Graph DSL

Reduction statement (provides determinism)

Shashidhar G LightHouse March 22, 2017 9 / 22

Front-end Compilation of Green-Marl

1 Parse the language. Abstract Syntax Tree Generation.

Shashidhar G LightHouse March 22, 2017 10 / 22

Front-end Compilation of Green-Marl

1 Parse the language. Abstract Syntax Tree Generation.

Shashidhar G LightHouse March 22, 2017 10 / 22

GPU Code generation

1 Identify parallel region to run in GPU.

// Each GPU Thread

Shashidhar G LightHouse March 22, 2017 11 / 22

Procedure T e s t (G : Graph , Sym Type Allocate in

Shashidhar G LightHouse March 22, 2017 12 / 22

GPU Code generation

Sym Type Parent

Shashidhar G LightHouse March 22, 2017 13 / 22

GPU Code generation

4 Generate code for Reductions.