You are on page 1of 44

Welcome to the M.S.

Seminar of
Shashidhar G

Title: Automatic Code Generation for Graph Algorithms on


GPUs
While we start the seminar, start a conversation with your neighbour
What was your best vacation till now?
Mine is in the Himalayas.
What personal projects you are working on right now?
Mine Photography. Search for one now, if you dont have.
What do you like or not like about your hometown?
I am from Mysuru, many things to like.....

Will love to hear about your stories after the Seminar.

Shashidhar G LightHouse March 22, 2017 0 / 22


Automatic Code Generation for
Graph Algorithms on GPUs

Shashidhar G, Rupesh Nasre

PACE Lab, IIT Madras

M.S. Seminar

March 22, 2017

Shashidhar G LightHouse March 22, 2017 1 / 22


Graph Algorithms

Graph Algorithms
Graph Algorithms : Shortest Path, Triangle Counting, Community
detection, Chemical reaction simulation, PageRank etc.
Shortest Path Algorithm
d i s t [ 1 . . . . m] =
dist [0] = 0

For a l l n o d e s V[0...x] i n w o r k l i s t {
For a l l n e i g h b o r s n o f Vi {

i f ( d i s t [ n ] > d i s t [ Vi ]+ l e n ( Vi , n ) ) {
d i s t [ n ] = d i s t [ Vi ]+ l e n ( Vi , n )
add n t o w o r k l i s t
}

}
}

Shashidhar G LightHouse March 22, 2017 2 / 22


Graph Algorithms

Graph Algorithms
Graph Algorithms : Shortest Path, Triangle Counting, Community
detection, Chemical reaction simulation, PageRank etc.
Shortest Path Algorithm
d i s t [ 1 . . . . m] =
dist [0] = 0

For a l l n o d e s V[0...x] i n w o r k l i s t {
For a l l n e i g h b o r s n o f Vi {

i f ( d i s t [ n ] > d i s t [ Vi ]+ l e n ( Vi , n ) ) {
d i s t [ n ] = d i s t [ Vi ]+ l e n ( Vi , n )
add n t o w o r k l i s t
}

}
}

Shashidhar G LightHouse March 22, 2017 2 / 22


Graph Algorithms

Graph Algorithms
Graph Algorithms : Shortest Path, Triangle Counting, Community
detection, Chemical reaction simulation, PageRank etc.
Shortest Path Algorithm
d i s t [ 1 . . . . m] =
dist [0] = 0

For a l l n o d e s V[0...x] i n w o r k l i s t {
For a l l n e i g h b o r s n o f Vi {

i f ( d i s t [ n ] > d i s t [ Vi ]+ l e n ( Vi , n ) ) {
d i s t [ n ] = d i s t [ Vi ]+ l e n ( Vi , n )
add n t o w o r k l i s t
}

}
}

Shashidhar G LightHouse March 22, 2017 2 / 22


Graph Algorithms

Graph Algorithms
Graph Algorithms : Shortest Path, Triangle Counting, Community
detection, Chemical reaction simulation, PageRank etc.
Shortest Path Algorithm
d i s t [ 1 . . . . m] =
dist [0] = 0

For a l l n o d e s V[0...x] i n w o r k l i s t {
For a l l n e i g h b o r s n o f Vi {

i f ( d i s t [ n ] > d i s t [ Vi ]+ l e n ( Vi , n ) ) {
d i s t [ n ] = d i s t [ Vi ]+ l e n ( Vi , n )
add n t o w o r k l i s t
}

}
}

Shashidhar G LightHouse March 22, 2017 2 / 22


Graph Algorithms

Graph Algorithms
Graph Algorithms : Shortest Path, Triangle Counting, Community
detection, Chemical reaction simulation, PageRank etc.
Shortest Path Algorithm
d i s t [ 1 . . . . m] =
dist [0] = 0

For a l l n o d e s V[0...x] i n w o r k l i s t {
For a l l n e i g h b o r s n o f Vi {

i f ( d i s t [ n ] > d i s t [ Vi ]+ l e n ( Vi , n ) ) {
d i s t [ n ] = d i s t [ Vi ]+ l e n ( Vi , n )
add n t o w o r k l i s t
}

}
}

Shashidhar G LightHouse March 22, 2017 2 / 22


Graph Algorithms

Graph Algorithms
Graph Algorithms : Shortest Path, Triangle Counting, Community
detection, Chemical reaction simulation, PageRank etc.
Shortest Path Algorithm
d i s t [ 1 . . . . m] =
dist [0] = 0

For a l l n o d e s V[0...x] i n w o r k l i s t {
For a l l n e i g h b o r s n o f Vi {
Critical Section for n
i f ( d i s t [ n ] > d i s t [ Vi ]+ l e n ( Vi , n ) ) {
d i s t [ n ] = d i s t [ Vi ]+ l e n ( Vi , n )
add n t o w o r k l i s t
}
Critical Section Ends for n
}
}

Shashidhar G LightHouse March 22, 2017 2 / 22


Parallelization of Graph Algorithms

Parallelization Challenges in Graph Algorithms

Shashidhar G LightHouse March 22, 2017 3 / 22


Parallelization of Graph Algorithms

Parallelization Challenges in Graph Algorithms

Irregularity : No regular
pattern in:
Work distribution
Memory accesses
Control flow
Communication

Cannot be predicted at
compile time.

Scalability :
Millions of Nodes, Edges.

Shashidhar G LightHouse March 22, 2017 3 / 22


Parallelization of Graph Algorithms

Parallelization Challenges in Graph Algorithms

Irregularity : No regular
pattern in:
Work distribution
Memory accesses
Control flow
Communication

Cannot be predicted at
compile time.

Scalability :
Millions of Nodes, Edges.

1
http://www.fmsasg.com/socialnetworkanalysis/
Shashidhar G LightHouse March 22, 2017 3 / 22
Parallelization of Graph Algorithms

Parallelization Challenges in Graph Algorithms

Lot of optimizations developed for a particular algorithm/pattern


(BFS, SSSP, Strongly connected components etc.).

Shashidhar G LightHouse March 22, 2017 4 / 22


Parallelization of Graph Algorithms

Parallelization Challenges in Graph Algorithms

Lot of optimizations developed for a particular algorithm/pattern


(BFS, SSSP, Strongly connected components etc.).

Hard to implement these optimizations for non-HPC experts.

Shashidhar G LightHouse March 22, 2017 4 / 22


Parallelization of Graph Algorithms

Related Work

1 Library based frameworks


Galois: Concurrent Data-structure library to identify when the node
becomes active for processing based on the operator applied. PLDI 11
Medusa: Set of APIs for processing vertices, edges or messages. IEEE
TPDS 2014
Ligra: Routines defined for processing on mapped edges and vertices.
PPoPP 13
2 Domain Specific Language
Elixir: Specify computations as set of operators along with a scheduling
constraint. The system produces efficient parallel code. OOPSLA 12
Halide: A language and compiler for optimizing parallelism, locality,
and recomputation in image processing pipelines. PLDI 13
Green-Marl: Language for parallel graph algorithms. ASPLOS 12

Shashidhar G LightHouse March 22, 2017 5 / 22


Motivation

Motivation

Why : We believe we can make novice programmers to do parallel


programming and use the power of parallel computing.

Shashidhar G LightHouse March 22, 2017 6 / 22


Motivation

Motivation

Why : We believe we can make novice programmers to do parallel


programming and use the power of parallel computing.

How : Language to create abstractions of the parallel computing and


let the compiler translate to parallel code.

Shashidhar G LightHouse March 22, 2017 6 / 22


Motivation

Motivation

Why : We believe we can make novice programmers to do parallel


programming and use the power of parallel computing.

How : Language to create abstractions of the parallel computing and


let the compiler translate to parallel code.

What : 30 line of Green-Marl Shortest Path code gets translated into


350 lines of parallel CUDA code.

Shashidhar G LightHouse March 22, 2017 6 / 22


Outline

Outline
Green-Marl: A Domain Specific Language
Front-end Compilation of Green-Marl
Contribution of LightHouse
Back-end code generation for GPUs(CUDA).
GPU code Optimization.
1 Eliminate Atomics in Boolean Reduction.
2 Loop Collapsing.
Experimental results and Conclusion.

Shashidhar G LightHouse March 22, 2017 7 / 22


DSL: Green-Marl

Green-Marl: Graph DSL

Fork-join style of parallel execution.


Uses iterators and collection (set, sequence).
Graph does not change.

1 Procedure T e s t (G : Graph ,
2 A : N_P<I n t >, r o o t : Node ) {
3
4 N_P<I n t > B ;
5 Int rootValue ;
6 Foreach ( n : G . Nodes )
7 Foreach ( s : n . Nbrs )
8 n .B = n .A + s .A;
9 rootValue = root .B;
10 }

Shashidhar G LightHouse March 22, 2017 8 / 22


DSL: Green-Marl

Green-Marl: Graph DSL

Fork-join style of parallel execution.


Uses iterators and collection (set, sequence).
Graph does not change.

1 Procedure T e s t (G : Graph ,
2 A : N_P<I n t >, r o o t : Node ) {
3
4 N_P<I n t > B ;
5 Int rootValue ;
6 Foreach ( n : G . Nodes )
7 Foreach ( s : n . Nbrs )
8 n .B = n .A + s .A;
9 rootValue = root .B;
10 }

Shashidhar G LightHouse March 22, 2017 8 / 22


DSL: Green-Marl

Green-Marl: Graph DSL

Reduction statement (provides determinism)


+, *, min, max, bitwise AND and OR.
argmin/argmax reductions: save the context of reduced value.

1 Foreach ( n : G . Nodes ) {
2 sum += n . B ;
3 v a l <maxNode> max= n . B<n> ;
4 }

Shashidhar G LightHouse March 22, 2017 9 / 22


Front-end Compilation of Green-Marl

Front-end Compilation of Green-Marl

1 Parse the language. Abstract Syntax Tree Generation.


2 Check for data-race conflicts.

1 Node_Prop<I n t > A ;
2 Foreach ( n : G . Nodes )
3 Foreach ( t : n . Nbrs )
4 t .A = n .A;

Shashidhar G LightHouse March 22, 2017 10 / 22


Front-end Compilation of Green-Marl

Front-end Compilation of Green-Marl

1 Parse the language. Abstract Syntax Tree Generation.


2 Check for data-race conflicts.
3 Front-end Optimizations:
Loop fusion.
Hoists the temporaries out of the sequential loop to save the repeated
allocations and deallocations.
Reduction inside a sequential loop is converted to a normal assignment.

Shashidhar G LightHouse March 22, 2017 10 / 22


Challenges in Code generation for GPUs

GPU Code generation

1 Identify parallel region to run in GPU.


Outer-most Foreach = GPU kernel call.
Nested Foreach loops = sequential For loop inside the kernel.

// Each GPU Thread


Foreach(n:G.Nodes) = n = threadID;
Foreach(s:n.Nbrs) For(all neighbors s of n){
... ...

Shashidhar G LightHouse March 22, 2017 11 / 22


Challenges in Code generation for GPUs

2 Scope of Variables.

Procedure T e s t (G : Graph , Sym Type Allocate in


A : N_P<I n t >, r o o t : Node ) { G Graph GPU - Device Mem.
A NP< Int > GPU - Device Mem.
N_P<I n t > B ; B NP< Int > GPU - Device Mem.
Int ret ; n Node::I GPU - Thread Local
Foreach ( n : G . Nodes ) s Node::I GPU - Thread Local
Foreach ( s : n . Nbrs ) root Node CPU
n .B = n .A + s .A; ret Int CPU
ret = root .B;
Symbol Table
}

Shashidhar G LightHouse March 22, 2017 12 / 22


GPU Code generation

GPU Code generation


3 Generate indices for memory accesses.

Sym Type Parent


G Graph
aa bb A NP< Int >
a b c d e f
C 0 2 4 6 6 8 B NP< Int >
cc dd n Node::I G
R b c a d b e d f s Node::I nG
0 1 2 3 4 5 6 7 8
ee ff root Node
ret Int
Graph in CSR Format
n = threadID ;
i f ( n > numNodes )
Foreach ( n : G . Nodes ) return ;
Foreach ( s : n . Nbrs ) f o r ( i = C [ n ] ; i < C [ n + 1 ] ; i ++){
... s = R[ i ] ;
...

Shashidhar G LightHouse March 22, 2017 13 / 22


GPU Code generation

GPU Code generation

4 Generate code for Reductions.

Normal Reductions converted to


atomic instructions.
argmin/argmax reductions are
handled different.

Shashidhar G LightHouse March 22, 2017 14 / 22


GPU Code generation

GPU Code generation

4 Generate code for Reductions.

GPU_T = 0;
K e r n e l C a l l <<<LaunchPara >>>(C , R , A) ;
GPUMemCpy( from , GPU_from , D e v i c e T o H o s t ) ;
GPUMemCpy( to , GPU_to , D e v i c e T o H o s t ) ;
Normal Reductions converted to K e r n e l C a l l (C , R , A) {
atomic instructions. ...
localMax = 0;
expr = s .A + t .A;
argmin/argmax reductions are atomicMax(&GPU_T, expr);
if(localMax < expr) {
handled different. localMax = expr;
localFrom = s;
localTo = t;
Int T = 0; }
Node s r c , d s t ; SoftwareBarrier();
i f ( l o c a l M a x == GPU_T)
F o r e a c h ( s : G . Nodes ) chooseThread = threadID ;
F o r e a c h ( t : s . Nbrs ) SoftwareBarrier();
T<from,to> max= s . A + t . A<s,t> ; i f ( c h o o s e T h r e a d == t h r e a d I D ) {
GPU_from = l o c a l F r o m ;
GPU_to = l o c a l t o ;
}
}

Shashidhar G LightHouse March 22, 2017 14 / 22


Optimizations

GPU Optimizations
1 Boolean Value Reduction: Eliminates atomics.

// initialized outside kernel


A = false;
....
atomicOr(&A, val) = if (val)
A = val;

Shashidhar G LightHouse March 22, 2017 15 / 22


Optimizations

GPU Optimizations
1 Boolean Value Reduction: Eliminates atomics.

// initialized outside kernel


A = false;
....
atomicOr(&A, val) = if (val)
A = val;

2 Loop Collapsing: Expand the traversal on neighbors of all the nodes


to traversal on the edges.
Foreach(e:G.Edges) {
Foreach(s:G.Nodes) = s = e.FromNode();
Foreach(t:s.Nbrs) t = e.ToNode();
... ...

Shashidhar G LightHouse March 22, 2017 15 / 22


Experiments

Experiments
Bipartite Matching : Less conflicts across threads, load-balanced
task, Data Parallel.
Conductance : Lot of atomics, thread divergent code.
PageRank : Floating point operations.
SSSP : Atomic instructions.

Shashidhar G LightHouse March 22, 2017 16 / 22


Experiments

Experiments
Bipartite Matching : Less conflicts across threads, load-balanced
task, Data Parallel.
Conductance : Lot of atomics, thread divergent code.
PageRank : Floating point operations.
SSSP : Atomic instructions.

Graph #Nodes #Edges


(millions) (millions)

Epinions 0.1 0.5 TestBed :


LiveJournal 4.8 69.0 Intel XeonE5-2650 v2 with 32 cores.
Pokec 1.6 30.6 OpenMP version 4.0.
Orkut 3.1 117.2 Tesla K40C device with 2880 cores.
USA 23.9 57.7
Datasets

Shashidhar G LightHouse March 22, 2017 16 / 22


Experiments

Experiments
OMP-1T OMP-Fast CUDA CUDA-OPT OMP-1T OMP-Fast CUDA CUDA-OPT
60 10

50
8

7
40

6
Speed up

Speed up
30 5

20
3

2
10

0 0
Ep

Li

Po

Ep

Li

Po

U
rk

SA

rk

SA
ve

ve
ke

ke
in

in
ut

ut
Jo

Jo
io

io
c

c
ns

ns
u

u
rn

rn
al

al
(a) Matching (b) Conductance
Shashidhar G LightHouse March 22, 2017 17 / 22
Experiments

Experiments
OMP-1T OMP-Fast CUDA CUDA-OPT
14
OMP-1T OMP-Fast CUDA CUDA-OPT Ligra
25

12

20

10

15
8
Speed up

Speed up
6
10

Ep

Li

Po

U
rk

SA
ve

ke
0
in

ut
J
io

c
ou
ns
Ep

Li

Po

r
na
rk

SA
ve

ke
in

l
ut
Jo
io

c
ns

urn

(b) PageRank-Propagate
al

(a) PageRank-Gather
Shashidhar G LightHouse March 22, 2017 18 / 22
Experiments

Experiments
OMP-1T OMP-Fast CUDA CUDA-OPT LoneStarGPU Totem-OPT Ligra

60

50

40
Speed up

30

20

10

0
Ep

Li

Po

U
rk

SA
ve

ke
in

ut
Jo
io

c
ns

ur
n
al

(a) SSSP
Shashidhar G LightHouse March 22, 2017 19 / 22
Experiments

Framework Pagerank Shortest Path


LightHouse(Green-Marl) 25 30
Ligra 110 65
Totem - 400
LonestarGPU 170 -
Lines of Code

Shashidhar G LightHouse March 22, 2017 20 / 22


Experiments

Future work

Implement an analysis framework for enabling optimizations.


Identify computation patterns in the algorithm for modifying the data
structures to use these patterns for reducing synchronization, Load
balancing.
Work-list based generation of CUDA Code for reducing the number of
idle threads.
Reducing number of idle threads will lead to reduction in power
consumption.
Implementing Work-stealing in CUDA threads to reduce load
imbalance.
Multi-GPU Support.

Shashidhar G LightHouse March 22, 2017 21 / 22


Conclusion

Conclusion

Efficient GPU code generation of graph algorithms from a high-level


description like Green-Marl.
The performance benefits reveal that DSLs provide an effective way
of developing parallel algorithms.

Shashidhar G LightHouse March 22, 2017 22 / 22


Conclusion

Conclusion

Efficient GPU code generation of graph algorithms from a high-level


description like Green-Marl.
The performance benefits reveal that DSLs provide an effective way
of developing parallel algorithms.

This work had been accepted at Languages and Compilers for


Parallel Computing(LCPC) 2016
Try LightHouse at http://pace.cse.iitm.ac.in/tools.php

Thank You

Shashidhar G LightHouse March 22, 2017 22 / 22


Acknowledgements

Acknowledgements

Krishna Nandivada Sir


GTC Committee Member: Madhu Mutyam Sir, Chandra Shekar
Sir, Ramakrishna Sir
Rupesh Nasre(No Sir because he will scold me if i address him as
Sir)
PACErs
All my friends

Shashidhar G LightHouse March 22, 2017 23 / 22


Acknowledgements

Tesla GPU Computing Architecture


Scalable processing and memory, massively multithreaded
GeForce 8800: 128 processor cores at 1.5 GHz, 12K threads SP

Host CPU System Memory

Compute Work
GPU
Distribution

Geometry Controller Geometry Controller Geometry Controller Geometry Controller Geometry Controller Geometry Controller Geometry Controller Geometry Controller
SMC SMC SMC SMC SMC SMC SMC SMC

I-Cache I-Cache I-Cache I-Cache I-Cache I-Cache I-Cache I-Cache I-Cache I-Cache I-Cache I-Cache I-Cache I-Cache I-Cache I-Cache
MT Issue MT Issue MT Issue MT Issue MT Issue MT Issue MT Issue MT Issue MT Issue MT Issue MT Issue MT Issue MT Issue MT Issue MT Issue MT Issue
C-Cache C-Cache C-Cache C-Cache C-Cache C-Cache C-Cache C-Cache C-Cache C-Cache C-Cache C-Cache C-Cache C-Cache C-Cache C-Cache

SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP

SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP

SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP

SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP

SFU SFU SFU SFU SFU SFU SFU SFU SFU SFU SFU SFU SFU SFU SFU SFU SFU SFU SFU SFU SFU SFU SFU SFU SFU SFU SFU SFU SFU SFU SFU SFU

Shared Shared Shared Shared Shared Shared Shared Shared Shared Shared Shared Shared Shared Shared Shared Shared
Memory Memory Memory Memory Memory Memory Memory Memory Memory Memory Memory Memory Memory Memory Memory Memory

Texture Unit Texture Unit Texture Unit Texture Unit Texture Unit Texture Unit Texture Unit Texture Unit
Tex L1 Tex L1 Tex L1 Tex L1 Tex L1 Tex L1 Tex L1 Tex L1

Interconnection Network

ROP L2 ROP L2 ROP L2 ROP L2 ROP L2 ROP L2

DRAM DRAM DRAM DRAM DRAM DRAM

5 Scalable Parallel Programming with CUDA 2/27/08 NVIDIA Corporation 2008

Shashidhar G LightHouse March 22, 2017 24 / 22


Acknowledgements

SM Multithreaded Multiprocessor

SM SM has 8 SP Thread Processors


32 GFLOPS peak at 1.35 GHz
I-Cache IEEE 754 32-bit floating point
MT Issue 32-bit and 64-bit integer
8K 32-bit registers
C-Cache
SM has 2 SFU Special Function Units
SP SP Scalar ISA
Memory load/store, texture fetch
SP SP Branch, call, return
Barrier synchronization instruction
SP SP Multithreaded Instruction Unit
768 Threads, hardware multithreaded
SP SP 24 SIMT warps of 32 threads
Independent thread execution
SFU SFU Hardware thread scheduling
16KB Shared Memory
Shared
Memory Concurrent threads share data
Low latency load/store
60 Scalable Parallel Programming with CUDA 2/27/08 NVIDIA Corporation 2008

Shashidhar G LightHouse March 22, 2017 25 / 22


Acknowledgements

GPU Architecture

Each Streaming Multiprocessor (SM) manages on the order of


thousand of hardware scheduled threads.
Warp is a set of 32 threads which are executed in a particular SM.
Threads in a warp execute in Single Instruction Multiple Thread
(SIMT) fashion.
The thread scheduling in SM happens at the granularity of a warp
(Warp Scheduling).
On the whole, GPU provides tens of thousands of data parallel
threads.

Shashidhar G LightHouse March 22, 2017 26 / 22


Acknowledgements

References

Elixir: a system for synthesizing concurrent graph programs by


Prountzos Dimitrios et al. at OOPSLA 12
Halide: a language and compiler for optimizing parallelism, locality,
and recomputation in image processing pipelines by Ragan Kelley et
al. at PLDI 13
Galois. The tao of parallelism in algorithms by Pingali Keshav et al.
at PLDI 11
Ligra: A Lightweight Graph Processing Framework for Shared
Memory by Shun Julian and Blelloch Guy E. at PPoPP 13
Medusa: Simplified Graph Processing on GPUs by Jianlong Zhong
and Bingsheng He at IEEE Transactions Parallel Distribributed
Systems 2014

Shashidhar G LightHouse March 22, 2017 27 / 22

You might also like