You are on page 1of 21

USENIX ATC’17, Santa Clara, CA, USA

Garaph: Efficient GPU-accelerated Graph Processing on


a Single Machine with Balanced Replication
Lingxiao Ma †, Zhi Yang †, Han Chen †, Jilong Xue ‡, Yafei Dai *

† Computer Science Department, Peking University


‡ Microsoft Research
* SPCCTA, Peking University
Large-Scale Graph Processing

1010 pages, 1012 tokens 109 nodes, 1012 edges

Peking University, Microsoft Research 2


Powerful Storage & Computation Technologies

* Figure from Internet

Peking University, Microsoft Research 3


Our Goal
- Large Memory + Fast Secondary Storages
- CPU+GPUs
Host (CPU) Device (GPUs)

Main Memory PCIe Bus Global Memory

Input
Secondary Storages

Peking University, Microsoft Research 4


Architecture

CPU Kernel

Secondary
Edges
Storages

Memory
Dispatcher
Edges
Vertices

GPU Kernel

Peking University, Microsoft Research 5


Graph Representation for Hybrid CPU and GPU
- CSC & CSR representation CSC (incomming edges)
- Shard: vertex interval Idx 0 2 4 5 6 9 9
- Page: batched shards Nbr 3 4 0 2 5 4 1 2 5
Edge 1 2 1 3 5 4 2 5 1
IdxOff 0 0 1 1 2 3
0 4 1
1 4 1
4
Shard 0 Shard 1
1 1 3
0 2 CSR (outgoing edges)
1 5
Idx 0 1 2 4 5 7 9
2
2
5
Nbr 1 4 1 4 0 0 3 2 4
3 5
4 1
4
Peking University, Microsoft Research 6
Programming APIs
- GAS Decomposition
- One program for both CPU and GPU

Activate

*GAS figure from PowerGraph slides

Peking University, Microsoft Research 7


GPU Computation Kernel
GPU

Streaming Multiprocessor
Edges
Edges
Host Memory

L1 Cache/
PCIe Shared
Device
Memory

Apply
Memory
Init
Vertices Sync Vertices Vertices
Apply

Peking University, Microsoft Research 8


Gather in GPU Computation Kernel

Shared Memory

Global Memory

Peking University, Microsoft Research 9


Problems in Gather
Shared Memory

Global Memory

*Gomez-Luna, Juan, et al. "Performance modeling of atomic additions on


GPU scratchpad memory." TPDS 24.11 (2013): 2273-2282.

- Conflicts
- Linear penalty
- Intra-warp >> Inter-warp

Peking University, Microsoft Research 10


Replication-Based Gather
Global Memory Mapping

- Customized replication
- O(N) -> O(logN), N≤32
- Modeling: balance profits and costs Aggregation
-
Shared Memory

Peking University, Microsoft Research 11


CPU Computation Kernel
- Sequential memory access & lock-free & load balance

GlobalVertices
2. Apply
Thread 0 Thread 1 Thread … Thread p-1

Aggregation

1. Gather
0 1 2 … r0 r0 … r1 ……
LocalVertices
r_{p-1} … n

Edges
Rep Rep …… Rep

Peking University, Microsoft Research 12


Dual Modes Processing Engine
- Pull & Notify-pull

Peking University, Microsoft Research 13


Hybrid CPU-GPU Scheduling
Schedule Page Page Page Page

CPU Page Page CPU Time/Page Time/Page


GPU0 Page Page Page Page GPU0 Time Time Time Time
GPU1 Page Page Page Page GPU1 Time Time Time Time

- CPU
- Pros: thread sequential processing
- Suit: pull/notify-pull dual-mode processing
- GPU
- Pros: SIMD parallel processing
- Suit: replication-based gather processing (only pull)

Peking University, Microsoft Research 14


Experiment Setup
- Machine information
- CPU: Intel Xeon E5-2650 v3
- 10 cores, 20 threads, 2.3-3.0GHz
- Memory: 64GB dual-channel DDR4 2133MHz
- GPU: NVidia GeForce GTX 1070 Graph |V| |E| Max in-deg Avg deg Size
- 1920 cores, 15 SMs, 8GB memory uk-2007@1M 1M 41M 0.4M 41 0.6GB
- Typical graph applications uk-2014-host 4.8M 51M 0.7M 11 0.8GB
enwiki-2013 4.2M 0.1B 0.4M 24 1.7GB
- PR, CC, SSSP, NN, HS, CS
gsh-2015-tpd 31M 0.6B 2.2M 20 10GB
- Datasets twitter-2010 42M 1.5B 0.8M 35 27GB
- 7 real world datasets sk-2005 51M 1.9B 8.6M 39 35GB
- Compare renren-2010 58M 2.8B 0.3M 48 44GB
- CuSha(HPDC’14), Ligra(PPoPP’13), Gemini(OSDI’16)

Peking University, Microsoft Research 15


Evaluation: Overall Performance
2.00

1.50

Runtime(s)
1.00

0.50

0.00
uk-2007-05@1M uk-2014-host enwiki-2013
CuSha Ligra Gemini Garaph-C Garaph-G Garaph-H

- Run 10 iterations of PR 90.00


80.00
- Up to 4.05x faster than the fastest one 70.00
60.00

4.05x
Runtime(s)
50.00
40.00
30.00
20.00
10.00
0.00
gsh-2015-tpd twitter-2010 sk-2005 renren-2010
CuSha Ligra Gemini Garaph-C Garaph-G Garaph-H

Peking University, Microsoft Research 16


Evaluation: Overall Performance
1.50

1.00

Runtime(s)
0.50

0.00
uk-2007-05@1M uk-2014-host enwiki-2013
CuSha Ligra Gemini Garaph-C Garaph-G Garaph-H

250.00

- Run CC to convergence 200.00

Runtime(s)
150.00
- GPU is much slower than CPU without 100.00
activation scheme 50.00

- Up to 5.36x faster than the fastest one 0.00


gsh-2015-tpd twitter-2010 sk-2005 renren-2010
CuSha Ligra Gemini Garaph-C Garaph-G Garaph-H

Peking University, Microsoft Research 17


Evaluation: Customized Replication
- SK-2005 dataset
- Slowest is 45.17x slower than the fastest one
- Correlation: 0.9853 45.17x
- => vertices of high degree
- Customized replication
- Up to 32.15x speedup

32.15x

Peking University, Microsoft Research 18


Evaluation: Hybrid CPU-GPU Scheduling

Peking University, Microsoft Research 19


Conclusions

Garaph: efficient GPU-accelerated graph processing on a single machine


- Replication-based GPU computation kernel.
- Dual modes replication-based CPU computation kernel.
- Scheduler for hybrid CPU and GPU.

Peking University, Microsoft Research 20


Q&A

Peking University, Microsoft Research 21

You might also like