Professional Documents
Culture Documents
Parallel
Parallel
Begin reading Kirk and Hwu
Book begins by focussing on CUDA
extends to openCL later
Check your GPU to figure out which
language you can use
CUDA is mostly restricted to NVIDIA
2
Intro to Parallel
Latency devices (CPU) versus throughput
devices (GPU)
Latency devices add functionality to
reduce latency
Branch prediction, multithreading,
superscalar, cache optimization;
Complex control
Powerful ALU
3
CPU vs GPU
ALU
Control
ALU
Control
Cache
Control
ALU
ALU
Cache
Control
Cache
Cache
Control
Cache
Control
Cache
DRAM
ALU
ALU
ALU
ALU
ALU
ALU
ALU
ALU
ALU
ALU
ALU
ALU
ALU
ALU
ALU
ALU
ALU
ALU
ALU
ALU
ALU
ALU
ALU
ALU
ALU
DRAM
Amdahls law
Originally
formulated for
parallel
computing
Limit on
performance is
the serial part
Dont neglect
your CPU code
7
Data parallelism
10
A[k]
B[k]
C[k]
11
A[k]
B[k]
C[k]
12
CUDA organization
CUDA code executes general compute
requests on parallel hardware
Two-part model
Host: executes serial code, decision
making, issues commands to device
Device: executes parallel code, responds
to requests from host
13
Thread Blocks
Parallel threads can be issued in blocks
collections of threads from the same
code
Threads in a block can communicate:
shared memory
atomic operations
barrier synchronization
i = blockIdx.x * blockDim.x + threadIdx.x;
C[i] = A[i] + B[i];
15
A[i]
B[i]
C[i]
Thread Block 1
i = blockIdx.x * blockDim.x + threadIdx.x;
C[i] = A[i] + B[i];
A[i]
B[i]
C[i]
Thread Block 2
i = blockIdx.x * blockDim.x + threadIdx.x;
C[i] = A[i] + B[i];
A[i]
B[i]
C[i]
16
Parallel Device
Host CPU
Global
Device
Memory
Thread 0
Thread 0
Registers
Thread 1
Registers
Thread k
Registers
Registers
Thread 1
Registers
Thread k
Registers
21
26
28
29
...
block 1
...
block 2
...
block 3
31
Image index x
...
...
Block 00
Block 01
Block 02
Block 03
Block 10
Block 11
Block 12
Block 13
Block 20
Block 21
Block 22
Block 23
Block 30
Block 31
Block 32
Block 33
32
33