401 03 Parallel

CS401/807
Parallel
Parallel
Begin reading Kirk and Hwu
Book begins by focussing on CUDA
extends to openCL later
Check your GPU to figure out which
language you can use
CUDA is mostly restricted to NVIDIA
2
Intro to Parallel
Latency devices (CPU) versus throughput
devices (GPU)
Latency devices add functionality to
reduce latency
Branch prediction, multithreading,
superscalar, cache optimization;
Complex control
Powerful ALU
3
GPUs: throughput devices

no branch prediction
no data forwarding
simpler ALU, focusing on throughput
e.g. single-cycle SIMD multiply
small cache
heavy pipelining
Many simpler cores versus few complex
cores
4
CPU vs GPU
ALU
Control
ALU
Control
Cache
Control
ALU
ALU
Cache
Control
Cache
Cache
Control
Cache
Control
Cache
DRAM
ALU
ALU
ALU
ALU
ALU
ALU
ALU
ALU
ALU
ALU
ALU
ALU
ALU
ALU
ALU
ALU
ALU
ALU
ALU
ALU
ALU
ALU
ALU
ALU
ALU
DRAM
Writing CPU versus GPU code

Use CPU when appropriate
Serial operations, Decision making
faster than GPUs for these tasks
Use GPU when appropriate
parallel operations, SIMD (more
accurately SPMD)
Single instruction/program, multiple
data
faster than CPU for these tasks
6
Amdahls law
Originally
formulated for
parallel
computing
Limit on
performance is
the serial part
Dont neglect
your CPU code
7
Scalability, Portability, and Cores

Scalability: the code can run faster, given
access to more of the same core
parallel
Portability: the code can run efficiently,
given access to different cores
heterogeneous
Scalability plus portability means taking
advantage of all computing resources
8
GPGPU frameworks and portability

Cuda, OpenCL etc aims for portability
write once, deploy on different
arrangements of CPU and GPU
General structure for GPGPU: Host plus
device model
CPU is host: make decisions, perform
serial code, issue tasks to the device
GPU is the device: execute tasks as
efficiently as possible
9
Data parallelism
Given a vector of data, do the same

operation to each element in the vector
Simplest parallel computation
EG: vector addition
c[i] = a[i]+b[i]
10
Vector Addition: Serial

One ALU
One addition at a
time
Each element is
presented to the ALU
in turn
O(n)
A[k]
B[k]
C[k]
11
Vector Addition: Parallel

Many ALUs
Each ALU performs
one slice of the
operation
All additions happen
at once
O(1)
A[k]
B[k]
C[k]
12
CUDA organization
CUDA code executes general compute
requests on parallel hardware
Two-part model
Host: executes serial code, decision
making, issues commands to device
Device: executes parallel code, responds
to requests from host
13
Parallel execution on CUDA

a CUDA (parallel) device executes a block
of related threads
also called grid or array of threads
All threads in the block execute the same
code (thus: SIMD)
Each thread has its own index, used to
identify which data it is operating on
compute memory addresses, make
control decisions
14
Thread Blocks
Parallel threads can be issued in blocks
collections of threads from the same
code
Threads in a block can communicate:
shared memory
atomic operations
barrier synchronization
i = blockIdx.x * blockDim.x + threadIdx.x;
C[i] = A[i] + B[i];
15
Thread Blocks and Communication

Thread Block 0
C[i] = A[i] + B[i];
A[i]
B[i]
C[i]
Thread Block 1
C[i] = A[i] + B[i];
A[i]
B[i]
C[i]
Thread Block 2
C[i] = A[i] + B[i];
A[i]
B[i]
C[i]
16
Thread Blocks and Communication

Threads in the same block can
communicate with each other
Threads in different blocks cannot, even
if they are executing the same code
block ID and thread ID are used together
to determine portion of the work being
done
Can be arranged as 1D, 2D, 3D data
structures
17
Vector addition example: assembly

to complete a vector addition in a
traditional machine
move data from memory to registers
add registers
move result back to memory
repeat for every element
18
Vector addition example: Serial code

void vecAdd(float *h_A, float *h_B, float *h_C, int n)
{
int i;
for (i = 0; i<n; i++)
h_C[i] = h_A[i]+h_B[i];
}
int main()
{
// Memory allocation for h_A, h_B, and h_C
// I/O to read h_A and h_B, N elements
...
vecAdd(h_A, h_B, h_C, N);
}
19
Vector Addition: Parallel

to complete a vector addition in a
traditional machine
move data from serial processor to
parallel processor cache
Issue threads to complete addition for
all elements in the array
multiple blocks may be required
Move data from parallel processor cache
back to serial processor cache
20
Parallel device memory access

Device code can access per-thread registers and
shared global memory
Host code can move data to and from shared global
memory
Host
CPU
Memory
Parallel Device
System bus (DMA)
Host CPU
Global
Device
Memory
Thread Block (0,0)
Thread Block (0,1)
Thread 0
Thread 0
Registers
Thread 1
Registers
Thread k
Registers
Registers
Thread 1
Registers
Thread k
Registers
21
CUDA memory management: Device

cudaMalloc()
Allocates memory in global device
memory
Two parameters
Pointer to the allocated object
Size of allocated object in bytes
cudaFree()
Frees object from global device memory
requires pointer to object to be freed
22
CUDA memory management: Host

cudaMemcpy()
memory data transfer
Requires four parameters
Pointer to destination
Pointer to source
Number of bytes copied
Type/Direction of transfer
Transfer to device is asynchronous
23
CUDA host code for vector addition

void vecAdd(float *h_A, float *h_B, float *h_C, int n)
{
int size = n * sizeof(float);
float *d_A, *d_B, *d_C;
cudaMalloc((void **) &d_A, size);
cudaMemcpy(d_A, h_A, size, cudaMemcpyHostToDevice);
cudaMalloc((void **) &d_B, size);
cudaMemcpy(d_B, h_B, size, cudaMemcpyHostToDevice);
cudaMalloc((void **) &d_C, size);
// Kernel invocation code

cudaMemcpy(h_C, d_C, size, cudaMemcpyDeviceToHost);
cudaFree(d_A); cudaFree(d_B); cudaFree (d_C);
}
24
CUDA host code for vector addition

You should check for errors at each stage
cudaError_t err = cudaMalloc((void **) &d_A, size);
if (err != cudaSuccess)
{
printf(%s in %s at line %d\n,
cudaGetErrorString(err), __FILE__, __LINE__);
exit(EXIT_FAILURE);
}
25
Vector addition device code
// Compute vector sum C = A+B

// Each thread performs one pair-wise addition
__global__
void vecAddKernel(float* A, float* B, float* C, int n)
{
int i = threadIdx.x+blockDim.x*blockIdx.x;
if(i<n) C[i] = A[i] + B[i];
}
26
aside: CUDA function declarations

__ = _ _
__device__ float DeviceFunc()
execute on device, call from device
__global__ void KernelFunc()
execute on device, call from host
__host__ float HostFunc()
execute on host, call from host
27
Vector addition device code
Each thread calculates its local index

relating to the block its in and the size of
the blocks
Each thread validates that the calculated
index is within the original index range
28
Vector addition kernel host code
int vecAdd(float* h_A, float* h_B, float* h_C, int n)

{
// d_A, d_B, d_C allocations and copies omitted
// Run ceil(n/256.0) blocks of 256 threads each
vecAddKernnel<<<ceil(n/256.0),256>>>(d_A, d_B, d_C, n);
}
29
Vector addition kernel host code

Kernel host sets blocks of 256 threads
Kernel host activates enough thread
blocks to satisfy the original range
more general:
int vecAdd(float* h_A, float* h_B, float* h_C, int n)
{
dim3 DimGrid((n-1)/256 + 1, 1, 1);
dim3 DimBlock(256, 1, 1);
vecAddKernnel<<<DimGrid,DimBlock>>>(d_A, d_B, d_C, n);
}
30
Thread Blocks and Grids

In our example, thread blocks are 256
threads
If we want to add 1000 elements, we
need 4 thread blocks
there will be 24 threads left over
which is why threads need to validate
array index
...
...
block 0
...
block 1
...
block 2
...
block 3
31
Thread Blocks and

Multidimensional Grids
Image index y
Image index x
...
...
Block 00
Block 01
Block 02
Block 03
Block 10
Block 11
Block 12
Block 13
Block 20
Block 21
Block 22
Block 23
Block 30
Block 31
Block 32
Block 33
32
Picture kernel block indexing

__global__ void PictureKernel(float* d_Pin, float* d_Pout, int n, int m)
{
// Calculate the row # of the d_Pin and d_Pout element
int Row = blockIdx.y*blockDim.y + threadIdx.y;
// Calculate the column # of the d_Pin and d_Pout element
int Col = blockIdx.x*blockDim.x + threadIdx.x;
// each thread computes one element of d_Pout if in range
if ((Row < m) && (Col < n)) {
d_Pout[Row*n+Col] = 2.0*d_Pin[Row*n+Col];
}
}
33

401 03 Parallel

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

401 03 Parallel

Uploaded by

Copyright:

Available Formats

CS401/807

GPUs: throughput devices

Writing CPU versus GPU code

Scalability, Portability, and Cores

GPGPU frameworks and portability

Given a vector of data, do the same

Vector Addition: Serial

Vector Addition: Parallel

Parallel execution on CUDA

Thread Blocks and Communication

Thread Blocks and Communication

Vector addition example: assembly

Vector addition example: Serial code

Vector Addition: Parallel

Parallel device memory access

System bus (DMA)

Thread Block (0,0)

Thread Block (0,1)

CUDA memory management: Device

CUDA memory management: Host

CUDA host code for vector addition

// Kernel invocation code

CUDA host code for vector addition

Vector addition device code

// Compute vector sum C = A+B

aside: CUDA function declarations

Vector addition device code

Each thread calculates its local index

Vector addition kernel host code

int vecAdd(float* h_A, float* h_B, float* h_C, int n)

Vector addition kernel host code

Thread Blocks and Grids

Thread Blocks and

Picture kernel block indexing

You might also like