You are on page 1of 47

CUDA Programming Model

Gernot Ziegler, NVIDIA UK (material by Gregory Ruetsch)

Programming in C for CUDA


C for CUDA = C + a few simple extensions
as C developer, easy to start writing parallel programs

Three key abstractions:


1. parallel threads on device (GPU) 2. manage corresponding memory spaces 3. corresponding synchronization

Host: Device management API Additionally, Runtime API & nvcc: use language extensions even for host code!
NVIDIA Confidential

Basics
Set up GPU for computation
GPU device and memory management GPU kernel launches (execution configuration) Some specifics of GPU/device code

Some additional features:


Vector types Asynchronous execution CUDA error handling CUDA Events

Note: only the basic features are covered


Programming Guide and Reference Manual contain more information
NVIDIA Confidential

Device Management
First task: CPU will query and select GPU devices
cudaGetDeviceCount( int* count ) cudaSetDevice( int device ) cudaGetDevice( int *current_device ) cudaGetDeviceProperties( cudaDeviceProp* prop, int device ) cudaChooseDevice( int *device, cudaDeviceProp* prop )

Multi-GPU setup:
device 0 is used by default, careful with combination of GFX card and Tesla ! (usually, one CPU thread controls one GPU each, but driver API allows more)
NVIDIA Confidential

Managing Memory
Host/CPU also manages device/GPU memory:
Allocate & Free memory Copy data to and from device's global memory (GPU DRAM, e.g. 4 GB on Tesla)

cudaMalloc(void **pointer, size_t nbytes) cudaMemset(void *pointer, int value, size_t count) cudaFree(void *pointer) Host and device have separate memory spaces!

NVIDIA Confidential

Example: Managing memory (no data transfer)


int n = 1024; int nbytes = 1024*sizeof(int); int *d_a = 0; cudaMalloc( (void**)&d_a, nbytes ); cudaMemset( d_a, 0, nbytes); cudaFree(d_a);

NVIDIA Confidential

CUDA: Runtime support


Explicit memory allocation returns pointers to GPU memory
cudaMalloc(), cudaFree()

Explicit memory copy for host device, device device


cudaMemcpy(), cudaMemcpy2D(), ...

Texture management
cudaBindTexture(), cudaBindTextureToArray(), ...

OpenGL & DirectX interoperability


cudaGLMapBufferObject(), cudaD3D9MapVertexBuffer(),

NVIDIA Confidential

Example: Host Code's mem manage


// allocate host memory int numBytes = N * sizeof(float) float* h_A = (float*) malloc(numBytes); // allocate device memory float* d_A = 0; cudaMalloc((void**)&d_A, numbytes); // copy data from host to device cudaMemcpy(d_A, h_A, numBytes, cudaMemcpyHostToDevice); // execute the kernel on GPU: [ NEXT SLIDE ] gpu_func <<<exec-dims>>> (params) // copy data from device back to host cudaMemcpy(h_A, d_A, numBytes, cudaMemcpyDeviceToHost); // free device memory cudaFree(d_A);
NVIDIA Confidential

Kernel creation
How to... gpu_func <<<exec-dims>>> (params) write a kernel! First, re-cap on the CUDA architecture...

NVIDIA Confidential

Device code: Thread bundles


Host Device Grid 1 Kernel 1 Block (0, 0) Block (0, 1) Block (1, 0) Block (1, 1) Block (2, 0) Block (2, 1)

Kernel = device code call A kernel is executed by a grid of thread blocks A thread block is a batch of threads that can cooperate through shared memory Threads from different blocks cannot cooperate

Grid 2 Kernel 2

Block (1, 1)
Thread Thread Thread Thread Thread (0, 0) (1, 0) (2, 0) (3, 0) (4, 0) Thread Thread Thread Thread Thread (0, 1) (1, 1) (2, 1) (3, 1) (4, 1) Thread Thread Thread Thread Thread (0, 2) (1, 2) (2, 2) (3, 2) (4, 2)

NVIDIA Confidential

Blocks must be independent


"Threads from different blocks cannot cooperate" Why? Any possible interleaving of blocks should be valid
presumed to run to completion without pre-emption can run in any order can run concurrently OR sequentially (GPU scaling)

Blocks may coordinate but not synchronize


shared queue pointer: OK shared lock: BAD can easily deadlock

So: Independence requirement gives scalability for different GPU sizes.


NVIDIA Confidential

Device code: Thread IDs


Threads and blocks have IDs
So each thread can decide what data to work on
Device Grid 1 Block (0, 0) Block (0, 1) Block (1, 0) Block (1, 1) Block (2, 0) Block (2, 1)

Block ID: 1D or 2D Thread ID: 1D, 2D, or 3D 2D/3D IDs simplify addressing when processing multidimensional data
Image processing Solving PDEs on volumes
NVIDIA Confidential

Block (1, 1)
Thread Thread Thread Thread Thread (0, 0) (1, 0) (2, 0) (3, 0) (4, 0) Thread Thread Thread Thread Thread (0, 1) (1, 1) (2, 1) (3, 1) (4, 1) Thread Thread Thread Thread Thread (0, 2) (1, 2) (2, 2) (3, 2) (4, 2)

Programming Model: Memory Spaces


Grid

Each thread can:


Read/write per-thread registers (Read/write per-thread local memory) Read/write per-block shared memory Read/write per-grid global memory Read only per-grid constant memory Read only per-grid texture memory

Block (0, 0)

Block (1, 0)

Shared Memory Registers Registers

Shared Memory Registers Registers

Thread (0, 0) Thread (1, 0)

Thread (0, 0) Thread (1, 0)

Local Memory

Local Memory

Local Memory

Local Memory

Host can read/write global, constant, and texture memory (all stored in GPU DRAM)
NVIDIA Confidential

Host

Global Memory Constant Memory Texture Memory

Qualifiers for variable storage (device code)


__device__
Stored in device memory, aka global memory (e.g. 4GB on Tesla) Large capacity, BUT: high latency, uncached Allocated with cudaMalloc Accessible by all threads

__shared__
On-chip memory (SRAM, low latency), 16 kB per multiprocessor Allocated by execution configuration or at compile time Shared access by all threads in the same thread block Shortlived (only while block runs)

All unqualified variables:


Scalars and built-in vector types are stored in registers Arrays may be in registers, or local memory (special form of global memory /DRAM)

NVIDIA Confidential

Launching kernels
Modified C function call syntax:
kernel<<<dim3 grid, dim3 block>>>()

Execution Configuration (<<< >>>):


grid dimensions: x and y thread-block dimensions: x, y, and z
dim3 grid(16, 16); dim3 block(16,16); kernel<<<grid, block>>>(...); kernel<<<32, 512>>>(...);

NVIDIA Confidential

Example: Host Code


// allocate host memory int numBytes = N * sizeof(float) float* h_A = (float*) malloc(numBytes); // allocate device memory float* d_A = 0; cudaMalloc((void**)&d_A, numbytes); // copy data from host to device cudaMemcpy(d_A, h_A, numBytes, cudaMemcpyHostToDevice); // execute the kernel increment_gpu<<< N/blockSize, blockSize>>>(d_A, b); // copy data from device back to host cudaMemcpy(h_A, d_A, numBytes, cudaMemcpyDeviceToHost); // free device memory cudaFree(d_A);
NVIDIA Confidential

CUDA Built-in Device Variables


All __global__ and __device__ functions have access to these automatically defined variables dim3 gridDim;
Dimensions of the grid in blocks (at most 2D)

dim3 blockDim;
Dimensions of the block in threads

dim3 blockIdx;
Block index within the grid

dim3 threadIdx;
Thread index within the block

NVIDIA Confidential

Example: Increment Array Elements


CPU program CUDA program

void increment_cpu(float *a, float b, int N) __global__ void increment_gpu(float *a, float b, int N) { { int idx = blockIdx.x * blockDim.x + threadIdx.x; if (idx < N) for (int idx = 0; idx<N; idx++) a[idx] = a[idx] + b; a[idx] = a[idx] + b; } }

void main() { ..... increment_cpu(a, b, N); }

void main() { .. dim3 dimBlock (blocksize); dim3 dimGrid( ceil( N / (float)blocksize) ); increment_gpu<<<dimGrid, dimBlock>>>(a, b, N); }

NVIDIA Confidential

Other extras (device code)


Other language extras....

NVIDIA Confidential

Built-in Vector Types

[u]char[1..4], [u]short[1..4], [u]int[1..4], [u]long[1..4], float[1..4]


Structures accessed with x, y, z, w fields: uint4 param; int y = param.y;

dim3
Based on uint3 Used to specify dimensions Default value (1,1,1)

Can be used in GPU and CPU code (if nvcc compiled)


NVIDIA Confidential

Thread Synchronization

void __syncthreads(); Synchronizes all threads in a block


Generates barrier synchronization instruction No thread can pass this barrier until all threads in the block reach it

Often needed for shared memory write/read synchronization inbetween threads

NVIDIA Confidential

GPU Atomic Integer Operations


Atomic operations on integers in global memory:
Associative operations on signed/unsigned ints add, sub, min, max, ... and, or, xor Increment, decrement Exchange, compare and swap

32-bit: hardware with compute capability >= 1.1 64-bit: hardware with compute capability >= 1.2

NVIDIA Confidential

C for CUDA : Summary


Function qualifiers:
__global__ void MyKernel() { } __device__ float MyDeviceFunc() { }

Variable qualifiers:
__constant__ float MyConstantArray[32]; __shared__ float MySharedArray[32];

Execution configuration:
dim3 dimGrid(100, 50); // 5000 thread blocks dim3 dimBlock(4, 8, 8); // 256 threads per block MyKernel <<< dimGrid, dimBlock >>> (...); // Launch kernel

Built-in variables and functions valid in device code:


dim3 dim3 dim3 dim3 void gridDim; // Grid dimension blockDim; // Block dimension blockIdx; // Block index threadIdx; // Thread index __syncthreads(); // Thread synchronization (ProgGuide)

NVIDIA Confidential

Runtime API: More features


Other runtime specialties for host code...

NVIDIA Confidential

Asynchronous operation
CUDA calls are enqueued in streams, and executed one after another : usually one default stream (0) Kernel launches are asynchronous
control returns to CPU immediately kernel executes after all previous CUDA calls

cudaMemcpy() is synchronous
copy starts after all previous CUDA calls have completed control returns to CPU after copy completes (async memcopies possible, too)

Thus: GPU output, required on the host, leads to sync


NVIDIA Confidential

Example: Async operation


// allocate host memory int numBytes = N * sizeof(float) float* h_A = (float*) malloc(numBytes); // allocate device memory float* d_A = 0; cudaMalloc((void**)&d_A, numbytes); // copy data from host to device cudaMemcpy(d_A, h_A, numBytes, cudaMemcpyHostToDevice); // "execute the kernel" // truly: CPU enqueues kernel calls, GPU executes asynchronously kernel_A<<< .., .. >>>(...); kernel_B<<< .., .. >>>(...); kernel_C<<< .., ..>>>(...); // copy data from device back to host - CPU/GPU SYNC cudaMemcpy(h_A, d_A, numBytes, cudaMemcpyDeviceToHost); // free device memory cudaFree(d_A);
NVIDIA Confidential

CUDA Error Reporting


All CUDA calls return error code
Except for kernel launches cudaError_t type

cudaGetLastError( )
Returns the code for the last error (no error: has a code) Even get error from kernel execution

char *cudaGetErrorString(code)
Returns a string describing the error

printf(%s\n, cudaGetErrorString( cudaGetLastError() ) );

NVIDIA Confidential

Textures in CUDA
Textures are known from graphics ... In CUDA, Texture is used for data reading Benefits:
Addressable in 1D, 2D, or 3D Data is cached (optimized for 2D locality)
Helpful for irregular data access

Filtering
Linear / bilinear / trilinear dedicated hardware

Wrap modes (for out-of-bounds addresses)

Usage:
Host code binds data to a texture reference Kernel reads data by calling a fetch function, e.g. tex1Dfetch()
NVIDIA Confidential

CUDA Event API


CUDA call streams can be interspersed with Events Usage scenarios:
measure elapsed time for CUDA calls (clock cycle precision!) query the status of an asynchronous CUDA call block CPU until CUDA calls prior to the event are completed asyncAPI sample in CUDA SDK
cudaEvent_t start, stop; cudaEventCreate(&start); cudaEventCreate(&stop); cudaEventRecord(start, 0); kernel<<<grid, block>>>(...); cudaEventRecord(stop, 0); cudaEventSynchronize(stop); float elapsedTime; cudaEventElapsedTime(&elapsedTime, start, stop); cudaEventDestroy(start); cudaEventDestroy(stop);
NVIDIA Confidential

Driver API
Up to this point the host code weve seen has been from the runtime API cuda*() functions...

Driver API: cu*() functions


Advantages:
Plain C interface, you can use any CPU compiler for host code (e.g. icc, etc.) More control over devices
One CPU thread can control multiple GPUs

PTX Just-In-Time (JIT) compilation


(Parallel Thread eXecution (PTX) is our "GPU assembly language")

No dependency on runtime library

Disadvantages:
No device emulation More verbose code

Note: Device code is identical, regardless of using the runtime or driver API
NVIDIA Confidential

Once more: Runtime and Driver API


Best place to start for virtually all developers: Runtime API Easy to migrate to driver API if/when it is needed Anything which can be done in the runtime API can also be done in the driver API, but not vice versa Much, much more information on both APIs in the CUDA Reference Manual

NVIDIA Confidential

New Features in CUDA 2.2 Zero copy


CUDA threads can directly read/write host (CPU) memory Requires pinned (non-pageable) memory Main benefits:
More efficient than small PCIe data transfers May be better performance when there is no opportunity for data reuse from device DRAM

2D Texturing from linear memory


Allows simpler write-to-texture in CUDA Useful for image processing

NVIDIA Confidential

nvcc is a C compiler
Advanced C++ constructs (classes with inheritance and virtual functions) make it stumble in device code! If problems occur, and CUDART is still desirable: Let nvcc only compile .cu files that contain the kernels, let customer's compiler handle C++ code in their own files, and link the two parts. Last resort: CUDA driver API, (nvcc compiles kernels into PTX or binaries, which application loads via C calls)
NVIDIA Confidential

C for CUDA Optimization


NVIDIA Confidential

Optimize Algorithms for GPU


Maximize data-parallelism in the algorithm (SIMD): Think threads for data elements, not specific tasks Reduce thread divergence (performance impact from branch serialization, when groups smaller than 32 threads start to diverge) More computation on the GPU than costly device-host data transfers
Even low parallelism computations can sometimes be faster than transferring back and forth to host

NVIDIA Confidential

Optimize Algorithms for GPU: Maths


Maximize arithmetic intensity (math per mem transfer) Sometimes its better to recompute results than to cause serial dependencies
GPU spends its transistors on ALUs, not memory

Double precision algorithms: Consider moving parts/all to single precision computation Hardware has builtin math functions (at reduced precision): __sinf(), __expf(), etc.
Try -fast-math (implicitly converts e.g. sin() to _sinf()) or carefully replace individual function calls, considering reduced accuracy
NVIDIA Confidential

Optimize Memory Access

Coalescing: "Optimal" memory access pattern


Coalesced vs. Non-coalesced = order of magnitude!

Shared memory: A user-managed cache Advanced concepts: Shared memory bank conflicts Make use of spatial locality for texture and constant caches

NVIDIA Confidential

Coalescing
Compute capability 1.0 and 1.1
K-th thread must access k-th word in the segment (or k-th word in 2 contiguous 128B segments for 128-bit words), not all threads need to participate Coalesces 1 transaction

Out of sequence 16 transactions

Misaligned 16 transactions

NVIDIA Confidential

Coalescing
Compute capability 1.2 and higher
MMU is more advanced, relaxes coalescing requirements Coalescing achieved for any pattern of addresses that fits into a segment of size: 32B for 8-bit words, 64B for 16-bit words, 128B for 32- and 64-bit words Smaller transactions may be issued to avoid wasted bandwidth due to unused words Exact rules in Programming Guide

1 transaction - 64B segment

NVIDIA Confidential

Take Advantage of Shared Memory


Hundreds of times faster than global memory Threads can cooperate via shared memory Use one / a few threads to load / compute data shared by all threads Use it to avoid non-coalesced access
Stage loads and stores in shared memory to re-order noncoalesceable addressing

NVIDIA Confidential

Use Parallelism Efficiently


Partition your computation to keep the GPU multiprocessors equally busy
Many threads, many thread blocks

Keep threads' resource usage low enough to support multiple blocks per multiprocessor
Resources: Registers, shared memory

NVIDIA Confidential

Host-Device Data Transfers


Device-Host memory bandwidth much lower than device-device bandwidth
8 GB/s peak (PCI-e x16 Gen 2) vs. 102 GB/s peak (Tesla C1060)

Minimize transfers
Dont transfer intermediate data: Can be allocated, operated on, and deallocated without ever copying them to host memory

Group transfers
One large transfer much better than many small ones

NVIDIA Confidential

Overlapping Data Transfers and Computation


Stream and Async API allow overlap host-device data transfers with computation
CPU computation can overlap data transfers on all CUDA capable devices Devices with Concurrent copy and execution (CompCap >= 1.1): Kernel computation can overlap data transfers, controlled via streams and events.

Stream = sequence of CUDA calls that execute in order


Calls in different streams can be interleaved Stream ID is an argument to async calls and kernel launches

NVIDIA Confidential

Shared Memory
~Hundred times faster than global memory Use it to cache data from global memory accesses Use it to avoid non-coalesced access
Stage loads and stores in shared memory to re-order non-coalesceable addressing

Threads can cooperate via shared memory share results with each other contribute to common result, e.g. block min/max/avg
NVIDIA Confidential

Grid/Block Size Heuristics


# of blocks > # of multiprocessors
So all multiprocessors have at least one block to execute

# of blocks / # of multiprocessors > 2


Multiple blocks can run concurrently in a multiprocessor Blocks that arent waiting at a __syncthreads() keep the hardware busy Subject to resource availability registers, shared memory

# of blocks > 100 to scale to future devices


Blocks executed in pipeline fashion 1000 blocks per grid will scale across multiple generations

NVIDIA Confidential

Accuracy
GPU and CPU results may differ, but are equally accurate (to specified ulp accuracy) CPU operations arent strictly limited to 0.5 ulp
Sequences of operations can be even more accurate due to 80-bit extended precision ALUs

Compare GPU calculation to CPU SSE And: Floating-point arithmetic is not associative! Complex area (ask if unsure)
NVIDIA Confidential

Summary
GPU hardware can achieve great performance on dataparallel computations if you follow a few simple guidelines:
Use parallelism efficiently Coalesce memory accesses if possible Take advantage of shared memory Explore other memory spaces
Texture Constant

(Reduce shared memory bank conflicts) See the Programming Guide, Best Practices Guide and Reference Manual If that doesn't help: Ask your local DevTech-Compute engineer :)
NVIDIA Confidential

You might also like