Professional Documents
Culture Documents
Host: Device management API Additionally, Runtime API & nvcc: use language extensions even for host code!
NVIDIA Confidential
Basics
Set up GPU for computation
GPU device and memory management GPU kernel launches (execution configuration) Some specifics of GPU/device code
Device Management
First task: CPU will query and select GPU devices
cudaGetDeviceCount( int* count ) cudaSetDevice( int device ) cudaGetDevice( int *current_device ) cudaGetDeviceProperties( cudaDeviceProp* prop, int device ) cudaChooseDevice( int *device, cudaDeviceProp* prop )
Multi-GPU setup:
device 0 is used by default, careful with combination of GFX card and Tesla ! (usually, one CPU thread controls one GPU each, but driver API allows more)
NVIDIA Confidential
Managing Memory
Host/CPU also manages device/GPU memory:
Allocate & Free memory Copy data to and from device's global memory (GPU DRAM, e.g. 4 GB on Tesla)
cudaMalloc(void **pointer, size_t nbytes) cudaMemset(void *pointer, int value, size_t count) cudaFree(void *pointer) Host and device have separate memory spaces!
NVIDIA Confidential
NVIDIA Confidential
Texture management
cudaBindTexture(), cudaBindTextureToArray(), ...
NVIDIA Confidential
Kernel creation
How to... gpu_func <<<exec-dims>>> (params) write a kernel! First, re-cap on the CUDA architecture...
NVIDIA Confidential
Kernel = device code call A kernel is executed by a grid of thread blocks A thread block is a batch of threads that can cooperate through shared memory Threads from different blocks cannot cooperate
Grid 2 Kernel 2
Block (1, 1)
Thread Thread Thread Thread Thread (0, 0) (1, 0) (2, 0) (3, 0) (4, 0) Thread Thread Thread Thread Thread (0, 1) (1, 1) (2, 1) (3, 1) (4, 1) Thread Thread Thread Thread Thread (0, 2) (1, 2) (2, 2) (3, 2) (4, 2)
NVIDIA Confidential
Block ID: 1D or 2D Thread ID: 1D, 2D, or 3D 2D/3D IDs simplify addressing when processing multidimensional data
Image processing Solving PDEs on volumes
NVIDIA Confidential
Block (1, 1)
Thread Thread Thread Thread Thread (0, 0) (1, 0) (2, 0) (3, 0) (4, 0) Thread Thread Thread Thread Thread (0, 1) (1, 1) (2, 1) (3, 1) (4, 1) Thread Thread Thread Thread Thread (0, 2) (1, 2) (2, 2) (3, 2) (4, 2)
Block (0, 0)
Block (1, 0)
Local Memory
Local Memory
Local Memory
Local Memory
Host can read/write global, constant, and texture memory (all stored in GPU DRAM)
NVIDIA Confidential
Host
__shared__
On-chip memory (SRAM, low latency), 16 kB per multiprocessor Allocated by execution configuration or at compile time Shared access by all threads in the same thread block Shortlived (only while block runs)
NVIDIA Confidential
Launching kernels
Modified C function call syntax:
kernel<<<dim3 grid, dim3 block>>>()
NVIDIA Confidential
dim3 blockDim;
Dimensions of the block in threads
dim3 blockIdx;
Block index within the grid
dim3 threadIdx;
Thread index within the block
NVIDIA Confidential
void increment_cpu(float *a, float b, int N) __global__ void increment_gpu(float *a, float b, int N) { { int idx = blockIdx.x * blockDim.x + threadIdx.x; if (idx < N) for (int idx = 0; idx<N; idx++) a[idx] = a[idx] + b; a[idx] = a[idx] + b; } }
void main() { .. dim3 dimBlock (blocksize); dim3 dimGrid( ceil( N / (float)blocksize) ); increment_gpu<<<dimGrid, dimBlock>>>(a, b, N); }
NVIDIA Confidential
NVIDIA Confidential
dim3
Based on uint3 Used to specify dimensions Default value (1,1,1)
Thread Synchronization
NVIDIA Confidential
32-bit: hardware with compute capability >= 1.1 64-bit: hardware with compute capability >= 1.2
NVIDIA Confidential
Variable qualifiers:
__constant__ float MyConstantArray[32]; __shared__ float MySharedArray[32];
Execution configuration:
dim3 dimGrid(100, 50); // 5000 thread blocks dim3 dimBlock(4, 8, 8); // 256 threads per block MyKernel <<< dimGrid, dimBlock >>> (...); // Launch kernel
NVIDIA Confidential
NVIDIA Confidential
Asynchronous operation
CUDA calls are enqueued in streams, and executed one after another : usually one default stream (0) Kernel launches are asynchronous
control returns to CPU immediately kernel executes after all previous CUDA calls
cudaMemcpy() is synchronous
copy starts after all previous CUDA calls have completed control returns to CPU after copy completes (async memcopies possible, too)
cudaGetLastError( )
Returns the code for the last error (no error: has a code) Even get error from kernel execution
char *cudaGetErrorString(code)
Returns a string describing the error
NVIDIA Confidential
Textures in CUDA
Textures are known from graphics ... In CUDA, Texture is used for data reading Benefits:
Addressable in 1D, 2D, or 3D Data is cached (optimized for 2D locality)
Helpful for irregular data access
Filtering
Linear / bilinear / trilinear dedicated hardware
Usage:
Host code binds data to a texture reference Kernel reads data by calling a fetch function, e.g. tex1Dfetch()
NVIDIA Confidential
Driver API
Up to this point the host code weve seen has been from the runtime API cuda*() functions...
Disadvantages:
No device emulation More verbose code
Note: Device code is identical, regardless of using the runtime or driver API
NVIDIA Confidential
NVIDIA Confidential
NVIDIA Confidential
nvcc is a C compiler
Advanced C++ constructs (classes with inheritance and virtual functions) make it stumble in device code! If problems occur, and CUDART is still desirable: Let nvcc only compile .cu files that contain the kernels, let customer's compiler handle C++ code in their own files, and link the two parts. Last resort: CUDA driver API, (nvcc compiles kernels into PTX or binaries, which application loads via C calls)
NVIDIA Confidential
NVIDIA Confidential
Double precision algorithms: Consider moving parts/all to single precision computation Hardware has builtin math functions (at reduced precision): __sinf(), __expf(), etc.
Try -fast-math (implicitly converts e.g. sin() to _sinf()) or carefully replace individual function calls, considering reduced accuracy
NVIDIA Confidential
Shared memory: A user-managed cache Advanced concepts: Shared memory bank conflicts Make use of spatial locality for texture and constant caches
NVIDIA Confidential
Coalescing
Compute capability 1.0 and 1.1
K-th thread must access k-th word in the segment (or k-th word in 2 contiguous 128B segments for 128-bit words), not all threads need to participate Coalesces 1 transaction
Misaligned 16 transactions
NVIDIA Confidential
Coalescing
Compute capability 1.2 and higher
MMU is more advanced, relaxes coalescing requirements Coalescing achieved for any pattern of addresses that fits into a segment of size: 32B for 8-bit words, 64B for 16-bit words, 128B for 32- and 64-bit words Smaller transactions may be issued to avoid wasted bandwidth due to unused words Exact rules in Programming Guide
NVIDIA Confidential
NVIDIA Confidential
Keep threads' resource usage low enough to support multiple blocks per multiprocessor
Resources: Registers, shared memory
NVIDIA Confidential
Minimize transfers
Dont transfer intermediate data: Can be allocated, operated on, and deallocated without ever copying them to host memory
Group transfers
One large transfer much better than many small ones
NVIDIA Confidential
NVIDIA Confidential
Shared Memory
~Hundred times faster than global memory Use it to cache data from global memory accesses Use it to avoid non-coalesced access
Stage loads and stores in shared memory to re-order non-coalesceable addressing
Threads can cooperate via shared memory share results with each other contribute to common result, e.g. block min/max/avg
NVIDIA Confidential
NVIDIA Confidential
Accuracy
GPU and CPU results may differ, but are equally accurate (to specified ulp accuracy) CPU operations arent strictly limited to 0.5 ulp
Sequences of operations can be even more accurate due to 80-bit extended precision ALUs
Compare GPU calculation to CPU SSE And: Floating-point arithmetic is not associative! Complex area (ask if unsure)
NVIDIA Confidential
Summary
GPU hardware can achieve great performance on dataparallel computations if you follow a few simple guidelines:
Use parallelism efficiently Coalesce memory accesses if possible Take advantage of shared memory Explore other memory spaces
Texture Constant
(Reduce shared memory bank conflicts) See the Programming Guide, Best Practices Guide and Reference Manual If that doesn't help: Ask your local DevTech-Compute engineer :)
NVIDIA Confidential