You are on page 1of 58

General-Purpose Computation on

Graphics Hardware

David Luebke

University of Virginia

Course Introduction
The GPU on commodity video cards has evolved
into an extremely flexible and powerful processor
Programmability
Precision
Power

We are interested in how to harness that power for


general-purpose computation

Motivation:
Computational Power
GPUs are fast

3 GHz Pentium4 theoretical: 6 GFLOPS, 5.96 GB/sec peak


GeForceFX 5900 observed: 20 GFLOPS, 25.3 GB/sec peak
GeForce 6800 Ultra observed: 53 GFLOPS, 35.2 GB/sec peak
GeForce 8800 GTX: estimated at 520 GFLOPS, 86.4 GB/sec
peak (reference here)
Thats almost 100 times faster than a 3 Ghz Pentium4!

GPUs are getting faster, faster


CPUs: annual growth 1.5 decade growth 60
GPUs: annual growth > 2.0 decade growth > 1000
Courtesy Kurt Akeley,
Ian Buck & Tim Purcell, GPU Gems (see course notes)

Motivation:
Computational Power

GPU

CPU

Courtesy Naga Govindaraju

Motivation:
Computational Power

GPU

multiplies per second

GFLOPS

NVIDIA NV30, 35, 40


ATI R300, 360, 420
Pentium 4

July 01

Jan 02

July 02

Jan 03

July 03

Jan 04

Courtesy Ian Buck

An Aside:
Computational Power
Why are GPUs getting faster so fast?
Arithmetic intensity: the specialized nature of GPUs makes it
easier to use additional transistors for computation not cache
Economics: multi-billion dollar video game market is a
pressure cooker that drives innovation

Motivation:
Flexible and precise
Modern GPUs are deeply programmable
Programmable pixel, vertex, video engines
Solidifying high-level language support

Modern GPUs support high precision


32 bit floating point throughout the pipeline
High enough for many (not all) applications

Motivation:
The Potential of GPGPU
The power and flexibility of GPUs makes them an
attractive platform for general-purpose computation
Example applications range from in-game physics
simulation to conventional computational science
Goal: make the inexpensive power of the GPU
available to developers as a sort of computational
coprocessor

The Problem:
Difficult To Use
GPUs designed for and driven by video games
Programming model is unusual & tied to computer graphics
Programming environment is tightly constrained

Underlying architectures are:


Inherently parallel
Rapidly evolving (even in basic feature set!)
Largely secret

Cant simply port code written for the CPU!

GPU Fundamentals:
The Graphics Pipeline
Graphics State

Application

Transform
Vertices
(3D)

Rasterizer
Xformed,
Lit
Vertices
(2D)

CPU

Shade
Fragments
(pre-pixels)

GPU

A simplified graphics pipeline


Note that pipe widths vary
Many caches, FIFOs, and so on not shown

Final
pixels
(Color, Depth)

Video
Memory
(Textures)

Render-to-texture

GPU Fundamentals:
The Modern Graphics Pipeline
Graphics State

Application
Vertices
(3D)

Vertex
Vertex
Processor
Processor

Rasterizer
Xformed,
Lit
Vertices
(2D)

CPU

Programmable vertex
processor!

Fragments
(pre-pixels)

GPU

Pixel
Fragment
Processor
Processor

Final
pixels
(Color, Depth)

Video
Memory
(Textures)

Render-to-texture

Programmable pixel
processor!

GPU Pipeline:
Transform
Vertex Processor (multiple operate in parallel)
Transform from world space to image space
Compute per-vertex lighting

GPU Pipeline:
Rasterizer
Rasterizer
Convert geometric rep. (vertex) to image rep. (fragment)
Fragment = image fragment
Pixel + associated data: color, depth, stencil, etc.
Interpolate per-vertex quantities across pixels

GPU Pipeline: Shade


Fragment Processors (multiple in parallel)
Compute a color for each pixel
Optionally read colors from textures (images)

Importance of Data Parallelism


GPU: Each vertex / fragment is independent
Temporary registers are zeroed
No static data
No read-modify-write buffers

Data parallel processing


Best for ALU-heavy architectures: GPUs
Multiple vertex & pixel pipelines
Hide memory latency (with more computation)

Courtesy of Ian Buck

Arithmetic Intensity
Lots of ops per word transferred
GPGPU demands high arithmetic intensity for peak
performance
Ex: solving systems of linear equations
Physically-based simulation on lattices
All-pairs shortest paths

Courtesy of Pat Hanrahan

Data Streams & Kernels


Streams
Collection of records requiring similar computation
Vertex positions, Voxels, FEM cells, etc.
Provide data parallelism

Kernels
Functions applied to each element in stream
Transforms, PDE,
No dependencies between stream elements
Encourage high arithmetic intensity

Courtesy of Ian Buck

Example: Simulation Grid


Common GPGPU computation style
Textures represent computational grids = streams

Many computations map to grids

Matrix algebra
Image & Volume processing
Physical simulation
Global Illumination
ray tracing, photon mapping,
radiosity

Non-grid streams can be


mapped to grids

Stream Computation
Grid Simulation algorithm
Made up of steps
Each step updates entire grid
Must complete before next step can begin

Grid is a stream, steps are kernels


Kernel applied to each stream element

Scatter vs. Gather


Grid communication
Grid cells share information

Computational Resources Inventory


Programmable parallel processors
Vertex & Fragment pipelines

Rasterizer
Mostly useful for interpolating addresses (texture coordinates)
and per-vertex constants

Texture unit
Read-only memory interface

Render to texture
Write-only memory interface

Vertex Processor
Fully programmable (SIMD / MIMD)
Processes 4-vectors (RGBA / XYZW)
Capable of scatter but not gather
Can change the location of current vertex
Cannot read info from other vertices
Can only read a small constant memory

Future hardware enables gather!


Vertex textures

Fragment Processor
Fully programmable (SIMD)
Processes 4-vectors (RGBA / XYZW)
Random access memory read (textures)
Capable of gather but not scatter
No random access memory writes
Output address fixed to a specific pixel

Typically more useful than vertex processor


More fragment pipelines than vertex pipelines
RAM read
Direct output

CPU-GPU Analogies
CPU programming is familiar
GPU programming is graphics-centric

Analogies can aid understanding

CPU-GPU Analogies
CPU

GPU

Stream / Data Array =


Memory Read

Texture
Texture Sample

CPU-GPU Analogies
CPU

GPU

Loop body / kernel / algorithm step = Fragment Program

Feedback
Each algorithm step depend on the
results of previous steps
Each time step depends on the
results of the previous time step

CPU-GPU Analogies
CPU

GPU

.
.
.
Grid[i][j]= x;
.
.
.
Array Write

Render to Texture

GPU Simulation Overview


Analogies lead to implementation
Algorithm steps are fragment programs
Computational kernels
Current state variables stored in textures
Feedback via render to texture

One question: how do we invoke


computation?

Invoking Computation
Must invoke computation at each pixel
Just draw geometry!
Most common GPGPU invocation is a full-screen quad

Typical Grid Computation


Initialize view (so that pixels:texels::1:1)
glMatrixMode(GL_MODELVIEW);
glLoadIdentity();
glMatrixMode(GL_PROJECTION);
glLoadIdentity();
glOrtho(0, 1, 0, 1, 0, 1);
glViewport(0, 0, outTexResX, outTexResY);

For each algorithm step:


Activate render-to-texture
Setup input textures, fragment program
Draw a full-screen quad (1x1)

Branching Techniques
Fragment program branches can be expensive
No true fragment branching on NV3X & R3X0

Better to move decisions up the pipeline

Replace with math


Occlusion Query
Domain decomposition
Z-cull
Pre-computation

Branching with OQ
Use it for iteration termination
Do
{ // outer loop on CPU
BeginOcclusionQuery
{
// Render with fragment program that
// discards fragments that satisfy
// termination criteria
} EndQuery
} While query returns > 0

Can be used for subdivision techniques


Demo later

Domain Decomposition
Avoid branches where outcome is fixed
One region is always true, another false
Separate FPs for each region, no branches

Example:
boundaries

Z-Cull
In early pass, modify depth buffer
Write depth=0 for pixels that should not be modified by later
passes
Write depth=1 for rest

Subsequent passes
Enable depth test (GL_LESS)
Draw full-screen quad at z=0.5
Only pixels with previous depth=1 will be processed

Can also use early stencil test


Not available on NV3X
Depth replace disables ZCull

Pre-computation
Pre-compute anything that will not change every
iteration!
Example: arbitrary boundaries
When user draws boundaries, compute texture containing
boundary info for cells
Reuse that texture until boundaries modified
Combine with Z-cull for higher performance!

High-level Shading Languages


Cg, HLSL, & GLSlang
Cg:
http://www.nvidia.com/cg

HLSL:
http://msdn.microsoft.com/library/default.asp?url=/library/enus/directx9_c/directx/graphics/reference/highlevellanguageshad
ers.asp

GLSlang:
http://www.3dlabs.com/support/developer/ogl2/whitepapers/inde
x.html
float3 cPlastic = Cd * (cAmbi + cDiff)
+ Cs*cSpec

.
MULR R0.xyz, R0.xxx, R4.xyxx;
MOVR R5.xyz, -R0.xyzz;
DP3R R3.x, R0.xyzz, R3.xyzz;

GPGPU Languages
Why do we want them?
Make programming GPUs easier!
Dont need to know OpenGL, DirectX, or ATI/NV extensions
Simplify common operations
Focus on the algorithm, not on the implementation

Sh
University of Waterloo
http://libsh.sourceforge.net
http://www.cgl.uwaterloo.ca

Brook
Stanford University
http://brook.sourceforge.net
http://graphics.stanford.edu

Brook: general purpose streaming


language
stream programming model
enforce data parallel computing
streams
encourage arithmetic intensity
kernels

C with stream extensions


GPU = streaming coprocessor

system outline
.br
Brook source files

brcc
source to source compiler

brt
Brook run-time library

Brook language

streams
streams
collection of records requiring similar computation
particle positions, voxels, FEM cell,
float3 positions<200>;
float3 velocityfield<100,100,100>;

similar to arrays, but


index operations disallowed: position[i]
read/write stream operators
streamRead (positions, p_ptr);
streamWrite (velocityfield, v_ptr);

encourage data parallelism

Brook language

kernels
kernels
functions applied to streams
similar to for_all construct
kernel void foo (float a<>, float b<>,
out float result<>) {
result = a + b;
}
float a<100>;
float b<100>;
float c<100>;
foo(a,b,c);

for (i=0; i<100; i++)


c[i] = a[i]+b[i];

Brook language

kernels
kernels
functions applied to streams
similar to for_all construct
kernel void foo (float a<>, float b<>,
out float result<>) {
result = a + b;
}
no dependencies between stream elements
encourage high arithmetic intensity

Brook language

kernels
kernels arguments

input/output streams
constant parameters
gather streams
iterator streams
kernel void foo (float a<>, float b<>,
float t, float array[],
iter float n<>,
out float result<>) {
result = array[a] + t*b + n;
}
float a<100>;
float b<100>;
float c<100>;
float array<25>
iter float n<100> = iter(0, 10);
foo(a,b,3.2f,array,n,c);

Brook language

kernels
ray triangle intersection
kernel void krnIntersectTriangle(Ray ray<>, Triangle tris[],
RayState oldraystate<>,
GridTrilist trilist[],
out Hit candidatehit<>) {
float idx, det, inv_det;
float3 edge1, edge2, pvec, tvec, qvec;
if(oldraystate.state.y > 0) {
idx = trilist[oldraystate.state.w].trinum;
edge1 = tris[idx].v1 - tris[idx].v0;
edge2 = tris[idx].v2 - tris[idx].v0;
pvec = cross(ray.d, edge2);
det = dot(edge1, pvec);
inv_det = 1.0f/det;
tvec = ray.o - tris[idx].v0;
candidatehit.data.y = dot( tvec, pvec ) * inv_det;
qvec = cross( tvec, edge1 );
candidatehit.data.z = dot( ray.d, qvec ) * inv_det;
candidatehit.data.x = dot( edge2, qvec ) * inv_det;
candidatehit.data.w = idx;
} else {
candidatehit.data = float4(0,0,0,-1);
}
}

Brook language

reductions
reductions
compute single value from a stream
reduce void sum (float a<>,
reduce float r<>)
r += a;
}
float a<100>;
float r;
sum(a,r);

r = a[0];
for (int i=1; i<100; i++)
r += a[i];

Brook language

reductions
reductions
associative operations only
(a+b)+c = a+(b+c)
sum, multiply, max, min, OR, AND, XOR
matrix multiply

Brook language

reductions
multi-dimension reductions
stream shape differences resolved by reduce function
reduce void sum (float a<>,
reduce float r<>)
r += a;
}
float a<20>;
float r<5>;
sum(a,r);

for (int i=0; i<5; i++)


r[i] = a[i*4];
for (int j=1; j<4; j++)
r[i] += a[i*4 + j];

Brook language

stream repeat & stride


kernel arguments of different shape
resolved by repeat and stride
kernel void foo (float a<>, float b<>,
out float result<>);
float a<20>;
float b<5>;
float c<10>;
foo(a,b,c);

foo(a[0],
foo(a[2],
foo(a[4],
foo(a[6],
foo(a[8],
foo(a[10],
foo(a[12],
foo(a[14],
foo(a[16],
foo(a[18],

b[0],
b[0],
b[1],
b[1],
b[2],
b[2],
b[3],
b[3],
b[4],
b[4],

c[0])
c[1])
c[2])
c[3])
c[4])
c[5])
c[6])
c[7])
c[8])
c[9])

Brook language

matrix vector multiply


kernel void mul (float a<>, float b<>,
out float result<>)
{ result = a*b; }
reduce void sum (float a<>,
reduce float result<>)
{ result += a; }
float
float
float
float

matrix<20,10>;
vector<1, 10>;
tempmv<20,10>;
result<20, 1>;

mul(matrix,vector,tempmv);
sum(tempmv,result);

V
V
V

Brook language

matrix vector multiply


kernel void mul (float a<>, float b<>,
out float result<>)
{ result = a*b; }
reduce void sum (float a<>,
reduce float result<>)
{ result += a; }
float
float
float
float

matrix<20,10>;
vector<1, 10>;
tempmv<20,10>;
result<20, 1>;

mul(matrix,vector,tempmv);
sum(tempmv,result);

sum

Running Brook
Compiling .br files
Brook CG Compiler
Version: 0.2 Built: Apr 24 2004, 18:11:59
brcc [-hvndktyAN] [-o prefix] [-w workspace] [-p shader ] foo.br
-h
help (print this message)
-v
verbose (print intermediate generated code)
-n
no codegen (just parse and reemit the input)
-d
debug (print cTool internal state)
-k
keep generated fragment program (in foo.cg)
-t
disable kernel call type checking
-y
emit code for ATI 4-output hardware
-A
enable address virtualization (experimental)
-N
deny support for kernels calling other kernels
-o prefix
prefix prepended to all output files
-w workspace workspace size (16 - 2048, default 1024)
-p shader
cpu / ps20 / fp30 / cpumt (can specify multiple)

Running Brook
BRT_RUNTIME selects platform
CPU Backend:
CPU Multithreaded Backend:
NVIDIA NV30 Backend:
OpenGL ARB Backend:
DirectX9 Backend:

BRT_RUNTIME = cpu
BRT_RUNTIME = cpumt
BRT_RUNTIME = nv30
BRT_RUNTIME = arb
BRT_RUNTIME = dx9

Runtime
accessing stream data for graphics aps
Brook runtime api available in c++ code
autogenerated .hpp files for brook code
brook::initialize( "dx9", (void*)device );
// Create streams
fluidStream0 = stream::create<float4>( kFluidSize, kFluidSize );
normalStream = stream::create<float3>( kFluidSize, kFluidSize );
// Get a handle to the texture being used by
// the normal stream as a backing store
normalTexture = (IDirect3DTexture9*)
normalStream->getIndexedFieldRenderData(0);
// Call the simulation kernel
simulationKernel( fluidStream0, fluidStream0, controlConstant,
fluidStream1 );

Applications
Includes lots of sample
applications

Ray-tracer
FFT
Image segmentation
Linear algebra

Brook performance
2-3x faster than CPU implementation

ATI Radeon 9800 XT


NVIDIA GeForce 6800

compared against 3GHz P4: GPUs still lose against SSE


cache friendly code.
Intel Math Library
Super-optimizations
FFTW
ATLAS
Custom cached-blocked
FFTW
segment C code

Brook for GPUs


Release v0.3 available on Sourceforge
Project Page
http://graphics.stanford.edu/projects/brook

Source
http://www.sourceforge.net/projects/brook

Over 4K downloads!
Brook for GPUs: Stream Computing on Graphics Hardware
Ian Buck, Tim Foley, Daniel Horn, Jeremy Sugerman, Kayvon
Fatahalian, Mike Houston, Pat Hanrahan

Fly-fishing fly images from The English Fly Fishing Sho

GPGPU Examples
Fluids
Reaction/Diffusion
Multigrid
Tone mapping

You might also like