22 Gpgpu

General-Purpose Computation on
Graphics Hardware
David Luebke
University of Virginia
Course Introduction
The GPU on commodity video cards has evolved
into an extremely flexible and powerful processor
Programmability
Precision
Power
We are interested in how to harness that power for

general-purpose computation
Motivation:
Computational Power
GPUs are fast
3 GHz Pentium4 theoretical: 6 GFLOPS, 5.96 GB/sec peak

GeForceFX 5900 observed: 20 GFLOPS, 25.3 GB/sec peak
GeForce 6800 Ultra observed: 53 GFLOPS, 35.2 GB/sec peak
GeForce 8800 GTX: estimated at 520 GFLOPS, 86.4 GB/sec
peak (reference here)
Thats almost 100 times faster than a 3 Ghz Pentium4!
GPUs are getting faster, faster

CPUs: annual growth 1.5 decade growth 60
GPUs: annual growth > 2.0 decade growth > 1000
Courtesy Kurt Akeley,
Ian Buck & Tim Purcell, GPU Gems (see course notes)
Motivation:
Computational Power
GPU
CPU
Courtesy Naga Govindaraju
Motivation:
Computational Power
GPU
multiplies per second
GFLOPS
NVIDIA NV30, 35, 40

ATI R300, 360, 420
Pentium 4
July 01
Jan 02
July 02
Jan 03
July 03
Jan 04
Courtesy Ian Buck
An Aside:
Computational Power
Why are GPUs getting faster so fast?
Arithmetic intensity: the specialized nature of GPUs makes it
easier to use additional transistors for computation not cache
Economics: multi-billion dollar video game market is a
pressure cooker that drives innovation
Motivation:
Flexible and precise
Modern GPUs are deeply programmable
Programmable pixel, vertex, video engines
Solidifying high-level language support
Modern GPUs support high precision

32 bit floating point throughout the pipeline
High enough for many (not all) applications
Motivation:
The Potential of GPGPU
The power and flexibility of GPUs makes them an
attractive platform for general-purpose computation
Example applications range from in-game physics
simulation to conventional computational science
Goal: make the inexpensive power of the GPU
available to developers as a sort of computational
coprocessor
The Problem:
Difficult To Use
GPUs designed for and driven by video games
Programming model is unusual & tied to computer graphics
Programming environment is tightly constrained
Underlying architectures are:

Inherently parallel
Rapidly evolving (even in basic feature set!)
Largely secret
Cant simply port code written for the CPU!
GPU Fundamentals:
The Graphics Pipeline
Graphics State
Application
Transform
Vertices
(3D)
Rasterizer
Xformed,
Lit
Vertices
(2D)
CPU
Shade
Fragments
(pre-pixels)
GPU
A simplified graphics pipeline

Note that pipe widths vary
Many caches, FIFOs, and so on not shown
Final
pixels
(Color, Depth)
Video
Memory
(Textures)
Render-to-texture
GPU Fundamentals:
The Modern Graphics Pipeline
Graphics State
Application
Vertices
(3D)
Vertex
Vertex
Processor
Processor
Rasterizer
Xformed,
Lit
Vertices
(2D)
CPU
Programmable vertex
processor!
Fragments
(pre-pixels)
GPU
Pixel
Fragment
Processor
Processor
Final
pixels
(Color, Depth)
Video
Memory
(Textures)
Render-to-texture
Programmable pixel
processor!
GPU Pipeline:
Transform
Vertex Processor (multiple operate in parallel)
Transform from world space to image space
Compute per-vertex lighting
GPU Pipeline:
Rasterizer
Rasterizer
Convert geometric rep. (vertex) to image rep. (fragment)
Fragment = image fragment
Pixel + associated data: color, depth, stencil, etc.
Interpolate per-vertex quantities across pixels
GPU Pipeline: Shade

Fragment Processors (multiple in parallel)
Compute a color for each pixel
Optionally read colors from textures (images)
Importance of Data Parallelism

GPU: Each vertex / fragment is independent
Temporary registers are zeroed
No static data
No read-modify-write buffers
Data parallel processing

Best for ALU-heavy architectures: GPUs
Multiple vertex & pixel pipelines
Hide memory latency (with more computation)
Courtesy of Ian Buck
Arithmetic Intensity
Lots of ops per word transferred
GPGPU demands high arithmetic intensity for peak
performance
Ex: solving systems of linear equations
Physically-based simulation on lattices
All-pairs shortest paths
Courtesy of Pat Hanrahan
Data Streams & Kernels

Streams
Collection of records requiring similar computation
Vertex positions, Voxels, FEM cells, etc.
Provide data parallelism
Kernels
Functions applied to each element in stream
Transforms, PDE,
No dependencies between stream elements
Encourage high arithmetic intensity
Courtesy of Ian Buck
Example: Simulation Grid

Common GPGPU computation style
Textures represent computational grids = streams
Many computations map to grids
Matrix algebra
Image & Volume processing
Physical simulation
Global Illumination
ray tracing, photon mapping,
radiosity
Non-grid streams can be

mapped to grids
Stream Computation
Grid Simulation algorithm
Made up of steps
Each step updates entire grid
Must complete before next step can begin
Grid is a stream, steps are kernels

Kernel applied to each stream element
Scatter vs. Gather

Grid communication
Grid cells share information
Computational Resources Inventory

Programmable parallel processors
Vertex & Fragment pipelines
Rasterizer
Mostly useful for interpolating addresses (texture coordinates)
and per-vertex constants
Texture unit
Read-only memory interface
Render to texture
Write-only memory interface
Vertex Processor
Fully programmable (SIMD / MIMD)
Processes 4-vectors (RGBA / XYZW)
Capable of scatter but not gather
Can change the location of current vertex
Cannot read info from other vertices
Can only read a small constant memory
Future hardware enables gather!

Vertex textures
Fragment Processor
Fully programmable (SIMD)
Processes 4-vectors (RGBA / XYZW)
Random access memory read (textures)
Capable of gather but not scatter
No random access memory writes
Output address fixed to a specific pixel
Typically more useful than vertex processor

More fragment pipelines than vertex pipelines
RAM read
Direct output
CPU-GPU Analogies
CPU programming is familiar
GPU programming is graphics-centric
Analogies can aid understanding
CPU-GPU Analogies
CPU
GPU
Stream / Data Array =

Memory Read
Texture
Texture Sample
CPU-GPU Analogies
CPU
GPU
Loop body / kernel / algorithm step = Fragment Program
Feedback
Each algorithm step depend on the
results of previous steps
Each time step depends on the
results of the previous time step
CPU-GPU Analogies
CPU
GPU
.
.
.
Grid[i][j]= x;
.
.
.
Array Write
Render to Texture
GPU Simulation Overview

Analogies lead to implementation
Algorithm steps are fragment programs
Computational kernels
Current state variables stored in textures
Feedback via render to texture
One question: how do we invoke

computation?
Invoking Computation
Must invoke computation at each pixel
Just draw geometry!
Most common GPGPU invocation is a full-screen quad
Typical Grid Computation

Initialize view (so that pixels:texels::1:1)
glMatrixMode(GL_MODELVIEW);
glLoadIdentity();
glMatrixMode(GL_PROJECTION);
glLoadIdentity();
glOrtho(0, 1, 0, 1, 0, 1);
glViewport(0, 0, outTexResX, outTexResY);
For each algorithm step:

Activate render-to-texture
Setup input textures, fragment program
Draw a full-screen quad (1x1)
Branching Techniques
Fragment program branches can be expensive
No true fragment branching on NV3X & R3X0
Better to move decisions up the pipeline
Replace with math

Occlusion Query
Domain decomposition
Z-cull
Pre-computation
Branching with OQ
Use it for iteration termination
Do
{ // outer loop on CPU
BeginOcclusionQuery
{
// Render with fragment program that
// discards fragments that satisfy
// termination criteria
} EndQuery
} While query returns > 0
Can be used for subdivision techniques

Demo later
Domain Decomposition
Avoid branches where outcome is fixed
One region is always true, another false
Separate FPs for each region, no branches
Example:
boundaries
Z-Cull
In early pass, modify depth buffer
Write depth=0 for pixels that should not be modified by later
passes
Write depth=1 for rest
Subsequent passes
Enable depth test (GL_LESS)
Draw full-screen quad at z=0.5
Only pixels with previous depth=1 will be processed
Can also use early stencil test

Not available on NV3X
Depth replace disables ZCull
Pre-computation
Pre-compute anything that will not change every
iteration!
Example: arbitrary boundaries
When user draws boundaries, compute texture containing
boundary info for cells
Reuse that texture until boundaries modified
Combine with Z-cull for higher performance!
High-level Shading Languages

Cg, HLSL, & GLSlang
Cg:
http://www.nvidia.com/cg
HLSL:
http://msdn.microsoft.com/library/default.asp?url=/library/enus/directx9_c/directx/graphics/reference/highlevellanguageshad
ers.asp
GLSlang:
http://www.3dlabs.com/support/developer/ogl2/whitepapers/inde
x.html
float3 cPlastic = Cd * (cAmbi + cDiff)
+ Cs*cSpec
.
MULR R0.xyz, R0.xxx, R4.xyxx;
MOVR R5.xyz, -R0.xyzz;
DP3R R3.x, R0.xyzz, R3.xyzz;
GPGPU Languages
Why do we want them?
Make programming GPUs easier!
Dont need to know OpenGL, DirectX, or ATI/NV extensions
Simplify common operations
Focus on the algorithm, not on the implementation
Sh
University of Waterloo
http://libsh.sourceforge.net
http://www.cgl.uwaterloo.ca
Brook
Stanford University
http://brook.sourceforge.net
http://graphics.stanford.edu
Brook: general purpose streaming

language
stream programming model
enforce data parallel computing
streams
encourage arithmetic intensity
kernels
C with stream extensions

GPU = streaming coprocessor
system outline
.br
Brook source files
brcc
source to source compiler
brt
Brook run-time library
Brook language
streams
streams
collection of records requiring similar computation
particle positions, voxels, FEM cell,
float3 positions<200>;
float3 velocityfield<100,100,100>;
similar to arrays, but

index operations disallowed: position[i]
read/write stream operators
streamRead (positions, p_ptr);
streamWrite (velocityfield, v_ptr);
encourage data parallelism
Brook language
kernels
kernels
functions applied to streams
similar to for_all construct
kernel void foo (float a<>, float b<>,
out float result<>) {
result = a + b;
}
float a<100>;
float b<100>;
float c<100>;
foo(a,b,c);
for (i=0; i<100; i++)

c[i] = a[i]+b[i];
Brook language
kernels
kernels
functions applied to streams
similar to for_all construct
result = a + b;
}
no dependencies between stream elements
encourage high arithmetic intensity
Brook language
kernels
kernels arguments
input/output streams
constant parameters
gather streams
iterator streams
float t, float array[],
iter float n<>,
result = array[a] + t*b + n;
}
float a<100>;
float b<100>;
float c<100>;
float array<25>
iter float n<100> = iter(0, 10);
foo(a,b,3.2f,array,n,c);
Brook language
kernels
ray triangle intersection
kernel void krnIntersectTriangle(Ray ray<>, Triangle tris[],
RayState oldraystate<>,
GridTrilist trilist[],
out Hit candidatehit<>) {
float idx, det, inv_det;
float3 edge1, edge2, pvec, tvec, qvec;
if(oldraystate.state.y > 0) {
idx = trilist[oldraystate.state.w].trinum;
edge1 = tris[idx].v1 - tris[idx].v0;
edge2 = tris[idx].v2 - tris[idx].v0;
pvec = cross(ray.d, edge2);
det = dot(edge1, pvec);
inv_det = 1.0f/det;
tvec = ray.o - tris[idx].v0;
candidatehit.data.y = dot( tvec, pvec ) * inv_det;
qvec = cross( tvec, edge1 );
candidatehit.data.z = dot( ray.d, qvec ) * inv_det;
candidatehit.data.x = dot( edge2, qvec ) * inv_det;
candidatehit.data.w = idx;
} else {
candidatehit.data = float4(0,0,0,-1);
}
}
Brook language
reductions
reductions
compute single value from a stream
reduce void sum (float a<>,
reduce float r<>)
r += a;
}
float a<100>;
float r;
sum(a,r);
r = a[0];
for (int i=1; i<100; i++)
r += a[i];
Brook language
reductions
reductions
associative operations only
(a+b)+c = a+(b+c)
sum, multiply, max, min, OR, AND, XOR
matrix multiply
Brook language
reductions
multi-dimension reductions
stream shape differences resolved by reduce function
reduce float r<>)
r += a;
}
float a<20>;
float r<5>;
sum(a,r);
for (int i=0; i<5; i++)

r[i] = a[i*4];
for (int j=1; j<4; j++)
r[i] += a[i*4 + j];
Brook language
stream repeat & stride

kernel arguments of different shape
resolved by repeat and stride
out float result<>);
float a<20>;
float b<5>;
float c<10>;
foo(a,b,c);
foo(a[0],
foo(a[2],
foo(a[4],
foo(a[6],
foo(a[8],
foo(a[10],
foo(a[12],
foo(a[14],
foo(a[16],
foo(a[18],
b[0],
b[0],
b[1],
b[1],
b[2],
b[2],
b[3],
b[3],
b[4],
b[4],
c[0])
c[1])
c[2])
c[3])
c[4])
c[5])
c[6])
c[7])
c[8])
c[9])
Brook language
matrix vector multiply

kernel void mul (float a<>, float b<>,
out float result<>)
{ result = a*b; }
reduce float result<>)
{ result += a; }
float
float
float
float
matrix<20,10>;
vector<1, 10>;
tempmv<20,10>;
result<20, 1>;
mul(matrix,vector,tempmv);
sum(tempmv,result);
V
V
V
Brook language
matrix vector multiply

kernel void mul (float a<>, float b<>,
out float result<>)
{ result = a*b; }
reduce float result<>)
{ result += a; }
float
float
float
float
matrix<20,10>;
vector<1, 10>;
tempmv<20,10>;
result<20, 1>;
mul(matrix,vector,tempmv);
sum(tempmv,result);
sum
Running Brook
Compiling .br files
Brook CG Compiler
Version: 0.2 Built: Apr 24 2004, 18:11:59
brcc [-hvndktyAN] [-o prefix] [-w workspace] [-p shader ] foo.br
-h
help (print this message)
-v
verbose (print intermediate generated code)
-n
no codegen (just parse and reemit the input)
-d
debug (print cTool internal state)
-k
keep generated fragment program (in foo.cg)
-t
disable kernel call type checking
-y
emit code for ATI 4-output hardware
-A
enable address virtualization (experimental)
-N
deny support for kernels calling other kernels
-o prefix
prefix prepended to all output files
-w workspace workspace size (16 - 2048, default 1024)
-p shader
cpu / ps20 / fp30 / cpumt (can specify multiple)
Running Brook
BRT_RUNTIME selects platform
CPU Backend:
CPU Multithreaded Backend:
NVIDIA NV30 Backend:
OpenGL ARB Backend:
DirectX9 Backend:
BRT_RUNTIME = cpu
BRT_RUNTIME = cpumt
BRT_RUNTIME = nv30
BRT_RUNTIME = arb
BRT_RUNTIME = dx9
Runtime
accessing stream data for graphics aps
Brook runtime api available in c++ code
autogenerated .hpp files for brook code
brook::initialize( "dx9", (void*)device );
// Create streams
fluidStream0 = stream::create<float4>( kFluidSize, kFluidSize );
normalStream = stream::create<float3>( kFluidSize, kFluidSize );
// Get a handle to the texture being used by
// the normal stream as a backing store
normalTexture = (IDirect3DTexture9*)
normalStream->getIndexedFieldRenderData(0);
// Call the simulation kernel
simulationKernel( fluidStream0, fluidStream0, controlConstant,
fluidStream1 );
Applications
Includes lots of sample
applications
Ray-tracer
FFT
Image segmentation
Linear algebra
Brook performance
2-3x faster than CPU implementation
ATI Radeon 9800 XT

NVIDIA GeForce 6800
compared against 3GHz P4: GPUs still lose against SSE

cache friendly code.
Intel Math Library
Super-optimizations
FFTW
ATLAS
Custom cached-blocked
FFTW
segment C code
Brook for GPUs

Release v0.3 available on Sourceforge
Project Page
http://graphics.stanford.edu/projects/brook
Source
http://www.sourceforge.net/projects/brook
Over 4K downloads!
Brook for GPUs: Stream Computing on Graphics Hardware
Ian Buck, Tim Foley, Daniel Horn, Jeremy Sugerman, Kayvon
Fatahalian, Mike Houston, Pat Hanrahan
Fly-fishing fly images from The English Fly Fishing Sho
GPGPU Examples
Fluids
Reaction/Diffusion
Multigrid
Tone mapping

22 Gpgpu

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

22 Gpgpu

Uploaded by

Copyright:

Available Formats

General-Purpose Computation on

We are interested in how to harness that power for

3 GHz Pentium4 theoretical: 6 GFLOPS, 5.96 GB/sec peak

GPUs are getting faster, faster

Courtesy Naga Govindaraju

multiplies per second

NVIDIA NV30, 35, 40

Courtesy Ian Buck

Modern GPUs support high precision

Underlying architectures are:

Cant simply port code written for the CPU!

A simplified graphics pipeline

GPU Pipeline: Shade

Importance of Data Parallelism

Data parallel processing

Courtesy of Ian Buck

Courtesy of Pat Hanrahan

Data Streams & Kernels

Courtesy of Ian Buck

Example: Simulation Grid

Many computations map to grids

Non-grid streams can be

Grid is a stream, steps are kernels

Scatter vs. Gather

Computational Resources Inventory

Future hardware enables gather!

Typically more useful than vertex processor

Analogies can aid understanding

Stream / Data Array =

Loop body / kernel / algorithm step = Fragment Program

GPU Simulation Overview

One question: how do we invoke

Typical Grid Computation

For each algorithm step:

Better to move decisions up the pipeline

Replace with math

Can be used for subdivision techniques

Can also use early stencil test

High-level Shading Languages

Brook: general purpose streaming

C with stream extensions

similar to arrays, but

encourage data parallelism

for (i=0; i<100; i++)

for (int i=0; i<5; i++)

stream repeat & stride

matrix vector multiply

matrix vector multiply

ATI Radeon 9800 XT

compared against 3GHz P4: GPUs still lose against SSE

Brook for GPUs

Fly-fishing fly images from The English Fly Fishing Sho

You might also like