Professional Documents
Culture Documents
Cedric Lee
3D Scene
Stage
Stage
Stage
Raster Image
Overview
Basic Graphics Pipeline Modern Graphics Pipeline Beyond Pipelining The New Wave
Use case:
Render a textured mesh with per-pixel lighting ambient light, 1 dir, 1 point, no shadows Assume z-buffer based architecture
3D Scene
Surface
Triangle mesh
Position + orientation (world matrix) Per-vertex uv, tangent, binormal Diffuse + normal maps
Material
Vertex Fetching
Vertex Stream
Per-Vertex
Input Assembler
Index Stream
Vertex Processing
Per-Vertex Position-OS Normal-OS Tangent-OS Binormal-OS Texture UV
Per-Vertex
Vertex Shader
Uniform Constants World Matrix View Matrix Projection Matrix
Scan Conversion
Per-Pixel
Trivial Reject
Rasterizer
Pixel Processing
Per-Pixel Position-WS Position-SS Normal-WS Tangent-WS Binormal-WS Texture UV Textures Diffuse Normal Per-Pixel
Uniform Constants Ambient L colour Dir L colour Dir L dir Point L colour Point L pos
Pixel Shader
Texturing Lighting
Depth Buffer
Colour Buffer
Frame buffer / render targets
Kill/emit vertices, primitives Ex. displacement mapping, fur, 1-pass render to cube map
Common shading cores shared between Vertex, Geometry and Pixel shading units Scheduler distributes work Load balancing
http://s08.idav.ucdavis.edu/luebke-nvidia-gpu-architecture.pdf
http://s08.idav.ucdavis.edu/luebke-nvidia-gpu-architecture.pdf
Bandwidth:
Hierarchical Z PS3: Compressed Z and colour to reduce bandwidth for MSAA reads X360: in-GPU EDRAM lots of bandwidth
Modern GPU
More memory, processing units More floating point formats, fewer usage restrictions More render targets (8) Longer shaders New data structures (e.g. Texture arrays) Better MSAA and anisotropic filtering support
Beyond Pipelining
Multi-processor
Solution to memory and power walls Pipelining : multiple stages happening at once Parallelism : many things happening in the same stage Small number of pipeline steps Some steps are much more compute intensive
Limit of pipelining
Parallelism
Parallelism examples:
All components of float4 at the same time Multiple vertices at the same time Multiple triangles at the same time
SIMD
e.g. GPU ALU Shared instruction store and control Compact and less expensive Efficient with no loops or branches Problem with unused processing cycles
Unfilled quads are inefficient Solution : avoid small or skinny triangles (PS3)
SIMT
Still SIMD. Shared code between threads. Process groups of primitives (e.g. 48 quads) in each thread Latency hiding:
1 Thread stalls on texture fetch Othe threads continue execution Especially important due to memory wall
SIMT
When branching:
Only evaluate one branch if all primitives take that branch Must evaluate both branches and mask the results if not all primitives take the same branch
MIMD
e.g. Multi-core CPUs, Cell SPEs, Larrabee Diff code stores and controls for diff processors More complex hardware More expensive Synchronization issues Can handle more complex data structures and algorithms
MIMD
Cell SPEs
SPEs
Local memory store Shared memory accessed via DMA Ring bus
PS3
RSX
Traditional GPU (z-buffer, ROP) SIMD data structures and processing (arrays) Micro triangle removal Skinning Post-FX Lighting Mostly rely on SIMD-friendly data structures
Larrabee
Many general purpose CPU cores Coherent memory access from cores Very few fixed-function units (e.g. Texture) Most graphics pipeline components are programmable
Programming
More MIMD More synchronization and data buffering issues More attention to latency hiding
Lighting
Non-uniform representations
Rasterization
Object-parallel rasterization
Ray-casting
Implicit surfaces (e.g. Metaballs, Level sets, CSG) Direct volume rendering
Questions?