Presentation - NVIDIA CUDA

Introduction
CUDA stands for Compute Unified Device Architecture and is a new

hardware and software architecture for issuing and managing computations on
th GPU as a d
the data-parallel
t ll l computing
ti d device
i without
ith t th
the need
d off mapping
i ththem
to graphics API.
p
GPU is specialized for compute-intensive,
p , highly
g yp parallel computation
p and,,
contrary to CPUs, more transistors are devoted to data processing rather than
data caching and flow control
GPU??
graphics
processing unit
2
Background
In just few years the GPUs had a great evolution
evolution.
Problems:
-the GPU could only be programmed using graphics API;
-the GPU DRAM could be read in a general way but could not be written in a
general way;
-some applications were bottlenecked by the DRAM memory bandwidth;
3
Hardware and software
NVIDIA CUDA SDK has been

designed for running parallel
computations on the device hardware:
it consist of a compiler, host and
device runtime libraries and a driver
API.
CUDA software stack is composed of

several layers:
- a hardware driver;
- an API and its runtime;
- two higher-level mathematical
libraries of common usage.
4
Access memory
CUDA provides general DRAM memory addressing both for scatter and gather
memory operations, just like on a CPU.
Gather
Scatter
5
Shared Memory
In addition it features a parallel data cache/on-chip shared memory with a very fast
general read and write access, used by
g y threads to share data with each other.
Without shared memory With shared memory
6
Programming Model & Hardware Implementation
GPU can be viewed as a coprocessor to

the main CPU, called host.
A kernel
k l is
i a function
f ti that
th t is
i executed
t d
on the graphic device as many different
threads.
A thread block is a batch of threads that

cooperate together by sharing data
through some fast shared memory and
synchronizing their execution to
coordinate memory access.
Each thread is identified by a thread ID

that can be:
- a number in ascending order,
- a 2 (or 3) component index composed
using the (x,y) position of the specific
thread inside the block.
7
Threads in different blocks from the

same g
grid cannot communicate and
synchronize with each other.
A thread that executes on the device

has only access to the device’s
DRAM and on-chip memory through
different memory spaces.
The global, constant and texture

memory spaces can be read from or
written
itt tot by
b the
th host
h t andd are
persistent across kernel launches by
the same application.
8
The GPU device is implemented

p as a
set of multiprocessors, each one
having a SIMD architecture and on-
chip memory of the four following
types:
-One set of local 32-bit register per
processor;
-A parallel data cache shared by all
the processors;
-A read-only constant cache, shared
b allll th
by the processors;
-A read-only texture cache, shared by
all the processors.
9
Application Programming Interface
CUDA programming interface:

- a set of extensions to the C language
- a runtime library :
-a host component that runs on the host and

provides
id ffunctions
ti tto control
t l and
d access one or
The fundamental extensions moreofcompute
the C language
devices made by host;
from the CUDA are:
- Function type qualifiers (_device_, _global_, _host_)
- Variable type qualifiers(
qualifiers(_device_
adevice
-a ,_constant_,
constant that
device component _shared_
shared
runs on) the device and
- Synchronization(_syncthreads() )
provides device-specific functions;
When some threads within-a a blockcomponent

a device access the that
same addresses
provides built ininshared
built-in vector or global
memory, there are potential RAW,
types and WAR or WAW
a subset of thehazards for some
C standard libraryofthat
these memory
accesses, but they can be
areavoided by synchronizing
supported threads
in both host and devicein-between
code. these
accesses.
accesses
10
Optimization guidelines
-Minimizing the use of instructions with

low throughput
-Maximizing
Maximizing the use of the available
Example:
memory bandwidth for each category of
_shared_
memory float shared[32];
float datag=the
-Allowing shared
thread[BaseIndex
scheduler to + s * tid]
Æ stride
soverlap (number
memory of locationswith
transactions in memory
between successive
mathematical elements)
computations as much as
m Æ total number of banks
possible
d Æ highest common factor between s and
mTo achieve high memory bandwidth,
tid and tid+n
shared memory are is
individed
conflict into
if n is a
equally-
multiple
lti lmemory
sized off m/d
/d modules called banks.
Bank conflicts may arise!
Without bank 8-way bank conflict
11
Bibliography
-NVIDIA CUDA Programming Guide

(http://developer.download.nvidia.com/compute/cuda/1_0/NVIDIA_CUDA_Programming_Guide_1.0.pdf)
- Towards three-dimensional teraflop CFD computing on a desktop PC using

graphics hardware
(www.irmb.tu-bs.de/UPLOADS/toelke/Publication/toelkeD3Q13.pdf)
-The CUDA compiler driver NVCC

(www.soa-world.de/echelon/wp-content/uploads/2007/11/nvcc_10.pdf)
- en.wikipedia.org/wiki/CUDA
- en.wikipedia.org/wiki/GPGPU
- www.gpgpu.org/developer/
www gpgpu org/developer/
12

Presentation - NVIDIA CUDA

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Presentation - NVIDIA CUDA

Uploaded by

Copyright:

Available Formats

Introduction

CUDA stands for Compute Unified Device Architecture and is a new

NVIDIA CUDA SDK has been

CUDA software stack is composed of

Without shared memory With shared memory

GPU can be viewed as a coprocessor to

A thread block is a batch of threads that

Each thread is identified by a thread ID

Threads in different blocks from the

A thread that executes on the device

The global, constant and texture

The GPU device is implemented

CUDA programming interface:

-a host component that runs on the host and

When some threads within-a a blockcomponent

-Minimizing the use of instructions with

Bank conflicts may arise!

Without bank 8-way bank conflict

-NVIDIA CUDA Programming Guide

- Towards three-dimensional teraflop CFD computing on a desktop PC using

-The CUDA compiler driver NVCC

You might also like