You are on page 1of 12

Introduction

CUDA stands for Compute Unified Device Architecture and is a new


hardware and software architecture for issuing and managing computations on
th GPU as a d
the data-parallel
t ll l computing
ti d device
i without
ith t th
the need
d off mapping
i ththem
to graphics API.
p
GPU is specialized for compute-intensive,
p , highly
g yp parallel computation
p and,,
contrary to CPUs, more transistors are devoted to data processing rather than
data caching and flow control

GPU??
graphics
processing unit

2
Background
In just few years the GPUs had a great evolution
evolution.

Problems:
-the GPU could only be programmed using graphics API;
-the GPU DRAM could be read in a general way but could not be written in a
general way;
-some applications were bottlenecked by the DRAM memory bandwidth;

3
Hardware and software

NVIDIA CUDA SDK has been


designed for running parallel
computations on the device hardware:
it consist of a compiler, host and
device runtime libraries and a driver
API.

CUDA software stack is composed of


several layers:
- a hardware driver;
- an API and its runtime;
- two higher-level mathematical
libraries of common usage.

4
Access memory

CUDA provides general DRAM memory addressing both for scatter and gather
memory operations, just like on a CPU.

Gather

Scatter

5
Shared Memory

In addition it features a parallel data cache/on-chip shared memory with a very fast
general read and write access, used by
g y threads to share data with each other.

Without shared memory With shared memory

6
Programming Model & Hardware Implementation

GPU can be viewed as a coprocessor to


the main CPU, called host.

A kernel
k l is
i a function
f ti that
th t is
i executed
t d
on the graphic device as many different
threads.

A thread block is a batch of threads that


cooperate together by sharing data
through some fast shared memory and
synchronizing their execution to
coordinate memory access.

Each thread is identified by a thread ID


that can be:
- a number in ascending order,
- a 2 (or 3) component index composed
using the (x,y) position of the specific
thread inside the block.

7
Programming Model & Hardware Implementation

Threads in different blocks from the


same g
grid cannot communicate and
synchronize with each other.

A thread that executes on the device


has only access to the device’s
DRAM and on-chip memory through
different memory spaces.

The global, constant and texture


memory spaces can be read from or
written
itt tot by
b the
th host
h t andd are
persistent across kernel launches by
the same application.

8
Programming Model & Hardware Implementation

The GPU device is implemented


p as a
set of multiprocessors, each one
having a SIMD architecture and on-
chip memory of the four following
types:
-One set of local 32-bit register per
processor;
-A parallel data cache shared by all
the processors;
-A read-only constant cache, shared
b allll th
by the processors;
-A read-only texture cache, shared by
all the processors.

9
Application Programming Interface

CUDA programming interface:


- a set of extensions to the C language
- a runtime library :

-a host component that runs on the host and


provides
id ffunctions
ti tto control
t l and
d access one or
The fundamental extensions moreofcompute
the C language
devices made by host;
from the CUDA are:
- Function type qualifiers (_device_, _global_, _host_)
- Variable type qualifiers(
qualifiers(_device_
adevice
-a ,_constant_,
constant that
device component _shared_
shared
runs on) the device and
- Synchronization(_syncthreads() )
provides device-specific functions;

When some threads within-a a blockcomponent


a device access the that
same addresses
provides built ininshared
built-in vector or global
memory, there are potential RAW,
types and WAR or WAW
a subset of thehazards for some
C standard libraryofthat
these memory
accesses, but they can be
areavoided by synchronizing
supported threads
in both host and devicein-between
code. these
accesses.
accesses

10
Optimization guidelines

-Minimizing the use of instructions with


low throughput
-Maximizing
Maximizing the use of the available
Example:
memory bandwidth for each category of
_shared_
memory float shared[32];
float datag=the
-Allowing shared
thread[BaseIndex
scheduler to + s * tid]
Æ stride
soverlap (number
memory of locationswith
transactions in memory
between successive
mathematical elements)
computations as much as
m Æ total number of banks
possible
d Æ highest common factor between s and
mTo achieve high memory bandwidth,
tid and tid+n
shared memory are is
individed
conflict into
if n is a
equally-
multiple
lti lmemory
sized off m/d
/d modules called banks.

Bank conflicts may arise!

Without bank 8-way bank conflict

11
Bibliography

-NVIDIA CUDA Programming Guide


(http://developer.download.nvidia.com/compute/cuda/1_0/NVIDIA_CUDA_Programming_Guide_1.0.pdf)

- Towards three-dimensional teraflop CFD computing on a desktop PC using


graphics hardware
(www.irmb.tu-bs.de/UPLOADS/toelke/Publication/toelkeD3Q13.pdf)

-The CUDA compiler driver NVCC


(www.soa-world.de/echelon/wp-content/uploads/2007/11/nvcc_10.pdf)

- en.wikipedia.org/wiki/CUDA

- en.wikipedia.org/wiki/GPGPU

- www.gpgpu.org/developer/
www gpgpu org/developer/

12

You might also like