You are on page 1of 29

INB375, INN60*

Parallel Computing
Teaching Staff
Unit Coordinator: Dr Wayne Kelly
w.kelly@qut.edu.au
3138 9336
S Block, Level 10
Room 1011 (enter via 1013)

Lecturer: Andrew Sorenson


a.sorensen@qut.edu.au
3138 0452
P Block, Level 8
Room 803

INB375/INN600 Parallel Computing 2


The Free Lunch is Over
The Free Lunch Is Over: A Fundamental Turn Toward Concurrency
in Software, Herb Sutter, Dr. Dobb's Journal, 30(3), March 2005
http://www.gotw.ca/publications/concurrency-ddj.htm
Processor manufacturers have run out of room with most of their
traditional approaches to boosting CPU performance
Instead of driving clock speeds and straight-line instruction throughput
ever higher, they are instead turning en masse to hyperthreading and
multicore architectures
That puts us at a fundamental turning point in software development
The changing face of hardware, why it suddenly does matter to software
Andy giveth, and Bill taketh away.
Specifically the concurrency revolution matters to you and is going to
change the way you will likely be writing software in the future.

INB375/INN600 Parallel Computing 3


Prerequisites
Strong programming skills in C++ or Java or C#

INB375/INN600 Parallel Computing 4


Assessment
1. Parallelization Assignment
Worth: 50%
Due: End of week 12
Teams: Individually or with a partner
Select an application, parallelize to run on a parallel computer
2. Oral Presentation and Participation
Worth: 20%
Due: Rostered throughout semester (weeks 4-13)
Teams: Individually (or in a team for some topics)
Research a parallel programming language and present to class
3. Final Exam
Worth: 30%
Due: During exam period

INB375/INN600 Parallel Computing 5


Student Presentation Topics
Native Threads Map Reduce
POSIX Threads Hadoop
Win32 Native Threads FORTRAN
Intel Thread Building Blocks Co-array Fortran
Managed Threads High Performance Fortran
Java Threads FX
.NET Threads Functional
Microsoft Task Parallel Library Erlang
PLINQ Sisal
Shared Memory NESL
OpenMP Communicating Processes
Distributed Memory Occam
MPI Linda
DARPA Scala
Fortress Parallel C languages
Chapel ZPL
X10 Cilk
Tools Unified Parallel C
Intel VTune GPU
Intel Parallel Studio CUDA
Visual Studio 2010 Profiler OpenCL
Concurrent
Go or suggest your own favourite parallel language!

INB375/INN600 Parallel Computing


6
Von Neumann Architecture
Memory
(Data and Instructions)

Input Output
Device
Bus Device
Central Processing Unit (CPU)

Control Unit (CU)


Program Counter (PC)
Instruction Register (IR)

Arithmetic Logic Unit (ALU)

INB375/INN600 Parallel Computing 7


The Fetch, Decode, Execute Cycle
1. Fetch
Next Instruction (PC++) from memory to Instruction Register (IR)
2. Decode
Work out what to do based on next op code
3 types of operations:
Load (memory -> register), Store (register -> memory)
Arithmetic (register + register -> register)
(conditional) branch (address -> PC)

3. Execute
ALU performs Arithmetic operations
Bus/Memory module performs Load/Store operations

INB375/INN600 Parallel Computing 8


CPU Clocking
CPU is controlled by a periodic clock tick.
Clock period needs to be long enough for:
all electrical signals to propagate across the circuit
at the speed of light (30cm in 1 nanosecond [@1 GHz])
Transistors to settle to a value of 0 or 1
time proportional to size of transistor.
If clock rate is too high (over-clocking):
Incorrect results will be computed.
The system may overhead,
since heat dissipated is proportional to clock speed.

INB375/INN600 Parallel Computing 9


The Von Neumann Bottleneck
The System bus (and memory controllers)
are similarly controlled by clocks, but
typically at a much slower rate.
Memory Load and Store operations
take more than 1 CPU cycle to
complete and so can cause the CPU to
stall.
For memory access, need to consider
both:
Maximum throughput
(data per second)
Minimum latency
(time from request to response).

INB375/INN600 Parallel Computing 10


Caching
Store frequently used data in a smaller specialized
block of memory that is faster and closer to the CPU
better throughput and latency
more expensive to build, so not a big CPU
Caching algorithm:
When data is first loaded from memory, store it in the cache.
Load in an entire line of data items (say 64 bytes), rather than just Cache
the item needed. Locality of reference means others will hopefully
be needed soon.
When a memory item is needed, check to see if it is already in the
cache.
If the cache is full, then need to decide which items to swap out,
e.g. Least Recently Used (LRU).
Replacement policy can be either: Memory
Fully associative can be store anywhere in cache.
Direct mapped can be store in only one place in the cache.
N-way set associative can be store in one of N places.
Can have multiple levels of cache (e.g. L1, L2, L3)
L1 smaller, faster and closer than L2, etc.
INB375/INN600 Parallel Computing 11
SuperScalar Processors
Execute more than one instruction at the same time (ILP).
Have multiple functional units (ALUs, FPUs, SSE, etc).
Typically instruction execution is pipelined
Each instruction takes multiple cycles to complete
But multiple instructions can be in the pipeline at the same time.
To facilitate superscalar execution the processor hardware
need to determine dependencies between instructions to
determine which instructions can be overlapped
this is relatively easy as all instructions operate on explicit registers.
results in so called out of order execution.
To keep the pipeline full, need to fetch in future instructions
If future instructions include branches, need branch prediction and/or
speculative execution.
All of this improves processor performance but results in a more complex
processor requiring more chip area, more power, longer cycle times.
Also trade off between Complex Instruction Sets (CISC) and Reduced
Instruction Sets (RISC)
INB375/INN600 Parallel Computing 12
Intel CoreTM microarchitecture

INB375/INN600 Parallel Computing 13


Moores Law
The number of transistors on a chip will double
approximately every two years.
- Gordon Moore, co-founder Intel, 1965.

INB375/INN600 Parallel Computing 14


Consequences of Moores Law
As transistors and circuits get smaller, clock rate can be increased,
resulting in faster/more powerful processors.
Processor (clock) speeds have double approximately every 2 years,
until recently.
Note, this was not what Moore originally claimed.
Greater chip density also allows more complex HW optimizations such as
out of order execution and branch prediction which has lead to better
overall performance.
Not just CPU have benefitted. Also:
Camera Megapixels
Memory capacity
While capacity has increased exponentially,
Cost has stayed the same or decreased.
Linked to pace of recent human development (also exponential).
Will we reach a technological singularity?

INB375/INN600 Parallel Computing 15


The End of Moores Law?
So, at the current point in time:
transistor density continues to rise exponentially, and will do so for at
least a few more years. Can this last forever?
Processor speeds are plateauing. Why?
Firstly, what are the limits to miniaturization?
Signals cant travel faster than the speed of light.
Can wires be thinner than one atom?
Problems with quantum tunnelling.
Possible solutions: 3D chips, new materials, quantum computing.
Why cant current processors get faster?
Heat generated is proportional to clock speed.
Current world fastest processor 8.5Gz is cooled by liquid nitrogen.
If we extrapolate exponentially, well soon be hotter than the Sun.

INB375/INN600 Parallel Computing 16


What to do with that extra Silicon?
So if we cant increase the clock speed, but we have more room
to play with on the chip due to miniaturization, how should we
exploit it?
Allocate more of the chip to caching.
Create more complex instruction execution infrastructure
But there are limits to the amount of useful Instruction Level Parallelism (ILP).
Place more than one CPU (core) on each chip!

INB375/INN600 Parallel Computing 17


Multicore Chips
More than one CPU (core) per chip.
How should the cores cooperate?
Each core could execute the same instruction stream, but with each
operating on different data (as in a GPU)
But more commonly, each core executes an independent instruction
stream.
How should the cores communicate with one another?
Could be via message passing (Network on Chip)
But more common is via a shared memory.

INB375/INN600 Parallel Computing 18


Shared Memory
Core 1 Core 2 Core N
Functional Functional Functional
Units Units Units

Registers Registers Registers


PC ... PC ...
... PC ...

Control Control Control


Unit Unit Unit

Shared Memory

INB375/INN600 Parallel Computing 19


Multicore Caching
Core 1 Core 2 Core N
Functional Functional Functional
Units Units Units

Registers Registers Registers


PC ... PC ...
... PC ...

Control Control Control


Unit Unit Unit

Private Private Private


Cache Cache Cache

Shared Cache

Shared Memory
INB375/INN600 Parallel Computing 20
Cache Coherence
What happens when one core writes to its private cache?
That cache line in all other private caches need to be
invalidated.
Uses technique called snooping to determine when cache
lines need to be invalidated.
Caches can be inclusive or exclusive.

INB375/INN600 Parallel Computing 21


Software Abstractions
Process
Thread Thread Thread

PC Stack PC Stack ... PC Stack

Heap (shared address space)

Process

Thread Thread Thread

PC Stack PC Stack ... PC Stack

Heap (shared address space)


INB375/INN600 Parallel Computing 22
Mapping Software to Hardware
A Process consist of one or more Threads.
Each Thread executes an independent stream of instructions.
The Operating System schedules threads to CPU(s).
When a CPU becomes available, decide thread to execute next.
Threads run until they need to wait, or until their time slice expires.
OS performs context switches and manages virtual memory.
So, even a single core can handle multiple processes and
threads.
Threads map well to multi-core shared memory hardware as:
the shared heap can be stored in the shared memory
each core can execute on a different thread (at the same time)
If there are more threads than cores, then time slicing is used.

INB375/INN600 Parallel Computing 23


Simultaneous MultiThreading
A CPU can run more than one thread at
the same time (without context switch).
Intel calls this HyperThreading.
Each CPU has multiple sets of registers
(including a Program Counter) for each
thread that it can handle at the same
time typically just two.
The functional units are then shared
between the currently running threads.

INB375/INN600 Parallel Computing 24


Multicore and Multithreading
Can have a multiple cores, each of which can
perform simultaneous multithreading.
E.g. Intel Core i7
Quad Core with HyperThreading
4 Cores x 2 threads per core = 8 Virtual Cores

INB375/INN600 Parallel Computing 25


Intel Core i7

INB375/INN600 Parallel Computing 26


Reading Task
Parallelism via Multithreaded and Multicore CPUs
Sodan et al, IEEE Computer, March 2010.
(Available online at QUT library).

INB375/INN600 Parallel Computing 27


References, and Further Reading
The following Wikipedia pages are relevant:
http://en.wikipedia.org/wiki/Moores_law
http://en.wikipedia.org/wiki/CPU_cache
http://en.wikipedia.org/wiki/Multi-core
http://en.wikipedia.org/wiki/Hyper-threading
http://en.wikipedia.org/wiki/Instruction_level_parallelism

INB375/INN600 Parallel Computing 28


Homework
Find out about the processor in your home PC or laptop:
Model, Family and Manufacturer of processor.
CPU: How many Cores? Hyper Threaded? 64bit or 32bit?
Cache: How many levels? Shared? Size? Associative?
Clock speed: CPU, FSB, Memory controller?
Over-clock (turbo boost)?
Memory: number of channels, bandwidth, latency?
Any other cool facts?
What tools did you use to find out?

Be prepared to report back to class next week.

INB375/INN600 Parallel Computing 29

You might also like