You are on page 1of 87

Parallel Architecture and Systems

January, 2012 Kenichi Miura, Ph.D. Professor, National Institute of Informatics Fellow, Fujitsu Laboratories limited

Course Outline
1. Introduction (What is parallel processing? Why needed?) 2. HPC Architecture 2.1. History of Supercomputers and trends 2.2. Classification of Parallel Architecture (From CPU to System) 2.3. Memory Architecture (Shared, Distributed ) 3. Computational Models 4. Parallel Algorithms 7.1 Serial vs. Parallel Algorithms 7.2 Hardware Realization and Examples of special-purpose Processors 5.Parallel Programming Languages 4.1. Relations between Parallel Languages and Architecuture 4.2. Parallel Language for Shared-memory Architecture OpenMP 4.3. Parallel Languages for Distributed-memory Architecture Message Passing Interface 6. Application Areas for the Large Scale Scientific Computations/Simulations 7.Grid Computing and Cloud

There are THREE things which are inevitable in this world Death, Tax and Parallelism !
John RiganatiSarnoff Lab

Origin of Digital Computers


Digital computers were developed during and soon after the WW II for military purposes.

USA ENIACBallistic Table EDVAC, JOHNNYAC, MANIAC Nuclear Weapon Development U.K. ColossusCode Breaking

ENIAC c1946

Colossus (UK, 1944)

Source: K.Miura at Bletcheley Park

Alan M. Turing 1912-1954)

John von Neumann (1903-1957)

D.H.Lehmers Criticism

This(ENIAC) was a highly parallel machine, until von Neumann spoiled it.
D.H.Lehmer: A History of the Sieve Process in N.Metropolis et al: A History of Computing in the Twentieth Century(1980)

Von Neumann Model

Input Device

Central Processing Unit (CPU)

Output Device

Memory (MSU)
Program Data

Von Neumann Bottleneck

Input Device

Central Processing Unit(CPU)


Program Counter

Output Device

Memory(MSU)
Program Data

Performance Trends of Supercomputers in USA (plus Fujitsu)


Flop/s

ASCI White VPP5000

10 10 10 10 10 10

12

tera

ASCI Blue ASCI Red NWT

VPP800

10

giga
8

VP2000 CRAY T90 VP400 CRAY C90 CRAY X-MP + + CRAY Y-MP ILLIAC4 + VP200 CRAY2 CDC7600 CRAY1 STRETCH LARK IBM704 CDC1604 CDC6600 230-75/APU

mega

kilo
2

UNIVAC ENIAC

1010/50 years 1.58 50

Mark I

1950

1960

1970

1980

1990

2000

All Rights Reserved, Copyright FUJITSU LIMITED 1999

Source:L.Smar

What is Parallel Processing? Why necessary?


To Solve problems in shorter time (Requirement toward CPU Performance To Solve larger problems (Requirement toward memory capacity) Issues Is it easy to use? How reliable is the system? Can software keep up with parallelism?
- Correctness of programs - Scalability of the programs - Ease of Programming - Portability of the programs across multiple parallel systems

Parallel Architecture - Discussion items Classification of Hardware Parallelism Processor DesignVector, Scalar Memory Design Shared, Distributed) Computational Models and Parallelism Correspondence with Systems Vector vs Scalar Parallel Applications and Parallelism SIMD,MIMD,SPMD) Data Parallel vs Control Parallel Numerical Algorithms and Parallelism

How to make computers run faster (1)


Higher Clock Frequency
Miniaturization of circuits and high level of IntegrationMoores Law) Shorter wire lengths internally and also across the chips Efficient cooling technology to remove heat

Generation of Computer Hardware


1st Gen.Vacuum Tube 2nd Gen.Discrete Transistor 3rd Gen.Integrated Circuit (IC 4th Gen.: Large scale Integration (LSI VLSI, ULSI, etc.

Moores Law
Number of transistors which can be mounted On a chip doubles in 24 18 months

Linpack Temperature Measurements (1)


Configuration
x86 Linux cluster Myrinet interconnect
Computation Terminated

Measurements
Celsius

microprocessor motherboard
measured at six locations
next slide

Source: D.Reed, North Carolina Univ.

Celsius

Linpack Temperature Measurements (2)


Thermal dynamics matter
reliability and fault management power and economics
Computation Terminated

Why?
Arrhenius equation
temperature implications
Celsius

mean time to catastrophic failure of commercial silicon


2X for every 10 C above 70 C

Source: D.Reed, North Carolina Univ.

Celsius

CPU Power Trend

Moores Second Law

Times are changing

Source: Burton Smith, Microsoft

How to make computers run faster (2)


Pipelining
Multiple operations are executed concurrently with time delay Examples: Automobile assembly line

- Instruction Stream (Fetch, Decode, Issue,.) - Data Stream (Segmented Arithmetic Units, chaining,)

Cray 1(1976): 160 Mflop/s

Seymour Cray

Cray-1
Sourced from http://www.thocp.net/hardware/cray_1.htm

CRAY 1 Architecture

VP100/200/400 Architecture

How to make computers run faster (3)


Parallelism
Multiple operations are executed simultaneously on physically-replicated hardware resources (Example: Ticket gates at train stations)
Instruction Stream (Superscalar, VLIW, MTA,...) Data Stream (Multiple Arithmetic Units, Striped Arithmetic Units, ) System Architecture (SIMD, MIMD, ccNUMA, Clusters, MPP..)

Analogy of Parallel Processing

ILLIAC IV:
The first SIMD Supercomputer System (1975-1982 @ NASA Ames Research Center)

Computer History Museum, Mt. View CA

D.L.Slotnick

ILLIAC IV Architecture
SIMD:64 way parallelism ECL7 gates/chip by TI) 16 MHz clock First Semiconductor Memory by Fairchild Designed at University of Illinois Built by Burroughs Corporation

Number of Processors vs System Performance


Processor Performance (Flop/s) 1T System Performance

100 G
Vector-

10 G 1G

SMP

V-SMP Cluster

ScalarSMP

SPP / SMP-Cluster

MPP/Cluster

100 M No. of Processors 101 102 103 104 105

How to make computers run faster (4)


data caching

Inout Devices

Cache

Central Processin Unit (CPU) registers

Cache

Output Device

Cache Memory Unit


Program Data

L1 L2 L3

Importance of Memory Performance which Matches CPU Performance


The performance of a processor is determined by the speed at which data can be moved to/from Memory Unit
- Registers (arch., rename,window, Resv. Station) - Cache Memory (I-cache,D-cache, L1,L2,L3) - Interleaved Memory Banks - Memory Technology Choices (SRAM vs SDRAM)

More complexity are introduced in order to cope with mismatch between CPU and Memory!

Interleaved Memory
-Cache avoidance technique Cyclic use of multiple memory banks to hide the cycle time of memory Used in Vector Processing System
CPU Registers MSU

Bank 0

Bank 1

Bank2

Bank n-2

Bank n-1

Memory Bank Conflict and How to Avoid it (1)


Vector: Array padding
a0 a1 a2 a255 a256 a257 a258 a511 Matrix A = a512 a513 a514 a767 a768 a769 a770 a1023 DIMENSION A (257,256) a0 0 a1 a2 a255 a256 a257 a258 a510

DIMENSION A (256,256) a0 a1 a2 a255 a256 a257 a258 a511

a511 0 a512 a513 a765 a766 a767 0 a768 a1020 0

B0 B1 B2 Memory Banks

B255

B0 B1 B2 Memory Banks

B255

Memory Bank Conflict and How to Avoid it (2)


Parallel: Skewed Storage by Routing
a0 a1 a2 a255 a256 a257 a258 a511 Matrix A = a512 a513 a514 a767 a768 a769 a770 a1023 DIMENSION A (256,256) a0 a1 a2 a255 a511 a256 a257 a258 a510 Route 0 Route 1

DIMENSION A (256,256) a0 a1 a2 a255 a256 a257 a258 a511

a766 a767 a512 a513 a765 Route 2 A1021a1022a1023a768 a1020 Route 3 0

B0 B1 B2 Memory Banks

B255

B0 B1 B2 Memory Banks

B255

Memory Bank Conflict and How to Avoid it (3)


Parallel: Prime-Numbered Memory Banks
a0 a16 a32 a48 a1 a17 a33 a49 a2 a15 a18 a31 a34 a47 a50 a63 DIMENSION A (16,16) a0 a17 a34 a51 a1 a2 a15 a16 a18 a19 a32 a33 a35 a36 a48 a49 a50 a52 a53 a64

Matrix A =

DIMENSION A (16,16) a0 a1 a2 a15 a16 a17 a18 a31

B0 B1 B2 B15 16 Memory Banks

B0 B1 B2 B16 17 Memory Banks

Intel Microprocessor

IBM Power7

Power Wall
25MW to the building 12.5MW to the computers

Lawrence Livermore National Lab.

Memory Architecture of Parallel Systems and Data Organization


1Shared Memory Model Uniform SMP Non-Uniform NUMAcc-NUMA 2Distributed Memory Model (Cluster, MPP) 3Hierarchical Model SMP Clusters

Memory Bus

System Interconnect

Variations of Shared-Memory Architecture

False Sharing of Data on SMP


Shared data may be logically independent, but happened to be on the same cache line. When updating the data, thrashing of the cache takes place, resulting in serious performance degradation. Note that read-only data are not affected.
CPU 0 cache Memory CPU 1 cache

Distributed Memory Architecture - Interconnect Topology

3D Torus Interconnect of IBM BlueGene (64 node case)

Tofu: 6D Torus Interconnect of Fujitsu Next Generation System

ToFu 1

ToFu 2

ToFu 3

ToFu 4

ToFu 5

4 x 4 Crossbar Network
cable

cable

Scalar vs Vector vs Multithread


Pipelining of hardware Physical replication of Units (intra-CPU, Inter-CPU ConcurrencyDepth of PipelineNumber of Pipelines

Key issue is how to keep the pipelines filled for longer time!
Degree of Parallelism in Application software Locality of datasets Memory latency hiding and Wider memory Bandwidth

Multi-threading
Denelcor HEP Cray MTA/XMT - Multiple Instruction Counters (one/thread) - Pipelines everywhere - Full /Empty bits on every memory word

Concept of Multi-thread Architecture (B.Smith)

HEP Architecture

Cray XMT (Multitasking)

You might also like