Seminar

Skeleton Particle-in-Cell Codes on Emerging Computer Architecture
1. INTRODUCTION
The next generation of supercomputers will likely consist of a
hierarchy of parallel computers. If each supercomputer node is defined as a
parameterized abstract machine, then it's possible to design algorithms
independently of the hardware. Such an abstract machine can be defined to
consist of a collection of vector (SIMD) processors, each with a small fast
memory communicating via a larger global memory. This abstraction fits a
variety of hardware, such as GPUs and multicore processors with vector
extensions. To program such an abstract machine, many ideas familiar from
the past as well as some new concepts can be used. Examples from plasma
particle-in-cell (PIC) codes help illustrate this approach.
High Performance Computing most generally refers to the practice of
aggregating
computing power
in
way
that
delivers
much
higher performance than one could get out of a typical desktop computer or
workstation in order to solve large problems in science, engineering, or
business. The most common users of HPC systems are scientific researchers,
engineers and academic institutions. Some government agencies, particularly
the military, also rely on HPC for complex applications. High-performance
systems often use custom-made components in addition to so-called
commodity components. As demand for processing power and speed grows,
HPC
will
likely
interest
businesses
of
all
sizes,
particularly
for transaction processing and data warehouses.

The next generation of high-performance computing (HPC) platforms
is evolving in the direction of greatly increased parallelism, largely because
there seems no other way to increase performance while keeping power
requirements reasonable. This poses a great challenge for all software
developers. Not only are new types of hardware emerging, such as GPUs, but
Department Of Computer Science And Engineering

2016,SNGIST
BTech 2012-

2
new programming concepts are also emerging as programmers struggle to

efficiently utilize the new hardware.
2. LITERATURE SURVEY
2.1.EXISTING SYSTEM
Scientific programmers have been using HPC computers for many
decades, and they have encountered and successfully overcome two
revolutionary developments in the past 40 years. The first was in the 1970s,
with the widespread use of vector processing introduced in the Cray series of
computers, and later by Japanese supercomputers such as Fujitsu and NEC. A
new style of programming to vectorize codes had to be learned. The second
came about 15 to 20 years later, with the introduction of distributed-memory
computers, where it is necessary to learn how to partition problems into
separate domains and use message-passing techniques to tie them together.
2.3.PROPOSED SYSTEM
The world is now in a third revolution in HPC, a hierarchy of parallel
processors. This latest revolution introduces some new programming concepts,
but it also reuses old ones. This article will illustrate these concepts via plasma
particle-in-cell (PIC) codes.
Skeleton codes are bare-bones but fully functional PIC codes containing all
the crucial elements but not the diagnostics and initial conditions typical of
production codes. These are sometimes also called mini-apps. The codes are
designed for high performance, but not at the expense of code obscurity. They
illustrate the use of a variety of parallel programming techniques, such as MPI,
OpenMP, and CUDA.

2016,SNGIST
BTech 2012-

3
3.PIC CODES
Particle simulations of plasmas were first implemented in the late 1950s and
early 1960s to study their kinetic processesthe interactions of particles and
electromagnetic fields described by six-dimensional distribution functions with
three space and three velocity coordinates. This is a many- body type of
numerical model, but in contrast with molecular dynamics codes that model
the direct interactions of particles, PIC codes model the interactions of
particles with electromagnetic fields. Because the fields are solved on a grid,
the calculation is generally of order N, where N is the number of particles.
Such codes are widely used in plasma physics to study a diverse set of
problems such as turbulence in tokomaks, plasma-based accelerators,
relativistic shocks, ion propulsion, and many others. Because they model
plasmas with particles, theyre among the most computationally demanding
types of numerical models in plasma physics, and thus are often on the frontier
of HPC.
PIC codes have three required elements. The first is a deposit step, in which
various quantities such as charge or current density are accumulated on a grid,
interpolating from particle locations. The second step calculates the
electromagnetic fields on the grid, using Maxwells equations or a subset such
as Poissons equation. The final step is a push step, where the electromagnetic
fields are interpolated from the grid to particle locations and the particle
velocities and positions are updated using a Lorentz force equation. In more
recent years, a fourth step has also been common: reordering particles to
improve data locality on cache-based systems. Typically, most of the

2016,SNGIST
BTech 2012-

4
computation time is spent processing particlesthat is, performing the first

and third steps[1,2].
A number of major computational challenges arise in parallelizing PIC codes.

First, particles and fields need to be co-located in memory, but particles
wander across the grid as the simulation proceeds. Second, depositing charge
or current contains a data hazard because different particles can deposit to the
same grid points simultaneously. Finally, load balancing the calculation is a
further challenge because particle density can be no uniform, sometimes
extremely so. Despite these challenges, calculations with as many as 3 trillion
interacting particles using more than 1 million processing cores have been
carried out by several groups on current HPC systems. One of the first PIC
codes to make use of these new emerging technologies was VPIC[3]. This code
was implemented on the IBM Roadrunner supercomputer, a unique hybrid
machine containing the IBM cell processors used in game consoles. PIC codes
began to appear in GPUs once they became generally programmable, for
example, PIConGPU[4], OSIRIS[5] and others.
3.1.Major Programming Concepts from Earlier HPC

Revolution
Vector processors from the first revolution became familiar to
computational scientists in the late 1970s with the introduction of the Cray 1
computer. Vectorizing compilers were able to use the vector hardware with
loops that had no data dependencies, that is, loops where the indices could be
executed in any order. Sometimes loops had to be simplified before the
2016,SNGIST
BTech 2012-

5
compiler could succeed. Loops that had data dependencies or features that the
compiler couldnt handle were broken up into multiple loops, some vector
loops, and some scalar loops, to vectorize them as much as possible. Later,
vectorizing compilers could separate vectorizable elements from scalar
elements automatically. Japanese supercomputer manufacturers such as Fujitsu
and NEC developed even more impressive vectorizing capabilities. Memory
access patterns were another
important concept on the Cray. Memory was divided into banks, with adjacent
words stored in adjacent banksthere was no cache. Reading memory with
stride 1 was optimal, whereas strides that read the same memory bank
repeatedly (large
powers of 2) were the worst. For PIC codes, the field solver and particle push
had no data dependencies and were easily vectorized. The deposit was
vectorized by splitting the particle loop into two.
The first loop calculated the addresses and quantities to be deposited for a
group of particles (typically 128) and stored them into a small temporary 2D
array. The number of quantities to be deposited depended on the interpolation
order, and for a single particle, each deposit was to a distinct location and thus
had no data dependencies. Thus the second vector loop contained a short inner
loop to perform the deposits for a single particle. The short vector loop wasnt
optimum on the Cray, but it was better than a scalar loop. Distributed-memory
computers from the second revolution were first developed in the mid-1980s,
but they didnt become mainstream in the scientific community until the mid1990s,
when the major vendors such as Cray and IBM offered products and the
message passing interface (MPI) standard was established. The major
challenge was how to effectively divide the computation without using shared

2016,SNGIST
BTech 2012-

6
memory. These distributed-memory computers relied on commodity

processors that made use of cache memory. The concept of data locality
became important, where it was optimal to process data in small chunks
stride through memory was less important.
For PIC codes, the main challenge was how to partition the problem so that
particles and fields were co-located while load balance was preserved and
communication was kept to a minimum. Effective domain decomposition
techniques were developed and are now widely used. Space was partitioned so
that each partition contained an equal number of particles to maintain load
balance[6]. For particles, communications were nearest
neighbour, with a particle manager procedure ensuring that particles ended up
in the right domain after the coordinates were updated. Field solvers varied,
with some requiring only nearest-neighbour communications and others
requiring global communications. In most PIC codes, only a small number of
communication patterns were needed. Local particle sorting was introduced [7]
to reorder particles in memory so that all the particles accessing the same grid
elements were processed at approximately the same time, which minimized
cache misses. Local sorting was relatively expensive but performed only
occasionally (typically every 25 to 50 time steps). Distributed-memory
techniques are useful even on shared-memory computers, because they avoid
data hazards. For example, GPUs contain multiple SIMD processors with
global memory accessible by each SIMD unit. Parallelism is improved if data
are partitioned so that each SIMD unit works on different data.

2016,SNGIST
BTech 2012-

7
4. ABSTRACT MACHINE FOR EMERGING

ARCHITECTURES
While emerging HPC platforms come in a variety of forms, many of them
have common elements that are familiar from past architectures. This allows
programmers to design their software for common elements and gives them
some stability in a changing environment. To guide our thinking on these new
architectures, consider the abstract machine illustrated in Figure 1. It consists
of a collection of SIMD processors, each with access to its own small fast local
memory as well as access to a slower memory shared by all SIMD units. This
abstract machine is also used as the basis of the OpenCL programming
language[8] It fits a variety of current architectures such as GPUs and the Intel
Many Integrated Core (MIC) as well as general multicore processors.

2016,SNGIST
BTech 2012-

8
Fig 1. Abstract machine for emerging architectures.
4.1.GPU Programming
A research program to develop PIC algorithms for new emerging
architectures, it was decided to start with the Nvidia GPUs because they
seemed the most advanced at the time; focus was entirely on Nvidias CUDA
environment[9,10]. To enhance experimentation, skeleton PIC codes were
developedbare-bones yet fully functional PIC codes containing all the
crucial elements, but not the diagnostics and initial conditions typical of
production codes. It was the same approach used during the last HPC
revolution[11].Today, these are sometimes also called mini-apps.Our goal was to
develop general algorithms that would be useful on architectures other than
just the GPU because we dont know how hardware and software will evolve.

2016,SNGIST
BTech 2012-

9
As codes for the GPU were developed, an important programming principle

soon emerged: optimizing data movement is crucial. The ratio of calculation
time to memory access time continues to grow larger, making the cost of data
move men relatively higher. Slow global memory should be accessed only
once if possible. If it needs to be accessed repeatedly, it should be copied to
fast local memory. Unfortunately, fast local memory is limited, so this can be a
challenge. Its also best to read memory with stride 1 as much as possible;
irregular memory access should be avoided. The only way we could imagine
satisfying these conditions was to sort particles into partitions small enough
that the field quantities fit in the small fast memory and to reorder particles
every time step.
The resulting algorithm then looked like the domain decomposition
scheme used with MPI, except that the domains, which we called tiles, were
very small, approximately 16 16 grid points for a 2D code. A particle
reordering algorithm was developed analogous to the particle manager used in
the MPI code. To both vectorize and parallelize the inherently irregular
reordering scheme was very challenging, but it turned out to be successful. The
scheme we developed has three steps. First, a list of particles leaving a tile is
created, including the direction
in which theyre going. Using this list, the threads controlling each tile place
outgoing particles into a particle buffer, with particles going in a given
direction grouped together. Finally, after all the tiles have been processed, the
threads controlling each tile copy incoming particles from near by buffers into
their particle array. This scheme allowed all the tiles to be executed
simultaneously and all the particle data movement within a tile to be
vectorized with no data hazards. Details of the scheme appear elsewhere [12].
Once we could be sure particles were always in the correct tile, the remaining
parts of the PIC code could be written in a more or less straightforward
manner. Particles were processed by tiles, field elements (that needed to be

2016,SNGIST
BTech 2012-

10
reused) were loaded from slow to fast memory just once, and global memory
reads had optimal stride.
Several new features in GPU programming werent present in earlier HPC

revolutions. The most important of these was the presence of lightweight
threads, meaning that the hardware could very quickly switch from one thread
to another. This capability encourages the use of many more threads than
hardware processors and has multiple benefits, one of which is hiding memory
latency. Having many threads waiting on memory simultaneously essentially
pipelines the memory. Another benefit is that these lightweight threads helps to
implement a master-slave type of load balancing. If many threads are waiting
for work, then when one thread finishes, another is started, and load balancing
doesnt need to be done on a fine level of granularity. Another new feature was
the presence of fast hardware atomic operations that can carry out an update
operation safely with multiple threads. This was primarily of benefit during the
deposit step. Fast atomic operations arent always available, so we developed
both a collision-resolving algorithm that made use of atomic operations as well
as a collision-free algorithm that didnt rely on such hardware.
4.2.Multicore Architectures
In the course of developing the GPU algorithm, it was useful to implement
an OpenMP version as a debugging tool. This GPU emulation implemented the
parallelism over tiles, but not the vectorization within a tile. Eventually it was
realized that the OpenMP version was useful in its own right, and the GPU
algorithm achieved excellent speedup on multicore architectures.
The GPU code has been extended to make use of multiple GPUs. Each
GPU contains only part of the computational domain, with MPI tying the
domains together. Each MPI node controls one GPU, resulting in three levels
2016,SNGIST
BTech 2012-

11
of parallelism: vector, shared, and distributed-memory parallelism. In this

code, it is required to integrate two different particle managers. The first two
steps of the GPU reordering algorithm are performed first. Then, some of the
particles in the last row or column of tiles are copied to the host and sent via
MPI to another MPI node, which copies them to its local GPU. The final step
of the reordering is then performed on the GPU. To control multiple GPUs on
a single host, a procedure was written to assign each MPI node a different
GPU device ID. Alternatively, OpenMP to handle the multiple GPUs on the
same host could have been used, but this would have required writing code
with four levels of parallelism and wasnt done.
Fig 2. Strong scaling performance as a function of the number of M2090 GPUs.

The field solver in the skeleton codes makes use of fast Fourier
transforms (FFTs). They arent usuallya dominant part of the calculation, so it

2016,SNGIST
BTech 2012-

12
was not discussed earlier. However, the performance of the multi-GPU code
was dominated by the FFT: the all-to-all transpose copied data between two
levels of slow memory, GPU/host, and host/MPI. Most electromagnetic PIC
codes dont use global field solvers or FFTs. Nevertheless, even the FFT had
reasonably scalability. On 96 GPUs, a two-and-ahalf- dimensional
electromagnetic simulation with
10 billion particles and a 16,384 16,384 mesh takes about 0.5 sec/time step
to execute. A strong scaling benchmark for the problem size in Table 1 is
shown in Figure 2. Again, these results are intended to illustrate whats
possible with emerging architectures. The same algorithm without the
vectorization was also implemented on multicore architectures with MPI and
OpenMP, and it gave good scaling up to 768 cores
5.PRESENTATION MATERIAL

2016,SNGIST
BTech 2012-

13

2016,SNGIST
BTech 2012-

14

2016,SNGIST
BTech 2012-

15

2016,SNGIST
BTech 2012-

16

2016,SNGIST
BTech 2012-

17
6. QUESTIONNAIRE
1.What is PIC Code.
Particle simulations of plasmas were first implemented in the late 1950s
and early 1960s to study their kinetic processesthe interactions of particles
and electromagnetic fields described by six-dimensional distribution functions
with three space and three velocity coordinates. This is a many- body type of
numerical model, but in contrast with molecular dynamics codes that model
the direct interactions of particles, PIC codes model the interactions of
particles with electromagnetic fields. Because the fields are solved on a grid,
the calculation is generally of order N, where N is the number of particles. PIC
codes model plasmas by integrating self-consistently the trajectories of many
charged particles responding to the electromagnetic fields that they themselves
generate and they are widely used in plasma physics, and with domaindecomposition techniques they have run in parallel on very large distributed
memory supercomputers. PIC codes have three major steps: depositing some
quantity such as charge or current from particles onto a grid, solving a field
equation to obtain the electromagnetic fields, and interpolating electric and
magnetic fields to particles from a grid. They have two data structures,
particles and fields, that need to efficiently communicate with one another.

2016,SNGIST
BTech 2012-

18
2. What is the role of PIC Code in High Performance Computing.

The next generation of supercomputers will likely consist of a hierarchy of
parallel computers. If we define each supercomputer node as a parameterized
abstract machine, then it's possible to design algorithms independently of the
hardware. Such an abstract machine can be defined to consist of a collection of
vector (SIMD) processors, each with a small fast memory communicating via
a larger global memory. This abstraction fits a variety of hardware, such as
GPUs and multicore processors with vector extensions. To program such an
abstract machine, we can use many ideas familiar from the past as well as
some new concepts and particle in cell codes can be this new concept. Particle
in Cell Codes are efficient and it be used to code for the high performance
computing applications.
3. What is SIMD Processor.

Single instruction, multiple data (SIMD), is a class of parallel computers
in Flynn's taxonomy. It describes computers with multiple processing elements
that perform the same operation on multiple data points simultaneously. Thus,
such machines exploit data level parallelism, but notconcurrency: there are
simultaneous (parallel) computations, but only a single process (instruction) at
a given moment. SIMD is particularly applicable to common tasks like
adjusting the contrast in a digital image or adjusting the volume of digital
audio. Most modern CPU designs include SIMD instructions in order to
2016,SNGIST
BTech 2012-

19
improve the performance of multimedia use. An application that may take

advantage of SIMD is one where the same value is being added to (or
subtracted from) a large number of data points, a common operation in
many multimediaapplications. One example would be changing the brightness
of an image. Each pixel of an image consists of three values for the brightness
of the red (R), green (G) and blue (B)
portions of the color. To change the brightness, the R, G and B values are read
from memory, a value is added to (or subtracted from) them, and the resulting
values are written back out to memory. With a SIMD processor there are two
improvements to this process. For one the data is understood to be in blocks,
and a number of values can be loaded all at once. Instead of a series of
instructions saying "retrieve this pixel, now retrieve the next pixel", a SIMD
processor will have a single instruction that effectively says "retrieve n pixels"
(where n is a number that varies from design to design). For a variety of
reasons, this can take much less time than retrieving each pixel individually, as
with traditional CPU design.
4. What is GPU.
A graphics processor unit (GPU), also occasionally called visual
processor unit (VPU), is a specialized electronic circuit designed to rapidly
manipulate and alter memory to accelerate the creation of images in a frame
buffer intended for output to a display. GPUs are used in embedded
systems, mobile phones, personal computers, workstations, and game consoles.
Modern GPUs are very efficient at manipulating computer graphicsand image
processing, and their highly parallel structure makes them more effective than
2016,SNGIST
BTech 2012-

20
general-purpose CPUs for algorithms where processing of large blocks of

visual data is done in parallel. In a personal computer, a GPU can be present
on a video card, or it can be embedded on the motherboard orin certain
CPUson the CPU die
The term GPU was popularized by Nvidia in 1999, who marketed
the GeForce 256 as "the world's first GPU", or Graphics Processing Unit, a
single-chip
processor
with
integrated transform,
lighting,
triangle
setup/clipping, and rendering engines that are capable of processing a

minimum of 10 million
polygons per second". Rival ATI Technologies coined the term visual
processing unit or VPU with the release of the Radeon 9700 in 2002. Modern
GPUs use most of their transistors to do calculations related to 3D computer
graphics. They were initially used to accelerate the memory-intensive work
of texture
mapping and rendering polygons,
accelerate geometric calculations
such
later
as
adding
units
to
the rotation and translation
of vertices into different coordinate systems. Recent developments in GPUs

include support for programmable shaders which can manipulate vertices and
textures
with
many
of
the
same
operations
supported
by CPUs, oversampling and interpolation techniques to reduce aliasing, and

very high-precision color spaces. Because most of these computations
involve matrix and vector operations,
engineers
and
scientists
have
increasingly studied the use of GPUs for non-graphical calculations.
5. What is MPI.
Message Passing Interface (MPI) is a standardized and portable messagepassing system designed by a group of researchers from academia and industry
to function on a wide variety of parallel computers. The standard defines the
2016,SNGIST
BTech 2012-

21
syntax and semantics of a core of library routines useful to a wide range of

users writing portable message-passing programs in different computer
programming languages such as Fortran, C, C++ and Java. There are several
well-tested and efficient implementations of MPI, including some that are free
or in the public domain. These fostered the development of a parallel software
industry, and encouraged development of portable and scalable large-scale
parallel applications.
7.FUTURE SCOPE
Skelton particle in cell codes have various advantages and we have looked
in to how the skelton particle can be used in emerging computer architecture
and how it can be used in high performance computing. Now Considering the
future scope of skelton particle in cell codes on Emerging Computer
Architecture, it have a lot in scope. We can greatly improve the performance
by using skelton particle in cell codes. Now let us consider some of the
important future scope of the particle in cell codes on emerging computing
architecture.
Create advanced versions of some of these skeleton codes to include

features such as dynamic load balancing.
Effective use of lightweight threads and fast atomic operations to increase

the parallel computing
Implement additional 3D versions of some of the skeleton codes

2016,SNGIST
BTech 2012-

22
Explore alternative programming paradigms, such as OpenCL, OpenACC,

or co-array Fortran etc.
8.CONCLUSION
Developing algorithms for emerging HPC architectures is largely a
combination of previous techniques: vector algorithms for vector processors,
tiling techniques from cache-based systems, and domain decomposition
schemes from distributed-memory computers. But new techniques are also
required, such as making effective use of lightweight threads and fast atomic
operations. The scientific community will emerge from this revolution as
successfully as it has from the past ones. The next generation of
supercomputers will likely consist of a hierarchy of parallel computers. If we
define each supercomputer node as a parameterized abstract machine, then it's
possible to design algorithms independently of the hardware. Such an abstract
machine can be defined to consist of a collection of vector (SIMD) processors,
each with a small fast memory communicating via a larger global memory.
This abstraction fits a variety of hardware, such as GPUs and multicore

2016,SNGIST
BTech 2012-

23
processors with vector extensions. To program such an abstract machine, we

can use many ideas familiar from the past as well as some new concepts
9.REFERENCES
1. C.K. Birdsall and A.B. Langdon, Plasma Physics via Computer Simulation,
McGraw-Hill, 1985.
2. R.W. Hockney and J.W. Eastwood, Computer Simulation Using Particles,
McGraw-Hill, 1981.
3. K.J. Bowers et al., 0.374 Pflops/s Trillion-ParticleKinetic Modeling of
Laser Plasma Interaction on
Roadrunner, Proc. ACM/IEEE Conf. Supercomputing,2008, article no. 63;
http://dl.acm.org/citation.
cfm?id=1413435.
4. M. Bussman et al., Radiative Signatures of the RelativisticKevinHelmholtz Instability, Proc. ACM/IEEEConf. High Performance Computing,
Networking, Storageand Analysis, 2013; http://picongpu.hzdr.de.

2016,SNGIST
BTech 2012-

24
5. X. Kong et al., Particle-in-Cell Simulations withCharge-Conserving

Current Deposition on GraphicalProcessing Units, J. Computational Phys.,
vol.230, no. 4, 2011, pp. 16761685; doi:10.1016/j.
jcp.2010.11.032.
6. P.C. Liewer and V.K. Decyk, A General ConcurrentAlgorithm for Plasma
Particle-in-CellCodes, J. Computational Phys., vol. 85, no. 2, 1989,pp. 302
322.
7. V.K. Decyk et al., Optimization of Particle-in-CellCodes on Reduced
Instruction Set Computer Processors,Computers in Physics, vol. 10, no. 3,
1996,pp. 290298.
8. B.R. Gaster et al., Heterogeneous Computing with OpenCL, Morgan
Kaufman, 2012.
9. N. Wild, The CUDA Handbook: A Comprehensive Guide to GPU
Programming, Addison-Wesley,
10. G. Reutsch and M. Fatica, CUDA Fortran for Scientists and Engineers,
Morgan Kaufman, 2014.
11. V.K. Decyk, Skeleton PIC Codes for Parallel Computers,Computer
Physics Comm., vol. 87, no. 1,
1995, pp. 8794.
12. V.K. Decyk and T.V. Singh, Particle-in-CellAlgorithms for Emerging
Computer Architectures,

2016,SNGIST
BTech 2012-

25

2016,SNGIST
BTech 2012-

Seminar

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Seminar

Uploaded by

Copyright:

Available Formats

Skeleton Particle-in-Cell Codes on Emerging Computer Architecture

for transaction processing and data warehouses.

Department Of Computer Science And Engineering

Skeleton Particle-in-Cell Codes on Emerging Computer Architecture

new programming concepts are also emerging as programmers struggle to

Department Of Computer Science And Engineering

Skeleton Particle-in-Cell Codes on Emerging Computer Architecture

Department Of Computer Science And Engineering

Skeleton Particle-in-Cell Codes on Emerging Computer Architecture

computation time is spent processing particlesthat is, performing the first

A number of major computational challenges arise in parallelizing PIC codes.

3.1.Major Programming Concepts from Earlier HPC

Skeleton Particle-in-Cell Codes on Emerging Computer Architecture

Department Of Computer Science And Engineering

Skeleton Particle-in-Cell Codes on Emerging Computer Architecture

memory. These distributed-memory computers relied on commodity

Department Of Computer Science And Engineering

Skeleton Particle-in-Cell Codes on Emerging Computer Architecture

4. ABSTRACT MACHINE FOR EMERGING

Department Of Computer Science And Engineering

Skeleton Particle-in-Cell Codes on Emerging Computer Architecture

Fig 1. Abstract machine for emerging architectures.

Department Of Computer Science And Engineering

Skeleton Particle-in-Cell Codes on Emerging Computer Architecture

As codes for the GPU were developed, an important programming principle

Department Of Computer Science And Engineering

Skeleton Particle-in-Cell Codes on Emerging Computer Architecture

Several new features in GPU programming werent present in earlier HPC

Skeleton Particle-in-Cell Codes on Emerging Computer Architecture

of parallelism: vector, shared, and distributed-memory parallelism. In this

Fig 2. Strong scaling performance as a function of the number of M2090 GPUs.

Department Of Computer Science And Engineering

Skeleton Particle-in-Cell Codes on Emerging Computer Architecture

Department Of Computer Science And Engineering

Skeleton Particle-in-Cell Codes on Emerging Computer Architecture

Department Of Computer Science And Engineering

Skeleton Particle-in-Cell Codes on Emerging Computer Architecture

Department Of Computer Science And Engineering

Skeleton Particle-in-Cell Codes on Emerging Computer Architecture

Department Of Computer Science And Engineering

Skeleton Particle-in-Cell Codes on Emerging Computer Architecture

Department Of Computer Science And Engineering

Skeleton Particle-in-Cell Codes on Emerging Computer Architecture

Department Of Computer Science And Engineering

Skeleton Particle-in-Cell Codes on Emerging Computer Architecture

2. What is the role of PIC Code in High Performance Computing.

3. What is SIMD Processor.

Skeleton Particle-in-Cell Codes on Emerging Computer Architecture

improve the performance of multimedia use. An application that may take

Skeleton Particle-in-Cell Codes on Emerging Computer Architecture

general-purpose CPUs for algorithms where processing of large blocks of

setup/clipping, and rendering engines that are capable of processing a

mapping and rendering polygons,

accelerate geometric calculations

the rotation and translation

of vertices into different coordinate systems. Recent developments in GPUs

by CPUs, oversampling and interpolation techniques to reduce aliasing, and

increasingly studied the use of GPUs for non-graphical calculations.

Skeleton Particle-in-Cell Codes on Emerging Computer Architecture

syntax and semantics of a core of library routines useful to a wide range of

Create advanced versions of some of these skeleton codes to include

Effective use of lightweight threads and fast atomic operations to increase

Implement additional 3D versions of some of the skeleton codes

Department Of Computer Science And Engineering