Professional Documents
Culture Documents
1. INTRODUCTION
The next generation of supercomputers will likely consist of a
hierarchy of parallel computers. If each supercomputer node is defined as a
parameterized abstract machine, then it's possible to design algorithms
independently of the hardware. Such an abstract machine can be defined to
consist of a collection of vector (SIMD) processors, each with a small fast
memory communicating via a larger global memory. This abstraction fits a
variety of hardware, such as GPUs and multicore processors with vector
extensions. To program such an abstract machine, many ideas familiar from
the past as well as some new concepts can be used. Examples from plasma
particle-in-cell (PIC) codes help illustrate this approach.
High Performance Computing most generally refers to the practice of
aggregating
computing power
in
way
that
delivers
much
higher performance than one could get out of a typical desktop computer or
workstation in order to solve large problems in science, engineering, or
business. The most common users of HPC systems are scientific researchers,
engineers and academic institutions. Some government agencies, particularly
the military, also rely on HPC for complex applications. High-performance
systems often use custom-made components in addition to so-called
commodity components. As demand for processing power and speed grows,
HPC
will
likely
interest
businesses
of
all
sizes,
particularly
BTech 2012-
2. LITERATURE SURVEY
2.1.EXISTING SYSTEM
Scientific programmers have been using HPC computers for many
decades, and they have encountered and successfully overcome two
revolutionary developments in the past 40 years. The first was in the 1970s,
with the widespread use of vector processing introduced in the Cray series of
computers, and later by Japanese supercomputers such as Fujitsu and NEC. A
new style of programming to vectorize codes had to be learned. The second
came about 15 to 20 years later, with the introduction of distributed-memory
computers, where it is necessary to learn how to partition problems into
separate domains and use message-passing techniques to tie them together.
2.3.PROPOSED SYSTEM
The world is now in a third revolution in HPC, a hierarchy of parallel
processors. This latest revolution introduces some new programming concepts,
but it also reuses old ones. This article will illustrate these concepts via plasma
particle-in-cell (PIC) codes.
Skeleton codes are bare-bones but fully functional PIC codes containing all
the crucial elements but not the diagnostics and initial conditions typical of
production codes. These are sometimes also called mini-apps. The codes are
designed for high performance, but not at the expense of code obscurity. They
illustrate the use of a variety of parallel programming techniques, such as MPI,
OpenMP, and CUDA.
BTech 2012-
3.PIC CODES
Particle simulations of plasmas were first implemented in the late 1950s and
early 1960s to study their kinetic processesthe interactions of particles and
electromagnetic fields described by six-dimensional distribution functions with
three space and three velocity coordinates. This is a many- body type of
numerical model, but in contrast with molecular dynamics codes that model
the direct interactions of particles, PIC codes model the interactions of
particles with electromagnetic fields. Because the fields are solved on a grid,
the calculation is generally of order N, where N is the number of particles.
Such codes are widely used in plasma physics to study a diverse set of
problems such as turbulence in tokomaks, plasma-based accelerators,
relativistic shocks, ion propulsion, and many others. Because they model
plasmas with particles, theyre among the most computationally demanding
types of numerical models in plasma physics, and thus are often on the frontier
of HPC.
PIC codes have three required elements. The first is a deposit step, in which
various quantities such as charge or current density are accumulated on a grid,
interpolating from particle locations. The second step calculates the
electromagnetic fields on the grid, using Maxwells equations or a subset such
as Poissons equation. The final step is a push step, where the electromagnetic
fields are interpolated from the grid to particle locations and the particle
velocities and positions are updated using a Lorentz force equation. In more
recent years, a fourth step has also been common: reordering particles to
improve data locality on cache-based systems. Typically, most of the
BTech 2012-
BTech 2012-
compiler could succeed. Loops that had data dependencies or features that the
compiler couldnt handle were broken up into multiple loops, some vector
loops, and some scalar loops, to vectorize them as much as possible. Later,
vectorizing compilers could separate vectorizable elements from scalar
elements automatically. Japanese supercomputer manufacturers such as Fujitsu
and NEC developed even more impressive vectorizing capabilities. Memory
access patterns were another
important concept on the Cray. Memory was divided into banks, with adjacent
words stored in adjacent banksthere was no cache. Reading memory with
stride 1 was optimal, whereas strides that read the same memory bank
repeatedly (large
powers of 2) were the worst. For PIC codes, the field solver and particle push
had no data dependencies and were easily vectorized. The deposit was
vectorized by splitting the particle loop into two.
The first loop calculated the addresses and quantities to be deposited for a
group of particles (typically 128) and stored them into a small temporary 2D
array. The number of quantities to be deposited depended on the interpolation
order, and for a single particle, each deposit was to a distinct location and thus
had no data dependencies. Thus the second vector loop contained a short inner
loop to perform the deposits for a single particle. The short vector loop wasnt
optimum on the Cray, but it was better than a scalar loop. Distributed-memory
computers from the second revolution were first developed in the mid-1980s,
but they didnt become mainstream in the scientific community until the mid1990s,
when the major vendors such as Cray and IBM offered products and the
message passing interface (MPI) standard was established. The major
challenge was how to effectively divide the computation without using shared
BTech 2012-
BTech 2012-
BTech 2012-
4.1.GPU Programming
A research program to develop PIC algorithms for new emerging
architectures, it was decided to start with the Nvidia GPUs because they
seemed the most advanced at the time; focus was entirely on Nvidias CUDA
environment[9,10]. To enhance experimentation, skeleton PIC codes were
developedbare-bones yet fully functional PIC codes containing all the
crucial elements, but not the diagnostics and initial conditions typical of
production codes. It was the same approach used during the last HPC
revolution[11].Today, these are sometimes also called mini-apps.Our goal was to
develop general algorithms that would be useful on architectures other than
just the GPU because we dont know how hardware and software will evolve.
BTech 2012-
BTech 2012-
reused) were loaded from slow to fast memory just once, and global memory
reads had optimal stride.
4.2.Multicore Architectures
In the course of developing the GPU algorithm, it was useful to implement
an OpenMP version as a debugging tool. This GPU emulation implemented the
parallelism over tiles, but not the vectorization within a tile. Eventually it was
realized that the OpenMP version was useful in its own right, and the GPU
algorithm achieved excellent speedup on multicore architectures.
The GPU code has been extended to make use of multiple GPUs. Each
GPU contains only part of the computational domain, with MPI tying the
domains together. Each MPI node controls one GPU, resulting in three levels
Department Of Computer Science And Engineering
2016,SNGIST
BTech 2012-
BTech 2012-
was not discussed earlier. However, the performance of the multi-GPU code
was dominated by the FFT: the all-to-all transpose copied data between two
levels of slow memory, GPU/host, and host/MPI. Most electromagnetic PIC
codes dont use global field solvers or FFTs. Nevertheless, even the FFT had
reasonably scalability. On 96 GPUs, a two-and-ahalf- dimensional
electromagnetic simulation with
10 billion particles and a 16,384 16,384 mesh takes about 0.5 sec/time step
to execute. A strong scaling benchmark for the problem size in Table 1 is
shown in Figure 2. Again, these results are intended to illustrate whats
possible with emerging architectures. The same algorithm without the
vectorization was also implemented on multicore architectures with MPI and
OpenMP, and it gave good scaling up to 768 cores
5.PRESENTATION MATERIAL
BTech 2012-
BTech 2012-
BTech 2012-
BTech 2012-
BTech 2012-
6. QUESTIONNAIRE
1.What is PIC Code.
Particle simulations of plasmas were first implemented in the late 1950s
and early 1960s to study their kinetic processesthe interactions of particles
and electromagnetic fields described by six-dimensional distribution functions
with three space and three velocity coordinates. This is a many- body type of
numerical model, but in contrast with molecular dynamics codes that model
the direct interactions of particles, PIC codes model the interactions of
particles with electromagnetic fields. Because the fields are solved on a grid,
the calculation is generally of order N, where N is the number of particles. PIC
codes model plasmas by integrating self-consistently the trajectories of many
charged particles responding to the electromagnetic fields that they themselves
generate and they are widely used in plasma physics, and with domaindecomposition techniques they have run in parallel on very large distributed
memory supercomputers. PIC codes have three major steps: depositing some
quantity such as charge or current from particles onto a grid, solving a field
equation to obtain the electromagnetic fields, and interpolating electric and
magnetic fields to particles from a grid. They have two data structures,
particles and fields, that need to efficiently communicate with one another.
BTech 2012-
BTech 2012-
portions of the color. To change the brightness, the R, G and B values are read
from memory, a value is added to (or subtracted from) them, and the resulting
values are written back out to memory. With a SIMD processor there are two
improvements to this process. For one the data is understood to be in blocks,
and a number of values can be loaded all at once. Instead of a series of
instructions saying "retrieve this pixel, now retrieve the next pixel", a SIMD
processor will have a single instruction that effectively says "retrieve n pixels"
(where n is a number that varies from design to design). For a variety of
reasons, this can take much less time than retrieving each pixel individually, as
with traditional CPU design.
4. What is GPU.
A graphics processor unit (GPU), also occasionally called visual
processor unit (VPU), is a specialized electronic circuit designed to rapidly
manipulate and alter memory to accelerate the creation of images in a frame
buffer intended for output to a display. GPUs are used in embedded
systems, mobile phones, personal computers, workstations, and game consoles.
Modern GPUs are very efficient at manipulating computer graphicsand image
processing, and their highly parallel structure makes them more effective than
Department Of Computer Science And Engineering
2016,SNGIST
BTech 2012-
processor
with
integrated transform,
lighting,
triangle
such
later
as
adding
units
to
with
many
of
the
same
operations
supported
engineers
and
scientists
have
5. What is MPI.
Message Passing Interface (MPI) is a standardized and portable messagepassing system designed by a group of researchers from academia and industry
to function on a wide variety of parallel computers. The standard defines the
Department Of Computer Science And Engineering
2016,SNGIST
BTech 2012-
7.FUTURE SCOPE
Skelton particle in cell codes have various advantages and we have looked
in to how the skelton particle can be used in emerging computer architecture
and how it can be used in high performance computing. Now Considering the
future scope of skelton particle in cell codes on Emerging Computer
Architecture, it have a lot in scope. We can greatly improve the performance
by using skelton particle in cell codes. Now let us consider some of the
important future scope of the particle in cell codes on emerging computing
architecture.
BTech 2012-
8.CONCLUSION
Developing algorithms for emerging HPC architectures is largely a
combination of previous techniques: vector algorithms for vector processors,
tiling techniques from cache-based systems, and domain decomposition
schemes from distributed-memory computers. But new techniques are also
required, such as making effective use of lightweight threads and fast atomic
operations. The scientific community will emerge from this revolution as
successfully as it has from the past ones. The next generation of
supercomputers will likely consist of a hierarchy of parallel computers. If we
define each supercomputer node as a parameterized abstract machine, then it's
possible to design algorithms independently of the hardware. Such an abstract
machine can be defined to consist of a collection of vector (SIMD) processors,
each with a small fast memory communicating via a larger global memory.
This abstraction fits a variety of hardware, such as GPUs and multicore
BTech 2012-
9.REFERENCES
1. C.K. Birdsall and A.B. Langdon, Plasma Physics via Computer Simulation,
McGraw-Hill, 1985.
2. R.W. Hockney and J.W. Eastwood, Computer Simulation Using Particles,
McGraw-Hill, 1981.
3. K.J. Bowers et al., 0.374 Pflops/s Trillion-ParticleKinetic Modeling of
Laser Plasma Interaction on
Roadrunner, Proc. ACM/IEEE Conf. Supercomputing,2008, article no. 63;
http://dl.acm.org/citation.
cfm?id=1413435.
4. M. Bussman et al., Radiative Signatures of the RelativisticKevinHelmholtz Instability, Proc. ACM/IEEEConf. High Performance Computing,
Networking, Storageand Analysis, 2013; http://picongpu.hzdr.de.
BTech 2012-
BTech 2012-
BTech 2012-