You are on page 1of 9

V. KRISHNA REDDY* et al.

[IJESAT] INTERNATIONAL JOURNAL OF ENGINEERING SCIENCE & ADVANCED TECHNOLOGY

ISSN: 22503676
Volume - 2, Issue - 1, 92 100

PERFORMANCE EVALUATION OF PARTICLE SWARM OPTIMIZATION ALGORITHMS ON GPU USING CUDA


V. Krishna Reddy1, L.S.S. Reddy2
1

Department of CSE, Lakireddy Bali Reddy College of Engineering, Mylavaram, India, krishna4474@gmail.com 2 Director, Lakireddy Bali Reddy College of Engineering, Mylavaram, India

Abstract
Particle Swarm Optimization (PSO) may be easy but powerful optimization algorithm relying on the social behavior of the particles. PSO has become popular due to its simplicity and its effectiveness in wide range of application with low computational cost. The main objective of this paper is to implement a parallel Asynchronous version and Synchronous versions of PSO on the Graphical Processing Unit (GPU) and compare the performance in terms of execution time and speedup with their sequential versions on the GPU. We also present the Implementation details and Performance observations of parallel PSO algorithms on GPU using Compute Unified Device Architecture (CUDA), a software platform from nVIDIA. We observed that the Asynchronous version of the algorithm outperforms other versions of the algorithm.

Index Terms: Component; formatting; style; styling; insert. -----------------------------------------------------------------------***----------------------------------------------------------------------1. INTRODUCTION


Particle Swarm Optimization (PSO) may be easy but powerful optimization algorithm, introduced by Kennedy and Eberhart in 1995 [2]. PSO searches the optimum of a task, termed fitness task, following rules impressed by the behavior of flocks of birds searching for food. As a population based meta-heuristic, PSO has recently gained a lot of and a lot of popularity attributable to its robustness, effectiveness, and ease. Regardless of the decisions of the algorithm structure, parameters, etc., and despite sensible convergence properties, PSO is still an iterative stochastic search process, which, depending on problem hardness, may require a large number of particle updates and fitness evaluations. Therefore, designing efficient PSO implementations is a problem of great practical relevance. It is even additional essential if one considers real-time applications to dynamic environments in which, for example, the fast-convergence properties of PSO is also used to trace moving points of interest (maxima or minima of a specific dynamicallychanging fitness function). There are variety of pc vision applications within which PSO has been used to trace moving objects or to see location and orientation of objects or posture of individuals. Some of these applications rely on the use of GPU multi-core architectures for general-purpose highperformance parallel computing, which have recently attracted researchers interest a lot of and a lot of, particularly when handy programming environments, such as nVIDIA CUDA [4], have been introduced. Such environments or APIs cash in of the computing capabilities of GPUs using parallel versions of high-level languages that need that solely the best level details of parallel method management be explicitly encoded within the programs. The evolution each of GPUs and of the corresponding programming environments has been very quick and, up to now, off from any standardization. The paper is organized as follows: Section II provides a description of PSO basic, standard PSO synchronous and asynchronous versions of the algorithm, alongside advantages and disadvantages. Design and Implementation details are provided in Section III. Section IV summarizes and compares the results obtained on classical benchmark with the concluding remarks presented in Section V.

2. PSO OVERVIEW
This section provides four versions of PSO algorithm with their advantages and disadvantages. First basics of PSO algorithm are presented followed by standard PSO later with Synchronous and Asynchronous PSO Algorithm.

IJESAT | Jan-Feb 2012


Available online @ http://www.ijesat.org 92

V. KRISHNA REDDY* et al. [IJESAT] INTERNATIONAL JOURNAL OF ENGINEERING SCIENCE & ADVANCED TECHNOLOGY

ISSN: 22503676
Volume - 2, Issue - 1, 92 100

2.1 PSO basics:


The core of PSO[2] is represented by the two functions which update a particles position and velocity within the domain of the fitness function at time t+1, which can be computed using the following equations: (t+1) = (t)) +1 = + (t) + ( +1)
1 1

updating, that is chosen from its left, right neighbors and itself. we tend to decision this a neighborhood or ring topologys, as shown in Figure 2.1(b). (Assuming that the swarm has a population of 12). 2) Inertia Weight and Constriction: In PSO, an inertia weight parameter was designed to regulate the influence of the previous particle velocities on the optimization method. By adjusting the worth of w, the swarm encompasses a bigger tendency to eventually constrict itself right down to area containing the most effective fitness and explore that area in detail. The same as the parameter w, SPSO introduced a replacement parameter called the constriction issue that springs from the present constants within the velocity update equation: = where = 1+ +2 and the velocity updating formula in SPSO is: ( +1)= ( ( )+ 1 1 + 2 2 (3) Where is not any longer global best however the local best. Statistical tests have shown that compared to PSO, SPSO will come higher results, whereas retaining the simplicity of PSO. The introduction of SPSO will provide researchers a typical grounding to figure from. SPSO are often used as a method of comparison for future developments and enhancements of PSO.

(t)

( ))+

2 2

( ) (1) (2)

where i =1, 2, ...N, N indicates the number of particles in the swarm namely the population. d =1, 2, ...D, D is the dimension of solution space. In Equations (1) and (2), the learning factors 1 and 2 are nonnegative constants, 1 and 2 are random numbers consistently distributed in the interval [0, 1], [ , ] where is also a chosen most velocity that's a relentless preset consistent with the objective optimization perform. If the rate on one dimension exceeds the foremost, it will be set to . This parameter controls the convergence rate of the PSO and may stop the strategy from growing too quick. The parameter w is that the inertia weight used to balance the planet and native search skills that could be a constant within the interval [0, 1]. ( +1) is that the position of the particle at time (t +1) , ( ) is that the bestfitness position reached by the particle up to time t (also termed personal attractor), is that the best-fitness purpose ever found by the full swarm (social attractor). Despite its simplicity, PSO is understood to be quite sensitive to the selection of its parameters. Underneath bound conditions, though, it are often proved that the swarm reaches a state of equilibrium, where particles converge onto a weighted average of their personal best and international best positions.

2.2 B. Standard Particle Swarm Optimization (SPSO)


In 2007, Daniel Bratton and James Kennedy designed a Standard Particle Swarm Optimization (SPSO) that could be a easy extension of the first algorithm whereas taking into consideration more modern developments which will be expected to enhance performance on standard measures. SPSO is different from original PSO mainly in the following aspects [7]: 1) Swarm Communication Topology: Original PSO uses a global topology showed in Figure 2.1(a). In this topology, the most effective particle, that is to blame for the speed updating of all the particles, is chosen from the entire swarm population. Whereas in SPSO there's no world best, each particle solely uses a neighborhood best particle for velocity

Figure 2.1. PSO and SPSO Topologies

2.3 . Synchronous PSO:


A main feature that affects the search performance of PSO is that the strategy in keeping with that the social attractor is updated. In synchronous PSO [3], positions and velocities of all particles are updated one once another in a very generation; this can be truly a full algorithm iteration, that corresponds to at least one discrete time unit. Among constant generation, once velocity and position are updated, every particles fitness, love its new position, is evaluated. The worth of the

IJESAT | Jan-Feb 2012


Available online @ http://www.ijesat.org 93

V. KRISHNA REDDY* et al. [IJESAT] INTERNATIONAL JOURNAL OF ENGINEERING SCIENCE & ADVANCED TECHNOLOGY social attractor is barely updated at the tip of every generation, when the fitness values of all particles within the swarm are known. The sequence of steps for asynchronous PSO is shown in Figure 2.2.

ISSN: 22503676
Volume - 2, Issue - 1, 92 100

not take full advantage of the GPU power in evaluating the fitness function in parallel. The parallelization only occurs on the number of particles of a swarm and ignores the dimensions of the function. In the parallel implementations the thread parallelization used as fine-grained as possible [1], in other words, all independent sequential parts of the code are allowed to run simultaneously in separate threads. However, the performance of an implementation does not only depend on the design choices, but also on the GPU architecture, data access scheme and layout, and the programming model, which in this case is CUDA. Therefore, it seems appropriate to outline the CUDA architecture and introduce some of its terminology.

3.2 CUDA Background


CUDA is a programming model and instruction set architecture leveraging the parallel computing capabilities of nVIDIA GPUs to solve complex problems more efficiently than a CPU. At the abstract level, the programming model needs that the developer divide the matter into coarse subproblems, specifically thread blocks, that may be solved independently in parallel, and every sub-problem into finer items that may be solved cooperatively by all threads among the block [4]. From the software point of view, a kernel is equivalent to a high-level programming language function or method containing all the instructions to be executed by all threads of each thread block. Finally, at the hardware level, nVIDIA GPUs consist of a number of identical multithreaded Streaming Multiprocessors (SM), each of which is made up of several cores that are able to run one thread block at a time. As the program invokes a kernel, a scheduler assigns thread blocks to SMs according to the number of available cores on each SM; the scheduler also ensures that delayed blocks are executed in an orderly fashion when more resources or cores are free. This makes a CUDA program automatically scalable on any number of SMs and cores. 3. SYSTEM DESIGN AND IMPLEMENTATION DETAILS The last thing to highlight is the memory hierarchy available to threads, and the performance associated with the read/write operations from/to each of the memory levels. Each thread has its own local registers and all threads belonging to the same thread-blocks can cooperate through shared memory. Registers and shared memory are physically embedded inside SMs and provide threads with the fastest possible memory access. Their lifetime is the same as the thread-block. All the threads of a kernel can also access global memory whose content persists over all kernel launches [4]; however, read and write operations to global memory are orders of magnitude slower than those to shared memory and registers,

3.1 Parallel PSO for GPUs


Almost all recent GPU implementations assign one thread to each particle [3, 5, and 6] which, in turn, means that fitness evaluations have to be computed sequentially in a loop within each particles thread. Since fitness calculation is often the most computation-intensive part of the algorithm, the execution time of such implementations is affected by the complexity of the fitness function and the dimensionality of the search domain. These GPU implementations of PSO do

IJESAT | Jan-Feb 2012


Available online @ http://www.ijesat.org 94

V. KRISHNA REDDY* et al. [IJESAT] INTERNATIONAL JOURNAL OF ENGINEERING SCIENCE & ADVANCED TECHNOLOGY therefore access to global memory should be minimized within a kernel. The design and implementation issues of our algorithms are presented in the following sections.

ISSN: 22503676
Volume - 2, Issue - 1, 92 100

3.3 Synchronous GPU-SPSO


The synchronous implementation [3] comprises three stages (kernels), namely: positions update, fitness evaluation, and bests update. Each kernel is parallelized to run a thread for each problem dimension. The function under consideration is optimized by iterating those kernels needed to perform one PSO generation. The three kernels must be executed sequentially and synchronization must occur at the end of each kernel run. Figure 3.1 better clarifies this structure. Since the algorithm is divided into three independent sequential kernels, each kernel must load all the data it needs initially and store the data back into global memory at the end of its execution. CUDA rules dictates that information sharing between different kernels is achievable only through the global memory. To better understand the difference between synchronous and asynchronous PSO the pseudo-code of the sequential versions of the algorithms is presented in Figure 2.2 and Figure 2.3. The synchronous 3-kernel implementation of GP U-SPSO, while allowing for virtually any swarm size, required synchronization points where all the particles data had to be saved to global memory to be read by the next kernel. This frequent access to global memory limited the performance of synchronous GPU-SPSO and was the main justification behind the asynchronous implementation.

In contrast to the synchronous version, all particle thread blocks must be executing simultaneously, i.e., no sequential scheduling of thread blocks to processing cores is employed, as there is no explicit point of synchronization of all particles. Two diagrams representing the parallel execution for both versions are shown in Figure 3.1. Having the swarm particles evolve independently not only makes the algorithm more biologically plausible, but it also does make the swarm more reactive to newly discovered minima/maxima [1]. The price to

3.4 Asynchronous GPU- SPSO:


The design of the parallelization process for the asynchronous version [1] is the same as for the synchronous one, that is: we allocate a thread block per particle, each of which executes a thread per problem dimension. This way every particle evaluates its fitness function and updates position, velocity, and personal best for each dimension in parallel. The main effect of the removal of the synchronization constraint is to let each particle evolve independently of the others, which allow it to keep all its data in fast-access local and shared memory, effectively removing the need to store and maintain the global best in global memory. In practice, every particle checks its neighbors personal best finesses, and then updates its own personal best in global memory only if it is better than the previously found personal best fitness. This can speed up execution time dramatically, particularly when the fitness function itself is highly parallelizable.

Figure 3.1: Asynchronous CUDA-PSO: particles run in parallel independently (top). Synchronous CUDA-PSO: particles synchronize at the end of each kernel (bottom). be paid is a limitation in the number of particles in a swarm which must match the maximum number of thread blocks that a certain GPU can maintain executing in parallel. This is not such a relevant shortcoming, as one of PSO's nicest features is

IJESAT | Jan-Feb 2012


Available online @ http://www.ijesat.org 95

V. KRISHNA REDDY* et al. [IJESAT] INTERNATIONAL JOURNAL OF ENGINEERING SCIENCE & ADVANCED TECHNOLOGY its good search effectiveness; because of this, only a small number of particles (a few dozen) is usually enough for a swarm search to work, which compares very favorably to the number of individuals usually required by evolutionary algorithms to achieve good performance when highdimensional problems are tackled. Also, currently, parallel system processing chips are scaling according to Moores law, and GPUs are being equipped with more processing cores with the introduction of every new model.

ISSN: 22503676
Volume - 2, Issue - 1, 92 100

The following implementations of PSO have been compared: 1. The sequential synchronous SPSO version modified to implement a two nearest- neighbors ring topology (cpu-syn). The sequential asynchronous PSO version uses stochastic star topology (cpu-asyn). 2. The synchronous three kernel version of GPU-SPSO (gpu-syn). 3. The asynchronous one kernel version of GPU-SPSO (gpu-asyn).

4. RESULTS
In this report, comparison of the performance of different versions of parallel PSO implementations and the sequential PSO implementations on a classical benchmark which comprised a set of functions which are often used to evaluate stochastic optimization algorithms has been presented. The goal was to compare different parallel PSO implementations with one another and with the sequential implementations, in terms of execution time and speed, while checking that the quality of results was not badly affected by the sequential implementations. So all parameters of the algorithm are kept equal in all tests, setting them to the standard values suggested in [7]: w =0.729 and C1 = C2 =2.000. Also, for the comparison to be as fair as possible, the SPSO was adapted by substituting its original stochastic-star topology with the same ring topology adopted in the parallel GPU-based versions. For the experiments, the parallel algorithms were developed using CUDA version 4.0. Tests were performed on graphic card (see Table 4.1 for detailed specifications). The sequential implementations was run on a PC powered by a 64-bits Intel(R) Core(TM) i3 CPU running at 2.27GHz.we pass two random integer numbers P1, P2 [0, M - D N] from CPU to GPU, then 2*D*N numbers can be drawn from array R starting at R(P1) and R(P2), respectively, instead of transporting 2DN numbers from CPU to GPU. TABLE I: MAJOR TECHNICAL FEATURES OF THE GPU USED FOR THE EXPERIMENTS. Model Name GeForce GTS250 GPU clock(GHz) 1.62 StreamMulti Processors 16 CUDA cores 128 Bus width (bit) 256 Memory (MB) 1024 Memory clock (MHz) 1000.0 CUDA compute 1.1 capability TABLE 4.2 BENCHMARK TEST FUNCTIONS

Performance Metric:
Computational cost (C) also known as execution time is defined as the processing time (in seconds) that the PSO algorithm consumes. Computational throughput (V) is defined as the inverse of the computational cost: =1/ Speedup(S): measures the reached execution time improvement. It is a rate that evaluates how rapid the variant of interest is in comparison with the variant of reference. = Where is the throughput of the parallel implementation under study, and is the throughput of the reference implementation, i.e. the sequential implementation. The code was tested on the standard benchmark functions shown in Table 4.2. And the results are floated in the graphs shown in the Figures 4.1 Sphere Function, 4.2 Rastrigrin function, 4.3 Rosenbrock function, and 4.4 Griewank Function. For each function, the following comparisons have been made;
Name Equation Bounds (100,100 ) =1x ( 5.12,5.1 2) (30,30) (600,600 ) Initial Bound s ( 50,100) ( 2.56,5. 12) (15,30) Optim um 0.0D

1 (Spher e) 2 (Rastri grin) 3 (Rosen brock) 4 (Griew ank)

=1

[ 210cos(2 i)+10]

0.0D

(100(xi+1 2)2 1 = 1+(xi1)2) 14000 [ 2 cos( ) =1

0.0D

=1+1]

(600, 600)

0.0D

IJESAT | Jan-Feb 2012


Available online @ http://www.ijesat.org 96

V. KRISHNA REDDY* et al. [IJESAT] INTERNATIONAL JOURNAL OF ENGINEERING SCIENCE & ADVANCED TECHNOLOGY

ISSN: 22503676
Volume - 2, Issue - 1, 92 100

a) Execution time vs. swarm population: w = 0.729, D = 50, Iter = 2000

b) Achieved speedup vs. swarm population: w = 0.729, D = 50, Iter = 2000

c) Execution time vs. problem dimension: w = 0.729, N = 100, Iter = 2000

d) Achieved speedup vs. problem dimension: w = 0.729, N = 100, Iter = 2000

Figure 4.1: execution time and speedup vs. swarm population size and problem dimensions for Sphere function

a) Execution time vs. swarm population: w = 0.729, D = 50, Iter = 5000

b) Achieved speedup vs. swarm population: w = 0.729, D = 50, Iter = 5000

IJESAT | Jan-Feb 2012


Available online @ http://www.ijesat.org 97

V. KRISHNA REDDY* et al. [IJESAT] INTERNATIONAL JOURNAL OF ENGINEERING SCIENCE & ADVANCED TECHNOLOGY

ISSN: 22503676
Volume - 2, Issue - 1, 92 100

c) Execution time vs. problem dimension: w = 0.729, N = 100, Iter = 5000

d) Achieved speedup vs. problem dimension: w = 0.729, N = 100, Iter = 5000

Figure 4.2: execution time and speedup vs. swarm population size and problem dimensions for Rastrigrin function

a) Execution time vs. swarm population: w = 0.729, D = 50, Iter = 22000

b) Achieved speedup vs. swarm population: w = 0.729, D = 50, Iter = 22000

c) Execution time vs. problem dimension: w = 0.729, N = 100, Iter = 22000

d) Achieved speedup vs. problem dimension: w = 0.729, N = 100, Iter = 22000

Figure 4.3: execution time and speedup vs. swarm population size and problem dimensions for Rosenbrock function

1. Keeping Problem dimension as constant a. Execution time vs. swarm population size b. Achieved speedup vs. swarm population

2. Keeping Swarm population size as constant a. Execution time vs. problem dimension b. Achieved speedup vs. problem dimension

IJESAT | Jan-Feb 2012


Available online @ http://www.ijesat.org 98

V. KRISHNA REDDY* et al. [IJESAT] INTERNATIONAL JOURNAL OF ENGINEERING SCIENCE & ADVANCED TECHNOLOGY

ISSN: 22503676
Volume - 2, Issue - 1, 92 100

a) Execution time vs. swarm population: w = 0.729, D = 50, Iter = 2000

b) Achieved speedup vs. problem dimension: w = 0.729, D = 50, Iter = 2000

c) Execution time vs. problem dimension: w = 0.729, N = 100, Iter = 2000

d) Achieved speedup vs. problem dimension: w = 0.729, N = 100, Iter = 2000

Figure 4.4: execution time and speedup vs. swarm population size and problem dimensions for Griewank Function

In general, the asynchronous version was much faster than the synchronous version. The asynchronous version allows the social attractors to be updated immediately after evaluating each particles fitness, which causes the swarm to move more promptly towards newly-found optima. From the Figures 4.1, 4.2, 4.3 and 4.4, it is clear that the GPU-asynchronous version taking less execution time than others. The reason behind the unexpected behavior of the sequential algorithm regarding execution time, which appears to be nonmonotonic with problem dimension, showing a surprising decrease as problem dimension becomes larger. In fact, code optimization (or the hardware itself) might lead several multiplications to be directly equaled to zero without even performing them, as soon as the sum of the exponents of the two factors is below the precision threshold; a similar though opposite consideration can be made for additions and the sum of the exponents. One observation that adding up terms all of comparable magnitude is much slower than adding the same number of terms on very different scales. It is also worth noticing that the execution time graphs are virtually identical for the functions taken into consideration, which shows that GPUs are extremely effective at computing

arithmetic-intensive functions, mostly independently of the set of operators used, and that memory allocation issues are prevalent in determining performance. The asynchronous version of GPU-SPSO algorithm was able to significantly reduce execution time with respect not only to the sequential versions but also to synchronous version of GPU-SPSO. Depending on the degree of parallelization allowed by the fitness functions we considered, the asynchronous version of GPU-SPSO could reach speed-ups ranging from 25 (Rosenbrock, Griewank) to over 75 (Rastrigin) with respect to the sequential implementation, and often of more than one order of magnitude with respect to the corresponding GPU-based 3-kernel synchronous version. From the results, one can also notice that the best performances were obtained on the Rastrigrin function. This is presumably as results of the presence of advanced math functions in their definition. In fact, GPUs have internal fast math functions which can provide good computation speed at the cost of slightly lower accuracy, which causes no problems in this case.

IJESAT | Jan-Feb 2012


Available online @ http://www.ijesat.org 99

V. KRISHNA REDDY* et al. [IJESAT] INTERNATIONAL JOURNAL OF ENGINEERING SCIENCE & ADVANCED TECHNOLOGY

ISSN: 22503676
Volume - 2, Issue - 1, 92 100

5. CONCLUSION & FUTURE SCOPE


[5]

It was observed from the results that asynchronous version of GPU-SPSO algorithm was able to significantly reduce execution time with respect not only to the sequential versions but also to synchronous version of GPU-SPSO. Depending on the degree of parallelization allowed by the fitness functions we considered, the asynchronous version of CUDA-PSO could reach speed-ups of up to about 75 (in the tests with the highest-dimensional Rastrigin functions) with respect to the sequential implementation, and often more than one order of magnitude with respect to the corresponding GPU-based 3kernel synchronous version, sometimes showing a limited, possibly only apparent, decrease of search performances. Hence asynchronous GPU-SPSO is preferable to optimize functions with more complex arithmetic, with the same swarm population, dimension and number of iterations. Furthermore, function with more complex arithmetic has a higher speed up. For problems with large swarm population or higher dimensions, the asynchronous GPU-PSO will provide better performance. Since most display card in current common PC has GPU chips, more researchers can make use of this parallel asynchronous GPU-SPSO to solve their practical problems. Future Scope will include updating and extending this asynchronous GPU-SPSO to the applications of PSO and improving the performance. Other interesting developments may be offered by the availability of OpenCL, which will allow owners of different GPUs (as well as multi-core CPUs, which are also supported) than nVIDIAs to implement parallel algorithms on their own computing architectures. The availability of shared code which allows for optimized code parallelization even on more traditional multi-core CPUs will make the comparison between GPU-based and multi-core CPUs easier (and, possibly, fairer) besides allowing for a possible optimized hybrid use of computing resources in modern computers.

Zhou, Y.; Tan, Y., Particle swarm optimization with triggered mutation and its implementation based on GPU, Proceedings of the 12th Annual Genetic and Evolutionary Computation Conference GECCO10, 2010, pp. 10071014. Venter, G.; Sobieski, J., A parallel particle swarm optimization algorithm accelerated by asynchronous evaluations, 6th World Congresses of Structural and Multidisciplinary Optimization, 2005. Bratton, D.; Kennedy, J., Defining a Standard for Particle Swarm Optimization, IEEE Swarm Intelligence Symposium, April 2007, pp.120-127. Wilkinson, B., General Purpose Computing using GPUs: Developing a hands-on undergraduate course on CUDA programming, 42nd ACM Technical Symposium on Computer Science Education, SIGCSE 2010, Dallas, Texas, USA , March 9-12, 2010.

[6]

[7]

[8]

REFERENCES
[1]

Mussi, L.; Nashed, Y.S.G.; Cagnoni, S., GPU-based Asynchronous Particle Swarm Optimization. Proceedings of GECCO 11, 2011, pp. 1555-1562. Kennedy, J.; Eberhart, R., Particle Swarm Optimization, IEEE International Conference on Neural Networks, Perth, WA, Australia, Nov.1995, pp.19421948. You Zhou, Ying Tan, GPU based Parallel Particle Swarm Optimization, IEEE Congress on Evolutionary Computation, 2009, pp.1493-1500. nVIDIA Corporation, nVIDIA CUDA programming guide 3.2, October 2010.

[2]

[3]

[4]

IJESAT | Jan-Feb 2012


Available online @ http://www.ijesat.org 100

You might also like