You are on page 1of 311

Proceedings of the

17th International Conference on


Advanced Computing and
Communications

December 14 – 17, 2009, Bangalore, India

ADCOM 2009
Copyright © 2009 by the Advanced Computing and Communications Society
All rights reserved.

Advanced Computing & Communications Society


Gate #2, CV Raman Avenue,
(Adj to Tata Book House)
Indian Institute of Science,
Bangalore, India
PIN: 560012
CONTENTS

Message from the Organizing Committee i

Steering Committee ii

Reviewers iii

GRID ARCHITECTURE
Parallel Implementation of Video Surveillance Algorithms on GPU 2
Architecture using NVIDIA CUDA
Sanyam Mehta, Ankush Mittal, Arindam Misra, Ayush Singhal, Praveen Kumar, and Kannappan
Palaniappan
Adapting Traditional Compilers onto Higher Architectures incorporating Energy 10
Optimization Methods for Sustained Performance
Prahlada Rao BB, Mangala N and Amit Chauhan

SERVER VIRTUALIZATION
Is I/O Virtualization ready for End-to-End Application Performance? 19
J. Lakshmi, S.K.Nandy
Eco-friendly Features of a Data Centre OS 26
S. Prakki

HUMAN COMPUTER INTERFACE -1


Low Power Biometric Capacitive CMOS Fingerprint Sensor System 34
Shankkar B, Roy Paily and Tarun Kumar
Particle Swarm Optimization for Feature Selection: An Application to Fusion of 38
Palmprint and Face
Raghavendra R, Bernadette Dorizzi, Ashok Rao and Hemantha Kumar

GRID SERVICES
OpenPEX: An Open Provisioning and EXecution System for Virtual Machines 45
Srikumar Venugopal, James Broberg and Rajkumar Buyya
Exploiting Grid Heterogeneity for Energy Gain 53
Saurabh Kumar Garg
Intelligent Data Analytics Console 60
Snehal Gaikwad, Aashish Jog and Mihir Kedia

COMPUTATIONAL BIOLOGY
Digital Processing of Biomedical Signals with Applications to Medicine 69
D. Narayan Dutt
Supervised Gene Clustering for Extraction of DiscriminativeFeatures from 75
Microarray Data.
C. Das, P.Maji, S. Chattopadhyay
Modified Greedy Search Algorithm for Biclustering Gene Expression Data 83
S.Das, S.M. Idicula
AD-HOC NETWORKS
Solving Bounded Diameter Minimum Spanning Tree Problem Using Improved 90
Heuristics
Rajiv Saxena and Alok Singh
Ad-hoc Cooperative Computation in Wireless Networks using Ant like Agents 96
Santosh Kulkarni and Prathima Agrawal
A Scenario-based Performance Comparison Study of the Fish-eye State Routing and 83
Dynamic Source Routing Protocols for Mobile Ad hoc Networks
Natarajan Meghanathan and Ayomide Odunsi

NETWORK OPTIMIZATION
Optimal Network Partitioning for Distributed Computing Using Discrete 113
Optimization
Angeline Ezhilarasi G and Shanti Swarup K
An Efficient Algorithm to Reconstruct a Minimum Spanning Tree in an 118
Asynchronous Distributed Systems
Suman Kundu and Uttam Kumar Roy
A SAL Based Algorithm for Convex Optimization Problems 125
Amit Kumar Mishra

WIRELESS SENSOR NETWORKS


Energy Efficient Cluster Formation using Minimum Separation Distance and Relay 130
CH’s in Wireless Sensor Networks
V. V. S. Suresh Kalepu and Raja Datta
An Energy Efficient Base Station to Node Communication Protocol for Wireless 136
Sensor Networks
Pankaj Gupta, Tarun Bansal and Manoj Misra
A Broadcast Authentication Protocol for Multi-Hop Wireless Sensor Networks 144
R. C. Hansdah, Neeraj Kumar and Amulya Ratna Swain

GRID SCHEDULING
Energy-efficient Scheduling of Grid Computing Clusters 153
Tapio Niemi, Jukka Kommeri and Ari-Pekka Hameri
Energy Efficient High Available System: An Intelligent Agent Based Approach 160
Ankit Kumar, Senthil Kumar R. K. and Bindhumadhava B. S
3. A Two-phase Bi-criteria Workflow Scheduling Algorithm in Grid Environments 168
Amit Agarwal and Padam Kumar
HUMAN COMPUTER INTERFACE -2
Towards Geometrical Password for Mobile Phones 175
Mozaffar Afaq, Mohammed Qadeer, Najaf Zaidi and Sarosh Umar
Improving Performance of Speaker Identification System Using Complementary 182
Information Fusion
Md Sahidullah, Sandipan Chakroborty and Goutam Saha
Right Brain Testing-Applying Gestalt psychology in Software Testing 188
Narayanan Palani

MOBILE AD-HOC NETWORKS


Intelligent Agent based QoS Enabled Node Disjoint Multipath Routing 193
Vijayashree Budyal, Sunilkumar Manvi and Sangamesh Hiremath
Close to Regular Covering by Mobile Sensors with Adjustable Ranges 200
Adil Erzin, Soumyendu Raha and.V.N. Muralidhara
Virtual Backbone Based Reliable Multicasting for MANET 204
Dipankaj Medhi

DISTRIBUTED SYSTEMS
Exploiting Multi-context in a Security Pattern Lattice for Facilitating User Navigation 215
Achyanta Kumar Sarmah, Smriti Kumar Sinha and Shyamanta Moni Hazarika
Trust in Mobile Ad Hoc Service GRID 223
Sundar Raman S and Varalakshmi P
Scheduling Light-trails on WDM Rings 227
Soumitra Pal and Abhiram Ranade

FOCUSSED SESSION ON RECONFIGURABLE COMPUTING


AES and ECC Cryptography Processor with Runtime Reconfiguration 236
Samuel Anato, Ricardo Chaves, Leonel Sousa
The Delft Reconfigurable VLIW Processor 244
Stephen Wong, Fakhar Anjam
Runtime Reconfiguration of Polyhedral Process Network Implementation 252
Hristo Nikolov, Todur Stefanov, Ed Depprettere
REDEFINE: Optimizations for Achieving High Throughput 259
Keshavan Varadarajan, Ganesh Garga,Mythri Alle, S K Nandy, Ranjani Narayan
Poster Papers
A Comparative Study of Different Packet Scheduling Algorithms with Varied 267
Network Service Load In IEEE 802.16 Broadband Wireless Access Systems
Prasun Chowdhury Iti Saha Misra
A Simulation Based Comparison of Gateway Load Balancing Strategies in 270
Integrated Internet-MANET
Rafi-U-Zaman Khaleel-Ur-Rahman Khan M.A.Razzaq A. Venugopal Reddy
ECAR: An Efficient Channel Assignment and Routing in Wireless Mesh Network S. 273
V. Rao, Chaitanya P. Umbare
Rotational Invariant Texture Classification of Color Images using Local Texture 276
Patterns
A.Suruliandi,E.M.Srinivasan K.Ramar
Time Synchronization for an Efficient Sensor Network System 280
Anita Kanavalli, Vijay Krishan, Ridhi Hirani, Santosh Prasad, Saranya K.,P Deepa Shenoy, and
Venugopal K R
Parallel Hybrid Germ Swarm Computing for Video Compression 283
K. M. Bakwad, S.S. Pattnaik, B. S. Sohi, S. Devi1, B.K. Panigrahi, M. R. Lohokare
Texture Classification using Local Texture Patterns: A Fuzzy Logic Approach 286
E.M. Srinivasan,A. Suruliandi K.Ramar
Integer Sequence based Discrete Gaussian and Reconfigurable Random Number 290
Generator
Arulalan Rajan, H S Jamadagni, Ashok Rao
Parallelization of PageRank and HITS Algorithm on CUDA Architecture 294
Kumar Ishan, Mohit Gupta, Naresh Kumar, Ankush Mittal
Designing Application Specific Irregular Topology for Network-on-Chip 297
Virendra Singh, Naveen Choudhary M.S.Gaur, V. Laxmi
QoS Aware Minimally Adaptive XY routing for NoC 300
Navaneeth Rameshan , Mushtaq Ahmed , M.S.Gaur , Vijay Laxmi and Anurag Biyani
Message from the Organizers

Welcome to the 17th International Conference on Advanced Computing and


Communications (ADCOM 2009) being held at the Indian Institute of Science,
Bangalore, India during December 14-18, 2009.
ADCOM, the flagship event of the Advanced Computing and Communication
Society (ACCS), is a major international conference that attracts professionals
from industry and academia across the world to share and disseminate their
innovative and pioneering views on recent trends and development in
computational sciences. ACCS is a registered scientific society founded to
provide a forum to individuals, institutions and industry to promote advanced
Computing and Communication technologies.
Building upon the success of last year’s conference, the 2009 Conference will
focus on "Green Computing" to promote higher standards for energy-efficient
data centers, central processing units, servers and peripherals as well as reduced
resource consumption towards a sustainable 'green' ecosystem. ADCOM will also
explore computing for the rural masses in improving delivery of public services
like education and primary health care.
Prof. Patrick Dewilde from Technical University Munich, and Prof. N.
Balakrishnan from the Indian Institute of Science are the General Chairs for
ADCOM 2009. The organizers thank Padma Bhushan Professor Thomas
Kailath, Professor Emeritus at Stanford University for being the Chief Guest for
the inaugural event of the conference, and to honor the “DAC-ACCS
Foundation Awards 2009” awardees Prof. Raj Jain from Washington University,
USA and Prof. Anurag Kumar from Indian Institute of Science, India for their
exceptional contributions to the advancement of networking technologies.
The conference features 8 plenary and 8 invited talks from internationally
acclaimed leaders in the industry and academics. The Programme Committee
had the arduous task of selecting 30 papers to be presented in 12 sessions and 11
poster presentations from a total of 326 submissions. ADCOM 2009 will have a
Special Focused session on “Emerging Reconfigurable Systems” with 4 invited
presentations to be followed by an open forum for discussions. In tune with the
theme of the conference, an Industry Session on Green Datacenters is organized
to disseminate awareness on energy-efficient solutions. A total of 8 tutorials in
current topics in various aspects of computing is arranged following the main
conference.
The organizers sincerely thank all authors, reviewers, programme committee
members, volunteers and participants for their continued support for the success
of ADCOM 2009. We welcome you all to enjoy the green and serene ambience
of the Indian Institute of Science, the venue of the conference, in the IT capital of
India.

i
ADCOM 2009 STEERING COMMITTEE

Patron
P. Balaram,IISc, India

General Chairs
N. Balakrishnan, IISc, India
Patrick Dewilde, TU Munich

Technical Programme Chairs


S. K. Nandy, IISc, India
S Uma Mahesh,Indrion, India

Organising Chairs
B.S. Bindhumadhava, CDAC, India
S. K. Sinha, CEDT, India

Industry Chairs
H. S. Jamadagni,IISc, India
Krithiwas Neelakantan, SUN, India
Lavanya Rastogi, Value-One, India
Saragur M. Srinidhi, Prometheus Consulting, India

Publicity & Media Chairs


G. L. Ganga Prasad, CDAC, India
P. V. G. Menon, VANN Consulting, India

Publications Chair
K. Rajan, IISc, India

Finance Chairs
G. N. Rathna, IISc, India

Advisory Committee
Harish Mysore, India
K. Subramanian, IGNOU, India
Ramanath Padmanabhan, INTEL, India
Sunil Sherlekar, CRL, India
Vittal Kini, Intel, India
N. Rama Murthy, CAIR, India
Ashok Das, Sun Moksha, India
Sridhar Mitta, India

ii
Reviewers for ADCOM 2009

The following reviewers’ participated in the review process of ADCOM 2009. We greatly
acknowledge them for their contributions.

Benjamin Premkumar P S Sastry


Dhamodaran Sampath Parag C Prasad
Ilia Polian R Govindarajan
Jordi Torres Rahul Banerjee
Lipika Ray Santanu Mahapatra
Hari Gupta Sathish S Vadhiyar
Sudha Balodia Shipra Agarwal
Madhu Gopinathan Soumyendu Raha
Kapil Vaswani Srinivasan Murali
K V Raghavan Sundararajan V
Chakraborty Joy T V Prabhakar
Aditya Kanade Thara Angksun
Arnab De V Kamakoti
Aditya Kanade Vadlamani Lalitha
Aditya Nori Veni Madhavan C E
Karthik Raghavan Venkataraghavan k
Rajugopal Gubbi Vinayak Naik
A Sriniwas Virendra Singh
P V Ananda Mohan Vishwanath G
Anirban Ghosh V C V Rao
Asim Yarkhan Gopinath K
Bhakthavathsala Vivekananda Vedula
Chandra Sekhar Seelamantula Y N Srikant
Chiranjib Bhattacharyya Zhizhong Chen
Debnath Pal
Debojyoti Dutta
Deepak D' Souza
Haresh Dagale
Joy Kuri
K R Ramakrishnan
R. Krishna Kumar
K S Venkataraghavan
Krishna Kumar R K
M J Shankar Raman
Manikandan Karuppasamy
Mrs J Lakshmi
Nagasuma Chandra
Nagi Naganathan
Narahari Yadati
Natarajan Kannan
Natarajan Meghanathan
Neelesh B Mehta
Nirmal Kumar Sancheti

iii
ADCOM 2009
GRID ARCHITECTURE

Session Papers:

1. Sanyam Mehta, Ankush Mittal, Arindam Misra, Ayush Singhal, Praveen Kumar, and
Kannappan Palaniappan, “Parallel Implementation of Video Surveillance Algorithms on GPU
Architecture using NVIDIA CUDA”.

2. Prahlada Rao BB, Mangala N and Amit Chauhan , “Adapting Traditional Compilers onto
Higher Architectures incorporating Energy Optimization Methods for Sustained Performance”

1
Parallel Implementation of Video Surveillance Algorithms on GPU
Architecture using NVIDIA CUDA

Sanyam Mehta‡, Arindam Misra‡, Ayush Singhal‡, Praveen Kumar†, Ankush Mittal‡,
Kannappan Palaniappan†
‡ †
Department of Electronics and Computer Engineering, Department of Computer Science,
Indian Institute of Technology, Roorkee, INDIA University of Missouri-Columbia,
USA
E-mail: san01uec@iitr.ernet.in,ari07uce@iitr.ernet.in,ayush488@gmail.com,
praveen.kverma@gmail.com, ankumfec@iitr.ernet.in, palaniappank@missouri.edu

Abstract public places and critical infrastructure like dams and


At present high-end workstations and clusters are the bridges, preventing cross-border infiltration,
commonly used hardware for the problem of real-time identification of military targets and providing crucial
video surveillance. Through this paper we propose a evidence in the trials of unlawful activities [11][13].
real time framework for a 640×480 frame size at 30 Obtaining the desired frame processing rates of 24-30
frames per second (fps) on an NVIDIA graphics fps in real-time for such algorithms is the major
processing unit (GPU) (GeForce 8400 GS) costing challenge faced by the developers. Furthermore, with
only Rs. 4000 which comes with many laptops and the recent advancements in video and network
PC’s. The processing of surveillance video is technology, there is a proliferation of inexpensive
computationally intensive and involves algorithms like network based cameras and sensors for widespread
Gaussian Mixture Model (GMM), Morphological deployment at any location. With the deployment of
Image operations and Connected Component Labeling progressively larger systems, often consisting of
(CCL). The challenges faced in parallelizing hundreds or even thousands of cameras distributed
Automated Video Surveillance (AVS) were: (i) over a wide area, video data from several cameras need
Previous work had shown difficulty in parallelizing to be captured, processed at a local processing server
CCL on CUDA due to the dependencies between sub- and transmitted to the control station for storage etc.
blocks while merging (ii) The overhead due to a large Since there is enormous amount of media stream data
number of memory transfers, reduces the speedup to be processed in real time, there is a great
obtained by parallelization. We present an innovative requirement of High Performance Computational
parallel implementation of the CCL algorithm, (HPC) solution to obtain an acceptable frame
overcoming the problem of merging. The algorithms processing throughput.
scale well for small as well as large image sizes. We The recent introduction of many parallel
have optimized the implementations for the above architectures has ushered a new era of parallel
mentioned algorithms and achieved speedups of 10X, computing for obtaining real-time implementation of
260X and 11X for GMM, Morphological image the video surveillance algorithms. Various strategies
operations and CCL respectively, as compared to the for parallel implementation of video surveillance on
serial implementation, on the GeForce GTX 280. multi-cores have been adopted in earlier works [1][2],
including our work on Cell Broadband Engine [15].
Keywords: GPU, thread hierarchy, erosion, dilation, The grid based solutions have a high communication
real time object detection, video surveillance. overhead and the cluster implementations are very
costly.
1. Introduction
The recent developments in the GPU architecture
have provided an effective tool to handle the workload.
Automated Video Surveillance is a sector that is
The GeForce GTX 280 GPU is a massively parallel,
witnessing a surge in demand owing to the wide range
unified shader design consisting of 240 individual
of applications like traffic monitoring, security of

2
stream processors having a single precision floating The three key abstractions of CUDA are the thread
point capability of 933 GFlops. CUDA enables new hierarchy, shared memories and barrier
applications with a standard platform for extracting synchronization, which render it as only an extension
valuable information from vast quantities of raw data. of C. All the GPU threads run the same code and, are
It enables HPC on normal enterprise workstations and very light weight and have a low creation overhead. A
server environments for data-intensive applications, eg. kernel can be executed by a one dimensional or two
[12]. CUDA combines well with multi-core CPU dimensional grid of multiple equally-shaped thread
systems to provide a flexible computing platform. blocks. A thread block is a 3, 2 or 1-dimensional group
In this paper the parallel implementation of various of threads as shown in Fig. 1. Threads within a block
video surveillance algorithms on the GPU architecture can cooperate among themselves by sharing data
is presented. This work focuses on algorithms like (i) through some shared memory and synchronizing their
Gaussian mixture model for background modeling, (ii) execution to coordinate memory accesses. Threads in
Morphological image operations for image noise different blocks cannot cooperate and each block can
removal (iii) Connected component labeling for execute in any order relative to other blocks. The
identifying the foreground objects. In each of these number of threads per block is therefore restricted by
algorithms, different memory types and thread the limited memory resources of a processor core. On
configurations provided by the CUDA architecture current GPUs, a thread block may contain up to 512
have been adequately exploited. One of the key threads. The multiprocessor SIMT (Single Instruction
contributions of this work is novel algorithmic Multiple Threads) unit creates, manages, schedules,
modification for parallelization of the divide and and executes threads in groups of 32 parallel threads
conquer strategy for CCL. The speed-ups obtained called warps.
with GTX 280(30 multiprocessors or 240 cores) were The constant memory is useful only when it is
very significant, the corresponding speed-ups on 8400 required that the entire warp may read a single memory
GS (2 multiprocessors or 16 cores) were sufficient location. The shared memory is on chip and the
enough to process the 640×480 sized surveillance accesses are 100x-150x faster than accesses to local
video in real-time. The scalability was tested by and global memory. The shared memory, for high
executing different frame sizes on both the GPUs. bandwidth, is divided into equal sized memory
modules called banks, which can be accessed
2. GPU Architecture and CUDA simultaneously. However, if two addresses of a
memory request fall in the same memory bank, there is
NVIDIA’s CUDA [14] is a general purpose parallel a bank conflict and the access has to be serialized. The
computing architecture that leverages the parallel banks are organized such that successive 32-bit words
compute engine in NVIDIA GPUs to solve many are assigned to successive banks and each bank has a
complex computational problems. The programmable bandwidth of 32 bits per two clock cycles. For devices
GPU is a highly parallel, multi-thread, many core co- of compute capability 1.x, the warp size is 32 and the
processors specialized for compute intensive highly number of banks is 16. The texture memory space is
parallel computation. cached so a texture fetch costs one memory read from
device memory only on a cache miss, otherwise it just
costs one read from the texture cache. The texture
cache is optimized for 2D spatial locality, so threads of
the same warp that read texture addresses that are close
together will achieve best performance. The local and
global memories are not cached and their access
latencies are high. However, coalescing in global
memory significantly reduce the access time and is an
important consideration (for compute capability 1.3,
global memory accesses are easily coalesced than
earlier versions). Also CUDA 2.2 release provides
page-locked host memory helps in increasing the
overall bandwidth when the memory is required to be
read or written exactly once. Also, it can be mapped to
device address space – so no explicit memory transfer
Fig 1 Thread hierarchy in CUDA required.

3
at a given image pixel, is independent of the
observations at other image pixels. It is also assumed
that these observations of the pixel can be modelled by
a mixture of K Gaussians (currently, from 3 to 5 are
used). Let xt be a pixel value at time t. Thus, the
probability that the pixel value xt is observed at time t
is given by:
   = ∑    

  |    (1)
Where,  ,  
and are the weights, the mean, and
the standard deviation, respectively, of the k-th
Gaussian of the mixture associated with the signal at
time t. At each time instant t the K Gaussians are
ranked in descending order using w=0.75 value (the
most ranked components represent the “expected”
signal, or the background) and only the first B
Fig. 2 The device memory space in CUDA distributions are used to model the background, where
 = arg  ∑
>  (2)
3. Our approach for the Video Surveillance T is a threshold representing the minimum fraction of
Workload data used to model the background. As the parameters
of each pixel change, to determine the value of
A typical Automated Video Surveillance (AVS)
Gaussian that would be produced by the background
workload consists of various stages like background
process depends on the most supporting evidence and
modelling, foreground/background detection, noise
least variance. Since the variance for a new moving
removal by morphological image operations and object
object that occludes the image is high which can be
identification. Once the objects have been identified
easily checked from the value of .
other applications can be developed as per the security
GMM offers pixel level data parallelism which can
requirements. Fig. 3 shows the multistage algorithm
be easily exploited on CUDA architecture. Since the
for a typical AVS system. The different stages and our
GPU consists of multi-cores which allow independent
approach to each of them are described as follows:
thread scheduling and execution, perfectly suitable for
independent pixel computation. So, an image of size
m × n requires m × n threads, implemented using the
appropriate size blocks running on multiple cores.
Besides this the GPU architecture also provides shared
memory which is much faster than the local and global
memory spaces. In fact, for all threads of a warp,
accessing the shared memory is as fast as accessing a
register as long as there are no bank conflicts [14]
between the threads. In order to avoid too many global
memory accesses, the shared memory was utilised to
store the arrays of various Gaussian parameters. Each
block has its own shared memory (upto 16 KB) which
is accessible (read/write) to all its threads
simultaneously, which greatly improves the
computation on each thread since memory access time
Fig. 3 A typical video surveillance system
is significantly reduced. The value for K (number of
Gaussians)is selected as 4, which not only results in
3.1 Gaussian Mixture Model effective coalescing [14] but also reduces the bank
conflicts. As shown in the Table 1 the efficacy of
Many approaches for background modelling like coalescing is quite prominent.
[4][5]have been proposed. Here, Gaussian Mixture The approach for GMM involves streaming (Fig. 4)
model proposed by Stauffer and Grimson [3] is taken i.e. processing the input frame using two streams
up, which assumes that the time series of observations,

4
allows for the memory copies of one stream to overlap structural element is set to the value 0, the output pixel
with the kernel execution of the other stream. is set to 0. With binary images, erosion completely
removes objects smaller than the structuring element
for i <=2 and removes perimeter pixels from larger image
do objects. This is described mathematically as:
create stream i //cudaStreamCreate
for each stream i !ʘ = "#$ %& ' ! ( ) (3)
do and !Ɵ = "|& * ! (4)
copy half the image from host to device
each stream i. //cudaMemcpyAsync where ($) is the reflection of set B and & is the
translation of set B by point z as per the set theoretic
for each stream i definition.
do 0 0 1 0 0
kernel execution for each stream i. 0 1 0 1 0
(half image processed) //gmm<<<….>>> 1 0 0 0 1
0 1 0 1 0
cudaThreadSynchronize(); 0 0 1 0 0

Fig. 4 Algorithm depicting streaming in CUDA Fig. 5 A 5×5 structuring element

Streaming resulted in significant speed up in the As the texture cache is optimized for 2-dimensional
case of 8400 GS, where the time for memory copies spatial locality, the 2-dimensional texture memory is
was closely matched to the time for kernel execution, used to hold the input image; this has an advantage
while in case of GTX 280, the speed up was not so over reading pixels from the global memory, when
significant as the kernel execution took little time, coalescing is not possible. Also, the problem of out of
being spread over 30 multiprocessors. bound memory references at the edge pixels are
avoided by the cudaAddressModeClamp addressing
mode of the texture memory in which out of range
3.2 Morphological Image Operations
texture coordinates are clamped to a valid range. Thus
the need to check out of bound memory references by
After the identification of the foreground pixels conditional statements never arose, preventing the
from the image, there are some noise elements (like warps from becoming divergent and adding a
salt and pepper noise) that creep into the foreground significant overhead.
image. They essentially need to be removed in order to
find the relevant objects by the connected component
labelling method. This is achieved by morphological
image operation of erosion followed by dilation [6].
Each pixel in the output image is based on a
comparison of the corresponding pixel in the input
image with its neighbours, depending on the
structuring element (Fig. 5) used. In case of dilation
(denoted by ʘ) the value of the output pixel is the
maximum value of all the pixels in the input pixel's
neighbourhood. In a binary image, if any of the pixels
in the neighbourhood corresponding to the structural
element is set to the value 1, the output pixel is set to 1.
With binary images, dilation connects areas that are
separated by spaces smaller than the structuring
element and adds pixels to the perimeter of each image
Fig. 6 Approach for erosion and dilation
object.
In erosion (denoted by Ɵ), the value of the output
As shown in Fig. 6 a single thread is used to
pixel is the minimum value of all the pixels in the input
process two pixels. A half warp (16 threads) has a
pixel's neighbourhood. In a binary image, if any of the
bandwidth of 32 bytes/cycle and hence 16 threads,
pixels in the neighbourhood corresponding to the
each processing 2 pixels (2 bytes) use full bandwidth,

5
while writing back noise-free image. This halves the
total number of threads thus reducing the execution Region 1 – labels Region 2 – labels
time significantly. A structuring element of size 7×7 limited from 0 to 7 limited from 8 to 15
was used both in dilation and erosion. A
straightforward convolution was done with one thread
running on two neighbouring pixels. The execution
times for the morphological image operations for the
GTX 280 and the 8400 GS are shown in Table 2.

3.3 Connected Component Labelling


Region 3 – labels Region 4 – labels
limited from 16 to 23 limited from 24 to 31
The connected component labelling algorithm
works on a black and white (binary) image input to
identify the various objects in the frame by checking
pixel connectivity [8]. The image is scanned pixel-by-
pixel (from top to bottom and left to right) in order to
identify connected pixel regions, i.e. regions of
adjacent pixels which share the same set of intensity
values and temporary labels are assigned. The
connectivity can be either 4 or 8 neighbour Fig 7. The connected components are assigned the
connectivity (8-connectivity in our case). Then, the maximum label after resolution
labels are put under equivalence class, pertaining to
their belonging to the same object. After constructing In earlier works on CCL like [9][10], the major
the equivalence classes the labels for the connected limitation was that the sub-blocks into which the
pixels are resolved by assigning label of the problem was broken had to be merged serially, the
equivalence class to all the pixels of that object. reason being each sub-block had blobs with serial
Here the approach for parallelizing CCL on the labels and while merging any two connected sub-
GPU belongs to the class of divide and conquer blocks, the labels in all the other sub-blocks had to be
algorithms [7]. The proposed implementation divides modified – clearly no parallelization possible.
the image into small parts and labels the objects in A new approach to enable parallelization of CCL is
those small parts. Then in the conquer phase the image presented in this paper. The code (as indicated in Fig.
parts are stitched back to see if the two adjoining parts 7) labels the blobs (objects) independent of the other
have the same object or not. sub-blocks, but according to the CUDA thread ids ( i.e.
For initial labelling the image was divided into 1st sub-block can label the blobs from 0 to 7, the 2nd
N×N small regions and the sub-images were scanned sub-block can label the blobs from 8 to 15 and so on).
pixel by pixel from left to right and top to bottom. So in this case no sub-block can detect more than 8
These small regions were executed in parallel on blobs (which is generally the case, but one may easily
different blocks (32×32 in case of 1024×1024 images). choose to have a higher limit). In order to avoid
Each pixel was labelled according to its conflicts between sub-blocks, connected parts of the
connectivity with its neighbours. In case of more than image, in different regions, were given the highest
one neighbour, one of the neighbour’s labels was used label from amongst the different labels in different
and rest were marked under one equivalence class. regions, as shown in the fig. 7.
This was done similarly for all blocks that were So, as a result of making the entire code ‘portable
running in parallel. The equivalence class array was on GPU’, the speed up obtained was enormous – the
stored in shared memory for each block which saved a entire processing being split and made parallel to be
lot of memory access time. The whole image frame executed on the GTX 280, resulting in the entire CCL
was stored in texture memory to reduce memory access (i.e. including merge) code to be executed in just 2.4
time, as global memory coalescing was not possible milliseconds (Table 3) for a 1024×768 image, a speed-
due to random but spatial accesses. up of 11x as compared to sequential code.

6
Table 1 CUDA Profiler output for 1024X768 image size

Function Occup- Coalesced Coalesced Total Divergent Total Total


ancy global global branches Branches Static global global
memory memory Memory memory memory
loads stores per block loads stores
GMM 0.75 5682 826 829 27 3112 5682 347
GMM 0.75 5094 186 500 5 3112 5094 93
Erode 1 0 102 408 2 20 0 37
Dilate 1 0 154 4975 31 20 0 77
CCL 0.25 2657 1902 4128 0 536 2657 823
Merge 0.25 9898 36 1936 2 1064 9898 18

4. EXPERIMENTAL RESULTS has two multiprocessors with 8 cores each, i.e. 16


stream processors, single precision floating point
The parallel implementation of the above mentioned capability of 28.8 GFlops and 128 MB of dedicated
AVS workload was executed on two NVIDIA GPUs, memory. It belongs to the compute capability 1.2. The
the first GPU used is the GeForce GTX 280 on board a development environment used was Visual Studio
3.2 GHz Intel Xeon machine with 1GB of RAM, the 2005 and the CUDA profiler version 2.2 was used for
second one was the GeForce 8400 GS on board a 2 profiling the CUDA implementation. The image sizes
GHz Intel Centrino Duo machine. The GTX 280 has a that have been used are 1600×1200, 1024×768,
single precision floating point capability of 933GFlops 640×480, 320×240 and 160×120. In the subsequent
and a memory bandwidth of 141.7 GB/sec, it also has 1 discussion we mention the results obtained for the
GB of dedicated DDR3 video memory and consists of image size of 1024×768 on the GTX 280 (30
30 multiprocessor with 8 cores each, hence a total of multiprocessors).
240 stream processors. It belongs to a compute The Gaussian Mixture Model, used for background
category 1.3 which supports many advanced features modelling has the kind of parallelism that is required
like page-locked host memory and those which take for implementation on a GPU. As evident by the Fig.
care of the alignment and synchronization issues. The 8, the time of execution increases with the increase in
8400 GS has a memory bandwidth of 6.4 GB/sec and image size and the amount of speedup achieved also
increases almost proportionately; this is due to the
execution of a large number of threads that keeps the
GPU busy. Hence, a significant speedup of 10x has
been achieved for 320×240 image.

Table 2. Execution times for erosion and dilation


for a 3×3 structuring element.

IMAGE GTX 280 8400 GS (ms)


SIZE (ms)
160×120 0.0445 0.120
320×240 0.0586 0.465
640×480 0.1254 1.75
1024×768 0.2429 3.61
1600×1200 0.5625 11.7

Shared memory was used to reduce the global


memory accesses keeping in view the shared memory
Fig. 8 Execution times for GMM (Speed up = 10 for size ( 16 KB). As can be seen from the Table 1 a total
image size 320×240, as compared to sequential of 4 blocks (192×4 threads out of a maximum of 1024
code) threads) could be executed in parallel on a

7
multiprocessor, giving an occupancy of 0.75.As a can be 8, the total number of active threads per
result of using K=4 all the global memory loads were multiprocessor wereere 256 and hence an occupancy of
coalesced as can be seen from the Table 1 also, there 0.25. The optimal parallelization
lization of the CCL algorithm
were lesser bank conflicts. The use of streaming was significant in itself, as the parallelization of CCL
reduced memory copy overhead but not to the extent on CUDA has not been reported and was deemed very
anticipated, due to the efficient memory copying in difficult. Apart from the code
ode been parallelized, the use
GTX 280 - compute capability 1.3. This approach of shared memory and then texture memory to store
however, was of great help in the
he 8400 GS – compute appropriate data led to significant increases in speedup.
capability1.2. The use of texture memory not only prevented any
The morphological image operations contribute a warps from diverging by avoiding the conditional
major portion of the computational expense of the of statements (due to clamped accesses in texture
the AVS workload. In our approach we are able to memory) but also led to speedup due to the spatial
drastically reduce their execution time. The speedup locality of references in CCL. However, the
scales with the image size both on the GTX 280 and implementation
ion of CCL is block size dependent,
depende which
the 8400 GS, the comparison of sequential code with still remains a bottleneck.
the parallel implementation of the 1024×768 image Table 3 Execution times for CCL
size shows a significant speedup of 260X with a
IMAGE SIZE GTX 280 (ms) 8400 GS (ms)
structuring element of 7×7. The time taken by the
sequential implementation was 89.806
.806 ms as compared 160×120 0.106 0.522
to the 0.352 ms taken by the parallel implementation. 320×240 0.220 1.34
For this image size we were able to unleash the full 640×480 1.256 4.5
computational power of the GPU with occupancy of 1. 1024×768 2.494 14.1
(i.e. neither shared memory, nor the registers per 1600×1200 2.649 46.2
multiprocessor were the limiting
miting factors) 280, as In each of the above kernels page-locked
page host
indicated in the Table 1. Moreover, the use of texture memory (a feature of CUDA 2.2 ) has been used
memory and address clamp modes have reduced the whenever only one memory read and write were
percentage of divergent threads to <1%. On the 8400 involved which increased the memory throughput.
GS also a significant speedup has been achieved. Architectures dedicated to video surveillance cost
as much as lakhs of rupees, while the GeForce GTX
280 costs Rs.17000 and the 8400 GS costs merely
Rs.4000. Even for an image size of 640×480, 30
frames per second could be processed on 8400 GS, for
an image size of 1024×768 close to 15 frames per
second could be processed and for images of smaller
size 30 frames could be easily processed as shown in
the fig 10.
(a) (b)

(c) (d)
(a) Input Image (b) Foreground Image
(c) Image after noise removal (d) Output
Fig. 9 Image output of various stages of AVS

In CCL (i.e. CCL & Merge) 32×32 sized


independent sub-blocks
blocks were assigned to each thread
and 32 threads were run on one block (which was
experimentally observed to be optimal). Since, the Fig. 10 Comparision of total ttime for image of
maximum number of active blocks on a multiprocessor different sizes

8
5. CONCLUSION AND FUTURE WORK pp.255-261, 20-25 September, 1999, Kerkyra, Corfu,
Greece
[6] H.Sugano and R.Miyamoto. Parallel
Through this paper, we describe the
implementation of morphological processing on cell/be
implementation of a typical AVS workload on the
with opencv interface. Communications, Control and
parallel architecture of NVIDIA GPUs to perform real
Signal Processing, 2008. ISCCSP 2008, pp 578–583,
time AVS. The various algorithms, as described in the
2008.
previous sections are GMM for background modelling,
[7] J. M. Park, G. C. Looney, and H. C. Chen, “ A Fast
morphological image operations, and CCL for object
Connected Component Labeling Algorithm Using
identification. In our previous work [15] a detailed
Divide and Conquer”, CATA 2000 Conference on
comparison has been done between Cell BE and
Computers and Their Applications, pp 373-376, Dec.
CUDA for these algorithms. During the
2000.
implementation on the GPU architecture, major
emphasis was given to selecting the thread [8] R. Fisher, S. Perkins, A. Walker and E. Wolfart
configurations, the memory types for each kind of data, Connected Component Labeling, 2003
out of the numerous options available on GPU [9] K.P. Belkhale and P. Banerjee, "Parallel
architecture, so that the memory latency can be Algorithms for Geometric Connected Component
reduced and hidden. Lot of emphasis was given to Labeling on a Hypercube Multiprocessor," IEEE
memory coalescing and avoiding bank conflicts. Transactions on Computers, vol. 41, no. 6, pp. 799-
Efficient usage of the different kinds of memories 709,1992
offered by the CUDA architecture and subsequent [10] M. Manohar and H.K. Ramapriyan. Connected
experimental verification resulted in the most optimal component labeling of binary images on a mesh
implementations. As a result, significant overall speed- connected massively parallel processor. Computer
up was achieved. vision, Graphics, and Image Processing, 45(2):133-
Further testing and validations are going on. We 149,1989.
have examined the performance on only 8400 GS(2 [11] K.Dawson-Howe, “Active surveillance using
multiprocessors) and GTX 280(30 multiprocessors) in dynamic background subtraction,” Tech. Rep.TCD-
this paper, hence a range of intermediate devices are CS-96-06, Trinity College, 1996.
yet to be explored. Our future work will include the [12] Michael Boyer, David Tarjan, Scott T. Acton†,
implementation of the AVS workload on other GPU and Kevin Skadron, Accelerating Leukocyte Tracking
devices to examine the scalability, as well as using CUDA:A Case Study in Leveraging Manycore
comparison with other parallel architectures to get an Copocessors,2009.
idea of their viability as compared to the GPU [13] A.C. Sankaranarayanan, A. Veeraraghavan, and
implementation. R. Chellappa. Object detection, tracking and
recognition for multiple smart cameras. Proceedings of
6. REFERENCES the IEEE, 96(10):1606–1624,2008.
[14] NVIDIA CUDA Programming Guide,Version 2.2,
page10,27-35,75-97,2009.
[1] S. Momcilovic and L. Sousa. A parallel algorithm [15] Praveen Kumar, Kannappan Palaniappan, Ankush
for advanced video motion estimation on multicore
Mittal and Guna Seetharaman. Parallel Blob Extraction
architectures. Int. Conf. Complex, Intelligent and using Multicore Cell Processor. Advanced Concepts
Software Intensive Systems, pp 831-836, 2008. for Intelligent Vision Systems (ACIVS) 2009. LNCS
[2] M. D. McCool. Data-Parallel Programming on the 5807, pp. 320–332, 2009.
Cell BE and the GPU Using the RapidMind
Development Platform. GSPx Multicore Applications
Conference, 9 pages, 2006.
[3] C. Stauffer and W. Grimson,Adaptive background
mixture models for real-time tracking, In Proceedings
CVPR,pp.246–252,1999.
[4] Zoran Zivkovic, Improved Adaptive Gaussian
Mixture Model for Background Subtraction. In Proc.
ICPR,pp 28-31 vol. 2,2004
[5] Toyama, K.; Krumm, J.; Brumitt, B.; Meyers, B.,
Wallflower: principles and practice of background
maintenance. The Proceedings of the Seventh IEEE
International Conference on Computer Vision, vol.1,

9
Adapting Traditional Compilers onto Higher Architectures
incorporating Energy Optimization Methods
for Sustained Performance
Prahlada Rao B B, Mangala N, Amit K S Chauhan
Centre for Development of Advanced Computing (CDAC),
#1, Old Madras Road, Byappanahalli, Bangalore-560038, India
email: {prahladab, mangala}@cdacb.ernet.in

adapting the Fortran90 compiler (CDF90) developed by C-


ABSTRACT - Improvements in processor technology are offering DAC to higher architectures.
benefits, such as large virtual address space, faster computations,
non-segmented memory, higher precision etc., but require up- A. Overview of CDAC’s Fortran90 Compiler
gradation of system software to be able to exploit the benefits The highlights of CDF90 compiler are that it supports both
offered. The Authors in this paper present various tasks and F77 and F90 standards; it supports Message Passing Interface
constraints experienced to enhance compilers from 32-bit to 64-
(MPI), and mixed language programming with C [6]. It also
bit platform. The paper describes various aspects, ranging from
design changes, to porting and testing issues that have been dealt has an in-built Fortran77 to Fortran90 converter. It is available
with while enhancing the C-DAC’s Fortran90 (CDF90) compiler on AIX, Linux and Solaris. CDF90 source code is written in
to work with 64-bit architecture and I:8 support. Features ‘C’ language and comprises of about 557 source code files
supported by CDF90, for energy efficiency of code, are with 190 Kilo Lines of Code (KLOC). As with other
presented. The regression testing carried out to test the I:8 traditional compilers CDF90 includes the key phases of
support added in the compiler are discussed. lexical analysis, syntax - semantic analysis, optimization and
code generation. Yacc [9] is used for developing the syntax
KEYWORDS: Compilers, Testing, Porting, LP64, CDF90, analysis modules in CDF90. Internally the context free
Optimizations
grammar is represented by a tree using AST (Abstract Syntax
I. INTRODUCTION Tree) [7] notation. The tree can be traversed to generate
intermediate code and also to carry out optimization
Based on Moore’s law, processors will continue to show
transformations.
an increase in speed and processing capability with time, and
chip making companies like Intel, AMD, IBM, Sun etc. put
efforts to stay on the Moore curve of progress. Processor B. Traditional and Retargetable Methodologies
architectures are evolving with different techniques to gain in Advanced versions of the GNU Compiler Collection
performance – dividing each instruction into a large number (GCC) offers many advantages over traditional compilers
of pipeline stages, scheduling instructions in an order different even though traditional compilers are important for they
from the order the processor received them, providing 64-bit provide the basic building block for adding sophisticated
architecture, providing multi cores, etc. However, the benefits features according to the requirements arising or to pursue
of all these components can be derived only if the system research work which can be carried out on top of the existing
software is tailored to match the new processor. To keep pace compiler projects. It provides the flexibility of not writing
with the rapid changes in processor technology, usually the everything from scratch.
existing system software codes are tweaked to exploit the In an attempt in early 90’s to make the compiler highly
offerings of the new architectures. portable, the code generator module was replaced with a
Domain experts in areas such as weather forecasting, ‘translator to C’ in order to generate intermediate C code;
climate modeling and atomic physics still continue to since stable C compilers were available on majority of the
maintain large programs written in Fortran language. Thus, platforms. Hence CDF90 acts as a converter by translating
Fortran being popular in the scientific community, requires to input Fortran77/90 program into equivalent C program which
be supported on new architectures to take advantage of is further passed to some standard ‘C’ compiler like gcc.
advancements in the hardware. However, redesigning a CDF90 incorporates traditional compiler development
compiler from scratch is a major task. Instead, modifying an
approach.
existing compiler so that it generates code for newer targets is CDF90 offers various optimization techniques such as,
a common way to make compilers compliant to enhanced Loop Unrolling, Loop Interchanging, Loop Merging, Loop
processors. This paper describes the important aspects of Distribution and Function in-lining etc, for efficient storage

10
and execution of the code while gcc comes with more
sophisticated and complex optimizations procedures, such as
inter-procedural analysis, for better performance. CDF90 E. Translator
takes the benefits of both the approaches by converting the This module translates FORTRAN source code to correct,
code into intermediate C code through traditional approach compliable and clean C source code.
and later passing the intermediate C code to gcc compiler to Translator makes an in-order traversal of the full parse tree
exploit the benefits offered by advanced techniques. Various and replaces each FORTRAN construct by their
compilation approaches are depicted in Figure 1. corresponding C construct. Output of translator module is a
‘.c’ file. This .c file is passed to some standard C compiler to
Portable
produce final executable. I/O libraries of CDF90 are linked
Traditional Retargetable
using ‘–l’ option and passed to the linker ‘ld’
Scientific Scientific Scientific
Code Code Code

Lexical Analyzer
Lexical Analyzer Lexical Analyzer F. I/O Library
Syntax + Semantic
Syntax + Semantic
This module contains methods that are invoked by the
Syntax + Semantic Analyzer
Analyzer
Analyzer intermediate C code generated as an output of translator
Optimizer
module with the help of linker at link time.
General Optimizer
Optimizer
Translator to
commonly used
language ( c/c++) for Intermediate Code
portability Generator Fortran77/90
Application
Machine Independent
Machine Independent Machine Independent (.f or .f90)

Code Generator & Compile using Code Generation


Assembler for using . md & .rtl for Lexer
popular compiler
Target Architecture Target Architecture
Machine Dependent Token
Machine Dependent Machine Dependent s
Parser
Executable Different Different AST
for Target Executables Executables
Architecture for Targets for Targets Optimizer

Optimized
AST
I/O Translator
Library

Figure1. Different Compiler Approaches Intermediate C


Code

C Compiler
(gcc)

II. CDF90 ARCHITECTURE


CDF90 is conventionally composed of following main Executable
File(XCOFF/XCOFF-64bit)
modules
B. Lexer Figure2. Control Flow Graph for CDF90 Compiler

This module converts sequence of characters into a III. CONSIDERATIONS FOR MIGRATING CDF90 TO
sequence of tokens, which will be given as input to the Parser 64-BIT
for construction of Abstract Syntax Tree. This Abstract
Syntax Tree will be used in later modules of compiler for A. Need for Migration of CDF90
further processing. 64 bit architectures are becoming popular since a decade
with a promise of higher accuracy and speed through use of
C. Parser
64-bit registers and 64-bit addressing. The advantages offered
This module receives input in the form of sequential source by 64-bit processors are:
program instructions, interactive commands, markup tags, or  Large virtual address
some other defined interface and breaks them up into parts  Non-segmented memories
that can then be managed by other compiler components.  64–bit arithmetic
Parser will get the tokens from the Lexer and will construct a
 Faster Computations
data structure, usually an Abstract Syntax Tree.
 Removal of certain system limitations
D. Optimizer A study was taken up to understand the feasibility, impact,
This module provides a suite of traditional optimization and effort for migrating existing Fortran90 compiler.
methods to minimize energy cost of a program by minimizing Considering the extensive features offered by CDF90 and the
memory access instructions and execution time. Optimization advantages of enhancing this to higher bit processors, it was
techniques like Loop Unrolling, Loop interchanging, Loop decided to enhance the existing compiler to support LP64
Merging, Loop Distribution and Function in-lining etc. have model, and this would require reasonable changes mainly in
been applied on the Parse Tree structure. parser, translator and library modules.

11
 64BitCDF90 needs to be validated for 64-bit architecture
compliance against the existing test cases along with the
B. Data Model Standards Followed
newly added test cases specific for 8-byte integer
For higher precision and better performance, newer support.
architectures have been designed to support various data
models. The three basic models that are supported by most of IV. EFFECT OF LP64 DATA MODEL ON CDF90
the major vendors on 64-bit platform are LP64, ILP64 and Some fundamental changes occur when moving from
LLP64 [5]. ILP32 data model to LP64 data model which are listed as
 LP64 (also known as 4/8/8) denotes int as 4 bytes, long following
and pointer as 8 bytes each.  long and pointers are no longer 32-bit size. Direct or
 ILP64 (also known as 8/8/8) means int, long and indirect assignments of int to long or pointer value is no
pointers are 8 bytes each. more valid.
 LLP64 (also known as 4/4/8) adds a new type (long  For 64-bit compilation, CDF90 needs to use 64-bit
long) and pointer as 64 bit types library archives. Also need to supply 64-bit specific flags
Many 64-bit compilers support LP64 [5] data model to backend compiler so that it can operate in 64-bit
including gcc and xlc on AIX5.3 platform. mode.
CDF90 acts as a front-end and depends upon gcc/xlc for  System derived types such as size_t, time_t, ptrdiff_t are
backend compilation. Since gcc/xlc follows LP64 data model 64-bit aligned in 64-bit compilation environments.
for 64-bit compilation on AIX5.3 platform so 64BitCDF90 Hence these values must not be contained or assigned to
also needs to follow the same data model. Hence LP64 data 32-bit variables.
models have been adopted for 64-bit compilation on 64-bit
AIX5.3/POWER5 platform. V. DESIGN/IMPLEMENTATION CHANGES FOR
C. Approach Followed for Migration 64-BIT ENHANCEMENTS
64BitCDF90 is able to perform 64-bit compilation
 Adding 8-byte integer (I: 8) support to CDF90 compiler
correctly on 64-bit platform with I:8 support after carrying
was required to enjoy the benefit of faster computing
out the following tasks described in two phases as below:
with larger integer values. In order to implement this in
CDF90, various Implicit FORTAN Library functions A. Migration from 32-bit to 64-bit
needed to be modified and new functions written that
Major porting concerns can be summarized as below:
support 8-byte integer computations. Also various
 Pointers size changes to 64-bit. All direct or implied
changes were required in different compiler modules
assignments or comparisons between “integer” and
which will be described in later part of the paper.
“pointer” values have been examined and removed.
 FORTRAN applications passed to CDF90 are translated
 Long size changes to 64-bit. All casts to allow the
to C code hence it was required to consider the data
compiler to accept assignment and comparison between
model of the underlying C compiler also. gcc4.2 follows
“long” and “integer” have been examined to ensure
LP64 data model for 64-bit compilation and for 32-bit
validity.
compilation it uses ILP32 data model even on the 64-bit
 Code has been updated to use the new 64-bit APIs and
AIX platform. Hence LP64 data model would be suitable
hence the executable generated after 64-bit compilation
in this situation.
is 64-bit compliant.
 Most 64-bit processors support both 32-bit and 64-bit
 Macros depending on 32-bit layout have been adjusted
execution modes [10]. Hence 64BitCDF90 also needs to
for the 64-bit environment.
provide both 32-bit and 64-bit compilation support
through the help of 32-bit and 64-bit compilation  A variety of other issues, like data truncation, that can
libraries, though the executable may be 32-bit only. arise from sign extension, memory allocation sizes, shift
Hence it is identified that two different libraries need to counts, array offsets, and other factors have to be dealt
be prepared for 32-bit and 64-bit compilation with extreme care.
environments.  User has option to select between 32-bit or 64-bit APIs.
 Fortran77/90 executable file format requires to be If 64-bit compiler flag is used, for e.g. ‘-maix64’ flag is
changed to XCOFF64 when compiled by 64BitCDF90 used for backend compilation with gcc 4.2, while
on 64-bit platform using 64-bit libraries. 64-bit CDF90 compiling then 64-bit API is linked and 64-bit object file
APIs need to be generated for 64-bit compilation of any format is generated. If user does not use any 64-bit flag
application, though CDF90 executable may be 32-bit for compilation, then by default 32-bit API is linked to
only. Please note that the same approach has been the application and 32-bit object file format is generated.
followed by ‘gcc’ which uses 32-bit executable for 64- Code has been added in the compiler to select either 32-
bit compilation, of any application file passed to it, bit or 64-bit API depending upon whether user has
through use of 64-bit library files. supplied 64-bit specific flags or not.
 64-bit Library files are generated by compiling the
source code using ‘gcc 4.2’ in 64-bit mode using ‘–

12
maix64’ option on AIX5.3/Power5 platform to produce E.g. The Translator is now able to internally translate
64-bit XCOFF file formats. These 64-bit object files are the implicit matmul (a, b) function, where a, b are matrix
passed to ‘ar’ tool using ‘–X64’ flag to produce 64-bit with 8-byte integer elements, in to _matmul_i8i8_22
library archives. (long long **a, long long **b) which is an intermediate
 Compiler code, which is written in C, is compiled using C library function to carry out the actual multiplication.
gcc 4.2 in 64-bit architecture (AIX5.3/Power5) without This translation is carried out on the basis of integer size
using any 64-bit specific flags and by default it generates conditional checking. Most of such changes have been
32-bit executables for CDF90 compiler. The same carefully debugged with the help of gdb6.5.
CDF90 executable can be linked to 64-bit library API to  I/O Library: There are two libraries supported by
compile applications, passed to it, in 64-bit mode and to 64BitCDF90.One library is used for 32-bit mode and the
generate object file in 64-bit XCOFF file format. The other is linked for 64-bit mode compilation. Most of the
same 32-bit CDF90 executable can be linked to 32-bit library functions in 32-bit CDF90 library have been
API also to compile applications in 32-bit mode and to modified /added to handle 64-bit integer data types and
generate 32-bit XCOFF object file format on the same hence to create 64-bit CDF90 library. Hence 64-bit
64-bit platform (AIX5.3/POWER5). CDF90 library can handle 8-byte integer for most of the
 32-bit library archives are generated using compilation functions it contains. The translator module will generate
by ‘gcc 4.2’ followed by archive file generation using corresponding C code that invokes these library
‘ar’ tool without any 64-bit specific flags on AIX5.3 functions, which handle 8-byte integer data size.
platform with PowerPC_Power5 architecture. E.g. _matmul_i8i8_(long long **a, long long **b)
library function has been added in the CDF90 library to
B. Adding I:8 Support in 64BitCDF90 compute matrix-matrix multiplications whose elements
32-bit CDF90 compiler supports following KIND values, in are 8-byte integers.
a Fortran77/90 program, for Integer data types  Building Libraries: The 32-bit library archives have
been created by compiling the source code in 32-bit
TABLE 1 mode while 64-bit library archives have been prepared
KIND VALUES FOR I NTEGERS by compiling the source code in 64-bit mode on 64-bit
platform. More specifically 64-bit libraries have been
KIND Value Size in bytes
prepared by compiling source code using gcc4.2 with ‘-
1 1
2 2
maix64’ flag and further passing it to archive tool ‘ar’
3 4 with the ‘-x64’ flag to create 64-bit library archives for
64BitCDF90.
The compiler has been enhanced to allow KIND value 4 for
which the size of the Integer is 8 bytes. Modifications are VI. CODE OPTIMIZATIONS: ENERGY EFFICIENT METHODS
performed in the following modules to achieve the desired OFFERED BY CDF90
results.
 Lexer: Code has been added to identify INTEGER*8 as Energy cost of a program depends upon the number of
a valid token. memory access[15]. Large number of memory access
results in high energy dissipation. Energy efficient
 Parser: The parser code has been modified so that it
compilers reduce memory access and thereby reduce energy
may correctly add the symbol (I: 8) correctly in the AST
generated after the parser phase. All other programming consumed by these accesses. One approach is to reduce the
LOAD and STORE instructions and storing data in registers
constructs, like functions, macros, data types etc, dealing
or cache memory to provide faster and efficient
with 8-byte integer size are also updated, in the parser
computations reducing the CPU cycles significantly.
module, to transform in to the correct AST structure.
According to Kremer [22], “Traditional compiler
 Translator: If the integer variable KIND value is 4 in an
optimizations such as common subexpression elimination,
input Fortran77/90 program, then the translator should
partial redundancy elimination, strength reduction, or dead
be able to declare and translate the symbol in to the
code elimination increase the performance of a program by
corresponding C code symbol. Corresponding code has
reducing the work to be done during program execution”.
been added in the translator module. Functionalities have
There are also several other compiler strategies for energy
been added, in the translator module, to correctly convert
reduction such as - compiler assisted Dynamic Voltage
implicit FORTRAN library function dealing with KIND
Scaling [20], instruction scheduling to reduce function unit
value 4 integer data types to their corresponding C
utilization, memory bank allocation, inter-program
library function names based on conditional type
optimizations [19] etc. which are being tried by the
checking for 8-byte integer data types. After successful
researchers.
conversion, the function call is dumped into an
Reducing power dissipation of processors is also being
intermediate .c file. The C functions, dealing with I:8
attempted at the hardware level and in the operating system.
data types, are called from CDF90 libraries.
Examples of these include power aware techniques for on-

13
chip buses, transactional memory, memory banks with low iv. Constant Value Propagation
power modes [16,17] and work load adaptation [18,23]. An expression involving several constant values will be
However energy reduction through compiler promises some calculated and replaced by a new constant value
advantages such as - no overhead at execution time, assess For ex:
‘future’ program behavior through aggressive whole x=2*y
program analysis, identify optimizations and make code z=3*x*s
transformations for reduced energy usage [24].
will be transformed into the following code by an optimizing
A. Optimizations in CDF90 compiler compiler
z=6*y*s
Programs compiled without any optimization generally
runs very slowly and hence takes more the cpu cycles and
results in higher energy consumption. Usually a medium v. Register Allocation and Instruction Scheduling
level optimization (-o2 on many machines) typically leads This particular optimization is the most difficult and most
to a speed up by a factor of 2-3 without any significant important also. Since CDF90 depends upon gcc for backend
improvements in compilation time. Different types of compilation so it is left up to the backend compiler only.
optimizations performed by CDF90 to improve program
performance are listed as following
B. Code Transformation Criteria and applied Techniques
i. Common Sub-expression elimination
CDF90 offers following loop optimization techniques.
The compiler takes the common sub-expression out from
various expressions and calculate it only once instead of i. Loop Interchange
several times It is a Process of exchanging the order of two iteration
t1=a+b-c variable. Loop interchange can often be used to enhance the
t2=a+b+c performance of code on parallel or vector machines.
Above two statements will be reduced by an optimizing Determining when loops may be safely and profitably
compiler in to following statements interchanged requires a study of the data dependences in the
t= a+b program
t1=t-c Loop interchange mechanism has been implemented in
t2=t+c CDF90. It contains a function which checks whether
interchanging of loops can be done.
Though this approach may not exhibit any significant The loops can be interchanged if the loops are perfectly
performance enhancements for smaller expressions but for nested. If one of the loop bound is dependent on the index
bigger expression it shows some certain performance variable of some other loop the loops can not be
enhancements interchanged, also the loop should be totally independent,
since after loop interchanging there should not be invalid
ii. Strength Reduction
computation. Based on condition output, loop interchanging
Strength Reduction is used to replace an existing is performed by CDF90 supported functions.
arithmetic expression by an equivalent expression that can For example consider the FORTRAN matrix addition
be evaluated faster. A simple example is replacing 2*i by algorithm below:
i+i since integer addition is faster than integer
DO I = 1, N
multiplication.
DO J = 1, M
iii. Loop Invariant Code A(I, J) = B(I, J) + C(I, J)
The code that is not dependent on loop iterations is ENDDO
removed from the loop and calculated one-time only instead ENDDO
of calculating for each loop iteration. The following
FORTRAN code let The loop accesses the arrays A, B and C row by row, which,
in FORTRAN, is very inefficient. Interchanging
do i=1,n I and J loops, as shown in the following example, facilitates
a(i)=m*n*b(i) column by column access.
end do
will be replaced by the following code by an optimizing
DO J = 1, M
compiler
DO I = 1, N
t=m*n
A(I, J) = B(I, J) + C(I, J)
do i=1,n
ENDDO
a(i)=t*b(i)
ENDDO
end do

14
ii. Loop Vectorization v. Loop Distribution
It is the conversion of loops from a non-vectored form to Loop distribution is used for transforming a sequential
a vectored form. A vectored form is one that has the same program into a parallel one. Loop distribution attempts to
operation happening on all members of a range (SIMD) break a loop into multiple loops over the same index range
without dependencies. Theoretically the ranges should be in but each taking only a part of the loop's body. This can
contiguous memory areas because this makes it easy for the improve locality of reference, both of the data being
processor to know where to apply the computations next. accessed in the loop and the code in the loop's body.
Conceptually, it could be any set of data. Only built-in data vi. Function Inlining
type in C and FORTRAN are arrays which have contiguous
Function inlining is a powerful high level optimization
memory.
which eliminates call cost and increases the chances of
iii. Loop Merging other optimizations taking effect due to the breaking down
Another technique implemented by CDF90 to reduce of the call boundaries.
loop overhead. When two adjacent loops would iterate the By declaring a function inline, you can direct CDF90 to
same number of times (whether or not that number is integrate that function's code into the code for its callers.
known at compile time), their bodies can be combined as This makes execution faster by eliminating the function-call
long as they make no reference to each other's data. overhead. The effect on code size is less predictable; object
code may be larger or smaller with function inlining,
iv. Loop Unrolling depending on the particular case.
Duplicates the body of the loop multiple times, in order to
decrease the number of times the loop condition is tested
and the number of jumps, which may degrade performance VII. TESTING 64BitCDF90 ON 64-BIT ARCHITECTURE
by impairing the instruction pipeline. Completely unrolling Compilers are used to generate software for systems where
a loop eliminates all overhead (except multiple instruction correctness is important [4] and testing [2] is needed for
fetches & increased program load time), but requires that quality control and error detection in compilers.
the number of iterations be known at compile time.
Loop unrolling, is designed to unroll loops for
A. Challenges
parallelizing and optimizing compilers. To illustrate,
consider the following loop: One of the challenges encountered in the CDF90 project
was non-availability of free (open source), standard test suite
for (i = 1; i <= 60; i++) a[i] = a[i] * b + c; to test the 64-bit compiler (specially I:8 check). Hence a test
suite to test all the functionalities of Fortran 77/90 was
This FOR loop can be transformed into the following developed (DPTESTCASES). This was not an easy task as the
equivalent loop consisting of multiple copies of the original test suite developer needed to be aware of all the features of
loop body: the language.

for (i = 1; i <= 60; i+=3) B. Test Suites Used


{
a[i] = a[i] * b + c; Two suites of test programs, FCVS [8] and
a[i+1] = a[i+1] * b + c; DPTESTCASES, are used, and whenever the compiler is
a[i+2] = a[i+2] * b + c; modified, the test programs are compiled using both the new
and old versions of the compiler. Any differences in the target
}
programs output are reported back to the development team.
The loop is said to have been unrolled twice, and the
unrolled loop runs faster because of reduction in loop
overhead. i. FORTRAN Compiler Validation System (FCVS)
Loop unrolling was initially developed for reducing loop FCVS, developed by NIST is a standard test suite for
overhead and for exposing instruction level parallelism for FORTRAN compiler validation and used to validate
machines with multiple functional units. 64BitCDF90 against FORTRAN77/90 applications. Scripts
Loop unrolling is limited by the number of registers files are written to compile and execute each and every test
available. If numbers of registers are less, there is an case present in the test suite. After running script file one can
increased number of LOAD/STORE and hence increased see the result/status for FCVS.
memory access.
In case of loop unrolling leads to performance loss
because of frequent LOAD/STORE, loop is spitted into ii. DPTESTCASES
two, a procedure called loop fission .This can be done in a Unfortunately FCVS does not contain test cases to test 8-
straight-forward manner if two independent code segments byte integer support in various FORTRAN language
are present in the loop. constructs. Hence a large set of small and specific test cases

15
that covered the complete language constructs with 8-byte v. Functional Testing
integer support was developed called DPTESTCASES. These Tested 64BitCDF90 along with original CDF90 for the
tests verify each of the compiler construct with 8-byte integer same test cases with the intent of finding defects,
support. DPTESTCASES enables to test for 8-byte integer demonstrated that defects are not present, verifying that the
support for various FORTRAN language features like Data module performs its intended functions with integer up to 8
type and Declarations, Specification Statements, Control bytes since Fortran supports integers of 1,2,4 and 8 bytes, and
statements, IO Statements, Intrinsic functions, Subroutine establishing confidence that a program does what it is
functions, boundary checks for integer constants. supposed to do.
Shell scripts are written to compile all test cases, execute
them and check the compiler warnings and error messages
against expected results of DPTESTCASES. vi. Regression Testing
The test package has been modified in several ways, like Similar in scope to a functional test, a regression test allows
supplying large values in possible test cases for boundary a consistent, repeatable validation of each new release of a
value analysis, and check for the correct results. This compiler for new requirement. Such testing ensures reported
approach of testing proved to be helpful in verifying compiler compiler defects have been corrected for each new release and
capabilities. that no new quality problems were introduced in the
maintenance process. Regression testing can be performed
C. Testing Approach manually.
The approach followed for testing 64BitCDF90 can be
described by listing out following main points:
vii. Static Analysis of the CDF90 code
The code size being very large, we needed a suitable static
i. Testing of 64BitCDF90 in 32-bit mode on 64-bit Machine analyser which supports LP64 data model for 64-bit
Execute each test case with 64BitCDF90 compiler in 32-bit compilation on 64-bit platform. We found lint [12] as most
mode and check for correct results. This check is required to suitable and offered following advantages:
see if modifications performed, for 64-bit compatibility, in the  Warns about incorrect, error-prone or nonstandard code
compiler code may have resulted in any adverse side effects. that that the compiler does not necessarily flag.
In 32-bit mode compiler internally uses 32-bit library archives  Issues potential bugs and portability problem
and generate 32-bit X-COFF on AIX5.3 platform. FCVS is  Assists in improving source code’s effectiveness,
used for testing in 32-bit mode. including reducing its size and required memory
Out of various options provided by lint, –errchk=longptr64, in
ii. Testing of 64BitCDF90 in 64-bit mode on 64-bit Machine particular, has proved to be very helpful for migrating CDF90
to 64-bit platform.
Execute each test case with enhanced cdf90 compiler in 64-
bit mode and check for correct results. This check is required
D. Testing Statistics
to see if the enhanced compiler is successfully compiling each
test case in the 64-bit environment. Both FCVS and i. 32-bit Compilation
DPTESTCASES are used to test 64-bit compatibility and 8- FCVS is compiled by 64BitCDF90 in 32-bit mode on 64-
byte integer support in various FORTRAN language bit machine (AIX5.3/Power5).198 out of total 200 test cases,
constructs. In 64-bit mode compiler internally uses 64-bit are successful. The two buggy test cases are being worked on.
library archives and generate 64-bit X-COFF on DPTESTCASES consisting of 210 test cases are compiled
AIX5.3/Power5 platform. successfully with 64BitCDF90 and producing correct results.

iii. White Box Testing ii. 64-bit Compilation


White-Box testing is the execution of the maximum number FCVS is compiled in 64-bit mode by setting 64-bit flag (set
of accessible code branches with the help of debugger or other ‘-maix64’ flag when using ‘gcc4.2’ for backend compilation)
means. The more code coverage is achieved the fuller is the to obtain 64-bit executable files. The same two test cases
testing provided [13]. For the large code base of 64BitCDF90, mentioned above are unsuccessful in 64-bit mode also.
it is not easy to traverse the full code. Hence white-box testing All the test cases present in DPTESTCASES are compiled
methods are used in the case when an error is found with test successfully and producing correct results.
cases and we need to find out the reason which caused it.
iv. Black Box Testing
Unit testing may be treated as black-box testing. The main VIII. CONCLUSIONS
idea of the method consists in writing a set of tests for Major hardware vendors have recently shifted to 64-bit
separate modules and functions of Fortran 90, which test all processors because of the performance, precision, and
the main constructs of Fortran. scalability that 64-bit platforms can provide. The constraints

16
of 32-bit systems, especially the 4GB virtual memory [12] “lint Source Code Checker”, C User's Guide, Sun
ceiling, have spurred companies to consider migrating to Studio 11, 819-3688-10
64-bit platforms. [13] Andrey Karpov, Evgeniy Ryzhkov, “Traps detection
This paper presents some important porting issues that are during migration of C and C++ code to 64-bit
faced while porting a compiler from 32-bit to 64-bit like - Windows”
adding a new language feature (I:8 data type) to the existing http://www.viva64.com/content/articles/64-bit-
compiler and ensuring that this feature is supported by a development/?f=TrapsDetection.html=en&content=64-
range of already existing library function. The testing bit-development
methodology is elaborated. Performance optimizations [14] http://www.itl.nist.gov/div897/ctg/fortran_form.htm
contributing to energy efficiency of the code are explained [15] Stefan Goedecker, Adolfy Hoisie “Performance
in the paper. The authors in this paper present the traditional optimization of numerically intensivecodes”, Society
optimizations implemented in CDF90. Other energy-saving for Industrial and Applied Mathematics (Cambridge
optimizations are being explored and shall be presented in University Press), 1987
future reports. [16] Tali Moreshet, R. Iris Bahar, Maurice Herlihy, “Energy
Reduction in Multiprocessor Systems Using
Transactional Memory”, ISLPED’05, USA
REFERENCES
[17] CaoY, Okuma, Yasuura, “Low-Energy Memory
[1] AHO, ALFRED V. / JEFFREY D. Allocation and Assignment Based on Variable Analysis
ULLMAN, “Principles of Compiler Design”, Tenth for Application-Specific Systems”, IEIC Technical
Indian Reprint, Pearson Education, 2003. Report, p31-38, Japan, 2002
[2] Jiantao Pan, “Software testing”, 18-849b Dependable [18] Changjiu Xian, Yung-Hsiang Lu, “Energy reduction by
Embedded Systems, Carnegie Mellon University, Spring workload adaptation in a multi-process environment”,
1999 Proceedings of the conference on Design, automation
http://www.ece.cmu.edu/~koopman/des_s99/sw_testing/ and test in Europe, 2006
[3] DeMiIIo, RA1; Krauser, EW2; Mathur, AP3, “An [19] J. Hom and U. Kremer, “Inter-Program Optimizations
Overview of Compiler-integrated Testing” Australian for Conserving Disk Energy”, International Symposium
Software Engineering Conference 1991: Engineering on Low Power Electronics and Design (ISLPED'05),
Safe Software; Proceedings, 1991 San Diego, California, August 2005
[4] A. S. Kossatchev and M. A. Posypkin, “Survey of [20] C-H. Hsu and U. Kremer, “Compiler-Directed
Compiler Testing Methods”, Programming and Dynamic Voltage Scaling Based on Program Regions”
Computer Software, Vol. 31, No. 1, 2005, pp. 10–19 Rutgers University Technical Report DCS-TR461,
[5] Harsha S. Adiga, ’” Porting Linux applications to 64-bit November 2001.
systems”, 12 Apr 2006. [21] Wei Zhang , “Compiler-Directed Data Cache Leakage
http://www.ibm.com/developerworks/library/l-port64.html Reduction”, IEEE Computer Society Annual
[6] http://cdac.in/html/ssdgblr/f90ide.asp. Symposium on VLSI, ISVLSI'04
[7] http://www.cocolab.com/en/cocktail.html [22] U. Kremer. “Low Power/Energy Compiler
[8] http://www.fortran2000.com/ArnaudRecipes/ Optimizations”, Low-Power Electronics Design (Editor:
cvs21_f95.html Christian Piguet), CRC Press, 2005.
[9] Stephen C. Johnson “Yacc: Yet Another Compiler- [23] Majid Sarrafzadeh, Prithviraj Banerjee, Alok Choudhary,
Compiler”, July 31, 1978 Andreas Moshovos, “PACT: Power Aware Compilation
http://www.cs.man.ac.uk/~pjj/cs211/yacc/yacc.html and Architectural Techniques”, California University
[10] Cathleen Shamieh,” Understanding 64-bit PowerPC LA, Dept of Computer Science.
architecture”, 19 Oct 2004, [24] U Kremer, “Compilers for Power and Energy
http://www.ibm.com/developerworks/library/pamicrodesign/ Management”, Tutorial, ACM SIGPLAN Conference on
[11] Steven Nakamoto and Michael Wolfe,” Porting Programming Language Design and Implementation
Compilers & Tools to 64 Bits” Dr. Dobb’s Portal, (PLDI'03), San Diego, CA, June 2003.
August 01, 2005

17
ADCOM 2009
SERVER
VIRTUALIZATION

Session Papers:

1. J. Lakshmi, S.K.Nandy, “ Is I/O Virtualization ready for End-to-End Application Performance? ”


INVITED PAPER

2. S. Prakki, “ Eco-friendly Features of a Data Centre OS” – INVITED PAPER

18
Is I/O virtualization ready for End-to-End
Application Performance?
J. Lakshmi, S. K. Nandy
Indian Institute of Science, Bangalore, India
{jlakshmi,nandy}@serc.iisc.ernet.in

Abstract: Workload consolidation using system target workloads. But I/O handling of these
virtualization feature is the key to many successful workloads on the consolidated servers results in
green initiatives in data centres. In order to exploit sharing of the physical resources and their
the available compute power, such systems warrant associated access paths. This sharing causes
sharing of other hardware resources like memory, interference that is dependent on the consolidated
caches, I/O devices and their associated access workloads and makes the application performance
paths among multiple threads of independent non-deterministic [1][2][3]. In such scenarios, it is
workloads. This mandates the need for ensuring essential to have appropriate mechanisms to define,
end-to-end application performance. In this paper monitor and ensure resource sharing policies across
we explore the current practices for I/O the various contending workloads. Many
virtualization, using sharing of the network applications like real-time hybrid voice and data
interface card as the example, with the aim to study communication systems onboard aircraft and naval
the support for end-to-end application performance vessels, streaming and on-demand video delivery,
guarantees. To ensure end-to-end application database and web-based services, when
performance and limit interference caused due to consolidated onto virtualized servers, need to
the sharing of devices, we present an evaluation of support soft real-time application deadlines to
previously proposed end-to-end I/O virtualization ensure performance.
architecture. The architecture is an extension to the
PCI-SIGV specification of I/O hardware to support Standard I/O devices are not virtualization aware
reconfigurable device partitions and uses VMM- and hence their virtualization is achieved using a
bypass technique for device access by the virtual software layer for multiplexing device access to
machines. Simulation results of the architecture for independent VMs. In such cases I/O device
application quality of service guarantees virtualization is commonly achieved following two
demonstrate the flexibility and scalability of the basic modes of virtualization, namely para-
architecture. virtualization and emulation [4]. In para-
virtualization the physical device is accessed and
Keywords – Multicore server virtualization, IO- controlled using a protected domain which could be
virtualization architectures,QoS, Performance. the virtual machine monitor (VMM) itself or an
independent virtual machine (VM), also called the
I. Introduction independent driver domain (IDD) as in Xen. The
Multi-core servers have brought in tremendous VMM or IDD actually do the data transfer to and
computing capacity to the commodity systems. from the device into their I/O address space using
These multi-core servers have not only prompted the device’s native driver. From there the copy or
applications to use fine-grained parallelism to gain transfer of the data to the VM’s address space is
advantage of the abundance of CPU cycles, they done using what is commonly called the para-
have also initiated the coalescing of multiple virtualized device driver. The para-virtualized
independent workloads onto a single server. driver is specifically written to support a specific
Multicore servers combined with system mechanism of data transfer between the VMM/IDD
virtualization have led to many successful green and the VM and needs a change in the OS of the
initiatives of data centre workload consolidation. VM (also called GuestOS).
This consolidation however needs to satisfy end-to-
end application performance guarantees. Current In emulation, the GuestOS of VM installs a device
virtualization technologies have evolved from the driver of the emulated virtual device. All the calls
prevalent single-hardware single-OS model which of this emulated device driver are trapped by the
presumes the availability of all other hardware VMM and translated to the native device driver’s
resources to the current scheduled process. This calls. The advantage of the emulation is that it
causes performance interference among multiple allows the GuestOS to be unmodified and hence
independent workloads sharing an I/O device, easier to adopt. However, para-virtualization has
based on the individual workloads. Major efforts been found to be much better in performance when
towards consolidation have focussed on compared to emulation. This is because emulation
aggregating the CPU cycle requirement of the results in each instruction translation whereas para-

1
19
virtualization involves only page-address of Figure 1. The data in this graph represents
translation. But, both these modes of device achievable throughput by the httperf [5] benchmark
virtualization impose resource overheads when hosted on a non-virtualized and virtualized Xen[6]
compared to non-virtualized servers. These and Vmware-ESXi [7] servers. In each of the case
overheads translate into application performance the http server was hosted on a Linux(FC6) OS and
loss. for the virtualized server, the hypervisor, IDD
(Xen) [8] and the virtual machine were pinned to
The second drawback of the existing architectures use the same physical CPU. The server used was
is their lack of sufficient quality of Service (QoS) dual core Intel Core2Duo system with 2GB RAM
controls to manage device usage on a per VM and 10/100/1000Mbps NIC. In the Xen hypervisor
basis. A desirable feature of these controls is that the virtual NIC used by the VM was configured to
they should guarantee application performance use a para-virtualized device driver implemented
with specified QoS on the shared device and this using event channel mechanism and a software
performance should be unaffected by the workloads bridge for creating virtual NICs. In the case of
sharing the device. The other desirable feature is Vmware hypervisor the virtual NIC used inside the
that the unused device capacity should be available VM was configured using a software switch with
for use to the other VMs. Prevalent virtualization access to device through emulation.
technologies like Xen and Vmware and even
standard linux distributions use a software layer
within the network stack to implement NIC usage
policies. Since these systems were built with the
assumption of single-hardware single-OS model,
these features provide required control on the
outgoing traffic from the NIC of the server. The
issue is with the incoming traffic. Since the policy
management is done above the physical layer,
ingress traffic accepted by the device is later
dropped based on input stream policies. This results
in the respective application not receiving the data,
which perhaps satisfies the application QoS, but
causes wasted use of the device bandwidth that
affects the delivered performance of all the
applications sharing the device. Also, it leads to
Figure 1: httperf benchmark throughput graph for non-
non-deterministic performance that varies with the virtualized, Xen and Vmware-ESXi virtual machine hosting http
type of applications using the device. This model is server on Linux(FC6). 1
insufficient for the virtualized server supporting
sharing of the NIC across multiple VMs. In this As can be observed from the graphs of Figure 1,
paper we describe and evaluate an end-to-end I/O the sustainable throughput of the benchmark drops
virtualization architecture that addresses these considerably when the http server is moved to a
drawbacks. virtualized server when compared to the non-
virtualized server. The reason for this drop is
Rest of the paper is organized as follows. section II answered by the CPU Utilization graph depicted in
presents experimental results on existing Figure 2. From the graphs we notice that moving
virtualization technologies, namely Xen and the http server from non-virtualized to virtualized
Vmware, that motivate this work; section III then server, the %CPU utilization, to support the same
describes an end-to-end I/O virtualization httperf workload, is increased significantly and this
architecture to overcome the issues raised in increase is substantial for the emulated mode of
section II; section IV details the evaluation of the device virtualization. The reason for this increased
architecture and presents the results; section V CPU utilization is because of I/O device
highlights the contributions of this work with virtualization overheads.
respect to existing literature and section VI details
the conclusions. Further, when the same server is consolidated with
two VMs sharing the same NIC, each supporting
one stream of an independent httperf benchmark,
II. Motivation there is further drop of achievable throughput per
VM. This is explicable since each VM now
Existing I/O Virtualization architectures use extra contends for the same NIC. The virtualization
CPU cycles to fulfill equivalent I/O workload. mechanisms not only share the device but also the
These overheads reflect in the achievable
application performance, as depicted by the graph 1
Some data of the graph has been reused from [9][10]

2
20
device access paths. This sharing causes
serialization which leads to latencies and
application performance loss which is dependent on
the nature of the consolidated workloads. Also, the
increased latencies in supporting the same I/O
workload on the virtualized platform causes loss of
usable device bandwidth which further reduces the
scalability of device sharing by multiple VMs.

Figure 4: httperf achievable throughput on a Xen consolidated


server with NIC sharing and QoS guarantees.

Each of the graphs shows the maximum achievable


throughput by VM1 when VM2 is constrained by a
specified QoS guarantee. This guarantee is the
maximum throughput that VM2 should deliver and
is implemented using the network QoS controls
available in Xen and Vmware servers. In Xen these
QoS controls are implemented using tc utilities of
Figure 2: CPU resource utilized by the http server to support
1 the netfilter module of linux OS of the Xen-IDD. In
httperf benchmark throughput.
Vmware, Veam enabled network controls are used.
VM1 and VM2 are two different VMs hosted on
This scalability can be improved to some extent by the same server sharing the same NIC. We observe
pinning different VMs to independent cores and that for the unconstrained VM, in this case VM1,
using a high speed, high bandwidth NIC. Still the maximum achievable throughput does not exceed
high virtualization overheads coupled with 200 for Vmware and 475 for Xen. This is
serialization due to shared access paths restricts considerably low when compared to the maximum
device sharing scalability. achievable throughput for a single VM using the
NIC. The reason being, the constrained VM,
The next study is on evaluating the NIC specific namely VM2, is receiving all requests. VM2 is also
QoS controls existing in Xen and Vmware. Since processing these requests and generating
current NICs do not support QoS controls these are appropriate replies which results in CPU resource
provided by the OS managing the device. In either consumption. Only some replies of the received
case, these controls are implemented in the network requests are dropped based on the currently
stack above the physical device. Because of this applicable QoS on the usable bandwidth. This is
they are insufficient as is displayed by the graphs in because both Vmware and Xen support QoS
Figure 3and Figure 4. controls on the egress traffic at the NIC. This
approach of QoS control on resource usage is
wasteful and coarse-grained. As can be observed,
as the constraint on VM2 is relaxed the behaviour
of NIC sharing reaches best effort and the resulting
throughput achievable by any of the VM is
obviously less than what can be achieved when a
single VM is hosted on the server. These graphs
clearly demonstrate the insufficiency of the existing
QoS controls.

From the above experiments we conclude the


following drawbacks in the existing I/O
virtualization architectures:
 Building hypervisors or VMMs using single-
hardware single-OS model leads to cohesive
Figure 3: httperf achievable throughput on a Vmware-ESXi architectures leading to high virtualization
consolidated server with NIC sharing and QoS guarantees. overheads. Virtualization overheads being high
cause loss of usable device bandwidth. This

3
21
often results in under-utilized resources and requirement of the VM to which the virtual device
limited consolidation ratios, particularly for I/O is allocated. This, along-with features like TCP
workloads. The remedy to this approach is to offload, virtual device priority and bandwidth
build I/O devices that are virtualization aware specification support at the device level provide
and decouple the device management from fine-grained QoS controls at the device while
device access i.e, provide native access to the sharing it with other VMs, as is elaborated upon in
I/O device from within the VM and allow the evaluation section.
VMM to manage concurrency issues rather than
the ownership issues. Figure 5 gives a block schematic of the proposed
 Lack of fine-grained QoS controls on device I/O virtualization architecture.2 The picture depicts
sharing cause performance loss which is a NIC card that can be housed within a virtualized
dependent on the workloads of the VMs sharing server. The card has a controller that manages the
the device. This leads scalability issues in DMA transfer to and from the device memory. The
sharing the I/O device. To address this the I/O standard device memory is replaced by a re-
device should support QoS controls for both the partitionable memory supported with n sets of
incoming and outgoing traffic. device registers. A set of m memory partitions,
where m ≤ n, with device registers forms the
To overcome the above drawbacks, we propose an virtual-NICs (vNICs). Ideally the device memory
I/O virtualization architecture. This architecture should be reconfigurable, i.e. dynamically
proposes an extension to the PCI-SIG IOV partitionable, and the VM’s QoS requirements
specification [11] for virtualization enabled would drive the sizing of the memory partition. The
hardware I/O devices with a VMM-bypass [12] advantage of having a dynamically partitionable
mechanism for virtual device access. device memory is that any unused memory can be
easily extended into or reduced from a vNIC in
III. End-to-End I/O Virtualization order to match adaptive QoS specifications. The
Architecture NIC identifies a vNIC request by generating
message signaled interrupts (MSI). The number of
We propose an end-to-end I/O virtualization interrupts supported by the controller restricts the
architecture that enables direct or native access to number of vNICs that can be exported. Based on
the I/O device from within the VM rather than the QoS guarantees a VM needs to honour,
accessing it through the layer of VMM or IDD. judicious use of native and para-virtualized access
PCI-SIG IOV specification proposes virtualized to the vNICs can overcome this limitation. A VM
I/O devices that can support native device access that has to support stringent QoS guarantees can
by the VM, provided the hypervisor is built to choose to use native access to the vNIC whereas
support such architectures. IOV specified hardware those VMs that are looking for best-effort NIC
can support multiple virtual devices at the hardware access can be allowed para-virtualized access to the
level. The VMM needs to be built such that it can vNIC.
recognize and export each virtual device to an
independent VM, as if the virtual device was an
independent physical device. This allows native
device access to the VM. When a packet hits the
hardware virtualized NIC, the VMM should
recognize the destination VM of an incoming
packet by the interrupt raised by the device and
forwards it to the appropriate VM. The VM
processes the packet as it would do so in the case of
non-virtualized environment. Here, device access
and scheduling of device communication is
managed by the VM that is using the device. This
eliminates the intermediary VMM/IDD on the
device access path and reduces I/O service time
which improves the usable device bandwidth and
application throughput. Figure 5: NIC architecture supporting MSI interrupts with
partitionable device memory, multiple device register sets and
DMA channels enabling independent virtual-NICs.
To support the idea of QoS based on device usage,
we extend the IOV architecture specification by
enabling reconfigurable memory on the I/O device.
2
For each of the virtual device defined on the This section is being reproduced from [10] to maintain
physical device, the device memory associated with continuity in the text. Complete architecture description with
performance statistics on achievable application throughput can
the virtual device is derived from the QoS be found in [10].

4
22
The VMM can aid in setting up the appropriate packet to the corresponding VM, which the NIC
hosting connections based on the requested QoS is now expected to do. While communicating,
requirements. the NIC controller decides on whether to accept
or reject the incoming packet based on the
The proposed architecture can be realized by the bandwidth specification or the virtual device’s
following modifications: available memory. This gives a fine-grained
control on the incoming traffic and helps reduce
 Virtual-NIC: In order to define vNIC, the the interference effects. The outbound traffic
physical device should support time-sharing in can be controlled by the VM itself using any of
hardware. For a NIC this can be achieved by the mechanisms as is done in the existing
using MSI and dynamically partitionable device architectures.
memory. These form the basic constructs to
define a virtual device on a physical device as IV. Architecture Evaluation for QoS
depicted in Figure 5. Each virtual device has a controls
specific logical device address, like the MAC
address in case of NICs, based on which the The proposed architecture was generated using
MSI is routed. Dedicated DMA channels, a Layered Queuing Network (LQN) Model and
specific set of device registers and a partition of service times for the various entries of the model
the device memory are part of the virtual device were obtained by using runtime profilers on the
interface which is exported to a VM when it is actual Xen based virtualized server. Complete
started. We call this virtual interface as the model building and validation details are available
virtual-NIC or vNIC which forms a restricted in [9][10]. Here we present the results of QoS
address space on the device for the VM to use evaluation carried out using the LQN model [12] of
and remains in possession of the VM till it is the proposed architecture. The QoS experiments
active or relinquishes the device. were conducted along the same lines as described
in the introduction section. The difference now is
 Accessing virtual-NIC: For accessing the that the QoS control is applied on the ingress traffic
virtual-NIC native device driver is hosted inside of the constrained VM, namely VM2. The results
the VM and is initialized with the help of VMM obtained are depicted in Figure 6.The proposed
when the VM is initialized. This device driver architecture allows for achieving higher application
can only manipulate the restricted device throughput on the shared NIC firstly because of the
address space which was exported through the VMM-bypass [12]. Also, as can be observed from
vNIC interface by the VMM. With the vNIC, the graphs above, the control of ingress traffic in
the VMM only identifies and forwards the the case of httperf benchmark shows highly
device interrupts to the destination VM. The OS improved performance benefit to the unconstrained
of the VM now handles the I/O access and thus VM, namely VM1.
can be accounted for the resource usage it
incurs. This eliminates the performance
interference due to VMM/IDD handling
multiple VMs’ request to/from a shared device.
Also, because the I/O access is now directly
done by the VM, the service time on the I/O
access reduces thereby resulting in better
bandwidth utilization. With the vNIC interface,
data transfer is handled by the VM. While
initializing the device driver for the virtual NIC
the VM sets up the Rx/Tx descriptor rings
within its address space and makes request to
the VMM for initializing the I/O page
translation table. The device driver uses this
table and performs DMA transfers directly into
the VM’s address space.
Figure 6: httperf throughput sharing on a QoS controlled,
shared NIC between two VMs using the proposed architecture
 QoS and virtual-NIC: The device memory with throughput constraints applied on the ingress traffic of
partition acts as a dedicated device buffer for VM2 at the NIC.
each of the VMs and with appropriate logic on
the NIC card one can easily implement QoS For request-response kind of benchmarks like the
based SLAs on the device that translate to httperf, controlling the ingress bandwidth is
bandwidth restrictions and VM specific priority. beneficial because once a request is dropped due to
The key is being able to identify the incoming saturation of allocated bandwidth, there is no

5
23
downstream activity associated with it and wasteful management from protection. In exokernel, the idea
resource utilization of NIC and CPU is avoided. was to extend native device access to applications
The QoS control at the device on the input stream with exokernel providing the protection. In our
of VM2 and the native access to the vNICs by the approach, the extension of native device access is
VMs gives the desired flexibility of making the to the VM, the protection being managed by the
unused bandwidth available to the unconstrained VMM. A VM is assumed to be running the
VM. traditional OS. Further, the PCI-SIG community
has realized the need for I/O device virtualization
V. Related work and has come out with the IOV specification to
deal with it. The IOV specification however, details
In early implementations, I/O virtualization about device features to allow native access to
adopted dedicated I/O device assignment to a VM. virtual device interfaces, through the use of I/O
This later evolved to device sharing across multiple page tables, virtual device identifiers and virtual
VMs through virtualized software interfaces device specific interrupts. The specification
[14][4]. A dedicated software entity, called the I/O presumes that QoS is a software feature and does
domain is used to perform physical device not address this. Many implementations adhering to
management. The I/O domain is either part of the the IOV specification are now being introduced in
VMM or is by itself an independent domain, like the market by Intel[25], Neterion[26], NetXen[27],
the IDD of Xen [8][15]. With this intermediary Solarflare [28] etc. CrossBow[29]suite from SUN
software layer between the device and the VM, any Microsystems talks about this kind of resource
application in a VM seeking access to the device provisioning, but it is a software stack over a
has to route the request through it. This architecture standard IOV complaint hardware. The results
still builds over the single-hardware single-OS published using any of these products are exciting
model [16]-[21]. The consequence of such in terms of the performance achieved, but almost
virtualization techniques is visible in the loss of all of them have ignored the control of reception at
application throughput and usable device the device level. We believe that lack of such a
bandwidth on virtualized servers as discussed control on highly utilized devices will cause non-
earlier. Because of the poor performance of the I/O deterministic application performance loss and
virtualization architectures a need to build under-utilization of the device bandwidth.
concurrent access to the shared I/O devices was felt
and recent publications on concurrent direct
network access (CDNA)[22] [19] and scalable self- VI. Conclusion
virtualizing network interface describe such efforts.
However, the scalable self-virtualizing interface In this paper we described how the lack of
[23] describes assigning a specific core for network virtualization awareness in I/O devices leads to
I/O processing on the virtual interface and exploits latency overheads on the I/O path. Added to this,
multiple cores on embedded network processors for the intermixing of device management and data
this. The paper does not detail how the address protection issues further increases the latency,
translation issues are handled, particularly in the thereby reducing the effective usable bandwidth of
case of virtualized environments. The CDNA the device. Also, lack of appropriate device sharing
architecture is similar to the proposal in this paper control mechanisms, at the device level leads to
in terms of allowing multiple VM specific Rx and loss bandwidth and performance interference on the
Tx device queues. But CDNA still builds over the device sharing VMs. To address these issues we
VMM/IDD handling the data transfer to and from proposed I/O device virtualization architecture, as
the device. Although the results of this work are an extension to the PCI-SIG IOV specification, and
exciting, the architecture still lacks the flexibility demonstrated its benefit through simulation
required to support fine-grained QoS. And, the techniques. Results demonstrate that by moving the
paper does not discuss about the performance QoS controls to the shared device, the unused
interference due to uncontrolled data reception by bandwidth is made available to the unconstrained
the device nor does it highlight the need for VM, unlike the case in prevalent technologies. The
addressing the QoS constraints at the device level. proposed architecture also improves the scalability
The proposed architecture in this paper addresses of VMs sharing the NIC because it eliminates the
these and also the issue of pushing the basic common software entity which regulates I/O device
constructs to assign QoS attributes like required sharing. The other advantage is that with this
bandwidth and priority into the device to get finer architecture, the maximum resource utilization is
control on resource usage and on restricting now accounted for by the VM. Also, this
performance interference. architecture reduces the workload interference on
sharing a device and simplifies the consolidation
The proposed architecture has it basis in process.
exokernel’s[24] philosophy of separating device

6
24
References hosted virtual machine monitor. In Proceedings of
the USENIX Annual Technical Conference, June
2001.
[1] M. Welsh and D. Culler, “Virtualization considered [16] D. Gupta, L. Cherkasova, R. Gardner, and A.
harmful: OS design directions for well-conditioned Vahdat. Enforcing performance isolation across
services”, Hot Topics in OS, 8th Workshop, 2001. virtual machines in Xen. In M. van Steen and M.
[2] Kyle J. Nesbit, James E Smith, Miquel Moreto, Henning, editors, Middleware, volume 4290 of
Francisco J. Cazorla, Alex Ramirez, Mateo Valero, Lecture Notes in Computer Science, pages 342–362.
“Multicore Resource Management “, IEEE Micro, Springer, 2006.
Vol. 28, Issues 3, P-6-16, 2008. [17] Weng, C., Wang, Z., Li, M., and Lu, X. 2009. The
[3] Kyle J. Nesbit, Miquel Moreto, Francisco J. hybrid scheduling framework for virtual machine
Cazorla, Alex Ramirez, Mateo Valero, and James E. systems. In Proceedings of the 2009 ACM
Smith, Virtual Private Machines: SIGPLAN/SIGOPS international Conference on
Hardware/Software Interactions in the Multicore Virtual Execution Environments (Washington, DC,
Era, In IEEE Micro special issue on Interaction of USA, March 11 - 13, 2009).
Computer Architecture and Operating System in the [18] Kim, H., Lim, H., Jeong, J., Jo, H., and Lee, J. 2009.
Manycore Era, May/June 2008. Task-aware virtual machine scheduling for I/O
[4] Scott Rixner, “Breaking the Performance Barrier: performance. In Proceedings of the 2009 ACM
Shared I/O in virtualization platforms has come a SIGPLAN/SIGOPS international Conference on
long way, but performance concerns remain”, ACM Virtual Execution Environments (Washington, DC,
Queue – Virtualization, Jan/Feb 2008. USA, March 11 - 13, 2009).
[5] D. Mosberger and T. Jin, “httperf: A Tool for [19] Menon, J. R. Santos, Y. Turner, G. J. Janakiraman,
Measuring Web Server Performance,” ACM, and W. Zwaenepoel. Diagnosing performance
Workshop on Internet Server Performance, pp. 59- overheads in the Xen virtual machine environment.
67, June 1998. In Proceedings of the ACM/USENIX Conference
[6] Paul Barham , Boris Dragovic , Keir Fraser , Steven on Virtual Execution Environments, June 2005.
Hand , Tim Harris , Alex Ho , Rolf Neugebauer , [20] Menon, A. L. Cox, and W. Zwaenepoel. Optimizing
Ian Pratt , Andrew Warfield, “Xen and the art of network virtualization in Xen. In Proceedings of the
virtualization”, 19th ACM SIGOPS, Oct. 2003. USENIX Annual Technical Conference, June 2006.
[7] “VMware ESX Server 2 - Architecture and [21] Santos, J. R., Janakiraman, G., Turner, Y., Pratt, I.
Performance Implications”, 2005, available at 2007. Netchannel 2: Optimizing network
http://www.vmware.com/pdf/esx2_performance_im performance. Xen Summit Talk (November)
plications.pdf [22] Willmann, P., Shafer, J., Carr, D., Menon, A.,
[8] K. Fraser, S. Hand, R. Neugebauer, I. Pratt, A. Rixner, S., Cox, A. L., Zwaenepoel, W. Concurrent
War_eld, and M. Williamson, “Safe hardware direct network access for virtual machine monitors.
access with the Xen virtual machine monitor.” 1st In Proceedings of the International Symposium on
Workshop on OASIS, Oct 2004. High-Performance Computer Architecture,2007
[9] J.Lakshmi, S.K.Nandy, “Modeling Architecture-OS (February).
interactions using Layered Queuing Network [23] H. Raj and K. Schwan. Implementing a scalable
Models”, International Conference Proceedings of self-virtualizing network interface on a multicore
HPC Asia, March, 2009, Taiwan. platform. In Workshop on the Interaction between
[10] J. Lakshmi, S. K. Nandy, “I/O Device virtualization Operating Systems and Computer Architecture, Oct.
in Multi-core era, a QoS Pespective”, Workshop on 2005.
Grids, Clouds and Virtualization, International [24] M. Frans Kaashoek, et. Al., “Application
Conference on Grids and Pervasive computing, Performance and Flexibility on Exokernel Systems
Geneva, May 2009. “, 16th ACM SOSP, Oct, 1997.
[11] PCI-SIG IOV Specification available online at [25] Intel Virtualization Technology for Directed-I/O
http://www.pcisig.com/specifications/iov www.intel.com/technology/itj/2006/v10i3/2-io/7-
[12] J. Liu, W. Huang, B. Abali, and D. K. Panda. High conclusion.htm
performance VMM-bypass I/O in virtual machines. [26] Neterion http://www.neterion.com/
In Proceedings of the USENIX Annual Technical [27] NetXen http://www.netxen.com/
Conference, June 2006. [28] Solarflare Communications
[13] Layered Queueing Network Solver software http://www.solarflare.com/
package, http://www.sce.carleton.ca/rads/lqns/ [29] CrossBow: Network Virtualization and Resource
[14] T. von Eicken and W. Vogels. Evolution of the Control
virtual interface architecture. Computer, 31(11), http://www.opensolaris.org/os/community/networki
1998. ng/crossbow_sunlabs_ext.pdf
[15] J. Sugerman, G. Venkatachalam, and B. Lim.
Virtualizing I/O devices on VMware Workstation’s

7
25
Eco-Friendly Features of a Data center OS
Surya Prakki
Sun Microsystems
Bangalore, India
surya.prakki@sun.com

Abstract—This paper presents the different technologies a o Simple administration


modern operating system like OpenSolaris offers which will help
data centers to become more eco-friendly. It starts off with the  How to reduce the energy requirements of an idling
various virtualization technologies OpenSolaris has got which computer?
help in driving consolidation of systems thus reducing system foot
print in a data center. It introduces the power aware dispatcher First problem of workload consolidation can be addressed
(PAD), how it plays well with the various processor supported using virtualization.
power states, and moves onto observability tools which prove
There is no silver bullet virtualization technology that can
how well the system is using the power management features.
address all of the above challenges.
Keywords-virtualization; consolidation; green computing;
II.OPERATING SYSTEM LEVEL VIRTUALIZATION
I. INTRODUCTION If the requirement is to consolidate homogenous workloads
Data centers are becoming the backbone of success of any [applications compiled for the same platform [I.e. Operating
enterprise. Enterprises are offering lot of services over the web System]], it is preferable to opt for OS level virtualization. For
– whether it be bank transactions or booking of tickets or the rest of the discussion let us look at Zones technology in
maintaining social relationships, etc. This is leading to OpenSolaris which provides this feature and see how it solves
automation of more and more services, thus resulting in even the above challenges.
more computers being deployed in a datacenter - server
sprawl. And as these computers are getting more and more A. Zones
powerful, their energy requirements have gone up, and with Zones provide a very low overhead mechanism to virtualize
increasing energy costs, data centers are impacting both operating system services, allowing one or more processes to
ecology as well as economics of running them. run in isolation from other activity on the system. The kernel
exports a number of distinct objects that can be associated with
This brings along a very interesting challenge to the modern
a particular zone, like processes, file system mounts, network
OS developers:
interfaces (I/F) etc. No zone can access objects belonging to
 How to move away from the traditional approach of another zone. This isolation prevents processes running within
hosting one service on one computer to, running a given zone from monitoring or affecting processes running in
multiple services on a single computer and address the other zones. Thus a zone is a 'sandbox' within which one or
following challenges in doing so : more applications can run without affecting or interacting with
the rest of the system.
o Isolation
As the underlying kernel is the same, physical resources are
o Minimizing the overheads multiplexed across zones, improving the effective utilization of
the system.
o Meeting peak load requirements (QoS)
The first zone that comes up on installing OpenSolaris
o Secure execution system is referred to as 'global zone' and all non-global zones
o Meeting different patch requirements of different need to be configured and installed from this global zone.
applications Zones framework provides an abstraction layer that
o Heterogeneous work loads separates applications from physical attributes of the machine
on which they are deployed, such as physical device paths and
o Testing and Development network I/F names. This enables, things like, multiple web
servers running in different zones to connect to the same port
o Enforcing resource controls using the distinct IP addresses associated with each zone.
o observability OpenSolaris had broken down the privileges associated
o Fail over through replication with root owned process into finer grained ones and the zones
inherit a subset of these privileges(5), thus making sure that
o Supporting legacy applications even if a zone is compromised, the intruder can't do any

26
damage to the rest of the system. Another outcome of this is, These tools could be run either from global zone or from inside
even privileged processes in non-global zones are prevented a non-global zone itself.
from performing operations that can have system-wide impact.
The extensibility of the zones framework can be gauged by
Administration of individual zones can be delegated to the fact that there is an 'lx' brand using which linux 2.6, 32bit
others, knowing that any actions taken by them would not applications can be run unmodified on an OpenSolaris x86
affect the rest of the system. system inside an lx branded zone. Thus even linux applications
can be observed from the global zone using earlier mentioned
Zones do not present a new API or ABI to which tools.
applications need to be 'ported' – I.e. the existing OpenSolaris
applications run inside zones without any changes or A zone can be configured with an exclusive IP stack, such
recompilation. So a process running inside a zone runs that it can have its own routing table, ARP table, IPSec policies
natively on the CPU and hence doesn't incur any performance and associations, IP filter rules and TCP/IP ndd variables. Each
penalty. zone can continue to run services such as NFS on top of
TCP/UDP/IP. This way a physical NIC can be set aside for
A zone or multiple zones can be tied to a resource pool – exclusive use of the zone. Zones can also be configured to have
which pulls together CPUs and scheduler. The resources shared IP stacks on top of a single NIC.
associated with a pool can be dynamically changed. This
enables the global administrator to give more resources to a Disks can be provided to a zone for its exclusive use, or a
zone when it is nearing its peak demand. Faire Share portion of a file system space can be set aside for the zone, or
Scheduler (FSS) can be associated with a pool. Using FSS a using ZFS file system, space can be grown as the demand
physical CPU assigned to multiple zones, via a resource pool, grows dynamically.
can be shared as per their entitlement – eve in the face of most
demanding work loads, thus guaranteeing quality of service The following block diagram captures the above
(QoS). The physical memory and swap memory are also discussion :
configured per zone and can be changed dynamically to meet
the varying demands of a zone.
Zones can be booted and halted independent of the
underlying kernel. As zone boot involves only setting up the
virtual platform, mounting the configured file systems, this is a
much faster operation compared to booting of even a smallest
physical system.
The software packages that need to be available in a zone is
a function of the service it is hosting and a zone administrator
is free to pick and choose. This makes a zone look like 'Just
Enough OS' kind of slick environment which is tailor made to
the services it hosts. Likewise a zone administrator is also free
to maintain the patch levels of the packages she installed.
To save on file system space, two types of zones are
supported – sparse and whole root. In case of sparse zone,
some file systems like /usr are loop back mounted from global
zone so that multiple copies of the binaries are not present. In
whole root zone, all components that make up the platform are
installed and thus takes more disk space.
The zones technology is extended to be able to run Figure 1.
applications compiled for earlier versions of Solaris. This is
referred to as branding a zone and is used to run Solaris 8 and Thus to summarize, zones virtualizes the following
Solaris 9 applications. In such a branded zone, run time facilities :
environment for an Solris 8 application is no different from
what it was on a box running Solaris 8 kernel. This branding  Processes
feature helps customers to replace old power-hungry systems
with newer eco-friendly computers. Operating environment  File Systems
should also provide tools which will help in making this  Networking
transition smoother.
 Identity
The OpenSolaris system has got a rich set of observability
tools like DTrace(1M), kstat(1M), truss(1), proc(4) and  Devices
debugging tools like mdb(1) which could be used to study the
behavior and debug applications in production environment.  Packaging

27
1) Field Use Case : as per the load on the VNIC and also allows classification of
 www.blastwave.com offsers open source packages for packets between VNICs to be off-loaded in hardware.
different versions of Solaris. As a result they needed to Each VNIC is assigned its own MAC address and optional
continue to have physical systems running Solaris 8 VLAN id (VID). The resulting MAC+VID tuple is used to
and Solaris 9. When they made use of the branded identify a VNIC on the network, physical or virtual.
zones feature, they were able to consolidate these
legacy systems onto systems running Solaris10 and Crossbow allows a bandwidth limit to be set on a VNIC.
they quantified the gain as follows : The bandwidth limit is enforced by the MAC transparently to
the user of the VNIC. This mechanism allows the
o 65 percent reduction in rack space, saving tens of administrator to configure the link speed of VNICs that are
thousands of dollars in power, cooling, and assigned to zones or VMs. - this way they can't use more
hardware-maintenance costs bandwidth than their assigned share.
o Reduced setup time from hours to minutes. Crossbow provides virtual switching semantics between
 Sun IT heavily uses zones to host quite a few services, VNICs created on top of the same physical NIC. Virtual
to quote a few: switching done by Crossbow is consistent with the behavior of
a typical physical switch found on a physical network.
o Request to make source changes, is made via a
portal which runs in a zone. Crossbow VNICs and virtual switches can be combined to
build a Crossbow Virtual Wire (vWire). A vWire can be a fully
o Service to manage lab infrastructure. virtual network in a box. A vWire can be used to instantiate a
layer-2 network which can be used to run a distributed
o Namefinder service which reports basic application spanning multiple virtual hosts,
information about employees.
Virtual Network Machines (VNMs) are pre-canned
o Service to host software patches. OpenSolaris zones which encapsulate a network function such
as routing, load balancing, etc. VNMs are assigned their own
One web server is run in each of the above
VNIC(s), and can be deployed on a vWire to provide network
zones and they all listen on port:80 without stamping
functions needed by the network applications. VNMs can come
on each other. All the above services are very critical
in pre-configured fashion, which helps in deploying new
to running of the business. First 3 cases do not see
instances quickly.
much change in workload and hence are easily
virtualized using zones. In case of 4th, patches are
released periodically and there will be lot of hits in the III. PLATFORM LEVEL VIRTUALIZATION
first 48hrs of patches being made available and during There are some instances where OS level virtualization falls
this time, additional CPUs and physical memory can short :
be set aside for this zone using dynamic resource
pools via a cron(1M) job. In consolidating these  Any decent sized data center will have heterogeneous
services, we replaced 4 physical systems with a single workloads - applications compiled for different
one. platforms and any effort to consolidate such workloads
can't be fully achieved by OS level virtualization.
 Price et al. [11] report less than 4% performance
degradation for time sharing workloads and is  A service or application needs a specific kernel module
attributed to the loop back mounts in case of sparse or driver to operate.
zone.
 Different applications need different kernel patch
levels.
B. Crossbow
Crossbow provides the building blocks for network  Consolidate legacy applications which expect end of
virtualization and resource control by Virtualizing the network life (EOL) Operating Systems.
stack and NIC around any service (HTTP, HTTPS, FTP, NFS In such scenarios, platform virtualization solution can be
etc.), protocol or Virtual machine. Crossbow does to used to consolidate the workloads. These solutions carve
networking stack what Zones did to OS services. multiple Virtual Machines (VMs) out of a physical machine
One of the main component of Crossbow is the ability to and are referred to as either Hypervisors or Virtual Machine
virtualize a physical NIC into multiple virtual NIC(VNICs). Monitors (VMMs).
These VNICs can be assigned to either zones or any virtual For the rest of the discussion let us look at the hypervisors
machines sharing the physical NIC. Virtualization is LDOMs and Xen, that OpenSolaris supports on Sparc and x86
implemented by the MAC layer and the VNIC pseudo driver of architectures respectively and see how they address some of the
OpenSolaris network stack. It allows physical NIC resources challenges mentioned in Introduction.
such as hardware rings and interrupts to be allocated to specific
VNICs. This allows each VNIC to be scheduled independently

28
A. Logical DOMains Hypervisor automatically powers off CPU cores that are not
LDOMs can be viewed as a hypervisor implemented in in use – I.e. not assigned to any domains.
firmware. LDOMs hypervisor is shipped along with sun4v The following block diagram captures the above discussion:
based Sparc systems.
Hypervisor divides the physical Sparc machine into
multiple virtual machines, called domains. Each domain can be
dynamically configured with a subset of machine resources and
its own independent OS. Isolation and protection are provided
by the LDOMs hypervisor, by restricting access to registers
and address space identifiers (ASI).
Hypervisor takes care of partitioning of resources.
Hardware resources are assigned to logical domains using
'Machine Descriptions (MD)'. It is a graph describing devices
available, which includes CPU threads/strands, memory blocks,
PCI bus. Each LDOM has its own MD. This is used to build
Open Boot Prom (OBP) and consequently the OS device tree
upon booting guest. A fallout of this model is, guest OS doesn't
even see any HW resources not present in its MD.
The first domain that comes up on a sun4v based system
acts as a control domain. LDOMs software manager runs in
the control domain. This helps us to interact with the
hypervisor and allocate and deallocate physical resources to
guest domains. Management software consists of a daemon
(ldmd(1M)) which interacts with the hypervisor and ldm(1M) Figure 2.
CLI to interact with the daemon. ldmd(1M) can be run only in
the control domain.
B. X86 Platform virtualization
Hypervisor provides a fast point-to-point communication In x86 space, over the years, many virtualization
channel called 'logical domain channel' (LDC) to enable technologies have come in and they can be classified under two
communication: between LDOMs, between LDOM and types :
hypervisor. LDC end points are defined in the MD. ldmd(1M)
also uses LDC to interact with the HV in managing domains.  Type 1 – Where virtualization solution does not need a
host OS to operate.
Initially all the resources are given to the control domain.
Using ldm(1M), administrator needs to detach the resources  Type 2 – Where virtualization solution needs host OS
from the primary domain and pass them onto the newly created to operate.
guest domains.
For the rest of the discussion, let us look at Xen (a Type 1
I/O for guest domains can be configured in two ways: hypervisor) and VirtualBox (a Type 2 hypervisor).
Direct I/O: A guest domain could be given direct access to 1)Xen
a PCI device and the guest domain could manage all the I/O
devices connected to them using its own disk drivers. Xen is an open source hypervisor technology developed at
Virtualized I/O: One of the domains, called a 'service University of Cambridge. This solution is specific to Solaris
domain' controls the HW present on the system and presents running on x86 based systems. Each VM created can run a
'virtual devices' to other guest domains. Then the guest could complete OS. Xen sits directly on top of HW and below the
use a virtual block device driver which forwards the I/O to the guest operating systems. It takes care of partitioning of
service domains through LDC. Virtual device being presented available CPU(s) and physical memory resources across the
to the guest could be backed by a physical disk, or a slice of it, VMs.
or even a regular file. Unlike other contemporary virtualization solutions that
In case of network I/O, hypervisor presents a layer-2 exist for x86, xen started off with a different design approach
'virtual network switch' which enables domain to domain to minimise the overheads a guest OS incurs – guest needs to
traffic. Any number of v-LANs can be created by creating be ported to xen architecture, which very closely resembles the
additional v-switch in the service domain. The switch talks to underlying x86 arch – This approach is referred to as 'Para
the device driver in the service domain to connect to physical Virtualization (PV)'.
NIC for external network connection; And can tag along the In this approach, a PV guest kernel is made to run in ring-1
layer-3 features of service domain kernel to do routing, iptable of the x86 architecture, while the hypervisor runs in ring-0.
filtering, NAT and firewalling. This way hypervisor protects itself against any malicious guest

29
kernel. The way an application makes a system call to get into  Extended Page Tables (EPT): With EPT, VMM doesn't
kernel, a PV kernel needs to make a hypercall to get into the have to maintain shadow page tables. The way page
hypervisor – hypercalls are needed to request any services from tables convert VA to PA, EPT converts guests physical
the hypervisor. to host physical address. This virtualizes CR3, which
guest continues to manipulate.
The first VM that comes up on boot is referred as Dom0
and the rest of the VMs are referred as DomUs. To keep the  VT-d: This enables guests to access the devices
hypervisor thin, Xen runs the device drivers that control the directly so that performance impact of emulated
peripherals on the system, in dom0. This approach makes a lot devices is cut out.
of code execute in ring-1 rather than ring-0 and thus improves
the security of the system. To get the best of both the worlds [I.e. PV approach and
wanting to use HW advances], hybrid virtualization is picking
In PV mode, IO between guests and their virtual devices is up momentum – where we start off with a guest in HVM mode
directed through a combination of Front End (FE) drivers that [so that there is no porting exercise] and then incrementally add
run in a guest and Back End (BE) drivers that run in a domain PV drivers which will bypass device emulation and thus reduce
that is hosting the device. The communication between the virtualization overheads – in case of OpenSolaris these PV
front end and back end happens via the hypervisor and hence is drivers are already implemented for block and network devices.
subject to privilege checks. The FE and BE drivers are class
specific I.e. one set for all block devices, and one set for Xen tracing facility provides a way to record hypervisor
network devices – thus the finer details associated with each events like VCPU getting on and off a CPU, VM blocking, etc,
specific device is completely avoided. This model of and this data can help nail performance issues with virtual
implementing the IO performs far better than the emulated machines.
devices approach. Xen allows live migration of guests to similar physical
Depending on the guest OS needs, more physical memory machines, which effectively brings in load balancing features.
can be passed on dynamically and likewise, if there is, memory Xen also allows suspend and resume of guests, which can
shortage in the system, hypervisor can also take back memory be used to start service on demand – this feature along with
from a guest. ZFS snapshots can be used to configure a guest and then take a
Likewise, the number of virtual CPUs (vCPUs) associated snapshot of it, and move it to a different physical machine and
with a guest can by dynamically increased if the guest needs these nodes could give a simple fail over capability.
more compute resources. Xen schedules these vCPUs onto the The following block diagram captures the above discussion:
physical CPUs.
The management tools needed to configure and install
guests, also run in dom0 thus making the hypervisor even
thinner. This also improves the debuggability of the tools as
they run in user space.
Given the recent advances in the CPUs, like VT-x of Intel
and SVM of AMD, xen allows unmodified guests to be run in a
VM – these are referred to as HVM guests. There are a couple
of major differences between how a PV guest is handled vis-a-
vis how a HVM guest is handled :
 Unlike PV guest, HVM guest, expects to handle the
devices directly by installing its own drivers – for this,
the hypervisor has to emulate different physical
devices and this emulation is done in the user land of
dom0. As can be inferred IO this way will be slower
than in PV approach.
Figure 3.
 HVM guest tries to install its own page tables. Xen
uses shadow page tables which track guests 'page table To summarize, xen helps in consolidating heterogeneous
modifications'. This can slow down handling page work loads that are commonly seen in a data center.
faults in the guest.
2)Virtual Box (VB)
In both PV as well as HVM guests, the dom0 needs to be
ported to xen platform. The applications run without any
modifications inside the guest operating systems. VB is an open source virtualization solution for x86 based
systems. VB runs as a process in the host operating system and
x86 CPUs, off late, are seeing a steady stream of new supports various latest as well as legacy operating systems to
features, to help implement VMMs easier: run as guests. The power of VB lies in the fact that the guest is
completely virtualized and is being run as just a process on the

30
host OS. VB follows the usual x86 virtualization technique of A. Advanced Configuration and Power Interface (ACPI)
running guest ring-0 code in ring-1 of the host; and running Processor C-states :
guest ring-3 code natively. ACPI C0 state refers the normal functioning of the CPU
Given its ease of use, VB can be used to consolidate when it draws the rated voltage. CPU enters C1 state while
desktop workloads – a new desktop can be configured and running idle thread by issuing halt instruction, in this state,
deployed in a few seconds time. other than APIC and bus, rest of the units do not consume
power. CPU runs idle thread when there are no threads in
It supports cold migration of the guests, where one could runnable state.
copy over the Virtual Disk Image (VDI) to a different system
along with XML files (config details) and start the VM on the ACPI process C3 state and beyond are referred to as Deep
other system. C-state support, as wake up latency of this state is higher than
the earlier state. In C3 state, even APIC and bus units are also
Creating duplicate VMs is as easy as copying over the VDI stopped, and caches lose state. OS support is needed in C3 state
and removing the uuid from the image, and registering it with because of state loss and OpenSolaris incorporates this support.
VB. VB emulates PCnet, and Intel PRO network adopters,
supports different networking modes: NAT, HIF, Internal B. Power Aware Dispatcher (PAD)
networking.
This feature extends existing topology aware scheduling
facility to bring 'power domain' awareness to the dispatcher.
C. Field Use Case :
With this awareness in place, the dispatcher can implement
The following are real world cases where we used platform coalescence dispatching policy to consolidate utilization onto a
virtualization inside Sun : smaller subset of CPU domains, freeing up other domains to be
power managed. In addition to being domain aware, the
 Once a supported release of Solaris is made, a separate
dispatcher will also tend to prefer to utilize domains already
build machine is spawned off which is used to host
running at lower C-states – this will increase the duration and
sources and allows gate keepers to create patches.
extent to which domains can remain quiescent, improving the
Earlier one physical system used to be set aside for
kernel's power saving ability.
each release. Now, with Xen and LDOMs, a new VM
is created for each release and multiple of such guests Because the dispatcher will track power domain utilization
could be hosted on a single system. Patch windows for along the way, it can drive active domain state changes in an
each of the releases are staggered such that multiple event driven fashion, eliminating the need for the power
guests won't hit the peak load together. A complete management subsystem to poll.
build of Solaris takes at least 10% longer inside a
guest, but this is still acceptable as it is not a time These current conservative logs yield a 3.5% improvement
critical job. Performance impact on the interactive in SpecPower on Nehalem but more importantly a 22.2% idle
workload which an engineer might see while pushing power savings.
her code change is too less to be noticed.
C. CPU Frequency Scaling :
 Engineers heavily use VB to test their changes in older
It is possible to reduce power consumption of a system by
releases of Solaris. So even though performance
running at a reduced clock frequency, when it is observed that
degradation could be in the range of 5-20% depending
CPU utilization is low.
on workload, it is still acceptable as it is only
functionality and sanity testing [after my kernel change
will the system boot?]. D. Memory Power Management:
Like CPUs, even memory could be put in power saving
So though there is an increase in the number of supported state when the system is idle. OpenSolaris enables this feature
Solaris releases there is not a corresponding increase in the on chip sets which support this feature.
number of physical systems in the data center, thus
significantly saving on the capital expenditure and carbon foot
print. E. Suspend to RAM:
It is common for desktops and laptops to have extended
IV.CPU MANAGEMENT periods of no activity as the user could be away at lunch. To
save power in such cases, OpenSolaris supports what is
To address the second problem mentioned in the referred to as ACPI S3 support – whereby the whole system is
introduction, operating system should support the various suspended to RAM and power cut off to CPU.
features provided by the hardware to reduce the idling power
consumption of the system. For the rest of the discussion let us F. CPU hotplug:
look at how OpenSolaris kernel supports various power saving
features of both the x86 and Sparc platform. This is supported on quite a few Sparc platforms and is
achieved by 'Dynamic Reconfiguration (DR)' support in the
kernel. A DRed out board containing CPUs is effectively
powered off and can be pulled out; But can be left in there, so

31
that depending on workload changes, the board can be DRed REFERENCES
back in.
[1] Paul Barham et al. Xen and the art of virtual-ization. In Proceedings of
V.OBSERVABILITY TOOLS the 19th Symposium on Operating Systems Principles, 2003.
[2] Bryan Cantrill, Mike Shapiro, and Adam Leventhal. Dynamic
There should be a mechanism by which administrator can instrumentation of production systems. In USENIX Annual Technical
see how well a system is taking advantage of the power Conference, 2004.
management features discussed above. For this OpenSolaris [3] I. Pratt, et al. The Ongoing Evolution of Xen. In Proceedings of the
provides a couple of tools : Ottawa Linux Symposium (OLS), Canada, 2006.
[4] C. Clark, et al. Live Migration of Virtual Machines. In Proceedings of
the 2nd Symposium on Networked Systems Design and Implementation
A. Power TOP: (NSDI), Boston, 2005. USENIX
powertop(1M) reports the activity that is making the CPU [5] http://www.usenix.org/event/lisa04/tech/price.html
to move to lower C-states and thus increase the power [6] http://hub.opensolaris.org/bin/view/Project+tesla/
consumption. Addressing the reasons for those, will make a [7] http://hub.opensolaris.org/bin/view/Community+Group+xen/
CPU stay longer in higher C-states. [8] http://hub.opensolaris.org/bin/view/Community+Group+ldoms/
[9] http://hub.opensolaris.org/bin/view/Project+ldoms-mgr/
B. Kstat : [10] http://hub.opensolaris.org/bin/view/Community+Group+zones/
kstat(1M) reports at what clock frequency the CPU is [11] http://www.sun.com/customers/software/blastwave.xml
currently operating and what are the supported frequencies.
Lower the frequency, the better from power consumption point
of view.

32
ADCOM 2009
HUMAN COMPUTER
INTERFACE -1

Session Papers:

1. Shankkar B, Roy Paily and Tarun Kumar , “Low Power Biometric Capacitive CMOS Fingerprint
Sensor System”

2. Raghavendra R, Bernadette Dorizzi, Ashok Rao and Hemantha Kumar, “Particle Swarm
Optimization for Feature Selection: An Application to Fusion of Palmprint and Face”

33
Low Power Biometric Capacitive CMOS
Fingerprint Sensor System
B. Shankkar, Tarun Kumar and Roy Paily

Department of Electronics and Communication Engineering


IIT Guwahati, Assam, India

Abstract—A charge sharing based sensor for obtaining fin- modeling of the capacitor between the metal plate and finger
gerprint has been designed. The design used sub threshold as a capacitor with air medium in between. Cp 1 and Cp 2 are
operations of MOSFETs for achieving low power sensor device the internal parasitic capacitances of the nodes N1 and N2 . In
working at 0.5 V. Also the interfacing circuitry and a fingerprint
matching algorithm was also designed to support the sensor and the pre-charge phase, the switches S1 and S3 are on and S2 is
complete a fingerprint verification system. off. The capacitors Cp 1 and Cp 2 get charged up. During the
Index Terms—Fingerprint, sensor, charge sharing, sub- evaluation phase, S2 is turned on. The voltage stored during
threshold, low power the pre-charge phase is now divided between Cs , Cp 1 and
Cp 1 . The output voltage at N1 is easily seen as the following
I. I NTRODUCTION expression:
Biometrics is the automated method of recognizing a person
based on a physiological or behavioral characteristic. Biomet- Cp 1 V1 + Cp 2 V2 + Cs V1
V0 = VN 1 = VN 2 = (1)
ric recognition can be used in identification mode, where the Cp 1 + Cp 2 + Cs
biometric system identifies a person from the entire enrolled
population by searching a database for a match based solely on
the biometric. A biometric system can also be used in verifica-
tion mode, where the biometric system authenticates a person’s
claimed identity from their previously enrolled pattern. Various
biometrics used for such a purpose are signature verification,
face recognition, fingerprints, iris recognition. Fingerprinting
is one of the oldest ways of technically identifying a person’s
identity. Fingerprints have distinct ridge and valley pattern on
tip of a finger for every individual. Every finger print can be
divided into two structures - Ridge and Valleys.
As the world becomes more accustomed to devices reducing in
size with each passing day and dependence on mobile devices Fig. 1. Basic Charge Sharing Scheme [1]
increasing at enormous rate, a fingerprint authentication and
identification system that can be mounted on such a device is As given in figure 2 Cs differs for ridge and valley, thus the
imperative. Any fingerprint authentication system requires a output voltage also differs according to the above expression.
sensor and corresponding circuitry, an interface and fingerprint This difference in voltage when passed to a comparator with
matching algorithm. Paper presents the the results achieved for an appropriate reference voltage, gives a binary output. The
simulation for all these modules. binary values from all the pixels in the chip then constitute
the required fingerprinting image. In the pre-charge phase (pch
II. S ENSOR AND C ORRESPONDING C IRCUITRY = 0), it can be seen that N1 and N3 are kept at Vdd by
Figure 1 shows the basic principle of charge-sharing scheme the PMOS transistors. During this phase, the capacitors Cp 2
[1]. The finger is modeled as upper metal plate of the capacitor, and Cp 3 are shorted with voltages at both ends as ground
with its lower metal plate in the cell. These electrodes are and Vdd respectively. The capacitors Cp 1 and Cs begin to
separated by the passivation layer of the silicon chip, caused charge up. They store a charge of Cp 1 Vdd and Cs (Vdd - Vf in )
by the metal oxide layer on the chip. The series combination of respectively. This is the charge accumulation phase.
the two capacitances is called Cs . The basic principle behind At the beginning of the evaluation phase (pch = 1), both
the working of capacitive fingerprinting sensors is that Cs the input and output voltages are equal to Vdd . Even when
changes according to whether the part of finger on that pixel the voltage at N1 starts decreasing due to charge-sharing
is a ridge or valley. If the part of finger is a ridge then Cs between the capacitors, the unity-gain buffer ensures that the
is higher than when it encounters a valley, in which case the voltage at N3 is equal to voltage at N1 , thus effectively
series combination of the two capacitances falls low due to the shorting the capacitor Cp 3 and removing its effect. Meanwhile,

34
Fig. 2. The Sensor Circuitry

the comparator is also enabled, which is able to produce


Fig. 4. Building blocks for integrating sensor module with FPGA kit
the required binary output. Thus the fingerprint pattern is
captured. The circuits were simulated and implementation
results are shown in Figure 3. The presence of Ridge and
IV. R ESOLUTION I MPROVEMENT
Valley are characterized by the voltage levels 1.6 V and 1.2
V respectively. We proposed a modification to the basic circuit to improve
the resolution of the output signal. We introduced an inverter
at the output stage, which magnifies the output difference for
ridge and valley obtained at output port. The inverter was
designed to have a characteristic such that the point where
Vin = Vout occurs at value that is equal to the middle
value of voltage swing for ridge and valley. This difference
is reasonably easily discernible, and it could help easing in
on the limited range of Vr ef . It could give a wider range for
Vr ef thus making the circuit more reliable. A voltage swing
of around 2V was achieved using the designed inverter. The
result of proposed improvement is shown in Figure 5

Fig. 3. Resolution obtained for basic Sense Amplifier

III. I NTERFACE

Circuit was designed for interfacing the sensor with the


digital processing system. Circuitry was required to select
particular pixels to be activated and then to selectively transfer
the pixel value to the FPGA board for fingerprint matching.
The module involved use of a decoder and an and-or gate array.
Also a basic circuitry to detect long sequence of single value
was designed to implement auto correction. This ensured that
a long stream of a single bit value is not consecutively sensed
as it is highly unlikely to occur in practical conditions. The Fig. 5. Results obtained with Resolution Improvement Circuit
complete sensor module along with circuitry and auto error
detection module is shown in Figure 4. The in0 and in1 signals
are from a counter which increases with positive edge of the
V. P OWER I MPROVEMENT
clock. The ctrl signal goes to the reset of the counter. The out
signal is the sensed value of fingerprint and goes to fingerprint Sub-Threshold [3] operations of MOSFET were used to
matching algorithm. The four 0 f ps0 blocks represent the sensor achieve reduction in supply voltage and power for required
array. The decode unit helps activate single pixel sensor at a purpose. Various steps of implementing the circuitry using
time values of which are passed to the output through the sub-threshold approach are given as follows:
and-or gate array.

35
• The supply voltage was first fixed at 0.5V.

• Since the current when the transistor is in weak-inversion


is very less, the node capacitances take a long time to
charge and discharge. Thus the frequency of clock was
reduced to 1KHz.

• The sense amplifier has to work as a unity gain amplifier


when the output and V- terminals are shorted. To design
the sense amplifier to work at 0.5V, 0.5 V was given at V+
and the width of the transistors was changed to get output
voltage as 0.5V. It is well-known that NMOS transistor
acts as a pull-down device and the PMOS transistor acts
as a pull-up device. After many iterations, the widths
were decided, with the corresponding characteristics. The Fig. 8. Low Power Inverter
circuit developed was as shown in Figure 6. The output
characteristics are shown in Figure 7

Fig. 6. Low Power Sense Amplifier

Fig. 9. DC transfer characteristics of low Power Inverter

• After the modules were designed and characterized


individually, they were combined to obtain the overall
circuit. Since the voltages to the node capacitances are
very less, it has to be taken care to allow more current
flow for faster charging. This decided the widths of the
transistors in the main circuit.

• Thus the overall circuit was designed at 0.5V supply


voltage with the transistors in weak inversion to reduce
power.
Fig. 7. Low Power Sense Amplifier input output response

• The inverter was designed to have a change from logic The improvement resulted in power requirement of 736nW per
0 to logic 1 at 0.25V. The width of the transistors was pixel position. The improvements introduced in the MOSFET
changed to get the desired characteristics. The circuit designs also improved resolution further and we obtained a
is shown in Figure 8 and the output characteristics are resolution of around 350 mV for a power supply of 0.5 V.
given in Figure 9. Results are presented in Figure 10

36
minutiae of the recently processed fingerprint. The minu-
tiae of the fingerprint records were saved as separate
templates. for every clock cycle one of the templates
was accessed. The minutiae set of current candidate
fingerprint was compared element by element to that of
the accessed template. In case of 15 or more matches in
minutiae locations a match is announced and loop exited.
One sample of the actual 100 pixel * 50 pixel image that
was input to the matching algorithm and the corresponding
Fig. 10. Results obtained with low power Resolution Improvement Circuit thinned image with minutiae positions is shown in figure 11.

VI. F INGERPRINT M ATCHING A LGORITHMS


Minutiae points [4] are these local ridge characteristics that
occur either at a ridge ending or a ridge bifurcation. A ridge
ending is defined as the point where the ridge ends abruptly
and the ridge bifurcation is the point where the ridge splits into
two or more branches. Once the fingerprint image has been
enhanced and thinning algorithm applied, the minutiae are then
extracted. The algorithms were implemented and simulated on
Xilinx 10.1 ISE simulator. The simulations were performed
Fig. 11. Fingerprint image and extracted minutiae
based on Virtex II Pro FPGA board. Following points can
be considered as basic steps used for implementation of the
fingerprint matching algorithm. Another approach based on correlation of the test finger-
print image with the fingerprint images in the database was
• Removing local variations: Local variations in the implemented on FPGA board. In this approach every single
fingerprint were removed by scanning the whole image pixel value that was fed to the FPGA board was compared
pixel by pixel and at each location determining the to the corresponding pixel value in all the other fingerprint
values of all neighboring pixels. If the particular pixel images stored in the database. Parameters were updated at
value is different from all other pixel values than that occurrence of each pixel value to calculate the correlation of
is changed to the value of neighboring pixel values.The the complete images. A correlation value higher than 0.6 was
code was implemented using VHDL. used to announce a match.

• Thinning Algorithm: Thinning algorithms are used to VII. CONCLUSION


to narrow down the ridge structures within a fingerprint The paper presents a complete basic model for obtain-
image. Thinning [5] is done by removing the pixels from ing a fingerprint and identifying or authenticating it. The
borders of the ridges without causing any disturbance whole project was implemented on suitable simulators for
to the continuity of the structures. For this purpose the each individual module. The sensor and related circuitry was
image was scanned from left to right. When ever for every simulated using Mentor tools and the fingerprint algorithm
pixel value on left did not match the values on the right, was implemented on Xilinx 10.1. Important achievement were
the particular pixel values were changed to ’1’ (white). in terms of resolution improvement and reduction in power
This way only the single pixel values at the end were requirement.
retained as ’0’ to represent single pixel width of ridge
pattern. R EFERENCES
• Minutiae Extraction: The minutiae that were considered [1] J.W. Lee and D.J. Min and J.Y. Kim and W.C. Kim, “A 600-dpi capacitive
are ridge endings and ridge bifurcations. At every pixel fingerprint sensor chip and image-synthesis technique,” IEEE J. Solid-State
Circuits,1999.
locations sharp changes were detected by checking op- [2] Jin-Moon Nama and Seung-Min Jungb and Moon-Key Lee,“Design
posite points. If any of two opposite pixel positions with and implementation of a capacitive fingerprint sensor circuit in CMOS
respect to current pixel position had different values then technology,” EE. Sensors and Actuators, 2007.
[3] Hendrawan Soeleman and Kaushik Roy, ”Ultra-low power digital sub-
it was considered to be some special point. Then certain threshold logic circuits,” Proceedings of the international symposium on
next pixels were analyzed and determined whether a Low power electronics and design, 1999.
particular pixel was a minutiae or not. For every minutiae [4] Nimitha Chama, “Fingerprint image enhancement and minutiae extrac-
tion” http://www.ces.clemson.edu/ stb/ece847/fall2004/projects/proj10.doc.
its location with respect to first minutiae were saved for [5] Pu Hongbin, Chen Junali, Zhang Yashe , “Fingerprint Thinning Algorithm
later comparison. Based on Mathematical Morphology,” The Eighth International Conference
• Minutiae matching: In this step minutiae of fingerprints on Electronic Measurement and Instruments, 2007.
in the records were loaded and were compared to the

37
PARTICLE SWARM OPTIMIZATION FOR FEATURE SELECTION: AN APPLICATION TO
FUSION OF PALMPRINT AND FACE

Raghavendra.R1 Bernadette Dorizzi2 Ashok Rao3 Hemantha Kumar.G1


1
Dept of Studies in Computer Science, University of Mysore, Mysore-570 006, India.
2
Institut TELECOM, TELECOM and Management SudParis,France
3
Professor, Dept of E & C, CIT, Gubbi, India.

ABSTRACT will experiment it on two widely studied modalities namely,


face and palmprint. Very few papers address the feature fu-
This paper relates to multimodal biometric analysis. Here we
sion of palmprint and face [2][3][4][5]. From these articles, it
present an efficient feature level fusion and selection scheme
is clear that, performing feature level fusion leads to the curse
that we apply on face and palmprint images. The features for
of dimensionality due to the large size of the fused feature
each modality are obtained using Log Gabor transform and
vector and hence linear or non linear projection schemes are
concatenated to form a fused feature vector. We then use a
used by above mentioned authors to overcome dimensional-
Particle Swarm Optimization (PSO) scheme, a random op-
ity problem. In this work, we address this problem by reduc-
timization method to select the dominant features from this
ing the dimension of feature space through some appropriate
fused feature vector. Final classification is performed in the
feature selection procedure. To this aim, we experimented
projection space of the selected features using Kernel Direct
the binary Partial Swarm Optimization (PSO) algorithm pro-
Discriminant Analysis. Extensive experiments are carried out
posed in [6] to perform feature selection. Indeed PSO based
on a virtual multimodal biometric database of 250 users built
feature selection has been shown to be very efficient on some
from the face FRGC and the palmprint PolyU databases. We
large scale application problems with performance better than
compare the proposed selection method with the well known
Genetic Algorithms [7][8]. We therefore implemented it for
feature selection methods such as Sequential Floating For-
this biometric feature fusion problem of high dimension and
ward Selection (SFFS), Genetic Algorithm (GA) and Adap-
this is the main novelty of this paper. Extensive experiments
tive Boosting (AdaBoost) in terms of both number of features
conducted on a virtual multimodal biometric database of 250
selected and performance. Experimental results show that the
users show the efficacy of the proposed scheme.
proposed method of feature fusion and selection using PSO
outperforms all other schemes in terms of reduction of the The rest of this paper is organized as follows: Section 2
number of features and it corresponds to a system that is eas- describes the proposed method of feature fusion using PSO
ier to implement, while showing the same or even better pe- and it also discusses the selection of parameters for PSO. Sec-
formance. tion 3 presents the experimental setup, Section 4 describes the
results and discussion and finally Section 5 draws the conclu-
sion.
Keywords: Multimodal Biometrics, Feature level fusion,
Feature selection, Particle Swarm Optimization
2. PROPOSED METHOD
1. INTRODUCTION

Recently, multimodal biometric fusion techniques have at-


tracted much attention as the complementary information be-
Face Log Gabor
tween different modalities could improve the recognition per- Feature Accept
Selection
formance. In practice, fusion of several biometric systems can Feature
Concatenation using
KDDA Classification

be performed at 4 different levels: Sensor level, Feature level, Palmprint Log Gabor
PSO Reject

Match score level and Decision level. As reported in [1], a


biometric system that integrates information at earlier stage
of processing is expected to provide better performance than
systems that integrate information at a later stage, because of
the availability of more and richer information. In this paper, Fig. 1. Proposed scheme of feature fusion and selection in
we are going to explore the interest of feature fusion and we Log Gabor Space including PSO feature selection

38
Figure 1 shows the proposed block diagram of feature 2.2. Principle of PSO
level fusion of palmprint and face in Log Gabor space. As
The main objective of PSO is to optimize a given function
observed from Figure 1, we first extract the texture features
called fitness function. PSO is initialized with a population of
of face and palmprint separately using Log Gabor transform.
particles distributed randomly over the search space and eval-
We use Log Gabor transform as it is suitable for analyzing
uated to compute the fitness function together. Each particle is
gradually changing data such as face, iris and palmprint [3]
treated as a point in the N-Dimension space. The ith particle
and also it is mentioned in [9] that the Log Gabor transform,
is represented as Xi = {x1 , x2 , . . . , xN }. At every iteration,
can reflect the frequency response of images more realisti-
each particle is updated by two best values called pbest and
cally than usual Gabor transform. On the linear frequency
gbest. pbest is the best position associated with the best fit-
scale, the transfer function of the Log Gabor transform has
ness value of particle ’i’ obtained so far and is represented as
the form [9]:
pbesti = {pbesti1 , pbesti2 , . . . , pbestiN } with fitness func-
tion f (pbesti ). gbest is the best position among all the par-
− log(ω/ωo )2
 
G(ω) = exp (1) ticles in the swarm. The rate of the position change (veloc-
2 × log(k/ωo )2 ity) for particle ’i’ is represented as Vi = {vi1 , vi2 , . . . , viN }.
The particle velocities are updated according to the following
Where, ωo is the filter center frequency. To obtain a constant equations [8]:
shape filter, the ratio k/ωo must be held constant for varying
ωo . Vidnew = w ∗ Vidold + C1 ∗ rand1 () ∗ (Pbestid − xid )
The Log Gabor transform used in our experiments has +C2 ∗ rand2 () ∗ (gbestd − xid ) (2)
4 different scales and 8 orientations. We fixed these values
based on the results of different trials and also in conformity xid = xid + Vidnew (3)
with the literature [4][3]. Thus, each image (of palmprint and
Where, d = 1, 2, . . . , N ,; w is the inertia weight. The suitable
face) is analyzed using 8×4 different Log Gabor filters result-
selection of inertia weights provides a balance between global
ing in 32 different filtered images of resolution 60 × 60. To
and local explorations, and results in fewer iterations on av-
reduce the computation cost, we down sample the image by a
erage to find near optimal results. C1 and C2 are the accel-
ratio equal to 6. Thus, the final size is reduced to 40×80. Sim-
eration constants used to pull each particle towards pbest and
ilar type of analysis is also carried out for palmprint modality.
gbest. Low values of C1 and C2 allow the particle to roam
By concatenating the column vectors associated to each im-
far from target regions, while high values result in abrupt
age we obtain the fused feature vector of size 6400×1. As the
movements towards or past the target regions. rand1 () and
imaging conditions of face and palmprint are different, a fea-
rand2 () are random numbers between (0,1).
ture vector normalization is carried out as mentioned in [3].
In order to reduce the size of each vector, we propose to per-
form feature selection through PSO as explained in Section 2.3. Binary PSO
2.1 and illustrated in Figure 2 where ’K’ indicates the dimen- The original PSO was introduced for continuous population
sion of the feature space after concatenation and ’S’ indicates but has been later extended by J. Kennedy and R.C. Eberhart
the reduced dimension obtained by PSO. Then, we use KDDA [6] to the discrete valued population. In the binary PSO, the
to project the selected features on Kernel discriminant space. particles are represented by binary values (0 or 1). Each par-
Here we employ KDDA because of its good performance as ticle velocity is updated according to the following equations:
well as its high dimension reduction ability. Finally, decision
about accept/reject is carried out using NNC. 1
S(Vidnew ) = new (4)
1 + e−Vid
if (rand < S(Vidnew )) then xid = 1; else xid = 0; (5)
2.1. Particle Swarm Optimization(PSO)
Where Vidnew denotes the particle velocity obtained from
PSO is a stochastic, population based optimization technique Equation 2, function S(Vidnew ) is a sigmoid transformation
aiming at finding a solution to an optimization problem in and rand is a random number selected from the uniform dis-
a search space. The PSO algorithm was first described by tribution (0,1). If S(Vidnew ) is larger than a random number
J. Kennedy and R.C. Eberhart in 1995 [8]. The main idea then its position value is represented as {1} else its position
of PSO is to simulate the social behavior of birds flocking value is represented as {0}. Binary PSO is well adapted to
to describe an evolving system. Each candidate solution is the feature selection context [10][7]. In order to apply the
therefore modeled by an individual bird that is a particle in a idea of binary PSO for feature selection of face and palmprint
search space. Each particle adjusts its flight by making use of features, we need to adapt the general binary PSO concept
its individual memory and of the knowledge gained from its to this precise application. This will be the objective of the
neighbors to find the best solution. following subsections.

39
FACE Palmprint
F F F F P P P P
F F F F P P P P
F F F F P P P P
F F F F P P P P

Log Gabor Features Log Gabor Features

LGF LGF LGF LGP LGP LGP

LGF LGF LGF LGP LGP LGP

LGF1 1
LGF2
Face Log Gabor
Features
Feature
Concatenation
Palm Log Gabor
LGPk-1 Features
LGPk K PSO

0 1 1 1

Selected Face Selected Palm 1 K


Features Features
Gabor
in Log
LGPk-1

ction
LGF2

LGPk

re Sele
Featu
in
Doma
1 S

Fig. 2. PSO features selection scheme in Log Gabor Space

2.3.1. Representation of Position 2.3.3. Velocity Limitation Vmax


The initial swarm is created such that the population of the
particles is distributed randomly over the search space. Since In the binary version of PSO, the value of Vmax limits the
we are using binary PSO, the particle position is represented probability that bit xid takes value 0 or 1 and therefore the
as a binary bit string of length N, where ’N’ is the total di- use of high Vmax value in binary PSO will decrease the range
mension of the feature set. Every bit in the string represents a explored by the particle[6]. In our experiments, we tried dif-
feature, value ’1’ means that corresponding feature is selected ferent values of Vmax and finally selected Vmax = 6, as it
while ’0’ means that it is not selected. Each particle velocity allows the particle to reach near optimal solutions.
is updated according to the Equations 2,4,5.

2.3.2. Fitness Function 2.3.4. Inertia Weight and Acceleration Constant


The feature selection relies on an appropriate formulation of
the fitness function. In Biometric verification, it is difficult to The weight of inertia is an important parameter as it provides
identify a single function that would characterize the match- the particles with a degree of memory capability. It is exper-
ing performance across a range of False Acceptance Rate (FAR) imentally found that inertia weight ’w’ in the range [0.8, 1.2]
and False Reject Rate (FRR) values i.e. across all matching yields a better performance [6]. Hence in our present work,
thresholds [1]. Thus in our experiments, we first compute we initially set ’w’ to 1.2 and then decrease it to zero during
the distance between reference and testing samples to get the subsequent iterations(in our work, we use 50 iterations). This
match scores and then we compute FAR and genuine accep- scheme of decreasing inertia weight is found to be better than
tance rate (GAR) by setting thresholds at different points. In the fixed one [6] as it allows reaching an optimal solution.
order to optimize the performance gain across a wide range Even though the rate of acceleration constants C1 and C2 are
of thresholds, we define the objective function to be the av- not so critical in the convergence of PSO, a suitable chosen
erage of 12 GAR values corresponding to 12 different FAR value may lead to faster convergence of the algorithm. In our
values (90%,70%, 50%, 30%, 10%, 0.8%, 0.6%, 0.4%, 0.2%, experiments, we varied the value of C1 and C2 from 0 to 2
0.09%, 0.05%, 0.01%). Thus, the main objective of the verifi- and finally chose C1 = 0.7 and C2 = 1.2 as it yields better
cation fitness function is to maximize this average GAR value. convergence.

40
2.3.5. Population Size subsets. Thus, in each 10 trials we have 1200 (= 200 × 6)
reference samples and 1200 (= 200 × 6) testing samples and
In our present work, we experimentally varied the size of the
hence we have 1500 genuine matching scores and 238800
population from 10 to 30 in steps of 5 and finally, we fixed
(= 200 × 199 × 6) impostor matching scores, as for each
the population size as 20 which was found to be an optimal
user, all other users are considered as an impostor. In closed
value.
identification, we calculate the recognition rate using the 1200
reference samples and the 1200 testing samples that will give
3. EXPERIMENTAL SETUP 1200 × 1200 matching scores. Note that the persons are ex-
actly the same in reference and test sets, this is why we speak
This section describes the experimental setup that we have of closed identification. Finally, results are presented by tak-
build in order to evaluate the proposed feature level fusion ing the mean of all trials (10) and we also present the statis-
schemes. Because of a lack of real multimodal database of tical variation of the results with 90% parametric confidence
face and palmprint data, experiments are carried out on a interval [13]which gives a better estimation of the deviation
database of virtual persons using face and palmprint data com- than the one that we can obtain thanks to the cross validation.
ing from two different databases. This procedure is valid as
for one person, face and palmprint can be considered as two
independent modalities [1]. 4. RESULTS AND DISCUSSION
For face modality we choose the FRGC face database[11]
as it is a big database, widely used for benchmarking. From
this database, we choose 250 different users from 2 different
sessions. The first session consists of 6 samples for each user
taken from data collected during Fall 2003 and the second
session consists of 6 samples for each user taken from data
collected during Spring 2004. Out of these 6 samples, the
first 4 samples are taken in controlled condition and the next
2 samples are taken from uncontrolled conditions. For palm-
print modality, we select a subset of 250 different palmprints
from polyU database [12], each of these users possesses 12
samples such that 6 samples are taken from the first session
and next 6 samples are taken from the second session. The
average time interval between the first and second session is
two months. In building our multimodal biometric database
of face and palmprint, each virtual person is associated with Fig. 3. ROC curves of the different verification systems
12 samples of face and palmprint produced randomly from
the face and palmprint samples of 2 persons in the respective
databases are associated to each virtual persons. Thus, the This section discusses the results of the proposed feature
built virtual multimodal biometric database consists of 250 fusion and selection scheme in terms of performance and num-
users such that each user has 12 samples. ber of feature selected. The proposed method is compared
with three different feature selection schemes such as Ad-
aBoost (feature fusion-AdaBoost)[14], Genetic Algorithm (fea-
3.1. Experimental Protocol
ture fusion-GA) [15] and Sequential Floating Forward Selec-
This section describes in detail the experimental protocol em- tion (feature fusion-SFFS) [16] in terms of number of feature
ployed in our work. For learning the projection spaces, we selected. Further, we also present the comparative analysis of
use a subset of 100 users called LDB, such that each user has feature level fusion and selection schemes with feature fusion
6 samples (selected randomly out of 12 samples). To validate scheme using complete set of features (feature fusion-LG).
the performance of all the algorithms, we divide the whole We also present the comparative analysis of feature level fu-
database of 250 users into two independent sets called Set-I sion with match score level fusion in terms of performance.
and Set-II. Set-I consists of 200 users and Set-II consists of 50 Figure 3 shows the ROC curves of the individual biomet-
users. Set-II is used as the validation set to fix the parameters rics, feature fusion using complete set of features (feature
of PSO (like Vmax , C1, C2, Population Size), match score fusion-LG), feature fusion and selection schemes and match
fusion and also those of AdaBoost. Set-I is divided into two score level fusion. To perform the match score level fusion,
equal partitions providing 6 reference data and 6 testing data we first obtain the match scores of face and palmprint inde-
for each of the 200 persons. The reference and testing parti- pendently using the combination of Log Gabor transform and
tion was repeated ’m’ times (where m = 10) using Holdout KDDA. Note that this architecture corresponds to state of art
crossvalidation and there is no overlapping between these two systems in both face and palmprint. We therefore perform a

41
dicates the final dimension of the KDDA projection space.
Table 1. Comparative performance of the different feature se-
From Table 2 we can observe that, PSO based feature selec-
lection schemes (Mean GAR at FAR = 0.01%) (Verification)
tion scheme uses less number of features as compared with
Methods GAR at 0.01% of FAR(%) three different feature selection schemes. Indeed, the pro-
with 90% confidence interval posed Feature Fusion-PSO scheme reduces the fused feature
Face Alone 65.32 [63.06; 67.37] space by roughly 45% while, SFFS, AdaBoost and GA re-
duces the fused feature space dimension by around 17%, 36%
Palmprint Alone 74.62 [72.55; 76.08]
and 39% respectively. These figures clearly indicate the ef-
Match Score Fusion 86.50 [84.89; 88.11]
ficacy of the proposed Feature Fusion-PSO. Further, It also
Feature Fusion-LG 92.51 [91.65; 93.75]
observed from our experiments that, fusion at feature level
Feature Fusion-SFFS 92.55 [91.32; 93.77]
allows an improvement of 5% over match score level fusion.
Feature Fusion-AdaBoost 92.88 [91.76; 94.00]
Feature Fusion-GA 92.75 [91.52; 93.97]
Feature Fusion-PSO 94.72 [93.85; 95.59] 5. CONCLUSION

In this paper, we investigated the dimensionality reduction of


high dimensional feature fusion space and proposed a novel
Table 2. Comparison of features selection schemes feature fusion scheme based on PSO. The proposed method
Methods Dori DF S DKDDA is compared with three different state-of-art feature selection
Feature fusion-LG 6400 6400 224 methods namely SFFS, AdaBoost and GA. Extensive experi-
Feature fusion-SFFS 6400 5286 207 ments carried out on a virtual multimodal biometric database
Feature fusion-AdaBoost 6400 4090 184 of 250 users composed of palmprint and face indicates that,
Feature fusion-GA 6400 3855 170 the proposed Feature Fusion-PSO approach reduces the fused
Feature fusion-PSO 6400 3520 139 feature space dimension by a factor of roughly 45% while
keeping same level of performance as that of the global sys-
tem: Feature Fusion-LG. Moreover, PSO implementation al-
lowed the recognition process to be made faster and less com-
weighted SUM rule by computing the weights using empiri- plex by reducing the number of features while preserving their
cal method. Here, we vary the weights W 1 and W 2 from 0 discriminative ability. Thus, from the above analysis, we can
to 1 such that W 1 + W 2 = 1 and finally, we fix the weights conclude that proposed method Feature Fusion-PSO can con-
for which we get the best performance on Set II. Indeed it tribute in developing more fast and accurate multimodal bio-
has been shown in [1] this leads to a good and simple scheme metric systems based on feature level fusion.
for match score fusion. In order to have better comparison, Our future work will focus on further investigating the se-
we employed the same fitness function as that of proposed lected features obtained with the PSO based scheme and in
fusion scheme using PSO (see Section 2.3.2) with other fea- particular to search for a correlation between certain selected
ture selection schemes used for comparative analysis used in features and the quality of the data.
our present work. Table 1 shows the relative performance
of these algorithms in terms of mean value of GAR at the
6. REFERENCES
value of FAR=0.01%. From Figure 3 and also from Table 1
it can be observed that, the performance of palmprint outper-
[1] A.Ross, K.Nandakumar, and A.K. Jain, Handbook of
forms the face with GAR=74.62% at FAR=0.01% while for
Multibiometrics, Springer-Verlag edition, 2006.
the face we have GAR=65.32% at FAR=0.01%. Further, the
feature level fusion (feature fusion-LG) of these two modal- [2] G. Feng, K. Dong, D. Hu, and D. Zhang, “When faces
ities shows a big improvement in performance as compared are combined with palmprints:a novel biometric fusion
with that of match score level fusion and individual biomet- strategy,” in First International Conference on Biomet-
ric with GAR=92.51% at FAR=0.01%. It is also observed ric Authentication (ICBA), 2004, pp. 701–707.
from Figure 3 (and also Table 1) that, the use of a selection
scheme allows keeping the same level of performance as Fea- [3] Y.Yao, X. Jing, and H. Wong, “Face and palmprint fea-
ture Fusion-LG but with less number of features. Table 2 in- ture level fusion for single sample biometric recogni-
dicates the number of features selected by proposed feature tion,” Nerocomputing, vol. 70, no. 7-9, pp. 1582–1586,
fusion and selection scheme using PSO and also of other three 2007.
diffrent feature selection schemes for the same level of perfor-
mance as the complete features (Feature Fusion-LG). Here, [4] X.Y. Jing, Y.F. Yao, J.Y. Yang, M. Li, and D. Zhang,
DOri indicates the initial features dimension, DF S indicates “Face and palmprint pixel level fusion and kernel DCV-
the dimension after feature selection scheme and DKDDA in- RBF classifier for small sample biometric recognition,”

42
Pattern Recognition, vol. 40, no. 3, pp. 3209–3224,
2007.
[5] Y. Yan and Y.J. Zhang, “Multimodal biometrics fusion
using Correlation Filter Bank,” in proceedings of In-
ternational Conference on Pattern Recognition (ICPR-
2008), 2008, pp. 1–4.
[6] J.Kennedy and R.C.Eberhart, “A discrete binary version
of the particle swarm algorithm,” in IEEE International
Conference on Systems, Man and Cybernetics, 1997, pp.
4104–4108.
[7] X.Wang, J.Yang, X.Teng, W. Xia, and B. Jensen, “Fea-
ture selection based on rough sets and particle swarm
optimization,” Pattern Recognition Letters, vol. 28, pp.
459–471, 2007.
[8] J. Kennedy and R.C. Eberhart, “Particle swarm opti-
mization,” in IEEE International Conference on Neural
Networks, 1995, pp. 1942–1948.
[9] X. Zhitao, G. Chengming, Y. Ming, and L. Qiang, “Re-
search on log gabor wavelet and its application in im-
age edge detection,” in proceedings of 6th International
Conference on Signal Processing, 2002, pp. 592– 595.
[10] M. Najjarzadeh and A. Ayatollahi, “A comparison be-
tween Genetic Algorithm and PSO for linear phase FIR
Digital filter design,” in IEEE International Conference
on Signal Processing(ICSP), 2008, pp. 2134–2137.
[11] P. J. Phillips, P. J. Flynn, T. Scruggs, K. W. Bowyer,
J. Chang, K. Hoffman, J. Marques, J. Min, and
W. Worek, “Overview of the face recognition grand
challenge,” in Proceedings of CVPR05, 2005, p.
947954.
[12] “polyU palmprint Database,”
www.comp.polyu.edu.hk/ biometrics/.
[13] R. M. Bolle, N. K. Ratha, and S. Pankanti, “An Evalua-
tion of Error Confidence Interval Estimation Methods,”
in Proceedings of ICPR-04, 2004, pp. 103–106.
[14] S.Shan, P.Yang, X.Chen, and W.Gen, “Adaboost gabor
fisher classifier for face recognition,” in Proceedings of
AFGR, 2005, pp. 278–291.
[15] G.Bebis, S. Uthiram, and M. Georgiopoulos, “Face de-
tection and verification using genetic search,” Interna-
tional Journal of Artificial Tools, vol. 6, no. 2, pp. 225–
246, 2000.
[16] F.Ferri, P.Pudil, M.Hatef, and J.Kittler, “Comparative
study of techniques for large scale feature selection,”
Pattern Recognition in Practice IV, Elsevier Science, pp.
403–413, 1994.

43
ADCOM 2009
GRID SERVICES

Session Papers:

1. Srikumar Venugopal, James Broberg and Rajkumar Buyya, “OpenPEX: An Open Provisioning
and EXecution System for Virtual Machines”

2. Saurabh Kumar Garg, “Exploiting Grid Heterogeneity for Energy Gain”

3. Snehal Gaikwad, Aashish Jog and Mihir Kedia, “Intelligent Data Analytics Console”

44
OpenPEX: An Open Provisioning and EXecution System for Virtual Machines

Srikumar Venugopal
School of Computer Science and Engineering,
University of New South Wales, Australia
Email:srikumarv@cse.unsw.edu.au
James Broberg and Rajkumar Buyya
Department of Computer Science and Software Engineering,
The University of Melbourne, Australia
Email:{brobergj, raj}@csse.unimelb.edu.au

Abstract tructure by multiplexing many VMs onto one physical host.


These developments provide an interesting framework
Virtual machines (VMs) have become capable enough where a lightweight VM image can be a unit of execution
to emulate full-featured physical machines in all aspects. (i.e. instead of a task or process on a shared system) and
Therefore, they have become the foundation not only for migration. Migration of VMs on the same subnet can be
flexible data center infrastructure but also for commercial achieved in a totally transparent, work-conserving fashion
Infrastructure-as-a-Service (IaaS) solutions. However, cur- without the running applications, nor any dependant clients
rent providers of virtual infrastructure offer simple mech- or external resources being aware that it has occurred [7, 5].
anisms through which users can ask for immediate allo- Also, VMs enable workload isolation and can be shutdown
cation of VMs. More sophisticated economic and alloca- without adverse effects on other concurrent VMs.
tion mechanisms are required so that users can plan ahead The ability to fashion VMs into expendable resource
and IaaS providers can improve their revenue. This paper units has led to their use for ad-hoc deployment of compu-
introduces OpenPEX, a system that allows users to provi- tational infrastructure in order to meet sudden spikes in re-
sion resources ahead of time through advance reservations. source demand. VMs also enable users to create computing
OpenPEX also incorporates a bilateral negotiation protocol environments customised to suit the requirements of their
that allows users and providers to come to an agreement by specific e-Science or e-Business applications. These capa-
exchanging offers and counter-offers. These functions are bilities have led to the advent of infrastructure provisioning
made available to users through a web portal and a REST- services both private (within an enterprise or organisation),
based Web service interface. or commercial. Providers and consumers of such services
negotiate Service Level Agreements (SLAs) that encapsu-
late the user requirements in terms of Service Level Objec-
1 Introduction tives (SLOs), and the rewards and penalties for meeting and
violating them respectively. Therefore, the provider’s aim
In the past, many networked services (such as web sites, is to maximise its own return of investment by maximising
databases and computational services) were hosted on ded- resource utilisation while avoiding or minimising penalties
icated physical hardware, which were configured exclu- caused by SLA violations.
sively to suit application dependent requirements. However, VMs have also become the ‘enabling technology’ be-
recent hardware and software advances have made it possi- hind the recent emergence of Cloud Computing [4]. In-
ble to host these services within Virtual Machines (VMs) frastructure as a Service (Iaas) Providers such as Amazon
on ‘commodity x86 hardware with minimal overhead. Vir- EC2, GoGrid or Mosso Cloud Servers have emerged that
tualisation solutions such as Xen [2] and VMWare [1] cre- offer virtualised machines (or resource slices) that can be
ate a virtual representation of a complete physical machine, obtained under a pay-per-use arrangement (i.e. no commit-
enabling operating systems (and any running applications) ment required, utility style pricing). Users can take advan-
to be de-coupled from the physical hardware. This allows tage of elastic capacity, where they can scale up and scale
improved utilisation and consolidation of computing infras- down on demand using a self-service interface, such as a

1
45
Web Service or Web Portal. The resources themselves (such Command Line Tools. The Amazon service provides hun-
as compute and storage) are highly abstracted or virtualised. dreds of pre-made AMIs (Amazon Machine Image), giving
However, IaaS service models are evolving and cur- users a wide choice of operating systems (i.e. Windows or
rently, most providers operate on a lease model – the users Linux) and pre-loaded software. Instances come in different
pay for the time the VM was active. Also, the choices avail- sizes, from Standard Instances (S, L, XL), which have pro-
able to the user in terms of specifying requirements are lim- portionally more RAM than CPU, to High CPU Instances
ited. Such models do not allow for more flexible strategies (M, XL), which have proportionally more CPU than RAM.
where the user can reduce both his costs and risk by booking A user can deploy these instances in two different regions,
his resources in advance. An advance reservation provides a US-East and EU-West Regions, with EU instances costing
guaranteed allocation of the resources at the needed time to more per hour than their US counterparts.
the consumer and helps the provider plan capacity require- Amazon EC2 provides an alternative to it’s on-demand
ments better. However, advance reservations induce new instances, known as a reserved instance. This facility of-
challenges in resource management and require new archi- fers a number of benefits over simply requesting instances
tectures for realisation. In this paper, we introduce Open- on demand, as it provides a lower per-hour rate, and pro-
PEX, a utility-based virtual infrastructure manager that en- vides assurances that any reserved instance you launch is
ables users to reserve VM instances in advance. OpenPEX guaranteed to succeed (provided you have booked them in
also offers a bilateral negotiation protocol that allows users advance). That is, users of such instances should not be
and providers to exchange offers and counter-offers, and affected by any transient limitations in EC2 capacity.
come to an agreement that is mutually beneficial. In the
next section, we distinguish the contributions of OpenPEX
from the state-of-the-art. Section 3 discusses the design and 2.2 Private IaaS Cloud Platforms
implementation of OpenPEX at length. Section 4 discusses
the Web Service interface to OpenPEX and finally, we con-
clude the paper with details on our future plans for the sys- Many different platforms exist to assist with the deploy-
tem. ment and management of virtual machines on a virtualised
cluster (i.e. a cluster running Virtual Machine software).
2 Related Work Such platforms are often referred to as ‘Private Clouds’,
as they can bring the benefits of Cloud Computing (such
Virtual Machine technology has become an essential as elasticity, dynamic provisioning and multiplexing work-
enabling technology of Cloud Computing environments. loads onto fewer machines) into local clusters.
Cloud Computing is a style of computing where resources Eucalyptus [9, 8] is an open-source (BSD-licensed) soft-
can be obtained in a pay-per-use manner (no commitment, ware infrastructure for implementing Infrastructure as a
utility pricing). Such resources have elastic capacity, where Service (IaaS) Compute Cloud on commodity hardware.
they can be scaled up and down on demand. Resources are Eucalyptus is notable by offering a Web Service interface
highly abstracted and virtualised, and can be obtained via a that is fully Amazon Web Services (AWS) API compliant.
self-service interface. Specifically, it emulates Amazon’s Elastic Compute Cloud
VMs are highly attractive to manage resources in such (EC2), Simple Storage Service (S3) and Elastic Block Store
environments as they improve utilisation by multiplexing (EBS) services at the API level. However, as the imple-
many VMs on one physical host (consolidation), allow ag- mentation details of Amazon’s services are not published,
ile deployment and management of services, provide on Eucalyptus’ internal implementation would differ.
demand cloning, (live) migration and checkpoint which OpenNebula [11] is an open-source Virtual Infrastruc-
improves reliability. Furthermore, a VM can be a self- ture management software that supports dynamic resizing,
contained unit of execution and migration. As such, effec- partitioning and scaling of computing resources. OpenNeb-
tive management of VMs and Virtual Machine infrastruc- ula can be deployed in a private, public or hybrid Cloud
ture is critical for any Cloud Computing Infrastructure as a models. The OpenNebula software turns an existing clus-
Service (IaaS) provider. ter into private cloud, which can be used privately or can
expose service to public via XML-RPC Web Services. The
2.1 Public IaaS Cloud Services integration of Cloud plugins (EC2, GoGrid) enable hybrid
model, where you can mix and match private and public
Amazon Elastic Compute Cloud (EC2) is an IaaS ser- resources. Haizea [11] has extended OpenNebula further,
vice that provides resizable compute capacity in the cloud. allowing resource providers to lease their resources using
These services can be leveraged via Web Services (SOAP or sophisticated leasing arrangements, instead of only provid-
REST),a web-based AWS Management Console or the EC2 ing on-demand VMs like most other IaaS services.

2
46
2.3 Issues with existing solutions utility / value of their jobs. Further information and compar-
ison of these utility computing environments are available
With the exception of OpenNebula (when used in con- in extensive survey of these platforms [3].
junction with the Haizea extension) and Amazon EC2
(when used with Reserved Instances), none of the above
public or private platforms offer the ability to perform an PEX Portal Web Service
Advanced Reservation of computing resources, rather they
only supply on-demand capacity on a best-effort basis (i.e. PEX Resource Mgr
if adequate resources are available). Furthermore, none of
these platforms provide an alternate offer (that is, a modi-
fied offering that can satisfy a users request that may differ Reservation
Allocator
Manager
from their initial request) in the event that the system cannot
satisfy a user’s specific request for resources.
Whilst Haizea supports a form of Advanced Reservation Event Queue
(which it denotes as advanced reservation leases), if the re-
quest cannot be satisfied there is no recourse - the request
will be rejected. Under the same circumstances, the Open- Node
VM Monitor
PEX system enacts a bilateral negotiation protocol that al- Monitor
lows users and providers to come to an agreement by ex-
changing offers and counter-offers, so a user’s advanced Xen
reservation request can be satisfied. Dispatcher
Amazon EC2 offers its own variation on the notion of
Advanced Reservation with its Reserved Instances product.
However, you need to purchase a Reserved Instance for ev-
ery instance you wish to guarantee to be available at some Xen Cluster
Xen Pool
point in the future. This essentially requires the end user to Mgr.
forecast exactly how many they will require in advance. Ac-
quisition of Reserved Instance is not instantaneous either, in leon pris deckard roy zhora
the authors’ experience a request for an Reserved Instance
Physical Resources
has taken more than an hour on previous occasions.

2.4 Utility Computing Platforms Figure 1. The OpenPEX Resource Manager.

With increasing popularity and usage, large Grid instal-


lations are facing new problems, such as excessive spikes in
demand for resources coupled with strategic and adversarial 3 OpenPEX
behaviour by users. Traditional Grid resource management
techniques did not ensure fair and equitable access to re- OpenPEX was constructed around the notion of using
sources in many systems. Traditional metrics (throughput, advance reservations as the primary method for allocating
waiting time, slowdown) failed to capture the more subtle VM instances. The use case followed here is of a user
requirements of users. There were no incentives for users who, either through a web portal or through the web ser-
to be flexible about resource requirements or job deadlines, vice makes a reservation for any number of instances of a
nor provisions to accommodate users with urgent work. Virtual Machine that have to be started at a specific time and
In such systems, users assign a “utility” value to their have to last for a specific duration. The VM is described by
jobs, where utility is a fixed or time-varying valuation that a template that is already registered in the system. If the re-
captures various QoS constraints (deadline, importance, sat- quest can be satisfied for the price asked for by the user, then
isfaction). The valuation is amount they are willing to pay OpenPEX creates the reservation, else it creates a counter-
a service provider to satisfy demands. Service providers offer with an alternate time interval where the request can
attempt to maximise their own utility, where utility may be accommodated. The counter offer may also instead spec-
directly correlate with their profit. Providers can priori- ify a different price for the original time interval. Once the
tise high yield (i.e. profit per unit of resource) user jobs, reservations have been finalised, the user can chose to ac-
and shared Grid systems are then viewed as a marketplace, tivate the reservation or have OpenPEX automatically start
where users compete for resources based on the perceived the instances when required.

3
47
These requirements motivate a resource management User/Portal/WS PEX RM ClusterNode

system that is able to manage physical nodes in such a man- initiateReservation

ner that it maintains the capacity to satisfy as many advance reservationID

reservation requests as possible. Such adaptive manage- requestReservation


ment may be enabled by a variety of techniques including isFeasible()

load forecasting, migrating existing VMs to other resources,


or/and suspending some of the VMs in order to increase the ACCEPT/REJECT/COUNTER Yes/No

available resource share.


alt [if ACCEPT]
Figure 1 shows the architecture of the OpenPEX Re- CONFIRM Xen Pool Mgr
source Manager. The Resource Manager has the following
REQUEST-SUCCESS
components:
Reservation Manager - This component interacts with the [if COUNTER]

users through the portal or the web service, and re- ACCEPT/REJECT/COUNTER

ceives incoming requests. It examines them to check IF ACCEPT THEN


whether they are feasible according to the reservation CONFIRM_REQUEST

policy employed, and creates counter-offers when re- CONFIRM

quired.
[if REJECT]

Allocator - Manages the allocation of VMs to physical


nodes. The allocator’s roles are to: create capacity for Activate Reservation
Create VM Create VM
new VMs by triggering migration of existing VMs to Start VM Instance
other nodes, or by suspending long running VMs; to Done Done

identify physical nodes on which reservations can be Start VM StartVM


start ()
activated; and to react to events such as the loss of a
physical node.
Node Monitor - Monitors the health and load of the phys-
ical nodes.
Figure 2. Alternate Offers Negotiation for Ad-
VM Monitor - Monitors the health of the VMs that have vance Reservations.
been started in OpenPEX. It detects events such as a
VM shut down by a user from the inside, a VM sud-
denly crashing, or a VM being unresponsive.
we have employed a protocol based on the Alternate Of-
Dispatcher - Interacts with the virtual machine manager fers mechanism [10] which was previously used for ne-
or the hypervisor on the physical nodes. It relays com- gotiation of SLAs in an enterprise Grid framework [12].
mands such as create, start or shutdown a VM to the The implementation of this protocol in OpenPEX is shown
virtual machine manager. The dispatcher is the only in Figure 2. The user opens the interaction by sending
component that is specific to the underlying virtual ma- an initiateReservation request in reply to which
chine manager, the rest of OpenPEX is designed to be OpenPEX returns a unique reservationID identifier.
independent of the underlying infrastructure. This identifier acts as a handle for the session and if
All these components are connected to an Event Queue the reservation goes through, then until its life-cycle is
that acts as a simple message bus for the system. The Event complete. The user then submits a proposal through a
Queue also enables scheduling of future tasks by allowing requestReservation call. The proposal contains a
delayed events. The entire system is backed by a Persis- description of the VM being requested (e.g. instance size),
tence layer that saves the current state to a database. the number of instances required, the start time for activat-
ing the reservation and the duration for which the reserva-
3.1 Negotiating Advance Reservations tion is required. The instance sizes are detailed in the next
section and examples of these descriptions are given in Sec-
As described previously, OpenPEX allows advance tion 4.
reservations to be negotiated bilaterally between the pro- In return, OpenPEX can respond with: ACCEPT, if the
ducers and consumers. As described previously, Open- proposal is acceptable; REJECT, if the proposal cannot be
PEX allows advance reservations to be negotiated bilat- satisfied in any manner; and COUNTER, if the reservation
erally between the producers and consumers. For this, required cannot be fulfilled with the parameters given in the

4
48
OpenPEX
has Reservation activates Instance
User 1 0:M 1 0:M
1

maps to

1:M
OpenPEX Reservation
maps to
Node 1 1 Node

Figure 3. OpenPEX Entity Relationship Diagram.

Figure 4. OpenPEX Welcome Screen.

proposal but an alternative can be generated instead. With


the last option, OpenPEX returns an alternative (or counter) Table 1. List of available VM configurations in
proposal generated by replacing terms of the user’s origi- OpenPEX.
nal proposal with those acceptable to it. For example, a
user could ask for 5 instances of a Virtual Machine of small Size Configuration
instance size (refer Section 3.2) and with Red Hat Enter- SMALL 1 CPU, 768 MB RAM, 10 GB HDD
prise Linux operating system. These instances have to be MEDIUM 2 CPU, 1.5 GB RAM, 20 GB HDD
started at 10:00 a.m. on August 21, 2009 for six days af- LARGE 3 CPU, 2.5 GB RAM, 40 GB HDD
ter which they can be shut down. However, OpenPEX may XLARGE 4 CPU, 3.5 GB RAM, 60 GB HDD
not be able to provision these instances on August 21, but
may have free nodes for six days starting August 23. In
this case, it will generate a counter proposal that replaces
only the start time with the new start time in the user’s orig- 3.2 Resource Management in OpenPEX
inal proposal. If OpenPEX is not able to provision the VMs
in any case (e.g. number of instances requested exceeds
its capacity), then the proposal is rejected. When the user
receives an ACCEPT, he can then reply with CONFIRM to Users are only able to ask for standardised configurations
confirm the reservation. In reply to a COUNTER, the user of virtual machines from the OpenPEX Resource Manager.
has the same three reply options. In case the user accepts These configurations depict the ”sizes” of the virtual ma-
the counter-proposal (through the reply ACCEPT), Open- chines as given in Table 1. The virtual machines can be
PEX sends back a CONFIRM-REQUEST so that the user paired with an operating environment chosen from the tem-
can reply back with a CONFIRM to confirm the reservation. plates available in the OpenPEX database.
This extra step is necessary as, even though the protocol is
When a user request arrives at the Reservation Manager,
bilateral, only OpenPEX can confirm a reservation.
the individual nodes are polled to determine which of them
are free at the requested time. If more than one nodes are
free, then the Manager choses the most loaded of them to
Once the reservation is confirmed, the user can activate it host the request. In case, there are no free nodes available,
by instantiating the VMs in the reservation after the agreed- an alternate time slot is requested from each of the nodes.
upon start time. The user can start or shutdown VMs at any The node that provides a starting time closest to the original
time during the course of the reservation. Once the duration request is temporarily locked, and the new starting time is
is over, the reservation expires, and all active VMs on that sent as an alternate offer to the user via a COUNTER reply
reservation are shutdown. in the negotiation protocol.

5
49
Figure 5. OpenPEX Reservations Screen.

Figure 6. OpenPEX Instances Screen.

3.3 Web Portal Interface via the Reservations screen. Figure 5 shows the Reserva-
tions screen with three reservations. If a reservation was
OpenPEX is developed completely in Java and is de- not yet activated, a user can choose to delete it (if they no
ployed as a service in an application container on the clus- longer required it) or activate it, so the associated instances
ter head node (or the storage node). It communicates with can start at the appropriate start time.
the pool manager using the Xen API and uses the Java Virtual Machine instances can be viewed and manipu-
Persistence API (JPA) for the persistence back-end. The lated via the Instances screen depicted in Figure 6. Here
database structure for PEX is depicted in Figure 3 and fol- the user can view salient information regarding their VM
lows Object-Relational Mapping (ORM) for easy develop- instance, such as it’s machine name, status (e.g. HALTED,
ment and extensibility. RUNNING, PAUSED, SUSPENDED). start time, end time,
The OpenPEX system provides an easy to use Web Por- and IP address. An instance can also be stopped early (i.e.
tal interface, enabling the user to access all the functional- before it’s designated end time) if desired.
ity of the OpenPEX. Users can access the system via a web
browser, register for an account and login to the system. 4 RESTful Web Service Interface
Upon logging in they will be greeted by a simple Welcome
Screen depicted in Figure 4, which shows what functions It is essential to provide programmatic access to the
are available to the end user. functions and capabilities of an OpenPEX cluster, in order
The user can choose to make a new reservation, where for users to be able to dynamically request reservations from
they can choose the size of the reservation they wish to the system (i.e. scaling out during periods of peak load), or
make (from the choices listed in Table 1), the start and end even to integrate an OpenPEX system into a wider pool of
time, the template (i.e. Operating System) they wish to use computing resources. As such, the full functionality of the
and the number of instances they require. Their request can OpenPEX system is exposed via Web Services, which are
be accepted or they can enter into a negotiation until they implemented in a RESTful (REpresentational State Trans-
come to an agreement with the OpenPEX cluster. fer) style [6].
Once this process has occurred, they can view their ex- The REST-style architecture provides a clear and clean
isting reservations and activate any unclaimed reservations delineation between the functions of the client and the

6
50
Table 2. OpenPEX RESTful Endpoints.
OpenPEX Operation HTTP Endpoint Parameters Return type
Create reservation POST /OpenPEX/reservations JSON (Fig. 7) JSON (Fig. 8,9)
Update reservation PUT /OpenPEX/reservations/requestId JSON JSON
Delete reservation DELETE /OpenPEX/reservations/requestId None HTTP 200 (OK)
Activate reservation PUT /OpenPEX/reservations/requestId/activate None HTTP 200 (OK)
Get reservation information GET /OpenPEX/reservations/requestId None JSON
List reservations GET /OpenPEX/reservations None JSON
Get instance information GET /OpenPEX/instances/vm id None JSON
List instances GET /OpenPEX/instances None JSON
Stop instance PUT /OpenPEX/instances/vm id/stop None HTTP 200 (OK)
Reboot instance PUT /OpenPEX/instances/vm id/reboot None HTTP 200 (OK)
Delete instance DELETE /OpenPEX/instances/vm id None HTTP 200 (OK)

server. A client performs operations on resources (such where they can stop, reboot or delete these instances.
as reservations and instances) which are identified through
{
standard URIs. The server returns a JSON1 representation "proposal": {
of the resource back to the client to indicate the current state "duration": 3600000,
of that resource. Clients can modify and delete these re- "id": "5D0FA0EB-90A8-F4E6-1DFF-61B2CEC6AD91",
sources by altering and returning their representations as "numInstancesFixed": 1,
"numInstancesOption": 0,
required. A client could be a Java or Python program, or "startTime": "Mon, 17 Aug 2009 04:49:03 GMT",
a Web Portal management interface for the OpenPEX sys- "template": "PEX Debian Etch 4.0 Template",
tem. "type": "XLARGE",
"userid": 1
{ },
"duration": 3600000, "reply": "ACCEPT"
"numInstancesFixed": 1, }
"numInstancesOption": 0,
"startTime": "Mon, 17 Aug 2009 04:49:03 GMT",
"template": "PEX Debian Etch 4.0 Template", Figure 8. Reply to reservation request
"type": "XLARGE"
}
Figure 7 depicts the JSON body for a new reservation
Figure 7. Create reservation JSON body call. A user specifies the duration of the reservation, the
number of instances required, the start time, the desired
Table 4 lists the functions exposed by the Web Ser- template (e.g. Operating System) required and the type (Ta-
vice interface, along with their corresponding HTTP meth- ble 1) of instance required. These preferences are expressed
ods and endpoints. Some calls require a JSON body in the JSON body of the call. OpenPEX will then respond
whilst others trigger their functionality by simply being ac- with a JSON reply that indicates the outcome of the request,
cessed. The calls typically return a JSON object or ar- which could be an acceptance of the proposed reservation
ray, or simply a HTTP code denoting whether an opera- (shown in Figure 8), a counter offer indicated an alternate
tion was a success or failure. All the methods listed re- reservation that could satisfy the user (shown in Figure 9),
quire HTTP basic authentication. From this table we can or an outright rejection of the proposed reservation.
see that the full OpenPEX life-cycle is exposed via the Web Upon successfully obtaining a reservation in the system,
Services interface. Customers can create a new reserva- a user can get the reservation record and activate the reser-
tion, engage in the bilaterally negotiation (via the Alter- vation. Once the reservation has been activated the user can
nate Offers protocol described earlier in this paper) via the then operate on the instances themselves, obtaining the in-
/OpenPEX/reservations endpoint, and finally acti- stance record, and control the state of the Virtual Machine
vate their reservation. Once a reservation has been ac- instance itself by stopping, rebooting or deleting it.
tivated, the corresponding Virtual Machine instances are
started at their designated start time. A user has control over 5 Conclusion and Future Work
these instances via the /OpenPEX/instances endpoint,
1 The application/json Media Type for JavaScript Object Notation In this paper we introduced OpenPEX, a system that al-
(JSON) - http://tools.ietf.org/html/rfc4627 lows users to provision resources ahead of time through ad-

7
51
{
"proposal": {
4. Integrating market-driven scheduling and migration
"duration": 3600000, policies into OpenPEX.
"id": "F07640D4-32BC-DDB6-457E-32B5595BA066",
"numInstancesFixed": 1, 5. Evaluating strategies for negotiating among multiple
"numInstancesOption": 0, VM providers and users based on market conditions in
"startTime": "Mon, 17 Aug 2009 05:52:31 GMT", conjunction with the proposed market-drive policies.
"template": "PEX Debian Etch 4.0 Template",
"type": "XLARGE",
"userid": 1 References
},
"reply": "COUNTER" [1] K. Adams and O. Agesen. A comparison of software and
}
hardware techniques for x86 virtualization. In Proceed-
Figure 9. Counter Reply to reservation re- ings of the 12th international conference on Architectural
support for programming languages and operating systems,
quest
pages 2–13, San Jose, California, USA, 2006. ACM.
[2] P. Barham, B. Dragovic, K. Fraser, S. Hand, T. Harris,
A. Ho, R. Neugebauer, I. Pratt, and A. Warfield. Xen and
the art of virtualization. ACM SIGOPS Operating Systems
vance reservations, instead of being limited to on-demand,
Review, 37(5):164–177, 2003.
best-effort resource acquisition. OpenPEX also incorpo- [3] J. Broberg, S. Venugopal, and R. Buyya. Market-oriented
rates a novel bilateral negotiation protocol that allows users Grids and Utility Computing: The state-of-the-art and future
and providers to come to an agreement by exchanging offers directions. Journal of Grid Computing, 6(3):255–276, 2008.
and counter-offers, in the event that a user’s original request [4] R. Buyya, C. S. Yeo, S. Venugopal, J. Broberg, and
cannot be precisely satisfied. I. Brandic. Cloud computing and emerging IT platforms:
The fundamental aim of OpenPEX was to harness virtual Vision, hype, and reality for delivering computing as the 5th
machines for adaptive provisioning of services on shared utility. Future Generation Computer Systems, 25(6):599–
616, June 2009.
computing resources. Adaptive provisioning may involve a [5] C. Clark, K. Fraser, S. Hand, J. G. Hansen, E. Jul,
combination of: 1) creating of new VMs to meet increase C. Limpach, I. Pratt, and A. Warfield. Live migration of vir-
in demand; 2) migrating existing VMs to other available tual machines. In Proceedings of the 2nd conference on Sym-
resources; and/or 3) suspending execution of some VMs posium on Networked Systems Design \& Implementation -
in order to increase the resource share available to oth- Volume 2, pages 273–286. USENIX Association, 2005.
ers. These techniques are, however, governed by negotiated [6] R. T. Fielding. Architectural Styles and the Design of
agreements between the users and the resource providers, Network-based Software Architectures. PhD thesis, Univer-
and between providers that encapsulate costs and guaran- sity of California, 2000.
[7] M. Nelson, B. Lim, and G. Hutchins. Fast transparent mi-
tees for deployment and maintenance of VMs for services. gration for virtual machines. In Proceedings of the annual
Demand and supply for services in such an environment is, conference on USENIX Annual Technical Conference, pages
therefore, mediated by market-driven resource management 25–25, Anaheim, CA, 2005. USENIX Association.
mechanisms, thereby leading to a so-called utility comput- [8] D. Nurmi, R. Wolski, C. Grzegorczyk, G. Obertelli, S. So-
ing environment. man, L. Youseff, and D. Zagorodnov. Eucalyptus: A Tech-
As such, we are endeavouring to implement provisional nical Report on an Elastic Utility Computing Archietcture
market-based resource management techniques in Open- Linking Your Programs to Useful Systems UCSB Computer
PEX system to collect pricing and utilisation data, and intro- Science Technical Report Number 2008-10.
[9] D. Nurmi, R. Wolski, C. Grzegorczyk, G. Obertelli, S. So-
duce policies and strategies for managing virtual machines man, L. Youseff, and D. Zagorodnov. The Eucalyptus Open-
in a market-driven utility computing environment. We in- source Cloud-computing System. Proceedings of Cloud
tend to achieve this by: Computing and Its Applications, 2008.
[10] A. Rubinstein. Perfect equilibrium in a bargaining model.
1. Soliciting a wide range of users (from other faculties Econometrica, 50(1):97–109, January 1982.
and other collaborating Universities) to run VM encap- [11] B. Sotomayor, R. Montero, I. Llorente, I. Foster, and
sulated workloads on the test-bed. F. de Informatica. Capacity Leasing in Cloud Systems using
the OpenNebula Engine. Cloud Computing and Applica-
2. Measuring crucial pricing (using simulated currency tions, 2008, 2008.
mechanism) and usage data from users of the system, [12] S. Venugopal, X. Chu, and R. Buyya. A negotiation mech-
anism for advance resource reservations using the alternate
which is difficult to obtain from commercial comput-
offers protocol. In Proceedings of the 16th International
ing centres and largely absent from the literature.
Workshop on Quality of Service (IWQoS 2008), pages 40–
3. Formulating market-driven policies for scheduling and 49. IEEE Computer Society Press, Los Alamitos, CA, USA,
June 2008.
migration in VM platforms based on data collected.

8
52
Exploiting Heterogeneity in Grid Computing for
Energy-Efficient Resource Allocation
Saurabh Kumar Garg and Rajkumar Buyya
The Cloud Computing and Distributed Systems
Department of Computer Science and Software Engineering
The University of Melbourne, Australia
Email: {sgarg, raj}@csse.unimelb.edu.au

Abstract—The growing computing demand from industry and While a lot of research has been performed to increase effi-
academia has lead to excessive power consumption which not only ciency of individual clusters at various levels such as processor
impacting long term sustainability of Grids like infrastructures in level (CPU) [4][5], in virtualization based resource managers
terms of energy cost but also from environmental perspective. The
problem can be addressed by replacing with more energy efficient [6], and cluster resource managers [7][8], the research on
infrastructures, but the process of switching to new infrastructure improving the energy efficiency of global systems such as grid
is not only costly but also time consuming. Grid being consist is still in its infancy. Most of the existing grid meta-schedulers,
of several HPC centers under different administrative domain, such as Maui/Moab scheduling suite [9], Condor-G [10], and
make problem more difficult. Thus, for reduction in energy GridWay [11], focus on improving system-centric performance
consumption, we address the challenge by effectively distributing
compute-intensive parallel applications on grid. We presented metrics such as utilization, average load and application’s
a metascheduling algorithm which exploits the heterogeneous turnaround time. Others such as Gridbus Broker [12] focus
nature of Grid to achieve reduction in energy consumption. Sim- on deadline and budget constrained scheduling. Thus, this
ulation results show that out algorithm HAMA can significantly paper examines how a grid meta-scheduler can exploit the
improve the energy efficiency of global grids by a factor of heterogeneity of the global grid infrastructure to achieve
typically 23% and as much as a factor of 50% in some cases
while meeting user’s QoS requirements. reduction in energy consumption of overall grid. In particular,
we focus on designing a meta-scheduling policy that can be
I. I NTRODUCTION easily adopted by existing grid meta-schedulers without many
changes in current grid infrastructure. This work will also
From last many years, global grid is serving as a main- have relevance to emerging cloud computing paradigm when
stream High Performance Computing (HPC) platform to pro- scaling of application across multiple clouds is considered
vide massive computational power to execute large-scale and [13]. The key contributions of this paper are:
compute-intensive scientific and technological applications. 1) It defines a novel Heterogeneity Aware Meta-scheduling
Enlarging the existing global grid infrastructure to meet the Algorithm (HAMA) that considers various factors con-
increasing demand from grid users can progressively speed up tributing to high energy consumption of grids, including
the advancement of science and technology. But the growing cooling system efficiency and CPU power efficiency.
environmental and economic impact due to high energy con- 2) It demonstrates through extensive simulations using real
sumption of HPC platforms has become a major bottleneck in workload traces that the energy efficiency of global grids
expansion of grid like platforms. can be improved as much as 23% with HAMA.
In April 2007, Gartner estimates that the ICT industry is The rest of this paper is organized as follows: Section 2 dis-
liable for 2% of the global CO2 emissions annually, which is cusses related work. Section 3 defines the grid meta-scheduling
equal to the aviation industry [1][2]. In addition to that, the model. Section 4 describes HAMA. Section 5 explains the
high power consumption has not only lead to rapid increase in evaluation methodology and simulation setup for comparing
utility bills but also affecting the reliability of servers due to HAMA with existing meta-scheduling policies. In Section 6,
high concentrated heat loads. The power efficiency of a HPC the performance results of HAMA are analyzed. Section 7
center depend on number of factors such as processor’s power concludes the paper and presents future work.
efficiency, cooling and air conditioning system, infrastructure
design and lighting/physical system. A recent study [3] done II. R ELATED W ORK
by Lawrence Berkeley National Laboratory shows the cooling This section presents related work on energy-
efficiency (the ratio of computer power : cooling power) of efficient/power-aware scheduling on grids. To the best
data centers varies drastically from a low of 0.6 to a high of of our knowledge, no previous work has proposed a meta-
3.5. Thus, the sustainable and environmental-friendly solutions scheduler that explicitly addresses energy efficiency of grids
must be employed by current HPC community to increase the from a global perspective.
energy efficiency of HPC systems which can more effectively Currently, for global grids, meta-schedulers in operation,
make use of electricity. such as GridWay [11] use heuristics such as First Come First

53
Serve (FCFS). Moab also has a FCFS batch scheduler with Users Local Scheduler Resource Provider

easy backfilling policy [9], [14]. Condor-G [10] uses either


6. Acknowledge user
FCFS or matchmaking with priority sort [15] as scheduling about
Resource match

policies. These schedulers mostly schedule jobs with several Meta-Scheduler

goals such as minimizing job completion time and achieving 4. Meta-scheduler send
jobs to local scheduler
1. Job request from
load balancing. The issue of energy consumption emission by users with deadline
for execution

the grids still need to be addressed. 5. Scheduling of


job for matched
There are several research efforts on power-aware resource time slot

allocation to optimize energy consumption at a single resource


site, typically within a single cluster or data center. The power
usage reduction within the resource site is achieved through 2. Resource site energy
efficiency related
two methods: by switching off parts of the cluster that are not parameters

utilized [16], [17], [18], [7]; or by Dynamic Voltage Scaling ` 3. Find most energy efficient

(DVS) to slow down the speed of CPU processing [19], [20], resource site

[21], [22], [8], [23], [24], [7]. Hence, these efforts help reduce
the energy consumption of one resource site such as cluster or
server farm, but not across multiple resource sites distributed Fig. 1. Meta-scheduling protocol
geographically.
Orgerie et al. [16] propose a prediction algorithm to reduce
the power consumption in a large-scale computational grid on an individual grid resource site and does not have
such as Grid’5000 by aggregating the workload and switch- preemptive priority. The reason for this requirement is
ing off unused CPUs. They focus on reducing CPU power that the synchronization among various tasks of parallel
consumption to minimize the total energy consumption. As jobs can be affected by communication delays when jobs
the power efficiency of grid sites can vary across the grid, are executed across multiple resource sites. The user’s
reducing CPU power consumption itself may not necessary objective is to have his job completed by a deadline.
lead to a global reduction in the energy consumption by the Deadlines are hard, i.e., the user will benefit only if
entire grid. We focus on conserving energy of grids from a the job completes before its deadline [25]. To facilitate
global perspective. the comparison between the algorithms described in this
Meisner et al. [19] show that in the case of high and work, the estimated execution time of a job provided
unpredictable workload, it is difficult to exploit the power by the user is considered to be accurate [26]. Several
on/off facility even though it is ideal to simply switch off models, such as those proposed by Sanjay and Vadhiyar
idle systems. Thus, DVS-enabled CPUs will be much better [27], can be applied to estimate the runtime of parallel
in saving energy in this case. Therefore, in this work we use jobs. In this work, a job execution time is inversely
DVS to reduce the energy consumption of CPUs since our proportional to the CPU operating frequency.
main focus is on large-scale computational grid resource sites 2) Grid Resource Sites:
which generally have unpredictable workload. Grid resource sites consist of clusters at different loca-
tions, such as the sites of the Distributed European In-
III. G RID M ETA - SCHEDULING M ODEL frastructure for Supercomputing Applications (DEISA)
A. System Model [28] with resource sites located in various European
A grid meta-scheduler acts as an interface to grid resource countries and LHC Grid across the world [29]. Each
sites and schedules jobs on the behalf of users as shown in resource site has a local scheduler that manages the
Figure 1. It interprets and analyzes the service requirements execution of incoming jobs. Each local scheduler peri-
of a submitted job and decides whether to accept or reject odically supplies information about available time slots
the job based on the availability of CPUs. Its objective is to (ts , te , N ) to the meta-scheduler, where ts and te are the
schedule jobs so that the energy consumption of grid can be start time and end time of the slot respectively and N is
reduced while the Quality of Service (QoS) requirements of the number of CPUs available for the slot. To facilitate
the jobs are met. As grid resource sites are located in different energy efficient computing, each local scheduler also
geographical regions, they have different power efficiency of supplies information about cooling system efficiency,
CPUs and cooling systems. Each resource site is responsible CPU power-frequency relationship, and CPU operating
for updating this information at the meta-scheduler for energy frequencies of the grid resource site. All CPUs within a
efficient scheduling. The two participating parties, grid users single resource site are homogeneous, but CPUs can be
and grid resource sites, are discussed below along with their heterogeneous across resource sites.
objectives and constraints:
1) Grid Users: B. Grid Resource Site Energy Model
Grid users submit parallel jobs with QoS requirements The major contributors for total energy usage in grid re-
to the grid meta-scheduler. Each job must be executed source site are computing devices (CPUs) and cooling system

54
which constitute about 80% of total energy consumption. job j, nj is the number of CPUs required for job execution,
Other systems such as lighting are not considered due to their and ej is the job execution time when operating at the CPU
negligible contribution to the total energy cost. frequency fm,j . In addition, let fij be the initial frequency at
The power consumption P of a CPU at a grid resource which CPUs of a grid resource site i operate while executing
site is composed of dynamic and static power [21][7]. The job j. HAMA, then, sorts the incoming jobs based on Earliest
static power includes the base power consumption of the Deadline First (EDF) (Algorithm 1: Line 4). The grid resource
CPU and the power consumption of all other components. sites are sorted in order of their power efficiency (Algorithm 1:
Thus, the CPU power P is approximated by the following Line 5) which is calculated by Cooling system efficiency ×
1 βi
function (similar to previous work [21][7]): P = β + αf 3 , CPU Efficiency, i.e., (1+ COP i
)×( f max +αi (fimax )2 ). Then,
i
where β is the static power consumed by the CPU, α is the meta-scheduler assigns jobs to resource sites according to this
proportionality constant, and f is the frequency at which the ordering (Algorithm 1: Line 7–11).
CPU is operating. We consider that CPUs support DVS facility
and thus their frequency can be varied from a minimum of Algorithm 1: HAMA
f min to a maximum of f max discretely. Let Ni be number 1 while current time < next schedule time do
of CPUs at a resource site i. Thus, if the CPU j running at 2 RecvResourcePublish(P)
frequency fj for tj time, then the total energy consumption //P contains information about grid resource sites
3 RecvJobQoS(Q)
due to computation is given by: //Q contains information about grid users
j
X 4 Sort jobs in ascending order of deadline
5 Sort resource sites in ascending order of
Ec,i = (βi + αi fj3 )tj . (1) 1 βi
(1 + COP ) × ( f max + αi (fimax )2 )
Ni i i
6 foreach job j ∈ RecvJobQoS do
The energy cost of an cooling system depends on the Coeffi- 7 foreach resource site i ∈ RecvResourcePublish do
//find time slot for scheduling job j at resource site i
cient Of Performance (COP) factor of the cooling system [30]. 8 if FindTimeSlot(i,j) then
COP is indication of efficiency of cooling system which is 9 Schedule job j on resource site i using DVS;
defined as the ratio of the amount of energy consumed by Update available time slots at resource site i
10 break
CPUs to the energy consumed by the cooling system. The 11
COP is however not constant and varies with cooling air
temperature. We assume that COP will remain constant during
scheduling cycle and resource sites will update meta-scheduler
whenever COP changes. Thus, the total energy consumed by TABLE I
cooling system is given by: PARAMETERS OF A G RID R ESOURCE S ITE i

Ec,i Parameter Notation


Eh,i = (2)
COPi Average Cooling system effi- COPi
ciency
Thus, the resultant total energy consumption by a grid CPU power Pi = βi + αi f 3
resource site is given by: CPU frequency range [fimin , fimax ]
1 Time slots (start time, end (ts , te , n)
Ei = (1 + )Ec,i (3) time, number of CPUs)
COPi
IV. H ETEROGENEITY AWARE M ETA - SCHEDULING The energy consumption is further reduced by scheduling
A LGORITHM (HAMA) jobs using DVS at the CPU level which can save energy
This section gives the details of our Heterogeneity Aware by scaling down the CPU frequency. Thus, when the grid
Meta-scheduling Algorithm (HAMA) which enables the grid meta-scheduler assigns a job to a grid resource site, it also
meta-scheduler to select the most energy efficient grid resource decides the time slot in which jobs should be executed at the
site. The grid meta-scheduler runs HAMA periodically to minimum frequency level to decrease energy consumption by
assign jobs to grid resource sites. HAMA achieves this by CPU (Algorithm 1: Line 8). If the job deadline is violated, the
first selecting the most energy efficient grid resource site and meta-scheduler scales up the CPU frequency to the next level
then by using DVS for further reduction in the energy con- and then again tries to find the free slot to execute the job.
sumption. Algorithm 1, described next, shows the pseudo-code The execution time an application is considered to increase
for HAMA. At each scheduling interval, the meta-scheduler linearly with the decrease in CPU frequency. Thus, in next
collects information from both grid resource sites and users CPU frequency level, since CPU will be executing application
(Algorithm 1: Line 2–3). Considering that a grid consists at higher frequency level, the time slot required will be shorter.
of n resource sites (supercomputer centers), all parameters As at a resource site CPUs may or may not have the DVS
associated with each resource site i are given in Table I. A facility, thus the scheduling at the local scheduler level can
user submits his QoS requirements for a job j in the form of be of two types: CPUs run at the maximum frequency (i.e.
a tuple (dj , nj , ej , fm,j ), where dj is the deadline to complete without DVS); or CPUs run at various frequency using DVS

55
(i.e. with DVS). If the meta-scheduler fails to schedule the job to ensure that the meta-scheduler can receive at least one job
on the resource site because no free slot is available, then the in every scheduling interval. The cooling system efficiency
job is forwarded to the next energy efficient resource site for (COP) value of resource sites is randomly generated using
scheduling. a uniform distribution between [0.5, 3.6] as indicated in the
study conducted by Greenberg et al. [3].
V. P ERFORMANCE E VALUATION Grid Meta-scheduling Algorithms: We examine the per-
We use workload traces Feitelson’s Parallel Workload formance of HAMA in terms of job selection and resource
Archive (PWA) [31] to model the global grid workload. Since allocation of the grid meta-scheduler. We compare our job
this paper focuses on studying the application requirements selection algorithm with EDF-FQ which prioritize jobs based
of grid users, the PWA meets our objective by providing job on deadline and submit jobs to resource site in earliest start
traces that reflect the characteristics of real parallel appli- time (FQ) manner with the least waiting time. We also
cations. The experiments utilize the jobs in the first week compare HAMA with another version of HAMA i.e. HAMA-
of the LLNL Thunder trace (January 2007 to June 2007). withoutDVS to analyze the affect of DVS facility on energy
The LLNL Thunder trace from the Lawrence Livermore consumption.
National Laboratory (LLNL) in USA is chosen due to its Performance Metrics: We consider two metrics: average
highest resource utilization of 87.6% among available traces energy consumption and workload (i.e. amount of workload
to ideally model a heavy workload scenario. From this trace, executed). Average power consumption shows the amount of
we obtain the submit time, requested number of CPUs, and energy saved by using HAMA in comparison to other grid
actual runtime of jobs. However, the trace does not contain meta-scheduling algorithms, whereas workload shows HAMA
the service requirement of jobs (i.e. deadline). Hence, we use affect on the workload executed successfully by grid.
a methodology proposed by Irwin et al. [32] to synthetically Experimental Scenarios: We run the experiments in two
assign deadlines through two classes namely Low Urgency scenarios 1) urgency class and 2) arrival rate of jobs. For
(LU) and High Urgency (HU). the urgency class, we use various percentages (0%, 20%,
A job i in the LU class has a high ratio of 40%, 60%, 80%, and 100%) of HU jobs. For instance, if the
deadlinei /runtimei so that its deadline is definitely longer percentage of HU jobs is 20%, then the percentage of LU
than its required runtime. Conversely, a job i in the HU class jobs is the remaining 80%. For the arrival rate, we use various
has a deadline of low ratio. Values are normally distributed factors (10, 100, and 1000) of submit time from the trace. For
within each of the high and low deadline parameters. The ratio example, a factor of 10 means a job with a submit time of 10s
of the deadline parameter’s high-value mean and low-value from the trace now has a simulated submit time of 1s. Hence,
mean is thus known as the high:low ratio. In our experiments, a higher factor represents higher workload by shortening the
the deadline high:low ratio is 3, while the low-value deadline submit time of jobs.
mean and variance is 4 and 2 respectively. In other words, LU Equation 3, we know that the performance of HAMA is
jobs have a high-value deadline mean of 12, which is 3 times highly dependent on the CPU efficiency and Cooling System
longer than HU jobs with a low-value deadline mean of 4. efficiency of grid resource sites. We compare performance of
The arrival sequence of jobs from the HU and LU classes is our algorithm in worst case scenario (HL) i.e., when resource
randomly distributed. site with the highest CPU power efficiency has the lowest COP,
Provider Configuration: The grid modelled in our simu- and best case scenario (HH) i.e., when resource site with the
lation contains 8 resource sites spread across five countries highest CPU power efficiency has the highest COP (HH).
derived from European Data Grid (EGEE) testbed [29]. The
configurations assigned to the resources in the testbed for VI. P ERFORMANCE R ESULTS
the simulation are listed in Table II. The configuration of
A. Affect on Energy consumption
each resource site is decided so that the modelled testbed
would reflect the heterogeneity of platforms and capabilities This section compares energy consumption of HAMA with
that is normally the characteristic of such installations. Power other meta-scheduling algorithms for grid resource sites with
parameters (i.e. CPU power factors and frequency level) of HH and HL configurations. The figure 2 shows how energy
the CPUs at different sites are derived from Wang and Lu’s consumption varies with deadline urgency and arrival rate of
work [7]. Current commercial CPUs only support discrete jobs. HAMA has clearly outperformed its competitor EDF-
frequency levels, such as the Intel Pentium M 1.6 GHz CPU, FQ by saving about 17%-23% energy in worst case and about
which supports 6 voltage levels. We consider discrete CPU 52%in best case.
frequencies with 5 levels in the range [fimin , fimax ]. For The effect of job urgency on energy consumption can be
the lowest frequency fimin , we use the same value used clearly seen from figure 2(a) and 2(b). As the percentage of
by Wang and Lu [7], i.e. fimin is 37.5% of fimax . Each HU jobs with more urgent (shorter) deadline increases, the
local scheduler at a grid site use Conservative Backfilling energy consumption (Figure 2(a) and 2(b)) also increases due
with advance reservation support as used by Mu’alem and to more urgent jobs running on resource sites with lower power
Feitelson [33]. The grid meta-scheduler schedules the job efficiency and at the highest CPU frequency to avoid deadline
periodically at a scheduling interval of 50 seconds, which is violations. On the other hand, the effect of job arrival rate on

56
TABLE II
C HARACTERISTICS OF G RID R ESOURCE S ITES

Location of Grid Site CPU Power Factors No. of CPUs MIPS Rating
β α fimax
RAL, UK 65 7.5 1.8 2050 1140
Imperial College (UK) 75 5 1.8 2600 1200
NorduGrid (Norway) 60 60 2.4 650 1330
NIKHEF (Netherlands) 75 5.2 2.4 540 1176
LYON (France) 90 4.5 3.0 600 1166
Milano (Italy) 105 6.5 3.0 350 1320
Torina (Italy) 90 4.0 3.2 200 1000
Padova (Italy) 105 4.4 3.2 250 1330

energy consumption (Figure 2(c) and 2(d) is minimal with a Moreover, the immediate and significant reduction in CO2
slight increase when more jobs arrive. emissions is required for the future sustainability of global
For grid resource sites without DVS, HAMA-without can grids.
reduce up to 15-21% of the energy consumption (Figure 2(a)) In this paper, we have addressed the energy efficiency of
in the HL configuration and 28-50% of energy consumption grids at the meta-scheduling level. We proposed Heterogenity
(Figure 2(b)) in the HH configuration compared to EDF-FQ Aware Meta-scheduling Algorithm (HAMA) to address the
which also doesn’t consider the DVS facility while scheduling problem by scheduling more workload with urgent deadline on
across the entire grid. This highlights the importance of the resource sites which are more power-efficient. Thus, HAMA
power efficiency factor in achieving energy-efficient meta- considers crucial information of global grid resource sites,
scheduling. In particular, HAMA can reduce energy consump- such as cooling system efficiency (COP) and CPU power effi-
tion (Figure 2(a) and 2(b) even more when there are more LU ciency. HAMA address the problem in two steps: 1) allocating
jobs with less urgent (longer) deadline and arrival rate is low. jobs to more energy-efficient resource sites and 2) scheduling
When we compare HAMA and HAMA-withoutDVS, we using DVS policy at the local resource site to further reduce
observe that by using DVS energy saving has increased by energy consumption.
about 11% when % of job with urgent deadline and job arrival Results show that our HAMA can reduce up to 23% energy
rate is high. This is because for the scenario when DVS facility consumption in worst case and upto 50% in best case as
is available jobs can run at lower CPU frequency to save compare to other algorithms (EDF-FQ). Moreover, even if
energy. DVS facility is not available, HAMA-withoutEDF can still
result in considerable amount of power savings of upto 21%.
B. Affect on Workload Executed Particularly, our HAMA algorithm can work very well when
Figure 3 shows the total amount of workload successfully the deadline of jobs is less urgent and arrival rate of jobs is
executed according to user’s QoS. The workload of a job not high. Thus, HAMA can also compliment the efficiency of
refer to multiplication of its execution time and the number of existing power-aware scheduling policies for clusters.
CPU required. The affect of job urgency and arrival rate on In future, we will investigate how HAMA can address the
workload executed can be clearly seen from Figure 3(a) and energy consumption problem in virtualized environments such
3(d). All meta-scheduling algorithm shows consistent decrease as clouds, which is the emerging platform for hosting business
in workload execution particularly in scenario of job urgency. applications. We will also integrate HAMA with existing
The reason is rejection of more jobs due to deadline miss grid meta-schedulers and conduct experiments on real grid
when all jobs are of high urgency. The amount of workload and cloud resources. We will also extend our current meta-
executed by EDF-FQ is less than HAMA because of the reason scheduling model to resources such as the storage disks and
that while scheduling using EDF-FQ, the local scheduler the switching devices.
execute the jobs using conservative backfilling without any ACKNOWLEDGEMENTS
consideration of job deadline. While in case of HAMA, meta-
We would like to thank Chee Shin Yeo for his constructive
scheduler send job to a resource site only if a time slot is
comments on this paper. This work is partially supported by
available to execute job before deadline.
research grants from the Australian Research Council (ARC)
VII. C ONCLUSION and Australian Department of Innovation, Industry, Science
and Research (DIISR).
With the increasing demand of global grids, the energy
consumption of grid infrastructure has escalated to the degree R EFERENCES
that grids are becoming a threat to the society rather than an [1] Gartner, “Gartner Estimates ICT Industry Accounts for 2 Percent of
asset. The carbon footprint of grids may continue to increase Global CO2 Emissions,” http://www.gartner.com/it/page.jsp?id=503867.
[2] J. G. Koomey, “Estimating total power consump-
unless the problem is addressed at every level, i.e., from local tion by servers in US and world,” http:/ /enter-
(within a single grid site) to global (across multiple grid sites). prise.amd.com/Downloads/svrpwrusecompletefinal.pdf.

57
700 800

700
650

Average Energy Consumption


Average Energy Consumption
600

600
500

550 400

300
500

200

450
100

400 0
0% 20% 40% 60% 80% 100% 0% 20% 40% 60% 80% 100%

% of High Urgency (HU) Jobs % of High Urgency (HU) Jobs


HAMA HAMA-withoutDVS EDF-FQ HAMA HAMA-withoutDVS EDF-FQ

(a) HL: Energy Consumption VS Job Urgency (b) HH: Energy Consumption VS Job Urgency
650 800

700

Average Energy Consumption


600
Average Energy Consumption

600

500
550

400

500 300

200
450
100

0
400
10 100 1000 10 100 1000
Increase in Job Arrival Rate Increase in Job Arrival Rate
HAMA HAMA-withoutDVS EDF-FQ HAMA HAMA-withoutDVS EDF-FQ

(c) HL: Energy Consumption VS Job Arrival Rate (d) HH: Energy Consumption VS Job Arrival Rate

Fig. 2. Comparison of HAMA with other meta-scheduling algorithms

[3] S. Greenberg, E. Mills, B. Tschudi, P. Rumsey, and B. My- [10] J. Frey, T. Tannenbaum, M. Livny, I. Foster, and S. Tuecke, “Condor-
att, “Best practices for data centers: Results from benchmarking G: A Computation Management Agent for Multi-Institutional Grids,”
22 data centers,” in Proc. of the 2006 ACEEE Summer Study Cluster Computing, vol. 5, no. 3, pp. 237–246, 2002.
on Energy Efficiency in Buildings, Pacific Grove, USA, 2006, [11] E. Huedo, R. Montero, and I. Llorente, “A framework for adaptive
http://eetd.lbl.gov/emills/PUBS/PDF/ACEEE-datacenters.pdf. execution in grids,” Software Practice and Experience, vol. 34, no. 7,
[4] V. Salapura et al., “Power and performance optimization at the system pp. 631–651, 2004.
level,” in Proc. of the 2nd conference on Computing frontiers, Ischia, [12] S. Venugopal, K. Nadiminti, H. Gibbins, and R. Buyya, “Designing a re-
Italy, 2005. source broker for heterogeneous grids,” SoftwarePractice & Experience,
[5] A. Elyada, R. Ginosar, and U. Weiser, “Low-complexity policies for vol. 38, no. 8, pp. 793–825, 2008.
energy-performance tradeoff in chip-multi-processors,” IEEE Transac- [13] R. Buyya, C. S. Yeo, S. Venugopal, J. Broberg, and I. Brandic, “Cloud
tions on Very Large Scale Integration (VLSI) Systems, vol. 16, no. 9, Computing and Emerging IT Platforms: Vision, Hype, and Reality for
pp. 1243–1248, 2008. Delivering Computing as the 5th Utility,” Future Generation Computer
[6] A. Verma, P. Ahuja, and A. Neogi, “pmapper: Power and migration cost Systems, vol. 25, no. 6, pp. 599–616, 2009.
aware application placement in virtualized systems,” in Proc. of the 9th [14] Y. Etsion and D. Tsafrir, “A Short Survey of Commercial Cluster Batch
ACM/IFIP/USENIX International Conference on Middleware, Leuven, Schedulers,” Technical Report 2005-13, Hebrew University, May 2005,
Belgium, 2008. Tech. Rep.
[7] L. Wang and Y. Lu, “Efficient Power Management of Heterogeneous [15] R. Raman, M. Livny, and M. Solomon, “Resource Management through
Soft Real-Time Clusters,” in Proc. of the 2008 Real-Time Systems Multilateral Matchmaking,” in Proc. of the 9th IEEE Symposium on High
Symposium, Barcelona, Spain, 2008. Performance Distributed Computing, Pittsburgh, USA, 2000.
[8] K. Kim, R. Buyya, and J. Kim, “Power aware scheduling of bag-of-tasks [16] A. Orgerie, L. Lefèvre, and J. Gelas, “Save Watts in Your Grid: Green
applications with deadline constraints on dvs-enabled clusters,” in Proc. Strategies for Energy-Aware Framework in Large Scale Distributed
of the Seventh IEEE International Symposium on Cluster Computing Systems,” in Proc. of the 2008 14th IEEE International Conference on
and the Grid, Rio de Janeiro, Brazil, 2007. Parallel and Distributed Systems, Melbourne, Australia, 2008.
[9] B. Bode et al., “The Portable Batch Scheduler and the Maui Scheduler [17] D. Bradley, R. Harper, and S. Hunter, “Workload-based power man-
on Linux Clusters,” in Proc. of the 4th Annual Linux Showcase and agement for parallel computer systems,” IBM Journal of Research and
Conference, Atlanta, USA, 2000. Development, vol. 47, no. 5, pp. 703–718, 2003.

58
2500 2500

Millions

Millions
2000 2000

Workload Executed
Workload Executed

1500 1500

1000
1000

500
500

0
0
0% 20% 40% 60% 80% 100%
0% 20% 40% 60% 80% 100%
% of High Urgency (HU) Jobs
% of High Urgency (HU) Jobs
HAMA HAMA-withoutDVS EDF-FQ
HAMA HAMA-withoutDVS EDF-FQ

(a) HL: Workload Execution VS Job Urgency (b) HH: Workload Execution VS Job Urgency
2500 2500
Millions

Millions
2000 2000
Workload Executed

Workload Executed
1500 1500

1000 1000

500 500

0 0
10 100 1000 10 100 1000

Increase in Job Arrival Rate Increase in Job Arrival Rate


HAMA HAMA-withoutDVS EDF-FQ HAMA HAMA-withoutDVS EDF-FQ

(c) HL: Workload Execution VS Job Arrival Rate (d) HH: Workload Execution VS Job Arrival Rate

Fig. 3. Comparison of HAMA with other meta-scheduling algorithms

[18] B. Lawson and E. Smirni, “Power-aware resource allocation in high-end of the 5th ACM conference on Electronic commerce, New York, USA,
systems via online simulation,” in Proc. of the 19th annual international 2004, pp. 61–70.
conference on Supercomputing, Cambridge, USA, 2005, pp. 229–238. [26] D. G. Feitelson, L. Rudolph, U. Schwiegelshohn, K. C. Sevcik, and
[19] D. Meisner, B. Gold, and T. Wenisch, “PowerNap: eliminating server P. Wong, “Theory and practice in parallel job scheduling,” in Job
idle power,” in Proceeding of the 14th international conference on Ar- Scheduling Strategies for Parallel Processing, London, UK, 1997, pp.
chitectural support for programming languages and operating systems, 1–34.
Washington, USA, 2009. [27] H. A. Sanjay and S. Vadhiyar, “Performance modeling of parallel
[20] G. Tesauro et al., “Managing power consumption and performance of applications for grid scheduling,” J. Parallel Distrib. Comput., vol. 68,
computing systems using reinforcement learning,” in Proceedings of no. 8, pp. 1135–1145, 2008.
the 21st Annual Conference on Neural Information Processing Systems, [28] “Distributed European Infrastructure for Supercomputing Applications
Vancouver, Canada, 2007. (DEISA),” http://www.deisa.eu.
[21] Y. Chen, A. Das, W. Qin, A. Sivasubramaniam, Q. Wang, and N. Gau- [29] Enabling Grids for E-sciencE, “EGEE project,” http://www.eu-egee.org/,
tam, “Managing server energy and operational costs in hosting centers,” 2005.
ACM SIGMETRICS Performance Evaluation Review, vol. 33, no. 1, pp. [30] J. Moore, J. Chase, P. Ranganathan, and R. Sharma, “Making scheduling
303–314, 2005. ”cool”: temperature-aware workload placement in data centers,” in Pro-
[22] A. Verma, P. Ahuja, and A. Neogi, “Power-aware dynamic placement of ceedings of the 2005 Annual Conference on USENIX Annual Technical
HPC applications,” in Proc. of the 22nd annual international conference Conference, Anaheim, CA, 2005.
on Supercomputing, Athens, Greece, 2008, pp. 175–184. [31] D. Feitelson, “Parallel workloads archive,”
[23] N. Kappiah, V. Freeh, and D. Lowenthal, “Just in time dynamic voltage http://www.cs.huji.ac.il/labs/parallel/workload.
scaling: Exploiting inter-node slack to save energy in MPI programs,” [32] D. Irwin, L. Grit, and J. Chase, “Balancing risk and reward in a market-
in Proc. of the 2005 ACM/IEEE conference on Supercomputing, Seattle, based task service,” in Proc. of the 13th IEEE International Symposium
USA, 2005. on High Performance Distributed Computing, Honolulu, USA, 2004.
[33] A. W. Mu’alem and D. G. Feitelson, “Utilization, Predictability, Work-
[24] C. Hsu and W. Feng, “A power-aware run-time system for high-
loads, and User Runtime Estimates in Scheduling the IBM SP2 with
performance computing,” in Proc. of the 2005 ACM/IEEE conference
Backfilling,” vol. 12, no. 6, pp. 529–543, Jun. 2001.
on Supercomputing, Seattle, USA, 2005.
[25] R. Porter, “Mechanism design for online real-time scheduling,” in Proc.

59
Intelligent Data Analytics Console
Snehal Gaikwad Mihir Kedia Aashish Jog Bhavana Tiple
School of Computer Science IBM Global Services School of Management Dept of Computer Science
Carnegie Mellon University IBM India Pvt Ltd IIT Bomaby MIT Pune
Pittsburgh, 15217 USA Delhi, 110020 INDIA Pawai, 400076 INDIA Pune, 411038 INDIA
snehalgaikwad@cmu.edu mihir.kedia@in.ibm.com aashishjog@som.iitb.ac.in bstiple@mitpune.com

Abstract—The problem of integrating data analysis, visualization representations eventually results into the major usability
and learning is arguably at the very core of the problem of data challenges for naïve users. As a result, predictions and
intelligence. In this paper, we review our research work in the decision-making process becomes time consuming and
area of data analytics and visualization focusing on four frustrating for users.
interlinked directions of research-(1) data collection, (2) data
analytics and visual evidence, (3) data visualization, and (4) Today, data analytics and visualization research has
intelligent user interfaces - that contribute and complement each undergone fundamental changes in several approaches [6, 9,
other. Our research resulted in Intelligent Data Analytics 25, 26]. A new research and development focus has emerged
Console (IDAC), an integration of the above four disciplines. The within the visualization to address some of the fundamental
main objectives of IDAC are to - (1) provide a rapid development problems associated with the new classes of data and their
platform for analyzing the data and generating components that related analysis tasks [28]. This research and development
can be used readily in software applications, (2) provide visual focus is known as information visualization. Information
evidences after each data analytical operation that will help user visualization combines aspects of scientific visualization,
to learn the behavior of data, and (3) provide the user-centric human-computer interfaces, data mining, imaging, and
platform for skilled data analytics experts as well as naïve user. graphics [28]. Visualization is perceived as gateway to
The paper presents development process of the user-centric understanding voluminous datasets. It provides healthy
intelligent software equipped with effective visualization. This
environment to data analytics experts, business executives for
approach should help business organizations to develop better
understanding and forecasting a huge amount of data in a quick
data analytics software using open source technologies.
span of time.
Keywords- Human Factors, Intelligent User Interfaces, In this paper, we present the development and
Machine Learning implementation of Intelligent Data Analytics Console (IDAC)
focusing on integration of Data Collection, Data Analytics and
I. INTRODUCTION
Visualization. We demonstrate an approach to develop a rapid
Huge amount of data is generated in various familiar platform for analyzing datasets and generating agents, which
processes, systems and computer networks built around them. can be used readily in software application. This process
Voluminous data from electronic business, computer networks, involves a library of analytics blocks and allows user to drag
financial trading systems, share markets, and weather and drop all the required functional blocks required to form an
forecasting comes in fast stream. In our day-to-day lives, ‘analytics execution chain’. Our development is based on the
manually analyzing and classifying the large amount of data is open source libraries from Weka [5, 36, 43], Yale [15, 29], and
not feasible. JFree Charts [11]. IDAC architecture proves how intelligence
Data analytics is the art and science of getting to know user interfaces can increase usability of the complex systems.
more about the behavior of complex real time data by Our research provides the detail guidelines for research and
application of mathematical and statistical principles [30, 43]. business community to effectively set-up a platform for data
Rapid data analytics operations performed on a given data analysis problems. The rapid learning platform of IDAC
results into huge changes in the original data values. If we captures the knowledge generated by data analytics experts; by
consider a dataset having about a thousand data points, tracking finding typical analysis patterns it provides standard recipes as
changes after each data analytics operation is impossible. well as online recommendation on what to do next. An
Therefore, to learn the pattern and trends of given data set, it is advisory system is an application of data analytics, which
necessary to represent data values into understandable format. analyzes the nature of data as well as the results obtained at
Many data analytics tools have been proposed to overcome each step to provide the recommendations. The paper
these problems; Weka [5, 36, 43], Yet Another Learning demonstrates how this feature is immense to users who do not
Environment (Yale, now known as Rapid Miner) [15, 29], have deep understanding of analytics algorithms. Capturing the
Sumatra TT [4] are major examples of them. However, current knowledge involves recording earlier successful execution of
software systems fail to provide a rapid analytics platform for analytics chains used by users for doing particular types of
naïve users. Sumatra TT, Weka, and Yale cannot support analysis and ‘training’ the advisory system. Further, we
different types of data formats. Sumatra TT fails to provide illustrate how visual evidence technique can acts as an effective
wide variety of analytics algorithms and operators. Less debugging technique to assist naïve users.
efficient drag and drop interfaces, lack of graphical

60
The rest of the paper is organized as follows. In Section 2, The data preprocessing is also called as the data cleaning
we introduce detail system architecture of IDAC; Section 3 or the data cleansing. The Data cleaning process deals with
presents the Intelligent User Interface; Usability evaluation and detecting and removing errors and inconsistencies in order to
results are described in the Section 4. Section 5 discusses the improve the quality of the available data [2, 20, 38]. We have
future research; Section 6 provides concluding remarks. used three categories of filters for data cleaning; Supervised
Filters, Unsupervised Filters, Streamable Filters. IDAC java
II. SYSTEM ARCHITECTURE packages for analytics operators are based on Yale [15] and
In this section, we provide a detail implementation of Weka [5] open source coding hierarchy. IDAC filter package
IDAC. IDAC architecture is based on the iterative process is responsible for filtering operations.
model [34, 35]. Traditional software development cycle The real time data set contains huge number of missing
includes widely used Waterfall Model, Prototyping Model or values, which affects the accuracy of prediction. For example,
Spiral Model [10]. Most of them have barriers of specification, if electronic business executive wants to launch a new product
communication and optimization [10, 34]. Considering but available datasets do not have enough labels for attribute,
research requirements, we decided to follow the Incremental then it is difficult for him to make accurate predictions. To
Process Model development. The main objective of applying overcome this problem and estimate unknown parameters, we
incremental development approach was to enhance system have implemented the Expectation-Maximization (EM)
usability and constantly tune computational performance of algorithm. EM algorithm is used for finding maximum
the software [22, 33, 34, 35]. likelihood estimates of parameters in probabilistic models,
The data collection module, the data analytics and visual where the model depends on unobserved latent variables [12].
evidence module, and the data visualization module are three E-step in the algorithm finds the conditional expectation of
major components of the system. Fig. 1 shows the detail missing data given the observed data and current estimated
architecture of IDAC. The data collection module is parameters, and substitutes the expectations for ‘missing data’
responsible for data preprocessing and cleaning [38]. The data M-step updates the parameter estimates by maximizing the
analytics and visual evidence module focuses on data mining expected complete-data log likelihood [12]. We used Frank
[43] and Machine Learning [30] operations and generates Dellaert’s theory to maximize the posterior probability of the
textual results. The visual evidence operators are created after unknown parameters [12]. If U is the given measurement data,
each data analysis operator. The data visualization module is J is a hidden variable then unknown parameter is estimated by
responsible for converting textual results to suitable graphical following formula [12]:
formats. In following section of the paper, we demonstrate the Θ∗ = argmax ∑J∈𝑗𝑗 𝑛𝑛 P (Θ, 𝐽𝐽 |𝑈𝑈 ) (1)
detailed architecture of IDAC.
Θ
A. Data Collection Module
EM computes a distribution over the space j rather than
Visual Evidence operators are created after each data finding the best J ∈ 𝑗𝑗 [12]. We found that existing data
analysis operator. The data visualization module is responsible analytics tools fails to support more than one data format [4],
for converting textual results to suitable graphical formats. In to overcome this drawback system supports two major data
following section of the paper, we demonstrate detailed file formats ARFF and CSV. Data conversion, operator helps
architecture of the system. The data collection module is user to convert files into other formats.
responsible for rapid data preprocessing.

Figure 1. IDAC System Architecture

61
IRIS Cross Validation Fold Maker Attribute Selection

Data Visualizer AttributeSummerizer ScatterPlotMatrix ClassifierPerformanceEvaluator Text Viewer

Figure 2. WEKA Data Analytics chain without Visual Operator

1) ARFF: ARFF is widely known as the Attribute Relation TABLE II. ARFF FOR CUSTOMER DATA
File Format. Structure of an ARFF file is divided into two
Sunny, 89, false, no
major parts: Header Section and Data Section. The header
section contains name of relations, list of attributes and their Overcast, 78, true, yes
types. The data section consists of data declaration lines and rainy, 75, true, yes
values. Table I shows the ARFF File for customer banking
record.
TABLE I. ARFF FOR CUSTOMER DATA The data collection module sends cleaned data to the data
analytics and visual evidence module to perform advance data
@relation bank analysis operations. In this module, IDAC allow data analytics
@attribute Service_type {Fund, Loan, CD, Bank_Account, Mortgage} expert to perform several analysis operations in certain order.
@attribute Customer {Student, Business, Other, Doctor, Professional}
Data probes send cleaned data to the analytics and visual
evidence module. This module involves a library of analytics
@attribute Size {Small,Large,Medium}} blocks which allows user to drag and drop all required blocks
@attribute Interest_rate real and connect them to form an ‘analytics execution chain’ in the
@DATA
visual application. The chain is based on an end goal of
analysis as well as the nature of the dataset. Fig. 2 shows the
Fund,Business,Small,1 chain of the Iris data set, cross validation, attribute selection
Loan,Other,Small,1 operators. Existing data analytics software provides
visualization at the end of the chain; as a result it becomes
Mortgage,Business,Small,4
difficult for user to understand the changes after each data
…………………………………. analytics operator [5, 4, 15]. From analytics, perspective
tracking these changes is very important to learn behavior of
the data set. We propose the new concept of the Visualization
2) Comma Separated Value (CSV): CSV is also known as
Debugger (see Fig. 3).
the comma-separated list. CSV files consist of data values
separated by commas. Table 2 represents the CSV file for
Weather forecasting data used by IDAC.

Figure 3. Visual Debugger

Before Processing After Processing

Cross Validation Fold Maker


Attribute Selection
Data
Visualizer ScatterPlotMatrix
Text Viewer

ClassifierPerformanceEvaluator

Figure 4. WEKA Data Analytics chain with Visual Operator

62
Visualization Debugger helps the user to understand changing Java Class Algorithm Description
behavior of the dataset after each analytics operation. Fig. 4
FarthestFast Cluster data using FarthestFirst
represents the basic visual debugger used to generate the algorithm.
visual evidence from data analytics results. The visualization
operator displays the dataset values twice: a) before entering
DBScan Density-based algorithm for
into the data analytics operator and b) after resulting from the
data analytics operator. Comparing results from the both discovering clusters in large
spatial databases with noise.
visual operators the user can easily keep track of changes that
take place in the operational chain. This application presents
interface amenable to the user who would like to rapidly An Example: We illustrate the IDAC clustering with a
provide an analytics solution into suitable data processing simple example of the real world company. The main
architecture. IDAC advisory system analyzes the results objective of the company is to develop a new financial product
obtained at each step and provides recommendation to assist for its customers. Table IV represents data values of vacation,
users to make decisions based on visual facts. Data analytics eCredit, salary and property of each customer. The goal of
operations help user to find patterns in the huge datasets. We operation is to find out pattern from the customer behavior and
have used open source Weka platform to develop a library of develop new financial product. Main step of the process is
different data analysis operators. Following section of paper finding out the data belonging to the specific cluster. Using
focuses on major Data Analytics operators implemented in data analytics library we apply Simple K means data analytics
IDAC. operator on available data. Table V shows which customer fits
B. IDAC Classification to what cluster. For example, customer number nine with
salary 13.85 belongs to first cluster. Table VI shows the mean
We have analyzed the performance of existing classifiers value for each cluster. Visualization in Fig. 5 helps us to
[13, 21, 41] as well as classification techniques used in Weka understand the behavior of the data. User can easily observe
[43] and Sumatra TT [4]. We found that decision trees are the that group 3 and 5 are differentiates themselves from the rest.
most popular and stable method among available data mining Customers who belong to cluster 3 and 5 have maximum
classifiers. The main reason is decision tree provides excellent vacations, high salary, and low property.
scalability and highly understandable results. Decision trees
also support the sparse data. We also used KNN model TABLE IV. CUSTOMER DATA
because it is applicable to most cases independent of training
Customer Vacation eCredit Salary Property
data requirement, sparse to dense. KNN is an instance-based
1 6 40 13.62 3.2804
classifier that provides excellent quality of prediction and 2 11 21 15.32 2.0232
explanatory results. However, we found that KNN is not 3 7 64 16.55 3.1202
optimal for problems with a non-zero Bayes error rate, that is, 4 3 47 15.71 3.4022
5 15 10 16.96 2.2825
for problems where even the best possible classifier has a non- 6 6 80 14.66 3.7338
zero classification error. We overcome this problem be 7 10 49 13.86 5.8639
incorporating discriminative and generative classifiers. 8 10 84 15.64 3.187
9 9 74 14.4 2.3823
……………………………………………………………………………………
C. IDAC Clustering 186 51 13 20.71 1.4585
187 49 17 19.18 2.4251
Clustering deals with discovering classes of an instance
that belongs together [23, 43]. Clusters are also known as self- TABLE V. RESULTS AFTER CLUSTERING
organized maps. IDAC clustering operators are developed by
Customer Vacation eCredit Salary Property Fit Cluster
using high performance clustering algorithms [32]. IDAC
1 6 40 13.62 3.2804 2
operator converts an input file into several clusters; it also 2 11 21 15.32 2.0232 2
provides probability of distribution for all attributes. Most of 3 7 64 16.55 3.1202 1
the time data set may have different trend than regular, such 4 3 47 15.71 3.4022 2
5 15 10 16.96 2.2825 2
data is called as Anomaly. Detecting theses anomalies is 6 6 80 14.66 3.7338 1
essential to study abnormal behavior of the dataset. Visual 7 10 49 13.86 5.8639 3
representation of clustering; enables user to find out abnormal 8 10 84 15.64 3.187 1
behaviors in the datasets and helps in decision-making. Table 9 9 74 14.4 2.3823 1
………………………………………………………………
III represents algorithms implemented in the IDAC. 186 51 13 20.71 1.4585 5
187 49 17 19.18 2.4251 5
TABLE III. IDAC JAVA CLASSES AND ALGORITHMS FOR CLUSTERING

Java Class Algorithm Description TABLE VI. MEAN VALUE FOR EACH CLUSTER
Group eCredit Salary Property Total
Cluster data using X-Means Vacation
Simple X Means Cluster Number
algorithm. 1 14.51515 80.45455 21.90152 5.465142 33
2 8.75 14.80556 16.20771 1.423411 36
3 39.02128 55.08511 19.92681 3.72284 47
Cluster data using K Means 4 10.21277 224.0435 24.9087 11.61411 23
Simple K Means algorithm. 5 48.21277 15.42553 21.87149 2.057874 47

63
Table VII shows the user centric visualization method. While
4 representing voluminous data, we have considered following
2
principles [9, 14, 25, 27]:

3 • Clarity: Data Visualization is different from verbal


information; visual information is analyzed in parallel
5 by the human brain. Therefore, data values and
1 changes are represented in interpretable formats.
• Simplicity: The Graphical representation should be as
Clusters with Many simple as possible it should not confuse the user.
Vacations, High
Salary and Low • Brevity: While representing datasets, economy of
property
expressions is very important; the representation
should be self-explanatory because we are much better
in remembering visual information.
According to the nature of a dataset, IDAC finds out visual
Figure 5. Clustering Visualization operator for specific data analysis task. Another key theme for
data visualization involves ease of use. The demand for good,
Based on the above results company can make a decision effective visualization of data is very high, especially for those
to develop traveling specific related financial service with who do not have any data analytics background. Naïve user
travelling card offers, insurance coverage, travel accident community is highly diverse, with different levels of education,
insurance, baggage insurance, theft insurance, full primary backgrounds, capabilities, and needs. Visualization module
collision insurance on car rentals. Rapid clustering enables this diverse group to solve the analytics problem at
computations and detail visual results make real time decision- hand. IDAC provides a visual library ranging from Scatter Plot
making process easier. Matrix to Andrews Curves.
III. INTELLIGENT USER INTERFACE
D. IDAC Association & Regression
In this section, we explore features of the IDAC
We have used association operators to learn relationships Intelligence Wizard. We have implemented intelligence and
between data attributes. Unlike Yale, IDAC supports Aprori learning algorithm for automatic generation of the data
algorithm for handling nominal attributes from given datasets. analytics chains [37, 42]. Intelligence Wizard system provides
The algorithm helps to determine number of rules, minimal separate user interface for the naïve user and guides him to
support and minimum confidence value [1]. Regression model perform further analysis. During initial development use case
is applicable for numeric classification and prediction; scenarios helped us to organize the information and determine
provided the relationship between the input attributes [17, 26]. the interaction sequence. We have conducted contextual
If data analyst knows a value of particular quantity, regression inquiries to understand users thinking and develop user centric
helps him to estimate the value of other one. Other analytics architecture. Our user centric design is based on three-tier
operators in IDAC are responsible for performing prediction, model [45]. Fig. 6 shows channel Layer Interaction Layer and
anomaly detection, filtering, and sampling operations. Semantic Layer of the three-tier architecture.
E. Data Visualization
Visualization is perceived as gateway to understand
voluminous datasets. It provides healthy environment to data
analytics experts, and business executives to understand and
forecast huge amount of data in quick span of time. We have
implemented IDAC visual operators by using open source
Figure 6. Three Tier Design Model
JFree Chart library [11, 39]. The purpose of visual operator is
to discover hidden patterns and provide assistance in decision-
making. Visualization library not only help users to debug
analytics results but also provides data forecasting platform. SELECT THE KIND OF OPERATION TO BE DONE ON THE DATA

TABLE VII. USER CENTRIC SCIENTIFIC AND INFORMATION


Association Cross Validation
Visualization User Objective
Missing values
Scientific Deep understanding of
Visualization Data Analyst scientific phenomena Leaner Create Model

Leaner Apply Model


Searching, discovering
Information Less Technical
relationships
Visualization

Figure 7. Interface to select Data Analysis Operation

64
Example Source: Sonar Data File

Automatically Generated Operators

Random Number Generator that changes values of given data set

Figure 8. Automatic Chain Generation:- Data Analytics and Visualization Operator

We have used the Semantic Layer to define contents in the This approach understands human responses and significantly
operator library. The Interaction Layer is responsible for reduces usability barriers faced by naïve users.
defining the Interaction Sequences and to determine User
Experience. The Chanel Layer is responsible for the actual IV. USABILITY EVALUATION AND RESULTS
presentation, which provides the user with a consistent In this section, we describe software testing approaches
experience, independent of the access method. For example, used to measure the performance and usability of the IDAC.
all interactions offered through a screen follow the same During the usability evaluation phase, we focused on both
logical steps and offer the sequence of options. The structure automated as well as manual software testing approaches.
of data analytics library supports the consistent experience. IDAC automated testing is performed by using WinRunner
When user selects operations, system compares it with 7.0i, software. WinRunner automates the testing process by
standard rules; the algorithm returns Boolean results for chain storing a script of the actions that are taking place. We used
validation. If the chain is valid then IDAC continues data Test Director to perform manual testing. The incremental
analysis operation. For invalid chains IDAC finds out the model helped us to achieve desire results and performance.
wrong operator and presents multiple suggestions to the user. Initially we have conducted usability research to understand
Initially IDAC Intelligence Wizard Screen allows user to the proportion of usability problems in Weka, Sumatra TT and
selects the file format. Depending on the selected dataset, Yale. Fig. 9 shows the results of Nielsen’s heuristic analysis.
system presents next interface to user. Fig. 7 shows the screen We have analyzed usability heuristics including match
for several data analysis operation. Intelligence Wizard asks between system and the real world standards, user control and
user about what kind of data analysis he wants to perform. freedom, help and documentation, consistency and standards,
After considering user’s choices, system automatically flexibility and efficiency of use, aesthetic and minimalistic
generates the chain of data analytics operators (See Fig 8). design. Results shows that most of the systems fail to achieve
From user’s inputs, IDAC generates and validates the consistency and standards, user control and freedom. Sumatra
operating chain. TT fails to provide help and documentation.

100
WEKA
90

SUMATRA TT
80

YALE
70

60

ERRORS 50

40

30

20

10

00
Match between User Control & Visibility of System Help & Consistency & Flexibility & Aesthetic & Minimalistic
System & Real Freedom Status Documentation Standards Efficiency of Use Design
world Standard
Nielsen’s Usability Heuristics

Figure 9. Usability evaluation results of Weka, Sumatra TT and Yale using heuristic analysis

65
100 WEKA

90 SUMATRA TT

80 YALE

70 IDAC
60

ERRORS
50

40

30

20

10

00

Match between User Control & Visibility of System Help & Documentation Consistency & Flexibility & Aesthetic & Minimalistic
System & Real Freedom Status Standards Efficiency of Use Design
world Standard
Nielsen’s Usability Heuristics

Figure 10. Usability evaluation results of IDAC, Weka, Sumatra TT and Yale using heuristic analysis

We overcame drawbacks in the traditional software by display is the biggest challenge. The immediate enhancement
focusing on following criteria: to data visualization would be separation and representation of
string attributes by taking frequency counts of their
• Effectiveness: Help to achieve desired accuracy and occurrences or by using other effective measures. Our current
completeness with which users can achieve their goals. research focus is to solve multidisciplinary business problems
• Efficiency: Reduces the computation time and using similar approach. Major work is development of a
resources spent in achieving desired goals. Privacy Protection Architecture. Today, organizations are
trying to incorporate customer-driven innovations; aim is to
• Satisfaction: Reduces user discomfort and increase provide better products and personalized services. This
pleasant interaction. involves a learning process in which organizations
We have adopted the standard approaches in testing phase continuously capture the data and learn from user behavior.
of iterative design. Unit testing, regression testing, integration Unfortunately, neither organizations nor users are equipped to
testing, alpha testing, and beta testing methods were effectively act in the efficient optimal way to decide what information they
used to test IDAC components. As new operator added to should protect and what they should reveal. This situation is
library, we performed regression testing to see whether it has creating challenging problems regarding customer data
an adverse effect on operators created previously. New visual security, privacy and trust. We are researching on how
operators are integrated without any side effect [18]. The alpha independent IDAC agents can effectively integrated with
testing was conducted in the presence of end users. Initial machine learning methods to classify and protect the sensitive
contextual enquiry helped to analyze the software usability. data. Use of the visualization module to define uncertainty,
After getting feedbacks from the beta testing further ambiguity, and behavioral biases in the privacy protection
enhancements about help and documentation standards were mechanism will be the major milestone of our research.
made [18]. The performance of the IDAC was compared to VI. CONCLUSION
other existing software. Results show that, the intelligence
wizard creates accurate analytics chains in about 93% of the In this paper, we presented Intelligent Data Analytics Console,
cases. Fig. 10 shows the improved results and efficiency of the the user centric platform that effectively integrates data
IDAC as compared to the current software systems. analytics, visualization and intelligence. The proposed
architecture demonstrates an open, cooperative and
V. FUTURE RESEARCH multidisciplinary approach to develop data analytics software
In this paper, we have demonstrated the solution to that acts as an assistant to users rather than a tool. Using
integrate data analytics, visualization and intelligence. We Visual Evidence technique we illuminate the demanding
emphasized on an open, cooperative and multidisciplinary approach to provide recommendations based on the data and
approach to increase usability of the system. Developing results obtained at particular instance. As shown through
personalized system for wide variety of users and incorporating series of best practices and usability evaluations, IDAC
effective use of information visualization is one of the major substantially reduces the usability barriers. With our approach,
challenges for future research. Solutions to these challenges researchers and enterprises can generate data analytics
are also rooted in understanding of visual Perception. components that can be used readily in software applications.
Understanding the user’s needs and selecting suitable visual

66
ACKNOWLEDGMENT [21] Mauricio Hernandez, Mauricio A. Hern'andez, and Salvatore Stolfo.
Realworld data is dirty: Data cleansing and the merge/purge problem.
We would like to thank TRDDC scientists Prof. Harrick Data Mining and Knowledge Discovery, 2:9-37, 1998.
Vin, Mr. Subrojyoti Roy Chaudhury, Dr. Savita Angadi and [22] Robert C. Holte. Very simple classification rules perform well on most
commonly used datasets. In Machine Learning, pages 63-91, 1993.
Mr. Niranjan Pedanekar for their constant support and
[23] Clare-Marie Karat. Cost-bene_t analysis of iterative usability testing. In
guidance. We appreciate the contribution of WEKA INTERACT '90: Proceedings of the IFIP TC13 Third Interational
developers to the open source community; your pioneered Conference on Human-Computer Interaction, pages 351-356,
research motivated and guided us. This research is a part of Amsterdam, The Netherlands, The Netherlands, 1990. North-Holland
Systems Research Lab, TATA Research Development and Publishing Co.
Design Center (TRDDC), TATA Consultancy Services. [24] L. Kaufman and P.J. Rousseeuw. Finding Groups in Data: an
introduction to cluster analysis. Wiley, 1990.
REFERENCES [25] Kenji Kira and Larry A. Rendell. A practical approach to feature
selection. In ML92: Proceedings of the ninth international workshop on
[1] G. Eason, B. Noble, and I. N. Sneddon, “On certain integrals of Machine learning, pages 249-256, San Francisco, CA, USA,1992.
Lipschitz-Hankel type involving products of Bessel functions,” Phil. Morgan Kaufmann Publishers Inc.
Trans. Roy. Soc. London, vol. A247, pp. 529–551, April 1955.
(references) [26] Michel Liquiere and Jean Sallantin. Structural machine learning with
galois lattice and graphs. In Proc. of the 1998 Int. Conf.on Machine
[2] Rakesh Agrawal, Tomasz Imielinski, and Arun N. Swami. Mining Learning (ICML'98, pages 305-313. Morgan Kaufmann, 1998.
association rules between sets of items in large databases, Proceedings
of 1993 ACM SIGMOD International Conference on Management of [27] Classification Algorithms Lus, Lus Torgo, and Joo Gama. Regression
Data, pages 207- 216, Washington, D.C., 1993. using classification algorithms. Intelligent Data Analysis, 1:275-292,
1997.
[3] Aha and David W. Tolerating noisy, irrelevant and novel attributes in
instance based learning algorithms. Int. J. Man-Mach. Stud., 36(2):267- [28] J. I. Maletic, A. Marcus, and M. L. Collard. A task oriented view of
287, 1992. software visualization.pages 32-40, 2002.
[4] C. G. Atkeson, A. W. Moore, and S. Schaal locally weighted learning. [29] A.I. McLeod and S.B. Provost. Multivariate data visualization, January
(1-5):11-73, 1997. 2001.
[5] Petr Aubrecht, Filip zelezny, Petr Miksovsky, and Olga Stepankova. [30] Ingo Mierswa, Ralf Klinkenberg, Simon Fischer, and Oliver Rittho. A
Sumatra TT: Towards a universal data preprocessor. exible platform for knowledge discovery experiments: Yale - yet another
learning environment. In Univ. of Dortmund, 2003.
[6] Remco R. Bouckaert, Eibe Frank, Mark Hall, Richard Kirkby, Peter
Reutemann, Alex Seewald, and David Scuse. Weka manual for version [31] Tom M. Mitchell. Machine Learning. McGraw-Hill, New York, 1997.
3-4, 2007. [32] D. J. Murdoch and E. D. Chow. A graphical display of large correlation
[7] Christopher J.C. Burges. A tutorial on support vector machines for matrices. j-AMER-STAT, 50(2):173-178, 1996.
pattern recognition. Data Mining and Knowledge Discovery, 2:121-167, [33] Raymond T. Ng and Jiawei Han. Efficient and effective clustering
1998. methods for spatial data mining, 1994
[8] Tom Chau and Andrew K.C. Wong. Pattern discovery by residual [34] D. L. Parnas and P. C. Clements. A rational design process: How and
analysis and recursive partitioning. IEEE Transactions on Knowledge why to fake it. IEEE Trans. Softw. Eng., 12(2):251-257, February 1986.
and Data Engineering, 11(6):833-852,1999. [35] Roger S. Pressman. Software Engineering: A Practitioner's Approach.
[9] Kumar Chellapilla and Patrice Y. Simard. Using machine learning to McGraw-Hill Higher Education, 2000.
break visual human interaction proofs (hips). In NIPS, 2004. [36] Matthias Rauterberg. An iterative-cyclic software process model, 1992.
[10] W. S. Cleveland. Visualizing Data. Hobart Press, Summit, New Jersey, [37] Roberts R. AI32 - Guide to Weka, March 2005
U.S.A., 1993 http://www.comp.leeds.ac.uk/andyr
[11] Bill Curtis, Herb Krasner, and Neil Iscoe. A field study of the software [38] Ross J. Quinlan. C4.5: Programs for Machine Learning (Morgan
design process for large systems. Commun. ACM, 31(11):1268-1287, Kaufmann Series in Machine Learning). Morgan Kaufmann, January
1988. 1993.
[12] Gilbert,Jfreechart http://www.jfree.org/jfreechart/, 2006. [39] Erhard Rahm and Hong Hai Do. Data cleaning: Problems and current
[13] Frank Dellaert. The expectation maximization algorithm. Technical approaches. IEEE Data Engineering Bulletin, 23:2000, 2000.
Report GIT-GVU-02-20, February 2002. [40] Kathy Walrath, Mary Campione, Alison Huml, and Sharon Zakhour.
[14] U. M. Fayyad and K. B. Irani. Multiinterval discretization of The JFC Swing Tutorial: A Guide to Constructing GUIs, Second
continuousvalued attributes for classification learning. In Proc. of the Edition. Addison Wesley Longman Publishing Co., Inc., Redwood City,
13th IJCAI, pages 1022-1027, Chambery, France, 1993. CA, USA, 2004.
[15] Stephen Few. Show Me the Numbers : Designing Tables and Graphs to [41] Edward J. Wegman. Hyperdimensional data analysis using parallel
Enlighten. Analytics Press, September 2004. coordinates. Journal of the American Statistical Association,
[16] Simon Fischer, Ingo Mierswa, Ralf Klinkenberg, and Oliver Rittho. 85(411):664-675, 1990.
Developer tutorial" Yale - yet another learning environment, 2006. [42] S.M. Weiss and C.A. Kulikowski. Computer Systems that Learn:
[17] Yoav Freund and Robert E. Schapire. Experiments with a new boosting Classification and Prediction Methods from Statistics, Neural Nets,
algorithm. In In Proceedings of the thirteenth international conference Machine Learning, and Expert Systems. Morgan Kaufman, 1991.
on machine learning, pages 148-156. Morgan Kaufmann, 1996. [43] Ian H. Witten and Eibe Frank. Data Mining: Practical Machine Learning
[18] Jerome Friedman, Trevor Hastie, and Robert Tibshirani. Additive Tools and Techniques with Java Implementations. Morgan Kaufmann,
logistic regression: a statistical view of boosting. Annals of Statistics, October 1999
28:2000, 1998. [44] Ian H. Witten and Eibe Frank. Data Mining: Practical Machine Learning
[19] Snehal Gaikwad, Mihir Kedia, and Aashish Jog. Data analytics and Tools and Techniques, Second Edition (Morgan Kaufmann Series in
visualization. Technical report, TATA Research Development and Data Management Systems). Morgan Kaufmann, June 2005.
Design Center - TCS, 2007. [45] Normal Sadeh Class Notes Mobile Pervasice Computing.
[20] Mark A. Hall and Lloyd A. Smith. Feature selection for machine
learning: Comparing a correlation-based filter approach to the wrapper,
1999.

67
ADCOM 2009
COMPUTATIONAL
BIOLOGY

Session Papers:

1. D. Narayan Dutt, “ Digital Processing of Biomedical Signals with Applications to Medicine ” –


INVITED PAPER

2. C. Das, P.Maji, S. Chattopadhyay, “ Supervised Gene Clustering for Extraction of Discriminative


Features from Microarray Data”.

3. S.Das, S.M. Idicula, “ Modified Greedy Search Algorithm for Biclustering Gene Expression Data”.

68
Digital Procesing of Biomedical Signals with
Applications to Medicine
D. Narayana Dutt
Department of Electrical Communication Engineering
Indian Institute of Science, Bangalore-560012, India
dndutt@ece.iisc.ernet.in

Abstract— We consider here digital processing of Schizophrenia is a worldwide prevalent disorder with a multi-
Electroencephalogram (EEG) and Heart Rate Variability (HRV) factorial but highly genetic aetiology. It is a multiple gene
signals with applications to psychiatry and cardiac care. First we disorder. Schizophrenia being a complex and widespread disorder
consider application to EEG. The conventional analysis of EEG giving rise to a great burden of suffering and impairment to both
signals cannot distinguish between schizophrenia and normal patients and their families. By looking at EEG it is very difficult
healthy individual. In this work, graph theoretic approach is used to ascertain and diagnose if an individual is suffering from the
to get various parameters from EEG. It is found that the disease. It is all the more difficult to find to which type of groups
schizophrenia patients can not only be differentiated from healthy a subject belongs without proper clinical analysis by a physician
subjects but can also be grouped into various subgroups in [1]. In this work the focus is on usage of graph theoretical
schizophrenia. Use of SVM based automation gives us a clear way to methods to analyze and statistical methods and techniques to
classify an individual into his particular group or subgroup for
identify individuals suffering from schizophrenia and also classify
further classification with an accuracy of more than 90%. Next we
them into the various groups of schizophrenia.
present a neural network based approach to classification of supine
vs. standing postures and normal vs. abnormal cases using heart
rate variability (HRV) data. We have chosen ten features for the
network inputs. Four classification algorithms have been compared. The Electrocardiogram describes the electrical activity of the
heart. The rhythmic behavior of the heart is studied by analyzing
Index terms— EEG, Graph theory, Schizophrenia, Connectivity, heart rate time series. Heart Rate Variability (HRV) has been
Support vector machine (SVM), HRV, Neural network accepted as a quantitative marker to study regulatory mechanisms
of autonomic nervous system. Understanding physiology of heart
I INTRODUCTION rate dynamics is very important in treating cardiac diseases that
are chronic as well as life threatening. Understanding these
The brain consists of billions of neurons which are the basic physiologies allow clinicians to devise improved treatment
data processing units that are closely interconnected via axons and methodologies. The importance of HRV was first discovered
dendrites forming a large network. The complex system like the when unborn infants were attached to cardiac sensors in utero.
brain can be described as a complex network of decision making Usage of HRV in the study of cardiac disorders is very popular.
nodes. In the brain, the network is composed of these neuronal The HRV analysis has gained importance since many studies [2]
units which are linked by synaptic connections. The activity of have shown low HRV as a strong indicator to predict cardiac
these neuronal units gives rise to dynamic states that are mortality and sudden cardiac death. Several studies suggest that
characterized by specific patterns of neuronal activation and co- these patients with anxiety and depression are at a higher risk for
activation. The characteristic patterns of temporal correlations significant cardiovascular mortality and sudden death [3]. In view
emerge as a result of functional interactions within a structured of these work, we have considered the application of neural
neuronal network. networks in the classification of supine vs. standing postures and
The brain is inherently a dynamic system in which the normal vs. abnormal cases using heart rate variability data. Four
correlation between the regions during behavior or even at rest classification algorithms have been compared viz., k-neighbors,
creates and reshapes continuously. Many experimental studies Radial Basis Function (RBF) networks, Support vector machines
suggest that perceptual and cognitive states are associated with (SVM) and back propagation networks for different set of
specific patterns of functional connectivity. These connections are features.
generated within and between large populations of neurons in the
cerebral cortex. An important goal in computational neuroscience
is to understand these spatiotemporal patterns of complex II .MATERIALS AND METHODS
functional networks of correlated dynamics of the brain activity A. Data acquisition
measured in terms of EEG recordings. The EEG signals were recorded from 53 subjects (25 control
and 28 schizophrenic subjects of them 21 were females and 32
Similarities between the EEG time-series are commonly were of males) using 32 scalp electrodes. The Neuroscan machine
quantified using linear techniques, in particular estimates of was used for the recording purpose. The signal data were analog
temporal correlation in time domain or coherence in frequency band pass filtered with cutoff frequencies 0.5 to 70 Hz and then
domain. The temporal correlations are the most often used to sampled at the rate of 256 Hz with a resolution of 12 bits. All
represent and quantify patterns in neuronal networks and is electrodes were referenced to A2. The subjects were instructed to
represented as correlation matrix. The functional connectivity close their eyes and be restful for some time before the collection
refers to the patterns of temporal correlations that exist between of data was carried out. Subjects chosen for this analysis were old
distinct neuronal units and such temporal correlations are often the established cases schizophrenia. It was ensured that the subjects
result of neuronal interactions along anatomical or structural were not under the influence of any medication during the process
connections. Thus, the correlation matrix may be viewed as a of collection of data. The 28 channels were according to 10-20
representation of functional connectivity of the brain system. international standard of electrode placement. The data were band

69
pass filtered between 0.5 to 38 Hz which included lower gamma linking them. The average of all entries of Dij has been called the
frequency range and eliminated 50 Hz line noise. “characteristic path length” [8,9,10], denoted l path.
B. Formation of network graphs 4)Cluster Index
The network connectivity can be studied using graph theory The clustering coefficient is a measure of the local
methods. A graph is represented by nodes (vertices) and interconnectedness of the graph, whereas the path length is an
connections (edges) between the nodes. In a graph vertices are indicator of its overall connectedness. The clustering coefficient
denoted by black dots and a line is drawn between the dots if the Ci for a vertex vi is the proportion of links between the vertices
two vertices are connected. The complexity of a graph can be within its neighborhood divided by the number of links that could
characterized by many measures like cluster coefficient and possibly exist between them. For an undirected graph the edge eij
characteristic path length. The cluster coefficient is a measure of between two nodes i, j is considered identical to eji. Therefore, if a
local interconnectedness of the graph and the characteristic path vertex vi has ki neighbors , ki (ki-1)/2 edges could exist among the
length is an indicator of its overall connectedness. More literature vertices within the neighborhood. Using this information, the
about the graph theory may be found elsewhere [1, 2 deal with clustering coefficient for undirected graphs can be calculated.
graph theory with the subject in relation to brain connectivity].
The clustering coefficient for the whole graph is given by
The temporal correlation between any two EEG time series Watts and Strogatz as the average of the clustering coefficient for
x1(t) and x2(t) at two scalp electrodes is ranges between 0 and 1. each vertex.
To allow mathematical analysis, we represent the neuronal
network activity patterns as graphs as follows. In this study the D Neural network methods
vertices of the graph represents the position of the electrode The neural network methods implemented in our work are
placement on the scalp. Let the number of electrodes used for explained below.
EEG recording be N and hence there are N vertices.
The correlation between all pair-wise combinations of EEG 1)Multilayer Perceptrons: A Back propagation network or
channels are computed to form a square matrix R of size N where Multilayer perceptron (MLP) consists of at least three layers of
N is the total number of EEG channels. Each entry rij in the matrix units: an input layer, at least one intermediate hidden layer, and
R, 1<=i,j<=N, is the correlation value for the channels i and j and an output layer. The units are connected in a feed-forward
0<=mod(rij)<=1, mod(rij)=1 for i=j, where mod represents fashion. With Back propagation networks, learning occurs during
absolute value. The correlation matrix R is converted into a binary a training phase. After a Back propagation network has learned
matrix by applying a threshold. The resulting binary matrix is the correct classification for a set of inputs, it can be tested on a
called adjacency matrix. A network graph is constructed using this second set of inputs to see how well it classifies untrained
matrix with N vertices and an edge between the nodes if the patterns [11]
correlation between them exceeds the threshold. Thus, if the
correlation between a pair of channels i and j, rij exceeds the 2)Radial-Basis Function Networks: Radial basis function
threshold value, an undirected edge is said to exist between the networks are also feed forward, but have only one hidden layer.
vertices i and j. All the edges are given the same cost equal to RBF hidden layer units have a receptive field, which has a centre
unity. that is, a particular input value at which they have a maximal
output. Their output tails off as the input moves away from this
C. Measures of a graph point. Generally, the hidden unit function is a Gaussian [11].
Graph theory is one of the latest tools being used for analysis 3)Support Vector Machines: Support vector machine is a
of brain connectivity. The following are the measures of graph popular technique for classification. Given a training set of
that are used for this analysis for which a brief is essential.
instance-label pairs (xi, yi), i = 1, . . . l where xi ∈ Rn and y∈{1, -
1)Average connection density 1}t, the SVM require the solution of an optimization problem.
The average connection density kden of an adjacency matrix A
is the number of all its nonzero entries divided by the maximal Here training vectors xi are mapped into a higher dimensional
possible number of connections. Thus, 0<=kden<=1. The sparser space by the function φ. Then SVM finds a linear separating
the graph, the lower is its connection density [4,5]. hyper plane with the maximal margin in this higher dimensional
2)Complexity space. K(xi, xj) = φ(xi)tφ (xj) is called the kernel function[11].
Complexity captures the extent to which a system is both
functionally segregated and functionally integrated (large subsets 4)K- Nearest Neighbour Classifier: Among statistical
tend to behave coherently). The statistical measure of neural approaches, a k-nearest neighbour classifier (KNNC) was
complexity CN(X) takes into account the full spectrum of subsets. selected because it does not assume any underlying distribution
CN(X) can be derived from either ensemble average of the mutual of data. In the k-nearest neighbour rule, a test sample is assigned
information between subset sizes or equivalently from the the class most frequently represented among the k nearest
ensemble average of mutual information between subsets of a training samples [12].
given size (ranging from 1 to n/2) and their complement [6,7,8].
E. Features Considered
3)Characteristic path length
Within a digraph, a path is defined as any ordered sequence of This section describes the features considered for classifying
distinct vertices and edges that links a source vertex j to a target HRV data.
vertex i. The distance matrix Dij describes the distance from
vertex j to vertex i, that is, the length of the shortest valid path

70
1)Fractal Dimension (FD): Katz's approach is used to calculate In this study statistical means of pameters were used to come to
the FD of a waveform [13]. the conclusions. For finding the parameters in the different bands
of EEG signals the EEG was filtered as per the frequency band
2)Complexity measure: Lempel-Ziv complexity measure C(n) that is being investigated. Initial glance on the parameters will
[14] is used in our work, since it is extremely suited for not reveal much information for classification, but on detailed
characterizing the development of spatio-temporal activity analysis of the same and study of various plots we can see that
patterns in high-dimensionality nonlinear systems. the subjects occupy different Euclidian spaces. This property thus
can be used to study for classification of subjects.
3)Time-domain features of HRV: We assume that only a finite
number of intervals are available. For 4-10 min segments, wide- The work involved detailed study of the entire band of EEG
sense stationarily may be assumed.The standard deviation of the signals (i.e. 0.5 to 38Hz) and also study of different bands of EEG
RR intervals (SDRR) is defined as the square root of the variance comprising of Delta, Alpha-1, Alpha-2, Theta and Beta. In all the
of the RR intervals six cases the parameters were extracted, tabulated, statistical mean
generated and listed for each subject. The results were plotted in
SDRR = √ E[RR2n]- RR2mean …………………………(1) 2D and 3D Euclidian space. These plots were examined for study
in classification.
The standard deviation of the successive differences of the RR The study was successful in not only identification of subjects
intervals (SDSD) is defined as the square root of the variance of suffering from Schizophrenia but also was able to classify them
the sequence ∆RRn = RRn – RRn+1 (the ∆RR intervals) into various groups within Schizophrenia. To further ascertain the
work into a working model machine learning approach was used
SDSD=√E[∆RR2n]-∆RR2mean ……………………………(2) to see if the parameters could be used to classify automatically. In
the present work Support Vector Machine was used successfully
4)Non-linear features: The two non-linear features, which are to identify Schizophrenia patients and also classify them with as
obtained from Poincare plot [15] are given below: accuracy of more than 90%.
Before concluding our work with support vector machines the
SD1: The SD1 measure of Poincare width is equivalent to the algorithm that is intended to be proposed for development for
standard deviation of the successive intervals, except that it is successful identification of Schizophrenia is isolation of the
scaled by 1/√2. normal healthy subjects, so the subjects are tested for positivity of
Schizophrenia. Once the person is tested positive , it is required to
SD2: The SD2 measure of the length of the Poincare cloud is find to which subgroup the person belongs. The algorithm
related to the auto covariance function. proposed analyses these steps and proposes a way to detect the
subgroups as well.
SD12+SD22= 2SDRR2 ………………………(3) 1)Detection of Non-Schizophrenic subjects
2 2 2 During analysis of the full band and various sub bands of EEG
SD2 =2SDRR –(1/2)SDSD …………………………(4)
for the study of parameters it is found that the 3D plots of
It can be argued that SD2 reflects the long-term HRV. The width complexity, cluster index and characteristic path length could be
of the Poincare plot correlates extremely highly with other easily used for classification of the normal healthy patients from
measures of short-term HRV. that of the schizophrenic patients. The 3D plot of the above
mentioned parameters can be used to distinguish between normal
healthy patients from that of the schizophrenic patients easily as
5)Frequency-Domain features: The frequency-domain features
shown in figure 1(a). The alpha 1 and alpha 2 band plots are in
considered are powers in ultra low frequency range, very low
figure 1(b) and 1(c) respectively. It can be seen that normal
frequency range, low frequency range and high frequency range. healthy subjects can be easily identified from that of
schizophrenic subjects.
III. RESULTS AND DISCUSSION
2)Identification of Sub group in Schizophrenic subjects
A. Results for EEG Data After identification of a schizophrenic patient, classifying them to
a particular subgroup is a very important and crucial task.
The collection of data was done for a period of three minutes on Identification of mixed schizophrenic subjects. It was observed
an average. For every subject, the connectivity matrices were that by just plotting the connection density with complexity and
found for every 2 seconds epoch for the entire duration of three charecteristic path length and from cluster index and
minutes. Parameters were calculated for each such epoch for all charecteristic path length other three parameters we could
the graphs generated. Once calculation of the parameters for all identify the mixed schizophrenic subjects without much
the subjects for 3 minutes duration was completed mean of each problems. This can be seen from plots as shown in figure 2.
parameter was taken for each subject.In this study the plots of
statistical means of parameters are used to come to conclusions. The plots show the three kinds of plots complexity Vs
The plots show the three kinds of plots like complexity Vs connection density, characteristic path length Vs connection
connecty etc.. Similar observations can also be seen in Delta density, characteristic path length Vs cluster index for full EEG
band as well. band and for alpha1, alpha2 and theta bands respectively.
Similar observations can also be seen in Delta band as well. The

71
The plots clearly show that the mixed schizophrenic subjects the location of errors in the data is given for various bands. We
could be clearly isolated from the other two subgroups. Though have observed that the errors can further be reduced using
the figures are shown for four bands even in delta band the multiple kernels thus increasing the efficiency.
mixed schizophrenia subjects show distinct charecteristics Once a patient is classified it can be updated into the initial
which can be easily identified from rest of the two subgroups. database which is used in SVM blocks. Prediction of more than
95% could be achieved using different kernels
Identification of the Positive and Negative Schizophrenic
The work presented here tries to apply the graph theory
subjects. After looking segregation of the mixed schizophrneic
approach to identify the schizophrenia patients from the normal
subjects we have two important subgroups to be detected. For healthy individuals and also to classify them into their
this we plot the 2D and 3D plots as shown in the figure 3. In subgroups. It is well known that the schizophrenia patients are
these plots we can see that the positive and negative difficult to identify from the EEG itself. At present there is no
schizophrenic subjects form distinct clusters which can easily be way to identify various subgroups of schizophrenia except for
identified. Thus finally the classification of schziphrenic physicians counseling, which consumes physicians valuable
patients is complete. time. Schizophrenia is not the problem of any single portion in
human brain but a problem of entire brain. Hence a technique
3)Proposed SVM based classification using multiband results that uses the entire multichannel EEG data at a time for analysis.
From the above plots it becomes amply clear that like the graph theory approach should be used. Thus, the
Schizophrenia subjects can be classified using some type of approach followed in this work is much better suited for
analyzing the EEG and possibly identifying the patients suffering
classifiers in higher dimensions. Thus, a SVM based classifier
from schizophrenia.
can be so chosen classification that can give an accuracy of more
than 90%. By using multiband classifiers with multiple kernels in During the study it is clear that there is a distinct difference in
tandem we can further reduce the errors in detecting and the connectivity of patients suffering from schizophrenia from
classification. By looking at various kernels for testing it can be that of normal healthy control individuals. The formation of
seen that the when used with multiple kernels the accuracy clusters by various groups and subgroups in 2D and 3D
approaches almost 100%. Thus an algorithm as shown in Figure Euclidian space has been exploited by using SVM based
4 is proposed for identification and classification of classifiers. The study encourages us to see if graph theory
schizophrenia subjects. For the algorithm to function effectively parameters could be used further identifying other brain
a database of established schizophrenic patients is required. The disorders which are otherwise difficult to diagnose using existing
EEG of subjects from the three subgroups of schizophrenia could forms of diagnosis. This study gives us a relatively cheaper
noninvasive tool to prematurely classify individuals who can
be taken and analyzed for as explained in the previous sections.
later be clinically analyzed by a physician in detail. By far this
The SVM modules then run based on the parameters. We have is one of the first ways for successful identification and also
calculated both the errors committed while predicting and also classification of patients suffering from schizophrenia.
.

Figure 2(a)(ii) Figure 2(b)(ii)

Figure 1(a) Complete Band Figure 1(b) Alpha 1 band


Fig.1 Three Dimensional Plots
for Complexity, Cluster Index and
Characteristic path length for
(a) Complete band (b) Alpha1
band and (c) Alpha2 band.

Figure 2(a)(iii) Figure 2(b)(iii)

Fi
gure 1(c) Alpha 2 band

Figure 2(c)(i)

Figure 2(d)(i)
Figure 2(a)(i) Figure 2(b)(i)

72
Fi
Figure 2(c)(ii) Figure 2(d)(ii)

Figure 4: Proposed algorithm

B. Results for ECG Data

Fi The ECG is recorded in lead II configuration from HP 78173A


Figure 2(c)(iii) Figure 2(d)(iii) ECG monitor. The signal was recorded onto a PC using 12-bit
Figure 2: Plots re of the bands (a) Complete band, (b) Alpha1 band, (c) Alpha2
band and (d) Theta band. The plots are of (i) Complexity Vs Connection density, ADC at a sampling frequency of 500 Hz. HRV signal is extracted
(ii) Connection density Vs Characteristic path length and (iii) Characteristic path from these ECG recordings using Berger’s algorithm (which
length Vs Cluster Index. comprise of peak detection algorithm to detect R peaks followed
by interpolation of interbeat intervals). The supine data were
obtained after the subjects rested for 10 min and the standing data
were obtained 2 min after the subjects stood up. Control
breathing corresponds to breathing at specified rate (typically 12
breaths per minute) whereas in spontaneous breathing subject is
asked to breath normally. The short HRV records are for 256 sec
F Fig
and they are at sampling freq of 4 Hz. Four classification
Figure 3(a)(i) Figure 3(b)(i) algorithms viz., KNNC, MLP, RBF networks and SVMs are
tested for the classification. Tables 1 and 2 show the
classification accuracy on the test sets of the algorithms for
classification of supine vs standing posture and normal vs
abnormal .The classification accuracies are presented for
different sets of features to the algorithms.

1)Classification of supine and standing postures: Classification


F Fig of data into supine and standing postures is considered. The
Figure 3(a)(ii) Figure 3(b)(ii)
obtained accuracies for the back propagation network with 9
hidden units, RBF networks with 110 hidden units and support
vector machine with RBF kernel are shown in Table 1.The
highest classification accuracy (90.57%) is obtained for the SVM
with RBF kernel.

2)Classification of normal and abnormal cases; Classification of


data into normal and abnormal cases is considered. The obtained
Fig Fi
Figure 3(a)(iii) Figure 3(b)(iii) accuracies for the back propagation network with 11 hidden
units, RBF networks with 110 hidden units and support vector
machine with RBF kernel are shown in table 2.The highest
classification accuracy (91.2%) is obtained for the SVM with
RBF kernel.

Features Considered KNNC MLP RBF SVM


Mean, Variance 78.63 90.43 89.74 90.57
Figure 3(a)(iv)
Fi Fractal Dimension (FD), 75.00 78.95 79.31 89.57
Figure 3(b)(iv) Complexity Measure (CM)
Figure 3: The plots are for (a) Complete Band and (b) Alpha1 Band. The
parameters plotted for (i) Cluster Index Vs Connection density, (ii) Characteristic
Mean, Variance, FD, CM 78.45 90.43 89.44 90.57
Path length Vs Cluster index (iii) Charecteristic path length, cluster Index and Frequency Domain Features 78.95 89.57 89.57 91.20
Complexity and (iv) Charecteristic path length, cluster Index and Connection Mean, Variance, FD, CM & 74.06 90.43 89.44 90.57
density Frequency domain features

Table 1. Testing Accuracy Table for HRV Data (Supine and


Standing postures)

73
Features Considered KNNC MLP RBF SVM than MLP and RBF networks. RBF network, in general, gives
Mean, Variance 79.82 87.83 89.44 90.57 better performance than MLP network. In the case of
Fractal Dimension (FD), 70.69 83.66 84.27 88.19 classification of supine and standing postures, frequency features
Complexity Measure (CM) are significant and heart rate variability is reduced in individuals
Mean, Variance, FD, CM 79.82 87.83 89.44 90.57 with anxiety and depressive disorders. Improved classification
Frequency Domain Features 77.59 89.74 89.44 91.20 can be achieved by taking a larger training set.
Mean, Variance, FD, CM & 67.83 89.74 89.74 91.20
Frequency domain features ACKNOWLEDGEMENTS
Author would like to thank Dr. John P John , National Institute
Table 2. Testing Accuracy Table for HRV Data (Normal and of Mental Health and Neuro Sciences (NIMHANS), Bangalore,
Abnormal cases) for providing the necessary EEG data for this study. Author would
also thank Maj.Kiran Kumar and Mr Mutyalaraju for their help in
From the above Tables 1 and 2, we can observe the following: the development of programs.

(i) The obtained accuracies for the back propagation network, REFERENCES
RBF networks units and support vector machine with RBF kernel [1] K.Sim ,T.H. Chua , Y.H. Chan , R Mahendran , S.A. Chong, “ Psychiatric
are higher than for k-neighbours. This is because in the case of
comorbidity in first episode schizophrenia: a 2 year, longitudinal outcome
data used in this work, different units bring different amount of
study”, J Psychiatr Res, vol. 40 (7), pp. 656-63,2006.
information whereas k-neighbours algorithm compute the
[2] M. Galinier , S. Boveda , A. Pathak , J. Fourcade , B. Dongay , “ Intra-
Euclidean distances between classified vector and some other individual analysis of instantaneous heart rate variability ”, Crit.
vectors. Therefore, this algorithm cannot take into consideration Care Med,vol.28(12), pp. 3939-3940,2000.
the fact that different inputs bring different amount of [3] D.L.Musselman,D.L.Evans,C.B.Nemeroff ,”Therelationship of expression
information, which is the natural feature of neural networks. to cardiovascular diseases.Arch Gen Psychiatryu,vol.55, pp.580- 592,1998.
[4] A.R. McIntosh , M.N. Rajah , N.J. Lobaugh ,. “ Interactions of prefrontal
(ii)The features SDRR, SDSD, SD1 and SD2 play an important
cortex in relation to awareness in sensory learning ”, Science, vol. 284,
role in the classification of normal vs abnormal cases. pp. 1531–1533,1999.
3.The features fractal dimension and complexity measure seems
[5] G.Tononi, O. Sporns, G.M. Edelman , “ A measure for brain complexity:
to be not playing a significant role in the classification problems.
Relating functional segregation and integration in the nervous system”,
4.Heart rate variability is reduced in individuals with anxiety and
Proc. Nat Acad Sci USA,vol. 91,pp. 5033–5037,1994.
depressive disorders and hence we are able to get accuracy
[6] O.Sporns, G.Tononi, G.M. Edelman, “Theoretical neuroanatomy: Relating
around 90% by using either features SDSD, SDRR, SD1 and anatomical and functional connectivity in graphs and cortical connection
SD2 or frequency features alone. matrices” Cerebral Cortex vol. 10, pp.127–141,2000.
5.In the case of classification of supine and standing postures, [7] O.Sporns, G.Tononi, G.Edelman, “Connectivity and complexity: The
frequency features are significant. Spectral analysis of HR data relationship between neuroanatomy and brain dynamics. Neural Networks”,
vol.13, pp. 909–922,2000.
reports a relative increase in low frequency power and decrease
[8] O.Sporns,G. Tononi, “Classes of Network Connectivity and Dynamics”,
in high frequency power from supine to standing posture. These Complexity, vol. 7, pp.28-38,2002.
changes are attributed to a predominance of sympathetic activity [9] D.J.Watts, S.H.Strogatz, “Collective dynamics of ‘small-world’ networks”
and vagal withdrawal in standing posture. Nature ,vol. 393, pp.440–442,1998.
[10] D.J.Watts, “Small Worlds” Princeton University Press: Princeton, NJ, 1999.
The second part of this paper on HRV has presented a neural- [11] Simon Haykin, Artificial neural networks, second edition,New York:
based approach to classifying HRV data into normal and Pearson Education, 2002.
abnormal cases and supine and standing postures. We are able to [12] R.O. Duda, P.E. Hart and D.G.Stork, Pattern Classification, second edition,
New York: John Wiley and Sons, 2001.
correctly classify 106 out of 116 cases corresponding to normal [13] M. Katz, “ Fractals and the analysis of waveforms”,Comput.Biol
and abnormal subjects and in case of supine and standing Med., vol. 18, pp. 145-156, 1988.
subjects we are able to classify 105 out of 116 correctly. We [14] Zhang Xu-Scheng, J. Roy Rob and E.W. Jensen , “ EEG complexity as
compared conventional KNNC method with three kinds of neural a measure of Depth of Anaesthesia ” , IEEE Trans. Biomed. Engg.,
vol.48,pp 1424-1433, 2001.
networks (MLP, RBF and SVM). The obtained accuracies for the [15] M.Brennan, M.Palaniswami and P.Kamen, “ Do existing measures of
back propagation network, RBF network units and support vector Poincare plot geometry reflect nonlinear features of heart rate
machine with RBF kernel are higher than for k-neighbours. variability?”, IEEE Trans. On BME, vol. 48, no. 11 pp. 1342-1347, 2001.
Among neural network methods, SVM gives better performance

74
Supervised Gene Clustering for Extraction of
Discriminative Features from Microarray Data
Chandra Das #1 , Pradipta Maji ∗2 , Samiran Chattopadhyay $3
#
Department of Computer Science and Engineering, Netaji Subhash Engineering College, Kolkata, 700 152, India
1
chandradas@hotmail.com


Machine Intelligence Unit, Indian Statistical Institute, Kolkata, 700 108, India
2
pmaji@isical.ac.in

$
Department of Information Technology, Jadavpur University, Kolkata, 700 092, India
3
samiranc@it.jusl.ac.in

Abstract— Among the large number of genes presented in study addresses the supervised classification task where data
microarray data, only a small fraction of them are effective for samples belong to a known class. So, the outcomes are sample
performing a certain diagnostic test. However, it is very difficult classes and the input features are measurements of genes for
to identify these genes for disease diagnosis. Clustering method
is able to group a set of genes based on their interdependence gene array-based sample classification. However, the major
and allows a small number of genes within or across the groups problem of microarray gene expression data-based sample
to be selected for analysis that may contain useful information classification, is the huge number of genes compared to the
for sample classification. In this regard, a new supervised gene limited number of samples. Most classification algorithms
clustering algorithm is proposed to cluster genes from microarray suffer from such a high dimensional input space. Furthermore,
data. The proposed method directly incorporates the information
of response variables in the grouping process for finding such most of the genes in arrays are irrelevant to sample distinction.
groups of genes, yielding a supervised clustering algorithm for These genes may also introduce noises and decrease prediction
genes. Significant genes are then selected from each cluster accuracy. In addition, a biomedical concern for researchers is
depending on the response variables. The average expression of to identify the key ”marker genes” which discriminate samples
the selected genes from each cluster acts as a representative of for class diagnoses. Therefore, gene selection is crucial for
that particular cluster. Some significant representatives are then
taken to form the reduced feature set that can be used to build sample classification in medical diagnostics, as well as for
the classifiers with very high classification accuracy. To compute understanding how the genome as a whole works.
the interdependence among the genes as well as the gene-class One way to identify these marker genes is to use the
relevance, mutual information is used, which is shown to be clustering methods [3], which partitions the given genes into
successful. The performance of the proposed method is described distinct subgroups, so that the genes (features) within a cluster
based on the predictive accuracy of naive bayes classifier, K-
nearest neighbor rule, and support vector machine. The proposed are similar while those in different clusters are dissimilar. A
method attains 100% predictive accuracy for all data sets. The set of small number of top-ranked genes from each cluster are
effectiveness of the proposed method, along with a comparison then selected or extracted based on some evaluation criterion
with existing methods, is demonstrated on three microarray data. to constitute the resulting reduced subset. As a first approach,
unsupervised clustering techniques have been widely applied
to find groups of co-regulated genes on microarray data. The
I. I NTRODUCTION
most prevalent approaches of unsupervised clustering include:
Recent advancement of microarray technologies has made (i) hierarchical clustering [4] (ii) K-means clustering [5] (iii)
the experimental study of gene expression data faster and more clustering through Self-Organizing Maps [6] etc. However,
efficient. Microarray techniques, such as DNA chip and high these algorithms usually fail to reveal functional groups of
density oligonucleotide chip, are powerful biotechnologies genes that are of special interest in sample classification. This
because they are able to record the expression levels of is because genes are clustered by similarity only, without using
thousands of genes simultaneously. The vast amount of gene any information about the experiment’s response variable.
expression data leads to statistical and analytical challenges This problem is solved by supervised clustering. Super-
including the classification of the dataset into correct classes. vised clustering is defined as grouping of variables (genes),
So, an important application of gene expression data in func- controlled by information about the class variables, as for
tional genomics is to classify samples according to their gene example, the tumor types of the tissues. Previous work in
expression profiles such as to classify cancer versus normal this field encompasses tree harvesting [7], a two step method
samples or to classify different types of cancer [1], [2]. which consists first of generating numerous candidate groups
Both supervised and unsupervised classifiers have been by unsupervised hierarchical clustering. Then, the average
used to build classification models from microarray data. This expression profile of each cluster is considered as a potential

75
input variable for a response model and the few gene groups that the proposed method is more effective.
that contain the most powerful information for sample discrim- The rest of this paper is organized as follows: In Section
ination are identified. In this work, only the second step makes II, a novel feature extraction algorithm is presented. Section
the clustering supervised, as the selection process relies on III briefly describes the concept of mutual information, which
external information about the class variable types. But it also is used to measure the similarity between two genes and
fails to reveal functional groups of genes of special interest in the relevance of a gene with respect to class variables. In
class prediction because genes are clustered by unsupervised Section IV, extensive experimental results are discussed, along
information only. with a comparison with other related methods. The paper is
A supervised clustering approach that directly incorporates concluded with a summary in Section V.
the response variables or class variables in the grouping pro-
II. P ROPOSED F EATURE E XTRACTION METHOD
cess is the partial least squares (PLS) procedure [8], [9]. The
PLS has the drawback that the fitted components involve all This section presents an algorithm for supervised learning
(usually thousands of) genes, which makes them very difficult of similarities and interactions among predictor variables for
to interpret. Another new promising supervised method [10] classification in very high dimensional spaces, and hence is
like PLS, is a one step approach that directly incorporates the predestinated for searching functional groups of genes on
response variables into the grouping process. This supervised microarray expression data.
clustering algorithm is a combination of gene selection for A. Proposed Supervised Clustering Algorithm
cluster membership and formation of a new predictor by
The proposed basic stochastic model for microarray data
possible sign flipping and averaging the gene expressions
equipped with categorical response is given by a random pair
within a cluster. The cluster membership is determined with
a forward and backward searching technique that optimizes (ξ, Y ) with values Rm × Y
the Wilcoxon test based predictive score and margin criteria where ξ ∈ Rm denotes a log-transformed gene-expression
defined in [10], which both involve the supervised response profile of a tissue sample, standarized to mean zero and
variables from the data. However, as both predictive score and unit variance. Y is the associated response variable, taking
margin criteria depend on the actual gene expression values, numeric values in Y = {0, 1, · · · , K−1 }. Here K represents
they are very much sensitive to noise or outlier of the data set. the number of classes. Suppose, X represents the gene set
In this paper, a new supervised gene clustering algorithm where X = {X1 , X2 , · · · , Xm }.
is proposed. It finds co-regulated clusters of genes whose To account for the fact that not all m genes on the chip, but
collective expression is strongly associated with the sample rather a few functional gene subsets, determine nearly all of the
categories or class labels. The mutual information is used outcome variation and thus the type of a tissue, the whole gene
here to measure the similarity between genes and the gene- set is partitioned into z number of functional groups or clusters
class relevance. The proposed method uses this measure to ({C1 , · · · , Cz } with z ≪ m). They form a disjoint and usually
reduce the redundancy among genes. It involves partitioning incomplete partition of the gene set:{∪zi=1 Ci } ⊂ {1, · · · , m}
of the original gene set into some distinct subsets or clusters and Ci ∩ Cj = ∅, i 6= j. Finally, a representative of every
so that the genes within a cluster are highly co-regulated cluster is generated and among them a few forms the reduced
with strong association to the response variables or sample feature set. Let, ξCi ∈ R denotes a representative expression
categories while those in different clusters are as dissimilar value of gene cluster Ci . There are many possibilities to
as possible. A single gene from each cluster having the determine such group values ξCi , but as we would like to shape
highest gene class relevance value is first selected as the clusters that contain similar genes, a simple linear combination
initial representative of that cluster. The representative of each is an accurate choice:
cluster is then modified by averaging the initial representative 1 X ˜
ξCi = δg ξg , δg ∈ {−1, 1} (1)
with other genes of that cluster whose collective expression Ci
g∈Ci
is strongly associated with the sample categories. In effect,
the proposed algorithm yields clusters typically made up of Here ξ˜g represents the expression value of gene Xg .
a few genes, whose coherent average expression levels allow Because of the use of log-transformed, mean-centered and
perfect discrimination of sample categories. After generating standardized expression data,we as a novel extension, allow
all clusters and their representatives, a few representatives are the contribution of a particular gene g to the group value
selected according to their class discrimination power and are XCi also to be given by its ’sign-flipped’ expression values
passed through classification algorithms to classify samples. −Xg . This means that we treat under and over expression
To evaluate the performance of the proposed method, dif- symmetrically, and it prevents the differential expression of
ferent cancer gene expression datasets are used. The perfor- genes with different polarity from cancelling out when they
mance of the proposed method is studied using the predictive are tagged. Now, we describe the partioning process of the
accuracy of support vector machine, K-nearest neighbor rule, gene set X into subsets or clusters C = {C1 , · · · , Cz }.
and the naive-bayes method. The classification accuracy of The proposed clustering method has two phases: (1) cluster
the proposed method is compared with those yielded by other generation phase (2) cluster refinement phase. In the cluster
gene-selection methods. The experimental results demonstrate generation phase, first the class relevance value of every gene

76
is calculated. Now, the gene with the highest class relevance 2) At first, using mutual information the class relevance
value, suppose Xv is selected as the member of first cluster value of every gene Xi , CR(Xi ) for i = 1, · · · , m is
C1 . Now, among the remaining genes, the genes which have calculated.
the similarities with Xv are greater than or equal to a user 3) Now the gene, let Xv , with highest class relevance value
defined threshold α are selected as the members of cluster is selected. This is the first member of the cluster Cp .
C1 . In this way cluster C1 is formed. 4) Repeat step 5 to step 15 until p ≤ z or all m genes are
The next phase is cluster refinement phase. In this phase selected:
from cluster C1 , at first the gene Xv with highest class 5) Now among the remaining genes in X whose similarities
relevance value is chosen. Now, any gene which is present with Xv ≥ α are selected for cluster Cp and the cluster
in cluster C1 , suppose Xt , is taken and average expression Cp is formed.
profile of Xv and Xt is calculated. If the class relevance 6) Now, in cluster Cp the gene Xv is taken and the initial
value of the average expression profile is greater than class cluster mean ξCp is set to the expression vector (ξX ˜ )
v

relevance value of Xv then Xt is selected. Again the whole of the chosen gene and in effect Xv ∈ S.
process is repeated that means any gene that is present in 7) Repeat the following two steps until all genes of cluster
cluster C1 except genes Xv and Xt , suppose Xr is taken and Cp are checked:
average expression profile of Xv , Xt , and Xr is taken. If class 8) Average the current average cluster expression profile
relevance value of average expression profile (of genes Xv , Xt , ξCp with each individual gene Xi present in cluster Cp ,
Xr ) is greater than class relevance value of previous average 1 ˜ + ˜ )
X
expression profile (of gene Xv and Xt ) then Xr is selected. ξCp +Xi = (ξXi ξX t
|S| + 1
Now the whole precess is repeated in cluster C1 until there is Xt ∈S

no unchecked gene. In this way in cluster C1 some genes are 9) if ξCp +Xi ≥ ξCp then Xi ∈ S and ξCp = ξCp +Xi .
selected which will increase class relevance value and average 10) The final average expression profile (ξCp ) of cluster Cp
expression profile of these selected genes are taken and this acts as representative of Cp .
will act as a representative for cluster C1 . After that, the gene 11) After generating cluster representative, Xv and all genes
with highest class relevance value i.e., Xv in this cluster and present in cluster Cp that have similarities with Xv
all other genes which have similarity with Xv is greater than greater than or equal to β are discarded from the gene
or equal to β present in this cluster are discarded. After the set X.
completion of cluster C1 creation, the gene which has next 12) Now, p = p + 1 and S ← ∅ .
highest class relevance value among the genes which are not 13) Now the gene, let Xu , with next highest class relevance
selected in any previously created clusters is taken and the value among the genes which are not selected in any
whole clustering process is repeated z number of times. Here, previously created clusters is taken. This is the first
z is a user defined parameter. member of the next cluster Cp and go to step 4.
After generating all z clusters and their representatives, the 14) If no gene is found then p = p − 1 and goto step 15.
best h number of cluster representatives are selected according 15) After generating all cluster representatives, the best h
to their class relevance value and are passed through classifier number of cluster representatives are selected according
algorithms to measure classification accuracy. to their class relevance values.
16) end.
• Proposed Feature (Gene) Extraction Algorithm:
Input: Given a m × n gene expression matrix T = B. Time Complexity
{wij |i = 1, · · · , m, j = 1, · · · , n}, where wij is the In this algorithm the original number of features (genes) is
measured expression level of gene Xi in the jth sample, m. From them the proposed method has selected h number of
m and n represents the total number of genes and cluster representatives. These cluster representatives actually
samples respectively. Let X represents the set of genes. form the reduced feature set. In the proposed method, at first
Then |X| = m and X = {X1 , · · · , Xi , · · · , Xm }. Let the class relevance value of every gene is calculated. The time
CR(Xi ) represents the class relevance value of gene Xi needed to calculate class relevance value of every gene is n
using mutual information. α and β are user defined as there are n number of samples. So, the time complexity
threshold. p is a variable which holds the number of of this phase is O(mn). Now, in the next phase, the gene,
current clusters. z is a user defined input variable which with highest class relevance value is chosen as the member
holds the maximum number of clusters, generated by the of the first cluster. Based on a user defined threshold among
proposed algorithm. Let C represents the set of clusters all other genes some genes are selected for that cluster and
and C = {C1 , · · · , Cz }. Here S is a set which holds the this cluster is formed. Then, the next gene, with highest class
genes which are used to generate average gene expression relevance value among all remaining genes which are not
profile of current cluster. already selected in any previously created clusters is chosen
Output: A set containing h number of cluster represen- and the whole process is repeated until m number of genes
tatives. are checked or we get z number of clusters. In every cluster,
1) Initialize: S ← ∅ and p ← 1. the similarities of all genes are calculated with respect to the

77
gene which has maximum class relevance value in this cluster. be determined by the shared information between the gene
The similarity calculation time between two gene is n as there and the rest as well as the shared information between the
are n number of samples. Then time complexity of selection gene and class label [11], [13]. If a gene has expression
of genes in a cluster is O(mn). As, z number of clusters are values randomly or uniformly distributed in different classes,
created, so, the time complexity of this phase is O(zmn). At its mutual information with these classes is zero. If a gene is
last, h number of cluster representatives are selected according strongly differentially expressed for different classes, it should
to their class relevance value. The time complexity of this have large mutual information. Thus, the mutual information
phase is O(h). As, h ≪ m, the overall time complexity of the can be used as a measure of relevance of genes. Similarly,
proposed method is O(zmn). mutual information may be used to measure the level of
similarity between genes.
C. Choice of α and β The entropy is a measure of uncertainty of random variables.
In this paper, we measure the similarity between any two If a discrete random variable X has X alphabets and the
genes using mutual information. If two genes are highly probability density function is p(x) = Pr{X = x}, x ∈ X ,
similar then mutual information between them is very high, the entropy of X is defined as
otherwise mutual information between them is low. When two X
genes are statistically independent, then mutual information H(X) = − p(x) log p(x). (2)
between them is zero. Mutual information is maximum when x∈X

we measure similarity of a gene with itself. The parameter α Similarly, the joint entropy of two random variables X with
is a threshold that measures the degree of similarity of two X alphabets and Y with Y alphabets is given by
genes. For every dataset, we measure first the similarity of XX
a gene with itself. Now, among these similarity values, the H(X, Y ) = − p(x, y) log p(x, y) (3)
maximum value is the maximum mutual information value for x∈X y∈Y

that particular data set. So, we take the values of α between where p(x, y) is the joint probability density function. The
0 and maximum mutual information value for every dataset. mutual information between X and Y is, therefore, given by
On the other hand, the threshold β is used to decide whether
a gene of current cluster will be considered for next cluster I(X, Y ) = H(X) + H(Y ) − H(X, Y ). (4)
generation step or not. In case of every data set and for every
B. Discretization
cluster β is set to 90% of maximum similarity of initial cluster
representative. In microarray gene expression data sets, the class labels
of samples are represented by discrete symbols, while the
III. E VALUATION C RITERIA FOR G ENE S ELECTION expression values of genes are continuous. Hence, to measure
The F -test value [11], [12], information gain, mutual infor- both gene-class relevance of a gene with respect to class
mation [11], [13], normalized mutual information [14], etc., labels and gene-gene redundancy between two genes using
are typically used to measure the gene-class relevance and mutual information [11], [13], [17], the continuous expression
the same or a different metric such as mutual information, values of a gene are usually divided into several discrete
the L1 distance, Euclidean distance, Pearson’s correlation partitions. The a prior (marginal) probabilities and their joint
coefficient, etc., [11], [13], [15] is employed to calculate the probabilities are then calculated to compute both gene-class
gene-gene similarity or redundancy. However, as the F -test relevance and gene-gene redundancy using the definitions for
value, Euclidean distance, Pearson’s correlation, etc., depend discrete cases. In this paper, the discretization method reported
on the actual gene expression values of the microarray data, in [11], [13], [17] is employed to discretize the continuous
they are very much sensitive to noise or outlier of the data set. gene expression values. The expression values of a gene are
On the other hand, as mutual information depends only on the discretized using mean µ and standard deviation σ computed
probability distribution of a random variable rather than on its over n expression values of that gene: any value larger than
actual values, it is more effective to evaluate the gene-class (µ + σ/2) is transformed to state 1; any value between
relevance as well as the gene-gene similarity [11], [13]. So, in (µ − σ/2) and (µ + σ/2) is transformed to state 0; any value
this paper to measure the gene class relevance and gene-gene smaller than (µ − σ/2) is transformed to state -1. These three
similarity mutual information is used. states correspond to the over-expression, baseline, and under-
expression of genes.
A. Mutual Information
In principle, mutual information is used to quantify the IV. E XPERIMENTAL R ESULTS AND D ISCUSSION
information shared by two objects. If two independent objects Organization of the experimental results is as follows:
do not share much information, the mutual information value First, the characteristics of the four microarray data sets are
between them is small. While two highly correlated objects discussed briefly. Then, the descriptions of different classifiers
will demonstrate a high mutual information value [16]. The (naive bayes classifier, K-nearest neighbor rule and support
objects can be the class label and the genes. The necessity for vector machine) used here are discussed to measure the perfor-
a gene to be an independent and informative can, therefore, mance of the proposed algorithm and finally, the performance

78
of the proposed method is extensively compared with other 3) K-Nearest Neighbor Rule [22]: The K-nearest neighbor
existing methods that are given in [3]. (K-NN) rule is used for evaluating the effectiveness of the
In this paper, mutual information is applied to calculate both reduced gene set for classification. It is a simplest machine
gene-class relevance and gene-gene redundancy. All methods learning algorithm for classifying samples based on closest
are implemented in C language and run in LINUX environ- training samples in the feature space. A sample is classified
ment having machine configuration Pentium IV, 3.2 GHz, 1 by a majority vote of its K-neighbors, with the sample being
MB cache, and 1 GB RAM. To analyze the performance of assigned to the class most common amongst its K-nearest
the proposed method, the experimentation is done on different neighbors. The value of K, chosen for the K-NN, is the square
microarray gene expression data sets. root of the number of samples in training set.
A. Gene Expression Data Sets C. Performance Analysis
In this paper, different public data sets of cancer microarrays
In this section, the results obtained by applying the super-
are used. Since binary classification is a typical and funda-
vised clustering algorithm to the above mentioned datasets
mental issue in diagnostic and prognostic prediction of cancer,
have been briefly described. The experimental results on
the proposed method is compared with other existing methods
different microarray data sets are presented in Tables I-IX.
using following binary-class data sets.
Subsequent discussions analyze the results with respect to the
1) Breast Cancer Data Set [18]: The breast cancer data set
prediction accuracy of the NB, SVM, and K-NN classifiers.
contains expression levels of 7129 genes in 49 breast tumor
Tables I, IV, VII and Tables II, V, VIII provide the performance
samples from [18]. The samples are classified according to
of the proposed method using the NB and SVM respectively,
their estrogen receptor (ER) status. 25 samples are ER positive
while Tables III, VI, IX show the results using the K-NN. For
while the other 24 samples are ER negative.
different values of α, extensive experiments have been investi-
2) Leukemia Data Set [1]: It is an affymetrix high-density
gated. The different values of α for which the experiment has
oligonucleotide array that contains 7070 genes and 72 sam-
been carried out, are 0.0001, 0.001, 0.005, 0.01, 0.02, 0.03,
ples from two classes of leukemia: 47 acute lymphoblastic
0.04, 0.05, 0.06, 0.07, 0.08, 0.09, 0.10, 0.11, 0.12, 0.13, 0.14,
leukemia and 25 acute myeloid leukemia.
0.15, 0.19, 0.20,0.30, 0.40, and 0.50.
3) Colon Cancer Data Set [19]: The colon cancer data set
To compute the prediction accuracy of the NB, SVM, and
contains expression levels of 40 tumor and 22 normal colon
K-NN, the leave-one-out-cross-validation is performed on each
tissues. Only the 2000 genes with the highest minimal intensity
were selected by [19]. gene expression data set. In all experiments, the value of z is
taken 50. So, we can get maximum 50 number of clusters and
B. Class Prediction Methods every cluster has its own representative. Among them with
Following three classifiers are used to evaluate the perfor- respect to best 5 representatives we have shown all the results
mance of the proposed clustering algorithm. for all the data sets using three classifiers. Each data set is
1) Naive Bayes Classifier [20]: The naive bayes (NB) pre-processed by standardizing each sample to zero mean and
classifier is one of the oldest classifiers. It is obtained by unit variance.
using the Bayes rule and assuming features (variables) are TABLE I
independent of each other given its class. For the jth sample P ERFORMANCE ON B REAST C ANCER D ATA S ET U SING NB C LASSIFIER
sj with m gene expression levels {w1j , · · · , wij , · · · , wmj }
Value Number of Selected Genes
for the m genes, the posterior probability that sj belongs to of α 1 2 3 4 5
class c is m
0.0001 100 * * * *
Y 0.001 100 100 * * *
p(c|sj ) ∝ p(wij |c) (5) 0.005 100 100 * * *
i=1 0.01 100 100 100 * *
0.02 100 100 100 97.96 *
where p(wij |c) are conditional tables (or conditional density) 0.04 100 100 100 100 100
estimated from training examples. Despite the independence 0.06 100 100 100 100 100
0.08 100 100 100 100 100
assumption, the NB has been shown to have good classification 0.10 100 100 100 100 100
performance for many real data sets [20]. 0.11 100 100 100 100 100
2) Support Vector Machine [21]: The support vector ma- 0.12 100 100 100 100 100
0.13 100 100 100 100 100
chine (SVM) is a relatively new and promising classification 0.14 100 100 100 100 100
method. It is a margin classifier that draws an optimal hy- 0.15 100 100 100 100 100
perplane in the feature vector space; this defines a boundary 0.20 100 100 100 97.96 97.96
0.30 97.96 97.96 97.96 97.96 97.96
that maximizes the margin between data samples in different 0.40 89.80 97.96 97.96 100 97.96
classes, therefore leading to good generalization properties. A 0.50 91.84 93.88 93.88 93.88 91.84
key factor in the SVM is to use kernels to construct nonlinear
decision boundary. In the present work, linear kernels are
used. The source code of the SVM is downloaded from Tables I, IV, and VII show the classification accuracy of
http://www.csie.ntu.edu.tw/∼cjlin/libsvm. different cancer microarray datasets using NB classifier. Using

79
TABLE II TABLE IV
P ERFORMANCE ON B REAST C ANCER D ATA S ET U SING SVM P ERFORMANCE ON L EUKEMIA D ATA S ET U SING NB C LASSIFIER
Value Number of Selected Genes Value Number of Selected Genes
of α 1 2 3 4 5 of α 1 2 3 4 5
0.0001 93.87 * * * * 0.0001 100 * * * *
0.001 85.72 93.88 * * * 0.001 100 100 * * *
0.005 89.79 91.84 * * * 0.005 100 100 100 * *
0.01 87.76 89.80 91.84 * * 0.01 100 100 100 100 *
0.02 89.80 93.88 93.88 100 * 0.02 100 100 100 100 100
0.04 87.76 100 100 100 100 0.04 100 100 100 100 100
0.06 91.84 100 100 100 100 0.06 100 100 100 100 100
0.08 91.84 95.92 100 100 100 0.08 100 100 100 100 100
0.10 100 97.96 100 100 100 0.10 100 100 100 100 100
0.11 93.88 97.96 100 100 100 0.11 100 100 100 100 100
0.12 95.92 100 100 100 100 0.12 100 100 100 100 100
0.13 95.92 100 100 100 100 0.13 100 100 100 100 100
0.14 95.92 97.96 100 100 100 0.14 100 100 100 100 100
0.15 100 100 100 100 100 0.15 100 100 100 100 100
0.20 97.96 100 100 100 100 0.20 100 100 100 100 100
0.30 89.80 91.84 89.80 93.88 95.92 0.30 100 100 100 100 100
0.40 89.80 95.92 95.92 93.88 95.92 0.40 100 100 100 100 98.61
0.50 87.76 91.84 91.84 91.84 91.84 0.50 93.06 97.22 98.61 100 100

TABLE III TABLE V


P ERFORMANCE ON B REAST C ANCER D ATA S ET U SING K-NN RULE P ERFORMANCE ON L EUKEMIA D ATA S ET U SING SVM
Value Number of Selected Genes Value Number of Selected Genes
of α 1 2 3 4 5 of α 1 2 3 4 5
0.0001 93.88 * * * * 0.0001 90.27 * * * *
0.001 95.92 95.92 * * * 0.001 91.66 100 * * *
0.005 100 95.92 * * * 0.005 84.72 100 100 * *
0.01 97.96 100 100 * * 0.01 83.33 100 100 100 *
0.02 93.88 93.88 93.88 100 * 0.02 90.27 94.44 100 100 100
0.04 97.96 97.96 100 100 100 0.04 90.27 98.61 98.61 100 100
0.06 100 100 100 100 100 0.06 90.27 100 100 100 100
0.08 95.92 95.92 100 100 100 0.08 94.44 100 100 100 100
0.10 100 100 97.96 97.96 100 0.10 98.61 100 98.61 100 100
0.11 95.92 97.96 100 100 100 0.11 95.83 100 100 100 100
0.12 97.96 100 100 100 100 0.12 98.61 98.61 98.61 98.61 100
0.13 95.92 100 100 100 100 0.13 100 100 100 100 100
0.14 97.96 97.96 97.96 100 100 0.14 98.61 100 100 100 100
0.15 100 100 100 100 100 0.15 98.61 98.61 100 100 100
0.20 100 100 100 100 97.96 0.20 94.44 100 98.61 100 100
0.30 95.92 91.84 89.80 97.96 95.92 0.30 100 100 100 100 98.61
0.40 89.80 95.92 95.92 95.92 95.92 0.40 97.22 100 98.61 95.83 98.61
0.50 91.84 93.88 93.88 91.84 91.84 0.50 93.05 93.05 95.83 94.44 95.83

NB, the proposed method gives 100% accuracy for all the is obtained for α values ranging from 0.0005 to 0.20 using 1
above mentioned datasets considering 1 or more gene cluster or more gene cluster representatives. K-NN also gives 100%
representatives. With the NB, a classification accuracy of accuracy for α value from 0.001 to 0.30 using 1 or more
100% is obtained in case of breast cancer data for α value cluster representatives in case of leukemia data set. For colon
ranging from 0.0001 to 0.20 and for leukemia cancer data cancer data set it also gives 100% accuracy for α values 0.02
100% accuracy is obtained for α value ranging from 0.0001 and 0.13 using 1 or more gene-cluster representatives.
to 0.50. In case of colon cancer data 100% accuracy is obtained Analyzing all these results, we can say that for NB classifier,
for a set of α values. These are 0.01, 0.02, 0.04 and 0.13. in case of breast cancer data set 100% accuracy is obtained
The results reported in Tables II, V, and VIII are based on for α values from 0.0001 to 0.20 and for SVM and K-NN best
predictive accuracy of the SVM. The results show that in case result is obtained for α values 0.10 and 0.15. So, the proposed
of breast cancer data set, 100% accuracy is obtained for α method gives best result for breast cancer data for α values
value ranging from 0.02 to 0.20 using 1 or more gene cluster 0.10 and 0.15. In case of colon cancer data using NB classifier,
representatives. Using SVM, 100% accuracy is obtained for 100% accuracy is obtained for a set of α values. These values
leukemia data for α value ranging from 0.001 to 0.40 using 1 are 0.01, 0.02, 0.04 and 0.13. For SVM, it gives best result
or more cluster representatives. In case of colon cancer data for α value 0.02 and using K-NN it gives best result for α
for α value 0.02 only 100% accuracy is obtained considering value 0.02, and 0.13 for colon cancer data. So when α value
1 gene cluster representative. is set to 0.02, the proposed method gives 100% accuracy for
For breast cancer data set using the K-NN, 100% accuracy three specified classifiers in case of colon cancer data using

80
TABLE VI TABLE VIII
P ERFORMANCE ON L EUKEMIA D ATA S ET U SING K-NN RULE P ERFORMANCE ON C OLON C ANCER D ATA S ET U SING SVM
Value Number of Selected Genes Value Number of Selected Genes
of α 1 2 3 4 5 of α 1 2 3 4 5
0.0001 98.61 * * * * 0.0001 95.16 * * * *
0.001 97.22 100 * * * 0.001 95.16 95.16 * * *
0.005 95.83 97.22 97.22 * * 0.005 95.16 98.39 * * *
0.01 98.61 100 100 100 * 0.01 98.39 96.77 * * *
0.02 98.61 98.61 100 100 100 0.02 100 93.55 * * *
0.04 97.22 98.61 98.61 100 100 0.04 90.32 90.32 95.16 * *
0.06 100 100 100 100 100 0.06 95.16 96.77 95.16 93.55 *
0.08 98.61 100 100 100 100 0.08 88.71 98.39 98.39 98.39 98.39
0.10 100 100 98.61 100 100 0.10 93.55 95.16 93.55 93.55 93.55
0.11 95.83 100 100 100 100 0.11 95.16 96.77 95.16 91.94 91.94
0.12 98.61 98.61 98.61 98.61 100 0.12 95.16 91.94 93.55 91.94 93.55
0.13 100 100 100 100 100 0.13 96.77 98.39 96.77 96.77 95.16
0.14 98.61 100 100 100 100 0.14 95.16 93.55 93.55 95.16 95.16
0.15 100 100 100 100 100 0.15 98.39 95.16 95.16 95.16 95.16
0.20 98.61 98.61 98.61 100 100 0.20 93.55 91.94 91.94 93.55 93.55
0.30 100 100 100 100 100 0.30 93.55 91.94 91.94 87.09 91.94
0.40 97.22 100 95.83 94.44 94.44 0.40 91.94 93.55 95.16 95.16 95.16
0.50 94.44 91.67 91.67 93.06 94.44 0.50 80.65 83.88 80.65 85.49 82.26

TABLE VII TABLE IX


P ERFORMANCE ON C OLON C ANCER D ATA S ET U SING NB C LASSIFIER P ERFORMANCE ON C OLON C ANCER D ATA S ET U SING K-NN RULE
Value Number of Selected Genes Value Number of Selected Genes
of α 1 2 3 4 5 of α 1 2 3 4 5
0.0001 98.39 * * * * 0.0001 95.16 * * * *
0.001 98.39 96.77 * * * 0.001 95.16 95.16 * * *
0.005 98.39 96.77 * * * 0.005 98.39 96.77 * * *
0.01 100 100 * * * 0.01 98.39 96.77 * * *
0.02 100 98.39 * * * 0.02 100 93.55 * * *
0.04 100 100 100 * * 0.04 98.39 93.55 93.55 * *
0.06 98.39 98.39 96.77 95.16 * 0.06 95.16 98.39 95.16 95.16 *
0.08 98.39 98.39 96.77 98.39 98.39 0.08 95.16 96.77 96.77 98.39 98.39
0.10 96.77 98.39 98.39 98.39 98.39 0.10 93.55 95.16 93.55 93.55 93.55
0.11 98.39 98.39 98.39 98.39 98.39 0.11 96.77 96.77 95.16 95.16 95.16
0.12 96.77 96.77 96.77 95.16 96.77 0.12 95.16 91.94 93.55 91.94 93.55
0.13 100 98.39 98.39 96.77 96.77 0.13 100 98.39 96.77 96.77 95.16
0.14 96.77 98.39 95.16 95.16 93.55 0.14 93.55 93.55 96.77 95.16 93.55
0.15 98.39 95.16 95.16 96.77 95.16 0.15 98.39 93.55 96.77 95.16 95.16
0.20 98.39 98.39 96.77 95.16 95.16 0.20 98.39 93.55 93.55 95.16 95.16
0.30 95.16 95.16 93.55 93.55 93.55 0.30 93.55 91.94 91.94 91.94 91.94
0.40 91.94 96.77 96.77 95.16 95.16 0.40 91.94 95.16 95.16 95.16 95.16
0.50 83.87 82.26 85.48 93.55 90.32 0.50 83.87 80.65 80.65 82.26 82.26

1 or more cluster representatives. So, the proposed method method is superior to the other gene selection methods by se-
gives best result for colon cancer data set for α values 0.02. lecting a smaller set of discriminative genes in the colon cancer
For leukemia cancer data we get 100% accuracy for α value and leukemia cancer data sets than the others as reflected by
from 0.0001 to onwards using NB classifier. Using SVM, the the classification results. The proposed method outperforms in
proposed method gives best result for α values ranging from all cases. Although, ACA and t-value algorithms can find good
0.001 to 0.40 and for K-NN it gives best result for α values discriminative genes for the K-NN method, t-value is unable to
from 0.001 to 0.30. So, we can say that for leukemia cancer do so for the naive bayes method. Using naive bayes classifier
data it gives best result when α value is set to 0.13 or 0.30. ACA gives good result for leukemia cancer but is unable to do
so for colon cancer. The k-means algorithm, SOM, biclustering
D. Comparative Performance Analysis algorithm, mRMR and RBF fail to find good discriminative
For comparison purpose we compare the proposed method genes for these two data sets, as shown in the results.
with the results of the attribute clustering algorithm (ACA),
t-value, k-means algorithm, minimum redundancy-maximum V. C ONCLUSION
relevance (mRMR) algorithm, self organizing map (SOM), bi- This paper presents a new algorithm for supervised clus-
clustering algorithm, and radial basis function (RBF) network tering of genes from microarray experiments. The proposed
on colon cancer and leukemia cancer data sets as given in [3] algorithm is potentially useful in the context of medical
using the NB and K-NN methods. diagnostics, as it identifies groups of interacting genes that
The experimental results in Table X show that the proposed have high explanatory power for given tissue types, and which

81
TABLE X
[4] M. B. Eisen, P. T. Spellman, and D. Botstein, “Cluster Analysis and
C OMPARATIVE P ERFORMANCE A NALYSIS OF D IFFERENT M ETHODS Display of Genome-Wide Expression Patterns,” Proc. Natl Acad. Sci.
Classifier Data Sets Methods/ Accuracy Number USA, vol. 95, pp. 14 863–14 868, 1998.
Algorithms (%) of Genes [5] R. Herwig, A. J. Poustka, C. Muller, C. Bull, H. Lehrach, and J. O’Brien,
Proposed 100 1 “Large-scale Clustering of cDNA-fingerprinting data.” Genome Res.,
ACA 83.9 7 vol. 9, pp. 1093–1105, 1999.
t-value 80.6 7 [6] P. Tamayo, D. Slonim, J. Mesirov, Q. Kitareewan, S. Dmitrovsky,
K-NN Colon k-means 69.4 14 E. Lander, E. S., and T. R. Golub, “Interpreting Patterns of Gene
Cancer SOM 59.7 14 Expression with Self-Organizing Maps: Methods and Application to
Biclustering 69.4 7 Hematopoietic Differentiation,” Proc. Natl Acad. Sci. USA, vol. 96, pp.
mRMR 64.5 7 2907–2912, 1999.
RBF 67.7 3 [7] T. Hastie, R.Tibshirani, D. Botstein, and P. Brown, “Supervised Har-
Proposed 100 1 vesting of Expression Trees,” Genome Biology, 2001.
ACA 67.7 14 [8] D. Nguyen and D. Rockie, “Tumor Classification by Partial Least
t-value 56.5 7 Squares using Microarray Gene Expression Data,” Bioinformatics, pp.
NB Colon k-means 62.9 7 39–50, 2002 2001.
Cancer SOM 64.5 7 [9] P. Geladi and B. Kowalski, “Partial Least Square Regression: A Tuto-
Biclustering 67.7 7 rial,” Analyt Chem Acta, 1986.
mRMR 64.5 7 [10] M. Detting and P. Buhlamann, “Supervised Clustering of Genes,”
RBF 64.5 3 Genome Biology, pp. 0069.1–0069.15, 2002.
Proposed 100 1 [11] C. Ding and H. Peng, “Minimum Redundancy Feature Selection from
ACA 91.2 7 Microarray Gene Expression Data,” in Proceedings of the Computational
t-value 88.2 14 Systems Bioinformatics, 2003, pp. 523–528.
K-NN Leukemia k-means 50.0 7 [12] J. Li, H. Su, H. Chen, and B. W. Futscher, “Optimal Search-Based
Cancer SOM 50.0 7 Gene Subset Selection for Gene Array Cancer Classification,” IEEE
Biclustering 58.8 21 Transactions on Information Technology in Biomedicine, vol. 11, no. 4,
mRMR 70.6 14 pp. 398–405, 2007.
RBF 47.1 3 [13] H. Peng, F. Long, and C. Ding, “Feature Selection Based on Mutual
Proposed 100 1 Information: Criteria of Max-Dependency, Max-Relevance, and Min-
ACA 82.4 7 Redundancy,” IEEE Transactions on Pattern Analysis and Machine
t-value 55.9 7 Intelligence, vol. 27, no. 8, pp. 1226–1238, 2005.
NB Leukemia k-means 58.8 7 [14] X. Liu, A. Krishnan, and A. Mondry, “An Entropy Based Gene Selec-
Cancer SOM 58.8 7 tion Method for Cancer Classification Using Microarray Data,” BMC
Biclustering 58.8 7 Bioinformatics, vol. 6, no. 76, pp. 1–14, 2005.
mRMR 67.6 7 [15] D. Jiang, C. Tang, and A. Zhang, “Cluster Analysis for Gene Expression
RBF 58.8 3 Data: A Survey,” IEEE Transactions on Knowledge and Data Engineer-
ing, vol. 16, no. 11, pp. 1370–1386, 2004.
[16] C. Shannon and W. Weaver, The Mathematical Theory of Communica-
tion. Champaign, IL: Univ. Illinois Press, 1964.
[17] P. Maji, “f -Information Measures for Efficient Selection of Discrimi-
in turn can be accurately predict the class labels of new native Genes from Microarray Data,” IEEE Transactions on Biomedical
samples. At the same time, such gene clusters may reveal Engineering, vol. 56, no. 4, pp. 1063–1069, 2009.
insights into biological processes and may be valuable for [18] M. West, C. Blanchette, H. Dressman, E. Huang, S. Ishida, R. Spang,
H. Zuzan, J. A. Olson, J. R. Marks, and J. R. Nevins, “Predicting the
functional genomics. Clinical Status of Human Breast Cancer by Using Gene Expression
In summary, the proposed algorithm tries to cluster genes Profiles,” Proceedings of the National Academy of Science, USA, vol. 98,
such that the discrimination of different tissue types is as no. 20, pp. 11 462–11 467, 2001.
[19] U. Alon, N. Barkai, D. A. Notterman, K. Gish, S. Ybarra, D. Mack,
simple as possible. The performance of the proposed method is and A. J. Levine, “Broad Patterns of Gene Expression Revealed by
evaluated by the predictive accuracy of naive bayes classifier, Clustering Analysis of Tumor and Normal Colon Tissues Probed by
K-nearest neighbor rule, and support vector machine. For Oligonucleotide Arrays,” Proceedings of the National Academy of Sci-
ence, USA, vol. 96, no. 12, pp. 6745–6750, 1999.
all data sets, 100% classification accuracy is found by the [20] T. Mitchell, Machine Learning. McGraw-Hill, 1997.
proposed method. The results obtained on real data sets [21] V. Vapnik, The Nature of Statistical Learning Theory. New York:
demonstrate that the proposed method can bring a remarkable Springer-Verlag, 1995.
[22] R. O. Duda, P. E. Hart, and D. G. Stork, Pattern Classification and
improvement on gene selection problem. So, the proposed Scene Analysis. John Wiley & Sons, New York, 1999.
method is capable of identifying discriminative genes that may
contribute to revealing underlying class structures, providing
a useful tool for the exploratory analysis of biological data.
R EFERENCES
[1] T. R. Golub, D. K. Slonim, P. Tamayo, C. Huard, M. Gaasenbeek,
J. P. Mesirov, H. Coller, M. L. Loh, J. R. Downing, M. A. Caligiuri,
C. D. Bloomfield, and E. S. Lander, “Molecular Classification of Cancer:
Class Discovery and Class Prediction by Gene Expression Monitoring,”
Science, vol. 286, pp. 531–537, 1999.
[2] E. Domany, “Cluster Analysis of Gene Expression Data,” Journal of
Statistical Physics, vol. 110, pp. 1117–1139, 2003.
[3] W. Au, K. C. C. Chan, A. K. C. Wong, and Y. Wang, “Attribute
Clustering for Grouping, Selection, and Classification of Gene Expres-
sion Data,” IEEE/ACM Transactions On Computational Biology and
Bioinformatics, vol. 2, no. 2, pp. 83–101, 2005.

82
Modified Greedy Search Algorithm for
Biclustering Gene Expression Data
Shyama Das Sumam Mary Idicula
Department of Computer Science
Department of Computer Science,
Cochin University of Science and Technology
Cochin, Kerala, India Cochin University of Science and Technology
shyamadas777@gmail.com Cochin, Kerala, India
sumam@cusat.ac.in

Abstract— Biclustering refers to simultaneous clustering of correspond to different time points or different environmental
both rows and columns of a data matrix. Biclustering is a highly conditions. The samples can be from different organs, from
useful data mining technique in the analysis of gene expression cancerous or healthy tissues, or even from different
data. The problem of identifying the most significant biclusters in individuals. The gene expression data contains thousands of
gene expression data has shown to be NP complete. In this paper genes and hundreds of conditions. An element in the matrix
a greedy search algorithm is developed for biclustering gene refers to the expression level of a particular gene under a
expression data. This algorithm has two steps. In the first step specific condition. The genes are co-regulated if the genes in
high quality bicluster seeds are generated using K-Means
a set display similar fluctuation under all conditions. By
clustering algorithm. Then these seeds are enlarged using the
greedy search method. Here the node that results in the minimum discovering the co-regulation, it is possible to refer to the gene
Hscore value when combined with the bicluster is selected. This is regulative network, which will lead to better understanding as
added to the bicluster. This selection and addition is continued till to how organisms develop and evolve. One of the objectives
the Hscore value of the bicluster reaches the given threshold. of gene expression data analysis is to group genes according to
Even though it is a greedy method the results obtained is far their expression under multiple conditions. Clustering is the
better than many of the metaheuristic methods which are most widely used data mining technique for analyzing gene
generally considered superior to greedy approach. expression data to group similar genes or conditions.
Clustering of co-expressed gene into biologically meaningful
Keywords - Biclustering; Gene expression data; greedy search;
groups assists in inferring the biological role of an unknown
datamining; K-Means clustering gene that is co-expressed with a known gene.

However clustering has got its own limitations. Clustering


I. INTRODUCTION
is based on the assumption that all the related genes behave
DNA Microarray technology is capable of measuring the similarly across all the measured conditions. It may reveal the
expression levels of thousands of genes in a single experiment. genes which are very closely co-regulated along the entire
Measuring the gene expression levels across different stages in column. However genes are not relevant for all the
different tissues or cells or under different conditions is useful experimental conditions but groups of genes are co-expressed
for understanding and interpreting biological processes. The and co-regulated only under specific conditions. They behave
relative abundance of the mRNA of a gene under a specific almost independently under other conditions. Moreover
experimental condition or sample is called the expression level clustering partitions the genes into disjoint sets i.e. each gene
of a gene. Gene expression patterns can offer massive is associated with a single biological function, which is in
information about cell functions. It has revolutionized gene contradiction to the biological system [1].
expression analysis. Microarray data are widely used in
genomic research because of its enormous potential in gene This observation resulted in the development of clustering
expression profiling, facilitating the prognosis and the methods that try to simultaneously group genes and
discovery of subtypes of diseases. Microarrays are widely conditions. This approach is called biclustering or co-
used in medical domain to construct molecular profiles of clustering. Biclustering is clustering applied in two
diseased and normal tissues of patients. Such profiles are dimensions, i.e. along the row and column, simultaneously.
extremely useful for understanding various diseases and This approach identifies the genes which show similar
facilitate a more accurate diagnosis, prognosis, treatment expression levels under a specific subset of experimental
planning and drug discovery. conditions. The objective is to discover maximal subgroups of
Microarray Gene expression data is organized in the form genes and subgroups of conditions. Such genes express highly
of a matrix where rows represent genes and columns represent correlated activities over a range of conditions. Biclustering
experimental conditions or samples. The experimental was first defined by Hartigan who called it direct clustering
conditions can be patients, tissue types etc. The samples can [2]. Cheng and Church were the first to apply biclustering to

83
gene expression data [3]. Biclustering is a powerful analytical called mean squared residue score or Hscore was introduced
tool when some genes have multiple functions and by Cheng and Church. It is the sum of the squared residue
experimental conditions are diverse. score. The residue score of an element bij in a submatrix B is
In this work a novel algorithm is developed for biclustering defined as
gene expression data using greedy strategy. In the first step
high quality bicluster seeds are generated using K-Means RS(bij)=bij-bIj-biJ+bIJ where
clustering algorithm. Then the seeds are enlarged by adding
the node that results in minimum incremental increase in
Hscore. The node addition continues till the Hscore value of
the bicluster reaches the given threshold.

II. METHODS AND MATERIALS

A. Model of bicluster
Gene expression dataset is a matrix in which rows represent Here I denotes the row set and J denotes the column set of
genes and columns represent experimental conditions. An matrix B, bij denotes the element in a submatrix, biJ denotes
element aij of the expression matrix A represents the logarithm the ith row mean, bIj denotes the jth column mean, and bIJ
of the relative abundance of the mRNA of the ith gene under denotes the mean of the whole bicluster. The residue score of
the jth condition. Let X={G1,G2,....GN} be the set of genes an element bij provides the difference between the actual
and Y={C1,...CM} be the set of conditions in the gene value and its expected value predicted from its row mean,
expression dataset. The dataset can be viewed as an NxM column mean and bicluster mean. The residue of an element is
matrix A of real numbers. A bicluster is a submatrix B of A a measure of how well the entry fits into that bicluster. Hence
and if the size of B is IxJ, then I is a subset of rows X of A, from the value of residue, the quality of the bicluster can be
and J is a subset of the columns Y of A. The rows and evaluated by computing the mean squared residue. That is
columns of the bicluster B need not be contiguous as in the Hscore or mean squared residue score of bicluster B is
2
expression matrix A. MSR(B) =
Biclusters are generally classified into four major types. Cheng and Church defined a bicluster to be a matrix with
They are: biclusters with constant values, biclusters with low mean squared residue score. The maximum value of MSR
constant values on rows or columns, biclusters with coherent that a matrix can have to be called as a bicluster is called MSR
values, and biclusters with coherent evolutions. In the gene threshold and is denoted as δ. A submatrix B is called a δ
expression data matrix constant biclusters disclose subsets of bicluster if MSR(B)< δ for some δ >0. A high MSR value
genes with similar expression values within a subset of signifies that the data is uncorrelated. A low MSR value
conditions. On the other hand a bicluster with constant values means that there is correlation in the matrix. The value of δ
in the rows will identify a subset of genes with similar depends on the dataset. For Yeast dataset the value of δ is 300
expression values across a subset of conditions permitting the and for Lymphoma dataset the value of δ is 1200. The volume
expression levels to vary from gene to gene. Similarly a of a bicluster or bicluster size is the product of number of rows
bicluster with constant columns identifies a subset of and the number of columns in the bicluster.
conditions within which a subset of genes manifest similar This bicluster model is much more flexible than the row
expression values assuming that the expression values might clusters. The identified submatrices need to be neither disjoint
vary from condition to condition. In the case of a bicluster nor cover the entire matrix. But the computation of biclusters
with coherent values, it identifies a subset of genes and a is costly because one will have to consider all the
subset of conditions with coherent values on both the rows and combinations of columns and rows in order to find out all the
columns. In this case the similarity among the genes is biclusters. The search space for the biclustering problem is
measured as the mean squared residue score. If the similarity 2m+n where m and n are size of genes and conditions
measure (mean squared residue score) of a matrix is within a respectively. Usually m+n is more than 2000. The biclustering
certain threshold, it is a bicluster. In the case of a bicluster problem is Np-hard.
with coherent evolutions, a subset of genes is up-regulated or
down-regulated across a subset of conditions without B. Encoding of bicluster
considering their actual expression values [1]). Each bicluster is encoded as a binary string of fixed length
The biclusters with coherent values are biologically more [4]. The length of the string is the sum of the number of rows
relevant than biclusters with constant values. Hence in this and the number of columns of the gene expression data matrix.
work biclusters with coherent values are identified. Thus the First N bits represent genes and next M bits represent
problem of biclustering can be formulated in the following conditions. A bit is set to one when the corresponding gene or
manner: given a data matrix A, find a set of submatrices B1, condition is included in the bicluster otherwise it is set to zero.
B2, ... Bn which satisfy some homogeneous characteristics or This representation is advantageous for node addition and
coherence. For measuring the degree of coherence a measure deletion.

84
C. Algorithm Description when added to the bicluster is considered as the best element.
Different types of algorithm design techniques are used to It cannot be specified as an element with smallest incremental
address the biclustering problem including iterative row and cost of Hscore because adding some elements reduces the
column clustering combination, Divide and Conquer, Greedy Hscore value. Seed growing starts from condition list followed
iterative search, Evolutionary or metaheuristic algorithms. by gene list until the Hscore value reaches the given threshold.
Greedy iterative search methods are based on the idea of This is a greedy method since our aim is to select the next
creating biclusters by adding or removing rows/columns from element which produces bicluster with minimum Hscore
them, using a criterion that maximizes a local gain. In this value. This algorithm is deterministic. A pseudo-code
work greedy search method is used for finding δ biclusters and description of modified greedy search algorithm is given
it is very fast compared to metaheuristic methods. The below.
algorithm has two major phases. In the first phase, an initial
set of seed biclusters are generated using K-Means one way F. Modified Greedy Search Algorithm
clustering algortithm. The second phase is used to enlarge the
seeds by adding more rows and columns using a greedy search Algorithm modifiedgreedy(seed, δ)
algorithm.
D. Seed Finding bicluster := seed
A good seed of a bicluster is a small bicluster with very
low Hscore value. Hence in the seed there exists a possibility Calculate Column_List the list of conditions not included in
of accommodating more genes and conditions within the given the bicluster
Hscore threshold. In this algorithm a simple seed finding While (MSR(bicluster) <= δ)
technique is used [5]. For finding seeds K-Means clustering
No_elem_Col=size(Column_List)
algorithm is used. K-Means is a partitional clustering
algorithm. The generated clusters are disjoint, flat or non- for i:=1: No_elem_Col
hierarchical. The number of clusters generated should be bicluster=bicluster+ Column_List [i]
specified as input. In K-Means clustering algorithm distance
measure is a parameter that specifies how the distance Column_List_msr[i]= MSR(bicluster)
between data points is measured. Here cosine angle distance is Remove Column_List[i] from bicluster
selected as the distance measure. First of all gene and
condition clusters are obtained from the K-Means one way end(for)
clustering algorithm. find minimum value in Column_List_msr and corresponding
That is genes in the dataset are partitioned into n gene index K
clusters. Those clusters having more than 10 genes are further
bicluster=bicluster+ Column_List [K]
divided into groups based on cosine angle distance from the
cluster centre so that each group contains maximum 10 genes. delete Column_List [K] from Column_List
Similarly conditions in the dataset are partitioned into m end(while)
clusters and each cluster containing more than 5 conditions is
further divided based on cosine angle distance from the cluster
center so that each group contains maximum 5 conditions. Calculate Row_List the list of genes not included in the
There are p gene clusters and q condition clusters. All bicluster
combinations of these p gene clusters and q condition clusters
are found. Hscore value for all these combinations is While (MSR(bicluster) <= δ)
calculated and those with Hscore value below a certain No_elem_Row=size(Row_List)
threshold are selected as seeds. Thus the gene expression data
matrix is partitioned into fixed size tightly co-regulated
submatrices. The Yeast dataset is partitioned into 140 gene for i:=1: No_elem_Row
clusters and 3 condition clusters [4]. bicluster=bicluster+ Row_List [i]
E. Seed growing phase
In the seed growing phase a separate list is maintained for Row_List_msr[i]= MSR(bicluster)
conditions and genes not included in the bicluster. Each seed Remove Row_List[i] from bicluster
is enlarged separately by adding more genes and conditions.
Initially conditions are added followed by genes. In modified end(for)
greedy search algorithm the best element is selected from the
gene list or condition list and added to the bicluster. The
find minimum value in Row_List_msr and corresponding
quality of the element is determined by the Hscore or MSR
index J
value of the bicluster after including the element in the
bicluster. The element which results in minimum Hscore value bicluster=bicluster+ Row_List [J]

85
delete Row_List [J] from Row_List residue, gene number, condition, volume etc. for the
performance comparison of modified greedy with other
end(while)
biclustering algorithms.
end(modifiedgreedy) 600 450

400
550

350
G. Difference between Novel Greedy Search and

E x p re s s io n V a lu e
E x p re s s io n V a lu e
500

Modified Greedy Search algorithms 300

In novel Greedy Search algorithm [6] node (condition or 450


250
gene) addition follows node deletion if necessary. The added
400
node is deleted if the Hscore value of the bicluster exceeds 200

certain threshold. The nodes are searched sequentially. The


350 150
node thus added may not be optimal in terms of Hscore value. 0 2 4 6 8 10
condition
12 14 16 18 0 2 4 6 8 10
Condition
12 14 16 18

But in the case of Modified Greedy Search algorithm the node


450 450
which results in minimum Hscore value when joined with
400
bicluster is selected and added to the bicluster. Hence superior 400

results can be obtained through Modified Greedy Search 350

algorithm. In Modified Greedy before adding a node the 300 350

E x p r e s s io n V a lu e

E x p re s s io n V a lu e
Hscore value of the bicluster combined with a single gene or 250
300
condition which is not included in the bicluster is to be 200

calculated for all the genes or conditions not included in the 150 250

bicluster. Even though Modified Greedy Search algorithm is 100


200
computationally more expensive in terms of time, it is capable 50

of obtaining larger biclusters with low Hscore value from gene 0


0 2 4 6 8 10 12 14 16 18
150
0 2 4 6 8 10 12 14 16 18
expression data compared to Novel Greedy Search algorithm. Condition condition

600 600

III. EXPERIMENTAL RESULTS 550


550
500

A. Dataset used 450


500

E x p r e s s io n V a lu e
The proposed algorithm is implemented in Matlab and 400

Experiments are conducted on the Yeast Saccharomyces 350 450

300
cerevisiae cell cycle expression dataset to assess the quality of 400
250
the proposed method. The dataset is based on Tavazoie et al
200
[7]. Dataset consists of 2884 genes and 17 conditions. The 350
150
values in the expression dataset are integers in the range 0 to 100 300
600. There are 34 missing values represented by -1. The 0 2 4 6 8 10 12 14 16 18 0 2 4 6 8 10
condition
12 14 16 18

dataset is obtained from


http://arep.med.harvard.edu/biclustering. 600
600

500
550
B. Bicluster Plots
400
In Figure 1 eight biclusters identified by the modified
E x p re s s io n V a lu e

500
E x p r e s s io n V a lu e

greedy search algorithm on the Yeast dataset are shown. From 300

the bicluster plots it can be noticed that genes present a similar 450 200

behavior under a set of conditions. Many of the biclusters


100
found on the Yeast dataset contain all 17 conditions. Out of 400

the eight biclusters shown in Figure 1, seven contain all 17 0

conditions and they differ in appearence. In short modified 350 -100


0 2 4 6 8 10 12 14 16 18 1 2 3 4 5 6 7 8 9
greedy search algorithm is ideal for identifying various Condition Condition

biclusters with coherent values. Information about these


Figure 1. Eight biclusters obtained from the Yeast expression data.
biclusters is given in Table 1. All the biclusters are having
Bicluster labels are (a), (b), (c), (d), (e), (f), (g) and (h)
mean squared residue less than 300. Details about 6 more
respectively. In the bicluster plots X axis contains conditions
biclusters obtained using modified greedy algorithm, whose
and Y axis contains expression values. The details about the
bicluster plots are not included in Figure 1 are also given in
biclusters can be obtained from Table 1 using bicluster label.
the last six rows of Table 1. These biclusters are also taken
Here only biclusters with different shapes are selected.
into account while calculating the average of mean squared

86
Average mean squared residue score is better than that of all
TABLE 1 other algorithms listed in the table except DBF. As it is clear
INFORMATION ABOUT BICLUSTERS OF YEAST DATASET
Rows Columns Bicl.Vol . MSR
from the Table 2, performance of modified greedy is better
Label
than novel greedy in terms of average mean squared residue,
(a) 10 17 170 66.4403 average gene number, average volume and largest bicluster
(b) 17 17 289 99.3497 size.
(c) 108 17 1836 194.5204 In multi-objective evolutionary computation [11] the
(d) 14 17 238 97.8389 maximum number of conditions obtained is only 11 for the
(e) 107 17 1819 199.1857 Yeast dataset. But, in this method there are biclusters with all
(f) 33 17 561 99.9639 17 conditions. For the Yeast dataset the maximum number of
(g) 31 17 527 97.9121 genes obtained for this algorithm in all the 17 conditions is
(h) 1405 9 12645 299.8968 147 with Hscore value 200.2474. The maximum available in
(p) 147 17 2499 200.2474 all the literature published so far is in the case of multi-
(q) 710 8 5680 199.9880 objective PSO [12]. They obtained 141 genes for 17
(r) 913 9 8217 256.1985 conditions with Hscore value 203.25.
(s) 1163 8 9304 246.0037
(t) 1200 8 9600 249.9022 TABLE 2
PERFORMANCE COMPARISON BETWEEN MODIFIED GREEDY AND
(u) 1355 9 12195 294.9206 OTHER ALGORITHMS FOR YEAST DATASET
Algori Avg. Avg.Gene Avg.Cond. Avg. Largest
thm Residue Num Num Vol. Bicluster
In the above table the first column contains the label of each
Modified 185.88 515.21 13.36 4684.29 12645
bicluster. The second and third columns report the number of
Greedy
rows (genes) and of columns (conditions) of the bicluster
Novel 199.78 94.75 14.75 1422.87 2112
respectively. The fourth column reports the volume of the
Greedy
bicluster and the last column contains the mean squared
residues of the biclusters. Table contains details of some more CC 204.29 166.71 12.09 1576.98 4485
biclusters not included in Figure 1 with labels (p),(q),(r),(s),(t) SEBI 205.18 13.61 15.25 209.92 1394
and (u). FLOC 187.54 195.00 12.80 1825.78 2000
DBF 114.70 188.00 11.00 1627.20 4000

As is clear from the above table the average mean squared


IV. COMPARISON
residue, the average number of genes and conditions, average
volume and largest bicluster size are compared for various
A comparative summarization of results of Yeast data
algorithms. For the average mean squared residue field lower
involving the performance of related algorithms are given in
values are better where as higher values are better for all other
Table 2. The performance of modified greedy algorithm in
fields.
comparison with that of Novel Greedy [6], SEBI [8], Cheng
and Church’s algorithm (CC) [3], and the algorithm FLOC by V. CONCLUSION
Yang et al. [9] and DBF [10] for the Yeast dataset are given.
As a powerful analytical tool biclustering finds application
SEBI (Sequential Evolutionary Biclustering) is based on
in the gene expressions of cancerous data for the identification
evolutionary algorithms. In the Cheng and Church algorithm,
of coregulated genes, gene functional annotation and sample
rows/columns were deleted from the gene expression data
classification. In this paper a new algorithm is introduced
matrix to find a bicluster. Their algorithm is based on greedy
based on the greedy search method for finding biclusters in
strategy which removes rows and columns starting from the
gene expression data. In the first step K-Means algorithm is
entire gene expression matrix. The model of bicluster
used to group rows and columns of the data matrix separately.
proposed by Cheng and Church was generalized by Yang et al
Then they are combined to produce submatrices. From these
(2003) for incorporating null values and for removing random
submatrices those with Hscore value below a certain threshold
interference. They developed a probabilistic algorithm FLOC
are selected as seeds which are small tightly coregulated
that can discover a set of possibly overlapping biclusters
submatrices. Then more genes and conditions are added to
simultaneously. Zhang et al. presented DBF (Deterministic
these seeds using a greedy search method in which gene or
Biclustering with frequent pattern mining). In DBF a set of
condition with minimum increase in Hscore value is added in
good quality biclusters seeds are generated in the first phase
each iteration until the Hscore value of the bicluster reaches
based on frequent pattern mining. In the second phase these
the given threshold. Based on the algorithm implementation
biclusters are enlarged by adding more genes or conditions.
on Yeast dataset a comparative assessment of the results is
For the modified greedy search algorithm presented here the
provided in order to demonstrate the effectiveness of the
average number of conditions is better than that of CC, FLOC
proposed method. In terms of the average mean residue score,
and DBF. Average gene number, average volume and largest
average gene number, average volume and largest bicluster
bicluster size is greater than that of all other algorithms.
size the biclusters obtained in this method is far better than

87
many of the biclustering algorithms and especially the Novel
Greedy Search algorithm. Moreover this method finds high
quality biclusters that show strikingly similar up-regulations
and down-regulations under a set of experimental conditions
that can be inspected visually by using plots.

REFERENCES
[1] Madeira S. C. and Oliveira A. L., “Biclustering algorithms for
Biological Data analysis: a survey” IEEE Transactions on computational
biology and bioinformatics, 2004, pp. 24-45.
[2] J. A. Hartigan, “Direct clustering of Data Matrix”, Journal of the American
Statistical Association Vol.67, no.337, 1972, pp. 123-129.
[3] Yizong Cheng and George M. Church, “Biclustering of expression
data”, Proc. 8th Int. Conf. Intelligent Systems for Molecular Biology,
2000, pp. 93-103.
[4] Anupam Chakraborty and Hitashyam Maka “Biclustering of Gene
Expression Data Using GeneticAlgorithm” Proceedings of Computation
Intelligence in Bioinformatics and Computational Biology CIBCB, 2005,
pp. 1-8
[5] Chakraborty A. and Maka H., “Biclustering of gene expression data by
simulated annealing”,HPCASIA ’05, 2005, pp. 627-632.
[6] Shyama Das and Sumam Mary Idicula “A Novel Approach in Greedy
Search Algorithm for Biclustering Gene Expression Data” accepted for
presentation in the International Conference on Bioinformatics,
Computational and Systems Biology (ICBCSB) which will be held in
Singapore during Aug 27-29, 2009.
[7] Tavazoie S., Hughes J. D., Campbell M. J., Cho R. J. and Church G. M.,
“Systematic determination of genetic network architecture”, Nat. Genet.,
vol.22, no.3, 1999 pp, 281-285.
[8] Federico Divina and Jesus S. Aguilar-Ruize, “Biclustering of Expression
Data with Evolutionary computation”, IEEE Transactions on Knowledge
and Data Engineering, Vol. 18,2006, pp. 590-602.
[9] J. Yang, H. Wang, W. Wang and P. Yu, “Enhanced Biclustering on
Expression Data”, Proc. Third IEEE Symp. BioInformatics and BioEng.
(BIBE’03, 2003), pp. 321-327.
[10] Z. Zhang, A. Teo, B. C. Ooi, K. L. Tan, “Mining deterministic
biclusters in gene expression data”, Proceedings of the fourth IEEE
Symposium on Bioinformatics and Bioengineering (BIBE’04), 2004,
pp.283-292.
[11] Banka H. and Mitra S., “Multi-objective evolutionary biclustering of
gene expression data”, Journal of Pattern Recognition, Vol.39, 2006, pp.
2464-2477.
[12] Junwan Liu, Zhoujun Lia and Feifei Liu “Multi-objective Particle Swarm
Optimization Biclustering of Microarray Data”, IEEE International
Conference on Bioinformatics and Biomedicine, 2008, pp. 363-366.

88
ADCOM 2009
AD-HOC NETWORKS

Session Papers:

1. Rajiv Saxena and Alok Singh, “Solving Bounded Diameter Minimum Spanning Tree Problem Using
Improved Heuristics”

2. Santosh Kulkarni and Prathima Agrawal, “Ad-hoc Cooperative Computation in Wireless


Networks using Ant like Agents”

3. Natarajan Meghanathan and Ayomide Odunsi, “A Scenario-based Performance Comparison Study


of the Fish-eye State Routing and Dynamic Source Routing Protocols for Mobile Ad hoc Networks”

89
Solving Bounded-Diameter Minimum Spanning
Tree Problem Using Improved Heuristics
Rajiv Saxena and Alok Singh
Department of Computer and Information Sciences
University of Hyderabad
Hyderabad 500046, Andhra Pradesh, India
rajiiv123@gmail.com, alokcs@uohyd.ernet.in

Abstract—The bounded-diameter minimum spanning tree of degree 1 except at most two nodes. To compute BDMST
(BDMST) problem is to find a minimum spanning tree of a in this case we consider each edge e of the graph one-by-
given connencted, undirected, edge-weighted graph G in which one and make its endpoints the vertices whose degree can
no path between any two vertices contains more than k edges.
The problem is known to be N P-Hard for 4 ≤ k < n − 1, exceed 1. For each of the remaining n − 2 nodes, whose
where n is the number of vertices in G. Therefore, we look degree is 1, it requires a comparison to determine to which
for heuristics to find good approximate solutions. This work is of the two nodes of degree ≥ 2 it is to be connected. This
an improvement over two existing greedy heuristics - Improved has to be repeated for every edge in G and then selecting a
Randomized Greedy Heuristics (RGH-I) and Improved Centre spanning tree with the smallest cost. In a complete graph with
Based Tree Construction (CBTC-I). Both themselves are im-
proved versions of heuristics RGH and CBTC. The improvement m edges, the total no. of comparisons thus required is (n−2)m
is such that given a bounded-diameter minimum spanning tree T which is O(n3 ). Finally when all the edge weights are same,
as constructed by RGH or CBTC, the heuristic tries to improve a minimum diameter spanning tree can be constructed using
the cost of T further by disconnecting a subtree of height h rooted breadth first search in O(mn) time. In the remaining general
at vertex v in T and attaching it to a vertex where the cost of cases the BDMST problem is N P-Hard [4].
attaching it is minimum witout violating the diamteter constraint.
On 25 euclidean instances and 20 non-euclidean instances upto The diameter of a tree is the maximum eccentricity of its
1000 vertices our approach shows substantial improvement over vertices. The eccentricity of a vertex v is the length of the
solutions found by RGH-I and CBTC-I. longest path from v to any other vertex. The vertex with
minimum eccentricity defines the centre of the tree. Every
I. I NTRODUCTION tree has either one or two centres. If the diameter is even we
have only one vertex as the centre. If the diameter is odd then
The bounded-diameter minimum spanning tree problem is two connected vertices forms the centre of the tree.
useful in many practical applications where a minimum span- Abdalla et. al [1] presented a greedy heuristic called One-
ning tree (MST) with a small diameter (length of the longest Time Tree Construction (OTTC) for solving BDMST. OTTC is
path in the tree) is required - such as in distributed mutual a modification of Prim’s algorithm that starts with a vertex and
exclusion algorithms [8], in data compression for information grows the spanning tree by connecting the nearest unconnected
retreival [3] and in linear lighwave networks (LLNs) [2]. vertex to the partially build spanning tree without violating
Let G = (V, E) be a connected undirected graph where V the diamater constraint. It keeps track of eccentricity of each
denotes the set of vertices and E denotes the set of edges. Each vertex such that the eccentricity of any vertex does not exceed
edge e ∈ E has a non-negative weights w(e) associated with diameter k.
it. The BDMST problem seeks a Minimum Spanning Tree (T) Raidl and Julstrom [7] proposed a randomized greedy
on G whose diameter does not exceed a given positive integer heuristic (RGH). Their algorithm starts by fixing the tree’s
k ≥ 2. That is centre. If k is even a vertex v0 at random is chosen from the
X
Minimize W (T ) = w(e) set of vertices V as the centre vertex and if the diameter is odd
e∈T then another vertex v1 is chosen at random and {v0 , v1 } forms
the centre of the tree. RGH maintains the diameter constraint
such that
by maintaining the depth of each vertex in tree, i.e., the number
diameter(T ) ≤ k
of edges on the path from the tree’s centre to the vertex. No
It is to be noted that for diameter k = n − 1 the problem vertex in T can have depth > bk/2c. This is based on an
is nothing but to find MST of G for which we already important observation by Handler [5] that in a tree of diameter
have polynomial time exact algorithms (Prim’s or Kruskal’s k, no vertex is more than bk/2c edges from the tree’s centre.
algorithms). When k = 2, the BDMST takes the form of a Thus by fixing the tree’s centre and using the observation by
star which can be computed in O(n2 ) and then selecting the Handler, RGH grows the spanning tree such that no vertex
smallest-weight star as solution. When k = 3, the BDMST have depth greater than bk/2c. On the test instances considered
takes the form of a dipolar star where every node must be RGH outperforms OTTC substantially.

90
Centre Vertex
Julstrom [6] later proposed a full greedy heuristic which
is a modified version of RGH called Centre Based Tree
Construction (CBTC) for constructing BDMST. In CBTC parent(v)

instead of selecting each next vertex at random, it selects an


unconnected vertex v ∈ / T and connects it to a vertex that is v

already ∈ T via an edge with the smallest weight. On 20 non-


euclidean instances whose edge weights have been chosen at
random, CBTC outperforms RGH but on euclidean instances
Candidate Vertices in RGH−I, RGH+HT
RGH outperforms CBTC.
Candidate Vertices Only in RGH+HT
Singh and Gupta [9] presented improved versions of RGH
Violates Diameter Constraint in RGH+HT
and CBTC that further reduce the cost of BDMST obtained
Descendant vertex; can not be candidate (RGH+HT)
by these heuristics. The improved version of RGH is RGH-
I, and, correspondingly, for CBTC it is CBTC-I. There were
two improvements proposed for each of these heuristics - one Fig. 1. Possible candidate set of vertices to which subtree rooted at vertex
v can be attached.
concerned with the efficiency of the algorithm and second with
the solution quality of RGH/CBTC. On Euclidean as well as
non-euclidean instances RGH-I (CBTC-I) outperforms RGH
(CBTC) substantially. the resulting solution is maitained. This requires computing
The rest of this paper is organised as follows. The next the height of the subtree rooted at vertex v before making
section (section 2) describes our heuristic for improvement this decision. This improvement is shown in Figure 1. As
made in RGH-I and CBTC-I for solving BDMST problem. shown in the figure, RGH+HT allows a subtree rooted at
Section 3 presents details of experiments and the comparative vertex v to be disconnected from its current parent and to be
computational results. The last section (section 4) outlines attached to any vertex marked in rectangle or circle (assuming
some conclusions. all these vertices offer reduced cost). The vertices shown in
II. I MPROVED G REEDY H EURISTIC (RGH+HT) rectangle are also the candidate vertices in RGH-I, whereas
RGH+HT allows additional candidate vertices as shown in
Our improved greedy heuristic is based on RGH [7] and its
circles. Thus RGH+HT presents a larger set of candidate
improved version RGH-I [9]. We call our heuristic RGH+HT.
vertices for improvement in comparison to RGH-I. It is to be
To better understand the improvement made by RGH+HT let
noted that because RGH+HT allows a vertex to be connected
us first consider how RGH-I improves the cost of BDMST
to any vertex whose depth is greater than its current depth
constructed by RGH. RGH-I includes two improvements for
we must not connect it to any of its descendant vertex (as
RGH as follows:
shown in Figure 1 by upside triangle), otherwise tree will
1) It checks for each vertex v other than centre ver- be disconnected. Pseudocode for RGH+HT based on these
tex/vertices and vertices connected immediately to the observations is given in Pseudocode 1.
centre vertex whether it can be connected to a better
vertex whose depth is less than that of the depth of RGH+HT makes multiple passes over the set of vertices
vertex v. In this case the subtree rooted at vertex v is until there is no further improvement possible as shown in
disconnected from its current parent and connected to line 2 of the pseudocode. The height of a subtree at step 11
the new vertex selected. can be computed by performing breadth-first-search starting
2) It uses sorted cost matrix for two purposes. First to select at root vertex of that subtree. While performing this breadth-
a better vertex for v as in previous case, and, second, first-search we also keep track of all the descendant vertices
where the greediness approach is applied in RGH, i.e., belonging to root vertex so that we do not attach v0 to any
instead of searching the candiadate set of vertices C of its descendant vertex as specified in step 15. The vertex
for the lowest-weight edge in case of |C| > n/10, that offers the maximum reduction in cost for v0 is selected
first n/10 elements of the row of the sorted cost matrix from the row of the sorted cost matrix correspoding to v0 in
corresponding to vertex v is searched. step 13. Step 16 is the condition for maintaining the diamter
RGH+HT retains the second improvement of RGH-I as it of the tree (T) while connecting the subtree to its newly found
improves the speed with which the search process and im- better vertex. Once a better vertex has been found, we connect
provements are carried out. We have modified the first im- the subtree to this better vertex (Step 18). After connecting this
provement of RGH-I. With RGH-I we can connect a subtree subtree to the new vertex, the depth of each vertex belonging
rooted at vertex v only to a vertex whose depth is less to this subtree including the root vertex is updated (Step 20).
than the depth of vertex v and if it offers reduced cost The basic idea behind RHG+HT was mentioned in [9] also.
than to its parent vertex. However, such an operation may However, it was not implemented and tested there as the main
not always lead to best possible improvement. The better purpose of [9] was to use improved RGH repeatedly within a
improvement may be possible if we can connect the subtree genetic algorithm, where it could have slowed down the overall
to any vertex as long as cost is reduced and the feasibility of algorithm significantly.

91
Pseudocode 1 RGH+HT B. Test Instances Description
Require: BDMST T as computed by RGH
We have compared the performance of various heuris-
Ensure: diam(T ) ≤ k
tics on euclidean as well as non-euclidean instances. The
1: moreimp ← true // While further improvements possible
problem instances used in our experiments are same stan-
2: while (moreimp) do
dard BDMST benchmark problem instances as used in [6]
3: moreimp ← false
and [9]. There are in all total 45 instances. Twenty-five
4: U ← V − {c0 } // Vertices in U without centres c0 or c1 of these instances are euclidean with five instances for
5: if odd(k) then each value of n ∈ {50, 100, 250, 500, 1000}. These
6: U ← V − {c1 } insances can be downloaded from Beasley’s OR-library
7: end if (www.people.brunel.ac.uk/∼mastjjb/jeb/info.html). These in-
8: while (U 6= ∅) do stances are listed there as instances of the euclidean steiner tree
9: v0 ← random(U ) problem. Euclidean instances consists of n points randomly
10: U ← U − {v0 } chosen in the unit square. These points are treated as the
11: ht ← height of subtree(v0 ) vertices of a complete graph whose edge weights are the
12: pv0 ← parent(v0 ) euclidean distances between the points. The library contains
13: minvtx ← next min cost(v0 ) 15 instances for each n and the first 5 of them are used for
14: while (minvtx 6= pv0 ) do the BDMST problem. The diameter bound k is taken to be 5,
15: if (minvtx ∈ / desc(v0 )) then 10, 15, 20, and 25 respectively for n = 50, 100, 250, 500 and
16: if ((ht + 1 + depth[minvtx]) ≤ bk/2c) then 1000 respectively.
17: moreimp ← true Twenty more instances, five each for n = 100, 250, 500
18: T ← T − {(pv0 , v0 )} + {(minvtx, v0 )} and 1000 vertices were created by Julstrom [6]. These are
19: parent[v0 ] ← minvtx non-euclidean or random instances which are complete graphs
20: for each vertex w in subtree rooted at vertex v0 with edge weights chosen at random from the interval [0.01,
do 0.99]. The diameter bound is taken to be 10, 15, 20 and 25
21: depth[w] ← depth[parent(w)] + 1 for n = 100, 250, 500 and 1000 vertices respectively.
22: end for
23: break C. Results of Experiments
24: end if Tables I and II report the results of various heuristics on
25: end if euclidean instances, whereas Tables III and IV do the same
26: minvtx ← next min cost(v0 ) on non-eulidean instances. For each instance these tables list
27: end while n, the diameter bound k, the best and average solutions,
28: end while and, standard deviation (SD) of solutions after running the
29: end while heuristics n times on each instance. Best results over the three
heuristics are printed in bold. It is clear from these tables that:
1) On euclidean instances, RGH+HT oobtained the best so-
The improvement for CBTC [6] is same as specified in the lutions (Table I). RGH+HT outperforms RGH-I, which
pseudocode for RGH+HT. Once the tree constructed by CBTC so far gives the most promising results on euclidean
is known, we perform the same steps as in RGH+HT. The instances, both in terms of best cost as well as average
CBTC+HT is CBTC with these improvements. cost for all the instances (except on instances of size 50
III. C OMPUTATIONAL R ESULTS where the best cost of RGH-I and RGH+HT are same,
however RGH+HT gives better average values.).
A. Experimental Setup
2) On non-euclidean instances, it is CBTC+HT (Table IV)
All our heuristics are coded in C and executed on an that outperforms all other heuristics. On all the instances
Intel Core 2 Duo 3.00 GHz CPU using 2GB of RAM in not only it gives better results in terms of best cost but
Linux environment (Open SuSE 10.3). CBTC, CBTC-I and also in terms of average costs.
CBTC+HT were executed n times on each instance of size n 3) On euclidean instances, CBTC+HT (Table II) shows
starting from each vertex in turn. RGH, RGH-I and RGH+HT substantial improvements in comparison to CBTC-I in
were executed n times starting from a random chosen vertex terms of best values for all the instances with n =1000.
each time.

92
TABLE I
R ESULTS OF RGH, RGH-I AND RGH+HT ON 25 E UCLIDEAN INSTANCES HAVING 50, 100, 250, 500 AND 1000 VERTICES

Instance RGH RGH-I RGH+HT


n k Number Best Avg. SD Best Avg. SD Best Avg. SD
50 5 1 9.34 12.82 2.48 8.53 12.57 2.13 8.53 12.56 2.14
2 8.98 11.56 1.56 8.74 11.39 1.48 8.74 11.39 1.48
3 8.76 11.54 1.90 8.28 10.66 1.21 8.28 10.66 1.21
4 7.47 10.57 1.66 7.54 9.83 1.51 7.54 9.80 1.52
5 8.79 10.91 1.61 8.59 10.52 1.48 8.59 10.49 1.48
100 10 1 9.35 10.77 0.81 9.16 10.21 0.75 8.88 9.96 0.72
2 9.41 10.80 0.81 9.09 10.45 1.00 8.68 10.16 1.00
3 9.75 11.25 0.90 9.39 10.73 0.70 9.25 10.46 0.71
4 9.55 11.03 0.89 9.14 10.57 0.88 8.95 10.35 0.85
5 9.78 11.36 1.06 9.61 10.95 0.87 9.09 10.65 0.93
250 15 1 15.14 16.51 0.69 14.61 15.89 0.45 14.04 15.08 0.49
2 15.20 16.33 0.67 14.82 15.73 0.42 14.11 14.99 0.48
3 15.08 16.19 0.56 14.75 15.68 0.44 13.80 14.86 0.47
4 15.49 16.77 0.62 15.14 16.15 0.43 14.24 15.38 0.48
5 15.42 16.53 0.58 14.99 15.91 0.45 14.11 15.10 0.48
500 20 1 21.72 22.86 0.51 21.10 22.07 0.39 19.39 20.40 0.43
2 21.46 22.52 0.46 20.81 21.78 0.38 19.09 20.17 0.42
3 21.51 22.78 0.50 20.89 22.03 0.37 19.42 20.41 0.41
4 21.82 22.85 0.47 21.15 22.10 0.39 19.41 20.46 0.46
5 21.37 22.52 0.51 20.84 21.75 0.39 18.86 20.05 0.44
1000 25 1 30.97 32.19 0.41 29.93 31.17 0.40 27.22 28.26 0.43
2 30.90 32.05 0.42 29.85 31.04 0.39 27.08 28.12 0.41
3 30.69 31.77 0.42 29.36 30.77 0.38 26.80 27.83 0.40
4 30.93 32.18 0.43 29.99 31.13 0.38 27.05 28.21 0.40
5 30.85 31.93 0.42 29.81 30.89 0.39 26.50 27.91 0.42

TABLE II
R ESULTS OF CBTC, CBTC-I AND CBTC+HT ON 25 E UCLIDEAN INSTANCES HAVING 50, 100, 250, 500 AND 1000 VERTICES

Instance CBTC CBTC-I CBTC+HT


n k Number Best Avg. SD Best Avg. SD Best Avg. SD
50 5 1 13.84 21.86 5.27 13.28 21.80 5.33 13.28 21.80 5.33
2 13.32 19.29 3.68 13.19 19.23 3.73 13.19 19.23 3.73
3 11.62 19.10 3.79 11.59 19.06 3.82 11.59 19.06 3.82
4 11.04 16.86 3.64 10.78 16.79 3.65 10.78 16.79 3.65
5 12.31 18.36 3.25 12.31 18.30 3.28 12.31 18.30 3.28
100 10 1 17.50 28.80 7.02 17.35 28.66 7.06 17.34 28.60 7.09
2 15.02 26.95 6.16 14.17 26.77 6.24 14.17 26.56 6.33
3 18.37 29.66 7.62 17.70 29.48 7.71 15.75 29.28 7.86
4 15.11 28.77 7.81 14.92 28.65 7.87 14.90 28.48 7.93
5 15.73 29.46 7.72 14.78 29.30 7.83 12.82 29.18 7.88
250 15 1 41.61 72.35 19.86 39.70 72.07 19.92 37.64 71.63 20.16
2 32.43 75.52 19.44 31.59 75.35 19.49 28.90 74.73 19.46
3 32.65 70.60 18.09 32.01 70.32 18.22 27.31 69.67 18.66
4 32.29 76.23 20.07 31.78 76.09 20.15 29.42 75.44 19.86
5 35.90 71.56 17.90 35.79 71.40 17.97 35.66 70.66 17.85
500 20 1 80.76 150.68 39.02 72.07 150.46 41.40 48.18 148.07 40.65
2 70.44 148.75 39.89 70.17 148.54 39.96 60.15 146.37 40.38
3 69.37 153.17 39.02 68.83 152.96 39.11 45.49 149.61 40.86
4 63.88 150.98 39.18 63.17 150.79 39.24 63.00 148.34 40.22
5 72.36 150.68 41.33 72.07 150.46 41.40 41.77 146.80 42.73
1000 25 1 173.23 327.50 82.96 172.62 327.30 83.02 90.01 321.07 84.90
2 173.85 323.72 81.34 173.06 323.50 81.41 95.83 318.59 83.48
3 175.80 321.25 83.04 175.47 321.04 83.10 94.02 312.70 85.72
4 163.89 323.45 80.13 163.43 323.23 80.22 81.39 317.02 83.17
5 149.36 325.96 78.34 148.37 325.76 78.41 70.55 318.52 81.37

93
TABLE III
R ESULTS OF RGH, RGH-I AND RGH+HT ON 20 NON -E UCLIDEAN INSTANCES HAVING 100, 250, 500 AND 1000 VERTICES

Instance RGH RGH-I RGH+HT


n k Number Best Avg. SD Best Avg. SD Best Avg. SD
100 10 1 3.96 5.47 0.60 3.30 4.39 0.49 3.02 3.89 0.47
2 4.01 5.41 0.59 3.40 4.30 0.51 2.72 3.85 0.54
3 4.50 5.68 0.57 3.43 4.59 0.57 2.78 4.01 0.51
4 4.16 5.20 0.53 2.95 4.24 0.47 2.57 3.80 0.48
5 4.21 5.50 0.58 3.56 4.53 0.44 3.01 4.01 0.48
250 15 1 6.17 7.73 0.59 5.12 6.43 0.52 4.26 5.44 0.48
2 6.27 7.64 0.56 4.73 6.31 0.49 4.31 5.45 0.48
3 6.35 7.62 0.55 5.07 6.37 0.49 4.30 5.45 0.47
4 6.21 7.63 0.60 5.15 6.43 0.55 4.51 5.49 0.45
5 6.51 7.81 0.52 5.28 6.60 0.53 4.66 5.69 0.45
500 20 1 9.36 10.72 0.55 7.50 8.86 0.55 6.54 7.44 0.42
2 9.27 10.79 0.56 7.71 8.94 0.53 6.70 7.53 0.42
3 9.16 10.70 0.59 7.36 8.89 0.54 6.55 7.46 0.42
4 9.13 10.69 0.60 7.66 8.91 0.52 6.68 7.50 0.42
5 9.18 10.66 0.54 7.46 8.89 0.57 6.67 7.49 0.42
1000 25 1 14.83 16.36 0.58 12.80 14.33 0.57 11.69 12.57 0.40
2 14.93 16.36 0.57 12.81 14.34 0.57 11.64 12.61 0.41
3 14.90 16.40 0.58 12.89 14.37 0.58 11.70 12.61 0.39
4 14.52 16.29 0.57 12.83 14.26 0.55 11.58 12.53 0.41
5 14.80 16.43 0.59 12.93 14.43 0.53 11.78 12.68 0.41

TABLE IV
R ESULTS OF CBTC, CBTC-I AND CBTC+HT ON 20 NON -E UCLIDEAN INSTANCES HAVING 100, 250, 500 AND 1000 VERTICES

Instance CBTC CBTC-I CBTC+HT


n k Number Best Avg. SD Best Avg. SD Best Avg. SD
100 10 1 2.58 3.23 0.35 2.53 3.06 0.29 2.53 2.91 0.25
2 2.55 3.09 0.31 2.43 2.93 0.28 2.36 2.78 0.26
3 2.66 3.48 0.44 2.61 3.32 0.39 2.49 3.16 0.36
4 2.45 3.03 0.29 2.38 2.87 0.25 2.37 2.74 0.22
5 2.71 3.34 0.39 2.63 3.17 0.33 2.58 3.00 0.27
250 15 1 3.96 4.40 0.22 3.93 4.30 0.19 3.88 4.15 0.14
2 4.09 4.45 0.24 4.02 4.31 0.20 3.97 4.16 0.13
3 3.87 4.33 0.21 3.83 4.21 0.17 3.82 4.08 0.13
4 3.92 4.40 0.24 3.88 4.29 0.20 3.85 4.15 0.15
5 4.16 4.63 0.26 4.11 4.48 0.21 4.05 4.31 0.15
500 20 1 6.34 6.70 0.16 6.31 6.61 0.14 6.29 6.48 0.96
2 6.47 6.82 0.17 6.43 6.72 0.14 6.38 6.58 0.97
3 6.34 6.66 0.16 6.30 6.57 0.13 6.24 6.44 0.10
4 6.39 6.77 0.15 6.36 6.68 0.13 6.31 6.54 0.09
5 6.41 6.75 0.17 6.37 6.65 0.14 6.30 6.52 0.10
1000 25 1 11.37 11.66 0.15 11.33 11.57 0.12 11.29 11.45 0.08
2 11.40 11.68 0.14 11.38 11.60 0.12 11.32 11.48 0.08
3 11.42 11.69 0.15 11.38 11.61 0.12 11.35 11.49 0.08
4 11.30 11.58 0.14 11.26 11.50 0.11 11.22 11.39 0.08
5 11.47 11.76 0.13 11.43 11.68 0.11 11.39 11.56 0.08

IV. C ONCLUSIONS I and CBTC-I. After attaching the subtree to a new better
We have improved the results of RGH-I and CBTC-I heuris- vertex it updates the depth of all the vertices in the subtree.
tics both on euclidean as well as on non-euclidean instances of This alongwith multiple passes over the list of vertices result
the BDMST problem. The improved heuristics RGH+HT and in better solution values for the BDMST problem.
CBTC+HT takes into consideration the height of the subtree As a future work, we plan to develop hybrid approaches
before connecting it to some other vertex of the tree thus for the BDMST problem combining RGH+HT with some
allowing more candidate vertices for improvement than RGH- metaheuristics like [9].

94
R EFERENCES [6] B.A. Julstrom, “Greedy heuristics for the bounded-diameter minimum
spanning tree problem,” ACM Journal of Experimental Algorithmics, vol.
[1] A. Abdalla, N. Deo, and P. Gupta, “Random-tree diameter and the
14, 2009, pp. 1-14
diameter constrained MST,” Congressus Numerantium, 144, 2000, pp.
[7] G.R. Raidl and B.A. Julstrom, “Greedy heuristics and an evolutionary
161-182.
algorithm for the bounded-diameter minimum spanning tree problem,”
[2] K. Bala, K. Petropoulos and T.E. Stern, “Multicasting in a linear lightwave
Proceedings of the ACM Symposium on Applied Computing, 2003, pp.
network,” Proceedings of IEEE INFOCOM’93, pp. 1350-1358
747-752
[3] A. Bookstein and S.T. Klein, “Compression of correlated bit-vectors,”
[8] K. Raymond, “A tree-based algorithm for distributed mutual exclusion,”
Information Systems, vol. 16, 1990, pp. 387-400
ACM Transactions on Computer Systems, vol. 7, 1989, pp. 61-77
[4] M.R. Garey and D.S. Johnson, Computers and Intractability: A Guide to
[9] A. Singh and A.K. Gupta, “Improved heuristics for the bounded-diameter
the Theory of NP-Completeness. W.H. Freeman, New York, 1979.
minimum spanning tree problem,” Soft Computing, vol. 11, 2007, pp.
[5] G.Y. Handler, “Minimax location of a facility in an undirected Graph,”
911-921
Transportation Science, vol. 7, 1978, pp. 287-293

95
Ad-hoc Cooperative Computation in Wireless Networks using Ant like Agents

Santosh Kulkarni Prathima Agrawal


Auburn University Auburn University
Computer Science & Software Engineering Electrical & Computer Engineering
Auburn, AL 36849-5347, USA Auburn, AL 36849-5347, USA
santosh@auburn.edu pagrawal@eng.auburn.edu

Abstract Because of their size limitations however, a large num-


ber of these devices are likely to have severe restrictions
Mobile applications continue to soar in popularity as on their processing power, storage space, available mem-
they provide its users the convenience of accessing services ory as well as battery capacity. Unfortunately such rigorous
from anywhere, at anytime. The underlying computing de- resource limitations in mobile computing devices preclude
vices for such applications however, are often limited in the full utilization of mobile applications in real life sce-
their battery and processing powers, primarily due to size narios [2]. To overcome this problem, we propose a new
and weight restrictions. Running complex applications on distributed computing model for wireless ad-hoc networks
such resource-limited devices has always been a challenge. called the Ad-hoc Cooperative Computation (ACC) model.
In the work presented here, we address this challenge by ACC is a computing model in which a set of heterogeneous
proposing a cooperative paradigm for ad-hoc computation computing systems dynamically form a cooperative system.
in wireless networks. In this model, a set of heterogeneous Whenever a resource limited computing device in such a
computing devices cooperate to dynamically form a dis- system has a resource consuming application to be run, it
tributed computation system. Whenever a resource-limited uses resources of other devices to surmount its own limi-
computing device in such a system has resource-consuming tations. The following scenario presented in [2] makes a
application to be run, it uses the resources of other devices strong case for our proposed computing model.
to overcome its own limitations. The proposed paradigm is Triage is a process that is executed in hospital emer-
based on the concept of execution migration and relies on gency rooms to sort injured people into groups based on
migratory execution units called Ant Agents to seek spare their need for immediate treatment. The same process is ac-
resources available in the network. Simulation results for tually needed in disaster areas which usually have a large
the proposed model demonstrate that cooperative ad-hoc number of casualties. In such scenarios, quickly identifying
computation is indeed beneficial for resource constrained the severely injured has proved to be an effective technique
wireless devices. in saving lives and controlling acute injuries. But unfortu-
nately, both technical and human resources that are readily
available in emergency rooms are scarce in disaster areas.
1. Introduction Mobile computing is now being proposed as a solution to
complement, automate and expedite the triage process in
The shrinking size and increasing density of wireless disaster fields. First, low-power vital sign sensors are at-
devices have profound implications for the future of wire- tached to each patient in the field. These sensors send med-
less communications. Today’s laptops and wireless phones ical data about the patient to nearby first responders who
may soon be outnumbered by ubiquitous computing devices are provided with mobile computing devices and medical
such as smart dust [24], micro sensors and micro robots applications. These medical applications then process and
[17]. In fact, there are already some organizations that pro- analyze the received data in order to make a decision on who
pose embedding communication systems into cars allow- needs the most immediate treatment [14]. Running such ap-
ing cars to interact with other cars or infrastructures over a plications on resource-limited mobile computing devices is
Wireless Local Area Network [1]. Therefore, future genera- a real challenge. Such devices may not have enough energy
tions of wireless networks are expected to have a huge num- to run the application and/or may not have enough process-
ber of heterogeneous mobile computing devices that are dy- ing power needed to make a timely decision. Alleviating
namically inter connected over wireless links. the effect of these limitations, which is the main objective

96
of our proposed computing model, would undoubtedly save on the concept of execution migration. Applications that are
more lives in future emergencies. in compliance with the ACC model shall consist of migra-
tory execution units called Ant Agents which work together
1.1. Challenges for Cooperation to accomplish a common goal. Ant Agents (AA), similar to
Mobile Agents [13], are collections of code and data blocks.
Wireless ad-hoc networks pose a unique set of chal- They migrate through the network, executing on each de-
lenges which make traditional distributed computing mod- vice in its path, foraging for devices of interest or devices
els difficult, if not impossible to employ, in alleviating the with desired properties.
effects of resource limitations. Some of the identified chal- The agents are also self-routing, namely, they are respon-
lenges are: sible for determining their own paths through the network.
In the proposed ACC model, AAs forage the network for
• Network size - The number of devices working to- devices of interest using ant like routing algorithms [3], [5],
gether to achieve a common goal will be orders of [8], [19]. Such routing algorithms based on the behavior of
magnitude greater than those seen in traditional dis- social insects in nature are known to result in optimal route
tributed systems. between the source and the destination [20]. For their part,
devices in the network support AAs by providing,
• Heterogeneous architecture - The devices are all likely
to have different hardware architectures as they are • A name based memory system and
typically tailored to perform a specific task within the
network. • An architecturally independent environment for re-
ceipt and execution of Ant Agents
• Unreliable links - The links in the network are inher-
ently fragile, with device and connection failures being To validate the proposed computing model, we have
a norm rather than an exception. developed a simulator that executes AAs allowing us to
evaluate both execution and communication time of a dis-
• Limited processing power - As mobile devices are tributed application. In this simulator we execute applica-
likely to have size and weight restrictions, they are lim- tions that are modeled as per the Bag-of-Tasks paradigm
ited in their processing power and battery capacity. of distributed computing. Simulation results show that our
proposed computation model is able to significantly im-
• Dynamic topology - The availability of the devices prove the execution times of mobile applications on re-
may vary greatly with time, with devices becoming un- source constrained devices.
reachable due to mobility or due to depletion of energy. The rest of this paper is organized as follows. The next
• Limited reach - Because of the nature of wireless com- section describes our proposed Ad-hoc Cooperative Com-
munication, devices can communicate directly with putation model. Section 3 presents the system architecture
only those devices that are within its transmission that supports the proposed model. In Section 4, we dis-
range. cuss the details of Ant Agents while in Section 5 we discuss
the application paradigms implemented using our proposed
model. Section 6 discusses related work and Section 7 con-
1.2. Cooperative Setup
cludes the paper.
Applications designed to suit the ACC model will typi-
cally target specific properties within the network and not 2. Ad-hoc Cooperative Computation Model
individual devices. Such targeted properties could include
specific data and/or resources that the application is inter- To exploit the raw computing power of large scale, het-
ested in. From the application’s point of view devices with erogeneous, wireless ad-hoc networks we propose a dis-
the same properties are interchangeable. Thus fixed naming tributed computing model called the Ad-hoc Cooperative
schemes such as IP addressing are inappropriate in most sit- Computation (ACC) model. This model is based on the
uations. As discussed in [11] a naming scheme based on the social behavior of ants which work together in groups to
content or property of a device is more appropriate for wire- execute tasks that are beyond the abilities of a single mem-
less ad-hoc networks. ber. The ACC model consists of distributed applications
Due to network volatility and dynamic binding of names that are defined as a dynamic collection of Ant Agents (AA)
to devices, execution migration based distributed comput- which cooperate amongst themselves to collectively achieve
ing is more suitable for wireless ad-hoc networks than dis- a common objective. The execution of an AA can be de-
tributed computing based on data migration (message pass- scribed in two phases: forage and migrate phase followed
ing) [4]. Hence, the system architecture for ACC is based by a computation phase. The AA execution performed at

97
each step may differ based on the properties of its hosting than some limit MAX, say 40,000. Since prime number
device. On devices that meet the application targeted prop- generation is a processor intensive task, the resource lim-
erties (device of interest), an AA may advance its execution ited source device, depicted as a black circle, injects AAs
state while on other devices it only executes its routing al- into the network seeking cooperation from other devices
gorithm. Like any mobile agent, an AA too carries along in the network. Because the originating device is proces-
with it its mobile data, mobile code as well as a lightweight sor limited, the injected AAs are initialized to forage for
execution state. computing cycles within the network. Once initialized each
Devices in the network support the reception and exe- AA forages the network for available computing resources.
cution of AAs by providing an architecturally independent When an AA finds a device with enough spare CPU cycles,
programming environment (e.g., Virtual Machine [9]) as depicted as gray circles, it proceeds to calculate the set of
well as a name based memory system (e.g., Tag Space [22]). primes within its given range. Upon completion, each AA
The AAs along with the system support provided by the de- reports its results back to the originating device by tracing
vices in the network form the ACC infrastructure which al- back its migratory route.
lows execution of distributed applications over ad-hoc wire-
less networks.
Our proposed computational model allows the user to ex-
ecute distributed tasks in ad-hoc networks by simply inject-
ing the corresponding AAs into the network. To do this,
the user need not have any prior knowledge about either the
scale or the topography of the network nor the specific func-
tionality of the devices involved. Additionally, making the
AAs intelligent eliminates the issue of implementing new
protocols on all the devices in the network; a task which is
difficult or even impossible to do using current approaches
[11].
Because of their intelligence, AAs are reasonably re- Figure 2. 3-D modeling using Computer-aided
silient to any network volatility. When certain devices be- design
come unavailable due to their mobility or energy depletion,
AAs are able to adapt well by either finding a new path to
their destination or by foraging for other devices in the net- Next, let us consider a Computer-aided design applica-
work that meet the properties targeted by the application. tion which is required to generate a 3-D model, given its
top, front and profile views. Since the originating device,
depicted as a black circle in Figure 2, is missing the required
views, it injects an AA into the network seeking coopera-
tion from devices which have the required data. Because
the originating device is data limited, the injected AA is ini-
tialized to forage for specific data within the network. Once
initialized the AA forages the network for the three views of
the object in question. When an AA finds a device with the
relevant data, depicted as gray circles, it proceeds to pro-
cess the available view. Finally, having processed all three
views, the agent reports the results back to the originating
device by tracing back its migratory path.
For applications that deal with large amounts of data,
moving the execution to the source of the data whenever
possible will improve the overall performance of the dis-
tributed system. For example, when using an AA for ob-
Figure 1. Generating prime numbers ject recognition, performing the image analysis on the de-
vice that acquired the image whenever possible, rather than
Let us consider two example applications that demon- transferring the image (or sequence of images) over the net-
strate the computation and communication aspects of the work would result in improved response time and band-
proposed ACC model. Figure 1 depicts an application width usage while reducing the overall energy consumed.
where the joint task is to generate all prime numbers less Similarly caching frequently used code blocks on devices

98
that regularly host AAs can also limit the impact of code 3.2. Resource Manager
transfer occurring with every injected AA.
Security is an important issue in any cooperative com- Each AA lists its estimate resource requirements in a Re-
puting model. Addressing it in our proposed model would source Table located in the AA header. The Resource Man-
mean protecting AAs against malicious devices as well as ager is responsible for receiving the authenticated AA from
protecting devices against malicious AAs. Although real- the Security Manager and storing them into AA Queue, sub-
izing this requires a comprehensive security framework to ject to the requested resource constraints being satisfied. A
be in place, we limit the current architecture to a simple ad- Resource Manager also checks to see if code section of the
mission control using authentication mechanisms based on incoming AA is already cached locally or not.
digital signatures.
3.3. Virtual Machine
3. System Architecture
The Virtual Machine is a hardware abstraction layer for
Considering the heterogeneous nature of the network,
the execution of AAs across all heterogeneous hardware
the system architecture aims to place as much intelligence
platforms present in the ad-hoc wireless network. Exam-
as possible in Ant Agents and keep the support required
ples include Java Virtual Machine, K Virtual Machine, etc.
from devices in the network to a minimum.

3.4. Tag Space

A Tag Space consists of a limited number of tags that


are persistent across AA executions. Figure 4 from [4] il-
lustrates the structure of a tag. It consists of an identifier, a
digital signature, lifetime information and data. The iden-
tifier represents the name of the tag. The access of AAs to
tags is restricted based on the digital signature. The tag life-
time specifies the time at which the tag will be reclaimed by
the device from the Tag Space.
Figure 3. ACC System Architecture

Figure 3 shows the system architecture support needed


for the proposed ACC model. As depicted, the Security Figure 4. Tag Structure
Manager first verifies the credentials of all incoming Ant
Agents. Next, the Resource Manager inspects the AA’s Re-
Tags can be used for:
source Table header field to check if the listed resource es-
timates can be satisfied. AAs whose resource estimates can • Naming: AAs name the devices of interest using tag
be satisfied are then queued up for execution at the Virtual identifiers.
Machine. The Tag Space represents the name-based mem-
ory region that stores data objects persistent across AA exe- • Data Storage: An AA can store data in the network by
cutions. The Virtual Machine acts as a hardware abstraction creating its own tags.
layer for loading, scheduling and executing tasks generated
by incoming AAs. Post execution, the AAs are injected • Routing: AAs use tags to create pheromone trail of
back into the network to allow them to migrate to their next visited devices in the network, by caching the relevant
destination. IDs in the data portion of such tags.

3.1. Security Manager • Synchronization: An AA can block on specific tag


pending a write of the tag by another AA. Once this
To prevent excessive use of its resources, a device needs tag is written, all AAs blocked on it will be woken up
to perform some form of admission control. In the pro- and made ready for execution. This way AAs can syn-
posed architecture, the Security Manager component per- chronize among themselves.
forms this role. It is primarily responsible for receiving in-
coming AAs and passing them onto the Resource Manager • Interaction with host device: An AA can interact with
subject to their approval by various admission restrictions. the host OS and I/O system using I/O tags.

99
4. Ant Agents 3. After completion of execution, the AA may migrate
to other devices of interest or may return back to the
AAs like mobile agents are migratory execution units originating device with the results of execution.
consisting of code, data and an execution state. In ACC,
the behavior of AAs is modeled on the behavior of ants in Ant Agent Admission
nature. Just as ants in nature cooperate with each other to To avoid unnecessary resource consumption, the Secu-
forage for food, AAs in ad-hoc networks cooperate with rity Manager executes a three-way handshake protocol for
each other to forage for devices that satisfy the applica- transferring AAs between neighboring devices. First, only
tion targeted properties (device of interest). In the context the identification information, digital signature and resource
of the proposed computing model, user applications can be table information is sent to the destination for admission
viewed as a collection of AAs cooperating with each other control. If the AA admission fails either due to security or
to achieve a common goal. Such AAs are intelligent and are resource constraints, the transferring task will be notified so
capable of routing themselves without needing any external that it can decide upon subsequent action.
support. When admitted for execution on the hosting de- If the AA is accepted, the Resource Manager at the des-
vice, the computation code within the AA is embodied into tination checks to see if the code section is already cached
a task. During its execution this task may modify the data locally. It then informs the source device to transfer only
sections of AA, modify the local tags to which it has access, the missing sections. Thus if code caching is enabled, the
may migrate to another device or may block on other tags subsequent transfer cost of the code is amortized over time.
of interest.
Ant Agent Execution
4.1. Format Upon admission, an AA becomes a task which is sched-
uled for execution by the Virtual Machine (VM). The ex-
ecution of an AA is non-preemptive, but new AAs can be
In addition to its identity and authentication information,
admitted during execution. An executing AA can yield the
an AA is comprised of code and data sections, a lightweight
VM by blocking on a tag. The VM makes sure that a task
execution state and a resource estimate table. A digital sig-
confirms to its declared resource estimates. Otherwise, the
nature together with the AA and task IDs identifies an AA.
task can be forcefully removed from the system.
The digital signature is used by the host devices to protect
the access to an AA’s tags. The code and data sections con- Ant Agent Migration
tain mobile code and data that an AA carries from one de- If the current computation of the AA does not complete
vice to another. The state field contains the execution con- on the hosting device, the task may continue its execution
text necessary for task resumption after a successful migra- on another device. The current execution state is captured
tion. The resource table consists of resource estimates like, and migrated along with the code and data sections. In case
execution time, memory requirements, etc. The resource the current computation of the AA does complete success-
estimates set a bound on the expected needs of the AA at fully, the execution state as well as the results are captured
the host device. Figure 5 depicts the skeletal structure of and migrated back to the originating device.
the Ant Agent.
4.3. Routing

AAs are self-routing, i.e. they are responsible for de-


Figure 5. Format of an Ant Agent termining their own paths through the network. Except for
providing Tag Space, there is no other system support re-
quired by the AAs for routing. An AA identifies its destina-
4.2. Life Cycle tion based on its application’s targeted properties. However
the AA executes its routing algorithm on each device in its
Once initialized at the originating device, each AA fol- path. Because the AA is inspired from the behavior of ants
lows the life cycle defined below: found in nature, it deposits an artificial pheromone in the
Tag Space of every device that is on the way to its destina-
1. It is subject to admission control at the next hop desti- tion. Like its natural counterpart, the artificial pheromone
nation. too has a lifetime and is used by other AAs to find their
way through the network. Such stigmergic communication
2. Upon admission, a task is generated out of AA’s code between the ants in nature is known to yield sub optimal
and data sections and scheduled for execution. paths. The following subsection explains how.

100
Ant Cooperation in Nature of all tasks that need to be solved is called the bag of tasks.
In nature, ants have the power of finding the shortest path Such a collection of tasks need not be solved in any par-
from ant colonies to foods [6]. As an ant moves, it deposits a ticular order. Workers are entities capable of executing and
substance called pheromone on the ground. This deposited solving tasks from the bag. At each iteration a worker grabs
pheromone is unique to each colony and is used by its mem- one task from the bag and computes the result.
bers to establish a route to the food source. Initially, when The bag-of-tasks applications share a general structure.
ants start out with no prior information, they start search- The first step is to initialize the problem data. Then the bag
ing for food by walking in random directions. When an of tasks is created, where the termination condition either
ant finds food, it follows its pheromone trail back to the represents a fixed number of iterations or is given implicitly
colony. In doing so the ant lays down more pheromone by reading input values from a file until the end of the file.
along its successful path. When other ants run into a trail of The actual computation is represented by a loop, which is
pheromone, they give up their own search and start follow- repeated until the bag is empty. Multiple workers may exe-
ing the existing trail. Since ants follow a trail with strongest cute the loop independently. All workers have shared access
pheromone concentration, the pheromone on the branches to the task bag and the output data. Each worker repeatedly
of the shortest path to the food will grow faster when com- removes a task, solves it by applying main compute func-
pared to its concentration on other branches. As pheromone tion to it and writes the results into a file.
evaporates over time, the colony forgets older, sub-optimal
paths. 5.2. ACC Implementation
Since AAs are modeled on the behavior described above,
Because the number of tasks in the bag-of-tasks
over time they too are expected to avoid sub-optimal paths
paradigm may be large, it is useful to allow the tasks to
between the originating device and the device of interest.
be generated on the fly. An AA is created specifically for
In case the pheromone tags are missing, an AA can forage
this purpose. This task-generating AA, called the Genera-
for the device of interest by spawning another AA for route
tor AA, typically stays at the originating device. When the
discovery and blocking on its pheromone tag. A write on
number of tasks in the task pool falls below a certain thresh-
this tag unblocks the waiting AA, which will now resume
old, the task-generating AA generates additional new tasks.
its migration. Since the tags are persistent for their lifetime,
Generator AA terminates when the bag-of-tasks becomes
pheromone information once acquired can be used by sub-
empty.
sequent AAs that belong to the same application.
After generating initial number of tasks, the Generator
AA injects them as AAs into the network. These AAs in-
5. Simulations dependently forage the network for adequate computing re-
sources. When an AA discovers a device of interest, it mi-
To prove that many distributed applications can be writ- grates there and starts executing its task. Post completion
ten using Ant Agents, we have implemented a simple ap- the results are returned to the Generator AA on the orig-
plication belonging to the Bag-of-Tasks paradigm of dis- inating device. The Generator AA injects a new AA into
tributed applications. There are three reasons for choosing the network for every execution result received. This new
an application confirming to this paradigm. First, many ap- AA can swiftly migrate to a device of interest by following
plications that fit this paradigm are highly computationally the pheromone trial of the previous successful AAs. The
intensive and thus can benefit from cooperation from other number of AAs in the network is not fixed but can be dy-
devices in wireless ad-hoc networks. Second, an applica- namically changed to adapt to the changes in the network.
tion following these paradigms can easily be divided into Figure 6 depicts a snapshot of the ACC implementation
large number of coarse-grain tasks. Third, each of these with the Generator AA located at the originating device
tasks are highly asynchronous and self-contained and there (black circle) while four application AAs execute at four
is limited communication amongst the tasks. These three different devices of interest (gray circle). The arrows indi-
properties make the chosen paradigm suitable for execution cate the back-and-forth migration of the AAs.
in a networked environment.
5.3. Performance
5.1. Bag-of-tasks Paradigm
The bag-of-tasks paradigm is widely used in many sci-
The bag-of-tasks paradigm applies to the situation when entific computations. Our experiments with this paradigm
the same function is to be executed a large number of times were based on a Monte Carlo simulation of a model of light
for a range of different parameters. If applying the function transport in organic tissue [21]. The simulation runs as fol-
to a set of parameters constitutes a task, then the collection lows. Once launched, a photon is moved a distance where

101
Agent at any device in the network leads to the execution of
some code block on the receiving device. However, while
Active Messages point to the handler at the receiving de-
vice, Ant Agents carry their own code with them. More-
over Ant Agents and Active Messages address completely
different problems. While Active Messages target fast com-
munication in system-area networks, Ant Agents are meant
to address large, heterogeneous, ad-hoc wireless networks.
The Smart Packets [23] architecture provides a flexible
means of network management through the use of mobile
code. Smart Packets are implemented over IP, using the IP
option header. They are routed just like other data traffic
Figure 6. Illustration of B-o-T implemenation in the network and only execute on arrival at a specific lo-
cation [4]. Unlike Smart Packets, Ant Agents are executed
it may be scattered, absorbed, propagated undisturbed, in- at each hop in the network to not only deposit its artificial
ternally reflected or transmitted out of the tissue. The pho- pheromone but also to determine its next hop towards the
ton is repeatedly moved until it either escapes from or it is destination. Additionally Ant Agents carry along with them
absorbed by the tissue. This process is repeated until the their execution context.
desired number of photos has been propagated. The ANTS [25] capsule model of programmability al-
Because the model assumes that the movement of each lows forwarding code to be carried and safely executed in-
photon in the tissue is independent of all other photons, this side the network by a Java VM [4]. When compared to Ant
simulation fits well in the bag of tasks paradigm. The ex- Agents we find that ANTS does not migrate the execution
periment results are shown in Figure 7. The graph presents state from device to device. Also ANTS targets IP networks
a near-linear speedup for each additional AA injected into while Ant Agents target large, heterogeneous, wireless ad-
the network. hoc networks.
A Mobile Agent [16] may be viewed as a task that explic-
itly migrates from node to node assuming that the underly-
ing network assures its transport between them [4]. Unlike
mobile agents however, Ant Agents are responsible for their
own routing in the network. The ACC architecture further
defines the infrastructure that devices in the network must
implement in order to support Ant Agents.
Ant Agents are similar to Smart Messages [12] which
also use migration of code in wireless networks. Apart
from being responsible for their own routing, Smart Mes-
sages also carry with them, their execution state as well as
their code and data blocks during every migration. How-
ever unlike Smart Messages, Ant Agents are modeled on
ants in nature. Hence, Ant Agents achieve stigmergic com-
munication by recording their pheromone in the Tag Space
of every device visited which can be sensed by other Ant
Agents belonging to the same distributed application. Such
Figure 7. Speedup for B-o-T experiments ant like behavior when employed in large numbers is known
to yield to emergence of optimal paths [15].

6. Related Work 7. Conclusions

Ant Agents bear some similarity to Active Messages [7], This paper has described a computing paradigm for large
Active Networks [18], [23], [25], Mobile Agents [10], [16] scale, heterogeneous, wireless ad-hoc networks. In the pro-
and Smart Messages [12]. Although Ant Agents borrow posed model, distributed applications are implemented as a
implementation solutions from all of them, the concept is collection of Ant Agents. The model overcomes the scale,
markedly different. heterogeneity and connectivity issues by placing the intel-
Similar to the Active Messages [7], the receipt of an Ant ligence in migratory execution units. The devices in the

102
network cooperate by providing a common minimal sys- Challenges and opportunities. IEEE Pervasive Computing,
tem support for the receipt and execution of Ant Agents. 3(4):16–23, 2004.
Simulations for Bag-of-Tasks family of distributed applica- [15] V. Maniezzo and A. Carbonaro. Ant colony optimization:
tions demonstrated that Ad-hoc Cooperative Computation An overview. In Essays and Surveys in Metaheuristics,
pages 21–44. Kluwer Academic Publishers, 1999.
represents a flexible and a simple solution for surmounting
[16] D. S. Milojicic, W. LaForge, and D. Chauhan. Mobile ob-
resource constraints on mobile computing devices. jects and agents (moa), 1998.
[17] R. Min and A. Chandrakasan. A framework for energy-
References scalable communication in high-density wireless networks.
In International Symposium on Low Power Electronics and
Design, pages 36–41, 2002.
[1] Car 2 car communication consortium. http://www.car-to-
[18] J. T. Moore, M. Hicks, and S. Nettles. Practical pro-
car.org/.
grammable packets. In in Proceedings of the 20th Annual
[2] W. Alsalih, S. Akl, and H. Hassanein. Cooperative ad
Joint Conference of the IEEE Computer and Communica-
hoc computing: towards enabling cooperative processing in
tions Societies (INFOCOM 2001, pages 41–50, 2001.
wireless environments. International Journal of Parallel, [19] Y. Ohtaki, N. Wakamiya, M. Murata, and M. Imase. Scal-
Emergent and Distributed Systems, 23(1):58–79, February able ant-based routing algorithm for ad-hoc networks. In
2008. 3rd IASTED International Conference on Communications,
[3] J. S. Baras and H. Mehta. A probabilistic emergent routing Internet, and Information Technology, 2004.
algorithm for mobile ad hoc networks, 2003. [20] Y. Ohtaki, N. Wakamiya, M. Murata, and M. Imase. Scal-
[4] C. Borcea, D. Iyer, P. Kang, A. Saxena, and L. Iftode. Co- able and efficient ant-based routing algorithm for ad-hoc
operative computing for distributed embedded systems. In networks. IEICE Transactions on Communications, E-
Proceedings of the 22nd International Conference on Dis- 89B(4):1231–1238, January 2006.
tributed Computing Systems (ICDCS 2002, pages 227–236, [21] S. A. Prahl, M. Keijzer, S. L. Jacques, and A. J. Welch. A
2002. monte carlo model of light propagation in tissue. SPIE Pro-
[5] G. D. Caro, G. D. Caro, F. Ducatelle, and L. M. Gam- ceedings of Dosimetry of Laser Radiation in Medicine and
bardella. Anthocnet: An adaptive nature-inspired algorithm Biology, IS(5):102–111, 1989.
for routing in mobile ad hoc networks. European Transac- [22] O. Riva, T. Nadeem, C. Borcea, and L. Iftode. Context-
tions on Telecommunications, 16:443–455, 2005. aware migratory services in ad hoc networks. IEEE Trans-
[6] G. D. Caro and M. Dorigo. Antnet: A mobile agents ap- actions on Mobile Computing, 6(12):1313–1328, 2007.
proach to adaptive routing. Technical report, 1997. [23] B. Schwartz, A. W. Jackson, W. T. Strayer, W. Z., R. D.
[7] T. V. Eicken, D. E. Culler, S. C. Goldstein, and K. E. Rockwell, and C. Partridge. Smart packets for active net-
Schauser. Active messages: a mechanism for integrated works, 1998.
communication and computation, 1992. [24] B. Warneke, M. Last, B. Liebowitz, Kristofer, and S. J. Pis-
[8] M. Gnes and O. Spaniol. Routing algorithms for mobile ter. Smart dust: Communicating with a cubic-millimeter
multi-hop ad-hoc. In Proceedings of International Work- computer. Computer Magazine, 34(1):44–51, January 2001.
shop on Next Generation Network Technologies, European [25] D. Wetherall. Active network vision and reality: Lessons
Comission Central Laboratory for Parallel Processings - from a capsule-based system, 1999.
Bulgarian Academy of Sciences, 2002.
[9] R. P. Goldberg. Survey of virtual machine research. IEEE
Computer, pages 34–45, June 1974.
[10] R. Gray, D. Kotz, G. Cybenko, and D. Rus. Mobile agents:
Motivations and state-of-the-art systems. Technical report,
Handbook of Agent, 2000.
[11] J. Heidemann, F. Silva, C. Intanagonwiwat, R. Govindan,
D. Estrin, and D. Ganesan. Building efficient wireless sen-
sor networks with low-level naming. In SOSP ’01: Pro-
ceedings of the eighteenth ACM symposium on Operating
systems principles, pages 146–159, 2001.
[12] P. Kang, C. Borcea, G. Xu, A. Saxena, U. Kremer, and
L. Iftode. Smart messages: A distributed computing plat-
form for networks of embedded systems. The Computer
Journal, Special Focus-Mobile and Pervasive Computing,
47:475–494, 2004.
[13] D. B. Lange and M. Oshima. Seven good reasons for mobile
agents. Commun. ACM, 42(3):88–89, 1999.
[14] K. Lorincz, D. J. Malan, T. R. F. Fulford-Jones, A. Na-
woj, A. Clavel, V. Shnayder, G. Mainland, M. Welsh, and
S. Moulton. Sensor networks for emergency response:

103
A Scenario-based Performance Comparison Study of the Fish-eye State
Routing and Dynamic Source Routing Protocols for Mobile Ad hoc Networks

Natarajan Meghanathan Ayomide Odunsi


Jackson State University, USA Goldman Sachs, USA
E-mail: nmeghanathan@jsums.edu E-mail: aodunsi86@yahoo.com

Abstract There exist two classes of MANET routing protocols


[1]: proactive and reactive. The proactive routing
The overall goal of this paper is to investigate the protocols can be of two sub-categories: Distance-vector
scalability of the Fish-eye State Routing (FSR) protocol and Link-state based routing. In the distance-vector
and the Dynamic Source Routing (DSR) protocol under based routing approach, each node periodically
different network scenarios in mobile ad hoc networks exchanges its routing table for the whole network with
(MANETs). This performance based study simulates all of its neighbors. For each destination, the neighbor
FSR and DSR under practical network scenarios typical node that informs of the best path to a destination is
of MANETs, and measures selected metrics that give an chosen as the next hop. In the link-state based routing
introspective look into the performance of FSR and approach, each node periodically floods link-state
DSR. The implementations of both protocols are updates, containing the list of its neighbors, to the whole
simulated for varying conditions of network density, network. Using these link-state updates, the global
node mobility and traffic load. The following topology is locally constructed at each node and the
performance metrics are evaluated: packet delivery Dijkstra algorithm [2] is run on this topology to find the
ratio, average hop count per path, control message best path to any other node. Destination-Sequenced
overhead and energy consumed per node. Simulation Distance Vector (DSDV) routing [3] and Optimized
results indicate FSR scales relatively better compared Link State Routing (OLSR) [4] protocols are classical
to DSR and consumes less energy when operated with examples of the distance-vector and link-state based
moderate to longer link-state broadcast update time strategies respectively. Proactive routing protocols are
intervals in high density networks with moderate to high characterized by low route discovery latency as routes
node mobility and offered traffic load. FSR successfully between any two nodes are known at any time instant.
delivers packets for a majority of the time with But there is a high control overhead involved in
relatively lower energy cost in comparison to DSR. periodically propagating the routing tables or the link-
state updates to determine and maintain routes.
Keywords: Routing protocols, Mobile ad hoc networks, The reactive or on-demand routing protocols
Energy consumption, Simulations, Performance Studies discover routes only when required. When a source
node has data to send to a destination node and does not
1. Introduction have a route to use, the source node broadcasts a Route-
Request (RREQ) message in its neighborhood and
A mobile ad hoc network (MANET) is a dynamic through further broadcasts by the intermediate nodes,
distributed system of wireless nodes where in the nodes the RREQ message is propagated towards the
move independent of each other. MANETs have several destination. The destination node receives the RREQ
operating constraints such as: limited battery charge per message along several paths and chooses the path that
node, limited transmission range per node and limited best satisfies the route selection principles of the routing
bandwidth. Routes in MANETs are often multi-hop in protocol. The destination sends a Route-Reply (RREP)
nature. Packet transmission or reception consumes the message to the source on the best path selected. The
battery charge at a node. Nodes forward packets for Dynamic Source Routing (DSR) [5] protocol and the Ad
their peers in addition to their own. In other words, hoc On-demand Distance Vector (AODV) [6] routing
nodes are forced to expend their battery charge for protocol are classical examples of the reactive routing
receiving and transmitting packets that are not intended protocols. The reactive routing protocols are often
for them. Given the limited energy budget for MANETs, characterized by low route discovery overhead as routes
inadvertent over usage of the energy resources of a are discovered only when needed; but, the tradeoff is
small set of nodes at the cost of others can have an higher route discovery latency.
adverse impact on the node lifetime. The Fish-eye State Routing (FSR) protocol [7] is a
type of link-state based proactive routing protocol

104
proposed to lower the traditionally observed higher 2. Simulation Environment
control overhead with this class of protocols. In FSR, a
node exchanges its link-state updates more frequently The simulations of FSR and DSR were conducted in
with nearby nodes, and less frequently with nodes that ns-2 [10]. The network dimensions are 1000m x 1000m.
are farther away. The number of nodes with which the The transmission range of each node is 250m. We vary
link-state information is exchanged more frequently is the network density by conducting simulations with 50
controlled by the “Scope” parameter (basically the nodes (low density network with an average of 10
number of hops), while the frequency of updating the neighbors per node) and 75 nodes (high density network
neighbors outside the scope is controlled by the “Time with an average of 15 neighbors per node). The
Period of Update” (TPU) parameter. The operation of simulation time is 1000 seconds. The scope value is 1-
FSR is basically controlled by these two parameters. As hop. If all the nodes flood their link-state updates at the
a result, a node maintains accurate distance and path same time instant, there would be collisions in the
information to its nearby nodes, with progressively less network. Hence, the TPU value for each node in the
accurate detail about the path to nodes that are farther network is uniformly and randomly chosen from the
away. A scope value of 1 and a larger TPU value interval [0…TPUmax]. The different values of TPUmax
typically results in a lower control overhead at the cost studied in the simulations are: 5, 20, 50, 100, 200 and
of a higher hop count path (a sub-optimal path) between 300 seconds. For simplicity, we refer TPUmax as TPU
any two nodes. On the other hand, a scope value equal for the rest of this paper. They mean the same.
to the diameter of the network and a smaller TPU value The node mobility model used in all of our
basically transform FSR to OLSR, resulting in higher simulations is the commonly used Random Waypoint
control overhead with the advantage of being able to use model [11]. Each node starts moving from an arbitrary
the minimum hop path between any two nodes. location to a randomly selected destination location at a
Given that the scope parameter is normally set to 1- speed uniformly distributed in the range [0,…,vmax].
hop, the critical performance metrics for FSR such as Once the destination is reached, the node may stop there
the control overhead (number of link-state messages for a certain time called the pause time (0 seconds in our
exchanged), path hop count and energy consumption are simulation) and then continue to move by choosing a
heavily dependent on the TPU parameter. To date, only different target location and a different velocity. The
a handful of performance studies ([8][9][10]) are vmax values used are 5 m/s, 50 m/s and 100 m/s; the
available for FSR in the literature. To the best of our corresponding average node velocity values are: 2.5
knowledge, we could not find a simulation study on the m/s, 25 m/s and 50 m/s representing mobility levels of
performance of FSR as a function of this TPU low (school environment), moderate (downtown) and
parameter. In addition, we conjecture that as the node high (interstate highway) respectively.
mobility and network density increases, the proactive Traffic sources are continuous bit rate (CBR).
routing strategy based FSR may be preferable over the Number of source-destination (s-d) sessions used is 15
reactive DSR. DSR and FSR have not been (low traffic load) and 40 (high traffic load). The starting
categorically studied for different levels of node times of the s-d sessions is uniformly distributed
mobility, network density and offered traffic load. The between 1 to 20 seconds. Data packets are 512 bytes in
above observations are the motivation for this paper. size; the packet sending rate is 4 data packets per
In this paper, we present a simulation based second. While distributing the source-destination roles
performance analysis of FSR with respect to the TPU for each node, we saw to it that a node does not end up
parameter under scenarios generated by different as source of more than two sessions and also not as
combinations of node mobility, network density and destination for more than two sessions.
offered traffic load. For each of these scenarios, the Each node is initially provided energy of 1000
performance of FSR is also compared with that obtained Joules to make sure that no node failures happen due to
for DSR. We categorically state which of these two inadequate energy supply. The transmission power loss
protocols can be preferred for each of the different per hop is fixed and it is 1.4 W and the reception power
scenarios. The rest of the paper is organized as follows: loss is 1 W [12]. The Medium Access Control (MAC)
Section 2 describes the simulation environment and the layer model used is the standard IEEE 802.11 model
scenarios considered. Section 3 defines the performance [13] wherein access to the channel per hop is
metrics evaluated. Section 4 illustrates the simulation accomplished using a Request-to-send (RTS) and Clear-
results obtained for different scenarios; interprets the to-send (CTS) control message exchange between the
performance of FSR with respect to the TPU parameter sender and the receiver constituting the hop in a path.
and compares the performance of FSR vis-à-vis DSR. The different combinations of simulation scenarios used
Section 5 concludes the paper. in this paper are summarized in Table 1.

105
Table 1: Scenarios Studied in the Simulation

Scenario # Network Density Offered Traffic Load Node Mobility


1 Low (50 nodes) Low (15 s-d Pairs) Low (vmax = 5 m/s)
2 Low (50 nodes) Low (15 s-d Pairs) Moderate (vmax = 50 m/s)
3 Low (50 nodes) Low (15 s-d Pairs) High (vmax = 100 m/s)
4 Low (50 nodes) High (40 s-d Pairs) Low (vmax = 5 m/s)
5 Low (50 nodes) High (40 s-d Pairs) Moderate (vmax = 50 m/s)
6 Low (50 nodes) High (40 s-d Pairs) High (vmax = 100 m/s)
7 High (75 nodes) Low (15 s-d Pairs) Low (vmax = 5 m/s)
8 High (75 nodes) Low (15 s-d Pairs) Moderate (vmax = 50 m/s)
9 High (75 nodes) Low (15 s-d Pairs) High (vmax = 100 m/s)
10 High (75 nodes) High (40 s-d Pairs) Low (vmax = 5 m/s)
11 High (75 nodes) High (40 s-d Pairs) Moderate (vmax = 50 m/s)
12 High (75 nodes) High (40 s-d Pairs) High (vmax = 100 m/s)

3. Performance Metrics trace files for each value of vmax and network density,
and 5 sets of randomly selected 15 and 40 s-d sessions.
The following performance metrics are evaluated for To present the results of FSR (with larger TPU values)
each of the 12 scenarios (listed in Table 1) and each of and DSR in a comparable scale in the figures, we
the six TPU values considered. present the control message overhead and energy
(i) Packet Delivery Ratio – the ratio of number of consumption per node incurred by FSR for maximum
actual data packets successfully disseminated from TPU value of 5 seconds in Tables 2 and 3 respectively.
the source to the destination to that of the number of
data packets originating at the source. 4.1 Low Network Density and Low Traffic
(ii) Average Hop Count per Path – the average number Load (Scenarios 1 through 3)
of hops in the route of an s-d session, time averaged
considering the duration of the s-d paths for all the The packet delivery ratio (refer Figure 1.1) of FSR
sessions over the entire simulation time. decreases with increase in the TPU value. This can be
(iii) Control Message Overhead – the ratio of the total attributed to the inaccuracy in the routing information
number of control messages (route discovery stored at the intermediate nodes for certain destination
broadcast messages for DSR or the link-state update nodes. However, it should be noted that FSR still
broadcast messages for FSR) received at the nodes consistently maintains a packet delivery ratio of above
to that of the actual number of data packets 90% even for TPU values exceeding 200 seconds. For
delivered to the destinations across all s-d sessions. both FSR and DSR, as the node mobility is increased
(iv) Energy Consumption per Node – the average from 5m/s to 50m/s, there is an increase in the packet
energy consumed across all the nodes in the delivery ratio. In low density networks, spatial
network. The energy consumed due to transmission distribution of nodes plays a critical role in the
and reception of data packets, periodic broadcasts effectiveness of a routing protocol. Nodes are sparsely
and receptions (in the case of FSR), and route distributed in a low density network and if nodes are
discoveries (in the case of DSR) all contribute to also characterized with low mobility, they tend to
the energy consumed at a node. experience higher rates of network disconnection.
Note that we take into consideration the number of Consequently, since the nodes do not change their
control messages received rather than transmitted positions frequently, the disconnected state persists, and
because a typical broadcast involves a node transmitting packet delivery is adversely impacted. In contrast, as
the control message and all of its neighbors receiving node mobility increases, nodes are redistributed and
the control message. The energy expended to receive the move to new locations, thus increasing the probability
control message, summed over all the nodes, is far less that they move within the transmission range of each
than the energy expended to transmit the message. other. As a result, the probability of network
connectivity increases, thus increasing the likelihood of
4. Simulation Results a node successfully routing a packet to its destination.
In the low node mobility scenario, FSR was
Each data point in Figures 1 through 4 and Tables 2 observed to yield a more optimal minimum hop path
and 3 is an average of data collected using 5 mobility than DSR for a time period of update (TPU) value of 5

106
Table 2: Control Message Overhead (Control messages received per data packet delivered) for
Maximum TPU Value of 5 Seconds

Maximum Node Low Density, Low Low Density, High High Density, Low High Density, High
Velocity (vmax) Traffic Load Traffic Load Traffic Load Traffic Load
5 m/s 178 64 585 220
50 m/s 182 69 640 235
150 m/s 180 67 660 250

Table 3: Energy Consumption per Node at Maximum TPU Value of 5 Seconds

Maximum Node Low Density, Low Low Density, High High Density, Low High Density, High
Velocity (vmax) Traffic Load Traffic Load Traffic Load Traffic Load
5 m/s 104 Joules 126 Joules 212 Joules 230 Joules
50 m/s 110 Joules 129 Joules 235 Joules 250 Joules
150 m/s 109 Joules 129 Joules 240 Joules 255 Joules

Figure 1.1: Packet Delivery Ratio Figure 1.2: Average Hop Count per Path

Figure 1.3: Control Message Overhead Figure 1.4: Average Energy Consumption per Node

Figure 1: Performance of FSR and DSR in Low Density Network and Low Traffic Load Scenarios

seconds (see Figure 1.2). In low node density networks, minimum hop paths. In contrast, at higher TPU values,
nodes are sparsely distributed, and availability of routes FSR propagates routing information infrequently, thus
between s-d pairs is not always guaranteed. At low DSR outperforms FSR at these TPU values. The
mobility, nodes are less likely to change their location, degradation in the performance of FSR can be attributed
which hinders them from discovering more optimal to routing inaccuracy as a result of longer link-state
routes to destinations. In addition, DSR tends to update time intervals utilized to exchange broadcast
maintain its current minimum hop path route until a link messages about the network topology.
failure is detected, predisposing it to retain sub-optimal FSR incurs a significantly higher control overhead
routing information in low node density scenarios. over DSR at a lower TPU value of 5 seconds (see Table
Consequently, since FSR proactively maintains more 2 and Figure 1.3). FSR periodically generates network
accurate topology information at lower TPU values, it wide broadcasts at a TPU value with the purpose of
outperforms DSR by determining more optimal establishing routes for every node in the network.

107
Figure 2.1: Packet Delivery Ratio Figure 2.2: Average Hop Count per Path

Figure 2.3: Control Message Overhead Figure 2.4: Average Energy Consumption per Node

Figure 2: Performance of FSR and DSR in Low Density Network and High Traffic Load Scenarios

This process of periodic broadcasts generates high least 94%. Thus, FSR can be utilized for applications
control overhead especially if it is done rather requiring optimized energy consumption at high node
frequently as in the case of a TPU value of 5 seconds. In velocity scenarios and can tolerate a packet delivery
contrast, DSR incurs less overhead than FSR because it ratio of approximately 94%.
generates less control packets in a low network density
scenario. DSR performs network wide flooding only 4.2 Low Network Density and High Traffic
when a route is needed for a data transmission session, Load (Scenarios 4 through 6)
and thus its control overhead depends on the offered
traffic load (number of s-d pair sessions). Both DSR and FSR exhibits an appreciable increase
In comparison to DSR, FSR generates less control in their respective packet delivery ratios (refer Figure
message overhead (refer Figure 1.3) for TPU values 2.1) at low node mobility of 5m/s. However, as node
ranging from 50 to 300 seconds. With respect to node mobility is increased to 50m/s and 100m/s respectively,
mobility, the amount of overhead generated appears to both protocols experience a slight decrease in their
grow with increasing mobility in DSR. However, FSR packet delivery ratios. This observation is justified for
remains unaffected by variations in node mobility. the following reason: In networks of low density and
At lower TPU values for FSR, it can be observed high traffic load, the number of neighbors per node is
that DSR consumes less energy per node (refer Table 3 significantly smaller compared to the number of active
and Figure 1.4), relative to FSR. This is expected s-d pairs. As a result, there is more demand placed on a
because DSR, a reactive protocol, should incur less few nodes to successfully route packets to their
energy consumption as a result of less control overhead destinations. This obviously results in more packets
generation, when compared to a proactive routing getting dropped at each node and hinders the ability of
protocol like FSR. However, we do notice that operating both protocols to successfully route packets to their
FSR under higher TPU values helps to minimize energy destinations at a higher rate. As node velocity is
consumption per node. FSR loses less energy per node increased from low to high, FSR incurs a higher hop
relative to DSR in high mobility scenario cases of count compared to DSR, except for the TPU value of 5
100m/s, corresponding to TPU values of 100 seconds seconds (see Figure 2.2). The hop count of DSR is not
and 200 seconds respectively. Figure 1.1 shows that the much affected by the node velocities.
packet delivery ratios of FSR corresponding to TPU As illustrated in Table 2, FSR is observed to incur a
values of 100 seconds and 200 seconds in a higher control overhead than DSR at a lower TPU value
characteristic high mobility scenario of 100m/s to be at of 5 seconds due to frequent network-wide broadcasts.

108
Figure 3.1: Packet Delivery Ratio Figure 3.2: Average Hop Count per Path

Figure 3.3: Control Message Overhead Figure 3.4: Average Energy Consumption per Node

Figure 3: Performance of FSR and DSR in High Density Network and Low Traffic Load Scenarios

However, DSR incurs significantly more overhead than unable to find paths to route data packets successfully to
FSR as traffic load is increased to 40 s-d pairs (refer their designated destination. Thus, there will be an
Figure 2.3) for TPU values of 20 seconds and beyond. observed increase in the amount of control overhead
This observation can be attributed to the reactive nature generated to maintain and establish routes for the
of DSR and the low node density of the network. DSR voluminous amount of data traffic. The energy
determines routes as needed. With an increasing need to consumption of FSR is significantly less when
determine routes for a growing number of s-d pairs, compared to that of DSR in moderate to high mobility
DSR invokes its route discovery mechanism frequently, scenarios for a TPU value of 100 seconds and above.
leading to frequent flooding of the network with
broadcast messages. The amount of route discoveries 4.3 High Network Density and Low Traffic
increases with increasing mobility, to determine routes Load (Scenarios 7 through 9)
for all the s-d pairs, and thus DSR incurs a higher
control overhead compared to FSR. FSR remains In high-density networks, the packet delivery ratios
largely unaffected by increasing rates of node mobility. incurred by both FSR and DSR are relatively larger than
The amount of energy consumed by both protocols those incurred in low-density networks (compare
(refer Figure 2.4) is observed to be appreciably larger Figures 1.1 and 3.1). For low mobility scenarios of
than that observed in low-density networks with low 5m/s, both FSR and DSR deliver packets at
traffic load (refer Figure 1.4). The spike noticed in approximately 100%. FSR maintains this perfect packet
energy consumption can be attributed to factors such as delivery rate for low node mobility as TPU values are
the number of data and control packets flowing through increased from 5 seconds up to 200 seconds. The better
the network. An increase in the offered traffic load at performance of both protocols can be attributed to the
low network density is analogous to an increase in the fact that each node has more neighbors within its
number of active s-d pairs wishing to establish sessions. transmission range to route messages along a given s-d
Consequently, this corresponds to an increase in the route. This distribution almost always guarantees that a
number of data packets flowing through each node in packet will be successfully routed to its destination.
the network, which contributes to the observed increase In high-density networks, the average hop count per
in the energy consumption at each node. In addition, in path values for both FSR and DSR shown in Figure 3.2,
a low density network, the probability of route failures reduced appreciably compared to the low network
is rather high. This is attributed to the fact that nodes density scenarios in Figures 1.2 and 2.2. Nodes in a high
could be sparsely distributed, and as a result, will be density network tend to have more neighbors, and as a

109
Figure 4.1: Packet Delivery Ratio Figure 4.2: Average Hop Count per Path

Figure 4.3: Control Message Overhead Figure 4.4: Average Energy Consumption per Node

Figure 4: Performance of FSR and DSR in High Density Network and High Traffic Load Scenarios

result have better path alternatives (shorter paths) to For moderate to high node mobility, DSR yielded a
choose from among the optimal routes to any given higher packet delivery ratio. The discrepancy between
destination. On the other hand, FSR and DSR are FSR and DSR in terms of packet delivery ratio
observed to incur significantly higher control overhead increased, as the TPU parameter values were increased
in high-density networks. This is because more from 50 seconds to 300 seconds. It should be noted that
broadcast messages are received at each node due to an FSR is still able to maintain a packet delivery ratio
increase in the number of neighbors. As illustrated in above 97% even at a high TPU value of 300 seconds.
Table 2 and Figure 3.3, for TPU value of 100 seconds With respect to hop count, FSR outperforms DSR in
or above, FSR incurs less control overhead than DSR. low node mobility scenario at a TPU 5 seconds. Beyond
The energy consumed per node by both protocols is 5 seconds, DSR discovers more optimal minimum hop
lower in magnitude for high density networks, compared paths compared to FSR, due to the discovery of
to that consumed in lower density networks (see Figures inaccurate routes in FSR. One major difference
1.4, 2.4, 3.4 and 4.4). As each node has more neighbors, observed is a slight increase in the magnitude of the hop
data gets efficiently routed along optimal paths in high count discovered by both protocols as compared to the
density networks. In low node mobility scenarios, the high network density and low traffic load scenario.
energy consumption of FSR is significantly higher than As illustrated in Table 2 and Figure 4.3, with respect
that of DSR. This is because FSR incurs a fixed energy to the control message overhead, FSR scales
cost due to periodic network broadcasts. However, at considerably better than DSR at TPU values greater
higher node mobility scenarios, energy consumption of than 50 seconds. FSR proactively maintains routing
FSR converges to that of DSR, and actually outperforms information and is not affected by increasing network
DSR at higher TPU values of 100 seconds and above. density. On the other hand, DSR incurs more overhead
Thus, FSR can be employed as a suitable routing with increasing demand of route discoveries for the s-d
alternative in networks characterized with high node sessions. Thus, variations in mobility have a significant
density and moderate to high node mobility. effect on the amount of control messages generated by
DSR in high node density and high traffic scenarios.
4.4 High Network Density and High Traffic FSR is not much affected by changes in node mobility.
Load (Scenarios 10 through 12) It is observed from Table 3 and Figure 4.4 that the
energy consumption of both protocols exceeded that of
FSR and DSR maintained a near perfect packet the high network density and low traffic load scenario
delivery ratio of 100% as illustrated in Figure 4.1. (refer Figure 3.4). This is justified by the increase

110
observed in the number of communicating s-d pairs. [2] C. E. Perkins and P. Bhagwat, “Highly Dynamic
More packets are routed in the network due to data and Destination Sequenced Distance Vector Routing for
control overhead, and as a result nodes expend more Mobile Computers,” Proceedings of ACM (Special
energy associated with routing a larger amount of Interest Group on Data Communications)
packets. Energy consumption per node also increases SIGCOMM, pp. 234 – 244, October 1994.
with increase in the mobility levels of nodes. When [3] P. Jacquet, P. Muhlethaler, T. Clausen, A. Laouiti,
compared to DSR, FSR consumes less energy in A. Qayyum and L. Viennot, “Optimized Link State
moderate to high mobility scenarios at TPU values Routing Protocol for Ad Hoc Networks,”
ranging from 20 seconds to 300 seconds. Thus, it can be Proceedings of the IEEE International Multi Topic
suggested that for high mobility and high-density Conference, pp. 62 – 68, Pakistan, December 2001.
scenarios, FSR can be configured to a lower TPU value [4] D. B. Johnson, D. A. Maltz, and J. Broch, “DSR:
of 20 seconds to minimize energy consumption. For The Dynamic Source Routing Protocol for Multi-
moderate node mobility, high density and high traffic hop Wireless Ad hoc Networks,” Ad hoc
load networks, FSR can be selected over DSR by Networking, edited by Charles E. Perkins, Chapter
configuring the former with a TPU value of 50 seconds. 5, pp. 139-172, Addison-Wesley, 2001.
[5] C. E. Perkins and E. M. Royer, “Ad hoc On-Demand
5. Conclusions Distance Vector Routing,” Proceedings of the 2nd
IEEE Workshop on Mobile Computing Systems and
This paper explores the performance and the Applications, pp. 90-100, February 1999.
associated tradeoffs for the FSR protocol relative to the [6] G. P. Mario, M. Gerla and T-W Chen, “Fisheye
DSR protocol for MANETs under varying scenarios of State Routing: A Routing Scheme for Ad Hoc
network density, node mobility, and traffic load using a Wireless Networks,” Proceedings of the
comprehensive simulation based analysis. Conclusions International Conference on Communications, pp.
and suggestions are made with respect to the 70 -74, New Orleans, USA, June 2000.
configuration of the FSR protocol in order to yield [7] S. Jaap, M. Bechler and L. Wolf, “Evaluation of
better performance than DSR under specific scenarios Routing Protocols for Vehicular Ad Hoc Networks
based on the results observed in the simulations. in City Traffic Scenarios,” Proceedings of the 5th
A significant tradeoff has been observed in the International Conference on Intelligent
performance of FSR regarding the hop count per path. Transportation Systems and Telecommunications,
For lower TPU values, FSR has been discovered to Brest, France, June 2005.
obtain shorter paths due to the increased frequency of [8] E. Johansson, K. Persson, M. Skold and U. Sterner,
route update messages. As the TPU value is increased, “An Analysis of the Fisheye Routing Technique in
FSR has been observed to incur higher hop count values Highly Mobile Ad Hoc Networks,” Proceedings of
due to lower update frequency. Consequently, this leads the IEEE 59th Vehicular Technology Conference,
to the persistence of stale routes, which generates longer Vol. 4, pp. 2166 – 2170, May 2004.
hop paths. We have identified the TPU values that will [9] T-H. Chu and S-I. Hwang, “Efficient Fisheye State
generate paths with hop count comparable to DSR. It Routing Protocol using Virtual Grid in High-
has been discovered that at low mobility levels, FSR density Ad Hoc Networks,” Proceedings of the 8th
yields more optimal paths. International Conference on Advanced
In high density networks characterized with high Communication Technology, Vol. 3, pp. 1475 –
traffic load, even at higher TPU values, FSR has a 1478, February 2006.
significantly lower control message overhead compared [10] Ns-2 Simulator: http://www.isi.edu/nsnam/ns/
to DSR and yet achieves a packet delivery ratio of at [11] C. Bettstetter, H. Hartenstein and X. Perez-Costa,
least 90%. The same trend has been noticed with respect “Stochastic Properties of the Random-Way Point
to energy consumption at high node density, moderate Mobility Model,” Wireless Networks, pp. 555-567,
and high mobility values with FSR losing less energy for Vol. 10, No. 5, September 2004.
routing and topology maintenance as compared to DSR. [12] L. M. Feeney, “An Energy Consumption Model for
Performance Analysis of Routing Protocols for
6. References Mobile Ad hoc Networks,” Journal of Mobile
Networks and Applications, Vol. 3, No. 6, pp. 239-
[1] C. Siva Ram Murthy and B. S. Manoj, “Routing 249, June 2001.
Protocols for Ad Hoc Wireless Networks,” Ad Hoc [13] G. Bianchi, “Performance Analysis of the IEEE
Wireless Networks: Architectures and Protocols, 802.11 Distributed Coordination Function,” IEEE
Chapter 7, pp. 299 – 364, Prentice Hall, June 2004. Journal of Selected Areas in Communications, Vol.
18, No. 3, pp. 535-547, March 2000.

111
ADCOM 2009
NETWORK
OPTIMIZATION

Session Papers:

1. Angeline Ezhilarasi G and Shanti Swarup K , “Optimal Network Partitioning for Distributed
Computing Using Discrete Optimization”

2. Suman Kundu and Uttam Kumar Roy, “An Efficient Algorithm to Reconstruct a Minimum
Spanning Tree in an Asynchronous Distributed Systems”

3. Amit Kumar Mishra, “A SAL Based Algorithm for Convex Optimization Problems”

112
Optimal Network Partitioning for Distributed
Computing Using Discrete Optimization
G. Angeline Ezhilarasi Dr. K. S. Swarup
Department of Electrical Engineering Department of Electrical Engineering
Indian Institute of Technology Madras Indian Institute of Technology Madras
Chennai, INDIA Chennai, INDIA
angel.ezhil@gmail.com swarup@ee.iitm.ac.in

Abstract— This paper presents an evolutionary based discrete reflects the features of parallel and distributed processing.
optimization (DO) technique for optimal network partitioning However these methods are computation intensive and involve
(NP) of a power system network. The algorithm divides the procedures based on natural selection crossover and mutation.
network model into a number of sub networks optimally in order It also requires a large population size and occupies more
to balance distributed computing and parallel processing of
power system computation and to reduce the communication
memory.
overhead. The partitioning method is illustrated on IEEE This paper presents the application of evolutionary based
Standard 14 Bus, 30 Bus and 118 Bus Test Systems and discrete particle swarm optimization for the problem of
compared with the other existing methods. The performance of network partition. The main advantages of the PSO algorithm
the algorithm is studied using the test systems with different are summarized as: simple concept, easy implementation,
configurations. robustness to control parameters, and computational efficiency
when compared with mathematical algorithm and other
Keywords-Network Partitioning, Discrete Particle Swarm heuristic optimization techniques. Recently, PSO have been
Optimization.
successfully applied to various fields of power system
optimization such as power system stabilizer design, reactive
I. INTRODUCTION power and voltage control, dynamic security border
The power system is a large interconnected complex identification, economic dispatch and optimal power flow.
network involving computation intensive applications and The following sections are organized as follows. Section 2
highly intensive, nonlinear dynamic entities that spread across deals with the formulation of the objective function for the
vast area. Under normal as well as congested condition, network partition problem. The aim of this optimization is to
centralized control requires powerful computing facilities and minimize the cost function, which is a measure of the
multiple high speed communication links at the control execution time of the applications in the torn network. An
centers. During certain circumstances, a failure in remote part overview of the DPSO and its implementation to NP problem
is done is section 3. The algorithm is tested on IEEE standard
of the system might spread instantaneously if the control
test systems and simulation results are discussed in section 4.
action is delayed. This lack of response may cripple the entire The case studies demonstrate validity of the application of this
power system including the centralized control center itself. algorithm by attaining a near optimal solution with a small
An effective way to monitor and control complex power population size and less computational effort.
system is to intervene locally at places where there is
disturbance and control the problem from propagating through
II. PROBLEM FORMULATION
the network. Hence distributed computing can greatly enhance
the reliability and improve the efficiency of power system The objective of the problem is to optimally assign each node
monitoring and control. of a large interconnected network to a sub network subject to
To simulate and implement distributed computing in power constraints. The resulting sub networks are used for efficient
system the large interconnected network must be torn into sub distributed computing and parallel processing of power system
networks in an optimal way. The partitioning should balance analysis. The allocation of the nodes to the sub networks
between the size of the sub networks and the interconnecting should be such that the number of nodes in the sub network
tie lines in order to reduce the overall parallel execution time. and the number of tie lines connecting the sub networks are
Over the past decades a number of algorithms have been well balanced. Hence the conventional cost function [3] which
proposed in literature for optimal network tearing. The models the computational performance of the partitioned
techniques include dynamic programming, and the heuristic network is taken as the fitness function for solving this
clustering approaches. Some of the optimization techniques optimization problem.
such as simulated annealing, genetic algorithm [1] and tabu
search [2] have also been used for network tearing. For these The objective is to minimize
optimization problems the cost function is formed such that it F = α M 2 + β L3 (1)

113
where,
F Partition Index ⎝ ( )
If ⎛⎜ rand < S V pnew ⎞⎟ then X new

p =1
(4)
M Maximum number of nodes in a sub network
L Total Number of Branches between all the sub else X new
p =0
networks where,
α , β Weighting factors ω Weight parameter
C1, C2 Weight factors
The first term in the fitness function includes the maximum rand1, rand 2 Random number between 0 and 1
nodes in a sub network, thereby influencing the load balance
in the distributed processing of any applications. The second X kp +1, X kp Position of the particle at the (k+1)th and kth
term relates to the communication of data between the
processes and hence focuses on the number of branches iteration
linking the sub networks. The total fitness function value V pk +1, V pk Velocity of the particles at the (k+1)th and
reflects the overall computation time of the power system
analysis problems under distributed or parallel processing. kth iteration
Pbest kp Best position of the particle p until the kth
III. IMPLEMENTATION OF NETWORK PARTITIONING
iteration
A. Overview of Discrete Optimization Gbest Best position of the group until the kth
Particle Swarm Optimization (PSO) was developed by iteration
Kennedy and Eberhart through simulation of bird flocking in a
two dimensional space. In this search space, every feasible
solution is called a particle and several such particles in the B. Evolutionary Based Discrete Optimization
search space form a group. The particles tend to optimize an The PSO gains self adapting properties by incorporating
objective function with the knowledge of its own best position any one of the evolutionary techniques such as replication,
attained so far and the best position of the entire group. Hence mutation and reproduction. In order to improve the
the particles in a group share information among them leading convergence in PSO mutation is generally done on the weight
to increased efficiency of the group. The original PSO treats factors. Also if there is no significant change in the Gbest for
non linear optimization problem with continuous variables. a considerable amount of time then mutation can be applied.
However the practical engineering problems are often In this work mutation is used to update the particles as they
combinatorial optimization problems for which Discrete are constituted by binary values only. The entire position
Particle Swarm Optimization (DPSO) can be used. update process of the particles is done based on the mutation
In a physical n-dimensional search space, the position of a probability normally above 0.85[6]. This ensures that the
particle ‘p’ is represented as a vector X = ( X1, X 2 , X 3......... X p ) particles are not trapped in their local optimum and do no
deviate far off from the current position as well. The process
and the velocity of a particle as (
V = V1,V2 ,V3.........V p ). Let of the algorithm and its implementation aspects are described
(
Pbest p = X1best , X 2best ,........ X best
p ) and (
Gbest p = Best X1best ,......... X best
p ) be in detail in the following sections.
the best position of the particle p and its neighbors so far. In 1) Generation of the Particles
DPSO [4] [5], the particles are initially set to binary values The objective of the network partition problem is to allocate
randomly. The probability of the particle making a decision is every node of power system network to a sub network, such
a function of the current particle, velocity, Pbest and Gbest. that the nodes are equally distributed and number of lines
The velocity of the particle given by equation (2) determines a linking them is minimum. Hence the structure of the particle is
probability threshold. The sigmoid function shown in equation framed as matrix of dimension (nc x nn), where ‘nc’ is the
(3) imposes limits on the velocity updates. The threshold is number of clusters or sub networks, it is a user defined
constrained within the range of [0, 1] such that higher velocity quantity and ‘nn’ is the total number of nodes in the power
likely chooses 1 and lower velocities chooses 0. Hence the system network. It is ensured that the each node is assigned to
position update is done using the velocity as shown in one cluster only .i.e, the sum of the columns of the particle
equation (4). array is 1 always. The velocity of the particles corresponds to
a threshold probability. Initially all velocities are set to zero
(
V pk +1 = ωV pk + C1rand1 Pbest kp − X kp ) vectors and all solutions including the Pbest and Gbest are
undefined.
+ C2 rand 2 ( Gbest − X kp )
(2)
The position of particles ‘p’ in the search space is created as
follows:

( ) 1 Step 1) Set j = nc, the number of sub networks and set


S V pnew = (3) k = 1, where k varies from 1 to nn.
−V pnew Step 2) Generate a random number R1 in the range
1+ e

114
[1, nn/nc]. Step 1) Set N = number of Nodes and
Step 3) Set the number of nodes to 1 from k to R1, and M = number of clusters.
set k = k+R1. Step 2) Select a column in random from 1 to N
Step 4) Set j = j -1, if j = 0 go to step 2, otherwise go Step 3) Find the row index whose element is 1
to step 5 Step 4) Select a row in random from 1 to M
Step 5) Repeat steps 1 to 4 for all the particles. whose element is 0.
Step 6) Stop the initialization process. Step 5) Flip the elements using the condition given by
equation (4).
N1 N2 Nnn-1 Nnn Step 6) Repeat the above steps for all the particles.
C1 1 1 0 0 0 0
C2 0 0 1 0 0 0
N1 N2 Nnn-1 Nnn
C1 1 0 0 0 0 0
Cnc-1 0 0 0 1 0 0 C2 0 1 1 0 0 0
Cnc 0 0 0 0 1 1

Figure 1. Structure of Particles

The particle structure of the network partition problem is Cnc-1 0 0 0 1 0 0


shown in Figure 1 for the system with ‘nc’ clusters or sub Cnc 0 0 0 0 1 1
networks and ‘nn’ Nodes.
N1 N2 Nnn-1 Nnn
2) Generation of the Particles C1 1 0 0 0 0 0
The particles in the solution space are evaluated by means of
the fitness function given by equation (1). The fitness function C2 0 0 1 0 0 0
is such that the results of the optimization problem would
balance the computational load on the processors and reduce
the communication overhead as well. The choice of the
exponents of M and L determines the order of solution times
required for sub network solution and full solution of the Cnc-1 0 1 0 1 0 0
interconnected network in a typical parallel processor solution. Cnc 0 0 0 0 1 1
Once the particles are evaluated the Pbest and Gbest are
selected from the swarm in that iteration as follows: Figure 2. Modification of Particle Structure

Step 1) Set j = 1, p = the number of particles. 4) Stopping Criteria


Step 2) If (F (Xp) > Pbestp) then Pbest = F (Xp) Generally for evolutionary algorithms the solution is reached
Step 3) If max (F (Xp)) > Gbest) then Gbest = F (Xp) if the fitness function remains constant for a considerable
Step 4) Set j = j +1, if j < p go to step 2, otherwise go amount of iterations or a maximum number of iterations can
to step 5 be fixed. In this paper the later is followed.
Step 5) Stop the evaluation process.

3) Modification of the Particles IV. CASE STUDIES


To modify the particles in the solution space for the next To assess the efficiency of the proposed method of network
iteration, the velocity of the particles are obtained from partioning, it was tested using the data of the standard IEEE
equation (2). In this process of updating the velocity the 14 Bus, 30 Bus and 118 Bus test systems. The results obtained
weight factors must be known a priori. It has been shown that are compared with other methods like Simulated Annealing
the irrespective of the problem the following parameters are (SA) and Genetic Algorithm (GA). Simulation was done using
appropriate. C1 = C2 = 2.0, ωmax = 0.9, ωmin = 0.4 . In this Matlab in a high performance computing Linux cluster. It is a
paper the weighting function is kept constant for all iterations 2048 node Linux cluster which aids parallel and distributed
and is taken as the average of its range[7-9]. Once the processing. The parameters [10] used for simulation of the
velocities are updated for the next iteration the particles are algorithm are as follows: population size = 100, maximum
updated based on the sigmoid function given by equation (3). iteration = 50, mutation rate = 0.1.
Since this is a discrete optimization problem and there exists The number of clusters is varied depending upon the size of
some constraints on the redundancy of the nodes in the sub the network, but for comparison purpose results of 2 and 3
networks. The particles as depicted in Figure. 2 are modified clusters are discussed here. Figure 3(a) shows the IEEE 14
using the following procedure based on a high mutation Bus system partitioned into two clusters optimally and Figure
probability 3(b) shows a worst case partition of the same system present

115
in the population. Similarly Figure 4(a) shows the IEEE 30
Bus System partitioned into three clusters. This is the optimal
partition obtained from the DPSO applied to the network
partitioning problem and Figure 4(b) shows a worst case
partition of the same system present in the population with the
same configuration.

Figure 4(a). Optimal Partition of 30 Bus System

Figure 3(a). Optimal Partition of 14 Bus System

Figure 4(b). Viable Partitioning of 30 Bus System

Figure 3(b). Viable Partitioning of 14 Bus System


TABLE 1. COMPARISON OF COST AND TIME FOR PARTITIONING IEEE
The partitions so obtained can be used for distributed
STANDARD SYSTEMS INTO 2 AND 3 CLUSTERS WITH α = 1 and β = 1
computing and parallel processing of large scale
interconnected power system. This will enhance the real time
Test No. of Time
simulation and can be applied in grid computing also. Table 1 M L Cost
Case Clusters (Sec)
shows the maximum number of nodes in a sub network, the
number of branches linking the sub networks. It also gives 2 7 3 76 4.14
14 Bus
account of the cost of partition and the execution time. It can 3 8 4 128 6.68
be concluded from the results that the optimal partition can be 2 16 7 599 7.96
30 Bus
obtained by increasing the number of clusters, as the size of 3 12 7 487 15.36
the system increases. It is clear that 2 clusters are optimal for 2 59 18 9313 32.87
14 Bus system and 3 clusters are optimal for 30 Bus system 118 Bus
3 42 18 7596 61.54
and higher. This observation is highlighted in Table 1.

116
In order the test the efficiency of the algorithm under different V. CONCLUSION
conditions, simulation is performed with variant configuration This paper presents a Discrete Particle Swarm Optimization
of parameters like weights, population size and mutation rates. method for optimal tearing of power system network. The
Table II shows the partitions of 14 Bus and 30 Bus systems algorithm is simple and can be used for clustering related
with different weight factors. problems in any field. In this method the DPSO is used to
minimize the cost function, which is actually an estimate of
TABLE II. COMPARISON OF COST FOR IEEE 14 BUS AND 30 BUS SYSTEMS
WITH DIFFERENT CONFIGURATION the execution time of a sub network in the distributed
computing environment. The algorithm was implemented in a
No. of Test high performance computing environment which supports
Weight M L Cost
Clusters System distributed computing and parallel processing. The simulation
α =1 14 Bus 8 4 128 results shows that the DPSO can find the near optimum
3 solution under different operating conditions. The torn sub
β =1 30 Bus 12 7 487
network can aid parallel processing thereby improving speed
α =3 14 Bus 5 5 200 of intensive power system computations.
3
β =1 30 Bus 16 6 984 REFERENCES
α =1 14 Bus 5 5 400
3
β =3 30 Bus 15 6 873 [1] M.R. Irving, BEng, PhD, CEng, MlEE, Prof. M.J.H. Sterling,, “Optimal
network tearing using simulated annealing”, 1EE Proceedings of Generation,
Transmission and Distribution, Vol. 137, No. I, Jan 1990, 69 – 72.
The performance of the optimization algorithm is shown by
means of convergence characteristics in Figure 5(a) and 5(b). [2] C.S. Chang, L.R. Lu, F.S. Wen, “Power system network partitioning using
Simulation was done with standard parameter mentioned Tabu Search”, Electric Power Systems Research 49 (1999) 55–613:1, 2009
earlier and convergence was attained when the system was [3] H. Ding, A.A. E1-Keib, R. Smith, Optimal clustering of power networks
partitioned into two and three clusters. Unlike SA[1] and using genetic algorithms, Electric Power Systems Research 30 (1994) 209-214
GA[3], DPSO reaches a near optimal solution in a less number
of iterations and minimum particles in the population. [4] Jong-Bae Park, Member, IEEE, Ki-Song Lee, Joong-Rin Shin, and Kwang
Y. Lee, Fellow, IEEE, “A Particle Swarm Optimization for Economic
Dispatch With Nonsmooth Cost Functions”, IEEE Transactions on Power
Systems, Vol. 20, No. 1, February 2005

[5] Qian-Li Zhang, Xing Li, Quang-Ahn Tran, “A Modified Particle Swarm
Optimization Algorithm”, Proceedings Of The Fourth International
Conference On Machine Learning And Cybernetics, Guangzhou, 18-21
August 2005

[6] Li-Yeh Chuanga, Hsueh-Wei Changb, Chung-Jui Tu C, Cheng-Hong


Yang C, “Improved binary PSO for feature selection using gene expression
data”, Computational Biology and Chemistry, Elsevier, Vol 32, 2008, pp. 29-
38.

[7] X.H. Shi’, X.L. Xing, Q.X. Wang, L.H. Zhang, X.W. Yang, C.G. Zhou,
Y.C. Liang, “A Discrete PSO Method For Generalized Tsp Problem”,
Figure 5(a). Convergence of 14 Bus System Proceedings of the Third International Conference on Machine Learning and
Cybernetics, Shanghai, 26-29 August 2004

[8] H. Shayeghi, M. Mahdavi, A. Kazemi, “Discrete Particle Swarm


Optimization Algorithm Used for TNEP Considering Network Adequacy
Restriction”, International Journal of Electrical, Computer, and Systems
Engineering,3:1, 2009

[9] Zhongxu Li, Yutian Liu, Senior Member IEEE, Rushui Liu and Xinsheng
Niu, “Network Partition for Distributed Reactive Power Optimization in
Power Systems”, IEEE International Conference on Networking, Sensing and
Control, 6-8 April 2008, 385 - 388

[10] P. Kanakasabapathy and K. Shanti Swarup, “Optimal Bidding Strategy


for Multi-unit Pumped Storage Plant in Pool-Based Electricity Market Using
Evolutionary Tristate PSO”, IEEE International Conference on Sustainable
Energy Technologies, ICSET 2008, 24-27 Nov. 2008, 95 – 100.
Figure 5(b). Convergence of 30 Bus System

117
An Efficient Algorithm to Reconstruct a Minimum
Spanning Tree in an Asynchronous Distributed
Systems
Suman Kundu Dr. Uttam Kr. Roy
Department of Information Technology Department of Information Technology
Jadavpur University Jadavpur University
Salt Lake, Kolkata - 700098 Salt Lake, Kolkata - 700098
sumankundu.nsec@gmail.com u roy@it.jusl.ac.in

Abstract—In a highly dynamic asynchronous distributed net- Several algorithms for constructing MST in distributed sys-
work, node failure (or recovery) and link failure (or recovery) tems were proposed in last three decades. Most of these MST
triggers topological changes. In many cases, reconstructing the construction algorithms are applicable in a static topology. In
minimum spanning tree, after each such topological change, is
very much required. this paper, we have proposed an algorithm based on message
In this paper, we have described a distributed algorithm based passing to reconstruct an MST that works seamlessly even
on message passing to reconstruct the minimum spanning tree in a dynamic topology. Our algorithm considers a single
after a link failure. The algorithm assumes that no further link failure and assumes that no further topological changes
topological changes occur during the execution of the algorithm. occur during the execution of the algorithm. It is also shown
The proposed algorithm requires significantly fewer numbers of
messages to reconstruct the spanning tree in comparison to other that the total number of messages required to reconstruct
existing algorithms. the spanning tree is significantly less in comparison to other
existing algorithms.
I. I NTRODUCTION The rest of the paper is organized as follows: Section II
A distributed network consists of several nodes and con- describes the related work and overview of our result. Section
nection among them. Each node is a computational unit, and III describes description of distributed algorithm, analysis and
the connections between them can send and receive messages proof of correctness. In section IV, we provide a result of
in a duplex manner. Multiple paths may exist between a pair simulation; section V concludes the overall algorithm and
of nodes. A Minimum Spanning Tree (hereafter referred to finally in section VI, we point out the further research areas
as MST) of such a network is the minimally connected tree we are working on.
that contains all the nodes of the network. Applications of
MST includes, effective communication in distributed systems, II. R ELATED W ORK
effective file searching and sharing for peer-to-peer network, In their pioneer paper [2], Gallager, Humblet and Spira
gateway routing in local area network or bandwidth allocation proposed one of the first distributed protocols to construct
in multi hop radio network and other computational scenario. MST in year 1983. The protocol of [2] is further improved
Usually, a cost is associated with each link. The cost may indi- in the protocol [3], [4], [5], [6], [7] and [8]. In [9], some
cate distance between two nodes, or the time required to send flaws of [5] are rectified. All these protocols of constructing
or receive data packets, or bandwidth of the communication MST are developed for static topology. Some of them address
channel, or any other parameters. A MST always contains the message efficiency and some of them address time efficiency,
minimum cumulative cost within the network. as the performance measures for the algorithm.
A distributed system is dynamic in nature i.e. in any However, the distributed systems are dynamic as described
distributed network topological changes occur with respect to in the previous section. Researchers are working on protocols,
time. A change can occur due to deletion or recovery of nodes which are resilient in nature to adopt the topological changes.
and links. In many situations, it is important to reconstruct Few of such algorithms are given in [1], [12] and [14]. In
the MST after each topological change. The main hurdle to their paper [10] B. Das and M. C. Loui provide a serial and
reconstruct MST arises due to the asynchronous nature of a parallel algorithm to address the similar problem and later
the system. Moreover, A node only knows local information. improved by Nardelli, Proietti and Widmayer in their paper
If topological changes occur, it must be propagated to each [11]. These algorithms do not address the distributed version
node via message communication. It is also possible that some of the problem. In paper [13], P. Flocchini, T. M. Enriquez,
part of the network gets the latest knowledge, whereas some L. Pagli, G. Prencipe, and N. Santoro provided a distributed
portion does not. Algorithm should address this issue as well. version of the same problem. In [13] authors provided with

118
the precomputed node replacement scheme. In our paper we previous one. After identifying the MOE a node propagates
provide the improved version of the distributed algorithm of the FINDMOE ACK<w(MOE)> to upward. Where w(MOE)
[1] for a single link failure. In the following section, we will is the locally known best outgoing weight, either its own
describe the response of a link failure by [1]. MOE or MOE received from its children (Whichever is
minimum). After receiving FINDMOE ACK<w(MOE)> root
A. Basic Algorithm of [1] sends CHANGE ROOT<id> through the same path it receives
In the algorithm of [1], C. Chang, I. A. Cimett and S.P.R. the MOE. The CHANGE ROOT<id> reaches to the node
Kumar proposed a resilient algorithm which reconstructs the where the MOE of the fragment incident. The node marks
MST after link failure and recovery. Complexity of the algo- itself the new root of the fragment and sends CONNECT
rithm for a single link failure is O(e), where e is the number message over the MOE. Connect subroutine works same as
of links in the network. the algorithm [2] and merge two fragments sharing the same
A link failure is a process of fragment expansion i.e. A MOE and starts the next iteration.
link(which is a part of the MST) failure breaks the MST into
two different fragments. The failed link initiates the process B. Overview of Our Results
of recovery in the adjacent nodes. The initiator node generates When considering the single link failure, our approach
new fragment identity. In algorithm [1], authors suggest two provides a significant improvement on the total number of
approaches to generate new fragment identity such that no messages required during reconstruction over the algorithm
conflict occurs between subsequent topological changes. First of [1]. Also, the message size for some control message is
approach, is to include identity of all nodes of the fragment improved slightly.
into the fragment identity. Second approach is to maintain a After link failure, the fragment which contains the root node
counter for each link. This counter counts the link failure and of the MST is referred as root fragment in this paper. If the
includes the value along with weight of the node for generating root fragment contains E 0 number of edges, then our algorithm
new fragment identity. In case of first approach, the size of requires 2 × E 0 fewer messages to reconstruct the MST. Our
the fragment identity of a large fragment becomes very large. approach here is to use the previously known fragment identity
So, the second approach is efficient in terms of message size. (say it as a historical data) for the root fragment. However, how
If a link w(u, v), is broken in certain time and u be the the algorithm evolves if another link failure occurs during the
parent of v. In response to the link failure, u will generate execution is still under observation.
the new identity for the fragment like w(u, v), u, c where c
is the counter of topological changes of the link w(u, v). III. A LGORITHM TO R ECONSTRUCTING MST AFTER L INK
FAILURE
Then u will forward the fragment identity to the root of
the fragment. However, in case of v, after generating the We closely followed the response of the algorithm [1] for a
fragment identity like w(u, v), v, c, it marks itself as root single link failure and found some improvement areas. In the
of that fragment. After getting the new fragment identity, following subsections, we will describe the network model, our
root of these two fragments changes their fragment identity observation regarding the algorithm [1], our contribution to im-
and starts broadcasting REIDEN<id> using the tree links proving the algorithm, description of the modified distributed
and wait for the acknowledgment. Any intermediate nodes, algorithm, analysis of the outcome and proof of correctness.
upon getting the REIDEN<id> message changes its fragment
identity to the id and sends the same message to its child A. Network Model
node. A leaf node, after getting the REIDEN<id> changes its The communication model for the algorithm is modeled
fragment identity and sends REIDEN ACK<id> to its parent. as an asynchronous network represented by an undirected
Intermediate node, sends REIDEN ACK<id> to its parent weighted graph of N nodes. The graph is represented by
only after getting REIDEN ACK<id> from all of its children. G(V, E), where V is the set of nodes and E ⊂ V × V is
Receiving of the REIDEN ACK<id> message indicates that the set of links. Each node is a computing unit consisting
all nodes of the fragment aware about the new identity value. of a processor, a local memory and also an input and output
Root now changes its state to find and initiate the find queue. The input (output) queue is an unlimited sized buffer
minimum outgoing edge (hereafter referred to as MOE) phase to send and receive messages. A unique identification number
by sending FINDMOE message over the children. When a is associated with each node (Node i represent the node with
node receives FINDMOE message, a node changes its state to the identification number i).
find. In the find state each node starts to send TEST<id> Each link (u, v), assigned with a fixed weight w(u, v) is
message via each non tree link. A TEST<id> message is a bidirectional communicational line between the node u and
responded by either ACCEPT<id> or REJECT<id>. An the node v. Each node has only the local information i.e. each
ACCEPT<id> message, indicates that the edge is outgoing, node aware about its identification number and the weight of
leading to another fragment. An important thing to remember the incident links. After construction of the MST, each node
here is that the ACCEPT or REJECT message should return will be aware of two additional information; first, the adjacent
the identity number of the test message. This will help to edge leading to the parent node in the MST and secondly the
determine whether the message is for the current failure or adjacent edges, those leading to the child nodes in the MST.

119
Nodes can communicate only via messages. The messages
may be lost due to link failure during transmission. However,
if the link is functioning, the messages can be sent from either
end and can be received by the other end within a finite, un-
determinable time, without error and in sequence.
Also, if a link failure occurs, then the failure event triggers
the recovery process for each end of the failed link and the
recovery process initiated by the same node.
B. Observation
In the algorithm of [1] the reconstruction process works in Fig. 1. MST of a random network
two phases. In Phase-I, root node informs each node of the
fragment with the new fragment identity, and in Phase-II, each
node finds its own MOE and forward the MOE to the root. 4) Rejected - the link is leading to the node included in the
Root then identifies the MOE of the fragment. Finally, the same fragment
fragment sends CONNECT message via the MOE. 5) Down - the link in not working
After failure each fragment changes its fragment identity to
It is assumed that, the MST is already constructed using
new one. Also, each message passes to the neighbor contains
some distributed protocol. A link failure, triggers the recovery
the fragment identity along with the control information. This
process in both end of the link. Let us take, failure occurs for
is used because if overlapping of link failure occurs then
the link e = (u, v, w, c) where u and v is the node connecting
the message response could be avoided depending upon the
the link. w is the weight of the link and c is the status change
fragment identity such a way, that only the current failure will
count for the link. If the link is not included in the MST, i.e.
be processed during the execution.
it is either in Basic or in Rejected state then u and v simply
C. Our Contribution changes the link status to Down and do nothing.
Our contribution to the algorithm is that we can use the
historical data like previously known fragment identity for
one fragment. When we use the previously known fragment
identity than the fragment with the older identity enter its
Phase-II without executing Phase-I. For our algorithm, we use
the previously known fragment identity for the root fragment.
Also, the FAILURE message propagating from the failure link
to the root of the root fragment no longer requires to carry
newly generated fragment identity. That means the FAILURE
message size is also reduced. The difficulty with this approach
is, when a TEST<id> message is received it may be possible
that fragment identity of the node is not updated yet. It may be
part of the same fragment or other fragments still in Phase- Fig. 2. A non MST link failure
I (Propagating new fragment identity is not completed yet).
However, if a node receives a TEST message in Phase-II, then Otherwise, the nodes marks the link Down and respond in
its fragment identity is correct. So, to avoid the conflicting the following manner -
response to a TEST<id> message, the response is delayed • When previous status of the link is Parent - the link
until the node enters into Phase-II. marks itself as root of the newly generated fragment. It
Also, it is assumed that no further failures occur during the then marks all Rejected nodes to Basic. This is necessary
execution. So, it is possible to reduce message size for control because the link may lead to the other fragments due to
messages. For example, the ACCEPT and REJECT message the topological change. The node generates new fragment
do not require to send the fragment identity back to the sender. identity for the fragment, reset its own fragment identity
and enters into the Phase-I by sending the INIT<fid>
D. Description of the protocol
message to its children. Here fid is the new fragment
In the beginning, each node maintains a collection of identity as described by the algorithm [1], i.e. it includes
adjacent edge. This adjacent edge collection is sorted by the the weight and count of failure along with the identity
cost of the link. During the life time, an adjacent link can have of the node. If u is the parent of v in the example edge
one of the following status e then after failure v marks itself the root and changes
1) Basic - the link is yet to processed its fragment identity to f id = w(u, v), v, c then initiate
2) Parent - the link lead to the parent Phase-I by sending this fid along with the INIT message.
3) Child - the link lead to child After receiving the INIT<fid> message node changes its

120
fragment identity to fid and marks all Rejected link to
Basic. Then it forwards the message to its children. If
the node is leaf node then it returns a FINISH message
to its parent. Each intermediate node waits for receiving
FINISH message from all its children and then sends
the FINISH message to the parent. A FINISH message
received by v (i.e. the root) indicates that all node of
the fragment knows the current fragment identity. Then
v starts Phase-II by sending the FINDMOE message.
• When previous status of the link is Child - the link
forward the FAILURE message to upward. Note that the
node did not generate new fragment identity to forward
along with FAILURE message. When the root node Fig. 5. TEST message and response
detects the failure, it initiates the Phase-II directly by
sending FINDMOE message to its children.

– Node executing in Phase-II: In this scenario a


TEST<fid> message is replied by either ACCEPT
if its fragment identity is different than of fid or by
REJECT if its fragment identity is same as fid.
• Upon receiving the REJECT message the node picks the
next best edge in Basic state and sends TEST<fid> to test
the edge. However, if it gets ACCEPT message, which
indicates it found its local MOE, then the node waits
for its children’s response. After finding the local MOE,
leaf nodes propagate the best weight to its parent via
REPORT<wt> message.
• When a REPORT<wt> message is received, the inter-
mediate nodes compare their local MOE with the wt
Fig. 3. MST link failure and response of u and v received from children and change the MOE accordingly.
After getting REPORT<wt> message from all of its
• A node receiving FINDMOE message, immediately en- children it sends the REPORT<wt> to its parent with
ters to finding state. In finding state each node finds its the best weight known by the node. Thus the best weight
local MOE. To find the local MOE a node picks the is propagated to the root node of the fragment. At this
minimum weighted adjacent edge which is in Basic state point the root sends the CHANGE ROOT message to the
and send TEST<fid>. same path that leads to the MOE.
• A node with the MOE of the fragment receives
CHANGE ROOT and marks itself as a root of the frag-
ment. Then it sends the CONNECT message over the
MOE and merge with the fragment sharing same MOE.

Fig. 4. Phase-II initiated by root of the fragment with FINDMOE message

• A node received a TEST<fid> message. The following


two cases to consider -
– Node executing in Phase-I: Then the response is Fig. 6. Root of the fragment is changed and CONNECT message send for
merge the fragment
delayed until the node itself enters into Phase-II by
receiving FINDMOE from its parent.

121
E. Analysis Proof: A failed link broke down the MST into two
Compare with the algorithm [1], in our approach the root fragments and notifies the failure to either end of the link.
fragment directly enters into Phase-II. So, the INIT<fid> and The node attached with the root fragment propagate the
FINISH messages (REIDEN<id> and REIDEN ACK<id> in FAILURE message toward the root element. The other node
algorithm [1]) of phase one is not required for root fragment. mark itself root of the new fragment and generates the new
Let us take the root fragment has E 0 number of edges after fragment identity. Each message is assumed to be reach at the
failure. Then to executing Phase-I it required to send E 0 destination in finite time and in a sequence manner whenever
number of INIT<fid> message over E 0 links. Also, it required the link is working. Also, it is assumed that no further failure
to send E 0 number of FINISH message over E 0 links. That occurs during the execution. Hence, the FAILURE message is
means the reconstruction process of our algorithm require correctly received by the root of the both fragments in finite
E 0 + E 0 = 2E 0 fewer messages then the protocol described in time.
[1]. Lemma III-F.3. On each node, fragment identity is updated
When compare our approach with the protocol of [1] in before the start of Phase-II according to the latest link failure.
terms of message size, some control message contains very
few bits with respect to the protocol of [1]. For example, Proof: Root fragment uses the previously known fragment
the FAILURE message in the root fragment only contains identity of existing MST. So the root fragment does not require
the control information indicating the failure occurrence. No to update the fragment identity. It then directly enters into
fragment identity is sends along with it. As we consider no the Phase-II, all nodes belongs to the root fragment already
further failure occur during execution of the protocol ACCEPT aware about the fragment identity. However, for the other
and REJECT message also contains control message only; no fragment, newly generated identity is propagated to the child
fragment identity returns with it. nodes by INIT<fid> message. Leaf node returns the FINISH
1) Complexity: Let us consider the network contains N message to its parent after updating their fragment identity.
nodes and E edges. The initial MST contains N nodes and Only after receiving FINISH message from its entire child a
e edges before the failure. Also, consider the root fragment intermediate node sends FINISH message to its parent. As the
contains N 0 number of nodes and E 0 number of edges. If the messages are assumed to be reach at destination sequential
height of the root fragment be h0 , then to propagate the failure manner in a finite time then FINISH message received by the
message to the root of the root fragment require O(h0 ) number root of the fragment implies each node already updated its
of message. Propagate the new fragment identity for the other fragment identity. Then the root node initiates Phase-II for
fragment, requires O(N − N 0 ) messages because this infor- that fragment. Thus, when a node is in Phase-II, its fragment
mation will be propagated through the tree links. Similarly, to identity is always updated with the latest link failure.
send Find MOE request and merging two fragments require Lemma III-F.4. Each fragment starts its Phase-II execution
O(N ) messages since these messages also be sent through the in finite time.
tree links. However, finding MOE of the fragments require to
send messages through O(E) links of the network. So, the Proof: As their is no link failure occur during the execu-
message complexity for the algorithm on link failure is O(E). tion of the protocol, all messages of Phase-I will be properly
responded by the nodes in a finite time and finally terminated
F. Proof of Correctness when FINISH message received by root of a fragments.
Root node then initiate the Phase-II by sending FINDMOE
At first, we will proof several Lemmas, which are used
message. Hence, each fragment starts its Phase-II in a finite
in the distributed algorithm, and then we will show that the
time.
algorithm generates the minimum weighted spanning tree after
completion of the algorithm. Lemma III-F.5. Fragments find its MOE within a finite time.
Lemma III-F.1. Before topological changes, each node of the Proof: After getting FINDMOE message each node sends
network has information about the tree links incident to it. TEST message to the minimum weighted non tree link (Which
is not in Down status) to test whether the link is leading
Proof: It is assumed that the MST is initially constructed
to another fragment or not. Then the node waits for the
using some distributed protocol. In our simulation, we use
response from the other end. The TEST message is correctly
the algorithm [7] to construct MST. Each node maintains a
reached to the other end because the link is not marked as
list of incident edges and their status as described in section
down and message lost is not a valid property according to
3.4 i.e. nodes are aware about the links leading to its parent
the assumption. TEST message response is delayed until the
and links leading to its children. These parent and child links
node start executing Phase-II. From the previous lemma, we
denote the tree links which are included in MST. Hence, each
found that TEST message is responded within a finite time
node is aware about the tree links, incident to it before any
because the responder node will enter Phase-II within a finite
topological changes.
time. Also, the response is known to be correct because in
Lemma III-F.2. Root of the fragments receives failure noti- Phase-II the fragment identity is already updated with the new
fication within a finite time. fragment identity. Now, if El be the local minimum outgoing

122
edge and Ec be the minimum outgoing edge forwarded by
all of its children. Then, an intermediate node forward the
M OE = min(El , Ec ) to its parents. Thus, at the root node
M OE of the fragment is calculated by M OE = min(El , Ec )
within a finite time.
Theorem III-F.1. The algorithm reconstructs spanning tree
in finite time and the spanning tree is the minimum weight
spanning tree of the network.
Fig. 8. Recovered MST after failure(in red)
Proof: When the minimum weighted outgoing edge MOE
is determined by root, the fragment changes its root to the new
node where the MOE is incident. Then it sends the connect Edges:
message through MOE. If the MOE found by one fragment
is also MOE of the other fragment then two fragments merge E = {e1 = (0, 5), e2 = (0, 2), e3 = (0, 1),
and create a merged fragment. If there is no other fragment e4 = (1, 2), e5 = (1, 4), e6 = (1, 3),
remains then it is the desire spanning tree. As merging only e7 = (2, 4), e8 = (2, 3)}
possible if both fragments agreed that the MOE is common,
it is obvious that algorithm does not produce any cyclic path. Weight:
Considering the case, where only one link failure occurs
W = {w(e1) = 4, w(e2) = 8, w(e3) = 15,
we can easily derive the proposition that the MOE of one
fragment is also the MOE of other fragment(Since network w(e4) = 10, w(e5) = 3, w(e6) = 5,
links are uniquely weighted). Hence, the fragments merge at w(e7) = 6, w(e8) = 7}
MOE and produce a spanning tree.
Message required to terminate the execution of the algo-
Now, there is no possibility of any other outgoing edge with
rithm is tabularized in Table 1. Scenario 1, e7 link fails
lesser weight then MOE, because in the process only minimum
and initiates the recovery process. If we use the construction
weight edge is filtered and forwarded from the leaf node
algorithm again (i.e. algorithm [7]) after failure it requires 114
to the root node(M OE = min(El , Ec ) from last lemma).
messages, where as the performance is improved if we use
Finally, root node determines MOE of the fragment. So, the
the reconstruction algorithm (i.e. using the algorithm [1]). In
merging occurs at the minimum possible weighted edge of
our modified algorithm, it takes fewer messages then existing
two fragments. Hence, merged spanning tree has the minimum
reconstruction algorithm of [1].
collective weight in the system.
Thus, upon terminating the algorithm reconstruct the span- TABLE I
ning tree which has the minimum weight i.e. MST of the L INK FAILURE VS R EQUIRE M ESSAGE C HART
network.
Message Require For
IV. E XPERIMENT AND R ESULT
MST Con- Algorithm Modified Al-
Broken
To compare the output, we simulate few network graphs Link
struction of [1] gorithm
using ’Network Simulator v2 (NS2 2.29)’ [15]. We constructed
Scenario 1: Single link failure
the initial MST using the algorithm [7]. The pictures below e7 114 59 55
shows the results of one experiment.
Scenario 2: Single link failure
e6 114 54 46

Scenario 3: Link failure one after another


e7 114 59 55
e6 50 50

V. C ONCLUSION
In this paper, we have presented a distributed algorithm to
reconstructing a Minimum Spanning Tree after deletion of
Fig. 7. Initial Graph and Constructed MST a link. The problem can also be solved using the protocol
[1]. Here we showed that for single link deletion scenario our
The example graph contains the vertex set V , edge set E protocol can reconstruct the MST with 2E 0 fewer messages
and corresponding weight set W as below - than protocol [1]; where E 0 represents the number of links in
Vertex: the root fragment. If we consider MST with large depth and the
link failure occur to very close to leaf node (i.e. the E 0 is much
V = {0, 1, 2, 3, 4, 5} greater than E − E 0 ) then our algorithm performs much better

123
way. However, when E 0 is zero i.e. the link failure occurs at [3] Chin F., Ting H. ”An almost linear time and O(n log n+e) messages
root node, the algorithm completed without any improvement distributed algorithm for minimum-weight spanning trees.” Proceedings
of 26th IEEE Symp. Foundations of Computer Science, p.257-266, 1985
of the total number of messages. We can refer to Scenario 3, [4] Gafni E., ”Improvement in the time complexities of two message optimal
where for next link failure there is no improvement in total protocols.” Proceedings of the ACM Symp. on Principles of Distributed
number of messages. Computing, 1985
[5] B. Awerbuch, ”Optimal Distributed Algorithm for Minimum Weight
Spanning Tree, Counting, Leader Election, and Related Problems,” Symp.
VI. F URTHER W ORKS Theory of Comp., pp. 230-240, May 1987.
[6] J. Garay, S. Kutten and D. Peleg, ”A Sub-Linear Time Distributed
When a TEST<fid> message is delayed until the node Algorithm for Minimum-Weight Spanning Trees.” 34th IEEE Symp. on
enters its finding state then we can use the same TEST<fid> to Foundations of Computer Science, pp. 659-668, November 1993.
[7] Gurdip Singh , Arthur J. Bernstein, ”A highly asynchronous minimum
determine whether the adjacent edge is Rejected or Accepted spanning tree protocol.” Distributed Computing, v.8 n.3, p.151-161,
and used it instead of sending another TEST<fid> message March 1995
over the same edge. However, this approach may lead to some [8] Elkin M., ”A faster distributed protocol for constructing minimum
spanning tree.” Proceedings of the ACM-SIAM Symp. on Discrete
other difficulties due to the asynchronous nature, and we are Algorithms, p.352-361, 2004
currently working on the same. [9] Michalis Faloutsos, Mart Molle, ”Optimal Distributed Algorithm for
Whenever the failure occurs at the root or very close to the Minimum Spanning Trees Revisited” Proceedings of the 14th Annual
ACM Symposium on Principles of Distributed Computing, pp. 231-237,
root, improvement is close to zero. We are working on the 1995.
algorithm so that it can use historical data such a way, which [10] B. Das and M.C. Loui, ”Reconstructing a minimum spanning tree after
produces improvement for other scenarios too. deletion of any node.” Algorithmica, 31. pp. 530-547, 2001.
[11] E. Nardelli, G. Proietti, and P. Widmayer ”Nearly linear time minimum
Also, currently we are working how we can modify the al- spanning tree maintenance for transient node failures” Algoritmica, 40.
gorithm so that it accepts topological changes during execution pp. 119-132, 2004
[12] Hichem Megharbi and Hamamache Kheddouci ”Distributed algorithms
of the algorithm. for Constructing and Maintaining a Spanning Tree in a Mobile Ad hoc
Network” First International Workshop on Managing Context Information
R EFERENCES in Mobile and Pervasive Environments, 2005.
[13] P. Flocchini, L. Pagli, G. Prencipe, and N. Santoro ”Distributed compu-
[1] C. Cheng, I.A. Cimett, and S. P.R. Kumar, ”A protocol to maintain a min- tation of all node replacements of a minimum spanning tree” Euro-Par,
imum spanning tree in a dynamic topology.” Computer Communications volume 4641 of LNCS, pages 598-607. Springer, 2007
Review 18, no. 4 (Aug. 1988): 330-338 [14] Awerbuch, B., Cidon I., and Kuten, ”Optimal maintenance of a spanning
[2] R. Gallager, P. Humblet and P. Spira, ”A distributed algorithm for tree.” J. J. ACM 55, 4, Article 18 (September 2008), 45 pages.
minimum-weight spanning trees.” ACM Transaction on Programming [15] Network Simulator version 2(N S2), URL:
Languages and Systems, 5(1):66-77, January 1983 http://www.isi.edu/nsnam/ns/

124
A SAL based algorithm for convex optimization
problems
Amit Kumar Mishra
Department of Electronics and Communication Engineering
Indian Institute of Technology Guwahati, India
Email: akmishra@ieee.org

Abstract—A new successive approximation logic (SAL) based


iterative optimization algorithm for convex optimization problem
is presented in this paper. The algorithm can be generalized
for multi-variable quadratic objective function. There are two
major advantages of the proposed algorithm. First of all, the
proposed algorithm takes a fixed number of iteration which
depends not on the objective function but on the search span
and on the resolution we desire. Secondly, for an n variable
objective function, if the number of data points we consider in
the span is N , then the algorithm takes just n log2 N number of
iteration.
Index Terms—Quadratic objective function, iterative optimiza-
tion

I. I NTRODUCTION Fig. 1: Generic single variable maximization problem


Solution to convex optimization problem is a well studied
area with a number of existing iterative algorithms. However,
the performance of the algorithms depend on how far we start problem can easily be achieved by incorporating some trivial
from the solution and on the nature of the objective function modification to the algorithm.
[1], [2]. In the current paper we present an iterative algorithm We also show a simple extrapolation of the algorithm for
based on the successive approximation logic (SAL). SAL has multi variable optimization (MVO) with an illustration of the
been successfully used in a range of uses ranging from analog- algorithm for bivariate optimization.
to-digital converter [3] to coordinate rotation digital computer The next section expounds the algorithm for single vari-
(CORDIC) architecture [4]. In the proposed algorithm we first able optimization (SVO) problem. Section 3 describes the
discritize the search domain and represent the points using algorithm for MVO. Section 4 discusses the comparison of
binary number system. The starting point is an all-zero binary the performance of the proposed algorithm to that of some
number. The bits are updated starting from the most significant of the classic algorithms from the literature. The last section
bit (MSB) till the least significant bit (LSB). The update concludes the paper.
algorithm is based on the slope of the objective function at
the given candidate solution. II. T HE ALGORITHM FOR SVO
Some of the major advantages of this algorithm are as The problem definition for SVO is as follows. Given a
follows. First of all this is simple and easily implementable function f : A → R, we seek the point xO such that
on digital hardware. Secondly, irrespective of the point from f (xO ) ≥ f (x), for all x in the search space. Through out this
where we start, the search takes exactly B iterations, where paper we will deal with maximization problem. The extension
B is the number of bits used to represent each point in the of the algorithm to a minimization problem is trivial.
search space. Thirdly, each iteration is computationally light, Figure 1 shows the generic single variable maximization
involving the calculation of the objective function twice. The problem. P 1 and P 2 are two generic points in the search
algorithm needs just the sign of the gradient at a point not space. In an iterative optimization, if P 1 is the resulting point
the exact magnitude. The error of optimization is less than the of the current iteration, the next iteration should move the
LSB i.e. 2−B . Finally, for an n variable objective function, if point towards right, and if the resulting point is P 2, then the
the number of data points we consider in the span is N , then next iteration should move the point towards left. This strategy
the algorithm takes just n log2 N number of iteration. is shown in algorithm 1.
In the present paper we have not handled the problem of It may be marked here that the updating depends only on
boundary constraints. It is further assumed that the search the sign of the slope, not on the exact magnitude. This gives
space has a single extrema point. Lastly, we only deal with the scope to use computationally less complicated algorithms
the maxima-search problem. Solution to the minima-search to estimate the slope.

125
The above mentioned updating is done using digital suc- TABLE I: Short description of the functions used in the
cessive approximation algorithm. In this first of all the search pseudo-codes
space is sampled by digital representation using binary number Function name Description
system. This representation and the number of bits used for ceil ceiling function
each point depends on the desired accuracy of the algorithm. Hs Heaviside step function
The starting point of updating is always 0. Let xi be the ith zeros zeros(K) gives a K bit binary
estimation for x0 , and let B be the number of bits used for number with all bits set to 0
representing a point in the search space. In the ith iteration bin2int function to convert binary
(i ∈ [1, B]), the (B − i − 1)th bit is updated confirming with number to equivalent integer
algorithm 1. The updating algorithm for the ith iteration is findslope finds the slope of the cost
given in algorithm 2. function at the given arguments

Algorithm 1 Updating algorithm for xi


1: if slope(xi ) ≥ 0 then
iterations, all the bits of arg are updated. xB is assigned to
2: xi+1 > xi
x0 and is returned as the answer.
3: else if slope(xi ) ≤ 0 then
4: xi+1 < xi III. T HE ALGORITHM FOR MVO
5: end if For multi-variable optimization (MVO) an updating algo-
rithm similar to the already discussed SVO is used. In this
all the variables of the search space are digitized as per
Algorithm 2 Updating algorithm for ith iteration the desired resolution, and represented using binary number
1: if slope(xi ) ≥ 0 then system. Instead of applying the algorithm on each dimension,
2: (B − i − 1)th bit of X = 1 all the dimensions are fused together by interleaving the binary
3: else if slope(xi ) ≤ 0 then representation of each dimension. Hence if B bits are used to
4: (B − i − 1)th bit of X = 0 represent each dimension, the updating algorithm is applied
5: end if on a DB bit number, where D is the dimension of the MVO
problem. In the interleaving, the bits of same significance are
The complete pseudo-code of the algorithm for SVO is placed together. Hence, the complete running of the algorithm
shown in algorithm 3. Table I gives short description of will need DB iterations.
the functions used in the pseudo-code. In general if KSV O is the number of operations required for
an SVO algorithm, similar algorithm for D dimension MVO
D
Algorithm 3 Find xO , given the search space boundaries xh will need KSV O operations. However, using the current algo-
and xl , and the desired resolution in the search space δx rithm the number of operations for a D-variable optimization
xh −xh
 problem is DKSV O . This results in a substantial speed up of
1: B ⇐ ceil log2
x −x
δx the algorithm for MVO problems.
2: δxup ⇐ h2B h
3: arg ⇐ zeros(1, B) A. Pseudo-code for bi-variate optimization
4: sl ⇐ 0 As an example, algorithm 4 gives the pseudo-code for a bi-
5: for i = 1 to B do variate optimization problem. This algorithm needs six inputs,
6: arg(i) ⇐ 1 viz. the boundary points for the search space xh , xl and yh , yl ,
7: xi ⇐ xl + bin2int(arg) ∗ δxup and the desired resolution in the search space in both the di-
8: sl ⇐ Hs(findslope(xi )) mensions δx, δy. In step 1 and 2, the numbers of bits, B1, B2
9: arg(i) ⇐ sl required for the problem in both the dimensions are estimated.
10: end for To make the algorithm less complicated, uniform number of
11: xO ⇐ xB bits are assigned for both the dimensions, by taking the higher
12: return xO number out of B1 and B2 as the number of bits to represent
the search space in both the dimensions. Accordingly, the
resolutions in both the dimensions are updated to δxup and
A. Explanation of the pseudo-code δyup in steps 4 and 5. Variable sl contains the slope of the
The algorithm needs three inputs, viz. the boundary points function at (xi , yi for the ith iteration. (argx, argy) are the
for the search space xh and xl , and the desired resolution digital numbers for the two dimensions, whose bits are updated
in the search space δx. In step 1, the number of bits, B in each iteration. arg is the 2B bit long digital number whose
required for the problem is estimated. From this the updated odd sequenced bits are derived from argx and even sequenced
resolution δxup is calculated (δxup ≤ δx). Variable sl contains bits are derived from argy. In 2B iterations, all the bits of arg
the slope of the function at xi for the ith iteration. arg is the are updated. Final estimates (x2B , y2B ) are assigned to (x0 , y0 )
binary number whose bits are updated in each iteration. In B and are returned as the answer.

126
Algorithm 4 Find (xO , yO ), given the search space boundaries TABLE II: Comparison of the proposed algorithm with some
(xh , yh ) and (xl , yl ), and the desired resolution in the search standard optimization algorithms
space (δx, δy) Prob. BFGS MBFGS UTRCP SALO
x −x

1: B1 ⇐ ceil log2  hδx h  no. [6] [6] [5]
yh −yh NI/NF NI/NF NI/NF NI/NF
2: B2 ⇐ ceil log2 δy
3: B ⇐ max(B1, B2) 1 40/57 39/54 22/35 17/35
x −x
4: δxup ⇐ h2B h 2 33/45 34/45 22/38 17/35
yh −yh
5: δyup ⇐ 2B
6: arg ⇐ zeros(1, 2B)
7: sl ⇐ zeros(1, 2B) ||∇f (xk )|| ≤ 10−8 . The algorithms are compared based on
8: argx ⇐ zeros(1, B) two factors, viz. the number of iterations needed (NI) and
9: argy ⇐ zeros(1, B) number of function evaluations (NF) involved.
10: for i = 1 to 2B do Table II gives the consolidated results.
11: arg(i) ⇐ 1 It can be observed that the performance of the proposed
12: for j = 1 to B do SALO algorithm is better or comparable to some of the best
13: argx(j) ⇐ arg(2j − 1) algorithms in the literature. However the classic algorithms
14: argy(j) ⇐ arg(2j) are designed to work on any unconstrained function. The
15: end for generalization of the SALO algorithm in the same lines is
16: xi ⇐ xl + bin2int(argx) ∗ δxup under progress.
17: yi ⇐ yl + bin2int(argy) ∗ δyup
18: sl(i) ⇐ Hs(findslope(x(i), y(i))) V. C ONCLUSIONS AND DISCUSSIONS
19: arg(i) ⇐ sl(i)
We have presented an algorithm based on digital successive
20: end for
approximation principle for convex optimization problem. The
21: xO ⇐ x2B
algorithm can directly be applied for SVO and with minor
22: yO ⇐ y2B
modification, for MVO. Some of the major advantages of this
23: return (xO , yO )
algorithm are as follows:
• The number of iterations is fixed and is equal to the
number of bits used to represent numbers in the sample
IV. C OMPARISON WITH SOME STANDARD OPTIMIZATION space. This number of iterations is irrespective of the cost
ALGORITHMS BASED ON NUMERICAL EXPERIMENTS
function.
In this section we compare the proposed algorithm SAL • Resolution of the algorithm depends on the user and is
based optimization (SALO) algorithm with some of the ex- fixed for a given number of bits, irrespective of the cost
isting powerful optimization algorithms reported in the open function.
literature [5], [6]. The comparisons are made on the basis • A D dimension SVO takes DB number of iterations.
of the performance in the optimization of scaled sine-valley This greatly reduces the computational complexity of an
function and scaled Rosenbrock function. Comparisons are SVO.
made with respect to classic BFGS algorithm, Yuan’s modified • The complete algorithm runs in digital domain and hence
BFGS algorithm [6] and usual trust region method with is highly amenable to digital computer implementation.
curvilinear path (UTRCP) [5]. • The proposed algorithm will also work for finding non-
Following are the functions on which the algorithm has been smooth maxima or minima, provided there are no local
validated: maxima or minima.
1) Problem 1: The first problem is a sine-valley function We have tested the algorithm on a range of quadratic func-
given by: tions for different number of variables. For all the cases the
algorithm was found to find the maxima with error < 2−B .
f (x1 , x2 ) = 100[x2 − sin(x1 )]2 + 0.25x21 . (1)
However we have not discussed about the boundary prob-
Starting point for this problem was ( 32 π, −1).
The solu- lem, in this paper. Still, because of the above mentioned
tion is (0, 0). advantages, the algorithm is deemed to prove as a useful
2) Problem 2: The second problem is the Rosenbrok’s practical solution for convex optimization problems in any
function given by: domain.
f (x1 , x2 ) = 100(x2 − x21 )2 + (1 − x1 )2 . (2) R EFERENCES
Starting point for this problem was (−1.2, 1.0). The [1] R. Fletcher, Practical methods of optimization. Wiley Interscience, 1987.
solution is (1, 1). [2] M. Powel, “How bad are the BFGS and DFP methods when the objective
function is quadratic?” Mathematical Programming, vol. 34, pp. 34–47,
The problems were solved for a resolution of 10−8 , i.e. 1986.

127
[3] J. F. Wakerly, Digital Design: Principles and Practices. Prentice Hall,
1999.
[4] J. E. Volder, “The CORDIC trignometric computing technique,” IRE
Trans. Electronics and Computers, vol. EC-8, pp. 330–334, 1959.
[5] Y. Xiao and F.Zhou, “Nonmonotone trust region methods with curvilinear
path in unconstrained optimization,” Computing, vol. 48, pp. 303–317,
1992.
[6] Y.-X. Yuan, “A modified BFGS algorithm for unconstrained optimiza-
tion,” IMA Journal of Numerical Analysis, vol. 11, pp. 325–332, 1991.

128
ADCOM 2009
WIRELESS SENSOR
NETWORKS

Session Papers:

1. V. V. S. Suresh Kalepu and Raja Datta , “Energy Efficient Cluster Formation using Minimum
Separation Distance and Relay CH’s in Wireless Sensor Networks”

2. Pankaj Gupta, Tarun Bansal and Manoj Misra, “An Energy Efficient Base Station to Node
Communication Protocol for Wireless Sensor Networks”

3. R. C. Hansdah, Neeraj Kumar and Amulya Ratna Swain, “A Broadcast Authentication Protocol
for Multi-Hop Wireless Sensor Networks”,

129
Energy Efficient Cluster Formation using Minimum Separation Distance
and Relay CH’s in Wireless Sensor Networks
and , Member, IEEE
Department of Electronics and Electrical Communication Engineering
Indian Institute of Technology, Kharagpur, India, Kharagpur-721302
Email: ,

Abstract the design of sensor network protocols and algorithms.


In this work we proposed a scheme to select the relay Since the sensor nodes are equipped with small, often
nodes for forwarding network data when a minimum irreplaceable, batteries with limited power capacity, it is
separation distance (MSD) is maintained between cluster essential that the network be energy efficient in order to
heads in a cluster based sensor network. This prolongs maximize the life span of the network [1, 2].
network lifetime by spreading the cluster heads, thus In this paper, we propose a method to select the relay
lowering the average communication energy consumption nodes to forward the aggregated data by considering Link
by optimizing the next node for delivery of data.. The Cost Factors (LCF). This work includes another efficient
work also includes the study of the above protocol by cluster-based routing protocol for large network areas,
varying the network area. We also proposed another which improves MSD routing protocol by introducing
cluster-based routing protocol for large network areas, Minimum Spanning Trees (MST) instead of direct
this improves the MSD routing protocol by introducing communications to connect nodes in clusters. The rest of
minimum spanning trees (MST) instead of direct the paper is organized as follows: the important existing
communications to connect nodes in clusters. We have protocols and the improvements subsequently proposed
done extensive simulations to show that the proposed upon them is described in Section II and the power radio
method outperforms the existing techniques. model used for simulations is present in Section III.
Section IV describes on the drawbacks of existing
Keywords: Wireless Sensor Network, MSD, TDMA, protocols and the proposed algorithm. The results duly
LEACH , PEGASIS, CH, Minimum Spanning Tree. supported by the relevant plots for performance
characteristics and related analysis are presented in
I. Introduction Section V and Section VI concludes the paper.
II. Related work
Wireless sensor networks consist of hundreds to
thousands of low-power multi-functioning sensor nodes, 2.1. Cluster-Based Routing Protocol
operating in an unattended environment, with limited
computational and sensing capabilities. Recent The popular existing Hierarchical Routing protocol in
developments in low-power wireless integrated micro sensor networks is Low Energy Adaptive Clustering
sensor technologies have made these sensor nodes Hierarchy (LEACH).
available in large numbers, at a low cost, to be employed LEACH (Low-Energy Adaptive Clustering
in a wide range of applications in military and national Hierarchy) [3] is a TDMA cluster based approach where
security, environmental monitoring, and many other a node elects itself to become cluster head by some
fields [1]. In contrast to traditional sensors, sensor probability and broadcasts an advertisement message to
networks offer a flexible proposition in terms of the ease all the other nodes in the network. A non cluster head
of deployment and multiple functionalities. In classical node selects a cluster head to join based on the received
sensors, the placement of the nodes and the network signal strength. Being cluster head is more energy
topology need to be predetermined and carefully consuming than being a non cluster head node, since the
engineered. However, in the case of modern wireless cluster head needs to receive data from all cluster
sensor nodes, their compact physical dimensions permit a members in its cluster and then send the data to the base
large number of sensor nodes to be randomly deployed in station. All nodes in the network have the potential to be
inaccessible terrains. In addition, the nodes in a wireless cluster head during some periods of time. The TDMA
sensor network are also capable of performing other scheme starts every round with a set-up phase to organize
functions such as data processing and routing, whereas in the clusters. After the set-up phase, the system is in a
traditional sensor networks special nodes with steady-state phase for a certain amount of time. The
computational capabilities have to be installed separately steady-state phase consists of several cycles where all
to achieve such functionalities. nodes have their transmission slots periodically. The
In order to take advantage of these features of nodes send their data to the cluster head that aggregates
wireless sensor nodes, we need to account for certain the data and sends it to its base station at the end of each
constraints associated with them. In particular, cycle. After a certain amount of time, the TDMA round
minimizing energy consumption is a key requirement in ends and the network re-enters the set-up phase. LEACH

1
130
has a drawback that the cluster is not evenly distributed
due to its random rotation of local cluster-head.
The Power Efficient Gathering in Sensor Information
Systems (PEGASIS), another clustering-based routing
protocol, further enhances network lifetime by increasing
local collaboration among sensor nodes [5]. In PEGASIS,
nodes are organized into a chain using a greedy algorithm
so that each node transmits to and receives from only one
of its neighbors. In each round, a randomly chosen node
from the chain will transmit the aggregated data to the Figure 1. major components and energy cost parameters
base station, thus reducing the per round energy of a sensor node.
expenditure compared to LEACH.
III. Network and Radio Models In our analysis, we use the same radio model
The Network Model and Architecture discussed in [9]. The transmit and receive energy costs
for the transfer of a l-bit data message between two nodes
Our proposed protocol lies in the realization that the separated by a distance of d meters is given by Eqs. 1 and
base station is a high-energy node with a large amount of 2, respectively.
energy supply. Thus, it utilizes the base station to control (1)
the coordinated sensing task performed by the sensor (2)
nodes. In this article we assume a sensor network model, Where in Eq. 1 denotes the total energy
similar to those used in [3, 6], with the following dissipated in the transmitter of the sensor node, and
properties: in Eq. 2 represents the energy cost incurred in the
• A fixed base station is located far away from the sensor receiver of the destination node. The parameters
nodes. and in Eq. 1 and Eq. 2 are the per bit energy
• The sensor nodes are energy constrained with a uniform dissipation for transmission and reception, respectively.
initial energy allocation. is the energy required by the transmit
• The nodes are equipped with power control capabilities amplifier to maintain an acceptable signal-to-noise ratio
to vary their transmitted power. in order to transfer data messages reliably. As is the case
• Each node senses the environment at a fixed rate and in [6], we use both the free-space propagation model and
always has data to send to the base station. the two-ray ground propagation model to approximate the
• All sensor nodes are immobile. path loss sustained due to wireless channel transmission.
The two key elements considered in the design of Given a threshold transmission distance of , the free-
protocol are the sensor nodes and base station. The sensor space model is employed when , and the two-ray
nodes are geographically grouped into clusters and model is applied for cases when . Using these two
capable of operating in two basic modes: models, the energy required by the transmit amplifier
• The cluster head mode is given by
• The sensing mode
In the sensing mode, the nodes perform sensing tasks (3)
and transmit the sensed data to the cluster head. In cluster Where and denote transmit amplifier parameters
head mode, a node gathers data from the other nodes corresponding to free-space and the two-ray models,
within its cluster, performs data fusion, and routes the
respectively, and is the threshold distance given by
data to the base station through other cluster head nodes.
The base station in turn performs the key tasks of cluster
(4)
formation, randomized cluster head selection, and CH-to-
CH routing path construction. We assume the same set of parameters used in [3] for all
experiments throughout the article:
The Radio Model
, , and
As shown in Fig. 1, a typical sensor node consists of . Moreover, the energy cost
four major components: a data processor unit; a micro- for data aggregation is set as .
sensor; a radio communication subsystem that consists of
transmitter/receiver electronics, antennae, and an IV. Energy efficient Routing protocol using MSD
amplifier; and a power supply unit [1]. Although energy and Relay CH”s
is dissipated in all of the first three components of a
sensor node, we mainly consider the energy dissipations The proposed routing technique is an extension to
associated with the radio component since the core LEACH. It uses a centralized cluster formation algorithm
objective of this article is to develop an energy-efficient to form clusters that means the cluster formation was
network layer protocol to improve network lifetime. In carried out by the BS. The protocol uses the same steady-
addition, energy dissipated during data aggregation in the state protocol as LEACH. During the set-up phase, the
cluster head nodes is also taken into account. base station receives information from each node about

2
131
their current location and energy level. After that, the 4.2 Our Approach
base station runs the centralized cluster formation
algorithm to determine cluster heads and clusters for that In WSNs asymmetric communication is possible.
round. Once the clusters are created, the base station That is, the base station reaches all the sensor nodes
broadcasts the information to all the nodes in the network. directly, while some sensor nodes cannot reach the base
Each of the nodes, except the cluster head, determines its station directly but need other nodes to forward its data,
local TDMA slot, used for data transmission, before it hence routing schemes are necessary.
goes to sleep until it is time to transmit data to its cluster As the network size increases the transmission
head, i.e., until the arrival of the next slot. distance within the cluster increases. There by energy
In our method during the set-up phase, for cluster consumption increases.
formation we are using the minimum separation distance In our approach we present the Designed energy
method proposed by Ewa Hansen, Jonas Neander [7] efficient cluster based routing protocol to overcome the
which overcomes the drawback of LEACH protocol by above drawbacks. This section includes selection of the
spatially distributing the cluster heads. A simple relay cluster heads to forward the data when minimum
algorithm to find and select cluster heads is described separation distance between clusters is maintained.. We
below. also present the efficient routing technique for large
sensor network areas. So, to forward the aggregated data
4.1 Cluster head selection algorithm from CHs to BS relay nodes are required. The selection
of Relay nodes is described below.
We randomly choose a node among the eligible nodes 4.2.1 Selection of Forwarding CHs
to become cluster head but we also make sure that the
nodes are separated with at least the minimum separation Once the CHs are identified and the nodes are
distance (if possible) from the other cluster head nodes. clustered relative to the distance from the CHs, the
Algorithm : CH selection algorithm routing towards the base station (BS) is initiated. First,
MSD = Minimum Separation Distance the CH checks if the BS is within communication range.
dc = Number of desired cluster heads, If so, data is sent directly to the BS. Otherwise, the data
energy(n) = Remaining energy for node n from the CHs in the sub-network are sent over a multi-
hop route to the BS. Here, the selection of a relay node is
set to maximize the link cost factor (LCF) which includes
energy, end-to-end delay and distance from the BS to the
RN.
Initially, a CH broadcasts HELLO packets to all CH
) nodes in range and receives ACK packets from all the
relay candidates that are in communication
range. The ACK packets contain information such as the
node ID, available energy, and processing delay at a
node, and distance from the BS. The RNs that are further
away from the BS than the current node do not respond to
the HELLO packets. If one of the ACK packets was sent
from the BS, then it is selected as a next hop node, thus
ending the route discovery procedure. Otherwise, the
current node builds a list of potential RNs from the
In the cluster head selection part, cluster heads are
ACKs. Then it selects the optimal RN using the LCF
randomly chosen from a list of eligible nodes. To
parameter. The same procedure is carried out for all hops
determine which nodes are eligible, the average energy of
to the BS. The advantage of this routing method is
the remaining nodes in the network is calculated. In order
reduction of the number of relay nodes that have to
to spread the load evenly, only nodes with energy levels
forward data in the network, and hence the scheme
above average are eligible.
reduces overhead and minimizes the number of hops and
If a node that has been randomly chosen is too close
communication due to flooding.
i.e. within the range of the minimum separation distance
The LCF from a node to its next hop node is given
from all other chosen cluster heads, a new node has to be
by (5) where represent the delay to reach the next hop
chosen to guarantee the minimum separation distance.
node, is the distance between the next hop node and
This process iterates until the desired number of cluster
heads is attained. If we cannot find a node outside the the BS, and is the energy remaining at the next hop
range of the minimum separation distance (to guarantee node:
the minimum separation distance) we choose any node (5)
among the eligible nodes to become cluster head. When
all cluster heads have been chosen and separated, In equation (5), consideration of the remaining energy
generally with at least the minimum separation distance, at the next hop node increases network lifetime, the
clusters are created the same way as in LEACH. distance to the BS from the next hop node reduces the
number of hops and end-to-end delay; and the delay

3
132
incurred to reach the next hop node minimizes any For an evaluation to be meaningful, the performance
channel fading problems and processing delay. When of the proposed protocol should be compared with the
multiple RNs are available for routing of packets, the performances of certain well known existing energy
most optimal RN is selected based on the maximum LCF. aware protocols namely, LEACH and PEGASIS.
4.2.2 Using MST in Intra cluster for large Performance is measured by quantitative metrics of
Sensor Networks: average energy dissipation, system lifetime, total data
messages successfully delivered, and number of nodes
In this protocol the main idea is MSTs to replace that are alive.
direct communication in one layer of the network: intra- For these simulations, we consider random network
cluster. The average transmission distance of each node configuration with 100 nodes where each node is
can be reduced by using MSTs instead of direct assigned an initial energy of 1J.Further more, the data
transmissions and thus the energy dissipation of message size for all simulations is fixed at 500 bytes, and
transmitting data is reduced. packet header for each type of packet was 25 bytes long.
In the performed simulations we have varied the
minimum separation distances between cluster heads, in
order to see the effects on energy consumption in the
network. We have also investigated whether the number
of clusters used, together with the minimum separation
distance, has any effect on the energy consumption. The
minimum separation distance varied between 30 and 45
meters, and the number of clusters varied between 2 and
8 clusters.

Figure 2 (a) (b) 50000 msd=30


No of messages Received

a) Direct communication in LEACH msd=35


b) MST Communication in intra cluster for large 40000 msd=40
network area 30000
at BS

In each cluster, all nodes including the CH are 20000


connected by a MST and then the CH as the leader to
collect data from the whole tree. And the CHS uses Relay 10000
nodes to forward the data to the BS. Data fusion process 0
is handled along the tree route. When the network area is
larger, the reduced transmission distance is greater. Thus, 2 3 4 5 6 7 8
this protocol is more energy efficient. No of clustres
In the direct transmission, information of routing
path is simple, and each node only needs to know of and Figure 3: No of messages Received by varying msd and
send data to its CH. But in trees, each node must know clusters
the next node that it would send data to. So we form the
MSTs by the BS.
V. Performance evaluation
To assess the performance of proposed routing
protocol, we simulated MSD and MST routing protocols
using C-language.

Nodes 100
Network size 100m x100m
Base station location (50,175)
Radio propagation speed 3x m/s
Processing delay 50µs
Figure 4. Distribution of sensor nodes and cluster
Radio speed 1Mbps formation with msd=30m
Data size 500 bytes
In Figure 3, we see how the minimum separation
distance affects the energy consumption, i.e., the number
Table 1. Characteristics of the test network. of messages received at the base station during the
lifetime of the network. We also see how the number of
clusters used affects the energy consumption in the

4
133
network. Further, we see that when using 5 clusters and a Next we analyze the number of data messages
minimum separation distance of 30 meters between received by the base station for the three routing
cluster heads, the base station receives the most protocols under consideration. For this experiment, we
messages. So, this gives the best energy efficient again simulated 100 m × 100 m network topology where
configuration. Figure 4 gives the distribution of sensor each node begins with an initial energy of 1J. Figure 7
nodes and the formation of clusters with a minimum shows the total number of data messages received by the
separation distance 30m. base station over the average energy dissipation. The plot
The improvement gained through MSD with relay clearly illustrates the effectiveness of proposed protocol
CH’s protocol is exemplified by the system lifetime graph in delivering significantly more data messages than its
in Fig.5. This plot shows the number of nodes that remain counterparts. Moreover, results in Fig. 7 confirm that
alive over the number of rounds of activity for the 100 m MSD protocol delivers the most data messages per unit of
× 100 m network scenario. With MSD protocol, all the energy of the two schemes. In the final experiment, we
nodes remain alive for 920 rounds, while the evaluate the performance of the routing protocols as the
corresponding numbers for LEACH and PEGASIS are area of the sensor field is increased. For this simulation,
510, and 825, respectively. Furthermore, if system 100 nodes are randomly placed in a square field of
lifetime is defined as the number of rounds for which 75 varying network areas with the base station located at
percent of the nodes remain alive, proposed protocol least 75 m away from the closest sensor node, and results
exceeds the system lifetime of LEACH by 30 percent. A were obtained over 25 different network topologies for
5 percent improvement in system lifetime is observed each network area instance. The figure 8 shows the no of
over PEGASIS. alive nodes after 900 rounds by varying the network area.
there is a comparison in performances between MSD with
120 relay CHs protocol, LEACH,PEGASIS and MST
leach
No of Alive nodes

protocol.
100 pegasis
80 msd
50000
no of msgs Rxed

60
40000
40
20 30000
0 20000
leach
0 1 2 3 4 5 6 7 8 91011121314151617181920 10000 pegasis
Hundreds msd
No Of rounds 0
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
Figure 5. No of alive nodes as rounds increases Avg Energy Dissipation (J)

1.2 Figure 7. Total number of data messages received at the


base station as a function of average energy dissipation.
Average enery

1
dissipation(J)

0.8 leach
120
0.6 pegasis
No Of Nodes Alive

msd 100 msd


0.4
pegasis 80 MST in cluster
0.2
leach 60
0
40
5000 1000 1500
No of rounds 20
Figure 6. A comparison of MSD protocol average 0
energy dissipation with other clustering-based protocols.
0 200 400 600
Network Area (m2)…
Figure 6 shows the average energy dissipation of the
protocols under study over the number of rounds of
Figure 8. Number of nodes alive as a function of
operation. This plot clearly shows that MSD with relay
network area.
CHs protocol has a much more desirable energy
expenditure curve than those of LEACH, and PEGASIS. Clearly, MSD protocol outperforms both LEACH and
On average, MSD protocol exhibits a reduction in energy PEGASIS as network area increases up to 300m. As the
consumption of 40 percent over LEACH, respectively. network area increases further MST protocol performs
This is because all the cluster heads in LEACH transmit better than other three protocols. This is mainly because
data directly to the distant base station, which in turn the LEACH do not ensure that the cluster heads are
causes significant energy losses in the cluster head nodes.

5
134
uniformly placed across the whole sensor field. As a 2) J. N. Al-Karaki and A. E. Kamal, “Routing
result, the cluster head nodes in LEACH and can become Techniques in Wireless Sensor Networks: A
concentrated in a certain region of the network, in which Survey,” IEEE Wireless Communications, vol.11, pp.
case nodes from the “cluster head deprived” regions will 6-28, Dec. 2004.
3) W. Heinzelman, A. Chandrakasan, and H.
dissipate a considerable amount of energy while Balakrishnan. “Energy-Efficient Communication
transmitting their data to a faraway cluster head. the Protocols for Wireless Microsensor Networks
utilization of the greedy algorithm in PEGASIS results in (LEACH)”. Porc. of the 33rd Hawaii International
a gradual increase in neighbor distances. This in turn Conference on Systems Science-Volume 8, pp. 3005-
increases the communication energy cost for those 3014, January 04-07, 2000.
PEGASIS nodes that have far neighbors. As shown 4) An Energy Efficient Routing Mechanism for Wireless
shortly, increasing neighbor distances will have a Sensor Networks, Ruay-Shiung Chang and Chia-Jou
significant effect on PEGASIS’ performance when the Kuo, IEEE,2006.
area of the sensor field is increased. 5) S. Lindesy and C. Raghavendra, “PEGASIS:Power-
Efficient Gathering in Sensor Information System,”
Proc. of 2002 IEEE Aerospace Conference, pp. 1-6,
120 msd March 2002.
6) S. D. Muruganathan, D. C. F Ma., R. I. Bhasin, and A.
No of nodes alive

100
MST in O. Fapojuwo, “A Centralized Energy-Efficient
80 Routing Protocol for Wireless Sensor Networks,”
cluster IEEE Communications Magazine, vol.43, pp. 8-13,
60
2005.
40 7) Ewa Hansen, Jonas Neander, Mikael Nolin and Mats
20 Björkman,”Energy-Efficient Cluster Formation for
0 Large Sensor Networks using a Minimum Separation
Distance” , In proceedings of the Fifth Annual
500 750 1000 1250 1500 Mediterranean Ad Hoc NetworkingWorkshop 2006,
No Of Rounds MedHocNet, Lipari, Italy, June 2006.
8) G.Huang, Xiaowei Li, Jing He,“Dynamic Minimum
Figure 9. Network lifetimes in network area of Spanning Tree Routing Protocol for large wireless
400m X400m Sensor networks” .IEEE, 2006.
9) Wendi B. Heinzelman, Member, IEEE, Anantha P.
Figure 9 gives the performance comparison of MSD
Chandrakasan “An Application-Specific Protocol
routing protocol and MST routing protocols for network Architecture for Wireless Microsensor Networks”
area 400m. The plots also show that tree topology makes IEEE Transactions on Wireless Communications, vol.
protocol performs better than direct transmission in larger 1, pp. 660-670, October 2002.
network area. MST protocol uses tree topology. MSD 10) Nauman Israr and Irfan Awan, July 2006”Multihop
protocol has uses direct communication intra-cluster. The routing Algorithm for Inter-ClusterHead
nd
simulation results prove MST is an elegant solution in communication”, 22 UK Performance Engineering
large network area. Workshop Bournemouth UK”, Pp.24-31.
VI. Conclusion 11) M. Younis, M. Youssef, and K. Arisha. “Energy-
We presented a simple energy-efficient cluster aware management for cluster- based sensor
formation algorithm for the AROS architecture. The networks”. Computer Networks, 43:649.668, Dec
simulations showed that using a minimum separation 2003.
12) A. Manjeshwar and D. P. Agrawal.” TEEN: A
distance between cluster heads and using Forwarding
Routing Protocol for Enhanced Efficiency in Wireless
CHs improves energy efficiency compared to LEACH, Sensor Networks.” Parallel and Distributed
PEGASIS, measured by the number of messages received Processing Symposium., Proceedings 15th
at the base station. Using 5 clusters and a minimum International, pages 2009-2015, April 2001.
separation distance of 30 meters between cluster heads is 13) A. Manjeshwar and D. P. Agrawal. “APTEEN: A
the best efficient configuration for our simulated network. Hybrid Protocol for Efficient Routing and
Using MST within the cluster is an elegant Comprehensive Information Retrieval in Wireless
solution for large sensor networks. From the simulation Sensor Networks.” Parallel and Distributed
result, this approach reduces the distance between a Processing Symposium., Proceedings International,
IPDPS2002, pages 195-202, April 2002.
cluster head and its cluster member nodes, thereby
14) J. Chang and L. Tassiulas. “Maximum lifetime
reducing the transmission energy when a cluster member routing in wireless sensor networks”. IEEE/ACM
node communicates with its cluster head. Results show Trans. Networks, 12(4):609.619, 2004.
that it is more energy efficient than LEACH and 15) A. Mainwaring, J. Polastre, R. Szewczyk, D. Culler,
PEGASIS for large sensor networks. and J. Andersson. “Wireless Sensor Networks for
References Habitat Monitoring”. WSNA'02, September 2002
1) F. Akyildiz, Su Weilian, Y. Sankarasubramaniam, and E.
E. Cayirci. “A Survey on Sensor Networks”. IEEE
Communications Magazine, 40(8):102.114, 2002.

6
135
An Energy Efficient Base Station to Node
Communication Protocol for
Wireless Sensor Networks
Pankaj Gupta Tarun Bansal Manoj Misra
Department of E.C.E. Department of Computer Science Department of E.C.E.
Indian Institute of Technology Roorkee University of Texas at Dallas Indian Institute of Technology Roorkee
Roorkee-247667, India Richardson TX, USA Roorkee-247667, India
pankiuec@iitr.ernet.in tarun@student.utdallas.edu manojfec@iitr.ernet.in

Abstract—Inexpensive sensors capable of significant Recent developments in low-power wireless integrated


computation and wireless communication with limited energy microsensor technologies have made these sensor nodes
resources are available. Once deployed, the small sensor nodes available in large numbers, at a low cost, to be employed in a
are usually inaccessible to the user, and thus replacement of wide range of applications in military & national security,
the energy source is not feasible. Hence, energy efficiency is a
key design issue that needs to be enhanced in order to improve
environmental monitoring, and many other fields [2]. The
the life span of network. Several network layer protocols like technology promises to revolutionize the way we live, work,
LEACH, BCDCP, and PEDAP have proved very useful and and interact with the physical environment.
efficient for Node to Base Station (BS) communication.
However, these centralized protocols have no explicit support In contrast to traditional sensors, sensor networks offer a
for BS to Node communication. In some scenarios, BS to Node flexible proposition in terms of the ease of deployment and
communication may be very frequent. In such cases trivial multiple functionalities. In classical sensors, the placement
solution of flooding may prove to be a very costly solution. We of the nodes and the network topology need to be
introduce here M-way Search Tree Based Base station to Node predetermined and carefully engineered. However, in the
communication protocol (MSTBBN) for Wireless Sensor
Networks, which can be used to provide efficient BS to Node
case of modern wireless sensor nodes, their compact physical
communication. MSTBBN can be used with any of the dimensions permit a large number of sensor nodes to be
centralized data-centric protocols (like LEACH-C) without any randomly deployed in inaccessible terrains. In addition, the
significant message overhead. Our solution provides efficient nodes in a wireless sensor network are also capable of
communication with time complexity of O(h) hops where h is performing other functions such as data processing and
the height of the BS-rooted tree constructed by the underlying routing, whereas in traditional sensor networks, special
routing protocol. nodes with computational capabilities have to be installed
separately to achieve such functionalities.
Keywords- Wireless Sensor Networks(WSN), Base station to
node communication, Wireless communication, Routing
In order to take advantage of these features of wireless
protocol, Energy Efficiency.
sensor nodes, we need to account for certain constraints
I. INTRODUCTION associated with them. In particular, minimizing energy
consumption is a key requirement in the design of sensor
A wireless sensor is a battery-operated device, capable of network protocols and algorithms. Since the sensor nodes are
sensing physical quantities. In addition to sensing, it is equipped with small, often irreplaceable, batteries with
capable of wireless communication, data storage, a limited limited power capacity, it is essential that the network be
amount of computation and signal processing. A wireless energy efficient in order to maximize the life span of the
sensor network (WSN) consists of hundreds to thousands of network [1, 3].
such low-power multifunctioning sensor nodes, operating in
an unattended environment, with limited computational and Recent advances in wireless sensor networks have led to
sensing capabilities to achieve a common objective [1]. A many new routing protocols specifically designed for sensor
WSN has one or more base station (or sink) which collects networks where energy awareness is an essential
data from all sensor devices. These base station(s) (BS) are consideration. These routing mechanisms have considered
the interface through which WSN interacts with the outside the characteristics of sensor nodes along with the application
world. and architecture requirements. Most of the attention,
however, has been given to designing protocols for routing

136
data from sensor nodes to base station. These protocols use Various alternatives to blind flooding were proposed.
flooding for data transmission from BS to individual nodes Typically these techniques aim at minimising number of
[3]. Obviously, flooding proves to be a very costly solution. retransmission of broadcast messages. One of the
This paper presents a novel method for base station to alternatives proposed is randomised forwarding where each
communicate with nodes in network, which could be used in node forwards a packet to its neighbours with a probability
scenarios where BS to node communications is frequent. For p. This scheme was termed as gossiping [9]. Typical value
example, in cases where BS has to update some parameters of p lies in the range 0.65 to 0.75 for acceptable reliability
at a particular node. of data delivery. However this probabilistic forwarding
increases delay in data delivery. Directed flooding proposed
The rest of paper is organised as follows. In Section II, we
in [10] sends data in a specific directional virtual aperture
discussed previous work in this area. Section III presents
outline of sensor network model we used and our instead of broadcasting. Only nodes within this virtual
assumptions. Then our protocol is described in Section IV. In aperture forward packets and thus power consumption is
section V we analyze performance of algorithm using reduced while maintaining a low overhead. However, it is
simulations. Lastly, paper is concluded in section VI with very difficult to decide the size of the aperture. If the
pointers to future work. aperture is small, the adjusting times will be large.
Increasing the aperture will give less adjustment, but the
benefit of directed flooding reduces because of increased
II. RELATED WORK overhead. Alternatives proposed in [11, 12] require 1 or 2
An adaptive clustering scheme called Low-Energy hop neighbour information of nodes and are thus do not
Adaptive Clustering Hierarchy (LEACH) is proposed in [4] scale with increasing node density.
which try to reduce the number of nodes communicating
directly with the BS. The protocol achieves this by forming Many solutions are available in literature to solve these
few clusters (elected randomly), where each cluster-head problems and avoid flooding. First solution, LEAR proposed
(CH) collects the data from nodes in its cluster, fuses and in [13] is inspired from Dynamic Source Routing [14], where
sends the result to the BS. LEACH-C [4] uses a centralized the base station puts the complete path to the destination
cluster formation algorithm to guarantee k nodes in the node in the packet. Intermediate nodes read the path in the
cluster and minimize the total energy spent by the non- received packets and use that to forward it to the destination.
cluster-head nodes, by evenly distributing the CHs However this solution incurs an overhead of carrying the
throughout the network. UDACH proposed in [5] works on whole path inside packet. When the network size is large
similar lines; however here the cluster heads are selected with multiple hops, the overhead of carrying whole path in
based upon the residual energy of each node. the packet will prove to be very costly.

Another clustering based approach called BCDCP [6] Second solution given by Hyun-sook Kim et. al. In [15]
makes clusters of equal size to ensure similar power requires nodes to maintain information of all their children.
dissipation. PEDAP [7] follows a minimum spanning tree So whenever a node receives a packet from BS, it forwards it
organisation with BS as the root, improving the total lifetime to one of its children on the basis of the final destination
and scalability. In PEGASIS [8] a chain is constructed address. This solution is also not scalable as the size of the
among the sensor nodes so that each node will receive from routing tables will grow with increase in network density.
and transmit to close neighbour.

Protocols discussed above and other data-oriented III. SYSTEM MODEL AND ASSUMPTIONS
centralized network protocols allow efficient node to BS
communication. However when base station has to
The system consists of following components:
communicate with the nodes, these protocols rely on
flooding where each receiver is obligated to retransmit the
packet it has not seen before in network. Network wide Node: This refers to sensor nodes. Sensor nodes are heart of
flooding reduces the network capacity by sending the network; they are in-charge of collecting, processing
information to mobile hosts which are not supposed to data and routing it to the sink. In other words, sensor nodes
receive it, thus increasing the traffic load, and packet can sense data from environment, perform simple
collision rate. This also leads to an increase in individual computations and transmit this data wirelessly to a
node power consumption. Obviously this proves to be a very command center either directly or in a multi hop fashion
costly solution especially when it comes to industrial through neighbors.
deployments where base station has to frequently
communicate the values of various parameters to the nodes. Base Station: Base station is a sensor node responsible for
For example, consider a real life example where nodes are getting request for data collection from applications. BS is
deployed to monitor temperature as directed by BS and also responsible for calculation of routing tree according to
sampling frequency is different at each node.

137
underlying routing protocol. Base station thus coordinates position of the nodes, residual energy and other heuristic
and controls the sensing tasks performed by sensor nodes. parameters. Base station then broadcasts this routing tree to
the nodes which on receiving the routing tree, rebroadcast it.
Destination Node: This is the node to which BS wants to Finally all the nodes in the field are aware of the routing tree.
send information. Nodes then use the information available in routing tree to
route their data to BS. Although this setup allows nodes to
Intermediate Nodes: These are sensor nodes which come in transfer their data to the base station efficiently, it provides
path between BS and destination node. Intermediate nodes no support for the BS to individual node communication.
forward data packet based upon underlying routing protocol.
In MSTBBN, we propose that along with the routing tree,
BS will assign each node a key KMSTBBN (inspired from
Network topology: It is a connectivity graph where nodes concept of m-way search trees [18]), which later on will be
are sensor nodes and edges are communication links. In a used by BS to communicate efficiently with any particular
wireless network, the link represents a one-hop connection, node without any packet or memory overhead. MSTBBN is
and the neighbours of a node are those within the radio not limited by use of any particular routing algorithm. Any
range. of the centralized algorithms (like LEACH-C, PEDAP) can
be used to provide underlying routing capabilities. MSTBBN
Following were our assumptions which are consistent with provides BS to node communication in O(h) hops and O(h)
the assumption made in literature [4, 6, 7, 8]: time & message complexity. Here h is the height of routing
• Base station is a high-energy node with a large tree constructed by the underlying routing protocol with BS
amount of energy supply as root.
• Sensor nodes are homogeneous and energy
constrained with uniform initial energy. Next we describe the working of MSTBBN protocol in
• Sensor nodes are equipped with power control following four phases.
capabilities to vary their transmitted power.
Phase 1: Calculation of routing tree by BS.
• Sensor nodes exhibit no mobility.
• All sensor nodes communicate through wireless At the beginning of each round, routing tree will be
links over a single shared channel. calculated by the underlying routing protocol. For example,
• Links between two sensor nodes is bidirectional. fig. 1 here shows the routing tree calculated by LEACH-C
• Each node knows its current energy and location with BS, Cluster Heads (CHs) and Nodes.
(using GPS [16] or other localization mechanisms).
• A message sent by a node is received correctly Phase 2: Allocation of KMSTBBN by BS.
with in a finite time by all one hop neighbours [17]. In next phase, using idea of m-way search trees, BS assigns
• Network topology does not change during network key - KMSTBBN to each node. Following rules are followed by
operation. BS while assigning keys to the nodes where KMSTBBN (X)
• Each node can be identified uniquely with its refers to key assigned to node X according to MSTBBN:
identifier. • Rule 1: KMSTBBN(N1) > KMSTBBN(N2) where N1 is
• Single hop broadcast refers to the operation of child of N2. E.g. In fig. 2, KMSTBBN(A) = 0 which is
sending a packet to all single-hop neighbours. lowest in whole tree, since its root (BS).

Most of these restrictions here have been placed in


order to simplify the solution. By slightly modifying the
proposed protocol, these restrictions can be easily removed.
In the sensing mode, the nodes perform sensing tasks and
transmit the sensed data. In cluster head mode (wherever
assumed), a node gathers data from other nodes within its
cluster, performs data fusion, and routes the data to the base
station. The base station in turn performs the key tasks of
cluster formation, cluster head selection, and routing tree
construction.
IV. PROTOCOL ARCHITECTURE
In all the centralized routing protocols like LEACH-C
[4], PEDAP [7] etc., routing tree is calculated by the base BS CH Node
station at the beginning of each round. This calculation is
done by BS on the basis of parameters like geographical Figure 1. Routing tree calculated by BS in LEACH-C

138
• Rule 2: KMSTBBN (N1) < KMSTBBN (N2) where N2 is A 0
the right sibling of N1. E.g. In fig. 2, KMSTBBN(B) <
KMSTBBN(C), Since C is the right sibling of B where 12
KMSTBBN(B) = 1, KMSTBBN(C) = 6.
• Rule 3: KMSTBBN (N3) > KMSTBBN (N1) where N1 is in 1 6 C
B D
the subtree rooted by N2 & N3 is the right sibling of
N1. E.g. In fig. 2, KMSTBBN(C) < KMSTBBN(D), Since
D is the right sibling of C where KMSTBBN(C) = 12,
KMSTBBN(D) = 6 and both rooted at A(base station). 11 15
2 7
• Rule 4: KMSTBBN (N1) < KMSTBBN (N3) where N3 is in 5 16
the subtree rooted by N2 and N1 is the left sibling of
N2. E.g. In fig. 2, KMSTBBN(C) < KMSTBBN(Z), Since 10
C is the left sibling of D and Z is in subtree rooted at Z
3 4 8 9
D where KMSTBBN(C) = 6, KMSTBBN(D) = 12, 13
KMSTBBN(Z) = 14. 14
BS CH Node
Figure 2. Routing tree with key values assigned to nodes
Observe that this method of assigning keys converts the
routing tree into a search tree. Moreover, it ensures that the
nodes to BS routing paths in the resulting tree are same as KMSTBBN(destination) (obtained from packet) lies in between
those calculated by the underlying protocol. This is KMSTBBN of node i and KMSTBBN of i’s child with highest value
important, as MSTBBN does not require any alteration to the of KMSTBBN. If not, node i will drop the packet instead of
routing tree created by the underlying protocol. In phase 4, processing it. Hence node i will only forward the packet if
we will explain how these keys will be used to provide destination node is in its subtree. Here nodes B and C will
efficient BS to node communication drop packet since Z is neither in subtree of B or C. Node D
will forward packet to its children, since Z is in subtree of D.
Phase 3: BS broadcasts keys to nodes. Except Z, rest of D’s children will discard this packet (since
neither is this packet intended for them, nor does destination
BS next broadcasts KMSTBBN of every node. This broadcast lies in their subtree) and in this way packet finally reaches its
is done in the similar fashion as done in the underlying destination. It can be easily seen that these decisions are
protocol (like LEACH-C etc). Moreover this broadcast can followed directly from the traditional m-way tree search
be piggybacked by BS while broadcasting the routing tree, algorithms.
thus further cutting down the network overhead. When the
node receives these packets, it makes record of its own
KMSTBBN and KMSTBBN of its children with highest value of
KMSTBBN (or base station can assign each node this value).
Thus the memory requirements of MSTBBN is O(1). For
example, Table 1 shows KMSTBBN’s stored at node C.
TABLE I. KMSTBBN’S STORED AT NODE C
KMSTBBN(C) 6
KMSTBBN(C’s child with maximum KMSTBBN) 11

Phase 4: Base station to node communication.


Now since the key assigning procedure has converted the
routing tree into a search tree, we can use traditional
searching algorithms similar to those of m-way search trees
for data transmission. Whenever BS has to communicate
with any node, BS will broadcast data to its children. Nodes
which receive packets will follow flowchart as shown in fig. Figure 3. Flowchart describing how a packet received will be handled
3 for further forwarding the data packets. by node i.

E.g. in fig. 2, if base station A has to send data to node Z V. PERFORMANCE EVALUATION
where KMSTBBN(Z) = 14 = KMSTBBN(destination), it will
forward its data to node D with KMSTBBN(D) = 12. However, In order to evaluate our scheme, we simulated MSTBBN
owing to the broadcast nature of the wireless medium, nodes over LEACH-C [4] and PEDAP [7] as underlying routing
B and C will also receive the transmitted data (apart from protocols on ns-2 [19, 20] and compared its performance
node D). According to flowchart shown in fig. 3, a node i on with flooding and directed flooding. We used following
receiving the packet will check whether model in our simulation studies:

139
• Wireless Sensor Network consisted of randomly
(uniform) distributed nodes in a square field of size 17.5
100 x 100 m2.

Average energy dissipated
• Each sensor node was assigned initial energy of 20 15.5
MSTBBN
Joules. Θ=π/2

(in Joues)
Size of data message was set to 500 bytes. Θ=3π/4
• First order radio model as described in [4] was used 13.5 Θ=π
to calculate energy consumption for receiving and Θ=5π/4
transmitting a data packet. 11.5 Θ=3π/2
• Base station was located at the centre of field.
• One round was defined as the duration of time from 9.5
when BS initiates sending of data packet to a node
600 700 800 900
to the time when this node receives the packet.
• Radio range of nodes was set to 40 m. No. of Rounds
• Performance was measured by quantitative metrics Figure 5: Average energy dissipated by nodes in LEACH-C using MSTBBN
of average energy dissipated per node and number and Directed flooding (for Θ = π/2 to Θ = 3π/2) versus number of rounds.
of packets forwarded.
flooding (for all values of virtual aperture Θ). This is because
A. MSTBBN over LEACH-C MSTBBN limits receptions & retransmissions of packets by
assigning KMSTBBN to all nodes and then performing efficient
In this section we present simulation results where we
routing on the basis of this assigned key. This results into
ran MSTBBN algorithm over LEACH-C [4] as underlying reduced redundant packet transmissions and thus increased
routing protocol. energy savings. Average energy efficiency in MSTBBN
compared to flooding was observed to be 15.52%.
Energy dissipated with Rounds: In the first experiment, we
evaluated the average energy dissipated by nodes in the Energy dissipated with node density: In next experiment, we
network with increasing number of rounds. For this evaluated the average energy dissipated by nodes in the
experiment, we simulated deployment of 100 nodes. In each network by varying number of nodes from 25 to 400. This
round base station randomly chooses a destination node and experiment was done for 100 rounds and in each round BS
sends a data packet to it. Number of rounds was varied from chooses a random destination node for sending data packet.
600 to 900. Fig. 4 here shows comparison of MSTBBN with Fig. 6 here shows comparison of MSTBBN with flooding
flooding while fig. 5 shows comparison of MSTBBN with while fig. 7 shows comparison of MSTBBN with directed
directed flooding. For directed flooding, virtual aperture was flooding. Energy dissipation in MSTBBN is nearly constant
varied from Θ=π/2 to Θ=3π/2. while for both flooding and directed flooding energy
dissipation keeps on increasing with increasing number of
The simulation results clearly demonstrate that as we nodes. This is because in case of MSTBBN, only CH’s
increase number of rounds, MSTBBN achieves significant forward data packets while in case of both flooding and
energy savings compared to both flooding and directed directed flooding, comparatively more nodes will receive
flooded-packets which lead to more flooding thereby
increasing overall energy dissipation.
Flooding
MSTBBN MSTBBN
Flooding
Average energy dissipated

17 2.05
Average energy dissipated

15
(in Joules)

1.95

13 1.85
(in Joules)

11 1.75

9 1.65

600 700 800 900 1.55


No. of Rounds 0 200 400
No. of Nodes
Figure 4. Average energy dissipated per node with increasing number of
rounds Figure 6. Average energy dissipated per node with increasing number of
nodes

140
25
2.05

No. of packets forwarded
Average Energy dissipated

20 MSTBBN

( in Thousands)
1.95
MSTBBN Θ=π/2
15
(in Joules)

Θ=π/2 Θ=3π/4
1.85
Θ=3π/4 Θ=π
10
Θ=π Θ=5π/4
1.75
Θ=5π/4 5 Θ=3π/2
1.65 Θ=3π/2
0
1.55 0 100 200 300 400
No. of Nodes
0 100 200 300 400
No. of Nodes
Figure 9. Number of packets forwarded by nodes in LEACH-C using
Figure 7. Average energy dissipated by nodes in LEACH-C using MSTBBN MSTBBN and Directed Flooding (for Θ=π/2 to Θ=3π/2) with varying
and Directed flooding (for Θ = π/2 to Θ = 3π/2) with varying node density. node density.

Number of packets forwarded with node density: In this MSTBBN


experiment, we evaluated number of packets forwarded by Flooding
nodes by varying number of nodes from 25 to 400.

Average energy dissipated
Experiment was done for 100 rounds where BS chooses a
16.5
random destination node at each round for sending data
packet. Fig. 8 and fig. 9 shows comparison of MSTBBN
(in Joules)
with flooding and directed flooding respectively. These 14.5
results testify our interpretation of nearly constant energy
dissipation in LEACH-C. 12.5

B. MSTBBN over PEDAP 10.5


In this section we present simulation results where we 600 700 800 900
ran MSTBBN algorithm over PEDAP [7] as underlying Rounds
routing protocol.
Figure 10. Average energy dissipated by nodes in PEDAP using MSTBBN
Energy dissipated with rounds: In this experiment, we and flooding versus number of rounds
evaluated the average energy dissipated by nodes in the
network with increasing number of rounds. For this to 900. Fig. 10 shows comparison of MSTBBN over
experiment, we simulated deployment of 100 nodes. In each PEDAP with flooding while fig. 11 shows its comparison
round BS randomly chooses a destination node and sends a with directed flooding. For directed flooding we varied size
data packet to it. Number of rounds was varied from 600 to of virtual aperture from Θ = π/2 to Θ = 3π/2. Average
energy efficiency in MSTBBN compared to flooding was
MSTBBN observed to be 8.48%.
Flooding
30
No. of Packets forwarded

16.5
Average energy dissipated

MSTBBN
Θ=π/2
(in Thousands)

(in Joules)

20 14.5 Θ=3π/4
Θ=π
Θ=5π/4
10 12.5
Θ=3π/2

0 10.5

0 100 200 300 400 600 700 800 900

No. of Nodes No. of Rounds

Figure 8. Number of packets forwarded by nodes in LEACH-C using Figure 11. Average energy dissipated by nodes in PEDAP using MSTBBN
MSTBBN and Flooding with varying node density. and Directed flooding (for Θ = π/2 to Θ = 3π/2) versus number of rounds

141
As we increase the number of rounds, the simulation
results clearly indicate that MSTBBN is more energy 2
efficient compared to flooding. Except for Θ = π/2,

Average energy dissipated
MSTBBN is always more efficient in case of directed MSTBBN
1.9
flooding. However, when we compare these results with Θ=π/2

(in Joules)
LEACH-C i.e. from fig. 4 & fig. 5, we find that MSTBBN Θ=3π/4
is lesser efficient for PEDAP. The reason is PEDAP is 1.8
Θ=π
based on spanning tree while LEACH-C is single level Θ=5π/4
cluster based protocol. So height of BS rooted tree in case of 1.7 Θ=3π/2
PEDAP is higher while for LEACH-C it’s constant and is 2.
Thus lesser number of nodes receive and forward packets in 1.6
case of LEACH-C which can be seen by comparing fig. 8 0 100 200 300 400
and fig. 14. No. of Nodes

Energy dissipated with node density: In next experiment we Figure 13. Average energy dissipated by nodes in PEDAP using MSTBBN
evaluated the average energy dissipated by nodes in the and Directed flooding (for Θ = π/2 to Θ = 3π/2) with varying node density.
network by varying number of nodes from 25 to 400. This
experiment was done for 100 rounds and in each round BS
MSTBBN
chooses a random destination node for sending data packet.
Flooding
Fig. 12 shows comparison of MSTBBN with flooding while
fig. 13 shows its comparison directed flooding. MSTBBN is 30

No. of packets forwarded 
thus an efficient protocol compared to flooding while in
case of directed flooding for some values of Θ, energy
(in Thousands)
20
dissipated is nearly same as MSTBBN.

Number of packets forwarded with node density: In this 10


experiment, we evaluated number of packets forwarded by
nodes by varying number of nodes from 25 to 400.
Experiment was done for 100 rounds where BS chooses a 0
random destination node at each round for sending data 0 100 200 300 400
packet. Fig. 14 shows comparison of MSTBBN with No. of Nodes
flooding while fig. 15 shows comparison with directed
flooding. Figure 14. Number of packets forwarded by nodes in PEDAP using
MSTBBN and Flooding with varying node density.

MSTBBN
Flooding 25
2.2
No. of packets forwarded
Average Energy dissipated

20 MSTBBN
2.1
(in Thousands)

Θ=π/2
2 15
(in Joules)

Θ=3π/4
1.9 10 Θ=π
1.8 Θ=5π/4
5 Θ=3π/2
1.7
0
1.6
0 100 200 300 400
0 100 200 300 400
No. of Nodes
Number of Nodes
Figure 15. Number of packets forwarded by nodes in PEDAP using
MSTBBN and Directed Flooding (for Θ = π/2 to Θ = 3π/2) with varying
Figure 12. Average energy dissipated by nodes in PEDAP using MSTBBN
node density.
and Flooding with varying node density.

142
VI. CONCLUSIONS AND FUTURE WORK networks,” IEEE Commnication Magazine, vol. 43, no. 3, pp. S8-S13,
Mar. 2005.
Minimizing energy consumption is a key requirement in [7] Huseyin O zgur Tan, Ibrahim Korpeoglu, “Power efficient data
the design of sensor network protocols and algorithms. Most gathering and aggregation in wireless sensor networks”, SIGMOD
of the attention however has been given to designing Record, Vol. 32, No. 4, December 2003, pp. 66-71.
protocols for routing data from sensor nodes to base station. [8] S. Lindsey and C. S. Raghavendra, “Pegasis: Power-efficient
In this paper we presented MSTBBN, a novel scheme for gathering in sensor information systems,” In Proceedings of IEEE
base station to node communication in O(h) hops where h is Aerospace Conference, 2002.
the height of the tree constructed by the underlying routing [9] S. Hedetniemi and A. Liestman, “A survey of gossiping and
broadcasting in communication networks,” Networks, Vol. 18, No. 4,
protocol. h can vary from 1 (in LEACH-C) to as much as N pp. 319-349, 1988.
(in PEGASIS) where N refers to number of nodes in whole [10] R. Farivar, M. Fazeli, and S. G. Miremadi, “Directed Flooding: A
network. Currently no such protocol exists for base station to fault tolerant routing protocol for wireless sensor networks,” 2005
node communication as the proposed one. Existing protocols Systems Communications (ICW'05, ICHSN'05, ICMCS'05,
are based upon multicasting protocols which are not energy SENET'05), pp. 395- 399, 2005
efficient. [11] Hai Liu , Xiaohua Jia , Peng-Jun Wan , Xinxin Liu , Frances F. Yao,
“A distributed and efficient flooding scheme using 1-hop information
in mobile ad hoc networks,” IEEE Transactions on Parallel and
Our protocol can work with any of the underlying Distributed Systems, v.18 n.5, p.658-671, May 2007.
centralized routing protocol like LEACH-C, BCDCP without
[12] Trong Duc Le , Hyunseung Choo, “PIB: an efficient broadcasting
any noticeable modification or message overhead. Routing scheme using predecessor information in multi-hop mobile ad-hoc
tree constructed by underlying protocol is converted to a networks,” Proceedings of the 2nd international conference on
search tree using keys assigned in MSTBBN. Furthermore, Ubiquitous information management and communication, January
our protocol does not result in any message or memory 31-February 01, 2008, Suwon, Korea.
overhead compared to the existing solutions, making our [13] Kyungtae Woo , Chansu Yu , Dongman Lee , Hee Yong Youn , Ben
solution scalable with increasing number of nodes as well as Lee, “Non-blocking, localized routing algorithm for balanced energy
consumption in mobile ad hoc networks,” Proceedings of the Ninth
density. This resulted into an energy efficient scheme for International Symposium in Modeling, Analysis and Simulation of
base station to node communication, which we verified using Computer and Telecommunication Systems (MASCOTS'01), p.117,
ns-2 simulations. August 15-18, 2001
[14] David B. Johnson, David A. Maltz, and Josh Broch, “DSR: The
Recent developments increasingly call for scenarios dynamic source routing protocol for multi-hop wireless ad hoc
networks. in ad hoc networking,” edited by Charles E. Perkins,
where the sensed data must be delivered to multiple base Chapter 5, pp. 139-172, Addison-Wesley, 2001.
stations. This forms on of the areas for future research where
[15] Hyun-sook Kim, Ki-jun Han, “A power efficient routing protocol
MSTBBN could be extended to serve for scenarios with based on balanced tree in wireless sensor networks,” Proceedings of
multiple base stations. Further, we also assumed that all the the First International Conference on Distributed Frameworks for
sensor nodes in network are stationary. For further research, Multimedia Applications (DFMA’05), p.138-143, February 06-09,
MSTBBN could be explored as a solution for slightly mobile 2005
networks. This could be done by increasing update interval [16] N. Bulusu, J. Heidemann, D. Estrin, ‘‘GPS-Less Low Cost Outdoor
of keys based upon node mobility. Localization for Very Small Devices,’’ IEEE Personal
Communications, Vol. 7, 2000.
[17] Chalermek Intanagonwiwat, Ramesh Govindan and Deborah Estrin,
“Directed diffusion: A scalable and robust communication paradigm
REFERENCES for sensor networks,” In Proceedings of the Sixth Annual
[1] Arampatzis, Th., Lygeros, J., Manesis, S., “A survey of applications International Conference on Mobile Computing and Networking
of wireless sensors and wireless sensor networks,” Intelligent (MobiCOM '00), August 2000, Boston, Massachussetts.
Control, 2005. Proceedings of the 2005 IEEE International [18] Sartaj Sahni, “Data structures, algorithms, and applications in C++,”
Symposium on, Mediterrean Conference on Control and Automation , edited by June Waldman, Chapter 11, pp. 525-527, McGraw-Hill,
vol., no., pp.719-724, 27-29 June 2005. 2000.
[2] Li, Yingshu, Thai, My T., Wu, Weili (Eds.), “Wireless Sensor [19] ns-2. http://www.isi.edu/nsnam/
Networks and Applications,” Springer Series on Signals and
Communication Technology, 2008, ISBN 978-0-387-49591-0. [20] K. Fall, “ns Notes and Documents,” The VINT Project, UC Berkeley,
LBL, USC/ISI, and Xerox PAPC, Feb. 2000, available at
[3] Carlos de Morais Cordeiro, Dharma Prakash Agarwal, “Ah Hoc & http://www.isi.edu/nsnam/ns-documentation.html.
Sensor Networks: Theory and Applications,” World Scientific
Publishing Company, 2006, ISBN 981-256-681-3.
[4] Wendi B. Heinzelman, Anantha P. Chandrakasan, Hari Balakrishnan,
“An application-specific protocol architecture for wireless
microsensor networks”, IEEE Transactions on wireless
communications, vol.1, no. 4, October 2002, pp. 660-670.
[5] Jin-Young Choi, Joon-Sic Cho, Seon-Ho Park, Tai-Myoung Chung,
“A clustering method of enhanced tree establishment in wireless
sensor networks,” in Proc. 10th Int. Conf. On Advacned
Communication Technology(ICAST), Feb. 2008, pp. 1103-1107.
[6] S. D. Muruganathan, D. C. F. Ma, R. I. Bhasin, and A. O. Fapojuwo,
“A centralized energy-efficient routing protocol for wireless sensor

143
A Broadcast Authentication Protocol for Multi-Hop
Wireless Sensor Networks
R. C. Hansdah, Neeraj Kumar and Amulya Ratna Swain
Dept. of Computer Science & Automation
Indian Institute of Science, Bangalore, India.
{hansdah, neerajkumar, amulya}@csa.iisc.ernet.in

Abstract—A base station in wireless sensor network (WSN) come from the base station only. This problem is known as the
needs to frequently broadcast messages to sensor nodes, since broadcast authentication problem. If a global shared secret key
broadcast communication is used in many applications such is used to authenticate these messages, the malicious nodes can
as networks query, time synchronization, multi-hop routing,
etc. One of the main problems of broadcast communication either modify these messages if they have to rebroadcast them
in WSNs is source authentication. Source authentication means or masquerade as the base station even in single hop WSN
that the receivers of broadcast data have to verify that the since they already have the key. A solution to this problem [4]
received data really originated from the claimed source and is is to use hash key chain. The first key of the chain is usually
not modified on the way. This problem is complicated due to distributed to each node of the WSN using some mechanism.
untrusted receivers and unreliable communication environment
where the sender does not retransmit the lost packets. In this The first message is encrypted using the first key of the chain.
paper, we propose a novel scheme for authenticating messages The key next to the first key of the chain is sent along with
from each node of the WSN at the base station using Diffie- the first message, which can be used to authenticate the first
Hellman key. Most existing schemes for broadcast authentication message, and so on. The problem in multi-hop environment,
using hash key chain are limited to single hop WSNs only. Using which is quite common in WSN, is that the sensor nodes which
the above technique for source node authentication, we extend
the broadcast authentication scheme using hash key chain to receive the broadcast directly from the base station can modify
multi-hop wireless sensor networks. The number of transmissions the messages before rebroadcasting. In a single hop WSN, this
of packets is also reduced using some selective paths during problem does not arise. Also the solution given above ensures
the broadcast and as a result, the storage and communication that malicious nodes cannot masquerade as the base station
overhead is also reduced. The analysis and experiments show since they do not have the next key. A few solutions to above
that our protocol is efficient and practical, and achieves better
performance than the previous approaches. problem have been proposed in the literature [5], [6], [7]. Of
these solutions, some of them use digital signature [5], [6], and
I. I NTRODUCTION others use one way hash key chain [7]. The solutions which
A wireless sensor networks consists of a collection of low use digital signature are quite heavy on the meager resources
cost, low power, and multifunctional sensor nodes. Some of the of sensor nodes. In this paper, we propose a novel scheme
designated nodes, called base stations, facilitate computation to authenticate each node of the WSN at the base station
within the WSN as well as communication with the outside using Diffie-Hellman key. We also use this scheme to propose
world. A WSN usually has a single base station. The base a broadcast authentication protocol using hash key chain for
station controls the sensor nodes as well as collects data multi-hop WSNs. An important feature of our protocol is that
reported by the sensor nodes. A WSN essentially can monitor it is fault-tolerant to node failures.
events of practical importance either periodically, or whenever The rest of the paper is organized as follows. In section
they occur over any geographical area such as forest, buildings II, we give a brief survey of related works. Assumptions
etc. As a result, WSNs have the potential to provide practical and definitions for the proposed protocol are described in
solutions to many problems of these types. Some of the section III. In section IV, we describe our proposed scheme.
potential applications of WSN are environmental and habi- Security and performance analysis of the protocol is described
tat monitoring, monitoring of civil structures like buildings, in section V. In section VI, we discuss our simulation results.
bridges etc., target tracking for military as well as civilian Conclusions are given in section VII.
applications, monitoring health conditions of patients and so
on. II. S URVEY OF R ELATED W ORKS
Security of many of the applications of WSN is very critical Broadcast authentication is an essential service in WSNs.
to the systems using them. There are many types of attacks Symmetric key based message authentication code [8] cannot
that can be made on the WSNs. A survey of the attacks can be directly used for resource-constrained wireless sensor net-
be found in [1], [2], [3]. One of the important operations of work, since a compromised receiver can easily impersonate
WSN is that the base station needs to broadcast messages the sender. On the other hand, asymmetric key based digital
to the sensor node occasionally. But the messages need to signature schemes [9] which are typically used for broadcast
be authenticated at each sensor node ensuring that they have authentication in traditional networks, are too expensive to be

144
used in WSNs, due to high computation involved in signature GH(M ) (si ) = GH(M ) (sj ) and si 6= sj , and send the message
verification. As a result, several broadcast authentication pro- M with the signature hsi , sj i to the receiver. After receiving the
tocol have been proposed for resource constrained WSNs [4], message, the receiver authenticates the received message by
[7], [10], [11], [12], [13]. authenticating the signature using previously obtained public
Perrig et al. have proposed a broadcast authentication pro- keys. The advantage of BiBa is fast verification and a short
tocol, named µTESLA [4], and it is the first protocol proposed signature but BiBa takes longer signing time and uses larger
for broadcast authentication in WSNs. This protocol is based public key size to authenticate the signer.
on one way hash key chain. µTESLA uses the key chain to To make an improvement over public key size and signing
emulate public key cryptography with delayed key disclosure. time, Reyzin et al. have proposed a new one-time signature
A key is initially chosen, and the remaining keys are generated scheme called HORS (Hash to Obtain Random Subset) [12]
using one way hash function. The first key of this chain(the last which reduces the time needed to sign the message and verify
key produced by the hash function) is used to encrypt the first the signature. It also reduces the key and signature sizes
message to be broadcasted by the base station, and this key in comparison to the ones used in BiBa and makes HORS
is distributed to each node of the WSNs apriori. The sender the faster one-time signature scheme. The security of BiBa
divides the time period for broadcast into multiple intervals depends upon the random-oracle model, while the security
and in each interval it uses one key starting with the first of HORS relies on the assumption of the existence of one-
key. At the end of each interval, it discloses the next key, way functions. HORS is computationally efficient, requiring
which makes it possible to authenticate the messages that were a single hash evaluation to generate the signature and a few
sent encrypted with the previous key. However, the receiving hash evaluations for verification as compared to BiBa. Still this
node needs to verify that the next key was not yet disclosed protocol has large public key size, which is not suitable in a
when it received the messages. After receiving a packet, WSN environment without additional modifications. Signing
if the receiver can ensure that the packet was sent before each packet would definitely provide secure broadcast authen-
the next key was disclosed, the receiver buffers this packet tication, but it still has considerable overhead for signing and
and authenticates it later after receiving the next key. The verifying packets and also uses more bandwidth.
protocol has certain drawbacks. The protocol requires loose An efficient broadcast authentication scheme[13], proposed
time synchronization between sender and receiver. Individual by Shang-Ming, is also based on one-time digital signature
authentication as well as instantaneous authentication is not scheme. Compared to HORS, this scheme requires less storage
available in µTESLA. More storage space is required at the and communication overhead at the expense of higher compu-
receiver side to buffer the packets until the next key is received. tation cost. In this scheme, key generation is the same as that
Many WSN applications are real time applications. Hence, used in HORS scheme. This scheme makes an improvement
to minimize the delay in authentication of real time data, over the HORS scheme by reducing the large key size, but
the maximum number of additional packets that are received still the public key size is large and computational overhead
before a packet is authenticated should be small. Nonetheless, per message is also large.
there would be some delay before a broadcast packet can be Bekara et al. have proposed a hop-by-hop broadcast source
authenticated, and therefore, it is not suitable for real time authentication protocol for WSN [7] to overcome the DOS
applications. attacks that limits the effect of attack to the one hop neighbor
To increase the scalability of µTESLA, Liu and Ning only. In this protocol, the authors use different key chains
have proposed a multilevel µTESLA [10]. The basic idea of for different hops of the network, where maximum hop of
this protocol is to predetermine and broadcast the parame- the network can be deduced from the maximum propagation
ters such as the key chain commitments instead of unicast delay in the network. This protocol consists of three phases
based message transmission used in µTESLA. Even though i.e., initialization phase, data broadcast phase and data buffer-
it improves the scalability of µTESLA, it still suffers from ing/verification phase. In the initialization phase, the base
certain drawbacks like requirement of time synchronization, station divides time into fixed time intervals, and generates
more buffer storage, etc. a separate hash key chain for each hop of the network, and
A broadcast authentication protocol, called BiBa (Bins stores the first key of each key chain and the duration of the
and Balls) [11], have been proposed by Perrig, and it uses intervals in each sensor node. In the data broadcast phase,
one time digital signature scheme to authenticate the source. the base station computes the MAC for the data it sends in
In BiBa, signer precomputes some t random values, called the current time interval by using the current key of each key
SEALs (SElf Authenticating vaLues). For each SEAL si , chain, and broadcasts the data together with the MACs. Later,
signer generates a public key fsi = Fsi (0), where Fsi () is it discloses the keys of the current time interval one after
a one-way function, and these public keys are transferred to another. In data buffering/verification phase, after receiving
the receiver to authenticate SEALs at the receiver end. For the broadcasted data, each node in a particular hop buffers the
each message M, the signer computes GH(M ) for all SEALs data until the associated key corresponding to the hop number
s1 to st , where GH(M ) is a particular instance from a family and time interval is disclosed. If the data packet is authentic,
of one-way function whose range is 0 to n-1 (i.e., n possible it forwards them to the next hop. With increase in size of
output values). The signer generates a signature hsi , sj i, where the network and the number of nodes, the protocol requires

145
more number of hash key chains that leads to demand for
more storage space. Hence, the protocol is not scalable and
it has large storage overhead. Since each node stores one key
of each key chain, authentication of the broadcast messages
using each of the keys introduces extra computation overhead
at each node. Even if the protocol claims that nodes need
Fig. 1. Authentication request packet from node to a base station
to buffer data for a duration less than the duration of key
disclosure delay as in µTESLA, it still suffers from delay in
authentication. the first key k0 . The signature on all messages signed with
Since the symmetric key based cryptography is not suitable key ki by the sender can be verified at the receivers using key
for the broadcast authentication, most of the proposed proto- ki+1 after it becomes available at the receivers.
cols have used asymmetric key based cryptography. Among
these protocols, a few of them use public key concept to C. Objectives of the Proposed Protocol
achieve asymmetric key based cryptography and others use We aim to achieve the following with the proposed protocol.
hash key chain technique. The protocols which use public (i) Messages sent by each node of WSN to the base station
key concept suffer from large public key size and large is fully authenticated at the base station.
computational overhead and the protocols which use the hash (ii) Compromising of a single node should not affect other
key chain technique achieve the broadcast authentication for nodes of the WSN, i.e., other nodes are not compro-
single hop network only with the exception of the protocol mised.
proposed in [7]. In this paper, we propose a novel scheme to (iii) Intermediate nodes must not be able to modify the
authenticate each node of the WSN at the base station using broadcast messages.
Diffie-Hellman key, and also use this scheme to propose a
broadcast authentication protocol for multi-hop wireless sensor IV. T HE P ROPOSED P ROTOCOL
networks using multilevel hash key chain. In this section, we present our proposed scheme for broad-
cast authentication. Our proposed protocol uses multilevel
III. A SSUMPTIONS AND O BJECTIVES hash key chain and Diffie-Hellman key for authentication
In this section, we first state the assumptions about WSNs of messages sent by sensor nodes to the base station, and
that we make for the proposed broadcast authentication pro- also messages broadcasted by the base station. The broadcast
tocol, and then give a brief description of hash key chain authentication protocol proposed in this paper is based on a
which is used in our protocol. Finally, we briefly describe the novel scheme using Diffie-Hellman key to authenticate each
objectives that we aim to achieve with our proposed scheme. sensor node at the base station. The scheme is described in
the following subsection.
A. Assumptions
We make the following assumptions for our proposed pro- A. Authentication of Sensor Node at the Base Station
tocol to ensure authentication of each node and also broadcast The scheme to authenticate sensor nodes at the base station
the messages securely over the whole WSN. uses Diffie-Hellman key between a pair of nodes. A Diffie-
• The WSN has a single base station, and the sensor nodes Hellman key is generated from the multiplicative group Zp∗ =
are static. {1, 2, . . . , p − 1}, where p is a large prime number. Let g be a
• Each sensor node has a unique ID. generator for Zp∗ . In this scheme, the base station is assigned
• The WSN is connected, i.e., there always exists a path a unique private key 1 < α < p − 1, and each sensor node
between any pair of sensor nodes. i is assigned a unique private key 1 < βi < p − 1. Now the
• The base station is trusted, but the broadcast medium following is preloaded in each sensor node i.

1. A key ki = f g βi mod p mod p , where f is any

is not trusted, i.e., the opponents can eavesdrop on the
messages being transmitted. suitable function. We refer to key ki as Diffie-Hellman
• The base station is secure, i.e., none can tamper with it, key.
and can extract information from it and also, it is suffi- 2. g βi mod p
ciently powerful to perform cryptographic computations. It is assumed that the βi ’s have been chosen in such a way
that βi 6= βj ⇒ ki 6= kj . This property can be ensured at the
B. Hash Key Chain
time of generating βi ’s, which would mean that each node has
A hash key chain of length n + 1 consists of a sequence of a unique key. The base station is preloaded just with α and p.
h h h h h
keys kn → kn−1 → kn−2 → . . . → k1 → k0 , where kn is the A message M sent by a sensor node to the base station has
initial key, and h is an arbitrary hash function. An important the generic format as shown in Figure 1.
property of hash key chain is that ki−1 can be derived from ki It is important that g βi mod p is not encrypted. The message
(0 ≤ i < n) but not vice versa. The key k0 is referred as the M may or may not be encrypted according to the requirement.
first key of the chain since it is used first in any application. Upon receipt of the message M at the base station, the base
Usually, the sender has the key chain, and the receivers have station computes the key ki , and verifies the authenticity and

146
the integrity of the message. The important features of the Procedure CBT;
scheme are as follows. begin
Initialize its cost to ∞;
1. The base station stores only two values, viz. α & p, to
SET FLAG = false; ACK NODE = -1;
authenticate any message from any sensor node.
Timer Flag=RESET;
2. At the minimum, each sensor node only needs to compute
while (node i receive ADV message from node j) do
the MAC of the message that it sends to the base station.
if (Timer Flag = RESET) then
3. The private key βi of each node, and prime p are not Wait for more advertisement message for
stored in the sensor node i. time duration t1 ;
B. Informal Description of the Broadcast Authentication Pro- Timer Flag = SET;
tocol end 
1
Since sensor nodes are resource and energy constrained, if costi >costj + then
REi
it is important that any broadcast operation should consume 1
as minimum resources and energy of each node as possible. costi = costj + ;
REi
Keeping this in mind, we initially construct a broadcast tree levi = levj + 1;
using which messages are broadcasted. In order that the SET FLAG = true; ACK NODE = src;
intermediate nodes which rebroadcast messages cannot modify end
the messages, we divide the sensor node into groups, and each if ((Timer Flag = SET) and (Time duration t1
group is initialized with different hash key chain. Using this expired)) then
mechanism, we ensure that a message cannot be modified Break;
when it is moved from a node in a group to a node in end
another group. We use the Diffie-Hellman key of each node to end
distribute the first key of the hash key chain of a group to each if (SET FLAG = true) then
member node of the group. If the same tree is used repeatedly Broadcast new advertisement message having
for broadcast, the energy of the internal nodes of the tree would cost costi and level levi ;
dry up quite fast. Therefore, we periodically restructure the Send acknowledgment to node ACK NODE;
end
broadcast tree by taking into account the remaining energy
Wait for possible ACK message;
of each node. After reconstruction, the nodes with higher
if (ACK message received) then
remaining energy would become the internal nodes. As a Node i is an internal node;
result, our broadcast authentication protocol consists of the else
following four phases and they are elaborated upon in the Node i is a leaf node;
following subsections. end
1) Broadcast tree establishment and group formation. end
2) Authentication and key distribution. # costj = cost of node j from where the advertisement
3) Message broadcast phase. # message has been received.
4) Periodic restructuring of the tree. # REi = remaining energy of node i.
P 1
C. Broadcast Tree Establishment and Group Formation # costi =
j∈path REj
In this section, we present algorithm for the construction of Algorithm 1: Broadcast tree construction phase of node i
broadcast tree using approach similar to the one given in [14],
and dividing the sensor nodes into groups. The broadcast tree
essentially has the following structure. The base station is at
level 0, which is the highest level. All the sensor nodes which given above describes the algorithm for the construction of the
can be directly reached from the base station are at level 1, the broadcast tree at each node i.
next lower level. When the sensor nodes at level 1 broadcast a Each node in the WSN stores the parent node ID, its level
message, the new sensor nodes which receive these messages number along with the associated least cost of the path to the
are at level 2, and so on. The algorithm for the construction of base station through the parent node. At the very beginning,
broadcast tree given below designates some of the nodes with each node except the base station sets its cost field, parent
higher remaining energy as internal nodes of the tree at each node ID and level number to ∞, -1 and -1 respectively. The
level 1 ≤ i < n−1, where n is the total number of levels in the base station sets both its cost field and level number to 0 and
tree. The value of n depends on the extent of geographical area sets its parent node ID to its own ID. Initially, the base station
into which the sensor nodes have been deployed and the power broadcasts an advertisement message ADV with its node ID,
used to broadcast messages, which essentially determines the level no, and cost as shown in Figure 2. After receiving the
communication range of broadcast. When the internal nodes of first ADV message, a sensor node i waits for a fixed duration
the broadcast tree at level i broadcast a message, all the sensor of time to receive additional ADV messages. It then chooses
nodes at level i + 1 receive the message. The procedure CBT among the nodes from which it received an ADV message a

147
Fig. 2. Advertisement(ADV) message format

node j that would result in a path with the least cost from the
base station to the node i itself. When the node i broadcast
its ADV message, it piggybacks an ACK message for node j.
When the node j receives the ACK, it comes to know that it is
an internal node of the tree. If a node does not receive an ACK Fig. 3. Broadcast tree
message, it concludes that it is a leaf node. The difference
between leaf nodes and internal nodes of a broadcast tree
is that a leaf node never rebroadcast a broadcast message.
The base station is initially given an estimate of how long it
would take for the construction of the broadcast tree. After this
duration elapses, the base station can start using the tree. To Fig. 4. Authentication request(ARQ) message format
maintain integrity of the ADV message, we can use a shared
secrete key for the whole network, which will be preloaded
D. Authentication and Key Distribution
before the deployment of the whole network. This key will
not be required afterwards. After the level number and group number are assigned to
To prevent the intermediate nodes from modifying a broad- a sensor node i during the construction of broadcast tree, it
cast message, one can assign to each level of the tree a sends an authentication request(ARQ) message to the base
different hash key chain. But this would run into problem as station through its parent. On receipt of authentication request
the number of levels is not known apriori, and also the level of message, the base station replies to the sensor node with
a node may change after the reconstruction of the tree. Hence, authentication request reply(ARR) message which, apart from
we divide the sensor nodes into groups based on their levels other information, contains the first key of the hash key chain
in the broadcast tree constructed initially as follows. of the group to which the node i belongs. We first describe the
Let levi = level of node i format of ARQ and ARR messages followed by the format
Then groupi = group number of node i is assigned as of data packet(DP) messages which are broadcasted by the
follows. base station. The format of the ARQ message is as shown
in Figure 4. Each node sends an ARQ message to the base
 station through the path created during the broadcast tree

 0, if levi = 1 construction and group formation phase. To reduce collisions,
((((levi − 1)%(M AX GROU P − 1))

groupi = (1) ARQ message from each node are sent after a random delay.

 +2)%(M AX GROU P − 1)) After an ARQ message is received at the base station, it is
+1, if levi > 1

authenticated as shown in Figure 5.
where MAX GROUP is the maximum number of groups Upon receipt of an ARQ message from a node i at the base
used for the whole network. The actual number of groups station, it sends an authentication request reply(ARR) message
(ANOG) may be less than or equal to MAX GROUP and they to the node with the first key of hash key chain of the group
are numbered from 0 to (ANOG-1). The equation (1) is illus- to which the node i belongs. The format of ARR message is
trated by an example as given in Table I with MAX GROUP as shown in Figure 6. The ARR message for node i contains
equal to 4. the first key of the hash key chain of the first group and the
first key of the hash key chain of the group to which node
levi 1 2 3 4 5 6 7 8 ... i belongs encrypted with key ki . The use of these keys in
groupi 0 1 2 3 1 2 3 1 ... the message broadcast phase is described in the next section.
TABLE I The base station generates a separate hash key chain for each
group of node. We denoted the hash key chain of group i
h h h h h
as Sin → Si(n−1) → Si(n−2) → . . . → Si1 → Si0 . The
The construction of broadcast tree is illustrated in Figure 3. maximum number of groups MAX GROUP is independent of
From Figure 3, it is clear that the broadcast tree construction the number of sensor nodes or the number of levels, and it is
algorithm always chooses a path from base station to a fixed apriori. The group of a node is as given by equation (1).
node, whose intermediate nodes have higher remaining energy
compared to the nodes in other possible paths. A packet from E. Message Broadcast Phase
a node to the base station is sent along the tree towards the After each node has received the ARR message, the
parent. base station can broadcast the data packet. The format of

148
Fig. 7. Data Packet(DP) message format

broadcast tree periodically, so that the nodes which have acted


as internal nodes in the current broadcast tree should not act
as internal node in the next broadcast tree. This ensures that
the whole network will survive for longer duration and all the
nodes will die nearly at the same time. Since the current key
of the hash key chain of the first group is available at every
node, it is used to ensure the integrity of the ADV message
during the restructuring of the broadcast tree.
But, as we change the broadcast tree periodically, the level
of each node may get changed. Since the group number of
each node is generated from node’s level number in the initial
Fig. 5. Authentication of individual node
tree, it remains unchanged due to restructuring of the tree, and
authentication process remains unaffected due to restructuring.

V. S ECURITY AND P ERFORMANCE A NALYSIS


In this section, we discuss parameters that influence security
Fig. 6. Authentication request reply(ARR) message format and performance of our protocol. In our protocol, we have used
a parameter, called α, which strengthens the security. As we
have earlier mentioned, α is preloaded and known only to the
the data packet is as shown in Figure 7. The data packet base station. After completion of broadcast tree construction
message contains message type, message, actual number of and group formation, every node i sends an authentication
groups(ANOG), encryption of the next key for each group by request packet to the base station as shown in Figure 4. This
the corresponding previous key, i.e., ESik Si(k+1) , 0 ≤ i ≤ authentication request packet contains node ID, level number,
(AN OG−1) , and MAC of these information using each of the g βi mod p, and message authentication code(MAC). If any
previous key as shown in Figure 7. To ensure confidentiality attacker gets this packet, it will not be useful to the attacker,
of messages, the required part of each data packet may be because the attacker will not get any key or generate any key
encrypted with S0k , but if confidentiality of messages is not from this packet even if the packet is not encrypted. If he tries
required, this encryption is not necessary. Upon receipt of a to modify the packet, the base station can easily identify this
data packet message, a node can verify the authenticity of the modified packet by computing MAC using a key tmp key i
message using its current key for the hash key chain, and also which has been calculated by base station as shown in Figure
receive the next key of the chain. An intermediate node cannot 5. Since the attacker does not know the key ki , he is unable to
modify the message since it does not have the current key for compute MAC. Hence, to break the security, α must be known
each of the other groups. For the very same reason, it cannot to the attacker, which is not possible because α is preloaded
extract the next key of other groups. and known only to the base station.
Now we discuss security of the data packets, which are
F. Periodic Restructuring of the Tree broadcasted by the base station after the authentication and
Each broadcast tree contains a few internal nodes and rest key distribution phase.
of them are leaf nodes. Each internal node always consumes Let us assume that the broadcast packet is received by
more energy because these nodes are used for transmitting the some attacker in the first hop. Suppose he tries to change
packets of its own, and also those of other leaf nodes. This message msg0 . If he changes the message, he would have to
reduces the remaining energy of the internal nodes. Hence, it change the MACs including the M ACS1,0 (x). But we have
is necessary that a node should not act as an internal node for earlier mentioned that the first level group key chain is known
a long duration. To overcome this problem, we restructure the to only the receivers which have already been authenticated

149
Proposed Protocol H2 BSAP Proposed Protocol H2 BSAP
Computaion overhead M AX GROU P × l × M ACop Computaion overhead M ACop + 2 × M ACop + ⌊l ×
per packet M ACop + per packet DECop Hashop ⌋
(M AX GROU P + Transmission overhead M AX GROU P × (l − r)×|M AC|+
1) × EN Cop per packet |M AC| + [l × |key|]
Transmission overhead M AX GROU P × l × |M AC| + [M AX GROU P ×
per packet |M AC| + [l × |key|] |key|]
[M AX GROU P × Storage overhead 2 × |key| l × |key|
|key|]
Storage overhead M AX GROU P × l × (n + 1) × |key| TABLE III. Overhead at other nodes
(n + 1) × |key|
MACop : MAC operation Hashop : Hash operation
TABLE II. Overhead at the base station ENCop : Encryption operation DECop : Decryption operation
|key|: Key length |MAC|: MAC length
n: Size of key chain l: Maximum hop
MAX GROUP: Maximum no. of r: Hop distance from BS
by the base station as the group one node. But, to compute groups
M ACS1,0 (x), he has to know the key S1,0 , which is not
possible. If the attacker modifies the broadcast packet using its TABLE IV. Notations
own key and forwards the modified packet to the next level,
then the next level nodes would identify the incorrect packet
a fixed area of 100 * 100 meter2 . The network parameters,
through their MAC if their group is different from that of the
such as transmission range, transmission rate, sensitivity, trans-
sender, and reject the packet.
mission power etc., for this simulation study are similar to the
It is possible that two different nodes in consecutive levels
parameters specified in CC2420[16] data sheet and TelosB[17]
belong to the same group after restructuring of the broadcast
data sheet. We have taken the initial energy of each node to
tree. In this case, some nodes in the next lower level may
be 29160 joules for 2 AA batteries as given in the Castalia
not be able to detect the modification. But other nodes will
simulator. Energy consumption for different radio modes used
be able to detect the same, and the event can be reported to
in this simulator are given in Table V. For this simulation,
all the nodes in the vicinity. Nonetheless, this situation would
we assume that clocks of all the nodes are synchronized.
be very rare as only the internal nodes broadcast, and it is
The simulation was carried out for both realistic as well as
very unlikely that a node would become internal node and its
ideal channel. We have used TelosB node hardware platform
level would also get increased or decreased. Our protocol is
specification for our simulation and have also used “tunable
also fault-tolerant to node failures as periodic restructuring of
protocol” provided by Castalia as the MAC layer protocol.
the broadcast tree would eliminate the failed nodes from the
The broadcast packet is generated randomly with uniform
internal node of the tree. Besides, even if an internal node
distribution in every 2 seconds interval at the base station.
has failed, broadcast might still go through other neighboring
Figure 8 shows the total number of transmissions to broad-
internal nodes.
cast a packet for different sizes of network. In this figure,
Table II and III show the comparisons between our proposed
we have compared our protocol with the previously proposed
protocol and H 2 BSAP protocol [7] with respect to compu-
scheme with respect to the number of transmissions made.
tation, transmission, and storage overhead at the base station
Our approach gives better performance as compared to the
and other nodes respectively. Table IV describes the notations
previously proposed scheme, because our approach generates
used in Table II and III. As compared to H 2 BSAP protocol,
a broadcast tree, where only the internal nodes can forward
our proposed protocol uses lesser number of key chains
the packet over the network. This reduces the number of
which depends upon MAX GROUP, whereas in H 2 BSAP ,
transmissions required to broadcast a packet over the network.
the maximum number of key chains depends on the depth
Figure 9 shows the number of authenticated nodes and the
of the network. Since MAX GROUP is fixed, our proposed
average number of nodes that received the broadcast packet
protocol can support network of any depth, but H 2 BSAP
for different sizes of WSNs. From this figure, we can say that
can support maximum of up to 15-hops as given in [7]. As
with increase in the density of the network, the percentage
we have already discussed that each node in the proposed
of authenticated nodes also decreases. This happens only
protocol keeps only two keys, i.e., the key of the key chain
due to the collisions of the authentication request packet. It
of the first group and the key of the key chain of the group
can be reduced by increasing the random delay before the
to which it belongs. Hence, the proposed protocol always has
authentication request packet is transferred.
less overhead as compared to H 2 BSAP , and authentication
Figure 10 shows the minimum, average, and maximum
in the proposed protocol is always immediate.
delivery time of a broadcast packet in the network for different
VI. S IMULATION S TUDIES node density of the network.

We have studied the performance of our Broadcast authen- VII. C ONCLUSIONS


tication scheme using Castalia simulator[15]. All the nodes In this paper, we have proposed a broadcast authentication
in the network are randomly deployed. For the simulation protocol for multi-hop wireless sensor networks using Diffie-
purpose, we vary the number of nodes from 50 to 225 in Hellman key and hash key chain. The protocol is based on

150
Radio mode Energy Consumption (mW) 180
Transmit 57.42 authentication
Receive 62 160 avg. packet delivery
Listen 62
Sleep 1.4 140

in number
120
TABLE V. Radio Characteristics
100

2500 80
with floding
Number of packet transmitted

without floding 60
2000
40
20
1500 50 75 100 125 150 175 200 225
Number of Nodes
1000
Fig. 9. Number of authenticated nodes, and average number of nodes that
received the broadcast packet
500

0.04
0 Minimum
50 75 100 125 150 175 200 225 0.035 Average
Number of nodes Maximum

Delivery Time(in Sec.)


0.03
Fig. 8. Number of transmissions required to broadcast a packet with flooding 0.025
and without flooding
0.02
0.015
a novel scheme to authenticate each sensor node at the base
0.01
station using Diffie-Hellman key. For this purpose, the base
station needs to store only two values instead of separate 0.005

shared secret key with each node. Compromising a single node 0


50 75 100 125 150 175 200 225
will not affect other nodes, and even a compromised will not
Number of nodes
be able to do any damage as far as the broadcast is concerned.
The proposed protocol exhibits many nice properties including Fig. 10. Minimum, average and maximum delivery time for a broadcast
individual authentication, instant authentication, and low over- packet
head in communication and storage. It also improves over the
existing broadcast authentication schemes in many aspects.
[10] D. Liu and P. Ning, “Multilevel µ tesla: Broadcast authentication for
R EFERENCES distributed sensor networks,” ACM Trans. Embed. Comput. Syst., vol. 3,
no. 4, pp. 800–836, 2004.
[1] X. Ren, “Security methods for wireless sensor networks,” in Mecha- [11] A. Perrig, “The biba one-time signature and broadcast authentication
tronics and Automation, Proceedings of the 2006 IEEE International protocol,” in CCS ’01: Proceedings of the 8th ACM conference on
Conference on, June 2006, pp. 1925–1930. Computer and Communications Security. New York, NY, USA: ACM,
[2] T. Zia and A. Zomaya, “Security issues in wireless sensor networks,” in 2001, pp. 28–37.
Systems and Networks Communications, 2006. ICSNC ’06. International [12] L. Reyzin and N. Reyzin, “Better than biba: Short one-time signatures
Conference on, Oct. 2006, pp. 40–40. with fast signing and verifying,” in ACISP ’02: Proceedings of the 7th
[3] Y. Wang, G. Attebury, and B. Ramamurthy, “A survey of security issues Australian Conference on Information Security and Privacy. London,
in wireless sensor networks,” Communications Surveys & Tutorials, UK: Springer-Verlag, 2002, pp. 144–153.
IEEE, vol. 8, no. 2, pp. 2–23, Quarter 2006. [13] S.-M. Chang, S. Shieh, W. W. Lin, and C.-M. Hsieh, “An efficient broad-
[4] A. Perrig, R. Szewczyk, V. Wen, D. Culler, and J. D. Tygar, “Spins: cast authentication scheme in wireless sensor networks,” in ASIACCS
security protocols for sensor networks,” in MobiCom ’01: Proceedings ’06: Proceedings of the 2006 ACM Symposium on Information, computer
of the 7th annual international conference on Mobile computing and and communications security. New York, NY, USA: ACM, 2006, pp.
networking. New York, NY, USA: ACM Press, 2001, pp. 189–199. 311–320.
[Online]. Available: http://dx.doi.org/10.1145/381677.381696 [14] F. Ye, A. Chen, S. Lu, and L. Zhang, “A scalable solution to minimum
[5] S. Yamakawa, Y. Cui, K. Kobara, and H. Imai, “Lightweight broadcast cost forwarding in large sensor networks,” in Computer Communications
authentication protocols reconsidered,” in Wireless Communications and and Networks, 2001. Proceedings. Tenth International Conference on,
Networking Conference, 2009. WCNC 2009. IEEE, April 2009, pp. 1–6. 2001, pp. 304–309.
[6] P. Ning, A. Liu, and W. Du, “Mitigating dos attacks against broadcast [15] “Castalia a simulator for wireless sensor networks,”
authentication in wireless sensor networks,” ACM Trans. Sen. Netw., http://castalia.npc.nicta.com.au/pdfs/Castalia User Manual.pdf.
vol. 4, no. 1, pp. 1–35, 2008. [16] “Cc2420 data sheet,” http://www.stanford.edu/class/cs244e/
[7] C. Bekara, M. Laurent-Maknavicius, and K. Bekara, “H2bsap: A hop- papers/cc2420.pdf.
by-hop broadcast source authentication protocol for wsn to mitigate [17] “Telosb data sheet,” http://www.xbow.com/Products/ Prod-
dos attacks,” in Communication Systems, 2008. ICCS 2008. 11th IEEE uct pdf files/Wireless pdf/TelosB Datasheet.pdf.
Singapore International Conference on, Nov. 2008, pp. 1197–1203.
[8] H. Krawczyk, M. Bellare, and R. Canetti, “Hmac: Keyed-hashing for
message authentication,” Internet RFC 2104, February 1997.
[9] R. H. Brown and A. Prabhakar, “Digital signature standard (dss).”
[Online]. Available: http://www.itl.nist.gov/fipspubs/fip186.htm

151
ADCOM 2009
GRID SCHEDULING

Session Papers:

1. Tapio Niemi, Jukka Kommeri and Ari-Pekka Hameri “Energy-efficient Scheduling of Grid
Computing Clusters”

2. Ankit Kumar, Senthil Kumar R. K. and Bindhumadhava B. S ,“Energy Efficient High


Available System: An Intelligent Agent Based Approach ”

3. Amit Agarwal and Padam Kumar “A Two-phase Bi-criteria Workflow Scheduling Algorithm in
Grid Environments”

152
Energy-efficient Scheduling of Grid Computing
Clusters
Tapio Niemi Jukka Kommeri
Helsinki Institute of Physics, Technology Programme Helsinki Institute of Physics, Technology Programme
CERN, CH-1211 Geneva 23, Switzerland CERN, CH-1211 Geneva 23, Switzerland
tapio.niemi@cern.ch kommeri@cern.ch

Ari-Pekka Hameri
HEC, University of Lausanne,CH-1015 Lausenne, Switzerland
ari-pekka.hameri@unil.ch

Abstract—Energy-efficiency is an increasingly important com- and grid computing for HEP has mostly focused on similar
ponent in computation costs in scientific computing. We have infrastructure issues as in general HPC computing such as
studied different scheduling settings with different hardware for cooling and purchasing energy-efficient hardware. As far as
high-throughput computing trying to minimise the electricity
usage of computing jobs. Instead of common practice of one we know, there are not many studies focusing on optimising
task per CPU core scheduling in grid clusters, we have tested the system configuration and scheduling settings for grid
variations of different scheduling methods based on idea to fully- computing.
load computing nodes. Our test showed running multiple tasks
simultaneously can decrease energy usage per computing task
In this paper, we focus on a typical grid computing problem:
over 40% and improve throughput of the computing node up How to process a large set of jobs efficiently. We try to opti-
to 100% when running a high-energy physics (HEP) analysis mise the energy efficiency and the total processing time of the
application. The trade-off is that processing times of individual set of jobs by choosing an optimal scheduling policy. In this
tasks are longer but in cases, such as HEP computing, in which sense our focus is closer to high-throughput computing than
the tasks are not time critical, only the total throughput is
important.
high-performance computing. Basically the problem is similar
to production management in any manufacturing processes.
I. I NTRODUCTION This kind of optimisation problem can lead to a trade-off
situation: improving energy-efficiency can weaken throughput.
Energy consumption has become one of the main cost of However, our tests indicated that these two aims are not neces-
computing and several methods to improve the situation has sary contradictory, meaning that optimising system throughput
been suggested. The focus on research has been in hardware also improves its energy efficiency.
and infrastructure aspects. Most of the computing centers
and computing clusters of research institutes focus on high- Our method is based on a conclusion that computers should
performance computing trying to optimise processing time run full-power or be turned off, since the fixed power con-
of individual computing jobs. Jobs can have strict deadlines sumption is around 50% of the full-power of the server.
or require massive parallelism. Instead, in high-throughput Since computers can run multiple tasks simultaneously in
computing the aim is slightly different, since individual jobs a CPU core using time sharing techniques, this naturally
are not time critical and the aim is to optimise the total leads to load-based scheduling policy. Our previous tests [1]
throughput over a longer period of time. indicated that the load should not mean only the processor
load but all components of the computer including memory
In computing intensive sciences, such as high energy-
usage, processor load, and I/O traffic. In the current study we
physics (HEP) energy-efficient solutions are important. For
tested different computing hardware: single core, low-energy
example the Worldwide LHC Computing Grid (WLCG) to
mini-PCs, and modern multicore systems commonly used in
be used to analyse the data that the Large Hadron Collier of
computer centers. Our test software included applications util-
CERN will produce, includes tens of thousands CPU cores.
ising different resources of the computer and a HEP analysis
In this scale, even a small system optimisation can offer
application.
noticeably energy and cost savings. Since scientific computing,
and especially, high-energy physics computing has special The basic terminology used in this paper is:
characteristics, also energy-optimisation methods can be tai- - A task is the smallest entity of processing work. The
lored for it. The main characteristics in this sense are: large task starts, retrieves/reads its possible input file, process
sets of similar kind jobs, data-intensive computing, no time the data and possibly writes its output file.
critical, no preceding conditions, and no intercommunication - A job is a collection of tasks. In general situation tasks
between jobs or their tasks, i.e. high parallelisms. In spite can have preceding relations but in our case tasks are
of this special nature, improving energy efficiency in cluster independent.

153
- A computing node, i.e. node is a part of the computing the completion time of the last job or the total completion time
cluster. It has one or more CPU cores, a fixed amount i.e. sum of all completion times.
of memory and disk space, and a network connection Often there are several objectives such as processing time
with some fixed capacity. The node schedules its jobs and energy efficiency in our case. Then the overall objective
independently to its CPU cores. is a (weighted) sum of the sub-objectives. This often leads to
- Energy efficiency means how many similar jobs can be a Pareto-optimal schedule.
processed by using the same amount of electricity.
- Computing efficiency, i.e. the system throughput, means B. Cluster Schedulers
how many similar jobs can be processed in a time unit. There are various batch scheduling systems – also called as
The paper has been organised as follows. In Background job schedulers or distributed resource management systems –
section we explain the common concepts of scheduling and available, such as Torque 1 , OpenPBS 2 , LSF 3 , Condor 4 , and
review related literature. After that the used methodology is Sun Grid Engine 5 . These systems have different features but
described in Section 4. Tests are explained in Section 5 and the basic functionality is very much similar.
results in Section 6. Finally conclusions are given in Section In our tests we used Sun Grid Engine (SGE) [4] of Sun
7. Microsystems that is also commonly used in grid computing
clusters. It has various features to control scheduling. The
II. BACKGROUND scheduling is done based on the load of the computing nodes
and resource requirements of the jobs. SGE supports check-
A. Scheduling pointing migration of checkpointing jobs among computing
Scheduling means in which order and to which computing nodes. In addition to batch jobs also interactive and parallel
nodes computing tasks should be allocated. How an individual jobs are supported. SGE accounts the resources, such as CPU,
computers schedule their processes is not included to our topic. memory, I/O, that a job has used. SGE contains an accounting
Scheduling problems can be classified according to following and reporting console (“ARCo”) that store accounting data into
properties: the SQL database for later analysis.
- on-line / offline In SGE scheduling is done in fixed intervals (default setting)
- knowledge on jobs or triggered by some events such as a new job submission. The
- knowledge on computing resources scheduler finds the best nodes for pending jobs based on, for
example, resource requirements of the job, the load of nodes,
There are lots of research on scheduling multiprocessor sys-
the relative performance of the nodes. By default the scheduler
tems. More formally (e.g. following [2] or [3]) the scheduling
dispatches the jobs to the queues, i.e. nodes, in the order in
problem can be defined as follows: We have m machines
which they have arrived. If several queues are identical, the
Mj (j = 1, ..., m) (i.e. computing nodes in our case) and j jobs
selection is random.
Ji (i = 1, ..., n) to be processed. A schedule S is an allocation
It is also possible to change to scheduling algorithm but
of time intervals from machines for each job. The challenge is
only one algorithm is shipped with the default distribution.
to find an optimal schedule for jobs when there exist certain
However, the scheduling can be controlled in four ways: 1)
constraints. A schedule is called optimal if it minimises a given
Dynamic resource management, 2) Queue sorting and 3) Job
optimality criteria. The criteria can be for example time, cost
sorting, 4) Resource reservation. Here we focus only queue
or usage of some resource. Figure I illustrates a schedule.
and job sortings. In the queue sorting the queue instances
of computing nodes are ranked to the order the scheduler
M1 J1
M2 J2 J2 J1 should use them. The ranking possibilities for queues are, for
M3 J3 J3 J3 example: system load, scaled system load, user defined system
load, or fixed order. The jobs sorting can be done, for example,
T ime
in following ways: ticket-based job priority, urgency or POSIX
TABLE I -based priorities, user or group based quotas.
A SCHEDULE
III. R ELATED WORK
Venkatachalam and Franz [5] give a detailed overview on
The optimality criteria can be defined in several ways. If techniques that can be used to reduce energy consumption of
the finishing time of the job Ji is denoted by Ci the cost is computer systems. There are several studies in different parts
marked by fi (Ci ). Now, the usual cost functions are called of the topic such as optimising processors by dynamic voltage
bottleneck objectives and sum objectives. The bottleneck is scaling (e.g. [6] and [7]); optimising disk systems (e.g. [8],
the maximum value of the cost functions of all jobs while the
1 www.clusterresources.com/pages/products/torque-resource-manager.php
sum is the summarized value. The cost function can be defined 2 http://www.openpbs.org
in several ways. The most common ones are make-span, total 3 www.platform.com/Products/Workload-Management
flow time and weighted flow time. When designing a schedule 4 www.cs.wisc.edu/condor

there are different objective functions to be minimised such as: 5 http://gridengine.sunsource.net

154
[9], and [10]); network optimisation (e.g. [11]); and compilers IV. M ETHODOLOGY
(e.g. [12]). There are also several studies on pure energy issues. A. Problem Description
For example, Lefurgy et al. [13] suggest a method to control
peak power consumption of servers. The method is based on We assume having a large set of tasks – compared to the
power measurement information on each computing server. number of CPU cores available – organised as a job. Further
Controlling peak power makes it possible to use smaller and we assume that jobs do not have deadlines, all of them arrive
more cost-effective power supplies. at time zero, there is no preceding relations between them or
among tasks in them, and the number of tasks is much larger
Scheduling is a widely studied topic but there is little work
than the number of computing nodes. These assumptions are
on scheduling as an energy saving method. Instead some works
usually true in HEP computing and they make the needed
suggest clearly opposite approaches: For example, Koole and
scheduling algorithm easier. Figure 1 illustrates the situation.
Righter [14] suggest a scheduling model in which tasks are
Our scheduling problem can be divided into two indepen-
replicated to several computers. However, the authors do not
dent steps:
estimated how much more resources is needed when the same
tasks (or at least parts of them) are computed in several 1. Finding the optimal load combination for the comput-
times. Fu et al. [15] present a scheduling model being able ing node. The optimal means that a job can be run by
to restart batch jobs. They give an efficient algorithm to solve using smallest possible amount of energy in the minimal
the problem but they do not touch the resource usage. time.
2. Scheduling jobs to the computing nodes in such a way
There also exists several studies relevant to our topic such
that all computing nodes are as close as possible to the
as: Kurowsk et al. [16] studies two-level hierarchical grid
optimum state (i.e. Step 1).
scheduling. Their approach is taking into account all stake-
holders of grid computing systems. The approach does not In an important special case in which all tasks are identical
require time characteristics of jobs being known and in it a inside a job, the problem simplifies into a form: how many
set of jobs in the grid level is scheduled simultaneously to the tasks to run simultaneously in a computing node. Then the
local computing resources. problem is how to measure the load of the node and define
the optimum load level. Generally, it is important to notice
Edmonds [17] studies non-clairvoyant scheduling in multi-
that we did not try to minimise the processing time of an
processor environments. In his model, the jobs can have arbi-
individual task but a large job containing several tasks. We do
trary arrival times and execution characteristics can change.
not include the local process scheduling on the node to our
Wang et al. [18] have studied optimal scheduling methods study.
in a case of identical jobs and different computers. They aim
to maximise the throughput and minimise the total load. They job
give an on-line algorithm to solve the problem.
Shivam et al. [19] presents a learning scheduling model, Cluster 
while Srinivasa Prasanna and Musicus [20] give a theoretical Cluster­level  queue
scheduling
scheduling model in which the number of processors allocated
to a task can be a continuous variable and it is possible to Cluster 
scheduler
allocate all processors for one task if needed.
Medernach [21] have studied workload in a grid computing
Node  Node 
cluster to be able to compare different scheduling methods. queue queue
Node­level 
The idea of the work was to find ways how the users of scheduling
the cluster can be grouped to characterise their usage. The Node  Node 
scheduler scheduler
scheduling is based on one job per CPU core idea.
Etsion and Tsafrir [22] compared commercial workload Core Core Core Core
management systems focusing on their scheduling systems and Resources
Core Core Core Core
default settings. According to the authors, the default settings
are often used by the administrators, or they are just slightly memory network memory network
modified. disk disk
Aziz and El-Rewini [23] have studied online scheduling
algorithms based on evolutionary algorithms in the grid con-
text. Ges et al. [24] have studied scheduling of irregular I/O Fig. 1. Scheduling system
intensive parallel jobs. They note that CPU load alone is
not enough but all other system resources (memory, network, Briefly, our hypothesis is:
storage) must be taken into account in scheduling decisions. Running several tasks simultaneously in a CPU core
Santos-Neto et al. [25] have studied scheduling in case of improves energy-efficiency and throughput compared
data-intensive data mining applications. to running only one task.

155
Possible reasons for this can be: and the other one no I/O access at all.
1. If tasks have I/O access, there would be idle 2. Memory access delays: we used two similar
time for the CPU core because of slow disks test application one having intensive memory
and network. access and the other one using very little of
2. If tasks have intensive memory access, there memory.
would be idle time for the CPU core because We tested two different scenarios:
of slow main memory access. 1. In the simplest case all tasks and all jobs and
computing nodes were identical.
B. Test Method and Environment
2. The second case we had different jobs but
Our test method was to execute a job , i.e. a large identical computing nodes. Now the problem
set of tasks and measure the time and electricity is how to allocate tasks to computing nodes.
consumed during the test run. The same test job was
The following scheduling methods were tested fo-
run with different cluster configurations to find out
cusing first two mentioned resources:
the optimum one.
We ran our tests on different test environments: - The default scheduling settings of SGE. Job
slots are equal to the number of CPU cores.
- A Xeon test cluster including one front-end
Currently this is often used in grid clusters.
and three computing nodes running Sun Grid
- Slot-based: fixed number of jobs per CPU core.
Engine [4]. The nodes had two single core
Intel Xeon 2.8 GHz processors (Supermicro A. Basic Tests
X6DVL-EG2 motherboard, 2048 KB L2 cache, We used following applications for basic tests:
800 MHz front side bus) with 2 gigabytes of 1. The I/O test application was a simple pro-
memory and 160 gigabytes of disk space. gram that wrote and read 300MB files mul-
- A Dell PowerEdge SC1435 computer with two tiple times. First it created a file containing
4-core ADM Opteron 2376 2.3GHz and 32 numbers generated using the process id as a
gigabytes of memory. ”random” seed, making all files unique. Files
- A cluster of three EeeBox mini computers with were named using the pid to avoid simultane-
Intel Atom N270 and 2 gigabytes of memory. ous writing/reading into the same file. After
The operating system used with Xeon and Opteron generating the file, the contents were copied
was Rocks 5.0 with kernel version 2.6.18. Eeeboxes to another file 20 times. Each time was a bit
required newer drivers so with them Rocks 5.2 and different (small shift in numerical values) to
kernel version 2.6.24.4 was used. The effect of the avoid buffering.
kernel was tested and found nonexistent. 2. The CPU test application used here was a
The electricity consumption of the computing nodes long loop calculating floating point multiplica-
was measured with the Watts Up Pro electricity me- tion. A reminder of the index variable was also
ter. We tested the accuracy of our test environment used to make compiler optimization harder.
by running the same tests several times with exactly 3. The memory test application reserved mem-
the same settings. The differences between the runs ory (200 MB) and filled it with numbers. After
were around +-1% both in time and electricity that it read a part of the memory and wrote it
consumption. to another part. This was done multiple times.
We developed and customised some tools to make
We performed two types of tests: 1) running identical
testing process easier. The test runs were submitted
test applications, and 2) mixing the applications. In
using a Perl script that automatically set the wanted
mixed test we submitted test applications to test
cluster parameters and stored all information into a
clusters in the following order: CPU, memory, I/O,
relational database.
and CPU, i.e. two CPU test per one set of memory
We assumed knowing the type of a job and the
and I/O.
characteristics of the hardware in advance. The test
applications were real HEP analysis applications and B. Test with Physics Software
dummy test applications simulating CPU intensive, We used a CMS analysis application in our tests.
memory intensive, and disk intensive applications. Input data for the test was from the CRAFT (CMS
V. T ESTS Running At Four Tesla) experiment that used cosmic
ray data recorded with CMS detector at the LHC
To find out a reason why our hypothesis is valid, we during 2008 [26]. This detector was used in the
formed tests to test our assumptions: similar way to future LHC experiments. Our test
1. I/O access delays: we used two similar test application uses CMSSW framework and it is very
application one having intensive I/O access close to an analysis applications that will be used

156
1800
Write speed
Read speed
to low amount of memory.
1600 There are big differences in energy efficiency among
1400
different hardware. Figure 5 illustrates this in our test
environment: the modern multicore computer is over
1200
7 times more energy efficient than the older single
Speed (blocks/s)

1000 core one. Even the modern low energy mini-PC is


800
not very energy-efficient in heavy computing tasks.
However, because of its very low power consump-
600
tion and low idle power, it could be efficient in some
400 other tasks.
According to our tests a multicore computer with
200
sufficient amount of memory is the best hardware
0
0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5 in physics computing. With this hardware running
Time (minutes)
multiple analysis tasks per CPU core gave the best
Fig. 2. I/O usage of a single HEP analysis job improvement: running three simultaneous tasks per
core increased throughput 100% and decreased elec-
350 tricity consumption 43% compared to the situation
running only one task per CPU. Remarkable is that
300
one task per CPU core configuration in multicore en-
vironment uses more energy per task than optimised
250
mini-PC environment does.
Used memory (Megabytes)

200

150

100

50

0
0 1 2 3 4 5 6
Time (minutes)

Fig. 3. Memory usage of a single HEP analysis job

when LHC collision data is available. The analysis


software reads the input file (350 MB), performs
the analysis, and writes the output file (1.6MB) in
the local disk. The disk I/O during the application
execution is shown in Figure 2 and the memory Fig. 4. Improvements on 2x4 core machine
usage in Figure 3.

VI. R ESULTS
Results of our basic tests are shown in Table II,
mixed basic tests in Table III, and physics analysis
tests in Table IV.
Generally running more than one task in a CPU
core improved throughput and decreased electricity
consumption. Figure 4 illustrates this in the case
of the multicore computer. However, improvements
heavily depended on the application and the hard-
ware. In modern multicore environment running 3-4
task per CPU core gave the the best results while
in older single core cluster 1-2 task per CPU was
the best. The single core cluster also was the only
environment in which running multiple tasks can
decrease the efficiency. This can partially be related Fig. 5. Wh/job performance in physics jobs

157
Hardware Test Environment Jobs/core Jobs/hour Wh/job Avg. kW / core Avg. kW / node
Xeon cluster Memory normal 1 60 11.88 119.17 715
optimal 3 91 8.44 127.50 765
Disk normal 1 119 6.06 119.67 718
optimal 1 119 6.06 119.67 718
CPU normal 1 176 3.79 111.33 668
optimal 1 176 3.79 111.33 668
2x4 core Opteron Memory normal 1 90 2.91 32.50 260
optimal 3 91 2.88 32.75 262
Disk normal 1 320 0.79 31.25 250
optimal 4 493 0.59 36.25 290
CPU normal 1 407 0.6 30.38 243
optimal 3 461 0.56 31.75 254
Eeebox cluster Memory normal 1 22 2.05 14.83 44.5
optimal 4 32 1.48 15.27 45.8
Disk normal 1 38 1.27 15.53 46.6
optimal 3 65 0.76 16.13 48.4
CPU normal 1 24 1.89 14.70 44.1
optimal 2 30 1.52 14.73 44.2
TABLE II
BASIC TESTS

Hardware jobs/core jobs/h Wh/job jobs/h/node avg.power/node %-throughput %-electricity


Xeon cluster 4 95 7.18 31.67 227.00 17% -14%
2x4 core Opteron 4 258 0.77 258.00 258.00 38% -22%
Eeebox cluster 3 43 1.09 14.33 15.27 105% -50%
TABLE III
M IXED BASIC TESTS

Hardware Environment Jobs/core Jobs/hour Wh/job %-throughput %-electricity


Xeon cluster normal 1 127 5.65
Xeon cluster optimal 2 124.3 5.61 -2.1% -0.8%
2x4 core Opteron normal 1 188 1.134
2x4 core Opteron optimal 3 376 0.77 100% -42.54%
Eeebox cluster normal 1 33 1.53
Eeebox cluster normal 3 45 1.24 36.6% -19.0%
TABLE IV
P HYSICS TEST RESULTS

158
VII. C ONCLUSION AND F UTURE W ORK [13] C. Lefurgy, X. Wang, and M. Ware, “Server-level power
control,” in ICAC ’07: Proceedings of the Fourth Interna-
Our tests showed that both energy efficiency and tional Conference on Autonomic Computing. Washington,
throughput can be remarkable improved by running DC, USA: IEEE Computer Society, 2007.
several tasks simultaneously in each CPU core. [14] G. Koole and R. Righter, “Resource allocation in grid
computing,” J. Scheduling, vol. 11, no. 3, pp. 163–173,
However, the results highly depended on the used 2008.
hardware. The biggest improvement in physics com- [15] R. Fu, T. Ji, J. Yuan, and Y. Lin, “Online scheduling in
puting in both throughput (100%) and energy con- a parallel batch processing system to minimize makespan
using restarts,” Theor. Comput. Sci., vol. 374, no. 1-3, pp.
sumption (43%) was achieved in modern 2 x 4 CPU 196–202, 2007.
core computer. The same hardware was clearly the [16] K. Kurowski, J. Nabrzyski, A. Oleksiak, and J. Weglarz, “A
most energy efficient, too. multicriteria approach to two-level hierarchy scheduling in
grids,” J. Scheduling, vol. 11, no. 5, pp. 371–379, 2008.
Our future work includes development of a load- [17] J. Edmonds, “Scheduling in the dark,” Theor. Comput. Sci.,
based scheduler being able to find automatically vol. 235, no. 1, pp. 109–141, 2000.
the best amount of tasks per computing node by [18] C.-M. Wang, X.-W. Huang, and C.-C. Hsu, “Bi-objective
optimization: An online algorithm for job assignment,” in
using the utilisation rate of different resources of Advances in Grid and Pervasive Computing, GPC 2009,
the computing system. Geneva, Switzerland, 2009, pp. 223–234.
[19] P. Shivam, S. Babu, and J. Chase, “Active and accelerated
ACKNOWLEDGEMENTS learning of cost models for optimizing scientific applica-
tions,” in VLDB ’06: Proceedings of the 32nd international
We would like to thank Arttu Klementtilä who conference on Very large data bases. VLDB Endowment,
helped us implementing a part of the test applica- 2006, pp. 535–546.
tions and executing the test runs and Magnus Ehrn- [20] G. N. S. Prasanna and B. R. Musicus, “The optimal control
approach to generalized multiprocessor scheduling,” Algo-
rooth’s Foundation for a grant for the test hardware. rithmica, vol. 15, no. 1, pp. 17–49, 1996.
[21] E. Medernach, “Workload analysis of a cluster in a grid
R EFERENCES environment,” in Job Scheduling Strategies for Parallel
[1] T. Niemi, J. Kommeri, K. Happonen, J. Klem, and A.-P. Processing 11th International Workshop, JSSPP 2005.
Hameri, “Improving energy-efficiency of grid computing Springer, 2005.
clusters,” in Advances in Grid and Pervasive Computing, 4th [22] Y. Etsion and D. Tsafri, “A short survey of commercial
International Conference, GPC 2009, Geneva, Switzerland, cluster batch schedulers,” Hebrew Univ. of Jerusalem, Tech.
2009, pp. 110–118. Rep. 2005-13, 2005.
[2] M. Pinedo, Scheduling: Theory, Algorithms, and Systems. [23] A. Aziz and H. El-Rewini, “On the use of meta-heuristics to
Springer, 2008. increase the efficiency of online grid workflow scheduling
[3] P. Brucker, Scheduling algorithms. Springer, 2007. algorithms,” Cluster Computing, vol. 11, no. 4, pp. 373–390,
[4] BEGINNER’S GUIDE TO SUNTM GRID ENGINE 6.2 2008.
Installation and Configuration, Sun Microsystems, 2008. [24] L. F. Ges, P. Guerra, B. Coutinho, L. Rocha, W. Meira,
[5] V. Venkatachalam and M. Franz, “Power reduction tech- R. Ferreira, D. Guedes, and W. Cirne, “Anthillsched: A
niques for microprocessor systems,” ACM Comput. Surv., scheduling strategy for irregular and iterative i/o-intensive
vol. 37, no. 3, pp. 195–237, 2005. parallel jobs,” in Job Scheduling Strategies for Paral-
[6] R. Ge, X. Feng, and K. W. Cameron, “Performance- lel Processing 11th International Workshop, JSSPP 2005.
constrained distributed dvs scheduling for scientific appli- Springer, 2005.
cations on power-aware clusters,” in SC ’05: Proceedings [25] E. Santos-Neto, W. Cirne, F. Brasileiro, A. Lima, R. Lima,
of the 2005 ACM/IEEE conference on Supercomputing. and C. Grande, “Exploiting replication and data reuse to
Washington, DC, USA: IEEE Computer Society, 2005, efficiently schedule data-intensive applications on grids,”
p. 34. in Proceedings of the 10th Workshop on Job Scheduling
[7] N. Kappiah, V. W. Freeh, and D. K. Lowenthal, “Just in Strategies for Parallel Processing, 2004, pp. 210–232.
time dynamic voltage scaling: Exploiting inter-node slack [26] D. Acosta and T. Camporesi, “Cosmic success,” CMS Times,
to save energy in mpi programs,” in SC ’05: Proceedings November 2008.
of the 2005 ACM/IEEE conference on Supercomputing.
Washington DC, USA: IEEE Computer Society, 2005.
[8] D. Essary and A. Amer, “Predictive data grouping: Defining
the bounds of energy and latency reduction through predic-
tive data grouping and replication,” Trans. Storage, vol. 4,
no. 1, pp. 1–23, 2008.
[9] Q. Zhu, Z. Chen, L. Tan, Y. Zhou, K. Keeton, and J. Wilkes,
“Hibernator: helping disk arrays sleep through the winter,”
in SOSP ’05, 20th ACM symposium on Operating systems
principles. New York, NY, USA: ACM, 2005, pp. 177–190.
[10] X. Li, Z. Li, Y. Zhou, and S. Adve, “Performance directed
energy management for main memory and disks,” Trans.
Storage, vol. 1, no. 3, pp. 346–380, 2005.
[11] S. Conner, G. M. Link, S. Tobita, M. J. Irwin, and
P. Raghavan, “Energy/performance modeling for collective
communication in 3-d torus cluster networks,” in SC ’06:
Proceedings of the 2006 ACM/IEEE conference on Super-
computing. New York, NY, USA: ACM, 2006.
[12] W. Zhang, J. S. Hu, V. Degalahal, M. Kandemir, N. Vi-
jaykrishnan, and M. J. Irwin, “Reducing instruction cache
energy consumption using a compiler-based strategy,” ACM
Trans. Archit. Code Optim., vol. 1, no. 1, pp. 3–33, 2004.

159
Energy Efficient High Available System:
An Intelligent Agent Based Approach
Ankit Kumar, R.K.Senthil Kumar, B.S.Bindhumadhava
Centre for Development of Advanced Computing,
'C-DAC Knowledge Park', Opp. HAL Aeroengine Division,
No. 1, Old Madras Road, Byappanahalli,
Bangalore-560038, India.
{ankitk, senthil, bindhu}@cdacb.ernet.in
Abstract— For achieving high availability in agent based consistency. Despite the rapid evolution in all aspects of
applications which are of mission critical nature, a replica of the computer technology, both the computer hardware and
real agent system can be created. However even the replica of real software are prone to numerous failure conditions which may
agent system may fail due to various reasons like node failure lead to the termination of these applications. So providing
,failure in communication link etc which may lead to the agent loss.
highly availability for these types of applications becomes
Another major issue with such types of systems is the wastage of
energy since the replica has to be in the active mode (full power increasingly important. High availability refers to the
mode) all the time which is really harmful for the environment. The availability of resources in a system, in the wake of
need for improved energy management in these type of systems has component failures in that system. High availability in agent
become essential for many reasons like reduced energy based applications can be achieved by detecting node failures
consumption & compliance with environmental standards. and reconfiguring the system appropriately, so that the
To overcome these issues, we present an intelligent agent based workload of the real agent system can be taken over by the
approach for efficient energy management in these systems and other node in the system called replica. So fault tolerance for
also agent loss prevention by creating a replica 'on demand' of the the replica is also required to prevent the loss of agents
real agent system using an efficient election algorithm (to find the
performing critical applications. Check pointing [3] of real
best suited system for replication) designed for dynamic networks.
agent system or replica is not a good approach We intended to
KEY WORDS achieve such a goal by applying replication ‘on demand’.
Instead of making the system run in the full power mode all
Agent System, Mobile Agents, Election Algorithm, Energy
the time which leads to wastage of energy, we put it to the
Management
sleep state and bring it to the active state only when required.
I. INTRODUCTION & MOTIVATION To achieve this, we propose an intelligent agent called “green
agent .
In distributed systems, there are various activities like
In this paper, we present an approach to ensure reliability of
electronic commerce, network management, process control
real agent system using replication based on election
applications and defence applications which are mobile agent
algorithm for dynamic networks. In this approach, we
based. A mobile agent is a self-managed software program
optimize the performance by replicating only the real agent
performing a particular task and which is capable of
system and running all the other agent systems as proxy agent
autonomously migrating through a heterogeneous network.
systems with minimal load.
An agent can exist only on nodes which have agent system
The rest of the paper is organized as follows. In Section 2,
running on them. An agent system provides an execution
we describe the ‘agent based highly available environment’.
environment to mobile agents. In CMAF (C-DAC Mobile
Election algorithm for dynamic networks is discussed in
Agent Framework) [1], agent systems are classified into two
Section 3. Section 4 discussed about the agent based energy
categories - real agent system and proxy agent system.
efficient high available system .Performance evaluation is
Pluggable services like registry, communication, and user
explained in Section 5. Section 6 discusses the conclusions.
interface provide functionalities to the agent system. These
services are called system agents, since they work for agent II. AGENT BASED HIGHLY AVAILABLE ENVIRONMENT
system. In a single network domain, there is only one real
In our proposed approach, ‘agent based highly available
agent system and all the other agent systems are run as proxy
environment’, we use replication to address agent failure due
agent system. A proxy agent system has lesser load compared
to a real agent system. Real agent system maintains a registry to node failure or agent system failure. Replication increases
of all the agent systems running in the network whereas proxy the dependability and availability of a system.
agent system does not. Mobile agent execution is initiated
In this model, all the agents running in the agent systems
from the real agent system and it can migrate to any proxy
agent system which is registered with the real agent system. which are registered are check pointed in the real agent
Mobile agent based applications which are of mission system. On failure of a proxy agent system, the agents which
are abnormally terminated continue their execution from the
critical nature require high degree of dependability and

160
real agent system using the last check pointed state. To Whenever a new proxy agent system comes up, it gets the
achieve dependability of agents, the real agent system is location of the real agent system and its replica from the
replicated. neighbour agent system. The proxy agent system then
The location of the real agent system and its replica are registers itself with the real agent system and its replica as
maintained in all the agent systems as realLocFile and shown in Figure 1.
replicaLocFile respectively.

Figure 1. Agent Based Highly Available Environment

When the real agent system fails, the replica will system. This algorithm handles the dynamic nature of the
takeover the control and hence become the new real agent network.
system. The realLocFile in the real agent system is updated Generally, the aim of an election algorithm [4] is to elect
with the new location. The agents which are blocked will a node from a fixed set of nodes. Some of the most
continue their execution in the new real agent system. common applications of election algorithm are key
These self healing and self configuring properties make our distribution [5], routing coordination [6], sensor
system highly available and self aware environment for coordination [7] and general control [8], [17]. Now a days,
agent execution. election algorithms are also being used in mobile agent
Whenever a replica becomes the new real agent system, based applications. In case of mobile agent based networks,
another replica is created. The location of the current real the election algorithm should adapt with the dynamic
agent system and its replica is updated in all the agent nature of the network as well as it should elect the agent
systems. The replication of real agent system is based on system based on its performance.
election algorithm for dynamic networks. Some existing election algorithms works for static
networks [9], [10], [11], [7], [12], [13], [14] or assume that
III. ELECTION ALGORITHM FOR DYNAMIC NETWORKS the network is static in nature [15], [16]. Existing election
The agent systems running in a distributed network form algorithms designed for dynamic networks use random
a hierarchical structure with the real agent system as the selection of node [17]. Sudarshan Vasudevan, Jim Kurose
root. Since proxy agent system can get registered and and Don Towsley proposed in their paper an election
unregistered at anytime, this network is dynamic in nature. algorithm for mobile ad hoc networks based on extrema-
In the proposed system, we use an election algorithm to finding [18]. There are also other extrema- finding
select the best-suited agent system from the proxy agent algorithms [8] and clustering algorithms for mobile
systems which are running in the network and reconfigure networks [19], [20]. But these algorithms are not used in
the selected agent system to create replica for the real agent

161
our approach since they require high amount of message agent system entry from the list and tries to migrate to that
passing between nodes which will increase the overhead. agent system.
In our approach, we propose an election algorithm for
dynamic mobile agent based networks. This algorithm 3) getInfo: After the successful migration to the proxy
selects an agent system among the proxy agent systems agent system, the ElectionAgent continues its execution.
based on the performance-related characteristics of the The election of agent system is made on the basis of the
system. Since the processor speed and the amount of hard disk space and the load average.
memory required for a proxy agent system and a real agent
system are same, we consider only the hard disk space and 4) moveBack: ElectionAgent gathers this information of
the load average for replica election. During the start up of the proxy agent system and gets the communication service
of the real agent system to move back. It migrates back to
real agent system, election is triggered by creating a mobile
agent called ElectionAgent. On failure of real agent system, the real agent system using its communication service.
the replica will takeover the control and creates a new 5) compareInfo: After migrating back to the real agent
replica by reconfiguring the proxy agent system selected by system, the information which is gathered is compared
ElectionAgent. with previous information, if it exists and the best value is
We now describe the operation of election algorithm for saved in a tmpInfoFile. The tmpInfoFile contains name,
mobile agent based networks. In Section A, we explain the location and attributes of the best proxy agent system.
algorithm for electing best proxy agent system. An
algorithm for performance comparison of proxy agent 6) updateList: The ElectionAgent checks the registry for
systems is given in Section B. Section C describes the any new proxy agent system entry. If it finds a new entry, it
process of updating the replica location in all the proxy is added to the list of agent systems.
agent systems. For each proxy agent system entry in the list, the
A. Algorithm for Electing Best Proxy Agent System ElectionAgent gets the information about that agent
system, compares with the previous information in the
When the real agent system starts up, it triggers the
tmpInfoFile and updates the tmpInfoFile with the best
election for best proxy agent system by creating a mobile
value. Finally, the tmpInfoFile which will contain the
agent called ElectionAgent.
information of the best proxy agent system is renamed as
We describe the election process by explaining the
infoFile. The ElectionAgent keeps on continuing its
methods used by the ElectionAgent. Table 1 shows the
execution and hence, we assume that the infoFile will
different methods used by the ElectionAgent.
contain the best proxy agent system. When the real agent
Table I
Methods used by Election Agent system fails, the replica will become the new real agent
Method Purpose system. The realLocFile in the real agent system is updated
getList gets the list of all proxy agent with the new location and the agent system which is
systems
contained in the infoFile is reconfigured as the new replica.
move migrates to a particular proxy The replicaLocFile in the real agent system is updated with
agent system
the new replica location and the location of real agent
getInfo get the attributes of the proxy
system and its replica are updated in all the proxy agent
systems.
agent system
moveBack move back to the real agent B. Algorithm for Performance Comparison
system
We describe an algorithm for comparing the
compareInfo compare the information gathered
performance of proxy agent systems based on the
updateList checks the registry for new proxy
information gathered by the ElectionAgent. Here we
agent system consider the hard disk space and the load average of the
1) getList: Real agent system maintains a registry of all proxy agent systems for comparison. The load average is
the agent systems running in the network. The the average number of processes in the kernel's run queue
ElectionAgent makes a list of all the proxy agent systems during an interval.
from this registry where each entry contains the agent We represent a proxy agent system with free hard disk
system name and its location. space, h and load average, l as (h,l). Let (h1,l1) be the proxy
agent system contained in the tmpInfoFile and (h2,l2) be
2) move: The agent takes the first entry from the list and
the proxy agent system that is to be compared. The best
tries to get the communication service of that agent system
proxy agent system is selected based on the different
by using its location. Once the communication service is
conditions given below.
received, the ElectionAgent migrates to that particular
proxy agent system. If the communication service is not Condition 1: h1=h2
available, the ElectionAgent will retry for ten times. After In this case, we compare l1 and l2, and select the proxy
the retrials, it will discard this agent system, gets the next agent system having lesser load average.

162
l1>l2 => (h2,l2) system to the first agent system entry in the list of agent
l1<l2 => (h1,l1) systems using the communication service. In the proxy
l1=l2 => (h1,l1) agent system, the entry of realLocFile and replicaLocFile
Condition 2: are updated with the location of the new real agent system
and its replica. The real agent system checks the registry
When there is a negligible difference between h1 and h2 ,
for any new proxy agent system entry. If it finds a new
we give more priority to load average for the selection of entry, it is added to the list of agent systems.
proxy agent system.
Example:
l1>l2 => (h2,l2) We illustrate an example for the operation of the
l1<l2 => (h1,l1) algorithm. Let us consider a network of agent systems as
When there is no difference between l1 and l2, we select shown in Figure 2(a). It consists of one real agent system
the proxy agent system having comparatively more free (R1), one replica (R2) and three proxy agent systems (P1,
hard disk space. P2, P3). The location of the real agent system and its
l1=l2 => (h1,l1), if h1>h2 l1=l2 => (h2,l2), if h2>h1 replica are shown in all the agent systems as [R1,R2].
Condition 3: R1 will now create ElectionAgent (EA). EA maintains a
When there is a significant difference between h1 and h2 list of all proxy agent systems as [P1,P2,P3]. It moves to
, we give priority to either load average or free hard disk P1 as shown in Figure 2(b). After reaching the proxy agent
space for the selection of proxy agent system, based on system P1, EA collects the information about P1, i.e., I(P1)
certain criteria. and moves back to the real agent system R1. The
Here we consider that a system with single CPU is information which is gathered is compared with previous
overloaded if the load average is greater than 1. So if one of information, if it exists the best value is saved in a
the systems is having a load average less than 1, we select tmpInfoFile. This is represented as {P1} as shown in
that system. Figure 2(c ).
l1>1 and l2 <1 => (h2,l2) Before EA migrates to the next proxy agent system, it
l1<1 and l2 >1 => (h1,l1) updates the list of proxy agent systems. In this example, a
new proxy agent system P4 is added to the list. The
When l1<1 and l2 <1, or l1>1 and l2 >1, we have two
updated list is [P2,P3,P4] and is shown in Figure 2(d). EA
cases. If the system having comparatively more free hard moves to each agent system in the list, gets the information
disk space has lesser load average, we select that system. and saves the best value. Finally, the tmpInfoFile which
Otherwise, we select the system having lesser load average will contain the information of the best proxy agent system
when the difference between load averages is significant is renamed into infoFile and we assume that P3 is the best
and when the difference is negligible, system with proxy agent system as shown in Figure 2(e). The EA will
comparatively more free hard disk space is selected. continue its execution and updates the infoFile each time.
1) h1>h2 When R1 fails, R2 will become the new real agent
l1<l2 => (h1,l1) l1=l2 => (h1,l1) system and the location of the real agent system in R2 is
l1>l2 => (h 2,l2), if the difference between l1 and l2 is updated as [R2,R2]. Now R2 will fetch the best proxy
significant agent system entry from the infoFile i.e., [P3]. P3 is
l1>l2 => (h1,l1), if the difference between l1 and l2 is reconfigured as the new replica R3 and the location of the
negligible replica in R2 is updated as [R2, R3] as shown in Figure
2(f).
2) h1<h2
The real agent system R2 updates the location of real
l1>l2 => (h2,l2) l1=l2 => (h2,l2)
agent system and its replica in all the proxy agent systems.
l1<l2 => (h 1,l1), if the difference between l1 and l2 is The real agent system R2 has a list containing the entries of
significant all proxy agent systems registered with it, [P1,P2,P4]. The
l1<l2 => (h2,l2), if the difference between l1 and l2 is location of the real agent system and its replica [R2,R3] are
negligible retrieved from the realLocFile and replicaLocFile. Real
agent system sends this information to the first proxy agent
C. Updating Replica Location in Proxy Agent Systems system entry in the list, P1 using the communication
The location of real agent system and its replica is service as shown in Figure 3(a). In the proxy agent system
updated in all the proxy agent systems by the real agent P1, the location of real agent system and its replica [R1,R2]
system. We describe this process below. are updated with [R2,R3]. For each proxy agent system
The entries of all the proxy agent system which are entry in the list, the real agent system updates the entry of
registered with the real agent system are added into a list. realLocFile and replicaLocFile with the location of the new
Each entry in the list contains the agent system name and real agent system and its replica as shown in Figure 3(b). In
its location. The location of the real agent system and its Figure 3, we assume that EA is running and shows only the
replica are retrieved from the realLocFile and the entries in the tmpInfoFile and infoFile.
replicaLocFile. This information is send by the real agent

163
Figure 2. Election Process in Proposed Environment

Figure 3. Updation Process in Proposed Environment

Figure 4. Working of Green Agent

164
2) move: It takes the first agent system entry from the list
IV. AGENT BASED ENERGY EFFICIENT of agent systems and tries to get the communication service
HIGH AVAILABLE SYSTEM of that agent system by using its location. Once the
As we choose the replica dynamically ‘on demand’ communication service is received, the GreenAgent
using an election algorithm in ‘agent based highly available migrates to that particular proxy agent system. If the
environment’. Most of the time this replica is idle and is put communication service is not available, the GreenAgent
to use only when it needs to be synchronized with the real will retry for ten times. Even after the retrials, if it is not
agent system. During this idle time the replica runs in full available, it will discard this agent system, gets the next
power mode which leads to a considerable amount of agency system entry from the list and try to migrate to that
energy wastage. agent system.
Modern computer systems are equipped with special
utilities that allow users to either manually or automatically 3) checkInfo: After the successful migration to the proxy
schedule their computers to switch to sleep mode for agent system, the GreenAgent continues its execution. It
energy saving purpose, which is done by the administrator checks specific parameters like CPU load, no of
[22,23]. Practically in highly available systems, it is not application/processes, running mouse/keyboard activity in
possible for an administrator to determine when exactly the order to detect whether the current system reaches an idle
replica will not be in use which leads to significant energy state and if so it sets the status flag corresponding to this
wastage. So for reducing the energy wastage we introduced agent system as ‘ready to sleep’ from ‘active’ in the real
an agent based efficient energy management approach for agent system.
High available systems. This approach is first implemented
4) moveBack: GreenAgent now gets the communication
for replica of real agent system of our ‘agent based highly
service of the real agent system and migrates back to the
available environment’ and then extended to other systems
real agent system using its communication service.
running as proxy agent systems.
In the current approach we will put the system in sleep 5) makeSleep: The GreenAgent finds all the agent system
mode using an agent known as ‘Green Agent’. Figure 4 whose status flags are set to ‘ready to sleep’ and puts all
illustrates the working of Green Agent. these systems to stand by mode by calling a special API
A. Working of Green Agent: function on these agent systems and sets their flags to
‘sleep’.
When the real agent system starts up, it triggers the
energy saving process by creating a mobile agent called 6) updateList: The GreenAgent checks the registry for
GreenAgent. any new proxy agent system entry. If it finds a new entry, it
We describe the working of GreenAgent by explaining is added to the list of agent systems with its status.
the methods used by the GreenAgent. Table 2 shows the
The real agent system checks the status flag of
different methods used by the GreenAgent.
the proxy agent system before sending any other mobile
Real agent system has a special registry which contains agent to it. If the flag is set as ‘active’ then only it sends it
the status of all other agent systems whether its in active
to the proxy agent system, and if the flag is set as ‘sleep’ or
mode or sleep mode with its location and is periodically
‘ready to sleep’, it first makes the corresponding system in
updated by the GreenAgent. active mode , sets the flag as ‘active’ and then sends the
Table II
Methods used by GreenAgent agent there.
Method Purpose The replica is updated by the real agent system
gets the list of all active proxy agent from the synchronization process. It should be in active
getList state only during synchronization, rest of the time it should
systems
migrates to a particular proxy agent be in sleep state. As there is no agent is running on the
move replica, the real agent system directly puts it into the active
system
get the specific parameters of proxy and sleep state as and when required. So first time after
agent system and check them to make synchronization real agent system will set replica to ‘stand
checkInfo
that system sleep or remain to be by’ mode and after that before each synchronization it will
active agent system set replica to full power mode. Through this approach we
moveBack move back to the real agent System can achieve efficient energy management technique in high
Make all the system sleep whose flag available systems.
makeSleep
is set to ‘ready to sleep’
checks the registry for new proxy V. PERFORMANCE E VALUATION
updateList
agent system
For comparing the performance of a mobile agent on
1) getList: The GreenAgent gets the entry of all the active CMAF [1] and ‘agent based highly available environment’,
proxy agent systems from the registry and makes a list. four different situations were simulated.
Each entry in the list contains the agent system name and
its location.

165
1) ACP represents normal execution of agent in CMAF ASFT-fR is due to the time taken by the replica to take
[1]. This involves migration time, processing time and over the control on failure of real agent system, time taken
agent check pointing [2] time. for reconfiguration of proxy agent system to replica and
restoration overhead. Since the updation of realLocFile and
2) ASFT represents normal execution of agent in fault replicaLocFile in proxy agent systems is done after the
tolerant agent system i.e., ‘agent based highly available restoration of agents, it does not affect the ASFT-fR curve.
environment’. However, this updation process does not take much delay.
Another simulation environment was setup to analyze
3) ASFT-fP represents execution of agent in ‘agent based the influence of number of nodes on agent execution time.
highly available environment’ on failure of the real agent For each of the four simulated situations, the agent
system when the agent is in proxy agent system. In this
execution time was measured by increasing the number of
case, we assume that the agent moves back from the proxy nodes. Figure 5(b) shows the results of the simulation.
agent system to the replica before the updation of the From the figure, we examine that there is a small
realLocFile and the replicaLocFile. difference between ACP and ASFT curves due to the
4) ASFT-fR represents execution of agent in ‘agent based replication overhead. We can also see that the difference
highly available environment’ on failure of the real agent between ASFT and ASFT-fP as well as ASFT-fR is almost
system when the agent is in the real agent system itself. constant.
The energy management system was tested on a network
Our algorithm was simulated to study the impact of consisting of 12 computers. CMAF was installed on each
agent size on its execution time. A simulation environment of the 12 machines and the total power consumed by the 12
was setup with one real agent system, one replica and one computers was measured. After this the green agent was
proxy agent system. For each simulated situations, an agent triggered in all of the 12 machines and again the power
was sent from the real agent system to the proxy agent consumed by all the machines was measured. The total
system by increasing its size. The results of the simulation power consumed by the machines was measured for a
are shown in Figure 5(a). The difference between ACP and significant time for both the cases. A graph was plotted of
ASFT curves is due to the replication overhead. There is a power consumed Vs time from the result obtained in both
constant difference between ASFT and ASFT-fP. This is the cases. Figure 5(c) shows the power variations with and
due to the delay taken by the agent to realise that the real without the green agent. From the graph it can be clearly
agent system has failed and that it has to move to the concluded that a significant amount of energy is saved if
replica. The significant difference between ASFT and we use the green agent.
12 60
ACP
ACP
ASFT
ASFT
ASFT-fP
Agent Execution Time (s)

10 ASFT-fP
Agent Execution Time (s)

ASFT-fR 50
ASFT-fR

8 40

6 30

4 20

2 10
20 40 60 80 100 20 40 60 80 100
Agent Size (kb) No. of Nodes

(a) Power (5W) (b)


14

Replication without GreenAgent


12

10

Replication with GreenAgent


8

0
1000 1030 1100 1130 1200 1230 1300

(c) Time (Hr)

Figure 5. Simulation Results

166
VI. CONCLUSION [5] B. DeCleene et al., “Secure Group Communication for Wireless
Networks”, Proc. of MILCOM 2001, VA, October 2001.
In this paper, we have proposed a ‘agent based highly [6] C. Perkins and E. Royer, “Ad-hoc On-Demand Distance Vector
available environment’ for achieving high availability in Routing”, Proc. of the 2nd IEEE WMCSA, New Orleans, LA,
agent based mission critical applications by creating a February 1999, pp. 90-100.
[7] W. Heinzelman, A. Chandrakasan and H. Balakrishnan, “Energy-
replica ‘on demand’ for a real agent system. We achieve Efficient Communication Protocol for Wireless Microsensor
this by using an election algorithm designed for dynamic Networks”, Proc. of HICSS, 2000.
networks to find the best-suited proxy agent system in the [8] K. Hatzis, G. Pentaris, P. Spirakis, V. Tampakas and R. Tan,
network. Since we are creating the replica dynamically ‘on “Fundamental Control Algorithms in Mobile Networks”, Proc. of
11th ACM SPAA, March 1999, pages 251-260.
demand’ hence we can avoid the agent loss in such type of [9] R. Gallager, P. Humblet and P. Spira, “A Distributed Algorithm for
systems. Minimum Weight Spanning Trees”, ACM Transactions on
In this paper we have also considered the issue of heavy Programming Languages and Systems, vol.4, no.1, pages 66-77,
energy wastage in such types of systems and proposed an January 1983.
[10] D. Peleg, “Time Optimal Leader Election in General Networks”,
intelligent agent based efficient energy management Journal of Parallel and Distributed Computing, vol.8, no.1, pages
approach for saving energy. 96-99, January 1990.
Finally, from the simulations, we found that the agent [11] D. Coore, R. Nagpal and R. Weiss, “Paradigms for Structure in an
execution delay due to replication is not of much overhead. Amorphous Computer”, Technical Report 1614, Massachussetts
Institute of Technology Artificial Intelligence Laboratory, October
We can summarise that there is a significant delay in agent 1997.
execution only when real agent system fails. Our approach [12] D. Estrin, R. Govindan, J. Heidemann and S. Kumar, “Next
enhances the availability and self awareness of the agent Century Challenges: Scalable Coordination in Sensor Networks”,
system and provides a highly reliable environment for the Proc. of ACM MOBICOM, August 1999.
[13] S. Vasudevan, B. DeCleene, N. Immerman, J. Kurose and D.
execution of mobile agent based mission critical Towsley, “Leader Election Algorithms for Wireless Ad Hoc
applications at the expense of replication overhead. Networks”, Proc. of IEEE DISCEX III, 2003.
We have also observed and compared the energy [14] A. Amis, R. Prakash, T. Vuong, and D.T Huynh, “MaxMin D-
consumed by the systems with and without the ‘Green Cluster Formation in Wireless Ad Hoc Networks”, Proc. of IEEE
INFOCOM, March 1999.
Agent’. The result clearly shows that a considerable [15] M. Aguilers, C. Gallet, H. Fauconnier and S. Toueg, “Stable leader
amount of energy is saved by using our energy election”, LNCS 2180, p. 108 ff.
management approach as compared to the normal [16] G. Taubenfeld, “Leader Election in presence of n-1 initial failures”,
approach. Information Processing Letters, vol.33, no.1, pages 25-28, October
1989.
Our future work will concentrate on the performance [17] N. Malpani, J. Welch and N. Vaidya, “Leader Election Algorithms
optimization of the agents based energy efficient highly for Mobile Ad Hoc Networks”, Fourth International Workshop on
available environment. Discrete Algorithms and Methods for Mobile Computing and
Communications, Boston, MA, August 2000.
REFERENCES [18] Sudarshan Vasudevan, Jim Kurose and Don Towsley, “Design and
Analysis of a Leader Election Algorithm for Mobile Ad Hoc
[1] [1] S. Venkatesh, B. S Bindhumadhava and Amrit Anand Bhandari, Networks”.
"Implementation of automated Grid software management tool: A [19] C. Lin and M. Gerla, “Adaptive Clustering for Mobile Wireless
Mobile Agent based approach", Proc. of Int'l Conf on Information Networks”, IEEE Journal on Selected Areas in Communications,
and Knowledge Engineering, June 2006, pages 208-214. 15(7):1265-75, 1997.
[2] Banupriya, Manju Abraham and B. S Bindhumadhava, "Fault [20] P. Basu, N. Khan and T. Little, “A Mobility based metric for
Tolerance for Mobile Agents", Proc. of Int'l Conf on Wireless clustering in mobile ad hoc networks”.
Networks, June 2007. [21] Newsham G & Tiller D, “Energy Consumption of Desktop
[3] Eugene Gendelman, Lubomir F. Bic and Michael B. Dillencourt, Computers: measurements and saving potentials” IEEE
“An Application-Transparent, Platform-Independent Approach to Transactions on Industry applications Jul/Aug 1994, pp1065-1072
Rollback-Recovery for Mobile Agent Systems”. [22] Nordman B., Kinney K., Piette M.A, Webber C., "User
[4] N.Lynch, “Distributed Algorithms”, Morgan Kaufmann Publishers, Guide to Power Management in PCs and Monitors", University
Inc, 1996. of California, Berkeley, January 1997

167
A Two-phase Bi-criteria Workflow Scheduling
Algorithm in Grid Environments
Amit Agarwal and Padam Kumar
Department of Electronics and Computer Engineering
Indian Institute of Technology Roorkee
Roorkee, India
{aamitdec, padamfec}@iitr.ernet.in

Abstract—Scheduling workflow application in a highly dynamic application over grid) has been considered as another important
and heterogeneous grid environment is a complex NP-complete scheduling criterion to employ the user-centric policies.
optimization problem. It may require several different criteria to
be considered simultaneously when evaluating the quality of Considering multiple criteria enables us to propose a more
solution or a schedule. The two most important scheduling realistic solution. Thus, an effective multi-criteria scheduling
criteria frequently addressed by the current GRID research are algorithm is required for executing workflows over grid while
the execution time and the economic cost. This paper presents an assuring the high speed of communication, reducing the tasks
efficient bi-criteria scheduling heuristic for workflows called execution time and economic cost. A workflow type of
Duplication-based Bi-criteria Scheduling Algorithm (DBSA). The application can be modeled as Directed Acyclic Graph (DAG)
proposed approach comprises two phases: (1) Duplication-based in which nodes represents the executable tasks and the directed
Scheduling – optimizes the primary criterion i.e. execution time edges represent the inter-task data and control dependencies.
(2) Sliding Constraint Schedule Optimization – optimizes Since the DAG scheduling problem in grid is NP-complete, we
secondary criterion i.e. economic cost keeping primary criterion have emphasized on heuristics for scheduling rather than the
within sliding constraint. The sliding constraint is defined as a exact methods. In [8, 10, 11, 12, 13, 14], several scheduling
function of the primary criterion to determine how much the algorithms have been proposed which minimizes the makespan
final solution can differ from the best solution found in primary and economic cost of the schedule but only few of them
scheduling. The experimental results reveal that the proposed
address the workflow types of applications. In [8, 13], Buyya et
approach generates schedules which are fairly optimized for both
al. propose the multi-objective planning for workflow
economic cost and makespan while keeping the makespan within
defined constraints for executing workflow applications in the
scheduling approaches for utility grids. In [10], a quality of
grid environment. service optimization strategy for multi-criteria scheduling has
been presented for criteria namely payment, deadline and
Keywords- grid computing; bi-criterion scheduling; reliability. In [11], Wieczorek et al. presents an efficient bi-
optimization; workflow applications; DAG. criterion scheduling algorithm called Dynamic Constraint
Algorithm (DCA) based on a sliding constraint. It takes list-
based heuristic for primary scheduling. In [12], Dogan and
I. INTRODUCTION
Ozguner show another tradeoff between makespan and
Grid [1] is a unified computing platform which consists of reliability using a sophisticated reliability model assuming
diverse set of heterogeneous resources distributed over large computation and network performance. Our work represents
geographical region inter-connected over high speed networks the different specification for two specific criteria (i.e.
and Internet. A workflow application can be defined as a makespan and economic cost).
collection of tasks with precedence constraints that are In [15], Deelman et al. describes the three different
executed in a well-defined order to achieve a specific goal [2]. workflow scheduling strategy namely full-plan-ahead
Scheduling workflow applications in grid with characteristics scheduling, in-time local scheduling, and in-time global
of dynamism, heterogeneity, distribution, openness, scheduling. In Just-in-time scheduling (in-time local
voluntariness, uncertainty and deception, is a complex scheduling), scheduling decision for an individual task is
optimization problem and several different criteria are needed postponed as long as possible, and performed before the task
to be considered simultaneously to obtain a realistic schedule. execution starts (fully dynamic approach). In Full-ahead
In general, minimization of total execution time (or makespan) planning (full-plan-ahead), the whole workflow is scheduled
of the schedule is applied as the most important scheduling before its execution starts (fully static approach). In this paper,
criteria [3-6]. The current grid computing systems are based on we adopted full-ahead planning as it does not incur the run-
system-centric policies, whose objectives are to optimize the time overheads and scheduling complexity on a federation of
system-wide metrics of performance i.e. total execution time or geographically distributed computing resources known as a
makespan. The convergence of grid computing toward the computational grid. The research study shows that heuristics
service-oriented approach is fostering a new vision where performing best in static environment (e.g. HLD [4], HBMCT
economic aspects represent central issues to burst the adoption [6]) have the highest potential to perform best in a more
of computing as a utility [7]. In current economic market accurately modeled Grid environment.
models [8, 9, 10], economic cost (cost of executing a workflow

168
Scheduling heuristics can be categorized as list-based The rest of the paper is organized as follows. Section II
scheduling, cluster-based scheduling and duplication-based describes the bi-criteria scheduling problem and related
scheduling. In large, the current multi-criteria scheduling terminology. Section III presents the bi-criteria scheduling
approaches have adopted the list-based scheduling heuristics approach. The proposed bi-criteria scheduling algorithm
(such as HEFT [3], HBMCT [6]) as a primary scheduling. In (DBSA) is described in section IV. In section V, simulation
extensive literature survey, it has been observed that results are presented and discussed. Section VI concludes the
duplication-based heuristics [4-5, 16] generate remarkably proposed research work.
much shorter schedules as compared with the list-based and
cluster-based heuristics. The duplication approach utilizes the II. BI-CRITERIA SCHEDULING PROBLEM
idle time slots (scheduling holes) for task duplication which, in
turn, reduces the communication cost. The duplication strategy A. Workflow Application Model
generates more optimized alternative schedules which help to
minimize the overall schedule length. This motivates us to A workflow scheduling problem can be defined as the
adopt the duplication based approach for primary scheduling to assignment of available grid resources to different workflow
optimize the makespan (primary criterion). tasks. A workflow can be modeled as DAG, as shown in fig. 2
and it can be represented by W ( N , E, T , C ) , where N is a
In secondary scheduling, our objective is to optimize the set of n computational tasks, T is a set of task computation
economic cost (secondary criterion) of the schedule while volumes (i.e., one unit of computation volume is one million
keeping the makespan within the defined sliding constraint. instructions), E is a set of communication arcs or edges that
The fig.1 illustrates that the primary solution (M1, C1) can be shows precedence constraint among the tasks and C is the set
obtained considering primary criterion i.e. makespan in
of communication data from parent tasks to child tasks (i.e.,
primary scheduling that yields makespan of length M1 while
one unit of communication data is one Kbyte). The value of
the economic cost is C1. In secondary scheduling, the above
schedule is optimized for the economic cost dragging the i T is the computation volume for the task ni N . The
makespan from M1 to M2 (M2 is the maximum allowable value of cij C is the communication data transferring along
schedule length) that yields the schedule with makespan M2 the edge e ij , eij E from task ni to task n j , for ni , n j N.
and the reduced economic cost C2. This approach generates
schedules which are remarkably more optimized both in terms n1
of the execution time and the economic cost as compared to
300 500
other relative algorithms. 100
300

In general, bi-criteria optimization yields a set of solutions n2 n3 n4 n5


(a Pareto set) rather than a single solution. Each solution in a
Pareto set is called Pareto optimum and when these solutions 70
700 800
0 200 1000
are plotted in the objective space they are collectively known 0
as Pareto front. The main objective of multi-criteria n6
n7
optimization problem is to obtain a Pareto front. In literature
[17], two approaches i.e. LOSS and GAIN were proposed to 800
50
compute the weight values for a given DAG. In LOSS, initial 0 n8
assignment is done for optimal makespan using an efficient
DAG scheduling [3] whereas in GAIN, initial assignment is
Figure 2. A workflow application (values are in Kbytes)
done by allocating tasks to cheapest machines in order to
reduce the economic cost as much as possible. In this paper, we The execution time (makespan) can be defined as the total
consider the LOSS approach where initial assignment is done time between the finish time of exit task and start time of the
using duplication-based scheduling approach inspired from our entry task in the given DAG and the economic cost is the
earlier research work [16] rather than HEFT [3] since summation of the economic costs of all workflow tasks
duplication-based heuristic produces shorter makespan. scheduled on different resources which can be computed as:

C1 P Primary Solution (M1, C1)


Secondary Criterion

m
(Economic Cost)

Local
Economic Cost (EC) = Cj
Search j 1
Direction

where m is the total no. of available resources in grid and C j is


C2 F Final Solution (M2, C2)
the execution cost of the tasks scheduled on a resource
p j which can be calculated as:
M1 M2
Primary Criterion
(Makespan) Cj PBT j (pj) M j
Sliding Constraint
Figure 1. A bi-criteria optimization process

169
where M j is the per unit time cost of executing task on a B. Grid Resource Model
resource p j and PBT j is the total busy time consumed by tasks A grid resource model can be represented by undirected
weighted graph G ( P, Q, A, B) as shown in fig. 3, where
scheduled on resource p j .
P { pi | pi P, i 1,2,......, p} is the set of p available
resources, A { ( pi ) | ( pi ) A, i 1,2,......, p} is the set of
TABLE I. SHOWING COMPUTATION COST, B-LEVEL AND
TASK SEQUENCE FOR DAG IN FIGURE 2 execution rates (Table II), where ( pi ) is the execution rate for
resource p i , Q {q( pi , p j ) | q( pi , p j ) Q, i 1,2,......, p} is
the set of communication links connecting pairs of distinct
resources, q( pi , p j ) is the communication link between p i and
p j , and B { ( pi , p j ) | ( pi , p j ) B, i 1,2,......, p} is the
set of data transfer rates (bandwidths) , where ( pi , p j ) is the
data transfer rate between resource p i and p j (Fig. 3). In our
model, task executions are assumed to be non-preemptive and
intra-processor communication cost between two tasks
scheduled on same resource is considered as zero.
In this model, the cost of the idle time slots between the
scheduled tasks on any resource is also considered in economic C. Bi-criteria Performance Criteria
cost as it is difficult for the grid scheduler to schedule other The computation cost of task ni on p j is ij (see Table I). If
workflow tasks in these idle slots. Thus, the total execution
time (makespan) can be expressed as: resource p j is not capable to process the task ni then ij .
In grid, some resources may not always be in fully connected
topology. Therefore, bandwidths between such resources are
makespan AFT (nexit ) AST (nentry ) computed by searching alternative paths between them with
maximum allowable bandwidths. The communication cost
where AFT and AST are the actual finish and actual start time between task ni scheduled on resource pm and task n j scheduled
of exit task and entry task respectively. The normalized on resource p n can be computed as:
schedule length (NSL) of a schedule can be calculated as:

cij
makespan ij
NSL ( pm , pn )
min{ ij }
pj P
ni CPm in

p2
The denominator is the summation of the minimum execution
costs of tasks on the CPmin [3]. 1 50 50

100
100 p4
TABLE II. RESOURCE CAPACITY p1

Resources p1 p2 p3 p4
100
100
Processing Capacity p3
220 350 450 310
( pi ) (in MIPS) Direct path
Indirect path
TABLE III. MACHINE PRICE
Resource Machine cost per MIPS Figure 3. Grid with 4 resources (Bandwidth is in Kbps)
pi M ( pi ) (in Dollar $)
In this model, we avoid the communication startup costs of
1 1.0
resources and intra-processor communication cost is negligible.
2 2.5 A workflow of tasks is submitted to the Grid scheduler [14]
3 3.0 where tasks are queued in non-decreasing order of their b-level.
4 2.0 The b-level (bottom level) of task ni can be defined as the
longest directed path including execution cost and
communication cost from task ni to the exit task in the given
DAG. It can be computed recursively as:

170
bi ~ max{b j ~} n succ(ni ) Algorithm (DBSA)
i ij j
Input:
where succ(ni ) refers to the immediate successors of task A DAG (workflow) W with task computation and communication costs.
node ni and ~i is the mean computation cost of task ni and ~ij is A set of available resources P with cost of execution per unit time.
prel
Sliding constraint L (10%, 25%, 50% and 75% of the makespan c1 ).
the mean communication cost between task ni and n j .
Algorithm I: Primary Scheduling
The optimization goal of bi-criteria scheduling is to obtain
the schedule with minimum schedule cost. It can be expressed Construct a priority based task sequence based on highest b-level
in terms of performance metrics called effective schedule cost for (each unscheduled task ni in task sequence)
(ESC) which can be computed as:
Let finish time Fi of task ni is Infinite
for (each capable resource p j )
ESC NSL EC Compute finish time Fi of task ni on resource p j
Construct task predecessor list pred_list(ni )
III. BI-CRITERIA SCHEDULING APPROACH Initialize temp _ list(ni , p j ) to zero
In this paper, we consider the makespan as the primary if (pred_list)
criterion and the economic cost as the secondary criterion. We for (each predecessor d k not scheduled on p j )
define the sliding constraint for the primary criterion i.e. how if duplication of d k on p j reduces the finish time Fij
much the final solution may differ from the best solution found
for the primary criterion. The primary scheduler adopts an Add d k in temp _ list(ni , p j )
efficient duplication scheduling approach to minimize the Update finish time Fij
schedule length as much as possible. Then, this schedule is endif
forwarded to the secondary scheduler. The secondary scheduler endfor
optimizes the above schedule produced by primary scheduler to endif
minimize the economic cost. In secondary scheduling, some of if Fij < Fi
duplicated tasks are removed and schedule is modified such Fi = Fij ;
that makespan of the schedule after removing those duplicated
tasks remains within maximum allowable execution length. ri = p j ;
if (temp_list)
Further, it investigates those tasks in the schedule which Copy temp _ list(ni , p j ) into duplicate_ list(ni , p j )
have been duplicated on other resources. Such tasks may
become unproductive if their descendent tasks are receiving endif
endif
input data from their duplicated version. Thus, such endfor
unproductive tasks or schedules are removed in order to reduce Assign task ni on resource ri and update schedule S.
the economic cost. If the makespan of the aforementioned if (duplicate_list)
schedule is less than the upper limit of the defined sliding Duplicate tasks from duplicate_list to ri and update schedule S.
constraint then it can be further modified to reduce the
endif
economic cost. The above schedule is modified by swapping endfor
tasks among resources (costlier to cheaper resources) such that
it reduces the economic cost of the schedule while keeping the Compute makespan c1prel (Equation 3) and Economic Cost c2prel
makespan within the upper limit. (Equation 1) from schedule S.

IV. DBSA ALGORITHM


The secondary scheduling optimizes primary solution for
The proposed algorithm can be divided into two phases: (1) the secondary criterion, generating the best possible
Primary Scheduling – optimizing the makespan (2) Secondary
solution sol wfinal SC , and the total costs c1final and c2final of
Scheduling – optimizing economic cost while keeping the
makespan within the defined sliding constraint. The pseudo primary and secondary criteria. The sliding constraint is equal
code for the proposed algorithm is described in Algorithm I for to L such that the primary criterion cost can be increased
primary scheduling and Algorithm II for secondary scheduling. from c1prel to c1prel + L . We can calculate the maximum
An efficient duplication-based scheduling heuristic has been allowable execution time Tmax of workflow with cheapest
applied for the primary scheduling [16]. It generates a
economic cost Cmin using cost optimization algorithm such as
preliminary solution sol wprel SC , with the total costs of
GreedyCost [13]. Similarly, maximum allowable economic
primary criterion and the secondary criterion which can be are
cost Cmax of workflow with shortest possible execution time
denoted as c1prel and c2prel , respectively. The set SC contains all
Tm in can be computed using time optimization algorithm such
possible schedules for workflows to be executed over grid [11].
as HED [16].

171
Algorithm II: Secondary Scheduling maximum allowable limit. The schedule in fig. 4(c) is modified
in order to reschedule tasks from resource P3 to P4 which
prel prel reduces the economic cost to 9.32$ while makespan is kept
Let L (maximum allowable schedule length) = c1 + 10% of c1
below 18 (+10% of makespan in primary scheduling).
prel
if (duplicate_list && c1 <= L )
Copy tasks in list A and sort them in non-decreasing order of start time V. SIMULATION RESULTS AND ANALYSIS
for (each duplicated task ai in A)
The algorithms described in section IV have been simulated
Compute schedule length SL without considering ai in schedule S and implemented for the evaluation of different random task
if SL<=L graphs or DAGs of different graph sizes (100, 200, 300, 400
c1final =SL; and 500) with different parallelisms i.e. maximum outdegree of
Remove ai from S and update S and A nodes in DAG (2, 4, 6, 8 and 10). The algorithms have been
executed and compared in the grid of heterogeneous clusters of
endif
endfor different sizes (5, 10, 15, 20, and 25) with 4 resources in each
Construct list B of tasks from A that were duplicated cluster. The proposed algorithm (DBSA) has been compared
Sort list B in non-decreasing order of task start time with DCA [11] for performance metrics i.e. effective schedule
for (each task bi in B) cost (ESC) as described in section II with respect to workflows
Compute schedule length SL without considering bi in schedule S. and grids of different sizes. The algorithms have been run
if SL<=L
under the same conditions for fair comparison: for each
workflow, each algorithm is run to find best possible second
c1final =SL; criteria cost while keeping the primary criteria within the
Remove bi from S and update S and B defined sliding constraint.
endif
endfor
endif
Compute economic cost c2final of the optimized schedule S
for (each task ni scheduled on resource p j in S)
Construct list R of capable resources in non-decreasing order whose
machine cost is less than M ( p j )
for (each resource pk in R)
final
Reschedule task ni to resource pk for c1 <=L
Compute economic cost EC’ (using Equation 1)
final
if EC’< c2
Update schedule S
c2final =EC’;
endif
endfor
endfor

The schedule produced by primary scheduling is illustrated


in fig. 4(a) for the workflow as shown in fig. 2. The total
execution time (or makespan) of this schedule is 16. This
schedule yields the total economic cost of 18.81$ computed as
per machine cost described in Table III using equation (1) and
(2). Further, we apply the secondary scheduling to optimize the
economic cost while keeping the makespan within sliding
constraint. In fig. 4(b), the above schedule is optimized after
removing some duplicated tasks whose removal keeps the
makespan within maximum allowable limit. It reduces the
economic cost of the schedule to 17.31$. Again, we identify
those tasks or sub-schedules which have been duplicated and
try to remove whose descendent tasks are receiving input data
from their duplicated version. Such sub-schedules or tasks are
removed which reduces the economic cost of the schedule to
14.32$ as shown in fig. 4(c). The tasks in the above schedule Figure 4. Schedule of bi-criteria approach using (a) Primary Scheduling (b)
are tried to reschedule on the cheapest resources if it reduces to (d) Secondary Scheduling with sliding constraint (+10%) of makespan
the economic cost while keeping the makespan within the

172
The algorithm is run for primary scheduling for both the an efficient scheduling algorithm called Duplication-based Bi-
criteria (i.e. makespan and economic cost) to find the best and criteria Scheduling Algorithm (DBSA) which optimizes both
worst solution for primary criteria ( c1best and c1worst ) which yield the makespan and economic cost of the schedule. The schedule
generated by DBSA algorithm is much more optimized than
the maximum sliding constraint i.e. the
other related bi-criteria algorithms in respect of both makespan
difference | c1worst c1best | . The algorithm is run for three different and economic cost. The algorithms have been implemented to
sliding constraint values: 25%, 50% and 75% of schedule different random DAGs onto different grids of
difference | c1worst c1best | . The simulated results and graphs heterogeneous clusters of various sizes. Different variants of
the algorithm were modeled and evaluated.
reveal that the proposed bi-criteria scheduling approach
outperforms the DCA algorithm in terms of both economic cost
and schedule length. In fig. 5 and fig. 6, the DBSA algorithm REFERENCES
yields reduced effective schedule cost (ESC) as compared to [1] I. Foster and C. Kesselman, “The Grid 2: Blueprint for a New
DCA over grid. The simulation parameters for modeling Computing Infrastructure”, Morgan Kaufmann Pub.,Elsevier Inc., 2004.
workflows and grid environments are presented in Table IV. [2] Zhio Shi, Jack J. Dongarra, Scheduling workflow applications on
processors with different capabilities, Elsevier, 2005.
[3] H. Topcuoglu, S. Hariri, and M. Wu. Performance-effective and low-
TABLE IV. GRID ENVIRONMENT LAYOUTS complexity task scheduling for heterogeneous computing. In IEEE
Transactions on Parallel and Distributed Systems, volume 13(3), pages
Number of grid resources [20, 100]
260–274, March 2002.
Resource bandwidth [100 Mbps, 1 Gbps] [4] Savina Bansal, Padam Kumar and Kuldip Singh, Dealing with
Number of tasks [100, 500] Heterogeneity Through Limited Duplication for Scheduling Precedence
Constrained Task Graphs, Journal of Parallel and Distributed
Computation cost of tasks [50, 2000] ms Computing, 65(4): 479-491, Apr 2005.
Data transfer size [20 Kbytes, 20 Mbytes] [5] A. Dogan and F. Ozguner, LDBS: A Duplication Based Scheduling
Algorithm for Heterogeneous Computing Systems, Proceedings of the
Resource capability (MIPS) [220, 580] Int’l Conf. on Parallel Processing, pp. 352-359, Aug 2002.
Execution cost (per MIPS) [1-5 $ per MIPS] [6] R. Sakellariou and H. Zhao. A hybrid heuristic for DAG scheduling on
heterogeneous systems. In 13th IEEE Heterogeneous Computing
Workshop (HCW’04), Santa Fe, New Mexico, USA, April 2004.
4000000
Effective Schedule Cost

[7] Nadia Ranaldo and Eugenio Zimeo, Time and Cost-Driven Scheduling
3500000
of Data Parallel Tasks in Grid Workflows, IEEE Systems Journal VOL.
3000000 3, NO. 1, pp.104-120, MARCH 2009.
2500000
[8] J. Yu, R. Buyya, and C. K. Tham, Cost-based Scheduling of Scientific
2000000 Work on Applications on Utility Grids, in Proceedings of the 1st IEEE
1500000 DBSA International Conference on e-Science and Grid Computing (e-Science
1000000 DCA 2005), IEEE. Melbourne, Australia: IEEE CS Press, Dec. 2005.
500000 [9] C. Ernemann, V. Hamscher and R. Yahyapour. Economic Scheduling in
0 Grid Computing. In Proceedings of the 8th Workshop on Job Scheduling
20 40 60 80 100 Strategies for Parallel Processing, Vol. 2537 of Lecture Notes in
Computer Science, Springer, pages 128–152, 2002.
Number of Resources
[10] Chunlin Li and Layuan Li, Utility-based QoS optimization strategy for
multi-criteria scheduling in Grid, in JDPC, 2006.
Figure 5. Effect of grid sizes on effective schedule cost [11] Wieczorek, M.; Podlipnig, S.; Prodan, R.; Fahringer, T., "Bi-criteria
Scheduling of Scientific Workflows for the Grid," Cluster Computing
and the Grid, 2008. CCGRID '08. 8th IEEE International Symposium on
4500000
, vol., no., pp.9-16, 19-22 May 2008.
Effective Schedule Cost

4000000
3500000
[12] A. Dogan and F. Ozguner, Biobjective Scheduling Algorithms for
Execution Time-Reliability Trade-off in Heterogeneous Computing
3000000
Systems, Comput. J., vol. 48, no. 3, pp. 300.314, 2005.
2500000
[13] J. Yu and R. Buyya, Scheduling Scientific Workflow Applications with
2000000
DBSA Deadline and Budget Constraints using Genetic Algorithms, Scientific
1500000
DCA Programming Journal, vol. 14, no. 1,pp:217-230, 2006.
1000000
[14] Agarwal, A. and Kumar, P. "An Effective Compaction Strategy for Bi-
500000
criteria DAG Scheduling in Grids", Int. J. of Communication Networks
0 and Distributed Systems (IJCNDS), Inderscience Publishers, in press.
100 200 300 400 500 [15] E. Deelman, J. Blythe, Y. Gil, and C. Kesselman. Workflow
Number of Tasks Management in GriPhyN. Grid Resource Management, State of the Art
and Future Trends. pages 99–116, 2004.
Figure 6. Effective schedule cost on different workflows [16] Agarwal, A. and Kumar, P. "Economical Duplication Based Task
Scheduling for Heterogeneous and Homogeneous Computing Systems,"
Advance Computing Conference, 2009. IACC 2009. IEEE International,
vol., no., pp.87-93, 6-7 March 2009.
VI. CONCLUSIONS
[17] H. Z. E. Tsiakkouri, R. Sakellariou and M. D. Dikaiakos, Scheduling
In this paper, a novel bi-criteria workflow scheduling Workflows with Budget Constraints, In Proceedings of the CoreGRID
approach has been presented and analyzed. We have proposed Workshop ”Integrated research in Grid Computing“, S. Gorlatch and
M. Danelutto, Eds., Nov. 2005, pp. 347.357.

173
ADCOM 2009
HUMAN COMPUTER
INTERFACE -2

Session Papers:

1. Mozaffar Afaq, Mohammed Qadeer, Najaf Zaidi and Sarosh Umar ,“Towards Geometrical
Password for Mobile Phones”

2. Md Sahidullah, Sandipan Chakroborty and Goutam Saha, “Improving Performance of Speaker


Identification System Using Complementary Information Fusion ”

3. Narayanan Palani, “Right Brain Testing-Applying Gestalt psychology in Software Testing”

174
Towards Geometrical Password for Mobile Phones

Mozaffar Afaque M Sarosh Umar Najaf Zaidi Mohammed A Qadeer


Dept of Computer Science, Dept of Computer Engg, Design Engineer I ,T R &D, Dept of Computer Engg,
Indian Institute of Aligarh Muslim University, ST Microelectronics, Aligarh Muslim University,
Technology, Kharagpur, Aligarh, India Greater Noida, India Aligarh, India
India sarosh.umar@gmail.com najaf.zaidi@st.com maqadeer@ieee.org
afaq.mozaffar@gmail.com

Abstract — Mobile cell phones have brought a revolution in Most of the passwords chosen by users are dictionary based
the modern world. They have become profound instruments in in textual password system. And it makes the cracker’s (one
bringing social as well as financial transformation. Mobile who tries to guess your password) job easier. Armed with a
phones today not only hold the key to communication dictionary of 250,000 words, a cracker could compare their
problems but can also be a suitable medium to facilitate
encryptions with those already stored in the password file in
commercial and financial transaction. There is an urgent need
to establish ways to authenticate people over cell phone. The a little more than five minutes [1]. Even if the edited words
current method for authentication uses alphanumeric are included it will add 14 to 17 additional tests per word.
username and password. The textual password scheme is This will add another 1,000,000 words to the list of possible
convenient but has its own drawbacks. Alphanumeric passwords for each user [1]. In this paper we will
passwords are most of the times easy to guess, offer limited demonstrate a graphical grid based password scheme which
possibilities and are easily forgotten. With financial will aim at providing a huge password space along with ease
transactions at stake, the need of the hour is a collection of of use .We will also analyze its strength by examining the
robust schemes for authentication. Graphical passwords are success of brute force technique. In this scheme we will try
one of such schemes which offer a plethora of options and
to make it easy for the user to remember and more complex
combinations. We are proposing a scheme which is simple for
the user and robust at the same time. Graphical password by for the attacker.
drawing geometries will provide a larger password space; at
the same time will allow the user to use its photographic
memory, making it easy to remember. The proposed scheme is II. RELATED WORK
suitable for all touch sensitive mobile phones.
Many papers have been published in recent years with a
Keywords: User authentication, graphical password, smart vision to have a graphical technique for user authentication.
phone security, geometrical password. Primarily there are just two methods, having recall and
recognition based approach respectively. Traditionally both
the methods have been realized through the textual
I. INTRODUCTION password space, which makes it easy to implement and at
the same time easy to crack.
Cell phones have become a necessity. The ability to keep in
touch with family, business associates, and access to email
are not the only reasons for the increasing demand of cell
phones. Today's technically advanced cell phones are
capable of not only receiving and placing phone calls, but
can very conveniently store data, take pictures and connect
to the internet. These features have allowed them to become
successful mediums in the field of e-commerce. With the
foray of mobiles into the world of finance a method to
authenticate users and their transactions was required.
Textual passwords were the first choice, not because they
were robust, but because they were easy to implement. With
this choice we opted for a catch 22 situation where if the
passwords are easy to remember, they are also easy to crack
or guess and when they are complex they are easily
forgotten.
Figure 1: VisKey SFR

175
The study shows that there are 90% recognition rates for
few seconds for 2560 pictures [2]. Clearly the mind of
Homo sapiens is best suited to respond to a visual. A recall
based password approach is VisKey [3], which is designed
for PDAs. In this scheme to make a password, users have to
tap spot in sequence. As for PDAs have a smaller screen
difficult to point exact location of spot. Theoretically it
provides a large password space but not enough to face a
brute force attack if number of spots is less than seven [4].

A scheme like Passfaces in which user chooses the different


relevant pictures that describes a story [5], an image
recognition based password scheme. Recent study of
graphical password [6], says that people are more Figure 3: Example of a password in DAS has only dots.
comfortable with graphical password which is easier to
remember. Recall based password user has to remember the But research shows that people optimally recall only 6 to 8
password. points in pattern [8], and also successful number of recalls
decreases drastically after 3 or 4 dots [9]. Our main
motivation will be to increase password space. The user can
choose the geometrical shape of their choice for the device
like PDA having graphical user interface that will also
optimize that password storage space. In our scheme we will
allow users to draw some geometrical shape with some
fixed end points and by putting dots at different location but
it will give some filed triangle in such a way that chances of
remembering those positions will be better.

III. DRAWING GEOMETRY

Drawing geometry is a graphical password scheme in which


the user draws some geometrical object on the screen.
Through this scheme we are targeting devices like mobiles,
notebook computers and hand-held devices such as Personal
Digital Assistants (PDAs) which have graphical user
interface. Since these devices are graphical input enabled we
can draw some interesting geometries using stylus.

In this scheme there will be mxn grids and each grid is


further divide into four parts by diagonal lines as shown in
figure 4.

Figure 2: DAS scheme.

Jermyn, et al. [7], proposed a technique, called “Draw - a –


secret (DAS)”, which allows the user to draw their unique
password (figure 2). DAS in which user defined drawing by
stylus strokes in case of PDA is recorded and the user have
to make same to authenticate himself. DAS scheme also
allows for the dots only as one of the example shown in
figure 2.
Figure 4: Grid provided to user and some simple geometrical shape drawn
by user.

176
In the above figure 4, we have considered 4x5 grid keeping
in mind the typical screen size of the PDAs these days and
its width height ratio. Depending on the screen size it can be
changed with justifiable number of rows and columns. But
taking that size (4x5) we will have total of 5x4 = 20 block
and each block has four triangles so total of possible triangle
(20 blocks) x (4 triangle/block) = 80 triangle. Similarly
each block has 4 small diagonal lines so total lines in that
way (20 blocks) x (4 lines/block) = 80 lines. Also we do
have some lines which are a result of joining adjacent points
horizontally and vertically. That will give 4x6=24
(horizontal) and 5x5=25 vertical lines which makes a total
of 24+25 = 49 (horizontal and vertical) lines. In that way we
will have total of

Figure 5: Drawing solid triangle


p (5,4) => 80 + 80 + 49 = 209 (1)

Note that the inversion does not take place for non-parallel
These 209 objects can be used to choose password by lines. Figure 8 shows a password made by using parallel and
drawing some of these objects in efficient manner. A non-parallel lines. To draw that we have button stylus able
password is considered to be the selection of certain lines to draw those lines by dragging stylus from one point to
and triangles. When a triangle is selected it is filled with another. The start point and end point of such line will be
some color and when a line is selected the color of that line decided by actually where stylus touches the screen and
changes (gets highlighted). Any combination of the where it leaves it. As illustrated in figure 8 if stylus touches
selection of lines and triangles will form a password as the screen at any location say coordinate (x,y) where two
shown in figure 5. In this way highlighted lines and filled vertical line va and vb (nearest vertical lines from point P at
triangle will provide us larger password space. Filling a distance half cell width) such that va _ x < vb and ha_ y <
triangle and highlighting work can be done by using stylus hb the nearest point of region P will be considered. Same
of PDAs either by putting dot in triangle or by dragging the strategy will be adopted for end point where stylus release
stylus crossing that line. As research shows that if the screen. If lines drawn by user are parallel but procedure
number of dots increases to difficult to remember those it is adopted by user to draw is as of nonparallel, in that case the
also increases. In this scheme we fill the triangle highlighted scheme will automatically detect that and even if parallel
lines makes geometric shape which is to be recalled not the lines are drawn by non-parallel method of drawing it will be
dots. More over we give another option which converts all considered as parallel lines.
highlighted lines to un-highlighted and vice-versa and the
same for filling triangle by single click a button “Invert” a
button which at least double the password space within
practical limit of password length. A line which is not
inclined at an angle of 45° or 0° or 90° i.e. the line which is
not parallel to diagonal, horizontal as well as vertical lines.
(Let’s call them non-parallel lines) These non-parallel lines
can also be drawn by joining two points after enabling those
drawing by clicking the button given labeled line “Line”
which enables user to draw non-parallel lines. As we can
see that crossing the same lines again cancels the effect of
highlighting, figure 6, in general we can say that crossing
even number of times the same line will cancel the
highlighting effect. The users don’t need to recall the
strokes but the resulting geometry. By using inversion
operation as shown in figure 7 the user can deselect all
currently highlighted lines and triangles and select all the
unselected lines and triangles.
Figure 6: Drawing lines

177
Figure 7: Inversion of drawn geometry

Figure 9: Example of textual password

This can allow users to use textual passwords in graphical


way. This letter(s) can be drawn in any direction and at any
letter can be entered at any position on screen as per the
user’s convenience.

V. EXTENSION FOR POSITION INDEPENDENCE


AND MULTISTAGE

As of now we have considered that the shapes as well as its


location constitute the password, together. If the user has
written letter ‘A’ but fails to recall the position of the ‘A’
even then the password will be incorrect. This scheme can
be extended to accommodate such cases. The location of the
figure can be ignored if the shape is correct (as illustrated in
figure 10). The same shape pattern at two different location
circled should be treated as same. Obviously doing so the
password space decreases but by increasing number of grid
this can be compensated. As we have seen that text can be
Figure 8: Example of non-parallel lines drawn but size of the PDAs limits the grid size. We can
have multiple stages for drawing shapes i.e. one shape in
The grid shown on screen is for the user’s convenience. first frame followed by next frame and so on.
Password drawn on invisible grids is shown in figure 7 also
illustrates the inversion.
The user can select the more button provided (not shown
any where) to go into next fresh blank frame on which more
letters or shape can be drawn. As we could not write full
IV. TEXT SIMULATION word IMAGINE but by doing so (multistage) we can write
first few letters say IMA in first frame and rest GINE in
The above mentioned technique can be used to write any second frame. Multistage increases the time required to
textual password. In the example shown the word enter the password but also it gives us huge password space
“IMAGINE” is written vertically to accommodate more like my password word GRAPH is simulated in geometry
letters on the screen, still letter E is missing (purposely) as the way it can be entered or chosen by user increased like
shown in figure 9. If the password contains more words then GR and APH or GRA and PH etc for two stage, though
multiple screens (say frames) can be used to accommodate stages will be less normally but by not fixing the number of
them. stage we get advantage of high password space.

178
VI. STORAGE OF PASSWORD

Since there is no need to store any image therefore only


password need to be stored as we have seen in case of grid
size (4x5) there are 209 possible objects if non-parallel lines
are not considered, if we numbered every object from
number 0,1,2,… ,208 then 209 bits is sufficient to store such
password. An extra bit should be kept for inversion whether
the password is inverted or not to avoid more calculation
while entering the password. For including non-parallel
lines, each non-parallel line can be stored by storing the
coordinates of two points (start point and end point). The
first fix number of bits will represent how many such lines
are there and then the coordinates of end points of each line
(10 bits for each). So if number of non-parallel lines is np
then total password length by taking 10 bits for representing
each non-parallel line is given below

Required bits to store password;

= 209 + 1 + 10 + np * 10;

Figure 10: Example of position independence = 220 + np * 10.

So this scheme does not take much space to store the


password as many graphical schemes take [10].

Figure 11: Variation of password space with increase in number of grids.

179
VII. SECURITY ANALYSIS when employed on PCs and ATM machines it is susceptible
to shoulder surfing. To make it more robust and handle the
As we have seen in eqn.(1) that we have 209 objects problem of shoulder surfing we will have to take into
individually each will be either highlighted or unselected. account the order in which the various components of the
Considering only the 209 objects and excluding the non- geometrical shape were drawn i.e. which line or triangle
parallel lines then we have a total of 2209 = 8.2275X1062 was first selected and then the next line which was selected
possibilities which is huge in terms of password space. and so on. This consideration will limit the scheme’s
vulnerability to shoulder surfing and will also expand the
So it is very robust from security point of view even after password space.
excluding non-parallel lines. If we consider non-parallel
lines also the additional 220 lines will be added which will
be also either highlighted or unselected so in that case total REFERENCES
password possible 2(209+220) = 2429 = 1.386X10129 . It is clear
that the password space will increase exponentially with 1. Daniel V. Klein “ Foiling the Cracker: A Survey of, and
increase in rows or columns as shown in table above. It will Improvements to, Password Security”.
be possible for device with a bigger screen (like ATM) to
have many more columns and rows. 2. Perception and memory for pictures: Single-trial learning of
2500 visual stimuli. Psychonomic Science, 19(2):73-74, 1970.
Due to this larger password space it is very difficult to carry
3. SFR-IT-Engineering,
out brute force attack on this password. With this scheme http://www.sfrsoftware.de/cms/EN/pocketpc/viskey/,
even if user decides to have the graphical representation of Accessed on January 2007.
the text, he will be least susceptible to dictionary attacks.
We have computed above password space in simple case 4. Muhammad Daniel Hafiz, Abdul Hanan Abdullah, Norafida
with only 4x5 grids and single stage password entry. If we Ithnin, Hazinah K. Mammi “Towards Identifying Usability
include those properties then the password space from the and Security Features of Graphical Password in Knowledge
scheme will be increased by many folds. Since we have not Based Authentication Technique”, in Proceedings of the
made any special assumption for text simulation in this Second Asia International Conference on Modelling &
Simulation, IEEE Computer Society.
scheme, so password space remains same even if we take it
as textual password scheme.
5. D. Davis, F. Monrose, and M. Reiter. On User Choice in
Graphical Password Schemes. In 13th USENIX Security
Symposium, 2004.
VIII. CASE STUDY
6. J. Thorpe and P. van Oorschot. Graphical Dictionaries and the
We had requested 25 users to try this scheme and share their Memorable Space of Graphical Passwords. In 13th USENIX
experience with us. When asked to rate the ease of usage of Security Symposium, 2004.
the new methodology on a scale of 1-10, we got an average
of 8.5. Twenty out of twenty five users said they found it 7. I. Jermyn, A. Mayer, F. Monrose, M. Reiter, and A. Rubin.
The Design and Analysis of Graphical Passwords. 8th
easier to remember passwords in graphical geometries.
USENIX Security Symposium, 1999.
Twenty one users out of twenty five could reproduce their
passwords after an interval of one week. 8. R.-S. French. Identification of Dot Patterns From Memory as
a Function of Complexity. Journal of Experimental
Psychology, 47:22–26, 1954.
IX. CONCLUSION AND FUTURE WORK
9. S.-I. Ichikawa. Measurement of Visual Memory Span by
In this paper we have proposed a graphical password Means of the Recall of Dot-in-Matrix Patterns. Behavior
scheme in which the user can draw simple geometrical Research Methods and Instrumentation, 14(3):309–313,
shapes consisting of lines and solid triangles. The user does 1982.
not need to remember the way in which password has been
10. Xiaoyuan Suo, Ying Zhu, G. Scott. Owen “Graphical
drawn but just the final geometrical shape. Passwords: A Survey”, in Proceedings of the 21st Annual
Computer Security Applications Conference (ACSAC 2005),
This scheme gives more password space and is competent in IEEE Computer Society.
resisting brute force attack. This way of storing the
password requires less space to store passwords as 11. Konstantinos Chalkias, Anastasios Alexiadis, George
compared to other graphical schemes. Stephanides “A Multi-Grid Graphical Password Scheme”.

This scheme is immune to shoulder surfing as the screen of 12. Julie Thorpe P.C. van Oorschot “Towards Secure Design
the hand held device is visible to the user only. However Choices for Implementing Graphical Passwords”, in
Proceedings of the 20th Annual Computer Security

180
Applications Conference (ACSAC’04), IEEE Computer
Society.

13. Julie Thorpe P.C. van Oorschot Anil Somayaji “Pass-


thoughts: Authenticating With Our Minds”.

14. Phen-Lan Lin, Li-Tung Weng, Po-Whei Huang “Graphical


Passwords Using Images with Random Tracks of Geometric
Shapes”, in Proceddings of 2008 Congress on Image and
Signal Processing, IEEE Computer Society.

15. Sonia Chiasso, P.C. van Oorschot, and Robert Biddle


“Graphical Password Authentication Using Cued Click
Points”.

16. M. W. Calkins. Short studies in memory and association from


the Wellesley College Laboratory. Psychological Review,
5:451-462, 1898.

17. M. A. Borges, M. A. Stepnowsky, and L. H. Holt. Recall and


recognition of words and pictures by adults and children.
Bulletin of the Psychonomic Society, 9:113-114, 1977.

181
Improving Performance of Speaker Identification
System Using Complementary Information Fusion
Md. Sahidullah, Sandipan Chakroborty and Goutam Saha
Department of Electronics and Electrical Communication Engineering
Indian Institute of Technology, Kharagpur, India, Kharagpur-721 302
Email: sahidullah@iitkgp.ac.in, mail2sandi@gmail.com, gsaha@ece.iitkgp.ernet.in
Telephone: +91-3222-283556/1470, FAX: +91-3222-255303

Abstract—Feature extraction plays an important role as a also contains some speaker specific information [5]. Residual
front-end processing block in speaker identification (SI) process. signal which can be obtained from the Linear Prediction
Most of the SI systems utilize like Mel-Frequency Cepstral (LP) analysis of speech signal contains information related to
Coefficients (MFCC), Perceptual Linear Prediction (PLP), Linear
Predictive Cepstral Coefficients (LPCC), as a feature for repre- source or vocal cord. Earlier Auto-associative Neural Network
senting speech signal. Their derivations are based on short term (AANN), Wavelet Octave Coefficients of Residues (WOCOR),
processing of speech signal and they try to capture the vocal tract residual phase etc. were used to extract the information from
information ignoring the contribution from the vocal cord. Vocal residual signal. In this work we have introduced Higher-
cord cues are equally important in SI context, as the information order Statistical Moments to capture the information from the
like pitch frequency, phase in the residual signal, etc could convey
important speaker specific attributes and are complementary to residual signal. In this paper we are integrating the vocal
the information contained in spectral feature sets. In this paper cord information with vocal tract information to boost up
we propose a novel feature set extracted from the residual signal the performance of speaker identification system. The log
of LP modeling. Higher-order statistical moments are used here likelihood score of both the system are fused together to
to find the nonlinear relationship in residual signal. To get the get the advantages of their complementarity [6], [7]. The
advantages of complementarity vocal cord based decision score
is fused with the vocal tract based score. The experimental speaker identification results on both the databases prove that
results on two public databases show that fused mode system combining the two systems, the performance can be improved
outperforms single spectral features. over baseline spectral feature based systems.
Index Terms—Speaker Identification, Feature Extraction, This paper is organized as follows. In section II we first
Higher-order Statistics, Residual Signal, Complementary Fea- review the basic of linear prediction analysis followed by the
ture.
proposed feature extraction technique. The speaker identifica-
tion experiment with results is shown in section III. Finally,
I. I NTRODUCTION
the paper is concluded in section IV.
Speaker Identification is the process of identifying a person
by his/her voice signal [1]. A state-of-the art speaker identi- II. F EATURE E XTRACTION F ROM R ESIDUAL S IGNAL
fication system requires feature extraction unit as a front end
In this section we first explain the conventional method
processing block followed by an efficient modeling scheme.
of derivation of residual signal by LP-analysis. The proposed
Vocal tract information like its formant frequency, bandwidth
feature extraction process is described consequently.
of formant frequency etc. are supposed to be unique for human
beings. The basic target of the feature extraction block is to A. Linear Prediction Analysis and Residual Signal
characterize those information. On the other hand this feature
extraction process represents the original speech signal into a In the LP model, (𝑛 − 1)-th to (𝑛 − 𝑝)-th samples of the
compact format as well as emphasizing the speaker specific speech wave (𝑛, 𝑝 are integers) are used to predict the 𝑛-th
information. The function of the feature extraction process sample. The predicted value of the 𝑛-th speech sample [3] is
block is also to represent the original signal into a robust given by
manner. Most of the speaker identification system uses Mel 𝑝
Frequency Cepstral coefficients (MFCC) or Linear Prediction ∑
𝑠ˆ(𝑛) = 𝑎(𝑘)𝑠(𝑛 − 𝑘) (1)
Cepstral Coefficient (LPCC) as a feature extraction block [1].
𝑘=1
MFCC is the modification of conventional Linear Frequency
Cepstral Coefficient keeping in mind the auditory system of where {𝑎(𝑘)}𝑝𝑘=1 are the predictor coefficients and 𝑠(𝑛) is
human being [2]. On the other hand, the LPCC is based on the 𝑛-th speech sample.The value of 𝑝 is chosen such that it
time domain processing of speech signal [3]. Later conven- could effectively capture the real and complex poles of the
tional LPCC is also modified motivated by perceptual property vocal tract in a frequency range equal to half the sampling
of human ear [4]. Like vocal tract, Vocal cord information frequency.The Prediction Coefficients (PC) are determined by

182
1000 100

500 50
Amplitude →

Amplitude →
0 0

−500 −50

−1000 −100
0 50 100 150 0 50 100 150
Number of Samples → Number of Samples →

1000 100

500 50
Amplitude →

Amplitude →
0 0

−500 −50

−1000 −100
0 50 100 150 0 50 100 150
Number of Samples → Number of Samples →

0.04 0.2

0.02 0.1
Amplitude →

0 Amplitude → 0

−0.02 −0.1

−0.04 −0.2
5 10 15 20 5 10 15 20
Number of Moments → Number of Moments →

Fig. 1. Example of two speech frames (top), their LP residuals (middle) and corresponding residual moments (bottom).

minimizing the mean square prediction error [1] and the error 𝑝
is defined as ∑
𝑠(𝑛) = − 𝑎(𝑘)𝑠(𝑛 − 𝑘) + 𝑒(𝑛) (5)
𝑁 −1 𝑘=1
1 ∑
𝐸(𝑛) = (𝑠(𝑛) − 𝑠ˆ(𝑛))2 (2) The LP transfer function can be defined as,
𝑁 𝑛=0
𝐺 𝐺
𝐻(𝑧) = = (6)
where summation is taken over all samples i.e., 𝑁 . The set 1 + 𝑝𝑘=1 𝑎(𝑘)𝑧 −𝑘

𝐴(𝑧)
of coefficients {𝑎(𝑘)}𝑝𝑘=1 which minimize the mean-squared
where 𝐺 is the gain scaling factor for the present input and
prediction error are obtained as solutions of the set of linear
𝐴(𝑧) is the 𝑝-th order inverse filter. These LP coefficients itself
equation
can be used for speaker recognition as it contains some speaker
𝑝
∑ specific information like vocal tract resonance frequencies,
𝜙(𝑗, 𝑘)𝑎(𝑘) = 𝜙(𝑗, 0), 𝑗 = 1, 2, 3, . . . , 𝑝 (3) their bandwidths etc.
𝑘=1 The prediction error i.e., 𝑒(𝑛) is called Residual Signal and
where it contains all the complementary information that are not con-
𝑁 −1 tained in the PC. Its worth mentioning here that residual signal
1 ∑ conveys vocal source cues containing fundamental frequency,
𝜙(𝑗, 𝑘) = 𝑠(𝑛 − 𝑗)𝑠(𝑛 − 𝑘) (4)
𝑁 𝑛=0 pitch period etc.

The PC, {𝑎(𝑘)}𝑝𝑘=1 are derived by solving the recursive B. Statistical Moments of Residual Signal
equation (3). Residual signal which is introduced in Section II-A gener-
Using the {𝑎(𝑘)}𝑝𝑘=1 as model parameters, equation (5) ally has a noise like behavior and it has flat spectral response.
represents the fundamental basis of LP representation. It Though it contains vocal source information, it is very difficult
implies that any signal can be defined by a linear predictor to perfectly characterize it. In literature Wavelet Octave Coef-
and its prediction error. ficients of Residues (WOCOR) [7], Auto-associative Neural

183
LP Coeff where, 𝜇 is the mean of residual signal over a frame. As the
LP Analysis
Inverse range of the residual signal is normalized, the first order mo-
Filtering ment (i.e. the mean) becomes zero. The higher order moments
Windowed (for 𝑘 = 2, 3, 4...𝐾) are taken as vocal source features as they
Speech
Frame Magnitude Higher Order represent the shape of the distribution of random signal. The
Moment
Normalization
Computation lower order moments are coarse parametrization whereas the
higher orders are finer representation of residual signal. In fig.
1, LP residual signal of a frame is shown as well as its higher
Residual Moment
Feature order moments. It is clear from the picture that if the lower
order moments are considered both the even and odd order
Fig. 2. Block diagram of Residual Moment Based Feature Extraction values are highly differentiable.
Technique.
C. Fusion of Vocal Tract and Vocal Cord Information
In this section we propose to integrate vocal tract and
Network (AANN) [5] , residual phase [6] etc are used to vocal cord parameters identifying speakers. In spite of the two
extract the residual information. It is worth mentioning here approaches have significant performance difference, the way
that higher-order statistics have shown significant results in a they represent speech signal is complementary to one another.
number of signal processing applications [8] when the nature Hence, it is expected that combining the advantages of both the
of the signal is non-gaussian. Higher order statistics also got feature will improve [13] the overall performance of speaker
attention of the researchers for retrieving information from identification system. The block diagram of the combined
the LP residual signals [9]. Recently, higher order cumulant system is shown in fig. 3. Spectral features and Residual
of LP residual signal is investigated [10] for improving the features are extracted from the training data in two separate
performance of speaker identification system. streams. Consequently, speaker modeling is performed for the
Higher order statistical moments of a signal parameterizes respective features independently and model parameters are
the shape of a function [11]. Let the distribution of random stored in the model database. At the time of testing same
signal 𝑥 be denoted by 𝑃 (𝑥), the central moment of order 𝑘 process is adopted for feature extraction. Log-likelihood of
of 𝑥 be denoted by two different features are computed w.r.t. their corresponding
∫∞ models. Finally, the output score is weighted and combined.
We have used score level linear fusion which can be
𝑀𝑘 = (𝑥 − 𝜇)𝑘 𝑑𝑃 (7)
formulated as in equation (10). To get the advantages of both
−∞
the system and their complementarity the score level linear
for 𝑘 = 1, 2, 3..., where 𝜇 is the mean of 𝑥. fusion can be formulated as follows:
On the other hand, the characteristics function of the prob- 𝐿𝐿𝑅𝑐𝑜𝑚𝑏𝑖𝑛𝑒𝑑 = 𝜂𝐿𝐿𝑅𝑠𝑝𝑒𝑐𝑡𝑟𝑎𝑙 + (1 − 𝜂)𝐿𝐿𝑅𝑟𝑒𝑠𝑖𝑑𝑢𝑎𝑙 (10)
ability distribution of the random variable is given by,
where 𝐿𝐿𝑅𝑠𝑝𝑒𝑐𝑡𝑟𝑎𝑙 and 𝐿𝐿𝑅𝑟𝑒𝑠𝑖𝑑𝑢𝑎𝑙 are log-likelihood ratio
∫∞ ∞ 𝑘 calculated from the spectral and residual based systems, re-
∑ (𝑗𝑡)
𝜑𝑋 (𝑡) = 𝑒𝑗𝑡𝑥 𝑑𝑃 = 𝑀𝑘 (8) spectively. The fusion weight is decided by the parameter 𝜂.
𝑘!
−∞ 𝑘=0
III. S PEAKER I DENTIFICATION E XPERIMENT
From the above equation it is clear that moments (𝑀𝑘 ) are A. Experimental Setup
coefficients for the expansion of the characteristics function.
1) Pre-processing stage: In this work, pre-processing stage
Hence, they can be treated as one set of expressive constants
is kept similar throughout different features extraction meth-
of a distribution. Moments can also effectively capture the
ods. It is performed using the following steps:
randomness of residual signal of auto regressive modeling
∙ Silence removal and end-point detection are done using
[12].
energy threshold criterion.
In this paper, we use higher order statistical moments of
∙ The speech signal is then pre-emphasized with 0.97 pre-
residual signal to parameterize the vocal source information.
emphasis factor.
The feature derived by the proposed technique is termed as
∙ The pre-emphasized speech signal is segmented into
Higher Order Statistical Moment of Residual (HOSMR). The
frames of each 20ms with 50% overlapping ,i.e. total
different blocks of the proposed feature extraction technique
number of samples in each frame is 𝑁 = 160, (sampling
from residual are shown in fig. 2.
frequency 𝐹𝑠 = 8𝐾𝐻𝑧.
At first the residual signal is first normalized between the
∙ In the last step of pre-processing, each frame is windowed
range [−1, +1]. Then central moment of order 𝑘 of a residual
using hamming window given equation
signal 𝑒(𝑛) is computed as,
2𝜋𝑛
𝑁 −1 𝑤(𝑛) = 0.54 + 0.46 cos( ) (11)
1 ∑ 𝑁 −1
𝑚𝑘 = (𝑒(𝑛) − 𝜇)𝑘 (9)
𝑁 𝑛=0 where 𝑁 is the length of the window.

184
Fig. 3. Block diagram of Fusion Technique: Score level fusion of Vocal tract (short term spectral based feature) and Vocal cord information (Residual).

2) Classification & Identification stage: Gaussian Mixture In SI, each speaker is represented by the a GMM and is re-
Modeling (GMM) technique is used to get probabilistic model ferred to by his/her model 𝜆. The parameter of 𝜆 are optimized
for the feature vectors of a speaker. The idea of GMM is to using Expectation Maximization(EM) algorithm [14]. In these
use weighted summation of multivariate gaussian functions to experiments, the GMMs are trained with 10 iterations where
represent the probability density of feature vectors and it is clusters are initialized by vector quantization [15] algorithm.
given by In identification stage, the log-likelihood scores of the
feature vector of the utterance under test is calculated by
𝑀
𝑇

𝑝(x) = 𝑝𝑖 𝑏𝑖 (x) (12) ∑
log 𝑝(𝑿∣𝜆) = 𝑝(𝒙𝑡 ∣𝜆) (15)
𝑖=1
𝑡=1
where x is a 𝑑-dimensional feature vector, 𝑏𝑖 (x), 𝑖 = 1, ..., 𝑀 Where X = {𝒙1 , 𝒙2 , ..., 𝒙𝑡 } is the feature vector of the test
are the component densities and 𝑝𝑖 , 𝑖 = 1, ..., 𝑀 are the mix- utterance.
ture weights or prior of individual gaussian. Each component In closed set SI task, an unknown utterance is identified
density is given by as an utterance of a particular speaker whose model gives
maximum log-likelihood. It can be written as
{ }
1 1 𝑡 −1
𝑏𝑖 (x) = 𝑑 1 exp − (x−𝝁𝒊 ) Σi (x−𝝁𝒊 ) (13)
(2𝜋) 2 ∣Σi ∣ 2 2 𝑇
𝑆ˆ = arg max

𝑝(𝒙𝑡 ∣𝜆𝑘 ) (16)
∑𝑀 Σi . The mixture
with mean vector 𝝁𝒊 and covariance matrix 1≤𝑘≤𝑆
𝑡=1
weights must satisfy the constraint that 𝑖=1 𝑝𝑖 = 1 and 𝑝𝑖 ≥
0. The Gaussian Mixture Model is parameterized by the mean, where 𝑆ˆ is the identified speaker from speaker’s model set
covariance and mixture weights from all component densities Λ = {𝜆1 , 𝜆2 , ..., 𝜆𝑆 } and 𝑆 is the total number of speakers.
and is denoted by 3) Databases for experiments:
YOHO Database: The YOHO voice verification corpus
𝜆 = {𝑝𝑖 , 𝝁𝒊 , Σi }𝑀
𝑖=1 (14) [1], [16] was collected while testing ITT’s prototype speaker

185
verification system in an office environment. Most subjects feature dimension is set at 19 for all kinds of features for
were from the New York City area, although there were many better comparison. In LP based systems 19 filters are used
exceptions, including some non-native English speakers. A for all-pole modeling of speech signals. On the other hand 20
high-quality telephone handset (Shure XTH-383) was used to filters are used for filterbank based system and 19 coefficients
collect the speech; however, the speech was not passed through are taken for extracting Linear Frequency Cepstral Coefficients
a telephone channel. There are 138 speakers (106 males and 32 (LFCC) and MFCC after discarding the first co-efficient which
females); for each speaker, there are 4 enrollment sessions of represents dc component. The detail description are available
24 utterances each and 10 test sessions of 4 utterances each. In in [19], [20]. The derivation LP based features can be found
this work, a closed set text-independent speaker identification in [1], [4], [21].
problem is attempted where we consider all 138 speakers The performance of baseline SI systems and fused systems
as client speakers. For a speaker, all the 96 (4 sessions × for different features and different model orders are shown in
24 utterances) utterances are used for developing the speaker Table II and Table III for POLYCOST and YOHO databases
model while for testing, 40 (10 sessions × 4 utterances) respectively. In this experiment, we take equal evidence from
utterances are put under test. Therefore, for 138 speakers we the two systems and set the value of 𝜂 to 0.5. The results for
put 138 × 40 = 5520 utterances under test and evaluated the the conventional spectral features follows the results shown
identification accuracies. in [22]. The POLYCOST database consists of speech signals
POLYCOST Database: The POLYCOST database [17] was collected over telephone channel. The improvement for this
recorded as a common initiative within the COST 250 action database is significant over the YOHO which is micro-phonic.
during January- March 1996. It contains around 10 sessions The experimental results shows significant performance im-
recorded by 134 subjects from 14 countries. Each session provement for SI system compare to only spectral systems for
consists of 14 items, two of which (MOT01 & MOT02 files) various model order.
contain speech in the subject’s mother tongue. The database
was collected through the European telephone network. The TABLE I
S PEAKER I DENTIFICATION R ESULTS ON POLYCOST AND YOHO
recording has been performed with ISDN cards on two XTL DATABASE USING HOSMR FEATURE FOR DIFFERENT MODEL ORDER OF
SUN platforms with an 8 kHz sampling rate. In this work, a GMM (HOSMR C ONFIGURATION : LP O RDER = 17, N UMBER OF
closed set text independent speaker identification problem is H IGHER O RDER M OMENTS = 6).

addressed where only the mother tongue (MOT) files are used. Database Model Order Identification Accuracy
Specified guideline [17] for conducting closed set speaker 2 19.4960
identification experiments is adhered to, i.e. ‘MOT02’ files 4 21.6180
POLYCOST 8 19.0981
from first four sessions are used to build a speaker model while 16 22.4138
‘MOT01’ files from session five onwards are taken for testing. 2 16.8841
As with YOHO database, all speakers (131 after deletion of 4 18.2246
YOHO 8 15.1268
three speakers) in the database were registered as clients.
16 18.2246
4) Score Calculation: In closed-set speaker identification 32 21.2138
problem, identification accuracy as defined in [18] and given 64 21.9565
by the equation (17) is followed.
Percentage of identification accuracy (PIA) =
No. of utterance correctly identified
× 100 (17) IV. C ONCLUSION
Total no. of utterance under test
B. Speaker Identification Experiments and Results The objective of this paper is to propose a new technique
The performance of speaker identification system based to improve the performance of conventional speaker identifica-
on the proposed HOSMR feature is evaluated on both the tion system which are based on spectral features representing
databases. The order of LP is kept at 17 and 6 residual only vocal tract information. Higher-order statistical moment
moments are taken to characterize the residual information. We of residual signal is derived and treated as a parameter carrying
have conducted experiment based on GMM based classifier vocal cord information. The log likelihood of both the system
for different model order. The identification results are shown are fused together. The experimental results on two popular
in Table I. The identification performance is very low because speech corpus prove that significant improvement can be
the vocal cord parameters are not the only cues for identifying obtained in combined SI system.
speakers but it has some inherent contribution in recognition.
At the same time it contains information which are not R EFERENCES
contained in spectral feature. The combined performance of [1] J. Campbell, J.P., “Speaker recognition: a tutorial,” Proceedings of the
both the system is to be observed. We have conducted SI IEEE, vol. 85, no. 9, pp. 1437–1462, Sep 1997.
experiment using two major kinds of baseline features, some [2] S. Davis and P. Mermelstein, “Comparison of parametric representations
for monosyllabic word recognition in continuously spoken sentences,”
are based on LP analysis (LPCC and PLPCC) and others Acoustics, Speech and Signal Processing, IEEE Transactions on, vol. 28,
(LFCC and MFCC) are based on filterbank analysis. The no. 4, pp. 357–366, Aug 1980.

186
TABLE II
S PEAKER I DENTIFICATION R ESULTS ON POLYCOST DATABASE phase and mfcc features for speaker recognition,” Signal Processing
SHOWING THE PERFORMANCE OF BASELINE ( SINGLE STREAM ) SYSTEM Letters, IEEE, vol. 13, no. 1, pp. 52–55, Jan. 2006.
AND FUSED SYSTEM (HOSMR C ONFIGURATION : LP O RDER = 17, [7] N. Zheng, T. Lee, and P. C. Ching, “Integration of complementary
N UMBER OF H IGHER O RDER M OMENTS = 6, F USION W EIGHT (𝜂)= 0.5). acoustic features for speaker recognition,” Signal Processing Letters,
IEEE, vol. 14, no. 3, pp. 181–184, March 2007.
Feature Model Order Baseline System Fused System [8] A. Nandi, “Higher order statistics for digital signal processing,” Math-
2 63.5279 71.4854 ematical Aspects of Digital Signal Processing, IEE Colloquium on, pp.
4 74.5358 78.9125 6/1–6/4, Feb 1994.
LPCC 8 80.3714 81.6976 [9] E. Nemer, R. Goubran, and S. Mahmoud, “Robust voice activity
16 79.8408 82.8912 detection using higher-order statistics in the lpc residual domain,” Speech
and Audio Processing, IEEE Transactions on, vol. 9, no. 3, pp. 217–231,
2 62.9973 65.7825 Mar 2001.
4 72.2812 75.5968 [10] M. Chetouani, M. Faundez-Zanuy, B. Gas, and J. Zarader, “Investiga-
PLPCC 8 75.0663 77.3210 tion on lp-residual representations for speaker identification,” Pattern
16 78.3820 80.5040 Recognition, vol. 42, no. 3, pp. 487 – 494, 2009.
2 62.7321 71.6180 [11] C.-H. Lo and H.-S. Don, “3-d moment forms: their construction and
4 74.9337 78.1167 application to object identification and positioning,” Pattern Analysis
LFCC 8 79.0451 81.2997 and Machine Intelligence, IEEE Transactions on, vol. 11, no. 10, pp.
16 80.7692 83.4218 1053–1064, Oct 1989.
2 63.9257 69.7613 [12] S. G. Mattson and S. M. Pandit, “Statistical moments of autoregressive
4 72.9443 76.1273 model residuals for damage localisation,” Mechanical Systems and
MFCC 8 77.8515 79.4430 Signal Processing, vol. 20, no. 3, pp. 627 – 645, 2006.
16 77.8515 79.5756 [13] J. Kittler, M. Hatef, R. Duin, and J. Matas, “On combining classifiers,”
Pattern Analysis and Machine Intelligence, IEEE Transactions on,
vol. 20, no. 3, pp. 226–239, Mar 1998.
[14] A. P. Dempster, N. M. Laird, and D. B. Rubin, “Maximum likelihood
TABLE III from incomplete data via the em algorithm,” Journal of the Royal
S PEAKER I DENTIFICATION R ESULTS ON YOHO DATABASE SHOWING THE Statistical Society. Series B (Methodological), vol. 39, pp. 1–38, 1977.
PERFORMANCE OF BASELINE ( SINGLE STREAM ) SYSTEM AND FUSED [15] G. R. Linde Y., Buzo A., “An algorithm for vector quanization design,”
SYSTEM (HOSMR C ONFIGURATION : LP O RDER = 17, N UMBER OF IEEE Transactions on Communications, vol. COM-28, no. 4, pp. 84–95,
H IGHER O RDER M OMENTS = 6, F USION W EIGHT (𝜂)= 0.5)). 1980.
[16] A. Higgins, J. Porter, and L. Bahler, “Yoho speaker authentication final
Feature Model Order Baseline System Fused System report,” ITT Defense Communications Division, Tech. Rep., 1989.
2 80.9420 84.7101 [17] H. Melin and J. Lindberg, “Guidelines for experiments on the polycost
4 88.9855 91.0870 database,” in in Proceedings of a COST 250 workshop on Application
8 93.8949 94.7826 of Speaker Recognition Techniques in Telephony, 1996, pp. 59–69.
LPCC 16 95.6884 96.2862 [18] D. Reynolds and R. Rose, “Robust text-independent speaker identi-
32 96.5399 97.1014 fication using gaussian mixture speaker models,” Speech and Audio
64 96.7391 97.2826 Processing, IEEE Transactions on, vol. 3, no. 1, pp. 72–83, Jan 1995.
2 66.5761 72.5543 [19] S. Chakroborty, A. Roy, S. Majumdar, and G. Saha, “Capturing comple-
4 76.9203 81.0507 mentary information via reversed filter bank and parallel implementation
8 85.3080 87.7717 with mfcc for improved text-independent speaker identification,” in
PLPCC 16 90.6341 91.9022 Computing: Theory and Applications, 2007. ICCTA ’07. International
32 93.5326 94.3116 Conference on, March 2007, pp. 463–467.
64 94.6920 95.3986 [20] S. Chakroborty, “Some studies on acoustic feature extraction, feature
selection and multi-level fusion strategies for robust text-independent
2 83.0072 85.8152 speaker identification,” Ph.D. dissertation, Indian Institute of Technol-
4 90.3623 91.7935 ogy, 2008.
8 94.6196 95.4891 [21] L. Rabiner and H. Juang B, Fundamental of speech recognition. First
LFCC 16 96.2681 96.6848 Indian Reprint: Pearson Education, 2003.
32 97.1014 97.3551 [22] D. Reynolds, “Experimental evaluation of features for robust speaker
64 97.2464 97.6268 identification,” Speech and Audio Processing, IEEE Transactions on,
2 74.3116 78.6051 vol. 2, no. 4, pp. 639–643, Oct 1994.
4 84.8551 86.9384
8 90.6703 92.0290
MFCC 16 94.1667 94.6920
32 95.6522 95.9964
64 96.7935 97.1014

[3] B. S. Atal, “Effectiveness of linear prediction characteristics of the


speech wave for automatic speaker identification and verification,” The
Journal of the Acoustical Society of America, vol. 55, no. 6, pp. 1304–
1312, 1974.
[4] H. Hermansky, “Perceptual linear predictive (plp) analysis of speech,”
The Journal of the Acoustical Society of America, vol. 87, no. 4, pp.
1738–1752, 1990.
[5] S. M. Prasanna, C. S. Gupta, and B. Yegnanarayana, “Extraction of
speaker-specific excitation information from linear prediction residual
of speech,” Speech Communication, vol. 48, no. 10, pp. 1243 – 1261,
2006.
[6] K. Murty and B. Yegnanarayana, “Combining evidence from residual

187
Right Brain Testing
Applying Gestalt psychology in Software Testing
Narayanan Palani,
Student in Computer Applications, BITS- Pilani, India.
[Tel:+91-9900465054 E-mail: narayananp24@gmail.com]

Abstract— Applying Gestalt psychology in Software Testing can


explore innovative paths and new techniques in web testing V. APPLICATION OF LAWS OF GESTALTISM IN SOFTWARE
methodologies. It can address the critical testing processes like TESTING:
test blindness testing, condition testing in complex conditions and
web security testing.

I. INTRODUCTION A. Law of Closure :


Software testing is an art and a complete testing ensures The mind may experience elements it does not perceive
that the application meets customer’s expectations. A tester can through sensation, in order to complete a regular figure (that is,
use various testing methods to address the key challenges in to increase regularity).
application testing. Applying psychology in software testing
Application: Figures, fonts, objects of AUT can’t perceive
can meet and resolve few flaws in testing like ‘test blindness’,
through the presentation, in order to complete the regular
‘application security constraints’, ‘desired functionality and
information (that is, to increase regularity).
requirement mismatch’.
B. Law of Similarity:
II. SCOPE:
The mind groups similar elements into collective entities
Application of psychology in software testing can be achieved or totalities. This similarity might depend on relationships of
and well utilized by, form, color, size, or brightness.
 Providing Psychological Lab Trainings to SQA team.
Application: Collective entities of AUT depend upon broad
 Sessions with Research Psychologist to develop
range of user information. So it is required to present collective
frameworks to address critical problems in software entities in similar places. It can be analyzed as a usability issue
testing. to improve customer experience.
 SQA teams can assist research scholars and Psycho-
analysts for ‘Collaborative Testing Assignments’ in C. Law of Proximity:
which innovative testing methods can be explored.
Spatial or temporal proximity of elements may induce the
mind to perceive a collective or totality.
III. RIGHT BRAIN TESTING: Application: Spatial or temporal proximity of elements in
It analyses application using diagrammatic, graphical and AUT may induce the mind to perceive a collective or totality. It
analytical representations (using right side of brain) hence it is should not make any negative impact and it can be tested for
right to mention it as a ‘right brain testing’ to find right user experience. It can be achieved by training a Subject Matter
behavior of AUT. Expert (SME) to test AUT using Gestalt psychology.

IV.APPLICATION OF GESTALTISM IN SOFTWARE TESTING: D. Law of Symmetry (Figure ground relationships):


Gestalt psychology is a theory of mind and brain positing Symmetrical images are perceived collectively, even in
that the operational principle of the brain. The Gestalt effect spite of distance.
refers to the form-forming capability of our senses, particularly Application: Symmetrical images/data/items in GUI are
with respect to the visual recognition of figures/objects and perceived collectively, even in spite of distance. So tester
whole forms instead of just a collection of simple lines and should make sure that the information provides right impact
curves. and it gives user friendly experience in navigation (in
It can be applied to software testing to address the navigation testing).
challenges of critical testing aspects in content testing,
navigation testing, usability testing and user interface testing. E. Law of Continuity:
The mind continues visual, auditory, and kinetic patterns.
Application: The mind continues visual, auditory, and
kinetic patterns as per the similar pages/menus in AUT. So user

188
expects similar objects in consecutive steps/functionalities and
processes. Similar objects/information can group together in
GUI to overcome this issue (in User Interface Testing).

F. Law of Common Fate:


Elements with the same moving direction are perceived as a
collective or unit.
Application: Elements with the same objective and scope
should be collective in AUT. It makes good customer
experience over the navigation period and it gives clear flow
for testing in Usability Testing.

VI.APPLICATION OF GESTALT PROPERTIES IN SOFTWARE


TESTING:

A. Application of Emergence:
A SME can identify ‘strong emergence’ and ‘weak
emergence’ of components/functions in web pages by applying Figure 2. Weak Emergence in picture1 and 2. Strong Emergence in picture3.
emergence into testing. A tester can validate linkages,
parameters passing mechanisms by applying ‘emergence’ to
track on various components/scripts which communicates B. Application of Reification:
together at various stages of user interactions. In User Interface
Testing, it addresses the critical aspects like application and the
user interaction problems, ‘strong emergence’ of application
components with interrupts. In integration testing, a tester can
easily track the problems that arises when units are combined
strongly (using strong emergence) or weakly (using weak
emergence). It delivers clear differences between strong and
weak emergence of components.

Unit1 Unit2

Unit3 Figure3: Example picture representation of Reification. A


complete three-dimensional shape is visible in picture,
where in actuality no such thing is drawn.

Reification is the constructive or generative aspect of


perception of object/process in the AUT. Misunderstanding of
Figure 1. Strong Emergence of Unit1-3, Weak Emergence: Unit1-2, Unit2-3
in Integration testing. test requirements or wrong assumptions on functionalities can
resume new bugs in application. Strong understanding of
reification can explore innovative ways to track on hided
defects in AUT. Tester can create a table to compare reification
points with understanding on test functionalities.

189
C. Application of Multistability: VII. APPLICATION OF ‘PRODUCTIVE THINKING’ IN
USABILITY TESTING:

Formulae:
PTR = [Combinations*Test Techniques]/Time;
PTR=Productive Thinking Result which analyses test
factors.It can be customized based on evaluations to identify
the waitage of Productive Thinking Process.

VIII. REPRODUCTIVE THINKING AND ITS REPRESENTATION


Figure4: Edger Rubin’s Rubin Figure-An Example for
IN TESTING:
Multistability.

Solving a problem with previous experiences by


Classification of paths to test in Rubin Figure:1.Left Face reproductive thinking can be applied to all kind of software
Only, 2.Right Face Only, 3.Left Face and Right Figure, 4. testing methods. But tester must document the observations
Right Face and Left Face, 5.White color of Figure, 6.Black and innovative new approaches for future testing references.
color of Figure, 7.Size and Color Ratio.
Combinations for testing in Rubin Figure:
1-5, 2-5, 3-5, 4-5, 1-6, 2-6, 3-6, 4-6, 5-6 IX.APPLICATION OF ‘RULE OF FIGURE’ IN GRAPHICAL USER
INTERFACE (GUI) TESTING:
Unstability between two or more alternative interpretations
is known as ‘Multistability’. When test requirement is
documented, it can be understood by two or more alternative For Example, look at the bottom of the page where the
interpretations and content testing can be done as per the letters got cut off but you still knew what it was telling you. It
understandings which leads to incorrect testing. By practising can also be the concept of looking at a picture that has two
with ‘Multistability’ methods of Gestaltism, a tester can clearly optical illusions. It is known as ‘Rule of Figure”. It can be the
derive the combinations and types of test requirements. It right option to test GUI of AUT.
makes tester to observe the varities of view points from
requirement specification and to the developers if needed.
X. TECHNIQUES CAN BE DERIVED FROM “RULE OF FIGURE”:
D. Application of Invariance:
 View AUT partially and imagine about remaining
GUI and take notes on those expectations to compare
it later.
 Analyse size and formats of various diagrams,
reports, graphical representations in AUT.
XI.VIEWS WHICH ARE GOING AGAINST GESTALT
PSYCHOLOGY:
 The Three-Process View
 Neo-Gestalts View

It is must to consider views which are oppose gestaltism


when it is useful for testing. Three process view and Neo-
Figure5: An ornamental pattern in which dozens of Gestalts view are critising gestaltism but properties of these
features are processed simultaneously to represent views are more useful to explore innovative software testing
‘Gestalt Invariance’. approaches.

Independent object recognition of rotation, translation, and


scale and variations like elastic deformations, different lighting,
and different component features are monitered using
‘Invariance Testing’ in AUT.

190
XII. APPLICATION OF NEO-GESTALTS VIEW: objectives’ to test. It delivers good understanding on AUT for
various testing activities.

A. Motion like properties: XV. CONCLUSION:


It can be applied to find the rate of change of particular Complete testing in AUT is a big challenge to SQA and
functionality/process and bug. innovative new techniques can address this problem to address
the expectations of customers. Application of Gestalt
psychology in software testing is a new path In which software
B. Rhythmic properties: testers are benefited with innovative techniques to find
Relative timing of particular process in various test versions abnormal defects in AUT. It is a challenging research area and
and builds can be explored by applying ‘Rhythmic Properties’ it can address the key challenges of software testing with
in regular testing perspectives. innovative techniques and methods.
Ex: ‘Cash Transaction’ in e-Banking applications.
XVI. ACKNOWLEDGMENT
I am heartily thankful to my faculties, whose
XIII. APPLICATION OF THREE-PROCESS VIEW: encouragement, guidance and support from the initial to the
final level enabled me to develop an understanding of the
subject.
A. Selective-Encoding Insight(SEI):
SEI Involving one to distinguish what is important in a Lastly, I offer my regards and blessings to all of those who
problem and what is irrelevant. supported me in any respect during the completion of this
research initiative.
It helps tester to identify customer specifications clearly
from various various communications with customers.

B. Selective-Comparison Insight (SCPI):


XVII. REFERENCES:
SCPI Identifying information by finding a connection [1] “Lessons learned in software testing” by Cem Kaner, James Bach, Bret
between acquired knowledge and experience. Pettichord.
It provides hidden flows and flaws in regular flows of ‘Data [2] “Software Testing Techniques” by Boris Beizer.
Flows’ in ‘Path Testing’. [3] “Game Testing All in One” by Charles P.Schultz, Robert Bryant, Tim
Langdel.
[4] http://en.wikipedia.org/wiki/Gestalt_psychology
C. Selective-Combination Insight(SCBI):
[5] http://www.optimum-web.co.uk/whatcontent.htm
SCBI Identifying a problem through understanding the [6] http://www.contentmanager.net/magazine/article_244_testing_in_conten
different components and putting everything together. It t_management_projects.html
addresses integration test engineers and various unit level [7] http://machineslikeus.com/the-constructive-aspect-of-visual-perception
integration testing. [8] http://en.wikipedia.org/wiki/Rubin_vase
[9] http://drezdel123.wordpress.com/2009/03/
XIV. LIMITATIONS OF GESTALT PSYCHOLOGY: [10] http://www.allgraphicdesign.com/graphicsblog/2008/03/04/the-rules-of-
the-gestalt-theory-and-how-to-apply-it-to-your-graphic-design-layouts/
 Descriptive representation

 Diagrammatic Analyze of problems.

A. Avoid descriptive approach and derive straight forward


methods:
Gestalt psychology is more descriptive rather than
exploratory. It is mandate to explore standard way of testing
representations and testers must understand methods and
techniques by exploratory definitions of psychological
applications in software testing to find bugs productively.

B. From Diagramatic Represention to Systematic Definitions


and Formulas:
Gestaltism deals with diagrams, pictorial representations
and various observations. It is essential to apply these
techniques in specific testing process to find ‘time specific

191
ADCOM 2009
MOBILE AD-HOC
NETWORKS

Session Papers:

1. Vijayashree Budyal, Sunilkumar Manvi and Sangamesh Hiremath, “Intelligent Agent based
QoS Enabled Node Disjoint Multipath Routing”

2. Adil Erzin, Soumyendu Raha and.V.N. Muralidhara, “Close to Regular Covering by Mobile
Sensors with Adjustable Ranges”

3. Dipankaj Medhi,“Virtual Backbone Based Reliable Multicasting for MANET”

192
Intelligent Agent based QoS Enabled Disjoint
Multipath Routing in MANETs
*Vijayashree Budyal, #S. S. Manvi,**S. G. Hiremath ,*Kala K. M.,

*ECE Department, Basaveshwar Engg. College, Bagalkot, India


#ECE Department, Reva Institute of Tech. And Mgmt., Bangalore, India
**ECE Department, G M Institute of Technology, Davangere, India

Abstract: Mobile ad hoc network (MANET) is an Multipath Routing (NDMR) has two novel aspects
infrastructureless, multihop, wireless, and frequently compared to the other on-demand multipath
changing network. To support multimedia applications protocols: It reduces routing overhead dramatically
such as video and voice, MANETs require an efficient and achieves multiple node disjoint routing paths (No
routing protocol and Quality of Service (QoS) mechanism. node except source-node and destination-node are
QoS support in MANETs is an important issue as best
effort routing is not efficient for supporting multimedia
common in multipaths) [5].
applications. Whenever there is a link break on the route,
the best effort protocols need to initiate a new route Important issue in multimedia communications is
discovery process. This results in a high routing load. On routing of application data based on QoS
demand Node-Disjoint Multipath Routing (NDMR) requirements. QoS routing is a method of finding
alleviate these problems and reduces routing overhead. This QoS routes between a source and destination. If a
paper proposes an intelligent agent based QoS enabled proper QoS route is identified, the applications will
NDMR, built on NDMR for supporting multimedia meet the guaranteed services [6]. A Biologically
applications that considers bandwidth, delay as QoS inspired QoS routing algorithm is described in [7]
metrics for optimal path computation in Mobile Ad hoc
Networks (MANETs). A mobile agent is employed to find
based on swarm intelligence inspired routing
QoS paths and to select an optimal path among them. The technique.
performance of the scheme is evaluated for packet delivery
ratio, end-to-end delay, QoS acceptance ratio and route Agent technology is emerging as a new paradigm in
discovery time for various network scenarios. the areas of artificial intelligence and computing.
Agents are said to become the next generation
1. Introduction components in software development, because of its
inherent structure and behaviour, which can be used
Mobile ad hoc networks are infrastructureless to facilitate Internet services.
networks that can be rapidly deployed. They are
characterized by multihop wireless connectivity, Agents are the autonomous programs situated within
frequently changing network topology [1]. The a programming environment. The agents achieve
design of efficient and reliable routing protocols in their goals by collecting the relevant information
such a network is a challenging issue. Ad Hoc On- from the host without affecting the local processing.
demand Distance Vector (AODV) and Dynamic They have certain special properties such as
Source Routing (DSR) are the two most widely mandatory and orthogonal (supplementary), which
studied on-demand ad hoc routing protocols. The make them different from the standard programs. The
limitation of both of them is that, they build and rely mandatory properties are autonomy, reactive,
on a single path for each data transmission. proactive and temporally continuous. The orthogonal
Whenever there is a link break on the route, protocols properties are communicative, mobile, learning and
need to initiate a new route discovery process, believable. An agent should posses the mandatory
resulting in high routing overheads [2]. properties which are compulsory. The orthogonal
properties enhance the capabilities of agents and
Multipath routing is one of the solutions that aim to provide strong notion of agents. An agent may or
establish multiple paths between source and may not posses the orthogonal properties.
destination. A lot of benefits have been explored for
multipath routing [4]. On-demand Node-Disjoint

193 1
Agents can be classified as local/user interface redundancy and avoid introducing a broadcast flood
agents, networked agents, distributed artificial in MANETs: Path accumulation, decreasing
intelligence (AI) agents and mobile agents. The multipath broadcast routing packets (using shortest
networked agents and user interface agents are single routing hops), and selecting node-disjoint paths.
agent systems, whereas the other two types of agents
are multi-agent systems [9]. In NDMR, AODV is modified to include path
accumulation in RREQ packets. When the packets
From the literature, we have noticed that a dynamic are broadcast in the network, each intermediate node
QoS path computation, based on rapidly changing appends its own address to the RREQ packet. When a
network conditions and capable of providing RREQ packet finally arrives at its destination, the
adaptability, flexibility, software reuse and destination is responsible for judging whether or not
customizability features, has not been addressed. This the route path is a node-disjoint path. If it is a node-
paper proposes an Intelligent agent based QoS disjoint path, the destination will create a route reply
enabled NDMR built on NDMR for supporting packet (RREP) which contains the node list of whole
multimedia applications that considers bandwidth, route path and unicasts it back to the source that
delay as QoS metrics for optimal path computation. generated the RREQ packet along the reverse route
A mobile agent is employed to find QoS paths and to path.
select an optimal path among them.
When an intermediate node receives a RREP packet,
II. QoS Enabled NDMR it updates the routing table and reverse routing table
using the node list of the whole route path contained
This paper proposes an intelligent agent based QoS in the RREP packet. When receiving a duplicate
enabled NDMR built on NDMR for supporting RREQ, the possibility of finding node-disjoint
multimedia applications, which identifies a set of multiple paths is zero if it is dropped, for it may come
multiple paths that meet the QoS requirements of a from another path. But if all of the duplicate RREQ
particular application, and selects a path which leads packets are broadcast, this will generate a broadcast
to highest overall resource efficiency. The scheme storm and dramatically decrease the performance. In
uses a mobile agent to perform this operation. Every order to avoid this problem, a novel approach is
node in a network comprises of an agent platform to introduced in NDMR recording the shortest routing
support mobile agents. In this section, we describe, hops to keep loop-free paths and decrease routing
NDMR in brief and explain the functioning of QoS broadcast overhead. When a node receives a RREQ
Enabled NDMR. packet for the first time, it checks the node list of the
route path calculates the number of hops from the
2.1 Node-disjoint multipath routing protocol source node to itself and records the number as the
shortest number of hops in its reverse routing table. If
Node-disjoint multipath routing protocol (NDMR) is the node receives a duplicate RREQ packet again, it
a new protocol developed by Xuefei Li [5], computes the number of hops and compares it with
modifying and extending AODV to enable the path the shortest number of hops in its reverse routing
accumulation feature of DSR in route request table. If the number of hops is larger than the shortest
packets. It can efficiently discover multiple paths number of hops in the reverse routing table, the
between source and destination nodes with low RREQ packet is dropped. Only when it is less than or
broadcast redundancy and minimal routing latency. equal to the shortest number of hops, the node
appends its own address to the node list of the route
In the route discovery process, the source creates a path in a RREQ packet and broadcasts it to
route request packet (RREQ) containing message neighboring nodes again.
type, source address, current sequence number of
source, destination address, the broadcast ID and The destination node is responsible for selecting and
route path. Then the source node broadcasts the recording multiple node-disjoint paths. When
packet to its neighbouring nodes. The broadcast ID is receiving the first RREQ packet, the destination
incremented every time that the source node initiates records the list of node IDs of the entire route path in
a RREQ, forming a unique identifier with the source its reverse route table and sends a RREP packet along
node address for the RREQ. Finding node-disjoint the reverse route path. When the destination receives
multiple paths with low overhead is not a duplicate RREQ, it compares the whole node IDs of
straightforward when the network topology changes the entire route path in the RREQ to all of the
dynamically. NDMR routing computation has three existing node-disjoint paths in its reverse routing
key features that help it to achieve low broadcast table. If there is no common node (except the source

194
and destination node) between the node IDs from the 2.2.2 Agency at each Node
RREQ and node IDs of any node-disjoint path in the
destination’s reverse table, the route path in current We assume that every node in a network maintains an
RREQ is a node-disjoint path and is recorded in the agency as shown in Fig.1. An agency consists of a
reverse routing table of the destination. Otherwise, BlackBoard, Communication Manager Agent, Delay
the current RREQ is discarded. and Bandwidth Estimator agent (D&B), QoS
negotiator/re-negotiator agent.

2.2 Functioning of QoS enabled NDMR. • BlackBoard : is a shared knowledge base structure,
which is read and updated by agents as and when
We describe the QoS routing metrics, and Agency at required. It consists of information such as residual
each node considered in the proposed work. bandwidth, delays that connected to a node as shown
in Fig 2.
2.2.1 QoS routing metrics • Communication Manager Agent (CMA): is a static
agent running at each node to serve the applications
Consider a network as an undirected graph G (V, E) requesting for on-demand QoS routes and also
to describe QoS metrics used in the proposed scheme, support route finding operations when the node is
where V is a set of nodes and E is a set of edges. A acting as an intermediate node. This agent is
path ‘P’ from source ‘s’ to destination ‘d’, P(s, d), is responsible for creation of the agents and updates the
a sequence of edges belonging to set E .The proposed data in blackboard. All the operations like,
scheme uses residual bandwidth (bwe) and delay (de) communication, updating the blackboard, reading the
metrics of a link for QoS routing of an application. A blackboard and so on, take place with the permission
QoS of an application is specified as Q = {bwmin, D}, of communication manager agent.
where bwmin is the minimum bandwidth required
for an application, D is the bounded end-to-end
delay for delivery of information. The application
can be viewed by a user with acceptable QoS by
guaranteeing bwmin and D metric. These concave and
addtive properties are defined below for a path
P={l1, l2, . . ., ln }, where ln is the nth link and m(P) is
the metric value on the path P.

• Additive: A metric m is said to be additive for a


given path P, if m (P) = m (l1) + m (l2) + · · · +m (ln).

• Concave: A metric m is said to be concave for a


given path P, if m ( P ) = min {m ( l1); m ( l2 ); · · ·,
m ( l n )}.
P(s, d) should satisfy the following bandwidth and
delay criteria (1) and (2) for an application to begin
and progress:
Fig 1. Agency at each Node
bw p(s , d) = min bwe ( i, j) ≥ bwmin ∙ ∙ ∙ (1)
(i, j) є p (s, d)

Node Residual-bandwidth Delays (secs)


delay p(s,d) = ∑ de (i , j) ≤ D ∙ ∙ ∙ (2)
(i, j) є p (s, d) (Mbps)
4 3 4
: : :
Notations bw p(s,d) and delay p(s,d) are bandwidth and : : :
delay values associated with optimal-path P(s,d),
respectively, where as notations bwe( i , j) and de(i
, j) denote residual bandwidth and delay on a link Fig 2. Entry of BlackBoard at a node 4
connecting nodes i and j on the path P(s,d),
respectively.

195
Algorithm 1: Functions of Communication- 4. When the agent reaches the source, it finds a
Manager-agent. set of multiple QoS paths that satisfies the
required resources
To maintain QoS requirements of an application by • Prune all the edges/links in collected
performing dynamic negotiation/re-negotiation connectivity/ resource information that
have less than the desired bandwidth and
Begin delay.
1. Receive connection request for source and • Find K Node-disjoint paths (No node is
destination with QoS requirements (maximum common in more than one path) by
bandwidth and delay required). following principles of NDMR routing
2. Trigger QoS –negotiator/Re-negotiator agent protocol.
to negotiate the resources to find a QoS route .
(Algorithm 2a) 5. If QoS path(s) available then, select a best
3. Trigger D&B-Estimator-Agent (Algorithm 3a) QoS path (path with widest bandwidth and
4. Trigger D&B-Estimator-Agent periodically to lowest delays) and reserve the resources on the
observe the QoS of an application in the path.
network (Algorithm 3b) Else, inform the communication manager agent
5. Check, if there is a QoS violation then, that QoS path is not available.
Trigger QoS-negotiator/Re-negotiator-agent 6. Dispose the QoS negotiator/re-negotiator
to re-negotiate the resources with the nodes in agent;
the established path (Algorithm 2b) 7. Stop.
6. Repeat steps 4 -5 until the session is completed. End.
7. Dispose the created agents.
8. stop. Algorithm 2b: Re-negotiation phase
End.
To re-negotiate resources whenever a QoS violation
• QoS Negotiator/Re-negotiator agent: QoS or congestion/failure is detected during a session.
negotiator/re-negotiator agent is a mobile agent, Begin
which is used to find the QoS route (a route 1. The QoS negotiator/re-negotiator agent collects the
satisfying bandwidth, delay) from the source to QoS requirements to be re-negotiated from the
destination at the beginning of a session as well as Communication-manager-agent.
whenever required. It negotiates/re-negotiates the 2. It migrates on the specified path by visiting every
resources in the path. It follows the principle of node on the path.
NDMR routing protocol while establishing multiple 3. After reaching the destination, it checks whether
paths between the source to destination by collecting the renegotiation is successful at all the visited
the neighbours connectivity and resource information nodes.
(bandwidth availability, delays). Finally chooses a 4. If re-negotiation is successful then inform the
maximum bandwidth and a minimum delay path newly negotiated QoS values on the existing path
among them for resource reservation. to the Communication-manager-agent and goto
step 6.
Algorithm 2a: Negotiation phase Else find the multiple QoS paths that satisfies
required resources (as given in step4 of Algorithm
To find a QoS route considering bandwidth and delay 2a)
parameters. 5. If QoS path(s) available then, select a best QoS
path and reserve the resources on the path and
Begin inform the server and manager agent;
1. The QoS negotiator/re-negotiator agent collects the Else, inform the communication manager agent
QoS requirements from the Communication- that QoS path is not available;
manager-agent. 6. Dispose the QoS negotiator/re-negotiator agent;
2. The agent migrates from source to its 7. Stop.
neighbours until it reaches the destination. End.
While traversing it collects the resource availability
from each of the intermediate nodes. • Delay and Bandwidth Estimator Agent: is a agent
3. The agent routes from destination to source that monitors the node. Such agents are created for
as per the pre-existing routing ofNDMR routing each node by the Communication-manager-agent.
protocol. The Delay and bandwidth-Estimator-agent computes

196
the bandwidth, delay for each node and updates the 3.1 Simulation model
Blackboard at regular intervals. Bandwidth is
calculated by monitoring flow of information on link The proposed model has been simulated in various
of a node and the available Residual bandwidth. network scenarios on Pentium-4 machine by using
Delay is computed by averaging the queuing delays “C++” programming language for the performance
taken for all the packets on the link within a given and effectiveness of the approach. Simulated area for
time interval. the network topology is of A X B sq.mts,
Total “N” number of nodes, “C” Mbps is the link
Algorithm 3a: Delay and Bandwidth Estimator agent capacity and Transmission range are considered in
computes Bandwidth and Dealy the simulation. In order to simulate mobility of nodes
in the network, we considered nodes to move in any
Begin of the 8 directions at a distance “d” mts with speed
1. Compute bandwidth and delay as follows varying between 0-12 mph (meter per hour). The
QoS requirements of an application are specified as
bw p(s , d) = min bwe ( i, j) ≥ bwmin Q= {bandwidth, delay}, where all the metrics of Q
(i, j) є p (s, d) are randomly distributed. Principles of NDMR
routing is used to generate node-disjoint multipaths
delay p(s,d) = ∑ de (i , j) ≤ D and create routing tables at each of the nodes before
(i, j) є p (s, d) applying the QoS Enabled NDMR scheme.

Notations bw p(s,d) and delay p(s,d) are 3.2 Simulation procedure


bandwidth and delay values associated with
path P(s,d), respectively, where as notations To illustrate some results of the proposed scheme,
bwe(i,j) and de(i,j) denote residual bandwidth The simulation inputs are A=300 mts, B=300 mts,
and delay on a link connecting nodes i and j on Number of nodes varies between 10 to 25,
the path P(s,d), respectively. Transmission range=100 mts, speed of the node
2. Informs the calculated value to communication – varies between 0 to 12 mph (meters per hour),
manager-agent. propagation delay varies between 1 to 5 secs, Data
3. Dispose the agent rate varies between 2 to 5 Mbps, Packet size=1 KB,
4. Stop Number of services are assumed to be constant.
End
Simulation procedure is as follows:
Algorithm 3b: Periodically Delay and Bandwidth Begin
Estimator agent observes bandwidth and delay 1. Create a network topology with random size of
variations nodes.
2. Randomly select source node and destination node.
Begin 3. Deploy the proposed scheme.
1. Observe bandwidth and delay at each node of the 4. Compute the performance parameters.
specified path by traversing from End.
destination to source
2. In the event of any QoS violations the agent The following performance metrics are used for
informs the Communication-manager- agent. evaluating the scheme.
3. Dispose the agent • Packet Delivery Ratio: Packet Delivery ratio is the
4. Stop ratio of the number of data packets delivered to the
End. destination node to the number of data packets
transmitted by the source node.
3. Simulation • Route discovery time: The time required to find the
optimal QoS path.
The proposed scheme is simulated in various network • Average end-to-end Delay: The average time the
scenarios by using C-Programming language to data packet takes to reach from source to destination
verify the performance and operation effectiveness of is known as Average end-to-end delay; it includes all
the scheme. In this section we describe the simulation possible delays caused by queuing and retransmission
model and simulation procedure. plus acknowledgement.
• Bandwidth utilization ratio: is the ratio of the sum
of utilization of bandwidth of all the links to the total
bandwidth of the network.

197
• QoS Acceptance ratio: is the ratio of sum of Qos requirement because more number of optimal paths
fulfilled paths to existing multipaths. are available.

4. Results
Figure 3 depicts the route discovery time that the
proposed scheme is more than NDMR scheme,
because the mobile agents need time to traverse the
network to find a optimal path.

Fig 6. Packet Delivery ratio Vs. no. of nodes

We experimented by injecting certain percentage of


link failures; a mobile agent tries to find another QoS
route for the same application. Figure 6 depicts
Fig 3. Route discovery time vs. no. of nodes with increased packet delivery ratio in IANDMR
scheme than NDMR.

5. Conclusions
An Intelligent-agent-based QoS enabled NDMR
scheme using the metrics bandwidth and delay for a
feasible path computation has been proposed. A
comparison of NDMR and the proposed QoS enabled
scheme presented in terms of Average End-to-End
delay, Packet delivery ratio, network bandwidth
utilization ratio and Route discovery time for sparse
and dense network scenarios. The results demonstrate
that Average end-to-end delay, packet delivery ratio
and the network bandwidth utilization of the
Fig 4. End-to-End Delay Vs. no. of nodes proposed scheme is better than the NDMR scheme.
The performance of the scheme is dependent on the
As shown in Figure 4 the Average End-to-End Delay richness of the network connectivity information
for proposed scheme is less than NDMR because, gathered by a mobile agent that is this scheme
proposed scheme selects the optimal path. performs better in the case of dense networks. The
agent’s visibility of its visited nodes also plays an
important role in improving the QoS acceptance
ratio. Important benefits of the agent-based scheme
as compared with traditional methods of software
development are flexibility, adaptability, software
reusability and maintainability.

Acknowledgments

We are very much thankful to the reviewers for


useful comments that helped us in improving the
quality of paper.
Fig 5. QoS Acceptance ratio Vs Delay required

As in Figure 5, the acceptance ratio is high in


IANDMR compared to NDMR with higher delay

198
References

[1] Jun-Zhao Sun, Mobile Ad Hoc Networking:


an essential technology for Pervasive
computing”, proc. IEEE International
conference on Info-tech and Info-net, Beijing
vol.3, pp.316-321, 2001.
[2] Liza Abdul LatiffI, Norsheila Fisal, “Routing
Protocols in Wireless- Mobile Ad Hoc
Network - A review”, proc. IEEE 9th Asian
Pacific conference on Communications, vol.
2, pp. 600-604, 2003.
[3] Ahmed Al-Maashri Mohamed Ould-Khaoua
“Performance Analysis of MANET Routing
Protocols in the Presence of Self-Similar
Traffic”, proc. IEEE 31st conference on Local
Computer Networks, pp. 801-807,Nov 2006.
[4] Hongxia Sun, Herman D. Hughes, “Adaptive
Multi-path Routing Scheme for QoS Support
in Mobile Ad-hoc Networks”,
www.scs.org/getDoc.cfm? id=2454.
[5] Xuefei Li and Laurie Cuthbert, “On-demand
Node- Disjoint Multipath Routing in Wireless
Ad hoc Networks”, proc. IEEE 29th Annual
International Conference on Local Computer
Networks, U.S.A, pp. 419-420, Nov 2004.
[6] Chenxi Zhu and M. Scott Corson, “QoS
routing for mobile ad hoc networks”, proc.
21st Annual Joint Conference of the Compute
and Communications Societies, vol. 2, pp.
958-967, Jun 2002.
[7] Zhenyu Liu, Marta Z, Kwiatkowska, and
Costas Constantinou, “A Biologically
Inspired QoS Routing Algorithm for Mobile
Ad Hoc Networks”, International Journal of
wireless and mobile computing, 2006.
[8] Jennings, N.R, “An agent-based approach for
building complex software systems”,
Communications of ACM, vol. 44, pp. 35-41,
2001.
[9] Manvi S.S, and Venkataram P, “Applications
of agent technology in communications: a
review”, Computer. Communication Journal,
pp. 1493–1508, 2004.
[10] Chess, D., Harrison, C., and Kershenbaum, “A
Mobile agents: are they a good idea?”,
Lecturer notes in Computer Science
SpringerBerlin /Heidelberg, vol. 1222, pp. 25-
45, 2006.
[11] Lange, D.B., and Oshima, M, “Seven good
reasons for mobile agents”, Communications
of ACM, vol. 42, pp. 88-89, 1999.

199
Close to Regular Covering by Mobile Sensors with
Adjustable Ranges
A.I.Erzin V.N.Muralidhara S.Raha

Abstract— A mobile wireless sensor network of a set of mobile focus in this paper is to maximize the lifetime of WSN. This
sensors with adjustable sensing and communication ranges is an- problem is sufficiently complex, and even special cases are
alyzed for coverage. Each sensor, being in active mode, consumes NP-hard [1]. Our goal is to take advantage of the mobility of
its limited energy for sensing, communication and movement and
in a sleep mode, preserves its energy. The problem that we focus sensors in comparison with the static sensors in class of regular
in this paper is to maximize the lifetime of WSN. This problem is covers [2], [3], [4]. In the model that we consider, the sensor
sufficiently complex, and even special cases are NP-hard [1]. Our nodes can adjust the sensing and communication ranges by
goal is to take advantage of mobility of sensors in comparison consuming some energy. In this paper, we show that mobility
with the static sensors in class of regular covers [2], [3], [4]. of the sensor nodes can be exploited to improve the life time
Index Terms— Adjustable ranges, coverage, energy efficiency, of the WSN.
wireless sensor network.
II. F IXED GRID
I. I NTRODUCTION Let the region√ O be tiled by the regular triangles (tiles)
A wireless sensor network (WSN) is composed of a large with the side R 3. These triangles form a regular grid with
number of sensor nodes deployed densely close a area of the set of grid nodes I. Suppose, each sensor has the energy
interest and are connected by a wireless interface. Wireless storage q > 0. For any sensor, sensing energy consumption
sensor networks constitute the platform of a broad range of per time period depends on a sensing range r (radius of the
applications such as national security, surveillance, military, disk) and equals SE = µ1 ra , µ1 > 0, a ≥ 2; communication
health care, and environmental monitoring. A sensor node in energy consumption per time period depends on the distance
a WSN, is typically equipped with Radio Transceiver, micro d and equals CE = µ1 rb , µ2 > 0, b ≥ 2; and the energy
controller and power supply and every node in the network consumption per time round during the motion depends on
have very limited processing, storage and energy resources. the speed v and equals M E = µ3 rc , µ3 > 0, c > 0. We
In most of the applications in the real world, it is almost suppose that during the motion, sensor does not consume the
impossible to replenish the power resources, hence energy energy for sensing and communication.
optimization is the most important issues in WSN.
Suppose the WSN is presented by the set J, |J| = m, of
mobile sensors with adjustable sensing and communication
ranges, which are distributed randomly over the plane region
O of space S. Each sensor, being in active mode, consumes its
i R 3 j
limited energy for sensing, communication and movement. In a
sleep mode sensor preserves its energy. Let the monitoring and R
communication areas of every sensor are the disks of certain disk i
radii with sensor in the centers [2], [3], [5], [6].
We say that the region O is covered, if every point in O
belongs to at least one monitoring disk. The lifetime of a WSN R 3
is the number of time rounds during which the region O is
covered by connected active sensors [7]. Observe that by max- k
imizing the lifetime of a WSN, we are actually maximizing
the time period over which the region is covered by the sensor
nodes with the limited energy resources. The problem that we
This research was supported jointly by the Russian Foundation for Basic
Research (grant 08-07-91300-IND-a) and by the Department of Science and
Technology of Government of India (grant INT/RFBR/P-04)
A.I.Erzin is with the Sobolev Institute of Mathematics, Russian Academy of
Sciences, Novosibirsk, Russia and the Novosibirsk State University, Novosi-
birsk, Russia. Fig. 1. Covering model A1
V.N.Muralidhara was with the Supercomputing Education and Research
Center, Indian Institute of Science, Bangalore, India. At present he is a
faculty member at the International Institute of Information Technology (IIIT)
Bangalore,
S.Raha is with the Supercomputing Education and Research Center, Indian If all sensors have the same sensing ranges R, and are
Institute of Science, Bangalore, India. equally placed in the grid nodes, then the covering model, we

200
call it A1, [8], [4](Fig. 1) is optimal with respect to the sensing
energy consumption (or covering density). In the model A1
each triad of neighbor disks of radius R with centers in the
nodes of triangle has one common point in the center of the G
j
tile. In cover A1 each sensor, located in the node i, must cover
i kv
the disk of radius R and center in the node√i (we call it disk
i). The density of the cover is DA1 = 2π/ 27 ≈ 1.2091 [4], (k  1)v
[8], and the sensing energy consumption of every sensor is
a
SEA1 = µ√ 1 R . The communication distance for each sensor
in A1 is R 3, hence disk i
√ the communication energy consumption
is CEA1 = µ2 (R 3)b . Therefore, the lifetime of one sensor sides of
is hexagon i
q
tA1 = √
µ1 R + µ2 (R 3)b
a

Since√the minimal number of grid nodes is N ≈


2S/(R2 27) [4], then the lifetime of cover A1 is


tA1 m qm 27
tA1 ≈ ≈ √ b Fig. 2. Sensors Inside the Regular Hexagon
N 2S(µ1 Ra−2 + µ2 Rb−2 ( 3 )
Let the sensors be distributed uniformly over the region O,
and parameter aij = 1 if i is the closest grid node to the sensor
j(i.e. the distance between j and i is dij = mink∈I dkj ) and
aij = 0 otherwise. Denote the set Ji = {j ∈ J|aij = 1}. b = c = 2, k = 6, one gets lk = k = 6, and lifetime of sensor
Then the sensors inside the regular hexagon i with j ∈ Qki equals tk (lk ) = 65.61.
√ center in
the node i and the sides at the distance δ = R 3/2 from Since the sensors are distributed uniformly, then there are
the center, are in the set Ji (Fig. 2). We reasonably suppose Nk ≈ mπ(2k−1)v 2 /S sensors in every set Jik . Let first active
that the sensor j in Ji (or in hexagon i) must cover the disk sensors are initially located in Ji1 , and we suppose that they
i. Then if j is located on the distance r away from the grid do not move and are active during
node i, then it must increase the sensing range by r in order to qN1
cover the disk i. Moreover, if the distance between the node L1 = √
µ1 (R + v)a + µ2 (R 3 + 2v)b
i and sensor j1 ∈ J1 is r1 , and the distance between the
node k and sensor j2 ∈ J2 is r2 , then in order to guarantee time periods. During time L1 sensors in J12 could move
a communication between the neighbor sensors j1 and j2 , it towards the grid node i, and then they can be active during
is necessary to increase the communication range of j1 and L2 = N2 max0≤l≤min{2,L1 } t2 (l) time periods. Therefore,
j2 by at least r1 + r2 units. But additionally, every sensor Pk−1
during the time Λk−1 = l=1 Ll sensors in Jik could move
j ∈ Ji can move towards the node i during some time rounds to the grid node i, and then they can be active during Lk =
in order to be nearer i. For the sake of simplicity, we suppose Nk max0≤l≤min{k,Λk−1 } tk (l) time periods. As a result, we get
that the speed of every sensor possess the two values 0 or v. the lifetime of sensors as
Therefore, if sensor j ∈ Ji is moving, then the speed v and
direction (towards the grid node i) are known.
K
Let us consider the concentric circles of radii δk = kv, k = X
1, 2, . . . k = vδ . Denote the set Jik = {j ∈ Ji |δk−1 < dij ≤ Λδ = ΛK = Lk
l=1
δk }. Then any sensor j ∈ Jik could reach the node i by at
most k time rounds. For example, when q = 365, µ1 = 0.5, µ2 = 0.25, µ3 =
Since the resource of each sensor is limited by q, then if 1.0, v = 1, a = b = c = 2, R = 6, we have K = 5 and
any sensor j ∈ Jik moves l time rounds and, as a result, lk = k for each sensor j ∈ J1k , k = 1, 2, . . . k = vδ . Let us
consumes lµ3 v c units of its energy, then taking into account compare the lifetime Λ0 of the WSN in example when the
the remainder sensor-node distance (k − l)v, it can be active sensors are static, and the lifetime Λδ of the WSN when the
during sensors are mobile. We have

q − lµ3 v c K
tk (l) ≈ √ X qNk
µ1 (R + (k − l)v)a + µ2 (R 3 + 2(k − l)v)b Λ0 ≈ √
k=1
µ1 (R + kv)a + µ2 (R 3 + 2kv)b
time rounds. Function tk (l) is concave, then one can find Tk = 5
tk (lk ) = max0≤l≤k tk (l) in O(log K). For example, when 2.365mπ X 2k − 1
≈ √
q = 365, µ1 = 0.5, µ2 = 0.25, µ3 = 1.0, v = 0.15, R = a = S 40 + 8(1 + 3)k + 3k 2
k=1

201
2.365mπ 1 3 5 7 9
≈ ( + + + + )
S 64.85 95.71 132.56 175.42 224.28
m
≈ 374
S
and
qN1
Λδ ≈ √
µ1 (R + v)a + µ2 (R 3 + 2v)b
K
X (q − lk µ3 v c )Nk
+ √
µ (R + (k − lk )v)a + µ2 (R 3 + 2(k − lk )v)b
k=2 1
5
mπ 365 1 X
≈ ( + (365 − k)(2k − 1))
S 62.89 45
k=2
m
≈ 621
S Fig. 3. Relocation of Grids with Movement of a Sensor
In this example the motion of the sensors gives us a con-
siderable gain in lifetime in comparison with the static case,
which may be advantageous only when the movement energy Since N1 ≈ mπv 2 /S, then the WSN’s lifetime in this case is
consumption M E = µ3 rc is relatively big. In any case,
optimal value of lk can be zero, and if it is disadvantageous, (q − µ3 v c )
Λv ≈ 1 + (N1 − n′1 + N1 n′2 ) √
then the sensors will not move. Therefore, the model with µ1 Ra + µ2 (R 3)b
mobile sensors is always better than one with static sensors. (q − 2µ3 v c )
+n′3 √
µ1 Ra + µ2 (R 3)b
III. F REE 3R2 mπ
GRID ≈1+(
4S √
The previous results depend on the parameter δ and are ob- µ1 (R + v)a + µ2 (R 3 + 2v)b (q − µ3 v c )
tained in case of fixed grid. Suppose that the number of sensors − ) √
q µ1 Ra + µ2 (R 3)b
Nk in Jik is sufficiently large, for each k = 1, 2, . . . k = vδ . 6mR2 (q − 2µ3 v c )
If grid is wandering, then we may replace it several times + √
25S µ1 Ra + µ2 (R 3)b
without changing the size (the new grid node in is relocated
from the previous position in−1 by 2δ distance from in to For the last example, (when q = 365, µ1 = 0.5, µ2 =
right or down like on the Fig. 3), then WSNs lifetime can be 0.25, µ3 = 1.0, v = 1, a = b = c = 2, R = 6), the lifetime
increased as follows. Let us set δ = v and suppose that during of the sensor network with free grid is Λv ≈ 747m/S, the
the first time round, when a part of sensors in Ji11 are active, WSNs lifetime in case of fixed grid is Λδ ≈ 621m/S, the
other sensors in every Ji1n , n ≥ 1 move to the grid node in . lifetime of cover A1 is LA1 ≈ 758m/S, and the lifetime of
The number of sensors in each Ji11 which are active
√ during the static WSN is Λ0 ≈ 374m/S. Then in the example, one gets
first time round is n′1 ≈ (µ1 (R + v)a + µ2 (R 3 + 2v)b )/q, 2Λ0 ≈ 1.2Λδ ≈ 1.01Λv ≈ LA1 and Λ0 < Λδ < Λv < LA1 .
and we suppose that n′1 ≤ N1 . These sensors do not move and Inequalities Λ0 ≤ Λδ ≤ LA1 and Λv ≤ LA1 are always
must increase their sensing ranges by v. During the first time true. Inequality Λδ ≤ Λv depends on the parameters. Thus
round N1 − n′1 sensors in each set Ji11 will reach the grid node if energy consumption for motion in unit time M E = µ3 rc
i1 , and it is not necessary to increase their sensing ranges to is considerably big, then one may get inequality Λ0 > Λv .
cover the O. Moreover, each sensor in Ji1n , n ≥ 2, will reach Let in the last example change only the value of µ3 and
in during the first time round. The number of sensors in every set it µ3 = 180. Then Λδ is the same Λ0 ≈ 374m/S, but
set Ji1n , n ≥ 2, is N√1 , and the number of these sets (new grid Λv ≈ 333m/S < Λ0 .
nodes) is n′2 ≥ ⌊ R2v 3 ⌋2 − 1 (value
√ n′2 + 1 is the number of
disks of radius δ packed in R 3 side rhombus), where ⌊A⌋ IV. G RID SIZE OPTIMIZATION
is an integer part of A. Every sensor, located outside the sets
The above results depend on the radius R which, in turn,
Ji1n , n ≥ 1, has two time rounds to reach the nearest grid node
determines the tiles size. The lifetime LA1 of model A1 is
(Fig. 3). The number of such sensors in the rhombus is
irrespective R. Suppose for√simplicity a = b = c = 2, v = 1
and lk = k for any k ≤ R 3/2, and let us find the optimal

R2 27m value of R ∈ [2, 8] giving the maximum to the WSNs lifetimes
n′3≈ − (n′2 + 1)N1 Λ0 (R), Λδ (R) and Λv (R). In this case for the considered
2S
mR2 √ models the lifetimes are
≥ (2 27 − 3π) √
4S R 3/2
mπq X 2k − 1
mR2 0
Λ (R) ≈ √
≈ 0.24 S µ1 (R + k)2 + µ (R 3 + 2k)2
2
S k=1

202
mπq q
Λδ (R) ≈ ( √
S µ1 (R + 1)2 + µ2 (R 3 + 2)2

R 3/2
1 X
+ (q − kµ3 )(2k − 1))
(µ1 + 3µ2 )R2
k=2

m(2.6q − 2.83µ3 )
Λv (R) ≈ +1
S(µ1 + 3µ2 )

µ1 (R + 1)2 + µ2 (R 3 + 2)2 q − µ3

qR2 (µ1 + 3µ2 )
Function Λv (R) is non-decreasing, and when, for exam-
ple, q = 36, µ1 = 1.0, µ2 = 0.2, µ3 = 5.0 then
maxR∈[2,8] Λv (R) = Λv (Rv ) ≈ 57m S , and optimal R
v
=
0 δ
8. Λ (R) and Λ (R) are multi-extremal functions, and get
maxR∈[2,8] Λ0 (R) = Λ0 (R0 ) ≈ 20m S when R0 ≈ 7 and
δ δ δ
maxR∈[2,8] Λ (R) = Λ (R ) ≈ S when Rδ ≈ 2.4.
33m

Further details can be found in [9].

R EFERENCES
[1] S. Slijepcevic and M. Potkonjak, “Power efficient organization of wireless
sensor networks,” in ICC, 2001, pp. 472,476.
[2] H. Zhang and J. Hou, “Maintaining sensing coverage and connectivity in
large sensor networks,” Ad Hoc & Sensor Wireless Networks, pp. 89–124,
2005.
[3] J. Wu and S. Yang, “Energy-efficient node scheduling models in sensor
networks with adjustable ranges,” Int. J. of Foundations of Computer
Science, no. 16, pp. 3–17, 2005.
[4] R. Williams, “The geometrical foundation on natural structure,” A Source
Book of Design; Dover Pub. Inc.: New York, pp. 51–52, 1979.
[5] M. Cardei, J. Wu, and M. Lu, “Improving network lifetime using sensors
with adjustable sensing ranges,” Int. J.of Sensor Networks, no. 1, pp.
41–49, 2006.
[6] J. Wu and F. Dai, “Virtual backbone construction in manets using
adjustable transmission ranges,” IEEE Trans On Mobile Computing, no. 5,
pp. 1188–1200, 2006.
[7] M. Cardei and J. Wu, “Energy-efficient coverage problems in wireless ad-
hoc sensor networks,” Computer Communications, no. 29, pp. 413–420,
2006.
[8] R. Kershner, “The number of circles covering a set,” American Journal
of Mathematics, no. 61, pp. 665–671, 1939.
[9] A. I. Erzin, “Close to regular plane covering by mobile sensors,” Abstracts
of Int. Conf. on Optimization and Applications (OTIMA-2009), Petrovac,
Montenegro, pp. 25–26, 2009.

203
Virtual Backbone Based Reliable Multicasting for MANET

Dipankaj G Medhi
ADG, Evolving Systems Network India Ltd
dipankaj@gmail.com

Abstract network on demand. The domain of application


targeted by MANET is group oriented. Undoubtedly,
In this paper, we propose a distributed reliable multicast is an efficient means of group oriented
multicasting scheme that uses a virtual backbone communication. Although, some applications such as
infrastructure to transmit the packets reliably over the temporary event networks (e.g. audio/video
unreliable communication channel of a mobile ad-hoc conferencing) can tolerate packet loss/error, but other
network (MANET). This novel approach consists of applications such as battlefield applications are loss
two phases of execution. In the first phase, we used a sensitive. Hence, an inevitable need for reliable
distributed clustering algorithm that extract a d-hop multicast arises. However, reliable multicast solutions
dominating set and interconnect them to from a proposed for wire-line networks [1, 2, 3, 4, and 5] are
backbone. The backbone infrastructure is dynamically not efficient to deploy in MANETs. Reliable multicast
changed to reflect the underlying topology condition. is a challenging research problem due to salient
Coverage of the backbone node is automatically characteristics of ad-hoc networks such an
adjusted in response to communication link quality infrastructure-less dynamic network topology, error
and node mobility. The reliable multicasting prone wireless transmission media and node mobility,
mechanism uses this infrastructure to transmit the data limited bandwidth and resources etc. We accept this
in the second phase of execution, which is the prime challenge and propose a noble solution to this problem
area of our investigation. We introduces a NACK in this paper.
based localized loss recovery mechanism, in which the Within the broad scope of group communication,
cluster-head act as the source node temporarily. this research work addresses the fundamental problem
Besides protecting the source from unnecessary of reliable multicasting. Many approaches are
retransmission, this approach reduces global proposed to resolve this problem is on literature. It is
congestion caused by control message and also proven that the performance of multicast protocols
retransmission. If the loss packet can’t recover locally, designed with extension to conventional routing
a global loss recovery mechanism is triggered that pull protocols is not attractive solution in stress condition.
back the lost packet globally. Moreover, it introduces For example, Jetcheva et al [6, 7] shows that
an ACK based one hop reliable packet delivery scheme performance of ADMR, MAODV and ODMRP is
to reduce control message and retransmitted data below 70% (packet delivery ratio) when number of
packet overhead. Simulation result demonstrates a sources increases. Similarly, Zhu et al [8]
potential packet delivery ratio in high mobility demonstrated that performance of MAODV degrades
conditions. to around 80% when group size is 5 with maximum
speed of individual node is 20 m/sec. ADMRP [6]
1. Introduction exhibits Packet Delivery Ratio up to 95% with pause
time 800 second, speed 20m/sec, three group with 3
Mobile Ad Hoc Network (MANET) is an source and 10 receiver/per group. In nutshell, all these
autonomous system consists of a collection of wireless protocols exhibit intolerably high packet loss rates
nodes that operates without any infrastructure. under moderate to high mobility. Further, flooding is
MANETs are envisioned to support advances also not an alternative approach for reliable
application such as, battlefield and disaster relief multicasting as Obraczka [9] demonstrate that “when
operation, temporary event networks, vehicular mobility intervals are very small and node speed is
networks or any other applications that requires a sufficiently high, even flooding becomes unreliable”.

204
As losses of data packet is obvious in wireless mechanism is proposed, in which the lost packet is
environment, special attention needs to pay to improve pulled from other group (cluster). Further, in our
the packet delivery ratio to achieve reliability. The mechanism, the sender does not have to keep any
normal way of recovery the lost packet is by sending information about the receiver nodes.
feedback from the receiver(s) to the sender. Based on
the feedback, the sender may retransmit the lost packet The rest of the paper is as follows-Section 2
once again. This mechanism may introduce request describes the backbone construction process. Section 3
message explosion problem in extreme condition. To provides correctness proof of the approach,
rectify this problem, ReMHoc [10] slow down the theoretically. Section 4 explains the mechanism used
recovery process randomly. Slowing downs the in this research work to achieve the reliable
recovery process by individual node augment end-to- multicasting. Section 5 summarized this work.
end delay and hence it is not a suitable solution. To
overcome this problem, AG [11], ReACT [12], RALM 2. Virtual Back-bone construction process
[13], RMA [14] introduces a localized loss recovery
mechanism, in which the feedback message is sent to a The virtual backbone infrastructures allow a smaller
set of nearby nodes. To get back the lost packet in any subset of nodes to participate in forwarding
localized loss recovery system, receiver must obtain control/data packets and help reduce the cost of
information about the recovery nodes that hold the lost flooding. Most virtual backbone construction
packet. Some of the existing approaches (for example algorithms for ad hoc networks are based on the
AG [11], ReACT [12]) uses explicit search mechanism connected dominating set problem but like ADB and
to find out the recovery node. This in turn introduce MobDHop, we constructed a d hop connected
control message overhead. Other approaches (for dominating set that creates a forest of varying depth
example RALM [13], RMA [14]), maintain a list tree, each of which is rooted at a backbone node.
containing the information of all the receiver nodes.
This information is collected by flooding of control 2.1 The Cluster-head Selection Criteria
messages in the entire network or at least to a part of
the network. Further, most of the existing approaches For construction of virtual backbone, the criteria for
depend on underlying multicast protocol (for example, selecting a cluster-head is trivial. The uniqueness of
RALM, ReMHoc, ReACT etc) our virtual backbone creation approach is the cluster-
Keeping all these points in mind, we proposed a head selection criterion. For efficient working of the
new approach for reliable multicasting that utilizes a proposed approach, the following cluster-head
virtual backbone infrastructure. Constructing virtual selection criteria need to be considered –
backbone for routing in MANET is not new. But none ƒ Mobility: The most important factor to be
of the existing reliable multicasting mechanism considered in cluster-head selection process is
explores this area. In doing so, we developed a mobility. In order to avoid the frequent cluster-
distributed clustering algorithm that extract a head (re)selection process, the cluster-head
dominating set and interconnect them to from a should be relatively stable. A dynamic node
backbone. Rest of the nodes is get associated with any changes its geographical position very
of the cluster-head and forms a forest of varying depth. frequently and hence, the node associated with
The backbone construction mechanism is similar with it also change very frequently. If such a node is
ADB [15] and ModDHop [16] except the cluster-head selected as cluster-head, a frequent breakdown
selection criteria that perfectly reflect the dynamic of the backbone will take place.
environment of MANET. Next, we proposed an Unlike others, we believe that mobility can
approach for reliable multicasting over the backbone not be measure with speed (WCA) or with
infrastructure to achieve guaranteed delivery of data Normalized Link Failure Frequencies (VDBM)
packets. Further, we develop an ACK based one hop or with Link Life Time (RMA) only. Speed of
loss detection and recovery mechanism to achieve high an individual node does not reflect the
degree of reliability. Moreover, we used a NACK surrounding environment. Similarly Normalized
based localized loss recovery mechanism considering Link Failure Frequencies (VDBM/ADM)
the cluster-head as the recovery node. Thus, our reflects the dynamic condition of the area
approach does not demand for explicit search of surrounding a node in terms of the number of
recovery node, hence network does not suffer from link failures per neighbor. But, some times the
control message overhead. If the lost packet can’t link may not be failed due to unavailability of
recover locally (within cluster), a global loss recovery

205
neighbor, but due to loss of Neighbor Discovery NumberOfLinkExpiredt
request packet (due to buffer overflow, LFFt = i
hidden/expose terminal problem) at MAC layer. i DegreeOfNodet
Moreover, keeping track of each link to find out i
Link Lifetime may be resource consuming in The Network Link Failure Frequency at time t is
multicasting environment. So, a combination of
speed and Network Layer Link Failure
Frequencies (NLLFF) is used to represent the
true mobility scenario of an individual node at where α smoothing factor of the past history.
network layer. A simplified approach for Value of α < 1
measuring NLLFF is used in this work.
ƒ Degree: To reduce the overhead associated with NLFF at time t-∆t
the cluster-head and to make a proper load
balancing, there should be a limitation of the 2.2 Cluster-head Selection Process
number of node that can be associated with a
cluster-head. So, unlike other approaches (e.g. The core selection process at each node begins
WCA) in this work preference is given to that after the node has wait for a random period of time,
node which is not highly loaded (the degree of usually long enough to allow the node to have heard
the node is lower than the threshold degree), in Hello packets from all of its neighbors. This process
cluster-head selection process. decides whether a node should still be serving as a
ƒ Node ID: If the value of the above two criteria core, or become a child of an existing core. Fig 1
for selecting CH is same, the conflict is resolved demonstrates the cluster-head election and tree
by node ID. The lowest ID node will be selected construction procedure.
as cluster-head in this situation.
To keep the number of node as low as possible,
Based on these criteria, a weight-age is assigned to the cluster-head election procedure has to maintain two
every node, as follows constrains-
ƒ WT_CONSTRAIN: It limits the maximum
Wtt = speedti × Speed_Factor + NLLFFti × cumulative weight from the core to the child
i node. Because of this constrain, the size of the
NLLF_Factor + Degreeti × Degree_Factor tree in highly dynamic area will be smaller.
ƒ DEPTH_CONSTRAIN: It limits the
Where Wtt Weight of node i at time t maximum depth a tree can have.
i
NLLFFti Network Link Failure Frequency of For maintaining the backbone structure in
node i at time t dynamic environment, every node exchanges a
Degreeti Degree of node i at time t periodic NEIGH_UPDATE message among the
and, Speed_Factor, NLFF_Factor and Degree_Factor neighborhood. Upon receiving a NEIGH_UPDATE
are multiplicative factor for speed, NLFF and degree message, the node updates its NIT table and performs
for node i, respectively. the following checks
ƒ If the NEIGH_UPDATE message sender is a
Network Layer Link Failure Frequency (NLLFF) new core node, check weight of the new core
reflects the dynamic condition of the surrounding area node with the weight of the current core node.
by measuring how frequently the neighbor table of the If the weight of the new core node is lesser
current node changes. To measure the NLLFF, the that the current core node, the node can join
node has to remember this information temporarily. the new core node provided it does not violate
After every ∆t interval of time, the node compares this WT_CONSTRAIN and
remembered information with newly gathered neighbor DEPTH_CONSTRAIN.
information. The difference in this comparison gives ƒ If the message sender is a tree node with
the number of link expired. The NLLFF for node i at better cumulative weight, the current node can
time t can be calculates as join the tree provided it does not violate
WT_CONSTRAIN and
DEPTH_CONSTRAIN.

206
ƒ If the message contains no better information, nodes in the network with individual node ids. Dotted
it will continue with current status. circles with equal radius represent the fixed
transmission range for each node.
Let, NIT ← One Hope Neighbor Table
q ← min_wt(NIT) /* return the least weighted node
information */
MyBackboneStatus = NONMEMBER
VirtualBackboneCreation ()
{
1. q ← min_wt(NIT);
2. if(my_wt < q→wt) {
3. MyCoreID` = MyOwnID;
4. MyParentID = MyOwnID;
5. MyBackboneStatus = CLUSTERHEAD;
6. OneHopBroadcast_MyStatus (MyCoreID, MyParentID, Wt, Figure 2. Initial Configuration of node
DistanceToCore); }
7. else {
8. if(my_wt = q→wt) {
9. if(my_ID < q→ID) {
10. MyCoreID = MyOwnID;
11. MyParentID = MyOwnID;
12. MyBackboneStatus = CLUSTERHEAD;
13. OneHopBroadcast_MyStatus (MyCoreID, MyParentID,
Wt,
DistanceToCore); } } }
14. for(;;) {
15. on receiving MyStatus (CoreID, ParentID, Wt,
DistanceToCore) {
16. if(my_wt + Wt) <WT_TH) /* Wt constrain is not
violated */ Figure 3. Neighbor nodes with weight
&&((DistanceToCore + 1)<DEPTH_TH) {
/* Distance constrain is not
violated */
17. if(my_wt > Wt) { /* I am heavier */
18. MyCoreID = CoreID;
19 MyParentID = ParentID;
20. MyBackboneStatus= MEMBER;
21. CumulativeWt = my_wt + Wt;
22. DistanceToCore ++;
23. OneHopBroadcast_MyStatus (MyCoreID, MyID,
CumulativeWt, DistanceToCore); }
24. else if(my_wt == Wt){ /* my wt is same */
25. if(my_ID < senderID) { /* conflict resolve */
26. MyCoreID = CoreID;
27. MyParentID = ParentID; Figure 4. Cluster with clusterhead
28. MyBackboneStatus = MEMBER;
29 CumulativeWt = my_wt + Wt; Fig 3 shows the neighbor nodes with
30. DistanceToCore ++;
31. OneHopBroadcast_MyStatus (MyCoreID, corresponding weight of every node. This is the
MyID, resultant weight after executing the backbone
CumulativeWt, DistanceToCore); } construction process. The cluster-head selection
}} procedure is executed in purely distributed manner and
else {
32. Wait to covered by other node } elected 3, 4 and 7 as cluster-head. From example, node
33. if (Wait to covered Time expired && MyBackboneStatus 7 is the minimum weight node among its neighborhood
== NONMEMBER){ (node 2, node 14 and node 9) hence it declares itself as
34. MyCoreID = CoreID; cluster-head and broadcast this information to
35. MyParentID = ParentID;
} } } }
neighborhood. All other node will join the cluster-
head, gradually. Finally, a cluster will form locally as
Figure 1: Backbone Construction Process shown in Fig 4. From Fig 4, it is observable that node
2.3 An Illustrative Example 9 is in the transmission range of node 7 and node 4.
Fig 2 to 6 demonstrates the backbone creation So, this node will exchange core node information
process. Figure 2 shows the initial configuration of the

207
with node 7 and node 4. Thus node 7 and node 4 will v. In other words, N0(v) = {v} and
able to know each other. N d (v) = (U u∈N d −1 ( v ) N1 (u )) ∪ N d −1 (v)
for d ≤ 1.
Ed(v) denotes set of links between d-hop neighbor.
The local execution of the algorithm by any node v
will generate a Nd(v)

Lemma 1(correctness): Every node can be


member of only one cluster.
Proof: The cluster is identified by the core ID.
Every core node will send MyStatus message (line no
6 & line no 13 of Fig 1) and this message will be
received by all the 1-hop neighbor. This will create
N1(v). The weights of these neighbors are obviously
Figure 5. The backbone
greater than core node. So, they are covered by that
core node. These nodes (covered by current core node)
can extend their coverage to other nodes (line no 23
and line no 31 in Fig 1) to construct N2(v). So, they
are also belonging to the same cluster.
When Nd(v) and d is exceeding
DEPTH_THERSHOLD and the wait time expired (line
33), the node create own cluster. Hence, every node
can determine its cluster and only one cluster.
Figure 6. The Logical View
Lemma 2 (Time-boundedness): The algorithm
terminates in a finite amount of time
Figure 5 shows the overall logical view after Proof: Initially all nodes are NONMENBER. At
executing the proposed backbone construction any instant of time t1, there exist at most one
algorithm. Figure 6 shows the logical view of the NONMENBER node having minimum (Weight,
virtual structure, which will be used in rest of this Degree, ID) because of uniqueness of the metric. At
paper for simplicity. time t2 ≥ t1 + (initial wait time), this NONMEMBER
node turns into core node and its neighbors turn into
MEMBER after message propagation delay. Thus the
3 Correctness Proof
number of NONMEMBER node decrease at least by 1.
In worst case situation after T≤|N| × (message
Assumption: To study the correctness of the propagation delay) all NONMEMBER nodes are
approach from theoretical prospect, an ideal network exhausted. Hence the lemma follows.
situation has to be assumed. In other words, there will
be no queuing delay and packet loss in the network. Theorem 1: (Correctness and time-boundedness)
We represent the MANET as a unit disk graph G = The algorithm generates a d-hop dominating set within
(V,E), where V is the set of nodes in the vicinity and E a finite time.
is the bidirectional links between the neighboring Proof: The core node expands its coverage to d
nodes. After execution of the algorithm, the graph will hop (Lemma1). Moreover, from Lemma 2, the set of
be divided into two subset Vc = {c: c is a core node CORE node form a dominating set. Hence, the
and c ∈ V} and Vm = {m: m is a member node and algorithm generates a d-hop dominating set.
c ∈ V}. Clearly, Vc ∪ Vm = V. Two nodes are
consider a neighbors if and only if their geographical Theorem 2: No two nodes in D are neighbor,
distance is no more than a given transmission range r. where D is the d-hop dominating set generated by
Let N1(V) denotes the set of all nodes that are in V or execution of the algorithm.
have a direct neighbor in V. The set N1(V) is Proof: It can be proved by contradiction. Assume
dominating set of V which cover all the node given by that two core node i, j ε D, the dominating set, are
{V- N1(V)} . In general, d-hop subgraph Gd(v), neighbor. During execution of the algorithm, the
induced from d-hop information of v, is (Nd(v), lowest weighted node will be selected as core among
Ed(v)). Nd(v) denotes the d-hop neighbor set of node neighbor. So, if i and j are both in d, both should have

208
minimum (Weight, Degree, ID) then other, which is combination). Periodically, every node check its
not possible. Hence, the theorem is proved packet sent table, and retransmit the data packet to the
one hop destination node, if it is not receiving any
4. Virtual Backbone Based Reliable ACK from that node. This process is continue for
Multicasting Protocol MAX_RETRY times (in our simulation, this value is
2). After MAX_RETRY times of attempt, it presumes
that the neighbor node is going out of range.
Now, we will describe the proposed reliable
multicasting protocol that uses the virtual
infrastructure constructed by the above-mentioned 4.2 Intra Cluster Operation
approach. For guaranteed delivery of data packet, we
used a ACK based technique for ensuring reliable one The reliable multicasting protocol explained in
hop data delivery and a receiver initiated NACK based this research work perform its operation in two levels –
approach for data recovery at receiver side. intra cluster (local) and inter cluster (backbone) level.
Inter-cluster operation is based on the concept of
4.1 One Hop Reliable Packet Delivery: ACK rendezvous point, where all the control message and
data packets are directed to the cluster-head of the
based approach
cluster the node belongs.
One of the main goals of this research work is to
4.2.1 Multicast Joining
reduce bottleneck in the sender node by limiting
retransmission request. To achieve this goal, an ACK
based retransmission mechanism for successful packet Each node maintains a Multicast Member Joining
delivery to the neighbor node is used. Consider the Table (MMIT) that keeps the information about the
scenario shown in Fig 7, in which sender S is sending child nodes that participate in the multicast process. To
data packet to receiver R via intermediate node i1 & i2. join a multicast group, a multicast member node
It is observable that, packet no 4 is lost in between register itself with own cluster-head. For this purpose,
node S and i1. By one hop reliable packet delivery node i announces its existence by sending a
scheme, this loss is identified and tried to recover. In JOIN_REQ message to its parent (ParentIDi). The
receiver initiated NACK based scheme, the receiver parent node updates the MMIT, set itself as a
will try to recover the data packet (packet no 4) by forwarder and forwards the message towards the
sending retransmit request (NACK) over the link {(R, cluster-head. This process continues till the
i2), (i2, i1), (i1, S)}. Hence, the link {(R, i2), (i2, i1)} JOIN_REQ message reaches the cluster-head. As soon
will be overburdened by retransmission request. We as the cluster-head receives a JOIN_REQ message, it
believe that, this overburden can be reduce by makes an entry in MMIT. Thus the forwarding node
providing a ACK based system, in which loss creates a multicast tree rooted at the core.
recovery can be done in one hop neighbor. Moreover, Similarly, every node sends a LEAVE_REQ
it will reduce bottleneck at sender due to NACK message, whenever it wants to leave the group. This
messages. LEAVE_REQ message is also processed in the same
way like JOIN_REQ message. As soon as any parent
node/cluster-head hears a LEAVE_REQ message, it
simply removes the entry of the node from MMIT.

Figure 7 Packet loss during transmission from sender 4.2.2 Data Transmission
to receiver
After construction of the backbone and the forest
In order to recover from one hop packet loss, of varying depth tree, any node can start to transmit
every node maintains a packet sent table. As soon as it data packet via the backbone or the tree. If the source
transmits /forwards a data packet, it makes an entry in is a tree member, it can simply forward the data packet
that table along with ACK_Recv = 0. As soon as a to its parent node and send to all the multicast member
node receives a data packet, it sends back an ACK to child nodes (consulting the MMIT). Upon receiving a
the sender. Upon receiving an ACK, it set the packet from a child node, the parent node forwards the
ACK_Recv of the packet sent table for the particular data packet until it reaches the cluster-head. In this
packet for the corresponding destination node (for data forwarding process, if there is any multicast
<MCastAddr, SeqNo, TimeStamp, DestID> member on the path towards the cluster-head, it simply

209
receives the packet and sends it to upper layer. When a other nearby core nodes listed in Core Information
core node receives a data packet from the downstream Table (CIT). If the received packet is an older one, the
node, the packet is buffered and forward to other core packet is simply discarded and sends back an ACK to
nodes by the backbone level multicast process. The the sender. Upon receiving a data packet, forward node
multicast source inserts a sequence number, multicast simply delivered the data packet in the link towards
group address and a time stamp (packet creation time) that cluster-head.
in every outgoing packet to uniquely identify a packet.
4.3.2 NACK Based Error Recovery
4.2.3 NACK based Error Recovery
As soon as a cluster-head receive a new data
One hop packet delivery mechanism improves the packet,, it buffers that packet to satisfy the NACK
packet delivery ratio between a pair of nodes. The requestor. Due to inherent characteristics of MANET,
supportive mechanism may not guarantee successful a cluster-head may require to pull data packet from
delivery of packet to the destination node in highly other cluster-head to satisfy the need of the own cluster
dynamic situation. Hence, a NACK based error member. For this purpose, a NACK based error
recovery mechanism is adopted in both local and recovery mechanism is required in backbone level.
backbone level. Upon received a NACK message from the
In our local loss recovery mechanism, cluster-head downstream, cluster-head checks the availability of the
acts as recovery node. Hence, unlike other system, our data packet in the buffer. If the packet is available in
approach does not need explicit search of the recovery the buffer, it sends back the data packet to the
node. Instead of sending retransmission request as requester. Otherwise, it makes an entry in
soon as a receiver detects a lost packet, a receiver NACK_Table (table that keeps track the NACK
periodically inform the cluster-head about the packet it message received by a node) generate a NACK request
has not yet received. The major advantage of this message putting own ID in the requestor field and
mechanism is that it helps reduce NACK explosion in hand over the request message to the neighbor leading
the core node. Each NACK message includes a to nearby cluster-head. Thus, it helps in reducing lost
sequence number R and a vector V. The sequence packet recovery latency for the receiver.
number R indicates that all the packets up to R is
received successfully and each flag in the vector 5 Simulations and Performance Evaluation
corresponds to sequence number of a lost packet.
To analyze the performance of the proposed
4.3 Inter Cluster Operations approach, we have conducted experiments using
GloMoSim2.02. It is a library based simulator
The previous section explained how to perform designed by Scalable Network Inc for mobile
multicast operation inside a cluster. To achieve data networks. In our experiments, 50 nodes are placed
forwarding among different cluster, a core receiving a randomly in 1000m by 1000m area. Constant bit rate
data packet must distribute among other core nodes. (CBR) traffic is generated by the application with each
payload being 512 bytes. UDP is used in the transport
4.3.1 Data Transmission layer. 802.11 is used as MAC layer protocol with two
ray path loss model. All the experiments are run for 15
Any node in the backbone may help play one of minute of simulation time.
the following two roles – Core and Forwarder. The
core node may be a multicast source or receiver or it 5.1 Performance Analysis
may simply act as a distributor of data packet among
other nodes. To analyze the performance of the proposed
If a core node is a source node, it sends the data reliable multicast protocol, the following three metric
packet to all other cluster-head via forward node (s). were used –
Moreover, it sends the data packet to all multicast Average Packet Delivery Ratio: defined as the
member child within the cluster by consulting with the number of received data packets divided by the
MMIT. Whenever, a cluster-head receives a data number of data packets generated. This metric
packet for the first time, it buffers the packet, sends measures the effectiveness and reliability of the
back an ACK to the sender and performs a cluster level protocol,
multicasting. Moreover, it forwards the data packet to

210
Average End-to-End delay: is the delay over all group is inspiring. Multicast Group 1 in Fig 9 is the
the packets received by all the receivers. It can be evidence of this conclusion.
define as. This metric evaluates the protocol’s
timeliness and efficiency.
First we will study the behavior of the proposed Multicas t Group = 3
approach with varying mobility, for that purpose we 3.5 Multicas t Group = 2
fixed the radio range of individual node at 12.0 dB Multicas t Group = 1

Av End to End Delay (sec)


3
with packet inter-departure interval 200ms and the
total nodes in the vicinity are 50. 2.5

Fig 8 plots the packet delivery ratio against the 2


node mobility. The packet delivery ratio achieved was 1.5
above 99%. However, with increase of mobility
(mostly above 30 m/sec), and with increase of number 1
of group there is a slight trend in reducing packet 0.5
delivery ratio (less than 0.3%). This degradation is due
to two different major facts 0
ƒ At extreme mobility situation, there will be 0 10 20 30 40 50
instability situation in the backbone, as the
Mobility (m/sec)
number of re-affiliation will increase. So, for
a fraction of second (during cluster-head Figure 9: Effect of mobility on end-to-end delay
handoff), the node may not able to receive
any packet. The Fig 10 and Fig 11 show the packet delivery ratio
ƒ During loss recovery process, more and more with varying inter-departure time with mobility 0 and
control messages will be generated by the mobility 50 m/sec respectively. In both the situations,
network, which in turn increase the the packet delivery ratio achieved by the proposed
congestion in the network, as a result more approach is above 99%. However, with decrease of
packets will be lost. packet inter-departure interval, and with increase of
number of group there is a slight trend in reducing
Multicas t Group = 3 packet delivery ratio (less than 0.1%). This
1.02 Multicas t Group = 2 degradation is due to the high contention the network
Multicas t Group = 1
experiences as the traffic rate and network load is
1
Packet Delivery Ratio

grows.
0.98 Multicas t Group = 3
Multicas t Group = 2
0.96 1.02 Multicas t Group = 1

0.94 1
Packet Delivery Ratio

0.92 0.98

0.9 0.96
0 10 20 30 40 50
0.94
Mobility (m/sec)
0.92
Figure 8: Effect of mobility on PDR (Interdeparture
Interval 200ms, Tx Range 12dB) 0.9
100 200 300 400 500
Fig 9 shows the end-to-end delay comparisons Packet Interderpature Interval
with different multicast group size. It is clearly
observable that there is a sharp increase in end-to-end Figure 10: Effect of Traffic rate on Packet Delivery
delay with increase in mobility and number of senders. Ratio (mobility 0, Tx Range 10)
The end-to-end delay suffered by the receivers in small

211
Multicas t Group = 3 Multicast Group = 3
1.02 Multicas t Group = 2 3 Multicast Group = 2

Av End-To-End Delay (sec)


Multicas t Group = 1 Multicast Group = 1
1 2.5
Packet Delivery Ratio

0.98 2

0.96 1.5

0.94 1

0.92 0.5

0.9 0
100 200 300 400 500 100 200 300 400 500

Interdeparture Interval (msec) Interdepature Interval (msec)

Figure 11: Effect of Traffic rate on Packet Delivery Figure 13: Effect of Traffic rate on End-to-end delay
Ratio (mobility 50 m/sec, Tx Range 10) (mo50 m/sec, Tx Range 0)
The Fig 12 and Fig 13 show the average end-to-end
6. Conclusion and Future Directions
delay with varying inter-departure time with mobility 0
and mobility 50 m/sec respectively. As seen in Fig 12, The research work proposed a reliable multicasting
the end-to-end delay is constantly increased with approach for mobile ad hoc network. We used a
decrease in packet inter-departure time interval, with different cluster-head selection criterion to find the d-
increase in group member. Although, it shows a hop dominating set. For load balancing, the highest
promising result with higher inter-departure interval, degree node is giving less weight-age in this process.
the end-to-end delay ratio is increased up to Then we used the virtual infrastructure for reliable
approximately 2 seconds with larger number of multicasting.. The error control mechanism combines
sources and lower packet inter-departure interval. one hop ACK based reliable data packet delivery
Multicas t Group = 3 approach helps in reducing global packet
3 Multicas t Group = 2 retransmission request and hence increases the
Av End-To-End Delay (sec)

Multicas t Group = 1 performance of the network. Unlike other approaches,


2.5 the localized loss recovery mechanism used in this
approach does not demand for explicit searching of
2
recovery node. Besides protecting the source from
1.5 unnecessary retransmission request, it reduces global
congestion caused by control message and
1 retransmission. Through extensive simulation, we
evaluate the performance of the proposed approach for
0.5 wide range of MANET scenarios. It shows a promising
packet delivery ratio up to 1, which was one of our
0
main objectives.
100 200 300 400 500
Packet Interderpature Interval
(msec) 7. References

Figure 12: Effect of Traffic rate on End-to-end delay [1] T Gopalswamy, M Singhal, D Panda and P
(mobility 0) Sadayappan A Reliable Multicast Algorithm for mobile
Ad Hoc Networks In Proceedings of ICDCS 2002, July
02-05, 2002

[2] S Paul, K K Sabnani, J C Lin and S Bhattacharyya


Reliable Multicast Transport Protocol (RMTP) IEEE

212
Journal on Selected Areas in Communications Vol 5 [13] Tnag Ken, Obraczka Katia Lee Sung-Ju, Gerla
no 3, Apr 1997 pp 407-421 Mario (2002) A Reliable, Congestion-Controlled
Multicast Transport Protocol in Multimedia Multihope
[3] J Macker and W Dang The Multicast Networks in WPMC, 2002
Dissemination Protocol (MDP) framework IETF
Internet Draft, draft-macker-mdp-framework-00.txt [14] Gopalsamy Thaigaraja, Singhal Mukesh, Panda D
Sadayappan P (2002) A Reliable Multicast Algorithm
[4] K Obraczka Multicast Transport Mechanism: A for Mobile Ad hoc Networks 22nd IEEE International
Survey and Taxonomy IEEE communication Magazine Conference on Distributed Computing System (ICDCS
Vol 36, no 1, Jan 1998 pp94-102 02)

[5] T Speakman, N Bhaskar, R Edmonstone, and D [15] 25. Jaikaeo C, Shen Chien-Chung (2002)
Farinacci, S Lin A Tweedly and L Vicisano PGM Adaptive Backbone-Based Multicasting for Ad Hoc
Reliable Transport Protocol Specification IETF Networks IEEE International Conference on
Internet Draft, draft-speakman-pgm-spec-03.txt Communications (ICC), New York City, April 28-May
2, 2002.
[6] Jetcheva Jorjeta G & Johson David B. (2001)
Adaptive Demand Driven Multicast Routing in Multi [16] 26. Inn Inn ER, Seah Winston K G (2004)
hop Wireless Ad Hoc Networks MobiHoc 2001 CA Mobility Based d-Hop Clustering Algorithm for Mobile
USA Ad Hoc Networks In: proceedings of IEEE Wireless
Communications and Networking Conference, March
[7] Jetcheva Jorjeta G. and Johnson David B. (2004) 21-25, 2004
A Performance Comparison of On-Demand Multicast
Routing Protocols for Ad Hoc Networks reports-
archive.adm.cs.cmu.edu/ anon/2004/CMU-CS-04-
176.pdf

[8] Zhu Yufang & Kunz Thomas (2004) MAODV


Implementation for NS-2.26 Systems and Computing
Engineering, CU, Technical Report, 2004

[9] Obraczka Katia & Viswanath Kumar (2001)


Flooding for Reliable Multicasting in Multi-hop Ad
Hoc Networks Wireless Networks Kluwer Academic
Publishers, page 627-634

[10] Sobeih Ahmed, Baraka Hoda, Fahmy Aly (2004)


ReMHoc: A Reliable Multicast Protocol for Wireless
Mobile Multihop Ad Hoc Network Consumer
Communication and Networking Conference (CCNC
2004)

[11] Chandra R, Ramasubramanian V, Birman K


Anonymous Gossip: Improving Multicast Reliability in
Mpbile Ad-hoc Network, International Conference on
Distributed Systems, Pages 275-283 April 2001

[12] Venkatesh Rajendran, Yi Yunjung, Obraczka


Katia, Lee Sung-Ju, Tang Ken, Gerla Mario.(2003)
"Reliable, Adaptive, Congestion-Controlled Ad hoc
Multicast Transport Protocol: Combining Source-
based and Local Recovery". UCSC Technical Report
2003

213
ADCOM 2009
DISTRIBUTED SYSTEMS

Session Papers:

1. Achyanta Kumar Sarmah, Smriti Kumar Sinha and Shyamanta Moni Hazarika, “Exploiting
Multi-context in a Security Pattern Lattice for Facilitating User Navigation”

2. Sundar Raman S and Varalakshmi P, “Trust in Mobile Ad Hoc Service GRID”

3. Soumitra Pal and Abhiram Ranade,“Scheduling Light-trails on WDM Rings”

214
1

Exploiting Multi-context in a Security Pattern Lattice for Facilitating User


Navigation

Achyanta Kumar Sarmah1,2 , Smriti K. Sinha1 and Shyamanta M. Hazarika1


1
School of Engineering, Tezpur University, Assam, India, (achinta,smriti,smh)@ tezu.ernet.in
2
Rajiv Gandhi Indian Institute of Management, Shillong, Meghalaya, India, aks@iimshillong.in

Abstract template. Available collection and repositories of SPs is


found to follow some template structure. However these
Repositories of Security Pattern (SP)s developed over templates defines SPs in different levels of abstraction and
the years are based on security templates that are essen- the SPs themselves differ in their perspectives. As such a
tially different from one another, each trying to capture common structure to emcompass all these patterns is still
security solutions at different levels of abstractions under missing which would have allowed us to select, implement
different perspectives..This lack of uniformity amongst the and deploy a pattern at the user level without requiring to
repositories leaves the user without a proper organization understand the engineering and design details of the pat-
of SPs to choose from. In addition to this, no representation tern. In our previous work [1] in this regard, we organized
of SPs have facilitated for retrieval of inter-dependent Security Patterns as a Concept Lattice. Carrying forward
patterns in a context even though patterns are always with this, we attempt reaching at a SP template that would
related to one another in a context. In this paper, we cover different existing SP repositories in different levels
carry forward our idea of Security Pattern Lattice(SPL) of abstraction and perspective. We introduce the concept
proposed in our previous work [1] and attempt to reach of a Multi-context in a SPL that would allow a user to
at a SP template that would cover different existing SP navigate to a concept in the SPL and then select the exact
repositories. We conceptualize a Security Concept to com- pattern we want.
prise of an extension of SPs and intension of Security
Requirement. Also, we introduce the concept of a Multi-
context in a SPL that would allow a user to search for a II. Related works on SP
concept with a given SP and retrieve it’s related patterns.
Keywords: Multi-context, Security Pattern, Security Re- A host of SPs have been proposed till date. The au-
quirement, SPL thors in each case defines a security template at certain
level of abstraction and then enumerates SPs from some
perspective. For example, in [2], seven patterns related
I. Introduction to the application and network domains are proposed at
the design level. In [3] twenty three patterns related to
Experts and developers working on security are primar- J2EE Applications, Identity Management, Web services
ily concerned with the architecture and design of security and service provisioning are proposed at the deployment
solutions. In contrast to this a user would want a convenient and functional level.In [4] fourteen patterns related to
and simple way of incorporating security solutions into a the application and network domains have been proposed
system without need for understanding the complexity of at implemenmtation and deployment level. Apart from
the architecture and design of security solutions. SP is an such attempts at enumerating security patterns, there also
engineering approach to bridge this gap. It documents a have been attempts at reaching a common repository of
reusable solution of some recurring security problem in a these patterns as in [5], [6]. However these repositories
context. In essence, it captures the expert’s knowledge to serves chiefly as a documentation of the existing patterns,
address a security problem. their templates and are void of a structured algorithmic
For a structural design of a security concept we can formalism that would allow a developer to directly extract
attach certain conceptual elements to a SP and define a his/her required pattern from the repository and implement

215
2
it in any platform. This cheifly happens because of two Pattern name, Abstract, Aliases, Problem, Solution, Static
reasons - firstly the templates used are different from Structure, Dynamic structure, Implementation issues, Com-
one another and specifically suit only the abstraction and mon attacks, Known uses, Sample code, Consequences,
perspective for which it is used and secondly a hierarchy Related patterns, References. The elements of this template
based organizational scheme for security patterns, which exhibits different perspectives:
are related to one another is missing. A hierarchy based • Documentation: Pattern name, Abstract, Aliases,
organizational scheme allows users a one point navigation Known uses, Consequences.
facility to search for some patterns based on any character- • Functionality: Problem, Solution.
istic. In [1], we decided on a security template of our own • State: Static structure, dynamic structure
based on Christopher Alexander’s definition of pattern and • Environmental: Related patterns, References, Com-
attempted organizing patterns by exploiting results from mon attacks
FCA and constructing a concept lattice of security patterns, • Implementation: Implementation issues, Sample
the SPL. Here we propose a generic template as discussed code
above.
3) Sun’s Core SP template for JEE by Nagappan et.
al.: Motivated by the concept of Security by default[3], a
III. SP template notion that ensures security at all OSI levels, the authors
forwards a security design methodology with the following
A. SP templates from related works stages
a) Define Security requirements
In this section we summarize some of the templates b) Candidate Security Architecture
used in the field of SP. Our aim of reviewing these c) Perform Risk and Trade off analysis
templates is not to reach at a template structure that would d) Identify SP and create Security Design
encompass all these templates and give us uniformity of e) Implement prototype
representation. It is rather to detect the various perspectives f) Validation testing and auditing
of SPs. We then attempt to capture these perspectives into The security template used for security patterns in this
the semantics of our proposed template and reach at an case has Problem, Forces, Solution, Structure, Strategies,
algorithmic structure. Uniformity of representation would Consequences, Reality checks and security actors and
allow proper and efficient selection while the algorithmic risks as its elements. The reality check element in this
structure would help us tranlate a pattern into any imple- template considers the perspective of testing resources for
menting framework. applicability of a pattern, though it is at a very conceptual
1) Markus Schumacher’s pattern template: Based on level.
the terminology provided in the Common Criteria, 4) Microsoft’s Web Service Security template: Mi-
Schumaher in [7] proposes a template with the elements crosoft classifies patterns for Web Service security into
Name, Context, Problem and Solution. Along with this Architectural, Design and Implementation patterns. For
comes some other optional elements which could improve this purpose it considers the elements Name, Context,
the comprehension of a SP. Problem, Forces and Solution with semantics similar to
2) Kienzle and Elder’s template: Kienzle et. al. in [8] the elements in other templates.
considers four levels of abstraction for SPs viz: Most of these patterns adhere to the basic elements of a
a) Concepts: These encompasses general strategies and security pattern viz: Name, Context, Problem and Solution,
are represented by abstract nouns that could not with additon of some other optional elements. In our case,
be directly implemneted by developers. For example we consider a pattern to have an algorithmic represen-
”least privilege”. tation. We therefore base upon these basic elements and
b) Classes of patterns: It represents a general problem propose a template as follows:
area that could have multiple solutions. CONTEXT - These are the preconditions that need to
c) Patterns: A pattern is specific enough to allow basic be met for applying the SP. For example, Authentication
properties to be specified and trade-off analysis to be needs to be performed before authorization always.
conducted against other patterns. A Precondition here exhibit inter pattern relationship or
d) Examples: An example is typified by sample code. It dependencies between patterns. It could be implemented
is the most immediately useful, but in a very narrow as a vector V ect<pattern,perspective> , where each element
context. of the vector is a tuple < pattern, perspective > and the
The authors takes an object oriented approach and proposes pattern in question is applicable for one or more of the
a template at the third level of abstraction with the elements elements of Vect.

216
3
PROBLEM- It defines a situation that would require DOMAIN CONCEPT These are concepts like role,

this pattern to be applied. For example, Authentication session, subject which are specific to a certain domain
elements viz: cards, passes etc tend to become namesakes and represents some security requirement.
after some time. We illustrate the reltionships between the above elements
SOLUTION- These are security algorithms/measures to with the example in figure 1:
be applied for solving the above problem. For example,
Digital signature is a solution for authorization. Authorization Concern
CONSEQUENCES- The context of a SP would usually
<refines> <refines>
be given in terms of some variables or objects. Conse- <refines>
quences would give us the change in values of the context
or introduce/deduce them. Apart from these, we could Access Control List RBAC Capability Engineering
have a number of optional elements which could serve as Artifacts
selection criteria for the patterns from a repository. Some <depends>
of these optional elements may be:
• Aliases Subjects Objects
• Known Uses or examples
• Abstract <depends> Domain Concepts
• Sample code
• Common Attacks and risks Sessions Roles Privileges

IV. Security Requirements Fig. 1. Elements of Authorization [9]

SR are in general considered to be those system security


policies that constraints functional requirements. They are V. Security Pattern Lattice(SPL)
expected to provide information about the desired level
of security for a system. While identifying and specifying SPL is a lattice theory based approach to organize SPs.
SRs, a common problem is that they tend to be accidentally Over the years, a corpus of SPs have been developed which
replaced with security specific architectural constraints is expanding continuously. A need of the hour is to address
that may unnecessarily constrain the security team from how to organize these security patterns within such a
using the most appropriate security mechanisms patterns corpus to enable application developers to navigate through
in our case, for meeting the true underlying SRs . Keeping the library of patterns and select suitable patterns without
this in mind we need a structured way of representing ambiguity. However pattern organization is a nontrivial
security requirements that could distinguish between the task. The problem of organization is not only related to SPs
security engineering artifact and the domain concepts that but also to the corpus of patterns in other domains. SPs can
the requirements satisfy. We base on the CIA model of be defined as a partially ordered set which in turn can be
information security for our SRs and define a template for organized in the form of a lattice. This has been exploited
identifying them as follows in our work [1] to present the SPL using results from
• CONCERN: This is the security concern represented FCA. We consider a SP template with elements <NAME,
by the requirement. Concern in our case have three ALIASES, TASK, PARAMETER, EXHIBIT>, and define
components based on the CIA model of Information a Trust Element (TE)
security, viz: Confidentiality, Integrity and Availabil- Definition 1: : Trust Element is a property or a state-
ity. Extending these components to other perspectives ment about an entity in a context which is otherwise
of security, we may have an enumeration of security unknown and whose absence makes the entity vulnerable
concerns as Availability, Identification, Authentica- to certain attacks. Here an entity could be a subject, a
tion, Immunity, Integrity, Intrusion, Non repudiation, object or a situation in a given context.
Privacy, Security auditing, Survivality, Physical pro- Observing the fact that the set of TEs and the set of SPs
tection, System Maintenance Security exhibit a Galois connection where minimizing one of them
• ENGINEERING ARTIFACT These are objects or maximizes the other and vice versa, we define security
artifacts that are used for enforcing some security context and security concept based on the principles of
requirement in a system. Any requirement may have FCA and build a Formal Concept Lattice for SPs in the
one or more engineering artifact related to it. For application context and call it SPL. The set of SPs serve
example, Authorization is usually enforced by Access as the extension while the set of TEs serve as the intension
Control List or User Role. here.

217
4
Thereafter, we attempt at classifying SPs within SPL by lattice in polynomial time delay1 or space linear in number
use of scaling techniques. of all the concept[10]
Algorithm 1 presented here is a concept generating
algorithm based on the Next-Closure [11], [12] algorithm
VI. Multi context in a SPL by Ganter.

We extend our idea of a SPL to accomodate search of Boundary conditions for the algorithm 1
a specific pattern in a security concept as follows: • Single attribute concepts of the context are given.
The conceptual TE is formalized as SR with a template. • No new attribute or object is added during concept
The user searches for some SP on the basis of a SR. This generation.
takes the user to a security concept in the SPL that has SPs
Data structure involved and input to the algorithm 1
satisfying the provided SR. For reaching at the specific
• OBJECTS = Array of objects.
SP desired by the user, he/she launches a second level
• ATTRIBUTES= Array of attributes.
search with some other parameter that would discriminate
• NO ATR = Total number of attributes.
amongst the SPs in the Security Concept at hand. In
• NO OBJ = Total number of objects.
the approach presented here, the second level seach is
• CONCEPT ARRAY = A array whose elements are
done on the basis of a precondition or in other words a
related pattern. The related pattern submitted by the user CONCEPTS.
• SUPREMUM = The CONCEPT with all objects and
is searched for in the Precodition vector of each of the
patterns in the extent of the security concept at hand. The null attribute.
• MAXEXTENT: The extent with all objects from
search returns all those patterns whose precondition vector
contains the related pattern submitted by the user in the OBJECTS as member.
• SINGLEATTRIBUTECONCEPTS: List of concepts
second level search. From our template of SP, we have the
element of precondition as multivalued, so we represent whose intents have only one attribute.
a security concept as another lattice say <G c,M c,I c> Output from the algorithm 1
where CONCEPTLATTICE - Generated lattice of the mined
G c=Set of security patterns in the present concept. concepts.
M c = Union of all preconditions belonging to all
patterns in G c Procedures involved in the algorithm 1
I c = Set of relationship between elements of G c and • VAL: Function to build the bitstring from the indices
M c tat tells us which precondition is applicable to which of a subset.
pattern. • UPDATESUBITENTS: Function to update the list of
In this context, the atomic concept having only the subintents.
desired precondition as its intension would give us the Input Parameters:
necessary SPs. INTENTS: List of intents.
SUBINTENT: The sub intent for all intents in IN-
TENTS.
VII. Generating and navigating in a multi- • RETRIEVEPOSITIONS: Function to retrieve the set
context SPL bit positions in an integer.
Classes involved in the algorithm 1
A. Concept Generating Algorithms • SUBSET ITERATOR (It is a class which allows us
to iterate through the subsets of a given set)
Generation and navigation in a concept lattice involves Data:
iterating through all possible concepts in a given context. A) SET SIZE: Size of the parent set.
Hence, the complexity of any algorithm that generates or B) SUBSET SIZE: Size of the Subsets
navigates in a concept lattice depends upon the number C) SUBSET INDICES: List of Indices of the current
of concepts in the given context. Again, the number of subset.
concepts is exponential in the size of it’s input context. D) FIRSTTIME: Flag to check whether the iterator
This case is similar to the case of finding the powerset have been called for the first time.
of a set. Here the complexity increases exponentially with Functions:
the size of the set. So, from the standpoint of worst case 1 Algorithm for listing a family of combinatorial structures is said
complexity, an algorithm generating all concepts and/or the to have polynomial delay if it executes at most polynomially many
concept lattice can be considered optimal if it generates the computation steps before either outputting each next structure or halting.

218
5
A) INCREMENT SUBSET INDICES(): Increment Classes involved
present subset indices to produce next subset indices. The set of classes for algorithm 1 is extended with the
B) NEXT SUBSET(): Function to return indices of following classes
next subset on the fly. • PRECON (A class that represents a precondition)

• CONCEPT (It is a class that represents a concept) Data:


Data: A) PATTERN: The pattern whose presence is
A) CONCEPT: A 16 bit integer that represents a required as the precondition.
concept. It’s higher order NO ATR bits represents the B) PERSPECTIVE: The perspective in which
intent and lower order NO OBJ bits represents the PATTERN should be applied as a precondition for
EXTENT. the parent pattern.
B) SUB INTENTS: List of INTENTS of CONCEPTS Function:
those are subsumed from this CONCEPT. A) isPRECON(PATTERN pat) :Returns true if pat
C) SUP INTENTS: List of INTENTS of CONCEPTS is same as PATTERN. Returns false otherwise. B)
that subsumes this CONCEPT. GETPAT: Function that returns the PATTERN.
Functions: C) GETPERSPECTIVE: Function that returns the
A) GETEXTENT(): Function to return the EXTENT PERSPECTIVE.
of this CONCEPT.
B) GETINTENT(): unction to return the INTENT of • SECPAT(A class representing a SP)
this CONCEPT.
C) ADD SUPINTENTS(): Function to add superin- Data:
tents of this CONCEPT. A) PROBLEM: Textual definition of the security
D) ADD SUBINTENTS(): Function to add subintents problem at hand.
of this CONCEPT. B) SOLUTION: Archiectural and Design
E) ADD SUPINTENT(): Function to add a singlein- specification for the pattern.
tent of this CONCEPT. C) CONTEXT: A vector representing the
preconditions
B. Extending algorithm 1 for facilitating D) CONSEQUENCES:
Multi-context in the concept lattice
Function:
In algorithm 2 described below, we extend algorithm 1 A) GETNEXTPRECON: Function that returns the
to facilitate multi-context in the concept lattice generated next PRECON for this SP.
and allow the user to search for a pattern based on two B) MOVEFIRST CONTEXT: Positions the
criterias, security requirement and related patterns. CONTEXT vector at the first index.
Boundary conditions C) MOVENEXT CONTEXT: Positions the
Same as for algorithm 1 CONTEXT vector at the next index.
D) MOVEABS CONTEXT(POS): Positions the
Data structures involved and input to the algorithm CONTEXT vector at the index POS.
The data structures and input to algorithm 1 is extended E) isFIRSTPRECON: Boolean function that checks
with the following whether it is the first position of the CONTEXT
• NO PERS = Total number of perspectives. vector or not.
• VECT PRECON = Vector of preconditions whose F) isLASTPRECON: Boolean function that checks
elements are objects of the class PRECON described whether it is the last position of the CONTEXT
below. vector or not.
• VECT SELECTED SP: vector that holds the selected G) isAFTERLASTPRECON: Boolean function that
SPs. checks whether the next index is out of bound or
not.
Output from the algorithm
Selection criteria for a pattern in the generated lattice:
VECT SELECTED SP populated with Selected SPs cor-
• SECREQ: The security requirement for which an
responding to selection criteria provided i.e. SECREQ,
PREREQ. applicable pattern is to be searched for.
• PREREQ: The pattern which exist as a precondition
Procedures involved for the pattern(s) applicable for the SECREQ pro-
Same as for algorithm 1 vided.

219
6
Variables: CUR ATTRIBUTE INDICE=0
1 repeat
2 Find all subsets of ATTRIBUTES of size CUR ATTRIBUTE INDICE+1.
3 CUR INTENT=0, CUR EXTENT=MAXEXTENT.
4 foreach subset S from step 2 do
5 CUR INTENT= S
6 CUR EXTENT = Intersection of the EXTENTS of CONCEPTS from SINGLEATTRIBUECONCEPTS
corresponding to each element in S.
7 if CUR EXTENT <> 0 then
1) NEW CONCEPT={CUR INTENT, CUR EXTENT }
2) Update SUP INTENTS of NEW CONCEPT with CUR INTENT
3) Update SUB INTENTS of each CONCEPT corresponding to each element in CUR INTENT with
NEW CONCEPT.
4) Append NEW CONCEPT to CONCEPT ARRAY.
8 end
9 end
10 CUR ATTRIBUTE INDICE = CUR ATTRIBUTE INDICE + 1; CUR INTENT=0
11 until CUR ATTRIBUTE INDICE=NO ATR

Algorithm 1. Algorithm to produce a concept lattice from a given context

variable: CUR ATTRIBUTE INDICE = 0


1 repeat
2 Find all subsets of ATTRIBUTES of size CUR ATTRIBUTE INDICE+1.
3 CUR INTENT=0, CUR EXTENT=MAXEXTENT, TAR CONCEPT=0.
4 foreach subset S from step 2 do
5 CUR INTENT= S
6 CUR EXTENT = Intersection of the EXTENTS of CONCEPTS from SINGLEATTRIBUTECONCEPTS
corresponding to each element in S.
7 if CUR EXTENT<> 0 and SECREQ ∈ CUR INTENT then
8 TAR CONCEPT={CUR INTENT, CUR EXTENT}, TAR EXTENT=CUR EXTENT
9 foreach pattern in TAR EXTENT do
10 if the current pattern CUR PAT in TAR EXTENT is valid then
11 Search PRECON of CUR PAT for a match with PREREQ. If a match found then add it to
VECT SELECTED SP
12 end
13 end
14 end
15 CUR ATTRIBUTE INDICE = CUR ATTRIBUTE INDICE+1
16 end
17 until CUR ATTRIBUTE INDICE=NO ATR

Algorithm 2. Algorithm for searching in a multicontext SPL

Complexity of the proposed algorithm In this case, maximum number of concepts to be


The complexity of the algorithm 2 for navigating in a searched for is equal to the total number of concepts in
multi-context SPL can be compared with that of Subset the SPL which is equal to cardinality of the power set
generation of a set of cardinality n. of the set of attributes, ATTRIBUTES. But the actual
number of valid concepts would be less in most cases and
the search may reach a target much earlier in the lattice.
Complexity of subset generation algorithm = O(2n ),
where n is the cardinality of parent set.

220
7
Let X < 2N O AT R , where X is the number of valid (and hence the algorithm is feasible) if X is of atmost order
concepts. C1 = Time required to perform an indexed polynomial. The order of X depends on the sparseness and
search for an extent or intent = Linear and Constant denseness of the initial context that gives all the single
C2 = Time required for 1 computation = Linear and attribute concepts. Higher the denseness of the context,
constant. higher would be the order of X and equivalently for sparse
C3 = Time required to perform 1 bit level operation = context.
Linear and constant.
C4 = Time required to perform 1 conditional operation = VIII. Conclusions and future directions
Linear and constant.
C5= Time to retrieve an extent or intent of a
Analysing the existing templates of SPs, we attempt to
concept=Linear and constant.
capture the various perspectives in which security patterns
C6=Average time required to search linearly for an
are considered. Building up on the SPL proposed by the
attribute in an intent=Linear
authors earlier, which is a concept lattice organizing secu-
C7=Average time required to search for a given
rity patterns in a formal manner, the present work extends
precondition in a pattern=Linear
the SPL further to include a multiple context within the
SPL. For this an all encompassing template for the security
We proceed to compute the analytical complexity of
patterns is considered. Also, applicability of all SPs are
our algorithm based on X as follows:
considered in light of various SRs based on the CIA model.
Casting the SPs and SRs into a FCA framework as in SPL,
No of comparisons / computations to search for a
the authors proposes a generating algorithm for the SPL,
concept:
based on concept generating tools like Nextclosre, Object
1) Search for a SECREQ in the intent of the current exploration and attribute exploration etc. Observing the
CONCEPT = C6 fact that patterns in a domain always works under a system
2) Retrieve extent if search at step 1 successful=C5+1 of forces where the patterns are related to one another, the
In the worst case, the total number of comparisons that idea of multi-context is introduced in the template of SP as
may be required to search for a concept in the SPL = PRECON which is a set of patterns related to the current
X ∗ (C6 + C5 + 1) one. In this regard the generating algorithm is extended to
facilitate navigation for a particular pattern based on a SR
No of comparisons to search for patterns in a concept and related SP provided. The present algorithm facilitates
with a given precondition navigation based on only one characteristic of SP i.e.
1) Average number of patterns in the extent of a given the CONTEXT or PRECONDITION. Also, the generating
concept = (N O OBJ + 1)/2 algorithm works only with a finite context. If a new SP is to
2) Average time required to search for a precondition be added then a rerun of the algorithm is required. Future
in a pattern=C7 work in this direction could be to make provision for
3) Total time to search for patterns in a concept, with adding a new concept to the existing lattice without need
a given precondition=C7 *(N O OBJ + 1)/2 of re-run of the whole generating algorithm. Also, facility
Total number of comparisons / computations for could be incorporated in the navigation algorithm to allow
selecting patterns in SPL based on a SECREQ and a search on any characteristic of a SP. With the algorithm at
precondition: our hand for generation and navigation, application could
No. of comparisions/computations to search for a concept be developed that would allow us to build a view of the
+ No. of comparisions to search for patterns in a generated lattice in any platform automatically.
concept with a given precondition = X*(C6+C5+1)+
C7*(N O OBJ + 1)/2 References
Since C5 is linear and constant, taking C5 ≈ 1 we have
[1] A. K. Sarmah, S. M. Hazarika, S. K. Sinha, Security pattern lattice:
No. of comparisions/computations ≈ X ∗ (C6 + 2) + C7 ∗ A formal model to organize security patterns, In Proceedings of
(N O OBJ + 1)/2 the 2008 19th International Conference on Database and Expert
In the above expression C6 is of order linear in the number Systems Application (2008) 292–296.
[2] J. Yoder, J. Barcalow, Architectural patterns for enabling application
of attributes and C7 is of order linear in the number of security, In Proc. of PLoP 1997.
patterns. Therefore the terms C7*(N O OBJ + 1)/2 and [3] C. Steel, R. Nagappan, R. Lai, Core Security Patterns, Prentice Hall,
(C6+2) are linear in order. Hence the overall complexity 2007.
[4] S. Romanosky, Page 1 security design patterns part 1 v1.1 (2001).
of the above expression depends upon the order of X i.e. [5] D. Kienzle, M. Elder, D. Tyree, J. Edward-Hewitt, Security pattern
O(X). consequently, the algorithm is of polynomial order repository v1.0.

221
8
[6] M. Hafiz, Security patterns and secure software architecture, 51st
tutorial in International Conference on Object Oriented Program-
ming, Systems, Languages and Applications.
[7] S. Markus, R. Utz, Security Engineering With Patterns, Springer-
Verlag New York, Inc., 2003.
[8] D. M. Kienzle, P. D, M. C. Elder, P. D, D. S. Tyree, Intro-
duction security patterns template and tutorial, Retrieved from
citeseerx.ist.psu.edu/viewdoc/summary10.1.1.131.2464.
[9] M. Dan, R. Indrakshi, R. Indrajit, H. S. Hilde, Building secu-
rity requirement patterns for increased effectiveness early in the
development process, In proc. of Symposium on Requirements
Engineering for Information Security,Paris.
[10] S. Kuznetsov, S. Obiedkov, Comparing performance of algorithms
generating for concept lattices, Journal of experimental and Theo-
retical Artificial Intelligence (2002) 189–216.
[11] B. Ganter, Two basic algorithms in concept analysis, Technical
Report preprint, TH Darmstadt.
[12] B. Ganter, K. Reuter, Finding all closed sets: A general approach,
Order 8 (1991) 283–290.

222
Trust in Mobile Ad Hoc Service GRID
P. Varalakshmi 1 S. Thamarai Selvi2 S.Sundar Raman 3
Department of Information Technology, Madras Institute of Technology, Anna University Chennai,
Chennai-600044, India
Email : varanip@gmail.com, stselvi@annauniv.edu, ssrmit62@gmail.com

Abstract relatively bandwidth-constrained wireless links. Due to


mobility, the network topology may change rapidly and
A mobile ad-hoc network (MANET) is a kind of wireless unpredictably over time. The network is decentralized,
ad-hoc network, and is a self-configuring network of where network organization and message delivery must
mobile routers connected by wireless links. The be executed by the nodes themselves, i.e., routing
routers/mobile nodes are free to move randomly and functionality will be incorporated into mobile nodes. We
organize themselves arbitrarily; thus, the network's need to use the underlying connectivity and routing
wireless topology may change rapidly and unpredictably. protocols that exist on Ad-Hoc networks in order to
The mobile devices forming the ad hoc network can be develop Mobile Ad-Hoc Service Grid (MASGRID). Thus
laptops, PDAs and mobile phones. These devices can be MASGRID is a dynamic, secure, coordinated resource
integrated to form an infrastructure known as grid. In sharing among mobile devices and can be referred as
order to effectively share and use these heterogeneous “Mobile Virtual Organizations”. Implementing trust
resources we visualize a grid overlay on this network. The evaluation over the MASGRID improves the overall
major challenge in forming a grid over an ad hoc network performance and thus increases the security of the ad hoc
is the infrastructure less nature and the trust GRID environment.
implementation. The trust is calculated by the reputation
of each node. The reputation differs based on the 2. Proposed Work
behavior of the node for the given job. We use the existing
architecture of mobile ad hoc grid and add the trust In traditional GRID environment, the trust evaluation is
computation. This makes the ad hoc grid nodes to trust carried out based on the job success rate and user
the remaining nodes for the assignment of jobs. Trust feedback. But in MASGRID, in addition to the above
management is an effective method to maintain the factors, the reputation of a node can be evaluated based
credibility of the system and keep honest entities. on the mobility and power constraints. The power
constraint can be evaluated based on the battery capability
of the device we consider to be resource. In MASGRID,
1. Introduction each nodes act as both grid user as well as grid resource
provider. Hence each node has to maintain the evaluation
Grid computing initially focused on large-scale factors in its own database.
resource sharing, innovative applications, and
achievement of high performance computing. Today, the 2.1 System Architecture
Grid approach suggests the development of a distributed
service environment that integrates a wide variety of The Mobile Ad Hoc Service GRID with trust
resources with various qualities of service capabilities to evaluation for proper resource management and job
support scientific and business problem solving submission required the present MASGRID architecture
environments. Grid service is a web service that provides to be reconfigured. The architecture consists of two
a set of well defined interface and that follows specific services 1) Resource Discovery Service 2) Resource
conventions. When we take the Grid Services on mobile Access Service. Along with the above services, watchdog
devices, it becomes a real challenge to deploy such kind service must be added.
of core Grid services and their requirements in terms of Resource Discovery Service (RDS) and Resource
space and computational power, especially in case of Access Services (RAS) use the underlying ad hoc network
hand-held devices. protocols for their functionality.
Mobile Ad Hoc Network (MANET) is an autonomous Resource Lookup Table (RLT) is used to maintain the
collection of mobile users (nodes) that communicate over resource information accessible to the particular node.

223
Resource Discovery Service – RDS is a service that is 2.3. Evaluation of Mobility Factor
used to find a particular resource node using RLT.
Resource Access Service – RAS is a service that is The mobility factor is based on the concept of how far
used to make the job execute in a remote grid node and a node is mobile relative to other nodes in the neighbor
keep track of the jobs submitted. list. The interval of a node can be identified based on the
Watchdog Service is used to check the job execution theory of relativity. Each node itself moves towards the
status and update the trust factor in the database. destination and hence the neighbor node speed cannot be
Each node maintains three tables in its database. The predicted by just the replacement in unit interval of time.
tables are node_info, job_sent_info and Thus the mobility factor is calculated based on the given
job_received_info. The node_info table contains the Eq. (2) as
neighbor listing and the available resource information.
The trust and mobility factor information for each mobility_nodej =
neighbor node will be updated in this table. The 1– (interval(nodej) / max_interval(nodej)) (2)
job_sent_info table contains the job submission
information. Each job contains the status flag which The interval function returns the unit of time taken by
shows the job success and the node to which the particular the node to reach the destination from a source point. The
job is assigned. The job_received_info table contains the max_interval function returns the maximum interval of
job accepted from its neighbors. This table is maintained one of the neighbor nodes of nodej.
as the queue and the job execution is carried out based on Thus for a node ‘j’ with ‘m’ neighbors can have one
the First-In First-Out basis. node (which is having maximum interval) with mobility
Consider a nodei needs to execute its jobj in MASGRID factor 0. All the remaining nodes have the mobility factor
environment. Then it gets the neighbor listing from the relative to that node.
node_infoi table. Each node in the neighbor listing will be The mobility factor for all nodes will be 0 initially.
assigned the part of the jobj. Thus nodei’s job_sent_infoi Each simulation interval updates the mobility factor for
table will be updated with the neighbor nodes each node. Thus this mobility factor will be in the range
information. Later the corresponding neighbor nodes’ of [0, 1]. This factor identifies the stability in the network.
job_received_info tables are also updated with the nodei The random waypoint mobility model is chosen for
information. The watchdog service checks for the job random positioning and keeping the random movements
execution feasibility in the neighbor nodes and updates to all nodes. This gives a relatively real world model for
the status flag of the job in both job_sent_info and mobile ad hoc networks in which each node is free to
job_received_info tables. move to any destination as it wishes as shown in Figure 1.

2.2. Trust Evaluation


Node j, t
The trust evaluation is based on the job success rate for
a node. Each node calculates the job success rate for its Di Vj
neighbors and updates the trust factor in its database.
Node i, t
Job Success Rate – It is the ratio of the number of jobs
executed successfully in a node to the total number of
Vi Node j, t+1
jobs assigned to the same node. Thus the job success rate
greatly influences the trust and reputation value of a node.
Node i, t+1 D i+1
The job success rate can be calculated based on the Eq.
(1) as

Ti = JSi / TJi (1)


Node i, t - Node ‘i’ at the time‘t’
Here, Ti is trust value of node ‘i’, JSi is number of jobs Node i, t+1 - Node ‘i’ at the time‘t+1’
completed successfully for node ‘i’ and TJi is the total
Node j, t - Node ‘j’ at the time‘t’
number of jobs assigned for node ‘i’. Node j, t+1 - Node ‘j’ at the time‘t+1’
Each node can behave either good or bad. Hence both
Di - The distance between node ‘i’ and node ‘j’ at time‘t’
have the equal probability of 0.5. Hence each node
Di+1 - The distance between node ‘i’ and node ‘j’ at
assigns an initial value to all neighbor nodes such that a time‘t+1’
node can be initially trusted partially for job submissions.
Thus using the above said Di and Di+1, a relative velocity
But later the trust value differs based on the job success
between nodei and nodej will be found out. This can be
history of the particular node. used for the computation of mobility factor.

224
The neighbor listing for each node is updated for each
simulation interval based on the transmission range and
2.4. Job Submission to the next hop the current position of each node. The ad hoc devices are
identified and the configurations are used randomly to
Further broadcasting of jobs can be carried out once each node. Thus all the nodes collectively act as the ad
the neighbor nodes are found to be not capable enough to hoc environment.
finish the job. The broadcasting of these jobs is restricted For each simulation, the following values need to be
based on the hop count limit. This increases the job configured.
success rate since each node is in search of capable nodes
in the later hop region. Number of nodes = 30
Transmission range = 400
Terrain = 2000, 2000
Number of jobs = 100 – 800
nodej Max_length,Min_length = 1000, 100
hop1 Max_file_size,Min_file_size = 1000, 10
nodei
After simulation, the performance evaluation is carried
hop2 out and the graph is plotted.
nodek
4. Results and Discussion
In Figure 3, a graph is plotted with number of jobs
against job success rate.
In Figure 2, a scenario shown may happen only when
nodei was not capable of finishing the job received from The job success rate for a set of nodes can be calculated
some nodej. Then nodei will again assign the same job using the cumulative of all nodes job success rates to the
with the source node being the nodej to its neighbor say number of nodes. Consider a case that there are ‘m’ nodes
nodek and receive the status. The status will be updated chosen for simulation and nodei have the job success rate
for the nodej too. of JSi . Then the cumulative job success rate for number
Though the job submissions to next hop proved to work of jobs ‘n’ will be given Eq. (3) as
well for less number of jobs, there are some demerits in
the higher order of job count. When the number of jobs Job Success Raten = i=0..m JSi / m (3)
raised, then nodei only just broadcast the jobs and failed to
receive the later jobs from nodej. Since the number of So, for unit increment in the number of jobs, the
resources available for the job is increased, the job corresponding job success rate are calculated and plotted
success rate also increases for the increased hop count. in all the cases.

3. Simulation In Figure 3, simulated results are shown for without


trust vs with trust; there is a upgrade in the performance
The simulation of trust evaluation in MASGRID has of MASGRID with the consideration of trust for assigning
been carried out using java and glomosim. Mysql is used jobs to the neighbor nodes. Increase in the number of jobs
as the trust and mobility factor keeping database. reduces the job success rate for un-trusted whereas trusted
The mobility model chosen is random waypoint improves its performance.
mobility model. This model makes the system react
similar to the mobile ad hoc environment. The following Trusted Vs. Untrusted MASGRID

three comparisons are carried out in the simulation. 100.00%


1. Trusted Vs. Untrusted MASGRID
Job Success Rate

80.00%
2. Trusted Vs. Untrusted mobility 60.00% WITH TRUST
3. Single Vs. Next Hop Job Submissions 40.00% WITHOUT TRUST
For each node, separate tables are maintained such that 20.00%
they can keep the trust and mobility values for their 0.00%
neighbors separately and assign jobs to the neighbors’ 100 200 300 400 500 600 700 800
respectively. Number of jobs

The jobs are generated randomly and assigned to the


given number of nodes. Each job is identified by ! " # $ % &'
instructions and input/output file sizes.

225
In Figure 4, a graph is plotted with number of jobs 5. Conclusions
against job success rate and the simulated results are
shown for mobility without trust vs mobility with trust. Trust implementation greatly improves the
Here, though initially both have the same performance performance of MASGRID in proper resource utilization
but increasing the number of jobs gives better and increased job success rate. The mobility factors too
performance only for trusted mobility node. Compared to taken into consideration improve the stability in job
previous analysis, this shows a fluctuation in the graph submission and raise the confidence that the nodes seems
this is based on the concept of mobility model pattern. to be stable until the result of execution is received by the
Though the graph fluctuates, never the untrusted mobility sender. This mobility factor also avoids the job
beats the trusted mobility value. The effect of simulation submission of nodes which are found to be more mobile
will be more clear once the environment is more mobile and less stable. Those nodes once received the job may
and the nodes seems to move randomly. suddenly leave the transmission range and almost
impossible to detect the node availability in the ad hoc
environment. Extending the job submission to the next
Trusted mobility Vs. Untrusted Mobility
hop increase the scalability of job submissions and avoids
assigning the job to the inefficient nodes again.
100.00%
References
Job Success Rate

80.00%
TRUSTED MOBILITY
60.00% [1] Ihsan I., Abdul Qadir M., and Iftikhar N. “Mobile Ad hoc
40.00% UNTRUSTED Service Grid – MASGRID”, Proceedings of World Academy of
MOBILITY Science, Engineering and Technology,2005.
20.00% [2] Kurdi H., Li M. and Al-Raweshidy H. “A Classification of
0.00% Emerging and Traditional Grid Systems”, IEEE Distributed
100 200 300 400 500 600 700 800 Systems, IEEE Computer Society, Vol. 9, No.3, 2008.
[3] Li Z., Sun L. and Ifeachor E., University of Plymouth, UK.
Number of jobs
“Challenges of Mobile Ad-hoc Grids and their applications in E-
HealthCare”, Conf. on Computational Intelligence in Medicine
and Healthcare, 2005.
( ! [4] Amin K., Laszewski G., Sosonkin M., Mikler R., Hategan
M. “Ad Hoc GRID Security Infrastructure”, Grid Computing
Workshop, 2006.
In Figure 5 also, a graph is plotted with number of jobs [5] Ma B., Sun J., Yu C.,School of Electronic Information
against job success rate and the simulated results are Engineering,Tianjin University, China, “Reputation-based Trust
shown for with first hop vs next hop also. Model in Grid Security System”, Journal of Communication and
In this graph, the next hop values are almost twice that Computer, ISSN1548-7709,2006.
of the first hop simulation. In this case, the number of hop [6] Kwok Y., Senior Member, IEEE, Hwang K., Fellow, IEEE,
and Song S., Member, IEEE. “Selfish Grids: Game-Theoretic
count is 2. The increase in the number of hop count with
Modeling and NAS/PSA Benchmark Evaluation” , IEEE
proper and efficient flooding scheme shows the improved transactions on parallel and distributed systems, Vol 18,No.
overall performance and hence increases the overall job 5.,2007.
success rate. [7] Srinivasany A., Teitelbaumy J., Liangz H., Wuyand Mihaela
J., Cardeiy A. “Reputation and Trust-Based Systems for Ad
Hoc and Sensor Networks” , Conference on GRID Security,
Single Hop Vs. Next Hop
2007.
120.00%
[8] Martin A., “Trust and Security in Virtual Communities”,
IEEE conference on secure Virtual Organization, 2007.
100.00%
Job Success Rate

[9] Brinklov M., Sharp R., “Incremental Trust in Grid


80.00%
NEXT HOP Computing”, IEEE International Symposium on Cluster
60.00% Computing and the Grid(CCGrid'07), IEEE Computer Society,
SINGLE HOP
40.00% 2007.
20.00% [10] Callaghan D. and Coghlan B., Department of Computer
0.00% Science,Trinity College Dublin, Ireland, “On Demand Trust
100 200 300 400 500 600 700 800 Evaluation”, Grid Computing Conference-06,2006.
Number of jobs

) * + *

226
Scheduling Light-trails on WDM Rings
Soumitra Pal and Abhiram Ranade
Department of Computer Science and Engg.,
Indian Institute of Technology Bombay,
Powai, Mumbai 400076, India.

Abstract—We consider the problem of scheduling commu- case. The third algorithm considers stationary traffic. In this
nication on optical WDM (wavelength division multiplexing) case, our algorithm can be allowed to take more time, because
networks using the light-trails technology. We give two online the computed configuration will be used for a long time since
algorithms which we prove to have competitive ratios O(log n)
and O(log2 n) respectively. We also consider a simplification the traffic pattern does not change. For both problems, our
of the problem in which the communication pattern is fixed objective is to minimize the number of wavelengths needed to
and known before hand, for which we give a solution using accommodate the given traffic.2
O(c + log n) wavelengths, where c is the congestion and a lower The input to the stationary problem is a matrix B, in which
bound on the number of wavelengths needed. While congestion B(i, j) gives the bandwidth demanded between nodes i and
bounds are common in communication scheduling, and we use
them in this work, it turns out that in some cases they are j. We give an algorithm which schedulesPthis traffic using
quite weak. We present a communication pattern for which the O(c+log n) wavelengths, where c = maxk i,j|i≤k<j B(i, j)
congestion bound is O(log n/ log log n) factor worse than the best is the maximum congestion at any link. The congestion as
lower bound. In some sense this pattern shows the distinguishing defined above is a lower bound, and so our algorithm can be
character of light-trail scheduling. Finally we present simulations seen to use a number of wavelengths close to the optimal. The
of our online algorithms under various loads.
reader may wonder why the additive log n term arises in the
I. I NTRODUCTION result. We show that there are communication matrices B for
which the congestion is much smaller than 1, but which yet
Light-trails [1] are considered to be an attractive solution require Ω(log n/ log log n) wavelengths. In some sense, this
to the problem of bandwidth provisioning in optical networks. justifies the form of our result.
The key idea in this is the use of optical shutters which are
For the online problem, we use the notion of competitive
inserted into the optical fiber, and which can be configured
analysis [2], [3], [4]. Specifically we establish that our first
to either let the optical signal through or block it from
algorithm is O(log n)-competitive, i.e. it requires at most
being transmitted into the next segment. By configuring some
a O(log n) factor more wavelengths as compared to the
shutters on (signal let through) and some off (signal blocked),
best possible algorithm, including an unrealistic algorithm
the network can be partitioned into subnetworks in which
which is given all the communication requests in advance.
multiple communications can happen in parallel on the same
A multiplicative O(log n) factor might be considered to be
light wavelength. In order to use the network efficiently, it is
too large to be relevant for practice (and indeed it is an
important to have algorithms for controlling the shutters.1
important open problem whether a lower factor can be proved);
In this paper we consider the simplest scenario: two fiber however, the experience with online algorithm design is that
optic rings, one clockwise and one anticlockwise, passing such algorithms often give good hints for designing practical
through a set of some n nodes, where typically n < 20 algorithms. We establish that our second algorithm for the
because of technological considerations. At each node of a online problem is O(log2 n), nevertheless we mention it
ring there are optical shutters that can either be used to because it is a simplified version of the first algorithm and
block or forward the signal on each possible wavelength. it seems to perform better in our simulations.
The optical shutters are controlled by an auxiliary network
That brings us to our final contribution: we simulate two
(“out-of-band channel”). It is to be noted that this network is
algorithms based on our online algorithms for some traffic
typically electronic, and the shutter switching time is of the
models. We compare them to a baseline algorithm which keeps
order of milliseconds as opposed to optical signals which have
the optical shutter switched off only in one node for each
frequencies of Gigahertz.
wavelength. Note that at least one node should switch off its
For this setting we give three algorithms for controlling optical shutter otherwise light signal will interfere with itself
the shutters, or bandwidth provisioning. The first two consider after traversing around the ring. We find that except for the
dynamic traffic, i.e. communication requests arrive and depart case of very low traffic, our algorithms are better than the
in an online manner, i.e. they have to be serviced as soon as
they arrive. The algorithm must respond very quickly in this 2 If our analysis indicates that some λ wavelengths are needed while only
λ0 are available, then effectively the system will have to be slowed down by
1 Notice that in the on mode, light goes through a shutter without being first a factor λ/λ0 . This is of course only one formulation; there could be other
converted to an electrical signal – this is one of the major advantages of the formulations which allow requests to be dropped and analyse what fraction
light-trail technology. of requests are satisfied.
baseline. For very local traffic, our algorithms are in fact much paper [19] suggests that minimizing total number of light-
superior. trails also minimizes total number of wavelengths, it may not
The rest of the paper is organized as follows. We begin be always true. For example, consider a transmission matrix in
in Section II by comparing our work with previous related which B(1, 2) = B(3, 4) = 0.5 and B(2, 3) = 1. To minimize
work. In Section III we give the details of our algorithm for total number of light-trails used, we create two light-trails on
the stationary problem. Section IV gives an example instance two different wavelengths. Transmission (2,3) is put in one
of the stationary problem where congestion lower bound is light-trail and transmissions (1,2) and (3,4) are put in the other
weak. We describe our two algorithms for the online problem light-trail. On the other hand, to minimize total number of
in Section V. We give results of simulation of our online wavelengths, we put each of them in a separate light-trail on
algorithms in Section VI. a single wavelength. We believe that minimizing the number
of light-trails (while fixing the number of wavelengths) is
II. P REVIOUS W ORK an appropriate objective for the online case, where this is a
Our problem as formulated is in fact similar to the problem measure of the work done by the scheduler. In our opinion,
of reconfigurable bus architectures [5], [6]. These have been for the stationary problem, the number of wavelengths is a
proposed for standard electrical communication; like the opti- better measure. There are few other models as well, e.g. [22]
cal shutter in light-trails, there is a switch which connects one minimizes total number of transmitters and receivers used in
segment of the bus to another, and which can be set on or off. all light-trails.
Again, even in this model, the switches are slow as compared The general approach followed in the literature to solve
to the data rates on the buses. So from an abstract view points the stationary problem is to formulate the problem as an
this model is very similar to ours. integer linear program (ILP) and then to solve the ILP using
While there is much work in the reconfigurable bus lit- standard solvers. The papers [18], [19] give two different
erature, it mostly concerns regular interconnection patterns, ILP formulations. However, solving these ILP formulations
such as those arising in matrix multiplication, list ranking takes prohibitive time even with moderate problem size since
and so on [7], [8], [9], [10]. The only work we know of the problem is NP-hard. To reduce the time to solve the
dealing with random communication patterns is in relation ILP, the paper [20] removed some redundant constraints from
to the PARBUS architecture. Such patterns are handled using the formulation and added some valid-inequalities to reduce
standard techniques such as Chernoff bounds [11]. We do not the search space. However, the ILP formulation still remains
know of any work which discusses how to schedule arbitrary difficult to solve.
irregular communication patterns in this setting. This is prob- Heuristics have also been used. The paper [20] solves the
ably understandable because reconfigurable bus architectures problem in a general network. It first enumerates all possible
have mostly been motivated as special purpose computers, light-trails of length not exceeding a given limit. Then it
except for the PRAM simulation motivation of PARBUS creates a list of eligible light-trails for each transmission and a
where the communication becomes random. However, if the list of eligible transmissions for each light-trail. Transmissions
network is used for general purpose computing, it does make are allocated in an order combining descending order of band-
sense to have algorithms to provision bandwidth for arbitrary width requirement and ascending order of number of eligible
irregular patterns, as we do here. light-trails. Among the eligible light-trails for a transmission,
After the light-trail technology was was introduced in [1], the one with higher number of eligible transmissions and
much work has been published in the literature. For example, higher number of already allocated transmissions is given
[12] has a mesh implementation of light-trails for general preference. The paper [21] used another heuristic for the
networks. The paper [13] implements a tree-shaped variant of problem in a general network. For a ring network, [19] used
light-trail, called as clustered light-trail, for general networks. three heuristics.
The paper [14] describes ‘tunable light-trail’ in which the For the problem on a general network, [16] solves two
hardware at the beginning works just like a simple light- subproblems. The first subproblem considers all possible light-
path but can be later tuned to act as light-trail. There is trails on all the available wavelengths as bins and packs
some preliminary work on multi-hop light-trails [15] in which the transmissions into compatible bins with the objective
transmissions are allowed to go through a sequence of overlap- of minimizing total number of light-trails used. The second
ping light-trails. Survivability in case of failures is considered subproblem assigns these light-trails to wavelengths. The first
in [16] by assigning each transmission request to two disjoint subproblem is solved using three heuristics and the second
light-trails. problem is solved by converting it to a graph coloring problem
Even with this basic hardware implementation, there are where each node corresponds to a light-trail and there is
different works solving different design problems. Several ob- an edge between two nodes if the corresponding light-trails
jectives are mentioned in the seminal paper [17] – to minimize conflict with each other.
total number of light-trails used, to minimize queuing delay, For the online problem, a number of models are possible.
to maximize network utilization etc. Most of the work in From the point of view of the light-trail scheduler, it is
the literature seems to solve the problem by minimizing total best if transmissions are not moved from one light-trail to
number of light-trails used [18], [19], [20], [21]. Though the another during execution, which is the model we use. It is also
appropriate to allow transmissions to be moved, with some The input is a matrix B with B(i, j) denoting the bandwidth
penalty. This is the model considered in [19], [23], where the requirement for the transmission from node i to node j,
goal is to minimize the penalty, measured as the number of without loss of generality, as a fraction of the bandwidth of a
light-trails constructed. The distributions of the transmissions single wavelength. The goal is to schedule these in minimum
that arrive is also another interesting issue. It is appropriate to number of wavelengths w. The output must give w as well
assume that the distribution is fixed, as has been considered in as a partitioning of each wavelength into a set of segments.
many simulation studies including our own. For our theoretical The partitioning may be specified as an increasing sequence
results, however, we assume that the transmission sequence of numbers (what we refer to as configuration) between 0
can be arbitrary. The work in [19] assumes that the traffic and n − 1; if u appears in the sequence it indicates that the
is an unknown but gradually changing distribution. They use shutter in node u is off, otherwise the shutter in node u is
a stochastic optimization based heuristic which is validated on. The segment between two off shutters is a light-trail. A
using simulations. The paper [20] considers a model in which transmission from i to j can be assigned to a light-trail L only
transmissions arrive but do not depart. Multi-hop problems if u ≤ i, j ≤ v where u, v are the endpoints of the light-trail.
have also been considered, e.g. [24]. An innovative idea to Further the sum of the required bandwidths assigned to any
assign transmissions to light-trails using online auctions has single light-trail must not exceed 1.
been considered in [25]. It is customary to consider two variations: non-splittable, in
which a transmission must be assigned to a single light-trail,
A. Remarks and splittable, in which a transmission can be split into two
As may be seen, there are a number of dimensions along or more transmissions by dividing up the bandwidth, and the
which the work in the literature may be classified: the network resultant transmissions can be assigned to different light-trails.
configuration, the kind of problem attempted, and the solution Our results hold for both variations.
approach. Network configurations starting from simple linear We will use cl (S) to denote the congestion induced on a link
array/rings [9], [19], [23] to full structured/unstructured net- l by a set S of transmissions. This is simply the total band-
works [8], [16], [18], [20], [21], [24] have been considered, width requirement of those transmissions from S requiring
both in the optical communication literature as well as the to cross link l. Clearly maxl cl (S), the maximum congestion
reconfigurable bus literature. The stationary problem as well over all links, is a lower bound on the number of wavelengths
as the dynamic problem has been considered, with additional needed. We use c(S) to denote maxl cl (S). Finally if t is
minor variations in the models. Finally, three solution ap- a transmission, then we abuse notation to write cl (t), c(t),
proaches can be identified. First is the approach in which instead of cl ({t}), c({t}), for the congestion contributed by
scheduling is done using exact solutions of Integer Linear t only, which is equal to the bandwidth requirement of t.
Programs [18], [19], [20]. This is useful for very small The key observation behind our algorithm for the stationary
problems. For larger problems, using the second approach, problem is: if all transmissions go the same distance in the
a variety of heuristics have been used [16], [19], [20], [21]. network, then it is easy to get a nearly optimal schedule. Thus
The evaluation of the scheduling algorithms has been done we partition the transmissions into classes based on the length
primarily using simulations. The third approach could be of the transmission. We then stitch back the separate schedules.
theoretical. However, except for some work related to random Define the length of a transmission to be the distance
communication patterns [11], we see no theoretical analysis between the origin and the destination. Transmissions with
of the performance of the scheduling algorithms. length between 2i−1 (non-inclusive) and 2i (inclusive) are said
to belong to the ith class where 0 ≤ i ≤ dlog2 (n − 1)e.
In contrast, our main contribution is theoretical. We give
algorithms with provable bounds on performance, both for Let R denote the set of all transmissions, and Ri the set of
the stationary and the online case. Our work uses the com- transmissions in class i. Class 0 is served simply by putting
petitive analysis approach [2] for the online problem. We shutters off at every node. Clearly, dc(R0 )e wavelengths will
use techniques of approximation algorithms to solve the sta- suffice for the splittable case, and twice that many for the non-
tionary problem. To our knowledge, this competitive analysis splittable (using ideas from bin-packing [26]). For R1 also it
and approximation algorithm approach to solve the light-trail is easily seen that O(dc(R1 )e) wavelengths will suffice. So for
scheduling problem has not been used in the literature. We the rest of this paper we only consider classes 2 and larger.
also give simulation results for the online algorithms. A. Schedule Transmissions of Class i
Our aim is to partition Ri further into sets S0 , S1 , . . . each
III. T HE S TATIONARY P ROBLEM
with congestion at most some constant value so that overall it
In this section, instead of considering two unidirectional does not take many wavelengths. We start with T0 = Ri , and
rings, we consider a linear array of n nodes, numbered 0 to in general given Tj we construct Sj and Tj+1 = Tj \ Sj as
n−1. Communication is considered undirected. This simplifies follows:
the discussion; it should be immediately obvious that all results We add transmissions greedily into Sj starting from the
directly carry over to two directed rings mentioned in the leftmost link l moving right, i.e. for each l pick transmissions
introduction. crossing it and move them into Sj until we have removed
at least unit congestion from cl (Tj ) or reduced cl (Tj ) to 0. the splittable case, the transmissions will be accommodated
Then we move to the next link. So, at the end the following in these wavelengths, since c(Sjp ) < 4. For the non-splittable
condition holds: case, 8 wavelengths will suffice, using standard bin packing

= cl (Tj ) if cl (Tj ) ≤ 1, and ideas [26].
∀l, cl (Sj ) (1) Thus all of Sj can be accommodated in at most 16 wave-
≥1 otherwise.
lengths for the splittable case, and at most 32 wavelengths for
However, to make sure that c(Sj ) is not large, we move back the non-splittable case.
transmissions from Sj , in the reverse order as they were added,
into Tj so long as condition (1) remains satisfied. Now we Theorem 4. The entire set Ri can be scheduled such that at
claim the following: each link x there are O(Cx (Ri ) + 1) light-trails.

Lemma 1. Construction of Sj , Tj+1 from Tj takes polynomial Proof: We first consider the light-trails as constructed in
time and c(Sj ) < 4. Lemma 3. In that construction, it is possible that some light-
trails contain links that are not used by any of the transmissions
Proof: For the first part, it can be seen that the construc- associated with the light-trail. In such cases we shrink the
tion takes at most n|Tj | time in the pick-up step and also in light-trails by removing the unused links (which can only
the move-back step. be near either end of the light-trail because all transmissions
For the second part, at the end of move-back step, for any assigned to a light-trail overlap at their anchor).
transmission t ∈ Sj there must exist a link l such that cl (Sj ) < Let j be largest such that x has a transmission from Sj .
1 + c(t) otherwise t would have been removed. We call l as a Then we know that cx (Ri ) ≥ j. For each k = 0, 1, . . . , j we
sweet spot for t. Since c(t) ≤ 1 we have cl (Sj ) < 2 for any have O(1) light-rails at x as described above. Thus we have
sweet spot l. a total of O(j + 1) = O(cx (Ri ) + 1) light-trails at x.
Now consider any link x. Of the transmissions through x,
let L1 (L2 ) denote transmissions having their sweet spot on the B. Merge Light-trails of All Classes Together
left (right) of x. Consider y, the rightmost of these sweet spots If we simply collect together the wavelengths as allocated
of some transmission t ∈ L1 . Note first that cy (Sj ) < 2. Also above, we would get a bound O(c log n). Note however, that
all transmissions in L1 pass through both x, y. Thus cx (L1 ) = if two transmissions, one in class i and the other in class j,
c(L1 ) = cy (L1 ) ≤ cy (Sj ) < 2. Similarly, cx (L2 ) < 2. Thus are spatially disjoint, then they could possibly share the same
cx (Sj ) = cx (L1 ) + cx (L2 ) < 4. But since this applies to all wavelength. Given below is a systematic way of doing this,
links x, c(Sj ) < 4. which gets us the sharper bound.
Next we show that not too many Sj will be constructed. We know that at each link l there are a total of O(cl (Ri ) +
Lemma 2. Suppose Sj is created for class i. Then j ≤ c(Ri ). 1) light-trails. Thus the total number of light-trails at l are
O(cl (R) + log n), summing over all classes.
Proof: Suppose Sj contains a transmission that uses some Think of each light-trail as an interval, giving us a collection
link l. The construction process above must have removed at of intervals such that any link l has at most O(cl (R)+log n) =
least unit congestion from l in every previous step 0 through O(c + log n) intervals. Now this collection of intervals can be
j − 1. Hence j ≤ cl (Ri ) ≤ c(Ri ). colored using O(c + log n) colors. So we put all the intervals
Every transmission in Sj has length at least 2i−1 + 1, and of the same color in the same wavelength.
must cross some node whose number is a multiple of 2i−1 .
The smallest numbered such node is called the anchor of the IV. O N THE C ONGESTION L OWER B OUND
transmission. The trail-point of a transmission is the right most We now consider an instance of the stationary problem. For
node numbered with a multiple of 2i−1 that is on the left of convenience, we assume m = n − 1 = 2k for some k, and
the anchor. If the transmission has trail-point at some node all logarithms with base 2. All the transmissions have same
with number of the form t2i−1 , then we define t mod 4 as its bandwidth requirement α = 1/(log m + 1).
phase. First, we have a transmission going from 0 to 2k . Then a
transmission from 0 to 2k−1 and a transmission from 2k−1 to
Lemma 3. The set Sj can be scheduled using O(1) wave-
2k . Then 4 spanning one-fourth the distance, and so on. Thus
lengths.
we have transmissions of log m + 1 classes, each class having
Proof: We partition Sj further into sets Sjp containing transmissions of same length. In class i ∈ {0, 1, . . . , log m}
transmissions of phase p. Note that the transmissions in any there are 2i transmissions B(sij , dij ) = α where sij =
Sjp either overlap at their anchors, or do not overlap at all. This jm/2i , dij = (j + 1)m/2i and j = 0, 1, . . . , 2i − 1. All other
is because if two transmissions in Sjp have different anchors, entries of B are 0. This is illustrated in Fig. 1 for n = 17.
then these two anchors are at least 2i+1 distance apart. Since Clearly the congestion of this pattern is uniformly 1. Con-
the length of transmission is at most 2i , the two transmissions sider an optimal solution. There has to be a light-trail in which
can not intersect. the first transmission from node 0 to m is scheduled. Thus
So for the set Sjp , consider 4 wavelengths, each having we must have a wavelength with no off shutters except at
shutters off at nodes numbered (4q + p)2i−1 . Clearly, for node 0 and node m. In this wavelength, it is easily seen
i=0
i=1
i=2
i=3
i=4
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

Fig. 1. An example instance where congestion bound is weak

that the longest transmissions should be scheduled. So we O(log n) and O(log2 n) respectively. They are inspired by the
start assigning transmissions of first few classes in this light- analysis of the algorithm for the snapshot problem, as may be
trail. Suppose, all the transmissions for first l classes are seen.
assigned. Then we have total 1 + 2 + 4 + · · · + 2l = 2l+1 − 1 In both the online algorithms, when a transmission request
transmissions assigned to this light-trail. Total bandwidth arrives, we first determine its class i and trail-point x (defined
requirement of these transmissions should be less than 1. in Section III-A). The transmission is allocated to some light-
This gives us (2l+1 − 1)(1/(log m + 1)) ≤ 1 implying trail with end nodes x and x + 2i+1 . However, the algorithms
l ≤ log(log m + 2) − 1 ≈ log log m. differ in the way a light-trail is chosen from some candidate
For the subsequent classes of transmissions, we allocate a light-trails.
new wavelength and create light-trails by putting shutters off at
nodes numbered multiples of m/2l+1 . It can be seen that again A. Algorithm S EPARATE C LASS
transmissions of next about log log m classes can be put in
these light-trails. We repeat this process until all transmissions In this algorithm, every allocated wavelength is assigned
are assigned. a class label i and a phase label p, and has shutters off at
In each wavelength we assign transmissions of log log m nodes (4q + p)2i−1 for all q, i.e. is configured to serve only
classes. There are total (1 + log m) classes. Thus the total transmissions of that class and phase. Whenever a transmission
number of wavelengths needed is d(1 + log m)/ log log me = of class i and phase p is to be served, it is only served
O(log n/ log log n) rather than the congestion bound of 1. by a wavelength with the same labels. If such a wavelength
For the example in Fig. 1, using this procedure, we have is found, and light-trail starting at its trail-point has space,
log log m = 2. Thus we require d(1 + log m)/ log log me = 3 then the transmission is assigned to that light-trail. If no such
wavelengths. The first wavelength is used for the transmissions wavelength is found, then a new wavelength is allocated and
of classes {0,1}, the second wavelength for classes {2,3} and labeled and configured as above.
the third for class 4. When a transmission finishes, it is removed from its asso-
ciated light-trail. The wavelength can be relabeled only when
V. T HE O NLINE P ROBLEM there are no transmissions in any of its light-trail.
In the online case, the transmissions arrive dynamically. Lemma 5. Suppose, at some point of time, among the wave-
An arrival event has parameters (si , di , ri ) respectively giving lengths allocated by the algorithm, x wavelengths had non-
the origin, destination, and bandwidth requirement of an empty light-trails of the same class and phase across a link l.
arriving transmission request. The algorithm must assign such Then l must have congestion Ω(x) at some instant.
a transmission to a light-trail L such that si , di belong to the
light-trail, and at any time the total bandwidth requirement Proof: Number these wavelengths in the order that they
of transmissions assigned to any light-trail is at most 1. A got allocated. Suppose the xth one was allocated due to a
departure event marks the completion of a previously sched- transmission t. This could only happen because t could not fit
uled transmission. The corresponding bandwidth is released in the first x − 1 wavelengths.
and becomes available for future transmissions. The algorithm For the splittable case this can only happen if the previous
must make assignments without knowing about subsequent x − 1 wavelengths contain congestion at least x − 1 − c(t) at
events. the anchor of t, when t arrived. But this is Ω(x) giving us the
Unlike the stationary problem, congestion at any link may result.
change over time. Let clt (S) denote the congestion induced For the non-splittable case, suppose that c(t) ≤ 0.5. Then
on a link l at time t by a set S of transmissions. This is simply each of the first x − 1 light-trails must have congestion of
the total bandwidth requirement of those transmissions from least 0.5 when t arrived, giving congestion Ω(x). So suppose
S requiring to cross link l at time t. The congestion bound c(t) > 0.5. Let k be the largest such that wavelength k
c(S) is maxl maxt clt (S), the maximum congestion over all contains a transmission t0 with c(t0 ) ≤ 0.5. If no such k exists,
links over all time instants. then clearly the congestion is Ω(x). If k exists, then all the
For the online problem, we present two algorithms, S EP - wavelengths higher than k have congestion at least 0.5 when
ARATE C LASS and A LL C LASS having competitive ratios t arrived. And the wavelengths lower than k had congestion
at least 0.5 when t0 arrived. So at one of the two time instants will end up with a light-trail L00 such that there are at least
the congestion must have been Ω(x). w00 wavelengths with light-trails conflicting with L00 , where
w00 = w−log n−log n(w/2 log n) = w/2−log n ≥ w/4 log n
Theorem 6. S EPARATE C LASS is O(log n) competitive.
for w = Ω(log2 n). But L00 is a single link and so we are done.
Proof: Suppose that S EPARATE C LASS uses w wave- Of these w/4 log n light-trails, at least w/16 log2 n must
lengths. We will show that the best possible algorithm (in- have the same class and phase. But Lemma 5 applies, and
cluding off-line algorithms) must use at least Ω(w/ log n) hence there is a link having congestion at least w/16 log2 n.
wavelengths. But this is a lower bound on the number of wavelengths
Consider the time at which the wth wavelength was allo- required by any algorithm, including an offline algorithm.
cated. At this time w − 1 wavelengths are already in use, and
of these w0 = (w − 1)/4 log n must have the same class and VI. S IMULATIONS
phase. Among these w0 wavelengths consider the one which We simulate our two online algorithms and a baseline
was allocated last to accommodate some light-trail L serving algorithm on a pair of oppositely directed rings, with nodes
some newly arrived transmission. At that time, each of the numbered 0 through n − 1 clockwise.
previously allocated w0 − 1 wavelengths was nonempty in the We use slightly simplified versions of the algorithms de-
extent of L. By Lemma 5, c(B) = Ω(((w −1)/4 log n)−1) = scribed in Section V (but easily seen to have the same bounds):
Ω(w/ log n). This is a lower bound on any algorithm, even basically we only use phases 0 and 2. Any transmissions that
off-line. would go into class i phase 1 (or phase 3) light-trail are
contained in some class i + 1 light-trail (of phase 0 or 2 only),
B. Algorithm A LL C LASS and are put there. We define a class i and phase 0 light-trail to
This is a simplification of the previous algorithm in that be one that is created by putting off shutters at nodes jn/2i
allocated wavelengths are not labeled. When a transmission for different j, suitably rounding when n is not a power of 2.
arrives, if a light-trail of its class and phase capable of A light-trail with class i and phase 2 is created by putting off
including it is found, then the transmission is assigned to it. shutters at nodes (jn/2i + n/2i+1 ), again rounding suitably.
If no such light-trail is found, then an attempt is made to For A LL C LASS, there is a similar simplification. Basically, we
create such a light-trail from the unused portions of any of the use light-trails having end nodes at jn/2i and (j + 1)n/2i or
existing wavelengths. If such a light-trail can be created, then at jn/2i + n/2i+1 and (j + 1)n/2i + n/2i+1 . As before, in
it is created and the transmission is placed in it. Otherwise a S EPARATE C LASS, we require any wavelength to contain light-
new wavelength is allocated, the required light-trail is created, trails of only one class and phase; whereas in A LL C LASS, a
and the rest of the wavelength is considered unused. wavelength may contain light-trails of different classes and
When a transmission finishes, it is removed from its asso- phases.
ciated light-trail. If this makes the light-trail empty then we For the baseline algorithm in each ring we use a single off
consider it as unused. Then we check if there are adjacent shutter at node 0. Transmissions from lower numbered nodes
unused light-trails for the same wavelength. If so, we merge to higher numbered nodes use the clockwise ring, and the
them by turning on the off shutter between them. others, the counterclockwise ring.
Theorem 7. A LL C LASS is O(log2 n) competitive. A. The simulation experiment
Proof: Suppose A LL C LASS uses w wavelengths. We will A single simulation experiment consists of running the
show that an optimal algorithm will use at least Ω(w/ log2 n). algorithms on a certain load, characterized by parameters
Clearly, we may assume w = Ω(log2 n). λ, D, rmin and α for 100 time steps. In our results, each data-
We first prove that there must exist a point of time in the point reported is the average of 150 simulation experiments
execution of A LL C LASS when there are w/4 log n light-trails with the same load parameters.
crossing the same link. In each time step, all nodes j that are not busy transmitting,
Number the wavelengths in the order of allocation. Consider generate a transmission (j, dj , rj ) active for tj time units.
the transmission t for which the wth wavelength was allocated After that the node is busy for tj steps. After that it generates
for the first time. Let L be the light-trail used for t. Clearly, another transmission as before. The transmission duration tj
the wth wavelength had to be allocated because the other is drawn from a Poisson distribution with parameter λ. The
wavelengths contained light-trails overlapping with L. Of destination dj of a transmission is picked using the distribution
these if at least w/4 log n light-trails crossed either end of D discussed later. The bandwidth is drawn from a modified
L, then we are done. If this fails, there must be at least Pareto distribution with scale parameter = 100 × rmin and
w0 = w − 1 − w/2 log n wavelengths that have light-trails shape parameter = α. The modification is that if the generated
which are strictly contained inside the extent of L. Let L0 bandwidth requirement exceeds the wavelength capacity 1, it
be the light-trail allocated on the w0 th of these wavelengths. is capped at 1.
Note that L0 is strictly smaller than L. Thus we can repeat We experimented with α = {1.5, 2, 3} and λ = {0.01, 0.1}
the above argument by using L0 and w0 in place of L and w but report results for only α = 1.5 and λ = 0.01; results for
respectively, only log n times, and if we fail each time, we other values are similar. We tried four values 0.01, 0.1, 0.25
and 0.5 for rmin . Here we report the results for rmin = congestion is not always a good lower bound. This may lead
0.01, 0.5. We considered four different distributions D for to a constant factor approximation algorithm for the problem.
selecting the destination node of a transmission. 1) Uniform: In the online case we gave two algorithms which we
we select a destination uniformly randomly from the n − 1 prove to have competitive ratios O(log n) and O(log2 n)
nodes other than the source node. 2) UniformClass: we first respectively. In practice we found that the second algorithm
choose a class uniformly from the dlog n/2e + 1 possible was better, and showing this analytically is an important open
classes. It should be noted that there can be a destination at problem.
a distance at most n/2 in any direction since we schedule Our online model is very conservative in the sense that
along the direction requiring shortest path. 3) Bimodal: first once a transmission is allocated on a light-trail, the light-trail
we randomly choose one of two possible modes. In mode 1, cannot be modified. However, other models allow light-trails to
a destination from the two immediate neighbors is selected shrink/grow dynamically [17]. It will be useful to incorporate
and in mode 2, a destination from the nodes other than the this (with some suitable penalty, perhaps) into our model.
two immediate nodes is chosen uniformly. For applications It will also be interesting to devise special algorithms that
where transmissions are generated by structured √ algorithms, work well given the distribution of arrivals.
local traffic, i.e. unit or short distances (e.g. n for mesh
like communications) would dominate. Here, for simplicity, ACKNOWLEDGMENT
we create a bimodal traffic which is mixture of completely We would like to thank Ashwin Gumaste for encourage-
local and completely global. 4) ShortPreferred: we select ment, insightful discussions and patient clearing of doubts
destinations at shorter distance with higher probability. In fact, related to light-trails.
we first choose a class i in the range 0, . . . , dlog n/2e with
probability 2i+1
1
and then select a destination uniformly from R EFERENCES
the possible destinations in that class. We report the results [1] I. Chlamtac and A. Gumaste, “Light-trails: A solution to IP centric
only for the distributions Uniform and Bimodal. communication in the optical domain,” Lecture notes in computer
science, pp. 634–644, 2003.
B. Results [2] A. Borodin and R. El-Yaniv, Online computation and competitive
analysis. Cambridge University Press, New York, NY, USA, 1998.
Fig. 2 shows the results for the 4 load scenarios. For each [3] D. Sleator and R. Tarjan, “Amortized efficiency of list update and paging
scenario, we report the number of wavelengths required by rules,” Communications of the ACM, vol. 28, no. 2, pp. 202–208, 1985.
[4] A. Karlin, M. Manasse, L. Rudolph, and D. Sleator, “Competitive snoopy
the 3 algorithms and the measured congestion as defined in caching,” Algorithmica, vol. 3, no. 1, pp. 79–119, 1988.
Section V. Each data-point is the average of 150 simulations [5] R. Wankar and R. Akerkar, “Reconfigurable architectures and algo-
(each of 100 time steps) for the same parameters on rings rithms: A research survey,” IJCSA, vol. 6, no. 1, pp. 108–123, 2009.
[6] K. Bondalapati and V. Prasanna, “Reconfigurable meshes: Theory and
having n = 5, 6, . . . , 20 nodes. We say that the two scenarios practice,” in Fourth Workshop on Reconfigurable Architectures, IPPS,
corresponding to rmin = 0.01 have low load and the remain- 1997.
ing two scenarios (rmin = 0.5) have high load. [7] K. Li, Y. Pan, and S. Zheng, “Parallel matrix computations using a
reconfigurable pipelined optical bus,” Journal of Parallel and Distributed
For low load, the baseline algorithm outperforms our algo- Computing, vol. 59, no. 1, pp. 13–30, 1999.
rithms. At this level of traffic, it does not make sense to reserve [8] C. Subbaraman, J. Trahan, and R. Vaidyanathan, “List ranking and graph
different light-trails for different classes. However, as load algorithms on the reconfigurable multiple bus machine,” in Parallel
Processing, 1993. ICPP 1993. International Conference on, vol. 3, 1993.
increases our algorithms outperform the baseline algorithm. [9] Y. Pan, M. Hamdi, and K. Li, “Efficient and scalable quicksort on a linear
For the same load, it is also seen that our algorithms become array with a reconfigurable pipelined bus system,” Future Generation
more effective as we change from the completely global Computer Systems, vol. 13, no. 6, pp. 501–513, 1998.
[10] Y. Wang, “An efficient O (1) time 3D all nearest neighbor algorithm
Uniform distribution to the more local Bimodal distribution. from image processing perspective,” Journal of Parallel and Distributed
This trend was also seen with the other distributions we Computing, vol. 67, no. 10, pp. 1082–1091, 2007.
experimented with. [11] S. Rajasekaran and S. Sahni, “Sorting, selection, and routing on the
array with reconfigurable optical buses,” IEEE Transactions on Parallel
Though we could not show analytically that A LL C LASS is and Distributed Systems, vol. 8, no. 11, pp. 1123–1132, 1997.
better than S EPARATE C LASS always, our simulation shows [12] A. Gumaste and I. Chlamtac, “Mesh implementation of light-trails:
that A LL C LASS performs better. It may be noted that our a solution to IP centric communication,” Proceedings of the 12th
International Conference on Computer Communications and Networks,
algorithm for the stationary case mixes up the light-trails of ICCCN’03, pp. 178–183, 2003.
different classes, and so suggests that the A LL C LASS might [13] A. Gumaste, G. Kuper, and I. Chlamtac, “Optimizing light-trail assign-
work better. ment to WDM networks for dynamic IP centric traffic,” in The 13th IEEE
Workshop on Local and Metropolitan Area Networks, LANMAN’04,
2004, pp. 113–118.
VII. C ONCLUSIONS AND F UTURE W ORK [14] Y. Ye, H. Woesner, R. Grasso, T. Chen, and I. Chlamtac, “Traffic
It can be shown that the non-splittable stationary problem grooming in light trail networks,” in IEEE Global Telecommunications
Conference, GLOBECOM’05, 2005.
is NP-hard, using a simple reduction from bin-packing. We [15] A. Gumaste, J. Wang, A. Karandikar, and N. Ghani, “MultiHop Light-
do not know if the splittable problem is also NP-hard. We Trails (MLT) - A Solution to Extended Metro Networks,” Personal
gave an algorithm for both variations of the stationary problem Communication.
[16] S. Balasubramanian, W. He, and A. Somani, “Light-Trail Networks: De-
which takes O(c + log n) wavelengths. It will also be useful sign and Survivability,” The 30th IEEE Conference on Local Computer
to improve the lower bound arguments; as Section IV shows, Networks, pp. 174–181, 2005.
Uniform Bimodal
10 9
AllClass AllClass
9 SeparateClass 8 SeparateClass
Baseline Baseline
8
Congestion 7 Congestion
Number of wavelengths W

7
6
6
5
5
4
4
3
3

2 2

1 1

0 0
6 8 10 12 14 16 18 20 6 8 10 12 14 16 18 20
Number of nodes n
(a) Low Load

Uniform Bimodal
16 16
AllClass AllClass
14 SeparateClass 14 SeparateClass
Baseline Baseline
Congestion Congestion
Number of wavelengths W

12 12

10 10

8 8

6 6

4 4

2 2

0 0
6 8 10 12 14 16 18 20 6 8 10 12 14 16 18 20
Number of nodes n
(b) High Load
Fig. 2. Simulation results

[17] A. Gumaste and I. Chlamtac, “Light-trails: an optical solution for IP [23] A. Lodha, A. Gumaste, P. Bafna, and N. Ghani, “Stochastic Optimization
transport [Invited],” Journal of Optical Networking, vol. 3, no. 5, pp. of Light-trail WDM Ring Networks using Benders Decomposition,” in
261–281, 2004. Workshop on High Performance Switching and Routing, HPSR’07, 2007,
[18] J. Fang, W. He, and A. Somani, “Optimal light trail design in WDM op- pp. 1–7.
tical networks,” in IEEE International Conference on Communications, [24] W. Zhang, G. Xue, J. Tang, and K. Thulasiraman, “Dynamic light
vol. 3, June 2004, pp. 1699–1703. trail routing and protection issues in WDM optical networks,” in IEEE
[19] A. Gumaste and P. Palacharla, “Heuristic and optimal techniques for Global Telecommunications Conference, GLOBECOM’05, 2005, pp.
light-trail assignment in optical ring WDM networks,” Computer Com- 1963–1967.
munications, vol. 30, no. 5, pp. 990–998, 2007. [25] A. Gumaste and S. Zheng, “Dual auction (and recourse) opportunistic
[20] A. Ayad, K. Elsayed, and S. Ahmed, “Enhanced Optimal and Heuristic protocol for light-trail network design,” in IFIP International Conference
Solutions of the Routing Problem in Light Trail Networks,” Workshop on Wireless and Optical Communications Networks, 2006, p. 6.
on High Performance Switching and Routing, HPSR’07, pp. 1–6, 2007. [26] E. G. Coffman, Jr., M. R. Garey, and D. S. Johnson, “Approximation
[21] B. Wu and K. Yeung, “OPN03-5: Light-Trail Assignment in WDM algorithms for bin packing: a survey,” Approximation algorithms for
Optical Networks,” in IEEE Global Telecommunications Conference, NP-hard problems, pp. 46–93, 1997.
GLOBECOM’06, 2006, pp. 1–5.
[22] S. Balasubramanian, A. Kamal, and A. Somani, “Network design in
IP-centric light-trail networks,” in 2nd International Conference on
Broadband Networks, IEEE Broadnets’05, 2005, pp. 41–50.
ADCOM 2009
FOCUSSED SESSION ON
RECONFIGURABLE
COMPUTING

235
AES and ECC Cryptography Processor with
Runtime Configuration
Samuel Antão, Ricardo Chaves, Leonel Sousa
Instituto Superior Técnico/INESC-ID
Technical University of Lisbon
Email: {sfan,rjfc,las}@sips.inesc-id.pt

Abstract—In nowadays society, the amount of applications that distributed secret without any previous agreed information.
require cryptographic support keeps growing. The functionality Several algorithms have been proposed to implement these
and security of these applications rely on the capability of cryptographic functions, and the most successful ones have
cryptographic accelerators in providing both adequate perfor-
mance metrics while maintaining flexibility. In this paper a been adopted regarding their strength against attacks, and their
programmable cryptographic processor prototype, supporting compatibility with the performance-and-compact demand [1],
AES and EC (Elliptic Curve) ciphering is presented. This [2].
processor consists of up to 12 programmable processing units. Regarding the asymmetric algorithms, the Elliptic Curve
We explore and present results for the dynamic reconfiguration (EC) cryptosystem has emerged as a reliable and effective
of these processing units, allowing the runtime replacement of
AES by EC units (or vice-versa) according to the application alternative for the widely used Rivest-Shamir-Adleman (RSA)
needs. Combining programmability and runtime reconfiguration, algorithm. The EC cryptosystems have the advantage of pro-
both flexibility and performance can be improved. Moreover, the viding a greater security per bit of the secret key. Therefore,
reconfiguration capability allows to further reduce the required smaller keys need to be used/stored. Consequently, more
hardware area, since these functionalities are multiplexed in time. compact and bandwidth efficient, since smaller keys need to
The presented prototype is supported by a Xilinx XC4VSX35
FPGA, consisting of 6 static EC units and 6 reconfigurable be transmitted, cryptosystems can be developed.
AES/EC units, running simultaneously. This processor is able Although symmetric algorithms do not offer the same
to cipher an 128 bit AES block in 22.9 µs and perform an properties as asymmetric ones in terms of the secret key con-
EC point multiplication in 2.02 ms. The full reconfiguration struction, they are simpler, more compact, and more efficiently
of a processing unit can be achieved in less time than an EC computed, allowing for better area and throughput metrics.
multiplication.
Thus, its usage is mandatory for some applications. Currently,
I. I NTRODUCTION one of the most widely used algorithms for symmetric cryp-
tography is the Advanced Encryption Standard (AES) [2].
Currently, most applications require security and authen- Although both symmetric and asymmetric algorithms have
tication services. Several protocols have been designed to shown to provide good performance metrics, their complex-
provide such requirements to these applications, being used ity is still considerable. To overcome this problem, hard-
in a variety of devices: from smart cards, wireless sensors, ware accelerators are employed. Several accelerators have
cell phones, and laptops, that usually need a small amount been proposed supported on Application Specific Integrated
of connections, to high-end servers that have to establish Circuit (ASIC) solutions [3], Field Programmable Gate Array
thousands of connections. For such a wide variety of devices, (FPGA) [4], Graphical Processing Unit (GPU) [5], [6], and In-
there is also a wide range of different demands. The following struction Set Architecture (ISA) extensions for general purpose
highlights the key features that have to be considered: processors [7]. While the flexibility of the solutions increase
• performance, supporting high-throughput and low la- when we move from the ASIC to the general purpose solu-
tency; tions, the performance decreases. The ASIC approach allows
• low cost, using cheap platforms and massive production for fast and low power solutions, but with limited adaptability
computing elements; and higher design costs. General purpose processors solutions
• compactness, allowing the coexistence of different appli- allow for optimal programmability, but achieve relatively low
cations in a small pool of resources; performances and higher power costs. The GPU solutions
• flexibility, allowing adjustment to different needs; allow for the utilization of a large amount of parallel hard-
• low power, enhancing the battery savings, reducing the ware structures with a reduced cost, because of the massive
costs in energy and heat sinks, and increasing autonomy. production due to the gaming market. However, the GPU’s
The security and authentication protocols are often sup- datapath is not optimized for cryptographic procedures and
ported by two main types of cryptographic functions: symmet- the parallelism extraction for cryptography is limited, allied
ric and asymmetric. The latter allows to establish a secure and with the significant power consumption. The FPGA solutions
confidential communication between two entities that share are a compromise between the high performance/low power of
a secret, while the former allows two entities to create a the ASIC and the flexibility/low cost of the general purpose

236
processors. Moreover, FPGAs allow to combine programmable • byte additions over GF (28 ), which correspond to a 8-bit
solutions with reconfiguration capabilities, providing adaptable bitwise exclusive OR (XOR) operation;
datapaths. FPGAs can be considered as an advised option to • non-linear function S(.) often called an SBox and its
efficiently support a wider range of cryptographic algorithms inverse; this function can be computed with multiplica-
and procedures. tions and inversions over GF (28 ) with the irreducible
This paper proposes a general cryptographic processor polynomial I(x) = x8 + x4 + x3 + x + 1;
supported on FPGA. This programmable processor was de- • data matrix multiplication with constant matrices, with
signed to take advantage of the reconfigurable capabilities the irreducible polynomial I(x);
of an FPGA for achieving good performance metrics and • matrix row rotating shift operation.
enhanced flexibility. The processor proposed in this work Further details about these operations and how they are applied
aims to provide support for the majority of the security and can be found in [2].
authentication protocols, introducing microcoded AES and EC
cores, and a true Random Number Generator (RNG) supported B. EC arithmetic
on oscillator rings to generate secrets. Very few attempts An EC over GF (2m ) is a set composed by a point at infinity
have been made in the related art to combine AES and EC O and the points Pi = (xi , yi ) ∈ GF (2m ) × GF (2m ) that
arithmetic into a single arithmetic body. The efficiency of such comply the following equation:
approaches is compromised by the difference in the size of
yi2 + xi yi = x3i + ax2i + b, a, b ∈ GF (2m ). (1)
the datapath (m ≥ 163 for the EC versus m = 8 for the
AES), requiring the use of different irreducible polynomials, By establishing the addition operation over the EC points and
thus different reduction structures. Our approach is different: by applying it recursively, it is possible to obtain the multipli-
instead of sharing the datapath for the AES and EC arithmetic, cation by a scalar operation. It is known to be computationally
we create individual, compact and high-performance AES and hard to invert this operation, since it is difficult to determine,
EC cores that share the same microcoded control unit. With from the recursive addition result of an EC point, how many
this approach and using the reconfiguration capabilities of the times this point was added. This is known as the Elliptic
FPGA, it becomes very easy and efficient to dynamically trade Curve Discrete Logarithm Problem (ECDLP), which supports
AES and EC cores, depending on the requirements. A compact the security of EC cryptosystems.
and flexible cryptographic processor with good performance The EC point addition and doubling (addition to itself) are
metrics is obtained. With a RNG associated to the processing performed with operations over the underlying field GF (2m )
units the secret keys of the protocols can be locally computed applied to the points’ coordinates. These GF (2m ) operations
and directly stored in the processing units’ memory. Avoiding are the addition, multiplication, squaring and the inversion,
the communication of secret keys makes the system more modulo an irreducible polynomial with degree m. Details
secure and resistant to external attacks. about how these operations can be efficiently performed can
The paper is organized as follows. In Section II we provide a be found in [8].
brief introduction on the AES and EC arithmetic. In Section III
we present the details of the reconfigurable architecture used. III. C RYPTOGRAPHIC P ROCESSOR A RCHITECTURE AND
In Section IV we describe the system layout in order to D ETAILS
handle the runtime configuration of processing units. Section In this work we developed a prototype of a cryptographic
V presents results for the developed prototype, and Section VI accelerator supported on reconfigurable hardware, namely a
draw some conclusions about the developed work. prototyping board powered by a Xilinx Virtex 4 FPGA [9]. In
this prototype the aim is to support the majority of protocols
II. AES AND EC C RYPTOGRAPHY that need asymmetric and symmetric cryptographic schemes,
In this section we briefly introduce the arithmetic that the and also the secure generation of secret keys for these proto-
proposed processor supports. cols. A schematic overview of the proposed processor organi-
zation is presented in Figure 1. The processor is composed by
A. AES arithmetic several processing units (PUs), responsible for computing the
The AES algorithm is composed by three main operations: cryptographic procedures. An RNG is also included in order
the key expansion, the ciphering, and the deciphering. In the to generate the secret data (such as the private keys). The
key expansion operation, the used key, with 16, 24 or 32 processor has an I/O interface to communicate and receive
bytes, is expanded in order to obtain 176, 208, or 240 bytes, the data (public keys, plain texts, ciphered texts) to/from the
depending on the initial size. This expanded key is divided main controller, which we herein call host of the processor.
in sets of 16 bytes and each set is used in each round of This interface is also used to provide commands, such as
the ciphering/deciphering operation. The number of rounds start commands for the processing units (PUs) or write/read
depends on the used key size. The key and data used in the commands of data and instructions. Through this interface the
ciphering/deciphering rounds are organized in a common way: host can query the processor for, e.g., availability of PUs or
in a 4 × 4 bytes matrix. Each AES round affects each of these check if the required tasks were already done. When the host
matrices’ elements using the following elementary operations: sends a write command to any PUs, it also defines the origin

237
of the data to be written, namely external data or internal storage logic storage logic
data read from the RNG. Thus, the host can use the secret input A
input A
adder A
input B
adder B
input B R1 R3
Data
information without having to touch or to know it. adder
RAM
R2
red A
Data R4
red B
R(a) R(b) RAM
All the PUs are responsible to run according to the mi- R(a) R(b)

crocode stored in a centralized instruction memory. For this, addition logic Reduction &
RA RB R(a) R(b) squaring logic
each PU has its own microprogram counter (µPC) and startup R(a)
0
R(b) 0 0
R(a) R(b)
addresses to run and control the flow of the correct program.
+
Look-up
ROM adder A
+ + adder B reduction
An arbiter controls the access to the instructions memory
multiplier red A red B
according to a priority policy, and signals any PU when the R(a)
0
0 0 R(b)

memory retrieves a valid microinstruction for it. 0

+ +
Each PU contains its local data memory, which is addressed adder
multiplication R1 ... ... ... ... R6
according to the received microinstructions. Input data and arithmetic logic logic
temporary data, as well as the final results are stored in this (b) EC PU
(a) AES PU
local memory. This memory can be accessed by the host, when
the PU is set to the idle state through specific microinstructions Fig. 2: Architecture of the processing units.
directly provided by the host, in order to be possible to read
and write data from/to the PUs. The width of the data memory
as well as the details of the arithmetic units available is port, the same ROM can be used for two PUs. The two adders
customizable according to the type of the PU. Different types at the input and output of the ROM perform the required
of PUs support different cryptographic procedures. additions for the AES arithmetic. With this architecture the
With this modular architecture, the PUs share the same following basic operations are implemented, where L(.) is a
control through the instruction memory while facilitating the look-up result: R(c) = R(a) + R(b), R(c) = R(a) + L(R(b)),
replacement of a given PU by another one. This allows to R(c) = L(R(a) + R(b)), and R(c) = L(R(b)), where a,
extract full advantage of the reconfiguration capabilities of the b, and c are the addresses provided by the microinstruction.
electronic devices. An operation to load a constant directly to the memory is
also implemented. The byte shift operations can be overcome
A. PU for AES
with the appropriate addressing of data, since each address
The architecture of the AES PU is presented in Figure 2a. correspond to one byte. Regarding the flow control of the
This architecture is composed by a data RAM of 512 positions, program, three jump related operations are implemented:
a ROM and two adders. The ROM implements a look up table
• jmpset: set the value of an indexing counter;
for the non-linear function S(.) and its inverse S −1 (.) (see
• jmpinc: jump if the value associated to this jump
Section II-A). We also include in this ROM the operations
instruction match the value in the indexing counter; the
2S(x), 3S(x), 9x, 11x, 13x, and 14x, where x represents the
indexing counter is incremented;
ROM address. With these operations, we are able to perform
• jmpdec: similar to jmpinc, but decrements the index-
the multiplications with the constants matrices operations.
ing counter;
Since the computation of the AES is performed over
• end: determines the end of the program, and thus the PU
GF (28 ) the used datapath and memory width is of 8 bits. Re-
becomes idle.
garding that x has 8 bits, the ROM has 2048 entries of 8 bits.
This amount of data fits a single BRAM present in the Xilinx It is also possible to sum the value in the indexing counter
Virtex 4 technology. Furthermore, since these BRAMs are dual multiplied by 16 to the data addresses. This allows to easily
browse through the 16 byte matrices where the AES data
is organized in the data BRAM depending on the indexing
Host Communication Interfaces
counter. All these functionalities, including the choice of the
RNG data access Data in Data out 1 Commands & PU Data out n ROM look-up and the usage of the indexing counter data in
Random Number status queries
Gen. the addresses, are mapped in microinstructions of 36 bit width.
Each microinstruction run in 3 clock cycles: one cycle to read
accessing request address ...
arbiter valid instruction
... the data, one cycle to read the ROM, and another cycle to
write the result.
Data Data
RAM RAM
Micro
instructions
control control B. PU for EC
& &
RAM
µPC µPC
We support the EC PU in our previous work presented in [8]
arithmetic arithmetic
for polynomial basis field arithmetic. The architecture of this
GF(2m) PU GF(2m) PU
micro-instruction ... compact and flexible PU is similar to the one of the AES
...
n processing units (PU) PU, and is depicted in Figure 2b. There is also a data BRAM
where the field elements (of size m ≥ 163) are split and
Fig. 1: Processor Organization Overview. stored in 21 bit words. The arithmetic logic supports two-

238
word with two-word operand additions and Karatsuba-Offman
multiplications.
The microcode adopted for the EC PU can be classified
into two main microinstruction types. The complex microin- random bit
structions (type I) are performed over field elements, while
the lower complexity microinstructions (type II) operate over
words. There is a type I reserved microinstruction that corre-
sponds to a customizable sequence of type II operations. reset clock

Type I instructions are used to access the memory (read


and write) and the key register (key) to compute the m bit Fig. 3: Random bits generator.
add, squaring and reduction operations (add, sqr and red),
and to control the flow by conditionally jumping to a microin-
struction address depending on the key register or by turning communicated by the device, there is no entity, other than
the Processing Unit (PU) to an idle state (jmp and end). the host, external to the device capable of achieving them,
The type II instructions allow for adding and multiplying 2- at least without implementing sophisticated attacks, such as
word operands (eadd and emult). An instruction determines Differential Power Attacks [10].
the end of a type II sequence (eret) and, consequently, The randomness source of the RNG is the jitter of an
the end of the pers type I instruction. A customizable oscillator. In a digital device, such as an FPGA, these os-
instruction (pers) is also reserved, corresponding to a user- cillators can be obtained with combinatorial rings of an odd
defined sequence of type II instructions. number of logical inverters. To obtain a random bitstream,
Another jump instruction is also introduced. When a PU is we can implement several of these oscillators, obtain the
placed in the architecture, an ID is assigned to it. This jump in- logic exclusive OR for all the outputs of each oscillator, and
struction, called jumpid, is an unconditional jump operation, sample the obtained signal with a frequency lower than the
and is only executed if the ID in the microinstruction match the frequency of the oscillators [11]. An FPGA implementation
ID assigned to the PU. If the IDs do not match, this instruction of such RNG was reported in [12] for an Altera Cyclone
is ignored. Introducing this instruction allows a program to use II FPGA. The authors in [12] suggest an improvement to
microcode segments of other programs, since this instruction the method presented in [11]. They suggest to sample the
works as a return instruction that is only considered by a output of each oscillator prior the exclusive-OR operation. This
PU running a specific program. This is useful to shrink the suggestion is based on the observation that the combinatorial
program sizes by running a single routine needed in different logic responsible for computing the exclusive-OR operation
programs, e.g. the inversion in the scalar multiplication and in may not have enough commutation speed between events at
the point addition. This unit also supports an instruction that the inputs. For the RNG designed in this paper we followed
signals the end of the program. this suggestion, which resulted in the circuit presented in
The functionality provided by the EC PU can be controlled Figure 3. We also introduced a reset signal in order to halt
by microinstructions of 32 bit width. Since the AES core need the oscillators and the random bitstream generation, in order
36 bit width instructions, the EC PU also uses 36 bit coded to reduce the power consumption when the RNG is not being
instructions, by ignoring the 4 most significant bits. Regarding used. A shift register was padded to the output of the RNG to
the clock cycles required for the instructions, the jump and store the random data and allow it to be readily read.
word size addition needs 3 cycles, the word size multiplication
needs 5 cycles, the field size addition needs 13 cycles, and the IV. RUNTIME R ECONFIGURATION
reduction and squaring operations need 14 clock cycles. The proposed processor is specially designed to efficiently
C. Arbiter support runtime reconfiguration. The modularity of the pro-
cessor allows to easily configure different processing units
The arbiter controls the access to the microinstruction
without affecting the behavior of the others. This allows
memory when there are simultaneous and pending requests.
to fulfill the runtime needs of the host by better adapting
The arbiter considers a static priority for each PU, where
the computation to the protocols being used. Our design is
all the EC PUs have higher priority than the AES PUs.
supported by a Xilinx Virtex 4 FPGA, allowing for the Xilinx
This is because the EC programs possess microinstructions
dynamic reconfiguration flow for this processor.
that take a larger amount of clock cycles comparing with
The only concern regarding the control of the dynamic
the AES microinstructions. Thus, it is more likely the AES
reconfiguration is related with the dummy requests placed
microinstructions to efficiently fill the clock cycles between the
in the arbiter by the PU under reconfiguration, due to the
EC requests, than the opposite, resulting in a better efficiency
unexpected behavior of the PUs outputs during reconfigu-
of the whole system.
ration. To overcome this issue, the architecture contains an
D. True Random Number Generator enable register that can be accessed by the host. When the
A true RNG is also included to generate the secrets that host disables the PU that is going to be reconfigured, the
lead to the private keys. Hence, since the private keys are not valid requests of that PU are ignored by the arbiter. After the

239
storage logic storage logic
reconfiguration, when the host enables the PU, a reset pulse input A
input A
input B input B
is generated for that PU to set it to the idle state. adder Data Data adder
RAM RAM
In order to support both AES and EC processing units, R(a) R(b) R(b) R(a)

the reconfigurable zones should cover the resources required


R(a) R(b) R(b) R(a)
by the most demanding implementation loaded in that zone. 0
Look-up
0

Between the two considered PUs, the most demanding in terms + ROM
+
of resources is the EC PU, due to the wider datapath (21-bit
R(a) 0 0 R(b) R(b) 0 0 R(a)
instead of 8-bit) and larger complexity. For these reasons, the 0 0

reconfigurable zones are sized to fit an EC PU. arithmetic logic + + arithmetic logic
adder adder
Since the several PUs compete to access the instruction reconf. PU 1 static reconf. PU 2
memory, conflicts can exist, thus some PUs may stall waiting
for their request to be fulfilled. These conflicts penalty will Fig. 4: AES PUs with shared look-up memory.
increase if the number of PUs appended to one of the instruc-
tions memory port increases. This effect has to be taken into
account when setting the number of PUs in the design and, maximum of 89 input signals (⌈89/8⌉ = 12 bus macros) and
consequently, the number of reconfigurable zones. Each of 64 output signals (⌈64/8⌉ = 8) is required, corresponding to
the AES operations requires 3 clock cycles to perform, while a total of 20 bus macros.
an EC operation requires from 3 to 14 cycles to perform.
This means that the average of clock cycles per instruction V. E XPERIMENTAL R ESULTS AND R ELATED W ORK
in the AES PUs is less than the EC PUs average. Thus, the The proposed design was successfully implemented and
AES PUs will generate more conflicts than the EC PUs. Since experimentally tested on a prototyping board powered by
the arbiter can issue one instruction per clock cycle, only a a Xilinx XC4VSX35-10 FPGA [9]. We implemented and
maximum of 3 AES PUs can ideally operate at the same time evaluated different combinations in the number of AES and
without conflicts. Putting a forth PU with less priority than the EC cores. These implementations refer to EC arithmetic
others will cause this fourth PU to stall until one of the others over GF (2163 ) and AES arithmetic with 128 bit key size.
finishes the ongoing computation. This observation determines The FPGA programming files were obtained from a VHDL
the number of the required reconfigurable zones, which is 3 description of the hardware, synthesized with the Synplify
per instruction memory port. Thus, the system can have up Premier C-2009.06 tools and Placed&Routed with the ISE
to 3 AES PUs per instruction memory port, implemented in 9.2.04i PR14 tools. The Virtex 4 technology supports the
the reconfigurable zones. The system can have more static handling of dynamic reconfiguration using the Internal Config-
EC PUs according to the conflicts that the user admits or to uration Access Port (ICAP). The advantage of using this port
the available resources. Considering a dual port instructions is the possibility to direct instantiate and conjugate it with the
memory, the number of reconfigurable zones can be increased remaining design, including the communication logic, that can
to the double, 6. write the reconfiguration bitstream directly to this port.
The use of dual port memories also contributes to reduce the The Virtex 4 FPGA contain block RAMs that provide
resources used in the design of the AES PUs. Considering a true dual port capabilities. This allows for all the memories
dual-port look-up ROM the same memory can be implemented employed in the design (instruction, data, and look-up) to be
statically outside the PUs and shared by two AES PUs, as dual port, saving resources. As discussed in Section IV, the
Figure 4 suggests. Moreover, this procedure allows for the maximum number of AES PUs competing for an instruction
information inside the RAM not to reside among the config- memory port can be up to 3. Thus, we will use 6 reconfigurable
uration data, enhancing the compactness of the bitstream and zones that can be reconfigured with an EC or AES PU. We
the configuration speed. Another issue that as to be considered also implement another 6 (3 per instruction BRAM port) static
while reconfiguring the PUs is the amount of signals that EC PUs. Thus the design can have up to 12 PUs working
cross the reconfigurable zone boundary, since the path of these simultaneously, and up to 6 PUs can be AES PUs. The reason
signals through the boundary has to be directly instantiated. for implementing only 6 static PUs is related with the Slice
This instantiation, except for the clock signal when provided resources constraint and with the increasing number of con-
by a global buffer, is performed recurring to directional slice flicts while accessing the instruction BRAM. We considered
bus macros. These bus macros are provided with the Xilinx an instruction memory with 1024 36-bit instructions to contain
ISE tools that support dynamic reconfiguration. The number all the routines for EC and AES arithmetic.
and type of the required macros is determined by the number The static design contains the required logic to implement
of PU inputs and outputs signals. Each bus macro occupies the communication with the host, the random numbers genera-
a Configurable Logic Block (CLB), which correspond do 4 tion, the AES look-up memories, and the 6 static EC PUs. The
slices, and supports up to 8-bit signals. To determine the required resources to implement the static design are 8,446
number of bus macros, the maximum number of inputs and slices and 11 BRAMs (2 for the instruction memory, 3 for the
maximum number of outputs in both the PUs types (AES AES look-up ROMs, and 6 for the data storage in the 6 static
and EC) have to be considered. For the proposed design a EC PUs).

240
Reconfigurable
Zone

Bus Macros

(a) No reconfigurable PUs (b) AES PUs only (c) ECC PUs only

Fig. 5: Processor layout with different reconfigurations.

There are 6 reconfigurable zones in the design with rectan- The maximum and minimum size in 32-bit words of the
gular shape of 13 CLB width and 21 CLB height (13 × 21 × runtime reconfiguration bitstreams are 30662 and 31067 for
4 = 1092 total slices). Considering the size of the Virtex 4 EC PUs, and 27898 and 29500 for the AES PUs. Although the
configuration frames which have 1 CLB width and 16 CLB reconfiguration area is the same for the AES and EC PUs, the
height, the reconfiguration of a reconfigurable area requires AES PUs result in approximately 5% smaller bitsreams due to
the communication of 26 frames. The different reconfigurable the lower utilization of resources, allowing for a slightly higher
zones do not intercept reconfiguration frames of the others. compression. The reconfiguration time is directly correlated
For this, the bottom boundaries of the reconfigurable zones with the bitstreams and the clock frequency. The ICAP in
are at the CLB coordinates 0, 32, and 64 (slices 0, 64, and Virtex 4 technologies allow to write a 32-bit reconfiguration
128). The layout of the system, as well as the bus macros word in each clock cycle. The maximum ICAP working
location, is depicted in Figure 5 for different contents of the frequency is 100 MHz [13], thus we expect that the maximum
reconfigurable zones after place and routing. Each PU employs reconfiguration time can be of 31067/100 M Hz ≈ 310µs.
1 BRAM for its data memory. The reconfigurable AES PUs Although, in the developed prototype, the reconfiguration
requires 157 ± 2 slices and the EC PU requires 943 ± 7 slices. bitstream is communicated from outside the device and written
The variation in the slices resources employed by each PU is directly in the ICAP, being the reconfiguration time limited by
due to the slightly different placing of the resources by the the communication process. The communication is performed
tools for the different reconfiguration zones. The occupation through a PCI bus, working at 33 MHz. Hence, we use the
of the reconfigurable zones by the PUs is 14% and 86% for same bus frequency driving the ICAP, being the incoming data
the AES and EC PUs, respectively. Although the occupation immediately transferred to the ICAP. With this, we obtain a
of the reconfigurable zones is not complete, the margin of maximum reconfiguration time of 31067/33 M Hz ≈ 941µs.
free resources allow to enhance the routing delays. Regarding In the next subsection we present the results specific for the
the complete design, the required resources are 14,092 slices RNG and PUs operation.
(92% of the total resources) with all the reconfigurable zones
A. Random Number Generator
implementing EC PUs and 9,387 (61% of the total resources)
with all the reconfigurable zones implementing AES PUs. In order to validate the implementation of the RNGs,
Considering the reconfigurable zones completely occupied, the random bitstreams were collected from the processor and their
required resources for the complete design are 14,998 slices randomness tested using a battery of tests. Two main batteries
(98% of the complete resources). The obtained system can run of tests are used for this purpose: the National Institute of
at the maximum frequency of 100.3 MHz. Standards and Technology (NIST) test [14] and the Diehard
one [15]. For the implementation proposed in [12], in order to
The reconfiguration bitstreams were generated in com- pass both batteries of tests successfully, the authors obtained
pressed format, using the appropriate Xilinx tools options. a RNG with 25 oscillators of 3 inverters each, sampled at

241
100 MHz. The option of using 3 inverters is justified by the AES (nAES ) operations were counted. The efficiency (E) is
enhanced compactness of the implementation. given by:
nEC TEC + nAES TAES
The randomness of the bitstream is enhanced if the number E= ; (2)
of oscillators increase or/and the sampling frequency decrease. nP U T
For the processor herein presented, using 3 inverters per oscil- where TEC and TAES is the time of a single EC and AES
lator, the number of required oscillators to pass both NIST and operation without conflicts in the memory accessed measured
Diehard tests at 100 MHz, which is the operating frequency in clock cycles, respectively, and nP U is the number of the
for the prototype, was shown to be 20. Each oscillator is active PUs. From Table I, it can be observed that the efficiency
implemented within a CLB, resulting in a very compact RNG. is very close to 100% for configurations with less than 4 PUs.
This result arose from the fact that an instruction takes at
B. Processing Units least 3 clock cycles to complete, thus the number of conflicts
Using the proposed architecture and microcode format, we in the arbiter will be meaningless. Moreover, for the other
were able to program the EC scalar multiplication and point configurations the efficiency is always greater than 61%.
addition in 401 instructions, and the AES key expansion, Comparing the presented results with the related work is
ciphering and deciphering in 253 instructions. The total latency not straightforward, since different technologies and different
for the EC PUs is 201,661 clock cycles for the EC scalar metrics are used by different authors. Nonetheless, we intro-
multiplication and 4,796 clock cycles for the point addition. duce some related art results to comparatively evaluate our
The latency for the key expansion and ciphering/deciphering design.
in our AES PU is 610 clock cycles and 2,290 clock cycles, In [4] a compact AES/EC design is proposed, supported on
respectively. a Xilinx Virtex XCV800 platform running at 41 MHz. Several
Performance metrics for different combinations of simulta- Logical Units (LUs) that support the basic field operations over
neously working PUs in the cryptographic processor are pre- GF (28 ) are organized by two reconfigurable modes: a Single-
sented in Table I. This metrics are measured at the prototype Instruction-Multiple-Data (SIMD) mode that support the AES
operating frequency, 100 MHz. arithmetic, and a Single-Instruction-Single-Data (SISD) mode
The evaluation in Table I use 1 EC scalar multiplication that supports the EC arithmetic. This design does not support
and 88 consecutive AES ciphering operations, because the simultaneous EC and AES arithmetic, since the LUs must
time consumption of one individual EC point multiplication be reconfigured to reuse resources. This design offers a
is approximately the time of 88 AES operations, allowing a throughput of 3.8 Mbit/s for the AES ciphering (128 bit key),
fair analysis. Although the instruction memory has two ports, and a Point multiplication (in GF (2163 )) latency of 5.36 ms.
we focus our analysis on a single arbiter individually, thus one Our AES throughput when using one PU is 5.5 Mbit/s (1.4
of the instruction memory ports. This analysis hold for both times higher) and the latency for the EC point multiplication
arbiters, even if the configuration of the PUs attached to them is 2.02 ms (2.65 times lower). The design in [4] occupies
is different. 220K gates (approx. 2329 Slices), which is 2.1 times more
An EC point multiplication produces a result in 2.02 ms than one reconfigurable zone in our design. In [4], the sharing
if no conflicts occur, thus the proposed design provides a of the datapath between the AES and EC results in the splitting
throughput of 496 Op/s for only one PU. For 6 EC PUs of an operation in smaller ones, when these operations could
running simultaneously, the throughput is of 1,536 Op/s, which be more efficiently computed in dedicated hardware or using
is lower than 6 times the throughput for one PU, due to look-up tables. This could justify the lower performance metric
the conflicts accessing the instructions. Performing the same of this design.
analysis for the AES arithmetic, considering the ciphering of In [3] a 0.18µm ASIC solution operating at 100 MHz is
128 bit blocks, the proposed processor provides a throughput proposed. In this solution the AES and EC arithmetic share
from 5.6 Mbit/s for 1 PU to 16.8 Mbit/s for the 3 PUs. In most the multipliers and registers. With 56K gates, the authors
this case, the throughput of the system scales directly with in [3] state that a throughput of 64 Mbit/s for the AES, and
the number of PUs, since all the instructions for the 3 the a latency of 1.8 µs for a field multiplication can be achieved.
AES PUs competing for the instruction memory take the Considering that 983 field multiplications and 650 squaring
same 3 clock cycles, thus no conflicts will occur. Intermediary operations are required for implementing the EC multiplication
configurations can be useful for the dynamic requirements of algorithm, we estimate that the EC point multiplication latency
the host. would be >2,9 ms. The herein proposed design is able to
We also introduce an efficiency metric in Table I. This perform the EC point multiplication 1.4 times faster. Although
efficiency measures the impact of the request conflicts solved our AES throughput is lower, our design can operate AES and
by the instruction memory arbiter. This efficiency measures the EC simultaneously and offer a flexibility and programmability
ratio of time used for useful computing by all the operating that an ASIC solution can not.
PUs within a specific time interval. To perform this efficiency In [16], a compact solution for AES supported by a Xilinx
measurement we programmed all the PUs to run consecutively XC2S15 FPGA running at 67 MHz is proposed. This design is
the same program, and after a specific time interval T mea- supported by two main arithmetic units: a multiply accumulate,
sured in clock cycles the number of complete EC (nEC ) and and a byte substitution unit, to support the non-linear function

242
TABLE I: Performance metrics for different combinations of simultaneously working PUs.
# ECC # AES Latency ECC throughput AES throughput Efficiency
PUs PUs (K clk cycles) ms (Op/s) (Mbit/s) (%)
0 0 - - - - -
1 0 201.7 2.02 496 - 100.00
2 0 201.7 2.02 992 - 100.00
3 0 201.7 2.02 1488 - 100.00
4 0 342.3 3.42 1169 - 82.50
5 0 344.9 3.45 1450 - 71.60
6 0 390.5 3.91 1536 - 61.67
0 1 201.5 2.02 - 5.59 99.98
1 1 206.9 2.07 483 5.44 99.08
2 1 223.5 2.24 895 5.04 96.61
3 1 348.8 3.49 860 3.23 81.80
4 1 354.5 3.55 1128 3.18 71.24
5 1 391.4 3.91 1278 2.88 61.59
0 2 201.5 2.02 - 11.18 99.98
1 2 208.1 2.08 481 10.83 98.57
2 2 348.7 3.49 574 6.46 79.27
3 2 350.3 3.50 856 6.43 70.72
4 2 385.9 3.86 1037 5.84 61.20
0 3 201.5 2.02 - 16.77 99.98

required in the AES. These units are controlled by microin- [2] ——, “Federal Information Processing Standards Publication 197: Ad-
structions and a microprogram counter controls the program vanced Encryption Standard,” November 2001.
[3] J. Wang, X. Zeng, and J. Chen, “A VLSI implementation of ECC
flow and branches. This design achieves a throughput of 2.2 combined with AES),” Proc. 8th International Conference on Solid-State
Mbit/s occupying 124 slices and 2 BRAMs. Our AES PU and Integrated Circuit Technology, pp. 1899–1904, March 2006.
offers a throughput 2.5 times higher with 1092 slices allocated [4] W. Lim and M. Benaissa, “Subword parallel GF(2m ) ALU: an implemen-
tation for a cryptographic processor,” Proc. IEEE Workshop on Signal
for its reconfigurable zone and 4 BRAMs. These 4 BRAMs Processing Systems, pp. 63–68, Aug. 2003.
are the minimum required for an AES PU to operate in the [5] R. Szerwinski and T. Guneysu, “Exploiting the Power of GPUs for
herein proposed design. Asymmetric Cryptography,” Proc. Workshop on Cryptographic Hard-
ware and Embedded Systems CHES, pp. 79–99, Aug. 2008.
VI. C ONCLUSIONS [6] S. Manavski, “CUDA Compatible GPU as an Efficient Hardware Ac-
celerator for AES Cryptography,” Proc. IEEE International Conference
In this paper, a microcoded and customizable cryptographic on Signal Processing and Communications, pp. 65–68, Nov. 2007.
processor prototype is presented, capable of efficiently com- [7] O. Kocabas, E. Savas, and J. Grossschadl, “Enhancing an Embedded
Processor Core with a Cryptographic Unit for Speed and Security,” Proc.
puting the AES and EC algorithms, as well as the generation International Conference on Reconfigurable Computing and FPGAs, pp.
of secrets through a RNG. The adopted approach relies on 409–414, Dec. 2008.
efficient and compact EC and AES processing units that share [8] S. Antão, R. Chaves, and L. Sousa, “Compact and Flexible Mi-
crocoded Elliptic Curve Processor for Reconfigurable Devices,” Proc.
the same control from a central microinstruction memory, 7th IEEE Symposium on Field-Programmable Custom Computing Ma-
allowing simultaneous computing of AES and EC routines. chines, FCCM, March 2009.
With this processor, customization can be performed by adding [9] Annapolis Micro Systems, Inc., Wildcard 4 Summary Description, 2007,
http://www.annapmicro.com/wc4.html.
processing units according to the processing needs. Additional [10] P. Kocher, J. Jaffe, and B. Jun, “Differential Power Analysis,” Proc. 19th
configuration can be achieved in runtime through the dynamic Annual International Cryptology Conference, Advances in Cryptology,
reconfiguration capabilities of the FPGA. These characteris- CRYPTO, vol. 1666, pp. 388–397, 1999.
[11] B. Sunar, W. Martin, and D. Stinson, “A provably secure true random
tics make this processor highly adaptable and flexible. The number generator with built-in tolerance to active attacks,” IEEE Trans-
reconfiguration time for a single PU is smaller than an EC actions on computers, vol. 56, no. 1, p. 109, 2007.
multiplication, resulting in negligible impact in the system [12] K. Wold and C. Tan, “Analysis and Enhancement of Random Number
Generator in FPGA Based on Oscillator Rings,” Proc. International
performance if several reconfigurations need to be performed. Conference on Reconfigurable Computing and FPGAs, REConFig, pp.
The proposed processing units, that provide the computing 385–390, 2008.
power of the processor, have shown to be very compact [13] Xilinx, Inc., Virtex-4 FPGA Data Sheet: DC and Switching Character-
istics, version 3.7, 2009, http://www.xilinx.com/support/documentation/
and suitable for embedded systems, supporting AES and EC data sheets/ds302.pdf.
with configurations fitting reconfiguration zones of 1092 slices [14] N. I. of Standards and Technology, “A Statistical Test Suite for
each, and throughputs up to 1536 Op/s for EC and 16.8 Mbit/s Random and Pseudorandom Number Generators for Cryptographic
Applications, Special Publication 800-22, Revision 1,” August 2008,
for AES. Another advantage of the proposed processor is the http://csrc.nist.gov/publications/nistpubs/800-22-rev1/SP800-22rev1.pdf.
inclusion of a compact true RNG in the architecture. This true [15] G. Marsaglia, “Diehard Battery of Tests of Randomness,” 1995,
RNG allows for the internal generation of secrets (such as http://stat.fsu.edu/pub/diehard/.
[16] T. Good and M. Benaissa, “AES on FPGA from the Fastest to the
private keys), thus enhancing the system security. Smallest,” Proc. Workshop on Cryptographic Hardware and Embedded
R EFERENCES Systems CHES, pp. 427–440, September 2005.

[1] N. I. of Standards and Technology, “Federal Information Processing


Standards Publication 186-3: Digital Signature Standard,” June 2009.

243
The Delft Reconfigurable VLIW Processor
Stephan Wong, Fakhar Anjam
Computer Engineering Laboratory
Delft University of Technology
Mekelweg 4, 2628 CD Delft,The Netherlands
E-mail: J.S.S.M.Wong@tudelft.nl, F.Anjam@tudelft.nl

Abstract—In this paper, we present the rationale and design numerical analysis etc. contain a lot of ILP, as they have
of the Delft reconfigurable and parameterized VLIW processor many independent repetitive calculations. VLIW processors
called ρ-VEX. Its architecture is based on the Lx/ST200 ISA such as the Lx/ST200 [1] from HP and STMicroelectronics
developed by HP and STMicroelectronics. We implemented
the processor on an FPGA as an open-source softcore and and the TriMedia [2] from NXP can exploit the ILP found in
made it freely available. Using the ρ-VEX, we intend to bridge an application by means of a compiler. By issuing multiple
the gap between general-purpose and application-specific operations in one instruction, a VLIW processor is able to
processing through parameterization of many architectural and accelerate an application many times compared to a RISC
organizational features of the processor. The parameters include: system [1][3].
instruction set (number and type of supported instructions), the
number and type of functional units (FUs), issue-width (number This paper presents the design of an open source, extensible
of slots), register file size, memory bandwidth. The parameters and reconfigurable softcore VLIW processor. The processor
can be set in a static or dynamic manner in order to provide the architecture is based on the VEX (VLIW Example) Instruction
best performance or the best utilization of available resources Set Architecture (ISA), as introduced in [4], and is imple-
on the FPGA. A complete toolchain including a C compiler mented on an FPGA. Parameters of the VLIW processor such
and a simulator is freely available. Any application written
in C can be mapped to the ρ-VEX processor. This VLIW as the number and type of functional units (FUs), supported
processor is able to exploit the instruction level parallelism instructions, memory-bandwidth, and register file size can be
(ILP) inherent in an application and make its execution faster chosen based on the application and the available resources
compared to a RISC processor system. This project creates on the FPGA. A software development toolchain including
research opportunities in the domain of softcore embedded a highly optimizing C compiler and a simulator for VEX
VLIW processor prototyping, as well as designs that can be
used in high-performance applications. is made freely available by Hewlett-Packard (HP) [5]. We
additionally present a development framework to optimally
Keywords: Reconfigurable computing, FPGA, softcore, ILP, utilize the processor. Any application written in C can be
VLIW. executed on the processor implemented on the FPGA. The
ISA can be extended with custom operations and the compiler
I. I NTRODUCTION is able to generate code for the custom hardware units, further
Very Long Instruction Word (VLIW) processors can be enhancing the performance.
used to increase the performance beyond normal Reduced In- The remainder of the paper is organized as follows. Sec-
struction Set Computer (RISC) architectures [1]. While RISC tion II explains the rationale behind the project. In Section III,
architectures only take advantage of temporal parallelism some previous work related to softcore processors is discussed.
(by utilizing pipelining), VLIW architectures can additionally The VEX VLIW processor architecture and the available
take advantage of the spatial parallelism by using multiple software toolchain are discussed in Section IV. Section V
functional units (FUs) to execute several operations simultane- presents the design and implementation details of our softcore
ously. VLIW processor improves the performance by exploit- VLIW processor ρ-VEX. Finally, conclusions are presented in
ing Instruction Level Parallelism (ILP) in a program. Field- Section VI.
programmable gate arrays (FPGAs) have become a widely
II. T HE R ATIONALE
used tool for rapid prototyping, providing both flexibility (as
in software programming) and performance (as in dedicated The utilization of reconfigurable hardware (with the most
hardware). Nowadays, FPGAs are moving beyond their simple common nowadays: field-programmable gate array (FPGA))
prototyping beginnings towards mainstream products being has increased tremendously in the past years due to their
utilized in many markets: general-purpose, high-performance, inherent parallelism1 that can be exploited in order to improve
and embedded. the execution of many applications, e.g., multimedia, bio-
For an application to take advantage of performance im- informatics, and many large-scale scientific computing appli-
provement from FPGA, it must possess inherent parallelism, or cations. Many approaches have been adopted to exploit recon-
the application source code should be structured in such a way 1 There are multiple factors that played a role, e.g., lowering cost of
to expose its parallelism. Applications in different domains ownership, but these are not mentioned as the discussion is focussed on
such as multimedia, bio-informatics, wireless communication, performance.

244
figurable hardware, but no single all-encompassing solution VLIW softcore(s) can be instantiated and configured on
has emerged as each performs usually only very well for their the FPGA. In this manner, a short trade-off study, e.g.,
particular environment or supported application(s). However, via a simulator or model, can determine the parameters
many of these solutions are hampered not by their ingenious most-suited for the available hardware and targeted ap-
designs but by the lack of tools to fully exploit the solution plication(s) at hand. This scenario is most suited for the
for more general-purpose cases. Therefore, we proposed the embedded design environment as the requirements and
ρ-VEX processor as a reconfigurable and extensible VLIW platform are usually well-known and fixed. The sharing
softcore processor to bridge the gap between application- of resources between multiple VLIW processors is also
specific and general-purpose processing. In the following, pre-determined.
we first highlight the advantages of our choice for a VLIW • dynamic resource sharing: When neither the application
processor as a starting point: nor the precise characteristics of the attached recon-
• simple hardware: One of the main advantages of VLIW figurable hardware is known at design time, the most
processors is that their hardware design is relatively appropriate scenario is to allow for dynamic resource
simple compared to other RISC processors as there is no sharing. In this scenario, enough resources are instan-
need for complex instruction decoders (e.g., out-of-order tiated to allow for sharing among the multiple VLIW
execution) in hardware since the compiler has already processors running on the same chip. The method how
taken care of the instruction scheduling. This means that to do this is under investigation and initial solutions have
the hardware we need to implement on the FPGA can been proposed already.
be kept simple and, therefore, higher clock frequencies • on-the-fly resource instantiation: When new resources
can be achieved to improve performance. Furthermore, are needed they can be instantiated on-the-fly. Similarly,
additional parallelism can be provided by simply adding when they are no longer needed their space can be freed
more issue slots or functional units. and be dedicated to other applications.
• availability of existing tools: Compilers for VLIW pro- The most promising solution to implement is most certainly
cessor are readily available and research and development the combination of the second and the third benefit mentioned
effort in improving them is still ongoing. Moreover, for above. On the other hand, one must not loose sight of certain
the VEX that we have chosen as a basis, a simulator intrinsic disadvantages of VLIW architectures that prevented
is available to investigate the performance gains for dif- it to become mainstream processors. However, we believe that
ferent architectural instances of the VEX processor. This these disadvantages are mainly due to their fixed design and
means we can exploit existing compilers (and simulators) many of these disadvantages can be mitigated when being
and future advancements without the need to dedicate implemented on reconfigurable hardware. We will highlight
much effort in their development. several issues2 in the following and how they could be
• no need for language translations: Another benefit of addressed:
using an existing VLIW architecture and its toolchain is • varying instruction word widths: Different applications
that there is no longer need for translators and automatic contain different levels of parallelism (this is true even
synthesis tools. Nowadays, e.g., when looking at C- within the same application). In order to fully exploit
2-VHDL tools, restrictions must be placed on the C this more issue slots should be used leading to longer
constructs before they can be utilized for the purpose (and therefore, different) instruction widths. Moreover,
of automatic hardware synthesis and sometimes code when using a different number of instructions can lead
rewriting is necessary to achieve improved performance. to a different encoding scheme of the VLIW instructions
In the latter, the (software) programmer needs to possess and thereby varying their length again. This issue can
hardware knowledge, which is not always the case. This be easily dealt with by the reconfigurable nature of a
means we can take any existing code and compile it to reconfigurable and parameterized VLIW processor as
our VLIW processor without rewriting code and without different instruction decoders can be instantiated. This
requiring the programmer to have hardware knowledge. can be achieved with or without reconfiguring the issue
We see a clear motivation for a reconfigurable VLIW slots (in the latter, unused issue slots can be shared among
processor between hardware design using automatic syn- other different softcores).
thesis tools (starting from C) and manual design as • high number of NOPs: Due to the traditionally fixed
adequate performance can be achieved after just the implementation nature of VLIW processors, their organi-
compilation time. zation may not completely match the parallelism inherent
The choice for a VLIW processor clearly has its advantages in the application leading to a high number of NOPs
and in the following, we will discuss reconfigurability-specific being scheduled. This leads to an under-utilization of the
benefits we foresee: available resources (in some cases to over 50%). Instead
• static resource sharing: When the size of the recon- 2 The length of this paper does not allow for an extensive discussion of the
figurable hardware structure and the available hardware shortcomings of VLIW processors and how they can be addressed. Therefore,
area are known beforehand, one or several pre-configured we only mention the most important ones.

245
of idling issue slots, the reconfigurable VLIW processor recompilation is needed to take advantage of new orga-
can reconfigure the issue slots, or reduce the number of nizational improvements. This need can be relaxed as
issue slots - i.e., either physically or enable sharing. dedicated organizational features can be provided in the
• unbalanced issue slots: This issue is tightly coupled reconfigurable hardware for particular already-compiled
with the previous issue as it is one of the causes for code.
the scheduling of NOPs as functional units might not 3) relaxation of backwards compatibility: Rarely used
be available across all issue slots. This issue can be instructions can be implemented in reconfigurable hard-
addressed by adding more functional units per issue slot. ware and their implementation can be instantiated when
Having stated how a reconfigurable and parameterized VLIW needed. This means that complex instruction decoding
can overcome the traditional shortcomings of a VLIW proces- hardware can be avoided leading to simpler hardware
sor, we will highlight in the following how such a reconfig- design and potentially lower power consumption.
urable VLIW processor can be used in two likely scenarios: By no means, our research in the design of the ρ-VEX
1) stand-alone general-purpose processor: In this sce- processor is finished and there are still many open questions
nario, complete applications (or application threads) run that need to be solved. However, discussing them is beyond
on the VLIW processor. The implementation of the the scope of this paper. In the remainder of this paper, we
processor can be fixed during the execution of multiple highlight several other similar approaches and describe more
applications, but our envisioned reconfigurable VLIW in-depth the design of our ρ-VEX processor and its current
processor should be able to adapt itself to different development status.
applications (or even to code portions within a single
application). III. R ELATED W ORK
2) application-specific co-processor: In this scenario, only In literature, few softcore VLIW processors with a complete
specific kernels that require acceleration are being com- toolchain can be found. The first VLIW softcore processor
piled to the VLIW processor. The benefits are: (1) no found in literature is Spyder [6]. The design and implementa-
need for code rewriting, (2) avoidance of using complex tion of Spyder marked the beginning of a reconfigurable VLIW
tools such as C-2-VHDL translators, and (3) manual softcore processor. Spyder consists of three reconfigurable
design of accelerators can be skipped. We have to note units. A compiler toolchain was made available. One of the
again that we are not stating that there is no need for the drawbacks of Spyder was that both the processor architecture
aforementioned actions or tools, but they can be avoided and the compiler were designed from scratch. Because the
when the VLIW processor is capable of providing good designer had to put efforts in both directions, the processor
enough performance within the requirements (such as, did not evolve extensively.
power, area) set.
Instance-specific VLIW processors are presented in [7][8].
Having stated the above, we present an advantage due to These architectures are specific implementations for some
the existence of a reconfigurable and parameterized VLIW applications, and do not represent a more general VLIW
processor, namely instruction-set architecture (ISA) emulation. processor. A VLIW processor with reconfigurable instruction
This means that we can implement different ISAs on top of the set is presented in [9]. An FPGA based design of a VLIW
VLIW processor and ensure that each emulation is the most ef- softcore processor is presented in [10]. Additionally, this
ficient. This will have the obvious advantage that applications processor is able to execute custom hardware. It has an ISA
compiled for different architectures can be executed without that is binary-code compatible with the Altera NIOS-II soft
code recompilation (cumbersome) or software code emulation processor. To support this architecture, a compilation and
(slow). Moreover, having the mentioned ability allows for the design automation flow are described for programs written in
following scenarios: C. The compilation scheme consists of a Trimaran [11] as the
1) ISA extension emulation: When new ISA extensions front-end and the extended NIOS-II as the back-end. Due to
are being introduced, much research and development the licensed Altera NIOS-II, this VLIW design is less flexible
effort is needed in order to ensure marked acceptance. and not open source.
However, with a reconfigurable ISA emulator it is possi- In [12], a modular design of a VLIW processor is reported.
ble to implement and ship the (draft) extension to poten- Certain parameters of the processor architecture could be
tial end-users for actual use and evaluation. Furthermore, altered in a modular fashion. The lack of a good software
bug reports can lead to further improvements before the toolchain and the absence of parametric extensibility limited
extension is fixed in hardware. The latter is still needed the use of this architecture. In [13], the architecture and
since the performance and power utilization of reconfig- micro-architecture of a customizable soft VLIW processor
urable hardware is usually not optional. However, early- are presented. Additionally, tools are discussed to customize,
on experience of developers can lead to a much earlier generate and program this processor. Performance and area
market adoption of the intended ISA extension. trade-offs achieved by customizing the processor’s datapath
2) instantiation of dedicated processor organizations: and ISA are evaluated. The limitation is the absence of a
When new processors are released, in many cases code compiler. In [14], the design and architecture of a VLIW

246
microprocessor is presented without any tool chain, which 3) The VEX Simulation System: The VEX simulator is an
restricts the processor usability. architectural-level simulator that uses compiled simula-
In [3], we presented the design and implementation of a tor technology to achieve faster execution. It additionally
reconfigurable VLIW softcore processor called ρ-VEX. In provides a set of POSIX-like libc and libm libraries
addition, a development framework to utilize the processor is (based on the GNU newlib libraries), a simple built-in
presented. The processor architecture is based on the VLIW cache simulator (level-1 cache only) and an Application
Example (VEX) ISA, as introduced in [4]. VEX represents a Program Interface (API) that enables other plug-ins used
scalable technology platform that allows variation in many for modeling the memory system.
aspects, including instruction issue-width, organization of A VEX software toolchain including the VEX C compiler
functional units, and instruction set. A software development and the VEX simulator is made freely available by the
toolchain for the VEX architecture [5] is freely available Hewlett-Packard Laboratories [5]. The reason behind choosing
from Hewlett-Packard (HP). The ρ-VEX processor is open- the VEX architecture for our project is the scalability and
source and implemented on an FPGA. Different parameters customizability of the VEX ISA and the availability of the free
such as the number and types of functional units, supported C compiler and simulator, which can be used for architecture
instructions, memory bandwidth, and size of register file can exploration.
be chosen based on the application requirements and available
resources on the FPGA. Initially, an instruction ROM file B. The VEX Instruction Set Architecture
has to be generated for each application to be run on the VEX offers a 32-bit clustered VLIW ISA. VEX models a
processor and the design has to be re-synthesized along with scalable technology platform for embedded VLIW processors
the instruction ROM file. Now a boot loader like functionality that allows variations in the parameters of the processor.
has been added and the executable files can be downloaded Following the VLIW design philosophy, the parameters of the
to the instruction memory and executed directly avoiding the processor, such as issue width, FUs, register files and processor
necessity of resynthesis. instruction set can be varied. The compiler is responsible
IV. T HE VEX VLIW P ROCESSOR : H ARDWARE AND for scheduling the instructions. Along with basic data and
R ELATED S OFTWARE operation semantics, VEX includes many features for compiler
flexibility in scheduling multiple concurrent operations. Some
Compared to superscalar and RISC processors, VLIW ar-
of these features are [4]:
chitecture requires a more powerful compiler due to more
• Parallel execution units, such as integer ALUs and mul-
complex operation scheduling. [4] presents definition of the
VLIW design philosophy as: "The VLIW design philosophy is tipliers.
• Parallel memory pipelines, including access to multiple
to design processors that offer ILP in ways completely visible
in the machine-level program and to the compiler". data memory ports.
• Data prefetching and other locality hints supported by the
A. The VEX System architecture.
The VEX stands for VLIW Example. VEX is a system • A large multiported shared register file made visible by

developed according to the VLIW philosophy by Hewlett- the architecture.


Packard (HP). VEX includes three basic components [4]: • Partial predication through select operations.
• Multiple condition registers to make efficient branch
1) The VEX ISA: VEX Instruction Set Architecture (ISA)
is a 32-bit clustered VLIW ISA that is scalable and architecture.
• Long immediate operands can be encoded in the same
customizable to individual application domains. The
VEX ISA is loosely modeled on the ISA of HP/ST Lx instruction.
(ST200) family of VLIW embedded cores [1]. VEX ISA Table I presents the parameters that could be changed for VEX
is scalable because different parameters of the processor VLIW processor.
such as the number of clusters, FUs, registers, and The most basic unit of execution in VEX is an operation,
latencies can be changed. VEX ISA is customizable which is equivalent to a typical RISC-style instruction. An
because special-purpose instructions can be defined in
a structured way. Table I
2) The VEX C Compiler: Based on trace scheduling, T HE VEX D ESIGN PARAMETERS
the VEX C compiler is an ISO/C89 compiler. It is
Processor Resource Design Parameters
derived from the Lx/ST200 C compiler, which itself Functional Units Number of FUs, type, supported instructions,
is a descendant of the Multiflow C compiler. A very degree of pipelining
flexible programmable machine model determines the Register File Register size, register file size, number of
target architecture, which is provided as input to the read ports, number of write ports
Load/Store Unit Number of memory ports, memory latency,
compiler. This means that without the need to recompile cache size, line unit
the compiler, architecture exploration of the VEX ISA Interconnection Number and width of buses, forwarding
is possible with this compiler. Network connections between units

247
encoded operation in VEX system is called a syllable. Multiple in both the source and destination cluster and may require
syllables are combined to form an instruction, which is an more than one cycle (pipelined or not). There is only a
atomic unit of execution in a VLIW processor. The instruction single instruction cache (I-cache), but different data cache
issue-width is the number of syllables in an instruction that (D-cache) ports and/or private memories can be associated
could be issued, and it depends on the number of FUs in with each cluster. This means that VEX allows multiple
the processor. An instruction having multiple syllables or memory accesses to execute simultaneously. Figure 1 depicts
operations is issued every cycle by the compiler to the multiple multiple D-cache blocks, attached by a crossbar to different
execution units of the VLIW processor, which is the main clusters, which allows a variety of memory configurations.
reason for performance compared to a RISC processor, which VEX clusters obey the following set of rules [4]:
has an issue-width of one. • Each cluster has the ability to issue multiple operations
1) Multicluster Organization: The number of read ports in the same instruction.
of the shared multiported register file in a VLIW processor • Different clusters can have different issue-widths and
is twice the issue-width, and the number of write ports is different types of operations.
equal to the issue-width (assuming that each FU requires two • Different clusters can have different VEX ISA, and not
input operands and writes one output as a result). Therefore, all clusters have to support the entire VEX ISA.
the issue-width is proportional to the product of the number • All units within a cluster are indistinguishable or equally
of read and write ports of the shared register file. The likely for selection. This means that the operations to
resource/area requirement for a multiported register file is be executed by a cluster do not have to be assigned to
directly proportional to the product of number of read and particular units within this cluster. To assign operations
write ports, therefore these parameters are not scalable to a to the units within a cluster is the job of the hardware
large extent or we can say that the issue-width is not scalable decoding logic.
to a large extent. To reduce this pressure on the number of 2) Structure of the Default VEX Cluster: The default single
read and write ports of the shared register file, VEX defines VEX cluster is a 4-issue VLIW core, as depicted in Figure 2,
a clustered architecture [4]. Using modular execution clusters, and consists of the following units [4]:
the VEX provides scalability of issue-width and functionality.
• Four 32-bit integer ALUs
A cluster is a collection of register files and a tightly coupled
• Two 16x32 multipliers (MULs)
set of FUs.
• One Load/Store Unit
VEX clusters are numbered from zero. Cluster 0 is a
• One Branch Unit
special cluster that must always be present in any VEX
• 64 32-bit general-purpose registers (GRs)
implementation, because the control operations execute on this
• 8 1-bit branch registers (BRs)
cluster. Different clusters have different unit/register mixes, but
a single Program Counter (PC) and a unified I-cache control This cluster can issue up to four operations per instruction.
them all, so that they run in lockstep or proper sequence [1]. These operations could be either integer ALU, or MUL or
The structure of a VEX multicluster architecture is depicted Load/Store operations. All FUs are directly connected to
in Figure 1. registers, and no FU is directly connected to another FU. Two
FUs within a cluster can only access registers in the same types of register banks are GR and BR. Both are multiported
cluster. VEX provides a simple pair of send-receive instruction and shared register files. Memory units support only load and
primitives that move data among registers on different clusters. store operations, i.e., operations that act on memory and save
These intercluster copy operations may consume resources results directly in memory are not supported by the VEX
system. The branch unit (control unit) in the default cluster
is used for program sequencing and is present only in cluster
0 in case of a multicluster machine.

C. The VEX C Compiler


The VEX C compiler is derived from Lx/ST200 C com-
piler, which is itself derived from the Multiflow C compiler
[15]. The Multiflow compiler includes high-level optimization
algorithms based on Trace Scheduling [16]. The VEX C
compiler is provided as a part of the freely available VEX
toolchain by HP. The compiler supports the old C language,
as well as the ISO/ANSI C. The toolchain has a command
line interface and is provided in the form of binaries. Different
command line options are provided for the compiler and the
toolchain. Applications can be compiled with profiling flags,
and GNU gprof can be used to visualize the profile data.
Figure 1. The VEX Multicluster Organization Because the VEX processor is scalable and customizable,

248
machine for our processor called ρ-VEX. Figure 4 depicts
the organization of our 32-bit, 4-issue VLIW processor imple-
mented on an FPGA. The ρ-VEX processor consists of fetch,
decode, execute and writeback stages. The fetch unit fetches a
VLIW instruction from the attached instruction memory, and
splits it into syllables that are passed on to the decode unit.
In the decode stage, the instructions are decoded and register
contents used as operands are fetched from the register file.
The actual operations take place in either the execute unit,
or in one of the parallel CTRL or MEM units. ALU and
MUL operations are performed in the execute stage. This stage
is implemented parametric, so that the number of ALU and
MUL functional units could be adapted. All jump and branch
operations are handled by the CTRL unit, and all data memory
Figure 2. The Default VEX Cluster load and store operations are handled by the MEM unit. All
write activities are performed in the writeback unit to ensure
that all targets are written back at the same time. Different
the compiler supports this scalability and customizability. To write targets could be the GR register file, BR register file,
compile for a different configuration, the compiler is provided data memory or PC.
with configuration information in the form of Machine Model The ρ-VEX implements all of the 73 operations of the
Configuration (fmm) file. To include a custom instruction, the VEX operations set. It additionally supports reconfigurable
application code is annotated with pragmas. Different compiler operations as the VEX compiler supports the use of custom
pragmas are available to improve the performance. Refer to [4] instructions via pragmas inside the application code. In the
for details on how to use the VEX compiler. current ρ-VEX prototype, it takes only a few lines of VHDL
code to add a custom operation to the architecture. One of the
D. The VEX Simulation System
24 available reserved opcodes could be chosen, and a provided
The VEX toolchain provides tools that allow C programs template VHDL function could be extended with the custom
compiled for a VEX VLIW configuration to be simulated on functionality. Currently, the following properties of ρ-VEX are
a host workstation. The VEX simulator is a fast compiled parametric:
simulator (CS) that translates VEX binary to a host computer • Syllable issue-width
binary. It first converts VEX binary to C and then using the C • Number of ALU units
compiler of a host generates a host executable. The compiled • Number of MUL units
simulation workflow is depicted in the Figure 3. • Number of GR registers (up to 64)
The VEX simulator produces instrumentation code to count • Number of BR registers (up to 8)
execution cycles and other statistics and generates a log file • Types of accessible FUs per syllable
at the end of simulation. This log file has all the statistical • Width of memory busses
information that can be analyzed for performance analysis To optimally exploit the processor utilization, a development
and architecture exploration. The simulator provides a simple framework is provided, which consists of compiling a piece
cache simulation library, which models an L1 instruction and of C code with VEX compiler and then generating a VHDL
data cache. The default cache simulation can be replaced by a instruction ROM file by assembling the assembly file with our
user-defined library. In addition, the VEX simulator includes assembler [3]. The ROM file is then synthesized with the rest
support for gprof, and different statistical files generated at the of the processor VHDL design files.
end of simulation can be used with gprof tool for analysis and As the target reconfigurable technology, Xilinx Virtex-II
profiling of the simulated application. Refer to [4] for details Pro (XC2VP30) FPGA was chosen, embedded on the XUP
on how to use the VEX simulator. V2P development board by Digilent. All experiments were
performed on a non-pipelined ρ-VEX system with 32 general
V. T HE ρ-VEX S OFTCORE VLIW P ROCESSOR
purpose registers (GR). A data memory of 1 kB implemented
In [3], we presented the design and implementation of a using BlockRAM was connected to ρ-VEX to store results.
reconfigurable and extensible softcore VLIW processor. We The issue-width of ρ-VEX was varied between 1, 2 and 4.
implemented a single-cluster standard configuration of VEX All configurations had the same number of ALU units as their
issue-width. The 2- and 4-issue ρ-VEX configurations had 2
MUL units. The application code was loaded in the instruction
memory before synthesis. We developed a debugging UART
interface to transmit data via the serial RS-232 protocol. This
interface invoked a transmission of the hexadecimal represen-
Figure 3. The VEX Simulation Flow tation of the data memory contents, as well as the contents

249
time without increasing cycle count.
VI. C ONCLUSIONS
In this paper, we presented the design and implementation
of a reconfigurable softcore VLIW processor based on the
Lx/ST200 ISA, developed by HP and STMicroelectronics. Our
processor design called ρ-VEX is parametric and different
parameters such as the number and type of functional units,
supported instructions, memory bandwidth, register file size
can be chosen depending upon the application and available
Figure 4. The ρ-VEX VLIW Processor resources on the FPGA. A toolchain including a C compiler
and a simulator is freely available. We provide a development
framework to optimally utilize the reconfigurable VLIW pro-
of the internal ρ-VEX cycle counter register. Synthesis results cessor. Any application written in C can be mapped to the
for ρ-VEX processor are presented in Table II. VLIW processor on the FPGA. This VLIW processor is able
to exploit the instruction level parallelism (ILP) inherent in
A. Recent Developments an application and make its execution faster compared to a
The following design improvements are added to the origi- RISC processor system. We described our rationale for the
nal ρ-VEX processor: ρ-VEX processor and presented the possible advantages it
can provide for its use as a general-purpose processor or
• The assembler has been extended and it now generates
application-specific co-processor.
a binary executable file for the processor. The hardware
design is modified and BlockRAM is used as the instruc- R EFERENCES
tion memory. The executable file can be downloaded into [1] P. Faraboschi, G. Brown, J.A. Fisher, G. Desoli, and F. Homewood, "Lx:
the instruction memory of the already placed processor A Technology Platform for Customizable VLIW Embedded Processing",
on FPGA using a serial port on the PC and the FPGA in Proceedings of the 27th annual International Symposium of Computer
Architecture (ISCA 00), June 2000, pp. 203 - 213.
development board, and therefore, re-synthesis and re- [2] TriMedia Processor Series. http://www.nxp.com/.
implementation of the processor with changing the ap- [3] S. Wong, T.V. As, and G. Brown, "ρ-VEX: A Reconfigurable and
plication is not required. Extensible Softcore VLIW Processor", in IEEE International Conference
on Field-Programmable Technologies (ICFPT 08), Taiwan, December
• We implemented dynamically reconfigurable register file
2008.
for the ρ-VEX processor to reduce the resources required [4] J. Fisher, P. Faraboschi, and C. Young, Embedded Computing: A VLIW
by the multiported register file [17]. The VEX archi- Approach to Architecture, Compilers and Tools. Morgan Kaufmann,
2004.
tecture supports up to 64 multiported shared registers [5] Hewlett-Packard Laboratories. VEX Toolchain. [Online]. Available:
in a register file for a single cluster VLIW processor. http://www.hpl.hp.com/downloads/vex/.
This register file accounts for a considerable amount of [6] C. Iseli and E. Sanchez, "Spyder: A Reconfigurable VLIW Processor
using FPGAs", in FPGAs for Custom Computing Machines, January
area when the VLIW processor is implemented on an 1993, pp. 17-24.
FPGA. Our processor design supports dynamic partial [7] C. Grabbe, M. Bednara, J.V.Z. Gathen, J. Shokrollahi, and J. Teich,
reconfiguration allowing the creation of dedicated register " A High Performance VLIW Processor for Finite Field Arithmetic",
in Proceedings of the 17th International Symposium on Parallel and
file sizes for different applications. The processor can Distributed Processing (IPDPS 03), April 2003.
dynamically create its own register file composed of the [8] M. Koester, W. Luk, and G. Brown, "A Hardware Compilation Flow For
actual number of registers the application needs. This Instance-Specific VLIW Cores", in Proceedings of the 18th International
Conference on Field Programmable Logic and Applications (FPL 08),
means that valuable area can be freed and utilized for Sep 2008.
other implementations running on the same FPGA when [9] A. Lodi, M. Toma, F. Campi, A. Cappelli, and R. Canegallo, "A
not the full register size is needed. Our design needs 924 VLIW Processor with Reconfigurable Instruction Set for Embedded
Applications", in IEEE Journal on Solid-State Circuits, vol. 38, no. 11,
slices on a Virtex-2 Pro device for dynamically placing Jan 2003, pp. 1876 - 1886.
a chunk of 8 registers, and places registers in multiples [10] A.K. Jones, R. Hoare, D. Kusic, J. Fazekas, and J. Foster, "An FPGA-
of 8 registers to simplify the design. The processor does based VLIW Processor with Custom Hardware Execution", in Pro-
ceedings of the 2005 ACM/SIGDA 13th Internal Symposium on Field
not need permanently 64 registers requiring 8594 slices Programmable Gate Arrays (FPGA 05), New York, NY, USA: ACM,
thereby considerably reducing the slice utilization at run- 2005, pp. 107 - 117.
[11] http://www.trimaran.org/.
[12] V. Brost, F. Yang, and M. Paindavoine, "A Modular VLIW Processor",
Table II in IEEE International Symposium on Circuits and Systems, ISCAS 2007.,
S YNTHESIS R ESULTS FOR ρ-VEX P ROCESSOR Apr 2007, pp. 3968 - 3971.
[13] M.A.R. Saghir, M. El-Majzoub, and P. Akl, "Customizing the Datapath
ρ-VEX Slices Max. Frequency and ISA of Soft VLIW Processors", in High Performance Embedded
Architectures and Compilers (HiPEAC 07), LNCS 4367 pp. 276-290,
1-issue 1895 (13%) 89.44 MHz Springer-Verlag Berlin Heidelberg 2007.
2-issue 5105 (37%) 89.44 MHz [14] W.F. Lee, VLIW Microprocessor Hardware Design For ASICs and
4-issue 10433 (76%) 89.44 MHz FPGA. McGraw-Hill, 2008.

250
[15] P.G. Lowney et al, "The Multiflow Trace Scheduling Compiler", The
Journal of Supercomputing, 7(1/2), 51-142, 1993.
[16] J. Fisher, "Trace Scheduling: A Technique for Global Microcode Com-
paction", IEEE Trans. on Computers, C-30(7), 478-490, 1981.
[17] S. Wong, F. Anjam, and M.F. Nadeem, "Dynamically Reconfigurable
Register File for a Softcore VLIW Processor", Accepted for publications
in DATE 2010.

251
Run-time Reconfiguration of Polyhedral Process Networks
Implementations

Hristo Nikolov Todor Stefanov Ed Deprettere


Leiden Institute of Advanced Computer Science
Leiden University, The Netherlands
{nikolov, stefanov, edd}@liacs.nl

Abstract version of the protocol. Evolving from configurable


computing, reconfigurable computing goes one step fur-
Run-time reconfigurable computing is a novel com- ther by providing manipulation of the logic within the
puting paradigm which offers greater functionality with FPGA at rum time. That is, the design of the hardware
a simpler hardware design and reduced time-to-market. may change in response to the demands placed upon
Although, the reconfigurable technology is constantly the system while it is running. Here, the FPGA acts as
advancing, yet reconfigurable computing is hardly em- an execution engine for a variety of different hardware
ployed in real systems due to the difficulties associated functions much as a CPU acts as an execution engine
with realizing and managing the reconfiguration pro- for a variety of software threads. A particular example
cess. In this paper, we address a particular design chal- of run-time reconfigurable computing is the so called
lenge, namely, the execution management of the dy- dynamic partial reconfiguration (DPR). Partial recon-
namic (reconfigurable) modules. We propose a general figuration is the process of configuring a portion of a
and technology independent approach for modeling and field programmable gate array while the other part is
implementation of run-time execution management for still running/operating. DPR allows for critical parts
applications modeled as polyhedral process networks. of the design to continue operating while a partial de-
By exploiting the main characteristics of the polyhe- sign is loaded into the FPGA.
dral process networks, the approach guarantees consis- Reconfigurable computing has two major advan-
tent executions of reconfigurable implementations. We tages. First, it is possible to achieve greater func-
do not focus on low-level implementation issues of the tionality with a simpler hardware design. Because not
reconfiguration process itself since the latter is not (di- all of the logic must be present in the FPGA at all
rectly) related to the execution management we propose, times, the cost of supporting additional features is re-
and therefore, it is out of the scope of this paper. duced to the cost of the memory required to store the
logic design. The second advantage is reduced time-
to-market. Most importantly, the logic design remains
1 Introduction flexible up to, and even after the product is shipped.
This allows an incremental design flow, a luxury usu-
When we talk about (re)configurable computing, we ally not available to typical hardware designs. One can
usually consider FPGA-based system designs. Such even ship a product that meets the minimum require-
systems retain the execution speed of “fixed” hardware ments and add features after deployment. Moreover,
while having a great deal of functional flexibility be- in a networked product like a set-top box or cellular
cause the logic within the FPGA can be changed if or telephone, it may even be possible to make such en-
when it is necessary. As a result, hardware bug fixes hancements without customer involvement. In case of
and upgrades can be administered as easily as their run-time reconfigurable computing, a main considera-
software counterparts. For example, in order to sup- tion is the overhead introduced by the reconfiguration
port a new version of a network protocol, one can re- process itself. If reconfiguration is performed too of-
design the internal logic of the FPGA, and send the ten, this overhead can become a bottleneck, limiting
enhancement to the affected customers by email. Once system performance. Therefore, the ratio execution-
they have downloaded the new logic design to the sys- time/reconfiguration-time has to be kept reasonably
tem and restarted it, they will be able to use the new high.

252
1.1 Problem Statement are respected, consistent system executions are guar-
anteed while allowing asynchronous reconfiguration of
different dynamic modules at rum time.
The principal benefits of using dynamic (partial) re-
The remaining part of the paper is organized as fol-
configuration (DPR) are the ability to execute larger
lows. In Section 2, we discuss the scope of the ap-
hardware designs with fewer gates and to realize the
proach and the main assumptions it relies on. Sec-
flexibility of a software-based (multi-threaded) solution
tion 3 presents the solution approach. Implementation
while retaining the execution speed of a more tradi-
details are discussed in Section 4. Section 5 concludes
tional, hardware-based approach. However, this comes
the paper.
at the price associated with the difficulties in realizing
run-time reconfigurable computing. First, the provided
design flows are weak and mostly experimental. It is 2 Scope of Work
not possible to model DPR during all the steps of a
system development. For instance, SystemC can be One of the main assumption in our work is that
used for first high-level steps but then it is difficult to we consider only dataflow dominated applications in
use other tools, e.g., HW/SW partitioning tools, sim- the realm of multimedia, imaging, and signal process-
ply because DPR is not integrated by the tool vendors. ing that naturally contain tasks communicating via
For the low-level steps, it is (almost) impossible to sim- streams of data. Such applications are very well mod-
ulate and validate the designs before the platform is in- eled by using the parallel dataflow model of computa-
tegrated into the final board. As a result, designers are tion (MoC) called Kahn Process Network (KPN) [4].
overwhelmed with too many and very low-level details The KPN model we use is a network of concurrent au-
in order to “get it right”, making reconfigurable com- tonomous processes that communicate data in a point-
puting a highly error-prone and time-consuming task. to-point fashion over bounded FIFO channels, using a
In addition to the lack of a tool support, a ma- blocking read/write on an empty/full FIFO as a syn-
jor challenge when using dynamic reconfiguration is chronization mechanism. Each process in the network
the execution management of the dynamic (reconfig- performs a sequential computation concurrently with
urable) modules. This includes both spatial and tem- the other processes. A well-known characteristic of
poral management. The latter is especially impor- KPNs is that their MoC is deterministic. Always for
tant in realizing reconfigurable implementations with a given input data, one and the same output data is
consistent run-time behavior. Consistency here means produced. This input/output relation does not depend
that any reconfigurable implementation and execution on the order in which the processes are executed. As
generates results equivalent to its non-reconfigurable the control is incorporated into the processes, no global
counterpart for the same application. The challenge in scheduler is present.
realizing an execution management is further exacer- To represent KPNs, we use polyhedral descriptions,
bated by the complexity of today’s applications, espe- therefore, we call our KPNs polyhedral process net-
cially in the domain of multimedia embedded systems. works (PPN). The PPNs are specific case of KPNs,
Usually, such systems consist of multiple compute mod- i.e., PPNs are static and everything about the exe-
ules that operate in a globally asynchronous fashion. cution of the process networks is known at compile
If these modules require reconfiguration, i.e., they are time. Moreover, the PPNs execute in finite memory
dynamic, it is very easy to violate consistency at rum and the amount of data communicated through the
time. This resembles very much the challenges in soft- FIFO channels is also known. We are interested in
ware multi-threading: common problems with thread this subset of KPNs because they are analyzable, e.g.,
synchronizations include deadlock and the inability to FIFO buffer sizes and execution schedules are decid-
(correctly) compose program fragments that are cor- able, and SW/HW synthesis from them is possible.
rect in isolation [3, 6]. In general, it is not known how A PPN is implemented as a heterogeneous multipro-
a programmer can come up with a multi-threaded pro- cessor system on chip (MPSoC) using the Daedalus
gram with correctness guarantee. The same problems design methodology [1, 10]. In such MPSoCs, the pro-
arise in the reconfigurable computing as well, i.e., there cessing components are programmable processors and
is no correctness guarantee for applications demanding dedicated HW compute modules (IP cores). The latter
and implementing reconfiguration at rum time. We may provide run-time reconfiguration. In this paper,
address this issue, and in this paper we present an ap- we consider fix communication topologies, i.e., a com-
proach based on conditions defining “save” points when munication topology can not be reconfigured in a target
reconfiguration may occur. The main contribution of MPSoC. Hence, reconfiguration can be applied only on
the proposed approach is that if the defined conditions the dedicated dynamic IP cores.

253
Y pxls
An IP core implements the main computation of a Yproc ...
Conv
Y,U,V pxls
Proc ...
PPN process which behaves like a function call. There-
reconfigure (Y,U,V)
fore, the computation performed by a reconfigurable IP RGB U pxls
...
has to resemble a function call as well. This means that Conv Uproc b) Processing with reconfiguration

for each input data read by the IP core, the core is ex-
d data ...
ecuted and it produces output data after an arbitrary V pxls
Vproc ... P1 P2

delay. In addition, to guarantee seamless integration c paramemters


within the dataflow of the considered heterogeneous a) Processing without reconfiguration c) PPN with dynamic parameters
systems, an IP core must have unidirectional data in-
terfaces at the input and the output that do not require Figure 1. Motivating example.
random access to read and write data from/to mem-
ory. Additional information about the IP cores is given
further in Section 4.
processing components Y proc, U proc, and V proc, re-
3 Solution approach spectively. Figure 1(b) depicts the same YUV-to-RGB
conversion and processing, however, implemented on a
system with run-time reconfiguration. In this version,
In this section, we discuss the solution approach there is one dynamic module P roc which is used to pro-
which allows for run-time reconfiguration of PPN pro- cess the YUV data. According to the data that need to
cesses in a way that consistent and deterministic PPN be processed, P roc is dynamically reconfigured by the
executions are guaranteed on the considered MPSoCs. Conv component. The implementation of the reconfig-
For an illustrative purpose, we use an example pre- uration process must avoid any undetermined behavior.
sented in Section 3.1. The PPN model is briefly intro- Therefore, explicit handshake logic is required for cor-
duced in Section 3.2. It contains parameters which may rect management of the reconfiguration. For brevity,
change values at rum time. The concept of modeling these details are omitted in Figure 1(b).
process network containing dynamic parameters was
In our example, we use only the type of the pro-
introduced recently in [7]. We use the same approach
cessed data to illustrate a scenario of a reconfigurable
as in [7] to preserve consistency of PPN executions,
computing. However depending on the required level
and in addition, we use the parameter values to trigger
of flexibility, additional information, e.g., frame size:
reconfiguration of particular processes (i.e. IP cores)
standard or high definition; type of encoding: MJPEG,
at rum time. In the proposed solution approach, we
MPEG4, or DivX; etc., can also be used for reconfig-
do not discuss technical details about how FPGA par-
uration of the system at rum time. Moreover, due to
tial reconfiguration is realized since it is highly vendor
performance limitations for example, the quality of the
dependent and it is out of the scope of this work. In-
encoding may need to be constrained as well. In our
stead, we discuss when actually reconfiguration is safe
approach to reconfigurable computing, we capture re-
to happen (in terms of consistency). It is based on con-
configuration information at application level, i.e., in
ditions which have to be respected at rum time. The
the polyhedral process network model we use to specify
conditions are discussed in Section 3.4.
application behavior. More precisely, different config-
uration possibilities are defined by a set of parameters
3.1 Illustrative example and their values in a PPN. Our illustrative example is
represented as a PPN in Figure 1(c). It consists of two
Below, we present a part of a multi-format video processes, P 1 and P 2, connected through one dataflow
encoding application. Usually, encoding algorithms channel (d). P 1 implements the RGB-to-YUV conver-
work on a YUV color space while naturally, the in- sion and P 2 realizes the processing of the Y, U, and
put video information is represented in a RGB color V components. The information what type of image
space. Therefore, initial conversion to YUV is required component is to be processed is specified by a parame-
and then, specific processing on the Y, U, and V im- ter. In order to transfer parameter values between the
age components is performed. Figure 1 illustrates this processes, we use control FIFO channels, i.e., channel
basic scenario which we will use as our illustrative ex- c in Figure 1(c). At run time, the parameter values are
ample. Figure 1(a) depicts a high-level view of an MP- used to trigger proper reconfigurations.
SoC system in which the input RGB stream is con- As is the case with all data-flow models, the main
verted by processing component Conv to Y, U, and question here is whether the PPNs with dynamic pa-
V streams. They are further processed in parallel by rameters are consistent. Consistency has to do with

254
a balancing of the production and consumption of to- 1 // Execution of process P1
kens in the network. When this balancing is dependent N1 N2 2 while( 1 ) {
3 // Execution cycle
on dynamic parameters, consistency conditions may be 4 read_parameter( N1 );
violated. In the remaining part of the paper, we dis- N1 N2
5 for ( int i=1; i<=N1; i=i+1 ) {
a c 6
cuss how we address this problem in order to guarantee x P1 y
b x P2 y 7
read( a, x );
execute_P1( x, &y );
consistent executions of applications modeled as PPNs 8 write( y, b );
on platforms using run-time partial reconfiguration. 9 }
10 }

3.2 Polyhedral (Kahn) process networks (PPN) a) PPN with dynamic parameters b) Structure of a process

The parallelism in our PPNs is expressed at the level Figure 2. PPN and process execution cycle.
of the application tasks as a process implements a single
application task only. A process of a PPN consists of
a function, input ports, output ports, and control. The
function specifies how data tokens from input streams a producer-consumer pair, shown in Figure 2(a).
are transformed to data tokens to output streams. The N1 and N2 are FIFO channels of the parameters
function also has input and/or output arguments. The N1 for process P 1 and N2 for process P 2, respec-
input and output ports are used to connect a process to tively. Each parameter can take values within a fixed
FIFO channels in order to read data tokens, initializing range. PPN(N1 , N2 ) denotes an instance of the PPN.
the function input arguments, and to write data gener- There is generally a relation between the parame-
ated as a result of the function execution. The control ters, in this example N1 and N2 . Therefore, some
specifies how many times the function is executed and instances PPN(N1 , N2 ) are invalid instances. For the
which ports to read/write at every execution, i.e., at PPN network in Figure 2(a), all different instances are,
every iteration (firing) of the process. The control of a
process can be compactly represented mathematically Parameters Range: PPN instances – PPN(N1,N2):
in terms of linearly bounded sets of iterator vectors us-
1 ≤ N1 ≤ 3; PPN(1,1); PPN(1,2); PPN(1,3)
ing the polytope model [2]. A process has a Process
1 ≤ N2 ≤ 3; PPN(2,1); PPN(2,2); PPN(2,3)
Domain (DM ) which is the set of all iterator vectors.
N2 ≥ N1 ; PPN(3,1); PPN(3,2); PPN(3,3)
Each iterator vector corresponds to one and only one
integral point in a polytope. Formally,
Instances P P N (2, 1), P P N (3, 1), and P P N (3, 2)
DM = {P (p) ∩ Z }, n are invalid because they violate the condition N2 ≥ N1 .
Similarly, instance P P N (2, 4) is invalid because N2 is
where P (p) is a parametric polytope, out of its range. Figure 2(b) shows the structure of a
process we propose to deal with dynamic parameters.
P (p) = {i ∈ Qn , p ∈ Zm | Ai ≥ Bp + C}, Network instances are selected by reading parameter
values at run time. For this purpose, we add a read
where i is an iteration vector, A, B and C are in- parameters phase, see line 4, prior to the actual pro-
tegral matrices of appropriate dimensions, and p is a cessing at lines 5-9. Because reading parameters and
parameter vector with an affine range R(p), data processing are repeated (possibly an infinite num-
ber of times), we call it a process execution cycle (lines
R(p) = {p ∈ Zm | Dp ≥ E}, 3-9). When all processes in a P P N have performed an
execution cycle, a network instance has performed an
where D and E are integral matrices of appropri- execution.
ate dimensions. We use the values of the parameter
vector’s elements to determine different configuration Definition 3.1 (Consistency of a PN instance)
options at rum time. A PN instance is consistent if after an execution, the
number of tokens written to any channel is equal to the
3.3 Process network instance number of tokens read from it.

In our approach to model dynamic parameters, 3.4 Preserving the consistency


we introduce a notion of a PPN instance which is
defined by the current value of the elements of the The validity of the PPN instances is a necessary but
parameter vector. Consider the PPN representing not a sufficient condition to preserve the PPN consis-

255
tency when changing parameter values at run time. A Read Write

valid set of parameters corresponds to a valid (and con- FIFOs Exist


READ EXECUTE WRITE Full FIFOs
(IP core)
sistent) PPN instance. However, the transition from a Data Data
valid instance to another valid instance at an arbitrary

Enable
Done

Done
Read

Done
Write
Valid
Exist
Conf

Conf

Conf
Full
point may violate the consistency of the instances and
the PPN execution. In order to transfer new values for FIFOs CONTROL
parameters to a process of the PPN at run time, i.e.,
to select a new PPN instance, we use control channels
with FIFO organization using blocking read/write syn- Figure 3. HW Module top-view.
chronization mechanism. In addition, we define the fol-
lowing three conditions which are sufficient to preserve
consistency when changing parameter values dynami-
cally at run time. shown in Figure 3, consisting of READ, EXECUTE,
C1: Parameter sets have to correspond to valid net- and WRITE blocks. The READ and WRITE blocks
work instances. constitute the communication part of a HM. A set of
C2: A valid parameter set has to initiate a network input data ports belongs to the read unit and a set
instance execution. of output data ports belongs to the write unit. The
C3: Processes may read new parameters from a number of input/output ports is equal to the number
valid set (corresponding to the selection of a new valid of the edges going in (respectively out of) the process
network instance) after they have completed a process of a PPN. The read unit is responsible for getting data
execution cycle. from proper channels (FIFOs) at each iteration. The
write unit is responsible for writing the result to proper
In other words, parameter values may be changed channels (FIFOs) at each iteration. Selecting a proper
(reconfiguration may take place) either before or af- channel at each iteration means to follow a local sched-
ter an execution cycle of the processes. This is taken ule incorporated into the read and write units. These
into account by the proposed execution cycle of a pro- local schedules are extracted from the PPN specifica-
cess illustrated in Figure 2(b). Note that the defined tion automatically by the Espam tool.
conditions are valid only for consistent PPN instances. The EXECUTE block of a HW Module (HM) is ac-
Therefore, a consistency check of a PPN instance is re- tually a dedicated HW IP core to be integrated. It is
quired, either at design time or at run time. In our not generated by Espam but it is taken from a library.
approach, a consistency check is performed at design In order to be incorporated into a HW Module, an IP
time since everything about the execution of a PPN is core has to provide Enable/Valid control interface. The
known. For more details about the defined conditions Enable signal is a control input to the IP core which
and the approach to deal with dynamic parameters at allows for running the core when there is data to be
run time, we refer to [7] where the presented approach processed. If input data is not available, or there is no
has been generalized for the SBF MoC [5]. room to store the output of the IP core to output FIFO
channels, then Enable is used to suspend the operation
4 Implementation of the IP core. The Valid signal is a control output
signal from the IP used to indicate whether the data
We consider that reconfiguration is applied on HW on the IP outputs is valid and ready to be written to
IP cores integrated in an MPSoC generated by Es- an output FIFO channel. In addition, the IP core has
pam [8, 9]. To integrate an IP core, Espam generates also to provide an interface for accepting configuration
a HW Module (HM) around an IP core taken from information, illustrated by the Conf/Done signals in
a library. To describe how reconfiguration, based on Figure 3.
parameter values, is realized with respect to the pre- A CONTROL block is added to capture the pro-
viously defined conditions, we explain the structure cess behavior, e.g., the number of process firings,
of a HM, shown in Figure 3. For additional details and to synchronize the operation of the other three
about HW IP core integration with Espam, we refer blocks. Also, CONTROL implements the block-
to [8]. The processes in our PPNs have always the ing read/write synchronization mechanism using Ex-
same structure. It reflects the KPN operational seman- ist/Read and Full/Write signals. Another function of
tics, i.e, read-execute-write using blocking read/write block CONTROL is to allow the parameter values to
synchronization mechanism. Therefore, a HW Module be set/modified from outside the HW Module at run
realizing a process of a PPN has a similar structure, time. Below, we present how the CONTROL block

256
implements the reconfiguration process such that the stances, i.e., the order in which the parameter sets are
previously defined conditions are respected. generated outside the network and written to the con-
trol channels. Since new parameter values are read
4.1 Respecting the conditions by the processes after performing an execution cycle,
parameter values selecting alternative PPN instances
Recall that the defined conditions are taken into ac- may be written to the control channels while a PPN
count by the proposed execution cycle of a PPN pro- instance is being executed. In addition, the proposed
cess, shown in Figure 2(b). Therefore, to respect the mechanism allows the processes to read the parameter
conditions and to preserve consistency of our PPNs, values independently of each other without violating
the CONTROL block of a HW Module (see Figure 3) the conditions defined for preserving the consistency.
implements this execution cycle. Our approach for run-time reconfiguration is applied
In the beginning, the CONTROL block reads pa- at two levels, i.e., high-level (no FPGA reconfiguration)
rameter values from the corresponding control FIFO by setting control registers and low-level by reconfigur-
channels. If data has not been written, the control ing the FPGA logic. Since we consider fixed commu-
block stalls waiting for it. The correctness of the pa- nication topology, the READ and WRITE units are
rameter values (i.e., the configuration data) has to be reconfigured by just writing data to control registers,
guaranteed (condition C1) by the module generating e.g., the amount of data to be communicated and par-
them. Thus, the combined writing of parameter val- ticular communication patterns to read/write from/to
ues and the reading of these parameters by the control different FIFO channels. Dynamic partial reconfigura-
block respects condition C2, because only a valid pa- tion is applied only on the IP core of a HW Module.
rameter set will cause a PPN process to initiate an From a design-complexity prospective, the proposed
execution cycle and, consequently, an execution of a approach to use PPNs with dynamic parameters to
network instance. After reading control data (e.g., capture (rum-time) reconfiguration information and
iteration domains and information about configuring to target reconfigurable MPSoC implementations con-
the IP core), the CONTROL block initiates an execu- tributes to a simplified (low-level) design effort be-
tion cycle. First, it performs an IP (re)configuration cause:
if it is required as well as setting control informa- 1. By using the defined conditions and the control
tion in the READ and WRITE blocks. After IP FIFOs, explicit handshaking (between processes)
core (re)configuration is completed (indicated by signal is eliminated. In addition, a reconfigurable IP core
’Done’), the control block uses the ’Exist/Read’, ’En- has to set only a “Done” signal to the CONTROL
able/Valid’, and ’Full/Write’ interfaces (see Figure 3) block after reconfiguration;
to control the execution (cycle) of the HW Module.
The end of the cycle is reached when the READ and 2. During the reconfiguration process, the dataflow
WRITE blocks have performed all required read and FIFOs used for communication between the dy-
write operations. This is indicated by the correspond- namic modules ensure proper operation of the
ing ’Done’ signals. After that, the control block is free static portion of the design.
to initiate another execution cycle (respecting condi-
tion C3), i.e., to read new configuration data from 5 Conclusions
the control channels and to repeat the steps described
above. In this paper, we proposed a general and technol-
ogy independent approach for modeling and implemen-
4.2 Discussion tation of run-time execution management for applica-
tions modeled as polyhedral process networks (PPNs)
By using FIFO control channels with blocking syn- and targeting reconfigurable computing. Based on the
chronization mechanism, we keep the KPN semantics characteristics of the PPN formal model of compu-
of our polyhedral process networks with dynamic pa- tations, we proposed conditions which define “save”
rameters, i.e., we have the capability to control the points when reconfiguration can occur. The main
execution without changing the model. Keeping the contribution of the presented work is that it guaran-
KPN model means that the deterministic behavior of tees consistent executions of reconfigurable implemen-
our PPNs with dynamic parameters is preserved. The tations. In addition, the FIFO communication and
FIFO organization of the control channels and the synchronization mechanism of the polyhedral process
blocking synchronization mechanism (the KPN seman- networks simplify design efforts and facilitate auto-
tics) keep the right order of selecting new network in- mated implementations.

257
References 6th Int. Workshop on Optimizations for DSP and Em-
bedded Systems (ODES-6), Boston, USA, Apr. 6 2008.
[1] Daedalus, a system-level design methodology and [8] H. Nikolov, T. Stefanov, and E. Deprettere. Au-
toolflow, http://daedalus.liacs.nl/. tomated Integration of Dedicated Hardwired IP
[2] P. Feautrier. Automatic parallelization in the polytope Cores in Heterogeneous MPSoCs Designed with
model. In The Data Parallel Programming Model, vol- ESPAM. EURASIP Journal on Embedded Sys-
ume 1132 of LNCS, pages 79–103, 1996. tems, 2008:Article ID 726096, 15 pages, 2008.
[3] M. Herlihy. The multicore revolution. In In 27th doi:10.1155/2008/726096.
FSTTCS: Foundations of Software Technology and [9] H. Nikolov, T. Stefanov, and E. Deprettere. System-
Theoretical Computer Science, pages 1–8, 2007. atic and automated multiprocessor system design, pro-
[4] G. Kahn. The Semantics of a Simple Language for gramming, and implementation. In IEEE Trans. on
Parallel Programming. In Proc. IFIP Congress 74. CAD of Integrated Circuits and Systems, volume 27,
North-Holland Publishing Co., 1974. Mar. 2008.
[5] B. Kienhuis and E. Deprettere. Modeling stream- [10] H. Nikolov, M. Thompson, T. Stefanov, A. Pimentel,
based applications using the sbf model of computa- S. Polstra, R. Bose, C. Zissulescu, and E. Deprettere.
tion. Journal of VLSI Signal Processing, 34(3), July Daedalus: Toward composable multimedia mp-soc de-
2003. sign. In In Proc. 45th ACM/IEEE Int. Design Au-
[6] E. A. Lee. The Problem With Threads. IEEE Com-
tomation Conference (DAC’08), pages 574–579, Ana-
puter, 36(5):33–42, 2006.
heim, USA, June 8-13 2008.
[7] H. Nikolov and E. Deprettere. Parameterized Stream-
Based Functions Dataflow Model of Computation. In

258
REDEFINE: Optimizations for Achieving
High Throughput
Keshavan Varadarajan#, Ganesh Garga#, Mythri Alle#, Alexander Fell#, Ranjani Narayan‡, S K Nandy#‡
#
CAD Lab, SERC, Indian Institute of Science, Bangalore, India

Morphing Machines, Bangalore, India

renders it more cost-effective to design and manufacture.


Abstract—REDEFINE is a runtime reconfigurable hardware However, FPGAs can be used only if the application (or
platform. In this paper, we trace the development of a runtime part of the application) can be placed completely on the
reconfigurable hardware from a general purpose processor, FPGA fabric. While this restriction is relaxed with
by eliminating certain characteristics such as: Register files changing technology, which enables more compute and
and Bypass network. We instead allow explicit write backs to
storage requirements to be placed on chip, FPGAs cannot
the reservation stations as in Transport Triggered
Architecture (TTA), but use a dataflow paradigm unlike TTA.
accommodate all applications. This lead to the creation of a
The compiler and hardware requirements for such a new class of hardware solutions called runtime
reconfigurable platform are detailed. The performance reconfigurable solutions. Unlike reconfigurable solutions,
comparison of REDEFINE with a GPP yields 1.91x runtime reconfigurable solutions are designed for
improvement for SHA-1 application. The performance can be reconfiguration at runtime with reduced overheads, to
improved further through the use of Custom IP blocks inside enable an application to be divided into several sub-tasks
the compute elements. This yields a 4x improvement in for their piece-wise execution.
performance for the Shift-Reduce kernel, which is a part of In order to render hardware amenable for runtime
the field multiplication operation. We also list other
reconfiguration, we need to equip the hardware with the
optimizations to the platform so as to improve its
performance.
ability to switch between configurations with very short
latencies, where a configuration includes a sequence of one
Index terms—REDEFINE, Explicit Transport, HyperOp, or more operations. In the case of a GPP, an instruction
Custom FU, Fused HyperOps, Fused HyperOp Pipeline represents such a configuration and instructions can be
loaded quickly from the instruction memory onto the
I. INTRODUCTION processor for execution. The GPP undergoes
reconfiguration every clock cycle and performs a different
The embedded hardware segment comprised (?) of three set of operations constituting an instruction. However, in
kinds of silicon offerings namely Application Specific order to achieve such reconfiguration, a GPP has a pre-built
Integrated Circuits (ASIC), Field Programmable Gate set of mathematical operations implemented, with all other
Arrays (FPGA) and General Purpose Embedded Cores (viz. transformations having to be expressed in terms of these
ARM, Power). Each of these solutions performs differently elementary operations. Due to the lower granularity of these
with regard to efficiency metrics. Efficiency is primarily operations, the number of operations to be executed
measured in terms of application throughput and energy increases, causing an increase in execution time as
dissipated by them. ASICs being highly application specific compared to an ASIC. The mathematical operations
deliver high throughput with very low energy dissipation. supported by a processor may not always be amenable to an
Due to high non-recurring engineering costs, however, optimal realization of a given function. In such cases the
ASICs can be deployed only in high volume segments. FPGA based combinational circuit building blocks serve as
They are unsuitable where changes in standards require a better means to represent the function. An instance of this
changes to the solution, since it would require redesign and provided in section VI.
redeployment. Redeployment may not be possible in all Thus a GPP, representing the ideal solution in terms of
situations. General Purpose Processors (GPPs) on the other reconfigurability, serves as a good starting point to develop
hand deliver lower throughputs with higher energy a runtime reconfigurable hardware solution. We try to
dissipation. However, their applicability to all domains eliminate some known inefficiencies in the GPP in order to
makes them cost-effective to design and manufacture. GPPs improve the performance. One of the primary sources of
are also unsuitable in environments where strict energy inefficiency is the write back to the register file from the
constraints are placed on the system. functional unit, as observed by Henk Corporaal [1]. He
A via media solution is to use a reconfigurable solution proposed the use of explicit transports as a means to avoid
such as a Field Programmable Gate Array. Reconfigurable unnecessary write backs. This involves exposing the
solutions are composed of combinational and sequential interconnect to the compiler infrastructure.
circuit elements that are composed to obtain the desired A modern processor pipeline includes several stages. The
functionality. This solution does not offer the performance execute and write back stages are shown in Figure 1. The
of an ASIC, but is flexible so as to emulate a wide range of execute stage includes various functional units along with
hardware circuits. The increased scope of application its reservation stations and the writeback stage includes the

259
Bypass Bus

Reservation
stations Register
File

FU FU FU FU

Figure 1 Functional Units (along with Reservation Stations) and Register File connected through
the Bypass Network
register file. These are connected over a bypass network.
The bypass network enables distribution of the result
operand to all operations that are waiting on this result so as
to proceed to execution. The bypass network uses the
broadcast mechanism to distribute the result operand to all
dependent operations. The use of broadcast is a worst—
case design, since it assumes that there might be operations
in all functional units that are awaiting this result. Our
analysis indicates that the number of consumers for a result
in most cases is one operation and 98% of the results are
consumed by at most 3 operations, when measured across
several kernels including IDCT, deblocking filter, CAVLC,
FIR and FFT. The cumulative density plot is shown in
Figure 3 Plot showing the average cumulative density for different
Figure 2 and Figure 3. Also the bus based interconnect used number of destinations of an instruction.
in the case of modern superscalar processors [2] is not
scalable. However, the use of dataflow computing inside
the execute stage, makes the design resilient to delays that
can be encountered, such as Load-Store delays. We modify
the design of the execute and write back stages as shown in
Figure 4.
The following changes were performed:
• The register file was eliminated, since it is a major
source of contention
• The registers of the reservation stations were made
addressable to enable explicit writes to it.
The bus based bypass network was replaced with a 2D
interconnect. In this case, we choose a Honeycomb
network, since it has the lowest degree per node [3]

Compute
Reservation Element
Router ALU
Station (CE)

Tile

Figure 4 The modified execution and writeback stages. The register


file has been eliminated and the bypass network is a honey comb
structure.

In this modified architecture, the results are directly written


back to the reservation stations of the destinations. The
implementation of such explicit write backs involves
Figure 2 Plot showing the cumulative density for different number of addressing the following issues:
destinations of an operations obtained for various kernels.
• In order to perform explicit transports the compiler
needs to determine all data dependences and issue
explicit data transfers between them.

260
• Explicit write back operation can be implemented only traffic is facilitated through the inter-HyperOp data
if all dependent instructions are in one of the forwarder (Figure 5; [7]) and data is stored in the global
reservation stations. Since this cannot be guaranteed, wait match unit, which is a part of the Scheduler.
an external register file is essential to store those For performing transports, the operations of the HyperOp
operands that cannot be consumed immediately. need to know the exact placement of the dependent
• Explicit write back also requires the source operation operations and their position on the reconfigurable fabric.
to know the placement of the destination operations. The compiler performs virtual binding of all operations i.e.
Dynamic assignment of reservation station and slots each operation assumes that it is placed at location 1 (0,0).
within them increases the complexity of hardware. The transport directives, to dependent operands, are
Our implementation of these requirements is elucidated in determined relative to the current location [8]. The routers
the subsequent section. In Transport Triggered support routing based on relative addresses. This
Architectures these requirements were implemented in the arrangement makes the HyperOps relocatable. So an
context of VLIW processors, whereas we approach the operation can be placed at any location as long as the
problem in the context of the dataflow paradigm. related operations, i.e. operations that supply data to the
The FPGA perspective of runtime reconfiguration is said operation, or operations that receive data from the said
presented in section III. Section IV compares our solution operation, are at offsets that are predetermined by the
with other solutions. In section V, we present a compiler. The exact location where a HyperOp will be
performance comparison between REDEFINE and a GPP. placed is determined at run time by the Resource Binder
In the subsequent section, section VI we indicate how the (Figure 5).
performance can be improved through the use of custom IP A brief description of all the various modules shown in
blocks. Other possible improvements identified through an Figure 5 is provided below:
analysis of time spent in various stages are presented in • The reconfigurable fabric consists of compute elements
section VII. The conclusions and scope for future work is (CE). The CEs include an ALU along with its
presented in section VIII. reservation station and router. A tile comprises a CE
and a router. The ALU supports a subset of the
II. REDEFINE: A RUNTIME RECONFIGURABLE operations present in the LLVM VISA. The
ARCHITECTURE interconnect employed is a toroidal honeycomb
structure [9]. The reservation stations determine which
of the ready operations will fire 2. In our current
Resource Binder HyperOp Launcher implementation (Figure 5), 64 tiles are interconnected
as an 8x8 configuration, with 12 access routers along
the periphery. The access routers provide connectivity
to the HyperOp Launcher, Inter-HyperOp Data
forwarder and the Load—Store Units that are used to
Global Wait
Match Unit

Scheduler

Reconfigurable Fabric access data memory banks. The access routers serve as
gateways of communication between the Fabric and the
external logic that drives it.
• The HyperOp Launcher is the hardware unit
responsible for transferring the compute and transport
Inter HyperOp Data Forwarder metadata to the Fabric, along with any data and
constant operands. The compute metadata specifies the
operation to be executed. This is connected to an
Figure 5 Schematic block diagram of REDEFINE instruction memory that is split into 5 banks. This
In order to specify explicit transports between two enables parallel transfer of 5 operations from the
operations, the compiler – RETARGET [4] constructs a instruction memory and HyperOp Launcher.
dataflow graph from the SSA representation for the • The Resource Binder, as described previously,
application specification (written in C language), generated determines the exact location on the fabric where the
by LLVM [5]. LLVM transforms the application into SSA HyperOp is to be placed. It also keeps track of the busy
form based on a virtual instruction set architecture (VISA). and unused tiles on the fabric.
The operations in the VISA are simple and non-orthogonal. • The Scheduler hosts the global wait match unit that
The Application is subdivided into smaller units called serves as the external register file, where data
HyperOps [4]. HyperOps are constructed by grouping exchanged between HyperOps is stored. Whenever all
together basic blocks such that a total order of HyperOps input operands of a HyperOp instance are available it is
can be constructed for execution. Explicit transports can be considered for being launched on the fabric for
performed between operations of a HyperOp. These explicit execution. The HyperOps thus chosen are then
transports are transformed into Transport Metadata that is
interpreted by the compute element, in order to forward the 1
The position of each CE on the fabric is specified by an ordered pair which specifies
results to the destination. Explicit Transports are facilitated the position as an offset from the origin along the x and y directions.
2
through the use of a Network on Chip [6]. Any data Static scheduling can be used in place of dynamic scheduling, however, since an
NoC is employed, the dynamic scheduling can schedule other instructions in case of
communication across HyperOps, i.e. inter-HyperOp data network delays.

261
forwarded to the Resource Binder where an appropriate application (when compared to a GPP). The recently
location on the fabric is determined. The HyperOp released Intel Nehalem too has several onboard application
Launcher transfers compute metadata, transport specific accelerators. On the other hand, FPGA vendors
metadata, constant operands and input operands to the ship a General Purpose core alongside to execute software
HyperOps. code, while the FPGA fabric provides hardware
• The input operands of a HyperOp are not placed at a acceleration. Stretch Inc explores a similar solution [11].
predetermined position. This necessitates the need for a This solution embeds an FPGA fabric alongside a Tensilica
lookup table that indicates the position in the global core to support Instruction Set Extensions at the post-
wait match unit where the input operands are placed. fabrication stage. Molen [12] uses a similar hardware fabric
The Inter-HyperOp Data Forwader is also responsible as Stretch; however, the compiler for Molen supports C to
for storing loop invariants and employs a sticky token RTL transformation to automatically program the
store for this purpose [10]. embedded FPGA. All these solutions help in exploiting the
best features of both GPPs and FPGAs. However, none of
III. RUNTIME RECONFIGURATION: AN FPGA PERSPECTIVE them reduce the time to reconfigure a hardware platform.
This is a critical requirement for runtime reconfigurable
FPGAs are composed of Look Up Tables (LUTs)
architectures.
interconnected by a programmable interconnect. The LUTs
The requirement of lower reconfiguration time has been
emulate the behavior of a combinational circuit. The truth
addressed in several hardware solutions viz. NEC-DRP,
table of the combinational circuit is programmed into the
DAP-DNA [13]. These hardware architectures employ
LUTs. More complex combinational circuits are realized
reconfigurable functional Units and ALUs in place of
through interconnection of LUTs. Programming an FPGA,
LUTs. However, they continue to use the programmable
thus, involves transferring the truth tables for each LUT and
interconnect, as in the FPGA. In order to reduce the runtime
the programming bits for the interconnect in order to setup
reconfiguration overhead, multiple configuration planes are
paths between communicating LUTs. The fine-grained
used. However, such a hardware configuration would be
structure of the LUTs and the programmable interconnect
useful only if the execution time of the application subtasks
has both advantages and disadvantages. The fine-grained
were sufficiently large to hide the configuration load
nature of the LUTs makes it more amenable to realization
latencies.
of circuits, which are combinational in nature. The use of
The hardware structure shown in Figure 5 is akin to several
programmable interconnect renders the transport of data
recently proposed general-purpose architectures viz. RAW
overhead free, due to pre-established paths. The primary
[14], TRIPS [15] and Wavescalar [16]. In the RAW
disadvantage of the fine-grained structure is the higher
processor several MIPS cores are repeated in space and are
latency incurred on programming the LUTs and the
interconnected by a 3-level Mesh interconnect. The TRIPS
interconnect. An FPGA has a reconfiguration time of the
and Wavescalar processors use Function Units and ALUs in
order of milliseconds at best, while the GPP can reconfigure
place of a full core. All these processors are geared towards
itself every clock cycle. Due to the large latencies involved
better exploitation of available thread level parallelism. On
in reconfiguration the FPGAs are not amenable to runtime
the other hand our solution is intended towards kernel
reconfiguration. More recently, FPGA vendors have been
execution acceleration. This difference in emphasis leads to
supporting partial reconfiguration, so as to reduce the
a completely different utilization of resources on the Fabric.
runtime overhead. However, this alone does not address the
The requirements for explicit transports stated in section I,
problem, as the ratio of configuration size for an application
can be implemented in several ways. Our solution employs
to size of the source specification is quite high.
explicit transports in the context of modern superscalar
In order to bring down the latency of the reconfiguration in
processors i.e. uses a dataflow paradigm. In Transport—
REDEFINE, we replaced the LUTs with more coarse-
Triggered Architectures (TTA) [1] this was achieved in the
grained functional units (viz. adder, shifter), akin to a GPP.
context of VLIW processors. The compiler computes the
This ensures that the amount of information needed to
explicit transports. The register file too is made a functional
program the Compute Elements (CEs) is quite low (just an
unit, into which data can be explicitly transferred. Due to
opcode). The programmable interconnect is replaced by a
the VLIW nature of the machine the locations of the
packet switched Network on Chip. This trades
operations are known apriori. The technique of explicit
reconfiguration overhead for hardware complexity. The
transports is amenable to application in other architectural
routers in the NoC have embedded routing logic, which
paradigms as well. The primary reason to adopt the
determines the path to be taken to transfer data from source
dataflow paradigm, in REDEFINE, as elucidated in the
to destination.
previous section, is to work around network delays. Other
VLIW techniques such as dynamic scheduling [17] could
IV. RELATED WORK
have been employed in the context of VLIW processors to
Several solutions have been proposed that try to address the work around nondeterministic NoC delays.
power—performance tradeoffs with regard to embedded
hardware solutions. Modern embedded processors viz.
PowerPC, come along with a host of domain specific
accelerators that help improve the performance while
having a lower energy overhead for the accelerated
262
V. COMPARING PERFORMANCE WITH A GPP for the said application compared to gcc [18]), the use of
The impact on performance of the architectural changes in RISC operations (as opposed to CISC operations in the
REDEFINE, as compared to a GPP is provided in this Pentium 4 processor), elimination of load stores for scalars
section. The SHA-1 hashing function is executed on both and reduction in register writebacks. In REDEFINE, the
REDEFINE and a GPP. SHA-1 is the most used average number of CEs used across several HyperOp
cryptographic hash function. Cryptographic hash functions executions is ~2.7 CEs. However, it has a peak CE
compute a cryptographic hash on a message digest, such utilization of 7 CEs. Thus the effects of variable issue
that a change in the message causes the generation of width, as reported in [19] are also seen, whereas the Intel
completely different hash. This facilitates detection of Pentium is a fixed issue width processor.
message modifications. The results of the SHA-1 run are The SHA1 function implemented in an ASIC can support
available in Table 1. The GPP run was performed on a up to 116Mbps as compared to ~5Mbps supported by
Pentium 4 running at 2.26GHz. The number of cycles was REDEFINE. Therefore REDEFINE performs ~20x worse
measured using Intel VTune, as described in [7]. The than an ASIC.
REDEFINE cycle count was obtained through a cycle-
accurate simulation 3, for an 8x8 fabric of tiles. A VI. IMPROVING PERFORMANCE THROUGH CUSTOM FU
computation granularity of 32-bit is used in both the GPP The REDEFINE architecture described thus far contains
and REDEFINE. ALUs that are quite general in nature. In order to improve
Table 1 Comparison of performance of SHA-1 function in REDEFINE the throughput and get it as close to an ASIC, it is essential
vs GPP. to integrate custom IP blocks in the CEs, so as to accelerate
the computation. However, these IP blocks need to domain-
SHA-1 REDEFINE General Purpose
specific, as opposed to application-specific, so as to avoid
Performance Processor
Execution Cycles 111777 21746290
the pitfalls of an ASIC. We consider the example of a field
multiplier to illustrate this.
Field multiplication is an important constituent of several
Operating 100MHz 2.26GHz cryptographic kernels, viz. ECDSA, ECC. Field
Frequency (@130nm) (@90nm) multiplication, unlike normal multiplication, involves a
Total Time taken 1.118 ms 1.071ms 4 sequence of Shift-Reduce (as opposed to a Shift in normal
multiplication) and XOR operations (as opposed to an add
A. Analysis of Results in normal multiplication). The Shift-Reduce operation is
compute intensive and is a good candidate for custom FU
REDEFINE takes 4.2% more time (in seconds) to execute
based acceleration. The design of the CE that contains the
the same program. However, the reported execution time
custom FU is shown in Figure 6.
(in seconds) does not take into account the technology node
The results for the runs without the Custom Function Unit
at which the processors were synthesized. After technology
and with Custom Function Unit are shown in Table 2. The
normalization, we find that REDEFINE performs 1.91x
performance of REDEFINE with custom FU is about 4x
better than the Pentium processor. This improvement in
better than the one without. A similar improvement in
performance can be attributed to the use of the LLVM
performance is seen in the context of FFT, as reported in
compiler [5] (which is known to perform about 30% better
[7].
Table 2 Comparison of Execution time of field multiplication without
and with Custom FU.
Without Custom With Custom FU
Reservation
FU
Station Execution Time 1692 424
(in clock cycles)

VII. MODULE-WISE BREAK-UP OF EXECUTION LATENCIES


While the Compute Unit and Fabric contribute to a large
Custom part of the execution delay, nearly an equal amount is
ALU FU
contributed by the Support Logic (composed of the
Scheduler, Resource Binder, HyperOp Launcher and Inter-
Transporter HyperOp Data Forwarder; Figure 5). In Table 3, we
tabulate the time spent in each of these blocks for the SHA
and shift-reduce functions.
To Router
The time shown in Table 3 is the average of the time spent
Figure 6 Modified Compute Element with Custom in the Support Logic across all executing HyperOps. The
Functional Unit
Support Logic overhead includes the time spent in
3
launching the HyperOp once the last operand for a
A SystemC/C++ based simulator developed in-house.
4
The computation of time in seconds computed based on number of cycles
HyperOp arrived at the Inter-HyperOp data forwarder upto
and frequency does not match the time reported in milli-seconds; however the selection of the first operation for launch, after the
the results reported by Intel Vtune are being directly reproduced.

263
opcodes and data operands are transferred on to the fabric machine, it is not possible to allow one HyperOp to
by the HyperOp Launcher. It should be noted that the Inter- proceed to next iteration, while the other HyperOp of
HyperOp data forwarder continues to receive the data even the same Fused HyperOp is executing the previous
during normal execution and thus the data transport is iteration.
overlapped by computation. Similarly, the HyperOp • Further, to improve performance, a pipeline of Fused
Launcher continues to transfer data operands even after the HyperOps can be formed referred to as the Fused
first operation is ready for launch. Of the time spent in the HyperOp Pipeline. In [20] this was referred to as the
Support Logic, maximum amount of time is spent in the Custom Instruction Pipeline. In a Fused HyperOp
HyperOp Launcher. The Inter-HyperOp data forwarder, Pipeline several instances of the Fused HyperOps can
Scheduler and Resource Binder take fixed time periods for be unrolled and data transfer between various instances
processing the HyperOp. However, the time taken by the can be accomplished by setting up a pipeline between
HyperOp Launcher depends upon the number of operations, these entities. The Fused HyperOp Pipeline is useful
input operands and constant operands that need to be for linear communication structures. However, other
transferred and the position of the identified CEs on the applications are known to have different
fabric. Since the HyperOp Launcher can transfer data only communication structures as mentioned in [21]. Other
through the ports along the periphery of the fabric, it incurs computation structures may need to be created for
different latencies for different CEs. This limits the these applications.
scalability of the fabric. Beyond a certain size, 3-D
structures may become essential to support scaling of the VIII. CONCLUSIONS AND FUTURE WORK
fabric, while not incurring high launch delays. The
In this paper, we presented the architectural evolution of
HyperOp Launcher latency is also affected by the routing
REDEFINE, a runtime reconfigurable architecture. We
delays due to network congestions, since it employs the
used a general purpose processor (GPP) as the starting point
NoC to transfer data to the identified CEs.
Table 3 Percentage of Execution time spent in Support Logic.
for evolving a runtime reconfigurable architecture, since a
Application Percentage time spent in GPP reconfigures itself every clock cycle. With certain
Support Logic architectural optimizations i.e. explicit write backs, we were
SHA-1 48.58 able to achieve 1.91x performance improvement over a
Shift-Reduce 41.49 GPP for SHA-1. However, this solution performs 20x
(w/o Custom FU) worse than an ASIC. To further improve the performance,
Shift-Reduce (w/ 48.05 the use of domain-specific Custom IP blocks within the CE
custom FU) becomes necessary. Runtime Reconfigurable hardware with
custom IP block gives a 4x improvement in performance
A. Enhancements to reduce Support Logic Overhead for Shift-Reduce kernel, which is the most compute
In order to ameliorate this, several possible solutions exist. intensive component of field multiplication. Apart from
• The maximum latencies are incurred in HyperOps that optimizations to the execution fabric, optimizations are also
are part of a loop. In these cases the HyperOp’s essential in the support logic viz. schedule, placement,
opcodes and constants may be retained on the fabric, launch of application substructures, in order to come close
only the new input operands are transferred to the to the performance of an ASIC.
HyperOp after every iteration. These are called In this paper, we have presented only the performance
persistent HyperOps. comparison with regard to an ASIC and a GPP. We intend
• The persistent HyperOps solution helps reduce to extend this work to include a detailed power comparison
HyperOp launch latency only in the case of frequently and power-performance comparison with regard to an
executed HyperOps. Another mechanism is to prefetch FPGA. We also intend to develop a complete library of
a HyperOp. In this case the next HyperOp to be domain-specific custom IPs that can be used within a
launched is statically predicted using compile time compute element of the reconfigurable fabric, to accelerate
analysis and augmented with profiling runs. This cryptographic applications.
information is used in prefetching the HyperOps, so as
to overlap the launch of the next HyperOp with the REFERENCES
execution of the current HyperOp. [1] H. Corporaal, Microprocessor Architectures: From
• In the case of both persistent HyperOps and prefetch VLIW to TTA . John Wiley & sons, 1998.
the input data needs to be transferred from the
[2] K. C. Yeager, "The MIPS R10000 Superscalar
Scheduler to the HyperOp Launcher, for transfer onto
Microprocessor," IEEE Micro, vol. 16, no. 2, pp. 28--
the fabric. In the case of frequently interacting
40, 1996.
HyperOps, this latency can be eliminated through a
Fused HyperOp. In a Fused HyperOp, also referred to [3] A. N. Satrawala, K. Varadarajan, M. Alle, S. K.
as Custom Instruction in [20], two or more closely Nandy, and R. Narayan, "REDEFINE: Architecture
interacting HyperOps are merged together to form a of a SoC Fabric for Runtime Composition of
combined entity. In this combined entity, the inter- Computation Structures," in FPL 2007. International
HyperOp communication is achieved completely on Conference on Field Programmable Logic and
the fabric. However, since the fabric is a static dataflow Applications., Amsterdam, 2007, pp. 558-561.

264
[4] M. Alle, K. Varadarajan, A. Fell, S. K. Nandy, and R. [15] K. Sankaralingam, et al., "The Distributed
Narayan, "Compiling Techniques for Coarse Grained Microarchitecture of the TRIPS Prototype
Runtime Reconfigurable Architectures," in ARC'09 Processor," in 39th International Symposium on
International Workshop on Applied Reconfigurable Microarchitecture (MICRO), Orlando, 2006, pp. 480-
Computing, vol. Volume 5453/2009, London, U.K, 491.
2009, pp. 204-215. [16] S. Swanson, K. Michelson, A. Schwerin, and M.
[5] C. Lattner and V. Adve, "LLVM: A Compilation Oskin, "WaveScalar," in 36th Annual International
Framework for Lifelong Program Analysis & Symposium on Microarchitecture (MICRO-36),
Transformation," in CGO '04: Proceedings of the Washington, DC, USA, 2003, p. 291.
international symposium on Code generation and [17] B. R. Rau, "Dynamically scheduled VLIW
optimization, Palo Alto, CA, 2004, p. 75. processors," in MICRO 26: Proceedings of the 26th
[6] N. Joseph, et al., "RECONNECT: A NoC for annual international symposium on
polymorphic ASICs using a low overhead single Microarchitecture, Austin, Texas, 1993, pp. 80-92.
cycle router," in ASAP '08: Proceedings of the 2008 [18] M. Larabel. (2009, Sep.) GCC vs. LLVM-GCC
International Conference on Application-Specific Benchmarks. [Online]. HYPERLINK
Systems, Architectures and Processors, Leuven, "http://www.phoronix.com/scan.php?page=article&it
Belgium, 2008, pp. 251-256. em=apple_llvm_gcc&num=1"
[7] A. Fell, et al., "Streaming FFT On REDEFINE-V2: http://www.phoronix.com/scan.php?page=article&ite
an Application-Architecture Design Space m=apple_llvm_gcc&num=1
Exploration," in CASES '09: Proceedings of the 2009 [19] M. D. Hill and M. R. Marty, "Amdahl's Law in the
international conference on Compilers, architecture, Multicore Era," Computer, vol. 41, no. 7, 2008.
and synthesis for embedded systems, Grenoble,
[20] M. Alle, et al., "REDEFINE: Runtime reconfigurable
France, 2009, pp. 127--136.
polymorphic ASIC," ACM Trans. Embed. Comput.
[8] G. K. Singh, K. Varadarajan, M. Alle, S. K. Nandy, Syst., vol. 9, no. 2, pp. 1--48, 2009.
and R. Narayan, "A Generic Graph-Oriented
[21] K. Asanovic, et al., "A view of the parallel
Mapping Strategy for a Honeycomb," International
computing landscape," Commun. ACM, vol. 52, no.
Journal on Futuristic Computer Applications, vol. 1,
10, pp. 56-67, Oct. 2009.
no. 1, pp. xx-xx, 2010.
[9] A. Fell, P. Biswas, J. Chetia, S. K. Nandy, and R.
Narayan, "Generic Routing Rules and a Scalable
Access Enhancement for the Network-on-Chip
RECONNECT," in IEEE International SoC
Conference, Glasgow, 2009, pp. xx-xx.
[10] J. R. Gurd, C. C. Kirkham, and I. Watson, "The
Manchester prototype dataflow computer," Commun.
ACM, vol. 28, no. 1, pp. 34--52, 1985.
[11] Stretch Inc. ( , ) Stretch S6000 Devices. [Online].
HYPERLINK
"http://www.stretchinc.com/_files/s6ArchitectureOve
rview.pdf"
http://www.stretchinc.com/_files/s6ArchitectureOver
view.pdf
[12] S. Vassiliadis, et al., "The Molen Polymorphic
Processor," IEEE Transactions on Computers, vol.
53, no. 11, pp. 1363-1375, Nov. 2004.
[13] Fujitsu Limited, IPflex Inc. (2004, Mar.) IPFlex and
Fujitsu Introduce DAP/DNA®-2, the Dynamically
Reconfigurable Processor. [Online]. HYPERLINK
"http://www.fujitsu.com/global/news/pr/archives/mo
nth/2004/20040317-01.html"
http://www.fujitsu.com/global/news/pr/archives/mont
h/2004/20040317-01.html
[14] M. B. Taylor, et al., "The Raw Microprocessor: A
Computational Fabric for Software Circuits and
General Purpose Programs," IEEE Micro, vol. 22, no.
2, pp. 25-35, Mar. 2002.

265
ADCOM 2009
POSTER PAPERS

266
A Comparative Study of Different Packet Scheduling Algorithms with Varied
Network Service Load In IEEE 802.16 Broadband Wireless Access Systems

Prasun Chowdhury Iti Saha Misra


Electronics and Telecommunication Engineering Electronics and Telecommunication Engineering
Jadavpur University, Kolkata, India Jadavpur University, Kolkata, India
prasun.jucal@gmail.com itimisra@cal.vsnl.net.in

Abstract— In this paper, we conduct a comprehensive


performance study of several packet scheduling algorithms
such as Strict Priority, WFQ, RR, SCFQ, WRR found in
literature under different queue management type like FIFO,
RED, RIO, WRED in point-to-multipoint mode of OFDM-
based WiMAX networks. We focus our work in comparing
quality of service (QoS) of various queue scheduling schemes to
suit for different specific environment i.e. different Service
Loads. Extensive simulation is performed under QualNet- 4.5
for the performance of various scheduling schemes and is
observed the best suitability for a particular service load
condition.
Figure 1. IEEE 802.16 Wimax scenario in Qual Net
Keywords- IEEE 802.16; MAC; QoS; Queue Scheduling
In figure 1, we have designed a scenario consisting of one
algorithm; Queue management type
base station and five subscriber station under one subnet.
Each subscriber station has three separate connections for
1. Introduction
the application namely UGS, rtPS, nrtPS with base station.
Despite the numerous scheduling algorithms proposed for We omit BE service because this service supports data
WiMAX (Worldwide Interoperability for Microwave streams for which no minimum service level is required and
Access) [1] networks, there is no comprehensive study that which may therefore be handled on a space-available basis.
provides a unified platform for comparing such algorithms. We have selected all the subscriber station with different IP
The aim of this work is to allow a thorough understanding Queue Scheduling schemes i.e.SS1 is scheduled with Strict
of the relative performance of representative uplink Priority IP Queue Scheduling scheme [1]. Similarly, SS2,
scheduling algorithms under various queue management SS3, SS4, SS5 are scheduled with Weighted Fair Queuing
types and subsequently utilize the results to suit for different (WFQ) [2], Round Robin(RR) [3], Self-Clocked Fair
specific environment i.e. different network load. We focus Queuing (SCFQ)[4], Weighted Round Robin (WRR) [3] IP
our work in comparing representative algorithms for the Queue Scheduling schemes respectively but the IP Queue
uplink traffic in OFDM WiMAX physical layer using Qual Management type of each subscriber station and subscriber
Net 4.5. station service load are kept fixed for a particular
simulation. In this work, we have taken four IP Queue
The remainder of the paper is organized as follows. In Management type i.e. First in First out (FIFO) [5], Random
section 2, we provide detailed description of our Early Detection (RED) Queue [6], Random Early Detection
contribution, simulation under QualNet-4.5. Finally, we with In/Out (RIO) Queue [6], Weighted Random Early
conclude the paper in section 3. Detection (WRED) Queue [7] and three types of subscriber
station Service Load i.e. Service Load1, Service Load2,
2. Our Contribution Service Load 3. Details of the Service Load are shown
below. So, in this way we have carried out total twelve
2.1 Simulation Setups in QualNet -4.5 simulations in QualNet-4.5 Animator. More details of these
components are accessible from Qual Net-4.5 Advance
We perform our scheduling algorithm comparison in the Wireless Documentation [6].
QualNet-4.5 simulator. We have modified and selected
those scheduling algorithms from graphical user interface of The following Wimax Model Parameters are used in the
QualNet-4.5 to fit to a particular environment. configuration:

For this purpose we have designed a scenario which is  Radio Type – 802.16 Radio
shown in figure 1.  MAC Protocol – 802.16
 BS Frame Duration – 20 ms

267
 IP Queue scheduling – strict priority, WFQ, RR, 14
SCFQ, WRR strict
12 priority
 IP Queue Management type – FIFO, RED, RIO, WRED
 PHY Channel bandwidth – 20 MHz 10 WFQ

Packet Loss (%)


 The starting time of the simulations are evenly 8
distributed in the interval 0s – 100s 6 RR
4
Table 1: lists of the SERVICE LOAD configuration SCFQ
parameters 2
0 WRR
Service Load Load Load Interval Start FIFO RED RIO WRED
1 2 3 (sec) time
(sec) Figure 5. Packet loss (%) comparison for Service Load 1
UGS 128 256 512 0.01 0 strict
rtPS 512 1024 2048 0.1 0 1.02
priority
nrtPS 1024 2048 4096 1 0 1 WFQ
2.2 Results and discussions 0.98
RR
0.96
50000 Strict
0.94 SCFQ
priority
Average Throughput (sec)

40000
WFQ 0.92
30000 WRR
RR FIFO RED RIO WRED
20000
10000 Figure 6. Queue Service ratio comparison for Service
SCFQ
Load 1
0
WRR 4 strict
Load 1 Load 2 Load 3
Average Queuing Delay

priority
Figure 2.Average Throughput (bps) comparison in 3 WFQ
queue scheduling schemes
2 RR
0.15 Strict
priority
WFQ 1 SCFQ
Average Jitter (sec)

0.1
RR 0 WRR
(sec)

0.05 FIFO RED RIO WRED


SCFQ
0 Figure 7. Average queuing delay (sec) comparison
WRR for Service Load 2
Load 1 Load 2 Load 3
21 strict
Figure 3. Average Jitter (sec) comparison in
queue scheduling schemes priority
20.5 WFQ
Packet Loss (%)

2.5 strict
Average Queuing Delay (sec)

priority 20
2 RR
WFQ
1.5 19.5
RR SCFQ
1 19
0.5 SCFQ WRR
18.5
0
FIFO RED RIO WRED
FIFO RED RIO WRED WRR
Figure 8. Packet loss (%) comparison for Service Load 2
Figure 4. Average Queuing Delay (sec) comparison
for Service Load 1

268
strict In case of SERVICE LOAD 1, SERVICE LOAD 2,
0.965 SERVICE LOAD 3, we can observe that queue scheduling
priority
0.96 algorithm SCFQ, RR, WFQ respectively with queue
WFQ management type RED provides good QoS support among
0.955
all other algorithms.
0.95 RR
0.945 3. Conclusion
SCFQ In the existing literatures, many researchers have tried to
0.94
find out flaws within the queue scheduling algorithms and
0.935 modified those for having better QoS support irrespective of
WRR the network Service Load. The present work clearly shows
FIFO RED RIO WRED
that a particular algorithm which is considered to be good in
Figure 9. Queue Service ratio comparison for a specific environment may not provide good QoS to other
Service Load 2 environment. It is reasonable to take into account the
network Service Load while modifying the packet
8 strict
Average Queuing Delay (sec)

scheduling algorithms for meaningful results.


priority
6 WFQ Acknowledgement: The authors deeply acknowledge the support
from DST, Govt. of India for this work in the form of FIST 2007
4 RR Project on “Broadband Wireless Communications” in the
Department of ETCE, Jadavpur University.
2
SCFQ
References
0
WRR
FIFO RED RIO WRED [1] K. Wongthavarawat, and A. Ganz, “Packet scheduling
for QoS support in IEEE 802.16 broadband wireless access
Figure 10.Average queuing delay (sec) comparison systems”, International Journal of Communication Systems,
for Service Load 3 vol. 16, issue 1, pp.81-96, February 2003.
29 strict
[2]N.Ruangchaijatupon, L.Wang and Y.Ji, “A Study on the
priority
Performance of Scheduling Schemes for Broadband
28 WFQ Wireless Access Networks”, Proceedings of International
Packet Loss (%)

Symposium on Communications and Information


27 RR Technology, pp. 1008-1012, October 2006.
26 SCFQ
[3]C.Cicconetti, A.Erta, L.Lenzini and
25 E.Mingozzi,“Performance Evaluation of the IEEE 802.16
WRR MAC for QoS Support”, IEEE Transactions on Mobile
FIFO RED RIO WRED
Computing, vol.6, no.1, pp.26-38, January 2007.
Figure 11. Packet loss (%) comparison for Service
Load 3 [4] Byung-Hwanchoi, Hong-shik Park “Rate proportional
SCFQ Algorithm for high speed packet-switched Networks”
0.96 strict ETRI Journal, Volume 22, Number 3, september 2008.
0.94 priority
0.92 WFQ [5] Fei Li, ”Fairness Analysis in Competitive FIFO
0.9 BufferManagement”Performance,Computingand
Communications Conference, 2008. IPCCC 2008. IEEE
0.88 RR
International
0.86
0.84 SCFQ [6] QualNet 4.5 Advanced Wireless Model Library, March
0.82 2008,http://www.scalablenetworks.com,http://www.qualnet.
WRR com,
FIFO RED RIO WRED
[7] Ming-Jye Sheng, Thomas Mak, “Analysis of Adaptive
Figure 12. Queue Service ratio comparison for Service WRED AND CBWFQ Algorithms on Tactical Edge ” IEEE
Load 3 2008

269
A Simulation Based Comparison of Gateway Load Balancing Strategies in
Integrated Internet-MANET

Rafi-U-Zaman Khaleel-Ur-Rahman Khan M.A.Razzaq A. Venugopal Reddy


Department of CSE Department of CSE Department of CSE Department of CSE
M.J.C.E.T M.J.C.E.T M.J.C.E.T UCE, O.U.
Hyderabad, India Hyderabad, India Hyderabad, India Hyderabad, India
{rafi.u.zaman, khaleelrkhan}@mjcollege.ac.in marazzaq@yahoo.com avgreddy@osmania.ac.in

Abstract balancing strategies are presented, one based on WLB-


AODV and the other based on Modified AODV [8].
The interconnection of wireless mobile ad hoc Based on a simulation study it is observed that the
network with the wired Internet is called Integrated proposed strategy based on WLB-AODV out-performs
Internet-MANET. This interconnection is facilitated the one based on Modified AODV.
through gateways which act as bridges between the
heterogeneous networks. Load balancing of these 2. Weighted Load Balanced – Ad Hoc On-
gateways is a critical issue in Integrated Internet- Demand Distance Vector (WLB-AODV)
MANET. In this paper, two gateway load balancing Routing Protocol
strategies for Integrated Internet-MANET are
proposed which are based on load balanced routing Modified-AODV is a modified version of AODV
protocols called WLB-AODV and Modified-AODV. wherein Aggregate Interface Queue Length (AIQL) is
The proposed strategies have been simulated using the used as path selection criteria instead of hop count. In
ns-2 simulator. The simulation results indicate that this mechanism, when an intermediate node receives a
the strategy based on WLB-AODV performs better Route Request, it adds its interface queue length to the
than the one based on Modified-AODV. Route Request and forwards it. This aggregation of
queue lengths of all the nodes lying on a path is called
1. Introduction the Aggregate Interface Queue Length. This process is
repeated until the Route Request reaches the
Wireless mobile ad hoc networks (MANET) are destination. The destination makes a selection of the
infrastructure less networks. Various protocols have best route based on the AIQL and sends a Route Reply
been proposed to perform routing within an ad hoc back to the source. Whereas Modified-AODV is based
network [1]. To extend its usefulness, a MANET on AIQL, WLB-AODV routing protocol is based on
needs to be connected to the Internet. We call such an three metrics. They are: hop length (HL), Aggregate
interconnected network an Integrated Internet- Interface Queue Length (AIQL) and Aggregate
MANET. A review of strategies for Integrated Routing Table Entry (ARTE).
Internet-MANET can be found in [2]. Such strategies Hop Length (HL): It is the distance in number of
make use of gateways to interconnect the ad hoc hops between any two mobile nodes in the mobile ad
network to the Internet. It is observed that the problem hoc network. This is the route selection metric used in
of gateway load balancing has not been adequately the original AODV routing protocol.
addressed in these strategies. A few strategies which Aggregate Interface Queue Length (AIQL): Every
exist in the literature for gateway load balancing [3-7] mobile node maintains a queue of outstanding packets
make use of traditional ad hoc routing protocols like which it has to forward. The longer the length of the
DSDV and AODV. None of them uses a specialized queue, the more work that mobile node has to do.
load-balanced routing protocol. Hence we can say that the queue length of a mobile
In this paper, we present an extended version of node reflects its current load. The mobile nodes with
AODV routing protocol, called Weighted Load longer queue length are better avoided. If the queue
Balanced – ADOV (WLB-AODV) routing protocol. In length exceeds a maximum threshold, then any
the second part of the paper, two gateway load incoming packets will be discarded. The AIQL is the

270
sum of the queue lengths of all mobile nodes lying on mobile ad hoc network. The second strategy, called
a path, as explained in the previous section Strategy-2, uses the Modified-AODV routing protocol.
Aggregate of Routing Table Entries (ARTE): Every Both the strategies have been implemented for hybrid
mobile node maintains a routing table which contains gateway discovery mechanism.
information about valid routes to a set of destinations.
If the number of entries is more, then this mobile node 4. Simulation of the Proposed Strategies
will act as an intermediate node which knows many
destinations and hence will find itself sending more A simulation of the proposed gateway load
Route Reply messages. Thus, a mobile node with more balancing strategies was carried out to determine the
routing table entries is more likely to be a busy node effectiveness of their performance when compared to
and hence better be avoided. Aggregate Routing each other. The network simulator used was ns-2.31.
Table Entry (ARTE) is the sum of all routing table The simulations were carried out, for varying Packet
entries on a route. Sending Rates. The simulation parameters common to
In the WLB-AODV routing protocol, a weighted all the simulations are given in table1. The constants
sum of the above three metrics is used in the selection a, b and c in metric Mi are given the values a=0.5,
of a route. Now, every mobile node maintains the b=0.25 and c=0.25.
AIQL and ARTE values in its routing table for known Table4.1. Simulation Parameters
destinations, apart from hop count as in original Parameter Name Value
AODV. The weighted sum is: Number of mobile nodes 25
Mi = a * HL + b * AIQL + c * ARTE
Number of source nodes 5
a+b+c=1
Number of IGW 3
Where, Mi represents value of the weighted metric
Number of
for route ‘i’. a, b and c represent the weights given to 3
Correspondent Nodes
each of the components.
Topology Size 1200 X 500m
Transmission Range 250m
3. Gateway Load Balancing Strategies for Traffic Type Constant Bit Rate
Integrated Internet-MANET using Load Mobile Node Speed 20 m/sec
Balanced Routing Protocols Packet Size 512 bytes
Pause Time 5 sec
The proposed network architecture for gateway Random Waypoint
load balancing consists of two-tiers. The high-tier Mobility model
Model
consists of foreign agents and the low-tier consists of Carrier Sensing Range 500m
mobile nodes which form the mobile ad hoc network Simulation time 900sec
as shown in figure 1. Packet Sending Rate 5 – 40 Packets/Sec
Interface Queue Length 50 packets
Advertisement Interval 5 sec
Advertisement Zone 3 hops

The performance is analyzed with respect to the


following performance metrics:
Packet Delivery Ratio: It is defined as the percentage
of the number of packets received to the total number
of packets sent.
End-to-End Delay: This is the average overall delay
for a packet to traverse from a source node to a
Figure1. Network Architecture for Gateway destination node.
Load Balancing in Integrated Internet-MANET Normalized Routing Load: It is the number of
routing control packets per data packet delivered at the
Two strategies for gateway load balancing are destination.
proposed. The first one, called Strategy-1, uses the Figure 2 shows the comparison of end to end delay
WLB-AODV routing protocol for routing within the of the two strategies. It is quite clear that Strategy-1

271
outperforms Strategy-2. This indicates that Strategy-1 and the other based on Modified-AODV. Through
successfully chooses lesser loaded routes to enable simulations in ns-2.31, it is observed that the Strategy
faster delivery of data packets. Figure 3 shows the based on WLB-AODV gives performance
comparison of the routing load of the two strategies. enhancements over the Strategy based on Modified
Here a slight, albeit consistent advantage is observed AODV, based on performance metrics End-to-End
for Strategy-1. This indicates that Strategy-1 incurs Delay, Normalized Routing Load and Packet Delivery
lower routing overhead due to lesser control packet Ratio.
retransmissions, by choosing lightly loaded routes.
Figure 4 again establishes the superiority of Strategy-1 6. References
over Strategy-2, by delivering a better packet delivery
ratio. This is due to the fact that lesser loaded routes [1] E.M. Royer and C-K. Toh, “A Review of Current
are chosen for data delivery and hence more number Routing Protocols for Ad Hoc Mobile Wireless Networks”,
of packets is delivered. IEEE Personal Communications Magazine, pp. 46-55,
(1999).

[2] Khaleel Ur Rahman Khan, Rafi U Zaman, A. Venugopal


Reddy, “Integrating Mobile Ad Hoc Networks and the
Internet: challenges and a review of strategies”, Proceedings
of IEEE/CREATE-NET/ICST COMSWARE 2008, (2008).

[3] J.H. Zhao, X.Z.Yang and H.W.Liu, “Load-balancing


Strategy of Multi-gateway for Ad hoc Internet Connectivity”,
Proceedings of the International Conference on Information
Figure2. End to End Delay as a function of Technology: Coding and Computing (ITCC'05) - Volume II,
Packet Sending Rate Pages: 592 – 596, (2005).

[4] A. Trivino-Cabrera, Eduardo Casilari, D. Bartolome and


A. Ariza, “Traffic Load Distribution in Ad Hoc Networks
through Mobile Internet Gateways”, Proceedings of Fourth
International Working Conference on Performance
Modelling and Evaluation of Heterogeneous Networks,
(2006).

[5] Y-Y Hsu, Y-C Tseng, C-C Tseng, C-F Huang, J-H Fan
and H-L Wu, “Design and Implementation of Two-Tier
Mobile Ad Hoc Networks with Seamless Roaming and
Figure3. Normalized Routing Load as a
Load-Balancing Routing Capability”, Proceedings of the
function of Packet Sending Rate First International Conference on Quality of Service in
Heterogeneous Wired/Wireless Networks, Pages: 52 – 58,
(2004)

[6] J. Shin, H. Lee, J. Na, A. Park and S. Kim, “Load


Balancing among Internet Gateways in Ad Hoc Networks”,
Proceeding of 62nd IEEE Vehicular Technology Conference,
Pages: 1677- 1680, (2005)

[7] Q. Le-Trung, P.E. Engelstad, T. Skeie and A.


Taherkordi, “Load-Balance of Intra/Inter-MANET Traffic
over Multiple Internet Gateways”, Proceedings of the 6th
Figure4. Packet Delivery Ratio as a function of International Conference on Advances in Mobile Computing
Packet Sending Rate and Multimedia, Pages 50-57, (2008)

5. Conclusion [8] A. Rani and M.Dave, “Performance Evaluation of


Modified AODV for Load Balancing”, Journal of Computer
Science, Vol 3, issue 11, (2007)
In this paper, two gateway load balancing
strategies were proposed; one based on WLB-AODV,

272
ECAR: An Efficient Channel Assignment and
Routing in Wireless Mesh Network
Chaitanya P. Umbare Dr. S. V. Rao
Department of Computer Science and Engineering Department of Computer Science and Engineering
Indian Institute of Technology Indian Institute of Technology
Guwahati, Assam Guwahati, Assam
India-781039 India-781039
Email: umbare@iitg.ernet.in Email: svrao@iitg.ernet.in

Abstract—Wireless mesh networking is a promising design Distributed channel assignment in multi-radio 802.11 mesh
paradigm for future generation wireless networks. Wireless mesh networks[5] presents a distributed, self-stabilizing mechanism
network(WMN) came into existence to resolve the limitations and that assigns channels to multi-radio nodes in WMN. Main
to significantly improve the performance of ad-hoc networks,
wireless local area networks(WLANs) etc. In a WMN or any disadvantages of this protocol is that the assigned channels
network, channel allocation is done based on the queue states at are not changes once the channel assignment is stabilized. A
different nodes which depends on how routing is done. Thus distributed channel assignment and routing in WMNs[6] is
it is important to have joint routing and channel allocation proposed by Raniwala. The assumption made in this paper
algorithms to control various measures of performance like is that, the traffic of all nodes goes or comes to/from the
delay and throughput. For better and efficient use of multiradio
and advanced physical layer technologies, cross-layer control gateway nodes only. For other traffic patterns, the protocol
is required. In this paper, we propose an intelligent channel does not works. And also due to uncoordinated allocation of
assignment strategy which results into minimum interference of nodes with the same priority, their channel assignment may
channels and efficient routing which results into reduction of not be convergent and, thus, may cause severe interference
end-to-end delay. among nodes.
In this paper we basically addresses two main issues which
I. I NTRODUCTION
significantly affect the performance of the network, routing
A WMN is a multi-hop, self configured and self healing and channel assignment. Basically we tries to improve the
network, which provides reliability, redundancy, scalability[1]. protocol[6] by intelligent channel assignment which signifi-
WMNs consist of two types of nodes: mesh routers and mesh cantly reduces the interference between neighboring nodes and
clients. Mesh routers which forms the backbone of the network intelligent routing which reduces lots of routing overhead. The
are generally not mobile and equipped with one or more radios next section describes our protocol in detail.
operating on same or different radio technology. And mesh
clients are generally mobile and equipped with one radio only II. P ROPOSED SCHEME
because of power constraint. The ECAR basically consists of two phases.
Now we discuss some of the past work related to our
1) Load-Balancing Routing
protocol.
2) Distributed Load-Aware Channel Assignment
Use of multiple 802.11 NICs per node is explored in
routing in multi-radio, multi-hop WMNs[2]. Due to the static These phases are explained in the following sections:
channel assignment to all the nodes, throughput improvement
is proportional to the number of NICs. The main idea in A. Load-Balancing Routing
multichannel CSMA MAC protocol for multihop wireless Main idea in the routing process is, if the source and
Networks[3] is to find an optimum channel for every single destination are in the same subnet and if the hop count
packet transmission. Basically channel switching is on packet between the router and the destination is less than equal to
by packet basis, which requires the re-synchronization among three, then router forwards the data packet using NIC[0] to its
communicating network cards for every packet. This leads to appropriate neighbor with the help of master routing table. If
decrease in network throughput. Centralized channel assign- the destination router is in the subtree rooted at this router, then
ment and routing algorithm for multi-Channel wireless mesh it will send the data packets to appropriate child using NIC[2]
networks[4], visits all the virtual links in the decreasing order using inter-domain routing table, otherwise it sends data packet
of their expected loads and upon visiting a link, algorithm to its parent using NIC[1] using Inter-domain routing table.
greedily assigns a channel that leads to minimum interference. Each router periodically broadcasts HELLO message to
This centralized algorithms demands the complete information all its 1-hop neighbors. After receiving HELLO message
about the network to perform channel assignment and routing. router updates its neighbors list, by considering hop-count

273
information. And broadcast this updated information to its 1- channel assigned to their child NIC by executing the above
hop neighbors. As a results of this, each router has information procedure periodically.
about all the routers in the network, with minimum hop-count
from the corresponding router. In this way, each router builds III. P ERFORMANCE E VALUATION
master routing table. This section gives the implementation details of our scheme
In order to build inter-domain routing table, each gateway in ns-2[7]. We also present our simulation environment and
node broadcasts ADVERTISE message to all its 1-hop neigh- results.
bors with residual capacity of the uplink. Upon receiving
ADVERTISE message, router sends JOIN message to the A. Simulation Environment
advertiser, if the existing gateway residual capacity is more The Simulation environment is as described in Table 5.1.
than the incoming one. Otherwise it will not send any reply. All the simulations are done in ns-2.
So when a router receives JOIN message, it will add the router
to its children list and send ACCEPT message containing No. Parameters Values
1 Area 240m X 240m
information about channel to be used for communication. 2 Transmission Range 22.5m
And also router send ROUTE-ADD message to its parent, 3 Nodes 25, 50, 75, 100
which contain address of the node and its children’s and this 4 Data Rate 50Kbps–4Mbps
process continues up to the gateway node. When a router 5 Number of Channels 13
6 Number of NICs/Node 3
receives ACCEPT message, it will update the parent entry and 7 Number of Flows 5–30
send LEAVE message to its previous parent. Upon receiving 8 Number of Gateway nodes 1–4
LEAVE message, router deletes the forwarding entries to that 9 Simulation Time 400sec
router and all its children’s and send ROUTE-DELETE mes- TABLE I
sage to its parent. The process of sending ROUTE-DELETE S IMULATION PARAMETERS
is recursive. The process of building master routing table
and inter-domain routing table can be done through only
one message, which avoids the overhead of sending control
B. Simulation Results
messages.
The performance parameters that were measured in our
B. Distributed Load-Aware Channel Assignment simulation are: Average aggregate throughput and average end-
In this section we present a localized distributed algorithm to-end delay. Our protocol assigns the channels to interfaces
for channel assignment to each interfaces. The NIC which is intelligently, because of this average aggregate throughput is
used to communicate with parent router is termed as parent much better in our case than DCAR[6] as shown in graphs
NIC, where as child NIC denotes the NIC used to commu- below.
nicate with its children’s. Each WMN router is responsible
for assigning channel to its child-NIC. And parent NIC of ECAR
DCAR
the router is associated with a unique child-NIC of the parent 14

12
router and is assigned the same channel as the parent’s child-
Average Aggregate Throughput (Mbits/sec)

10

NIC.
8

Each router periodically exchanges its individual channel


6

usage information with all of its (k+1) hop neighbors through 4

CHANNEL-USAGE packet, where k is the ratio of the inter- 2 Number of flows: 30


Number of Gateway Nodes: 2

ference range and the communication range. After receiving 0


40 50 60 70 80 90 100 110
the channel usage information, each router calculates the Number of Nodes

aggregate traffic load of each channel. Each router excludes


channels used by its ancestors and the channels used by the
nodes in the same level and determines a least loaded channel
and assign it to the child NIC. 18
ECAR

The load on the routers closer to gateways is more, since 16


DCAR

most of the traffic is going and coming from gateway nodes.


Average Aggregate Throughput (Mbits/sec)

14

In order to give more relay bandwidth to them, they are having


12

higher priority while assigning the channels. Each router


10
broadcasts channel assigned to its child NIC to its children’s
through CHANNEL-CHANGE packet. Network traffic change 8

may result in change in the various channels load, which 6 Number of nodes: 100
Number of Gateway Nodes: 3

10 15 20 25 30
may imbalance the channel load and decreases the network Number of Flows

throughput. In order to load balance the channel and to


improve network throughput, each router dynamically changes

274
ECAR
which of the available non-overlapped radio channels should
18 DCAR
be assigned to each 802.11 interface in the WMN? Second,
16
how packets should be routed through a multi-channel wireless

Average Aggregate Throughput (Mbits/sec)


14

mesh network? Traffic between near by routers is done using


12

10
master routing table, by sending data packets directly to next
8
hop, instead of sending through gateway nodes, which results
6
in the reduction of routing overhead. Moreover, by dynam-
4
Number of flows: 30
Number of Nodes: 100 ically changing the assigned channels which can minimize
1 2 3 4
Number of Gateway Nodes interference, hence increases throughput of the network.
R EFERENCES
[1] R. Bruno, M. Conti, and E. Gregori, “Mesh networks: commodity
In the following graphs we showed effect on average end- multihop ad hoc networks,” Communications Magazine, IEEE, vol. 43,
pp. 123–131, 2005.
to-end delay by changing the number of hops(1 in first graph, [2] R. Draves, J. Padhye, and B. Zill, “Routing in multi-radio, multi-hop
2 in second and 3 in third) between the source and destination wireless mesh networks,” in MobiCom ’04: Proceedings of the 10th
with varying number of nodes. annual international conference on Mobile computing and networking.
ACM Press, 2004, pp. 114–128.
[3] A. Nasipuri, J. Zhuang, and S. Das, “A multichannel csma mac protocol
for multihop wireless networks,” pp. 1402–1406, 1999.
30 ECAR
DCAR [4] A. Raniwala, K. Gopalan, and T.-C. Chiueh, “Centralized channel assign-
25
ment and routing algorithms for multi-channel wireless mesh networks,”
SIGMOBILE Mob. Comput. Commun. Rev., vol. 8, pp. 50–65, 2004.
Average End-to-End Delay (msec)

20
[5] B.-J. Ko, V. Misra, J. Padhye, and D. Rubenstein, “Distributed channel
15
assignment in multi-radio 802.11 mesh networks,” in Wireless Communi-
cations and Networking Conference, 2007.WCNC 2007. IEEE, 2007, pp.
10 3978–3983.
[6] A. Raniwala and T.-C. Chiueh, “Architecture and algorithms for an ieee
5
802.11 -based multi-channel wireless mesh network,” vol. 3, 2005, pp.
0
20 30 40 50 60 70 80 90 100 110
2223–2234.
Number of Nodes
[7] Information Sciences Institute, “NS-2 network simulator,” Software Pack-
age, 2003, http://www.isi.edu/nsnam/ns/.

ECAR
DCAR
25
Average End-to-End Delay (msec)

20

15

10

0
20 30 40 50 60 70 80 90 100 110
Number of Nodes

ECAR
DCAR

25
Average End-to-End Delay (msec)

20

15

10

0
20 30 40 50 60 70 80 90 100 110
Number of Nodes

IV. C ONCLUSION AND F UTURE W ORK


In particular, our protocol efficiently handles the two fun-
damental design issues in the multi-channel WMN. First,

275
Rotational Invariant Texture Classification of Color
Images using Local Texture Patterns
A.Suruliandi
Department of Computer Science and Engineering, Manonmaniam Sundaranar University
Tirunelveli, Tamilnadu, India
suruliandi@yahoo.com

E.M.Srinivasan K.Ramar
Department of ECE, Government Polytechnic College, Department of CSE, National Engineering College,
Nagercoil, Tamilnadu, India Kovilpatti, Tamilnadu, India.
emsvasan@yahoo.com kramar_nec@rediffmail.com

Abstract— In this paper a new approach to extend Local He and Wang [1] proposed a texture modeling scheme
Texture Patterns (LTP) texture model suitable for color called ‘Texture Spectrum (TS)’. The TS operator is a gray-
images is presented. In this study, to extract spatial feature of a scale invariant one but not a rotational invariant. Ojala et al.
color mage, Gray-Local Texture Patterns (GLTP) is extended their earlier work in 2002 [4] and proposed a new
introduced. Contrast is another important property of images.
To extract contrast features, Color-Local Contrast Variance
texture model called Local Binary Patterns (LBP)’. The
Patterns (CLCVP) is proposed. However, much important LBP model is gray-scale and rotational invariant. Recently,
information contained in the image can be revealed by joint Suruliandi and Ramar [6] proposed a new texture model
distributions of individual features. Hence, GLTP/CLCVP is LTP by combining the best features of TS and LBP models.
proposed as a textural feature extraction technique for Hence it is proposed to extend the LTP model and to
classification of color images. The performance of the proposed combine it with contrast variance feature for classification
features is carried out with rotational invariant classification of of color images.
Outex texture database images. From the experimental results
it is observed that GLTP/CLCVP is yielding higher B. Outline of the Proposed Work
classification accuracy rate of 99.32% for color images.
The spatial texture feature used in this study is LTP,
Keywords- Texture Analysis, Texture Classification, Local which is basically a gray-scale operator and it is extended to
Texture Patterns, Local Gray scale Texture Patterns, Local Color process the color image patterns. The feature GLTP is
Contrast Variance Patterns. introduced for this purpose. The LTP is an excellent
measure of the spatial structure of local image texture, but it
I. INTRODUCTION discards the contrast variance of the image which is an
important property. The feature CLCVP is proposed based
Texture methods can be categorized as statistical, on the contrast variance feature. Color and texture being
geometrical structural, model-based and signal processing complementary, it is expected that their orthogonal property
features. A comparative study on texture measures are in the form of joint distribution can perform better for
reported in [3] texture analysis. Hence, the feature GLTP/CLCVP is
A. Motivation and Justification for the Proposed Approach proposed as a joint distribution texture model.
The color texture models are essential for at least two C. Organization of the Paper
reasons. The first one is most approaches to texture analysis This paper is organized as follows. Section II describes
quantify texture measures by single values like means,
the basic operators LTP and VAR. The proposed texture
variance, entropy etc. However, much important feature extraction technique for color images is explained in
information contained in the distribution of texture values Section III. Classification principle is illustrated in Section
might be lost. The second one is color, which has
IV. Experimental results are illustrated in Section V.
considerable importance in image analysis. Color as well as Discussion and Conclusion is presented in Section VI.
texture has been discussed in literature intensively. But most
of the known texture models are based on gray-scale II. BASIC OPERATORS
images. This is an inadequate restriction in many real world
applications of computer vision. These restrictions are the A. Local Texture Patterns (LTP)
motivation behind the development of color texture models. The local image texture information can be extracted
from a neighborhood of 3 x 3 local regions. Let gc, g1,

276
g2,…,g8 be the pixel values of a local region where gc is the 1 P −1 1 P −1
value of the central pixel and g1, g2,…,g8 are the pixel
values of its 8 neighborhood. Let the pattern unit P, between
VAR = ∑ i C
P i =0
( g −μ ) 2
where μ c = ∑ gi
P i =0
(5)

gc and its neighbor gi (i = 1, 2,…,8 b) be defined as VAR is by definition, invariant against shifts in gray-scale.
In forming a pattern over a local region, either the LTP or
⎧0 if gi < (gc - Δg) VAR can be used as texture descriptors. The features, either
⎪ (1)
P(gi , gc ) = ⎨1 if (gc - Δg)≤ gi ≤ (gc + Δg) i =1,2,…,8 the joint pair LTP+VAR or their joint distribution
⎪9 if g > (g + Δg) LTP/VAR, can also be used as texture descriptors.
⎩ i c

where Δg is a small positive value that represents desirable C. Patterns Unification Procedure
threshold value set by the user. The values for P can be any VAR has a continuous valued output and hence
three distinct values and here it is chosen as 0, 1 and 9 in quantization of its feature space is needed. This can be
order to make the pattern labeling process easier and have no achieved by patterns unification procedure. The unique code
other importance. Fig.1 shows a 3 x 3 local region, P values corresponding to the inputs P, Q and R is computed as
calculated along the border and its pattern string. shown in Figure 2 The variables P, Q and R may represent
123 110 113 1 0 0 VAR values. A ‘●’ in the upper position means its value is
00999911 higher, a ‘●’ in the lower position means its value is lesser
117 120 135 1 9
and ‘●’s in the same line means values are equal.
130 125 128 9 9 9
(a) (b) (c) P Q R Code P Q R Code

Fig.1. (a) 3 x 3 local region (b) Pattern units matrix for Δg=4.
(c) Pattern String ●
● ● ● 1 ● 8
To define uniform circular patterns over the pattern

string, a uniformity measure ‘U’ which corresponds to the
number of spatial transitions circularly in the Pattern string is
● ●
defined as
● ● 2 ● 9
8
U = s(P(g8 , gc ),P(g1, gc )) + ∑s(P(gi , gc ),P(gi−1, gc )) (2) ●
i=2
where ●
● ● 3 ● 10
⎧1 if X - Y > 0
s( X ,Y ) = ⎨ (3) ● ●
⎩0 otherwise

The following rotational, gray-scale shift invariant ‘Local 4 11
● ●
Texture Pattern (LTP)’ operator is proposed for describing a
local image texture. ● ● ●

⎧ 8 ● ● ●
⎪ ∑ P ( g i , g c ) if U ≤ 3 (4)
LTP = ⎨ i =1 ● 5 ● 12
⎪ 73 otherwise
⎩ ●

For U = 0 there exist 3 LTP (0, 8 and 72), for U = 2 there ● ●


are 21 LTP (1 to 7, 9, 16, 18, 24, 27, 32, 36, 40, 45, 48, 54,
● ● 6 ● 13
56, 63 and 64) and for U = 3 there exist another 21 LTP (10
to15, 19 to 23, 28 to31, 37 to 39, 46, 47 and 55). All other ●
non-uniform patterns are grouped under one label 73. The
pattern strings such as 00019000 and 00091000 are
considered rotate equivalents and such patterns will generate ● ● 7
the same LTP. Therefore, the total no of LTP are 46. Since
there are few holes in the LTP numbering scheme, they are ●
relabeled to form continuous numbering from 1 to 46 using a Figure 2 Patterns unification procedure
small lookup table.

B. Contrast Variance (VAR) III. PROPOSED TEXTURE MODEL FOR COLOR IMAGES

The local contrast variance (VAR) is defined by The procedure for computing GLTP/CLCVP is
illustrated in Figure 3

277
RGB images
B. k-Nearest Neighbor Classification
Isolate R, G and B planes The algorithm for k-Nearest Neighbor classification is
as follows.
• Have training data (xi, ti; i = 1, 2, ..., n) where xi is
the attribute of the training sample ‘i’, and ti is the
Form intensity scale image Compute VAR for the class label of the training sample ‘i’.
using local regions of R, G and • Have some test point x wish to classify.
I=0.299R+0.587G+0.114B B planes using a 3 x 3 • Calculate similarity or dissimilarity between the
sliding window and form test point and the training points.
R, G and B VAR planes • Find the k training points k1, k2,…, kk, which are
closest to the test point.
• Set the classification t for the test point to be the
Form LTP plane by Combine R, G and B most common of the k nearest neighbors.
applying LTP operator VAR planes using • The special case when k = 1, the algorithm will
over ‘I’ using a sliding patterns unification behave simply as a Nearest Neighbor classification.
window of size 3 x 3. procedure to form a
unified VAR plane V. EXPERIMENTS
A. Images Used in the Experiments
The performance of the proposed texture features for
Form 2D joint distribution histogram color images are demonstrated with the textures downloaded
GLTP/CLCVP using LTP plane and unified from Outex database [2]. In this study five classes of
VAR plane textures are used for classification purposes. In each class of
texture, images of three different rotation angles, three
different illuminations and three different resolutions (100,
Figure 3. Extraction of GLTP/CLCVP joint distribution 300 and 600 dpi) were used. In total there were 135 (5 x 3 x
features. 3 x 3) textures used for experiments. Figure 4 shows the five
texture classes.
IV. CLASSIFICATION PRINCIPLE
B. Rotational Invaiant Texture Classification
A. Texture Similarities There are many applications for texture analysis in which
Similarity between different textures is evaluated by rotation-invariance is important. In most approaches to
comparing their Pattern Spectrum using the log-likelihood texture classification, it is assumed that the unknown
ratio also known as the G-statistic [5]. samples to be classified are identical to the training samples
⎛ ⎡ ⎞ with respect to orientation. However, in reality the samples
n

⎜ ⎢ ∑ ∑ f i log f i ⎥ − ⎟ to be classified can occur at arbitrary orientation.
⎜ ⎣ s ,m i −1 ⎦ ⎟ In this experiment, the classifier was trained with samples
⎜ ⎟ of illuminant ‘inca’, rotation angle of 00 and resolution of
⎜ ⎡ ⎛ n
⎞ ⎛ n
⎞⎤ ⎟
⎜ ⎢ ∑ ⎜ ∑ f i ⎟ log ⎜ ∑ f i ⎟ ⎥ − ⎟ (6)
600 dpi. 10 samples each with the size of 64 x 64 were
⎣ ⎝ ⎠ ⎝ ⎠ ⎦ ⎟ extracted from all the five texture classes shown in Figure 4
G = 2⎜
s , m i − 1 i − 1

⎜ ⎡ n ⎛ ⎞ ⎛ ⎞⎤ ⎟ and there were 50 (5 x 10) training models. For rotational


⎜ ⎢ ∑ ⎜⎜ ∑ f i ⎟⎟ log ⎜⎜ ∑ f i ⎟⎟ ⎥ + ⎟ invariant classification purpose, samples of the same
⎜ ⎣⎢ i = 1 ⎝ s , m ⎠ ⎝ s ,m ⎠ ⎦⎥ ⎟ illuminant i.e. ‘inca’ and with same resolution as training
⎜ ⎟ samples i.e. 600 dpi but with different rotation angle i.e. the
⎜ ⎡⎛ n
⎞ ⎛ n
⎞⎤ ⎟
⎜ ⎢ ⎜ ∑ ∑ f i ⎟ log ⎜ ∑ ∑ f i ⎟ ⎥ ⎟ other two rotation angles in each texture were used to test
⎝ ⎣ ⎝ s .m i =1 ⎠ ⎝ s .m i =1 ⎠⎦ ⎠ the classifier. 40 samples each with size of 64 x 64 were
where ‘s’ is a histogram of the texture measure distribution extracted from each texture class and in total there were 200
of the test sample image and ‘m’ is a histogram of the (5 x 40) validation samples. 3-NN classifier algorithm was
texture measure distribution of model sample image, ‘n’ is used as the classifier. The validation sample was assigned to
the total number of bins in the histogram and ‘fi’ is the the class of testing model of majority of 3 matches. The
frequency at bin ‘i’. The value of the G-statistic indicates results of the classification are shown in Table 1.
the possibility that two image texture distributions come
from the same population. The more alike the histograms
are, the smaller is the value of G.

278
(a) (b) (c) (d) (e)

Figure 4. Textures from Outex database used in the experiments:


(a) Canvas001; (b) Canvas002; (c) Canvas003; (d) Canvas005; (e) Cardboard001.

TABLE 1. ROTATIONAL INVARIANT CLASSIFICATION.

CV-1- CV-1- CV- CV-2- CV- CV-3- CV-5- CV-5- CB-1- CB-1-
Texture Average
45 90 2-30 60 3-00 10 15 30 15 75
Classification
Accuracy 90.91 97.73 100 100 100 100 100 100 100 100 98.86
( %)

C. Comparative Analysis of GLTP/CLCVP for Rotational VI. DISCUSSION AND CONCLUSION


Invariant Texture Classification
In this paper, a new texture feature is proposed for color
This experiment was conducted to measure the texture modelling. The features GLTP/CLCVP is introduced
performance of three texture features GLTP/CLCVP, as joint distribution features of texture patterns and contrast
LBP/CLCVP and TS/CLCVP for rotational invariant variance patterns. The performance of the proposed texture
classification. All textures were chosen with illumination feature was studied with respect to rotational invariance.
‘inca’, resolution of 600 dpi and rotation angle of 00. The The experimental results reveal that the proposed texture
validation textures were the same five textures with feature GLTP/CLCVP yields promising results for
illumination ‘inca’, resolution of 600 dpi, but the other two rotational invariant classification problems.
rotation angles were used. The 3-NN classifier was used as In this work, the performance of the proposed texture
the classification algorithm. The results are tabulated in feature extraction technique is tested using k-NN classifier
Table 2. with G-Statistic as the distance measure. In future it is
TABLE 2. ROTATIONAL INVARIANT CLASSIFICATION RESULTS FOR
planned to test the performance using other classifiers with
VARIOUS TEXTURE MODELS. different distance measures.
Classification accuracy in %
Texture TS/CLCVP LBP/CLCVP GLTP/CLCVP REFERENCES
[1] D.C.He and L.Wang, “Texture unit, Texture Spectrum and Texture
CV1-I-600-45 100.00 80.68 94.32 Analysis”, IEEE Transaction on Geoscience and Remote Sensing,
vol. 28, no. 4, pp. 509 – 512,. 1990.
CV1-I-600-90 46.59 77.27 98.86
[2] T.Ojala, T.Mäenpää, M.Pietikäinen, J.Viertola, J.Kyllönen and
CV2-I-600-30 100.00 100.00 100.00 S.Huovinen, “Outex – A New Framework for Empirical
Evaluation of Texture Analysis Algorithms”, Proc. 16th Int’l Conf.
Pattern Recognition (2002). Available online at
CV2-I-600-60 100.00 100.00 100.00
http://www.outex.oulu.fi
[3] T.Ojala, M.Pietikäinen and D.Harwood, “A Comparative Study of
CV3-I-600-05 100.00 98.86 100.00
Texture Measures with Classification Based on Feature
Distributions”, Pattern Recognition, vol. 29, no 1, pp. 51 –
CV3-I-600-10 100.00 98.86 100.00 59, 1996.
[4] T.Ojala, M.Pietikäinen and T.Mäenpää, “Multiresolution Gray-scale
CV5-I-600-15 78.41 90.91 100.00 and Rotation Invariant Texture Classification with Local Binary
Patterns”, IEEE Transaction on Pattern Analysis and Machine
CV5-I-600-30 71.59 90.91 100.00 Intelligence, vol. 24, no. 7, pp. 971 – 987, 2002
[5] R.R.Sokal and F.J.Rohlf, “Introduction to Biostatistics’, 2nd ed,
CB1-I-600-15 100.00 100.00 100.00 W.H.Freeman, (1987).
[6] Suruliandi, A. and K. Ramar, 2008, Local Texture Patterns – A
CB1-I-600-75 100.00 100.00 100.00 univariate texture model for classification of images, Proceedings of
The 2008 16th International Conference on Advanced Computing and
Average 89.66 93.75 99.32 Communications (ADCOM08), Anna University, Chennai,
Tamilnadu, India. pp 32 – 39. 2008, Available online at IEEE Xplore.
Texture Legends Examples
CV1-I-600-45: Canvas001 with ‘inca’ illumination at 100 dpi
and rotation angle of 45.
CB1-I-600-15: Cardboard001 with ‘inca’ illumination at 600
dpi and rotation angle of 15.

279
Time Synchronization for an Efficient Sensor Network System
Anita Kanavalli, Vijay Krishan, Ridhi Hirani, Santosh Prasad, Saranya K.,
P Deepa Shenoy, and Venugopal K R
Department of Computer Science and Engineering
University Visvesvaraya College of Engineering, Bangalore, India
anita.kanavalli@gmail.com
L M Patnaik
Vice Chancellor
Defence Institute of Advanced Technology, Pune, India

Abstract more with respect to security especially when monitoring


of critical equipment is needed.
Time synchronization schemes in Wireless Sensor Net-
works have been subjected to various security threats and
attacks. In this paper we throw light on some of these Time synchronization is imperative for many applica-
attacks. Nevertheless we are more concerned with the tions in sensor networks at many layers of its design. The
pulse delay attack which cannot be countered using any various examples of sensor networks include TDMA ra-
of the cryptographic techniques. We propose an algorithm dio scheduling, reducing redundant messages by dupli-
called Resync algorithm which not only detects the delay cate detection of the same event by different sensors, per-
attack but also aims to rectify the compromised node and forming mobile object tracking, using the different sensor
introduce it back in the network for the synchronization nodes to perform ordered logging of events during system
process. In-depth analysis has been done in terms of the debugging, to name a few. The effect of inaccurate time
rate of success achieved in detecting multiple outliers i.e. synchronization would be detrimental to all the above ap-
nodes under attack and the level of accuracy obtained in plications if somehow the underlying time synchroniza-
the offset values after running the Resync algorithm. tion protocol was modified by an adversary. This will af-
fect the estimated trajectory of the object, which would
greatly differ from the actual one because the collabora-
keywords Time synchronization, Attacks, Security, Re- tive data processing and signal processing techniques will
covery, Clock Adversary, Compromise nodes. be greatly affected. Similarly, the importance of time syn-
chronization of sensor networks can be seen in other con-
trol applications.
1 Introduction
Sensor Networks are made up of small devices with sens- Contribution: We proposed an extension to the L-SGS
ing and processing facilities equipped with a low-power i.e. Resync Algorithm, where the compromised node is
radio interface. The emergence of industrial control appli- brought back into the synchronization process, by run-
cations as compared to previous applications which were ning the Resync algorithm in the compromised node. This
dedicated towards environmental monitoring and surveil- extension is evaluated against the main L-SGS algorithm
lance tasks. Industrial control applications demand much based on various parameters.

280
2 Model and Algorithm 3 Resync Algorithm
Our system consists of a network of sensor nodes which This proposed algorithm is a continuation of the modified
are assumed to be in their power ranges. Such sensors L − SGS. As compared to the L − SGS, the modified
are called neighbors of each other. The radio link present LSGS, does not abort when a node gets compromised.
between the sensor nodes is bidirectional to allow a two- Instead, the node which has been identified as the compro-
way communication between the nodes. mised node runs the Resync algorithm in order to counter
In the group synchronization algorithm the nodes syn- the external attack and re-enter into the synchronization
chronized if their delay value has not crossed a pre- process after being compromised. The main idea behind
calculated threshold value. Both the receiver-sender, which is used in the Resync algorithm is that once the
sender-receiver can fall prey to the pulse delay attack. outliers have been detected i.e. the malicious time offsets,
they need to be excluded while calculating the true offsets
between the nodes. This can be done by calculating the
2.1 Problem definition mean of the offsets of the benign nodes and approximate
true time offsets.
Implementation of the modified L-SGS algorithm. The al- Let Γ be the time offset data set from all the nodes and
gorithm uses the broadcast property of the wireless chan- χ be the time offsets from the outlier set, then the benign
nel to broadcast messages from the central to the other time offset set is defined as Γ − χ . Also let is defined as
nodes. The times Tij are measured by the local clock the average of the set Γ − χ. The size of set of time offsets
of the node, where Tij is the Time i, received by Node is n and that of the time offsets of compromised nodes is
j. Each node has to keep track of four sets of time, Ti , k, then µ is calculated as follows: µ = P (χ − χ)i/(n−k)
i=1 to 4 in order to calculate dj and δj , if not compro- where I ranges from 1 to n − k Therefore, when a node
mised. Any of the nodes in the group can serve as the gets compromised, it has to ensure that it gets the true off-
central node, provided that the collisions in the wireless sets from the nodes which are not compromised in the net-
channel do not cause any drop of packets to and from the work. Accordingly, it has to take the average of these time
central node. The implementation of Resync algorithm in offsets received and set its ffset to the calculated mean.
order to counter the external attack and re-enter into the The following algorithm, implements the following con-
synchronization process after being compromised. cept. The Resync algorithm executes as follows:
Table II: Resync Algorithm

2.2 Algorithm 1. Nx → Ni |Nx ||[COM P ROM ISED]∗


2. Ni → Nx [δi ||OF F SET ]
The modified L-SGS is executed as follows:
3. Nx calcualtes he average of the offsets of the remain-
Table I: L - SGS Algorithm ing benign group members and adjusts its clock.
4. Nx → Ni |Nx ||[RESY N C]∗
1. Nc → Ni [SY N C]∗i=1 excluding c.
Node Nx represents the node which has been compro-
mised. This has to be informed to all the other group
2. Ni (T1i ) → Nc (T2i )[Ni ||Nc ||REP LY ]
members taking part in the synchronization process. This
is achieved in Step 1, where the compromised node in-
3. Nc (T1i ) → Nc (T2i )[Ni ||Nc ||REP LY ] cludes its node id and broadcasts it to all the nodes in a
COMPROMISED message. In Step 2, as a reply each of
If d <= D then offset δ = ((T2i −T1i −)−(T4i −T3i ))/2, the benign nodes, include the offset that they calculated
else Node Ni labeled as a compromised node and runs as a part of the modified L-SGS in the OFFSET message
RESY N C algorithm. and transmit it to the compromised node.

281
4 Implementation and Performance As we can see from the first plot, Nodes 1,3,4,5 and 7 i.e.
Analysis
4.1 Implementation of modified LSGS
Our implementation of the modified L-SGS is simulation
oriented. The simulation was performed using a recently
developed simulator Castalia built specifically for Wire-
less Sensor Networks. Castalia is built on top of open
source mobile network Omnet++.
For the sample run, there are nine nodes taken into ac-
count, one acting as the central node. The delays have
been plotted when no node is compromised. Based on
Figure 2: Run 1-Comparison of true and calculated offsets
the threshold calculation presented above average delay
davg = 0.0198446s and σ = 0.00794624. The re-
five out of the eight nodes have a difference in the time
quired n = 3 and thus the threshold is calculated to be
offsets between 1 to 10 ms. Thus, as we can see from
D = 0.04368332s. This value is then used to detect out-
the plot above even though the Resync algorithm has an
liers in the subsequent runs. Based on the threshold delay
efficiency > 70% in most cases.
calculation, the presence of compromised nodes can be
detected as described in the algorithm. The following fig-
ure illustrate how the algorithm is able to pick up on the 5 Conclusion
presence of a node under pulse delay attack based on the
threshold value. The performance of the modified L-SGS is measured us-
ing the Successful Detection Rate (SDR) where different
delays are introduced into the network. The performance
of the Resync algorithm is then measured by comparing
the calculated offsets with true time offsets. Even though
in most cases, the accuracy between the time offsets is
about 70algorithm can be refined in order to achieve a
higher accuracy rate.

References
Figure 1: End-to-end delay in a sample run [1] S. Ganeriwal, R. Kumar, M. B. Srivastava, Timesync
protocol for sensor networks. In Proceedings of the
The scatter graph in Figure 1 plots the uncompromised First ACM Conference on Embedded Networked
nodes and their respective delays. All the delays are below Sensor Systems (SenSys), pp 138-149, 2003.
the threshold which is represented by the dashed line.
[2] H. Song, S. Zhu, G. Cao, Attack-resilient time syn-
chronization for wireless sensor networks, AdHoc
4.2 Performance Evaluation Networks. V5 (1) 112-125, 2006.
The Resync algorithm is responsible for ensuring that if a
node is compromised, it broadcasts its status to the other
nodes which are present in its group and using the time
offsets of the benign nodes be able to correct its offset.

282
Parallel Hybrid Germ Swarm Computing for Video
Compression
K. M. Bakwad1, S.S. Pattnaik1, B. S. Sohi2, S. Devi1, B.K. Panigrahi3, M. R. Lohokare 1
1
National Institute of Technical Teachers’ Training and Research Chandigarh, India
2
UIET, Panjab University, Chandigarh, India
3
Indian Institute of Technology, Delhi, India
km_malikaforu@yahoo.co.in
shyampattnaik@yahoo.com

Abstract— This paper proposes Parallel Hybrid Germ Swarm Efficient Block Matching Motion Estimation [10], Content
Computing (PHGSC), for real time video compression. The Adaptive Video Compression [11] and Fast motion estimation
convergence of Bacterial Foraging Optimization (BFO) is very algorithm [12] . GA has also been used for fast motion
slow because of fixed step size and its performance heavily estimation [13]. In this paper, authors propose fusion of
degraded for real time processing. In this paper, initially the parallel germ computing as, with GLBestPSO for motion
authors tried to increase the speed of BFO by updating bacteria
estimation
positions parallel instead of serial, which is treated as Parallel
Germ Computing. Further Parallel germ computing is
II. PARALLEL HYBRID GERM SOFT COMPUTING
hybridized with GLBest Particle Swarm Optimization
(GLBestPSO) to improve global performance of PGC. The
PHGSC is used to reduce computational time of motion BFO can be classified into serial and parallel BFO. Standard
estimation in video compression. The adaptive step size with BFO is serial. In BFO, if all of the bacteria update their
prediction, zero motion vector and Von Neumann neighborhood information at the same time then it treated as Parallel
topology implemented in PHGSC ensures the best matching Bacterial Foraging or Parallel Germ Computing. The Parallel
block computationally very fast. The presented PHGSC saves
computational time up to 93.36 % when compared with other
Bacterial Foraging or Parallel Germ Computing developed by
published methods. the authors can be found in [14]. The PGC, when hybridized
with GLBestPSO is called as PHGSC. The PHGSC has been
used for video compression.
Keywords- Parallel Hybrid Germ Swarm Computing (PHGSC); The authors proposes an adaptive step size as given in Eq.(1),
Global and local Best Particle Swarm Optimization
(GLBestPSO); video compression; computional time; peak signal
which is used to predict best matching block in the reference
to noise ration; frame with respect to macro block in the current frame for
which motion vector is found. In PHGSC, the positions of
bacteria are updated as given in Eq (2). In step size equation,
I. INTRODUCTION
W and C are same as GLBest PSO [15] as given in Eq. (3) and
Eq.(4). Due to adaptive step equation of PHGSC, the next
Depending upon foraging strategies of the E. coli bacterium, block search will start from nearer to best matching block in
K.M. Passino proposed Bacterial Foraging Optimization in previous step.
2002[1]. The Bacteria foraging optimization [1] is gaining
popularity in research community due to its attractive features. Step size = abs [Mx + My] +r*W*C (1)
Motion estimation has been popularly used in video signal ∆ (i )
processing, which is a fundamental component of video θ (i , k , l ) = θ (i, j , k ) + C (i ) + Stepsize (2)
compression. In motion estimation, computational complexity ∆ T ( i ) ∆ (i )
varies from 70 percent to 90 percent for all video compression.  gbesti 
w = 1.1 −  (3)
pbesti 
The exhaustive search (ES) or full search algorithm gives the
highest peak signal to noise ratio amongst any block-matching 
algorithm but requires more computational time [2]. To reduce  gbesti  (4)
c = 1 + 
the computational time of exhaustive search method, many  pbesti 
other methods are proposed i.e. Simple and Efficient Search Where,
(SES)[2], Three Step Search (TSS)[3], New Three Step Search Mx = Horizontal position of motion vector of previous block.
(NTSS)[3], Four step Search (4SS)[4], Diamond Search My =Vertical position of motion vector of previous block.
(DS)[5], Adaptive Road Pattern Search (ARPS)[6], Novel
Cross Diamond search [7], New Cross-Diamond Search Step by step algorithm of PHGSC
algorithm [8], Adaptive Block Matching Algorithm [9],

283
Step1: Initialize Parameters p, S, Nc, Ns, Nre, C (i), i= 1, 2... S Step 8: Update Jglobal from Jbest (j, k).
Where, Step 9: If k < Nre, go to step3 otherwise end.
p = Dimension of search space
S = Number of bacteria in the population
Nc = Number of Chemo tactic steps III. PHGSC FOR MOTION ESTIMATION
Ns= Number of Swimming steps
Nre = Number of reproduction steps The authors have already used the MPPSO [14] and PBFO
C (i) = Step size taken in the random direction specified by the [16] for motion estimation. The Von Neumann topology is
tumble. used as search pattern. In the proposed method, macro block is
J (i, j, k) = Fitness value or cost of i-th bacteria in the j-th known as bacteria. Five bacterium are used in PHGSC for
chemotaxis and k-th reproduction steps. motion estimation. The initial position of block to be searched
θ (i, j, k)= Position vector of i-th bacterium in j-th chemotaxis in reference frame is same as block in current frame for which
step and k-th reproduction steps. motion vector is found. The mean absolute difference (MAD)
Jbest (j, k) = Fitness value or cost of best position in the j-th is taken as objective or cost function for motion estimation in
chemotaxis and k-th reproduction steps. and is expressed as Eq (5).
Jglobal= Fitness value or cost of the global best position in the
entire search space. 1 M N CurrentBlock (i, j )
Step 2: Update the following parameters. MeanAbsoluteDifference =
MN
∑∑ − ReferenceBlock (i, j )
i =1 j =1
J (i, j, k)
Jbest (j, k) (5)
Jglobal= Jbest (j, k) Where, M = Number of Rows in the frame.
Step 3: Reproduction Loop: k = k+1 N = Number of Columns in the frame
Step 4: Chemotaxis loop: j = j+1 The performance of the proposed method is evaluated by peak
a) Compute fitness function J (i, j, k) for i = 1, 2, signal to noise ratio, which is given by the eq. (6).
3… S.
b) Update Jbest (j, k). PSNR = 10 log10
c) Tumble: Generate a random vector 2552
[ ]
∆(i ) ∈ ℜ with each element ∆ m (i ) m =1,
p 1

M ,N
(OrigionalFrame(i , j ) − CompensatedFrame(i , j )) 2
i , j =1
2,.,p, a random number on [-1 1] MN
(6)
d) Compute θ for i = 1, 2 …S
IV. RESULTS AND DISCUSSIONS
∆(i )
θ (i, j + 1, k ) = θ (i, j , k ) + C (i )
∆T (i) ∆(i ) The proposed method (PHGSC) has been tested for standard
video i.e. Caltrain and lecture based video sequences. Video
e) Swim
sequences with a distance of two frames between current
i) Let m =0 (counter for swim length)
ii) While m < Ns frame and reference frame are used to generate frame-by-
Let m = m+1 frame results of the proposed algorithm. To test the efficiency
Compute fitness function J (i, j+1, k) for i = of the proposed algorithm with existing algorithms, the
1, 2, 3… S algorithms are executed on HP workstation CPU 3.0 GHz and
Update Jbest (j+1, k) 2GB RAM with MATLAB. The performance of PHGSC is
 If Jbest (j+1,k)< Jbest (j, k) (if doing better), compared with of other methods [2][3][4][5][6] and the result
is presented in Table 1 and Table 2. The speed of PHGSC is
Jbest (j, k) = Jbest (j+1, k). Compute θ for i faster than published methods and PSNR is close to published
= 1, 2...S methods as shown in Table 3. PHGSC saves computational
∆ (i) time from 93.36 to 6.06 percentages with PSNR gain of -
θ (i, j + 1, k ) = θ (i, j, k ) + C (i )
∆ (i )∆(i )
T 0.1573 to +1.7441. In the suggested method, zero motion is
stored directly. Zero motion vectors implemented PHGSC
saves computational time by maintaining accuracy.
Use this θ (i, j + 1, k ) to compute the new j
(i, j+1, k). V. CONCLUSION
 Else, let m = Ns. This is the end of the while This paper presents new hybrid soft computing technique
statement known as PHGSC. The proposed technique is used for motion
Step 6: If j <Nc, go to step 4. In this case, continue estimation in video. As compared to ES, PHGSC gives less
Chemotaxis, since the life of bacteria is not over. PSNR of 0.1573 and 0.0189 for caltrain and lecturer based
Step 7: The Sr=S/2 bacteria with the highest cost function video sequence respectively.
values die and other Sr=S/2 bacteria with the best values split.

284
TABLE I. COMPARISION OF MEAN PSNR OF PROPOSED METHOD AND EXISTING METHODS

Type of No. of Mean PSNR


Sr.
Video Frames ES TSS SESTSS NTSS 4SS DS ARPS Proposed
No
Sequence Method
1 Caltrain 30 27.8422 26.2390 25.9408 26.9647 27.4322 27.5123 27.4336 27.6849
2 Lecturer 24 35.2214 34.8762 34.8757 34.8467 34.8273 34.8252 34.7248 35.2025
Based

TABLE II. COMPARISION OF COMPUTIONAL TIME IN SECONDS OF PROPOSED METHOD AND EXISTING METHODS

Type of Video No. of Computational Time in seconds


Sr. No Sequence Frames ES TSS SESTSS NTSS 4SS DS ARPS Proposed
Method
1 Caltrain 30 3.55 0.45 0.35 0.45 0.43 0.42 0.33 0.31
2 Lecturer Based 24 5.88 0.73 0.56 0.58 0.54 0.52 0.42 0.39

TABLE III. COMPURIONAL TIME SAVE AND PSNR GAIN BY PROPOSED METHOD OVER EXISTING METHODS

Sr. No Proposed Method Type of Video Existing Block Matching Method


Sequence ES TSS SESTSS NTSS 4SS DS APRS
1 Computational Time Caltrain 91.26 31.11 11.42 31.11 27.9 26.19 6.06
save by PHGSC (In Lecture Based 93.36 46.57 30.35 32.35 27.71 25 7.14
percentage)
2 PSNR gain by PHGSC Caltrain -0.1573 +1.4459 +1.7441 +0.7202 +0.2527 +0.1726 +0.2513
(in db) Lecture Based -0.0189 +0.3263 +0.3268 +0.3558 +0.3752 +0.3773 +0.4771

[8] C.W. Lam., L.M. Po and C.H. Cheung, “A New Cross- Diamond
Search Algorithm foe Fast Block Matching Motion Estimation”, 2003
The PHGSC saves computational time from 93.36 to 6.06 IEEE International Conference on Neural Networks and Signal
with a PSNR gain of 1.7441 to 0.1726 over existing methods. Processing, Nanjing, China, Dec. 2003, pp.1262-1265.
The results show promising improvement in terms of [9] Humaira Niasr, Tae-Sun Chol, “An Adaptive Block Motion Estimation
accuracy, while drastically reducing the computational time. Algorithm Based on Spatio Temporal Correlation”, Digest of Technical
papers, International conference on consumer Electronics, Jan 7-11,
The code developed is generalized in nature and proves its 2006, pp.393-394.
identity as useful tool in motion estimation [10] Viet-Anh Nguyen, Yap-peng Tan, “Efficient Block Matching Motion
Estimation Based on Integral Frame Attributes”, IEEE Transactions on
Circuits and Systems for Video Technology, vol.16, no. 2, March 2006,
pp.375-385.
REFERENCES
[11] Jiancong Luo, Ishfog Ahmed, Yong Fang Liang and Vishwanathan
[1] Liu and K.M. Passino, M.A. Simaan, “Biomimircy of Social Foraging Swaminathan, “Motion Estimation for Content adaptive Video
Bacteria for distributed Optimization: Models, Principles, and Emergent Compression”, IEEE Transactions on Circuits and Systems for video
Behaviors”, Journal of Optimization Theory and Applications”, vol. 115, Technology, vol. 18, no.7, July 2008, pp.900-909.
no.3, December 2002, pp 603-628. [12] Chun-Man Mak, Chi keung Fong, and Wai Khen Chan, “Fast Motion
[2] Jianhua Lu., Ming L. Liou, “A Simple and Efficient Search Algorithm Estimation For H.264/AVC in Walsh Hadmard Domain”, IEEE
for Block Matching Motion Estimation”, IEEE Trans. Circuits and Transactions on Circuits and Systems for Video Technology, vol. 18,
Systems for Video Technology, vol.7, no.2, April 1997, pp. 429- -433. no.6, June 2008, pp. 735-745.
[3] Renxiang Li, Bing Zeng, and Ming L. Liou, “A New Three- Step Search [13] Shen Li., Weipu Xu, Nanning Zheng, Hui Wang , “A Novel Fast
Algorithm for Block Motion Estimation”, IEEE Trans. Circuits and Motion Estimation Method Based on Genetic Algorithm”, ACTA
Systems for Video Technology, vol. 4, no. 4, August 1994, pp. 438-442. ELECTRONICA SINICA, vol.28 no.6, June 2000, pp.114-117.
[4] Lai-Man Po., Wing –Chung Ma, “A Novel Four- Step Search Algorithm [14] K. M. Bakwad, Dr. S.S. Pattnaik, Dr. B. S. Sohi, Swapna Devi, M.R.
for Fast Block Motion Estimation”, IEEE Trans. Circuits and Systems Lohakare,” Parallel Bacterial Foraging Optimization for Video
for Video Technology, vol.6, no. 3, June 1996, pp. 313-317 . Compression” International Journal of Recent Trends in Electrical and
[5] Shan Zhu, Kai-Khuang Ma, “A New Diamond Search Algorithm for Electronics, International Journal of Recent Trends in Engineering
Fast Matching Motion Estimation”, IEEE Trans. Image Processing, (Computer Science), vol. 1 no.1, June 2009, pp.118-122.
vol.9, no.2, February 2000, pp.287-290. [15] M.Senthil Arumugam , M.V.C.Rao, Aarthi Chandramohan, A new and
[6] Yao Nie, Kai-Khuang Ma, “Adaptive Rood Pattern Search for Fast improved version of particle swarm optimization algorithm with global-
Block-Matching Motion Estimation”, IEEE Trans. Image Processing, local best parameters, Journal of Knowledge and Information System
vol.11, no.12,December 2002, pp. 1442-1448. (KAIS), Springer, vol.16 no.3, 2008,pp. 324-350.
[7] Chun-Ho Cheung, Lai-Man Po, “A Novel Cross Diamond Search [16] K. M. Bakwad, Dr. S.S. Pattnaik, Dr. B. S. Sohi, Swapna Devi, Ch.
Algorithm for Fast Block Estimation”, IEEE Trans. Circuits and Vidya Sagar, P. K. Patra, Sastry V. R. S. Gollapudi, “Small Population
Systems, vol.11, no.12, December 2002, pp. 1442-1448. Based Modified Parallel Particle Swarm Optimization for Motion
Estimation” IEEE Sixteenth International Conference on Advanced
Computing and Communication (ADCOM-08), Anna University,
Chennai, India, ,17 Dec, 2008, pp. 367-373

285
Texture Classification using Local Texture Patterns:
A Fuzzy Logic Approach
E.M. Srinivasan
Department of Electronics and Communication Engineering, Government Polytechnic College,
Nagercoil, Tamilnadu, India.
emsvasan@yahoo.com

A. Suruliandi K.Ramar
Department of CSE, M S University, Department of CSE, National Engineering College,
Tirunelveli, Tamilnadu, India. Kovilpatti, Tamilnadu, India.
suruliandi@yahoo.com kramar_nec@rediffmail.com

Abstract— Texture analysis plays a vital role in image occurrence frequency of LTP over the whole image. In the
processing. The prospect of texture based image analysis LTP model, the pattern associated with the local texture
depends on the texture features and the texture model. This region of size 3 x 3 is uniform or non-uniform that is based
paper presents a new texture model ‘Fuzzy Local Texture on the gray-level difference between the central pixel and its
Patterns (FLTP)’ and ‘Fuzzy Pattern Spectrum (FPS)’. The neighbors as well as a uniformity measure computed by a
local image texture is described by FLTP and the global image specific rotation scheme. In this approach, the total number
texture is described by FPS which is an occurrence frequency of patterns as well as the number of bins in the histogram is
of FLTP over the entire image. The efficiency of the proposed 46 only. The LTP operator is computationally simple and
texture model is tested with texture classification. The results
robust against gray-scale and rotational variations.
show that the proposed method provides a very good and
robust performance. As referred in the FTU model, if fuzzy techniques are
used, then there will be a significant improvement in texture
Keywords- Texture Analysis, Texture Classification, Local characterization. But, the number of bins required for FTS
Texture Patterns, Fuzzy Local Texture Patterns, Fuzzy Pattern model is 6561. This large number of bins will bring out the
Spectrum. local texture information in more detail. But, at the same
time, as the number of bins increases, computational time
I. INTRODUCTION complexity of texture analysis also increases. In the case of
Numerous texture modeling techniques have been LTP model, the total number bins required is 46 only and
developed by many researchers. Each method is superior in hence, the model is computationally efficient for texture
discriminating its texture characteristics but there are no analysis. Thus, it is realised that, fuzzy techniques as used in
obvious texture modeling techniques common for all texture the FTU model may be combined with LTP model for a
images. Comparative study about various texture analysis progressive approach. Hence, in this paper, it is proposed to
methods can be found in [5, 9]. introduce a new texture model that incorporates the
advantages of both methods.
A. Motivation and Justification for the Proposed Approach
Barcelo et. al. [1] proposed a texture characterization B. Outline of the Proposed Work
approach ‘Fuzzy Texture Spectrum (FTS)’, which is based In this paper, a new texture analysis operator ‘Fuzzy
on the texture model ‘Texture Spectrum (TS)’ introduced by Local Texture Patterns (FLTP)’ is proposed. The local image
He and Wang [3, 4, and 8]. In FTS approach, the fuzzy logic texture is described by FLTP and the global image texture is
and fuzzy techniques are included in TS texture model with described by ‘Fuzzy Pattern Spectrum (FPS)’ which is an
due consideration of uncertainties introduced by noise, and occurrence frequency of FLTP over the entire image. The
different caption and digitization processes. In this performance of the proposed approach is demonstrated with
representation scheme, the spectrum requires a total number texture classification.
of 6561 bins. For real textures, FTS method gives a better
representation. Moreover, the FTS method provides superior
C. Organization of the Paper
discrimination between textured regions and homogeneous
regions. This paper is organized as follows. Section II describes
Recently, Suruliandi and Ramar [7] proposed a new the LTP texture model. Section III describes the proposed
texture modeling approach ‘Local Texture Patterns (LTP)’. FLTP texture model. Section IV includes experiments
They describe the local image texture by LTP and global conducted on texture classification of Brodatz [2] images and
image texture by ‘Pattern Spectrum (PS)’ which is the comparative analysis of various texture models based on
classification performance. Section V concludes the work.

286
II. LOCAL TEXTURE PATTERNS (LTP) AND B. Global Image Texture Description by PS
PATTERN SPECTRUM (PS)
The occurrence frequency of all the LTP is PS, with the
abscissa indicating the LTP and the ordinate representing its
A. Local Image Texture Description by LTP occurrence frequency. Global image texture is described
Let gc be the central pixel value and g1, g2,…,g8 be its with the help of PS. The spectrum uses LTP defined earlier,
neighbor pixel values in a 3x3 local region. Let the ‘Pattern as the measure to describe the global texture.
Unit’ P, between gc and its neighbors gi (i=1,2, …, 8) be
defined as III. PROPOSED TEXTURE MODEL
0 if gi < (gc - ∆g) In this section, FLTP and FPS texture modeling approach
P(gi , gc ) = 1 if (gc - ∆g)≤ gi ≤ (gc + ∆g) i =1,2,…,8 (1) is proposed. The proposed method borrows some of the basic
principles of LTP method and FTS method.
9 if gi > (gc + ∆g)

where g is a positive value that represents the gray value A. Fuzzy Local Texture Patterns – FLTP
and has its importance in forming the patterns. P can be It is noted from the Figure 1(b), the Pattern Units matrix
assigned with one of any three distinct values 0, 1 and 9. is filled with unique P values (0, 1 or 9). The P values simply
There are eight P values for a local region. A Pattern Units represent the relationship between the central pixel and its
matrix is filled with these values. The method of calculating neighbors within a small 3 x 3 pixels image region. To
the P values in a 3 x 3 local region, formation of Pattern represent the same in a more flexible way, each cell of the
Units matrix and Pattern String are shown in Figure 1. Pattern Units matrix can be assigned with three membership
values. Without loss of generality of FTU and LTP, the
128 115 118 1 0 0 membership values are directly associated with the degree to
which the neighbor pixel is smaller than (0), equal to (1) or
122 125 140 1 9 greater than (9) the centre pixel.

135 130 133 9 9 9 In a 3 x 3 local image region, let gc be the value of the
central pixel and gi (i=1,2,…,8) be the values of its neighbor
(a) (b) pixels. With the assumption g=0 in (1), let the difference
between gc and gi be xi (i=1,2,…,8). Let µ 0(xi), µ 1(xi) and
0 0 9 9 9 9 1 1 µ 9(xi) be the membership degrees for the values 0,1 and 9 of
(c) xi respectively. The ‘Fuzzy Pattern Unit (FP)’ value between
gc and its neighbors gi (i=1,2, …, 8) is defined as
Figure 1. (a) 3 x 3 local region (b) Pattern Units matrix for g=4
(c) ‘Pattern String’ FP(gc,gi)= ( µ0 ( x i ) /0 , µ1( x i) /1 , µ9 ( x i) /9) i=1,2,…8 (5)

A ‘Uniformity’ measure U which corresponds to the If there is a local homogeneous region, then the
number of spatial transitions circularly in the ‘Pattern String’ difference between gc and gi will be equal to zero or almost
is defined as equal to zero and µ 1(xi) will be higher, and µ 0(xi) and µ 9(xi)
will be lower. In case of textured region, the difference
8 between gc and gi will be increasing and therefore µ 1(xi) will
U = s( P( g8 , gc ), P( g1, gc )) + s( P( gi , gc ), P( gi−1, gc )) (2) be decreasing and µ 0(xi) and µ 9(xi) will be increasing.
i =2
Based on the above considerations, it is proposed here,
where three membership functions which are arrived from the
heuristic results with parameters {a,b} which determine the
1 if X -Y > 0 (3) boundary coordinates of xi. The membership functions are
s(X,Y)= given below.
0 otherwise
1 if xi ≤ −a

The patterns with at most U value of 3 shall be treated as µ 0 (xi) = - (xi + b) if - b < x i <- a (6)
uniform patterns and others as non-uniform patterns. The a-b
LTP operator for describing a local texture is defined as 0 otherwise
8
P ( g i , g c ) if U ≤ 3
LTP = i =1 (4) 0 if abs ( x i )≥ a
73 otherwise
µ 1 (xi) = - (abs( xi ) − a) if b < abs (xi) < a (7)
There are 46 number of LTP in total. As there are few a-b
holes in the LTP numbering scheme, using a lookup table 1 otherwise
they are relabeled into continuous numbers from 1 to 46.

287
1 if xi ≥ a So, when this 3 × 3 local region is considered, the central
(8) pixel has associated FLTP which is defined by
µ 9 (xi) = µ 0 (- x i) if b < x i < a
K
0 otherwise FLTP = mLTPk * µ(mLTP)k (13)
k =1
The degrees to which the pixel gi is negative (smaller),
zero (similar), or positive (larger) with regard to the central where K is the total number of S or mLTP.
pixel gc are µ 0(xi), µ 1(xi) and µ 9(xi) respectively. Hence, the
FP associated to the central pixel is given by B. Fuzzy Pattern Spectrum – FPS
FP ={( µ 0 ( x1) /0, µ 1 ( x1) /1, µ 9 ( x1) /9),…, (9) Using the procedure outlined in the previous section, the
( µ 0 ( x8) / 0, µ 1 ( x8) /1, µ 9 ( x8) /9)} FLTP are calculated. Such FLTP are identified as uniform or
non uniform using (2). If U 3, then the pattern is assumed as
The local region can be represented as a Fuzzy Pattern uniform and otherwise non-uniform. In some natural
Units matrix as shown in Figure 2. The entries in the matrix textures, non-uniform patterns also help in describing the
are FP values which are calculated using (6, 7, 8, 9). texture characteristics. In this proposed approach, it is
decided to have 73 uniform patterns (0 to 72) and another 73
µ 0(x1)/0, µ 0(x2)/0, µ 0(x3)/0,
non-uniform patterns (73 to 146). Therefore, a total number
µ 1(x1)/1, µ 1(x2)/1, µ 1(x3)/1,
µ 9(x1)/9 µ 9(x2)/9 µ 9(x3)/9 of 146 bins are there in the occurrence histogram of FPS.
µ 0(x8)/0, µ 0(x4)/0,
µ 1(x8)/1, µ 1(x4)/1, IV. EXPERIMENTS AND RESULTS
µ 9(x8)/9 µ 9(x4)/9
µ 0(x7)/0, µ 0(x6)/0, µ 0(x5)/0, A. Textures used in the Experiments
µ 1(x7)/1, µ 1(x6)/1, µ 1(x5)/1,
µ 9(x7)/9 µ 9(x6)/9 µ 9(x5)/9 The textures used in the experiments are taken from the
Figure 2. Fuzzy Pattern Units matrix
publicly available Brodatz benchmark database. They are
shown in Figure 3. The textures are Beach sand, Stone, Sand,
From the matrix elements, the FLTP is calculated by the Grass and Water. Normally, these are the images
following procedure. Each matrix element contains three P encountered in remotely sensed image analysis.
(0,1 or 9) values and the corresponding membership values.
By using these values, a set of ‘Pattern Strings (S)’ is
constructed.
S k (psi ) = Pi u for one non - zero membership µu ( xi ) = 1 (a) (b) (c) (d) (e)

S k (psi ) = Pi u for two non - zero membership s Figure 3. Brodatz images (a) Beach sand (b) Stone (c) Sand (d) Grass
} (10) (e) Water.
S k +1(psi ) = Pi v 0 < µu ( xi ), µ v ( xi ) < 1
where, psi (i=1,2,…,8) is the element of S. Pi means P
v B. Texture Similarity
value of ith element having membership v. If ith matrix Similarity between different textures is evaluated by
element contains one of the membership degree values equal comparing their histograms. The histograms are compared as
to ‘1’, the ith element of the string is filled with the a test of goodness-of-fit using a nonparametric statistic, the
corresponding P value. For other non-zero membership log-likelihood ratio also known as the G-statistic [6]. The
values, there will be two strings filled with corresponding P G-statistic compares the bins of two histograms and is
values for ith position of the strings. defined as
We use a new mLTP operator with minor modifications n n n

in (4). Here, mLTP is defined by f i log f i − f i log fi − (14)


s , m i −1 s ,m i −1 i −1
G=2
8 n n n
mLTP = ps i (11) f i log fi + f i log fi
i =1 i =1 s ,m s ,m s . m i =1 s . m i =1

When the membership degree values are ‘1’ in all the where s is a histogram of the first image and m is a
matrix elements, there will be only one S and one mLTP. If histogram of second image , n is the total number of bins in
there are ‘n’ elements in the matrix having two non-zero the histogram and fi is the frequency at bin i.
membership values, the total number of S and mLTP is 2n.
The degree to each mLTP is obtained by multiplying the C. Classification of Brodatz Images using the Proposed
eight corresponding membership degrees Texture Model
8 An experiment on image classification was conducted to
µ(mLTP ) = ∏ µ psi (x i ) (12) prove the efficiency of the proposed FLTP model. For this
i =1 study, Brodatz texture images shown in Figure 3 of size

288
512 x 512 were taken. Each individual texture image was TABLE 2. CLASSIFICATION ACCURACY OF VARIOUS TEXTURE MODELS
considered as a sample and there were 5 samples in total.
Classification Accuracy (%)
Test images were extracted from the source images, keeping Tex.
each pixel of 512 x 512 as the center of the sample and thus Mode Beach
Stone Sand Grass Water Avg
sand
262144 test samples were extracted irrespective of the
sample size. Each test sample was compared against the FTU 100 99.88 100 100 99.67 99.91
model samples using (14) and the test sample was classified
FLTP 100 98.92 100 100 99.52 99.69
as the category of the model sample which gives minimum G
value. The test was carried out for test samples of size W, LTP 99.90 99.12 99.71 99.12 100 99.57
equal to 15, 30 and 45. Table 1 shows the results.
V. DISCUSSION AND CONCLUSION
TABLE 1. CLASSIFICATION OF BRODATZ IMAGES USING THE PROPOSED
FLTP METHOD In this paper, a new method of texture characterization
Tex- % of
technique based on FLTP and FTS is presented. Local
Classified Samples (Total : 262144 Samples)
ture Accura patterns are identified by the FLTP method and these
W Beach Stone Sand Grass Water
cy patterns are used to form FPS which characterizes the global
sand
texture feature of the given texture image. The classification
15 259905 832 489 918 0 99.15
results in Table 1, which shows a high classification
accuracy of more than 99% for Brodatz images. It is
Beach

30 262038 0 106 0 0 99.96


sand

45 262144 0 0 0 0 100
observed from Table 2 that, classification accuracy is above
99% for FLTP model which is considerably enough to
15 3111 249738 0 9295 0 95.27
compare with other models. From the results, it is inferred
that, the proposed model has very good discriminatory power
Stone

30 157 260096 0 1891 0 99.22


45 0 261756 0 388 0 99.85 and hence in future, it is planned to use the FLTP model for
texture analysis such as texture segmentation and texture
15 31 211 260635 1196 71 99.42 based edge detection.
Sand

30 0 0 262144 0 0 100
45 0 0 262144 0 0 100 REFERENCES
[1] A. Barcelo, E. Montseny and P. Sobrevilla, “Fuzzy Texture Unit and
15 121 3504 1806 256713 0 97.93 Fuzzy Texture Spectrum for Texture Characterization”, Fuzzy Sets
Grass

30 0 0 0 262144 0 100 and Systems, 158, 239-252 (2007).


45 0 0 0 262144 0 100 [2] P. Brodatz, Texture – A Photographic Album for Artists and
Designers, Reinhold, New York (1968).
15 0 0 10476 1709 249959 95.35 [3] D.C. He and L. Wang, “Texture Unit, Texture Spectrum and Texture
Water

30 0 0 3958 0 258186 98.49 analysis”, IEEE Trans. on Geoscience and Remote sensing, 28(4),
509-512(1990).
45 0 0 343 0 261801 99.87
[4] D.C. He and L. Wang, “Unsupervised Textural Classification of
D. Quantitative Analysis of Various Texture Models using Images Using the Texture Spectrum”, Pattern Recognition, 25(3),
247-255(1992).
Classification Accuracy
[5] T. Ojala and M.Pietikäinen and D. Harwood, “A Comparative Study
The performance of FTS model and LTP model were of Texture Measures with Classification Based on Feature
compared with the proposed FLTP model. The result of the Distributions”, Pattern Recognition, 29(1), 51-59 (1996).
comparison is tabulated in Table 2. [6] R.R. Sokal and F.J. Rohlf, “Introduction to Biostatistics’, 2nd ed,
W.H.Freeman, (1987).
Using the FTS model, 99.91 percent classification [7] A. Suruliandi and K. Ramar, “Local Texture Patterns- A univariate
accuracy was obtained. This is due to the fact that FTS texture model for classification of images”,Proceedings of The 2008
model has very good discriminating power. 16th International Conference on Advanced Computing and
Communications (ADCOM08), 32-39 (2008).
The LTP model is yielding an accuracy of 99.57 percent. [8] L. Wang and D.C. He, “Texture Classification using Texture
The strength of this model is, it is a robust model against Spectrum”, Pattern Recognition, 23, 905-910(1990).
gray-scale variations and rotational variations which is an [9] J. Zhang and T. Tan, “Brief Review of Invariant Texture Analysis
important criterion for real time applications. Methods”, Pattern Recognition, 35, 735-747 (2002).

The FLTP method is performing better with classification


accuracy of 99.69 percent. This is due to the fact that, the
number of local patterns identified is 146 which is
sufficiently large enough to characterize the local spatial
patterns.

289
Integer Sequence based Discrete Gaussian and
Reconfigurable Random Number Generator
Arulalan Rajan, H S Jamadagni Ashok Rao
Centre for Electronics Design and Technology, Dept of E & C, CIT,
Indian Institute of Science, India Gubbi, Tumkur, India
(mrarul,hsjam)@cedt.iisc.ernet.in ashokrao.mys@gmail.com

Abstract - A simple random number generation technique based


II. OVERVIEW OF EXISTING TECHNIQUES FOR GAUSSIAN
on integer sequences is proposed in this paper. Random integers RANDOM NUMBER GENERATION
with Gaussian distribution were generated and compared with One of the most commonly used non uniform, continuous
that of the numbers generated using the proposed technique. The distributions is the Gaussian distribution. A number of
mean square error between the estimated probability density Gaussian random number generators have been described in
function and the obtained one is very negligible and is of the
literature. Most of these involve the transformation of uniform
order of 10-6. Using the proposed technique, one can generate
anywhere between 16,000 to 80,000 random integers between 1 random numbers [4]. In this section, we present a quick
and 100, with Gaussian distribution. Depending on the required overview of these techniques. We also present a typical
range, the number of random numbers generated can be varied. discrete analogue to Gaussian distributed random numbers.
The technique renders itself to a very simple hardware The cumulative distribution function (CDF) inversion
implementation that is dynamically reconfigurable on-the-fly to technique [1], Box-Muller transform method, [5] and its many
generate random variables with different distribution. hardware implementations [6], the rectangle-wedge-tail
method, by Marsaglia [7], and several other algorithms and
implementations [8] for generating Gaussian distributed
Keywords- Integer Sequences, Discrete Gaussian, Random
random numbers have been reported in literature.
Number, Reconfigurable hardware
With digital signal processing techniques requiring
discrete random numbers, one needs to look at different
I. INTRODUCTION strategies for generating discrete analogue of continuous
random numbers. A simple and straightforward technique is to
Random sequence generators have become inevitable in
sample the continuous Gaussian, yielding the sampled
almost all fields ranging from communication to finance. The
Gaussian kernel. The disadvantage in this method is that the
random sequences have some probability distribution function
discrete function does not have the discrete analogues of the
(PDF) associated with them. The most frequently used ones
properties of the continuous function.
are uniform distribution, Gaussian distribution, Poisson
A second approach is to make use of a discrete Gaussian
distribution, Binomial distribution [1]. Of these, the uniformly
kernel [11] defined by
distributed random numbers are generated using linear −t
T ( n , t ) = e I n (t ) - (1),
feedback shift registers [2]. Similarly, Gaussian distributed
random numbers are more common in digital communication where I n (t ) is the modified Bessel function of integer order.
[3]. There are many techniques available for generating The complexity of generating Bessel function on hardware is
random numbers with Gaussian (bell-shaped) distribution. very high and hence not best suited for hardware
These generators have however focused on generating random implementation.
numbers in the interval (0, 1). Not much emphasis has been Having given the overview of the existing techniques for
laid on generating discrete analogue to a continuous Gaussian generating Gaussian random numbers, we now proceed to
distribution. In this paper, we propose a new technique to discuss our technique to obtain Gaussian random numbers.
generate discrete analogue to Gaussian distributed random
numbers. We also present a simpler hardware implementation
of such Gaussian random number generators. We also propose III. INTEGER SEQUENCE BASED GAUSSIAN RANDOM
a dynamically reconfigurable hardware random number NUMBER GENERATION
generator based on integer sequences.
Integer sequence, as the name implies, is a sequence of
The paper is organized as follows: In section 2, we give
integers generated using difference equations or polynomial
an overview of the existing techniques for generating Gaussian
functions. In our work, we consider a few of the integer
random numbers and their hardware implementations. We
sequences, listed in the Online Encyclopedia of Integer
propose the new technique based on integer sequences in
Sequences (OEIS) [12], generated using some kind of
section 3. In section 4, we discuss some of the results of the
recursive relations. Table 1 gives the list of integer sequences
proposed technique with regard to the generation and
used for generating random numbers. Fig. 1 shows the plot of
statistical characteristics of the random numbers. We conclude
some of these sequences.
in section 5.

290
These sequences, of certain length (determined by the deviating much from the conventional scheme, we propose to
range in which the random numbers are needed) are pairwise use variance as the input. One can make lengths of the
convolved and their convolution plots studied. The envelope sequences or the sequences themselves depend on the variance
of each of these convolution plots turns out to be similar to given.
Gaussian. Fig. 2 shows the convolution of sequences in which On obtaining the lengths of the sequences, we can use a
HCS denotes the Hofstadter Conway sequence and GS denotes generator as simple as the one proposed in [14] to generate the
Golomb Sequence. The idea of generating Gaussian sequence. The sequence elements are obtained for half the
distribution directly follows from the plot and is discussed in lengths and symmetry is forced on the sequence for the other
detail as follows: The indices of the sequence resulting from half of the sequence length. Once the sequences are
the convolution of two sequences are taken to be the set of generated, they can be convolved using either a single
random numbers that one can generate. The index typically multiply accumulate (MAC) unit or multiple MAC units. The
starts from 1 to L+M-1. We take these numbers, n, from 1 to result of this convolution is taken as the probability density
L+M-1 as the values that a random variable X can take. The function for the random variable X.
probability that X=n is given by the value of convolution at n. Higher level block diagram of the random number
We explain this in detail with an example. We take S1 to be generator shown in figure 4 is then used to generate the
Golomb sequence of length 51, and S2 also to be Golomb random numbers with Gaussian distribution. An approach
sequence of length, 50. We convolve the two sequences and similar to the one used for generating the sequence can be
the sum of the convolution over the entire length of the result, used here also. We obtain a pattern here from the convolution
L+M −1 and store that in the pattern information memory (PIM). The
Total = ∑ S3(n), where S3 (n) = S1(n) * S2 (n) - (2) addresses for this memory are precisely the values that the
n=1 random numbers can take. The addresses are generated using
gives the total number of random numbers that can be the LFSR, with the initial seed being capable of taking the
generated, between 1 and 100. The probability that a random LFSR through all the states. Once all the states have been
variable X can take an integer value between 1 and 100 is obtained, the seed of the LFSR is changed. The LFSR is used
given by S3(n), the sequence resulting from convolution, ie., to facilitate the randomness in the generator’s output. The
P ( X = ( x = n)) = S3 ( n ) /Total - (3) decrement and compare unit (DCU) basically decrements the
content of that location pointed to by the LFSR. The compare
With the technique of generating Gaussian random numbers
logic, compares if the contents of the memory location pointed
having been described in detail, we now proceed to discuss the
to by the LFSR is zero. In such a case, the address is
hardware implementation of a Gaussian random number
incremented by 1 so that the next element can be output. Since
generator.
the LFSR value has to remain the same till the random value is
output and the contents of the memory is decremented, the
A. Hardware Implementation of Random Number Generator
LFSR can be made to operate at half the clock rate of the
A sequence like Fibonacci sequence [11], described by
memory. The control engine shown is used to decide the
the following recurrence,
sequence and the length of the sequence, based on the variance
a ( n ) = a ( n − 1) + a ( n − 2), a (1) = 0, a (1) = 1 - (4)
input and the distribution type.
is easier to generate in hardware as it involves only a simple A simple modification to figure 4 results in a dynamically
recursion. However, this is not the case with most of the other reconfigurable random number generator, Here the pattern
sequences, which involve more than one recursion. To information memory is a segmented one, holding the pattern
generate these sequences, new and simple strategies were information of multiple convolution results. Higher order
developed. Let us look at the following sequences and discuss address bits can be made to identify the distribution type and
in detail the strategies developed for generating the same. the lower order address bits can be used as the values that the
a ( n ) = 1 + a ( n − a ( a ( n − 1)), a (1) = 1; a (2) = 2; - (5) random variable can take. Based on the variance and the
a(n) = a(a(n − 1)) + a(n − a(n − 1)); a(1) = a(2) = 1; - (6) distributions, the sequences and their lengths can be obtained.
Equations (5) and (6) are used to generate Golomb sequence Depending on the distribution, one can use the integer
[12] and Hofstadter Conway [13] sequence respectively. The sequences directly or convolve them or perform any other
elements generated from (5) are as follows: operations, followed by writing into the pattern information
1,2,2,3,3,4,4,4,5,5,5,6,6,6,6,7,7,7,7, 8 …. memory. An alternate approach to make the hardware a
As seen from (6) and (7), the direct hardware implementation reconfigurable one is by having another memory segment
of the generating function is not that simple. An alternate where the pattern can be stored. The contents of this memory
approach to generate these sequences, based on the inherent could be written into the pattern information memory shown
pattern was proposed in [14]. in figure 4, on the fly, so that any distribution can be obtained
Having looked at the individual sequence generator, we using the same hardware as in figure 4.
now proceed to describe the architecture used to generate the
random numbers.
Conventional random number generators take mean and
variance as the inputs and then generate random values. Not

291
propose to explore the use of integer sequences in generating
IV. RESULTS extreme valued probability distribution functions.
Sequences listed in table 1 were considered for random
number generation. Convolution was performed with the VI. REFERENCES
following combinations of the sequences: [1] D. E. Knuth, “Seminumerical Algorithms - The Art of
• A sequence S1 convolved with itself Computer Programming”, Vol.2, 3rd ed., Addison-Wesley,
• A sequence S1 convolved with another sequence S2. USA, 1998.
Without loss of generality, the lengths of the sequences were [2] Solomon W. Golomb. Shift Register Sequences. Aegean
taken to be the same, in order to study the relation between Park Press, 1981.
length and variance. The numbers were generated in [3] Xilinx 2002. Additive White Gaussian noise (AWGN)
MATLAB using the proposed technique and compared with core. CoreGen documentation file.
the random numbers generated using the same mean and [4] D.Thomas, P. Leong, J.Villasenor, “Gaussian Random
variance as that of the proposed technique. Mean squared error Number Generators”, ACM Computing Surveys, Vol.39,
is also plotted. No.4, Artcl 11, Oct 2007.
As mentioned in section 3, figure 1 and figure 2 are the [5] G. E. P. Box and M. E. Muller, “A note on the generation
plots of the sequences for various lengths and their of random normal deviates,” The Annals of Math. Statistics,
convolution respectively. Figure 3 shows the histogram plot of vol. 29, 1958, pp. 610–611.
the convolution of Golomb sequence with itself. The CDF plot [6] A Alimohammad, S. F. Fard, Bruce F. Cockburn, C.
is shown in figure 5. The variation of length (assuming that the Schlegel, “A Compact and Accurate Gaussian Variate
sequences convolved have same length and equal to L) with Generator”,IEEE Trans. on VLSI Systems, vol 16, no.5,2008.
standard deviation and variance are shown in figure 6 and pp 517-527.
figure 7 respectively. We find that for a Gaussian distribution, [7] G. Marsaglia, T. A. Bray, “A Convenient method for
the length of the sequence and the standard deviation has a generating normal variables”, SIAM Rev.6,3, 1964, pp 260-
linear relation, while there is a square relation between the 264.
length and variance. We thus find that the length of the [8] G. Zhang, P. H. W. Leong, D. Lee, J. D. Villasenor, R. C.
sequences can be made dependent on the standard deviation C. Cheung, and W. Luk, “Ziggurat-based hardware Gaussian
and hence the variance. random number generator,” in IEEE Intl. Conference on Field
In the usual Gaussian random number generators, within a Programmable Logic and its Applications, 2005, pp. 275–280.
given interval, the variance, σ2, can vary depending on the [9] T. Lindberg, “Scale-space for discrete signals”,
requirement and hence the profile or the shape of the Gaussian PAMI(12), No.3, March 1990, pp 234-254.
PDF varies. A similar situation here is that the range be the [10] N.J.A.Sloane, The Online Encyclopedia of Integer
same but with a different variance, the lengths of the two Sequences”, www.research.att.com/~njas/sequences/
sequences can be changed. To address this, the lengths of the [11] www.research.att.com/~njas/sequences/A000045
two sequences can be changed from 50 each to say 61 and 40 [12] www.research.att.com/~njas/sequences/A001462
or any other combination of lengths such that the sequence [13] www.research.att.com/~njas/sequences/A004001
resulting from convolution has a length of 100. This fact is [14] A. Rajan, HS Jamadagni, A. Rao, “Integer Sequence
illustrated in figure8. Window based Reconfigurable FIR filters”, Proc. of the First
Figure 9 compares the estimated PDF with the obtained IEEE Intl. Workshop on Reconfigurable Computing, Dec
PDF. Figure 10 gives the mean square error plot of the 2008, India, http://ewh.ieee.org/conf/hprcw/rcw08.html
comparison. We find that the mean square error value is of the
order of 10-6. This clearly shows that the technique proposed Table 1. List of Integer Sequences from [10]
in this paper, is more efficient to generate random integers Sequence Generating Function Offset
following Gaussian distribution. A001462 a[n] = 1 + a[n - a[a[n - 1]]] a[1]= 1, a[2]= 2
A004001 a[n] = a[a[n-1]] a[1] = a[2] = 1
V. CONCLUSION + a[n-a[n-1]]
The technique of using integer sequences for random A113886 a[n] = a[a[n-2]] a[1] = a[2] = 1
number generation has been proposed in this paper. It has also + a[n-a[n-1]]
been shown that convolution of a certain family of integer A005229 a[n] = a[a[n-2]] a[1] = a[2] = 1
sequences can be used to generate discrete Gaussian random + a[n-a[n-2]]
variable. The variance of the Gaussian random variable is A098378 a[n] = a[a[a[n - 1]]] a[1] = a[2] = 1
made to influence the choice of the integer sequences and the + a[n – a[a[n - 1]]]
length of the sequences to be convolved. A simple architecture A006158 a[n] = a[a[n-3]] a[1] = a[2] = 1
to generate Gaussian random number is also presented in this + a[n-a[n-3]]
paper. It has also been illustrated that a slight modification to A006161 a[n)= a[a[n-1]-1] a[1] = a[2] = 1
this architecture yields a reconfigurable random number + a[n+1-a[n-1]
generator, based on different distributions. The mean square
error plot shows that the error is very less. In future, we

292
S3[n]
a[n]

Figure 1. Integer Sequences with forced symmetry Figure 2. Convolution of Sequences


Random Number played out

Seed generate Log2K


Counter 1
Frequency

MOD K-1
LFSR
Control
DCU
engine Pointer PIM
to RV generate
Variance
Distribution L+M
type
Figure 3. Histogram plot for random variable obtained Figure 4. Random Number Generator
from Golomb sequence Convolution

Figure 5. Estimated CDF for random variable obtained Figure 6. Standard Deviation Vs Length Plot for Golomb
from Golomb sequence convolution Sequence convolution based distribution

(n)
Figure 7. Variance vs Sequence Length Figure 8. Profile variation with change in lengths for the
same range between 1 and 100, but different variance

Figure 9. PDF plots comparison Figure 10. MSE Plot

293
Parallelization of PageRank and HITS
Algorithm on CUDA Architecture
Kumar Ishan, Mohit Gupta, Naresh Kumar, Ankush Mittal
Department of electronics & Computer Engineering,
Indian Institute of Technology, Roorkee, India.
{kicomuec, mickyuec, naresuec, ankumfec}@iitr.ernet.in

Abstract 2. PAGERANK

Efficiency of any search engine mostly depends on how efficiently 2.1. Algorithm
and precisely it can determine the importance and popularity of
a web document. Page Rank algorithm and HITS algorithm are Let Rank(p) denotes the rank of web page p from set of all
widely known approaches to determine the importance and web pages P. Let Sp bet a set of all web pages that points to
popularity of web pages. Due to large number of documents
page p and Nu be the outdegree of the page u ε Sp. Then the
available on World Wide Web, huge amount of computations are
required to determine the rank of web pages making it very time “importance” given by a page u to the page p due to its link is
consuming. Researchers have devoted much attention in measured as Rank(u)/Nu. So total “importance” given to a
parallelizing PageRank on PC Cluster, Grids, and Multi-core page is the sum of all the “importance” due to incoming link
processors like Cell Broadband Engine to overcome this issue but to page p. This is computed iteratively n times for each page
with little or no success. In this paper, we discuss the issues in rank to converge. This iteration is as follows.
porting these algorithms on Compute Unified Device
Architecture (CUDA) and introduce efficient parallel ∀p ∈ P,
implementation of these algorithms on CUDA by exploiting the Rank ୧ିଵ ሺuሻ
block structure of web, which not only cut down the computation Rank ୧ ሺpሻ = ሺ1 − dሻ + d ∗ ෍ … ሺ1ሻ
time but also significantly reduces of the cost of hardware N୳
୳ ∈ ୗ୮
required.
Where d is the “damping factor” from the random surfer
1. INTRODUCTION
model [1]. We will be using 0.85 as the value of d further in
In present days, the unceasing growth of World Wide Web this paper as given in [1]. The use of d insures the
has lead to a lot of research in page ranking algorithms used convergence of PageRank algorithm [5].
by the search engines to provide the most relevant results to
2.2. Related Works
the user for any particular query. The dynamic and diverse
nature of web graph further exaggerates the challenges in Since PageRank involves huge amount of computation,
achieving the optimum results. Web link analysis provides a therefore many researchers have attempted with their own
way to order the web pages by studying the link structure of approaches towards its parallel implementation. Haveliwala et
web graphs. PageRank and HITS (Hyperlink - Induced Topic al. [6] exploits the block structure of web for computing
Search) are two such most popular algorithms used by some Pagerank. Rungsawang and Manaskasemsak used PC Cluster
current search engines in same or modified form to rank the to compute PageRank[3] in which they divide the input
documents based on the link structure of the documents. graph into equal sets and calculate them on each pc cluster
PageRank, originally introduced by Brin and Page [1], is nodes. Rungsawang and Manaskasemsak also implemented
based on the fact that a web page is more important if many Partition-Based Parallel PageRank Algorithm [4] on PC
other web pages link to it. In core, it contains continuously Cluster. Another efficient parallel implementation of
iterating over the web graph until the Rank assigned to all of PageRank on pc cluster by Kohlschutter et al. [5] achieves
the pages converges to a stable value. In contrast to PageRank, gain of 10 times by using block structure of web page and
a similar HITS algorithm, developed by Kleinberg [2], ranks reformulating the algorithm by combining Jacobi and Gauss-
the documents on the basis of two scores which it assigns to a Seidel method. The implementation [9] on multi core 8 SPU
particular set of documents dependent on specific query, based Cell BE has shown that the PageRank Algorithm runs
although basis for computation are same for both. This paper 22 times slowly.
addresses issues related to parallel implementation of these
algorithms in an interesting manner and proposes an 2.3. CUDA Architecture
innovative way of exploiting the block structure of web
existing at much lower level. Our approach in parallel CUDA™, introduced by NVIDIA, is a general purpose
implementation of these algorithms on NVIDIA’s Multi-Core parallel computing that leverages the parallel compute engine
CUDA Architecture not only reduces the computation time in NVIDIA GPUs to solve many complex computational
but also requires much cheaper hardware. problems in a more efficient way than on a CPU. These GPUs

294
are used as coprocessor to assist CPU for computational separately. Then the rest of the nodes are added to the link
intensive task. structured input file for host. When blocks are scheduled on
More details about this architecture can be explored at [11]. device’s multiprocessor, all threads in a warp follow similar
Here, we highlight the features of CUDA architecture that execution flow to a greater extent.
need special mention in relation to our work. The number of calculation on host can be further decreased if
1. SIMT Architecture we use some constant multiple (average factor) of the average
2. Asynchronous Concurrent Execution value. This constant for peak performance is different for
3. Warps different input graphs depending on the distribution of the
4. Memory Coalescing number of indegrees among the nodes.

2.4. Porting issues of Parallel Implementation on CUDA 2.5. Results and observations
Architecture
We used four different parameters; block size, average factor,
Porting issues with the PageRank algorithm are mainly lower and upper limits for range of locality for experiments.
concerned with hardware restrictions of CUDA architecture. By increasing the range limits more than few thousands the
Some important issues of CUDA architecture are as follows: performance is either decreased or remains same and the block
1) Non coalesced memory access: - CUDA has some sizes giving reasonable increase in performance are 32 and 64.
constraint related to memory accesses. The protocol followed As for smaller block size the number of threads executing in
by the CUDA architecture for memory transaction ensures parallel are too few. And with larger block size threads
that all the threads referencing memory in the same segment become more divergent. As the average values for smaller
are serviced simultaneously. Therefore bandwidth will be used block size is very high therefore increase in the average factor
most efficiently only if, simultaneous memory accesses by gives decrease in performance. While with larger blocks
threads in a half warp belongs to a segment. But due to increase in average factor gives increase in performance.
uneven and random nature of indegrees of nodes the memory Using suitable parameters based on above discussion, we
reference sometimes become non coalesced hindering the achieved some promising results.
simultaneous service of memory transactions leading to the
wastage of memory bandwidth. 3. HITS

Solution: - The nodes generally link in the locality, with few 3.1. Algorithm
links to farther nodes. To improve the rank calculation of a
node, say p, we process only those nodes on kernel which HITS algorithm also ranks web pages on the basis of link
belong to locality of p, determined by the lower and upper structure of web. For this purpose, it assigns two scores to a
limits. And the rest of the nodes are processed on host web page, namely, Authority Score and Hub Score. Higher
processor. So we create two link structured input file, one to Authority score means that the given web page is linked by
be processed by kernel, which contains nodes lying in locality, many documents with high Hub Score & a higher Hub Score
and other contains rest of the nodes to be processed on host means that the given documents points to many documents
processor. with high Authority value. In contrast to PageRank, this
algorithm is query dependent & both the scores are assigned at
2) Divergence in control flow:- CUDA demands the execution the run time depending upon the query.
path of all threads in a warp to be similar for the thread to For a given query, a relatively small set of relevant documents
execute in parallel, hence, divergent execution paths of Root set (R) is retrieved from the web, generally on the basis
threads in a warp cause the CUDA to suspend the parallel of the reoccurrences of the words of the query (Q) or TDIDF.
execution of threads and executes them sequentially (or Then from the Root Set, a Complete Set(C) is formed by
become serialized), hence decreasing throughput. As the including all the documents which either points to at least one
number of indegrees of nodes can be very dissimilar, the loop document in the Root set or are pointed by at least one
involved in calculation, iterating for number of indegrees, can document in the root set. Finally, the scores are assigned to all
make the thread’s control flow to become divergent or the documents in the Complete Set in a number of iterations.
different from other threads. In [7], it is shown that generally the scores converge in 5-10
iterations.
Solution:- To solve this problem, we tried to exploit the block Authority Value Ai is the sum of the Hub Scores Hj of the
structure [8] of web. A careful study reveals that there exists documents pointing towards it. And Hub Value Hi defined as
block structure even at smaller level. So, we divide all the sum of the authority Scores Ai of the documents it points to.
nodes into blocks and calculate average for each block Since, Root Set & Complete set, and therefore the Authority
Score & Hub Score, are calculated at the time of query, both

295
these scores are query specific. This makes this algorithm [2] J.M. Kleinberg, “Authoritative Sources in a Hyperlinked Environment”,
quite slow, making it unfeasible to be used in real life Journal of the ACM (JACM) archive. Volume 46, Issue 5, September 1999.
[3] A. Rungsawang and B. Manaskasemsak, “PageRank Computation Using
situations. Since, the third step is the most time consuming PC Cluster”, Proceedings of the 10th European PVM/MPI User’s Group
step, we present our algorithm to make this part run faster on Meeting, Venice, Italy, 29th Sep – 2nd Oct 2003.
CUDA architecture. [4] A. Rungsawang and B. Manaskasemsak, “Partition-Based Parallel
PageRank Algorithm”, Proceedings of the Third international Conference on
Information Technology and Applications ICITA’05), Sydney, 4th - 7th July,
3.2. Parallel Implementation 2005.
[5] C. Kohlschutter, P. Chirita, and W. Nejdl, “Effıcient Parallel Computation
Since, the computations involved in this algorithm are similar of PageRank”, Proceedings of the 28th European Conference on Information
to PageRank; it follows the similar model of implementation Retrieval (ECIR), London, United Kingdom, 2006.
as discussed in section 2.4. As CUDA allows asynchronous [6] S. Kamvar, T.H. Haveliwala, C. D. Manning ,G. H. Golu, “Exploiting the
Block Structure of the Web for Computing PageRank”, Technical Report
concurrent execution, so the control returns to the host before CSSM-03-02, Computer Science Department, Stanford University, 2003.
the device completes its task leading to the parallel execution [7] Y. G. Saffar, K. S. Esmaili, M. Ghodsi, and H. Abolhassani, “Parallel
of code between host (CPU) and device (GPU). So, in order to Online Ranking of Web Pages”, The 4th ACS/IEEE International Conference
minimize the computation time, the task of calculating each on Computer Systems and Applications (AICCSA-06), UAE, March 2006,
pp. 104-109.
score is uniformly divided between host and device such that [8] A. Arasu, J. Novak, A. Tomkins, and J. Tomlin, “PageRank Computation
they both take approximately same time to compute their part and the Structure of the Web: Experiments and Algorithms”, In Proceedings
of job. of the 11th World Wide Web Conference, poster track, Honolulu, Hawaii, 7-
11 May 2002.
[9] G. Buehrer, S. Parthasarathy, and M. Goyder, "Data mining on the cell
3.3 Results of Parallel Implementation of HITS
broadband engine", Proceedings of ICS’08, Cairo, Egypt, 20-24 October,
2008.
Since, the computations in HITS algorithm are similar to [10] S. Nomura Satoshi Oyama Tetsuo Hayamizu, and Toru Ishida, “Analysis
PageRank therefore following the same model of and Improvement of HITS Algorithm for DetectingWeb Communities”.
implementation on a set of 300,000 nodes, generated using [11] NVIDIA CUDA Programming Guide 2.2 by NVIDIA Corporation.
[12] WebGraph Laboratory, http://webgraph.dsi.unimi.it/ in 2006.
WebGraph [12]; we achieved a significant gain on CUDA
Architecture as compared to CPU.

4. CONCLUSION

In our paper, we effectively demonstrate how to parallelize


graph based algorithm like PageRank and HITS on CUDA
Architecture to achieve high performance with much cheaper
hardware. Further, if the nodes of pc clusters include CUDA
enabled devices then we can achieve better performance, like
for speed up of 10 on clusters as achieved by [3], we can
increase performance gain up to 40 times with marginal
increase in cost.
Since, HITS algorithm calculates the scores at the time of
query, its optimization will not only lead to quicker results but
also more accurate results can be achieved by increasing the
size of the Complete Set. Our approach can also be extended
to parallelize graph based algorithms utilizing sparse graphs.
LSI is another information retrieval method that uses Singular
Value Decomposition to identify patterns between terms and
concepts contained in an unstructured collection of text. It also
requires relatively high computational performance compared
to other IR method which is proving to be main bottle neck in
its glory.

REFERENCES
[1] S. Brin and L. Page, “The Anatomy of a Large Scale Hypertextual Web
Search Engine,” Computer Networks and ISDN Systems archive, Volume 30,
Issue 1-7, April. 1998.

296
Designing Application Specific Irregular Topology
for Network-on-Chip
Naveen Choudhary M.S.Gaur, V. Laxmi Virendra Singh
Department of Computer Science & Department of Computer SERC
Engineering Engineering Indian Institute of Science
College of Engineering and Technology Malaviya National Institute of Bangalore, India
Udaipur, India Technology viren@serc.iisc.ernet.in
naveenc121@yahoo.com Jaipur, India
{gaurms|vlaxmi}@mnit.ac.in

Abstract—Network-on-chip (NoC) has been proposed as a applications communication characteristics to generate an


solution for the communication challenges of System-on-chip optimized network topology along with the corresponding
(SoC) design in nano-scale technologies. Application specific SoC routing tables. Irregular NoC communication model is defined
design offers the opportunity for incorporating custom NoC in Section II. The proposed Genetic Algorithm based
architectures that are more suitable for a particular application, methodology is presented in Section III. Section IV
and may not conform to regular topologies. In this work we summarizes some experimental results followed by a brief
propose to generate a custom NoC that maximizes performance conclusion in section V.
under the given resource constraints. This being NP-hard
problem, we present a heuristic technique based on genetic
algorithm for synthesis of custom NoC architectures along with II. IRREGULAR NOC COMMUNICATION MODEL
requisite routing tables with the objective to improve Task graphs are generally used to model the behavior of
communication load distribution.
complex SoC applications on an abstract level. The tasks T are
Keywords-NP-hard; Network-on-Chip; Optimization;
mapped to a set of IP cores (Intellectual property cores) V
Performance; Cores which communicate through unidirectional point-to-point
abstract channels. In this paper tasks to cores mapping is
assumed to be already done. Definition 1 and definition 2
I. INTRODUCTION defines the core and NoC topology graph respectively.
The Network-on-Chip [1, 2, 6] has been proposed as a Definition 1 The core graph is a directed graph, G (V, E) with each
promising solution to act as communication architecture for the vertex νi Є V representing an IP core and the directed edge (νi, νj)
modern SoC platforms with the increasing number of processor denoted as ei j Є E, representing the communication between the cores
cores. Several early works favored the use of standard νi and νj. The weight of the edge ei,j denoted by bwi,j, represent the
topologies such as meshes, tori, k-ary n-cubes or fat trees under desired bandwidth of the communication from νi and νj.
the assumption that the wires can be well structured in such
Definition 2 The NoC topology graph is a directed graph N (U, F)
topologies. However, most SoCs are heterogeneous, with each with each vertex υi Є U representing a node/tile in the topology and
core having different sizes, functionality and communication the directed edge (υi, υj) denoted as fi,j Є F represents a direct
requirements. Thus, standard topologies can have a structure communication physical link/channel between the vertices υi and υj.
that poorly matches the application traffic. This leads to large The weight of the edge fi,j denoted Abwi,j represent the available
wiring complexity after floor planning, as well as significant link/channel bandwidth across the edge fi, j.
power and area overhead. Since traffic characteristics for
application specific SoC can be well characterized at design III. IRREGULAR NOC TOPOLOGY GENERATION
time [7], it is expected that networks with irregular topology METHODOLOGY
tailored to the application requirements will have an edge over
the networks with regular topology. A key problem in NoC is As shown in Fig. 1 floorplanning using methodology like
to ensure that no deadlock situations can block the whole B*-Trees [12] with the objective to minimize area can be done
network by its routing algorithm. However there are deadlock as the first step. The irregular topology construction is started
free topology agnostic routing algorithms such as up*/down* by creating a Breadth First Search spanning tree based on
[3], L-turn [4], down/up [5]. These algorithms have in common Manhattan distance among the IP cores. The permitted node
that they are based on turn prohibition, a methodology which degree (nd_treemax) ie number of allowed ports per IP core at
avoids deadlock by prohibiting a subset of all turns in the this stage is kept less than the actual permitted node degree
network. In this paper, a genetic algorithm based heuristic is (ndmax). The initial tree topology is strongly connected, and thus
proposed for the design of customized irregular Networks-on- provides a path between every pair of nodes and therefore this
Chip. The presented methodology uses the predefined property is retained throughout the topology generation

297
process. Based on the constructed minimum spanning tree and
using Dijkstra's shortest path algorithm, the routing table
entries for the routers of the NoC is generated for each edge in
the core graph. At this stage the traffic load to these tree paths
is assigned according to the bandwidth requirement in the core
graph, the basic tree path for (υi, υj) in the NoC topology graph
is assigned the traffic load of bwi j and similarly the edges of the
path (υi, υj) are assigned the traffic load as the summation of
their previously assigned traffic load and bwi j. In the next phase
of the methodology a genetic algorithm based heuristic is used
for the design of customized irregular NoC. The proposed
genetic algorithm explores the search space to generate an Figure 1. Network construction flow using genetic algorithm
irregular topology with optimized bandwidth load distribution
and improved energy requirements. 3) Energy-Reduction-Mutation: This mutation is done on
randomly selected chromosome with the bias towards the best
A. Solution Representation class of the population in each generation. In this mutation
Each chromosome is represented by an array of genes with each path of every gene of the chromosome is traversed and we
the maximum size of the gene array to be equal to the number try to find a replacement shorter path by adding suitable
of edges in the core graph. Similarly each gene contains the shortcut.
information regarding the various possible paths in the NoC
topology graph between the <source(υi), destination(υj) > pair C. Crossover Operator
A gene is only permitted to have a maximum of n For achieving crossover two chromosomes and a random
(configurable parameter) number of paths and in these n paths crossover point is selected and then genes of these
at least one of the path is the shortest path through the edges chromosomes are mixed over the crossover point to produce
exclusively of the minimum spanning tree only and rest of the two new chromosomes. The new chromosome is accepted only
paths are generated by adding shortcuts. if it leads to improvement in the cost.

B. Mutation Operators D. Measure of Fitness & Output


The following three mutation operations are used to bring The fitness (cost function) measure essentially has two
variety in the population. components: 1) average bandwidth requirement overflow, (2)
1) Topology-Extension-Mutation: A random number of dynamic energy requirement [8, 10] of the traffic for the
genes are picked from the selected chromosomes and their customized topology. Let X1 is maximum chromosome energy
paths are checked for the traffic loads assigned to them. If any requirement among all the chromosome in the population, X2 is
of the edges/channels of this path are heavily loaded then a maximum possible bandwidth requirement from a channel of
suitable shortcut channel is inserted in the topology. The added the NoC topology graph among all chromosomes in the
shortcut is constrained by the maximum permitted channel population, Eci is the energy requirement for chromosome ci
length emax due to physical signaling delay and so prevents the and Bci is the average bandwidth requirement overflow per
algorithm from inserting wires that span long distances across channel of the NoC topology graph of the chromosome ci then
the chip. Similarly shortcut is not added between the IP cores if cost of the chromosome ci can be formulated as under
it exceeds a given maximum permitted node-degree ndmax of
either its source or target core. This constraint prevents the Cost i = α × ( Eci / X 1 ) + β × ( Bci / X 2 )
algorithm from instantiating slow routers with a large number Where α and β are two empirically determined constants.
of I/O-channels which would decrease the achievable clock The value of α and β were fixed as 0.25 and 0.75 respectively.
frequency due to internal routing and scheduling delay of the It may be noted that the best 10% chromosomes at any
router. A new deadlock free path is formed including the added generation are directly transferred to the next generation, so
shortcut channel using Dijkstra's shortest path algorithm in that the solution does not degrade between the generations. The
combination with the routing rules of up*/down* routing. The topology, routing tables and traffic load mapping for the paths
excess load of the selected path is transferred to the channels of of the output best chromosome is accepted as the inputs for the
the new path if it does not lead to overloading of the new path IrNIRGAM(simulator for Irregular topology based NoC
channels otherwise the shortcut is rejected. Interconnect Routing and Application Modeling) NoC
simulator.
2) Topology-Reduction-Mutation: This mutation tries to
remove such channels from the topology which are very lightly
loaded. The load of the path to be removed is transferred to an IV. EXPERIMENTAL RESULTS
existing path of the gene having minimum load on its channels In order to obtain a broad range of different irregular traffic
such that the overall load distribution improves. scenarios, we randomly generated multiple core graphs using
TGFF [11] with diverse bandwidth requirement of the IP cores.
For performance comparison a NoC simulator IrNIRGAM
supporting irregular topology is deployed. IrNIRGAM is an

298
extended version of NIRGAM [13] for supporting Irregular energy estimates for providing multiple objective optimization
NoC with table based routing. For performance comparison frameworks.
IrNIRGAM was run for 10000 clock cycles and network
throughput in flits and average flit latency were used as
parameters for comparison.

Figure 3. Average performance comparison of IrNoc with 2-D Mesh


topology with X-Y and OE routing with varying packet injection interval

REFERENCES
[1] W. J. Dally, B.Towles,,“Route Packets, Not Wires: On-Chip
Interconnection Networks,” in IEEE Proceedings of the 38th Design
Figure 2. Average Performance comparison of IrNoC with 2-D Mesh Automation Conference (DAC), pp. 684–689, 2001.
topology with X-Y and OE routing [2] L. Benini, G. DeMicheli., “Networks on Chips: A New SoC Paradigm,”
IEEE Computer Vol. 35, No. 1 pp. 70–78, January 2002.
Previously the proposed genetic algorithm was run for [3] e. a. M. D. Schroeder, “Autonet: A High-Speed Self-Configuring Local
1000 generation with population size of 200 for obtaining the Area Network Using Point-to-Point Links,” Journal on Selected Areas in
customized irregular topology. Mutations are done on 15% of Communications, vol. 9, Oct. 1991.
the population and crossover on 30% of the population in each [4] A. Jouraku, A. Funahashi, H. Amano, M. Koibuchi, “L-turn routing: An
generation. During optimization the maximum channel length Adaptive Routing in Irregular Networks,” in International Conference
on Parallel Processing, pp. 374-383, Sep. 2001.
was set to be twice the length of a tile. Figure 2 summarize the
[5] Y.M. Sun, C.H. Yang, Y.C Chung, T.Y. Hang, “An Efficient Deadlock-
performance results averaged over 50 generated irregular Free Tree-Based Routing Algorithm for Irregular Wormhole-Routed
topologies (IrNoC) with permitted node/core degree of 6 with Networks Based on Turn Model,” in International Conference on
number of cores varying between 16 to 81 and 2D-mesh of Parallel Processing, vol. 1, pp. 343-352, Aug. 2004.
equal number of cores as in IrNoC with X-Y [9] and OE [9] [6] U. Ogras, J. Hu, R. Marculescu, “Key research problems in NoC design:
routing. For IrNoC table based routing was used. Figure 2 a holistic perspective,” IEEE CODES+ISSS, pp. 69-74, 2005.
shows that optimized IrNoCs sustain a higher throughput and [7] W.H.Ho, T.M.Pinkston, “A Methodology for Designing Efficient On-
lower transmission latency in all cases. IrNoC with permitted Chip Interconnects on Well-Behaved Communication Patterns,” HPCA
2003, pp. 377-388, Feb 2003.
node degree of 6 achieves 19.4% and 32% more throughput on
average with decrease in average flit latency of 15.2 and 60.3 [8] J.Hu, R.Marculescu,“Energy-Aware Mapping for Tile-based NOC
Architectures Under Performance Constraints,” ASP-DAC 2003, Jan
clock cycles in comparison to corresponding 2-D Mesh with X- 2003.
Y and OE routing respectively. Similar tests on IrNoC with [9] J. Duato, S. Yalamanchili, L. Ni, Interconnection Networks : An
permitted node degree of 4 showed gains (7.5%, 18.9%) and Engineering Approach, Elsevier, 2003.
(12.4 clocks, 57.45 clocks) for throughput and latency [10] J. Hu, R. Marculescu, “Energy- and performance-aware mapping for
respectively. Figure 3 shows throughput and latency regular NoC architectures”, IEEE Trans. on CAD of Integrated Circuits
comparison of IrNoC (with permitted node degree of 6) and 2- and Systems, 24(4), April 2005.
D mesh with X-Y and OE routing with varying packet [11] R. P. Dick, D. L. Rhodes, W. Wolf, “TGFF: task graphs for free,” in
injection interval in clock cycles. Proc Intl. Workshop on Hardware/Software Codesign, March 1998.
[12] Y. C. Chang, Y. W. Chang, G. M. Wu and S. W. Wu, “B*-Trees : A
New Representation for Non-Slicing Floorplans,” in Proc. 37th Design
V. CONCLUSION Automation Conference, pp. 458-463, 2000.
A genetic algorithm based methodology was implemented [13] Lavina Jain, B.M.Al-Hashimi, M.S.Gaur, V.Laxmi, A.Narayanan,
“NIRGAM: A Simulator for NoC Interconnect Routing and Application
to tailor the NoC topology to the requirements of the Modelling, DATE 2007, 2007.
application captured in the core graph. In our future work to
further analyze the effectiveness of the proposed methodology,
we intend to compare the proposed methodology with other
application specific design methodologies proposed in the
literature with realistic benchmarks in addition to fine grained

299
QoS Aware Minimally Adaptive XY routing for
NoC
Navaneeth Rameshan∗ , Mushtaq Ahmed† , M.S.Gaur‡ , Vijay Laxmi§ and Anurag Biyani¶
∗†‡§ Computer Engineering, Malaviya National Institute of Technology Jaipur, India
∗ navaneeth.rameshan@gmail.com, ¶ mushtaq@mnit.ac.in, ‡ gaurms@mnit.ac.in, § vlaxmi@mnit.ac.in
¶ Jaypee Institute of Information technology, Noida, India

anuragbiyani@gmail.com

Abstract—Network-on-Chip (NoC) has emerged as a solution routing, packet is forwarded in X-direction until the destination
to communication handling in Systems-on-Chip design. A major and current column becomes equal, and then packet is routed
design consideration is high performance of router of smaller size. Y-direction until destination node is reached. XY routing is
To achieve this objective, routing algorithm need to be simple
as well as congestion-aware. QoS is also emerging as one of non-adaptive, which leads to bad load balancing and lack of
the design objectives in NoC design. Recent work has shown adaptability to congestion.
that deterministic routing does not fare well when traffic in the Odd-Even routing [3] is a deadlock-free partially adaptive
network increases [6]. An ideal routing algorithm should take routing in which turning is restricted to prevent deadlock
congestion awareness into account. In this paper, we propose occurrence. The routing path is governed by the following
a new Quality-of-service (QoS) aware routing algorithm which
is simple to implement and partially adapts (considers only rule (column starting from 0): for any packet, following turns
minimal paths) with the traffic congestion for meeting different are not allowed.
QoS requirements such as Best Effort (BE) and Guaranteed 1. East-to-North or East-to-South at even column.
Throughput (GT) [5]. Comparison of our algorithm with other 2. North-to-West or South-to-West at odd column.
routing algorithms namely XY, Odd-Even (OE) and DyAd
suggest improved performance in terms of average delay and III. MAXY: M INIMALLY A DAPTIVE XY ROUTING
jitter.
We propose a variation of the XY routing algorithm which
I. I NTRODUCTION introduces a capability to adapt with traffic, while still re-
NoC needs to support Quality of service (QoS) that poses taining the biggest advantage of XY routing: Simplicity. We
to be a main concern when there are varieties of applications select our routing direction only from the maximum of 2
sharing the network, as it becomes necessary to offer guar- path-length reducing directions at any stage, i.e., algorithm is
antees on performance. QoS refers to the capability of the minimalistic, and hence leads to freedom from livelock issues
network to provide communication services above a certain [1]. But despite being minimalistic, our algorithm is adaptive
minimum value(s) of one or more performance metrics such as and decisions among directions are chosen at crucial positions
dedicated bandwidth, control jitter and latency [2]. To manage enroute.
the allocation of resources to communication streams more We illustrate the functioning of the proposed algorithm by
judiciously and efficiently, network traffic is often divided into example. Let us say a packet is to be routed from source
a number of classes when QoS is at premium. Different classes node (Sx ,Sy ) to destination node (Dx ,Dy ). Unlike XY routing,
of packets have different QoS requirements and different levels in which we always route packet first in X direction, here
of importance. In general, the network traffic can be classified we route packet with aim to equalize the absolute difference
into two traffic classes – (1) Guaranteed Throughput (GT) and between X and Y coordinates of current and final nodes. So
(2) Best Effort (BE) [5]. the first step is to route the packet in the direction which
In this paper we propose a new method, Adaptive XY helps in equalizing the absolute difference between X and
Routing, for routing in NoC and compare its performance with Y coordinates of current and final nodes. Once the absolute
various standard routing algorithms, viz., XY, Odd-Even [3] difference is equal for both directions then buffer availability
and DyAd with context to QoS in NoC. Open-source simulator of both feasible directions is taken into account and packet is
NIRGAM [4] is used for comparison with other routing routed to one which has maximum buffer availability. If both
algorithms. have same buffer availability then a random selection is made
since it leads to equal load distribution probability among
II. XY AND O DD -E VEN ROUTING direction with same number of output buffer channel(s). The
In NoC, one of the simplest routing algorithms is XY same process (equalizing the coordinate and using buffer
routing. In XY routing, path is determined solely by addresses availability when equal) is repeated till the packet reaches
of source and destination nodes in a deterministic manner, i.e., destination.
same path is chosen for a given pair of source and destination For example, assume that in a 8x8 mesh we have to route
nodes irrespective of the traffic situation in the network. In XY a packet from (0,0) to (3,6). In this case, first we compute

300
absolute differences between current node (initially source Algorithm 1 Minimally Adaptive XY Algorithm
node) and destination node. Here since ∆x = |3-0| = 3 and ∆y Require: Sx , Sy hx, yi coordinates of source node
= |6-0| = 6, so we route it in a direction which minimizes | Cx , Cy hx, yi coordinates of current node
∆y|. Thus we chose South direction initially. Hence the current Dx , Dy hx, yi coordinates of destination node
node now becomes (0,1). Since still | ∆y| > | ∆x|, hence Ensure: Route fom hSx , Sy i → hDx , Dy i
packet is routed in south direction making the current direction Initialization
to be (0,2). Again | ∆y| > | ∆x| therefore again south direction 1: Cx = Sx
is chosen. Now current position becomes (0,3). Now at this 2: Cy = Sy
stage, | ∆y| = | ∆x|. In this example, after this stage algorithm 3: while (true) do
becomes non-deterministic, i.e., routing path now becomes a 4: absX = |Dx − Cx |
function of buffer availabilities at nodes. We now look at the 5: absY = |Dy − Cy |
buffer availability of both the favorable directions for routing 6: if ((Cx == Dx )and(Cy == Dy )) then
(South and West in this case). Say West direction0 s output 7: Return IP CORE
buffer has more channels free than South direction0 s output 8: end if
buffer at (0,3). Thus packet is routed in West direction at this 9: if (Cx > Dx ) then
point making the current node to be (1,3). Now | ∆y| > | ∆x|, 10: dirX = WEST
therefore packet is routed in south direction and next node is 11: else
(1,4). Since at (1,4) | ∆y| = | ∆x|, therefore same method 12: dirX = EAST
is used for choosing between south and west directions. This 13: end if
process is repeated till we reach our destination node (3,6). 14: if (Cy > Dy ) then
Thus one possible path for a packet can be: (0,0) → (0,1) → 15: dirY = NORTH
(0,2) → (0,3) → (1,3) → (1,4) → (1,5) → (2,5) → (3,5) → 16: else
(3,6). Complete algorithm is illustrated in Algorithm 1. 17: dirY = SOUTH
The proposed routing algorithm is free from livelock as it is 18: end if
minimalistic in nature and requires only two virtual channels 19: if (absX == absY ) then
for deadlock free routing. The virtual channel (VC) assignment 20: if (buffer[dirX] > buffer[dirY]) then
depends on the relative position of the source S and destination 21: Route in dirX
D. If D is towards East (West) of S, packets use the first 22: else if (buffer[dirX] < buffer[dirY]) then
(second) virtual channel along Y direction. For X direction, 23: Route in dirY
any virtual channel can be used. This approach breaks the 24: else
cycle formed in channel dependency graph, thereby preventing 25: Route in random(dirX,dirY)
deadlock as illustrated in Figure 1. 26: end if
27: end if
28: if (absX > absY ) then
29: Route in dirX
30: else
31: Route in dirY
32: end if
33: Update hCx , Cy i
34: end while

A. Experimental Setup

In this experimental setup a 5x5, 2-dimensional mesh topol-


ogy as shown in Figure 2 is selected in which links 2-7, 7-12,
12-17, 2-1, 1-6, 6-11, 11-16, 16-21 are shared by both traffic
Fig. 1. Virtual channel assignment to prevent cycle formation and
deadlock
types for XY routing whereas links 2-7, 7-12, 12-17, 17-22,
22-21, 6-11, 11-16, 16-17 are shared by both traffic types for
odd-even routing. Wormhole switching technique is employed
IV. R ESULTS using both types of traffic classes. Simulations were run for
5000 clock cycles. Traffic load (as a fraction of capacity) is
To validate the proposed algorithm, a number of simulations varied from 10% to 100% in steps of 10%. The network is
were carried out with the help of NoC simulator, NIRGAM [4] evaluated on the basis of average latency and jitter. The graphs
for QoS. for varying bandwidth and load are given below:

301
Fig. 3.Average Latency of BE traffic for XY, OE, DyAd and MAXY
Routing with bandwidth for GT=0.75.
Fig. 2. A 5x5, two-dimensional mesh showing source-destination
pairs for GT (1-17, 2-21) and BE (3-21, 6-17) traffic

B. Results and Analysis


For analyzing the average latency and average jitter with
varying loads in context to QoS, the traffic scenario shown
in figure 2 was used. It can be noticed from figure 3 that
the average latency of BE traffic in case of MAXY routing
algorithm is almost comparable but slightly lesser than OE
when the bandwidth reserved for GT is 0.75. As expected,
the average latency for XY routing is the highest as it
is deterministic in nature and only one virtual-channel is
available for a given ratio of GT to BE.The awareness to
congestion through partial adaptivity in case of OE and
DYAD may be at the expense of the new path with no
guarantee to the path length taken. Proposed algorithm Fig. 4. Average Jitter of GT traffic for XY, OE and MAXY Routing
with bandwidth for GT=0.50
MAXY however adapts to the congestion caused due to
the availability of only one VC and chooses an alternative
minimal path thus reducing the average latency of BE traffic. reduce the latency of BE traffic when most of the resources
Figure 4 shows that the average jitter for GT traffic with are available for GT, not only ensuring QoS but also aiding
MAXY is either comparable or lesser than OE for different in the improvement of other differentiated traffic classes.
load values when bandwidth assigned for GT is 0.50, i.e.,
the available bandwidth is reduced by 25% which leads to R EFERENCES
congestion in the VC and MAXY adapts better to congestion [1] Justin A. Boyan and Michael L. Littman. Packet routing in dynamically
than deterministic routing such as XY. changing networks: A reinforcement learning approach. In Advances in
Neural Information Processing Systems, pages 671–678, 1994.
[2] Ran Ginosar Evgeny Bolotin, Israel Cidon and Avinoam Kolodny. Qnoc:
Qos architecture and design process for network on chip. Journal of
V. C ONCLUSION System Architecture, Special issue on Networks on Chip, 2003.
From our observed experimental results, it can be concluded [3] G.-M.Chiu. The odd-even turn model for adaptive routing. In IEEE
transactions on Parallel and Distributed Systems, pages 729–738, 2000.
that the proposed QoS-aware MAXY routing algorithm shows [4] M.S.Gaur Lavina Jain, B.M.Al-Hashimi and V.Laxmi. Nirgam: A
improvement in terms of latency and jitter over other routing simulator for noc interconnect routing and application modeling. In
algorithms such as XY, OE and DyAd in the presence of Design, Automation and Test in Europe, April 2007.
[5] R.Thid M.Millberg, E. Nelson and A. Janstch. Guaranteed bandwidth us-
congestion in the network. With congestion in the network, ing looped containers in temporally disjoint networks within the nostrum
MAXY performs better as it remains partially congestion network on chip. In Proceedings of the Design Automation and Test in
aware and because of the minimalistic nature, it consumes Europe, 2004.
[6] Luca Benini Terry Tao Ye and Giovanni Mitcheli. Packetization and
lesser energy and also prohibits any chance of livelock and routing analysis of on-chip multiprocessor networks. Journal of Systems
deadlock can be prevented using atleast 2 virtual-channels. Architecture, 50(2-3), 2004.
Therefore, the proposed methodology can thus be used to

302

You might also like