You are on page 1of 424

Yoel Tenne and Chi-Keong Goh (Eds.

)
Computational Intelligence in Optimization
Adaptation, Learning, and Optimization, Volume 7
Series Editor-in-Chief

Meng-Hiot Lim
Nanyang Technological University, Singapore
E-mail: emhlim@ntu.edu.sg

Yew-Soon Ong
Nanyang Technological University, Singapore
E-mail: asysong@ntu.edu.sg

Further volumes of this series can be found on our homepage: springer.com

Vol. 1. Jingqiao Zhang and Arthur C. Sanderson


Adaptive Differential Evolution, 2009
ISBN 978-3-642-01526-7

Vol. 2. Yoel Tenne and Chi-Keong Goh (Eds.)


Computational Intelligence in
Expensive Optimization Problems, 2010
ISBN 978-3-642-10700-9

Vol. 3. Ying-ping Chen (Ed.)


Exploitation of Linkage Learning in Evolutionary Algorithms, 2010
ISBN 978-3-642-12833-2

Vol. 4. Anyong Qing and Ching Kwang Lee


Differential Evolution in Electromagnetics, 2010
ISBN 978-3-642-12868-4

Vol. 5. Ruhul A. Sarker and Tapabrata Ray (Eds.)


Agent-Based Evolutionary Search, 2010
ISBN 978-3-642-13424-1

Vol. 6. John Seiffertt and Donald C. Wunsch


Unified Computational Intelligence for Complex Systems, 2010
ISBN 978-3-642-03179-3

Vol. 7. Yoel Tenne and Chi-Keong Goh (Eds.)


Computational Intelligence in Optimization, 2010
ISBN 978-3-642-12774-8
Yoel Tenne and Chi-Keong Goh (Eds.)

Computational Intelligence in
Optimization
Applications and Implementations

123
Dr. Yoel Tenne
Department of Mechanical Engineering and
Science-Faculty of Engineering,
Kyoto University, Yoshida-honmachi,
Sakyo-Ku, Kyoto 606-8501, Japan
E-mail: yoel.tenne@ky3.ecs.kyoto-u.ac.jp
Formerly: School of Aerospace Mechanical and Mechatronic Engineering,
Sydney University, NSW 2006, Australia

Dr. Chi-Keong Goh


Advanced Technology Centre,
Rolls-Royce Singapore Pte Ltd
50 Nanyang Avenue, Block N2,
Level B3C, Unit 05-08, Singapore 639798
E-mail: chi.keong.goh@rolls-royce.com

ISBN 978-3-642-12774-8 e-ISBN 978-3-642-12775-5

DOI 10.1007/978-3-642-12775-5

Adaptation, Learning, and Optimization ISSN 1867-4534

Library of Congress Control Number: 2010926028


c 2010 Springer-Verlag Berlin Heidelberg

This work is subject to copyright. All rights are reserved, whether the whole or part
of the material is concerned, specifically the rights of translation, reprinting, reuse
of illustrations, recitation, broadcasting, reproduction on microfilm or in any other
way, and storage in data banks. Duplication of this publication or parts thereof is
permitted only under the provisions of the German Copyright Law of September 9,
1965, in its current version, and permission for use must always be obtained from
Springer. Violations are liable to prosecution under the German Copyright Law.

The use of general descriptive names, registered names, trademarks, etc. in this
publication does not imply, even in the absence of a specific statement, that such
names are exempt from the relevant protective laws and regulations and therefore
free for general use.

Typeset & Cover Design: Scientific Publishing Services Pvt. Ltd., Chennai, India.

Printed on acid-free paper

987654321

springer.com
To our families for their love and support.
Preface

Optimization is an integral part to science and engineering. Most real-world


applications involve complex optimization processes, which are difficult to
solve without advanced computational tools. With the increasing challenges
of fulfilling optimization goals of current applications there is a strong drive
to advance the development of efficient optimizers. The challenges introduced
by emerging problems include:
• objective functions which are prohibitively expensive to evaluate, so typi-
cally so only a small number of objective function evaluations can be made
during the entire search,
• objective functions which are highly multimodal or discontinuous, and
• non-stationary problems which may change in time (dynamic).

Classical optimizers may perform poorly or even may fail to produce any
improvement over the starting vector in the face of such challenges. This
has motivated researchers to explore the use computational intelligence (CI)
to augment classical methods in tackling such challenging problems. Such
methods include population-based search methods such as: a) evolutionary
algorithms and particle swarm optimization and b) non-linear mapping and
knowledge embedding approaches such as artificial neural networks and fuzzy
logic, to name a few. Such approaches have been shown to perform well in
challenging settings. Specifically, CI are powerful tools which offer several
potential benefits such as: a) robustness (impose little or no requirements
on the objective function) b) versatility (handle highly non-linear mappings)
c) self-adaption to improve performance and d) operation in parallel (making
it easy to decompose complex tasks). However, the successful application of
CI methods to real-world problems is not straightforward and requires both
expert knowledge and trial-and-error experiments. As such the goal of this
volume is to survey a wide range of studies where CI has been successfully ap-
plied to challenging real-world optimization problems, while highlighting the
VIII Preface

insights researchers have obtained. Broadly, the studies in this volume focus
on four main disciplines: continuous optimization, classification, scheduling
and hardware implementations.
For continuous optimization, Neto et al. study the use of artificial neural
networks (ANNs) and Heuristic Rules for solving large scale optimization
problems. They focus on a recurrent ANN to solve a quadratic program-
ming problem and propose several techniques to accelerate convergence of
the algorithm. Their method is more efficient than one using an ANN only.
Starzyk et al. propose a direct-search optimization algorithm which uses re-
inforcement learning, resulting in an algorithm which ‘learns’ the best path
during the search. The algorithm weights past steps based on their success
to yield a new candidate search step. They benchmark their algorithm with
several mathematical test functions and apply to training of a multi-layer
perceptron neural network for image recognition. Ventresca et al. use the op-
position sampling approach to decrease the number of function evaluations.
The approach attempts to sample the function in a subspace generated by the
‘opposites’ of an existing population of candidates. They apply their method
to differential evolution and incremental learning and show that the opposi-
tion method improves performance over baseline variants. Bazan studies an
optimization algorithm for problems where the objective function requires
large computational resources. His proposed algorithm uses locally regular-
ized approximations of the objective function using radial basis functions. He
provides convergence proofs and formulates a framework which can be ap-
plied to other algorithms such as Gauss-Seidel or the Conjugate Directions.
Ruiz-Torrubiano et al. study hybrid methods for solving large scale opti-
mization problems with cardinality constraints, a class of problems arising in
diverse areas such as finance, machine learning and statistical data analysis.
While existing methods can provide exact solutions (such as branch-and-
bound) they require large resources. As such the study focuses on meth-
ods which can efficiently identify approximate solutions but require far less
computer resources. For problems where it is expensive to evaluate the ob-
jective function, Jayadeva et al. propose using a support-vector machine to
predict the location of yet undiscovered optima. Their framework can be
applied to problems where little or no a-priori information is available on
the objective function, as the algorithm ‘learns’ during the search process.
Benchmarks show their method can outperform existing methods such as par-
ticle swarm optimization or genetic algorithms. Vouchkov and Keane study
multi-objective optimization problems using surrogate-models. They inves-
tigate how to efficiently update the surrogates under a small optimization
‘budget’ and compare different updating strategies. They also shows that us-
ing a number of surrogate models can improve the optimization search and
that the size of the ensemble should increase with the problem dimension.
Others study agents-based algorithms, that is, where the optimization is
done by agents which co-operate during the search. Dreżewski and Siwik
review agent-based co-evolutionary algorithms for multi-objective problems.
Preface IX

Such algorithms combine co-evolution (multiple species) with the agent ap-
proach (interaction). They review and compare existing methods and bench-
mark them over a range of test problems. Results show the agent-based co-
evolutionary algorithms can perform equally well and even surpass some of
the best existing multi-objective evolutionary algorithms. Salhi and Töreyen
proposes a multi-agent algorithm based on game theory. Their framework uses
multiple solvers (agents) which compete over available resources and their
algorithm identifies the most successful solver. In the spirit of game theory,
successful solvers are rewarded by increasing their computing resources and
vice versa. Test results show the framework provides a better final solution
when compared to using a single solver.
For applications in classification, Arana-Daniel et al. use Clifford algebra
to generalize support vector machines (SVMs) for classification (with an ex-
tension to regression). They represent input data as a multivector and use
a single Clifford kernel for multi-class problems. This approach significantly
reduces the computational complexity involved in training the SVM. Tests
using real-world applications of signal processing and computer vision show
the merit of their approach. Luukka and Lampinen propose a classification
method which combines principal component analysis to pre-process the data
followed by optimization of the classifier parameters using a differential evo-
lution algorithm. Specifically, they optimize the class vectors used by the
classifier and the power of the distance metric. Test results using real-world
data sets show the proposed approach performs equally or better to some of
the best existing classifiers. Lastly in this category, Zhang et al. study the
problem of feature selection in high-dimensional problems. They focus on the
GA-SVM approach, where a genetic algorithm (GA) optimizes the param-
eters of the SVM (the GA uses the SVM output as the objective values).
The problem requires large computational resources which make it difficult
to apply to large or high-dimensional sets. As such they propose several mea-
sures such as parallelization, neighbour search and caching to accelerate the
search. Test results show their approach can reduce the computational cost
of training an SVM classifier.
Two studies focus on difficult scheduling problems. First, Pieters studies
the problem of railway timetable design scheduling, which is an NP-hard
problem with additional challenging features as such being reactive and dy-
namic. He studies solving the problem with Symbiotic Networks, a class of
neural networks inspired by the symbiosis phenomenon is nature, and so the
network uses ‘agents’ to adapt itself to the problem. Test results show the
Symbiotic network can successfully handle the complex scheduling problem.
Next, Srivastava et al. propose an approach combining evolutionary algo-
rithms, neural network and fuzzy logic to solve problems of multiobjective
time-cost trade-off. They consider a range of such problems including non-
linear time-cost relationships, constrained resources and project uncertain-
ties. They show the merit of their approach by testing it on a real-world
test case.
X Preface

Lastly, for applications of CI to hardware implementations, Meher studies


the use of systolic arrays for implementing artificial neural networks in VLSI
and FPGA platforms. The chapter studies the use of systolic arrays for effi-
cient hardware implementations of neural networks for real-time applications.
The chapter surveys various approaches, current achievements as well as fu-
ture directions such as mixed analog-digital neural networks. This is followed
by Thangavelautham et al. who propose using coarse-coding techniques to
evolve multi-robot controllers and where they aim to evolve simultaneously
both the controller and sensor configurations. To make the problem tractable
they use an Artificial Neural Tissue to exploit regularity in the sensor data.
Test results show their approach outperforms a reference one.
Overall, the chapters in this volume address a spectrum of issues arising
in the application of computational intelligence to real-world difficult opti-
mization problems. The chapters discuss both the current accomplishments
and the remaining open issues as well as point to future research directions
in the field.

September 2009 Yoel TENNE


Chi-Keong GOH
Acknowledgement to Reviewers

We express our thanks to the expertise provided by our fellow researchers


who have kindly reviewed for the edited book. Their assistance have been
invaluable to our endeavors.

B.V. Babu Dudy Lim


Will Browne Passi Luukka
Pedro M. S. Carvalho Pramod Kumar Meher
Jia Chen Hirotaka Nakayama
Sheng Chen Ferrante Neri
Tsung-Che Chiang Thai Dung Nguyen
Siang-Yew Chong Alberto Ochoa
Antonio Della Ciopa Yew-Soon Ong
Carlos A. Coello Coello Khaled Rasheed
Marco Cococcioni Tapabrata Ray
Claudio De Stefano Abdellah Salhi
Antonio Gaspar-Cunha Vui Ann Shim
Kyriakos C. Giannakoglou Ofer M. Shir
David Ginsbourger Dimitri Solomatine
Frederico Guimarães Sanjay Srivastava
Martin Holena Janusz Starzyk
Amitay Issacs Stephan Stilkerich
Jayadeva Haldun Süral
Wee Tat Koo Mohammhed B. Trabia
Slawomier Koziel Massimiliano Vasile
Jouni Lampinen Lingfeng Wang
Xiaodong Li Chee How Wong
Contents

1 New Hybrid Intelligent Systems to Solve Linear and Quadratic


Optimization Problems and Increase Guaranteed Optimal
Convergence Speed of Recurrent ANN . . . . . . . . . . . . . . . . . . . . . . . . . 1
Otoni Nóbrega Neto, Ronaldo R.B. de Aquino, Milde M.S. Lira
1.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Neural Network of Maa and Shanblatt: Two-Phase
Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.3 Hybrid Intelligent System Description . . . . . . . . . . . . . . . . . . . . . . . 7
1.3.1 Method of Tendency Based on the Dynamics in
Space-Time (TDST) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
1.3.2 Method of Tendency Based on the Dynamics in
State-Space (TDSS) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
1.4 Case Studies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
1.4.1 Case 1: Mathematical Linear Programming
Problem – Four Variables . . . . . . . . . . . . . . . . . . . . . . . . . . 16
1.4.2 Case 2: Mathematical Linear Programming
Problem – Eleven Variables . . . . . . . . . . . . . . . . . . . . . . . . 16
1.4.3 Case 3: Mathematical Quadratic Programming
Problem – Three Variables . . . . . . . . . . . . . . . . . . . . . . . . . 18
1.5 Simulations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
1.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
2 A Novel Optimization Algorithm Based on Reinforcement
Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
Janusz A. Starzyk, Yinyin Liu, Sebastian Batog
2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
2.2 Optimization Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
2.2.1 Basic Search Procedure . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
XIV Contents

2.2.2 Extracting Historical Information by Weighted


Optimized Approximation . . . . . . . . . . . . . . . . . . . . . . . . . 30
2.2.3 Predicting New Step Sizes . . . . . . . . . . . . . . . . . . . . . . . . . 36
2.2.4 Stopping Criterion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
2.2.5 Optimization Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
2.3 Simulation and Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
2.3.1 Finding Global Minimum of a Multi-variable
Function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
2.3.2 Optimization of Weights in Multi-layer Perceptron
Training . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
2.3.3 Micro-saccade Optimization in Active Vision for
Machine Intelligence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
2.4 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
3 The Use of Opposition for Decreasing Function Evaluations in
Population-Based Search . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
Mario Ventresca, Shahryar Rahnamayan, Hamid Reza Tizhoosh
3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
3.2 Theoretical Motivations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
3.2.1 Definitions and Notations . . . . . . . . . . . . . . . . . . . . . . . . . . 51
3.2.2 Consequences of Opposition . . . . . . . . . . . . . . . . . . . . . . . 52
3.2.3 Lowering Function Evaluations . . . . . . . . . . . . . . . . . . . . . 53
3.2.4 Comparison to Existing Methods . . . . . . . . . . . . . . . . . . . 54
3.3 Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
3.3.1 Differential Evolution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
3.3.2 Opposition-Based Differential Evolution . . . . . . . . . . . . . 57
3.3.3 Population-Based Incremental Learning . . . . . . . . . . . . . . 57
3.3.4 Oppositional Population-Based Incremental
Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
3.4 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
3.4.1 Evolutionary Image Thresholding . . . . . . . . . . . . . . . . . . . 59
3.4.2 Parameter Settings and Solution Representation . . . . . . . 63
3.5 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
3.5.1 ODE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
3.5.2 OPBIL . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
3.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
4 Search Procedure Exploiting Locally Regularized Objective
Approximation: A Convergence Theorem for Direct Search
Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
Marek Bazan
4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
4.2 The Search Procedure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
4.3 Zangwill’s Method to Prove Convergence . . . . . . . . . . . . . . . . . . . . 75
Contents XV

4.4 The Main Result . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77


4.4.1 Closedness of the Algorithmic Transformation . . . . . . . . 78
4.4.2 A Perturbation in the Line Search . . . . . . . . . . . . . . . . . . . 80
4.5 The Radial Basis Appproximation . . . . . . . . . . . . . . . . . . . . . . . . . . 87
4.5.1 Detecting Dense Regions . . . . . . . . . . . . . . . . . . . . . . . . . . 87
4.5.2 Regularization Training . . . . . . . . . . . . . . . . . . . . . . . . . . . 88
4.5.3 Choice of the Regularization Parameter λ Value . . . . . . . 90
4.5.4 Error Bounds for Radial Basis Approximation . . . . . . . . 91
4.6 Numerical Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94
4.6.1 Test Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94
4.6.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95
4.7 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102
5 Optimization Problems with Cardinality Constraints . . . . . . . . . . . . 105
Rubén Ruiz-Torrubiano, Sergio Garcı́a-Moratilla, Alberto Suárez
5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105
5.2 Approximate Methods for the Solution of Optimization
Problems with Cardinality Constrains . . . . . . . . . . . . . . . . . . . . . . . 108
5.2.1 Simulated Annealing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108
5.2.2 Genetic Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110
5.2.3 Estimation of Distribution Algorithms . . . . . . . . . . . . . . . 111
5.3 Benchmark Optimization Problems with Cardinality
Constraints . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113
5.3.1 The Knapsack Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114
5.3.2 Ensemble Pruning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116
5.3.3 Portfolio Optimization with Cardinality Constraints . . . . 119
5.3.4 Index Tracking by Partial Replication . . . . . . . . . . . . . . . . 122
5.3.5 Sparse Principal Component Analysis . . . . . . . . . . . . . . . 124
5.4 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 128
6 Learning Global Optimization through a Support Vector
Machine Based Adaptive Multistart Strategy . . . . . . . . . . . . . . . . . . . 131
Jayadeva, Sameena Shah, Suresh Chandra
6.1 Introduction and Background Research . . . . . . . . . . . . . . . . . . . . . . 132
6.2 Global Optimization with Support Vector Regression Based
Adaptive Multistart (GOSAM) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 134
6.3 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 136
6.3.1 One Dimensional Wave Function . . . . . . . . . . . . . . . . . . . 137
6.3.2 Two Dimensional Case: Ackley’s Function . . . . . . . . . . . 140
6.3.3 Comparison with PSO and GA on Higher
Dimensional Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141
6.4 Extension to Constrained Optimization Problems . . . . . . . . . . . . . . 143
6.4.1 Sequential Unconstrained Minimization Techniques . . . 143
6.5 Design Optimization Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 147
XVI Contents

6.5.1 Sample and Hold Circuit . . . . . . . . . . . . . . . . . . . . . . . . . . 148


6.5.2 Folded Cascode Amplifier . . . . . . . . . . . . . . . . . . . . . . . . . 149
6.6 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 149
6.7 Conclusion and Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 152
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 153
7 Multi-objective Optimization Using Surrogates . . . . . . . . . . . . . . . . . 155
Ivan Voutchkov, Andy Keane
7.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 155
7.2 Surrogate Models for Optimization . . . . . . . . . . . . . . . . . . . . . . . . . 157
7.3 Multi-objective Optimization Using Surrogates . . . . . . . . . . . . . . . 158
7.4 Pareto Fronts - Challenges . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 159
7.5 Response Surface Methods, Optimization Procedure and Test
Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 161
7.6 Update Strategies and Related Parameters . . . . . . . . . . . . . . . . . . . . 163
7.7 Test Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 164
7.8 Pareto Front Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 165
7.8.1 Generational Distance ([3], pp.326) . . . . . . . . . . . . . . . . . 165
7.8.2 Spacing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 165
7.8.3 Spread . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 165
7.8.4 Maximum Spread . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 166
7.9 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 166
7.9.1 Understanding the Results . . . . . . . . . . . . . . . . . . . . . . . . . 166
7.9.2 Preliminary Calculations . . . . . . . . . . . . . . . . . . . . . . . . . . 167
7.9.3 The Effect of the Update Strategy Selection . . . . . . . . . . . 167
7.9.4 The Effect of the Initial Design of Experiments . . . . . . . . 171
7.10 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 174
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 174

8 A Review of Agent-Based Co-Evolutionary Algorithms for


Multi-Objective Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 177
Rafał Dreżewski, Leszek Siwik
8.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 177
8.2 Model of Co-Evolutionary Multi-Agent System . . . . . . . . . . . . . . . 179
8.2.1 Co-Evolutionary Multi-Agent System . . . . . . . . . . . . . . . 180
8.2.2 Environment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 180
8.2.3 Species . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 181
8.2.4 Sex . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 182
8.2.5 Agent . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 183
8.3 Co-Evolutionary Multi-Agent Systems for Multi-Objective
Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 187
8.3.1 Co-Evolutionary Multi-Agent System with
Co-Operation Mechanism (CCoEMAS) . . . . . . . . . . . . . . 187
8.3.2 Co-Evolutionary Multi-Agent System with
Predator-Prey Interactions (PPCoEMAS) . . . . . . . . . . . . . 190
8.4 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 196
Contents XVII

8.4.1 Test Suite, Performance Metric and State-of-the-Art


Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 196
8.4.2 A Glance at Assessing Co-operation Based Approach
(CCoEMAS) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 197
8.4.3 A Glance at Assessing Predator-Prey Based
Approach (PPCoEMAS) . . . . . . . . . . . . . . . . . . . . . . . . . . . 200
8.5 Summary and Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 207
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 207
9 A Game Theory-Based Multi-Agent System for Expensive
Optimisation Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 211
Abdellah Salhi, Özgun Töreyen
9.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 211
9.2 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 213
9.2.1 Optimisation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 213
9.2.2 Game Theory: The Iterated Priosoners’ Dilemma . . . . . . 213
9.2.3 Multi-Agent Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 214
9.3 Constructing GTMAS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 215
9.3.1 GTMAS at Work: Illustration . . . . . . . . . . . . . . . . . . . . . . 216
9.4 The GTMAS Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 218
9.4.1 Solver-Agents Decision Making Procedure . . . . . . . . . . . 219
9.5 Application of GTMAS to TSP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 221
9.6 Tests and Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 228
9.7 Conclusion and Further Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 229
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 230
10 Optimization with Clifford Support Vector Machines and
Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 233
N. Arana-Daniel, C. López-Franco, E. Bayro-Corrochano
10.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 233
10.2 Geometric Algebra . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 234
10.2.1 The Geometric Algebra of n-D Space . . . . . . . . . . . . . . . . 235
10.2.2 The Geometric Algebra of 3-D Space . . . . . . . . . . . . . . . 237
10.3 Linear Clifford Support Vector Machines for Classification . . . . . 237
10.4 Non Linear Clifford Support Vector Machines for
Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 242
10.5 Clifford SVM for Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 243
10.6 Recurrent Clifford SVM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 245
10.7 Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 247
10.7.1 3D Spiral: Nonlinear Classification Problem . . . . . . . . . . 247
10.7.2 Object Recognition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 250
10.7.3 Multi-case Interpolation . . . . . . . . . . . . . . . . . . . . . . . . . . . 256
10.7.4 Experiments Using Recurrent CSVM . . . . . . . . . . . . . . . . 257
10.8 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 260
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 260
XVIII Contents

11 A Classification Method Based on Principal Component Analysis


and Differential Evolution Algorithm Applied for Prediction
Diagnosis from Clinical EMR Heart Data Sets . . . . . . . . . . . . . . . . . . 263
Pasi Luukka, Jouni Lampinen
11.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 264
11.2 Heart Data Sets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 267
11.3 Classification Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 268
11.3.1 Dimension Reduction Using Principal Component
Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 268
11.3.2 Classification Based on Differential Evolution . . . . . . . . 269
11.3.3 Differential Evolution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 271
11.4 Classification Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 272
11.5 Discussion and Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 280
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 281
12 An Integrated Approach to Speed Up GA-SVM Feature Selection
Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 285
Tianyou Zhang, Xiuju Fu, Rick Siow Mong Goh, Chee Keong Kwoh,
Gary Kee Khoon Lee
12.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 285
12.2 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 288
12.2.1 Parallel/Distributed GA . . . . . . . . . . . . . . . . . . . . . . . . . . . 288
12.2.2 Parallel SVM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 290
12.2.3 Neighbor Search . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 291
12.2.4 Evaluation Caching . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 292
12.3 Experiments and Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 292
12.4 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 297
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 298

13 Computation in Complex Environments; Optimizing Railway


Timetable Problems with Symbiotic Networks . . . . . . . . . . . . . . . . . . 299
Kees Pieters
13.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 299
13.1.1 Convergence Inducing Process . . . . . . . . . . . . . . . . . . . . . 300
13.1.2 A Classification of Problem Domains . . . . . . . . . . . . . . . . 301
13.2 Railway Timetable Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 302
13.3 Symbiotic Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 304
13.3.1 A Theory of Symbiosis . . . . . . . . . . . . . . . . . . . . . . . . . . . . 306
13.3.2 Premature Convergence . . . . . . . . . . . . . . . . . . . . . . . . . . . 311
13.4 Symbiotic Networks as Optimizers . . . . . . . . . . . . . . . . . . . . . . . . . . 313
13.5 Trains as Symbiots . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 314
13.5.1 Trains in Symbiosis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 315
13.5.2 The Environment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 315
13.5.3 The Trains . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 316
13.5.4 The Optimizing Layer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 318
13.5.5 Computational Complexity . . . . . . . . . . . . . . . . . . . . . . . . 319
Contents XIX

13.5.6 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 319


13.5.7 A Symbiotic Network as a CCGA . . . . . . . . . . . . . . . . . . . 321
13.5.8 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 322
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 322
14 Project Scheduling: Time-Cost Tradeoff Problems . . . . . . . . . . . . . . . 325
Sanjay Srivastava, Bhupendra Pathak, Kamal Srivastava
14.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 325
14.1.1 A Mathematical Description of TCT Problems . . . . . . . . 328
14.2 Resource-Constrained Nonlinear TCT . . . . . . . . . . . . . . . . . . . . . . . 329
14.2.1 Artificial Neural Networks . . . . . . . . . . . . . . . . . . . . . . . . . 330
14.2.2 Working of ANN and Heuristic Embedded Genetic
Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 331
14.2.3 ANNHEGA for a Case Study . . . . . . . . . . . . . . . . . . . . . . 334
14.3 Sensitivity Analysis of TCT Profiles . . . . . . . . . . . . . . . . . . . . . . . . 336
14.3.1 Working of IFAG . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 339
14.3.2 IFAG for a Case Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . 339
14.4 Hybrid Meta Heuristic . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 343
14.4.1 Working of Hybrid Meta Heuristic . . . . . . . . . . . . . . . . . . 345
14.4.2 HMH Approach for Case Studies . . . . . . . . . . . . . . . . . . . 348
14.4.3 Standard Test Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . 352
14.5 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 354
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 355
15 Systolic VLSI and FPGA Realization of Artificial Neural
Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 359
Pramod Kumar Meher
15.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 360
15.2 Direct-Design of VLSI for Artificial Neural Network . . . . . . . . . . 362
15.3 Design Considerations and Systolic Building Blocks for ANN . . . 364
15.4 Systolic Architectures for ANN . . . . . . . . . . . . . . . . . . . . . . . . . . . . 371
15.4.1 Systolic Architecture for Hopfield Net . . . . . . . . . . . . . . . 371
15.4.2 Systolic Architecture for Multilayer Neural Network . . . 373
15.4.3 Systolic Implementation of Back-Propagation
Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 373
15.4.4 Implementation of Advance Algorithms and
Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 376
15.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 376
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 377
16 Application of Coarse-Coding Techniques for Evolvable
Multirobot Controllers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 381
Jekanthan Thangavelautham, Paul Grouchy,
Gabriele M.T. D’Eleuterio
16.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 381
16.2 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 384
XX Contents

16.2.1 The Body and the Brain . . . . . . . . . . . . . . . . . . . . . . . . . . . 385


16.2.2 Task Decomposition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 385
16.2.3 Machine-Learning Techniques and Modularization . . . . 386
16.2.4 Fixed versus Variable Topologies . . . . . . . . . . . . . . . . . . . 387
16.2.5 Regularity in the Environment . . . . . . . . . . . . . . . . . . . . . . 388
16.3 Artificial Neural Tissue Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 389
16.3.1 Computation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 389
16.3.2 The Decision Neuron . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 390
16.3.3 Evolution and Development . . . . . . . . . . . . . . . . . . . . . . . . 391
16.3.4 Sensory Coarse Coding Model . . . . . . . . . . . . . . . . . . . . . 393
16.4 An Example Task: Resource Gathering . . . . . . . . . . . . . . . . . . . . . . 395
16.4.1 Coupled Motor Primitives . . . . . . . . . . . . . . . . . . . . . . . . . 397
16.4.2 Evolutionary Parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . 399
16.5 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 399
16.5.1 Evolution and Robot Density . . . . . . . . . . . . . . . . . . . . . . . 403
16.5.2 Behavioral Adaptations . . . . . . . . . . . . . . . . . . . . . . . . . . . 403
16.5.3 Evolved Controller Scalability . . . . . . . . . . . . . . . . . . . . . . 406
16.6 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 407
16.7 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 409
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 410
Chapter 1
New Hybrid Intelligent Systems to Solve Linear
and Quadratic Optimization Problems and
Increase Guaranteed Optimal Convergence
Speed of Recurrent ANN

Otoni Nóbrega Neto, Ronaldo R.B. de Aquino, and Milde M.S. Lira

Abstract. This chapter deals with the study of artificial neural networks (ANNs) and
Heuristic Rules (HR) to solve optimization problems. The study of ANN as optimiza-
tion tools for solving large scale problems was due to the fact that this technique has
great potential for hardware VLSI implementation, in which it may be more efficient
than traditional optimization techniques. However, the implementation of computa-
tional algorithm has shown that the proposed technique should have been efficient
but slow as compared with traditional mathematical methods. In order to make it
a fast method, we will show two ways to increase the speed of convergence of the
computational algorithm. For analyzes and comparison, we solved three test cases.
This paper considers recurrent ANN to solve linear and quadratic programming prob-
lems. These networks are based on the solution of a set of differential equations that
are obtained from a transformation of an augmented Lagrange energy function. The
proposed hybrid systems combining recurrent ANN and HR presented a reduced
computational effort in relation to the one using only the recurrent ANN.

1.1 Introduction
The early 1980’s were marked by a resurgence of interest in artificial neural net-
works (ANNs). At that time, the development of ANNs had the important charac-
teristic of temporal processing. Many researchers have attributed the resumption of
researches on ANNs in the eighties to the Hopfield model presented in 1982 [1].
This recurrent Hopfield model has so far constituted, a great progress in the thresh-
old of knowledge in the area of neural networks, until then.
Nowadays, it is known that there are two ways of incorporating the temporal com-
putation in a neural network: the first one is possible by using a statistical neural nets
to accomplish a dynamical mapping in a structure of short-term memory; and the
Otoni Nóbrega Neto · Ronaldo R.B. de Aquino · Milde M.S. Lira
Electrical Engineering Department, Federal University of Pernambuco, Brazil
e-mail: otoninobrega@hotmail.com,rrba@ufpe.br

Y. Tenne and C.-K. Goh (Eds.): Computational Intelligence in Optimization, ALO 7, pp. 1–26.
springerlink.com c Springer-Verlag Berlin Heidelberg 2010
2 O.N. Neto, R.R.B. de Aquino, and M.M.S. Lira

second one is by making internal feedback connections that may be made by single
or multi-loop feedback in which the neural network can be fully connected. Artifi-
cial neural networks that have feedback connections in their topology are known as
recurrent neural networks [2]. The theoretical study and applications of recurrent
neural nets were developed in several subsequent works [3, 4, 5, 6, 7, 8, 9]. Actu-
ally, the progress provided by Hopfield’s works have shown that a value of energy
could be associated to each state of the net and that this energy decreases monoton-
ically as the path is described within the state-space towards a fixed point. These
fixed points are therefore stable points of energy [10], i.e., the described energy
function behaves as Lyapunov functions for the model described in detail in Hop-
field’s works. At this exact moment, it is observed subjects of stabilities in recurrent
neural nets. Considering the stability in a non-linear dynamical system, we usually
think about stability in the sense of Lyapunov. The Direct Method of Lyapunov is
broadly used for stability analysis of linear and non-linear systems which may be
either time-variant or time-invariant. Therefore, it can be directly applicable to the
stability analysis of ANNs [2].
In 1985, Hopfield solved the traveling salesman problem [7] that is a problem
in combinatorial optimization using a continuous model of the recurrent neural net-
work as an optimization tool. In 1986, Hopfield proposed a specialized ANN to
solve specific problems of linear programming (LP) [9] based on analog circuits,
studied since 1956 by Insley B. Pyne and presented in [11]. On that occasion, Hop-
field demonstrated that the dynamics involved in recurrent artificial neural nets were
described by a Lyapunov function and that for this reason, it was demonstrated that
this network is stable and also that the point of stability is the solution to the problem
for which the ANN was modeled.
In 1987, Kennedy and Chua demonstrated that the ANN which was proposed
by Hopfield in 1986, in spite of searching for the minimum level of the energy
function, it had not been modeled to offer an inferior limit, but only when the sat-
uration of an operational amplifier of the circuit was reached [12]. Due to this
deficiency, Kennedy and Chua proposed a new circuit for LP problems that also
proved to be able to solve quadratic programming (QP) problems. These circuits
were nominated as “canonical non-linear programming circuit”, which are based on
the Kuhn-Tucker (KT) conditions [12]. In this kind of ANN-based optimization,
the problem has to be “hard-wired” in the network and the convergence behavior of
the ANN depends greatly on how the cost function is modeled.
Later on, hard studies [13, 14] confirmed that for non-linear programming prob-
lems, the proposed model by Kennedy and Chua [15] has completely satisfied the
optimization of KT conditions and the penalty method. Besides, under appropri-
ate conditions this net is stable. In spite of the important progresses presented in
Kennedy and Chua’s studies, a deficiency was observed in the model, which appears
when the equilibrium point of the net happens in the neighborhood of the optimal
point of the original problem, but the distance between the optimal point and the
equilibrium point of the network can be reduced by increasing the penalty parame-
ter (s), as in [14] and [16]. Even so, Kennedy and Chua’s network is able to solve a
great class of optimization problems with and without constraints. However, when
1 New HIS to Solve LP and QP Optimization and Increase Convergence Speed 3

the solutions of constrains optimization problems are in the neighborhood of the


feasible region, i.e., equality constraints are close to happen, then the network only
converges on an approximate solution that can be out of the feasible region [17].
This is explained by the application of the penalty function theorem [16]. For ap-
plications in which an unfeasible solution cannot be tolerated, the usefulness of this
technique (Kennedy and Chua’s neural networks) is seriously jeopardized. With the
intention of overcoming this difficulty, Maa and Shanblatt proposed the two-phase
method [14]. This work reveals an innovation in the method presented by David
W. Tank and John J. Hopfield [18] and it guarantees that, in certain conditions, the
proposed network evolves towards the exact solution of the optimization problem.
Since Kennedy and Chua network contains a finite penalty parameter, it gen-
erates only approximate solutions and presents an implementation problem when
the penalty parameter is very large. To reach an exact solution Maa and Shan-
blatt method uses another penalty parameter in the second phase. Therefore, to
avoid using penalty parameters, some significant works have been done in recent
years. Among them, a few primal-dual neural networks with two-layer and one-
layer structure were developed for solving linear and quadratic programming prob-
lems [18, 19, 20, 21]. These neural networks were proved to be globally convergent
to an exact solution when the objective function is convex.
Nowadays, recurrent ANN have been used to solve real world problems such
as, the hydrothermal Scheduling [22] based on the augmented Lagrange Hopfield
network and [23, 24, 25] based on Maa and Shanblatt Two-phase Neural Network.
In this work, an ANNs was approached to solve optimization problems and the
proposed method by Maa and Shanblatt was applied. The study of ANN as opti-
mization tools for solving large scale problems was due to the fact that this technique
has a great potential for hardware VLSI implementation, in which it can be more
efficient than traditional optimization techniques. However, the implementation of
the method in software has shown that, in spite of the technique being efficient in
the solution of optimization problems, the speed of convergence could become slow
when compared with traditional mathematical methods. In this regard, it was created
and proposed heuristic rules in a hybrid form to aid and accelerate the convergence
of the two-phase method in the software. It is important to point out that the soft-
ware implementation of the method is an important part of the development and
analysis of the method in hardware. An important characteristic for choosing Maa
and Shanblatt network is that it is ready to solve linear and quadratic optimization
problems with equality and inequality linear constraints without using mathematical
transformations, which would increase the dimension of the problem. As we plan to
apply the developed HIS in this work to solve the hydrothermal scheduling problem
[24, 25], which does not need an exact solution, Maa and Shanblatt network first
phase was chosen for this implementation. In future works, we may try other kinds
of recurrent ANNs.
Decision trees and classification rules are important and common methods used
for knowledge representation in the expert systems [26]. Heuristic rules are rules
which have no particular foundation in a scientific theory, but which are only based
on the observation of general patterns and derived from facts. These rules are
4 O.N. Neto, R.R.B. de Aquino, and M.M.S. Lira

applicable to many problems as shown in [27, 28, 29, 30]. Here the basis of the
proposed heuristic rules is the dynamical behavior of neural networks. From the
convergence analysis, we identified the parameters and their relationships, which
are then transformed into a set of heuristic rules. We developed an algorithm based
on the heuristic rules and carried out some experiments to evaluate and support the
proposed technique.
In this work, two possible implementations were developed, tested and com-
pared; and a high reduction in computational effort was observed by using the pro-
posed heuristic rules. This reduction is related to the decrease in the number of ODE
computed during the convergence process. Other possible implementations are also
indicated.
This work is organized beginning with a revision of the two-phase method of
Maa and Shanblatt; following, we present the proposed heuristic rules and show the
solutions for test cases using the previously discussed techniques; next, the simula-
tion results are presented and analyzed; and finally, we draw conclusions about the
proposed work.

1.2 Neural Network of Maa and Shanblatt: Two-Phase


Optimization
The operation of the Hopfield network model and the subsequent models is based
on a constraint violation of optimization problem. When a constraint violation oc-
curs, the magnitude and the direction of the violation are fed back to adjust the
states of the neurons of the network so that the overall energy function of the net-
work always decreases until it reaches a minimum level. These ANN models have
dynamical characteristics according to the Lyapunov function. Therefore, it can be
demonstrated that these networks are stable and that the equilibrium point is the
solution of LP and QP problems that the network represents. This type of neural
network was firstly improved in [15] and later in [14]. The network used in the last
version is the one used in this work.
Maa and Shanblatt network is able to solve constrained or unconstrained convex
quadratic and linear programming problems.
Consider the following convex P problem:
1 T
(P) min f (x) = x Qx + cT x
2
s.t. g(x) = Dx − b ≤ 0
h(x) = Hx − w = 0
x ∈ Rn (1.1)

where c ∈ Rn , D ∈ R pxn , b ∈ R p , H ∈ Rqxn , w ∈ Rq , p and q ≤ n and Q ∈ Rnxn is sym-


metric and positive definite or positive semidefinite, f , gi ’s and h j ’s are functions
on Rn → R. Assume that the feasible domain of P is not empty and the objective
function is bounded below over the domain of P.
1 New HIS to Solve LP and QP Optimization and Increase Convergence Speed 5

Particularly, P is said to be convex programming if f and gi ’s are convex func-


tions, and h j ’s are affine functions. Another particularity of the formulation can
be observed when Q is a zero matrix and the cost function is thus reduced to
f (x) = cT x. In this case, if the inequality and equality constraints have linear for-
mulation then the problem P becomes a linear programming problem.
The method presented by Maa and Shanblatt [14] is composed in to two phases.
The first phase of the method aims to initialize the problem and to converge quickly
without high accuracy towards the neighborhood of the optimal point, while the sec-
ond phase aims to reach the exact solution of the problem. To this end, the dynamic
of the first phase is based on the exact penalty of Lagrangian function or energy
function L(s, x):
 
p   q
s
∑ g+i (x) + ∑ (h j (x))
2 2
L(s, x) = f (x) + (1.2)
2 i=1 j=1

where s is a large positive real number, and the function g+ i (x(t)) = max{0, gi (x(t))},
whose notation was simplified to g+ = [g+ + T
1 , . . . , gm ] , according to [14].
As long as the system converges x(t) → x̂, sg+ i (x(t)) → λi and sh j (x(t)) → μ j
which are the Lagrange multipliers associated with each corresponding constraint.
In the first phase, an approximation of the Lagrange multipliers is already obtained.
The block diagram of a two-phase optimization network is shown in Fig. 1.1.
The dynamics that happen in the first phase are in the time range 0 ≤ t ≤ t1 (t1 is the
time instant when the switch is closed connecting the first phase to the second one).
The network operates according to the following dynamics:
 
p   q
dx
= −∇ f (x) − s ∑ ∇gi (x)gi (x) + ∑ (∇h j (x)h j (x))
+
(1.3)
dt i=1 j=1

In the second phase (t ≥ t1 ) the network begins to shift the directional vector sg+ i (x)
gradually to λi , and sh j (v) to μ j . By imposing a small positive real value ε , the up-
date rate of d λi /dt and d μi /dt that are represented in (1.6) and (1.7), respectively,
is comparatively much slower than that of dx/dt (1.5). Approximation of such dy-
namics is possible by considering λ and μ to be fixed. Then it can be seen that (1.6)
is seeking a minimum point of the augmented Lagrangian function La (s, x):
s + 
La (s, x) = f (x) + λ T g(x) + μ T h(x) +  g (x) 2 +  h(x) 2 (1.4)
2
In the block diagram of Fig. 1.1, in the first phase, the subsystems within the two
large rectangles do not contribute during t ≤ t1 and in the second phase, when t > t1 ,
the dynamics of the network become:
 
p   +  q
dx
= −∇ f (x) − ∑ ∇gi (x) sgi (x) + λi + ∑ (∇h j (x) (sh j (x) + μ j ))
dt i=1 j=1
(1.5)
6 O.N. Neto, R.R.B. de Aquino, and M.M.S. Lira

Fig. 1.1 Block diagram of the dynamical system to Maa and Shanblatt network

The Lagrange multipliers are updated as:

d λi (t + Δ t)
= ε sg+
i (gi (x(t))), to i = 1, . . . p, and (1.6)
dt
d μ j (t + Δ t)
= ε sh j (x(t)), to j = 1, . . . q. (1.7)
dt
A practical value is ε = 1/s according to [14], what leaves the network with just one
adjustment parameter. However, using ε independently of s gives more freedom to
control the dynamics of the network. During the first phase, the Lagrange multipliers
are null, thus there is not restriction on the initial value of x(t).
According to the theorem of penalty function, the solution achieved in the first
phase is not equivalent to the minimum of the function f (x), unless the penalty
parameter s is infinite. In this way, the use of the second phase of optimization is
necessary to any finite value of s. The system reaches equilibrium when:
1 New HIS to Solve LP and QP Optimization and Increase Convergence Speed 7

g+
i =0
h j = 0 and
p q
∇ f (x) + ∑ ∇gi (x)λi (x) + ∑ ∇h j (x)μ j (x) = 0, (1.8)
i=1 j=1

that is identical to optimality condition of the KT theorem and thus the equilibrium
point of the two-phase network is precisely a global minimum point to a convex
problem (P).
In [12] it is demonstrated that the Kennedy and Chua network for linear and
quadratic programming problems satisfies completely the optimality conditions of
KT and the function penalty method. It shows also that under appropriate conditions
this network is completely stable. Moreover, it is shown that the equilibrium point
happens in the neighborhood of the optimal point of the original problem and that
the distance among them can be made arbitrarily small, selecting sufficiently a large
value of penalty parameter (s).
For problems that cannot tolerate a solution in the infeasible region, due to physi-
cal limits of operational amplifiers, a two-phase optimization network model is pro-
posed. In the second phase, we can obtain both, the exact solution for these problems
and the corresponding Lagrange multipliers associated with each restriction.

1.3 Hybrid Intelligent System Description


The proposed network by Maa and Shanblatt has two attractive features which are
the property of guaranteed global convergence of the mathematical programming
problem and the possibility of physical implementation of the neural network in a
circuit with electrical components where the response time of the dynamic of the cir-
cuit would be imposed by the capacitance in the circuit, thus the convergence time
would be negligible. In spite of these attractive characteristics, the time required
for processing the computational algorithm becomes a barrier in the ANN-based
applications for solving large-scale mathematical programming problems, then it is
necessary the resolution of several differential equations. In this regard, problems
with larger number of variables and constraints will involve a fair amount of differ-
ential equations to be solved. In order to mitigate this problem, heuristic rules were
developed to accelerate the convergence of the computational algorithm involved in
recurrent neural networks.
The combination of recurrent ANN with heuristic rules forms the Hybrid Intelli-
gent System (HIS), in which these two techniques interact and exchange information
with one another while the optimal solution of the problem is not achieved.
The basis of the proposed heuristic rules is the dynamical behavior of the neural
networks. The control theory of system studies deeply the dynamical behavior of a
process. With the aid of control theory and gathering information of the Lyapunov
theorem for the network [2], it can be stated that from any given initial state of the
state vectors x(0), the network will always change the values of the state variables
xi (t) in the direction in which the value of the Lyapunov function for the network
8 O.N. Neto, R.R.B. de Aquino, and M.M.S. Lira

decreases continuously until it reaches a stationary point, which corresponds to a


global minimum of the programming problem. The trajectory of the variables is
illustrated in Fig. 1.2. It depicts a two-dimensional trajectory (orbit) of a dynamical
system, where it is possible to observe the state variables of the system at certain
time instants (t0 ,t1 , . . . ,t5 ). The dotted vector can be understood as the tendency of
the convergence (indicated by the gradient vector) of the variables in the dynamics
of the system.

Fig. 1.2 A two-dimensional trajectory (orbit) of a dynamical system

Trajectories of the state variables for the same system are exemplified graphically
in Fig. 1.3. These trajectories are distinct due to the fact that the state variables have
different initial states. The dynamic of recurrent ANN has the same properties and,
therefore, it is similar to the dynamics showed in Fig. 1.3.
In spite of Maa and Shanblatt model which deals with continuous-time recur-
rent network, in a computational algorithm the calculations of the iterations are
performed in a discrete-time form, since the calculations of the integral equations
demand a small step size for calculations, but not null. Therefore, we take total
control at the course of the iteration of the algorithm in the network.
Detailed observations were carried out during test of the algorithm of Maa and
Shanblatt model showing that the computational convergence is slow and the tra-
jectories in the state-space of convergence of recurrent networks are smooth and
possibly predictable. Then, we observed that, in certain conditions, it is not only
possible to estimate a point closer to the minimum point of the function energy of
the network, but it is also possible to estimate a point that instead of the initial orbit
of convergence turns to an initial point of a new orbit of convergence. This new orbit
would have a shorter curvature and, consequently, a smaller Euclidean distance to
the optimal point. In this regard, the number of steps to calculate the convergence
of the computational algorithm can be reduced and, consequently, the time to
compute the equilibrium point of the network.
1 New HIS to Solve LP and QP Optimization and Increase Convergence Speed 9

Fig. 1.3 An illustration of a two-dimensional state (phase) of a dynamical system and the
associated vectorial field

To achieve the equilibrium point, we use two methods. In the first one, the point
is calculated starting from the evolution of the dynamics in space-time plan (in this
work, we considered only autonomous systems). In the second method, the calcu-
lation is performed by observing the evolution of the variable in state-space. The
mechanism of these two methods and the way they operate in the proposed HIS are
described as follows.

1.3.1 Method of Tendency Based on the Dynamics in Space-Time


(TDST)
Consider the convergences of dynamical systems of first order according to the
graphs in Fig. 1.4.
Observing the curves of Fig. 1.4 and pointing out that the time is a vari-
able that is always increasing, i.e., the following point x1 (t) is always in front
of x1 (t − Δ t), we can reach the conclusion that a closer point to the convergence
would be located outside the internal area of the convergence curve concavity, in
case the convergence curve behaves as shown in graph Fig. 1.4(a). For example,
for t = 0.060, x1 (0.060) ≈ 0.3; in this case, a better estimation to the point would
be x1 (0.061) = 0.45 for Δ t = 0.001s. Restarting the network with this initial state
would generate a convergence curve as illustrated in graph Fig. 1.5. However, this
rule would be applied only for curves of type (a) and (c) as shown in Fig. 1.4, since
curves (b) and (d), a better estimation to the point is located inside the internal area
of the concavity.
Observing the particularities of the possible curvatures of the convergence curve
in the space-time plan, the following parameters were created in relation to:
• curvature {curve, straight line};
• concavity, when it exist, {concave downwards, concave upwards};
10 O.N. Neto, R.R.B. de Aquino, and M.M.S. Lira

1 1

0.9 0.9

0.8 0.8

0.7 0.7

0.6 0.6
x1(t)

x1(t)
0.5 0.5

0.4 0.4

0.3 0.3

0.2 0.2

0.1 0.1

0 0
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
t t

(a) (b)

1 1

0.9 0.9

0.8 0.8

0.7 0.7

0.6 0.6
x1(t)

x1(t)

0.5 0.5

0.4 0.4

0.3 0.3

0.2 0.2

0.1 0.1

0 0
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
t t

(c) (d)

Fig. 1.4 Dynamical convergences of first order (single variable systems): graphs of evolution
of state variables in time

• time rating {high, mean, low};


• variable xi (t) {increasing, decreasing}.
In order to assess these parameters, the network must provide at least three points
(P0 , P1 , P2 ) in the convergence curve. Next, these three points are normalized in the
horizontal axis and also in the vertical axis, in order to avoid problems in the algo-
rithm used to estimate a better point. The chosen normalization equation is presented
in (1.9), where M is the maximum and m is the minimum of the three values to be
normalized, a and b are chosen according to the range of the normalized values. In
this work, the values are normalized into value in the range [0, 1], then a = 0 and
b = 1. z is the value to be normalized, and zN is the normalized value.

b(z − m) − a(z − M)
zN = (1.9)
M−m
1 New HIS to Solve LP and QP Optimization and Increase Convergence Speed 11

0.8

Original Dynamic
0.6
x1(t)
Advanced Dynamic

0.4
Predict Point by Heuristic Rule (HIS−1)

0.2 The Algorithm Time Gain

Time in which the ANN is Restarted


0
0 0.2 0.4 0.6 0.8 1
t

Fig. 1.5 Action of the HIS to calculate a better point in dynamics evolving over time

From the normalized points, two spatial vectors − →v 1 and −



v 2 are computed, and the
following relevant information are obtained from them: Euclidian norm, angle (θi )
of each vector in relation to the horizontal axis; and finally, angle between them
(Δ θ = θ2 − θ1 ). Therefore, when there is a space curvature over the normalized
points, classification regions (decision regions) are generated as shown in Fig. 1.6.
To understand the illustration in Fig. 1.6 better, consider that the normalized initial
point (P0N ) is always found in the beginning of each area (S4, S5, S6, S7, S8 and
S9). Besides these six possibilities, there are more three others that occur in case of
straightforward convergences where Δ θ is approximately zero.

Fig. 1.6 Classification regions to patterns that have spatial curvature

Region 4 (S4) is similar to the beginning of the convergence shown in Fig. 1.4(a).
While region 5 (S5) describes a behavior close to the curve formed by the end of
the convergence shown in Fig. 1.4(a) and the beginning of the convergence shown
in Fig. 1.4(b). Region 6 (S6) has a convergence similar to that shown in Fig. 1.4(b).
12 O.N. Neto, R.R.B. de Aquino, and M.M.S. Lira

Region 7 (S7) describes the dynamic of the type shown in the Fig. 1.4(d) and region
8 (S8) describes a behavior close to the curve formed by the end of the convergence
shown in the Fig. 4(c) and the beginning of the convergence shown in Fig. 1.4(d).
While region 9 (S9) represents a behavior close to that shown in Fig. 1.4(c).
The straightforward regime can be of three types: increasing - the derivative of
the curve is positive and not close to zero, corresponding to region 1 (S1); constant
- the derivative of the curve is approximately zero, corresponding to region 2 (S2);
and decreasing - derivative is negative and not close to zero, corresponding to region
3 (S3). We can note that the regimes described by regions S5 and S8 can be consid-
ered close to the constant straightforward regime. Thus, we modeled the following
heuristic rules:

Rule 1: if <curvature is a straight line> and <variable increases> and <time


rating is high or mean> than <Action I>.
Rule 2: if <curvature is a straight line> and <time rating is low> than <Action
II>.
Rule 3: if <curvature is a straight line> and <variable decreases> and <time
rating is high or mean> than <Action III>.
Rule 4: if <curvature is a curve> and <variable increases> and <time rating is
high and mean> and <concavity is concave downward> than <Action
IV>.
Rule 5: if <curvature is a curve> and <time rating is low> and <concavity is
concave downward> than <Action II>.
Rule 6: if <curvature is a curve> and <variable decreases> and <time rating is
high and mean> and <concavity is concave downward> than <Action
V>.
Rule 7: if <curvature is a curve> and <variable increases> and <time rating
is high and mean> and <concavity is concave upward> than <Action
V>.
Rule 8: if <curvature is a curve> and <time rating is low> and <concavity is
concave upward> than < Action II >.
Rule 9: if <curvature is a curve> and <variable decreases> and <time rating
is high and mean> and <concavity is concave upward > than <Action
VI>.

The actions shown in the rules lead to sub-functions that return a better value for
the next initialization point of the network. The straight line condition implicates
that either the system is converging very slowly or the step size of the integration
algorithm is very small. In this case, the linear function shown in (1.10) can be
applied as shown below:

P3N = a(P2N − P0N ) + P2N , (1.10)

where a is a constant that yields a gain in magnitude of →


−v (−

3 v 3 = P3 − P2 ). The
rules above are summarized in Table 1.1.
1 New HIS to Solve LP and QP Optimization and Increase Convergence Speed 13

Table 1.1 Description of the actions to be taken due to the heuristic rules according to each
decision region

Actions Description of the Actions Regions


I P3N is calculated according to (1.6) with a = a1 . S1
II P3N is calculated according to (1.6) with a = a2 . S2, S5, S8
III P3N is calculated according to (1.6) with a = a3 . S3
IV P3N takes the coordinates of the superior point in the circum- S4
ference that passes through the normalized points P0N , P1N
and P2N .
V P3N takes the coordinates of the point fixed at the most right S6, S7
position in the circumference that passes through the normal-
ized points P0N , P1N and P2N .
VI P3N takes the coordinates of the inferior point in the circum- S9
ference that passes through the normalized points P0N , P1N
and P2N .

Having the normalized point P3N estimated by the heuristic rules, we need to
unnormalize it to obtain the P3 value. This value will be used to start the recurrent
network.

1.3.2 Method of Tendency Based on the Dynamics in State-Space


(TDSS)
To calculate a better point using the dynamics in state-space, two facts must be
pointed out: firstly, the convergence of the variables depends on the convergence
of other variables; secondly, when we are working in the state-space, the variations
in the state of the variables are mapped by taking one variable xi as reference, so
that the curve is free to evolve in all directions, which does not happen when we
are working in the space-time plan. Perceiving previous facts and observing the
convergence orbits in the state-space shown Fig. 1.2 and Fig. 1.3, we reached the
conclusion that to calculate a better point in the state-space, it must be located inside
the concavity of the orbit of the system dynamics.
To calculate in such a state-space, we do the following: first we take one of
the variables as a reference (for example, x1 (t)) and draw n − 1 complex plan
(for a system with n variable), thus we have the plans x1 (t)0x2 (t), x1 (t)0x3 (t), . . .,
x1 (t)0xn−1(t). Having the three points P2 = x(t), P1 = x(t − Δ t) and P0 = x(t − 2Δ t)
provided by the network, we can create state vectors in each plan of all n − 1 com-
plex plan. For example, for the plan x1 (t)0x2 (t), we have:


v 1 (t) = (x1 (t) + i · x2(t)) − (x1 (t − Δ t) + i · x2(t − Δ t)) and (1.11)



v 2 (t) = (x1 (t − Δ t) + i · x2(t − Δ t)) − (x1(t − 2Δ t) + i · x2(t − 2Δ t)). (1.12)
14 O.N. Neto, R.R.B. de Aquino, and M.M.S. Lira

From these vectors, we carried out a rotation transformation in the axes using (1.13)
and (1.14), according to the Fig. 1.7.


v 1’ = −

v 1 exp(−θ1 ) (1.13)

v 2’ = −

→ →
v 2 exp(−θ1 ) (1.14)
where θ1 is the −

v 1 (t) angle and also a translation transformation using:

x1 ’(t − 2Δ t) = −|−

v 1|
x2 ’(t − 2Δ t) = 0
x1 ’(t − Δ t) = 0
x2 ’(t − Δ t) = 0
x1 ’(t) = x1 (t) − x1 (t − 2Δ t)
x2 ’(t) = x2 (t) − x2 (t − 2Δ t) (1.15)

Fig. 1.7 Translation and rotation transformation in state-space

The rotation and translation transformation facilitates the behavior analysis of the
vector −→v 2 in relation to vector −→
v 1 . Therefore, heuristic rules can be applied to
perform the gain in module and angle of the state vectors yielding vector − →v 3 ’ to
each n − 1 complex plan. Finally, a strategy is created to determine which will be
the final value of the reference variable. An effective strategy is to add the value of
the reference variable and the average of the calculated increments of this variable
in the complex plans.
Fig. 1.8 shows two pictures associated with two examples of a set of heuris-
tic rules that can be used to produce vector − →
v 3 ’. In each picture, the point that is
closer to the left position in the circumference is the point P0 ’, in the center of the
circumference is fixed the point P1 ’ and the points marked with tiny circle in the
circumference symbolize several possibilities of point P2 ’. And finally, the results
1 New HIS to Solve LP and QP Optimization and Increase Convergence Speed 15

of the heuristic rules are marked with green circle, points P3 ’. In order to obtain the
final point P3 , we apply the inverse transformation of translation and rotation to the
point P3 ’, thus generating the appropriated value to initialize the recurrent network.
Fig. 1.9 shows an example of application of the heuristic rules to estimate a better
point through the dynamics in the state-space. The external curve represents the
dynamics of the recurrent network without the heuristic rules, and the internal one
represents the dynamics using the HIS (ANN and TDSS) which is based on heuristic
rules (TDSS). We point out that, in the internal curve, the points with circle as a
marker are the iteration points performed by the network and the points with a plus
sign are the points estimated by the heuristic rules (P3 ).

3 3

2 2

1 1
X2’(t)

X2’(t)

0 0

−1 −1

−2 −2

−3 −3
0 2 4 0 2 4
X1’(t) X1’(t)

(a) (b)

Fig. 1.8 Variations of the rules applied to points P0 , P1 , P2 to calculate the estimated point P3

2.5

2
Original Orbit

1.5
x2(t)

1
Predict Point by Heuristic Rule (HIS−2)

0.5
Advanced Orbit
Point by Recurrent NN
0

−0.5
−1 0 1 2 3 4 5
x1(t)

Fig. 1.9 Graph of the convergence orbit of the state variables x1 and x2 : external curve -
ANN; internal curve - HIS
16 O.N. Neto, R.R.B. de Aquino, and M.M.S. Lira

By iteratively applying the ANN and the HR, the system will perform to reduce
the curvature of the orbit in the state-space jumping from one orbit to another until it
achieves the solution of the problem (point of equilibrium of the recurrent network).

1.4 Case Studies


In order to test the proposed hybrid intelligent system, we chose mathematical pro-
gramming problems previously solved in [16, 31, 32].
The following cases were solved using the HIS which was implemented in the
MATLAB. In addition, to solve the differential equations involved in the problem,
we implemented the Richardson extrapolation method of level four, which has in its
structure the classical Runge-Kutta method of order four, and uses a fixed step-size
of integration [33].

1.4.1 Case 1: Mathematical Linear Programming Problem –


Four Variables
A classical linear programming problem [31] was used to choose the parameters
of the developed heuristic rules. In [31] the problem was solved to compare the
performance of several recurrent network models. In this case we deal with the
following problem LP1 :

(LP1 ) min f (x) = −8x1 − 8x2 − 5x3 − 5x4


s.t. x1 + x3 = 40
x2 + x4 = 60
−5x1 + 5x2 ≤ 0
2x1 − 3x2 ≤ 0
x≥0
x ∈ R4 (1.16)

1.4.2 Case 2: Mathematical Linear Programming Problem –


Eleven Variables
With the heuristic rules parameters already defined in case 1, we chose a larger
scale problem to compare the models performance. In this case, we chose a flow
problem of minimum cost (LP2 ) solved in [32]. In this problem there are several
nodes (points) representing several consumers and suppliers which are connecting
through paths (arcs). The aim of the problem is to calculate the flow through all
paths in order to minimize the total cost, whose value is calculated by the sum of
the products of the cost and the flow performed in each arch. The problem can be
represented as a graph as shown in Fig. 1.10 where the amount of flow on a node
cannot exceed the capacity of the node. A network is a set of elements called nodes
1 New HIS to Solve LP and QP Optimization and Increase Convergence Speed 17

and a set of elements called arcs, each arc ei j being an ordered pair (i, j) of distinct
nodes i and j. If ei j is an arc, then the node i is called the tail of ei j and the node j
is called the head of ei j . The directed graph shown in Fig. 1.10 is formed of 6 nodes
and 11 arcs.

e12 2
1
e23
e13
3
e42
e15
e53
e43 e62
e54 4
5
e64

e56 6

Fig. 1.10 A directed graph of a minimum cost flow problem

The cost of each arc is represented by vector c and its maximum capacity flow
by vector b, and the demands by vector w. If wi ≤ 0, the node is a supplier (sources)
and, if wi > 0, the node is consumers (sinks). Suppose, for instance, that we have
wT = [−9 4 17 1 − 5 − 8]. The matrix H of the network is called the incidence
matrix of our network. More generally, the incidence matrix of a network with n
nodes and m arcs has n rows and m columns. Thus, our matrix H has size 6 x 11 and
is formed as follows:
⎡ ⎤
−1 −1 −1 0 0 0 0 0 0 0 0
⎢ 1 0 0 −1 1 1 0 0 0 0 0 ⎥
⎢ ⎥
⎢ 0 1 0 1 0 0 1 1 0 0 0⎥
H =⎢⎢ 0 0 0 0 −1 0 0 −1 1 1 0 ⎥
⎥ (1.17)
⎢ ⎥
⎣ 0 0 1 0 0 0 −1 0 −1 0 −1 ⎦
0 0 0 0 0 −1 0 0 0 −1 1

Considering that there are no losses in the network, i.e., everything that is produced
is consumed then the sum of all the elements wi j of the graph is zero. This condition
turns the matrix H into a linearly dependent (LD) matrix, in other words, any row
can be obtained by a linear combination of the other rows. In order to overcome this
problem, we remove a row of the matrix H and one element of the column vector w.
Here, the last row of the matrix H was removed, turning this matrix and the vector
w into a truncated incidence matrix and a truncated vector, according to [32].
⎡ ⎤
−1 −1 −1 0 0 0 0 0 0 0 0
⎢ 1 0 0 −1 1 1 0 0 0 0 0 ⎥
⎢ ⎥
H =⎢ ⎢ 0 1 0 1 0 0 1 1 0 0 0⎥
⎥ (1.18)
⎣ 0 0 0 0 −1 0 0 −1 1 1 0 ⎦
0 0 1 0 0 0 −1 0 −1 0 −1
18 O.N. Neto, R.R.B. de Aquino, and M.M.S. Lira

Problem LP2 has the following form:

(LP2 ) min f (x) = 3x1 + 5x2 + x3 + x4 + 4x5 + x6 + 6x7 + x8 + x9 + x10 + x11


s.t. − 1x1 − 1x2 − 1x3 = 9
−x1 − x4 + x5 + x6 = −4
x2 + x4 + x7 + x8 = −17
−x5 − x8 + x9 + x10 = −1
x3 − x7 − x9 − x11 = 5
x ≤ [2 10 10 6 8 7 9 9 10 8 6]T
x≥0
x ∈ R11 (1.19)

1.4.3 Case 3: Mathematical Quadratic Programming


Problem – Three Variables
As an example of quadratic programming (QP), the economic dispatch problem was
solved as formulate in [16]. In this problem, the aim is to minimize the total cost
and respond to the demand of the power system. The formulation contemplates 3
thermal generators (n = 3) connected to just one load.
Defining fi as the generation cost of the i-th generation unit, xi as the generated
power by the i-th generation unit, w as the total power demand of the load; the limits
xi,min , xi,max are defined by the physical limitation of the i-th generation unit. The
power economic dispatch is expressed as QP:
n
(QP) min f (x) = ∑ fi (xi )
i=1
n
s.t. ∑ xi − w = 0
i=1
xmin ≤ x ≤ xmax
x ∈ Rn (1.20)

Data were obtained from [16]: xmin = [150 100 50]T in MW, xmax = [600 400 200]T
in MW, w = 850 MW and the following costs for the generator units:

f1 (x1 ) = 561 + 7.92x1 + 0.00156x21


f2 (x2 ) = 310 + 7.85x2 + 0.00194x22
f3 (x3 ) = 78 + 7.95x3 + 0.00482x23 (1.21)
1 New HIS to Solve LP and QP Optimization and Increase Convergence Speed 19

1.5 Simulations
The chosen parameters to simulate the problems LP1 and LP2 were: integration step
size of 1e−3 , 100 for the parameter s of the neural network in the first and the second
phase, value 1.1 for the parameter e of the network in the second phase. The main
results of LP1 and LP2 simulation are presented in Table 1.2.

Table 1.2 Simulation Results of LP1 and LP2

LP1 LP2
ANN HIS-1a HIS-2b ANN HIS-1a HIS-2b
No. points by the ANN (Phase 1) 8316 1000 2391 8670 2965 3099
No. points by the HR (Phase 1) - 331 795 - 986 1031
Total no. points (Phase 1) 8316 1331 3186 8670 3951 4130
Normalized computer processing time (Phase 1) 1.00 0.12 0.29 1.00 0.33 0.35
Time instant when the switch is closed (s) = t1 8.32 1.33 3.19 8.67 3.95 4.13
No. of calculated points by the ANN (Phase 2) 8607 8277 8316 7692 7251 7263
No. of calculated points in both phases 16923 9608 11502 16362 11202 11393
Initial cost (Phase 1) = f (x(0)) -260.00 -260.00 -260.00 0.00 0.00 0.00
Final cost (Phase 1) = f (x(t1 )) -741.77 -741.82 -741.78 55.67 55.65 55.63
Final cost (Phase 2) = f (x(tend )) -740.00 -740.00 -740.00 56.00 55.99 56.00
a HIS-1 = ANN and method of Tendency based on the Dynamics in
Space-Time(TDST).
b HIS-2 = ANN and method of Tendency based on the Dynamics in State-Space

(TDSS).

The results shown in Table 1.2 at row 5 point out that both proposed hybrid
systems (HIS-1 and HIS-2) were able to advance the dynamics of the simulated
linear problems efficiently. This reduces greatly the time to process the network
algorithm since at each integration step size, n ODEs are solved, where n is the
number of variables in the problem. For instance, for the problem LP1 , the total
number of points calculated at the end of the first phase by using the ANN was
8316, then 33264 ODEs were solved while using the HIS-1. It was necessary 1000
points yielding 4000 ODEs to be solved in order to reach the end of the first phase,
in other words, the HIS-1 reduced the computational effort by approximately 88%
compared to the ANN. Besides, for the LP2 problem, this rate was approximately
66%. The rates of computational effort reduction for problems LP1 and LP2 when
comparing the ANN to the HIS-2 were 71% and 64%, respectively.
Fig. 1.11-1.13 presents the simulation results for the LP1 problem and
Fig. 1.14-1.16 for the LP2 problem.
The chosen parameters to simulate the problems QP: integration step size of
1e−2 ; 50 for the parameter s of the neural network in the first phase. The initial
condition used was x(0) = [400 300 150]T .
20 O.N. Neto, R.R.B. de Aquino, and M.M.S. Lira

40
40
x1 x2 X1(t) x X2(t)
35
35

30
30
X1(t) x X4(t)

25 25

x4
20 20

15 15

10 10

5 X1(t) x X3(t)
5
x3
0
0
0 2 4 6 8 10 12 14 16 10 15 20 25 30 35 40

(a) (b)

Fig. 1.11 Dynamics of the problem LP1 obtained by the ANN with the initial condition
x(0) = [10 10 10 10]T : (a) Dynamic in time-state plan; (b) Dynamic in state-space plan, taking
the variable x1 (t) as reference

40
40
x1 x2
35 X1(t) x X4(t)
35

30
30

25 25
x4
20 X1(t) x X2(t)
20

15 15

10 10

5 X1(t) x X3(t)
5
x3
0
0
0 1 2 3 4 5 6 7 8 9 10 10 15 20 25 30 35 40

(a) (b)

Fig. 1.12 Dynamics of the problem LP1 , obtained by the HIS-1 with the initial condition
x(0) = [10 10 10 10]T : (a) Dynamic in time-space plan; (b) Dynamic in state-space plan,
taking the variable x1 (t) as reference

The results shown in Table 1.3 at row 5 point out that both proposed hybrid
systems were able to advance the dynamics of the simulated quadratic problems
efficiently. This reduces greatly the time to process the network algorithm since at
each step size integration, n ODEs are solved, where n is the number of variables
in the problem. For instance, for the problem QP, the total number of points cal-
culated at the end of the first phase by using the ANN was 172629, then 517887
ODEs were solved, while using the HIS-1, it was necessary 8257 points yielding
24771 ODEs to be solved in order to reach the end of the first phase, in other words,
the HIS-1 reduced the computational effort by approximately 95% compared to the
1 New HIS to Solve LP and QP Optimization and Increase Convergence Speed 21

40
40
x1 x2
35
35
X1(t) x X4(t)
30
30

25 25
X1(t) x X2(t)
x4
20 20

15 15

10 10

5 X1(t) x X3(t)
5
x3
0
0
0 2 4 6 8 10 12 10 15 20 25 30 35 40

(a) (b)

Fig. 1.13 Dynamics of the problem LP1 , obtained by the HIS-2 with the initial condition
x(0) = [10 10 10 10]T : (a) Dynamic in time-space plan; (b) Dynamic in state-space plan,
taking the variable x1 (t) as reference

9
9
x8 x9 8
8
x6 7
7

6 6

x4
5 5
x3
4 4
x2
3 3
x1
2 2
x7 x10
1
x11 1
0
x5 0
-1
0 2 4 6 8 10 12 14 16 0 0.5 1 1.5 2 2.5

(a) (b)

Fig. 1.14 Dynamics of the problem LP2 , obtained by the ANN with the initial condition
x(0) = [0 0 0 0 0 0 0 0 0 0 0]T : (a) Dynamic in time-space plan; (b) Dynamic in state-space
plan, taking the variable x1 (t) as reference

ANN. In addition to HIS-2, and the QP problem, this rate was approximately 70%.
Fig. 1.17-1.19 presents the simulation results for the problem QP.
All case studies were carried out in the same computer, thus we take the process-
ing time of the ANN in phase 1 as the base to normalize the hybrid case in the same
phase. Here, we point out that the hybrid systems were not used in phase 2. As a
result, we have observed that for the LP1 case, the HIS-1 has taken a processing time
of 12%, while the HIS-2 has taken 29%; for the LP2 case, the HIS-1 has taken a pro-
cessing time of 33%, while the HIS-2 has taken 35%; and for the QP case, the HIS-1
has taken a processing time of 6%, while the HIS-2 has taken 45%. It is important
22 O.N. Neto, R.R.B. de Aquino, and M.M.S. Lira

16
16
14
14

12
12

10 10
x8 x9
8 8
x6
6 6
x4
4 x3
4
x2
2 x1
2
x10
0 x5 x7 x11
0

-2
-2
-4
0 2 4 6 8 10 -1.5 -1 -0.5 0 0.5 1 1.5 2 2.5 3

(a) (b)

Fig. 1.15 Dynamics of the problem LP2 , obtained by the HIS-1 with the initial condition
x(0) = [0 0 0 0 0 0 0 0 0 0 0]T : (a) Dynamic in time-space plan; (b) Dynamic in state-space
plan, taking the variable x1 (t) as reference

9
9
x8 x9
8
8
x6 7
7

6 6

x4 5
5
x3
4 4
x2
3 3
x1
2 2
x10
1
1
x5 x7 x11
0
0
-1
0 2 4 6 8 10 0 0.5 1 1.5 2 2.5

(a) (b)

Fig. 1.16 Dynamics of the problem LP2 , obtained by the HIS-2 with the initial condition
x(0) = [0 0 0 0 0 0 0 0 0 0 0]T : (a) Dynamic in time-space plan; (b) Dynamic in state-space
plan, taking the variable x1 (t) as reference

to note that the heuristic rule efficiency varies according to the type of the problem.
In this work, the results have showed that the HIS-1 yielded better performance,
specifically, in the LP1 and QP problems. Therefore, we could observe a decrease in
the processing time yielded by the implemented heuristic rules, while reducing the
number of ODE computed. These rules estimated the next values to each variable
of the problem throughout the convergence. We can highlight that as ODE also
calculates points during the application of the HIS, the proposed systems can correct
themselves in case of incorrect estimative, showing the high performance for being
resilient.
1 New HIS to Solve LP and QP Optimization and Increase Convergence Speed 23

Table 1.3 Simulation Results of QP

QP
ANN HIS-1a HIS-2b
No. points by the ANN (Phase 1) 172629 8257 51900
No. points by the HR (Phase 1) - 2750 17299
Total no. points (Phase 1) 172629 11007 69199
Normalized computer processing time (Phase 1) 1.00 0.06 0.45
Time instant when the switch is closed (s) = t1 1726.29 110.07 691.99
Final cost (Phase 1) = f (x(t1 )) 22680.05 22680.05 22680.05
a HIS-1 = ANN and method of Tendency based on the Dynamics in
Space-Time(TDST).
b
HIS-2 = ANN and method of Tendency based on the Dynamics in State-Space
(TDSS).

400 320 X1(t) x X3(t)


x1
300
350
280
x2
260
300
240

250 220

200
200
180

160
150
x3 140
X1(t) x X2(t)
120
0 200 400 600 800 1000 1200 1400 1600 1800 393 394 395 396 397 398 399 400 401

(a) (b)

Fig. 1.17 Dynamics of the problem QP, obtained by the ANN with the initial condition
x(0) = [400 300 150]T : (a) Dynamic in time-space plan; (b) Dynamic in state-space plan,
taking the variable x1 (t) as reference

1.6 Conclusion
In this paper, two Hybrid Intelligent Systems have been proposed. These systems
combined Maa and Shamblatt network with heuristic rules. Maa and Shamblatt
network is a two-phase recurrent neural network that provides the exact solution
for linear and quadratic programming problem. When compared to conventional
linear and nonlinear optimization techniques, the two-phase network formulation
becomes advantageous as there is no matrix inversion required. The main aim of
the proposed HIS is to increase the speed of convergence towards the optimal point
which is guaranteed by the ANN. In the cases presented, the optimal convergence
24 O.N. Neto, R.R.B. de Aquino, and M.M.S. Lira

400 x1
320 X1(t) x X2(t)

300
350
x2
280

260
300
240

250 220

200
200
180

160
150
x3 140
X1(t) x X3(t)
120
0 20 40 60 80 100 393 394 395 396 397 398 399 400 401

(a) (b)

Fig. 1.18 Dynamics of the problem QP, obtained by the HIS-1 with the initial condition
x(0) = [400 300 150]T : (a) Dynamic in time-space plan; (b) Dynamic in state-space plan,
taking the variable x1 (t) as reference

400 x1
320
X1(t) x X2(t)
300
350
x2 280

260
300
240

250 220

200
200
180

160
150
x3 140 X1(t) x X3(t)

120
0 100 200 300 400 500 600 700 393 394 395 396 397 398 399 400 401

(a) (b)

Fig. 1.19 Dynamics of the problem QP, obtained by the HIS-2 with the initial condition
x(0) = [400 300 150]T : (a) Dynamic in time-space plan; (b) Dynamic in state-space plan,
taking the variable x1 (t) as reference

was reached. The proposed systems have both advantages. The simulation analyses
show a reduction in the computational effort by approximately 95% compared to
the ANN in the QP case solved in this paper which has a guaranteed optimal con-
vergence, without inverting matrices. The implementation of the proposed HIS has
been developed to solve operational planning problems of large-scale, which may
be applied in future works. That is, a large-scale economic power dispatch, with the
scheduling of hydro, thermal and wind power plants to minimize the overall pro-
duction cost, while satisfying the load demand in the mid-term operation planning
of hydrothermal generation systems. In future works, we will propose the combina-
tion of these heuristic rules and/or the application in the second phase of Maa and
Shanblatt method.
1 New HIS to Solve LP and QP Optimization and Increase Convergence Speed 25

References
1. Hopfield, J.J.: Neural networks and physical systems with emergent collective computa-
tional abilities. Proc. Natl. Acad. Sci. USA 79, 2552–2558 (1982)
2. Haykin, S.: Neural networks: a comprehensive foundation, 2nd edn. Prentice Hall, USA
(1999)
3. Hopfield, J.J.: Neurons with graded response have collective computational properties
like those of two-state neurons. Proc. Natl. Acad. Sci. USA 81, 3088–3092 (1984)
4. Hopfield, J.J.: Learning algorithms and probability distributions in feed-forward and
feed-back networks. Proc. Natl. Acad. Sci. USA 84, 8429–8433 (1987)
5. Hopfield, J.J.: The effectiveness of analogue neural network hardware. Network: Com-
putation in Neural Systems 1(1), 27–40 (1990)
6. Hopfield, J.J., Feinstein, D.I., Palmer, R.G.: Unlearning has a stabilizing effect in collec-
tive memories. Nature 304, 158–159 (1983)
7. Hopfield, J.J., Tank, D.W.: Neural Computation of Decisions in Optimization Problem.
Biological Cybernetics 52, 141–152 (1985)
8. Hopfield, J.J., Tank, D.W.: Computing with Neural Circuits: A Model. Science 233(8),
625–633 (1986)
9. Tank, D.W., Hopfield, J.J.: Simple Neural Optimization Networks: An A/D Converter,
Signal Decision Circuit, and a Linear Programming Circuit. IEEE Trans. on Circuits and
Systems 33(5), 533–541 (1986)
10. Ludemir, T.B., Braga, A.P., Carvalho, A.C.P.L.F.: Redes Neurais Artificiais Teoria e
Aplicacoes. 1st ed. Rio de Janeiro, RJ: LTC - Livros Tecnicos e Cientificos Editora S.A.
(2000)
11. Pyne, I.B.: Linear Programming on an electronic analogue computer. Trans. AIEE. Part
I (Comm. & Elect.) 75, 139–143 (1956)
12. Kennedy, M.P., Chua, L.O.: Unifying Tank and Hopfield Linear Programming Circuit
and the Canonical Nonlinear Programming Circuit of Chua and Lin. IEEE Trans. on
Circuits and Systems 34(2), 210–214 (1987)
13. Chiu, C., Maa, C.Y., Shanblatt, M.A.: An artificial neural network algorithm for dynamic
programming. Int. J. Neural Syst. 1(3), 211–220 (1990)
14. Maa, C.Y., Shanblatt, M.A.: A Two-Phase Optimization Neural Network. IEEE Trans-
actions on Neural Networks 3(6), 1003–1009 (1992)
15. Kennedy, M.P., Chua, L.O.: Neural Networks for Nonlinear Programming. IEEE Trans.
on Circuits and Systems 35(5), 210–220 (1988)
16. Maa, C.Y., Shanblatt, M.A.: Linear and Quadratic Programming Neural Network Anal-
ysis. IEEE Transactions on Neural Networks 3(4), 580–594 (1992)
17. Chiu, C., Maa, C.Y., Shanblatt, M.A.: Energy Function Analysis of Dynamic Program-
ming Neural Networks. IEEE Transactions on Neural Networks 2(4) (July 1991)
18. Xia, Y.S.: A New Neural Network for Solving Linear and Quadratic Programming Prob-
lems. IEEE Transactions on Neural Networks 7(6), 1544–1547 (1996)
19. Tao, Q., Cao, J.D., Xue, M.S., Qiao, H.: A High Performance Neural Network for Solv-
ing Nonlinear Programming Problems with Hybrid Constraints. Phys. Lett. A 288(2),
88–94 (2001)
20. Xia, Y.S., Wang, J.: A General Methodology for Designing Globally Convergent Op-
timization Neural Networks. IEEE Transactions on Neural Networks 9(6), 1331–1343
(1998)
26 O.N. Neto, R.R.B. de Aquino, and M.M.S. Lira

21. Xia, Y.S., Wang, J.: A Recurrent Neural Network for Solving Nonlinear Convex Pro-
grams Subject to Linear Constraints. IEEE Transactions on Neural Networks 16(2),
379–386 (2005)
22. Dieu, V.N., Ongsakul, W.: Enhanced Merit Order and Augmented Lagrange Hop-
field Network for Hydrothermal Scheduling. Electrical Power and Energy Systems 30,
93–101 (2008)
23. Naresh, R., Dubey, J., Sharma, J.: Two-phase Neural Network Based Modeling Frame-
work of Constrained Economic Load Dispatch. IEE Proc. Gener. Transm. Distrib. 151(3)
(May 2004)
24. Aquino, R.R.B.: Recurrent Artificial Neural Networks: an application to optimization of
hydro thermal power systems (in Portuguese), Ph.D. Thesis, COPELE/UFPE, Campina
Grande, Brazil (January 2001)
25. Rosas, P., Aquino, R.R.B., et al.: Study of Impacts of a Large Penetration of Wind Power
and Distributed Power Generation as a Whole on the Brazilian Power System. In: Euro-
pean Wind Energy Conference (EWEC), London (November 2004)
26. Witten, I.H., Frank, E.: Data Mining Practical Machine Learning Tools and Techniques,
2nd edn. Morgan Kaufmann, San Francisco (2005)
27. Mitra, S., Mitra, M., Chaudhuri, B.B.: Pattern Defined Heuristic Rules and Directional
Histogram Based Online ECG Parameter Extraction. Measurement 42, 150–156 (2009)
28. Tuncel, G.: A Heuristic Rule-Based Approach for Dynamic Scheduling of Flexible Man-
ufacturing Systems. In: Levner, E. (ed.) Multiprocessor Scheduling: Theory and Appli-
cations, December 2007, p. 436. Itech Education and Publishing, Vienna (2007)
29. Baykasoglu, A., Ozbakir, L., Dereli, T.: Multiple Dispatching Rule Based Heuristic for
Multi-Objective Scheduling of Job Shops Using Tabu Search. In: Proceedings of MIM
2002: 5th International Conference on Managing Innovations in Manufacturing (MIM)
Milwaukee, Wisconsin, USA, September 9-11, pp. 1–6 (2002)
30. Idris, N., Baba, S., Abdullah, R.: Using Heuristic Rules from Sentence Decomposition
of Experts Summaries to Detect Students Summarizing Strategies. International Journal
of Human and Social Sciences 2, 1 (Winter 2008), www.waset.org
31. Zak, S.H., Upatising, V., Hui, S.: Solving Linear Programming Problems with Neural
Networks: A Comparative Study. IEEE Transactions on Neural Networks 6(1), 94–104
(1995)
32. Chvatal, V.: Linear Programming. W.H. Freman and Company, New York (1983)
33. Lastman, G.J., Sinha, N.K.: Microcomputer-based numerical methods for science and
engineering. Saunders Colleg Pubblishing, USA (1988)
Chapter 2
A Novel Optimization Algorithm Based on
Reinforcement Learning

Janusz A. Starzyk, Yinyin Liu, and Sebastian Batog

Abstract. In this chapter, an efficient optimization algorithm is presented for the


problems with hard to evaluate objective functions. It uses the reinforcement learn-
ing principle to determine the particle move in search for the optimum process.
A model of successful actions is build and future actions are based on past ex-
perience. The step increment combines exploitation of the known search path and
exploration for the improved search direction. The algorithm does not require any
prior knowledge of the objective function, nor does it require any characteristics
of such function. It is simple, intuitive and easy to implement and tune. The opti-
mization algorithm was tested using several multi-variable functions and compared
with other widely used random search optimization algorithms. Furthermore, the
training of a multi-layer perceptron, to find a set of optimized weights, is treated as
an optimization problem. The optimized multi-layer perceptron was applied to Iris
database classification. Finally, the algorithm is used in image recognition to find a
familiar object with retina sampling and micro-saccades.

2.1 Introduction
Optimization is a process to find the maximum or the minimum function value
within given constraints by changing values of its multiple variables. It can be the
essential for solving complex engineering problems in such areas as computer sci-
ence, aerospace, machine intelligence applications, etc. When the analytical relation
Janusz A. Starzyk · Yinyin Liu
Ohio University, School of Electrical Engineering and Computer Science, U.S.A.
e-mail: starzyk@bobcat.ent.ohiou.edu,yliu@bobcat.ent.ohiou.edu
Sebastian Batog
Silesian University of Technology, Institute Of Computer Science, Poland
e-mail: sebastian.batog@gmail.com

Y. Tenne and C.-K. Goh (Eds.): Computational Intelligence in Optimization, ALO 7, pp. 27–47.
springerlink.com c Springer-Verlag Berlin Heidelberg 2010
28 J.A. Starzyk, Y. Liu, and S. Batog

between the variables and the objective function value is explicitly known, analyti-
cal methods, such as Lagrange multiplier methods [1], interior point methods [18],
Newton methods [30], gradient descent methods [25], etc., can be applied. How-
ever, in many practical applications, analytical methods do not apply. This happens
when the objective functions are unknown, when relations between variables and
function value are not given or difficult to find, when the functions are known while
their derivatives are not applicable, or when the optimum value of function cannot
be verified. In these cases, iterative search processes are required to find the function
optimum.
Direct search algorithms [10] contain a set of optimization methods that do not
require derivatives and do not approximate either the objective functions or their
derivatives. These algorithms find locations with better function values following a
search strategy. They only need to compare the objective function values in succes-
sive iterative steps to make the move decision. Within the category of direct search,
distinctions can be made among three classes including pattern search methods [28],
simplex methods [6], and adaptive sets of search directions [23]. In pattern search
methods, the variables of the function are varied by either steps of predetermined
magnitude or the steps sizes are reduced at the same degree [15]. Simplex meth-
ods construct a simplex in ℜN using N+1 points and use the simplex to drive the
search for optimum. The methods with adaptive sets of search directions, proposed
by Rosenbrock [23] and Powell [21], construct conjugate directions using the infor-
mation about the curvature of the objective function during the search.
In order to avoid local minima, random search methods are developed utilizing
randomness in setting the initial search points and other search parameters like the
search direction or the step size. In Optimized Step-Size Random Search (OSSRS)
[24], the step size is determined by fitting a quadratic function for the optimized
function values in each of the random directions. The random direction is generated
with a normal distribution of a given mean and standard deviation. Monte-Carlo op-
timizations adopted randomness in the search process to generate the possibilities
to escape from the local minima. Simulate Annealing (SA) [13] is one typical kind
of Monte-Carlo algorithm. It exploits the analogy between the search for a mini-
mum in the optimization problem and the annealing process in which a metal cools
and stabilizes into a minimum energy crystalline structure. It accepts the move to a
new position with worse function value with a probability, which is controlled by
the ”temperature” parameter, and the probability decreases along the ”cooling pro-
cess”. SA can deal with highly nonlinear, chaotic problems provided that the cooling
schedule and other parameters are carefully tuned.
Particle Swarm Optimization (PSO) [11] is a population-based evolutionary com-
putational algorithm. It exploits the cooperation within the solution population in-
stead of the competition among them. At each iteration in PSO, a group of search
particles make moves in a mutually coordinated fashion. The step size of a particle is
a function of both the best solution found by that particle and the best solution found
so far by all the particles in the group. The use of a population of search particles
and the cooperation among them enable the algorithm to evaluate function values in
a wide range of variables in the input space and to find the optimum position. Each
2 A Novel Optimization Algorithm Based on Reinforcement Learning 29

particle only remembers its best solution and the global best solution of the group
to determine its step sizes.
Generally, during the course of search, a sequence of decisions on the step sizes is
made and a number of function values are obtained in these optimization methods.
In order to implement an efficient search for the optimum point, it is desired that
such historical information can be utilized in the optimization process.
Reinforcement Learning (RL) [27] is a type of learning process to maximize cer-
tain numerical values by combining exploration and exploitation and using rewards
as learning stimuli. In the reinforcement learning problem, the learning agent per-
forms the experiments to interact with the unknown environment and accumulate
the knowledge during this process. It is a trial-and-error exploratory process with
the objective to find the optimum action. During this process, an agent can learn
to build the model of the environment to instruct its search, so that the agent can
predict the environment’s response to its actions and choose the most useful actions
for its objectives based on its past exploring experience.
Surrogate based optimization refers to an idea of speeding optimization process
by using surrogates for the objectives and constraints functions. The surrogates also
allow for the optimization of problems with non-smooth or noisy responses, and
can provide insight into the nature of the design space. The max-min SAGA ap-
proach [20] is to search for designs that have the best worst case performance in
the presence of parameter uncertainty. By leveraging a trust-region approach which
uses computationally cheap surrogate models, the present approach allows for the
possibility of achieving robust design solutions on a limited computational budged.
Another example of a surrogate based optimization is the surrogate assisted
Hooke-Jeeves (SAHJA) algorithm [8] which can be used as a local component of a
global optimization algorithm. This local searcher uses the Hooke-Jeeves method,
which performs its exploration of the input space intelligently employing both the
real fitness and an approximated function.
The idea of building knowledge about an unknown problem through exploration
can be applied in the optimization problems. To find the optimum of an unknown
multivariable function, an efficient search procedure can be performed using only
historical information from conducted experiments to expedite the search. In this
chapter, a novel and efficient optimization algorithm based on reinforcement learn-
ing is presented. This algorithm uses simple search operators and will be called
reinforcement learning optimization (RLO) in the later sections. It does not require
any prior knowledge of the objective function or function’s gradient information,
nor does it require any characteristics of the objective function. In addition, it is
conceptually very simple and easy to implement. This approach to optimization
is compatible with the neural networks and learning through interaction, thus it is
useful for systems of embodied intelligence and motivated learning as presented in
[26]. The following section presents the RLO method and illustrates it within several
machine learning applications.
30 J.A. Starzyk, Y. Liu, and S. Batog

2.2 Optimization Algorithm


2.2.1 Basic Search Procedure
A N-variable optimization objective function

V = f (p1 , p2 , ..., pN ) (p1 , p2 , ..., pN ,V ∈ ℜ1 )

could have several local minima and several global minima Vopt1 , ...,VoptN . It is
desired that the search process, initiated from a random point, finds a path to the
global optimum point. Unlike particle swarm optimization [11], this process can be
performed with a single search particle that learns how to find its way to the opti-
mum point. It does not require the cooperation among a group of particles, although
implementing the cooperation among several search particles may further enhance
the search process in this method.
At each point of the search, the search particle intends to find a new location
with a better value within a searching range around it and then determines the di-
rection and the step size for the next move. It tries to reach the optimum by explor-
ing weighted random search of each variable (coordinate). The step size of search
in each variable is randomly generated with its own probability density function.
These functions are gradually learned during the search process. It is expected that
at the later stage of search, the probability density functions are approximated for
each variable. Then the stochastically randomized path to the minimum point of the
function from the start point is learned.
The step sizes of all the coordinates determine the center of the new searching
area and the standard deviations of the probability functions determine the size of
the new searching area around the center. In the new searching area, several loca-
tions PS are randomly generated. If there is a location p’ with better value than the
current one, the search operator moves to it. From this new location, new step sizes
and new searching range are determined, so that the search for optimum continues.
If in the current searching area, there is no point with better value that the search
particle can move to, another set of random points are generated until no improve-
ment is obtained after several, say M, trials. Then the searching area size and step
sizes are modified in order to find a better function value. If no better value is found
after K trials of generating different searching areas or the proposed stopping crite-
rion is met, we can claim that the optimum point has been found. The algorithm of
searching for the minimum point is schematically shown in the Figure 2.1.

2.2.2 Extracting Historical Information by Weighted Optimized


Approximation
After the search particle makes a sequence of n moves, the step sizes of these moves
d pti (t = 1, 2, ..., n; i = 1, 2, ..., N) are available for learning. These historical steps
have made the search particle move towards better values of the objective function
and hopefully get closer to the optimum location. In this sense, these steps are the
2 A Novel Optimization Algorithm Based on Reinforcement Learning 31

Fig. 2.1 The algorithm of RLO searching for the minimum point

successful actions during the trial. It is proposed that the successful actions which
result in a positive reinforcement (as the step sizes of each coordinate) follow a
function of the iterative steps t, as in (2.1), where dpi represents the step sizes on ith
coordinate and f i (t) is the function for coordinate i.

d pi = fi (t) (i = 1, 2, ..., N), (2.1)

These unknown functions f i (t) can be approximated, for example, using polynomi-
als through the least-squared fit (LSF) process.
⎧ ⎫

⎪ a0 ⎪⎪
⎡ ⎤ ⎪ ⎪ ⎧ i⎫
1 t1 t12 ... t1B ⎪ ⎨ a1 ⎪⎬ ⎨ d p1 ⎬
⎣ ... ... ... ... ... ⎦ a2 = ... (2.2)
⎪ ⎪ ⎩ i⎭
1 tn tn2 ... tnB ⎪ ⎪
⎪ ... ⎪

⎪ d p
⎩ ⎭ n
aB

In (2.2), the step sizes from d pi1 to d pin are the step sizes on a certain coordinate
during n steps and are fitted as unknown function values using polynomials of or-
der B. The polynomial coefficients a0 to aB can be obtained and will represent the
function f i (t) to estimate dpi ,
B
d pi = ∑ a jt j. (2.3)
j=0

Using polynomials for function approximation could be easy and efficient. How-
ever, considering the characteristic of optimization problems, we have two concerns.
First, in order to generate a good approximation while avoiding overfitting, a proper
order of polynomials must be selected. In the optimized approximation algorithm
(OAA) presented in [17], the goodness of fit is determined by the so-called signal-
to-noise ratio figure (SNRF). Based on SNRF, an approximation stopping criterion
32 J.A. Starzyk, Y. Liu, and S. Batog

was developed. Using a certain set of basis functions for approximation, the error
signal, computed as the difference between the approximated function and the sam-
pled data, can be examined by SNRF to determine how much useful information it
contains. The SNRF for the error signal, denoted as SNRF e , is compared to the pre-
calculated SNRF for white Gaussian noise (WGN), denoted as SNRFW GN . If SNRF e
is higher thanSNRFWGN , more basis functions should be used to improve the learn-
ing. Otherwise, the error signal shows the characteristic of WGN and should not
be reduced any more to avoid fitting into the noise, and the obtained approximated
function is the optimum function. Such process can be applied to determine the
proper order of the polynomial.
The second concern is that in the case of reinforcement learning, the knowl-
edge about originally unknown environment is gradually accumulated throughout
the learning process. The information that the learning system obtains at the be-
ginning of the process is mostly based on initially random exploration. During the
process of interaction, the learning system collects the historical information and
builds the model of the environment. The model can be updated after each step of
interaction. The decisions made at the later stages of the interaction are more based
on the built model rather than a random exploration. This means that the recent re-
sults are more important and should be weighted more heavily than the old ones.
For example, the weights applied can be exponentially increasing from the initial
trials to the recent ones, as

αt
wt = (t = 1, 2, ..., n), (2.4)
n
where we can define α n = n. As a result, the weights are in the open interval (0:1],
and weight is 1 for the most recent sample. Applying the weights in the LSF, we
have the weighted least-squared fit (WLSF), expressed as follows:
⎧ ⎫

⎪ a0 ⎪⎪
⎡ ⎤ ⎪ ⎪ ⎧ ⎫
1 · w1 t1 w1 t12 w1 ... t1B w1 ⎪⎨ a1 ⎪⎬ ⎨ d p1 w1 ⎬
⎣ ... ... ... ... ... ⎦ a2 = ... (2.5)
⎪ ⎪ ⎩ ⎭
1 · wn tn wn tn wn ... tn wn ⎪
2 B

⎪ ... ⎪

⎪ d p n wn
⎩ ⎭
aB

Due to the weights applied to the given samples, the approximated function will fit
to the recent data better than to the old ones.
Utilizing the concept of OAA to obtain optimized WLSF, the SNRF for the error
signal or WGN has to be estimated considering the sample weights. In the original
OAA for one-dimensional problem [17], the SNRF for error signal was calculated
as,
C(e j , e j−1 )
SNRFe = (2.6)
C(e j , e j ) − C(e j , e j−1 )
where C represents the correlation calculation, e j represents the error signal (j=1,2,
...,n), e j−1 represents the (circular) shifted version of the e j . The characteristics
2 A Novel Optimization Algorithm Based on Reinforcement Learning 33

of SNRF for WGN, expressed through the average value and the standard deviation,
can be estimated from Monte-Carlo simulation, as (see derivation at [17])

μSNRF W GN (n) = 0 (2.7)

1
σSNRF W GN (n) = √ . (2.8)
n
Then the threshold, which determines whether SNRF e shows the characteristic of
SNRFW GN and the fitting error should not be further reduced, is,

thSNRF W GN (n) = μSNRF W GN (n) + 1.7σSNRF W GN (n). (2.9)

For the weighted approximation, the SNRF for the error signal is calculated as,

C(e j · w j , e j−1 · w j−1 )


SNRFe = . (2.10)
C(e j · w j , e j · w j ) − C(e j · w j , e j−1 · w j−1 )

In Fig.2.2(a), σSNRF W GN (n) from a 200-run Monte-Carlo simulation is shown in


the logarithmic scale. The σSNRF W GN (n) can be estimated as

2
σSNRF W GN (N) =√ . (2.11)
n

It is found that the 5% significance level can be approximated by the average value
plus 1.5 times standard deviations for an arbitrary n. Fig.2.2(b) illustrates the his-
togram of SNRFW GN with 216 samples, as an example. The threshold in this case of
a dataset with 216 samples can be calculated using μ + 1.5σ = 0 + 1.5 × 0.0078 =
0.0117.
Therefore, to obtain an optimized weighted approximation in one-dimensional
case, the following algorithm is performed.

Optimized weighted approximation algorithm (OWAA)


Step (2.1). Assume that an unknown function F, with input space t ⊂ ℜ1 is described
by n training samples as d pt , (t = 1, 2, ..., n).
Step (2.2). The signal detection threshold is pre-calculated for the given number of
samples n based on SNRFW GN . For a one-dimensional problem,
1.5 · 2
thSNRF W GN (n) = √ .
n

Step (2.3). Take a set of basis functions, for example, polynomials of order from 0
up to order B.
Step (2.4). Use these B+1 basis functions to obtain the approximated function,
B+1
dˆpt = ∑ fl (xt ) (t = 1, 2, ..., n). (2.12)
l=1
34 J.A. Starzyk, Y. Liu, and S. Batog

Fig. 2.2 Characteristic of SNRF for WGN in weighted approximation

Step (2.5). Calculate the approximation error signal,

et = d pt − dˆpt (t = 1, 2, ..., n). (2.13)

Step (2.6). Determine SNRF of the error signal using (2.10).


Step (2.7). Compare the SNRF e with thSNRF W GN . If the SNRF e is equal to or less
than thSNRF W GN , or if B exceeds the number of samples, stop the procedure. In
such case F̂ is the optimized approximation. Otherwise, add one basis function, in
this example increase the order of the approximating polynomial to B+1 and repeat
Steps (2.4)-(2.7).
Using the above algorithm, the proper order of polynomial is determined to ex-
tract the useful (but not the noise) information from the historical data. Also, the
extracted information will fit into the recent results better than to the old ones.
We illustrate this process of learning historical information by considering a 2-
variable function as an example.

Example

The function V (p1 , p2 ) = p22 sin(1.5p2 ) + 2p21 sin(2p1 ) + p1 sin(2p2 ) has several lo-
cal minima, but only one global minimum, as shown in Fig. 2.3. In the process of
interaction, the historical information after each iteration is collected. The historical
step sizes of 2 coordinates are separately approximated, as shown in the Fig. 2.4 (a)
and 2.4 (b). The step sizes of two coordinates are approximated by quadratic poly-
nomials which are determined by OWAA and the coefficients of polynomials are
obtained using WLSF. In Fig. 2.4, the approximated functions are compared with
the quadratic polynomials whose coefficients are obtained from LSF. Again, it is
2 A Novel Optimization Algorithm Based on Reinforcement Learning 35

observed that, the function obtained using WLSF is fitted closer to the data in later
iterations than the function obtained using LSF.

Fig. 2.3 A 2-variable function V (p1 , p2 )

Fig. 2.4 Function approximation for historical step sizes

The level of the approximation error signal et for step sizes of a certain coor-
dinate dpi , which is the difference between the observed sampled data and the ap-
proximated function, can be measured by its standard deviation, as shown in (2.14).
  

1 n
σ pi =  ∑ (et − ē)2 (2.14)
n t=1

This standard deviation will be called the approximation deviation in the follow-
ing discussion. It represents the maximum deviation of the location of the search
particle from the prediction by the approximated function in the unknown function
optimization problem.
36 J.A. Starzyk, Y. Liu, and S. Batog

2.2.3 Predicting New Step Sizes


The approximated functions will be used to determine the step sizes for the next
iteration, as shown in (2.15) and Fig. 2.5 along with the approximated functions.
i
d pt+1 = f i (t + 1) (2.15)

Fig. 2.5 Prediction of the step sizes for the next iteration

The step size functions are the model of environment that the learning system builds
during the process of interaction based on historical information. The future step
size determined by such model can be employed as exploitation of the existing
model. However, such model built during the learning process cannot be treated
as exact. Besides exploitation which best utilizes the obtained model, exploration
is desired to a certain degree in order to improve the model and discover better
solutions. The exploration can be implemented using Gaussian random generator
(GRG). As a good trade-off between exploitation and exploration is needed, we pro-
pose to use the step sizes for the next iteration determined by the step size functions
as the mean value and the approximation deviation as the standard deviation of the
random generator. Gaussian random generators give several random choices of the
step sizes. Effectively, the determined step sizes of multiple coordinates generate
the center of the searching area, and the size of the searching range is determined
by the standard deviations of GRG for the coordinates. The multiple random values
generated by GRG for each coordinate effectively create multiple locations within
the searching area. The objective function values of these locations will be com-
pared and the location with the best value, called current best location, will be
chosen as the place from which the search particle will continue searching in the
next iteration. Therefore, the actual step sizes are calculated using the distance from
the “previous best location” to the “current best location”. The actual step sizes will
be added in the historical step sizes and used to update the model of the unknown
environment.
2 A Novel Optimization Algorithm Based on Reinforcement Learning 37

Several locations of the search particle in this approach are illustrated in


Fig. 2.6 using a 2-variable function as an example. The search particle was located
at previous best location p prev (p1prev , p2prev ) and the previous step size was found as
d p prev (d p1prev , d p2prev ) after current best location p(p1 , p2 )is found as the best loca-
tion in previous searching area (an area with p(p1 , p2 ) in it, not shown in the figure).
At current best position p(p1 , p2 ), using the environment model built with historical
step sizes, the current step size is determined to be dp1 on coordinate 1 and dp2 on
coordinate 2, so that the center of the searching area is determined. The approxima-
tion deviations of two coordinates σ p1 and σ p2 give the size of the searching range.
Within the searching range, several random points are generated in order to find a
better position to which the search operator will move.

Fig. 2.6 Step sizes and searching area

2.2.4 Stopping Criterion


Search particle moves from every “previous best location” to “current best location”
and step sizes actually taken are used for model learning. As new step sizes are
generated, the search particle is expected to move to locations with better objective
function values. In the proposed algorithm, the search particle only makes the move
when a location with a better function value is found.
However, if all the points generated in the current searching range have no better
function values than the current best value, the search particle does not move and
the GRG will repeat generating groups of particle locations for several trials. If no
better location is found after M trials, we suspect that the current searching range
is too small or the current step size is too large, which makes us miss the locations
with better function values. In such case, we should enlarge the size of the searching
area, and reduce the step size, as in (2.16),
38 J.A. Starzyk, Y. Liu, and S. Batog

σ pi = α σ pi
(i = 1, 2, ..., N), (2.16)
d pi = ε d pi
where α > 1, and ε < 1. If this new search is still not successful, the searching range
and the step size will continue changing until some points with better function values
are found. If at certain step of the search process, in order to find the new location
with better function values, the current step size is reduced to be too small to make
the search particle move anywhere, it indicates that the optimum point has been
reached. The stop criterion can be defined by the current step size being β times
smaller than the previous step size, as,

d p < β d p prev (0 < β < 1, β is usually small). (2.17)

2.2.5 Optimization Algorithm


Based on previous discussion, the proposed optimization algorithm (RLO) can be
described as follows.
(a). The procedure starts from a random point of the objective function with N-
variables V = f (p1 , p2 , ..., pN ) . It will try to make a series of moves to get closer to
the global optimum point.
(b). To change from the current location, the step size dpi (i=1, 2, . . . ,N) and the
standard deviation σ pi (i = 1, 2, ..., N) for each coordinate are generated from the
uniform probability distribution.
(c). The step sizes dpi (i=1, 2, . . . N) determine the center of the searching area.
The deviations of all the coordinates σ pi (i = 1, 2, ..., N) determine the size of the
searching area. Several points Ps in this range are randomly chosen from Gaussian
distribution using dpi as mean values and σ pi as standard deviations.
(d). The objective function values are evaluated at these new points. Compare the
objective function values on random points with that at the current location.
(e). If the new points generated in Step (c) have no better values than the current
position, Step (c ) is repeated for up to M trials until point with better function value
is found.
(f). If the search fails after M trials, enlarge the size of the searching area, and reduce
the step size, as in (2.16).
(g). If the search with the updated searching area size and the step sizes from Step
(f) is not successful, the range and the step size will keep being adjusted until either
some points with better values are found, the current step sizes are much smaller
than previous step sizes as in (2.17), or function value changed by less than a pre-
specified threshold. If any of these conditions happens then the algorithm termi-
nates. This also indicates that the optimum point has been reached.
(h). Move the search particle to the point p(p1 , p2 ) with the best function value V b (a
local minimum or maximum depending on the optimization objective). The distance
between previous best point p prev (p1prev , p2prev ) and current best point p(p1 , p2 ) gives
the actual step size dpi (i=1, 2, . . . , N). Collect the historical information of the step
sizes taken during the search process.
2 A Novel Optimization Algorithm Based on Reinforcement Learning 39

(i). Approximate the function of the step sizes as a function of iterative steps us-
ing weighted least-square fit as in (2.5). The proper maximum order of the basis
functions is determined using SNRF described in section 2.2.2 to avoid overfitting.
(j). Use the modeled function to determine the step sizes dpi (i=1, 2, . . . ,N) for the
next iteration step. The approximation deviation difference between the approxi-
mated step sizes and the actual step sizes σ pi (i = 1, 2, ..., N) gives the approximation
deviation. Repeat Step (c) to (j).
In general, the optimization algorithm based on the reinforcement learning builds
the model of successful moves for a given objective function. The model is built
based on historical successful actions and it is used to determine new actions. The
algorithm combines the exploitation and exploration of searching using random gen-
erators. The optimization algorithm does not require any prior knowledge of the ob-
jective function or its derivatives nor there are any special requirements put on the
objective function. The use of search operator is conceptually very simple and intu-
itive. In the following section, the algorithm is verified using several experiments.

2.3 Simulation and Discussion


2.3.1 Finding Global Minimum of a Multi-variable Function
2.3.1.1 A Synthetic Bivariate Function
A synthetic bivariate function

V (p1 , p2 ) = p22 sin(1.5p2 ) + 2p21 sin(2p1 ) + p1 sin(2p2 ),

used previously in the example in section 2.2.2, is used as the objective function.
This function has several local minima and one global minimum equal to -112.2586.
The optimization algorithm starts at a random point and performs the search process
looking for the optimum point (minimum in this example). The number of random
points Ps generated in the searching area in each step is 10. The scaling factors α
and ε in (2.16) are 1.1 and 0.9. The β in (2.17) is 0.005.
One possible search path is shown in Fig. 2.7 from the start location to the final
optimum location as found by RLO algorithm. The global optimum is found in
13 iterative steps. The historical locations are shown in the figure as well. During
the search process, the historical step sizes taken are shown in Fig. 2.8 with their
approximation by WLSF.
Example of another search process starting from another random point is per-
formed and is shown in Fig. 2.9. The global optimum is found in 10 iterative steps.
Table 2.1 shows changes in the numerical function values and adjustment of the step
sizes dp1 and dp2 for p1 and p2 in the successive search steps. Notice how the step
size was initially reduced to be increased again once the algorithm started to follow
a correct path towards the optimum.
40 J.A. Starzyk, Y. Liu, and S. Batog

Fig. 2.7 Search path from start point to optimum

Fig. 2.8 Step sizes taken during the search process

Fig. 2.9 Search path from start point to optimum


2 A Novel Optimization Algorithm Based on Reinforcement Learning 41

Table 2.1 Function values and step sizes in a searching process

Search steps Function value V (p1 , p2 ) Step size d p1 Step size d p2


1 1.4430 2.9455 0.8606
2 -34.8100 0.3570 -1.7924
3 -61.4957 -0.0508 -0.7299
4 -69.8342 -0.0477 -0.3114
5 -70.5394 -0.1232 0.2015
6 -71.5813 0.0000 4.4358
7 -109.0453 -0.0281 0.3408
8 -110.8888 0.0495 -0.0531
9 -112.0104 0.0438 -0.0772
10 -112.1666

Such search process was performed for 300 random trials. The success rate of
finding the global optimum is 93.78%. On average, it takes 5.9 steps and 4299 func-
tion evaluations to find the optimum in this problem.
The same problems are tested on several other direct search based optimization
algorithms, including SA [29], PSO [14] and OSSRS [2]. The success rate of find-
ing global optimum and the average number of function evaluations are compared
in Tables 2.2, 2.3, 2.4. All the simulations were performed using an Intel Core Duo
2.2GHz based PC, with 2GB of RAM.

Table 2.2 Comparison of optimization performances on synthetic function

RLO SA PSO OSSRS


Success rate of finding the global optimum 93.78% 29.08% 94.89% 52.21%
Number of function evaluations 4299 13118 4087 313
CPU time consumption [s] 28.4 254.35 20.29 1.95

2.3.1.2 Six-Hump Camel Back Function


The classic 2D six-hump camel back function [5] has 6 local minima and 2 global
minima. The function is given as
p41 2
V (p1 , p2 ) = (4 − 2.1p21 + ) p1 + p1 p2 + (−4 + 4p22 ) p22 (p1 ∈ [−3, 3], p2 ∈
3
[−2, 2]).
Within the specified bounded region, the function has 2 global minima equal to -
1.0316. The optimization performances of these algorithms from 300 random trials
are compared in Table 2.3.
42 J.A. Starzyk, Y. Liu, and S. Batog

Table 2.3 Comparison of optimization performances on six-hump camel back function

RLO SA PSO OSSRS


Success rate of finding the global optimum 80.33% 45.22% 86.44% 42.67%
Number of function evaluations 5016 8045.5 3971 256
CPU time consumption [s] 33.60 151.86 20.35 1.63

2.3.1.3 Banana Function


The Rosenbrock’s famous “banana function” [23], as

V (p1 , p2 ) = 100(p2 − p21 )2 + (1 − p1)2 ,

has 1 global minimum equal to 0 lying inside a narrow, curved valley. The opti-
mization performances of these algorithms from 300 random trials are compared in
Table 2.4.

Table 2.4 Comparison of optimization performances on banana function

RLO SA PSO OSSRS


Success rate of finding the global optimum 74.55% 3.33% 41% 88.89%
Number of function evaluations 48883.7 28412 4168 882.4
CPU time consumption [s] 320.74 539.38 20.27 5.15

In these optimization problems, RLO demonstrates consistently satisfactory per-


formance without particular tuning of the parameters. However, other methods show
different level of efficiency and capabilities of handling various problems.

2.3.2 Optimization of Weights in Multi-layer Perceptron Training


The output of a multi-layer perceptron (MLP) can be looked at as the value of a
function with the weights as the approximation variables. Training the MLP, in the
sense of finding optimal values of weights to accomplish the learning task, can be
treated as an optimization problem. We can take the Iris plant database [22] as a
testing case. The Iris database contains 3 classes, 5 numerical features and 150 sam-
ples. In order to accomplish the classification of the iris samples, a 3-layered MLP
with an input layer, a hidden layer and an output layer can be used. The size of the
input layer should be equal to the number of features. The size of the hidden layer
is chosen to be 6, and since the class IDs are numerical values equal to 1, 2 and
3, the size of the output layer is 1. The weight matrix between the input layer and
the hidden layer contains 30 elements, and the one between the hidden layer and the
2 A Novel Optimization Algorithm Based on Reinforcement Learning 43

output layer contains 6 elements. Overall, there are 36 weight elements (parameters)
to be optimized. In a typical trial, the optimization algorithm finds the optimal set
of weights after only 3 iterations. In the testing stage, the outputs of the MLP are
rounded to be the nearest integers to indicate predicted class IDs. Comparing the
given class IDs and the predicted class IDs from the MLP in Fig. 2.10, it is obtained
that 146 out of 150 iris samples can be correctly classified by such set of weights
and the percentage of correct classification is 97.3%. A single support vector ma-
chine (SVM) achieved 96.73% classification rate [12]. In addition, a MLP with the
same structure, training by back-propagation (BP) achieved 96% on Iris test case.
The MLP and BP are implemented using MATLAB neural network toolbox.

Fig. 2.10 RLO performance on neural network training on Iris problem

2.3.3 Micro-saccade Optimization in Active Vision for Machine


Intelligence
In the area of machine intelligence, active vision becomes an interesting topic. In-
stead of taking in the whole scene captured by the camera and making sense of all
the information in the conventional computer vision approach, active vision agent
focuses on small parts of the scene and moves its fixation frequently. Human and
other animals use such quick movement of both eyes, which is called saccade [3], to
focus on the interesting part of the scene and efficiently use its own resources. The
interesting parts are usually important features of the input, and with the important
features being extracted, the high-resolution scene is analyzed and recognized with
relatively small number of samples.
In a saccade movement network (SMN) presented in [16], the original images are
transformed into a set of low resolution images after saccade movements and retina
44 J.A. Starzyk, Y. Liu, and S. Batog

sampling. The set of images, as the sampled features, are fed to the self-organizing
winner-take-all classifier (SOWTAC) network for recognition. To find interesting
features of the input image and to direct the movements of saccade, image segmen-
tation, edge detection and basic morphology tools [4] are utilized.

Fig. 2.11 Face image and its interesting features in active vision [16]

Fig. 2.11 (a) shows a face image from [7] with 320×240 pixels. The interesting
features found are shown in Fig. 2.11 (b). The stars represent the center of the four
interesting features found on a face image and the rectangles represent the feature
boundaries. Then, the retina sampling model [16] places its fovea at the center of
each interesting feature, so that these features will be extracted.
Practically, the centers of the interesting features found by image processing tools
[4] are not guaranteed to be the accurate centers, which will affect the accuracy
of feature extraction and pattern recognition process. In order to help to find the
optimum sampling position, RLO algorithm can be used to direct the move of the
fovea of the retina and find the closest match between the obtained sample features
and pre-stored reference sample features. These slight moves during fixation to find
the optimum sampling positions can be called microsaccades in the active vision
process, although the actual role of microsaccades has been a debate topic unsolved
for several decades [19].
Fig. 2.12 (a) shows a group of ideal samples of important features in face recog-
nition. Fig. 2.12 (b) shows the group of sampled features with initial sampling posi-
tions. In the optimization process, the x-y coordinates need to be optimized so that
the sampled images have the optimum similarity to the ideal images. The level of
similarity can be measured by the sum of squared intensity difference [9]. In this
metric, increased similarity will have decreased intensity difference. Such problem
can be also perceived as an image registration problem. The two-variables objec-
tive function V(x, y), the sum of squared intensity difference, needs to be minimized
2 A Novel Optimization Algorithm Based on Reinforcement Learning 45

Fig. 2.12 Image sampling by micro-saccade

through RLO algorithm. It is noted that the only information available is that V
can be the function of x and y coordinates. How the function would be expressed
and what are its characteristics are totally unknown. The minimum value of the
objective function is not known either. RLO would be the suitable algorithm for
such optimization problem. Fig. 2.12 (c) shows the optimized sampled images us-
ing RLO-directed microsaccades. The optimized feature samples are closer to the
ideal feature samples, which will help the processing of the face image.
After the featured images are obtained through RLO-directed microsaccades,
these low-resolution images, instead of the entire high-resolution face image, are
sent to the SOWTAC network for further processing or recognition.

2.4 Conclusions
In this chapter, a novel and efficient optimization algorithm is presented for the
problems in which the objective functions are unknown. The search particle is able
to build the model of successful actions and choose its future action based on the
past exploring experience. The decisions on the step sizes (and directions) are made
based on a trade-off between exploitation of the known search path and exploration
for the improved search direction. In this sense, this algorithm falls into a category
of reinforcement learning based optimization (RLO) methods. The algorithm does
not require any prior knowledge of the objective function, nor does it require any
characteristics of such function. It is conceptually very simple and intuitive as well
as very easy to implement and tune.
The optimization algorithm was tested and verified using several multi-variable
functions and compared with several other widely used random search optimization
46 J.A. Starzyk, Y. Liu, and S. Batog

algorithms. Furthermore, the training of a multi-layer perceptron (MLP), based on


finding a set of optimized weights to accomplish the learning, is treated as an opti-
mization problem. The proposed RLO was used to find the weights of MLP in the
training problem on Iris database. Finally, the algorithm is used in the image recog-
nition process to find a familiar object with retina sampling and micro-saccades.
The performance of RLO, will depend to a certain degree on the values of several
parameters that this algorithm uses. With certain preset parameters, the performance
of RLO can meet our requirements in several machine learning problems involved
in our current research. In the future research, a theoretical and systematic analysis
of the effect of these parameters will be conducted. In addition, using a group of
search particles and their cooperation and competition, a population based RLO can
be developed. With the help of model approximation techniques and the trade-off
between exploration and exploitation proposed in this work, the population based
RLO is expected to have better performance.

References
1. Arfken, G.: Lagrange Multipliers, 3rd edn. §17.6 in Mathematical Methods for Physi-
cists, pp. 945–950. Academic Press, Orlando (1985)
2. Belur, S.: A random search method for the optimization of a function of
n variables. MATLAB central file exchange, http://www.mathworks.com/
matlabcentral/fileexchange/loadFile.do?objectId=100
3. Cassin, B., Solomon, S.: Dictionary of Eye Terminology. Triad Publishing Company,
Gainsville (1990)
4. Detecting a Cell Using Image Segmentation. Image Processing Toolbox, the Mathworks,
http://www.mathworks.com/products/image/demos.html
5. Dixon, L.C.W., Szego, G.P.: The optimization problem: An introduction. Towards Global
Optimization II. North Holland, New York (1978)
6. Eelder, J.A., Mead, R.: A simplex method for function minimization. The Computer
Journal 7, 308–313 (1965)
7. Facegen Modeller. Singular Inversions,
http://www.facegen.com/products.htm
8. del Toro Garcia, X., Neri, F., Cascella, G.L., Salvatore, N.: A surrogate associated
Hooke-Jeeves algorithm to optimize the control system of a PMSM drive. IEEE ISIE,
347–352 (July 2006)
9. Hill, D.L.G., Batchelor, P.: Registration methodology: concepts and algorithms. In: Ha-
jnal, J.V., Hill, D.L.G., Hawkes, D.J. (eds.) Medical Image Registration. Medical Image
Registration. CRC, Boca Raton (2001)
10. Hooke, R., Jeeves, T.A.: Direct search solution of numerical and statistical problems.
Journal of the Association for Computing Machinery 8, 212–229 (1961)
11. Kennedy, J., Eberhart, R.C.: Particle swarm optimization. In: Proc. IEEE Int. Conf. Neu-
ral Networks, Perth, Australia, December 1995, vol. 4, pp. 1942–1948 (1995)
12. Kim, H., Pang, S., Je, H.: Support vector machine ensemble with bagging. In: Lee, S.-W.,
Verri, A. (eds.) SVM 2002. LNCS, vol. 2388, p. 397. Springer, Heidelberg (2002)
13. Kirkpatrick, S., Gelatt Jr., C.D., Vecchi, M.P.: Optimization by simulated annealing. Sci-
ence 220(4598), 671–680 (1983)
2 A Novel Optimization Algorithm Based on Reinforcement Learning 47

14. Leontitsis, A.: Hybrid Particle Swarm Optimization, MATLAB central file ex-
change, http://www.mathworks.com/matlabcentral/fileexchange/
loadFile.do?objectId=6497
15. Lewis, R.M., Torczon, V., Trosset, M.W.: Direct search methods: Then and now. Journal
of Computational and Applied Mathematics 124(1), 191–207 (2000)
16. Li, Y.: Active Vision through Invariant Representations and Saccade Movements. Master
thesis, School of Electrical Engineering and Computer Science, Ohio University (2006)
17. Liu, Y., Starzyk, J.A., Zhu, Z.: Optimized Approximation Algorithm in Neural Networks
without overfitting. IEEE Trans. on Neural Networks 19(4), 983–995 (2008)
18. Lustig, I.J., Marsten, R.E., Shanno, D.F.: Computational Experience with a Primal-Dual
Interior Point Method for Linear Programming. Linear Algebra and its Application 152,
191–222 (1991)
19. Martinez-Conde, S., Macknik, S.L., Hubel, D.H.: The role of fixational eye movements
in visual perception. Nature Reviews Neuroscience 5(3), 229–240 (2004)
20. Ong, Y.-S.: Max-min surrogate-assisted evolutionary algorithm for robust design. IEEE
Trans. on Evolutionary Computation 10(4), 392–404 (2006)
21. Powell, M.J.D.: An efficient method for finding the minimum of a function of several
variables without calculating derivatives. The Computer Journal 7, 155–162 (1964)
22. Fisher, R.A.: Iris Plants Database (July 1988),
http://faculty.cs.byu.edu/˜cgc/Teaching/CS_478/iris.arff
23. Rosenbrock, H.H.: An automatic method for finding the greatest or least value of a func-
tion. The Computer Journal 3, 175–184 (1960)
24. Sheela, B.V.: An optimized step-size random search. Computer Methods in Applied Me-
chanics and Engineering 19(1), 99–106 (1979)
25. Snyman, J.A.: Practical Mathematical Optimization: An Introduction to Basic Optimiza-
tion Theory and Classical and New Gradient-Based Algorithms. Springer, Heidelberg
(2005)
26. Starzyk, J.A.: Motivation in Embodied Intelligence. In: Frontiers in Robotics, Automa-
tion and Control, October 2008, pp. 83–110. I-Tech Education and Publishing (2008),
http://www.intechweb.org/
book.php?%20id=78&content=subject&sid=11
27. Sutton, R.S., Barto, A.G.: Reinforcement Learning: An Introduction. MIT Press, Cam-
bridge (1998)
28. Torczon, V.: On the Convergence of Pattern Search Algorithms. SIAM Journal on Opti-
mization 17(1), 1–25 (1997)
29. Vandekerckhove, J.: General simulated annealing algorithm, MATLAB central file ex-
change, http://www.mathworks.com/matlabcentral/fileexchange/
loadFile.do?objectId=10548
30. Ypma, T.J.: Historical development of the Newton-Raphson method. SIAM Re-
view 37(4), 531–551 (1995)
Chapter 3
The Use of Opposition for Decreasing Function
Evaluations in Population-Based Search

Mario Ventresca, Shahryar Rahnamayan, and Hamid Reza Tizhoosh

Abstract. This chapter discusses the application of opposition-based comput-


ing to reducing the amount of function calls required to perform optimization
by population-based search. We provide motivation and comparison to simi-
lar, but different approaches including antithetic variates and quasi-randomness/
low-discrepancy sequences. We employ differential evolution and population-based
incremental learning as optimization methods for image thresholding. Our results
confirm improvements in required function calls, as well as support the oppositional
princples used to attain them.

3.1 Introduction
Global optimization is concerned with discovering an optimal (minimum or maxi-
mum) solution to a given problem generally within a large search space. In some in-
stances the search space may be simple (i.e. concave or convex optimization can be
used). However, most real world problems are multi-modal and deceptive [5] which
often causes traditional optimization algorithms to become trapped at local optima.
Many strategies have been developed to overcome this for global optimization in-
cluding, but not limited to simulated annealing [9], tabu search [4], evolutionary
algorithms [7] and swarm intelligence [3].
Some of these methods employ a single solution per iteration methodology
whereby only one solution is generated and successively perturbed towards more
Mario Ventresca · Hamid Reza Tizhoosh
Department of Systems Design Engineering, The University of Waterloo, Ontario, Canada
e-mail: mventres@pami.uwaterloo.ca,tizhoosh@uwaterloo.ca
Shahryar Rahnamayan
Faculty of Engineering and Applied Science,
The University of Ontario Institute of Technology, Ontario, Canada
e-mail: shahryar.rahnamayan@uoit.ca

Y. Tenne and C.-K. Goh (Eds.): Computational Intelligence in Optimization, ALO 7, pp. 49–71.
springerlink.com c Springer-Verlag Berlin Heidelberg 2010
50 M. Ventresca, S. Rahnamayan, and H.R. Tizhoosh

appropriate solutions (i.e. simulated annealing and tabu search). Single-solution


methods inherently require low computational time per iteration, however they of-
ten lack sufficient diversity to adequately explore the search space. An alternative
is via population-based techniques where many solutions are generated at each iter-
ation and all (or most) are used to determine the search direction (i.e. evolutionary
and swarm intelligence algorithms). By considering many solutions per iteration
these methods will have a higher diversity, however will require a large amount of
computation. In this chapter we focus on population-based optimization.
Real-world problems also present the possible issue of complexity in the eval-
uation of a solution. That is, determining the quality of a given solution is com-
putationally expensive. Investigation into simpler evaluation metrics is one possible
direction, however, it may be found that evaluation is still expensive. Then, reducing
the time spent (i.e. number of function evaluations) becomes an important goal.
Opposition-based computing (OBC) is a newly devised concept, having as one
of its aims, the improvment of convergence rates of algorithms by defining and
simultaneously considering pairs of opposite solutions [24]. This advanced conver-
gence rate is also usually accompanied by a more desirable final result. To date,
OBC has shown improvements in reinforcement learning [20, 22, 23], evolution-
ary algorithms [14, 15, 16], ant colony optimization [12], simulated annealing [28],
estimation of distribution [29], and neural networks [26, 27, 30].
In this chapter we discuss the application of OBC to reducing the number of
function calls required to achieve a desired accuracy for population-based searches.
We show theoretical reasoning behind OBC (which has roots in monotonic opti-
mization) and provide mathematical motivations for its ability to reduce function
calls and improve accuracy of simulation results. The key factor to accomplishing
this is via simultaneous consideration of negatively associated variables and their
affect on the target evaluation function and search process. We choose to highlight
the improvements for the task of image thresholding using differential evolution and
population-based incremental learning.
The rest of this chapter is organized as follows: Section 3.2 discusses the theoret-
ical motivations behind our approach. Differential evolution and population-based
incremental learning are discussed in Section 3.3 as are their respective opposition-
based counterparts. The experimental setup is given in Section 3.4 and results are
presented in Section 8.4. Conclusions are give in Section 3.6.

3.2 Theoretical Motivations


In this section we introduce notations and definitions required to explain the concept
of opposition and its ability to reduce the number of function calls. We also provide
a brief comparison of OBC to antithetic variates and quasi-random/low-discrepancy
sequences.
3 The Use of Opposition for Decreasing Function Evaluations 51

3.2.1 Definitions and Notations


In the following definitions, assume that A ⊆ ℜd is non-empty, compact and d-
dimensional. Without loss of generality, f : A → ℜ is a continuous function to be
maximized. We assume all A are feasible.
The purpose of a global search method is to discover the global optima (either
minimum or maximum) of a given function and not converge to one of the local
optima.
Definition 1 (Global Optima). A solution θ ∗ ∈ A is a global optima if f (θ ) ≤
f (θ ∗ ) for all θ ∈ A . There may exist more than one global optima.
Definition 2 (Local Optima). A solution θ ∈ A is a local optima if there exists a
ε -neighborhood Nε (θ ) with radius ε > 0 where g(θ , θ ) < ε for distance function
g and θ ∈ A ∩ Nε (θ ), and f (θ ) ≤ f (θ ).
Recent research [1, 25, 32] has shown the benefit of utilizing monotonic transforma-
tions of the evaluation criteria as a means of discovering global optima. This causes
a reordering of the solutions and a gradient-based method can be used to search the
reordered space. An issue with these convexification and concavification methods
which transform certain functions to a monotonic form is that the mapping must be
known a priori. Otherwise, optimization on the transformed function is unreliable.
Definition 3 (Monotonicity). Function φ : ℜ → ℜ is monotonic if for x, y ∈ ℜ and
x < (>)y, then φ (x) ≤ (≥)φ (y). A strictly monotonic function is one which does
not permit equality, (i.e. φ (x) < (>)φ (y)).
Theoretically, a monotonic transformation is ideal, however, OBC does not require
it. Instead, opposition extends the monotonic global search idea through the use of
opposites solutions, which are simultaneously considered and the more desirable
(w.r.t. f and the problem definition) is used during the search.
Definition 4 (Opposite). A pair (x, x̆) ∈ A are opposites if there exists a function
Φ : A → A where Φ (x) = y and Φ (y) = x. The breve notation will be used to
denote the opposite element (i.e. x̆ = Φ (x) = y).
The function Φ referred to in Definition 4 is the key to employing opposition-based
techniques. This determines which elements are opposites, and a poorly selected
function could lead to poor performance (see the following section).
Definition 5 (Opposite Mapping). A one-to-one function Φ : A → A where every
pair x, x̆ ∈ A are unique (i.e. for z ∈ A , if Φ (x) = y and Φ (y) = x then there does
not exist Φ (y) = z or Φ (x) = z).
Φ can be determined via prior knowledge, intuition or through some a priori
or online learning procedure. Simultaneous use of the opposites (for example,
a maximization problem) is easily accomplished by allowing f (θ ) = f (θ̆ ) =
max( f (θ ), f (θ̆ )) and searching for a solution S which corresponds to the most de-
sirable solution.
S = arg max f (θ ). (3.1)
θ
52 M. Ventresca, S. Rahnamayan, and H.R. Tizhoosh

3.2.2 Consequences of Opposition


As with a monotonic transformation the “optimal” opposition map is one which
effectively reorders elements of A such that φ ( f ) is monotonic. Under this trans-
formation cor(X, X̆) ≤ 0 for X, X̆ ⊂ A , as is shown in Figure 3.1.

0.9

0.8

0.7

0.6
Evaluation

0.5 X X
0.4

0.3

0.2

0.1

0
Solution

Fig. 3.1 Example of a transformed evaluation funtion to a monotonic function. The values
in X and X̆ are negatively correlated. The original function (not shown) could have been any
nonlinear, continuous function

The implementation of opposition for an optimization problem involves the si-


multaneous examination of x, x̆ and returns the most desirable. That is, we aim for
a negative correlation between the two guesses. A consequence of Φ is an effec-
tive halving of possible evaluations (as shown in Figure 3.2) where f (θ ) = f (θ̆ ) =
max( f (θ ), f (θ̆ )). The halving results from full information of the transformed func-
tion such that the opposite solution is not required to be observed to compute the
max operation. In the more general case, it is sufficient to determine a function such
that Pr(max( f (θ1 ), f (θ˘1 ) < max( f (θ1 ), f (θ2 ))) > 0.5, for θ1 , θ2 ∈ A .
A further consequence of simultaneous consideration of opposites is a provably
more desirable E[ f ] and lower variance [28]. Alone this does not guarantee a higher
quality outcome; the probability density function of the opposite-transformed func-
tion values should also be more desirable than the joint p.d.f. of random sampling
(i.e. the distribution corresponding to Pr(max(x1 , x2 ))).
3 The Use of Opposition for Decreasing Function Evaluations 53

0.9

0.8

0.7

0.6 Function representing the


maximum of a solution
Evaluation

and its opposite


0.5

0.4 Transformed monotonic


function
0.3

0.2

0.1

0
Solution

Fig. 3.2 Taking f (x) = max( f (x), f (−x)), we see that the possible evaluations in the search
space have been halved in the optimal situation of full knowledge. In the general case, the
transformed function will have a more desirable mean and lower variance

While not investigated in this chapter we make the observation that successive
applications of different opposite maps will lead to further smoothing of f . For
example,
f 2 (θ ) = max( f (z = arg max f (θ )), f (z̆)) (3.2)
θ

where z̆ is determined via opposite map Φ2 = Φ1 and superscript f 2 indicates the


two applications of opposite mappings. In the limit,

lim f i (θ ) = max( f i (zi = arg max f i−1 (θi )), f (z̆i )) = f (θ ∗ ) (3.3)
i→∞ θi

for i > 0 and global optima f (θ ∗ ). Effectively, this flattens the entire error surface
of f , except for the global optimal(s). A more feasible alternative is to use k > 0
transformations which give reasonable results where 0 ≤ | f k−1 − f k | < ε does not
diminish greatly.

3.2.3 Lowering Function Evaluations


We briefy discuss conditions on the design of the opposition map Φ which often lead
to improvements over purely random sampling with respect to lowering function
evaluations. If using an algorithm solely based on randomly generating solutions to
the problem at hand then we require that for some ε > 0 and δ > 0.5
54 M. Ventresca, S. Rahnamayan, and H.R. Tizhoosh

Pr (max( f (x), f (x̆)) − max( f (x), f (y)) > ε ) > δ (3.4)

where x, y, x̆ ∈ A . That is, the distribution max(x, x̆) must be less desirable than the
distribution of i.i.d. random guesses. If this condition is met, then the probability
that the optimal solution (or higher quality) is discovered is higher using opposite
guesses. The goal in developing Φ is to maximize ε and δ . A similar goal is to deter-
mine Φ such that E[g(x, x̆)] is maximized for some distance function g. Satisifying
this condition implies (3.4).
Thus, probabilistically we expect a lower number of function calls to find a solu-
tion of a given thresholded quality. If employing this strategy within a guided search
method the dynamics of the algorithm must be considered to assure the guarantee
(i.e. the algorithm adds bias to the search which affects the convergence rate).
Practically, the simplest manner to decide a satisfactory Φ is through intuition
or prior knowledge about f . A possibility is to utilize modified low-discrepancy
sequences (see below), which aim to distribute guesses evenly throughout the search
space.

3.2.4 Comparison to Existing Methods


While opposition may sometimes employ methods from antithetic variates and low-
discrepancy sequences, in general that is not the case. To elucidate the uniqueness
of opposition in the following we distinguish it from these two methods.

3.2.4.1 Antithetic Variates


Suppose we desire to estimate ξ = E[ f ] = E[Y1 ,Y2 ] with unbiased estimator
Y1 + Y2
ξ̂ = . (3.5)
2

If Y1 ,Y2 are i.i.d. then var(ξ̂ ) = var(Y1 +Y2 )/2. However, if cov(Y1 ,Y2 ) < 0 then the
variance can be further reduced.
One method to accomplish this is through the use of a monotonic function h.
Then, generate Y1 as an i.i.d. value as before, but utilizing h our two variables are
h(Y1 ) and h(1 − Y1), which are monotonic over interval [0,1]. And

h(Y1 ) + h(Y2 )
ξ̂ = (3.6)
2
will be an unbiased estimator of E[ f ].
Opposition is similar in its selection of negatively correlated samples. However,
in the antithetic approach there is no guideline to construct such a monotonic func-
tion, although such a function has been proven to exist [17]. Opposition provides
various means to accomplish this, as well as to incorporate the idea into optimiza-
tion while guaranteeing more desirable expected values and lower variance in the
target function.
3 The Use of Opposition for Decreasing Function Evaluations 55

Further, opposition extends beyond the generation of solutions in random


sampling-based algorithms. It can also be applied to algorithm behavior and can be
used to relate concepts expressed linguistically, where we have not found evidence
that antithetic variates have application.

3.2.4.2 Quasi-Randomness and Low-Discrepancy Sequences


These methods aim to combine the randomness from pseudorandom generators
which select values i.i.d. with the advantages of generating points distributed in
a grid-like fashion. The goal is to uniformly distribute values over the search space
by achieving a low-discrepancy. However, this is achieved at the cost of statistical
independence.
The discrepancy of a sequence is a measure of its uniformity and can be calcu-
lated via [6]:  
 |B ∩ X| 
DN (X) = sup   − λd (B) (3.7)
I∈J N
where λd is the d-dimensional Lebesque measure, |B ∩ X| is the number of points
of X = (x1 , ..., xN ) that fall into interval I, and J is the set of d-dimensional intervals
defined as:
d
∏[ai, bi ) = {x ∈ ℜd : ai ≤ xi ≤ bi } (3.8)
i=1

for 0 ≤ ai < bi ≤ 1.
That is, the actual number of points within each interval for a given sample is
close to the expected number. Such sequences have been widely studied in quasi-
Monte Carlo methods [10].
Opposition may utilize low-discrepancy sequences in some situations. Though
in general, low-discrepancy sequences are simply a means for attaining uniform
distribution without regard to the correlation between the evaluations at these points.
Further, opposition-based techniques simultaneously consider two points in order to
smooth the evaluation function and improve performance of the sampling algorithm
whereas quasi-random sequences often are concerned with many more points.
These methods have been applied to evolutionary algorithms where it was found
that by a performance study of the different sampling methods such as Uniform,
Normal, Halton, Sobol, Faure, and Low-Discrepancy is valuable only for
low-dimensional (d < 10 and so non-highly-sparse) populations [11].

3.3 Algorithms
In this section we describe Differential Evolution (DE) and Population-¿ It seems,
performance study of the different sampling methods such as Uniform, Normal, Hal-
ton, Sobol, Faure, and Low-Discrepancy [11] is valuable only for low-dimensional
(D < 10 and so non-highly-sparse) populations. Learning (PBIL), which are the
56 M. Ventresca, S. Rahnamayan, and H.R. Tizhoosh

parent algorithms for this study. We also describe the oppositional variants of these
methods.

3.3.1 Differential Evolution


Differential Evolution (DE) was proposed by Price and Storn in 1995 [21]. It is an
effective, robust, and simple global optimization algorithm [8]. DE is a population-
based directed search method. Like other evolutionary algorithms, it starts with an
initial population vector. If no preliminary knowledge about the solution space is
available the population is randomly generated. Each vector of the initial population
can be generated as follows [8]:

Xi, j = a j + rand j (0, 1) × (a j − b j ); j = 1, 2, ..., d, (3.9)

where d is the problem dimension; a j and b j are the lower and the upper boundaries
of the variable j, respectively. rand(0, 1) is the uniformly generated random number
in [0, 1].
Assume Xi,t (i = 1, 2, ..., N p ) are candidate solution vectors in generation t and
N p : is the population size. Successive populations are generated by adding the
weighted difference of two randomly selected vectors to a third randomly selected
vector. For classical DE (DE/rand/1/bin), the mutation, crossover, and selection
operators are straightforwardly defined as follows:
Mutation: For each vector Xi,t in generation t a mutant vector Vi,t is defined by

Vi,t = Xa,t + F(Xc,t − Xb,t ), (3.10)

where i = {1, 2, ..., N p } and a, b, and c are mutually different random integer indices
selected from {1, 2, ..., N p }. Further, i, a, b, and c are unique such that it is neces-
sary for N p ≥ 4. F ∈ [0, 2] is a real constant which determines the amplification of
the added differential variation of Xc,t − Xb,t . Larger values for F result in higher
diversity in the generated population and lower values lead to faster convergence.
Crossover: By shuffling competing solution vectors DE utilizes the crossover op-
eration to generate new solutions and also to increase the population diversity. For
the classical DE (DE/rand/1/bin), the binary crossover (shown by ‘bin’ in the no-
tation) is utilized. It defines the following trial vector:

Ui,t = (U1i,t ,U2i,t , ...,Udi,t ), (3.11)

where, 
V ji,t if rand j (0, 1) ≤ Cr ∨ j = k,
U ji,t = . (3.12)
X ji,t otherwise,

for Cr ∈ (0, 1) the predefined crossover rate, and rand j (0, 1) is the jth evaluation of
a uniform random number generator. k ∈ {1, 2, ..., d} is a random parameter index,
3 The Use of Opposition for Decreasing Function Evaluations 57

chosen once for each i to make sure that at least one parameter is always selected
from the mutated vector, V ji,t . Most popular values for Cr are in the range of (0.4, 1)
[14].
Selection: This decides which vector (Ui,t or Xi,t ) should be a member of next (new)
generation, t + 1. For a minimization problem, the vector with the lower value of
objective function is chosen (greedy selection).
This evolutionary cycle (i.e., mutation, crossover, and selection) is repeated N p (pop-
ulation size) times to generate a new population. These successive generations are
produced until meeting the predefined termination criteria.

3.3.2 Opposition-Based Differential Evolution


By utilizing opposite points, we can obtain fitter starting candidate solutions even
when there is no a priori knowledge about the solution(s) according to:
1. Random initialization of population P(NP ),
2. Calculate opposite population by

OPi, j = a j + b j − Pi, j , (3.13)

i = 1, 2, ..., N p ; j = 1, 2, ..., D,
where Pi, j and OPi, j denote jth variable of the ith vector of the population and
the opposite-population, respectively.
3. Selecting the N p fittest individuals from {P ∪ OP} as the initial population.
The general ODE scheme also employs generation jumping, but it has not be used
in this work in lieu of only population-initialization and sample generation.

3.3.3 Population-Based Incremental Learning


PBIL is a stochastic search which abstracts the population of samples found in evo-
lutionary computation with a probability distribution for each variable of the so-
lution [2]. At each generation a new sample population is generated based on the
current probability distribution. The best individual is retained and the probability
model is updated accordingly to reflect the belief regarding the final solution. The
update rule is similar to that found in reinforcement learning.
A population is represented by probability matrix M := (mi, j )d×c which stores
the probability distribution over each possible element in the solution. If consider-
ing a binary problem then solution S := (si, j )d×c ∈ {0, 1} and mi, j ∈ [0, 1] is the
probability element si, j = 1. For continuous problems probability distributions are
used instead of a threshold value as is the case for discrete problems [18].
Learning consists of utilizing M to generate population P of k samples. After
evaluation of each sample according to function f the “best” (B∗ ) solution is re-
tained and M is updated according to
58 M. Ventresca, S. Rahnamayan, and H.R. Tizhoosh

Mt = (1 − α )Mt−1 + α B∗ (3.14)

where 0 < α < 1 is the learning rate and t ≥ 1 is the iteration. Initially, mi, j = 0.5 to
reflect that lack of prior information.
To abstract the crossover and mutation operators of evolutionary computation,
PBIL employs a randomization of M. Let 0 < β , γ < 1 be the probability of mutation
and degree of mutation, respectively. Then with probability β

mi, j = (1 − γ )mi, j + γ · random(0 or 1). (3.15)

Algorithm 1 provides a summary of this approach.

Algorithm 1. Population-Based Incremental Learning [2]


1: {Initialize probabilities}
2: M0 := (mi, j ) = 0.5
3: for t = 1 to ω do
4: {Generate samples}
5: G1 = generate samples(k,Mt−1 )

6: {Find best sample}


7: B∗ = select best({B∗ } ∪ G1)

8: {Update M}
9: Mt = (1 − α )Mt−1 + α B∗

10: {Mutate probability vector}


11: for i = 0...d and j = 0...c do
12: if random(0, 1) < β then
13: mi, j = (1 − γ )mi, j + γ · random(0 or 1)
14: end if
15: end for
16: end for

3.3.4 Oppositional Population-Based Incremental Learning


The opposition-based version of PBIL (OPBIL) shown in Algorithm 2 employs the
opposite concept to improve diversity within the sample generation phase. A direct
effect on convergence rate is observed as a consequence of this mechanism. Fur-
ther, OPBIL has an ability to escape local optima which estimation of distribution
algorithms such as PBIL are prone to becoming trapped on [19]. The description
provided here is brief and the interested reader is invited to read [29] for a more
detailed description.
The general structure of the PBIL algorithm remains, however aside from the
sampling procedure the update and mutation rules are altered to reflect a degrading
3 The Use of Opposition for Decreasing Function Evaluations 59

degree of opposition with respect to the number of iterations. That is, as the number
of iterations t → ∞ the amount two opposite solutions differ approaches 1 bit (w.r.t.
Hamming distance).
Sampling is accompished using an opposite guessing strategy whereby half of
the population R1 is generated using probability matrix M and the other half is
generated via a change in Hamming distance to a given element of R1 . The distance
is calculated using an exponentially decaying function in the flavor of

ξ (t) = le(ct) , (3.16)

where l is the maximum number of bits in a guess and c < 0 is a user defined
constant.
Updating of M is performed in lines 14-28. If a new global best solution has been
discovered (i.e. η = B∗ ), or with probability pamp the sample best solution is used
to focus the search, respectively. When no new optima have been discovered this
strategy tends away from B∗ . The actual update is performed in line 16 and is based
on a reinforcement learning update using the sample best solution. The degree to
which M is updated is controlled by the user defined parameter 0 < ρ < 1.
Should the above criteria for update fail, a decay of M with probability pdecay
is attempted in line 17. The decay, performed in lines 21-27 slowly tends M away
from B∗ . This portion of the algorithm has the intention to prevent convergence an
aide in the exploration ability through small, smooth updates. Parameter 0 < τ < 1
is user defined where often τ  ρ .
Equations in lines 11 and 12 were determined experimentally and no argument
regarding their optimality is provided. Indeed, there likely exists many functions
which will yield more desirable results. These have been decided because they tend
to lead to a good behavior and outcome of PBIL.

3.4 Experimental Setup


In this section we provide a discussion of the image thresholding problem, and the
application of evolutionary algorithms to solving it. Additionally, the evaluation
measure we employ to grade the quality of a segmentation is presented. Parame-
ter settings and problem representation are also given.

3.4.1 Evolutionary Image Thresholding


Image segmentation involves partitioning an image I into a set of segments with the
goal of locating objects in the image which are sufficiently similar. Thresholding
is a subset problem of image segmentation, with only 2 classes defined by whether
a given pixel is above or below a specific threshold value ω . This task has numer-
ous applications and several general segmentation algorithms have been proposed
[33]. Due to the variety of image types there does not exist a single algorithm for
segmenting all images optimally.
60 M. Ventresca, S. Rahnamayan, and H.R. Tizhoosh

Algorithm 2. Pseudocode for the OPBIL algorithm


Require: Maximum iterations, ω
Require: Number of samples per iteration, k
1: {Initialize probabilities}
2: M0 = mi..l = 0.5

3: for t = 1 to ω do
4: {Generate samples}
5: R1 = generate samples(k/2,M)
6: R̆1 = generate opposites(R1)

7: {Find best sample}


8: η = select best({R1 ∪ R̆1 })
9: B∗ = select best(B∗ , η )

10: {Compute probabilities}


11: pamp (Δ ) = 1 − e−bΔ

)− f (η )
1 − f (Bf (B ∗)
12: pdecay (Δ ; f (B∗ ), f (η )) = √
Δ +1
13: {Update M}
14: if η = B∗ OR random(0, 1) < pamp then
15: Δ =0
16: Mt = (1 − ρ )Mt−1 + ρη
17: else if random(0, 1) < pdecay then
18: if random(0, 1) < pdecay then
19: use B∗ in line 23 instead of η
20: end if
21: for all i, j each with probability < pdecay do
22: if ηi, j = B∗i, j then

1 − τ · random(0, 1), if ηi, j = 1,
23: mi, j = mi, j ·
1 + τ · random(0, 1), otherwise
24: else 
1 + τ · random(0, 1), if ηi, j = 1,
25: mi, j = mi, j ·
1 − τ · random(0, 1), otherwise
26: end if
27: end for
28: end if
29: Δ = Δ +1
30: end for
3 The Use of Opposition for Decreasing Function Evaluations 61

Many general purpose segmentation algorithms are histogram based and aim
to discover a deep valley between two peaks, and setting ω equal to that value.
However, many real world problems will have multimodal histograms and deciding
which value (i.e. valley) will correspond to the best thresholing is not obvious. The
difficulty is compounded by the fact that the relative size of peaks may be large
(and the valley becomes hard to distinguish) or valleys could be very broad. Sev-
eral algorithms have been proposed to overcome this [33]. Other methods based on
information theory and other statistical methods have been proposed as well [13].
Typically, the problem of segmentation involves a high degree of uncertainty
which makes solving the problem difficult. Stochastic searches such as evolution-
ary algorithms and population-based incremental learning often cope well with un-
certainty in optimization, hence they provide an interesting alternative approach to
traditional methods.
The main difficulty associated with the use of population-based methods is that
they are computationally expensive due to the large amount of function calls re-
quired during the optimization process. One approach to minimizing uncertainty
is by splitting the image into subimages which (hopefully) have characteristics al-
lowing for an easy segmentation. Combining the subimages together then forms
the entire segmented image, although this requires extra function calls to analyze
each subimage. An important caveat is that the local image may represent a good
segmentation, but may not be useful with respect to the image as a whole.
In this chaper we investigate thresholding with population-based techniques. Us-
ing ODE we do not perform any splitting into subimages and for OPBIL we split I
into four equal-sized subregions, each having it’s own threshold value. In both cases
we require a single evaluation to perform the segmentation and we show that the
opposition-based techniques reduce the required number of function calls.
As stated above, there exist many different segmentation algorithms. Further, nu-
merous methods for evaluating the quality of a seqmentation have also been put
forth [34]. In this paper we use a simple method which aims to minimize the dis-
crepancy between the original M × N gray-level image I and its thresholded image
T [31]:
M N
∑ ∑ |Ii, j − Ti, j | (3.17)
i=1 j=1

where | · | represents the absolute value operation. Using different evaluations will
change the outcome of the algorithm, however, the problem of segmentation in this
manner nonetheless remains computationally expensive.
We use the images shown in Figure 3.3 to evaluate the algorithms. The first col-
umn represents the original image, then the gold and the third column corresponds
to the approximate target image for ODE and OPBIL (i.e. the value-to-reach targets,
discussed below). We show the gold image for completeness, it is not required in
the experiments.
Both experiments employ a value-to-reach (VRT) stopping criteria which mea-
sures the time or function calls required to reach a specific value. The VTR values
have been experimentally determined and are given in the following table. Due to
62 M. Ventresca, S. Rahnamayan, and H.R. Tizhoosh

Fig. 3.3 The images used to benchmark the algorithms. The first column is the original gray-
level image, the second is the gold and the third colum is the target image of the optimization
within the required function calls
3 The Use of Opposition for Decreasing Function Evaluations 63

the respective algorithm ability of solving this problem, given the representation and
bahavior of convergence these values differ for ODE and OPBIL.

Table 3.1 Value-to-reach (VRT) for O/DE and O/PBIL experiments

Image O/DE O/PBIL


1 19579 19850
2 3391 4925
3 7139 7175
4 19449 19850
5 19650 19700
6 22081 22700

3.4.2 Parameter Settings and Solution Representation


The ODE and OPBIL algorithms differ in that the former is a real optimization
algorithm and the latter operates in the binary domain. Therefore, the solution rep-
resentations will also differ and consequently directly affect the quality of results.
However, the goal of this chapter is to show the ability of opposition to decrease the
required number of function evaluations, and so fine-tuning aspects these algorithms
is not the focus of this investigation.

ODE Settings

The differential evolution experiments follow standard enconding guidelines. The


user-defined parameter settings were determined empirically as shown in Table 3.2.

Table 3.2 Parameter settings for differential evolution-based experiments

Parameter Value
Population size Np = 5
Amplification factor F = 0.9
Crossover probability Cr = 0.9
Mutation strategy DE/rand/1/bin
Maximum function calls MAXNFC = 200
Jumpring rate (no jumping) Jr = −1

In order to maintain a reliable and fair comparison, these settings are kept un-
changed for all conducted experiments for both DE and ODE algorithms.
64 M. Ventresca, S. Rahnamayan, and H.R. Tizhoosh

OPBIL Settings

As stated above, OPBIL requires a binary solution representation. However, thresh-


olding aims to discover an integer value 0 ≤ T ≤ 255, to perform the segmentation
operation of I > T . Additionally, we use an approach of splitting I into subimages
I1,...,16 where each Ii is an equal sized square region of the original image.
Encoding was determined to be a matrix R := (ri, j )16×8 which corresponds to
16 subimages having a gray-level value < 28 = 256. Each row of R is converted to
an integer which is used to segment the respective region of I. The extra regions in-
crease problem difficulty as they result in more deceptive and multimodal problems.
Parameter settings for PBIL and OPBIL are as follows:

Table 3.3 Parameter settings for population-based incremental learning experiments

Parameter Value
Maximum iterations t = 150
Sample size k = 24
PBIL Only
Learning rate α = 0.35
Mutation probability β = 0.15
Mutation degree γ = 0.25
OPBIL Only
Update frequency control b = 0.1
Learning rate ρ = 0.25
Probability decay τ = 0.0005

3.5 Experimental Results


Using the test images and parameter settings stated above we show the ability of
oppositional concepts to decrease the required function calls. Only the results for
OPBIL are more detailed due to space limitations, but similar behavior should be
observed in ODE. All results presented correspond to the average of 30 runs, unless
otherwise noted.

3.5.1 ODE
Table 3.4 presents a summary of the results obtained regarding function calls for
ODE versus DE. Except for image 2, we show an decrease in function calls for
all images. Images 4 and 5 have statistically significant improvements with respect
to the decreased number of function calls, using a t-test at 0.9 confidence level.
Further except for image 6, we show a lower standard deviation which indicates
higher reliability in the results.
3 The Use of Opposition for Decreasing Function Evaluations 65

Computing the overall mean function calls we show an improvement of 322-


277=45 function calls. This equates to an average of 45/6 = 7.5 saved function calls
per image. This implies a savings of 322/277 ≈ 1.16, indicating approximately 16%
less function calls. For expensive optimization problems this can correspond to a
great amount of savings with respect to algorithm run-time.

Table 3.4 Summary results for DE vs. ODE with respect to required function calls. μ and σ
correspond to the mean and standard deviation of the subscripted algorithm, respectively

Image μDE σDE μODE σODE


1 74 41 60 34
2 32 20 35 16
3 42 23 37 19
4 74 36 54 30
5 47 37 45 22
6 63 26 46 31
Total 322 277

3.5.2 OPBIL
Table 3.5 shows the expected number of iterations (each iteration has 24 function
calls) to attain the value-to-reach given in Table 3.1. In all cases OPBIL reaches its
goal in fewer iterations that PBIL, where results for images 2,5,6 are found to be
statistically significant using a t-test at a 0.9 confidence interval. Additionally, in
all cases we find a lower standard deviation indicating a more reliable behavior for
OPBIL.
Overall we have 444-347=97 saved iterations using OPBIL, which is an average
of 16*24=384 function calls per image. The approximate savings is 444/347 ≈ 1.28
which is about a 28% improvement in required iterations.

Table 3.5 Summary results for PBIL vs. OPBIL with respect to required iterations calls.
μ and σ correspond to the mean and standard deviation of the subscripted algorithm,
respectively

Image μPBIL σPBIL μOPBIL σOPBIL


1 62 19 53 12
2 80 25 65 9
3 61 12 60 5
4 47 14 40 10
5 68 13 53 9
6 128 21 76 14
Total 444 347
66 M. Ventresca, S. Rahnamayan, and H.R. Tizhoosh

In the following we analyze the correlation and distance for each sample per
iteration. This is to examine whether the negative correlation and larger distance
properties between a guess and its opposite are found in the sample. If true, we have
supported (although not confirmed) the hypothesis that the observed improvements
can be due to these characteristics.
Figure 3.4 shows the averaged correlation cor(Rt1 , R̆1 ), for randomly generated
R1 and respective opposites R̆t1 at iteration t. The solid line corresponds to OPBIL
t

and the dotted line is PBIL, respectively. These plots show that OPBIL indeed has a
lower correlation (with respect to evaluation function) than PBIL (where we gener-
ate R1 as above, and let R̆1 also be randomly generated). In all cases the correlation
is much stronger for PBIL (noting that if the algorithm reaches the VTR then we set
the correlation to 0).

Image 1 Image 2
1 1

0.8 0.8
correlation

correlation

0.6 0.6

0.4 0.4

0.2 0.2

0 0
0 50 100 150 0 50 100 150
iterations iterations
Image 3 Image 4
1 1

0.8 0.8
correlation

correlation

0.6 0.6

0.4 0.4

0.2 0.2

0 0
0 50 100 150 0 50 100 150
iterations iterations
Image 5 Image 6
1 1

0.8 0.8
correlation

correlation

0.6 0.6

0.4 0.4

0.2 0.2

0 0
0 50 100 150 0 50 100 150
iterations iterations

Fig. 3.4 Sample mean correlation over 30 trials for PBIL (dotted) versus OPBIL (solid). We
find OPBIL indeed yields a lower correlation than PBIL
3 The Use of Opposition for Decreasing Function Evaluations 67

We also examine the mean distance,

2 k/2
ḡ = ∑ g(Rt1,i , R̆t1,i )
k i=1
(3.18)

which computes the fitness-distance between the ith guess R1,i and its opposite R̆1,i
at iteration t, which is shown in Figure 3.5. The distance for PBIL is relatively low
throughout the 150 iterations, gently decreasing as the algorithm converges. How-
ever, as a consequence of OPBIL’s ability to mainain diversity the distances between
samples increases during the early stanges of the search and similarily rapidly de-
creases. Indeed, this implies the lower correlation shown above.

Image 1 Image 2
4000 1500

3000
1000
distance

distance

2000
500
1000

0 0
0 50 100 150 0 50 100 150
iteration iterations
Image 3 Image 4
3000 4000

3000
2000
distance

distance

2000
1000
1000

0 0
0 50 100 150 0 50 100 150
iterations iterations
Image 5 Image 6
6000 4000

3000
4000
distance

distance

2000
2000
1000

0 0
0 50 100 150 0 50 100 150
iterations iterations

Fig. 3.5 Sample mean distance over 30 trials for samples of PBIL (dotted) versus OPBIL
(solid). We find OPBIL indeed yields a higher distance between paired samples than PBIL
68 M. Ventresca, S. Rahnamayan, and H.R. Tizhoosh

The final test is to examine the standard deviation (w.r.t. evaluation function) of
the distance between samples, given in Figure 3.6. Both algorithms have similarly
formed plots with respect to this measure, reflecting the convergence rate of the
respective algorithms. It seems as though the use of opposition aides by infusing
diversity early during the search and quickly focuses once a high quality optima is
found. Conversely, the basic PBIL does not include this bias, therefore convergence
is less rapid.

Image 1 Image 2
2000 1500

1500
1000
std. dev.

std. dev.
1000
500
500

0 0
0 50 100 150 0 50 100 150
iterations iterations
Image 3 Image 4
2000 1500

1500
1000
std. dev.

std. dev.

1000
500
500

0 0
0 50 100 150 0 50 100 150
iterations iterations
Image 5 Image 6
3000 1500

2000 1000
std. dev.

std. dev.

1000 500

0 0
0 50 100 150 0 50 100 150
iterations iterations

Fig. 3.6 Sample mean standard deviations over 30 trials for samples of PBIL (dotted) versus
OPBIL (solid)

3.6 Conclusion
In this chapter we have discussed the application of opposition-based computing
techniques to reducing the required number of function calls. Firstly, a brief
3 The Use of Opposition for Decreasing Function Evaluations 69

introduction to the underlying concepts of opposition were given, along with condi-
tions under which opposition-based methods should be successful. A comparison to
similar concepts of antithetic variates and quasi-random/low-discrepancy sequences
made obvious the uniqueness of our method.
Two recently proposed algorithms, ODE and OPBIL were briefly introduced and
the manner in which opposition is used to improve their parent algorithms, DE and
PBIL was given, respectively. The manner in which opposites are used in both cases
differed, but the underlying concepts are the same.
Using the expensive optimization problem of image thresholding as a benchmark,
we examine the ability of ODE and PBIL to lower the required function calls to
reach the pre-specified target value. It was found that both algorithms reduce the
expected number of function calls, ODE by approximately 16% (function calls)
and OPBIL by 28% (iterations), respectively. Further, concentrating on OPBIL, we
show the hypothesized lower correlation and higher fitness-distance measures for a
quality opposite mapping.
Our results are very promising, however their requires future work in various re-
gards. Firstly, a further theoretical basis for opposition and choosing opposite map-
pings is needed. This could possibly lead to general strategies of implementation
when no prior knowledge is available. Further application to different real-world
problems is also desired.

Acknowledgements
This work has been partially supported by Natural Sciences and Engineering Research Coun-
cil of Canada (NSERC).

References
1. Bai, F., Wu, Z.: A novel monotonization transformation for some classes of global opti-
mization problems. Asia-Pacific Journal of Operational Research 23(3), 371–392 (2006)
2. Baluja, S.: Population Based Incremental Learning - A Method for Integrating Genetic
Search Based Function Optimization and Competitive Learning. Tech. rep., Carnegie
Mellon University, CMU-CS-94-163 (1994)
3. Engelbrecht, A.: Fundamentals of Computational Swarm Intelligence. Wiley, Chichester
(2005)
4. Glover, F., Laguna, M.: Tabu Search. Kluwer, Dordrecht (1997)
5. Goldberg, D.E., Horn, J., Deb, K.: What makes a problem hard for a classifier system?
Tech. rep. In: Collected Abstracts for the First International Workshop on Learning Clas-
sifier Systems (IWLCS 1992), NASA Johnson Space (1992)
6. Niederreiter, H.: Random Number Generation and Quasi-Monte Carlo Methods. Society
for Industrial and Applied Mathematics (1992)
7. Holland, J.H.: Adaptation in Natural and Artificial Systems. The University of Michigan
Press (1975)
8. Price, K., Storn, R., Lampinen, J.A.: Differential Evolution: A Practical Approach to
Global Optimization. Springer, Heidelberg (2005)
70 M. Ventresca, S. Rahnamayan, and H.R. Tizhoosh

9. Kirkpatrick, S., Gelatt, C.D., Vecchi, M.P.: Optimization by Simulated Annealing. Sci-
ence 220(4598), 671–680 (1983)
10. Lemieux, C.: Monte Carlo and Quasi-Monte Carlo Sampling. Springer, Heidelberg
(2009)
11. Maaranen, H., Miettinen, K., Penttinen, A.: On intitial populations of genetic algorithms
for continuous optimization problems. Journal of Global Optimization 37(3), 405–436
(2007)
12. Montgomery, J., Randall, M.: Anti-pheromone as a tool for better exploration of search
space. In: Third International Workshop, ANTS, pp. 1–3 (2002)
13. O’Gormam, L., Sammon, M., Seul, M. (eds.): Practical Algorithms for Image Analysis.
Cambridge University Press, Cambridge (2008)
14. Rahnamayan, S., Tizhoosh, H.R., Salama, M.M.A.: Opposition-based differential evolu-
tion. IEEE Transactions on Evolutionary Computation 12(1), 64–79 (2008)
15. Rahnamayn, S., Tizhoosh, H.R., Salama, S.: Opposition-based Differential Evolution
Algorithms. In: IEEE Congress on Evolutionary Computation, pp. 7363–7370 (2006)
16. Rahnamayn, S., Tizhoosh, H.R., Salama, S.: Opposition-based Differential Evolution
Algorithms for Optimization of Noisy Problems. In: IEEE Congress on Evolutionary
Computation, pp. 6756–6763 (2006)
17. Rubinstein, R.: Monte Carlo Optimization, Simulation and Sensitivity of Queueing Net-
works. Wiley, Chichester (1986)
18. Sebag, M., Ducoulombier, A.: Extending Population-Based Incremental Learning to
Continuous Search Spaces. In: Eiben, A.E., Bäck, T., Schoenauer, M., Schwefel, H.-P.
(eds.) PPSN 1998. LNCS, vol. 1498, pp. 418–427. Springer, Heidelberg (1998)
19. Shapiro, J.: Diversity loss in general estimation of distribution algorithms. In: Parallel
Problem Solving in Nature IX, pp. 92–101 (2006)
20. Shokri, M., Tizhoosh, H.R., Kamel, M.: Opposition-based Q(lambda) Algorithm. In:
IEEE International Joint Conference on Neural Networks, pp. 646–653 (2006)
21. Storn, R., Price, K.: Differential evolution- a simple and efficient heuristic for global
optimization over continuous spaces. Journal of Global Optimization 11, 341–359 (1997)
22. Tizhoosh, H.R.: Reinforcement Learning Based on Actions and Opposite Actions. In:
International Conference on Artificial Intelligence and Machine Learning (2005)
23. Tizhoosh, H.R.: Opposition-based Reinforcement Learning. Journal of Advanced Com-
putational Intelligence and Intelligent Informatics 10(4), 578–585 (2006)
24. Tizhoosh, H.R., Ventresca, M. (eds.): Oppositional Concepts in Computational Intelli-
gence. Springer, Heidelberg (2008)
25. Toh, K.: Global optimization by monotonic transformation. Computational Optimization
and Applications 23(1), 77–99 (2002)
26. Ventresca, M., Tizhoosh, H.R.: Improving the Convergence of Backpropagation by Op-
posite Transfer Functions. In: IEEE International Joint Conference on Neural Networks,
pp. 9527–9534 (2006)
27. Ventresca, M., Tizhoosh, H.R.: Opposite Transfer Functions and Backpropagation
Through Time. In: IEEE Symposium on Foundations of Computational Intelligence, pp.
570–577 (2007)
28. Ventresca, M., Tizhoosh, H.R.: Simulated Annealing with Opposite Neighbors. In: IEEE
Symposium on Foundations of Computational Intelligence, pp. 186–192 (2007)
29. Ventresca, M., Tizhoosh, H.R.: A diversity maintaining population-based incremental
learning algorithm. Information Sciences 178(21), 4038–4056 (2008)
3 The Use of Opposition for Decreasing Function Evaluations 71

30. Ventresca, M., Tizhoosh, H.R.: Numerical condition of feedforward networks with op-
posite transfer functions. In: IEEE International Joint Conference on Neural Networks,
pp. 3232–3239 (2008)
31. Weszka, J., Rosenfeld, A.: Threshold evaluation techniques. IEEE Transactions on Sys-
tems, Man and Cybernetics 8(8), 622–629 (1978)
32. Wu, Z., Bai, F., Zhang, L.: Convexification and concavification for a general class of
global optimization problems. Journal of Global Optimization 31(1), 45–60 (2005)
33. Yoo, T. (ed.): Insight into Images: Principles and Practice for Segmentation, Registration,
and Image Analysis. AK Peters (2004)
34. Zhang, H., Fritts, J., Goldman, S.: Image segmentation evaluation: A survey of unsuper-
vised methods. Computer Vision and Image Understanding 110, 260–280 (2008)
Chapter 4
Search Procedure Exploiting Locally
Regularized Objective Approximation:
A Convergence Theorem for Direct Search
Algorithms

Marek Bazan

Abstract. The Search Procedure Exploiting Locally Regularized Objective Approx-


imation is a method to speed-up local optimization processes in which the objective
function evaluation is expensive. It was introduced in [1] and further developed in
[2]. In this paper we present the convergence theorem of the method. The theorem
is proved for the EXTREM [6] algorithm but applies to any Gauss-Siedle algorithm
that uses sequential quadratic interpolation (SQI) as a line search method. After
some extension it can also be applied to conjugate direction algorithms. The proof
is based on the Zangwill theory of closed transformations. This method of the proof
was chosen instead of sufficient decrease approach since the crucial element of the
presented proof is an extension of the SQI convergence proof from [14] which is
based on this approach.

4.1 Introduction
Optimization processes with objective functions that are expensive to evaluate –
since usually their evaluation requires to solve a large system of linear equations
or to simulate some physical process – occur in many fields of modern design. The
main strategy in speeding up such processes via constructing a model to approxi-
mate an objective function are trust region methods [4]. The application of radial
basis function approximation as an approximation model in trust region methods
was discussed in [13]. The standard method to prove a convergence of a trust region
method is the method of sufficient decrease.
In [1] and [2] we presented the search procedure which can be viewed as an
alternative to trust region methods. It relies on combining the direct search algo-
rithm EXTREM [6] with the locally regularized radial basis approximation. The
Marek Bazan
Institute of Informatics, Automatics and Robotics, Department of Electronics,
Wrocław University of Technology, ul. Janiszewskiego 11/17, 50-372 Wrocław, Poland
e-mail: marek.bazan@pwr.wroc.pl

Y. Tenne and C.-K. Goh (Eds.): Computational Intelligence in Optimization, ALO 7, pp. 73–103.
springerlink.com c Springer-Verlag Berlin Heidelberg 2010
74 M. Bazan

method achieves a speed-up of the optimization by exchanging some number of di-


rect function evaluations by radial basis approximation. The method combined with
the EXTREM algorithm was implemented within the computer program ROXIE
for superconducting magnets design and optimization [3], however it is a general
framework and as we shall see, any search algorithm that is based on the Gauss-
Seidle non-gradient algorithm or conjugate direction algorithm, can be used. We
will call our method the Search Procedure Exploiting Locally Regularized Objec-
tive Approximation (SPELROA).
In this paper we give the convergence theorem for the SPELROA method, com-
bined with the EXTREM minimization algorithm under the assumption that the
radial basis approximation has a realtive error less than ε . The method of proof is
different from those used in proofs of trust region methods since it is based on the
Zangwill theory of closed transformations. The crucial element of the proof is the
modification of the proof of the convergence of the quadratic interpolation as a line
search method (c.f. [14]) under the assumption that a pertubation of the function
value may be introduced to the algorithm at each step. We give the conditions on the
function values as well as on ε to maintain convergence. The proof of the conver-
gence of the sequential quadratic interpolation in [14] is based on Zangwill’s theory
and as we essentially extend it we chose this method also to prove the convergence
of the whole SPELROA method.
The plan of this chapter is the following. In the next section we sketch the SPEL-
ROA method. In the third section we give some theory from [21] to be used in the
next section to prove the main result. Finally in the fourth section we discuss the
radial basis approximation and heuristics used for its construction. We also describe
difficulties in establishing strict error bound in the current state of the development
of radial basis function approximation for sparse data. In the reminder of the paper
we give numerical results for three test functions of 6,8 and 11 variables from a set
of test functions proposed in [12].

4.2 The Search Procedure


Let there be given a direct search optimization algorithm A that uses the quadratic
interpolation as a line search method. The SPELROA method combined with algo-
rithm A can be written in the form of the following algorithm (c.f. [2]).
While generating the set Z we have to take care that data points are not placed too
close to each other. When two points are too close to each other – where a distance
is controlled by a user-supplied parameter whose value is relative to the diameter of
the set Z – one of the points has to be replaced by another point not yet included.
Such procedure of constructing Z keeps the separation distance (c.f. [16]) greater
than the user-supplied parameter value and therefore guarantees that the radial basis
function interpolation matrix is not singular. The crucial step of the scheme is point
3, containing a threefold check for whether the approximation f˜(xk ) can be used
in the algorithm A instead of f (xk ) evaluated directly. Conditions being checked in
steps 3.a) and 3.b) are related to the radial basis approximation and will be discussed
4 Search Procedure Exploiting Locally Regularized Objective Approximation 75

1
2 The Search Procedure exploiting Locally Regularized Objective
Approximation
Input : f : Rd → R – the objective function,
x 0 ∈ Rd – a starting point,
ε >0 – prescribed accuracy of the approximation,
f˜(·) – a radial basis function approximation of f (·),
Is – number of initial steps performed by the direct optimization
algorithm A,
N < Is – size of the dataset to construct the approximation f˜(·),.
ε -check – a procedure to check conditions required by the convergence
3 theorem to hold.
0. Perform Is initial steps of the algorithm A.
1. In the k-th step generate point xk for which the function value is supposed to be
evaluated using algorithm A.
2. Generate a set Z from N nearest points for which function was evaluated
directly.
3. Judge, whether for point xk a reliable approximation of f (xk ) can be
constructed.
a. If point xk is located in a reliable region of the search domain then construct
the approximation f˜ and evaluate f˜(xk ).
b. If the approximation f˜(xk ) was correctly constructed then perform an
ε -check.
c. If the ε -check is positive (i.e. the procedure returns true) then substitute

f (xk ) ← f˜(xk )

to the algorithm A.
d. Else evaluate f (xk ) directly.
4. If the stopping criterion of algorithm A is satisfied then stop.
Else replace k := k + 1 and go to 1.

in section 4.5 whereas the ε -check procedure from step 3.c) is associated with the
convergence theorem and will be discussed in section 4.4.

4.3 Zangwill’s Method to Prove Convergence


Feasible direction optimization algorithms can be written as

xk+1 = xk + τk dk
76 M. Bazan

where dk is a search direction and τk is a search step. The k-th iteration of such al-
gorithms can be viewed as a composite algorithmic transformation A = M 1 D where
D : Rd → Rd × Rd is a search direction generating transform

D(x) = (x, d)

where d ∈ Rd is a direction vector, whereas M 1 : Rd × Rd → Rd is a line minimiza-


tion transform

M 1 (x, d) = {y : f (y) = min f (x + τ d), y = x + τ 0d}


τ ∈J

where (x, d) ∈ Rd × Rd and J is a variability interval for a scalar τ .


In his monography [21] Zangwill gave a method of proving the convergence of
feasible directions optimization algorithms based on properties of an algorithmic
transformation A. In this section we sketch all crucial definitions and lemmas used
to state the main convergence theorem.
Definition 1. A transformation A : V → V is a transformation of a point into a set
when each point x ∈ V is assigned a set A(x) of points from V . The result of appli-
cation of the transform A(·) to a point xk can be any point xk+1 from a set A(xk )
thus
xk+1 ∈ A(xk ).
A transformation A = M 1 D defining a feasible direction optimization algorithm is a
transformation of a point into a set.
Definition 2. We say that a transformation A : V → V is closed in a point x∞ , if the
following implication holds true:
1. xk → x∞ , k ∈ K ,
2. yk ∈ A(xk ), k ∈ K ,
3. yk → y∞ ,
imply
4. y∞ ∈ A(x∞ ),
where K is a sequence of natural numbers. We say that transformation A is closed
on X ∈ V if it is closed in any point x ∈ X.
The property of closedness for algorithmic transformations is an analogue of the
property of continuity for ”usual” functions.
Theorem 1. (see [21] page 99)
Let a transformation A : V → V of a point into a set be an algorithm, which for a
given point x1 , generates a sequence {xk }∞
k=1 . Let S ⊂ V be a set of solutions.
Let us assume
1. All points xk are in a compact set X ⊂ V .
2. There exists a function Z : V → R such, as
4 Search Procedure Exploiting Locally Regularized Objective Approximation 77

a. if point x is not a solution, then for any y ∈ A(x) there is

Z(y) < Z(x)

b. if point x is a solution, then either the algorithm finishes or for any y ∈ A(x)

Z(y) ≤ Z(x).

3. Transformation A is closed for a point x if it is not a solution.


Then either the algorithm finishes in a point which is a solution or any convergent
subsequence generated by the algorithm has its limit in the solution set S.

Now we additionally need two lemmas from [21] concerning the closedness of a
composition of closed transforms.

Lemma 1. Let C : W → X be a given function and B : X → Y be a transform of a


point into a set. If function C is continuous in point w∞ and B is closed in C(w∞ ),
then a composition A = BC : W → Y is closed in w∞ .

Lemma 2. Let f be a continuous function. Then the transform M 1 is closed, if J is


a compact and bounded interval.

4.4 The Main Result


In practice usually another line search operator is considered, since implementation
of the operator M 1 (·, ·) is expensive. Let’s consider a line search operator M ∗ defined
as
M ∗ (x, d) = M 1 (x, d) ∪ {y = x + τ d : f (y) ≤ f (x) − Δ , τ ∈ J}. (4.1)
Its value is a set of points for which function f is decreased by Δ on the direction
d from point x or when this set is empty its value is a minimum of function f . A
suggestion of a practical application of the operator M ∗ can be found in one of the
exercises in [21]. In the algorithm EXTREM such an operator is used instead of M 1 .
The operator M ∗ is implemented to be the parabolic search algorithm. Application of
the operator M ∗ is more practical, in particular in the initial part of the optimization
process, where the first steps of the parabolic interpolation search give the most
significant decrease of the function value whereas the next steps usually make the
function f decrease much slower.
The main result of this note is the following theorem.

Theorem 2. Let a function f : Ω → R; Ω ⊂ Rd be an objective function of the opti-


mization problem. Let be given a method to approximate the objective function f in
certain points of the domain Ω with a relative error ε > 0. If function f is strictly
convex the SPELROA method combined with the Gauss-Seidel search algorithm us-
ing approximated sequential quadratic interpolation as a line search method con-
verges to a stationary point x0 ∈ S where S = {x : ∇ f (x) = 0}.
78 M. Bazan

Proof of Theorem 2 will be based on Theorem 1 where transformation A will be

A = RM ∗ Dd M ∗ Dd−1 . . . M ∗ D2 M ∗ D1

where Di chooses i-th direction from the orthogonal direction base of the k-th iter-
ation and R is an orthogonalization step to produce a new base along the direction
xk−1
0 xd
k−1
from k − 1 iteration.

4.4.1 Closedness of the Algorithmic Transformation


To show that point 3 of the theorem 1 is fullfilled we first have to show that the
transformation M ∗ defined in (4.1) is closed. We prove the following lemma.

Lemma 3. Let f be a continuous function. Then the transform M ∗ is closed, if J is


a compact and bounded interval.

Proof. According to Definition 2 let us consider sequences {(xk , dk )}∞


k=1 and
{yk }∞
k=1 . We assume that
1. (xk , dk ) → (x∞ , d∞ ), k ∈ K
2. yk ∈ M ∗ (xk , dk ), k ∈ K
3. y → y∞ , k ∈ K
So we have that
yk = xk + τ k dk
where τk ∈ J k for which

f (xk + τ k dk ) ≤ f (xk ) − Δ .

Because τ k ∈ J for k ∈ K and J is compact, thus there exists a convergent


subsequence
τ k → τ ∞, k ∈ K 1,
where K 1 ⊂ K and τ ∞ ∈ J.
For a fixed τ ∈ J from the definition of yk it folllows

f (yk ) < f (xk + τ dk ) (4.2)

or
f (yk ) − f (xk + τ dk ) ≤ Δ . (4.3)
Note that if (4.3) is not satisfied then (4.2) is satisfied.
Since f is continuous in a limit we get

f (y∞ ) = lim f (yk ) < lim f (xk + τ dk ) = f (x∞ + τ d∞ ) (4.4)


k∈K 1 k∈K 1

and in the same way


f (y∞ ) − f (x∞ + τ d∞ ) ≤ Δ . (4.5)
4 Search Procedure Exploiting Locally Regularized Objective Approximation 79

Because the above for any τ (4.4) or (4.5) is fulfilled then for any point y∗ ∈
M ∗ (x∞ , d∞ ) we can
f (y∞ ) < f (y∗ ), (4.6)
or
f (y∞ ) − f (y∗ ) ≤ Δ . (4.7)
On the other hand in point y∗ ∈ M ∗ (x∞ , d∞ ) function f for τ ∈ J attains the least
value
y∞ = x∞ + τ ∞ d∞ , τ ∞ ∈ J
or,
f (y∞ ) < f (x∞ ) − Δ (4.8)
therefore
f (y∗ ) − f (y∞ ) ≤ Δ . (4.9)
Comparing (4.8) and (4.9) with (4.6) and (4.7) we get the result

y∞ ∈ M ∗ (x∞ , d∞ ).

To make use of Lemma 1 in order to show the closedness of transformation A we


also have to notice that the transformations Di (x) = (x, di ) (i = 1, . . . , d) are contin-
uous functions. For direct search algorithms transformations Di generate orthogo-
nal directions. They are the same as in the Gauss-Seidle algorithm (c.f. [21]). For
conjugate direction search algorithms transformations Di generate succesive con-
jugate directions. Transformation R that generates the orthogonal serach directions
0 , . . . , dd−1 ] for the step k is defined as
[dnew new

0 , [d0 , . . . , dd−1 ]) = (x0 , [d0 , . . . , dd−1 ]),


R(xk−1 k new new

where the sequence of the new orthogonal vectors is uniquely defined by the orthog-
onalization of the vectors w0 , w1 , . . . , wd−1

w0 = s0 d0 + s1 d1 + . . . + sd−1 dd−1
w1 = + s1 d1 + . . . + sd−1 dd−1
...
wd−1 = + sd−1 dd−1

where scalars s0 , s1 , . . . , sd−1 correspond to step sizes in all directions in the step
k − 1. Transformation R is uniquely defined without any conditions on scalars
s0 , s1 , . . . , sd−1 as long as the orthogonalization is performed using the algorithm
presented in [15]. In this case it is also a continuous function. Finally then since the
transformation A is a composition of closed transformations M ∗ with continuous
functions Di (i = 0, . . . , d − 1) and R then the assumptions of the Lemma 1 are sat-
isfied we see that transformation A is closed. That proves that assumption 3 of the
Theorem 1 holds for unperturbed M ∗ . In the succeeding subsection we show that
transformation M ∗ can be realized by the perturbed transformation M 1 .
80 M. Bazan

4.4.2 A Perturbation in the Line Search


For non-gradient optimization algorithms with an orthogonal basis as a set of search
directions transformation M ∗ is the only place in the algorithm where the perturba-
tion from approximation of the objective function is introduced by the SPELROA
method. Therefore to show that point 2 of the Theorem 1 is fullfilled it is sufficient
that implementation of the transformation M ∗ that allows perturbation of function
value on the level ε > 0 minimizes function f along the direction d.
A proof of the convergence of the parabolic interpolation line search method can
be found in [8] or [14]. Here we will give conditions on the perturbation of the
function so that the proof given in [14] holds true. We will keep the notation as
close as possible to that in [14].
(i) (i) (i)
Let function f : R → R be unimodal. Let for a triplet ζ (i) = (ζ1 , ζ2 , ζ3 ) be
(i) (i) (i) (i) (i)
f (ζ2 ) ≤ min{ f (ζ1 ), f (ζ3 )} i.e. the interval [ζ1 , ζ3 ] contains an unique mini-
mizer of function f .
For a non-perturbed objective function we define a set of feasible triplets T ⊂ R3
defining an interval [ζ1 , ζ3 ] that contains the minimizer λ̂ as

T := {ζ ∈ R3 : ζ1 < ζ2 < ζ3 , f (ζ2 ) ≤ min{ f (ζ1 ), f (ζ3 )}


∪{ζ ∈ R3 : ζ1 = ζ2 < ζ3 , f (ζ1 ) ≤ 0, f (ζ3 ) ≥ f (ζ1 )}
∪{ζ ∈ R3 : ζ1 < ζ2 = ζ3 , f (ζ3 ) ≥ 0, f (ζ1 ) ≥ f (ζ3 )}
∪{ζ ∈ R3 : ζ1 = ζ2 = ζ3 = λ̂ }.

For ζ ∈ T with ζ1 < ζ2 < ζ3 the minimum of the quadratic interpolating points
(ζ1 , f (ζ1 )), (ζ2 , f (ζ2 )), (ζ3 , f (ζ3 )) equals

1 (ζ32 − ζ22 ) f (ζ1 ) + (ζ32 − ζ12 ) f (ζ2 ) + (ζ22 − ζ12 ) f (ζ3 )


λ ∗ (ζ ) =
2 (ζ3 − ζ2 ) f (ζ1 ) + (ζ3 − ζ1 ) f (ζ2 ) + (ζ2 − ζ1 ) f (ζ3 )

Then the set of admissible replacement triplets A(ζ ) is a set of candidate triplets
that may replace ζ ∈ T defining a smaller interval containing λ̂ in the next iteration
of the algorithm. A0 (ζ ) is defined as

(ζ1 , λ ∗ (ζ ), ζ2 ),
u1 (ζ ) =
(ζ2 , λ ∗ (ζ ), ζ3 ),
u2 (ζ ) =
A0 (ζ ) := T ∩ {u1 (ζ ), u2 (ζ ), u3 (ζ ), u4 (ζ )} where
(λ ∗ (ζ ), ζ2 , ζ3 ),
u3 (ζ ) =
(ζ1 , ζ2 , λ ∗ (ζ )).
u4 (ζ ) =
(4.10)
The crucial assumption on the perturbation that we introduce into the triplet used
to construct a quadratic is that we always allow a perturbation in only one point
of three points. Without loss of generality let us assume that a minimum of the
perturbed quadratic is constructed only for ζ such that ζ1 < ζ2 < ζ3 .
For the perturbation of the value of the function f on the level ε = 0 we define
three sets of triplets:
4 Search Procedure Exploiting Locally Regularized Objective Approximation 81

T1 (ε ) := {ζ ∈ R3 : ζ1 < ζ2 < ζ3 , f (ζ2 ) ≤ min{ f (ζ1 )(1 − |ε |), f (ζ3 )} } (4.11)


T2 (ε ) := {ζ ∈ R3 : ζ1 < ζ2 < ζ3 , f (ζ2 )(1 + |ε |) ≤ min{ f (ζ1 ), f (ζ3 )} } (4.12)
T3 (ε ) := {ζ ∈ R3 : ζ1 < ζ2 < ζ3 , f (ζ2 ) ≤ min{ f (ζ1 ), f (ζ3 )(1 − |ε |)} } (4.13)

with the minima of the underlying perturbed quadratics

1 (ζ32 − ζ22 ) f (ζ1 )(1 + |ε |) + (ζ32 − ζ12 ] f (ζ2 ) + (ζ22 − ζ12 ) f (ζ3 )
λ̃1∗ (ε ; ζ ) = ,
2 (ζ3 − ζ2 ) f (ζ1 )(1 + |ε |) + (ζ3 − ζ1 ) f (ζ2 ) + (ζ2 − ζ1 ) f (ζ3 )
1 (ζ32 − ζ22 ) f (ζ1 ) + (ζ32 − ζ12 ) f (ζ2 )(1 + |ε |) + (ζ22 − ζ12 ) f (ζ3 )
λ̃2∗ (ε ; ζ ) = ,
2 (ζ3 − ζ2 ) f (ζ1 ) + (ζ3 − ζ1 ) f (ζ2 )(1 + |ε |) + (ζ2 − ζ1 ) f (ζ3 )
1 (ζ32 − ζ22 ) f (ζ1 ) + (ζ32 − ζ12 ) f (ζ2 ) + (ζ22 − ζ12 ) f (ζ3 )(1 + |ε |)
λ̃3∗ (ε ; ζ ) =
2 (ζ3 − ζ2 ) f (ζ1 ) + (ζ3 − ζ1 ) f (ζ2 ) + (ζ2 − ζ1 ) f (ζ3 )(1 + |ε |)

for a perturbation of function f in points ζ1 , ζ2 and ζ3 respectively.


Such definitions of sets Tl (ε ) (l ∈ {1, 2, 3}) provide that the perturbed minima
are contained within the interval [ζ1 , ζ3 ]. Corresponding sets of admissible triplets
are defined as

Ãl (ε ; ζ ) := Tl (ε ) ∩ {ũl1 (ε ; ζ ), ũl2 (ε ; ζ ), ũl3 (ε ; ζ ), ũl4 (ε ; ζ )}

where
ũl1 (ε ; ζ ) = (ζ1 , λ̃l∗ (ε ; ζ ), ζ2 ),
ũl2 (ε ; ζ ) = (ζ2 , λ̃l∗ (ε ; ζ ), ζ3 ),
(4.14)
ũl3 (ε ; ζ ) = (λ̃l∗ (ε ; ζ ), ζ2 , ζ3 ),
ũl4 (ε ; ζ ) = (ζ1 , ζ2 , λ̃l∗ (ε ; ζ )),
where l ∈ {1, 2, 3}.
Lemma 4. Let S denote the set of stationary points of function f

S := {ζ ∈ T : f (ζ1 ) = 0 or f (ζ2 ) = 0 or f (ζ3 ) = 0}.

The following statements hold:


1. For every ζ ∈ T, the set A(ζ ) = A0 (ζ ) ∪ Ã1 (ε ; ζ ) ∪ Ã2 (ε ; ζ ) ∪ Ã3 (ε ; ζ ) is
nonempty.
2. The set valued map A(·) is closed.
3. For every ζ ∈ T\S := {ζ ∈ T : ζ ∈/ S} such that ζ1 < ζ2 < ζ3 there is c(y) < c(ζ )
for all y ∈ A(ζ ) for

c(ζ ) = f˜1 (ζ1 ) + f˜2 (ζ2 ) + f˜3 (ζ3 )

where f˜l (·) = f or f˜l (·) = f (·)(1 + |ε |) depending on whether a value at point ζl
was exact or perturbed.
Proof. Let us introduce the following notation f˜l (·) = f (·) when evaluated at points
x = ζl or f˜l (·) = f (·)(1 + ε ) when evaluated at point x = ζl . At the beginning let us
note that Tl (ε ) ⊂ T for (l ∈ {1, 2, 3}) and ε > 0.
82 M. Bazan

1. Let ζ = (ζ1 , ζ2 , ζ3 ), ∈ T be fixed. If f (ζ1 ), f (ζ2 ) and f (ζ3 ) are computed with-
out perturbation then A(ζ ) is not empty by proof in [14]. Let us consider then
that a function value was approximated with the relative error equal ε . The mini-
mum of the quadratic constructed in this case will be λ̃l∗ (ε ; ζ ) where l ∈ {1, 2, 3}
depending on at which point a function value is approximated. Let us consider
the case when λ̃l∗ (ε , ζ ) ∈ [ζ1 , ζ2 ] providing moreover that the minimum λ ∗ (ζ )
obtained as if the function was evaluated without any perturbation also belongs
to [ζ1 , ζ2 ]. Then A(ζ ) is empty if and only if both ũ1 (ε ; ζ ) and ũ3 (ε ; ζ ) are not
in A(ζ ), i.e. if and only if

f (λ̃1∗ (ε ; ζ )) > min{ f (ζ1 )(1 + |ε |), f (ζ2 )} = f (ζ2 ) (4.15)

and

f (ζ2 ) > min{ f (λ̃1∗ (ε ; ζ )), f (ζ3 )} ≥ min{ f (λ̃1∗ (ε ; ζ )), f (ζ2 )} (4.16)

if a function value is approximated at ζ1 (analogue inequalities are considered


for perturbations in ζ2 and ζ3 ).
Since the inequalities (4.16) imply, that f (ζ2 ) ≥ f (λ̃1∗ (ε ; ζ )) we get a
contradiction with the inequality (4.15) which proves the thesis when the per-
turbation is in the point ζ1 . The similiar contradiction we get for analogue in-
equalities for perturbations in ζ2 and ζ3 . Please note that the condition that the
corresponding unperturbed λ ∗ (ζ ) is also in [ζ1 , ζ2 ] guarantees that if we com-
pare ỹ = (ỹ1 , ỹ2 , ỹ3 ) ∈ Ãl (ζ ) (l ∈ {1, 2, 3}) with y = (y1 , y2 , y3 ) ∈ A0 (ζ ) we get
ỹ1 ≤ y1 and y˜3 ≥ y3 . The case when λ̃ ∗ (ζ ) belongs to [ζ2 , ζ3 ] is symmetric. This
shows that A(ζ ) is not empty.
2. We will prove the closedness of A from Definition 2. Let us assume that {ζ (i) }∞ i=0
and ζ (i) → ζ∗ ∈ T and there exists ζ∗∗ ∈ T and an infinite subsequence K ⊂ N
such that ζ (i+1) ∈ A(ζ (i) ) for every i ∈ K such that ζ (i+1) →K ζ∗∗ as i → ∞. Then
there must exist k ∈ {1, 2, 3, 4} and an inifinite subsequence K ⊂ K such that
ζ (i+1) = uk (ζ (i) ) or ζ (i+1) = ũlki (ε ; ζ (i) ) (where li ∈ {1, 2, 3}) for every i ∈ K . As
shown in [14] functions uk (·) are continuous and therefore when the sequence
{ζ (i) }∞
i=0 does not contain ζ
(i) generated by approximated values then from the

continuity of uk (·) with respect to ζ and the closedness of the set T it follows that
uk (ζ (i) ) → uk (ζ∗ ) = ζ∗∗ ∈ A(ζ∗ ). This proves the closedness of transformation A
if no approximation is used in any sequence ζ (i) .
Now we will consider the sequences containing approximated triplets. In-
troducing the approximated triplets ζ (i+1) (i.e. ζ (i+1) ∈ Ãl (ε ; ζ (i) )) introduces
discontinuities of the first kind into functions uk (ζ ) and the argument based on
continuity of functions uk (ζ ) cannot be applied directly. We will show that using
algorithm A introduces a finite number of isolated points discontinuities and that
guarantees that from any sequence {ζ (i) }∞ i=1 after removing some finite number
of initial elements we can apply the proof from [14].
We have to consider two cases
4 Search Procedure Exploiting Locally Regularized Objective Approximation 83

• When a number of occurences of ũlki in the sequence {ζ (i) }∞


i=0 is finite then
the proof of the closedness of transformation A given in [14] applies after
removing from {ζ (i) }∞
i=0 some number of initial elements.
• Let us then assume that ũlki occur an inifinite number of times in {ζ (i) }∞
i=0 .
(i) ∞
Since {ζ }i=1 is convergent then for any δ ∈ R there exists i0 such that for
all i > i0 we have
(i ) (i ) (i ) (i +1) (i +1) (i +1)
|ζ (i0 ) − ζ (i0 +1) | = |(ζ1 0 , ζ2 0 , ζ3 0 ) − (ζ1 0 , ζ2 0 , ζ3 0 )| < δ (4.17)

Let us choose such i1 for which inequality (4.17) holds. Since the approxi-
mation in the next iterations is used infnitely many times then there exists a
subsequence K = {i : i > i1 } ⊂ N such that if i ∈ K then

ζ (i+1) ∈ Ãl (ε ; ζ (i) )

for some l ∈ {1, 2, 3}.


We will show that for any δ it is possible to choose ε such that

||ζ (i1 ) − ζ (i1 +1) ||2 > 2δ . (4.18)

This will give us at the end a contradiction with assumed convergence. To


give conditions on ε we solve inequalities (4.18) using expressions (4.14) for
ulk (ε , ζ ) for k = 1, . . . , 4 and l ∈ {1, 2, 3} and applying a method of parabola
transformation q̂ from Appendix A.
(i)
For example for perturbation in ζ1 we get

(B − C)(ζ(r) − Ka ) − (B − ζ(r)C) − A(Ka − 1)


ε1 > (4.19)
A(Ka − 1)

and one of the system of inequalities is fullfiled


 
A(1 + ε1) + B − C > 0 A(1 + ε1) + B − C < 0
or . (4.20)
A(Ka − 1) < 0 A(Ka − 1) > 0

where  2
ε 2δ
ε1 = , Ka = − [1 − ζ(r)]2 .
(i)
f (ζl ) ζ3 − ζ1

and ζ(r) = ζζ2 − ζ1


with A, B and C defined in Appendix A. We leave to the
3 −ζ1
reader showing that these conditions are not contradicting and also derivation
of the analogue inequalities for l = 2 and l = 3.
The above considerations mean for any δ the value ε can be chosen so that
(4.18) holds. But it contradicts with the assumption that {ζi }∞i=0 converges
since we can choose two subsequences from {ζi }∞ i=0 that converge to two
different accumulation points. The first subsequence is formed from ζi1 s and
the second from ζi1 +1 s. Both are infinite. From this we conclude that from
84 M. Bazan

certain i0 the sequence {ζi }∞


i=0 cannot contain approximated points and there-
fore the proof from [14] also applies in this case.
This finishes the proof of the closedness of transformation A(ζ ).
3. Let us assume that ζ ∈ T\S. Then λ̃ ∗ (ζ ) ∈ (ζ1 , ζ3 ). Hereafter in this point we
consider the quadratic obtained with a transformation q̂ from Appendix A. This
transformation preserves all properties used in this proof. In the following con-
sideration we will use the expression Δl (ε ; ζ ) which is the distance between the
unperturbed and perturbed minimum i.e. Δl (ε ; ζ ) = |λ̃l∗ (ζ ) − λ ∗ (ζ )|. See Ap-
pendix B for expression Δl (ε ; ζ ).
Here the situation is also symmetric with respect to ζ1 and ζ3 and we will con-
sider the case where λ ∗ (ζ ) ∈ (ζ1 , ζ2 ] as well as λ̃l∗ (ζ ) ∈ (ζ1 , ζ2 ] (l ∈ {1, 2, 3}).
Firstly we observe that f (ζ2 ) ≤ f (ζ3 ), because if there was f (ζ2 ) = f (ζ3 ),
then λ̃ ∗ (ζ ) = 1/2(ζ2 + ζ3 ) since no value is perturbed or only the value at point
ζ2 is perturbed and there is no effect of this perturbation on a minimum. Both
cases contradict the assumption that λ̃l∗ (ζ ) ∈ (ζ1 , ζ2 ] (l ∈ {1, 2, 3}).
When λ̃l∗ (ζ )∈(ζ1 , ζ2 ] (l ∈{1, 2, 3}) then only u1(ζ ) and ũl1 (ε ; ζ ) (l ∈ {1, 2, 3})
and u3 (ζ ) and ũl3 (ζ ) (l ∈ {1, 2, 3}) can belong to A(ζ ). For unperturbed function
values we have λ ∗ (ζ ) < ζ2 and for perturbed function values we have

λl∗ (ε ; ζ ) + Δl (ε ; ζ ) < ζ2 (l ∈ {1, 2, 3}). (4.21)

In Appendix B we solve (4.21) with respect to ε establishing conditions for which


λl∗ (ε ; ζ ) < ζ2 and provide the line of proof from [14] is valid. We get only three
cases
a. A(ζ ) = {u1 (ζ ), ũ11 (ζ ), ũ21 (ζ ), ũ31 (ζ )}. Then since u3 (ζ ) ∈
/ A(ζ ) as well as
/ A(ζ ) (l ∈ {1, 2, 3}) and f (λ ∗ (ζ )) < f (ζ2 ) as well as f (λ̃l∗ (ε ; ζ )) <
ũl3 (ε ; ζ ) ∈
f (ζ2 ) (l = 1, 3)
f (λ̃2∗ (ε ; ζ )) < f (ζ2 )(1 + |ε |) (4.22)
we must have

c(ũl1 (ζ )) = f˜l (ζ1 ) + f (λ̃l∗ (ζ )) + f˜l (ζ3 ) < f˜l (ζ1 ) + f˜l (ζ2 ) + f˜l (ζ3 ) = c(ζ ),

where f˜l (·) = f (·) or f˜l (·) = f (·)(1 + ε ) depending on which value was per-
turbed.
b. A(ζ ) = {u3 (ζ ), ũ13 (ζ ), ũ23 (ζ ), ũ33 (ζ )}. Then, since u1 (ζ ) ∈
/ A(ζ ) as well as
/ A(ζ ) (l ∈ {1, 2, 3}) we must have that f (ζ2 ) ≤ f (λ ∗ (ζ )) as well
ũl1 (ε ; ζ ) ∈
as f˜(ζ2 ) ≤ f (λ̃l∗ (ζ )) (l ∈ {1, 2, 3}) depending on which value was perturbed.
Also f (λ ∗ (ζ )) < f (ζ1 ) as well as fl (λ̃l∗ (ζ )) < f˜l (ζ1 ) (l ∈ {1, 2, 3}) since oth-
erwise we would have a local maximum in [ζ1 , ζ2 ] contradicting unimodality.
Therefore in this case we must have

c(ũl3 (ζ )) = f (λ̃l∗ (ζ )) + f˜2 (ζ2 ) + f˜3 (ζ3 ) < f˜l (ζ1 ) + f˜l (ζ2 ) + f˜l (ζ3 ) = c(ζ )

with perturbation at point ζl .


4 Search Procedure Exploiting Locally Regularized Objective Approximation 85

c. Finally, we can have a case A(ζ ) = {u1 (ζ ), u3 (ζ )}. In this case we are not
able to include to A(ζ ) any of the approximated triplet ũl (ε ; ζ ) (i = 1, 2, 3).
This is because of the following properties
i. f (ζ2 ) < f (ζ1 ) by assumption,
ii. λ ∗ (ζ ) ≤ ζ2 ,
iii. f (λ ∗ (ζ )) = f (ζ2 ) which implies λ ∗ (ζ ) = ζ2 .
These equalities hold since otherwise, we would have a contradiction with
the unimodality of f (·). Approximating any of the value would mean that
we would not be able to guarantee the property iii. Therefore since f (ζ2 ) <
min{ f (ζ1 ), f (ζ3 )} we get c(u1 (ζ )) < c(ζ ) and c(u3 (ζ )) < c(ζ ). From a prac-
tical point of view for a given ζi where one of the coordinates is approximated
or all are exact, we can see whether we have to use exact values i.e. remove
the approximation, checking whether

|λ̃l∗ (ε ; ζ ) − ζ2 | > Δl (ε ; ζ ) (l ∈ {1, 2, 3}).

This exhausts all the possibilities and finishes the proof of the third point.

We are now in a position to formulate the Sequential Quadratic Interpolation Algo-


rithm with Perturbation (c.f. [14] for the unperturbed version).
Please note that in the following we use the expressions an “approximated” and
“perturbed” function values interchangably – they are synonymous here. Moreover
the adjective “exact” here means perturbed on the level of the machine precision ε
where ε << ε .
The main condition under which we can make use of the available approximation
of the function in one of the points is the separation property, i.e. for the approxi-
mation used in a point with an index l ∈ {1, 2, 3} the triplet ζi belongs to Tl where
Tl is defined by (4.11), (4.12) or (4.13) respectively.
Now we can formulate a convergence theorem for the above algorithm analogous
to that in ([14] p. 155).

Theorem 3. Suppose that {ζi }∞ i=0 is a sequence constructed by Algorithm 2 in mini-


mizing a continuously differentiable and unimodal function f : R → R. Then ζ (i) →
ζ̂ as i → ∞ with ζ̂ ∈ S.

To prove the above theorem we will show the way in which we can apply the proof
given in [14] making use of the Lemma 4.

Proof. The main difficulty in application of the method of the proof from [14] is the
fact that using approximation in certain points of the domain causes a discontinuity
of the cost function c as well as a discontinuity of the functional of calculating the
minimum of the quadratics with respect to a parameter triplet ζ (i) .
Let us observe first that the proof of the third point of Lemma 4 shows that allow-
(i)
ing perturbation according to (4.23) provides that {ζ1 }∞ i=0 is monotone increasing
(i)
as well as {ζ3 }∞ i=0 is monotone decreasing. Since both these sequences are bounded
(i)
they are both convergent. Moreover keeping Δ (ε ; ζ ) on a level so that the above
86 M. Bazan

1
2 The Sequential Quadratic Interpolation Algorithm with the Objective
Function Perturbation
Input : ζ0 ∈ T – a starting point,
ε > 0 – the relative error of the approximation available in certain
3 points of the function evaluation.
0. Set i = 0.
1. Compute λ ∗ = λ ∗ (ζ (i) ) or λ ∗ = λ̃l∗ (ε ; ζ (i) ) depending on whether the function
(i) (i) (i)
value was exact in all points ζ1 , ζ2 , ζ3 or it was perturbed in point
(i)
ζl (l ∈ {1, 2, 3}) respectively.
(i) (i)
2. If λ ∗ = ζ1 or λ ∗ = ζ3 then STOP, else construct the set A(ζ (i) ) and
a. If the approximation to any value of the triplet ζ (i) is not available then
A(ζ (i) ) = A0 according to (4.10).
(i)
b. If the approximation in point ζl (l ∈ {1, 2, 3}) is available
i. Compute transformation q̂ as described in Appendix A.
ii. Compute Δl (ε ; ζ (i) )
iii. If
(i)
|λl∗ (ε ; ζ ) − ζ2 | < Δl (ε ; ζ ).
then A(ζi ) = A0 and go to 3.
iv. If

λl∗ (ε ; ζ ) + Δl (ε ; ζ ) < ζi2 or λl∗ (ε ; ζ ) − Δl (ε ; ζ ) > ζi2 (4.23)

then A = A0 ∪ Ãl
3. Compute
ζ (i+1) ∈ arg min{c(ζ ) : ζ ∈ A(ζ (i) )}
4. Replace i := i + 1 and go to step 1.

property of the mentioned sequences would be fulfilled when the approximation


(i) (i)
would not be used, guarantees that ζ̂ ∈ [ζ1 , ζ3 ] for any i ∈ N. We have to distin-
guish between two nontrivial cases
1. When {ζ (i) }∞ i=1 → ζ̂ and ζ̂ = (ζ̂1 , ζ̂2 , ζ̂3 ) is such an accumulation point that ζ̂1 <
ζ̂2 < ζ̂3 . In this case the contradiction is obtained using the continuity of the
cost function c if an unperturbed algorithm was used. We can use the continuity
argument here since as it was shown in the proof of Lemma 4 the approximation
can only be used in a finite number of steps. Therefore after removing a certain
number of initial steps the proof from [14] applies.
2. The first case, therefore does not apply, i.e. the sequence constructed by Algo-
rithm 2 can have two accumulation points. In [14] it is shown that it is the same
4 Search Procedure Exploiting Locally Regularized Objective Approximation 87

accumulation point. Lemma 8 guarantees that the argument contained in [14]


holds true if the constructed sequence ends up in the stopping criterion with an
unperturbed triplet. Since the sequence is not infinite we have to consider one
more case, namely, when the algorithm stops with a perturbed triplet.
Two accumulation points are ζ∗ = (ζ̂1 , ζ̂1 , ζ̂3 ) and ζ∗∗ = (ζ̂1 , ζ̂3 , ζ̂3 ). To pro-
vide the separation of function values in the first case the perturbation can be only
in the point ζ̂3 . In the second case the perturbation can be only in the point ζ̂1 .
(i) (i) (i) (i)
In the above sequences we have therefore ζ2 → ζ1 or ζ2 → ζ3 . On the other
(i)
hand, the separation between λ ∗ (ε ; ζ (i) ) and ζ2 has to be greater than Δ3 (ε ; ζ (i) )
and Δ1 (ε ; ζ (i) ) for two cases respectively. This gives the contradiction with the
convergence.
From the practical point of view to provide the convergence of the perturbed SQI
algorithm and therefore the whole SPELROA method it is sufficient to provide that
the procedure ε -check in step 3.c) of Algorithm 1 to be launched in each iteration of
the perturbed SQI algorithm. It takes the last three points xi−2 , xi−1 , xi and compute
x −xk−1
ζk = (0,t1 ,t2 ) such that t1 = k−2 xk−2 −xk , t2 = 1 and then compute the transformation
q̂ defined by (4.34) to obtain points (0, −2), (ζ(r) , q̂(ζ(r) )), (1, −1) when f˜(xk−2 ) <
f˜(xk ) (for the case when f˜(xk−2 ) > f˜(xk ) it is sufficient to rotate the scaled parabola
around point 0.5) and scales ε
⎧ ε
⎨ f (xk−2 ) ,
⎪ when a perturbation is in point xk−2 ,
ε
ε := f (xk−1 ) , when a perturbation is in point xk−1 ,

⎩ 2ε ,
f (x ) when a perturbation is in point xk
k

depending on at which point the function value was approximated. Then if


1. a perturbation is at 0 then if ε satisfies (4.35) and fullfills (4.36) or (4.37) then the
approximation can be used else calculate the objective function deterministically,
2. a perturbation is at ζ(r) then if ε satisfies (4.38) and fullfills (4.37) or (4.40) then
the approximation can be used else calculate the objective function deterministi-
cally,
3. a perturbation is at 1 then ε has to satisfy analogue conditions (left to the reader)
for the approximation to be used by the algorithm.

4.5 The Radial Basis Appproximation


In this section we will discuss the main aspects and possibilities of constructing a
radial basis approximation of the objective function in Algorithm 1.

4.5.1 Detecting Dense Regions


Since the data set Z = {xi }Ni=1 is very sparse then a method to detect regions rich
in data within the convex hull Ω of Z is required. In [2] we introduced a number of
merit
88 M. Bazan

∑Nj=1,i< j ai jWi j
γ (x, Z ) := (4.24)
∑Nj=1,i< j Wi j
d
where ai j = r j +rij
i
, Wi j = r j +r
1
i
, and di j = ||x j − xi ||2 and r j = ||x − x j ||2 . γ (x, Z )
measures how well data points from Z surround the evaluation point x. Numbers ai j
measure how far point x is placed from the interval xi x j . The maximal value equal
1 is attained by ai j if point x is on this interval. The weighting Wi j emphasize in
γ (x, Z ) the impact of intervals xi x j that are close to point x. The additional denom-
inating by a sum of weights Wi j provides that for any x ∈ Rd the range of values
of γ (x, Z ) is (0, 1]. If the value of γ (x, Z ) is greater than a certain threshold value
then in the evaluation point x a construction of an approximation with a good local
quality can be expected.

0.9
0.8
0.7
0.6
0.5
X2

0.4
0.3
0.2
0.1
0
−1 −0.9 −0.8 −0.7 −0.6 −0.5 −0.4 −0.3
X1

Fig. 4.1 γ (x, Z ) defined by (4.24). Here the set Z is a data set constructed on the 50-th
optimization step of a 2-parametric Rosenbrock function optimization with the EXTREM
algorithm

4.5.2 Regularization Training


For a given data set Z = {xi , f (xi )}Ni=1 ⊂ Rd × R of pairwise different data points
xi , and for the Gaussian radial basis function φ (x) = exp(−x2 /r2 ), r ≥ 0 (see [9]) a
radial basis function interpolatant s(x) is defined as
N
s(x) = ∑ wi φi (||x − xi||), (4.25)
i=1

where
s(xi ) = f (xi ) = fi , i = 1, . . . , N.
4 Search Procedure Exploiting Locally Regularized Objective Approximation 89

The interpolation conditions s(xi ) = fi imply the matrix formulation

Φ w = f, (4.26)

where Φ = [φ (xi − x j )]i=1,...,N; j=1,...,N , f = [ f1 , f2 , . . . , fN ]T , w = [w1 , w2 , . . . , wN ]T .


Positive definiteness [9] of φ guarantees nonsingularity of matrix Φ and thus there
exists the unique solution w for which s(x) interpolates data from the set Z. Al-
though the other choice of positive definite radial basis functions is possible we
choose the Gaussian function due to its natural interpretation of r parameter which
can be set for a value proportional to the diameter of the data set Z = {xi }Ni=1 . There
are two reasons for which we solve an approxmiation problem rather than interpo-
lation. Firstly when the data set Z is very irregular and φ is strictly positive definite
the matrix Φ is very ill-conditioned. Secondly the problem of solving (4.26) yields
solution s(x) which may oscillate between data points where data is sparse.
The approximation solution is sought by means of Tikhonov regulatization. Sup-
pose that we are given a linear mapping T : H → R and define the regularization
operator J : H → R by J(s) = ||T s||2R where H is a native space ot the underlying
radial basis function φ (·). For a function f ∈ H and for the prescribed value of the
single regulatization parameter λ its approximation is a function fλ (x) ∈ H in a
form (4.25) which is the solution of the minimization problem
 
N
min

∑ [ f (xi ) − fλ (xi)]2 + λ J( fλ ) : fλ ∈ H . (4.27)
i=1

In the matrix form the problem (4.27) is written as

(Φ T Φ − λ I)w = Φ T t (4.28)

for a λ > 0 governing a trade-off between a data reproduction and the desired
smoothness of solution fλ (x). Due to the ill-conditioning of matrix Φ a direct
inversion of matrix (Φ T Φ − λ I) in (4.28) is not numerically stable.
To solve (4.28) in a numerically stable way we use the singular value decompo-
sition of the matrix Φ defined as

Φ = USVT , (4.29)

where U ∈ RN×N , V ∈ RN×N are orthogonal matrices and S = diag(σ1 , . . . , σN ),


where σ1 ≥ σ2 ≥ · · · ≥ σN are singular values of Φ . The singular value decomposi-
tion is unique up to the signs of the columns of matrices U and VT . Using the above
decompostion we express the inverse matrix in (4.28) as

Φ † = (Φ T Φ + λ 2 I)−1 Φ T = V((ST S + λ 2I)−1 ST )UT = VΩλ UT , (4.30)

where  
σ1 σ2 σN
Ωλ = diag , ,..., 2 .
σ12 + λ 2 σ22 + λ 2 σN + λ 2
90 M. Bazan

Using the above equation the weight vector wλ is expressed (see [5]) by the expan-
sion with respect to singular vectors of the matrix Φ
N
σi
wλ = VT Ωλ† UT t = ∑ (uT t)vi . (4.31)
σ
i=1 i
2 +λ2 i

Comparing the expansion (4.31) for λ > 0 with the expansion for λ = 0 i.e. for the
interpolation problem which reads
N
1 T
w = VS−1 UT t = ∑ (ui t)vi . (4.32)
i=1 σi

one can see what is the role of the regularization parameter λ . For σ p ≥ λ ≥ σ p+1
we have
1 σi
 2 (k > p),
σk σk + λ 2
and hence the impact of the singular vectors corresponding to singular values σk <
λ is dumped in the expansion (4.31). This enables us to avoid oscillation of the
solution that are introduced by inverting small singular values in the expansion of
the weight vector. Another approach in solving the problem of ill-conditioning of
the interpolation matrix generated by multiquadratic functions was presented in [7].

4.5.3 Choice of the Regularization Parameter λ Value


An appropriate procedure to choose λ parameter is a crucial issue in the Tikhonov
regularization. A chosen procedure should be able to find a λ for which the solution
s(x) reproduces well the data set Z as well as it generalizes well in-between data
points. These goals are conflicting and therefore a method should give us the right
balance between reproduction of the data set as well as a generalization for data
from outside the data set.

4.5.3.1 Weighted Gradient Variance and Local Mean Square Error


In [2] we introduced a method that relates a choice of λ to the reproduction quality
on the data points close to the evaluation point x.
The method was defined as the following. Let us define

r j = ||x − x j ||2 , j = 1, . . . , N.

For a sequence of λ s that covers the singular value spectrum (σN , σ1 ) of the matrix
S from the decomposition (4.29) we calculate the Normalized Local Mean Square
Error
4 Search Procedure Exploiting Locally Regularized Objective Approximation 91
  

 N [sλ (x j )− f j ]2 1
 ∑ j=1 · r2
 f 2j j
NLMSEλ ,Z (x) =  , (4.33)
∑Nj=1 r12
j

and the Weighted Gradient Variance


||∇sλ (x j )−Gλ ,Z (x)||2
∑Nj=1 r2j
WGVλ ,Z (x) = ,
∑Nj=1 r12
j

where Gλ ,Z (x) is a mean gradient variance at point x defined as


∇sλ (x j )
∑Nj=1 r2j
Gλ ,Z (x) = ,
∑Nj=1 r12
j

where sλ (x j ) is a value of the constructed approximation for λ at point x j from the


data set Z, and ∇ is the gradient operator. It is the discrepancy method since only
these λ ’s are considered for which the value of NLMSE is smaller than a prescribed,
user-defined threshold value NLMSEthr . The threshold NLMSEthr value says which
approximation quality has to be preserved on the points from Z nearest to the evalu-
ation point x. From the set of λ ’s for which NLMSEλ ,Z (x) is smaller than NLMSEthr
the minimizer of the oscillation measure WGVλ ,Z (x) is chosen as the optimum.
Figure 4.2 shows a) the data set of 30 points from the 2-parametric Rosenbrock func-
tion optimization by the EXTREM algorithm, b) a plot of NLMSEλ ,Z (x) for two dif-
ferent points of the domain for a data set, c) a corresponding plot with W GVλ ,Z(A)
and W GVλ ,Z (B). Figure 4.3 shows a) the obtained approximation and b) the chosen
value of λ parameter.

4.5.4 Error Bounds for Radial Basis Approximation


In the previous section we presented a method to choose the value of the regular-
ization parameter λ . Unfortunately, using this method does not provide us that the
generalization error is smaller than a prescribed value ε as well as we are not able
directly to estimate the generalization error. This method similarly as the Gener-
alized Cross Validation (c.f. [18]) estimates an error on the data points (WGV for
nearest points to the evaluation point and GCV for all points from a data set).
The error bounds for radial basis interpolation has been studied extensively for
more than two decades. The first bounds establishing the rate of convergence of
a radial basis interpolant for functions from a native space of the underlaying ra-
dial basis function were given in [10]. The latter paper gave the foundations for
the development of the theory of convergence of the radial basis interpolation (see
92 M. Bazan

a) b)

0.98 −1
A
0.96
−2 B
0.94 A

0.92 −3

log10(NLMSE)
0.9
−4
x2

0.88
B
−5
0.86

0.84 −6

0.82
−7
0.8

0.78 −8
−1.02 −1 −0.98 −0.96 −0.94 −0.92 −0.9 −18 −16 −14 −12 −10 −8 −6 −4 −2
x log10(λ)
1

c)

1.8

1.7

1.6 A
log10(WGVλ, Z( x))

1.5

1.4

1.3

1.2

1.1
B
1
−18 −16 −14 −12 −10 −8 −6 −4 −2
log10(λ)

Fig. 4.2 a) Data set consisting of 30 points from the optimization path from EXTREM
algorithm optimizing the 2-parametric Rosenbrock function. b) Local reproduction of the
data near points A and B measured by NLMSEλ ,Z (A) and NLMSEλ ,Z (B) respectively. Here
NLMSEthr = 10−5 is depicted by a straight line. c) A measure of the oscillation of the solution
W GVλ ,Z (x) in points A and B. By dots the optimal λ for points A and B are depicted

e.g. [17], [19] and references therein). The rate of the convergence which is consid-
ered is with respect to dataset density i.e. a global fill distance

h(Z , Ω ) := max min ||y − xi||2 .


y∈Ω 1≤i≤N
4 Search Procedure Exploiting Locally Regularized Objective Approximation 93

a) b)
approximation error of the approximation−5constructed with WGV λ for approximation constructed with WGV
for NLMSE thr = 10 −1 0.98 0.96 0.94 0.92
−4 0.9
x 10 0.88
−0.95 0.86
−12
2 y
−0.9 −10

−8

1
−6

−4

−2
0
−1 0
−0.98
−0.96
0.98
−0.94 0.96
0.94
−0.92 0.92
0.9
0.88
−0.9
0.86

Fig. 4.3 a) The approximation error for λ chosen by the measure W GVλ ,Z (x) for
NLMSEthr = 5.0 · 10−6 . b) Chosen value of λ parameter – it can be noticed that in data
regions where data is sparse the method suggests a greater value of λ

where Z = {xi }Ni=1 ⊆ Ω which is a mesh norm measuring the radius of the biggest
ball contained in the domain Ω and not containing any data points inside, and where
the domain Ω satisfies the interior cone condition with radius r and angle θ .

Definition 3. A set Ω ⊆ Rd is said to satisfy an interior cone condition if there exists


an angle θ ∈ (0, π /2) and a radius r > 0 so that for every x ∈ Ω a unity vector ξ (x)
exists so that the cone C(x, ξ (x), θ , r) := {x + η y : y ∈ Rd , ||y||2 = 1, yT ξ (x) ≥
cos θ , η ∈ [0, r]} is contained in Ω .

The summary of error bounds for various radial basis functions was given in [16]. A
very precise derivation of the error bounds for Gauss radial basis function interpo-
lation was given in [20]. In the latter paper one can find a derivation of all constants
involved in the bound. The analogue error estimates (with all constants involved)
of the approximation with positive defined radial basis function constructed with
Tikhonov regularization for functions from the Sobolew space Wpτ was presented in
[19]. Here we will show that the error estimates contained in [19] cannot be used
in our scheme due to the small number of points in the optimization process and
therefore using heuristic methods from previous sections is justified.
All of the error bounds rely on a common property of local polynomial reproduc-
tion that has to be guaranteed by an approximation procedure (c.f. [19]). The error
bounds can be formulated in the form of the following theorem.

Theorem 4. Suppose that Ω ⊂ Rd is bounded and satisfies an interior cone condi-


tion with angle θ and radius r. Let m be a maximal degree of polynomials repro-
duced by fλ in a form (4.25) defined as the solution of (4.27). Define the following
quantities
94 M. Bazan
 
sin θ
ϑ := 2arcsin ,
4(1 + sin θ )
sin θ sin ϑ
Q(m, θ ) := .
8m (1 + sin θ )(1 + sin ϑ )
2

If the global fill distance h(Ω , Y ) satisfies

h(Z , Ω ) ≤ Q(m, θ )r.

the approximation error can be bounded as

|| f − fλ ||L∞ (Ω ) ≤ C[h(Z , Ω )]τ −d/p | f |Wpτ (Ω ) + 2ε , and ε = max | f (x j )− fλ (x j )|.


x j ∈Z

Let us consider the unity ball B(0, 1) as a domain of the approximation. It satisfies
the interior cone condition with r = 1 and θ = π /3. It can be seen to reproduce only
the linear polynomials i.e. m = 1 and therefore for the above bounds to be satisifed
there has to be
h(Z , Ω ) ≤ Q(1, π /3) < 0.012613.
It means that from one data point to another there must be a distance not greater
than 1.3% of the radius of the ball containing the whole data set. Such a number of
points cannot be generated by any local optimization algorithm.
The above consideration shows that the exisiting accurate error bounds for the
regularized approximation with radial basis functions cannot be applied to estimate
the approximation in the SPELROA method. It is due to the sparseness of the data
in data sets constructed by local optimization algorithms.

4.6 Numerical Results


Numerical results of the performance for real optimization problems from the LHC
magnet design process with up to 5 parameters were presented in [2]. Here we
present results for three test problems from a set of test problems proposed in [12].

4.6.1 Test Problems


1. Six variable problem and eight variable problem I
As a six and first eight variable problem we considered the Extended Rosenbrock
Function (i.e. problem no 21 in [12]). It is defined as
d/2
f ([x1 , x2 , . . . , xd ]) = ∑ 100(x2i − x22i−1)2 + (1 − x2i−1)2 .
i=1

A standard starting point is x0 = (ξ j ) where ξ2 j−1 = −1.2 and ξ2 j = 1 and the


minimum equals f ∗ = 0 at (1, . . . , 1).
4 Search Procedure Exploiting Locally Regularized Objective Approximation 95

2. Eight variable problem II


As a second eight variable problem we chose Chebquad function (i.e. problem
no 35 in [12]). It is defined as
d d  1
1
f (x1 , . . . , xd ) = ∑ fi (x)2 , where fi (x) = ∑ Ti (x j ) − Ti (x)dx
i=1 d j=1 0

and Ti is the i-th Chebyshev polynomial shifted to the interval [0, 1]. The standard
starting point is x0 = (ξi ) where ξ j = j/(d + 1) and the minimum for d = 8 equals
to f ∗ = 3.51687 · 10−3.
3. Eleven variable problem
As en eleven variable problem we chose Osborne 2 function (i.e. problem no 19
in [12]). It is defined as
11
f (x1 , . . . , x11 ) = ∑ fi (x)2 , fi (x) = yi − (xi e−ti x5 + x2e−(ti −x9 )
2x
where 6 +
i=1

+x3 e−(ti −x10 ) + x4e−(ti −x11 )


2x 2x
7 8 )

ti = (i − 1)/10 and yi for i = 1, . . . , 65 are constants that can be found in [12].


The standard starting point is x0 = [1.3, 0.65, 0.65, 0.7, 0.6, 3, 5, 7, 2, 4.5, 5.5] and
the minimum is f ∗ = 4.01377 · · ·10−2.

4.6.2 Results
For each problem we run Algorithm 1 combined with EXTREM algorithm with the
three following objective function approximation methods
1. Radial basis function aproximation without regularization,
2. Radial basis function aproximation with regularization using Generalized Cross
Validation [18] to choose λ parameter,
3. Radial basis function aproximation with regularization using Weighted Gradient
Variance to choose λ parameter.
To construct the approximation of the objective function with one of these meth-
ods we used 30 Gaussian radial basis functions with an equal shape parameter set
to half of the distance between the most distant centers. No additional parameters
were required for the first and the second method. The single user-defned threshold
NLMSEthr for measure (4.33) was used in the third method. Apart from parameters
concerning the construction of the radial basis approximation we had to set-up three
parameters related directly to the Algorithm 1 itself. These are: the number of initial
steps Is = 50 and ε = 10−3 for the ε -chcek procedure and γthr – the threshold value
for a measure (4.24) to detect the reliable region.
In the tables below we show performance of the Algorithm 1 for problems de-
fined in the previous subsection with the above objective function approximation
strategies compared to the EXTREM algorithm. The first column shows the number
96 M. Bazan

Table 4.1 6-variable Rosenbrock function optimization using: Left) pure EXTREM, Right)
Algorithm 1 without regularization and with γthr = 0.65

EXTREM
step ||x − x∗|| f Alg. 1 no regularization
num. step num.
250 2.267853 2.885768 num. ||x − x∗|| f approx.
500 0.650821 0.109817 250 2.459117 3.711824 22
750 0.418187 0.031570 500 1.318398 0.559350 45
1000 0.246522 0.012666 750 0.675244 0.089372 35
1250 0.029985 0.001228 933 0.130808 0.005606 10
1523 0.002014 0.000001

Table 4.2 6-variable Rosenbrock function optimization using: Left) Algorithm 1 with regu-
larization using GCV, Right) Algorithm 1 with regularization using WGV with NLMSEthr =
5 · 10−6 . In both cases γthr = 0.65

Alg. 1 with GCV Alg. 1 with WGV


step ∗
||x − x || f num. step ∗ num.
num. approx. num. ||x − x || f approx.
250 2.363547 3.991352 11 250 2.292762 2.923941 12
500 1.094954 0.334820 28 500 1.250487 0.435774 16
750 0.778636 0.162569 48 750 0.630858 0.150665 28
1000 0.218992 0.009666 47 1000 0.513698 0.048269 38
1250 0.025537 0.000131 49 1250 0.094284 0.001763 47
1288 0.025537 0.000131 10 1262 0.094284 0.001763 4

Table 4.3 8-variable Rosenbrock function optimization using: Left) pure EXTREM, Right)
Algorithm 1 without regularization and γthr = 0.65

EXTREM
step ||x − x∗|| f Alg. 1 no regularization
num. step num.
250 3.923525 10.438226 num. ||x − x∗|| f approx.
500 2.731706 5.606541 250 3.030179 6.376938 22
750 1.771076 1.032648 500 2.153959 1.855113 23
1000 1.009230 0.251387 750 1.380208 0.471618 30
1250 0.484663 0.050215 1000 0.910852 0.215283 46
1500 0.274941 0.015386 1250 0.361361 0.028169 43
1750 0.262434 0.012950 1500 0.186309 0.006926 55
2000 0.208392 0.007465 1537 0.186309 0.006926 7
3471 0.000553 0.000000
4 Search Procedure Exploiting Locally Regularized Objective Approximation 97

Table 4.4 8-variable Rosenbrock function optimization using: Left) Algorithm 1 with regu-
larization using GCV, Right) Algorithm 1 with regularization using WGV with NLMSEthr =
10−6 . In both cases γthr = 0.65

Alg. 1 with WGV


step ||x − x∗|| f num.
Alg. 1 with GCV num. approx.
step ∗ num.
num. ||x − x || f approx. 250 3.263247 9.379781 11
500 2.288176 2.394216 15
250 3.100123 7.932378 17
750 1.272541 0.435303 17
500 2.353671 3.580721 28
1000 0.767617 0.143009 35
750 1.810286 1.085058 26
1250 0.505116 0.057994 42
1000 0.946269 0.240704 41
1500 0.146317 0.004114 51
1200 0.705151 0.118365 43
1750 0.028930 0.000169 35
1861 0.028958 0.000169 12

Table 4.5 8-variable Chebquad function optimization using: Left) pure EXTREM Right)
Algorithm 1 without regularization and with γthr = 0.65

EXTREM Alg. 1 no regularization


step ||x − x∗|| f step num.
num. num. ||x − x∗|| f approx.
1 0.161512 0.038618
250 0.098420 0.006216 4
250 0.097709 0.006187
500 0.059421 0.004684 18
500 0.059121 0.004681
750 0.016460 0.003698 750 0.009926 0.003584 10
1000 0.000454 0.003517 48
1000 0.000351 0.003517
1237 0.000000 0.003517 1197 0.000141 0.003517 97

of steps of the algorithm i.e. a sum of the number of the direct functions evaluations
and the number of steps in which the radial basis function approximation was used.
The second column shows the distance from the minium and the third column shows
the objective function value. The last column shows the number of steps in which
objective function approximation was used within the previous 250 steps.
As we can see in all cases SPELROA required considerably fewer steps than the
pure EXTREM algorithm to stop. Using the WGV method to build radial
basis approximation gave the best convergence results, i.e. the stopping point for
the SPELROA with WGV compared to stopping points given by the method with
the other methods. For all problems NLMSEthr was chosen intuitively to ∼ 10−6 .
It means that the reconstruction of the training set in the vicinity of the evaluation
point was at the level of 10−6. To compute the reliable region γthr set up to 0.65 was
sufficient to preserve convergence of the method. In the optimization of 8-variable
Chebquad function it turned out that it was possible to reduce γthr to 0.6. That gave
16 points in which the objective function was approximated in the first 250 steps
98 M. Bazan

Table 4.6 8-variable Chebquad function optimization using: Left) Algorithm 1 with regular-
ization using GCV and γthr = 0.65, Right) Algorithm 1 with regularization using WGV with
NLMSEthr = 10−6 and with γthr = 0.60

Alg. 1 with GCV Alg. 1 with WGV


step ||x − x∗ || f num. step ||x − x∗ || f num.
num. approx. num. approx.
250 0.098419 0.006216 4 250 0.109306 0.007110 16
500 0.059477 0.004684 17 500 0.064139 0.004759 73
750 0.008713 0.003584 11 750 0.009802 0.003586 46
1000 0.00156 0.003518 118 1000 0.001266 0.003519 31
1095 0.00156 0.003518 68 1077 0.001051 0.003517 52

Table 4.7 11-variable Osborne 2 function optimization using: Left) pure EXTREM, Right)
Algorithm 1 without regularization and with γthr = 0.65

EXTREM
step ||x − x∗|| f
num.
Alg. 1 no regularization
1 4.755269 2.093420 step num.
250 1.004450 0.081672 num. ||x − x∗|| f approx.
500 0.099411 0.041034 250 1.379902 0.977272 29
750 0.073530 0.041772 500 0.568322 0.059211 31
1000 0.007715 0.040138 710 0.335485 0.041512 43
1250 0.001092 0.040138
1434 0.000000 0.040138

Table 4.8 11-variable Osborne 2 function optimization using: Left) Algorithm 1 with regu-
larization using GCV and γthr = 0.65 b) Algorithm 1 with regularization using WGV with
NLMSEthr = 10−6 . In both cases γthr = 0.65

Alg. 1 with GCV


step num. Alg. 1 with WGV
||x − x∗ || f step num.
num. approx.
num. ||x − x∗ || f approx.
250 1.673133 0.365797 32
250 0.953416 0.106864 14
500 0.975217 0.045721 31
500 0.334653 0.047136 38
750 0.496683 0.041805 43
750 0.0441106 0.040209 49
1000 0.068066 0.040462 35
919 0.037960 0.040185 29
1106 0.062328 0.040177 25

instead 4 such points when γthr = 0.65. An interesting result was also obtained for
Osborne 2 function. The EXTREM algorithm found a better solution than that sug-
gested in [12]. Algorithm 1 with any approximation method did not converge to this
minimum. The method without regularization did not converge at all whereas GCV
4 Search Procedure Exploiting Locally Regularized Objective Approximation 99

and WGV converged rather to the minimum suggested in [12] than to that found by
EXTREM.

4.7 Summary
The Search Procedure Exploiting Locally Regularized Objective Approximation is
a method to speed-up local optimization processes. The method combines a non-
gradient optimization algorithm with the regularized local radial basis function
approximation. It relies on using a local regularized radial basis function approxi-
mation instead of a direct objective function evaluation in a certain number of func-
tion evaluation steps of the optimization algorithm. In this chapter we presented the
proof of the convergence of the Search Procedure Exploiting Regularized Objective
Approximation which applies to any Gauss-Siedle and conjugate direction search al-
gorithm that uses the sequanetial quadratic interpolation as a line search procedure.
The convergence is proven under assumption that the approximation of the objec-
tive function with the prescribed approximation relative error is exploited only in
the seqential quadratic interplation. The performance of the method was presented
on 6 and 8-parametric Rosenbrock function, 8-parametric Chebquad function and
11-parametric Osborne 2 function. Further studies will be to compare the method
with trust region methods.

Acknowledgements. I would like to thank all referees for very valuable comments.

Appendix A
The minimum of the quadratic q(ζ ) built of three points (ζ1 , f (ζ1 )), (ζ2 , f (ζ2 )) and
(ζ3 , f (ζ3 )) where {ζ1 , ζ2 , ζ3 } ⊂ R equals

1 (ζ32 − ζ22 ) f (ζ1 ) + (ζ32 − ζ12 ) f (ζ2 ) + (ζ22 − ζ12 ) f (ζ3 )


λ∗ = ,
2 (ζ3 − ζ2 ) f (ζ1 ) + (ζ3 − ζ1 ) f (ζ2 ) + (ζ2 − ζ1 ) f (ζ3 )

Let us transform quadratic q(x) = ax2 + bx + c; x = ζ1 + t(ζ3 − ζ1 );t = (−∞, ∞) to


quadratic q̂(x) = âx2 + b̂x + ĉ assuming the following
1. A transformation will be of the form

q̂(x) = E(p(x)) + D, p(x) = L2 (x), (4.34)

where L2 (x) is the Lagrange interpolation parabola constructed on points


(0, f (ζ1 )), (ζ(r) , f (ζ2 )) and (1, f (ζ3 ) whereas ζ(r) = ζζ2 − ζ1
.
3 −ζ1
2. a) If f (ζ1 ) > f (ζ3 ) b) if f (ζ1 ) < f (ζ3 )
 
q̂(0) = −1, q̂(0) = −2,
q̂(1) = −2, q̂(1) = −1.
100 M. Bazan

From the condition 1. we get

L2 (x) = a x2 + b x + c

where
f (ζ1 ) f (ζ2 ) f (ζ3 )
a = + + ,
ζ(r) ζ(r) (ζ(r) − 1) 1 − ζ(r)
!

f (ζ1 )(ζ(r) + 1) f (ζ2 ) f (ζ3 )ζ(r)
b =− + + ,
ζ(r) ζ(r) (ζ(r) − 1) 1 − ζ(r)
c = f (ζ1 ).

From the conditions 2. we get


a) b)
 
E = a +b
1
, E = − a +b
1
,
c c
D = 2 − Ec = 2 − a +b . D = 1 + Ec = 1 + a +b .

Then in the canonical form q̂(x) = âx2 + b̂x + ĉ we have â = Ea , b̂ = Eb , ĉ = Ec +


D.
The crucial properties of this transformation are
1. q̂(ζ(r) ) < −1.
 
2. p ζ(r) = f (ζ2 ).
3. q̂(ζ(r) ) does not depend on f (ζ2 ).
4. For the minimum point λ ∗ of q such that λ ∗ ∈ [ζ1 , ζ3 ] we get a minimum point
∗ −ζ
λ̂ ∗ = λζ − 1
where λ̂ ∗ is a minimum point of q̂.
3 ζ1
5. The transformation has a singularity when a = b .
This transformation reduces the number of free parameters from 6 to 3.

Appendix B

B.1 Expression for Δ (ε ; ζ )


The unperturbed minimum λ ∗ (ζ ) is related to the perturbed minimum λ̃l∗ (ε ; ζ ) as

λ̃l∗ (ε ; ζ ) = λ ∗ (ζ ) ± Δl (ε ; ζ ).

Here we assume that both λ ∗ (ζ ) and λ̃l∗ (ε ; ζ ) are minima of the quadratics obtained
by the transformation q̂(·) from Appendix A with the assumption that f (ζ1 ) < f (ζ3 )
i.e. if the opposite is true then we have to rotate the quadratics with respect to the
center of the interval [ζ1 , ζ3 ].
Let us introduce a notation to simplify derivations. Let us denote A = 2(ζ(r) −
1), B = q̂(ζ(r) ),C = ζ(r) . Then for the unperturbed quadratic we have that
4 Search Procedure Exploiting Locally Regularized Objective Approximation 101

1 A(ζ(r) + 1) + B − ζ(r)C
λ∗ =
2 A + B −C
whereas for the perturbed one we have
 
∗ 1 A(ζ(r) + 1)(1 + ε ) + B − ζ(r)C
λ1 (ε ; ζ ) =
2 A(1 + ε ) + B − C
 
1 A(ζ(r) 1) + B(1 + ε ) − ζ(r)C
+
λ2∗ (ε ; ζ ) =
2 A + B(1 + ε ) − C
 
1 A(ζ(r) 1) + B − ζ(r)C(1 + ε )
+
λ3∗ (ε ; ζ ) =
2 A + B − C(1 + ε )
ζ2 −ζ1
for the perturbation in 0, ζ(r) and 1 respectively where ζ(r) = ζ3 −ζ1 . We can simplify
the expression |λ ∗ − λ̃ ∗ (ε ; ζ )| to get
 
C − ζ(r) B Aε
Δ1 (ε ; ζ ) = ·
2(A + B − C) Aε + A + B − C

and Δ2 (ε ; ζ ) and Δ3 (ε ; ζ ) similarly.

B.2 The Main Condition


If inequalities (4.21) are satisfied then we have a guarantee that λ ∗ (ζ ) < ζ2 . It is
because if λ̃l∗ (ε ; ζ ) is shifted by Δl (ε ; ζ ) to the left then shifting it back to the right
will not make it greater than ζ2 . If it is shifted to the right then we have a margin of
2Δl (ε ; ζ ). As previously mentioned we will consider only perturbation in 0 and ζ(r)
i.e. l = 1 and l = 2. For l = 1 we get

A(ζ(r) + 1)(1 + ε ) + B − ζ(r)C


< 2ζ2
A(1 + ε ) + B − C

We have to consider two cases depending on the sign of the denominator.


1. If A(1 + ε ) + B − C) > 0 then we get

A(1 − ζ(r))ε < A(ζ(r) − 1) + B(2ζ(r) − 1) − C(2ζ(r) − 1)

which again is divided into two cases depending on the sign of the cooeficient
by ε
a. When A(1 − ζ(r)) > 0 then

A(ζ(r) − 1) + B(2ζ(r) − 1) − C(2ζ(r) − 1)


ε< .
A(1 − ζ(r))

b. When A(1 − ζ(r)) < 0 then


102 M. Bazan

A(ζ(r) − 1) + B(2ζ(r) − 1) − C(2ζ(r) − 1)


ε> .
A(1 − ζ(r))

2. If A(1 + ε ) + B − C) < 0 then we get two cases


a. When A(1 − ζ(r)) > 0 then

A(ζ(r) − 1) + B(2ζ(r) − 1) − C(2ζ(r) − 1)


ε> .
A(1 − ζ(r))

b. When A(1 − ζ(r)) < 0 then

A(ζ(r) − 1) + B(2ζ(r) − 1) − C(2ζ(r) − 1)


ε< .
A(1 − ζ(r))

Only upper bounds for ε , i.e. 1.a) and 2.b), are interpretable as a solution to our
problem. So finally we get two regions for ε and ζ(r) where

A(ζ(r) − 1) + B(2ζ(r) − 1) − C(2ζ(r) − 1)


ε< , (4.35)
A(1 − ζ(r))

for 
A(1 + ε ) + B − C > 0
, (4.36)
A(1 − ζ(r)) > 0
or 
A(1 + ε ) + B − C < 0
. (4.37)
A(1 − ζ(r)) < 0
In the same way we can obtain conditions for l = 2:

−B(1 − ζ(r)) − A
ε< , (4.38)
B(1 − ζ(r))

for 
A + B(1 + ε ) − C > 0
, (4.39)
B(1 − ζ(r)) < 0
or 
A + B(1 + ε ) − C < 0
. (4.40)
B(1 − ζ(r)) > 0

References
1. Bazan, M., Russenschuck, S.: Using neural networks to speed up optimization algo-
rithms. Eur. Phys. J. AP 12, 109–115 (2000)
2. Bazan, M., Aleksa, M., Russenschuck, S.: An improved method using radial basis func-
tion neural networks to speed up optimization algorithms. IEEE Trans. on Magnetics 38,
1081–1084 (2002)
4 Search Procedure Exploiting Locally Regularized Objective Approximation 103

3. Bazan, M., Aleksa, M., Lucas, J., Russenschuck, S., Ramberger, S., Völlinger, C.: In-
tegrated design of superconducting magnets with the CERN field computation pro-
gram ROXIE. In: Proc. 6th International Computational Accelarator Physics Conference,
Darmstadt, Germany (September 2000)
4. Conn, A.R., Gould, N.I.M., Toint, P.L.: Trust region methods. SIAM, Philadelphia (2005)
5. Hansen, P.C.: Rank-deficient and Discrete Ill-posed Problems. SIAM, Philadelphia
(1998)
6. Jacob, H.G.: Rechnergestützte Optimierung statischer und dynamischer Systeme.
Springer, Heidelberg (1982)
7. Kansa, E.J., Hon, Y.C.: Circumventing the ill-conditionning problem with multiquadratic
radial basis: Applications to elliptic partial differentail equations. Comp. Math. with
App. 39(7-8), 123–137 (2000)
8. Luenberger, D.G.: Introduction to linear and nonlinear programming, 2nd edn. Addison-
Wesley, New York (1984)
9. Micchelli, C.A.: Interpolation of Scattered Data: Distance Matrices and Conditionally
Positive Definite Functions. Constructive Approximation 2, 11–22 (1986)
10. Madych, W.R., Nelson, S.A.: Multivariate interpolation and conditionally positive defi-
nite functions II. Math. Comp. 4(189), 211–230 (1990)
11. Madych, W.R.: Miscellaneous error bounds for multiquadric and related interpolators.
Comp. Math. with Appl. 24(12), 121–138 (1992)
12. Moré, J.J., Garbow, B.S., Hillstorm, K.E.: Testing unconstrained optimization software.
ACM Trans. Math. Software 7(1), 17–41 (1981)
13. Oeuvray, R.: Trust region method based on radial basis functions with application on
biomedical imaging, Ecole Polytechnique Federale de Lausanne (2005)
14. Polak, E.: Optimization. Algorithms and Consistent Approximations. Applied Mathe-
matical Sciences, vol. 124. Springer, Heidelberg (1997)
15. Powell, M.J.D.: On calculation of orthogonal vectors. The Computer Journal 11(3),
302–304 (1968)
16. Schaback, R.: Error estimates and condition number for radial basis function interpola-
tion. Adv. Comput. Math. 3, 251–264 (1995)
17. Schaback, R.: Native Hilbert Spaces for Radial Basis Functions I. The new development
in Approximation Theory. Birkhäuser, Basel (1999)
18. Wahba, G.: Spline models for obsevational data. SIAM, Philadelphia (1990)
19. Wendland, H., Rieger, C.: Approximate interpolation with applications to selecting
smoothing parameters. Numerische Mathematik 101, 643–662 (2005)
20. Wendland, H.: Gaussian Interpolation Revisited. In: Kopotun, K., Lyche, T., Neamtu,
M. (eds.) Trends in Approximation Theory, pp. 427–436. Vanderbilt University Press,
Nashville (2001)
21. Zangwill, W.I.: Nonlinear Programming; a Unified Approach. Prentice-Hall Interna-
tional Series. Prentice-Hall, Englewood Cliffs (1969)
Chapter 5
Optimization Problems with Cardinality
Constraints

Rubén Ruiz-Torrubiano, Sergio Garcı́a-Moratilla, and Alberto Suárez

Abstract. In this article we review several hybrid techniques that can be used to
accurately and efficiently solve large optimization problems with cardinality con-
straints. Exact methods, such as branch-and-bound, require lengthy computations
and are, for this reason, infeasible in practice. As an alternative, this study focuses
on approximate techniques that can identify near-optimal solutions at a reduced
computational cost. Most of the methods considered encode the candidate solutions
as sets. This representation, when used in conjunction with specially devised search
operators, is specially suited to problems whose solution involves the selection of
optimal subsets of specified cardinality. The performance of these techniques is il-
lustrated in optimization problems of practical interest that arise in the fields of
machine learning (pruning of ensembles of classifiers), quantitative finance (port-
folio selection), time-series modeling (index tracking) and statistical data analysis
(sparse principal component analysis).

5.1 Introduction
Many practical optimization problems involve the selection of subsets of specified
cardinality from a collection of items. These problems can be solved by exhaustive
enumeration of all the candidate solutions of the specified cardinality. In practice,
only small problems of this type can be exactly solved within a reasonable amount of
Rubén Ruiz-Torrubiano
Computer Science Department, Universidad Autónoma de Madrid, Spain
e-mail: ruben.ruizt@estudiante.uam.es
Sergio Garcı́a-Moratilla
Computer Science Department, Universidad Autónoma de Madrid, Spain
e-mail: sergio.garcia@uam.es
Alberto Suárez
Computer Science Department, Universidad Autónoma de Madrid, Spain
e-mail: alberto.suarez@uam.es

Y. Tenne and C.-K. Goh (Eds.): Computational Intelligence in Optimization, ALO 7, pp. 105–130.
springerlink.com c Springer-Verlag Berlin Heidelberg 2010
106 R. Ruiz-Torrubiano, S. Garcı́a-Moratilla, and A. Suárez

time. The number of steps required to find the optimal solution can be reduced using
branch-and-bound techniques. Nevertheless, the computational complexity of the
search remains exponential, which means that large problems cannot be handled by
these exact methods. It is therefore important to design algorithms that can identify
near-optimal solutions at a reduced computational cost. In this article we present an
unified framework for handling optimization problems with cardinality constraints.
A number of approximate methods within this framework are analyzed and their
performance is tested in extensive benchmark experiments.
In its general form, an optimization problem with cardinality constraints can be
formulated in terms of a vector of binary variables z = {z1 , z2 , . . . , zD }, zi ∈ {0, 1}.
The goal is to minimize a cost function that depends on z, subject to a constraint
that specifies the number of non-zero bits in z
D
min {F(z)}
z
∑ zi = k. (5.1)
i=1

Optimization problems with cardinality constraints given by an inequality ∑D i=1 zi ≤


K can be solved by selecting the best of the solutions of K optimization prob-
lems with the equality constraint ∑D i=1 zi = k; k = 1, 2, . . . , K. Finally, the solution
of a combinatorial optimization problem without restrictions can be obtained by
solving the sequence of problems with cardinality constraints ∑D i=1 zi = k; k = 1, 2,
. . . , D.
Continuous optimization tasks with cardinality constraints can also be analyzed
within this framework. Consider the problem of minimizing a function that depends
on a D-dimensional continuous parameter θ . We search for solutions with exactly k
non-zero components of θ
D
min {F(θ )}
θ
θ ∈ RD , ∑ I(θi = 0) = k, (5.2)
i=1

where I(·) is an indicator function (I(true) = 1, I(false) = 0). This hybrid prob-
lem can be transformed into a purely combinatorial one of the type (5.1) by intro-
ducing a D-dimensional binary vector z whose i-th component indicates whether
variable i is allowed to take a non-zero value (zi = 1) or is set to zero (zi = 0).

min [F ∗ (z)] ,
z
∑ zi = k. (5.3)
i

The function F ∗ (z) is the solution of an auxiliary continuous optimization problem


in the reduced space defined by z

F ∗ (z) = min F(θ [z] ), (5.4)


[θ |z]
5 Optimization Problems with Cardinality Constraints 107

where θ [z] denotes the k-dimensional vector formed by the components of θ for which
the value of the corresponding component of z is 1. The remaining components of θ
are set to zero in the auxiliary problem.
This decomposition makes it clear how hybrid methods that combine techniques
for combinatorial and continuous optimization can be applied to identify the so-
lution of the subset selection problem with a continuous objective function: For a
given value of z, the optimal θ [z] is calculated by solving the surrogate problem
defined by (5.4), where z determines which components of θ are allowed to take a
non-zero value. The final solution is obtained by searching in the purely combinato-
rial space of possible values of z, using the optimal function value that is a solution
of (5.4) to guide the exploration.
The success of this hybrid approach depends on the availability of a continuous
optimization algorithm that can efficiently identify the globally optimal solution
of the auxiliary optimization problem defined in (5.4) and on the efficiency of the
algorithm used to address the combinatorial part of the search. For simple forms
of the continuous objective function and of the remaining restrictions (other than
the cardinality constraint), the auxiliary problem can be efficiently solved by exact
optimization techniques. For instance, efficient linear and quadratic programming
algorithms are available if the function is linear or quadratic, respectively [1]. For
more complex objective functions, general non-linear optimization techniques (such
as quasi-Newton [2] or interior-point methods [3]) may be necessary. In these cases,
there is no guarantee that the solution of the auxiliary problem be globally optimal.
As a consequence, if the solutions found are far from the global optimum, the com-
binatorial search that is used to solve the original problem (5.3) can be seriously
misled.
In this work, we assume that the continuous optimization task defined by (5.4)
can be solved exactly and focus on the solution of the combinatorial part of the orig-
inal problem. Section 5.2 describes how standard combinatorial optimization tech-
niques can be adapted to handle the cardinality constraints considered. Emphasis is
placed on the use of an appropriate encoding for the search states in terms of sets.
This set-based encoding is particularly well-suited for the definition of search oper-
ators that preserve the cardinality of the candidate solutions. With this adaptation,
the approximate methods described provide a practicable alternative to identifying
the exact solution by exhaustive search, which becomes computationally infeasible
in large problems, or to computationally inexpensive optimization methods, such as
greedy search, which tend to find suboptimal solutions. The experiments presented
in Section 5.3 illustrate how the techniques reviewed find near-optimal solutions
with limited computational resources and can therefore be used to address optimal
subset selection problems of practical interest. Novel results regarding the applica-
tion of these techniques to some of these problems (ensemble pruning and sparse
PCA) are also provided. Finally, Section 8.5 summarizes the conclusions of this
work.
108 R. Ruiz-Torrubiano, S. Garcı́a-Moratilla, and A. Suárez

5.2 Approximate Methods for the Solution of Optimization


Problems with Cardinality Constrains
In this section we describe how simulated annealing (SA), genetic algorithms (GA),
and estimation of distribution algorithms (EDA) can be used to solve large opti-
mization problems with cardinality constraints. They are stochastic search methods
that involve the generation of candidate solutions, which are then rejected or se-
lected according to their performance. In their standard formulation, no particular
consideration is given to the number of non-zero components in the candidate solu-
tions generated. Cardinality constraints can be taken into account using one of the
following approaches:
(i) No candidate solution violating the constraint is generated at any time by the al-
gorithm. Enforcing this property requires the design of appropriate genetic and
neighborhood operators, such that the space of solutions of a given cardinality
is closed under these search operators [4] [5].
(ii) Solutions that violate the cardinality constraint can be generated by the suc-
cessor operators. Whenever a violation occurs, a repair algorithm is applied to
transform the infeasible solution into a solution of the desired cardinality. Typ-
ically, a local search is used to obtain the closest feasible solution, but random
repair mechanisms can be used as well [6] [7].
(iii) Solutions that violate the cardinality constraint can be generated by the succes-
sor operators. In contrast with the previous approach, infeasible solutions are
not repaired. Instead, a penalty term is introduced on the evaluation function so
that infeasible candidate solutions have worse scores than feasible ones with an
equivalent performance [6].
In the experiments described in Section 5.3, the best overall performance is obtained
by methods that use a set-based representation together with appropriately designed
successor operators that preserve the cardinality of the solutions. These results un-
derscore the importance of using a representation that properly reflects the structure
of the problem. Therefore, the focus of this study is on the design of search opera-
tors that preserve the cardinality of the candidate solutions. These specially adapted
methods are generally preferable to standard schemes that take into account the re-
strictions by either ad-hoc repair mechanisms or by including a term in the cost
function that penalizes violations of the constraints.

5.2.1 Simulated Annealing


Simulated annealing (SA) is an optimization technique inspired by the field of ther-
modynamics [8]. The main idea is to mimic the physical process of melting a solid
and then cooling it to allow the formation of a regular crystalline structure that at-
tains a minimum of the system’s free energy. In simulated annealing the function to
be minimized F(z) (objective or cost function) takes the role of the free energy in
the physical system. The physical configuration space is replaced by the space of
candidate solutions, which are connected by transitions defined by a neighborhood
5 Optimization Problems with Cardinality Constraints 109

operator. The stochastic search proceeds by considering transitions from the cur-
rent state z(cur) to a neighboring configuration zl ∈ N (z(cur) ) generated at random.
The proposed transition is accepted if the value of the objective function decreases.
Otherwise, if the candidate configuration is of higher cost, the transition is accepted
only with a certain probability. This probability is expressed as a Boltzmann factor
 
(cur)
F(z ) − F(z )
Paccept (zl , z(cur) ; Tk ) = exp −
l
, (5.5)
Tk

where the parameter Tk plays the role of a temperature. A general version of this
technique is given as Algorithm 1. In this pseudocode, the function annealingSched-
ule returns the temperature Tk for the following epoch. It is common to use a geo-
metrical schedule Tk = γ Tk−1 , where γ smaller, but usually close to one, regulates
how fast the temperature is decreased.

Algorithm 1. Simulated annealing


• Generate initial configuration z(0) and initial temperature T0
• z(cur) ← z(0)
• i←0
• While convergence criteria are not met [Annealing loop]
– i ← i+1
– Fix temperature for epoch i: Ti = annealingSchedule(Ti−1 )
– Fix length for epoch i: Li
– For l = 1, . . . , Li [Epoch loop]
1. Select randomly an element zl ∈ N (z(cur) ).
2. If F(zl ) < F(z(cur) ), then z(cur) ← zl
3. Else, generate u ∼ U[0, 1]
If u < Paccept (zl , z(cur) ; Ti ), then z(cur) ← zl
• Return the best value found.

Cardinality constraints can be handled in SA by selecting an appropriate encod-


ing for z and a corresponding neighborhood N (z). In particular, the candidate
solutions can be encoded as sets of specified cardinality. The components of the
binary vector z are then interpreted as indicating membership to the set: if zi = 1,
the ith element is included in the solution. Otherwise, if zi = 0 it is excluded from
the selection. It is also necessary to design a neighborhood operator that preserves
the cardinality constraints, so that no penalty or repair mechanisms are needed. A
simple design is to exchange an element included in the current candidate solution
with an element excluded from it. This is the version of SA that will be used in
Section 5.3.
110 R. Ruiz-Torrubiano, S. Garcı́a-Moratilla, and A. Suárez

5.2.2 Genetic Algorithms


Genetic algorithms are a class of optimization methods that mimic the process of
the natural evolution [9]. Optimization is achieved by selection from a population
that exhibits some random variability. The outline of a general genetic algorithm is
shown in Algorithm 2.

Algorithm 2. Genetic Algorithm


• Generate an initial population P0 with P individuals.
• For each individual I j ∈ P0 , calculate fitness Φ (I j ).
• Initialize the generation counter t ← 0.
• While convergence criteria are not met:
– Increase the generation counter t ← t + 1.
– Select a parent set Πt ⊂ Pt composed of nP individuals from the population.
– While Πt = 0:
/
· Extract two individuals I1 and I2 from Πt .
· Apply the crossover operator Θ (I1 , I2 ) and generate nC children (with
probability pC ).
· Apply the mutation operator to the nC children (with probability pM ).
– Calculate the fitness value of the new individuals.
– Add the new individuals to the population.
– Select P individuals that make up Pt+1 , the population for generation t + 1.

For problems with cardinality constraints, two alternative encodings for the can-
didate solutions are considered. A first possibility is a standard binary representa-
tion, where the chromosomes are bit-strings. The difficulty with this encoding is
that standard mutation and crossover operators do not preserve the number of non-
zero bits of the parents. A possible solution to this problem is to assign a lower
fitness value to individuals in the population that violate the cardinality constraint.
Assuming that a problem with an inequality cardinality constraint is considered, a
penalized fitness function can be built by subtracting from the standard fitness func-
tion a penalty term that depends on the magnitude of the violation of the cardinality
constraint
Δk (z) = |Card(z) − k| (5.6)
The penalized fitness function is

Φ p (z) = Φ (z) − β f p (Δk (z)), (5.7)

where f p : N → R+ is a monotonically increasing function of Δk (z) and β ≥ 0


represents the strength of the penalty.
Another option is to repair infeasible individuals when they are generated. Sev-
eral repair mechanisms can be defined for this purpose. For instance, an individual
5 Optimization Problems with Cardinality Constraints 111

can be repaired by randomly setting some bits to 0 or 1, as needed, until the cardi-
nality constraint is satisfied (random repair). Another alternative is to use a heuristic
to determine which bits must be set to 0 or to 1 (heuristic repair). The results of a
greedy optimization or the solutions of a relaxed version of the problem can also be
used to achieve this objective [10].
An alternative to binary encoding is to use the set representation introduced in
simulated annealing. The use of this representation simplifies the design of crossover
and mutation operators that preserve the cardinality of the individuals. The neigh-
boring operator defined in SA can be used to construct mutated individuals. Since
this operator swaps a variable in the set of selected variables with another variable in
the complement of this set, the cardinality of the original chromosome is preserved
by the mutation.
Some crossover operators on sets were introduced in [5]. They are defined taking
into account the properties of respect and assortment [11]. Respect ensures that the
offspring inherit the common genetic material of the parents. Assortment guarantees
that every combination of the alleles of the two parents is possible in the child, pro-
vided that these alleles are compatible. When cardinality constraints are considered,
it is no longer possible to design crossover operators that guarantee both respect and
assortment.
A crossover operation that provides a good balance of these properties and en-
sures that the cardinality of the parents is preserved in the offspring is random as-
sorting recombination (RAR). RAR crossover is described in Algorithm 3. In this
algorithm, the integer parameter w ≥ 0 determines the amount of common informa-
tion from both parents that is retained by the offspring. For w = 0, elements that are
present in the chromosomes of both parents are not allowed in the child. Higher val-
ues of w assign more importance to the elements in the intersection of the parents’
sets (chromosomes). In the limit w → ∞, the child contains every element that is in
both of the parents’ chromosomes with a probability that approaches 1.

5.2.3 Estimation of Distribution Algorithms


Estimation of distribution algorithms (EDAs) are a class of evolutionary methods in
which diversity is generated by a probabilistic sampling scheme [12]. Depending
on the nature of this sampling scheme, different variants of EDAs can be designed.
In this work, we consider the Population Based Incremental Learning (PBIL) algo-
rithm as a representative algorithm of the EDA family [13]. It operates on binary
chromosomes of fixed length (z) and assumes statistical independence among the
genes {zi ; i = 1, 2, . . . , D}. In generation g, the genotype of the population is char-
acterized by the probability vector p(g) , whose ith component is the probability of
assigning the value 1 to the gene in the ith position. The update of the probability
distribution using DSe g (see Algorithm 4) in PBIL is

1 M (gim )
p(g+1) = α ∑ z + (1 − α )p(g),
M m=1
(5.8)
112 R. Ruiz-Torrubiano, S. Garcı́a-Moratilla, and A. Suárez

Algorithm 3. Random Assortment Recombination algorithm


1 Input: Two parents I1 and I2 , and a fixed cardinality k.
2 Output: A child chromosome Θ .
• Create auxiliary sets A, B,C, D, E:
– A = elements present in both parents.
– B = elements not present in any of the parents.
– C ≡ D elements present in only one parent.
– E = 0.
/
• Build set
G = {w copies of elements from A and B, and 1 copy of elements in C and D}
• While |Θ | < k and G = 0:
/
– Extract g ∈ G without replacement.
/ E, Θ = Θ ∪ {g}.
– If g ∈ A or g ∈ C, and g ∈
– If g ∈ B or g ∈ D, E = E ∪ {g}.
• If |Θ | < k, add elements chosen at random from U − Θ until chromosome is
complete.

Algorithm 4. Estimation of distribution algorithm(EDA)


• Initialize the distribution that characterizes the population P(0) (z)
• Initialize generation counter g ← 0.
• While convergence criteria are not met
– Sample population of P individuals using P(g) (z)

Dg = {z(g1) , . . . , z(gP) }

– Sort population by non-increasing fitness value



Dg = {z(gi1 )) , z(gi2 ) , . . . , z(giP ) },

where i1 , i2 , . . . , iP is a reordering of the indices 1, 2, . . . , P such that

Φ (z(gi1 ) ) ≥ Φ (z(gi2 ) ) ≥ · · · ≥ Φ (z(giP ) )

– Select the first M ≤ P individuals from the sorted population


(gi1 ) (gi2 )
g = {z
DSe ,z , . . . , z(giM ) }

– Estimate the new probability P(g+1) (z) distribution using DSe


g
– Update generation counter g ← g + 1
• Return the best solution found.
5 Optimization Problems with Cardinality Constraints 113

where z(gim ) represents the individual in the im −th position on generation g, and
α ∈ (0, 1] is a smoothing parameter included to avoid strong fluctuations in the
estimates of the probability distribution. Individuals are sorted by decreasing fitness
values. The Univariate Marginal Distribution Algorithm (UMDA [14]) algorithm
from the EDA family is recovered when α = 1. Even though the encoding is binary,
the cardinality constraints can be enforced in the sampling of individuals. Algorithm
5 describes a sampling method that generates individuals of a specified cardinality k
from a distribution of bits characterized by the probability vector p. The application
of this method to sample new individuals guarantees that the algorithm is closed
with respect to the cardinality constraint.

Algorithm 5. Sampling individuals of a specified cardinality from p.


• Initialize p̂ ← p
• Initialize individual x = 0.
• For i = 1, 2, . . . , k
– Generate a random number u ∼ U[0, 1]
j−1
– Determine the value of j such that ∑i=1 p̂i < u ≤ ∑D i= j+1 p̂i .
– Set x j = 1.
– Update the value p̂ j ← 0
– Renormalize
p̂i
p̂i ← D , i = 1, 2, . . . , D,
∑k=1 p̂k

i=1 p̂i = 1.
so that p̂ can be interpreted as a probability vector ∑D
– Return the generated individual x.

5.3 Benchmark Optimization Problems with Cardinality


Constraints
This section introduces a collection of optimization problems with cardinality con-
straints that are used to illustrate the application of the methods described in the pre-
vious section. These include standard combinatorial optimization problems, such as
the knapsack problem, and real-world problems that arise in the fields of machine
learning (ensemble pruning), quantitative finance (cardinality-constrained portfolio
optimization), time series modeling (index tracking by partial replication) and statis-
tical data analysis (sparse principal components analysis). The problems considered
are either purely combinatorial or involve the optimization of continuous parame-
ters. In a purely combinatorial optimization problem F(z) can be directly evaluated
once z is known. The knapsack problem and ensemble pruning are of this type. The
remaining problems considered (portfolio selection, index tracking and sparse PCA)
are hybrid optimization tasks, in which the evaluation of F(z) for a fixed value of
the binary vector z requires the solution of an auxiliary continuous optimization
114 R. Ruiz-Torrubiano, S. Garcı́a-Moratilla, and A. Suárez

problem. While it is possible to address the combinatorial and the continuous op-
timization problems simultaneously, we concentrate on strategies that handle these
aspects separately. Therefore, the outcome of the continuous optimization algorithm
is used to guide the combinatorial optimization search, as in (5.4). For the hybrid
problems considered, the secondary optimization task can be efficiently solved in an
exact manner by quadratic programming. Nonetheless, the scheme can be directly
generalized when the evaluation of F(z) requires a more complex programming so-
lution, possibly without guarantee of convergence to the global solution of the sur-
rogate optimization problem. Under these conditions the algorithm used to address
the combinatorial part can actually be misled by the suboptimal solutions found in
the auxiliary problem.

5.3.1 The Knapsack Problem


Knapsack problems are a family of combinatorial optimization problems that in-
volve selecting a subset from a pool of items [15]. In this work, we consider the
0/1 knapsack problem, which can be shown to be NP-complete [16]. In an instance
of the 0/1 knapsack D items are available to fill up a knapsack. A profit pi and a
weight wi are associated to the i-th item, i = 1, 2, . . . , D. The objective is to identify
the subset of items whose accumulated profit is maximum and whose overall weight
does not exceed a given capacity W
D D
max ∑ pi zi s.t. ∑ wi zi ≤ W zi ∈ {0, 1} , i = 1, 2, . . . , D. (5.9)
i=1 i=1

Both exact and approximate methods have been used to address the 0/1 knapsack
problem. Exact algorithms based on branch-and-bound approaches and dynamic
programming are reviewed in [17]. Genetic algorithms [18, 19] and EDAs [12] have
also been used to address this problem.
Cardinality constraints are generally not considered in the standard 0/1 knapsack
problem. Nevertheless, the optimum of the unconstrained problem can be obtained
by solving D cardinality-constrained knapsack problems ∑D i=1 zi = k; k = 1, 2, . . . D.
The k-th element in this sequence is a knapsack problem with the restriction that
only k items can be included in the knapsack. To compare the performance of the
different optimization methods analyzed in this work, we use the testing protocol
proposed in [20] [18]. Three types of problems, defined in terms of two parameters
v, r ∈ R+ , v > 1, are considered:
(1) Uncorrelated: Weights and profits are generated randomly in [1, v].
(2) Weakly correlated: Weights are generated randomly in [1, v] and profits are
generated in the interval [wi − r, wi + r].
(3) Strongly correlated: Weights are generated randomly in [1, v] and profit pi =
wi + r.
In general, knapsack problems with correlations between weights and profits are more
difficult to solve than problems in which the weights and profits are independent. We
5 Optimization Problems with Cardinality Constraints 115

use v = 10, r = 5 and a capacity W = 2v, which tends to include very few items in
the solution. The results reported are averages over 25 realizations of each problem,
which are solved using the different approximate methods: SA, a standard GA with
linear penalty, a GA using set encoding and the RAR operator (w = 1), and PBIL.
The conditions under which the search is conducted are determined on exploratory
experiments. A geometric annealing schedule Tk = γ Tk−1 with γ = 0.9 is used in
SA. The GAs evolve populations composed of 100 individuals. The probabilities of
crossover and mutation are pc = 1, pm = 10−2 , respectively. In PBIL, a population
composed of 1000 individuals is used. The probability distribution is updated using
10% of the individuals. The smoothing parameter α is 0.1. Exact results obtained with
the solver SYMPHONY from the COIN-OR project [21] implementing a branch-and-
cut (B&C) approach [22], are also reported for reference. In the strongly correlated
problems it was not possible to find the exact solutions within a reasonable amount
of time.

Table 5.1 Results for the 0-1 Knapsack problem with restrictive capacity

Corr. No.Items Algorithm


GA Lin. GA RAR SA PBIL B&C (exact)
Profit Time Profit Time Profit Time Profit Time Profit Time
none 100 79.36 26.2 82.09 54.5 80.70 98.4 81.89 24.1 82.11 62.0
none 250 90.63 38.1 105.34 134.9 102.91 284.4 104.51 47.4 106.43 178.7
none 500 95.93 57.2 119.88 261.9 118.07 531.7 117.28 91.5 123.93 568.8
weak 100 52.97 26.9 54.38 53.9 53.53 99.87 54.33 24.2 54.43 81.8
weak 250 59.07 38.3 66.24 130.4 65.13 286.4 65.85 47.7 67.10 180.3
weak 500 60.40 56.1 74.17 266.1 73.40 531.9 72.05 87.9 76.61 560.1
strong 100 76.19 26.2 79.77 57.8 79.73 98.7 78.99 24.0 − −
strong 250 83.98 37.8 94.20 139.5 94.15 286.0 92.39 47.3 − −
strong 500 84.52 55.3 101.40 272.2 102.16 525.6 96.60 86.9 − −

Table 5.1 displays the average profit obtained and the time (in seconds) to reach
a solution for each method. The experiments were performed on an AMD Turion
computer with 1.79 Ghz processor speed and 1 Gb RAM. None of the approximate
methods reaches the optimal profit, which is calculated using an exact branch-and-
cut method. The highest profit obtained by an approximate optimization is high-
lighted in boldface. In all cases, the algorithms that use a set encoding (GA with
RAR crossover and SA) exhibit the best performance. They also require longer
times to reach a solution, specially SA. PBIL obtains good results only in small
uncorrelated knapsack problems. This is explained by the fact that the sampling and
estimation of probability distributions becomes progressively more difficult as the
dimensionality of the problem increases. Furthermore, PBIL assumes statistical in-
dependence between the variables, which makes the algorithm perform worse on
problems in which correlations are present. The standard GA with linear penalty
has a very poor performance in all the knapsack problems analyzed.
116 R. Ruiz-Torrubiano, S. Garcı́a-Moratilla, and A. Suárez

5.3.2 Ensemble Pruning


Consider the problem of automatic induction of classifiers from a collection of in-
stances {(xn , yn ); n = 1, 2, . . . , N}, where yn is the class label of the example char-
acterized by the vector of attributes xn . The goal is to induce from these data an
autonomous system that accurately predicts the class label on the basis of the vec-
tor of attributes of a previously unseen instance. There are a number of algorithms
that can be used for learning different types of classifiers: decision trees, neural
networks, support vector machines, etc. In practice, one of the most successful
paradigms is ensemble learning [23]. Ensembles are composed of a diverse col-
lection of classifiers that are generated from the same training data by introduc-
ing variations in the algorithm used for induction or in the conditions under which
learning takes place. The outputs of the individual classifiers are then combined (for
instance, by majority voting) to produce the prediction of the ensemble. Pooling the
decisions of the ensemble members has the potential of improving the generaliza-
tion capacity of a single learner. However, ensembles are costly to generate and have
large storage requirements. Furthermore, the time required to classify an unlabeled
instance increases linearly with the size of the ensemble. Recent work has shown
that the storage requirements and classification times can be significantly reduced
by selecting a subset of classifiers whose generalization capacity is equivalent and
sometimes superior to the original complete ensemble. This process receives the
name of ensemble pruning [24], selection, [25] or thinning [26].
Ensemble pruning has been a subject of great interest in the recent literature on
machine learning (see refs. in [27]). Most studies focus on the definition of appropri-
ate quantities that can be optimized on the training set to obtain pruned ensembles
with good generalization performance. The individual properties of classifiers are
not useful to guide the selection process. The generalization capacity of the pruned
ensemble crucially depends on the complementarity of the classifiers that are part of
it. The search in the space of subensembles is usually greedy. A notable exception
is [28, 29, 30], where genetic algorithms, generally with real-valued chromosomes,
are used.
Another exception is [31], where ensemble pruning is formulated as a quadratic
integer programming problem. Consider an ensemble composed of D classifiers and
a set of labeled instances. Define a matrix G, whose element Gi j is the number of
common errors between classifier i and classifier j, where i, j = 1, 2, . . . , D. The
value of the diagonal term Gii is the number of errors made by classifier i. The
matrix is then symmetrized and its elements normalized so that they are in the same
scale  
Gii 1 Gi j G ji
G̃ii = , G̃i j,i= j = + . (5.10)
N 2 Gii G j j
Intuitively, ∑i G̃ii measures the overall strength of the ensemble classifiers and
∑i j,i= j G̃i j measures their diversity. The subensemble selection problem of size k
can now be formulated as a quadratic integer programming problem
5 Optimization Problems with Cardinality Constraints 117

D
argmin zT · G̃ · z, s.t. ∑ zi = k, zi ∈ {0, 1}. (5.11)
z i=1

The binary variable zi indicates whether classifier i should be selected. The size of
the pruned ensemble, k, is specified beforehand. The selection process is a combi-
natorial optimization problem whose exact  solution
 requires the evaluation of the
D
performance of the exponentially large subensembles of size k that can be
k
extracted from an ensemble of size D. In [31] the solution is approximated in poly-
nomial time by applying semi-definite programming (SDP) to a convex relaxation
of the original problem.
To investigate the performance in the ensemble pruning problem of the opti-
mization methods described in Section 5.2, we generate bagging ensembles for five
representative benchmark problems from the UCI repository: heart, pima, satellite,
waveform and wdbc (Breast Cancer Wisconsin) [32]. The individual classifiers in
the ensemble are trained on different bootstrap samples of the original data [33].
If the classifiers used as base learners are unstable the fluctuations in the bootstrap
sample lead to the induction of different predictors. Assuming that the errors of
these classifiers are uncorrelated, pooling their decisions by majority voting should
improve the accuracy of the predictions. In the experiments performed, bagging en-
sembles of 101 CART trees are built [34]. The original ensemble is pruned to k = 21
decision trees. The strength-diversity measure G, the time consumed in seconds and
the number of evaluations are averaged over 5 ten-fold cross-validations for heart,
pima, satellite and wdbc, and over 50 independent partitions for waveform. The suc-
cess rate is the average over 50 repetitions of the optimization for a given partition
of the data into training and testing sets.
The parameters for the metaheuristic optimization methods are determined in
exploratory experiments using the results of SDP as a gauge. For the GAs, popu-
lations with 100 individuals are evolved using a steady state generational substitu-
tion scheme. The crossover probability is set to 1. The mutation probability is 10−2
for GAs with binary representation and 10−3 for GAs with set representation. The
strength of the penalty term in the GA with linear penalties is β = 400. If the best
individual of the final population does not satisfy the cardinality constraint, a greedy
search is performed to fulfill the restriction. The value w = 1 is used in RAR-GA. A
geometric annealing schedules with γ = 0.9 is used in SA. In these experiments, the
best solution in 10 independent executions of the SA algorithm is chosen. For PBIL,
a population of 1000 individuals is generated, where 10% of the individuals are used
to update the probability distribution. The smoothing constant is set to α = 0.1.
The results of the ensemble pruning experiments performed are summarized in
Table 5.2. Most of the optimization methods analyzed reach similar solutions in
all the classification problems considered, with the exception of the standard GA
with linear penalty, which obtains the worst values of the objective function. In
terms of this quantity, the best overall results correspond to SA and SDP. In terms
of efficiency, SDP should be preferred. In machine learning, the relevant measure
of performance is the generalization capacity of the classifiers generated. The test
118 R. Ruiz-Torrubiano, S. Garcı́a-Moratilla, and A. Suárez

Table 5.2 Results for the GA, SA and EDA approaches in the ensemble pruning problem

Algorithm Problem Best G Success Time (s) Test Error


rate
heart 156.2940 1.00 7.575 18.06
pima 234.8931 0.98 14.644 23.99
SA satellite 185.1163 1.00 63.490 13.15
waveform 105.2465 1.00 37.621 19.64
wdbc 121.9183 0.88 34.085 4.50
heart 157.8668 0.98 2.017 17.96
pima 235.0572 0.92 2.072 23.86
GA satellite 186.0706 1.00 0.870 12.85
Linear Penalty waveform 105.8294 0.02 5.179 19.70
wdbc 122.7625 0.06 4.963 4.45
heart 156.3127 1.00 0.851 17.87
pima 234.8931 1.00 1.665 23.99
GA Heuristic Repair satellite 185.1163 1.00 0.520 13.15
waveform 105.2465 0.80 8.168 19.64
wdbc 121.9183 0.90 7.875 4.50
heart 156.3860 1.00 0.697 17.96
pima 234.9190 1.00 1.381 24.09
GA satellite 185.1163 1.00 0.910 13.15
RAR (w = 1) waveform 105.2510 0.48 6.880 19.62
wdbc 121.9399 0.40 6.449 4.50
heart 156.4111 1.00 38.409 17.69
pima 234.9358 0.96 38.426 24.05
PBIL satellite 185.1163 1.00 16.400 13.15
waveform 105.2663 0.36 16.086 19.67
wdbc 122.0467 0.34 38.392 4.34
heart 156.3034 1.00 1.137 18.15
pima 234.8956 1.00 1.159 24.09
SDP satellite 185.1163 1.00 1.230 13.15
waveform 104.9984 0.90 1.230 19.60
wdbc 121.9143 0.90 1.117 4.39

error displayed in the last column of the table provides an estimate of the error rate
in examples that have not been used to train the classifiers. Lower test errors indicate
better generalization capacity. According to this measure the ranking of methods is
rather different: classifiers that were optimal according to the objective function are
suboptimal in terms of their generalization capacity. This indicates that the learning
process is affected by overfitting, because the objective function is estimated on the
training data. Nevertheless, the generalization performance of the pruned ensembles
is very similar for all the optimization methods considered. Table 5.3 shows the test
error of a single CART tree, of a complete bagging ensemble and the range of values
5 Optimization Problems with Cardinality Constraints 119

Table 5.3 Test errors for CART, standard bagging and pruned bagging

Problem CART Bagging Pruned bagging


heart 23.63 21.48 [17.69,18.15]
pima 24.84 24.67 [23.86,24.09]
satellite 13.80 14.25 [12.85,13.15]
waveform 30.27 22.53 [19.62,19.67]
wdbc 7.28 5.68 [4.34,4.50]

of the test error obtained by pruned bagging ensembles of size k = 21. In all the
classification problems considered, pruned ensembles have a lower test error than
CART and complete bagging.

5.3.3 Portfolio Optimization with Cardinality Constraints


The selection of optimal investment portfolios is a problem of great interest in the
area of quantitative finance and has attracted much attention in the scientific com-
munity (see refs. in [10]). It is a multiobjective optimization task with two opposed
goals: The maximization of profit and the minimization of risk. Several methods
have been proposed to address this problem, mostly within the classical mean-
variance model developed by H. Markowitz [35]. In this framework, the returns of
the assets considered for investment are modeled as white noise. Profit is quantified
in terms of the expected return of the portfolio. The variance of the portfolio re-
turns is used as a measure of risk. In its simplest version, the problem can be solved
by quadratic programming [1]. However, if cardinality constraints are included, the
problem becomes a mixed-integer quadratic problem, which can be shown to be
NP-Complete [10]
T
min w[z] · Σ[z,z] · w[z] (5.12)
z
T
s.t. w[z] · r̄[z] = R∗ (5.13)
a[z] ≤ w[z] ≤ b[z] , a[z] ≥ 0, b[z] ≥ 0 (5.14)
l ≤ A[z] · w[z] ≤ u (5.15)
z ·1 ≤ K
T
(5.16)
wT · 1 = 1, w ≥ 0. (5.17)

The inputs of the algorithm are r̄, the vector of expected asset returns and Σ, the
covariance matrix of the asset returns. The goal is to determine the optimal weights
of the assets in the portfolio; i.e. the value of w that maximizes the variance of the
portfolio returns (5.12), for a given value of the expected return of the portfolio, R∗
(5.13). The elements of the binary vector z specify whether asset i is included in the
final portfolio (zi = 1) or not (zi = 0). Column vectors x[z] are obtained by remov-
ing from the corresponding vector x those components i for which zi = 0. Similarly,
120 R. Ruiz-Torrubiano, S. Garcı́a-Moratilla, and A. Suárez

the matrix A[z] is obtained by eliminating the i-th column of A whenever zi = 0. Fi-
nally, Σ[z,z] is obtained by removing from Σ the rows and columns for which the
corresponding indicator is zero (zi = 0). The symbols 0 and 1 denote vectors of the
appropriate size whose entries are all equal to 0 or to 1, respectively. Minimum and
maximum investment constraints, which set a lower and an upper bound on the in-
vestment of each asset in the portfolio are captured by (5.14). Vectors a and b are
D × 1 column vectors with the lower and upper bounds on the portfolio weights, re-
spectively. Inequality (5.15) summarizes the M concentration of capital constraints.
The m-th row of the M × D matrix A is the vector of coefficients of the linear combi-
nation that defines the constraint. The M × 1 column vectors l and u correspond to the
lower and upper bounds of the M linear restrictions, respectively. Concentration of
capital constraints can be used, for instance, to control the amount of capital invested
in a group of assets, so that investor preferences or limits for investment in certain
asset classes can be formally expressed. Since these constraints are linear, they do
not increase the difficulty of the problem, which can still be solved efficiently by
quadratic programming. Expression (5.16) corresponds to the cardinality constraint,
which limits the number of assets that can be included in the final portfolio. Finally,
equation (5.17) ensures that all the capital is invested in the portfolio.
The cardinality-constrained problem is difficult to solve by standard optimization
techniques. Branch-and-Bound methods can be used to find exact solutions [36]. De-
spite the improvements in efficiency, the complexity of the search is still exponential.
Genetic algorithms have also been used to address this problem: In [37], the perfor-
mance of GAs is compared to SA and to tabu search (TS) [38]. According to this in-
vestigation, the best-performing portfolios are obtained by pooling the results of the
different heuristics. In [39] SA is used to search directly in the space of real-valued
asset weights. Tabu search is employed in [40]. This work focuses on the design
of appropriate neighborhood operators to improve the efficiency of the search. In
[7] [41] Multi-Objective Evolutionary Algorithms (MOEAs) are used to address the
problem. These algorithms employ a hybrid encoding instead of a pure continuous
one and heuristic repair mechanisms to handle infeasible individuals. The impact of
local search improvements are also investigated in this work. The authors conclude
that the hybrid encoding improves the overall performance of the algorithm.
In the experiments carried out in this investigation, we address the problem of
optimal portfolio selection with lower bounds and cardinality constraints. The pa-
rameters of the constraints considered are li = 0.1, ui = 0.1, i = 1, . . . , D and K = 10.
The performance of the different optimization methods is compared by calculating
the efficient frontier for the problem with and without these constraints. Points on
the efficient frontier correspond to minimum-risk portfolios for a given expected
return, or, alternatively, to portfolios that have the largest expected return from a
family of portfolios with equal risk. As a measure of the quality of the solution ob-
tained, the average relative distance to the unconstrained efficient frontier (without
cardinality and lower bound constraints) is calculated

1 NF
σic − σi∗
D=
NF ∑ σ∗ (5.18)
i=1 i
5 Optimization Problems with Cardinality Constraints 121

where NF = 100 is the number of frontier points considered, σic is the solution of
the constrained problem in the i-th point of the frontier, and σi∗ is the solution of the
corresponding unconstrained problem.

Table 5.4 Results for the GA, SA and EDA approaches in the portfolio selection problem

Algorithm Index Best D Success Time (s) Optimizations


rate
Hang Seng 0.00321150 1.00 1499.9 3.87 · 107
DAX 2.53162860 0.98 2877.3 7.63 · 107
SA FTSE 1.92205745 0.92 3610.4 8.87 · 107
S&P 4.69373181 0.91 3567.8 9.54 · 107
Nikkei 0.20197748 0.95 4274.5 9.25 · 107
Hang Seng 0.00327011 0.86 750.9 1.36 · 107
DAX 2.53314271 0.69 2999.0 4.60 · 107
GA FTSE 1.93255870 0.51 3539.3 5.76 · 107
Linear penalty S&P 4.69373181 0.76 4636.8 7.03 · 107
Nikkei 0.22992173 0.42 4811.7 6.47 · 107
Hang Seng 0.00321150 1.00 1122.9 2.18 · 107
DAX 2.53162860 1.00 4730.6 7.45 · 107
GA FTSE 1.92150019 0.94 6301.4 9.70 · 107
Heuristic Repair S&P 4.69373181 1.00 7860.6 11.42 · 107
Nikkei 0.20197748 0.99 10191.2 11.47 · 107
Hang Seng 0.00321150 1.00 1200.8 2.77 · 107
DAX 2.53162860 1.00 3178.5 6.14 · 107
GA FTSE 1.92150019 0.95 6384.6 12.02 · 107
RAR (w = 1) S&P 4.69373181 0.99 6575.6 12.34 · 107
Nikkei 0.20197748 1.00 9893.3 14.17 · 107
Hang Seng 0.00321150 1.00 2292.8 5.55 · 107
DAX 2.53162860 0.94 4489.1 7.70 · 107
PBIL FTSE 1.92208910 0.85 4782.3 8.06 · 107
S&P 4.69570006 0.88 5100.2 8.28 · 107
Nikkei 0.30164777 0.43 7486.5 8.21 · 107

The expected returns and the covariance matrix of the components of five major
world markets included in the OR-Library [42] are used as inputs for the optimiza-
tion: Hang Seng (Hong-Kong, 31 assets), DAX (Germany, 85 assets), FTSE (UK,
89 assets), Standard and Poor’s (U.S.A., 98 assets) and Nikkei (Japan, 225 assets).
The methods compared are SA, standard GA with linear penalty, standard GA with
heuristic repair, GA with a set representation and RAR (w = 1) crossover, and PBIL.
The SA heuristic is used with a geometric annealing scheme with constant γ = 0.9.
Populations of 100 individuals are used for the GAs. The mutation and crossover
probabilities are pm = 10−2 and pc = 1, respectively. PBIL samples populations
122 R. Ruiz-Torrubiano, S. Garcı́a-Moratilla, and A. Suárez

of 400 individuals, 10% of which are used to update the probability distribution.
The heuristic repair scheme performs an unconstrained optimization without the
cardinality constraint, and then either includes in the chromosome those products
with the highest weights or eliminates the products with the smallest weights in the
unconstrained solution, as needed.
Table 5.4 summarizes the results of the experiments. The value of D (5.18) dis-
played in the third column is the best out of 5 executions of each of methods consid-
ered. The proportion of attempts in which the corresponding optimization algorithm
obtains the best known solution is given in the column labeled sucess rate. The last
two columns report the time employed (in seconds) and the number of quadratic
optimizations performed, respectively. In terms of the quality of the obtained solu-
tions, using a binary encoding with linear penalties performs worse than all the other
approximate methods. By contrast, the heuristic repair scheme identifies the best of
the known solutions in all the problems investigated. GA with a set representation
and RAR (w = 1) crossover has also an excellent performance and is slightly more
efficient on average. High quality solutions are also obtained by SA, albeit at higher
computational cost. PBIL performs well only in problems in which the number of
assets considered for investment is small. As the dimensionality of the problem in-
creases, sampling and estimation of the probability distribution in algorithms of the
EDA family become less effective.

5.3.4 Index Tracking by Partial Replication


Index tracking is a passive investment strategy whose goal is to match the perfor-
mance of a reference financial index. The problem can be exactly solved by invest-
ing on each asset an amount of capital that is proportional to the corresponding
weight in the index. In practice, this strategy has the drawback of incurring high
initial transaction costs. Furthermore, there is an overhead in managing a portfolio
that invests in every constituent of the index. In particular, rebalancing the portfolio
can be costly if the composition of the index is revised. An alternative is to create a
tracking portfolio that invests only in a reduced set of assets. This partial replication
strategy will in general be unable to perfectly reproduce the behavior of the index.
However, a portfolio that invests in a fixed number of assets and closely follows the
evolution of the index can be obtained by minimizing the tracking error
!
1 T D
min
w,z
∑ ∑ (w j r j (t) − rt )2
T t=1
(5.19)
j=1
D
∑ wi = 1, (5.20)
i=1
l ≤ A·w ≤ u (5.21)
zi ∈ {0, 1}, ai zi ≤ wi ≤ bi zi , ai ≥ 0, bi ≥ 0, i = 1, 2, . . . , D (5.22)
D
∑ zi ≤ K, (5.23)
i=1
5 Optimization Problems with Cardinality Constraints 123

where T is the length of the time series considered, D is the number of constituents
of the index, r j (t) is the return of asset j at time t and rt is the return of the index
at time t. Restriction (5.20) is a budget constraint, which ensures that all the cap-
ital is invested in the portfolio. Investment concentration constraints are captured
by (5.21). Expression (5.22) reflects lower and upper bound constraints. The binary
variables {z1 , z2 , . . . , zD } indicate whether an asset is included or excluded from the
tracking portfolio. Note that when zi = 0, the lower and upper bounds for the weight
of asset i are both equal to zero, which effectively excludes this asset from the
investment. The cardinality constraint is expressed by Eq.(5.23).
Index tracking has been extensively investigated in the literature. The hybrid GA
with set encoding and RAR crossover described in Section 5.2 is used in [4]. In-
stead of the tracking error, this work minimizes the variance of the difference be-
tween the returns of the index and of the tracking portfolio. Optimal impulse control
techniques are used in [43]. In [44] the problem is solved by using the threshold ac-
cepting (TA) heuristic, which is a deterministic analogue of simulated annealing, in
which transitions are rejected only when they lead to a deterioration in performance
that is above a specified threshold. Evolutionary algorithms with real-valued chro-
mosome representations are used in [45]. This investigation focuses on the influence
of transaction costs and portfolio rebalancing. In [46] the portfolio optimization and
index tracking problems are addressed by means of a heuristic relaxation method
that consists in solving a small number of convex optimization problems with fixed
transaction costs. Hybrid optimization approaches to minimizing the tracking by
partial replication are also investigated in [47, 48, 49].
In the current investigation, publicly available benchmark data from the OR-
Library [42] is used to compare the optimization techniques described in Section
5.2. Five major world market indices are used in the experiments: Hang-Seng, DAX,
FTSE, S&P and Nikkei. For each index, the time series of 290 weekly returns for
the index and for its constituents are given. From these data, the first 145 values
are used to create a tracking portfolio that includes a maximum of K = 10 assets.
The last 145 values are used to measure the out-of-sample tracking error. The pop-
ulation sizes are 350 for the GAs and 1000 for PBIL. The values of the remaining
parameters coincide with those used in the portfolio selection problem.
Table 5.5 presents a summary of the experiments performed. The best out of 5 ex-
ecutions of the different optimization methods are reported. GA with random repair
obtains the best overall results. GA with set encoding and RAR (w = 1) crossover
matches these results except in Nikkei, which is the index with the largest number of
constituents. PBIL also has a good performance, but the computational cost is higher
than for the other algorithms. In fact, the algorithm reached the maximum number
of optimizations established without converging. The results of SA and GA with
binary encoding and linear penalty are suboptimal in all but the simplest problems.
They also exhibit low success rates. In all problems investigated, the out-of-sample
error is typically larger than the in-sample error, but of the same order of magnitude.
124 R. Ruiz-Torrubiano, S. Garcı́a-Moratilla, and A. Suárez

Table 5.5 Results for the GA, SA and EDA approaches in the index tracking problem

Algorithm Index Best MSE MSE Success Time Number


In-Sample Out-of-Sample rate (s) opts.
Hang Seng 1.3462 · 10−5 2.0575 · 10−5 0.40 1.12 19342
DAX 8.0837 · 10−6 7.4824 · 10−5 0.40 1.73 27101
SA FTSE 2.3951 · 10−5 7.0007 · 10−5 0.20 1.44 1.43
S&P 1.6781 · 10−5 4.7347 · 10−5 0.20 1.97 29764
Nikkei 2.1974 · 10−5 1.0719 · 10−4 0.20 95.00 1476549
Hang Seng 1.3462 · 10−5 2.0575 · 10−5 0.60 4.15 51509
DAX 8.0837 · 10−6 7.4824 · 10−5 0.20 13.69 144868
GA FTSE 2.7345 · 10−5 5.3148 · 10−5 0.20 17.66 158465
Linear Penalty S&P 1.7974 · 10−5 5.2898 · 10−5 0.20 36.89 311008
Nikkei 2.0061 · 10−5 1.0707 · 10−4 0.20 123.28 1015774
Hang Seng 1.3462 · 10−5 2.0575 · 10−5 1.00 5.92 81690
DAX 8.0837 · 10−6 7.4824 · 10−5 1.00 18.89 231840
GA FTSE 2.1836 · 10−5 8.0091 · 10−5 0.40 21.20 255820
Random Repair S&P 1.6573 · 10−5 5.5457 · 10−5 0.20 47.02 508313
Nikkei 1.8255 · 10−5 6.9574 · 10−5 0.20 170.62 1664696
Hang Seng 1.3462 · 10−5 2.0575 · 10−5 1.00 4.67 51513
DAX 8.0837 · 10−6 7.4824 · 10−5 1.00 14.17 124717
GA FTSE 2.1836 · 10−5 8.0091 · 10−5 0.40 18.83 156456
RAR (w = 1) S&P 1.6573 · 10−5 5.5457 · 10−5 0.20 42.31 311002
Nikkei 1.8917 · 10−5 8.1057 · 10−5 0.20 175.34 1015766
Hang Seng 1.3462 · 10−5 2.0575 · 10−5 1.00 167.04 2010000
DAX 8.0837 · 10−6 7.4824 · 10−5 1.00 199.28 2010000
PBIL FTSE 2.1836 · 10−5 8.0091 · 10−5 1.00 195.31 2010000
S&P 1.6781 · 10−5 4.7347 · 10−5 0.60 314.77 2010000
Nikkei 1.9510 · 10−5 7.4572 · 10−5 0.20 222.86 2010000

5.3.5 Sparse Principal Component Analysis


Principal Component Analysis (PCA) is a dimensionality reduction technique that is
frequently used in data analysis, data compression and data visualization. The goal
is to identify the directions along which the multidimensional data have the largest
variance. The principal components can be obtained by maximizing the variance of
normalized linear combinations of the original variables. Typically they have a non-
zero projection on all the original coordinates, which can make their interpretation
difficult. The goal of sparse PCA is to find principal components that have non-
zero loadings in only a small number of the original directions, while at the same
time explaining most of the variance. The first sparse principal component can be
obtained by solving the cardinality-constrained optimization problem
5 Optimization Problems with Cardinality Constraints 125
" T
#
max w[z] · Σ[z,z] · w[z] (5.24)
w,z

s.t. w[z] 2 = 1 (5.25)


z · 1 ≤ K,
T
(5.26)

where Σ is the data covariance matrix. As in the previous problems, the elements of
the binary vector z encode whether the principal component has a non-zero projec-
tion along the corresponding direction. Once the first principal component has been
found, if more principal components are to be calculated, the covariance matrix Σ
is deflated as follows  

Σ = Σ − wT · Σ · w w wT (5.27)
and a new problem of the form given by (5.24), defined now in terms of this de-
flated covariance matrix is solved. The decomposition stops after a maximum of
Rank(Σ) iterations. In practice, the number of principal components is either spec-
ified beforehand or determined by the percentage of the total variance of the data
explained.
The problem of finding sparse principal components has also received a fair
amount of attention in the recent literature. Greedy search is used in [50]. In [51]
SPCA is formulated as a regression problem, so that LASSO techniques [52] can be
used to favor sparse solutions. In LASSO, an L1 -norm penalty for non-zero values
of the factor loadings is used. A higher weight of the penalty term in the objective
functions induces models that are sparser. However it is not possible to have a di-
rect control on the number of non-zero coefficients in the solution. The cardinality
constraint is explicitly considered in [53], which uses a method based on solving a
relaxation of the problem by semidefinite programming (SDP).
To compare the performance of the different methods analyzed, we use the bench-
mark problem introduced in [54]. Consider the sparse vector v, whose components
are ⎧

⎨1, if i ≤ 50
vi = 1/(i − 50), if 50 < i ≤ 100 (5.28)


0, otherwise
A covariance matrix is built from this vector and U, a square matrix of dimensions
150 × 150 whose elements are U[0, 1] random variables

Σ = σ vvT + UT U, (5.29)

where σ = 10 is the signal-to-noise ratio. In this manner, the pattern of cardinal-


ity is partially masked by noise. In our experiments the results of SA, binary GA
with linear penalties, binary GA with random repair, set GA with RAR crossover
operator and w = 1, PBIL and DSPCA, an approximate method based on semidef-
inite programming [53, 55] are compared. SA uses a geometric annealing scheme
with γ = 0.9. The GAs use a population of 50 invididuals. Crossover and muta-
tion are performed with probabilities pc = 1 and pm = 10−2 , respectively. PBIL is
executed with a population of 400 individuals and α = 0.1. In this algorithm, the
126 R. Ruiz-Torrubiano, S. Garcı́a-Moratilla, and A. Suárez

best 10% of the individuals are used to update the probability distribution. The first
sparse principal component is then calculated. For each of the methods that involve
stochastic search (all except DSPCA), the best out of 5 independent executions of
the algorithm is taken. Figure 7.1 displays the variance explained by the first sparse
principal component as a function of its cardinality K = 1, 2, . . . , 140, for all the
methods considered. GA using a linear penalty does not obtain good solutions in
this high-dimensional problem. PBIL performs slightly better, but is clearly infe-
rior to SA, GAs with random repair, GA with set encoding and DSPCA. Table 5.6
shows the detailed results for cardinality K = 50, which is the cardinality of the true
hidden pattern. In this table, the largest value of the variance achieved is highlighted
in bold. The success rates, the computation times on an AMD Turion machine with
1.79 Ghz processor speed and 1 Gb RAM and the total number of optimizations are
also given. The times for the the DSPCA algorithm times are not given, because a
MATLAB implementation was used [54], which cannot be directly compared with
the other results, obtained with code written in C. The GA with set encoding and
RAR (w = 1) crossover and the GA with binary encoding and random repair obtain
the best results and explain more variance than the solution obtained by DSPCA.
The first of this methods is slightly faster. SA is very fast and achieves a result
that is only slightly worse with a success rate of 100%. PBIL and GA with binary
encoding and linear penalty obtain solutions that are clearly inferior.

45

40

35

30

25
Variance

20

15

10
GA Linear Penalty
GA Random Repair
5 GA RAR
SA
PBIL
DSPCA
0
25 50 75 100 125 150
Cardinality

Fig. 5.1 Comparison of results for the SPCA problem


5 Optimization Problems with Cardinality Constraints 127

Table 5.6 Results for the GA, SA, EDA and SDP approaches in the synthetic problem for
K = 50

Algorithm Best Success Time (s) Optimizations


variance rate
SA 22.5727 1.00 65.95 11639
GA + Linear Penalty 19.7881 0.20 126.1 5137
GA + Random Repair 22.7423 0.80 172.1 7981
GA + RAR (w = 1) 22.7423 1.00 105.41 5146
PBIL 20.1778 1.00 198.20 40800
SDP 22.5001 − − −

5.4 Conclusions
Many tasks of practical interest can be formulated as optimization problems with
cardinality constraints. The examples analyzed in this article arise in various fields
of application: ensemble pruning, optimal portfolio selection, financial index track-
ing and sparse principal component analysis. They are large optimization problems
whose solution by standard optimization methods is computationally expensive. In
practice, using exact methods like branch-and-bound is feasible only for small prob-
lem instances. A practicable alternative is to use approximate optimization methods
that can identify near-optimal solutions at a lower computational cost: Genetic al-
gorithms, simulated annealing and estimation of distribution algorithms. However,
the search operators used in the standard formulations of these techniques are ill-
suited to the problem because they do not preserve the cardinality of the candi-
date solutions. This means that either ad-hoc penalization or repair mechanisms are
needed to enforce the constraints. Including penalty terms in the objective func-
tion distorts the search and generally leads to suboptimal solutions. Applying repair
mechanisms to infeasible configurations provides a more elegant and effective ap-
proach to the problem. Nonetheless, the best option is to use a set representation,
in conjunction with specially designed search operators that preserve the cardinality
of the candidate solutions. Some of the problems considered, such as the knapsack
problem and ensemble pruning are purely combinatorial optimization tasks. In prob-
lems like portfolio selection, index tracking and sparse PCA both combinatorial and
continuous aspects are present. For these we advocate the use of hybrid methods
that separately handle the combinatorial and the continuous aspects of cardinality-
constrained optimization problems. Among the approximate methods considered,
a genetic algorithm with set encoding and RAR crossover obtains the best overall
performance. In problems where the comparison was possible, the solutions ob-
tained are close to the exact ones and to those identified by approximate methods
that use semidefinitie programming. Using the same encoding, simulated annealing
also obtains fairly good solutions, generally at a higher computational cost. This
indicates that the RAR crossover operator seems to enhance the search by introduc-
ing in the population individuals that effectively combine advantageous features of
128 R. Ruiz-Torrubiano, S. Garcı́a-Moratilla, and A. Suárez

their ancestors. Estimation of distribution algorithms, such as PBIL, perform well


on small and medium-sized problem instances. However, they fail to obtain good
solutions on large problems. The reason for this loss of efficacy is that the sampling
and estimation of probability distributions becomes progressively more difficult as
the dimensionality of the problem increases.

Acknowledgments
This research has been supported by Dirección General de Investigación (Spain),
project TIN2007-66862-C02-02.

References
1. Gill, P.E., Murray, W., Saunders, M.A., Wright, M.H.: Inertia-controlling methods for
general quadratic programming. SIAM Review 33, 1–36 (1991)
2. Gill, P., Murray, W.: Quasi-newton methods for unconstrained optimization. IMA Journal
of Applied Mathematics 9 (1), 91–108 (1972)
3. Adler, I., Karmarkar, N., Resende, M.G.C., Veiga, G.: An implementation of Kar-
markar’s algorithm for linear programming. Mathematical Programming 44, 297–335
(1989)
4. Shapcott, J.: Index tracking: genetic algorithms for investment portfolio selection. Tech-
nical report, EPCC-SS92-24, Edinburgh, Parallel Computing Centre (1992)
5. Radcliffe, N.J.: Genetic set recombination. Foundations of Genetic Algorithms. Morgan
Kaufmann Pulishers, San Francisco (1993)
6. Coello, C.: Theoretical and numerical constraint-handling techniques used with evolu-
tionary algorithms: a survey of the state of the art. Computer Methods in Applied Me-
chanics and Engineering 191, 1245–1287 (2002)
7. Streichert, F., Ulmer, H., Zell, A.: Evaluating a hybrid encoding and three crossover
operators on the constrained portfolio selection problem. In: Proceedings of the Congress
on Evolutionary Computation (CEC 2004), vol. 1, pp. 932–939 (2004)
8. Kirkpatrick, S., Gelatt Jr., C.D., Vecchi, M.P.: Optimization by simulated annealing. Sci-
ence 4598, 671–679 (1983)
9. Goldberg, D.: Genetic Algorithms in Search, Optimization and Machine Learning.
Addison-Weasley, Reading (1989)
10. Moral-Escudero, R., Ruiz-Torrubiano, R., Suarez, A.: Selection of optimal investment
portfolios with cardinality constraints. In: Proceedings of the IEEE World Congress on
Evolutionary Computation, pp. 2382–2388 (2006)
11. Radcliffe, N.J.: Equivalence class analysis of genetic algorithms. Complex Systems 5,
183–205 (1991)
12. Larrañaga, P., Lozano, J.A. (eds.): Estimation of Distribution Algorithms: A New Tool
for Evolutionary Computation. Kluwer Academic Publishers, Dordrecht (2002)
13. Baluja, S.: Population-based incremental learning: A method for integrating genetic
search based function optimization and competitive learning. Technical Report CMU-
CS-94-163, Carnegie Mellon University (1994)
14. Muehlenbein, H.: The equation for response to selection and its use for prediction. Evo-
lutionary Computation 5, 303–346 (1998)
15. Kellerer, H., Pferschy, U., Pisinger, D.: Knapsack Problems. Springer, Heidelberg (2004)
5 Optimization Problems with Cardinality Constraints 129

16. Miller, R.E., Thatcher, J.W. (eds.): Reducibility among combinatorial problems, pp. 85–
103. Plenum Press (1972)
17. Pisinger, D.: Where are the hard knapsack problems? Computers & Operations Research,
2271–2284 (2005)
18. Simões, A., Costa, E.: An evolutionary approach to the zero/one knapsack problem: Test-
ing ideas from biology. In: Proceedings of the Fifth International Conference on Artificial
Neural Networks and Genetic Algorithms, ICANNGA (2001)
19. Ku, S., Lee, B.: A set-oriented genetic algorithm and the knapsack problem. In: Proceed-
ings of the IEEE World Congress on Evolutionary Computation, CEC 2001 (2001)
20. Michalewicz, Z.: Genetic Algorithms + Data Structures = Evolution Programs. Springer,
Heidelberg (1996)
21. Ladanyi, L., Ralphs, T., Guzelsoy, M., Mahajan, A.: SYMPHONY (2009),
https://projects.coin-or.org/SYMPHONY
22. Padberg, M.W., Rinaldi, G.: A branch-and-cut algorithm for the solution of large scale
traveling salesman problems. SIAM Review 33, 60–100 (1991)
23. Dietterich, T.G.: An experimental comparison of three methods for constructing ensem-
bles of decision trees: Bagging, boosting, and randomization. Machine Learning 40,
139–157 (2000)
24. Margineantu, D.D., Dietterich, T.G.: Pruning adaptive boosting. In: Proc. of the 14th
International Conference on Machine Learning, pp. 211–218. Morgan Kaufmann, San
Francisco (1997)
25. Caruana, R., Niculescu-Mizil, A., Crew, G., Ksikes, A.: Ensemble selection from li-
braries of models. In: Proc. of the 21st International Conference on Machine Learning,
p. 18. ACM Press, New York (2004)
26. Banfield, R.E., Hall, L.O., Bowyer, K.W., Kegelmeyer, W.P.: Ensemble diversity mea-
sures and their application to thinning. Information Fusion 6, 49–62 (2005)
27. Martı́nez-Muñoz, G., Lobato, D.H., Suárez, A.: An analysis of ensemble pruning tech-
niques based on ordered aggregation. IEEE Transactions on Pattern Analysis and Ma-
chine Intelligence 31, 245–259 (2009)
28. Zhou, Z.H., Wu, J., Tang, W.: Ensembling neural networks: Many could be better than
all. Artificial Intelligence 137, 239–263 (2002)
29. Zhou, Z.H., Tang, W.: Selective ensemble of decision trees. In: Liu, Q., Yao, Y., Skowron,
A. (eds.) RSFDGrC 2003. LNCS (LNAI), vol. 2639, pp. 476–483. Springer, Heidelberg
(2003)
30. Hernández-Lobato, D., Hernández-Lobato, J.M., Ruiz-Torrubiano, R., Valle, Á.: Pruning
adaptive boosting ensembles by means of a genetic algorithm. In: Corchado, E., Yin,
H., Botti, V., Fyfe, C. (eds.) IDEAL 2006. LNCS, vol. 4224, pp. 322–329. Springer,
Heidelberg (2006)
31. Zhang, Y., Burer, S., Street, W.N.: Ensemble pruning via semi-definite programming.
Journal of Machine Learning Research 7, 1315–1338 (2006)
32. Asuncion, A., Newman, D.: UCI machine learning repository (2007)
33. Breiman, L.: Bagging predictors. Machine Learning 24, 123–140 (1996)
34. Breiman, L., Friedman, J.H., Olshen, R.A., Stone, C.J.: Classification and Regression
Trees. Chapman & Hall, New York (1984)
35. Markowitz, H.: Portfolio selection. Journal of Finance 7, 77–91 (1952)
36. Bienstock, D.: Computational study of a family of mixed-integer quadratic program-
ming problems. In: Balas, E., Clausen, J. (eds.) IPCO 1995. LNCS, vol. 920. Springer,
Heidelberg (1995)
130 R. Ruiz-Torrubiano, S. Garcı́a-Moratilla, and A. Suárez

37. Chang, T.J., Meade, N., Beasley, J.E., Sharaiha, Y.M.: Heuristics for cardinality con-
strained portfolio optimisation. Computers and Operations Research 27, 1271–1302
(2000)
38. Glover, F.: Future paths for integer programming and links to artificial intelligence. Com-
puters and Operations Research 13, 533–549 (1986)
39. Crama, Y., Schyns, M.: Simulated annealing for complex portfolio selection problems.
Technical report, Groupe d’Etude des Mathematiques du Management et de l’Economie
9911, Universie de Liege (1999)
40. Schaerf, A.: Local search techniques for constrained portfolio selection problems. Com-
putational Economics 20, 177–190 (2002)
41. Streichert, F., Tamaka-Tamawaki, M.: The effect of local search on the constrained port-
folio selection problem. In: Proceedings of the IEEE World Congress on Evolutionary
Computation (CEC 2006), Vancouver, Canada, pp. 2368–2374 (2006)
42. Beasley, J.E.: Or-library: Distributing test problems by electronic mail. Journal of the
Operational Research Society 41(11), 1069–1072 (1990)
43. Buckley, I., Korn, R.: Optimal index tracking under transaction costs and impulse con-
trol. International Journal of Theoretical and Applied Finance 1(3), 315–330 (1998)
44. Gilli, M., Këllezi, E.: Threshold accepting for index tracking. Computing in Economics
and Finance 72 (2001)
45. Beasley, J.E., Meade, N., Chang, T.: An evolutionary heuristic for the index tracking
problem. European Journal of Operations Research 148(3), 621–643 (2003)
46. Lobo, M., Fazel, M., Boyd, S.: Portfolio optimization with linear and fixed transaction
costs. Annals of Operations Research, special issue on financial optimization 152(1),
376–394 (2007)
47. Jeurissen, R., van den Berg, J.: Index tracking using a hybrid genetic algorithm. In: ICSC
Congress on Computational Intelligence Methods and Applications 2005 (2005)
48. Jeurissen, R., van den Berg, J.: Optimized index tracking using a hybrid genetic algo-
rithm. In: Proceedings of the IEEE World Congress on Evolutionary Computation (CEC
2008), pp. 2327–2334 (2008)
49. Ruiz-Torrubiano, R., Suárez, A.: A hybrid optimization approach to index tracking. Ac-
cepted for publication in Annals of Operations Research (2007)
50. Moghaddam, B., Weiss, Y., Avidan, S.: Spectral bounds for sparse PCA. In: Advances in
Neural Information Processing Systems, NIPS 2005 (2005)
51. Zou, H., Hastie, T., Tibshirani, R.: Sparse principal component analysis. Journal of Com-
putational and Graphical Statistics 15(2), 265–286 (2006)
52. Tibshirani, R.: Regression shrinkage and selection via the lasso. Journal of the Royal
Statistical Society B 58, 267–268 (1996)
53. d’Aspremont, A., Ghaoui, L.E., Jordan, M., Lanckriet, G.: A direct formulation for
sparse PCA using semidefinite programming. SIAM Review 49(3), 434–448 (2007)
54. d’Aspremont, A., Bach, F., Ghaoui, L.E.: Optimal solutions for sparse principal compo-
nent analysis. Journal of Machine Learning Research 9, 1269–1294 (2008)
55. d’Aspremont, A., Ghaoui, L.E., Jordan, M., Lanckriet, G.: MATLAB code for DSPCA
(2008), http://www.princeton.edu/˜aspremon/DSPCA.htm
Chapter 6
Learning Global Optimization through a
Support Vector Machine Based Adaptive
Multistart Strategy

Jayadeva, Sameena Shah, and Suresh Chandra

Abstract. We propose a global optimization algorithm called GOSAM (Global Op-


timization using Support vector regression based Adaptive Multistart) that applies
statistical machine learning techniques, viz. Support Vector Regression (SVR) to
adaptively direct iterative search in large-scale global optimization. At each itera-
tion, GOSAM builds a training set of the objective function’s local minima discov-
ered till the current iteration, and applies SVR to construct a regressor that learns
the structure of the local minima. In the next iteration the search for the local min-
imum is started from the minimum of this regressor. The idea is that the regressor
for local minima will generalize well to the local minima not obtained so far in the
search, and hence its minimum would be a ‘crude approximation’ to the global min-
imum. This approximation improves over time, leading the search towards regions
that yield better local minima and eventually the global minimum. Simulation re-
sults on well known benchmark problems show that GOSAM requires significantly
fewer function evaluations to reach the global optimum, in comparison with meth-
ods like Particle Swarm optimization and Genetic Algorithms. GOSAM proves to
be relatively more efficient as the number of design variables (dimension) increases.
GOSAM does not require explicit knowledge of the objective function, and also
Jayadeva
Dept. of Electrical Engineering, Indian Institute of Technology, Hauz Khas,
New Delhi - 110016, India
e-mail: jayadeva@ee.iitd.ac.in
Sameena Shah
Dept. of Electrical Engineering, Indian Institute of Technology, Hauz Khas,
New Delhi - 110016, India
e-mail: sameena.shah@gmail.com
Suresh Chandra
Dept. of Mathematics, Indian Institute of Technology, Hauz Khas,
New Delhi - 110016, India
e-mail: chandras@maths.iitd.ac.in

Y. Tenne and C.-K. Goh (Eds.): Computational Intelligence in Optimization, ALO 7, pp. 131–154.
springerlink.com c Springer-Verlag Berlin Heidelberg 2010
132 Jayadeva, S. Shah, and S. Chandra

does not assume any specific properties. We also discuss some real world applica-
tions of GOSAM involving constrained and design optimization problems.

6.1 Introduction and Background Research


Global optimization involves finding the optimal or best possible configuration from
a large space of possible configurations. It is among the most fundamental of compu-
tational tasks, and has numerous applications including bio-informatics [26],
robotics [20], portfolio optimization [5], VLSI design [31], and nearly every
engineering application [27].
If the search space is small, then obtaining the global optimal is trivial; otherwise
some special structure like linearity, convexity or differentiability of the problem
needs to be exploited. Classical mathematical optimization techniques are based
on utilizing such special structures. Difficulties in obtaining the global optimum
arise when the objective function neither has any special structure, nor possesses
properties like continuity and differentiability, or if it has numerous local optima that
obstruct search for the global optimum [25]. Such objective functions are common
in many applications including data mining, location problems, and computational
chemistry, amongst others [13, 15]. Similar difficulties arise if some structure exists
but is not known a priori.
For the global optimization of these kinds of objective functions, one utilizes
the broad class of general-purpose algorithms called local search algorithms[29].
Global optimizers typically depend on local search algorithms that search multiple
states in the neighborhood of a given state or configuration. Such methods find a
local optimum, which depends on the starting state. Local search generally does not
yield the global optimum, because it gets stuck in a local optimum. Therefore, local
search methods are usually augmented with some strategy to escape from local op-
tima. For instance, in simulated annealing (SA), first introduced by [22], the escape
strategy is a probabilistic shaking that is associated with a temperature parameter.
The “temperature” is reduced iteratively from a high initial value, based on a cooling
schedule.
On the other hand, though local search strategies suffer from entrapment in local
optima, they are very fast. The time required to run an iteration of simulated an-
nealing is sufficient for several iterations of gradient descent. Therefore, instead of
escaping from local optima, an alternative is the use of multi-restart local search ap-
proaches. These start from a new initial state once a local search step has terminated
in a local optimum. Multistart approaches are known to outperform other strategies
on some problems, e.g. simulated annealing on the Travelling Salesperson Problem
(TSP) [19].
The performance of any local search procedure depends on the starting state, and
multi-restart local search algorithms start from a randomly chosen state. None of the
above mentioned approaches exploit knowledge of the space that has been explored
so far, to guide further search. In other words, their search strategy does not evolve
with time. A question that comes to mind is whether, based on some knowledge
6 Learning Global Optimization through SVM Based Adaptive Multistart 133

collected about the function, it is possible to generate a start state that is better than
a random one. If the answer is in the affirmative, then successive iterations will lead
us closer to the global minimum.
Evolutionary algorithms like Particle Swarm optimization (PSO), Genetic Al-
gorithms (GA) and Ant Colony optimization (ACO), are distributed iterative search
algorithms, which indirectly use some form of information about the space explored
so far, to direct search. Initially, there is a finite number of “agents” that search for
the global optimum. The paths of these agents are dynamically and independently
updated during the search based on the results obtained till the current update.
PSO, developed by Kennedy and Eberhart [21], is inspired by the flocking behav-
ior of birds. In PSO, particles start search in different regions of the solution space,
and every particle is made aware of the best local optimum amongst those found
by its neighbors, as well as the global optimum obtained up to the current iteration.
Each particle then iteratively adapts its path and velocity accordingly. The algorithm
converges when a particle finds a highly desirable path to proceed on, and the other
particles effectively follow its lead.
Genetic Algorithms [12] are motivated by the biological evolutionary opera-
tions of selection, mutation and crossover. In real life, the fittest individuals tend
to survive, reproduce and improve over generations. Based on this, “chromosomes”
that yield better optima are considered to correspond to fitter individuals, and are
used for creating the next generation of chromosomes that hopefully lead us to bet-
ter optima. The population of chromosomes is updated till convergence, or until a
specified number of updates is completed.
Ant colony optimization [10] mimics the behavior of a group of ants following
the shortest path to a food source. Ants (agents) exchange information indirectly,
through a mechanism called “stigmergy”, by leaving a trail of pheromone on the
paths traversed. States believed to be good are marked by heavier concentrations
of pheromone to guide the ants that arrive later. Therefore, decisions that are taken
subsequently get biased by previous decisions and their outcomes.
Some heuristic techniques use an alternate approach to guide further search by
application of machine learning techniques on past search results. Machine learning
techniques help in discovering relationships by analyzing the search data that other
techniques may ignore. If any relationship exists, then it could be exploited to re-
duce search time or improve the quality of optima. For this task, some papers try
to understand the structure of the search space, while others try to tune algorithms
accordingly (cf. [4] for a survey of these algorithms).
Boyan used information of the complete trajectories to the local minima and the
corresponding value of the local minima reached, to construct evaluation functions
[8], [9]. The minimum of the evaluation function determined a new starting point.
Optimal solutions were obtained for many combinatorial optimization problems like
bin packing, channel routing, etc.
Agakov et. al [3], gave a compiler optimization algorithm that trains on a set of
computer programs and predicts which parts of the optimization space are likely
to give large performance improvements for programs. Boese et al. [6] explored the
use of local minima to adapt the optimization algorithm. For graph bisection and the
134 Jayadeva, S. Shah, and S. Chandra

TSP, they found a “big valley” structure to the set of minima. Using this information
they were able to hand code a strategy to find good starting states for these problems.
Is this possible for other problems as well ? The proposed work is motivated by
the question: for any general global optimization problem, is there a structure
to the set of local optima ? If so, can it be learnt automatically through the use
of machine learning ?
We propose a new algorithm for the general global optimization problem, termed
as Global Optimization using Support vector regression based Adaptive Multistart
(GOSAM). GOSAM attempts to learn the structure of local minima based on local
minima discovered during earlier local searches. GOSAM uses Support Vector Ma-
chine based learning to learn a Fit function (regressor) that passes through all the
local minima, thereby learning the structure of the locations of local minima. Since
the regressor can only learn the structure of the local minima encountered till the
present iteration, the idea is that the regressor for local minima will generalize well
to the local minima not obtained so far in the search. Consequently, its minimum
would be a ‘crude approximation’ to the global minimum of the objective function.
In the next iteration the search for the local minimum is started from the minimum of
this regressor. The new local minimum obtained is added as a new training point and
the Fit function is re-constructed. Over time, this approximation gets better, leading
the search towards regions that yield better local minima and eventually the global
minimum. Surprisingly, for most problems this algorithm tends to direct search to
the region containing the global minimum in just a few iterations and is significantly
faster than other methods. The results reinforce our belief that many problems have
some ‘structure’ to the location of local minima, which can be exploited in direct-
ing further search. It is important to emphasize that GOSAM’s approach is differ-
ent from approximating a fitness landscape; GOSAM attempts to predict how local
minima are distributed, and where the best one might lie. This turns out to be very
efficient in practice.
In this chapter, we wish to demonstrate the same by testing on many benchmark
global optimization problems against established evolutionary methods. The rest of
the chapter is organized as follows. Section 6.2 discusses the proposed algorithm.
Section 6.3 is devoted to GOSAM’s performance on benchmark optimization prob-
lems, as well as a comparison with GA and PSO. Section 6.4 extends the algorithm
for constrained optimization problems. Section 6.5 demonstrates how the algorithm
may be applied to design optimization problems. Section 6.6 is devoted to a general
discussion on the convergence of GOSAM to the global optimum, while Section 6.7
contains concluding remarks.

6.2 Global Optimization with Support Vector Regression Based


Adaptive Multistart (GOSAM)
The motivation of the proposed algorithm is to use the information about the local
minima encountered in earlier steps, to predict the location of other better minima.
We denote the objective function to be minimized by f (x), where x (xi , ∀i = 1, . . . , n)
is a n dimensional vector of variables. We assume that the lower and upper bounds
of each of these variables is known. For an unconstrained optimization problem, the
6 Learning Global Optimization through SVM Based Adaptive Multistart 135

feasible region is the complete search space that lies within the lower and upper
bounds of all variables.
We now summarize the flow of the GOSAM algorithm. At each iteration, the
algorithm performs a local search1 , starting from a location termed as the start-
state. Iteratively the algorithm determines the start-state for the next iteration.

GOSAM Algorithm for Minimization of a Multivariate Function


1. Initialize start-state to an initial guess, possibly chosen randomly, lying in the
search space.
2. Starting from start-state, obtain a locally optimal solution using a local search
procedure. Term this solution as current-local-optimum.
3. Store current-local-optimum (x∗i , ∀i = 1 , . . . , n) and the corresponding function
value ( f (x∗ )) in the training set.
4. Apply Support Vector Regression treating all the local optima collected so far, as
the independent variables and their corresponding function values as the target
values. The regressor obtained will be called the current Fit function.
5. Obtain the minimum of the current Fit function using a local search procedure.
6. Set start-state to the minimum obtained in Step 5. If minimum is out of bounds
or is same as that obtained in the previous r runs, set the start-state to a random
one.
7. If the termination criteria have been met, proceed to Step 8. Otherwise, go to
Step 2.
8. Return the best local minimum obtained.
Initially, we generate a feasible random state and assign it to start-state. In step 2,
we perform local search on the objective function starting from the start-state. The
search terminates at a local minimum. We store the local minimum and the corre-
sponding value of the objective function in an array. This constitutes our training
data. We treat local minima as data points and their corresponding function values
as target values. In Step 4, we use Support Vector Regression (SVR), to perform
regression on the training data comprising of the local minima obtained till the cur-
rent iteration. In general, fitting the local minima with a linear SVR regressor would
incur a large error. Nonlinear regression can be achieved with a wide choice of
kernel functions, and choice of the kernel will also impact the number of function
evaluations required to reach the global optimum. In all our experiments, we chose
the most commonly used 2nd degree polynomial kernel, also termed as a quadratic
kernel, primarily to simplify computation. It also facilitates Step 5 of the GOSAM
algorithm, which requires minimization of the SVR regressor. For a polynomial ker-
nel of degree 2, the problem to be minimized is a quadratic one, which can be solved
efficiently.
We found that GOSAM is not handicapped by the choice of kernel, and that a
quadratic kernel worked well over a wide range of problems. All the local minima
1 To the best of our knowledge, there is no restriction on the choice of local search procedure
used.
136 Jayadeva, S. Shah, and S. Chandra

obtained till the current iteration are treated as training samples, and their corre-
sponding function values as the target values. The regressor obtained in Step 4,
termed as the current Fit function, approximates local minima of the objective func-
tion. In the limit that all local minima are known, SVR will construct a regressor
that passes through all local minima of the objective function. The global minimum
of this function would then correspond to the global minimum of the original objec-
tive function. If we knew all the local minima, then regression is not required and
one can easily determine the best local minimum. We utilize only the information
of the few local minima obtained through local search till the current iteration. We
then rely on the excellent generalization properties of SVRs to predict how the local
minima are distributed. Search is redirected to the region containing the minimum
of the regressor or the Fit function. Because of the limited size of the training set,
this regressor will not be an exact approximation of the local minima of the objective
function. However, over successive iterations, the Fit function tends to better local-
ize the global minimum of the function being optimized. This is demonstrated by
the experiments presented in section 6.3, that show that the ‘predictor’ turns out to
be so good that search terminates in the global minimum within very few iterations.
Apart from the generalization ability of SVRs, which is imperative in predicting
better starting points and finding the global optimum quickly, the choice of using
SVR for function approximation is also motivated by the fact that the regressors ob-
tained using SVR are generally very simple and can be constructed by using only a
few support vectors. Since minimization of the Fit function requires evaluating it at
several points, the use of only the support vectors contributes to computational ef-
ficiency. Regardless of the complexity of the kernel used, the optimization problem
that needs to be solved remains a convex quadratic one, because only a kernel matrix
that contains an inner product between the points is required. The meagre amounts
of data to be fit, i.e. the small number of local minima and their corresponding func-
tion values, also contribute to making the process fast and efficient.
In step 5, we minimize the Fit obtained and reset the start-state to its minimum.
If the local minimum obtained from this start-state is the same as the one obtained
in the previous r iterations, or out of bounds, then we conclude that the search has
become too localized, and needs to explore other regions to discover new minima.
In such a case, we reset the start-state to a random state.

6.3 Experimental Results


GOSAM was encoded in MATLAB and run on an 800 MHz Pentium III PC with
256 MB RAM. For all our test cases, we used the local minimizer of LINDO API
[23]. We tested GOSAM on a number of difficult global optimization benchmark
problems, which are available online at [11, 24]2 . Most of the benchmark problems
available in [11, 24] are parameterized in the number of variables, and can thus
2 The websites also include visualizations of the two dimensional examples, a discussion of
why each of these problems is difficult, and a mention of the estimated number of local
minima of the benchmarks.
6 Learning Global Optimization through SVM Based Adaptive Multistart 137

be extended to any dimension. We first illustrate the working of GOSAM on one


and two variable examples that possess several local minima. These examples will
help visualize GOSAM’s working. We then discuss results for higher dimensional
benchmarks.

6.3.1 One Dimensional Wave Function


Figures 6.1 through 6.4 show the objective function f (x) = (|x|−10)cos(2π x) (taken
from [8]) as a dotted wave. The bounds for this problem were taken to be x = -10.0
to x = 10.0. As seen in Figs. 6.1 - 6.4, the objective function has several local min-
ima, but the global minimum lies at x = 0. We now show how GOSAM finds the
global minimum for this example.

10

−2

−4

−6

−8

−10
−10 −8 −6 −4 −2 0 2 4 6 8 10

Fig. 6.1 Iteration 1 for the global minimization of the objective function f (x) = (|x| −
10)cos(2π x). Local search started from a random start state given by x = -1.3444 (indicated
by the circle) and terminated at the local minimum x = -1.0 where f (x) = -9.0. Using only
this one local minimum in the training set, the regressor obtained till the end of iteration 1 is
shown by the solid line

The initial randomly chosen starting state is x = -1.3444. This is shown as the
circled point in Fig. 6.1. Local search from this point led to the local minimum at
x = -1.0, indicated by a square in Fig. 6.1. At this point, the objective function has a
value of f (−1.0) = -9.0. Using only one local minimum in the training set, the SVR
regressor that was obtained is shown by the solid line parallel to the x-axis. Since
this regressor has a constant function value, its minimum is the same everywhere;
therefore, any random point can be selected as the minimum. In our simulation,
the random point returned was x = -6.3. Local search from this point terminated at
138 Jayadeva, S. Shah, and S. Chandra

the local minimum x = -6.0. The regressor obtained using these two points led to
a minimum at the boundary. In cases when the minimum is at a boundary, we find
that one can start the next local search from either the boundary point, or from
a random new starting point. The search for the global optimum was not ham-
pered by either choice. However, the results reported here are based on a random
restart in such cases. In this simulation, search was restarted from a random point at
x = 4.483.

10

−2

−4

−6

−8

−10
−10 −8 −6 −4 −2 0 2 4 6 8 10

Fig. 6.2 Iteration 3 for the global minimization of the objective function f (x) = (|x| −
10)cos(2π x). Local search started from a random start state given by x = 4.483 (indicated
by the circle) and terminated at the local minimum at x = 4.0. The regressor obtained using
the three points, depicted by squares, is shown as the solid concave curve. The minimum of
this curve lies at x = -0.8422

In the third iteration, shown in Fig. 6.2, local search is started from x = 4.483, de-
picted by a circle. The local minimum was found to be at x = 4.0, and is depicted by
a square in the figure. When the information of these three local minima was used,
the SVR regressor shown as the solid concave curve was obtained. The minimum of
this curve lies at x = -0.8422.
The start state for iteration 4 was given by the minimum of the regressor obtained
in the previous iteration, given by x = -0.8422. This point is depicted as a circle in
Fig. 6.3. The local minimum obtained from this starting state is again depicted as the
square at the end of the slope. The regressor obtained using these four local minima
is shown as a bowl shaped curve, the minimum of which is located at x = −0.1130.
In the next iteration, depicted in Fig 6.4, local search from x = −0.1130, depicted
by a circle, led us to the global minimum at x = 0.0, depicted by a square in the
figure.
6 Learning Global Optimization through SVM Based Adaptive Multistart 139

10

−2

−4

−6

−8

−10
−10 −8 −6 −4 −2 0 2 4 6 8 10

Fig. 6.3 Iteration 4 for the global minimization of the objective function f (x) = (|x| −
10)cos(2π x). Local search started from the minimum of the regressor obtained in the
previous iteration, given by x = -0.8422 (indicated by the circle). It terminated at the local
minimum at x = -1.0. The regressor obtained using the four local minima obtained till the
current iteration, depicted by squares, is shown as the solid convex shaped curve. The mini-
mum of this curve lies at x = −0.1130

10

−2

−4

−6

−8

−10
−10 −8 −6 −4 −2 0 2 4 6 8 10

Fig. 6.4 Iteration 5 for the global minimization of the objective function f (x) = (|x| −
10)cos(2π x). Local search started from the start state given by the minimum of the regressor
obtained at the end of the previous iteration, given by x = -0.1130 (indicated by the circle)
and terminated at the local minimum at x = 0.0 where f (x) = -10.0. The regressor obtained
using all the local minima obtained, depicted by squares, is shown as the solid convex curve
140 Jayadeva, S. Shah, and S. Chandra

6.3.2 Two Dimensional Case: Ackley’s Function


Ackley’s function is a multimodal benchmark optimization problem, that is widely
used for testing global optimization algorithms. The n-dimensional Ackley’s
function is given by
$
−b 1
∑ni=1 x2i 1
− e n ∑i=1 cos(cxi ) + a + e ,
n
f (x) = −ae n

where a = 20, b = 0.2, and c = 2π . Its global minimum is located at xi = 0, ∀i =


1, 2, . . . , n, with the function value f (0) = 0. For the purpose of illustration, we
consider the two dimensional Ackley’s function.
As seen in Fig. 6.5, Ackley’s function has a large number of local minima that
hinder the search for the global minimum.

Fig. 6.5 Ackley’s function. A huge number of local minima are seen that obstruct the search
for the global minimum at (0, 0)

Figures 6.6 through 6.8 show the plots of the regressor function obtained after
iterations 2, 3, and 4 respectively. Note that though both figures 6.7 and 6.8 look
similar, there is a difference in the locations of their minima. The minimum of the
bowl shaped Fit function of Fig. 6.8, when used as the start state for next local
minimization procedure, led to the global minimum of Ackley’s function.
6 Learning Global Optimization through SVM Based Adaptive Multistart 141

Fig. 6.6 Regressor obtained after iteration 2, while optimizing Ackley’s function

Fig. 6.7 Regressor obtained after iteration 3, while optimizing Ackley’s function

6.3.3 Comparison with PSO and GA on Higher Dimensional


Problems
Particle Swarm Optimization (PSO) and Genetic Algorithms (GA) are evolutionary
techniques, that also use information revealed during search to generate new search
points. We compare our algorithm with both these approaches on several global op-
timization benchmark problems, ranging in dimension (number of variables n) from
142 Jayadeva, S. Shah, and S. Chandra

Fig. 6.8 Regressor obtained after iteration 4, while optimizing Ackley’s function. Local
search starting from the minimum of this Fit function led to the global minimum

2 to 100. The Particle Swarm optimization toolbox was obtained from [30], while
the Genetic Algorithm optimization toolbox (GAOT) is the one available at [16].
The next start-state in PSO and GA is obtained by simple mathematical or logical
operations, whereas for GOSAM it is generated after determining the SVR followed
by minimization of a quadratic problem. Therefore, an iteration of GOSAM takes
more time than an iteration of either of these algorithms. Moreover, GA and PSO
run a number of agents in parallel, whereas the current implementation of GOSAM
is a sequential one. However, the difference in the number of function evaluations
required is so dramatic that GOSAM always found the global minimum significantly
faster.
In all our experiments, we evaluated the three algorithms on three different per-
formance criteria. The first criterion is the number of function evaluations required
to reach the global optimum. The second criterion is the number of times the global
optimum is reached in 20 runs, each from a randomly chosen starting point. The
third measure is the average CPU time taken to reach the global optimum.
Table 6.1 presents the results obtained. Each value indicated in the table is the av-
erage over 20 runs of the corresponding algorithm. For each run, the initial start
state of all the algorithms was the same randomly chosen point. The reported re-
sults have been obtained on a PC with a 1.6 GHz processor and 512 MB RAM. The
first row in the evaluation parameter for each benchmark function (Fn. Evals.) gives
the average number of function evaluations required by each algorithm to find the
global optimum. The number of times that the global optimum was obtained out of
the 20 runs is given in the second row (GO. Obtd.). If the global minimum was not
obtained in all runs, then the average value and the standard deviation of the best
6 Learning Global Optimization through SVM Based Adaptive Multistart 143

optima obtained over all the runs has been mentioned within parentheses. The third
row (T (s)) indicates the average time taken in seconds by each algorithm in a run.
Though any number of local minima may be used for building a predictor, we
used a maximum of 100 local minima. The 101st local minima overwrote the 1st
local minimum obtained, and so on. In each case, the search was confined to lie
within a box [−10, 10]n where n is the dimension. In all our experiments, we used
the conventional SVR framework [14]. The use of techniques such as SMO [28],
or online SVMs [7] can be used to speed up the training process further. Our focus
in this work is to show the use of machine learning techniques to help predict the
location of better local minima.
The parameters for GA and PSO (c1 = 2, c2 = 2, c3 = 1, chi = 1, and swarm
size = 20) were kept the same as the default ones. For GOSAM, the SVR parameters
were taken to be ε = 10−3 , C = 1000, and the kernel to be the two degree polynomial
kernel with t = 1.
Table 6.1 shows that consistently GOSAM outperforms both PSO and GA by a
large margin. This difference gets dramatically highlighted in higher dimensions.
Finding the global minimum becomes increasingly difficult as the dimension n in-
creases; PSO and GA fail to find the global optimum in many cases, despite a large
number of function evaluations. However, GOSAM always found the global mini-
mum after a relatively small number of function evaluations (the count for function
evaluation for GOSAM also includes the number of times the objective function
was evaluated during local search). We believe that this result is significant, because
it shows that GOSAM scales very effectively to large dimensional problems. The
experimental results strikingly demonstrate that GOSAM not only finds the global
optimum consistently, but also does so with a significantly fewer number of function
evaluations.

6.4 Extension to Constrained Optimization Problems


Constrained optimization problems are usually solved by solving related uncon-
strained problems, which are obtained through the use of penalty or barrier func-
tions. We take recourse to Sequential Unconstrained Minimization Techniques
(SUMTs), which we briefly review.

6.4.1 Sequential Unconstrained Minimization Techniques


SUMTs comprise of a class of non-linear programming methods that solve a
sequence of unconstrained optimization tasks. Given a problem of the form

Minimize a(x) (6.1)

subject to the constraints

gi (x) <= 0, for i = 1, . . . M, (6.2)


144 Jayadeva, S. Shah, and S. Chandra

Table 6.1 Comparison of GOSAM with PSO and GA on Difficult Benchmark Problems

N Benchmark Evaluation GOSAM PSO GAOT


function parameter
2 Ackley Fn. Evals. 122.75 12580.0 2202.75
GO Obtd. 20 20 20
T(s) 0.02970 0.535524 0.677470

2 Rastrigin Fn. Evals. 129.5 23037.0 2198.15


GO Obtd. 20 20 20
T(s) 0.0328 0.913196 0.662956

2 Griewangk Fn. Evals. 108.95 91824.0 1180.15


GO Obtd. 20 20 19‡
(0.0057 ± 2.5e-
5)
T(s) .02265 3.758419 0.357913

2 Rotated Fn. Evals. 24.35 22542.0 647.75


Hyper
Ellipsoid GO Obtd. 20 20 20
T(s) .0078 0.972250 0.236874

2 Rosenbrock’s Fn. Evals. 105.05 49702.00 10600.75


Valley GO Obtd. 20 20 20
T(s) .0148 2.046909 3.19688

2 Schwefel Fn. Evals. 37.4 24000 866.30


GO Obtd. 20 1† 20
T(s) .01325 5.338205 0.268143

2 Branin’s Rcos Fn. Evals. 81.85 52341.0 649.75


GO Obtd. 20 20 20
T(s) .01795 2.371105 0.206017

2 Six Hump Fn. Evals. 64.5 29892.0 643.95


Camelback GO Obtd. 20 20 20
T(s) .01715 1.213340 0.199104

10 Ackley Fn. Evals. 208.09 145746.0 11226.10


GO Obtd. 20 20 20
T(s) .03516 6.636169 3.973002
6 Learning Global Optimization through SVM Based Adaptive Multistart 145

Table 6.1 (continued)

10 Rastrigin Fn. Evals. 298.65 300040.0 11219.25


GO Obtd. 20 0‡ 12‡
(3.035 ± 2.054) (0.4478 ± 0.4587)
T(s) .0398 12.854138 3.435049

10 Rotated Fn. Evals. 46.9 105417.00 4766.35


Hyper
Ellipsoid GO Obtd. 20 20 20
T(s) .01255 4.995237 1.448820

10 Rosenbrock’s Fn. Evals. 2177.90 300040.0 22917.3


Valley GO Obtd. 20 3‡ 3‡
(1.9104 ± 1.2932) (2.9247 ± 3.0229)
T(s) .04460 12.467250 7.737082

100 Ackley Fn. Evals. 7437.4 300040.0 24708.40


GO Obtd. 20 0‡ 0‡
(8.0094 ± 5.7940) (1.6847 ± 0.2302)
T(s) 8.707 24.979903 14.027134

100 Rastrigin Fn. Evals. 4931.5 2000040.00 36528.90


GO Obtd. 20 0 0
(486.3559 ± (60.2946 ± 8.7735)
737.8523)
T(s) 7.615 128.840828 23.566207

100 Rotated Fn. Evals. 431.5 292666.0 37001.70


Hyper
Ellipsoid GO Obtd. 20 0‡ 0‡
(0.3350 ± 1.0915) (2.2683 ± 0.9803)
T(s) .0823 46.069183 29.993730

100 Rosenbrock’s Fn. Evals. 8676.20 500040.0 61365.65


Valley GO Obtd. 20 0‡ 0‡
(14324550.17± (357.198±
1535813.439) 165.5572)
T(s) .89920 17.913865 36.662314

‡ The global optimum was not obtained in all the 20 runs. The value in the corresponding
parentheses indicates the mean and the standard deviation of the quality of global minima
obtained in the 20 runs.
† The global optimum obtained was not within the specified bounds.
146 Jayadeva, S. Shah, and S. Chandra

where a(x) is the objective function, and gi (x), for i = 1, . . . M are the M constraints.
One kind of SUMT, the quadratic penalty function method, minimizes a sequence
of functions of the form (p = 1, 2, . . . M).
M
Fp (x) = a(x) + ∑ α pi Max(0, gi (x))2 , (6.3)
i=1

where α pi is a scalar weight, and p is the problem number. The minimizer for the pth
problem in the sequence forms the guess or starting point for the (p + 1)th problem.
The scalars change from one problem to the next based on the rule that α pi >=
α(p−1)i ; they are typically increased geometrically, by say 10%. These weights
indicate the relative emphasis of the constraints and the objective function.
In the limit, the constraints become overwhelmingly large, the sequence of min-
ima of the unconstrained problems converges to a solution of the original con-
strained optimization problem. We now illustrate the use of SUMT through the
application of GOSAM to the graph coloring problem.
Given a graph with a set of nodes or vertices, and an adjacency matrix D, the
Graph Coloring Problem (GCP) requires coloring each node or vertex so that no
two adjacent nodes have the same color. The adjacency matrix entry di j is a 1 if
nodes i and j are adjacent, and is 0 otherwise.
A minimal coloring requires finding a valid coloring that uses the least number of
colors. The GCP can be solved through an energy minimization approach. We used
an approach based on the Compact Analogue Neural Network (CANN) formulation
[17]. In this approach, a N-vertex GCP is solved by considering a network of N
neurons, whose outputs denote the node colors. The outputs are represented by a set
of real numbers Xi , i = 1, 2, . . . , N. The color is not assumed to be an integer as is
done conventionally.
The GCP is solved by minimizing a sequence (p = 1, 2, . . .) of functions of the
form

A N N
E= ∑ ∑ (1 − di j )Vm ln coshβ (Xi − X j ) +
2 i=1
(6.4)
j=1
 
Bp N N
coshβ (Xi − X j + δ )coshβ (Xi − X j − δ )
∑∑
2 i=1
d i j Vm ln
coshβ (Xi − X j )2
j=1

In keeping with the earlier literature on neural network approaches to the GCP, we
term E in (6.4) as an energy function.
The first term of equation (6.4) is present only for di j = 0, i.e. for non-adjacent
nodes. The term is minimized when Xi = X j . The term therefore minimizes the
number of distinct colors used. The second term is minimized if the values of Xi and
X j corresponding to adjacent nodes differ by at least δ . This term corresponds to the
adjacency constraint in the GCP, and becomes large as the problem sequence index
p increases. Nodes colored by colors that differ by less than δ correspond to nodes
with identical colors.
6 Learning Global Optimization through SVM Based Adaptive Multistart 147

We used GOSAM to minimize the energy function corresponding to difficult


GCP benchmark problems [1], which have a large number of connections. Of these
the Myciel instances are Graphs based on the Mycielski transformation. These
graphs are difficult to solve because they are triangle free (clique number 2) but
the coloring number increases with the problem size. “Huck” instance is a graph
that is created where each node represents a character. Two nodes are connected by
an edge if the corresponding characters encounter each other in the book “Twain’s
Huckleberry Finn”. In the “Games120” instance, the games played in a college foot-
ball season is represented by a graph where the nodes represent each college team,
and two teams are connected by an edge if they played each other during the season.
The energy functions for these problems are very complex and lead to extremely
hard global optimization problems. However, the constrained optimizer was able
to obtain the optimal coloring for each of these instances. Table 6.2 sums up the
results obtained. Note that the starting value of B, and the amount of increment in
B for successive iterations are both related to the time taken to reach the optimal
solution. One would like to start with a value of B, which quickly takes us into the
feasible region. This leads us to believe that a large value of B would do the trick.
However, if the value of B is taken to be too large then we might not be able to reach
the optimal solution. Thus there is no obvious answer to determine a good starting
value of B, instead it is based on educated guesses. A natural reasoning would be that
for dense adjacency matrices a large value of B should be chosen while a relatively
smaller value would suffice for sparse adjacency matrices. If we reach the feasible
region, then we could slowly and cautiously (making sure that we don’t exit the
feasible region) increase the value of A till we reach the optimal solution. We defer
a more detailed discussion of this aspect as it is beyond the scope of this chapter.

Table 6.2 Constrained optimization on benchmark GCP instances

Instance Nodes Edges Optimal coloring Best Solution Obtained Iterations required
Myciel3 11 20 4 4 3
Myciel4 23 71 5 5 5
Huck 74 301 11 11 8
Games120 120 638 9 9 10

6.5 Design Optimization Problems


Designers are usually confronted with the problem of finding optimal settings for a
large number of design parameters, with respect to several simulated product or pro-
cess characteristics. Problems of design and synthesis in the electronic domain are
generally constrained non-linear optimization problems. The principal characteris-
tics of these problems are very time consuming function evaluations and the absence
of derivative information. In most cases, evaluating the cost or objective function re-
quires a system simulation, and the function is rarely available in an analytical form.
In fact, the use of classical optimization techniques to give an optimal solution is
148 Jayadeva, S. Shah, and S. Chandra

nearly impossible. For instance, VLSI design engineers carry out time-consuming
function evaluations by using circuit or other simulation tools , e.g. Spectre [2], and
choose a circuit with optimal component values. Since there are still many possi-
ble design parameter settings and computer simulations are time consuming, it is
crucial to find the best possible design with a minimum number of simulations. We
used GOSAM to solve several circuit optimization problems. The interface between
the optimizer and the circuit simulators is shown in Fig. 6.9. Preliminary details of
this work were reported in [18].

ble n
Netlist

ria ig
va des
s
te
da
Up

Updated design Invoke Spectre Read


variables
Optimizer Interface Spectre
(Run a
simulation)
Function value
Write function
Re
ad

value

Output File

Fig. 6.9 GOSAM’s interface with the circuit simulator Spectre

We initially start with values for the design variables that are provided by a de-
signer, or choose them randomly. Since there are no analytical formulae to compute
the output for the input design parameters, the function values are calculated by using
a circuit simulator such as Spectre. The simulator writes the output value to a file,
which is read by the interface and returned to GOSAM. GOSAM then uses SVR
on the set of values obtained so far, to determine the Fit function. The SVR yields a
smooth and differentiable regressor. GOSAM then computes the minimum of the Fit
function, and sends it as the vector of new design parameters, to the interface. A key
feature of this approach is that we can apply it even to scenarios where the objective
function is not available in analytical form or is not differentiable. A major bonus
is that examination of the Fit function yields information about what constitutes a
good design. We now briefly discuss a few interesting circuit optimization examples.

6.5.1 Sample and Hold Circuit


For a sample and hold circuit, the objective function was to hold the sampled value
as constant as possible during the hold period. The design variables are the widths
of 22 MOSFETs, along with values of four capacitors named as C1,C2,C3, and
C4. The transistor widths were constrained to lie between 250nm and 1200nm. Ca-
pacitor C3 was required to be between 1fF and 5000f, while all other capacitors
were constrained to lie between 1fF and 500fF. Simulations show that the sampled
6 Learning Global Optimization through SVM Based Adaptive Multistart 149

value is maintained well during the hold period. As of date, numerous complex
VLSI circuits have been designed using GOSAM interfaced with the circuit simu-
lator Spectre. The chosen circuits include Phase Locked Loops (PLLs), a variety of
operational amplifiers, and filters. In these examples, transistor sizes and other com-
ponent values have been selected to optimize specified objectives such as jitter, gain,
phase margin, and power, while meeting specified constraints on other performance
measures as well as on transistor sizes.

Fig. 6.10 Response of the optimized Sample-and-Hold Circuit, showing output voltage
versus time. The goal was to keep the output constant during the hold period

6.5.2 Folded Cascode Amplifier


For a folded cascode amplifier, the design objective was to maximize the phase mar-
gin. The variables for the optimization task were taken to be the widths of 16 MOS-
FET transistors. The result obtained, depicted in Figure 6.11 shows that GOSAM
obtained the maximum phase margin as 169.73◦, as well as an excellent solution
with a phase margin of around 120◦ . An industry level commercial tool found a
solution with a phase margin of around 59◦ .

6.6 Discussion
An important question relates to assumptions that may be implicitly or explicitly
made regarding the function to be optimized. We mentioned previously that any
local search mechanism could be used in conjunction with GOSAM. Figure 6.12
illustrates this with the help of a toy example. For the objective function shown
by the dashed curve in Fig. 6.12, the gradient cannot be computed to reach two of
the minima. A line search method is used in the outer triangular regions, while for
the parabolic region in the middle the gradient is available and a simple gradient
descent leads us to the local minimum. These three local minima, when used by
SVR to construct the regressor, yield the parabolic shaped solid curve of Fig. 6.12.
Local search starting from the minimum of this curve led to the global minimum.
150 Jayadeva, S. Shah, and S. Chandra

Fig. 6.11 Phase margin versus iteration count for a folded cascode amplifier

Fig. 6.12 A toy example illustrating that any local minimizing procedure can be used with
GOSAM. The function is depicted as the dotted curve. For the outer triangular regions, the
gradient information cannot be used, so the local minima are found by a line search method.
However for the inner parabolic region, the local minimum can be found using gradient de-
scent. The regressor obtained is shown by the solid curve that passes through the local minima
obtained

In the worst case, GOSAM performs similar to a random multistart. This is be-
cause whenever it is not possible to use the minimum of the Fit function (for exam-
ple when it is out of bounds or almost the same minimum is given by the previous
two iterations), GOSAM restarts the search from a random state. Therefore in the
worst case it will randomly explore the search space for new starting points. How-
ever, real applications never involve functions that are discontinuous everywhere,
and we have not encountered this worst case.
6 Learning Global Optimization through SVM Based Adaptive Multistart 151

Fig. 6.13 A toy example to illustrate that the regressor for the objective function f (x), de-
noted as Fit of f (x) is smoother than f (x). Recursively, the Fit for the Fit of f (x) is smoother
than the Fit of f (x), and in the limit leads to a convex underestimate of f (x)

Server invokes
Instance of GOSAM

Web Server
Server requests Optimizer
function/ Sends
optimized points
Client sends
a request
Client sends
function value
at requested
points
Client

Fig. 6.14 Testing: A web based service

Minimization of the regressor function is an essential step in GOSAM. In all


our experiments we used local search to accomplish this step. A doubt that comes
to mind is what might happen if the Fit function itself turns out to have multiple
local minima. Such a situation is certainly possible, and is theoretically interesting.
An alternative approach that we suggest is to use GOSAM recursively. This idea is
intuitive because the regressor function, called Fit function in Fig 6.13, is smoother
than the objective function, as it is a smooth interpolation of only the local minima
of the objective function encountered earlier. Therefore, a Fit of the Fit function’s
152 Jayadeva, S. Shah, and S. Chandra

local minima would be even smoother. This is depicted pictorially in Fig. 6.13,
which uses a hypothetical example to illustrate what the application of GOSAM
to f (x) and recursively to Fit functions, might achieve. The original function f (x)
has a number of minima. As can be seen, the number of minima reduces at each
step and the sequence of recursively computed Fit functions become increasingly
smoother, and the sequence terminates at a convex function that is related to the
double conjugate of the original function. However, local minimization of the Fit
function seems to be more than adequate, as is done in the present implementation.
It is possible to construct functions where GOSAM’s strategy will fail. For ex-
ample, it would be impossible to learn any structure from a function with a uniform
distribution of randomly located minima, or a function that is discontinuous almost
everywhere. However, on most problems of any practical interest, small perturbations
from a local minimum will lead us to another locally minimal configuration. This im-
plies that a learning tool can be used to predict locations of other minima from the
knowledge of only a few.

6.7 Conclusion and Future Work


In this paper, we presented GOSAM, a fast and effective multistart global minimiza-
tion algorithm for solving optimization problems. GOSAM applies support vector
regression on the training set formed by previously discovered local minima, to
guide search towards better local minima. This is different from approximating a fit-
ness landscape; GOSAM attempts to predict how local minima are distributed, and
where the best one might lie. A regressor that fits local minima is smoother than one
that tries to fit the original function. Approximating the fitness landscape requires
fitting all points and not just a few minima. The use of Support Vector Regression
allows only support vectors to be retained, and redundant information can be dis-
carded. Experimental results on large benchmarks show that GOSAM searches far
more efficiently, uses significantly fewer function evaluations, and finds the global
optimum more consistently than other state-of-the-art methods. The effectiveness of
GOSAM confirms that the generalizing ability of SVRs is very useful in predicting
where good local minima lie. We have also shown how GOSAM can be applied
to unconstrained tasks, as well as combinatorial optimization tasks such as graph
coloring, that are traditionally solved as integer programming problems. GOSAM
does not require the function to be known in terms of an analytical expression. It is
enough to have a black box that can evaluate the function at a chosen point. This
allows GOSAM to be interfaced to any such black box. We have presented results
in the VLSI domain, where GOSAM has been interfaced to a commercial circuit
simulator and used to optimize MOSFET sizes and component values to meet de-
sired objectives subject to specified constraints. The objectives are typically com-
plex, such as phase margin of a folded cascode amplifier, or jitter in a Phase Locked
Loop. A current version of GOSAM is equipped with a web interface that allows a
user to access it without revealing information about the function being optimized.
The set up of the web based service is shown in Fig. 6.14. As the figure illustrates,
6 Learning Global Optimization through SVM Based Adaptive Multistart 153

only vectors and corresponding cost values are exchanged between the GOSAM
server and a client running a simulator or emulator. This allows GOSAM to be pro-
vided as a service across the web while protecting proprietary information about the
optimizer and the objective function.
Other aspects worthy of investigation include the use of different approaches to
SVR, such as online learning techniques, and parallellizing operations in GOSAM
to speed up search. Ongoing efforts include extending GOSAM to multi-objective
optimization tasks. GOSAM may be obtained from the authors for non-commercial
academic use on a trial basis.

Acknowledgements. The authors would like to thank Dr. R. Kothari of IBM India Re-
search Laboratory, Prof. R. Newcomb, University of Maryland, College Park, USA, and Prof.
S.C.Dutta Roy of the Department of Electrical Engineering, IIT Delhi, for their valuable
comments and a critical appraisal of the manuscript.

References
1. http://mat.gsia.cmu.edu/COLOR02/
2. http://www.cadence.com/products/custom ic/spectre/
index.aspx
3. Agakov, F., Bonilla, E., Cavazos, J., Franke, B., Fursin, G., O’Boyle, M., Thomson,
J., Toussaint, M., Williams, C.: Using machine learning to focus iterative optimisation.
In: Proceedings of the 4th Annual International Symposium on Code Generation and
Optimization (CGO), New York, NY, USA, pp. 295–305 (2006)
4. Baluja, S., Barto, A., Boese, K., Boyan, J., Buntine, W., Carson, T., Caruana, R., Davies,
S., Dean, T., Dietterich, T., Hazlehurst, S., Impagliazzo, R., Jagota, A., Kim, K., Mc-
Govern, A., Moll, R., Moss, E., Perkins, T., Sanchis, L., Su, L., Wang, X., Wolpert, D.:
Statistical machine learning for large-scale optimization. Neural Computing Surveys 3,
1–58 (2000)
5. Black, F., Litterman, R.: Global portfolio optimization. Financial Analysts Journal 48(5),
28–43 (1992)
6. Boese, K., Kahng, A.B., Muddu, S.: A new adaptive multi-start technique for combina-
torial global optimizations. Operations Research Letters 16(2), 101–113 (1994)
7. Bordes, A., Bottou, L.: The huller: A simple and efficient online SVM. In: Gama, J.,
Camacho, R., Brazdil, P.B., Jorge, A.M., Torgo, L. (eds.) ECML 2005. LNCS (LNAI),
vol. 3720, pp. 505–512. Springer, Heidelberg (2005),
http://leon.bottou.org/papers/bordes-bottou-2005
8. Boyan, J.: Learning evaluation functions for global optimization. Phd dissertation, CMU
(1998)
9. Boyan, J., Moore, A.: Learning evaluation functions for global optimization and
boolean satisfiability. In: Proceedings of the Fifteenth National Conference on Arti-
ficial Intelligence, vol. 15, pp. 3–10. John Wiley and Sons Ltd., Chichester (1998),
http://www.cs.cmu.edu/˜jab/cv/pubs/boyan.stage2.ps.gz
10. Dorigo, M., Maniezzo, V., Colorni, A.: Ant system: Optimization by a colony of cooper-
ating agents. IEEE Transactions on Systems, Man, and Cybernetics-Part B 26(1), 29–41
(1996),
http://iridia.ulb.ac.be/˜mdorigo/ACO/publications.html
154 Jayadeva, S. Shah, and S. Chandra

11. GEATbx: Genetic and evolutionary algorithm toolbox (1994),


http://www.geatbx.com/docu/fcnindex.html
12. Goldberg, D.E.: Genetic Algorithms in Search, Optimization and Machine Learning.
Addison-Wesley Longman Publishing Co., Inc., Boston (1989)
13. Grossmann, I.E. (ed.): Global optimization in engineering design. Kluwer Academic
Publishers, Dordrecht (1996)
14. Gunn, S.: Support vector machines for classification and regression. Technical report,
Image Speech and Intelligent Systems Research Group, University of Southampton, UK
(1998), http://www.isis.ecs.soton.ac.uk/resources/svminfo/
15. Horst, R., Tuy, H.: Global optimization:deterministic approaches. Springer, Berlin
(1993)
16. Houck, C., Joines, J., Kay, M.: A genetic algorithm for function optimization: A matlab
implementation. NCSU-IE TR 95-09 (1995),
http://www.ie.ncsu.edu/mirage/GAToolBox/gaot/
17. Jayadeva, Dutta Roy, S.C., Chaudhary, A.: Compact analogue neural network: A
new paradigm for neural based combinatorial optimisation. IEE Proc-Circuits Devices
Syst. 146(3) (1999)
18. Jaydeva, Shah, S., Chandra, S.: Learning to optimize vlsi design problems. In: INDI-
CON, pp. 1–5. IEEE, New Delhi (2006)
19. Johnson, D., McGeoch, L.: The travelling salesman problem: A case study in local opti-
misation. In: Local Search in Combinatorial Optimisation, pp. 215–310. John Wiley and
Sons, London (1997)
20. Kazerounian, K., Wang, Z.: Global versus local optimization in redundancy resolution of
robotic manipulators. The International Journal of Robotics Research 7(5), 2–12 (1988)
21. Kennedy, J., Eberhart, R.: Particle swarm optimization. In: Proceedings of the IEEE
International Conference on Neural Networks, Perth, Australia, vol. 4, pp. 1942–1948
(1995)
22. Kirkpatrick, S., Gelatt, C., Vecchi, M.: Optimization by simulated annealing. Sci-
ence 220(4598), 671–680 (1983)
23. LINDO SYSTEMS Inc.: LINDO API User’s Manual (2002)
24. Madsen, K.: Test problems for global optimization,
http://www2.imm.dtu.dk/˜km/Test_ex_forms/test_ex.html
25. Mangasarian, O.: Nonlinear Programming. SIAM, Philadelphia (1994)
26. Moles, C., Mendes, P., Banga, J.: Parameter estimation in biochemical pathways: a com-
parison of global optimization methods. Genome Research 13, 2467–2474 (2003)
27. Neumaier, A.: Global optimization,
http://www.mat.univie.ac.at/˜neum/glopt/applications.html
28. Platt, J.: Fast Training of Support Vector Machines using Sequential Minimal Optimiza-
tion. In: Advances in Kernel Methods - Support Vector Learning, pp. 185–208. MIT
Press, Cambridge (1999)
29. Russel, S., Norvig, P.: Artificial intelligence: a modern approach. Prentice Hall, Engle-
wood Cliffs (1995)
30. Singh, J.: PSO algorithm toolbox (2003),
http://psotoolbox.sourceforge.net/
31. Wang, M., Yang, X., Sarrafzadeh, M.: Congestion minimization during placement. IEEE
Transactions on Computer-Aided Design of Integrated Circuits and Systems 19(10),
1140–1148 (2000)
Chapter 7
Multi-objective Optimization Using Surrogates

Ivan Voutchkov and Andy Keane

Abstract. Until recently, optimization was regarded as a discipline of rather theo-


retical interest, with limited real-life applicability due to the computational or ex-
perimental expense involved. Practical multiobjective optimization was considered
almost as an utopia even in academic studies due to the multiplication of this ex-
pense. This paper discusses the idea of using surrogate models for multiobjective
optimization. With recent advances in grid and parallel computing more companies
are buying inexpensive computing clusters that can work in parallel. This allows,
for example, efficient fusion of surrogates and finite element models into a multiob-
jective optimization cycle. The research presented here demonstrates this idea using
several response surface methods on a pre-selected set of test functions. We aim to
show that there are number of techniques which can be used to tackle difficult prob-
lems and we also demonstrate that a careful choice of response surface methods is
important when carrying out surrogate assisted multiobjective search.

7.1 Introduction
In the world of real engineering design, there are often multiple targets which man-
ufacturers are trying to achieve. For instance in the aerospace industry, a general
problem is to minimize weight, cost and fuel consumption while keeping perfor-
mance and safety at a maximum. Each of these targets might be easy to achieve
individually. An airplane made of balsa wood would be very light and will have low
fuel consumption, however it will not be structurally strong enough to perform at
high speeds or carry useful payload. Also such an airplane might not be very safe,
Ivan Voutchkov
University of Southampton, Southampton SO17 1BJ, United Kingdom
e-mail: iiv@soton.ac.uk
Andy Keane
University of Southampton, Southampton SO17 1BJ, United Kingdom
e-mail: ajk@soton.ac.uk

Y. Tenne and C.-K. Goh (Eds.): Computational Intelligence in Optimization, ALO 7, pp. 155–175.
springerlink.com c Springer-Verlag Berlin Heidelberg 2010
156 I. Voutchkov and A. Keane

i.e., robust to various weather and operational conditions. On the other hand, a solid
body and a very powerful engine will make the aircraft structurally sound and able
to fly at high speeds, but its cost and fuel consumption will increase enormously. So
engineers are continuously making trade-offs and producing designs that will sat-
isfy as many requirements as possible, while industrial, commercial and ecological
standards are at the same time getting ever tighter.
Multiobjective optimization (MO) is a tool that aids engineers in choosing the
best design in a world where many targets need to be satisfied. Unlike conventional
optimization, MO will not produce a single solution, but rather a set of solutions,
most commonly referred to as Pareto front (PF) [12]. By definition it will contain
only non-dominated solutions1 . It is up to the engineer to select a final design by
examining this front.
Over the past few decades with the rapid growth of computational power, the fo-
cus in optimization algorithms in general has shifted from local approaches that find
the optimal value with the minimal number of function evaluations to more global
strategies which are not necessarily as efficient as local searches but (some more
than the others) promise to converge to global solutions, the main players being
various strands of genetic and evolutionary algorithms. At the same time, comput-
ing power has essentially stopped growing in terms of flops per CPU core. Instead
parallel processing is an integral part of any modern computer system. Computing
clusters are ever more accessible through various techniques and interfaces such as
multi-threading, multi-core, Windows HPC, Condor, Globus, etc.
Parallel processing means that several function evaluations can be obtained at
the same time, which perfectly suits the ideology behind genetic and evolutionary
algorithms. For example Genetic algorithms are based on the idea borrowed from
biological reproduction, where the offspring of two parents copy the best genes of
their parents but also introduce some mutation to allow diversity. The entire gener-
ation of offspring produced by parents in a generation represent designs that can be
evaluated in parallel. The fittest individuals survive and are copied into the next gen-
eration, whilst weak designs are given some random chance with low probability to
survive. Such parallel search methods are conveniently applicable to multiobjective
optimization problems, where the fitness of an individual is measured by how close
to the Pareto front this designs is. All individuals are ranked, those that are part of
the Pareto front get the lowest (best) rank, the next best have higher rank and so on.
Thus the multiobjective optimization is reduced to single objective minimization of
the rank of the individuals. This is idea has been developed by Deb and implemented
in NSGA2 [5].
In the context of this paper, the aim of MO is to produce a well spread out set
of optimal designs, with as few function evaluations as possible. There are number
of methods published and widely used to do this – MOGA, SPEA, PAES, VEGA,
NSGA2, etc. Some are better than others - generally the most popular in the litera-
ture are NSGA2 (Deb) and SPEA2 (Zitzler), because they are found to achieve good
results for most problems [2, 3, 4, 5, 6]. The first is based on genetic algorithms and
1 Non-dominated designs are those where to improve performance in any particular goal
performance in at least one other goal must be made worse.
7 Multi-objective Optimization Using Surrogates 157

the second on an evolutionary algorithm, both of which are known to need many
function evaluations. In real engineering problems the cost of evaluating a design
is probably the biggest obstacle that prevents extensive use of optimization proce-
dures. In the multiobjective world, this cost is multiplied, because there are multi-
ple expensive results to obtain. Evaluating directly a finite element model can take
several days, which makes it very expensive to try hundreds or thousands of designs.

7.2 Surrogate Models for Optimization


It seems that increased computing power leads to increased hunger for even more
computing power, as engineers realise that they can run more detailed and realis-
tic models. In essence, from an engineering point of view, the available computing
power is never enough and this tendency does not seem to be changing at least in
the foreseeable future. To put these words into prospective, to be useful to an engi-
neering company, a modern optimization approach should be able to tackle a global
multiobjective optimization problem in about a week. The problem would typically
have 20-30 variables, 2-5 objectives, 2-5 constraints with evaluation times of about
12-48h per design and often per objective. Unless you have access to 5000-7000
parallel CPUs, the only way to currently tackle such problems is to use surrogate
models.
In the single objective world, approaches using surrogate models are fairly well
established and have proven to successfully deal with the problem of computational

Fig. 7.1 Direct search versus surrogate models for optimization


158 I. Voutchkov and A. Keane

expense (see Fig. 7.1) [22]. Since their introduction, more and more companies
have adopted surrogate assisted optimization techniques and some are making steps
to incorporate this approach in their design cycle as standard. The reason for this is
that instead of using the expensive computational models during the optimization
step, they are substituted with a much cheaper but still accurate replica. This makes
optimization not only useful, but usable and affordable. The key idea that makes
surrogate models efficient is that they should become more accurate in the region of
interest as the search progresses, rather than being equally accurate over the entire
design space, as an FE representation will tend to be. This is achieved by adding to
the surrogate knowledge base only at points of interest. The procedure is referred to
as surrogate update.
Various publications address the idea of surrogates models and multiobjective op-
timisation [10, 11, 12, 13, 14, 15, 16, 17, 18, 19]. As one would expect, no approxi-
mation method is universal. Factors such as function modality, number of variables,
number of objectives, constraints, computation time, etc., all have to be taken into
account when choosing an approximation method. The work presented here aims to
demonstrate this diversity and hints at some possible strategies to make best use of
surrogates for multi-objective problems.

7.3 Multi-objective Optimization Using Surrogates


To illustrate the basic idea, the zdt1 – zdt6 test function suite [3] will be used to be-
gin with. It is a good suite to demonstrate the effectiveness of surrogate models, as it
is fairly simple for response surface (surrogate) modelling. Fig. 13.3 represents the
zdt2 function and the optimisation procedure. It is a striking comparison, demon-
strating the surrogate approach. The problem has two objective functions and two
design variables. The Pareto front obtained using surrogates with 40 function evalu-
ations is far superior to the one without surrogates and the same number of function
evaluations.

Table 7.1 Full function evaluations for ZDT2 - Fig. 13.3

Number of variables 2 5 10
Number of function evaluations without surrogates 2500 5000 10000
Number of function evaluations with surrogates 40 40 60

On the other hand 2500 evaluations without surrogates were required to obtain a
similar quality of Pareto front to the case with surrogates and 40 evaluations. The
difference is even more significant if more variables are added – see Table 14.1.
Here we have chosen objective functions with simple shapes to demonstrate the
effectiveness of using surrogates. Both functions would be readily approximated
using most available methods. It is not uncommon to have relationships of simi-
lar simplicity in real problems, although external noise factors could make them
7 Multi-objective Optimization Using Surrogates 159

Fig. 7.2 A (left) – Function ZDT2; B (right) – ZDT2 – Pareto front achieved in 40 evalua-
tions: Diamonds – Pareto front with surrogates; Circles – solution without surrogates

look rougher. Relationships of higher order of multimodality would be more of a


challenge for most methods, as will be demonstrated later.

7.4 Pareto Fronts - Challenges


Depending on the search algorithm, the quality of the Pareto front could vary greatly.
There are various characteristics that describe a good quality Pareto front:
1. Spacing – better search techniques will space the points on the Pareto front
uniformly rather than producing clusters. See Fig. 7.3a
2. Richness – better search techniques will put more points on the Pareto front than
others. See Fig. 7.3b
3. Diversity – better search techniques will produce fronts that are spread out better
with respect to all objectives. See Fig. 7.3c
4. Optimality – better search techniques will produce fronts that dominate the fronts
produced by less good techniques. In test problems this is usually measured
as ‘generational distance’ to an ideal Pareto front. We discuss this later. See
Fig. 7.3d
5. Globality – the obtained Pareto front is a global as opposed to local. Similar to
single objective optimization, in the multiobjective world, it is also possible to
have local and global optimal solutions. This concept is demonstrated using the
F5 test function (a full description is given in sections 15.4 and 15.5). Fig. 7.4
illustrates the function and the optimization procedure. Due to the sharp nature
of the global solution it cannot be guaranteed that with a small number of GA
evaluations, the correct solution will be found. Furthermore, since the surrogate is
based only on sampled data, if this data does not contain any points in the global
optimum area, then the surrogate will never know about its existence. Therefore
160 I. Voutchkov and A. Keane

any optimization based only on such surrogates will lead us to the local solution.
Therefore conventional optimization approaches based on surrogate models rely
on constant updating of the surrogate. A widely accepted technique in single
objective optimization is to update the surrogate with its current optimal solution.
In multiobjective terms this will translate to updating the surrogate with one or
more points belonging to its Pareto front. If the surrogate Pareto front is local
and not global, then the next update will also be around the local Pareto front.
Continuing with this procedure the surrogate model will become more and more
accurate in the area of the local optimal solution, but will never know about the
existence of the global solution.
6. Robust convergence from any start design with any random number sequence. It
turns out that the success of a conventional multiobjective optimization based on
surrogates, using updates at previously found optimal locations strongly depends
on the initial data used to train the first surrogate before any updates are added.
If this data happens to contain points around the global Pareto front, then the
algorithm will be able to quickly converge and find a nice global Pareto front.
However the odds are that the local Pareto fronts are smoother and easier to find
shapes and in most cases this is where the procedure will converge unless suitable
global exploration steps are taken.
7. Efficiency and convergence – better search techniques will converge using less
function evaluations.

Fig. 7.3 Pareto front potential problems - (a) clustering; (b) too few points; (c) lack of diver-
sity; (d) non-optimality
7 Multi-objective Optimization Using Surrogates 161

Fig. 7.4 F5: Local and global solutions

7.5 Response Surface Methods, Optimization Procedure and


Test Functions
In a previous publication [20] we have shown that for complex and high-dimensional
functions Kriging is the response surface method of choice [22]. We have also
stressed the importance of applying a high level of understanding when using Krig-
ing. There have been various publications that critique kriging, due to lack of under-
standing. Our opinion is that if the user understands the strengths and weaknesses
of this approach it can become an invaluable tool, often the only one capable of
producing meaningful results in reasonable times.
Kriging is a Response Surface (RSM) method, designed in the 60’s for geological
surveys [7]. It can be a very efficient RSM model for cases where it is expensive to
obtain large amounts of data. A significant number of publications discuss the krig-
ing procedure in detail. An important role for the success of the method is the tuning
of its hyper parameters. It should be mentioned that researchers who have chosen
rigorous training procedures, report positive results when using kriging, while pub-
lications that use basic training procedures often reject this method. Nevertheless,
the method is becoming increasingly popular in the world of optimization as it often
provides a surrogate with usable accuracy.
This method was used to build surrogates for the above test cases, therefore it is
useful to briefly outline its major pros and cons:

Pros:
• can always predict with no error at sample points,
• the error in close proximity to sample points is minimal,
• requires small number of sample points in comparison to other response surface
methods,
• reasonably good behaviour with high dimensional problems.
162 I. Voutchkov and A. Keane

Cons:
• for large number of data points and variables, training of the hyper-parameters
and prediction may become computationally expensive.
Researchers should make a conscious decision when choosing Kriging for their
RSMs. Such a decision should take into account the cost of a direct function eval-
uation including constraints (if any), available computational power, and dimen-
sionality of the problem. Sometimes it might be possible to use kriging for one
of the objectives while another is evaluated directly, or a different RSM is used to
minimize the cost.
As this paper aims to demonstrate various approaches in making a better use of
surrogate models, we will use kriging throughout, but most conclusions could be
generalised for other RS methods as well. The chosen multiobjective algorithm is
NSGA2. Other multiobjective optimizers might show slightly different behaviour.
The basic procedure is as follows:

1. Carry out 20 LPtau [8] spaced initial direct function evalua-


tions.
2. Train hyper-parameters, using combination of GA and DHC
(dynamic hill climbing) [23]
3. Choose a selection of update strategies with specified num-
ber of updates.
4. Search the RSMs using each of the selected methods
5. Select designs that are best in terms of ranking and space
filling properties.
6. Evaluate selected designs and add to data set.
7. Produce Pareto front and compare with previous. Stop if 2-3
consecutive Pareto fronts are identical. Otherwise continue.
8. If Pareto front contains too many points, choose specified
number of points that are furthest away from each other
9. Repeat from step 2.

There are several possible stopping criteria:


• fixed number of update iterations,
• stop when all update points are dominated,
• stop if the percentage of new update points that belong to the Pareto front falls
below a pre-defined value,
• stop if the percentage of old points on the current Pareto front rises above a
pre-defined value,
• stop when there is no further improvement of the Pareto front quality. The quality
of the Pareto front is a complex multiobjective problem on its own. The best
Pareto front could be defined as the one being as close as possible to the origin
of the objective function space, while having the best diversity, i.e., spread on all
7 Multi-objective Optimization Using Surrogates 163

objectives and the points are evenly distributed. Metrics for assessing the quality
of the Pareto front are discussed by Deb [3].
We have used the last of these criteria for our studies.

7.6 Update Strategies and Related Parameters


One of the main aims of this publication is to show the effect of different update
strategies and number of updates. Here we consider the following six approaches in
various combinations:
• UPDMOD = 1; (Nr) - Random updates. These can help escape from local Pareto
fronts and enrich the genetic material,
• UPDMOD = 2; (Nrsm) - RSM Pareto front. A specified number of points are
extracted from the Pareto front obtained after the search of the response surface
models of the objectives and constraints (if any). When the RSM Pareto front is
rich it is possible to extract data that is uniformly distributed.
• UPDMOD = 3; (Nsl) - Secondary NSGA2 layer. A completely independent
NSGA2 algorithm is applied directly to the non-RSM objective functions and
constraints. This exploits the well known property of the NSGA2 which makes
it (slowly) converge to global solutions. During each update iteration, the direct
NSGA2 is run for one generation with population size of Nsl. There are two
strands to this approach. The first one is referred to as ‘decoupled’. The genetic
material is completely independent from the other update strategies. No entries
other than those from the direct NSGA2 are used. The second strand is referred
to as ‘coupled’, where the genetic information is composed of suitable designs
obtained by other participating update strategies. Suitable designs are selected in
terms of Pareto optimality, or rank in terms of NSGA2. Please note that although
it might sound similar, this is a completely different approach from the Mμ GA
algorithm, proposed by Coello and Toscano (2000)
• UPDMOD = 4; (Nrmse) – Root of the Mean Squared Error (RMSE). When us-
ing kriging as a response surface model, it is possible to compute an estimate of
the RMSE, at no significant computational cost. The value of this metric is large
where there are large gaps between data points. RMSE is minimal close to or
at existing data points. Therefore adding updates at the location of the maximum
RMSE should significantly improve the quality and coverage of the response sur-
face model. When dealing with multiple objectives/constraints it is appropriate
to construct a Pareto front of maximum RMSEs for all objectives and extract
Nrmse points from it.
• UPDMOD = 5; (Nie) – Expected improvement (EI). This is another kriging spe-
cific function which represents the probability of finding the optimal point in a
new location. The update points are extracted from the Pareto front of the max-
imum values of the EI for all objectives. For constrained problems, the values
of EI for all objectives are multiplied by the value of the feasibility of the con-
straints, which is 1 for satisfied constraints 0 for unfeasible and rather smooth
ridge around the constraint boundary, see Forrester et al [22].
164 I. Voutchkov and A. Keane

• UPDMOD = 6; (Nkmean) – The RSMs are searched using GA or DHC and points
are extracted using a k-mean cluster detection algorithm.
All these update strategies have their own strengths and weaknesses, and therefore
a suitable combination should be carefully considered. The results section of this
chapter provides some insights on the effects of each of these strategies when used
in various combinations.

Additional Parameters that Can Affect the Search


The following parameters can also affect the performance of a multi-objective
RSM search:
• RSMSIZE – number of points used for RSM construction. It is expected that the
more points that are used, the more accurate the RSM predictions, however this
comes at increasing training cost. Therefore the number of training points should
be limited.
• EVALSIZE – number of points used during RSM evaluation. This stage is con-
siderably less expensive than training and therefore more points can be used dur-
ing the evaluation stage. Ultimately this should increase the density of quality
material and therefore fewer gaps for the RSM to approximate.
• EPREWARD – endpoint reward factor. Higher value rewards are given at the
end points of the Pareto front, and this improves its spread. Lower value would
increase the pressure of the GA to explore the centre of the Pareto front.
• GA NPOP and GA NGEN – the population size and number of generation used
to search the RSM, RMSE and EI Pareto fronts.

7.7 Test Functions


Several test functions with various degrees of complexity have been chosen to
demonstrate the overview of the RS methods for the purpose of multiobjective opti-
mization. These functions are well known from the literature:
F5: (Fig. 7.4). High complexity shape – has a smooth and a sharp feature. The
combination of both makes it easier for the optimization procedure to converge to
the smooth feature, which represents a local Pareto front. The global Pareto front
lies around the sharp feature which is harder to reach. Two objectives, x (i) = 0 .. 1,
i = 1, 2; no constraints [3], page 350.
ZDT1 - ZDT6: Clustered and discontinuous Pareto fronts. Shape complexity
is moderate. Two objectives, n variables (in present study n = 2), no constraints.
x (i) = 0 .. 1, i = 1, 2 [3], page 357.
ZDT1cons: Same formulation as for ZDT1 but with 25 variables and 2
constraints. Constraints are described in [3], page 368.
Bump: The bump function, 25 variables, 2 objectives, 1 constraint. We have used
the function as provided in [21] which is a single objective with two constraints.
We have made one of the constraints into second objective, so that the optimiza-
tion problem is defined as : Maximise the original objective, minimize the sum of
7 Multi-objective Optimization Using Surrogates 165

variables whilst keeping the product of the variables greater than 0.75. There are 25
variables, each varying between 0 and 3.

7.8 Pareto Front Metrics


To measure the performance of the various strategies discussed in this paper, we
have adopted several metrics. Some of them use comparison to an ‘ideal’ solution
which is denoted by Q and represents the Pareto front obtained using direct search
with a large number of iterations (20,000). All metrics are designed so that smaller
is better.

7.8.1 Generational Distance ([3], pp.326)


The average of the minimum Euclidian distance between each point of the two
Pareto fronts, $
|Q|
∑i=1 di
gd = ,
|Q|
% & '
(i) (k) 2
di = mink=1,|p| ∑Mj=1 f j − p j , and is the Euclidian distance between the so-
lution (i) and the nearest member of Q.

7.8.2 Spacing
Standard deviation of the absolute differences between the solution (i) and the near-
est member of Q, 

 1 |Q|  2
sp =  ∑ di − d¯ ,
|Q| i=1
& '
di = min ∑ j=1 f j − p j .
M (i) (k)
k=1,|p|

7.8.3 Spread

|Q|  
¯
m=1 dm − ∑i=1 di − d
∑M e
Δ = 1− ,
m=1 dm + |Q| d
∑M e ¯
where di is the absolute difference between neighbouring solutions. For compatibil-
ity with the above metrics, the values of the spread is subtracted from 1, so that a
wider spread will produce a smaller value.
166 I. Voutchkov and A. Keane

7.8.4 Maximum Spread


Normalized distance between the most distant points on the pareto front. The dis-
tance is normalized against the maximum spread of the ‘ideal’ pareto front. For
compatibility with the above metrics, the value of the maximum spread is subtracted
from 1, so that a wider spread will produce a smaller value,

 ⎛ ⎞
(i) 2
 (i)
max fm − min fm
1 M
 ⎜ i=1,|Q| ⎟
MS = 1 −  ∑ ⎝
i=1,|Q|
⎠ .
M m=1 Pm − Pmmin
max

7.9 Results
The study carried out aims to show the effect of applying various update strategies,
number of training and evaluation points, etc. The performance of each particular
approach is measured using the metrics described in the previous section.
An overall summary is given at the end of this section, but the best recipe ap-
pears to be highly problem dependant. It is also not possible to show all results for
all functions due to limited space, and we have therefore chosen several that best
represent the ideas discussed.
To correctly appreciate the results, please bear in mind that they are meant to
show diversity rather than a magic recipe that works in all situations.

7.9.1 Understanding the Results


The legend on the figures represents the selected strategy in the form
[Nr]-[Nrsm]-[Nsl]-[Nrmse]-[Nie]-[Nkmean]MUPD[RSMSIZE]MEVL[EVALSIZE]
so that a 8-14-15-10-3-3MUPD50MEVL300 would represent 8 random update
points, 14 RSM updates, 15 NSGA2 Second layer updates, 10 RMSE updates, 3
EI updates, 3 KMEAN updates with 50 krig training points and 300 krig evaluation
points.
All approaches were given a maximum of 60 update iterations and stopping cri-
teria of reaching two consecutive unchanged Pareto fronts. Total number of runs is
recorded for each update iteration and all metrics are plotted against number of real
function evaluations, (i.e. likely cost on real expensive problems).
Strategies with ‘dec’ appended to their name – indicate that the decoupled Second
layer is used, as opposed to coupled for those where Nsl = 30 and without any
appendix. Those labled ‘43’ use a one pass constraint penalty expected improvement
strategy whilst those that have Nie = 30 and no appendix use a constraint feasibility
algorithm.
7 Multi-objective Optimization Using Surrogates 167

7.9.2 Preliminary Calculations


7.9.2.1 Finding the Ideal Pareto Front
As mentioned in section 7.8, most of the Pareto front metrics are based on com-
parison to an ‘ideal’ Pareto front. To find it, each of the test functions has been
run through a direct NSGA2 search (direct = without the usage of surrogates) with
Population size of 100 for 200 generations, which takes 20000 function evaluations.

7.9.2.2 How Many Generations for the RSM Search?


We have conducted a study for each of the test functions to find what the minimum
number of generations they should be run for is, in order to achieve best conver-
gence. We found that a population size of 70 with 80 generations is sufficient for all
of test problems and this is what we have used for our tests. Some test functions,
such as ZDT1 - ZDT6 with two variables could be converged using a smaller num-
ber of individuals and generations, however for comparison purposes we decided to
use the same settings for all functions.

7.9.2.3 What Is the Best Value for EPREWARD during the RSM Search?

The EPREWARD value is strictly individual for each function. Taking into account
the specifics of the test function it can improve the diversity of the Pareto front. The
default value is 0.65, which works well for most of the functions, but we have also
conducted studies where this parameter is varied between -1 and 1 in steps of 0.1,
and individual value for each function is selected based on best Pareto front metrics.

7.9.3 The Effect of the Update Strategy Selection


Fig. 7.5 shows that the selection of update strategy is important even for functions
with only two variables. F5 has a deceptive Pareto front and several update strategies
were not able to escape from the local Pareto front.
Fig. 7.6 clearly shows that some strategies have converged earlier than the others,
but some of them to the local front. Generally methods such as Random updates and
Secondary NSGA2 layer updates are not based on the RSM and are the strongest
candidates when deceptive features in the multiobjective space are expected. It is
a common observation amongst most of the low dimensional objective functions
(two or three variables) that using all the update techniques together is not necessar-
ily the winning strategy. However combining at least one RSM and one non-RSM
technique proves to work well. It is somewhat important to note that the Second
NSGA2 layer shows its effect after sixth or seventh update iteration, as it needs time
to converge and gather genetic information.
Update strategies that employ a greater variety of techniques prove to be more
successful for functions with higher number of variables (25).
168 I. Voutchkov and A. Keane

Fig. 7.5 Pareto front for F5

Fig. 7.6 Generational distance for F5

Fig. 7.9 and Fig. 7.10 show that the ‘bump’ function is particularly difficult for
all strategies, which makes it a good test problem. This function has extremely
tight constraint and multimodal features. It is not yet clear which of combination of
strategies should be recommended, as the ‘ideal’ Pareto front has not been reached,
however it seems that a decoupled secondary NSGA2 layer is showing a good
7 Multi-objective Optimization Using Surrogates 169

Fig. 7.7 Pareto front for ZDT1cons

Fig. 7.8 Generational distance for ZDT1cons

advancement. We are continuing studies on this function and will give results in
future publications.
To summarize the performance of each strategy an average statistics is com-
puted. It is derived as follows. The actual performance in most cases is a trade-
off between a given metric and the number of function evaluations needed for
170 I. Voutchkov and A. Keane

Fig. 7.9 Pareto front for the ‘bump’ function

Fig. 7.10 Generational distance for the ‘bump’ function

convergence. Therefore the four metrics can be ranked against the number of runs,
in the same way as ranks are obtained during NSGA2 operation. The obtained ranks
are then averaged across all test functions. Low average rank means that the strategy
has been optimal for more metrics and functions. These results are summarized in
Table 14.2.
7 Multi-objective Optimization Using Surrogates 171

Table 7.2 Summary of performance

Random RSM PF SL RMSE EI KMEAN Av. Rank Min. Rank Max. Rank Note
0 30 0 0 30 0 1.53 1 2 EI const.feas
0 30 30 0 0 0 1.83 1 3.33 SL coupled
0 30 0 30 0 0 2 1.33 3.33 RMSE
0 30 0 0 30 0 2.2 1.33 3 EI normal
0 30 30 0 0 0 2.8 1.33 4 SL decoupled
30 30 0 0 0 0 2.84 2 4 Random
0 60 0 0 0 0 2.85 2 3.33 RSM PF

The summary shows that all strategies are generally better than using only the
conventional RSM based updates, which is expected, as the conventional method is
almost always bound to converge at local solutions. However it must be underlined
that a correct selection is problem dependant and must be selected with care and
understanding.

7.9.4 The Effect of the Initial Design of Experiments


All methods presented here start from a given initial design of experiments. This is
the starting point and this is what the initial surrogate model is based on. It is of
course important to show the effect of these initial conditions. In what follows we
have shown that effect by using a range of different initial DOEs. We have again

Fig. 7.11 Generational distance for zdt1 starting from different initial DOEs
172 I. Voutchkov and A. Keane

Fig. 7.12 Generational distance for F5 starting from different initial DOEs

Fig. 7.13 Pareto fronts for ‘bump’ starting from different initial DOEs

used 10 updates for each of the techniques (60 updates per iteration in total) for all
functions. The only difference being the starting set of designs.
Fig. 7.11 and Fig. 7.12 illustrate the generational distance for zdt1 and f5 func-
tions - both with two variables. They both demonstrate a good averagibility,
7 Multi-objective Optimization Using Surrogates 173

Fig. 7.14 Generational distance for ‘bump’ starting from different initial DOEs

Fig. 7.15 Pareto fronts for ‘zdt1cons’ starting from different initial DOEs

confirming once again that the surrogate updates are fairly robust for functions with
low number of variables.
Figures 7.13, 7.14 and 7.15 illustrate much greater variance and show that high
dimensionality is a difficult challenge for surrogate strategies, however one should
also consider the low number of function evaluations used here.
174 I. Voutchkov and A. Keane

7.10 Summary
In this publication we have aimed to share our experience in tackling expensive
multiobjective problems. We have shown that as soon as we decide to use surrogate
models, to substitute for expensive objective functions, we need to consider a num-
ber of other specifics in order to produce a useful Pareto front. We have discussed
the challenges that one might face when using surrogates and have proposed six up-
date strategies that one might wish to use. Given understanding of these strategies,
the researcher should decide on the budget of updates they could afford and then
spread this budget over several update strategies. We have shown that it is best to
use at least two different strategies – ideally a mixture of RSM and non-RSM based
techniques. When solving problems with few variables we have shown that a com-
bination of two or three techniques is sufficient, however with higher dimensional
problems, one should consider using more techniques.
It is also beneficial to constrain the number of designs that are used for RSM
training and also for RSM evaluation to limit the cost. The selection method of the
designs then being used is open to further research. In this material we have used
selection based on Pareto front ranking.
Our research also included parameters that reward the search for exploring the
end points on the Pareto front. Although not explicitly mentioned in this material,
our studies are using features such as improved crossover, mutation and selection
strategies, declustering algorithm applied both in the variable and objective space
to avoid data clustering. Data is also being automatically conditioned and filtered,
and advanced kriging tuning techniques are used. These features are part of the
OPTIONS [1], OptionsMATLAB and OptionsNSGA2 RSM suites [24].

Acknowledgements. This work was funded by Rolls – Royce Plc, whose support is
gratefully acknowledged.

References
1. Keane, A.J.: OPTIONS manual,
http://www.soton.ac.uk/˜ajk/options.ps
2. Obayashi, S., Jeong, S., Chiba, K.: Multi-Objective Design Exploration for Aerodynamic
Configurations, AIAA-2005-4666
3. Deb, K.: Multi-objective optimization using evolutionary algorithms. John Wiley &
Sons, Ltd., New York (2003)
4. Zitzler, et al.: Comparison of multiobjective evolutionary algorithms: Empirical results.
Evolutionary Computational Journal 8(2), 125–148 (2000)
5. Knowles, J., Corne, D.: The Pareto archived evolution strategy: A new baseline algorithm
for multiobjective optimisation. In: Proceedings of the 1999 Congress on Evolutionary
Computation, pp. 98–105. IEEE Service Center, Piscatway (1999)
6. Fonseca, C.M., Fleming, P.J.: Multiobjective optimization and multiple constraint han-
dling with evolutionary algorithms - Part II: Application example. IEEE Transactions on
Systems, Man, and Cybernetics: Part A: Systems and Humans, 38–47 (1998)
7 Multi-objective Optimization Using Surrogates 175

7. Jones, D.R., Schonlau, M., Welch, W.J.: Efficient global optimization of expensive black-
box functions. Journal of Global Optimization 13, 455–492 (1998)
8. Sobol’, I.M., Turchaninov, V.I., Levitan, Y.L., Shukhman, B.V.: Quasi-Random Se-
quence Generators, Keldysh Institute of Applied Mathematics, Russian Acamdey of Sci-
ences, Moscow (1992)
9. Nowacki, H.: Modelling of Design Decisions for CAD. In: Goos, G., Hartmanis, J.
(eds.) Computer Aided Design Modelling, Systems Engineering, CAD-Systems. LNCS,
vol. 89. Springer, Heidelberg (1980)
10. Kumano, T., et al.: Multidisciplinary Design Optimization of Wing Shape for a Small Jet
Aircraft Using Kriging Model. In: 44th AIAA Aerospace Sciences Meeting and Exhibit,
Jannuary 2006, pp. 1–13 (2006)
11. Nain, P.K.S., Deb, K.: A multi-objective optimization procedure with successive approx-
imate models. KanGAL Report No. 2005002 (March 2005)
12. Keane, A., Nair, P.: Computational Approaches for Aerospace Design: The Pursuit of
Excellence (2005) ISBN: 0-470-85540-1
13. Leary, S., Bhaskar, A., Keane, A.J.: A derivative based surrogate model for approximat-
ing and optimizing the output of an expensive computer simulation. J. Global Optimiza-
tion 30, 39–58 (2004)
14. Leary, S., Bhaskar, A., Keane, A.J.: A Constraint Mapping Approach to the Structural
Optimization of an Expensive Model using Surrogates. Optimization and Engineering 2,
385–398 (2001)
15. Emmerich, M., Naujoks, B.: Metamodel-assisted multiobjective optimization strategies
and their application in airfoil design. In: Parmee, I. (ed.) Proc of. Fifth Int’l. Conf.
on Adaptive Design and Manufacture (ACDM), Bristol, UK, April 2004, pp. 249–260.
Springer, Berlin (2004)
16. Giotis, A.P., Giannakoglou, K.C.: Single- and Multi-Objective Airfoil Design Using Ge-
netic Algorithms and Artificial Intelligence. In: EUROGEN 1999, Evolutionary Algo-
rithms in Engineering and Computer Science (May 1999)
17. Knowles, J., Hughes, E.J.: Multiobjective optimization on a budget of 250 evaluations.
In: Coello Coello, C.A., Hernández Aguirre, A., Zitzler, E. (eds.) EMO 2005. LNCS,
vol. 3410, pp. 176–190. Springer, Heidelberg (2005)
18. Chafekar, D., et al.: Multi-objective GA optimization using reduced models. IEEE
SMCC 35(2), 261–265 (2005)
19. Nain, P.: A computationally efficient multi-objective optimization procedure using suc-
cessive function landscape models. Ph.D. dissertation, Department of Mechanical Engi-
neering, Indian Institute of Technology (July 2005)
20. Voutchkov, I.I., Keane, A.J.: Multiobjective optimization using surrogates. In: Proc. 7th
Int. Conf. Adaptive Computing in Design and Manufacture (ACDM 2006), Bristol, pp.
167–175 (2006) ISBN 0-9552885-0-9
21. Keane, A.J.: Bump: A Hard (?) Problem (1994),
http://www.soton.ac.uk/˜ajk/bump.html
22. Forrester, A., Sobester, A., Keane, A.: Engineering design via Surrogate Modelling. Wi-
ley, Chichester (2008)
23. Yuret, D., Maza, M.: Dynamic hill climbing: Overcoming the limitations of optimization
techniques. In: The Second Turkish Symposium on Artificial Intelligence and Neural
Networks, pp. 208–212 (1993)
24. OptionsMatlab & OptionsNSGA2 RSM,
http://argos.e-science.soton.ac.uk/blogs/OptionsMatlab/
Chapter 8
A Review of Agent-Based Co-Evolutionary
Algorithms for Multi-Objective Optimization

Rafał Dreżewski and Leszek Siwik

Abstract. Agent-based evolutionary algorithms are a result of mixing two paradigms:


multi-agent systems and evolutionary algorithms. Agent-based co-evolutionary al-
gorithms allow for existing many species and sexes of agents within the system as
well as for defining co-evolutionary interactions between species and sexes. Algo-
rithms based on the model of co-evolutionary multi-agent system have been already
applied in many domains, like multi-modal optimization, generation of investment
strategies, portfolio optimization, and multi-objective optimization. In this chapter
we present an overview of selected agent-based co-evolutionary algorithms, their
formal models, and results of experiments with standard test problems and financial
problem, aimed at making comparison of agent-based and “classical” state-of-the-art
multi-objective algorithms. Presented results show that, depending on the problem
being solved, agent-based algorithms obtain comparable, and sometimes even bet-
ter, results than “classical” algorithms, however of course they are not the universal
solver for any multi-objective optimization problem.

8.1 Introduction
In spite of a huge potential dozing in evolutionary algorithms and a lot of successful
applications of such algorithms in solving difficult problem of optimization and
searching, very frequently such methods have not been able to deal with defined
problem and obtained results have not been satisfying. Among the reasons of such
a situation the following can be mentioned:
• centralization of evolutionary process where the process of selection as well
as the process of creation of new generations are controlled by one single
algorithm;
Rafał Dreżewski · Leszek Siwik
Department of Computer Science
AGH University of Science and Technology, Kraków, Poland
e-mail: {drezew,siwik}@agh.edu.pl

Y. Tenne and C.-K. Goh (Eds.): Computational Intelligence in Optimization, ALO 7, pp. 177–209.
springerlink.com c Springer-Verlag Berlin Heidelberg 2010
178 R. Dreżewski and L. Siwik

• reducing of specimen to the (system of) genes without capabilities of exerting of


any influence on the process of evolution;
• omitting some crucial—from the evolution and adaptation capabilities point of
view—operations and processes observable in the nature. Moreover, in the lit-
erature there are opinions that crossover and mutation they are only the kinds
of one single—destructive and exploration-oriented—operator and there is no
agreement if (and if so—when) they should be used or even if they should be
distinguished [17];
• to realize their own goals, during decision-making process, specimens are able
neither to gather nor to utilize any kind of information from the environment;
• depriving specimens of such—absolutely natural and obvious in nature—
biological and social behaviors like competition, rivalry, cooperation etc.;
• in the consequence of previous point (limited number of operators) it is almost
impossible to define in classical evolutionary algorithms more sophisticated
(and more effective simultaneously), advanced algorithms and computational
methods.
In the consequence, in the literature, there are being raised arguments that classi-
cal evolutionary algorithms are methods of adapting and fitting of algorithm’s pa-
rameters to defined conditions rather than really creative methods of searching and
optimization.
It is nothing strange so, that intensive research is being performed on methods uti-
lizing ideas and conceptions of computer models of observable in nature Darwinian
evolution but at the same time, on methods that should be devoid of mentioned above
shortcomings, and which could be perceived as a full analogy to natural processes.
During the research, decentralization and autonomy have been in the limelight.
Proposed, as a result, method called Evolutionary Multi-Agent System—EMAS [2]
should be perceived as a new trend among evolutionary algorithms allowing for
realization of defined postulates by utilizing advantages simultaneously of both:
evolutionary and agent-based approaches.
Proposed paradigm of evolutionary multi-agent system is characterized by the
following—crucial, taking the shortcomings of classical evolutionary algorithms
into account—features:
• in the process of evolution autonomous agents are taking a part. Agents are able
to make decisions to realize their own goals and they are not passive units of
global and central evolution which are limited and reduced to the role of (group
of) genes;
• the prices of evolution is decentralized and agents taking the part in that are
able to create advanced social structures and to realize sophisticated strategies of
cooperation, competition, interactions and reciprocal relations
• agents taking the part in the process of evolution are able to observe the envi-
ronment (and occurring changes) and to make appropriate decisions and actions
what additionally enrich the spectrum of possible for realization complex and
effective computational methods and algorithms.
8 A Review of Agent-Based Co-Evolutionary Algorithms 179

During further research on realizing advanced, complex social and biological mech-
anisms within the confines of EMAS—general model of so called CoEMAS Co-
evolutionary multi-agent systems (CoEMAS) [8] has been proposed and it has turned
out that with the use of such a model almost any kind of interaction, cooperation or
competition among many species or sexes of co-evolving agents is possible what al-
lows for improving the quality of obtained result. Such improvement results mainly
from better maintenance of population diversity—what is especially important in
the case of applying such systems for solving multi-modal or multi-objective opti-
mization tasks.
In the course of this chapter we are focusing on applying co-evolutionary
multi-agent systems for solving multi-objective optimization tasks.
Following [5]—multi-objective optimization problem—MOOP in its general
form is being defined as follows:


⎪ Minimize/Maximize fm (x̄), m = 1, 2 . . . , M
⎨ Sub ject to g j (x̄) ≥ 0, j = 1, 2 . . . , J
MOOP ≡

⎪ h k (x̄) = 0, k = 1, 2 . . . , K
⎩ (L) (U)
xi ≤ xi ≤ xi , i = 1, 2 . . . , N

Authors of this chapter assume that readers are familiar with at least fundamental
concepts and notions regarding multi-objective optimization in the Pareto sense (re-
lation of domination, Pareto frontier and Pareto set etc.) and their explanation is
omitted in this paper (interested readers can find definitions and deep analysis of all
necessary concepts and notions of Pareto multi-objective optimization for instance
in [3, 5]).
This chapter is organized as follows:
• in Section 8.2 formal model as well as detailed description of Co-Evolutionary
Multi-Agent System—CoEMAS is presented;
• in Section 8.3 detailed description and formal model of two realization of Co-
EMAS applied for solving MOOP is given. In this section Co-Evolutionary
Multi-Agent System with Predator-Prey interactions (PPCoEMAS) as well as
Co-Evolutionary Multi-Agent System with Cooperation (CCoEMAS) are
discussed;
• in Section 8.4 we discuss shortly test suite and performance metric used during
experiments, and next we glance at results obtained by both systems presented in
the course of this chapter (PPCoEMAS and CCoEMAS);
• in Section 8.5 the most important remarks, conclusions and comments are given.

8.2 Model of Co-Evolutionary Multi-Agent System


Agent-based models of evolutionary algorithms are the result of mixing two
paradigms: multi-agent systems and evolutionary algorithms. The result is decen-
tralized evolutionary system, in which agents “live” within the environment
of the system, compete for limited resources, reproduce, die, migrate from one
180 R. Dreżewski and L. Siwik

computational node to another, observe the environment and other agents, and can
communicate with other agents and change the environment.
Basic model of agent-based evolutionary algorithm (so called evolutionary multi-
agent system—EMAS model) was proposed in [2]. EMAS model included all the
features which were mentioned above. However in the case of some problems, for
example multi-modal optimization or multi-objective optimization, it turned out that
these mechanisms are not sufficient. Such types of problems require maintaining of
population diversity mechanisms, speciation mechanisms and possibilities of intro-
ducing additional biologically and socially inspired mechanisms in order to solve a
problem and obtain satisfying results.
Mentioned above limitations of the basic EMAS model and research aimed at
applying agent-based evolutionary algorithms to multi-modal and multi-objective
problems led to the formulation of the model of co-evolutionary multi-agent system—
CoEMAS [8]. This model included the possibilities of existing different species and
sexes in the system and allowed for defining co-evolutionary interactions between
them. Below we present basic ideas and notions of CoEMAS model, which we will
use in Section 8.3 when the systems used in experiments will be described.

8.2.1 Co-Evolutionary Multi-Agent System


The CoEMAS is described as 4-tuple:

CoEMAS = E, S, Γ , Ω  (8.1)

where E is the environment of the CoEMAS, S is the set of species (s ∈ S) that


co-evolve in CoEMAS, Γ is the set of resource types that exist in the system, the
amount of type γ resource will be denoted by rγ , Ω is the set of information types
that exist in the system, the information of type ω will be denoted by iω .

8.2.2 Environment
The environment of CoEMAS may be described as 3-tuple:
. /
E = T E,Γ E ,Ω E (8.2)

where T E is the topography of environment E, Γ E is the set of resource types that


exist in the environment, Ω E is the set of information types that exist in the envi-
ronment. The topography of the environment is given by:

T E = H, l (8.3)

where H is directed graph with the cost function c defined: H = V, B, c, V is the
set of vertices, B is the set of arches. The distance between two nodes is defined as
the length of the shortest path between them in graph H.
8 A Review of Agent-Based Co-Evolutionary Algorithms 181

Fig. 8.1 Co-evolutionary multi-agent system

The l function makes it possible to locate particular agent in the environment


space:
l : A→V (8.4)
where A is the set of agents, that exist in CoEMAS.
Vertex v is given by:
v = Av , Γ v , Ω v , ϕ  (8.5)
Av is the set of agents that are located in the vertex v, Γ v is the set of resource types
that exist within the v (Γ v ⊆ Γ E ), Ω v is the set of information types that exist within
the v (Ω v ⊆ Ω E ), ϕ is the fitness function.

8.2.3 Species
Species s ∈ S is defined as follows:

s = As , SX s , Z s ,Cs  (8.6)

where:
• As is the of agents of species s (by as we will denote the agent, which is of species
s, as ∈ As );
• SX s is the set of sexes within the s;
182 R. Dreżewski and L. Siwik

• Z s is the set of actions, which can be performed by the agents of species s


0 a
(Z s = Z , where Z a is the set of actions, which can be performed by the
a∈As
agent a);
• Cs is the set of relations with other species that exist within CoEMAS.
The set of relations of si with other species (Csi ) is the sum of the following sets of
relations: 1 s ,z− 2 1 s ,z+ 2
Csi = −− i
−→: z ∈ Z si ∪ −− i
−→: z ∈ Z si (8.7)
s ,z− s ,z+
where −−
i
−→ and −−
i
−→ are relations between species, based on some actions z ∈ Z si ,
which can be performed by the agents of species si :
s ,z− 3. /
−−
i
−→= si , s j ∈ S × S : agents of species si can decrease the fitness of
4 (8.8)
agents of species s j by performing the action z ∈ Z si

s ,z+ 3. /
−−
i
−→= si , s j ∈ S × S : agents of species si can increase the fitness of
4 (8.9)
agents of species s j by performing the action z ∈ Z si
s ,z−
If si −−
i
−→ si then we are dealing with the intra-species competition, for example
si ,z+
the competition for limited resources, and if si −− −→ si then there is some form of
co-operation within the species si .
With the use of the above relations we can define many different co-evolutionary
interactions, e.g., mutualism, predator-prey, host-parasite, etc. For example mutual-
ism between two species si and s j (i = j) takes place if and only if ∃zk ∈ Z si ∃zl ∈ Z s j ,
si ,z + s j ,zl +
such that si −−−
k
→ s j and s j −−−→ si and these two species live in tight co-operation.
Predator-prey interactions between two species, si (predators) and s j (preys) (i =
si ,z − s j ,zl +
j), takes place if and only if ∃zk ∈ Z si ∃zl ∈ Z s j , such that si −−−
k
→ s j and s j −−−→ si ,
where zk is the action of killing the prey (kill), and zl is the action of death (die).

8.2.4 Sex
The sex sx ∈ SX s which is within the species s is defined as follows:

sx = Asx , Z sx ,Csx  (8.10)

where Asx is the set of agents of sex sx and species s (Asx ⊆ As ):

Asx = {a : a ∈ As ∧ a is the agent of sex sx} (8.11)


8 A Review of Agent-Based Co-Evolutionary Algorithms 183

With asx we will denote the agent of sex sx (asx ∈ Asx ). Z0sx is the set of actions
which can be performed by the agents of sex sx, Z sx = Z a , where Z a is the
a∈Asx
set of actions which can be performed by the agent a. And finally Csx is the set of
relations between the sx and other sexes of the species s.
Analogically as in the case of species, we can define the relations between the
sexes of the same species. The set of all relations of the sex sxi ∈ SX s with other
sexes of species s (Csxi ) is the sum of the following sets of relations:
1 sx ,z− 2 1 sx ,z+ 2
Csxi = −−i−→: z ∈ Z sxi ∪ −−i−→: z ∈ Z sxi (8.12)

sx ,z− sx ,z+
where −−i−→ and −−i−→ are the relations between sexes, in which some actions
z ∈ Z sxi are used:

sx ,z− 3. /
−−i−→= sxi , sx j ∈ SX s × SX s : agents of sex sxi can decrease the
4 (8.13)
fitness of agents of sex sx j by performing the action z ∈ Z sxi

sx ,z+ 3. /
−−i−→= sxi , sx j ∈ SX s × SX s : agents of sex sxi can increase the
4 (8.14)
fitness of agents of sex sx j by performing the action z ∈ Z sxi

With the use of presented relations between sexes we can model for example sexual
selection interactions, in which agents of one sex choose partners for reproduction
from agents of the other sex within the same species, taking into account some
preferred features (see [10]).

8.2.5 Agent
Agent a (see Fig. 8.2) of sex sx and species s (in order to simplify the notation we
assume that a ≡ asx,s ) is defined as follows:

a = gna , Z a , Γ a , Ω a , PRa  (8.15)

where:
• gna is the genotype of agent a, which may be composed of any number of chro-
Ê
mosomes (for example: gna = (x1 , x2 , . . . , xk ), where xi ∈ , gna ∈ k ); Ê
• Z a is the set of actions, which agent a can perform;
• Γ a is the set of resource types, which are used by agent a (Γ a ⊆ Γ );
• Ω a is the set of information, which agent a can possess and use (Ω a ⊆ Ω );
• PRa is partially ordered set of profiles of agent a (PRa ≡ PRa , ) with defined
partial order relation .
184 R. Dreżewski and L. Siwik

Fig. 8.2 Agent in the CoEMAS

Relation  is defined in the following way:


3. /
 = pri , pr j ∈ PRa × PRa : realization of active goals of profile pri
has equal or higher priority than the realization of (8.16)
4
active goals of profile pr j

The active goal (which is denoted as gl ∗ ) is the goal gl, which should be realized in
the given time. The relation  is reflexive, transitive and antisymmetric and partially
orders the set PRa :

pr  pr for every pr ∈ PRa (8.17a)


(pri  pr j ∧ pr j  prk ) ⇒ pri  prk for every pri , pr j , prk ∈ PRa (8.17b)
(pri  pr j ∧ pr j  pri ) ⇒ pri = prk for every pri , pr j ∈ PRa (8.17c)

The set of profiles PRa is defined in the following way:

PRa = {pr1 , pr2 , . . . , prn } (8.18a)


pr1  pr2  · · ·  prn (8.18b)

Profile pr1 is the basic profile—it means that the realization of its goals has the
highest priority and they will be realized before the goals of other profiles.
Profile pr of agent a (pr ∈ PRa ) can be the profile in which only resources are
used:

pr = Γ pr , ST pr , RST pr , GL pr  (8.19)
8 A Review of Agent-Based Co-Evolutionary Algorithms 185

Algorithm 6. Basic activities of agent a in CoEMAS


γ γ
1 rγ ← rinit ; /* rinit is the initial amount of resource given to
the agent */
2 while rγ > 0 do
3 activate the profile pri ∈ PRa with the highest priority and with the active goal
gl ∗j ∈ GL pri ;
4 if pri is the resource profile then
γ γ
5 if 0 < rγ < rmin then ; /* rmin is the minimal amount of
resource needed by the agent to realize its
activities */
6
7 choose the strategy stk ∈ ST pri with the highest priority that can be used to
take some resources from the environment or other agent;
8 perform actions contained within the stk ;
9 else if rγ = 0 then
10 execute die strategy;
11 end
12 else if pri is the reproduction profile then
rep,γ rep,γ
13 if rγ > rmin then ; /* rmin is the minimal amount of
resource needed for reproduction */
14
15 choose the strategy stk ∈ ST pri with the highest priority that can be used to
reproduce;
16 perform actions contained within the stk ;
17 end
18 else if pri is the migration profile then
mig,γ mig,γ
19 if rγ > rmin then ; /* rmin is the minimal amount of
resource needed for migration */
20
21 choose the strategy stk ∈ ST pri with the highest priority that can be used to
migrate;
22 perform actions contained within the stk ;
mig,γ
23 give rmin amount of resource to the environment;
24 end
25 end
26 end

in which only information are used:

pr = Ω pr , M pr , ST pr , RST pr , GL pr  (8.20)
or resources and information are used:

pr = Γ pr , Ω pr , M pr , ST pr , RST pr , GL pr  (8.21)
where:
• Γ pr is the set of resource types, which are used within the profile pr (Γ pr ⊆ Γ a );
• Ω pr is the set of information types, which are used within the profile pr (Ω pr ⊆
Ω a );
186 R. Dreżewski and L. Siwik

• M pr is the set of information representing the agent’s knowledge about the


environment and other agents (it is the model of the environment of agent a);
• ST pr is the partially ordered set of strategies (ST pr ≡ ST pr , ), which can be
used by agent within the profile pr in order to realize an active goal of this profile;
• RST pr is the set of strategies that are realized within the profile pr—generally,
not all of the strategies from the set ST pr have to be realized within the profile
pr, some of them may be realized within other profiles;
• GL pr is partially ordered set of goals (GL pr ≡ GL pr , ), which agent has to
realize within the profile pr.
The relation  is defined in the following way:

3
 = sti , st j  ∈ ST pr × ST pr : strategy sti has equal or higher
4 (8.22)
priority than strategy st j

This relation is reflexive, transitive and antisymmetric and partially orders the set
ST pr . Every single strategy st ∈ ST pr is consisted of actions, which ordered perfor-
mance leads to the realization of some active goal of the profile pr:

st = z1 , z2 , . . . , zk , st ∈ ST pr , zi ∈ Z a (8.23)

The relation  is defined in the following way:


3
 = gli , gl j  ∈ GL pr × GL pr : goal gli has equal or higher
4 (8.24)
priority than the goal gl j

This relation is reflexive, transitive and antisymmetric and partially orders the set
GL pr .
The partially ordered sets of profiles PRa , goals GL pr and strategies ST pr are
used by the agent in order to make decisions about the realized goal and to choose
the appropriate strategy in order to realize that goal. The basic activities of the agent
a are shown in Algorithm 6.
In CoEMAS systems the set of profiles is usually composed of resource profile
(pr1 ), reproduction profile (pr2 ), and migration profile (pr3 ):

PRa = {pr1 , pr2 , pr3 } (8.25a)

pr1  pr2  pr3 (8.25b)

The highest priority has the resource profile, then there is reproduction profile, and
finally migration profile.
8 A Review of Agent-Based Co-Evolutionary Algorithms 187

8.3 Co-Evolutionary Multi-Agent Systems for Multi-Objective


Optimization
In this section we will describe two co-evolutionary multi-agent systems used in
the experiments. Each of these systems uses different co-evolutionary mechanism:
co-operation and predator-prey interactions. All of the systems are based on general
model of co-evolution in multi-agent system described in Section 8.2—in this sec-
tion only such elements of the systems will be described that are specific for these
instantiations of the general model. In all the systems presented below, real-valued
vectors are used as agents’ genotypes. Mutation with self-adaptation and intermedi-
ate recombination are used as evolutionary operators [1].

8.3.1 Co-Evolutionary Multi-Agent System with Co-Operation


Mechanism (CCoEMAS)
The co-evolutionary multi-agent system with co-operation mechanism is defined as
follows (see Eq. (8.1)):
CCoEMAS = E, S, Γ , Ω  (8.26)
The number of species corresponds with the number of criteria (n) of the multi-
objective problem being solved S = {s1 , . . . , sn }. Three information types (Ω =
{ω1 , ω2 , ω3 }) and one resource type (Γ = {γ }) are used. Information of type ω1
denotes nodes to which agent can migrate. Information of type ω2 denotes (for the
agent of given species) all agents from other species that are located within the same
node in time t. Information of type ω3 denotes (for the given agent) all agents from
the same species located within the same node.

8.3.1.1 Species
The species s is defined as follows:

s = As , SX s = {sx} , Z s ,Cs  (8.27)

where SX s is the set of sexes which exist within the s species, Z s is the set of actions
that agents of species s can perform, and Cs is the set of relations of s species with
other species that exist in the CCoEMAS.
Actions
The set of actions Z s is defined as follows:

Z s = {die, seek, get, give, accept, seekPartner, clone, rec, mut, migr} (8.28)

where:
• die is the action of death (agent dies when it is out of resources);
• seek is the action of finding a dominated agent from the same species in order to
take some resources from it;
188 R. Dreżewski and L. Siwik

• get action gets some resource from another agent located within the same node,
which is dominated by the agent that performs get action;
• give action gives some resources to the agent that performs get action;
• accept action accepts partner for reproduction when the amount of resource pos-
sessed by the agent is above the given level;
• seekPartner action seeks for partner for reproduction, such that it comes from
another species and has the amount of resource above the minimal level needed
for reproduction;
• clone is the action of producing offspring (parents give some of their resources
to the offspring during this action);
• rec is the recombination operator (intermediate recombination is used [1]);
• mut is the mutation operator (mutation with self-adaptation is used [1]);
• migr is the action of migrating from one node to another. During this action agent
loses some of its resource.

Relations

The set of relations of si species with other species that exist within the system is
defined as follows: 1 s ,get− s ,accept+ 2
Csi = −−−−→, −−−−−−→
i i
(8.29)

The first relation models intra species competition for limited resources:
s ,get−
−−
i
−−→= {si , si } (8.30)

The second one models co-operation between species:


s ,accept+ 3. /4
−−
i
−−−−→= si , s j (8.31)

8.3.1.2 Agent
Agent a of species s (a ≡ as ) is defined as follows:

a = gna , Z a = Z s , Γ a = Γ , Ω a = Ω , PRa  (8.32)

Genotype of agent a is consisted of two vectors (chromosomes): x of real-coded


decision parameters’ values and σ of standard deviations’ values, which are used
during mutation with self-adaptation. Agents of the given species are evaluated ac-
cording to only one criteria associated with this species. Z a = Z s (see Eq. (8.28)) is
the set of actions which agent a can perform. Γ a is the set of resource types used
by the agent, and Ω a is the set of information types. Basic activities of agent a in
CCoEMAS with the use of profiles are presented in Alg. 7.
8 A Review of Agent-Based Co-Evolutionary Algorithms 189

Algorithm 7. Basic activities of agent a in CCoEMAS


γ
1 rγ ← rinit ;
2 while rγ > 0 do
3 activate the profile pri ∈ PRa with the highest priority and with the active goal
gl ∗j ∈ GL pri ;
4 if pr1 is activated then
γ
5 if 0 < rγ < rmin then
6 seek, get;
 γ 
7 rγ ← rγ + rget ;
8 else if rγ = 0 then
9 die;
10 end
11 else if pr2 is activated then
rep,γ
12 if rγ > rmin then
13 seekPartner,
& clone,
' rec, mut;
rep,γ
14 rγ ← rγ − rgive ;
15 end
16 else if pr3 is activated then
17 if accept&is activated ' then
γ γ rep,γ
18 r ← r − rgive ;
else if give
 is activated then
γ 
19
20 rγ ← rγ − rget ;
21 end
22 else if pr4 is activated then
mig,γ
23 if rγ > rmin then
24 migr;
& '
mig,γ
25 rγ ← rγ − rmin ;
26 end
27 end
28 end

Profiles

The partially ordered set of profiles includes resource profile (pr1 ), reproduction
profile (pr2 ), interaction profile (pr3 ), and migration profile (pr4 ):

PRa = {pr1 , pr2 , pr3 , pr4 } (8.33a)


pr1  pr2  pr3  pr4 (8.33b)

The resource profile is defined in the following way:


.
pr1 = Γ pr1 = Γ , Ω pr1 = {ω3 } , M pr1 = {iω3 } ,
/ (8.34)
ST pr1 , RST pr1 = ST pr1 , GL pr1
190 R. Dreżewski and L. Siwik

The set of strategies include two strategies:

ST pr1 = {die, seek, get} (8.35)

The goal of the pr1 profile is to keep the amount of resources above the minimal
level or to die when the amount of resources falls to zero. This profile uses the
model M pr1 = {iω3 }.
The reproduction profile is defined as follows:
.
pr2 = Γ pr2 = Γ , Ω pr2 = {ω2 } , M pr2 = {iω2 } ,
/ (8.36)
ST pr2 , RST pr2 = ST pr2 , GL pr2

The set of strategies include one strategy:

ST pr2 = {seekPartner, clone, rec, mut} (8.37)

The only goal of the pr2 profile is to reproduce. In order to realize this goal agent can
use strategy of reproduction: seekPartner, clone, rec, mut. During the reproduction
rep,γ
agent transfers the amount of rgive resources to the offspring.
The interaction profile is defined as follows:
.
pr3 = Γ pr3 = Γ , Ω pr3 = {ω2 , ω3 } , M pr3 = {iω2 , iω3 } ,
/ (8.38)
ST pr3 = {accept, give}, RST pr3 = ST pr3 , GL pr3

The goal of the pr3 profile is to interact with agents from another species with the
use of accept and give strategies.
The migration profile is defined as follows:
.
pr4 = Γ pr4 = Γ , Ω pr4 = {ω1 } , M pr4 = {iω1 } ,
3. /4 / (8.39)
ST pr4 = migr , RST pr4 = ST pr4 , GL pr4

The goal of the pr4 profile is to migrate


. within
/ the environment. In order to realize
such a goal the migration strategy migr is used, which firstly chooses the node
on the basis of information {iω1 } and then realizes the migration. As a result of
migrating agent loses some of its resources.

8.3.2 Co-Evolutionary Multi-Agent System with Predator-Prey


Interactions (PPCoEMAS)
The co-evolutionary multi-agent system with predator-prey interactions
(PPCoEMAS) is defined as follows (see Eq. (8.1)):

PPCoEMAS = E, S, Γ , Ω  (8.40)


8 A Review of Agent-Based Co-Evolutionary Algorithms 191

The set of species includes two species, preys and predators S = {prey, pred}.
Two information types (Ω = {ω1 , ω2 }) and one resource type (Γ = {γ }) are used.
Information of type ω1 denote nodes to which agent can migrate. Information of
type ω2 denote such prey that are located within the particular node in time t.

8.3.2.1 Prey Species


The prey species (prey) is defined as follows:

prey = A prey , SX prey = {sx} , Z prey ,C prey  (8.41)

where SX prey is the set of sexes which exist within the prey species, Z prey is the set
of actions that agents of species prey can perform, and C prey is the set of relations
of prey species with other species that exist in the PPCoEMAS.

Actions

The set of actions Z prey is defined as follows:


3
Z prey = die, seek, get, give, accept, seekPartner,
4 (8.42)
clone, rec, mut, migr

where:
• die is the action of death (prey dies when it is out of resources);
• seek action seeks for another prey agent that is dominated by the prey performing
this action or is too close to it in criteria space.
• get action gets some resource from another a prey agent located within the same
node, which is dominated by the agent that performs get action or is too close to
it in the criteria space;
• give action gives some resource to another agent (which performs get action);
• accept action accepts partner for reproduction when the amount of resource pos-
sessed by the prey agent is above the given level;
• seekPartner action is used in order to find the partner for reproduction when the
amount of resource is above the given level and agent can reproduce;
• clone is the action of producing offspring (parents give some of their resources
to the offspring during this action);
• rec is the recombination operator (intermediate recombination is used [1]);
• mut is the mutation operator (mutation with self-adaptation is used [1]);
• migr is the action of migrating from one node to another. During this action agent
loses some of its resource.

Relations

The set of relations of prey species with other species that exist within the system is
defined as follows: 1 prey,get− prey,give+ 2
C prey = −−−−−→, −−−−−−→ (8.43)
192 R. Dreżewski and L. Siwik

The first relation models intra species competition for limited resources:
prey,get−
−−−−−→= {prey, prey} (8.44)

The second one models predator-prey interactions:


prey,give+
−−−−−−→= {prey, pred} (8.45)

8.3.2.2 Predator Species


The predator species (pred) is defined as follows:
5 6
pred = A pred , SX pred = {sx} , Z pred ,C pred (8.46)

Actions

The set of actions Z pred is defined as follows:

Z pred = {seek, getFromPrey, migr} (8.47)

where:
• The seek action allows finding the “worst” (according to the criteria associated
with the given predator) prey located within the same node as the predator;
• getFromPrey action gets all resources from the chosen prey,
• migr action allows predator to migrate between nodes of the graph H—this re-
sults in losing some of the resources.

Relations

The set of relations of pred species with other species that exist within the system
are defined as follows:
1 pred,getFromPrey− 2
C pred = −−−−−−−−−−−→ (8.48)

This relation models predator-prey interactions:


pred,getFromPrey−
−−−−−−−−−−−→= {pred, prey} (8.49)

As a result of performing getFromPrey action and taking all resources from selected
prey, it dies.

8.3.2.3 Prey Agent


Agent a of species prey (a ≡ a prey ) is defined as follows:

a = gna , Z a = Z prey , Γ a = Γ , Ω a = Ω , PRa  (8.50)


8 A Review of Agent-Based Co-Evolutionary Algorithms 193

Algorithm 8. Basic activities of agent a ≡ a prey in PPCoEMAS


γ
1 rγ ← rinit ;
2 while rγ > 0 do
3 activate the profile pri ∈ PRa with the highest priority and with the active goal
gl ∗j ∈ GL pri ;
4 if pr1 is activated then
γ
5 if 0 < rγ < rmin then
6 seek, get;
 γ 
7 rγ ← rγ + rget ;
8 else if rγ = 0 then
9 die;
10 end
11 else if pr2 is activated then
rep,γ
12 if rγ > rmin then
13 if seekPartner,
& clone, rec,
' mut is performed then
clone,γ
14 rγ ← rγ − rgive ;
15 else if accept
& is performed
' then
γ γ accept,γ
16 r ← r − rgive ;
17 end
18 end
19 else if pr3 is activated then
20 if get is performed by prey agent then
21 give;& '
γ
22 rγ ← rγ − rgive ;
23 else if get is performed by predator agent then
24 give;
25 rγ ← 0;
26 end
27 else if pr4 is activated then
mig,γ
28 if rγ > rmin then
29 migr; & '
mig,γ
30 rγ ← rγ − rmin ;
31 end
32 end
33 end

Genotype of agent a is consisted of two vectors (chromosomes): x of real-coded


decision parameters’ values and σ of standard deviations’ values, which are used
during mutation with self-adaptation. Z a = Z prey (see Eq. (8.42)) is the set of actions
which agent a can perform. Γ a is the set of resource types used by the agent, and
Ω a is the set of information types. Basic activities of agent a are presented in Alg. 8.

Profiles

The partially ordered set of profiles includes resource profile (pr1 ), reproduction
profile (pr2 ), interaction profile (pr3 ), and migration profile (pr4 ):
194 R. Dreżewski and L. Siwik

PRa = {pr1 , pr2 , pr3 , pr4 } (8.51a)


pr1  pr2  pr3  pr4 (8.51b)

The resource profile is defined in the following way:


.
pr1 = Γ pr1 = Γ , Ω pr1 = {ω2 } , M pr1 = {iω2 } ,
/ (8.52)
ST pr1 , RST pr1 = ST pr1 , GL pr1

The set of strategies include two strategies:

ST pr1 = {die, seek, get} (8.53)

The goal of the pr1 profile is to keep the amount of resources above the minimal
level or to die when the amount of resources falls to zero. This profile uses the
model M pr1 = {iω2 }.
The reproduction profile is defined as follows:
.
pr2 = Γ pr2 = Γ , Ω pr2 = {ω2 } , M pr2 = {iω2 } ,
/ (8.54)
ST pr2 , RST pr2 = ST pr2 , GL pr2

The set of strategies include two strategies:

ST pr2 = {seekPartner, clone, rec, mut, accept} (8.55)

The only goal of the pr2 profile is to reproduce. In order to realize this goal agent can
use strategy of reproduction seekPartner, clone, rec, mut or can accept partners for
reproduction (accept).
The interaction profile is defined as follows:
.
pr3 = Γ pr3 = Γ , Ω pr3 = 0, / M pr3 = 0,
/ ST pr3 = {give} ,
/ (8.56)
RST pr3 = ST pr3 , GL pr3

The goal of the pr3 profile is to interact with predators and preys with the use of
strategy give.
The migration profile is defined as follows:
.
pr4 = Γ pr4 = Γ , Ω pr4 = {ω1 } , M pr4 = {iω1 } ,
3. /4 / (8.57)
ST pr4 = migr , RST pr4 = ST pr4 , GL pr4

The goal of the pr4 profile is to migrate within the environment. In order to real-
ize such a goal the migration strategy is used, which firstly chooses the node and
then realizes the migration. As a result of migrating prey loses some amount of
resource.
8 A Review of Agent-Based Co-Evolutionary Algorithms 195

Algorithm 9. Basic activities of agent a ≡ a pred in PPCoEMAS


γ
1 rγ ← rinit ;
2 while rγ > 0 do
3 activate the profile pri ∈ PRa with the highest priority and with the active goal
gl ∗j ∈ GL pri ;
4 if pr1 is activated then
γ
5 if 0 < rγ < rmin then
6 seek, getFromPrey;
 prey,γ  prey,γ
7 rγ ← rγ + rget ; /* rget are all resources of the
prey agent that was chosen by a */
8 end
9 else if pr2 is activated then
mig,γ
10 if rγ > rmin then
11 migr;& '
mig,γ
12 rγ ← rγ − rmin ;
13 end
14 end
15 end

8.3.2.4 Predator Agent


Agent a of species pred is defined analogically to prey agent (see eq. (8.50)). There
exist two main differences. Genotype of predator agent is consisted only of the in-
formation about the criterion associated with the given agent. The set of profiles is
consisted only of two profiles, resource profile (pr1 ), and migration profile (pr2 ):
PRa = {pr1 , pr2 }, where pr1  pr2 . Basic activities of agent a are presented in
Alg. 9.

Profiles

The resource profile is defined in the following way:


.
pr1 = Γ pr1 = Γ , Ω pr1 = {ω2 } , M pr1 = {iω2 } ,
/ (8.58)
ST pr1 = {seek, getFromPrey}, RST pr1 = ST pr1 , GL pr1

The goal of the pr1 profile is to keep the amount of resource above the minimal level
with the use of strategy seek, getFromPrey.
The migration profile is defined as follows:
.
pr2 = Γ pr2 = Γ , Ω pr2 = {ω1 } , M pr2 = {iω1 } ,
3. /4 / (8.59)
ST pr2 = migr , RST pr2 = ST pr2 , GL pr2

The goal of pr2 profile is to.migrate


/ within the environment. In order to realize this
goal the migration strategy migr is used. The realization of the migration strategy
results in losing some of the resource possessed by the agent.
196 R. Dreżewski and L. Siwik

8.4 Experimental Results


Presented formally in section 8.3 agent-based co-evolutionary approaches for multi-
objective optimization have been tentatively assessed. Obtained during experiments
preliminary results were presented in some of our previous papers and in this section
they are shortly summarized.

8.4.1 Test Suite, Performance Metric and State-of-the-Art


Algorithms
As a test problem firstly, slightly modified so-called Laumanns multi-objective prob-
lem was used, which is defined as follows [15, 18]:

⎨ f1 (x) = x21 + x22
Laumanns = f2 (x) = (x1 + 2)2 + x22 (8.60)

−5 ≤ x1 , x2 ≤ 5

Secondly the so-called Kursawe problem was used. Its definition is as follows [18]:
⎧ & & $ ''

⎨ f1 (x) = ∑n−1 −10 exp −0.2 x2i + x2i+1
i=0
Kursawe = f2 (x) = ∑n |xi |0.8 + 5 sin x3 (8.61)

⎩ i=1 i
n = 3 − 5 ≤ x1 , x2 , x3 ≤ 5

In one of our experiments discussed shortly in this chapter building effective port-
folio problem was used. Assumed definition as well as true Pareto frontier for such
a problem can be found in [16].
Obviously during our experiments also well known and commonly used test
suites were used. Inter alia such problems as ZDT test suite was used ([19, p. 57–63],
[21], [5, p. 356–362], [4, p. 194–199]).

Dispersing solutions over


the whole approximation
of the true Pareto frontier

Drifting towards True Pareto frontier


max

f2 the true Pareto


frontier

f1
max

Fig. 8.3 Two goals of multi-objective optimization


8 A Review of Agent-Based Co-Evolutionary Algorithms 197

Two main distinguishing features of high-quality solution of MOOPs are: close-


ness to the true Pareto frontier as well as dispersion of found non-dominated solution
over the whole (approximation) of the Pareto frontier (see Figure 8.3).
In the consequence, despite that using only one single measure during assessing
the effectiveness of (evolutionary) algorithms for multi-objective optimization is not
enough [23], since Hypervolume Ratio measure (HVR) [20] allows for estimating
both of these aspects—in this chapter discussion and presentation of obtained results
is based on this very measure.
Hypervolume or Hypervolume ratio (HVR), describes the area covered by solu-
tions of obtained result set. For each solution, hypercube is evaluated with respect
to the fixed reference point. In order to evaluate hypervolume ratio, value of hyper-
volume for obtained set is normalized with hypervolume value computed for true
Pareto frontier. HV and HVR are defined as follows:
7
N
HV = v( vi ) (8.62a)
i=1
HV(PF ∗ )
HVR = (8.62b)
HV(PF)

where vi is hypercube computed for i − th solution, PF ∗ represents obtained Pareto


frontier and PF is the true Pareto frontier.
To assess (in a quantitative way) PPCoEMAS and CCoEMAS the comparison
with results obtained with the use of state-of-the-art algorithms has to be made. That
is why we are comparing results obtained by discussed in this chapter approaches
with results obtained by NSGA-II [6, 7] and SPEA2 [12, 22] algorithms since these
very algorithms are the most efficient and most commonly used evolutionary multi-
objective optimization algorithms. Additionally, obtained results are compared also
with NPGA [13] and PPES [15] algorithms.

8.4.2 A Glance at Assessing Co-operation Based Approach


(CCoEMAS)
Presented in section 8.3.1 co-evolutionary multi-agent system with co-operation
mechanism (CCoEMAS) was assessed tentatively using inter alia ZDT test-suite.
The size of population of CCoEMAS and the size of benchmarking algorithms
(NSGA-II and SPEA2) assumed during presented experiments were as follows:
CCoEMAS—200, NSGA-II—300 and SPEA—100. Next, selected parameters and
γ
their values assumed during those experiments are as follows: rinit = 50 (it repre-
sents the level of resources possessed initially by individual just after its creation),
γ
rget = 30 (it represents the amount of resources transferred in the case of domi-
rep,γ
nation), rmin = 30 (it represents the level of resources required for reproduction),
pmut = 0.5 (mutation probability).
198 R. Dreżewski and L. Siwik

1
HVR Measure value 0.95
0.9
0.85
0.8
0.75
NSGA2
0.7 CCoEMAS
SPEA
0.65
0 5 10 15 20 25 30 35
(a) Time [s]
1
0.99
HVR Measure value

0.98
0.97
0.96
0.95
0.94
NSGA2
0.93 CCoEMAS
SPEA
0.92
0 5 10 15 20 25 30 35 40
(b) Time [s]
Fig. 8.4 HVR values obtained by CCoEMAS, SPEA2, and NSGA-II run against Zitzler’s
problems ZDT1 (a) and ZDT2 (b) [11]

As one may see after the analysis of results presented in figures 8.4 and 8.5—
CCoEMAS, as not so complex algorithm as NSGA-II or SPEA2, initially allows for
obtaining better solutions, but with time classical algorithms—especially NSGA-
II—are the better alternatives. It is however worth to mention that in the case of
8 A Review of Agent-Based Co-Evolutionary Algorithms 199

0.9

HVR Measure value


0.8

0.7

0.6

0.5 NSGA2
CCoEMAS
SPEA2
0.4
0 5 10 15 20 25 30 35 40
(a) Time [s]
0.97
0.96
HVR Measure value

0.95
0.94
0.93
0.92
0.91
NSGA2
0.9 CCoEMAS
SPEA2
0.89
0 5 10 15 20 25 30 35 40
(b) Time [s]
1

0.95
HVR Measure value

0.9

0.85

0.8 NSGA2
CCoEMAS
SPEA2
0.75
0 5 10 15 20 25 30 35 40
(c) Time [s]

Fig. 8.5 HVR values obtained by CCoEMAS, SPEA2, and NSGA-II run against Zitzler’s
problems ZDT3 (a) ZDT4 (b) and ZDT6 (c) [11]
200 R. Dreżewski and L. Siwik

ZDT4 problem this characteristic seems to be reversed—i.e. initially classical al-


gorithms seem to be better alternatives, but finally CCoEMAS allows for obtaining
better solutions (observed as higher values of HVR metrics). Deeper analysis of
obtained during presented experiments results can be found in [11].

8.4.3 A Glance at Assessing Predator-Prey Based Approach


(PPCoEMAS)
In this section some selected results regarding presented in section 8.3.2 co-
evolutionary multi-agent system with predator-prey interactions are presented.
Among the others, PPCoEMAS was assessed with the use of some presented in
section 8.4.1 classical benchmarking problems: firstly Laumanns [15] and Kursawe
[14] test problems were used. Also the other than NSGA-II and SPEA2 classical
algorithms were used during experiments with predator-prey approach. This time
predator-prey evolutionary strategy (PPES) and niched-pareto genetic algorithm
(NPGA) were used. In this section only a kind of summary of obtained results is
given. More detailed analysis can be found in [9, 16].

5
PPCoEMAS frontier after 6000 steps
f2

0
0 1 2 3 4 5
(a) f1

5
PPES frontier after 6000 steps
f2

0
0 1 2 3 4 5
(b) f1

Fig. 8.6 Pareto frontier approximations obtained by PPCoEMAS (a) and PPES (b) algo-
rithms for Laumanns problem after 6000 steps [9]
8 A Review of Agent-Based Co-Evolutionary Algorithms 201

NPGA

PPES

PPCoEMAS

50 60 70
(a) HV

NPGA

PPES

PPCoEMAS

0.7 0.8 0.9 1


(b) HVR

Fig. 8.7 The value of HV (a) and HVR (b) measure for Laumanns problem obtained by
PPCoEMAS, PPES and NPGA after 6000 steps

In the very first experiments with PPCoEMAS relatively simple Laumanns test
problem was used. In Figure 8.6 there are presented Pareto frontier approxima-
tions obtained by PPCoEMAS and PPES algorithms and in Figure 8.7 there are
presented values of HV and HVR metrics for all three algorithms being compared
(PPCoEMAS, PPES and NPGA). As it can be seen—the differences between al-
gorithms being analyzed are not so distinct, however proposed PPCoEMAS system
seems to be the best alternative.
The second problem used was more demanding multi-objective Kursawe prob-
lem with disconnected both Pareto set and Pareto frontier. In Figure 8.9 there are
presented final approximations of Pareto frontier obtained by PPCoEMAS and by
reference algorithms after 6000 time steps. As one may notice, there is no doubt that
PPCoEMAS is definitely the best alternative since it is able to obtain Pareto frontier
that is located very close to the model solution, that is very well dispersed and what
202 R. Dreżewski and L. Siwik

NPGA

PPES

PPCoEMAS

350 400 450 500 550 600


(a) HV

NPGA

PPES

PPCoEMAS

0.6 0.7 0.8 0.9 1


(b) HVR

Fig. 8.8 The value of HV (a) and HVR (b) measure for Kursawe problem obtained by PP-
CoEMAS, PPES and NPGA after 6000 steps

is also very important—it is more numerous than PPES and NPGA-based solutions.
The above observations are fully confirmed by the values of HV and HVR metrics
presented in Figure 8.8.
Proposed co-evolutionary multi-agent system with predator-prey interactions was
also assessed with the use of building effective portfolio problem. In this case, each
individual in the prey population is represented as a p-dimensional vector. Each
dimension represents the percentage participation of i-th (i ∈ 1 . . . p) share in the
whole portfolio.
During presented experiments—Warsaw Stock Exchange quotations from 2003-
01-01 until 2005-12-31 were taken into consideration. Simultaneously, the portfolio
consists of the following three (experiment I) or seventeen (experiment II) stocks
quoted on the Warsaw Stock Exchange: in experiment I: RAFAKO, PONARFEH,
PKOBP, in experiment II: KREDYTB, COMPLAND, BETACOM, GRAJEWO,
KRUK, COMARCH, ATM, HANDLOWY, BZWBK, HYDROBUD, BORYSZEW,
8 A Review of Agent-Based Co-Evolutionary Algorithms 203

-5

f2
-10

PPCoEMAS frontier after 6000 steps


-20 -19 -18 -17 -16 -15 -14
(a) f1

-5
f2

-10

PPES frontier after 6000 steps


-20 -19 -18 -17 -16 -15 -14
(b) f1

-5
f2

-10

NPGA frontier after 6000 steps


-20 -19 -18 -17 -16 -15 -14
(c) f1

Fig. 8.9 Pareto frontier approximations for Kursawe problem obtained by PPCoEMAS (a),
PPES (b) and NPGA (c) after 6000 steps [9]

ARKSTEEL, BRE, KGHM, GANT, PROKOM, BPHPBK. As the market index,


WIG20 has been taken into consideration.
In Figure 8.10there are presented final Pareto frontiers obtained using PPCoEMAS,
NPGA and PPES algorithm after 1000 steps in experiment I. As one may notice, in
this case frontier obtained by PPCoEMAS is more numerous than NPGA-based and
as numerous as PPES-based one. Unfortunately, in this case, diversity of population
in PPCoEMAS approach is visibly worse than in the case of NPGA or PPES-based
frontiers.
204 R. Dreżewski and L. Siwik

0.2
PPCoEMAS-based Pareto frontier after 1000 steps

0.15

Profit
0.1

0.05

0
0 0.05 0.1 0.15 0.2 0.25 0.3
(a) Risk

0.2
PPES-based Pareto frontier after 1000 steps

0.15
Profit

0.1

0.05

0
0 0.05 0.1 0.15 0.2 0.25 0.3
(b) Risk

0.2
NPGA-based Pareto frontier after 1000 steps

0.15
Profit

0.1

0.05

0
0 0.05 0.1 0.15 0.2 0.25 0.3
(c) Risk

Fig. 8.10 Pareto frontier approximations after 1000 steps obtained by PPCoEMAS (a), PPES
(b), and NPGA (c) for building effective portfolio consisting of 3 stocks [16]

Similar situation can be also observed in Figure 8.11 presenting Pareto fron-
tiers obtained by PPCoEMAS, NPGA and PPES—but this time portfolio that is
being optimized consists of 17 shares. Also this time PPCoEMAS-based frontier
is quite numerous and quite close to the true Pareto frontier but the tendency for
focusing solutions around only selected part(s) of the whole frontier is very dis-
tinct. The explanation of observed tendency can be found in [9, 16] and on the very
8 A Review of Agent-Based Co-Evolutionary Algorithms 205

0.45
PPCoEMAS-based Pareto frontier after 1000 steps
0.4

0.35

0.3

0.25

Profit
0.2

0.15

0.1

0.05

0
0 0.05 0.1 0.15 0.2
(a) Risk

0.45
PPES-based Pareto frontier after 1000 steps
0.4

0.35

0.3

0.25
Profit

0.2

0.15

0.1

0.05

0
0 0.05 0.1 0.15 0.2
(b) Risk

0.45
NPGA-based Pareto frontier after 1000 steps
0.4

0.35

0.3

0.25
Profit

0.2

0.15

0.1

0.05

0
0 0.05 0.1 0.15 0.2
(c) Risk

Fig. 8.11 Pareto frontier approximations after 1000 steps obtained by PPCoEMAS (a), PPES
(b), and NPGA (c) for building effective portfolio consisting of 17 stocks [16]

general level it can be said that it is caused by the stagnation of evolution process
in PPCoEMAS. Hypothetical, non-dominated average portfolios for experiment I
and II are presented in Figure 8.12 and in Figure 8.13 respectively (in Figure 8.13
shares are presented from left to right in the order in which they were mentioned
above).
206 R. Dreżewski and L. Siwik

1
PPCoEMAS portfolio after 1 step

percentage share in the portfolio


0.8

0.6

0.4

0.2

0
RAFAKO PONAR PKOBP
(a) share name

1
PPCoEMAS portfolio after 900 steps
percentage share in the portfolio

0.8

0.6

0.4

0.2

0
RAFAKO PONAR PKOBP
(b) share name

Fig. 8.12 Effective portfolio consisting of three stocks proposed by PPCoEMAS [16]

1
PPCoEMAS portfolio after 1 step
percentage share in the portfolio

0.8

0.6

0.4

0.2

(a) share name

1
PPCoEMAS portfolio after 900 steps
percentage share in the portfolio

0.8

0.6

0.4

0.2

(b) share name

Fig. 8.13 Effective portfolio consisting of seventeen stocks proposed by PPCoEMAS [16]
8 A Review of Agent-Based Co-Evolutionary Algorithms 207

8.5 Summary and Conclusions


Agent-based (co-)evolutionary algorithms have been applied already in many dif-
ferent domains, including multi-modal optimization, multi-objective optimization,
and financial problems. Agent-based models of evolutionary algorithms allows for
mixing and using simultaneously different bio-inspired techniques and algorithms
within one coherent agent model, and adding new biologically and socially in-
spired operators and mechanisms in a very natural way. Agent-based models of
evolutionary algorithm also allow for using parallel and decentralized computa-
tions without any additional changes because these models are decentralized and use
asynchronous computations.
In this chapter we have presented two selected agent-based co-evolutionary al-
gorithms for multi-objective optimization—one of them used co-operative mech-
anisms and the other one used predator-prey mechanism. Formal models of these
systems as well as results of experiments with standard multi-objective test prob-
lems and financial problem of multi-objective portfolio optimization were presented.
The results of experiments show that agent-based algorithms may obtain quite satis-
factory results, comparable or in the case of some problems even better than state-of-
the-art multi-objective evolutionary algorithms, however of course there is still place
for improvement and further research. Presented results also lead to conclusion that
none of the existing evolutionary algorithms for multi-objective optimization can
not alone solve all problems in a best way—there is, and always will be, space for
new algorithms and improvements suited for some particular problems.
Future research on the agent-based models will concentrate on improvements
to the already proposed algorithms as well as on new algorithms and techniques.
Examples of new techniques which may be incorporated into agent-based models of
evolutionary algorithms include cultural and immunological mechanisms. Another
way of development would be adding social and economical layer to the existing
biological one and using such agent-based models for modeling and simulation of
complex and emergent phenomena from social and economical life.

References
1. Bäck, T., Fogel, D., Michalewicz, Z. (eds.): Handbook of Evolutionary Computation.
IOP Publishing and Oxford University Press (1997)
2. Cetnarowicz, K., Kisiel-Dorohinicki, M., Nawarecki, E.: The application of evolution
process in multi-agent world to the prediction system. In: Tokoro, M. (ed.) Proceedings
of the 2nd International Conference on Multi-Agent Systems (ICMAS 1996). AAAI
Press, Menlo Park (1996)
3. Coello, C., Lamont, G., Van Veldhuizen, D.: Evolutionary Algorithms for Solving Multi-
Objective Problems, 2nd edn. Springer, New York (2007)
4. Coello Coello, C., Van Veldhuizen, D., Lamont, G.: Evolutionary algorithms for solv-
ing multi-objective problems, 2nd edn. Genetic and evolutionary computation. Springer,
Heidelberg (2007)
5. Deb, K.: Multi-Objective Optimization using Evolutionary Algorithms. John Wiley &
Sons, Chichester (2001)
208 R. Dreżewski and L. Siwik

6. Deb, K., Agrawal, S., Pratab, A., Meyarivan, T.: A Fast Elitist Non-Dominated Sorting
Genetic Algorithm for Multi-Objective Optimization: NSGA-II. In: Deb, K., Rudolph,
G., Lutton, E., Merelo, J.J., Schoenauer, M., Schwefel, H.-P., Yao, X. (eds.) PPSN 2000.
LNCS, vol. 1917, pp. 849–858. Springer, Heidelberg (2000),
citeseer.ist.psu.edu/article/deb00fast.html
7. Deb, K., Pratab, A., Agarwal, S., Meyarivan, T.: A fast and elitist multi-objective ge-
netic algorithm: Nsga-ii. IEEE Transaction on Evolutionary Computation 6(2), 181–197
(2002)
8. Dreżewski, R.: A model of co-evolution in multi-agent system. In: Mařı́k, V., Müller, J.P.,
Pěchouček, M. (eds.) CEEMAS 2003. LNCS (LNAI), vol. 2691, pp. 314–323. Springer,
Heidelberg (2003)
9. Dreżewski, R., Siwik, L.: The application of agent-based co-evolutionary system with
predator-prey interactions to solving multi-objective optimization problems. In: Proceed-
ings of the 2007 IEEE Symposium Series on Computational Intelligence. IEEE, Los
Alamitos (2007)
10. Dreżewski, R., Siwik, L.: Agent-based co-evolutionary techniques for solving multi-
objective optimization problems. In: Kosiński, W. (ed.) Advances in Evolutionary Al-
gorithms. IN-TECH, Vienna (2008)
11. Dreżewski, R., Siwik, L.: Agent-based co-operative co-evolutionary algorithm for multi-
objective optimization. In: Rutkowski, L., Tadeusiewicz, R., Zadeh, L.A., Zurada, J.M.
(eds.) ICAISC 2008. LNCS (LNAI), vol. 5097, pp. 388–397. Springer, Heidelberg
(2008)
12. Zitzler, E., Laumanns, M., Thiele, L.: Spea2: Improving the strength pareto evolutionary
algorithm for multiobjective optimization. In: Giannakoglou, K., et al. (eds.) Evolution-
ary Methods for Design, Optimisation and Control with Application to Industrial Prob-
lems (EUROGEN 2001). International Center for Numerical Methods in Engineering
(CIMNE), pp. 95–100 (2002)
13. Horn, J., Nafpliotis, N., Goldberg, D.E.: A niched pareto genetic algorithm for multi-
objective optimization. In: Proceedings of the First IEEE Conference on Evolutionary
Computation. IEEE World Congress on Computational Intelligence, vol. 1, pp. 82–87.
IEEE Service Center, Piscataway (1994),
citeseer.ist.psu.edu/horn94niched.html
14. Kursawe, F.: A variant of evolution strategies for vector optimization. In: Schwefel, H.-
P., Männer, R. (eds.) PPSN 1990. LNCS, vol. 496, pp. 193–197. Springer, Heidelberg
(1991), citeseer.ist.psu.edu/kursawe91variant.html
15. Laumanns, M., Rudolph, G., Schwefel, H.P.: A spatial predator-prey approach to multi-
objective optimization: A preliminary study. In: Eiben, A.E., Bäck, T., Schoenauer, M.,
Schwefel, H.-P. (eds.) PPSN 1998. LNCS, vol. 1498, p. 241. Springer, Heidelberg (1998)
16. Siwik, L., Dreżewski, R.: Co-evolutionary multi-agent system for portfolio optimization.
In: Brabazon, A., O’Neill, M. (eds.) Natural Computation in Computational Finance, pp.
273–303. Springer, Heidelberg (2008)
17. Spears, W.: Crossover or mutation? In: Proceedings of the 2-nd Foundation of Genetic
Algorithms, pp. 221–237. Morgan Kauffman, San Francisco (1992)
18. Van Veldhuizen, D.A.: Multiobjective evolutionary algorithms: Classifications, analyses
and new innovations. PhD thesis, Graduate School of Engineering of the Air Force Insti-
tute of Technology Air University (1999)
19. Zitzler, E.: Evolutionary algorithms for multiobjective optimization: methods and appli-
cations. PhD thesis, Swiss Federal Institute of Technology, Zurich (1999)
8 A Review of Agent-Based Co-Evolutionary Algorithms 209

20. Zitzler, E., Thiele, L.: An evolutionary algorithm for multiobjective optimization: The
strength pareto approach. Tech. Rep. 43, Swiss Federal Institute of Technology, Zurich,
Gloriastrasse 35, CH-8092 Zurich, Switzerland (1998),
citeseer.ist.psu.edu/article/zitzler98evolutionary.html
21. Zitzler, E., Deb, K., Thiele, L.: Comparison of Multiobjective Evolutionary Algorithms:
Empirical Results. Evolutionary Computation 8(2), 173–195 (2000)
22. Zitzler, E., Laumanns, M., Thiele, L.: Spea2: Improving the strength pareto evolutionary
algorithm. Tech. Rep. TIK-Report 103, Computer Engineering and Networks Labora-
tory (TIK), Department of Electrical Engineering, Swiss Federal Institute of Technology
(ETH) Zurich, ETH Zentrum, Gloriastrasse 35, CH-8092 Zurich, Switzerland (2001)
23. Zitzler, E., Thiele, L., Laumanns, M., Fonseca, C.M., da Fonseca, V.G.: Performance
assessment of multiobjective optimizers: An analysis and review. IEEE Transactions on
Evolutionary Computation 7(2), 117–132 (2003)
Chapter 9
A Game Theory-Based Multi-Agent System for
Expensive Optimisation Problems

Abdellah Salhi and Özgun Töreyen

Abstract. This paper is concerned with the development of a novel approach to


solve expensive optimisation problems. The approach relies on game theory and a
multi-agent framework in which a number of existing algorithms, cast as agents, are
deployed with the aim to solve the problem in hand as efficiently as possible. The
key factor for the success of this approach is a dynamic resource allocation biased
toward promising algorithms on the given problem. This is achieved by allowing
the agents to play a cooperative-competitive game the outcomes of which will be
used to decide which algorithms, if any, will drop out of the list of solver-agents
and which will remain in use. A successful implementation of this framework will
result in the most suited algorithm(s) for the given problem being predominantly
used on the available computing platform. In other words it guarantees the best
use of the resources both algorithms and hardware with the by-product being the
best approximate solution for the problem given the available resources. GTMAS is
tested on a standard collection of TSP problems. The results are included.

9.1 Introduction
Modelling problems arising in real world applications taking into account the non-
linearity and the combinatorial aspects of solution sets often leads to expensive to
solve optimisation problems; they are inherently intractable. Indeed, even checking
a given solution for optimality is NP-hard [10, 17, 32]. It is, therefore, not reason-
able, in general, to expect the optimum solution to be found in acceptable times.
What one can, almost always, only expect is an approximate solution, the quality of
which is crucial to its potential use.
Abdellah Salhi · Özgun Töreyen
Department of Mathematical Sciences, The University of Essex, Colchester CO4 3SQ, UK
e-mail: as@essex.ac.uk,otoreyen@aselsan.com.tr

Y. Tenne and C.-K. Goh (Eds.): Computational Intelligence in Optimization, ALO 7, pp. 211–232.
springerlink.com c Springer-Verlag Berlin Heidelberg 2010
212 A. Salhi and Ö. Töreyen

It is well known, that, at least in the case of stochastic algorithms, the quality of
the approximate solution (or some confidence the user may have in it) is proportional
to the time spent in the search for it, [22, 23, 24]. As in many applications there is a
time constraint, a deadline beyond which a better approximate solution is of no use,
it is essential that all available resources (software and hardware) be used as well
as possible, to insure that the best approxiate solution, under the circumstances, is
obtained. This is what the novel approach suggested here is attempting to achieve.
To do so, it must:
1. find which algorithm(s), in the suite of algorithms, is the most appropriate for the
given instance of the expensive optimisation problem;
2. replicate this algorithm(s) on all availabe processor nodes in a parallel environ-
ment, or allocate to it all of the remaining CPU time if a single processor, or
sequential environment, is used;
Point (1) above is dealt with through measuring the performance of the algorithms
used. Point (2) is dealt with via the implementation of a cooperative/competitive
game of the Iterated Prisoners’ Dilemma (IPD) type, [4, 7, 16, 20, 25, 27]. Although
other paradigms of cooperative/competitive behaviour, such as the Stag Hunt game,
[7], can be used, the IPD seems appropriate. Note that, implementing cooperation is
fairly straightforward, and implementing competition is not. We believe that it is at
least as important as cooperation between agents for an effective search. To the best
of our knowledge, this is the first time implementing competition for optimisation
purposes is attempted. We use payoff matrices as a handle to manipulate it. Two al-
gorithms (agents) cooperate by exchanging their current solutions; they compete by
not exchanging their solutions. Note that, intuitively, cooperation may lead to early
convergence to a local optimum, by virtue of propagating a given solution poten-
tially to all algorithms and having all of them searching the same area. Competition,
on the other hand, may lead to good coverage of the search space by virtue of not
sharing solutions, i.e. helping algorithms “stay away” from each other and therefore,
potentially, explore different areas of the search space.
Although the study presents the prototype of a generic solver that can involve
any number of solver algorithms and run on any computing platform, here, a system
with only two search algorithms, implemented sequentially, is investigated. This
simplified model, however, has the inherent complexities of a system with many
more agents and should show how good or otherwise the general system can be for
expensive optimisation.
Note that the generic nature of this approach makes it applicable in any discipline
where problem solving is involved and more than one solution method is available.
This document is organised as follows. In Section 9.2, a brief literature review
is given. In Sections 9.3 and 9.4, the design and implementation of the system is
explained. Section 9.5 explains how the system is applied to solve the Travelling
Salesman Problem. The results are presented in Section 9.6. Finally, conclusions
are drawn and future research prospects are outlined in Section 9.7.
9 A Game Theory-Based Multi-Agent System 213

9.2 Background
In the following a brief review of the three main topics involved, i.e. optimisation,
the IPD and agents systems will be given.

9.2.1 Optimisation
The general optimisation problem is of the global type, constrained, nonlinear and
involves mixed variables, i.e. both discrete and continuous variables. However, a
lot of optimisation problems that are encountered in real applications do not have
all of these characteristics, but are still intractable. The 0-1-Knapsack problem, for
example, involves only binary variables and has one single constraint, but is still
NP-Hard. The general optimisation problem can be cast in the following form.
Let f be a function from Rn to R and A ⊂ Rn , then find x∗ ∈ A such that ∀x ∈ A,
f (x∗ ) ≤ f (x).

9.2.2 Game Theory: The Iterated Priosoners’ Dilemma


The Prisoners’ Dilemma (PD) brought to attention by Merrill Flood of the Rand
Corporation in 1951, and later formulated by Al Tucker [8, 11] is a popular paradigm
for the problem of cooperation. Its formulation can be as follows.

Table 9.1 Formulation of PD: The Payoff Matrix

Player2
C D
Player1 C R=3,R=3 S=0,T=5
D T=5,S=0 P=1,P=1

In the payoff matrix of Table 1, actions C and D stand for ‘Cooperate’ and ‘De-
fect’ and payoffs R, P, T, and S stand for ‘Reward’, ‘Punishment’,‘Temptation’, and
‘Sucker’s’ payoff respectively. This payoff matrix shows that defecting is benefi-
cial to both players for two reasons. First, it leads to a greater payoff (T = 5) in
case the other player cooperates, (S = 0). Second, it is a safe move because neither
knows what the other’s move will be. So, to rational players, defecting is the best
choice. But, if both players choose to defect then it leads to a worse payoff (P = 1)
as compared to cooperating (R = 3). That is the dilemma.
The special setting of the one shot PD is seen by many to be contrary to the idea of
cooperation. This is because the only equilibrium point is the outcome [P, P] which
is a Nash equilibrium, [7]. Also, [P, P] is at the intersection of minimax strategy
choices for both players. These minimax strategies are dominant for both players,
hence the exclusion in principal of cooperation (by virtue of the dominance of the
chosen strategies). Moreover, even if cooperative strategies were chosen, the result-
ing cooperative ‘solution’ is not an equilibrium point. This means that it is not stable
214 A. Salhi and Ö. Töreyen

due to the fact that both players are tempted to defect from it. It should also be noted
that cooperative problems in real life are likely to be faced repeatedly. This makes
the IPD a more appropriate model for the study of cooperation than the one-shot
version of the game.
The PD game is characterised by the strict inequality relations between the pay-
offs: T > R > P > S. And to avoid coordination or total agreement getting a ‘helping
hand’, most experimental PD games have a payoff matrix satisfying: 2R > S + T , as
in Table 1.
The close analysis of the IPD reveals that, unlike the one-shot PD, it has a large
number of Nash equilibria. These being inside the convex hull of the outcomes
(0,5), (3,3), (5,0), (1,1) of the pure strategies in the one-shot PD, (see Figure 9.1).
Note that (1,1) corresponding to [P, P] is a Nash equilibrium for the IPD also. For a
comprehensive investigation of the IPD, please refer to [4, 5, 15].

(0, 5)

(3, 3)

(1, 1)

(5, 0)

Fig. 9.1 Set of Nash Equilibrium Points

9.2.3 Multi-Agent Systems


Multi-Agent System (MAS) are collections of agents that work together to accom-
plish a task that is normally beyond the capabilities of a single agent, [19], [30],
[31]. In our case, however, every agent is capable of solving the instance of the
optimisation problem in hand.
The agents in a classical MAS communicate between themselves, cooperate, col-
laborate and sometimes negotiate ([1, 9, 12]). They do not, normally, compete. But,
they are allowed, in fact required, to do so, in the framework used here.
The present paper describes a Game Theoretic Multi-Agent Solver (GTMAS),
[18], which implements the ideas introduced above. It will be limited to three agents
(Figure 9.2): a coordinator-agent, Solver-Agent 1 (SA1), running the Genetic Al-
gorithm (GA), [14, 26], and Solver-Agent 2 (SA2), running Simulated Annealing
9 A Game Theory-Based Multi-Agent System 215

(SA), [28]. The system, implemented in Matlab, is tested on a set of Travelling


Salesman Problems (TSP) from TSPLIB, [21].

9.3 Constructing GTMAS


GTMAS architecture follows closely the goal-based MAS architecture of Park and
Suguraman [19] and the IPD model as described in [2], [3], [6].
Let the problem to be solved be ℘. The overall goal of GTMAS is to solve ℘ ef-
ficiently using the best available algorithm(s) in the system’s library. This is equiv-
alent to completing two tasks: selecting the best algorithm(s) among all available
algorithms and obtaining a ‘good’ solution.
The overall goal, is divided into sub-goals which are then matched to the sys-
tem’s agents. Each of the solver-agents runs a different algorithm and tries to be
the first to solve ℘ by using as much as possible of the available computing facili-
ties, here CPU time. The coordinator-agent enables coordination, manages the game
through which the solver-agents compete for the facilities, allocates the facilities to
the solver-agents, and also communicates with the user, i.e. the owner of problem
℘, (Figure 9.2).

Fig. 9.2 Goal Hierarchy Diagram

Solver-agents cooperate (C) or compete (D) with each other by sharing or not
their solutions. If an agent can take an opponent’s possibly better solution when it
is stuck in a local optimum say, and uses it, then it can improve its own search.
The decision to cooperate or to compete is autonomously made by the agents us-
ing their beliefs (no notable change in the objective value in the last few iterations,
for instance, may mean convergence to a local optimium), the history of the pre-
vious encounters with their opponents (the number of times they cooperated and
competed), and certain rules which follow observations of the behaviours of agents.
Some are explained below.
The rules are set to prevent the game from converging too soon to a near
pure competition game (which is equivalent to playing the GRIM strategy, [2]).
216 A. Salhi and Ö. Töreyen

Go-it-alone type strategy can not contribute to the solution quality more than run-
ning an algorithm on its own. These rules are:
• If the number of times SA1 knows the solution of SA2 increases, then the likeli-
hood that SA2 finds the solution to ℘ first, decreases. Therefore, SA2 is unlikely
to cooperate. Since all solver-agents are aware of this, they would cooperate less
and take their opponent’s solution more often, given the chance.
• If SA1 does not cooperate when SA2 cooperates, then SA2 would retaliate. This
leads to the TIT-FOR-TAT and the go-it-alone type strategy.
• If a solver-agent cooperates in the first encounter with another solver-agent, then
it can be perceived as in need of help; i.e. it is stuck at a local optimum. Agents,
therefore, perceive the first cooperation of their opponent as a “forceful invitation
to cooperate or else...” from bullet point 2 above.
There are all sorts of rules which are implicit in the IPD. Agents, however, do not
have to apply them systematically.

9.3.1 GTMAS at Work: Illustration


Recall that the solver-agents, in order to solve ℘, play the IPD game. In each en-
counter, they either cooperate (C) or compete (D).
Figure 9.3 shows an encounter of 2 agents after they both obtain their interme-
diary solutions. Node 1 is the starting node which shows the decision alternatives

Fig. 9.3 Decision Tree of the Game


9 A Game Theory-Based Multi-Agent System 217

of SA1. It can cooperate and end up in Node 2, or Node 3. Nodes 2 and 3 are the
decision nodes of SA2 which has the same two alternative decisions as SA1. Node 4
follows the cooperation of both of the agents that may result in a solution exchange.
Node 5 shows the situation where SA1 cooperates and SA2 competes which means
SA2 may take the solution of SA1 and SA1 takes nothing. Node 6 depicts the same
situation as that leading to node 5 but with agents taking different actions. In node
7, neither gives its solution; they continue without any exchange.
The decision tree is expanded further with branching from nodes 4-7, but with
alternatives now being: “Take the opponent’s solution” and “Do not take the op-
ponent’s solution”. This branching determines which agent has a better solution
and is essential for setting up the payoff matrices that drive the system. 8 new
nodes(leaves) arise. Each pair of sibling nodes yields a different payoff matrix. The
labels (G)(for good) and (B)(for bad) refer to agents having a better solution than the
opponent or otherwise, respectively. The cells that are crossed refer to impossible
outcomes.
Managing the resources is based on the outcomes of the decisions of the solver-
agents. When an agent cooperates it gains one unit (of CPU time or equivalent in
terms of iterations it is allowed to do) and loses double that. When it competes it
gains two units and loses one (or half of the initial gain). This means, the GTMAS
payoff matrix rewards competition. The idea behind supporting competition is to
counter the “helping hand” that cooperation gets from the rules underpinning the
construction of GTMAS (see above). It can also be argued that, intuitively at least,
too frequent exchanging of solutions will lead to early convergence to local optima.
So, competition gives solver-agents the chance to cover the search space better.
The 4 payoff matrices in Figure 9.3 can be combined in one payoff matrix
(Table 9.2).

Table 9.2 Combined Payoff Table for Evaluating and Rewarding Agents

B
C D
G C (1,-2) (1,-1)
D (2,-2) (2,-1)

The equilibrium point for the payoff matrix is (D, D) with payoffs 2 and -1. It
is also a regret-free point. The payoff matrix at the core of GTMAS is different
from those commonly found in the literature. These matrices would be drawn im-
mediately after decisions have been taken, i.e. at nodes 4 to 7 in Figure 9.3. Here,
they are drawn after other decisions are taken. In fact, one can highlight three main
differences;
218 A. Salhi and Ö. Töreyen

(i) The return of a player is not dependent on the opponent’s choice directly. Whether
the opponent cooperates or competes becomes only relevant after the exchange of
solutions has been decided;
(ii) The payoff is affected by what has been achieved in terms of the quality of
solution after exchange (or otherwise of solutions);
Unlike trditional games, here, after the players (solver-agents) have made their
choices, they are given a chance to progress with the consequences of the choices.
Only after that, are they rewarded/punished. This was made explicit in the above
paragraph where reasons for rewarding competition/penalising cooperation, were
given; for instance when we said that a cooperating agent “gains one unit and loses
double that”, we meant that the solver-agent runs first for a unit of CPU time (or
equivent in iterations) and only after that is it penalised by taking 2 units of CPU
time from its account. Basically, the quality of the solution following decisions has
to be measured first before the payoffs are allocated. Time is an important factor in
the IPD.
(iii) The third difference is that the players are not “Solver-Agent 1” and “Solver-
Agent 2”, but instead “Solver-Agent with the better solution” and “Solver-Agent
with the worse solution”. The configuration of the table may change at each stage
according to the solution qualities of the solver-agents. The one with better solution
is always placed as the row player.

9.4 The GTMAS Algorithm


GTMAS is a generic (n + 1)-agent system that consists of a coordinator-agent and
n solver-agents. It can be seen as a loosely coupled hybrid algorithm that uses
any number of algorithms contributing to the hybridisation. The pseudocode of the
agents is given below.

Coordinator-Agent Pseudocode
1. Initialise belief. Initialise resources,n.
2. For Nstage stages, play the game and update belief
where Nstage limits the number of stages the game is played.
2.1. Start decision phase: Run the solver-agents to decide.
2.2. Manage the solution exchange.
2.3. Start competition phase: Run the solver-sgents to compete.
2.4. Evaluate and reward/punish the solver-agents.
Update resources, m1 , m2 iterations where
mi = n + ∑currentstage−1
j=1 ri j , ri j is the reward of agent i
at stage j.
2.5. Increment stage.
3. End the game. Select the best algorithm. Report the results.
9 A Game Theory-Based Multi-Agent System 219

Solver-Agents Pseudocode
1. Initialise belief.
2. If it is a decision phase, do:
2.1. If it is the first stage, do:
2.1.1. Initialise memory and algorithm specific parameters.
2.1.2. Run own algorithm for n iterations.
2.1.3. Cooperate.
2.1.4. End run. Send the results to the Coordinator-Agent.
2.2. If it is the second stage, do:
2.2.1. Update belief.
2.2.2. Run own algorithm for mi iterations.
2.2.3. Compete.
2.2.4. End run. Send the results to the Coordinator-Agent.
2.3. If stage > 2, do:
2.3.1. Update belief.
2.3.2. Run own algorithm for mi iterations.
2.3.3. Decide to cooperate/compete.
2.3.4. End run. Send the results to the Coordinator-Agent.
3. If it is a competition phase, do:
3.1. Update belief.
3.2. Run own algorithm for n iterations.
3.3. End run. Send the results to the Coordinator-Agent.

A prototype GTMAS is constructed with 3 agents; a coordinator-agent and two


solver-agents. GTMAS starts with initialisation of the coordinator-agent. The
coordinator-agent reads the problem data and initialises the payoff table. It also
initialises the resources accounts (seconds of CPU time or number of iterations)
assigned to the solver-agents. The overall iterations of GTMAS are called stages
which consist of decision and competition phases. In the decision phases, the
coordinator-agent asks the solver-agents for their decisions.
The solver-agents start with the initialisation of their memory the outcome of
enounters with their opponents, algorithm parameters (i.e. population for the Ge-
netic Algorithm) and decision parameters (η , β , α , σ and γ ). After they run for
the given number of iterations determined by the coordinator-agent to obtain their
initial solutions for ℘, they make their decisions.
The decisions in the first two stages are not subject to analysis since no historical
data (memory of earlier encounters) exist yet. Both agents cooperate in the first
stage and compete in the second stage, regardless of the results and without prior
analysis. The following stages differ from the first two stages with the introduction
of memory. Decisions are made according to procedure Decide() below.

9.4.1 Solver-Agents Decision Making Procedure


Let SA1 be a solver-agent and SA2 its opponent in a IPD game. SA1 decides
whether to compete or cooperate according to the following procedure.
220 A. Salhi and Ö. Töreyen

Procedure Decide() Pseudocode


0. Begin
1. If (SA2 has cooperated in the last move) then
1.1. If (P(IC|OC) < σ %) then
1.1.1. Cooperate.
1.2. Else If (P(OC|IC) < γ %) then
1.2.0. Cooperate.
1.2.1. Else If (SA1 didn’t improve by γ % in last 2 stages) then
1.2.1.0. Decide randomly to cooperate or compete
1.2.1.1. Else
1.2.1.1.0. Compete.
1.2.1.2. End If
1.2.2. End If
1.3. End If
2. Else (Ask SA2 for its solution);
2.1. If (SA2 solution is α % better than that of SA1) then
2.1.1. If (SA1 is stuck with γ %) then
2.1.1.0. Cooperate.
2.1.2. Else If (SA2 solution is 2α % better than that of SA1) then
2.1.2.1. If (P(OC|ID) > β %) then
2.1.2.1.0. Compete.
2.1.2.2. Else
2.1.2.2.0. Cooperate.
2.1.2.3. End If
2.1.3. End If
2.3. Else
2.3.1. Compete.
2.4. End If
3. End If
4. Stop

In the decision-making process, P(IC|OC) is the probability that SA1 will cooperate
in the next iteration given that SA2 cooperates in this iteration. It is equal to the
ratio between the number of times SA1 cooperates in the (n + 1)st iteration given
that SA2 cooperated in the nth iteration and the the total number of encounters.
P(OC|IC) is the probability that SA2 will cooperate in the next iteration given that
SA1 cooperates in this iteration. It is equal to the ratio of the number of times SA2
cooperates in (n + 1)st iteration given SA1 cooperated in the nth iteration to the total
number of encounters.
P(OC|ID) is the probability that SA2 will cooperate in the next iteration given that
SA1 cooperates in this iteration. It is equal to the ratio of the number of times SA2
cooperates in (n + 1)st iteration given that SA1 competed in the nth iteration to the
total number of encounters.
9 A Game Theory-Based Multi-Agent System 221

σ is a measure of how likely it is for SA1 to adopt a TIT-FOR-TAT strategy; it is


referred to as “responsiveness”.
η is a measure of how likely it is for SA2 to adopt a TIT-FOR-TAT strategy;
γ is how likely it is for either agents to get stuck in a local optimum;
α is the difference between SA1’s solution and that of SA2;
β is how likely it is for the opponent to cooperate (be nice!);
If the opponent cooperated in the last move, the agents check their own responsive-
ness, first. If it is less than σ , then they cooperate; if not, they check the opponent’s
responsiveness. If the latter is less than η , they conclude that the opponent is not as
responsive as it should be, so they compete. If the opponent is responsive, then they
check their own status: if they performed γ % better than what they obtained in the
last 2 stages before, then they compete, otherwise they make a random choice as to
whether they cooperate or compete.
If the opponent did not cooperate in the last move, then they compare their own
status with that of their opponent. If the opponent’s last solution is not α % better
than their own solution, they compete. If it is at least α % better, then they check
their own progress. If they are stuck with γ %, or in other words, the solution has
not improved more than γ % in the last 2 stages, then they cooperate. If they are not
stuck, they check the difference between the opponent’s and their own solutions. If
the solution of the opponent is not 2α % better than their own solution, they com-
pete. If it is 2α % better, then they check the opponent’s attitude to competition.
If the opponent is likly to cooperate, i.e. if the probability it cooperates after com-
petition is larger than β , they compete, otherwise they cooperate. Note that this
description is given as the procedure Decide() .
After an agent makes a decision, it is sent to the coordinator-agent which manages
the solution exchange. Solution exchange is settled randomly when a solution is
offered, following a cooperate move; it is accepted with probability 0.5.
The worse performing agent is not entirely removed. It is kept but it is only
allowed one iteration in each stage. This is specific to the two solver-agent case as it
seems, from experimental results, that the weaker algorithm still helps the stronger
one to get, overall, a better solution. This, however, may not be the case if a large
number of algorithms were used.
When all the stages are completed, the results of the best solver-agent are
reported.

9.5 Application of GTMAS to TSP


GTMAS is applied to a collection of Travelling Salesman Problems (TSP), [21]. The
Genetic Algorithm (GA) and Simulated Annealing (SA) are selected as the solvers
in the library. GA is coded in Matlab 7.0 and Simulated Annealing Matlab code is
borrowed from Matlab Central ([29]). They are customised to be incorporated in
GTMAS which is also coded in Matlab 7.0.
Generic parameters of GTMAS are defined by pre-experimentation. Nstage , the
number of stages the game is played, is set to 5. Preliminary analyses show that
222 A. Salhi and Ö. Töreyen

it is sufficient for convergence to the optimum or a good solution. n, the number of


iterations solver-agents start the decision phase and run for in the competition phases
is set to 10. Accordingly, all the entries in the payoff table, ri j , are quadrupled for
faster progress (see Table 9.3).

Table 9.3 Payoff Table of GTMAS

B
C D
G C (4,-8) (4,-4)
D (8,-8) (8,-4)

GTMAS is customised specific to the GA-SA competition. SA is fast in the ini-


tial iterations because of the high temperature and large probability to accept bad
solutions in order to escape local optima. Afterwards, it slows down considerably
with the decreases in temperature. The resource in the decision phase is CPU time.
The number of iterations SA is run, mSA , is updated at the beginning of each stage
to balance the CPU time usage of SA and that of GA.
10
mSA = ∗ mSA .
currentstage
Prior to experimentation, GA is expected to perform better than SA. GA is coded
for the specific problem in hand and its parameters are tuned accordingly. The pa-
rameters of SA are default parameters as found in the literature. The game played
between the agents affects the solution quality substantially. It depends on the solver-
agents’ attitude to cooperation and competition, the solution exchanges and the pay-
off matrix. These parameters are summarised with their possible values in Table 9.4.

Table 9.4 GTMAS Parameter Values

Payoff Matrix Decision Model Characteristic Characteristic


Matrix Model of GA of SA
simple random random random
cooperation-rewarded evaluative cooperative cooperative
competition-rewarded competitive competitive
time-dependent

The payoff matrix can be simple, cooperation-rewarded, competition-rewarded


or time-dependent.
9 A Game Theory-Based Multi-Agent System 223

Table 9.5 Simple (Top-left), Cooperation-Rewarded (bottom-left), Competition-Rewarded


(Top-right) and Time-Dependent (bottom-right) Payoff Matrices

B B
C D C D
G C (4,-4) (4,-4) G C (4,-8) (4,-4)
D (4,-4) (4,-4) D (8,-8) (8,-4)

B B
C D C D
G C (8,-4) (8,-8) G C ( 4t ,−4t) ( 4t ,− 4t )
D (4,-4) (4,-8) D (4t,−4t) (4t,− 4t )

The solver-agents arrive at decisions using the decision-making parameters η ,


β , α and σ . These parameters were explained earlier and their values are given in
Table 9.6.

Table 9.6 The Values of the Decision Parameters

Agent
Characteristics α β γ η σ
cooperative 0.1 0.9 0.01 0.2 0.6
competitive 0.5 0.1 0.01 0.8 0.2

Twenty combinations of parameters were used and for each five runs were car-
ried out on the problems of Table 9.13, from TSPLIB ([21]). GA is found to be the
better algorithm in 98runs and SA is found to be better only in the remaining 2. The
final solutions, the elapsed times, the solver algorithm and the series of coopera-
tion/competition bouts and exchange of solutions are recorded. The results are en-
tered into SPSS 14 for analysis of significant factors. The cooperation/competition
series and the exchange series are categorised prior to analysis. The categorisation
is summarised in Table 9.7.

Table 9.7 Cooperation and Solution Exchange Sequences of Solver-Agents

Cooperate Take Opponent’s Solution


less than twice never takes
more than twice takes in first stage
takes after second stage
takes in both

These are added to the factors of the experiment. The final factors of the question
are summarised in Table 9.8 with their corresponding values and the number of
occurrences.
224 A. Salhi and Ö. Töreyen

Table 9.8 ANOVA - Data Summary

Number of
Factors ValuesObservations
PAYOFF simple 1 25
cooperation-rewarded 2 25
competition-rewarded 3 25
time-dependent 4 25
DECISION PROCESS coop GA vs coop SA 1 20
coop GA vs comp SA 2 20
comp GA vs coop SA 3 20
comp GA vs comp SA 4 20
random 5 20
CATEGORY GA COOPERATES less than twice 1 64
more than twice 2 36
CATEGORY SA COOPERATES less than twice 1 43
more than twice 2 57
CATEGORY GA TAKES never takes 1 25
SA’S SOLUTION takes in first stage 2 23
takes after second stage 3 26
takes in both 4 26
CATEGORY SA TAKES never takes 1 22
GA’S SOLUTION takes in first stage 2 33
takes after second stage 3 33
takes in both 4 12

Table 9.9 shows ANOVA results for the dependent variable deviation. Here de-
viation, is the difference between the true solution objective value and the objective
value of the solution found. All factors and reasonable multiple interactions are in-
cluded in the model. Most of them are very insignificant due to the high random
variability. However, the interaction of solution taking sequences of agents is signif-
icant with 11% confidence. Therefore, the solution taking sequences factors them-
selves are significant. Even though, it is not a very reliable confidence, these are the
most expected factors to be significant to explain the data since the solution quality
is expected to depend on the times solution exchanges occur.
Table 9.10 shows ANOVA results for the dependent variable time. The only sig-
nificant factor which is in the 12% significance level is the solution exchange se-
quences of SA solver-agent. This matches exactly the expectations since SA varies
a lot both between iterations within problems and between problems. When it takes
a solution in any stage, the average elapsed time is about 100 seconds. When it
doesn’t take the GA solution, the average elapsed time is about 60 seconds.
9 A Game Theory-Based Multi-Agent System 225

Table 9.9 ANOVA - Significant Factors Affecting Deviation From True Solution Value

Source Type III SS Deg.F Mean Sq. F.Value Sig.


Corrected Model 589.795a 76 7.760 .865 .689
Intercept 2098.415 1 2098.415 233.899 .000
PAYOFF 26.754 3 8.918 .994 .413
DECISION PROCESS 16.223 4 4.056 .452 .770
PAYOFF*DECISION
PROCESS 25.528 4 6.382 .711 .593
CATEGORY COOP1*
CATEGORY COOP2 5.214 1 5.214 .581 .454
CATEGORY TAKEN1*
CATEGORY TAKEN2 124.516 7 17.788 1.983 .102
PAYOFF*CATEGORY
COOP1*CATEGORY
COOP2 .000 0 . . .
PAYOFF*CATEGORY
TAKEN1*CATEGORY
TAKEN2 106.348 13 8.181 .912 .555
DECISION PROCESS*
CATEGORY COOP1*
CATEGORY COOP2 .000 0 . . .
DECISION PROCESS*
CATEGORY TAKEN1*
CATEGORY TAKEN2 7.187 4 1.797 .200 .936
PAYOFF*DECISION
PROCESS*
CATEGORY COOP1*
CATEGORY COOP2 .000 0 . . .
PAYOFF*DECISION
PROCESS*
CATEGORY TAKEN1*
CATEGORY TAKEN2 5.806 2 2.903 .324 .727
PAYOFF*DECISION
PROCESS*
CATEGORY COOP1*
CATEGORY COOP2*
CATEGORY TAKEN1*
CATEGORY TAKEN2 .000 0 . . .
CATEGORY COOP1 .041 1 .041 .005 .947
CATEGORY COOP2 4.731 1 4.731 .527 .475
CATEGORY TAKEN1 9.223 3 3.074 .343 .795
CATEGORY TAKEN2 11.471 3 3.824 .426 .736
Error 206.344 23 8.971
Total 4238.542 100
Corrected Total 796.139 99
a. R Squared = .741 (Adjusted R Squared =-.116)
226 A. Salhi and Ö. Töreyen

Table 9.10 ANOVA - Significant Factors Affecting Time

Source Type III SS Deg.F Mean Sq. F.Value Sig.


Corrected Model 331363.405a 76 4360.045 .522 .981
Intercept 626245.612 1 626245.612 74.956 .000
CATEGORY COOP1 6.103 1 6.103 .001 .979
CATEGORY COOP2 111.092 1 111.092 .013 .909
CATEGORY TAKEN1 7246.948 3 2415.649 .289 .833
CATEGORY TAKEN2 54448.233 3 18149.411 2.172 .119
PAYOFF 2783.043 3 927.681 .111 .953
DECISION PROCESS 878.624 4 219.656 .026 .999
CATEGORY COOP1*
CATEGORY COOP2 .538 1 .538 .000 .994
CATEGORY TAKEN1*
CATEGORY TAKEN2 50147.857 7 7163.980 .857 .553
PAYOFF*DECISION
PROCESS 794.601 4 198.650 .024 .999
CATEGORY COOP1*
CATEGORY COOP2* .000 0 . . .
DECISION PROCESS
CATEGORY COOP1*
CATEGORY COOP2* .000 0 . . .
PAYOFF
CATEGORY TAKEN1*
CATEGORY TAKEN2* 1624.845 4 406.211 .049 .995
DECISION PROCESS
CATEGORY TAKEN1*
CATEGORY TAKEN2* 25019.542 13 1924.580 .230 .996
PAYOFF
CATEGORY TAKEN1*
CATEGORY TAKEN2* 786.468 2 393.234 .047 .954
PAYOFF*DECISION
PROCESS
CATEGORY COOP1*
CATEGORY COOP2* .000 0 . . .
PAYOFF*DECISION
PROCESS
CATEGORY COOP1*
CATEGORY COOP2*
CATEGORY TAKEN1*
CATEGORY TAKEN2* .000 0 . . .
PAYOFF*DECISION
PROCESS*
Error 192161.123 23 8354.831
Total 1536492.299 100
Corrected Total 523524.528 99
a. R Squared = .633 (Adjusted R Squared =-.580)
9 A Game Theory-Based Multi-Agent System 227

Table 9.11 Results with Respect to Solution Exchanges

GA Takes SA’s SA Takes GA’s Average Average


Solution Solution Deviation Time (sec)
never takes never takes - -
never takes takes in first stage 5.51% 108.8
never takes takes after second stage 4.89% 81.28
never takes takes in both 8.01% 98.8
takes in first stage never takes 4.62% 60.41
takes in first stage takes in first stage 6.85% 105.6
takes in first stage takes after second stage 5.18% 211.16
takes in first stage takes in both 6.06% 71.12
takes after second stage never takes 3.79% 58.24
takes after second stage takes in first stage 6.30% 119.97
takes after second stage takes after second stage 7.08% 93.08
takes after second stage take sin both 7.50% 296.77
takes in both never takes 7.41% 57.16
takes in both takes in first stage 3.36% 140.57
takes in both taken after second stage 5.74% 92.54
takes in both takes in both 5.34% 116.08

Table 9.12 Selection of Best Parameters

Characteristic Characteristic # of Time


Payoff of GA of SA Occur. Dev. (sec.)
coop-rewarded cooperative cooperative 2 3.59% 61.97
coop-rewarded cooperative competitive 1 1.27% 45.49
coop-rewarded competitive cooperative 1 3.33% 60.49
coop-rewarded competitive competitive 2 7.39% 50.17
coop-rewarded random random - - -
comp-rewarded cooperative cooperative - - -
comp-rewarded cooperative competitive - - -
comp-rewarded competitive cooperative - - -
comp-rewarded competitive competitive 1 0.65% 58.09
comp-rewarded random random - - -
simple cooperative cooperative - - -
simple cooperative competitive - - -
simple competitive cooperative 1 6.50% 66.41
simple competitive competitive 1 3.20% 70.54
simple random random - - -
time-dependent cooperative cooperative - - -
time-dependent cooperative competitive - - -
time-dependent competitive cooperative - - -
time-dependent competitive competitive 1 1.34% 54.77
time-dependent random random 1 3.47% 60.31
228 A. Salhi and Ö. Töreyen

In Table 9.11, the best deviation is found when GA takes SA’s solution in both
the first stage and after the second stage and SA takes GA’s solution only in the
first stage. Average deviation is 3.36%. However, the average time elapsed to obtain
this average deviation is quite high, at 141 seconds. The second best deviation is
observed when GA takes SA’s solution after the second stage and SA never takes
GA’s solution. The average deviation is 3.79% with an average elapsed time of 58
seconds. From these results, it can be said that, in this setting, i.e. when GA com-
petes and takes the solution of SA and SA cooperates by offering its own solution
and never taking that of GA, the best performance is obtained. Whether obtaining
this solution exchange setting is random, is not clear. What is clear is that it occurs
quite often. Table 9.12, records some of its occurences. Amongst these 11 occur-
rences, the best average deviation is obtained with a competition-rewarded payoff
matrix and both agents being competitive. The deviation is 0.65% which actually
comes from only one occurrence, and the time is 58 seconds. This analysis does not
show that if competitive agents play against each other in a competition-rewarded
environment, then this is the best environment; rather, it shows that if competitive
agents play against each other in a competition-rewarded environment and their so-
lution exchange happens to be one-way benefit to one of the solver-agents, then this
might be the best setting.

9.6 Tests and Results


GTMAS is tested on 10 problems from TSPLIB [21]. The results are summarised
in Table 9.13.
Table 9.13 shows the runs for GA alone, SA alone and GA and SA together
under the framework of GTMAS, sequentially. For each problem instance, GTMAS
selected GA as the best solving agent.
When average deviations are compared within problem instances, GTMAS is
found to dominate GA. On average, GTMAS always finds better solutions than GA.
This is due to GA benefiting from the presence of SA; this must be from a synergistic
effect. It is observed that when GA takes the solution of SA the quality of the overall
solution increases considerably. Solution exchange seems to play a critical role in
defining the quality of the solution.
However, when elapsed times are compared, the average time of GTMAS is al-
most double that of GA for almost all problems. This rather unfavorable time count
is the cost of keeping the SA solver-agent because it improves the qualiy of the
overall solution. There are also, other overheads that come with the need for coor-
dination, decision making and so on.
It should also be noted that the recorded time counts for GTMAS are those of a
sequential implementation. The times of a parallel implementation are expected to
be significantly lower.
9 A Game Theory-Based Multi-Agent System 229

Table 9.13 Results of GTMAS Applied to TSP

Average Average Selected


Problem Agent 1 Agent 2 Deviation Time (sec) Algorithm
burma14 GA - 0.70% 6.95
burma14 - SA 0.53% 8.34
burma14 GA SA 0.00% 13.49 GA
ulysses16 GA - 0.27% 8.26
ulysses16 - SA 0.17% 42.28
ulysses16 GA SA 0.00% 14.71 GA
ulysses22 GA - 1.56% 9.83
ulysses22 - SA 1.16% 97.12
ulysses22 GA SA 1.47% 17.78 GA
att48 GA - 4.97% 41.23
att48 - SA 31.48% 10.52
att48 GA SA 4.06% 129.87 GA
eil51 GA - 4.46% 44.45
eil51 - SA 18.17% 423.26
eil51 GA SA 2.72% 91.58 GA
berlin52 GA - 8.67% 42.99
berlin52 - SA 36.37% 11.44
berlin52 GA SA 5.32% 66.94 GA
st70 GA - 11.62% 66.17
st70 - SA 24.89% 232.21
st70 GA SA 7.97% 143.52 GA
eil76 GA - 6.84% 74.90
eil76 - SA 33.34% 1162.02
eil76 GA SA 5.88 136.23 GA
pr76 GA - 6.25% 94.02
pr76 - SA 35.91% 254.20
pr76 GA SA 5.71% 140.05 GA
eil101 GA - 10.37% 143.99
eil101 - SA 50.11% 220.71
eil101 GA SA 8.58% 211.98 GA

9.7 Conclusion and Further Work


A generic smart solver, GTMAS, has been constructed that combines a multi-agent
system architecture and game theory to deal with expensive optimisation problems.
Within GTMAS different algorithms attached to agents play an Iterated Prisoners’
Dilemma type game in which they cooperate to solve the problem and compete
over the computing facilities available (here CPU time). In the process, the system
finds the most appropriate algorithm for the given problem from a library of avail-
able algorithms and solves the problem. It also obtains a better quality approximate
230 A. Salhi and Ö. Töreyen

solution than the best algorithm would obtain on its own. This is because of the
synergistic effect of the algorithms working together.
GTMAS implements an interesting resource allocation process that uses a pur-
pose built payoff matrix to encourage competition for the available computing re-
sources. Solver-agents are rewarded by increasing their access to the computing fa-
cilities for good performance; they are punished for bad performance, by reducing
their access to the computing facilities. This simple rule guarantees that the comput-
ing platform is increasingly being dedicated to the most suited algorithm. In other
words, the bulk of the computing platform will eventually be used by the best per-
forming algorithm, which is synonymous with the computing resources being used
efficiently.
GTMAS as implemented here involves only two players. The study will benefit
from a more extensive investigation with a large number of algorithms. To extend
it to n players, the results obtained can be used. The game can be designed such
that given the players A1 , A2 , ..., An , pair-wise games are considered and each game
is evaluated separately according to the same 2-by-2 payoff matrix introduced here.
The solvers that fail in the simultaneous games in 2-by-2 competitions get elimi-
nated and the tournament continues with the ones that survive.
Another approach of playing the n-by-n game could be playing it simultaneously,
using notions of Nash’s poker game [13], with a specially created n-by-n payoff
matrix that would evaluate all agents at once but select the best iteratively. Current
and future research directions concern extending the ideas of the GTMAS prototype
to a general n-by-n environment which deals with n algorithms, running in parallel,
according to one of the two proposed payoff matrices.

References
1. Aldea, A., Alcantra, R.B., Jimenez, L., Moreno, A., Martinez, J., Riano, D.: The scope
of application of multi-agent systems in the process industry: Three case studies. Expert
Systems with Applications 26, 39–47 (2004)
2. Axelrod, R.: Effective choice in the prisoner’s dilemma. Journal of Conflict Resolu-
tion 24(1), 3–25 (1980)
3. Axelrod, R.: More effective choice in the prisoner’s dilemma. Journal of Conflict Reso-
lution 24(3), 379–403 (1980)
4. Axelrod, R.: The Evolution of Cooperation. Basic Books, New York (1984)
5. Axelrod, R.: The evolution of strategies in the iterated prisoners’ dilemma. In: Davis, L.
(ed.) Genetic Algorithms and Simulated Annealing, pp. 32–42. Morgan Kaufmann, Los
Altos (1987)
6. Axelrod, R., Hamilton, W.D.: The evolution of cooperation. Science 211, 1390–1396
(1981)
7. Binmore, K.: Fun and Games. D.C.Heath, Lexington (1991)
8. Binmore, K.: Playing fair: Game theory and the social contract. MIT Press, Cambridge
(1994)
9. Bratman, M.E.: Shared cooperative activity. The Philosophical Review 101(2), 327–341
(1992)
9 A Game Theory-Based Multi-Agent System 231

10. Byrd, R.H., Dert, C.L., Rinnooy Kan, A.H.G., Schnabel, R.B.: Concurrent stochastic
methods for global optimization. Mathematical Programming 46, 1–30 (1990)
11. Colman, A.M.: Game Theory and Experimental Games. Pergamon Press Ltd., Oxford
(1982)
12. Doran, J.E., Franklin, S., Jennings, N.R., Norman, T.J.: On cooperation in multi-agent
systems. The Knowledge Engineering Review 12(3), 309–314 (1997)
13. Nash, J.F.: Non-cooperative games. Annals of Mathematics 54(2), 286–295 (1951)
14. Holland, J.H.: Adaptation in Natural and Artificial Systems. University of Michigan
Press, Ann Arbor (1975)
15. Linster, B.: Essays on Cooperation and Competition. PhD thesis, University of Michigan,
Michigan (1990)
16. Luce, R., Raiffa, H.: Games and Decisions. Wiley, New York (1957)
17. Murty, K.G., Kabadi, S.N.: Some NP-complete problems in quadratic and nonlinear pro-
gramming. Mathematical Programming 39, 117–130 (1987)
18. Töreyen, Ö.: A game-theory based multi-agent system for solving complex optimisation
problems and a clustering application related to the integration of turkey into the eu com-
munity. M.Sc. Thesis Submitted to the Department of Mathematical Sciences, University
of Essex, UK (2008)
19. Park, S., Sugumaran, V.: Designing multi-agent systems: A framework and application.
Expert Systems with Applications 28, 259–271 (2005)
20. Rapoport, A., Chammah, A.M.: Prisoner’s Dilemma: A Study in Conflict and Coopera-
tion. University of Michigan Press, Ann Arbor (1965)
21. Reinelt, G.: TSPLIB,
http://www.iwr.uni-heidelberg.de/groups/comopt/
software/TSPLIB95
22. Rinnooy Kan, A.H.G., Timmer, G.T.: Stochastic global optimization methods Part I:
Clustering methods. Mathematical Programming 39, 27–56 (1987)
23. Rinnooy Kan, A.H.G., Timmer, G.T.: Stochastic global optimization methods Part II:
Multi-level methods. Mathematical Programming 39, 57–78 (1987)
24. Rinnooy Kan, A.H.G., Timmer, G.T.: Global optimization. In: Nemhauser, G.L., Rin-
nooy Kan, A.H.G., Todd, M.J. (eds.) Optimization. Handbooks in Operations Research
and Management Science, ch. IX, vol. 1, pp. 631–662. North Holland, Amsterdam
(1989)
25. Salhi, A., Glaser, H., De Roure, D.: A genetic approach to understanding cooperative
behaviour. In: Osmera, P. (ed.) Proceedings of the 2nd International Mendel Conference
on Genetic Algorithms, MENDEL 1996, pp. 129–136 (1996)
26. Salhi, A., Glaser, H., De Roure, D.: Parallel implementation of a genetic-programming
based tool for symbolic regression. Information Processing Letters 66(6), 299–307
(1998)
27. Salhi, A., Glaser, H., De Roure, D., Putney, J.: The prisoners’ dilemma revisited. Techni-
cal Report DSSE-TR-96-2, Department of Electronics and Computer Science, The Uni-
versity of Southampton, U.K. (February 1996)
28. Salhi, A., Proll, L.G., Rios Insua, D., Martin, J.: Experiences with stochastic algorithms
for a class of global optimisation problems. RAIRO Operations Research 34(22), 183–
197 (2000)
29. Seshadri, A.: Simulated annealing for travelling salesman problem,
http://www.mathworks.com/matlabcentral/fileexchange
232 A. Salhi and Ö. Töreyen

30. Tweedale, J., Ichalkaranje, H., Sioutis, C., Jarvis, B., Consoli, A., Phillips-Wren, G.:
Innovations in multi-agent systems. Journal of Network and Computer Applications 30,
1089–1115 (2007)
31. Wooldridge, M., Jennings, N.R.: Intelligent agents: Theory and practice. Knowledge En-
gineering Review 10(2), 115–152 (1995)
32. Zhigljavsky, A.A.: Theory of Global Search. Mathematics and its applications, Soviet
Series, vol. 65. Kluwer Academic Publishers, Dordrecht (1991)
Chapter 10
Optimization with Clifford Support Vector
Machines and Applications

N. Arana-Daniel, C. López-Franco, and E. Bayro-Corrochano

Abstract. This chapter introduces a generalization of the real- and complex-valued


SVM’s using the Clifford algebra. In this framework we handle the design of ker-
nels involving the geometric product for linear and nonlinear classification and re-
gression. The major advantage of our approach is that we redefine the optimization
variables as multivectors. This allows us to have a multivector as output and, there-
fore we can represent multiple classes according to the dimension of the geometric
algebra in which we work. By using the CSVM with one Clifford kernel we reduce
the complexity of the computation greatly. This can be done thanks to the Clifford
product, which performs the direct product between the spaces of different grade
involved in the optimization problem. We conduct comparisons between CSVM
and the most used approaches to solve multi-class classification to show that ours
is more suitable for practical use on certain type of problems. In this chapter are
included several experiments to show the application of CSVM to solve classifica-
tion and regression problems, as well as 3D object recognition for visual guided
robotics. In addition, it is shown the design of a recurrent system involving LSTM
network connected with CSVM and we study the performance of this system with
time series experiments and robot navigation using reinforcement learning.

10.1 Introduction
The Support Vector Machine (SVM) [1, 2, 3, 4] is a powerfull optimization algo-
rithm to solve classification and regression problems, but it was originally designed
N. Arana-Daniel · C. López-Franco
Computer Science Department, Exact Sciences and Engineering Campus, CUCEI,
University of Guadalajara, Av. Revolucion 1500, Col. Olı́mpica, C.P. 44430,
Guadalajara, Jalisco, México
e-mail: {nancy.arana,carlos.lopez}@cucei.udg.mx
E. Bayro-Corrochano
Cinvestav del IPN, Department of Electrical Engineering and Computer Science,
Zapopan, Jalisco, México

Y. Tenne and C.-K. Goh (Eds.): Computational Intelligence in Optimization, ALO 7, pp. 233–262.
springerlink.com c Springer-Verlag Berlin Heidelberg 2010
234 N. Arana-Daniel, C. López-Franco, and E. Bayro-Corrochano

for binary classification. The methodology to extend this algorithm to do multi-


classification is still an on-going research issue. Currently there are two main types
of approaches for multi-class SVM [6, 7]. One is by constructing and combining
several binary classifiers, while the other is by directly considering all data in one
big optimization problem. The last mentioned approach is computationally more
expensive to solve multi-class problems. This is why the authors were motivated
to develop an SVM-based algorithm for multi-classification and multi-regression,
which furthermore is based on the Clifford (or geometric) Algebra’s framework [5].
The authors’ hypothesis was that these algebras could be the appropriate mathe-
matical framework to develop this algorithm because Clifford Algebras allow us to
express in a compact way a lot of geometric entities (which are used to represent
multi-class) and the products between them.
This chapter will present the results obtained from the development of the above
mentioned hypothesis: i) the design of the generalization of the real- and complex-
valued Support Multi-Vector Machines for classification and regression using the
Clifford geometric algebra, which from now on-wards will be called Clifford Sup-
port Vector Machines (CSVM), ii) the development of Multiple Input Multiple Out-
put (MIMO) CSVM and iii) the application of CSVM as classifiers, regressors and
as an important component of a recurrent system. This work is a continuation of a
first one on the generalization of SVMs [8].

10.2 Geometric Algebra


Let Gn denote the geometric (Clifford) algebra of n-dimensions, this is a graded
linear space. As well as vector addition and scalar multiplication we have a non-
commutative product which is associative and distributive over addition – this is the
geometric or Clifford product. A further distinguishing feature of the algebra is that
any vector squares to give a scalar. The geometric product of two vectors a and b is
written ab and can be expressed as a sum of its symmetric and antisymmetric parts
ab = a·b + a∧b, (10.1)
where the inner product a·b and the outer product a∧b are defined by

a · b = 12 (ab + ba)
(10.2)
a ∧ b = 12 (ab − ba).

The inner product of two vectors is the standard scalar or dot product and produces
a scalar. The outer or wedge product of two vectors is a new quantity which we call
a bivector. We think of a bivector as a oriented area in the plane containing a and b,
formed by sweeping a along b. Thus, b ∧ a will have the opposite orientation mak-
ing the wedge product anti-commutative as given in ( 10.2). The outer product is
immediately generalizable to higher dimensions – for example, (a ∧ b) ∧ c, a trivec-
tor, is interpreted as the oriented volume formed by sweeping the area a ∧ b along
vector c. The outer product of k vectors is a k-vector or k-blade, and such a quantity
is said to have grade k. A multivector A ∈ Gn is the sum of k-blades of different or
10 Optimization with Clifford Support Vector Machines and Applications 235

equal grade. This linear combination is called homogeneous of grade r (A = Ar ) if


it contains terms of only a single grade.

10.2.1 The Geometric Algebra of n-D Space


In an n-Dimensional space V n we can introduce an orthonormal basis of vectors
{σi }, i = 1, ..., n, such that σi · σ j = δi j . This leads to a basis for the entire algebra:

1, {σi }, {σi ∧ σ j }, {σi ∧ σ j ∧ σk }, ...,


σ1 ∧ σ2 ∧ . . . ∧ σn = I. (10.3)

which spans the entire geometric algebra Gn . Here I is the hyper volume called
pseudo scalar which commutes with all the multivectors and it is used as dualization
operator as well. Note that the basis vectors are not represented by bold symbols.
Any multivector can be expressed in terms of this basis. Any multivector can be
expressed in terms of this basis. Because the addition of k-vectors (homogeneous
vectors of grade k) is closed and the multiplication of a k-vector is a vector space,
8k  
denoted V n . Each of this spaces is spanned by nk k-vectors, where nk := (n−k)!k!
n!
.
n n
Thus, our geometric algebra Gn , which is spanned by ∑ k = 0 k = 2n elements, is
a direct sum of its homogeneous subspaces of grades 0, 1, 2, ..., n, that is,

9
0 9
1 9
2 9
n
Gn = Vn ⊕ Vn ⊕ Vn ⊕ ...⊕ Vn (10.4)

8
0 8
1
where V n = R is the set of real numbers and V n = V n corresponds to the linear
n-Dimensional vector space. Thus, any multivector of Gn can be expressed in terms
of the basis of these subspaces.
In this chapter we will specify a geometric algebra Gn of the n dimensional space
by G p,q,r , where p, q and r stand for the number of basis vector which squares to 1,
-1 and 0 respectively and fulfill n=p+q+r. Its even sub algebra will be denoted by
+ .
G p,q,r
In the n-D space there are multivectors of grade 0 (scalars), grade 1 (vectors),
grade 2 (bivectors), grade 3 (trivectors), etc... up to grade n. Any two such multi-
vectors can be multiplied using the geometric product. Consider two multivectors
Ar and Bs of grades r and s respectively. The geometric product of Ar and Bs can be
written as
Ar Bs = ABr+s + ABr+s−2 + . . . + AB|r−s| (10.5)
where Mt is used to denote the t-grade part of multivector M, e.g. consider the
geometric product of two vectors ab = ab0 + ab2 = a · b + a ∧ b. Another simple
illustration is the geometric product of A = 4σ3 + 2σ1 σ2 and b = 8σ2 + 6σ3

Ab = 24(σ3 )2 + 16σ1(σ2 )2 + 32σ3σ2 + 12σ1σ2 σ3


= 24 + 16σ1 − 32σ2σ3 + 12I (10.6)
236 N. Arana-Daniel, C. López-Franco, and E. Bayro-Corrochano

Note here, that the Clifford product for σi σi = (σi )2 = σi · σi = 1, because the wedge
product between σi ∧ σi = 0, and σi σ j = σi ∧ σ j , the geometric product of differ-
ent unit basis vectors is equal to their wedge, which for simple notation can be
omitted. Using equation 10.5 we can express the inner and outer products for the
multivectors as

Ar · Bs = Ar Bs |r−s| (10.7)


Ar ∧ Bs = Ar Bs r+s (10.8)

In order to deal with more general multivectors, we define the scalar product

A ∗ B = AB0 (10.9)

For an r-grade multivector Ar = ∑ri=0 Ar i , the following operations are defined:
r
Grade Involution: Âr = ∑ (−1)i Ai (10.10)
i=0
r
∑ (−1)
i(i−1)
Reversion: A†r = 2 Ai (10.11)
i=0
Clifford Conjugation: Ãr = †r (10.12)
r
∑ (−1)
i(i+1)
= 2 Ai (10.13)
i=0

The grade involution simply negates the odd-grade blades of a multivector. The
reversion can also be obtained by reversing the order of basis vectors making up the
blades in a multivector and then rearranging them to their original order using the
anti-commutativity of the Clifford product. The scalar product ∗ is positive definite,
i.e. one can associate with any multivector A = A0 + A1 + . . . + An a unique
positive scalar magnitude |A| defined by
n
|A|2 = A† ∗ A = ∑ |Ar |2 ≥ 0, (10.14)
r=0

where |A| = 0 if and only


: if A = 0. For an homogeneous multivector Ar its magnitude
is defined as |Ar | = A†r Ar . In particular, for an r-vector Ar of the form Ar = a1 ∧
a2 ∧ . . . ∧ ar : A†r = (a1 . . . ar−1 ar )† = ar ar−1 . . . a1 and thus A†r Ar = a21 a22 . . . a2r , so,
we will say that such a r-vector is null if anda only if it has a null vector for a factor.
If in such f actorization or Ar p, q as s factors square in a positive number, negative
and zero, respectively, we will say that Ar is a r-vector with signature (p, q, s). In
particular, if s = 0 such a non − singular r-vector has a multiplicative inverse

A† A
A−1 = (−1)q = 2 (10.15)
|A|2 A
10 Optimization with Clifford Support Vector Machines and Applications 237

In general, the inverse A−1 of a multivector A, if it exists, is defined by the equation


A−1 A = 1.

10.2.2 The Geometric Algebra of 3-D Space


The basis for the geometric algebra G3,0,0 of the the 3-D space has 23 = 8 elements
and is given by:

1 , {σ1 , σ2 , σ3 }, {σ1 σ2 , σ2 σ3 , σ3 σ1 }, {σ1 σ2 σ3 } ≡ I .


;<=> (10.16)
; <= >; <= >; <= >
scalar vectors bivectors trivector

In G3,0,0 a typical multivector v will be of the form v = α0 + α1 σ1 + α2 σ2 + α3 σ3 +


α4 σ2 σ3 + α5 σ3 σ1 + α6 σ1 σ2 + α7 I3 =< v >0 + < v >1 + < v >2 + < v >3 , where
8
0 8
1
the αi s are real numbers and < v >0 = α0 ∈ V n , < v >1 = α1 σ1 + α2 σ2 + α3 σ3 ∈
8
2 8
3
V n , < v >2 = α4 σ2 σ3 + α5 σ3 σ1 + α6 σ1 σ2 ∈ V n , < v >3 = α7 I3 ∈ V n .
In geometric algebra a rotor (short name for rotator), R, is an even-grade element
of the algebra which satisfies RR=1, ? where R? stands for the conjugate of R. If A =
{a0 , a1 , a2 , a3 } ∈ G3,0,0 represents a unit quaternion , then the rotor which performs
the same rotation is simply given by

R = a0 + a1 (I σ1 ) − a2 (I σ2 ) + a3 (I σ3 ) (10.17)
;<=> ; <= >
scalar bivectors
= a0 + a1σ2 σ3 + a2σ3 σ1 + a3σ1 σ2 . (10.18)

The quaternion algebra is therefore seen to be a subset of the geometric algebra of


3-space. The conjugated of a rotor given by R? = a0 − a1σ2 σ3 + a2σ3 σ1 − a3σ1 σ2 ,
The transformation in terms of a rotor a → RaR? = b is a very general way of
handling rotations; it works for multivectors of any grade and in spaces of any di-
mension in contrast to quaternion calculus. Rotors combine in a straightforward
manner, i.e. a rotor R1 followed by a rotor R2 is equivalent to a total rotor R where
R = R2 R1 .

10.3 Linear Clifford Support Vector Machines for


Classification
For the case of the Clifford SVM for classification we represent the data set in a cer-
tain Clifford Algebra Gn where n = p + q + r, where any multivector base squares
to 0, 1 or -1 depending if they belong to p, r, or r multivector bases respectively. We
consider the general case of an input comprising D multivectors, and one multivec-
tor output, i.e. each ith-vector has D multivector entries xi = [xi1 , xi2 , ..., xiD ]T , where
xi j ∈ Gn and D is its dimension. Thus the ith-vector dimension is D×2n , then each
238 N. Arana-Daniel, C. López-Franco, and E. Bayro-Corrochano

data ith-vector xi ∈ GnD . And each of ith-vectors will be associated with one output
of the 2n possibilities given by the following multivector output

yi = yi s + yi σ1 + yi σ2 + ... + yiI ∈ {±1 ± σ1 ± σ2 . . . ± I} (10.19)

where the first subindex s stands for scalar part. For the classification the CSVM sep-
arates these multivector-valued samples into 2n groups by selecting a good enough
function from the set of functions
T
{ f (x) = w† x + b, }. (10.20)

where x, w = [w1 , w2 , . . . , wD ]T ∈ GnD and f (x), b ∈ Gn . An entry of the optimal hy-


perplane w is given by wi = wis , wiσ1 σ1 + ... + wiσ1 σ2 σ1 σ2 + . . . + wiI I.
Let us see in detail the last equation
T
f (x) = w† x + b
= [w†1 , w†2 , ..., w†D ]T [x1 , x2 , ..., xD ] + b
D
= ∑ w†i xi + b (10.21)
i=1

where w†i xi corresponds to the Clifford product of two multivectors and w†i is the
reversion of the multivector wi .
Next,we introduce now a structural risk functional similar to the real valued
one of the SVM for classification. By using a loss function similar to Vapnik’s ξ -
insensitive one, we utilize following linear constraint quadratic programming for the
primalequation
T
min L(w, b, ξ ) = 12 w† w +C ∑i, j ξ1 j
subject to
T (10.22)
yi j ( f (xi )) j = yi j (w† xi + b) j ≥ 1 − ξi j
ξi j ≥ 0 for all i, j,
where ξi j stands for the slack variables, i, indicate the data ith-vector and j indexes
the multivector component, i.e. j = 1 for the coefficient of the scalar part, j = 2 for
the coefficient of σ1 . . . j = 2n for the coefficient of I. The dual expression of this
problem can be derived straightforwardly. Firstly let us consider the expression of
the orientation of optimal hyperplane.

wi = [w1 , w2 , ..., wD ]T (10.23)

each of the wk is given by the multivector

wk = wks + wkσ1 σ1 + ... + wkσ1σ2 σ1 σ2 + ... + wkI I. (10.24)


10 Optimization with Clifford Support Vector Machines and Applications 239

Each component of these weights are computed as follows:


l & '
wks = ∑ (αs ) j (ys ) j (xks ) j ,
j=1
l & '
wkσ1 = ∑ (ασ1 ) j (yσ1 ) j (xkσ1 ) j ...
j=1
l & '
wkI = ∑ (αI ) j (yI ) j (xkI ) j . (10.25)
j=1
where (αs ) j , (ασ1 ) j , ..., (αI ) j , j = 1, ..., l are the Lagrange multipliers. According
the Wolfe dual programing [1] the dual form reads
1 †T
min
2
(w w) − ∑ αi j (10.26)
i, j

subject to aT · 1 = 0, and all the Lagrange multipliers should fulfill 0 ≤ (αs ) j ≤ C,


0 ≤ (ασ1 ) j ≤ C, ..., 0 ≤ (ασ1 σ2 ) j ≤ C, ..., 0 ≤ (αI ) j ≤ C for i = 1, ..., D and j =
1, ..., l. In aT · 1 = 0, 1 denotes a vector of all ones. The entries of the vector
a = [as , aσ1 , aσ2 , ..., aσ1 σ2 , aI ] (10.27)

are given by
aTs = [(αs )1 (ys )1 , (αs )2 (ys )1 , ..., (αs )l (ys )l ]
aTσ1 = [(ασ1 )1 (yσ1 )1 , (ασ1 )1 (yσ1 )1 , ..., (ασ1 )l (yσ1 )l ]
..
. (10.28)
aTI = [(αI )1 (yI )1 , (αI )1 (yI )1 , ..., (αI )l (yI )l ]

note that the vector aT has the dimension: (l × 2n ) × 1. We require a compact and
easy representation of the resultant GRAM matrix of the multi-components, this will
help for the programing of the algorithm. For that let us first consider the Clifford
product of (w∗T w), this can be expressed as follows

w†T w = w†T ws + w†T wσ1 + w†T wσ2 + . . . + w†T wI (10.29)

Since w has the components presented in (10.25), the equation (10.29) can be rewrit-
ten as follows

w†T w = aTs x†T xs as + ... + aTs x†T xσ1 σ2 aσ1 σ2 + ...
+aTs x†T xI aI + aTσ1 x†T xs as + ...
+aTσ1 x†T xσ1 σ2 aσ1 σ2 + ... + aTσ1 x†T xI aI +
. (10.30)
aTI x†T xs as + aTI x†T xσ1 aσ1 + ...
+aTI x†T xσ1 σ2 aσ1 σ2 + ... + aTI x†T xI aI .
240 N. Arana-Daniel, C. López-Franco, and E. Bayro-Corrochano

Renaming the matrices of the t-grade parts of x†T xt , we rewrite previous
equation as:

w†T w = aTs Hs as + aTs Hσ1 aσ1 + aTs Hσ1 σ2 aσ1 σ2 + ...


+aTs HI aI + aTσ1 Hs as + aTσ1 Hσ1 aσ1 + ...
+aTσ1 Hσ1 σ2 aσ1 σ2 + ... + aTσ1 HI aI +
.
aTI Hs as + aTI Hσ1 aσ1 + ... + aTI Hσ1 σ2 aσ1 σ2 + ...
+aTI HI aI . (10.31)

Taken into consideration the previous equations and definitions, the primal equation
(10.22) reads now as follows:
1
min L(w, b, ξ ) = aT Ha + C · ∑ αi j (10.32)
2 i, j

using the previous definitions and equations we can define the dual optimization
problem as follows
1
max aT 1 − aT Ha
2
sub ject to
0 ≤ (αs ) j ≤ C, 0 ≤ (ασ1 ) j ≤ C, ...,
0 ≤ (ασ1 σ2 ) j ≤ C, ..., 0 ≤ (αI ) j ≤ C
f or j = 1, ..., l, (10.33)

where a is given by (10.27) and, again 1 denotes a vector of all ones.


H is a positive semidefinite matrix which is the expected generalized Gram ma-
trix. This matrix in terms of the matrices of the t-grade parts of x∗ xt is written as
follows:
⎡ ⎤
Hs Hσ1 Hσ2 .... .... ... ... Hσ1σ2 ... HI
⎢ HσT1 Hs ... Hσ4 .....Hσ1 σ2 ... HI Hs ⎥
⎢ ⎥
⎢ HσT2 HσT1 Hs ... Hσ1 σ2 ... HI Hs Hσ1 ⎥
⎢ ⎥
H =⎢
⎢ . ⎥,
⎥ (10.34)
⎢ . ⎥
⎢ ⎥
⎣ . ⎦
HIT ... HσT1 σ2 .............HσT2 HσT1 Hs

note that the diagonal entries equal to Hs and since H is a symmetric matrix the
lower matrices are transposed. The optimal weight vector w is as given by (10.23).
The threshold b ∈ GnD can be computed by using KKT conditions with the Clif-
ford support vectors as follows
10 Optimization with Clifford Support Vector Machines and Applications 241

b = bs + bσ1 σ1 + ... + bσ1σ2 σ1 σ2 + ... + bI I),


l
= ∑ (y j − w†T x j )/l. (10.35)
j=1

The decision function can be seen as sectors reserved for each involved class, i.e.
in the case of complex numbers (G1,0,0 ) or quaternions (G0,2,0 ) we can see that the
circle or the sphere are divide by means spherical vectors. Thus the decision function
can be envisaged as
" # " #
y = csignm f (x) = csignm w†T x + b
" l #
= csignm ∑ (α j ◦ y j )(x†Tj x) + b, (10.36)
j=1
" #
where csignm f (x) is the function for detecting the sign of f (x) and m stands for
the different values which indicate the state valency, e.g. bivalent, tetravalent and
the operation “◦” is defined as

(α j ◦ y j ) = < α j >0 < y j >0 + < α j >1 < y j >1 σ1 + ...


+ < α j >2n < y j >2n I, (10.37)

simply one consider as coefficients of the multivector basis the multiplications be-
tween the coefficients of blades of same degree. For clarity we introduce this oper-
ation “◦”which takes place implicitly in previous equation (10.25).
Note that the cases of complex numbers 2-state (outputs 1 for − π2 ≤ arg( f (x)) <
π π 3π π
2 and -1 for 2 ≤ arg( f (x)) < 2 ) and 4-state (outputs 1+i for 0 ≤ arg( f (x)) < 2 , -
π 3π 3π
1+i for 2 ≤ arg( f (x)) < π , -1-i for π ≤ arg( f (x)) < 2 and 1-i for 2 ≤ arg( f (x)) <
2π ) can be solved by the multi-class real valued SVM, however in case of higher
representations like the 16-state using quaternions, it would be awkward to resort to
the multi-class real valued SVMs.
The major advantage of our approach is that we redefine the optimization vector
variables as multivectors. This allows us to utilize the components of the multivec-
tor output to represent different classes. The amount of achieved class outputs is
directly proportional to the dimension of the involved geometric algebra. The key
idea to solve multi-class classification in the geometric algebra is to avoid that the
multivector elements of different grade get collapsed into a scalar, this can be done
thanks to the redefinition of the primal problem involving the Clifford product in-
stead of the inner product (10.22). The reader should bear in mind that the Clifford
product performs the direct product between the spaces of different grade and its
result is represented by a multivector, thus the outputs of the CSVM are represented
by y= ys + yσ1 + yσ2 + ... + yI ∈ {±1 ± σ1 ± σ2 . . . ± I}.
242 N. Arana-Daniel, C. López-Franco, and E. Bayro-Corrochano

10.4 Non Linear Clifford Support Vector Machines for


Classification
For the nonlinear Clifford valued classification problems we require a Clifford val-
ued kernel K(x, y). In order to fulfill the Mercer theorem we resort to a component-
wise Clifford-valued mapping
φ
x ∈ Gn −→ Φ (x) = Φs (x) + Φσ1 σ1 + Φσ1 σ2 (x)σ2 + ...
+I ΦI (x) ∈ Gn .

In general we build a Clifford kernel K(xm , x j ) by taking the Clifford product be-
tween the reversion of xm and x j as follows

K(xm , x j ) = Φ (xm )† Φ (x j ), (10.38)

note that the kind of reversion operation (·)† of a multivector depends of the signa-
ture of the involved geometric algebra G p,q,r . Next as illustration we present kernels
using different geometric algebras. According to the Mercer theorem, there exists
a mapping u : G → F , which maps the multivectors x ∈ Gn into the complex Eu-
u
clidean space x →= ur (x) + IuI (x)
Complex-valued linear kernel function in G1,0,0 (the center of this geometric al-
gebra, i.e. s, I = σ1 σ2 is isomorph with C):

K(xm , xn ) = u(xm )† u(xn )


= (u(xm )s u(xn )s + u(xm )I u(xn )I ) + I(u(xm )s u(xn )I − u(xm )I u(xn )s ),
= (k(xm , xn )ss + k(xm , xn )II ) + ...
+I(k(xm , xn )Is ) − k(xm , xn )sI )
= Hr + IHi (10.39)

where (xs )m , (xs )n , (xI )m , (xI )n are vectors of the individual components of the
complex numbers (x)m = (xs )m + I(xI )n ∈ G1,0,0 and (x)n = (xs )n + I(xI )n ∈ G1,0,0
respectively.
For the quaternion-valued Gabor kernel function, we use i = σ2 σ3 , j = −σ3 σ1 ,
k = σ1 σ2 . The Gaussian window Gabor kernel function reads

K(xm , xn ) = g(xm , xn )exp−iw0 (xm − xn )


T
(10.40)

where the normalized Gaussian window function is given by


||x − x ||2
1 − m 2 n
g(xm , xn ) = √ exp 2ρ (10.41)
2πρ
10 Optimization with Clifford Support Vector Machines and Applications 243

and the variables w0 and xm − xn stand for the frequency and space domains
respectively.
Unlike the Hartley transform or the 2D complex Fourier this kernel function sep-
arates nicely the even and odd components of the involved signal, i.e.

K(xm , xn ) = K(xm , xn )s + K(xm , xn )σ2 σ3 + ...


+K(xm , xn )σ3 σ1 + K(xm , xn )σ1 σ2
= g(xm , xn )cos(wT0 xm )cos(wT0 xm ) + ...
+g(xm , xn )cos(wT0 xm )sin(wT0 xm )i + ...
+g(xm , xn )sin(wT0 xm )cos(wT0 xm ) j + ...
+g(xm , xn )sin(wT0 xm )sin(wT0 xm )k.

Since g(xm , xn ) fulfills the Mercer’s condition it is straightforward to prove that


k(xm , xn )u in the above equations satisfy these conditions as well.
After we defined these kernels we can proceed in the formulation of the SVM
n
conditions. We substitute the mapped data Φ (x) = ∑2u=1 < Φ (x) >u into the linear
function f (x) = w†T x + b = w∗T Φ (x) + b. The problem can be stated similarly as in
(10.22-10.26). In fact we can replace the kernel function in (10.33) to accomplish
the Wolfe dual programming and thereby to obtain the kernel function group for
nonlinear classification

Hs = [Ks (xm , x j )]m, j=1,..,l


Hσ1 = [Kσ1 (xm , x j )]m, j=1,..,l
...
Hσn = [Kσn (xm , x j )]m, j=1,..,l ·
·
HI = [KI (xm , x j )]m, j=1,..,l . (10.42)

In the same way we use the kernel functions to replace the the dot product of the
input data in (10.36). In general the output function of the nonlinear Clifford SVM
reads
" # " #
y = csignm f (x) = csignm w†T Φ (x) + b , (10.43)

where m stands for the state valency.

10.5 Clifford SVM for Regression


The representation of the data set for the case of Clifford SVM for regres-
sion is the same as for Clifford SVM for classification; we represent the data set
244 N. Arana-Daniel, C. López-Franco, and E. Bayro-Corrochano

in a certain Clifford Algebra Gn . Each data ith-vector has multivector entries


xi = [xi1 , xi2 , ..., xiD ]T , where xi j ∈ Gn and D is its dimension. Let (x1 , y1 ),(x2 ,
y2 ),...,(x j , y j ),...,(xl , yl ) be the training set of independently and identically dis-
tributed multivector-valued sample pairs, where each label yi = yi s + yi σ1 σ1 +
yi σ2 σ2 + ... + yi I I, and the first subindex s stands for scalar part. The regression
problem using multivectors is to find a multivector-valued function f (x) that has
at most ε -deviation from the actually obtained targets yi ∈ Gn for all the training
data, and at the same time, is as flat as possible. We will use a multivector-valued
ε -insensitive loss function and arrive at the formulation of Vapnik [1]:

min 12 w†T w +C · ∑i, j (ξi + ξ̃i )


subject to
(10.44)
(yi − w†T xi − b) j ≤ (ε + ξi j )
(w†T xi + b − yi ) j ≤ (ε + ξ̃i j ) ξi j ≥ 0, ξ̃i j ≥ 0 for all i, j.
where w, x ∈ GnD , and (.) j extracts the scalar accompanying a multivector base. Next
we proceed like in section 10.3, since the expression of the orientation of optimal
hyperplane is the same that in (10.23) and each of the wi is computed as follows:
l & '
ws = ∑ (αs ) j − (α̃s ) j (xs ) j , ,
j=1
l & '
wσ1 = ∑ (ασ1 ) j − (α̃σ1 ) j (xσ1 ) j , ...,
j=1
l & '
wI = ∑ (αI ) j − (α̃I ) j (xI ) j .
j=1

We can now redefine the entries of the vector in (10.27), these are given by

aTs = [(αs11 − α̃s1 , (αs2 − α̃s2 ), ..., (αsl − α̃s1 )],


aTσ1 = [(ασ1 1 − α̃σ1 1 ), (ασ1 2 − α̃σ1 2 ), ..., (ασ1 l − α̃σ1 l )]
... (10.45)
aTI = [(αI1 − α̃I1 ), (αI2 − α̃I2 ), ..., (αIl − α̃Il )]

Now, we can rewrite the Clifford product, as we did in (10.29 - 10.31) to get the
primal problem as follows:

min 12 aT Ha +C · ∑li=1 (ξ + ξ̃ )
subject to
(w† x + b − y) j ≤ (ε + ξ ) j
(y − w† x − b) j ≤ (ε + ξ̃ ) j
ξi j ≥ 0, ξ̃i j for all i, j.

Thereafter, we write straightforwardly the dual of 10.46 for solving the regression
problem
10 Optimization with Clifford Support Vector Machines and Applications 245

1
max −α̃ T (ε −
˜ y) − α T (ε + y) − aT Ha
2
sub ject to
l l
∑ (αs j − α̃s j ) = 0, ∑ (ασ 1 j − α̃σ1 j ) = 0, ..,
j=1 j=1
l
∑ (αI j − α̃I j ) = 0, 0 ≤ (αis ) ≤ C, 0 ≤ (αiσ1 ) ≤ C, ...,
j=1
0 ≤ (αiσ1 σ2 ) ≤ C, ..., 0 ≤ (iαI ) ≤ C 0 ≤ (αis∗ ) ≤ C, 0 ≤ (αi∗σ1 ) ≤ C, ...,
0 ≤ (αi∗σ1 σ2 ) ≤ C, ..., 0 ≤ (iαI∗ ) ≤ C j = 1, ..., l, (10.46)

For nonlinear regression similar as explained in subsection 10.4 we utilize a partic-


ular kernel for computing k(xm , x j ) = Φ (x̃m )Φ (x j ), again this kind of conjugation
operation ()∗ of a multivector depends of the signature of the involved geometric
algebra G p,q,r . We can use the kernels described in subsection 10.4.

10.6 Recurrent Clifford SVM


SVMs are very powerful for solving regression and classification tasks. They carry
out predictions by linearly combining kernel basis functions. By mapping the input
feature space to a higher dimensional space, the SVMs can separate linearly clusters
by means an optimal hyperplane. A rather limited way to apply existing SVMs to
sequence prediction [? ? ] or classification [12] is to build a training set either by
transforming the sequential input to an input vector of some static domain (e.g.,
a frequency or phase representation, a Hidden Markov Model -HMM- [13, 14]),
or by simple frequency counting of patterns, symbols or substrings, or by taking
fixed time windows of k sequential values [10]. The window-based approaches of
course, fail if the temporal dependency exceeds the length of k steps. As for the
case of training HMM with long sequences, unfortunately they get numerous local
minima points [15, 16]. Suykens and Vandewalle [17], incorporates the dynamic
equations in the primal problem for a SVM solution. The major disadvantage of
this approach is that the problem is not longer convex, thus there is no guarantee
of finding an optimal global solution. In all these discussed attempts there has not
been a recurrent SVM which learns tasks involving time lags of arbitrary length
between important input events. However, a pioneering attempt using real valued
SVM and neuroevolution for sequence prediction was done by Schmidhuber, et al.
[18]. Unfortunately at present the research activity on recurrent SVM is very scarce.
We started to explore a way to build a CSVM based recurrent system which will
profit of all advantages of the CSVM, namely it helps to maintain the convexity, it
is MIMO and it suits to process sequences with geometric characteristics. In order
to do that, we decided to connect two processing modules in cascade: a Long Short
Term memory LSTM, [20] and a CSVM.
246 N. Arana-Daniel, C. López-Franco, and E. Bayro-Corrochano

LSTM-CSVM is a Evolino and Evoke based system [18, 19]: the underlying
idea of these systems is that it is needed two cascade modules: a robust module
to process short and long-time dependencies (LSTM) and an optimization module
to produce precise outputs (CSVM, Moore-Penrose pseudo inverse method, SVM
respectively). The LSTM module addresses the disadvantage of having relevant
pieces of information outside the history window and also avoids the problem of
the “vanishing error” presented by algorithms like Back-Propagation Through Time
(BPTT, e.g., Williams an Zipser 1992) or Real-Time-Recurrent Learning ( RTRL,
e.g., Robinson and Fallside 1987)1. Meanwhile CSVM maps the internal activations
of the fist module to a set of precise outputs, again, it is taken advantage of the mul-
tivector output representation to implement a system with less process units and
therefore less computational complex.
LSTM-CSVM works as follows: a sequence of input vectors (u(0)...u(t)) is given
to the LSTM which in turn feeds the CSVM with the outputs of each of its memory
cells, see Fig. 10.1.

Fig. 10.1 LSTM-CSVM system

The CSVM aimed at finding the expected nonlinear mapping of training data.
The input and output equations of Figure 10.1 are

φ (t) = f (W, u(t), u(t-1),...,u(0),...,).


k
y(t) = b + ∑ wi K(φ (t), φi (t)). (10.47)
i=1

where φ (t) = [ψ1 , ψ2 , ..., ψn ]T ∈ Rn is the activation in time t of n units of the LSTM,
this serves as input to the CSVM, given the input vectors(u(0)...u(t)) and the weight
matrix W . Since the LSTM is a recurrent net, the argument of the function f (.)
represents the history of the input vectors.

1 The reader can get more information about BPTT and RTRL-vanishing error versus
LSTM-constant error flow in [20].
10 Optimization with Clifford Support Vector Machines and Applications 247

First, the LSTM-CSVM system was trained using the conventional algorithm for
the LSTM. Although the system learns, unfortunately it takes too long to find a
suitable matrix W . Instead, propagating the training data through the LSTM-CSVM
system, we evolved the rows of the matrix using the evolutionary algorithms known
as Enforced Sub-Populations (ESP) [21] algorithm. This approach differs with the
standard methods, because instead of evolving the complete set of the net parame-
ters, it rather evolves subpopulations of the LSTM memory cells. For the mutation
of the chromosomes, the ESP uses Cauchy density function.

10.7 Applications
In this section we present five interesting experiments. The first one shows a multi-
class classification using CSVM with a simulated example.Here, we present also a
number of variables computing per approach and a time comparison between CSVM
and three approaches to do multi-class classification using real SVM. The second
is about object multi-class classification with two types of training data: Phase a)
artificial data and Phase b) real data obtained from a stereo vision system. We also
compared the CSVM against MLP’s (for multi-class classification) . The third ex-
periment presents a multi-class interpolation. The fourth and fifth includes the ex-
perimental analysis of the recurrent CSVM.

10.7.1 3D Spiral: Nonlinear Classification Problem


We extended the well known 2-D spiral problem to the 3-D space. This experiment
should test whether the CSVM would be able to separate five 1-D manifolds embed-
ded in R3 . On this application, we used a quaternion valued CSVM which works in
G0,2,0 2 , this allows us to have quaternion inputs and outputs, and therefore, with one
output quaternion we can represent until 24 classes .The functions were generated
as follows:
f1 (t) = [x1 (t), y1 (t), z1 (t)]
= [z1 ∗ cos(θ ) ∗ sin(θ ), z1 ∗ sin(θ ) ∗ sin(θ ), z1 ∗ cos(θ )]
f2 (t) = [x2 (t), y2 (t), z2 (t)]
= [z2 ∗ cos(θ ) ∗ sin(θ ), z2 ∗ sin(θ ) ∗ sin(θ ), z2 ∗ cos(θ )]
f3 (t) = [x3 (t), y3 (t), z3 (t)]
= [z3 ∗ cos(θ ) ∗ sin(θ ), z3 ∗ sin(θ ) ∗ sin(θ ), z3 ∗ cos(θ )]
f4 (t) = [x4 (t), y4 (t), z4 (t)]
= [z4 ∗ cos(θ ) ∗ sin(θ ), z4 ∗ sin(θ ) ∗ sin(θ ), z4 ∗ cos(θ )]
f5 (t) = [x5 (t), y5 (t), z5 (t)]
= [z5 ∗ cos(θ ) ∗ sin(θ ), z5 ∗ sin(θ ) ∗ sin(θ ), z5 ∗ cos(θ )]

To depict these vectors they were normalized by 10. In Fig. 10.2 one can see that the
problem is high nonlinear separable. The CSVM uses for training 50 input quater-
nions of each of the five functions, since these have three coordinates we use simply
2 The dimension of this geometric algebra is 22 = 4.
248 N. Arana-Daniel, C. López-Franco, and E. Bayro-Corrochano

Support Vectors

Fig. 10.2 3D spiral with five classes. The marks represent the support multivectors found by
the CSVM

the bivector part of the quaternion, namely xi = xi (t)σ2 σ3 + yi (t)σ3 σ1 + zi (t)σ1 σ2 ≡


[0, xi (t), yi (t), zi (t)]. The CSVM used the kernel given by (10.42). Note that the
CSVM indeed manage to separate the five classes.

10.7.1.1 Comparisons Using 3D Spiral Example


According to [22] the most used methods to do multi-class classification are: one-
against-all [23], one-against-one [24], DAGSVM [26], and some methods to solve
multi-class in one step, known as all together methods [27]. Table 10.1 shows a com-
parison of number of variables computing per approach, considering also CSVM.
The experiments shown in [22] indicate that “one-against-one and DAG meth-
ods are more suitable for practical use than the other methods”, we have chosen
of these methods the one-against-one and the earliest implementation for SVM
multi-class classification one-against-all approach to do comparisons between them
and our proposal CSVM. The comparisons were made using the 3D spiral toy ex-
ample and the quaternion CSVM shown in the past subsection. The number of
classes was increased on each experiment, we started with K=3 classes and 50
training inputs for each class. Since the training inputs have three coordinates we
use simply the bivector part of the quaternion for CSVM approach, namely xi =
xi (t)σ2 σ3 + yi (t)σ3 σ1 + zi (t)σ1 σ2 ≡ [0, xi (t), yi (t), zi (t)], therefore CSVM computes
D ∗ N = 3 ∗ 150 = 450 variables. The approaches one-against-all and one-against-
one compute 450 and 300 variables respectively, however the training times of
CSVM and one-against-one are very similar in the first experiment. Note that when
we increase the number of classes the performance of CSVM is much better than
the other approaches because the number of variables to compute is greatly reduced.
We improved the computational efficiency of all these algorithms, utilizing the de-
composition method [28] and the shrinking technique [29]. We can see in table 10.2
that the CSVM using a quarter of the variables is still faster with around a quarter of
the processing time of the other approaches. The classification performance of the
four approaches is presented in Table 10.3. We used during training and test 50 and
20 vectors per class respectively. We can see that the CSVM for classification has
overall the best performance.
10 Optimization with Clifford Support Vector Machines and Applications 249

Table 10.1 Number of variables per approach

Approach NQP NVQP TNV


CSVM 1 D*N D*N
One-against-all K N K*N
One-against-one K(K-1)/2 2*N/K N(K-1)
DAGSVM K(K-1)/2 2*N/K N(K-1)
A method by considering 1 K*N K*N
all data at once

NQP Number of quadratic problems to solve


NVQP Number of variables to compute per quadratic problem
TNV Total Number of Variables
D Training input data dimension
N Total number of training examples
K Number of classes

Table 10.2 Time training per approach (seconds)

Approach K=3, N=150 K=5, N=250 K=16, N=800


(Variables) (Variables) (Variables)
CSVM 0.07 0.987 10.07
C=1000 (450) (750) (3200)
One-against-all 0.11 8.54 131.24
(C, σ )=(1000,2−3) (450) (1250) (12800)
One-against-one 0.09 2.31 30.86
(C, σ )=(1000,2−2) (300) (1000) (12000)
DAGSVM 0.10 3.98 38.88
(C, σ )=(1000,2−3) (300) (1000) (12000)

K Number of classes, N Number of training examples (50 each


class)
Used kernels K(xi , x j ) = e−σ ||xi −x j with parameters taken from
σ = {2, 20 , 2−1 , 2−2 , 2−3 } and costs C={1,10,100,1000,10000}.
From these 5 × 5 combinations, the best result was selected for
each
approach.
250 N. Arana-Daniel, C. López-Franco, and E. Bayro-Corrochano

Table 10.3 Percent of accuracy in training and test

Approach Ntrain=150 Ntrain=250 Ntrain=800


Ntest=60 Ntest=100 Ntest=320
K=3 K=5 K=16
CSVM 98.66 99.2 99.87
C=1000 (95.00) (98.00) (99.68)
One-against-all 96.00 98.00 99.75
(C, σ )=(1000,2−3) (90.00) (96.00) (99.06)
One-against-one 98.00 98.4 (99.87)
(C, σ )=(1000,2−2) (95.00) 99.00 (99.375)
DAGSVM 97.33 98.4 (99.87)
(C, σ )=(1000,2−3) (95.00) (97) (99.68)

K Number of classes, Ntrain=Number of total training vectors


Ntest=Number of test vectors, % accuracy in training phase
above
Below in brackets, the percent of accuracy in test phase
Used kernels K(xi , x j ) = e−σ ||xi −x j with parameters taken from
σ = {2, 20 , 2−1 , 2−2 , 2−3 } and costs C={1,10,100,1000,10000}.
From these 5 × 5 combinations, the best result was selected for
each
approach.

10.7.2 Object Recognition


In this subsection we will show an application of Clifford SVM for multi class
object classification. In the experiments shown in this subsection, we want to use
only one CSVM with a quaternion as input and a quaternion as output, that allow
us to have up 24 = 16 classes. Basically we packed in a feature quaternion one 3-D
point (which lies in surface of the object) and the magnitude of the distance between
this point and the point which lies in the main axis of the object in the same level
curve. Fig. 10.3 depicts the 4 features taking by the object :

Xi = δi s + xi σ2 σ3 + yi σ3 σ1 + zi σ1 σ2 (10.48)
≡ [δi , (xi , yi , zi )]T

For each object we trained the CSVM using a set of several feature quaternions
obtained from different level curves; that means that each object is represented by
several feature quaternions and not only one. Due to this way to train the CSVM,
the order in which the feature quaternions are shown to the CSVM is important:
we begin to sample data from the bottom to the top of the objects and we show the
training and test data in this order to the CSVM. We processed the outputs using
a counter that computes which class fires the most for each training or test set in
10 Optimization with Clifford Support Vector Machines and Applications 251

[ n,
[ +m , (x,y,z) n ]
(x,y,z) +m]
[ ,
(x,y,z) ]

a) b)

Fig. 10.3 Geometric characteristics of one training object. The magnitude is δi , and the 3D
coordinates (xi , yi , zi ) to build the feature vector: [δi , (xi , yi , zi )]

order to decide which class the object belongs, see Fig. 10.4. Note carefully, that
this experiment is anyway a challenge for any algorithm for recognition, because
the feature signature is sparse. We will show later, that using this kind of feature
vectors the CSVM’s performance is superior to the MLP’s one. Of course, if you
spend more time trying to improve the quality of the feature signature, the CSVM’s
performance will increase accordingly.
WINNER CLASS
COUNTER
OUTPUTS
CSVM
INPUTS

Fig. 10.4 After we get the outputs, these are accumulated using a counter to calculate which
class the object belongs

It is important to say that all the objects (synthetic and real) were preprocessed
in order to have a common center an the same scale, then our learning process can
be seen as centering and scale invariant.

10.7.2.1 Phase a) Synthetic Data


In the first phase of this experiment, we used data training obtained from synthetic
objects, the training set are shown in Fig. 10.5. Note that we have six different
objects, which means a six-classes classification problem, and we solve it with
only one CSVM making use of its multi-output characteristic. In general, for the
”‘one versus all”’ approach one needs n SVMs (one for each class). In contrast,
the CSVM needs only one machine because its quaternion output allows to have 16
class outputs. For the input-data coding, we used a 3D point which is packed into
252 N. Arana-Daniel, C. López-Franco, and E. Bayro-Corrochano

the σ2 σ3 , σ3 σ1 , σ1 σ2 basis of the feature quaternion and the magnitude was packed
in the scalar part of the quaternion . Figure 10.6 shows the 3D points sampled from
the objects. We compared the performance of the following approaches: CSVM, a
4-7-6 MLP and the real valued SVM approaches one-against-one, one-against-all
and DAGSVM. The results in tables 10.4 and 10.5 show that CSVM has better gen-
eralization and less training errors than the MLP approach and the real valued-SVM
approaches. Note that all methods were speed up using the acceleration techniques
[28, 29]. The authors think that the MLP presents more training and generalization
errors because the way we represent the objects (as feature quaternion sets) makes
the MLP gets stuck in local minima very often during the learning phase, whereas
the CSVM is guaranteed to find the optimal solution to the classification problem
because it solves a convex quadratic problem with global minima. With respect to
the real-valued SVM based approaches, the CSVM takes advantage of the Clifford
product, which enhances the discriminatory power of the classificator itself unlike
the other approaches which are based solely on inner products.

a) b) c)

d) e) f)

Fig. 10.5 Training synthetic object set

10.7.2.2 Phase b) Real Data


In this phase of the experiment we obtained the training data using our robot “Ge-
ometer”, it is shown in right side of Fig. 10.7. We take two stereoscopic views of
each object: one frontal view and one 180 rotated view (w.r.t. the frontal view), after
that, we applied the Harry’s filter on each view in order to get the objects corners and
then, with the stereo system, the 3D points (xi , yi , zi ) which laid on the object surface
and to calculate the magnitude δi for the quaternion equation (10.49). This process
10 Optimization with Clifford Support Vector Machines and Applications 253

Table 10.4 Object-recognition performance in percent (%) during training using synthetic
data

Object NTS CSVM MLP 1-vs-all 1-vs-1 DAGSVM


C=1200 a) b) c)
C 86 93.02 48.83 87.2 90.69 90.69
S 84 89.28 46.42 89.28 90.47 90.47
F 84 85.71 40.47 83.33 84.52 83.33
W 86 91.86 46.51 90.69 91.86 93.02
D 80 93.75 50.00 87.5 91.25 90.00
U 84 86.90 48.80 82.14 83.33 84.52
C=cylinder, S=sphere, F=fountain, W=worm, D=Diamond,
U=cube
NTS= Number of Training Vectors.
Used kernels K(xi , x j ) = e−σ ||xi −x j with parameters taken from
σ = {2−1 , 2−2 , 2−3 , 2−4 , 2−5 } and costs
C={150,1000,1100,1200,1400,1500,10000}.
From these 8 × 5 combinations, the best result was selected for
each
approach. a)(2−4 , 1500), b)(2−3, 1200), c)(2−4, 1400).

Table 10.5 Object-recognition performance in percent (%) during test using synthetic data

Object NTS CSVM MLP 1-vs-all 1-vs-1 DAGSVM


C=1200 a) b) c)
C 52 94.23 80.76 90.38 96.15 96.15
S 66 87.87 45.45 83.33 84.84 86.36
F 66 90.90 51.51 83.33 86.36 84.84
W 66 89.39 57.57 86.36 83.33 86.36
D 58 93.10 55.17 93.10 93.10 93.10
U 66 92.42 46.96 89.39 90.90 89.39
C=cylinder, S=sphere, F=fountain, W=worm, D=Diamond,
U=cube
NTS= Number of Training Vectors.
Used kernels K(xi , x j ) = e−σ ||xi −x j with parameters taken from
σ = {2−1 , 2−2 , 2−3 , 2−4 , 2−5 } and costs
C={150,1000,1100,1200,1400,1500,10000}.
From these 8 × 5 combinations, the best result was selected for
each
approach. a)(2−4 , 1500), b)(2−3, 1200), c)(2−4, 1400).
254 N. Arana-Daniel, C. López-Franco, and E. Bayro-Corrochano

a) b) c)

*
***
***
***
*
* * ***
***
***
* *
* *
* *
* ***
***
*** *

d) e) f)
Fig. 10.6 Sampling of the training synthetic object set

is illustrated in Fig.10.7 and the whole training object set is shown in Fig.10.8. We
take the non-normalized 3D point for the bivector basis σ23 , σ31 , σ12 of the feature
quaternion in (10.49).

Fig. 10.7 Left:Sampling view of a real object. We use big white cross for the depiction.
Right: Stereo vision system in the experiment environment

After the training, we tested with a set of feature quaternions that the machine did
not see during its training and we used the approach of ’winner take all’ to decide
which class the object belongs. The results of the training and test are shown in
table 10.6. We trained CSVM with equal number of training data for each object,
that is, 90 feature quaternions for each object, but we test with different number of
data for object. Note that we have two pairs of objects that are very similar between
each other; first pair is composed by half sphere shown in Fig.10.8.c) and the rock
in Fig.10.8.d), in spite of this similarities, we got very good accuracy percentages in
10 Optimization with Clifford Support Vector Machines and Applications 255

Fig. 10.8 Training real object set, stereo pair images. We include only the frontal views

test phase for both objects: 65.9% for the half sphere and 84% for the rock. We think
we got better results for the rock because this object has a lot of texture that produces
many corners which in torn capture better the irregularities, therefore we have more
test feature quaternions for the rock than for half sphere (75 against 44 respectively).
The second pair composed by similar objects is shown in Fig.10.8.e) and Fig10.8.f),
these are two equal plastic bottles of juice, but one of them (Fig. 10.8.f)) is burned,
that makes the difference between them and give the CSVM enough distinguish
features to make two object classes, that is shown in table 10.6, we got 60% of
correct classified test samples for bottle in Fig. 10.8.e) against 61% for burned bottle
in Fig.10.8.f). The lower learn rates in the last objects (Fig.10.8 c), e) and f)) is
because the CSVM is confusing a bit the classes due to the fact that the feature
vectors are not large and reach enough.

Table 10.6 Experimental Results using real data

Object NTS NES CTS %


Cube 90 50 38 76.00
Prism 90 43 32 74.42
Half sphere 90 44 29 65.90
Rock 90 75 63 84.00
Plastic bottle 1 90 65 39 60.00
Plastic bottle 2 90 67 41 61.20
Number of training samples
Number of test samples
Number of correct classified test samples
Percentage left column
256 N. Arana-Daniel, C. López-Franco, and E. Bayro-Corrochano

10.7.3 Multi-case Interpolation


A real valued SVM can carry out regression and interpolation for multiple inputs
and one real output. Surprisingly a Clifford valued SVM can have multiple inputs
and 2n outputs for a n-dimensional space or R n . For handling regression we use
1.0 > ε > 0, where the diameter of the tube surrounding the optimal hyperplane
is twice of ε . For the case of interpolation we use ε = 0. We have chosen an in-
teresting task where we use a CSVM for interpolation in order to code a certain
kind of behavior we want that a visual guided robot performs. The robot should
autonomously draw a complicated 2D pattern. This capacity should be coded inter-
nally in Long Term Memory (LTM), so that the robot reacts immediately without
the need of reasoning. Similar to as a capable person who reacts in milliseconds
with incredible precision for accomplishing a very difficult task, for example a ten-
nis player or tango dancer. For our purpose we trained off line a CSVM using two
real valued functions. The CSVM used the geometric algebra G3+ (quaternion al-
gebra). Two inputs using two components of the quaternion input and two outputs
(two components of the quaternion output). The first input u and first output x coded
the relation x = a ∗ sin(3 ∗ u) ∗ cos(u) for one axis. The second input v and second
output y coded the relation y = a ∗ sin(3 ∗ v)∗ sin(v) for another axis, see Fig. 10.9.a),
b). The 2D pattern can be drawn using these 50 points generated by functions for
x and y, see Fig. 10.9.c. We tested if the CSVM can interpolate good enough using
100 and 400 unseen before input tuples {u, v}, see respectively Fig. 10.9.d) e). Once
the CSVM was trained we incorporated it as part of the LTM of the visual guided
robot shown in Fig. 10.9.f. For carrying out its task the robot called the CSVM for
a sequence of input patterns. The robot was able to draw the desired 2D pattern as
we see in Fig. 10.10.a)-d). The reader should bear in mind that this experiment was
designed using the equation of a standard function, in order to have a ground true.
Any how, our algorithm should be able also to learn 3D curves which do not have
explicit equations.

a) b) c) d)

Fig. 10.9 a) and b) Continuous curves of training output data for axes x and y (50 points).
c) 2D result by testing with 400 input data. d) Experiment environment
10 Optimization with Clifford Support Vector Machines and Applications 257

a) b) c) d)
Fig. 10.10 a), b), c) Image sequence while robot is drawing d) Robot’s draw. Result by testing
with 400 input data

10.7.4 Experiments Using Recurrent CSVM


In this section we first analyze the performance of the recurrent CSVM against state
of the art algorithms for solving a time series problem. Then, in a second experiment,
we apply the recurrent CSVM to tackle a partially observable problem in robotics.

10.7.4.1 Time Series


We utilized the data of water levels of the Venice Lagoon during the periods from
1980 to 1989 and 1990 to 1995.3. The recurrent CSVM was trained with the first
400 series values. The LSTM module was evolved with four memory cells during
100 generations and using the Cauchy parameter α = 10−3 . The CSVM was trained
using as inputs the output values of these four memory cells. The achieved train-
ing error was of 0.0019 and the recurrent CSVM was tested with 600 steps. The
system was able to predict in advance 600 steps of the series. Figure 10.11 shows
the prediction performance of the recurrent CSVM by using the training data and
Figure 10.11.a) depicts the results of predictions using unforeseen 600 test values.
In the figures the ordinate’s range is [0..1].

Fig. 10.11 a)Time seriess Venice Lagoon training. b)Recall data. Tick line (in red) real data,
thin line predicted values by LSTM-CSVM
3 A. Tomasin, CNR-ISDMG Universita Ca’Foscari, Venice.
258 N. Arana-Daniel, C. López-Franco, and E. Bayro-Corrochano

In the next test, we employed the time series Mackey-Glass which is commonly
used for testing the generalization and prediction ability of an algorithm. The series
are generated by the following differential equation

˙ =α y(t − τ )
y(t) , (10.49)
(1 + y(t − τ )β ) − γ y(t)

where the parameters are usually set as α = 0.2, β = 10 and γ = 0.1. This equation
is chaotic when the delay is τ > 16.8. We select as delay the most common utilized
value of τ = 17. The task is to predict the series values after the delay y[t + P]
by using the previous points y[t], y[t − 6], y[t − 12], y[t − 18]. By P = 50 sampled
values, it is expected that the algorithm learns the four dimensional function: y(t) =
f (y[t − 50], y[t − 50 − 6], y[t − 50 − 12], y[t − 50 − 18]).
The LSTM-CSVM was trained with the first 3000 values of the series using
P = 100. The module LSTM with 4 memory cells was evolved with a Cauchy pa-
rameter α = 10−5 over 150 generations. The “Eco state approach” was trained with
1000 neurons and a mean square error of 10−4 , while using the Evolino system
the achieved error was of 1.9 × 10−3 with 30 cells evolved over 200 generations
[19, 30]. It has been reported [31] that using a LSTM the minimum error achieved
was of 0.1184 using the same amount of 4 neurons as in our LTSM-CSVM.

Table 10.7 Time series Mackey-Glass

Approach Units Generations Error


Eco state 1000 200 10−4
Evolino 30 200 1.9 × 10−3
LSTM 4 0.1184
LSTM-CSVM 4 y CSVM 150 0.011

Table 10.7 shows a summary of the comparison results. Here we note that the
LTSM alone has a poorer performance than the LTSM-CSVM, showing that the
CSVM clearly improves the prediction precision. Note that for this complex time se-
ries, as opposite to these two approaches (Eco state approach, Evolino), the LTSM-
CSVM uses a lower number of neurons and it requires less generations during the
training for an acceptable error of 0.011.

10.7.4.2 Robot Navigation in Discrete Labyrinths


Finally, we utilize the LTSM-CSVM with reinforcement learning in a task of a
robotic perception action system. The robot system comprises of a stereoscopic
camera, a 6 D.O.F: robot arm and a 4 finger Barret-hand. The task consisted to
move the robot hand through a real 2D labyrinth. This was built using 10 blocks
of 10 cm. height each. To enhance the visibility, their top faces were painted in
10 Optimization with Clifford Support Vector Machines and Applications 259

red, this facilitated the segmentation of the blocks by the stereoscopic cameras. The
stereoscopic systems took images from an angle of 45 degrees, for that we needed
to calibrate cameras, and correct the position of the cameras views as they were
oriented on the top perpendicularly to the labyrinth. With this information we had
a complete 3D view from above. Using a color segmentation algorithm, we get the
vector coordinates of the block corners. These observation vectors were then fed to
the LTSM-CSVM.
The architecture of the LTSM-CSVM with reinforcement learning and the train-
ing were equal as the previous application. The differences with the simulated exper-
iment were: i) the 3D vectors of the block corners were obtained by the stereoscopic
camera (the blocks build a 2D labyrinth), ii) the robot actions were hand movements
trough the 2D labyrinth and iii) the length of this real labyrinth was smaller than pre-
vious simulated one. We had 4 different labyrinths. Each was 10 blocks length.
The evolution of the system consisted of 50 generations using a Cauchy noise
parameter of α = 10−4 . The module CSVM is fed with a vector of the output of
the last 4 memory cells of the LSTM. The 4 outputs of the CSVM represent the
4 different actions to be carried out during the navigation through the labyrinth.
After each generation, the best net was kept and the task was considered fulfilled
by a perfect reward of 4.0. The four possible actions of the system are robot hand
movements of 10 cm. length towards left, right, back an fort. The initial position
of the robot arm was located at the entry of a labyrinth. In all the labyrinths we
exploited the intern state (support state), i.e. the coordinates of the exit which was
the same for all cases.

1)
1)

2)
2)

3) 3)

4) 4)

Fig. 10.12 a) Training labyrinths 1 y 2. Recall labyrinths 3 y 4.(third column) The robot hand
is at the entry of the labyrinth holding a plastic object. (fourth column) Position from the hand
marked with a cross
260 N. Arana-Daniel, C. López-Franco, and E. Bayro-Corrochano

Figure 10.12 shows the four labyrinths. The images on the first and third columns
were obtained by the stereoscopic system. The images on the second and fourth
column were obtained after perspective correction and color segmentation. The
labyrinths 1 y 2 were used for the training, whereas the 3 and 4 were used for recall.
The third and fourth columns in Figure 10.12 show the agent at the beginning
the labyrinths. In this labyrinth, only one trajectory was successful (reward 4.0).
The training and the test were done off line, then the robot had to follow the action
vectors computed by the system LTSM-CSVM.

10.8 Conclusions
This chapter generalizes the real valued SVM to Clifford valued SVM and it is used
for classification, regression and recurrence. The CSVM accepts multiple multivec-
tor inputs and multivector outputs like a MIMO architecture, that allows us to have
multi-class applications. We can use CSVM over complex, quaternion or hyper-
complex numbers according our needs. The application section shows experiments
in pattern recognition and visually guided robotics which illustrate the power of the
algorithms and help the reader understand the Clifford SVM and use it in various
tasks of complex and quaternion signal and image processing, pattern recognition
and computer vision using high dimensional geometric primitives. This generaliza-
tion appears promising particularly in geometric computing and their applications,
like graphics, augmented reality and robot vision.

References
1. Vapnik, V.: Statistical Learning Theory. Wiley, New York (1998)
2. Burges, C.J.C.: A tutorial on Support Vector Machines for Pattern Recognition. In:
Knowledge Discovery and Data Mining, vol. 2, pp. 1–43. Kluwer Academic Publish-
ers, Dordrecht (1998)
3. Müller, K.-R., Mika, S., Rätsch, G., Tsuda, K., Schölkopf, B.: An Introduction to Kernel-
Based Learning Algorithms. IEEE Trans. on Neural Networks 12(2), 181–202 (2001)
4. Cristianini, N., Shawe-Taylor, J.: Support Vector Machines and other kernel-based learn-
ing methods. Cambridge University Press, Cambridge (2000)
5. Hestenes, D., Li, H., Rockwood, A.: New algebraic tools for classical geometry. In: Som-
mer, G. (ed.) Geometric Computing with Clifford Algebras. Springer, Heidelberg (2001)
6. Lee, Y., Lin, Y., Wahba, G.: Multicategory Support Vector Machines, Technical Report
No. 1043, University of Wisconsin, Departament of Statistics, pp. 10–35 (2001)
7. Weston, J., Watkins, C.: Support vector machines for multi-class pattern recognition. In:
Proceedings of the 6th European Symposium on Artificial Neural Networks (ESANN),
pp. 185–201 (1999)
8. Bayro-Corrochano, E., Arana-Daniel, N., Vallejo-Gutierrez, R.: Geometric Preprocess-
ing, geometric feedforward neural networks and Clifford support vector machines for
visual learning. Journal Neurocomputing 67, 54–105 (2005)
9. Bayro-Corrochano, E., Arana-Daniel, N., Vallejo-Gutierrez, R.: Recurrent Clifford Sup-
port Machines. In: Proceedings IEEE World Congress on Computational Intelligence,
Hong-Kong (2008)
10 Optimization with Clifford Support Vector Machines and Applications 261

10. Mukherjee, S., Osuna, E., Girosi, F.: Nonlinear prediction of chaotic time series using a
support vector machine. In: Principe, J., Gile, L., Morgan, N., Wilson, E. (eds.) Neural
Networks for Signal Precessing VII - Proceedings of the 1997 IEEE Workshop, New
York, pp. 511–520 (1997)
11. Müller, K.-R., Smola, A.J., Rätsch, G., Schölkopf, B., Kohlmorgen, J., Vapnik, V.N.:
Predicting time series with support vector machines. In: Gerstner, W., Hasler, M., Ger-
mond, A., Nicoud, J.-D. (eds.) ICANN 1997. LNCS, vol. 1327, pp. 999–1004. Springer,
Heidelberg (1997)
12. Salomon, J., King, S., Osborne, M.: Framewise phone classification using support vector
machines. In: Proc. Int. Conference on Spoke Language Processing, Denver (2002)
13. Altun, Y., Tsochantaris, I., Hofmann, T.: Hidden markov support vector machines. In:
Proc. Int. Conference on Machine Learning (2003)
14. Jaakkola, T.S., Haussler, D.: Exploting generative models in discriminative classifiers.
In: Proc. of the Conference on Advances in Neural Information Systems II, Cambridge,
pp. 487–493 (1998)
15. Bengio, Y., Frasconi, P.: Difussion of credit and markovian models. In: Tesauro, G.,
Touretzky, D.S., Leen, T.K. (eds.) Advances in Neural Information Systems 14. MIT
Press, Cambridge (2002)
16. Hochreiter, S., Mozer, M.: A discrete probabilistic memory for discovering dependencies
in time. In: Int. Conference on Neural Networks, pp. 661–668 (2001)
17. Suykens, J.A.K., Vanderwalle, J.: Recurrent least squares support vector machines. IEEE
Transactions on Circuits and Systems-I 47, 1109–1114 (2000)
18. Schmidhuber, J., Gagliolo, M., Wierstra, D., Gomez, F.: Recurrent Support Vector Ma-
chines, Technical Report, no. IDSIA 19-05 (2005)
19. Schmidhuber, J., Wierstra, D., Gómez, F.J.: Hybrid neuroevolution optimal linear search
for sequence prediction. In: Kaufman, M. (ed.) Proceedings of the 19th International
Joint Conference on Artificial Intelligence, IJCAI, pp. 853–858 (2005)
20. Hochreiter, S., Schmidhuber, J.: Long Short-Term Memory, Technical Report FKI-207-
95 (1996)
21. Gmez, F.J., Miikkulainen, R.: Active guidance for a finless rocket using neuroevolution.
In: Proc. GECCO, pp. 2084–2095 (2003)
22. Hsu, C.W., Lin, C.J.: A comparison of methods for multi-class Support Vector Machines.
Technical report, National Taiwan University, Taiwan (2001)
23. Bottou, L., Cortes, C., Denker, J., Drucker, H., Guyon, I., Jackel, L.Y., Muller, U.,
Sackinger, E., Simard, P., Vapnik, V.: Comparison of classifier methods: a case study
in handwriting digit recognition. In: International Conference on Pattern Recognition,
pp. 77–87. IEEE Computer Society Press, Los Alamitos (1994)
24. Knerr, S., Personnaz, L., Dreyfus, G.: Single-layer learning revisited: a stepwise proce-
dure for building and training a neural network. In: Fogelman, J. (ed.) Neurocomputing:
Algorithms, Architectures and Applications. Springer, Heidelberg (1990)
25. Kreßel, U.: Pairwise classification and support vector machines. In: Schlkipf, B., Burges,
C.J.J., Smola, A.J. (eds.) Advances in Kernel Methods - Support Vector Learning, pp.
255–268. MIT Press, Cambridge (1999)
26. Platt, J.C., Cristianini, N., Shawe-Taylor, J.: Large margin DAGs for multiclass classifi-
cation. In: Advances in Neural Information Processing Systems, vol. 12, pp. 547–533.
MIT Press, Cambridge (2000)
27. Weston, J., Watkins, C.: Multi-class support vector machines. Technical Report CSD-
TR-98-04, Royal Holloway, University of London, Egham (1998)
262 N. Arana-Daniel, C. López-Franco, and E. Bayro-Corrochano

28. Hsu, C.W., Lin, C.J.: A simple decomposition method for Support Vector Machines.
Machine Learning 46, 291–314 (2002)
29. Joachims, T.: Making large-scale SVM learning practical. In: Schölkopf, B., Burges,
C.J.C., Smola, A.J. (eds.) Advances in Kernel Methods-Support Vector Learning. MIT
Press, Cambridge (1998); Journal of Machine Learning Research 5, 819–844 (1998)
30. Jaeger, H.: Harnessing nonlinearity: Predicting chaotic systems and saving energy in
wireless communication. EmphScience (304), 78–80 (2004)
31. Gers, F.A., Eck, D., Schmidhuber, J.: Applying LSTM to time series predictable through
time-window approaches. In: Dorffner, G., Bischof, H., Hornik, K. (eds.) ICANN 2001.
LNCS, vol. 2130, pp. 669–685. Springer, Heidelberg (2001)
Chapter 11
A Classification Method Based on Principal
Component Analysis and Differential Evolution
Algorithm Applied for Prediction Diagnosis
from Clinical EMR Heart Data Sets

Pasi Luukka and Jouni Lampinen

Abstract. In this article we have studied the usage of a classification method based
on preprocessing the data first using principal component analysis, and then using
the compressed data in actual classification process which is based on differential
evolution algorithm, an evolutionary optimization algorithm. This method is applied
here for prediction diagnosis from clinical data sets with chief complaint of chest
pain using classical Electronic Medical Record (EMR), heart data sets. For exper-
imentation we used a set of five frequently applied benchmark data sets includ-
ing Cleveland, Hungarian, Long Beach, Switzerland and Statlog data sets. These
data sets are containing demographic properties, clinical symptoms, clinical find-
ings, laboratory test results specific electrocardiography (ECG), results pertaining
to angina and coronary infarction, etc. In other words, classical EMR data pertain-
ing to the evaluation of a chest pain patient and ruling out angina and/or Coronary
Artery Disease, (CAD). The prediction diagnosis results with the proposed classi-
fication approach were found promisingly accurate. For example, the Switzerland
data set was classified with 94.5% ± 0.4% accuracy. Combining all these data sets
resulted in the classification accuracy of 82.0% ± 0.5%. We compared the results
of the proposed method with the corresponding results of the other methods re-
ported in the literature that have demonstrated relatively high classification perfor-
mance in solving this problem. Depending on the case, the results of the proposed
method were of equal level with the best compared methods, or outperformed their
Pasi Luukka
Laboratory of Applied Mathematics, Lappeenranta University of Technology,
P.O. Box 20, FIN-53851 Lappeenranta, Finland
e-mail: pasi.luukka@lut.fi
Jouni Lampinen
Department of Computer Science, University of Vaasa, P.O. Box 700,
FI-65101 Vaasa, Finland
e-mail: jouni.lampinen@uwasa.fi

Y. Tenne and C.-K. Goh (Eds.): Computational Intelligence in Optimization, ALO 7, pp. 263–283.
springerlink.com c Springer-Verlag Berlin Heidelberg 2010
264 P. Luukka and J. Lampinen

classification accuracy clearly. In general, the results are suggesting that the pro-
posed method has potential in this task.

11.1 Introduction
Many data sets that come from the real world are admittedly coupled with noise.
Noise can be stated as a random error or variance of a measured variable [13]. Data
analysis is almost always burdened with uncertainty of different kinds. There are
several different techniques to deal with noisy data [7].
A major problem in mining scientific data sets is that the data is often high di-
mensional. In many cases there is a large number of features representing the object.
One problem is that the computational time for the pattern recognition algorithms
can become prohibitive, when the number of dimensions grows high. This can be a
severe problem, especially as some of the features are not discriminatory. Besides
the computational cost, irrelevant features may also cause a reduction in the accu-
racy of some algorithms.
To address this problem of high dimensionality, a common approach is to identify
the most important features associated with an object, so that further processing can
be simplified without compromising the quality of the final results. There are several
different ways in which the dimension of a problem can be reduced. The simplest
approach is to identify important attributes based on the input from domain experts.
Another commonly used approach is Principal Component Analysis (PCA) [19],
which defines new attributes (principal components or PCs) as mutually-orthogonal
linear combinations of the original attributes. For many data sets, it is sufficient to
consider only the first few PCs, thus reducing the dimension. However, for some
data sets, PCA does not provide a satisfactory representation. It is not always the
case that mutually-orthogonal linear combinations are the best way to define new
attributes but e.g. nonlinear combinations needs to be sometimes considered. The
analysis of the problem of dealing with data of high dimensionality is both diffi-
cult and subtle. The information loss caused by these methods is also sometimes a
problem.
One of the latest methods in evolutionary computation is differential evolution
(DE) algorithm [30]. In this paper we will examine the applicability of a classifica-
tion method where data is first preprocessed with PCA and then the resulting data is
classified with DE-classifier to the diagnosis of heart disease. In literature there are
several papers where evolutionary computation research has concerned the theory
and practice of classifier systems [4], [16], [17], [18], [31], [35], [10]. The differen-
tial evolution algorithm has been studied in unsupervised learning problems which
can be in a sence repositioned to classification problem in [26], [11]. DE was also
used combined with artificial neural networks in [1] for diagnosis of breast cancer.
It is also been used to tune classifiers parameter values in [12] and in similarity
classifier [23] to tune similarity measures parameters.
11 A Classification Method Based on PCA and DE 265

Here we propose a method which first preprocesses the data using PCA and then
classify the processed data using differential evolution classification method. Dif-
ferential evolution algorithm is applied for finding suitable vectors for each class to
classify sample by comparison of class vectors and the sample which we want to
classify. Differential evolution algorithm is applied for finding optimal class vectors
to represent each class. In addition, it is also applied for determining the value of a
distances parameter that we applied for making the final classification decision.
Advantages of doing the procedure this way are that we are able to reduce dimen-
sionality and hence reduce the computational cost that would be otherwise untoler-
ably high, especially in the cases for high dimensional data sets. Also advantage of
this procedure is that we are able to filter out noise which enhances the creation of
class vector for each class in the classifier. The class vectors are optimized using
the DE algorithm. Using this procedure we will also find the optimal dimension for
these data sets. Combination of finding best reduced dimension, filtering out noise
from the data and optimization of class vectors and needed parameters for our prob-
lem at hand brings out the more accurate solution for the problem.
The data sets for empirical evaluation of the proposed approach were taken from
a UCI-Repository of Machine Learning data set [25]. Classifier and preprocessing
methods were implemented with MAT LABT M -software.
From the optimization and modelling point of view the classification problem
subject to our investigations can be divided into two parts: to the classification model
and to the optimization approach applied for fitting (or learning) the model. Gen-
erally, a multipurpose classifier can be viewed as a scalable and learnable model
that can be fitted to a particular dataset by scaling it to the data dimensionality
and by optimizing a set of model parameters to maximize the classification accu-
racy. For the optimization, simply the classification accuracy over the learning set
may serve as the objective function value to be maximized. Alternatively the opti-
mization problem can be formulated as a minimization task, as we did here, where
the number of misclassified samples is to be minimized. In the literature, mostly
linear or nonlinear local optimization approaches has been applied for solving the
actual classifier model optimization problem, or approaches that can be viewed as
such. This is the most common approach despite of the fact that the underlying op-
timization problem is a global optimization problem. For example, the weight set
of a feed-forward neural network classifier is typically optimized with a gradient-
descent based on local optimizer, or alternatively by some other local optimizer like
Levenberg-Marquardt algorithm. This kind of usage of limited capacity optimizers
for fitting the classification model limits the achievable classification accuracy in
two ways. First, the model should be limited so, that local optimizers can be applied
to fit them. This means that only very moderately multimodal classification models
can be applied, and due to such modelling limitation, the classification capability
will be limited correspondingly. Secondly, if a local optimizer is applied to optimize
(to fit or to learn) even a moderately multimodal classification model, it is likely to
get trapped into a local optima, to a suboptimal solution. Thereby, the only way to
get the classifier models with a higher modelling capacity at disposal, and also to get
full capacity out of the current multimodal classification models, is applying global
266 P. Luukka and J. Lampinen

optimization for fitting the classification models to the data to be classified. For
example, in case of a nonlinear feed-forward neural network classifier, the model is
clearly multimodal, but practically always fitted by applying a local optimizer that is
capable of providing only locally optimal solutions. Thus, we consider that applying
global optimization instead of local optimization is an important fundamental issue
that is currently severely constraining the further development of classifiers. The
capabilities of currently used local optimizers are limiting the selection of the appli-
cable classifier models, and also the capabilities of the currently used models that
are including multimodal properties are limited by the capabilities of the optimizers
applied to fit them to the data.
Based on the above mentioned considerations, our basic motivation for applying
a global optimizer for learning the applied classifier model comes from the fact that
typically local (nonlinear) optimizers have been applied for the purpose, despite that
the underlying optimization problem is actually a multimodal global optimization
problem, and a local optimizer should be expected to become trapped into a local
suboptimal solution. The advantage of our proposed method is that since DE does
not get trapped in local minimum we can expect it to find better solutions than what
can be found in nearest local minimum.
Another motivation was that we would like to optimize also the parameter p of
the Minkowsky distance metrics (see section 3). In practice, that means increased
nonlinearity and increased multimodality of the classification model, resulting in
more locally optimal points in the search space, where a local optimizer would be
even more likely to get trapped. Practically, optimizing p successfully requires usage
of an effective global optimizer since local optimizers are unlikely to provide even
an acceptably good suboptimal solution anymore. Otherwise, by using a global op-
timizer, optimization of p becomes possible. Two folded advantages were expected
on this. First, by optimizing (systematically) the value for p, instead of selecting it
a priori by trial and error as earlier, a higher classification accuracy may be possi-
ble to reach. Secondly, the selection of the value of p can be done automatically
this way, and laborious trial and error experimentation by the user is not needed
at all. Furthermore, the potential for further developments is increased. The local
optimization approaches are severely limiting the selection of classifier models to
be used and as well the problem formulations for classifier model optimization task
become limited, too. Simply, local optimizers are limited to fit, or learn, only classi-
fier models, where trapping into a local suboptimal solution is not a major problem,
while global optimizers do not have such fundamental limitations. For example, the
range of possible class membership functions can be extended to those requiring
global optimization (due to increased nonlinearity and multimodality), and which
cannot be handled anymore by simple local optimizers, even with the nonlinear
ones. In addition, we would like to remark, that we have not yet fully utilized the
further development capabilities provided by our global optimization approach. For
example, even more difficult optimization problem settings are now within possi-
bilities, and the differential evolution have good capabilities for multi-objective and
multi-constrained nonlinear optimization that provides further possibilities for our
future developments.
11 A Classification Method Based on PCA and DE 267

11.2 Heart Data Sets


The heart data sets that we applied for experimentation were all taken from [25]
where they are freely available. They all contain 13 attributes (which have been
extracted from a larger set of 75). Information about the attributes can be found
in Table 11.1 and the basic properties of the data sets are summarized in Table
11.2. About attribute types; real valued attributes are no 1, 4, 5, 8, 10 and 12, or-
dered type is attribute no 11, binary value are attributes 2, 6, 9 and nominal value
attributes 7, 3, 13. Variable to be predicted is absence or presence of heart disease.
The data sets were collected in different locations and principal investigators re-
sponsible for the data collection are: 1) Andras Janos, Hungarian Institute of Car-
diology, 2) William Steinbrunn, University Hospital, Zurich, 3) Matthias Pfisterer,
University Hospital, Basel, 4) Robert Detrano, V.A. Medical Center, Long Beach
and Cleveland Clinic Foundation. Donor of the statlog data set was Ross D. King,
University of Strathclyde, Glasgow. The Statlog data set is slightly modified from
the Cleveland data set (instead of 303 samples they are using only 270).

Table 11.1 Heart data sets attribute information

no Attribute
1. Age
2. Sex
3. Chest pain type (4 values)
4. Resting blood pressure
5. Serum cholestoral in mg/dl
6. Fasting blood sugar > 120 mg/dl
7. Resting electrocardiographic results (values 0,1,2)
8. Maximum heart rate achieved
9. Exercise induced angina
10. Oldpeak = ST depression induced by exercise relative to rest
11. The slope of the peak exercise ST segment
12. Number of major vessels (0-3) colored by flouroscopy
13. Thal: 3 = normal; 6 = fixed defect; 7 = reversable defect

Table 11.2 Test data sets and their properties

Name Nb. classes Dim Nb. cases


Heart-Cleveland 2 13 303
Heart-Hungarian 2 13 294
Heart-Long-Beach-va 2 13 200
Heart-Switzerland 2 13 123
Heart-Statlog 2 13 270
268 P. Luukka and J. Lampinen

11.3 Classification Method


Heart data sets were classified so that first data was preprocessed using PCA algo-
rithm and then the resulting data was classified using classification method based on
differential evolution. Next we first start with explaining in more detail the principal
component analysis method and then the classification method based on differential
evolution algorithm, after this we give a more thorough description of the differen-
tial evolution algorithm.

11.3.1 Dimension Reduction Using Principal Component


Analysis
High-dimensional data sets present many mathematical challenges as well as some
opportunities, and are bound to give rise to new theoretical developments [7]. One
of the problems with high-dimensional data sets is that, in many cases, not all the
measured variables are ”important” for understanding the underlying phenomena of
interest. In mathematical terms, the problem we investigate in dimension reduction
can be stated as follows: given the r-dimensional random variable x = (x1 , ..., xr )T ,
find a lower dimensional representation of it, y = (y1 , ..., yk )T with k < r, that cap-
tures the content in the original data, according to some criteria. The components of
y are sometimes called the hidden components. Different fields use different names
for r multivariate vectors: the term ”variable” is mostly used in statistics, while
”feature” and ”attribute” are alternatives commonly used in the computer science
and machine learning literature.
PCA [19] is the best linear dimension reduction technique in the mean-square
error sense. In various fields, it is also known as the singular value decomposition
(SVD), the Karhunen-Loeve transform, the Hotelling transform, and the empirical
orthogonal function (EOF) method.
Let x1 , ..., xn be the n r-dimensional real vectors constituting the data set. In PCA
data is first to be centered
1 n
∑ xp = 0
n p=1
(11.1)

PCA attempts to find a k-dimensional subspace L of Rk , such that the orthogonal


projections PL xp of the r points on L have maximal variance. If L is the line spanned
by unit vector u, the projection of x ∈ Rk on L is

PL x = (u x)u (11.2)

where prime denotes transposition. The variance of data in the direction of L is,
therefore
1 n 2 1 n 1 n

n p=1
(u xp ) = ∑ u xp xp u = u( ∑ xp xp )u = u Su
n p=1 n p=1
(11.3)
11 A Classification Method Based on PCA and DE 269

where S is the sample covariance matrix of the data. PCA thus looks for the vector
u∗ which maximizes u Su, under the constraint ||u|| = 1. It is easy to show that the
solution is the normalized eigenvector u1 of S associated to its largest eigenvalue
λ1 , and
u1 Su1 = λ1 u1 u1 = λ1 (11.4)
This is then extented to find the k-dimensional subspace L on which the projected
points PL xp have maximal variance. The lines spanned by the eigenvectors uj are
called the principal axes of the data, and k new features y j = uj x defined by the
coordinates of x along the principal axes are called principal components. The vector
yp of principal components for each initial pattern vector xp may easily be computed
in matrix form as yp = U r xp , where Ur = [u1 , ..., ur ] is r × k matrix having the r
normalized eigenvector of S as its columns.
PCA can be used in classification problems to display data in the form of infor-
mative plots. The score values have the same properties as the weighted averages,
i.e., they are not sensitive to random noise but show the processes that affect several
variables simultaneously in a systematic way. This makes them suitable for detect-
ing multivariate trends, such as the clustering of objects or variables in multivariate
data sets. PCA can be seen as a data compression method which can be used to (1)
display multivariate data sets, (2) filter noise and (3) study and interpret multivariate
processes. One clear limitation of the PCA is that it can only handle linear relations
between variables [9]. We acknowledge the fact that this may not be the best kernel
for the approach but here in our procedure it seems to be working.

11.3.2 Classification Based on Differential Evolution


The problem of classification is basically one of partitioning the feature space into
regions, one region for each category. Ideally, one would like to arrange this parti-
tioning so that none of the decisions is ever wrong [8].
The objective is to classify a set X of objects to N different classes C1 , . . . ,CN by
their features. We suppose that T is the number of different kinds of features that
we can measure from objects. The key idea is to determine for each class the ideal
vector yi
yi = (yi1 , . . . , yiT ) (11.5)
that represents class i as well as possible. Later on we call these vectors as class vec-
tors. When these class vectors have been determined we have to make the decision
to which class the sample x belongs to according to some criteria. This can be done
e.g. by computing the distances di between the class vectors and the sample which
we want to classify. For computing the distance we used Minkowsky metric:
 1/p
T
d(x, y) = ∑ |x j − y j | p
(11.6)
j=1
270 P. Luukka and J. Lampinen

We used Minkowsky metric because it is more general than euclidean Metric. Eu-
clidean metric is still included there as a special case when p = 2. We also found
that when p value was optimized by using DE, the optimum was not even near p = 2
which corresponds to euclidean metric.
After computing the distances between the samples and class vectors we can
make the classification decision according to the shortest distance.
for x, y ∈ Rn . We decide that x ∈ Cm if

dx, ym  = min dx, yi  . (11.7)


i=1,...,N

Before doing the actual classification, all the parameters for classifier should be
decided. These parameters are
1. The class vectors yi = (yi (1), . . . , yi (T )) for each class i = 1, . . . , N
2. The power value p in (11.6).
In this study we used differential evolution algorithm [30] to optimize both the class
vectors and p value. For this purpose we split the data into learning set learn and
testing set test. Split was made so that half of the data was used in learning set and
half in testing set. We used data available in learning set to find the optimal class
vectors yi and the data in the testing set test was applied for assessing the clas-
sification performance of the proposed classifier. A brief description of differential
evolution algorithm is presented in the following section. The number of parameters
that differential evolution algorithm needs to optimize here is classes * dimension +
parameter coming from minkowsky distance. As results will later show PCA can be
used to lower datas dimensionality and with low dimensions we still we can find re-
sults which are clearly better than what can be found by using simply DE-classifier.
If we are not satisfied to just lower the data’s dimensionality and to enhancement
achieved this way but want to find out the best lowered dimension we have to do
this also for every dimension that is lower than maximum dimension so we get
∑dimension
i=1 (classes ∗ (dimension − i) + parameter coming from minkowsky distance)
worth of parameters to be optimized.
In short the procedure for our algorithm is as follows:
1. Divide data into learning set and testing set
2. Create initial class vectors for each class (here we used simply random numbers)
3. Compute distance between samples in the learning set and class vectors
4. Classify samples according to their minimum distance
5. Compute classification accuracy (no. of correctly classified samples/total number
of samples in learning set)
6. Compute the objective function value to be minimized as cost = 1 − accuracy
7. Create new class vectors for each class for the next population using selection,
mutation and crossover operations of differential evolution algorithm, and goto
step 3. until the stopping criteria is reached (e.g. maximum number of iterations
is reached)
11 A Classification Method Based on PCA and DE 271

8. Classify data in testing set according to the minimum distance between class
vectors and samples.

11.3.3 Differential Evolution


The DE algorithm [33], [30] was introduced by Storn and Price in 1995 and it be-
longs to the family of Evolutionary Algorithms (EAs). The design principles of DE
are simplicity, efficiency, and the use of floating-point encoding instead of binary
numbers. As a typical EA, DE has a random initial population that is then improved
using selection, mutation, and crossover operations. Several ways exist to determine
a stopping criterion for EAs but usually a predefined upper limit Gmax for the num-
ber of generations to be computed provides an appropriate stopping condition. Other
control parameters for DE are the crossover control parameter CR, the mutation fac-
tor F, and the population size NP.
In each generation G, DE goes through each D dimensional decision vector vi,G
of the population and creates the corresponding trial vector ui,G as follows in the
most common DE version, DE/rand/1/bin [29]:

r1 , r2 , r3 ∈ {1, 2, . . . , NP} , (randomly selected,


except mutually different and different from i)
jrand = floor (randi [0, 1) · D) + 1
for( j = 1; j ≤ D; j = j + 1)
{
if(rand j [0, 1) < CR ∨ j =jrand ) 
u j,i,G = v j,r3 ,G + F · v j,r1 ,G − v j,r2 ,G
else
u j,i,G = v j,i,G
}

In this DE version, NP must be at least four and it remains fixed along CR and F
during the whole execution of the algorithm. Parameter CR ∈ [0, 1], which controls
the crossover operation, represents the probability that an element for the trial vector
is chosen from a linear combination of three randomly chosen vectors and not from
the old vector vi,G . The condition “ j = jrand ” is to make sure that at least one element
is different compared to the elements of the old vector. The parameter F is a scaling
factor for mutation and its value is typically (0, 1+]1 . In practice, CR controls the
rotational invariance of the search, and its small value (e.g., 0.1) is practicable with
separable problems while larger values (e.g., 0.9) are for non-separable problems.
The control parameter F controls the speed and robustness of the search, i.e., a lower
value for F increases the convergence rate but it also adds the risk of getting stuck
into a local optimum. Parameters CR and NP have the same kind of effect on the
convergence rate as F has.
1 Notation means that the practical upper limit is about 1 but not strictly defined.
272 P. Luukka and J. Lampinen

After the mutation and crossover operations, the trial vector ui,G is compared to
the old vector vi,G . If the trial vector has an equal or better objective value, then it
replaces the old vector in the next generation. This can be presented as follows (in
this paper minimization of objectives is assumed) [29]:

ui,G if f (ui,G ) ≤ f (vi,G )
vi,G+1 = .
vi,G otherwise

DE is an elitist method since the best population member is always preserved and
the average objective value of the population will never get worse.
As the objective function, f , to be minimized we applied the number of incor-
rectly classified learning set samples. Each population member, vi,G , as well as each
new trial solution, ui,G , contains the class vectors for all classes and the power value
p. In other words, DE is seeking the vector (y(1), ..., y(T ), p) that minimizes the
objective function f . After the optimization process the final solution, defining the
optimized classifier, is the best member of the last generation’s, Gmax , population,
the individual vi,Gmax . The best individual is the one providing the lowest objective
function value and therefore the best classification performance for the learning set.
The control parameters of DE algorithm were set here as follows: CR=0.9 and
F=0.5 were applied for all classification problems. NP was chosen so that it was six
times the size of the optimized parameters or if size of the NP.
However, these selections were mainly based on general recommendations and
practical experiences with the usage of DE, and no systematic investigations were
performed for finding the optimal control parameter values, therefore further clas-
sification performance improvements by finding better control parameter settings in
future are within possibilities.

11.4 Classification Results


All data sets were split in half; one half was used for training and the other half
for testing the classifier. The training sets were randomly created 30 times for each
dimension. The results are also compared to other existing results in the literature.
In Table 11.3 the results from the applied data sets are reported. Achieved results
are also compared with the results achieved without PCA. In first column data set
and possible usage of preprocessing the data first with PCA is reported. In second
column best classification accuracy is given and in third the mean classification
accuracy. Variance is reported next and then reduced dimension providing the best
results is given. Finally also optimized p-value is given in last column. Results from
Cleveland heart data set is given in Heart-Cleveland, Hungarian in Heart-Hungarian,
Switzerland heart data set in Heart-Switzerland and results from Long-Beach data
set in Heart-Long-Beach. All these four data sets are combined in Heart-All. There
are several missing values in these data sets and simply a dummy value −9 is used
for missing value. Results from heart-statlog data set are in Heart-statlog. Best mean
classification accuracies are in boldface.
11 A Classification Method Based on PCA and DE 273

Table 11.3 Classification results of heart data sets. Comparison of classification results with
original data and data preprocessed with two dimension reduction methods. Best mean accu-
racy is in boldface

Data Best result (in %) Mean result (in %) Variance (in %) dim p-value
Heart-Cleveland 89.44 % 82.86 % 7.71 13 19.3
Heart-Cleveland(PCA) 91.55 % 86.48 % 2.82 12 82.8
Heart-Hungarian 88.44 % 83.42 % 5.95 13 88.1
Heart-Hungarian(PCA) 93.20 % 87.48 % 3.34 11 96.7
Heart-Switzerland 95.16 % 94.35 % 0.67 13 70.8
Heart-Switzerland(PCA) 95.16 % 94.46 % 0.66 5 82.1
Heart-Long Beach 80.20 % 78.32 % 1.31 13 54.4
Heart-Long Beach(PCA) 85.15 % 79.93 % 2.70 12 67.9
Heart-All 78.22 % 76.98 % 0.94 13 1.8
Heart-All(PCA) 84.22 % 82.01 % 1.05 13 49.2
Heart-statlog 88.89 % 83.21 % 10.80 13 81.1
Heart-statlog(PCA) 91.86 % 87.63 % 4.01 13 90.9

Cleveland data set: From the Table 11.3 we can observe that best mean classifica-
tion accuracy for the Cleveland data set is 86.5% and when 99% confidence √ interval
is computed for the results (Using Student’s t distribution μ ± t1−α /2 Sμ / n) we get
for the confidence interval 86.5% ± 0.8%. This result was obtained when data was
first preprocessed with PCA. The preprocessing by PCA enhanced the results over
3%. The best mean accuracy was found with target dimensionality of 12.
Achieved results with the Cleveland data set are compared to other results in Ta-
bles 11.4–11.6. In Table 11.4 the classification results obtained by our DE based
approach are compared to the corresponding results reported in [32] where method
called Classification by Feature Partitioning (CFP) was introduced. This method is
an inductive, incremental and supervised learning method. There the data set was
divided in two sets as here but training and testing set sizes were a bit different.
When comparing our results with the results of Sirin and Güvenir [32] we observed
that DE classifier classified the Cleveland data set with a higher accuracy (83.4%)
than IB classifiers and C4 but yielded a slightly lower accuracy than CFP,(84.0%).
When data was first preprocessed with PCA the classification accuracy of 87.5%
was reached by the DE-classifier. In Table 11.5 the classification results obtained by
DE based approach are compared to results from classifiers which was reported in
[21]. They used decision tree classifier and also preprocessed the data with wavelet
transform. They also used two fold techniques as here but division for training and
testing set was 80 − 20. For their decision tree classifier 76% accuracy was reported.
In comparison, DE yielded accuracy of 83%. Li et al. [21] managed to enhance the
results by preprocessing the data first with wavelet transform gaining about 4% unit
enhancement having mean accuracy of 80%. We reached about 3% unit enhance-
ment with using PCA, corresponding to 86% classification accuracy.
In Table 11.6 we have compared our results with the results reported in [3] where
tenfold crossover was used instead of two fold as in our experiment. As can be
seen there smart crossover operator with multiple parents for a Pittsburgh Learning
274 P. Luukka and J. Lampinen

Classifier seems to be giving around 10% better performance with this data set.
Generally, the results obtained here by the DE classifier with PCA preprocessing
appeared to be rather promising.

Table 11.4 DE classifiers classification result comparison to results Sirin and Güvenir re-
ported in [32] from Cleveland and Hungarian data sets

data set IB1 IB2 IB3 C4 CFP DE PCA + DE


Hungarian 58.7 55.9 80.5 78.2 82.3 83.4 87.5
Cleveland 75.7 71.4 78.0 75.5 84.0 82.9 86.5

Table 11.5 DE classifiers classification result comparison to results of Li et al. reported from
Cleveland, Hungarian and Switzerland data sets

Data set Decisiontree(Dt) Dt + wavelet DE PCA + DE


Hungarian 76 80 83 87
Cleveland 76 80 83 86
Switzerland 88 88 94 94

Table 11.6 DE classifiers classification result comparison to results of Bacardit and Krasno-
gor [3] from Cleveland, Hungarian and Statlog data sets

Data set EnhancedPLCS DE PCA + DE


Hungarian 86.05 83.42 87.48
Cleveland 95.54 82.86 86.48
Statlog 94.44 83.21 87.63

Hungarian data set: With the Hungarian data set a same situation was observed.
The best results was found when data was first preprocessed with PCA. Best mean
accuracy with 99% confidence interval was 87.5% ± 0.9%. Preprocessing with PCA
enhanced the results over 3%. Best accuracy was found with the target dimension-
ality of 11.
The results obtained with the Hungarian data set are compared to the results
of the other classifiers in Tables 11.4–11.7. When the results are compared with
the results reported by Sirin and Güvenir [32] (Table 11.4) DE-classifier yielded a
slightly higher mean accuracy 83.4%, than the second best CFD with the accuracy
of 82.3%. When the Hungarian data set was preprocessed with PCA the accuracy of
DE-classifier increased to 87.5%, which can be considered as a remarkably good re-
sult. In Table 11.5 our results are compared with the corresponding ones by Li et al.
[21]. They reported accuracy of 76% with their decision tree classifier while in com-
parison our DE-classifier reached with 83% accuracy. Li et al also preprocessed the
11 A Classification Method Based on PCA and DE 275

data and their wavelet transform preprocessing gained about 4% unit enhancement
in accuracy (80%). We obtained 4% unit enhancement with the accuracy when we
preprocessed the data by using PCA and then performed the classification by using
DE classifier, reaching the accuracy of 87%. In Table 11.6 our results are compared
with the results of [3]. With their method accuracy of 86% was reported for Hun-
garian data set. This accuracy outperformes DE but when data is first preprocessed
with PCA and then the linear combination is used DE-classifier manages to get bet-
ter results. In Table 11.7 our results are compared with the results of Detrano et al
[6]. They have reported 77% accuracy with CDF, while our corresponding result
was 83% accuracy by using DE classifier.

Table 11.7 DE classifiers classification result comparison to results Detrano et al [6] reported
from Hungarian, Long beach and Switzerland data sets

Data set CDF CADENZKA DE PCA + DE


Hungarian 77 74 83 87
Long Beach 79 77 78 80
Switzerland 81 81 94 94

Switzerland data set: With the Switzerland data set even higher classification ac-
curacy was reached than with the previous two cases. Best mean accuracy with 99%
confidence interval was 94.5%± 0.4%. Also variances were considerably lower than
with previous two data sets. The results with the original data and PCA-preprocessed
data were very close and there were no statistically significant differences between
the classifying the original data or preprocessed data.
Comparison to the other results with the Switzerland data set is provided in Tables
11.5 and 11.7. Rather similar findings as in [21] and [6] were done. Also in [21] and
[6] the Switzerland data set was found possible to classify with a higher accuracy
than the other data sets. Furthermore, in [21] it was noticed the same as what we
found that preprocessing the Switzerland data set did not considerably enhanced the
results but results were similar with the original data. Li et al. [21] reported accuracy
of 88% with decision tree classifier and Detrano et al. [6] accuracy of 81% with both
CDF and CADENZKA. We managed to classify the data with 94% accuracy using
preprocessed data and DE classifier. DE classifiers accuracy of 94% is the highest
one among these results.
Long Beach data set: The Long Beach data set seemed to be most difficult to
classify with the algorithm presented in this paper. Best mean accuracy with 99%
confidence interval was 79.9% ± 0.9%. This result was achieved when the data was
first preprocessed using principal component analysis algorithm and then classified
with DE classifier. Dimension where best results were achieved was 12. The Long
Beach data set has the highest number of missing values among these four data sets,
and it is often left out from studies [21] due to this reason. Here all those missing
values were replaced by dummy value −9 and large amount of −9 values is probably
276 P. Luukka and J. Lampinen

the reason why accuracies are somewhat lower with this data set than with the other
data sets.
For the Long Beach data set results are compared in Table 11.7. The results with-
out preprocessing are rather similar with the classifiers CDF and CADENZKA.
They reported accuracy of 79% with CDF and 77% with CADENZKA. We ob-
tained accuracy of 78% with DE-classifier. When data was preprocessed with PCA
and then classified with DE a accuracy (of 80%) was obtained for this particular
dataset.
Heart-All: In Heart-All all previous four data sets are combined to achieve larger
amount of data to be used. When all four data sets were combined the best mean
accuracy with 99% confidence interval was 82.0% ± 0.5%. This was achieved when
data was first preprocessed with PCA. This result was achieved by using 13 di-
mensions. Here variance (of 1.05) was lower than with the other data sets with the
exception of the Switzerland data set. When we compare the results with the results
without preprocessing with PCA first we managed to enhance the results by using
PCA about 5%.
In Table 11.8 there is classification result comparison to results reported by Łeski
in [22] and Pedreira in [27]. In all the results Cleveland, Hungarian, Switzerland
and Long-beach data sets are combined together. Łeski used An ε −margin nonlin-
ear classifier based on fuzzy if-then rules and divided the data in two folds as in
our procedure but here testing set was much larger than training set. Pedreira used
Kohonen’s LVQ2.1 with Training data selection and ten-fold crossvalidation was
used.

Table 11.8 DE classifiers classification result comparison to results reported by Łeski in [22]
and Pedreira in [27]. In all the results Cleveland, Hungarian, Switzerland and Long-beach
data sets are combined together

Method Accuracy (%)


SVM-best for several experiments varying C and δ [27]
2
80.00
Kohonen’s LVQ2.1 (mean of 10-fold experiment) [27] 73.00
ε −margin nonlinear classifier based on fuzzy if-then rules [22] 79.96
Incremental SVM [5] 78.70
Fisher’s linear discriminant [22] 78.04
Logistic-regression-derived discriminant function [6] 77.00
Bayes point machine [14] 77.20
DE classifier 76.98
DE classifier with PCA 82.01

DE-classifier provided similar level of accuracy (76.98%) as Incremental SVM


(78.70%), Fisher’s linear discriminant (78.04%), Logistic-regression-derived dis-
criminant function (77.00) and Bayes point machine (77.20%). Kohonen’s LVQ2.1
managed slightly worse with 73.00% accuracy. SVM-best result for several ex-
periments varying C and δ 2 (80.00) and ε -margin nonlinear classifier based on
11 A Classification Method Based on PCA and DE 277

fuzzy if-then rules (79.96%) classified with a bit higher accuracy than DE classi-
fier. When data was preprocessed with PCA and then classified with DE classifier, a
higher mean accuracy of 82.01% was gained. As reported in Table 11.8 the result of
DE classifier with PCA data preprocessing outperformed the results of these seven
classifiers.
Heart-statlog: When the heart-statlog data set was classified, the best mean accu-
racy with 99% confidence interval was 87.6 ± 1.0. When this result is compared to
the original Cleveland data set, the accuracy is only little higher. So removing in
advanced samples which had missing values increased the accuracy only by about
1% unit. Also variances were actually higher with the heart-statlog data set than
with the original Cleveland data set. When the results are compared to the results
of [20] where this data set was classified with 19 different classifiers, the results are
more accurate with DE classifier than the best compared result, which was 84.4%
accuracy with NewId classifier.
The results with the heart-statlog data set are compared in Tables 11.9–11.11.
Similar mean accuracy was obverved with DE-classifier (83.2%) than with LMT
(83.2), SLogistic (83.3%) and MLogistic (83.7%) in Table 11.9. In this experiment
Hervas-Martinez & Martinez-Estudillo [15] used two folds as we but division be-
tween training set and testing set was 75 − 25. When the data was preprocessed
with PCA and then classified with DE we observed 87.6% mean accuracy. In Ta-
ble 11.10 the results are compared with the results reported by Abdel-Aal [2] with
GMDH. With GMDH also optimal feature selection was carried out. They used
also two fold division of the data set and there division was 70 − 30 between the
training set and testing set. When comparing the results of GMDH with all features
(82.5%) to the results of DE-classifier (83.2%), DE-classifier provided a sligthly
higher accuracy. When optimal features were selected with GMDH, the accuracy
of 85% is reported, which is a bit lower than we reached by applying DE-classifier
to the PCA-preprocessed data (87.6%). In Table 11.11 the results are compared to
those reported in [28] by Polat et al. Polat et al. [28] used ten-fold crossvalidation
scheme to get their results. They reported results for their main classification sys-
tem, artificial immune recognition system (AIRS), that reached accuracy of 84.50%.

Table 11.9 DE classifiers heart statlog data set classification result comparison to results
reported by Hervas-Martinez & Martinez-Estudillo [15]

Method Accuracy (%)


LMT 83.22 ± 6.50
SLogistic 83.30 ± 6.48
MLogistic 83.67 ± 6.43
C4.5 78.15 ± 7.42
CART 78.00 ± 8.25
LotusS 77.63 ± 7.16
DE 83.2 ± 1.7
PCA+DE 87.6 ± 1.0
278 P. Luukka and J. Lampinen

Table 11.10 DE classifiers heart statlog data set classification result comparison to results
reported by Abdel-Aal [2]. Results are compared to results using all features and optimal
dimension

Method Accuracy (%)


GMDH(all 13 features) 82.5
GMDH(optimal 6 features) 85
DE(all 13 features) 83.2
PCA+DE(13 features) 87.6

Table 11.11 DE classifiers classification result comparison to results Polat et al [28] reported
from heart statlog data set

Author Method Accuracy (%)


ToolDiag, RA IB1 − 4 50.00
WEKA, RA InductH 58.50
ToolDiag, RA RBF 60.00
WEKA, RA FOIL 64.00
ToolDiag, RA MLP+BP 65.60
WEKA, RA T2 68.10
WEKA, RA 1R 71.40
WEKA, RA IB1c 74.00
WEKA, RA K∗ 76.70
Robert Detrano Logistic regression 77.00
Cheung (2001) C4.5 81.11
Cheung (2001) Naive-Bayes 81.48
Cheung (2001) BNND 81.11
Cheung (2001) BNNF 80.96
WEKA, RA Naive-Bayes 83.60
Polat et al. (2005) AIRS 84.50
Polat et al. (2007) Fuzzy-AIRS-k-NN- based system 87.00
Our work DE-classifier 83.21
Our work PCA+DE-classifier 87.63

This is a bit higher accuracy than 83.21% that we reached with DE-classifier. They
also used a preprocessing method which they called as weighting scheme based on
k−nearest neighbour (k-NN) and utilized that as preprocessing step before classi-
fying with their main classifier, AIRS. Using this method they gained accuracy of
87.00% while we reached accuracy of 87.63% by preprocessing the data with PCA
and then classifying with DE-classifier. Thus, the compared results appears to have
rather similar level of accuracy in these cases. When we compare the results with
enhanced PLCS [3], their method gives about 7% higher classification accuracy than
presented method.
11 A Classification Method Based on PCA and DE 279

Heart data classification


95

Mean classification accuracy


90

85

80

75

70
Switzerland
Statlog
65 Hungarian
Cleveland
60 All
Long beach
55
0 2 4 6 8 10 12 14

(a) Reduced dimension (with PCA)


Heart data classification
Variances classification accuracy

8
Switzerland
Statlog
7
Hungarian
Cleveland
6 All
Long beach

0
0 2 4 6 8 10 12 14

(b) Reduced dimension (with PCA)

Fig. 11.1 Classification results with respect to the reduced dimension a) Mean classification
accuracies when data is first preprocessed with PCA and the classified with DE classifier b)
variances

In Fig 11.1 mean classification accuracies and variances are plotted for every di-
mension to see how classification accuracy changes with respect to reduced dimen-
sion. As can be seen from Fig 11.1 accurate results can be found already with few
dimensions. Preprocessing the data with PCA, good results are still found with all
data sets when dimension is as low as 4. Also variances are low with all dimensions
if the first three dimensions are disregarded. The figure is suggesting that rather ac-
curate results can be achieved when the reduced dimensionality is only about half
of the original. This information can be very useful when dealing with large amount
of data having high dimensionality, since computations are taking considerably less
280 P. Luukka and J. Lampinen

time when the dimensionality is reduced. Computations with only six dimension
was about 3.5 times faster than computations with full 13 dimension.

Comparison to Support Vector Machine classifier


To make our comparison even more clear we also computed the classification results
with Support Vector Machine (SVM) [34] and with combination of PCA+SVM.
Results of these classification runs can be found in Table 11.12. With SVM we
used RBF as kernel function. When one compares the results with the original data
and with preprocessed data one can see that with using SVM classifier results are
not always enhanced when data is first preprocessed using PCA. Sometimes even
(as in the case of hungarian data set) we received worser results with combination
PCA+SVM than with just using SVM. Also when we compare the results of SVM
which can be considered to be quite new and high performing classifier it does not
perform that well in this task but compared to DE classifier, classification accuracies
are about 20% lower with SVM than with DE-classifier. This experiment was done
to emphazise that PCA is working well in this data set with DE-classifier but this
observation is not directly transferable to other classifiers.

Table 11.12 Results with SVM and with combination of PCA+SVM. Best and mean classi-
fication results, variances

Data & method Best result Mean result Variance


Original data
Heart C & SVM 62.86 58.33 0.065
Heart H & SVM 70.07 63.20 0.095
Heart L & SVM 73.74 65.86 0.136
Heart S & SVM 91.94 86.60 0.021
Heart Statlog & SVM 65.93 60.93 0.073
Heart all & SVM 68.37 64.75 0.029
Preprocessed with PCA
Heart C & SVM 66.70 64.05 0.20
Heart H & SVM 46.94 44.83 0.12
Heart L & SVM 52.43 50.26 0.14
Heart S & SVM 99.67 93.38 0.79
Heart Statlog & SVM 67.19 65.22 0.13
Heart all & SVM 65.96 65.07 0.03

11.5 Discussion and Conclusions


In this paper we applied classification method based on preprocessing the data first
with PCA and then applying differential evolution classifier to the diagnosis of heart
disease. For demonstrating and assessing the proposed classification approach, we
computed results for four different heart data sets individually, and also the results
11 A Classification Method Based on PCA and DE 281

for the case when all data sets were combined together. With diagnosis of heart
disease we found that by preprocessing the data first with PCA a higher classifica-
tion accuracy can be achieved than without preprocessing. This was observed in all
studied cases excluding the case for the Switzerland data set. Another aspect is the
reduced overall computing time. By data dimensionality reduction data sets having
high dimensionality can be classified considerably faster with this reduced data than
what could be done with the original data. This procedure also made it possible for
DE to find more robust and accurate class vectors which improved the classification
accuracy. Also this way we managed to filter out noise and improve the results.
We are considering that the main factor resulting in the good classification accu-
racy in studied cases was the application of an effective global optimizer, differen-
tial evolution, for fitting the classification model instead of local optimization based
approaches. The results are indicating that also prepossessing the data before clas-
sification may, in successful cases, not only help with the curse of increasing data
dimensionality, but also provide a further improvement in classification accuracy.
Anyway, the main contributor to the accuracy was global optimization method in
the classifier, that made it possible to at least some extend to avoid getting trapped
into locally optimal (and thereby suboptimal) solutions, and make it possible to
improve the solution further on in comparison with the compared approaches. An-
other important point contributing to the classification accuracy was systematical
optimization of parameter p, instead of keeping it fixed, or setting it manually by
trial-and-error. It should be noted, that inclusion of p among the optimized param-
eters, was possible due to application of global optimizer capable of handling the
extra parameter, extra nonlinearity and extra multimodality of classifier model op-
timization problem that the inclusion of p among parameters to be optimized is
resulting in.
Generally, the classification accuracy yielded by the proposed approach com-
pared well with the other corresponding results of several classifier reported in lit-
erature. We managed to classify the Switzerland data set with 94.5% ± 0.4% mean
accuracy and when all heart data sets were combined, we achieved the mean ac-
curacy of 82% ± 0.5%. The results are suggesting that the proposed classification
approach has potential in diagnosis of heart disease.
A further advantage of the approach is that when dimension of data is reduced,
the overall computational time is reduced, allowing classification of even larger data
sets.

References
1. Abbass, H.A.: An evolutionary artificial neural networks approach for breast cancer di-
agnosis. Artificial Intelligence in Medicine 25, 265–281 (2002)
2. Abdel-Aal, R.E.: GMDH-based Feature Ranking and Selection for Improved Classifica-
tion of Medical Data. Journal of Biomedical Informatics 38, 456–468 (2005)
282 P. Luukka and J. Lampinen

3. Bacardit, J., Krasnogor, N.: Smart Crossover Operator with Multiple Parents for a Pitts-
burgh Learning Classifier System. In: Proceedings of the 8th Annual Conference on Ge-
netic and Evolutionary Computation (GECCO 2006), pp. 1441–1448. ACM Press, New
York (2006)
4. Booker, L.: Improving the performance of generic algorithms in classifier systems. In:
Grefenstette, J.J. (ed.) Proc. 1st Int. Conf. on Genetic Algorithms, Pittsburgh, PA, July
1985, pp. 80–92 (1985)
5. Cauwenberghs, G., Poggio, T.: Incremental and decremental support vector machine
learning. In: Advanced Neural Information Processing Systems, vol. 13. MIT Press,
Cambridge (2001)
6. Detrano, R., Janosi, A., Steinbrunn, W., Pfisterer, M., Sandhu, S., Guppy, K., Lee, S.,
Froelicher, V.: International application of a new probability algorithm for the diagnosis
of coronary artery disease. Americal Journal of Cardiology 64, 304–310 (1989)
7. Donoho, D.: High-dimensional data analysis: The curses and blessings of dimensional-
ity. In: Lecture at the “Mathematical Challenges of the 21st Century” conference of the
American Math. Society, Los Angeles, August 6-11 (2000)
8. Duda, R., Hart, P.: Pattern Classification and Scene Analysis. John Wiley & Sons, Chich-
ester (1973)
9. Fodor, I.K.: A Survey of Dimension Reduction Techniques, LLNL technical report (June
2002)
10. Fogarty, T.C.: Co-evolving co-operative populations of rules in learning control systems.
In: Fogarty, T.C. (ed.) AISB-WS 1994. LNCS, vol. 865, pp. 195–209. Springer, Heidel-
berg (1994)
11. Giacobini, M., Brabazon, A., Cagnoni, S., Gianni, A.D., Drechsler, R.: Automatic
Recognition of Hand Gestures with Differential Evolution - Applications of Evolutionary
Computing: Evoworkshops (2008)
12. Gomes-Skarmeta, A.F., Valdes, M., Jimenez, F., Marin-Blazquez, J.G.: Approximative
fuzzy rules approaches for classification with hybrid-GA technigues. Information Sci-
ences 136, 193–214 (2001)
13. Han, J., Kamber, M.: Data Mining: Concepts and Techniques. Morgan Kaufmann Pub-
lisher, San Francisco (2000)
14. Herbrich, R., Graepel, T., Campbell, C.: Bayes point machines. J. Machine Learning
Res. 1, 245–279 (2001)
15. Hervas-Martinez, C., Martinez-Estudillo, F.: Logistic Regression Using Covariates Ob-
tained by Product-unit Neural Network Models. Pattern Recognition 40, 52–64 (2007)
16. Holland, J.H.: Properties of the bucket-brigade algorithm. In: Grefenstette, J.J. (ed.) Proc.
1st Int. Conf. on Genetic Algorithms, Pittsburgh, PA, July 1985, pp. 1–7 (1985)
17. Holland, J.H.: Genetic algorithms and classifier systems: foundations and future direc-
tions. In: Proc. 2nd Int. Conf. on Genetic Algorithms, pp. 82–89 (1987)
18. Holland, J.H., Holyoak, K.J., Nisbett, R.E., Thagard, P.R.: Classifier systems, Q-
morphisms and induction. In: Davis, L. (ed.) Genetic algorithms and Simulated Anneal-
ing, ch. 9, pp. 116–128 (1987)
19. Jolliffe, I.: Principal Component Analysis. Springer, Heidelberg (1986)
20. King, R.D., Feng, C., Sutherland, A.: Statlog: Comparison of Classification Algorithms
on Large Real-World Problems. Applied Artificial Intelligence 9(3), 256–287 (1995)
21. Li, Q., Li, T., Zhu, S., Kambhamettu, C.: Improving Medical/Biological Data Classi-
fication Performance by Wavelet Preprocessing. In: Proceedings of IEEE International
Conference on Data mining (ICDM), pp. 657–660 (2002)
11 A Classification Method Based on PCA and DE 283

22. Łeski, J.M.: An ε − Margin Nonlinear Classifier Based on Fuzzy If-Then Rules. IEEE
Transactions on Systems, Man and Cybernetics-Part B: Cybernetics 34(1), 68–76 (2004)
23. Luukka, P., Sampo, J.: Similarity Classifier Using Differential Evolution and Genetic
Algorithm in Weight Optimization. Journal of Advanced Computational Intelligence and
Intelligent Informatics 8(6), 591–598 (2004)
24. Martens, H., Naes, T.: Multivariate Calibration. John Wiley, Chichester (1989)
25. Newman, D.J., Hettich, S., Blake, C.L., Merz, C.J.: UCI Repository of machine learning
databases. University of California, Department of Information and Computer Science,
Irvine, CA, http://www.ics.uci.edu/˜mlearn/MLRepository.html
(Cited 30 November 2008)
26. Omran, M., Engelbrecht, A.P., Salman, A.: Differential Evolution Methods for Unsu-
pervised Image Classification. In: Proceedings of the Seventh Congress on Evolutionary
Computation (CEC 2005), Edinburgh, Scotland. IEEE Press, Los Alamitos (2005)
27. Pedreira, C.E.: Learning Vector Quantization with Training Data Selection. IEEE Trans-
actions on Pattern Analysis and Machine Intelligence 28(1), 157–162 (2006)
28. Polat, K., Sahan, S., Günes, S.: Automatic detection of heart disease using an artifi-
cial immune recognition system (AIRS) with fuzzy resource allocation mechanism and
k-nn (nearest neighbour) based weighting preprocessing. Expert Systems with Applica-
tions 32, 625–631 (2007)
29. Price, K.V.: New Ideas in Optimization. In: An Introduction to Differential Evolution,
ch. 6, pp. 79–108. McGraw-Hill, London (1999)
30. Price, K., Storn, R., Lampinen, J.: Differential Evolution - A Practical Approach to
Global Optimization. Springer, Heidelberg (2005)
31. Robertson, G.: Parallel implementation of genetic algorithms in a classifier system. In:
Davis, L. (ed.) Genetic algorithms and Simulated Annealing, ch. 10, pp. 129–140 (1987)
32. Sirin, I., Güvenir, H.A.: An Algorithm for Classification by Feature Partitioning Techni-
cal Report CIS-9301, Bilkent University, Dept. of Computer Engineering and Informa-
tion Science, Ankara (1993)
33. Storn, R., Price, K.V.: Differential Evolution - a Simple and Efficient Heuristic for Global
Optimization over Continuous Space. Journal of Global Optimization 11(4), 341–359
(1997)
34. Vapnik, V.: The Nature of Statistical Learning Theory. Springer, Heidelberg (1995)
35. Wilson, S.W.: Hierarchical credit allocation in a classifier system. In: Davis, L. (ed.)
Genetic algorithms and Simulated Annealing, ch. 8, pp. 104–115 (1987)
Chapter 12
An Integrated Approach to Speed Up GA-SVM
Feature Selection Model

Tianyou Zhang, Xiuju Fu, Rick Siow Mong Goh, Chee Keong Kwoh,
and Gary Kee Khoon Lee

Abstract. Significant information or features are often overshadowed by noises and


resulted in poor classification results. Feature selection methods such as GA-SVM
are desirable in filtering out the irrelevant features and thus improve the accuracy;
the selection itself might also offer critical insights into the problems. However, the
high computational cost greatly discourages the application of GA-SVM, especially
for large-scale datasets. In this paper, an HPC-enabled GA-SVM (HGA-SVM) is
proposed and implemented by integrating data parallelization, multithreading and
heuristic techniques with the ultimate goal of maintaining robustness and lower-
ing computational cost. Our proposed model is comprised of four improvement
strategies: 1) GA Parallelization, 2) SVM Parallelization, 3) Neighbor Search and
4) Evaluation Caching. All the four strategies improve the respective aspects of the
feature selection algorithm and contribute collectively towards higher computational
throughput.

12.1 Introduction
The booming information technologies have promoted the production of data from
all sorts of domains. The significant information or features are often mixed up
with the noises inside the data. It subsequently places a challenging task in machine
learning for filtering the irrelevant and selecting the truly important features out.
Given data samples with class labels, supervised classification models are usually
used together with optimization algorithms for feature selection in which classifica-
tion accuracies are used as fitness evaluation of the selected feature subsets. In this
Tianyou Zhang · Xiuju Fu · Rick Siow Mong Goh · Gary Kee Khoon Lee
Institute of High Performance Computing, 1 Fusionopolis Way, #16-16 Connexis,
Singapore 138632
e-mail: zhangty@ihpc.a-star.edu.sg
Chee Keong Kwoh
Nanyang Technological University, Singapore 637457

Y. Tenne and C.-K. Goh (Eds.): Computational Intelligence in Optimization, ALO 7, pp. 285–298.
springerlink.com c Springer-Verlag Berlin Heidelberg 2010
286 T. Zhang et al.

work, we develop a feature selection model which combines the merits of support
vector machine (SVM), genetic algorithm (GA) and high performance computing
techniques.
A supervised learning model is capable to learn a set of functions (classifiers)
from prior knowledge. It has been widely applied in many domains including bioin-
formatics, cheminformatics, financial forecasting, etc. A typical example in super-
vised learning is that given a set of training data with multiple input features and
labeled outputs, a classifier is learnt from the “known” examples, and generalized
to label the “unknown” ones. The rationale of applying supervised learning is that
labeling could be expensive. For example, in most of bioinformatics problems, lab-
oratory approaches are the most reliable and trustful but time-consuming, labor-
intensive and costly. A more cost-effective and efficient alternative can be sought
from conducting laboratory experiments to collect the sufficient labeled data, fol-
lowed by training a classifier to label the rest of input examples, and finally doing
laboratory verification to the highlighted examples.
Support vector machine [1] is a set of supervised learning tools based on struc-
tural risk minimization principle, and has been popular in both classification and
regression tasks. The principle of SVM is constructing an optimal separating hy-
perplane to maximize the margin between two classes of data. The concept can be
visualized as that two boundary planes parallel to the hyperplane are constructed
at its each side, and are pushed maximally towards the data points as long as no
data points fall in between the boundary planes. In the scene, “margin” refers to
the distance between the boundary planes, and “support vectors” are the data points
sitting on those boundary planes. However in many real-world problems, the un-
avoidable existence of noises makes infeasible to construct such hard-margin clas-
sifier with zero errors. Soft margin [2] is then introduced to allow the data points
to fall in-between the boundary planes or even across the hyperplane with penalty
cost C, which controls the tradeoff between maximal margin and minimal error.
For linear-separable data, constructing the linear classifier is straightforward. But
for non-linear-separable data, kernel trick [3] is needed to map the data into high
dimensional space, in which the transformed data become linear separable. The per-
formance of SVM classifier is often estimated by k-fold cross validation. That is
done by dividing the data into k subsets and using a different subset for validation
rotationally while the rest subsets are reserved for training in each round. The values
of generalization error or accuracy computed in all k rounds are finally aggregated
to measure the overall performance of SVM.
When SVM is employed for data classification, the choice of margin cost C and
kernel parameters is considered as a very important process for obtaining high per-
formance of SVM classifiers [4]. The optimal parameters that lead to the minimal
generalization error are data-dependent. Presently no rules or formula can be used to
compute such values analytically, so parameter tuning is often required. An intuitive
realization of parameter tuning is grid search [5]. That is, the parameters are varied
by step-size within the preset range of values (in a structure of “grid”); the optimal
values can be found by measuring every combination (every node in the grid). Due
to its complexity, usually two-dimensional grid is used to tune a pair of parameters
12 An Integrated Approach to Speed Up GA-SVM Feature Selection Model 287

such as C and γ (Gaussian function width in RBF kernel). Even after parameter
tuning, SVM classifier might deliver poor accuracy in classifying some particular
datasets. One possible reason is noise interference in which an overwhelming num-
ber of irrelevant features are included inside the inputs so that a truly representative
classifier cannot be learnt. If prior knowledge is insufficient to differentiate which
features are truly relevant to the output, such that all the possible features are in-
cluded in the training data, the accuracy of learning would be deteriorated. In those
cases, the key of improving learning performance is feature selection, which is the
technique of searching the significant candidates in the feature space by optimiza-
tion methods such as genetic algorithm.
Genetic Algorithm (GA) [6] is a search technique inspired by the natural evo-
lution. In the evolution, the individuals with better genetic merits (chromosome)
are more likely to survive under natural selection and reproduce the offspring; the
unfit ones are filtered out. By constant filtering, generation after generation, the pop-
ulation tends to carry the fitter and fitter chromosomes. To mimic this process, all
candidate solutions to a feature-selection problem can be encoded as “chromosome”
(feature subset representation), which takes the form of bit array (e.g. 10010101) -
1s and 0s denote the presence or absence of each feature. A group of those candidate
solutions is sampled randomly and form the initial population of chromosomes. The
chromosomes are then evaluated by objective function to compute the fitness scores.
Multiple chromosomes are stochastically selected based on their fitness, recombined
(crossover) and mutated, and finally form the next generation. By means of random
mutations and crossovers, the variety of chromosomes is introduced and evaluated
in every generation and gradually evolve the solutions towards the optimal. The pro-
cess will be iterated until convergence, i.e. there is no more improvement to the best
fitness score in the population. At the end, the chromosome with the best-ever fit-
ness will be the final solution and all the features denoted by 0s will then be filtered
out.
By taking SVM as objective function in GA, GA-SVM had been widely used to
filter out the irrelevant features and improve the learning accuracy in the noisy set-
tings. However, there is a practical problem of high computational cost in GA-SVM.
Assuming m population and g generations in GA, wxw grid in parameter tuning and
t seconds for SVM plus 10-fold cross validation, then the overall runtime will cost
mgw2t seconds. It is a time-consuming process that even a small-scale problem may
need nearly a day to complete (demonstrated in the Result section). That strongly
discourages the application of GA-SVM to larger and more complex data. In this
paper, we introduce high performance computing (HPC) techniques and heuristic
methods to speed up the traditional GA-SVM feature selection model. In our HPC-
enabled GA-SVM (HGA-SVM), we employ data parallelization, multithreading,
repeated evaluation reduction and heuristic optimization, with the ultimate goal of
trimming down computational cost and making large-scale feature selection more
feasible. The HGA-SVM is comprised of four improvement strategies: 1) GA Paral-
lelization, 2) SVM Parallelization, 3) Neighbor Search, and 4) Evaluation Caching.
All the four strategies work collectively towards higher computational efficiency.
288 T. Zhang et al.

12.2 Methodology
GA-SVM feature selection model is a technique that operates GA to search the
feature space for in which a subset of features could produce the best learning per-
formance through SVM. An implementation of such model is comprised of three
operators: crossover, mutation, and SVM evaluation. With respect to population
size, the first two are of linear complexity; and the last one is of quadratic complex-
ity. It is clear that reducing population size could lower computational cost effec-
tively. Moreover, when input data grows larger and more complex, most of time lag
would rise in SVM evaluation since all other operators only work in chromosome
layer. Imagine if a single SVM training is slowed down by t seconds in a larger
dataset, by the effect of 10-fold cross validation and 10x10 grid search, the time
lag of each chromosome in every generation would be amplified to 1000t seconds.
Apparently SVM evaluation is the biggest obstacle in GA-SVM that discourages its
application in large-scale dataset and therefore the speedup specific to SVM would
be desirable.
High computational cost of GA-SVM could also arise from parameter tuning in
SVM. Exhaustive grid search is an intuitive and straight-forward technique to tune
the parameters of SVM; however it is slow despite the grid dimension is small. In
a simple 10x10 grid search, it requires 100 times of SVM learning (with 10-fold
cross validation) for each chromosome in every generation. This exhaustive search
will incur a huge computational cost. However, parameter tuning cannot be omitted
even though the cost is high. Otherwise SVM learning would be biased and the
purpose of improving learning performance is undermined.
An additional waste of computational power is redundant evaluation. It happens
when identical chromosomes re-emerge in the different GA generations due to ran-
dom mutations and crossovers. Since a standard GA is memory-less, i.e. does not
keep any historical records on the past-evaluated chromosomes, it has to evaluate
every appearance of those identical chromosomes. It costs the computation power
for the unnecessary evaluations and increases the runtime.
We have enumerated three causes that lead to a slow execution of GA-SVM in
the large-scale datasets. With the respective targets, four improvement strategies
were designed to alleviate the computational cost. Parallel GA and parallel SVM
speed up the GA and SVM respectively; neighbor search replaces grid search to
reduce the number of combinations to be measured; and lastly evaluation caching
avoids the repeated unnecessary evaluations. Figure. 1 shows the workflow of all
four improvement strategies in HGA-SVM.

12.2.1 Parallel/Distributed GA
The design of GA parallelization follows parallel island model [7] in a coarse grained
architecture. The entire population of chromosomes is divided into n subpopula-
tion and each subpopulation is assigned to a different parallel node. Every parallel
node evolves their local subpopulation by a serial GA. At the end of every gener-
ation, multiple chromosomes are selected randomly at each node and exchanged
12 An Integrated Approach to Speed Up GA-SVM Feature Selection Model 289

among the peers, which is called “migration”. Migration brings in the new variety
to local population and facilitates to build up the common trend of evolution in all
subpopulations.
By distributing chromosomes to n parallel nodes, local population size is reduced
by factor of 1/n (population size is an even integer before and after reduction).
The execution time of a single GA generation is speeded up by n times approx-
imately, because selection, crossover and mutation are of O(n3 ) complexity and
SVM evaluation is of O(n3) complexity with respect to population size [8]. There
are however some drawbacks of parallelization - parallel overheads, which includes
start-up/termination overhead, synchronization overhead and communication over-
head. The first two overheads are unavoidable in order to coordinate parallel com-
puting in multiple nodes; so we focus on reducing communication overhead in this

Fig. 12.1 Design Scheme of HPC-enabled GA-SVM feature selection model


290 T. Zhang et al.

study. As adoption of parallel island framework, GAs run independently at different


nodes with their local copy of data, so the necessity of data communication is mini-
mized. Another reduction is realized in migration operation, which requires passing
around the arrays of bits (chromosomes) among the nodes. There are many migra-
tion schemes specific to certain topology in the literature. To minimize communica-
tion overhead, we adopt “ring” topology for migration in which each node transfers
the local best chromosomes to its neighbor on the ring. For instance, among three
parallel nodes A, B and C, the exchange will happen as A → B, B → C and C → A.
Then the incurred communication overhead is linear w.r.t. the number of parallel
nodes. If there are n parallel nodes, it requires n generations for the migrated chro-
mosomes to travel in the ring and return to their original node. Therefore any serial
GA should be allowed for termination only if there is no further improvement for
at least n consecutive generations. Once any serial GA terminates, the parallel GA
will stop the iteration and collect the local best chromosome from individual nodes
to compute the final solution.
GA parallelization is developed using the MPI library [9] and is usually catered
for the distributed-memory hardware architecture. Parallel GA using MPI is able to
scale to all the compute nodes available, which would significantly benefit the appli-
cation of our HGA-SVM in large-scale datasets. With the introduction of multi-core
processors, GA parallelization can also be applied to mainstream shared-memory
systems to achieve good performance speedup as well.

12.2.2 Parallel SVM


SVM training is compute-intensive because it requires quadratic programming (QP)
[1] for determining the optimal separating hyperplane. Over the years, several meth-
ods had been developed to lower the computation cost. One of the methods is se-
quential minimal optimization (SMO) [10]. SMO divides a large QP problem into a
series of smaller QP problems that can be solved analytically such that the training
is speeded up. However, when SVM is required to repeat by thousands of times in a
typical feature selection task, the SVM with SMO (SVM-SMO) is still considerably
slow.
To speedup the SVM-SMO, we applied the parallelization technique to distribute
the computations to multiple nodes/threads for concurrent execution. As paralleliza-
tion introduces the extra overheads in coordination and communication, it is wise to
parallelize the most computational intensive section to achieve the maximal speedup.
Table 12.1 shows a typical execution profile of SVM-SMO (retrieved from LibSVM
[11] training execution). It is clear that kernel calculations take up most of the com-
putational time.
The caller function of those kernel calculations is comprised of an iterative loop
scanning through the instance space to select a pair for optimization. Thus OpenMP
[12], a parallelization protocol designed for shared-memory multi-processor/multi-
core systems, is most suitable to apply. The implementation of the parallel SVM
is relatively simpler than GA parallelization. It could be done by identifying and
12 An Integrated Approach to Speed Up GA-SVM Feature Selection Model 291

Table 12.1 A Typical Execution Profiling For SVM-SMO (LIBSVM)

Time (%) Self (sec) Calls (sec) Function Name


81.15 231.34 791,262,338 Kernel::kernel
11.67 33.27 269,460 SVC Q::get Q
4.78 13.64 66,095 Solver::select working set
2.22 6.34 1 Solver::Solve
0.15 0.44 66 Solver::do shrinking
0.02 0.05 3,806 Cache::swap index

resolving data dependency inside the loop, followed by inserting the OpenMP direc-
tives, without any modification to the structure of the algorithm. The parallel SVM
will be able to utilize the multiple CPU cores concurrently in form of multithreading
(refer to Figure 12.1), so effectively reduce the computation time. Since OpenMP
also introduces the overheads, the parallel SVM would perform more efficiently if
the training dataset is sufficiently large.
In our HGA-SVM, the GA and SVM operations are parallelized using MPI and
OpenMP respectively. The parallelization techniques used in both of these opera-
tions allow them to work together as hybrid parallelization to speed up the workflow
concurrently.

12.2.3 Neighbor Search


Parameter tuning is crucial to achieve minimal generalization error in SVM learn-
ing; however it is also time-consuming when employing exhaustive grid search.
Inspired by pattern search method [13], we proposed a new derivative-free method,
neighbor search, as a general solution to parameter selection problem. Neighbor
search inherits the underlying structure from grid search but not attempt to measure
every node in the grid. In our context, the parameters to be tuned are margin cost
C and RBF kernel width γ . The neighbor search for C and γ starts from an initial
position in the grid (says 10x10 grid) as the centroid and sample multiple neighbor
nodes with uniform distribution within the grid of parameter domains. The centroid
and its neighbors are measured by SVM learning accuracy with the corresponding
pairs of parameters applied, and the best node (the one associated with the highest
accuracy) is nominated as the new centroid. By repeating the above process, the
centroid will keep moving towards the best node until the convergence, i.e. the cen-
troid itself is the best among the group of examinees. Neighbor search is a heuristic
search method and the confidence level of its solution depends on sampling size,
i.e. how many neighbor nodes are sampled in every round. If a larger sampling size,
the solution is more likely to be the optimal in the grid but slower as more measure-
ments to be done; if a smaller sampling size, less confident to the solution but faster.
By introducing neighbor search, the tradeoff between solution confidence and run-
time cost could be adjusted appropriately to achieve considerable speedup with the
bearable suboptimal solution.
292 T. Zhang et al.

It is intuitive that if two chromosomes differ in few bits, their optimal locations
in the grid might be closer to each other. It could be applicable to the mutated chro-
mosomes in the new generation if the hamming distance between parent and child
chromosome is small. Since neighbor search has been done for the parent chromo-
some and found the optimal node, the same node can be used as the initial centroid
for the child chromosome, which would be advantageous for the faster convergence.

12.2.4 Evaluation Caching


Caching is employed in our algorithm as it can help avoid the repeated unnecessary
evaluations in the different generations. A cache is built up to store all the previ-
ous chromosomes evaluated. Whenever an evaluation is requested, the cache is first
sought. Only if a cache-miss occurs, an SVM evaluation is executed and the cache is
updated subsequently. The efficiency of the cache depends on how frequent the iden-
tical chromosomes re-emerge, which is varied and stochastic by nature. However,
according to probability theory, the cache tends to be less effective when the data
dimension grows, because the chance of encountering the identical chromosomes
(after random mutation and crossover) decreases. The implementation of evalua-
tion caching requires the additional memory space to store cache entries. Keeping
a small footprint for the cache is a challenge as a large number of entries could be
expectable. In HGA-SVM, encoding compression is introduced to reduce the length
of individual cache entries. There are two encoding schemes developed for different
type of data. For the low dimensional dataset, a simple multi-bit encoding is used to
compress a chromosome into a multi-bit symbol string, in which the compression
rate depends on the number of distinct symbols to be used; for the high dimensional
sparse dataset, further compression could be achieved by encoding the difference in
the consecutive bits of a chromosome followed by compression to the consecutive
0s in the encoded string. The encoding compress schemes could not only reduce
the footprint of evaluation cache, but also cut down the computational cost of cache
search due to shorter length of every cache entry.

12.3 Experiments and Results


The source codes of GA [14] were ported to Octave [15] and incorporated with
the improvement strategies including parallel GA, neighbor search and evaluation
cache. The MPI required in the parallel GA is supported by MPITB library [16].
The encoding compression required in evaluation cache was coded as C++ libraries
and linked to Octave for efficiency purpose. The source code of LibSVM 2.8.6 [11]
was modified to implement the parallel SVM and also ported to Octave as external
library which allows direct access to runtime variables in the memory. The exper-
iment platform used is a 2x Intel Xeon Quad-core (3.0GHz) machine with 32GB
memory.
The parameters of our HGA-SVM are listed in Table 12.2. The population/
subpopulation size and crossover/mutation/ migration rate are self-explanatory, and
12 An Integrated Approach to Speed Up GA-SVM Feature Selection Model 293

Table 12.2 List of Preset Parameters in GA-SVM

GA Parameters
# of parallel node n
population size 80
subpopulation size 80/n
cross-over rate 60%
mutation rate 5%
Migration rate 50%
max-generation 100
max-convergence n
fitness epsilon 0.01%
SVM Parameters
SVM kernel RBF
cross-validation 10-fold stratified
C and γ range 10−2 to 103
grid size 10x10
neighbor sampling size 8

their values were set based on experience. Max-generation refers to the maximum
number of generations to be evolved; and max-convergence denotes the number of
consecutive generations to be waited before termination if there is no further im-
provement to the fitness score, which is measured by average accuracy in stratified
10-fold cross validation of SVM classifier with the tuned parameters. The minimal
update level of fitness score is 0.01%. RBF kernel was used in SVM. C and γ were
tuned in range of 10− 2 to 103 by a 10x10 grid with sample size of 8.
Two datasets had been used in our study to evaluate the performance of our algo-
rithm. The details of datasets are shown in Table 12.3. Both datasets are the pre-
processed data from the LibSVM (http://www.csie.ntu.edu.tw/ cjlin/libsvmtools/
datasets/). The performance of improvement strategies was measured by compu-
tation reduction (including search reduction and evaluation reduction) and runtime
reduction per generation. Search reduction refers to how much percentage of search
tasks is saved compared to grid search; and evaluation reduction refers to how much
percentage of evaluation tasks is saved compared to GA without cache. As GA is a
stochastic process, the number of generations for convergence may vary for different
runs. Thus the overall runtime is not appropriate for benchmark purpose. Instead,
runtime per generation could be used to illustrate the effectiveness of the improve-
ment strategies.
GA-SVM is time-consuming even for a small-scale dataset like Austrian. For
690 examples of 14 features, it took 1048 minutes ( 17.5 hours) to complete the
16 generations of GA-SVM. The slowness affirmed our determination to speed up
GA-SVM for any feasible application in practice.
294 T. Zhang et al.

Table 12.3 List of Datasets Tested with GA-SVM

Name # of Examples # of Features # of Classes


Austrian 690 14 2
Adult 1605 123 2

Fig. 12.2 Distributions of Runtime Per Generation w.r.t. MPI Nodes

Parallel GA can significantly speed up the traditional GA by distributing chro-


mosomes to different parallel nodes. The reduction on the population size would
cut down the computational cost in all GA operations especially SVM evaluation.
Fig. 12.2 confirms this expectation in the experiment with Austrian dataset. It was
observed that the runtime per generation decreased when the number of parallel
nodes increased. By taking average runtime into account, Fig. 12.3 plotted the rela-
tionship between (average) runtime per generation and the number of parallel nodes.
The respective speedup gains for 2, 4 and 8 nodes were 2.01, 4.00 and 8.46, which
demonstrated a linear fashion in runtime reduction.
In Parallel-SVM, the amount of kernel computations is evenly distributed among
multiple threads, and each thread is allocated to a processing core for concurrent
executions. Fig. 12.4 shows how SVM training time changes with the growing num-
ber of threads (cores) with Adult dataset. The speedups for 2, 3 and 4 threads were
1.89, 2.60 and 2.88 (equivalent to 0.94, 0.86 and 0.71 per thread) respectively, and
the plot exhibited an inverse exponential fashion. That was as the result of parallel
overheads. In our design, communication overhead had been minimized by data lo-
calization and faster migration algorithm. The rest of overheads (like thread start-up
and termination) have a fixed cost and are independent of data size. Therefore, a
12 An Integrated Approach to Speed Up GA-SVM Feature Selection Model 295

4000

runtme per generation (sec)


3500
3000
2500
2000
1500
1000
500
0
1 2 4 8
# of MPI nodes

Fig. 12.3 Parallel-GA average runtime per generation w.r.t. MPI Nodes

120

100
runtime (sec)

80

60

40

20

0
1 2 3 4
# of threads

Fig. 12.4 Parallel-SVM training time w.r.t. number of threads

better performance could be expected in dealing with larger datasets, since the cost
of the overheads would be amortized.
Evaluation caching is able to avoid the unnecessary evaluations of identical chro-
mosomes in the different generations. If the chance of cache hit (i.e. a chromosome
has been evaluated earlier and stored in the cache) is significant, the overall speedup
would be remarkable. Fig. 12.5 shows how much percentage of evaluations was
saved by caching in the experiment. The frequent re-emergence of identical chromo-
somes was observed as result of low feature dimension and mutation rate. 76.25%
of cache hit was observed in the experiment and led to 75.62% reduction on the av-
erage runtime per generation (from 61.42 to 14.97 minutes, 4 times speedup). The
performance of evaluation caching highly depends on re-emergence probability of
identical chromosomes, which is mostly affected by feature dimension. As binary
encoding of GA’s chromosomes, the total number of combinations of features is 2n
where n is the feature dimension. When feature dimension increases, the chance
of hitting a chromosome in the past evaluations will drop rapidly. This phenomenon
was confirmed in the experiment with Adult dataset. As feature dimension rose from
14 to 123 with the same population size, there were only 3 cache hits during 40
296 T. Zhang et al.

Fig. 12.5 Evaluation Reduction Distribution for Evaluation Caching (left: Austrian, right:
Adult)

Fig. 12.6 Search Reduction Distribution for Neighbor Search (left: Austrian, right: Adult)

Fig. 12.7 Integration of Four Improvement Strategies (Austrian) (left: standard GA-SVM,
right: HGA-SVM)

generations of GA. By considering the overhead incurred in caching, the results


suggested that this strategy should be cautious in applying to high dimensional data.
Neighbor search is an improvement to grid search for tuning C and γ by replac-
ing exhaustive search with neighbor sampling and heuristic search. Fig. 12.6 sum-
marizes the search reduction of neighbor search in the experiments with Austrian
and Adult dataset. For Austrian dataset, two independent runs of HGA-SVM were
12 An Integrated Approach to Speed Up GA-SVM Feature Selection Model 297

conducted with grid search and neighbor search. Using the same 10x10 grid and pa-
rameter range, grid search required 8000 times (80 chromosomes x 10 x 10 grid)
of SVM measurements per generation; and neighbor search measured only 1410 to
1524 times (80.95% to 82.37% reduction, 81.77% on average). Both runs of HGA-
SVM found the best fitness of 87.97% classification accuracy but the one using
neighbor search was 5.79 times faster (61.42 mins v.s. 10.60 mins) per generation.
The similar observation was also found with Adult dataset: 79.10% to 84.70% search
reduction by neighbor search, i.e., on average 5.76 times faster per generation.
Finally, the collective speedup of all four improvement strategies was evaluated.
A remarkable reduction on computational cost was observed. Fig. 12.7 shows the
distributions of runtime per generation with Austrian dataset. The average runtime
per generation is reduced from 61.42 min to 0.46 min, ∼133 times.
In all the above experiments, the improvement was also observed to SVM learn-
ing accuracy over 10-fold cross validation (Fig. 12.8). The learning accuracy was
enhanced by 3.74% - 8.10% as a result of feature selection.

Fig. 12.8 Improvement to Classification Accuracy

12.4 Conclusion
Our HGA-SVM illustrates an integrated approach that combines parallelization and
heuristic techniques that can effectively to lower the computational cost effectively.
We had demonstrated the individual speedup gains from parallel GA, parallel SVM,
neighbor search and evaluation caching as well as their collective gain. Through the
feature selection, the learning accuracy of SVM was enhanced as well. Overall, we
show that our HGA-SVM is useful in alleviating the computational cost with the
improved learning performance, allowing the feasible application to larger data and
298 T. Zhang et al.

more complex data. In our future work, caching, cross validation, and more efficient
heuristic techniques will be explored to further improve the current algorithm.

References
1. Boser, B.E., Guyon, I.M., Vapnik, V.N.: A training algorithm for optimal margin clas-
sifiers. In: Proceedings of the 5th Annual ACM Workshop on Computational Learning
Theory, pp. 144–152 (1992)
2. Cortes, C., Vapnik, V.: Support Vector Networks. Machine Learning 20, 273–297 (1995)
3. Aronszajn, N.: Theory of reproducing kernels. Transactions of the American Mathemat-
ical Society 68, 337–404 (1950)
4. Chapelle, O., Vapnik, V., Bousquet, O., Mukherjee, S.: Choosing Multiple Parameters
for Support Vector Machines. Machine Learning 46, 131–159 (2002)
5. Hsu, C.W., Chang, C.C., Lin, C.J.: A practical guide to support vector classification
(2003)
6. Mitchell, M.: An Introduction to Genetic Algorithms (1998)
7. Tanese, R.: Distributed genetic algorithms. In: Proceedings of the third international con-
ference on Genetic algorithms, George Mason University, United States, pp. 434–439.
Morgan Kaufmann Publishers Inc., San Francisco (1989)
8. Vapnik, V.N.: Statistical Learning Theory. Wiley Interscience, Hoboken (1998)
9. Foster, I.: Designing and Building Parallel Programs: Concepts and Tools for Parallel
Software Engineering. Addison-Wesley, Reading (1995)
10. Platt, J.: Sequential minimal optimization: A fast algorithm for training support vector
machines. Advances in Kernel Methods-Support Vector Learning, 185–208 (1999)
11. Chang, C.C., Lin, C.J.: LIBSVM: a library for support vector machines 80, 604–611
(2001), Software available at http://www.csie.ntu.edu.tw/cjlin/libsvm
12. Dagum, L., Menon, R.: OpenMP: An Industry-Standard API for Shared-Memory Pro-
gramming. IEEE Computational Science & Engineering, 46–55 (1998)
13. Momma, M., Bennett, K.P.: A pattern search method for model selection of support
vector regression. In: Proceedings of the SIAM International Conference on Data Mining
(2002)
14. Houck, C.R., Joines, J., Kay, M.: A Genetic Algorithm for Function Optimization: A
Matlab Implementation, NCSU-IE TR, vol. 95 (1995)
15. Eaton, J.W.: Octave, http://www.gnu.org/software/octave/
16. Fernández, J., Anguita, M., Ros, E., Bernier, J.: SCE Toolboxes for the Development of
High-Level Parallel Applications. In: Alexandrov, V.N., van Albada, G.D., Sloot, P.M.A.,
Dongarra, J. (eds.) ICCS 2006. LNCS, vol. 3992, pp. 518–525. Springer, Heidelberg
(2006)
Chapter 13
Computation in Complex Environments;
Optimizing Railway Timetable Problems with
Symbiotic Networks

Kees Pieters

13.1 Introduction
The title of this contribution balances on two concepts. The first is ‘complex envi-
ronments’.‘Complex’ should not be read in the colloquial sense of the word; com-
plexity addresses, amongst others, non-linear, contingent and ‘chaotic’ phenomena
([10],[11]). Many thinkers on complexity consider such characteristics — some-
times called organized complexity— to demarcate a transition point where analytical
approaches are no longer feasible ([26]:18).
Put in another way, organized complexity moves away from traditional machines,
which have supreme performance for their intended tasks, but also require very sta-
ble and predictable environments. Rather a line is drawn towards living organisms,
which are very robust and are better adjusted for contingent environments than ma-
chines are. Along this gradient, ‘robust machines’ form an interesting field of in-
quiry for optimization problems. Railway Timetable Problems (RTP) can be seen as
a benchmark for such robust machines (or algorithms).
The second concept, ‘symbiotic networks’, is introduced as an optimization strat-
egy that can, to some extent, optimize in such complex environments. RTP has been
a benchmark problem for symbiotic networks, and so this contribution does not fo-
cus on RTP in itself, but rather uses RTP to analyze the behaviour of symbiotic
networks in a practical setting.
This paper is outlined as follows; first a ‘meta-perspective’ on optimization pro-
cesses will be drawn, then the RTP will be introduced as a complex environment.
The theory of symbiotic networks will be discussed, and how this approach was
implemented to optimize RTP. Last the various tests that have been carried out with
the simulation environment will be discussed.
Cornelis P. Pieters
Condast, Omloop 82, 3552 AZ, Utrecht, the Netherlands
e-mail: cees_pieters@wxs.nl

Y. Tenne and C.-K. Goh (Eds.): Computational Intelligence in Optimization, ALO 7, pp. 299–324.
springerlink.com c Springer-Verlag Berlin Heidelberg 2010
300 K. Pieters

13.1.1 Convergence Inducing Process


Research on computational intelligence often tends to focus on the algorithms or
heuristics that aim to solve certain problems. This focus sometimes obfuscates the
fact that the algorithms are taken up in a larger process, which also contributes to the
eventual solutions. In the specific case of computational intelligence, the algorithms
that are used are often part of a specific pattern [1, 20, 21, 22] called a ‘convergence
inducing process’ ([5]:424-425), as depicted in Table 13.1.

Table 13.1 Convergence Inducing Process

Pattern Convergence Inducing Process


Description An actor samples an environment by an iterative cycle
of testing and evaluation until a certain goal criterion has
been met
a.k.a Problem Solver, Global Search

Notes The actor typically contains a balanced mix between


divergent processes (exploration) and convergent pro-
cesses (optimization), which map the evaluated variables
to the goal function. The latter processes tend to dominate
(eventually)

Typically, this process combines a divergent process of exploration with a con-


vergent process of optimization, of which the latter is dominant. The actor, which
in the case of computational intelligence consists of the strategies that are deployed,
usually includes a securing mechanism which stores high-ranking values, and is able
to use these for further evaluation. Genetic algorithms, goal-directed agent systems,
and the learning phase of neural networks can all be considered instantiations of this
pattern ([6],[18]).
The algorithms or heuristics that implement the strategies determine how the
convergence-inducing process behaves. The pattern may thus take various specific
forms, depending on the problems that are tackled and the manner and quality of
the optimization that is aimed for. However, the essential property of this process
is always a form of feedback between actor —for instance a certain computational
algorithm— and data that are present in the actor’s environment, which is often
called the ‘problem domain’ in computational intelligence.
13 Computation in Complex Environments 301

13.1.2 A Classification of Problem Domains


This ‘broader’ perspective on computational intelligence shows that there often is
a relationship between the process and the characteristics of the problem domain.
This, of course, is widely known by now. There are a number of ‘No Free Lunch’
(NFL) theorems that stress this relationship [27], and for instance de Jong func-
tions provide powerful benchmarks against which computational intelligence can
be tested and (mutually) evaluated [7]. Some optimization strategies are also paired
with so-called deceptive functions, which can be seen as problem domains that
deceive the algorithms into optimizing towards poor (sub-optimal) results[14].
Complex problem domains can also be categorized as follows:

Table 13.2 Characteristics of Problem Domains

Property Remarks

Familiar/Uncertain In a predominantly uncertain environment, the process incorporates


only limited specific knowledge about the problem domain for its opti-
mization strategy. Often only certain aspects of the problem domain are
known, such as its structure (tree, graph, etc.).
NP-hard (-complete) Results in a combinatorial explosion of the search space.
Linear-nonlinear Usually increases the uncertainty of trends and patterns, such as curves
Static/non-static In a static environment, the problem domain does not change while it is
being processed by the actor. A non-static environment can be dynamic,
stochastic or in any other way changing while being processed
Reactive Implies that the problem domain is influenced by the actor. A reactive
problem domain is always dynamic as well.
Contingency The environment can provide unexpected situations or events that dis-
rupt the process of optimization.

The ‘Traveling Salesman Problem’ or certain job shop problems are examples of
familiar, NP-hard problem domains, while agent systems typically operate in reac-
tive and often highly nonlinear environments [28]. Strictly speaking, many neural
networks can be considered as operating in ‘familiar’ environments, as the patterns
that they learn during the training phase (design) are assumed to be present in the
problem domain once the network is in operation. In practical settings, most so-
lutions require a mix of these, in which many constraints and heuristics that are
specific for the problem domain are also designed. Note that the categories are
‘actor-centric’; the designer of the process may, and usually will, have more knowl-
edge about the environment than the actor itself.
Along the scales that are introduced with these characteristics, it becomes clear
that computational intelligence often combines intelligence ‘by design’ and ‘true’
forms of computational intelligence. ‘Designed’ intelligence includes the implicit
assumptions about a chosen solution strategy, the type of solution that is chosen
(GAs, neural networks, etc.) and ‘tweaking’ of the algorithms by the designer.
Problem domains that are predominantly uncertain, NP-hard, non-linear, reactive
and contingent are amongst the most difficult to tackle. These form the charac-
teristics of a complex environment. A convergence inducing process operating in
302 K. Pieters

these domains usually cannot be tailored for a specific environment, but will need
to find a near-optimal solution in finite time in a problem domain that continuously
changes, and which may react to the process itself. A robust algorithm will try to op-
timize, regardless of the conditions the environment imposes on it, and a good robust
optimizer may even use these conditions to its advantage.
There is one specific class of problems that provides an interesting point of refer-
ence for optimization in these severely complex problem domains; railway timetable
and –pathing problems [3, 8, 12, 13, 17].

13.2 Railway Timetable Problems


The typical railway timetable problem (RTP) revolves around the question of which
set of departure times (timetable) of trains causes no conflicts between trains while
they are traveling on a railroad infrastructure. The infrastructure (the problem do-
main) is a resource-constrained network; trains tend to share tracks at some point,
and an ideal timetable would prevent trains from hindering each other during ser-
vice. The railway timetable problem can be extended by including other variables,
such as waiting time for passengers, the availability of carriages and personnel at
stations at a given time, and so on.
The form of the railroad infrastructure plays an important role in the complex-
ity of the problem domain. Some countries can develop a tree-structure which can
reduce the complexity significantly. Many densely populated countries with old rail-
way infrastructures, such as in the Netherlands, have to work with dense networks
that contain many cycles.
For an extensive overview of RTP see ([12]:37-58), also [3]). This contribution
will focus on the ‘simpler’ variant that optimizes on conflicts alone.
Besides the infrastructure, usually some other aspects of the problem domain
are known as well, such as the allowed speeds of the trains in different situations,
the minimal (and often also the maximum) waiting times at stations and so on. A
railway timetable can also be further constrained by specific demands of the services
that are offered. For instance, a certain trajectory may require trains to travel with
fixed intervals, or at least a number of times within a certain time frame (e.g. four
times an hour). This is called a cyclic, or periodic railway timetable, which is known
to be a special form of a Periodic Event Scheduling Problem (PESP) ( [13]:10-12).
PESP is known to be NP-hard and therefore a challenge in itself even without the
dynamic aspects.
Generally speaking, most solution strategies usually meticulously record these
characteristics in graphs and process them, which usually results in computational
intensive solutions. As a result, to date few solutions have been developed this way
that can optimize a full service of trains on a full infrastructure. Most research there-
fore concentrates on describing one, or a few tracks [3], or focus on train movement
around stations [8].
13 Computation in Complex Environments 303

Train scheduling in practice therefore still relies on human experts [29], although
some supportive tools have been developed to assist in generating timetables for a
full service of trains on a full infrastructure ([13]:15-16).
Another approach can be taken where optimization is performed on a full in-
frastructure, and where the algorithm has to optimize without prior knowledge of
the problem domain. In this case, the number of conflicts will be measured as a
function of a given timetable (see Figure 13.1). This approach makes the problem
domain predominantly uncertain, NP-hard, non-linear and reactive. If the optimiza-
tion has to account for delays of trains, or other sources of contingencies, then RTP
fulfills all the criteria for a complex environment. In this case, efficiency is not the
only criterion for the optimization strategy, but robustness also becomes an impor-
tant aspect.

Fig. 13.1 RTP as a convergence inducing process

In effect, the trains just follow their intended trajectories (in a simulated envi-
ronment), while the optimization strategy measures the conflicts and adjusts the
timetables ‘on-the-fly’. The departure times are the only degrees of freedom that the
problem solver has, but changing them also changes the conflicts (reactivity).
Intuitively it is clear that train timetable problems cannot be resolved entirely
through optimization of the departure times. The railroad infrastructure may be too
restrictive, for instance when two trains are heading towards each other on a single
track. Most existing railroad infrastructures will be sufficiently extensive to avoid
these infrastructural limitations.
A second source of conflicts is related to the amount of trains in service at a given
time. The potential for conflicts increases with the amount of trains that use the in-
frastructure. At a certain critical threshold, the capacity of the railroad infrastructure
is so dense that conflicts can no longer be resolved. This is another boundary of the
RTP.
A practical description of RTP therefore is as follows:
Given a certain rail infrastructure, and given a certain required service of trains, can
the infrastructure provide this service without conflicts?

One attempt to automatically generate an optimal rail timetable for a full railroad
infrastructure (a model of the Dutch railroad infrastructure) used so-called symbiotic
networks to resolve this [16, 17, 19].
304 K. Pieters

13.3 Symbiotic Networks


Like many forms of evolutionary computational intelligence, symbiotic networks
have been inspired by a phenomenon known in nature, namely symbiosis (or mutu-
alism). As a reference, a possible repertoire of interactions between agents is listed
in a pattern called actor/co-actor (see Table 13.3).1

Table 13.3 Actor/Co-actor Pattern

Pattern: Actor/Co-actor

Description An actor can interact with others according to a number of strategies. The
notion of perceived ‘benefit’ is assigned to these interactions between actor,
the entity initiating the interaction, and the co-actor, the entity that is subject
to the interaction. This could, for instance, be according to the following
table:
a.k.a
Constraints:
Name Benefit ac- Benefit a.k.a
tor co-actor
co-existence 0 0
Competition + 0/- Adversarialism
Parasitism + -/0/+
+ + Mutualism
Altruism - +
Symbiosis 0/+ 0/+
Synnecrosis - - (e.g. spite)

Notes A further distinction could be made in predator-prey relationships (the actor


is predator), which means that the co-actor is the resource that the actor aims
to acquire. Also a ‘vertical’ interaction where the actor is a participant of an
co-actor (aggregate form) is an interesting special case.

Of all the interaction patterns between biological organisms, such as competi-


tion and altruism, symbiosis is the only one that –to date- has not been elucidated
by mathematical means. The initial research therefore aimed to make a model that
would demonstrate the minimal conditions in which autonomous agents, which do
not necessarily rely on each other, engage in a relationship of mutual dependency
[15, 16]. It will be clear that this premise already biased an agent network.
The eventual network however proved to have a relationship with neural net-
works, as the convergence criterion for symbiotic networks was very similar to the
Rosenblatt convergence criterion for perceptron neural nets [25]. Basically a sym-
biotic network can optimize (or ‘learn’) one pattern, whereas a neural network can
learn multiple patterns. A symbiotic network however, like most agent systems, does
not know a distinct learning cycle. A symbiotic network ‘learns by doing’.
1 The original name of the pattern was ‘actor-actant’, but has been renamed because of the
specific use of ‘actant’ in the social sciences.
13 Computation in Complex Environments 305

Practical tests showed that symbiotic networks are particularly interesting in dy-
namic environments. If a number of n agents collaborated in solving a certain task,
the complexity of O(n3 ) proved to be relatively poor in static environments, such
as when the Traveling Salesman Problem was taken as benchmark [15]. A chance
article in a Dutch newspaper on railway timetable problems provided the ideal ex-
perimental environment for symbiotic networks. A model of the Dutch railway in-
frastructure was made, in which various optimization strategies were implemented
and compared. It was experimentally demonstrated that the complexity of the solu-
tion remained approximately O(n3 ) for RTP.
First, the theory behind symbiotic networks will be given some attention. As was
mentioned earlier, the initial research aimed to find the minimal requirements that
agents would need to engage in a symbiotic interaction. Symbiosis is seen here as
a mutually beneficiary pattern of interaction. According to actor/co-actor, mutual-
ism will also conform to this pattern, but mutualism is generally associated with a
parasitic interaction that turns out to be mutually beneficial [9]. Symbiosis in the
way it is used here rather starts from co-existence, which develops into symbio-
sis under certain conditions. This applies for more co-operative strategies, such as
CCGAs [23, 24], swarming algorithms and ANT algorithms [2], although these so-
lutions are usually designed to be co-operative. Symbiotic networks have to learn
this, given a certain overall (predetermined) goal.
As a first crude description, one could say that the agents in a symbiotic network
provide some sort of service that benefits the others. This is usually the implicit
assumption of symbiosis or mutualism, which comes down to a form of ‘I’ll scratch
your back if you’ll scratch mine’. However, the model that was developed shows
that this elementary form of feedback is insufficient to create a stable relationship.
A specific kind of communication is required to negotiate the need for the services
through the network.
Symbiosis assumes that the benefit, which is mathematically represented by a
goal function, is provided by the other participants in the network and so the agents
are encouraged to maintain and optimize the relationship once this benefit is ‘de-
tected’. In symbiotic networks, this optimization is considered to be the result of an
unusual feedback loop that is established once the agents are in each other’s sphere
of influence. This loop is achieved by an ability of the agents to communicate their
needs (through so-called stress-signals) and that they are able to change their behav-
ior, based on these stress-signals. It is a bit like a parent and her baby; when the baby
starts crying, the parent stops doing whatever she was engaged in and addresses the
baby’s needs.
In a way, some agents are sensitive to certain contingencies in, or require re-
sources from the environment, while other agents have the means to address them.
It is clear that competitive or altruistic approaches will not be a preferred form of in-
teraction between these agents in such situations. Parasitic or (other) invasive strate-
gies might work if agents know beforehand which others to target. However, it is
not always known which teams should be formed. Symbiotic networks, to some
extent, are able to figure this out by learning patterns in the stress signals that are
communicated.
306 K. Pieters

Fig. 13.2 In a symbiotic network, the problem domain is ‘folded in’ the network

The model presented in this paper considers the environment of the system to be
an integral part of the network (Figure 13.2) [4]. This approach is similar to that of
Pnuelian reactive systems and agent-based systems [28].

13.3.1 A Theory of Symbiosis


The environment is a vector E = {E0 ,. . . ,Ek } of sub-environments Ei , or neighbor-
hoods2. For reasons of descriptive clarity, a dotted line marked with the token Ei
will be used to denote the neighborhood from which the entity obtains its input or
releases its output. Such a neighborhood usually has a dimension, such as force,
temperature, velocity, etc. The model further starts from the premise of a primary
transformation process of an input signal to an output signal. The response of this
process is:
Oi = μi .Ii (13.1)
Every agent is connected to the environment through a pair of neighborhoods {Ei ,
Ei+1 }, that may have different dimensions. Agents are connected through these
neighborhoods. For now, the following is presupposed:

Fig. 13.3 Primary Transformation Process


2 With the inevitable progression of insight, currently ‘surroundings’ is preferred for the
interaction space of an agent.
13 Computation in Complex Environments 307

Ii+1 = Ei+1 · Oi (13.2)


Apart from this, the response of the neighborhoods cannot be controlled and may
even be unknown. Each agent is expanded with the following properties:

- A goal gi
- A dimensionless stress function si (Ii , gi ) (13.3)
- Symbiotic behavior μi (s0 , s1 , . . . , sn ) (13.4)
This results in the following agent, a symbiot (Figure 13.4):

Fig. 13.4 A symbiot emits a stress signal and is able to change its behavior as a function of
the stress signals in the network

The goal is associated with the input of the entity, which reflects for instance
a biological organism’s need for food. When these symbiots are connected, a very
simple network is formed (Figure 13.5).
In this figure, symbiot1 is the successor of symbiot0 through the environment.
At convergence, symbiot0 should be able to support symbiot1 in reaching its goal
I1 = g1 . This means that the following will apply:
g1
I1 = g1 = O0 · E1 ⇒ O0 = E1 (13.5)

Fig. 13.5 Minimal Symbiotic Network


308 K. Pieters

According to (13.1), this results in:

O0 = μ0 · I0 ⇒ μ0 = g1
(E1 ·I0 )
(13.6)

The behavior μ0 is determined by the stress signals and should converge to a situa-
tion where I = g, or I0 = g0 and I1 = g1 . In such a situation (13.6) becomes

μ0 = g1
(g0 ·E1 )
(13.7)

It is clear that if this applies for all symbiots in the system, one could say that the
system has mapped its behavior ȝ = [μ0 , μ1 ] against its environment and its goals.
This is interesting when the environment is unknown, for a converged system
can provide information about the environmental relationships between the various
probes that the system has put in the environment. Note that in this situation the
inputs (i.e. the goals) should never be allowed to become zero as this would impair
convergence.
If the symbiots are connected in such a way that E2 = E0 , then a situation of
mutual benefit has been formed. This results in symbiosis according to the pattern
of actor/co-actor.
Convergence of the symbiotic system takes time. A system has converged when:

lim Δ μi ⇒ 0 (13.8)
t→tc

Δ is the change over a certain amount of time and the convergence time tc is the
time when the system has converged. Suppose the behavior μi changes according to
a certain algorithm f μi (s0 , s1 , . . . , sn ):

Δ μi = μit+1 − μit = f μi (s0 , s1 , . . . , sn ) (13.9)

Here, t stands for iteration step in time. At convergence the following should apply:

f μi (s0 , s1 , . . . , sn ) = 0 (13.10)

Adding this so-called symbiotic algorithm to the symbiot results in figure 13.6.

Fig. 13.6 Symbiot


13 Computation in Complex Environments 309

The stress signal si (Ii , gi ) reflects the aim that one has when applying the network
to a specific problem, and should be zero once the goals have been achieved. The
network aims to achieve I = g. Take, for instance, the following stress function:

si (Ii , gi ) = ρ (gi − Ii ), ρ > 0 (13.11)

In this equation, the stress will increase the further the input edges away from the
goal function. The factor ρ is used to make the stress signal dimensionless and to
normalize its value, for instance between <−1, 1>. In this case, at convergence:

lim si = lim ρ (gi − Ii) = 0 (13.12)


t→tc t→tc

Convergence should ideally never take an infinitesimal amount of time. However,


the convergence time tc also poses restrictions to the goals of the symbiot. If these
goals are functions of time, then it will be clear that ideally during optimization a
goal should not change and the input signal should be stable, as the system would
otherwise possibly never be able to converge. This also applies for the environ-
ment E. However, practically this cannot be assumed, and thus the system will be in
continuous flux.
Besides this, the various signals are limited to minimum and maximum ranges in
practical systems. If the behavior or the stress signals of the symbiots run into these,
then the system may converge to a state where I = g.
So far, it has been assumed that convergence has taken place. The question how-
ever is, what the criteria are that allow the symbiotic network to converge to a
situation where I ≈ g.
Convergence
Consider a network of n + 1 symbiots with:
• an environment E = [E0 , E1 , . . . , En ]
• an input vector I = [I0 , I1 , . . . , In ]
• a goal vector g = [g0 , g1 , . . . , gn ]
• an output vector O = [O0 , O1 , . . . , On ]
• a stress vector s = [s0 , s1 , . . . , sn ]
The goal vector is constant within the convergence time tc . For each symbioti , the
following applies at a given time t:

Oti = μit .Iit (13.13)

μit+1 = μit + δ · f(s), δ > 0, f(s) is dimensionless (13.14)

sti = ρ (gti − Iit ), ρ > 0, si is dimensionless (13.15)

δ and ρ are included in (13.14) and (13.15) to scale the dimensionless factors f (s)
and si to the dimensions of Ii and μi . For now, each neighborhood is defined as
follows:
310 K. Pieters

I j = E j .Oi , E j > 0∀i, j (13.16)

Now suppose an initial situation where I = g and the following relations apply:
⎧ n


⎨ ≥ 0, i f ∑ si ≥ 0
f (s) = i=0
n (13.17)


⎩ < 0, i f ∑ si < 0
i=0
Because of (13.14), μi will increase or decrease due to (13.17). The same will also
happen to Oi due to (13.13),(13.14) and (13.16). But the opposite will occur due
n
to (13.15). This process will therefore repeat itself until ∑ si = 0. The time tc that
i=0
this convergence takes depends on the values of the elements of s, and therefore of
the values of I (13.15) and E (13.16). This means that the environment affects the
convergence time of the network.
The convergence criterion shows that any symbiotic algorithm that complies with
the constraints of (13.17) will cause converge of the system and that every symbiot
in the network should be connected to at least one other through the environment.
The convergence criterion also shows that I = g is only one of the possible solutions.
Depending on the symbiotic algorithm, averages of the various (gi − Ii ) functions
can also cause convergence. This will result in premature convergence (gi − Ii ) = ei ,
where ei is not equal to zero.
The problem of premature convergence is similar to limitations in pattern match-
ing that are recognized in neural networks [5]. The n symbiots as a whole form
an nxn matrix —n symbiots that are ‘listening’ to n stress signals, including their
own— that is dynamically altered by the stress vector s. This results in a number of
eigenvectors, which stand for solutions of s where the symbiotic algorithm is zero.
The feedback loop that is constructed through the environment causes the system
to converge to one of them. However, the eigenvector still represents a whole range
of solutions of s, of which only (s = 0) is desired. The challenge of the system is
to approximate this ideal convergence. This will be discussed further in the next
section.
The neighborhood influences the convergence process through its sign. Ei could
be a negative function in (13.16), in which case convergence is still possible, pro-
vided that the signs of the appropriate si+1 are inverted in the symbiotic algorithm
that is used.
Up to now a very rigid goal criterion has been used, namely one where Ii = gi . In
nature, the survival goals are often less strict and could be, for instance, Ii ≥ gi (e.g.
food). This would translate to the stress signals as follows:
{
ρ(gi − Ii ), i f Ii < gi
si (Ii , gi ) = (13.18)
0, i f Ii ≥ gi

This choice increases the solution space of the network significantly, as a whole
range of input vectors I lead to a situation where s = 0. This allows a much large
13 Computation in Complex Environments 311

portion of the solution space to give adequate results, leading to efficient systems
that use very simple symbiotic algorithms.

13.3.2 Premature Convergence


As a symbiotic network is intrinsically embedded in its environment, the latter will
partially determine the network structure. This unpredictable aspect influences the
network in two ways:
1. By the transformation of an output signal of one symbiot to the input of another.
This is determined by the environment E = [E0 , . . . , En ].
2. By the connection scheme, or structure of the network.
The convergence criterion showed that the first issue mainly concentrates on the sign
of each transformation, and furthermore that the values of E affect the convergence
time tc of the entire network.
The connection scheme, which deals with the question which and how many
successors a symbiot has, deserves further scrutiny, as it influences premature con-
vergence where I = g when f(s) = 0. A successor of a certain symbioti is a symbiot j
of which the input is either connected to the output of symbioti through a neighbor-
hood (immediate successor) or through a number of symbiot-neighborhood pairs as
depicted in 13.7. This figure shows a parallel construction, where one symbiot ser-
vices two immediate successors, as well as a branch, a sequence of symbiots. Both
constructions have implications on the convergence of the network.

Fig. 13.7 Extended Symbiotic Network

Parallel connections of symbiots can impair ideal convergence. It is clear that if


a symbiot services multiple successors, convergence to all their goals is only pos-
sible if there is overlap between them. The symbiot will never be able to serve its
successors if they have contradicting goals. In such a case the network will opt for
an average, the upper, or the lower bounds of their goals, depending on the symbi-
otic algorithm that is chosen. These limitations are intrinsically determined by the
way the symbiots are connected and the goals they have. There is no way that a
symbiotic algorithm can work around these limitations.
312 K. Pieters

Branches in a network can also cause premature convergence through propaga-


tion. In a branch, a change in the behavior of one symbiot results in changes to the
input of its immediate successors, and their immediate successors and so forth. If
such sequence of successors contains amplifying elements, then a small adjustment
at the start of the branch can cause enormous fluctuations just behind the amplifi-
cation. This amplification is transferred to the stress signals and result in a situation
where the contributions of symbiots further down the chain are much higher than
that of the immediate successors. They ‘out-shout’ not only the immediate succes-
sors, but also other stress signals in the network. Therefore, they impair correct
functioning of the network. Note that amplification occurs when a goal value of a
symbiot is lower than that of a successor.
For instance, take a symbiotic algorithm of the following type:
n
f(s) = w0 · s0 + · · · + wn · sn = ∑ (wi · si ) (13.19)
i=0
It is clear that a major issue for such a symbiotic algorithm is to ‘decide’ which
stress signals should get which weights. A symbiot can only serve its (immediate)
successors and therefore the symbiotic algorithm should ideally be constructed to
discard the stress signals of other symbiots. These contributions may possibly only
pollute useful information. It is as if trying to hear your children call out on a busy
playground. The dilemma that a symbiot faces is that its successors are determined
by the environment and are therefore not necessarily known.
If a symbiot knows what its immediate successors are, and there is overlap in
their goals, then it can concentrate on optimizing these goals. This would result in
a network where ideal convergence can take place. Therefore, a perfect symbiotic
algorithm will be able:
1. to identify the immediate successors
2. set their corresponding weights to be unequal to zero
The weights of other stress signals should ideally be zero, as their sum totals zero at
convergence anyway. Besides this, they should ideally not influence the successor’s
operation.
These issues were investigated in an experimental setting which confirm these
observations [15]. One of the most interesting algorithms proved to be a variant of
(13.19) where the weight factor was determined by a Hebbian learning rule:
wti = wt−1
i + ρ ⋅ si , ρ ∈< 0, 1 > (13.20)

The algorithm also included a forgetting rule in order to compensate erroneous


learning during the initial phase of the convergence process. The resulting network
converged almost ideally in situations where branches had decreasing goals (no am-
plifications). Apparently the network is able to ‘learn’ its (immediate) successors.
When amplifications were included (random goals), the system converged to the
same level as an averaging algorithm, a variant of (13.19) where:
13 Computation in Complex Environments 313

wi = w∀wi , w ∈< 0, 1 >.


This particular configuration gives the network traits of a neural network, in which
the algorithms learn patterns of communications in the stress signals.
At this point also, the environment has become less specific than was initially
depicted in (13.2) and (13.16). All that is now required is that a symbiot should be
able to service its successors in whatever way. If this is possible, then the network
will try to optimize. As the environment influences the stress signals, the traits of the
environment are taken up in the patterns that are learned. This does not mean that
the system will converge ideally, but only that the system will do its best given the
the means and limitations of the problem domain. The symbiots ‘fold around’ the
neighborhoods and try to address the contingencies through the feedback loops they
form, and this contributes to the robustness of the algorithm. The challenge then is
to see which symbiotic algorithms are best suited to guide the convergence inducing
process.

13.4 Symbiotic Networks as Optimizers


The theoretical model of symbiotic networks may have obfuscated the fact that in or-
der to use a symbiotic network as an optimizer, a relatively simple heuristic needs to
be implemented. Every agent in the network must be able to monitor stress signals,
and it must be able to change its behaviour so that the overall stress becomes less.
At first glance, this seems problematic, for the change of the global stress signals do
not (necessarily) depend on the temporal change of an individual agent. However,
as all the agents are trying to achieve the same, the network will try to optimize
globally. The quality of the optimization is another matter, of course, but it was hy-
pothesized that the dynamics of the network might actually support optimization to
some extent, as this might help the network from getting trapped in a local mini-
mum. Besides this, as more stress signals are resolved, the network can concentrate
more and more on optimizing the remaining ‘pockets of resistance’. This hypothesis
needed experimental verification, which was investigated with RTP.
When symbiotic networks are used as optimizers, the manner of optimization
becomes relatively simple. As the stress signals are both a measure for the effective-
ness of the co-operation the agents engage in, as well as that they monitor events
from the environment, optimization is a matter of trying to minimize the stress sig-
nals, albeit at the expense that these stress signals partially reflect an environment
that can disrupt the optimization process.
Therefore, for many practical optimization problems, a symbiotic network may
be configured to be continuously in operation —like a living cell—, but it can also
terminate when a certain condition has been met, such as:
∣∣s∣∣ < ε, ε >0 (13.21)
It is also worth pointing out that, in principle, there are no parameters that need to
be ‘tweaked’; the multiple feedback loops in an n-agent network are the principle
operators of the optimization process.
314 K. Pieters

This does not mean that specific heuristics or optimization algorithms cannot
improve the overall convergence when applied to a specific problem. A number of
these ‘designed’ interventions will return in the specific case of RTP, that will be
discussed next.
Globally, RTP configured as a symbiotic network means simulating the railway
infrastructure and the intended services of the trains. The simulation lets the trains
travel their trajectory, and every time a conflict between trains occurs, a stress signal
is generated. The actual optimizing layer collects the stress signals and generates a
new timetable, which defines the ‘behaviour’ of the symbiot. The symbiotic algo-
rithms that are used, determine how the stress signals modify the timetable. Most of
the strategies implemented here change the departure time of a train a minute earlier
or later per optimization cycle, and keep track whether a previous change results in
an improvement or not. Thus, local optimization (of individual trains) should result
in global optimization. The overall flow chart is given in Figure 13.8. This approach
will be discussed in greater detail next.

Fig. 13.8 Basic Flow Chart of RTP Configured as a Symbiotic Network

13.5 Trains as Symbiots


One of the major problems of a dense railway infrastructure is that trains put a de-
mand on scarce resources, i.e. the tracks, switch points, platforms of railway stations
and so on. Trains traveling along different trajectories with different speed, some-
times may find other trains ahead of them traveling much slower. On single-track
trajectories there is a risk of running into a train coming from the opposite direction.
There are only a limited number of places where such conflicts can be resolved.
Railway stations often have a number of extra tracks that allows faster trains to
overtake slower ones, and switch points can cause the trains to take different direc-
tions. This contributes to the highly dynamic nature of such conflicts, and makes it
a complex problem to tackle.
As was mentioned earlier, optimizing the departure times of the trains is one
of the most cost-effective means of improving the capacity of the rail net. As the
conflicts are a function of these departure times, there should be a timetable that
results in a minimal amount of conflicts, and preferably none, of course.
A symbiotic network was implemented to generate a timetable with a minimal
amount of conflicts for a model of the Dutch rail net. The Netherlands Railways
13 Computation in Complex Environments 315

(NS) operates a (daily) cyclic rail timetable, so trains depart at fixed intervals (for
instance twice or three times an hour) [12, 13].
Before continuing, it may be helpful to give a few definitions:
• A timetable is a set of departure times for all the trains that are required to travel
on a certain rail infrastructure during a certain time frame, for instance daily.
• A trajectory is a service that one or more trains are required to travel and runs
from a certain station (origin) to a destination. In-between the trains usually stop
at one or more stopover stations.
• A (train) schedule is the schedule that is assigned to a trajectory. This includes
the number of trips during a certain period (e.g. four times per hour) and the
departure times.
• A trip is the service that a certain train is actually carrying out at some point.
RTP has been used as an environment to investigate the behaviour of a symbiotic
network in a practical setting. The focus was not on generating actual timetables.
The research did confirm that the network does optimize, and therefore could, in
principle, be used to generate real timetables.
The experiments were conducted by running a computer simulation of the Dutch
railway system with the different symbiotic algorithms and under different condi-
tions. Each run consisted of 30,000 iterations, where an iteration step represents a
‘real time’ of 15 seconds. All the experiments initially select a random departure
time within the range of a trip.

13.5.1 Trains in Symbiosis


The previous discussion on symbiotic networks have suggested that a practical
application basically consists of three layers that interact with each other:
• The environment, which consists of the railway infrastructure and the rules and
constraints that apply, such as maximum speed, waiting times at stations, etc.
• The trains, which are considered the active optimizing entities in this network, as
the departure times of these trains can be changed.
• An optimizing layer, which consists of the symbiotic algorithms that are used.
In the particular solution that was developed (other strategies are also possible), the
trains ‘collect’ stress signals, based on the conflicts they encounter. These are passed
to the optimizing layer, which uses the stress signals to manage train schedules.
This alters the departure time of the trains, which is the ‘change of behavior’ that
symbiotic agents need to be capable of doing. The departure times thus depend on
the stress signals.
While the environment is domain-specific, the optimizing layer does not depend
on the problem domain. The optimizing layer also has a sort of ‘plugin’ structure
for the symbiotic algorithms. This way different types of algorithms can be tested
and compared.
316 K. Pieters

Fig. 13.9 Dutch Rail Infrastructure and Global Software Architecture

13.5.2 The Environment


For this research, an environment was created in software that simulates a very
constrained version of the Dutch railway infrastructure. This ensured that the model
would generate a lot of conflicts, but also implied that a conflict-free timetable would
be nearly impossible. This choice was made for practical reasons and because the
focus was on optimization strategies rather than creating an actual timetable.
The environment consists of a set of neighborhoods, which provide the interac-
tion space of the trains and constrains their movement. The neighborhoods
implement tracks, stations, border crossings, curves and so on.
Single, double, or multiple (parallel) tracks are possible, but in the experimental
model the environment mainly consists of single and double tracks. Only in very
dense areas, such as in and between major cities (especially the densely populated
area between Amsterdam, Rotterdam, the Hague and Utrecht, called the ‘Randstad’)
four or six tracks are sometimes used in order to anticipate congestion in those areas.
For details on the implementation, see [17].
Every train is subject to a trajectory, a list of neighborhoods that determines the
travel plan. The trajectory starts and ends at stations, the origin and destination.
The trajectory also defines at which stations will be stopped in between (stopover
stations), the departure time in both directions, and the amount of trains traveling
this trajectory per hour. The application generates one timetable for a full day (from
05.00 AM to midnight), and does not differentiate between weekdays and weekends,
or other specific circumstances.

13.5.3 The Trains


The trains are the active entities in the network. As mentioned earlier, they travel
along their trajectory according to the constraints defined by the neighborhoods they
pass. But trains also have their own set of constraints, which sometimes overrule
13 Computation in Complex Environments 317

those of the neighborhoods. For one, the different types of trains define a hierarchical
structure (Table 13.4):

Table 13.4 Types of Trains

Type vmax Description


[km/h]
International 140 Only stops in a few major cities and border crossings
Intercity 140 Only stops in major cities of its trajectory
Express Train 120 Stops at a limited amount of stations of its trajectory
Local Train 100 Stops at most of the stations of its trajectory

The table shows the hierarchy from top to bottom. The type of train not only de-
termines the maximum speed vmax and the number of stops, it also defines how the
trains are influenced by stress signals. The departure time will only be influenced
by stress signals of trains of equal type or higher. An international train will only
respond to stress signals of other international trains, while local trains respond to
stress signals of all other train types. This approach minimizes possible goal con-
flicts of trains that are considered more important, but imposes a great deal of stress
on express trains and local trains. On the other hand, a lot of local trains operate in
rural areas, sparsely populated parts of the railway infrastructure, where they mainly
encounter stress around the origin and destination stations of their trajectory. In such
cases, the hierarchy in train types contributes to a situation where stress is distributed
from ‘hot spots’ to the periphery.
If the trajectory is a train’s general travel plan, a trip is its instantiation. If a
trajectory is traveled, say, four times an hour, the application creates eight trips for
that trajectory, four for both directions. A train may ‘collect’ stress signals based
on its encounters with other trains, but it is the trip that uses the result to change
the departure time of the next train. When a train has reached its destination, the
collected stress is made available to the optimizing layer. This approach makes the
optimization process less sensitive to short-term fluctuations of the stress signals
during a trip. Strictly speaking, the trip is therefore the active optimizing element
in the network, while the train merely causes and collects stress signals. This way a
relationship between departure time and stress signal is established.

The Stress Signals

The stress signals reflect the conflicts that trains can encounter:
• Two trains traveling in opposite directions pass each other on a single track.
• A train tries to enter a neighborhood that has no free tracks.
• A train encounters another train of lower rank (speed) less than 1500 meters in
front of it, heading in the same direction.
The model does not adjust the behavior of the trains, instead they pass each other as
if the other train is not there. Therefore, only the stress signal is a reminder of these
318 K. Pieters

encounters. When two trains find themselves in conflict, the stress that is calculated
is determined by the location of the trains and the nearest neighborhood that has
sufficient tracks to resolve the problem.
If the nearest free neighborhood is in front of a train, it outputs a negative stress
signal, which is translated to a request to leave earlier at the next trip. In the other
situation a positive stress signal is given, which is a request to leave later. The sys-
tem enforces that both trains give a stress signal that leads them to the same free
neighborhood. This prevents that stalemates can occur, especially for trains coming
from opposite directions. If for instance both trains would give a stress signal to
leave earlier, it would result in exactly the same conflict occurring a bit earlier on
the next trip.
An advantage of this particular environment is, that every agent in the system
knows exactly which other agents it is interacting with, and so its response is tar-
geted to service those that actually profit from it. This improves the behavior of the
system as a whole, as the chance of premature convergence due to goal conflicts
and neutralizing stress signals becomes smaller. Of course, the stress collected by a
train is still the superposition of the collected stress signals, but due to the dynamic
character of the network there is a good chance that the system ‘pulls’ itself out of
temporary premature convergence due to fluctuation of the stress signals. The dy-
namic character of the network also prevents stable branches with amplifications to
occur in the network. This means that the dynamics of the environment can, to some
extent, actually be used by the system to improve its overall behavior.

13.5.4 The Optimizing Layer


The optimizing layer collects the stress signals of the trains and feeds this to the
symbiotic algorithms, which in turn calculate a new departure time for those trains.
The optimizing layer is not specific for a certain problem domain.
Various algorithms were initially tested, which were mainly variants of the
following formula:
m
dit = dit−1 + ρ .ri ∑ stki (13.22)
k=0
In this equation, dit is the departure time of tripi at a given time t, ri is the range in
which the departure time can fluctuate, and ρ an additional limiting factor. The sum
of the stress of the m trains that were encountered during the trip determines the new
departure time of the trip. The stress is normalized to a range of −1, 1.
The range ri depends on the amount of trips that are carried out per hour. If a
trajectory is traversed four times per hour, the departure time fluctuates 7.5 minutes
around a every quarter of an hour. For a trajectory that runs two times an hour, it
fluctuates 15 minutes around every half of an hour.
The factor ρε [0, 2 has been added as an additional optimization parameter. For
most experiments it defaulted to 1. The range and the factor ρ together determine
the bounds of the ‘liveliness’ in the system.
13 Computation in Complex Environments 319

Most learning algorithms included a weight vector for every stress signal, in a
similar fashion as (13.19) and (13.20).
A series of tests have been carried out to analyze the optimizing behaviour of the
network using three algorithms, namely a hill-climbing algorithm, Hebbian learning
and a third that made the system behave a bit like a Kohonen network [17].

13.5.5 Computational Complexity


With every iteration step, a total of n trains can be traveling, with n possibly being
larger than the amount of trips of the system, as the duration of a trip is usually
longer than the amount of trips per hour. The active trains can interact with maxi-
mally n other trains, leading to n stress signals that need to be updated with every
iteration step. This leads to an upper bound of the computational complexity of
O(n2 ) per iteration step. However, most, if not all, trains will only interact with a
limited number of other trains, leading to a practical complexity closer to O(n) per
iteration step for most railway infrastructures. The complexity needs to be multi-
plied with the convergence time tc of the network, in order to get an estimate for
the computational complexity of the solution. In practice, convergence usually has
completed well within 30,000 iterations for 186 trips, which is much smaller than
n2 , so the upper bound of the complexity is O(n4 ), although the practical complexity
is better than O(n3 ).

13.5.6 Results
A typical result of a number of runs has been depicted in Figure 13.10.

Fig. 13.10 Average Amount of Trains Without Conflicts in %


320 K. Pieters

On average, a system with Hebbian learning manages to reduce the amount of


conflicts to 17 ± 5% of the trips. However, individual runs have been made that
managed to reduce the amount of conflicts to less than 9%, while the upper bound
of 22% was fairly constant. This gives reason to believe that there is still much room
for improvement.

Fig. 13.11 Convergence with Delays: Maximally 5 minutes, 10% probability

Fig. 13.12 Convergence with Delays: Maximally 5 minutes, 20% probability


13 Computation in Complex Environments 321

Figure 13.11 shows the results when random delays of trains were introduced to
test the robustness of the solutions. The most striking result is the fact that on aver-
age the network hardly seems affected by the delays. Normally a conflict is resolved
the moment when conflicting trains find a free track. Delays push the trains further
away from the conflict area into the free neighborhoods. As most of these ‘sink-
holes’ are formed by stations with a three minute stopover time, they are sufficient
to resolve the majority of the delays that are generated, while delays larger than
three minutes will only cause incidental stress and not lead to structural changes in
the system.
Delays with a higher impact will at some point deteriorate convergence, although
the variance seems fairly constant (Figure 13.12). Similar results were obtained
when the maximum delay time is increased.

13.5.7 A Symbiotic Network as a CCGA


The lack of comparable solutions makes it difficult to find a benchmark for RTP. Al-
ternative strategies based on for instance competition are hard to implement, as these
normally work best when a number of known alternatives can be tested and mutually
compared. However, as symbiotic networks are fundamentally co-operative, they
are comparable to so-called co-evolving cooperative genetic algorithms (CCGA) as
proposed by Potter and de Jong [23, 24]. Therefore, as a follow-up research, an
attempt at benchmarking was done by implementing a CCGA.
In a CCGA, a number of cooperating genetic algorithms (GA) are mutually con-
nected by certain credit-assignment strategies, which are implemented by design.
The populations optimize individually, but they share their results, which are then
translated into a fitness function.
As the stress signals provide a means of credit-assignment, it was relatively easy
to configure the symbiotic network to operate as a CCGA. Every symbiotic algo-
rithm is implemented as a GA operating on a population of departure times of trains,
and the stress signals are used as fitness function. Starting from an initial, random
population that covers the first trips, new individuals are formed through the stan-
dard operators for GAs.
This approach can both provide a benchmark for symbiotic networks against a
well-documented alternative co-operative strategy, as well as provide a novel ap-
proach for CCGAs, as symbiotic networks learn a form of credit-assignment, rather
than that this is implemented by design. Besides this, the configuration could be
used to assess an intuition that the network would be less prone to the notorious
‘tweaking’ of reproduction and mutation rates in GAs; symbiotic networks tend to
utilize the dynamics of the network itself for optimization. For details see [19].
On average, the CCGA configuration performs a bit better than Hebbian learn-
ing strategies, presumably because the latter tend to dampen out when the stress
becomes less. Hebbian learning tends to ‘coagulate’ the system near convergence.
Due to mutation and crossover, CCGAs keep trying new solutions around conver-
gence and, like the delays, manage to utilize the possibilities of neighborhoods with
322 K. Pieters

free tracks much better. On average, CCGAs manage to let 85±3% of the trains run
without conflicts.
The configuration is hardly affected by varying mutation rates up to 5%. Above
that, the system gives significantly poorer results.

13.5.8 Discussion
When problem domains and solution strategies are considered from a wider per-
spective, such as provided by the pattern of a convergence inducing process, var-
ious classes of solution strategies can be compared in relationship to the specific
problems they aim to address. Besides the more intuitive distinctions between in-
telligence ‘by design’ and ‘true’ computational intelligence, it also allows a more
comprehensive assessment of various solution strategies provided by interactions of
multiple agents, such as for instance depicted in the actor/co-actor pattern. Most of
all, it introduces the environment as an integral part of the optimization process.
The specific problem domain provided by railway timetable problems has
demonstrated the relationships between environmental conditions, its constraints,
heuristics and possibilities, and the interplay between designed and computational
intelligence.
In this contribution, symbiotic networks were introduced as an approach to opti-
mize in complex, dynamic environments. Currently the research has pursued mod-
est goals, and concentrated on understanding how agents can learn to collaborate
in complex environments, in order to achieve an overall goal. Railway timetable
problems offered a means to analyze this, but the results demonstrate that symbiotic
networks have certain potential to be applied in real-world applications as robust
problem solvers.

Acknowledgments
I would like to thank prof. dr. Harry Hunneman from the University for Humanistics
in Utrecht in the Netherlands, and prof. dr. Paul Cilliers from the Centre for Studies
in Complexity of Stellenbosch University in South Africa for their valuable support
in developing a ‘helicopter view’ on the issues related to complexity thinking.
I am also greatly indebted to dr. ir. Schil de Vos and dr. Jack Gerissen for their
support during my research in symbiotic algorithms at the Open University in the
Netherlands, and Schil also for his feedback for the draft version of this chapter.

References
1. Alexander, C.: A Pattern Language: Towns, Buildings, Construction. Oxford University
Press, USA (1977)
2. Blum, C.: Ant colony optimization: Introduction and recent trends. Physics of Life Re-
views 2(4), 373, 353 (2005),
http://www.dx.doi.org/10.1016/j.plrev.2005.10.001
13 Computation in Complex Environments 323

3. Caprara, A., Fischetti, M., Toth, P.: Modeling and solving the train timetabling problem.
Operations Research 50(5), 851–861 (2002),
http://www.jstor.org/stable/3088485;
ArticleType: primary article / Full publication date: September-October 2002 / Copy-
right 2002 INFORMS
4. Cilliers, P.: Boundaries, hierarchies and networks in complex systems. International Jour-
nal of Innovation Management 5(2), 135–147 (2001)
5. Hassoun, M.H.: Fundamentals of Artificial Neural Networks. The MIT Press, Cambridge
(1995)
6. Holland, J.H.: Adaptation in Natural and Artificial Systems: An Introductory Analysis
with Applications to Biology, Control, and Artificial Intelligence. The MIT Press, Cam-
bridge (1992)
7. Jong, K.A.D.: An analysis of the behavior of a class of genetic adaptive systems. PhD
thesis, University of Michigan (1975),
http://portal.acm.org/citation.cfm?id=907087
8. Lee, Y., Chen, C.: Modeling and solving the train pathing problem. In: Twelfth World
Multi-Conference on Systemics, Cybernetics and Informatics, IIIS, Orlando (2008)
9. Margulis, L.: Symbiotic Planet: A New Look At Evolution, 1st edn. Basic Books (1999)
10. Mitchell, M.: Complexity: A Guided Tour. Oxford University Press, USA (2009)
11. Morin, E.: On Complexity. Hampton Press (2008)
12. Odijk, M.A.: Railway timetable generation (1998)
13. Peeters, L.: Cyclic Railway Timetable Optimization. Phd Thesis, ERIM PhD Series Re-
search in Management, Erasmus Universiteit, Rotterdam (2003)
14. Picek, S., Gloub, M.: Dealings with problem hardness in genetic algorithms. WSEAS
Transactions on Computers 8(5) (2009)
15. Pieters, C.P.: Symbiotic algorithms. Master’s thesis, Open University (2003)
16. Pieters, C.P.: Symbiotic networks. Evolutionary Computation. In: The 2003 Congress on
CEC 2003, vol. 2, pp. 921–927 (2003)
17. Pieters, C.P.: Trains in symbiosis. In: IASTED 2004 Congress on Artificial Intelligence
and Soft Computing 2004, pp. 481–487 (2005)
18. Pieters, C.P.: Effective Adaptive Plans, pp. 277–282. Springer, Heidelberg (2006),
http://dx.doi.org/10.1007/1-4020-5263-4_44
19. Pieters, C.P.: Reflections on the geno- and the phenotype. In: CEC 2006 IEEE Congress
on Evolutionary Computation, pp. 1638, 1632 (2006),
http://ieeexplore.ieee.org/Xplore/login.jsp?url=/iel5/
11108/35623/01688504.pdf?tp=&isnumber=&arnumber=1688504
20. Pieters, C.P.: Complex systems and patterns. In: Twelfth World Multi-Conference on
Systemics, Cybernetics and Informatics, vol. VII, pp. 268–275 (2008)
21. Pieters, C.P.: A pattern-oriented approach to health; using pac in a discourse of health.
International Journal of Education and Information Technologies 3(2), 126–134 (2009),
http://www.naun.org/journals/educationinformation/
eit-90.pdf
22. Pieters, C.P.: Patterns, complexity and the lingua democratica. In: Proceedings of the
10th WSEAS International Conference on Automation and Information, ICAI 2009.
Revent Advances in Automation & Information. WSEAS Press, Prague (2009)
324 K. Pieters

23. Potter, M.A., de Jong, K.A.: A cooperative coevolutionary approach to function op-
timization. In: Davidor, Y., Männer, R., Schwefel, H.-P. (eds.) PPSN 1994. LNCS,
vol. 866, pp. 249–257. Springer, Heidelberg (1994),
http://citeseerx.ist.psu.edu/viewdoc/
summary?doi=10.1.1.119.2706
24. Potter, M.A., de Jong, K.A.: Cooperative coevolution: An architecture for evolving coad-
apted subcomponents. Evolutionary Computation 8(1), 1–29 (2000),
http://dx.doi.org/10.1162/106365600568086
25. Rosenblatt, F.: The perceptron: A probabilistic model for information. Psychological
Review 65(6), 386–408 (1958)
26. Weinberg, G.M.: An Introduction to General Systems Thinking, 25th edn. Dorset House
Publishing Company, Incorporated (2001)
27. Wolpert, D., Macready, W.: No free lunch theorems for optimization. IEEE Transactions
on Evolutionary Computation 1(1), 67–82 (1997)
28. Wooldridge, M.: Reasoning about Rational Agents, 1st edn. The MIT Press, Cambridge
(2000)
29. Zwaneveld, P.J., Kroon, L.G., van Hoesel, S.P.M.: Routing trains through a railway sta-
tion based on a node packing model. European Journal of Operational Research 128(1),
14–33 (2001),
http://www.sciencedirect.com/science? ob=ArticleURL&
udi=B6VCT-41Y1XYH-2& user=10& rdoc=1& fmt=& orig=search&
sort=d&view=c& acct=C000050221& version=1& urlVersion=0&
userid=10&md5=0a0c12353a918d29b76ed2ad4c741a4f
Chapter 14
Project Scheduling: Time-Cost Tradeoff
Problems

Sanjay Srivastava, Bhupendra Pathak, and Kamal Srivastava

Abstract. We design and implement new methods to solve multiobjective time-cost


tradeoff (TCT) problems in project scheduling using evolutionary algorithm and
its hybrid variants with fuzzy logic, and artificial neural networks. We deal with
a wide variety of TCT problems encountered in real world engineering projects.
These include consideration of (i) nonlinear time-cost relationships of project ac-
tivities, (ii) presence of a constrained resource apart from precedence constraints,
and (iii) project uncertainties. We also present a hybrid meta heuristic (HMH) com-
bining a genetic algorithm with simulated annealing to solve discrete version of
multiobjective TCT problem. HMH is employed to solve two test cases of TCT.

14.1 Introduction
The project manager handles conflicting states to optimize various parameters of
project scheduling process. Minimizing project completion time and project cost
continues to be universally sought objectives, conflicting in nature, which is known
as time-cost tradeoff (TCT) in project scheduling. TCT belongs to a class of multi-
objective optimization (MOO) problem wherein there is no single optimum
solution rather there exists a number of solutions, which are all optimal – Pareto-
optimal solutions – optimal TCT profile in project scheduling literature. The tradeoff
between project time and cost gives project managers both challenges and opportu-
nities to work out the best schedule to complete a project, and is of considerable eco-
nomic importance. Projects are usually represented using networks, having nodes
Sanjay Srivastava
Department of Mechanical Engineering, Dayalbagh Educational Institute,
Dayalbagh, Agra, India
e-mail: ssrivastava.engg@gmail.com
Bhupendra Pathak · Kamal Srivastava
Department of Mathematics, Dayalbagh Educational Institute, Dayalbagh, Agra, India
e-mail: pathak111@gmail.com,kamalsrivast@yahoo.com

Y. Tenne and C.-K. Goh (Eds.): Computational Intelligence in Optimization, ALO 7, pp. 325–357.
springerlink.com c Springer-Verlag Berlin Heidelberg 2010
326 S. Srivastava, B. Pathak, and K. Srivastava

Fig. 14.1 Network model

and directed arcs (Figure 14.1). These diagrams provide a powerful visualization of
the relationships among the various project activities, which form the precedence
constraints in TCT analysis. There are two kinds of networks:
i Activity-On-Arc (AOA): The arcs represent activities and the nodes represent
events. An event represents a stage of accomplishment; either the start or the
completion of an activity.
ii Activity-On-Node (AON): The nodes represent activities and the directed arcs
represent precedence relations. This representation is easier to construct.
Nonincreasing time-cost relationship of a project activity can be either continuous
or discrete (Cost refers to direct cost throughout this work). The continuous rela-
tionship can be linear or nonlinear. Accordingly TCT problems may be categorized
as: (i) linear TCT problem for linear continuous time-cost relationship; (ii) nonlin-
ear TCT problem for nonlinear continuous time-cost relationship, and (iii) discrete
TCT problem for discrete time-cost relationship.
There are a large number of activities within real-life projects; therefore, it is
almost impossible to enumerate all possible combinations to identify the best deci-
sions for completing a project in the shortest time and at the minimum cost. Several
researchers have suggested various methods [1], including mathematical techniques
and heuristics, for obtaining TCT profile, but there still remains many serious im-
pediments that restrict a wider use of TCT profile as management contrivances.
The first concerns the form of the time-cost relationship of project activities and
the size of project networks. Most of the existing methodologies for determining
optimal TCT profile need to rely on unrealistic assumptions about the time-cost re-
lationship of activities such as linear, convex, continuous etc. in order to manage
computational costs. This, however, renders the derived profile inaccurate. Most of
the reported methodologies that attempt to deal with realistic time-cost relation-
ships is accompanied by the only-for-small-networks admonition. Moreover, the
discrete version of TCT problem is known to be NP-hard, and it has been proved that
any exact solution algorithm would very likely exhibit an exponential worst-case
complexity [2].
The complexity of TCT problem further increases, if resource constraints are also
present, which is not uncommon in realistic projects. In addition, to solve TCT prob-
lem in a generalized way, the scheduler must consider the presence of project uncer-
tainties such as weather conditions, space congestion, labor performance etc., which
14 Project Scheduling: Time-Cost Tradeoff Problems 327

dynamically affect both, the project duration and cost, during its implementation.
In view of foregoing research issues we developed comprehensive and intelligent
methods to solve a variety of realistic TCT problems.
Some related research efforts follow. Richard et al. [3] developed nonlinear time-
cost tradeoff models with quadratic cost relations. Vanhouke [4] applied a branch
and bound method to solve discrete TCT problem with time switch constraints.
Vanhouke and Debels [5] used analytical method as well as tabu search to solve
discrete TCT problem.
As mentioned, TCT is a MOO problem with two conflicting objectives. MOO
is a field reasonably explored by researchers in recent years since 1990 - as a
result diverse techniques have been developed over the years [6]. Most of these
techniques elude the complexities involved in MOO and usually transform multi-
objective problem into a single objective problem by employing some user defined
function. Since MOO involves determining Pareto-optimal solutions, therefore, it
is hard to compare the results of various solution techniques of MOO, as it is the
decision-maker who decides the best solution out of all optimal solutions pertaining
to a specific scenario [7]. Evolutionary algorithms (EAs) are meta heuristics that are
able to search large regions of the solutions space without being trapped in local
optima [8]. Some well-known meta heuristics are genetic algorithm (GA), simu-
lated annealing (SA), and tabu search. Genetic algorithms are search algorithms
[9], which are based on the mechanics of natural selection and genetics to search
through decision space for optimal solutions [10]. In GA, a string represents a set of
decisions (chromosome combination), a potential solution to a problem. Each string
is evaluated on its performance with respect to the fitness function (objective func-
tion). The ones with better performance (fitness value) are more likely to survive
than the ones with worse performance. Then the genetic information is exchanged
between strings by crossover and perturbed by mutation. The result is a new gen-
eration with (usually) better survival abilities. This process is repeated until certain
termination condition is met. A genetic algorithm uses a population of solutions in
each iteration of its search procedure, instead of a single solution. Since a population
of solutions is processed in each iteration, the outcome of a GA is also a population
of solutions. This unique feature of GA makes it a true multiobjective optimization
technique and that is how GAs transcend classical search and optimization tech-
niques [11]. The robustness of GAs is greatly enhanced when they are hybridized
with other meta heuristics such as simulated annealing, tabu search etc. [8]. Differ-
ent versions of multiobjective GAs have been successfully employed to solve many
MOO problems in science and engineering [7]. GA based MOO techniques have
also been used to solve TCT problem [12], [13]. More recently, Azaron et al. [14]
proposed models using genetic algorithms and the Pareto front approach to solve
nonlinear TCT problem in PERT networks. However, these models did not consider
the presence of a constrained resource and/or nonlinear time-cost relationships of
project activities.
328 S. Srivastava, B. Pathak, and K. Srivastava

14.1.1 A Mathematical Description of TCT Problems


TCT problems dealt in this work involve, at activity level, expediting an activity
anywhere between two time limits (i) normal time, NT, (maximum activity time)
associated with normal cost (least activity direct cost), and (ii) crash time, CT, (min-
imum activity time) associated with crash cost (maximum activity direct cost). We
describe TCT problems as follows.
Let the set φ represent the space of all feasible instances θ of the network, where
an instance θ = {< ti , ci > /CTi ≤ ti ≤ NTi , i = 1, 2, . . . , n}, ti and ci having nonin-
creasing relationship denote time and cost of ith activity respectively, n is the num-
ber of activities in the network, and CTi , NTi are crash time and normal time of ith
activity respectively.
Further, for ith activity, ci = fi (ti ), where, fi : [CTi , NTi ] → R is a linear map (R is
the set of real numbers), for a linear TCT, and fi is a nonlinear map, for a nonlinear
TCT. tθ and cθ denote project duration and project cost respectively.
The discrete version of TCT problem requires defining θ in the following way.
Let Ai = {< ti j , ci j > / j = 1, 2, . . . , pi }, i = 1, 2, . . . , n, denotes the set of pi alterna-
tives for the activity i where ti j and ci j are the time required and the cost involved
for the jth alternative. An instance θ is defined as θ = {< x, y > / < x, y >∈ Ai , i =
1, . . . , n}; each Ai contributes exactly one pair < x, y > to θ . Clearly |θ | = n.
Three possible problem formulations for the general TCT problem are:
i Find θ  s.t. cθ  = minθ ∈φ {cθ /tθ ≤ Pro jectDeadline}
ii Find θ  s.t. tθ  = minθ ∈φ {tθ /cθ ≤ Budget}
iii When the objective is to identify the entire TCT profile for the project network,
the problem is to find

B = {θ  ∈ φ /there does not exist another θ ∈ φ with(tθ ≤ tθ  ) ∧ (cθ ≤ cθ  )}


with strict inequality in at least one case. Here the set of instances θ  represents
the entire TCT profile over the set of feasible project durations for the network.
The decision-maker is free to choose a θ  depending on specific project require-
ments. This formulation is the most generalized one, which has been addressed
throughout our work.
The chapter is organized in the following sections. Section 1 introduces time-cost
tradeoff problem, and establishes the background of our work. It includes the math-
ematical description of TCT problems. Section 2 presents a methodology using
ANNs and a heuristic with multiobjective GA to solve multimode resource-
constrained nonlinear time-cost tradeoff (RCNTCT) problem. An integrated Fuzzy-
ANN-GA method is developed to carry out the sensitivity analysis of nonlinear TCT
profile in Section 3. Design and implementation of hybrid meta heuristic (HMH), an
evolutionary MOO method, to solve discrete TCT problem is detailed out in Section
4. Section 5 is used for conclusions and to lay down important dimensions of future
work.
14 Project Scheduling: Time-Cost Tradeoff Problems 329

14.2 Resource-Constrained Nonlinear TCT


We present a method for solving nonlinear TCT problem of project scheduling
under a constrained resource. In real world projects, some resources are usually
constrained, and therefore in such situations the project manager should include
resource constraints in TCT analysis apart from precedence constraints [15].
Further, existing methods usually ignore nonlinear time-cost relationship of project
activities. We tackled the foregoing obstacles through an intelligent method, inte-
grating ANN models and a heuristic with a GA based MOO algorithm. GA in
its usual perspective is employed to search for Pareto-optimal front. The method
is termed as artificial neural network and heuristic embedded genetic algorithm
(ANNHEGA).
RCNTCT may also be viewed as a case of resource constrained project schedul-
ing problem (RCPSP) [16]. A survey of solution procedures for RCPSP can be
found in Ozdamar and Ulusoy [17], and more recently in Kolish and Hartman
[18]. Many different objectives are possible in RCPSP as per the needs of the
decision-maker; the one which has been investigated the most is to find the mini-
mum makespan of the project. However, in our case the objective is to search for the
entire RCNTCT profile. With this objective, the problem dealt in this section may
fall to a class of multimode resource-constrained project scheduling (MM-RCPSP)
[19]. There have been a few research attempts earlier of combining TCT problem
with RCPSP. Erenguc et al. [20] proposed a branch and bound method to find out
resource requirements, the extent of crashing, and start time for each activity so
as to minimize total project cost. Leu and Yang [13] proposed a model unifying
resource allocation, time-cost tradeoff and resource leveling using a GA with multi-
ple attribute decision-making approach. The literature on the class of TCT problem
(RCNTCT) as defined here is nearly void to the best of our knowledge.
We design a heuristic to account for a resource constraint in solving nonlinear
TCT problem – it checks for the availability of the resource requirement of project
activities. Some related definitions follow. A mode of an activity implies a different
option in terms of cost and duration for performing the activity under consideration.
In mode m, let activity i requires a processing time ti (m) and a constant amount of
resource rti (m) during each period while it is in progress. Once a mode is selected,
the activity continues in this mode until finishing. A constant work content Wi is
defined for each activity as a product of ti (m) and rti (m). A constrained resource is
assumed to be available in constant amount RA throughout the project, which is not
unusual in practice. Non-preemptive scheduling of activities is considered. The ac-
tivities are numbered from 0 to n + 1, where activities 0 and n + 1, represent dummy
activities denoting the beginning and the end of a project in an AON network. CTi
and NTi , the two limits of ti , determine the upper and lower limits of constrained
resource requirement of ith activity respectively. In RCNTCT it is important to men-
tion that although there is a continuous and non-increasing relationship between ti
and ri , as well as between ti and ci , for every activity of the project network, yet
there is no direct correspondence between ri and ci .
330 S. Srivastava, B. Pathak, and K. Srivastava

For RCNTCT, the mathematical formulation of nonlinear TCT described in


Subsection 1.1 will include the following constraints.

sk (m) − si (m) ≥ ti (m) ∀k ∈ S (Precedence constraints)

∑ rti (m) ≤ RA (Resource constraints)


si (m)∈Os (m)
i

where, si (m) and sk (m) are the starting time of activities i and k respectively, in
mode m; S is the set of the succeeding activities of ith activity; rti (m) is the re-
source requirement of ith activity at processing time of ti (m); and Osi (m) is the set of
activities being performed at si (m).

14.2.1 Artificial Neural Networks


An effort is made to utilize the function approximation capabilities of ANNs using
back propagation neural network with Levenberg Marquardt (LM) learning rule in
modelling time-cost relationships for each activity of a project network. An ANN
model is capable of capturing nonlinear time-cost relationship. One back propaga-
tion neural network is employed for each activity of the project network for rapid
estimation of the corresponding activity cost. The LM learning rule uses an approx-
imation of the Newton’s method to get better performance. This technique is rela-
tively faster. The LM Approximation update rule is: Δ W = (J T J + μ I)−1 J T e, where
Δ W is weight update matrix, J is the Jacobian matrix of derivatives of each error
to each weight, μ is a scalar and e is an error vector. If the scalar is very large, the
above expression approximates the Gradient Descent method while it is small the
above expression becomes the Gauss-Newton method. The Gauss-Newton method
is faster and more accurate near error minima. Hence, the aim is to shift towards
the Gauss-Newton as quickly as possible. Thus μ is decreased after each success-
ful step and increased only when step increases the error. The architecture of ANN
employed to solve RCNTCT problem is shown in Figure 14.2.

Fig. 14.2 Artificial neural network architecture for nonlinear TCT problem
14 Project Scheduling: Time-Cost Tradeoff Problems 331

14.2.2 Working of ANN and Heuristic Embedded Genetic


Algorithm
A string represents a potential solution to a problem in ANNHEGA (see Subsection
A). ANNHEGA starts with generating an initial population (see Subsection B). Each
string is checked beforehand against the resource constraint throughout the project
schedule in the heuristic module of the algorithm (see Subsection G). TCT profile
and convex hull of the existing (initial) generation (see Subsection C) are determined
and plotted. Fitness values ( f itu ) of the individuals of the initial population (see
Subsection D) are then determined. Keeping the individuals on the TCT profile,
a pool of chromosomes according to individual’s fitness value is generated. It is
important to mention that GA employed here incorporates elitism, by keeping the
individuals in the TCT profile for the next generation, as it helps in converging to the
true TCT profile. It is worthy to mention here that Arias and Coello [21] have proved
that GAs for MOO converge to the global optimal solution for some functions in the
presence of elite-preserving operator. As a next step of ANNHEGA, the crossover
operator (see Subsection E) and the mutation operator (see Subsection F) are applied
to produce the next generation. This process is repeated for a pre-specified number
of generations, or, alternatively, untill no improvements are observed in the non
dominated solutions for a pre-specified number of iterations.

A. Structure of a Solution
A solution here is a string which basically represents an instance θ of the project
schedule (Figure 14.3); each element ti of an n-tuple string, T , can assume any value,
a natural number, from [CTi , NTi ] . As already mentioned, ci = fi (ti ) for ith activity,
and fi : [CTi , NTi ] → R is a nonlinear map for nonlinear TCT. The associated tθ and
cθ of each individual string are determined by summing up the corresponding cost
for each activity and by computing the maximum path time respectively.

Fig. 14.3 An instance of project schedule

B. Initial Population
The initial population consists of n p solutions, where (n p − 2) strings are selected
randomly from the feasible search space, i.e., each ti of a string is chosen randomly
from [CTi , NTi ] . The remaining two strings are formed such that for the first string
332 S. Srivastava, B. Pathak, and K. Srivastava

ti = NTi ∀i = 1, . . . , n and for the second string ti = CTi ∀i = 1, . . . , n. This would


help in identifying the extent of diversification of population in each generation
of GA while searching for optimal TCT profile. These solutions are referred to as
parents. Each string, representing a unique network schedule, is tested beforehand
against the resource constraint; and the early start time of non-critical activities is
modified if necessary as per the procedure in Subsection G. The associated cost
of each individual string is determined by summing up the cost for each activity
and the project duration of each string is determined by computing the
maximum path time. The cost data of each activity is intelligently determined by the
corresponding trained ANN. These n p strings form the initial population of
ANNHEGA.

C. Time-Cost Tradeoff Profile and Convex Hull


Let θ1 and θ2 be two strings in a population F. θ1 dominates θ2 if tθ1 ≤ tθ2 and cθ1 ≤
cθ2 with either being tθ1 < tθ2 or cθ1 < cθ2 . Let D be a binary relation defined on the
set F by D = {(θ1 , θ2 )/θ1 , θ2 ∈ F ∧ θ1 dominates θ2 }, then the non-dominating set
(NDS) is given by NDS = {θi ∈ F/(θ j , θi ) ∈ / D ∀ j, j = i}, i.e. it represents the strings
(solutions) of F which are not dominated by any other string of F. The curve formed
by joining these solutions is referred to as time-cost tradeoff profile and the solutions
as the tradeoff points in the context of project scheduling literature.
We define a convex hull merely as a boundary that encloses all members of a
population from below (Figure 14.4). This boundary is in the form of straight line
segments. The purpose of drawing a convex hull for each population is to evaluate
the fitness of each individual in the population [12]. A convex hull may not include
all the solution points of the non-dominated set.

D. Distance Measure vs. Fitness


After determining the TCT profile and the convex hull of the existing generation
(the parent generation) we calculate the minimal distance (du ) between the parent
and each of the segments of the convex hull (Figure 14.4). Then we determine the
fitness value and the probability of selection for each individual within the parent
population as below.
f itu = dmax − du (14.1)

f itu
probu = (14.2)
∑ f itu
where f itu = fitness value of parent u; dmax = maximum du in the generation; du =
minimal distance between the parent u and each of the segment v of the convex hull,
du = min (duv , for all v); and probu = probability of selection of parent u.
14 Project Scheduling: Time-Cost Tradeoff Problems 333

Fig. 14.4 Fitness evaluation of a member of the population

E. Crossover
We consider one point crossover, wherein a parent P1 produces a child by crossing
over with another parent P2 selected randomly. A random integer q with 1 ≤ q ≤ n
is chosen, where q represents the crossover site. The first q positions of the child are
taken from the first q positions of P1 while the remaining (n- q) positions are defined
by the (n-q) positions of P2 .

F. Mutation
The mutation operator modifies a randomly selected activity of a string with a prob-
ability mr ; that is (mr × |F|) strings will undergo mutation. The mutation operator
works on a given string in the following manner. Let the string be represented by
str(i), i = 1, . . . , n. A random number q, 1 ≤ q ≤ n is generated for the location of
gene to be mutated. Another random natural number r, r ∈ [CTq , NTq ], is generated
and str[q] is replaced by r.

G. Heuristic Procedure
The float value of an activity is defined as the available time of an activity by which
it can be delayed without affecting the time deadline of project. Obviously the float
value of a critical activity is zero. Each string, representing a unique network sched-
ule, is tested beforehand against resource constraint in this module; and early start
time of non-critical activities is modified if necessary exploiting their float values.
The heuristic checks for resource requirement (RR ) period by period for each string
against resource availability (RA ) in the given project. If RR > RA at any time inter-
val Δ t of a network schedule, start time of non-critical activity falling in Δ t would
be shifted period by period exploiting its float value so as to adjust RR within RA
throughout the network schedule. Further, if more than one non-critical activity falls
334 S. Srivastava, B. Pathak, and K. Srivastava

in Δ t, each one would be processed one by one (ties may be broken arbitrarily) in
a similar manner as mentioned above till RR is adjusted within RA . If, even after
shifting corresponding non-critical activity (activities) the RR is not adjusted within
RA for a string, it is altogether rejected and hence it does not participate in the evo-
lutionary process of ANNHEGA. In case of rejection of strings due to violation
of resource constraints ANNHEGA keeps on generating other strings and check-
ing them against resource constraints till the population size is met. That is how
ANNHEGA maintains the population size in the evolutionary process.
It is a well-known fact that in general, the resource requirements of a project
over all periods is never constant, even after applying the best resource leveling
procedure. The proposed heuristic procedure makes use of this fact while fixing the
upper limit of RA . This is detailed out below. The peak resource requirement Rmax
based on activities’ normal time is computed for two extreme cases viz. (1) all the
non-critical activities are scheduled to start at their earliest start time (ES) and, (2)
all the non-critical activities are scheduled to start at their latest start time (LS). The
averaging of peak resource requirements of these two cases is considered to be the
peak resource requirement of the project network:
Rmax o f ES + Rmax o f LS
Rmax =
2
The initial value of RA to run ANNHEGA is taken as equal to Rmax ; this may be
termed as the upper limit of the constrained resource of the project. Now in gen-
erating TCT profile the project duration is basically crashed and obviously more
resources would be required for each subsequent crashing. In order to deciding the
lower possible limit of RA (below this limit project expediting is not possible), RA is
subsequently reduced and ANNHEGA is run every time till a point occurs when
project time starts increasing instead of decreasing to satisfying the constrained
resource.
More formally, the ANNHEGA scheme can be summarized in pseudo code as
follows. Let CHP be the set of children, I be the current generation number, and
GEN is the maximum number of iterations.

14.2.3 ANNHEGA for a Case Study


ANNHEGA is implemented on a case study to illustrate its working (Figure 14.5).
Its basic structure is similar to that of Feng et al. [12], however, a continuous and
nonlinear time-cost relationship for each activity with additional time-cost informa-
tion is incorporated and also a work-content (product of the activity duration and the
amount of the resource needed) is assigned to each one. In this project 18 activities
are numbered as i = 1, . . . , 18. Per-period-availability of resource (RA ) is constant
and computed as mentioned earlier.
14 Project Scheduling: Time-Cost Tradeoff Problems 335

Fig. 14.5 Network of the test problem

14.2.3.1 Computational Results


Table 14.1 shows the network data of test problem: time, cost and work content
(WC) of each activity. ANNs are trained with time-cost data of Table 14.1. ID de-
notes the activity number. There are total seven time-cost options available for each
activity; training data for ANN is prepared by picking up first and last time-cost
options and by randomly selecting three more options. Remaining two options of
each activity are used as testing data for the neural network. A three-layer network,
as shown in Figure 14.3, with one input-activity time and one output-activity cost
is used. The training effort is very less with LM learning rule (it takes between 5 to
7 iterations only). One network is trained for each activity, thus a total of 18 ANNs
are employed here. An error goal of 1.0e-03 is specified. The modelling power of
ANNs is validated using the testing data set – the activity cost is evaluated using
ANNs, and is compared with known cost data. The close comparison of values of
cost obtained using the neural network (ANN cost) enumerates the accuracy of ANN
module of ANNHEGA (Table 14.2). Initial population (n p ), mutation rate (mr ), and
crossover rate (cr ) are chosen as 200, 0.02, and 1.0 respectively based on the cri-
terion of faster convergence to the final Pareto front. We use these parameters for
other test problems as well. Initial value of resource availability (RA ) is computed as
mentioned earlier in Subsection 2.2G. Below the lowest limit of RA project duration
would start increasing instead. In addition, the search is set to terminate when the
tradeoff points do not change in 5 consecutive iterations. An initial generation of
200 strings is randomly selected and shown in the Figure 14.6. It can be seen that
the initial generation is distributed over the solution space and does not gather in
one region. The fitness of an individual in a population depends on its proximity to
the convex hull. The movement of improved population would be towards the con-
vex hull with each passing generation. Accordingly convex hull also moves towards
coordinate axes. This illustrates the preference of algorithm to converge towards
Pareto-optimal front. Figure 14.7 depicts intermediate improvements in the tradeoff
points and convex hull as these move towards axes. Figure 14.8 shows the convex
hull and tradeoff points of the final population as achieved; a clear improvement is
visible. Since the tradeoff points do not improve further, therefore these points are
concluded to be the best solution points as searched by ANNHEGA. Further, these
are compared with the analytical solutions for judging the accuracy of ANNHEGA
336 S. Srivastava, B. Pathak, and K. Srivastava

– it is able to search for at least 90% of the Pareto-optimal solutions on an average


of 50 runs. ANNHEGA based system can help to monitor and control the project in
a cost-effective way in real time, and one can choose the best alternative over the
RCNTCT profile to execute the real world projects.

14.3 Sensitivity Analysis of TCT Profiles

In real life projects the duration and cost of each activity could change dynamically
as a result of many uncertain variables, such as management experience (ME), labor
skill (LS), weather conditions (WC), etc. Project managers must take these uncer-
tainties into account and provide an optimal balance of time and cost based on their
own experience and knowledge. The uncertainty features can be well represented by
the fuzzy set concepts. Time analysis of a project under uncertainties has been stud-
ied using fuzzy set theoretic approach [22]. Daisy and Thomas [23] applied fuzzy
set theory to model the managers’ behavior in predicting project network parame-
ters within an activity. Leu et al. [24] used fuzzy set theory to model the variations
in the duration of activities due to changing environmental factors. Other types of
uncertainties such as budget uncertainty have also been incorporated into project
time-cost tradeoff [25]. Existing methods for sensitivity analysis of TCT profiles
with regard to project uncertainties ignore the cost parameter of project activities
[26], and do not include provision for nonlinear time-cost relationship of project
activities. To comprise these problems we devised and executed a novel method – it
examines the effects of project uncertainties on both, the duration as well as the cost
of the activities, and incorporates nonlinear time-cost relationship of project activ-
ities. The method integrates three key fields of computational intelligence – Fuzzy
Logic, ANNs and multiobjective Genetic Algorithm – the method is referred to as
Integrated Fuzzy-ANN-GA (IFAG).
A rule based fuzzy logic framework is developed which brings up the changes
in the duration and the cost of each activity for the inputted uncertainties, and then
ANNs are trained with these time-cost data (one ANN is used for each activity)
to model time-cost relationships. It has been already shown in Section 2 that the
integration of ANNs with GA facilitates the evaluation of fitness function of GA.
GA is employed to search for Pareto-optimal front for a given set of time-cost pair
of each project activity. That is how the integration of fuzzy logic framework and
ANNs with GA is implemented to comprehend the responsiveness of nonlinear TCT
profile with respect to project uncertainties. A test case of TCT problem is solved
using IFAG. Fuzzy sets and fuzzy inference system are briefly described below.

A. Fuzzy Sets

Fuzzy set theory is an efficient tool for modelling uncertainties associated with
vagueness, imprecision, or/and lack of information regarding variables of decision
space. The underlying power of fuzzy set theory is that it uses linguistic variables,
14 Project Scheduling: Time-Cost Tradeoff Problems 337

Table 14.1 Network Data of Test Problem

ID Time Cost WC ID Time Cost WC ID Time Cost WC


1 14 2400 224 7 9 30000 900 13 14 4000 860
1 15 2150 7 11 27200 13 15 3795
1 16 1900 7 13 26100 13 16 3500
1 18 1750 7 14 25600 13 18 3200
1 21 1500 7 15 24000 13 21 2750
1 23 1340 7 17 22300 13 23 2155
1 24 1200 7 18 22000 13 24 1800
2 15 3000 625 8 14 220 120 14 9 3000 90
2 17 2630 8 15 215 14 10 2930
2 18 2400 8 16 200 14 12 2825
2 20 1800 8 17 190 14 14 2605
2 21 1720 8 21 167 14 15 2400
2 23 1500 8 23 150 14 17 2295
2 25 1000 8 24 120 14 18 2200
3 15 4500 300 9 15 300 225 15 10 6525 80
3 17 4415 9 18 240 15 13 5990
3 19 4220 9 20 180 15 14 4500
3 22 4000 9 23 150 15 16 3500
3 25 3730 9 24 130 15 17 3355
3 30 3375 9 25 110 15 18 2600
3 33 3200 9 25 100 15 20 1930
4 12 45000 520 10 15 450 100 16 20 3000 650
4 13 44300 10 22 400 16 22 2000
4 15 38450 10 23 390 16 24 1750
4 16 35000 10 27 345 16 26 1685
4 18 33700 10 28 320 16 28 1500
4 19 32400 10 30 325 16 29 1385
4 20 30000 10 33 320 16 30 1000
5 22 20000 420 11 12 450 380 17 14 4000 100
7 24 17500 11 13 420 17 16 3700
5 25 16400 11 14 370 17 17 3455
5 26 15900 11 16 350 17 18 3200
5 27 15700 11 17 330 17 21 2780
5 28 15000 11 19 305 17 23 2335
5 30 10000 11 20 300 17 24 1800
6 14 40000 800 12 22 2000 480 18 9 3000 108
6 16 39200 12 24 1750 18 10 2900
6 17 34500 12 25 1690 18 12 2790
6 18 32000 12 27 1525 18 14 2565
6 20 27700 12 28 1500 18 15 2400
6 22 20300 12 29 1200 18 16 2315
6 24 18000 12 30 1000 18 18 2200
338 S. Srivastava, B. Pathak, and K. Srivastava

Table 14.2 Comparison of Ann Cost with Actual Cost

ID Time Cost ANN ID Time Cost ANN


Cost Cost
1 15 2150 2150.0710 23 390 390.000
1 21 1500 1500.0510 28 330 329.928
2 17 2630 2629.9711 14 370 370.000
2 23 1500 1499.9211 17 330 330.000
3 22 4000 4000.0512 24 1750 1750.03
3 25 3730 3730.0112 27 1525 1525.09
4 15 38450 38450.213 16 3500 3500.00
4 18 33700 33700.513 21 2750 2750.30
5 27 15700 15700.014 12 2825 2825.30
5 28 15000 15000.414 14 2605 2605.11
6 17 34500 34500.615 13 5990 5990.93
6 22 20300 20300.215 16 3500 3500.30
7 14 25600 25600.516 22 2000 2000.70
7 15 24000 24000.116 29 1385 1385.21
8 17 190 190.00117 21 2780 2780.41
8 23 150 150.00317 23 2335 2335.24
9 18 240 239.99718 14 2565 2565.00
9 23 150 149.98818 16 2315 2315.10

rather than quantitative variables to represent imprecise concepts. The values of


linguistic variables are words or sentences in a given language. For example man-
agement experience can be considered as a linguistic variable. Since the values of
this variable, such as long experience, or short experience, are not clearly defined
but are meaningful classifications nonetheless.

B. Fuzzy Inference System

Fuzzy inference is the process of formulating the mapping from a given input to
an output using fuzzy logic. The mapping then provides a basis from which deci-
sions can be made. The process of fuzzy inference involves membership functions,
fuzzy logic operators, and if-then rules. Fuzzy inference systems (FIS) have been
successfully applied in fields such as automatic control, data classification, decision
analysis, expert systems and computer vision. We have employed Mamdani-type
fuzzy inference system using MATLAB’s fuzzy logic toolbox. Mamdani’s fuzzy
inference method [27] is the most commonly seen fuzzy methodology. Mamdani’s
effort was based on Lotfi Zadeh’s work on fuzzy algorithms for complex systems
and decision processes [28].
14 Project Scheduling: Time-Cost Tradeoff Problems 339

14.3.1 Working of IFAG


IFAG starts with inputting project uncertainties in Fuzzy Logic Framework (see
Subsection 3.1.1), which in turn generates a set of time-cost pair for each activity of
a given project. This results in the same project network but with different time-cost
data of activities. This data is now inputted to ANNs for their subsequent training
as described earlier. Working of GA is already described in Section 2 the only
difference is that any string generated in GA need not go to heuristic module as
resources are assumed to be sufficiently available in this part of work. Working of
IFAG is further detailed out with a case study presented in Subsection 3.2.

14.3.1.1 Fuzzy Logic Framework for Project Uncertainties


A. FIS for Activity Duration and Cost

The FIS to capture the effect of linguistic variables on the activity duration and cost
is designed with 3 input variables – ME, LS and WC, and two output variables – ac-
tivity duration and activity cost. Triangular membership functions are used to model
the linguistic variables, input as well as output variables. The FIS editor interfaces
inputs and outputs.

B. Membership Function (MF) Curves of Input Linguistic Variables

The linguistic variables namely ME, LS, and WC are modeled using five member-
ship functions such as one shown in Figure 14.9 for weather condition. The linguis-
tic variables are defined in the range 0–1.

C. Membership Function (MF) Curves for Output Variables

The output variables – activity duration, and activity cost – are modeled by 7 mem-
bership functions (Figure 14.10) over the universe of discourse (UOD). The range
for UOD for activity duration has been assumed from (D− 0.2 × D) to (D+ 0.2 × D)
where D represents an initial estimate of activity duration by the project experts.
Similarly the range for UOD for activity cost has been assumed from (C − 0.2 ×C)
to (C + 0.2 ×C) where C represents an initial estimate of activity cost by the project
experts.

14.3.2 IFAG for a Case Study


Project network shown in Fig. 14.5 is taken as a test problem with time-cost options
as illustrated in Table 14.1. However activities work content of Table 14.1 are not
considered here as resources are assumed to be sufficiently available.
340 S. Srivastava, B. Pathak, and K. Srivastava

5
x 10

1.7 Population

1.6

1.5

Project Cost 1.4

1.3

1.2

1.1

0.9
100 110 120 130 140 150 160 170
Project time

Fig. 14.6 Initial population

5
x 10

1.7 Population
Tradeoff Points
Convex Hull
1.6

1.5
Project Cost

1.4

1.3

1.2

1.1

0.9
100 110 120 130 140 150 160 170
Project time

Fig. 14.7 Intermediate improvements in the tradeoff points and convex hull

5
x 10

1.7 Population
Tradeoff Points
Convex Hull
1.6

1.5
Project Cost

1.4

1.3

1.2

1.1

0.9
100 110 120 130 140 150 160 170
Project time

Fig. 14.8 Tradeoff points and convex hull of the final population
14 Project Scheduling: Time-Cost Tradeoff Problems 341

Fig. 14.9 Membership curves for weather condition (fuzzy sets: VeryBad, Bad, Medium,
Good, & VeryGood)

Fig. 14.10 Membership curves for activity duration (fuzzy sets: VerySmall, Small, Small-
Medium, Medium, LongMedium, Long, VeryLong)

14.3.2.1 Computational Results


IFAG starts with generating a TCT profile using time-cost data of Table 14.1 with
project uncertainties as (ME=0.5, LS=0.5, and WC=0.5) using its ANN-GA mod-
ule. This is equivalent to running IFAG without considering project uncertainties.
The corresponding result (a TCT profile), shown in Fig. 14.11 and Table 14.3, is
termed as normal TCT profile. Thereafter, we input project uncertainties at user in-
terface by changing the values of ME, LS, WC; which obviously causes normal
TCT profile to vary up and down, as it is sensitive to different values of ME, LS
and WC. If the values of these linguistic variables are greater than (0.5, 0.5, 0.5)
i.e. better than normal conditions, the profile moves towards the coordinate axes i.e.
project duration and cost are reduced and vice versa. The following tables represent
the important results, wherein project cost is in $(1.0e+005), and project time is in
days. Table 14.4 depicts the pessimistic case wherein the values of ME, LS, and WC
worsen (i.e. ME = 0.3, LS = 0.3, WC= 0.3); these values fall below the normal ones.
As obvious, in this case the whole TCT profile shift upwards, i.e., it moves away
from coordinate axes (Figure 14.11). On the similar lines Table 14.5 illustrates the
optimistic case. The TCT profile moves towards the coordinate axes (Figure 14.11).
342 S. Srivastava, B. Pathak, and K. Srivastava

Table 14.3 Project Cost and Time Under Normal Condition

Project Time 169 162 159 152 133 121 116 108 104
Project Cost 98040 98370 98520 99630 101670104360 107120122540 137700

Table 14.4 ME = 0.3, LS = 0.3, WC= 0.3

Project Time 185 152 146 145 138 117 104 103 102
Project Cost 110650 111450 112350 112630 114060 118610 154250 158750 166440

Table 14.5 ME = 0.9, LS = 0.9, WC= 0.9

Project Time 151 125 121 117 117 105 102


Project Cost 73550 76680 77550 78440 79280 87170 97910

5
x 10
2.5
ME = 0.5 LS =0.5 WC =0.5
ME = 0.9 LS =0.9 WC =0.9
ME = 0.8 LS =0.6 WC =0.9
2 ME = 0.3 LS =0.3 WC =0.3
ME = 0.2 LS =0.3 WC =0.4
Project Cost

1.5

0.5
100 150 200 250
Project Time

Fig. 14.11 TCT Profiles under different values of linguistic variables

Responsiveness of TCT profile for scenarios such as (ME= 0.2, LS =0.3, and WC=
0.4) and (ME = 0.8, LS = 0.6, WC= 0.9) works consistently (Figure 14.11).
Further, the values of linguistic variables are taken in different ways i.e. some
have the values above the normal conditions and some below the normal conditions.
TCT profile under normal conditions (i.e. ME, LS, and WC as 0.5, 0.5, and 0.5
respectively) is shown in Figure 14.12 for a run different than earlier one. For (ME,
LS, and WC) as (0.8, 0.4, and 0.7) respectively, the project duration and cost are
obtained as shown in Table 14.6 / Figure 14.12. Table 14.7 depicts the case when
(ME = 0.3, LS = 0.8, and WC = 0.3). The results are shown in Figure 14.12 for
14 Project Scheduling: Time-Cost Tradeoff Problems 343

Table 14.6 ME = 0.8, LS = 0.4, WC= 0.7

Project Time 154 145 137 129 119 110 106 105 103
Project Cost 99720 100410 101970 103880 108190 126740 143980 148630 158120

Table 14.7 ME = 0.3, LS = 0.8, WC= 0.2

Project Time 187 178 176 159 157 122 119 117 115
Project Cost 108280 109100 109280 112440 113280 117640 131100 141300 148830

5
x 10
2.4 ME = 0.5 LS = 0.5 WC = 0.5
ME = 0.8 LS = 0.4 WC = 0.7
2.2 ME = 0.3 LS = 0.8 WC = 0.2

1.8
Project Cost

1.6

1.4

1.2

0.8

0.6

0.4
60 80 100 120 140 160 180 200 220 240
Project Time

Fig. 14.12 TCT profiles under different values of linguistic variables

comparison purpose. IFAG provides a comprehensive tool to project managers in


analyzing their time-cost optimization decisions in a flexible and realistic manner.

14.4 Hybrid Meta Heuristic


Lastly we present a hybrid meta heuristic technique for solving multiobjective dis-
crete TCT problem, which is known to be NP-hard. HMH hybridizes a multiobjec-
tive GA with simulated annealing, and is apposite for problems where the
generation of complete Pareto front, a TCT profile in this case, is essential for a
decision-maker. We validated HMH on two standard test problems of MOO. We
also present two case studies of discrete TCT which are solved using HMH. As
mentioned, the robustness of GAs is greatly enhanced when they are hybridized with
other meta heuristics such as simulated annealing, tabu search etc. Yip and Pao [29]
employed a hybrid simulated annealing and simulated evolution based evolutionary
algorithm to solve traveling salesman problem for single objective optimization; the
design and development of HMH for multiobjective optimization to solve discrete
TCT problem is basically motivated from this work.
344 S. Srivastava, B. Pathak, and K. Srivastava

It is important to mention here that HMH presented in this work is unconventional


in terms of its working (fitness function evaluation etc.) in comparison to existing
multiobjective evolutionary algorithms (MOEAs) well comprehended in [11]. HMH
suits well to our problem of searching the optimal TCT profile. The Pareto front
solutions obtained from HMH are diverse enough from project expediting view-
point; in fact, it includes almost all the relevant solution points on the Pareto front
through which project compressing needs to be carried out by the decision-maker.
We have not incorporated a mechanism to preserve diversity except generating two
non-random extreme solutions. Further HMH is meant to work for two objectives
only and for problems involving convex Pareto-optimal front. Apart from diversity
preservation, convergence to true Pareto front is another important issue in MOEAs
[11]. In most of the real world projects the true Pareto front (optimal TCT profile) is
unknown, so all such metrics which measure the extent of convergence to a known
set of Pareto-optimal solutions can not be used for our problem. However, HMH is
validated on two standard test problems involving convex Pareto front [11].
The proposed HMH incorporates the concept of Pareto’s optimality to evolve a
family of non dominated solutions distributing along the TCT profile, hence elim-
inating the need of aggregating multiple objectives into a compromise function.
HMH embeds simulated annealing in GA to deciding the number of children to be
generated from the parents of next generation. The algorithm is general enough to
incorporate the various aspects of time-cost relationships as per the given specifica-
tions of real life networks, such as linearity of time-cost relationships, or continuous
mapping between time and cost of activities. The mathematical description of dis-
crete TCT problem is presented in Subsection 1.1.
The preliminaries and definitions to understand HMH are explained concisely
below. Some of these definitions are similar to those given in Subsection 2.2; the
change in definitions is attributed to discrete version of TCT. Identical definitions
are not repeated here.

A. Structure of a Solution
A solution here is a string which represents an instance θ of the project schedule
(Fig. 14.2); each element ti of an n-tuple string, T , can assume any value from the
set {ti j }, j = 1, . . . , pi . The associated cost cθ and tθ of each individual string are
determined in the usual manner.

B. Initial Population
The initial populaton, consisting of n p solutions, is generated by randomly selecting
(n p − 2) individual strings from the feasible search space, i.e., each ti of a string is
chosen randomly from the set {ti j }, j = 1, . . . , pi . The remaining two strings are non-
randomly added in the population as follows. Tmaxi and Tmini , are two strings added
such that all activities have t maxi = max∀ j {ti j } and t mini = min∀ j {ti j } duration
respectively.
See Subsections 2.2C and 2.2E for definitions of TCT profile/convex hull and
crossover respectively.
14 Project Scheduling: Time-Cost Tradeoff Problems 345

C. Distance Measurement
The distance dw of an individual solution point in a population is determined by
calculating the minimal Euclidean distance (dwv ) between the wth solution point
and each of the segment v of the convex hull, i.e., dw = min∀v (dwv ) (Figure 14.4).
The solutions with a lower value of distance are considered to be fitter than those
having larger value of the distance.

D. Mutation
The mutation operator modifies a randomly selected activity of a string with a prob-
ability mr ; that is (mr × |F|) strings will undergo mutation. The mutation operator
works on a given string in the following manner. For the values of string T repre-
sented by ti , i = 1, . . . , n, a random number q, 1 ≤ q ≤ n is generated for the location
of gene to be mutated. Another random value tq , [(t mini ≤ tq , ≤ t maxi )and (tq =
tq )], is generated and tq replaces tq .

E. Simulated Annealing
Simulated annealing (SA) is a popular search technique which imitates the cooling
process of material in a heat bath. SA as stochastic optimization was introduced in
the context of minimization problems by Kirckpatrick et al. [30]. It is a global op-
timization method that distinguishes between different local optima. Starting from
an initial configuration, the SA algorithm generates at random new configurations
from the neighborhood of the original configuration. If the change results in a better
configuration, then the transition to the new configuration is accepted with a Boltz-
mann probability factor. The probability factor is regulated by a parameter called
Temperature (temp) and provides a mechanism for accepting a bad move. In the
initial iterations (temp = temp0 ) this probability is high (almost one) and when the
temperature is subsequently lowered using a cooling ratio (cool r) it comes down
to almost zero in the final stage of iterations (temp = temp f ).

14.4.1 Working of Hybrid Meta Heuristic


HMH begins with generating an initial population of n p solutions, which are referred
to as parents. Initially each parent u is allowed to produce (child num(u) = nc /n p )
number of children, where nc is the total number of children produced in a gen-
eration; this number is suitably chosen so that the search space can be extensively
scanned for the selection process to follow. A parent produces a child by cross-
ing over with a randomly selected string from the remaining population of parents.
Thus child num(u) number of children are produced by repeating this process for
the required number of times for each parent. A parent together with its children
constitutes a family; all the solutions in a family are referred to as members of the
family. Thus, throughout the procedure, n p families exist in a population. In the
initial generation, each family has a single parent.
The procedure followed in each iteration is explained below.
346 S. Srivastava, B. Pathak, and K. Srivastava

The Pareto front (or TCT profile) of the generation, i.e., all the parents together
with their children, is determined, which represents the nondominated set of a gen-
eration. Thereafter, a convex hull that encloses all members of a population from
below, is drawn. The basic idea is that lesser the distance of an individual within a
generation from the convex hull, better is its fitness with respect to either /all of the
objectives (Figure 14.4).
For each family u, its members on the Pareto front are counted (par num(u), u =
1, . . . , n p ). These par num(u) members become the parents for the uth family. How-
ever, it is important to note that if for a family u, no member appears on the Pareto
front, the family is not rejected all together; in the hope of its improvement in future,
a member of that family which is ’nearest’ to the Pareto front is selected to be the
parent for the next generation. This ’nearness’ is measured by a distance function
described in Section 4. The importance of the number par num(u) is twofold. Firstly
it determines the parents for the next generation chosen from each family. This is
how elitism is incorporated in the algorithm, which helps it in converging closer to
Pareto-optimal front. Elites of a current population are given an opportunity to be
directly carried over to the next generation. Therefore, a ’good’ solution found in
a current generation will never be lost unless a better solution is discovered. The
absence of elitism does not guarantee this feature. Importantly the presence of elites
enhances the probability of creating better offspring. Secondly it helps in keeping a
’good’ distribution of solutions over the Pareto front.
The next step is to decide the number of children, child num(u), u = 1, . . . , n p
allocated to each family in the next generation. This number actually provides the
information of how good the region is. To accomplish this, a distance measure as
defined in Section 4 is used, which measures the nearness of each member of a
family to the Pareto front. To find child num(u), the process of simulated anneal-
ing has been incorporated into the selection process as mentioned in the procedure
f ind num() given in the Subsection 4.1.1. It first counts the members of family
which satisfy Bolzmann criteria. Clearly the number child num(u) is proportional
to the number of members of the uth family which are closer to the convex-hull.
Further, par num(u) also plays a direct role in measuring the fitness of each fam-
ily u, that is, number of children to be produced in the next generation by family u
is determined by par num(u) plus the number of family members who qualify the
Boltzman criterion. This is obvious as these par num(u) members are on the TCT
profile.
The next step is the generation of children by each family. As mentioned earlier,
initially each family has a single parent, but in subsequent generations the number
of parents per family may be more than one (as par num(u) ≥ 1 for the families
whose members are on the Pareto front). In such a case, the number child num(u) is
almost equally divided among these par num(u) parents for producing the children.
The method of producing children by any parent is same as explained for the initial
generation. Now mutation is applied on randomly selected strings of the population
and the temperature is cooled down. The process is repeated until no improvement
is observed in the TCT profile for a specified number of generations. The algorithm
is able to search for the best family in the evolution process.
14 Project Scheduling: Time-Cost Tradeoff Problems 347

14.4.1.1 Pseudocode of HMH


1. A step-wise pseudocode of HMH follows.
Step 1: Set initial temperature temp = temp0 , no improve iter.
Set n p, nc 4
Set Gen = 1
Step 2: Select (n p − 2) parents randomly, and add remaining two parents by taking
shortest and longest durations of the activities
Step 3: For u = 1 to n p do child num(u)= nc /n p , and par num(u) = 1
Step 4: Generate nc children from the parents, with parent(u) producing
child num(u) children. This creates n p families consisting of parents and
their corresponding children
Step 5: Determine the TCT profile and convex-hull of the existing generation i.e.
np
∑ child num(u) + par num(u)strings constitute a generation (14.3)
u=1

For each f amily(u), u = 1, . . . , n p , do steps 6 to 8


Step 6: Find the number of members appearing on the obtained non-dominated
front, i.e., par num(u) . These members become the parents for the next
generation.
Step 7: For each member (w), w = 1, . . . , par num(u) + child num(u), dw is com-
puted as defined by the distance function dwu = min∀v (dwvu ), where d
wv is the
th th
distance of w member from the v line segment of the convex hull (Figure
14.4)
Step 8: Determine the number of children child num(u) that will be generated by
the family in the next generation, as detailed out in Procedure f ind num.
Step 9: Parents (those mentioned in step 6) produce child num(u) children by
crossing over randomly with the others members of the obtained non-
dominated front. If no member of a family appears in the front then the
new parent for this family is decided as follows: The string having best fit-
ness value among all the members of the family becomes the parent for the
next generation.
Step 10: Apply mutation.
Step 11: Gen = Gen + 1;
Step 12: Decrease the temperature temp = temp ∗ cool r
Step 13: Repeat steps 5 to 12 until Gen ≥ Max iter or the TCT profile remains iden-
tical for improve iter number of generations.
Procedure f ind num()
Step 1: sum = 0;
Step 2: for u = 1 to n p do accept(u) = 0;
Repeat step 3 to 6 for each f amily(u).
Step 3: Repeat step 4 for each member of the family.
348 S. Srivastava, B. Pathak, and K. Srivastava

Step 4: If the member is not in the Pareto front,


then if exp(−dw /temp) > ρ ), (accept(u) = accept(u) + 1);
end
Step 5: sum = sum + accept(u) + par num(u);
Step 6: for u = 1 to n p do child num(u) = (nc × accept(u))/sum

14.4.2 HMH Approach for Case Studies


Many test cases are generated to validate the efficiency and accuracy of HMH. How-
ever, two case studies of discrete TCT have been detailed out in this section. Firstly,
a project network (Fig. 14.5.) is considered with time-cost options of different ac-
tivities (time in days and cost in $) as shown in Table 14.8 [12].

Table 14.8 Options of First Test Problem

ID Time Cost ID Time Cost ID Time Cost ID Time Cost


1 14 2400 5 24 17500 9 20 180 14 9 3000
1 15 2150 5 28 15000 9 23 150 14 15 2400
1 16 1900 5 30 10000 9 25 100 14 18 2200
1 21 1500 6 14 40000 10 15 450 15 16 3500
1 24 1200 6 18 32000 10 22 400 16 20 3000
2 15 3000 6 24 18000 10 33 320 16 22 2000
2 18 2400 7 9 30000 11 12 450 16 24 1750
2 20 1800 7 15 24000 11 16 350 16 28 1500
2 23 1500 7 18 22000 11 20 300 16 30 1000
2 25 1000 8 14 220 12 22 2000 17 14 4000
3 15 4500 8 15 215 12 24 1750 17 18 3200
3 22 4000 8 16 200 12 28 1500 17 24 1800
3 33 3200 8 21 208 12 30 1000 18 9 3000
4 16 35000 8 24 120 13 14 4000 18 15 2400
4 20 30000 9 15 300 13 24 1800 18 18 2200
5 22 20000 9 18 240

Experiments are performed to select SA and GA parameters. The SA parameters


– initial temperature, (tempo ), cooling ratio (cool r), and final temperature (temp f )
– are chosen as 100, 0.85 and 0.1 respectively. We initially experimented with cool r
= 0.75, 0.8, 0.85, and 0.9, and found that the value of 0.85 was giving the best results.
Similar experiments are conducted to decide the parameters – tempo and temp f –
which would ensure the faster convergence to the final Pareto front. GA parameters
– initial population, n p , the ratio nc /n p , and mutation rate, mr – are selected as 60, 8
and 0.02 respectively. We illustrate one such selection – to decide n p , experiments
are done with different values of n p , ranging from 20 to 100. For each value of n p ,
50 trials are conducted by keeping other parameters constant. The average time to
converge to final Pareto front is reported in Table 14.9. The results indicate that
for n p = 60, the convergence to the final Pareto front is fastest. In addition, the
14 Project Scheduling: Time-Cost Tradeoff Problems 349

Table 14.9 Selection of Initial Population

np 20 40 60 80 100
Average time(sec) for 10 runs 20.95 17.85 9.33 17.44 18.74

Table 14.10 Options of Second Test Problem

Activity Duration Cost Activity Duration Cost Activity Duration Cost


A 5 480 E 12 1860 F 19 2000
6 300 13 1450 G 13 1900
B 9 450 14 1050 14 1200
C 12 850 F 16 3860 H 7 950
13 600 17 3220 8 640
D 15 420 18 2600 I 9 560

search is set to terminate when the TCT profile does not change in five consecutive
iterations (it is found to be a good enough number). We use these parameters for all
the experiments with HMH on the test problems.
An initial generation of n p strings is randomly selected and nc children are pro-
duced. Results of a typical run of HMH for this test problem follow. It can be seen
(Figure 14.13) that the initial generation is well distributed over the solution space.
Figure 14.14 illustrates the intermediate improvements. In succeeding iterations
HMH searches for optimal TCT profile. Figure 14.15 depicts the tradeoff points
of the final generation population. Since our tradeoff points do not improve further,
therefore these points are concluded to be the best points obtained. It takes on an
average 6 iterations for HMH to search for the best possible TCT profile for this test
problem.
Interestingly HMH commands a good efficiency as it searches for a Pareto-
optimal front after examining an extremely small fraction of possible solutions. For
the project network of Figure 14.5, total number of possible schedules are 4.72 ×
109 , whereas HMH (on an average of 50 runs) searched for only 3600(180 × 20)
possible schedules to converge to best possible TCT profile, which is an extremely
small fraction (0.00007627%) of the solution space. The results of TCT profile of
final generation obtained by HMH are compared with analytical results obtained
from exhaustive enumeration. HMH proves very well in terms of accuracy as it is
able to search for 95% of the optimal solutions on TCT profile (on an average of 50
runs of HMH). Further, on comparing visually HMH results with GA based MOO
results [12] to solve the same test problem (Figure 14.5), HMH turns out to be better
in terms of both, degree of convergence to true Pareto front as well as diversity of
solutions.
350 S. Srivastava, B. Pathak, and K. Srivastava

5
x 10

1.7 Population
Tradeoff points
Convex hull
1.6

1.5

Project Cost 1.4

1.3

1.2

1.1

0.9
90 100 110 120 130 140 150 160 170 180
Project Time

Fig. 14.13 Initial population with tradeoff points and convex hull

5
x 10

1.7 Population
Tradeoff points
Convex hull
1.6

1.5

1.4
Project Cost

1.3

1.2

1.1

0.9
90 100 110 120 130 140 150 160 170 180
Project Time

Fig. 14.14 Intermediate improvements

The second problem involves an adaptation of 9-activity network (Figure 14.16


and table 14.10) from [31]. The resources are assumed to be available without con-
straints. The HMH parameters chosen for this problem are same as mentioned ear-
lier. An initial generation of n p strings is randomly selected and n p children are
produced. Initial generation is found to be well diversified over the solution space
for this test problem as well. The diversity of solutions is further maintained in
the intermediate improvements in the tradeoff points as these move towards axes.
Figure 14.17 illustrates the initial population, and Figure 14.18 presents the tradeoff
points and its convex hull for the final generation population. HMH again proves
itself in terms of efficiency as well as accuracy while comparing the results with
those obtained by exhaustive enumeration technique.
14 Project Scheduling: Time-Cost Tradeoff Problems 351

5
x 10

1.7 Population
Tradeoff points
x min
1.6 x max
x mean
x median
1.5 x std
y min
y max
1.4 y mean
Project Cost

y median
y std
1.3 Convex hull

1.2

1.1

0.9
90 100 110 120 130 140 150 160 170 180
Project Time

Fig. 14.15 Tradeoff points and convex hull of final generation population

Fig. 14.16 Network of the second test problem

11500 11500

Population
11000 11000 Tradeoff points
x min
x max
10500 10500 x mean
x median
x std
10000 10000 y min
Project Cost

y max
y mean
9500
Project Cost

9500 y median
y std
Convex hull
9000 9000

8500 8500

8000 8000

7500 Population 7500

7000 7000
48 49 50 51 52 53 54 55 56 57 58
48 50 52 54 56 58 Project Time
Project Time

Fig. 14.17 Initial population Fig. 14.18 Tradeoff points and convex
hull of final generation population
352 S. Srivastava, B. Pathak, and K. Srivastava

14.4.3 Standard Test Problems


To test and validate HMH, two standard test problems involving convex Pareto-
optimal front from [11] are successfully attempted here. On visualizing the results
it is clear that HMH produces Pareto solutions that are good enough from diversity
and convergence viewpoints.

14.4.3.1 Schaffers Two Objective Problem

This problem has two objectives, which are to be minimized:



⎨ Minimize fi (x) = x2 ,
SCH1: Minimize f2 (x) = (x − 2)2,

−A ≤ x ≤ A

This problem has Pareto-optimal


: ∗ solutions x∗ ∈ [0, 2] and the Pareto-optimal set is a
convex set: f2 = ( f1 − 2) in the range 0 ≤ fi∗ ≤ 4. Different values of the bound-
∗ 2

parameter are used in different studies. Values as low as A = 10 to values as high


as A = 105 have been used. Figure 14.19 shows the first generation tradeoff points
and convex hull along with population. The tradeoff points and convex hull of the
final generation, occurred in the 3rd iteration only, are shown in Figure 14.20 and it
can be clearly seen that the obtained non-dominated front well matches with known
Pareto-optimal front.

14.4.3.2 Zitzler-Deb-Thiele’s 1st (ZDT1) Test Problem


ZDT1 has two objectives to be minimized. In the general form the problem is as
below: 
Minimize f1 (x),
Minimize f2 (x) = g(x)h( f1 (x), g(x))
Other ZDT test problems vary in the way the three functions f1 (x),g(x), and h(x)
are defined. ZDT1 is illustrated below:

⎨ f1 (x) = x1 , 9 n

ZDT1: g(x) = 1 + n−1$ ∑i=2 xi

⎩ h( f , g) = 1 − f1 .
1 g

The problem has 30 variables, which lie in the range [0, 1]. It has a convex Pareto-
optimal region that corresponds to 0 ≤ x∗1 ≤ 1 and x∗i = 0 for i = 2, 3, . . . , 30. In
this problem, the Pareto-optimal front is formed with g(x) = 1 . Figure 14.21 shows
the first generation tradeoff points and the convex hull along with population. The
final generation tradeoff points and convex hull are shown in the Figure 14.22; im-
portantly, the obtained non-dominated front matches fairly well with known Pareto-
optimal front. Further, it has a good distribution of non-dominated solutions across
the front. HMH is, therefore, efficient and accurate in tackling a large number of
14 Project Scheduling: Time-Cost Tradeoff Problems 353

Schaffer’s two objective problem

Population
Tradeoff points
20 Convex hull

15
f2

10

0
0 5 10 15 20
f1

Fig. 14.19 First generation tradeoff points (NDS) and convex hull

Schaffer’s two objective problem


4
Tradeoff points
3.5 Convex hull

2.5
f2

1.5

0.5

0
0 0.5 1 1.5 2 2.5 3 3.5 4
f1

Fig. 14.20 Final tradeoff points (NDS) and convex hull

decision variables. The HMH has performed extremely well on the above standard
test problems. The non dominated solutions have converged very close to known
Pareto-optimal front. Further, it is visualized that obtained non-dominated solutions
maintain a good diversity. All the test problems presented in this work are per-
formed on HP Intel(R)Pentium(R) 4 CPU with 3.2 GHz Processor and 1 GB RAM.
The procedures are coded in MATLAB 7.0 and tested under Microsoft Windows XP
Professional version 2002.
354 S. Srivastava, B. Pathak, and K. Srivastava

Zitzler−Deb −Thiele’s (ZDT) Test Problem


1

0.9

0.8

0.7

0.6
f2

0.5

0.4

0.3 Population
Tradeoff points
0.2 Convex hull
0.1

0
0 1 2 3 4 5 6
f1

Fig. 14.21 First generation tradeoff points (NDS) and convex hull

Fig. 14.22 Final tradeoff points (NDS) and convex hull

14.5 Conclusions
ANNHEGA amalgamates ANN models and a heuristic technique with GA in a
unique way to solve resource-constrained nonlinear TCT problem and becomes
a powerful multiobjective optimization method without losing its simplicity. The
method succeeds in making TCT analysis more realistic by adding two important
dimensions to it. Firstly any existing arbitrary shaped time-cost relationship can be
dealt using its ANN module. Secondly the heuristic module takes care for a con-
strained resource in the TCT analysis. The feasibility of the ANNHEGA is shown
14 Project Scheduling: Time-Cost Tradeoff Problems 355

through an illustrative test case. An additional outcome of this work is that it deliv-
ers the lowest limit of the constrained resource beyond which project expediting is
not feasible, this information is important for schedule planner. ANNHEGA based
system can help to monitor and control the project in the most cost-effective way
in real time, and one can choose the best alternative over the RCNTCT profile to
execute the projects. There are interesting future extents of this work. More than
one constrained resource can be incorporated in the system. Also other precedence
relationships may be considered in the system.
IFAG is presented to carry out the sensitivity analysis of nonlinear TCT profiles
with respect to real life project uncertainties. Fuzzy logic framework facilitated (1)
the representation of imprecise activity duration as well as activity cost; (2) the
estimation of new time-cost pair for each activity based on inputted uncertainties;
and (3) the interpretation of the fuzzy results in the crisp forms. A case study is
solved using IFAG to demonstrate the working of the IFAG. The method provides a
comprehensive tool to project managers in analyzing their time – cost optimization
decisions in a more flexible and realistic manner. In future we intend to investigate
the responsiveness of RCNTCT profile for project uncertainties.
HMH is a new MOO method implemented combining genetic algorithm and sim-
ulated annealing to solve TCT problem by incorporating the concept of Pareto’s op-
timality to evolve a family of nondominated solutions distributing well along the
TCT profile. Two case studies of discrete TCT are solved using HMH to illustrate
its performance. HMH can discover near-optimal solutions after examining an ex-
tremely small fraction of possible solutions. HMH is also tested on two standard
test problems of MOO to validate its performance. Interestingly HMH suits well to
our problems, however, from algorithm viewpoint, we, as part of future work, in-
tend to (1) incorporate a mechanism to preserve the diversity in the algorithm, (2)
compare it with standard MOEAs such as NSGA-2, SPEA-2, PAES etc., by using
metrics to evaluate diversity & convergence properties and (3) enhance it to incor-
porate more than two objectives. The obvious future extensions of our work can be
to experiment HMH to solve RCNTCT in place of GA in ANNHEGA. Similarly
performance of sensitivity analysis of TCT profiles can be investigated using HMH
along with fuzzy logic and ANNs. HMH may be further explored for solving other
complex MOO problems.

References
1. De, P., Dunne, E.J., Ghosh, J.B., Wells, C.E.: The discrete time-cost tradeoff problem
revisited. European Journal of Operational Research 81, 225–238 (1995)
2. De, P., Dunne, E.J., Ghosh, J.B., Wells, C.E.: Complexity of the discrete time/cost trade-
off problem for project networks. Operations Research 45, 302–306 (1997)
3. Richard, F.D., Hebert, J.E., Verdini, W.A., Grimsrud, P.H., Venkateshwar, S.: Nonlin-
ear time/cost tradeoff models in project management. Computers & Industrial Engineer-
ing 28(2), 219–229 (1995)
356 S. Srivastava, B. Pathak, and K. Srivastava

4. Vanhoucke, M.: New computational results for the discrete time/cost trade-off problem
with time-switch constraints. European Journal of Operational Research 165, 359–374
(2005)
5. Vanhoucke, M., Debels, D.: The discrete time/cost trade-off problem: extensions and
heuristic procedures. Journal of Scheduling 10(4-5), 311–326 (2007)
6. Ehrgott, M., Gandibleux, X.: A survey and annotated bibliography of multiobjective
combinatorial optimization. OR Spektrum 22, 425–460 (2000)
7. Coello, C.A.C.: An updated survey of GA-based multiobjective optimization techniques.
ACM Computing Surveys 32(2), 109–142 (2000)
8. Dimopoulos, C., Zalzala, M.S.: Recent developments in evolutionary computation for
manufacturing optimization: problems, solutions and comparisons. IEEE Transactions
on Evolutionary Computation 4, 93–113 (2000)
9. Holland, J.H.: Adaptation in natural selection and artificial systems. Univ. of Michigan
Press, Ann Arbor (1975)
10. Goldberg, D.E.: Genetic algorithms in search optimization & machine learning. Addison
Wesley, Reading (1998)
11. Deb, K.: Multi-objective optimization using evolutionary algorithms. Wiley, Chichester
(2001)
12. Feng, C.W., Liu, L., Burns, A.: Using genetic algorithms to solve construction time-cost
trade-off problems. Journal of Computer in Civil Engineering 11, 184–189 (1997)
13. Leu, S.S., Yang, C.H.: GA-based multicriteria optimal model for construction schedul-
ing. Journal of Construction Engineering and Management 125(6), 420–427 (1999)
14. Azaron, A., Perkgoz, C., Sakawa, M.: A genetic algorithm approach for the time cost
trade-off in PERT networks. Applied Mathematics and Computation 168, 1317–1339
(2005)
15. Pathak, B.K., Singh, H.K., Srivastava, S.: Multi-resource-constrained discrete time-cost
tradeoff with MOGA based hybrid method. In: Proc. 2007 IEEE Congress on Evolution-
ary Computation, pp. 4425–4432 (2007)
16. Demeulemeester, E., Herroelen, W.: Project scheduling – A research handbook. Kluwer
Academic Publishers, Boston (2002)
17. Ozdamar, L., Ulusoy, G.A.: Survey on the Resource-Constrained Project Scheduling
Problem. IIE Transactions 27, 574–586 (1995)
18. Kolish, R., Hartmann, S.: Experimental investigation of heuristics for resource-
constrained project scheduling: An update. European Journal of Operational Re-
search 174, 23–37 (2006)
19. Yang, B., Geunes, J., O’Brien, W.J.: Resource-Constrained Project Scheduling: Past
Work and New Directions. Research Report, Department of Industrial and Systems En-
gineering, University of Florida, Gainesville, FL (2001)
20. Erenguc, S.S., Ahn, T.D., Conway, G.: The resource constrained project scheduling prob-
lem with multiple crashable modes: An exact solution method. Naval Research Logis-
tics 48(2), 107–127 (2001)
21. Arias, M.V., Coello, C.A.C.: Asymptotic convergence of metaheurisitcs for multiobjec-
tive optimization problems. Soft Computing 10, 1001–1005 (2005)
22. Mares, M.: Network analysis of fuzzy set methodology in industrial engineering. In:
Evans, G., Karwowski, W., Wilhelm, M.R. (eds.), pp. 115–125. Elsevier Science Pub-
lishers, B. V., Amsterdam (1989)
23. Daisy, X.M., Thomas, S.: Stochastic Time-cost optimization model incorporating fuzzy
sets theory and nonreplaceable front. Journal of Construction Engineering and Manage-
ment 131(2), 176–186 (2005)
14 Project Scheduling: Time-Cost Tradeoff Problems 357

24. Leu, S.S., Chen, A.T., Yang, C.H.: A GA-based fuzzy optimal model for construction
time-cost trade-off. International Journal of Project Management 19, 47–58 (2001)
25. Yang, T.: Impact of budget uncertainty on project time-cost tradeoff. IEEE Transactions
on Engineering Management 52(2), 167–174 (2005)
26. Pathak, B.K., Srivastava, S.: MOGA-based time-cost tradeoffs: responsiveness for
project uncertainties. In: Proc. 2007 IEEE Congress on Evolutionary Computation, pp.
3085–3092 (2007)
27. Mamdani, E.H.: Application of fuzzy logic to approximate reasoning using linguistic
synthesis. IEEE Transactions on Computers 26(12), 1182–1191 (1977)
28. Zadeh, L.A.: Outline of a new approach to the analysis of a complex system and decision
processes. IEEE Transactions on Systems, Man and Cybernetics SMC-3, 28–44 (1973)
29. Yip, P., Pao, Y.H.: Combinatorial optimization with use of guided evolutionary simulated
annealing. IEEE Transactions on Neural Networks 6(2), 290–295 (1995)
30. Kirkpatrick, S., Gelatt, C.D., Veechi, M.P.: Optimization by simulated annealing. Sci-
ence 220(4598), 671–680 (1983)
Chapter 15
Systolic VLSI and FPGA Realization of
Artificial Neural Networks

Pramod Kumar Meher

Abstract. Systolic architectures are established as a widely popular class of VLSI


structures for repetitive and computation-intensive applications due to the simplic-
ity of their processing elements (PEs), modularity of design, regular and nearest
neighbor interconnections between the PEs, high-level of pipelinability, small chip-
area and low-power consumption. In systolic arrays, the desired data is pumped
rhythmically in a regular interval across the PEs to yield high throughput by fully
pipelined processing. The I/O bottleneck is significantly reduced by systolic array
architectures by feeding the data at the chip-boundary, and pipelining that across
the structure. The extensive reuse of data within the array allows for executing
large volume of computation with only a modest increase of bandwidth. Since the
FPGA devices consist of regularly placed inter-connected logic blocks, they closely
resemble with the systolic processors. The systolic computation within the PEs
therefore could easily be mapped to the configurable logic blocks in FPGA device.
Interestingly also, the artificial neural network (ANN) algorithms are quite suitable
for systolic implementation due to their repetitive multiply-accumulate behaviour.
Several variations of one-dimensional and two-dimensional systolic arrays are re-
ported in the literature for the implementation of different types of neural networks.
Special purpose systolic designs for various ANN-based applications relating to
pattern recognition and classification, adaptive filtering and channel equalization,
vector quantization, image compression and general signal/image processing appli-
cations have been suggested in the last two decades. We have devoted this chapter
on the systolic architectures for the implementation of ANN algorithms in cus-
tom VLSI and FPGA platforms. The key techniques used for the design of ba-
sic systolic building blocks of ANN algorithms are discussed in detail. Moreover,
the mapping of fully-connected unconstrained ANN, as well as, multilayer ANN
algorithm into fully-pipelined systolic architecture is described with generalized
Pramod Kumar Meher
Department of Embedded Systems,
Institute for Infocomm Research, 1 Fusionopolis Way, Singapore-138632
e-mail: pkmeher@i2r.a-star.edu.sg

Y. Tenne and C.-K. Goh (Eds.): Computational Intelligence in Optimization, ALO 7, pp. 359–380.
springerlink.com c Springer-Verlag Berlin Heidelberg 2010
360 P.K. Meher

dependence graph formulation. A brief overview of systolic architectures for ad-


vance ANN algorithms for different applications are presented at the end.

15.1 Introduction
Over the years, the artificial neural network (ANN), not only has been more and
more popular due to its adaptive and non-linear capabilities, but also has been es-
tablished as a potential intelligent tool in every imaginable area of technology for
solving ill-posed problems where conventional techniques fail to be effective. The
ANN algorithms are, however, computation-intensive, and computational complex-
ity of these algorithms increases with the number of inputs in a training pattern, the
number of layers of neurons in case of multilayer network, and the number of neu-
rons in different layers. Apart from that, the ANN algorithms for training phase are
iterative by nature, and require execution of several iterations to train the network.
The general-purpose-computers based on sequential von Neumann architecture are
found to be slow to implement the iterative training particulary when the network
consists of a large number of neurons and multiple hidden layers. On the other hand,
the ANN algorithms are inherently parallel, and each layer of multilayer network
could easily be implemented by a separate pipeline stage. Attempts have, there-
fore, been made to exploit these features of ANN algorithms to implement them
in the single instruction-stream multiple data-stream (SIMD) machines and array-
processors [1, 2]. The SIMD configuration has been considered as a good choice for
the implementation of these algorithms, as it provides a large number of processing
cells using a shared controller with minimal programming, and low burden to the
operating system. The real-time and embedded systems, however, impose stringent
limitations on the cost, size, power-consumption, throughput rate and computational
latency of the neural algorithms. To fit into the embedding environment, the size of
the computing structure very often should be small enough, and at the same time it
should meet the speed requirement of the time-critical and hard-real-time applica-
tions. Although the general-purpose-computers can execute the ANN algorithms of
small-sized network through software, it is essential to realize these algorithms in
dedicated VLSI or field programmable gate array (FPGA) device to meet the cost,
size and time requirement of the embedded and real-time applications. The conven-
tional general-purpose-machines and SIMD machines fall far too short to match the
requirements and specifications of many such application environments. Several at-
tempts have therefore been made in the last two decades for the realization of ANN
in analog, as well as, digital VLSI. There can be two kinds of approaches to hard-
ware implementation of ANN algorithms, e.g., the direct-design approach and the
indirect-design approach [3, 4]. In the direct-design approach, the neural algorithms
are directly mapped into dedicated hardware, while the indirect approach makes use
of the matrix processing behaviour of the neural models.
The ANN algorithms are found to be well suited for systolic implementation due
to their repetitive and recursive behaviour. Several variations of one-dimensional
and two-dimensional systolic arrays are, therefore, reported for the implementation
15 Systolic VLSI and FPGA Realization of Artificial Neural Networks 361

Processing Processing Processing Processing


Element Element Element Element

Processing Processing Processing Processing


Element Element Element Element

Processing Processing Processing Processing


Element Element Element Element

Processing Processing Processing Processing


Element Element Element Element

Fig. 15.1 A generalized representation of a systolic architecture

of artificial neural network [5]-[15]. Application specific systolic architectures for


various ANN-based applications relating to pattern recognition and classification,
adaptive filtering and channel equalization, vector quantization, image compres-
sion and general signal/image processing applications have come up in the last two
decades.
The general organization of a systolic structure is shown in Fig. 15.1. It is a
network of simple and identical processing elements (PEs), arranged regularly in
an array, connected by localized communication-links. Each PE computes rhythmi-
cally and pumps data synchronously in and out such that a regular flow of data is
maintained across the array for fully pipelined computation. Systolic architectures
are now established as a class of the most popular VLSI structures for repetitive
and computation-intensive applications due to simplicity of their PEs, modularity
of their structure, regular and nearest neighbor interconnections between the PEs,
high-level of pipelinability, small chip-area and low-power consumption [16]-[19].
The I/O bottleneck is significantly reduced by systolic array architectures by feed-
ing the data at the boundary and pipelining that across the structure. The extensive
reuse of data within the array allows to execute large volume of computation with
only a modest increase of bandwidth [17].
FPGA devices have progressed steadily and substantially not only in terms of
their logic and I/O resources but also in terms of their performance. The reusability
of these devices along with widely and freely available easy-to-use software tools for
simulation, synthesis, and place and route has made FPGA a preferred platform for
computation-intensive applications in signal processing and communication. FPGA
362 P.K. Meher

Configurable
Logic Blocks

I/O Blocks

Programmable Interconnects

Fig. 15.2 A generalized structure of an FPGA device

implementation of several ANN-based applications has been reported in the litera-


ture [20]-[22]. The generalized structure of an FPGA device is shown in Fig. 15.2. It
consists of regularly arranged programmable logic components called configurable
logic blocks (CLB), and a hierarchy of reconfigurable interconnects that allow the
CLBs to be connected together to perform the necessary computations for specific
applications. From Fig. 15.1 and Fig. 15.2 we can find that the structure of FPGA and
systolic structure are quite similar, particulary in terms of regularity and modularity
of computing logic. It is therefore very much straight-forward to map the systolic
architectures to FPGA platforms which could be matched with the repetitive modu-
lar structure of systolic designs. In this chapter, we have discussed the basic design
principles and development of systolic VLSI for the ANN algorithms which could
easily be ported to FPGA platform as well.
The rest of the chapter is organized as follows: In the next section, we have dis-
cussed the direct-design approach for systolic implementation of ANN algorithms.
The basic design principles of mapping neural algorithms into systolic array archi-
tectures are discussed in Section 15.3. The systolic architectures for fully connected
unconstrained network, and multilayer neural network are discussed in Section 15.4.
A brief overview of systolic architectures for advance ANN algorithms for differ-
ent applications is placed at the end of this Section. Conclusions of the chapter are
presented in Section 15.5.

15.2 Direct-Design of VLSI for Artificial Neural Network


In the direct-design approach, the computing structure directly follows the neu-
ral computing model by straight forward mapping. These designs are tailored for
particular ANN models for specific applications, and aim at high-performance
15 Systolic VLSI and FPGA Realization of Artificial Neural Networks 363

processing of the algorithm by fully dedicated implementation. Unlike the structures


for the indirect-designs, these structures use global connectivity and, therefore, can
be used only for networks of smaller dimensions. Although there are some optical
implementations of direct-designs for some neural models [23, 24], direct-design
electronic neural processors are relatively more prevalent. Several electronic pro-
cessors have been designed and developed using analog CMOS [25, 26], as well as,
digital CMOS technologies [3, 4] for various dedicated applications e.g., pattern
recognition, feature extraction and machine vision etc. Analog designs are
continuous-valued and may be implemented in continuous-time RC circuits or
discrete-time switched capacitor circuits, while the digital designs, by definition,
are discrete-time, as well as, discrete-valued. Although the analog, as well as, the
digital approaches have shown some success in specific application areas, each of
these approaches is associated with certain disadvantages also. The salient features
of analog neural processors are discussed in the followings.
Analog approach of implementation of neural algorithm has considerable poten-
tial, as it can be used for massive computation with low power dissipation since the
transistors in sub-threshold region consume very low power. An analog neuron can
be implemented conveniently by a simple differential amplifier, where the synaptic
weights are taken care of by the resistive circuit elements. Consequently, it is pos-
sible to pack several neurons in a small analog chip. Analog circuits are capable to
process more than 1 bit per transistor to yield very high-throughput of result, and
consequently, analog approach offers compactness of hardware and higher speed
performance compared with the digital neural processors. The computation of the
weighted sum is performed conveniently in analog circuits either by the magnitude
of currents or the amount of charge, while addition operations in digital circuits in-
troduce substantial latency in the implementation of neural algorithms. Analog neu-
ral network allows asynchronous updating of weights which helps to realize very
high-speed processing. The non-linear behaviour of electronic circuits facilitates
convenient implementation of the non-linear threshold functions at the output of the
neurons. The non-volatile storage of analog weights, however, sets serious limitation
on the learning capabilities of analog processors. Therefore, the learning algorithm
may be executed in a companion digital signal processor chip on the printed cir-
cuit board, so that the weight values are updated and stored conveniently in digital
memory. The updated weight values are dynamically transferred and stored in the
capacitors formed by MOS transistors with additional augmented capacitors in the
analog circuit [27]. In spite of several encouraging features, the analog neural pro-
cessors have only limited domain of applicability due to some inherently associated
drawbacks:
1. The analog neural networks are more susceptible to noise, crosstalk, variation of
power supply, and effect of temperature.
2. Analog circuits are not perfectly reproducible and will have variation of perfor-
mances from chip to chip.
3. The error voltage due to switching transient and data loss due to leakage current
tends to deteriorate the operational accuracy of the analog neural processors.
364 P.K. Meher

4. Analog processors are used with only limited precision (usually of 8-bit)
because the chip-area increases with the precision.
Contrastingly, the digital designs offer better tolerance to variations of power supply,
ambient temperature, noise and crosstalk etc. compared with the analog processors.
Besides, they provide accurate storage of weights in digital memory cells to facil-
itate updating of weights during learning. Compared with the analog circuits the
digital designs also allow flexible choice of word-length depending on the preci-
sion requirement. Along with these, the digital designs offer fast turn-around due
to availability of advanced CAD software and better support for silicon fabrication.
The digital implementations on the other hand are relatively slow and involve more
chip-area compared with their analog counterpart.

15.3 Design Considerations and Systolic Building Blocks for


ANN
A neural network usually consists of a number of neurons with dense intercon-
nections among them. A variety of network topologies and interconnections of the
neurons have been suggested to perform a wide range of applications using dif-
ferent learning mechanisms. Basing on the type of connectivity, all these networks
may be put under two major categories. In one category of networks, every neuron
is connected to every other neuron, which may be called as the neural network of
unconstrained connectivity. The other category is multilayer neural network. In a
multilayer neural network, the neurons are arranged in different layers. The neurons
of a given layer are not connected with each other, but each neuron in a layer is con-
nected to every other neuron in its adjacent layer through some connecting weights.
Irrespective of the type, a neural network goes through a series of iterations to arrive
at the solution of a given problem. In general, the neural algorithms are executed in
two phases: The search phase and the learning phase [3]-[6]. In the search phase,
all the neurons of the network iteratively update their activation values or the states.
Assuming unconstrained connectivity among the neurons, the state/output of the i-th
neuron at (k + 1)th iteration can be expressed in a generalized form as:

xi (k + 1) = f (ui (k + 1), ui (k), xi (k)) for i = 1, 2, · · ·, N, (15.1)


where the internal activation of the neuron is given by
N
ui (k + 1) = ∑ Wi j (k) · x j (k) + θi(k) (15.2)
j=1

Wi j is the weight value for the connection running from the j-th neuron to the i-th
neuron, and θi (k) is the bias value associated with the i-th neuron. {Wi j } constitutes
the weight matrix W of size N ×N, where N is the number of neurons in the network.
f is a non-linear function, usually called as the threshold function or the activation
function, which may be a sigmoid function, a step function, a squashing function or
15 Systolic VLSI and FPGA Realization of Artificial Neural Networks 365

may be a stochastic function. In the learning phase, the neurons adaptively update
their weights either by supervised or by unsupervised learning. In this article, we
focus on the VLSI implementation of the neural networks based on the most basic
supervised learning rule [5], namely the Widrow-Hoff rule (also popularly known
as delta rule) where the weights of the neurons are updated in every iteration of
learning according to

(Wi j )NEW = (Wi j )OLD + η · x j · (di − xi ) (15.3)


η in (15.3) is the learning rate, that determines the rate at which the weights con-
verge to the optimal values and also the magnitude of the residual mean square
error. di is the desired target value specified for the i-th neuron for a given set of
input. The processing of each neuron (given by equations (15.1-15.3)) is performed
in two successive stages as shown in Fig.15.3.

COMPUTATION NEURON OUTPUT


INPUT VALUES

OF NEURON OUTPUT
WEIGTHS
NEURON

UPDATING
ERROR VALUES
OF NEURON WEIGHTS

Fig. 15.3 Two-stage processing of a neuron

The computation of neuron output/state is performed in the first stage followed by


the updating of weights to be used for computing the output of the next iteration. Be-
sides, from equations (15.1-15.3) we can find that once the outputs of all the neurons
are available, it is fairly simple and straight-forward to perform the weight updating
in a fully parallel computing section without any data-dependencies. The outputs of
all the neurons for any given iteration can also be computed in parallel according
to equations (15.1 and 15.2) since there are no data-dependencies between them,
but data-dependencies exist between the partial sums for computing the outputs of
individual neurons, which need to be taken care of parallel implementation. More-
over, the matrix-vector multiplication of the form of equation (15.2) is frequently
encountered in most of the ANN algorithms. (Note that for multilayer feed-forward
network as well as the network with back-propagation learning, we can have similar
computations as that of the unconstrained network given in (15.2), for computing
the outputs of the neurons of a given layer and for updating the neuron weights,
where the connection weights run from the j-th neuron in the preceding layer to the
i-th neuron in the current layer.) Keeping this in view, we discuss here the derivation
of systolic array for the computation of equation (15.2) which could be used as a
building block for the systolic implementation of many different types of ANN. To
derive the systolic architecture, let us represent equation (15.2) for any given neuron
for a specific iteration in a generalized matrix-vector product form:
366 P.K. Meher

N
yi = ∑ Wi j · x j + θi. (15.4)
j=1

A three-step systolic mapping procedure, e.g.,


• representation of the computation in a locally recursive form,
• transformation of the recursive computation into a dependence graphs (DG),
• appropriate mapping of the DG onto suitable systolic array architecture,
is normally followed to derive a systolic array by DG formulation [5]. Since equa-
tion (15.4) is already in a locally recursive form, directly from this equation we can
derive a localized DG as shown in Fig. 15.4. It consists of N 2 identical nodes ar-
ranged in N rows and N columns. Function of each node of the DG is depicted in
Fig. 15.4(b). The elements of vector x = {x1 , x2 , · · ·xN } are fed to the N nodes on
the first row of the DG which move vertically down to the adjacent nodes, while the
output of the nodes of different columns are transferred horizontally to the adjacent
nodes on their right. The desired matrix-vector product in this case is obtained from
the right-most column of nodes of the DG.
The nodes of this DG can be projected vertically down to obtain a linear systolic
array (shown in Fig. 15.5) consisting of N locally connected PEs. The function of
the PEs is described in Fig. 15.5(b). Each PE of the structure performs one multi-
plication and one addition during a clock period. The elements the weight matrix W
are stored in N circularly-shift-registers of size N, such that the i-th register stores
the elements of the i-th column of the weight matrix and feeds a weight value to the

x1 x2 x3 xN

θ1 W11 W12 W13 W1N y1

θ2 W21 W22 W23 W2N y2 Yin

Xin Wij Xout


θ3 W31 W32 W33 W3N y3

Yout

Xout ← Xin + Wij Yin;


Yout ← Yin;

θΝ WN1
N1
WN2 WN3 WNN yN

(a) (b)
Fig. 15.4 DG for the computation of matrix-vector product of equation (15.4). (a) The DG.
(b) Function of each node of the DG
15 Systolic VLSI and FPGA Realization of Artificial Neural Networks 367

x1 x2 x3 xN
Δ 2Δ (Ν -1)Δ

PE PE PE PE
θ1 yN
θ2 W11 W12 W13 W1N yN-1
θ3
W21 W22 W23 W2N

W3N y2
θΝ W31 W32 W33
y1

WN1
N1 WN2 WN3 WNN

(a)

Yin

Xin PE Xout Xout ← Xin + WinYin;

Win
(b)
Fig. 15.5 The linear systolic array for the computation of matrix-vector product. (a) The
systolic array. (b) Function of each PE of the structure

i-th PE in each clock cycle. The elements of input vector x are fed to N different
PEs as shown in Fig. 15.5(a) such that the input to a PE is staggered by one clock
cycle relative to the adjacent PE on its left. The elements of the input vector x (once
loaded to the PEs) stay in their respective PEs throughout the computation, while
the computed output of each PE is transferred to its neighbouring PE on the right.
The first output value of the array is obtained after N cycles from the right-most PE,
while the rest N − 1 output values are obtained in the next N − 1 cycles, where the
duration of a clock period T = TM + TA , for TM and TA , respectively, being the time
involved to perform one multiplication and one addition in a PE. The latency of the
structure is N cycles and has a throughput rate of one output per cycle. It has all the
advantages of systolic design, but the output is required to be demultiplexed to be
stored in separate registers to be used in the next iteration. For large size ANN, the
time required for demultiplexing is large; and the demultiplexer involves consider-
ably high area complexity and requires a lot of additional interconnections.
To avoid this difficulty of pure systolic realization, we discuss here a semi-systolic
implementation of matrix-vector product of (15.4). For semi-systolic realization, the
368 P.K. Meher

dependence graph of Fig. 15.4 can be modified to a form as shown in Fig. 15.6,
where the nodes are flipped about the diagonal (so that the weight values appear in
transposed form) and i-th column of resulting DG is circularly-shifted-up by (i − 1)
places. As in case of the original DG of Fig. 15.4, the modified DG also consists of
N 2 number of nodes arranged in N rows and N columns. In this case also the input
values are loaded to the nodes on the first row of the DG, but they move diagonally
down to the adjacent nodes on the left on the lower row, while the outputs computed
by the nodes of each row are transferred vertically down to the adjacent nodes.

θ1 θ2 θ3 θΝ
x1 x2 x3 xN
W11 W22 W33 WNN

W12 W23 W34 WN1


Yin
Zin

Wij
W13 W24 W35 WN2
Zout
Yout

Yout ← Yin + Wij Zin;


Zout ← Zin;
W1N W21 W32 WNN’

y1 y2 y3 yN
(a) (b)
Fig. 15.6 The modified DG for semi-systolic computation of matrix-vector product of (15.4).
(a) The dependence graph. (b) Function of each node of the DG. Note: N = N − 1

The DG of Fig. 15.6 can be projected vertically down as in case of the DG


of Fig. 15.4 to obtain a semi-systolic linear array consisting of N identical PEs
as shown in Fig. 15.7. The function of the PEs of this structure is described in
Fig. 15.7(b). Each PE in this case also performs one multiplication and one addition
during a cycle period. The input structure for the elements of the weight matrix W
is also the same as that of the pure-systolic structure of Fig. 15.5. Unlike the latter,
the input values xi for i = 1, 2, · · ·, N in this case are loaded simultaneously to the
individual PEs without staggering, and transferred to the adjacent PEs to the left
in the subsequent cycles. The elements of the input vector x thus move circularly
across the array, and the output of the left-most PE in this case is connected to the
input of the rightmost PE. The results of computation in different PEs in this case
15 Systolic VLSI and FPGA Realization of Artificial Neural Networks 369

x1 x2 x3 xN

PE PE PE PE

y1 y2 y3 yN
W11 W22 W33 WNN

W12 W23 W34 WN1

W13 W24 W35 WN2

WN1
1N W21 W32 WNN’

(a)

Initialise : X ← Yin and S ← 0;


Yin For count = 1 to N in every cycle do :
S ← S + W in X ;
Xout ← X ;
Xout PE Xin
X ← Xin;
count ← count + 1;
End do;
Win Yout Yout ← S ;
End.

(b)

Fig. 15.7 A semi-systolic array for the computation of equation (15.4). (a) The semi-systolic
array. (b) Function of each PE of the structure. N = N − 1

do not move, and get accumulated in the respective PEs. The sum of products com-
puted in each PE is finally released simultaneously after N cycles as output. The
semi-systolic structure has also a latency of N cycles, and has the duration of each
clock cycle equal to the time required to perform one multiply-accumulate opera-
tion T = TM + TA as in the case of the pure systolic structure of Fig. 15.5. Unlike
the other, since all the outputs in this case are obtained after N cycles from N PEs
of the structure, the output of each PE can be reused as input in the same PE for the
370 P.K. Meher

x1 x2 x3 xN
θ1 θ2 θ3 θΝ

W11 W22 W33 WNN

W12 W23 W34 WN1 Yin


Zin

Wij
W13 W24 W35 WN2
Zout
Yout

Zout ← Zin + Wij Yin;

W1N W21 W32 WNN’


Yout ← Yin;

y1 y2 yN-1 yN
(a) (b)

Yin Initialise : X ← Yin; X out ← 0 and X in ← 0;


For count = 1 to N in every cycle do :
X out ← X in + W in X ;
Xout PE Xin
count ← count + 1;
End do;
Yout ← X in;
Win Yout End.
(c)
Fig. 15.8 An alternative dependence graph of the semi-systolic computation of equation
(15.4). (a) The DG. (b) Function of each node of the DG. (c) Function of each PE of a
semi-systolic array

next iteration. In spite of its global communication, the semi-systolic structure of


Fig. 15.7 may be suitable for fast implementation of the matrix-vector product due
to its scope for reuse of computed values which we have discussed in detail in the
next Section.
The computation of (15.4) can alternatively be represented by a DG as shown in
Fig. 15.8, where the distribution of the weight values is similar to that of the DG of
Fig. 15.6, and the elements of vector x are loaded to individual nodes on the first
row. But the input values in this case move vertically down to the respective adjacent
15 Systolic VLSI and FPGA Realization of Artificial Neural Networks 371

nodes while the computed results move diagonally down to the adjacent nodes on
the next lower row on the left column. The output computed by the leftmost node of
a row of the DG is transferred to the rightmost node on the next row of the DG. The
DG of Fig. 15.8 can be projected vertically to obtain an array similar to the semi-
systolic array of Fig. 15.7, where the function of the PEs is depicted in Fig. 15.8(c).

15.4 Systolic Architectures for ANN


In this Section, we describe the derivation of systolic architectures for Hopfield net
[28] and multilayer ANN [29] with back-propagation learning [30] using the basic
matrix-vector processing units discussed in Section 15.3. The structures presented
here are closely similar to those suggested by Kung and Hwang in references [5]
and [6]; and those of Shams and Przytula in [2] .

15.4.1 Systolic Architecture for Hopfield Net


The system dynamics of the Hopfield net [28] for the search phase can be presented
as a matrix-vector product computation followed by a non-linear activation as

ui (k + 1) = Wi j (k) · x j (k) + θi . (15.5)

where the summation over the repeated index is assumed, and

xi (k + 1) = (1 + tanh[(ui (k) + η · ui (k + 1))/u0])/2 for 1 ≤ i ≤ N. (15.6)

The vector u(k) = {ui (k), for 1 ≤ i ≤ N} represents the states of all the N neurons
at the k-th iteration, and θ = {θi , for 1 ≤ i ≤ N} is the bias vector.
The systolic and semi-systolic architectures for matrix-vector product which we
have discussed in the last Section can be utilized for VLSI realization of Hopfield
nets as shown in Fig. 15.9. It consists of N PEs, where N is the length of input
activation. Each of these PEs consists of two sub-cells PE-1 and PE-2 and a circular-
shift-register. The set of circular-shift-registers Ri for 1 ≤ i ≤ N is used to feed the
appropriate values of weights to the PEs as shown in Fig. 15.9. The function of PE-1
is the same as that of the PEs of the semi-systolic structure of Fig. 15.7. Function of
PE-2 is described in Fig. 15.9(b). It performs the desired non-linear function given
by equation (15.6). There are several techniques reported for efficient computation
of tanh function to be performed by these cells, and can be implemented in many
different ways [31]-[34]. For low-complexity implementation of this function one
may use a CORDIC circuit or a look up table consisting of 2L words where L is the
word-length [31].
In the search phase, each PE may be treated as a neuron where the weight vector
W j = {W j j ,W j( j+1) , · · ·,W jN ,W j1 ,W j2 , · · ·,W j( j−2) ,W j( j−1) } in the shift-register R j
of the j-th PE (for 1 ≤ j ≤ N) corresponds to the synaptic weights of the neuron.
During (k + 1)-th iteration of the search phase, the activation output xi (k) of the i-th
PE (for 1 ≤ i ≤ N) of the k-th iteration is reloaded to the same PE, which moves
372 P.K. Meher

x1in x2in x3in xNin

PE-1 PE-1 PE-1 PE-1

R1 R2 R3 RN

θ1 θ2 θ3 θΝ
PE-2 PE-2 PE-2 PE-2

x1out x2out x3out xNout


(a)

Xin
X ← η ⋅ ( Xin + θ in);
Ain PE-2 θ in X ← ( Ain + X )/u0 ;
Xout ← [1 + tanh( x)] / 2;

Xout
(b)
Fig. 15.9 The linear array architecture for implementation of the search phase of the Hopfield
net. (a) The array architecture. Ri of the i-th PE is a circular-shift register that contains the i-th
column of the weight matrix W as shown in Fig.5. (b) Function of the non-linear processing
cell PE-2. u0 is the initial activation value stored in the PE and η is a constant

across the array from one PE to its adjacent PE in every computational cycle such
that each input activation visits each PE once in every N cycles. When x j (k) arrives
at the i-th PE, it is multiplied with Wi j , and the product value is accumulated in
the same PE. N such product values are accumulated in N consecutive cycles, and
the accumulated sum is then transferred to its non-linear processing cell PE-2. The
output activation xi (k + 1) obtained from the non-linear processing cell of the i-th
PE is reloaded to itself for the processing of next iteration. The iterative process
is continued till convergence is reached, and once the convergence is reached after
certain number of iterations, the learning phase starts for the adjustment of weights.
The systolic architecture derived for the search phase can be reused for the learning
phase as well. The architecture for the search phase (Fig. 15.9) can be used for the
learning phase as follows:
1. Calculate the product of the error value (di − xi ) and the learning rate η as: Si =
η · (di − xi ), and store that in the i-th PE.
2. To calculate the weight increment terms according to (15.3), the converged acti-
vation values x j for 1 ≤ j ≤ N move across the PEs as in case of the search phase
15 Systolic VLSI and FPGA Realization of Artificial Neural Networks 373

form a PE to its adjacent PE on the left in every computational cycle. When x j


visits the i-th PE, it is multiplied with Si to find the current weight increment term
Δ Wi j for updating the weights available from the circular-shift register.
3. After N clock cycles all the activation values pass through a PE, and all the N
weights associated with a PE are adjusted.

15.4.2 Systolic Architecture for Multilayer Neural Network


The multilayer perceptron model [29] is very much popular due to its wide range of
applications. In this Subsection, it is intended to show that the architectural design
discussed in the previous Section can be used for the implementation of the multi-
layer neural model also. The system dynamics of search phase of the l-th layer of
an L-layer network is given by:

xi (l) = f (ui (l)), (15.7)

where
Nl −1
ui (l) = ∑ Wi j · x j (l − 1)) + θi(l) for 1 ≤ i ≤ Nl and 1 ≤ l ≤ L. (15.8)
j=1

Nl is the number of nodes in the l-th layer. For simplicity of presentation, (without
loss of generalization) we have assumed that each layer consists of equal number of
nodes (e.g., Nl = N) and θi (l) is the bias input of the i-th neuron in the l-th layer.
The computations of each neuron can be performed by a systolic array of the
kind shown in Fig. 15.9 (discussed in the Subsection 15.4.1) such that the compu-
tation of (15.7) and (15.8) for all the neurons of a layer can be realized in fully
parallel form in L systolic arrays. The resulting mesh architecture, consisting of LN
number of PEs arranged in L rows and N columns is shown in Fig. 15.10. The bias
values (not shown explicitly in the structure) are used to initialize the accumulation
registers in the PEs. For a reduced-hardware implementation, the computation of
different layers may be performed by a single array structure by time-multiplexing
of the computation of different layers by a simple control unit and external storage
elements to store the outputs of the neurons.

15.4.3 Systolic Implementation of Back-Propagation Algorithm


The back propagation (BP) algorithm [30] is one of the most widely used learning
schemes of multilayer neural net. It is an iterative gradient descent technique to
minimize the mean-square-error between the desired target values and the output
values of the neurons in a multilayer neural net. The BP algorithm is comprised of
two basic steps: (i) the feed-forward step and (ii) the reverse step or backward step.
In the feed-forward step, one of the input training patterns is fed to the network,
and the output activation values are computed for each layer according to (15.7) and
(15.8). In the reverse step, the differences between the desired targets and the output
374 P.K. Meher

x1(0) x2(0) x3(0) xN(0)

x1(1) x2(1) x3(1) xN(1)

x1(L-1) x2(L-1) x3(L-1) xN(L-1)

x1(L) x2(L) x3(L) xN(L)

Fig. 15.10 Systolic mesh architecture for multilayer ANN

activation values (also called as the error signals) are estimated for all the neurons at
the output layer, and propagated back for weight adjustment of the preceding layers
progressively backward. The weight adaptation of the l-th layer according to BP
algorithm pertaining to the m-th input pattern is given by

Wimj (l) = Wim−1


j (l) + η · δim (l) · xmj (l − 1) (15.9)

where the error signal ‘δim (l)’ for updating the weights of the l-th layer is recursively
computed as:
m
δim (L) = (dim − xm
i (L)) · f (ui (L)) for l = L and (15.10)

@ A
δim (l) = ∑ δ jm (l + 1) ·Wim−1
j (l + 1) · f (um
i (l)) for l < L. (15.11)
j

The feed-forward step is the same as that of the search phase and can be implemented
by the structure of Fig.15.10. For the weight adjustment by back-propagation al-
gorithm, the structure of search phase can be reused by simple modification. The
formula for weight updating for the L-th layer is given by (15.10), which is similar
to that of (15.3) except that η is replaced by the derivative of non-linear function
f ’, and can be implemented by the structure discussed in the Subsection 15.4.1. For
all other values of l, ( e.g., for l < L in (15.11)), the error signal used in (15.10) is
replaced by an inner-product of the N-point error-vector and a row of the weight
matrix. The inner-products of (15.11) can be easily realized by the semi-systolic
structure of Fig. 15.7 using the PEs whose function is described in Fig. 15.8(c). The
15 Systolic VLSI and FPGA Realization of Artificial Neural Networks 375

l-th layer

δ1(l) δ2(l) δ3(l) δN(l)


(l+1)-th layer

δ1(l+1) δ2(l+1) δ3(l+1) δN(l+1)

Fig. 15.11 Systolic array structure of weight updating by back-propagation algorithm

array structure for updating of weights {Wim−1


j (l)} and calculation of {δim (l)} is
shown in Fig. 15.11, and discussed in the followings:
1. The weight value Wim−1 j (l + 1) available from the local shift-register of the j-th
PE of the (l + 1)-th layer is multiplied with the residing error value δ jm (l + 1), and
the product is added thereafter with the accumulated products from its preceding
PE on its right. The accumulated product for calculating δim (l) is initialized at
the i-th PE, and then propagated circularly left-ward across the array.
2. Similar operations are repeated for all i, for i = 1, 2, 3, · · ·, N. The accumulated
value originating from the i-th PE returns back to itself after adding up all the
products relating to the inner product of (15.11). In the mean time, f (um i (l)) is
calculated at the i-th PE on the l-th layer. The inner-product (not shown in the
figure) computed at the i-th PE of the (l + 1)-th layer is transferred to the i-th PE
on the l-th layer where it is multiplied with f (um i (l)) to obtain δi (l).
m

3. δi (l) is then used by the i-th PE on the l-th layer for updating {W jim−1 (l) for
m

j = i, i + 1, · · ·, N, 1, · · ·, i − 1} at the l-th layer and calculation of δim (l − 1) at the


(l − 1)-th layer. All these 3 steps are repeated for all the layers backward till the
input layer is reached.
The computation time of each layer here is proportional to the number of neurons
on the layer. The hardware utilization of the structure will, therefore, be optimal
when all the layers have equal number of neurons. But, if different layers have dif-
ferent number of neurons then layers with less number of neurons are required to
wait till the layers with higher number of neurons complete their computations. An
architecture for unequal number of neurons is discussed in [6]. Optimized one-and
two-dimensional systolic architectures are suggested in [7]-[9] for the reduction of
computation-time. Several other variations and optimization of systolic architec-
tures for the ANN models are reported in [10]-[15].
376 P.K. Meher

15.4.4 Implementation of Advance Algorithms and Applications


Systolic realizations of several ANN applications relating to signal processing, im-
age processing, pattern recognition and classification problems have been reported
in the literature. Recurrent neural networks (RNNs), form the most general class of
neural networks, in which every node can be connected to any other node. RNN has
the ability to implement highly non-linear dynamical systems of arbitrary complex-
ity [35], [36]. A systolic array architecture (similar to the one discussed for the Hop-
field net in Section 15.4.1) for recurrent learning algorithm of [35], has been derived
by Kechriotis and Manolakos [37] systematically from the DG formulation using the
canonical mapping methodology, for the implementation of retrieving phase as well
as the learning phase. A highly regular and modular architecture of recurrent neural
network implementation of a shortest path processor in reconfigurable hardware and
dedicated VLSI is suggested in [38]. Ramacher et al have developed a neural signal
processor that executes the compute-bound primitives shared by all the neural nets
for high-speed signal processing [39]. Vidal and Massicotte have presented an effi-
cient architecture for channel equalization using a piecewise linear multilayer neu-
ral network [40]. Broomhead et al have presented a fully systolic network based on
multilayer feed-forward perceptron model using the radial basis function for non-
linear system identification, nonlinear adaptive filtering and pattern classification
[41]. Cavaiuolo and others have presented a systolic neural network for image pro-
cessing, and have discussed its advantages over the conventional implementations
[42]. VLSI architectures for pattern recognition and classification using ANN algo-
rithms are discussed in [43] and [44]. A reconfigurable systolic implementation of
face recognition system based on principal component neural network is presented
in a recent paper [45]. A neural chip along with an analog vector quantizer for image
processing applications is suggested by Sheu et al in [46]. An interesting scheme for
hybrid analog-digital systolization of neural network is presented in [47].

15.5 Conclusion
The systolic array architectures, due to their several features of advantage, are
considered to be attractive for the implementation of computation-intensive ANN
algorithms in custom VLSI and FPGA devices for real-time applications. The key
techniques used for mapping of ANN algorithms into systolic computing struc-
ture are discussed, and a brief overview of systolic architectures for different ANN
applications are presented in this chapter. Along with the design of basic systolic
building blocks for various ANN algorithms, the mapping of fully-connected un-
constrained ANN, as well as, multilayer ANN algorithm into fully-pipelined systolic
architecture is described by generalized dependance graph formulation. The readers
may refer to the cited references for detail discussions on hardware implementa-
tion of advance ANN algorithms and extended forms of ANN for different appli-
cations. Interested readers may also like to find several variations and optimization
of systolic architectures for the ANN models in the references. Most of the VLSI
structures suggested in the literature are meant for a particular topology of network,
15 Systolic VLSI and FPGA Realization of Artificial Neural Networks 377

and a specific training algorithm. Only a few of the architectures offer flexibility
of adapting to different learning process or topologies and constitution of network
[48]-[49]. Self-configuring and adaptive architectures could also be designed for
complex multi-modal learning applications and for the applications subjected to
different constraints and environmental influences [50, 51]. It is observed that both
analog, as well as, the digital implementations have their inherent advantages and
disadvantages. Therefore, it is expected that mixed analog-digital circuits might be
able to deliver the best of the two for the VLSI implementation of different ANN
models. Mixed analog-digital neural networks have significant potential to be de-
ployed directly and more efficiently for various applications in signal processing,
communication and instrumentation where real-world interaction in analog domain
is very much prevalent during different phases of network operation.

References
1. Brown, J.R., Garber, M.M., Venable, S.F.: Artificial neural network on a SIMD architec-
ture. In: Proceedings Frontiers of Massively Parallel Computation, pp. 43–47 (1988)
2. Shams, S., Przytula, K.W.: Mapping of neural networks onto programmable parallel ma-
chines. In: Proceedings IEEE International Symposium on Circuits and Systems, vol. 4,
pp. 2613–2617 (1990)
3. Kung, S.Y.: Digital Neurocomputing. Prentice Hall, Englewood Cliffs (1992)
4. Kung, S.Y.: Tutorial: digital neurocomputing for signal/image processing. In: Proceed-
ings of IEEE Workshop Neural Networks for Signal Processing, pp. 616–644 (1991)
5. Kung, S.Y., Hwang, J.N.: Parallel architectures for artificial neural nets. In: Proceedings
of IEEE International Conference on Neural Networks, vol. 2, pp. 165–172 (1988)
6. Kung, S.Y., Hwang, J.N.: A unifying algorithm/architecture for artificial neural networks.
In: Proceedings of International Conference on Acoustics, Speech, and Signal Process-
ing, vol. 4, pp. 2505–2508 (1989)
7. Amin, H., Curtis, K.M., Hayes Gill, B.R.: Efficient two-dimensional systolic array ar-
chitecture for multilayer neural network. Electronics Letters 33(24), 2055–2056 (1997)
8. Amin, H., Curtis, K.M., Hayes Gill, B.R.: Two-ring systolic array network for artificial
neural networks. In: IEE Proceedings Circuits, Devices and Systems, vol. 164(5), pp.
225–230 (1999)
9. Myoupo, J.F., Seme, D.: A single-layer systolic architecture for back propagation learn-
ing. In: Proceedings of IEEE International Conference on Neural Networks, vol. 2, pp.
1329–1333 (1996)
10. Khan, E.R., Ling, N.: Systolic architectures for artificial neural nets. In: Proceedings of
IEEE International Joint Conference on Neural Networks, vol. 1, pp. 620–627 (1991)
11. Zubair, M., Madan, B.B.: Systolic implementation of neural networks. In: Proceedings
of IEEE International Conference on Computer Design: VLSI in Computers and Proces-
sors, pp. 479–482 (1989)
12. Pazienti, F.: Systolic array for neural network implementation. In: Proceedings 6th
Mediterranean Electrotechnical Conference, vol. 2, pp. 981–984 (1991)
13. Girones, R.G., Salcedo, A.M.: Systolic implementation of a pipelined on-line back prop-
agation. In: Proceedings Seventh International Conference on Microelectronics for Neu-
ral, Fuzzy and Bio-Inspired Systems, pp. 387–394 (1999)
378 P.K. Meher

14. Naylor, D., Jones, S.: A performance model for multilayer neural networks in linear
arrays. IEEE Transactions on Parallel and Distributed Systems 5(12), 1322–1328 (1994)
15. Naylor, D., Jones, S., Myers, D.: Back propagation in linear arrays-a performance anal-
ysis and optimization. IEEE Transactions on Neural Networks 6(3), 583–595 (1995)
16. Kung, H.T.: Why systolic architectures? Computer 15(1), 37–46 (1982)
17. Kung, S.Y.: VLSI Array Processors. Prentice Hall, Englewood Cliffs (1988)
18. Parhi, K.K.: VLSI Digital Signal Processing Systems: Design and Implementation.
Wiley-Interscience Publication, John Wiley & Sons, New York (1999)
19. Zhang, D., Pal, S.K. (eds.): Neural Networks and Systolic Array Design. World Scien-
tific, River Edge (2002)
20. Ben Salem, A.K., Ben Othman, S., Ben Saoud, S.: Design and implementation of a neural
command rule on a FPGA circuit. In: Proceedings 12th IEEE International Conference
on Electronics, Circuits and Systems, pp. 1–4 (2005)
21. Liu, J., Liang, D.: A Survey of FPGA-Based Hardware Implementation of ANNs. In:
Proceedings 1st International Conference on Neural Networks and Brain, pp. 915–918
(2005)
22. Mohan, A.R., Sudha, N., Meher, P.K.: An embedded face recognition system on A VLSI
array architecture and its FPGA implementation. In: Proceedings 34th Annual Confer-
ence of IEEE Industrial Electronics, pp. 2432–2437 (2008)
23. Farhat, N.H., Paaltis, D., Prata, A., Paek, E.: Optical Implementation of the Hopfield
Model. Applied Optics 24, 1469–1475 (1985)
24. Wanger, K., Paaltis, D.: Multilayer optical learning networks. Applied Optics 26, 5061–
5076 (1987)
25. Mead, C.: Analog VLSI and neural systems. Addison Wesley, Reading (1989)
26. Sivilotti, M.A., Mahowald, M.A., Mead, C.A.: Real-time visual computations using ana-
log CMOS processing arrays. In: Loslben, P. (ed.) Advanced Research on VLSI, pp.
295–312. MIT Press, Cambridge (1987)
27. Sheu, B.J., Choi, J.: Neural Information Processing and VLSI. Kluwer Academic Pub-
lishers, Dordrecht (1995)
28. Hopfield, J.J., Tank, D.W.: Neural computation of decisions in optimization problems.
Biological cybernetics 52, 141–154 (1985)
29. Rummelhart, D.E., McClelland, J.L.: Parallel and distributed processing: Explorations
in the Microstructure of cognition. MIT Press, Cambridge (1986)
30. Werbos, P.: Beyond regression: New tools for prediction and analysis in the behavioral
sciences. Ph.D. thesis, Harvard University, Cambridge, Mass. (1974)
31. Gisutham, B., Srikanthan, T., Asari, K.V.: A high speed flat CORDIC based neuron with
multi-level activation function for robust pattern recognition. In: Proceedings Fifth IEEE
International Workshop on Computer Architectures for Machine Perception, pp. 87–94
(2000)
32. Anna Durai, S., Siva Prasad, P.V., Balasubramaniam, A., Ganapathy, V.: A learning strat-
egy for multilayer neural network using discretized sigmoidal function. In: Proceedings
Fifth IEEE International Conference on Neural Networks, pp. 2107–2110 (1995)
33. Zhang, M., Vassiliadis, S., Delgado-Frias, J.G.: Sigmoid generators for neural comput-
ing using piecewise approximations. IEEE Transactions on Computers 45, 1045–1049
(1996)
34. Saichand, V., Nirmala, D.M., Arumugam, S., Mohankumar, N.: FPGA realization of
activation function for artificial neural networks. In: Proceedings Eighth International
Conference on Intelligent Systems Design and Applications, vol. 3, pp. 159–164 (2008)
15 Systolic VLSI and FPGA Realization of Artificial Neural Networks 379

35. Williams, R.J., Zipser, D.: Experimental analysis of the real-time recurrent learning al-
gorithm. Connection Science 1, 87–111 (1989)
36. Williams, R.J., Zipser, D.: Gradient-based learning algorithms for recurrent networks
and their computational complexity. In: Back-propagation: Theory, Architectures and
Applications. Erlbaum, Hillsdale (1992)
37. Kechriotis, G., Manolakos, E.S.: A VLSI array architecture for the on-line training of re-
current neural networks. In: Conference Record of Asilomar Conference on the Twenty-
Fifth Signals, Systems and Computers, vol. 1, pp. 506–510 (1991)
38. Shaikh-Husin, N., Hani, M.K., Teoh, G.S.: Implementation of recurrent neural network
algorithm for shortest path calculation in network routing. In: Proceedings of Interna-
tional Symposium on Parallel Architectures, Algorithms and Networks, I-SPAN 2002,
pp. 313–317 (2002)
39. Ramacher, U., Beichter, J., Bruls, N., Sicheneder, E.: Architecture and VLSI design of
a VLSI neural signal processor. In: Proceedings IEEE International Symposium on Cir-
cuits and Systems, vol. 3, pp. 1975–1978 (1993)
40. Vidal, M., Massicotte, D.: A VLSI parallel architecture of a piecewise linear neural net-
work for nonlinear channel equalization. In: Proceedings the 16th IEEE Conference on
Instrumentation and Measurement Technology, vol. 3, pp. 1629–1634 (1999)
41. Broomhead, D.S., Jones, R., McWhirter, J.G., Shepherd, T.J.: A systolic array for nonlin-
ear adaptive filtering and pattern recognition. In: Proceedings IEEE International Sym-
posium on Circuits and Systems, vol. 2, pp. 962–965 (1990)
42. Cavaiuolo, M., Yakovleff, A.J.S., Watson, C.R., Kershaw, J.A.: A systolic neural net-
work image processing architecture. In: Proceedings Computer Systems and Software
Engineering, pp. 695–700 (1992)
43. Bermak, A., Martinez, D.: Digital VLSI implementation of a multi-precision neural net-
work classifier. In: Proceedings 6th International Conference on Neural Information Pro-
cessing, vol. 2, pp. 560–565 (1999)
44. Shadafan, R.S., Niranjan, M.: A systolic array implementation of a dynamic sequential
neural network for pattern recognition. In: Proceedings IEEE World Congress on Com-
putational Intelligence and IEEE International Conference on Neural Networks, vol. 4,
pp. 2034–2039 (1994)
45. Sudha, N., Mohan, A.R., Meher, P.K.: Systolic array realization of a neural network-
based face recognition system. In: Proceedings 3rd IEEE Conference on Industrial Elec-
tronics and Applications, pp. 1864–1869 (2008)
46. Sheu, B.J., Chang, C.F., Chen, T.H., Chen, O.T.C.: Neural-based analog trainable vector
quantizer and digital systolic processors. In: Proceedings IEEE International Symposium
on Circuits and Systems, vol. 3, pp. 1380–1383 (1991)
47. Moreno, J.M., Castillo, F., Cabestany, J., Madrenas, J., Napieralski, A.: An analog sys-
tolic neural processing architecture. IEEE Micro. 14(3), 51–59 (1994)
48. Madraswala, T.H., Mohd, B.J., Ali, M., Premi, R., Bayoumi, M.A.: A reconfigurable
‘ANN’ architecture. In: Proceedings IEEE International Symposium on Circuits and Sys-
tems, vol. 3, pp. 1569–1572 (1992)
49. Jang, Y.-J., Park, C.-H., Lee, H.-S.: A programmable digital neuro-processor design with
dynamically reconfigurable pipeline/parallel architecture. In: Proceedings International
Conference on Parallel and Distributed Systems, pp. 18–24 (1998)
50. Patra, J.C., Lee, H.Y., Meher, P.K., Ang, E.L.: Field Programmable Gate Array Imple-
mentation of a Neural Network-Based Intelligent Sensor System. In: Proceeding Inter-
national Conference on Control Automation Robotics and Vision, December 2006, pp.
333–337 (2006)
380 P.K. Meher

51. Patra, J.C., Chakraborty, G., Meher, P.K.: Neural Network-Based Robust Linearization
and Compensation Technique for Sensors under Nonlinear Environmental Influences.
IEEE Transactions on Circuits and Systems-I: Regular Papers 55(5), 1316–1327 (2008)

About the author

Pramod Kumar Meher received the B.Sc. and M.Sc. degrees in physics and the Ph.D.
in science from Sambalpur University, Sambalpur, India, in 1976, 1978, and 1996,
respectively. He has a wide scientific and technical background covering physics,
electronics, and computer engineering. Currently, he is a Senior Scientist with the
Institute for Infocomm Research, Singapore. Prior to this assignment he was a visit-
ing faculty with the School of Computer Engineering, Nanyang Technological Uni-
versity, Singapore. Previously, he was a Professor of computer applications with
Utkal University, Bhubaneswar, India, from 1997 to 2002, a Reader in electron-
ics with Berhampur University, Berhampur, India, from 1993 to 1997, and a Lec-
turer in physics with various Government Colleges in India from 1981 to 1993. His
research interest includes design of dedicated and reconfigurable architectures for
computation-intensive algorithms pertaining to signal processing, image processing,
communication, and intelligent computing. He has published more than 140 tech-
nical papers in various reputed journals and conference proceedings. Dr. Meher is a
Fellow of the Institution of Electronics and Telecommunication Engineers (IETE),
India and a Fellow of the Institution of Engineering and Technology (IET), UK.
He is currently serving as Associate Editor for the IEEE Transactions on Circuits
and Systems-II: Express Briefs, IEEE Transactions on Very Large Scale Integration
(VLSI) Systems, and Journal of Circuits, Systems, and Signal Processing. He was
the recipient of the Samanta Chandrasekhar Award for excellence in research in
engineering and technology for the year 1999.
Chapter 16
Application of Coarse-Coding Techniques for
Evolvable Multirobot Controllers

Jekanthan Thangavelautham, Paul Grouchy, and Gabriele M.T. D’Eleuterio

Abstract. Robots, in their most general embodiment, can be complex systems trying
to negotiate and manipulate an unstructured environment. They ideally require an
‘intelligence’ that reflects our own. Artificial evolutionary algorithms are often used
to generate a high-level controller for single and multi robot scenarios. But evolu-
tionary algorithms, for all their advantages, can be very computationally intensive.
It is therefore very desirable to minimize the number of generations required for
a solution. In this chapter, we incorporate the Artificial Neural Tissue (ANT) ap-
proach for robot control from previous work with a novel Sensory Coarse Coding
(SCC) model. This model is able to exploit regularity in the sensor data of the en-
vironment. Determining how the sensor suite of a robot should be configured and
utilized is critical for the robot’s operation. Much as nature evolves body and brain
simultaneously, we should expect improved performance resulting from artificially
evolving the controller and sensor configuration in unison. Simulation results on
an example task, resource gathering, show that the ANT+SCC system is capable of
finding fitter solutions in fewer generations. We also report on hardware experiments
for the same task that show complex behaviors emerging through self-organized task
decomposition.

16.1 Introduction
Our motivation for evolutionary-based control approaches for multirobot systems
originates in the use of robots for space exploration and habitat construction on
Jekanthan Thangavelautham
Mechanical Engineering Department, Massachusetts Institute of Technology,
77 Massachusetts Ave., Cambridge, MA, USA, 02139
e-mail: jthanga@mit.edu
Paul Grouchy · Gabriele M.T. D’Eleuterio
Institute for Aerospace Studies, University of Toronto,
4925 Dufferin St., Toronto, Canada, M3H5T6
e-mail: {paul.grouchy,gabriele.deleuterio}@utoronto.ca

Y. Tenne and C.-K. Goh (Eds.): Computational Intelligence in Optimization, ALO 7, pp. 381–412.
springerlink.com c Springer-Verlag Berlin Heidelberg 2010
382 J. Thangavelautham, P. Grouchy, and G.M.T. D’Eleuterio

alien planets and planetoids, such as Mars and the Moon. Potential scenarios include
establishing a distributed antenna for communications, deploying a mobile array of
actuators and sensors for geological measurements, or constructing elements of an
outpost in preparation for the arrival of humans.
These kinds of project call for not just a single monolithic robotic system but
teams of elemental robots working in collaboration and coordination. While space
applications may require the use of a multiagent strategy, they are by no means the
only ones. Consider, for example, terrestrial applications such as search and rescue,
mapping, manufacturing and construction.
A number of factors make the team approach viable and attractive. Among them
are increased reliability. One can afford to lose a member of the team without de-
stroying the team’s integrity. A team approach can offer increased efficiency through
parallelization of operations. As such, multiagent systems are more readily scalable.
Most important, however, a team can facilitate task decomposition. A complex task
can be parsed into manageable subtasks which can be delegated to multiple elemen-
tal units.
Robotic systems can themselves be complex and their environments are generally
unstructured. Sound control strategies are therefore not easy to develop. A method-
ical approach is not only desired but arguably required. It would be ideal if the
controller could be automatically generated starting from a ‘blank slate,’ where the
designer is largely relieved of the design process and detailed models of the systems
or environment can be avoided. It is by such a road that we have come to the use of
evolutionary algorithms for generating controllers that are based on neural-network
architectures.
We have developed and tested, both in simulation and hardware, a neuroevo-
lutionary approach called the Artificial Neural Tissue (ANT). This neural-network-
based controller employs a variable-length genome consisting of a regulatory
system that dictates the rate of morphological growth and can selectively activate
and inhibit neuron ensembles through a coarse-coding scheme [31]. The approach
requires an experimenter to specify a goal function, a sensory input layout for the
robots and a repertoire of allowable basis behaviors. The control topology and its
contents emerge through the evolutionary process.
But is it possible to evolve, in addition to the controllers themselves, the neces-
sary sensor configurations and the selection of behavior primitives (motor-actuator
commands) concurrently with the evolution of the controller? It is this question that
we address in this work.
In tackling this challenge, we turn to a key theme in our ANT concept, namely,
coarse coding. Coarse coding is an efficient, distributed means of representation that
makes use of multiple coarse receptive fields to represent a higher-resolution field.
As is well known, nature exploits coarse coding in the brain and sensory systems.
In artificial systems, coarse coding is used to interpret data. In ANT, however, it
is the program (the artificial neural structure responsible for computation) that is
coarse coded. This allows the development of spatially modular functionality in the
architecture that mimics the modularity in natural brains.
16 Coarse-Coding Techniques for Evolvable Multirobot Controllers 383

We present in this work a Sensory Coarse Coding (SCC) model that extends
capabilities of ANT and allows for the evolution of sensor configuration and coupled
motor primitives. The evolution of sensor configuration and behavior primitives can
be used to take advantage of regularities in the task space and can help to guide and
speed up the evolution of the controller.
Training neural network controllers using an evolutionary algorithms approach
like ANT for robotics is computationally expensive and tends to be used when other
standard search algorithms like gradient descent are unsuitable or require substantial
supervision for the given task space. The controllers are developed using a biolog-
ically motivated development process and replicated on one or more robotic plat-
forms for evaluation. Training on hardware is often logistically difficult, requiring
a long-term power source and a means of automating the controller evaluation pro-
cess. The alternative is to simulate the robotic evaluation process on one or more
computers. The bulk of the required time in training is the evaluation process. Ge-
netic operations including selection, mutation and crossover tend to take less than
one percent of the computational time. Therefore any method that can reduce the
number of genetic evaluations will have a substantial impact on the training process.
Furthermore, a significant reduction in the number of generations required can also
make the training process feasible on hardware. Robotic simulations often take into
account the dynamics and kinematics of robotic vehicle interactions. However, care
has to be taken to ensure the simulation environment resembles or is compatible with
actual hardware. In other circumstances, it may not be beneficial to prototype and
demonstrate capabilities and concepts in simulation before proceeding towards ex-
pensive hardware demonstration. With robotics applications on the lunar surface, the
low gravity environment cannot be easily replicated on earth and hence high fidelity
dynamics simulations may be needed to demonstrate aspects of system capability.
For multirobotic tasks, the global effect of local interactions between robots is
often difficult to gauge, and the specific interactions required to achieve coordinated
behavior may even be counterintuitive. Furthermore, it is not at all straightforward to
determine the best sensor configuration. Often detailed analysis of the task needs to
be performed to figure out the necessary coordination rules and sensory configura-
tions. The alternative is to use optimization techniques to in effect shrink the search
space sufficiently to enable evolutionary search algorithms to find suitable solutions.
Evolving this configuration may also give useful insight into the sensors necessary
for a task. This may help guide a robotic designer in their design processes—we do
not presume to make the designer completely redundant—by helping them deter-
mine which sensors and actuators are best to achieve a given objective. In addition,
the evolution of the sensor configuration in conjunction with the controller would
allow us to mimic nature more closely. Nature perforce evolves body and brain
together.
The remainder of this chapter is organized as follows. First, we provide back-
ground to our problem by reviewing past work on the use of evolutionary algorithms
for the development of multirobot controllers and on ‘body and brain’ evolution. We
present the workings of the Artificial Neural Tissue approach followed by the Sen-
sory Coarse Coding model. We refer to the integration of the latter into the former
384 J. Thangavelautham, P. Grouchy, and G.M.T. D’Eleuterio

as the ANT+SCC system. Next, we report on a number of experiments we have con-


ducted to demonstrate the ANT+SCC system on a group of robots, concentrating on
an example of resource gathering. This is followed by a discussion of the findings
and finally we venture some concluding remarks.

16.2 Background
Coordination and control of multirobot systems are often inspired by biology. In
nature, multiagent systems such as social insects use a number of mechanisms for
control and coordination. These include the use of templates, stigmergy, and self-
organization. Templates are environmental features perceptible to the individuals
within the collective [3]. In insect colonies, templates may be a natural phenomenon
or they may be created by the colonies themselves. They may include temperature,
humidity, chemicals, or light gradients. Stigmergy is a form of indirect communi-
cation mediated through the environment [12]. One way in which ants and termites
exploit stigmergy is through the use of pheromone trails. Self-organization describes
how local or microscopic behaviors give rise to a macroscopic structure in systems
[2]. However, many existing approaches suffer from another emergent feature called
antagonism [5]. This is the effect that arises when multiple agents trying to perform
the same task interfere with each other and reduce the overall efficiency of the group.
Within the field of robotics, many have sought to develop multirobot control and
coordination behaviors based on one or more of the prescribed mechanisms used
in nature. These solutions have been developed using user-defined deterministic ‘if-
then’ rules or preprogrammed stochastic behaviors. Such techniques in robotics in-
clude template-based approaches that exploit light fields to direct the creation of
walls [33] and planar annulus structures [34]. Stigmergy has been used extensively
in collective-robotic construction tasks, including blind bull dozing [24], box push-
ing [21] and heap formation [1].
Inspired by insect societies, the robots are equipped with the necessary sensors re-
quired to demonstrate multirobot control and coordination behaviors. Furthermore,
the robot controllers are often designed by hand to be reactive and have access only
to local information. They are nevertheless able to self-organize through coopera-
tion to achieve an overall objective. This is difficult to do by hand, since the global
effect of these local interactions is often very difficult to predict. The simplest hand-
coded techniques have been to design a controller for a single robot and scaling
to multiple units by treating other units as obstacles to be avoided [1], [24]. Other
more sophisticated techniques make use of explicit communication or designing an
extra set of coordination rules to gracefully handle agent-to-agent interactions [33].
These approaches are largely heuristic, rely on ad hoc assumptions that often re-
quire knowledge of the task domain and are implemented with a specified robot
configuration in mind.
16 Coarse-Coding Techniques for Evolvable Multirobot Controllers 385

16.2.1 The Body and the Brain


The ecological balance principle [25] states that the complexity of an ‘agent’ must
match the complexity of the task environment. In the natural world, this matching is
done by the evolutionary process which molds both the body and the brain of indi-
vidual organisms for survival. Adaptive systems may evolve to exploit regularities
in the task environment that concurrently impact the physical design and control.
This has been demonstrated with artificial evolution of both the body and brain of
artificial creatures. Early work by Sims [27] explored breeding virtual creatures that
could swim and hop in a simulated 3-D environment. A generative encoding system
was used to describe the phenotype and was based on Lindenmayer’s L-system [19].
The creatures were fully described by the genome and composed of a hierarchical
description outlining three-dimensional rigid parts and various joints. Reactive con-
trol laws were also evolved that described the interaction of the various parts.
Framsticks, another method of evolving artificial body-brain systems, combined
a neural-network controller with a specified morphology [17]. The resultant virtual
creatures were evolved to perform locomotion in various environments, both on
land and in water. Grammar-based techniques have also been used to evolve brain
and bodies for robotic applications [22] and have demonstrated simple Braintenberg
tasks. Work by Lipson and Pollack [20] demonstrated the evolution of robot designs
and actuators that were realized by a 3-D printer. Standard components such as
motors were added allowing the system to perform locomotion. Further work in
this area by Zykov et al. [35] has been directed towards evolving self-replicating
machine configurations using a physical substrate.
In robotics, there are both potential advantages and disadvantages when design-
ing both body and brain concurrently for solving individual tasks. One advantage is
that the end design may be specifically tuned towards solving a specialized task in
an efficient manner (equivalent to finding a niche in nature). However, this may be
at the cost of losing multipurpose capabilities. On the other hand, specific needs and
performance considerations may warrant the need for special-purpose robots.
In many practical situations, one may not have the resources necessary to design
specialized robots for a task at hand. Hence one has to use standard robot configura-
tions and implement a controller for this configuration. However, it may be practical
to reconfigure placement of sensors on a standard robot configuration for use on a
specified task. How best to configure these sensors and implement the associated
controllers remains a crucial question in multirobot systems.

16.2.2 Task Decomposition


Sensor configuration is particularly key to task decomposition. The ability to parti-
tion or segment a complex task into subtasks is a vital capability for both natural and
artificial systems. Part of solving a real-world task requires sensing the environment
and using it to provide feedback when performing actions.
386 J. Thangavelautham, P. Grouchy, and G.M.T. D’Eleuterio

One of the advantages of multirobot systems, as mentioned above, is precisely


the opportunity to facilitate task decomposition. With the use of such systems, con-
trol and coordination are critical. Both the control and coordination capabilities are
dependent on the individual robot’s ability to sense its environment and its ability to
perform actions.

16.2.3 Machine-Learning Techniques and Modularization


Machine-learning techniques, particularly artificial evolution, exploit self-
organization and relieve the designer of having to determine a suitable control
strategy and sensor configuration. Using these techniques, controllers and sensor
configurations develop cooperation and interaction strategies by setting the evolv-
ing system loose in the environment. By contrast, it is difficult to design controllers
by hand with cooperation in mind because it is difficult, given the complexity of the
system as well as the environment, to predict or control the global behaviors that
will result from local interactions.
One machine-learning technique to overcome the difficulties of design by hand is
based on Cellular Automata (CA) look-up tables. A genetic algorithm can be used
to evolve the table entries [7]. The assumption is that each combination of sensory
inputs will result in a particular choice of output behaviors. This approach is an in-
stance of a ‘tabula rasa’ technique. The control system starts off as a blank slate with
limited assumptions regarding control architecture and is guided through training by
a fitness function (system goal function). Such approaches can be used to obtain ro-
bust, scalable controllers that exploit multirobot mechanisms such as stigmergy and
self-organization. Furthermore, these approaches are beneficial for hardware exper-
iments as there is minimal computational overhead incurred, especially if onboard
sensor processing is available.
One of the limitations of a look-up table approach is that the table size grows
exponentially with the number of inputs. For a 3 × 3 tiling formation task, a sin-
gle look-up table architecture is found to be intractable owing to premature search
stagnation [29]. To address this limitation, the controller can be modularized into
subsystems by exploiting regularities in the task environment. These subsystems
can explicitly communicate and coordinate actions with other agents. This act of
dividing the agent functionality into subsystems is a form of user-assisted task de-
composition. Such intervention requires domain knowledge of the task and ad hoc
design choices to facilitate searching for a solution.
Use of neural networks is also another form of modularization, where each neu-
ron can communicate and perform some form of sensory information processing.
The added advantage of neural-network architectures is that the neurons can gen-
eralize (unlike CAs) by recognizing correlations between a combination of sensory
inputs, thus effectively shrinking the search space. Fixed-topology neural-network
architectures have been used extensively for multirobot tasks, including building
walls [33], tile formation [30], and cooperative transport [13]. However, monolithic
fixed-topology neural-network architectures also face scalability problems. With an
16 Coarse-Coding Techniques for Evolvable Multirobot Controllers 387

increasing number of hidden neurons, one must contend with the effects of spa-
tial crosstalk where noisy neurons interfere and drown out signals from feature-
detecting neurons [16].
Crosstalk in combination with limited supervision (through use of a global fitness
function) can lead to the ‘bootstrap problem’ [23], where evolutionary algorithms
are unable to pick out incrementally fitter solutions resulting in premature stagnation
of the evolutionary run. Thus, choosing the wrong network topology may lead to a
situation that is either unable to solve a task or is difficult to train [31].

16.2.4 Fixed versus Variable Topologies


Fixed-topology architectures accordingly have limitations, particularly in robotics,
for the very reason that the topology must be determined a priori and there is no
opportunity to modifying it without starting over. However, variable-topology ar-
chitectures allow for the evolution of both the network architecture and the neuronal
weights simultaneously. The genotypes for these systems are encoded in a one-to-
one mapping such as in Neuro-Evolution of Augmenting Topologies (NEAT) [28].
The use of recursive rewriting of the genotype contents to a produce a phenotype is
used in methods such as in Cellular Encoding [14], L-systems [27] and through arti-
ficial ontogeny [8]. Ontogeny (morphogenesis) models developmental biology and
includes a growth program in the genome that starts from a single egg and subdi-
vides into specialized daughter cells. Other morphogenetic systems include [4] and
Developmental Embryonal Stages (DES) [10].
The growth program within many of these morphogenetic systems is controlled
through artificial gene regulation. Artificial gene regulation is a process in which
gene activation/inhibition regulates (and is regulated by) the expression of other
genes. Once the growth program has been completed, there is no further use for
gene regulation within the artificial system, which is in stark contrast to biologi-
cal systems where gene regulation is always present. These variable topologies also
have to be grown incrementally starting from a single cell in order to minimize the
dimensional search space as the size of the network architecture may inadvertently
make training difficult [28]. With recursive rewriting of the phenotype, limited mu-
tations can result in substantial changes to the growth program. Such techniques also
introduce a deceptive fitness landscape where limited fitness sampling of a pheno-
type may not correspond well to the genotype, resulting once again in premature
search stagnation [26].
The Artificial Neural Tissue concept [31] is intended to address limitations evi-
dent with existing variable topologies through the modeling of a number of biolog-
ically plausible mechanisms. ANT also uses a nonrecursive genotype-to-phenotype
mapping, avoiding deceptive fitness landscapes, and includes gene duplication sim-
ilar to DES. Gene duplication produces redundant copies of a master gene and facil-
itates neutral ‘complexification,’ where the copied gene undergoes mutational drift
and results in the expression of incremental innovations [10].
388 J. Thangavelautham, P. Grouchy, and G.M.T. D’Eleuterio

16.2.5 Regularity in the Environment


Most of the developmental systems described above deal with techniques used to fa-
cilitate evolution of network topologies. A critical advantage of the neural-network
approach is its ability to generalize. Generalization often relies on regularities and
patterns in the sensory input space to be effective. However the sensor configura-
tion used by these developmental controllers still need to be specified by the experi-
menter. As discussed earlier, in a multirobot environment, it is often counterintuitive
as to what control rules are necessary to obtain desired global behaviors. Further-
more, it is difficult to know what sensor configuration is necessary to facilitate these
control rules. A number of techniques including the one presented here attempt to
address this limitation. By allowing for the evolutionary search process to modify
sensor configuration and geometry, it expected that this will facilitate finding effi-
cient solutions with fewer genetic evaluations.
Blondie24 [6], an evolved checkers-playing neural network, is one of the early
efforts to exploit regularity in task space. A standard fixed neural-network topology
was designed to take into account the regularity of the checkerboard. Hidden nodes
of the network were tied to subsquares (cell regions arranged in square shape). The
resultant network was coevolved with real players on an Internet checkers server
and reached an expert level of proficiency in the game.
HyperNEAT extends NEATs capabilities by combining a variable neural-network
topology with a hypercube-based generative description of the topology [11]. In-
stead of encoding for every weight in the network separately in the genome, Hyper-
NEAT uses a type of Compositional Pattern Producing Network (CPPN) to produce
a network that represents the weight (connectivity) parameters of the phenotype
network (controller). It also allows for the CPPNs to represent symmetries and reg-
ularities from the geometry of the task inputs directly in the controller.
Geometric regularities can also be extracted using coarse-coding techniques.
Coarse coding allows for the partitioning of separate geometric locations and thus
allows for each part to be learned separately through task decomposition [11]. While
this is advantageous, it has also been argued that it may prevent the learning system
from discovering interdimensional regularities and alternate approaches have been
shown to overcome this limitation using a priori knowledge [18]. However, as we
show here, this capability can also be evolved.
ANT+SCC extends the ANT approach with a sensor mapping and filtering
scheme. This functionality is performed by a group of sensory neurons that interact
in a coarse-coding fashion. This interaction helps determine resultant sensor geom-
etry and resolution and aids in the filtering process. The output from these sensor
neurons feeds into an ANT controller where higher-level processing is performed.
This laminar scheme bears some resemblance in functionality to how the visual cor-
tex in the mammalian brain operates with lower layers performing sensory filtering
such as edge and line detection which in turn is used by higher-level functionality
to make, for example, optical flow measurements.
16 Coarse-Coding Techniques for Evolvable Multirobot Controllers 389

16.3 Artificial Neural Tissue Model


The ANT architecture (Fig. 16.1a) presented in this paper consists of a develop-
mental program, encoded in the ‘genome,’ that constructs a three-dimensional neu-
ral tissue and associated regulatory functionality. The tissue consists of two types
of neural units, decision neurons and motor-control neurons, or simply motor neu-
rons. Regulation is performed by the decision neurons that dynamically exhibit or
inhibit motor-control neurons within the tissue based on a coarse-coding techniques.
The following sections discuss the computational aspects of the tissue and how it is
created.

16.3.1 Computation
We imagine the motor neurons of our network to be spheres arranged in a regular
rectangular lattice in which the neuron Nλ occupies the position λ = (l, m, n) ∈ I3
(sphere centered within cube). The state sλ of the neuron is binary, i.e., sλ ∈ S =
{0, 1}. Each neuron Nλ nominally receives inputs from neurons Nκ where κ ∈ ⇑(λ ),
the nominal input set. Here we shall assume that these nominal inputs are the 3 × 3
neurons centered one layer below Nλ ; in other terms, ⇑(λ ) = {(i, j, k) | i = l −1, l, l +
1; j = m− 1, m, m+ 1; k = n − 1}. (As will be explained presently, however, we shall
not assume that all the neurons are active all the time.) The activation function of
each neuron is taken from among four possible threshold functions of the weighted
input σ :

0, if σ ≥ θ1
ψdown (σ , θ1 ) =
1, otherwise

0, if σ ≤ θ2
ψup (σ , θ2 ) =
1, otherwise
 (16.1)
0, min(θ1 , θ2 ) ≤ σ < max(θ1 , θ2 )
ψditch (σ , θ1 , θ2 ) =
1, otherwise

0, σ ≤ min(θ1 , θ2 ) or σ > max(θ1 , θ2 )
ψmound (σ , θ1 , θ2 ) =
1, otherwise

The weighted input σλ for neuron Nλ is nominally taken as

∑κ ∈⇑(λ ) wκλ sκ
σλ = (16.2)
∑κ ∈⇑(λ ) sκ

with the proviso that σ = 0 if the numerator and denominator are zero. Also, wκλ ∈ R
is the weight connecting Nκ to Nλ . We may summarize these threshold functions in
a single analytical expression as

ψ = (1 − k1 )[(1 − k2 )ψdown + k2 ψup ] + k1 [(1 − k2 )ψditch + k2 ψmound ] (16.3)


390 J. Thangavelautham, P. Grouchy, and G.M.T. D’Eleuterio

where k1 and k2 can take on the value 0 or 1. The activation function is thus encoded
in the genome by k1 , k2 and the threshold parameters θ1 , θ2 ∈ R.
It may appear that ψdown and ψup are mutually redundant as one type can be
obtained from the other by reversing the signs on all the weights. However, retaining
both increases diversity in the evolution because a single 2-bit ‘gene’ is required to
encode the threshold function and only one mutation suffices to convert ψdown into
ψup or vice versa as opposed to changing the sign of every weight.
The sensor data are represented by the activation of the sensor input neurons
Nα i , i = 1 . . . m, summarized as A = {sα 1 , sα 2 . . . sα m }. Similarly, the output of the
network is represented by the activation of the output neurons Nω j , j = 1 . . . n, sum-
marized as Ω = {sω 1 , sω 2 . . . sω bn }, where k = 1 . . . b specifies the output behavior.
1 2
Each output neuron commands one behavior of the agent. (In the case of a robot,
a typical behavior may be to move forward a given distance. This may result in
the coordinated action of several actuators. Alternatively, the behavior may be more
primitive such as augmenting the current of a given actuator.) If sω k = 1, output neu-
j
ron ω j votes to activate behavior k; if sω k = 0, it does not. Since multiple neurons
j
can have access to a behavior pathway, an arbitration scheme is imposed to ensure
n
the controller is deterministic where p(k) = ∑s k, j=1 sω k /nk and nk is the number of
j
output neurons connected to output behavior k resulting in behavior k being acti-
vated if p(k) ≥ 0.5.
As implied by the set notation of Ω , the outputs are not ordered. In this embodi-
ment, the order of activation is selected randomly. We are primarily interested here
in the statistical characteristics of relatively large populations but such an approach
would likely not be desirable in a practical robotic application. However this can be
remedied by simply assigning a sequence a priori to the activations (as shown in
Table 16.2 for the resource gathering task).
We moreover note that the output neurons can be redundant; that is, more than
one neuron can command the same behavior, in which case for a given time step
one behavior may be “emphasized” by being voted multiple times. Neurons may
also cancel each other out.

16.3.2 The Decision Neuron


The coarse-coding nature of the artificial neural tissue is provided by the decision
neurons. Decision neurons can be thought of as rectangular structures occupying
nodes in the lattice as established by the evolutionary process (Fig. 16.1). The effect
of these neurons is to excite into operation or inhibit (disable) the motor control
neurons (shown as spheres). Once a motor control neuron is excited into opera-
tion, the computation outlined in (16.2) is performed. Motivated as we are to seek
biological support for ANT, we may look to the phenomenon of chemical communi-
cation among neurons. In addition to communicating electrically along axons, some
neurons release chemicals that are read by other neurons, in essence serving as a
“wireless” communication system to complement the “wired” one.
16 Coarse-Coding Techniques for Evolvable Multirobot Controllers 391

(a)

(b)
Fig. 16.1 Synaptic connections between motor neurons and operation of neurotransmitter
field, (a) Synaptic connections and (b) Coarse-coding

For the state of a decision neuron Tμ where μ is binary and determined by


one of the same activation functions (16.1) that is used to calculate the output of
a motor control neuron. The inputs to Tμ are all the input sensor neurons Nα ; i.e.,
μ μ
sμ = ψμ (sα 1 . . . sα m ) where σμ = ∑α vα sα / ∑α sα and vα are the weights. The deci-
sion neuron is dormant if sμ = 0 and releases a virtual neurotransmitter chemical of
uniform concentration cμ over a prescribed field of influence if sμ = 1. Motor con-
trol neurons within the highest chemical concentration field are excited into opera-
tion. Only those neurons that are so activated will establish the functioning network
for the given set of input sensor data. Owing to the coarse-coding effect, the sums
used in the weighted input of (16.1) are over only the set ⇑(λ ) ⊆ ⇑(λ ) of active in-
puts to Nλ . Likewise the output of ANT is in general Ω ⊆ Ω . The decision neuron’s
field of influence is taken to be a rectangular box extending ±dμr , where r = 1, 2, 3,
from μ in the three perpendicular directions. These three dimensions along with μ
and cμ , the concentration level of the virtual chemical emitted by Tμ , are encoded
in the genome.

16.3.3 Evolution and Development


A population of ANT controllers is evolved in an artificial Darwinian manner. The
‘genome’ for a controller contains a ‘gene’ for each cell with a specifier D that is
392 J. Thangavelautham, P. Grouchy, and G.M.T. D’Eleuterio

Fig. 16.2 Gene map for the Artificial Neural Tissue

used to distinguish the functionality (between motor control, decision and tissue).
A constructor protein (an autonomous program) interprets the information encoded
in the gene and translates this into a cell descriptor protein (see Fig. 16.2). The
gene ‘activation’ parameter is a binary flag resident in all the cell genes and is used
to either express or repress the contents of the gene. When repressed, a descriptor
protein of the gene content is not created. Otherwise, the constructor protein ‘grows’
the tissue in which each cell is located relative to a specified seed-parent address.
A cell death flag determines whether the cell commits suicide after being grown.
Once again, this feature in the genome helps in the evolutionary process for a cell,
by committing suicide, still occupies a volume in the lattice although it is dormant.
In otherwise retaining its characteristics, evolution can decide to reinstate the cell
by merely toggling a bit.

Fig. 16.3 Genes are ‘read’ by constructor proteins that transcribe the information into a de-
scriptor protein which is used to construct a cell. When a gene is repressed, the constructor
protein is prevented from reading the gene contents
16 Coarse-Coding Techniques for Evolvable Multirobot Controllers 393

In turn mutation (manipulation of gene parameters with uniform random distribu-


tion) to the growth program results in new cells being formed through cell division.
The rate at which mutation occurs to a growth program is also specified for each tis-
sue and is dependent on the neuron replication probability parameter. Cell division
requires a parent cell (selected with highest replication probability relative to the rest
of the cells within the tissue) and results in copying m% of the original cell contents
to a daughter cell (where m is determined based on a uniform random distribution),
with the remaining cell contents initialized with a uniform random distribution. The
cell type of each new cell is determined based on the ratio of motor control to de-
cision neurons specified in the tissue gene. The new cell can be located in one of
six neighboring locations (top, bottom, north, south, east, west), sharing a common
side with the parent, as long as the volume is not occupied by another cell.

16.3.4 Sensory Coarse Coding Model


In this section we present the Sensory Coarse Coding (SCC) model. The model in-
cludes two components that allow for filtering and mapping locations of sensory
inputs. Sensory Coarse Coding provides additional functionality not found within
the ANT model, namely the ability to search for spatial mappings of sensory in-
puts while simultaneously filtering these inputs for further processing. Biological
motivation for this capability comes from analyzing the visual cortex.
Within the tissue architecture, we include one additional type of neuron, the sen-
sor neuron. There exists a group of v sensory neurons Π = [Φτ 1 , Φτ 2 . . . Φτ v ], where
each neuron has a position τ = (l, m), l ∈ [0, h − 1] + 0.5, m ∈ [0, h − 1] + 0.5 on
a spatial map spanning h × h grid squares and representing the agent/robot and its
surroundings (Fig. 16.4 left). The state sτ of a sensor neuron can assume one of up
to q states, i.e., sτ ∈ A = {sα 1 , sα 2 . . . sα q }.
Each neuron Φτ receives input from spatial map locations Lϕ , where ϕ ∈ ⇑(τ ),
the input set. Each grid square Lϕ assumes a sensor reading, one of q states, i.e.,
Lϕ ∈ A . Here we shall assume that the receptive field for this sensor neuron is
a bounded area containing aτ × bτ grid squares centered at (l, m), i.e., ⇑(τ ) =
{(i, j) | i ∈ [l − aτ /2, l + aτ /2], j ∈ [m − bτ /2, m + bτ /2]}. The sensory neurons are
not necessarily fed their entire input set. A coarse coding system is used to decide
which inputs, if any, a sensory neuron will receive. Each sensory neuron emits a
stimulus chemical in the area ⇑(τ ) such that the amount of chemical diffused at
location ϕ due to sensory neuron Φτ i is the following:

1, if ϕ ∈ ⇑(τ i )
cϕ ,τ i = (16.4)
0, otherwise

Therefore, the net concentration of chemical diffused due to the v sensory neurons
at ϕ is:
v
cϕ = ∑ cϕ ,τ i (16.5)
i=1
394 J. Thangavelautham, P. Grouchy, and G.M.T. D’Eleuterio

Fig. 16.4 (Left) Three sensor neurons and their respective receptive field shown shaded. With
S = 1, only Φ2 is selected since ∑ϕ ∈⇑(τ ) cϕ is the highest among the three. (Right) Once Φ2
is activated, only locations with the highest chemical concentrations (shaded in dark gray)
are fed as inputs to the evolved priority filter. The result is a single output from the neuron,
indicating red

To determine which grid squares a sensory neuron Φτ will receive, the chemical
concentration at each location ϕ ∈⇑ (τ ) is calculated. The states of the locations
Lϕ that have a maximum chemical concentration in the grid are fed to the sensory
neuron inputs. If ∑ϕ ∈⇑(τ ) cϕ = 0, sensory neuron Φτ is deactivated. Furthermore, S
sensor neurons with the highest ∑ϕ ∈⇑(τ ) cϕ are activated. S ∈ I and can be evolved
within the tissue gene or be specified for a task.  
Therefore, we define Iτ = {Lϕ |cϕ = maxϕ ∈⇑(τ ) cϕ } as the input set after coarse
coding to sensory neuron Φτ . For sensor neurons that are active, we calculate sτ :

sτ = min (pi )  p j ∩ Iτ = 0,
/ ∀j < i (16.6)
i∈[1,...,q]

where p j is an element of a global priority list P of sensory states, P = [p1 . . . pq ]


and where p j ∈ A . The global priority list is obtained by polling a group of filter
units and is described in Section 16.3.4.1. In summary, each sensory neuron takes
inputs from ⇑(τ ) and produces a single output sτ , where both inputs and outputs
are restricted to the states in A . This reduction of inputs to a single output is done
through prioritized filtering using the global priority list P. Thus if a sensory neu-
ron’s input set ⇑(τ ) contains one or more states p1 , the sensory neuron’s output sτ
is set to p1 , regardless of its other input states. Similarly, if a sensory neuron’s input
set contains one or more states p2 and no states p1 , the output is set to p2 , regardless
of its other inputs, and so on down the priority list.

16.3.4.1 Input Filtering


The priority list P is generated by polling a group of n filter units. Each of these
independent units takes in as input q weighted inputs and produces a single binary
output using the threshold activation function ψup from (16.1). Each filter unit j
has q weights w jk , 1 ≤ k ≤ q. To poll the filter units for a particular input state
sα k ∈ A , the units are given an input vector of size q containing all zeros, except
for at position k, which is set to one, and their outputs are summed to yield Vsα k .
16 Coarse-Coding Techniques for Evolvable Multirobot Controllers 395

Thus to tally the votes for input state sα 3 , the filter units receive the input vector
[0 0 1 0 . . . 0] of size q and their outputs are summed as given below:
n  
Vsk = ∑ ψup w jk , θ j (16.7)
j

This process is repeated for all states in A , and the priority list is generated by
assigning the state with the highest number of votes to p1 , assigning the state that
garnered the second highest number of votes to p2 , etc. In case of a tie, the tie-
breaker is the sum of the raw outputs of the filter networks, i.e., before the ψup
activation function is applied.

16.3.4.2 Evolution and Development


Fig. 16.5 shows the additional types of genes included in the tissue genome. These
genes are developed similarly to the motor neurons and decision neurons as de-
scribed in Section 16.3.3. The sensor neurons are grown on a two-dimensional spa-
tial map. Mutations can perturb the contents of an existing gene or result in the de-
velopment of new ones. Both the filter units and sensor neurons also have a ‘sensor
type’ specifier which restricts each genome to access certain types of sensory inputs
such as obstacle detection or color detection (See Section 16.4 for further details).
Furthermore, sensor neurons have the capability of referencing different groups of
filter units using the ‘Filter Reference’ parameter. However for the experiments pre-
sented here we set this value to 0.

Fig. 16.5 Gene map for the Sensory Coarse Coding Model

16.4 An Example Task: Resource Gathering


The effectiveness of the ANT controller is demonstrated in simulation on the re-
source gathering task [32]. A team of robots collects resource material distributed
throughout its work space and deposits it in a designated dumping area. The
396 J. Thangavelautham, P. Grouchy, and G.M.T. D’Eleuterio

workspace is modeled as a two-dimensional grid environment with one robot oc-


cupying four grid squares.
For this task, the controller must possess a number of capabilities including gath-
ering resource material, avoiding the workspace perimeter, avoiding collisions with
other robots, and forming resources into a berm at the designated location. (In the
present experiment, a berm is simply a mound of the resource material.) The berm
location has perimeter markings on the floor and a light beacon mounted nearby. The
two colors on the border are intended to allow the controller to determine whether
the robot is inside or outside the berm location (Fig. 16.6).

Fig. 16.6 2D grid world model of experiment chamber

Though solutions can be found without the light beacon, its presence improves
the efficiency of the solutions found, as it allows the robots to track the target loca-
tion from a distance instead of randomly searching the workspace for the perimeter.
The global fitness function for the task measures the amount of resource material
accumulated in the designated location within a finite number of time steps, in this
case T = 300. Darwinian selection is performed based on the fitness value of each
controller averaged over 100 different initial conditions.

Table 16.1 Predefined Sensor Inputs

Sensor Variables Function Description


V1 . . .V4 Resource Detection Resource, No Resource
C1 . . .C4 Template Detection Blue, Red, Orange, Floor
S1 , S2 Obstacle Detection Obstacle, No Obstacle
LP1 Light Position Left, Right, Center, No Light
LD1 Light Range 0-10 (distance to light)
16 Coarse-Coding Techniques for Evolvable Multirobot Controllers 397

Fig. 16.7 Predefined input sensor mapping, with simulation model inset

Table 16.2 Preordered Basis Behaviors

Order Behavior Description


1 Dump Resource Move one grid square back; turn
left
2 Move Forward Move one grid square forward
3 Turn Right Turn 90◦ right
4 Turn Left Turn 90◦ left
5, 7, 9, 11 Bit Set Set memory bit i to 1, i = 1 . . . 4
6, 8, 10, 12 Bit Clear Set memory bit i to 0, i = 1 . . . 4

Simple feature-detection heuristics are used to determine the values of V1 . . .V4


and C1 . . .C4 based on the grid locations shown. For detection of the light beacon,
the electronic shutter speed and gain are adjusted to ensure that the light source is
visible while other background features are underexposed. The position of the light
LP1 is determined based on the pan angle of the camera. The distance to the light
source LD1 is estimated based on its size in the image. The robots also have access
to four memory bits, which can be manipulated using some of the basis behaviors.
Table 16.2 lists the basis behaviors the robot can perform. These behaviors are acti-
vated based on the output of the ANT controller, and all occur within a single time
step.

16.4.1 Coupled Motor Primitives


In this section we consider an alternative setup, where the ANT controllers are pro-
vided building blocks for the basis behaviors in the form of motor primitive [9]
sequences. The motor primitives are taken as discrete voltage signals over a dis-
crete time window applied on DC motors as shown in Fig. 16.8 and as arguments to
the motor primitive commands in Table 16.3. These voltage output signals feed to
the actuators and can be in one of three states, {1, 0, −1}V for a discrete time win-
dow, Δ tn , n ∈ {0, 1, 2, 3, 4, 5} as shown. In addition, each actuator takes on a default
398 J. Thangavelautham, P. Grouchy, and G.M.T. D’Eleuterio

Fig. 16.8 Motor primitives composed of discretized voltage signals shown for a simulated
robot

Fig. 16.9 Modified tissue gene that includes order of execution of motor primitive sequences

Table 16.3 Coupled Motor Primitives for the Sign-Following Task

Neuron ID Behavior Coupled Motor Signals


1 Move Forward Left Motor 1 || Right Motor 1
2 Turn Right 90◦ Left Motor 1 || Right Motor -1
3 Turn Left 90◦ Left Motor -1 || Right Motor 1
4 Pivot Right Left Motor 0 || Right Motor -1
5 Pivot Left Left Motor 0 || Right Motor 1
6 Pivot Right Left Motor 1 || Right Motor 0
7 Pivot Left Left Motor -1 || Right Motor 0
8, 10, 12, 14 Bit set Set memory bit i to 1, i = 1 · · · 4
9, 11, 13, 15 Bit clear Set memory bit i to 0, i = 1 · · · 4

voltage value of 0. The actual value of V , the voltage constant, is dependent on the
actuator.
The ANT controller also needs to determine the order of execution of these motor
primitive sequences. The modified tissue gene is shown in Fig. 16.9. The order
of the output coupled motor primitive (CMP) sequences are evolved as additional
parameters in the tissue gene and is read starting from the left. The elements of
the table, o1 , · · · , oε contain the Neuron ID values. The order is randomly initialized
when starting the evolutionary process and with each Neuron ID occupying one spot
on the gene. Point mutations to this section of the tissue gene result in swapping
16 Coarse-Coding Techniques for Evolvable Multirobot Controllers 399

Neuron ID values between sites. Table 16.3 shows the repertoire of coupled motor
primitives provided for the ANT controllers and thus ε = 15 for this particular setup
(Fig. 16.9). The motor primitives are coupled, where for example the left drive motor
and the right drive motor are executed in parallel (indicated using ||). Under this
setup, it is still possible for the controller to execute a sequence of motor primitives
in a serial fashion.

16.4.2 Evolutionary Parameters


The evolutionary algorithm population size for the experiments is P = 100, with
crossover probability pc = 0.7, mutation probability pm = 0.025 and a tournament
size of 0.06P. The tissue is initialized as a ‘seed culture’ , with 3 × 6 motor con-
trol neurons in one layer. After this, the tissue is grown to include 70–110 neurons
(selected from a uniform random distribution) before starting the evolutionary pro-
cess. These seeding parameters are not task specific and have been observed to be
sufficient for a number of different robotic tasks.

16.5 Results
We compare evolutionary performance of various evolvable control system models
in Fig. 16.10. Included is a Cellular Automata lookup table that consists of a table of
reactive rules that spans 1216384 entries for this task which was evolved using popu-
lation, selection and mutation parameters from Section 16.4.2. The genome is binary
and is merely the contents of the lookup table. For this approach, we also assumed
that the light beacon is turned off. Hence there exists 24 × 44 × 22 = 16384 possible
combinations of sensory inputs states, accounting for resource detection, template
detection and obstacle detection sensors respectively (Table 16.1). For each combi-
nation of sensory input, the 12 allowable behaviors outlined in Table 16.2 could be
executed. As can be seen, the population quickly stagnates at a very low fitness due
to the ‘bootstrap problem’ [23]. With limited supervision, the fitness function makes
it difficult to distinguish between incrementally fitter solutions. Instead the system
depends on ‘bigger leaps’ in fitness space (through sequences of mutations) for it
to be distinguishable during selection. However, bigger leaps become more improb-
able as evolution progresses, resulting in search stagnation. The performance of a
population of randomly initialized fixed-topology, fully-connected networks, con-
sisting of between 2 and 3 layers, with up to 40 hidden and output neurons is also
shown in Fig. 16.10.
In a fixed-topology network there tends to be more ‘active’ synaptic connec-
tions present (since all neurons are active), and thus it takes longer for each neuron
to tune these connections for all sensory inputs. In this regard ANT is advanta-
geous, since the topology is evolved and decision neurons learn to inhibit noisy
neurons through a masking process. The net result is that ANT requires fewer ge-
netic evaluations to evolve desired solutions in comparison to standard neural net-
works. The standard ANT model using sensory inputs and basis behaviors outlined
400 J. Thangavelautham, P. Grouchy, and G.M.T. D’Eleuterio

Fig. 16.10 Evolutionary performance comparison, showing population best averaged over
30 evolutionary algorithm runs of various evolvable control architectures. Error bars indicate
standard deviation. As shown, ANT combined with Sensory Coarse Coding (SCC) and Cou-
pled Motor Primitives (CMP) ordered through evolution obtains desired solutions with fewer
genetic evaluations. The CA lookup table approach as shown remains stagnant and is unable
to solve the task while fixed-topology neural nets converge at a much slower rate

in Tables 16.1 and 16.2, respectively, shows a noticeable improvement in evolu-


tionary performance over the lookup table and fixed topology architectures. Further
improvement is gained when we allow the ANT architecture to evolve the execution
order scheme and coupled motor primitives (Section 16.4.1) instead of using a list
of preordered basis behaviors.
Finally, we also compare ANT+SCC using coupled motor primitives with these
models. To make the comparison meaningful with respect to the other models, we
impose some restrictions on the ANT+SCC configuration. This includes limiting
the maximum number of active (selected) sensor neurons from the SCC model to
4 for the resource detection layer and 4 for the template detection layer. We also
used predefined layouts for the other spatial sensors, namely obstacle detection. As
can be seen in these results, ANT+SCC shows a noticeable performance advantage
over the baseline ANT model. Furthermore, we obtain equivalent population best
fitness values requiring approximately 5 times less genetic evaluations than with the
baseline ANT model (Table 16.4). It should be noted that desired solutions ( f ≥
0.885) were not obtained using standard neural networks within 10,000 generations
of evolution. Examples of an evolved execution order scheme and sensor priority
table using ANT+SCC+CMP are shown in Fig. 16.12.
A typical evolutionary run for the resource gathering task using ANT takes ap-
proximately six hours on a dual core Intel T7200, 2GHz desktop processor, with
16 Coarse-Coding Techniques for Evolvable Multirobot Controllers 401

Table 16.4 Number of generations required to obtain desired solutions ( f ≥ 0.885)

Method Avg. Generations Standard Deviation


ANT+SCC+CMP 421 133
ANT+CMP 1,142 241
ANT+SCC+CMP Control Experiment 1 1,343 104
ANT+SCC+CMP Control Experiment 2 1,968 312
ANT 1,983 235
Fixed Topology Neural Net. > 10,000 NA

only one core being used for the evolutionary run. With ANT+SCC+CMP, one can
get a comparably suitable solution in less than one hour and thirty minutes. Further-
more since this five fold improvement in performance is due to enhancements in the
search process, the improvement is expected to carry over with faster processors.
Using regular neural networks comparable solutions were not obtained even after
approximately 30 hours (10,000 generations) of evolution.
The solutions obtained in Table 16.4 can accumulate at least 88.5% of the dis-
persed resources in the designated dumping area within T = 300 timesteps and has
been determined to be of sufficient quality to complete the task (see Fig. 16.18 for
hardware demonstration). Given more time steps, it is expected that the robots will
have accumulated the remaining resources. One would ideally like to provide as in-
put raw sensor data to the robot controller. However this results in an exponential
increase in search space for a linear increase in sensor states. The alternative would
be to filter out and guess which subset of the sensory states maybe useful for solving
a prespecified task. A wrong guess or poor design choice may make the process of
finding a suitable controller difficult or impossible. Hence, ANT+SCC allows for
additional flexibility, by helping to filter out and determine suitable sensory input
states.
Fig. 16.13 shows the evolved population of sensory neurons on the body-centric
spatial map. Fig. 16.11 (left) shows the average area used by selected sensor neu-
rons and the average number of sensor neurons that participated in the selection pro-
cess during evolution. The average area remains largely constant indicating there is
strong selective pressure towards particular geometric shapes and area. This makes
sense for the resource gathering task, as controllers need to detect a sufficiently
large area in front to identify color cues indicating whether the robot is inside or
outside the dumping area. What is interesting is that with S = 4, for template detec-
tion sensor neurons, we still see a steady increase in the number of sensor neurons
competing to get selected. The increased number of neurons can potentially act in a
cooperative manner, reinforcing and serving as redundant receptive fields covering
key locations on the spatial map. Redundancy is beneficial in limiting the impact of
402 J. Thangavelautham, P. Grouchy, and G.M.T. D’Eleuterio

deleterious mutations. Fig. 16.11 (right) shows that the individuals in the evolution-
ary process start off by sensing a smaller area and that this area is steadily increased
as the solutions converge. If each sensor neuron senses just one grid square area,
then filtering is effectively disabled. At the beginning of the evolutionary process,
individuals take on reduced risk by sensing a smaller effective area, but as the fil-
tering capability evolves concurrently (correctly prioritizing the sensory cues), it
allows for the individual controllers to sense and filter a larger area. The number of
active filter units continue to get pruned, until they reach a steady state number. This
trend is consistent with experiments using ANT [31], where noisy neurons are shut
off as the controllers converge towards a solution.
In order to measure the impact of the coarse-coding and filtering towards
ANT+SCC performance improvement, we performed control experiment 1, where
the maximum size of the sensor cells was restricted to one grid square and where the
net concentration of each grid square within the spatial map was set to 1 (Table 16.4).
These two modifications effectively prevent coarse sensor cells from forming and
interacting to form fine representations. Instead, what is a left is a group of fine sen-
sor neurons that are always active. With the sensor cell area being restricted to one

Fig. 16.11 (Left) Average area occupied by selected sensor neurons and number of sensor
neurons that participated in the selection process during evolution. (Right) Number of active
filter units and number of grid squares accessible by the sensor neurons during evolution.
Both plots show parameters from population best averaged over 30 evolutionary algorithm
runs

Fig. 16.12 Evolved coupled motor primitives ordering scheme and sensor priority list for
template detection shown for an ANT+SCC+CMP controller with a fitness of 0.98. See Ta-
ble 16.3 and 16.1 for reference
16 Coarse-Coding Techniques for Evolvable Multirobot Controllers 403

Fig. 16.13 Example of an evolved sensor layout (fitness of 0.98) using the ANT+SCC+CMP
model. (Left) Participating sensor neurons and receptive fields (template detection) shown.
(Right) Selected sensor neurons shown. Shaded area indicates resultant regions sensed by the
controller

grid square, the priority filter has no effect, since it requires at least two grid squares
with differing sensory input states. The fitness performance of this model is com-
parable to the baseline ANT model. However, since this model also uses coupled
motor primitives and it performed worse than ANT+CMP alone, the net impact of
these imposed constraints is actually a decrease in performance. Furthermore, we
performed a second control experiment (control experiment 2), where we imposed
the receptive field sizes to 3 × 3 grid squares and set the net concentration at each
grid square to 1 (Table 16.4). These two modifications ensure the receptive field re-
mains coarse and prevents coarse coding interactions from occurring, while leaving
the filter functionality within SCC turned on. The net effect is that we see a no-
ticeable drop in performance due to SCC. Both of these experiments indicate that
coarse-coding interaction between sensor neurons is helping to find desired solution
within fewer genetic evaluation.

16.5.1 Evolution and Robot Density


Fig. 16.14 shows the fitness (population best) of the overall system evaluated at each
generation of the artificial evolutionary process using the baseline ANT model, with
a specified initial resource density and various robot densities. These results show
that system performance increases with the number of robots present (with total area
held constant). For scenarios initialized with more robots, each robot has a smaller
area to cover in trying to gather and dump resources.

16.5.2 Behavioral Adaptations


In an ANT-based architecture, networks are dynamically formed with decision neu-
rons processing the sensory input and in turn ‘selecting’ motor-control neurons
through coarse-coding [31]. The behavioral activity of the controllers (see Fig. 16.16)
shows the formation of small networks of neurons which handle individual behav-
iors, such as dumping resources or detecting visual templates (boundary perimeters,
404 J. Thangavelautham, P. Grouchy, and G.M.T. D’Eleuterio

Fig. 16.14 Evolutionary performance comparison of ANT-based solutions for one to five
robots. Error bars indicate standard deviation

target area markings, etc.). Localized regions within the tissue do not exclusively
handle these specific user-defined, distal behaviors. Instead, the activity of the deci-
sion neurons indicate distribution of specialized ‘feature detectors’ among indepen-
dent networks.
Some of the emergent solutions evolved indicate that the individual robots all
figure out how to dump nearby resources into the designated berm area, but that not
all robots deliver resource all the way to the dumping area every time. Instead, the
robots learn to pass the resource material from one individual to another during an
encounter, forming a ‘bucket brigade’ (see Fig. 16.15, 16.18). This technique im-
proves the overall efficiency of the system as less time is spent traveling to and from
the dumping area. Since the robots cannot explicitly communicate with one another,
these encounters happen by chance rather than through preplanning. As with other
multiagent systems, communication between robots occurs through the manipula-
tion of the environment in the form of stigmergy. The task in [33] is similar in that
distributed objects must be delivered to a confined area; however, the hand-designed
controller does not scale as well as the ‘bucket brigade’ solution that the ANT con-
trollers discovered here. We also noticed that the robot controllers do make use of
the light beacon to home in on the light beacon that is located next to a dumping
area, however there is no noticeable difference in fitness performance when the robot
controllers are evolved with light turned off [32]. In these simulation experiments,
the robots have no way to measure the remaining time available; hence, the sys-
tem cannot greedily accumulate resource materials without periodically dumping
the material at the designated area.
16 Coarse-Coding Techniques for Evolvable Multirobot Controllers 405

Fig. 16.15 Snapshots of robots and trajectories of a task simulation (4 robots)

Fig. 16.16 Tissue Topology and neuronal activity of a select number of decision neurons.
Decision neurons in turn ‘select’ (excite into operation) motor control neurons within its
diffusion field
406 J. Thangavelautham, P. Grouchy, and G.M.T. D’Eleuterio

Fig. 16.17 Scaling of ANT-based solutions from one to five robots

Fig. 16.18 Snapshots of two rovers performing the resource gathering task using an ANT
controller. Frames 2 and 3 show the ‘bucket brigade’ behavior, while frames 4 and 5 show
the boundary avoidance behavior

16.5.3 Evolved Controller Scalability


We examine the fittest solutions from the simulation runs shown in Fig. 16.17 for
scalability in the number of robots while holding the amount of resources constant.
Taking the controller evolved for a single robot and running it on a multirobot sys-
tem shows limited performance improvement. In fact, using four or more robots
results in a decrease in performance, due to the increased antagonism created.
16 Coarse-Coding Techniques for Evolvable Multirobot Controllers 407

The scalability of the evolved solution depends in large part on the number of
robots used during the training runs. The single-robot controller expectedly lacks
the cooperative behavior necessary to function well within a multiagent setting.
For example, such controllers fail to develop ‘robot collision avoidance’ or ‘bucket
brigade’ behaviors. Similarly, the robot controllers evolved with two or more robots
perform demonstrably worse when scaled down to a single robot, showing that the
solutions are dependent on cooperation among the robots.

16.6 Discussion
In this chapter, we use a global fitness function to train multirobot controllers with
limited supervision to perform self-organized task decomposition. Techniques that
perform well for the task make use of modularity and generalization. Modularity
is the use and reuse of components, while generalization is the process of finding
patterns or making inferences from many particulars. With a multirobot setup, mod-
ularity together with parallelism is exploited by evolved controllers to accomplish
the task. Rather than have one centralized individual attempting to solve a task using
global information, the individuals within the group are decentralized, make use of
local information and take on different roles through a process of self-organization.
This process of having different agents solve different subcomponents of the task in
order to complete the overall task is a form of task decomposition.
In this multirobot setup, there are both advantages and disadvantages to consider.
Multiple robots working independently exploit parallelism, helping to reduce the
time and effort required to complete a task. Furthermore, we also see that solutions
show improved overall system performance when evolved with groups of robots. It
should be noted that the density of robots is critical towards solving the task. Higher
densities of robots result in antagonism, with robots spending more time getting out
of the way of one another rather than progressing on the task, leading to reduced
system performance.
It was shown that a CA look-up table architecture that lacked both modularity
and generalization is found to be intractable due to the ‘bootstrap problem,’ result-
ing in premature search stagnation. This is due to the fact that EAs are unable to
find an incrementally better solution during the early phase of evolution. Use of
neural networks is a form of functional modularization, where each neuron per-
forms sensory-information processing and makes solving the task more tractable.
However with increased numbers of hidden neurons, one is faced with the effects of
spatial crosstalk where noisy neurons interfere and drown out signals from feature-
detecting neurons [16]. Crosstalk in combination with limited supervision (through
use of a global fitness function) can again lead to the ‘bootstrap problem’ [23]. Thus,
choosing the wrong network topology may lead to a situation that is either unable
to solve the problem or difficult to train [31].
With the use of Artificial Neural Tissues (ANT), we introduce hierarchical func-
tional modularity into the picture. The tissue consists of modular neurons that can
form dynamic, modular networks of neurons. These groups of neurons handle
408 J. Thangavelautham, P. Grouchy, and G.M.T. D’Eleuterio

specialized functionality as we have shown and can be reused repeatedly for this
purpose. In contrast, with a standard fixed topology neural network, similar func-
tionality may need to evolve independently multiple times in different parts of the
network. In these various neural network architectures, modularity is functional,
with behaviors and capabilities existing in individual neurons or in groups and trig-
gered when necessary. ANT facilitates evolution of this capability by allowing for
regulatory functionality that enables dynamic activation and inhibition of neurons
within the tissue. Groups of neurons could be easily activated or shut-off through a
coarse-coding process. Furthermore, with the ANT+SCC model, we allow for evo-
lution of both spatial and functional modularity. Spatial modularity is possible with
the SCC model, since we may get specialized sensory neurons that find spatial sen-
sory patterns. The output from these sensory neurons are used as input by various
groups of neurons active within ANT. These sensor neurons act as specialized fea-
ture detectors looking for either color cues or resources.
Comparison of the various evolvable control system models indicates that con-
trollers with an increased ability to generalize evolve desired solution with far fewer
genetic evaluations. The CA lookup table architecture lacks generalization capabil-
ity and performed the worst. For the CA lookup table, evolved functionality needs to
be tuned for every unique combination of sensory inputs. A regular fixed topology
network performed better, but since the topology had no capacity to increase in size
or selectively activate/inhibit neurons within the network, it needed to tune most of
the neurons in the network towards both helping perform input identifications or ac-
tions and preventing these same neurons from generating spurious outputs. Thus the
same capabilities may have to be acquired by different neurons located in different
parts of the network requiring an increased number of genetic evaluations to reach
a desired solution.
The standard ANT architecture can quickly shut off (mask out) neurons generat-
ing spurious output and thus does not require having sequences of mutation occur,
tuning each neuron within the tissue to acquire compatible (or similar) capabilities
or remain dormant. Thus certain networks of neurons within the tissue can acquire
and apply a certain specialized capability (Fig. 16.16), while most others remain
dormant through the regulatory process. Hence within ANT, increased functional
generalization is achieved through specialization. With the fixed topology neural
network, the net effect of all the neurons having to be active all the time implies
that the controllers have to evolve to individually silence each of the spurious neu-
rons or acquire the same capabilities repeatedly, thus implying reduced functional
generalization.
ANT+SCC can generalize even further. Apart from being able to selectively ac-
tivate/inhibit neurons, it can also choose to receive a coarse or fine representation
of the sensory input. In other words, it can perform further sensor generalization.
A coarse representation of the sensory input in effect implies some degree of gen-
eralization. The priority filtering functionality prioritizes certain sensor states over
others, while the coarse coding representation selects a subset of the inputs to send
to the filter. The resultant input preprocessing facilitates finding and exploiting un-
derlying patterns in the input set. The net effect is that the controller does not have
16 Coarse-Coding Techniques for Evolvable Multirobot Controllers 409

to deal with as many unique conditions since the number of unique sensory input
combinations seen by the ANT controller is reduced by SCC. This in turn facili-
tates evolution of controllers that require fewer generations to reach a desired solu-
tion. At the same time, over-generalization of the sensory inputs is problematic (see
ANT+SCC+CMP control experiment 2). By imposing coarse receptive fields and
preventing coarse-coded interactions, the controllers may miss key (fine) features
through prioritized filtering. Hence, although the sensory input space may effec-
tively have shrunk, through over-generalization valuable information is lost. These
results justify the need for representations that selectively increase or decrease gen-
eralization of sensory input through coarse-coding.
This increased ability to generalize by the ANT+SCC model also seems to offset
the increase number of parameters (increased search space) that needs to be evolved.
Herein lies a tradeoff, as a larger search space alone may require a greater number
of genetic evaluations to reach a desired solution, but this may also provide some
unexpected benefits. In particular, a larger space may help in finding more feasible
or desirable solutions than those already present and may even reduce the necessary
number of genetic evaluations by guiding evolution (as in the ANT+SCC case). As
pointed out, ANT+SCC with its ability to further generalize sensory input appears to
provide a net benefit, even though it needs to be evolved with additional parameters
(in comparison to the standard ANT model).
This benefit is also apparent when comparing the baseline ANT controller with
ANT-ordered coupled motor primitives. The additional genomic parameters appears
to be beneficial once again, since the search process has access to more potential
solutions. Furthermore, it should be noted that these additional degrees of freedom
within the ANT+SCC controller do not appear to introduce deceptive sensory inputs
or capabilities. Deceptive inputs and capabilities can slow down the evolutionary
process, since the evolving system may retain these capabilities when they initially
provide a slight fitness advantage. However, these functionalities can in turn limit
or prevent the controllers from reaching the desired solution. Thus in effect, the
evolving population can get stuck in a local minimum, unable to transcend towards
a better fitness peak.

16.7 Conclusion
This chapter has reported on a number of experiments used to automatically gen-
erate neural network based controllers for multirobot systems. We have shown
that with judicious selection of a fitness function, it is possible to encourage self-
organized task decomposition using evolutionary algorithms. We have also shown
that by exploiting hierarchical modularity, regulatory functionality and the ability
to generalize, controllers can overcome tractability concerns. Controllers with in-
creased modularity and generalization abilities are found to evolve desired solutions
with fewer training evaluations by effectively reducing the size of the search space.
These techniques are also able to find novel multirobot coordination and control
strategies. To facilitate this process of evolution, coarse-coding techniques are used
410 J. Thangavelautham, P. Grouchy, and G.M.T. D’Eleuterio

to evolve ensembles of arbitration neurons that acquire specialized functionality.


Similar techniques are used to evolve sensor-filter configurations. Both techniques
facilitate functional and spatial modularity and generalization. This combination al-
lows for a methodical approach to control development, particularly one where the
controller and robot sensory configurations can be automatically generated starting
from a blank slate, where the designer can be largely relieved of the design process
and where detailed models of the system or environment can be avoided.

References
1. Beckers, R., Holland, O.E., Deneubourg, J.L.: From local actions to global tasks: Stig-
mergy and collective robots. In: Fourth International Workshop on the Syntheses and
Simulation of Living Systems, pp. 181–189. MIT Press, Cambridge (1994)
2. Bonabeau, E., Theraulaz, G., Deneubourg, J.-L., Aron, S., Camazine, S.: Self-
organization in social insects. Trends in Ecology and Evolution 12, 188–193 (1997)
3. Bonabeau, E., Dorigo, M., Theraulaz, G.: Swarm Intelligence: From Natural to Artificial
Systems. Oxford Univ. Press, New York (1999)
4. Bongard, J., Pfeifer, R.: Repeated structure and dissociation of genotypic and pheno-
typic complexity in artificial ontogeny. In: Proceedings of the Genetic and Evolutionary
Computation Conference 2001, San Francisco, CA, pp. 829–836 (2001)
5. Chantemargue, F., Dagaeff, T., Schumacher, M., Hirsbrunner, B.: Implicit cooperation
and antagonism in multi-agent systems, University of Fribourg, Technical Report (1996)
6. Chellapilla, K., Fogel, D.B.: Evolving an expert checkers playing program without using
human expertise. IEEE Transactions on Evolutionary Computation 5(4), 422–428 (2001)
7. Das, R., Crutchfield, J.P., Mitchell, M., Hanson, J.: Evolving globally synchronized cel-
lular automata. In: Proceedings of the Sixth International Conference on Genetic Algo-
rithms 1995, pp. 336–343. Morgan Kaufmann, San Fransisco (1995)
8. Dellaert, F., Beer, R.: Towards an evolvable model of development for autonomous agent
synthesis. In: Artificial Life IV: Proceedings of the 4th International Workshop on the
Synthesis and Simulation of Living Systems, pp. 246–257. MIT Press, Cambridge (1994)
9. Demeris, J., Matarić, M.J.: Perceptuo-Motor Primitives in Imitation. In: Autonomous
Agents 1998 Workshop on Agents in Interaction Acquiring Competence (1998)
10. Federici, D., Downing, K.: Evolution and Development of a Multicellular Organism:
Scalability, Resilience, and Neutral Complexification. Artificial Life 12, 381–409 (2006)
11. Gauci, J., Stanley, K.: A Case Study on the Critical Role of Geometric Regularity in
Machine Learning. In: Proceedings of the 23rd AAAI Conference on AI. AAAI Press,
Menlo Park (2008)
12. Grassé, P.: La reconstruction du nid les coordinations interindividuelles; la theorie de
stigmergie. Insectes Sociaux 35, 41–84 (1959)
13. Groß, R., Dorigo, M.: Evolving a Cooperative Transport Behavior for Two Simple
Robots. In: Liardet, P., Collet, P., Fonlupt, C., Lutton, E., Schoenauer, M. (eds.) EA
2003. LNCS, vol. 2936, pp. 305–316. Springer, Heidelberg (2004)
14. Gruau, F., Whitley, D., Pyeatt, L.: A comparison between cellular encoding and direct
encoding for genetic neural networks. In: Genetic Programming 1996, pp. 81–89. MIT
Press, Cambridge (1996)
16 Coarse-Coding Techniques for Evolvable Multirobot Controllers 411

15. Hastie, T., Tibshirani, R., Friedman, R.: The Elements of Statistical Learning. Springer,
New York (2001)
16. Jacobs, R., Jordan, M., Barto, A.: Task decomposition through competition in a modular
connectionist architecture. Cognitive Science (15), 219–250 (1991)
17. Komosinski, M., Ulatowski, S.: Framsticks: towards a simulation of a nature-like world,
creatures and evolution. In: Proceedings of the 5th European Conference on Artificial
Life. Springer, Berlin (1998)
18. Leffler, B.R., Littman, M.L., Edmunds, T.: Efficient reinforcement learning with relocat-
able action models. AAAI Journal, 572–577 (2007)
19. Lindenmayer, A.: Mathematical models for cellular interaction in development I. Fila-
ments with one-sided inputs. Journal of Theoretical Biology 18, 280–289 (1968)
20. Lipson, H., Pollack, J.: Automatic design and manufacture of artificial lifeforms. Na-
ture 406, 974–978 (2000)
21. Matarić, M.J., Nilsson, M., Simsarian, K.T.: Cooperative multi-robot box-pushing. In:
IEEE/RSJ International Conference on Intelligent Robots and Systems, pp. 556–561
(1995)
22. Mautner, C., Belew, R.K.: Evolving Robot Morphology and Control. In: Sugisaka, M.
(ed.) Proceedings of Artificial Life and Robotics 1999 (AROB 1999), Oita, ISAROB
(1999)
23. Nolfi, S., Floreano, D.: Evolutionary Robotics: The Biology, Intelligence, and Technol-
ogy of Self-Organizing Machines. MIT Press, Cambridge (2000)
24. Parker, C.A., Zhang, H., Kube, C.R.: Blind bulldozing: Multiple robot nest construction.
In: IEEE/RSJ International Conference on Intelligent Robots and Systems, pp. 2010–
2015 (2003)
25. Pfeifer, R., Scheier, C.: Understanding Intelligence. MIT Press, Cambridge (1999)
26. Roggen, D., Federici, D.: Multi-cellular Development: Is There Scalability and Ro-
bustnes to Gain? In: Yao, X., Burke, E.K., Lozano, J.A., Smith, J., Merelo-Guervós,
J.J., Bullinaria, J.A., Rowe, J.E., Tiňo, P., Kabán, A., Schwefel, H.-P. (eds.) PPSN 2004.
LNCS, vol. 3242, pp. 391–400. Springer, Heidelberg (2004)
27. Sims, K.: Evolving 3D Morphology and Behavior by Competition. In: Proceedings of
Artificial Life IV, pp. 28–39. MIT Press, Cambridge (1994)
28. Stanley, K., Miikkulainen, R.: Continual Coevolution through Complexification. In: Pro-
ceedings of the Genetic and Evolutionary Computation Conference 2002. Morgan Kauf-
mann, San Francisco (2002)
29. Thangavelautham, J., Barfoot, T., D’Eleuterio, G.M.T.: Coevolving communication and
cooperation for lattice formation tasks (updated). In: Advances In Artificial Life: Pro-
ceedings of the 7th European Conference on ALife, pp. 857–864 (2003)
30. Thangavelautham, J., D’Eleuterio, G.M.T.: A neuroevolutionary approach to emergent
task decomposition. In: Yao, X., Burke, E.K., Lozano, J.A., Smith, J., Merelo-Guervós,
J.J., Bullinaria, J.A., Rowe, J.E., Tiňo, P., Kabán, A., Schwefel, H.-P. (eds.) PPSN 2004.
LNCS, vol. 3242, pp. 991–1000. Springer, Heidelberg (2004)
31. Thangavelautham, J., D’Eleuterio, G.M.T.: A coarse-coding framework for a gene-
regulatory-based artificial neural tissue. In: Advances In Artificial Life: Proceedings of
the 8th European Conference on ALife, pp. 67–77 (2005)
32. Thangavelautham, J., Alexander, S., Boucher, D., Richard, J., D’Eleuterio, G.M.T.:
Evolving a Scalable Multirobot Controller Using an Artificial Neural Tissue Paradigm.
In: IEEE International Conference on Robotics and Automation, Washington, D.C
(2007)
412 J. Thangavelautham, P. Grouchy, and G.M.T. D’Eleuterio

33. Wawerla, J., Sukhatme, G., Mataric, M.: Collective construction with multiple robots. In:
IEEE/RSJ International Conference on Intelligent Robots and Systems, pp. 2696–2701
(2002)
34. Wilson, M., Melhuish, C., Sendova-Franks, A.B., Scholes, S.: Algorithms for building
annular structures with minimalist robots inspired by brood sorting in ant colonies. Au-
tonomous Robots 17, 115–136 (2004)
35. Zykov, V., Mytilinaios, E., Adams, B., Lipson, H.: Self-reproducing machines. Na-
ture 435(7038), 163–164 (2005)

You might also like